summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2023-10-19bpftool: Wrap struct_ops dump in an arrayManu Bretelle1-0/+6
When dumping a struct_ops, 2 dictionaries are emitted. When using `name`, they were already wrapped in an array, but not when using `id`. Causing `jq` to fail at parsing the payload as it reached the comma following the first dict. This change wraps those dictionaries in an array so valid json is emitted. Before, jq fails to parse the output: ``` $ sudo bpftool struct_ops dump id 1523612 | jq . > /dev/null parse error: Expected value before ',' at line 19, column 2 ``` After, no error parsing the output: ``` sudo ./bpftool struct_ops dump id 1523612 | jq . > /dev/null ``` Signed-off-by: Manu Bretelle <chantr4@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Quentin Monnet <quentin@isovalent.com> Link: https://lore.kernel.org/bpf/20231018230133.1593152-3-chantr4@gmail.com
2023-10-19bpftool: Fix printing of pointer valueManu Bretelle1-1/+1
When printing a pointer value, "%p" will either print the hexadecimal value of the pointer (e.g `0x1234`), or `(nil)` when NULL. Both of those are invalid json "integer" values and need to be wrapped in quotes. Before: ``` $ sudo bpftool struct_ops dump name ned_dummy_cca | grep next "next": (nil), $ sudo bpftool struct_ops dump name ned_dummy_cca | \ jq '.[1].bpf_struct_ops_tcp_congestion_ops.data.list.next' parse error: Invalid numeric literal at line 29, column 34 ``` After: ``` $ sudo ./bpftool struct_ops dump name ned_dummy_cca | grep next "next": "(nil)", $ sudo ./bpftool struct_ops dump name ned_dummy_cca | \ jq '.[1].bpf_struct_ops_tcp_congestion_ops.data.list.next' "(nil)" ``` Signed-off-by: Manu Bretelle <chantr4@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Quentin Monnet <quentin@isovalent.com> Link: https://lore.kernel.org/bpf/20231018230133.1593152-2-chantr4@gmail.com
2023-10-19bpf, docs: Define signed modulo as using truncated divisionDave Thaler1-0/+8
There's different mathematical definitions (truncated, floored, rounded, etc.) and different languages have chosen different definitions [0][1]. E.g., languages/libraries that follow Knuth use a different mathematical definition than C uses. This patch specifies which definition BPF uses, as verified by Eduard [2] and others. [0] https://en.wikipedia.org/wiki/Modulo#Variants_of_the_definition [1] https://torstencurdt.com/tech/posts/modulo-of-negative-numbers/ [2] https://lore.kernel.org/bpf/57e6fefadaf3b2995bb259fa8e711c7220ce5290.camel@gmail.com/ Signed-off-by: Dave Thaler <dthaler@microsoft.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: David Vernet <void@manifault.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20231017203020.1500-1-dthaler1968@googlemail.com
2023-10-18selftests/bpf: Add options and frags to xdp_hw_metadataLarysa Zaremba2-13/+67
This is a follow-up to the commit 9b2b86332a9b ("bpf: Allow to use kfunc XDP hints and frags together"). The are some possible implementations problems that may arise when providing metadata specifically for multi-buffer packets, therefore there must be a possibility to test such option separately. Add an option to use multi-buffer AF_XDP xdp_hw_metadata and mark used XDP program as capable to use frags. As for now, xdp_hw_metadata accepts no options, so add simple option parsing logic and a help message. For quick reference, also add an ingress packet generation command to the help message. The command comes from [0]. Example of output for multi-buffer packet: xsk_ring_cons__peek: 1 0xead018: rx_desc[15]->addr=10000000000f000 addr=f100 comp_addr=f000 rx_hash: 0x5789FCBB with RSS type:0x29 rx_timestamp: 1696856851535324697 (sec:1696856851.5353) XDP RX-time: 1696856843158256391 (sec:1696856843.1583) delta sec:-8.3771 (-8377068.306 usec) AF_XDP time: 1696856843158413078 (sec:1696856843.1584) delta sec:0.0002 (156.687 usec) 0xead018: complete idx=23 addr=f000 xsk_ring_cons__peek: 1 0xead018: rx_desc[16]->addr=100000000008000 addr=8100 comp_addr=8000 0xead018: complete idx=24 addr=8000 xsk_ring_cons__peek: 1 0xead018: rx_desc[17]->addr=100000000009000 addr=9100 comp_addr=9000 EoP 0xead018: complete idx=25 addr=9000 Metadata is printed for the first packet only. [0] https://lore.kernel.org/all/20230119221536.3349901-18-sdf@google.com/ Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/bpf/20231017162800.24080-1-larysa.zaremba@intel.com
2023-10-17selftests/bpf: Add additional mprog query test coverageDaniel Borkmann1-1/+130
Add several new test cases which assert corner cases on the mprog query mechanism, for example, around passing in a too small or a larger array than the current count. ./test_progs -t tc_opts #252 tc_opts_after:OK #253 tc_opts_append:OK #254 tc_opts_basic:OK #255 tc_opts_before:OK #256 tc_opts_chain_classic:OK #257 tc_opts_chain_mixed:OK #258 tc_opts_delete_empty:OK #259 tc_opts_demixed:OK #260 tc_opts_detach:OK #261 tc_opts_detach_after:OK #262 tc_opts_detach_before:OK #263 tc_opts_dev_cleanup:OK #264 tc_opts_invalid:OK #265 tc_opts_max:OK #266 tc_opts_mixed:OK #267 tc_opts_prepend:OK #268 tc_opts_query:OK #269 tc_opts_query_attach:OK #270 tc_opts_replace:OK #271 tc_opts_revision:OK Summary: 20/0 PASSED, 0 SKIPPED, 0 FAILED Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Alan Maguire <alan.maguire@oracle.com> Link: https://lore.kernel.org/bpf/20231017081728.24769-1-daniel@iogearbox.net
2023-10-17selftests/bpf: Add selftest for bpf_task_under_cgroup() in sleepable progYafang Shao2-3/+36
The result is as follows: $ tools/testing/selftests/bpf/test_progs --name=task_under_cgroup #237 task_under_cgroup:OK Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED Without the previous patch, there will be RCU warnings in dmesg when CONFIG_PROVE_RCU is enabled. While with the previous patch, there will be no warnings. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/bpf/20231007135945.4306-2-laoar.shao@gmail.com
2023-10-17bpf: Fix missed rcu read lock in bpf_task_under_cgroup()Yafang Shao1-1/+6
When employed within a sleepable program not under RCU protection, the use of 'bpf_task_under_cgroup()' may trigger a warning in the kernel log, particularly when CONFIG_PROVE_RCU is enabled: [ 1259.662357] WARNING: suspicious RCU usage [ 1259.662358] 6.5.0+ #33 Not tainted [ 1259.662360] ----------------------------- [ 1259.662361] include/linux/cgroup.h:423 suspicious rcu_dereference_check() usage! Other info that might help to debug this: [ 1259.662366] rcu_scheduler_active = 2, debug_locks = 1 [ 1259.662368] 1 lock held by trace/72954: [ 1259.662369] #0: ffffffffb5e3eda0 (rcu_read_lock_trace){....}-{0:0}, at: __bpf_prog_enter_sleepable+0x0/0xb0 Stack backtrace: [ 1259.662385] CPU: 50 PID: 72954 Comm: trace Kdump: loaded Not tainted 6.5.0+ #33 [ 1259.662391] Call Trace: [ 1259.662393] <TASK> [ 1259.662395] dump_stack_lvl+0x6e/0x90 [ 1259.662401] dump_stack+0x10/0x20 [ 1259.662404] lockdep_rcu_suspicious+0x163/0x1b0 [ 1259.662412] task_css_set.part.0+0x23/0x30 [ 1259.662417] bpf_task_under_cgroup+0xe7/0xf0 [ 1259.662422] bpf_prog_7fffba481a3bcf88_lsm_run+0x5c/0x93 [ 1259.662431] bpf_trampoline_6442505574+0x60/0x1000 [ 1259.662439] bpf_lsm_bpf+0x5/0x20 [ 1259.662443] ? security_bpf+0x32/0x50 [ 1259.662452] __sys_bpf+0xe6/0xdd0 [ 1259.662463] __x64_sys_bpf+0x1a/0x30 [ 1259.662467] do_syscall_64+0x38/0x90 [ 1259.662472] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 1259.662479] RIP: 0033:0x7f487baf8e29 [...] [ 1259.662504] </TASK> This issue can be reproduced by executing a straightforward program, as demonstrated below: SEC("lsm.s/bpf") int BPF_PROG(lsm_run, int cmd, union bpf_attr *attr, unsigned int size) { struct cgroup *cgrp = NULL; struct task_struct *task; int ret = 0; if (cmd != BPF_LINK_CREATE) return 0; // The cgroup2 should be mounted first cgrp = bpf_cgroup_from_id(1); if (!cgrp) goto out; task = bpf_get_current_task_btf(); if (bpf_task_under_cgroup(task, cgrp)) ret = -1; bpf_cgroup_release(cgrp); out: return ret; } After running the program, if you subsequently execute another BPF program, you will encounter the warning. It's worth noting that task_under_cgroup_hierarchy() is also utilized by bpf_current_task_under_cgroup(). However, bpf_current_task_under_cgroup() doesn't exhibit this issue because it cannot be used in sleepable BPF programs. Fixes: b5ad4cdc46c7 ("bpf: Add bpf_task_under_cgroup() kfunc") Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Stanislav Fomichev <sdf@google.com> Cc: Feng Zhou <zhoufeng.zf@bytedance.com> Cc: KP Singh <kpsingh@kernel.org> Link: https://lore.kernel.org/bpf/20231007135945.4306-1-laoar.shao@gmail.com
2023-10-17net, bpf: Add a warning if NAPI cb missed xdp_do_flush().Sebastian Andrzej Siewior8-0/+66
A few drivers were missing a xdp_do_flush() invocation after XDP_REDIRECT. Add three helper functions each for one of the per-CPU lists. Return true if the per-CPU list is non-empty and flush the list. Add xdp_do_check_flushed() which invokes each helper functions and creates a warning if one of the functions had a non-empty list. Hide everything behind CONFIG_DEBUG_NET. Suggested-by: Jesper Dangaard Brouer <hawk@kernel.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20231016125738.Yt79p1uF@linutronix.de
2023-10-17libbpf: Don't assume SHT_GNU_verdef presence for SHT_GNU_versym sectionAndrii Nakryiko1-6/+10
Fix too eager assumption that SHT_GNU_verdef ELF section is going to be present whenever binary has SHT_GNU_versym section. It seems like either SHT_GNU_verdef or SHT_GNU_verneed can be used, so failing on missing SHT_GNU_verdef actually breaks use cases in production. One specific reported issue, which was used to manually test this fix, was trying to attach to `readline` function in BASH binary. Fixes: bb7fa09399b9 ("libbpf: Support symbol versioning for uprobe") Reported-by: Liam Wisehart <liamwisehart@meta.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Manu Bretelle <chantr4@gmail.com> Reviewed-by: Fangrui Song <maskray@google.com> Acked-by: Hengqi Chen <hengqi.chen@gmail.com> Link: https://lore.kernel.org/bpf/20231016182840.4033346-1-andrii@kernel.org
2023-10-17Merge tag 'for-netdev' of ↵Jakub Kicinski120-895/+3519
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2023-10-16 We've added 90 non-merge commits during the last 25 day(s) which contain a total of 120 files changed, 3519 insertions(+), 895 deletions(-). The main changes are: 1) Add missed stats for kprobes to retrieve the number of missed kprobe executions and subsequent executions of BPF programs, from Jiri Olsa. 2) Add cgroup BPF sockaddr hooks for unix sockets. The use case is for systemd to reimplement the LogNamespace feature which allows running multiple instances of systemd-journald to process the logs of different services, from Daan De Meyer. 3) Implement BPF CPUv4 support for s390x BPF JIT, from Ilya Leoshkevich. 4) Improve BPF verifier log output for scalar registers to better disambiguate their internal state wrt defaults vs min/max values matching, from Andrii Nakryiko. 5) Extend the BPF fib lookup helpers for IPv4/IPv6 to support retrieving the source IP address with a new BPF_FIB_LOOKUP_SRC flag, from Martynas Pumputis. 6) Add support for open-coded task_vma iterator to help with symbolization for BPF-collected user stacks, from Dave Marchevsky. 7) Add libbpf getters for accessing individual BPF ring buffers which is useful for polling them individually, for example, from Martin Kelly. 8) Extend AF_XDP selftests to validate the SHARED_UMEM feature, from Tushar Vyavahare. 9) Improve BPF selftests cross-building support for riscv arch, from Björn Töpel. 10) Add the ability to pin a BPF timer to the same calling CPU, from David Vernet. 11) Fix libbpf's bpf_tracing.h macros for riscv to use the generic implementation of PT_REGS_SYSCALL_REGS() to access syscall arguments, from Alexandre Ghiti. 12) Extend libbpf to support symbol versioning for uprobes, from Hengqi Chen. 13) Fix bpftool's skeleton code generation to guarantee that ELF data is 8 byte aligned, from Ian Rogers. 14) Inherit system-wide cpu_mitigations_off() setting for Spectre v1/v4 security mitigations in BPF verifier, from Yafang Shao. 15) Annotate struct bpf_stack_map with __counted_by attribute to prepare BPF side for upcoming __counted_by compiler support, from Kees Cook. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (90 commits) bpf: Ensure proper register state printing for cond jumps bpf: Disambiguate SCALAR register state output in verifier logs selftests/bpf: Make align selftests more robust selftests/bpf: Improve missed_kprobe_recursion test robustness selftests/bpf: Improve percpu_alloc test robustness selftests/bpf: Add tests for open-coded task_vma iter bpf: Introduce task_vma open-coded iterator kfuncs selftests/bpf: Rename bpf_iter_task_vma.c to bpf_iter_task_vmas.c bpf: Don't explicitly emit BTF for struct btf_iter_num bpf: Change syscall_nr type to int in struct syscall_tp_t net/bpf: Avoid unused "sin_addr_len" warning when CONFIG_CGROUP_BPF is not set bpf: Avoid unnecessary audit log for CPU security mitigations selftests/bpf: Add tests for cgroup unix socket address hooks selftests/bpf: Make sure mount directory exists documentation/bpf: Document cgroup unix socket address hooks bpftool: Add support for cgroup unix socket address hooks libbpf: Add support for cgroup unix socket address hooks bpf: Implement cgroup sockaddr hooks for unix sockets bpf: Add bpf_sock_addr_set_sun_path() to allow writing unix sockaddr from bpf bpf: Propagate modified uaddrlen from cgroup sockaddr programs ... ==================== Link: https://lore.kernel.org/r/20231016204803.30153-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-17page_pool: fragment API support for 32-bit arch with 64-bit DMAYunsheng Lin3-23/+24
Currently page_pool_alloc_frag() is not supported in 32-bit arch with 64-bit DMA because of the overlap issue between pp_frag_count and dma_addr_upper in 'struct page' for those arches, which seems to be quite common, see [1], which means driver may need to handle it when using fragment API. It is assumed that the combination of the above arch with an address space >16TB does not exist, as all those arches have 64b equivalent, it seems logical to use the 64b version for a system with a large address space. It is also assumed that dma address is page aligned when we are dma mapping a page aligned buffer, see [2]. That means we're storing 12 bits of 0 at the lower end for a dma address, we can reuse those bits for the above arches to support 32b+12b, which is 16TB of memory. If we make a wrong assumption, a warning is emitted so that user can report to us. 1. https://lore.kernel.org/all/20211117075652.58299-1-linyunsheng@huawei.com/ 2. https://lore.kernel.org/all/20230818145145.4b357c89@kernel.org/ Tested-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> CC: Lorenzo Bianconi <lorenzo@kernel.org> CC: Alexander Duyck <alexander.duyck@gmail.com> CC: Liang Chen <liangchen.linux@gmail.com> CC: Guillaume Tucker <guillaume.tucker@collabora.com> CC: Matthew Wilcox <willy@infradead.org> CC: Linux-MM <linux-mm@kvack.org> Link: https://lore.kernel.org/r/20231013064827.61135-2-linyunsheng@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-17net: stub tcp_gro_complete if CONFIG_INET=nJacob Keller1-0/+4
A few networking drivers including bnx2x, bnxt, qede, and idpf call tcp_gro_complete as part of offloading TCP GRO. The function is only defined if CONFIG_INET is true, since its TCP specific and is meaningless if the kernel lacks IP networking support. The combination of trying to use the complex network drivers with CONFIG_NET but not CONFIG_INET is rather unlikely in practice: most use cases are going to need IP networking. The tcp_gro_complete function just sets some data in the socket buffer for use in processing the TCP packet in the event that the GRO was offloaded to the device. If the kernel lacks TCP support, such setup will simply go unused. The bnx2x, bnxt, and qede drivers wrap their TCP offload support in CONFIG_INET checks and skip handling on such kernels. The idpf driver did not check CONFIG_INET and thus fails to link if the kernel is configured with CONFIG_NET=y, CONFIG_IDPF=(m|y), and CONFIG_INET=n. While checking CONFIG_INET does allow the driver to bypass significantly more instructions in the event that we know TCP networking isn't supported, the configuration is unlikely to be used widely. Rather than require driver authors to care about this, stub the tcp_gro_complete function when CONFIG_INET=n. This allows drivers to be left as-is. It does mean the idpf driver will perform slightly more work than strictly necessary when CONFIG_INET=n, since it will still execute some of the skb setup in idpf_rx_rsc. However, that work would be performed in the case where CONFIG_INET=y anyways. I did not change the existing drivers, since they appear to wrap a significant portion of code when CONFIG_INET=n. There is little benefit in trashing these drivers just to unwrap and remove the CONFIG_INET check. Using a stub for tcp_gro_complete is still beneficial, as it means future drivers no longer need to worry about this case of CONFIG_NET=y and CONFIG_INET=n, which should reduce noise from buildbots that check such a configuration. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> # build-tested Link: https://lore.kernel.org/r/20231013185502.1473541-1-jacob.e.keller@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-17drivers: net: wwan: wwan_core.c: resolved spelling mistakeMuhammad Muzammil1-2/+2
resolved typing mistake from devce to device Signed-off-by: Muhammad Muzammil <m.muzzammilashraf@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20231013042304.7881-1-m.muzzammilashraf@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-17cgroup, netclassid: on modifying netclassid in cgroup, only consider the ↵Liansen Zhai1-0/+6
main process. When modifying netclassid, the command("echo 0x100001 > net_cls.classid") will take more time on many threads of one process, because the process create many fds. for example, one process exists 28000 fds and 60000 threads, echo command will task 45 seconds. Now, we only consider the main process when exec "iterate_fd", and the time is about 52 milliseconds. Signed-off-by: Liansen Zhai <zhailiansen@kuaishou.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20231012090330.29636-1-zhailiansen@kuaishou.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-17net: usb: replace deprecated strncpy with strscpyJustin Stitt1-2/+2
strncpy() is deprecated for use on NUL-terminated destination strings [1] and as such we should prefer more robust and less ambiguous string interfaces. Other implementations of .*get_drvinfo use strscpy so this patch brings sr_get_drvinfo() in line as well: igb/igb_ethtool.c +851 static void igb_get_drvinfo(struct net_device *netdev, igbvf/ethtool.c 167:static void igbvf_get_drvinfo(struct net_device *netdev, i40e/i40e_ethtool.c 1999:static void i40e_get_drvinfo(struct net_device *netdev, e1000/e1000_ethtool.c 529:static void e1000_get_drvinfo(struct net_device *netdev, ixgbevf/ethtool.c 211:static void ixgbevf_get_drvinfo(struct net_device *netdev, ... Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1] Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html [2] Link: https://github.com/KSPP/linux/issues/90 Signed-off-by: Justin Stitt <justinstitt@google.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://lore.kernel.org/r/20231012-strncpy-drivers-net-usb-sr9800-c-v1-1-5540832c8ec2@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-17lan78xx: replace deprecated strncpy with strscpyJustin Stitt1-1/+1
strncpy() is deprecated for use on NUL-terminated destination strings [1] and as such we should prefer more robust and less ambiguous string interfaces. Other implementations of .*get_drvinfo use strscpy so this patch brings lan78xx_get_drvinfo() in line as well: igb/igb_ethtool.c +851 static void igb_get_drvinfo(struct net_device *netdev, igbvf/ethtool.c 167:static void igbvf_get_drvinfo(struct net_device *netdev, i40e/i40e_ethtool.c 1999:static void i40e_get_drvinfo(struct net_device *netdev, e1000/e1000_ethtool.c 529:static void e1000_get_drvinfo(struct net_device *netdev, ixgbevf/ethtool.c 211:static void ixgbevf_get_drvinfo(struct net_device *netdev, Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1] Link: https://github.com/KSPP/linux/issues/90 Signed-off-by: Justin Stitt <justinstitt@google.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://lore.kernel.org/r/20231012-strncpy-drivers-net-usb-lan78xx-c-v1-1-99d513061dfc@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-17net: phy: smsc: replace deprecated strncpy with ethtool_sprintfJustin Stitt1-4/+2
strncpy() is deprecated for use on NUL-terminated destination strings [1] and as such we should prefer more robust and less ambiguous string interfaces. ethtool_sprintf() is designed specifically for get_strings() usage. Let's replace strncpy in favor of this dedicated helper function. Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1] Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html [2] Link: https://github.com/KSPP/linux/issues/90 Signed-off-by: Justin Stitt <justinstitt@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20231012-strncpy-drivers-net-phy-smsc-c-v1-1-00528f7524b3@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-17net: netcp: replace deprecated strncpy with strscpyJustin Stitt1-2/+2
strncpy() is deprecated for use on NUL-terminated destination strings [1] and as such we should prefer more robust and less ambiguous string interfaces. Considering the above, a suitable replacement is `strscpy` [2] due to the fact that it guarantees NUL-termination on the destination buffer without unnecessarily NUL-padding. Other implementations of .*get_drvinfo also use strscpy so this patch brings keystone_get_drvinfo() in line as well: igb/igb_ethtool.c +851 static void igb_get_drvinfo(struct net_device *netdev, igbvf/ethtool.c 167:static void igbvf_get_drvinfo(struct net_device *netdev, i40e/i40e_ethtool.c 1999:static void i40e_get_drvinfo(struct net_device *netdev, e1000/e1000_ethtool.c 529:static void e1000_get_drvinfo(struct net_device *netdev, ixgbevf/ethtool.c 211:static void ixgbevf_get_drvinfo(struct net_device *netdev, Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1] Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html [2] Link: https://github.com/KSPP/linux/issues/90 Signed-off-by: Justin Stitt <justinstitt@google.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://lore.kernel.org/r/20231012-strncpy-drivers-net-ethernet-ti-netcp_ethss-c-v1-1-93142e620864@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-17tcp: Set pingpong threshold via sysctlHaiyang Zhang6-6/+39
TCP pingpong threshold is 1 by default. But some applications, like SQL DB may prefer a higher pingpong threshold to activate delayed acks in quick ack mode for better performance. The pingpong threshold and related code were changed to 3 in the year 2019 in: commit 4a41f453bedf ("tcp: change pingpong threshold to 3") And reverted to 1 in the year 2022 in: commit 4d8f24eeedc5 ("Revert "tcp: change pingpong threshold to 3"") There is no single value that fits all applications. Add net.ipv4.tcp_pingpong_thresh sysctl tunable, so it can be tuned for optimal performance based on the application needs. Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/1697056244-21888-1-git-send-email-haiyangz@microsoft.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-16net, sched: Add tcf_set_drop_reason for {__,}tcf_classifyDaniel Borkmann2-6/+23
Add an initial user for the newly added tcf_set_drop_reason() helper to set the drop reason for internal errors leading to TC_ACT_SHOT inside {__,}tcf_classify(). Right now this only adds a very basic SKB_DROP_REASON_TC_ERROR as a generic fallback indicator to mark drop locations. Where needed, such locations can be converted to more specific codes, for example, when hitting the reclassification limit, etc. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Victor Nogueira <victor@mojatatu.com> Link: https://lore.kernel.org/r/20231009092655.22025-2-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-16net, sched: Make tc-related drop reason more flexibleDaniel Borkmann3-7/+17
Currently, the kfree_skb_reason() in sch_handle_{ingress,egress}() can only express a basic SKB_DROP_REASON_TC_INGRESS or SKB_DROP_REASON_TC_EGRESS reason. Victor kicked-off an initial proposal to make this more flexible by disambiguating verdict from return code by moving the verdict into struct tcf_result and letting tcf_classify() return a negative error. If hit, then two new drop reasons were added in the proposal, that is SKB_DROP_REASON_TC_INGRESS_ERROR as well as SKB_DROP_REASON_TC_EGRESS_ERROR. Further analysis of the actual error codes would have required to attach to tcf_classify via kprobe/kretprobe to more deeply debug skb and the returned error. In order to make the kfree_skb_reason() in sch_handle_{ingress,egress}() more extensible, it can be addressed in a more straight forward way, that is: Instead of placing the verdict into struct tcf_result, we can just put the drop reason in there, which does not require changes throughout various classful schedulers given the existing verdict logic can stay as is. Then, SKB_DROP_REASON_TC_ERROR{,_*} can be added to the enum skb_drop_reason to disambiguate between an error or an intentional drop. New drop reason error codes can be added successively to the tc code base. For internal error locations which have not yet been annotated with a SKB_DROP_REASON_TC_ERROR{,_*}, the fallback is SKB_DROP_REASON_TC_INGRESS and SKB_DROP_REASON_TC_EGRESS, respectively. Generic errors could be marked with a SKB_DROP_REASON_TC_ERROR code until they are converted to more specific ones if it is found that they would be useful for troubleshooting. While drop reasons have infrastructure for subsystem specific error codes which are currently used by mac80211 and ovs, Jakub mentioned that it is preferred for tc to use the enum skb_drop_reason core codes given it is a better fit and currently the tooling support is better, too. With regards to the latter: [...] I think Alastair (bpftrace) is working on auto-prettifying enums when bpftrace outputs maps. So we can do something like: $ bpftrace -e 'tracepoint:skb:kfree_skb { @[args->reason] = count(); }' Attaching 1 probe... ^C @[SKB_DROP_REASON_TC_INGRESS]: 2 @[SKB_CONSUMED]: 34 ^^^^^^^^^^^^ names!! Auto-magically. [...] Add a small helper tcf_set_drop_reason() which can be used to set the drop reason into the tcf_result. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Victor Nogueira <victor@mojatatu.com> Link: https://lore.kernel.org/netdev/20231006063233.74345d36@kernel.org Reviewed-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20231009092655.22025-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-16Merge branch 'bpf-log-improvements'Daniel Borkmann8-157/+200
Andrii Nakryiko says: ==================== This patch set fixes ambiguity in BPF verifier log output of SCALAR register in the parts that emit umin/umax, smin/smax, etc ranges. See patch #4 for details. Also, patch #5 fixes an issue with verifier log missing instruction context (state) output for conditionals that trigger precision marking. See details in the patch. First two patches are just improvements to two selftests that are very flaky locally when run in parallel mode. Patch #3 changes 'align' selftest to be less strict about exact verifier log output (which patch #4 changes, breaking lots of align tests as written). Now test does more of a register substate checks, mostly around expected var_off() values. This 'align' selftests is one of the more brittle ones and requires constant adjustment when verifier log output changes, without really catching any new issues. So hopefully these changes can minimize future support efforts for this specific set of tests. ==================== Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2023-10-16bpf: Ensure proper register state printing for cond jumpsAndrii Nakryiko1-1/+6
Verifier emits relevant register state involved in any given instruction next to it after `;` to the right, if possible. Or, worst case, on the separate line repeating instruction index. E.g., a nice and simple case would be: 2: (d5) if r0 s<= 0x0 goto pc+1 ; R0_w=0 But if there is some intervening extra output (e.g., precision backtracking log) involved, we are supposed to see the state after the precision backtrack log: 4: (75) if r0 s>= 0x0 goto pc+1 mark_precise: frame0: last_idx 4 first_idx 0 subseq_idx -1 mark_precise: frame0: regs=r0 stack= before 2: (d5) if r0 s<= 0x0 goto pc+1 mark_precise: frame0: regs=r0 stack= before 1: (b7) r0 = 0 6: R0_w=0 First off, note that in `6: R0_w=0` instruction index corresponds to the next instruction, not to the conditional jump instruction itself, which is wrong and we'll get to that. But besides that, the above is a happy case that does work today. Yet, if it so happens that precision backtracking had to traverse some of the parent states, this `6: R0_w=0` state output would be missing. This is due to a quirk of print_verifier_state() routine, which performs mark_verifier_state_clean(env) at the end. This marks all registers as "non-scratched", which means that subsequent logic to print *relevant* registers (that is, "scratched ones") fails and doesn't see anything relevant to print and skips the output altogether. print_verifier_state() is used both to print instruction context, but also to print an **entire** verifier state indiscriminately, e.g., during precision backtracking (and in a few other situations, like during entering or exiting subprogram). Which means if we have to print entire parent state before getting to printing instruction context state, instruction context is marked as clean and is omitted. Long story short, this is definitely not intentional. So we fix this behavior in this patch by teaching print_verifier_state() to clear scratch state only if it was used to print instruction state, not the parent/callback state. This is determined by print_all option, so if it's not set, we don't clear scratch state. This fixes missing instruction state for these cases. As for the mismatched instruction index, we fix that by making sure we call print_insn_state() early inside check_cond_jmp_op() before we adjusted insn_idx based on jump branch taken logic. And with that we get desired correct information: 9: (16) if w4 == 0x1 goto pc+9 mark_precise: frame0: last_idx 9 first_idx 9 subseq_idx -1 mark_precise: frame0: parent state regs=r4 stack=: R2_w=1944 R4_rw=P1 R10=fp0 mark_precise: frame0: last_idx 8 first_idx 0 subseq_idx 9 mark_precise: frame0: regs=r4 stack= before 8: (66) if w4 s> 0x3 goto pc+5 mark_precise: frame0: regs=r4 stack= before 7: (b7) r4 = 1 9: R4=1 Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20231011223728.3188086-6-andrii@kernel.org
2023-10-16bpf: Disambiguate SCALAR register state output in verifier logsAndrii Nakryiko3-32/+55
Currently the way that verifier prints SCALAR_VALUE register state (and PTR_TO_PACKET, which can have var_off and ranges info as well) is very ambiguous. In the name of brevity we are trying to eliminate "unnecessary" output of umin/umax, smin/smax, u32_min/u32_max, and s32_min/s32_max values, if possible. Current rules are that if any of those have their default value (which for mins is the minimal value of its respective types: 0, S32_MIN, or S64_MIN, while for maxs it's U32_MAX, S32_MAX, S64_MAX, or U64_MAX) *OR* if there is another min/max value that as matching value. E.g., if smin=100 and umin=100, we'll emit only umin=10, omitting smin altogether. This approach has a few problems, being both ambiguous and sort-of incorrect in some cases. Ambiguity is due to missing value could be either default value or value of umin/umax or smin/smax. This is especially confusing when we mix signed and unsigned ranges. Quite often, umin=0 and smin=0, and so we'll have only `umin=0` leaving anyone reading verifier log to guess whether smin is actually 0 or it's actually -9223372036854775808 (S64_MIN). And often times it's important to know, especially when debugging tricky issues. "Sort-of incorrectness" comes from mixing negative and positive values. E.g., if umin is some large positive number, it can be equal to smin which is, interpreted as signed value, is actually some negative value. Currently, that smin will be omitted and only umin will be emitted with a large positive value, giving an impression that smin is also positive. Anyway, ambiguity is the biggest issue making it impossible to have an exact understanding of register state, preventing any sort of automated testing of verifier state based on verifier log. This patch is attempting to rectify the situation by removing ambiguity, while minimizing the verboseness of register state output. The rules are straightforward: - if some of the values are missing, then it definitely has a default value. I.e., `umin=0` means that umin is zero, but smin is actually S64_MIN; - all the various boundaries that happen to have the same value are emitted in one equality separated sequence. E.g., if umin and smin are both 100, we'll emit `smin=umin=100`, making this explicit; - we do not mix negative and positive values together, and even if they happen to have the same bit-level value, they will be emitted separately with proper sign. I.e., if both umax and smax happen to be 0xffffffffffffffff, we'll emit them both separately as `smax=-1,umax=18446744073709551615`; - in the name of a bit more uniformity and consistency, {u32,s32}_{min,max} are renamed to {s,u}{min,max}32, which seems to improve readability. The above means that in case of all 4 ranges being, say, [50, 100] range, we'd previously see hugely ambiguous: R1=scalar(umin=50,umax=100) Now, we'll be more explicit: R1=scalar(smin=umin=smin32=umin32=50,smax=umax=smax32=umax32=100) This is slightly more verbose, but distinct from the case when we don't know anything about signed boundaries and 32-bit boundaries, which under new rules will match the old case: R1=scalar(umin=50,umax=100) Also, in the name of simplicity of implementation and consistency, order for {s,u}32_{min,max} are emitted *before* var_off. Previously they were emitted afterwards, for unclear reasons. This patch also includes a few fixes to selftests that expect exact register state to accommodate slight changes to verifier format. You can see that the changes are pretty minimal in common cases. Note, the special case when SCALAR_VALUE register is a known constant isn't changed, we'll emit constant value once, interpreted as signed value. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20231011223728.3188086-5-andrii@kernel.org
2023-10-16selftests/bpf: Make align selftests more robustAndrii Nakryiko1-120/+121
Align subtest is very specific and finicky about expected verifier log output and format. This is often completely unnecessary as in a bunch of situations test actually cares about var_off part of register state. But given how exact it is right now, any tiny verifier log changes can lead to align tests failures, requiring constant adjustment. This patch tries to make this a bit more robust by making logic first search for specified register and then allowing to match only portion of register state, not everything exactly. This will come handly with follow up changes to SCALAR register output disambiguation. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20231011223728.3188086-4-andrii@kernel.org
2023-10-16selftests/bpf: Improve missed_kprobe_recursion test robustnessAndrii Nakryiko1-4/+4
Given missed_kprobe_recursion is non-serial and uses common testing kfuncs to count number of recursion misses it's possible that some other parallel test can trigger extraneous recursion misses. So we can't expect exactly 1 miss. Relax conditions and expect at least one. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20231011223728.3188086-3-andrii@kernel.org
2023-10-16selftests/bpf: Improve percpu_alloc test robustnessAndrii Nakryiko3-0/+14
Make these non-serial tests filter BPF programs by intended PID of a test runner process. This makes it isolated from other parallel tests that might interfere accidentally. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20231011223728.3188086-2-andrii@kernel.org
2023-10-16tsnep: Inline small fragments within TX descriptorGerhard Engleder2-26/+79
The tsnep network controller is able to extend the descriptor directly with data to be transmitted. In this case no TX data DMA address is necessary. Instead of the TX data DMA address the TX data buffer is placed at the end of the descriptor. The descriptor is read with a 64 bytes DMA read by the tsnep network controller. If the sum of descriptor data and TX data is less than or equal to 64 bytes, then no additional DMA read is necessary to read the TX data. Therefore, it makes sense to inline small fragments up to this limit within the descriptor ring. Inlined fragments need to be copied to the descriptor ring. On the other hand DMA mapping is not necessary. At most 40 bytes are copied, so copying should be faster than DMA mapping. For A53 1.2 GHz copying takes <100ns and DMA mapping takes >200ns. So inlining small fragments should result in lower CPU load. Performance improvement is small. Thus, comparision of CPU load with and without inlining of small fragments did not show any significant difference. With this optimization less DMA reads will be done, which decreases the load of the interconnect. Signed-off-by: Gerhard Engleder <gerhard@engleder-embedded.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-16Merge branch 'udp-tunnel-route-lookups'David S. Miller7-200/+147
Beniamino Galvani says: ==================== net: consolidate IPv4 route lookup for UDP tunnels At the moment different UDP tunnels rely on different functions for IPv4 route lookup, and those functions all implement the same logic. Only bareudp uses the generic ip_route_output_tunnel(), while geneve and vxlan basically duplicate it slightly differently. This series first extends the generic lookup function so that it is suitable for all UDP tunnel implementations. Then, bareudp, geneve and vxlan are adapted to use them. This results in code with less duplication and hopefully better maintainability. After this series is merged, IPv6 will be converted in a similar way. Changelog: v2 - fix compilation with IPv6 disabled ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-16vxlan: use generic function for tunnel IPv4 route lookupBeniamino Galvani1-73/+41
The route lookup can be done now via generic function udp_tunnel_dst_lookup() to replace the custom implementations in vxlan_get_route(). Note that this patch only touches IPv4, while IPv6 still uses vxlan6_get_route(). After IPv6 route lookup gets converted as well, vxlan_xmit_one() can be simplified by removing local variables that will be passed via "struct ip_tunnel_key", such as remote_ip, local_ip, flow_flags, label. Suggested-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Beniamino Galvani <b.galvani@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-16geneve: use generic function for tunnel IPv4 route lookupBeniamino Galvani1-66/+32
The route lookup can be done now via generic function udp_tunnel_dst_lookup() to replace the custom implementation in geneve_get_v4_rt(). Suggested-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Beniamino Galvani <b.galvani@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-16geneve: add dsfield helper functionBeniamino Galvani1-11/+18
Add a helper function to compute the tos/dsfield. In this way, we can factor out some duplicate code. Also, the helper will be called from more places in the next commit. Suggested-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Beniamino Galvani <b.galvani@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-16ipv4: use tunnel flow flags for tunnel route lookupsBeniamino Galvani1-0/+1
Commit 451ef36bd229 ("ip_tunnels: Add new flow flags field to ip_tunnel_key") added a new field to struct ip_tunnel_key to control route lookups. Currently the flag is used by vxlan and geneve tunnels; use it also in udp_tunnel_dst_lookup() so that it affects all tunnel types relying on this function. Signed-off-by: Beniamino Galvani <b.galvani@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-16ipv4: add new arguments to udp_tunnel_dst_lookup()Beniamino Galvani3-20/+25
We want to make the function more generic so that it can be used by other UDP tunnel implementations such as geneve and vxlan. To do that, add the following arguments: - source and destination UDP port; - ifindex of the output interface, needed by vxlan; - the tos, because in some cases it is not taken from struct ip_tunnel_info (for example, when it's inherited from the inner packet); - the dst cache, because not all tunnel types (e.g. vxlan) want to use the one from struct ip_tunnel_info. With these parameters, the function no longer needs the full struct ip_tunnel_info as argument and we can pass only the relevant part of it (struct ip_tunnel_key). Suggested-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Beniamino Galvani <b.galvani@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-16ipv4: remove "proto" argument from udp_tunnel_dst_lookup()Beniamino Galvani3-5/+5
The function is now UDP-specific, the protocol is always IPPROTO_UDP. Suggested-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Beniamino Galvani <b.galvani@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-16ipv4: rename and move ip_route_output_tunnel()Beniamino Galvani5-58/+58
At the moment ip_route_output_tunnel() is used only by bareudp. Ideally, other UDP tunnel implementations should use it, but to do so the function needs to accept new parameters that are specific for UDP tunnels, such as the ports. Prepare for these changes by renaming the function to udp_tunnel_dst_lookup() and move it to file net/ipv4/udp_tunnel_core.c. Suggested-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Beniamino Galvani <b.galvani@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-16selftests: net: remove unused variableszhujun23-5/+3
These variables are never referenced in the code, just remove them Signed-off-by: zhujun2 <zhujun2@cmss.chinamobile.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-16net: cxgb3: simplify logic for rspq_check_napiChristian Marangi1-6/+1
Simplify logic for rspq_check_napi. Drop redundant and wrong napi_is_scheduled call as it's not race free and directly use the output of napi_schedule to understand if a napi is pending or not. rspq_check_napi main logic is to check if is_new_response is true and check if a napi is not scheduled. The result of this function is then used to detect if we are missing some interrupt and act on top of this... With this knowing, we can rework and simplify the logic and make it less problematic with testing an internal bit for napi. Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Christian Marangi <ansuelsmth@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-15Merge branch 'ptp-multiple-readers'David S. Miller9-60/+261
Xabier Marquiegui says: ==================== ptp: Support for multiple filtered timestamp event queue readers On systems with multiple timestamp event channels, there can be scenarios where multiple userspace readers want to access the timestamping data for various purposes. One such example is wanting to use a pps out for time synchronization, and wanting to timestamp external events with the synchronized time base simultaneously. Timestmp event consumers on the other hand, are often interested in a subset of the available timestamp channels. linuxptp ts2phc, for example, is not happy if more than one timestamping channel is active on the device it is reading from. Linked lists are introduced to support multiple timestamp event queue consumers, and timestamp event channel filters through IOCTLs, as well as a debugfs interface to do some simple verifications. Xabier Marquiegui (6): posix-clock: introduce posix_clock_context concept ptp: Replace timestamp event queue with linked list ptp: support multiple timestamp event readers ptp: support event queue reader channel masks ptp: add debugfs interface to see applied channel masks ptp: add testptp mask test drivers/ptp/ptp_chardev.c | 129 ++++++++++++++++---- drivers/ptp/ptp_clock.c | 45 ++++++- drivers/ptp/ptp_private.h | 28 +++-- drivers/ptp/ptp_sysfs.c | 13 +- include/linux/posix-clock.h | 35 ++++-- include/uapi/linux/ptp_clock.h | 2 + kernel/time/posix-clock.c | 36 ++++-- tools/testing/selftests/ptp/ptpchmaskfmt.sh | 14 +++ tools/testing/selftests/ptp/testptp.c | 19 ++- 9 files changed, 261 insertions(+), 60 deletions(-) create mode 100644 tools/testing/selftests/ptp/ptpchmaskfmt.sh --- v6: - correct commit message - correct coding style v5: https://lore.kernel.org/netdev/cover.1696804243.git.reibax@gmail.com/ - fix spelling on commit message - fix memory leak on ptp_open v4: https://lore.kernel.org/netdev/cover.1696511486.git.reibax@gmail.com/ - split modifications in different patches for improved organization - rename posix_clock_user to posix_clock_context - remove unnecessary flush_users clock operation - remove unnecessary tests - simpler queue clean procedure - fix/clean comment lines - simplified release procedures - filter modifications exclusive to currently open instance for simplicity and security - expand mask to 2048 channels - make more secure and simple: mask is only applied to the testptp instance. Use debugfs to verify effects. v3: https://lore.kernel.org/netdev/20230928133544.3642650-1-reibax@gmail.com/ - add this patchset overview file - fix use of safe and non safe linked lists for loops - introduce new posix_clock private_data and ida object ids for better dicrimination of timestamp consumers - safer resource release procedures - filter application by object id, aided by process id - friendlier testptp implementation of event queue channel filters v2: https://lore.kernel.org/netdev/20230912220217.2008895-1-reibax@gmail.com/ - fix ptp_poll() return value - Style changes to comform to checkpatch strict suggestions - more coherent ptp_read error exit routines - fix testptp compilation error: unknown type name 'pid_t' - rename mask variable for easier code traceability - more detailed commit message with two examples v1: https://lore.kernel.org/netdev/20230906104754.1324412-2-reibax@gmail.com/ ==================== Signed-off-by: Xabier Marquiegui <reibax@gmail.com> Suggested-by: Richard Cochran <richardcochran@gmail.com> Suggested-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-15ptp: add testptp mask testXabier Marquiegui1-1/+18
Add option to test timestamp event queue mask manipulation in testptp. Option -F allows the user to specify a single channel that will be applied on the mask filter via IOCTL. The test program will maintain the file open until user input is received. This allows checking the effect of the IOCTL in debugfs. eg: Console 1: ``` Channel 12 exclusively enabled. Check on debugfs. Press any key to continue ``` Console 2: ``` 0x00000000 0x00000001 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 ``` Signed-off-by: Xabier Marquiegui <reibax@gmail.com> Suggested-by: Richard Cochran <richardcochran@gmail.com> Suggested-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-15ptp: add debugfs interface to see applied channel masksXabier Marquiegui4-0/+39
Use debugfs to be able to view channel mask applied to every timestamp event queue. Every time the device is opened, a new entry is created in `$DEBUGFS_MOUNTPOINT/ptpN/$INSTANCE_ADDRESS/mask`. The mask value can be viewed grouped in 32bit decimal values using cat, or converted to hexadecimal with the included `ptpchmaskfmt.sh` script. 32 bit values are listed from least significant to most significant. Signed-off-by: Xabier Marquiegui <reibax@gmail.com> Suggested-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-15ptp: support event queue reader channel masksXabier Marquiegui4-2/+41
On systems with multiple timestamp event channels, some readers might want to receive only a subset of those channels. Add the necessary modifications to support timestamp event channel filtering, including two IOCTL operations: - Clear all channels - Enable one channel The mask modification operations will be applied exclusively on the event queue assigned to the file descriptor used on the IOCTL operation, so the typical procedure to have a reader receiving only a subset of the enabled channels would be: - Open device file - ioctl: clear all channels - ioctl: enable one channel - start reading Calling the enable one channel ioctl more than once will result in multiple enabled channels. Signed-off-by: Xabier Marquiegui <reibax@gmail.com> Suggested-by: Richard Cochran <richardcochran@gmail.com> Suggested-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-15ptp: support multiple timestamp event readersXabier Marquiegui4-29/+55
Use linked lists to create one event queue per open file. This enables simultaneous readers for timestamp event queues. Signed-off-by: Xabier Marquiegui <reibax@gmail.com> Suggested-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-15ptp: Replace timestamp event queue with linked listXabier Marquiegui4-6/+42
Introduce linked lists to access the timestamp event queue. Signed-off-by: Xabier Marquiegui <reibax@gmail.com> Suggested-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-15posix-clock: introduce posix_clock_context conceptXabier Marquiegui4-32/+76
Add the necessary structure to support custom private-data per posix-clock user. The previous implementation of posix-clock assumed all file open instances need access to the same clock structure on private_data. The need for individual data structures per file open instance has been identified when developing support for multiple timestamp event queue users for ptp_clock. Signed-off-by: Xabier Marquiegui <reibax@gmail.com> Suggested-by: Richard Cochran <richardcochran@gmail.com> Suggested-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-15Merge branch 'dpll-phase-offset-phase-adjust'David S. Miller9-18/+517
Arkadiusz Kubalewski says: ==================== dpll: add phase-offset and phase-adjust Improve monitoring and control over dpll devices. Allow user to receive measurement of phase difference between signals on pin and dpll (phase-offset). Allow user to receive and control adjustable value of pin's signal phase (phase-adjust). v4->v5: - rebase series on top of net-next/main, fix conflict - remove redundant attribute type definition in subset definition v3->v4: - do not increase do version of uAPI header as it is not needed (v3 did not have this change) - fix spelling around commit messages, argument descriptions and docs - add missing extack errors on failure set callbacks for pin phase adjust and frequency - remove ice check if value is already set, now redundant as checked in the dpll subsystem v2->v3: - do not increase do version of uAPI header as it is not needed v1->v2: - improve handling for error case of requesting the phase adjust set - align handling for error case of frequency set request with the approach introduced for phase adjust ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-15dpll: netlink/core: change pin frequency set behaviorArkadiusz Kubalewski1-8/+42
Align the approach of pin frequency set behavior with the approach introduced with pin phase adjust set. Fail the request if any of devices did not registered the callback ops. If callback op on any pin's registered device fails, return error and rollback the value to previous one. Signed-off-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-15ice: dpll: implement phase related callbacksArkadiusz Kubalewski2-4/+226
Implement new callback ops related to measurement and adjustment of signal phase for pin-dpll in ice driver. Signed-off-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-15dpll: netlink/core: add support for pin-dpll signal phase offset/adjustArkadiusz Kubalewski2-1/+155
Add callback ops for pin-dpll phase measurement. Add callback for pin signal phase adjustment. Add min and max phase adjustment values to pin proprties. Invoke callbacks in dpll_netlink.c when filling the pin details to provide user with phase related attribute values. Signed-off-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-15dpll: spec: add support for pin-dpll signal phase offset/adjustArkadiusz Kubalewski4-4/+42
Add attributes for providing the user with: - measurement of signals phase offset between pin and dpll - ability to adjust the phase of pin signal Signed-off-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>