summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2024-03-21net: dsa: mt7530: fix link-local frames that ingress vlan filtering portsArınç ÜNAL2-9/+23
Whether VLAN-aware or not, on every VID VLAN table entry that has the CPU port as a member of it, frames are set to egress the CPU port with the VLAN tag stacked. This is so that VLAN tags can be appended after hardware special tag (called DSA tag in the context of Linux drivers). For user ports on a VLAN-unaware bridge, frame ingressing the user port egresses CPU port with only the special tag. For user ports on a VLAN-aware bridge, frame ingressing the user port egresses CPU port with the special tag and the VLAN tag. This causes issues with link-local frames, specifically BPDUs, because the software expects to receive them VLAN-untagged. There are two options to make link-local frames egress untagged. Setting CONSISTENT or UNTAGGED on the EG_TAG bits on the relevant register. CONSISTENT means frames egress exactly as they ingress. That means egressing with the VLAN tag they had at ingress or egressing untagged if they ingressed untagged. Although link-local frames are not supposed to be transmitted VLAN-tagged, if they are done so, when egressing through a CPU port, the special tag field will be broken. BPDU egresses CPU port with VLAN tag egressing stacked, received on software: 00:01:25.104821 AF Unknown (382365846), length 106: | STAG | | VLAN | 0x0000: 0000 6c27 614d 4143 0001 0000 8100 0001 ..l'aMAC........ 0x0010: 0026 4242 0300 0000 0000 0000 6c27 614d .&BB........l'aM 0x0020: 4143 0000 0000 0000 6c27 614d 4143 0000 AC......l'aMAC.. 0x0030: 0000 1400 0200 0f00 0000 0000 0000 0000 ................ BPDU egresses CPU port with VLAN tag egressing untagged, received on software: 00:23:56.628708 AF Unknown (25215488), length 64: | STAG | 0x0000: 0000 6c27 614d 4143 0001 0000 0026 4242 ..l'aMAC.....&BB 0x0010: 0300 0000 0000 0000 6c27 614d 4143 0000 ........l'aMAC.. 0x0020: 0000 0000 6c27 614d 4143 0000 0000 1400 ....l'aMAC...... 0x0030: 0200 0f00 0000 0000 0000 0000 ............ BPDU egresses CPU port with VLAN tag egressing tagged, received on software: 00:01:34.311963 AF Unknown (25215488), length 64: | Mess | 0x0000: 0000 6c27 614d 4143 0001 0001 0026 4242 ..l'aMAC.....&BB 0x0010: 0300 0000 0000 0000 6c27 614d 4143 0000 ........l'aMAC.. 0x0020: 0000 0000 6c27 614d 4143 0000 0000 1400 ....l'aMAC...... 0x0030: 0200 0f00 0000 0000 0000 0000 ............ To prevent confusing the software, force the frame to egress UNTAGGED instead of CONSISTENT. This way, frames can't possibly be received TAGGED by software which would have the special tag field broken. VLAN Tag Egress Procedure For all frames, one of these options set the earliest in this order will apply to the frame: - EG_TAG in certain registers for certain frames. This will apply to frame with matching MAC DA or EtherType. - EG_TAG in the address table. This will apply to frame at its incoming port. - EG_TAG in the PVC register. This will apply to frame at its incoming port. - EG_CON and [EG_TAG per port] in the VLAN table. This will apply to frame at its outgoing port. - EG_TAG in the PCR register. This will apply to frame at its outgoing port. EG_TAG in certain registers for certain frames: PPPoE Discovery_ARP/RARP: PPP_EG_TAG and ARP_EG_TAG in the APC register. IGMP_MLD: IGMP_EG_TAG and MLD_EG_TAG in the IMC register. BPDU and PAE: BPDU_EG_TAG and PAE_EG_TAG in the BPC register. REV_01 and REV_02: R01_EG_TAG and R02_EG_TAG in the RGAC1 register. REV_03 and REV_0E: R03_EG_TAG and R0E_EG_TAG in the RGAC2 register. REV_10 and REV_20: R10_EG_TAG and R20_EG_TAG in the RGAC3 register. REV_21 and REV_UN: R21_EG_TAG and RUN_EG_TAG in the RGAC4 register. With this change, it can be observed that a bridge interface with stp_state and vlan_filtering enabled will properly block ports now. Fixes: b8f126a8d543 ("net-next: dsa: add dsa support for Mediatek MT7530 switch") Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-03-21bpftool: Clean up HOST_CFLAGS, HOST_LDFLAGS for bootstrap bpftoolQuentin Monnet1-4/+5
Bpftool's Makefile uses $(HOST_CFLAGS) to build the bootstrap version of bpftool, in order to pick the flags for the host (where we run the bootstrap version) and not for the target system (where we plan to run the full bpftool binary). But we pass too much information through this variable. In particular, we set HOST_CFLAGS by copying most of the $(CFLAGS); but we do this after the feature detection for bpftool, which means that $(CFLAGS), hence $(HOST_CFLAGS), contain all macro definitions for using the different optional features. For example, -DHAVE_LLVM_SUPPORT may be passed to the $(HOST_CFLAGS), even though the LLVM disassembler is not used in the bootstrap version, and the related library may even be missing for the host architecture. A similar thing happens with the $(LDFLAGS), that we use unchanged for linking the bootstrap version even though they may contains flags to link against additional libraries. To address the $(HOST_CFLAGS) issue, we move the definition of $(HOST_CFLAGS) earlier in the Makefile, before the $(CFLAGS) update resulting from the feature probing - none of which being relevant to the bootstrap version. To clean up the $(LDFLAGS) for the bootstrap version, we introduce a dedicated $(HOST_LDFLAGS) variable that we base on $(LDFLAGS), before the feature probing as well. On my setup, the following macro and libraries are removed from the compiler invocation to build bpftool after this patch: -DUSE_LIBCAP -DHAVE_LLVM_SUPPORT -I/usr/lib/llvm-17/include -D_GNU_SOURCE -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -lLLVM-17 -L/usr/lib/llvm-17/lib Another advantage of cleaning up these flags is that displaying available features with "bpftool version" becomes more accurate for the bootstrap bpftool, and no longer reflects the features detected (and available only) for the final binary. Cc: Jean-Philippe Brucker <jean-philippe@linaro.org> Signed-off-by: Quentin Monnet <qmo@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Message-ID: <20240320014103.45641-1-qmo@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-03-21Merge branch 'report-rcu-qs-for-busy-network-kthreads'Jakub Kicinski3-0/+37
Yan Zhai says: ==================== Report RCU QS for busy network kthreads This changeset fixes a common problem for busy networking kthreads. These threads, e.g. NAPI threads, typically will do: * polling a batch of packets * if there are more work, call cond_resched() to allow scheduling * continue to poll more packets when rx queue is not empty We observed this being a problem in production, since it can block RCU tasks from making progress under heavy load. Investigation indicates that just calling cond_resched() is insufficient for RCU tasks to reach quiescent states. This also has the side effect of frequently clearing the TIF_NEED_RESCHED flag on voluntary preempt kernels. As a result, schedule() will not be called in these circumstances, despite schedule() in fact provides required quiescent states. This at least affects NAPI threads, napi_busy_loop, and also cpumap kthread. By reporting RCU QSes in these kthreads periodically before cond_resched, the blocked RCU waiters can correctly progress. Instead of just reporting QS for RCU tasks, these code share the same concern as noted in the commit d28139c4e967 ("rcu: Apply RCU-bh QSes to RCU-sched and RCU-preempt when safe"). So report a consolidated QS for safety. It is worth noting that, although this problem is reproducible in napi_busy_loop, it only shows up when setting the polling interval to as high as 2ms, which is far larger than recommended 50us-100us in the documentation. So napi_busy_loop is left untouched. Lastly, this does not affect RT kernels, which does not enter the scheduler through cond_resched(). Without the mentioned side effect, schedule() will be called time by time, and clear the RCU task holdouts. V4: https://lore.kernel.org/bpf/cover.1710525524.git.yan@cloudflare.com/ V3: https://lore.kernel.org/lkml/20240314145459.7b3aedf1@kernel.org/t/ V2: https://lore.kernel.org/bpf/ZeFPz4D121TgvCje@debian.debian/ V1: https://lore.kernel.org/lkml/Zd4DXTyCf17lcTfq@debian.debian/#t ==================== Link: https://lore.kernel.org/r/cover.1710877680.git.yan@cloudflare.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-21bpf: report RCU QS in cpumap kthreadYan Zhai1-0/+3
When there are heavy load, cpumap kernel threads can be busy polling packets from redirect queues and block out RCU tasks from reaching quiescent states. It is insufficient to just call cond_resched() in such context. Periodically raise a consolidated RCU QS before cond_resched fixes the problem. Fixes: 6710e1126934 ("bpf: introduce new bpf cpu map type BPF_MAP_TYPE_CPUMAP") Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org> Signed-off-by: Yan Zhai <yan@cloudflare.com> Acked-by: Paul E. McKenney <paulmck@kernel.org> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Link: https://lore.kernel.org/r/c17b9f1517e19d813da3ede5ed33ee18496bb5d8.1710877680.git.yan@cloudflare.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-21net: report RCU QS on threaded NAPI repollingYan Zhai1-0/+3
NAPI threads can keep polling packets under load. Currently it is only calling cond_resched() before repolling, but it is not sufficient to clear out the holdout of RCU tasks, which prevent BPF tracing programs from detaching for long period. This can be reproduced easily with following set up: ip netns add test1 ip netns add test2 ip -n test1 link add veth1 type veth peer name veth2 netns test2 ip -n test1 link set veth1 up ip -n test1 link set lo up ip -n test2 link set veth2 up ip -n test2 link set lo up ip -n test1 addr add 192.168.1.2/31 dev veth1 ip -n test1 addr add 1.1.1.1/32 dev lo ip -n test2 addr add 192.168.1.3/31 dev veth2 ip -n test2 addr add 2.2.2.2/31 dev lo ip -n test1 route add default via 192.168.1.3 ip -n test2 route add default via 192.168.1.2 for i in `seq 10 210`; do for j in `seq 10 210`; do ip netns exec test2 iptables -I INPUT -s 3.3.$i.$j -p udp --dport 5201 done done ip netns exec test2 ethtool -K veth2 gro on ip netns exec test2 bash -c 'echo 1 > /sys/class/net/veth2/threaded' ip netns exec test1 ethtool -K veth1 tso off Then run an iperf3 client/server and a bpftrace script can trigger it: ip netns exec test2 iperf3 -s -B 2.2.2.2 >/dev/null& ip netns exec test1 iperf3 -c 2.2.2.2 -B 1.1.1.1 -u -l 1500 -b 3g -t 100 >/dev/null& bpftrace -e 'kfunc:__napi_poll{@=count();} interval:s:1{exit();}' Report RCU quiescent states periodically will resolve the issue. Fixes: 29863d41bb6e ("net: implement threaded-able napi poll loop support") Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org> Signed-off-by: Yan Zhai <yan@cloudflare.com> Acked-by: Paul E. McKenney <paulmck@kernel.org> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Link: https://lore.kernel.org/r/4c3b0d3f32d3b18949d75b18e5e1d9f13a24f025.1710877680.git.yan@cloudflare.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-21rcu: add a helper to report consolidated flavor QSYan Zhai1-0/+31
When under heavy load, network processing can run CPU-bound for many tens of seconds. Even in preemptible kernels (non-RT kernel), this can block RCU Tasks grace periods, which can cause trace-event removal to take more than a minute, which is unacceptably long. This commit therefore creates a new helper function that passes through both RCU and RCU-Tasks quiescent states every 100 milliseconds. This hard-coded value suffices for current workloads. Suggested-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org> Signed-off-by: Yan Zhai <yan@cloudflare.com> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Link: https://lore.kernel.org/r/90431d46ee112d2b0af04dbfe936faaca11810a5.1710877680.git.yan@cloudflare.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-21ionic: update documentation for XDP supportShannon Nelson1-0/+22
Add information to our documentation for the XDP features and related ethtool stats. While we're here, we also add the missing timestamp stats. Signed-off-by: Shannon Nelson <shannon.nelson@amd.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240319163534.38796-1-shannon.nelson@amd.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-21lib/bitmap: Fix bitmap_scatter() and bitmap_gather() kernel docHerve Codina1-21/+23
The make htmldoc command failed with the following error ... include/linux/bitmap.h:524: ERROR: Unexpected indentation. ... include/linux/bitmap.h:524: CRITICAL: Unexpected section title or transition. Move the visual representation to a literal block. Fixes: de5f84338970 ("lib/bitmap: Introduce bitmap_scatter() and bitmap_gather() helpers") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Closes: https://lore.kernel.org/linux-kernel/20240312153059.3ffde1b7@canb.auug.org.au/ Signed-off-by: Herve Codina <herve.codina@bootlin.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com> Acked-by: Yury Norov <yury.norov@gmail.com> Link: https://lore.kernel.org/r/20240314120006.458580-1-herve.codina@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-21Merge tag 'v6.9-rc-smb3-server-fixes' of git://git.samba.org/ksmbdLinus Torvalds15-147/+716
Pull smb server updates from Steve French: - add support for durable file handles (an important data integrity feature) - fixes for potential out of bounds issues - fix possible null dereference in close - getattr fixes - trivial typo fix and minor cleanup * tag 'v6.9-rc-smb3-server-fixes' of git://git.samba.org/ksmbd: ksmbd: remove module version ksmbd: fix potencial out-of-bounds when buffer offset is invalid ksmbd: fix slab-out-of-bounds in smb_strndup_from_utf16() ksmbd: Fix spelling mistake "connction" -> "connection" ksmbd: fix possible null-deref in smb_lazy_parent_lease_break_close ksmbd: add support for durable handles v1/v2 ksmbd: mark SMB2_SESSION_EXPIRED to session when destroying previous session ksmbd: retrieve number of blocks using vfs_getattr in set_file_allocation_info ksmbd: replace generic_fillattr with vfs_getattr
2024-03-21Merge tag 'trace-tools-v6.9' of ↵Linus Torvalds21-306/+650
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull trace tool updates from Steven Rostedt: "Tracing: - Update makefiles for latency-collector and RTLA, using tools/build/ makefiles like perf does, inheriting its benefits. For example, having a proper way to handle library dependencies. - The timerlat tracer has an interface for any tool to use. rtla timerlat tool uses this interface dispatching its own threads as workload. But, rtla timerlat could also be used for any other process. So, add 'rtla timerlat -U' option, allowing the timerlat tool to measure the latency of any task using the timerlat tracer interface. Verification: - Update makefiles for verification/rv, using tools/build/ makefiles like perf does, inheriting its benefits. For example, having a proper way to handle dependencies" * tag 'trace-tools-v6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tools/rtla: Add -U/--user-load option to timerlat tools/verification: Use tools/build makefiles on rv tools/rtla: Use tools/build makefiles to build rtla tools/tracing: Use tools/build makefiles on latency-collector
2024-03-21netfilter: nf_tables: do not compare internal table flags on updatesPablo Neira Ayuso1-1/+1
Restore skipping transaction if table update does not modify flags. Fixes: 179d9ba5559a ("netfilter: nf_tables: fix table flag updates") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-03-21netfilter: nft_set_pipapo: release elements in clone only from destroy pathPablo Neira Ayuso1-4/+1
Clone already always provides a current view of the lookup table, use it to destroy the set, otherwise it is possible to destroy elements twice. This fix requires: 212ed75dc5fb ("netfilter: nf_tables: integrate pipapo into commit protocol") which came after: 9827a0e6e23b ("netfilter: nft_set_pipapo: release elements in clone from abort path"). Fixes: 9827a0e6e23b ("netfilter: nft_set_pipapo: release elements in clone from abort path") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-03-20kconfig: tests: test dependency after shuffling choicesMasahiro Yamada5-0/+71
Commit c8fb7d7e48d1 ("kconfig: fix broken dependency in randconfig- generated .config") fixed the issue, but I did not add a test case. This commit adds a test case that emulates the reported situation. The test would fail without c8fb7d7e48d1. To handle the choice "choose X", FOO must be calculated beforehand. FOO depends on A, which is a member of another choice "choose A or B". Kconfig _temporarily_ assumes the value of A to proceed. The choice "choose A or B" will be shuffled later, but the result may or may not meet "FOO depends on A". Kconfig should invalidate the symbol values and recompute them. In the real example for ARCH=arm64, the choice "Instrumentation type" needs the value of CPU_BIG_ENDIAN. The choice "Endianness" will be shuffled later. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2024-03-20kconfig: tests: add a test for randconfig with dependent choicesMasahiro Yamada5-0/+78
Since commit 3b9a19e08960 ("kconfig: loop as long as we changed some symbols in randconfig"), conf_set_all_new_symbols() is repeated until there is no more choice left to be shuffled. The motivation was to shuffle a choice nested in another choice. Although commit 09d5873e4d1f ("kconfig: allow only 'config', 'comment', and 'if' inside 'choice'") disallowed the nested choice structure, we must still keep 3b9a19e08960 because there are still cases where conf_set_all_new_symbols() must iterate. scripts/kconfig/tests/choice_randomize/Kconfig is the test case. The second choice depends on 'B', which is the member of the first choice. With 3b9a19e08960 reverted, we would never get the pattern specified by scripts/kconfig/tests/choice_randomize/expected_config2. A real example can be found in lib/Kconfig.debug. Without 3b9a19e08960, the randconfig would not shuffle the "Compressed Debug information" choice, which depends on DEBUG_INFO, which is derived from another choice "Debug information". My goal is to refactor Kconfig so that randconfig will work more simply, without using the loop. For now, let's add a test case to ensure all dependent choices are shuffled, as it is a somewhat tricky case for the current Kconfig. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2024-03-20kconfig: tests: support KCONFIG_SEED for the randconfig runnerMasahiro Yamada1-6/+10
This will help get consistent results for randconfig tests. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2024-03-20Merge tag 'docs-6.9-2' of git://git.lwn.net/linuxLinus Torvalds5-187/+220
Pull more documentation updates from Jonathan Corbet: "A handful of late-arriving documentation fixes and enhancements" * tag 'docs-6.9-2' of git://git.lwn.net/linux: docs: verify/bisect: remove a level of indenting docs: verify/bisect: drop 'v' prefix, EOL aspect, and assorted fixes docs: verify/bisect: check taint flag docs: verify/bisect: improve install instructions docs: handling-regressions.rst: Update regzbot command fixed-by to fix docs: *-regressions.rst: Add colon to regzbot commands doc: Fix typo in admin-guide/cifs/introduction.rst README: Fix spelling
2024-03-20Merge branch 'octeontx2-pf-mbox-fixes'David S. Miller10-88/+225
Subbaraya Sundeep says: ==================== octeontx2-pf: RVU Mailbox fixes This patchset fixes the problems related to RVU mailbox. During long run tests some times VF commands like setting MTU or toggling interface fails because VF mailbox is timedout waiting for response from PF. Below are the fixes Patch 1: There are two types of messages in RVU mailbox namely up and down messages. Down messages are synchronous messages where a PF/VF sends a message to AF and AF replies back with response. UP messages are notifications and are asynchronous like AF sending link events to PF. When VF sends a down message to PF, PF forwards to AF and sends the response from AF back to VF. PF has to forward VF messages since there is no path in hardware for VF to send directly to AF. There is one mailbox interrupt from AF to PF when raised could mean two scenarios one is where AF sending reply to PF for a down message sent by PF and another one is AF sending up message asynchronously when link changed for that PF. Receiving the up message interrupt while PF is in middle of forwarding down message causes mailbox errors. Fix this by receiver detecting the type of message from the mbox data register set by sender. Patch 2: During VF driver remove, VF has to wait until last message is completed and then turn off mailbox interrupts from PF. Patch 3: Do not use ordered workqueue for message processing since multiple works are queued simultaneously by all the VFs and PF link UP messages. Patch 4: When sending link event to VF by PF check whether VF is really up to receive this message. Patch 5: In AF driver, use separate interrupt handlers for the AF-VF interrupt and AF-PF interrupt. Sometimes both interrupts are raised to two CPUs at same time and both CPUs execute same function at same time corrupting the data. v2 changes: Added missing mutex unlock in error path in patch 1 Refactored if else logic in patch 1 as suggested by Paolo Abeni ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2024-03-20octeontx2-af: Use separate handlers for interruptsSubbaraya Sundeep1-3/+14
For PF to AF interrupt vector and VF to AF vector same interrupt handler is registered which is causing race condition. When two interrupts are raised to two CPUs at same time then two cores serve same event corrupting the data. Fixes: 7304ac4567bc ("octeontx2-af: Add mailbox IRQ and msg handlers") Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-03-20octeontx2-pf: Send UP messages to VF only when VF is up.Subbaraya Sundeep1-0/+3
When PF sending link status messages to VF, it is possible that by the time link_event_task work function is executed VF might have brought down. Hence before sending VF link status message check whether VF is up to receive it. Fixes: ad513ed938c9 ("octeontx2-vf: Link event notification support") Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-03-20octeontx2-pf: Use default max_active works instead of oneSubbaraya Sundeep1-3/+4
Only one execution context for the workqueue used for PF and VFs mailbox communication is incorrect since multiple works are queued simultaneously by all the VFs and PF link UP messages. Hence use default number of execution contexts by passing zero as max_active to alloc_workqueue function. With this fix in place, modify UP messages also to wait until completion. Fixes: d424b6c02415 ("octeontx2-pf: Enable SRIOV and added VF mbox handling") Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-03-20octeontx2-pf: Wait till detach_resources msg is completeSubbaraya Sundeep2-2/+2
During VF driver remove, a message is sent to detach VF resources to PF but VF is not waiting until message is complete. Also mailbox interrupts need to be turned off after the detach resource message is complete. This patch fixes that problem. Fixes: 05fcc9e08955 ("octeontx2-pf: Attach NIX and NPA block LFs") Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-03-20octeontx2: Detect the mbox up or down message via registerSubbaraya Sundeep9-83/+205
A single line of interrupt is used to receive up notifications and down reply messages from AF to PF (similarly from PF to its VF). PF acts as bridge and forwards VF messages to AF and sends respsones back from AF to VF. When an async event like link event is received by up message when PF is in middle of forwarding VF message then mailbox errors occur because PF state machine is corrupted. Since VF is a separate driver or VF driver can be in a VM it is not possible to serialize from the start of communication at VF. Hence to differentiate between type of messages at PF this patch makes sender to set mbox data register with distinct values for up and down messages. Sender also checks whether previous interrupt is received before triggering current interrupt by waiting for mailbox data register to become zero. Fixes: 5a6d7c9daef3 ("octeontx2-pf: Mailbox communication with AF") Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-03-20selftests/bpf: scale benchmark counting by using per-CPU countersAndrii Nakryiko2-17/+62
When benchmarking with multiple threads (-pN, where N>1), we start contending on single atomic counter that both BPF trigger benchmarks are using, as well as "baseline" tests in user space (trig-base and trig-uprobe-base benchmarks). As such, we start bottlenecking on something completely irrelevant to benchmark at hand. Scale counting up by using per-CPU counters on BPF side. On use space side we do the next best thing: hash thread ID to approximate per-CPU behavior. It seems to work quite well in practice. To demonstrate the difference, I ran three benchmarks with 1, 2, 4, 8, 16, and 32 threads: - trig-uprobe-base (no syscalls, pure tight counting loop in user-space); - trig-base (get_pgid() syscall, atomic counter in user-space); - trig-fentry (syscall to trigger fentry program, atomic uncontended per-CPU counter on BPF side). Command used: for b in uprobe-base base fentry; do \ for p in 1 2 4 8 16 32; do \ printf "%-11s %2d: %s\n" $b $p \ "$(sudo ./bench -w2 -d5 -a -p$p trig-$b | tail -n1 | cut -d'(' -f1 | cut -d' ' -f3-)"; \ done; \ done Before these changes, aggregate throughput across all threads doesn't scale well with number of threads, it actually even falls sharply for uprobe-base due to a very high contention: uprobe-base 1: 138.998 ± 0.650M/s uprobe-base 2: 70.526 ± 1.147M/s uprobe-base 4: 63.114 ± 0.302M/s uprobe-base 8: 54.177 ± 0.138M/s uprobe-base 16: 45.439 ± 0.057M/s uprobe-base 32: 37.163 ± 0.242M/s base 1: 16.940 ± 0.182M/s base 2: 19.231 ± 0.105M/s base 4: 21.479 ± 0.038M/s base 8: 23.030 ± 0.037M/s base 16: 22.034 ± 0.004M/s base 32: 18.152 ± 0.013M/s fentry 1: 14.794 ± 0.054M/s fentry 2: 17.341 ± 0.055M/s fentry 4: 23.792 ± 0.024M/s fentry 8: 21.557 ± 0.047M/s fentry 16: 21.121 ± 0.004M/s fentry 32: 17.067 ± 0.023M/s After these changes, we see almost perfect linear scaling, as expected. The sub-linear scaling when going from 8 to 16 threads is interesting and consistent on my test machine, but I haven't investigated what is causing it this peculiar slowdown (across all benchmarks, could be due to hyperthreading effects, not sure). uprobe-base 1: 139.980 ± 0.648M/s uprobe-base 2: 270.244 ± 0.379M/s uprobe-base 4: 532.044 ± 1.519M/s uprobe-base 8: 1004.571 ± 3.174M/s uprobe-base 16: 1720.098 ± 0.744M/s uprobe-base 32: 3506.659 ± 8.549M/s base 1: 16.869 ± 0.071M/s base 2: 33.007 ± 0.092M/s base 4: 64.670 ± 0.203M/s base 8: 121.969 ± 0.210M/s base 16: 207.832 ± 0.112M/s base 32: 424.227 ± 1.477M/s fentry 1: 14.777 ± 0.087M/s fentry 2: 28.575 ± 0.146M/s fentry 4: 56.234 ± 0.176M/s fentry 8: 106.095 ± 0.385M/s fentry 16: 181.440 ± 0.032M/s fentry 32: 369.131 ± 0.693M/s Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Message-ID: <20240315213329.1161589-1-andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-03-20bpftool: Remove unnecessary source files from bootstrap versionQuentin Monnet1-4/+1
Commit d510296d331a ("bpftool: Use syscall/loader program in "prog load" and "gen skeleton" command.") added new files to the list of objects to compile in order to build the bootstrap version of bpftool. As far as I can tell, these objects are unnecessary and were added by mistake; maybe a draft version intended to add support for loading loader programs from the bootstrap version. Anyway, we can remove these object files from the list to make the bootstrap bpftool binary a tad smaller and faster to build. Fixes: d510296d331a ("bpftool: Use syscall/loader program in "prog load" and "gen skeleton" command.") Signed-off-by: Quentin Monnet <qmo@kernel.org> Message-ID: <20240320013457.44808-1-qmo@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-03-20bpftool: Enable libbpf logs when loading pid_iter in debug modeQuentin Monnet1-7/+12
When trying to load the pid_iter BPF program used to iterate over the PIDs of the processes holding file descriptors to BPF links, we would unconditionally silence libbpf in order to keep the output clean if the kernel does not support iterators and loading fails. Although this is the desirable behaviour in most cases, this may hide bugs in the pid_iter program that prevent it from loading, and it makes it hard to debug such load failures, even in "debug" mode. Instead, it makes more sense to print libbpf's logs when we pass the -d|--debug flag to bpftool, so that users get the logs to investigate failures without having to edit bpftool's source code. Signed-off-by: Quentin Monnet <qmo@kernel.org> Message-ID: <20240320012241.42991-1-qmo@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-03-20Merge branch 'bpf-raw-tracepoint-support-for-bpf-cookie'Alexei Starovoitov14-50/+248
Andrii Nakryiko says: ==================== BPF raw tracepoint support for BPF cookie Add ability to specify and retrieve BPF cookie for raw tracepoint programs. Both BTF-aware (SEC("tp_btf")) and non-BTF-aware (SEC("raw_tp")) are supported, as they are exactly the same at runtime. This issue recently came up in production use cases, where custom tried to switch from slower classic tracepoints to raw tracepoints and ran into this limitation. Luckily, it's not that hard to support this for raw_tp programs. v2->v3: - s/bpf_raw_tp_open/bpf_raw_tracepoint_open_opts/ (Alexei, Eduard); v1->v2: - fixed type definition for stubs of bpf_probe_{register,unregister}; - added __u32 :u32 and aligned raw_tp fields (Jiri); - added Stanislav's ack. ==================== Link: https://lore.kernel.org/r/20240319233852.1977493-1-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-03-20selftests/bpf: add raw_tp/tp_btf BPF cookie subtestsAndrii Nakryiko2-1/+129
Add test validating BPF cookie can be passed during raw_tp/tp_btf attachment and can be retried at runtime with bpf_get_attach_cookie() helper. Acked-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Message-ID: <20240319233852.1977493-6-andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-03-20libbpf: add support for BPF cookie for raw_tp/tp_btf programsAndrii Nakryiko5-5/+53
Wire up BPF cookie passing or raw_tp and tp_btf programs, both in low-level and high-level APIs. Acked-by: Stanislav Fomichev <sdf@google.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Message-ID: <20240319233852.1977493-5-andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-03-20bpf: support BPF cookie in raw tracepoint (raw_tp, tp_btf) programsAndrii Nakryiko5-6/+28
Wire up BPF cookie for raw tracepoint programs (both BTF and non-BTF aware variants). This brings them up to part w.r.t. BPF cookie usage with classic tracepoint and fentry/fexit programs. Acked-by: Stanislav Fomichev <sdf@google.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Message-ID: <20240319233852.1977493-4-andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-03-20bpf: pass whole link instead of prog when triggering raw tracepointAndrii Nakryiko5-33/+38
Instead of passing prog as an argument to bpf_trace_runX() helpers, that are called from tracepoint triggering calls, store BPF link itself (struct bpf_raw_tp_link for raw tracepoints). This will allow to pass extra information like BPF cookie into raw tracepoint registration. Instead of replacing `struct bpf_prog *prog = __data;` with corresponding `struct bpf_raw_tp_link *link = __data;` assignment in `__bpf_trace_##call` I just passed `__data` through into underlying bpf_trace_runX() call. This works well because we implicitly cast `void *`, and it also avoids naming clashes with arguments coming from tracepoint's "proto" list. We could have run into the same problem with "prog", we just happened to not have a tracepoint that has "prog" input argument. We are less lucky with "link", as there are tracepoints using "link" argument name already. So instead of trying to avoid naming conflicts, let's just remove intermediate local variable. It doesn't hurt readibility, it's either way a bit of a maze of calls and macros, that requires careful reading. Acked-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Message-ID: <20240319233852.1977493-3-andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-03-20bpf: flatten bpf_probe_register call chainAndrii Nakryiko1-6/+1
bpf_probe_register() and __bpf_probe_register() have identical signatures and bpf_probe_register() just redirect to __bpf_probe_register(). So get rid of this extra function call step to simplify following the source code. It has no difference at runtime due to inlining, of course. Acked-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Message-ID: <20240319233852.1977493-2-andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-03-20tools/rtla: Add -U/--user-load option to timerlatDaniel Bristot de Oliveira4-9/+101
The timerlat tracer provides an interface for any application to wait for the timerlat's periodic wakeup. Currently, rtla timerlat uses it to dispatch its user-space workload (-u option). But as the tracer interface is generic, rtla timerlat can also be used to monitor any workload that uses it. For example, a user might place their own workload to wait on the tracer interface, and monitor the results with rtla timerlat. Add the -U option to rtla timerlat top and hist. With this option, rtla timerlat will not dispatch its workload but only setting up the system, waiting for a user to dispatch its workload. The sample code in this patch is an example of python application that loops in the timerlat tracer fd. To use it, dispatch: # rtla timerlat -U In a terminal, then run the python program on another terminal, specifying the CPU to run it. For example, setting on CPU 1: #./timerlat_load.py 1 Then rtla timerlat will start printing the statistics of the ./timerlat_load.py app. An interesting point is that the "Ret user Timer Latency" value is the overall response time of the load. The sample load does a memory copy to exemplify that. The stop tracing options on rtla timerlat works in this setup as well, including auto analysis. Link: https://lkml.kernel.org/r/36e6bcf18fe15c7601048fd4c65aeb193c502cc8.1707229706.git.bristot@kernel.org Cc: Jonathan Corbet <corbet@lwn.net> Cc: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
2024-03-20tools/verification: Use tools/build makefiles on rvDaniel Bristot de Oliveira6-133/+183
Use tools/build/ makefiles to build rv, inheriting the benefits of it. For example, having a proper way to handle dependencies. Link: https://lkml.kernel.org/r/2a38a8f7b8dc65fa790381ec9ab42fb62beb2e25.1710519524.git.bristot@kernel.org Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Josh Poimboeuf <jpoimboe@kernel.org> Cc: John Kacur <jkacur@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
2024-03-20tools/rtla: Use tools/build makefiles to build rtlaDaniel Bristot de Oliveira7-145/+244
Use tools/build/ makefiles to build rtla, inheriting the benefits of it. For example, having a proper way to handle dependencies. rtla is built using perf infra-structure when building inside the kernel tree. At this point, rtla diverges from perf in two points: Documentation and tarball generation/build. At the documentation level, rtla is one step ahead, placing the documentation at Documentation/tools/rtla/, using the same build tools as kernel documentation. The idea is to move perf documentation to the same scheme and then share the same makefiles. rtla has a tarball target that the (old) RHEL8 uses. The tarball was kept using a simple standalone makefile for compatibility. The standalone makefile shares most of the code, e.g., flags, with regular buildings. The tarball method was set as deprecated. If necessary, we can make a rtla tarball like perf, which includes the entire tools/build. But this would also require changes in the user side (the directory structure changes, and probably the deps to build the package). Inspired on perf and objtool. Link: https://lkml.kernel.org/r/57563abf2715d22515c0c54a87cff3849eca5d52.1710519524.git.bristot@kernel.org Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Josh Poimboeuf <jpoimboe@kernel.org> Cc: John Kacur <jkacur@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
2024-03-20tools/tracing: Use tools/build makefiles on latency-collectorDaniel Bristot de Oliveira4-19/+122
Use tools/build/ makefiles to build latency-collector, inheriting the benefits of it. For example: Before this patch, a missing tracefs/traceevents headers will result in fail like this: ~/linux/tools/tracing/latency $ make cc -Wall -Wextra -g -O2 -o latency-collector latency-collector.c -lpthread latency-collector.c:26:10: fatal error: tracefs.h: No such file or directory 26 | #include <tracefs.h> | ^~~~~~~~~~~ compilation terminated. make: *** [Makefile:14: latency-collector] Error 1 Which is not that helpful. After this change it reports: ~/linux/tools/tracing/latency# make Auto-detecting system features: ... libtraceevent: [ OFF ] ... libtracefs: [ OFF ] libtraceevent is missing. Please install libtraceevent-dev/libtraceevent-devel libtracefs is missing. Please install libtracefs-dev/libtracefs-devel Makefile.config:29: *** Please, check the errors above.. Stop. This type of output is common across other tools in tools/ like perf and objtool. Link: https://lkml.kernel.org/r/872420b0880b11304e4ba144a0086c6478c5b469.1710519524.git.bristot@kernel.org Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Josh Poimboeuf <jpoimboe@kernel.org> Cc: John Kacur <jkacur@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
2024-03-20Merge tag 'ipsec-2024-03-19' of ↵Jakub Kicinski4-9/+20
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2024-03-19 1) Fix possible page_pool leak triggered by esp_output. From Dragos Tatulea. 2) Fix UDP encapsulation in software GSO path. From Leon Romanovsky. * tag 'ipsec-2024-03-19' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec: xfrm: Allow UDP encapsulation only in offload modes net: esp: fix bad handling of pages from page_pool ==================== Link: https://lore.kernel.org/r/20240319110151.409825-1-steffen.klassert@secunet.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-20devlink: fix port new reply cmd typeJiri Pirko1-1/+1
Due to a c&p error, port new reply fills-up cmd with wrong value, any other existing port command replies and notifications. Fix it by filling cmd with value DEVLINK_CMD_PORT_NEW. Skimmed through devlink userspace implementations, none of them cares about this cmd value. Reported-by: Chenyuan Yang <chenyuan0y@gmail.com> Closes: https://lore.kernel.org/all/ZfZcDxGV3tSy4qsV@cy-server/ Fixes: cd76dcd68d96 ("devlink: Support add and delete devlink port") Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Link: https://lore.kernel.org/r/20240318091908.2736542-1-jiri@resnulli.us Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-20tcp: Clear req->syncookie in reqsk_alloc().Kuniyuki Iwashima3-1/+12
syzkaller reported a read of uninit req->syncookie. [0] Originally, req->syncookie was used only in tcp_conn_request() to indicate if we need to encode SYN cookie in SYN+ACK, so the field remains uninitialised in other places. The commit 695751e31a63 ("bpf: tcp: Handle BPF SYN Cookie in cookie_v[46]_check().") added another meaning in ACK path; req->syncookie is set true if SYN cookie is validated by BPF kfunc. After the change, cookie_v[46]_check() always read req->syncookie, but it is not initialised in the normal SYN cookie case as reported by KMSAN. Let's make sure we always initialise req->syncookie in reqsk_alloc(). [0]: BUG: KMSAN: uninit-value in cookie_v4_check+0x22b7/0x29e0 net/ipv4/syncookies.c:477 cookie_v4_check+0x22b7/0x29e0 net/ipv4/syncookies.c:477 tcp_v4_cookie_check net/ipv4/tcp_ipv4.c:1855 [inline] tcp_v4_do_rcv+0xb17/0x10b0 net/ipv4/tcp_ipv4.c:1914 tcp_v4_rcv+0x4ce4/0x5420 net/ipv4/tcp_ipv4.c:2322 ip_protocol_deliver_rcu+0x2a3/0x13d0 net/ipv4/ip_input.c:205 ip_local_deliver_finish+0x332/0x500 net/ipv4/ip_input.c:233 NF_HOOK include/linux/netfilter.h:314 [inline] ip_local_deliver+0x21f/0x490 net/ipv4/ip_input.c:254 dst_input include/net/dst.h:460 [inline] ip_rcv_finish+0x4a2/0x520 net/ipv4/ip_input.c:449 NF_HOOK include/linux/netfilter.h:314 [inline] ip_rcv+0xcd/0x380 net/ipv4/ip_input.c:569 __netif_receive_skb_one_core net/core/dev.c:5538 [inline] __netif_receive_skb+0x319/0x9e0 net/core/dev.c:5652 process_backlog+0x480/0x8b0 net/core/dev.c:5981 __napi_poll+0xe7/0x980 net/core/dev.c:6632 napi_poll net/core/dev.c:6701 [inline] net_rx_action+0x89d/0x1820 net/core/dev.c:6813 __do_softirq+0x1c0/0x7d7 kernel/softirq.c:554 do_softirq+0x9a/0x100 kernel/softirq.c:455 __local_bh_enable_ip+0x9f/0xb0 kernel/softirq.c:382 local_bh_enable include/linux/bottom_half.h:33 [inline] rcu_read_unlock_bh include/linux/rcupdate.h:820 [inline] __dev_queue_xmit+0x2776/0x52c0 net/core/dev.c:4362 dev_queue_xmit include/linux/netdevice.h:3091 [inline] neigh_hh_output include/net/neighbour.h:526 [inline] neigh_output include/net/neighbour.h:540 [inline] ip_finish_output2+0x187a/0x1b70 net/ipv4/ip_output.c:235 __ip_finish_output+0x287/0x810 ip_finish_output+0x4b/0x550 net/ipv4/ip_output.c:323 NF_HOOK_COND include/linux/netfilter.h:303 [inline] ip_output+0x15f/0x3f0 net/ipv4/ip_output.c:433 dst_output include/net/dst.h:450 [inline] ip_local_out net/ipv4/ip_output.c:129 [inline] __ip_queue_xmit+0x1e93/0x2030 net/ipv4/ip_output.c:535 ip_queue_xmit+0x60/0x80 net/ipv4/ip_output.c:549 __tcp_transmit_skb+0x3c70/0x4890 net/ipv4/tcp_output.c:1462 tcp_transmit_skb net/ipv4/tcp_output.c:1480 [inline] tcp_write_xmit+0x3ee1/0x8900 net/ipv4/tcp_output.c:2792 __tcp_push_pending_frames net/ipv4/tcp_output.c:2977 [inline] tcp_send_fin+0xa90/0x12e0 net/ipv4/tcp_output.c:3578 tcp_shutdown+0x198/0x1f0 net/ipv4/tcp.c:2716 inet_shutdown+0x33f/0x5b0 net/ipv4/af_inet.c:923 __sys_shutdown_sock net/socket.c:2425 [inline] __sys_shutdown net/socket.c:2437 [inline] __do_sys_shutdown net/socket.c:2445 [inline] __se_sys_shutdown+0x2a4/0x440 net/socket.c:2443 __x64_sys_shutdown+0x6c/0xa0 net/socket.c:2443 do_syscall_64+0xd5/0x1f0 entry_SYSCALL_64_after_hwframe+0x6d/0x75 Uninit was stored to memory at: reqsk_alloc include/net/request_sock.h:148 [inline] inet_reqsk_alloc+0x651/0x7a0 net/ipv4/tcp_input.c:6978 cookie_tcp_reqsk_alloc+0xd4/0x900 net/ipv4/syncookies.c:328 cookie_tcp_check net/ipv4/syncookies.c:388 [inline] cookie_v4_check+0x289f/0x29e0 net/ipv4/syncookies.c:420 tcp_v4_cookie_check net/ipv4/tcp_ipv4.c:1855 [inline] tcp_v4_do_rcv+0xb17/0x10b0 net/ipv4/tcp_ipv4.c:1914 tcp_v4_rcv+0x4ce4/0x5420 net/ipv4/tcp_ipv4.c:2322 ip_protocol_deliver_rcu+0x2a3/0x13d0 net/ipv4/ip_input.c:205 ip_local_deliver_finish+0x332/0x500 net/ipv4/ip_input.c:233 NF_HOOK include/linux/netfilter.h:314 [inline] ip_local_deliver+0x21f/0x490 net/ipv4/ip_input.c:254 dst_input include/net/dst.h:460 [inline] ip_rcv_finish+0x4a2/0x520 net/ipv4/ip_input.c:449 NF_HOOK include/linux/netfilter.h:314 [inline] ip_rcv+0xcd/0x380 net/ipv4/ip_input.c:569 __netif_receive_skb_one_core net/core/dev.c:5538 [inline] __netif_receive_skb+0x319/0x9e0 net/core/dev.c:5652 process_backlog+0x480/0x8b0 net/core/dev.c:5981 __napi_poll+0xe7/0x980 net/core/dev.c:6632 napi_poll net/core/dev.c:6701 [inline] net_rx_action+0x89d/0x1820 net/core/dev.c:6813 __do_softirq+0x1c0/0x7d7 kernel/softirq.c:554 Uninit was created at: __alloc_pages+0x9a7/0xe00 mm/page_alloc.c:4592 __alloc_pages_node include/linux/gfp.h:238 [inline] alloc_pages_node include/linux/gfp.h:261 [inline] alloc_slab_page mm/slub.c:2175 [inline] allocate_slab mm/slub.c:2338 [inline] new_slab+0x2de/0x1400 mm/slub.c:2391 ___slab_alloc+0x1184/0x33d0 mm/slub.c:3525 __slab_alloc mm/slub.c:3610 [inline] __slab_alloc_node mm/slub.c:3663 [inline] slab_alloc_node mm/slub.c:3835 [inline] kmem_cache_alloc+0x6d3/0xbe0 mm/slub.c:3852 reqsk_alloc include/net/request_sock.h:131 [inline] inet_reqsk_alloc+0x66/0x7a0 net/ipv4/tcp_input.c:6978 tcp_conn_request+0x484/0x44e0 net/ipv4/tcp_input.c:7135 tcp_v4_conn_request+0x16f/0x1d0 net/ipv4/tcp_ipv4.c:1716 tcp_rcv_state_process+0x2e5/0x4bb0 net/ipv4/tcp_input.c:6655 tcp_v4_do_rcv+0xbfd/0x10b0 net/ipv4/tcp_ipv4.c:1929 tcp_v4_rcv+0x4ce4/0x5420 net/ipv4/tcp_ipv4.c:2322 ip_protocol_deliver_rcu+0x2a3/0x13d0 net/ipv4/ip_input.c:205 ip_local_deliver_finish+0x332/0x500 net/ipv4/ip_input.c:233 NF_HOOK include/linux/netfilter.h:314 [inline] ip_local_deliver+0x21f/0x490 net/ipv4/ip_input.c:254 dst_input include/net/dst.h:460 [inline] ip_sublist_rcv_finish net/ipv4/ip_input.c:580 [inline] ip_list_rcv_finish net/ipv4/ip_input.c:631 [inline] ip_sublist_rcv+0x15f3/0x17f0 net/ipv4/ip_input.c:639 ip_list_rcv+0x9ef/0xa40 net/ipv4/ip_input.c:674 __netif_receive_skb_list_ptype net/core/dev.c:5581 [inline] __netif_receive_skb_list_core+0x15c5/0x1670 net/core/dev.c:5629 __netif_receive_skb_list net/core/dev.c:5681 [inline] netif_receive_skb_list_internal+0x106c/0x16f0 net/core/dev.c:5773 gro_normal_list include/net/gro.h:438 [inline] napi_complete_done+0x425/0x880 net/core/dev.c:6113 virtqueue_napi_complete drivers/net/virtio_net.c:465 [inline] virtnet_poll+0x149d/0x2240 drivers/net/virtio_net.c:2211 __napi_poll+0xe7/0x980 net/core/dev.c:6632 napi_poll net/core/dev.c:6701 [inline] net_rx_action+0x89d/0x1820 net/core/dev.c:6813 __do_softirq+0x1c0/0x7d7 kernel/softirq.c:554 CPU: 0 PID: 16792 Comm: syz-executor.2 Not tainted 6.8.0-syzkaller-05562-g61387b8dcf1d #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/29/2024 Fixes: 695751e31a63 ("bpf: tcp: Handle BPF SYN Cookie in cookie_v[46]_check().") Reported-by: syzkaller <syzkaller@googlegroups.com> Reported-by: Eric Dumazet <edumazet@google.com> Closes: https://lore.kernel.org/bpf/CANn89iKdN9c+C_2JAUbc+VY3DDQjAQukMtiBbormAmAk9CdvQA@mail.gmail.com/ Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20240315224710.55209-1-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-20net/bnx2x: Prevent access to a freed page in page_poolThinh Tran1-3/+3
Fix race condition leading to system crash during EEH error handling During EEH error recovery, the bnx2x driver's transmit timeout logic could cause a race condition when handling reset tasks. The bnx2x_tx_timeout() schedules reset tasks via bnx2x_sp_rtnl_task(), which ultimately leads to bnx2x_nic_unload(). In bnx2x_nic_unload() SGEs are freed using bnx2x_free_rx_sge_range(). However, this could overlap with the EEH driver's attempt to reset the device using bnx2x_io_slot_reset(), which also tries to free SGEs. This race condition can result in system crashes due to accessing freed memory locations in bnx2x_free_rx_sge() 799 static inline void bnx2x_free_rx_sge(struct bnx2x *bp, 800 struct bnx2x_fastpath *fp, u16 index) 801 { 802 struct sw_rx_page *sw_buf = &fp->rx_page_ring[index]; 803 struct page *page = sw_buf->page; .... where sw_buf was set to NULL after the call to dma_unmap_page() by the preceding thread. EEH: Beginning: 'slot_reset' PCI 0011:01:00.0#10000: EEH: Invoking bnx2x->slot_reset() bnx2x: [bnx2x_io_slot_reset:14228(eth1)]IO slot reset initializing... bnx2x 0011:01:00.0: enabling device (0140 -> 0142) bnx2x: [bnx2x_io_slot_reset:14244(eth1)]IO slot reset --> driver unload Kernel attempted to read user page (0) - exploit attempt? (uid: 0) BUG: Kernel NULL pointer dereference on read at 0x00000000 Faulting instruction address: 0xc0080000025065fc Oops: Kernel access of bad area, sig: 11 [#1] ..... Call Trace: [c000000003c67a20] [c00800000250658c] bnx2x_io_slot_reset+0x204/0x610 [bnx2x] (unreliable) [c000000003c67af0] [c0000000000518a8] eeh_report_reset+0xb8/0xf0 [c000000003c67b60] [c000000000052130] eeh_pe_report+0x180/0x550 [c000000003c67c70] [c00000000005318c] eeh_handle_normal_event+0x84c/0xa60 [c000000003c67d50] [c000000000053a84] eeh_event_handler+0xf4/0x170 [c000000003c67da0] [c000000000194c58] kthread+0x1c8/0x1d0 [c000000003c67e10] [c00000000000cf64] ret_from_kernel_thread+0x5c/0x64 To solve this issue, we need to verify page pool allocations before freeing. Fixes: 4cace675d687 ("bnx2x: Alloc 4k fragment for each rx ring buffer element") Signed-off-by: Thinh Tran <thinhtr@linux.ibm.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20240315205535.1321-1-thinhtr@linux.ibm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-20Merge tag 'bcachefs-2024-03-19' of https://evilpiepirate.org/git/bcachefsLinus Torvalds26-111/+157
Pull bcachefs fixes from Kent Overstreet: "Assorted bugfixes. Most are fixes for simple assertion pops; the most significant fix is for a deadlock in recovery when we have to rewrite large numbers of btree nodes to fix errors. This was incorrectly running out of the same workqueue as the core interior btree update path - we now give it its own single threaded workqueue. This was visible to users as "bch2_btree_update_start(): error: BCH_ERR_journal_reclaim_would_deadlock" - and then recovery hanging" * tag 'bcachefs-2024-03-19' of https://evilpiepirate.org/git/bcachefs: bcachefs: Fix lost wakeup on journal shutdown bcachefs; Fix deadlock in bch2_btree_update_start() bcachefs: ratelimit errors from async_btree_node_rewrite bcachefs: Run check_topology() first bcachefs: Improve bch2_fatal_error() bcachefs: Fix lost transaction restart error bcachefs: Don't corrupt journal keys gap buffer when dropping alloc info bcachefs: fix for building in userspace bcachefs: bch2_snapshot_is_ancestor() now safe to call in early recovery bcachefs: Fix nested transaction restart handling in bch2_bucket_gens_init() bcachefs: Improve sysfs internal/btree_updates bcachefs: Split out btree_node_rewrite_worker bcachefs: Fix locking in bch2_alloc_write_key() bcachefs: Avoid extent entry type assertions in .invalid() bcachefs: Fix spurious -BCH_ERR_transaction_restart_nested bcachefs: Fix check_key_has_snapshot() call bcachefs: Change "accounting overran journal reservation" to a warning
2024-03-20selftests/bpf: Prevent client connect before server bind in test_tc_tunnel.shAlessandro Carminati (Red Hat)1-1/+12
In some systems, the netcat server can incur in delay to start listening. When this happens, the test can randomly fail in various points. This is an example error message: # ip gre none gso # encap 192.168.1.1 to 192.168.1.2, type gre, mac none len 2000 # test basic connectivity # Ncat: Connection refused. The issue stems from a race condition between the netcat client and server. The test author had addressed this problem by implementing a sleep, which I have removed in this patch. This patch introduces a function capable of sleeping for up to two seconds. However, it can terminate the waiting period early if the port is reported to be listening. Signed-off-by: Alessandro Carminati (Red Hat) <alessandro.carminati@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20240314105911.213411-1-alessandro.carminati@gmail.com
2024-03-20Merge branch 'current_pid_tgid-for-all-prog-types'Andrii Nakryiko6-42/+215
Yonghong Song says: ==================== current_pid_tgid() for all prog types Currently bpf_get_current_pid_tgid() is allowed in tracing, cgroup and sk_msg progs while bpf_get_ns_current_pid_tgid() is only allowed in tracing progs. We have an internal use case where for an application running in a container (with pid namespace), user wants to get the pid associated with the pid namespace in a cgroup bpf program. Besides cgroup, the only prog type, supporting bpf_get_current_pid_tgid() but not bpf_get_ns_current_pid_tgid(), is sk_msg. But actually both bpf_get_current_pid_tgid() and bpf_get_ns_current_pid_tgid() helpers do not reveal kernel internal data and there is no reason that they cannot be used in other program types. This patch just did this and enabled these two helpers for all program types. Patch 1 added the kernel support and patches 2-5 added the test for cgroup and sk_msg. Change logs: v1 -> v2: - allow bpf_get_[ns_]current_pid_tgid() for all prog types. - for network related selftests, using netns. ==================== Link: https://lore.kernel.org/r/20240315184849.2974556-1-yonghong.song@linux.dev Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2024-03-20selftests/bpf: Add a sk_msg prog bpf_get_ns_current_pid_tgid() testYonghong Song2-0/+76
Add a sk_msg bpf program test where the program is running in a pid namespace. The test is successful: #165/4 ns_current_pid_tgid/new_ns_sk_msg:OK Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20240315184915.2976718-1-yonghong.song@linux.dev
2024-03-20selftests/bpf: Add a cgroup prog bpf_get_ns_current_pid_tgid() testYonghong Song2-0/+80
Add a cgroup bpf program test where the bpf program is running in a pid namespace. The test is successfully: #165/3 ns_current_pid_tgid/new_ns_cgrp:OK Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20240315184910.2976522-1-yonghong.song@linux.dev
2024-03-20selftests/bpf: Refactor out some functions in ns_current_pid_tgid testYonghong Song2-22/+41
Refactor some functions in both user space code and bpf program as these functions are used by later cgroup/sk_msg tests. Another change is to mark tp program optional loading as later patches will use optional loading as well since they have quite different attachment and testing logic. There is no functionality change. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20240315184904.2976123-1-yonghong.song@linux.dev
2024-03-20selftests/bpf: Replace CHECK with ASSERT_* in ns_current_pid_tgid testYonghong Song1-17/+19
Replace CHECK in selftest ns_current_pid_tgid with recommended ASSERT_* style. I also shortened subtest name as the prefix of subtest name is covered by the test name already. This patch does fix a testing issue. Currently even if bss->user_{pid,tgid} is not correct, the test still passed since the clone func returns 0. I fixed it to return a non-zero value if bss->user_{pid,tgid} is incorrect. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/20240315184859.2975543-1-yonghong.song@linux.dev
2024-03-20bpf: Allow helper bpf_get_[ns_]current_pid_tgid() for all prog typesYonghong Song4-8/+4
Currently bpf_get_current_pid_tgid() is allowed in tracing, cgroup and sk_msg progs while bpf_get_ns_current_pid_tgid() is only allowed in tracing progs. We have an internal use case where for an application running in a container (with pid namespace), user wants to get the pid associated with the pid namespace in a cgroup bpf program. Currently, cgroup bpf progs already allow bpf_get_current_pid_tgid(). Let us allow bpf_get_ns_current_pid_tgid() as well. With auditing the code, bpf_get_current_pid_tgid() is also used by sk_msg prog. But there are no side effect to expose these two helpers to all prog types since they do not reveal any kernel specific data. The detailed discussion is in [1]. So with this patch, both bpf_get_current_pid_tgid() and bpf_get_ns_current_pid_tgid() are put in bpf_base_func_proto(), making them available to all program types. [1] https://lore.kernel.org/bpf/20240307232659.1115872-1-yonghong.song@linux.dev/ Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/20240315184854.2975190-1-yonghong.song@linux.dev
2024-03-19Merge tag 'soc-late-6.9' of ↵Linus Torvalds44-486/+1441
git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc Pull more ARM SoC updates from Arnd Bergmann: "These are changes that for some reason ended up not making it into the first four branches but that should still make it into 6.9: - A rework of the omap clock support that touches both drivers and device tree files - The reset controller branch changes that had a dependency on late bugfixes. Merging them here avoids a backmerge of 6.8-rc5 into the drivers branch - The RISC-V/starfive, RISC-V/microchip and ARM/Broadcom devicetree changes that got delayed and needed some extra time in linux-next for wider testing" * tag 'soc-late-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (31 commits) soc: fsl: dpio: fix kcalloc() argument order bus: ts-nbus: Improve error reporting bus: ts-nbus: Convert to atomic pwm API riscv: dts: starfive: jh7110: Add camera subsystem nodes ARM: bcm: stop selecing CONFIG_TICK_ONESHOT ARM: dts: omap3: Update clksel clocks to use reg instead of ti,bit-shift ARM: dts: am3: Update clksel clocks to use reg instead of ti,bit-shift clk: ti: Improve clksel clock bit parsing for reg property clk: ti: Handle possible address in the node name dt-bindings: pwm: opencores: Add compatible for StarFive JH8100 dt-bindings: riscv: cpus: reg matches hart ID reset: Instantiate reset GPIO controller for shared reset-gpios reset: gpio: Add GPIO-based reset controller cpufreq: do not open-code of_phandle_args_equal() of: Add of_phandle_args_equal() helper reset: simple: add support for Sophgo SG2042 dt-bindings: reset: sophgo: support SG2042 riscv: dts: microchip: add specific compatible for mpfs pdma riscv: dts: microchip: add missing CAN bus clocks ARM: brcmstb: Add debug UART entry for 74165 ...
2024-03-19Merge tag 's390-6.9-2' of ↵Linus Torvalds67-553/+771
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull more s390 updates from Heiko Carstens: - Various virtual vs physical address usage fixes - Add new bitwise types and helper functions and use them in s390 specific drivers and code to make it easier to find virtual vs physical address usage bugs. Right now virtual and physical addresses are identical for s390, except for module, vmalloc, and similar areas. This will be changed, hopefully with the next merge window, so that e.g. the kernel image and modules will be located close to each other, allowing for direct branches and also for some other simplifications. As a prerequisite this requires to fix all misuses of virtual and physical addresses. As it turned out people are so used to the concept that virtual and physical addresses are the same, that new bugs got added to code which was already fixed. In order to avoid that even more code gets merged which adds such bugs add and use new bitwise types, so that sparse can be used to find such usage bugs. Most likely the new types can go away again after some time - Provide a simple ARCH_HAS_DEBUG_VIRTUAL implementation - Fix kprobe branch handling: if an out-of-line single stepped relative branch instruction has a target address within a certain address area in the entry code, the program check handler may incorrectly execute cleanup code as if KVM code was executed, leading to crashes - Fix reference counting of zcrypt card objects - Various other small fixes and cleanups * tag 's390-6.9-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (41 commits) s390/entry: compare gmap asce to determine guest/host fault s390/entry: remove OUTSIDE macro s390/entry: add CIF_SIE flag and remove sie64a() address check s390/cio: use while (i--) pattern to clean up s390/raw3270: make class3270 constant s390/raw3270: improve raw3270_init() readability s390/tape: make tape_class constant s390/vmlogrdr: make vmlogrdr_class constant s390/vmur: make vmur_class constant s390/zcrypt: make zcrypt_class constant s390/mm: provide simple ARCH_HAS_DEBUG_VIRTUAL support s390/vfio_ccw_cp: use new address translation helpers s390/iucv: use new address translation helpers s390/ctcm: use new address translation helpers s390/lcs: use new address translation helpers s390/qeth: use new address translation helpers s390/zfcp: use new address translation helpers s390/tape: fix virtual vs physical address confusion s390/3270: use new address translation helpers s390/3215: use new address translation helpers ...
2024-03-19tracing: Just use strcmp() for testing __string() and __assign_str() matchSteven Rostedt (Google)1-3/+2
As __assign_str() no longer uses its "src" parameter, there's a check to make sure nothing depends on it being different than what was passed to __string(). It originally just compared the pointer passed to __string() with the pointer passed into __assign_str() via the "src" parameter. But there's a couple of outliers that just pass in a quoted string constant, where comparing the pointers is UB to the compiler, as the compiler is free to create multiple copies of the same string constant. Instead, just use strcmp(). It may slow down the trace event, but this will eventually be removed. Also, fix the issue of passing NULL to strcmp() by adding a WARN_ON() to make sure that both "src" and the pointer saved in __string() are either both NULL or have content, and then checking if "src" is not NULL before performing the strcmp(). Link: https://lore.kernel.org/all/CAHk-=wjxX16kWd=uxG5wzqt=aXoYDf1BgWOKk+qVmAO0zh7sjA@mail.gmail.com/ Fixes: b1afefa62ca9 ("tracing: Use strcmp() in __assign_str() WARN_ON() check") Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>