summaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)AuthorFilesLines
2021-06-18net: wwan: Allow WWAN drivers to provide blocking tx and poll functionStephan Gerhold1-2/+11
At the moment, the WWAN core provides wwan_port_txon/off() to implement blocking writes. The tx() port operation should not block, instead wwan_port_txon/off() should be called when the TX queue is full or has free space again. However, in some cases it is not straightforward to make use of that functionality. For example, the RPMSG API used by rpmsg_wwan_ctrl.c does not provide any way to be notified when the TX queue has space again. Instead, it only provides the following operations: - rpmsg_send(): blocking write (wait until there is space) - rpmsg_trysend(): non-blocking write (return error if no space) - rpmsg_poll(): set poll flags depending on TX queue state Generally that's totally sufficient for implementing a char device, but it does not fit well to the currently provided WWAN port ops. Most of the time, using the non-blocking rpmsg_trysend() in the WWAN tx() port operation works just fine. However, with high-frequent writes to the char device it is possible to trigger a situation where this causes issues. For example, consider the following (somewhat unrealistic) example: # dd if=/dev/zero bs=1000 of=/dev/wwan0qmi0 dd: error writing '/dev/wwan0qmi0': Resource temporarily unavailable 1+0 records out This fails immediately after writing the first record. It's likely only a matter of time until this triggers issues for some real application (e.g. ModemManager sending a lot of large QMI packets). The rpmsg_char device does not have this problem, because it uses rpmsg_trysend() and rpmsg_poll() to support non-blocking operations. Make it possible to use the same in the RPMSG WWAN driver by adding two new optional wwan_port_ops: - tx_blocking(): send data blocking if allowed - tx_poll(): set additional TX poll flags This integrates nicely with the RPMSG API and does not require any change in existing WWAN drivers. With these changes, the dd example above blocks instead of exiting with an error. Cc: Loic Poulain <loic.poulain@linaro.org> Signed-off-by: Stephan Gerhold <stephan@gerhold.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-18rpmsg: core: Add driver_data for rpmsg_device_idStephan Gerhold1-0/+1
Most device_id structs provide a driver_data field that can be used by drivers to associate data more easily for a particular device ID. Add the same for the rpmsg_device_id. Cc: Bjorn Andersson <bjorn.andersson@linaro.org> Signed-off-by: Stephan Gerhold <stephan@gerhold.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-18Revert "net: add pf_family_names[] for protocol family"David S. Miller1-48/+0
This reverts commit 1f3c98eaddec857e16a7a1c6cd83317b3dc89438. Does not build... Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-18net: add pf_family_names[] for protocol familyYejune Deng1-0/+48
Modify the pr_info content from int to char *, this looks more readable. Signed-off-by: Yejune Deng <yejune.deng@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-18mptcp: dump csum fields in mptcp_dump_mpextGeliang Tang1-6/+11
In mptcp_dump_mpext, dump the csum fields, csum and csum_reqd in struct mptcp_dump_mpext too. Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-18mptcp: receive checksum for MP_CAPABLE with dataGeliang Tang1-1/+2
This patch added a new member named csum in struct mptcp_options_received. When parsing the MP_CAPABLE with data, if the checksum is enabled, adjust the expected_opsize. If the receiving option length matches the length with the data checksum, get the checksum value and save it in mp_opt->csum. And in mptcp_incoming_options, pass it to mpext->csum. We always parse any csum/nocsum combination and delay the presence check to later code, to allow reset if missing. Additionally, in the TX path, use the newly introduce ext field to avoid MPTCP csum recomputation on TCP retransmission and unneeded csum update on when setting the data fin_flag. Co-developed-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-18mptcp: add csum_reqd in mptcp_out_optionsGeliang Tang1-2/+3
This patch added a new member csum_reqd in struct mptcp_out_options and struct mptcp_subflow_request_sock. Initialized it with the helper function mptcp_is_checksum_enabled(). In mptcp_write_options, if this field is enabled, send out the MP_CAPABLE suboption with the MPTCP_CAP_CHECKSUM_REQD flag. Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-18mptcp: generate the data checksumGeliang Tang1-0/+1
This patch added a new member named csum in struct mptcp_ext, implemented a new function named mptcp_generate_data_checksum(). Generate the data checksum in mptcp_sendmsg_frag, save it in mpext->csum. Note that we must generate the csum for zero window probe, too. Do the csum update incrementally, to avoid multiple csum computation when the data is appended to existing skb. Note that in a later patch we will skip unneeded csum related operation. Changes not included here to keep the delta small. Co-developed-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-18mptcp: add csum_enabled in mptcp_sockGeliang Tang1-0/+1
This patch added a new member named csum_enabled in struct mptcp_sock, used a dummy mptcp_is_checksum_enabled() helper to initialize it. Also added a new member named mptcpi_csum_enabled in struct mptcp_info to expose the csum_enabled flag. Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-18seg6: add support for SRv6 End.DT46 BehaviorAndrea Mayer1-0/+2
IETF RFC 8986 [1] includes the definition of SRv6 End.DT4, End.DT6, and End.DT46 Behaviors. The current SRv6 code in the Linux kernel only implements End.DT4 and End.DT6 which can be used respectively to support IPv4-in-IPv6 and IPv6-in-IPv6 VPNs. With End.DT4 and End.DT6 it is not possible to create a single SRv6 VPN tunnel to carry both IPv4 and IPv6 traffic. The proposed End.DT46 implementation is meant to support the decapsulation of IPv4 and IPv6 traffic coming from a single SRv6 tunnel. The implementation of the SRv6 End.DT46 Behavior in the Linux kernel greatly simplifies the setup and operations of SRv6 VPNs. The SRv6 End.DT46 Behavior leverages the infrastructure of SRv6 End.DT{4,6} Behaviors implemented so far, because it makes use of a VRF device in order to force the routing lookup into the associated routing table. To make the End.DT46 work properly, it must be guaranteed that the routing table used for routing lookup operations is bound to one and only one VRF during the tunnel creation. Such constraint has to be enforced by enabling the VRF strict_mode sysctl parameter, i.e.: $ sysctl -wq net.vrf.strict_mode=1 Note that the same approach is used for the SRv6 End.DT4 Behavior and for the End.DT6 Behavior in VRF mode. The command used to instantiate an SRv6 End.DT46 Behavior is straightforward, i.e.: $ ip -6 route add 2001:db8::1 encap seg6local action End.DT46 vrftable 100 dev vrf100. [1] https://www.rfc-editor.org/rfc/rfc8986.html#name-enddt46-decapsulation-and-s ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Performance and impact of SRv6 End.DT46 Behavior on the SRv6 Networking ======================================================================= This patch aims to add the SRv6 End.DT46 Behavior with minimal impact on the performance of SRv6 End.DT4 and End.DT6 Behaviors. In order to verify this, we tested the performance of the newly introduced SRv6 End.DT46 Behavior and compared it with the performance of SRv6 End.DT{4,6} Behaviors, considering both the patched kernel and the kernel before applying the End.DT46 patch (referred to as vanilla kernel). In details, the following decapsulation scenarios were considered: 1.a) IPv6 traffic in SRv6 End.DT46 Behavior on patched kernel; 1.b) IPv4 traffic in SRv6 End.DT46 Behavior on patched kernel; 2.a) SRv6 End.DT6 Behavior (VRF mode) on patched kernel; 2.b) SRv6 End.DT4 Behavior on patched kernel; 3.a) SRv6 End.DT6 Behavior (VRF mode) on vanilla kernel (without the End.DT46 patch); 3.b) SRv6 End.DT4 Behavior on vanilla kernel (without the End.DT46 patch). All tests were performed on a testbed deployed on the CloudLab [2] facilities. We considered IPv{4,6} traffic handled by a single core (at 2.4 GHz on a Xeon(R) CPU E5-2630 v3) on kernel 5.13-rc1 using packets of size ~ 100 bytes. Scenario (1.a): average 684.70 kpps; std. dev. 0.7 kpps; Scenario (1.b): average 711.69 kpps; std. dev. 1.2 kpps; Scenario (2.a): average 690.70 kpps; std. dev. 1.2 kpps; Scenario (2.b): average 722.22 kpps; std. dev. 1.7 kpps; Scenario (3.a): average 690.02 kpps; std. dev. 2.6 kpps; Scenario (3.b): average 721.91 kpps; std. dev. 1.2 kpps; Considering the results for the patched kernel (1.a, 1.b, 2.a, 2.b) we observe that the performance degradation incurred in using End.DT46 rather than End.DT6 and End.DT4 respectively for IPv6 and IPv4 traffic is minimal, around 0.9% and 1.5%. Such very minimal performance degradation is the price to be paid if one prefers to use a single tunnel capable of handling both types of traffic (IPv4 and IPv6). Comparing the results for End.DT4 and End.DT6 under the patched and the vanilla kernel (2.a, 2.b, 3.a, 3.b) we observe that the introduction of the End.DT46 patch has no impact on the performance of End.DT4 and End.DT6. [2] https://www.cloudlab.us Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-17driver core: add a helper to setup both the of_node and fwnode of a deviceIoana Ciornei1-0/+1
There are many places where both the fwnode_handle and the of_node of a device need to be populated. Add a function which does both so that we have consistency. Suggested-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller8-11/+97
Daniel Borkmann says: ==================== pull-request: bpf-next 2021-06-17 The following pull-request contains BPF updates for your *net-next* tree. We've added 50 non-merge commits during the last 25 day(s) which contain a total of 148 files changed, 4779 insertions(+), 1248 deletions(-). The main changes are: 1) BPF infrastructure to migrate TCP child sockets from a listener to another in the same reuseport group/map, from Kuniyuki Iwashima. 2) Add a provably sound, faster and more precise algorithm for tnum_mul() as noted in https://arxiv.org/abs/2105.05398, from Harishankar Vishwanathan. 3) Streamline error reporting changes in libbpf as planned out in the 'libbpf: the road to v1.0' effort, from Andrii Nakryiko. 4) Add broadcast support to xdp_redirect_map(), from Hangbin Liu. 5) Extends bpf_map_lookup_and_delete_elem() functionality to 4 more map types, that is, {LRU_,PERCPU_,LRU_PERCPU_,}HASH, from Denis Salopek. 6) Support new LLVM relocations in libbpf to make them more linker friendly, also add a doc to describe the BPF backend relocations, from Yonghong Song. 7) Silence long standing KUBSAN complaints on register-based shifts in interpreter, from Daniel Borkmann and Eric Biggers. 8) Add dummy PT_REGS macros in libbpf to fail BPF program compilation when target arch cannot be determined, from Lorenz Bauer. 9) Extend AF_XDP to support large umems with 1M+ pages, from Magnus Karlsson. 10) Fix two minor libbpf tc BPF API issues, from Kumar Kartikeya Dwivedi. 11) Move libbpf BPF_SEQ_PRINTF/BPF_SNPRINTF macros that can be used by BPF programs to bpf_helpers.h header, from Florent Revest. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-17net: fix mistake path for netdev_features_stringsJian Shen2-3/+3
Th_strings arrays netdev_features_strings, tunable_strings, and phy_tunable_strings has been moved to file net/ethtool/common.c. So fixes the comment. Signed-off-by: Jian Shen <shenjian15@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-16net/smc: Make SMC statistics network namespace awareGuvenc Gulce2-0/+20
Make the gathered SMC statistics network namespace aware, for each namespace collect an own set of statistic information. Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-16net/smc: Add netlink support for SMC fallback statisticsGuvenc Gulce1-0/+14
Add support to collect more detailed SMC fallback reason statistics and provide these statistics to user space on the netlink interface. Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-16net/smc: Add netlink support for SMC statisticsGuvenc Gulce1-0/+69
Add the netlink function which collects the statistics information and delivers it to the userspace. Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-15net: bonding: Use per-cpu rr_tx_counterJussi Maki1-1/+1
The round-robin rr_tx_counter was shared across CPUs leading to significant cache thrashing at high packet rates. This patch switches the round-robin packet counter to use a per-cpu variable to decide the destination slave. On a test with 2x100Gbit ICE nic with pktgen_sample_04_many_flows.sh (-s 64 -t 32) the tx rate was 19.6Mpps before and 22.3Mpps after this patch. "perf top -e cache_misses" before: 12.31% [bonding] [k] bond_xmit_roundrobin_slave_get 10.59% [sch_fq_codel] [k] fq_codel_dequeue 9.34% [kernel] [k] skb_release_data after: 15.42% [sch_fq_codel] [k] fq_codel_dequeue 10.06% [kernel] [k] __memset 9.12% [kernel] [k] skb_release_data Signed-off-by: Jussi Maki <joamaki@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-15bpf: Support socket migration by eBPF.Kuniyuki Iwashima3-0/+18
This patch introduces a new bpf_attach_type for BPF_PROG_TYPE_SK_REUSEPORT to check if the attached eBPF program is capable of migrating sockets. When the eBPF program is attached, we run it for socket migration if the expected_attach_type is BPF_SK_REUSEPORT_SELECT_OR_MIGRATE or net.ipv4.tcp_migrate_req is enabled. Currently, the expected_attach_type is not enforced for the BPF_PROG_TYPE_SK_REUSEPORT type of program. Thus, this commit follows the earlier idea in the commit aac3fc320d94 ("bpf: Post-hooks for sys_bind") to fix up the zero expected_attach_type in bpf_prog_load_fixup_attach_type(). Moreover, this patch adds a new field (migrating_sk) to sk_reuseport_md to select a new listener based on the child socket. migrating_sk varies depending on if it is migrating a request in the accept queue or during 3WHS. - accept_queue : sock (ESTABLISHED/SYN_RECV) - 3WHS : request_sock (NEW_SYN_RECV) In the eBPF program, we can select a new listener by BPF_FUNC_sk_select_reuseport(). Also, we can cancel migration by returning SK_DROP. This feature is useful when listeners have different settings at the socket API level or when we want to free resources as soon as possible. - SK_PASS with selected_sk, select it as a new listener - SK_PASS with selected_sk NULL, fallbacks to the random selection - SK_DROP, cancel the migration. There is a noteworthy point. We select a listening socket in three places, but we do not have struct skb at closing a listener or retransmitting a SYN+ACK. On the other hand, some helper functions do not expect skb is NULL (e.g. skb_header_pointer() in BPF_FUNC_skb_load_bytes(), skb_tail_pointer() in BPF_FUNC_skb_load_bytes_relative()). So we allocate an empty skb temporarily before running the eBPF program. Suggested-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/netdev/20201123003828.xjpjdtk4ygl6tg6h@kafai-mbp.dhcp.thefacebook.com/ Link: https://lore.kernel.org/netdev/20201203042402.6cskdlit5f3mw4ru@kafai-mbp.dhcp.thefacebook.com/ Link: https://lore.kernel.org/netdev/20201209030903.hhow5r53l6fmozjn@kafai-mbp.dhcp.thefacebook.com/ Link: https://lore.kernel.org/bpf/20210612123224.12525-10-kuniyu@amazon.co.jp
2021-06-15bpf: Support BPF_FUNC_get_socket_cookie() for BPF_PROG_TYPE_SK_REUSEPORT.Kuniyuki Iwashima1-0/+1
We will call sock_reuseport.prog for socket migration in the next commit, so the eBPF program has to know which listener is closing to select a new listener. We can currently get a unique ID of each listener in the userspace by calling bpf_map_lookup_elem() for BPF_MAP_TYPE_REUSEPORT_SOCKARRAY map. This patch makes the pointer of sk available in sk_reuseport_md so that we can get the ID by BPF_FUNC_get_socket_cookie() in the eBPF program. Suggested-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/netdev/20201119001154.kapwihc2plp4f7zc@kafai-mbp.dhcp.thefacebook.com/ Link: https://lore.kernel.org/bpf/20210612123224.12525-9-kuniyu@amazon.co.jp
2021-06-15tcp: Add reuseport_migrate_sock() to select a new listener.Kuniyuki Iwashima1-0/+3
reuseport_migrate_sock() does the same check done in reuseport_listen_stop_sock(). If the reuseport group is capable of migration, reuseport_migrate_sock() selects a new listener by the child socket hash and increments the listener's sk_refcnt beforehand. Thus, if we fail in the migration, we have to decrement it later. We will support migration by eBPF in the later commits. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/bpf/20210612123224.12525-5-kuniyu@amazon.co.jp
2021-06-15tcp: Keep TCP_CLOSE sockets in the reuseport group.Kuniyuki Iwashima1-0/+1
When we close a listening socket, to migrate its connections to another listener in the same reuseport group, we have to handle two kinds of child sockets. One is that a listening socket has a reference to, and the other is not. The former is the TCP_ESTABLISHED/TCP_SYN_RECV sockets, and they are in the accept queue of their listening socket. So we can pop them out and push them into another listener's queue at close() or shutdown() syscalls. On the other hand, the latter, the TCP_NEW_SYN_RECV socket is during the three-way handshake and not in the accept queue. Thus, we cannot access such sockets at close() or shutdown() syscalls. Accordingly, we have to migrate immature sockets after their listening socket has been closed. Currently, if their listening socket has been closed, TCP_NEW_SYN_RECV sockets are freed at receiving the final ACK or retransmitting SYN+ACKs. At that time, if we could select a new listener from the same reuseport group, no connection would be aborted. However, we cannot do that because reuseport_detach_sock() sets NULL to sk_reuseport_cb and forbids access to the reuseport group from closed sockets. This patch allows TCP_CLOSE sockets to remain in the reuseport group and access it while any child socket references them. The point is that reuseport_detach_sock() was called twice from inet_unhash() and sk_destruct(). This patch replaces the first reuseport_detach_sock() with reuseport_stop_listen_sock(), which checks if the reuseport group is capable of migration. If capable, it decrements num_socks, moves the socket backwards in socks[] and increments num_closed_socks. When all connections are migrated, sk_destruct() calls reuseport_detach_sock() to remove the socket from socks[], decrement num_closed_socks, and set NULL to sk_reuseport_cb. By this change, closed or shutdowned sockets can keep sk_reuseport_cb. Consequently, calling listen() after shutdown() can cause EADDRINUSE or EBUSY in inet_csk_bind_conflict() or reuseport_add_sock() which expects such sockets not to have the reuseport group. Therefore, this patch also loosens such validation rules so that a socket can listen again if it has a reuseport group with num_closed_socks more than 0. When such sockets listen again, we handle them in reuseport_resurrect(). If there is an existing reuseport group (reuseport_add_sock() path), we move the socket from the old group to the new one and free the old one if necessary. If there is no existing group (reuseport_alloc() path), we allocate a new reuseport group, detach sk from the old one, and free it if necessary, not to break the current shutdown behaviour: - we cannot carry over the eBPF prog of shutdowned sockets - we cannot attach/detach an eBPF prog to/from listening sockets via shutdowned sockets Note that when the number of sockets gets over U16_MAX, we try to detach a closed socket randomly to make room for the new listening socket in reuseport_grow(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/bpf/20210612123224.12525-4-kuniyu@amazon.co.jp
2021-06-15tcp: Add num_closed_socks to struct sock_reuseport.Kuniyuki Iwashima1-2/+3
As noted in the following commit, a closed listener has to hold the reference to the reuseport group for socket migration. This patch adds a field (num_closed_socks) to struct sock_reuseport to manage closed sockets within the same reuseport group. Moreover, this and the following commits introduce some helper functions to split socks[] into two sections and keep TCP_LISTEN and TCP_CLOSE sockets in each section. Like a double-ended queue, we will place TCP_LISTEN sockets from the front and TCP_CLOSE sockets from the end. TCP_LISTEN----------> <-------TCP_CLOSE +---+---+ --- +---+ --- +---+ --- +---+ | 0 | 1 | ... | i | ... | j | ... | k | +---+---+ --- +---+ --- +---+ --- +---+ i = num_socks - 1 j = max_socks - num_closed_socks k = max_socks - 1 This patch also extends reuseport_add_sock() and reuseport_grow() to support num_closed_socks. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20210612123224.12525-3-kuniyu@amazon.co.jp
2021-06-15net: Introduce net.ipv4.tcp_migrate_req.Kuniyuki Iwashima1-0/+1
This commit adds a new sysctl option: net.ipv4.tcp_migrate_req. If this option is enabled or eBPF program is attached, we will be able to migrate child sockets from a listener to another in the same reuseport group after close() or shutdown() syscalls. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20210612123224.12525-2-kuniyu@amazon.co.jp
2021-06-15net/mlx5: Enlarge interrupt field in CREATE_EQShay Drory1-2/+2
FW is now supporting more than 256 MSI-X per PF (up to 2K). Hence, enlarge interrupt field in CREATE_EQ to make use of the new MSI-X's. Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Maor Gottlieb <maorg@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-15net/mlx5: Provide cpumask at EQ creation phaseLeon Romanovsky1-0/+1
The users of EQ are running their code on different CPUs and with various affinity patterns. Move the cpumask setting close to their actual usage. Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-14net: core: devlink: add dropped stats traps fieldOleksandr Mazur1-0/+10
Whenever query statistics is issued for trap, devlink subsystem would also fill-in statistics 'dropped' field. This field indicates the number of packets HW dropped and failed to report to the device driver, and thus - to the devlink subsystem itself. In case if device driver didn't register callback for hard drop statistics querying, 'dropped' field will be omitted and not filled. Signed-off-by: Oleksandr Mazur <oleksandr.mazur@plvision.eu> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-14net: phy: micrel: ksz886x/ksz8081: add cabletest supportOleksij Rempel1-0/+1
This patch support for cable test for the ksz886x switches and the ksz8081 PHY. The patch was tested on a KSZ8873RLL switch with following results: - port 1: - provides invalid values, thus return -ENOTSUPP (Errata: DS80000830A: "LinkMD does not work on Port 1", http://ww1.microchip.com/downloads/en/DeviceDoc/KSZ8873-Errata-DS80000830A.pdf) - port 2: - can detect distance - can detect open on each wire of pair A (wire 1 and 2) - can detect open only on one wire of pair B (only wire 3) - can detect short between wires of a pair (wires 1 + 2 or 3 + 6) - short between pairs is detected as open. For example short between wires 2 + 3 is detected as open. Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-14net: phy/dsa micrel/ksz886x add MDI-X supportOleksij Rempel1-0/+2
Add support for MDI-X status and configuration Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-14net: phy: micrel: move phy reg offsets to common headerMichael Grzeschik1-0/+13
Some micrel devices share the same PHY register defines. This patch moves them to one common header so other drivers can reuse them. And reuse generic MII_* defines where possible. Signed-off-by: Michael Grzeschik <m.grzeschik@pengutronix.de> Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-12net: qualcomm: rmnet: trailer value is a checksumAlex Elder1-1/+1
The csum_value field in the rmnet_map_dl_csum_trailer structure is a "real" Internet checksum. It is a 16 bit value, in big endian format, which represents an inverted ones' complement sum over pairs of bytes. Make that clear by changing its type to __sum16. This makes a typecast in rmnet_map_ipv4_dl_csum_trailer() and another in rmnet_map_ipv6_dl_csum_trailer() unnecessary. Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-12wwan: add interface creation supportJohannes Berg2-0/+40
Add support to create (and destroy) interfaces via a new rtnetlink kind "wwan". The responsible driver has to use the new wwan_register_ops() to make this possible. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Sergey Ryazanov <ryazanov.s.a@gmail.com> Signed-off-by: Loic Poulain <loic.poulain@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-12rtnetlink: add IFLA_PARENT_[DEV|DEV_BUS]_NAMEJohannes Berg1-0/+7
In some cases, for example in the upcoming WWAN framework changes, there's no natural "parent netdev", so sometimes dummy netdevs are created or similar. IFLA_PARENT_DEV_NAME is a new attribute intended to contain a device (sysfs, struct device) name that can be used instead when creating a new netdev, if the rtnetlink family implements it. As suggested by Parav Pandit, we also introduce IFLA_PARENT_DEV_BUS_NAME attribute in order to uniquely identify a device on the system (with bus/name pair). ip-link(8) support for the generic parent device attributes will help us avoid code duplication, so no other link type will require a custom code to handle the parent name attribute. E.g. the WWAN interface creation command will looks like this: $ ip link add wwan0-1 parent-dev wwan0 type wwan channel-id 1 So, some future subsystem (or driver) FOO will have an interface creation command that looks like this: $ ip link add foo1-3 parent-dev foo1 type foo bar-id 3 baz-type Y Below is an example of dumping link info of a random device with these new attributes: $ ip --details link show wlp0s20f3 4: wlp0s20f3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DORMANT group default qlen 1000 ... parent_bus pci parent_dev 0000:00:14.3 Co-developed-by: Sergey Ryazanov <ryazanov.s.a@gmail.com> Signed-off-by: Sergey Ryazanov <ryazanov.s.a@gmail.com> Co-developed-by: Loic Poulain <loic.poulain@linaro.org> Signed-off-by: Loic Poulain <loic.poulain@linaro.org> Suggested-by: Sergey Ryazanov <ryazanov.s.a@gmail.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-12rtnetlink: add alloc() method to rtnl_link_opsJohannes Berg1-0/+8
In order to make rtnetlink ops that can create different kinds of devices, like what we want to add to the WWAN framework, the priv_size and setup parameters aren't quite sufficient. Make this easier to manage by allowing ops to allocate their own netdev via an @alloc method that gets the tb netlink data. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Sergey Ryazanov <ryazanov.s.a@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-12net: phy: Add 25G BASE-R interface modeSteen Hegelund1-0/+4
Add 25gbase-r phy interface mode Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com> Signed-off-by: Bjarni Jonasson <bjarni.jonasson@microchip.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-11net: pcs: xpcs: export xpcs_do_config and xpcs_link_upVladimir Oltean1-0/+4
The sja1105 hardware has a quirk in that some changes require a switch reset, which loses all configuration. When the reset is initiated, everything needs to be reprogrammed, including the MACs and the PCS. This is currently done in sja1105_static_config_reload() - we manually call sja1105_adjust_port_config(), sja1105_sgmii_pcs_config() and sja1105_sgmii_pcs_force_speed() which are all internal functions. There is a desire for sja1105 to use the common xpcs driver, and that means that the equivalents of those functions, xpcs_do_config() and xpcs_link_up() respectively, will no longer be local functions. Forcing phylink to retrigger a resolve somehow, say by doing dev_close() followed by dev_open() is not really an option, because the CPU port might have a PCS as well, and there is no net device which we can close and reopen for that. Additionally, the dev_close/dev_open sequence might force a renegotiation of the copper-side link for SGMII ports connected to a PHY, and this is undesirable as well, because the switch reset is much quicker than a PHY autoneg, so we would have a lot more downtime. The only solution I see is for the sja1105 driver to keep doing what it's doing, and that means we need to export the equivalents from xpcs for sja1105_sgmii_pcs_config and sja1105_sgmii_pcs_force_speed, and call them directly in sja1105_static_config_reload(). This will be done during the conversion patch. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-11net: pcs: xpcs: add support for NXP SJA1110Vladimir Oltean1-0/+1
The NXP SJA1110 switch integrates its own, non-Synopsys PMA, but it manages it through the register space of the XPCS itself, in a small register window inside MDIO_MMD_VEND2 from address 0x8030 to 0x806e. This coincides with where the registers for the default Synopsys PMA are, but the register definitions are of course not the same. This situation is an odd hardware quirk, but the simplest way to manage it is to drive the SJA1110's PMA from within the XPCS driver. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-11net: pcs: xpcs: add support for NXP SJA1105Vladimir Oltean1-0/+2
The NXP SJA1105 DSA switch integrates a Synopsys SGMII XPCS on port 4. The generic code works fine, except there is an integration issue which needs to be dealt with: in this switch, the XPCS is integrated with a PMA that has the TX lane polarity inverted by default (PLUS is MINUS, MINUS is PLUS). To obtain normal non-inverted behavior, the TX lane polarity must be inverted in the PCS, via the DIGITAL_CONTROL_2 register. We introduce a pma_config() method in xpcs_compat which is called by the phylink_pcs_config() implementation. Also, the NXP SJA1105 returns all zeroes in the PHY ID registers 2 and 3. We need to hack up an ad-hoc PHY ID (OUI is zero, device ID is 1) in order for the XPCS driver to recognize it. This PHY ID is added to the public include/linux/pcs/pcs-xpcs.h for that reason (for the sja1105 driver to be able to use it in a later patch). Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-11net: pcs: xpcs: rename mdio_xpcs_args to dw_xpcsVladimir Oltean1-7/+7
The struct mdio_xpcs_args is reminiscent of when a similarly named struct mdio_xpcs_ops existed. Now that that is removed, we can shorten the name to dw_xpcs (dw for DesignWare). Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-11Merge branch '100GbE' of ↵David S. Miller1-0/+12
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue Tony Nguyen says: ==================== Jake Keller says: ==================== 100GbE Intel Wired LAN Driver Updates 2021-06-11 Extend the ice driver to support basic PTP clock functionality for E810 devices. This includes some tangential work required to setup the sideband queue and driver shared parameters as well. This series only supports E810-based devices. This is because other devices based on the E822 MAC use a different and more complex PHY. The low level device functionality is kept within ice_ptp_hw.c and is designed to be extensible for supporting E822 devices in a future series. This series also only supports very basic functionality including the ptp_clock device and timestamping. Support for configuring periodic outputs and external input timestamps will be implemented in a future series. There are a couple of potential "what? why?" bits in this series I want to point out: 1) the PTP hardware functionality is shared between multiple functions. This means that the same clock registers are shared across multiple PFs. In order to avoid contention or clashing between PFs, firmware assigns "ownership" to one PF, while other PFs are merely "associated" with the timer. Because we share the hardware resource, only the clock owner will allocate and register a PTP clock device. Other PFs determine the appropriate PTP clock index to report by using a firmware interface to read a shared parameter that is set by the owning PF. 2) the ice driver uses its own kthread instead of using do_aux_work. This is because the periodic and asynchronous tasks are necessary for all PFs, but only one PF will allocate the clock. The series is broken up into functional pieces to allow easy review. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-11virtio/vsock: update trace event for SEQPACKETArseny Krasnov1-1/+4
Add SEQPACKET socket type to vsock trace event. Signed-off-by: Arseny Krasnov <arseny.krasnov@kaspersky.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-11virtio/vsock: rest of SOCK_SEQPACKET supportArseny Krasnov1-0/+5
Small updates to make SOCK_SEQPACKET work: 1) Send SHUTDOWN on socket close for SEQPACKET type. 2) Set SEQPACKET packet type during send. 3) Set 'VIRTIO_VSOCK_SEQ_EOR' bit in flags for last packet of message. 4) Implement data check function for SEQPACKET. 5) Check for max datagram size. Signed-off-by: Arseny Krasnov <arseny.krasnov@kaspersky.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-11virtio/vsock: dequeue callback for SOCK_SEQPACKETArseny Krasnov1-0/+5
Callback fetches RW packets from rx queue of socket until whole record is copied(if user's buffer is full, user is not woken up). This is done to not stall sender, because if we wake up user and it leaves syscall, nobody will send credit update for rest of record, and sender will wait for next enter of read syscall at receiver's side. So if user buffer is full, we just send credit update and drop data. Signed-off-by: Arseny Krasnov <arseny.krasnov@kaspersky.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-11virtio/vsock: defines and constants for SEQPACKETArseny Krasnov1-0/+9
Add set of defines and constants for SOCK_SEQPACKET support in vsock. Signed-off-by: Arseny Krasnov <arseny.krasnov@kaspersky.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-11af_vsock: rest of SEQPACKET supportArseny Krasnov1-0/+2
Add socket ops for SEQPACKET type and .seqpacket_allow() callback to query transports if they support SEQPACKET. Also split path for data check for STREAM and SEQPACKET branches. Signed-off-by: Arseny Krasnov <arseny.krasnov@kaspersky.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-11af_vsock: implement send logic for SEQPACKETArseny Krasnov1-0/+2
Update current stream enqueue function for SEQPACKET support: 1) Call transport's seqpacket enqueue callback. 2) Return value from enqueue function is whole record length or error for SOCK_SEQPACKET. Signed-off-by: Arseny Krasnov <arseny.krasnov@kaspersky.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-11af_vsock: implement SEQPACKET receive loopArseny Krasnov1-0/+4
Add receive loop for SEQPACKET. It looks like receive loop for STREAM, but there are differences: 1) It doesn't call notify callbacks. 2) It doesn't care about 'SO_SNDLOWAT' and 'SO_RCVLOWAT' values, because there is no sense for these values in SEQPACKET case. 3) It waits until whole record is received. 4) It processes and sets 'MSG_TRUNC' flag. So to avoid extra conditions for two types of socket inside one loop, two independent functions were created. Signed-off-by: Arseny Krasnov <arseny.krasnov@kaspersky.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-11net: phylink: introduce phylink_fwnode_phy_connect()Calvin Johnson1-0/+3
Define phylink_fwnode_phy_connect() to connect phy specified by a fwnode to a phylink instance. Signed-off-by: Calvin Johnson <calvin.johnson@oss.nxp.com> Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com> Acked-by: Grant Likely <grant.likely@arm.com> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-11net: mdio: Add ACPI support code for mdioCalvin Johnson1-0/+26
Define acpi_mdiobus_register() to Register mii_bus and create PHYs for each ACPI child node. Signed-off-by: Calvin Johnson <calvin.johnson@oss.nxp.com> Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com> Acked-by: Rafael J. Wysocki <rafael@kernel.org> Acked-by: Grant Likely <grant.likely@arm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-11ACPI: utils: Introduce acpi_get_local_address()Calvin Johnson1-0/+7
Introduce a wrapper around the _ADR evaluation. Signed-off-by: Calvin Johnson <calvin.johnson@oss.nxp.com> Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com> Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com> Acked-by: Rafael J. Wysocki <rafael@kernel.org> Acked-by: Grant Likely <grant.likely@arm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-11net: mdiobus: Introduce fwnode_mdiobus_register_phy()Calvin Johnson1-0/+35
Introduce fwnode_mdiobus_register_phy() to register PHYs on the mdiobus. From the compatible string, identify whether the PHY is c45 and based on this create a PHY device instance which is registered on the mdiobus. Along with fwnode_mdiobus_register_phy() also introduce fwnode_find_mii_timestamper() and fwnode_mdiobus_phy_device_register() since they are needed. While at it, also use the newly introduced fwnode operation in of_mdiobus_phy_device_register(). Signed-off-by: Calvin Johnson <calvin.johnson@oss.nxp.com> Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com> Acked-by: Grant Likely <grant.likely@arm.com> Signed-off-by: David S. Miller <davem@davemloft.net>