summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)AuthorFilesLines
2018-07-16tls: Refactor tls_offload variable namesBoris Pismenny2-17/+16
For symmetry, we rename tls_offload_context to tls_offload_context_tx before we add tls_offload_context_rx. Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-16tcp: Don't coalesce decrypted and encrypted SKBsBoris Pismenny2-0/+15
Prevent coalescing of decrypted and encrypted SKBs in GRO and TCP layer. Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-16net: Add TLS RX offload featureIlya Lesokhin1-0/+1
This patch adds a netdev feature to configure TLS RX inline crypto offload. Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com> Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-16net: Add decrypted field to skbBoris Pismenny1-0/+6
The decrypted bit is propogated to cloned/copied skbs. This will be used later by the inline crypto receive side offload of tls. Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller4-37/+117
Daniel Borkmann says: ==================== pull-request: bpf-next 2018-07-15 The following pull-request contains BPF updates for your *net-next* tree. The main changes are: 1) Various different arm32 JIT improvements in order to optimize code emission and make the JIT code itself more robust, from Russell. 2) Support simultaneous driver and offloaded XDP in order to allow for advanced use-cases where some work is offloaded to the NIC and some to the host. Also add ability for bpftool to load programs and maps beyond just the cgroup case, from Jakub. 3) Add BPF JIT support in nfp for multiplication as well as division. For the latter in particular, it uses the reciprocal algorithm to emulate it, from Jiong. 4) Add BTF pretty print functionality to bpftool in plain and JSON output format, from Okash. 5) Add build and installation to the BPF helper man page into bpftool, from Quentin. 6) Add a TCP BPF callback for listening sockets which is triggered right after the socket transitions to TCP_LISTEN state, from Andrey. 7) Add a new cgroup tree command to bpftool which iterates over the whole cgroup tree and prints all attached programs, from Roman. 8) Improve xdp_redirect_cpu sample to support parsing of double VLAN tagged packets, from Jesper. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-15bpf: Add BPF_SOCK_OPS_TCP_LISTEN_CBAndrey Ignatov1-0/+1
Add new TCP-BPF callback that is called on listen(2) right after socket transition to TCP_LISTEN state. It fills the gap for listening sockets in TCP-BPF. For example BPF program can set BPF_SOCK_OPS_STATE_CB_FLAG when socket becomes listening and track later transition from TCP_LISTEN to TCP_CLOSE with BPF_SOCK_OPS_STATE_CB callback. Before there was no way to do it with TCP-BPF and other options were much harder to work with. E.g. socket state tracking can be done with tracepoints (either raw or regular) but they can't be attached to cgroup and their lifetime has to be managed separately. Signed-off-by: Andrey Ignatov <rdna@fb.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-07-14tcp: remove redundant rcv_nxt updateYafang Shao1-1/+0
tcp_rcv_nxt_update() is already executed in tcp_data_queue(). This line is redundant. See bellow, tcp_queue_rcv tcp_rcv_nxt_update(tcp_sk(sk), TCP_SKB_CB(skb)->end_seq); tcp_rcv_nxt_update(tp, TCP_SKB_CB(skb)->end_seq); <<<< redundant Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-14net: sched: refactor flower walk to iterate over idrVlad Buslov2-11/+11
Extend struct tcf_walker with additional 'cookie' field. It is intended to be used by classifier walk implementations to continue iteration directly from particular filter, instead of iterating 'skip' number of times. Change flower walk implementation to save filter handle in 'cookie'. Each time flower walk is called, it looks up filter with saved handle directly with idr, instead of iterating over filter linked list 'skip' number of times. This change improves complexity of dumping flower classifier from quadratic to linearithmic. (assuming idr lookup has logarithmic complexity) Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Reported-by: Simon Horman <simon.horman@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-14net: ipmr: add support for passing full packet on wrong vifNikolay Aleksandrov1-5/+16
This patch adds support for IGMPMSG_WRVIFWHOLE which is used to pass full packet and real vif id when the incoming interface is wrong. While the RP and FHR are setting up state we need to be sending the registers encapsulated with all the data inside otherwise we lose it. The RP then decapsulates it and forwards it to the interested parties. Currently with WRONGVIF we can only be sending empty register packets and will lose that data. This behaviour can be enabled by using MRT_PIM with val == IGMPMSG_WRVIFWHOLE. This doesn't prevent IGMPMSG_WRONGVIF from happening, it happens in addition to it, also it is controlled by the same throttling parameters as WRONGVIF (i.e. 1 packet per 3 seconds currently). Both messages are generated to keep backwards compatibily and avoid breaking someone who was enabling MRT_PIM with val == 4, since any positive val is accepted and treated the same. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-13xdp: support simultaneous driver and hw XDP attachmentJakub Kicinski2-59/+79
Split the query of HW-attached program from the software one. Introduce new .ndo_bpf command to query HW-attached program. This will allow drivers to install different programs in HW and SW at the same time. Netlink can now also carry multiple programs on dump (in which case mode will be set to XDP_ATTACHED_MULTI and user has to check per-attachment point attributes, IFLA_XDP_PROG_ID will not be present). We reuse IFLA_XDP_PROG_ID skb space for second mode, so rtnl_xdp_size() doesn't need to be updated. Note that the installation side is still not there, since all drivers currently reject installing more than one program at the time. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-07-13xdp: factor out common program/flags handling from driversJakub Kicinski1-0/+34
Basic operations drivers perform during xdp setup and query can be moved to helpers in the core. Encapsulate program and flags into a structure and add helpers. Note that the structure is intended as the "main" program information source in the driver. Most drivers will additionally place the program pointer in their fast path or ring structures. The helpers don't have a huge impact now, but they will decrease the code duplication when programs can be installed in HW and driver at the same time. Encapsulating the basic operations in helpers will hopefully also reduce the number of changes to drivers which adopt them. Helpers could really be static inline, but they depend on definition of struct netdev_bpf which means they'd have to be placed in netdevice.h, an already 4500 line header. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-07-13xdp: don't make drivers report attachment modeJakub Kicinski2-6/+9
prog_attached of struct netdev_bpf should have been superseded by simply setting prog_id long time ago, but we kept it around to allow offloading drivers to communicate attachment mode (drv vs hw). Subsequently drivers were also allowed to report back attachment flags (prog_flags), and since nowadays only programs attached will XDP_FLAGS_HW_MODE can get offloaded, we can tell the attachment mode from the flags driver reports. Remove prog_attached member. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-07-13xdp: add per mode attributes for attached programsJakub Kicinski1-4/+26
In preparation for support of simultaneous driver and hardware XDP support add per-mode attributes. The catch-all IFLA_XDP_PROG_ID will still be reported, but user space can now also access the program ID in a new IFLA_XDP_<mode>_PROG_ID attribute. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-07-13devlink: Add generic parameters region_snapshotAlex Vesker1-0/+5
region_snapshot - When set enables capturing region snapshots Signed-off-by: Alex Vesker <valex@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Moshe Shemesh <moshe@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-13devlink: Add support for region snapshot read commandAlex Vesker1-0/+182
Add support for DEVLINK_CMD_REGION_READ_GET used for both reading and dumping region data. Read allows reading from a region specific address for given length. Dump allows reading the full region. If only snapshot ID is provided a snapshot dump will be done. If snapshot ID, Address and Length are provided a snapshot read will done. This is used for both snapshot access and will be used in the same way to access current data on the region. Signed-off-by: Alex Vesker <valex@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-13devlink: Add support for region snapshot delete commandAlex Vesker1-0/+93
Add support for DEVLINK_CMD_REGION_DEL used for deleting a snapshot from a region. The snapshot ID is required. Also added notification support for NEW and DEL of snapshots. Signed-off-by: Alex Vesker <valex@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-13devlink: Extend the support querying for region snapshot IDsAlex Vesker1-0/+53
Extend the support for DEVLINK_CMD_REGION_GET command to also return the IDs of the snapshot currently present on the region. Each reply will include a nested snapshots attribute that can contain multiple snapshot attributes each with an ID. Signed-off-by: Alex Vesker <valex@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-13devlink: Add support for region get commandAlex Vesker1-0/+114
Add support for DEVLINK_CMD_REGION_GET command which is used for querying for the supported DEV/REGION values of devlink devices. The support is both for doit and dumpit. Reply includes: BUS_NAME, DEVICE_NAME, REGION_NAME, REGION_SIZE Signed-off-by: Alex Vesker <valex@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-13devlink: Add support for creating region snapshotsAlex Vesker1-0/+95
Each device address region can store multiple snapshots, each snapshot is identified using a different numerical ID. This ID is used when deleting a snapshot or showing an address region specific snapshot. This patch exposes a callback to add a new snapshot to an address region. The snapshot will be deleted using the destructor function when destroying a region or when a snapshot delete command from devlink user tool. Signed-off-by: Alex Vesker <valex@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-13devlink: Add callback to query for snapshot id before snapshot createAlex Vesker1-0/+21
To restrict the driver with the snapshot ID selection a new callback is introduced for the driver to get the snapshot ID before creating a new snapshot. This will also allow giving the same ID for multiple snapshots taken of different regions on the same time. Signed-off-by: Alex Vesker <valex@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-13devlink: Add support for creating and destroying regionsAlex Vesker1-0/+84
This allows a device to register its supported address regions. Each address region can be accessed directly for example reading the snapshots taken of this address space. Drivers are not limited in the name selection for different regions. An example of a region-name can be: pci cr-space, register-space. Signed-off-by: Alex Vesker <valex@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-13net: gro: properly remove skb from listPrashant Bhole1-2/+4
Following crash occurs in validate_xmit_skb_list() when same skb is iterated multiple times in the loop and consume_skb() is called. The root cause is calling list_del_init(&skb->list) and not clearing skb->next in d4546c2509b1. list_del_init(&skb->list) sets skb->next to point to skb itself. skb->next needs to be cleared because other parts of network stack uses another kind of SKB lists. validate_xmit_skb_list() uses such list. A similar type of bugfix was reported by Jesper Dangaard Brouer. https://patchwork.ozlabs.org/patch/942541/ This patch clears skb->next and changes list_del_init() to list_del() so that list->prev will maintain the list poison. [ 148.185511] ================================================================== [ 148.187865] BUG: KASAN: use-after-free in validate_xmit_skb_list+0x4b/0xa0 [ 148.190158] Read of size 8 at addr ffff8801e52eefc0 by task swapper/1/0 [ 148.192940] [ 148.193642] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.18.0-rc3+ #25 [ 148.195423] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20180531_142017-buildhw-08.phx2.fedoraproject.org-1.fc28 04/01/2014 [ 148.199129] Call Trace: [ 148.200565] <IRQ> [ 148.201911] dump_stack+0xc6/0x14c [ 148.203572] ? dump_stack_print_info.cold.1+0x2f/0x2f [ 148.205083] ? kmsg_dump_rewind_nolock+0x59/0x59 [ 148.206307] ? validate_xmit_skb+0x2c6/0x560 [ 148.207432] ? debug_show_held_locks+0x30/0x30 [ 148.208571] ? validate_xmit_skb_list+0x4b/0xa0 [ 148.211144] print_address_description+0x6c/0x23c [ 148.212601] ? validate_xmit_skb_list+0x4b/0xa0 [ 148.213782] kasan_report.cold.6+0x241/0x2fd [ 148.214958] validate_xmit_skb_list+0x4b/0xa0 [ 148.216494] sch_direct_xmit+0x1b0/0x680 [ 148.217601] ? dev_watchdog+0x4e0/0x4e0 [ 148.218675] ? do_raw_spin_trylock+0x10/0x120 [ 148.219818] ? do_raw_spin_lock+0xe0/0xe0 [ 148.221032] __dev_queue_xmit+0x1167/0x1810 [ 148.222155] ? sched_clock+0x5/0x10 [...] [ 148.474257] Allocated by task 0: [ 148.475363] kasan_kmalloc+0xbf/0xe0 [ 148.476503] kmem_cache_alloc+0xb4/0x1b0 [ 148.477654] __build_skb+0x91/0x250 [ 148.478677] build_skb+0x67/0x180 [ 148.479657] e1000_clean_rx_irq+0x542/0x8a0 [ 148.480757] e1000_clean+0x652/0xd10 [ 148.481772] net_rx_action+0x4ea/0xc20 [ 148.482808] __do_softirq+0x1f9/0x574 [ 148.483831] [ 148.484575] Freed by task 0: [ 148.485504] __kasan_slab_free+0x12e/0x180 [ 148.486589] kmem_cache_free+0xb4/0x240 [ 148.487634] kfree_skbmem+0xed/0x150 [ 148.488648] consume_skb+0x146/0x250 [ 148.489665] validate_xmit_skb+0x2b7/0x560 [ 148.490754] validate_xmit_skb_list+0x70/0xa0 [ 148.491897] sch_direct_xmit+0x1b0/0x680 [ 148.493949] __dev_queue_xmit+0x1167/0x1810 [ 148.495103] br_dev_queue_push_xmit+0xce/0x250 [ 148.496196] br_forward_finish+0x276/0x280 [ 148.497234] __br_forward+0x44f/0x520 [ 148.498260] br_forward+0x19f/0x1b0 [ 148.499264] br_handle_frame_finish+0x65e/0x980 [ 148.500398] NF_HOOK.constprop.10+0x290/0x2a0 [ 148.501522] br_handle_frame+0x417/0x640 [ 148.502582] __netif_receive_skb_core+0xaac/0x18f0 [ 148.503753] __netif_receive_skb_one_core+0x98/0x120 [ 148.504958] netif_receive_skb_internal+0xe3/0x330 [ 148.506154] napi_gro_complete+0x190/0x2a0 [ 148.507243] dev_gro_receive+0x9f7/0x1100 [ 148.508316] napi_gro_receive+0xcb/0x260 [ 148.509387] e1000_clean_rx_irq+0x2fc/0x8a0 [ 148.510501] e1000_clean+0x652/0xd10 [ 148.511523] net_rx_action+0x4ea/0xc20 [ 148.512566] __do_softirq+0x1f9/0x574 [ 148.513598] [ 148.514346] The buggy address belongs to the object at ffff8801e52eefc0 [ 148.514346] which belongs to the cache skbuff_head_cache of size 232 [ 148.517047] The buggy address is located 0 bytes inside of [ 148.517047] 232-byte region [ffff8801e52eefc0, ffff8801e52ef0a8) [ 148.519549] The buggy address belongs to the page: [ 148.520726] page:ffffea000794bb00 count:1 mapcount:0 mapping:ffff880106f4dfc0 index:0xffff8801e52ee840 compound_mapcount: 0 [ 148.524325] flags: 0x17ffffc0008100(slab|head) [ 148.525481] raw: 0017ffffc0008100 ffff880106b938d0 ffff880106b938d0 ffff880106f4dfc0 [ 148.527503] raw: ffff8801e52ee840 0000000000190011 00000001ffffffff 0000000000000000 [ 148.529547] page dumped because: kasan: bad access detected Fixes: d4546c2509b1 ("net: Convert GRO SKB handling to list_head.") Signed-off-by: Prashant Bhole <bhole_prashant_q7@lab.ntt.co.jp> Reported-by: Tyler Hicks <tyhicks@canonical.com> Tested-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-13net: ipv4: fix listify ip_rcv_finish in case of forwardingJesper Dangaard Brouer1-1/+7
In commit 5fa12739a53d ("net: ipv4: listify ip_rcv_finish") calling dst_input(skb) was split-out. The ip_sublist_rcv_finish() just calls dst_input(skb) in a loop. The problem is that ip_sublist_rcv_finish() forgot to remove the SKB from the list before invoking dst_input(). Further more we need to clear skb->next as other parts of the network stack use another kind of SKB lists for xmit_more (see dev_hard_start_xmit). A crash occurs if e.g. dst_input() invoke ip_forward(), which calls dst_output()/ip_output() that eventually calls __dev_queue_xmit() + sch_direct_xmit(), and a crash occurs in validate_xmit_skb_list(). This patch only fixes the crash, but there is a huge potential for a performance boost if we can pass an SKB-list through to ip_forward. Fixes: 5fa12739a53d ("net: ipv4: listify ip_rcv_finish") Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-13net/sched: act_skbedit: don't use spinlock in the data pathDavide Caratti1-39/+68
use RCU instead of spin_{,un}lock_bh, to protect concurrent read/write on act_skbedit configuration. This reduces the effects of contention in the data path, in case multiple readers are present. Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-13net/sched: skbedit: use per-cpu countersDavide Caratti1-4/+4
use per-CPU counters, instead of sharing a single set of stats with all cores: this removes the need of spinlocks when stats are read/updated. Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-13tcp: use monotonic timestamps for PAWSArnd Bergmann3-6/+7
Using get_seconds() for timestamps is deprecated since it can lead to overflows on 32-bit systems. While the interface generally doesn't overflow until year 2106, the specific implementation of the TCP PAWS algorithm breaks in 2038 when the intermediate signed 32-bit timestamps overflow. A related problem is that the local timestamps in CLOCK_REALTIME form lead to unexpected behavior when settimeofday is called to set the system clock backwards or forwards by more than 24 days. While the first problem could be solved by using an overflow-safe method of comparing the timestamps, a nicer solution is to use a monotonic clocksource with ktime_get_seconds() that simply doesn't overflow (at least not until 136 years after boot) and that doesn't change during settimeofday(). To make 32-bit and 64-bit architectures behave the same way here, and also save a few bytes in the tcp_options_received structure, I'm changing the type to a 32-bit integer, which is now safe on all architectures. Finally, the ts_recent_stamp field also (confusingly) gets used to store a jiffies value in tcp_synq_overflow()/tcp_synq_no_recent_overflow(). This is currently safe, but changing the type to 32-bit requires some small changes there to keep it working. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-13net/tls: Use aead_request_alloc/free for request alloc/freeVakul Garg1-8/+4
Instead of kzalloc/free for aead_request allocation and free, use functions aead_request_alloc(), aead_request_free(). It ensures that any sensitive crypto material held in crypto transforms is securely erased from memory. Signed-off-by: Vakul Garg <vakul.garg@nxp.com> Acked-by: Dave Watson <davejwatson@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-12tipc: check session number before accepting link protocol messagesJon Maloy3-22/+52
In some virtual environments we observe a significant higher number of packet reordering and delays than we have been used to traditionally. This makes it necessary with stricter checks on incoming link protocol messages' session number, which until now only has been validated for RESET messages. Since the other two message types, ACTIVATE and STATE messages also carry this number, it is easy to extend the validation check to those messages. We also introduce a flag indicating if a link has a valid peer session number or not. This eliminates the mixing of 32- and 16-bit arithmethics we are currently using to achieve this. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-12tipc: add sequence number check for link STATE messagesJon Maloy4-6/+32
Some switch infrastructures produce huge amounts of packet duplicates. This becomes a problem if those messages are STATE/NACK protocol messages, causing unnecessary retransmissions of already accepted packets. We now introduce a unique sequence number per STATE protocol message so that duplicates can be identified and ignored. This will also be useful when tracing such cases, and to avert replay attacks when TIPC is encrypted. For compatibility reasons we have to introduce a new capability flag TIPC_LINK_PROTO_SEQNO to handle this new feature. Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-12Merge branch '10GbE' of ↵David S. Miller4-31/+173
git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue Jeff Kirsher says: ==================== L2 Fwd Offload & 10GbE Intel Driver Updates 2018-07-09 This patch series is meant to allow support for the L2 forward offload, aka MACVLAN offload without the need for using ndo_select_queue. The existing solution currently requires that we use ndo_select_queue in the transmit path if we want to associate specific Tx queues with a given MACVLAN interface. In order to get away from this we need to repurpose the tc_to_txq array and XPS pointer for the MACVLAN interface and use those as a means of accessing the queues on the lower device. As a result we cannot offload a device that is configured as multiqueue, however it doesn't really make sense to configure a macvlan interfaced as being multiqueue anyway since it doesn't really have a qdisc of its own in the first place. The big changes in this set are: Allow lower device to update tc_to_txq and XPS map of offloaded MACVLAN Disable XPS for single queue devices Replace accel_priv with sb_dev in ndo_select_queue Add sb_dev parameter to fallback function for ndo_select_queue Consolidated ndo_select_queue functions that appeared to be duplicates ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-12tcp: expose both send and receive intervals for rate sampleDeepti Raghavan1-0/+4
Congestion control algorithms, which access the rate sample through the tcp_cong_control function, only have access to the maximum of the send and receive interval, for cases where the acknowledgment rate may be inaccurate due to ACK compression or decimation. Algorithms may want to use send rates and receive rates as separate signals. Signed-off-by: Deepti Raghavan <deeptir@mit.edu> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-12net: sched: fix unprotected access to rcu cookie pointerVlad Buslov1-2/+7
Fix action attribute size calculation function to take rcu read lock and access act_cookie pointer with rcu dereference. Fixes: eec94fdb0480 ("net: sched: use rcu for action cookie update") Reported-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-12net: sched: act_ife: fix memory leak in ife initVlad Buslov1-1/+3
Free params if tcf_idr_check_alloc() returned error. Fixes: 0190c1d452a9 ("net: sched: atomically check-allocate action") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-12net/ipv6: propagate net.ipv6.conf.all.addr_gen_mode to devicesSabrina Dubroca1-0/+12
This aligns the addr_gen_mode sysctl with the expected behavior of the "all" variant. Fixes: d35a00b8e33d ("net/ipv6: allow sysctl to change link-local address generation mode") Suggested-by: David Ahern <dsahern@gmail.com> Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-12net/ipv6: reserve room for IFLA_INET6_ADDR_GEN_MODESabrina Dubroca1-1/+3
inet6_ifla6_size() is called to check how much space is needed by inet6_fill_link_af() and inet6_fill_ifinfo(), both of which include the IFLA_INET6_ADDR_GEN_MODE attribute. Reserve some room for it. Fixes: bc91b0f07ada ("ipv6: addrconf: implement address generation modes") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-12net/ipv6: don't reinitialize ndev->cnf.addr_gen_mode on new inet6_devSabrina Dubroca1-2/+0
The value has already been copied from this netns's devconf_dflt, it shouldn't be reset to the global kernel default. Fixes: d35a00b8e33d ("net/ipv6: allow sysctl to change link-local address generation mode") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-12net/ipv6: fix addrconf_sysctl_addr_gen_modeSabrina Dubroca1-13/+14
addrconf_sysctl_addr_gen_mode() has multiple problems. First, it ignores the errors returned by proc_dointvec(). addrconf_sysctl_addr_gen_mode() calls proc_dointvec() directly, which writes the value to memory, and then checks if it's valid and may return EINVAL. If a bad value is given, the value displayed when reading net.ipv6.conf.foo.addr_gen_mode next time will be invalid. In case the value provided by the user was valid, addrconf_dev_config() won't be called since idev->cnf.addr_gen_mode has already been updated. Fix this in the usual way we deal with values that need to be checked after the proc_do*() helper has returned: define a local ctl_table and storage, call proc_dointvec() on that temporary area, then check and store. addrconf_sysctl_addr_gen_mode() also writes the new value to the global ipv6_devconf_dflt, when we're writing to some netns's default, so that new netns will inherit the value that was set by the change occuring in any netns. That doesn't make any sense, so let's drop this assignment. Finally, since addr_gen_mode is a __u32, switch to proc_douintvec(). Fixes: d35a00b8e33d ("net/ipv6: allow sysctl to change link-local address generation mode") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-12net/sched: flower: Fix null pointer dereference when run tc vlan commandJianbo Liu1-22/+26
Zahari issued tc vlan command without setting vlan_ethtype, which will crash kernel. To avoid this, we must check tb[TCA_FLOWER_KEY_VLAN_ETH_TYPE] is not null before use it. Also we don't need to dump vlan_ethtype or cvlan_ethtype in this case. Fixes: d64efd0926ba ('net/sched: flower: Add supprt for matching on QinQ vlan headers') Signed-off-by: Jianbo Liu <jianbol@mellanox.com> Reported-by: Zahari Doychev <zahari.doychev@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-11sch_cake: Conditionally split GSO segmentsToke Høiland-Jørgensen1-26/+73
At lower bandwidths, the transmission time of a single GSO segment can add an unacceptable amount of latency due to HOL blocking. Furthermore, with a software shaper, any tuning mechanism employed by the kernel to control the maximum size of GSO segments is thrown off by the artificial limit on bandwidth. For this reason, we split GSO segments into their individual packets iff the shaper is active and configured to a bandwidth <= 1 Gbps. Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-11sch_cake: Add overhead compensation support to the rate shaperToke Høiland-Jørgensen1-1/+123
This commit adds configurable overhead compensation support to the rate shaper. With this feature, userspace can configure the actual bottleneck link overhead and encapsulation mode used, which will be used by the shaper to calculate the precise duration of each packet on the wire. This feature is needed because CAKE is often deployed one or two hops upstream of the actual bottleneck (which can be, e.g., inside a DSL or cable modem). In this case, the link layer characteristics and overhead reported by the kernel does not match the actual bottleneck. Being able to set the actual values in use makes it possible to configure the shaper rate much closer to the actual bottleneck rate (our experience shows it is possible to get with 0.1% of the actual physical bottleneck rate), thus keeping latency low without sacrificing bandwidth. The overhead compensation has three tunables: A fixed per-packet overhead size (which, if set, will be accounted from the IP packet header), a minimum packet size (MPU) and a framing mode supporting either ATM or PTM framing. We include a set of common keywords in TC to help users configure the right parameters. If no overhead value is set, the value reported by the kernel is used. Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-11sch_cake: Add DiffServ handlingToke Høiland-Jørgensen1-16/+423
This adds support for DiffServ-based priority queueing to CAKE. If the shaper is in use, each priority tier gets its own virtual clock, which limits that tier's rate to a fraction of the overall shaped rate, to discourage trying to game the priority mechanism. CAKE defaults to a simple, three-tier mode that interprets most code points as "best effort", but places CS1 traffic into a low-priority "bulk" tier which is assigned 1/16 of the total rate, and a few code points indicating latency-sensitive or control traffic (specifically TOS4, VA, EF, CS6, CS7) into a "latency sensitive" high-priority tier, which is assigned 1/4 rate. The other supported DiffServ modes are a 4-tier mode matching the 802.11e precedence rules, as well as two 8-tier modes, one of which implements strict precedence of the eight priority levels. This commit also adds an optional DiffServ 'wash' mode, which will zero out the DSCP fields of any packet passing through CAKE. While this can technically be done with other mechanisms in the kernel, having the feature available in CAKE significantly decreases configuration complexity; and the implementation cost is low on top of the other DiffServ-handling code. Filters and applications can set the skb->priority field to override the DSCP-based classification into tiers. If TC_H_MAJ(skb->priority) matches CAKE's qdisc handle, the minor number will be interpreted as a priority tier if it is less than or equal to the number of configured priority tiers. Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-11sch_cake: Add NAT awareness to packet classifierToke Høiland-Jørgensen1-2/+49
When CAKE is deployed on a gateway that also performs NAT (which is a common deployment mode), the host fairness mechanism cannot distinguish internal hosts from each other, and so fails to work correctly. To fix this, we add an optional NAT awareness mode, which will query the kernel conntrack mechanism to obtain the pre-NAT addresses for each packet and use that in the flow and host hashing. When the shaper is enabled and the host is already performing NAT, the cost of this lookup is negligible. However, in unlimited mode with no NAT being performed, there is a significant CPU cost at higher bandwidths. For this reason, the feature is turned off by default. Cc: netfilter-devel@vger.kernel.org Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-11netfilter: Add nf_ct_get_tuple_skb global lookup functionToke Høiland-Jørgensen2-0/+51
This adds a global netfilter function to extract a conntrack tuple from an skb. The function uses a new function added to nf_ct_hook, which will try to get the tuple from skb->_nfct, and do a full lookup if that fails. This makes it possible to use the lookup function before the skb has passed through the conntrack init hooks (e.g., in an ingress qdisc). The tuple is copied to the caller to avoid issues with reference counting. The function returns false if conntrack is not loaded, allowing it to be used without incurring a module dependency on conntrack. This is used by the NAT mode in sch_cake. Cc: netfilter-devel@vger.kernel.org Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-11sch_cake: Add optional ACK filterToke Høiland-Jørgensen1-2/+452
The ACK filter is an optional feature of CAKE which is designed to improve performance on links with very asymmetrical rate limits. On such links (which are unfortunately quite prevalent, especially for DSL and cable subscribers), the downstream throughput can be limited by the number of ACKs capable of being transmitted in the *upstream* direction. Filtering ACKs can, in general, have adverse effects on TCP performance because it interferes with ACK clocking (especially in slow start), and it reduces the flow's resiliency to ACKs being dropped further along the path. To alleviate these drawbacks, the ACK filter in CAKE tries its best to always keep enough ACKs queued to ensure forward progress in the TCP flow being filtered. It does this by only filtering redundant ACKs. In its default 'conservative' mode, the filter will always keep at least two redundant ACKs in the queue, while in 'aggressive' mode, it will filter down to a single ACK. The ACK filter works by inspecting the per-flow queue on every packet enqueue. Starting at the head of the queue, the filter looks for another eligible packet to drop (so the ACK being dropped is always closer to the head of the queue than the packet being enqueued). An ACK is eligible only if it ACKs *fewer* bytes than the new packet being enqueued, including any SACK options. This prevents duplicate ACKs from being filtered, to avoid interfering with retransmission logic. In addition, we check TCP header options and only drop those that are known to not interfere with sender state. In particular, packets with unknown option codes are never dropped. In aggressive mode, an eligible packet is always dropped, while in conservative mode, at least two ACKs are kept in the queue. Only pure ACKs (with no data segments) are considered eligible for dropping, but when an ACK with data segments is enqueued, this can cause another pure ACK to become eligible for dropping. The approach described above ensures that this ACK filter avoids most of the drawbacks of a naive filtering mechanism that only keeps flow state but does not inspect the queue. This is the rationale for including the ACK filter in CAKE itself rather than as separate module (as the TC filter, for instance). Our performance evaluation has shown that on a 30/1 Mbps link with a bidirectional traffic test (RRUL), turning on the ACK filter on the upstream link improves downstream throughput by ~20% (both modes) and upstream throughput by ~12% in conservative mode and ~40% in aggressive mode, at the cost of ~5ms of inter-flow latency due to the increased congestion. In *really* pathological cases, the effect can be a lot more; for instance, the ACK filter increases the achievable downstream throughput on a link with 100 Kbps in the upstream direction by an order of magnitude (from ~2.5 Mbps to ~25 Mbps). Finally, even though we consider the ACK filter to be safer than most, we do not recommend turning it on everywhere: on more symmetrical link bandwidths the effect is negligible at best. Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-11sch_cake: Add ingress modeToke Høiland-Jørgensen1-4/+83
The ingress mode is meant to be enabled when CAKE runs downlink of the actual bottleneck (such as on an IFB device). The mode changes the shaper to also account dropped packets to the shaped rate, as these have already traversed the bottleneck. Enabling ingress mode will also tune the AQM to always keep at least two packets queued *for each flow*. This is done by scaling the minimum queue occupancy level that will disable the AQM by the number of active bulk flows. The rationale for this is that retransmits are more expensive in ingress mode, since dropped packets have to traverse the bottleneck again when they are retransmitted; thus, being more lenient and keeping a minimum number of packets queued will improve throughput in cases where the number of active flows are so large that they saturate the bottleneck even at their minimum window size. This commit also adds a separate switch to enable ingress mode rate autoscaling. If enabled, the autoscaling code will observe the actual traffic rate and adjust the shaper rate to match it. This can help avoid latency increases in the case where the actual bottleneck rate decreases below the shaped rate. The scaling filters out spikes by an EWMA filter. Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-11sched: Add Common Applications Kept Enhanced (cake) qdiscToke Høiland-Jørgensen3-0/+1879
sch_cake targets the home router use case and is intended to squeeze the most bandwidth and latency out of even the slowest ISP links and routers, while presenting an API simple enough that even an ISP can configure it. Example of use on a cable ISP uplink: tc qdisc add dev eth0 cake bandwidth 20Mbit nat docsis ack-filter To shape a cable download link (ifb and tc-mirred setup elided) tc qdisc add dev ifb0 cake bandwidth 200mbit nat docsis ingress wash CAKE is filled with: * A hybrid Codel/Blue AQM algorithm, "Cobalt", tied to an FQ_Codel derived Flow Queuing system, which autoconfigures based on the bandwidth. * A novel "triple-isolate" mode (the default) which balances per-host and per-flow FQ even through NAT. * An deficit based shaper, that can also be used in an unlimited mode. * 8 way set associative hashing to reduce flow collisions to a minimum. * A reasonable interpretation of various diffserv latency/loss tradeoffs. * Support for zeroing diffserv markings for entering and exiting traffic. * Support for interacting well with Docsis 3.0 shaper framing. * Extensive support for DSL framing types. * Support for ack filtering. * Extensive statistics for measuring, loss, ecn markings, latency variation. A paper describing the design of CAKE is available at https://arxiv.org/abs/1804.07617, and will be published at the 2018 IEEE International Symposium on Local and Metropolitan Area Networks (LANMAN). This patch adds the base shaper and packet scheduler, while subsequent commits add the optional (configurable) features. The full userspace API and most data structures are included in this commit, but options not understood in the base version will be ignored. Various versions baking have been available as an out of tree build for kernel versions going back to 3.10, as the embedded router world has been running a few years behind mainline Linux. A stable version has been generally available on lede-17.01 and later. sch_cake replaces a combination of iptables, tc filter, htb and fq_codel in the sqm-scripts, with sane defaults and vastly simpler configuration. CAKE's principal author is Jonathan Morton, with contributions from Kevin Darbyshire-Bryant, Toke Høiland-Jørgensen, Sebastian Moeller, Ryan Mounce, Tony Ambardar, Dean Scarff, Nils Andreas Svee, Dave Täht, and Loganaden Velvindron. Testing from Pete Heist, Georgios Amanakis, and the many other members of the cake@lists.bufferbloat.net mailing list. tc -s qdisc show dev eth2 qdisc cake 8017: root refcnt 2 bandwidth 1Gbit diffserv3 triple-isolate split-gso rtt 100.0ms noatm overhead 38 mpu 84 Sent 51504294511 bytes 37724591 pkt (dropped 6, overlimits 64958695 requeues 12) backlog 0b 0p requeues 12 memory used: 1053008b of 15140Kb capacity estimate: 970Mbit min/max network layer size: 28 / 1500 min/max overhead-adjusted size: 84 / 1538 average network hdr offset: 14 Bulk Best Effort Voice thresh 62500Kbit 1Gbit 250Mbit target 5.0ms 5.0ms 5.0ms interval 100.0ms 100.0ms 100.0ms pk_delay 5us 5us 6us av_delay 3us 2us 2us sp_delay 2us 1us 1us backlog 0b 0b 0b pkts 3164050 25030267 9530280 bytes 3227519915 35396974782 12879808898 way_inds 0 8 0 way_miss 21 366 25 way_cols 0 0 0 drops 5 0 1 marks 0 0 0 ack_drop 0 0 0 sp_flows 1 3 0 bk_flows 0 1 1 un_flows 0 0 0 max_len 68130 68130 68130 Tested-by: Pete Heist <peteheist@gmail.com> Tested-by: Georgios Amanakis <gamanakis@gmail.com> Signed-off-by: Dave Taht <dave.taht@gmail.com> Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-10tcp: remove SG-related comment in tcp_sendmsg()Julian Wiedmann1-3/+0
Since commit 74d4a8f8d378 ("tcp: remove sk_can_gso() use"), the code doesn't care whether the interface supports SG. Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-10net: core: fix use-after-free in __netif_receive_skb_list_coreEdward Cree1-2/+7
__netif_receive_skb_core can free the skb, so we have to use the dequeue- enqueue model when calling it from __netif_receive_skb_list_core. Fixes: 88eb1944e18c ("net: core: propagate SKB lists through packet_type lookup") Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-10net: core: fix uses-after-free in list processingEdward Cree1-8/+13
In netif_receive_skb_list_internal(), all of skb_defer_rx_timestamp(), do_xdp_generic() and enqueue_to_backlog() can lead to kfree(skb). Thus, we cannot wait until after they return to remove the skb from the list; instead, we remove it first and, in the pass case, add it to a sublist afterwards. In the case of enqueue_to_backlog() we have already decided not to pass when we call the function, so we do not need a sublist. Fixes: 7da517a3bc52 ("net: core: Another step of skb receive list processing") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-09net: allow fallback function to pass netdevAlexander Duyck2-12/+7
For most of these calls we can just pass NULL through to the fallback function as the sb_dev. The only cases where we cannot are the cases where we might be dealing with either an upper device or a driver that would have configured things to support an sb_dev itself. The only driver that has any significant change in this patch set should be ixgbe as we can drop the redundant functionality that existed in both the ndo_select_queue function and the fallback function that was passed through to us. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>