summaryrefslogtreecommitdiff
path: root/net/sched
AgeCommit message (Collapse)AuthorFilesLines
2020-05-15net: sched: cls_flower: implement terse dump supportVlad Buslov1-0/+43
Implement tcf_proto_ops->terse_dump() callback for flower classifier. Only dump handle, flags and action data in terse mode. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-15net: sched: implement terse dump support in actVlad Buslov2-8/+50
Extend tcf_action_dump() with boolean argument 'terse' that is used to request terse-mode action dump. In terse mode only essential data needed to identify particular action (action kind, cookie, etc.) and its stats is put to resulting skb and everything else is omitted. Implement tcf_exts_terse_dump() helper in cls API that is intended to be used to request terse dump of all exts (actions) attached to the filter. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-15net: sched: introduce terse dump flagVlad Buslov1-8/+31
Add new TCA_DUMP_FLAGS attribute and use it in cls API to request terse filter output from classifiers with TCA_DUMP_FLAGS_TERSE flag. This option is intended to be used to improve performance of TC filter dump when userland only needs to obtain stats and not the whole classifier/action data. Extend struct tcf_proto_ops with new terse_dump() callback that must be defined by supporting classifier implementations. Support of the options in specific classifiers and actions is implemented in following patches in the series. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-07Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller5-8/+31
Conflicts were all overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-07net: flow_offload: skip hw stats check for FLOW_ACTION_HW_STATS_DONT_CAREPablo Neira Ayuso1-2/+12
This patch adds FLOW_ACTION_HW_STATS_DONT_CARE which tells the driver that the frontend does not need counters, this hw stats type request never fails. The FLOW_ACTION_HW_STATS_DISABLED type explicitly requests the driver to disable the stats, however, if the driver cannot disable counters, it bails out. TCA_ACT_HW_STATS_* maintains the 1:1 mapping with FLOW_ACTION_HW_STATS_* except by disabled which is mapped to FLOW_ACTION_HW_STATS_DISABLED (this is 0 in tc). Add tc_act_hw_stats() to perform the mapping between TCA_ACT_HW_STATS_* and FLOW_ACTION_HW_STATS_*. Fixes: 319a1d19471e ("flow_offload: check for basic action hw stats type") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-05sch_choke: Remove classid from choke_skb_cb.David S. Miller1-1/+0
Suggested by Cong Wang. Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-05net: sched: choke: Remove unused inline function choke_set_classidYueHaibing1-5/+0
There's no callers in-tree anymore since commit 5952fde10c35 ("net: sched: choke: remove dead filter classify code") Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-04net: partially revert dynamic lockdep key changesCong Wang1-6/+11
This patch reverts the folowing commits: commit 064ff66e2bef84f1153087612032b5b9eab005bd "bonding: add missing netdev_update_lockdep_key()" commit 53d374979ef147ab51f5d632dfe20b14aebeccd0 "net: avoid updating qdisc_xmit_lock_key in netdev_update_lockdep_key()" commit 1f26c0d3d24125992ab0026b0dab16c08df947c7 "net: fix kernel-doc warning in <linux/netdevice.h>" commit ab92d68fc22f9afab480153bd82a20f6e2533769 "net: core: add generic lockdep keys" but keeps the addr_list_lock_key because we still lock addr_list_lock nestedly on stack devices, unlikely xmit_lock this is safe because we don't take addr_list_lock on any fast path. Reported-and-tested-by: syzbot+aaa6fa4949cc5d9b7b25@syzkaller.appspotmail.com Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-04net_sched: sch_fq: add horizon attributeEric Dumazet1-5/+54
QUIC servers would like to use SO_TXTIME, without having CAP_NET_ADMIN, to efficiently pace UDP packets. As far as sch_fq is concerned, we need to add safety checks, so that a buggy application does not fill the qdisc with packets having delivery time far in the future. This patch adds a configurable horizon (default: 10 seconds), and a configurable policy when a packet is beyond the horizon at enqueue() time: - either drop the packet (default policy) - or cap its delivery time to the horizon. $ tc -s -d qd sh dev eth0 qdisc fq 8022: root refcnt 257 limit 10000p flow_limit 100p buckets 1024 orphan_mask 1023 quantum 10Kb initial_quantum 51160b low_rate_threshold 550Kbit refill_delay 40.0ms timer_slack 10.000us horizon 10.000s Sent 1234215879 bytes 837099 pkt (dropped 21, overlimits 0 requeues 6) backlog 0b 0p requeues 6 flows 1191 (inactive 1177 throttled 0) gc 0 highprio 0 throttled 692 latency 11.480us pkts_too_long 0 alloc_errors 0 horizon_drops 21 horizon_caps 0 v2: fixed an overflow on 32bit kernels in fq_init(), reported by kbuild test robot <lkp@intel.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-04net_sched: fix tcm_parent in tc filter dumpCong Wang1-4/+4
When we tell kernel to dump filters from root (ffff:ffff), those filters on ingress (ffff:0000) are matched, but their true parents must be dumped as they are. However, kernel dumps just whatever we tell it, that is either ffff:ffff or ffff:0000: $ nl-cls-list --dev=dummy0 --parent=root cls basic dev dummy0 id none parent root prio 49152 protocol ip match-all cls basic dev dummy0 id :1 parent root prio 49152 protocol ip match-all $ nl-cls-list --dev=dummy0 --parent=ffff: cls basic dev dummy0 id none parent ffff: prio 49152 protocol ip match-all cls basic dev dummy0 id :1 parent ffff: prio 49152 protocol ip match-all This is confusing and misleading, more importantly this is a regression since 4.15, so the old behavior must be restored. And, when tc filters are installed on a tc class, the parent should be the classid, rather than the qdisc handle. Commit edf6711c9840 ("net: sched: remove classid and q fields from tcf_proto") removed the classid we save for filters, we can just restore this classid in tcf_block. Steps to reproduce this: ip li set dev dummy0 up tc qd add dev dummy0 ingress tc filter add dev dummy0 parent ffff: protocol arp basic action pass tc filter show dev dummy0 root Before this patch: filter protocol arp pref 49152 basic filter protocol arp pref 49152 basic handle 0x1 action order 1: gact action pass random type none pass val 0 index 1 ref 1 bind 1 After this patch: filter parent ffff: protocol arp pref 49152 basic filter parent ffff: protocol arp pref 49152 basic handle 0x1 action order 1: gact action pass random type none pass val 0 index 1 ref 1 bind 1 Fixes: a10fa20101ae ("net: sched: propagate q and parent from caller down to tcf_fill_node") Fixes: edf6711c9840 ("net: sched: remove classid and q fields from tcf_proto") Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-04net: sched: fallback to qdisc noqueue if default qdisc setup failJesper Dangaard Brouer1-3/+14
Currently if the default qdisc setup/init fails, the device ends up with qdisc "noop", which causes all TX packets to get dropped. With the introduction of sysctl net/core/default_qdisc it is possible to change the default qdisc to be more advanced, which opens for the possibility that Qdisc_ops->init() can fail. This patch detect these kind of failures, and choose to fallback to qdisc "noqueue", which is so simple that its init call will not fail. This allows the interface to continue functioning. V2: As this also captures memory failures, which are transient, the device is not kept in IFF_NO_QUEUE state. This allows the net_device to retry to default qdisc assignment. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-04net_sched: sch_skbprio: add message validation to skbprio_change()Eric Dumazet1-0/+3
Do not assume the attribute has the right size. Fixes: aea5f654e6b7 ("net/sched: add skbprio scheduler") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-04net_sched: sch_fq: perform a prefetch() earlierEric Dumazet1-1/+1
The prefetch() done in fq_dequeue() can be done a bit earlier after the refactoring of the code done in the prior patch. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-04net_sched: sch_fq: do not call fq_peek() twice per packetEric Dumazet1-18/+16
This refactors the code to not call fq_peek() from fq_dequeue_head() since the caller can provide the skb. Also rename fq_dequeue_head() to fq_dequeue_skb() because 'head' is a bit vague, given the skb could come from t_root rb-tree. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-04net_sched: sch_fq: use bulk freeing in fq_gc()Eric Dumazet1-7/+11
fq_gc() already builds a small array of pointers, so using kmem_cache_free_bulk() needs very little change. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-04net_sched: sch_fq: change fq_flow size/layoutEric Dumazet1-2/+7
sizeof(struct fq_flow) is 112 bytes on 64bit arches. This means that half of them use two cache lines, but 50% use three cache lines. This patch adds cache line alignment, and makes sure that only the first cache line is touched by fq_enqueue(), which is more expensive that fq_dequeue() in general. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-04net_sched: sch_fq: avoid touching f->next from fq_gc()Eric Dumazet1-8/+13
A significant amount of cpu cycles is spent in fq_gc() When fq_gc() does its lookup in the rb-tree, it needs the following fields from struct fq_flow : f->sk (lookup key in the rb-tree) f->fq_node (anchor in the rb-tree) f->next (used to determine if the flow is detached) f->age (used to determine if the flow is candidate for gc) This unfortunately spans two cache lines (assuming 64 bytes cache lines) We can avoid using f->next, if we use the low order bit of f->{age|tail} This low order bit is 0, if f->tail points to an sk_buff. We set the low order bit to 1, if the union contains a jiffies value. Combined with the following patch, this makes sure we only need to bring into cpu caches one cache line per flow. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-02net: schedule: add action gate offloadingPo Liu1-0/+33
Add the gate action to the flow action entry. Add the gate parameters to the tc_setup_flow_action() queueing to the entries of flow_action_entry array provide to the driver. Signed-off-by: Po Liu <Po.Liu@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-02net: qos: introduce a gate control flow actionPo Liu3-0/+649
Introduce a ingress frame gate control flow action. Tc gate action does the work like this: Assume there is a gate allow specified ingress frames can be passed at specific time slot, and be dropped at specific time slot. Tc filter chooses the ingress frames, and tc gate action would specify what slot does these frames can be passed to device and what time slot would be dropped. Tc gate action would provide an entry list to tell how much time gate keep open and how much time gate keep state close. Gate action also assign a start time to tell when the entry list start. Then driver would repeat the gate entry list cyclically. For the software simulation, gate action requires the user assign a time clock type. Below is the setting example in user space. Tc filter a stream source ip address is 192.168.0.20 and gate action own two time slots. One is last 200ms gate open let frame pass another is last 100ms gate close let frames dropped. When the ingress frames have reach total frames over 8000000 bytes, the excessive frames will be dropped in that 200000000ns time slot. > tc qdisc add dev eth0 ingress > tc filter add dev eth0 parent ffff: protocol ip \ flower src_ip 192.168.0.20 \ action gate index 2 clockid CLOCK_TAI \ sched-entry open 200000000 -1 8000000 \ sched-entry close 100000000 -1 -1 > tc chain del dev eth0 ingress chain 0 "sched-entry" follow the name taprio style. Gate state is "open"/"close". Follow with period nanosecond. Then next item is internal priority value means which ingress queue should put. "-1" means wildcard. The last value optional specifies the maximum number of MSDU octets that are permitted to pass the gate during the specified time interval. Base-time is not set will be 0 as default, as result start time would be ((N + 1) * cycletime) which is the minimal of future time. Below example shows filtering a stream with destination mac address is 10:00:80:00:00:00 and ip type is ICMP, follow the action gate. The gate action would run with one close time slot which means always keep close. The time cycle is total 200000000ns. The base-time would calculate by: 1357000000000 + (N + 1) * cycletime When the total value is the future time, it will be the start time. The cycletime here would be 200000000ns for this case. > tc filter add dev eth0 parent ffff: protocol ip \ flower skip_hw ip_proto icmp dst_mac 10:00:80:00:00:00 \ action gate index 12 base-time 1357000000000 \ sched-entry close 200000000 -1 -1 \ clockid CLOCK_TAI Signed-off-by: Po Liu <Po.Liu@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-01netlink: remove type-unsafe validation_data pointerJohannes Berg2-15/+7
In the netlink policy, we currently have a void *validation_data that's pointing to different things: * a u32 value for bitfield32, * the netlink policy for nested/nested array * the string for NLA_REJECT Remove the pointer and place appropriate type-safe items in the union instead. While at it, completely dissolve the pointer for the bitfield32 case and just put the value there directly. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-27sch_sfq: validate silly quantum valuesEric Dumazet1-0/+9
syzbot managed to set up sfq so that q->scaled_quantum was zero, triggering an infinite loop in sfq_dequeue() More generally, we must only accept quantum between 1 and 2^18 - 7, meaning scaled_quantum must be in [1, 0x7FFF] range. Otherwise, we also could have a loop in sfq_dequeue() if scaled_quantum happens to be 0x8000, since slot->allot could indefinitely switch between 0 and 0x8000. Fixes: eeaeb068f139 ("sch_sfq: allow big packets and be fair") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot+0251e883fe39e7a0cb0a@syzkaller.appspotmail.com Cc: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-27sch_choke: avoid potential panic in choke_reset()Eric Dumazet1-1/+2
If choke_init() could not allocate q->tab, we would crash later in choke_reset(). BUG: KASAN: null-ptr-deref in memset include/linux/string.h:366 [inline] BUG: KASAN: null-ptr-deref in choke_reset+0x208/0x340 net/sched/sch_choke.c:326 Write of size 8 at addr 0000000000000000 by task syz-executor822/7022 CPU: 1 PID: 7022 Comm: syz-executor822 Not tainted 5.7.0-rc1-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x188/0x20d lib/dump_stack.c:118 __kasan_report.cold+0x5/0x4d mm/kasan/report.c:515 kasan_report+0x33/0x50 mm/kasan/common.c:625 check_memory_region_inline mm/kasan/generic.c:187 [inline] check_memory_region+0x141/0x190 mm/kasan/generic.c:193 memset+0x20/0x40 mm/kasan/common.c:85 memset include/linux/string.h:366 [inline] choke_reset+0x208/0x340 net/sched/sch_choke.c:326 qdisc_reset+0x6b/0x520 net/sched/sch_generic.c:910 dev_deactivate_queue.constprop.0+0x13c/0x240 net/sched/sch_generic.c:1138 netdev_for_each_tx_queue include/linux/netdevice.h:2197 [inline] dev_deactivate_many+0xe2/0xba0 net/sched/sch_generic.c:1195 dev_deactivate+0xf8/0x1c0 net/sched/sch_generic.c:1233 qdisc_graft+0xd25/0x1120 net/sched/sch_api.c:1051 tc_modify_qdisc+0xbab/0x1a00 net/sched/sch_api.c:1670 rtnetlink_rcv_msg+0x44e/0xad0 net/core/rtnetlink.c:5454 netlink_rcv_skb+0x15a/0x410 net/netlink/af_netlink.c:2469 netlink_unicast_kernel net/netlink/af_netlink.c:1303 [inline] netlink_unicast+0x537/0x740 net/netlink/af_netlink.c:1329 netlink_sendmsg+0x882/0xe10 net/netlink/af_netlink.c:1918 sock_sendmsg_nosec net/socket.c:652 [inline] sock_sendmsg+0xcf/0x120 net/socket.c:672 ____sys_sendmsg+0x6bf/0x7e0 net/socket.c:2362 ___sys_sendmsg+0x100/0x170 net/socket.c:2416 __sys_sendmsg+0xec/0x1b0 net/socket.c:2449 do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:295 Fixes: 77e62da6e60c ("sch_choke: drop all packets in queue during reset") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-27fq_codel: fix TCA_FQ_CODEL_DROP_BATCH_SIZE sanity checksEric Dumazet1-1/+1
My intent was to not let users set a zero drop_batch_size, it seems I once again messed with min()/max(). Fixes: 9d18562a2278 ("fq_codel: add batch ability to fq_codel_drop()") Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller1-3/+4
Simple overlapping changes to linux/vermagic.h Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-25net: sched: report ndo_setup_tc failures via extackJesper Dangaard Brouer1-1/+4
Help end-users of the 'tc' command to see if the drivers ndo_setup_tc function call fails. Troubleshooting when this happens is non-trivial (see full process here[1]), and results in net_device getting assigned the 'qdisc noop', which will drop all TX packets on the interface. [1]: https://github.com/xdp-project/xdp-project/blob/master/areas/arm64/board_nxp_ls1088/nxp-board04-troubleshoot-qdisc.org Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Tested-by: Ioana Ciornei <ioana.ciornei@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-24net: sched : Remove unnecessary cast in kfreeXu Wang1-1/+1
Remove unnecassary casts in the argument to kfree. Signed-off-by: Xu Wang <vulab@iscas.ac.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-23sched: etf: do not assume all sockets are full blownEric Dumazet1-3/+4
skb->sk does not always point to a full blown socket, we need to use sk_fullsock() before accessing fields which only make sense on full socket. BUG: KASAN: use-after-free in report_sock_error+0x286/0x300 net/sched/sch_etf.c:141 Read of size 1 at addr ffff88805eb9b245 by task syz-executor.5/9630 CPU: 1 PID: 9630 Comm: syz-executor.5 Not tainted 5.7.0-rc2-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: <IRQ> __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x188/0x20d lib/dump_stack.c:118 print_address_description.constprop.0.cold+0xd3/0x315 mm/kasan/report.c:382 __kasan_report.cold+0x35/0x4d mm/kasan/report.c:511 kasan_report+0x33/0x50 mm/kasan/common.c:625 report_sock_error+0x286/0x300 net/sched/sch_etf.c:141 etf_enqueue_timesortedlist+0x389/0x740 net/sched/sch_etf.c:170 __dev_xmit_skb net/core/dev.c:3710 [inline] __dev_queue_xmit+0x154a/0x30a0 net/core/dev.c:4021 neigh_hh_output include/net/neighbour.h:499 [inline] neigh_output include/net/neighbour.h:508 [inline] ip6_finish_output2+0xfb5/0x25b0 net/ipv6/ip6_output.c:117 __ip6_finish_output+0x442/0xab0 net/ipv6/ip6_output.c:143 ip6_finish_output+0x34/0x1f0 net/ipv6/ip6_output.c:153 NF_HOOK_COND include/linux/netfilter.h:296 [inline] ip6_output+0x239/0x810 net/ipv6/ip6_output.c:176 dst_output include/net/dst.h:435 [inline] NF_HOOK include/linux/netfilter.h:307 [inline] NF_HOOK include/linux/netfilter.h:301 [inline] ip6_xmit+0xe1a/0x2090 net/ipv6/ip6_output.c:280 tcp_v6_send_synack+0x4e7/0x960 net/ipv6/tcp_ipv6.c:521 tcp_rtx_synack+0x10d/0x1a0 net/ipv4/tcp_output.c:3916 inet_rtx_syn_ack net/ipv4/inet_connection_sock.c:669 [inline] reqsk_timer_handler+0x4c2/0xb40 net/ipv4/inet_connection_sock.c:763 call_timer_fn+0x1ac/0x780 kernel/time/timer.c:1405 expire_timers kernel/time/timer.c:1450 [inline] __run_timers kernel/time/timer.c:1774 [inline] __run_timers kernel/time/timer.c:1741 [inline] run_timer_softirq+0x623/0x1600 kernel/time/timer.c:1787 __do_softirq+0x26c/0x9f7 kernel/softirq.c:292 invoke_softirq kernel/softirq.c:373 [inline] irq_exit+0x192/0x1d0 kernel/softirq.c:413 exiting_irq arch/x86/include/asm/apic.h:546 [inline] smp_apic_timer_interrupt+0x19e/0x600 arch/x86/kernel/apic/apic.c:1140 apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:829 </IRQ> RIP: 0010:des_encrypt+0x157/0x9c0 lib/crypto/des.c:792 Code: 85 22 06 00 00 41 31 dc 41 8b 4d 04 44 89 e2 41 83 e4 3f 4a 8d 3c a5 60 72 72 88 81 e2 3f 3f 3f 3f 48 89 f8 48 c1 e8 03 31 d9 <0f> b6 34 28 48 89 f8 c1 c9 04 83 e0 07 83 c0 03 40 38 f0 7c 09 40 RSP: 0018:ffffc90003b5f6c0 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff13 RAX: 1ffffffff10e4e55 RBX: 00000000d2f846d0 RCX: 00000000d2f846d0 RDX: 0000000012380612 RSI: ffffffff839863ca RDI: ffffffff887272a8 RBP: dffffc0000000000 R08: ffff888091d0a380 R09: 0000000000800081 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000012 R13: ffff8880a8ae8078 R14: 00000000c545c93e R15: 0000000000000006 cipher_crypt_one crypto/cipher.c:75 [inline] crypto_cipher_encrypt_one+0x124/0x210 crypto/cipher.c:82 crypto_cbcmac_digest_update+0x1b5/0x250 crypto/ccm.c:830 crypto_shash_update+0xc4/0x120 crypto/shash.c:119 shash_ahash_update+0xa3/0x110 crypto/shash.c:246 crypto_ahash_update include/crypto/hash.h:547 [inline] hash_sendmsg+0x518/0xad0 crypto/algif_hash.c:102 sock_sendmsg_nosec net/socket.c:652 [inline] sock_sendmsg+0xcf/0x120 net/socket.c:672 ____sys_sendmsg+0x308/0x7e0 net/socket.c:2362 ___sys_sendmsg+0x100/0x170 net/socket.c:2416 __sys_sendmmsg+0x195/0x480 net/socket.c:2506 __do_sys_sendmmsg net/socket.c:2535 [inline] __se_sys_sendmmsg net/socket.c:2532 [inline] __x64_sys_sendmmsg+0x99/0x100 net/socket.c:2532 do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:295 entry_SYSCALL_64_after_hwframe+0x49/0xb3 RIP: 0033:0x45c829 Code: 0d b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 db b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007f6d9528ec78 EFLAGS: 00000246 ORIG_RAX: 0000000000000133 RAX: ffffffffffffffda RBX: 00000000004fc080 RCX: 000000000045c829 RDX: 0000000000000001 RSI: 0000000020002640 RDI: 0000000000000004 RBP: 000000000078bf00 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff R13: 00000000000008d7 R14: 00000000004cb7aa R15: 00007f6d9528f6d4 Fixes: 4b15c7075352 ("net/sched: Make etf report drops on error_queue") Fixes: 25db26a91364 ("net/sched: Introduce the ETF Qdisc") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Cc: Vinicius Costa Gomes <vinicius.gomes@intel.com> Reviewed-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-22net/sched: act_ct: update nf_conn_acct for act_ct SW offload in flowtablewenxu1-0/+2
When the act_ct SW offload in flowtable, The counter of the conntrack entry will never update. So update the nf_conn_acct conuter in act_ct flowtable software offload. Signed-off-by: wenxu <wenxu@ucloud.cn> Reviewed-by: Roi Dayan <roid@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-08net: sched: Fix setting last executed chain on skb extensionPaul Blakey1-0/+1
After driver sets the missed chain on the tc skb extension it is consumed (deleted) by tc_classify_ingress and tc jumps to that chain. If tc now misses on this chain (either no match, or no goto action), then last executed chain remains 0, and the skb extension is not re-added, and the next datapath (ovs) will start from 0. Fix that by setting last executed chain to the chain read from the skb extension, so if there is a miss, we set it back. Fixes: af699626ee26 ("net: sched: Support specifying a starting chain via tc skb ext") Reviewed-by: Oz Shlomo <ozsh@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Paul Blakey <paulb@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-04net_sched: fix a missing refcnt in tcindex_init()Cong Wang1-0/+1
The initial refcnt of struct tcindex_data should be 1, it is clear that I forgot to set it to 1 in tcindex_init(). This leads to a dec-after-zero warning. Reported-by: syzbot+8325e509a1bf83ec741d@syzkaller.appspotmail.com Fixes: 304e024216a8 ("net_sched: add a temporary refcnt for struct tcindex_data") Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Jiri Pirko <jiri@resnulli.us> Cc: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-01net_sched: add a temporary refcnt for struct tcindex_dataCong Wang1-6/+38
Although we intentionally use an ordered workqueue for all tc filter works, the ordering is not guaranteed by RCU work, given that tcf_queue_work() is esstenially a call_rcu(). This problem is demostrated by Thomas: CPU 0: tcf_queue_work() tcf_queue_work(&r->rwork, tcindex_destroy_rexts_work); -> Migration to CPU 1 CPU 1: tcf_queue_work(&p->rwork, tcindex_destroy_work); so the 2nd work could be queued before the 1st one, which leads to a free-after-free. Enforcing this order in RCU work is hard as it requires to change RCU code too. Fortunately we can workaround this problem in tcindex filter by taking a temporary refcnt, we only refcnt it right before we begin to destroy it. This simplifies the code a lot as a full refcnt requires much more changes in tcindex_set_parms(). Reported-by: syzbot+46f513c3033d592409d2@syzkaller.appspotmail.com Fixes: 3d210534cc93 ("net_sched: fix a race condition in tcindex_destroy()") Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-31Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller1-0/+3
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-31Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller1-1/+1
Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next The following patchset contains Netfilter/IPVS updates for net-next: 1) Add support to specify a stateful expression in set definitions, this allows users to specify e.g. counters per set elements. 2) Flowtable software counter support. 3) Flowtable hardware offload counter support, from wenxu. 3) Parallelize flowtable hardware offload requests, from Paul Blakey. This includes a patch to add one work entry per offload command. 4) Several patches to rework nf_queue refcount handling, from Florian Westphal. 4) A few fixes for the flowtable tunnel offload: Fix crash if tunneling information is missing and set up indirect flow block as TC_SETUP_FT, patch from wenxu. 5) Stricter netlink attribute sanity check on filters, from Romain Bellan and Florent Fourcot. 5) Annotations to make sparse happy, from Jules Irenge. 6) Improve icmp errors in debugging information, from Haishuang Yan. 7) Fix warning in IPVS icmp error debugging, from Haishuang Yan. 8) Fix endianess issue in tcp extension header, from Sergey Marinkevich. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-30bpf: Add socket assign supportJoe Stringer1-0/+3
Add support for TPROXY via a new bpf helper, bpf_sk_assign(). This helper requires the BPF program to discover the socket via a call to bpf_sk*_lookup_*(), then pass this socket to the new helper. The helper takes its own reference to the socket in addition to any existing reference that may or may not currently be obtained for the duration of BPF processing. For the destination socket to receive the traffic, the traffic must be routed towards that socket via local route. The simplest example route is below, but in practice you may want to route traffic more narrowly (eg by CIDR): $ ip route add local default dev lo This patch avoids trying to introduce an extra bit into the skb->sk, as that would require more invasive changes to all code interacting with the socket to ensure that the bit is handled correctly, such as all error-handling cases along the path from the helper in BPF through to the orphan path in the input. Instead, we opt to use the destructor variable to switch on the prefetch of the socket. Signed-off-by: Joe Stringer <joe@wand.net.nz> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200329225342.16317-2-joe@wand.net.nz
2020-03-30net: sched: expose HW stats types per action used by driversJiri Pirko3-2/+11
It may be up to the driver (in case ANY HW stats is passed) to select which type of HW stats he is going to use. Add an infrastructure to expose this information to user. $ tc filter add dev enp3s0np1 ingress proto ip handle 1 pref 1 flower dst_ip 192.168.1.1 action drop $ tc -s filter show dev enp3s0np1 ingress filter protocol ip pref 1 flower chain 0 filter protocol ip pref 1 flower chain 0 handle 0x1 eth_type ipv4 dst_ip 192.168.1.1 in_hw in_hw_count 2 action order 1: gact action drop random type none pass val 0 index 1 ref 1 bind 1 installed 10 sec used 10 sec Action statistics: Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 used_hw_stats immediate <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-30net: introduce nla_put_bitfield32() helper and use itJiri Pirko2-21/+10
Introduce a helper to pass value and selector to. The helper packs them into struct and puts them into netlink message. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-27netfilter: flowtable: Fix incorrect tc_setup_type typewenxu1-1/+1
The indirect block setup should use TC_SETUP_FT as the type instead of TC_SETUP_BLOCK. Adjust existing users of the indirect flow block infrastructure. Fixes: b5140a36da78 ("netfilter: flowtable: add indr block setup support") Signed-off-by: wenxu <wenxu@ucloud.cn> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-27cls_flower: Add extack support for flags keyGuillaume Nault1-4/+7
Pass extack down to fl_set_key_flags() and set message on error. Signed-off-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-27cls_flower: Add extack support for src and dst port range optionsGuillaume Nault1-8/+18
Pass extack down to fl_set_key_port_range() and set message on error. Both the min and max ports would qualify as invalid attributes here. Report the min one as invalid, as it's probably what makes the most sense from a user point of view. Signed-off-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-27cls_flower: Add extack support for mpls optionsGuillaume Nault1-5/+18
Pass extack down to fl_set_key_mpls() and set message on error. Signed-off-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-27sched: act_pedit: Implement stats_update callbackPetr Machata1-0/+11
Implement this callback in order to get the offloaded stats added to the kernel stats. Reported-by: Alexander Petrovskiy <alexpe@mellanox.com> Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-27sched: act_skbedit: Implement stats_update callbackPetr Machata1-0/+11
Implement this callback in order to get the offloaded stats added to the kernel stats. Reported-by: Alexander Petrovskiy <alexpe@mellanox.com> Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller5-8/+19
Overlapping header include additions in macsec.c A bug fix in 'net' overlapping with the removal of 'version' string in ena_netdev.c Overlapping test additions in selftests Makefile Overlapping PCI ID table adjustments in iwlwifi driver. Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-25net: Fix CONFIG_NET_CLS_ACT=n and CONFIG_NFT_FWD_NETDEV={y, m} buildPablo Neira Ayuso1-4/+2
net/netfilter/nft_fwd_netdev.c: In function ‘nft_fwd_netdev_eval’: net/netfilter/nft_fwd_netdev.c:32:10: error: ‘struct sk_buff’ has no member named ‘tc_redirected’ pkt->skb->tc_redirected = 1; ^~ net/netfilter/nft_fwd_netdev.c:33:10: error: ‘struct sk_buff’ has no member named ‘tc_from_ingress’ pkt->skb->tc_from_ingress = 1; ^~ To avoid a direct dependency with tc actions from netfilter, wrap the redirect bits around CONFIG_NET_REDIRECT and move helpers to include/linux/skbuff.h. Turn on this toggle from the ifb driver, the only existing client of these bits in the tree. This patch adds skb_set_redirected() that sets on the redirected bit on the skbuff, it specifies if the packet was redirect from ingress and resets the timestamp (timestamp reset was originally missing in the netfilter bugfix). Fixes: bcfabee1afd99484 ("netfilter: nft_fwd_netdev: allow to redirect to ifb via ingress") Reported-by: noreply@ellerman.id.au Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-25net: cbs: Fix software cbs to consider packet sending timeZh-yuan Ye1-1/+11
Currently the software CBS does not consider the packet sending time when depleting the credits. It caused the throughput to be Idleslope[kbps] * (Port transmit rate[kbps] / |Sendslope[kbps]|) where Idleslope * (Port transmit rate / (Idleslope + |Sendslope|)) = Idleslope is expected. In order to fix the issue above, this patch takes the time when the packet sending completes into account by moving the anchor time variable "last" ahead to the send completion time upon transmission and adding wait when the next dequeue request comes before the send completion time of the previous packet. changelog: V2->V3: - remove unnecessary whitespace cleanup - add the checks if port_rate is 0 before division V1->V2: - combine variable "send_completed" into "last" - add the comment for estimate of the packet sending Fixes: 585d763af09c ("net/sched: Introduce Credit Based Shaper (CBS) qdisc") Signed-off-by: Zh-yuan Ye <ye.zh-yuan@socionext.com> Reviewed-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-24net: sched: rename more stats_typesJakub Kicinski2-25/+23
Commit 53eca1f3479f ("net: rename flow_action_hw_stats_types* -> flow_action_hw_stats*") renamed just the flow action types and helpers. For consistency rename variables, enums, struct members and UAPI too (note that this UAPI was not in any official release, yet). Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-20net: tc_skbedit: Make the skbedit priority offloadablePetr Machata1-0/+3
The skbedit action "priority" is used for adjusting SKB priority. Allow drivers to offload the action by introducing two new skbedit getters and a new flow action, and initializing appropriately in tc_setup_flow_action(). Signed-off-by: Petr Machata <petrm@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-19net: sched: Fix hw_stats_type setting in pedit loopPetr Machata1-1/+1
In the commit referenced below, hw_stats_type of an entry is set for every entry that corresponds to a pedit action. However, the assignment is only done after the entry pointer is bumped, and therefore could overwrite memory outside of the entries array. The reason for this positioning may have been that the current entry's hw_stats_type is already set above, before the action-type dispatch. However, if there are no more actions, the assignment is wrong. And if there are, the next round of the for_each_action loop will make the assignment before the action-type dispatch anyway. Therefore fix this issue by simply reordering the two lines. Fixes: 74522e7baae2 ("net: sched: set the hw_stats_type in pedit loop") Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-19net/sched: act_ct: Fix leak of ct zone template on replacePaul Blakey1-1/+1
Currently, on replace, the previous action instance params is swapped with a newly allocated params. The old params is only freed (via kfree_rcu), without releasing the allocated ct zone template related to it. Call tcf_ct_params_free (via call_rcu) for the old params, so it will release it. Fixes: b57dc7c13ea9 ("net/sched: Introduce action ct") Signed-off-by: Paul Blakey <paulb@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-18net_sched: sch_fq: enable use of hrtimer slackEric Dumazet1-4/+17
Add a new attribute to control the fq qdisc hrtimer slack. Default is set to 10 usec. When/if packets are throttled, fq set up an hrtimer that can lead to one interrupt per packet in the throttled queue. By using a timer slack, we allow better use of timer interrupts, by giving them a chance to call multiple timer callbacks at each hardware interrupt. Also, giving a slack allows FQ to dequeue batches of packets instead of a single one, thus increasing xmit_more efficiency. This has no negative effect on the rate a TCP flow can sustain, since each TCP flow maintains its own precise vtime (tp->tcp_wstamp_ns) v2: added strict netlink checking (as feedback from Jakub Kicinski) Tested: 1000 concurrent flows all using paced packets. 1,000,000 packets sent per second. Before the patch : $ vmstat 2 10 procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 0 0 0 60726784 23628 3485992 0 0 138 1 977 535 0 12 87 0 0 0 0 0 60714700 23628 3485628 0 0 0 0 1568827 26462 0 22 78 0 0 1 0 0 60716012 23628 3485656 0 0 0 0 1570034 26216 0 22 78 0 0 0 0 0 60722420 23628 3485492 0 0 0 0 1567230 26424 0 22 78 0 0 0 0 0 60727484 23628 3485556 0 0 0 0 1568220 26200 0 22 78 0 0 2 0 0 60718900 23628 3485380 0 0 0 40 1564721 26630 0 22 78 0 0 2 0 0 60718096 23628 3485332 0 0 0 0 1562593 26432 0 22 78 0 0 0 0 0 60719608 23628 3485064 0 0 0 0 1563806 26238 0 22 78 0 0 1 0 0 60722876 23628 3485236 0 0 0 130 1565874 26566 0 22 78 0 0 1 0 0 60722752 23628 3484908 0 0 0 0 1567646 26247 0 22 78 0 0 After the patch, slack of 10 usec, we can see a reduction of interrupts per second, and a small decrease of reported cpu usage. $ vmstat 2 10 procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 1 0 0 60722564 23628 3484728 0 0 133 1 696 545 0 13 87 0 0 1 0 0 60722568 23628 3484824 0 0 0 0 977278 25469 0 20 80 0 0 0 0 0 60716396 23628 3484764 0 0 0 0 979997 25326 0 20 80 0 0 0 0 0 60713844 23628 3484960 0 0 0 0 981394 25249 0 20 80 0 0 2 0 0 60720468 23628 3484916 0 0 0 0 982860 25062 0 20 80 0 0 1 0 0 60721236 23628 3484856 0 0 0 0 982867 25100 0 20 80 0 0 1 0 0 60722400 23628 3484456 0 0 0 8 982698 25303 0 20 80 0 0 0 0 0 60715396 23628 3484428 0 0 0 0 981777 25176 0 20 80 0 0 0 0 0 60716520 23628 3486544 0 0 0 36 978965 27857 0 21 79 0 0 0 0 0 60719592 23628 3486516 0 0 0 22 977318 25106 0 20 80 0 0 Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>