From 91705c61b52029ab5da67a15a23eef08667bf40e Mon Sep 17 00:00:00 2001 From: Daniel Borkmann Date: Tue, 23 Jul 2013 14:51:47 +0200 Subject: net: sctp: trivial: update mailing list address The SCTP mailing list address to send patches or questions to is linux-sctp@vger.kernel.org and not lksctp-developers@lists.sourceforge.net anymore. Therefore, update all occurences. Signed-off-by: Daniel Borkmann Acked-by: Neil Horman Acked-by: Vlad Yasevich Signed-off-by: David S. Miller --- Documentation/networking/sctp.txt | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) (limited to 'Documentation') diff --git a/Documentation/networking/sctp.txt b/Documentation/networking/sctp.txt index 0c790a76910e..97b810ca9082 100644 --- a/Documentation/networking/sctp.txt +++ b/Documentation/networking/sctp.txt @@ -19,7 +19,6 @@ of SCTP that is RFC 2960 compliant and provides an programming interface referred to as the UDP-style API of the Sockets Extensions for SCTP, as proposed in IETF Internet-Drafts. - Caveats: -lksctp can be built as statically or as a module. However, be aware that @@ -33,6 +32,4 @@ For more information, please visit the lksctp project website: http://www.sf.net/projects/lksctp Or contact the lksctp developers through the mailing list: - - - + -- cgit v1.2.3 From c9bee3b7fdecb0c1d070c7b54113b3bdfb9a3d36 Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Mon, 22 Jul 2013 20:27:07 -0700 Subject: tcp: TCP_NOTSENT_LOWAT socket option Idea of this patch is to add optional limitation of number of unsent bytes in TCP sockets, to reduce usage of kernel memory. TCP receiver might announce a big window, and TCP sender autotuning might allow a large amount of bytes in write queue, but this has little performance impact if a large part of this buffering is wasted : Write queue needs to be large only to deal with large BDP, not necessarily to cope with scheduling delays (incoming ACKS make room for the application to queue more bytes) For most workloads, using a value of 128 KB or less is OK to give applications enough time to react to POLLOUT events in time (or being awaken in a blocking sendmsg()) This patch adds two ways to set the limit : 1) Per socket option TCP_NOTSENT_LOWAT 2) A sysctl (/proc/sys/net/ipv4/tcp_notsent_lowat) for sockets not using TCP_NOTSENT_LOWAT socket option (or setting a zero value) Default value being UINT_MAX (0xFFFFFFFF), meaning this has no effect. This changes poll()/select()/epoll() to report POLLOUT only if number of unsent bytes is below tp->nosent_lowat Note this might increase number of sendmsg()/sendfile() calls when using non blocking sockets, and increase number of context switches for blocking sockets. Note this is not related to SO_SNDLOWAT (as SO_SNDLOWAT is defined as : Specify the minimum number of bytes in the buffer until the socket layer will pass the data to the protocol) Tested: netperf sessions, and watching /proc/net/protocols "memory" column for TCP With 200 concurrent netperf -t TCP_STREAM sessions, amount of kernel memory used by TCP buffers shrinks by ~55 % (20567 pages instead of 45458) lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols TCPv6 1880 2 45458 no 208 yes ipv6 y y y y y y y y y y y y y n y y y y y TCP 1696 508 45458 no 208 yes kernel y y y y y y y y y y y y y n y y y y y lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols TCPv6 1880 2 20567 no 208 yes ipv6 y y y y y y y y y y y y y n y y y y y TCP 1696 508 20567 no 208 yes kernel y y y y y y y y y y y y y n y y y y y Using 128KB has no bad effect on the throughput or cpu usage of a single flow, although there is an increase of context switches. A bonus is that we hold socket lock for a shorter amount of time and should improve latencies of ACK processing. lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3 OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf. Local Remote Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service Send Socket Recv Socket Send Time Units CPU CPU CPU CPU Service Service Demand Size Size Size (sec) Util Util Util Util Demand Demand Units Final Final % Method % Method 1651584 6291456 16384 20.00 17447.90 10^6bits/s 3.13 S -1.00 U 0.353 -1.000 usec/KB Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3': 412,514 context-switches 200.034645535 seconds time elapsed lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3 OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf. Local Remote Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service Send Socket Recv Socket Send Time Units CPU CPU CPU CPU Service Service Demand Size Size Size (sec) Util Util Util Util Demand Demand Units Final Final % Method % Method 1593240 6291456 16384 20.00 17321.16 10^6bits/s 3.35 S -1.00 U 0.381 -1.000 usec/KB Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3': 2,675,818 context-switches 200.029651391 seconds time elapsed Signed-off-by: Eric Dumazet Cc: Neal Cardwell Cc: Yuchung Cheng Acked-By: Yuchung Cheng Signed-off-by: David S. Miller --- Documentation/networking/ip-sysctl.txt | 13 +++++++++++++ include/linux/tcp.h | 1 + include/net/sock.h | 19 +++++++++++++------ include/net/tcp.h | 14 ++++++++++++++ include/uapi/linux/tcp.h | 1 + net/ipv4/sysctl_net_ipv4.c | 7 +++++++ net/ipv4/tcp.c | 7 +++++++ net/ipv4/tcp_ipv4.c | 1 + net/ipv4/tcp_output.c | 3 +++ net/ipv6/tcp_ipv6.c | 1 + 10 files changed, 61 insertions(+), 6 deletions(-) (limited to 'Documentation') diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index 10742902146f..53cea9bcb14c 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -516,6 +516,19 @@ tcp_wmem - vector of 3 INTEGERs: min, default, max this value is ignored. Default: between 64K and 4MB, depending on RAM size. +tcp_notsent_lowat - UNSIGNED INTEGER + A TCP socket can control the amount of unsent bytes in its write queue, + thanks to TCP_NOTSENT_LOWAT socket option. poll()/select()/epoll() + reports POLLOUT events if the amount of unsent bytes is below a per + socket value, and if the write queue is not full. sendmsg() will + also not add new buffers if the limit is hit. + + This global variable controls the amount of unsent data for + sockets not using TCP_NOTSENT_LOWAT. For these sockets, a change + to the global variable has immediate effect. + + Default: UINT_MAX (0xFFFFFFFF) + tcp_workaround_signed_windows - BOOLEAN If set, assume no receipt of a window scaling option means the remote TCP is broken and treats the window as a signed quantity. diff --git a/include/linux/tcp.h b/include/linux/tcp.h index 472120b4fac5..9640803a17a7 100644 --- a/include/linux/tcp.h +++ b/include/linux/tcp.h @@ -238,6 +238,7 @@ struct tcp_sock { u32 rcv_wnd; /* Current receiver window */ u32 write_seq; /* Tail(+1) of data held in tcp send buffer */ + u32 notsent_lowat; /* TCP_NOTSENT_LOWAT */ u32 pushed_seq; /* Last pushed seq, required to talk to windows */ u32 lost_out; /* Lost packets */ u32 sacked_out; /* SACK'd packets */ diff --git a/include/net/sock.h b/include/net/sock.h index d0b5fdee50a2..b9f2b095b1ab 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -746,11 +746,6 @@ static inline int sk_stream_wspace(const struct sock *sk) extern void sk_stream_write_space(struct sock *sk); -static inline bool sk_stream_memory_free(const struct sock *sk) -{ - return sk->sk_wmem_queued < sk->sk_sndbuf; -} - /* OOB backlog add */ static inline void __sk_add_backlog(struct sock *sk, struct sk_buff *skb) { @@ -950,6 +945,7 @@ struct proto { unsigned int inuse_idx; #endif + bool (*stream_memory_free)(const struct sock *sk); /* Memory pressure */ void (*enter_memory_pressure)(struct sock *sk); atomic_long_t *memory_allocated; /* Current allocated memory. */ @@ -1088,11 +1084,22 @@ static inline struct cg_proto *parent_cg_proto(struct proto *proto, } #endif +static inline bool sk_stream_memory_free(const struct sock *sk) +{ + if (sk->sk_wmem_queued >= sk->sk_sndbuf) + return false; + + return sk->sk_prot->stream_memory_free ? + sk->sk_prot->stream_memory_free(sk) : true; +} + static inline bool sk_stream_is_writeable(const struct sock *sk) { - return sk_stream_wspace(sk) >= sk_stream_min_wspace(sk); + return sk_stream_wspace(sk) >= sk_stream_min_wspace(sk) && + sk_stream_memory_free(sk); } + static inline bool sk_has_memory_pressure(const struct sock *sk) { return sk->sk_prot->memory_pressure != NULL; diff --git a/include/net/tcp.h b/include/net/tcp.h index c5868471abae..18fc999dae3c 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -284,6 +284,7 @@ extern int sysctl_tcp_thin_dupack; extern int sysctl_tcp_early_retrans; extern int sysctl_tcp_limit_output_bytes; extern int sysctl_tcp_challenge_ack_limit; +extern unsigned int sysctl_tcp_notsent_lowat; extern atomic_long_t tcp_memory_allocated; extern struct percpu_counter tcp_sockets_allocated; @@ -1539,6 +1540,19 @@ extern int tcp_gro_complete(struct sk_buff *skb); extern void __tcp_v4_send_check(struct sk_buff *skb, __be32 saddr, __be32 daddr); +static inline u32 tcp_notsent_lowat(const struct tcp_sock *tp) +{ + return tp->notsent_lowat ?: sysctl_tcp_notsent_lowat; +} + +static inline bool tcp_stream_memory_free(const struct sock *sk) +{ + const struct tcp_sock *tp = tcp_sk(sk); + u32 notsent_bytes = tp->write_seq - tp->snd_nxt; + + return notsent_bytes < tcp_notsent_lowat(tp); +} + #ifdef CONFIG_PROC_FS extern int tcp4_proc_init(void); extern void tcp4_proc_exit(void); diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h index 8d776ebc4829..377f1e59411d 100644 --- a/include/uapi/linux/tcp.h +++ b/include/uapi/linux/tcp.h @@ -111,6 +111,7 @@ enum { #define TCP_REPAIR_OPTIONS 22 #define TCP_FASTOPEN 23 /* Enable FastOpen on listeners */ #define TCP_TIMESTAMP 24 +#define TCP_NOTSENT_LOWAT 25 /* limit number of unsent bytes in write queue */ struct tcp_repair_opt { __u32 opt_code; diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c index b2c123c44d69..69ed203802da 100644 --- a/net/ipv4/sysctl_net_ipv4.c +++ b/net/ipv4/sysctl_net_ipv4.c @@ -554,6 +554,13 @@ static struct ctl_table ipv4_table[] = { .proc_handler = proc_dointvec_minmax, .extra1 = &one, }, + { + .procname = "tcp_notsent_lowat", + .data = &sysctl_tcp_notsent_lowat, + .maxlen = sizeof(sysctl_tcp_notsent_lowat), + .mode = 0644, + .proc_handler = proc_dointvec, + }, { .procname = "tcp_rmem", .data = &sysctl_tcp_rmem, diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 5eca9060bb8e..c27e81392398 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -2631,6 +2631,10 @@ static int do_tcp_setsockopt(struct sock *sk, int level, else tp->tsoffset = val - tcp_time_stamp; break; + case TCP_NOTSENT_LOWAT: + tp->notsent_lowat = val; + sk->sk_write_space(sk); + break; default: err = -ENOPROTOOPT; break; @@ -2847,6 +2851,9 @@ static int do_tcp_getsockopt(struct sock *sk, int level, case TCP_TIMESTAMP: val = tcp_time_stamp + tp->tsoffset; break; + case TCP_NOTSENT_LOWAT: + val = tp->notsent_lowat; + break; default: return -ENOPROTOOPT; } diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 2e3f129df0eb..2a5d5c469d17 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -2800,6 +2800,7 @@ struct proto tcp_prot = { .unhash = inet_unhash, .get_port = inet_csk_get_port, .enter_memory_pressure = tcp_enter_memory_pressure, + .stream_memory_free = tcp_stream_memory_free, .sockets_allocated = &tcp_sockets_allocated, .orphan_count = &tcp_orphan_count, .memory_allocated = &tcp_memory_allocated, diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 92fde8d1aa82..884efff5b531 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -65,6 +65,9 @@ int sysctl_tcp_base_mss __read_mostly = TCP_BASE_MSS; /* By default, RFC2861 behavior. */ int sysctl_tcp_slow_start_after_idle __read_mostly = 1; +unsigned int sysctl_tcp_notsent_lowat __read_mostly = UINT_MAX; +EXPORT_SYMBOL(sysctl_tcp_notsent_lowat); + static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, int push_one, gfp_t gfp); diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index 80fe69ef2188..b792e870686b 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -1924,6 +1924,7 @@ struct proto tcpv6_prot = { .unhash = inet_unhash, .get_port = inet_csk_get_port, .enter_memory_pressure = tcp_enter_memory_pressure, + .stream_memory_free = tcp_stream_memory_free, .sockets_allocated = &tcp_sockets_allocated, .memory_allocated = &tcp_memory_allocated, .memory_pressure = &tcp_memory_pressure, -- cgit v1.2.3 From 5ad37d5deee1ff7150a2d0602370101de158ad86 Mon Sep 17 00:00:00 2001 From: Hannes Frederic Sowa Date: Fri, 26 Jul 2013 17:43:23 +0200 Subject: tcp: add tcp_syncookies mode to allow unconditionally generation of syncookies | If you want to test which effects syncookies have to your | network connections you can set this knob to 2 to enable | unconditionally generation of syncookies. Original idea and first implementation by Eric Dumazet. Cc: Florian Westphal Cc: David Miller Signed-off-by: Eric Dumazet Signed-off-by: Hannes Frederic Sowa Signed-off-by: David S. Miller --- Documentation/networking/ip-sysctl.txt | 4 ++++ net/ipv4/tcp_ipv4.c | 5 +++-- net/ipv6/tcp_ipv6.c | 3 ++- 3 files changed, 9 insertions(+), 3 deletions(-) (limited to 'Documentation') diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index 53cea9bcb14c..36be26b2ef7a 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -440,6 +440,10 @@ tcp_syncookies - BOOLEAN SYN flood warnings in logs not being really flooded, your server is seriously misconfigured. + If you want to test which effects syncookies have to your + network connections you can set this knob to 2 to enable + unconditionally generation of syncookies. + tcp_fastopen - INTEGER Enable TCP Fast Open feature (draft-ietf-tcpm-fastopen) to send data in the opening SYN packet. To use this feature, the client application diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 2a5d5c469d17..280efe5f19c1 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -890,7 +890,7 @@ bool tcp_syn_flood_action(struct sock *sk, NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPREQQFULLDROP); lopt = inet_csk(sk)->icsk_accept_queue.listen_opt; - if (!lopt->synflood_warned) { + if (!lopt->synflood_warned && sysctl_tcp_syncookies != 2) { lopt->synflood_warned = 1; pr_info("%s: Possible SYN flooding on port %d. %s. Check SNMP counters.\n", proto, ntohs(tcp_hdr(skb)->dest), msg); @@ -1462,7 +1462,8 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb) * limitations, they conserve resources and peer is * evidently real one. */ - if (inet_csk_reqsk_queue_is_full(sk) && !isn) { + if ((sysctl_tcp_syncookies == 2 || + inet_csk_reqsk_queue_is_full(sk)) && !isn) { want_cookie = tcp_syn_flood_action(sk, skb, "TCP"); if (!want_cookie) goto drop; diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index b792e870686b..38c196ca6011 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -963,7 +963,8 @@ static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb) if (!ipv6_unicast_destination(skb)) goto drop; - if (inet_csk_reqsk_queue_is_full(sk) && !isn) { + if ((sysctl_tcp_syncookies == 2 || + inet_csk_reqsk_queue_is_full(sk)) && !isn) { want_cookie = tcp_syn_flood_action(sk, skb, "TCPv6"); if (!want_cookie) goto drop; -- cgit v1.2.3 From fd158d79d33d3c8b693e3e2d8c0e3068d529c2dc Mon Sep 17 00:00:00 2001 From: Florian Westphal Date: Mon, 29 Jul 2013 15:41:52 +0200 Subject: netfilter: tproxy: remove nf_tproxy_core, keep tw sk assigned to skb The module was "permanent", due to the special tproxy skb->destructor. Nowadays we have tcp early demux and its sock_edemux destructor in networking core which can be used instead. Thanks to early demux changes the input path now also handles "skb->sk is tw socket" correctly, so this no longer needs the special handling introduced with commit d503b30bd648b3cb4e5f50b65d27e389960cc6d9 (netfilter: tproxy: do not assign timewait sockets to skb->sk). Thus: - move assign_sock function to where its needed - don't prevent timewait sockets from being assigned to the skb - remove nf_tproxy_core. Signed-off-by: Florian Westphal Signed-off-by: Pablo Neira Ayuso --- Documentation/networking/tproxy.txt | 5 ++- include/net/netfilter/nf_tproxy_core.h | 4 --- net/netfilter/Kconfig | 22 +++--------- net/netfilter/Makefile | 3 -- net/netfilter/nf_tproxy_core.c | 62 ---------------------------------- net/netfilter/xt_TPROXY.c | 9 +++++ 6 files changed, 16 insertions(+), 89 deletions(-) delete mode 100644 net/netfilter/nf_tproxy_core.c (limited to 'Documentation') diff --git a/Documentation/networking/tproxy.txt b/Documentation/networking/tproxy.txt index 7b5996d9357e..ec11429e1d42 100644 --- a/Documentation/networking/tproxy.txt +++ b/Documentation/networking/tproxy.txt @@ -2,9 +2,8 @@ Transparent proxy support ========================= This feature adds Linux 2.2-like transparent proxy support to current kernels. -To use it, enable NETFILTER_TPROXY, the socket match and the TPROXY target in -your kernel config. You will need policy routing too, so be sure to enable that -as well. +To use it, enable the socket match and the TPROXY target in your kernel config. +You will need policy routing too, so be sure to enable that as well. 1. Making non-local sockets work diff --git a/include/net/netfilter/nf_tproxy_core.h b/include/net/netfilter/nf_tproxy_core.h index 36d9379d4c4b..975ffa4545a9 100644 --- a/include/net/netfilter/nf_tproxy_core.h +++ b/include/net/netfilter/nf_tproxy_core.h @@ -203,8 +203,4 @@ nf_tproxy_get_sock_v6(struct net *net, const u8 protocol, } #endif -/* assign a socket to the skb -- consumes sk */ -void -nf_tproxy_assign_sock(struct sk_buff *skb, struct sock *sk); - #endif diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig index 56d22cae5906..c45fc1a60e0d 100644 --- a/net/netfilter/Kconfig +++ b/net/netfilter/Kconfig @@ -410,20 +410,6 @@ config NF_NAT_TFTP endif # NF_CONNTRACK -# transparent proxy support -config NETFILTER_TPROXY - tristate "Transparent proxying support" - depends on IP_NF_MANGLE - depends on NETFILTER_ADVANCED - help - This option enables transparent proxying support, that is, - support for handling non-locally bound IPv4 TCP and UDP sockets. - For it to work you will have to configure certain iptables rules - and use policy routing. For more information on how to set it up - see Documentation/networking/tproxy.txt. - - To compile it as a module, choose M here. If unsure, say N. - config NETFILTER_XTABLES tristate "Netfilter Xtables support (required for ip_tables)" default m if NETFILTER_ADVANCED=n @@ -720,10 +706,10 @@ config NETFILTER_XT_TARGET_TEE this clone be rerouted to another nexthop. config NETFILTER_XT_TARGET_TPROXY - tristate '"TPROXY" target support' - depends on NETFILTER_TPROXY + tristate '"TPROXY" target transparent proxying support' depends on NETFILTER_XTABLES depends on NETFILTER_ADVANCED + depends on IP_NF_MANGLE select NF_DEFRAG_IPV4 select NF_DEFRAG_IPV6 if IP6_NF_IPTABLES help @@ -731,6 +717,9 @@ config NETFILTER_XT_TARGET_TPROXY REDIRECT. It can only be used in the mangle table and is useful to redirect traffic to a transparent proxy. It does _not_ depend on Netfilter connection tracking and NAT, unlike REDIRECT. + For it to work you will have to configure certain iptables rules + and use policy routing. For more information on how to set it up + see Documentation/networking/tproxy.txt. To compile it as a module, choose M here. If unsure, say N. @@ -1180,7 +1169,6 @@ config NETFILTER_XT_MATCH_SCTP config NETFILTER_XT_MATCH_SOCKET tristate '"socket" match support' - depends on NETFILTER_TPROXY depends on NETFILTER_XTABLES depends on NETFILTER_ADVANCED depends on !NF_CONNTRACK || NF_CONNTRACK diff --git a/net/netfilter/Makefile b/net/netfilter/Makefile index a1abf87d43bf..ebfa7dc747cd 100644 --- a/net/netfilter/Makefile +++ b/net/netfilter/Makefile @@ -61,9 +61,6 @@ obj-$(CONFIG_NF_NAT_IRC) += nf_nat_irc.o obj-$(CONFIG_NF_NAT_SIP) += nf_nat_sip.o obj-$(CONFIG_NF_NAT_TFTP) += nf_nat_tftp.o -# transparent proxy support -obj-$(CONFIG_NETFILTER_TPROXY) += nf_tproxy_core.o - # generic X tables obj-$(CONFIG_NETFILTER_XTABLES) += x_tables.o xt_tcpudp.o diff --git a/net/netfilter/nf_tproxy_core.c b/net/netfilter/nf_tproxy_core.c deleted file mode 100644 index 474d621cbc2e..000000000000 --- a/net/netfilter/nf_tproxy_core.c +++ /dev/null @@ -1,62 +0,0 @@ -/* - * Transparent proxy support for Linux/iptables - * - * Copyright (c) 2006-2007 BalaBit IT Ltd. - * Author: Balazs Scheidler, Krisztian Kovacs - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License version 2 as - * published by the Free Software Foundation. - * - */ - -#include - -#include -#include -#include -#include -#include - - -static void -nf_tproxy_destructor(struct sk_buff *skb) -{ - struct sock *sk = skb->sk; - - skb->sk = NULL; - skb->destructor = NULL; - - if (sk) - sock_put(sk); -} - -/* consumes sk */ -void -nf_tproxy_assign_sock(struct sk_buff *skb, struct sock *sk) -{ - /* assigning tw sockets complicates things; most - * skb->sk->X checks would have to test sk->sk_state first */ - if (sk->sk_state == TCP_TIME_WAIT) { - inet_twsk_put(inet_twsk(sk)); - return; - } - - skb_orphan(skb); - skb->sk = sk; - skb->destructor = nf_tproxy_destructor; -} -EXPORT_SYMBOL_GPL(nf_tproxy_assign_sock); - -static int __init nf_tproxy_init(void) -{ - pr_info("NF_TPROXY: Transparent proxy support initialized, version 4.1.0\n"); - pr_info("NF_TPROXY: Copyright (c) 2006-2007 BalaBit IT Ltd.\n"); - return 0; -} - -module_init(nf_tproxy_init); - -MODULE_LICENSE("GPL"); -MODULE_AUTHOR("Krisztian Kovacs"); -MODULE_DESCRIPTION("Transparent proxy support core routines"); diff --git a/net/netfilter/xt_TPROXY.c b/net/netfilter/xt_TPROXY.c index d7f195388f66..17c40deafa4f 100644 --- a/net/netfilter/xt_TPROXY.c +++ b/net/netfilter/xt_TPROXY.c @@ -117,6 +117,15 @@ tproxy_handle_time_wait4(struct sk_buff *skb, __be32 laddr, __be16 lport, return sk; } +/* assign a socket to the skb -- consumes sk */ +static void +nf_tproxy_assign_sock(struct sk_buff *skb, struct sock *sk) +{ + skb_orphan(skb); + skb->sk = sk; + skb->destructor = sock_edemux; +} + static unsigned int tproxy_tg4(struct sk_buff *skb, __be32 laddr, __be16 lport, u_int32_t mark_mask, u_int32_t mark_value) -- cgit v1.2.3 From 49dfe76261e427f5521b40321fbc3d947350165d Mon Sep 17 00:00:00 2001 From: Paul Gortmaker Date: Wed, 31 Jul 2013 15:16:20 -0400 Subject: Documentation: add networking/netdev-FAQ.txt A collection of expectations and operational details about how networking development takes place in the context of the netdev mailing list. The content is meant to capture specific items that are unique to netdev workflow, and not re-document generic linux expectations that are already captured elsewhere. This was originally proposed[1] as a regular posting mailing list FAQ, but it probably is more universally accessible here in tree. [1] https://lwn.net/Articles/559211/ Signed-off-by: Paul Gortmaker Signed-off-by: David S. Miller --- Documentation/networking/00-INDEX | 2 + Documentation/networking/netdev-FAQ.txt | 224 ++++++++++++++++++++++++++++++++ 2 files changed, 226 insertions(+) create mode 100644 Documentation/networking/netdev-FAQ.txt (limited to 'Documentation') diff --git a/Documentation/networking/00-INDEX b/Documentation/networking/00-INDEX index 32dfbd924121..18b64b2b8a68 100644 --- a/Documentation/networking/00-INDEX +++ b/Documentation/networking/00-INDEX @@ -124,6 +124,8 @@ multiqueue.txt - HOWTO for multiqueue network device support. netconsole.txt - The network console module netconsole.ko: configuration and notes. +netdev-FAQ.txt + - FAQ describing how to submit net changes to netdev mailing list. netdev-features.txt - Network interface features API description. netdevices.txt diff --git a/Documentation/networking/netdev-FAQ.txt b/Documentation/networking/netdev-FAQ.txt new file mode 100644 index 000000000000..d9112f01c44a --- /dev/null +++ b/Documentation/networking/netdev-FAQ.txt @@ -0,0 +1,224 @@ + +Information you need to know about netdev +----------------------------------------- + +Q: What is netdev? + +A: It is a mailing list for all network related linux stuff. This includes + anything found under net/ (i.e. core code like IPv6) and drivers/net + (i.e. hardware specific drivers) in the linux source tree. + + Note that some subsystems (e.g. wireless drivers) which have a high volume + of traffic have their own specific mailing lists. + + The netdev list is managed (like many other linux mailing lists) through + VGER ( http://vger.kernel.org/ ) and archives can be found below: + + http://marc.info/?l=linux-netdev + http://www.spinics.net/lists/netdev/ + + Aside from subsystems like that mentioned above, all network related linux + development (i.e. RFC, review, comments, etc) takes place on netdev. + +Q: How do the changes posted to netdev make their way into linux? + +A: There are always two trees (git repositories) in play. Both are driven + by David Miller, the main network maintainer. There is the "net" tree, + and the "net-next" tree. As you can probably guess from the names, the + net tree is for fixes to existing code already in the mainline tree from + Linus, and net-next is where the new code goes for the future release. + You can find the trees here: + + http://git.kernel.org/?p=linux/kernel/git/davem/net.git + http://git.kernel.org/?p=linux/kernel/git/davem/net-next.git + +Q: How often do changes from these trees make it to the mainline Linus tree? + +A: To understand this, you need to know a bit of background information + on the cadence of linux development. Each new release starts off with + a two week "merge window" where the main maintainers feed their new + stuff to Linus for merging into the mainline tree. After the two weeks, + the merge window is closed, and it is called/tagged "-rc1". No new + features get mainlined after this -- only fixes to the rc1 content + are expected. After roughly a week of collecting fixes to the rc1 + content, rc2 is released. This repeats on a roughly weekly basis + until rc7 (typically; sometimes rc6 if things are quiet, or rc8 if + things are in a state of churn), and a week after the last vX.Y-rcN + was done, the official "vX.Y" is released. + + Relating that to netdev: At the beginning of the 2 week merge window, + the net-next tree will be closed - no new changes/features. The + accumulated new content of the past ~10 weeks will be passed onto + mainline/Linus via a pull request for vX.Y -- at the same time, + the "net" tree will start accumulating fixes for this pulled content + relating to vX.Y + + An announcement indicating when net-next has been closed is usually + sent to netdev, but knowing the above, you can predict that in advance. + + IMPORTANT: Do not send new net-next content to netdev during the + period during which net-next tree is closed. + + Shortly after the two weeks have passed, (and vX.Y-rc1 is released) the + tree for net-next reopens to collect content for the next (vX.Y+1) release. + + If you aren't subscribed to netdev and/or are simply unsure if net-next + has re-opened yet, simply check the net-next git repository link above for + any new networking related commits. + + The "net" tree continues to collect fixes for the vX.Y content, and + is fed back to Linus at regular (~weekly) intervals. Meaning that the + focus for "net" is on stablilization and bugfixes. + + Finally, the vX.Y gets released, and the whole cycle starts over. + +Q: So where are we now in this cycle? + +A: Load the mainline (Linus) page here: + + http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git + + and note the top of the "tags" section. If it is rc1, it is early + in the dev cycle. If it was tagged rc7 a week ago, then a release + is probably imminent. + +Q: How do I indicate which tree (net vs. net-next) my patch should be in? + +A: Firstly, think whether you have a bug fix or new "next-like" content. + Then once decided, assuming that you use git, use the prefix flag, i.e. + + git format-patch --subject-prefix='PATCH net-next' start..finish + + Use "net" instead of "net-next" (always lower case) in the above for + bug-fix net content. If you don't use git, then note the only magic in + the above is just the subject text of the outgoing e-mail, and you can + manually change it yourself with whatever MUA you are comfortable with. + +Q: I sent a patch and I'm wondering what happened to it. How can I tell + whether it got merged? + +A: Start by looking at the main patchworks queue for netdev: + + http://patchwork.ozlabs.org/project/netdev/list/ + + The "State" field will tell you exactly where things are at with + your patch. + +Q: The above only says "Under Review". How can I find out more? + +A: Generally speaking, the patches get triaged quickly (in less than 48h). + So be patient. Asking the maintainer for status updates on your + patch is a good way to ensure your patch is ignored or pushed to + the bottom of the priority list. + +Q: How can I tell what patches are queued up for backporting to the + various stable releases? + +A: Normally Greg Kroah-Hartman collects stable commits himself, but + for networking, Dave collects up patches he deems critical for the + networking subsystem, and then hands them off to Greg. + + There is a patchworks queue that you can see here: + http://patchwork.ozlabs.org/bundle/davem/stable/?state=* + + It contains the patches which Dave has selected, but not yet handed + off to Greg. If Greg already has the patch, then it will be here: + http://git.kernel.org/cgit/linux/kernel/git/stable/stable-queue.git + + A quick way to find whether the patch is in this stable-queue is + to simply clone the repo, and then git grep the mainline commit ID, e.g. + + stable-queue$ git grep -l 284041ef21fdf2e + releases/3.0.84/ipv6-fix-possible-crashes-in-ip6_cork_release.patch + releases/3.4.51/ipv6-fix-possible-crashes-in-ip6_cork_release.patch + releases/3.9.8/ipv6-fix-possible-crashes-in-ip6_cork_release.patch + stable/stable-queue$ + +Q: I see a network patch and I think it should be backported to stable. + Should I request it via "stable@vger.kernel.org" like the references in + the kernel's Documentation/stable_kernel_rules.txt file say? + +A: No, not for networking. Check the stable queues as per above 1st to see + if it is already queued. If not, then send a mail to netdev, listing + the upstream commit ID and why you think it should be a stable candidate. + + Before you jump to go do the above, do note that the normal stable rules + in Documentation/stable_kernel_rules.txt still apply. So you need to + explicitly indicate why it is a critical fix and exactly what users are + impacted. In addition, you need to convince yourself that you _really_ + think it has been overlooked, vs. having been considered and rejected. + + Generally speaking, the longer it has had a chance to "soak" in mainline, + the better the odds that it is an OK candidate for stable. So scrambling + to request a commit be added the day after it appears should be avoided. + +Q: I have created a network patch and I think it should be backported to + stable. Should I add a "Cc: stable@vger.kernel.org" like the references + in the kernel's Documentation/ directory say? + +A: No. See above answer. In short, if you think it really belongs in + stable, then ensure you write a decent commit log that describes who + gets impacted by the bugfix and how it manifests itself, and when the + bug was introduced. If you do that properly, then the commit will + get handled appropriately and most likely get put in the patchworks + stable queue if it really warrants it. + + If you think there is some valid information relating to it being in + stable that does _not_ belong in the commit log, then use the three + dash marker line as described in Documentation/SubmittingPatches to + temporarily embed that information into the patch that you send. + +Q: Someone said that the comment style and coding convention is different + for the networking content. Is this true? + +A: Yes, in a largely trivial way. Instead of this: + + /* + * foobar blah blah blah + * another line of text + */ + + it is requested that you make it look like this: + + /* foobar blah blah blah + * another line of text + */ + +Q: I am working in existing code that has the former comment style and not the + latter. Should I submit new code in the former style or the latter? + +A: Make it the latter style, so that eventually all code in the domain of + netdev is of this format. + +Q: I found a bug that might have possible security implications or similar. + Should I mail the main netdev maintainer off-list? + +A: No. The current netdev maintainer has consistently requested that people + use the mailing lists and not reach out directly. If you aren't OK with + that, then perhaps consider mailing "security@kernel.org" or reading about + http://oss-security.openwall.org/wiki/mailing-lists/distros + as possible alternative mechanisms. + +Q: What level of testing is expected before I submit my change? + +A: If your changes are against net-next, the expectation is that you + have tested by layering your changes on top of net-next. Ideally you + will have done run-time testing specific to your change, but at a + minimum, your changes should survive an "allyesconfig" and an + "allmodconfig" build without new warnings or failures. + +Q: Any other tips to help ensure my net/net-next patch gets OK'd? + +A: Attention to detail. Re-read your own work as if you were the + reviewer. You can start with using checkpatch.pl, perhaps even + with the "--strict" flag. But do not be mindlessly robotic in + doing so. If your change is a bug-fix, make sure your commit log + indicates the end-user visible symptom, the underlying reason as + to why it happens, and then if necessary, explain why the fix proposed + is the best way to get things done. Don't mangle whitespace, and as + is common, don't mis-indent function arguments that span multiple lines. + If it is your 1st patch, mail it to yourself so you can test apply + it to an unpatched tree to confirm infrastructure didn't mangle it. + + Finally, go back and read Documentation/SubmittingPatches to be + sure you are not repeating some common mistake documented there. -- cgit v1.2.3 From cda5f98e36576596b9230483ec52bff3cc97eb21 Mon Sep 17 00:00:00 2001 From: Daniel Borkmann Date: Tue, 6 Aug 2013 21:18:12 +0200 Subject: net: sctp: convert sctp_checksum_disable module param into sctp sysctl Get rid of the last module parameter for SCTP and make this configurable via sysctl for SCTP like all the rest of SCTP's configuration knobs. Signed-off-by: Daniel Borkmann Signed-off-by: David S. Miller --- Documentation/networking/ip-sysctl.txt | 8 ++++++++ include/net/netns/sctp.h | 3 +++ include/net/sctp/structs.h | 5 ----- net/sctp/input.c | 4 ++-- net/sctp/output.c | 5 ++++- net/sctp/protocol.c | 5 +++-- net/sctp/sysctl.c | 10 +++++++++- 7 files changed, 29 insertions(+), 11 deletions(-) (limited to 'Documentation') diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index 36be26b2ef7a..1b27b0b8687f 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -1507,6 +1507,14 @@ sack_timeout - INTEGER Default: 200 +checksum_disable - BOOLEAN + Disable SCTP checksum computing and verification for debugging purpose. + + 1: Disable checksumming + 0: Enable checksumming + + Default: 0 + valid_cookie_life - INTEGER The default lifetime of the SCTP cookie (in milliseconds). The cookie is used during association establishment. diff --git a/include/net/netns/sctp.h b/include/net/netns/sctp.h index 3573a81815ad..ebfdf1e7d402 100644 --- a/include/net/netns/sctp.h +++ b/include/net/netns/sctp.h @@ -129,6 +129,9 @@ struct netns_sctp { /* Threshold for autoclose timeout, in seconds. */ unsigned long max_autoclose; + + /* Flag to disable SCTP checksumming. */ + int checksum_disable; }; #endif /* __NETNS_SCTP_H__ */ diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h index d9c93a77b1a9..06ebeaaaa9aa 100644 --- a/include/net/sctp/structs.h +++ b/include/net/sctp/structs.h @@ -141,10 +141,6 @@ extern struct sctp_globals { /* This is the sctp port control hash. */ int port_hashsize; struct sctp_bind_hashbucket *port_hashtable; - - /* Flag to indicate whether computing and verifying checksum - * is disabled. */ - bool checksum_disable; } sctp_globals; #define sctp_max_instreams (sctp_globals.max_instreams) @@ -156,7 +152,6 @@ extern struct sctp_globals { #define sctp_assoc_hashtable (sctp_globals.assoc_hashtable) #define sctp_port_hashsize (sctp_globals.port_hashsize) #define sctp_port_hashtable (sctp_globals.port_hashtable) -#define sctp_checksum_disable (sctp_globals.checksum_disable) /* SCTP Socket type: UDP or TCP style. */ typedef enum { diff --git a/net/sctp/input.c b/net/sctp/input.c index fa91aff02388..b9a25e18bb2b 100644 --- a/net/sctp/input.c +++ b/net/sctp/input.c @@ -140,8 +140,8 @@ int sctp_rcv(struct sk_buff *skb) __skb_pull(skb, skb_transport_offset(skb)); if (skb->len < sizeof(struct sctphdr)) goto discard_it; - if (!sctp_checksum_disable && !skb_csum_unnecessary(skb) && - sctp_rcv_checksum(net, skb) < 0) + if (!net->sctp.checksum_disable && !skb_csum_unnecessary(skb) && + sctp_rcv_checksum(net, skb) < 0) goto discard_it; skb_pull(skb, sizeof(struct sctphdr)); diff --git a/net/sctp/output.c b/net/sctp/output.c index 5a55c55d71ad..cdb5f4914e17 100644 --- a/net/sctp/output.c +++ b/net/sctp/output.c @@ -395,6 +395,7 @@ int sctp_packet_transmit(struct sctp_packet *packet) int padding; /* How much padding do we need? */ __u8 has_data = 0; struct dst_entry *dst = tp->dst; + struct net *net; unsigned char *auth = NULL; /* pointer to auth in skb data */ __u32 cksum_buf_len = sizeof(struct sctphdr); @@ -541,7 +542,9 @@ int sctp_packet_transmit(struct sctp_packet *packet) * Note: Adler-32 is no longer applicable, as has been replaced * by CRC32-C as described in . */ - if (!sctp_checksum_disable) { + net = dev_net(dst->dev); + + if (!net->sctp.checksum_disable) { if (!(dst->dev->features & NETIF_F_SCTP_CSUM)) { __u32 crc32 = sctp_start_cksum((__u8 *)sh, cksum_buf_len); diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c index b52ec2510101..a570a6365f87 100644 --- a/net/sctp/protocol.c +++ b/net/sctp/protocol.c @@ -1193,6 +1193,9 @@ static int __net_init sctp_net_init(struct net *net) /* Whether Cookie Preservative is enabled(1) or not(0) */ net->sctp.cookie_preserve_enable = 1; + /* Whether SCTP checksumming is disabled(1) or not(0) */ + net->sctp.checksum_disable = 0; + /* Default sctp sockets to use md5 as their hmac alg */ #if defined (CONFIG_SCTP_DEFAULT_COOKIE_HMAC_MD5) net->sctp.sctp_hmac_alg = "md5"; @@ -1549,6 +1552,4 @@ MODULE_ALIAS("net-pf-" __stringify(PF_INET) "-proto-132"); MODULE_ALIAS("net-pf-" __stringify(PF_INET6) "-proto-132"); MODULE_AUTHOR("Linux Kernel SCTP developers "); MODULE_DESCRIPTION("Support for the SCTP protocol (RFC2960)"); -module_param_named(no_checksums, sctp_checksum_disable, bool, 0644); -MODULE_PARM_DESC(no_checksums, "Disable checksums computing and verification"); MODULE_LICENSE("GPL"); diff --git a/net/sctp/sysctl.c b/net/sctp/sysctl.c index 190674702b20..754809a7183d 100644 --- a/net/sctp/sysctl.c +++ b/net/sctp/sysctl.c @@ -296,7 +296,15 @@ static struct ctl_table sctp_net_table[] = { .extra1 = &max_autoclose_min, .extra2 = &max_autoclose_max, }, - + { + .procname = "checksum_disable", + .data = &init_net.sctp.checksum_disable, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = &zero, + .extra2 = &one, + }, { /* sentinel */ } }; -- cgit v1.2.3 From 71acc0ddd499cc323199fb1ae350ce9ea0744352 Mon Sep 17 00:00:00 2001 From: "David S. Miller" Date: Fri, 9 Aug 2013 13:09:41 -0700 Subject: Revert "net: sctp: convert sctp_checksum_disable module param into sctp sysctl" This reverts commit cda5f98e36576596b9230483ec52bff3cc97eb21. As per Vlad's request. Signed-off-by: David S. Miller --- Documentation/networking/ip-sysctl.txt | 8 -------- include/net/netns/sctp.h | 3 --- include/net/sctp/structs.h | 5 +++++ net/sctp/input.c | 4 ++-- net/sctp/output.c | 5 +---- net/sctp/protocol.c | 5 ++--- net/sctp/sysctl.c | 10 +--------- 7 files changed, 11 insertions(+), 29 deletions(-) (limited to 'Documentation') diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index 1b27b0b8687f..36be26b2ef7a 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -1507,14 +1507,6 @@ sack_timeout - INTEGER Default: 200 -checksum_disable - BOOLEAN - Disable SCTP checksum computing and verification for debugging purpose. - - 1: Disable checksumming - 0: Enable checksumming - - Default: 0 - valid_cookie_life - INTEGER The default lifetime of the SCTP cookie (in milliseconds). The cookie is used during association establishment. diff --git a/include/net/netns/sctp.h b/include/net/netns/sctp.h index ebfdf1e7d402..3573a81815ad 100644 --- a/include/net/netns/sctp.h +++ b/include/net/netns/sctp.h @@ -129,9 +129,6 @@ struct netns_sctp { /* Threshold for autoclose timeout, in seconds. */ unsigned long max_autoclose; - - /* Flag to disable SCTP checksumming. */ - int checksum_disable; }; #endif /* __NETNS_SCTP_H__ */ diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h index 78eedf0cd4b3..422db6cc3214 100644 --- a/include/net/sctp/structs.h +++ b/include/net/sctp/structs.h @@ -135,6 +135,10 @@ extern struct sctp_globals { /* This is the sctp port control hash. */ int port_hashsize; struct sctp_bind_hashbucket *port_hashtable; + + /* Flag to indicate whether computing and verifying checksum + * is disabled. */ + bool checksum_disable; } sctp_globals; #define sctp_max_instreams (sctp_globals.max_instreams) @@ -146,6 +150,7 @@ extern struct sctp_globals { #define sctp_assoc_hashtable (sctp_globals.assoc_hashtable) #define sctp_port_hashsize (sctp_globals.port_hashsize) #define sctp_port_hashtable (sctp_globals.port_hashtable) +#define sctp_checksum_disable (sctp_globals.checksum_disable) /* SCTP Socket type: UDP or TCP style. */ typedef enum { diff --git a/net/sctp/input.c b/net/sctp/input.c index 2873e221e187..5f2068679f83 100644 --- a/net/sctp/input.c +++ b/net/sctp/input.c @@ -134,8 +134,8 @@ int sctp_rcv(struct sk_buff *skb) __skb_pull(skb, skb_transport_offset(skb)); if (skb->len < sizeof(struct sctphdr)) goto discard_it; - if (!net->sctp.checksum_disable && !skb_csum_unnecessary(skb) && - sctp_rcv_checksum(net, skb) < 0) + if (!sctp_checksum_disable && !skb_csum_unnecessary(skb) && + sctp_rcv_checksum(net, skb) < 0) goto discard_it; skb_pull(skb, sizeof(struct sctphdr)); diff --git a/net/sctp/output.c b/net/sctp/output.c index e35b84cc1d9f..0ac3a65daccb 100644 --- a/net/sctp/output.c +++ b/net/sctp/output.c @@ -389,7 +389,6 @@ int sctp_packet_transmit(struct sctp_packet *packet) int padding; /* How much padding do we need? */ __u8 has_data = 0; struct dst_entry *dst = tp->dst; - struct net *net; unsigned char *auth = NULL; /* pointer to auth in skb data */ __u32 cksum_buf_len = sizeof(struct sctphdr); @@ -536,9 +535,7 @@ int sctp_packet_transmit(struct sctp_packet *packet) * Note: Adler-32 is no longer applicable, as has been replaced * by CRC32-C as described in . */ - net = dev_net(dst->dev); - - if (!net->sctp.checksum_disable) { + if (!sctp_checksum_disable) { if (!(dst->dev->features & NETIF_F_SCTP_CSUM)) { __u32 crc32 = sctp_start_cksum((__u8 *)sh, cksum_buf_len); diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c index 54482977a48f..5e17092f4ada 100644 --- a/net/sctp/protocol.c +++ b/net/sctp/protocol.c @@ -1187,9 +1187,6 @@ static int __net_init sctp_net_init(struct net *net) /* Whether Cookie Preservative is enabled(1) or not(0) */ net->sctp.cookie_preserve_enable = 1; - /* Whether SCTP checksumming is disabled(1) or not(0) */ - net->sctp.checksum_disable = 0; - /* Default sctp sockets to use md5 as their hmac alg */ #if defined (CONFIG_SCTP_DEFAULT_COOKIE_HMAC_MD5) net->sctp.sctp_hmac_alg = "md5"; @@ -1546,4 +1543,6 @@ MODULE_ALIAS("net-pf-" __stringify(PF_INET) "-proto-132"); MODULE_ALIAS("net-pf-" __stringify(PF_INET6) "-proto-132"); MODULE_AUTHOR("Linux Kernel SCTP developers "); MODULE_DESCRIPTION("Support for the SCTP protocol (RFC2960)"); +module_param_named(no_checksums, sctp_checksum_disable, bool, 0644); +MODULE_PARM_DESC(no_checksums, "Disable checksums computing and verification"); MODULE_LICENSE("GPL"); diff --git a/net/sctp/sysctl.c b/net/sctp/sysctl.c index 1b1ee769b5bf..6b36561a1b3b 100644 --- a/net/sctp/sysctl.c +++ b/net/sctp/sysctl.c @@ -290,15 +290,7 @@ static struct ctl_table sctp_net_table[] = { .extra1 = &max_autoclose_min, .extra2 = &max_autoclose_max, }, - { - .procname = "checksum_disable", - .data = &init_net.sctp.checksum_disable, - .maxlen = sizeof(int), - .mode = 0644, - .proc_handler = proc_dointvec_minmax, - .extra1 = &zero, - .extra2 = &one, - }, + { /* sentinel */ } }; -- cgit v1.2.3 From 6c821bd9edc9563b34c7920b4a99fe64992de530 Mon Sep 17 00:00:00 2001 From: Jonas Jensen Date: Thu, 8 Aug 2013 13:34:54 +0200 Subject: net: Add MOXA ART SoCs ethernet driver The MOXA UC-711X hardware(s) has an ethernet controller that seem to be developed internally. The IC used is "RTL8201CP". Since there is no public documentation, this driver is mostly the one published by MOXA that has been heavily cleaned up / ported from linux 2.6.9. Signed-off-by: Jonas Jensen Signed-off-by: David S. Miller --- .../devicetree/bindings/net/moxa,moxart-mac.txt | 21 + drivers/net/ethernet/Kconfig | 1 + drivers/net/ethernet/Makefile | 1 + drivers/net/ethernet/moxa/Kconfig | 30 ++ drivers/net/ethernet/moxa/Makefile | 5 + drivers/net/ethernet/moxa/moxart_ether.c | 558 +++++++++++++++++++++ drivers/net/ethernet/moxa/moxart_ether.h | 330 ++++++++++++ 7 files changed, 946 insertions(+) create mode 100644 Documentation/devicetree/bindings/net/moxa,moxart-mac.txt create mode 100644 drivers/net/ethernet/moxa/Kconfig create mode 100644 drivers/net/ethernet/moxa/Makefile create mode 100644 drivers/net/ethernet/moxa/moxart_ether.c create mode 100644 drivers/net/ethernet/moxa/moxart_ether.h (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/net/moxa,moxart-mac.txt b/Documentation/devicetree/bindings/net/moxa,moxart-mac.txt new file mode 100644 index 000000000000..583418b2c127 --- /dev/null +++ b/Documentation/devicetree/bindings/net/moxa,moxart-mac.txt @@ -0,0 +1,21 @@ +MOXA ART Ethernet Controller + +Required properties: + +- compatible : Must be "moxa,moxart-mac" +- reg : Should contain register location and length +- interrupts : Should contain the mac interrupt number + +Example: + + mac0: mac@90900000 { + compatible = "moxa,moxart-mac"; + reg = <0x90900000 0x100>; + interrupts = <25 0>; + }; + + mac1: mac@92000000 { + compatible = "moxa,moxart-mac"; + reg = <0x92000000 0x100>; + interrupts = <27 0>; + }; diff --git a/drivers/net/ethernet/Kconfig b/drivers/net/ethernet/Kconfig index 2037080c504d..506b0248c400 100644 --- a/drivers/net/ethernet/Kconfig +++ b/drivers/net/ethernet/Kconfig @@ -90,6 +90,7 @@ source "drivers/net/ethernet/marvell/Kconfig" source "drivers/net/ethernet/mellanox/Kconfig" source "drivers/net/ethernet/micrel/Kconfig" source "drivers/net/ethernet/microchip/Kconfig" +source "drivers/net/ethernet/moxa/Kconfig" source "drivers/net/ethernet/myricom/Kconfig" config FEALNX diff --git a/drivers/net/ethernet/Makefile b/drivers/net/ethernet/Makefile index 390bd0bfaa27..c0b8789952e7 100644 --- a/drivers/net/ethernet/Makefile +++ b/drivers/net/ethernet/Makefile @@ -42,6 +42,7 @@ obj-$(CONFIG_NET_VENDOR_MARVELL) += marvell/ obj-$(CONFIG_NET_VENDOR_MELLANOX) += mellanox/ obj-$(CONFIG_NET_VENDOR_MICREL) += micrel/ obj-$(CONFIG_NET_VENDOR_MICROCHIP) += microchip/ +obj-$(CONFIG_NET_VENDOR_MOXART) += moxa/ obj-$(CONFIG_NET_VENDOR_MYRI) += myricom/ obj-$(CONFIG_FEALNX) += fealnx.o obj-$(CONFIG_NET_VENDOR_NATSEMI) += natsemi/ diff --git a/drivers/net/ethernet/moxa/Kconfig b/drivers/net/ethernet/moxa/Kconfig new file mode 100644 index 000000000000..1731e050fa27 --- /dev/null +++ b/drivers/net/ethernet/moxa/Kconfig @@ -0,0 +1,30 @@ +# +# MOXART device configuration +# + +config NET_VENDOR_MOXART + bool "MOXA ART devices" + default y + depends on (ARM && ARCH_MOXART) + ---help--- + If you have a network (Ethernet) card belonging to this class, say Y + and read the Ethernet-HOWTO, available from + . + + Note that the answer to this question doesn't directly affect the + kernel: saying N will just cause the configurator to skip all + the questions about MOXA ART devices. If you say Y, you will be asked + for your specific card in the following questions. + +if NET_VENDOR_MOXART + +config ARM_MOXART_ETHER + tristate "MOXART Ethernet support" + depends on ARM && ARCH_MOXART + select NET_CORE + ---help--- + If you wish to compile a kernel for a hardware with MOXA ART SoC and + want to use the internal ethernet then you should answer Y to this. + + +endif # NET_VENDOR_MOXART diff --git a/drivers/net/ethernet/moxa/Makefile b/drivers/net/ethernet/moxa/Makefile new file mode 100644 index 000000000000..aa3c73e9e952 --- /dev/null +++ b/drivers/net/ethernet/moxa/Makefile @@ -0,0 +1,5 @@ +# +# Makefile for the MOXART network device drivers. +# + +obj-$(CONFIG_ARM_MOXART_ETHER) += moxart_ether.o diff --git a/drivers/net/ethernet/moxa/moxart_ether.c b/drivers/net/ethernet/moxa/moxart_ether.c new file mode 100644 index 000000000000..abd2c545df83 --- /dev/null +++ b/drivers/net/ethernet/moxa/moxart_ether.c @@ -0,0 +1,558 @@ +/* MOXA ART Ethernet (RTL8201CP) driver. + * + * Copyright (C) 2013 Jonas Jensen + * + * Jonas Jensen + * + * Based on code from + * Moxa Technology Co., Ltd. + * + * This file is licensed under the terms of the GNU General Public + * License version 2. This program is licensed "as is" without any + * warranty of any kind, whether express or implied. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "moxart_ether.h" + +static inline void moxart_emac_write(struct net_device *ndev, + unsigned int reg, unsigned long value) +{ + struct moxart_mac_priv_t *priv = netdev_priv(ndev); + + writel(value, priv->base + reg); +} + +static void moxart_update_mac_address(struct net_device *ndev) +{ + moxart_emac_write(ndev, REG_MAC_MS_ADDRESS, + ((ndev->dev_addr[0] << 8) | (ndev->dev_addr[1]))); + moxart_emac_write(ndev, REG_MAC_MS_ADDRESS + 4, + ((ndev->dev_addr[2] << 24) | + (ndev->dev_addr[3] << 16) | + (ndev->dev_addr[4] << 8) | + (ndev->dev_addr[5]))); +} + +static int moxart_set_mac_address(struct net_device *ndev, void *addr) +{ + struct sockaddr *address = addr; + + if (!is_valid_ether_addr(address->sa_data)) + return -EADDRNOTAVAIL; + + memcpy(ndev->dev_addr, address->sa_data, ndev->addr_len); + moxart_update_mac_address(ndev); + + return 0; +} + +static void moxart_mac_free_memory(struct net_device *ndev) +{ + struct moxart_mac_priv_t *priv = netdev_priv(ndev); + int i; + + for (i = 0; i < RX_DESC_NUM; i++) + dma_unmap_single(&ndev->dev, priv->rx_mapping[i], + priv->rx_buf_size, DMA_FROM_DEVICE); + + if (priv->tx_desc_base) + dma_free_coherent(NULL, TX_REG_DESC_SIZE * TX_DESC_NUM, + priv->tx_desc_base, priv->tx_base); + + if (priv->rx_desc_base) + dma_free_coherent(NULL, RX_REG_DESC_SIZE * RX_DESC_NUM, + priv->rx_desc_base, priv->rx_base); + + kfree(priv->tx_buf_base); + kfree(priv->rx_buf_base); +} + +static void moxart_mac_reset(struct net_device *ndev) +{ + struct moxart_mac_priv_t *priv = netdev_priv(ndev); + + writel(SW_RST, priv->base + REG_MAC_CTRL); + while (readl(priv->base + REG_MAC_CTRL) & SW_RST) + mdelay(10); + + writel(0, priv->base + REG_INTERRUPT_MASK); + + priv->reg_maccr = RX_BROADPKT | FULLDUP | CRC_APD | RX_FTL; +} + +static void moxart_mac_enable(struct net_device *ndev) +{ + struct moxart_mac_priv_t *priv = netdev_priv(ndev); + + writel(0x00001010, priv->base + REG_INT_TIMER_CTRL); + writel(0x00000001, priv->base + REG_APOLL_TIMER_CTRL); + writel(0x00000390, priv->base + REG_DMA_BLEN_CTRL); + + priv->reg_imr |= (RPKT_FINISH_M | XPKT_FINISH_M); + writel(priv->reg_imr, priv->base + REG_INTERRUPT_MASK); + + priv->reg_maccr |= (RCV_EN | XMT_EN | RDMA_EN | XDMA_EN); + writel(priv->reg_maccr, priv->base + REG_MAC_CTRL); +} + +static void moxart_mac_setup_desc_ring(struct net_device *ndev) +{ + struct moxart_mac_priv_t *priv = netdev_priv(ndev); + void __iomem *desc; + int i; + + for (i = 0; i < TX_DESC_NUM; i++) { + desc = priv->tx_desc_base + i * TX_REG_DESC_SIZE; + memset(desc, 0, TX_REG_DESC_SIZE); + + priv->tx_buf[i] = priv->tx_buf_base + priv->tx_buf_size * i; + } + writel(TX_DESC1_END, desc + TX_REG_OFFSET_DESC1); + + priv->tx_head = 0; + priv->tx_tail = 0; + + for (i = 0; i < RX_DESC_NUM; i++) { + desc = priv->rx_desc_base + i * RX_REG_DESC_SIZE; + memset(desc, 0, RX_REG_DESC_SIZE); + writel(RX_DESC0_DMA_OWN, desc + RX_REG_OFFSET_DESC0); + writel(RX_BUF_SIZE & RX_DESC1_BUF_SIZE_MASK, + desc + RX_REG_OFFSET_DESC1); + + priv->rx_buf[i] = priv->rx_buf_base + priv->rx_buf_size * i; + priv->rx_mapping[i] = dma_map_single(&ndev->dev, + priv->rx_buf[i], + priv->rx_buf_size, + DMA_FROM_DEVICE); + if (dma_mapping_error(&ndev->dev, priv->rx_mapping[i])) + netdev_err(ndev, "DMA mapping error\n"); + + writel(priv->rx_mapping[i], + desc + RX_REG_OFFSET_DESC2 + RX_DESC2_ADDRESS_PHYS); + writel(priv->rx_buf[i], + desc + RX_REG_OFFSET_DESC2 + RX_DESC2_ADDRESS_VIRT); + } + writel(RX_DESC1_END, desc + RX_REG_OFFSET_DESC1); + + priv->rx_head = 0; + + /* reset the MAC controler TX/RX desciptor base address */ + writel(priv->tx_base, priv->base + REG_TXR_BASE_ADDRESS); + writel(priv->rx_base, priv->base + REG_RXR_BASE_ADDRESS); +} + +static int moxart_mac_open(struct net_device *ndev) +{ + struct moxart_mac_priv_t *priv = netdev_priv(ndev); + + if (!is_valid_ether_addr(ndev->dev_addr)) + return -EADDRNOTAVAIL; + + napi_enable(&priv->napi); + + moxart_mac_reset(ndev); + moxart_update_mac_address(ndev); + moxart_mac_setup_desc_ring(ndev); + moxart_mac_enable(ndev); + netif_start_queue(ndev); + + netdev_dbg(ndev, "%s: IMR=0x%x, MACCR=0x%x\n", + __func__, readl(priv->base + REG_INTERRUPT_MASK), + readl(priv->base + REG_MAC_CTRL)); + + return 0; +} + +static int moxart_mac_stop(struct net_device *ndev) +{ + struct moxart_mac_priv_t *priv = netdev_priv(ndev); + + napi_disable(&priv->napi); + + netif_stop_queue(ndev); + + /* disable all interrupts */ + writel(0, priv->base + REG_INTERRUPT_MASK); + + /* disable all functions */ + writel(0, priv->base + REG_MAC_CTRL); + + return 0; +} + +static int moxart_rx_poll(struct napi_struct *napi, int budget) +{ + struct moxart_mac_priv_t *priv = container_of(napi, + struct moxart_mac_priv_t, + napi); + struct net_device *ndev = priv->ndev; + struct sk_buff *skb; + void __iomem *desc; + unsigned int desc0, len; + int rx_head = priv->rx_head; + int rx = 0; + + while (1) { + desc = priv->rx_desc_base + (RX_REG_DESC_SIZE * rx_head); + desc0 = readl(desc + RX_REG_OFFSET_DESC0); + + if (desc0 & RX_DESC0_DMA_OWN) + break; + + if (desc0 & (RX_DESC0_ERR | RX_DESC0_CRC_ERR | RX_DESC0_FTL | + RX_DESC0_RUNT | RX_DESC0_ODD_NB)) { + net_dbg_ratelimited("packet error\n"); + priv->stats.rx_dropped++; + priv->stats.rx_errors++; + continue; + } + + len = desc0 & RX_DESC0_FRAME_LEN_MASK; + + if (len > RX_BUF_SIZE) + len = RX_BUF_SIZE; + + skb = build_skb(priv->rx_buf[rx_head], priv->rx_buf_size); + if (unlikely(!skb)) { + net_dbg_ratelimited("build_skb failed\n"); + priv->stats.rx_dropped++; + priv->stats.rx_errors++; + } + + skb_put(skb, len); + skb->protocol = eth_type_trans(skb, ndev); + napi_gro_receive(&priv->napi, skb); + rx++; + + ndev->last_rx = jiffies; + priv->stats.rx_packets++; + priv->stats.rx_bytes += len; + if (desc0 & RX_DESC0_MULTICAST) + priv->stats.multicast++; + + writel(RX_DESC0_DMA_OWN, desc + RX_REG_OFFSET_DESC0); + + rx_head = RX_NEXT(rx_head); + priv->rx_head = rx_head; + + if (rx >= budget) + break; + } + + if (rx < budget) { + napi_gro_flush(napi, false); + __napi_complete(napi); + } + + priv->reg_imr |= RPKT_FINISH_M; + writel(priv->reg_imr, priv->base + REG_INTERRUPT_MASK); + + return rx; +} + +static void moxart_tx_finished(struct net_device *ndev) +{ + struct moxart_mac_priv_t *priv = netdev_priv(ndev); + unsigned tx_head = priv->tx_head; + unsigned tx_tail = priv->tx_tail; + + while (tx_tail != tx_head) { + dma_unmap_single(&ndev->dev, priv->tx_mapping[tx_tail], + priv->tx_len[tx_tail], DMA_TO_DEVICE); + + priv->stats.tx_packets++; + priv->stats.tx_bytes += priv->tx_skb[tx_tail]->len; + + dev_kfree_skb_irq(priv->tx_skb[tx_tail]); + priv->tx_skb[tx_tail] = NULL; + + tx_tail = TX_NEXT(tx_tail); + } + priv->tx_tail = tx_tail; +} + +static irqreturn_t moxart_mac_interrupt(int irq, void *dev_id) +{ + struct net_device *ndev = (struct net_device *) dev_id; + struct moxart_mac_priv_t *priv = netdev_priv(ndev); + unsigned int ists = readl(priv->base + REG_INTERRUPT_STATUS); + + if (ists & XPKT_OK_INT_STS) + moxart_tx_finished(ndev); + + if (ists & RPKT_FINISH) { + if (napi_schedule_prep(&priv->napi)) { + priv->reg_imr &= ~RPKT_FINISH_M; + writel(priv->reg_imr, priv->base + REG_INTERRUPT_MASK); + __napi_schedule(&priv->napi); + } + } + + return IRQ_HANDLED; +} + +static int moxart_mac_start_xmit(struct sk_buff *skb, struct net_device *ndev) +{ + struct moxart_mac_priv_t *priv = netdev_priv(ndev); + void __iomem *desc; + unsigned int len; + unsigned int tx_head = priv->tx_head; + u32 txdes1; + + desc = priv->tx_desc_base + (TX_REG_DESC_SIZE * tx_head); + + spin_lock_irq(&priv->txlock); + if (readl(desc + TX_REG_OFFSET_DESC0) & TX_DESC0_DMA_OWN) { + net_dbg_ratelimited("no TX space for packet\n"); + priv->stats.tx_dropped++; + return NETDEV_TX_BUSY; + } + + len = skb->len > TX_BUF_SIZE ? TX_BUF_SIZE : skb->len; + + priv->tx_mapping[tx_head] = dma_map_single(&ndev->dev, skb->data, + len, DMA_TO_DEVICE); + if (dma_mapping_error(&ndev->dev, priv->tx_mapping[tx_head])) { + netdev_err(ndev, "DMA mapping error\n"); + return NETDEV_TX_BUSY; + } + + priv->tx_len[tx_head] = len; + priv->tx_skb[tx_head] = skb; + + writel(priv->tx_mapping[tx_head], + desc + TX_REG_OFFSET_DESC2 + TX_DESC2_ADDRESS_PHYS); + writel(skb->data, + desc + TX_REG_OFFSET_DESC2 + TX_DESC2_ADDRESS_VIRT); + + if (skb->len < ETH_ZLEN) { + memset(&skb->data[skb->len], + 0, ETH_ZLEN - skb->len); + len = ETH_ZLEN; + } + + txdes1 = readl(desc + TX_REG_OFFSET_DESC1); + txdes1 |= TX_DESC1_LTS | TX_DESC1_FTS; + txdes1 &= ~(TX_DESC1_FIFO_COMPLETE | TX_DESC1_INTR_COMPLETE); + txdes1 |= (len & TX_DESC1_BUF_SIZE_MASK); + writel(txdes1, desc + TX_REG_OFFSET_DESC1); + writel(TX_DESC0_DMA_OWN, desc + TX_REG_OFFSET_DESC0); + + /* start to send packet */ + writel(0xffffffff, priv->base + REG_TX_POLL_DEMAND); + + priv->tx_head = TX_NEXT(tx_head); + + ndev->trans_start = jiffies; + + spin_unlock_irq(&priv->txlock); + + return NETDEV_TX_OK; +} + +static struct net_device_stats *moxart_mac_get_stats(struct net_device *ndev) +{ + struct moxart_mac_priv_t *priv = netdev_priv(ndev); + + return &priv->stats; +} + +static void moxart_mac_setmulticast(struct net_device *ndev) +{ + struct moxart_mac_priv_t *priv = netdev_priv(ndev); + struct netdev_hw_addr *ha; + int crc_val; + + netdev_for_each_mc_addr(ha, ndev) { + crc_val = crc32_le(~0, ha->addr, ETH_ALEN); + crc_val = (crc_val >> 26) & 0x3f; + if (crc_val >= 32) { + writel(readl(priv->base + REG_MCAST_HASH_TABLE1) | + (1UL << (crc_val - 32)), + priv->base + REG_MCAST_HASH_TABLE1); + } else { + writel(readl(priv->base + REG_MCAST_HASH_TABLE0) | + (1UL << crc_val), + priv->base + REG_MCAST_HASH_TABLE0); + } + } +} + +static void moxart_mac_set_rx_mode(struct net_device *ndev) +{ + struct moxart_mac_priv_t *priv = netdev_priv(ndev); + + spin_lock_irq(&priv->txlock); + + (ndev->flags & IFF_PROMISC) ? (priv->reg_maccr |= RCV_ALL) : + (priv->reg_maccr &= ~RCV_ALL); + + (ndev->flags & IFF_ALLMULTI) ? (priv->reg_maccr |= RX_MULTIPKT) : + (priv->reg_maccr &= ~RX_MULTIPKT); + + if ((ndev->flags & IFF_MULTICAST) && netdev_mc_count(ndev)) { + priv->reg_maccr |= HT_MULTI_EN; + moxart_mac_setmulticast(ndev); + } else { + priv->reg_maccr &= ~HT_MULTI_EN; + } + + writel(priv->reg_maccr, priv->base + REG_MAC_CTRL); + + spin_unlock_irq(&priv->txlock); +} + +static struct net_device_ops moxart_netdev_ops = { + .ndo_open = moxart_mac_open, + .ndo_stop = moxart_mac_stop, + .ndo_start_xmit = moxart_mac_start_xmit, + .ndo_get_stats = moxart_mac_get_stats, + .ndo_set_rx_mode = moxart_mac_set_rx_mode, + .ndo_set_mac_address = moxart_set_mac_address, + .ndo_validate_addr = eth_validate_addr, + .ndo_change_mtu = eth_change_mtu, +}; + +static int moxart_mac_probe(struct platform_device *pdev) +{ + struct device *p_dev = &pdev->dev; + struct device_node *node = p_dev->of_node; + struct net_device *ndev; + struct moxart_mac_priv_t *priv; + struct resource *res; + unsigned int irq; + int ret; + + ndev = alloc_etherdev(sizeof(struct moxart_mac_priv_t)); + if (!ndev) + return -ENOMEM; + + irq = irq_of_parse_and_map(node, 0); + if (irq <= 0) { + netdev_err(ndev, "irq_of_parse_and_map failed\n"); + return -EINVAL; + } + + priv = netdev_priv(ndev); + priv->ndev = ndev; + + res = platform_get_resource(pdev, IORESOURCE_MEM, 0); + ndev->base_addr = res->start; + priv->base = devm_ioremap_resource(p_dev, res); + ret = IS_ERR(priv->base); + if (ret) { + dev_err(p_dev, "devm_ioremap_resource failed\n"); + goto init_fail; + } + + spin_lock_init(&priv->txlock); + + priv->tx_buf_size = TX_BUF_SIZE; + priv->rx_buf_size = RX_BUF_SIZE + + SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); + + priv->tx_desc_base = dma_alloc_coherent(NULL, TX_REG_DESC_SIZE * + TX_DESC_NUM, &priv->tx_base, + GFP_DMA | GFP_KERNEL); + if (priv->tx_desc_base == NULL) + goto init_fail; + + priv->rx_desc_base = dma_alloc_coherent(NULL, RX_REG_DESC_SIZE * + RX_DESC_NUM, &priv->rx_base, + GFP_DMA | GFP_KERNEL); + if (priv->rx_desc_base == NULL) + goto init_fail; + + priv->tx_buf_base = kmalloc(priv->tx_buf_size * TX_DESC_NUM, + GFP_ATOMIC); + if (!priv->tx_buf_base) + goto init_fail; + + priv->rx_buf_base = kmalloc(priv->rx_buf_size * RX_DESC_NUM, + GFP_ATOMIC); + if (!priv->rx_buf_base) + goto init_fail; + + platform_set_drvdata(pdev, ndev); + + ret = devm_request_irq(p_dev, irq, moxart_mac_interrupt, 0, + pdev->name, ndev); + if (ret) { + netdev_err(ndev, "devm_request_irq failed\n"); + goto init_fail; + } + + ether_setup(ndev); + ndev->netdev_ops = &moxart_netdev_ops; + netif_napi_add(ndev, &priv->napi, moxart_rx_poll, RX_DESC_NUM); + ndev->priv_flags |= IFF_UNICAST_FLT; + ndev->irq = irq; + + SET_NETDEV_DEV(ndev, &pdev->dev); + + ret = register_netdev(ndev); + if (ret) { + free_netdev(ndev); + goto init_fail; + } + + netdev_dbg(ndev, "%s: IRQ=%d address=%pM\n", + __func__, ndev->irq, ndev->dev_addr); + + return 0; + +init_fail: + netdev_err(ndev, "init failed\n"); + moxart_mac_free_memory(ndev); + + return ret; +} + +static int moxart_remove(struct platform_device *pdev) +{ + struct net_device *ndev = platform_get_drvdata(pdev); + + unregister_netdev(ndev); + free_irq(ndev->irq, ndev); + moxart_mac_free_memory(ndev); + platform_set_drvdata(pdev, NULL); + free_netdev(ndev); + + return 0; +} + +static const struct of_device_id moxart_mac_match[] = { + { .compatible = "moxa,moxart-mac" }, + { } +}; + +struct __initdata platform_driver moxart_mac_driver = { + .probe = moxart_mac_probe, + .remove = moxart_remove, + .driver = { + .name = "moxart-ethernet", + .owner = THIS_MODULE, + .of_match_table = moxart_mac_match, + }, +}; +module_platform_driver(moxart_mac_driver); + +MODULE_DESCRIPTION("MOXART RTL8201CP Ethernet driver"); +MODULE_LICENSE("GPL v2"); +MODULE_AUTHOR("Jonas Jensen "); diff --git a/drivers/net/ethernet/moxa/moxart_ether.h b/drivers/net/ethernet/moxa/moxart_ether.h new file mode 100644 index 000000000000..2be9280d608c --- /dev/null +++ b/drivers/net/ethernet/moxa/moxart_ether.h @@ -0,0 +1,330 @@ +/* MOXA ART Ethernet (RTL8201CP) driver. + * + * Copyright (C) 2013 Jonas Jensen + * + * Jonas Jensen + * + * Based on code from + * Moxa Technology Co., Ltd. + * + * This file is licensed under the terms of the GNU General Public + * License version 2. This program is licensed "as is" without any + * warranty of any kind, whether express or implied. + */ + +#ifndef _MOXART_ETHERNET_H +#define _MOXART_ETHERNET_H + +#define TX_REG_OFFSET_DESC0 0 +#define TX_REG_OFFSET_DESC1 4 +#define TX_REG_OFFSET_DESC2 8 +#define TX_REG_DESC_SIZE 16 + +#define RX_REG_OFFSET_DESC0 0 +#define RX_REG_OFFSET_DESC1 4 +#define RX_REG_OFFSET_DESC2 8 +#define RX_REG_DESC_SIZE 16 + +#define TX_DESC0_PKT_LATE_COL 0x1 /* abort, late collision */ +#define TX_DESC0_RX_PKT_EXS_COL 0x2 /* abort, >16 collisions */ +#define TX_DESC0_DMA_OWN 0x80000000 /* owned by controller */ +#define TX_DESC1_BUF_SIZE_MASK 0x7ff +#define TX_DESC1_LTS 0x8000000 /* last TX packet */ +#define TX_DESC1_FTS 0x10000000 /* first TX packet */ +#define TX_DESC1_FIFO_COMPLETE 0x20000000 +#define TX_DESC1_INTR_COMPLETE 0x40000000 +#define TX_DESC1_END 0x80000000 +#define TX_DESC2_ADDRESS_PHYS 0 +#define TX_DESC2_ADDRESS_VIRT 4 + +#define RX_DESC0_FRAME_LEN 0 +#define RX_DESC0_FRAME_LEN_MASK 0x7FF +#define RX_DESC0_MULTICAST 0x10000 +#define RX_DESC0_BROADCAST 0x20000 +#define RX_DESC0_ERR 0x40000 +#define RX_DESC0_CRC_ERR 0x80000 +#define RX_DESC0_FTL 0x100000 +#define RX_DESC0_RUNT 0x200000 /* packet less than 64 bytes */ +#define RX_DESC0_ODD_NB 0x400000 /* receive odd nibbles */ +#define RX_DESC0_LRS 0x10000000 /* last receive segment */ +#define RX_DESC0_FRS 0x20000000 /* first receive segment */ +#define RX_DESC0_DMA_OWN 0x80000000 +#define RX_DESC1_BUF_SIZE_MASK 0x7FF +#define RX_DESC1_END 0x80000000 +#define RX_DESC2_ADDRESS_PHYS 0 +#define RX_DESC2_ADDRESS_VIRT 4 + +#define TX_DESC_NUM 64 +#define TX_DESC_NUM_MASK (TX_DESC_NUM-1) +#define TX_NEXT(N) (((N) + 1) & (TX_DESC_NUM_MASK)) +#define TX_BUF_SIZE 1600 +#define TX_BUF_SIZE_MAX (TX_DESC1_BUF_SIZE_MASK+1) + +#define RX_DESC_NUM 64 +#define RX_DESC_NUM_MASK (RX_DESC_NUM-1) +#define RX_NEXT(N) (((N) + 1) & (RX_DESC_NUM_MASK)) +#define RX_BUF_SIZE 1600 +#define RX_BUF_SIZE_MAX (RX_DESC1_BUF_SIZE_MASK+1) + +#define REG_INTERRUPT_STATUS 0 +#define REG_INTERRUPT_MASK 4 +#define REG_MAC_MS_ADDRESS 8 +#define REG_MAC_LS_ADDRESS 12 +#define REG_MCAST_HASH_TABLE0 16 +#define REG_MCAST_HASH_TABLE1 20 +#define REG_TX_POLL_DEMAND 24 +#define REG_RX_POLL_DEMAND 28 +#define REG_TXR_BASE_ADDRESS 32 +#define REG_RXR_BASE_ADDRESS 36 +#define REG_INT_TIMER_CTRL 40 +#define REG_APOLL_TIMER_CTRL 44 +#define REG_DMA_BLEN_CTRL 48 +#define REG_RESERVED1 52 +#define REG_MAC_CTRL 136 +#define REG_MAC_STATUS 140 +#define REG_PHY_CTRL 144 +#define REG_PHY_WRITE_DATA 148 +#define REG_FLOW_CTRL 152 +#define REG_BACK_PRESSURE 156 +#define REG_RESERVED2 160 +#define REG_TEST_SEED 196 +#define REG_DMA_FIFO_STATE 200 +#define REG_TEST_MODE 204 +#define REG_RESERVED3 208 +#define REG_TX_COL_COUNTER 212 +#define REG_RPF_AEP_COUNTER 216 +#define REG_XM_PG_COUNTER 220 +#define REG_RUNT_TLC_COUNTER 224 +#define REG_CRC_FTL_COUNTER 228 +#define REG_RLC_RCC_COUNTER 232 +#define REG_BROC_COUNTER 236 +#define REG_MULCA_COUNTER 240 +#define REG_RP_COUNTER 244 +#define REG_XP_COUNTER 248 + +#define REG_PHY_CTRL_OFFSET 0x0 +#define REG_PHY_STATUS 0x1 +#define REG_PHY_ID1 0x2 +#define REG_PHY_ID2 0x3 +#define REG_PHY_ANA 0x4 +#define REG_PHY_ANLPAR 0x5 +#define REG_PHY_ANE 0x6 +#define REG_PHY_ECTRL1 0x10 +#define REG_PHY_QPDS 0x11 +#define REG_PHY_10BOP 0x12 +#define REG_PHY_ECTRL2 0x13 +#define REG_PHY_FTMAC100_WRITE 0x8000000 +#define REG_PHY_FTMAC100_READ 0x4000000 + +/* REG_INTERRUPT_STATUS */ +#define RPKT_FINISH BIT(0) /* DMA data received */ +#define NORXBUF BIT(1) /* receive buffer unavailable */ +#define XPKT_FINISH BIT(2) /* DMA moved data to TX FIFO */ +#define NOTXBUF BIT(3) /* transmit buffer unavailable */ +#define XPKT_OK_INT_STS BIT(4) /* transmit to ethernet success */ +#define XPKT_LOST_INT_STS BIT(5) /* transmit ethernet lost (collision) */ +#define RPKT_SAV BIT(6) /* FIFO receive success */ +#define RPKT_LOST_INT_STS BIT(7) /* FIFO full, receive failed */ +#define AHB_ERR BIT(8) /* AHB error */ +#define PHYSTS_CHG BIT(9) /* PHY link status change */ + +/* REG_INTERRUPT_MASK */ +#define RPKT_FINISH_M BIT(0) +#define NORXBUF_M BIT(1) +#define XPKT_FINISH_M BIT(2) +#define NOTXBUF_M BIT(3) +#define XPKT_OK_M BIT(4) +#define XPKT_LOST_M BIT(5) +#define RPKT_SAV_M BIT(6) +#define RPKT_LOST_M BIT(7) +#define AHB_ERR_M BIT(8) +#define PHYSTS_CHG_M BIT(9) + +/* REG_MAC_MS_ADDRESS */ +#define MAC_MADR_MASK 0xffff /* 2 MSB MAC address */ + +/* REG_INT_TIMER_CTRL */ +#define TXINT_TIME_SEL BIT(15) /* TX cycle time period */ +#define TXINT_THR_MASK 0x7000 +#define TXINT_CNT_MASK 0xf00 +#define RXINT_TIME_SEL BIT(7) /* RX cycle time period */ +#define RXINT_THR_MASK 0x70 +#define RXINT_CNT_MASK 0xF + +/* REG_APOLL_TIMER_CTRL */ +#define TXPOLL_TIME_SEL BIT(12) /* TX poll time period */ +#define TXPOLL_CNT_MASK 0xf00 +#define TXPOLL_CNT_SHIFT_BIT 8 +#define RXPOLL_TIME_SEL BIT(4) /* RX poll time period */ +#define RXPOLL_CNT_MASK 0xF +#define RXPOLL_CNT_SHIFT_BIT 0 + +/* REG_DMA_BLEN_CTRL */ +#define RX_THR_EN BIT(9) /* RX FIFO threshold arbitration */ +#define RXFIFO_HTHR_MASK 0x1c0 +#define RXFIFO_LTHR_MASK 0x38 +#define INCR16_EN BIT(2) /* AHB bus INCR16 burst command */ +#define INCR8_EN BIT(1) /* AHB bus INCR8 burst command */ +#define INCR4_EN BIT(0) /* AHB bus INCR4 burst command */ + +/* REG_MAC_CTRL */ +#define RX_BROADPKT BIT(17) /* receive broadcast packets */ +#define RX_MULTIPKT BIT(16) /* receive all multicast packets */ +#define FULLDUP BIT(15) /* full duplex */ +#define CRC_APD BIT(14) /* append CRC to transmitted packet */ +#define RCV_ALL BIT(12) /* ignore incoming packet destination */ +#define RX_FTL BIT(11) /* accept packets larger than 1518 B */ +#define RX_RUNT BIT(10) /* accept packets smaller than 64 B */ +#define HT_MULTI_EN BIT(9) /* accept on hash and mcast pass */ +#define RCV_EN BIT(8) /* receiver enable */ +#define ENRX_IN_HALFTX BIT(6) /* enable receive in half duplex mode */ +#define XMT_EN BIT(5) /* transmit enable */ +#define CRC_DIS BIT(4) /* disable CRC check when receiving */ +#define LOOP_EN BIT(3) /* internal loop-back */ +#define SW_RST BIT(2) /* software reset, last 64 AHB clocks */ +#define RDMA_EN BIT(1) /* enable receive DMA chan */ +#define XDMA_EN BIT(0) /* enable transmit DMA chan */ + +/* REG_MAC_STATUS */ +#define COL_EXCEED BIT(11) /* more than 16 collisions */ +#define LATE_COL BIT(10) /* transmit late collision detected */ +#define XPKT_LOST BIT(9) /* transmit to ethernet lost */ +#define XPKT_OK BIT(8) /* transmit to ethernet success */ +#define RUNT_MAC_STS BIT(7) /* receive runt detected */ +#define FTL_MAC_STS BIT(6) /* receive frame too long detected */ +#define CRC_ERR_MAC_STS BIT(5) +#define RPKT_LOST BIT(4) /* RX FIFO full, receive failed */ +#define RPKT_SAVE BIT(3) /* RX FIFO receive success */ +#define COL BIT(2) /* collision, incoming packet dropped */ +#define MCPU_BROADCAST BIT(1) +#define MCPU_MULTICAST BIT(0) + +/* REG_PHY_CTRL */ +#define MIIWR BIT(27) /* init write sequence (auto cleared)*/ +#define MIIRD BIT(26) +#define REGAD_MASK 0x3e00000 +#define PHYAD_MASK 0x1f0000 +#define MIIRDATA_MASK 0xffff + +/* REG_PHY_WRITE_DATA */ +#define MIIWDATA_MASK 0xffff + +/* REG_FLOW_CTRL */ +#define PAUSE_TIME_MASK 0xffff0000 +#define FC_HIGH_MASK 0xf000 +#define FC_LOW_MASK 0xf00 +#define RX_PAUSE BIT(4) /* receive pause frame */ +#define TX_PAUSED BIT(3) /* transmit pause due to receive */ +#define FCTHR_EN BIT(2) /* enable threshold mode. */ +#define TX_PAUSE BIT(1) /* transmit pause frame */ +#define FC_EN BIT(0) /* flow control mode enable */ + +/* REG_BACK_PRESSURE */ +#define BACKP_LOW_MASK 0xf00 +#define BACKP_JAM_LEN_MASK 0xf0 +#define BACKP_MODE BIT(1) /* address mode */ +#define BACKP_ENABLE BIT(0) + +/* REG_TEST_SEED */ +#define TEST_SEED_MASK 0x3fff + +/* REG_DMA_FIFO_STATE */ +#define TX_DMA_REQUEST BIT(31) +#define RX_DMA_REQUEST BIT(30) +#define TX_DMA_GRANT BIT(29) +#define RX_DMA_GRANT BIT(28) +#define TX_FIFO_EMPTY BIT(27) +#define RX_FIFO_EMPTY BIT(26) +#define TX_DMA2_SM_MASK 0x7000 +#define TX_DMA1_SM_MASK 0xf00 +#define RX_DMA2_SM_MASK 0x70 +#define RX_DMA1_SM_MASK 0xF + +/* REG_TEST_MODE */ +#define SINGLE_PKT BIT(26) /* single packet mode */ +#define PTIMER_TEST BIT(25) /* automatic polling timer test mode */ +#define ITIMER_TEST BIT(24) /* interrupt timer test mode */ +#define TEST_SEED_SELECT BIT(22) +#define SEED_SELECT BIT(21) +#define TEST_MODE BIT(20) +#define TEST_TIME_MASK 0xffc00 +#define TEST_EXCEL_MASK 0x3e0 + +/* REG_TX_COL_COUNTER */ +#define TX_MCOL_MASK 0xffff0000 +#define TX_MCOL_SHIFT_BIT 16 +#define TX_SCOL_MASK 0xffff +#define TX_SCOL_SHIFT_BIT 0 + +/* REG_RPF_AEP_COUNTER */ +#define RPF_MASK 0xffff0000 +#define RPF_SHIFT_BIT 16 +#define AEP_MASK 0xffff +#define AEP_SHIFT_BIT 0 + +/* REG_XM_PG_COUNTER */ +#define XM_MASK 0xffff0000 +#define XM_SHIFT_BIT 16 +#define PG_MASK 0xffff +#define PG_SHIFT_BIT 0 + +/* REG_RUNT_TLC_COUNTER */ +#define RUNT_CNT_MASK 0xffff0000 +#define RUNT_CNT_SHIFT_BIT 16 +#define TLCC_MASK 0xffff +#define TLCC_SHIFT_BIT 0 + +/* REG_CRC_FTL_COUNTER */ +#define CRCER_CNT_MASK 0xffff0000 +#define CRCER_CNT_SHIFT_BIT 16 +#define FTL_CNT_MASK 0xffff +#define FTL_CNT_SHIFT_BIT 0 + +/* REG_RLC_RCC_COUNTER */ +#define RLC_MASK 0xffff0000 +#define RLC_SHIFT_BIT 16 +#define RCC_MASK 0xffff +#define RCC_SHIFT_BIT 0 + +/* REG_PHY_STATUS */ +#define AN_COMPLETE 0x20 +#define LINK_STATUS 0x4 + +struct moxart_mac_priv_t { + void __iomem *base; + struct net_device_stats stats; + unsigned int reg_maccr; + unsigned int reg_imr; + struct napi_struct napi; + struct net_device *ndev; + + dma_addr_t rx_base; + dma_addr_t rx_mapping[RX_DESC_NUM]; + void __iomem *rx_desc_base; + unsigned char *rx_buf_base; + unsigned char *rx_buf[RX_DESC_NUM]; + unsigned int rx_head; + unsigned int rx_buf_size; + + dma_addr_t tx_base; + dma_addr_t tx_mapping[TX_DESC_NUM]; + void __iomem *tx_desc_base; + unsigned char *tx_buf_base; + unsigned char *tx_buf[RX_DESC_NUM]; + unsigned int tx_head; + unsigned int tx_buf_size; + + spinlock_t txlock; + unsigned int tx_len[TX_DESC_NUM]; + struct sk_buff *tx_skb[TX_DESC_NUM]; + unsigned int tx_tail; +}; + +#if TX_BUF_SIZE >= TX_BUF_SIZE_MAX +#error MOXA ART Ethernet device driver TX buffer is too large! +#endif +#if RX_BUF_SIZE >= RX_BUF_SIZE_MAX +#error MOXA ART Ethernet device driver RX buffer is too large! +#endif + +#endif -- cgit v1.2.3 From af61a165187bb94b1dc7628ef815c23d0eacf40b Mon Sep 17 00:00:00 2001 From: Johannes Berg Date: Tue, 2 Jul 2013 18:09:12 +0200 Subject: mac80211: add control port protocol TX control flag A lot of drivers check the frame protocol for ETH_P_PAE, for various reasons (like making those more reliable). Add a new flags bitmap to the TX control info and a new flag indicating the control port protocol is in use to let all drivers also apply such logic to other control port protocols, should they be configured. Also use the new flag in the iwlwifi drivers. Signed-off-by: Johannes Berg --- Documentation/DocBook/80211.tmpl | 1 + drivers/net/wireless/iwlwifi/dvm/tx.c | 2 +- drivers/net/wireless/iwlwifi/iwl-devtrace.h | 7 ++++--- drivers/net/wireless/iwlwifi/mvm/tx.c | 9 ++++----- include/net/mac80211.h | 19 ++++++++++++++++--- net/mac80211/rc80211_minstrel_ht.c | 5 +++-- net/mac80211/tx.c | 8 +++++--- 7 files changed, 34 insertions(+), 17 deletions(-) (limited to 'Documentation') diff --git a/Documentation/DocBook/80211.tmpl b/Documentation/DocBook/80211.tmpl index 49267ea97568..f403ec3c5c9a 100644 --- a/Documentation/DocBook/80211.tmpl +++ b/Documentation/DocBook/80211.tmpl @@ -325,6 +325,7 @@ functions/definitions !Finclude/net/mac80211.h ieee80211_rx_status !Finclude/net/mac80211.h mac80211_rx_flags +!Finclude/net/mac80211.h mac80211_tx_info_flags !Finclude/net/mac80211.h mac80211_tx_control_flags !Finclude/net/mac80211.h mac80211_rate_control_flags !Finclude/net/mac80211.h ieee80211_tx_rate diff --git a/drivers/net/wireless/iwlwifi/dvm/tx.c b/drivers/net/wireless/iwlwifi/dvm/tx.c index 5ee983faa679..f364583b1535 100644 --- a/drivers/net/wireless/iwlwifi/dvm/tx.c +++ b/drivers/net/wireless/iwlwifi/dvm/tx.c @@ -87,7 +87,7 @@ static void iwlagn_tx_cmd_build_basic(struct iwl_priv *priv, priv->lib->bt_params->advanced_bt_coexist && (ieee80211_is_auth(fc) || ieee80211_is_assoc_req(fc) || ieee80211_is_reassoc_req(fc) || - skb->protocol == cpu_to_be16(ETH_P_PAE))) + info->control.flags & IEEE80211_TX_CTRL_PORT_CTRL_PROTO)) tx_flags |= TX_CMD_FLG_IGNORE_BT; diff --git a/drivers/net/wireless/iwlwifi/iwl-devtrace.h b/drivers/net/wireless/iwlwifi/iwl-devtrace.h index 4491c1c72cc7..684c416d3493 100644 --- a/drivers/net/wireless/iwlwifi/iwl-devtrace.h +++ b/drivers/net/wireless/iwlwifi/iwl-devtrace.h @@ -33,10 +33,11 @@ static inline bool iwl_trace_data(struct sk_buff *skb) { struct ieee80211_hdr *hdr = (void *)skb->data; + struct ieee80211_tx_info *info = IEEE80211_SKB_CB(skb); - if (ieee80211_is_data(hdr->frame_control)) - return skb->protocol != cpu_to_be16(ETH_P_PAE); - return false; + if (!ieee80211_is_data(hdr->frame_control)) + return false; + return !(info->control.flags & IEEE80211_TX_CTRL_PORT_CTRL_PROTO); } static inline size_t iwl_rx_trace_len(const struct iwl_trans *trans, diff --git a/drivers/net/wireless/iwlwifi/mvm/tx.c b/drivers/net/wireless/iwlwifi/mvm/tx.c index f0e96a927407..d62a6d3053ff 100644 --- a/drivers/net/wireless/iwlwifi/mvm/tx.c +++ b/drivers/net/wireless/iwlwifi/mvm/tx.c @@ -91,11 +91,10 @@ static void iwl_mvm_set_tx_cmd(struct iwl_mvm *mvm, struct sk_buff *skb, tx_flags |= TX_CMD_FLG_ACK | TX_CMD_FLG_BAR; /* High prio packet (wrt. BT coex) if it is EAPOL, MCAST or MGMT */ - if (info->band == IEEE80211_BAND_2GHZ && - (skb->protocol == cpu_to_be16(ETH_P_PAE) || - is_multicast_ether_addr(hdr->addr1) || - ieee80211_is_back_req(fc) || - ieee80211_is_mgmt(fc))) + if (info->band == IEEE80211_BAND_2GHZ && + (info->control.flags & IEEE80211_TX_CTRL_PORT_CTRL_PROTO || + is_multicast_ether_addr(hdr->addr1) || + ieee80211_is_back_req(fc) || ieee80211_is_mgmt(fc))) tx_flags |= TX_CMD_FLG_BT_DIS; if (ieee80211_has_morefrags(fc)) diff --git a/include/net/mac80211.h b/include/net/mac80211.h index 9cda3728c2cb..b70c00111323 100644 --- a/include/net/mac80211.h +++ b/include/net/mac80211.h @@ -375,7 +375,7 @@ struct ieee80211_bss_conf { }; /** - * enum mac80211_tx_control_flags - flags to describe transmission information/status + * enum mac80211_tx_info_flags - flags to describe transmission information/status * * These flags are used with the @flags member of &ieee80211_tx_info. * @@ -471,7 +471,7 @@ struct ieee80211_bss_conf { * Note: If you have to add new flags to the enumeration, then don't * forget to update %IEEE80211_TX_TEMPORARY_FLAGS when necessary. */ -enum mac80211_tx_control_flags { +enum mac80211_tx_info_flags { IEEE80211_TX_CTL_REQ_TX_STATUS = BIT(0), IEEE80211_TX_CTL_ASSIGN_SEQ = BIT(1), IEEE80211_TX_CTL_NO_ACK = BIT(2), @@ -507,6 +507,18 @@ enum mac80211_tx_control_flags { #define IEEE80211_TX_CTL_STBC_SHIFT 23 +/** + * enum mac80211_tx_control_flags - flags to describe transmit control + * + * @IEEE80211_TX_CTRL_PORT_CTRL_PROTO: this frame is a port control + * protocol frame (e.g. EAP) + * + * These flags are used in tx_info->control.flags. + */ +enum mac80211_tx_control_flags { + IEEE80211_TX_CTRL_PORT_CTRL_PROTO = BIT(0), +}; + /* * This definition is used as a mask to clear all temporary flags, which are * set by the tx handlers for each transmission attempt by the mac80211 stack. @@ -680,7 +692,8 @@ struct ieee80211_tx_info { /* NB: vif can be NULL for injected frames */ struct ieee80211_vif *vif; struct ieee80211_key_conf *hw_key; - /* 8 bytes free */ + u32 flags; + /* 4 bytes free */ } control; struct { struct ieee80211_tx_rate rates[IEEE80211_TX_MAX_RATES]; diff --git a/net/mac80211/rc80211_minstrel_ht.c b/net/mac80211/rc80211_minstrel_ht.c index 7475a7a33797..9eff3824f2d1 100644 --- a/net/mac80211/rc80211_minstrel_ht.c +++ b/net/mac80211/rc80211_minstrel_ht.c @@ -439,12 +439,13 @@ minstrel_aggr_check(struct ieee80211_sta *pubsta, struct sk_buff *skb) { struct ieee80211_hdr *hdr = (struct ieee80211_hdr *) skb->data; struct sta_info *sta = container_of(pubsta, struct sta_info, sta); + struct ieee80211_tx_info *info = IEEE80211_SKB_CB(skb); u16 tid; if (unlikely(!ieee80211_is_data_qos(hdr->frame_control))) return; - if (unlikely(skb->protocol == cpu_to_be16(ETH_P_PAE))) + if (unlikely(info->control.flags & IEEE80211_TX_CTRL_PORT_CTRL_PROTO)) return; tid = *ieee80211_get_qos_ctl(hdr) & IEEE80211_QOS_CTL_TID_MASK; @@ -776,7 +777,7 @@ minstrel_ht_get_rate(void *priv, struct ieee80211_sta *sta, void *priv_sta, /* Don't use EAPOL frames for sampling on non-mrr hw */ if (mp->hw->max_rates == 1 && - txrc->skb->protocol == cpu_to_be16(ETH_P_PAE)) + (info->control.flags & IEEE80211_TX_CTRL_PORT_CTRL_PROTO)) sample_idx = -1; else sample_idx = minstrel_get_sample_rate(mp, mi); diff --git a/net/mac80211/tx.c b/net/mac80211/tx.c index 0e42322aa6b1..098ae854ad3c 100644 --- a/net/mac80211/tx.c +++ b/net/mac80211/tx.c @@ -539,9 +539,11 @@ ieee80211_tx_h_check_control_port_protocol(struct ieee80211_tx_data *tx) { struct ieee80211_tx_info *info = IEEE80211_SKB_CB(tx->skb); - if (unlikely(tx->sdata->control_port_protocol == tx->skb->protocol && - tx->sdata->control_port_no_encrypt)) - info->flags |= IEEE80211_TX_INTFL_DONT_ENCRYPT; + if (unlikely(tx->sdata->control_port_protocol == tx->skb->protocol)) { + if (tx->sdata->control_port_no_encrypt) + info->flags |= IEEE80211_TX_INTFL_DONT_ENCRYPT; + info->control.flags |= IEEE80211_TX_CTRL_PORT_CTRL_PROTO; + } return TX_CONTINUE; } -- cgit v1.2.3 From fc4eba58b4c1462ff3d6247b66fb47d6928db6d2 Mon Sep 17 00:00:00 2001 From: Hannes Frederic Sowa Date: Wed, 14 Aug 2013 01:03:46 +0200 Subject: ipv6: make unsolicited report intervals configurable for mld Commit cab70040dfd95ee32144f02fade64f0cb94f31a0 ("net: igmp: Reduce Unsolicited report interval to 1s when using IGMPv3") and 2690048c01f32bf45d1c1e1ab3079bc10ad2aea7 ("net: igmp: Allow user-space configuration of igmp unsolicited report interval") by William Manley made igmp unsolicited report intervals configurable per interface and corrected the interval of unsolicited igmpv3 report messages resendings to 1s. Same needs to be done for IPv6: MLDv1 (RFC2710 7.10.): 10 seconds MLDv2 (RFC3810 9.11.): 1 second Both intervals are configurable via new procfs knobs mldv1_unsolicited_report_interval and mldv2_unsolicited_report_interval. (also added .force_mld_version to ipv6_devconf_dflt to bring structs in line without semantic changes) v2: a) Joined documentation update for IPv4 and IPv6 MLD/IGMP unsolicited_report_interval procfs knobs. b) incorporate stylistic feedback from William Manley v3: a) add new DEVCONF_* values to the end of the enum (thanks to David Miller) Cc: Cong Wang Cc: William Manley Cc: Benjamin LaHaise Cc: YOSHIFUJI Hideaki Signed-off-by: Hannes Frederic Sowa Signed-off-by: David S. Miller --- Documentation/networking/ip-sysctl.txt | 18 ++++++++++++++++++ include/linux/ipv6.h | 2 ++ include/uapi/linux/ipv6.h | 2 ++ net/ipv6/addrconf.c | 25 +++++++++++++++++++++++++ net/ipv6/mcast.c | 17 ++++++++++++++--- 5 files changed, 61 insertions(+), 3 deletions(-) (limited to 'Documentation') diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index 36be26b2ef7a..debfe857d8f9 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -1039,7 +1039,15 @@ disable_policy - BOOLEAN disable_xfrm - BOOLEAN Disable IPSEC encryption on this interface, whatever the policy +igmpv2_unsolicited_report_interval - INTEGER + The interval in milliseconds in which the next unsolicited + IGMPv1 or IGMPv2 report retransmit will take place. + Default: 10000 (10 seconds) +igmpv3_unsolicited_report_interval - INTEGER + The interval in milliseconds in which the next unsolicited + IGMPv3 report retransmit will take place. + Default: 1000 (1 seconds) tag - INTEGER Allows you to write a number, which can be used as required. @@ -1331,6 +1339,16 @@ ndisc_notify - BOOLEAN 1 - Generate unsolicited neighbour advertisements when device is brought up or hardware address changes. +mldv1_unsolicited_report_interval - INTEGER + The interval in milliseconds in which the next unsolicited + MLDv1 report retransmit will take place. + Default: 10000 (10 seconds) + +mldv2_unsolicited_report_interval - INTEGER + The interval in milliseconds in which the next unsolicited + MLDv2 report retransmit will take place. + Default: 1000 (1 second) + icmp/*: ratelimit - INTEGER Limit the maximal rates for sending ICMPv6 packets. diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h index 850e95bc766c..77a478462d8e 100644 --- a/include/linux/ipv6.h +++ b/include/linux/ipv6.h @@ -19,6 +19,8 @@ struct ipv6_devconf { __s32 rtr_solicit_interval; __s32 rtr_solicit_delay; __s32 force_mld_version; + __s32 mldv1_unsolicited_report_interval; + __s32 mldv2_unsolicited_report_interval; #ifdef CONFIG_IPV6_PRIVACY __s32 use_tempaddr; __s32 temp_valid_lft; diff --git a/include/uapi/linux/ipv6.h b/include/uapi/linux/ipv6.h index 4bda4cf5b0f5..d07ac6903e59 100644 --- a/include/uapi/linux/ipv6.h +++ b/include/uapi/linux/ipv6.h @@ -160,6 +160,8 @@ enum { DEVCONF_ACCEPT_DAD, DEVCONF_FORCE_TLLAO, DEVCONF_NDISC_NOTIFY, + DEVCONF_MLDV1_UNSOLICITED_REPORT_INTERVAL, + DEVCONF_MLDV2_UNSOLICITED_REPORT_INTERVAL, DEVCONF_MAX }; diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index 7fd8572bac80..ad12f7c43f25 100644 --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -177,6 +177,8 @@ static struct ipv6_devconf ipv6_devconf __read_mostly = { .accept_redirects = 1, .autoconf = 1, .force_mld_version = 0, + .mldv1_unsolicited_report_interval = 10 * HZ, + .mldv2_unsolicited_report_interval = HZ, .dad_transmits = 1, .rtr_solicits = MAX_RTR_SOLICITATIONS, .rtr_solicit_interval = RTR_SOLICITATION_INTERVAL, @@ -211,6 +213,9 @@ static struct ipv6_devconf ipv6_devconf_dflt __read_mostly = { .accept_ra = 1, .accept_redirects = 1, .autoconf = 1, + .force_mld_version = 0, + .mldv1_unsolicited_report_interval = 10 * HZ, + .mldv2_unsolicited_report_interval = HZ, .dad_transmits = 1, .rtr_solicits = MAX_RTR_SOLICITATIONS, .rtr_solicit_interval = RTR_SOLICITATION_INTERVAL, @@ -4179,6 +4184,10 @@ static inline void ipv6_store_devconf(struct ipv6_devconf *cnf, array[DEVCONF_RTR_SOLICIT_DELAY] = jiffies_to_msecs(cnf->rtr_solicit_delay); array[DEVCONF_FORCE_MLD_VERSION] = cnf->force_mld_version; + array[DEVCONF_MLDV1_UNSOLICITED_REPORT_INTERVAL] = + jiffies_to_msecs(cnf->mldv1_unsolicited_report_interval); + array[DEVCONF_MLDV2_UNSOLICITED_REPORT_INTERVAL] = + jiffies_to_msecs(cnf->mldv2_unsolicited_report_interval); #ifdef CONFIG_IPV6_PRIVACY array[DEVCONF_USE_TEMPADDR] = cnf->use_tempaddr; array[DEVCONF_TEMP_VALID_LFT] = cnf->temp_valid_lft; @@ -4862,6 +4871,22 @@ static struct addrconf_sysctl_table .mode = 0644, .proc_handler = proc_dointvec, }, + { + .procname = "mldv1_unsolicited_report_interval", + .data = + &ipv6_devconf.mldv1_unsolicited_report_interval, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec_ms_jiffies, + }, + { + .procname = "mldv2_unsolicited_report_interval", + .data = + &ipv6_devconf.mldv2_unsolicited_report_interval, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec_ms_jiffies, + }, #ifdef CONFIG_IPV6_PRIVACY { .procname = "use_tempaddr", diff --git a/net/ipv6/mcast.c b/net/ipv6/mcast.c index db25b8eb62bd..6c76df9909bf 100644 --- a/net/ipv6/mcast.c +++ b/net/ipv6/mcast.c @@ -108,7 +108,6 @@ static int ip6_mc_leave_src(struct sock *sk, struct ipv6_mc_socklist *iml, struct inet6_dev *idev); -#define IGMP6_UNSOLICITED_IVAL (10*HZ) #define MLD_QRV_DEFAULT 2 #define MLD_V1_SEEN(idev) (dev_net((idev)->dev)->ipv6.devconf_all->force_mld_version == 1 || \ @@ -129,6 +128,18 @@ int sysctl_mld_max_msf __read_mostly = IPV6_MLD_MAX_MSF; pmc != NULL; \ pmc = rcu_dereference(pmc->next)) +static int unsolicited_report_interval(struct inet6_dev *idev) +{ + int iv; + + if (MLD_V1_SEEN(idev)) + iv = idev->cnf.mldv1_unsolicited_report_interval; + else + iv = idev->cnf.mldv2_unsolicited_report_interval; + + return iv > 0 ? iv : 1; +} + int ipv6_sock_mc_join(struct sock *sk, int ifindex, const struct in6_addr *addr) { struct net_device *dev = NULL; @@ -2158,7 +2169,7 @@ static void igmp6_join_group(struct ifmcaddr6 *ma) igmp6_send(&ma->mca_addr, ma->idev->dev, ICMPV6_MGM_REPORT); - delay = net_random() % IGMP6_UNSOLICITED_IVAL; + delay = net_random() % unsolicited_report_interval(ma->idev); spin_lock_bh(&ma->mca_lock); if (del_timer(&ma->mca_timer)) { @@ -2325,7 +2336,7 @@ void ipv6_mc_init_dev(struct inet6_dev *idev) setup_timer(&idev->mc_dad_timer, mld_dad_timer_expire, (unsigned long)idev); idev->mc_qrv = MLD_QRV_DEFAULT; - idev->mc_maxdelay = IGMP6_UNSOLICITED_IVAL; + idev->mc_maxdelay = unsolicited_report_interval(idev); idev->mc_v1_seen = 0; write_unlock_bh(&idev->lock); } -- cgit v1.2.3 From 954c396756e3d31985f7bc6a414a988a4736a7d0 Mon Sep 17 00:00:00 2001 From: Sean Cross Date: Wed, 21 Aug 2013 01:46:12 +0000 Subject: net/phy: micrel: Add OF configuration support for ksz9021 Some boards require custom PHY configuration, for example due to trace length differences. Add the ability to configure these registers in order to get the PHY to function on boards that need it. Because PHYs are auto-detected based on MDIO device IDs, allow PHY configuration to be specified in the parent Ethernet device node if no PHY device node is present. Signed-off-by: Sean Cross Signed-off-by: David S. Miller --- .../devicetree/bindings/net/micrel-ksz9021.txt | 49 ++++++++++ drivers/net/phy/micrel.c | 103 ++++++++++++++++++++- 2 files changed, 151 insertions(+), 1 deletion(-) create mode 100644 Documentation/devicetree/bindings/net/micrel-ksz9021.txt (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/net/micrel-ksz9021.txt b/Documentation/devicetree/bindings/net/micrel-ksz9021.txt new file mode 100644 index 000000000000..997a63f1aea1 --- /dev/null +++ b/Documentation/devicetree/bindings/net/micrel-ksz9021.txt @@ -0,0 +1,49 @@ +Micrel KSZ9021 Gigabit Ethernet PHY + +Some boards require special tuning values, particularly when it comes to +clock delays. You can specify clock delay values by adding +micrel-specific properties to an Ethernet OF device node. + +All skew control options are specified in picoseconds. The minimum +value is 0, and the maximum value is 3000. + +Optional properties: + - rxc-skew-ps : Skew control of RXC pad + - rxdv-skew-ps : Skew control of RX CTL pad + - txc-skew-ps : Skew control of TXC pad + - txen-skew-ps : Skew control of TX_CTL pad + - rxd0-skew-ps : Skew control of RX data 0 pad + - rxd1-skew-ps : Skew control of RX data 1 pad + - rxd2-skew-ps : Skew control of RX data 2 pad + - rxd3-skew-ps : Skew control of RX data 3 pad + - txd0-skew-ps : Skew control of TX data 0 pad + - txd1-skew-ps : Skew control of TX data 1 pad + - txd2-skew-ps : Skew control of TX data 2 pad + - txd3-skew-ps : Skew control of TX data 3 pad + +Examples: + + /* Attach to an Ethernet device with autodetected PHY */ + &enet { + rxc-skew-ps = <3000>; + rxdv-skew-ps = <0>; + txc-skew-ps = <3000>; + txen-skew-ps = <0>; + status = "okay"; + }; + + /* Attach to an explicitly-specified PHY */ + mdio { + phy0: ethernet-phy@0 { + rxc-skew-ps = <3000>; + rxdv-skew-ps = <0>; + txc-skew-ps = <3000>; + txen-skew-ps = <0>; + reg = <0>; + }; + }; + ethernet@70000 { + status = "okay"; + phy = <&phy0>; + phy-mode = "rgmii-id"; + }; diff --git a/drivers/net/phy/micrel.c b/drivers/net/phy/micrel.c index 9ca494550186..c31aad0004cb 100644 --- a/drivers/net/phy/micrel.c +++ b/drivers/net/phy/micrel.c @@ -25,6 +25,7 @@ #include #include #include +#include /* Operation Mode Strap Override */ #define MII_KSZPHY_OMSO 0x16 @@ -53,6 +54,20 @@ #define KS8737_CTRL_INT_ACTIVE_HIGH (1 << 14) #define KSZ8051_RMII_50MHZ_CLK (1 << 7) +/* Write/read to/from extended registers */ +#define MII_KSZPHY_EXTREG 0x0b +#define KSZPHY_EXTREG_WRITE 0x8000 + +#define MII_KSZPHY_EXTREG_WRITE 0x0c +#define MII_KSZPHY_EXTREG_READ 0x0d + +/* Extended registers */ +#define MII_KSZPHY_CLK_CONTROL_PAD_SKEW 0x104 +#define MII_KSZPHY_RX_DATA_PAD_SKEW 0x105 +#define MII_KSZPHY_TX_DATA_PAD_SKEW 0x106 + +#define PS_TO_REG 200 + static int ksz_config_flags(struct phy_device *phydev) { int regval; @@ -65,6 +80,20 @@ static int ksz_config_flags(struct phy_device *phydev) return 0; } +static int kszphy_extended_write(struct phy_device *phydev, + u32 regnum, u16 val) +{ + phy_write(phydev, MII_KSZPHY_EXTREG, KSZPHY_EXTREG_WRITE | regnum); + return phy_write(phydev, MII_KSZPHY_EXTREG_WRITE, val); +} + +static int kszphy_extended_read(struct phy_device *phydev, + u32 regnum) +{ + phy_write(phydev, MII_KSZPHY_EXTREG, regnum); + return phy_read(phydev, MII_KSZPHY_EXTREG_READ); +} + static int kszphy_ack_interrupt(struct phy_device *phydev) { /* bit[7..0] int status, which is a read and clear register. */ @@ -141,6 +170,78 @@ static int ks8051_config_init(struct phy_device *phydev) return rc < 0 ? rc : 0; } +static int ksz9021_load_values_from_of(struct phy_device *phydev, + struct device_node *of_node, u16 reg, + char *field1, char *field2, + char *field3, char *field4) +{ + int val1 = -1; + int val2 = -2; + int val3 = -3; + int val4 = -4; + int newval; + int matches = 0; + + if (!of_property_read_u32(of_node, field1, &val1)) + matches++; + + if (!of_property_read_u32(of_node, field2, &val2)) + matches++; + + if (!of_property_read_u32(of_node, field3, &val3)) + matches++; + + if (!of_property_read_u32(of_node, field4, &val4)) + matches++; + + if (!matches) + return 0; + + if (matches < 4) + newval = kszphy_extended_read(phydev, reg); + else + newval = 0; + + if (val1 != -1) + newval = ((newval & 0xfff0) | ((val1 / PS_TO_REG) & 0xf) << 0); + + if (val2 != -1) + newval = ((newval & 0xff0f) | ((val2 / PS_TO_REG) & 0xf) << 4); + + if (val3 != -1) + newval = ((newval & 0xf0ff) | ((val3 / PS_TO_REG) & 0xf) << 8); + + if (val4 != -1) + newval = ((newval & 0x0fff) | ((val4 / PS_TO_REG) & 0xf) << 12); + + return kszphy_extended_write(phydev, reg, newval); +} + +static int ksz9021_config_init(struct phy_device *phydev) +{ + struct device *dev = &phydev->dev; + struct device_node *of_node = dev->of_node; + + if (!of_node && dev->parent->of_node) + of_node = dev->parent->of_node; + + if (of_node) { + ksz9021_load_values_from_of(phydev, of_node, + MII_KSZPHY_CLK_CONTROL_PAD_SKEW, + "txen-skew-ps", "txc-skew-ps", + "rxdv-skew-ps", "rxc-skew-ps"); + ksz9021_load_values_from_of(phydev, of_node, + MII_KSZPHY_RX_DATA_PAD_SKEW, + "rxd0-skew-ps", "rxd1-skew-ps", + "rxd2-skew-ps", "rxd3-skew-ps"); + ksz9021_load_values_from_of(phydev, of_node, + MII_KSZPHY_TX_DATA_PAD_SKEW, + "txd0-skew-ps", "txd1-skew-ps", + "txd2-skew-ps", "txd3-skew-ps"); + } + return 0; +} + #define KSZ8873MLL_GLOBAL_CONTROL_4 0x06 #define KSZ8873MLL_GLOBAL_CONTROL_4_DUPLEX (1 << 6) #define KSZ8873MLL_GLOBAL_CONTROL_4_SPEED (1 << 4) @@ -281,7 +382,7 @@ static struct phy_driver ksphy_driver[] = { .name = "Micrel KSZ9021 Gigabit PHY", .features = (PHY_GBIT_FEATURES | SUPPORTED_Pause), .flags = PHY_HAS_MAGICANEG | PHY_HAS_INTERRUPT, - .config_init = kszphy_config_init, + .config_init = ksz9021_config_init, .config_aneg = genphy_config_aneg, .read_status = genphy_read_status, .ack_interrupt = kszphy_ack_interrupt, -- cgit v1.2.3 From 03f0d916aa0317592dda11bd17c7357858719b6c Mon Sep 17 00:00:00 2001 From: Andy Zhou Date: Wed, 7 Aug 2013 20:01:00 -0700 Subject: openvswitch: Mega flow implementation Add wildcarded flow support in kernel datapath. Wildcarded flow can improve OVS flow set up performance by avoid sending matching new flows to the user space program. The exact performance boost will largely dependent on wildcarded flow hit rate. In case all new flows hits wildcard flows, the flow set up rate is within 5% of that of linux bridge module. Pravin has made significant contributions to this patch. Including API clean ups and bug fixes. Signed-off-by: Pravin B Shelar Signed-off-by: Andy Zhou Signed-off-by: Jesse Gross --- Documentation/networking/openvswitch.txt | 40 + include/uapi/linux/openvswitch.h | 9 +- net/openvswitch/actions.c | 6 +- net/openvswitch/datapath.c | 140 ++- net/openvswitch/datapath.h | 6 + net/openvswitch/flow.c | 1387 ++++++++++++++++++++---------- net/openvswitch/flow.h | 96 ++- 7 files changed, 1171 insertions(+), 513 deletions(-) (limited to 'Documentation') diff --git a/Documentation/networking/openvswitch.txt b/Documentation/networking/openvswitch.txt index 8fa2dd1e792e..37c20ee2455e 100644 --- a/Documentation/networking/openvswitch.txt +++ b/Documentation/networking/openvswitch.txt @@ -91,6 +91,46 @@ Often we ellipsize arguments not important to the discussion, e.g.: in_port(1), eth(...), eth_type(0x0800), ipv4(...), tcp(...) +Wildcarded flow key format +-------------------------- + +A wildcarded flow is described with two sequences of Netlink attributes +passed over the Netlink socket. A flow key, exactly as described above, and an +optional corresponding flow mask. + +A wildcarded flow can represent a group of exact match flows. Each '1' bit +in the mask specifies a exact match with the corresponding bit in the flow key. +A '0' bit specifies a don't care bit, which will match either a '1' or '0' bit +of a incoming packet. Using wildcarded flow can improve the flow set up rate +by reduce the number of new flows need to be processed by the user space program. + +Support for the mask Netlink attribute is optional for both the kernel and user +space program. The kernel can ignore the mask attribute, installing an exact +match flow, or reduce the number of don't care bits in the kernel to less than +what was specified by the user space program. In this case, variations in bits +that the kernel does not implement will simply result in additional flow setups. +The kernel module will also work with user space programs that neither support +nor supply flow mask attributes. + +Since the kernel may ignore or modify wildcard bits, it can be difficult for +the userspace program to know exactly what matches are installed. There are +two possible approaches: reactively install flows as they miss the kernel +flow table (and therefore not attempt to determine wildcard changes at all) +or use the kernel's response messages to determine the installed wildcards. + +When interacting with userspace, the kernel should maintain the match portion +of the key exactly as originally installed. This will provides a handle to +identify the flow for all future operations. However, when reporting the +mask of an installed flow, the mask should include any restrictions imposed +by the kernel. + +The behavior when using overlapping wildcarded flows is undefined. It is the +responsibility of the user space program to ensure that any incoming packet +can match at most one flow, wildcarded or not. The current implementation +performs best-effort detection of overlapping wildcarded flows and may reject +some but not all of them. However, this behavior may change in future versions. + + Basic rule for evolving flow keys --------------------------------- diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h index 52490b0e62b5..de1fa5d3780f 100644 --- a/include/uapi/linux/openvswitch.h +++ b/include/uapi/linux/openvswitch.h @@ -1,6 +1,6 @@ /* - * Copyright (c) 2007-2011 Nicira Networks. + * Copyright (c) 2007-2013 Nicira, Inc. * * This program is free software; you can redistribute it and/or * modify it under the terms of version 2 of the GNU General Public @@ -379,6 +379,12 @@ struct ovs_key_nd { * @OVS_FLOW_ATTR_CLEAR: If present in a %OVS_FLOW_CMD_SET request, clears the * last-used time, accumulated TCP flags, and statistics for this flow. * Otherwise ignored in requests. Never present in notifications. + * @OVS_FLOW_ATTR_MASK: Nested %OVS_KEY_ATTR_* attributes specifying the + * mask bits for wildcarded flow match. Mask bit value '1' specifies exact + * match with corresponding flow key bit, while mask bit value '0' specifies + * a wildcarded match. Omitting attribute is treated as wildcarding all + * corresponding fields. Optional for all requests. If not present, + * all flow key bits are exact match bits. * * These attributes follow the &struct ovs_header within the Generic Netlink * payload for %OVS_FLOW_* commands. @@ -391,6 +397,7 @@ enum ovs_flow_attr { OVS_FLOW_ATTR_TCP_FLAGS, /* 8-bit OR'd TCP flags. */ OVS_FLOW_ATTR_USED, /* u64 msecs last used in monotonic time. */ OVS_FLOW_ATTR_CLEAR, /* Flag to clear stats, tcp_flags, used. */ + OVS_FLOW_ATTR_MASK, /* Sequence of OVS_KEY_ATTR_* attributes. */ __OVS_FLOW_ATTR_MAX }; diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c index ab101f715447..1f680222f4f7 100644 --- a/net/openvswitch/actions.c +++ b/net/openvswitch/actions.c @@ -1,5 +1,5 @@ /* - * Copyright (c) 2007-2012 Nicira, Inc. + * Copyright (c) 2007-2013 Nicira, Inc. * * This program is free software; you can redistribute it and/or * modify it under the terms of version 2 of the GNU General Public @@ -376,8 +376,10 @@ static int output_userspace(struct datapath *dp, struct sk_buff *skb, const struct nlattr *a; int rem; + BUG_ON(!OVS_CB(skb)->pkt_key); + upcall.cmd = OVS_PACKET_CMD_ACTION; - upcall.key = &OVS_CB(skb)->flow->key; + upcall.key = OVS_CB(skb)->pkt_key; upcall.userdata = NULL; upcall.portid = 0; diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c index 9d97ef3c9830..d29cd9aa4a67 100644 --- a/net/openvswitch/datapath.c +++ b/net/openvswitch/datapath.c @@ -1,5 +1,5 @@ /* - * Copyright (c) 2007-2012 Nicira, Inc. + * Copyright (c) 2007-2013 Nicira, Inc. * * This program is free software; you can redistribute it and/or * modify it under the terms of version 2 of the GNU General Public @@ -165,7 +165,7 @@ static void destroy_dp_rcu(struct rcu_head *rcu) { struct datapath *dp = container_of(rcu, struct datapath, rcu); - ovs_flow_tbl_destroy((__force struct flow_table *)dp->table); + ovs_flow_tbl_destroy((__force struct flow_table *)dp->table, false); free_percpu(dp->stats_percpu); release_net(ovs_dp_get_net(dp)); kfree(dp->ports); @@ -226,19 +226,18 @@ void ovs_dp_process_received_packet(struct vport *p, struct sk_buff *skb) struct sw_flow_key key; u64 *stats_counter; int error; - int key_len; stats = this_cpu_ptr(dp->stats_percpu); /* Extract flow from 'skb' into 'key'. */ - error = ovs_flow_extract(skb, p->port_no, &key, &key_len); + error = ovs_flow_extract(skb, p->port_no, &key); if (unlikely(error)) { kfree_skb(skb); return; } /* Look up flow. */ - flow = ovs_flow_tbl_lookup(rcu_dereference(dp->table), &key, key_len); + flow = ovs_flow_lookup(rcu_dereference(dp->table), &key); if (unlikely(!flow)) { struct dp_upcall_info upcall; @@ -253,6 +252,7 @@ void ovs_dp_process_received_packet(struct vport *p, struct sk_buff *skb) } OVS_CB(skb)->flow = flow; + OVS_CB(skb)->pkt_key = &key; stats_counter = &stats->n_hit; ovs_flow_used(OVS_CB(skb)->flow, skb); @@ -435,7 +435,7 @@ static int queue_userspace_packet(struct net *net, int dp_ifindex, upcall->dp_ifindex = dp_ifindex; nla = nla_nest_start(user_skb, OVS_PACKET_ATTR_KEY); - ovs_flow_to_nlattrs(upcall_info->key, user_skb); + ovs_flow_to_nlattrs(upcall_info->key, upcall_info->key, user_skb); nla_nest_end(user_skb, nla); if (upcall_info->userdata) @@ -468,7 +468,7 @@ static int flush_flows(struct datapath *dp) rcu_assign_pointer(dp->table, new_table); - ovs_flow_tbl_deferred_destroy(old_table); + ovs_flow_tbl_destroy(old_table, true); return 0; } @@ -611,10 +611,12 @@ static int validate_tp_port(const struct sw_flow_key *flow_key) static int validate_and_copy_set_tun(const struct nlattr *attr, struct sw_flow_actions **sfa) { - struct ovs_key_ipv4_tunnel tun_key; + struct sw_flow_match match; + struct sw_flow_key key; int err, start; - err = ovs_ipv4_tun_from_nlattr(nla_data(attr), &tun_key); + ovs_match_init(&match, &key, NULL); + err = ovs_ipv4_tun_from_nlattr(nla_data(attr), &match, false); if (err) return err; @@ -622,7 +624,8 @@ static int validate_and_copy_set_tun(const struct nlattr *attr, if (start < 0) return start; - err = add_action(sfa, OVS_KEY_ATTR_IPV4_TUNNEL, &tun_key, sizeof(tun_key)); + err = add_action(sfa, OVS_KEY_ATTR_IPV4_TUNNEL, &match.key->tun_key, + sizeof(match.key->tun_key)); add_nested_action_end(*sfa, start); return err; @@ -857,7 +860,6 @@ static int ovs_packet_cmd_execute(struct sk_buff *skb, struct genl_info *info) struct ethhdr *eth; int len; int err; - int key_len; err = -EINVAL; if (!a[OVS_PACKET_ATTR_PACKET] || !a[OVS_PACKET_ATTR_KEY] || @@ -890,11 +892,11 @@ static int ovs_packet_cmd_execute(struct sk_buff *skb, struct genl_info *info) if (IS_ERR(flow)) goto err_kfree_skb; - err = ovs_flow_extract(packet, -1, &flow->key, &key_len); + err = ovs_flow_extract(packet, -1, &flow->key); if (err) goto err_flow_free; - err = ovs_flow_metadata_from_nlattrs(flow, key_len, a[OVS_PACKET_ATTR_KEY]); + err = ovs_flow_metadata_from_nlattrs(flow, a[OVS_PACKET_ATTR_KEY]); if (err) goto err_flow_free; acts = ovs_flow_actions_alloc(nla_len(a[OVS_PACKET_ATTR_ACTIONS])); @@ -908,6 +910,7 @@ static int ovs_packet_cmd_execute(struct sk_buff *skb, struct genl_info *info) goto err_flow_free; OVS_CB(packet)->flow = flow; + OVS_CB(packet)->pkt_key = &flow->key; packet->priority = flow->key.phy.priority; packet->mark = flow->key.phy.skb_mark; @@ -922,13 +925,13 @@ static int ovs_packet_cmd_execute(struct sk_buff *skb, struct genl_info *info) local_bh_enable(); rcu_read_unlock(); - ovs_flow_free(flow); + ovs_flow_free(flow, false); return err; err_unlock: rcu_read_unlock(); err_flow_free: - ovs_flow_free(flow); + ovs_flow_free(flow, false); err_kfree_skb: kfree_skb(packet); err: @@ -1045,7 +1048,8 @@ static int set_action_to_attr(const struct nlattr *a, struct sk_buff *skb) if (!start) return -EMSGSIZE; - err = ovs_ipv4_tun_to_nlattr(skb, nla_data(ovs_key)); + err = ovs_ipv4_tun_to_nlattr(skb, nla_data(ovs_key), + nla_data(ovs_key)); if (err) return err; nla_nest_end(skb, start); @@ -1093,6 +1097,7 @@ static size_t ovs_flow_cmd_msg_size(const struct sw_flow_actions *acts) { return NLMSG_ALIGN(sizeof(struct ovs_header)) + nla_total_size(key_attr_size()) /* OVS_FLOW_ATTR_KEY */ + + nla_total_size(key_attr_size()) /* OVS_FLOW_ATTR_MASK */ + nla_total_size(sizeof(struct ovs_flow_stats)) /* OVS_FLOW_ATTR_STATS */ + nla_total_size(1) /* OVS_FLOW_ATTR_TCP_FLAGS */ + nla_total_size(8) /* OVS_FLOW_ATTR_USED */ @@ -1119,12 +1124,25 @@ static int ovs_flow_cmd_fill_info(struct sw_flow *flow, struct datapath *dp, ovs_header->dp_ifindex = get_dpifindex(dp); + /* Fill flow key. */ nla = nla_nest_start(skb, OVS_FLOW_ATTR_KEY); if (!nla) goto nla_put_failure; - err = ovs_flow_to_nlattrs(&flow->key, skb); + + err = ovs_flow_to_nlattrs(&flow->unmasked_key, + &flow->unmasked_key, skb); + if (err) + goto error; + nla_nest_end(skb, nla); + + nla = nla_nest_start(skb, OVS_FLOW_ATTR_MASK); + if (!nla) + goto nla_put_failure; + + err = ovs_flow_to_nlattrs(&flow->key, &flow->mask->key, skb); if (err) goto error; + nla_nest_end(skb, nla); spin_lock_bh(&flow->lock); @@ -1214,20 +1232,24 @@ static int ovs_flow_cmd_new_or_set(struct sk_buff *skb, struct genl_info *info) { struct nlattr **a = info->attrs; struct ovs_header *ovs_header = info->userhdr; - struct sw_flow_key key; - struct sw_flow *flow; + struct sw_flow_key key, masked_key; + struct sw_flow *flow = NULL; + struct sw_flow_mask mask; struct sk_buff *reply; struct datapath *dp; struct flow_table *table; struct sw_flow_actions *acts = NULL; + struct sw_flow_match match; int error; - int key_len; /* Extract key. */ error = -EINVAL; if (!a[OVS_FLOW_ATTR_KEY]) goto error; - error = ovs_flow_from_nlattrs(&key, &key_len, a[OVS_FLOW_ATTR_KEY]); + + ovs_match_init(&match, &key, &mask); + error = ovs_match_from_nlattrs(&match, + a[OVS_FLOW_ATTR_KEY], a[OVS_FLOW_ATTR_MASK]); if (error) goto error; @@ -1238,9 +1260,13 @@ static int ovs_flow_cmd_new_or_set(struct sk_buff *skb, struct genl_info *info) if (IS_ERR(acts)) goto error; - error = validate_and_copy_actions(a[OVS_FLOW_ATTR_ACTIONS], &key, 0, &acts); - if (error) + ovs_flow_key_mask(&masked_key, &key, &mask); + error = validate_and_copy_actions(a[OVS_FLOW_ATTR_ACTIONS], + &masked_key, 0, &acts); + if (error) { + OVS_NLERR("Flow actions may not be safe on all matching packets.\n"); goto err_kfree; + } } else if (info->genlhdr->cmd == OVS_FLOW_CMD_NEW) { error = -EINVAL; goto error; @@ -1253,8 +1279,11 @@ static int ovs_flow_cmd_new_or_set(struct sk_buff *skb, struct genl_info *info) goto err_unlock_ovs; table = ovsl_dereference(dp->table); - flow = ovs_flow_tbl_lookup(table, &key, key_len); + + /* Check if this is a duplicate flow */ + flow = ovs_flow_lookup(table, &key); if (!flow) { + struct sw_flow_mask *mask_p; /* Bail out if we're not allowed to create a new flow. */ error = -ENOENT; if (info->genlhdr->cmd == OVS_FLOW_CMD_SET) @@ -1267,7 +1296,7 @@ static int ovs_flow_cmd_new_or_set(struct sk_buff *skb, struct genl_info *info) new_table = ovs_flow_tbl_expand(table); if (!IS_ERR(new_table)) { rcu_assign_pointer(dp->table, new_table); - ovs_flow_tbl_deferred_destroy(table); + ovs_flow_tbl_destroy(table, true); table = ovsl_dereference(dp->table); } } @@ -1280,14 +1309,30 @@ static int ovs_flow_cmd_new_or_set(struct sk_buff *skb, struct genl_info *info) } clear_stats(flow); + flow->key = masked_key; + flow->unmasked_key = key; + + /* Make sure mask is unique in the system */ + mask_p = ovs_sw_flow_mask_find(table, &mask); + if (!mask_p) { + /* Allocate a new mask if none exsits. */ + mask_p = ovs_sw_flow_mask_alloc(); + if (!mask_p) + goto err_flow_free; + mask_p->key = mask.key; + mask_p->range = mask.range; + ovs_sw_flow_mask_insert(table, mask_p); + } + + ovs_sw_flow_mask_add_ref(mask_p); + flow->mask = mask_p; rcu_assign_pointer(flow->sf_acts, acts); /* Put flow in bucket. */ - ovs_flow_tbl_insert(table, flow, &key, key_len); + ovs_flow_insert(table, flow); reply = ovs_flow_cmd_build_info(flow, dp, info->snd_portid, - info->snd_seq, - OVS_FLOW_CMD_NEW); + info->snd_seq, OVS_FLOW_CMD_NEW); } else { /* We found a matching flow. */ struct sw_flow_actions *old_acts; @@ -1303,6 +1348,13 @@ static int ovs_flow_cmd_new_or_set(struct sk_buff *skb, struct genl_info *info) info->nlhdr->nlmsg_flags & (NLM_F_CREATE | NLM_F_EXCL)) goto err_unlock_ovs; + /* The unmasked key has to be the same for flow updates. */ + error = -EINVAL; + if (!ovs_flow_cmp_unmasked_key(flow, &key, match.range.end)) { + OVS_NLERR("Flow modification message rejected, unmasked key does not match.\n"); + goto err_unlock_ovs; + } + /* Update actions. */ old_acts = ovsl_dereference(flow->sf_acts); rcu_assign_pointer(flow->sf_acts, acts); @@ -1327,6 +1379,8 @@ static int ovs_flow_cmd_new_or_set(struct sk_buff *skb, struct genl_info *info) ovs_dp_flow_multicast_group.id, PTR_ERR(reply)); return 0; +err_flow_free: + ovs_flow_free(flow, false); err_unlock_ovs: ovs_unlock(); err_kfree: @@ -1344,12 +1398,16 @@ static int ovs_flow_cmd_get(struct sk_buff *skb, struct genl_info *info) struct sw_flow *flow; struct datapath *dp; struct flow_table *table; + struct sw_flow_match match; int err; - int key_len; - if (!a[OVS_FLOW_ATTR_KEY]) + if (!a[OVS_FLOW_ATTR_KEY]) { + OVS_NLERR("Flow get message rejected, Key attribute missing.\n"); return -EINVAL; - err = ovs_flow_from_nlattrs(&key, &key_len, a[OVS_FLOW_ATTR_KEY]); + } + + ovs_match_init(&match, &key, NULL); + err = ovs_match_from_nlattrs(&match, a[OVS_FLOW_ATTR_KEY], NULL); if (err) return err; @@ -1361,7 +1419,7 @@ static int ovs_flow_cmd_get(struct sk_buff *skb, struct genl_info *info) } table = ovsl_dereference(dp->table); - flow = ovs_flow_tbl_lookup(table, &key, key_len); + flow = ovs_flow_lookup_unmasked_key(table, &match); if (!flow) { err = -ENOENT; goto unlock; @@ -1390,8 +1448,8 @@ static int ovs_flow_cmd_del(struct sk_buff *skb, struct genl_info *info) struct sw_flow *flow; struct datapath *dp; struct flow_table *table; + struct sw_flow_match match; int err; - int key_len; ovs_lock(); dp = get_dp(sock_net(skb->sk), ovs_header->dp_ifindex); @@ -1404,12 +1462,14 @@ static int ovs_flow_cmd_del(struct sk_buff *skb, struct genl_info *info) err = flush_flows(dp); goto unlock; } - err = ovs_flow_from_nlattrs(&key, &key_len, a[OVS_FLOW_ATTR_KEY]); + + ovs_match_init(&match, &key, NULL); + err = ovs_match_from_nlattrs(&match, a[OVS_FLOW_ATTR_KEY], NULL); if (err) goto unlock; table = ovsl_dereference(dp->table); - flow = ovs_flow_tbl_lookup(table, &key, key_len); + flow = ovs_flow_lookup_unmasked_key(table, &match); if (!flow) { err = -ENOENT; goto unlock; @@ -1421,13 +1481,13 @@ static int ovs_flow_cmd_del(struct sk_buff *skb, struct genl_info *info) goto unlock; } - ovs_flow_tbl_remove(table, flow); + ovs_flow_remove(table, flow); err = ovs_flow_cmd_fill_info(flow, dp, reply, info->snd_portid, info->snd_seq, 0, OVS_FLOW_CMD_DEL); BUG_ON(err < 0); - ovs_flow_deferred_free(flow); + ovs_flow_free(flow, true); ovs_unlock(); ovs_notify(reply, info, &ovs_dp_flow_multicast_group); @@ -1457,7 +1517,7 @@ static int ovs_flow_cmd_dump(struct sk_buff *skb, struct netlink_callback *cb) bucket = cb->args[0]; obj = cb->args[1]; - flow = ovs_flow_tbl_next(table, &bucket, &obj); + flow = ovs_flow_dump_next(table, &bucket, &obj); if (!flow) break; @@ -1680,7 +1740,7 @@ err_destroy_ports_array: err_destroy_percpu: free_percpu(dp->stats_percpu); err_destroy_table: - ovs_flow_tbl_destroy(ovsl_dereference(dp->table)); + ovs_flow_tbl_destroy(ovsl_dereference(dp->table), false); err_free_dp: release_net(ovs_dp_get_net(dp)); kfree(dp); @@ -2287,7 +2347,7 @@ static void rehash_flow_table(struct work_struct *work) new_table = ovs_flow_tbl_rehash(old_table); if (!IS_ERR(new_table)) { rcu_assign_pointer(dp->table, new_table); - ovs_flow_tbl_deferred_destroy(old_table); + ovs_flow_tbl_destroy(old_table, true); } } } diff --git a/net/openvswitch/datapath.h b/net/openvswitch/datapath.h index a91486484916..4d109c176ef3 100644 --- a/net/openvswitch/datapath.h +++ b/net/openvswitch/datapath.h @@ -88,11 +88,13 @@ struct datapath { /** * struct ovs_skb_cb - OVS data in skb CB * @flow: The flow associated with this packet. May be %NULL if no flow. + * @pkt_key: The flow information extracted from the packet. Must be nonnull. * @tun_key: Key for the tunnel that encapsulated this packet. NULL if the * packet is not being tunneled. */ struct ovs_skb_cb { struct sw_flow *flow; + struct sw_flow_key *pkt_key; struct ovs_key_ipv4_tunnel *tun_key; }; #define OVS_CB(skb) ((struct ovs_skb_cb *)(skb)->cb) @@ -183,4 +185,8 @@ struct sk_buff *ovs_vport_cmd_build_info(struct vport *, u32 pid, u32 seq, int ovs_execute_actions(struct datapath *dp, struct sk_buff *skb); void ovs_dp_notify_wq(struct work_struct *work); + +#define OVS_NLERR(fmt, ...) \ + pr_info_once("netlink: " fmt, ##__VA_ARGS__) + #endif /* datapath.h */ diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c index fca282520cee..1fceb9653598 100644 --- a/net/openvswitch/flow.c +++ b/net/openvswitch/flow.c @@ -1,5 +1,5 @@ /* - * Copyright (c) 2007-2011 Nicira, Inc. + * Copyright (c) 2007-2013 Nicira, Inc. * * This program is free software; you can redistribute it and/or * modify it under the terms of version 2 of the GNU General Public @@ -46,6 +46,184 @@ static struct kmem_cache *flow_cache; +static void ovs_sw_flow_mask_set(struct sw_flow_mask *mask, + struct sw_flow_key_range *range, u8 val); + +static void update_range__(struct sw_flow_match *match, + size_t offset, size_t size, bool is_mask) +{ + struct sw_flow_key_range *range = NULL; + size_t start = offset; + size_t end = offset + size; + + if (!is_mask) + range = &match->range; + else if (match->mask) + range = &match->mask->range; + + if (!range) + return; + + if (range->start == range->end) { + range->start = start; + range->end = end; + return; + } + + if (range->start > start) + range->start = start; + + if (range->end < end) + range->end = end; +} + +#define SW_FLOW_KEY_PUT(match, field, value, is_mask) \ + do { \ + update_range__(match, offsetof(struct sw_flow_key, field), \ + sizeof((match)->key->field), is_mask); \ + if (is_mask) { \ + if ((match)->mask) \ + (match)->mask->key.field = value; \ + } else { \ + (match)->key->field = value; \ + } \ + } while (0) + +#define SW_FLOW_KEY_MEMCPY(match, field, value_p, len, is_mask) \ + do { \ + update_range__(match, offsetof(struct sw_flow_key, field), \ + len, is_mask); \ + if (is_mask) { \ + if ((match)->mask) \ + memcpy(&(match)->mask->key.field, value_p, len);\ + } else { \ + memcpy(&(match)->key->field, value_p, len); \ + } \ + } while (0) + +void ovs_match_init(struct sw_flow_match *match, + struct sw_flow_key *key, + struct sw_flow_mask *mask) +{ + memset(match, 0, sizeof(*match)); + match->key = key; + match->mask = mask; + + memset(key, 0, sizeof(*key)); + + if (mask) { + memset(&mask->key, 0, sizeof(mask->key)); + mask->range.start = mask->range.end = 0; + } +} + +static bool ovs_match_validate(const struct sw_flow_match *match, + u64 key_attrs, u64 mask_attrs) +{ + u64 key_expected = 1 << OVS_KEY_ATTR_ETHERNET; + u64 mask_allowed = key_attrs; /* At most allow all key attributes */ + + /* The following mask attributes allowed only if they + * pass the validation tests. */ + mask_allowed &= ~((1 << OVS_KEY_ATTR_IPV4) + | (1 << OVS_KEY_ATTR_IPV6) + | (1 << OVS_KEY_ATTR_TCP) + | (1 << OVS_KEY_ATTR_UDP) + | (1 << OVS_KEY_ATTR_ICMP) + | (1 << OVS_KEY_ATTR_ICMPV6) + | (1 << OVS_KEY_ATTR_ARP) + | (1 << OVS_KEY_ATTR_ND)); + + /* Always allowed mask fields. */ + mask_allowed |= ((1 << OVS_KEY_ATTR_TUNNEL) + | (1 << OVS_KEY_ATTR_IN_PORT) + | (1 << OVS_KEY_ATTR_ETHERTYPE)); + + /* Check key attributes. */ + if (match->key->eth.type == htons(ETH_P_ARP) + || match->key->eth.type == htons(ETH_P_RARP)) { + key_expected |= 1 << OVS_KEY_ATTR_ARP; + if (match->mask && (match->mask->key.eth.type == htons(0xffff))) + mask_allowed |= 1 << OVS_KEY_ATTR_ARP; + } + + if (match->key->eth.type == htons(ETH_P_IP)) { + key_expected |= 1 << OVS_KEY_ATTR_IPV4; + if (match->mask && (match->mask->key.eth.type == htons(0xffff))) + mask_allowed |= 1 << OVS_KEY_ATTR_IPV4; + + if (match->key->ip.frag != OVS_FRAG_TYPE_LATER) { + if (match->key->ip.proto == IPPROTO_UDP) { + key_expected |= 1 << OVS_KEY_ATTR_UDP; + if (match->mask && (match->mask->key.ip.proto == 0xff)) + mask_allowed |= 1 << OVS_KEY_ATTR_UDP; + } + + if (match->key->ip.proto == IPPROTO_TCP) { + key_expected |= 1 << OVS_KEY_ATTR_TCP; + if (match->mask && (match->mask->key.ip.proto == 0xff)) + mask_allowed |= 1 << OVS_KEY_ATTR_TCP; + } + + if (match->key->ip.proto == IPPROTO_ICMP) { + key_expected |= 1 << OVS_KEY_ATTR_ICMP; + if (match->mask && (match->mask->key.ip.proto == 0xff)) + mask_allowed |= 1 << OVS_KEY_ATTR_ICMP; + } + } + } + + if (match->key->eth.type == htons(ETH_P_IPV6)) { + key_expected |= 1 << OVS_KEY_ATTR_IPV6; + if (match->mask && (match->mask->key.eth.type == htons(0xffff))) + mask_allowed |= 1 << OVS_KEY_ATTR_IPV6; + + if (match->key->ip.frag != OVS_FRAG_TYPE_LATER) { + if (match->key->ip.proto == IPPROTO_UDP) { + key_expected |= 1 << OVS_KEY_ATTR_UDP; + if (match->mask && (match->mask->key.ip.proto == 0xff)) + mask_allowed |= 1 << OVS_KEY_ATTR_UDP; + } + + if (match->key->ip.proto == IPPROTO_TCP) { + key_expected |= 1 << OVS_KEY_ATTR_TCP; + if (match->mask && (match->mask->key.ip.proto == 0xff)) + mask_allowed |= 1 << OVS_KEY_ATTR_TCP; + } + + if (match->key->ip.proto == IPPROTO_ICMPV6) { + key_expected |= 1 << OVS_KEY_ATTR_ICMPV6; + if (match->mask && (match->mask->key.ip.proto == 0xff)) + mask_allowed |= 1 << OVS_KEY_ATTR_ICMPV6; + + if (match->key->ipv6.tp.src == + htons(NDISC_NEIGHBOUR_SOLICITATION) || + match->key->ipv6.tp.src == htons(NDISC_NEIGHBOUR_ADVERTISEMENT)) { + key_expected |= 1 << OVS_KEY_ATTR_ND; + if (match->mask && (match->mask->key.ipv6.tp.src == htons(0xffff))) + mask_allowed |= 1 << OVS_KEY_ATTR_ND; + } + } + } + } + + if ((key_attrs & key_expected) != key_expected) { + /* Key attributes check failed. */ + OVS_NLERR("Missing expected key attributes (key_attrs=%llx, expected=%llx).\n", + key_attrs, key_expected); + return false; + } + + if ((mask_attrs & mask_allowed) != mask_attrs) { + /* Mask attributes check failed. */ + OVS_NLERR("Contain more than allowed mask fields (mask_attrs=%llx, mask_allowed=%llx).\n", + mask_attrs, mask_allowed); + return false; + } + + return true; +} + static int check_header(struct sk_buff *skb, int len) { if (unlikely(skb->len < len)) @@ -121,12 +299,7 @@ u64 ovs_flow_used_time(unsigned long flow_jiffies) return cur_ms - idle_ms; } -#define SW_FLOW_KEY_OFFSET(field) \ - (offsetof(struct sw_flow_key, field) + \ - FIELD_SIZEOF(struct sw_flow_key, field)) - -static int parse_ipv6hdr(struct sk_buff *skb, struct sw_flow_key *key, - int *key_lenp) +static int parse_ipv6hdr(struct sk_buff *skb, struct sw_flow_key *key) { unsigned int nh_ofs = skb_network_offset(skb); unsigned int nh_len; @@ -136,8 +309,6 @@ static int parse_ipv6hdr(struct sk_buff *skb, struct sw_flow_key *key, __be16 frag_off; int err; - *key_lenp = SW_FLOW_KEY_OFFSET(ipv6.label); - err = check_header(skb, nh_ofs + sizeof(*nh)); if (unlikely(err)) return err; @@ -176,6 +347,21 @@ static bool icmp6hdr_ok(struct sk_buff *skb) sizeof(struct icmp6hdr)); } +void ovs_flow_key_mask(struct sw_flow_key *dst, const struct sw_flow_key *src, + const struct sw_flow_mask *mask) +{ + u8 *m = (u8 *)&mask->key + mask->range.start; + u8 *s = (u8 *)src + mask->range.start; + u8 *d = (u8 *)dst + mask->range.start; + int i; + + memset(dst, 0, sizeof(*dst)); + for (i = 0; i < ovs_sw_flow_mask_size_roundup(mask); i++) { + *d = *s & *m; + d++, s++, m++; + } +} + #define TCP_FLAGS_OFFSET 13 #define TCP_FLAG_MASK 0x3f @@ -224,6 +410,7 @@ struct sw_flow *ovs_flow_alloc(void) spin_lock_init(&flow->lock); flow->sf_acts = NULL; + flow->mask = NULL; return flow; } @@ -263,7 +450,7 @@ static void free_buckets(struct flex_array *buckets) flex_array_free(buckets); } -struct flow_table *ovs_flow_tbl_alloc(int new_size) +static struct flow_table *__flow_tbl_alloc(int new_size) { struct flow_table *table = kmalloc(sizeof(*table), GFP_KERNEL); @@ -281,17 +468,15 @@ struct flow_table *ovs_flow_tbl_alloc(int new_size) table->node_ver = 0; table->keep_flows = false; get_random_bytes(&table->hash_seed, sizeof(u32)); + table->mask_list = NULL; return table; } -void ovs_flow_tbl_destroy(struct flow_table *table) +static void __flow_tbl_destroy(struct flow_table *table) { int i; - if (!table) - return; - if (table->keep_flows) goto skip_flows; @@ -303,31 +488,55 @@ void ovs_flow_tbl_destroy(struct flow_table *table) hlist_for_each_entry_safe(flow, n, head, hash_node[ver]) { hlist_del(&flow->hash_node[ver]); - ovs_flow_free(flow); + ovs_flow_free(flow, false); } } + BUG_ON(!list_empty(table->mask_list)); + kfree(table->mask_list); + skip_flows: free_buckets(table->buckets); kfree(table); } +struct flow_table *ovs_flow_tbl_alloc(int new_size) +{ + struct flow_table *table = __flow_tbl_alloc(new_size); + + if (!table) + return NULL; + + table->mask_list = kmalloc(sizeof(struct list_head), GFP_KERNEL); + if (!table->mask_list) { + table->keep_flows = true; + __flow_tbl_destroy(table); + return NULL; + } + INIT_LIST_HEAD(table->mask_list); + + return table; +} + static void flow_tbl_destroy_rcu_cb(struct rcu_head *rcu) { struct flow_table *table = container_of(rcu, struct flow_table, rcu); - ovs_flow_tbl_destroy(table); + __flow_tbl_destroy(table); } -void ovs_flow_tbl_deferred_destroy(struct flow_table *table) +void ovs_flow_tbl_destroy(struct flow_table *table, bool deferred) { if (!table) return; - call_rcu(&table->rcu, flow_tbl_destroy_rcu_cb); + if (deferred) + call_rcu(&table->rcu, flow_tbl_destroy_rcu_cb); + else + __flow_tbl_destroy(table); } -struct sw_flow *ovs_flow_tbl_next(struct flow_table *table, u32 *bucket, u32 *last) +struct sw_flow *ovs_flow_dump_next(struct flow_table *table, u32 *bucket, u32 *last) { struct sw_flow *flow; struct hlist_head *head; @@ -353,11 +562,13 @@ struct sw_flow *ovs_flow_tbl_next(struct flow_table *table, u32 *bucket, u32 *la return NULL; } -static void __flow_tbl_insert(struct flow_table *table, struct sw_flow *flow) +static void __tbl_insert(struct flow_table *table, struct sw_flow *flow) { struct hlist_head *head; + head = find_bucket(table, flow->hash); hlist_add_head_rcu(&flow->hash_node[table->node_ver], head); + table->count++; } @@ -377,8 +588,10 @@ static void flow_table_copy_flows(struct flow_table *old, struct flow_table *new head = flex_array_get(old->buckets, i); hlist_for_each_entry(flow, head, hash_node[old_ver]) - __flow_tbl_insert(new, flow); + __tbl_insert(new, flow); } + + new->mask_list = old->mask_list; old->keep_flows = true; } @@ -386,7 +599,7 @@ static struct flow_table *__flow_tbl_rehash(struct flow_table *table, int n_buck { struct flow_table *new_table; - new_table = ovs_flow_tbl_alloc(n_buckets); + new_table = __flow_tbl_alloc(n_buckets); if (!new_table) return ERR_PTR(-ENOMEM); @@ -405,28 +618,30 @@ struct flow_table *ovs_flow_tbl_expand(struct flow_table *table) return __flow_tbl_rehash(table, table->n_buckets * 2); } -void ovs_flow_free(struct sw_flow *flow) +static void __flow_free(struct sw_flow *flow) { - if (unlikely(!flow)) - return; - kfree((struct sf_flow_acts __force *)flow->sf_acts); kmem_cache_free(flow_cache, flow); } -/* RCU callback used by ovs_flow_deferred_free. */ static void rcu_free_flow_callback(struct rcu_head *rcu) { struct sw_flow *flow = container_of(rcu, struct sw_flow, rcu); - ovs_flow_free(flow); + __flow_free(flow); } -/* Schedules 'flow' to be freed after the next RCU grace period. - * The caller must hold rcu_read_lock for this to be sensible. */ -void ovs_flow_deferred_free(struct sw_flow *flow) +void ovs_flow_free(struct sw_flow *flow, bool deferred) { - call_rcu(&flow->rcu, rcu_free_flow_callback); + if (!flow) + return; + + ovs_sw_flow_mask_del_ref(flow->mask, deferred); + + if (deferred) + call_rcu(&flow->rcu, rcu_free_flow_callback); + else + __flow_free(flow); } /* Schedules 'sf_acts' to be freed after the next RCU grace period. @@ -497,18 +712,15 @@ static __be16 parse_ethertype(struct sk_buff *skb) } static int parse_icmpv6(struct sk_buff *skb, struct sw_flow_key *key, - int *key_lenp, int nh_len) + int nh_len) { struct icmp6hdr *icmp = icmp6_hdr(skb); - int error = 0; - int key_len; /* The ICMPv6 type and code fields use the 16-bit transport port * fields, so we need to store them in 16-bit network byte order. */ key->ipv6.tp.src = htons(icmp->icmp6_type); key->ipv6.tp.dst = htons(icmp->icmp6_code); - key_len = SW_FLOW_KEY_OFFSET(ipv6.tp); if (icmp->icmp6_code == 0 && (icmp->icmp6_type == NDISC_NEIGHBOUR_SOLICITATION || @@ -517,21 +729,17 @@ static int parse_icmpv6(struct sk_buff *skb, struct sw_flow_key *key, struct nd_msg *nd; int offset; - key_len = SW_FLOW_KEY_OFFSET(ipv6.nd); - /* In order to process neighbor discovery options, we need the * entire packet. */ if (unlikely(icmp_len < sizeof(*nd))) - goto out; - if (unlikely(skb_linearize(skb))) { - error = -ENOMEM; - goto out; - } + return 0; + + if (unlikely(skb_linearize(skb))) + return -ENOMEM; nd = (struct nd_msg *)skb_transport_header(skb); key->ipv6.nd.target = nd->target; - key_len = SW_FLOW_KEY_OFFSET(ipv6.nd); icmp_len -= sizeof(*nd); offset = 0; @@ -541,7 +749,7 @@ static int parse_icmpv6(struct sk_buff *skb, struct sw_flow_key *key, int opt_len = nd_opt->nd_opt_len * 8; if (unlikely(!opt_len || opt_len > icmp_len)) - goto invalid; + return 0; /* Store the link layer address if the appropriate * option is provided. It is considered an error if @@ -566,16 +774,14 @@ static int parse_icmpv6(struct sk_buff *skb, struct sw_flow_key *key, } } - goto out; + return 0; invalid: memset(&key->ipv6.nd.target, 0, sizeof(key->ipv6.nd.target)); memset(key->ipv6.nd.sll, 0, sizeof(key->ipv6.nd.sll)); memset(key->ipv6.nd.tll, 0, sizeof(key->ipv6.nd.tll)); -out: - *key_lenp = key_len; - return error; + return 0; } /** @@ -584,7 +790,6 @@ out: * Ethernet header * @in_port: port number on which @skb was received. * @key: output flow key - * @key_lenp: length of output flow key * * The caller must ensure that skb->len >= ETH_HLEN. * @@ -602,11 +807,9 @@ out: * of a correct length, otherwise the same as skb->network_header. * For other key->eth.type values it is left untouched. */ -int ovs_flow_extract(struct sk_buff *skb, u16 in_port, struct sw_flow_key *key, - int *key_lenp) +int ovs_flow_extract(struct sk_buff *skb, u16 in_port, struct sw_flow_key *key) { - int error = 0; - int key_len = SW_FLOW_KEY_OFFSET(eth); + int error; struct ethhdr *eth; memset(key, 0, sizeof(*key)); @@ -649,15 +852,13 @@ int ovs_flow_extract(struct sk_buff *skb, u16 in_port, struct sw_flow_key *key, struct iphdr *nh; __be16 offset; - key_len = SW_FLOW_KEY_OFFSET(ipv4.addr); - error = check_iphdr(skb); if (unlikely(error)) { if (error == -EINVAL) { skb->transport_header = skb->network_header; error = 0; } - goto out; + return error; } nh = ip_hdr(skb); @@ -671,7 +872,7 @@ int ovs_flow_extract(struct sk_buff *skb, u16 in_port, struct sw_flow_key *key, offset = nh->frag_off & htons(IP_OFFSET); if (offset) { key->ip.frag = OVS_FRAG_TYPE_LATER; - goto out; + return 0; } if (nh->frag_off & htons(IP_MF) || skb_shinfo(skb)->gso_type & SKB_GSO_UDP) @@ -679,21 +880,18 @@ int ovs_flow_extract(struct sk_buff *skb, u16 in_port, struct sw_flow_key *key, /* Transport layer. */ if (key->ip.proto == IPPROTO_TCP) { - key_len = SW_FLOW_KEY_OFFSET(ipv4.tp); if (tcphdr_ok(skb)) { struct tcphdr *tcp = tcp_hdr(skb); key->ipv4.tp.src = tcp->source; key->ipv4.tp.dst = tcp->dest; } } else if (key->ip.proto == IPPROTO_UDP) { - key_len = SW_FLOW_KEY_OFFSET(ipv4.tp); if (udphdr_ok(skb)) { struct udphdr *udp = udp_hdr(skb); key->ipv4.tp.src = udp->source; key->ipv4.tp.dst = udp->dest; } } else if (key->ip.proto == IPPROTO_ICMP) { - key_len = SW_FLOW_KEY_OFFSET(ipv4.tp); if (icmphdr_ok(skb)) { struct icmphdr *icmp = icmp_hdr(skb); /* The ICMP type and code fields use the 16-bit @@ -722,53 +920,49 @@ int ovs_flow_extract(struct sk_buff *skb, u16 in_port, struct sw_flow_key *key, memcpy(&key->ipv4.addr.dst, arp->ar_tip, sizeof(key->ipv4.addr.dst)); memcpy(key->ipv4.arp.sha, arp->ar_sha, ETH_ALEN); memcpy(key->ipv4.arp.tha, arp->ar_tha, ETH_ALEN); - key_len = SW_FLOW_KEY_OFFSET(ipv4.arp); } } else if (key->eth.type == htons(ETH_P_IPV6)) { int nh_len; /* IPv6 Header + Extensions */ - nh_len = parse_ipv6hdr(skb, key, &key_len); + nh_len = parse_ipv6hdr(skb, key); if (unlikely(nh_len < 0)) { - if (nh_len == -EINVAL) + if (nh_len == -EINVAL) { skb->transport_header = skb->network_header; - else + error = 0; + } else { error = nh_len; - goto out; + } + return error; } if (key->ip.frag == OVS_FRAG_TYPE_LATER) - goto out; + return 0; if (skb_shinfo(skb)->gso_type & SKB_GSO_UDP) key->ip.frag = OVS_FRAG_TYPE_FIRST; /* Transport layer. */ if (key->ip.proto == NEXTHDR_TCP) { - key_len = SW_FLOW_KEY_OFFSET(ipv6.tp); if (tcphdr_ok(skb)) { struct tcphdr *tcp = tcp_hdr(skb); key->ipv6.tp.src = tcp->source; key->ipv6.tp.dst = tcp->dest; } } else if (key->ip.proto == NEXTHDR_UDP) { - key_len = SW_FLOW_KEY_OFFSET(ipv6.tp); if (udphdr_ok(skb)) { struct udphdr *udp = udp_hdr(skb); key->ipv6.tp.src = udp->source; key->ipv6.tp.dst = udp->dest; } } else if (key->ip.proto == NEXTHDR_ICMP) { - key_len = SW_FLOW_KEY_OFFSET(ipv6.tp); if (icmp6hdr_ok(skb)) { - error = parse_icmpv6(skb, key, &key_len, nh_len); - if (error < 0) - goto out; + error = parse_icmpv6(skb, key, nh_len); + if (error) + return error; } } } -out: - *key_lenp = key_len; - return error; + return 0; } static u32 ovs_flow_hash(const struct sw_flow_key *key, int key_start, int key_len) @@ -777,7 +971,7 @@ static u32 ovs_flow_hash(const struct sw_flow_key *key, int key_start, int key_l DIV_ROUND_UP(key_len - key_start, sizeof(u32)), 0); } -static int flow_key_start(struct sw_flow_key *key) +static int flow_key_start(const struct sw_flow_key *key) { if (key->tun_key.ipv4_dst) return 0; @@ -785,39 +979,95 @@ static int flow_key_start(struct sw_flow_key *key) return offsetof(struct sw_flow_key, phy); } -struct sw_flow *ovs_flow_tbl_lookup(struct flow_table *table, - struct sw_flow_key *key, int key_len) +static bool __cmp_key(const struct sw_flow_key *key1, + const struct sw_flow_key *key2, int key_start, int key_len) +{ + return !memcmp((u8 *)key1 + key_start, + (u8 *)key2 + key_start, (key_len - key_start)); +} + +static bool __flow_cmp_key(const struct sw_flow *flow, + const struct sw_flow_key *key, int key_start, int key_len) +{ + return __cmp_key(&flow->key, key, key_start, key_len); +} + +static bool __flow_cmp_unmasked_key(const struct sw_flow *flow, + const struct sw_flow_key *key, int key_start, int key_len) +{ + return __cmp_key(&flow->unmasked_key, key, key_start, key_len); +} + +bool ovs_flow_cmp_unmasked_key(const struct sw_flow *flow, + const struct sw_flow_key *key, int key_len) +{ + int key_start; + key_start = flow_key_start(key); + + return __flow_cmp_unmasked_key(flow, key, key_start, key_len); + +} + +struct sw_flow *ovs_flow_lookup_unmasked_key(struct flow_table *table, + struct sw_flow_match *match) +{ + struct sw_flow_key *unmasked = match->key; + int key_len = match->range.end; + struct sw_flow *flow; + + flow = ovs_flow_lookup(table, unmasked); + if (flow && (!ovs_flow_cmp_unmasked_key(flow, unmasked, key_len))) + flow = NULL; + + return flow; +} + +static struct sw_flow *ovs_masked_flow_lookup(struct flow_table *table, + const struct sw_flow_key *flow_key, + struct sw_flow_mask *mask) { struct sw_flow *flow; struct hlist_head *head; - u8 *_key; - int key_start; + int key_start = mask->range.start; + int key_len = mask->range.end; u32 hash; + struct sw_flow_key masked_key; - key_start = flow_key_start(key); - hash = ovs_flow_hash(key, key_start, key_len); - - _key = (u8 *) key + key_start; + ovs_flow_key_mask(&masked_key, flow_key, mask); + hash = ovs_flow_hash(&masked_key, key_start, key_len); head = find_bucket(table, hash); hlist_for_each_entry_rcu(flow, head, hash_node[table->node_ver]) { - - if (flow->hash == hash && - !memcmp((u8 *)&flow->key + key_start, _key, key_len - key_start)) { + if (flow->mask == mask && + __flow_cmp_key(flow, &masked_key, key_start, key_len)) return flow; - } } return NULL; } -void ovs_flow_tbl_insert(struct flow_table *table, struct sw_flow *flow, - struct sw_flow_key *key, int key_len) +struct sw_flow *ovs_flow_lookup(struct flow_table *tbl, + const struct sw_flow_key *key) { - flow->hash = ovs_flow_hash(key, flow_key_start(key), key_len); - memcpy(&flow->key, key, sizeof(flow->key)); - __flow_tbl_insert(table, flow); + struct sw_flow *flow = NULL; + struct sw_flow_mask *mask; + + list_for_each_entry_rcu(mask, tbl->mask_list, list) { + flow = ovs_masked_flow_lookup(tbl, key, mask); + if (flow) /* Found */ + break; + } + + return flow; } -void ovs_flow_tbl_remove(struct flow_table *table, struct sw_flow *flow) + +void ovs_flow_insert(struct flow_table *table, struct sw_flow *flow) +{ + flow->hash = ovs_flow_hash(&flow->key, flow->mask->range.start, + flow->mask->range.end); + __tbl_insert(table, flow); +} + +void ovs_flow_remove(struct flow_table *table, struct sw_flow *flow) { BUG_ON(table->count == 0); hlist_del_rcu(&flow->hash_node[table->node_ver]); @@ -844,149 +1094,84 @@ const int ovs_key_lens[OVS_KEY_ATTR_MAX + 1] = { [OVS_KEY_ATTR_TUNNEL] = -1, }; -static int ipv4_flow_from_nlattrs(struct sw_flow_key *swkey, int *key_len, - const struct nlattr *a[], u32 *attrs) -{ - const struct ovs_key_icmp *icmp_key; - const struct ovs_key_tcp *tcp_key; - const struct ovs_key_udp *udp_key; - - switch (swkey->ip.proto) { - case IPPROTO_TCP: - if (!(*attrs & (1 << OVS_KEY_ATTR_TCP))) - return -EINVAL; - *attrs &= ~(1 << OVS_KEY_ATTR_TCP); - - *key_len = SW_FLOW_KEY_OFFSET(ipv4.tp); - tcp_key = nla_data(a[OVS_KEY_ATTR_TCP]); - swkey->ipv4.tp.src = tcp_key->tcp_src; - swkey->ipv4.tp.dst = tcp_key->tcp_dst; - break; - - case IPPROTO_UDP: - if (!(*attrs & (1 << OVS_KEY_ATTR_UDP))) - return -EINVAL; - *attrs &= ~(1 << OVS_KEY_ATTR_UDP); - - *key_len = SW_FLOW_KEY_OFFSET(ipv4.tp); - udp_key = nla_data(a[OVS_KEY_ATTR_UDP]); - swkey->ipv4.tp.src = udp_key->udp_src; - swkey->ipv4.tp.dst = udp_key->udp_dst; - break; - - case IPPROTO_ICMP: - if (!(*attrs & (1 << OVS_KEY_ATTR_ICMP))) - return -EINVAL; - *attrs &= ~(1 << OVS_KEY_ATTR_ICMP); - - *key_len = SW_FLOW_KEY_OFFSET(ipv4.tp); - icmp_key = nla_data(a[OVS_KEY_ATTR_ICMP]); - swkey->ipv4.tp.src = htons(icmp_key->icmp_type); - swkey->ipv4.tp.dst = htons(icmp_key->icmp_code); - break; - } - - return 0; -} - -static int ipv6_flow_from_nlattrs(struct sw_flow_key *swkey, int *key_len, - const struct nlattr *a[], u32 *attrs) +static bool is_all_zero(const u8 *fp, size_t size) { - const struct ovs_key_icmpv6 *icmpv6_key; - const struct ovs_key_tcp *tcp_key; - const struct ovs_key_udp *udp_key; - - switch (swkey->ip.proto) { - case IPPROTO_TCP: - if (!(*attrs & (1 << OVS_KEY_ATTR_TCP))) - return -EINVAL; - *attrs &= ~(1 << OVS_KEY_ATTR_TCP); - - *key_len = SW_FLOW_KEY_OFFSET(ipv6.tp); - tcp_key = nla_data(a[OVS_KEY_ATTR_TCP]); - swkey->ipv6.tp.src = tcp_key->tcp_src; - swkey->ipv6.tp.dst = tcp_key->tcp_dst; - break; - - case IPPROTO_UDP: - if (!(*attrs & (1 << OVS_KEY_ATTR_UDP))) - return -EINVAL; - *attrs &= ~(1 << OVS_KEY_ATTR_UDP); - - *key_len = SW_FLOW_KEY_OFFSET(ipv6.tp); - udp_key = nla_data(a[OVS_KEY_ATTR_UDP]); - swkey->ipv6.tp.src = udp_key->udp_src; - swkey->ipv6.tp.dst = udp_key->udp_dst; - break; - - case IPPROTO_ICMPV6: - if (!(*attrs & (1 << OVS_KEY_ATTR_ICMPV6))) - return -EINVAL; - *attrs &= ~(1 << OVS_KEY_ATTR_ICMPV6); - - *key_len = SW_FLOW_KEY_OFFSET(ipv6.tp); - icmpv6_key = nla_data(a[OVS_KEY_ATTR_ICMPV6]); - swkey->ipv6.tp.src = htons(icmpv6_key->icmpv6_type); - swkey->ipv6.tp.dst = htons(icmpv6_key->icmpv6_code); + int i; - if (swkey->ipv6.tp.src == htons(NDISC_NEIGHBOUR_SOLICITATION) || - swkey->ipv6.tp.src == htons(NDISC_NEIGHBOUR_ADVERTISEMENT)) { - const struct ovs_key_nd *nd_key; + if (!fp) + return false; - if (!(*attrs & (1 << OVS_KEY_ATTR_ND))) - return -EINVAL; - *attrs &= ~(1 << OVS_KEY_ATTR_ND); - - *key_len = SW_FLOW_KEY_OFFSET(ipv6.nd); - nd_key = nla_data(a[OVS_KEY_ATTR_ND]); - memcpy(&swkey->ipv6.nd.target, nd_key->nd_target, - sizeof(swkey->ipv6.nd.target)); - memcpy(swkey->ipv6.nd.sll, nd_key->nd_sll, ETH_ALEN); - memcpy(swkey->ipv6.nd.tll, nd_key->nd_tll, ETH_ALEN); - } - break; - } + for (i = 0; i < size; i++) + if (fp[i]) + return false; - return 0; + return true; } -static int parse_flow_nlattrs(const struct nlattr *attr, - const struct nlattr *a[], u32 *attrsp) +static int __parse_flow_nlattrs(const struct nlattr *attr, + const struct nlattr *a[], + u64 *attrsp, bool nz) { const struct nlattr *nla; u32 attrs; int rem; - attrs = 0; + attrs = *attrsp; nla_for_each_nested(nla, attr, rem) { u16 type = nla_type(nla); int expected_len; - if (type > OVS_KEY_ATTR_MAX || attrs & (1 << type)) + if (type > OVS_KEY_ATTR_MAX) { + OVS_NLERR("Unknown key attribute (type=%d, max=%d).\n", + type, OVS_KEY_ATTR_MAX); + } + + if (attrs & (1 << type)) { + OVS_NLERR("Duplicate key attribute (type %d).\n", type); return -EINVAL; + } expected_len = ovs_key_lens[type]; - if (nla_len(nla) != expected_len && expected_len != -1) + if (nla_len(nla) != expected_len && expected_len != -1) { + OVS_NLERR("Key attribute has unexpected length (type=%d" + ", length=%d, expected=%d).\n", type, + nla_len(nla), expected_len); return -EINVAL; + } - attrs |= 1 << type; - a[type] = nla; + if (!nz || !is_all_zero(nla_data(nla), expected_len)) { + attrs |= 1 << type; + a[type] = nla; + } } - if (rem) + if (rem) { + OVS_NLERR("Message has %d unknown bytes.\n", rem); return -EINVAL; + } *attrsp = attrs; return 0; } +static int parse_flow_mask_nlattrs(const struct nlattr *attr, + const struct nlattr *a[], u64 *attrsp) +{ + return __parse_flow_nlattrs(attr, a, attrsp, true); +} + +static int parse_flow_nlattrs(const struct nlattr *attr, + const struct nlattr *a[], u64 *attrsp) +{ + return __parse_flow_nlattrs(attr, a, attrsp, false); +} + int ovs_ipv4_tun_from_nlattr(const struct nlattr *attr, - struct ovs_key_ipv4_tunnel *tun_key) + struct sw_flow_match *match, bool is_mask) { struct nlattr *a; int rem; bool ttl = false; - - memset(tun_key, 0, sizeof(*tun_key)); + __be16 tun_flags = 0; nla_for_each_nested(a, attr, rem) { int type = nla_type(a); @@ -1000,53 +1185,78 @@ int ovs_ipv4_tun_from_nlattr(const struct nlattr *attr, [OVS_TUNNEL_KEY_ATTR_CSUM] = 0, }; - if (type > OVS_TUNNEL_KEY_ATTR_MAX || - ovs_tunnel_key_lens[type] != nla_len(a)) + if (type > OVS_TUNNEL_KEY_ATTR_MAX) { + OVS_NLERR("Unknown IPv4 tunnel attribute (type=%d, max=%d).\n", + type, OVS_TUNNEL_KEY_ATTR_MAX); return -EINVAL; + } + + if (ovs_tunnel_key_lens[type] != nla_len(a)) { + OVS_NLERR("IPv4 tunnel attribute type has unexpected " + " length (type=%d, length=%d, expected=%d).\n", + type, nla_len(a), ovs_tunnel_key_lens[type]); + return -EINVAL; + } switch (type) { case OVS_TUNNEL_KEY_ATTR_ID: - tun_key->tun_id = nla_get_be64(a); - tun_key->tun_flags |= TUNNEL_KEY; + SW_FLOW_KEY_PUT(match, tun_key.tun_id, + nla_get_be64(a), is_mask); + tun_flags |= TUNNEL_KEY; break; case OVS_TUNNEL_KEY_ATTR_IPV4_SRC: - tun_key->ipv4_src = nla_get_be32(a); + SW_FLOW_KEY_PUT(match, tun_key.ipv4_src, + nla_get_be32(a), is_mask); break; case OVS_TUNNEL_KEY_ATTR_IPV4_DST: - tun_key->ipv4_dst = nla_get_be32(a); + SW_FLOW_KEY_PUT(match, tun_key.ipv4_dst, + nla_get_be32(a), is_mask); break; case OVS_TUNNEL_KEY_ATTR_TOS: - tun_key->ipv4_tos = nla_get_u8(a); + SW_FLOW_KEY_PUT(match, tun_key.ipv4_tos, + nla_get_u8(a), is_mask); break; case OVS_TUNNEL_KEY_ATTR_TTL: - tun_key->ipv4_ttl = nla_get_u8(a); + SW_FLOW_KEY_PUT(match, tun_key.ipv4_ttl, + nla_get_u8(a), is_mask); ttl = true; break; case OVS_TUNNEL_KEY_ATTR_DONT_FRAGMENT: - tun_key->tun_flags |= TUNNEL_DONT_FRAGMENT; + tun_flags |= TUNNEL_DONT_FRAGMENT; break; case OVS_TUNNEL_KEY_ATTR_CSUM: - tun_key->tun_flags |= TUNNEL_CSUM; + tun_flags |= TUNNEL_CSUM; break; default: return -EINVAL; - } } - if (rem > 0) - return -EINVAL; - if (!tun_key->ipv4_dst) - return -EINVAL; + SW_FLOW_KEY_PUT(match, tun_key.tun_flags, tun_flags, is_mask); - if (!ttl) + if (rem > 0) { + OVS_NLERR("IPv4 tunnel attribute has %d unknown bytes.\n", rem); return -EINVAL; + } + + if (!is_mask) { + if (!match->key->tun_key.ipv4_dst) { + OVS_NLERR("IPv4 tunnel destination address is zero.\n"); + return -EINVAL; + } + + if (!ttl) { + OVS_NLERR("IPv4 tunnel TTL not specified.\n"); + return -EINVAL; + } + } return 0; } int ovs_ipv4_tun_to_nlattr(struct sk_buff *skb, - const struct ovs_key_ipv4_tunnel *tun_key) + const struct ovs_key_ipv4_tunnel *tun_key, + const struct ovs_key_ipv4_tunnel *output) { struct nlattr *nla; @@ -1054,23 +1264,24 @@ int ovs_ipv4_tun_to_nlattr(struct sk_buff *skb, if (!nla) return -EMSGSIZE; - if (tun_key->tun_flags & TUNNEL_KEY && - nla_put_be64(skb, OVS_TUNNEL_KEY_ATTR_ID, tun_key->tun_id)) + if (output->tun_flags & TUNNEL_KEY && + nla_put_be64(skb, OVS_TUNNEL_KEY_ATTR_ID, output->tun_id)) return -EMSGSIZE; - if (tun_key->ipv4_src && - nla_put_be32(skb, OVS_TUNNEL_KEY_ATTR_IPV4_SRC, tun_key->ipv4_src)) + if (output->ipv4_src && + nla_put_be32(skb, OVS_TUNNEL_KEY_ATTR_IPV4_SRC, output->ipv4_src)) return -EMSGSIZE; - if (nla_put_be32(skb, OVS_TUNNEL_KEY_ATTR_IPV4_DST, tun_key->ipv4_dst)) + if (output->ipv4_dst && + nla_put_be32(skb, OVS_TUNNEL_KEY_ATTR_IPV4_DST, output->ipv4_dst)) return -EMSGSIZE; - if (tun_key->ipv4_tos && - nla_put_u8(skb, OVS_TUNNEL_KEY_ATTR_TOS, tun_key->ipv4_tos)) + if (output->ipv4_tos && + nla_put_u8(skb, OVS_TUNNEL_KEY_ATTR_TOS, output->ipv4_tos)) return -EMSGSIZE; - if (nla_put_u8(skb, OVS_TUNNEL_KEY_ATTR_TTL, tun_key->ipv4_ttl)) + if (nla_put_u8(skb, OVS_TUNNEL_KEY_ATTR_TTL, output->ipv4_ttl)) return -EMSGSIZE; - if ((tun_key->tun_flags & TUNNEL_DONT_FRAGMENT) && + if ((output->tun_flags & TUNNEL_DONT_FRAGMENT) && nla_put_flag(skb, OVS_TUNNEL_KEY_ATTR_DONT_FRAGMENT)) return -EMSGSIZE; - if ((tun_key->tun_flags & TUNNEL_CSUM) && + if ((output->tun_flags & TUNNEL_CSUM) && nla_put_flag(skb, OVS_TUNNEL_KEY_ATTR_CSUM)) return -EMSGSIZE; @@ -1078,176 +1289,372 @@ int ovs_ipv4_tun_to_nlattr(struct sk_buff *skb, return 0; } -/** - * ovs_flow_from_nlattrs - parses Netlink attributes into a flow key. - * @swkey: receives the extracted flow key. - * @key_lenp: number of bytes used in @swkey. - * @attr: Netlink attribute holding nested %OVS_KEY_ATTR_* Netlink attribute - * sequence. - */ -int ovs_flow_from_nlattrs(struct sw_flow_key *swkey, int *key_lenp, - const struct nlattr *attr) +static int metadata_from_nlattrs(struct sw_flow_match *match, u64 *attrs, + const struct nlattr **a, bool is_mask) { - const struct nlattr *a[OVS_KEY_ATTR_MAX + 1]; - const struct ovs_key_ethernet *eth_key; - int key_len; - u32 attrs; - int err; + if (*attrs & (1 << OVS_KEY_ATTR_PRIORITY)) { + SW_FLOW_KEY_PUT(match, phy.priority, + nla_get_u32(a[OVS_KEY_ATTR_PRIORITY]), is_mask); + *attrs &= ~(1 << OVS_KEY_ATTR_PRIORITY); + } - memset(swkey, 0, sizeof(struct sw_flow_key)); - key_len = SW_FLOW_KEY_OFFSET(eth); + if (*attrs & (1 << OVS_KEY_ATTR_IN_PORT)) { + u32 in_port = nla_get_u32(a[OVS_KEY_ATTR_IN_PORT]); - err = parse_flow_nlattrs(attr, a, &attrs); - if (err) - return err; + if (is_mask) + in_port = 0xffffffff; /* Always exact match in_port. */ + else if (in_port >= DP_MAX_PORTS) + return -EINVAL; - /* Metadata attributes. */ - if (attrs & (1 << OVS_KEY_ATTR_PRIORITY)) { - swkey->phy.priority = nla_get_u32(a[OVS_KEY_ATTR_PRIORITY]); - attrs &= ~(1 << OVS_KEY_ATTR_PRIORITY); + SW_FLOW_KEY_PUT(match, phy.in_port, in_port, is_mask); + *attrs &= ~(1 << OVS_KEY_ATTR_IN_PORT); + } else if (!is_mask) { + SW_FLOW_KEY_PUT(match, phy.in_port, DP_MAX_PORTS, is_mask); } - if (attrs & (1 << OVS_KEY_ATTR_IN_PORT)) { - u32 in_port = nla_get_u32(a[OVS_KEY_ATTR_IN_PORT]); - if (in_port >= DP_MAX_PORTS) - return -EINVAL; - swkey->phy.in_port = in_port; - attrs &= ~(1 << OVS_KEY_ATTR_IN_PORT); - } else { - swkey->phy.in_port = DP_MAX_PORTS; + + if (*attrs & (1 << OVS_KEY_ATTR_SKB_MARK)) { + uint32_t mark = nla_get_u32(a[OVS_KEY_ATTR_SKB_MARK]); + + SW_FLOW_KEY_PUT(match, phy.skb_mark, mark, is_mask); + *attrs &= ~(1 << OVS_KEY_ATTR_SKB_MARK); } - if (attrs & (1 << OVS_KEY_ATTR_SKB_MARK)) { - swkey->phy.skb_mark = nla_get_u32(a[OVS_KEY_ATTR_SKB_MARK]); - attrs &= ~(1 << OVS_KEY_ATTR_SKB_MARK); + if (*attrs & (1 << OVS_KEY_ATTR_TUNNEL)) { + if (ovs_ipv4_tun_from_nlattr(a[OVS_KEY_ATTR_TUNNEL], match, + is_mask)) + return -EINVAL; + *attrs &= ~(1 << OVS_KEY_ATTR_TUNNEL); } + return 0; +} - if (attrs & (1 << OVS_KEY_ATTR_TUNNEL)) { - err = ovs_ipv4_tun_from_nlattr(a[OVS_KEY_ATTR_TUNNEL], &swkey->tun_key); - if (err) - return err; +static int ovs_key_from_nlattrs(struct sw_flow_match *match, u64 attrs, + const struct nlattr **a, bool is_mask) +{ + int err; + u64 orig_attrs = attrs; - attrs &= ~(1 << OVS_KEY_ATTR_TUNNEL); - } + err = metadata_from_nlattrs(match, &attrs, a, is_mask); + if (err) + return err; - /* Data attributes. */ - if (!(attrs & (1 << OVS_KEY_ATTR_ETHERNET))) - return -EINVAL; - attrs &= ~(1 << OVS_KEY_ATTR_ETHERNET); + if (attrs & (1 << OVS_KEY_ATTR_ETHERNET)) { + const struct ovs_key_ethernet *eth_key; - eth_key = nla_data(a[OVS_KEY_ATTR_ETHERNET]); - memcpy(swkey->eth.src, eth_key->eth_src, ETH_ALEN); - memcpy(swkey->eth.dst, eth_key->eth_dst, ETH_ALEN); + eth_key = nla_data(a[OVS_KEY_ATTR_ETHERNET]); + SW_FLOW_KEY_MEMCPY(match, eth.src, + eth_key->eth_src, ETH_ALEN, is_mask); + SW_FLOW_KEY_MEMCPY(match, eth.dst, + eth_key->eth_dst, ETH_ALEN, is_mask); + attrs &= ~(1 << OVS_KEY_ATTR_ETHERNET); + } - if (attrs & (1u << OVS_KEY_ATTR_ETHERTYPE) && - nla_get_be16(a[OVS_KEY_ATTR_ETHERTYPE]) == htons(ETH_P_8021Q)) { - const struct nlattr *encap; + if (attrs & (1 << OVS_KEY_ATTR_VLAN)) { __be16 tci; - if (attrs != ((1 << OVS_KEY_ATTR_VLAN) | - (1 << OVS_KEY_ATTR_ETHERTYPE) | - (1 << OVS_KEY_ATTR_ENCAP))) - return -EINVAL; - - encap = a[OVS_KEY_ATTR_ENCAP]; tci = nla_get_be16(a[OVS_KEY_ATTR_VLAN]); - if (tci & htons(VLAN_TAG_PRESENT)) { - swkey->eth.tci = tci; - - err = parse_flow_nlattrs(encap, a, &attrs); - if (err) - return err; - } else if (!tci) { - /* Corner case for truncated 802.1Q header. */ - if (nla_len(encap)) - return -EINVAL; + if (!(tci & htons(VLAN_TAG_PRESENT))) { + if (is_mask) + OVS_NLERR("VLAN TCI mask does not have exact match for VLAN_TAG_PRESENT bit.\n"); + else + OVS_NLERR("VLAN TCI does not have VLAN_TAG_PRESENT bit set.\n"); - swkey->eth.type = htons(ETH_P_8021Q); - *key_lenp = key_len; - return 0; - } else { return -EINVAL; } - } + + SW_FLOW_KEY_PUT(match, eth.tci, tci, is_mask); + attrs &= ~(1 << OVS_KEY_ATTR_VLAN); + } else if (!is_mask) + SW_FLOW_KEY_PUT(match, eth.tci, htons(0xffff), true); if (attrs & (1 << OVS_KEY_ATTR_ETHERTYPE)) { - swkey->eth.type = nla_get_be16(a[OVS_KEY_ATTR_ETHERTYPE]); - if (ntohs(swkey->eth.type) < ETH_P_802_3_MIN) + __be16 eth_type; + + eth_type = nla_get_be16(a[OVS_KEY_ATTR_ETHERTYPE]); + if (is_mask) { + /* Always exact match EtherType. */ + eth_type = htons(0xffff); + } else if (ntohs(eth_type) < ETH_P_802_3_MIN) { + OVS_NLERR("EtherType is less than minimum (type=%x, min=%x).\n", + ntohs(eth_type), ETH_P_802_3_MIN); return -EINVAL; + } + + SW_FLOW_KEY_PUT(match, eth.type, eth_type, is_mask); attrs &= ~(1 << OVS_KEY_ATTR_ETHERTYPE); - } else { - swkey->eth.type = htons(ETH_P_802_2); + } else if (!is_mask) { + SW_FLOW_KEY_PUT(match, eth.type, htons(ETH_P_802_2), is_mask); } - if (swkey->eth.type == htons(ETH_P_IP)) { + if (attrs & (1 << OVS_KEY_ATTR_IPV4)) { const struct ovs_key_ipv4 *ipv4_key; - if (!(attrs & (1 << OVS_KEY_ATTR_IPV4))) - return -EINVAL; - attrs &= ~(1 << OVS_KEY_ATTR_IPV4); - - key_len = SW_FLOW_KEY_OFFSET(ipv4.addr); ipv4_key = nla_data(a[OVS_KEY_ATTR_IPV4]); - if (ipv4_key->ipv4_frag > OVS_FRAG_TYPE_MAX) + if (!is_mask && ipv4_key->ipv4_frag > OVS_FRAG_TYPE_MAX) { + OVS_NLERR("Unknown IPv4 fragment type (value=%d, max=%d).\n", + ipv4_key->ipv4_frag, OVS_FRAG_TYPE_MAX); return -EINVAL; - swkey->ip.proto = ipv4_key->ipv4_proto; - swkey->ip.tos = ipv4_key->ipv4_tos; - swkey->ip.ttl = ipv4_key->ipv4_ttl; - swkey->ip.frag = ipv4_key->ipv4_frag; - swkey->ipv4.addr.src = ipv4_key->ipv4_src; - swkey->ipv4.addr.dst = ipv4_key->ipv4_dst; - - if (swkey->ip.frag != OVS_FRAG_TYPE_LATER) { - err = ipv4_flow_from_nlattrs(swkey, &key_len, a, &attrs); - if (err) - return err; } - } else if (swkey->eth.type == htons(ETH_P_IPV6)) { - const struct ovs_key_ipv6 *ipv6_key; + SW_FLOW_KEY_PUT(match, ip.proto, + ipv4_key->ipv4_proto, is_mask); + SW_FLOW_KEY_PUT(match, ip.tos, + ipv4_key->ipv4_tos, is_mask); + SW_FLOW_KEY_PUT(match, ip.ttl, + ipv4_key->ipv4_ttl, is_mask); + SW_FLOW_KEY_PUT(match, ip.frag, + ipv4_key->ipv4_frag, is_mask); + SW_FLOW_KEY_PUT(match, ipv4.addr.src, + ipv4_key->ipv4_src, is_mask); + SW_FLOW_KEY_PUT(match, ipv4.addr.dst, + ipv4_key->ipv4_dst, is_mask); + attrs &= ~(1 << OVS_KEY_ATTR_IPV4); + } - if (!(attrs & (1 << OVS_KEY_ATTR_IPV6))) - return -EINVAL; - attrs &= ~(1 << OVS_KEY_ATTR_IPV6); + if (attrs & (1 << OVS_KEY_ATTR_IPV6)) { + const struct ovs_key_ipv6 *ipv6_key; - key_len = SW_FLOW_KEY_OFFSET(ipv6.label); ipv6_key = nla_data(a[OVS_KEY_ATTR_IPV6]); - if (ipv6_key->ipv6_frag > OVS_FRAG_TYPE_MAX) + if (!is_mask && ipv6_key->ipv6_frag > OVS_FRAG_TYPE_MAX) { + OVS_NLERR("Unknown IPv6 fragment type (value=%d, max=%d).\n", + ipv6_key->ipv6_frag, OVS_FRAG_TYPE_MAX); return -EINVAL; - swkey->ipv6.label = ipv6_key->ipv6_label; - swkey->ip.proto = ipv6_key->ipv6_proto; - swkey->ip.tos = ipv6_key->ipv6_tclass; - swkey->ip.ttl = ipv6_key->ipv6_hlimit; - swkey->ip.frag = ipv6_key->ipv6_frag; - memcpy(&swkey->ipv6.addr.src, ipv6_key->ipv6_src, - sizeof(swkey->ipv6.addr.src)); - memcpy(&swkey->ipv6.addr.dst, ipv6_key->ipv6_dst, - sizeof(swkey->ipv6.addr.dst)); - - if (swkey->ip.frag != OVS_FRAG_TYPE_LATER) { - err = ipv6_flow_from_nlattrs(swkey, &key_len, a, &attrs); - if (err) - return err; } - } else if (swkey->eth.type == htons(ETH_P_ARP) || - swkey->eth.type == htons(ETH_P_RARP)) { + SW_FLOW_KEY_PUT(match, ipv6.label, + ipv6_key->ipv6_label, is_mask); + SW_FLOW_KEY_PUT(match, ip.proto, + ipv6_key->ipv6_proto, is_mask); + SW_FLOW_KEY_PUT(match, ip.tos, + ipv6_key->ipv6_tclass, is_mask); + SW_FLOW_KEY_PUT(match, ip.ttl, + ipv6_key->ipv6_hlimit, is_mask); + SW_FLOW_KEY_PUT(match, ip.frag, + ipv6_key->ipv6_frag, is_mask); + SW_FLOW_KEY_MEMCPY(match, ipv6.addr.src, + ipv6_key->ipv6_src, + sizeof(match->key->ipv6.addr.src), + is_mask); + SW_FLOW_KEY_MEMCPY(match, ipv6.addr.dst, + ipv6_key->ipv6_dst, + sizeof(match->key->ipv6.addr.dst), + is_mask); + + attrs &= ~(1 << OVS_KEY_ATTR_IPV6); + } + + if (attrs & (1 << OVS_KEY_ATTR_ARP)) { const struct ovs_key_arp *arp_key; - if (!(attrs & (1 << OVS_KEY_ATTR_ARP))) + arp_key = nla_data(a[OVS_KEY_ATTR_ARP]); + if (!is_mask && (arp_key->arp_op & htons(0xff00))) { + OVS_NLERR("Unknown ARP opcode (opcode=%d).\n", + arp_key->arp_op); return -EINVAL; + } + + SW_FLOW_KEY_PUT(match, ipv4.addr.src, + arp_key->arp_sip, is_mask); + SW_FLOW_KEY_PUT(match, ipv4.addr.dst, + arp_key->arp_tip, is_mask); + SW_FLOW_KEY_PUT(match, ip.proto, + ntohs(arp_key->arp_op), is_mask); + SW_FLOW_KEY_MEMCPY(match, ipv4.arp.sha, + arp_key->arp_sha, ETH_ALEN, is_mask); + SW_FLOW_KEY_MEMCPY(match, ipv4.arp.tha, + arp_key->arp_tha, ETH_ALEN, is_mask); + attrs &= ~(1 << OVS_KEY_ATTR_ARP); + } - key_len = SW_FLOW_KEY_OFFSET(ipv4.arp); - arp_key = nla_data(a[OVS_KEY_ATTR_ARP]); - swkey->ipv4.addr.src = arp_key->arp_sip; - swkey->ipv4.addr.dst = arp_key->arp_tip; - if (arp_key->arp_op & htons(0xff00)) + if (attrs & (1 << OVS_KEY_ATTR_TCP)) { + const struct ovs_key_tcp *tcp_key; + + tcp_key = nla_data(a[OVS_KEY_ATTR_TCP]); + if (orig_attrs & (1 << OVS_KEY_ATTR_IPV4)) { + SW_FLOW_KEY_PUT(match, ipv4.tp.src, + tcp_key->tcp_src, is_mask); + SW_FLOW_KEY_PUT(match, ipv4.tp.dst, + tcp_key->tcp_dst, is_mask); + } else { + SW_FLOW_KEY_PUT(match, ipv6.tp.src, + tcp_key->tcp_src, is_mask); + SW_FLOW_KEY_PUT(match, ipv6.tp.dst, + tcp_key->tcp_dst, is_mask); + } + attrs &= ~(1 << OVS_KEY_ATTR_TCP); + } + + if (attrs & (1 << OVS_KEY_ATTR_UDP)) { + const struct ovs_key_udp *udp_key; + + udp_key = nla_data(a[OVS_KEY_ATTR_UDP]); + if (orig_attrs & (1 << OVS_KEY_ATTR_IPV4)) { + SW_FLOW_KEY_PUT(match, ipv4.tp.src, + udp_key->udp_src, is_mask); + SW_FLOW_KEY_PUT(match, ipv4.tp.dst, + udp_key->udp_dst, is_mask); + } else { + SW_FLOW_KEY_PUT(match, ipv6.tp.src, + udp_key->udp_src, is_mask); + SW_FLOW_KEY_PUT(match, ipv6.tp.dst, + udp_key->udp_dst, is_mask); + } + attrs &= ~(1 << OVS_KEY_ATTR_UDP); + } + + if (attrs & (1 << OVS_KEY_ATTR_ICMP)) { + const struct ovs_key_icmp *icmp_key; + + icmp_key = nla_data(a[OVS_KEY_ATTR_ICMP]); + SW_FLOW_KEY_PUT(match, ipv4.tp.src, + htons(icmp_key->icmp_type), is_mask); + SW_FLOW_KEY_PUT(match, ipv4.tp.dst, + htons(icmp_key->icmp_code), is_mask); + attrs &= ~(1 << OVS_KEY_ATTR_ICMP); + } + + if (attrs & (1 << OVS_KEY_ATTR_ICMPV6)) { + const struct ovs_key_icmpv6 *icmpv6_key; + + icmpv6_key = nla_data(a[OVS_KEY_ATTR_ICMPV6]); + SW_FLOW_KEY_PUT(match, ipv6.tp.src, + htons(icmpv6_key->icmpv6_type), is_mask); + SW_FLOW_KEY_PUT(match, ipv6.tp.dst, + htons(icmpv6_key->icmpv6_code), is_mask); + attrs &= ~(1 << OVS_KEY_ATTR_ICMPV6); + } + + if (attrs & (1 << OVS_KEY_ATTR_ND)) { + const struct ovs_key_nd *nd_key; + + nd_key = nla_data(a[OVS_KEY_ATTR_ND]); + SW_FLOW_KEY_MEMCPY(match, ipv6.nd.target, + nd_key->nd_target, + sizeof(match->key->ipv6.nd.target), + is_mask); + SW_FLOW_KEY_MEMCPY(match, ipv6.nd.sll, + nd_key->nd_sll, ETH_ALEN, is_mask); + SW_FLOW_KEY_MEMCPY(match, ipv6.nd.tll, + nd_key->nd_tll, ETH_ALEN, is_mask); + attrs &= ~(1 << OVS_KEY_ATTR_ND); + } + + if (attrs != 0) + return -EINVAL; + + return 0; +} + +/** + * ovs_match_from_nlattrs - parses Netlink attributes into a flow key and + * mask. In case the 'mask' is NULL, the flow is treated as exact match + * flow. Otherwise, it is treated as a wildcarded flow, except the mask + * does not include any don't care bit. + * @match: receives the extracted flow match information. + * @key: Netlink attribute holding nested %OVS_KEY_ATTR_* Netlink attribute + * sequence. The fields should of the packet that triggered the creation + * of this flow. + * @mask: Optional. Netlink attribute holding nested %OVS_KEY_ATTR_* Netlink + * attribute specifies the mask field of the wildcarded flow. + */ +int ovs_match_from_nlattrs(struct sw_flow_match *match, + const struct nlattr *key, + const struct nlattr *mask) +{ + const struct nlattr *a[OVS_KEY_ATTR_MAX + 1]; + const struct nlattr *encap; + u64 key_attrs = 0; + u64 mask_attrs = 0; + bool encap_valid = false; + int err; + + err = parse_flow_nlattrs(key, a, &key_attrs); + if (err) + return err; + + if ((key_attrs & (1 << OVS_KEY_ATTR_ETHERNET)) && + (key_attrs & (1 << OVS_KEY_ATTR_ETHERTYPE)) && + (nla_get_be16(a[OVS_KEY_ATTR_ETHERTYPE]) == htons(ETH_P_8021Q))) { + __be16 tci; + + if (!((key_attrs & (1 << OVS_KEY_ATTR_VLAN)) && + (key_attrs & (1 << OVS_KEY_ATTR_ENCAP)))) { + OVS_NLERR("Invalid Vlan frame.\n"); return -EINVAL; - swkey->ip.proto = ntohs(arp_key->arp_op); - memcpy(swkey->ipv4.arp.sha, arp_key->arp_sha, ETH_ALEN); - memcpy(swkey->ipv4.arp.tha, arp_key->arp_tha, ETH_ALEN); + } + + key_attrs &= ~(1 << OVS_KEY_ATTR_ETHERTYPE); + tci = nla_get_be16(a[OVS_KEY_ATTR_VLAN]); + encap = a[OVS_KEY_ATTR_ENCAP]; + key_attrs &= ~(1 << OVS_KEY_ATTR_ENCAP); + encap_valid = true; + + if (tci & htons(VLAN_TAG_PRESENT)) { + err = parse_flow_nlattrs(encap, a, &key_attrs); + if (err) + return err; + } else if (!tci) { + /* Corner case for truncated 802.1Q header. */ + if (nla_len(encap)) { + OVS_NLERR("Truncated 802.1Q header has non-zero encap attribute.\n"); + return -EINVAL; + } + } else { + OVS_NLERR("Encap attribute is set for a non-VLAN frame.\n"); + return -EINVAL; + } } - if (attrs) + err = ovs_key_from_nlattrs(match, key_attrs, a, false); + if (err) + return err; + + if (mask) { + err = parse_flow_mask_nlattrs(mask, a, &mask_attrs); + if (err) + return err; + + if (mask_attrs & 1ULL << OVS_KEY_ATTR_ENCAP) { + __be16 eth_type = 0; + __be16 tci = 0; + + if (!encap_valid) { + OVS_NLERR("Encap mask attribute is set for non-VLAN frame.\n"); + return -EINVAL; + } + + mask_attrs &= ~(1 << OVS_KEY_ATTR_ENCAP); + if (a[OVS_KEY_ATTR_ETHERTYPE]) + eth_type = nla_get_be16(a[OVS_KEY_ATTR_ETHERTYPE]); + + if (eth_type == htons(0xffff)) { + mask_attrs &= ~(1 << OVS_KEY_ATTR_ETHERTYPE); + encap = a[OVS_KEY_ATTR_ENCAP]; + err = parse_flow_mask_nlattrs(encap, a, &mask_attrs); + } else { + OVS_NLERR("VLAN frames must have an exact match on the TPID (mask=%x).\n", + ntohs(eth_type)); + return -EINVAL; + } + + if (a[OVS_KEY_ATTR_VLAN]) + tci = nla_get_be16(a[OVS_KEY_ATTR_VLAN]); + + if (!(tci & htons(VLAN_TAG_PRESENT))) { + OVS_NLERR("VLAN tag present bit must have an exact match (tci_mask=%x).\n", ntohs(tci)); + return -EINVAL; + } + } + + err = ovs_key_from_nlattrs(match, mask_attrs, a, true); + if (err) + return err; + } else { + /* Populate exact match flow's key mask. */ + if (match->mask) + ovs_sw_flow_mask_set(match->mask, &match->range, 0xff); + } + + if (!ovs_match_validate(match, key_attrs, mask_attrs)) return -EINVAL; - *key_lenp = key_len; return 0; } @@ -1255,7 +1662,6 @@ int ovs_flow_from_nlattrs(struct sw_flow_key *swkey, int *key_lenp, /** * ovs_flow_metadata_from_nlattrs - parses Netlink attributes into a flow key. * @flow: Receives extracted in_port, priority, tun_key and skb_mark. - * @key_len: Length of key in @flow. Used for calculating flow hash. * @attr: Netlink attribute holding nested %OVS_KEY_ATTR_* Netlink attribute * sequence. * @@ -1264,102 +1670,100 @@ int ovs_flow_from_nlattrs(struct sw_flow_key *swkey, int *key_lenp, * get the metadata, that is, the parts of the flow key that cannot be * extracted from the packet itself. */ -int ovs_flow_metadata_from_nlattrs(struct sw_flow *flow, int key_len, - const struct nlattr *attr) + +int ovs_flow_metadata_from_nlattrs(struct sw_flow *flow, + const struct nlattr *attr) { struct ovs_key_ipv4_tunnel *tun_key = &flow->key.tun_key; - const struct nlattr *nla; - int rem; + const struct nlattr *a[OVS_KEY_ATTR_MAX + 1]; + u64 attrs = 0; + int err; + struct sw_flow_match match; flow->key.phy.in_port = DP_MAX_PORTS; flow->key.phy.priority = 0; flow->key.phy.skb_mark = 0; memset(tun_key, 0, sizeof(flow->key.tun_key)); - nla_for_each_nested(nla, attr, rem) { - int type = nla_type(nla); - - if (type <= OVS_KEY_ATTR_MAX && ovs_key_lens[type] > 0) { - int err; - - if (nla_len(nla) != ovs_key_lens[type]) - return -EINVAL; - - switch (type) { - case OVS_KEY_ATTR_PRIORITY: - flow->key.phy.priority = nla_get_u32(nla); - break; - - case OVS_KEY_ATTR_TUNNEL: - err = ovs_ipv4_tun_from_nlattr(nla, tun_key); - if (err) - return err; - break; - - case OVS_KEY_ATTR_IN_PORT: - if (nla_get_u32(nla) >= DP_MAX_PORTS) - return -EINVAL; - flow->key.phy.in_port = nla_get_u32(nla); - break; - - case OVS_KEY_ATTR_SKB_MARK: - flow->key.phy.skb_mark = nla_get_u32(nla); - break; - } - } - } - if (rem) + err = parse_flow_nlattrs(attr, a, &attrs); + if (err) return -EINVAL; - flow->hash = ovs_flow_hash(&flow->key, - flow_key_start(&flow->key), key_len); + memset(&match, 0, sizeof(match)); + match.key = &flow->key; + + err = metadata_from_nlattrs(&match, &attrs, a, false); + if (err) + return err; return 0; } -int ovs_flow_to_nlattrs(const struct sw_flow_key *swkey, struct sk_buff *skb) +int ovs_flow_to_nlattrs(const struct sw_flow_key *swkey, + const struct sw_flow_key *output, struct sk_buff *skb) { struct ovs_key_ethernet *eth_key; struct nlattr *nla, *encap; + bool is_mask = (swkey != output); - if (swkey->phy.priority && - nla_put_u32(skb, OVS_KEY_ATTR_PRIORITY, swkey->phy.priority)) + if (nla_put_u32(skb, OVS_KEY_ATTR_PRIORITY, output->phy.priority)) goto nla_put_failure; - if (swkey->tun_key.ipv4_dst && - ovs_ipv4_tun_to_nlattr(skb, &swkey->tun_key)) + if ((swkey->tun_key.ipv4_dst || is_mask) && + ovs_ipv4_tun_to_nlattr(skb, &swkey->tun_key, &output->tun_key)) goto nla_put_failure; - if (swkey->phy.in_port != DP_MAX_PORTS && - nla_put_u32(skb, OVS_KEY_ATTR_IN_PORT, swkey->phy.in_port)) - goto nla_put_failure; + if (swkey->phy.in_port == DP_MAX_PORTS) { + if (is_mask && (output->phy.in_port == 0xffff)) + if (nla_put_u32(skb, OVS_KEY_ATTR_IN_PORT, 0xffffffff)) + goto nla_put_failure; + } else { + u16 upper_u16; + upper_u16 = !is_mask ? 0 : 0xffff; - if (swkey->phy.skb_mark && - nla_put_u32(skb, OVS_KEY_ATTR_SKB_MARK, swkey->phy.skb_mark)) + if (nla_put_u32(skb, OVS_KEY_ATTR_IN_PORT, + (upper_u16 << 16) | output->phy.in_port)) + goto nla_put_failure; + } + + if (nla_put_u32(skb, OVS_KEY_ATTR_SKB_MARK, output->phy.skb_mark)) goto nla_put_failure; nla = nla_reserve(skb, OVS_KEY_ATTR_ETHERNET, sizeof(*eth_key)); if (!nla) goto nla_put_failure; + eth_key = nla_data(nla); - memcpy(eth_key->eth_src, swkey->eth.src, ETH_ALEN); - memcpy(eth_key->eth_dst, swkey->eth.dst, ETH_ALEN); + memcpy(eth_key->eth_src, output->eth.src, ETH_ALEN); + memcpy(eth_key->eth_dst, output->eth.dst, ETH_ALEN); if (swkey->eth.tci || swkey->eth.type == htons(ETH_P_8021Q)) { - if (nla_put_be16(skb, OVS_KEY_ATTR_ETHERTYPE, htons(ETH_P_8021Q)) || - nla_put_be16(skb, OVS_KEY_ATTR_VLAN, swkey->eth.tci)) + __be16 eth_type; + eth_type = !is_mask ? htons(ETH_P_8021Q) : htons(0xffff); + if (nla_put_be16(skb, OVS_KEY_ATTR_ETHERTYPE, eth_type) || + nla_put_be16(skb, OVS_KEY_ATTR_VLAN, output->eth.tci)) goto nla_put_failure; encap = nla_nest_start(skb, OVS_KEY_ATTR_ENCAP); if (!swkey->eth.tci) goto unencap; - } else { + } else encap = NULL; - } - if (swkey->eth.type == htons(ETH_P_802_2)) + if (swkey->eth.type == htons(ETH_P_802_2)) { + /* + * Ethertype 802.2 is represented in the netlink with omitted + * OVS_KEY_ATTR_ETHERTYPE in the flow key attribute, and + * 0xffff in the mask attribute. Ethertype can also + * be wildcarded. + */ + if (is_mask && output->eth.type) + if (nla_put_be16(skb, OVS_KEY_ATTR_ETHERTYPE, + output->eth.type)) + goto nla_put_failure; goto unencap; + } - if (nla_put_be16(skb, OVS_KEY_ATTR_ETHERTYPE, swkey->eth.type)) + if (nla_put_be16(skb, OVS_KEY_ATTR_ETHERTYPE, output->eth.type)) goto nla_put_failure; if (swkey->eth.type == htons(ETH_P_IP)) { @@ -1369,12 +1773,12 @@ int ovs_flow_to_nlattrs(const struct sw_flow_key *swkey, struct sk_buff *skb) if (!nla) goto nla_put_failure; ipv4_key = nla_data(nla); - ipv4_key->ipv4_src = swkey->ipv4.addr.src; - ipv4_key->ipv4_dst = swkey->ipv4.addr.dst; - ipv4_key->ipv4_proto = swkey->ip.proto; - ipv4_key->ipv4_tos = swkey->ip.tos; - ipv4_key->ipv4_ttl = swkey->ip.ttl; - ipv4_key->ipv4_frag = swkey->ip.frag; + ipv4_key->ipv4_src = output->ipv4.addr.src; + ipv4_key->ipv4_dst = output->ipv4.addr.dst; + ipv4_key->ipv4_proto = output->ip.proto; + ipv4_key->ipv4_tos = output->ip.tos; + ipv4_key->ipv4_ttl = output->ip.ttl; + ipv4_key->ipv4_frag = output->ip.frag; } else if (swkey->eth.type == htons(ETH_P_IPV6)) { struct ovs_key_ipv6 *ipv6_key; @@ -1382,15 +1786,15 @@ int ovs_flow_to_nlattrs(const struct sw_flow_key *swkey, struct sk_buff *skb) if (!nla) goto nla_put_failure; ipv6_key = nla_data(nla); - memcpy(ipv6_key->ipv6_src, &swkey->ipv6.addr.src, + memcpy(ipv6_key->ipv6_src, &output->ipv6.addr.src, sizeof(ipv6_key->ipv6_src)); - memcpy(ipv6_key->ipv6_dst, &swkey->ipv6.addr.dst, + memcpy(ipv6_key->ipv6_dst, &output->ipv6.addr.dst, sizeof(ipv6_key->ipv6_dst)); - ipv6_key->ipv6_label = swkey->ipv6.label; - ipv6_key->ipv6_proto = swkey->ip.proto; - ipv6_key->ipv6_tclass = swkey->ip.tos; - ipv6_key->ipv6_hlimit = swkey->ip.ttl; - ipv6_key->ipv6_frag = swkey->ip.frag; + ipv6_key->ipv6_label = output->ipv6.label; + ipv6_key->ipv6_proto = output->ip.proto; + ipv6_key->ipv6_tclass = output->ip.tos; + ipv6_key->ipv6_hlimit = output->ip.ttl; + ipv6_key->ipv6_frag = output->ip.frag; } else if (swkey->eth.type == htons(ETH_P_ARP) || swkey->eth.type == htons(ETH_P_RARP)) { struct ovs_key_arp *arp_key; @@ -1400,11 +1804,11 @@ int ovs_flow_to_nlattrs(const struct sw_flow_key *swkey, struct sk_buff *skb) goto nla_put_failure; arp_key = nla_data(nla); memset(arp_key, 0, sizeof(struct ovs_key_arp)); - arp_key->arp_sip = swkey->ipv4.addr.src; - arp_key->arp_tip = swkey->ipv4.addr.dst; - arp_key->arp_op = htons(swkey->ip.proto); - memcpy(arp_key->arp_sha, swkey->ipv4.arp.sha, ETH_ALEN); - memcpy(arp_key->arp_tha, swkey->ipv4.arp.tha, ETH_ALEN); + arp_key->arp_sip = output->ipv4.addr.src; + arp_key->arp_tip = output->ipv4.addr.dst; + arp_key->arp_op = htons(output->ip.proto); + memcpy(arp_key->arp_sha, output->ipv4.arp.sha, ETH_ALEN); + memcpy(arp_key->arp_tha, output->ipv4.arp.tha, ETH_ALEN); } if ((swkey->eth.type == htons(ETH_P_IP) || @@ -1419,11 +1823,11 @@ int ovs_flow_to_nlattrs(const struct sw_flow_key *swkey, struct sk_buff *skb) goto nla_put_failure; tcp_key = nla_data(nla); if (swkey->eth.type == htons(ETH_P_IP)) { - tcp_key->tcp_src = swkey->ipv4.tp.src; - tcp_key->tcp_dst = swkey->ipv4.tp.dst; + tcp_key->tcp_src = output->ipv4.tp.src; + tcp_key->tcp_dst = output->ipv4.tp.dst; } else if (swkey->eth.type == htons(ETH_P_IPV6)) { - tcp_key->tcp_src = swkey->ipv6.tp.src; - tcp_key->tcp_dst = swkey->ipv6.tp.dst; + tcp_key->tcp_src = output->ipv6.tp.src; + tcp_key->tcp_dst = output->ipv6.tp.dst; } } else if (swkey->ip.proto == IPPROTO_UDP) { struct ovs_key_udp *udp_key; @@ -1433,11 +1837,11 @@ int ovs_flow_to_nlattrs(const struct sw_flow_key *swkey, struct sk_buff *skb) goto nla_put_failure; udp_key = nla_data(nla); if (swkey->eth.type == htons(ETH_P_IP)) { - udp_key->udp_src = swkey->ipv4.tp.src; - udp_key->udp_dst = swkey->ipv4.tp.dst; + udp_key->udp_src = output->ipv4.tp.src; + udp_key->udp_dst = output->ipv4.tp.dst; } else if (swkey->eth.type == htons(ETH_P_IPV6)) { - udp_key->udp_src = swkey->ipv6.tp.src; - udp_key->udp_dst = swkey->ipv6.tp.dst; + udp_key->udp_src = output->ipv6.tp.src; + udp_key->udp_dst = output->ipv6.tp.dst; } } else if (swkey->eth.type == htons(ETH_P_IP) && swkey->ip.proto == IPPROTO_ICMP) { @@ -1447,8 +1851,8 @@ int ovs_flow_to_nlattrs(const struct sw_flow_key *swkey, struct sk_buff *skb) if (!nla) goto nla_put_failure; icmp_key = nla_data(nla); - icmp_key->icmp_type = ntohs(swkey->ipv4.tp.src); - icmp_key->icmp_code = ntohs(swkey->ipv4.tp.dst); + icmp_key->icmp_type = ntohs(output->ipv4.tp.src); + icmp_key->icmp_code = ntohs(output->ipv4.tp.dst); } else if (swkey->eth.type == htons(ETH_P_IPV6) && swkey->ip.proto == IPPROTO_ICMPV6) { struct ovs_key_icmpv6 *icmpv6_key; @@ -1458,8 +1862,8 @@ int ovs_flow_to_nlattrs(const struct sw_flow_key *swkey, struct sk_buff *skb) if (!nla) goto nla_put_failure; icmpv6_key = nla_data(nla); - icmpv6_key->icmpv6_type = ntohs(swkey->ipv6.tp.src); - icmpv6_key->icmpv6_code = ntohs(swkey->ipv6.tp.dst); + icmpv6_key->icmpv6_type = ntohs(output->ipv6.tp.src); + icmpv6_key->icmpv6_code = ntohs(output->ipv6.tp.dst); if (icmpv6_key->icmpv6_type == NDISC_NEIGHBOUR_SOLICITATION || icmpv6_key->icmpv6_type == NDISC_NEIGHBOUR_ADVERTISEMENT) { @@ -1469,10 +1873,10 @@ int ovs_flow_to_nlattrs(const struct sw_flow_key *swkey, struct sk_buff *skb) if (!nla) goto nla_put_failure; nd_key = nla_data(nla); - memcpy(nd_key->nd_target, &swkey->ipv6.nd.target, + memcpy(nd_key->nd_target, &output->ipv6.nd.target, sizeof(nd_key->nd_target)); - memcpy(nd_key->nd_sll, swkey->ipv6.nd.sll, ETH_ALEN); - memcpy(nd_key->nd_tll, swkey->ipv6.nd.tll, ETH_ALEN); + memcpy(nd_key->nd_sll, output->ipv6.nd.sll, ETH_ALEN); + memcpy(nd_key->nd_tll, output->ipv6.nd.tll, ETH_ALEN); } } } @@ -1504,3 +1908,84 @@ void ovs_flow_exit(void) { kmem_cache_destroy(flow_cache); } + +struct sw_flow_mask *ovs_sw_flow_mask_alloc(void) +{ + struct sw_flow_mask *mask; + + mask = kmalloc(sizeof(*mask), GFP_KERNEL); + if (mask) + mask->ref_count = 0; + + return mask; +} + +void ovs_sw_flow_mask_add_ref(struct sw_flow_mask *mask) +{ + mask->ref_count++; +} + +void ovs_sw_flow_mask_del_ref(struct sw_flow_mask *mask, bool deferred) +{ + if (!mask) + return; + + BUG_ON(!mask->ref_count); + mask->ref_count--; + + if (!mask->ref_count) { + list_del_rcu(&mask->list); + if (deferred) + kfree_rcu(mask, rcu); + else + kfree(mask); + } +} + +static bool ovs_sw_flow_mask_equal(const struct sw_flow_mask *a, + const struct sw_flow_mask *b) +{ + u8 *a_ = (u8 *)&a->key + a->range.start; + u8 *b_ = (u8 *)&b->key + b->range.start; + + return (a->range.end == b->range.end) + && (a->range.start == b->range.start) + && (memcmp(a_, b_, ovs_sw_flow_mask_actual_size(a)) == 0); +} + +struct sw_flow_mask *ovs_sw_flow_mask_find(const struct flow_table *tbl, + const struct sw_flow_mask *mask) +{ + struct list_head *ml; + + list_for_each(ml, tbl->mask_list) { + struct sw_flow_mask *m; + m = container_of(ml, struct sw_flow_mask, list); + if (ovs_sw_flow_mask_equal(mask, m)) + return m; + } + + return NULL; +} + +/** + * add a new mask into the mask list. + * The caller needs to make sure that 'mask' is not the same + * as any masks that are already on the list. + */ +void ovs_sw_flow_mask_insert(struct flow_table *tbl, struct sw_flow_mask *mask) +{ + list_add_rcu(&mask->list, tbl->mask_list); +} + +/** + * Set 'range' fields in the mask to the value of 'val'. + */ +static void ovs_sw_flow_mask_set(struct sw_flow_mask *mask, + struct sw_flow_key_range *range, u8 val) +{ + u8 *m = (u8 *)&mask->key + range->start; + + mask->range = *range; + memset(m, val, ovs_sw_flow_mask_size_roundup(mask)); +} diff --git a/net/openvswitch/flow.h b/net/openvswitch/flow.h index 66ef7220293e..9674e45f6969 100644 --- a/net/openvswitch/flow.h +++ b/net/openvswitch/flow.h @@ -1,5 +1,5 @@ /* - * Copyright (c) 2007-2011 Nicira, Inc. + * Copyright (c) 2007-2013 Nicira, Inc. * * This program is free software; you can redistribute it and/or * modify it under the terms of version 2 of the GNU General Public @@ -33,6 +33,8 @@ #include struct sk_buff; +struct sw_flow_mask; +struct flow_table; struct sw_flow_actions { struct rcu_head rcu; @@ -131,6 +133,8 @@ struct sw_flow { u32 hash; struct sw_flow_key key; + struct sw_flow_key unmasked_key; + struct sw_flow_mask *mask; struct sw_flow_actions __rcu *sf_acts; spinlock_t lock; /* Lock for values below. */ @@ -140,6 +144,25 @@ struct sw_flow { u8 tcp_flags; /* Union of seen TCP flags. */ }; +struct sw_flow_key_range { + size_t start; + size_t end; +}; + +static inline u16 ovs_sw_flow_key_range_actual_size(const struct sw_flow_key_range *range) +{ + return range->end - range->start; +} + +struct sw_flow_match { + struct sw_flow_key *key; + struct sw_flow_key_range range; + struct sw_flow_mask *mask; +}; + +void ovs_match_init(struct sw_flow_match *match, + struct sw_flow_key *key, struct sw_flow_mask *mask); + struct arp_eth_header { __be16 ar_hrd; /* format of hardware address */ __be16 ar_pro; /* format of protocol address */ @@ -159,21 +182,21 @@ void ovs_flow_exit(void); struct sw_flow *ovs_flow_alloc(void); void ovs_flow_deferred_free(struct sw_flow *); -void ovs_flow_free(struct sw_flow *flow); +void ovs_flow_free(struct sw_flow *, bool deferred); struct sw_flow_actions *ovs_flow_actions_alloc(int actions_len); void ovs_flow_deferred_free_acts(struct sw_flow_actions *); -int ovs_flow_extract(struct sk_buff *, u16 in_port, struct sw_flow_key *, - int *key_lenp); +int ovs_flow_extract(struct sk_buff *, u16 in_port, struct sw_flow_key *); void ovs_flow_used(struct sw_flow *, struct sk_buff *); u64 ovs_flow_used_time(unsigned long flow_jiffies); - -int ovs_flow_to_nlattrs(const struct sw_flow_key *, struct sk_buff *); -int ovs_flow_from_nlattrs(struct sw_flow_key *swkey, int *key_lenp, +int ovs_flow_to_nlattrs(const struct sw_flow_key *, + const struct sw_flow_key *, struct sk_buff *); +int ovs_match_from_nlattrs(struct sw_flow_match *match, + const struct nlattr *, const struct nlattr *); -int ovs_flow_metadata_from_nlattrs(struct sw_flow *flow, int key_len, - const struct nlattr *attr); +int ovs_flow_metadata_from_nlattrs(struct sw_flow *flow, + const struct nlattr *attr); #define MAX_ACTIONS_BUFSIZE (32 * 1024) #define TBL_MIN_BUCKETS 1024 @@ -182,6 +205,7 @@ struct flow_table { struct flex_array *buckets; unsigned int count, n_buckets; struct rcu_head rcu; + struct list_head *mask_list; int node_ver; u32 hash_seed; bool keep_flows; @@ -197,22 +221,56 @@ static inline int ovs_flow_tbl_need_to_expand(struct flow_table *table) return (table->count > table->n_buckets); } -struct sw_flow *ovs_flow_tbl_lookup(struct flow_table *table, - struct sw_flow_key *key, int len); -void ovs_flow_tbl_destroy(struct flow_table *table); -void ovs_flow_tbl_deferred_destroy(struct flow_table *table); +struct sw_flow *ovs_flow_lookup(struct flow_table *, + const struct sw_flow_key *); +struct sw_flow *ovs_flow_lookup_unmasked_key(struct flow_table *table, + struct sw_flow_match *match); + +void ovs_flow_tbl_destroy(struct flow_table *table, bool deferred); struct flow_table *ovs_flow_tbl_alloc(int new_size); struct flow_table *ovs_flow_tbl_expand(struct flow_table *table); struct flow_table *ovs_flow_tbl_rehash(struct flow_table *table); -void ovs_flow_tbl_insert(struct flow_table *table, struct sw_flow *flow, - struct sw_flow_key *key, int key_len); -void ovs_flow_tbl_remove(struct flow_table *table, struct sw_flow *flow); -struct sw_flow *ovs_flow_tbl_next(struct flow_table *table, u32 *bucket, u32 *idx); +void ovs_flow_insert(struct flow_table *table, struct sw_flow *flow); +void ovs_flow_remove(struct flow_table *table, struct sw_flow *flow); + +struct sw_flow *ovs_flow_dump_next(struct flow_table *table, u32 *bucket, u32 *idx); extern const int ovs_key_lens[OVS_KEY_ATTR_MAX + 1]; int ovs_ipv4_tun_from_nlattr(const struct nlattr *attr, - struct ovs_key_ipv4_tunnel *tun_key); + struct sw_flow_match *match, bool is_mask); int ovs_ipv4_tun_to_nlattr(struct sk_buff *skb, - const struct ovs_key_ipv4_tunnel *tun_key); + const struct ovs_key_ipv4_tunnel *tun_key, + const struct ovs_key_ipv4_tunnel *output); + +bool ovs_flow_cmp_unmasked_key(const struct sw_flow *flow, + const struct sw_flow_key *key, int key_len); + +struct sw_flow_mask { + int ref_count; + struct rcu_head rcu; + struct list_head list; + struct sw_flow_key_range range; + struct sw_flow_key key; +}; + +static inline u16 +ovs_sw_flow_mask_actual_size(const struct sw_flow_mask *mask) +{ + return ovs_sw_flow_key_range_actual_size(&mask->range); +} + +static inline u16 +ovs_sw_flow_mask_size_roundup(const struct sw_flow_mask *mask) +{ + return roundup(ovs_sw_flow_mask_actual_size(mask), sizeof(u32)); +} +struct sw_flow_mask *ovs_sw_flow_mask_alloc(void); +void ovs_sw_flow_mask_add_ref(struct sw_flow_mask *); +void ovs_sw_flow_mask_del_ref(struct sw_flow_mask *, bool deferred); +void ovs_sw_flow_mask_insert(struct flow_table *, struct sw_flow_mask *); +struct sw_flow_mask *ovs_sw_flow_mask_find(const struct flow_table *, + const struct sw_flow_mask *); +void ovs_flow_key_mask(struct sw_flow_key *dst, const struct sw_flow_key *src, + const struct sw_flow_mask *mask); #endif /* flow.h */ -- cgit v1.2.3 From d7064f4c192c19958c72b13cd9556815bedc0432 Mon Sep 17 00:00:00 2001 From: Jeff Kirsher Date: Fri, 23 Aug 2013 17:19:23 -0700 Subject: Documentation/networking/: Update Intel wired LAN driver documentation Updates the documentation to the Intel wired LAN drivers. Signed-off-by: Jeff Kirsher Tested-by: Aaron Brown Tested-by: Phil Schmitt Signed-off-by: David S. Miller --- Documentation/networking/e100.txt | 4 +- Documentation/networking/e1000.txt | 12 ++-- Documentation/networking/e1000e.txt | 16 +++-- Documentation/networking/igb.txt | 67 +++++++++++++++++++-- Documentation/networking/igbvf.txt | 8 +-- Documentation/networking/ixgb.txt | 14 ++--- Documentation/networking/ixgbe.txt | 109 +++++++++++++++++++++++++++++++---- Documentation/networking/ixgbevf.txt | 6 +- 8 files changed, 193 insertions(+), 43 deletions(-) (limited to 'Documentation') diff --git a/Documentation/networking/e100.txt b/Documentation/networking/e100.txt index fcb6c71cdb69..13a32124bca0 100644 --- a/Documentation/networking/e100.txt +++ b/Documentation/networking/e100.txt @@ -1,7 +1,7 @@ Linux* Base Driver for the Intel(R) PRO/100 Family of Adapters ============================================================== -November 15, 2005 +March 15, 2011 Contents ======== @@ -122,7 +122,7 @@ Additional Configurations NOTE: This setting is not saved across reboots. - Ethtool + ethtool ------- The driver utilizes the ethtool interface for driver configuration and diff --git a/Documentation/networking/e1000.txt b/Documentation/networking/e1000.txt index 71ca95855671..437b2099cced 100644 --- a/Documentation/networking/e1000.txt +++ b/Documentation/networking/e1000.txt @@ -1,8 +1,8 @@ -Linux* Base Driver for the Intel(R) PRO/1000 Family of Adapters -=============================================================== +Linux* Base Driver for Intel(R) Ethernet Network Connection +=========================================================== Intel Gigabit Linux driver. -Copyright(c) 1999 - 2010 Intel Corporation. +Copyright(c) 1999 - 2013 Intel Corporation. Contents ======== @@ -420,15 +420,15 @@ Additional Configurations - The maximum MTU setting for Jumbo Frames is 16110. This value coincides with the maximum Jumbo Frames size of 16128. - - Using Jumbo Frames at 10 or 100 Mbps may result in poor performance or - loss of link. + - Using Jumbo frames at 10 or 100 Mbps is not supported and may result in + poor performance or loss of link. - Adapters based on the Intel(R) 82542 and 82573V/E controller do not support Jumbo Frames. These correspond to the following product names: Intel(R) PRO/1000 Gigabit Server Adapter Intel(R) PRO/1000 PM Network Connection - Ethtool + ethtool ------- The driver utilizes the ethtool interface for driver configuration and diagnostics, as well as displaying statistical information. The ethtool diff --git a/Documentation/networking/e1000e.txt b/Documentation/networking/e1000e.txt index 97b5ba942ebf..ad2d9f38ce14 100644 --- a/Documentation/networking/e1000e.txt +++ b/Documentation/networking/e1000e.txt @@ -1,8 +1,8 @@ -Linux* Driver for Intel(R) Network Connection -============================================= +Linux* Driver for Intel(R) Ethernet Network Connection +====================================================== Intel Gigabit Linux driver. -Copyright(c) 1999 - 2010 Intel Corporation. +Copyright(c) 1999 - 2013 Intel Corporation. Contents ======== @@ -259,13 +259,16 @@ Additional Configurations - The maximum MTU setting for Jumbo Frames is 9216. This value coincides with the maximum Jumbo Frames size of 9234 bytes. - - Using Jumbo Frames at 10 or 100 Mbps is not supported and may result in + - Using Jumbo frames at 10 or 100 Mbps is not supported and may result in poor performance or loss of link. - Some adapters limit Jumbo Frames sized packets to a maximum of 4096 bytes and some adapters do not support Jumbo Frames. - Ethtool + - Jumbo Frames cannot be configured on an 82579-based Network device, if + MACSec is enabled on the system. + + ethtool ------- The driver utilizes the ethtool interface for driver configuration and diagnostics, as well as displaying statistical information. We @@ -273,6 +276,9 @@ Additional Configurations http://ftp.kernel.org/pub/software/network/ethtool/ + NOTE: When validating enable/disable tests on some parts (82578, for example) + you need to add a few seconds between tests when working with ethtool. + Speed and Duplex ---------------- Speed and Duplex are configured through the ethtool* utility. For diff --git a/Documentation/networking/igb.txt b/Documentation/networking/igb.txt index 9a2a037194a5..4ebbd659256f 100644 --- a/Documentation/networking/igb.txt +++ b/Documentation/networking/igb.txt @@ -1,8 +1,8 @@ -Linux* Base Driver for Intel(R) Network Connection -================================================== +Linux* Base Driver for Intel(R) Ethernet Network Connection +=========================================================== Intel Gigabit Linux driver. -Copyright(c) 1999 - 2010 Intel Corporation. +Copyright(c) 1999 - 2013 Intel Corporation. Contents ======== @@ -36,6 +36,53 @@ Default Value: 0 This parameter adds support for SR-IOV. It causes the driver to spawn up to max_vfs worth of virtual function. +QueuePairs +---------- +Valid Range: 0-1 +Default Value: 1 (TX and RX will be paired onto one interrupt vector) + +If set to 0, when MSI-X is enabled, the TX and RX will attempt to occupy +separate vectors. + +This option can be overridden to 1 if there are not sufficient interrupts +available. This can occur if any combination of RSS, VMDQ, and max_vfs +results in more than 4 queues being used. + +Node +---- +Valid Range: 0-n +Default Value: -1 (off) + + 0 - n: where n is the number of the NUMA node that should be used to + allocate memory for this adapter port. + -1: uses the driver default of allocating memory on whichever processor is + running insmod/modprobe. + + The Node parameter will allow you to pick which NUMA node you want to have + the adapter allocate memory from. All driver structures, in-memory queues, + and receive buffers will be allocated on the node specified. This parameter + is only useful when interrupt affinity is specified, otherwise some portion + of the time the interrupt could run on a different core than the memory is + allocated on, causing slower memory access and impacting throughput, CPU, or + both. + +EEE +--- +Valid Range: 0-1 +Default Value: 1 (enabled) + + A link between two EEE-compliant devices will result in periodic bursts of + data followed by long periods where in the link is in an idle state. This Low + Power Idle (LPI) state is supported in both 1Gbps and 100Mbps link speeds. + NOTE: EEE support requires autonegotiation. + +DMAC +---- +Valid Range: 0-1 +Default Value: 1 (enabled) + Enables or disables DMA Coalescing feature. + + Additional Configurations ========================= @@ -55,10 +102,10 @@ Additional Configurations - The maximum MTU setting for Jumbo Frames is 9216. This value coincides with the maximum Jumbo Frames size of 9234 bytes. - - Using Jumbo Frames at 10 or 100 Mbps may result in poor performance or - loss of link. + - Using Jumbo frames at 10 or 100 Mbps is not supported and may result in + poor performance or loss of link. - Ethtool + ethtool ------- The driver utilizes the ethtool interface for driver configuration and diagnostics, as well as displaying statistical information. The latest @@ -106,6 +153,14 @@ Additional Configurations Where n=the VF that attempted to do the spoofing. + Setting MAC Address, VLAN and Rate Limit Using IProute2 Tool + ------------------------------------------------------------ + You can set a MAC address of a Virtual Function (VF), a default VLAN and the + rate limit using the IProute2 tool. Download the latest version of the + iproute2 tool from Sourceforge if your version does not have all the + features you require. + + Support ======= diff --git a/Documentation/networking/igbvf.txt b/Documentation/networking/igbvf.txt index cbfe4ee65533..40db17a6665b 100644 --- a/Documentation/networking/igbvf.txt +++ b/Documentation/networking/igbvf.txt @@ -1,8 +1,8 @@ -Linux* Base Driver for Intel(R) Network Connection -================================================== +Linux* Base Driver for Intel(R) Ethernet Network Connection +=========================================================== Intel Gigabit Linux driver. -Copyright(c) 1999 - 2010 Intel Corporation. +Copyright(c) 1999 - 2013 Intel Corporation. Contents ======== @@ -55,7 +55,7 @@ networking link on the left to search for your adapter: Additional Configurations ========================= - Ethtool + ethtool ------- The driver utilizes the ethtool interface for driver configuration and diagnostics, as well as displaying statistical information. The ethtool diff --git a/Documentation/networking/ixgb.txt b/Documentation/networking/ixgb.txt index d75a1f9565bb..1e0c045e89f7 100644 --- a/Documentation/networking/ixgb.txt +++ b/Documentation/networking/ixgb.txt @@ -1,7 +1,7 @@ -Linux Base Driver for 10 Gigabit Intel(R) Network Connection -============================================================= +Linux Base Driver for 10 Gigabit Intel(R) Ethernet Network Connection +===================================================================== -October 9, 2007 +March 14, 2011 Contents @@ -274,9 +274,9 @@ Additional Configurations ------------------------------------------------- Configuring a network driver to load properly when the system is started is distribution dependent. Typically, the configuration process involves adding - an alias line to files in /etc/modprobe.d/ as well as editing other system - startup scripts and/or configuration files. Many popular Linux distributions - ship with tools to make these changes for you. To learn the proper way to + an alias line to /etc/modprobe.conf as well as editing other system startup + scripts and/or configuration files. Many popular Linux distributions ship + with tools to make these changes for you. To learn the proper way to configure a network device for your system, refer to your distribution documentation. If during this process you are asked for the driver or module name, the name for the Linux Base Driver for the Intel 10GbE Family of @@ -306,7 +306,7 @@ Additional Configurations with the maximum Jumbo Frames size of 16128. - Ethtool + ethtool ------- The driver utilizes the ethtool interface for driver configuration and diagnostics, as well as displaying statistical information. The ethtool diff --git a/Documentation/networking/ixgbe.txt b/Documentation/networking/ixgbe.txt index af77ed3c4172..96cccebb839b 100644 --- a/Documentation/networking/ixgbe.txt +++ b/Documentation/networking/ixgbe.txt @@ -1,8 +1,9 @@ -Linux Base Driver for 10 Gigabit PCI Express Intel(R) Network Connection -======================================================================== +Linux* Base Driver for the Intel(R) Ethernet 10 Gigabit PCI Express Family of +Adapters +============================================================================= -Intel Gigabit Linux driver. -Copyright(c) 1999 - 2010 Intel Corporation. +Intel 10 Gigabit Linux driver. +Copyright(c) 1999 - 2013 Intel Corporation. Contents ======== @@ -16,8 +17,8 @@ Contents Identifying Your Adapter ======================== -The driver in this release is compatible with 82598 and 82599-based Intel -Network Connections. +The driver in this release is compatible with 82598, 82599 and X540-based +Intel Network Connections. For more information on how to identify your adapter, go to the Adapter & Driver ID Guide at: @@ -72,7 +73,7 @@ cables that comply with SFF-8431 v4.1 and SFF-8472 v10.4 specifications. Laser turns off for SFP+ when ifconfig down ------------------------------------------- "ifconfig down" turns off the laser for 82599-based SFP+ fiber adapters. -"ifconfig up" turns on the later. +"ifconfig up" turns on the laser. 82598-BASED ADAPTERS @@ -118,6 +119,93 @@ NOTE: For 82598 backplane cards entering 1 gig mode, flow control default behavior is changed to off. Flow control in 1 gig mode on these devices can lead to Tx hangs. +Intel(R) Ethernet Flow Director +------------------------------- +Supports advanced filters that direct receive packets by their flows to +different queues. Enables tight control on routing a flow in the platform. +Matches flows and CPU cores for flow affinity. Supports multiple parameters +for flexible flow classification and load balancing. + +Flow director is enabled only if the kernel is multiple TX queue capable. + +An included script (set_irq_affinity.sh) automates setting the IRQ to CPU +affinity. + +You can verify that the driver is using Flow Director by looking at the counter +in ethtool: fdir_miss and fdir_match. + +Other ethtool Commands: +To enable Flow Director + ethtool -K ethX ntuple on +To add a filter + Use -U switch. e.g., ethtool -U ethX flow-type tcp4 src-ip 0x178000a + action 1 +To see the list of filters currently present: + ethtool -u ethX + +Perfect Filter: Perfect filter is an interface to load the filter table that +funnels all flow into queue_0 unless an alternative queue is specified using +"action". In that case, any flow that matches the filter criteria will be +directed to the appropriate queue. + +If the queue is defined as -1, filter will drop matching packets. + +To account for filter matches and misses, there are two stats in ethtool: +fdir_match and fdir_miss. In addition, rx_queue_N_packets shows the number of +packets processed by the Nth queue. + +NOTE: Receive Packet Steering (RPS) and Receive Flow Steering (RFS) are not +compatible with Flow Director. IF Flow Director is enabled, these will be +disabled. + +The following three parameters impact Flow Director. + +FdirMode +-------- +Valid Range: 0-2 (0=off, 1=ATR, 2=Perfect filter mode) +Default Value: 1 + + Flow Director filtering modes. + +FdirPballoc +----------- +Valid Range: 0-2 (0=64k, 1=128k, 2=256k) +Default Value: 0 + + Flow Director allocated packet buffer size. + +AtrSampleRate +-------------- +Valid Range: 1-100 +Default Value: 20 + + Software ATR Tx packet sample rate. For example, when set to 20, every 20th + packet, looks to see if the packet will create a new flow. + +Node +---- +Valid Range: 0-n +Default Value: 1 (off) + + 0 - n: where n is the number of NUMA nodes (i.e. 0 - 3) currently online in + your system + 1: turns this option off + + The Node parameter will allow you to pick which NUMA node you want to have + the adapter allocate memory on. + +max_vfs +------- +Valid Range: 1-63 +Default Value: 0 + + If the value is greater than 0 it will also force the VMDq parameter to be 1 + or more. + + This parameter adds support for SR-IOV. It causes the driver to spawn up to + max_vfs worth of virtual function. + + Additional Configurations ========================= @@ -221,9 +309,10 @@ http://www.redhat.com/promo/summit/2008/downloads/pdf/Thursday/Mark_Wagner.pdf Known Issues ============ - Enabling SR-IOV in a 32-bit Microsoft* Windows* Server 2008 Guest OS using - Intel (R) 82576-based GbE or Intel (R) 82599-based 10GbE controller under KVM - ----------------------------------------------------------------------------- + Enabling SR-IOV in a 32-bit or 64-bit Microsoft* Windows* Server 2008/R2 + Guest OS using Intel (R) 82576-based GbE or Intel (R) 82599-based 10GbE + controller under KVM + ------------------------------------------------------------------------ KVM Hypervisor/VMM supports direct assignment of a PCIe device to a VM. This includes traditional PCIe devices, as well as SR-IOV-capable devices using Intel 82576-based and 82599-based controllers. diff --git a/Documentation/networking/ixgbevf.txt b/Documentation/networking/ixgbevf.txt index 5a91a41fa946..53d8d2a5a6a3 100644 --- a/Documentation/networking/ixgbevf.txt +++ b/Documentation/networking/ixgbevf.txt @@ -1,8 +1,8 @@ -Linux* Base Driver for Intel(R) Network Connection -================================================== +Linux* Base Driver for Intel(R) Ethernet Network Connection +=========================================================== Intel Gigabit Linux driver. -Copyright(c) 1999 - 2010 Intel Corporation. +Copyright(c) 1999 - 2013 Intel Corporation. Contents ======== -- cgit v1.2.3 From b800c3b966bcf004bd8592293a49ed5cb7ea67a9 Mon Sep 17 00:00:00 2001 From: Hannes Frederic Sowa Date: Tue, 27 Aug 2013 01:36:51 +0200 Subject: ipv6: drop fragmented ndisc packets by default (RFC 6980) This patch implements RFC6980: Drop fragmented ndisc packets by default. If a fragmented ndisc packet is received the user is informed that it is possible to disable the check. Cc: Fernando Gont Cc: YOSHIFUJI Hideaki Signed-off-by: Hannes Frederic Sowa Signed-off-by: David S. Miller --- Documentation/networking/ip-sysctl.txt | 6 ++++++ include/linux/ipv6.h | 1 + include/uapi/linux/ipv6.h | 1 + net/ipv6/addrconf.c | 10 ++++++++++ net/ipv6/ndisc.c | 17 +++++++++++++++++ 5 files changed, 35 insertions(+) (limited to 'Documentation') diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index debfe857d8f9..a2be556032c9 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -1349,6 +1349,12 @@ mldv2_unsolicited_report_interval - INTEGER MLDv2 report retransmit will take place. Default: 1000 (1 second) +suppress_frag_ndisc - INTEGER + Control RFC 6980 (Security Implications of IPv6 Fragmentation + with IPv6 Neighbor Discovery) behavior: + 1 - (default) discard fragmented neighbor discovery packets + 0 - allow fragmented neighbor discovery packets + icmp/*: ratelimit - INTEGER Limit the maximal rates for sending ICMPv6 packets. diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h index 9ac5047062c8..28ea38439313 100644 --- a/include/linux/ipv6.h +++ b/include/linux/ipv6.h @@ -50,6 +50,7 @@ struct ipv6_devconf { __s32 accept_dad; __s32 force_tllao; __s32 ndisc_notify; + __s32 suppress_frag_ndisc; void *sysctl; }; diff --git a/include/uapi/linux/ipv6.h b/include/uapi/linux/ipv6.h index d07ac6903e59..593b0e32d956 100644 --- a/include/uapi/linux/ipv6.h +++ b/include/uapi/linux/ipv6.h @@ -162,6 +162,7 @@ enum { DEVCONF_NDISC_NOTIFY, DEVCONF_MLDV1_UNSOLICITED_REPORT_INTERVAL, DEVCONF_MLDV2_UNSOLICITED_REPORT_INTERVAL, + DEVCONF_SUPPRESS_FRAG_NDISC, DEVCONF_MAX }; diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index 2d6d1793bbfe..a7183fc9bbc2 100644 --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -204,6 +204,7 @@ static struct ipv6_devconf ipv6_devconf __read_mostly = { .accept_source_route = 0, /* we do not accept RH0 by default. */ .disable_ipv6 = 0, .accept_dad = 1, + .suppress_frag_ndisc = 1, }; static struct ipv6_devconf ipv6_devconf_dflt __read_mostly = { @@ -241,6 +242,7 @@ static struct ipv6_devconf ipv6_devconf_dflt __read_mostly = { .accept_source_route = 0, /* we do not accept RH0 by default. */ .disable_ipv6 = 0, .accept_dad = 1, + .suppress_frag_ndisc = 1, }; /* IPv6 Wildcard Address and Loopback Address defined by RFC2553 */ @@ -4188,6 +4190,7 @@ static inline void ipv6_store_devconf(struct ipv6_devconf *cnf, array[DEVCONF_ACCEPT_DAD] = cnf->accept_dad; array[DEVCONF_FORCE_TLLAO] = cnf->force_tllao; array[DEVCONF_NDISC_NOTIFY] = cnf->ndisc_notify; + array[DEVCONF_SUPPRESS_FRAG_NDISC] = cnf->suppress_frag_ndisc; } static inline size_t inet6_ifla6_size(void) @@ -5001,6 +5004,13 @@ static struct addrconf_sysctl_table .mode = 0644, .proc_handler = proc_dointvec }, + { + .procname = "suppress_frag_ndisc", + .data = &ipv6_devconf.suppress_frag_ndisc, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec + }, { /* sentinel */ } diff --git a/net/ipv6/ndisc.c b/net/ipv6/ndisc.c index 04d31c2fbef1..41720feeaa64 100644 --- a/net/ipv6/ndisc.c +++ b/net/ipv6/ndisc.c @@ -1519,10 +1519,27 @@ static void pndisc_redo(struct sk_buff *skb) kfree_skb(skb); } +static bool ndisc_suppress_frag_ndisc(struct sk_buff *skb) +{ + struct inet6_dev *idev = __in6_dev_get(skb->dev); + + if (!idev) + return true; + if (IP6CB(skb)->flags & IP6SKB_FRAGMENTED && + idev->cnf.suppress_frag_ndisc) { + net_warn_ratelimited("Received fragmented ndisc packet. Carefully consider disabling suppress_frag_ndisc.\n"); + return true; + } + return false; +} + int ndisc_rcv(struct sk_buff *skb) { struct nd_msg *msg; + if (ndisc_suppress_frag_ndisc(skb)) + return 0; + if (skb_linearize(skb)) return 0; -- cgit v1.2.3 From 95bd09eb27507691520d39ee1044d6ad831c1168 Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Tue, 27 Aug 2013 05:46:32 -0700 Subject: tcp: TSO packets automatic sizing After hearing many people over past years complaining against TSO being bursty or even buggy, we are proud to present automatic sizing of TSO packets. One part of the problem is that tcp_tso_should_defer() uses an heuristic relying on upcoming ACKS instead of a timer, but more generally, having big TSO packets makes little sense for low rates, as it tends to create micro bursts on the network, and general consensus is to reduce the buffering amount. This patch introduces a per socket sk_pacing_rate, that approximates the current sending rate, and allows us to size the TSO packets so that we try to send one packet every ms. This field could be set by other transports. Patch has no impact for high speed flows, where having large TSO packets makes sense to reach line rate. For other flows, this helps better packet scheduling and ACK clocking. This patch increases performance of TCP flows in lossy environments. A new sysctl (tcp_min_tso_segs) is added, to specify the minimal size of a TSO packet (default being 2). A follow-up patch will provide a new packet scheduler (FQ), using sk_pacing_rate as an input to perform optional per flow pacing. This explains why we chose to set sk_pacing_rate to twice the current rate, allowing 'slow start' ramp up. sk_pacing_rate = 2 * cwnd * mss / srtt v2: Neal Cardwell reported a suspect deferring of last two segments on initial write of 10 MSS, I had to change tcp_tso_should_defer() to take into account tp->xmit_size_goal_segs Signed-off-by: Eric Dumazet Cc: Neal Cardwell Cc: Yuchung Cheng Cc: Van Jacobson Cc: Tom Herbert Acked-by: Yuchung Cheng Acked-by: Neal Cardwell Signed-off-by: David S. Miller --- Documentation/networking/ip-sysctl.txt | 9 +++++++++ include/net/sock.h | 2 ++ include/net/tcp.h | 1 + net/ipv4/sysctl_net_ipv4.c | 10 ++++++++++ net/ipv4/tcp.c | 28 +++++++++++++++++++++++----- net/ipv4/tcp_input.c | 32 +++++++++++++++++++++++++++++++- net/ipv4/tcp_output.c | 2 +- 7 files changed, 77 insertions(+), 7 deletions(-) (limited to 'Documentation') diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index a2be556032c9..1cb3aeb4baff 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -482,6 +482,15 @@ tcp_syn_retries - INTEGER tcp_timestamps - BOOLEAN Enable timestamps as defined in RFC1323. +tcp_min_tso_segs - INTEGER + Minimal number of segments per TSO frame. + Since linux-3.12, TCP does an automatic sizing of TSO frames, + depending on flow rate, instead of filling 64Kbytes packets. + For specific usages, it's possible to force TCP to build big + TSO frames. Note that TCP stack might split too big TSO packets + if available window is too small. + Default: 2 + tcp_tso_win_divisor - INTEGER This allows control over what percentage of the congestion window can be consumed by a single TSO frame. diff --git a/include/net/sock.h b/include/net/sock.h index e4bbcbfd07ea..6ba2e7b0e2b1 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -232,6 +232,7 @@ struct cg_proto; * @sk_napi_id: id of the last napi context to receive data for sk * @sk_ll_usec: usecs to busypoll when there is no data * @sk_allocation: allocation mode + * @sk_pacing_rate: Pacing rate (if supported by transport/packet scheduler) * @sk_sndbuf: size of send buffer in bytes * @sk_flags: %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE, * %SO_OOBINLINE settings, %SO_TIMESTAMPING settings @@ -361,6 +362,7 @@ struct sock { kmemcheck_bitfield_end(flags); int sk_wmem_queued; gfp_t sk_allocation; + u32 sk_pacing_rate; /* bytes per second */ netdev_features_t sk_route_caps; netdev_features_t sk_route_nocaps; int sk_gso_type; diff --git a/include/net/tcp.h b/include/net/tcp.h index dd5e16f66f84..6a6a88db462d 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -281,6 +281,7 @@ extern int sysctl_tcp_early_retrans; extern int sysctl_tcp_limit_output_bytes; extern int sysctl_tcp_challenge_ack_limit; extern unsigned int sysctl_tcp_notsent_lowat; +extern int sysctl_tcp_min_tso_segs; extern atomic_long_t tcp_memory_allocated; extern struct percpu_counter tcp_sockets_allocated; diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c index 8ed7c32ae28e..540279f4c531 100644 --- a/net/ipv4/sysctl_net_ipv4.c +++ b/net/ipv4/sysctl_net_ipv4.c @@ -29,6 +29,7 @@ static int zero; static int one = 1; static int four = 4; +static int gso_max_segs = GSO_MAX_SEGS; static int tcp_retr1_max = 255; static int ip_local_port_range_min[] = { 1, 1 }; static int ip_local_port_range_max[] = { 65535, 65535 }; @@ -760,6 +761,15 @@ static struct ctl_table ipv4_table[] = { .extra1 = &zero, .extra2 = &four, }, + { + .procname = "tcp_min_tso_segs", + .data = &sysctl_tcp_min_tso_segs, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = &zero, + .extra2 = &gso_max_segs, + }, { .procname = "udp_mem", .data = &sysctl_udp_mem, diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 4e42c03859f4..fdf74090a001 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -283,6 +283,8 @@ int sysctl_tcp_fin_timeout __read_mostly = TCP_FIN_TIMEOUT; +int sysctl_tcp_min_tso_segs __read_mostly = 2; + struct percpu_counter tcp_orphan_count; EXPORT_SYMBOL_GPL(tcp_orphan_count); @@ -785,12 +787,28 @@ static unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now, xmit_size_goal = mss_now; if (large_allowed && sk_can_gso(sk)) { - xmit_size_goal = ((sk->sk_gso_max_size - 1) - - inet_csk(sk)->icsk_af_ops->net_header_len - - inet_csk(sk)->icsk_ext_hdr_len - - tp->tcp_header_len); + u32 gso_size, hlen; + + /* Maybe we should/could use sk->sk_prot->max_header here ? */ + hlen = inet_csk(sk)->icsk_af_ops->net_header_len + + inet_csk(sk)->icsk_ext_hdr_len + + tp->tcp_header_len; + + /* Goal is to send at least one packet per ms, + * not one big TSO packet every 100 ms. + * This preserves ACK clocking and is consistent + * with tcp_tso_should_defer() heuristic. + */ + gso_size = sk->sk_pacing_rate / (2 * MSEC_PER_SEC); + gso_size = max_t(u32, gso_size, + sysctl_tcp_min_tso_segs * mss_now); + + xmit_size_goal = min_t(u32, gso_size, + sk->sk_gso_max_size - 1 - hlen); - /* TSQ : try to have two TSO segments in flight */ + /* TSQ : try to have at least two segments in flight + * (one in NIC TX ring, another in Qdisc) + */ xmit_size_goal = min_t(u32, xmit_size_goal, sysctl_tcp_limit_output_bytes >> 1); diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index ec492eae0cd7..1a84fffe6993 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -688,6 +688,34 @@ static void tcp_rtt_estimator(struct sock *sk, const __u32 mrtt) } } +/* Set the sk_pacing_rate to allow proper sizing of TSO packets. + * Note: TCP stack does not yet implement pacing. + * FQ packet scheduler can be used to implement cheap but effective + * TCP pacing, to smooth the burst on large writes when packets + * in flight is significantly lower than cwnd (or rwin) + */ +static void tcp_update_pacing_rate(struct sock *sk) +{ + const struct tcp_sock *tp = tcp_sk(sk); + u64 rate; + + /* set sk_pacing_rate to 200 % of current rate (mss * cwnd / srtt) */ + rate = (u64)tp->mss_cache * 2 * (HZ << 3); + + rate *= max(tp->snd_cwnd, tp->packets_out); + + /* Correction for small srtt : minimum srtt being 8 (1 jiffy << 3), + * be conservative and assume srtt = 1 (125 us instead of 1.25 ms) + * We probably need usec resolution in the future. + * Note: This also takes care of possible srtt=0 case, + * when tcp_rtt_estimator() was not yet called. + */ + if (tp->srtt > 8 + 2) + do_div(rate, tp->srtt); + + sk->sk_pacing_rate = min_t(u64, rate, ~0U); +} + /* Calculate rto without backoff. This is the second half of Van Jacobson's * routine referred to above. */ @@ -3278,7 +3306,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag) u32 ack_seq = TCP_SKB_CB(skb)->seq; u32 ack = TCP_SKB_CB(skb)->ack_seq; bool is_dupack = false; - u32 prior_in_flight; + u32 prior_in_flight, prior_cwnd = tp->snd_cwnd, prior_rtt = tp->srtt; u32 prior_fackets; int prior_packets = tp->packets_out; const int prior_unsacked = tp->packets_out - tp->sacked_out; @@ -3383,6 +3411,8 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag) if (icsk->icsk_pending == ICSK_TIME_RETRANS) tcp_schedule_loss_probe(sk); + if (tp->srtt != prior_rtt || tp->snd_cwnd != prior_cwnd) + tcp_update_pacing_rate(sk); return 1; no_queue: diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 884efff5b531..e63ae4c9691d 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -1631,7 +1631,7 @@ static bool tcp_tso_should_defer(struct sock *sk, struct sk_buff *skb) /* If a full-sized TSO skb can be sent, do it. */ if (limit >= min_t(unsigned int, sk->sk_gso_max_size, - sk->sk_gso_max_segs * tp->mss_cache)) + tp->xmit_size_goal_segs * tp->mss_cache)) goto send_now; /* Middle in queue won't get any more data, full sendable already? */ -- cgit v1.2.3 From 7ec06da81d2b98859b558d8d551a0c4e3d9516a3 Mon Sep 17 00:00:00 2001 From: Daniel Borkmann Date: Wed, 28 Aug 2013 22:13:11 +0200 Subject: net: packet: document available fanout policies Update documentation to add fanout policies that are available. Signed-off-by: Daniel Borkmann Signed-off-by: David S. Miller --- Documentation/networking/packet_mmap.txt | 8 ++++++++ 1 file changed, 8 insertions(+) (limited to 'Documentation') diff --git a/Documentation/networking/packet_mmap.txt b/Documentation/networking/packet_mmap.txt index 8572796b1eb6..c01223628a87 100644 --- a/Documentation/networking/packet_mmap.txt +++ b/Documentation/networking/packet_mmap.txt @@ -543,6 +543,14 @@ TPACKET_V2 --> TPACKET_V3: In the AF_PACKET fanout mode, packet reception can be load balanced among processes. This also works in combination with mmap(2) on packet sockets. +Currently implemented fanout policies are: + + - PACKET_FANOUT_HASH: schedule to socket by skb's rxhash + - PACKET_FANOUT_LB: schedule to socket by round-robin + - PACKET_FANOUT_CPU: schedule to socket by CPU packet arrives on + - PACKET_FANOUT_RND: schedule to socket by random selection + - PACKET_FANOUT_ROLLOVER: if one socket is full, rollover to another + Minimal example code by David S. Miller (try things like "./test eth0 hash", "./test eth0 lb", etc.): -- cgit v1.2.3 From e2a240c7d3bcebf90936cc7c22c2729b3a4cec1f Mon Sep 17 00:00:00 2001 From: Sonic Zhang Date: Wed, 28 Aug 2013 18:55:39 +0800 Subject: driver:net:stmmac: Disable DMA store and forward mode if platform data force_thresh_dma_mode is set. Some synopsys ip implementation doesn't support DMA store and forward mode, such as BF60x. So, set force_thresh_dma_mode to use DMA thresholds only. Update document and devicetree as well. Signed-off-by: Sonic Zhang Acked-by: Giuseppe Cavallaro Signed-off-by: David S. Miller --- Documentation/devicetree/bindings/net/stmmac.txt | 5 +++++ Documentation/networking/stmmac.txt | 3 +++ drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 4 +++- drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c | 5 +++++ include/linux/stmmac.h | 1 + 5 files changed, 17 insertions(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/net/stmmac.txt b/Documentation/devicetree/bindings/net/stmmac.txt index 261c563b5f06..eba0e5e59ebe 100644 --- a/Documentation/devicetree/bindings/net/stmmac.txt +++ b/Documentation/devicetree/bindings/net/stmmac.txt @@ -22,6 +22,11 @@ Required properties: - snps,pbl Programmable Burst Length - snps,fixed-burst Program the DMA to use the fixed burst mode - snps,mixed-burst Program the DMA to use the mixed burst mode +- snps,force_thresh_dma_mode Force DMA to use the threshold mode for + both tx and rx +- snps,force_sf_dma_mode Force DMA to use the Store and Forward + mode for both tx and rx. This flag is + ignored if force_thresh_dma_mode is set. Optional properties: - mac-address: 6 bytes, mac address diff --git a/Documentation/networking/stmmac.txt b/Documentation/networking/stmmac.txt index 654d2e55c8cb..457b8bbafb08 100644 --- a/Documentation/networking/stmmac.txt +++ b/Documentation/networking/stmmac.txt @@ -123,6 +123,7 @@ struct plat_stmmacenet_data { int bugged_jumbo; int pmt; int force_sf_dma_mode; + int force_thresh_dma_mode; int riwt_off; void (*fix_mac_speed)(void *priv, unsigned int speed); void (*bus_setup)(void __iomem *ioaddr); @@ -159,6 +160,8 @@ Where: o pmt: core has the embedded power module (optional). o force_sf_dma_mode: force DMA to use the Store and Forward mode instead of the Threshold. + o force_thresh_dma_mode: force DMA to use the Shreshold mode other than + the Store and Forward mode. o riwt_off: force to disable the RX watchdog feature and switch to NAPI mode. o fix_mac_speed: this callback is used for modifying some syscfg registers (on ST SoCs) according to the link speed negotiated by the diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c index be406911fd01..8d4ccd35a016 100644 --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c @@ -1224,7 +1224,9 @@ static void free_dma_desc_resources(struct stmmac_priv *priv) */ static void stmmac_dma_operation_mode(struct stmmac_priv *priv) { - if (priv->plat->force_sf_dma_mode || priv->plat->tx_coe) { + if (priv->plat->force_thresh_dma_mode) + priv->hw->dma->dma_mode(priv->ioaddr, tc, tc); + else if (priv->plat->force_sf_dma_mode || priv->plat->tx_coe) { /* * In case of GMAC, SF mode can be enabled * to perform the TX COE in HW. This depends on: diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c index da8be6e63096..74a89a4aef88 100644 --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c @@ -79,6 +79,11 @@ static int stmmac_probe_config_dt(struct platform_device *pdev, of_property_read_u32(np, "snps,pbl", &dma_cfg->pbl); dma_cfg->fixed_burst = of_property_read_bool(np, "snps,fixed-burst"); dma_cfg->mixed_burst = of_property_read_bool(np, "snps,mixed-burst"); + plat->force_thresh_dma_mode = of_property_read_bool(np, "snps,force_thresh_dma_mode"); + if (plat->force_thresh_dma_mode) { + plat->force_sf_dma_mode = 0; + pr_warn("force_sf_dma_mode is ignored if force_thresh_dma_mode is set."); + } return 0; } diff --git a/include/linux/stmmac.h b/include/linux/stmmac.h index 9e495d31516e..bb5deb0feb6b 100644 --- a/include/linux/stmmac.h +++ b/include/linux/stmmac.h @@ -108,6 +108,7 @@ struct plat_stmmacenet_data { int bugged_jumbo; int pmt; int force_sf_dma_mode; + int force_thresh_dma_mode; int riwt_off; void (*fix_mac_speed)(void *priv, unsigned int speed); void (*bus_setup)(void __iomem *ioaddr); -- cgit v1.2.3 From 6da7c8fcbcbdb50ec68c61b40d554c74850fdb91 Mon Sep 17 00:00:00 2001 From: stephen hemminger Date: Tue, 27 Aug 2013 16:19:08 -0700 Subject: qdisc: allow setting default queuing discipline By default, the pfifo_fast queue discipline has been used by default for all devices. But we have better choices now. This patch allow setting the default queueing discipline with sysctl. This allows easy use of better queueing disciplines on all devices without having to use tc qdisc scripts. It is intended to allow an easy path for distributions to make fq_codel or sfq the default qdisc. This patch also makes pfifo_fast more of a first class qdisc, since it is now possible to manually override the default and explicitly use pfifo_fast. The behavior for systems who do not use the sysctl is unchanged, they still get pfifo_fast Also removes leftover random # in sysctl net core. Signed-off-by: Stephen Hemminger Acked-by: Eric Dumazet Signed-off-by: David S. Miller --- Documentation/sysctl/net.txt | 13 ++++++++++ include/net/pkt_sched.h | 3 +++ include/net/sch_generic.h | 1 + net/core/sysctl_net_core.c | 30 ++++++++++++++++++++++- net/sched/sch_api.c | 58 ++++++++++++++++++++++++++++++++++++++++++++ net/sched/sch_generic.c | 11 +++++---- net/sched/sch_mq.c | 2 +- net/sched/sch_mqprio.c | 2 +- 8 files changed, 112 insertions(+), 8 deletions(-) (limited to 'Documentation') diff --git a/Documentation/sysctl/net.txt b/Documentation/sysctl/net.txt index d569f2a424d5..9a0319a82470 100644 --- a/Documentation/sysctl/net.txt +++ b/Documentation/sysctl/net.txt @@ -50,6 +50,19 @@ The maximum number of packets that kernel can handle on a NAPI interrupt, it's a Per-CPU variable. Default: 64 +default_qdisc +-------------- + +The default queuing discipline to use for network devices. This allows +overriding the default queue discipline of pfifo_fast with an +alternative. Since the default queuing discipline is created with the +no additional parameters so is best suited to queuing disciplines that +work well without configuration like stochastic fair queue (sfq), +CoDel (codel) or fair queue CoDel (fq_codel). Don't use queuing disciplines +like Hierarchical Token Bucket or Deficit Round Robin which require setting +up classes and bandwidths. +Default: pfifo_fast + busy_read ---------------- Low latency busy poll timeout for socket reads. (needs CONFIG_NET_RX_BUSY_POLL) diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h index f7c24f8fbdc5..59ec3cd15d68 100644 --- a/include/net/pkt_sched.h +++ b/include/net/pkt_sched.h @@ -85,6 +85,9 @@ struct Qdisc *fifo_create_dflt(struct Qdisc *sch, struct Qdisc_ops *ops, int register_qdisc(struct Qdisc_ops *qops); int unregister_qdisc(struct Qdisc_ops *qops); +void qdisc_get_default(char *id, size_t len); +int qdisc_set_default(const char *id); + void qdisc_list_del(struct Qdisc *q); struct Qdisc *qdisc_lookup(struct net_device *dev, u32 handle); struct Qdisc *qdisc_lookup_class(struct net_device *dev, u32 handle); diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h index 76368c9d4503..2e31bf14df00 100644 --- a/include/net/sch_generic.h +++ b/include/net/sch_generic.h @@ -316,6 +316,7 @@ extern struct Qdisc noop_qdisc; extern struct Qdisc_ops noop_qdisc_ops; extern struct Qdisc_ops pfifo_fast_ops; extern struct Qdisc_ops mq_qdisc_ops; +extern const struct Qdisc_ops *default_qdisc_ops; struct Qdisc_class_common { u32 classid; diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c index 31107abd2783..cca444190907 100644 --- a/net/core/sysctl_net_core.c +++ b/net/core/sysctl_net_core.c @@ -20,6 +20,7 @@ #include #include #include +#include static int zero = 0; static int one = 1; @@ -193,6 +194,26 @@ static int flow_limit_table_len_sysctl(struct ctl_table *table, int write, } #endif /* CONFIG_NET_FLOW_LIMIT */ +#ifdef CONFIG_NET_SCHED +static int set_default_qdisc(struct ctl_table *table, int write, + void __user *buffer, size_t *lenp, loff_t *ppos) +{ + char id[IFNAMSIZ]; + struct ctl_table tbl = { + .data = id, + .maxlen = IFNAMSIZ, + }; + int ret; + + qdisc_get_default(id, IFNAMSIZ); + + ret = proc_dostring(&tbl, write, buffer, lenp, ppos); + if (write && ret == 0) + ret = qdisc_set_default(id); + return ret; +} +#endif + static struct ctl_table net_core_table[] = { #ifdef CONFIG_NET { @@ -315,7 +336,14 @@ static struct ctl_table net_core_table[] = { .mode = 0644, .proc_handler = proc_dointvec }, -# +#endif +#ifdef CONFIG_NET_SCHED + { + .procname = "default_qdisc", + .mode = 0644, + .maxlen = IFNAMSIZ, + .proc_handler = set_default_qdisc + }, #endif #endif /* CONFIG_NET */ { diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c index 51b968d3febb..812e57900591 100644 --- a/net/sched/sch_api.c +++ b/net/sched/sch_api.c @@ -131,6 +131,11 @@ static DEFINE_RWLOCK(qdisc_mod_lock); ************************************************/ +/* Qdisc to use by default */ + +const struct Qdisc_ops *default_qdisc_ops = &pfifo_fast_ops; +EXPORT_SYMBOL(default_qdisc_ops); + /* The list of all installed queueing disciplines. */ static struct Qdisc_ops *qdisc_base; @@ -200,6 +205,58 @@ int unregister_qdisc(struct Qdisc_ops *qops) } EXPORT_SYMBOL(unregister_qdisc); +/* Get default qdisc if not otherwise specified */ +void qdisc_get_default(char *name, size_t len) +{ + read_lock(&qdisc_mod_lock); + strlcpy(name, default_qdisc_ops->id, len); + read_unlock(&qdisc_mod_lock); +} + +static struct Qdisc_ops *qdisc_lookup_default(const char *name) +{ + struct Qdisc_ops *q = NULL; + + for (q = qdisc_base; q; q = q->next) { + if (!strcmp(name, q->id)) { + if (!try_module_get(q->owner)) + q = NULL; + break; + } + } + + return q; +} + +/* Set new default qdisc to use */ +int qdisc_set_default(const char *name) +{ + const struct Qdisc_ops *ops; + + if (!capable(CAP_NET_ADMIN)) + return -EPERM; + + write_lock(&qdisc_mod_lock); + ops = qdisc_lookup_default(name); + if (!ops) { + /* Not found, drop lock and try to load module */ + write_unlock(&qdisc_mod_lock); + request_module("sch_%s", name); + write_lock(&qdisc_mod_lock); + + ops = qdisc_lookup_default(name); + } + + if (ops) { + /* Set new default */ + module_put(default_qdisc_ops->owner); + default_qdisc_ops = ops; + } + write_unlock(&qdisc_mod_lock); + + return ops ? 0 : -ENOENT; +} + /* We know handle. Find qdisc among all qdisc's attached to device (root qdisc, all its children, children of children etc.) */ @@ -1854,6 +1911,7 @@ static int __init pktsched_init(void) return err; } + register_qdisc(&pfifo_fast_ops); register_qdisc(&pfifo_qdisc_ops); register_qdisc(&bfifo_qdisc_ops); register_qdisc(&pfifo_head_drop_qdisc_ops); diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index 48be3d5c0d92..5078e0c5db8d 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -530,7 +530,6 @@ struct Qdisc_ops pfifo_fast_ops __read_mostly = { .dump = pfifo_fast_dump, .owner = THIS_MODULE, }; -EXPORT_SYMBOL(pfifo_fast_ops); static struct lock_class_key qdisc_tx_busylock; @@ -583,6 +582,9 @@ struct Qdisc *qdisc_create_dflt(struct netdev_queue *dev_queue, { struct Qdisc *sch; + if (!try_module_get(ops->owner)) + goto errout; + sch = qdisc_alloc(dev_queue, ops); if (IS_ERR(sch)) goto errout; @@ -686,7 +688,7 @@ static void attach_one_default_qdisc(struct net_device *dev, if (dev->tx_queue_len) { qdisc = qdisc_create_dflt(dev_queue, - &pfifo_fast_ops, TC_H_ROOT); + default_qdisc_ops, TC_H_ROOT); if (!qdisc) { netdev_info(dev, "activation failed\n"); return; @@ -739,9 +741,8 @@ void dev_activate(struct net_device *dev) int need_watchdog; /* No queueing discipline is attached to device; - create default one i.e. pfifo_fast for devices, - which need queueing and noqueue_qdisc for - virtual interfaces + * create default one for devices, which need queueing + * and noqueue_qdisc for virtual interfaces */ if (dev->qdisc == &noop_qdisc) diff --git a/net/sched/sch_mq.c b/net/sched/sch_mq.c index 5da78a19ac9a..2e56185736d6 100644 --- a/net/sched/sch_mq.c +++ b/net/sched/sch_mq.c @@ -57,7 +57,7 @@ static int mq_init(struct Qdisc *sch, struct nlattr *opt) for (ntx = 0; ntx < dev->num_tx_queues; ntx++) { dev_queue = netdev_get_tx_queue(dev, ntx); - qdisc = qdisc_create_dflt(dev_queue, &pfifo_fast_ops, + qdisc = qdisc_create_dflt(dev_queue, default_qdisc_ops, TC_H_MAKE(TC_H_MAJ(sch->handle), TC_H_MIN(ntx + 1))); if (qdisc == NULL) diff --git a/net/sched/sch_mqprio.c b/net/sched/sch_mqprio.c index accec33c454c..d44c868cb537 100644 --- a/net/sched/sch_mqprio.c +++ b/net/sched/sch_mqprio.c @@ -124,7 +124,7 @@ static int mqprio_init(struct Qdisc *sch, struct nlattr *opt) for (i = 0; i < dev->num_tx_queues; i++) { dev_queue = netdev_get_tx_queue(dev, i); - qdisc = qdisc_create_dflt(dev_queue, &pfifo_fast_ops, + qdisc = qdisc_create_dflt(dev_queue, default_qdisc_ops, TC_H_MAKE(TC_H_MAJ(sch->handle), TC_H_MIN(i + 1))); if (qdisc == NULL) { -- cgit v1.2.3 From f21278108204ab244cd534a0d45c174ecc559267 Mon Sep 17 00:00:00 2001 From: Daniel Borkmann Date: Wed, 4 Sep 2013 00:19:44 +0200 Subject: net: ipv6: mld: document force_mld_version in ip-sysctl.txt Document force_mld_version parameter in ip-sysctl.txt. Signed-off-by: Daniel Borkmann Cc: Hannes Frederic Sowa Acked-by: Hannes Frederic Sowa Signed-off-by: David S. Miller --- Documentation/networking/ip-sysctl.txt | 5 +++++ 1 file changed, 5 insertions(+) (limited to 'Documentation') diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index 1cb3aeb4baff..a46d78583ae1 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -1358,6 +1358,11 @@ mldv2_unsolicited_report_interval - INTEGER MLDv2 report retransmit will take place. Default: 1000 (1 second) +force_mld_version - INTEGER + 0 - (default) No enforcement of a MLD version, MLDv1 fallback allowed + 1 - Enforce to use MLD version 1 + 2 - Enforce to use MLD version 2 + suppress_frag_ndisc - INTEGER Control RFC 6980 (Security Implications of IPv6 Fragmentation with IPv6 Neighbor Discovery) behavior: -- cgit v1.2.3