summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)AuthorFilesLines
2015-06-08fib_trie: coding style: Use pointer after checkFiro Yang1-8/+13
As Alexander Duyck pointed out that: struct tnode { ... struct key_vector kv[1]; } The kv[1] member of struct tnode is an arry that refernced by a null pointer will not crash the system, like this: struct tnode *p = NULL; struct key_vector *kv = p->kv; As such p->kv doesn't actually dereference anything, it is simply a means for getting the offset to the array from the pointer p. This patch make the code more regular to avoid making people feel odd when they look at the code. Signed-off-by: Firo Yang <firogm@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-08tcp: get_cookie_sock() consolidationEric Dumazet2-23/+6
IPv4 and IPv6 share same implementation of get_cookie_sock(), and there is no point inlining it. We add tcp_ prefix to the common helper name and export it. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-07bpf: allow programs to write to certain skb fieldsAlexei Starovoitov1-12/+82
allow programs read/write skb->mark, tc_index fields and ((struct qdisc_skb_cb *)cb)->data. mark and tc_index are generically useful in TC. cb[0]-cb[4] are primarily used to pass arguments from one program to another called via bpf_tail_call() which can be seen in sockex3_kern.c example. All fields of 'struct __sk_buff' are readable to socket and tc_cls_act progs. mark, tc_index are writeable from tc_cls_act only. cb[0]-cb[4] are writeable by both sockets and tc_cls_act. Add verifier tests and improve sample code. Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-07bpf: make programs see skb->data == L2 for ingress and egressAlexei Starovoitov3-25/+26
eBPF programs attached to ingress and egress qdiscs see inconsistent skb->data. For ingress L2 header is already pulled, whereas for egress it's present. This is known to program writers which are currently forced to use BPF_LL_OFF workaround. Since programs don't change skb internal pointers it is safe to do pull/push right around invocation of the program and earlier taps and later pt->func() will not be affected. Multiple taps via packet_rcv(), tpacket_rcv() are doing the same trick around run_filter/BPF_PROG_RUN even if skb_shared. This fix finally allows programs to use optimized LD_ABS/IND instructions without BPF_LL_OFF for higher performance. tc ingress + cls_bpf + samples/bpf/tcbpf1_kern.o w/o JIT w/JIT before 20.5 23.6 Mpps after 21.8 26.6 Mpps Old programs with BPF_LL_OFF will still work as-is. We can now undo most of the earlier workaround commit: a166151cbe33 ("bpf: fix bpf helpers to use skb->mac_header relative offsets") Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-07tcp: remove redundant checks IIEric Dumazet2-8/+0
For same reasons than in commit 12e25e1041d0 ("tcp: remove redundant checks"), we can remove redundant checks done for timewait sockets. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-07inet: add IP_BIND_ADDRESS_NO_PORT to overcome bind(0) limitationsEric Dumazet3-2/+11
When an application needs to force a source IP on an active TCP socket it has to use bind(IP, port=x). As most applications do not want to deal with already used ports, x is often set to 0, meaning the kernel is in charge to find an available port. But kernel does not know yet if this socket is going to be a listener or be connected. It has very limited choices (no full knowledge of final 4-tuple for a connect()) With limited ephemeral port range (about 32K ports), it is very easy to fill the space. This patch adds a new SOL_IP socket option, asking kernel to ignore the 0 port provided by application in bind(IP, port=0) and only remember the given IP address. The port will be automatically chosen at connect() time, in a way that allows sharing a source port as long as the 4-tuples are unique. This new feature is available for both IPv4 and IPv6 (Thanks Neal) Tested: Wrote a test program and checked its behavior on IPv4 and IPv6. strace(1) shows sequences of bind(IP=127.0.0.2, port=0) followed by connect(). Also getsockname() show that the port is still 0 right after bind() but properly allocated after connect(). socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 5 setsockopt(5, SOL_IP, IP_BIND_ADDRESS_NO_PORT, [1], 4) = 0 bind(5, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0 getsockname(5, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0 connect(5, {sa_family=AF_INET, sin_port=htons(53174), sin_addr=inet_addr("127.0.0.3")}, 16) = 0 getsockname(5, {sa_family=AF_INET, sin_port=htons(38050), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0 IPv6 test : socket(PF_INET6, SOCK_STREAM, IPPROTO_IP) = 7 setsockopt(7, SOL_IP, IP_BIND_ADDRESS_NO_PORT, [1], 4) = 0 bind(7, {sa_family=AF_INET6, sin6_port=htons(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = 0 getsockname(7, {sa_family=AF_INET6, sin6_port=htons(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 0 connect(7, {sa_family=AF_INET6, sin6_port=htons(57300), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = 0 getsockname(7, {sa_family=AF_INET6, sin6_port=htons(60964), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 0 I was able to bind()/connect() a million concurrent IPv4 sockets, instead of ~32000 before patch. lpaa23:~# ulimit -n 1000010 lpaa23:~# ./bind --connect --num-flows=1000000 & 1000000 sockets lpaa23:~# grep TCP /proc/net/sockstat TCP: inuse 2000063 orphan 0 tw 47 alloc 2000157 mem 66 Check that a given source port is indeed used by many different connections : lpaa23:~# ss -t src :40000 | head -10 State Recv-Q Send-Q Local Address:Port Peer Address:Port ESTAB 0 0 127.0.0.2:40000 127.0.202.33:44983 ESTAB 0 0 127.0.0.2:40000 127.2.27.240:44983 ESTAB 0 0 127.0.0.2:40000 127.2.98.5:44983 ESTAB 0 0 127.0.0.2:40000 127.0.124.196:44983 ESTAB 0 0 127.0.0.2:40000 127.2.139.38:44983 ESTAB 0 0 127.0.0.2:40000 127.1.59.80:44983 ESTAB 0 0 127.0.0.2:40000 127.3.6.228:44983 ESTAB 0 0 127.0.0.2:40000 127.0.38.53:44983 ESTAB 0 0 127.0.0.2:40000 127.1.197.10:44983 Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-05mpls: Add MPLS entropy label in flow_keysTom Herbert1-0/+35
In flow dissector if an MPLS header contains an entropy label this is saved in the new keyid field of flow_keys. The entropy label is then represented in the flow hash function input. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-05net: Add GRE keyid in flow_keysTom Herbert1-1/+23
In flow dissector if a GRE header contains a keyid this is saved in the new keyid field of flow_keys. The GRE keyid is then represented in the flow hash function input. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-05net: Add IPv6 flow label to flow_keysTom Herbert1-19/+10
In flow_dissector set the flow label in flow_keys for IPv6. This also removes the shortcircuiting of flow dissection when a non-zero label is present, the flow label can be considered to provide additional entropy for a hash. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-05net: Add VLAN ID to flow_keysTom Herbert1-0/+14
In flow_dissector set vlan_id in flow_keys when VLAN is found. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-05net: Get rid of IPv6 hash addresses flow keysTom Herbert1-17/+0
We don't need to return the IPv6 address hash as part of flow keys. In general, using the IPv6 address hash is risky in a hash value since the underlying use of xor provides no entropy. If someone really needs the hash value they can get it from the full IPv6 addresses in flow keys (e.g. from flow_get_u32_src). Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-05net: Add keys for TIPC addressTom Herbert1-5/+13
Add a new flow key for TIPC addresses. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-05net: Add full IPv6 addresses to flow_keysTom Herbert4-29/+114
This patch adds full IPv6 addresses into flow_keys and uses them as input to the flow hash function. The implementation supports either IPv4 or IPv6 addresses in a union, and selector is used to determine how may words to input to jhash2. We also add flow_get_u32_dst and flow_get_u32_src functions which are used to get a u32 representation of the source and destination addresses. For IPv6, ipv6_addr_hash is called. These functions retain getting the legacy values of src and dst in flow_keys. With this patch, Ethertype and IP protocol are now included in the flow hash input. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-05net: Get skb hash over flow_keys structureTom Herbert2-13/+43
This patch changes flow hashing to use jhash2 over the flow_keys structure instead just doing jhash_3words over src, dst, and ports. This method will allow us take more input into the hashing function so that we can include full IPv6 addresses, VLAN, flow labels etc. without needing to resort to xor'ing which makes for a poor hash. Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-05net: Remove superfluous setting of key_basicTom Herbert1-6/+0
key_basic is set twice in __skb_flow_dissect which seems unnecessary. Remove second one. Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-05net: Simplify GRE case in flow_dissectorTom Herbert1-22/+22
Do break when we see routing flag or a non-zero version number in GRE header. Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-04bpf: fix build due to missing tc_verdAlexei Starovoitov1-3/+1
fix build error: net/core/filter.c: In function 'bpf_clone_redirect': net/core/filter.c:1429:18: error: 'struct sk_buff' has no member named 'tc_verd' if (G_TC_AT(skb2->tc_verd) & AT_INGRESS) Fixes: 3896d655f4d4 ("bpf: introduce bpf_clone_redirect() helper") Reported-by: Or Gerlitz <gerlitz.or@gmail.com> Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-04tcp: double default TSQ output bytes limitWei Liu1-2/+2
Xen virtual network driver has higher latency than a physical NIC. Having only 128K as limit for TSQ introduced 30% regression in guest throughput. This patch raises the limit to 256K. This reduces the regression to 8%. This buys us more time to work out a proper solution in the long run. Signed-off-by: Wei Liu <wei.liu2@citrix.com> Cc: David Miller <davem@davemloft.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-04tcp: remove redundant checksEric Dumazet2-4/+4
tcp_v4_rcv() checks the following before calling tcp_v4_do_rcv(): if (th->doff < sizeof(struct tcphdr) / 4) goto bad_packet; if (!pskb_may_pull(skb, th->doff * 4)) goto discard_it; So following check in tcp_v4_do_rcv() is redundant and "goto csum_err;" is wrong anyway. if (skb->len < tcp_hdrlen(skb) || ...) goto csum_err; A second check can be removed after no_tcp_socket label for same reason. Same tests can be removed in tcp_v6_do_rcv() Note : short tcp frames are not properly accounted in tcpInErrs MIB, because pskb_may_pull() failure simply drops incoming skb, we might fix this in a separate patch. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-04switchdev: documentation: use switchdev_port_obj_xxx for IPv4 FIB ↵Scott Feldman1-2/+2
add/modify/delete ops Clarify in documentation and code that IPV4 FIB add operation is used for both adding a new FIB entry to the device and for modifying an existing FIB entry on the device. Also, remove left-over references to ipv4_fib ops and replace with details on SWITCHDEV_PORT_IPV4_FIB object. Signed-off-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-04Merge tag 'batman-adv-for-davem' of git://git.open-mesh.org/linux-mergeDavid S. Miller4-45/+41
Antonio Quartulli says: ==================== pull request: batman-adv 20150603 here you have our second batch of patches intended for net-next. In this patchset you won't find any new features, but quite some code cleanup work, a bunch of code style fixes and also comments corrections by Markus Pargmann. Moreover you have a patch from Sven Eckelmann removing an unnecessary NULL check in batadv_iv_ogm_update_seqnos(). Please pull or let me know of any problem! ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-04bpf: introduce bpf_clone_redirect() helperAlexei Starovoitov1-0/+40
Allow eBPF programs attached to classifier/actions to call bpf_clone_redirect(skb, ifindex, flags) helper which will mirror or redirect the packet by dynamic ifindex selection from within the program to a target device either at ingress or at egress. Can be used for various scenarios, for example, to load balance skbs into veths, split parts of the traffic to local taps, etc. Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-03batman-adv: Remove unnecessary ret variable in algo_registerMarkus Pargmann1-5/+2
Remove ret variable and all jumps. Signed-off-by: Markus Pargmann <mpa@pengutronix.de> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
2015-06-03batman-adv: Remove unnecessary ret variableMarkus Pargmann1-8/+3
We can avoid this indirect return variable by directly returning the error values. Signed-off-by: Markus Pargmann <mpa@pengutronix.de> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
2015-06-03batman-adv: main, batadv_compare_eth return boolMarkus Pargmann1-1/+1
Declare the returntype of batadv_compare_eth as bool. The function called inside this helper function (ether_addr_equal_unaligned) also uses bool as return value, so there is no need to return int. Signed-off-by: Markus Pargmann <mpa@pengutronix.de> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
2015-06-03batman-adv: main, Convert is_my_mac() to boolMarkus Pargmann2-5/+8
It is much clearer to see a bool type as return value than 'int' for functions that are supposed to return true or false. Signed-off-by: Markus Pargmann <mpa@pengutronix.de> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
2015-06-03batman-adv: Remove unnecessary check for orig_ifinfo not NULLSven Eckelmann1-2/+1
orig_ifinfo is dereferenced multiple times in batadv_iv_ogm_update_seqnos before the check for NULL is done. The function also exists at the beginning when orig_ifinfo would have been NULL. This makes the check at the end unnecessary and only confuses the reader/code analyzers. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
2015-06-03batman-adv: types, Fix comment on bcast_ownMarkus Pargmann1-3/+4
batadv_orig_bat_iv->bcast_own is actually not a bitfield, it is an array. Adjust the comment accordingly. Signed-off-by: Markus Pargmann <mpa@pengutronix.de> Signed-off-by: Antonio Quartulli <antonio@meshcoding.com> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
2015-06-03batman-adv: iv_ogm, fix comment function nameMarkus Pargmann1-1/+1
This is a small copy paste fix for batadv_ing_buffer_avg. Signed-off-by: Markus Pargmann <mpa@pengutronix.de> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
2015-06-03batman-adv: iv_ogm, fix coding styleMarkus Pargmann1-1/+3
The kernel coding style says, that there should not be multiple assignments in one row. Signed-off-by: Markus Pargmann <mpa@pengutronix.de> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
2015-06-03batman-adv: iv_ogm, Fix dup_status commentMarkus Pargmann1-1/+1
Signed-off-by: Markus Pargmann <mpa@pengutronix.de> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
2015-06-03batman-adv: iv_ogm_orig_update, style, add missing bracketsMarkus Pargmann1-1/+2
CodingStyle describes that either none or both branches of a conditional have to have brackets. Signed-off-by: Markus Pargmann <mpa@pengutronix.de> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
2015-06-03batman-adv: iv_ogm_queue_add, Simplify expressionsMarkus Pargmann1-2/+2
Signed-off-by: Markus Pargmann <mpa@pengutronix.de> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
2015-06-03batman-adv: iv_ogm_aggregate_new, simplify error handlingMarkus Pargmann1-15/+13
It is just a bit easier to put the error handling at one place and let multiple error paths use the same calls. Signed-off-by: Markus Pargmann <mpa@pengutronix.de> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
2015-06-02Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller26-119/+215
Conflicts: drivers/net/phy/amd-xgbe-phy.c drivers/net/wireless/iwlwifi/Kconfig include/net/mac80211.h iwlwifi/Kconfig and mac80211.h were both trivial overlapping changes. The drivers/net/phy/amd-xgbe-phy.c file got removed in 'net-next' and the bug fix that happened on the 'net' side is already integrated into the rest of the amd-xgbe driver. Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-02Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller1-4/+0
Pablo Neira Ayuso says: ==================== Netfilter fix for net The following patch reverts the ebtables chunk that enforces counters that was introduced in the recently applied d26e2c9ffa38 ('Revert "netfilter: ensure number of counters is >0 in do_replace()"') since this breaks ebtables. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-02vlan: Add GRO support for non hardware accelerated vlanToshiaki Makita1-0/+96
Currently packets with non-hardware-accelerated vlan cannot be handled by GRO. This causes low performance for 802.1ad and stacked vlan, as their vlan tags are currently not stripped by hardware. This patch adds GRO support for non-hardware-accelerated vlan and improves receive performance of them. Test Environment: vlan device (.1Q) on vlan device (.1ad) on ixgbe (82599) Result: - Before $ netperf -t TCP_STREAM -H 192.168.20.2 -l 60 Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 87380 16384 16384 60.00 5233.17 Rx side CPU usage: %usr %sys %irq %soft %idle 0.27 58.03 0.00 41.70 0.00 - After $ netperf -t TCP_STREAM -H 192.168.20.2 -l 60 Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 87380 16384 16384 60.00 7586.85 Rx side CPU usage: %usr %sys %irq %soft %idle 0.50 25.83 0.00 59.53 14.14 [ Register VLAN offloads with priority 10 -DaveM ] Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-02vti6: Add pmtu handling to vti6_xmit.Steffen Klassert1-0/+14
We currently rely on the PMTU discovery of xfrm. However if a packet is localy sent, the PMTU mechanism of xfrm tries to to local socket notification what might not work for applications like ping that don't check for this. So add pmtu handling to vti6_xmit to report MTU changes immediately. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-02openvswitch: include datapath actions with sampled-packet upcall to userspaceNeil McKee3-10/+33
If new optional attribute OVS_USERSPACE_ATTR_ACTIONS is added to an OVS_ACTION_ATTR_USERSPACE action, then include the datapath actions in the upcall. This Directly associates the sampled packet with the path it takes through the virtual switch. Path information currently includes mangling, encapsulation and decapsulation actions for tunneling protocols GRE, VXLAN, Geneve, MPLS and QinQ, but this extension requires no further changes to accommodate datapath actions that may be added in the future. Adding path information enhances visibility into complex virtual networks. Signed-off-by: Neil McKee <neil.mckee@inmon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-02net: Add priority to packet_offload objects.David S. Miller3-2/+9
When we scan a packet for GRO processing, we want to see the most common packet types in the front of the offload_base list. So add a priority field so we can handle this properly. IPv4/IPv6 get the highest priority with the implicit zero priority field. Next comes ethernet with a priority of 10, and then we have the MPLS types with a priority of 15. Suggested-by: Eric Dumazet <eric.dumazet@gmail.com> Suggested-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-02Revert "net: core: 'ethtool' issue with querying phy settings"David S. Miller1-9/+1
This reverts commit f96dee13b8e10f00840124255bed1d8b4c6afd6f. It isn't right, ethtool is meant to manage one PHY instance per netdevice at a time, and this is selected by the SET command. Therefore by definition the GET command must only return the settings for the configured and selected PHY. Reported-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-01Revert "netfilter: ensure number of counters is >0 in do_replace()"Bernhard Thaler1-4/+0
This partially reverts commit 1086bbe97a07 ("netfilter: ensure number of counters is >0 in do_replace()") in net/bridge/netfilter/ebtables.c. Setting rules with ebtables does not work any more with 1086bbe97a07 place. There is an error message and no rules set in the end. e.g. ~# ebtables -t nat -A POSTROUTING --src 12:34:56:78:9a:bc -j DROP Unable to update the kernel. Two possible causes: 1. Multiple ebtables programs were executing simultaneously. The ebtables userspace tool doesn't by default support multiple ebtables programs running Reverting the ebtables part of 1086bbe97a07 makes this work again. Signed-off-by: Bernhard Thaler <bernhard.thaler@wvnet.at> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-06-01net: dsa: Properly propagate errors from dsa_switch_setup_oneFlorian Fainelli1-2/+2
While shuffling some code around, dsa_switch_setup_one() was introduced, and it was modified to return either an error code using ERR_PTR() or a NULL pointer when running out of memory or failing to setup a switch. This is a problem for its caler: dsa_switch_setup() which uses IS_ERR() and expects to find an error code, not a NULL pointer, so we still try to proceed with dsa_switch_setup() and operate on invalid memory addresses. This can be easily reproduced by having e.g: the bcm_sf2 driver built-in, but having no such switch, such that drv->setup will fail. Fix this by using PTR_ERR() consistently which is both more informative and avoids for the caller to use IS_ERR_OR_NULL(). Fixes: df197195a5248 ("net: dsa: split dsa_switch_setup into two functions") Reported-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Tested-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-01tcp: fix child sockets to use system default congestion control if not setNeal Cardwell2-2/+8
Linux 3.17 and earlier are explicitly engineered so that if the app doesn't specifically request a CC module on a listener before the SYN arrives, then the child gets the system default CC when the connection is established. See tcp_init_congestion_control() in 3.17 or earlier, which says "if no choice made yet assign the current value set as default". The change ("net: tcp: assign tcp cong_ops when tcp sk is created") altered these semantics, so that children got their parent listener's congestion control even if the system default had changed after the listener was created. This commit returns to those original semantics from 3.17 and earlier, since they are the original semantics from 2007 in 4d4d3d1e8 ("[TCP]: Congestion control initialization."), and some Linux congestion control workflows depend on that. In summary, if a listener socket specifically sets TCP_CONGESTION to "x", or the route locks the CC module to "x", then the child gets "x". Otherwise the child gets current system default from net.ipv4.tcp_congestion_control. That's the behavior in 3.17 and earlier, and this commit restores that. Fixes: 55d8694fa82c ("net: tcp: assign tcp cong_ops when tcp sk is created") Cc: Florian Westphal <fw@strlen.de> Cc: Daniel Borkmann <dborkman@redhat.com> Cc: Glenn Judd <glenn.judd@morganstanley.com> Cc: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-01net/rds Add getsockopt support for SO_RDS_TRANSPORTSowmini Varadhan1-0/+14
The currently attached transport for a PF_RDS socket may be obtained from user space by invoking getsockopt(2) using the SO_RDS_TRANSPORT option at the SOL_RDS level. The integer optval returned will be one of the RDS_TRANS_* constants defined in linux/rds.h. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-01net/rds: Add setsockopt support for SO_RDS_TRANSPORTSowmini Varadhan4-0/+53
An application may deterministically attach the underlying transport for a PF_RDS socket by invoking setsockopt(2) with the SO_RDS_TRANSPORT option at the SOL_RDS level. The integer argument to setsockopt must be one of the RDS_TRANS_* transport types, e.g., RDS_TRANS_TCP. The option must be specified before invoking bind(2) on the socket, and may only be used once on the socket. An attempt to set the option on a bound socket, or to invoke the option after a successful SO_RDS_TRANSPORT attachment, will return EOPNOTSUPP. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-01net/rds: Declare SO_RDS_TRANSPORT and RDS_TRANS_* constants in uapi/linux/rds.hSowmini Varadhan1-5/+0
User space applications that desire to explicitly select the underlying transport for a PF_RDS socket may do so by using the SO_RDS_TRANSPORT socket option at the SOL_RDS level before bind(). The integer argument provided to the socket option would be one of the RDS_TRANS_* values, e.g., RDS_TRANS_TCP. This commit exports the constant values need by such applications via <linux/rds.h> Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-01ebpf: allow bpf_ktime_get_ns_proto also for networkingDaniel Borkmann1-0/+2
As this is already exported from tracing side via commit d9847d310ab4 ("tracing: Allow BPF programs to call bpf_ktime_get_ns()"), we might as well want to move it to the core, so also networking users can make use of it, e.g. to measure diffs for certain flows from ingress/egress. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Alexei Starovoitov <ast@plumgrid.com> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-01udp: fix behavior of wrong checksumsEric Dumazet2-8/+4
We have two problems in UDP stack related to bogus checksums : 1) We return -EAGAIN to application even if receive queue is not empty. This breaks applications using edge trigger epoll() 2) Under UDP flood, we can loop forever without yielding to other processes, potentially hanging the host, especially on non SMP. This patch is an attempt to make things better. We might in the future add extra support for rt applications wanting to better control time spent doing a recv() in a hostile environment. For example we could validate checksums before queuing packets in socket receive queue. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-01Merge tag 'mac80211-next-for-davem-2015-05-29' of ↵David S. Miller10-27/+73
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next Johannes Berg says: ==================== As we get closer to the merge window, here are a few more things for -next: * disconnect TDLS stations on CSA to avoid issues * fix a memory leak introduced in a recent commit * switch rfkill and cfg80211 to PM ops * in an unlikely scenario, prevent a bookkeeping value to get corrupted leading to dropped packets * fix a crash in VLAN assignment * switch rfkill-gpio to more modern gpiod API * send disconnected event to userspace with proper local/remote indication ==================== Signed-off-by: David S. Miller <davem@davemloft.net>