summaryrefslogtreecommitdiff
path: root/include/net
AgeCommit message (Collapse)AuthorFilesLines
2016-12-10Merge tag 'mac80211-next-for-davem-2016-12-09' of ↵David S. Miller1-0/+11
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next Johannes Berg says: ==================== Three fixes: * fix a logic bug introduced by a previous cleanup * fix nl80211 attribute confusing (trying to use a single attribute for two purposes) * fix a long-standing BSS leak that happens when an association attempt is abandoned ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-09cfg80211/mac80211: fix BSS leaks when abandoning assoc attemptsJohannes Berg1-0/+11
When mac80211 abandons an association attempt, it may free all the data structures, but inform cfg80211 and userspace about it only by sending the deauth frame it received, in which case cfg80211 has no link to the BSS struct that was used and will not cfg80211_unhold_bss() it. Fix this by providing a way to inform cfg80211 of this with the BSS entry passed, so that it can clean up properly, and use this ability in the appropriate places in mac80211. This isn't ideal: some code is more or less duplicated and tracing is missing. However, it's a fairly small change and it's thus easier to backport - cleanups can come later. Cc: stable@vger.kernel.org Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-12-08net: do not read sk_drops if application does not careEric Dumazet1-1/+2
sk_drops can be an often written field, do not read it unless application showed interest. Note that sk_drops can be read via inet_diag, so applications can avoid getting this info from every received packet. In the future, 'reading' sk_drops might require folding per node or per cpu fields, and thus become even more expensive than today. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-08net: rfs: add a jump labelEric Dumazet1-11/+14
RFS is not commonly used, so add a jump label to avoid some conditionals in fast path. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-08flow dissector: ICMP supportSimon Horman1-0/+17
Allow dissection of ICMP(V6) type and code. This should only occur if a packet is ICMP(V6) and the dissector has FLOW_DISSECTOR_KEY_ICMP set. There are currently no users of FLOW_DISSECTOR_KEY_ICMP. A follow-up patch will allow FLOW_DISSECTOR_KEY_ICMP to be used by the flower classifier. Signed-off-by: Simon Horman <simon.horman@netronome.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-08Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller13-7/+202
Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next The following patchset contains a large Netfilter update for net-next, to summarise: 1) Add support for stateful objects. This series provides a nf_tables native alternative to the extended accounting infrastructure for nf_tables. Two initial stateful objects are supported: counters and quotas. Objects are identified by a user-defined name, you can fetch and reset them anytime. You can also use a maps to allow fast lookups using any arbitrary key combination. More info at: http://marc.info/?l=netfilter-devel&m=148029128323837&w=2 2) On-demand registration of nf_conntrack and defrag hooks per netns. Register nf_conntrack hooks if we have a stateful ruleset, ie. state-based filtering or NAT. The new nf_conntrack_default_on sysctl enables this from newly created netnamespaces. Default behaviour is not modified. Patches from Florian Westphal. 3) Allocate 4k chunks and then use these for x_tables counter allocation requests, this improves ruleset load time and also datapath ruleset evaluation, patches from Florian Westphal. 4) Add support for ebpf to the existing x_tables bpf extension. From Willem de Bruijn. 5) Update layer 4 checksum if any of the pseudoheader fields is updated. This provides a limited form of 1:1 stateless NAT that make sense in specific scenario, eg. load balancing. 6) Add support to flush sets in nf_tables. This series comes with a new set->ops->deactivate_one() indirection given that we have to walk over the list of set elements, then deactivate them one by one. The existing set->ops->deactivate() performs an element lookup that we don't need. 7) Two patches to avoid cloning packets, thus speed up packet forwarding via nft_fwd from ingress. From Florian Westphal. 8) Two IPVS patches via Simon Horman: Decrement ttl in all modes to prevent infinite loops, patch from Dwip Banerjee. And one minor refactoring from Gao feng. 9) Revisit recent log support for nf_tables netdev families: One patch to ensure that we correctly handle non-ethernet packets. Another patch to add missing logger definition for netdev. Patches from Liping Zhang. 10) Three patches for nft_fib, one to address insufficient register initialization and another to solve incorrect (although harmless) byteswap operation. Moreover update xt_rpfilter and nft_fib to match lbcast packets with zeronet as source, eg. DHCP Discover packets (0.0.0.0 -> 255.255.255.255). Also from Liping Zhang. 11) Built-in DCCP, SCTP and UDPlite conntrack and NAT support, from Davide Caratti. While DCCP is rather hopeless lately, and UDPlite has been broken in many-cast mode for some little time, let's give them a chance by placing them at the same level as other existing protocols. Thus, users don't explicitly have to modprobe support for this and NAT rules work for them. Some people point to the lack of support in SOHO Linux-based routers that make deployment of new protocols harder. I guess other middleboxes outthere on the Internet are also to blame. Anyway, let's see if this has any impact in the midrun. 12) Skip software SCTP software checksum calculation if the NIC comes with SCTP checksum offload support. From Davide Caratti. 13) Initial core factoring to prepare conversion to hook array. Three patches from Aaron Conole. 14) Gao Feng made a wrong conversion to switch in the xt_multiport extension in a patch coming in the previous batch. Fix it in this batch. 15) Get vmalloc call in sync with kmalloc flags to avoid a warning and likely OOM killer intervention from x_tables. From Marcelo Ricardo Leitner. 16) Update Arturo Borrero's email address in all source code headers. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-07net: sock_rps_record_flow() is for connected socketsEric Dumazet1-1/+11
Paolo noticed a cache line miss in UDP recvmsg() to access sk_rxhash, sharing a cache line with sk_drops. sk_drops might be heavily incremented by cpus handling a flood targeting this socket. We might place sk_drops on a separate cache line, but lets try to avoid wasting 64 bytes per socket just for this, since we have other bottlenecks to take care of. sock_rps_record_flow() should only access sk_rxhash for connected flows. Testing sk_state for TCP_ESTABLISHED covers most of the cases for connected sockets, for a zero cost, since system calls using sock_rps_record_flow() also access sk->sk_prot which is on the same cache line. A follow up patch will provide a static_key (Jump Label) since most hosts do not even use RFS. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-07netfilter: nf_tables: support for set flushingPablo Neira Ayuso1-1/+5
This patch adds support for set flushing, that consists of walking over the set elements if the NFTA_SET_ELEM_LIST_ELEMENTS attribute is set. This patch requires the following changes: 1) Add set->ops->deactivate_one() operation: This allows us to deactivate an element from the set element walk path, given we can skip the lookup that happens in ->deactivate(). 2) Add a new nft_trans_alloc_gfp() function since we need to allocate transactions using GFP_ATOMIC given the set walk path happens with held rcu_read_lock. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-07netfilter: nf_tables: add stateful object reference to set elementsPablo Neira Ayuso1-0/+9
This patch allows you to refer to stateful objects from set elements. This provides the infrastructure to create maps where the right hand side of the mapping is a stateful object. This allows us to build dictionaries of stateful objects, that you can use to perform fast lookups using any arbitrary key combination. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-07netfilter: nft_quota: add depleted flag for objectsPablo Neira Ayuso1-0/+2
Notify on depleted quota objects. The NFT_QUOTA_F_DEPLETED flag indicates we have reached overquota. Add pointer to table from nft_object, so we can use it when sending the depletion notification to userspace. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-07netfilter: nf_tables: notify internal updates of stateful objectsPablo Neira Ayuso1-0/+4
Introduce nf_tables_obj_notify() to notify internal state changes in stateful objects. This is used by the quota object to report depletion in a follow up patch. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-07netfilter: nf_tables: atomic dump and reset for stateful objectsPablo Neira Ayuso1-1/+2
This patch adds a new NFT_MSG_GETOBJ_RESET command perform an atomic dump-and-reset of the stateful object. This also comes with add support for atomic dump and reset for counter and quota objects. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-06netfilter: nf_tables: add stateful objectsPablo Neira Ayuso1-0/+79
This patch augments nf_tables to support stateful objects. This new infrastructure allows you to create, dump and delete stateful objects, that are identified by a user-defined name. This patch adds the generic infrastructure, follow up patches add support for two stateful objects: counters and quotas. This patch provides a native infrastructure for nf_tables to replace nfacct, the extended accounting infrastructure for iptables. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-06netfilter: add and use nf_fwd_netdev_egressFlorian Westphal1-0/+1
... so we can use current skb instead of working with a clone. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-06netfilter: nft_payload: layer 4 checksum adjustment for pseudoheader fieldsPablo Neira Ayuso1-0/+1
This patch adds a new flag that signals the kernel to update layer 4 checksum if the packet field belongs to the layer 4 pseudoheader. This implicitly provides stateless NAT 1:1 that is useful under very specific usecases. Since rules mangling layer 3 fields that are part of the pseudoheader may potentially convey any layer 4 packet, we have to deal with the layer 4 checksum adjustment using protocol specific code. This patch adds support for TCP, UDP and ICMPv6, since they include the pseudoheader in the layer 4 checksum calculation. ICMP doesn't, so we can skip it. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-06netfilter: defrag: only register defrag functionality if neededFlorian Westphal3-2/+10
nf_defrag modules for ipv4 and ipv6 export an empty stub function. Any module that needs the defragmentation hooks registered simply 'calls' this empty function to create a phony module dependency -- modprobe will then load the defrag module too. This extends netfilter ipv4/ipv6 defragmentation modules to delay the hook registration until the functionality is requested within a network namespace instead of module load time for all namespaces. Hooks are only un-registered on module unload or when a namespace that used such defrag functionality exits. We have to use struct net for this as the register hooks can be called before netns initialization here from the ipv4/ipv6 conntrack module init path. There is no unregister functionality support, defrag will always be active once it was requested inside a net namespace. The reason is that defrag has impact on nft and iptables rulesets (without defrag we might see framents). Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-05net_sched: gen_estimator: complete rewrite of rate estimatorsEric Dumazet4-13/+18
1) Old code was hard to maintain, due to complex lock chains. (We probably will be able to remove some kfree_rcu() in callers) 2) Using a single timer to update all estimators does not scale. 3) Code was buggy on 32bit kernel (WRITE_ONCE() on 64bit quantity is not supposed to work well) In this rewrite : - I removed the RB tree that had to be scanned in gen_estimator_active(). qdisc dumps should be much faster. - Each estimator has its own timer. - Estimations are maintained in net_rate_estimator structure, instead of dirtying the qdisc. Minor, but part of the simplification. - Reading the estimator uses RCU and a seqcount to provide proper support for 32bit kernels. - We reduce memory need when estimators are not used, since we store a pointer, instead of the bytes/packets counters. - xt_rateest_mt() no longer has to grab a spinlock. (In the future, xt_rateest_tg() could be switched to per cpu counters) Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-05Merge branch 'for-upstream' of ↵David S. Miller1-12/+13
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next Johan Hedberg says: ==================== pull request: bluetooth-next 2016-12-03 Here's a set of Bluetooth & 802.15.4 patches for net-next (i.e. 4.10 kernel): - Fix for a potential NULL deref in the ieee802154 netlink code - Fix for the ED values of the at86rf2xx driver - Documentation updates to ieee802154 - Cleanups to u8 vs __u8 usage - Timer API usage cleanups in HCI drivers Please let me know if there are any issues pulling. Thanks. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-05net: reorganize struct sock for better data localityEric Dumazet1-24/+27
Group fields used in TX path, and keep some cache lines mostly read to permit sharing among cpus. Gained two 4 bytes holes on 64bit arches. Added a place holder for tcp tsq_flags, next to sk_wmem_alloc to speed up tcp_wfree() in the following patch. I have not added ____cacheline_aligned_in_smp, this might be done later. I prefer doing this once inet and tcp/udp sockets reorg is also done. Tested with both TCP and UDP. UDP receiver performance under flood increased by ~20 % : Accessing sk_filter/sk_wq/sk_napi_id no longer stalls because sk_drops was moved away from a critical cache line, now mostly read and shared. /* --- cacheline 4 boundary (256 bytes) --- */ unsigned int sk_napi_id; /* 0x100 0x4 */ int sk_rcvbuf; /* 0x104 0x4 */ struct sk_filter * sk_filter; /* 0x108 0x8 */ union { struct socket_wq * sk_wq; /* 0x8 */ struct socket_wq * sk_wq_raw; /* 0x8 */ }; /* 0x110 0x8 */ struct xfrm_policy * sk_policy[2]; /* 0x118 0x10 */ struct dst_entry * sk_rx_dst; /* 0x128 0x8 */ struct dst_entry * sk_dst_cache; /* 0x130 0x8 */ atomic_t sk_omem_alloc; /* 0x138 0x4 */ int sk_sndbuf; /* 0x13c 0x4 */ /* --- cacheline 5 boundary (320 bytes) --- */ int sk_wmem_queued; /* 0x140 0x4 */ atomic_t sk_wmem_alloc; /* 0x144 0x4 */ long unsigned int sk_tsq_flags; /* 0x148 0x8 */ struct sk_buff * sk_send_head; /* 0x150 0x8 */ struct sk_buff_head sk_write_queue; /* 0x158 0x18 */ __s32 sk_peek_off; /* 0x170 0x4 */ int sk_write_pending; /* 0x174 0x4 */ long int sk_sndtimeo; /* 0x178 0x8 */ Signed-off-by: Eric Dumazet <edumazet@google.com> Tested-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-04netfilter: conntrack: add nf_conntrack_default_on sysctlFlorian Westphal1-0/+9
This switch (default on) can be used to disable automatic registration of connection tracking functionality in newly created network namespaces. This means that when net namespace goes down (or the tracker protocol module is unloaded) we *might* have to unregister the hooks. We can either add another per-netns variable that tells if the hooks got registered by default, or, alternatively, just call the protocol _put() function and have the callee deal with a possible 'extra' put() operation that doesn't pair with a get() one. This uses the latter approach, i.e. a put() without a get has no effect. Conntrack is still enabled automatically regardless of the new sysctl setting if the new net namespace requires connection tracking, e.g. when NAT rules are created. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-04netfilter: conntrack: register hooks in netns when needed by rulesetFlorian Westphal1-0/+4
This makes use of nf_ct_netns_get/put added in previous patch. We add get/put functions to nf_conntrack_l3proto structure, ipv4 and ipv6 then implement use-count to track how many users (nft or xtables modules) have a dependency on ipv4 and/or ipv6 connection tracking functionality. When count reaches zero, the hooks are unregistered. This delays activation of connection tracking inside a namespace until stateful firewall rule or nat rule gets added. This patch breaks backwards compatibility in the sense that connection tracking won't be active anymore when the protocol tracker module is loaded. This breaks e.g. setups that ctnetlink for flow accounting and the like, without any '-m conntrack' packet filter rules. Followup patch restores old behavour and makes new delayed scheme optional via sysctl. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-04netfilter: add and use nf_ct_netns_get/putFlorian Westphal1-0/+4
currently aliased to try_module_get/_put. Will be changed in next patch when we add functions to make use of ->net argument to store usercount per l3proto tracker. This is needed to avoid registering the conntrack hooks in all netns and later only enable connection tracking in those that need conntrack. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-04netfilter: conntrack: remove unused init_net hookFlorian Westphal1-3/+0
since adf0516845bcd0 ("netfilter: remove ip_conntrack* sysctl compat code") the only user (ipv4 tracker) sets this to an empty stub function. After this change nf_ct_l3proto_pernet_register() is also empty, but this will change in a followup patch to add conditional register of the hooks. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-04netfilter: conntrack: built-in support for UDPliteDavide Caratti3-0/+22
CONFIG_NF_CT_PROTO_UDPLITE is no more a tristate. When set to y, connection tracking support for UDPlite protocol is built-in into nf_conntrack.ko. footprint test: $ ls -l net/netfilter/nf_conntrack{_proto_udplite,}.ko \ net/ipv4/netfilter/nf_conntrack_ipv4.ko \ net/ipv6/netfilter/nf_conntrack_ipv6.ko (builtin)|| udplite| ipv4 | ipv6 |nf_conntrack ---------++--------+--------+--------+-------------- none || 432538 | 828755 | 828676 | 6141434 UDPlite || - | 829649 | 829362 | 6498204 Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-04netfilter: conntrack: built-in support for SCTPDavide Caratti3-0/+19
CONFIG_NF_CT_PROTO_SCTP is no more a tristate. When set to y, connection tracking support for SCTP protocol is built-in into nf_conntrack.ko. footprint test: $ ls -l net/netfilter/nf_conntrack{_proto_sctp,}.ko \ net/ipv4/netfilter/nf_conntrack_ipv4.ko \ net/ipv6/netfilter/nf_conntrack_ipv6.ko (builtin)|| sctp | ipv4 | ipv6 | nf_conntrack ---------++--------+--------+--------+-------------- none || 498243 | 828755 | 828676 | 6141434 SCTP || - | 829254 | 829175 | 6547872 Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-04netfilter: conntrack: built-in support for DCCPDavide Caratti3-0/+20
CONFIG_NF_CT_PROTO_DCCP is no more a tristate. When set to y, connection tracking support for DCCP protocol is built-in into nf_conntrack.ko. footprint test: $ ls -l net/netfilter/nf_conntrack{_proto_dccp,}.ko \ net/ipv4/netfilter/nf_conntrack_ipv4.ko \ net/ipv6/netfilter/nf_conntrack_ipv6.ko (builtin)|| dccp | ipv4 | ipv6 | nf_conntrack ---------++--------+--------+--------+-------------- none || 469140 | 828755 | 828676 | 6141434 DCCP || - | 830566 | 829935 | 6533526 Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-04netfilter: nf_log: do not assume ethernet header in netdev familyLiping Zhang1-1/+3
In netdev family, we will handle non ethernet packets, so using eth_hdr(skb)->h_proto is incorrect. Meanwhile, we can use socket(AF_PACKET...) to sending packets, so skb->protocol is not always set in bridge family. Add an extra parameter into nf_log_l2packet to solve this issue. Fixes: 1fddf4bad0ac ("netfilter: nf_log: add packet logging for netdev family") Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-04netfilter: built-in NAT support for UDPliteDavide Caratti1-0/+3
CONFIG_NF_NAT_PROTO_UDPLITE is no more a tristate. When set to y, NAT support for UDPlite protocol is built-in into nf_nat.ko. footprint test: (nf_nat_proto_) |udplite || nf_nat --------------------------+--------++-------- no builtin | 408048 || 2241312 UDPLITE builtin | - || 2577256 Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-04netfilter: built-in NAT support for SCTPDavide Caratti1-0/+3
CONFIG_NF_NAT_PROTO_SCTP is no more a tristate. When set to y, NAT support for SCTP protocol is built-in into nf_nat.ko. footprint test: (nf_nat_proto_) | sctp || nf_nat --------------------------+--------++-------- no builtin | 428344 || 2241312 SCTP builtin | - || 2597032 Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-04netfilter: built-in NAT support for DCCPDavide Caratti1-0/+3
CONFIG_NF_NAT_PROTO_DCCP is no more a tristate. When set to y, NAT support for DCCP protocol is built-in into nf_nat.ko. footprint test: (nf_nat_proto_) | dccp || nf_nat --------------------------+--------++-------- no builtin | 409800 || 2241312 DCCP builtin | - || 2578968 Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-04ipv6 addrconf: Implemented enhanced DAD (RFC7527)Erik Nordmark2-1/+5
Implemented RFC7527 Enhanced DAD. IPv6 duplicate address detection can fail if there is some temporary loopback of Ethernet frames. RFC7527 solves this by including a random nonce in the NS messages used for DAD, and if an NS is received with the same nonce it is assumed to be a looped back DAD probe and is ignored. RFC7527 is enabled by default. Can be disabled by setting both of conf/{all,interface}/enhanced_dad to zero. Signed-off-by: Erik Nordmark <nordmark@arista.com> Signed-off-by: Bob Gilligan <gilligan@arista.com> Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-04ipv4: fib: Replay events when registering FIB notifierIdo Schimmel1-1/+2
Commit b90eb7549499 ("fib: introduce FIB notification infrastructure") introduced a new notification chain to notify listeners (f.e., switchdev drivers) about addition and deletion of routes. However, upon registration to the chain the FIB tables can already be populated, which means potential listeners will have an incomplete view of the tables. Solve that by dumping the FIB tables and replaying the events to the passed notification block. The dump itself is done using RCU in order not to starve consumers that need RTNL to make progress. The integrity of the dump is ensured by reading the FIB change sequence counter before and after the dump under RTNL. This allows us to avoid the problematic situation in which the dumping process sends a ENTRY_ADD notification following ENTRY_DEL generated by another process holding RTNL. Callers of the registration function may pass a callback that is executed in case the dump was inconsistent with current FIB tables. The number of retries until a consistent dump is achieved is set to a fixed number to prevent callers from looping for long periods of time. In case current limit proves to be problematic in the future, it can be easily converted to be configurable using a sysctl. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-04ipv4: fib: Allow for consistent FIB dumpingIdo Schimmel1-0/+3
The next patch will enable listeners of the FIB notification chain to request a dump of the FIB tables. However, since RTNL isn't taken during the dump, it's possible for the FIB tables to change mid-dump, which will result in inconsistency between the listener's table and the kernel's. Allow listeners to know about changes that occurred mid-dump, by adding a change sequence counter to each net namespace. The counter is incremented just before a notification is sent in the FIB chain. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-04ipv4: fib: Add fib_info_hold() helperIdo Schimmel1-0/+5
As explained in the previous commit, modules are going to need to take a reference on fib info and then drop it using fib_info_put(). Add the fib_info_hold() helper to make the code more readable and also symmetric with fib_info_put(). Signed-off-by: Ido Schimmel <idosch@mellanox.com> Suggested-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-03netns: fix net_generic() "id - 1" bloatAlexey Dobriyan1-7/+9
net_generic() function is both a) inline and b) used ~600 times. It has the following code inside ... ptr = ng->ptr[id - 1]; ... "id" is never compile time constant so compiler is forced to subtract 1. And those decrements or LEA [r32 - 1] instructions add up. We also start id'ing from 1 to catch bugs where pernet sybsystem id is not initialized and 0. This is quite pointless idea (nothing will work or immediate interference with first registered subsystem) in general but it hints what needs to be done for code size reduction. Namely, overlaying allocation of pointer array and fixed part of structure in the beginning and using usual base-0 addressing. Ids are just cookies, their exact values do not matter, so lets start with 3 on x86_64. Code size savings (oh boy): -4.2 KB As usual, ignore the initial compiler stupidity part of the table. add/remove: 0/0 grow/shrink: 12/670 up/down: 89/-4297 (-4208) function old new delta tipc_nametbl_insert_publ 1250 1270 +20 nlmclnt_lookup_host 686 703 +17 nfsd4_encode_fattr 5930 5941 +11 nfs_get_client 1050 1061 +11 register_pernet_operations 333 342 +9 tcf_mirred_init 843 849 +6 tcf_bpf_init 1143 1149 +6 gss_setup_upcall 990 994 +4 idmap_name_to_id 432 434 +2 ops_init 274 275 +1 nfsd_inject_forget_client 259 260 +1 nfs4_alloc_client 612 613 +1 tunnel_key_walker 164 163 -1 ... tipc_bcbase_select_primary 392 360 -32 mac80211_hwsim_new_radio 2808 2767 -41 ipip6_tunnel_ioctl 2228 2186 -42 tipc_bcast_rcv 715 672 -43 tipc_link_build_proto_msg 1140 1089 -51 nfsd4_lock 3851 3796 -55 tipc_mon_rcv 1012 956 -56 Total: Before=156643951, After=156639743, chg -0.00% Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-03netns: add dummy struct inside "struct net_generic"Alexey Dobriyan1-2/+4
This is precursor to fixing "[id - 1]" bloat inside net_generic(). Name "s" is chosen to complement name "u" often used for dummy unions. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-03netlink: 2-clause nla_ok()Alexey Dobriyan1-2/+1
nla_ok() consists of 3 clauses: 1) int rem >= (int)sizeof(struct nlattr) 2) u16 nla_len >= sizeof(struct nlattr) 3) u16 nla_len <= int rem The statement is that clause (1) is redundant. What it does is ensuring that "rem" is a positive number, so that in clause (3) positive number will be compared to positive number with no problems. However, "u16" fully fits into "int" and integers do not change value when upcasting even to signed type. Negative integers will be rejected by clause (3) just fine. Small positive integers will be rejected by transitivity of comparison operator. NOTE: all of the above DOES NOT apply to nlmsg_ok() where ->nlmsg_len is u32(!), so 3 clauses AND A CAST TO INT are necessary. Obligatory space savings report: -1.6 KB $ ./scripts/bloat-o-meter ../vmlinux-000* ../vmlinux-001* add/remove: 0/0 grow/shrink: 3/63 up/down: 35/-1692 (-1657) function old new delta validate_scan_freqs 142 155 +13 tcf_em_tree_validate 867 879 +12 dcbnl_ieee_del 328 338 +10 netlbl_cipsov4_add_common.isra 218 215 -3 ... ovs_nla_put_actions 888 806 -82 netlbl_cipsov4_add_std 1648 1566 -82 nl80211_parse_sched_scan 2889 2780 -109 ip_tun_from_nlattr 3086 2945 -141 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller3-4/+6
Couple conflicts resolved here: 1) In the MACB driver, a bug fix to properly initialize the RX tail pointer properly overlapped with some changes to support variable sized rings. 2) In XGBE we had a "CONFIG_PM" --> "CONFIG_PM_SLEEP" fix overlapping with a reorganization of the driver to support ACPI, OF, as well as PCI variants of the chip. 3) In 'net' we had several probe error path bug fixes to the stmmac driver, meanwhile a lot of this code was cleaned up and reorganized in 'net-next'. 4) The cls_flower classifier obtained a helper function in 'net-next' called __fl_delete() and this overlapped with Daniel Borkamann's bug fix to use RCU for object destruction in 'net'. It also overlapped with Jiri's change to guard the rhashtable_remove_fast() call with a check against tc_skip_sw(). 5) In mlx4, a revert bug fix in 'net' overlapped with some unrelated changes in 'net-next'. 6) In geneve, a stale header pointer after pskb_expand_head() bug fix in 'net' overlapped with a large reorganization of the same code in 'net-next'. Since the 'net-next' code no longer had the bug in question, there was nothing to do other than to simply take the 'net-next' hunks. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02bpf: Add support for reading socket family, type, protocolDavid Ahern1-0/+15
Add socket family, type and protocol to bpf_sock allowing bpf programs read-only access. Add __sk_flags_offset[0] to struct sock before the bitfield to programmtically determine the offset of the unsigned int containing protocol and type. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02net/sched: cls_flower: Add offload support using egress Hardware deviceHadar Hen Zion1-0/+2
In order to support hardware offloading when the device given by the tc rule is different from the Hardware underline device, extract the mirred (egress) device from the tc action when a filter is added, using the new tc_action_ops, get_dev(). Flower caches the information about the mirred device and use it for calling ndo_setup_tc in filter change, update stats and delete. Calling ndo_setup_tc of the mirred (egress) device instead of the ingress device will allow a resolution between the software ingress device and the underline hardware device. The resolution will take place inside the offloading driver using 'egress_device' flag added to tc_to_netdev struct which is provided to the offloading driver. Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02net/sched: act_mirred: Add new tc_action_ops get_dev()Hadar Hen Zion1-0/+2
Adding support to a new tc_action_ops. get_dev is a general option which allows to get the underline device when trying to offload a tc rule. In case of mirred action the returned device is the mirred (egress) device. Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02net/sched: Add separate check for skip_hw flagHadar Hen Zion1-4/+15
Creating a difference between two possible cases: 1. Not offloading tc rule since the user sets 'skip_hw' flag. 2. Not offloading tc rule since the device doesn't support offloading. This patch doesn't add any new functionality. Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02tcp: randomize tcp timestamp offsets for each connectionFlorian Westphal2-5/+5
jiffies based timestamps allow for easy inference of number of devices behind NAT translators and also makes tracking of hosts simpler. commit ceaa1fef65a7c2e ("tcp: adding a per-socket timestamp offset") added the main infrastructure that is needed for per-connection ts randomization, in particular writing/reading the on-wire tcp header format takes the offset into account so rest of stack can use normal tcp_time_stamp (jiffies). So only two items are left: - add a tsoffset for request sockets - extend the tcp isn generator to also return another 32bit number in addition to the ISN. Re-use of ISN generator also means timestamps are still monotonically increasing for same connection quadruple, i.e. PAWS will still work. Includes fixes from Eric Dumazet. Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-01Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller2-4/+4
Pablo Neira Ayuso says: ==================== Netfilter fixes for net This is a large batch of Netfilter fixes for net, they are: 1) Three patches to fix NAT conversion to rhashtable: Switch to rhlist structure that allows to have several objects with the same key. Moreover, fix wrong comparison logic in nf_nat_bysource_cmp() as this is expecting a return value similar to memcmp(). Change location of the nat_bysource field in the nf_conn structure to avoid zeroing this as it breaks interaction with SLAB_DESTROY_BY_RCU and lead us to crashes. From Florian Westphal. 2) Don't allow malformed fragments go through in IPv6, drop them, otherwise we hit GPF, patch from Florian Westphal. 3) Fix crash if attributes are missing in nft_range, from Liping Zhang. 4) Fix arptables 32-bits userspace 64-bits kernel compat, from Hongxu Jia. 5) Two patches from David Ahern to fix netfilter interaction with vrf. From David Ahern. 6) Fix element timeout calculation in nf_tables, we take milliseconds from userspace, but we use jiffies from kernelspace. Patch from Anders K. Pedersen. 7) Missing validation length netlink attribute for nft_hash, from Laura Garcia. 8) Fix nf_conntrack_helper documentation, we don't default to off anymore for a bit of time so let's get this in sync with the code. I know is late but I think these are important, specifically the NAT bits, as they are mostly addressing fallout from recent changes. I also read there are chances to have -rc8, if that is the case, that would also give us a bit more time to test this. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-30l2tp: lock socket before checking flags in connect()Guillaume Nault1-0/+2
Socket flags aren't updated atomically, so the socket must be locked while reading the SOCK_ZAPPED flag. This issue exists for both l2tp_ip and l2tp_ip6. For IPv6, this patch also brings error handling for __ip6_datagram_connect() failures. Signed-off-by: Guillaume Nault <g.nault@alphalink.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-30tcp: instrument how long TCP is busy sendingFrancis Yan1-1/+5
This patch measures TCP busy time, which is defined as the period of time when sender has data (or FIN) to send. The time starts when data is buffered and stops when the write queue is flushed by ACKs or error events. Note the busy time does not include SYN time, unless data is included in SYN (i.e. Fast Open). It does include FIN time even if the FIN carries no payload. Excluding pure FIN is possible but would incur one additional test in the fast path, which may not be worth it. Signed-off-by: Francis Yan <francisyyan@gmail.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-30tcp: instrument tcp sender limits chronographsFrancis Yan1-0/+14
This patch implements the skeleton of the TCP chronograph instrumentation on sender side limits: 1) idle (unspec) 2) busy sending data other than 3-4 below 3) rwnd-limited 4) sndbuf-limited The limits are enumerated 'tcp_chrono'. Since a connection in theory can idle forever, we do not track the actual length of this uninteresting idle period. For the rest we track how long the sender spends in each limit. At any point during the life time of a connection, the sender must be in one of the four states. If there are multiple conditions worthy of tracking in a chronograph then the highest priority enum takes precedence over the other conditions. So that if something "more interesting" starts happening, stop the previous chrono and start a new one. The time unit is jiffy(u32) in order to save space in tcp_sock. This implies application must sample the stats no longer than every 49 days of 1ms jiffy. Signed-off-by: Francis Yan <francisyyan@gmail.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-27Bluetooth: __ variants of u8 and friends are not neccessary inside kernelPavel Machek1-12/+13
bluetooth.h is not part of user API, so __ variants are not neccessary here. Signed-off-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2016-11-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-1/+1
udplite conflict is resolved by taking what 'net-next' did which removed the backlog receive method assignment, since it is no longer necessary. Two entries were added to the non-priv ethtool operations switch statement, one in 'net' and one in 'net-next, so simple overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-25devlink: Add E-Switch inline mode controlRoi Dayan1-0/+2
Some HWs need the VF driver to put part of the packet headers on the TX descriptor so the e-switch can do proper matching and steering. The supported modes: none, link, network, transport. Signed-off-by: Roi Dayan <roid@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>