summaryrefslogtreecommitdiff
path: root/net/netfilter
AgeCommit message (Collapse)AuthorFilesLines
2017-04-06Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller7-112/+204
Mostly simple cases of overlapping changes (adding code nearby, a function whose name changes, for example). Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-29netfilter: nfnetlink_queue: fix secctx memory leakLiping Zhang1-2/+7
We must call security_release_secctx to free the memory returned by security_secid_to_secctx, otherwise memory may be leaked forever. Fixes: ef493bd930ae ("netfilter: nfnetlink_queue: add security context information") Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-03-27netfilter: nf_ct_ext: fix possible panic after nf_ct_extend_unregisterLiping Zhang1-3/+10
If one cpu is doing nf_ct_extend_unregister while another cpu is doing __nf_ct_ext_add_length, then we may hit BUG_ON(t == NULL). Moreover, there's no synchronize_rcu invocation after set nf_ct_ext_types[id] to NULL, so it's possible that we may access invalid pointer. But actually, most of the ct extends are built-in, so the problem listed above will not happen. However, there are two exceptions: NF_CT_EXT_NAT and NF_CT_EXT_SYNPROXY. For _EXT_NAT, the panic will not happen, since adding the nat extend and unregistering the nat extend are located in the same file(nf_nat_core.c), this means that after the nat module is removed, we cannot add the nat extend too. For _EXT_SYNPROXY, synproxy extend may be added by init_conntrack, while synproxy extend unregister will be done by synproxy_core_exit. So after nf_synproxy_core.ko is removed, we may still try to add the synproxy extend, then kernel panic may happen. I know it's very hard to reproduce this issue, but I can play a tricky game to make it happen very easily :) Step 1. Enable SYNPROXY for tcp dport 1234 at FORWARD hook: # iptables -I FORWARD -p tcp --dport 1234 -j SYNPROXY Step 2. Queue the syn packet to the userspace at raw table OUTPUT hook. Also note, in the userspace we only add a 20s' delay, then reinject the syn packet to the kernel: # iptables -t raw -I OUTPUT -p tcp --syn -j NFQUEUE --queue-num 1 Step 3. Using "nc 2.2.2.2 1234" to connect the server. Step 4. Now remove the nf_synproxy_core.ko quickly: # iptables -F FORWARD # rmmod ipt_SYNPROXY # rmmod nf_synproxy_core Step 5. After 20s' delay, the syn packet is reinjected to the kernel. Now you will see the panic like this: kernel BUG at net/netfilter/nf_conntrack_extend.c:91! Call Trace: ? __nf_ct_ext_add_length+0x53/0x3c0 [nf_conntrack] init_conntrack+0x12b/0x600 [nf_conntrack] nf_conntrack_in+0x4cc/0x580 [nf_conntrack] ipv4_conntrack_local+0x48/0x50 [nf_conntrack_ipv4] nf_reinject+0x104/0x270 nfqnl_recv_verdict+0x3e1/0x5f9 [nfnetlink_queue] ? nfqnl_recv_verdict+0x5/0x5f9 [nfnetlink_queue] ? nla_parse+0xa0/0x100 nfnetlink_rcv_msg+0x175/0x6a9 [nfnetlink] [...] One possible solution is to make NF_CT_EXT_SYNPROXY extend built-in, i.e. introduce nf_conntrack_synproxy.c and only do ct extend register and unregister in it, similar to nf_conntrack_timeout.c. But having such a obscure restriction of nf_ct_extend_unregister is not a good idea, so we should invoke synchronize_rcu after set nf_ct_ext_types to NULL, and check the NULL pointer when do __nf_ct_ext_add_length. Then it will be easier if we add new ct extend in the future. Last, we use kfree_rcu to free nf_ct_ext, so rcu_barrier() is unnecessary anymore, remove it too. Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-03-27netfilter: nfnl_cthelper: fix a race when walk the nf_ct_helper_hash tableLiping Zhang1-96/+81
The nf_ct_helper_hash table is protected by nf_ct_helper_mutex, while nfct_helper operation is protected by nfnl_lock(NFNL_SUBSYS_CTHELPER). So it's possible that one CPU is walking the nf_ct_helper_hash for cthelper add/get/del, another cpu is doing nf_conntrack_helpers_unregister at the same time. This is dangrous, and may cause use after free error. Note, delete operation will flush all cthelpers added via nfnetlink, so using rcu to do protect is not easy. Now introduce a dummy list to record all the cthelpers added via nfnetlink, then we can walk the dummy list instead of walking the nf_ct_helper_hash. Also, keep nfnl_cthelper_dump_table unchanged, it may be invoked without nfnl_lock(NFNL_SUBSYS_CTHELPER) held. Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-03-27netfilter: invoke synchronize_rcu after set the _hook_ to NULLLiping Zhang4-1/+6
Otherwise, another CPU may access the invalid pointer. For example: CPU0 CPU1 - rcu_read_lock(); - pfunc = _hook_; _hook_ = NULL; - mod unload - - pfunc(); // invalid, panic - rcu_read_unlock(); So we must call synchronize_rcu() to wait the rcu reader to finish. Also note, in nf_nat_snmp_basic_fini, synchronize_rcu() will be invoked by later nf_conntrack_helper_unregister, but I'm inclined to add a explicit synchronize_rcu after set the nf_nat_snmp_hook to NULL. Depend on such obscure assumptions is not a good idea. Last, in nfnetlink_cttimeout, we use kfree_rcu to free the time object, so in cttimeout_exit, invoking rcu_barrier() is not necessary at all, remove it too. Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-03-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller7-124/+133
Conflicts: drivers/net/ethernet/broadcom/genet/bcmmii.c drivers/net/hyperv/netvsc.c kernel/bpf/hashtab.c Almost entirely overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22netfilter: nfnl_cthelper: Fix memory leakJeffy Chen1-3/+9
We have memory leaks of nf_conntrack_helper & expect_policy. Signed-off-by: Jeffy Chen <jeffy.chen@rock-chips.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-03-22netfilter: nfnl_cthelper: fix runtime expectation policy updatesPablo Neira Ayuso1-2/+84
We only allow runtime updates of expectation policies for timeout and maximum number of expectations, otherwise reject the update. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Acked-by: Liping Zhang <zlpnobody@gmail.com>
2017-03-21netfilter: nfnl_cthelper: fix incorrect helper->expect_class_maxLiping Zhang1-9/+11
The helper->expect_class_max must be set to the total number of expect_policy minus 1, since we will use the statement "if (class > helper->expect_class_max)" to validate the CTA_EXPECT_CLASS attr in ctnetlink_alloc_expect. So for compatibility, set the helper->expect_class_max to the NFCTH_POLICY_SET_NUM attr's value minus 1. Also: it's invalid when the NFCTH_POLICY_SET_NUM attr's value is zero. 1. this will result "expect_policy = kzalloc(0, GFP_KERNEL);"; 2. we cannot set the helper->expect_class_max to a proper value. So if nla_get_be32(tb[NFCTH_POLICY_SET_NUM]) is zero, report -EINVAL to the userspace. Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-03-20netfilter: fix the warning on unused refcount variableReshetova, Elena1-1/+0
net/netfilter/nfnetlink_acct.c: In function 'nfnl_acct_try_del': net/netfilter/nfnetlink_acct.c:329:15: warning: unused variable 'refcount' [-Wunused-variable] unsigned int refcount; ^ Fixes: b54ab92b84b6 ("netfilter: refcounter conversions") Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-03-17netfilter: refcounter conversionsReshetova, Elena17-57/+61
refcount_t type and corresponding API (see include/linux/refcount.h) should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-03-16ipvs: remove an annoying printk in netns initCong Wang1-2/+0
At most it is used for debugging purpose, but I don't think it is even useful for debugging, just remove it. Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Simon Horman <horms@verge.net.au>
2017-03-15netfilter: nft_ct: do cleanup work when NFTA_CT_DIRECTION is invalidLiping Zhang1-1/+2
We should jump to invoke __nft_ct_set_destroy() instead of just return error. Fixes: edee4f1e9245 ("netfilter: nft_ct: add zone id set support") Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-03-13netfilter: nft_set_rbtree: use per-set rwlock to improve the scalabilityLiping Zhang1-15/+16
Karel Rericha reported that in his test case, ICMP packets going through boxes had normally about 5ms latency. But when running nft, actually listing the sets with interval flags, latency would go up to 30-100ms. This was observed when router throughput is from 600Mbps to 2Gbps. This is because we use a single global spinlock to protect the whole rbtree sets, so "dumping sets" will race with the "key lookup" inevitably. But actually they are all _readers_, so it's ok to convert the spinlock to rwlock to avoid competition between them. Also use per-set rwlock since each set is independent. Reported-by: Karel Rericha <karel@unitednetworks.cz> Tested-by: Karel Rericha <karel@unitednetworks.cz> Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-03-13netfilter: limit: use per-rule spinlock to improve the scalabilityLiping Zhang2-10/+11
The limit token is independent between each rules, so there's no need to use a global spinlock. Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-03-13netfilter: nf_conntrack: reduce resolve_normal_ct argsFlorian Westphal1-31/+26
also mark init_conntrack noinline, in most cases resolve_normal_ct will find an existing conntrack entry. text data bss dec hex filename 16735 5707 176 22618 585a net/netfilter/nf_conntrack_core.o 16687 5707 176 22570 582a net/netfilter/nf_conntrack_core.o Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-03-13Revert "netfilter: nf_tables: add flush field to struct nft_set_iter"Pablo Neira Ayuso1-4/+0
This reverts commit 1f48ff6c5393aa7fe290faf5d633164f105b0aa7. This patch is not required anymore now that we keep a dummy list of set elements in the bitmap set implementation, so revert this before we forget this code has no clients. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-03-13netfilter: nft_fib: Support existence checkPhil Sutter1-5/+9
Instead of the actual interface index or name, set destination register to just 1 or 0 depending on whether the lookup succeeded or not if NFTA_FIB_F_PRESENT was set in userspace. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-03-13netfilter: nft_ct: add helper set supportFlorian Westphal1-0/+171
this allows to assign connection tracking helpers to connections via nft objref infrastructure. The idea is to first specifiy a helper object: table ip filter { ct helper some-name { type "ftp" protocol tcp l3proto ip } } and then assign it via nft add ... ct helper set "some-name" helper assignment works for new conntracks only as we cannot expand the conntrack extension area once it has been committed to the main conntrack table. ipv4 and ipv6 protocols are tracked stored separately so we can also handle families that observe both ipv4 and ipv6 traffic. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-03-13netfilter: provide nft_ctx in object init functionFlorian Westphal3-5/+8
this is needed by the upcoming ct helper object type -- we'd like to be able use the table family (ip, ip6, inet) to figure out which helper has to be requested. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-03-13netfilter: nft_set_bitmap: keep a list of dummy elementsPablo Neira Ayuso1-80/+66
Element comments may come without any prior set flag, so we have to keep a list of dummy struct nft_set_ext to keep this information around. This is only useful for set dumps to userspace. From the packet path, this set type relies on the bitmap representation. This patch simplifies the logic since we don't need to allocate the dummy nft_set_ext structure anymore on the fly at the cost of increasing memory consumption because of the list of dummy struct nft_set_ext. Fixes: 665153ff5752 ("netfilter: nf_tables: add bitmap set type") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-03-13netfilter: Force fake conntrack entry to be at least 8 bytes alignedSteven Rostedt (VMware)1-1/+5
Since the nfct and nfctinfo have been combined, the nf_conn structure must be at least 8 bytes aligned, as the 3 LSB bits are used for the nfctinfo. But there's a fake nf_conn structure to denote untracked connections, which is created by a PER_CPU construct. This does not guarantee that it will be 8 bytes aligned and can break the logic in determining the correct nfctinfo. I triggered this on a 32bit machine with the following error: BUG: unable to handle kernel NULL pointer dereference at 00000af4 IP: nf_ct_deliver_cached_events+0x1b/0xfb *pdpt = 0000000031962001 *pde = 0000000000000000 Oops: 0000 [#1] SMP [Modules linked in: ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables ipv6 crc_ccitt ppdev r8169 parport_pc parport OK ] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.10.0-test+ #75 Hardware name: MSI MS-7823/CSM-H87M-G43 (MS-7823), BIOS V1.6 02/22/2014 task: c126ec00 task.stack: c1258000 EIP: nf_ct_deliver_cached_events+0x1b/0xfb EFLAGS: 00010202 CPU: 0 EAX: 0021cd01 EBX: 00000000 ECX: 27b0c767 EDX: 32bcb17a ESI: f34135c0 EDI: f34135c0 EBP: f2debd60 ESP: f2debd3c DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 CR0: 80050033 CR2: 00000af4 CR3: 309a0440 CR4: 001406f0 Call Trace: <SOFTIRQ> ? ipv6_skip_exthdr+0xac/0xcb ipv6_confirm+0x10c/0x119 [nf_conntrack_ipv6] nf_hook_slow+0x22/0xc7 nf_hook+0x9a/0xad [ipv6] ? ip6t_do_table+0x356/0x379 [ip6_tables] ? ip6_fragment+0x9e9/0x9e9 [ipv6] ip6_output+0xee/0x107 [ipv6] ? ip6_fragment+0x9e9/0x9e9 [ipv6] dst_output+0x36/0x4d [ipv6] NF_HOOK.constprop.37+0xb2/0xba [ipv6] ? icmp6_dst_alloc+0x2c/0xfd [ipv6] ? local_bh_enable+0x14/0x14 [ipv6] mld_sendpack+0x1c5/0x281 [ipv6] ? mark_held_locks+0x40/0x5c mld_ifc_timer_expire+0x1f6/0x21e [ipv6] call_timer_fn+0x135/0x283 ? detach_if_pending+0x55/0x55 ? mld_dad_timer_expire+0x3e/0x3e [ipv6] __run_timers+0x111/0x14b ? mld_dad_timer_expire+0x3e/0x3e [ipv6] run_timer_softirq+0x1c/0x36 __do_softirq+0x185/0x37c ? test_ti_thread_flag.constprop.19+0xd/0xd do_softirq_own_stack+0x22/0x28 </SOFTIRQ> irq_exit+0x5a/0xa4 smp_apic_timer_interrupt+0x2a/0x34 apic_timer_interrupt+0x37/0x3c By using DEFINE/DECLARE_PER_CPU_ALIGNED we can enforce at least 8 byte alignment as all cache line sizes are at least 8 bytes or more. Fixes: a9e419dc7be6 ("netfilter: merge ctinfo into nfct pointer storage area") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-03-13netfilter: nf_tables: fix mismatch in big-endian systemLiping Zhang3-31/+35
Currently, there are two different methods to store an u16 integer to the u32 data register. For example: u32 *dest = &regs->data[priv->dreg]; 1. *dest = 0; *(u16 *) dest = val_u16; 2. *dest = val_u16; For method 1, the u16 value will be stored like this, either in big-endian or little-endian system: 0 15 31 +-+-+-+-+-+-+-+-+-+-+-+-+ | Value | 0 | +-+-+-+-+-+-+-+-+-+-+-+-+ For method 2, in little-endian system, the u16 value will be the same as listed above. But in big-endian system, the u16 value will be stored like this: 0 15 31 +-+-+-+-+-+-+-+-+-+-+-+-+ | 0 | Value | +-+-+-+-+-+-+-+-+-+-+-+-+ So later we use "memcmp(&regs->data[priv->sreg], data, 2);" to do compare in nft_cmp, nft_lookup expr ..., method 2 will get the wrong result in big-endian system, as 0~15 bits will always be zero. For the similar reason, when loading an u16 value from the u32 data register, we should use "*(u16 *) sreg;" instead of "(u16)*sreg;", the 2nd method will get the wrong value in the big-endian system. So introduce some wrapper functions to store/load an u8 or u16 integer to/from the u32 data register, and use them in the right place. Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-03-13netfilter: nft_set_bitmap: fetch the element key based on the set->klenLiping Zhang1-10/+17
Currently we just assume the element key as a u32 integer, regardless of the set key length. This is incorrect, for example, the tcp port number is only 16 bits. So when we use the nft_payload expr to get the tcp dport and store it to dreg, the dport will be stored at 0~15 bits, and 16~31 bits will be padded with zero. So the reg->data[dreg] will be looked like as below: 0 15 31 +-+-+-+-+-+-+-+-+-+-+-+-+ | tcp dport | 0 | +-+-+-+-+-+-+-+-+-+-+-+-+ But for these big-endian systems, if we treate this register as a u32 integer, the element key will be larger than 65535, so the following lookup in bitmap set will cause out of bound access. Another issue is that if we add element with comments in bitmap set(although the comments will be ignored eventually), the element will vanish strangely. Because we treate the element key as a u32 integer, so the comments will become the part of the element key, then the element key will also be larger than 65535 and out of bound access will happen: # nft add element t s { 1 comment test } Since set->klen is 1 or 2, it's fine to treate the element key as a u8 or u16 integer. Fixes: 665153ff5752 ("netfilter: nf_tables: add bitmap set type") Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-03-08netfilter: nf_nat_sctp: fix ICMP packet to be dropped accidentlyYing Xue1-1/+12
Regarding RFC 792, the first 64 bits of the original SCTP datagram's data could be contained in ICMP packet, such as: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Type | Code | Checksum | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | unused | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Internet Header + 64 bits of Original Data Datagram | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ However, according to RFC 4960, SCTP datagram header is as below: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Source Port Number | Destination Port Number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Verification Tag | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Checksum | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ It means only the first three fields of SCTP header can be carried in ICMP packet except for Checksum field. At present in sctp_manip_pkt(), no matter whether the packet is ICMP or not, it always calculates SCTP packet checksum. However, not only the calculation of checksum is unnecessary for ICMP, but also it causes another fatal issue that ICMP packet is dropped. The header size of SCTP is used to identify whether the writeable length of skb is bigger than skb->len through skb_make_writable() in sctp_manip_pkt(). But when it deals with ICMP packet, skb_make_writable() directly returns false as the writeable length of skb is bigger than skb->len. Subsequently ICMP is dropped. Now we correct this misbahavior. When sctp_manip_pkt() handles ICMP packet, 8 bytes rather than the whole SCTP header size is used to check if writeable length of skb is overflowed. Meanwhile, as it's meaningless to calculate checksum when packet is ICMP, the computation of checksum is ignored as well. Signed-off-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-03-06netfilter: nf_tables: add nft_set_lookup()Pablo Neira Ayuso4-37/+36
This new function consolidates set lookup via either name or ID by introducing a new nft_set_lookup() function. Replace existing spots where we can use this too. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-03-06netfilter: nf_tables: validate the expr explicitly after init successfullyLiping Zhang9-35/+13
When we want to validate the expr's dependency or hooks, we must do two things to accomplish it. First, write a X_validate callback function and point ->validate to it. Second, call X_validate in init routine. This is very common, such as fib, nat, reject expr and so on ... It is a little ugly, since we will call X_validate in the expr's init routine, it's better to do it in nf_tables_newexpr. So we can avoid to do this again and again. After doing this, the second step listed above is not useful anymore, remove them now. Patch was tested by nftables/tests/py/nft-test.py and nftables/tests/shell/run-tests.sh. Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-03-06netfilter: nft_hash: support of symmetric hashLaura Garcia Liebana1-1/+98
This patch provides symmetric hash support according to source ip address and port, and destination ip address and port. For this purpose, the __skb_get_hash_symmetric() is used to identify the flow as it uses FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL flag by default. The new attribute NFTA_HASH_TYPE has been included to support different types of hashing functions. Currently supported NFT_HASH_JENKINS through jhash and NFT_HASH_SYM through symhash. The main difference between both types are: - jhash requires an expression with sreg, symhash doesn't. - symhash supports modulus and offset, but not seed. Examples: nft add rule ip nat prerouting ct mark set jhash ip saddr mod 2 nft add rule ip nat prerouting ct mark set symhash mod 2 By default, jenkins hash will be used if no hash type is provided for compatibility reasons. Signed-off-by: Laura Garcia Liebana <laura.garcia@zevenet.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-03-06netfilter: nft_hash: rename nft_hash to nft_jhashLaura Garcia Liebana1-18/+18
This patch renames the local nft_hash structure and functions to nft_jhash in order to prepare the nft_hash module code to add new hash functions. Signed-off-by: Laura Garcia Liebana <laura.garcia@zevenet.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-03-06netfilter: nft_exthdr: Allow checking TCP option presence, tooPhil Sutter1-3/+10
Honor NFT_EXTHDR_F_PRESENT flag so we check if the TCP option is present. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-03-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds3-85/+59
Pull networking fixes from David Miller: 1) Fix double-free in batman-adv, from Sven Eckelmann. 2) Fix packet stats for fast-RX path, from Joannes Berg. 3) Netfilter's ip_route_me_harder() doesn't handle request sockets properly, fix from Florian Westphal. 4) Fix sendmsg deadlock in rxrpc, from David Howells. 5) Add missing RCU locking to transport hashtable scan, from Xin Long. 6) Fix potential packet loss in mlxsw driver, from Ido Schimmel. 7) Fix race in NAPI handling between poll handlers and busy polling, from Eric Dumazet. 8) TX path in vxlan and geneve need proper RCU locking, from Jakub Kicinski. 9) SYN processing in DCCP and TCP need to disable BH, from Eric Dumazet. 10) Properly handle net_enable_timestamp() being invoked from IRQ context, also from Eric Dumazet. 11) Fix crash on device-tree systems in xgene driver, from Alban Bedel. 12) Do not call sk_free() on a locked socket, from Arnaldo Carvalho de Melo. 13) Fix use-after-free in netvsc driver, from Dexuan Cui. 14) Fix max MTU setting in bonding driver, from WANG Cong. 15) xen-netback hash table can be allocated from softirq context, so use GFP_ATOMIC. From Anoob Soman. 16) Fix MAC address change bug in bgmac driver, from Hari Vyas. 17) strparser needs to destroy strp_wq on module exit, from WANG Cong. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (69 commits) strparser: destroy workqueue on module exit sfc: fix IPID endianness in TSOv2 sfc: avoid max() in array size rds: remove unnecessary returned value check rxrpc: Fix potential NULL-pointer exception nfp: correct DMA direction in XDP DMA sync nfp: don't tell FW about the reserved buffer space net: ethernet: bgmac: mac address change bug net: ethernet: bgmac: init sequence bug xen-netback: don't vfree() queues under spinlock xen-netback: keep a local pointer for vif in backend_disconnect() netfilter: nf_tables: don't call nfnetlink_set_err() if nfnetlink_send() fails netfilter: nft_set_rbtree: incorrect assumption on lower interval lookups netfilter: nf_conntrack_sip: fix wrong memory initialisation can: flexcan: fix typo in comment can: usb_8dev: Fix memory leak of priv->cmd_msg_buffer can: gs_usb: fix coding style can: gs_usb: Don't use stack memory for USB transfers ixgbe: Limit use of 2K buffers on architectures with 256B or larger cache lines ixgbe: update the rss key on h/w, when ethtool ask for it ...
2017-03-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller3-85/+59
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for your net tree, they are: 1) Missing check for full sock in ip_route_me_harder(), from Florian Westphal. 2) Incorrect sip helper structure initilization that breaks it when several ports are used, from Christophe Leroy. 3) Fix incorrect assumption when looking up for matching with adjacent intervals in the nft_set_rbtree. 4) Fix broken netlink event error reporting in nf_tables that results in misleading ESRCH errors propagated to userspace listeners. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-03netfilter: nf_tables: don't call nfnetlink_set_err() if nfnetlink_send() failsPablo Neira Ayuso1-78/+55
The underlying nlmsg_multicast() already sets sk->sk_err for us to notify socket overruns, so we should not do anything with this return value. So we just call nfnetlink_set_err() if: 1) We fail to allocate the netlink message. or 2) We don't have enough space in the netlink message to place attributes, which means that we likely need to allocate a larger message. Before this patch, the internal ESRCH netlink error code was propagated to userspace, which is quite misleading. Netlink semantics mandate that listeners just hit ENOBUFS if the socket buffer overruns. Reported-by: Alexander Alemayhu <alexander@alemayhu.com> Tested-by: Alexander Alemayhu <alexander@alemayhu.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-03-03netfilter: nft_set_rbtree: incorrect assumption on lower interval lookupsPablo Neira Ayuso1-5/+4
In case of adjacent ranges, we may indeed see either the high part of the range in first place or the low part of it. Remove this incorrect assumption, let's make sure we annotate the low part of the interval in case of we have adjacent interva intervals so we hit a matching in lookups. Reported-by: Simon Hanisch <hanisch@wh2.tu-dresden.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-03-03netfilter: nf_conntrack_sip: fix wrong memory initialisationChristophe Leroy1-2/+0
In commit 82de0be6862cd ("netfilter: Add helper array register/unregister functions"), struct nf_conntrack_helper sip[MAX_PORTS][4] was changed to sip[MAX_PORTS * 4], so the memory init should have been changed to memset(&sip[4 * i], 0, 4 * sizeof(sip[i])); But as the sip[] table is allocated in the BSS, it is already set to 0 Fixes: 82de0be6862cd ("netfilter: Add helper array register/unregister functions") Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-03-02sched/headers: Prepare to remove <linux/cred.h> inclusion from <linux/sched.h>Ingo Molnar1-0/+2
Add #include <linux/cred.h> dependencies to all .c files rely on sched.h doing that for them. Note that even if the count where we need to add extra headers seems high, it's still a net win, because <linux/sched.h> is included in over 2,200 files ... Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-28Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds3-4/+5
Pull networking fixes from David Miller: 1) Don't save TIPC header values before the header has been validated, from Jon Paul Maloy. 2) Fix memory leak in RDS, from Zhu Yanjun. 3) We miss to initialize the UID in the flow key in some paths, from Julian Anastasov. 4) Fix latent TOS masking bug in the routing cache removal from years ago, also from Julian. 5) We forget to set the sockaddr port in sctp_copy_local_addr_list(), fix from Xin Long. 6) Missing module ref count drop in packet scheduler actions, from Roman Mashak. 7) Fix RCU annotations in rht_bucket_nested, from Herbert Xu. 8) Fix use after free which happens because L2TP's ipv4 support returns non-zero values from it's backlog_rcv function which ipv4 interprets as protocol values. Fix from Paul Hüber. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (35 commits) qed: Don't use attention PTT for configuring BW qed: Fix race with multiple VFs l2tp: avoid use-after-free caused by l2tp_ip_backlog_recv xfrm: provide correct dst in xfrm_neigh_lookup rhashtable: Fix RCU dereference annotation in rht_bucket_nested rhashtable: Fix use before NULL check in bucket_table_free net sched actions: do not overwrite status of action creation. rxrpc: Kernel calls get stuck in recvmsg net sched actions: decrement module reference count after table flush. lib: Allow compile-testing of parman ipv6: check sk sk_type and protocol early in ip_mroute_set/getsockopt sctp: set sin_port for addr param when checking duplicate address net/mlx4_en: fix overflow in mlx4_en_init_timestamp() netfilter: nft_set_bitmap: incorrect bitmap size net: s2io: fix typo argumnet argument net: vxge: fix typo argumnet argument netfilter: nf_ct_expect: Change __nf_ct_expect_check() return value. ipv4: mask tos for input route ipv4: add missing initialization for flowi4_uid lib: fix spelling mistake: "actualy" -> "actually" ...
2017-02-28lib/vsprintf.c: remove %Z supportAlexey Dobriyan7-11/+11
Now that %z is standartised in C99 there is no reason to support %Z. Unlike %L it doesn't even make format strings smaller. Use BUILD_BUG_ON in a couple ATM drivers. In case anyone didn't notice lib/vsprintf.o is about half of SLUB which is in my opinion is quite an achievement. Hopefully this patch inspires someone else to trim vsprintf.c more. Link: http://lkml.kernel.org/r/20170103230126.GA30170@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andy Shevchenko <andy.shevchenko@gmail.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-28scripts/spelling.txt: add "aligment" pattern and fix typo instancesMasahiro Yamada1-1/+1
Fix typos and add the following to the scripts/spelling.txt: aligment||alignment I did not touch the "N_BYTE_ALIGMENT" macro in drivers/net/wireless/realtek/rtlwifi/wifi.h to avoid unpredictable impact. I fixed "_aligment_handler" in arch/openrisc/kernel/entry.S because it is surrounded by #if 0 ... #endif. It is surely safe and I confirmed "_alignment_handler" is correct. I also fixed the "controler" I found in the same hunk in arch/openrisc/kernel/head.S. Link: http://lkml.kernel.org/r/1481573103-11329-8-git-send-email-yamada.masahiro@socionext.com Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-28scripts/spelling.txt: add "an user" pattern and fix typo instancesMasahiro Yamada1-1/+1
Fix typos and add the following to the scripts/spelling.txt: an user||a user an userspace||a userspace I also added "userspace" to the list since it is a common word in Linux. I found some instances for "an userfaultfd", but I did not add it to the list. I felt it is endless to find words that start with "user" such as "userland" etc., so must draw a line somewhere. Link: http://lkml.kernel.org/r/1481573103-11329-4-git-send-email-yamada.masahiro@socionext.com Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-26netfilter: nft_set_bitmap: incorrect bitmap sizePablo Neira Ayuso1-1/+1
priv->bitmap_size stores the real bitmap size, instead of the full struct nft_bitmap object. Fixes: 665153ff5752 ("netfilter: nf_tables: add bitmap set type") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-02-26netfilter: nf_ct_expect: Change __nf_ct_expect_check() return value.Jarno Rajahalme1-2/+2
Commit 4dee62b1b9b4 ("netfilter: nf_ct_expect: nf_ct_expect_insert() returns void") inadvertently changed the successful return value of nf_ct_expect_related_report() from 0 to 1 due to __nf_ct_expect_check() returning 1 on success. Prevent this regression in the future by changing the return value of __nf_ct_expect_check() to 0 on success. Signed-off-by: Jarno Rajahalme <jarno@ovn.org> Acked-by: Joe Stringer <joe@ovn.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-02-25netfilter: nf_ct_expect: nf_ct_expect_related_report(): Return zero on success.Jarno Rajahalme1-1/+1
Commit 4dee62b1b9b4 ("netfilter: nf_ct_expect: nf_ct_expect_insert() returns void") inadvertently changed the successful return value of nf_ct_expect_related_report() from 0 to 1, which caused openvswitch conntrack integration fail in FTP test cases. Fix this by always returning zero on the success code path. Fixes: 4dee62b1b9b4 ("netfilter: nf_ct_expect: nf_ct_expect_insert() returns void") Signed-off-by: Jarno Rajahalme <jarno@ovn.org> Acked-by: Joe Stringer <joe@ovn.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-02-23netfilter: nft_ct: fix random validation errors for zone set supportFlorian Westphal1-0/+1
Dan reports: net/netfilter/nft_ct.c:549 nft_ct_set_init() error: uninitialized symbol 'len'. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Fixes: edee4f1e924582 ("netfilter: nft_ct: add zone id set support") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-02-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller6-39/+81
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for your net tree, they are: 1) Revisit warning logic when not applying default helper assignment. Jiri Kosina considers we are breaking existing setups and not warning our users accordinly now that automatic helper assignment has been turned off by default. So let's make him happy by spotting the warning by when we find a helper but we cannot attach, instead of warning on the former deprecated behaviour. Patch from Jiri Kosina. 2) Two patches to fix regression in ctnetlink interfaces with nfnetlink_queue. Specifically, perform more relaxed in CTA_STATUS and do not bail out if CTA_HELP indicates the same helper that we already have. Patches from Kevin Cernekee. 3) A couple of bugfixes for ipset via Jozsef Kadlecsik. Due to wrong index logic in hash set types and null pointer exception in the list:set type. 4) hashlimit bails out with correct userspace parameters due to wrong arithmetics in the code that avoids "divide by zero" when transforming the userspace timing in milliseconds to token credits. Patch from Alban Browaeys. 5) Fix incorrect NFQA_VLAN_MAX definition, patch from Ken-ichirou MATSUZAWA. 6) Don't not declare nfnetlink batch error list as static, since this may be used by several subsystems at the same time. Patch from Liping Zhang. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-21Merge branch 'master' of git://blackhole.kfki.hu/nfPablo Neira Ayuso2-4/+7
Jozsef Kadlecsik says: ==================== ipset patches for nf Please apply the next patches for ipset in your nf branch. Both patches should go into the stable kernel branches as well, because these are important bugfixes: * Sometimes valid entries in hash:* types of sets were evicted due to a typo in an index. The wrong evictions happen when entries are deleted from the set and the bucket is shrinked. Bug was reported by Eric Ewanco and the patch fixes netfilter bugzilla id #1119. * Fixing of a null pointer exception when someone wants to add an entry to an empty list type of set and specifies an add before/after option. The fix is from Vishwanath Pai. ==================== Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-02-21netfilter: nfnetlink: remove static declaration from err_listLiping Zhang1-1/+1
Otherwise, different subsys will race to access the err_list, with holding the different nfnl_lock(subsys_id). But this will not happen now, since ->call_batch is only implemented by nftables, so the err_list is protected by nfnl_lock(NFNL_SUBSYS_NFTABLES). Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-02-19netfilter: xt_hashlimit: Fix integer divide round to zero.Alban Browaeys1-16/+9
Diving the divider by the multiplier before applying to the input. When this would "divide by zero", divide the multiplier by the divider first then multiply the input by this value. Currently user2creds outputs zero when input value is bigger than the number of slices and lower than scale. This as then user input is applied an integer divide operation to a number greater than itself (scale). That rounds up to zero, then we multiply zero by the credits slice size. iptables -t filter -I INPUT --protocol tcp --match hashlimit --hashlimit 40/second --hashlimit-burst 20 --hashlimit-mode srcip --hashlimit-name syn-flood --jump RETURN thus trigger the overflow detection code: xt_hashlimit: overflow, try lower: 25000/20 (25000 as hashlimit avg and 20 the burst) Here: 134217 slices of (HZ * CREDITS_PER_JIFFY) size. 500000 is user input value 1000000 is XT_HASHLIMIT_SCALE_v2 gives: 0 as user2creds output Setting burst to "1" typically solve the issue ... but setting it to "40" does too ! This is on 32bit arch calling into revision 2 of hashlimit. Signed-off-by: Alban Browaeys <alban.browaeys@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-02-19netfilter: ipset: Null pointer exception in ipset list:setVishwanath Pai1-3/+6
If we use before/after to add an element to an empty list it will cause a kernel panic. $> cat crash.restore create a hash:ip create b hash:ip create test list:set timeout 5 size 4 add test b before a $> ipset -R < crash.restore Executing the above will crash the kernel. Signed-off-by: Vishwanath Pai <vpai@akamai.com> Reviewed-by: Josh Hunt <johunt@akamai.com> Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
2017-02-19Fix bug: sometimes valid entries in hash:* types of sets were evictedJozsef Kadlecsik1-1/+1
Wrong index was used and therefore when shrinking a hash bucket at deleting an entry, valid entries could be evicted as well. Thanks to Eric Ewanco for the thorough bugreport. Fixes netfilter bugzilla #1119 Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>