summaryrefslogtreecommitdiff
path: root/include/net/netfilter
AgeCommit message (Collapse)AuthorFilesLines
2022-06-27netfilter: nf_tables: avoid skb access on nf_stolenFlorian Westphal1-6/+10
When verdict is NF_STOLEN, the skb might have been freed. When tracing is enabled, this can result in a use-after-free: 1. access to skb->nf_trace 2. access to skb->mark 3. computation of trace id 4. dump of packet payload To avoid 1, keep a cached copy of skb->nf_trace in the trace state struct. Refresh this copy whenever verdict is != STOLEN. Avoid 2 by skipping skb->mark access if verdict is STOLEN. 3 is avoided by precomputing the trace id. Only dump the packet when verdict is not "STOLEN". Reported-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-06-06netfilter: nf_tables: bail out early if hardware offload is not supportedPablo Neira Ayuso1-1/+1
If user requests for NFT_CHAIN_HW_OFFLOAD, then check if either device provides the .ndo_setup_tc interface or there is an indirect flow block that has been registered. Otherwise, bail out early from the preparation phase. Moreover, validate that family == NFPROTO_NETDEV and hook is NF_NETDEV_INGRESS. Fixes: c9626a2cbdb2 ("netfilter: nf_tables: add hardware offload support") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-06-02netfilter: nf_tables: delete flowtable hooks via transaction listPablo Neira Ayuso1-1/+0
Remove inactive bool field in nft_hook object that was introduced in abadb2f865d7 ("netfilter: nf_tables: delete devices from flowtable"). Move stale flowtable hooks to transaction list instead. Deleting twice the same device does not result in ENOENT. Fixes: abadb2f865d7 ("netfilter: nf_tables: delete devices from flowtable") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-27netfilter: conntrack: re-fetch conntrack after insertionFlorian Westphal1-1/+6
In case the conntrack is clashing, insertion can free skb->_nfct and set skb->_nfct to the already-confirmed entry. This wasn't found before because the conntrack entry and the extension space used to free'd after an rcu grace period, plus the race needs events enabled to trigger. Reported-by: <syzbot+793a590957d9c1b96620@syzkaller.appspotmail.com> Fixes: 71d8c47fc653 ("netfilter: conntrack: introduce clash resolution on insertion race") Fixes: 2ad9d7747c10 ("netfilter: conntrack: free extension area immediately") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-16netfilter: nf_conncount: reduce unnecessary GCWilliam Tu1-0/+1
Currently nf_conncount can trigger garbage collection (GC) at multiple places. Each GC process takes a spin_lock_bh to traverse the nf_conncount_list. We found that when testing port scanning use two parallel nmap, because the number of connection increase fast, the nf_conncount_count and its subsequent call to __nf_conncount_add take too much time, causing several CPU lockup. This happens when user set the conntrack limit to +20,000, because the larger the limit, the longer the list that GC has to traverse. The patch mitigate the performance issue by avoiding unnecessary GC with a timestamp. Whenever nf_conncount has done a GC, a timestamp is updated, and beforce the next time GC is triggered, we make sure it's more than a jiffies. By doin this we can greatly reduce the CPU cycles and avoid the softirq lockup. To reproduce it in OVS, $ ovs-appctl dpctl/ct-set-limits zone=1,limit=20000 $ ovs-appctl dpctl/ct-get-limits At another machine, runs two nmap $ nmap -p1- <IP> $ nmap -p1- <IP> Signed-off-by: William Tu <u9012063@gmail.com> Co-authored-by: Yifeng Sun <pkusunyifeng@gmail.com> Reported-by: Greg Rose <gvrose8192@gmail.com> Suggested-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13netfilter: conntrack: skip verification of zero UDP checksumKevin Mitchell1-4/+17
The checksum is optional for UDP packets. However nf_reject would previously require a valid checksum to elicit a response such as ICMP_DEST_UNREACH. Add some logic to nf_reject_verify_csum to determine if a UDP packet has a zero checksum and should therefore not be verified. Signed-off-by: Kevin Mitchell <kevmitch@arista.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13netfilter: prefer extension check to pointer checkFlorian Westphal2-17/+16
The pointer check usually results in a 'false positive': its likely that the ctnetlink module is loaded but no event monitoring is enabled. After recent change to autodetect ctnetlink usage and only allocate the ecache extension if a listener is active, check if the extension is present on a given conntrack. If its not there, there is nothing to report and calls to the notification framework can be elided. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13netfilter: conntrack: un-inline nf_ct_ecache_ext_addFlorian Westphal1-25/+5
Only called when new ct is allocated or the extension isn't present. This function will be extended, place this in the conntrack module instead of inlining. The callers already depend on nf_conntrack module. Return value is changed to bool, noone used the returned pointer. Make sure that the core drops the newly allocated conntrack if the extension is requested but can't be added. This makes it necessary to ifdef the section, as the stub always returns false we'd drop every new conntrack if the the ecache extension is disabled in kconfig. Add from data path (xt_CT, nft_ct) is unchanged. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13netfilter: conntrack: add nf_ct_iter_data object for nf_ct_iterate_cleanup*()Pablo Neira Ayuso1-3/+9
This patch adds a structure to collect all the context data that is passed to the cleanup iterator. struct nf_ct_iter_data { struct net *net; void *data; u32 portid; int report; }; There is a netns field that allows to clean up conntrack entries specifically owned by the specified netns. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13netfilter: conntrack: remove unconfirmed listFlorian Westphal1-1/+0
It has no function anymore and can be removed. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13netfilter: extensions: introduce extension genid countFlorian Westphal2-16/+25
Multiple netfilter extensions store pointers to external data in their extension area struct. Examples: 1. Timeout policies 2. Connection tracking helpers. No references are taken for these. When a helper or timeout policy is removed, the conntrack table gets traversed and affected extensions are cleared. Conntrack entries not yet in the hashtable are referenced via a special list, the unconfirmed list. On removal of a policy or connection tracking helper, the unconfirmed list gets traversed an all entries are marked as dying, this prevents them from getting committed to the table at insertion time: core checks for dying bit, if set, the conntrack entry gets destroyed at confirm time. The disadvantage is that each new conntrack has to be added to the percpu unconfirmed list, and each insertion needs to remove it from this list. The list is only ever needed when a policy or helper is removed -- a rare occurrence. Add a generation ID count: Instead of adding to the list and then traversing that list on policy/helper removal, increment a counter that is stored in the extension area. For unconfirmed conntracks, the extension has the genid valid at ct allocation time. Removal of a helper/policy etc. increments the counter. At confirmation time, validate that ext->genid == global_id. If the stored number is not the same, do not allow the conntrack insertion, just like as if a confirmed-list traversal would have flagged the entry as dying. After insertion, the genid is no longer relevant (conntrack entries are now reachable via the conntrack table iterators and is set to 0. This allows removal of the percpu unconfirmed list. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13netfilter: remove nf_ct_unconfirmed_destroy helperFlorian Westphal1-3/+0
This helper tags connections not yet in the conntrack table as dying. These nf_conn entries will be dropped instead when the core attempts to insert them from the input or postrouting 'confirm' hook. After the previous change, the entries get unlinked from the list earlier, so that by the time the actual exit hook runs, new connections no longer have a timeout policy assigned. Its enough to walk the hashtable instead. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13netfilter: cttimeout: decouple unlink and free on netns destructionFlorian Westphal1-8/+0
Make it so netns pre_exit unlinks the objects from the pernet list, so they cannot be found anymore. netns core issues a synchronize_rcu() before calling the exit hooks so any the time the exit hooks run unconfirmed nf_conn entries have been free'd or they have been committed to the hashtable. The exit hook still tags unconfirmed entries as dying, this can now be removed in a followup change. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13netfilter: conntrack: include ecache dying list in dumpsFlorian Westphal1-0/+2
The new pernet dying list includes conntrack entries that await delivery of the 'destroy' event via ctnetlink. The old percpu dying list will be removed soon. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13netfilter: ecache: use dedicated list for event redeliveryFlorian Westphal2-3/+2
This disentangles event redelivery and the percpu dying list. Because entries are now stored on a dedicated list, all entries are in NFCT_ECACHE_DESTROY_FAIL state and all entries still have confirmed bit set -- the reference count is at least 1. The 'struct net' back-pointer can be removed as well. The pcpu dying list will be removed eventually, it has no functionality. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-04-08netfilter: ecache: move to separate structureFlorian Westphal1-2/+6
This makes it easier for a followup patch to only expose ecache related parts of nf_conntrack_net structure. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-03-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-0/+18
Merge in overtime fixes, no conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-03-20netfilter: nft_fib: add reduce supportFlorian Westphal1-0/+3
The fib expression stores to a register, so we can't add empty stub. Check that the register that is being written is in fact redundant. In most cases, this is expected to cancel tracking as re-use is unlikely. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-03-20netfilter: nft_meta: extend reduce support to bridge familyFlorian Westphal1-0/+2
its enough to export the meta get reduce helper and then call it from nft_meta_bridge too. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-03-20netfilter: nf_tables: cancel tracking for clobbered destination registersPablo Neira Ayuso2-0/+15
Output of expressions might be larger than one single register, this might clobber existing data. Reset tracking for all destination registers that required to store the expression output. This patch adds three new helper functions: - nft_reg_track_update: cancel previous register tracking and update it. - nft_reg_track_cancel: cancel any previous register tracking info. - __nft_reg_track_cancel: cancel only one single register tracking info. Partial register clobbering detection is also supported by checking the .num_reg field which describes the number of register that are used. This patch updates the following expressions: - meta_bridge - bitwise - byteorder - meta - payload to use these helper functions. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-03-20netfilter: nf_tables: do not reduce read-only expressionsPablo Neira Ayuso1-0/+8
Skip register tracking for expressions that perform read-only operations on the registers. Define and use a cookie pointer NFT_REDUCE_READONLY to avoid defining stubs for these expressions. This patch re-enables register tracking which was disabled in ed5f85d42290 ("netfilter: nf_tables: disable register tracking"). Follow up patches add remaining register tracking for existing expressions. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-03-20netfilter: conntrack: Add and use nf_ct_set_auto_assign_helper_warned()Phil Sutter1-0/+1
The function sets the pernet boolean to avoid the spurious warning from nf_ct_lookup_helper() when assigning conntrack helpers via nftables. Fixes: 1a64edf54f55 ("netfilter: nft_ct: add helper set support") Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-03-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-1/+0
No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-03-16netfilter: flowtable: Fix QinQ and pppoe support for inet tablePablo Neira Ayuso1-0/+18
nf_flow_offload_inet_hook() does not check for 802.1q and PPPoE. Fetch inner ethertype from these encapsulation protocols. Fixes: 72efd585f714 ("netfilter: flowtable: add pppoe support") Fixes: 4cd91f7c290f ("netfilter: flowtable: add vlan support") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-03-08Revert "netfilter: conntrack: tag conntracks picked up in local out hook"Florian Westphal1-1/+0
This was a prerequisite for the ill-fated "netfilter: nat: force port remap to prevent shadowing well-known ports". As this has been reverted, this change can be backed out too. Signed-off-by: Florian Westphal <fw@strlen.de>
2022-03-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2-2/+6
net/batman-adv/hard-interface.c commit 690bb6fb64f5 ("batman-adv: Request iflink once in batadv-on-batadv check") commit 6ee3c393eeb7 ("batman-adv: Demote batadv-on-batadv skip error message") https://lore.kernel.org/all/20220302163049.101957-1-sw@simonwunderlich.de/ net/smc/af_smc.c commit 4d08b7b57ece ("net/smc: Fix cleanup when register ULP fails") commit 462791bbfa35 ("net/smc: add sysctl interface for SMC") https://lore.kernel.org/all/20220302112209.355def40@canb.auug.org.au/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-03-02net/sched: act_ct: Fix flow table lookup failure with no originating ifindexPaul Blakey1-1/+5
After cited commit optimizted hw insertion, flow table entries are populated with ifindex information which was intended to only be used for HW offload. This tuple ifindex is hashed in the flow table key, so it must be filled for lookup to be successful. But tuple ifindex is only relevant for the netfilter flowtables (nft), so it's not filled in act_ct flow table lookup, resulting in lookup failure, and no SW offload and no offload teardown for TCP connection FIN/RST packets. To fix this, add new tc ifindex field to tuple, which will only be used for offloading, not for lookup, as it will not be part of the tuple hash. Fixes: 9795ded7f924 ("net/sched: act_ct: Fill offloading tuple iifidx") Signed-off-by: Paul Blakey <paulb@nvidia.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-03-01netfilter: nf_queue: fix possible use-after-freeFlorian Westphal1-1/+1
Eric Dumazet says: The sock_hold() side seems suspect, because there is no guarantee that sk_refcnt is not already 0. On failure, we cannot queue the packet and need to indicate an error. The packet will be dropped by the caller. v2: split skb prefetch hunk into separate change Fixes: 271b72c7fa82c ("udp: RCU handling for Unicast packets.") Reported-by: Eric Dumazet <eric.dumazet@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Florian Westphal <fw@strlen.de>
2022-02-25Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2-3/+1
tools/testing/selftests/net/mptcp/mptcp_join.sh 34aa6e3bccd8 ("selftests: mptcp: add ip mptcp wrappers") 857898eb4b28 ("selftests: mptcp: add missing join check") 6ef84b1517e0 ("selftests: mptcp: more robust signal race test") https://lore.kernel.org/all/20220221131842.468893-1-broonie@kernel.org/ drivers/net/ethernet/mellanox/mlx5/core/en/tc/act/act.h drivers/net/ethernet/mellanox/mlx5/core/en/tc/act/ct.c fb7e76ea3f3b6 ("net/mlx5e: TC, Skip redundant ct clear actions") c63741b426e11 ("net/mlx5e: Fix MPLSoUDP encap to use MPLS action information") 09bf97923224f ("net/mlx5e: TC, Move pedit_headers_action to parse_attr") 84ba8062e383 ("net/mlx5e: Test CT and SAMPLE on flow attr") efe6f961cd2e ("net/mlx5e: CT, Don't set flow flag CT for ct clear flow") 3b49a7edec1d ("net/mlx5e: TC, Reject rules with multiple CT actions") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-20netfilter: nf_tables_offload: incorrect flow offload action array sizePablo Neira Ayuso2-3/+1
immediate verdict expression needs to allocate one slot in the flow offload action array, however, immediate data expression does not need to do so. fwd and dup expression need to allocate one slot, this is missing. Add a new offload_action interface to report if this expression needs to allocate one slot in the flow offload action array. Fixes: be2861dc36d7 ("netfilter: nft_{fwd,dup}_netdev: add offload support") Reported-and-tested-by: Nick Gregory <Nick.Gregory@Sophos.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-02-09netfilter: nft_cmp: optimize comparison for 16-bytesPablo Neira Ayuso1-0/+9
Allow up to 16-byte comparisons with a new cmp fast version. Use two 64-bit words and calculate the mask representing the bits to be compared. Make sure the comparison is 64-bit aligned and avoid out-of-bound memory access on registers. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-02-09netfilter: cttimeout: use option structureFlorian Westphal1-2/+6
Instead of two exported functions, export a single option structure. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-02-09netfilter: ecache: don't use nf_conn spinlockFlorian Westphal1-1/+1
For updating eache missed value we can use cmpxchg. This also avoids need to disable BH. kernel robot reported build failure on v1 because not all arches support cmpxchg for u16, so extend this to u32. This doesn't increase struct size, existing padding is used. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-02-04netfilter: conntrack: remove extension register apiFlorian Westphal7-54/+0
These no longer register/unregister a meaningful structure so remove it. Cc: Paul Blakey <paulb@nvidia.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-02-04netfilter: conntrack: handle ->destroy hook via nat_ops insteadFlorian Westphal1-3/+0
The nat module already exposes a few functions to the conntrack core. Move the nat extension destroy hook to it. After this, no conntrack extension needs a destroy hook. 'struct nf_ct_ext_type' and the register/unregister api can be removed in a followup patch. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-02-04netfilter: conntrack: move extension sizes into coreFlorian Westphal1-1/+0
No need to specify this in the registration modules, we already collect all sizes for build-time checks on the maximum combined size. After this change, all extensions except nat have no meaningful content in their nf_ct_ext_type struct definition. Next patch handles nat, this will then allow to remove the dynamic register api completely. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-02-04netfilter: conntrack: make all extensions 8-byte alignnedFlorian Westphal1-4/+1
All extensions except one need 8 byte alignment, so just make that the default. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-01-19net/netfilter: Add unstable CT lookup helpers for XDP and TC-BPFKumar Kartikeya Dwivedi1-0/+23
This change adds conntrack lookup helpers using the unstable kfunc call interface for the XDP and TC-BPF hooks. The primary usecase is implementing a synproxy in XDP, see Maxim's patchset [0]. Export get_net_ns_by_id as nf_conntrack_bpf.c needs to call it. This object is only built when CONFIG_DEBUG_INFO_BTF_MODULES is enabled. [0]: https://lore.kernel.org/bpf/20211019144655.3483197-1-maximmi@nvidia.com Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20220114163953.1455836-7-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-01-10Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextJakub Kicinski3-8/+49
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for net-next. This includes one patch to update ovs and act_ct to use nf_ct_put() instead of nf_conntrack_put(). 1) Add netns_tracker to nfnetlink_log and masquerade, from Eric Dumazet. 2) Remove redundant rcu read-size lock in nf_tables packet path. 3) Replace BUG() by WARN_ON_ONCE() in nft_payload. 4) Consolidate rule verdict tracing. 5) Replace WARN_ON() by WARN_ON_ONCE() in nf_tables core. 6) Make counter support built-in in nf_tables. 7) Add new field to conntrack object to identify locally generated traffic, from Florian Westphal. 8) Prevent NAT from shadowing well-known ports, from Florian Westphal. 9) Merge nf_flow_table_{ipv4,ipv6} into nf_flow_table_inet, also from Florian. 10) Remove redundant pointer in nft_pipapo AVX2 support, from Colin Ian King. 11) Replace opencoded max() in conntrack, from Jiapeng Chong. 12) Update conntrack to use refcount_t API, from Florian Westphal. 13) Move ip_ct_attach indirection into the nf_ct_hook structure. 14) Constify several pointer object in the netfilter codebase, from Florian Westphal. 15) Tree-wide replacement of nf_conntrack_put() by nf_ct_put(), also from Florian. 16) Fix egress splat due to incorrect rcu notation, from Florian. 17) Move stateful fields of connlimit, last, quota, numgen and limit out of the expression data area. 18) Build a blob to represent the ruleset in nf_tables, this is a requirement of the new register tracking infrastructure. 19) Add NFT_REG32_NUM to define the maximum number of 32-bit registers. 20) Add register tracking infrastructure to skip redundant store-to-register operations, this includes support for payload, meta and bitwise expresssions. * git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next: (32 commits) netfilter: nft_meta: cancel register tracking after meta update netfilter: nft_payload: cancel register tracking after payload update netfilter: nft_bitwise: track register operations netfilter: nft_meta: track register operations netfilter: nft_payload: track register operations netfilter: nf_tables: add register tracking infrastructure netfilter: nf_tables: add NFT_REG32_NUM netfilter: nf_tables: add rule blob layout netfilter: nft_limit: move stateful fields out of expression data netfilter: nft_limit: rename stateful structure netfilter: nft_numgen: move stateful fields out of expression data netfilter: nft_quota: move stateful fields out of expression data netfilter: nft_last: move stateful fields out of expression data netfilter: nft_connlimit: move stateful fields out of expression data netfilter: egress: avoid a lockdep splat net: prefer nf_ct_put instead of nf_conntrack_put netfilter: conntrack: avoid useless indirection during conntrack destruction netfilter: make function op structures const netfilter: core: move ip_ct_attach indirection to struct nf_ct_hook netfilter: conntrack: convert to refcount_t api ... ==================== Link: https://lore.kernel.org/r/20220109231640.104123-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-10netfilter: nft_bitwise: track register operationsPablo Neira Ayuso1-0/+2
Check if the destination register already contains the data that this bitwise expression performs. This allows to skip this redundant operation. If the destination contains a different bitwise operation, cancel the register tracking information. If the destination contains no bitwise operation, update the register tracking information. Update the payload and meta expression to check if this bitwise operation has been already performed on the register. Hence, both the payload/meta and the bitwise expressions are reduced. There is also a special case: If source register != destination register and source register is not updated by a previous bitwise operation, then transfer selector from the source register to the destination register. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-01-10netfilter: nf_tables: add register tracking infrastructurePablo Neira Ayuso1-0/+12
This patch adds new infrastructure to skip redundant selector store operations on the same register to achieve a performance boost from the packet path. This is particularly noticeable in pure linear rulesets but it also helps in rulesets which are already heaving relying in maps to avoid ruleset linear inspection. The idea is to keep data of the most recurrent store operations on register to reuse them with cmp and lookup expressions. This infrastructure allows for dynamic ruleset updates since the ruleset blob reduction happens from the kernel. Userspace still needs to be updated to maximize register utilization to cooperate to improve register data reuse / reduce number of store on register operations. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-01-10netfilter: nf_tables: add NFT_REG32_NUMPablo Neira Ayuso1-1/+3
Add a definition including the maximum number of 32-bits registers that are used a scratchpad memory area to store data. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-01-10netfilter: nf_tables: add rule blob layoutPablo Neira Ayuso1-4/+18
This patch adds a blob layout per chain to represent the ruleset in the packet datapath. size (unsigned long) struct nft_rule_dp struct nft_expr ... struct nft_rule_dp struct nft_expr ... struct nft_rule_dp (is_last=1) The new structure nft_rule_dp represents the rule in a more compact way (smaller memory footprint) compared to the control-plane nft_rule structure. The ruleset blob is a read-only data structure. The first field contains the blob size, then the rules containing expressions. There is a trailing rule which is used by the tracing infrastructure which is equivalent to the NULL rule marker in the previous representation. The blob size field does not include the size of this trailing rule marker. The ruleset blob is generated from the commit path. This patch reuses the infrastructure available since 0cbc06b3faba ("netfilter: nf_tables: remove synchronize_rcu in commit phase") to build the array of rules per chain. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-01-10netfilter: conntrack: avoid useless indirection during conntrack destructionFlorian Westphal1-2/+6
nf_ct_put() results in a usesless indirection: nf_ct_put -> nf_conntrack_put -> nf_conntrack_destroy -> rcu readlock + indirect call of ct_hooks->destroy(). There are two _put helpers: nf_ct_put and nf_conntrack_put. The latter is what should be used in code that MUST NOT cause a linker dependency on the conntrack module (e.g. calls from core network stack). Everyone else should call nf_ct_put() instead. A followup patch will convert a few nf_conntrack_put() calls to nf_ct_put(), in particular from modules that already have a conntrack dependency such as act_ct or even nf_conntrack itself. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-01-10netfilter: conntrack: Use max() instead of doing it manuallyJiapeng Chong1-1/+1
Fix following coccicheck warning: ./include/net/netfilter/nf_conntrack.h:282:16-17: WARNING opportunity for max(). Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-01-04net/sched: act_ct: Fill offloading tuple iifidxPaul Blakey2-0/+54
Driver offloading ct tuples can use the information of which devices received the packets that created the offloaded connections, to more efficiently offload them only to the relevant device. Add new act_ct nf conntrack extension, which is used to store the skb devices before offloading the connection, and then fill in the tuple iifindex so drivers can get the device via metadata dissector match. Signed-off-by: Oz Shlomo <ozsh@nvidia.com> Signed-off-by: Paul Blakey <paulb@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-23netfilter: conntrack: tag conntracks picked up in local out hookFlorian Westphal1-0/+1
This allows to identify flows that originate from local machine in a followup patch. It would be possible to make this a ->status bit instead. For now I did not do that yet because I don't have a use-case for exposing this info to userspace. If one comes up the toggle can be replaced with a status bit. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-12-23netfilter: nf_tables: make counter support built-inPablo Neira Ayuso1-0/+6
Make counter support built-in to allow for direct call in case of CONFIG_RETPOLINE. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-12-08netfilter: conntrack: annotate data-races around ct->timeoutEric Dumazet1-3/+3
(struct nf_conn)->timeout can be read/written locklessly, add READ_ONCE()/WRITE_ONCE() to prevent load/store tearing. BUG: KCSAN: data-race in __nf_conntrack_alloc / __nf_conntrack_find_get write to 0xffff888132e78c08 of 4 bytes by task 6029 on cpu 0: __nf_conntrack_alloc+0x158/0x280 net/netfilter/nf_conntrack_core.c:1563 init_conntrack+0x1da/0xb30 net/netfilter/nf_conntrack_core.c:1635 resolve_normal_ct+0x502/0x610 net/netfilter/nf_conntrack_core.c:1746 nf_conntrack_in+0x1c5/0x88f net/netfilter/nf_conntrack_core.c:1901 ipv6_conntrack_local+0x19/0x20 net/netfilter/nf_conntrack_proto.c:414 nf_hook_entry_hookfn include/linux/netfilter.h:142 [inline] nf_hook_slow+0x72/0x170 net/netfilter/core.c:619 nf_hook include/linux/netfilter.h:262 [inline] NF_HOOK include/linux/netfilter.h:305 [inline] ip6_xmit+0xa3a/0xa60 net/ipv6/ip6_output.c:324 inet6_csk_xmit+0x1a2/0x1e0 net/ipv6/inet6_connection_sock.c:135 __tcp_transmit_skb+0x132a/0x1840 net/ipv4/tcp_output.c:1402 tcp_transmit_skb net/ipv4/tcp_output.c:1420 [inline] tcp_write_xmit+0x1450/0x4460 net/ipv4/tcp_output.c:2680 __tcp_push_pending_frames+0x68/0x1c0 net/ipv4/tcp_output.c:2864 tcp_push_pending_frames include/net/tcp.h:1897 [inline] tcp_data_snd_check+0x62/0x2e0 net/ipv4/tcp_input.c:5452 tcp_rcv_established+0x880/0x10e0 net/ipv4/tcp_input.c:5947 tcp_v6_do_rcv+0x36e/0xa50 net/ipv6/tcp_ipv6.c:1521 sk_backlog_rcv include/net/sock.h:1030 [inline] __release_sock+0xf2/0x270 net/core/sock.c:2768 release_sock+0x40/0x110 net/core/sock.c:3300 sk_stream_wait_memory+0x435/0x700 net/core/stream.c:145 tcp_sendmsg_locked+0xb85/0x25a0 net/ipv4/tcp.c:1402 tcp_sendmsg+0x2c/0x40 net/ipv4/tcp.c:1440 inet6_sendmsg+0x5f/0x80 net/ipv6/af_inet6.c:644 sock_sendmsg_nosec net/socket.c:704 [inline] sock_sendmsg net/socket.c:724 [inline] __sys_sendto+0x21e/0x2c0 net/socket.c:2036 __do_sys_sendto net/socket.c:2048 [inline] __se_sys_sendto net/socket.c:2044 [inline] __x64_sys_sendto+0x74/0x90 net/socket.c:2044 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae read to 0xffff888132e78c08 of 4 bytes by task 17446 on cpu 1: nf_ct_is_expired include/net/netfilter/nf_conntrack.h:286 [inline] ____nf_conntrack_find net/netfilter/nf_conntrack_core.c:776 [inline] __nf_conntrack_find_get+0x1c7/0xac0 net/netfilter/nf_conntrack_core.c:807 resolve_normal_ct+0x273/0x610 net/netfilter/nf_conntrack_core.c:1734 nf_conntrack_in+0x1c5/0x88f net/netfilter/nf_conntrack_core.c:1901 ipv6_conntrack_local+0x19/0x20 net/netfilter/nf_conntrack_proto.c:414 nf_hook_entry_hookfn include/linux/netfilter.h:142 [inline] nf_hook_slow+0x72/0x170 net/netfilter/core.c:619 nf_hook include/linux/netfilter.h:262 [inline] NF_HOOK include/linux/netfilter.h:305 [inline] ip6_xmit+0xa3a/0xa60 net/ipv6/ip6_output.c:324 inet6_csk_xmit+0x1a2/0x1e0 net/ipv6/inet6_connection_sock.c:135 __tcp_transmit_skb+0x132a/0x1840 net/ipv4/tcp_output.c:1402 __tcp_send_ack+0x1fd/0x300 net/ipv4/tcp_output.c:3956 tcp_send_ack+0x23/0x30 net/ipv4/tcp_output.c:3962 __tcp_ack_snd_check+0x2d8/0x510 net/ipv4/tcp_input.c:5478 tcp_ack_snd_check net/ipv4/tcp_input.c:5523 [inline] tcp_rcv_established+0x8c2/0x10e0 net/ipv4/tcp_input.c:5948 tcp_v6_do_rcv+0x36e/0xa50 net/ipv6/tcp_ipv6.c:1521 sk_backlog_rcv include/net/sock.h:1030 [inline] __release_sock+0xf2/0x270 net/core/sock.c:2768 release_sock+0x40/0x110 net/core/sock.c:3300 tcp_sendpage+0x94/0xb0 net/ipv4/tcp.c:1114 inet_sendpage+0x7f/0xc0 net/ipv4/af_inet.c:833 rds_tcp_xmit+0x376/0x5f0 net/rds/tcp_send.c:118 rds_send_xmit+0xbed/0x1500 net/rds/send.c:367 rds_send_worker+0x43/0x200 net/rds/threads.c:200 process_one_work+0x3fc/0x980 kernel/workqueue.c:2298 worker_thread+0x616/0xa70 kernel/workqueue.c:2445 kthread+0x2c7/0x2e0 kernel/kthread.c:327 ret_from_fork+0x1f/0x30 value changed: 0x00027cc2 -> 0x00000000 Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 17446 Comm: kworker/u4:5 Tainted: G W 5.16.0-rc4-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Workqueue: krdsd rds_send_worker Note: I chose an arbitrary commit for the Fixes: tag, because I do not think we need to backport this fix to very old kernels. Fixes: e37542ba111f ("netfilter: conntrack: avoid possible false sharing") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-11-01netfilter: nft_payload: support for inner header matching / manglingPablo Neira Ayuso1-0/+2
Allow to match and mangle on inner headers / payload data after the transport header. There is a new field in the pktinfo structure that stores the inner header offset which is calculated only when requested. Only TCP and UDP supported at this stage. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>