summaryrefslogtreecommitdiff
path: root/lib/radix-tree.c
AgeCommit message (Collapse)AuthorFilesLines
2018-10-21radix tree: Remove multiorder supportMatthew Wilcox1-202/+13
All users have now been converted to the XArray. Removing the support reduces code size and ensures new users will use the XArray instead. Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21radix tree test suite: Convert tag_tagged_items to XArrayMatthew Wilcox1-12/+0
The tag_tagged_items() function is supposed to test the page-writeback tagging code. Since that has been converted to the XArray, there's not much point in testing the radix tree's tagging code. This requires using the pthread mutex embedded in the xarray instead of an external lock, so remove the pthread mutexes which protect xarrays/radix trees. Also remove radix_tree_iter_tag_set() as this was the last user. Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21radix tree: Remove radix_tree_clear_tagsMatthew Wilcox1-13/+0
The page cache was the only user of this interface and it has now been converted to the XArray. Transform the test into a test of xas_init_marks(). Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21radix tree: Remove radix_tree_maybe_preload_orderMatthew Wilcox1-74/+0
This function was only used by the page cache which is now converted to the XArray. Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21radix tree: Remove split/join codeMatthew Wilcox1-169/+2
radix_tree_split and radix_tree_join were never used upstream. Remove them; if they're needed in future they will be replaced by XArray equivalents. Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21radix tree: Remove radix_tree_update_node_tMatthew Wilcox1-34/+8
The only user of this functionality was the workingset code, and it's now been converted to the XArray. Remove __radix_tree_delete_node() entirely as it was also only used by the workingset code. Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21shmem: Convert shmem_alloc_hugepage to XArrayMatthew Wilcox1-43/+1
xa_find() is a slightly easier API to use than radix_tree_gang_lookup_slot() because it contains its own RCU locking. This commit removes the last user of radix_tree_gang_lookup_slot() so remove the function too. Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21page cache: Add and replace pages using the XArrayMatthew Wilcox1-3/+3
Use the XArray APIs to add and replace pages in the page cache. This removes two uses of the radix tree preload API and is significantly shorter code. It also removes the last user of __radix_tree_create() outside radix-tree.c itself, so make it static. Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21ida: Convert to XArrayMatthew Wilcox1-71/+0
Use the XA_TRACK_FREE ability to track which entries have a free bit, similarly to how it uses the radix tree's IDR_FREE tag. This eliminates the per-cpu ida_bitmap preload, and fixes the memory consumption regression I introduced when making the IDR able to store any pointer. Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21xarray: Add XArray unconditional store operationsMatthew Wilcox1-2/+2
xa_store() differs from radix_tree_insert() in that it will overwrite an existing element in the array rather than returning an error. This is the behaviour which most users want, and those that want more complex behaviour generally want to use the xas family of routines anyway. For memory allocation, xa_store() will first attempt to request memory from the slab allocator; if memory is not immediately available, it will drop the xa_lock and allocate memory, keeping a pointer in the xa_state. It does not use the per-CPU cache, although those will continue to exist until all radix tree users are converted to the xarray. This patch also includes xa_erase() and __xa_erase() for a streamlined way to store NULL. Since there is no need to allocate memory in order to store a NULL in the XArray, we do not need to trouble the user with deciding what memory allocation flags to use. Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21xarray: Add XArray load operationMatthew Wilcox1-43/+0
The xa_load function brings with it a lot of infrastructure; xa_empty(), xa_is_err(), and large chunks of the XArray advanced API that are used to implement xa_load. As the test-suite demonstrates, it is possible to use the XArray functions on a radix tree. The radix tree functions depend on the GFP flags being stored in the root of the tree, so it's not possible to use the radix tree functions on an XArray. Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21xarray: Define struct xa_nodeMatthew Wilcox1-24/+24
This is a direct replacement for struct radix_tree_node. A couple of struct members have changed name, so convert those. Use a #define so that radix tree users continue to work without change. Signed-off-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Josef Bacik <jbacik@fb.com>
2018-10-21xarray: Add definition of struct xarrayMatthew Wilcox1-38/+37
This is a direct replacement for struct radix_tree_root. Some of the struct members have changed name; convert those, and use a #define so that radix_tree users continue to work without change. Signed-off-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Josef Bacik <jbacik@fb.com>
2018-09-30xarray: Change definition of sibling entriesMatthew Wilcox1-45/+19
Instead of storing a pointer to the slot containing the canonical entry, store the offset of the slot. Produces slightly more efficient code (~300 bytes) and simplifies the implementation. Signed-off-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Josef Bacik <jbacik@fb.com>
2018-09-30xarray: Replace exceptional entriesMatthew Wilcox1-12/+9
Introduce xarray value entries and tagged pointers to replace radix tree exceptional entries. This is a slight change in encoding to allow the use of an extra bit (we can now store BITS_PER_LONG - 1 bits in a value entry). It is also a change in emphasis; exceptional entries are intimidating and different. As the comment explains, you can choose to store values or pointers in the xarray and they are both first-class citizens. Signed-off-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Josef Bacik <jbacik@fb.com>
2018-09-30idr: Permit any valid kernel pointer to be storedMatthew Wilcox1-6/+15
An upcoming change to the encoding of internal entries will set the bottom two bits to 0b10. Unfortunately, m68k only aligns some data structures to 2 bytes, so the IDR will interpret them as internal entries and things will go badly wrong. Change the radix tree so that it stops either when the node indicates that it's the bottom of the tree (shift == 0) or when the entry is not an internal entry. This means we cannot insert an arbitrary kernel pointer as a multiorder entry, but the IDR does not permit multiorder entries. Annoyingly, this means the IDR can no longer take advantage of the radix tree's ability to store a single entry at offset 0 without allocating memory. A pointer which is 2-byte aligned cannot be stored directly in the root as it would be indistinguishable from a node, so we must allocate a node in order to store a 2-byte pointer at index 0. The idr_replace() function does not take a GFP flags argument, so cannot allocate memory. If a user inserts a 4-byte aligned pointer at index 0 and then replaces it with a 2-byte aligned pointer, we must be able to store it. Arbitrary pointer values are still not permitted; pointers of the form 2 + (i * 4) for values of i between 0 and 1023 are reserved for the implementation. These are not valid kernel pointers as they would point into the zero page. This change does cause a runtime memory consumption regression for the IDA. I will recover that later. Signed-off-by: Matthew Wilcox <willy@infradead.org> Tested-by: Guenter Roeck <linux@roeck-us.net>
2018-08-22ida: Remove old APIMatthew Wilcox1-9/+0
Delete ida_pre_get(), ida_get_new(), ida_get_new_above() and ida_remove() from the public API. Some of these functions still exist as internal helpers, but they should not be called by consumers. Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-08-22radix-tree: Fix UBSAN warningMatthew Wilcox1-1/+1
get_slot_offset() can be called with a NULL 'parent' argument. In this case, the calculated value will not be used, but calculating it is undefined. Rather than fixing the caller (__radix_tree_delete) to not call get_slot_offset(), make get_slot_offset() robust against being called with a NULL parent. Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-05-26idr: fix invalid ptr dereference on item deleteMatthew Wilcox1-1/+3
If the radix tree underlying the IDR happens to be full and we attempt to remove an id which is larger than any id in the IDR, we will call __radix_tree_delete() with an uninitialised 'slot' pointer, at which point anything could happen. This was easiest to hit with a single entry at id 0 and attempting to remove a non-0 id, but it could have happened with 64 entries and attempting to remove an id >= 64. Roman said: The syzcaller test boils down to opening /dev/kvm, creating an eventfd, and calling a couple of KVM ioctls. None of this requires superuser. And the result is dereferencing an uninitialized pointer which is likely a crash. The specific path caught by syzbot is via KVM_HYPERV_EVENTD ioctl which is new in 4.17. But I guess there are other user-triggerable paths, so cc:stable is probably justified. Matthew added: We have around 250 calls to idr_remove() in the kernel today. Many of them pass an ID which is embedded in the object they're removing, so they're safe. Picking a few likely candidates: drivers/firewire/core-cdev.c looks unsafe; the ID comes from an ioctl. drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c is similar drivers/atm/nicstar.c could be taken down by a handcrafted packet Link: http://lkml.kernel.org/r/20180518175025.GD6361@bombadil.infradead.org Fixes: 0a835c4f090a ("Reimplement IDR and IDA using the radix tree") Reported-by: <syzbot+35666cba7f0a337e2e79@syzkaller.appspotmail.com> Debugged-by: Roman Kagan <rkagan@virtuozzo.com> Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-05-19radix tree: fix multi-order iteration raceRoss Zwisler1-4/+2
Fix a race in the multi-order iteration code which causes the kernel to hit a GP fault. This was first seen with a production v4.15 based kernel (4.15.6-300.fc27.x86_64) utilizing a DAX workload which used order 9 PMD DAX entries. The race has to do with how we tear down multi-order sibling entries when we are removing an item from the tree. Remember for example that an order 2 entry looks like this: struct radix_tree_node.slots[] = [entry][sibling][sibling][sibling] where 'entry' is in some slot in the struct radix_tree_node, and the three slots following 'entry' contain sibling pointers which point back to 'entry.' When we delete 'entry' from the tree, we call : radix_tree_delete() radix_tree_delete_item() __radix_tree_delete() replace_slot() replace_slot() first removes the siblings in order from the first to the last, then at then replaces 'entry' with NULL. This means that for a brief period of time we end up with one or more of the siblings removed, so: struct radix_tree_node.slots[] = [entry][NULL][sibling][sibling] This causes an issue if you have a reader iterating over the slots in the tree via radix_tree_for_each_slot() while only under rcu_read_lock()/rcu_read_unlock() protection. This is a common case in mm/filemap.c. The issue is that when __radix_tree_next_slot() => skip_siblings() tries to skip over the sibling entries in the slots, it currently does so with an exact match on the slot directly preceding our current slot. Normally this works: V preceding slot struct radix_tree_node.slots[] = [entry][sibling][sibling][sibling] ^ current slot This lets you find the first sibling, and you skip them all in order. But in the case where one of the siblings is NULL, that slot is skipped and then our sibling detection is interrupted: V preceding slot struct radix_tree_node.slots[] = [entry][NULL][sibling][sibling] ^ current slot This means that the sibling pointers aren't recognized since they point all the way back to 'entry', so we think that they are normal internal radix tree pointers. This causes us to think we need to walk down to a struct radix_tree_node starting at the address of 'entry'. In a real running kernel this will crash the thread with a GP fault when you try and dereference the slots in your broken node starting at 'entry'. We fix this race by fixing the way that skip_siblings() detects sibling nodes. Instead of testing against the preceding slot we instead look for siblings via is_sibling_entry() which compares against the position of the struct radix_tree_node.slots[] array. This ensures that sibling entries are properly identified, even if they are no longer contiguous with the 'entry' they point to. Link: http://lkml.kernel.org/r/20180503192430.7582-6-ross.zwisler@linux.intel.com Fixes: 148deab223b2 ("radix-tree: improve multiorder iterators") Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reported-by: CR, Sapthagirish <sapthagirish.cr@intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Matthew Wilcox <willy@infradead.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11radix tree: use GFP_ZONEMASK bits of gfp_t for flagsMatthew Wilcox1-1/+2
Patch series "XArray", v9. (First part thereof). This patchset is, I believe, appropriate for merging for 4.17. It contains the XArray implementation, to eventually replace the radix tree, and converts the page cache to use it. This conversion keeps the radix tree and XArray data structures in sync at all times. That allows us to convert the page cache one function at a time and should allow for easier bisection. Other than renaming some elements of the structures, the data structures are fundamentally unchanged; a radix tree walk and an XArray walk will touch the same number of cachelines. I have changes planned to the XArray data structure, but those will happen in future patches. Improvements the XArray has over the radix tree: - The radix tree provides operations like other trees do; 'insert' and 'delete'. But what most users really want is an automatically resizing array, and so it makes more sense to give users an API that is like an array -- 'load' and 'store'. We still have an 'insert' operation for users that really want that semantic. - The XArray considers locking as part of its API. This simplifies a lot of users who formerly had to manage their own locking just for the radix tree. It also improves code generation as we can now tell RCU that we're holding a lock and it doesn't need to generate as much fencing code. The other advantage is that tree nodes can be moved (not yet implemented). - GFP flags are now parameters to calls which may need to allocate memory. The radix tree forced users to decide what the allocation flags would be at creation time. It's much clearer to specify them at allocation time. - Memory is not preloaded; we don't tie up dozens of pages on the off chance that the slab allocator fails. Instead, we drop the lock, allocate a new node and retry the operation. We have to convert all the radix tree, IDA and IDR preload users before we can realise this benefit, but I have not yet found a user which cannot be converted. - The XArray provides a cmpxchg operation. The radix tree forces users to roll their own (and at least four have). - Iterators take a 'max' parameter. That simplifies many users and will reduce the amount of iteration done. - Iteration can proceed backwards. We only have one user for this, but since it's called as part of the pagefault readahead algorithm, that seemed worth mentioning. - RCU-protected pointers are not exposed as part of the API. There are some fun bugs where the page cache forgets to use rcu_dereference() in the current codebase. - Value entries gain an extra bit compared to radix tree exceptional entries. That gives us the extra bit we need to put huge page swap entries in the page cache. - Some iterators now take a 'filter' argument instead of having separate iterators for tagged/untagged iterations. The page cache is improved by this: - Shorter, easier to read code - More efficient iterations - Reduction in size of struct address_space - Fewer walks from the top of the data structure; the XArray API encourages staying at the leaf node and conducting operations there. This patch (of 8): None of these bits may be used for slab allocations, so we can use them as radix tree flags as long as we mask them off before passing them to the slab allocator. Move the IDR flag from the high bits to the GFP_ZONEMASK bits. Link: http://lkml.kernel.org/r/20180313132639.17387-3-willy@infradead.org Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Acked-by: Jeff Layton <jlayton@kernel.org> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-02-22ida: do zeroing in ida_pre_get()Rasmus Villemoes1-1/+1
As far as I can tell, the only place the per-cpu ida_bitmap is populated is in ida_pre_get. The pre-allocated element is stolen in two places in ida_get_new_above, in both cases immediately followed by a memset(0). Since ida_get_new_above is called with locks held, do the zeroing in ida_pre_get, or rather let kmalloc() do it. Also, apparently gcc generates ~44 bytes of code to do a memset(, 0, 128): $ scripts/bloat-o-meter vmlinux.{0,1} add/remove: 0/0 grow/shrink: 2/1 up/down: 5/-88 (-83) Function old new delta ida_pre_get 115 119 +4 vermagic 27 28 +1 ida_get_new_above 715 627 -88 Link: http://lkml.kernel.org/r/20180108225634.15340-1-linux@rasmusvillemoes.dk Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Acked-by: Matthew Wilcox <mawilcox@microsoft.com> Cc: Eric Biggers <ebiggers@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-02-07idr: Remove idr_alloc_extMatthew Wilcox1-1/+2
It has no more users, so remove it. Move idr_alloc() back into idr.c, move the guts of idr_alloc_cmn() into idr_alloc_u32(), remove the wrappers around idr_get_free_cmn() and rename it to idr_get_free(). While there is now no interface to allocate IDs larger than a u32, the IDR internals remain ready to handle a larger ID should a need arise. These changes make it possible to provide the guarantee that, if the nextid pointer points into the object, the object's ID will be initialised before a concurrent lookup can find the object. Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
2017-11-16mm, truncate: do not check mapping for every page being truncatedMel Gorman1-17/+13
During truncation, the mapping has already been checked for shmem and dax so it's known that workingset_update_node is required. This patch avoids the checks on mapping for each page being truncated. In all other cases, a lookup helper is used to determine if workingset_update_node() needs to be called. The one danger is that the API is slightly harder to use as calling workingset_update_node directly without checking for dax or shmem mappings could lead to surprises. However, the API rarely needs to be used and hopefully the comment is enough to give people the hint. sparsetruncate (tiny) 4.14.0-rc4 4.14.0-rc4 oneirq-v1r1 pickhelper-v1r1 Min Time 141.00 ( 0.00%) 140.00 ( 0.71%) 1st-qrtle Time 142.00 ( 0.00%) 141.00 ( 0.70%) 2nd-qrtle Time 142.00 ( 0.00%) 142.00 ( 0.00%) 3rd-qrtle Time 143.00 ( 0.00%) 143.00 ( 0.00%) Max-90% Time 144.00 ( 0.00%) 144.00 ( 0.00%) Max-95% Time 147.00 ( 0.00%) 145.00 ( 1.36%) Max-99% Time 195.00 ( 0.00%) 191.00 ( 2.05%) Max Time 230.00 ( 0.00%) 205.00 ( 10.87%) Amean Time 144.37 ( 0.00%) 143.82 ( 0.38%) Stddev Time 10.44 ( 0.00%) 9.00 ( 13.74%) Coeff Time 7.23 ( 0.00%) 6.26 ( 13.41%) Best99%Amean Time 143.72 ( 0.00%) 143.34 ( 0.26%) Best95%Amean Time 142.37 ( 0.00%) 142.00 ( 0.26%) Best90%Amean Time 142.19 ( 0.00%) 141.85 ( 0.24%) Best75%Amean Time 141.92 ( 0.00%) 141.58 ( 0.24%) Best50%Amean Time 141.69 ( 0.00%) 141.31 ( 0.27%) Best25%Amean Time 141.38 ( 0.00%) 140.97 ( 0.29%) As you'd expect, the gain is marginal but it can be detected. The differences in bonnie are all within the noise which is not surprising given the impact on the microbenchmark. radix_tree_update_node_t is a callback for some radix operations that optionally passes in a private field. The only user of the callback is workingset_update_node and as it no longer requires a mapping, the private field is removed. Link: http://lkml.kernel.org/r/20171018075952.10627-3-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Andi Kleen <ak@linux.intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-09radix-tree: must check __radix_tree_preload() return valueEric Dumazet1-4/+5
__radix_tree_preload() only disables preemption if no error is returned. So we really need to make sure callers always check the return value. idr_preload() contract is to always disable preemption, so we need to add a missing preempt_disable() if an error happened. Similarly, ida_pre_get() only needs to call preempt_enable() in the case no error happened. Link: http://lkml.kernel.org/r/1504637190.15310.62.camel@edumazet-glaptop3.roam.corp.google.com Fixes: 0a835c4f090a ("Reimplement IDR and IDA using the radix tree") Fixes: 7ad3d4d85c7a ("ida: Move ida_bitmap to a percpu variable") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: <stable@vger.kernel.org> [4.11+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-07Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds1-3/+3
Pull networking updates from David Miller: 1) Support ipv6 checksum offload in sunvnet driver, from Shannon Nelson. 2) Move to RB-tree instead of custom AVL code in inetpeer, from Eric Dumazet. 3) Allow generic XDP to work on virtual devices, from John Fastabend. 4) Add bpf device maps and XDP_REDIRECT, which can be used to build arbitrary switching frameworks using XDP. From John Fastabend. 5) Remove UFO offloads from the tree, gave us little other than bugs. 6) Remove the IPSEC flow cache, from Florian Westphal. 7) Support ipv6 route offload in mlxsw driver. 8) Support VF representors in bnxt_en, from Sathya Perla. 9) Add support for forward error correction modes to ethtool, from Vidya Sagar Ravipati. 10) Add time filter for packet scheduler action dumping, from Jamal Hadi Salim. 11) Extend the zerocopy sendmsg() used by virtio and tap to regular sockets via MSG_ZEROCOPY. From Willem de Bruijn. 12) Significantly rework value tracking in the BPF verifier, from Edward Cree. 13) Add new jump instructions to eBPF, from Daniel Borkmann. 14) Rework rtnetlink plumbing so that operations can be run without taking the RTNL semaphore. From Florian Westphal. 15) Support XDP in tap driver, from Jason Wang. 16) Add 32-bit eBPF JIT for ARM, from Shubham Bansal. 17) Add Huawei hinic ethernet driver. 18) Allow to report MD5 keys in TCP inet_diag dumps, from Ivan Delalande. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1780 commits) i40e: point wb_desc at the nvm_wb_desc during i40e_read_nvm_aq i40e: avoid NVM acquire deadlock during NVM update drivers: net: xgene: Remove return statement from void function drivers: net: xgene: Configure tx/rx delay for ACPI drivers: net: xgene: Read tx/rx delay for ACPI rocker: fix kcalloc parameter order rds: Fix non-atomic operation on shared flag variable net: sched: don't use GFP_KERNEL under spin lock vhost_net: correctly check tx avail during rx busy polling net: mdio-mux: add mdio_mux parameter to mdio_mux_init() rxrpc: Make service connection lookup always check for retry net: stmmac: Delete dead code for MDIO registration gianfar: Fix Tx flow control deactivation cxgb4: Ignore MPS_TX_INT_CAUSE[Bubble] for T6 cxgb4: Fix pause frame count in t4_get_port_stats cxgb4: fix memory leak tun: rename generic_xdp to skb_xdp tun: reserve extra headroom only when XDP is set net: dsa: bcm_sf2: Configure IMP port TC2QOS mapping net: dsa: bcm_sf2: Advertise number of egress queues ...
2017-08-31idr: Add new APIs to support unsigned longChris Mi1-3/+3
The following new APIs are added: int idr_alloc_ext(struct idr *idr, void *ptr, unsigned long *index, unsigned long start, unsigned long end, gfp_t gfp); void *idr_remove_ext(struct idr *idr, unsigned long id); void *idr_find_ext(const struct idr *idr, unsigned long id); void *idr_replace_ext(struct idr *idr, void *ptr, unsigned long id); void *idr_get_next_ext(struct idr *idr, unsigned long *nextid); Signed-off-by: Chris Mi <chrism@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-18drm/i915: Replace execbuf vma ht with an idrChris Wilson1-0/+1
This was the competing idea long ago, but it was only with the rewrite of the idr as an radixtree and using the radixtree directly ourselves, along with the realisation that we can store the vma directly in the radixtree and only need a list for the reverse mapping, that made the patch performant enough to displace using a hashtable. Though the vma ht is fast and doesn't require any extra allocation (as we can embed the node inside the vma), it does require a thread for resizing and serialization and will have the occasional slow lookup. That is hairy enough to investigate alternatives and favour them if equivalent in peak performance. One advantage of allocating an indirection entry is that we can support a single shared bo between many clients, something that was done on a first-come first-serve basis for shared GGTT vma previously. To offset the extra allocations, we create yet another kmem_cache for them. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20170816085210.4199-5-chris@chris-wilson.co.uk
2017-05-04lockdep: allow to disable reclaim lockup detectionMichal Hocko1-0/+2
The current implementation of the reclaim lockup detection can lead to false positives and those even happen and usually lead to tweak the code to silence the lockdep by using GFP_NOFS even though the context can use __GFP_FS just fine. See http://lkml.kernel.org/r/20160512080321.GA18496@dastard as an example. ================================= [ INFO: inconsistent lock state ] 4.5.0-rc2+ #4 Tainted: G O --------------------------------- inconsistent {RECLAIM_FS-ON-R} -> {IN-RECLAIM_FS-W} usage. kswapd0/543 [HC0[0]:SC0[0]:HE1:SE1] takes: (&xfs_nondir_ilock_class){++++-+}, at: xfs_ilock+0x177/0x200 [xfs] {RECLAIM_FS-ON-R} state was registered at: mark_held_locks+0x79/0xa0 lockdep_trace_alloc+0xb3/0x100 kmem_cache_alloc+0x33/0x230 kmem_zone_alloc+0x81/0x120 [xfs] xfs_refcountbt_init_cursor+0x3e/0xa0 [xfs] __xfs_refcount_find_shared+0x75/0x580 [xfs] xfs_refcount_find_shared+0x84/0xb0 [xfs] xfs_getbmap+0x608/0x8c0 [xfs] xfs_vn_fiemap+0xab/0xc0 [xfs] do_vfs_ioctl+0x498/0x670 SyS_ioctl+0x79/0x90 entry_SYSCALL_64_fastpath+0x12/0x6f CPU0 ---- lock(&xfs_nondir_ilock_class); <Interrupt> lock(&xfs_nondir_ilock_class); *** DEADLOCK *** 3 locks held by kswapd0/543: stack backtrace: CPU: 0 PID: 543 Comm: kswapd0 Tainted: G O 4.5.0-rc2+ #4 Call Trace: lock_acquire+0xd8/0x1e0 down_write_nested+0x5e/0xc0 xfs_ilock+0x177/0x200 [xfs] xfs_reflink_cancel_cow_range+0x150/0x300 [xfs] xfs_fs_evict_inode+0xdc/0x1e0 [xfs] evict+0xc5/0x190 dispose_list+0x39/0x60 prune_icache_sb+0x4b/0x60 super_cache_scan+0x14f/0x1a0 shrink_slab.part.63.constprop.79+0x1e9/0x4e0 shrink_zone+0x15e/0x170 kswapd+0x4f1/0xa80 kthread+0xf2/0x110 ret_from_fork+0x3f/0x70 To quote Dave: "Ignoring whether reflink should be doing anything or not, that's a "xfs_refcountbt_init_cursor() gets called both outside and inside transactions" lockdep false positive case. The problem here is lockdep has seen this allocation from within a transaction, hence a GFP_NOFS allocation, and now it's seeing it in a GFP_KERNEL context. Also note that we have an active reference to this inode. So, because the reclaim annotations overload the interrupt level detections and it's seen the inode ilock been taken in reclaim ("interrupt") context, this triggers a reclaim context warning where it thinks it is unsafe to do this allocation in GFP_KERNEL context holding the inode ilock..." This sounds like a fundamental problem of the reclaim lock detection. It is really impossible to annotate such a special usecase IMHO unless the reclaim lockup detection is reworked completely. Until then it is much better to provide a way to add "I know what I am doing flag" and mark problematic places. This would prevent from abusing GFP_NOFS flag which has a runtime effect even on configurations which have lockdep disabled. Introduce __GFP_NOLOCKDEP flag which tells the lockdep gfp tracking to skip the current allocation request. While we are at it also make sure that the radix tree doesn't accidentaly override tags stored in the upper part of the gfp_mask. Link: http://lkml.kernel.org/r/20170306131408.9828-3-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Suggested-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dave Chinner <david@fromorbit.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Chris Mason <clm@fb.com> Cc: David Sterba <dsterba@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Brian Foster <bfoster@redhat.com> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-03-07ida: Free correct IDA bitmapMatthew Wilcox1-2/+2
There's a relatively rare race where we look at the per-cpu preallocated IDA bitmap, see it's NULL, allocate a new one, and atomically update it. If the kmalloc() happened to sleep and we were rescheduled to a different CPU, or an interrupt came in at the exact right time, another task might have successfully allocated a bitmap and already deposited it. I forgot what the semantics of cmpxchg() were and ended up freeing the wrong bitmap leading to KASAN reporting a use-after-free. Dmitry found the bug with syzkaller & wrote the patch. I wrote the test case that will reproduce the bug without his patch being applied. Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
2017-03-01Merge branch 'idr-4.11' of git://git.infradead.org/users/willy/linux-daxLinus Torvalds1-229/+532
Pull IDR rewrite from Matthew Wilcox: "The most significant part of the following is the patch to rewrite the IDR & IDA to be clients of the radix tree. But there's much more, including an enhancement of the IDA to be significantly more space efficient, an IDR & IDA test suite, some improvements to the IDR API (and driver changes to take advantage of those improvements), several improvements to the radix tree test suite and RCU annotations. The IDR & IDA rewrite had a good spin in linux-next and Andrew's tree for most of the last cycle. Coupled with the IDR test suite, I feel pretty confident that any remaining bugs are quite hard to hit. 0-day did a great job of watching my git tree and pointing out problems; as it hit them, I added new test-cases to be sure not to be caught the same way twice" Willy goes on to expand a bit on the IDR rewrite rationale: "The radix tree and the IDR use very similar data structures. Merging the two codebases lets us share the memory allocation pools, and results in a net deletion of 500 lines of code. It also opens up the possibility of exposing more of the features of the radix tree to users of the IDR (and I have some interesting patches along those lines waiting for 4.12) It also shrinks the size of the 'struct idr' from 40 bytes to 24 which will shrink a fair few data structures that embed an IDR" * 'idr-4.11' of git://git.infradead.org/users/willy/linux-dax: (32 commits) radix tree test suite: Add config option for map shift idr: Add missing __rcu annotations radix-tree: Fix __rcu annotations radix-tree: Add rcu_dereference and rcu_assign_pointer calls radix tree test suite: Run iteration tests for longer radix tree test suite: Fix split/join memory leaks radix tree test suite: Fix leaks in regression2.c radix tree test suite: Fix leaky tests radix tree test suite: Enable address sanitizer radix_tree_iter_resume: Fix out of bounds error radix-tree: Store a pointer to the root in each node radix-tree: Chain preallocated nodes through ->parent radix tree test suite: Dial down verbosity with -v radix tree test suite: Introduce kmalloc_verbose idr: Return the deleted entry from idr_remove radix tree test suite: Build separate binaries for some tests ida: Use exceptional entries for small IDAs ida: Move ida_bitmap to a percpu variable Reimplement IDR and IDA using the radix tree radix-tree: Add radix_tree_iter_delete ...
2017-02-14radix-tree: Fix __rcu annotationsMatthew Wilcox1-59/+66
Many places were missing __rcu annotations. A few places needed a few lines of explanation about why it was safe to not use RCU accessors. Add a custom CFLAGS setting to the Makefile to ensure that new patches don't miss RCU annotations. Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
2017-02-14radix-tree: Add rcu_dereference and rcu_assign_pointer callsMatthew Wilcox1-11/+15
Some of these have been missing for many years. Others were recently introduced by me. Fortunately, we have tools that help us find such things. Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
2017-02-14radix_tree_iter_resume: Fix out of bounds errorMatthew Wilcox1-1/+0
The address sanitizer occasionally finds an out of bounds error while running the test-suite. It turned out to be a read of the pointer immediately next to the tree root, but this out of bounds error could have occurred elsewhere. This happens because radix_tree_iter_resume() dereferences 'slot' before checking whether we've come to the end of the chunk. We can just delete this line; the value was never used. Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
2017-02-14radix-tree: Store a pointer to the root in each nodeMatthew Wilcox1-6/+8
Instead of having this mysterious private_data in each radix_tree_node, store a pointer to the root, which can be useful for debugging. This also relieves the mm code from the duty of updating it. Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
2017-02-14radix-tree: Chain preallocated nodes through ->parentMatthew Wilcox1-5/+4
Chaining through the ->private_data member means we have to zero ->private_data after removing preallocated nodes from the list. We're about to initialise ->parent anyway, so we can avoid zeroing it. Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
2017-02-14ida: Use exceptional entries for small IDAsMatthew Wilcox1-0/+8
We can use the root entry as a bitmap and save allocating a 128 byte bitmap for an IDA that contains only a few entries (30 on a 32-bit machine, 62 on a 64-bit machine). This costs about 300 bytes of kernel text on x86-64, so as long as 3 IDAs fall into this category, this is a net win for memory consumption. Thanks to Rasmus Villemoes for his work documenting the problem and collecting statistics on IDAs. Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
2017-02-14ida: Move ida_bitmap to a percpu variableMatthew Wilcox1-3/+42
When we preload the IDA, we allocate an IDA bitmap. Instead of storing that preallocated bitmap in the IDA, we store it in a percpu variable. Generally there are more IDAs in the system than CPUs, so this cuts down on the number of preallocated bitmaps that are unused, and about half of the IDA users did not call ida_destroy() so they were leaking IDA bitmaps. Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
2017-02-14Reimplement IDR and IDA using the radix treeMatthew Wilcox1-88/+287
The IDR is very similar to the radix tree. It has some functionality that the radix tree did not have (alloc next free, cyclic allocation, a callback-based for_each, destroy tree), which is readily implementable on top of the radix tree. A few small changes were needed in order to use a tag to represent nodes with free space below them. More extensive changes were needed to support storing NULL as a valid entry in an IDR. Plain radix trees still interpret NULL as a not-present entry. The IDA is reimplemented as a client of the newly enhanced radix tree. As in the current implementation, it uses a bitmap at the last level of the tree. Signed-off-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2017-02-14radix-tree: Add radix_tree_iter_deleteMatthew Wilcox1-31/+60
Factor the deletion code out into __radix_tree_delete() and provide a nice iterator-based wrapper around it. If we free the node, advance the iterator to avoid reading from freed memory. Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
2017-02-14radix-tree: Add radix_tree_iter_tag_clear()Matthew Wilcox1-28/+40
The counterpart to radix_tree_iter_tag_set(), used by the IDR code Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Reviewed-by: Rehas Sachdeva <aquannie@gmail.com>
2017-02-13EXPORT_SYMBOL radix_tree_replace_slotSong Liu1-0/+1
It will be used in drivers/md/raid5-cache.c Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-01-28radix tree: constify some pointersMatthew Wilcox1-26/+31
If we're just getting the value of a tag, or looking up an entry, we won't modify the radix tree, so we can declare these functions as taking a const pointer. Mostly for documentation purposes, though it might help code generation. Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
2017-01-25radix-tree: fix private list warningsMatthew Wilcox1-1/+1
The newly introduced warning in radix_tree_free_nodes() was testing the wrong variable; it should have been 'old' instead of 'node'. Fixes: ea07b862ac8e ("mm: workingset: fix use-after-free in shadow node shrinker") Link: http://lkml.kernel.org/r/20170118163746.GA32495@cmpxchg.org Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-08mm: workingset: fix use-after-free in shadow node shrinkerJohannes Weiner1-2/+9
Several people report seeing warnings about inconsistent radix tree nodes followed by crashes in the workingset code, which all looked like use-after-free access from the shadow node shrinker. Dave Jones managed to reproduce the issue with a debug patch applied, which confirmed that the radix tree shrinking indeed frees shadow nodes while they are still linked to the shadow LRU: WARNING: CPU: 2 PID: 53 at lib/radix-tree.c:643 delete_node+0x1e4/0x200 CPU: 2 PID: 53 Comm: kswapd0 Not tainted 4.10.0-rc2-think+ #3 Call Trace: delete_node+0x1e4/0x200 __radix_tree_delete_node+0xd/0x10 shadow_lru_isolate+0xe6/0x220 __list_lru_walk_one.isra.4+0x9b/0x190 list_lru_walk_one+0x23/0x30 scan_shadow_nodes+0x2e/0x40 shrink_slab.part.44+0x23d/0x5d0 shrink_node+0x22c/0x330 kswapd+0x392/0x8f0 This is the WARN_ON_ONCE(!list_empty(&node->private_list)) placed in the inlined radix_tree_shrink(). The problem is with 14b468791fa9 ("mm: workingset: move shadow entry tracking to radix tree exceptional tracking"), which passes an update callback into the radix tree to link and unlink shadow leaf nodes when tree entries change, but forgot to pass the callback when reclaiming a shadow node. While the reclaimed shadow node itself is unlinked by the shrinker, its deletion from the tree can cause the left-most leaf node in the tree to be shrunk. If that happens to be a shadow node as well, we don't unlink it from the LRU as we should. Consider this tree, where the s are shadow entries: root->rnode | [0 n] | | [s ] [sssss] Now the shadow node shrinker reclaims the rightmost leaf node through the shadow node LRU: root->rnode | [0 ] | [s ] Because the parent of the deleted node is the first level below the root and has only one child in the left-most slot, the intermediate level is shrunk and the node containing the single shadow is put in its place: root->rnode | [s ] The shrinker again sees a single left-most slot in a first level node and thus decides to store the shadow in root->rnode directly and free the node - which is a leaf node on the shadow node LRU. root->rnode | s Without the update callback, the freed node remains on the shadow LRU, where it causes later shrinker runs to crash. Pass the node updater callback into __radix_tree_delete_node() in case the deletion causes the left-most branch in the tree to collapse too. Also add warnings when linked nodes are freed right away, rather than wait for the use-after-free when the list is scanned much later. Fixes: 14b468791fa9 ("mm: workingset: move shadow entry tracking to radix tree exceptional tracking") Reported-by: Dave Chinner <david@fromorbit.com> Reported-by: Hugh Dickins <hughd@google.com> Reported-by: Andrea Arcangeli <aarcange@redhat.com> Reported-and-tested-by: Dave Jones <davej@codemonkey.org.uk> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Chris Leech <cleech@redhat.com> Cc: Lee Duncan <lduncan@suse.com> Cc: Jan Kara <jack@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <mawilcox@linuxonhyperv.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-15redo: radix tree test suite: fix compilationMatthew Wilcox1-1/+0
[ This resurrects commit 53855d10f456, which was reverted in 2b41226b39b6. It depended on commit d544abd5ff7d ("lib/radix-tree: Convert to hotplug state machine") so now it is correct to apply ] Patch "lib/radix-tree: Convert to hotplug state machine" breaks the test suite as it adds a call to cpuhp_setup_state_nocalls() which is not currently emulated in the test suite. Add it, and delete the emulation of the old CPU hotplug mechanism. Link: http://lkml.kernel.org/r/1480369871-5271-36-git-send-email-mawilcox@linuxonhyperv.com Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-15radix-tree: ensure counts are initialisedMatthew Wilcox1-21/+20
radix_tree_join() was freeing nodes with a non-zero ->exceptional count, and radix_tree_split() wasn't zeroing ->exceptional when it allocated the new node. Fix this by making all callers of radix_tree_node_alloc() pass in the new counts (and some other always-initialised fields), which will prevent the problem recurring if in future we decide to do something similar. Link: http://lkml.kernel.org/r/1481667692-14500-3-git-send-email-mawilcox@linuxonhyperv.com Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-15radix-tree: fix replacement for multiorder entriesMatthew Wilcox1-16/+44
When replacing an entry with NULL, we need to delete any sibling entries. Also account deleting exceptional entries properly. Also fix a bug with radix_tree_iter_replace() where we would fail to remove entirely freed nodes. Also fix accounting bug when switching between normal and exceptional entries with replace_slot. Also add testcases for all these bugs. Link: http://lkml.kernel.org/r/1480369871-5271-61-git-send-email-mawilcox@linuxonhyperv.com Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-15radix-tree: add radix_tree_split_preload()Matthew Wilcox1-1/+23
Calculate how many nodes we need to allocate to split an old_order entry into multiple entries, each of size new_order. The test suite checks that we allocated exactly the right number of nodes; neither too many (checked by rtp->nr == 0), nor too few (checked by comparing nr_allocated before and after the call to radix_tree_split()). Link: http://lkml.kernel.org/r/1480369871-5271-60-git-send-email-mawilcox@linuxonhyperv.com Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-15radix-tree: add radix_tree_splitMatthew Wilcox1-4/+138
This new function splits a larger multiorder entry into smaller entries (potentially multi-order entries). These entries are initialised to RADIX_TREE_RETRY to ensure that RCU walkers who see this state aren't confused. The caller should then call radix_tree_for_each_slot() and radix_tree_replace_slot() in order to turn these retry entries into the intended new entries. Tags are replicated from the original multiorder entry into each new entry. Link: http://lkml.kernel.org/r/1480369871-5271-59-git-send-email-mawilcox@linuxonhyperv.com Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>