summaryrefslogtreecommitdiff
path: root/fs/bcachefs/btree_key_cache.c
AgeCommit message (Collapse)AuthorFilesLines
2023-12-03bcachefs: Don't drop journal pins in exit pathKent Overstreet1-2/+0
There's no need to drop journal pins in our exit paths - the code was trying to have everything cleaned up on any shutdown, but better to just tweak the assertions a bit. This fixes a bug where calling into journal reclaim in the exit path would cass a null ptr deref. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-15bcachefs: Kill journal pre-reservationsKent Overstreet1-14/+0
This deletes the complicated and somewhat expensive journal pre-reservation machinery in favor of just using journal watermarks: when the journal is more than half full, we run journal reclaim more aggressively, and when the journal is more than 3/4s full we only allow journal reclaim to get new journal reservations. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-14bcachefs: Run btree key cache shrinker less aggressivelyKent Overstreet1-4/+19
The btree key cache maintains lists of items that have been freed, but can't yet be reclaimed because a bch2_trans_relock() call might find them - we're waiting for SRCU readers to release. Previously, we wouldn't count these items against the number we're attempting to scan for, which would mean we'd evict more live key cache entries - doing quite a bit of potentially unecessary work. With recent work to make sure we don't hold SRCU locks for too long, it should be safe to count all the items on the freelists against number to scan - even if we can't reclaim them yet, we will be able to soon. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-07Merge tag 'bcachefs-2023-11-5' of https://evilpiepirate.org/git/bcachefsLinus Torvalds1-4/+4
Pull more bcachefs updates from Kent Overstreet: "Here's the second big bcachefs pull request. This brings your tree up to date with my master branch, which is what existing bcachefs users are currently running. New features: - rebalance_work btree (and metadata version 1.3): the rebalance thread no longer has to scan to find extents that need processing - big scalability improvement. - sb_errors superblock section: this adds counters for each fsck error type, since filesystem creation, along with the date of the most recent error. It'll get us better bug reports (since users do not typically report errors that fsck was able to fix), and I might add telemetry for this in the future. Fixes include: - multiple snapshot deletion fixes - members_v2 fixups - deleted_inodes btree fixes - copygc thread no longer spins when a device is full but has no fragmented buckets (i.e. rebalance needs to move data around instead) - a fix for a memory reclaim issue with the btree key cache: we're now careful not to hold the srcu read lock that blocks key cache reclaim for too long - an early allocator locking fix, from Brian - endianness fixes, from Brian - CONFIG_BCACHEFS_DEBUG_TRANSACTIONS no longer defaults to y, a big performance improvement on multithreaded workloads" * tag 'bcachefs-2023-11-5' of https://evilpiepirate.org/git/bcachefs: (70 commits) bcachefs: Improve stripe checksum error message bcachefs: Simplify, fix bch2_backpointer_get_key() bcachefs: kill thing_it_points_to arg to backpointer_not_found() bcachefs: bch2_ec_read_extent() now takes btree_trans bcachefs: bch2_stripe_to_text() now prints ptr gens bcachefs: Don't iterate over journal entries just for btree roots bcachefs: Break up bch2_journal_write() bcachefs: Replace ERANGE with private error codes bcachefs: bkey_copy() is no longer a macro bcachefs: x-macro-ify inode flags enum bcachefs: Convert bch2_fs_open() to darray bcachefs: Move __bch2_members_v2_get_mut to sb-members.h bcachefs: bch2_prt_datetime() bcachefs: CONFIG_BCACHEFS_DEBUG_TRANSACTIONS no longer defaults to y bcachefs: Add a comment for BTREE_INSERT_NOJOURNAL usage bcachefs: rebalance_work btree is not a snapshots btree bcachefs: Add missing printk newlines bcachefs: Fix recovery when forced to use JSET_NO_FLUSH journal entry bcachefs: .get_parent() should return an error pointer bcachefs: Fix bch2_delete_dead_inodes() ...
2023-11-03Merge tag 'mm-stable-2023-11-01-14-33' of ↵Linus Torvalds1-9/+12
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Many singleton patches against the MM code. The patch series which are included in this merge do the following: - Kemeng Shi has contributed some compation maintenance work in the series 'Fixes and cleanups to compaction' - Joel Fernandes has a patchset ('Optimize mremap during mutual alignment within PMD') which fixes an obscure issue with mremap()'s pagetable handling during a subsequent exec(), based upon an implementation which Linus suggested - More DAMON/DAMOS maintenance and feature work from SeongJae Park i the following patch series: mm/damon: misc fixups for documents, comments and its tracepoint mm/damon: add a tracepoint for damos apply target regions mm/damon: provide pseudo-moving sum based access rate mm/damon: implement DAMOS apply intervals mm/damon/core-test: Fix memory leaks in core-test mm/damon/sysfs-schemes: Do DAMOS tried regions update for only one apply interval - In the series 'Do not try to access unaccepted memory' Adrian Hunter provides some fixups for the recently-added 'unaccepted memory' feature. To increase the feature's checking coverage. 'Plug a few gaps where RAM is exposed without checking if it is unaccepted memory' - In the series 'cleanups for lockless slab shrink' Qi Zheng has done some maintenance work which is preparation for the lockless slab shrinking code - Qi Zheng has redone the earlier (and reverted) attempt to make slab shrinking lockless in the series 'use refcount+RCU method to implement lockless slab shrink' - David Hildenbrand contributes some maintenance work for the rmap code in the series 'Anon rmap cleanups' - Kefeng Wang does more folio conversions and some maintenance work in the migration code. Series 'mm: migrate: more folio conversion and unification' - Matthew Wilcox has fixed an issue in the buffer_head code which was causing long stalls under some heavy memory/IO loads. Some cleanups were added on the way. Series 'Add and use bdev_getblk()' - In the series 'Use nth_page() in place of direct struct page manipulation' Zi Yan has fixed a potential issue with the direct manipulation of hugetlb page frames - In the series 'mm: hugetlb: Skip initialization of gigantic tail struct pages if freed by HVO' has improved our handling of gigantic pages in the hugetlb vmmemmep optimizaton code. This provides significant boot time improvements when significant amounts of gigantic pages are in use - Matthew Wilcox has sent the series 'Small hugetlb cleanups' - code rationalization and folio conversions in the hugetlb code - Yin Fengwei has improved mlock()'s handling of large folios in the series 'support large folio for mlock' - In the series 'Expose swapcache stat for memcg v1' Liu Shixin has added statistics for memcg v1 users which are available (and useful) under memcg v2 - Florent Revest has enhanced the MDWE (Memory-Deny-Write-Executable) prctl so that userspace may direct the kernel to not automatically propagate the denial to child processes. The series is named 'MDWE without inheritance' - Kefeng Wang has provided the series 'mm: convert numa balancing functions to use a folio' which does what it says - In the series 'mm/ksm: add fork-exec support for prctl' Stefan Roesch makes is possible for a process to propagate KSM treatment across exec() - Huang Ying has enhanced memory tiering's calculation of memory distances. This is used to permit the dax/kmem driver to use 'high bandwidth memory' in addition to Optane Data Center Persistent Memory Modules (DCPMM). The series is named 'memory tiering: calculate abstract distance based on ACPI HMAT' - In the series 'Smart scanning mode for KSM' Stefan Roesch has optimized KSM by teaching it to retain and use some historical information from previous scans - Yosry Ahmed has fixed some inconsistencies in memcg statistics in the series 'mm: memcg: fix tracking of pending stats updates values' - In the series 'Implement IOCTL to get and optionally clear info about PTEs' Peter Xu has added an ioctl to /proc/<pid>/pagemap which permits us to atomically read-then-clear page softdirty state. This is mainly used by CRIU - Hugh Dickins contributed the series 'shmem,tmpfs: general maintenance', a bunch of relatively minor maintenance tweaks to this code - Matthew Wilcox has increased the use of the VMA lock over file-backed page faults in the series 'Handle more faults under the VMA lock'. Some rationalizations of the fault path became possible as a result - In the series 'mm/rmap: convert page_move_anon_rmap() to folio_move_anon_rmap()' David Hildenbrand has implemented some cleanups and folio conversions - In the series 'various improvements to the GUP interface' Lorenzo Stoakes has simplified and improved the GUP interface with an eye to providing groundwork for future improvements - Andrey Konovalov has sent along the series 'kasan: assorted fixes and improvements' which does those things - Some page allocator maintenance work from Kemeng Shi in the series 'Two minor cleanups to break_down_buddy_pages' - In thes series 'New selftest for mm' Breno Leitao has developed another MM self test which tickles a race we had between madvise() and page faults - In the series 'Add folio_end_read' Matthew Wilcox provides cleanups and an optimization to the core pagecache code - Nhat Pham has added memcg accounting for hugetlb memory in the series 'hugetlb memcg accounting' - Cleanups and rationalizations to the pagemap code from Lorenzo Stoakes, in the series 'Abstract vma_merge() and split_vma()' - Audra Mitchell has fixed issues in the procfs page_owner code's new timestamping feature which was causing some misbehaviours. In the series 'Fix page_owner's use of free timestamps' - Lorenzo Stoakes has fixed the handling of new mappings of sealed files in the series 'permit write-sealed memfd read-only shared mappings' - Mike Kravetz has optimized the hugetlb vmemmap optimization in the series 'Batch hugetlb vmemmap modification operations' - Some buffer_head folio conversions and cleanups from Matthew Wilcox in the series 'Finish the create_empty_buffers() transition' - As a page allocator performance optimization Huang Ying has added automatic tuning to the allocator's per-cpu-pages feature, in the series 'mm: PCP high auto-tuning' - Roman Gushchin has contributed the patchset 'mm: improve performance of accounted kernel memory allocations' which improves their performance by ~30% as measured by a micro-benchmark - folio conversions from Kefeng Wang in the series 'mm: convert page cpupid functions to folios' - Some kmemleak fixups in Liu Shixin's series 'Some bugfix about kmemleak' - Qi Zheng has improved our handling of memoryless nodes by keeping them off the allocation fallback list. This is done in the series 'handle memoryless nodes more appropriately' - khugepaged conversions from Vishal Moola in the series 'Some khugepaged folio conversions'" [ bcachefs conflicts with the dynamically allocated shrinkers have been resolved as per Stephen Rothwell in https://lore.kernel.org/all/20230913093553.4290421e@canb.auug.org.au/ with help from Qi Zheng. The clone3 test filtering conflict was half-arsed by yours truly ] * tag 'mm-stable-2023-11-01-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (406 commits) mm/damon/sysfs: update monitoring target regions for online input commit mm/damon/sysfs: remove requested targets when online-commit inputs selftests: add a sanity check for zswap Documentation: maple_tree: fix word spelling error mm/vmalloc: fix the unchecked dereference warning in vread_iter() zswap: export compression failure stats Documentation: ubsan: drop "the" from article title mempolicy: migration attempt to match interleave nodes mempolicy: mmap_lock is not needed while migrating folios mempolicy: alloc_pages_mpol() for NUMA policy without vma mm: add page_rmappable_folio() wrapper mempolicy: remove confusing MPOL_MF_LAZY dead code mempolicy: mpol_shared_policy_init() without pseudo-vma mempolicy trivia: use pgoff_t in shared mempolicy tree mempolicy trivia: slightly more consistent naming mempolicy trivia: delete those ancient pr_debug()s mempolicy: fix migrate_pages(2) syscall return nr_failed kernfs: drop shared NUMA mempolicy hooks hugetlbfs: drop shared NUMA mempolicy pretence mm/damon/sysfs-test: add a unit test for damon_sysfs_set_targets() ...
2023-11-02bcachefs: Don't downgrade locks on transaction restartKent Overstreet1-1/+1
We should only be downgrading locks on success - otherwise, our transaction restarts won't be getting the correct locks and we'll livelock. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-31bcachefs: Fix shrinker namesKent Overstreet1-1/+1
Shrinkers are now exported to debugfs, so the names can't have slashes in them. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-31bcachefs: bch2_btree_id_str()Kent Overstreet1-2/+2
Since we can run with unknown btree IDs, we can't directly index btree IDs into fixed size arrays. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Heap allocate btree_transKent Overstreet1-7/+5
We're using more stack than we'd like in a number of functions, and btree_trans is the biggest object that we stack allocate. But we have to do a heap allocatation to initialize it anyways, so there's no real downside to heap allocating the entire thing. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Fix W=12 build errorsKent Overstreet1-2/+0
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Fix -Wformat in bch2_btree_key_cache_to_text()Nathan Chancellor1-1/+1
When building bcachefs for 32-bit ARM, there is a compiler warning in bch2_btree_key_cache_to_text() due to use of an incorrect format specifier: fs/bcachefs/btree_key_cache.c:1060:36: error: format specifies type 'size_t' (aka 'unsigned int') but the argument has type 'long' [-Werror,-Wformat] 1060 | prt_printf(out, "nr_freed:\t%zu", atomic_long_read(&c->nr_freed)); | ~~~ ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | %ld fs/bcachefs/util.h:223:54: note: expanded from macro 'prt_printf' 223 | #define prt_printf(_out, ...) bch2_prt_printf(_out, __VA_ARGS__) | ^~~~~~~~~~~ 1 error generated. On 64-bit architectures, size_t is 'unsigned long', so there is no warning when using %zu but on 32-bit architectures, size_t is 'unsigned int'. Use '%lu' to match the other format specifiers used in this function for printing values returned from atomic_long_read(). Fixes: 6d799930ce0f ("bcachefs: btree key cache pcpu freedlist") Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Fix silent enum conversion errorKent Overstreet1-5/+7
This changes mark_btree_node_locked() to take an enum btree_node_locked_type, not a six_lock_type, since BTREE_NODE_UNLOCKED is -1 which may cause problems converting back and forth to six_lock_type if short enums are in use. With this change, we never store BTREE_NODE_UNLOCKED in a six_lock_type enum. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: mark bch_inode_info and bkey_cached as reclaimableMikulas Patocka1-1/+1
Mark these caches as reclaimable, so that available memory is correctly reported when there is a lot of cached inodes. Note that more work is needed - you should add __GFP_RECLAIMABLE to some of the kmalloc calls, so that they are allocated from the "kmalloc-rcl-*" caches. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Add new assertions for shutdown pathKent Overstreet1-0/+1
We've been seeing assertions pop that indicate the btree node cache or key cache have dirty items when we just did a clean shutdown. Add some more assertions so we can catch this when we're dirtying items. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Kill BTREE_INSERT_USE_RESERVEKent Overstreet1-1/+0
Now that we have journal watermarks and alloc watermarks unified, BTREE_INSERT_USE_RESERVE is redundant and can be deleted. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Kill JOURNAL_WATERMARKKent Overstreet1-1/+1
This unifies JOURNAL_WATERMARK with BCH_WATERMARK; we're working towards specifying watermarks once in the transaction commit path. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: unregister_shrinker() now safe on not-registered shrinkerKent Overstreet1-2/+1
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: allocate_dropping_locks()Kent Overstreet1-10/+3
Add two new helpers for allocating memory with btree locks held: The idea is to first try the allocation with GFP_NOWAIT|__GFP_NOWARN, then if that fails - unlock, retry with GFP_KERNEL, and then call trans_relock(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23six locks: Kill six_lock_state unionKent Overstreet1-3/+3
As suggested by Linus, this drops the six_lock_state union in favor of raw bitmasks. On the one hand, bitfields give more type-level structure to the code. However, a significant amount of the code was working with six_lock_state as a u64/atomic64_t, and the conversions from the bitfields to the u64 were deemed a bit too out-there. More significantly, because bitfield order is poorly defined (#ifdef __LITTLE_ENDIAN_BITFIELD can be used, but is gross), incrementing the sequence number would overflow into the rest of the bitfield if the compiler didn't put the sequence number at the high end of the word. The new code is a bit saner when we're on an architecture without real atomic64_t support - all accesses to lock->state now go through atomic64_*() operations. On architectures with real atomic64_t support, we additionally use atomic bit ops for setting/clearing individual bits. Text size: 7467 bytes -> 4649 bytes - compilers still suck at bitfields. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23six locks: Kill six_lock_pcpu_(alloc|free)Kent Overstreet1-9/+4
six_lock_pcpu_alloc() is an unsafe interface: it's not safe to allocate or free the percpu reader count on an existing lock that's in use, the only safe time to allocate percpu readers is when the lock is first being initialized. This patch adds a flags parameter to six_lock_init(), and instead of six_lock_pcpu_free() we now expose six_lock_exit(), which does the same thing but is less likely to be misused. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: bch2_bkey_get_iter() helpersKent Overstreet1-4/+3
Introduce new helpers for a common pattern: bch2_trans_iter_init(); bch2_btree_iter_peek_slot(); - bch2_bkey_get_iter_type() returns -ENOENT if it doesn't find a key of the correct type - bch2_bkey_get_val_typed() copies the val out of the btree to a (typically stack allocated) variable; it handles the case where the value in the btree is smaller than the current version of the type, zeroing out the remainder. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Private error codes: ENOMEMKent Overstreet1-8/+8
This adds private error codes for most (but not all) of our ENOMEM uses, which makes it easier to track down assorted allocation failures. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: don't bump key cache journal seq on nojournal commitsBrian Foster1-4/+19
fstest generic/388 occasionally reproduces corruptions where an inode has extents beyond i_size. This is a deliberate crash and recovery test, and the post crash+recovery characteristics are usually the same: the inode exists on disk in an early (i.e. just allocated) state based on the journal sequence number associated with the inode. Subsequent inode updates exist in the journal at higher sequence numbers, but the inode hadn't been written back before the associated crash and the post-crash recovery processes a set of journal sequence numbers that doesn't include updates to the inode. In fact, the sequence with the most recent inode key update always happens to be the sequence just before the front of the journal processed by recovery. This last bit is a significant hint that the problem relates to an on-disk journal update of the front of the journal. The root cause of this problem is basically that the inode is updated (multiple times) in-core and in the key cache, each time bumping the key cache sequence number used to control the cache flush. The cache flush skips one or more times, bumping the associated key cache journal pin to the key cache seq value. This has a side effect of holding the inode in memory a bit longer than normal, which helps exacerbate this problem, but is also unsafe in certain cases where the key cache seq may have been updated by a transaction commit that didn't journal the associated key. For example, consider an inode that has been allocated, updated several times in the key cache, journaled, but not yet written back. At this stage, everything should be consistent if the fs happens to crash because the latest update has been journal. Now consider a key update via bch2_extent_update_i_size_sectors() that uses the BTREE_UPDATE_NOJOURNAL flag. While this update may not change inode state, it can have the side effect of bumping ck->seq in bch2_btree_insert_key_cached(). In turn, if a subsequent key cache flush skips due to seq not matching the former, the ck->journal pin is updated to ck->seq even though the most recent key update was not journaled. If this pin happens to reside at the front (tail) of the journal, this means a subsequent journal write can update last_seq to a value beyond that which includes the most recent update to the inode. If this occurs and the fs happens to crash before the inode happens to flush, recovery will see the latest last_seq, fail to recover the inode and leave the inode in the inconsistent state described above. To avoid this problem, skip the key cache seq update on NOJOURNAL commits, except on initial pin add. Pass the insert entry directly to bch2_btree_insert_key_cached() to make the associated flag available and be consistent with btree_insert_key_leaf(). Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Drop some anonymous structs, unionsKent Overstreet1-3/+3
Rust bindgen doesn't cope well with anonymous structs and unions. This patch drops the fancy anonymous structs & unions in bkey_i that let us use the same helpers for bkey_i and bkey_packed; since bkey_packed is an internal type that's never exposed to outside code, it's only a minor inconvenienc. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Centralize btree node lock initializationKent Overstreet1-2/+1
This fixes some confusion in the lockdep code due to initializing btree node/key cache locks with the same lockdep key, but different names. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Kill trans->flagsKent Overstreet1-1/+2
Recursive transaction commits are occasionally necessary - in particular, for the upcoming btree write buffer's flush path. This avoids bugs due to trans->flags being accidentally mutated mid-commit, which can cause c->writes refcount leaks. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Don't call bch2_journal_pin_drop() under key cache lockKent Overstreet1-6/+8
This fixes a (harmless) lockdep splat, due to a lock order violation in the key cache exit path. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Use six_lock_ip()Kent Overstreet1-1/+1
This uses the new _ip() interface to six locks and hooks it up to btree_path->ip_allocated, when available. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Don't emit tracepoints for expected eventsKent Overstreet1-2/+2
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: key cache: Don't hold btree locks while using GFP_RECLAIMKent Overstreet1-20/+50
This is something we need to do more widely: instead of bothering with GFP_NOIO/GFP_NOFS, if we need to allocate memory while holding locks: - first attempt the allocation with GFP_NOWAIT - if that fails, drop btree locks with bch2_trans_unlock(), then retry with GFP_KERNEL. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Improve bkey_cached_lock_for_evict()Kent Overstreet1-3/+2
We don't need a write lock to check if a key is dirty. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Fix a livelock in key cache fill pathKent Overstreet1-1/+10
We weren't setting path->uptodate before calling bch2_btree_key_cache_fill() - which causes __bch2_btree_path_upgrade() to fail. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Key cache now works for snapshots btreesKent Overstreet1-8/+8
This switches btree_key_cache_fill() to use a btree iterator, not a btree path, so that it can search for keys in previous snapshots. We also add another iterator flag, BTREE_ITER_KEY_CACHE_FILL, to avoid recursion back into the key cache. This will allow us to re-enable the key cache for inodes in the next patch. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Bring back BTREE_ITER_CACHED_NOFILLKent Overstreet1-2/+1
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: New bpos_cmp(), bkey_cmp() replacementsKent Overstreet1-4/+4
This patch introduces - bpos_eq() - bpos_lt() - bpos_le() - bpos_gt() - bpos_ge() and equivalent replacements for bkey_cmp(). Looking at the generated assembly these could probably be improved further, but we already see a significant code size improvement with this patch. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Fix a use after freeKent Overstreet1-4/+5
This fixes a regression from percpu freedlists in the btree key cache code: in a rare error path, we were immediately freeing a bkey_cached that had been used before and should've waited for an SRCU barrier. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Assorted checkpatch fixesKent Overstreet1-5/+5
checkpatch.pl gives lots of warnings that we don't want - suggested ignore list: ASSIGN_IN_IF UNSPECIFIED_INT - bcachefs coding style prefers single token type names NEW_TYPEDEFS - typedefs are occasionally good FUNCTION_ARGUMENTS - we prefer to look at functions in .c files (hopefully with docbook documentation), not .h file prototypes MULTISTATEMENT_MACRO_USE_DO_WHILE - we have _many_ x-macros and other macros where we can't do this Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Btree key cache shrinker fixKent Overstreet1-10/+32
The shrinker assumes freed key cache items are ordered by age, so that it doesn't have to scan the full list to find items that are old enough (according to the srcu code) to be freed. But percpu freelists broke this ordering; this patch fixes this by ensuring we insert items into the proper position. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Btree key cache improvementsKent Overstreet1-17/+50
- In userspace, we don't have real percpu variables; this patch disables the percpu freelists in userspace - add some error messages for the asserts in bch2_fs_btree_key_cache_exit(); we've been hitting this (only in userspace, oddly), perhaps this will help us track down the error. - bkey_cached_reuse() should likely be taking the key cache lock, and it's a slowpath so it doesn't hurt to Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: bch2_btree_key_cache_scan() doesn't need trylockKent Overstreet1-6/+2
We don't actually allocate memory under the btree key cache lock - so there's no recursion concerns, and the shrinker can just use mutex_lock(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Break out bch2_btree_path_traverse_cached_slowpath()Kent Overstreet1-3/+57
Prep work for further refactoring. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Delete old deadlock avoidance codeKent Overstreet1-21/+6
This deletes our old lock ordering based deadlock avoidance code. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-23bcachefs: Ensure intent locks are marked before taking write locksKent Overstreet1-2/+8
Locks must be correctly marked for the cycle detector to work. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Avoid using btree_node_lock_nopath()Kent Overstreet1-11/+6
With the upcoming cycle detector, we have to be careful about using btree_node_lock_nopath - in particular, using it to take write locks can cause deadlocks. All held locks need to be tracked in a btree_path, so that the cycle detector knows about them - unless we know that we cannot cause deadlocks for other reasons: e.g. we are only taking read locks, or we're in very early fsck (topology repair). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Fix usage of six lock's percpu mode, key cache versionKent Overstreet1-42/+88
Similar to "bcachefs: Fix usage of six lock's percpu mode", six locks have a percpu mode, but we can't switch between percpu and non percpu modes while a lock is in use: threads attempting to take a read lock may race, and we'll end up with the read count permanently off. Fixing this the "correct" way, in six_lock_pcpu_(alloc|free) would require an RCU barrier, and we don't want to do that - instead, we have to permanently segragate percpu and non percpu objects, including when on freelists. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Refactor bkey_cached_alloc() pathKent Overstreet1-19/+19
Clean up the arguments passed and make them more consistent. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Convert more locking code to btree_bkey_cached_commonKent Overstreet1-1/+1
Ideally, all the code in btree_locking.c should be converted, but then we'd want to convert btree_path to point to btree_key_cached_common too, and then we'd be in for a much bigger cleanup - but a bit of incremental cleanup will still be helpful for the next patches. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: btree_bkey_cached_common->cachedKent Overstreet1-0/+1
Add a type descriptor to btree_bkey_cached_common - there's no reason not to since we've got padding that was otherwise unused, and this is a nice cleanup (and helpful in later patches). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: bch2_btree_node_lock_write_nofail()Kent Overstreet1-5/+6
Taking a write lock will be able to fail, with the new cycle detector - unless we pass it nofail, which is possible but not preferred. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: New locking functionsKent Overstreet1-25/+57
In the future, with the new deadlock cycle detector, we won't be using bare six_lock_* anymore: lock wait entries will all be embedded in btree_trans, and we will need a btree_trans context whenever locking a btree node. This patch plumbs a btree_trans to the few places that need it, and adds two new locking functions - btree_node_lock_nopath, which may fail returning a transaction restart, and - btree_node_lock_nopath_nofail, to be used in places where we know we cannot deadlock (i.e. because we're holding no other locks). Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>