summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2023-10-23bcachefs: Delete redundant log messagesKent Overstreet8-64/+17
Now that we have distinct error codes for different memory allocation failures, the early init log messages are no longer needed. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Change check for invalid key typesKent Overstreet26-80/+141
As part of the forward compatibility patch series, we need to allow for new key types without complaining loudly when running an old version. This patch changes the flags parameter of bkey_invalid to an enum, and adds a new flag to indicate we're being called from the transaction commit path. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Assorted sparse fixesKent Overstreet38-118/+115
- endianness fixes - mark some things static - fix a few __percpu annotations - fix silent enum conversions Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Refactor bch_sb_field_ops handlingKent Overstreet1-11/+18
This changes bch_sb_field_ops lookup to match how bkey_ops now works; for an unknown field type we return an empty ops struct. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Allow for unknown key typesKent Overstreet3-27/+33
This adds a new helper for lookups bkey_ops for a given key type, which returns a null bkey_ops for unknown key types; various bkey_ops users are tweaked as well to handle unknown key types. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Allow for unknown btree IDsKent Overstreet11-42/+91
We need to allow filesystems with metadata from newer versions to be mountable and usable by older versions. This patch enables us to roll out new btrees without a new major version number; we can now handle btree roots for unknown btree types. The unknown btree roots will be retained, and fsck (including backpointers) will check them, the same as other btree types. We add a dynamic array for the extra, unknown btree roots, in addition to the fixed size btree root array, and add new helpers for looking up btree roots. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: flush journal to avoid invalid dev usage entries on recoveryBrian Foster1-0/+11
A crash immediately after device removal can result in an unmountable filesystem due to recovery failure. The following command reliably reproduces on a multi-device fs: bcachefs device remove <dev> && xfs_io -xc shutdown <mnt> The post-crash mount fails with an error similar to the following, reported by fsck: invalid journal entry dev_usage at offset 7994/8034 seq 12: bad dev, fixing This refers to a device usage entry in the journal that refers to the index of the just removed device. Recovery considers this an invalid entry and fails to proceed. Device usage entries are added to journal buffer writes via bch_journal_write() -> bch2_journal_super_entries_add_common(), which means any journal buffer write has content that refers to member devices at the time of the journal write. The device remove sequence already removes metadata references to the device being removed. It then flushes any pins that refer to the device, clears replica entries, removes the in-memory device object and lastly updates the superblock to reflect that the device is no longer present. The problem is that any journal writes that occur during this sequence will include a dev usage entry so long as the device is present. To avoid this problem, we can flush the journal once more after the device entry is removed from the in-core structures, but before the superblock is updated to fully remove the device on-disk. Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: mark active journal devices on journal replicas gcBrian Foster1-1/+13
A simple device evacuate, remove, add test loop with concurrent shutdowns occasionally reproduces a problem where the filesystem fails to mount. The mount failure occurs because the filesystem was uncleanly shut down, yet no member device is marked for journal data in the superblock. An fsck detects the problem, restores the mark and allows the mount to proceed without further consistency issues. The reason for the lack of journal data marks is the gc mechanism invoked via bch2_journal_flush_device_pins() runs while the journal happens to be empty. This results in garbage collection of all journal replicas entries. Once the updated replicas table is written to the superblock, the filesystem is put in a transiently unrecoverable state until further journal data is written, because journal recovery expects to find at least one marked journal device whenever the filesystem is not otherwise marked clean (i.e. as on clean unmount). To fix this problem, update the journal replicas gc algorithm to always mark currently active journal replicas entries by writing to the journal. This ensures that only entries for devices that are no longer used for journaling are garbage collected, not just those that don't happen to currently hold journal data. This preserves the journal recovery invariant above and avoids putting the fs into a transiently unrecoverable state. Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: bch2_version_compatible()Kent Overstreet5-66/+60
This adds a new helper for checking if an on-disk version is compatible with the running version of bcachefs - prep work for introducing major:minor version numbers. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: bch2_version_to_text()Kent Overstreet6-20/+37
Add a new helper for printing out metadata versions in a standard format. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Kill BTREE_INSERT_USE_RESERVEKent Overstreet9-55/+56
Now that we have journal watermarks and alloc watermarks unified, BTREE_INSERT_USE_RESERVE is redundant and can be deleted. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Fix a null ptr deref in bch2_fs_alloc() error pathKent Overstreet3-1/+6
This fixes a null ptr deref in bch2_free_pending_node_rewrites() when the list head wasn't initialized. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Fix a format string warningKent Overstreet1-1/+1
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Kill JOURNAL_WATERMARKKent Overstreet12-55/+46
This unifies JOURNAL_WATERMARK with BCH_WATERMARK; we're working towards specifying watermarks once in the transaction commit path. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: BCH_WATERMARK_reclaimKent Overstreet3-2/+6
Add another watermark for journal reclaim - this is needed for the next patches, that unify BCH_WATERMARK with JOURNAL_WATERMARK. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: struct bch_extent_rebalanceKent Overstreet3-2/+26
This adds the extent entry for extents that rebalance needs to do something with. We're adding this ahead of the main rebalance_work patchset, because adding new extent entries can't be done in a forwards-compatible way. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Expand BTREE_NODE_IDKent Overstreet1-3/+14
We now have 20 bits for the btree ID in the on disk format - sufficient for 1 million distinct btrees. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Fix btree node write error messageKent Overstreet1-1/+1
Error messages should include the error code, when available. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: fsck: Break walk_inode() up into multiple functionsKent Overstreet1-46/+57
Some refactoring, prep work for algorithm improvements related to snapshots. we need to add a bitmap to the list of inodes for "seen this snapshot"; for this bitmap to correctly be available, we'll need to gather the list of inodes first, and later look up the inode for a given snapshot. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Fix leak in backpointers fsckKent Overstreet1-2/+4
We were forgetting to exit a printbuf - whoops. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: unregister_shrinker() now safe on not-registered shrinkerKent Overstreet2-4/+2
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Add a missing rhashtable_destroy() callKent Overstreet1-0/+1
Fixes https://lore.kernel.org/linux-bcachefs/784c3e6a-75bd-e6ca-535a-43b3e1daf643@kernel.dk/T/#mbf7caf005f960018eba23b58795d06c06c947411 Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Improve bch2_bkey_make_mut()Kent Overstreet4-11/+13
bch2_bkey_make_mut() now takes the bkey_s_c by reference and points it at the new, mutable key. This helps in some fsck paths that may have multiple repair operations on the same key. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Reduce stack frame size of bch2_check_alloc_info()Kent Overstreet1-18/+22
Excessive inlining may (on some versions of gcc?) cause excessive stack usage; this turns off some inlining in bch2_check_alloc_info. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: fsck needs BTREE_UPDATE_INTERNAL_SNAPSHOT_NODEKent Overstreet1-10/+12
A few fsck paths weren't using BTREE_UPDATE_INTERNAL_SNAPSHOT_NODE - oops. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Improve error message for overlapping extentsKent Overstreet1-6/+32
We now print out the full previous extent we overlapping with, to aid in debugging and searching through the journal. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Fix check_pos_snapshot_overwritten()Kent Overstreet1-1/+0
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Rename enum alloc_reserve -> bch_watermarkKent Overstreet15-103/+101
This is prep work for consolidating with JOURNAL_WATERMARK. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: BCH_ERR_fsck -> EINVALKent Overstreet1-1/+1
When we return errors outside of bcachefs, we need to return a standard error code - fix this for BCH_ERR_fsck. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: bch2_trans_mark_pointer() refactoringKent Overstreet1-7/+7
bch2_bucket_backpointer_mod() doesn't need to update the alloc key, we can exit the alloc iter earlier. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Fix more lockdep splats in debug.cKent Overstreet4-20/+20
Similar to previous fixes, we can't incur page faults while holding btree locks. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Fix lockdep splat in bch2_readdirKent Overstreet1-4/+9
dir_emit() can fault (taking mmap_lock); thus we can't be holding btree locks. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Check for ERR_PTR() from filemap_lock_folio()Kent Overstreet1-5/+5
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: New error message helpersKent Overstreet14-171/+171
Add two new helpers for printing error messages with __func__ and bch2_err_str(): - bch_err_fn - bch_err_msg Also kill the old error strings in the recovery path, which were causing us to incorrectly report memory allocation failures - they're not needed anymore. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: fiemap: Fix a lockdep splatKent Overstreet1-1/+4
As with the previous patch, we generally can't hold btree locks while copying to userspace, as that may incur a page fault and require mmap_lock. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: seqmutex; fix a lockdep splatKent Overstreet5-23/+96
We can't be holding btree_trans_lock while copying to user space, which might incur a page fault. To fix this, convert it to a seqmutex so we can unlock/relock. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Don't call lock_graph_descend() with wait lock heldKent Overstreet1-6/+15
This fixes a deadlock: 01305 WARNING: possible circular locking dependency detected 01305 6.3.0-ktest-gf4de9bee61af #5305 Tainted: G W 01305 ------------------------------------------------------ 01305 cat/14658 is trying to acquire lock: 01305 ffffffc00982f460 (fs_reclaim){+.+.}-{0:0}, at: __kmem_cache_alloc_node+0x48/0x278 01305 01305 but task is already holding lock: 01305 ffffff8011aaf040 (&lock->wait_lock){+.+.}-{2:2}, at: bch2_check_for_deadlock+0x4b8/0xa58 01305 01305 which lock already depends on the new lock. 01305 01305 01305 the existing dependency chain (in reverse order) is: 01305 01305 -> #2 (&lock->wait_lock){+.+.}-{2:2}: 01305 _raw_spin_lock+0x54/0x70 01305 __six_lock_wakeup+0x40/0x1b0 01305 six_unlock_ip+0xe8/0x248 01305 bch2_btree_key_cache_scan+0x720/0x940 01305 shrink_slab.constprop.0+0x284/0x770 01305 shrink_node+0x390/0x828 01305 balance_pgdat+0x390/0x6d0 01305 kswapd+0x2e4/0x718 01305 kthread+0x184/0x1a8 01305 ret_from_fork+0x10/0x20 01305 01305 -> #1 (&c->lock#2){+.+.}-{3:3}: 01305 __mutex_lock+0x104/0x14a0 01305 mutex_lock_nested+0x30/0x40 01305 bch2_btree_key_cache_scan+0x5c/0x940 01305 shrink_slab.constprop.0+0x284/0x770 01305 shrink_node+0x390/0x828 01305 balance_pgdat+0x390/0x6d0 01305 kswapd+0x2e4/0x718 01305 kthread+0x184/0x1a8 01305 ret_from_fork+0x10/0x20 01305 01305 -> #0 (fs_reclaim){+.+.}-{0:0}: 01305 __lock_acquire+0x19d0/0x2930 01305 lock_acquire+0x1dc/0x458 01305 fs_reclaim_acquire+0x9c/0xe0 01305 __kmem_cache_alloc_node+0x48/0x278 01305 __kmalloc_node_track_caller+0x5c/0x278 01305 krealloc+0x94/0x180 01305 bch2_printbuf_make_room.part.0+0xac/0x118 01305 bch2_prt_printf+0x150/0x1e8 01305 bch2_btree_bkey_cached_common_to_text+0x170/0x298 01305 bch2_btree_trans_to_text+0x244/0x348 01305 print_cycle+0x7c/0xb0 01305 break_cycle+0x254/0x528 01305 bch2_check_for_deadlock+0x59c/0xa58 01305 bch2_btree_deadlock_read+0x174/0x200 01305 full_proxy_read+0x94/0xf0 01305 vfs_read+0x15c/0x3a8 01305 ksys_read+0xb8/0x148 01305 __arm64_sys_read+0x48/0x60 01305 invoke_syscall.constprop.0+0x64/0x138 01305 do_el0_svc+0x84/0x138 01305 el0_svc+0x34/0x80 01305 el0t_64_sync_handler+0xb0/0xb8 01305 el0t_64_sync+0x14c/0x150 01305 01305 other info that might help us debug this: 01305 01305 Chain exists of: 01305 fs_reclaim --> &c->lock#2 --> &lock->wait_lock 01305 01305 Possible unsafe locking scenario: 01305 01305 CPU0 CPU1 01305 ---- ---- 01305 lock(&lock->wait_lock); 01305 lock(&c->lock#2); 01305 lock(&lock->wait_lock); 01305 lock(fs_reclaim); 01305 01305 *** DEADLOCK *** Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Fix bch2_check_discard_freespace_key()Kent Overstreet1-12/+35
We weren't correctly checking the freespace btree - it's an extents btree, which means we need to iterate over each bucket in a freespace extent. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: bch2_trans_unlock_noassert()Kent Overstreet3-1/+11
This fixes a spurious assert in the btree node read path. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Fix bch2_btree_update_start()Kent Overstreet1-1/+1
The calculation for number of nodes to allocate in bch2_btree_update_start() was incorrect - this fixes a BUG_ON() on the small nodes test. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: bch2_extent_ptr_desired_durability()Kent Overstreet3-8/+23
This adds a new helper for getting a pointer's durability irrespective of the device state, and uses it in the the data update path. This fixes a bug where we do a data update but request 0 replicas to be allocated, because the replica being rewritten is on a device marked as failed. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: snapshot_to_text() includes snapshot treeKent Overstreet1-2/+3
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Fix try_decrease_writepoints()Kent Overstreet1-17/+21
- We may need to drop btree locks before taking the writepoint_lock, as is done in other places. - We should be using open_bucket_free_unused(), so that we don't waste space. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Delete weird hacky transaction restart injectionKent Overstreet1-3/+0
since we currently don't have a good fault injection library, bch2_btree_insert_node() was randomly injecting faults based on local_clock(). At the very least this should have been a debug mode only thing, but this is a brittle method so let's just delete it. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Write buffer flush needs BTREE_INSERT_NOCHECK_RWKent Overstreet1-0/+1
btree write buffer flush is only invoked from contexts that already hold a write ref, and checking if we're still RW could cause us to fail to completely flush the write buffer when shutting down. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: New assertions when marking filesystem cleanKent Overstreet1-0/+5
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: ec: Fix a lost wakeupKent Overstreet1-0/+1
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: fix NULL pointer dereference in try_alloc_bucketMikulas Patocka1-1/+2
On Mon, 29 May 2023, Mikulas Patocka wrote: > The oops happens in set_btree_iter_dontneed and it is caused by the fact > that iter->path is NULL. The code in try_alloc_bucket is buggy because it > sets "struct btree_iter iter = { NULL };" and then jumps to the "err" > label that tries to dereference values in "iter". Here I'm sending a patch for it. From: Mikulas Patocka <mpatocka@redhat.com> The function try_alloc_bucket sets the variable "iter" to NULL and then (on various error conditions) jumps to the label "err". On the "err" label, it calls "set_btree_iter_dontneed" that tries to dereference "iter->trans" and "iter->path". So, we get an oops on error condition. This patch fixes the crash by testing that iter.trans and iter.path is non-zero before calling set_btree_iter_dontneed. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Fix subvol deletion deadlockKent Overstreet1-15/+11
d_prune_aliases() may call bch2_evict_inode(), which needs c->vfs_inodes_list_lock. Fix this by always calling igrab() before putting the inodes onto our disposal list, and then calling d_prune_aliases() with c->vfs_inodes_lock dropped. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: don't spin in rebalance when background target is not usableBrian Foster2-1/+10
If a bcachefs filesystem is configured with a background device (disk group), rebalance will relocate data to this device in the background by checking extent keys for whether they currently reside in the specified target. For keys that do not, rebalance performs a read/write cycle to allow the write path to properly relocate data. If the background target is not usable (read-only, for example), however, the write path doesn't actually move data to another device. Instead, rebalance spins indefinitely reading and rewriting the same data over and over to the same device. If the background target is made available again, the rebalance picks this up, relocates the data, and eventually terminates. To avoid this spinning behavior, update the rebalance background target logic to not only check whether the extent is not in the target, but whether the target is actually usable as well. If not, then don't mark the key for rewrite. Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>