Age | Commit message (Collapse) | Author | Files | Lines |
|
This unifies JOURNAL_WATERMARK with BCH_WATERMARK; we're working towards
specifying watermarks once in the transaction commit path.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
GFP_NOIO dates from the bcache days, when we operated under the block
layer. Now, GFP_NOFS is more appropriate, so switch all GFP_NOIO uses to
GFP_NOFS.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
A user hit this BUG_ON() - it's unclear how it happened, so replace it
with a fatal error that will cause us to go read only, and print out
more information.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
The journal write submission path marks the associated replica
entries for journal data in journal_write_done(), which is just
after journal write bio submission. This creates a small window
where journal entries might have been written out, but the
associated replica is not marked such that recovery does not know
that the associated device contains journal data.
Move the replica marking a bit earlier in the write path such that
recovery is guaranteed to recognize that the device contains journal
data in the event of a crash.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This adds private error codes for most (but not all) of our ENOMEM uses,
which makes it easier to track down assorted allocation failures.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
- __bch2_bkey_drop_ptr() -> bch2_bkey_drop_ptr_noerror(), now available
outside extents.
- Split bch2_bkey_has_device() and bch2_bkey_has_device_c(), const and
non const versions
- bch2_extent_has_ptr() now returns the pointer it found
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Rust bindgen doesn't cope well with anonymous structs and unions. This
patch drops the fancy anonymous structs & unions in bkey_i that let us
use the same helpers for bkey_i and bkey_packed; since bkey_packed is an
internal type that's never exposed to outside code, it's only a minor
inconvenienc.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This brings back journal_entries_compact(), but in a more efficient form
- we need to do multiple postprocess steps, so iterate over the
journal entries being written just once to make it more efficient.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This patch
- Adds a mechanism for queuing up journal entries prior to the journal
being started, which will be used for early journal log messages
- Adds bch2_fs_log_msg() and improves bch2_trans_log_msg(), which now
take format strings. bch2_fs_log_msg() can be used before or after
the journal has been started, and will use the appropriate mechanism.
- Deletes the now obsolete bch2_journal_log_msg()
- And adds more log messages to the recovery path - messages for
journal/filesystem started, journal entries being blacklisted, and
journal replay starting/finishing.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
If it so happens that we crash while dirty, meaning we don't have the
superblock clean section, and we erroneously mark a journal entry we
wrote as blacklisted, we won't be able to recover.
This patch fixes this by adding a fallback: if we've got no superblock
clean section, and no non-ignored journal entries, we try the most
recent ignored journal entry.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This just cleans up and simplifies the code that decides where to resume
writing in the journal - when the code was originally written we weren't
saving the precise location of every journal write found.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
On startup, we need to ensure the first journal entry written is a flush
write: after a clean shutdown we generally don't read the journal, which
means we might be overwriting whatever was there previously, and there
must always be at least one flush entry in the journal or recovery will
fail.
Found by fstests generic/388.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This tweaks the recovery and journal paths so that we don't error out
before we need to: the list_journal command should work, even if we
wouldn't be able to replay successfully.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
If the last journal write didn't complete sucessfully due to a torn
write, we'll detect it as a checksum error. In that case, we should just
pretend that journal entry was never written.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Print out the journal entries we read and will replay as soon as
possible - if we get an error walidating keys it's helpful to know where
it was in the journal.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
- Centralize format strings in bcachefs.h
- Add bch2_fmt_inum_offset() and related helpers
- Switch error messages for inodes to also print out the offset, in
bytes
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
On journal read, previously we would do full journal entry validation
immediately after reading a journal entry.
However, this would lead to errors for journal entries we weren't
actually going to use, either because they were too old or too new
(newer than the most recent flush).
We've observed write tearing on journal entries newer than the newest
flush - which makes sense, prior to a flush there's no guarantees about
write persistence.
This patch defers full journal entry validation until the end of the
journal read path, when we know which journal entries we'll want to use.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Prep work for the next patch, to defer journal entry validation: we now
track for each replica whether we had a good checksum.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Previously, jset_validate() was formatting the initial part of an error
string for every entry it validating - expensive.
This moves that code to journal_entry_err_msg(), which is now only
called if there's an actual error.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Also, do some reorganizing/renaming, convert atomic counters in bch_fs
to persistent counters, and add a few missing counters.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
This converts bcachefs to the modern printbuf interface/implementation,
synced with the version to be submitted upstream.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Lately we've been doing a lot of debugging by looking at the journal to
see what was changed, and by what code path. This patch adds a new
journal entry type for recording overwrites, so that we don't have to
search backwards through the journal to see what was being overwritten
in order to work out what the triggers were supposed to be doing.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
We shouldn't kick journal reclaim unnecessarily, it's got its own timer
for that.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
This option was useful when the replicas mechism was new and still being
debugged, but hasn't been used in ages - let's delete it.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
In journal_entry_add(), we were repeatedly scanning the journal entries
radix tree to scan for old entries that can be freed, with O(n^2)
behaviour. This patch tweaks things to remember the previous last_seq,
so we don't have to scan for entries to free from the start.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
These showed up when building for mips.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
Previously, the journal read path used a linked list for storing the
journal entries we read from disk. But there's been a bug that's been
causing journal_flush_delay to incorrectly be set to 0, leading to far
more journal entries than is normal being written out, which then means
filesystems are no longer able to start due to the O(n^2) behaviour of
inserting into/searching that linked list.
Fix this by switching to a radix tree.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
Previously, we were missing accounting for buckets in need_gc_gens and
need_discard states. This matters because buckets in those states need
other btree operations done before they can be used, so they can't be
conuted when checking current number of free buckets against the
allocation watermark.
Also, we weren't directly counting free buckets at all. Now, data type 0
== BCH_DATA_free, and free buckets are counted; this means we can get
rid of the separate (poorly defined) count of unavailable buckets.
This is a new on disk format version, with upgrade and fsck required for
the accounting changes.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
This adds a new parameter to .key_invalid() methods for whether the key
is being read or written; the idea being that methods can do more
aggressive checks when a key is newly created and being written, when we
wouldn't want to delete the key because of those checks.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
In the old allocator code, buckets would be discarded just prior to
being used - this made sense in bcache where we were discarding buckets
just after invalidating the cached data they contain, but in a
filesystem where we typically have more free space we want to be
discarding buckets when they become empty.
This patch implements the new behaviour - it checks the need_discard
btree for buckets awaiting discards, and then clears the appropriate
bit in the alloc btree, which moves the buckets to the freespace btree.
Additionally, discards are now enabled by default.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Now that we have new persistent data structures for the allocator, this
patch converts the allocator to use them.
Now, foreground bucket allocation uses the freespace btree to find
buckets to allocate, instead of popping buckets off the freelist.
The background allocator threads are no longer needed and are deleted,
as well as the allocator freelists. Now we only need background tasks
for invalidating buckets containing cached data (when we are low on
empty buckets), and for issuing discards.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Something funny is going on with the new code for restoring the journal
write point, and it's hard to reproduce.
We do want to debug this because resuming writing to the journal in the
wrong spot could be something serious. For now, replace the assertion
with an error message and revert to old behaviour when it happens.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
Now we've got strings for metadata versions - this changes
bch2_sb_to_text() and our mount log message to use it.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
This patch tweaks the journal recovery path so that we start writing
right after where we left off, instead of the next empty bucket. This is
partly prep work for supporting zoned devices, but it's also good to do
in general to avoid the journal completely filling up and getting stuck.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
After emergency shutdown, all journal entries will be written as noflush
entries, meaning they will never be used - but they'll still exist for
debugging tools to examine.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
Previous patch just moved responsibility for incrementing the journal
sequence number and initializing the new journal entry from
__journal_entry_close() to journal_entry_open(); this patch makes the
analagous change for journal reservation state, incrementing the index
into array of journal_bufs at open time.
This means that __journal_entry_close() never fails to close an open
journal entry, which is important for the next patch that will change
our emergency shutdown behaviour.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
It makes the code more readable if we work off of sequence numbers,
instead of direct indexes into the array of journal buffers.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
This patch changes journal_entry_open() to initialize the new journal
entry, not __journal_entry_close().
This also means that journal_cur_seq() refers to the sequence number of
the last journal entry when we don't have an open journal entry, not the
next one.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
This replaces the journal flag JOURNAL_NEED_WRITE with per-journal buf
state - more explicit, and solving a race in the old code that would
lead to entries being opened and written unnecessarily.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
This patch changes printbufs dynamically allocate and reallocate a
buffer as needed. Stack usage has become a bit of a problem, and a major
cause of that has been static size string buffers on the stack.
The most involved part of this refactoring is that printbufs must now be
exited with printbuf_exit().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
vstruct_bytes() was returning a u64 - it should be a size_t, the corect
type for the size of anything that fits in memory.
Also replace a 64 bit divide with div_u64().
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
This patch was originally to work around the journal geting stuck in
nochanges mode - but that was just a hack, we needed to fix the actual
bug. It should be fixed now, so revert it.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
Long ago it was possible to get a journal reservation and not use it,
but that's no longer allowed, which means journal_write_compact() has
very little work to do, and isn't really worth the code anymore.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
This patch improves the superblock .to_text() methods and adds methods
for all types that were missing them. It also improves printbufs by
allowing them to specfiy what units we want to be printing in, and adds
new wrapper methods for unifying our kernel and userspace environments.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
bch_scnmemcpy was for printing length-limited strings that might not
have a terminating null - turns out sprintf & pr_buf can do this with
%.*s.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
When viewing what's in the journal, it's more useful to have the logical
location - journal bucket and offset within that bucket - than just the
offset on that device.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|