Kent Overstreet [Mon, 2 Jun 2025 13:26:20 +0000 (09:26 -0400)]
bcachefs: Fix oops in btree_node_seq_matches()
btree_update_nodes_written() needs to wait on in-flight writes to old
nodes before marking them as freed. But it has no reason to pin those
old nodes in memory, so some trickyness ensues.
The update we're completing deleted references to those nodes from the
btree, so we know if they've been evicted they can't be pulled back in.
We just have to check if the nodes we have pointers to are still those
old nodes, and haven't been reused.
To do that we check the node's "sequence number" (actually a random 64
bit cookie), but that lives in the node's data buffer. 'struct btree'
can't be freed until filesystem shutdown (as they're quite small), but
the data buffers can be freed or swapped around.
Commit 1f88c3567495, which was fixing a kmsan warning, assumed that we
could safely do this locklessly with just a READ_ONCE() - if we've got a
non-null ptr it would be safe to read from.
But that's not true if the data buffer is a vmalloc allocation, so we
need to restore the locking that commit deleted (or alternatively RCU
free those data buffers, but there's no other reason for that).
Fixes: 1f88c3567495 ("bcachefs: Fix a KMSAN splat in btree_update_nodes_written()") Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sat, 31 May 2025 04:11:52 +0000 (00:11 -0400)]
bcachefs: Fix dirent_casefold_mismatch repair
Instead of simply recreating a mis-casefolded dirent, use the str_hash
repair code, which will rename it if necessary - the dirent might have
been created again with the correct casefolding.
Factor out out bch2_str_hash_repair key() from
__bch2_str_hash_check_key() for the new path to use, and export
bch2_dirent_create_key() as well.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bcachefs: Fix -Wc23-extensions in bch2_check_dirents()
Clang warns (or errors with CONFIG_WERROR=y):
fs/bcachefs/fsck.c:2325:2: error: label followed by a declaration is a C23 extension [-Werror,-Wc23-extensions]
2325 | int ret = bch2_trans_run(c,
| ^
On clang-17 and older, this is an unconditional error:
fs/bcachefs/fsck.c:2325:2: error: expected expression
2325 | int ret = bch2_trans_run(c,
| ^
Move the declaration of ret to the top of the function to resolve both
ways this issue manifests.
Fixes: c72def523799 ("bcachefs: Run check_dirents second time if required") Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sat, 31 May 2025 17:10:43 +0000 (13:10 -0400)]
bcachefs: Make check_key_has_snapshot safer
Snapshot deletion v2 added sentinal values for deleted snapshots, so
"key for deleted snapshot" - i.e. snapshot deletion missed something -
is safe to repair automatically.
But if we find a key for a missing snapshot we have no idea what
happened, and we shouldn't delete it unless we're very sure that
everything else is consistent.
So hook it up to the new bch2_require_recovery_pass(), we'll now only
delete if snapshots and subvolumes have recenlty been checked.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sat, 31 May 2025 17:01:44 +0000 (13:01 -0400)]
bcachefs: BCH_RECOVERY_PASS_NO_RATELIMIT
Add a superblock flag to temporarily disable ratelimiting for a recovery
pass.
This will be used to make check_key_has_snapshot safer: we don't want to
delete a key for a missing snapshot unless we know that the snapshots
and subvolumes btrees are consistent, i.e. check_snapshots and
check_subvols have run recently.
Changing those btrees - creating/deleting a subvolume or snapshot - will
set the "disable ratelimit" flag, i.e. ensuring that those passes run if
check_key_has_snapshot discovers an error.
We're only disabling ratelimiting in the snapshot/subvol delete paths,
we're not so concerned about the create paths.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sat, 31 May 2025 16:48:00 +0000 (12:48 -0400)]
bcachefs: bch2_require_recovery_pass()
Add a helper for requiring that a recovery pass has already run: either
run it directly, if we're still in recovery, or if we're not in recovery
check if it has run recently and schedule it if it hasn't.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sat, 31 May 2025 15:58:11 +0000 (11:58 -0400)]
bcachefs: Repair code for directory i_size
We had a bug due due to an incomplete revert of the patch implementing
directory i_size (summing up the size of the dirents), leading to
completely screwy i_size values that underflow.
Most userspace programs don't seem to care (e.g. du ignores it), but it
turns out this broke sshfs, so needs to be repaired.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 30 May 2025 00:06:01 +0000 (20:06 -0400)]
bcachefs: Journal keys are retained until shutdown, or journal replay finishes
If we don't finish journal replay we need to keep journal keys around
until the filesystem shuts down - otherwise e.g. -o norecovery, various
tools (dump, list) break, and eventually we'll be doing journal replay
in the background.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 29 May 2025 21:32:35 +0000 (17:32 -0400)]
bcachefs: Improve error printing in btree_node_check_topology()
We had a bug report where the errors from btree_node_check_topology()
don't seem to be getting printed; log_fsck_err() does some fancy
ratelimiting-type stuff that we don't want here.
Instead, just use bch2_count_fsck_err(); this is simpler, and modelled
after how we're currently handling bucket ref update errors in
buckets.c.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Wed, 28 May 2025 20:25:11 +0000 (16:25 -0400)]
bcachefs: bch2_str_hash_check_key() may now be called without snapshots_seen
We don't track snapshot overwrites outside of fsck, so for this to be
called at runtime outside of fsck we need to create it on demand, when
we have repair to do.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Wed, 28 May 2025 02:20:27 +0000 (22:20 -0400)]
bcachefs: Runtime self healing for keys for deleted snapshots
If snapshot deletion incorrectly missing some keys and leaves keys for
deleted snapshots, that causes a bit of a problem for data move - we
can't move an extent for a nonexistent snapshot, because the extent
might have to be fragmented, and maintaining correct visibility in child
snapshots doesn't work if it doesn't have a snapshot.
Previously we'd just skip these keys, but it turns out that causes
copygc to spin.
So we need runtime self healing, i.e. calling check_key_has_snapshot()
from the data move path.
Snapshot deletion v2 included sentinal values for deleted snapshot
nodes, so this is quite safe.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Wed, 28 May 2025 05:00:34 +0000 (01:00 -0400)]
bcachefs: sysfs/errors
Make the superblock error counters available in sysfs; the only other
way they can be seen is 'show-super', but we don't write the superblock
every time the error count gets incremented.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Wed, 28 May 2025 00:51:00 +0000 (20:51 -0400)]
bcachefs: bch2_check_fix_ptrs() can now repair btree roots
This is straightforward enough: check_fix_ptrs() currently only runs
before we go RW, so updating the btree root pointer in c->btree_roots
suffices - it'll be written out in the first journal write we do.
For that, do_bch2_trans_commit_to_journal_replay() now handles
JSET_ENTRY_btree_root entries.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Wed, 28 May 2025 02:06:04 +0000 (22:06 -0400)]
bcachefs: Fix misaligned bucket check in journal space calculations
Fix an assertion pop in the tiering_misaligned test: rounding down to
bucket size at the end of the journal space calculations leaves
cur_entry_sectors == 0, which is incorrect with !cur_entry_err.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Wed, 28 May 2025 00:37:50 +0000 (20:37 -0400)]
bcachefs: Fix incorrect multiple dev check in journal write path
It's uncomon to have multiple devices with journalling only on a subset,
but can be specified with the 'data_allowed' option. We need to know if
we're doing data/metadata writes to multiple devices, as that requires
issuing flushes before the journal writes.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 26 May 2025 16:21:57 +0000 (12:21 -0400)]
bcachefs: Journal read error message improvements
- Don't print a checksum error when we first read a journal entry: we
print a checksum error later if we'll be using the journal entry.
- Continuing with the theme of of improving error messages and grouping
errors into a single log message per error, print a single 'checksum
error' message per journal entry, and use bch2_journal_ptr_to_text()
to print out where on the device it was.
- Factor out checksum error messages and checking for missing journal
entries into helpers, bch2_journal_read() has gotten obnoxiously big.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
this was missing because the wakeup needs to happen after transaction
commit. Also, add a 'kick' counter, to make sure we don't miss a wakeup
that occured right after we finished checking the rebalance_work btree.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sat, 24 May 2025 18:37:20 +0000 (14:37 -0400)]
bcachefs: Ensure we print output of run_recovery_pass if it errors
Also, don't error out in bucket_ref_update_err(): we don't want to
return -BCH_ERR_cannot_rewind_recovery if it's not an insert, if it's an
overwrite we continue.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sat, 24 May 2025 05:56:10 +0000 (01:56 -0400)]
bcachefs: fix REFLINK_P_MAY_UPDATE_OPTIONS
If we're doing a reflink copy of existing reflinked data, we may only
set REFLINK_P_MAY_UPDATE_OPTIONS if it was set on the reflink pointer
we're copying from.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 23 May 2025 18:03:06 +0000 (14:03 -0400)]
bcachefs: Ensure we don't use a blacklisted journal seq
Different versions differ on the size of the blacklist range; it is
theoretically possible that we could end up with blacklisted journal
sequence numbers newer than the newest seq we find in the journal, and
pick a new start seq that's blacklisted.
Explicitly check for this in bch2_fs_journal_start().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 23 May 2025 18:19:25 +0000 (14:19 -0400)]
bcachefs: Small check_fix_ptr fixes
We don't want to change the bucket gen, on gen mismatch: it's possible
to have multiple btree nodes with different gens in the same bucket that
we want to keep, if we have to recover from btree node scan.
It's also not necessary to set g->gen_valid; add a comment to that
effect.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 23 May 2025 22:30:10 +0000 (18:30 -0400)]
bcachefs: Fix allocate -> self healing path
When we go to allocate and find taht a bucket in the freespace btree is
actually allocated, we're supposed to return nonzero to tell the
allocator to skip it.
This fixes an emergency read only due to a bucket/ptr gen mismatch - we
also don't return the correct bucket gen when this happens.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Wed, 10 Apr 2024 03:53:57 +0000 (23:53 -0400)]
bcachefs: Path must be locked if trans->locked && should_be_locked
If path->should_be_locked is true, that means user code (of the btree
API) has seen, in this transaction, something guarded by the node this
path has locked, and we have to keep it locked until the end of the
transaction.
Assert that we're not violating this; should_be_locked should also be
cleared only in _very_ special situations.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 22 May 2025 22:12:54 +0000 (18:12 -0400)]
bcachefs: bch2_path_get() reuses paths if upgrade_fails & !should_be_locked
Small additional optimization over the previous patch, bringing us
closer to the original behaviour, except when we need to clone to avoid
a transaction restart.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 22 May 2025 20:54:31 +0000 (16:54 -0400)]
bcachefs: Fix btree_path_get_locks when not doing trans restart
btree_path_get_locks, on failure, shouldn't unlock if we're not issuing
a transaction restart: we might drop locks we're not supposed to (if
path->should_be_locked is set).
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 22 May 2025 19:40:24 +0000 (15:40 -0400)]
bcachefs: Kill bch2_path_put_nokeep()
bch2_path_put_nokeep() was intended for paths we wouldn't need to
preserve for a transaction restart - it always frees them right away
when the ref hits 0.
But since paths are shared, freeing unconditionally is a bug, the path
might have been used elsewhere and have should_be_locked set, i.e. we
need to keep it locked until the end of the transaction.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 22 May 2025 16:50:22 +0000 (12:50 -0400)]
bcachefs: Reduce stack usage in data_update_index_update()
Separate tracepoint message generation and other slowpath code into
non-inline functions, and use bch2_trans_log_str() instead of using a
printbuf for our journal message.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Wed, 21 May 2025 07:19:18 +0000 (03:19 -0400)]
bcachefs: Improve trace_trans_restart_upgrade
- Convert to a 'fs_str' tracepoint that just emits as a string: this
lets us build up the tracepoint with a printbuf, using our pretty
printers, and they're much easier to manage
- Include locks_held, before and after
- Include the btree node pointer we failed on (error pointer, null, or
real node)
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 19 May 2025 14:31:44 +0000 (10:31 -0400)]
bcachefs: BCH_INODE_has_case_insensitive
Add a flag for tracking whether a directory has case-insensitive
descendents - so that overlayfs can disallow mounting, even though the
filesystem supports case insensitivity.
This is a new on disk format version, with a (cheap) upgrade to ensure
the flag is correctly set on existing inodes.
Create, rename and fssetxattr are all plumbed to ensure the new flag is
set, and we've got new fsck code that hooks into check_inode(0.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sat, 17 May 2025 00:43:18 +0000 (20:43 -0400)]
bcachefs: Coalesce accounting in trans commit
Accounting has gotten quite heavy, and there's lots of redundancy in
accounting updates within a transaction, as we often add/delete multiple
extents that touch the same accountign counters.
This will reduce the amount of data that we journal, and reduce pressure
downstream on the btree write buffer.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>