]> git.ipfire.org Git - thirdparty/linux.git/log
thirdparty/linux.git
11 days agobcachefs: Fix subvol to missing root repair
Kent Overstreet [Mon, 2 Jun 2025 23:48:27 +0000 (19:48 -0400)] 
bcachefs: Fix subvol to missing root repair

We had a bug where the root inode of a subvolume was erronously deleted:
bch2_evict_inode() called bch2_inode_rm(), meaning the VFS inode's
i_nlink was somehow set to 0 when it shouldn't have - the inode in the
btree indicated it clearly was not unlinked.

This has been addressed with additional safety checks in
bch2_inode_rm() - pulling in the safety checks we already were doing
when deleting unlinked inodes in recovery - but the really disastrous
bug was in check_subvols(), which on finding a dangling subvol (subvol
with a missing root inode) would delete the subvolume.

I assume this bug dates from early check_directory_structure() code,
which originally handled subvolumes and normal paths - the idea being
that still live contents of the subvolume would get reattached
somewhere.

But that's incorrect, and disastrously so; deleting a subvolume triggers
deleting the snapshot ID it points to, deleting the entire contents.

The correct way to repair is to recreate the root inode if it's missing;
then any contents will get reattached under that subvolume's lost+found.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 days agobcachefs: Run may_delete_deleted_inode() checks in bch2_inode_rm()
Kent Overstreet [Mon, 2 Jun 2025 21:43:36 +0000 (17:43 -0400)] 
bcachefs: Run may_delete_deleted_inode() checks in bch2_inode_rm()

We had a bug where bch2_evict_inode() incorrectly called bch2_inode_rm()
- the journal clearly showed the inode was not unlinked.

We've got checks that we use in recovery when cleaning up deleted
inodes, lift them to bch2_inode_rm() as well.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 days agobcachefs: delete dead code from may_delete_deleted_inode()
Kent Overstreet [Mon, 2 Jun 2025 22:26:44 +0000 (18:26 -0400)] 
bcachefs: delete dead code from may_delete_deleted_inode()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 days agobcachefs: Add flags to subvolume_to_text()
Kent Overstreet [Mon, 2 Jun 2025 21:23:49 +0000 (17:23 -0400)] 
bcachefs: Add flags to subvolume_to_text()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 days agobcachefs: Fix oops in btree_node_seq_matches()
Kent Overstreet [Mon, 2 Jun 2025 13:26:20 +0000 (09:26 -0400)] 
bcachefs: Fix oops in btree_node_seq_matches()

btree_update_nodes_written() needs to wait on in-flight writes to old
nodes before marking them as freed. But it has no reason to pin those
old nodes in memory, so some trickyness ensues.

The update we're completing deleted references to those nodes from the
btree, so we know if they've been evicted they can't be pulled back in.
We just have to check if the nodes we have pointers to are still those
old nodes, and haven't been reused.

To do that we check the node's "sequence number" (actually a random 64
bit cookie), but that lives in the node's data buffer. 'struct btree'
can't be freed until filesystem shutdown (as they're quite small), but
the data buffers can be freed or swapped around.

Commit 1f88c3567495, which was fixing a kmsan warning, assumed that we
could safely do this locklessly with just a READ_ONCE() - if we've got a
non-null ptr it would be safe to read from.

But that's not true if the data buffer is a vmalloc allocation, so we
need to restore the locking that commit deleted (or alternatively RCU
free those data buffers, but there's no other reason for that).

Fixes: 1f88c3567495 ("bcachefs: Fix a KMSAN splat in btree_update_nodes_written()")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 days agobcachefs: Fix dirent_casefold_mismatch repair
Kent Overstreet [Sat, 31 May 2025 04:11:52 +0000 (00:11 -0400)] 
bcachefs: Fix dirent_casefold_mismatch repair

Instead of simply recreating a mis-casefolded dirent, use the str_hash
repair code, which will rename it if necessary - the dirent might have
been created again with the correct casefolding.

Factor out out bch2_str_hash_repair key() from
__bch2_str_hash_check_key() for the new path to use, and export
bch2_dirent_create_key() as well.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 days agobcachefs: Fix bch2_fsck_rename_dirent() for casefold
Kent Overstreet [Sat, 31 May 2025 21:00:00 +0000 (17:00 -0400)] 
bcachefs: Fix bch2_fsck_rename_dirent() for casefold

bch2_fsck_renamed_dirent was creating bch_dirent keys open-coded - but
we need to use the appropriate helper, if the directory is casefolded.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 days agobcachefs: Redo bch2_dirent_init_name()
Kent Overstreet [Sun, 1 Jun 2025 22:35:18 +0000 (18:35 -0400)] 
bcachefs: Redo bch2_dirent_init_name()

Redo (and simplify somewhat) how casefolded and non casefolded dirents
are initialized, and export this to be used by fsck_rename_dirent().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 days agobcachefs: Fix -Wc23-extensions in bch2_check_dirents()
Nathan Chancellor [Wed, 4 Jun 2025 19:38:27 +0000 (12:38 -0700)] 
bcachefs: Fix -Wc23-extensions in bch2_check_dirents()

Clang warns (or errors with CONFIG_WERROR=y):

  fs/bcachefs/fsck.c:2325:2: error: label followed by a declaration is a C23 extension [-Werror,-Wc23-extensions]
   2325 |         int ret = bch2_trans_run(c,
        |         ^

On clang-17 and older, this is an unconditional error:

  fs/bcachefs/fsck.c:2325:2: error: expected expression
   2325 |         int ret = bch2_trans_run(c,
        |         ^

Move the declaration of ret to the top of the function to resolve both
ways this issue manifests.

Fixes: c72def523799 ("bcachefs: Run check_dirents second time if required")
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 weeks agobcachefs: Run check_dirents second time if required
Kent Overstreet [Sat, 31 May 2025 23:12:25 +0000 (19:12 -0400)] 
bcachefs: Run check_dirents second time if required

If we move a key backwards, we'll need a second pass to run the rest of
the fsck checks.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 weeks agobcachefs: Run snapshot deletion out of system_long_wq
Kent Overstreet [Sun, 1 Jun 2025 22:24:18 +0000 (18:24 -0400)] 
bcachefs: Run snapshot deletion out of system_long_wq

We don't want this running out of the same workqueue, and blocking,
writes.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 weeks agobcachefs: Make check_key_has_snapshot safer
Kent Overstreet [Sat, 31 May 2025 17:10:43 +0000 (13:10 -0400)] 
bcachefs: Make check_key_has_snapshot safer

Snapshot deletion v2 added sentinal values for deleted snapshots, so
"key for deleted snapshot" - i.e. snapshot deletion missed something -
is safe to repair automatically.

But if we find a key for a missing snapshot we have no idea what
happened, and we shouldn't delete it unless we're very sure that
everything else is consistent.

So hook it up to the new bch2_require_recovery_pass(), we'll now only
delete if snapshots and subvolumes have recenlty been checked.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 weeks agobcachefs: BCH_RECOVERY_PASS_NO_RATELIMIT
Kent Overstreet [Sat, 31 May 2025 17:01:44 +0000 (13:01 -0400)] 
bcachefs: BCH_RECOVERY_PASS_NO_RATELIMIT

Add a superblock flag to temporarily disable ratelimiting for a recovery
pass.

This will be used to make check_key_has_snapshot safer: we don't want to
delete a key for a missing snapshot unless we know that the snapshots
and subvolumes btrees are consistent, i.e. check_snapshots and
check_subvols have run recently.

Changing those btrees - creating/deleting a subvolume or snapshot - will
set the "disable ratelimit" flag, i.e. ensuring that those passes run if
check_key_has_snapshot discovers an error.

We're only disabling ratelimiting in the snapshot/subvol delete paths,
we're not so concerned about the create paths.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 weeks agobcachefs: bch2_require_recovery_pass()
Kent Overstreet [Sat, 31 May 2025 16:48:00 +0000 (12:48 -0400)] 
bcachefs: bch2_require_recovery_pass()

Add a helper for requiring that a recovery pass has already run: either
run it directly, if we're still in recovery, or if we're not in recovery
check if it has run recently and schedule it if it hasn't.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 weeks agobcachefs: bch_err_throw()
Kent Overstreet [Wed, 28 May 2025 15:57:50 +0000 (11:57 -0400)] 
bcachefs: bch_err_throw()

Add a tracepoint for any time we return an error and unwind.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 weeks agobcachefs: Repair code for directory i_size
Kent Overstreet [Sat, 31 May 2025 15:58:11 +0000 (11:58 -0400)] 
bcachefs: Repair code for directory i_size

We had a bug due due to an incomplete revert of the patch implementing
directory i_size (summing up the size of the dirents), leading to
completely screwy i_size values that underflow.

Most userspace programs don't seem to care (e.g. du ignores it), but it
turns out this broke sshfs, so needs to be repaired.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 weeks agobcachefs: Kill un-reverted directory i_size code
Kent Overstreet [Sat, 31 May 2025 22:32:37 +0000 (18:32 -0400)] 
bcachefs: Kill un-reverted directory i_size code

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 weeks agobcachefs: Delete redundant fsck_err()
Kent Overstreet [Sat, 31 May 2025 22:47:49 +0000 (18:47 -0400)] 
bcachefs: Delete redundant fsck_err()

'inode_has_wrong_backpointer'; we have more specific errors for every
case afterwards.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 weeks agobcachefs: Convert BUG() to error
Kent Overstreet [Sat, 31 May 2025 23:20:33 +0000 (19:20 -0400)] 
bcachefs: Convert BUG() to error

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 weeks agobcachefs: Add better logging to fsck_rename_dirent()
Kent Overstreet [Fri, 30 May 2025 23:09:11 +0000 (19:09 -0400)] 
bcachefs: Add better logging to fsck_rename_dirent()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 weeks agobcachefs: Replace rcu_read_lock() with guards
Kent Overstreet [Sat, 24 May 2025 20:33:39 +0000 (16:33 -0400)] 
bcachefs: Replace rcu_read_lock() with guards

The new guard(), scoped_guard() allow for more natural code.

Some of the uses with creative flow control have been left.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 weeks agobcachefs: CLASS(btree_trans)
Kent Overstreet [Sun, 25 May 2025 05:41:17 +0000 (01:41 -0400)] 
bcachefs: CLASS(btree_trans)

Allow btree_trans to be used with CLASS().

Automatic cleanup, instead of manually calling bch2_trans_put().

We don't use DEFINE_CLASS because using a static inline for the
constructor breaks bch2_trans_get()'s use of __func__, so we have to
open code it.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 weeks agobcachefs: CLASS(darray)
Kent Overstreet [Thu, 29 May 2025 23:13:42 +0000 (19:13 -0400)] 
bcachefs: CLASS(darray)

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 weeks agobcachefs: CLASS(printbuf)
Kent Overstreet [Sun, 25 May 2025 06:59:35 +0000 (02:59 -0400)] 
bcachefs: CLASS(printbuf)

Add a DEFINE_CLASS() for printbufs.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 weeks agobcachefs: sysfs trigger_journal_commit
Kent Overstreet [Fri, 30 May 2025 00:16:58 +0000 (20:16 -0400)] 
bcachefs: sysfs trigger_journal_commit

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 weeks agobcachefs: sysfs trigger_emergency_read_only
Kent Overstreet [Thu, 29 May 2025 22:02:21 +0000 (18:02 -0400)] 
bcachefs: sysfs trigger_emergency_read_only

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 weeks agobcachefs: darray_find(), darray_find_p()
Kent Overstreet [Thu, 29 May 2025 20:56:50 +0000 (16:56 -0400)] 
bcachefs: darray_find(), darray_find_p()

New helpers to avoid open coded loops.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 weeks agobcachefs: Journal keys are retained until shutdown, or journal replay finishes
Kent Overstreet [Fri, 30 May 2025 00:06:01 +0000 (20:06 -0400)] 
bcachefs: Journal keys are retained until shutdown, or journal replay finishes

If we don't finish journal replay we need to keep journal keys around
until the filesystem shuts down - otherwise e.g. -o norecovery, various
tools (dump, list) break, and eventually we'll be doing journal replay
in the background.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 weeks agobcachefs: Improve error printing in btree_node_check_topology()
Kent Overstreet [Thu, 29 May 2025 21:32:35 +0000 (17:32 -0400)] 
bcachefs: Improve error printing in btree_node_check_topology()

We had a bug report where the errors from btree_node_check_topology()
don't seem to be getting printed; log_fsck_err() does some fancy
ratelimiting-type stuff that we don't want here.

Instead, just use bch2_count_fsck_err(); this is simpler, and modelled
after how we're currently handling bucket ref update errors in
buckets.c.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 weeks agobcachefs: bch2_readdir() now calls str_hash_check_key()
Kent Overstreet [Wed, 28 May 2025 20:34:42 +0000 (16:34 -0400)] 
bcachefs: bch2_readdir() now calls str_hash_check_key()

More self healing code: readdir will now notice if there are dirents
hashed incorrectly, and it'll repair them if errors=fix_safe.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 weeks agobcachefs: bch2_str_hash_check_key() may now be called without snapshots_seen
Kent Overstreet [Wed, 28 May 2025 20:25:11 +0000 (16:25 -0400)] 
bcachefs: bch2_str_hash_check_key() may now be called without snapshots_seen

We don't track snapshot overwrites outside of fsck, so for this to be
called at runtime outside of fsck we need to create it on demand, when
we have repair to do.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 weeks agobcachefs: __bch2_insert_snapshot_whiteouts() refactoring
Kent Overstreet [Wed, 28 May 2025 19:20:20 +0000 (15:20 -0400)] 
bcachefs: __bch2_insert_snapshot_whiteouts() refactoring

Now uses bch2_get_snapshot_overwrites(), and much shorter.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 weeks agobcachefs: bch2_get_snapshot_overwrites()
Kent Overstreet [Wed, 28 May 2025 19:08:19 +0000 (15:08 -0400)] 
bcachefs: bch2_get_snapshot_overwrites()

New helper for getting a list of snapshot IDs that have overwritten a
given key.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 weeks agobcachefs: bch2_dev_journal_bucket_delete()
Kent Overstreet [Wed, 28 May 2025 18:26:33 +0000 (14:26 -0400)] 
bcachefs: bch2_dev_journal_bucket_delete()

Recover from "journal and btree in same bucket".

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 weeks agobcachefs: Runtime self healing for keys for deleted snapshots
Kent Overstreet [Wed, 28 May 2025 02:20:27 +0000 (22:20 -0400)] 
bcachefs: Runtime self healing for keys for deleted snapshots

If snapshot deletion incorrectly missing some keys and leaves keys for
deleted snapshots, that causes a bit of a problem for data move - we
can't move an extent for a nonexistent snapshot, because the extent
might have to be fragmented, and maintaining correct visibility in child
snapshots doesn't work if it doesn't have a snapshot.

Previously we'd just skip these keys, but it turns out that causes
copygc to spin.

So we need runtime self healing, i.e. calling check_key_has_snapshot()
from the data move path.

Snapshot deletion v2 included sentinal values for deleted snapshot
nodes, so this is quite safe.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 weeks agobcachefs: Don't unlock trans before data_update_init()
Kent Overstreet [Wed, 28 May 2025 20:06:07 +0000 (16:06 -0400)] 
bcachefs: Don't unlock trans before data_update_init()

data_update_init() does need to do btree operations, delay doing the
unlock-before-io.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 weeks agobcachefs: Use bch2_err_matches() for BCH_ERR_fsck_(fix|ignore)
Kent Overstreet [Wed, 28 May 2025 15:31:51 +0000 (11:31 -0400)] 
bcachefs: Use bch2_err_matches() for BCH_ERR_fsck_(fix|ignore)

We'll be adding subtypes of these errors, and new error code tracing.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 weeks agobcachefs: Mark bch_errcode helpers __attribute__((const))
Kent Overstreet [Wed, 28 May 2025 15:27:59 +0000 (11:27 -0400)] 
bcachefs: Mark bch_errcode helpers __attribute__((const))

These don't access global memory or defer pointer arguments - this
enables CSE optimizations.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 weeks agobcachefs: Add missing printbuf_reset() in bch2_check_dirent_inode_dirent()
Kent Overstreet [Thu, 29 May 2025 19:02:37 +0000 (15:02 -0400)] 
bcachefs: Add missing printbuf_reset() in bch2_check_dirent_inode_dirent()

We were accidentally including the contents from the previous
fsck_err().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 weeks agobcachefs: sysfs/errors
Kent Overstreet [Wed, 28 May 2025 05:00:34 +0000 (01:00 -0400)] 
bcachefs: sysfs/errors

Make the superblock error counters available in sysfs; the only other
way they can be seen is 'show-super', but we don't write the superblock
every time the error count gets incremented.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 weeks agobcachefs: bch2_check_fix_ptrs() can now repair btree roots
Kent Overstreet [Wed, 28 May 2025 00:51:00 +0000 (20:51 -0400)] 
bcachefs: bch2_check_fix_ptrs() can now repair btree roots

This is straightforward enough: check_fix_ptrs() currently only runs
before we go RW, so updating the btree root pointer in c->btree_roots
suffices - it'll be written out in the first journal write we do.

For that, do_bch2_trans_commit_to_journal_replay() now handles
JSET_ENTRY_btree_root entries.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 weeks agobcachefs: Include b->ob.nr in cached_btree_node_to_text()
Kent Overstreet [Tue, 27 May 2025 18:39:43 +0000 (14:39 -0400)] 
bcachefs: Include b->ob.nr in cached_btree_node_to_text()

We have a bug report that looks like we might be leaking open buckets -
let's check if they got left attached to the cached btree node.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 weeks agobcachefs: Move devs_sorted to alloc_request
Kent Overstreet [Mon, 26 May 2025 21:03:48 +0000 (17:03 -0400)] 
bcachefs: Move devs_sorted to alloc_request

More stack usage work.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 weeks agobcachefs: reduce stack usage in alloc_sectors_start()
Kent Overstreet [Mon, 26 May 2025 21:15:11 +0000 (17:15 -0400)] 
bcachefs: reduce stack usage in alloc_sectors_start()

with typical config options, variables in different inline functions
aren't sharing stack space - and these are slowpaths.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 weeks agobcachefs: bch2_alloc_v4_to_text()
Kent Overstreet [Mon, 26 May 2025 18:24:19 +0000 (14:24 -0400)] 
bcachefs: bch2_alloc_v4_to_text()

Specialize the .to_text() for alloc_v4, to avoid the temporary on the
stack for conversion from old versions.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 weeks agobcachefs: Tweak bch2_data_update_init() for stack usage
Kent Overstreet [Mon, 26 May 2025 17:26:10 +0000 (13:26 -0400)] 
bcachefs: Tweak bch2_data_update_init() for stack usage

- Separate out a slowpath for bkey_nocow_lock()
- Don't call bch2_bkey_ptrs_c() or loop over pointers more than
  necessary

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 weeks agobcachefs: kill replicas_sectors arg to __trigger_extent()
Kent Overstreet [Mon, 26 May 2025 18:15:28 +0000 (14:15 -0400)] 
bcachefs: kill replicas_sectors arg to __trigger_extent()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 weeks agobcachefs: Don't stack allocate bch_writepage_state
Kent Overstreet [Mon, 26 May 2025 20:16:17 +0000 (16:16 -0400)] 
bcachefs: Don't stack allocate bch_writepage_state

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 weeks agobcachefs: factor out break_cycle_fail()
Kent Overstreet [Mon, 26 May 2025 20:29:56 +0000 (16:29 -0400)] 
bcachefs: factor out break_cycle_fail()

More stack usage work.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 weeks agobcachefs: btree_node_missing_err()
Kent Overstreet [Mon, 26 May 2025 16:48:19 +0000 (12:48 -0400)] 
bcachefs: btree_node_missing_err()

Factor out an error path for a small stack usage improvement.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 weeks agobcachefs: Kill bkey_buf in btree_path_down()
Kent Overstreet [Sun, 25 May 2025 21:56:45 +0000 (17:56 -0400)] 
bcachefs: Kill bkey_buf in btree_path_down()

Allocate some (smaller) temporary storage in btree_trans for this -
btree_path_down() is in our max-stack call stack.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 weeks agobcachefs: Add missing error logging in delete_dead_inodes()
Kent Overstreet [Wed, 28 May 2025 00:37:21 +0000 (20:37 -0400)] 
bcachefs: Add missing error logging in delete_dead_inodes()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 weeks agobcachefs: Fix misaligned bucket check in journal space calculations
Kent Overstreet [Wed, 28 May 2025 02:06:04 +0000 (22:06 -0400)] 
bcachefs: Fix misaligned bucket check in journal space calculations

Fix an assertion pop in the tiering_misaligned test: rounding down to
bucket size at the end of the journal space calculations leaves
cur_entry_sectors == 0, which is incorrect with !cur_entry_err.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 weeks agobcachefs: Fix incorrect multiple dev check in journal write path
Kent Overstreet [Wed, 28 May 2025 00:37:50 +0000 (20:37 -0400)] 
bcachefs: Fix incorrect multiple dev check in journal write path

It's uncomon to have multiple devices with journalling only on a subset,
but can be specified with the 'data_allowed' option. We need to know if
we're doing data/metadata writes to multiple devices, as that requires
issuing flushes before the journal writes.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 weeks agobcachefs: Catch data_update_done events in trace_io_move_start_fail
Kent Overstreet [Wed, 28 May 2025 01:45:56 +0000 (21:45 -0400)] 
bcachefs: Catch data_update_done events in trace_io_move_start_fail

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 weeks agobcachefs: io_move_evacuate_bucket tracepoint, counter
Kent Overstreet [Wed, 28 May 2025 01:54:22 +0000 (21:54 -0400)] 
bcachefs: io_move_evacuate_bucket tracepoint, counter

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 weeks agobcachefs: trace_io_move_pred
Kent Overstreet [Tue, 27 May 2025 03:00:21 +0000 (23:00 -0400)] 
bcachefs: trace_io_move_pred

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 weeks agobcachefs: Fix infinite loop in journal_entry_btree_keys_to_text()
Kent Overstreet [Sun, 25 May 2025 21:04:11 +0000 (17:04 -0400)] 
bcachefs: Fix infinite loop in journal_entry_btree_keys_to_text()

Fix an infinite loop when bkey_i->k.u64s is 0.

This only happens in userspace, where 'bcachefs list_journal' can print
the entire contents of the journal, and non-dirty entries aren't
validated.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 weeks agobcachefs: Journal read error message improvements
Kent Overstreet [Mon, 26 May 2025 16:21:57 +0000 (12:21 -0400)] 
bcachefs: Journal read error message improvements

- Don't print a checksum error when we first read a journal entry: we
  print a checksum error later if we'll be using the journal entry.

- Continuing with the theme of of improving error messages and grouping
  errors into a single log message per error, print a single 'checksum
  error' message per journal entry, and use bch2_journal_ptr_to_text()
  to print out where on the device it was.

- Factor out checksum error messages and checking for missing journal
  entries into helpers, bch2_journal_read() has gotten obnoxiously big.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 weeks agobcachefs: Don't rewind to run a recovery pass we already ran
Kent Overstreet [Mon, 26 May 2025 15:12:53 +0000 (11:12 -0400)] 
bcachefs: Don't rewind to run a recovery pass we already ran

Fix a small regression from the "run recovery passes" rewrite, which
enabled async recovery passes.

This fixes getting stuck in a loop in recovery.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 weeks agobcachefs: Move unicode message to after the startup message
Kent Overstreet [Sun, 25 May 2025 15:51:33 +0000 (11:51 -0400)] 
bcachefs: Move unicode message to after the startup message

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 weeks agobcachefs: Fix missing commit in check_dirents
Kent Overstreet [Sat, 24 May 2025 23:53:03 +0000 (19:53 -0400)] 
bcachefs: Fix missing commit in check_dirents

Other repair code seems to be doing commits themselves, but
check_key_has_snapshot() does not.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 weeks agobcachefs: Fix lost rebalance wakeups
Kent Overstreet [Sat, 24 May 2025 19:29:50 +0000 (15:29 -0400)] 
bcachefs: Fix lost rebalance wakeups

Fix a missing wakeup in

'bcachefs set-file-option' -> xattr option update -> inode_write

this was missing because the wakeup needs to happen after transaction
commit. Also, add a 'kick' counter, to make sure we don't miss a wakeup
that occured right after we finished checking the rebalance_work btree.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 weeks agobcachefs: bch2_kthread_io_clock_wait_once()
Kent Overstreet [Sat, 24 May 2025 19:24:00 +0000 (15:24 -0400)] 
bcachefs: bch2_kthread_io_clock_wait_once()

Add a version of bch2_kthread_io_clock_wait() that only schedules once -
behaving more like schedule_timeout().

This will be used for fixing rebalance wakeups.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 weeks agobcachefs: Ensure we print output of run_recovery_pass if it errors
Kent Overstreet [Sat, 24 May 2025 18:37:20 +0000 (14:37 -0400)] 
bcachefs: Ensure we print output of run_recovery_pass if it errors

Also, don't error out in bucket_ref_update_err(): we don't want to
return -BCH_ERR_cannot_rewind_recovery if it's not an insert, if it's an
overwrite we continue.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 weeks agobcachefs: Fix missing BTREE_UPDATE_internal_snapshot_node
Kent Overstreet [Sat, 24 May 2025 18:20:58 +0000 (14:20 -0400)] 
bcachefs: Fix missing BTREE_UPDATE_internal_snapshot_node

Repair code will do updates on older snapshot versions, so needs the
correct annotation.

Reported-by: syzbot+42581416dba62b364750@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 weeks agobcachefs: fix REFLINK_P_MAY_UPDATE_OPTIONS
Kent Overstreet [Sat, 24 May 2025 05:56:10 +0000 (01:56 -0400)] 
bcachefs: fix REFLINK_P_MAY_UPDATE_OPTIONS

If we're doing a reflink copy of existing reflinked data, we may only
set REFLINK_P_MAY_UPDATE_OPTIONS if it was set on the reflink pointer
we're copying from.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 weeks agobcachefs: Don't mount bs > ps without TRANSPARENT_HUGEPAGE
Kent Overstreet [Sat, 24 May 2025 01:59:12 +0000 (21:59 -0400)] 
bcachefs: Don't mount bs > ps without TRANSPARENT_HUGEPAGE

Large folios aren't supported without TRANSPARENT_HUGEPAGE

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 weeks agobcachefs: Fix btree_iter_next_node() for new locking asserts
Kent Overstreet [Sat, 24 May 2025 00:11:43 +0000 (20:11 -0400)] 
bcachefs: Fix btree_iter_next_node() for new locking asserts

We can't unlock a should_be_locked path unless we're in a transaction
restart.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 weeks agobcachefs: Ensure we don't use a blacklisted journal seq
Kent Overstreet [Fri, 23 May 2025 18:03:06 +0000 (14:03 -0400)] 
bcachefs: Ensure we don't use a blacklisted journal seq

Different versions differ on the size of the blacklist range; it is
theoretically possible that we could end up with blacklisted journal
sequence numbers newer than the newest seq we find in the journal, and
pick a new start seq that's blacklisted.

Explicitly check for this in bch2_fs_journal_start().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 weeks agobcachefs: Small check_fix_ptr fixes
Kent Overstreet [Fri, 23 May 2025 18:19:25 +0000 (14:19 -0400)] 
bcachefs: Small check_fix_ptr fixes

We don't want to change the bucket gen, on gen mismatch: it's possible
to have multiple btree nodes with different gens in the same bucket that
we want to keep, if we have to recover from btree node scan.

It's also not necessary to set g->gen_valid; add a comment to that
effect.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 weeks agobcachefs: Fix opts.recovery_pass_last
Kent Overstreet [Fri, 23 May 2025 22:31:53 +0000 (18:31 -0400)] 
bcachefs: Fix opts.recovery_pass_last

This was lost in the giant recovery pass rework - but it's used heavily
by bcachefs subcommand utilities.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 weeks agobcachefs: Fix allocate -> self healing path
Kent Overstreet [Fri, 23 May 2025 22:30:10 +0000 (18:30 -0400)] 
bcachefs: Fix allocate -> self healing path

When we go to allocate and find taht a bucket in the freespace btree is
actually allocated, we're supposed to return nonzero to tell the
allocator to skip it.

This fixes an emergency read only due to a bucket/ptr gen mismatch - we
also don't return the correct bucket gen when this happens.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 weeks agobcachefs: Fix endianness in casefold check/repair
Kent Overstreet [Fri, 23 May 2025 17:13:44 +0000 (13:13 -0400)] 
bcachefs: Fix endianness in casefold check/repair

Fixes: 010c89468134 ("bcachefs: Check for casefolded dirents in non casefolded dirs")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 weeks agobcachefs: Path must be locked if trans->locked && should_be_locked
Kent Overstreet [Wed, 10 Apr 2024 03:53:57 +0000 (23:53 -0400)] 
bcachefs: Path must be locked if trans->locked && should_be_locked

If path->should_be_locked is true, that means user code (of the btree
API) has seen, in this transaction, something guarded by the node this
path has locked, and we have to keep it locked until the end of the
transaction.

Assert that we're not violating this; should_be_locked should also be
cleared only in _very_ special situations.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 weeks agobcachefs: Simplify bch2_path_put()
Kent Overstreet [Thu, 22 May 2025 19:52:15 +0000 (15:52 -0400)] 
bcachefs: Simplify bch2_path_put()

Simplify the "do we need to keep this locked?" checks.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 weeks agobcachefs: Plumb btree_trans for more locking asserts
Kent Overstreet [Thu, 22 May 2025 19:33:14 +0000 (15:33 -0400)] 
bcachefs: Plumb btree_trans for more locking asserts

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 weeks agobcachefs: Clear trans->locked before unlock
Kent Overstreet [Thu, 22 May 2025 20:04:15 +0000 (16:04 -0400)] 
bcachefs: Clear trans->locked before unlock

We're adding new should_be_locked assertions: it's going to be illegal
to unlock a should_be_locked path when trans->locked is true.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 weeks agobcachefs: Clear should_be_locked before unlock in key_cache_drop()
Kent Overstreet [Thu, 22 May 2025 20:03:08 +0000 (16:03 -0400)] 
bcachefs: Clear should_be_locked before unlock in key_cache_drop()

We're adding new should_be_locked assertions, also add a comment
explaining why clearing should_be_locked is safe here.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 weeks agobcachefs: bch2_path_get() reuses paths if upgrade_fails & !should_be_locked
Kent Overstreet [Thu, 22 May 2025 22:12:54 +0000 (18:12 -0400)] 
bcachefs: bch2_path_get() reuses paths if upgrade_fails & !should_be_locked

Small additional optimization over the previous patch, bringing us
closer to the original behaviour, except when we need to clone to avoid
a transaction restart.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 weeks agobcachefs: Give out new path if upgrade fails
Kent Overstreet [Thu, 22 May 2025 22:00:45 +0000 (18:00 -0400)] 
bcachefs: Give out new path if upgrade fails

Avoid transaction restarts due to failure to upgrade - we can traverse a
new iterator without a transaction restart.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 weeks agobcachefs: Fix btree_path_get_locks when not doing trans restart
Kent Overstreet [Thu, 22 May 2025 20:54:31 +0000 (16:54 -0400)] 
bcachefs: Fix btree_path_get_locks when not doing trans restart

btree_path_get_locks, on failure, shouldn't unlock if we're not issuing
a transaction restart: we might drop locks we're not supposed to (if
path->should_be_locked is set).

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 weeks agobcachefs: btree_node_locked_type_nowrite()
Kent Overstreet [Thu, 22 May 2025 22:03:32 +0000 (18:03 -0400)] 
bcachefs: btree_node_locked_type_nowrite()

Small helper to improve locking assertions.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 weeks agobcachefs: Kill bch2_path_put_nokeep()
Kent Overstreet [Thu, 22 May 2025 19:40:24 +0000 (15:40 -0400)] 
bcachefs: Kill bch2_path_put_nokeep()

bch2_path_put_nokeep() was intended for paths we wouldn't need to
preserve for a transaction restart - it always frees them right away
when the ref hits 0.

But since paths are shared, freeing unconditionally is a bug, the path
might have been used elsewhere and have should_be_locked set, i.e. we
need to keep it locked until the end of the transaction.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 weeks agobcachefs: bch2_journal_write_checksum()
Kent Overstreet [Wed, 7 May 2025 01:54:35 +0000 (21:54 -0400)] 
bcachefs: bch2_journal_write_checksum()

We need to delay checksumming the journal write; we don't know the
blocksize until after we allocate the write.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 weeks agobcachefs: Reduce stack usage in data_update_index_update()
Kent Overstreet [Thu, 22 May 2025 16:50:22 +0000 (12:50 -0400)] 
bcachefs: Reduce stack usage in data_update_index_update()

Separate tracepoint message generation and other slowpath code into
non-inline functions, and use bch2_trans_log_str() instead of using a
printbuf for our journal message.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 weeks agobcachefs: bch2_trans_log_str()
Kent Overstreet [Thu, 22 May 2025 16:49:56 +0000 (12:49 -0400)] 
bcachefs: bch2_trans_log_str()

The data update path doesn't need a printbuf for its log message - this
will help reduce stack usage.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 weeks agobcachefs: Kill bkey_buf usage in data_update_index_update()
Kent Overstreet [Thu, 22 May 2025 16:34:40 +0000 (12:34 -0400)] 
bcachefs: Kill bkey_buf usage in data_update_index_update()

Reduce stack usage - bkey_buf has a 96 byte buffer on the stack, but the
btree_trans bump allocator works just fine here.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 weeks agobcachefs: Drop empty accounting updates
Kent Overstreet [Wed, 21 May 2025 19:54:56 +0000 (15:54 -0400)] 
bcachefs: Drop empty accounting updates

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 weeks agobcachefs: Improve trace_trans_restart_upgrade
Kent Overstreet [Wed, 21 May 2025 07:19:18 +0000 (03:19 -0400)] 
bcachefs: Improve trace_trans_restart_upgrade

- Convert to a 'fs_str' tracepoint that just emits as a string: this
  lets us build up the tracepoint with a printbuf, using our pretty
  printers, and they're much easier to manage

- Include locks_held, before and after

- Include the btree node pointer we failed on (error pointer, null, or
  real node)

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 weeks agobcachefs: fix bch2_inum_snapshot_to_path()
Kent Overstreet [Wed, 21 May 2025 02:59:58 +0000 (22:59 -0400)] 
bcachefs: fix bch2_inum_snapshot_to_path()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 weeks agobcachefs: fix duplicate printk
Kent Overstreet [Wed, 21 May 2025 00:15:39 +0000 (20:15 -0400)] 
bcachefs: fix duplicate printk

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 weeks agobcachefs: BCH_INODE_has_case_insensitive
Kent Overstreet [Mon, 19 May 2025 14:31:44 +0000 (10:31 -0400)] 
bcachefs: BCH_INODE_has_case_insensitive

Add a flag for tracking whether a directory has case-insensitive
descendents - so that overlayfs can disallow mounting, even though the
filesystem supports case insensitivity.

This is a new on disk format version, with a (cheap) upgrade to ensure
the flag is correctly set on existing inodes.

Create, rename and fssetxattr are all plumbed to ensure the new flag is
set, and we've got new fsck code that hooks into check_inode(0.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 weeks agobcachefs: bch2_inode_find_by_inum_snapshot()
Kent Overstreet [Mon, 19 May 2025 14:10:19 +0000 (10:10 -0400)] 
bcachefs: bch2_inode_find_by_inum_snapshot()

Move a fsck.c helper into inode.c, eliminate some duplicate and organize
the inode lookup helpers.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 weeks agobcachefs: bch2_inum_snapshot_to_path()
Kent Overstreet [Mon, 19 May 2025 13:48:50 +0000 (09:48 -0400)] 
bcachefs: bch2_inum_snapshot_to_path()

Add a better helper for printing out paths of inodes when we don't know
the subvolume, for fsck.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 weeks agobcachefs: bch2_rename_trans() only runs rename-to-dir code if needed
Kent Overstreet [Mon, 19 May 2025 13:17:39 +0000 (09:17 -0400)] 
bcachefs: bch2_rename_trans() only runs rename-to-dir code if needed

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 weeks agobcachefs: subvol_inum_eq()
Kent Overstreet [Mon, 19 May 2025 13:12:49 +0000 (09:12 -0400)] 
bcachefs: subvol_inum_eq()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 weeks agobcachefs: Don't set bi_casefold on non directories
Kent Overstreet [Mon, 19 May 2025 13:15:31 +0000 (09:15 -0400)] 
bcachefs: Don't set bi_casefold on non directories

bi_casefold only makes sense for directories, and since it's one of the
variable length fields setting it unnecessarily wastes space.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 weeks agobcachefs: Remove duplicate call to bch2_trans_begin()
Alan Huang [Mon, 19 May 2025 11:51:04 +0000 (19:51 +0800)] 
bcachefs: Remove duplicate call to bch2_trans_begin()

There is one in for_each_btree_key_max().

Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 weeks agobcachefs: Call bch2_bkey_set_needs_rebalance() earlier in write path
Kent Overstreet [Fri, 16 May 2025 20:45:44 +0000 (16:45 -0400)] 
bcachefs: Call bch2_bkey_set_needs_rebalance() earlier in write path

There's no reason to be running this inside our transaction; it forces
us to copy the key we're updating to a temporary, which we'd like to
skip.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>