git.ipfire.org Git - thirdparty/kernel/linux.git/log

]> git.ipfire.org Git - thirdparty/kernel/linux.git/log

projects / thirdparty / kernel / linux.git / log

summary | shortlog | log | commit | commitdiff | tree
first ⋅ prev ⋅ next

commit | commitdiff | tree

Mikulas Patocka [Thu, 13 Jul 2023 16:00:28 +0000 (18:00 +0200)]

bcachefs: mark bch_inode_info and bkey_cached as reclaimable

Mark these caches as reclaimable, so that available memory is correctly
reported when there is a lot of cached inodes.

Note that more work is needed - you should add __GFP_RECLAIMABLE to some
of the kmalloc calls, so that they are allocated from the "kmalloc-rcl-*"
caches.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Thu, 13 Jul 2023 02:27:16 +0000 (22:27 -0400)]

bcachefs: Compression levels

This allows including a compression level when specifying a compression
type, e.g.
compression=zstd:15

Values from 1 through 15 indicate compression levels, 0 or unspecified
indicates the default.

For LZ4, values 3-15 specify that the HC algorithm should be used.

Note that for compatibility, extents themselves only include the
compression type, not the compression level. This means that specifying
the same compression algorithm but different compression levels for the
compression and background_compression options will have no effect.

XXX: perhaps we could add a warning for this

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Thu, 13 Jul 2023 02:06:37 +0000 (22:06 -0400)]

bcachefs: Extent sb compression type fields to 8 bits

The upper 4 bits are for compression level.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Thu, 13 Jul 2023 02:06:11 +0000 (22:06 -0400)]

bcachefs: bcachefs_format.h should be using __u64

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Wed, 12 Jul 2023 03:47:29 +0000 (23:47 -0400)]

bcachefs: fix_errors option is now a proper enum

Before, it was parsed as a bool but internally it was really an enum:
this lets us pass in all the possible values.

But we special case the option parsing: no supplied value is parsed as
FSCK_FIX_yes, to match the previous behaviour.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Thu, 13 Jul 2023 01:48:32 +0000 (21:48 -0400)]

bcachefs: bch_opt_fn

Minor refactoring to get rid of some unneeded token pasting.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Wed, 12 Jul 2023 17:55:03 +0000 (13:55 -0400)]

bcachefs: Convert snapshot table to RCU array

This switches the generic radix tree for the in-memory table of snapshot
nodes to a simple rcu array. This means we have to add new locking to
deal with reallocations, but is faster than traversing the radix tree.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Wed, 12 Jul 2023 15:43:03 +0000 (11:43 -0400)]

bcachefs: Add a race_fault() for write buffer slowpath

We haven't hooked up dynamic fault injection quite yet, but we will soon

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Tue, 11 Jul 2023 00:30:04 +0000 (20:30 -0400)]

bcachefs: Add buffered IO fallback for userspace

In userspace, we want to be able to switch to buffered IO when we're
dealing with an image on a filesystem/device that doesn't support the
blocksize the filesystem was formatted with.

This plumbs through !opts.direct_io -> FMODE_BUFFERED, which will be
supported by the shim version of blkdev_get_by_path() in -tools, and it
adds a fallback to disable direct IO and retry for userspace.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Mon, 10 Jul 2023 02:28:08 +0000 (22:28 -0400)]

bcachefs: Fallocate now checks page cache

Previously, fallocate would only check the state of the extents btree
when determining if we need to create a reservation.

But the page cache might already have dirty data or a disk reservation.
This changes __bchfs_fallocate() to call bch2_seek_pagecache_hole() to
check for this.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Mon, 10 Jul 2023 21:23:59 +0000 (17:23 -0400)]

bcachefs: Don't start copygc until recovery is finished

With "bcachefs: Snapshot depth, skiplist fields", we now can't run data
move operations until after bch2_check_snapshots() is complete.

Ideally we'd have the copygc (and rebalance) threads wait until
c->curr_recovery_pass has advanced, but the waitlist handling is tricky
- so for now, move starting copygc back to read_write_late().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Mon, 10 Jul 2023 19:56:05 +0000 (15:56 -0400)]

bcachefs: Fix build error on weird gcc

fixes
./include/linux/stddef.h:8:14: error: positional initialization of field in ‘struct’ declared with ‘designated_init’ attribute [-Werror=designated-init]

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Sun, 25 Jun 2023 22:04:46 +0000 (18:04 -0400)]

bcachefs: Snapshot depth, skiplist fields

This extents KEY_TYPE_snapshot to include some new fields:
- depth, to indicate depth of this particular node from the root
- skip[3], skiplist entries for quickly walking back up to the root

These are to improve bch2_snapshot_is_ancestor(), making it O(ln(n))
instead of O(n) in the snapshot tree depth.

Skiplist nodes are picked at random from the set of ancestor nodes, not
some fixed fraction.

This introduces bcachefs_metadata_version 1.1, snapshot_skiplists.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Mon, 10 Jul 2023 17:42:26 +0000 (13:42 -0400)]

bcachefs: Version table now lists required recovery passes

Now that we've got forward compatibility sorted out, we should be doing
more frequent version upgrades in the future.

To avoid having to run a full fsck for every version upgrade, this
improves the BCH_METADATA_VERSIONS() table to explicitly specify a
bitmask of recovery passes to run when upgrading to or past a given
version.

This means we can also delete PASS_UPGRADE().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Mon, 10 Jul 2023 16:23:01 +0000 (12:23 -0400)]

bcachefs: bch2_sb_maybe_downgrade(), bch2_sb_upgrade()

Add some new helpers, and fix upgrade/downgrade in bch2_fs_initialize().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Mon, 10 Jul 2023 15:17:56 +0000 (11:17 -0400)]

bcachefs: Fix a write buffer flush deadlock

We're not supposed to block if BTREE_INSERT_JOURNAL_RECLAIM && watermark
!= BCH_WATERMARK_reclaim.

This should really be a separate BTREE_INSERT_NONBLOCK flag - add some
comments to that effect, it's not important for this patch.

btree write buffer flush depends on this behaviour though - the first
loop tries to flush sequentially, which doesn't free up space in the
journal optimally. If that can't proceed we bail out and flush in
journal order - that won't work if we're blocked instead of returning an
error.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Wed, 28 Jun 2023 02:09:35 +0000 (22:09 -0400)]

bcachefs: bcachefs_metadata_version_major_minor

This introduces major/minor versioning to the superblock version number.
Major version number changes indicate incompatible releases; we can move
forward to a new major version number, but not backwards. Minor version
numbers indicate compatible changes - these add features, but can still
be mounted and used by old versions.

With the recent patches that make it possible to roll out new btrees and
key types without breaking compatibility, we should be able to roll out
most new features without incompatible changes.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Sun, 9 Jul 2023 19:13:30 +0000 (15:13 -0400)]

bcachefs: Add new assertions for shutdown path

We've been seeing assertions pop that indicate the btree node cache or
key cache have dirty items when we just did a clean shutdown.

Add some more assertions so we can catch this when we're dirtying items.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Sun, 9 Jul 2023 18:18:28 +0000 (14:18 -0400)]

bcachefs: bch2_xattr_set() now updates ctime

Fixes fstests generic/728

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Sun, 9 Jul 2023 18:12:58 +0000 (14:12 -0400)]

bcachefs: Kill bch2_xattr_get()

Inline it into the only caller

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Sun, 9 Jul 2023 17:49:34 +0000 (13:49 -0400)]

bcachefs: Fix try_decrease_writepoints()

We were freeing open buckets on the writepoint list, but forgetting to
take them off the writepoint list - whoops

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Sun, 9 Jul 2023 17:20:29 +0000 (13:20 -0400)]

bcachefs: Mark as EXPERIMENTAL

As discussed on list, bcachefs is going to be marked as experimental for
a few releases, until the inevitable tide of new bug reports subsides.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Fri, 7 Jul 2023 06:42:28 +0000 (02:42 -0400)]

bcachefs: Enumerate recovery passes

Recovery and fsck have many different passes/jobs to do, which always
run in the same order - but not all of them run all the time. Some are
for fsck, some for unclean shutdown, some for version upgrades.

This adds some new structure: a defined list of recovery passes that we
can run in a loop, as well as consolidating the log messages.

The main benefit is consolidating the "should run this recovery pass"
logic, as well as cleaning up the "this recovery pass has finished"
state; instead of having a bunch of ad-hoc state bits in c->flags, we've
now got c->curr_recovery_pass.

By consolidating the "should run this recovery pass" logic, in the
future on disk format upgrades will be able to say "upgrading to this
version requires x passes to run", instead of forcing all of fsck to
run.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Sun, 9 Jul 2023 02:33:29 +0000 (22:33 -0400)]

bcachefs: Stash journal replay params in bch_fs

For the upcoming enumeration of recovery passes, we need all recovery
passes to be called the same way - including journal replay.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Sun, 9 Jul 2023 02:27:03 +0000 (22:27 -0400)]

bcachefs: Kill bch2_bucket_gens_read()

This folds bch2_bucket_gens_read() into bch2_alloc_read(), doing the
version check there.

This is prep work for enumarating all recovery passes: we need some
cleanup first to make calling all the recovery passes consistent.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Sun, 9 Jul 2023 02:21:45 +0000 (22:21 -0400)]

bcachefs: Fix error path in bch2_journal_flush_device_pins()

We need to always call bch2_replicas_gc_end() after we've called
bch2_replicas_gc_start(), else we leave state around that needs to be
cleaned up.

Partial fix for: https://github.com/koverstreet/bcachefs/issues/560

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Wed, 28 Jun 2023 03:34:02 +0000 (23:34 -0400)]

bcachefs: version_upgrade is now an enum

The version_upgrade parameter is now an enum, not a bool, and it's
persistent in the superblock:
- compatible (default): upgrade to the latest compatible version
- incompatible: upgrade to latest incompatible version
- none

Currently all upgrades are incompatible upgrades, but the next release
will introduce major:minor versions.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Wed, 28 Jun 2023 23:59:56 +0000 (19:59 -0400)]

bcachefs: BCH_SB_VERSION_UPGRADE_COMPLETE()

Version upgrades are not atomic operations: when we do a version upgrade
we need to update the superblock before we start using new features, and
then when the upgrade completes we need to update the superblock again.
This adds a new superblock field so we can detect and handle incomplete
version upgrades.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Fri, 7 Jul 2023 21:09:26 +0000 (17:09 -0400)]

bcachefs: Convert more -EROFS to private error codes

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Fri, 7 Jul 2023 08:38:29 +0000 (04:38 -0400)]

bcachefs: Delete redundant log messages

Now that we have distinct error codes for different memory allocation
failures, the early init log messages are no longer needed.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Fri, 7 Jul 2023 01:16:10 +0000 (21:16 -0400)]

bcachefs: Change check for invalid key types

As part of the forward compatibility patch series, we need to allow for
new key types without complaining loudly when running an old version.

This patch changes the flags parameter of bkey_invalid to an enum, and
adds a new flag to indicate we're being called from the transaction
commit path.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Fri, 7 Jul 2023 02:47:42 +0000 (22:47 -0400)]

bcachefs: Assorted sparse fixes

- endianness fixes
- mark some things static
- fix a few __percpu annotations
- fix silent enum conversions

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Fri, 7 Jul 2023 00:11:36 +0000 (20:11 -0400)]

bcachefs: Refactor bch_sb_field_ops handling

This changes bch_sb_field_ops lookup to match how bkey_ops now works;
for an unknown field type we return an empty ops struct.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Thu, 6 Jul 2023 23:23:27 +0000 (19:23 -0400)]

bcachefs: Allow for unknown key types

This adds a new helper for lookups bkey_ops for a given key type, which
returns a null bkey_ops for unknown key types; various bkey_ops users
are tweaked as well to handle unknown key types.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Thu, 29 Jun 2023 02:09:13 +0000 (22:09 -0400)]

bcachefs: Allow for unknown btree IDs

We need to allow filesystems with metadata from newer versions to be
mountable and usable by older versions.

This patch enables us to roll out new btrees without a new major version
number; we can now handle btree roots for unknown btree types.

The unknown btree roots will be retained, and fsck (including
backpointers) will check them, the same as other btree types.

We add a dynamic array for the extra, unknown btree roots, in addition
to the fixed size btree root array, and add new helpers for looking up
btree roots.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Brian Foster [Fri, 30 Jun 2023 17:09:46 +0000 (13:09 -0400)]

bcachefs: flush journal to avoid invalid dev usage entries on recovery

A crash immediately after device removal can result in an
unmountable filesystem due to recovery failure. The following
command reliably reproduces on a multi-device fs:

bcachefs device remove <dev> && xfs_io -xc shutdown <mnt>

The post-crash mount fails with an error similar to the following,
reported by fsck:

invalid journal entry dev_usage at offset 7994/8034 seq 12: bad dev, fixing

This refers to a device usage entry in the journal that refers to
the index of the just removed device. Recovery considers this an
invalid entry and fails to proceed.

Device usage entries are added to journal buffer writes via
bch_journal_write() -> bch2_journal_super_entries_add_common(),
which means any journal buffer write has content that refers to
member devices at the time of the journal write.

The device remove sequence already removes metadata references to
the device being removed. It then flushes any pins that refer to the
device, clears replica entries, removes the in-memory device object
and lastly updates the superblock to reflect that the device is no
longer present. The problem is that any journal writes that occur
during this sequence will include a dev usage entry so long as the
device is present. To avoid this problem, we can flush the journal
once more after the device entry is removed from the in-core
structures, but before the superblock is updated to fully remove the
device on-disk.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Brian Foster [Fri, 30 Jun 2023 14:51:46 +0000 (10:51 -0400)]

bcachefs: mark active journal devices on journal replicas gc

A simple device evacuate, remove, add test loop with concurrent
shutdowns occasionally reproduces a problem where the filesystem
fails to mount. The mount failure occurs because the filesystem was
uncleanly shut down, yet no member device is marked for journal data
in the superblock. An fsck detects the problem, restores the mark
and allows the mount to proceed without further consistency issues.

The reason for the lack of journal data marks is the gc mechanism
invoked via bch2_journal_flush_device_pins() runs while the journal
happens to be empty. This results in garbage collection of all journal
replicas entries. Once the updated replicas table is written to the
superblock, the filesystem is put in a transiently unrecoverable state
until further journal data is written, because journal recovery expects
to find at least one marked journal device whenever the filesystem is
not otherwise marked clean (i.e. as on clean unmount).

To fix this problem, update the journal replicas gc algorithm to always
mark currently active journal replicas entries by writing to the
journal. This ensures that only entries for devices that are no longer
used for journaling are garbage collected, not just those that don't
happen to currently hold journal data. This preserves the journal
recovery invariant above and avoids putting the fs into a transiently
unrecoverable state.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Thu, 29 Jun 2023 00:27:07 +0000 (20:27 -0400)]

bcachefs: bch2_version_compatible()

This adds a new helper for checking if an on-disk version is compatible
with the running version of bcachefs - prep work for introducing
major:minor version numbers.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Wed, 28 Jun 2023 23:53:05 +0000 (19:53 -0400)]

bcachefs: bch2_version_to_text()

Add a new helper for printing out metadata versions in a standard
format.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Tue, 27 Jun 2023 21:32:48 +0000 (17:32 -0400)]

bcachefs: Kill BTREE_INSERT_USE_RESERVE

Now that we have journal watermarks and alloc watermarks unified,
BTREE_INSERT_USE_RESERVE is redundant and can be deleted.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Wed, 28 Jun 2023 04:01:19 +0000 (00:01 -0400)]

bcachefs: Fix a null ptr deref in bch2_fs_alloc() error path

This fixes a null ptr deref in bch2_free_pending_node_rewrites() when
the list head wasn't initialized.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Wed, 28 Jun 2023 03:28:17 +0000 (23:28 -0400)]

bcachefs: Fix a format string warning

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Tue, 27 Jun 2023 21:32:38 +0000 (17:32 -0400)]

bcachefs: Kill JOURNAL_WATERMARK

This unifies JOURNAL_WATERMARK with BCH_WATERMARK; we're working towards
specifying watermarks once in the transaction commit path.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Tue, 27 Jun 2023 21:29:20 +0000 (17:29 -0400)]

bcachefs: BCH_WATERMARK_reclaim

Add another watermark for journal reclaim - this is needed for the next
patches, that unify BCH_WATERMARK with JOURNAL_WATERMARK.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Tue, 27 Jun 2023 23:02:17 +0000 (19:02 -0400)]

bcachefs: struct bch_extent_rebalance

This adds the extent entry for extents that rebalance needs to do
something with.

We're adding this ahead of the main rebalance_work patchset, because
adding new extent entries can't be done in a forwards-compatible way.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Tue, 27 Jun 2023 22:01:09 +0000 (18:01 -0400)]

bcachefs: Expand BTREE_NODE_ID

We now have 20 bits for the btree ID in the on disk format - sufficient
for 1 million distinct btrees.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Tue, 27 Jun 2023 23:10:24 +0000 (19:10 -0400)]

bcachefs: Fix btree node write error message

Error messages should include the error code, when available.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Sun, 25 Jun 2023 20:35:49 +0000 (16:35 -0400)]

bcachefs: fsck: Break walk_inode() up into multiple functions

Some refactoring, prep work for algorithm improvements related to
snapshots.

we need to add a bitmap to the list of inodes for "seen this snapshot";
for this bitmap to correctly be available, we'll need to gather the list
of inodes first, and later look up the inode for a given snapshot.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Tue, 27 Jun 2023 20:20:05 +0000 (16:20 -0400)]

bcachefs: Fix leak in backpointers fsck

We were forgetting to exit a printbuf - whoops.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Tue, 27 Jun 2023 03:31:49 +0000 (23:31 -0400)]

bcachefs: unregister_shrinker() now safe on not-registered shrinker

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Tue, 27 Jun 2023 03:10:21 +0000 (23:10 -0400)]

bcachefs: Add a missing rhashtable_destroy() call

Fixes https://lore.kernel.org/linux-bcachefs/784c3e6a-75bd-e6ca-535a-43b3e1daf643@kernel.dk/T/#mbf7caf005f960018eba23b58795d06c06c947411

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Mon, 26 Jun 2023 22:36:24 +0000 (18:36 -0400)]

bcachefs: Improve bch2_bkey_make_mut()

bch2_bkey_make_mut() now takes the bkey_s_c by reference and points it
at the new, mutable key.

This helps in some fsck paths that may have multiple repair operations
on the same key.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Tue, 27 Jun 2023 02:26:04 +0000 (22:26 -0400)]

bcachefs: Reduce stack frame size of bch2_check_alloc_info()

Excessive inlining may (on some versions of gcc?) cause excessive stack
usage; this turns off some inlining in bch2_check_alloc_info.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Sun, 25 Jun 2023 05:34:45 +0000 (01:34 -0400)]

bcachefs: fsck needs BTREE_UPDATE_INTERNAL_SNAPSHOT_NODE

A few fsck paths weren't using BTREE_UPDATE_INTERNAL_SNAPSHOT_NODE -
oops.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Sun, 25 Jun 2023 03:22:20 +0000 (23:22 -0400)]

bcachefs: Improve error message for overlapping extents

We now print out the full previous extent we overlapping with, to aid in
debugging and searching through the journal.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Sun, 25 Jun 2023 03:20:39 +0000 (23:20 -0400)]

bcachefs: Fix check_pos_snapshot_overwritten()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Sat, 24 Jun 2023 23:30:10 +0000 (19:30 -0400)]

bcachefs: Rename enum alloc_reserve -> bch_watermark

This is prep work for consolidating with JOURNAL_WATERMARK.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Sat, 24 Jun 2023 19:59:03 +0000 (15:59 -0400)]

bcachefs: BCH_ERR_fsck -> EINVAL

When we return errors outside of bcachefs, we need to return a standard
error code - fix this for BCH_ERR_fsck.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Sat, 24 Jun 2023 16:17:57 +0000 (12:17 -0400)]

bcachefs: bch2_trans_mark_pointer() refactoring

bch2_bucket_backpointer_mod() doesn't need to update the alloc key, we
can exit the alloc iter earlier.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Wed, 21 Jun 2023 10:44:44 +0000 (06:44 -0400)]

bcachefs: Fix more lockdep splats in debug.c

Similar to previous fixes, we can't incur page faults while holding
btree locks.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Wed, 21 Jun 2023 10:00:04 +0000 (06:00 -0400)]

bcachefs: Fix lockdep splat in bch2_readdir

dir_emit() can fault (taking mmap_lock); thus we can't be holding btree
locks.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Wed, 21 Jun 2023 04:31:49 +0000 (00:31 -0400)]

bcachefs: Check for ERR_PTR() from filemap_lock_folio()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Tue, 20 Jun 2023 17:49:25 +0000 (13:49 -0400)]

bcachefs: New error message helpers

Add two new helpers for printing error messages with __func__ and
bch2_err_str():
- bch_err_fn
- bch_err_msg

Also kill the old error strings in the recovery path, which were causing
us to incorrectly report memory allocation failures - they're not needed
anymore.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Tue, 20 Jun 2023 01:12:05 +0000 (21:12 -0400)]

bcachefs: fiemap: Fix a lockdep splat

As with the previous patch, we generally can't hold btree locks while
copying to userspace, as that may incur a page fault and require
mmap_lock.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Tue, 20 Jun 2023 01:01:13 +0000 (21:01 -0400)]

bcachefs: seqmutex; fix a lockdep splat

We can't be holding btree_trans_lock while copying to user space, which
might incur a page fault. To fix this, convert it to a seqmutex so we
can unlock/relock.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Mon, 19 Jun 2023 04:07:40 +0000 (00:07 -0400)]

bcachefs: Don't call lock_graph_descend() with wait lock held

This fixes a deadlock:

01305 WARNING: possible circular locking dependency detected
01305 6.3.0-ktest-gf4de9bee61af #5305 Tainted: G        W
01305 ------------------------------------------------------
01305 cat/14658 is trying to acquire lock:
01305 ffffffc00982f460 (fs_reclaim){+.+.}-{0:0}, at: __kmem_cache_alloc_node+0x48/0x278
01305
01305 but task is already holding lock:
01305 ffffff8011aaf040 (&lock->wait_lock){+.+.}-{2:2}, at: bch2_check_for_deadlock+0x4b8/0xa58
01305
01305 which lock already depends on the new lock.
01305
01305
01305 the existing dependency chain (in reverse order) is:
01305
01305 -> #2 (&lock->wait_lock){+.+.}-{2:2}:
01305        _raw_spin_lock+0x54/0x70
01305        __six_lock_wakeup+0x40/0x1b0
01305        six_unlock_ip+0xe8/0x248
01305        bch2_btree_key_cache_scan+0x720/0x940
01305        shrink_slab.constprop.0+0x284/0x770
01305        shrink_node+0x390/0x828
01305        balance_pgdat+0x390/0x6d0
01305        kswapd+0x2e4/0x718
01305        kthread+0x184/0x1a8
01305        ret_from_fork+0x10/0x20
01305
01305 -> #1 (&c->lock#2){+.+.}-{3:3}:
01305        __mutex_lock+0x104/0x14a0
01305        mutex_lock_nested+0x30/0x40
01305        bch2_btree_key_cache_scan+0x5c/0x940
01305        shrink_slab.constprop.0+0x284/0x770
01305        shrink_node+0x390/0x828
01305        balance_pgdat+0x390/0x6d0
01305        kswapd+0x2e4/0x718
01305        kthread+0x184/0x1a8
01305        ret_from_fork+0x10/0x20
01305
01305 -> #0 (fs_reclaim){+.+.}-{0:0}:
01305        __lock_acquire+0x19d0/0x2930
01305        lock_acquire+0x1dc/0x458
01305        fs_reclaim_acquire+0x9c/0xe0
01305        __kmem_cache_alloc_node+0x48/0x278
01305        __kmalloc_node_track_caller+0x5c/0x278
01305        krealloc+0x94/0x180
01305        bch2_printbuf_make_room.part.0+0xac/0x118
01305        bch2_prt_printf+0x150/0x1e8
01305        bch2_btree_bkey_cached_common_to_text+0x170/0x298
01305        bch2_btree_trans_to_text+0x244/0x348
01305        print_cycle+0x7c/0xb0
01305        break_cycle+0x254/0x528
01305        bch2_check_for_deadlock+0x59c/0xa58
01305        bch2_btree_deadlock_read+0x174/0x200
01305        full_proxy_read+0x94/0xf0
01305        vfs_read+0x15c/0x3a8
01305        ksys_read+0xb8/0x148
01305        __arm64_sys_read+0x48/0x60
01305        invoke_syscall.constprop.0+0x64/0x138
01305        do_el0_svc+0x84/0x138
01305        el0_svc+0x34/0x80
01305        el0t_64_sync_handler+0xb0/0xb8
01305        el0t_64_sync+0x14c/0x150
01305
01305 other info that might help us debug this:
01305
01305 Chain exists of:
01305   fs_reclaim --> &c->lock#2 --> &lock->wait_lock
01305
01305  Possible unsafe locking scenario:
01305
01305        CPU0                    CPU1
01305        ----                    ----
01305   lock(&lock->wait_lock);
01305                                lock(&c->lock#2);
01305                                lock(&lock->wait_lock);
01305   lock(fs_reclaim);
01305
01305  *** DEADLOCK ***

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Sun, 18 Jun 2023 17:25:35 +0000 (13:25 -0400)]

bcachefs: Fix bch2_check_discard_freespace_key()

We weren't correctly checking the freespace btree - it's an extents
btree, which means we need to iterate over each bucket in a freespace
extent.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Sun, 18 Jun 2023 17:25:09 +0000 (13:25 -0400)]

bcachefs: bch2_trans_unlock_noassert()

This fixes a spurious assert in the btree node read path.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Sat, 17 Jun 2023 03:30:02 +0000 (23:30 -0400)]

bcachefs: Fix bch2_btree_update_start()

The calculation for number of nodes to allocate in
bch2_btree_update_start() was incorrect - this fixes a BUG_ON() on the
small nodes test.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Tue, 13 Jun 2023 19:12:04 +0000 (15:12 -0400)]

bcachefs: bch2_extent_ptr_desired_durability()

This adds a new helper for getting a pointer's durability irrespective
of the device state, and uses it in the the data update path.

This fixes a bug where we do a data update but request 0 replicas to be
allocated, because the replica being rewritten is on a device marked as
failed.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Tue, 13 Jun 2023 19:05:40 +0000 (15:05 -0400)]

bcachefs: snapshot_to_text() includes snapshot tree

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Thu, 16 Mar 2023 22:05:00 +0000 (18:05 -0400)]

bcachefs: Fix try_decrease_writepoints()

- We may need to drop btree locks before taking the writepoint_lock, as
is done in other places.
- We should be using open_bucket_free_unused(), so that we don't waste
space.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Sun, 11 Jun 2023 22:24:04 +0000 (18:24 -0400)]

bcachefs: Delete weird hacky transaction restart injection

since we currently don't have a good fault injection library,
bch2_btree_insert_node() was randomly injecting faults based on
local_clock().

At the very least this should have been a debug mode only thing, but
this is a brittle method so let's just delete it.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Sun, 11 Jun 2023 23:45:21 +0000 (19:45 -0400)]

bcachefs: Write buffer flush needs BTREE_INSERT_NOCHECK_RW

btree write buffer flush is only invoked from contexts that already hold
a write ref, and checking if we're still RW could cause us to fail to
completely flush the write buffer when shutting down.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Sun, 11 Jun 2023 23:21:16 +0000 (19:21 -0400)]

bcachefs: New assertions when marking filesystem clean

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Sat, 10 Jun 2023 05:37:16 +0000 (01:37 -0400)]

bcachefs: ec: Fix a lost wakeup

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Mikulas Patocka [Tue, 30 May 2023 12:15:41 +0000 (08:15 -0400)]

bcachefs: fix NULL pointer dereference in try_alloc_bucket

On Mon, 29 May 2023, Mikulas Patocka wrote:

> The oops happens in set_btree_iter_dontneed and it is caused by the fact
> that iter->path is NULL. The code in try_alloc_bucket is buggy because it
> sets "struct btree_iter iter = { NULL };" and then jumps to the "err"
> label that tries to dereference values in "iter".

Here I'm sending a patch for it.

From: Mikulas Patocka <mpatocka@redhat.com>

The function try_alloc_bucket sets the variable "iter" to NULL and then
(on various error conditions) jumps to the label "err". On the "err"
label, it calls "set_btree_iter_dontneed" that tries to dereference
"iter->trans" and "iter->path".

So, we get an oops on error condition.

This patch fixes the crash by testing that iter.trans and iter.path is
non-zero before calling set_btree_iter_dontneed.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Fri, 9 Jun 2023 19:41:41 +0000 (15:41 -0400)]

bcachefs: Fix subvol deletion deadlock

d_prune_aliases() may call bch2_evict_inode(), which needs
c->vfs_inodes_list_lock.

Fix this by always calling igrab() before putting the inodes onto our
disposal list, and then calling d_prune_aliases() with
c->vfs_inodes_lock dropped.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Brian Foster [Tue, 30 May 2023 18:51:12 +0000 (14:51 -0400)]

bcachefs: don't spin in rebalance when background target is not usable

If a bcachefs filesystem is configured with a background device
(disk group), rebalance will relocate data to this device in the
background by checking extent keys for whether they currently reside
in the specified target. For keys that do not, rebalance performs a
read/write cycle to allow the write path to properly relocate data.

If the background target is not usable (read-only, for example),
however, the write path doesn't actually move data to another
device. Instead, rebalance spins indefinitely reading and rewriting
the same data over and over to the same device. If the background
target is made available again, the rebalance picks this up,
relocates the data, and eventually terminates.

To avoid this spinning behavior, update the rebalance background
target logic to not only check whether the extent is not in the
target, but whether the target is actually usable as well. If not,
then don't mark the key for rewrite.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Brian Foster [Tue, 30 May 2023 18:48:58 +0000 (14:48 -0400)]

bcachefs: push rcu lock down into bch2_target_to_mask()

We have one caller that cycles the rcu lock solely for this call
(via target_rw_devs()), and we'd like to add another. Simplify
things by pushing the rcu lock down into bch2_target_to_mask(),
similar to how bch2_dev_in_target() works.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Brian Foster [Tue, 30 May 2023 18:41:50 +0000 (14:41 -0400)]

bcachefs: create internal disk_groups sysfs file

We have bch2_sb_disk_groups_to_text() to dump disk group labels, but
no good information on device group membership at runtime. Add
bch2_disk_groups_to_text() and an associated 'disk_groups' sysfs
file to print group and device relationships.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Mon, 5 Jun 2023 05:16:00 +0000 (01:16 -0400)]

bcachefs: Clean up tests code

- delete redundant error messages
- convert various code to bch2_trans_run

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Mon, 5 Jun 2023 05:15:33 +0000 (01:15 -0400)]

bcachefs: Improve backpointers error message

the error message here dated from when backpointers could be stored in
alloc keys; now, we should always print the full key.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Tue, 30 May 2023 08:59:30 +0000 (04:59 -0400)]

bcachefs: More drop_locks_do() conversions

Using drop_locks_do() ensures that every unlock() is paired with a
relock(), with proper error checking.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Sun, 4 Jun 2023 23:40:35 +0000 (19:40 -0400)]

bcachefs: Delete warning from promote_alloc()

It's possible to see a -BCH_ERR_ENOSPC_disk_reservation here, and that's
fine.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Sun, 4 Jun 2023 22:08:56 +0000 (18:08 -0400)]

bcachefs: Fix bch2_fsck_ask_yn()

- getline() output includes a newline, without stripping that we were
just looping

- Make the prompt clearer

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Sun, 28 May 2023 23:23:35 +0000 (19:23 -0400)]

bcachefs: replicas_deltas_realloc() uses allocate_dropping_locks()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Sun, 28 May 2023 07:44:38 +0000 (03:44 -0400)]

bcachefs: Convert acl.c to allocate_dropping_locks()

More work to avoid allocating memory with btree locks held.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Sun, 28 May 2023 07:44:38 +0000 (03:44 -0400)]

bcachefs: allocate_dropping_locks()

Add two new helpers for allocating memory with btree locks held: The
idea is to first try the allocation with GFP_NOWAIT|__GFP_NOWARN, then
if that fails - unlock, retry with GFP_KERNEL, and then call
trans_relock().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Mon, 29 May 2023 20:27:11 +0000 (16:27 -0400)]

bcachefs: Use unlikely() in bch2_err_matches()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Mon, 29 May 2023 06:26:04 +0000 (02:26 -0400)]

bcachefs: Fix error handling in promote path

The promote path had a BUG_ON() for unknown error type, which we're now
seeing: change it to a WARN_ON() - because we're curious what this is -
and otherwise handle it in the normal error path.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Sun, 28 May 2023 04:59:26 +0000 (00:59 -0400)]

bcachefs: fs-io: Eliminate GFP_NOFS usage

GFP_NOFS doesn't ever make sense. If we're allocatingc memory it should
be GFP_NOWAIT if btree locks are held, GFP_KERNEL otherwise.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Sun, 28 May 2023 05:09:50 +0000 (01:09 -0400)]

bcachefs: bch2_trans_kmalloc no longer allocates memory with btree locks held

When allocating memory, gfp flags should generally be

- GFP_NOWAIT|__GFP_NOWARN if btree locks are held
- GFP_NOFS if in the IO path or otherwise holding resources needed for
IO submission
- GFP_KERNEL otherwise

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Sun, 28 May 2023 22:06:27 +0000 (18:06 -0400)]

bcachefs: drop_locks_do()

Add a new helper for the common pattern of:
- trans_unlock()
- do something
- trans_relock()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Sun, 28 May 2023 22:02:38 +0000 (18:02 -0400)]

bcachefs: GFP_NOIO -> GFP_NOFS

GFP_NOIO dates from the bcache days, when we operated under the block
layer. Now, GFP_NOFS is more appropriate, so switch all GFP_NOIO uses to
GFP_NOFS.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Sun, 28 May 2023 06:35:34 +0000 (02:35 -0400)]

bcachefs: Ensure bch2_btree_node_get() calls relock() after unlock()

Fix a bug where bch2_btree_node_get() might call bch2_trans_unlock() (in
fill) without calling bch2_trans_relock(); this is a bug when it's done
in the core btree code.

Also, twea bch2_btree_node_mem_alloc() to drop btree locks before doing
a blocking memory allocation.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Sun, 28 May 2023 04:35:35 +0000 (00:35 -0400)]

bcachefs: Avoid __GFP_NOFAIL

We've been using __GFP_NOFAIL for allocating struct bch_folio, our
private per-folio state.

However, that struct is variable size - it holds state for each sector
in the folio, and folios can be quite large now, which means it's
possible for bch_folio to be larger than PAGE_SIZE now.

__GFP_NOFAIL allocations are undesirable in normal circumstances, but
particularly so at >= PAGE_SIZE, and warnings are emitted for that.

So, this patch adds proper error paths and eliminates most uses of
__GFP_NOFAIL. Also, do some more cleanup of gfp flags w.r.t. btree node
locks: we can use GFP_KERNEL, but only if we're not holding btree locks,
and if we are holding btree locks we should be using GFP_NOWAIT.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Sun, 28 May 2023 03:19:13 +0000 (23:19 -0400)]

bcachefs: Fix corruption with writeable snapshots

When partially overwriting an extent in an older snapshot, the existing
extent has to be split.

If the existing extent was overwritten in a different (sibling)
snapshot, we have to ensure that the split won't be visible in the
sibling snapshot.

data_update.c already has code for this,
bch2_insert_snapshot_writeouts() - we just need to move it into
btree_update_leaf.c and change bch2_trans_update_extent() to use it as
well.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Sat, 27 May 2023 23:59:59 +0000 (19:59 -0400)]

bcachefs: Convert -ENOENT to private error codes

As with previous conversions, replace -ENOENT uses with more informative
private error codes.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Sat, 27 May 2023 23:55:54 +0000 (19:55 -0400)]

bcachefs: trans_for_each_path_safe()

bch2_btree_trans_to_text() is used on btree_trans objects that are owned
by different threads - when printing out deadlock cycles - so we need a
safe version of trans_for_each_path(), else we race with seeing a
btree_path that was just allocated and not fully initialized:

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

A mirror of Linus' kernel repository