git.ipfire.org Git - thirdparty/xfsprogs-dev.git/log

xfsprogs: Release v7.0.0

Update all the necessary files for a v7.0.0 release.

Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>

xfs_scrub: drop the warning about mixed bidirectional codepoints in names

Gedalya complained about receiving warnings about mixed bidirectional
codepoints in a filename:

"First, well-known file name extensions are not internationalized. While
the file name can be in non-latin letters, the extension will be in
latin. Hence you would expect to see file names such as עברית.pdf ."

Gedalya goes on to point out that file names can be created from (say)
the title of an article, which might itself mix RTL and LTR characters.
Both uses are totally fair, but regrettably unfamiliar to 2018-era me.

Unicode TR 36 even weasel-words its own recommendation: "As much as
possible, avoid mixing right-to-left and left-to-right characters in a
single name." Maybe I should have paid more attention to weasel
wording in specifications. :P

Let's fix this by removing the warning altogether.

Reported-by: gedalya@gedalya.net
Link: https://lore.kernel.org/linux-xfs/8961ee4a-3830-498b-a432-5545695db599@gedalya.net/
Link: https://www.unicode.org/reports/tr36/tr36-15.html#Bidirectional_Text_Spoofing
Cc: <linux-xfs@vger.kernel.org> # v4.16.0
Fixes: baa9ed8dca213f ("xfs_scrub: check name for suspicious characters")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Andrey Albershteyn <aalbersh@kernel.org>

xfs_scrub_all: fix deadlock if lsblk produces a lot of output

Patrick Fischer reported a deadlock in find_mounts() that is the result
of lsblk producing so much output that it fills the pipe buffer. When
this happens, lsblk blocks on write()ing the pipe, at which point
waiting for lsblk to exit will also block forever.

Now that we can be reasonably assured that everyone has Python 3.5
(because RHEL6 is long dead), we can replace this whole mess with a call
to subprocess.run that captures the output. The json library can
convert a byte array directly to a python dict, which means we don't
need to concatenate iterated lines or any of that stuff anymore.

Reported-by: patrick.fischer@siedl.net
Link: https://lore.kernel.org/linux-xfs/323580211.1220195.1777554001363.JavaMail.zimbra@siedl.net/
Cc: <linux-xfs@vger.kernel.org> # v4.15.0
Fixes: f1dca11cad1308 ("xfs_scrub: create a script to scrub all xfs filesystems")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Andrey Albershteyn <aalbersh@kernel.org>

xfs: factor out xfs_attr3_leaf_init

Source kernel commit: e65bb55d7f8c2041c8fdb73cd29b0b4cad4ed847

Factor out wrapper xfs_attr3_leaf_init function, which exported for
external use.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Long Li <leo.lilong@huawei.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>

xfs: factor out xfs_attr3_node_entry_remove

Source kernel commit: ce4e789cf3561c9fac73cc24445bfed9ea0c514b

Factor out wrapper xfs_attr3_node_entry_remove function, which
exported for external use.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Long Li <leo.lilong@huawei.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>

xfs: annotate struct xfs_attr_list_context with __counted_by_ptr

Source kernel commit: e5966096d0856d071269cb5928d6bc33342d2dfd

Add the `__counted_by_ptr` attribute to the `buffer` field of `struct
xfs_attr_list_context`. This field is used to point to a buffer of
size `bufsize`.

The `buffer` field is assigned in:
1. `xfs_ioc_attr_list` in `fs/xfs/xfs_handle.c`
2. `xfs_xattr_list` in `fs/xfs/xfs_xattr.c`
3. `xfs_getparents` in `fs/xfs/xfs_handle.c` (implicitly initialized to NULL)

In `xfs_ioc_attr_list`, `buffer` was assigned before `bufsize`. Reorder
them to ensure `bufsize` is set before `buffer` is assigned, although
no access happens between them.

In `xfs_xattr_list`, `buffer` was assigned before `bufsize`. Reorder
them to ensure `bufsize` is set before `buffer` is assigned.

In `xfs_getparents`, `buffer` is NULL (from zero initialization) and
remains NULL. `bufsize` is set to a non-zero value, but since `buffer`
is NULL, no access occurs.

In all cases, the pointer `buffer` is not accessed before `bufsize` is set.

This patch was generated by CodeMender and reviewed by Bill Wendling.
Tested by running xfstests.

Signed-off-by: Bill Wendling <morbo@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>

xfs_scrub: warn about unicode variation selectors in names

ArsTechnica recently wrote about a GitHub supply chain attack wherein
non-rendering unicode sequences were embedded in javascript files to
hide payloads that could be decrypted trivially later. While these are
unlikely to appear in file and attribute names, we should warn about
this sort of steganography.

Link: https://arstechnica.com/security/2026/03/supply-chain-attack-using-invisible-code-hits-github-and-other-repositories/
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs_quota: display default limits for users with zero usage

When the kernel returns default quota limits for IDs without a dquot
on disk, xfs_quota suppresses the output because it only checks
usage counters (d_bcount, d_icount, d_rtbcount). If all counters
are zero the line is skipped even though non-zero limits apply.

Also check the soft/hard limit fields so that quota output is
displayed whenever limits are configured, even if usage is zero.

This is the userspace companion to the kernel commit:
"xfs: return default quota limits for IDs without a dquot"

Signed-off-by: Ravi Singh <ravising@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>

debian: add version control tags to control

Add some vcs fields to point to the official sources for the debian
package (which is to say, upstream xfsprogs).

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Andrey Albershteyn <aalbersh@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>

xfs_healer: update weakhandle mountpoint when reconnecting

If we have to reconnect a weakhandle to a different mount point, we
should update the internal mountpoint string to that new value so we
don't have to keep reconnecting it.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Andrey Albershteyn <aalbersh@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs_scrub: raise media verification IO limits

To avoid starving other threads of disk IO resources, the read-verify
pool limits the size of verification IOs to limit bandwidth consumption,
and it is willing to over-verify some amount of unwritten media to
reduce the number of verification IO requests sent to the device.

However, these limits were set in 2018 when areal densities were lower
and disk bandwidth was more limited. Increase them now to reduce scan
time on the author's system by 10% in foreground and 50% in background
mode.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs_scrub: drop SCSI_VERIFY code from disk.

Now that we have a media verification ioctl in the kernel, drop the
SCSI_VERIFY code, which enables us to drop the dependency on sg and
obviates the need to fix some unit-handling bugs in the HDIO_GETGEO
code.

A subsequent patch will enable larger verification IO sizes, which this
old code cannot handle.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs_scrub: clean up device-related error messages

Use consistent terminology for the data/log/rt device in the error
messages.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs_scrub: move failmap and other outputs into read_verify_pool

Remove all the indirect verify IO error function calls and whatnot by
making the read_verify_pool track the ranges of failed media and other
problems. Add some new helper functions that report on the outcome of
the read verifciation so that phase6 can report on what happened.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs_scrub: index read-verify pools by xfs_device ids

Refactor the read-verify pool array in struct media_verify_state so that
we can index them via enum xfs_device. This will enable further
cleanups in the next few patches.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs_scrub: perform media scanning of the log region

Scan the log area for media errors because a defect in a region could
prevent the user from being able to perform log recovery.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

scrub: don't allocate disk for ioctl-based media verify

When the kernel provides the data verification ioctl there is no point
in allocating struct disk and thus opening the underlying devices.
Refactor the code so that we don't have to do that, with the added
benefit of keeping the read verification self-contained in
read_verify.c for the case where the kernel provides the ioctl.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
[djwong: break up patch]
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>

xfs_scrub: use the verify media ioctl during phase 6 if possible

If the kernel suppots the XFS_IOC_VERIFY_MEDIA ioctl, use that to
perform the phase 6 media scan instead of pwrite or the SCSI VERIFY
command. This enables better integration with xfs_healer and fsnotify;
and reduces the amount of work that userspace has to do.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs_scrub: move disk media verification error injection

This isn't really disk-related since it's a knob to make the
read_verify_pool pretend that the media is defective. Move this code
before we add a new media verify path that doesn't require the disk
abstraction.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
[djwong: split off from another hch patch, create a new commit message]
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>

scrub: simplify verifier threads calculation

Throwing all CPUs at verifying seems like a bad idea for foreground
scrub, as does abusing I/O opt/min as that says absolutely nothing
about parallelism.  I can't really think of a better way than manually
configuring this except maybe kernel hints.  But the current decisions
are not good defaults, and also are the only user of struct disk for
file systems using the kernel verify ioctl.

The best default seems to be 8, because verification speed increases
diminish above that level.  Note that background service mode now obeys
the thread count restrictions.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
[djwong: rebase atop previous patches, leave disk_heads() alone]
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>

xfs_scrub: rename nr_io_threads

This variable really describes the number of threads that we should
start up to scan metadata. Rename it to reduce confusion with the media
verification code, which also initiates IO.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

scrub: use enum xfs_device for read verification

Passing the disk pointer and then translating back to an index is a bit
confusing. Rewrite the read_verify and related code to pass around an
enum xfs_device and use that to identify the device.

The disks are placed in an array so that they can be trivially indexed using
the xfs device. Because the value start at 1 this adds an unused slot, but
such a minor waste does not matter in the overall scheme of things.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
[djwong: fix minor merge conflicts]
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>

xfs_scrub: don't pass the io_end_arg around everywhere

The value is the same across all callers and read-verify pools, so let's
just pass it into read_verify_pool_alloc instead of making temporary
aliases everywhere and increasing memory usage.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

scrub: simplify the read_verify_pool_alloc interface

Don't pass the miniosize as that's always the fs block size.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
[djwong: rebase atop another read verify cleanup]
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>

xfs_scrub: move read verification scheduling to phase6.c

Right now there's a weird coupling between read_verify.c and spacemap.c:
Anyone using a read_verify_pool is required to tell the pool how many
threads it's going to use to call read_verify_schedule_io. This is
because the read_verify_pool accumulates verification requests on a
per-thread basis to try to combine adjacent written regions for media
verification. However, the verification requests are made from the
phase6.c callback (check_rmap) that is called from the workers created
by scrub_scan_all_spacemaps.

Yeah, that's confusing: implementation details of spacemap.c must be
inferred by phase6.c and passed to read_verify.c.

Let's fix this by moving the per-thread schedule accumulation to
phase6.c before the next patches constrain the number of IO threads
sending verification requests to the kernel.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

scrub: remove the unused io_disk field in struct read_verify

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>

xfs_scrub: fix i18n of the decode_special_owner return value

The function decode_special_owner turns a special fsmap owner into a
printable string value. However, it does not query the gettext catalog
for a local language translation, which leads to annoying multilanguage
failure messages. Fix that by adding the appropriate wrappers.

Cc: <linux-xfs@vger.kernel.org> # v4.15.0
Fixes: b364a9c008fc04 ("xfs_scrub: scrub file data blocks")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs_scrub: report truncated devices as media errors

If we encounter a zero-length read of an xfs device but don't hit any
actual media errors, we won't report the truncated device. Fix that.
Also fix the mistake that flushing the verify-pools of the log/rt
devices doesn't actually cause scrub to abort.

Cc: <linux-xfs@vger.kernel.org> # v6.13.0
Fixes: a6e089903f2f58 ("xfs_scrub: tread zero-length read verify as an IO error")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

libfrog: lift *BYTES helpers to convert.h

Move these byte unit conversion macros to convert.h and amend the macros
to cast to unsigned long long to avoid shifting issues. Now we can use
these same macros throughout the codebase instead of opencoding shifts
and possibly suffering from integer shifting problems.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

mkfs: rename byte unit conversion macros

Rename these macros so that we can promote the generic ones in the next
patch.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

libfrog: allow bitmap_free to handle a null bitmap pointer

Allow bitmap_free() callers to pass a pointer to a NULL pointer.
This will help subsequent refactorings in xfs_scrub have cleaner
bitmap_free callsites.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

debian: enable xfs_healer on the root filesystem by default

Now that we're finished building autonomous repair, enable the healer
service on the root filesystem by default.  The root filesystem is
mounted by the initrd prior to starting systemd, which is why the
xfs_healer_start service cannot autostart the service for the root
filesystem.

dh_installsystemd won't activate a template service (aka one with an
at-sign in the name) even if it provides a DefaultInstance directive to
make that possible.  Hence we enable this explicitly via the postinst
script.

Note that Debian enables services by default upon package installation,
so this is consistent with their policies.  Their kernel doesn't enable
online fsck, so healer won't do much more than monitor for corruptions
and log them.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

debian/control: listify the build dependencies

This will make it less gross to add more build deps later.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

mkfs: enable online repair if all backrefs are enabled

If all backreferences are enabled in the filesystem, then enable online
repair by default if the user didn't supply any other autofsck setting.
Users might as well get full self-repair capability if they're paying
for the extra metadata.

Note that it's up to each distro to enable the systemd services
according to their own service activation policies. Debian policy is to
enable all systemd services at package installation but they don't
enable online fsck in their Kconfig so the services won't activate.
RHEL and SUSE policy requires sysadmins to enable them explicitly unless
the OS vendor also ships a systemd preset file enabling the services.
Distros without systemd won't get any of the systemd services,
obviously.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs_io: add listmount and statmount commands

Add two new commands: one to list all mounts via statmount, now that we
use this in xfs_healer_start, and another to statmount each open file.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs_scrub: print systemd service names

Add a hidden switch to xfs_scrub to emit systemd service names for XFS
services targetting filesystems paths instead of opencoding the
computation in things like fstests.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs_healer: add a manual page

Add a new section 8 manpage for this service daemon so others can read
about what this program is supposed to do.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs_healer: validate that repair fds point to the monitored fs

When xfs_healer reopens a mountpoint to perform a repair, it should
validate that the opened fd points to a file on the same filesystem as
the one being monitored.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs_healer: use statmount to find moved filesystems even faster

As noted in the previous patch, it's possible that a mounted filesystem
can move mountpoints between the time of the initial mount (at which
point xfs_healer starts) and when it actually wants to start a repair.
The previous patch fixed that problem by using getmntent to walk
/proc/self/mounts to see if it finds a mount with the same "source"
name, aka data device.

However, this is really slow if there are a lot of filesystems because
we end up wading through a lot of irrelevant information.  However,
statmount() can help us here because as of Linux 7.0 we can open the
passed-in path at startup, call statmount() on it to retrieve the
mnt_id, and then call it again later with that same mnt_id to find the
mountpoint.  Luckily xfs_healthmon didn't get merged until 7.0 so it's
more or less guaranteed to be there if XFS_IOC_HEALTH_MONITOR succeeds.

Obviously if this doesn't work, we can fall back to the slow walk.

This statmount code enables xfs_healer to find a filesystem that has
had its mountpoint moved to a different place in the directory tree
without the use of bind mounts and without needing to walk the entire
mount list:

# mount -t tmpfs urk /mnt
# mount --make-rprivate /mnt
# mkdir -p /mnt/a /mnt/b
# mount /dev/sda /mnt/a
# mount --move /mnt/a /mnt/b

The key here is that the struct mount object is moved, and no new ones
are created.  Therefore, the original mnt_id is still usable.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs_healer: use getmntent to find moved filesystems

It's possible that a mounted filesystem can move mountpoints between the
time of the initial mount (at which point xfs_healer starts) and when
it actually wants to start a repair. When this happens,
weakhandle::mountpoint becomes obsolete and opening it will either fail
with ENOENT or the handle revalidation will return ESTALE.

However, we do still have a means to find the mounted filesystem -- the
fsname parameter (aka the path to the data device at mount time). This
is record in /proc/mounts, which means that we can iterate getmntent to
see if we can find the mount elsewhere.

As documented a few patches ago, this would be easier if we had
revocable fds that didn't pin mounts, but that's a very huge ask.

This getmntent code enables xfs_healer to find a filesystem that has
been bind mounted in a new place and the original mountpoint detached:

# mount /dev/sda /mnt
# xfs_healer /mnt &
# mount /mnt /opt --bind
# umount /mnt

The key here is that each bind mount gets a separate struct mount
object.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs_healer: run full scrub after lost corruption events or targeted repair failure

If we fail to perform a spot repair of metadata or the kernel tells us
that it lost corruption events due to queue limits, initiate a full run
of the online fsck service to try to fix the error.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs_healer: use the autofsck fsproperty to select mode

Make the xfs_healer background service query the autofsck filesystem
property to figure out which operating mode it should use.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs_healer: don't start service if kernel support unavailable

Use ExecCondition= in the system service to check if kernel support for
the health monitor is available. If not, we don't want to run the
service, have it fail, and generate a bunch of silly log messages.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs_healer: create a service to start the per-mount healer service

Create a daemon to wait for xfs mount events via fsnotify and start up
the per-mount healer service. It's important that we're running in the
same mount namespace as the mount, so we're a fanotify client to avoid
having to filter the mount namespaces ourselves.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs_healer: create a per-mount background monitoring service

Create a systemd service definition for our self-healing filesystem
daemon so that we can run it for every mounted filesystem. Add a
hidden switch so that we can print the service unit name for fstests.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs_healer: use getparents to look up file names

If the kernel tells about something that happened to a file, use the
GETPARENTS ioctl to try to look up the path to that file for more
ergonomic reporting.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs_healer: enable repairing filesystems

Make it so that our health monitoring daemon can initiate repairs in
response to reports of corrupt filesystem metadata.  Repairs are
initiated from the background workers as explained in the previous
patch.

Note that just like xfs_scrub, xfs_healer's ability to repair metadata
relies heavily on back references such as reverse mappings and directory
parent pointers to add redundancy to the filesystem.  Check for these
two features and whine a bit if they are missing, just like scrub.

There's a bit of trickery with the fd that is used to initiate repairs
in the kernel.  Because an open fd will pin the filesystem in memory,
xfs_healer can only hold an open fd to the target filesystem while it's
performing repairs.  Therefore, at startup xfs_healer must sample enough
information about the target filesystem to reconnect to it later on.
Currently, the fs source (aka the data device path) and the root
directory handle are sufficient to do this.

Someday we might be able to have revocable fds, which would eliminate
the need for such efforts in userspace.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs_healer: create daemon to listen for health events

Create a daemon program that can listen for and log health events.
Eventually this will be used to self-heal filesystems in real time.

Because events can take a while to process, the main thread reads event
objects from the healthmon fd and dispatches them to a background
workqueue as quickly as it can.  This split of responsibilities is
necessary because the kernel event queue will drop events if the queue
fills up, and each event can take some time to process (logging,
repairs, etc.) so we don't want to lose events.

To be clear, xfs_healer and xfs_scrub are complementary tools:

Scrub walks the whole filesystem, finds stuff that needs fixing or
rebuilding, and rebuilds it.  This is sort of analogous to a patrol
scrub.

Healer listens for metadata corruption messages from the kernel and
issues a targeted repair of that structure.  This is kind of like an
ondemand scrub.

My end goal is that xfs_healer (the service) is active all the time and
can respond instantly to a corruption report, whereas xfs_scrub (the
service) gets run periodically as a cron job.

xfs_healer can decide that it's overwhelmed with problems and start
xfs_scrub to deal with the mess.  Ideally you don't crash the filesystem
and then have to use xfs_repair to smash your way back to a mountable
filesystem.

By default we run xfs_healer as a background service, which means that
we only start two threads -- one to read the events, and another to
process them.  In other words, we try not to use all available hardware
resources for repairs.  The foreground mode switch starts up a large
number of threads to try to increase parallelism, which may or may not
be useful for repairs depending on how much metadata the kernel needs to
scan.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs_io: add a media verify command

Add a subcommand to invoke the media verification ioctl to make sure
that we can actually check the storage underneath an xfs filesystem.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs_io: monitor filesystem health events

Create a subcommand to monitor for health events generated by the kernel.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

man2: document the media verification ioctl

Document XFS_IOC_VERIFY_MEDIA, which is a new ioctl for xfs_scrub to
perform media scans on the disks underneath the filesystem. This will
enable media errors to be reported to xfs_healer and fsnotify.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

man2: document the healthmon ioctl

Document the XFS_IOC_HEALTH_MONITOR and
XFS_IOC_HEALTH_FD_ON_MONITORED_FS ioctls.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

libfrog: add wrappers for listmount and statmount

Add some wrappers for listmount and statmount so that we don't have to
open-code the kernel ABI quirks in every utility program that uses it.
Note that glibc seems to have discussed providing a wrapper in late 2023
but took no action; and the listmount manpage says that there is no
glibc wrapper.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

libfrog: hoist a couple of service helper functions

Hoist a couple of service/daemon-related helper functions to libfrog so
that we can share the code between xfs_scrub and xfs_healer.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

libfrog: add support code for starting systemd services programmatically

Add some simple routines for computing the name of systemd service
instances and starting systemd services.  These will be used by the
xfs_healer_start service to start per-filesystem xfs_healer service
instances.

Note that we run systemd helper programs as subprocesses for a couple of
reasons.  First, the path-escaping functionality is not a part of any
library-accessible API, which means it can only be accessed via
systemd-escape(1).  Second, although the service startup functionality
can be reached via dbus, doing so would introduce a new library
dependency.  Systemd is also undergoing a dbus -> varlink RPC transition
so we avoid that mess by calling the cli systemctl(1) program.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

libfrog: create healthmon event log library functions

Add some helper functions to log health monitoring events so that xfs_io
and xfs_healer can share logging code.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

libfrog: add a function to grab the path from an open fd and a file handle

handle_walk_paths operates on a file handle, but requires that the fs
has been registered with libhandle via path_to_fshandle.  For a normal
libhandle client this is the desirable behavior because the application
*should* maintain an open fd to the filesystem mount.

However for xfs_healer this isn't going to work well because the healer
mustn't pin the mount while it's running.  It's smart enough to know how
to find and reconnect to the mountpoint, but libhandle doesn't have any
such concept.

Therefore, alter the libfrog getparents code so that xfs_healer can pass
in the mountpoint and reconnected fd without needing libhandle.  All
we're really doing here is trying to obtain a user-visible path for a
file that encountered problems for logging purposes; if it fails, we'll
fall back to logging the inode number.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

libxfs: fix printing of cache.c_maxcount.

c_maxcount is usigned, but is being printed as signed.

Signed-off-by: Thiago Becker <tbecker@redhat.com>
Reviewed-by: Andrey Albershteyn <aalbersh@kernel.org>

fsr: always print error messages from xfrog_defragrange()

Error messages from xfrog_defragrange() are only printed when
verbose/debug flags are used.

We had reports from users complaining it's hard to find out the error
messages lost in the middle of dozens of other informational messages.

Change packfile() behavior to print errors independently of verbose or
debug flags.

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

fsr: package function should check for negative errors

xfrog_defragrange as most other functions from libfrog return
a negative error value, while xfs_fsr's packfile(), expects
a positive error value.

Whenever xfrog_defragrange fails, the switch case always falls into the
default clausule, making the error message pointless.
Fix this by inverting xfrog_defragrange() return value call.

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

xfs: fix returned valued from xfs_defer_can_append

Source kernel commit: 54fcd2f95f8d216183965a370ec69e1aab14f5da

xfs_defer_can_append returns a bool, it shouldn't be returning
a NULL.

Found by code inspection.

Fixes: 4dffb2cbb483 ("xfs: allow pausing of pending deferred work items")
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Acked-by: Souptick Joarder <souptick.joarder@hpe.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: Remove redundant NULL check after __GFP_NOFAIL

Source kernel commit: 281cb17787d4284a7790b9cbd80fded826ca7739

kzalloc() is called with __GFP_NOFAIL, so a NULL return is not expected.
Drop the redundant !map check in xfs_dabuf_map().
Also switch the nirecs-sized allocation to kcalloc().

Signed-off-by: hongao <hongao@uniontech.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: add static size checks for ioctl UABI

Source kernel commit: 650b774cf94495465d6a38c31bb1a6ce697b6b37

The ioctl structures in libxfs/xfs_fs.h are missing static size checks.
It is useful to have static size checks for these structures as adding
new fields to them could cause issues (e.g. extra padding that may be
inserted by the compiler). So add these checks to xfs/xfs_ondisk.h.

Due to different padding/alignment requirements across different
architectures, to avoid build failures, some structures are ommited from
the size checks. For example, structures with "compat_" definitions in
xfs/xfs_ioctl32.h are ommited.

Signed-off-by: Wilfred Mallawa <wilfred.mallawa@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: remove duplicate static size checks

Source kernel commit: e97cbf863d8918452c9f81bebdade8d04e2e7b60

In libxfs/xfs_ondisk.h, remove some duplicate entries of
XFS_CHECK_STRUCT_SIZE().

Signed-off-by: Wilfred Mallawa <wilfred.mallawa@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: Add a comment in xfs_log_sb()

Source kernel commit: ac1d977096a17d56c55bd7f90be48e81ac4cec3f

Add a comment explaining why the sb_frextents are updated outside the
if (xfs_has_lazycount(mp) check even though it is a lazycounter.
RT groups are supported only in v5 filesystems which always have
lazycounter enabled - so putting it inside the if(xfs_has_lazycount(mp)
check is redundant.

Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nirjhar Roy (IBM) <nirjhar.roy.lists@gmail.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: remove metafile inodes from the active inode stat

Source kernel commit: 47553dd60b1da88df2354f841a4f71dd4de6478a

The active inode (or active vnode until recently) stat can get much larger
than expected on file systems with a lot of metafile inodes like zoned
file systems on SMR hard disks with 10.000s of rtg rmap inodes.

Remove all metafile inodes from the active counter to make it more useful
to track actual workloads and add a separate counter for active metafile
inodes.

This fixes xfs/177 on SMR hard drives.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: fix code alignment issues in xfs_ondisk.c

Source kernel commit: fd81d3fd01a5ee4bd26a7dc440e7a2209277d14b

Fixup some code alignment issues in xfs_ondisk.c

Signed-off-by: Wilfred Mallawa <wilfred.mallawa@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: Refactoring the nagcount and delta calculation

Source kernel commit: a49b7ff63f98ba1c4503869c568c99ecffa478f2

Introduce xfs_growfs_compute_delta() to calculate the nagcount
and delta blocks and refactor the code from xfs_growfs_data_private().
No functional changes.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nirjhar Roy (IBM) <nirjhar.roy.lists@gmail.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

Convert 'alloc_obj' family to use the new default GFP_KERNEL argument

Source kernel commit: bf4afc53b77aeaa48b5409da5c8da6bb4eff7f43

This was done entirely with mindless brute force, using

git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' |
xargs sed -i 's/$alloc_objs*(.*$, GFP_KERNEL)/\1)/'

to convert the new alloc_obj() users that had a simple GFP_KERNEL
argument to just drop that argument.

Note that due to the extreme simplicity of the scripting, any slightly
more complex cases spread over multiple lines would not be triggered:
they definitely exist, but this covers the vast bulk of the cases, and
the resulting diff is also then easier to check automatically.

For the same reason the 'flex' versions will be done as a separate
conversion.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

treewide: Replace kmalloc with kmalloc_obj for non-scalar types

Source kernel commit: 69050f8d6d075dc01af7a5f2f550a8067510366f

This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:

Single allocations:     kmalloc(sizeof(TYPE), ...)
are replaced with:      kmalloc_obj(TYPE, ...)

Array allocations:      kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with:      kmalloc_objs(TYPE, COUNT, ...)

Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with:      kmalloc_flex(*PTR, FAM, COUNT, ...)

(where TYPE may also be *VAR)

The resulting allocations no longer return "void *", instead returning
"TYPE *".

Signed-off-by: Kees Cook <kees@kernel.org>

xfs: give the defer_relog stat a xs_ prefix

Source kernel commit: edf6078212c366459d3c70833290579b200128b1

Make this counter naming consistent with all the others.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: add zone reset error injection

Source kernel commit: 41374ae69ec3a910950d3888f444f80678c6f308

Add a new errortag to test that zone reset errors are handled correctly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: don't validate error tags in the I/O path

Source kernel commit: b8862a09d8256a9037293f1da3b4617b21de26f1

We can trust XFS developers enough to not pass random stuff to
XFS_ERROR_TEST/DELAY. Open code the validity check in xfs_errortag_add,
which is the only place that receives unvalidated error tag values from
user space, and drop the now pointless xfs_errortag_enabled helper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: fix spacing style issues in xfs_alloc.c

Source kernel commit: 0ead3b72469e52ca02946b2e5b35fff38bfa061f

Fix checkpatch.pl errors regarding missing spaces around assignment
operators in xfs_alloc_compute_diff() and xfs_alloc_fixup_trees().

Adhere to the Linux kernel coding style by ensuring spaces are placed
around the assignment operator '='.

Signed-off-by: Shin Seong-jun <shinsj4653@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: add a method to replace shortform attrs

Source kernel commit: eaec8aeff31d0679eadb27a13a62942ddbfd7b87

If we're trying to replace an xattr in a shortform attr structure and
the old entry fits the new entry, we can just memcpy and exit without
having to delete, compact, and re-add the entry (or worse use the attr
intent machinery). For parent pointers this only advantages renaming
where the filename length stays the same (e.g. mv autoexec.bat
scandisk.exe) but for regular xattrs it might be useful for updating
security labels and the like.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs: speed up parent pointer operations when possible

Source kernel commit: d693534513d8dcdaafcf855986d0fe0476a47462

After a recent fsmark benchmarking run, I observed that the overhead of
parent pointers on file creation and deletion can be a bit high.  On a
machine with 20 CPUs, 128G of memory, and an NVME SSD capable of pushing
750000iops, I see the following results:

$ mkfs.xfs -f -l logdev=/dev/nvme1n1,size=1g /dev/nvme0n1 -n parent=0
meta-data=/dev/nvme0n1           isize=512    agcount=40, agsize=9767586 blks
=                       sectsz=4096  attr=2, projid32bit=1
=                       crc=1        finobt=1, sparse=1, rmapbt=1
=                       reflink=1    bigtime=1 inobtcount=1 nrext64=1
=                       exchange=0   metadir=0
data     =                       bsize=4096   blocks=390703440, imaxpct=5
=                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1, parent=0
log      =/dev/nvme1n1           bsize=4096   blocks=262144, version=2
=                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
=                       rgcount=0    rgsize=0 extents
=                       zoned=0      start=0 reserved=0

So we created 40 AGs, one per CPU.  Now we create 40 directories and run
fsmark:

$ time fs_mark  -D  10000  -S  0  -n  100000  -s  0  -L  8 -d ...
# Version 3.3, 40 thread(s) starting at Wed Dec 10 14:22:07 2025
# Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
# Directories:  Time based hash between directories across 10000 subdirectories with 180 seconds per subdirectory.
# File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name)
# Files info: size 0 bytes, written with an IO size of 16384 bytes per write
# App overhead is time in microseconds spent in the test not doing file writing related system calls.

parent=0               parent=1
==================     ==================
real    0m57.573s      real    1m2.934s
user    3m53.578s      user    3m53.508s
sys     19m44.440s     sys     25m14.810s

$ time rm -rf ...

parent=0               parent=1
==================     ==================
real    0m59.649s      real    1m12.505s
user    0m41.196s      user    0m47.489s
sys     13m9.566s      sys     20m33.844s

Parent pointers increase the system time by 28% overhead to create 32
million files that are totally empty.  Removing them incurs a system
time increase of 56%.  Wall time increases by 9% and 22%.

For most filesystems, each file tends to have a single owner and not
that many xattrs.  If the xattr structure is shortform, then all xattr
changes are logged with the inode and do not require the the xattr
intent mechanism to persist the parent pointer.

Therefore, we can speed up parent pointer operations by calling the
shortform xattr functions directly if the child's xattr is in short
format.  Now the overhead looks like:

$ time fs_mark  -D  10000  -S  0  -n  100000  -s  0  -L  8 -d ...

parent=0               parent=1
==================     ==================
real    0m58.030s      real    1m0.983s
user    3m54.141s      user    3m53.758s
sys     19m57.003s     sys     21m30.605s

$ time rm -rf ...

parent=0               parent=1
==================     ==================
real    0m58.911s      real    1m4.420s
user    0m41.329s      user    0m45.169s
sys     13m27.857s     sys     15m58.564s

Now parent pointers only increase the system time by 8% for creation and
19% for deletion.  Wall time increases by 5% and 9% now.

Close the performance gap by creating helpers for the attr set, remove,
and replace operations that will try to make direct shortform updates,
and fall back to the attr intent machinery if that doesn't work.  This
works for regular xattrs and for parent pointers.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs: reduce xfs_attr_try_sf_addname parameters

Source kernel commit: 1ef7729df1f0c5f7bb63a121164f54d376d35835

The dp parameter to this function is an alias of args->dp, so remove it
for clarity before we go adding new callers.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs: strengthen attr leaf block freemap checking

Source kernel commit: 27a0c41f33d8d31558d334b07eb58701aab0b3dd

Check for erroneous overlapping freemap regions and collisions between
freemap regions and the xattr leaf entry array.

Note that we must explicitly zero out the extra freemaps in
xfs_attr3_leaf_compact so that the in-memory buffer has a correctly
initialized freemap array to satisfy the new verification code, even if
subsequent code changes the contents before unlocking the buffer.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs: refactor attr3 leaf table size computation

Source kernel commit: a165f7e7633ee0d83926d29e7909fdd8dd4dfadc

Replace all the open-coded callsites with a single static inline helper.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs: fix freemap adjustments when adding xattrs to leaf blocks

Source kernel commit: 3eefc0c2b78444b64feeb3783c017d6adc3cd3ce

xfs/592 and xfs/794 both trip this assertion in the leaf block freemap
adjustment code after ~20 minutes of running on my test VMs:

ASSERT(ichdr->firstused >= ichdr->count * sizeof(xfs_attr_leaf_entry_t)
+ xfs_attr3_leaf_hdr_size(leaf));

Upon enabling quite a lot more debugging code, I narrowed this down to
fsstress trying to set a local extended attribute with namelen=3 and
valuelen=71.  This results in an entry size of 80 bytes.

At the start of xfs_attr3_leaf_add_work, the freemap looks like this:

i 0 base 448 size 0 rhs 448 count 46
i 1 base 388 size 132 rhs 448 count 46
i 2 base 2120 size 4 rhs 448 count 46
firstused = 520

where "rhs" is the first byte past the end of the leaf entry array.
This is inconsistent -- the entries array ends at byte 448, but
freemap[1] says there's free space starting at byte 388!

By the end of the function, the freemap is in worse shape:

i 0 base 456 size 0 rhs 456 count 47
i 1 base 388 size 52 rhs 456 count 47
i 2 base 2120 size 4 rhs 456 count 47
firstused = 440

Important note: 388 is not aligned with the entries array element size
of 8 bytes.

Based on the incorrect freemap, the name area starts at byte 440, which
is below the end of the entries array!  That's why the assertion
triggers and the filesystem shuts down.

How did we end up here?  First, recall from the previous patch that the
freemap array in an xattr leaf block is not intended to be a
comprehensive map of all free space in the leaf block.  In other words,
it's perfectly legal to have a leaf block with:

* 376 bytes in use by the entries array
* freemap[0] has [base = 376, size = 8]
* freemap[1] has [base = 388, size = 1500]
* the space between 376 and 388 is free, but the freemap stopped
tracking that some time ago

If we add one xattr, the entries array grows to 384 bytes, and
freemap[0] becomes [base = 384, size = 0].  So far, so good.  But if we
add a second xattr, the entries array grows to 392 bytes, and freemap[0]
gets pushed up to [base = 392, size = 0].  This is bad, because
freemap[1] hasn't been updated, and now the entries array and the free
space claim the same space.

The fix here is to adjust all freemap entries so that none of them
collide with the entries array.  Note that this fix relies on commit
2a2b5932db6758 ("xfs: fix attr leaf header freemap.size underflow") and
the previous patch that resets zero length freemap entries to have
base = 0.

Fixes: 1da177e4c3f415 ("Linux-2.6.12-rc2")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs: delete attr leaf freemap entries when empty

Source kernel commit: 6f13c1d2a6271c2e73226864a0e83de2770b6f34

Back in commit 2a2b5932db6758 ("xfs: fix attr leaf header freemap.size
underflow"), Brian Foster observed that it's possible for a small
freemap at the end of the end of the xattr entries array to experience
a size underflow when subtracting the space consumed by an expansion of
the entries array.  There are only three freemap entries, which means
that it is not a complete index of all free space in the leaf block.

This code can leave behind a zero-length freemap entry with a nonzero
base.  Subsequent setxattr operations can increase the base up to the
point that it overlaps with another freemap entry.  This isn't in and of
itself a problem because the code in _leaf_add that finds free space
ignores any freemap entry with zero size.

However, there's another bug in the freemap update code in _leaf_add,
which is that it fails to update a freemap entry that begins midway
through the xattr entry that was just appended to the array.  That can
result in the freemap containing two entries with the same base but
different sizes (0 for the "pushed-up" entry, nonzero for the entry
that's actually tracking free space).  A subsequent _leaf_add can then
allocate xattr namevalue entries on top of the entries array, leading to
data loss.  But fixing that is for later.

For now, eliminate the possibility of confusion by zeroing out the base
of any freemap entry that has zero size.  Because the freemap is not
intended to be a complete index of free space, a subsequent failure to
find any free space for a new xattr will trigger block compaction, which
regenerates the freemap.

It looks like this bug has been in the codebase for quite a long time.

Fixes: 1da177e4c3f415 ("Linux-2.6.12-rc2")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs: split and refactor zone validation

Source kernel commit: 19c5b6051ed62d8c4b1cf92e463c1bcf629107f4

Currently xfs_zone_validate mixes validating the software zone state in
the XFS realtime group with validating the hardware state reported in
struct blk_zone and deriving the write pointer from that.

Move all code that works on the realtime group to xfs_init_zone, and only
keep the hardware state validation in xfs_zone_validate. This makes the
code more clear, and allows for better reuse in userspace.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: add a xfs_rtgroup_raw_size helper

Source kernel commit: fc633b5c5b80c1d840b7a8bc2828be96582c6b55

Add a helper to figure the on-disk size of a group, accounting for the
XFS_SB_FEAT_INCOMPAT_ZONE_GAPS feature if needed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: add missing forward declaration in xfs_zones.h

Source kernel commit: 41263267ef26d315b1425eb9c8a8d7092f9db7c8

Add the missing forward declaration for struct blk_zone in xfs_zones.h.
This avoids headaches with the order of header file inclusion to avoid
compilation errors.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: remove xfs_attr_leaf_hasname

Source kernel commit: 3a65ea768b8094e4699e72f9ab420eb9e0f3f568

The calling convention of xfs_attr_leaf_hasname() is problematic, because
it returns a NULL buffer when xfs_attr3_leaf_read fails, a valid buffer
when xfs_attr3_leaf_lookup_int returns -ENOATTR or -EEXIST, and a
non-NULL buffer pointer for an already released buffer when
xfs_attr3_leaf_lookup_int fails with other error values.

Fix this by simply open coding xfs_attr_leaf_hasname in the callers, so
that the buffer release code is done by each caller of
xfs_attr3_leaf_read.

Fixes: 07120f1abdff ("xfs: Add xfs_has_attr and subroutines")
Reported-by: Mark Tinguely <mark.tinguely@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: directly include xfs_platform.h

Source kernel commit: cf9b52fa7d65362b648927d1d752ec99659f5c43

The xfs.h header conflicts with the public xfs.h in xfsprogs, leading
to a spurious difference in all shared libxfs files that have to
include libxfs_priv.h in userspace. Directly include xfs_platform.h so
that we can add a header of the same name to xfsprogs and remove this
major annoyance for the shared code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: move struct xfs_log_iovec to xfs_log_priv.h

Source kernel commit: 027410591418bded6ba6051151d88fc6fb8a7614

This structure is now only used by the core logging and CIL code.

Also remove the unused typedef.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: add media verification ioctl

Source kernel commit: b8accfd65d31f25b9df15ec2419179b6fa0b21d5

Add a new privileged ioctl so that xfs_scrub can ask the kernel to
verify the media of the devices backing an xfs filesystem, and have any
resulting media errors reported to fsnotify and xfs_healer.

To accomplish this, the kernel allocates a folio between the base page
size and 1MB, and issues read IOs to a gradually incrementing range of
one of the storage devices underlying an xfs filesystem. If any error
occurs, that raw error is reported to the calling process. If the error
happens to be one of the ones that the kernel considers indicative of
data loss, then it will also be reported to xfs_healthmon and fsnotify.

Driving the verification from the kernel enables xfs (and by extension
xfs_scrub) to have precise control over the size and error handling of
IOs that are issued to the underlying block device, and to emit
notifications about problems to other relevant kernel subsystems
immediately.

Note that the caller is also allowed to reduce the size of the IO and
to ask for a relaxation period after each IO.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs: check if an open file is on the health monitored fs

Source kernel commit: 8b85dc4090e1c72c6d42acd823514cce67cd54fc

Create a new ioctl for the healthmon file that checks that a given fd
points to the same filesystem that the healthmon file is monitoring.
This allows xfs_healer to check that when it reopens a mountpoint to
perform repairs, the file that it gets matches the filesystem that
generated the corruption report.

(Note that xfs_healer doesn't maintain an open fd to a filesystem that
it's monitoring so that it doesn't pin the mount.)

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs: convey file I/O errors to the health monitor

Source kernel commit: dfa8bad3a8796ce1ca4f1d15158e2ecfb9c5c014

Connect the fserror reporting to the health monitor so that xfs can send
events about file I/O errors to the xfs_healer daemon. These events are
entirely informational because xfs cannot regenerate user data, so
hopefully the fsnotify I/O error event gets noticed by the relevant
management systems.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs: convey externally discovered fsdax media errors to the health monitor

Source kernel commit: e76e0e3fc9957a5183ddc51dc84c3e471125ab06

Connect the fsdax media failure notification code to the health monitor
so that xfs can send events about that to the xfs_healer daemon.

Later on we'll add the ability for the xfs_scrub media scan (phase 6) to
report the errors that it finds to the kernel so that those are also
logged by xfs_healer.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs: convey filesystem shutdown events to the health monitor

Source kernel commit: 74c4795e50f816dbf5cf094691fc4f95bbc729ad

Connect the filesystem shutdown code to the health monitor so that xfs
can send events about that to the xfs_healer daemon.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs: convey metadata health events to the health monitor

Source kernel commit: 5eb4cb18e445d09f64ef4b7c8fdc3b2296cb0702

Connect the filesystem metadata health event collection system to the
health monitor so that xfs can send events to xfs_healer as it collects
information.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs: convey filesystem unmount events to the health monitor

Source kernel commit: 25ca57fa3624cae9c6b5c6d3fc7f38318ca1402e

In xfs_healthmon_unmount, send events to xfs_healer so that it knows
that nothing further can be done for the filesystem.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs: create event queuing, formatting, and discovery infrastructure

Source kernel commit: b3a289a2a9397b2e731f334d7d36623a0f9192c5

Create the basic infrastructure that we need to report health events to
userspace.  We need a compact form for recording critical information
about an event and queueing them; a means to notice that we've lost some
events; and a means to format the events into something that userspace
can handle.  Make the kernel export C structures via read().

In a previous iteration of this new subsystem, I wanted to explore data
exchange formats that are more flexible and easier for humans to read
than C structures.  The thought being that when we want to rev (or
worse, enlarge) the event format, it ought to be trivially easy to do
that in a way that doesn't break old userspace.

I looked at formats such as protobufs and capnproto.  These look really
nice in that extending the wire format is fairly easy, you can give it a
data schema and it generates the serialization code for you, handles
endianness problems, etc.  The huge downside is that neither support C
all that well.

Too hard, and didn't want to port either of those huge sprawling
libraries first to the kernel and then again to xfsprogs.  Then I
thought, how about JSON?  Javascript objects are human readable, the
kernel can emit json without much fuss (it's all just strings!) and
there are plenty of interpreters for python/rust/c/etc.

There's a proposed schema format for json, which means that xfs can
publish a description of the events that kernel will emit.  Userspace
consumers (e.g. xfsprogs/xfs_healer) can embed the same schema document
and use it to validate the incoming events from the kernel, which means
it can discard events that it doesn't understand, or garbage being
emitted due to bugs.

However, json has a huge crutch -- javascript is well known for its
vague definitions of what are numbers.  This makes expressing a large
number rather fraught, because the runtime is free to represent a number
in nearly any way it wants.  Stupider ones will truncate values to word
size, others will roll out doubles for uint52_t (yes, fifty-two) with
the resulting loss of precision.  Not good when you're dealing with
discrete units.

It just so happens that python's json library is smart enough to see a
sequence of digits and put them in a u64 (at least on x86_64/aarch64)
but an actual javascript interpreter (pasting into Firefox) isn't
necessarily so clever.

It turns out that none of the proposed json schemas were ever ratified
even in an open-consensus way, so json blobs are still just loosely
structured blobs.  The parsing in userspace was also noticeably slow and
memory-consumptive.

Hence only the C interface survives.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs: start creating infrastructure for health monitoring

Source kernel commit: a48373e7d35a89f6f9b39f0d0da9bf158af054ee

Start creating helper functions and infrastructure to pass filesystem
health events to a health monitoring file. Since this is an
administrative interface, we only support a single health monitor
process per filesystem, so we don't need to use anything fancy such as
notifier chains (== tons of indirect calls).

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

libfrog: fix missing gettext call in current_fixed_time

This error message can be seen by regular users, so it should be looked
up in the internationalization catalogue with gettext.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

libfrog: hoist some utilities from libxfs

This started with a desire to move the duplicate cmn_err declarations in
libxfs into libfrog/util.h ahead of the patch that renames libxfs_priv.h
to xfs_platform.h.

Then this patch expanded in scope when I realized that there were
several other utility functions that weren't specific to xfs; those go
in libfrog.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

libxfs: port various kernel apis from 7.0

Port more kernel APIs from Linux as of 7.0.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

libxfs: fix XFS_STATS_DEC

This macro only takes two arguments in the kernel, so fix the definition
here too. All existing callsites #if 0 it into oblivion which is why
we've never noticed, but an upcoming patch in the libxfs sync will not
be so lucky.

Cc: <linux-xfs@vger.kernel.org> # v4.9.0
Fixes: ece930fa14a343 ("xfs: refactor xfs_bunmapi_cow")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>