Eric Sandeen [Thu, 20 Jul 2017 15:51:46 +0000 (10:51 -0500)]
xfs_db: properly set inode type
When we set the type to "inode" the verifier validates multiple
inodes in the current fs block, so setting the buffer size to
that of just one inode is not sufficient and it'll emit spurious
verifier errors for all but the first, as we read off the end:
xfs_db> daddr 99
xfs_db> type inode
Metadata corruption detected at xfs_inode block 0x63/0x200
Metadata corruption detected at xfs_inode block 0x63/0x200
Metadata corruption detected at xfs_inode block 0x63/0x200
Metadata corruption detected at xfs_inode block 0x63/0x200
Metadata corruption detected at xfs_inode block 0x63/0x200
Metadata corruption detected at xfs_inode block 0x63/0x200
Metadata corruption detected at xfs_inode block 0x63/0x200
Use the special set_cur_inode() function for this purpose
as is done in inode_f().
Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Bill O'Donnell <billodo@redhat.com>
[sandeen: remove nag/warning printf for now] Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Thu, 20 Jul 2017 15:51:37 +0000 (10:51 -0500)]
xfs_db: redirect printfs when metadumping to stdout
If we're metadumping to stdout, we don't want xfs_db's various dbprintf
statements dumping to stdout because that'll corrupt the metadump.
Therefore, let outf point to the existing stdout and redirect stdout to
stderr for the duration of the dump operation.
Reported-by: David Shaw <dshaw@jabberwocky.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Eric Sandeen [Thu, 20 Jul 2017 15:51:34 +0000 (10:51 -0500)]
mkfs.xfs: allow specification of 0 data stripe width & unit
The "noalign" option works for this too, but it seems reasonable
to allow explicit specification of stripe unit and stripe width
to 0; today, doing so today makes the code think it's unspecified,
and so it goes ahead and detects stripe geometry and sets it in the
superblock. That's unexpected and surprising.
Create a new flag that tracks whtether a geometry option has been
specified, and if it's set along with 0 values, treat it the
same as if "noalign" had been specified.
Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Thu, 13 Jul 2017 16:51:27 +0000 (11:51 -0500)]
mkfs: set inode alignment and cluster size for minimum log size estimation
In order for mkfs to calculate the minimum log size correctly, it must
be able to find the transaction type with the largest reservation. The
iunlink transaction reservation size calculation depends on having the
inode cluster size set correctly, which in turn depends on the inode
alignment parameters being set as they will be in the final filesystem.
Therefore we have to set up the inoalignmt field in max_trans_res.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Thu, 13 Jul 2017 16:51:25 +0000 (11:51 -0500)]
mkfs: set agblklog when we're verifying minimum log size
In e5cc9d560a ("mkfs: set agsize prior to calculating minimum log
size"), we set the ag size in the superblock structure so that we can
calculate the maximum btree height correctly. The btree heights are
used to calculate transaction reservation sizes; these sizes are used to
compute the minimum log length; and the minimum log length is checked by
the kernel.
Unfortunately, I didn't realize that some of the btree sizing functions
also depend on the agblklog (log2 of the ag size), so we've been
underestimating the minimum log length allowable, which results in mkfs
formatting filesystems that the kernel refuses to mount.
This can be trivially reproduced by formatting a small (~800M) volume
with rmap and reflink turned on.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Fri, 30 Jun 2017 18:56:29 +0000 (13:56 -0500)]
libxfs: fix fsmap.h inclusion
If we /do/ have HAVE_GETFSMAP defined, we need to include linux/fsmap.h.
Found-by: Eryu Guan <eguan@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Fri, 30 Jun 2017 16:02:46 +0000 (11:02 -0500)]
xfs_db: identify attr dabtree field types correctly
For whatever reason, the v5 xattr dabtree header fields are mapped to
the directory dabtree header fields, which means that the types are
wrong and hence we cannot use the 'addr' command to step through the
tree. Since the v4 attr dabtree does this correctly, simply port the v5
fields to the attr code too.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Bill O'Donnell <billodo@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Bill O'Donnell <billodo@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Bill O'Donnell <billodo@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Bill O'Donnell [Thu, 29 Jun 2017 18:05:00 +0000 (13:05 -0500)]
xfs_db: improve argument naming in set_cur and set_iocur_type
In set_cur and set_iocur_type, the current naming for arguments
type, block number, and length are t, d, and c, respectively.
Replace these with more intuitive and descriptive names:
type, blknum, and len. Fix type of blknum (xfs_daddr_t) to be
consistent with that of libxfs_readbuf where it's used.
Additionally remove extra blank line in io.c.
Signed-off-by: Bill O'Donnell <billodo@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Bill O'Donnell [Thu, 29 Jun 2017 18:05:00 +0000 (13:05 -0500)]
xfs_db: update buffer size when new type is set
xfs_db doesn't take sector size into account when setting type.
This can result in an errant crc. For example, with a sector size
of 4096:
xfs_db> agi 0
xfs_db> p crc
crc = 0xab85043e (correct)
xfs_db> daddr
current daddr is 16
xfs_db> daddr 42
xfs_db> daddr 16
xfs_db> type agi
Metadata CRC error detected at xfs_agi block 0x10/0x200
xfs_db> p crc
crc = 0xab85043e (bad)
When xfs_db sets the new daddr in daddr_f, it does so with one
BBSIZE sector (512). Changing the type doesn't change the size
of the current buffer in iocur_top, so the checksum is calculated
on the wrong length for the type (when the actual sector size > BBSIZE (512).
For types with fields, reread the buffer to pick up the correct size for
the new type when it gets set. Facilitate the reread by setting the cursor
with set_cur().
Signed-off-by: Bill O'Donnell <billodo@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[sandeen: fix up long line, clarify subject & comments] Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Wed, 21 Jun 2017 22:14:30 +0000 (17:14 -0500)]
xfs_spaceman: add group summary mode
Add a -g switch to show only a per-group summary.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com>
[sandeen: reset global gflag to 0 for each call] Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Wed, 21 Jun 2017 22:14:30 +0000 (17:14 -0500)]
xfs_spaceman: add a man page
Add a manual page describing xfs_spaceman's behavior.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com>
[sandeen: minor edits] Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Dave Chinner [Wed, 21 Jun 2017 22:14:30 +0000 (17:14 -0500)]
xfs_spaceman: Free space mapping command
Add freespace mapping tool modelled on the xfs_db freesp command.
The advantage of this command over xfs_db is that it can be done
online and is coherent with concurrent modifications to the
filesystem.
This requires the kernel to support the XFS_IOC_GETFSMAP ioctl to map
free space indexes.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
[darrick: port from FIEMAPFS to GETFSMAP] Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Dave Chinner [Wed, 21 Jun 2017 22:14:30 +0000 (17:14 -0500)]
xfs_spaceman: add new speculative prealloc control
Add an control interface for purging speculative
preallocation via the new ioctls.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
[darrick: change xfsctl to ioctl] Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com>
[sandeen: change help to "removes" not "controls"] Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Dave Chinner [Wed, 21 Jun 2017 22:14:30 +0000 (17:14 -0500)]
xfs_spaceman: add FITRIM support
Add support for discarding free space extents via the FITRIM
command. Make it easy to discard a single range, an entire AG or all
the freespace in the filesystem.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com>
[sandeen: minor printf formatting fixup] Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Dave Chinner [Wed, 21 Jun 2017 22:14:30 +0000 (17:14 -0500)]
xfs_spaceman: space management tool
xfs_spaceman is intended as a diagnostic and control tool for space
management operations within XFS. Operations like examining free
space, managing allocation policies, issuing block discards on free
space, etc.
The tool is modelled on the xfs_io interface, allowing both
interactive and command line control of the tool, enabling it to be
used in scripts and automated management tools.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
[darrick: change xfsctl to ioctl] Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Wed, 21 Jun 2017 22:14:30 +0000 (17:14 -0500)]
xfs_repair: replace rmap_compare with libxfs version
Now that libxfs has a function to compare rmaps, replace xfs_repair's
helper function with that.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Wed, 21 Jun 2017 22:14:30 +0000 (17:14 -0500)]
xfs_io: support the new getfsmap ioctl
Plumb in all the pieces we need to have xfs_io query the GETFSMAP ioctl
for an arbitrary filesystem.
[sandeen: minor doc fixes]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Wed, 21 Jun 2017 22:14:30 +0000 (17:14 -0500)]
xfs: introduce the XFS_IOC_GETFSMAP ioctl
Introduce a new ioctl that uses the reverse mapping btree to return
information about the physical layout of the filesystem.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
[sandeen: rework to have less autoconf magic] Modified-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Wed, 21 Jun 2017 22:14:30 +0000 (17:14 -0500)]
libxfs: use crc32c slice-by-8 variant by default
The crc32c code used in xfsprogs was copied directly from the Linux
kernel. However, that code selects slice-by-4 by default, which isn't
the fastest -- that's slice-by-8, which trades table size for speed.
Fix some makefile dependency problems and explicitly select the
algorithm we want. With this patch applied, I see about a 10% drop in
CPU time running xfs_repair.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Wed, 21 Jun 2017 22:14:29 +0000 (17:14 -0500)]
libxcmd: add cvt{int, long} to convert strings to int and long
Create some helper functions to convert strings to int or long
in a standard way and work around problems in the C library
atoi/strtol functions.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Wed, 21 Jun 2017 22:14:29 +0000 (17:14 -0500)]
xfs_io: refactor numlen into a library function
Refactor the competing numlen implementations into a single library function.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Eric Sandeen [Wed, 21 Jun 2017 22:14:29 +0000 (17:14 -0500)]
xfs_metadump: tag metadump image with informational flags
After the long discussion about warning the user and/or consumer
of xfs_metadumps about dirty logs, it crossed my mind that we
could use the reserved slot in the metadump header to tag the
file with attributes, so the consumer of the metadump knows how
it was created.
This patch adds 4 flags to describe the metadump: The first simply
indicates the presence of any (or no) informational flags.
The old mb_reserved field has been 0 on disk since inception, so
the presence of XFS_METADUMP_INFO_FLAGS indicates that this metadump
may contain the informational flags:
- dirty log
- obfuscated
- full blocks (unused portions of metadata blocks
are not zeroed out).
It then adds a new option to xfs_mdrestore, "-i" to show info,
which can be used with or without a target file:
# xfs_mdrestore -i metadumpfile
metadumpfile: not obfuscated, clean log, full metadata blocks
Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Zirong Lang [Wed, 21 Jun 2017 19:40:35 +0000 (14:40 -0500)]
xfs_repair: handle reading superblock from image on larger sector size filesystem
Due to xfs_repair uses direct IO, sometimes it can't read superblock
from an image file has smaller sector size than host filesystem.
Especially that superblock doesn't align with host filesystem's
sector size.
Fortunately, xfsprogs already has code to do isa_file and geometry
check in xfs_repair.c, it turns off O_DIRECT after phase1() if the
sector size is less than the host filesystem's geometry. So move
the isa_file auto detection over up the phase1(), and try to do
once geometry check before phase1() if get_sb() return OK.
Signed-off-by: Zorro Lang <zlang@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Bill O'Donnell [Wed, 21 Jun 2017 19:40:35 +0000 (14:40 -0500)]
xfs_growfs: ensure target path is an active xfs mountpoint
xfs_growfs manpage clearly states that the target path must be
an active xfs mountpoint.
Current behavior allows xfs_growfs to proceed if the target path
resides anywhere on a mounted xfs filesystem. This could lead to
unexpected results. Unless the target path is an active xfs
mountpoint, reject it. Create a new fs table lookup function which
matches only active xfs mount points, not any file residing within
those mountpoints.
Signed-off-by: Bill O'Donnell <billodo@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Eric Sandeen [Wed, 14 Jun 2017 21:23:32 +0000 (16:23 -0500)]
libxfs: fix xfs_trans_alloc_empty namespace
Do all the right libxfs_ magic for this new function.
Reported-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
This structure copy was throwing unaligned access warnings on sparc64:
Kernel unaligned access at TPC[1043c088] xfs_btree_visit_blocks+0x88/0xe0 [xfs]
xfs_btree_copy_ptrs does a memcpy, which avoids it.
Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
If a malicious user corrupts the refcount btree to cause a cycle between
different levels of the tree, the next mount attempt will deadlock in
the CoW recovery routine while grabbing buffer locks. We can use the
ability to re-grab a buffer that was previous locked to a transaction to
avoid deadlocks, so do that here.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reduce stack usage and get rid of compiler warnings by eliminating
unused variables.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
The delalloc -> real block conversion path uses an incorrect
calculation in the case where the middle part of a delalloc extent
is being converted. This is documented as a rare situation because
XFS generally attempts to maximize contiguity by converting as much
of a delalloc extent as possible.
If this situation does occur, the indlen reservation for the two new
delalloc extents left behind by the conversion of the middle range
is calculated and compared with the original reservation. If more
blocks are required, the delta is allocated from the global block
pool. This delta value can be characterized as the difference
between the new total requirement (temp + temp2) and the currently
available reservation minus those blocks that have already been
allocated (startblockval(PREV.br_startblock) - allocated).
The problem is that the current code does not account for previously
allocated blocks correctly. It subtracts the current allocation
count from the (new - old) delta rather than the old indlen
reservation. This means that more indlen blocks than have been
allocated end up stashed in the remaining extents and free space
accounting is broken as a result.
Fix up the calculation to subtract the allocated block count from
the original extent indlen and thus correctly allocate the
reservation delta based on the difference between the new total
requirement and the unused blocks from the original reservation.
Also remove a bogus assert that contradicts the fact that the new
indlen reservation can be larger than the original indlen
reservation.
Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
xfs has defined PF_FSTRANS to declare a scope GFP_NOFS semantic quite
some time ago. We would like to make this concept more generic and use
it for other filesystems as well. Let's start by giving the flag a more
generic name PF_MEMALLOC_NOFS which is in line with an exiting
PF_MEMALLOC_NOIO already used for the same purpose for GFP_NOIO
contexts. Replace all PF_FSTRANS usage from the xfs code in the first
step before we introduce a full API for it as xfs uses the flag directly
anyway.
This patch doesn't introduce any functional change.
Link: http://lkml.kernel.org/r/20170306131408.9828-4-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dave Chinner <david@fromorbit.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Chris Mason <clm@fb.com> Cc: David Sterba <dsterba@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Nikolay Borisov <nborisov@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
In xfs_reflink_end_cow, we erroneously reserve only enough blocks to
handle adding 1 extent. This is problematic if we fragment free space,
have to do CoW, and then have to perform multiple bmap btree expansions.
Furthermore, the BUI recovery routine doesn't reserve /any/ blocks to
handle btree splits, so log recovery fails after our first error causes
the filesystem to go down.
Therefore, refactor the transaction block reservation macros until we
have a macro that works for our deferred (re)mapping activities, and fix
both problems by using that macro.
With 1k blocks we can hit this fairly often in g/187 if the scratch fs
is big enough.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
XFS only supports the unwritten extent bit in the data fork, and only if
the file system has a version 5 superblock or the unwritten extent
feature bit.
We currently have two routines that validate the invariant:
xfs_check_nostate_extents which return -EFSCORRUPTED when it's not met,
and xfs_validate_extent that triggers and assert in debug build.
Both of them iterate over all extents of an inode fork when called,
which isn't very efficient.
This patch instead adds a new helper that verifies the invariant one
extent at a time, and calls it from the places where we iterate over
all extents to converted them from or two the in-memory format. The
callers then return -EFSCORRUPTED when reading invalid extents from
disk, or trigger an assert when writing them to disk.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
We only ever use the normal and unwritten states. And the actual
ondisk format (this enum isn't despite being in xfs_format.h) only
has space for the unwritten bit anyway.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
On some architectures do_div does the pointer compare
trick to make sure that we've sent it an unsigned 64-bit
number. (Why unsigned? I don't know.)
Fix up the few places that squawk about this; in
xfs_bmap_wants_extents() we just used a bare int64_t so change
that to unsigned.
In xfs_adjust_extent_unmap_boundaries() all we wanted was the
mod, and we have an xfs-specific function to handle that w/o
side effects, which includes proper casting for do_div.
In xfs_daddr_to_ag[b]no, we were using the wrong type anyway;
XFS_BB_TO_FSBT returns a block in the filesystem, so use
xfs_rfsblock_t not xfs_daddr_t, and gain the unsignedness
from that type as a bonus.
Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Now that reflink operations don't set the firstblock value we don't
need the workarounds for non-NULL firstblock values without a prior
allocation.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
The main thing that xfs_bmap_remap_alloc does is fixing the AGFL, similar
to what we do in the space allocator. But the reflink code doesn't touch
the allocation btree unlike the normal space allocator, so we couldn't
care less about the state of the AGFL.
So remove xfs_bmap_remap_alloc and just handle the di_nblocks update in
the caller.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Add a new helper to be used for reflink extent list additions instead of
funneling them through xfs_bmapi_write and overloading the firstblock
member in struct xfs_bmalloca and struct xfs_alloc_args.
With some small changes to xfs_bmap_remap_alloc this also means we do
not need a xfs_bmalloca structure for this case at all.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
For the reflink case we'd much rather pass the required arguments than
faking up a struct xfs_bmalloca.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
We never do COW operations for the attr fork, so don't pretend we handle
them.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
bno should be a xfs_fsblock_t, which is 64-bit wides instead of a
xfs_aglock_t, which truncates the value to 32 bits.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
ndquots is a 32-bit value, and we don't care
about the remainder; there is no reason to use do_div
here, it seems to be the result of a decade+ historical
accident.
Worse, the do_div implementation in userspace breaks
when fed a 32-bit dividend, so we commented it out there
in any case.
Change to simple division, and then we can change
userspace to match, and mandate a 64-bit dividend in
the do_div() in userspace as well.
Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Introduce a new ioctl that uses the reverse mapping btree to return
information about the physical layout of the filesystem.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Add _query_range and _query_all functions to the realtime bitmap
allocator. These two functions are similar in usage to the btree
functions with the same name and will be used for getfsmap and scrub.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Create a helper function that will query all records in a btree.
This will be used by the online repair functions to examine every
record in a btree to rebuild a second btree.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Implement a query_range function for the bnobt and cntbt. This will
be used for getfsmap fallback if there is no rmapbt and by the online
scrub and repair code.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Plumb in the pieces (init_high_key, diff_two_keys) necessary to call
query_range on the free space btrees. Remove the debugging asserts
so that we can make queries starting from block 0.
While we're at it, merge the redundant "if (btnum ==" hunks.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
"xfs_iread: validation failed for inode 96 failed"
One "failed" seems like enough.
Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Alex Elder <elder@linaro.org> Reviewed-by: Bill O'Donnell <billodo@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Opencoding the trivial checks makes it much easier to read (and grep..).
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
This checks for all the non-normal extent types, including handling both
encodings of delayed allocations.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Zirong Lang [Wed, 3 May 2017 19:54:26 +0000 (14:54 -0500)]
xfs_io: add missed quotation marks in man page
The description about set_encpolicy command in xfs_io man page missed
some quotation marks. It's a little picky, but it really causes the
quotation marks can't match each other.
Signed-off-by: Zorro Lang <zlang@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Zirong Lang [Wed, 3 May 2017 19:54:00 +0000 (14:54 -0500)]
xfs_io: add missed inode command into man page
There's an "inode" command in xfs_io, it's used to query physical
information about an inode. But there's not any information about
it in xfs_io and other related man pages. So document this command
in the xfs_io man page now.
[sandeen: include some of djwong's edits]
Signed-off-by: Zorro Lang <zlang@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Tue, 2 May 2017 16:12:57 +0000 (11:12 -0500)]
xfs_db: dump metadata btrees via 'btdump'
Introduce a new 'btdump' command that can print the contents of all
blocks of any metadata subtree in the filesystem. This enables
developers and forensic analyst to view a metadata structure without
having to navigate the btree manually.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Tue, 2 May 2017 16:12:55 +0000 (11:12 -0500)]
xfs_db: use iocursor type to guess btree geometry if bad magic
The function block_to_bt plays an integral role in determining the btree
geometry of a block that we want to manipulate with the debugger.
Normally we use the block magic to find the geometry profile, but if the
magic is bad we'll never find it and return NULL. The callers of this
function do not check for NULL and crash.
Therefore, if we can't find a geometry profile matching the magic
number, use the iocursor type to guess the profile and scowl about that
to stdout. This makes it so that even with a corrupt magic we can try
to print the fields instead of crashing the debugger.
[sandeen: comment changes, add magic ASSERT, keep other ASSERTs]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Tue, 2 May 2017 16:12:54 +0000 (11:12 -0500)]
xfs_db: don't print arrays off the end of a buffer
Before printing an array, clamp the array count against the size of the
buffer so that we don't print random heap contents.
[sandeen: re-use fsz variable in call to prfunc]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Eric Sandeen [Tue, 2 May 2017 16:12:53 +0000 (11:12 -0500)]
mkfs.xfs: Assign proper defaults to rmapbt and reflink flags
The "defaultval" field in the options structure was a bit confusing,
so when the rmapbt & reflink options got added, the desire was
to keep them off by default, and "defaultval = 0" got set.
However, the purpose of this field is to define the default value
when the flag is specified with no associated value, i.e.
-m rmapbt vs. -m rmapbt=0 or -m rmapbt=1
Today, the resulting behavior is unexpected, and different from any
other mkfs flags; specifying "-m rmapbt,reflink" results in a
filesystem /without/ those features.
Fix these to be consistent with every other boolean flag in the
mkfs options, so that specifying the flag with no value will
enable the feature.
Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
chandan [Tue, 2 May 2017 16:12:52 +0000 (11:12 -0500)]
xfs_io: Add statx support for PowerPC architecture
Linux kernel commit f717629c7f834ab2efa05c7dbf0826f1d7c32ade (powerpc:
Wire up statx() syscall) added support for statx syscall for PowerPC
architecture. This commit enables using 'statx' sub-command of xfs_io to
be used on PowerPC.
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Eric Sandeen [Tue, 2 May 2017 16:12:51 +0000 (11:12 -0500)]
xfs_io: fix statx definition for non-x86 architecture
Apply the same fix to xfs_io as Gwendal did for fstests:
Fix a compilation error for ARM:
__ILP32__ is defined but not __X32_SYSCALL_BIT.
The check should only apply for x86_64 architecture, statx for other
architectures is not implemented yet - see commit 7acc839c9e57
"statx: Add a system call to make enhanced file info available".
Signed-off-by: Gwendal Grignou <gwendal@chromium.org> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Eric Sandeen [Tue, 2 May 2017 16:12:50 +0000 (11:12 -0500)]
xfs_db: allow write -d to dqblks
Allow write -d to write bad data and recalculate CRC
for dqblks.
Inspired-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Tue, 2 May 2017 16:12:31 +0000 (11:12 -0500)]
xfs_db: allow write -d to inodes
Add a helper function to xfs_db so that we can recalculate the CRC of an
inode whose field we just wrote. This enables us to write arbitrary
values with a good CRC for the purpose of checking the read verifiers on
a v5 filesystem.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Tue, 2 May 2017 16:12:31 +0000 (11:12 -0500)]
xfs_db.8: document write -d option
The xfs_db write "-d" option allows us to write bad
data with a good CRC, as added in 86769b3 xfs_db: allow recalculating CRCs on invalid metadata
but never documented, so do that now.
[sandeen: split out doc patch]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Eric Sandeen [Mon, 10 Apr 2017 22:40:58 +0000 (17:40 -0500)]
xfs.5: document barrier deprecation in xfs
Since kernel v4.10, the barrier mount option is deprecated:
2291dab2 xfs: Always flush caches when integrity is required 4cf4573d xfs: deprecate barrier/nobarrier mount option
Document this fact in the xfs(5) manpage.
Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Eric Sandeen [Mon, 10 Apr 2017 22:34:45 +0000 (17:34 -0500)]
xfs_io: hook up statx
Wire up the statx syscall to xfs_io.
xfs_io> help statx
statx [-v|-r][-m basic | -m all | -m <mask>][-FD] -- extended statistics on the currently open file
Display extended file status.
Options:
-v -- More verbose output
-r -- Print raw statx structure fields
-m mask -- Specify the field mask for the statx call
(can also be 'basic' or 'all'; default STATX_ALL)
-D -- Don't sync attributes with the server
-F -- Force the attributes to be sync'd with the server
Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Eric Sandeen [Mon, 10 Apr 2017 22:33:38 +0000 (17:33 -0500)]
xfs_io: refactor stat functions, add raw dump
This adds a "-r" raw structure dump to stat options, and
factors the code a bit; statx will also use print_file_info
and print_xfs_info.
Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Eric Sandeen [Mon, 10 Apr 2017 22:33:15 +0000 (17:33 -0500)]
xfs_io: move stat functions to new file
Adding statx will add a bit of code, so break stat-related
functions out of open.c into their own new file.
Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Eric Sandeen [Mon, 10 Apr 2017 22:32:04 +0000 (17:32 -0500)]
xfsprogs: fix build dep on configure.ac
Zorro reported that this sequence:
# git checkout v4.9.0; make realclean; make
# git checkout v4.10.0; make clean; make
fails:
...
Building libxfs
[CC] gen_crc32table
gcc: error: @BUILD_CFLAGS@: No such file or directory
gmake[3]: *** No rule to make target `crc32table.h', needed by `crc32selftest'. Stop.
This is because
0a71e38 build: Allow compiling xfsprogs in a cross compile environment
added the new BUILD_CFLAGS to configure.ac, and unless we re-run
autotools, that variable does not get substituted when
include/builddefs gets built.
(This can be worked around by "make realclean" and then everything
gets regenerated.)
The configure script is generated from configure.ac, so adding
a Make dependency here should resolve such issues in the future.
Reported-by: Zorro Lang <zlang@redhat.com> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Eric Sandeen [Mon, 10 Apr 2017 22:32:04 +0000 (17:32 -0500)]
xfs_repair: pass btnum not magic to phase5 functions
When ed849ef xfs: remove boilerplate around xfs_btree_init_block
was merged from kernelspace, I made only minimal changes at the
libxfs boundary to accommodate the new libxfs_btree_init_block
interface.
We can chase that up a bit higher and remove more code by
passing in btnum from the start; we can also remove the
"finobt" argument from build_ino_tree() because that is
known from type of tree passed in.
Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Eric Sandeen [Mon, 10 Apr 2017 22:32:04 +0000 (17:32 -0500)]
xfs_repair: warn about dirty log with -n option
When looking at xfs_repair -n output today, we have no idea if
reported errors may be due to an un-replayed dirty log. If this
is the case, mention it in the output.
Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Mathias Troiden reproduced a filesystem corruption that resulted in
a zero-sized local format symlink inode. This is invalid state and
results in an inode that cannot be accessed or modified.
The kernel detects this problem on inode access, fails and warns the
user to umount and run xfs_repair. Unfortunately, xfs_repair doesn't
even detect the problem. Thus the user has no path to recovery.
Update xfs_repair to check for invalid zero-sized symlinks and flag
them as corrupted. This results in tossing the inode, but returns
the fs to a valid state.
Reported-by: Mathias Troiden <mathias.troiden@gmail.com> Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Eric Sandeen [Mon, 10 Apr 2017 22:31:55 +0000 (17:31 -0500)]
xfsprogs: remove unused libxfs helper #defines
There are several #defines which aren't used anywhere
in either xfsprogs or the kernel, so remove them from the
libxfs/libxfs_priv.h helper file.
This does leave a few (currently) unused defines in place
if they are still used in the kernel, in the hope that future
libxfs changes might Just Work.
Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
The inline directory verifiers should be called on the inode fork data,
which means after iformat_local on the read side, and prior to
ifork_flush on the write side. This makes the fork verifier more
consistent with the way buffer verifiers work -- i.e. they will operate
on the memory buffer that the code will be reading and writing directly.
Furthermore, revise the verifier function to return -EFSCORRUPTED so
that we don't flood the logs with corruption messages and assert
notices. This has been a particular problem with xfs/348, which
triggers the XFS_WANT_CORRUPTED_RETURN assertions, which halts the
kernel when CONFIG_XFS_DEBUG=y. Disk corruption isn't supposed to do
that, at least not in a verifier.
Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
xfs_extent_busy_flush is a void function, so don't reduce it to zero.
This shuts up gcc warnings about do-nothing statements.
[sandeen: switch to more common ((void)0) paradigm]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
When we're reading or writing the data fork of an inline directory,
check the contents to make sure we're not overflowing buffers or eating
garbage data. xfs/348 corrupts an inline symlink into an inline
directory, triggering a buffer overflow bug.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
When a reflink operation causes the bmap code to allocate a btree block
we're currently doing single-AG allocations due to having ->firstblock
set and then try any higher AG due a little reflink quirk we've put in
when adding the reflink code. But given that we do not have a minleft
reservation of any kind in this AG we can still not have any space in
the same or higher AG even if the file system has enough free space.
To fix this use a XFS_ALLOCTYPE_FIRST_AG allocation in this fall back
path instead.
[And yes, we need to redo this properly instead of piling hacks over
hacks. I'm working on that, but it's not going to be a small series.
In the meantime this fixes the customer reported issue]
Also add a warning for failing allocations to make it easier to debug.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Commit fa7f138 ("xfs: clear delalloc and cache on buffered write
failure") fixed one regression in the iomap error handling code and
exposed another. The fundamental problem is that if a buffered write
is a rewrite of preexisting delalloc blocks and the write fails, the
failure handling code can punch out preexisting blocks with valid
file data.
This was reproduced directly by sub-block writes in the LTP
kernel/syscalls/write/write03 test. A first 100 byte write allocates
a single block in a file. A subsequent 100 byte write fails and
punches out the block, including the data successfully written by
the previous write.
To address this problem, update the ->iomap_begin() handler to
distinguish newly allocated delalloc blocks from preexisting
delalloc blocks via the IOMAP_F_NEW flag. Use this flag in the
->iomap_end() handler to decide when a failed or short write should
punch out delalloc blocks.
This introduces the subtle requirement that ->iomap_begin() should
never combine newly allocated delalloc blocks with existing blocks
in the resulting iomap descriptor. This can occur when a new
delalloc reservation merges with a neighboring extent that is part
of the current write, for example. Therefore, drop the
post-allocation extent lookup from xfs_bmapi_reserve_delalloc() and
just return the record inserted into the fork. This ensures only new
blocks are returned and thus that preexisting delalloc blocks are
always handled as "found" blocks and not punched out on a failed
rewrite.
Reported-by: Xiong Zhou <xzhou@redhat.com> Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
XFS_ALLOCTYPE_ANY_AG was only used for the RT allocator and is unused
now, and XFS_ALLOCTYPE_START_AG has been unused for a while.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
In various places we currently assert that xfs_bmap_btalloc allocates
from the same as the firstblock value passed in, unless it's either
NULLAGNO or the dop_low flag is set. But the reflink code does not
fully follow this convention as it passes in firstblock purely as
a hint for the allocator without actually having previous allocations
in the transaction, and without having a minleft check on the current
AG, leading to the assert firing on a very full and heavily used
file system. As even the reflink code only allocates from equal or
higher AGs for now we can simply the check to always allow for equal
or higher AGs.
Note that we need to eventually split the two meanings of the firstblock
value. At that point we can also allow the reflink code to allocate
from any AG instead of limiting it in any way.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
When block size is larger than inode cluster size, the call to
XFS_B_TO_FSBT(mp, mp->m_inode_cluster_size) returns 0. Also, mkfs.xfs
would have set xfs_sb->sb_inoalignmt to 0. This causes
xfs_ialloc_cluster_alignment() to return 0. Due to this
args.minalignslop (in xfs_ialloc_ag_alloc()) gets the unsigned
equivalent of -1 assigned to it. This later causes alloc_len in
xfs_alloc_space_available() to have a value of 0. In such a scenario
when args.total is also 0, the assert statement "ASSERT(args->maxlen >
0);" fails.
This commit fixes the bug by replacing the call to XFS_B_TO_FSBT() in
xfs_ialloc_cluster_alignment() with a call to xfs_icluster_size_fsb().
Suggested-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Certain workoads that punch holes into speculative preallocation can
cause delalloc indirect reservation splits when the delalloc extent is
split in two. If further splits occur, an already short-handed extent
can be split into two in a manner that leaves zero indirect blocks for
one of the two new extents. This occurs because the shortage is large
enough that the xfs_bmap_split_indlen() algorithm completely drains the
requested indlen of one of the extents before it honors the existing
reservation.
This ultimately results in a warning from xfs_bmap_del_extent(). This
has been observed during file copies of large, sparse files using 'cp
--sparse=always.'
To avoid this problem, update xfs_bmap_split_indlen() to explicitly
apply the reservation shortage fairly between both extents. This smooths
out the overall indlen shortage and defers the situation where we end up
with a delalloc extent with zero indlen reservation to extreme
circumstances.
Reported-by: Patrick Dung <mpatdung@gmail.com> Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
When a delalloc extent is created, it can be merged with pre-existing,
contiguous, delalloc extents. When this occurs,
xfs_bmap_add_extent_hole_delay() merges the extents along with the
associated indirect block reservations. The expectation here is that the
combined worst case indlen reservation is always less than or equal to
the indlen reservation for the individual extents.
This is not always the case, however, as existing extents can less than
the expected indlen reservation if the extent was previously split due
to a hole punch. If a new extent merges with such an extent, the total
indlen requirement may be larger than the sum of the indlen reservations
held by both extents.
xfs_bmap_add_extent_hole_delay() assumes that the worst case indlen
reservation is always available and assigns it to the merged extent
without consideration for the indlen held by the pre-existing extent. As
a result, the subsequent xfs_mod_fdblocks() call can attempt an
unintentional allocation rather than a free (indicated by an ASSERT()
failure). Further, if the allocation happens to fail in this context,
the failure goes unhandled and creates a filesystem wide block
accounting inconsistency.
Fix xfs_bmap_add_extent_hole_delay() to function as designed. Cap the
indlen reservation assigned to the merged extent to the sum of the
indlen reservations held by each of the individual extents.
Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Currently we force the log and simply try again if we hit a busy extent,
but especially with online discard enabled it might take a while after
the log force for the busy extents to disappear, and we might have
already completed our second pass.
So instead we add a new waitqueue and a generation counter to the pag
structure so that we can do wakeups once we've removed busy extents,
and we replace the single retry with an unconditional one - after
all we hold the AGF buffer lock, so no other allocations or frees
can be racing with us in this AG.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
When we allocate COW fork blocks for direct I/O writes we currently first
create a delayed allocation, and then convert it to a real allocation
once we've got the delayed one.
As there is no good reason for that this patch instead makes use call
xfs_bmapi_write from the COW allocation path. The only interesting bits
are a few tweaks the low-level allocator to allow for this, most notably
the need to remove the call to xfs_bmap_extsize_align for the cowextsize
in xfs_bmap_btalloc - for the existing convert case it's a no-op, but
for the direct allocation case it would blow up our block reservation
way beyond what we reserved for the transaction.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
In the data fork, we only allow extents to perform the following state
transitions:
delay -> real <-> unwritten
There's no way to move directly from a delalloc reservation to an
/unwritten/ allocated extent. However, for the CoW fork we want to be
able to do the following to each extent:
delalloc -> unwritten -> written -> remapped to data fork
This will help us to avoid a race in the speculative CoW preallocation
code between a first thread that is allocating a CoW extent and a second
thread that is remapping part of a file after a write. In order to do
this, however, we need two things: first, we have to be able to
transition from da to unwritten, and second the function that converts
between real and unwritten has to be made aware of the cow fork. Do
both of those things.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
We can't handle a bmbt that's taller than BTREE_MAXLEVELS, and there's
no such thing as a zero-level bmbt (for that we have extents format),
so if we see this, send back an error code.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>