git.ipfire.org Git - thirdparty/xfsprogs-dev.git/log

libxfs: dont free xfs_inode until complete

Originally, the xfs_inode are released upon the first
call to xfs_trans_cancel, xfs_trans_commit, or
inode_item_done. This code used the log item lock field
to prevent the release of the inode on the next call to
one of the above functions. This is a unusual use of the
log item lock field which is suppose to specify which lock
is to be release on transaction commit or cancel. User
space does not perform locking in transactions..

Unfortunately, this breaks any code that relies on multiple
transaction operations. For example, adding an extended
attribute to an inode that does not have an attribute fork
will fail:

# xfs_db -x XFS_DEVICE
xfs_db> inode INO_NUM
xfs_db> attr_set newattribute

This patch does the following:
1) Removes the iput from the transaction completion and
    requires that the xfs_inode allocators call IRELE()
    when they are done with the pointer. The real time
    inodes are pointed to by the xfs_mount and have a longer
    lifetime.
2) Removes libxfs_trans_iput() because transaction entries
    are removed in transaction commit and cancel.
3) Removes libxfs_trans_ihold() which is an obsolete interface.
4) Removes the now unneeded ili_ilock_flags from the
    xfs_inode_log_item structure.

Signed-off-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

libxfs: remove unused argument in trans_iput

Remove the unused second argument to xfs_iput() and
xfs_trans_iput().

Introduce the define "IRELE()" and use in place of xfs_iput().

Signed-off-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfsprogs: v3.2.0 release

Update all the version and changelog files for v3.2.0.

Signed-off-by: Dave Chinner <david@fromorbit.com>

xfsprogs: v3.2.0-rc3 release

Update all the various version and changelog files for a v3.2.0-rc3
release.

Signed-off-by: Dave Chinner <david@fromorbit.com>

repair: don't grind CPUs with large extent lists

When repairing a large filesystem with fragemented files, xfs_repair
can grind to a halt burning multiple CPUs in blkmap_set_ext().
blkmap_set_ext() inserts extents into the blockmap for the inode
fork and it keeps them in order, even if the inserts are not done in
order.

The ordered insert is highly inefficient - it starts at the first
extent, and simple walks the array to find the insertion point. i.e.
it is an O(n) operation. When we have a fragemented file with a
large number of extents, the cost of the entire mapping operation
is rather costly.

The thing is, we are doing the insertion from an *ordered btree
scan* which is inserting the extents in ascending offset order.
IOWs, we are always inserting the extent at the end of the array
after searching the entire array. i.e. the mapping operation cost is
O(N^2).

Fix this simply by reversing the order of the insert slot search.
Start at the end of the blockmap array when we do almost all
insertions, which brings the overhead of each insertion down to O(1)
complexity. This, in turn, results in the overall map building
operation being reduced to an O(N) operation, and so performance
degrades linearly with increasing extent counts rather than
exponentially.

While there, I noticed that the growing of the blkmap array was only
done 4 extents at a time. When we are dealing with files that may
have hundreds of thousands of extents, growing th map only 4 extents
at a time requires excessive amounts of reallocation. Reduce the
reallocation rate by increasing the grow increment according to how
large the array currently is.

The result is that the test filesystem (27TB, 30M inodes, at ENOSPC)
takes 5m10s to *fully repair* on my test system, rather that getting
15 (of 60) AGs into phase three and sitting there burning 3-4 CPUs
making no progress for over half an hour.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

repair: don't unlock prefetch tree to read discontig buffers

The way discontiguous buffers are currently handled in prefetch is
by unlocking the prefetch tree and reading them one at a time in
pf_read_discontig(), inside the normal loop of searching for buffers
to read in a more optimized fashion.

But by unlocking the tree, we allow other threads to come in and
find buffers which we've already stashed locally on our bplist[].
If 2 threads think they own the same set of buffers, they may both
try to delete them from the prefetch btree, and the second one to
arrive will not find it, resulting in:

fatal error -- prefetch corruption

To fix this, simply abort the buffer gathering loop when we come
across a discontiguous buffer, process the gathered list as per
normal, and then after running the large optimised read, check to
see if the last buffer on the list is a discontiguous buffer.
If is is discontiguous, then issue the discontiguous buffer read
while the locks are not held. We only ever have one discontiguous
buffer per read loop, so it is safe just to check the last buffer in
the list.

The fix is loosely based on a a patch provided by Eric Sandeen, who
did all the hard work of finding the bug and demonstrating how to
fix it.

Reported-by:Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

Update debian changelog in preparation for pending release

In particular, add a note to the changelog about the added
dh_autoconf use so the BTS will be updated appropriately.

Signed-off-by: Nathan Scott <nathans@debian.org>

Fix msgfmt warning when building the German translation

Use the same header present in the Polish language files to
resolve the following warning in the de.po build:
[MSGFMT] de.mo
de.po:7: warning: header field 'Language' missing in header

Signed-off-by: Nathan Scott <nathans@debian.org>

Fix 32 bit build warning in libxfs, xfs_daddr_t printing

Add the usual type casts to resolve the following warnings:
rdwr.c: In function 'libxfs_getbufr_map':
rdwr.c:499:4: warning: format '%lx' expects argument of type 'long unsigned int', but argument 5 has type 'xfs_daddr_t' [-Wformat]
rdwr.c:499:4: warning: format '%lx' expects argument of type 'long unsigned int', but argument 6 has type 'xfs_daddr_t' [-Wformat]

Signed-off-by: Nathan Scott <nathans@debian.org>

xfsprogs: v3.2.0-rc2 release

Update all the various version and changelog files for a v3.2.0-rc2
release.

Signed-off-by: Dave Chinner <david@fromorbit.com>

mkfs.xfs: prevent close(-1) on protofile error path

My previous cleanups introduced this; in the case where
fd=open() failed, the out_fail: path would try to close(-1).

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_logprint: Fix error handling in xlog_print_trans_efi

A recent change to xlog_print_trans_efi() led to a leaked
"src_f" on this error return.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

repair: detect and handle attribute tree CRC errors

Currently the attribute code will not detect and correct errors in
the attribute tree. It also fails to validate the CRCs and headers
on remote attribute blocks. Ensure that all the attribute blocks are
CRC checked and that the processing functions understand the correct
block formats for decoding.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

repair: handle remote symlink CRC errors

We can't really repair broken symlink buffer contents, but we can at
least warn about it and correct the CRC error so the symlink is
again readable.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

repair: remove more dirv1 leftovers

get_bmapi() and it's children were only called by dirv1 code. There
are no current callers, so remove them.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

repair: report AG btree verifier errors

When we scan the filesystem freespace and inode maps in phase 2, we
don't report CRC errors that are found. We don't really care from a
repair perspective, because the trees are completely rebuilt from
the ground up in phase 5, but froma checking perspective we need to
inform the user that we found inconsistencies.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

repair: detect CRC errors in AG headers

repair doesn't currently detect verifier errors in AG header
blocks - apart from the primary superblock they are not detected.
They are, fortunately, corrected in the important cases (AGF, AGI
and AGFL) because these structures are rebuilt in phase 5, but if
you run xfs_repair in checking mode it won't report them as bad.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

repair: detect and correct CRC errors in directory blocks

repair doesn't currently verifier errors in directory blocks - they
cause repair to ignore blocks and hence fail because it can't read
critical blocks from the directory.

Fix this by having the directory buffer read code detect a verifier
error and retry the read without the verifier if the verifier has
detected an error. Then pass the verifer error with the successfully
read buffer back to the caller, so the caller can handle the error
appropriately. In most cases, this is simply marking the directory
as needing a rebuild, so once the directory entries have been
checked and repaired, it will rewrite all the directory buffers
(including the clean ones) and in the process recalculate all the
the CRC on the directory blocks.

Hence pure CRC errors in directory blocks are now handled correctly
by xfs_repair.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

repair: ensure prefetched buffers have CRCs validated

Prefetch currently does not do CRC validation when the IO completes
due to the optimisation it performs and the fact that it does not
know what the type of metadata into the buffer is supposed to be.
Hence, mark all prefetched buffers as "suspect" so that when the
end user tries to read it with a supplied validation function the
validation is run even though the buffer was already in the cache.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

db: verify buffer on type change

Currently when the type command is run, we simply change the type
associated with the buffer, but don't verify it. This results in
unchecked CRCs being displayed. Hence when changing the type, run
the verifier associated with the type to determine if the buffer
contents is valid or not.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

db: don't claim unchecked CRCs are correct

Currently xfs_db will claim the CRC on a structure is correct if the
buffer is not marked with an error. However, buffers may have been
read without a verifier, and hence have not had their CRCs
validated. in this case, we shoul dreport "unchecked" rather than
"correct". For example:

xfs_db> fsb 0x6003f
xfs_db> type dir3
xfs_db> p
dhdr.hdr.magic = 0x58444433
dhdr.hdr.crc = 0x2d0f9c9d (unchecked)
....

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfsprogs: v3.2.0-rc1 release

Update all the various version and changelog files for a v3.2.0-rc1
release.

Signed-off-by: Dave Chinner <david@fromorbit.com>

logprint: handle split EFI entry

xfs_logprint does not correctly handle EFI entries that
are split across two log buffers. xfs_efi_copy_format()
falsely interrupts the truncated size of the split entry
as being a corrupt entry.

If the first log entry has enough information, namely the
number of extents in the entry and the identifier, then
display this information and a warning that this entry is
truncated. Otherwise, if there is not enough information in
the first log buffer, then print a message that the EFI decode
was not possible. These messages are similar to split inode
entries.

Example of a continued entry:
Oper (336): tid: f214bdb len: 44 clientid: TRANS flags: CONTINUE
EFI: #regs: 1 num_extents: 2 id: 0xffff880804f63900
EFI free extent data skipped (CONTINUE set, no space)

Reported-by: Michael L. Semon <mlsemon35@gmail.com>
Signed-off-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

libxfs: remove never-read "offset" assignment in readbufr_map & writebufr

libxfs_readbufr_map() & libxfs_writebufr() iterate
over bp->b_map[] and read each chunk.  The loops start
out correctly, getting the offset from bm_bn and the
length from bm_len.  After the IO it correctly
advances the target buffer pointer by len, but then
inexplicably advances "offset" by len as well.  The
whole point of this exercise is to handle discontiguous
ranges - marching offset along by length of IO done
is incorrect.

Thankfully offset is immediately reset to the proper
value again at the top of the loop for the next range,
so this is harmless, other than being confusing.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_io: fix random pread/pwrite to honor offset

xfs_io's pread & pwrite claim to support a random IO mode
where it will do random IOs between offset & offset+len.

However, offset was ignored, and we did the IOs between 0
and len instead.

Clang caught this by pointing out that the calculated/normalized
"offset" variable was never read.

(NB: If the range is larger than RAND_MAX, these functions don't
work, but that's always been true, so I'll leave it for another
day...)

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfsprogs: remove write-only assignments

There are many instances where variable assignments are made,
but never read (or are re-assigned before they are read).
The Clang static analyzer finds these.

Here's a chunk of what I think are trivial removals of such
assignments; other detections point to more serious problems
(or are shared w/ kernel code so should probably be fixed there
first).

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_db: don't use invalid index in ring_f

ring_f() tests for an invalid index which would overrun the
iocur_ring[] array and warns, but then uses it anyway.

Return immediately if it's out of bounds.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

mkfs: catch unknown format in protofile parsing

As the code stands today we can't get an unknown format in the
last case statement, but Coverity warns that if we ever do, we'll
use an uninitialized "ip" in the call to libxfs_trans_log_inode().

Adding a default: case to catch unknown formats is defensive and
makes the checker happy.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_repair: address never-true tests in repair/bmap.c on 64 bit

The test "if (new_naexts > BLKMAP_NEXTS_MAX)" is never true
on a 64-bit platform; new_naexts is an int, and BLKMAP_NEXTS_MAX
is INT_MAX. So just wrap the whole thing in the #ifdef.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_quota: remove impossible tests in printpath

printpath() had some cut & paste tests of "c" - but
nothing had set it yet other than c=0, so testing it
is pointless. Just remove tests for non-zero "c"
until we might have set it to something interesting.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_db: fix too-large memset value in attr code

memset(value, 0xfeedface, valuelen);

seemed to be trying to fill a new attr with some
recognizeable magic, but of course memset can
only set a single byte; switch this to use 'v'

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

libxfs: annotate a case fallthrough in libxfs_ialloc

This is all working as intended, but add a comment to
make it more obvious to readers and static code
checkers.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

libxfs: free resources in libxfs_alloc_file_space error paths

The bmap freelist & transaction pointer weren't
being freed in libxfs_alloc_file_space error paths;
more or less copy the error handling that exists
in kernelspace to resolve this.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_logprint: fix leak in error path of xlog_print_record()

In 2 error paths we returned without freeing the allocated buf.
Collapse them into a compound test & free buf on the way out.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_quota: fix memory leak in quota_group_type() error path

quota_group_type has some rather contorted logic that's
been around since 2005.

In the (!name) case, if any of the 3 calls setting up ngroups fails,
we fall back to using just one group.

However, if it's the getgroups() call that fails, we overwrite
the allocated gid ptr with &gid, thus leaking that allocated
memory. Worse, we set "dofree" to 1, so will free non-allocated
local var gid. And that last else case is redundant; if we get there,
gids is guaranteed to be non-null.

Refactor it a bit to be more clear (I hope) and correct.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

libxfs: fix memory leak in xfs_dir2_node_removename

Fix the leak of kernel memory in xfs_dir2_node_removename()
when xfs_dir2_leafn_remove() returns an error code.

Cross-port of kernel commit 3a8c9208

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

libxlog: fix memory leak in xlog_recover_add_to_trans

Free the memory in error path of xlog_recover_add_to_trans().
Normally this memory is freed in recovery pass2, but is leaked
in the error path.

Userspace version of kernel commits 519ccb8 & aaaae98

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfsprogs: trivial buffer frees on error paths

Lots of memory leaks on error paths etc, spotted by
coverity. This patch rolls up the super-straightforward
fixes across xfsprogs.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_fsr: refactor fsrall_cleanup

fsrall_cleanup leaked an fd in the non-timeout
case - but the logic was weird and tortured, refactor
it to make more sense and fix the fd leak as well.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfsprogs: fix various fd leaks

Coverity spotted these; several paths where we don't
close fds when we return.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_repair: fix prefetch queue waiting

97b1fcf xfs_repair: fix array overrun in do_inode_prefetch

The thread creation loop has 2 ways to exit; either via
the loop counter based on thread_count, or the break statement
if we've started enough workers to cover all AGs.

Whether or not the loop counter "i" reflects the number of
threads started depends on whether or not we exited via the
break.

The above commit prevented us from indexing off the end
of the queues[] array if we actually advanced "i" all the
way to thread_count, but in the case where we break, "i"
is one *less* than the nr of threads started, so we don't
wait for completion of all threads, and all hell breaks
loose in phase 5.

Just stop with the cleverness of re-using the loop counter -
instead, explicitly count threads that we start, and then use
that counter to wait for each worker to complete.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfsprogs: fix directory hash ordering bug

Commit f5ea1100 ("xfs: add CRCs to dir2/da node blocks") introduced
in 3.10 incorrectly converted the btree hash index array pointer in
xfs_da3_fixhashpath(). It resulted in the the current hash always
being compared against the first entry in the btree rather than the
current block index into the btree block's hash entry array. As a
result, it was comparing the wrong hashes, and so could misorder the
entries in the btree.

For most cases, this doesn't cause any problems as it requires hash
collisions to expose the ordering problem. However, when there are
hash collisions within a directory there is a very good probability
that the entries will be ordered incorrectly and that actually
matters when duplicate hashes are placed into or removed from the
btree block hash entry array.

This bug results in an on-disk directory corruption and that results
in directory verifier functions throwing corruption warnings into
the logs. While no data or directory entries are lost, access to
them may be compromised, and attempts to remove entries from a
directory that has suffered from this corruption may result in a
filesystem shutdown. xfs_repair will fix the directory hash
ordering without data loss occuring.

[dchinner: wrote useful a commit message]

Ported from equivalent kernel commit c88547a8.

Signed-off-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfsprogs: add a missing QA header file to install-qa target

Included from "include/libxfs.h".

Signed-off-by: David Sterba <dsterba@suse.cz>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_db: hide debug bbmap output

Most of xfsprogs building with DEBUG enables extra
checks, asserts, etc, but this bunch of printfs was
extra output that's not generally helpful for most
people's runtime experience - and it breaks xfs/290
with all the noise.

I assume it's for actual debugging use, and not
generally useful, so bury it a bit deeper under
it's own #ifdef.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

repair: ensure that unused superblock fields are zeroed

When we grab a superblock off disk via get_sb(), we don't know what
the in-memory superblock we are filling out contained. We need to
ensure that the entire structure is returned in an initialised
state regardless of which fields libxfs_sb_from_disk() populates
from disk. In this case, it doesn't populate the sb_crc field,
and so uninitialised values can escape through to disk on v4
filesystems because of this. This causes xfs/031 to fail on v4
filesystems.

Reported-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_io: add support for flink

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfsprogs: fix use after free in inode_item_done()

Commit "3a19fb7 libxfs: stop caching inode structures"
introduced a use after free.

libxfs_iput() already does the check for ip->i_itemp, and a
kmem_zone_free() if it's present, and then frees the ip pointer.
Re-checking ip->i_itemp after the libxfs_iput call will access
the freed ip pointer, as will setting ip_>i_itemp to NULL.

Simply remove the offending code to fix this up.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_repair: fix array overrun in do_inode_prefetch

Coverity spotted this:

do_inode_prefetch() does a while loop, creating queues:

for (i = 0; i < thread_count; i++) {
...
create_work_queue(&queues[i], mp, 1);
...
}

and then does this to wait for them all to complete:

for (; i >= 0; i--)
destroy_work_queue(&queues[i]);

But we leave the first for loop with (i == thread_coun)t, and
the second one will try to index queues[] one past the end.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

repair: phase 1 does not handle superblock CRCs

Phase 1 of xfs_repair verifies and corrects the primary
superblock of the filesystem. It does not verify that the CRC of the
superblock that is found is correct, nor does it recalculate the CRC
of the superblock it rewrites.

This happens because phase1 does not use the libxfs buffer cache -
it just uses pread/pwrite on a memory buffer, and works directly
from that buffer. Hence we need to add CRC verification to
verify_sb(), and CRC recalculation to write_primary_sb() so that it
works correctly.

This also enables us to use get_sb() as the method of fetching the
superblock from disk after phase 1 without needing to use the libxfs
buffer cache and guessing at the sector size. This prevents a
verifier error because it attempts to CRC a superblock buffer that
is much longer than the usual sector sizes.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_db: Use EFSBADCRC for CRC validity indication

xfs_db currently gives indication as to whether a buffer CRC is ok
or not. Currently it does this by checking for EFSCORRUPTED in the
b_error field of the buffer. Now that we have EFSBADCRC to indicate
a bad CRC independently of structure corruption, use that instead to
drive the CRC correct/incorrect indication in the structured output.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

libxfs: modify verifiers to differentiate CRC from other errors

[userspace port]

Modify all read & write verifiers to differentiate
between CRC errors and other inconsistencies.

This sets the appropriate error number on bp->b_error,
and then calls xfs_verifier_error() if something went
wrong. That function will issue the appropriate message
to the user.

Also, fix the silly bug in xfs_buf_ioerror() that this patch
exposes.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

libxfs: add xfs_verifier_error()

[userspace port]

We want to distinguish between corruption, CRC errors,
etc.  In addition, the full stack trace on verifier errors
seems less than helpful; it looks more like an oops than
corruption.

Create a new function to specifically alert the user to
verifier errors, which can differentiate between
EFSCORRUPTED and CRC mismatches.  It doesn't dump stack
unless the xfs error level is turned up high.

Define a new error message (EFSBADCRC) to clearly identify
CRC errors.  (Defined to EBADMSG, bad message)

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

libxfs: add helper for updating checksums on xfs_bufs

[userspace port]

Many/most callers of xfs_update_cksum() pass bp->b_addr and
BBTOB(bp->b_length) as the first 2 args. Add a helper
which can just accept the bp and the crc offset, and work
it out on its own, for brevity.

Signed-off-by: Dave Chinner <dchinner@redhat.com>

libxfs: add helper for verifying checksums on xfs_bufs

[userspace port]

Many/most callers of xfs_verify_cksum() pass bp->b_addr and
BBTOB(bp->b_length) as the first 2 args. Add a helper
which can just accept the bp and the crc offset, and work
it out on its own, for brevity.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

libxfs: Use defines for CRC offsets in all cases

[userspace port]

Some calls to crc functions used useful #defines,
others used awkward offsetof() constructs.

Switch them all to #define to make things a bit cleaner.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

libxfs: skip pointless CRC updates after verifier failures

[userspace port]

Most write verifiers don't update CRCs after the verifier
has failed and the buffer has been marked in error. These
two didn't, but should.

Add returns to the verifier failure block, since the buffer
won't be written anyway.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

libxfs: limit superblock corruption errors to actual corruption

[userspace port]

Today, if

xfs_sb_read_verify
  xfs_sb_verify
    xfs_mount_validate_sb

detects superblock corruption, it'll be extremely noisy, dumping
2 stacks, 2 hexdumps, etc.

This is because we call XFS_CORRUPTION_ERROR in xfs_mount_validate_sb
as well as in xfs_sb_read_verify.

Also, *any* errors in xfs_mount_validate_sb which are not corruption
per se; things like too-big-blocksize, bad version, bad magic, v1 dirs,
rw-incompat etc - things which do not return EFSCORRUPTED - will
still do the whole XFS_CORRUPTION_ERROR spew when xfs_sb_read_verify
sees any error at all.  And it suggests to the user that they
should run xfs_repair, even if the root cause of the mount failure
is a simple incompatibility.

I'll submit that the probably-not-corrupted errors don't warrant
this much noise, so this patch removes the warning for anything
other than EFSCORRUPTED returns, and replaces the lower-level
XFS_CORRUPTION_ERROR with an xfs_notice().

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

libxfs: skip verification on initial "guess" superblock read

[userspace port]

When xfs_readsb() does the very first read of the superblock,
it makes a guess at the length of the buffer, based on the
sector size of the underlying storage. This may or may
not match the filesystem sector size in sb_sectsize, so
we can't i.e. do a CRC check on it; it might be too short.

In fact, mounting a filesystem with sb_sectsize larger
than the device sector size will cause a mount failure
if CRCs are enabled, because we are checksumming a length
which exceeds the buffer passed to it.

So always read twice; the first time we read with NULL
buffer ops to skip verification; then set the proper
read length, hook up the proper verifier, and give it
another go.

Once we are sure that we've got the right buffer length,
we can also use bp->b_length in the xfs_sb_read_verify,
rather than the less-trusted on-disk sectorsize for
secondary superblocks. Before this we ran the risk of
passing junk to the crc32c routines, which didn't always
handle extreme values.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

libxfs: xfs_sb_read_verify() doesn't flag bad crcs on primary sb

[userspace port]

My earlier commit 10e6e65 deserves a layer or two of brown paper
bags. The logic in that commit means that a CRC failure on the
primary superblock will *never* result in an error return.

Hopefully this fixes it, so that we always return the error
if it's a primary superblock, otherwise only if the filesystem
has CRCs enabled.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

libxfs: sanitize sb_inopblock in xfs_mount_validate_sb

[userspace port]

xfs_mount_validate_sb doesn't check sb_inopblock for sanity
(as does its xfs_repair counterpart, FWIW).

If it's out of bounds, we can go off the rails in i.e.
xfs_inode_buf_verify(), which uses sb_inopblock as a loop
limit when stepping through a metadata buffer.

The problem can be demonstrated easily by corrupting
sb_inopblock with xfs_db and trying to mount the result:

# mkfs.xfs -dfile,name=fsfile,size=1g
# xfs_db -x fsfile
xfs_db> sb 0
xfs_db> write inopblock 512
inopblock = 512
xfs_db> quit

# mount -o loop fsfile mnt
and we blow up in xfs_inode_buf_verify().

With this patch, we get a (very noisy) corruption error,
and fail the mount as we should.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

libxfs: be more forgiving of a v4 secondary sb w/ junk in v5 fields

[userspace port]

Today, if xfs_sb_read_verify encounters a v4 superblock
with junk past v4 fields which includes data in sb_crc,
it will be treated as a failing checksum and a significant
corruption.

There are known prior bugs which leave junk at the end
of the V4 superblock; we don't need to actually fail the
verification in this case if other checks pan out ok.

So if this is a secondary superblock, and the primary
superblock doesn't indicate that this is a V5 filesystem,
don't treat this as an actual checksum failure.

We should probably check the garbage condition as
we do in xfs_repair, and possibly warn about it
or self-heal, but that's a different scope of work.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_repair.8: Fix a grammatical error

Fix an issue in the -o force_geometry suboption text.

Signed-off-by: Andrew Clayton <andrew@digital-domain.net>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

metadump: pathname obfuscation overruns symlink buffer

In testing the previous metadump changes, fsstress generated an
inline symlink of 336 bytes in length. This caused corruption of the
restored filesystem that wasn't present in the original filesystem -
it corrupted the magic number of the next inode in the chunk. The
reason being that the symlink data is not null terminated in the
inode literal area, and hence when the symlink consumes the entire
literal area like so:

xfs_db> daddr 0x42679
xfs_db> p
000: 494ea1ff 03010000 00000000 00000000 00000001 00000000 00000000 00000000
020: 53101af9 1678d2a8 53101af9 15fec0a8 53101af9 15fec0a8 00000000 00000150
040: 00000000 00000000 00000000 00000000 00000002 00000000 00000000 d868b52d
060: ffffffff 0ce5477a 00000000 00000002 00000002 0000041c 00000000 00000000
080: 00000000 00000000 00000000 00000000 53101af9 15fec0a8 00000000 00042679
0a0: 6c4e9d4e 84944986 a074cffd 0ea042a8 78787878 78787878 78782f78 78787878
0c0: 78787878 2f787878 78787878 78782f78 78787878 78787878 2f787878 78787878
0e0: 78782f78 78787878 78787878 2f787878 78787878 78782f78 78787878 78787878
100: 2f787878 78787878 78782f78 78787878 78787878 2f787878 78787878 78782f78
120: 78787878 78787878 2f787878 78787878 78782f78 78787878 78787878 2f787878
140: 78787878 78782f78 78787878 78787878 2f787878 78787878 78782f78 78787878
160: 78787878 2f787878 78787878 78782f78 78787878 78787878 2f787878 78787878
180: 78782f78 78787878 78787878 2f787878 78787878 78782f78 78787878 78787878
1a0: 2f787878 78787878 78782f78 78787878 78787878 2f787878 78787878 78782f78
1c0: 78787878 78787878 2f787878 78787878 78782f78 78787878 78787878 2f787878
1e0: 78787878 78782f78 78787878 78787878 2f787878 78787878 78782f78 78787878

the symlink data butts right up agains the magic number of the next
inode in the chunk. And then, when obfuscation gets to the final
pathname component, it gets it's length via:

/* last (or single) component */
namelen = strlen((char *)comp);
hash = libxfs_da_hashname(comp, namelen);
obfuscate_name(hash, namelen, comp);

strlen(), which looks for a null terminator and finds it several
bytes into the next inode. It then proceeds to obfuscate that
length, including the inode magic number of the next inode....

Fix this by ensuring we can't overrun the symlink buffer length
by assuming that the symlink is not null terminated. Tested against
the filesystem image that triggered the problem in the first place.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

metadump: Only verify obfuscated metadata being dumped

The discontiguous buffer support series added a verifier check on
the metadata buffers before they go written to the metadump image.
If this failed, it returned an error, and the result would be that
we stopped processing the metadata and exited, truncating the dump.

xfs_metadump is supposed to dump the metadata in the filesystem for
forensic analysis purposes, which means we actually want it to
retain any corruptions it finds in the filesystem. Hence running the
verifier - even to recalculate CRCs - when the metadata is
unmodified is the wrong thing to be doing. And stopping the dump
when we come across an error is even worse.

We still need to do CRC recalculation when obfuscating names and
attributes. Hence we need to make running the verifier conditional
on the buffer or inode:
a) being uncorrupted when read, and
b) modified by the obfuscation code.

If either of these conditions is not true, then we don't run the
verifier or recalculate the CRCs.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

metadump: contiguous metadata object need to be split

On crc enabled filesystems with obfuscation enabled we need to be
able to recalculate the CRCs on individual buffers.
process_single_fsb_objects() reads a contiguous range of single
block objects as a singel buffer, and hence we cannot correctly
recalculate the CRCs on them.

Split the loop up into individual buffer reads, processing and
writes rather than a single read, multiple block processing and a
single write.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_growfs: don't grow data if only -m is specified

While writing an xfstest to check imaxpct behavior, I realized
that xfs_growfs -m XXX /mnt/point will not only change
imaxpct, but also grow the filesystem if it's not currently
occupying the entire block device. This doesn't seem like
the expected behavior, so split it out such that if only
-m, and not -d, is specified, only the imaxpct will be
changed.

This is a change from previous behavior, but I think it
satisfies the principle of least surprise...

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_io: test for invalid -Tr flag combination before open

Coverity spotted this.

It complained that we didn't close the fd before returning in
the error case of incompatible options, but in reality, we wouldn't
have gotten that far because open(O_RDONLY|O_TMPFILE) would be
rejected with EINVAL.

So the error handling test would never actually be true.

Fix this by moving the error checking prior to the open so
the user gets a more useful error message than "Invalid Argument."

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_io: add missing break in O_TMPFILE case

Coverity spotted this.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_io: add fzero command for zeroing range via fallocate

Add fzero command which zeroes a range of the file using
fallocate FALLOC_FL_ZERO_RANGE flag.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_logprint: don't advance op counter in xlog_print_trans_icreate

xlog_print_trans_icreate is advancing the op counter
"(*i)++" incorrectly; it only contains one region, and
the loop which called it will properly advance the op
once we're done.

Found-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_logprint: Don't error out after split items lose context

xfs_logprint recognizes a "left over region from split log item"
but then expects the *next* op to be a valid start to a new
item.  The problem is, we can split i.e. an xfs_inode_log_format
item, skip over it, and then land on the xfs_icdinode_t
data which follows it - this doesn't have a valid log item
magic (XFS_LI_*) and we error out.  This results in something
like:

  xfs_logprint: unknown log operation type (494e)

Fix this by recognizing that we've skipped over an item and
lost the context we're in, so just continue skipping over
op headers until we find the next valid start to a log item.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_copy: band-aids for CRC filesystems

xfs_copy needs a fair bit of work for CRCs because it rewrites
UUIDs by default, but this change will get it working properly
with the "-d" (duplicate) option which keeps the same UUID.

Accept the the CRC magic valid, change the ASSERT() to an error
message and exit more gracefully, and don't
even get started if we don't have the '-d' option.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_metadump: include F in getopts string

I added an F case, but didn't add it to the bash
getopts string last go-round.

/usr/sbin/xfs_metadump: illegal option -- F

I sure thought I tested this, I'm not sure how it got lost.

Reported-by: Boris Ranto <branto@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

repair: prefetch runs too far ahead

When trying to work out why a non-crc filesystem took 1m57 to repair
and the same CRC enabled filesystem took 11m35 to repair, I noticed
that there was far too much CRC checking going on. Prefetched buffers
should not be CRCed, yet shortly after starting this began
to happen. perf profiling also showed up an awful lot of time doing
buffer cache lookups, and the cache profile output indicated that
the hit rate was way below 3%. IOWs, the readahead was getting so
far ahead of the processing that it was thrashing the cache.

That there is a difference in processing rate between CRC and
non-CRC filesystems is not surprising. What is surprising is the
readahead behaviour - it basically just keeps reading ahead until it
has read everything on an AG, and then it goes on to the next AG,
and reads everything on it, and then goes on to the next AG,....

This goes on until it pushes all the buffers the processing threads
need out of the cache, and suddenly they start re-reading from disk
with the various CRC checking verifiers enabled, and we end up going
-really- slow. Yes, threading made up for it a bit, but it's just
wrong.

Basically, the code assumes that IO is going to be slower than
processing, so it doesn't throttle prefetch across AGs to slow
down prefetch to match the processing rate.

So, to fix this, don't let a prefetch thread get more than a single
AG ahead of it's processing thread, just like occurs for single
threaded (i.e. -o ag_stride=-1) operation.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

repair: factor out threading setup code

The same code is repeated in different places to set up
multithreaded prefetching. This can all be factored into a single
implementation.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

repair: BMBT prefetch needs to be CRC aware

I'd been trying to track down a behavioural difference between
non-crc and crc enabled filesystems that was resulting non-crc
filesystem executing prefetch almost 3x faster than CRC filesystems.
After amny ratholes, I finally stumbled on the fact that btree
format directories are not being prefetched due to a missing magic
number check, and it's rejecting all XFS_BMAP_CRC_MAGIC format BMBT
buffers. This makes prefetch on CRC enabled filesystems behave the
same as for non-CRC filesystems.

The difference a single line of code can make on a 50 million inode
filesystem with a single threaded prefetch enabled run is pretty
amazing. It goes from 3,000 iops @ 50MB/s to 2,000 IOPS @ 800MB/s
and the cache hit rate goes from 3% to 49%. The runtime difference:

Unpatched:

Phase           Start           End             Duration
Phase 1:        02/21 18:34:12  02/21 18:34:12
Phase 2:        02/21 18:34:12  02/21 18:34:15  3 seconds
Phase 3:        02/21 18:34:15  02/21 18:40:09  5 minutes, 54 seconds
Phase 4:        02/21 18:40:09  02/21 18:46:36  6 minutes, 27 seconds
Phase 5:        02/21 18:46:36  02/21 18:46:37  1 second
Phase 6:        02/21 18:46:37  02/21 18:52:51  6 minutes, 14 seconds
Phase 7:        02/21 18:52:51  02/21 18:52:52  1 second

Total run time: 18 minutes, 40 seconds

Patched:

Phase           Start           End             Duration
Phase 1:        02/21 19:58:23  02/21 19:58:23
Phase 2:        02/21 19:58:23  02/21 19:58:27  4 seconds
Phase 3:        02/21 19:58:27  02/21 19:59:20  53 seconds
Phase 4:        02/21 19:59:20  02/21 20:00:07  47 seconds
Phase 5:        02/21 20:00:07  02/21 20:00:08  1 second
Phase 6:        02/21 20:00:08  02/21 20:00:50  42 seconds
Phase 7:        02/21 20:00:50  02/21 20:00:50

Total run time: 2 minutes, 27 seconds

Is no less impressive. Shame it's just a regression fix. ;)

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

repair: fix prefetch queue limiting

The length of the prefetch queue is limited by a semaphore. To avoid
a ABBA deadlock, we only trywait on the semaphore so if we fail to
get it we can kick the IO queues before sleeping. Unfortunately,
the "need to sleep" detection is just a little wrong - it needs to
lok at errno, not err for the EAGAIN value.

Hence this queue throttling has not been working for a long time.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

repair: use a listhead for the dotdot list

Cleanup suggested by Christoph Hellwig - removes another open coded
list implementation.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

repair: limit auto-striding concurrency apprpriately

It's possible to have filesystems with hundreds of AGs on systems
with little concurrency and resources. In this case, we can easily
exhaust memory and fail to create threads and have all sorts of
interesting problems.

xfs/250 can cause this to occur, with failures like:

        - agno = 707
        - agno = 692
fatal error -- cannot create worker threads, error = [11] Resource temporarily unavailable

And this:

        - agno = 484
        - agno = 782
failed to create prefetch thread: Resource temporarily unavailable

Because it's trying to create more threads than a poor little 512MB
single CPU ia32 box can handle.

So, limit concurrency to a maximum of numcpus * 8 to prevent this.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

libxfs: buffer cache hashing is suboptimal

The hashkey calculation is very simplistic,and throws away an amount
of entropy that should be folded into the hash. The result is
sub-optimal distribution across the hash tables. For example, with a
default 512 entry table, phase 2 results in this:

Max supported entries = 4096
Max utilized entries = 3970
Active entries = 3970
Hash table size = 512
Hits = 0
Misses = 3970
Hit ratio =  0.00
Hash buckets with   0 entries     12 (  0%)
Hash buckets with   1 entries      3 (  0%)
Hash buckets with   2 entries     10 (  0%)
Hash buckets with   3 entries      2 (  0%)
Hash buckets with   4 entries    129 ( 12%)
Hash buckets with   5 entries     20 (  2%)
Hash buckets with   6 entries     54 (  8%)
Hash buckets with   7 entries     22 (  3%)
Hash buckets with   8 entries    150 ( 30%)
Hash buckets with   9 entries     14 (  3%)
Hash buckets with  10 entries     16 (  4%)
Hash buckets with  11 entries      7 (  1%)
Hash buckets with  12 entries     38 ( 11%)
Hash buckets with  13 entries      5 (  1%)
Hash buckets with  14 entries      4 (  1%)
Hash buckets with  17 entries      1 (  0%)
Hash buckets with  19 entries      1 (  0%)
Hash buckets with  23 entries      1 (  0%)
Hash buckets with >24 entries     23 ( 16%)

Now, given a perfect distribution, we shoul dhave 8 entries per
chain. What we end up with is nothing like that.

Unfortunately, for phase 3/4 and others, the number of cached
objects results in the cache being expanded to 256k entries, and
so the stats just give this;

Hits = 262276
Misses = 8130393
Hit ratio =  3.13
Hash buckets with >24 entries    512 (100%)

We can't evaluate the efficiency of the hashing algorithm here.
Let's increase the size of the hash table to 32768 entries and go
from there:

Phase 2:

Hash buckets with   0 entries  31884 (  0%)
Hash buckets with   1 entries     35 (  0%)
Hash buckets with   2 entries     78 (  3%)
Hash buckets with   3 entries     30 (  2%)
Hash buckets with   4 entries    649 ( 65%)
Hash buckets with   5 entries     12 (  1%)
Hash buckets with   6 entries     13 (  1%)
Hash buckets with   8 entries     40 (  8%)
Hash buckets with   9 entries      1 (  0%)
Hash buckets with  13 entries      1 (  0%)
Hash buckets with  15 entries      1 (  0%)
Hash buckets with  22 entries      1 (  0%)
Hash buckets with  24 entries     17 ( 10%)
Hash buckets with >24 entries      6 (  4%)

There's a significant number of collisions given the population is
only 15% of the size of the table itself....

Phase 3:

Max supported entries = 262144
Max utilized entries = 262144
Active entries = 262090
Hash table size = 32768
Hits = 530844
Misses = 7164575
Hit ratio =  6.90
Hash buckets with   0 entries  11898 (  0%)
....
Hash buckets with  12 entries   5513 ( 25%)
Hash buckets with  13 entries   4188 ( 20%)
Hash buckets with  14 entries   2073 ( 11%)
Hash buckets with  15 entries   1811 ( 10%)
Hash buckets with  16 entries   1994 ( 12%)
....
Hash buckets with >24 entries    339 (  4%)

So, a third of the hash table does not even has any entries in them,
despite having more than 7.5 million entries run through the cache.
Median chain lengths are 12-16 entries, ideal is 8. And lots of
collisions on the longer than 24 entrie chains...

Phase 6:

Hash buckets with   0 entries  14573 (  0%)
....
Hash buckets with >24 entries   2291 ( 36%)

Ouch. Not a good distribution at all.

Overall runtime:

Phase           Start           End             Duration
Phase 1:        12/06 11:35:04  12/06 11:35:04
Phase 2:        12/06 11:35:04  12/06 11:35:07  3 seconds
Phase 3:        12/06 11:35:07  12/06 11:38:27  3 minutes, 20 seconds
Phase 4:        12/06 11:38:27  12/06 11:41:32  3 minutes, 5 seconds
Phase 5:        12/06 11:41:32  12/06 11:41:32
Phase 6:        12/06 11:41:32  12/06 11:42:29  57 seconds
Phase 7:        12/06 11:42:29  12/06 11:42:30  1 second

Total run time: 7 minutes, 26 seconds

Modify the hash to be something more workable - steal the linux
kernel inode hash calculation and try that:

phase 2:

Max supported entries = 262144
Max utilized entries = 3970
Active entries = 3970
Hash table size = 32768
Hits = 0
Misses = 3972
Hit ratio =  0.00
Hash buckets with   0 entries  29055 (  0%)
Hash buckets with   1 entries   3464 ( 87%)
Hash buckets with   2 entries    241 ( 12%)
Hash buckets with   3 entries      8 (  0%)

Close to perfect.

Phase 3:

Max supported entries = 262144
Max utilized entries = 262144
Active entries = 262118
Hash table size = 32768
Hits = 567900
Misses = 7118749
Hit ratio =  7.39
Hash buckets with   5 entries   1572 (  2%)
Hash buckets with   6 entries   2186 (  5%)
Hash buckets with   7 entries   9217 ( 24%)
Hash buckets with   8 entries   8757 ( 26%)
Hash buckets with   9 entries   6135 ( 21%)
Hash buckets with  10 entries   3166 ( 12%)
Hash buckets with  11 entries   1257 (  5%)
Hash buckets with  12 entries    364 (  1%)
Hash buckets with  13 entries     94 (  0%)
Hash buckets with  14 entries     14 (  0%)
Hash buckets with  15 entries      5 (  0%)

A near-perfect bell curve centered on the optimal distribution
number of 8 entries per chain.

Phase 6:

Hash buckets with   0 entries     24 (  0%)
Hash buckets with   1 entries    190 (  0%)
Hash buckets with   2 entries    571 (  0%)
Hash buckets with   3 entries   1263 (  1%)
Hash buckets with   4 entries   2465 (  3%)
Hash buckets with   5 entries   3399 (  6%)
Hash buckets with   6 entries   4002 (  9%)
Hash buckets with   7 entries   4186 ( 11%)
Hash buckets with   8 entries   3773 ( 11%)
Hash buckets with   9 entries   3240 ( 11%)
Hash buckets with  10 entries   2523 (  9%)
Hash buckets with  11 entries   2074 (  8%)
Hash buckets with  12 entries   1582 (  7%)
Hash buckets with  13 entries   1206 (  5%)
Hash buckets with  14 entries    863 (  4%)
Hash buckets with  15 entries    601 (  3%)
Hash buckets with  16 entries    386 (  2%)
Hash buckets with  17 entries    205 (  1%)
Hash buckets with  18 entries    122 (  0%)
Hash buckets with  19 entries     48 (  0%)
Hash buckets with  20 entries     24 (  0%)
Hash buckets with  21 entries     13 (  0%)
Hash buckets with  22 entries      8 (  0%)

A much wider bell curve than phase 3, but still centered around the
optimal value and far, far better than the distribution of the
current hash calculation. Runtime:

Phase           Start           End             Duration
Phase 1:        12/06 11:47:21  12/06 11:47:21
Phase 2:        12/06 11:47:21  12/06 11:47:23  2 seconds
Phase 3:        12/06 11:47:23  12/06 11:50:50  3 minutes, 27 seconds
Phase 4:        12/06 11:50:50  12/06 11:53:57  3 minutes, 7 seconds
Phase 5:        12/06 11:53:57  12/06 11:53:58  1 second
Phase 6:        12/06 11:53:58  12/06 11:54:51  53 seconds
Phase 7:        12/06 11:54:51  12/06 11:54:52  1 second

Total run time: 7 minutes, 31 seconds

Essentially unchanged - this is somewhat of a "swings and
roundabouts" test here because what it is testing is the cache-miss
overhead.

FWIW, the comparison here shows a pretty good case for the existing
hash calculation. On a less populated filesystem (5m inodes rather
than 50m inodes) the typical hash distribution was:

Max supported entries = 262144
Max utilized entries = 262144
Active entries = 262094
Hash table size = 32768
Hits = 626228
Misses = 800166
Hit ratio = 43.90
Hash buckets with   0 entries  29274 (  0%)
Hash buckets with   3 entries      1 (  0%)
Hash buckets with   4 entries      1 (  0%)
Hash buckets with   7 entries      1 (  0%)
Hash buckets with   8 entries      1 (  0%)
Hash buckets with   9 entries      1 (  0%)
Hash buckets with  12 entries      1 (  0%)
Hash buckets with  13 entries      1 (  0%)
Hash buckets with  16 entries      2 (  0%)
Hash buckets with  18 entries      1 (  0%)
Hash buckets with  22 entries      1 (  0%)
Hash buckets with >24 entries   3483 ( 99%)

Total and utter crap. Same filesystem, new hash function:

Max supported entries = 262144
Max utilized entries = 262144
Active entries = 262103
Hash table size = 32768
Hits = 673208
Misses = 838265
Hit ratio = 44.54
Hash buckets with   3 entries    558 (  0%)
Hash buckets with   4 entries   1126 (  1%)
Hash buckets with   5 entries   2440 (  4%)
Hash buckets with   6 entries   4249 (  9%)
Hash buckets with   7 entries   5280 ( 14%)
Hash buckets with   8 entries   5598 ( 17%)
Hash buckets with   9 entries   5446 ( 18%)
Hash buckets with  10 entries   3879 ( 14%)
Hash buckets with  11 entries   2405 ( 10%)
Hash buckets with  12 entries   1187 (  5%)
Hash buckets with  13 entries    447 (  2%)
Hash buckets with  14 entries    125 (  0%)
Hash buckets with  15 entries     25 (  0%)
Hash buckets with  16 entries      3 (  0%)

Kinda says it all, really...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

repair: per AG locks contend for cachelines

The per-ag locks used to protect per-ag block lists are located in a tightly
packed array. That means that they share cachelines, so separate them out inot
separate 64 byte regions in the array.

pahole confirms the padding is correctly applied:

struct aglock {
        pthread_mutex_t            lock;                 /*     0    40 */

        /* size: 64, cachelines: 1, members: 1 */
        /* padding: 24 */
};

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

repair: translation lookups limit scalability

A bit of perf magic showed that scalability was limits to 3-4
concurrent threads due to contention on a lock inside in something
called __dcigettext(). That some library somewhere that repair is
linked against, and it turns out to be inside the translation
infrastructure to support the _() string mechanism:

# Samples: 34K of event 'cs'
# Event count (approx.): 495567
#
# Overhead        Command      Shared Object          Symbol
# ........  .............  .................  ..............
#
    60.30%     xfs_repair  [kernel.kallsyms]  [k] __schedule
               |
               --- 0x63fffff9c
                   process_bmbt_reclist_int
                  |
                  |--39.95%-- __dcigettext
                  |          __lll_lock_wait
                  |          system_call_fastpath
                  |          SyS_futex
                  |          do_futex
                  |          futex_wait
                  |          futex_wait_queue_me
                  |          schedule
                  |          __schedule
                  |
                  |--8.91%-- __lll_lock_wait
                  |          system_call_fastpath
                  |          SyS_futex
                  |          do_futex
                  |          futex_wait
                  |          futex_wait_queue_me
                  |          schedule
                  |          __schedule
                   --51.13%-- [...]

Fix this by initialising global variables that hold the translated
strings at startup, hence avoiding the need to do repeated runtime
translation of the same strings.

Runtime of an unpatched xfs_repair is roughly:

        XFS_REPAIR Summary    Fri Dec  6 11:15:50 2013

Phase           Start           End             Duration
Phase 1:        12/06 10:56:21  12/06 10:56:21
Phase 2:        12/06 10:56:21  12/06 10:56:23  2 seconds
Phase 3:        12/06 10:56:23  12/06 11:01:31  5 minutes, 8 seconds
Phase 4:        12/06 11:01:31  12/06 11:07:08  5 minutes, 37 seconds
Phase 5:        12/06 11:07:08  12/06 11:07:09  1 second
Phase 6:        12/06 11:07:09  12/06 11:15:49  8 minutes, 40 seconds
Phase 7:        12/06 11:15:49  12/06 11:15:50  1 second

Total run time: 19 minutes, 29 seconds

Patched version:

Phase           Start           End             Duration
Phase 1:        12/06 10:36:29  12/06 10:36:29
Phase 2:        12/06 10:36:29  12/06 10:36:31  2 seconds
Phase 3:        12/06 10:36:31  12/06 10:40:08  3 minutes, 37 seconds
Phase 4:        12/06 10:40:08  12/06 10:43:42  3 minutes, 34 seconds
Phase 5:        12/06 10:43:42  12/06 10:43:42
Phase 6:        12/06 10:43:42  12/06 10:50:28  6 minutes, 46 seconds
Phase 7:        12/06 10:50:28  12/06 10:50:29  1 second

Total run time: 14 minutes

Big win!

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

mkfs: proto file creation does not set ftype correctly

Hence running xfs_repair on a ftype enable filesystem that has
contents created by a proto file will throw warnings on mismatched
ftype entries and correct them. xfs/031 fails due to this problem.

Fix it by settin gup the xname structure with the correct type
fields.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

mkfs: default log size for small filesystems too large

Recent changes to the log size scaling have resulted in using the
default size multiplier for the log size even on small filesystems.
Commit 88cd79b ("xfs: Add xfs_log_rlimit.c") changed the calculation
of the maximum transaction size that the kernel would issues and
that significantly increased the minimum size of the default log.
As such the size of the log on small filesystems was typically
larger than the prefious default, even though the previous default
was still larger than the minimum needed.

Rework the default log size calculation such that it will use the
original log size default if it is larger than the minimum log size
required, and only use a larger log if the configuration of the
filesystem requires it.

This is especially obvious in xfs/216, where the default log size is
10MB all the way up to 16GB filesystems. The current mkfs selects a
log size of 50MB for the same size filesystems and this is
unnecessarily large.

Return the scaling of the log size for small filesystems to
something similar to what xfs/216 expects.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_io: add support for O_TMPFILE opens

On Thu, Feb 20, 2014 at 02:00:00PM -0800, Christoph Hellwig wrote:
Add a new -T argument to the open command that supports using the
O_TMPFILE flag.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

libxfs: clear stale buffer errors on write

If we've read a buffer and it's had an error (e.g a bad CRC) and the
caller corrects the problem with the buffer and writes it via
libxfs_writebuf() without clearing the error on the buffer,
subsequent reads of the buffer while it is still in cache can see
that error and fail inappropriately.

xfs/033 demonstrates this error, where phase 3 detects the corrupted
root inode and clears, but doesn't clear the b_error field. Later in
phase 6, the code that rebuilds the root directory tries to read the
root inode and sees a buffer with an error on it, thereby triggering
a fatal repair failure:

Phase 3 - for each AG...
        - scan and clear agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
xfs_inode_buf_verify: XFS_CORRUPTION_ERROR
bad magic number 0x0 on inode 64
....
cleared root inode 64
....
Phase 6 - check inode connectivity...
reinitializing root directory
xfs_imap_to_bp: xfs_trans_read_buf() returned error 117.

fatal error -- could not iget root inode -- error - 117
#

Fix this by assuming buffers that are written are clean and correct
and hence we can zero the b_error field before retiring the buffer
to the cache.

Reported-by: Eric Sandeen <esandeen@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <esandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

libxfs: contiguous buffers are not discontigous

When discontiguous directory buffer support was fixed in xfs_repair,
(dd9093d xfs_repair: fix discontiguous directory block support)
it changed to using libxfs_getbuf_map() to support mapping
discontiguous blocks, and the prefetch code special cased such
discontiguous buffers.

The issue is that libxfs_getbuf_map() marks all buffers, even
contiguous ones - as LIBXFS_B_DISCONTIG, and so the prefetch code
was treating every buffer as discontiguous. This causes the prefetch
code to completely bypass the large IO optimisations for dense areas
of metadata. Because there was no obvious change in performance or
IO patterns, this wasn't noticed during performance testing.

However, this change mysteriously fixed a regression in xfs/033 in
the v3.2.0-alpha release, and this change in behaviour was
discovered as part of triaging why it "fixed" the regression.
Anyway, restoring the large IO prefetch optimisation results
a repair on a 10 million inode filesystem dropping from 197s to 173s,
and the peak IOPS rate in phase 3 dropping from 25,000 to roughly
2,000 by trading off a bandwidth increase of roughly 100% (i.e.
200MB/s to 400MB/s). Phase 4 saw similar changes in IO profile and
speed increases.

This, however, re-introduces the regression in xfs/033, which will
now be fixed in a separate patch.

Reported-by: Eric Sandeen <esandeen@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <esandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_repair: fix sibling pointer tests in verify_dir2_path()

RH QE reported that if we create a 1G filesystem with default
options, mount it, and create inodes until full, then run
repair, repair reports corruption in verify_dir2_path() with:

> bad back pointer in block 8390324 for directory inode 131

The commit 88b32f0 xfs: add CRCs to dir2/da node blocks
had a small error which regressed this; although we switch
to the "newnode," to check sibling pointers, we re-populate
the node hdr with the old "node" data. This causes the
backpointer test to be testing the wrong node's values.

Fixing this bug fixes the testcase.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_db: fix attribute leaf output for ATTR3 format

attr3_leaf_entries_count() checks against the wrong magic number.
hence returns zero for an entry count when it should be returning a
value. Fixing this makes xfs/021 pass on CRC enabled filesystems.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_io: set argmax to 1 for imap command

The imap command supports an optional argument to specify the
number of inode records to capture per ioctl(), but argmax is
currently set to 0. This leads to an error if an argument is
provided on the command line. Set argmax to 1 to support the
optional argument.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_db: fix the setting of unaligned directory fields

Setting the directory startoff, startblock, and blockcount
fields are difficult on both big and little endian machines.
The setting of extentflag was completely broken.

big endian test:
xfs_db> write u.bmx[0].startblock 12
u.bmx[0].startblock = 0
xfs_db> write u.bmx[0].startblock 0xc0000
u.bmx[0].startblock = 192

little endian test:
xfs_db> write u.bmx[0].startblock 12
u.bmx[0].startblock = 211106232532992
xfs_db> write u.bmx[0].startblock 0xc0000
u.bmx[0].startblock = 3221225472

Since the output fields and the lengths are not aligned to a byte,
setbitval requires them to be entered in big endian and properly
byte/nibble shifted. The extentflag output was aligned to a byte
but was not shifted correctly.

Convert the input to big endian on little endian machines and
nibble/byte shift on all platforms so setbitval can set the bits
correctly.

Clean some whitespace while in the setbitbal() function.

[dchinner: cleaned up remaining bad whitespace and comments]

Signed-off-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

metadump: fully support discontiguous directory blocks

Now that directory block obfuscation can handle single contiguous
directory blocks, we can make the multi-block code use discontiguous
buffers to read in an entire directory block at a time. This allows
us to pass a complete directory object to the processing function
and hence be able to process any sort of directory object regardless
of it's underlying layout.

With the, we can remove the multi-block loop from the directory
processing code and get rid of allt eh structures used to hold
inter-call state. This graeatly simplifies the code as well as
adding the additional functionality.

With this patch, a CRC enabled filesystem now passes xfs/291.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

metadump: walk single fsb objects a block at a time

To be able to support arbitrary discontiguous extents in multi-block
objects, we need to be able to process a single object at a time
presented as a single flat buffer. Right now we pass an arbitrary
extent and have the individal object processing functions break it
up and keep track of inter-call state.

This greatly complicates the processing of directory objects, such
that certain formats are simply not handled at all. Instead, for
single block objects loop over the extent a block at a time, feeding
a whole object to the processing function and hence making the
extent walking generic instead of per object.

At thsi point multi-block directory objects still need to use the
existing code, so duplicate the old single block object code into it
so we can fix it up properly. This means directory block processing
can't be fully cleaned up yet.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

metadump: separate single block objects from multiblock objects

When trying to dump objects, we have to treat multi-block objects
differently to single block objects. Separate out the code paths for
single block vs multi-block objects so we can add a separate path
for multi-block objects.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

metadump: support writing discontiguous io cursors

To handle discontiguous buffers, metadump needs to be able to handle
io cursrors that use discontiguous buffer mappings. Factor
write_buf() to extract the data copy routine and use that to
implement support for both flat and discontiguous buffer maps.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

metadump: sanitise write_buf/index return values

Write_buf/write_index use confusing boolean values for return,
meaning that it's hard to tell what the correct error return is
supposed to be. Convert them to return zero on success or a
negative errno otherwise so that it's clear what the error case is.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_repair: fix discontiguous directory block support

xfs/291 tests fragmented multi-block directories, and xfs_repair
throws lots of lookup badness errors in phase 3:

        - agno = 1
7fa3d2758740: Badness in key lookup (length)
bp=(bno 0x1e040, len 4096 bytes) key=(bno 0x1e040, len 16384 bytes)
        - agno = 2
7fa3d2758740: Badness in key lookup (length)
bp=(bno 0x2d0e8, len 4096 bytes) key=(bno 0x2d0e8, len 16384 bytes)
7fa3d2758740: Badness in key lookup (length)
bp=(bno 0x2d068, len 4096 bytes) key=(bno 0x2d068, len 16384 bytes)
        - agno = 3
7fa3d2758740: Badness in key lookup (length)
bp=(bno 0x39130, len 4096 bytes) key=(bno 0x39130, len 16384 bytes)
7fa3d2758740: Badness in key lookup (length)
bp=(bno 0x391b0, len 4096 bytes) key=(bno 0x391b0, len 16384 bytes)
7fa3d2758740: Badness in key lookup (length)

This is because it is trying to read a directory buffer in full
(16k), but is finding a single 4k block in the cache instead.

The opposite is happening in phase 6 - phase 6 is trying to read 4k
buffers but is finding a 16k buffer there instead.

The problem is caused by the fact that directory buffers can be
represented as compound buffers or as individual buffers depending
on the code reading the directory blocks. The main problem is that
the IO prefetch code doesn't understand compound buffers, so teach
it about compound buffers to make the problem go away.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

libxfs: remove map from libxfs_readbufr_map

The map passed in to libxfs_readbufr_map is used to check the buffer
matches the map. However, the repair readahead code has no map it
can use to validate the buffer it set up previously, so just get rid
of the map being passed in because it serves no useful purpose.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

libxfs: add a flags field to libxfs_getbuf_map

It will be needed to make the repair prefetch code aware of compound
buffers.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_repair: add support for validating dirent ftype field

Add code to track the filetype of an inode from phase 3 when all the
inodes are scanned throught to phase 6 when the directory structure
is validated and corrected.

Add code to phase 6 shortform and long form directory entry
validation to read the ftype from the dirent, lookup the inode
record and check they are the same. If they aren't and we are in
no-modify mode, issue a warning such as:

Phase 6 - check inode connectivity...
        - traversing filesystem ...
would fix ftype mismatch (5/1) in directory/child inode 64/68
        - traversal finished ...
        - moving disconnected inodes to lost+found ...

If we are fixing the problem:

Phase 6 - check inode connectivity...
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
fixing ftype mismatch (5/1) in directory/child inode 64/68
        - traversal finished ...
        - moving disconnected inodes to lost+found ...

Note that this is from a leaf form directory entry that was
intentionally corrupted with xfs_db like so:

xfs_db> inode 64
xfs_db> a u3.bmx[0].startblock
xfs_db> p
....
du[3].inumber = 68
du[3].namelen = 11
du[3].name = "syscalltest"
du[3].filetype = 1
du[3].tag = 0x70
....
xfs_db> write du[3].filetype 5
du[3].filetype = 5
xfs_db> quit

Shortform directory entry repair was tested in a similar fashion.

Further, track the ftype in the directory hash table that is build,
so if the directory is rebuild from scratch it has the necessary
ftype information to rebuild the directory correctly. Further, if we
detect a ftype mismatch, update the entry in the hash so that later
directory errors that lead to it being rebuilt use the corrected
ftype field, not the bad one.

Note that this code pulls in some kernel side code that is currently
in kernel private locations (xfs_mode_to_ftype table), so there'll
be some kernel/userspace sync work needed to bring these back into
line.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>