git.ipfire.org Git - thirdparty/xfsprogs-dev.git/log

]> git.ipfire.org Git - thirdparty/xfsprogs-dev.git/log

projects / thirdparty / xfsprogs-dev.git / log

Eric Sandeen [Mon, 3 Mar 2014 01:22:02 +0000 (12:22 +1100)]

xfs_logprint: don't advance op counter in xlog_print_trans_icreate

xlog_print_trans_icreate is advancing the op counter
"(*i)++" incorrectly; it only contains one region, and
the loop which called it will properly advance the op
once we're done.

Found-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

commit | commitdiff | tree

Eric Sandeen [Mon, 3 Mar 2014 01:21:49 +0000 (12:21 +1100)]

xfs_logprint: Don't error out after split items lose context

xfs_logprint recognizes a "left over region from split log item"
but then expects the *next* op to be a valid start to a new
item.  The problem is, we can split i.e. an xfs_inode_log_format
item, skip over it, and then land on the xfs_icdinode_t
data which follows it - this doesn't have a valid log item
magic (XFS_LI_*) and we error out.  This results in something
like:

  xfs_logprint: unknown log operation type (494e)

Fix this by recognizing that we've skipped over an item and
lost the context we're in, so just continue skipping over
op headers until we find the next valid start to a log item.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

commit | commitdiff | tree

Eric Sandeen [Mon, 3 Mar 2014 01:21:21 +0000 (12:21 +1100)]

xfs_copy: band-aids for CRC filesystems

xfs_copy needs a fair bit of work for CRCs because it rewrites
UUIDs by default, but this change will get it working properly
with the "-d" (duplicate) option which keeps the same UUID.

Accept the the CRC magic valid, change the ASSERT() to an error
message and exit more gracefully, and don't
even get started if we don't have the '-d' option.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

commit | commitdiff | tree

Eric Sandeen [Mon, 3 Mar 2014 01:19:24 +0000 (12:19 +1100)]

xfs_metadump: include F in getopts string

I added an F case, but didn't add it to the bash
getopts string last go-round.

/usr/sbin/xfs_metadump: illegal option -- F

I sure thought I tested this, I'm not sure how it got lost.

Reported-by: Boris Ranto <branto@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

commit | commitdiff | tree

Dave Chinner [Mon, 3 Mar 2014 01:19:14 +0000 (12:19 +1100)]

repair: prefetch runs too far ahead

When trying to work out why a non-crc filesystem took 1m57 to repair
and the same CRC enabled filesystem took 11m35 to repair, I noticed
that there was far too much CRC checking going on. Prefetched buffers
should not be CRCed, yet shortly after starting this began
to happen. perf profiling also showed up an awful lot of time doing
buffer cache lookups, and the cache profile output indicated that
the hit rate was way below 3%. IOWs, the readahead was getting so
far ahead of the processing that it was thrashing the cache.

That there is a difference in processing rate between CRC and
non-CRC filesystems is not surprising. What is surprising is the
readahead behaviour - it basically just keeps reading ahead until it
has read everything on an AG, and then it goes on to the next AG,
and reads everything on it, and then goes on to the next AG,....

This goes on until it pushes all the buffers the processing threads
need out of the cache, and suddenly they start re-reading from disk
with the various CRC checking verifiers enabled, and we end up going
-really- slow. Yes, threading made up for it a bit, but it's just
wrong.

Basically, the code assumes that IO is going to be slower than
processing, so it doesn't throttle prefetch across AGs to slow
down prefetch to match the processing rate.

So, to fix this, don't let a prefetch thread get more than a single
AG ahead of it's processing thread, just like occurs for single
threaded (i.e. -o ag_stride=-1) operation.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

commit | commitdiff | tree

Dave Chinner [Mon, 3 Mar 2014 01:18:25 +0000 (12:18 +1100)]

repair: factor out threading setup code

The same code is repeated in different places to set up
multithreaded prefetching. This can all be factored into a single
implementation.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

commit | commitdiff | tree

Dave Chinner [Mon, 3 Mar 2014 01:18:01 +0000 (12:18 +1100)]

repair: BMBT prefetch needs to be CRC aware

I'd been trying to track down a behavioural difference between
non-crc and crc enabled filesystems that was resulting non-crc
filesystem executing prefetch almost 3x faster than CRC filesystems.
After amny ratholes, I finally stumbled on the fact that btree
format directories are not being prefetched due to a missing magic
number check, and it's rejecting all XFS_BMAP_CRC_MAGIC format BMBT
buffers. This makes prefetch on CRC enabled filesystems behave the
same as for non-CRC filesystems.

The difference a single line of code can make on a 50 million inode
filesystem with a single threaded prefetch enabled run is pretty
amazing. It goes from 3,000 iops @ 50MB/s to 2,000 IOPS @ 800MB/s
and the cache hit rate goes from 3% to 49%. The runtime difference:

Unpatched:

Phase           Start           End             Duration
Phase 1:        02/21 18:34:12  02/21 18:34:12
Phase 2:        02/21 18:34:12  02/21 18:34:15  3 seconds
Phase 3:        02/21 18:34:15  02/21 18:40:09  5 minutes, 54 seconds
Phase 4:        02/21 18:40:09  02/21 18:46:36  6 minutes, 27 seconds
Phase 5:        02/21 18:46:36  02/21 18:46:37  1 second
Phase 6:        02/21 18:46:37  02/21 18:52:51  6 minutes, 14 seconds
Phase 7:        02/21 18:52:51  02/21 18:52:52  1 second

Total run time: 18 minutes, 40 seconds

Patched:

Phase           Start           End             Duration
Phase 1:        02/21 19:58:23  02/21 19:58:23
Phase 2:        02/21 19:58:23  02/21 19:58:27  4 seconds
Phase 3:        02/21 19:58:27  02/21 19:59:20  53 seconds
Phase 4:        02/21 19:59:20  02/21 20:00:07  47 seconds
Phase 5:        02/21 20:00:07  02/21 20:00:08  1 second
Phase 6:        02/21 20:00:08  02/21 20:00:50  42 seconds
Phase 7:        02/21 20:00:50  02/21 20:00:50

Total run time: 2 minutes, 27 seconds

Is no less impressive. Shame it's just a regression fix. ;)

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

commit | commitdiff | tree

Dave Chinner [Mon, 3 Mar 2014 01:17:46 +0000 (12:17 +1100)]

repair: fix prefetch queue limiting

The length of the prefetch queue is limited by a semaphore. To avoid
a ABBA deadlock, we only trywait on the semaphore so if we fail to
get it we can kick the IO queues before sleeping. Unfortunately,
the "need to sleep" detection is just a little wrong - it needs to
lok at errno, not err for the EAGAIN value.

Hence this queue throttling has not been working for a long time.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

commit | commitdiff | tree

Dave Chinner [Mon, 3 Mar 2014 01:17:34 +0000 (12:17 +1100)]

repair: use a listhead for the dotdot list

Cleanup suggested by Christoph Hellwig - removes another open coded
list implementation.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

commit | commitdiff | tree

Dave Chinner [Mon, 3 Mar 2014 01:17:27 +0000 (12:17 +1100)]

repair: limit auto-striding concurrency apprpriately

It's possible to have filesystems with hundreds of AGs on systems
with little concurrency and resources. In this case, we can easily
exhaust memory and fail to create threads and have all sorts of
interesting problems.

xfs/250 can cause this to occur, with failures like:

        - agno = 707
        - agno = 692
fatal error -- cannot create worker threads, error = [11] Resource temporarily unavailable

And this:

        - agno = 484
        - agno = 782
failed to create prefetch thread: Resource temporarily unavailable

Because it's trying to create more threads than a poor little 512MB
single CPU ia32 box can handle.

So, limit concurrency to a maximum of numcpus * 8 to prevent this.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

commit | commitdiff | tree

Dave Chinner [Mon, 3 Mar 2014 01:17:07 +0000 (12:17 +1100)]

libxfs: buffer cache hashing is suboptimal

The hashkey calculation is very simplistic,and throws away an amount
of entropy that should be folded into the hash. The result is
sub-optimal distribution across the hash tables. For example, with a
default 512 entry table, phase 2 results in this:

Max supported entries = 4096
Max utilized entries = 3970
Active entries = 3970
Hash table size = 512
Hits = 0
Misses = 3970
Hit ratio =  0.00
Hash buckets with   0 entries     12 (  0%)
Hash buckets with   1 entries      3 (  0%)
Hash buckets with   2 entries     10 (  0%)
Hash buckets with   3 entries      2 (  0%)
Hash buckets with   4 entries    129 ( 12%)
Hash buckets with   5 entries     20 (  2%)
Hash buckets with   6 entries     54 (  8%)
Hash buckets with   7 entries     22 (  3%)
Hash buckets with   8 entries    150 ( 30%)
Hash buckets with   9 entries     14 (  3%)
Hash buckets with  10 entries     16 (  4%)
Hash buckets with  11 entries      7 (  1%)
Hash buckets with  12 entries     38 ( 11%)
Hash buckets with  13 entries      5 (  1%)
Hash buckets with  14 entries      4 (  1%)
Hash buckets with  17 entries      1 (  0%)
Hash buckets with  19 entries      1 (  0%)
Hash buckets with  23 entries      1 (  0%)
Hash buckets with >24 entries     23 ( 16%)

Now, given a perfect distribution, we shoul dhave 8 entries per
chain. What we end up with is nothing like that.

Unfortunately, for phase 3/4 and others, the number of cached
objects results in the cache being expanded to 256k entries, and
so the stats just give this;

Hits = 262276
Misses = 8130393
Hit ratio =  3.13
Hash buckets with >24 entries    512 (100%)

We can't evaluate the efficiency of the hashing algorithm here.
Let's increase the size of the hash table to 32768 entries and go
from there:

Phase 2:

Hash buckets with   0 entries  31884 (  0%)
Hash buckets with   1 entries     35 (  0%)
Hash buckets with   2 entries     78 (  3%)
Hash buckets with   3 entries     30 (  2%)
Hash buckets with   4 entries    649 ( 65%)
Hash buckets with   5 entries     12 (  1%)
Hash buckets with   6 entries     13 (  1%)
Hash buckets with   8 entries     40 (  8%)
Hash buckets with   9 entries      1 (  0%)
Hash buckets with  13 entries      1 (  0%)
Hash buckets with  15 entries      1 (  0%)
Hash buckets with  22 entries      1 (  0%)
Hash buckets with  24 entries     17 ( 10%)
Hash buckets with >24 entries      6 (  4%)

There's a significant number of collisions given the population is
only 15% of the size of the table itself....

Phase 3:

Max supported entries = 262144
Max utilized entries = 262144
Active entries = 262090
Hash table size = 32768
Hits = 530844
Misses = 7164575
Hit ratio =  6.90
Hash buckets with   0 entries  11898 (  0%)
....
Hash buckets with  12 entries   5513 ( 25%)
Hash buckets with  13 entries   4188 ( 20%)
Hash buckets with  14 entries   2073 ( 11%)
Hash buckets with  15 entries   1811 ( 10%)
Hash buckets with  16 entries   1994 ( 12%)
....
Hash buckets with >24 entries    339 (  4%)

So, a third of the hash table does not even has any entries in them,
despite having more than 7.5 million entries run through the cache.
Median chain lengths are 12-16 entries, ideal is 8. And lots of
collisions on the longer than 24 entrie chains...

Phase 6:

Hash buckets with   0 entries  14573 (  0%)
....
Hash buckets with >24 entries   2291 ( 36%)

Ouch. Not a good distribution at all.

Overall runtime:

Phase           Start           End             Duration
Phase 1:        12/06 11:35:04  12/06 11:35:04
Phase 2:        12/06 11:35:04  12/06 11:35:07  3 seconds
Phase 3:        12/06 11:35:07  12/06 11:38:27  3 minutes, 20 seconds
Phase 4:        12/06 11:38:27  12/06 11:41:32  3 minutes, 5 seconds
Phase 5:        12/06 11:41:32  12/06 11:41:32
Phase 6:        12/06 11:41:32  12/06 11:42:29  57 seconds
Phase 7:        12/06 11:42:29  12/06 11:42:30  1 second

Total run time: 7 minutes, 26 seconds

Modify the hash to be something more workable - steal the linux
kernel inode hash calculation and try that:

phase 2:

Max supported entries = 262144
Max utilized entries = 3970
Active entries = 3970
Hash table size = 32768
Hits = 0
Misses = 3972
Hit ratio =  0.00
Hash buckets with   0 entries  29055 (  0%)
Hash buckets with   1 entries   3464 ( 87%)
Hash buckets with   2 entries    241 ( 12%)
Hash buckets with   3 entries      8 (  0%)

Close to perfect.

Phase 3:

Max supported entries = 262144
Max utilized entries = 262144
Active entries = 262118
Hash table size = 32768
Hits = 567900
Misses = 7118749
Hit ratio =  7.39
Hash buckets with   5 entries   1572 (  2%)
Hash buckets with   6 entries   2186 (  5%)
Hash buckets with   7 entries   9217 ( 24%)
Hash buckets with   8 entries   8757 ( 26%)
Hash buckets with   9 entries   6135 ( 21%)
Hash buckets with  10 entries   3166 ( 12%)
Hash buckets with  11 entries   1257 (  5%)
Hash buckets with  12 entries    364 (  1%)
Hash buckets with  13 entries     94 (  0%)
Hash buckets with  14 entries     14 (  0%)
Hash buckets with  15 entries      5 (  0%)

A near-perfect bell curve centered on the optimal distribution
number of 8 entries per chain.

Phase 6:

Hash buckets with   0 entries     24 (  0%)
Hash buckets with   1 entries    190 (  0%)
Hash buckets with   2 entries    571 (  0%)
Hash buckets with   3 entries   1263 (  1%)
Hash buckets with   4 entries   2465 (  3%)
Hash buckets with   5 entries   3399 (  6%)
Hash buckets with   6 entries   4002 (  9%)
Hash buckets with   7 entries   4186 ( 11%)
Hash buckets with   8 entries   3773 ( 11%)
Hash buckets with   9 entries   3240 ( 11%)
Hash buckets with  10 entries   2523 (  9%)
Hash buckets with  11 entries   2074 (  8%)
Hash buckets with  12 entries   1582 (  7%)
Hash buckets with  13 entries   1206 (  5%)
Hash buckets with  14 entries    863 (  4%)
Hash buckets with  15 entries    601 (  3%)
Hash buckets with  16 entries    386 (  2%)
Hash buckets with  17 entries    205 (  1%)
Hash buckets with  18 entries    122 (  0%)
Hash buckets with  19 entries     48 (  0%)
Hash buckets with  20 entries     24 (  0%)
Hash buckets with  21 entries     13 (  0%)
Hash buckets with  22 entries      8 (  0%)

A much wider bell curve than phase 3, but still centered around the
optimal value and far, far better than the distribution of the
current hash calculation. Runtime:

Phase           Start           End             Duration
Phase 1:        12/06 11:47:21  12/06 11:47:21
Phase 2:        12/06 11:47:21  12/06 11:47:23  2 seconds
Phase 3:        12/06 11:47:23  12/06 11:50:50  3 minutes, 27 seconds
Phase 4:        12/06 11:50:50  12/06 11:53:57  3 minutes, 7 seconds
Phase 5:        12/06 11:53:57  12/06 11:53:58  1 second
Phase 6:        12/06 11:53:58  12/06 11:54:51  53 seconds
Phase 7:        12/06 11:54:51  12/06 11:54:52  1 second

Total run time: 7 minutes, 31 seconds

Essentially unchanged - this is somewhat of a "swings and
roundabouts" test here because what it is testing is the cache-miss
overhead.

FWIW, the comparison here shows a pretty good case for the existing
hash calculation. On a less populated filesystem (5m inodes rather
than 50m inodes) the typical hash distribution was:

Max supported entries = 262144
Max utilized entries = 262144
Active entries = 262094
Hash table size = 32768
Hits = 626228
Misses = 800166
Hit ratio = 43.90
Hash buckets with   0 entries  29274 (  0%)
Hash buckets with   3 entries      1 (  0%)
Hash buckets with   4 entries      1 (  0%)
Hash buckets with   7 entries      1 (  0%)
Hash buckets with   8 entries      1 (  0%)
Hash buckets with   9 entries      1 (  0%)
Hash buckets with  12 entries      1 (  0%)
Hash buckets with  13 entries      1 (  0%)
Hash buckets with  16 entries      2 (  0%)
Hash buckets with  18 entries      1 (  0%)
Hash buckets with  22 entries      1 (  0%)
Hash buckets with >24 entries   3483 ( 99%)

Total and utter crap. Same filesystem, new hash function:

Max supported entries = 262144
Max utilized entries = 262144
Active entries = 262103
Hash table size = 32768
Hits = 673208
Misses = 838265
Hit ratio = 44.54
Hash buckets with   3 entries    558 (  0%)
Hash buckets with   4 entries   1126 (  1%)
Hash buckets with   5 entries   2440 (  4%)
Hash buckets with   6 entries   4249 (  9%)
Hash buckets with   7 entries   5280 ( 14%)
Hash buckets with   8 entries   5598 ( 17%)
Hash buckets with   9 entries   5446 ( 18%)
Hash buckets with  10 entries   3879 ( 14%)
Hash buckets with  11 entries   2405 ( 10%)
Hash buckets with  12 entries   1187 (  5%)
Hash buckets with  13 entries    447 (  2%)
Hash buckets with  14 entries    125 (  0%)
Hash buckets with  15 entries     25 (  0%)
Hash buckets with  16 entries      3 (  0%)

Kinda says it all, really...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

commit | commitdiff | tree

Dave Chinner [Mon, 3 Mar 2014 01:16:54 +0000 (12:16 +1100)]

repair: per AG locks contend for cachelines

The per-ag locks used to protect per-ag block lists are located in a tightly
packed array. That means that they share cachelines, so separate them out inot
separate 64 byte regions in the array.

pahole confirms the padding is correctly applied:

struct aglock {
        pthread_mutex_t            lock;                 /*     0    40 */

        /* size: 64, cachelines: 1, members: 1 */
        /* padding: 24 */
};

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

commit | commitdiff | tree

Dave Chinner [Mon, 3 Mar 2014 01:16:44 +0000 (12:16 +1100)]

repair: translation lookups limit scalability

A bit of perf magic showed that scalability was limits to 3-4
concurrent threads due to contention on a lock inside in something
called __dcigettext(). That some library somewhere that repair is
linked against, and it turns out to be inside the translation
infrastructure to support the _() string mechanism:

# Samples: 34K of event 'cs'
# Event count (approx.): 495567
#
# Overhead        Command      Shared Object          Symbol
# ........  .............  .................  ..............
#
    60.30%     xfs_repair  [kernel.kallsyms]  [k] __schedule
               |
               --- 0x63fffff9c
                   process_bmbt_reclist_int
                  |
                  |--39.95%-- __dcigettext
                  |          __lll_lock_wait
                  |          system_call_fastpath
                  |          SyS_futex
                  |          do_futex
                  |          futex_wait
                  |          futex_wait_queue_me
                  |          schedule
                  |          __schedule
                  |
                  |--8.91%-- __lll_lock_wait
                  |          system_call_fastpath
                  |          SyS_futex
                  |          do_futex
                  |          futex_wait
                  |          futex_wait_queue_me
                  |          schedule
                  |          __schedule
                   --51.13%-- [...]

Fix this by initialising global variables that hold the translated
strings at startup, hence avoiding the need to do repeated runtime
translation of the same strings.

Runtime of an unpatched xfs_repair is roughly:

        XFS_REPAIR Summary    Fri Dec  6 11:15:50 2013

Phase           Start           End             Duration
Phase 1:        12/06 10:56:21  12/06 10:56:21
Phase 2:        12/06 10:56:21  12/06 10:56:23  2 seconds
Phase 3:        12/06 10:56:23  12/06 11:01:31  5 minutes, 8 seconds
Phase 4:        12/06 11:01:31  12/06 11:07:08  5 minutes, 37 seconds
Phase 5:        12/06 11:07:08  12/06 11:07:09  1 second
Phase 6:        12/06 11:07:09  12/06 11:15:49  8 minutes, 40 seconds
Phase 7:        12/06 11:15:49  12/06 11:15:50  1 second

Total run time: 19 minutes, 29 seconds

Patched version:

Phase           Start           End             Duration
Phase 1:        12/06 10:36:29  12/06 10:36:29
Phase 2:        12/06 10:36:29  12/06 10:36:31  2 seconds
Phase 3:        12/06 10:36:31  12/06 10:40:08  3 minutes, 37 seconds
Phase 4:        12/06 10:40:08  12/06 10:43:42  3 minutes, 34 seconds
Phase 5:        12/06 10:43:42  12/06 10:43:42
Phase 6:        12/06 10:43:42  12/06 10:50:28  6 minutes, 46 seconds
Phase 7:        12/06 10:50:28  12/06 10:50:29  1 second

Total run time: 14 minutes

Big win!

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

commit | commitdiff | tree

Dave Chinner [Fri, 28 Feb 2014 01:04:26 +0000 (12:04 +1100)]

mkfs: proto file creation does not set ftype correctly

Hence running xfs_repair on a ftype enable filesystem that has
contents created by a proto file will throw warnings on mismatched
ftype entries and correct them. xfs/031 fails due to this problem.

Fix it by settin gup the xname structure with the correct type
fields.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

commit | commitdiff | tree

Dave Chinner [Fri, 28 Feb 2014 01:04:12 +0000 (12:04 +1100)]

mkfs: default log size for small filesystems too large

Recent changes to the log size scaling have resulted in using the
default size multiplier for the log size even on small filesystems.
Commit 88cd79b ("xfs: Add xfs_log_rlimit.c") changed the calculation
of the maximum transaction size that the kernel would issues and
that significantly increased the minimum size of the default log.
As such the size of the log on small filesystems was typically
larger than the prefious default, even though the previous default
was still larger than the minimum needed.

Rework the default log size calculation such that it will use the
original log size default if it is larger than the minimum log size
required, and only use a larger log if the configuration of the
filesystem requires it.

This is especially obvious in xfs/216, where the default log size is
10MB all the way up to 16GB filesystems. The current mkfs selects a
log size of 50MB for the same size filesystems and this is
unnecessarily large.

Return the scaling of the log size for small filesystems to
something similar to what xfs/216 expects.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

commit | commitdiff | tree

Christoph Hellwig [Sun, 23 Feb 2014 21:13:51 +0000 (08:13 +1100)]

xfs_io: add support for O_TMPFILE opens

On Thu, Feb 20, 2014 at 02:00:00PM -0800, Christoph Hellwig wrote:
Add a new -T argument to the open command that supports using the
O_TMPFILE flag.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

commit | commitdiff | tree

Dave Chinner [Sun, 23 Feb 2014 21:13:28 +0000 (08:13 +1100)]

libxfs: clear stale buffer errors on write

If we've read a buffer and it's had an error (e.g a bad CRC) and the
caller corrects the problem with the buffer and writes it via
libxfs_writebuf() without clearing the error on the buffer,
subsequent reads of the buffer while it is still in cache can see
that error and fail inappropriately.

xfs/033 demonstrates this error, where phase 3 detects the corrupted
root inode and clears, but doesn't clear the b_error field. Later in
phase 6, the code that rebuilds the root directory tries to read the
root inode and sees a buffer with an error on it, thereby triggering
a fatal repair failure:

Phase 3 - for each AG...
        - scan and clear agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
xfs_inode_buf_verify: XFS_CORRUPTION_ERROR
bad magic number 0x0 on inode 64
....
cleared root inode 64
....
Phase 6 - check inode connectivity...
reinitializing root directory
xfs_imap_to_bp: xfs_trans_read_buf() returned error 117.

fatal error -- could not iget root inode -- error - 117
#

Fix this by assuming buffers that are written are clean and correct
and hence we can zero the b_error field before retiring the buffer
to the cache.

Reported-by: Eric Sandeen <esandeen@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <esandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

commit | commitdiff | tree

Dave Chinner [Sun, 23 Feb 2014 21:12:41 +0000 (08:12 +1100)]

libxfs: contiguous buffers are not discontigous

When discontiguous directory buffer support was fixed in xfs_repair,
(dd9093d xfs_repair: fix discontiguous directory block support)
it changed to using libxfs_getbuf_map() to support mapping
discontiguous blocks, and the prefetch code special cased such
discontiguous buffers.

The issue is that libxfs_getbuf_map() marks all buffers, even
contiguous ones - as LIBXFS_B_DISCONTIG, and so the prefetch code
was treating every buffer as discontiguous. This causes the prefetch
code to completely bypass the large IO optimisations for dense areas
of metadata. Because there was no obvious change in performance or
IO patterns, this wasn't noticed during performance testing.

However, this change mysteriously fixed a regression in xfs/033 in
the v3.2.0-alpha release, and this change in behaviour was
discovered as part of triaging why it "fixed" the regression.
Anyway, restoring the large IO prefetch optimisation results
a repair on a 10 million inode filesystem dropping from 197s to 173s,
and the peak IOPS rate in phase 3 dropping from 25,000 to roughly
2,000 by trading off a bandwidth increase of roughly 100% (i.e.
200MB/s to 400MB/s). Phase 4 saw similar changes in IO profile and
speed increases.

This, however, re-introduces the regression in xfs/033, which will
now be fixed in a separate patch.

Reported-by: Eric Sandeen <esandeen@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <esandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

commit | commitdiff | tree

Eric Sandeen [Sun, 23 Feb 2014 21:11:29 +0000 (08:11 +1100)]

xfs_repair: fix sibling pointer tests in verify_dir2_path()

RH QE reported that if we create a 1G filesystem with default
options, mount it, and create inodes until full, then run
repair, repair reports corruption in verify_dir2_path() with:

> bad back pointer in block 8390324 for directory inode 131

The commit 88b32f0 xfs: add CRCs to dir2/da node blocks
had a small error which regressed this; although we switch
to the "newnode," to check sibling pointers, we re-populate
the node hdr with the old "node" data. This causes the
backpointer test to be testing the wrong node's values.

Fixing this bug fixes the testcase.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

commit | commitdiff | tree

Dave Chinner [Wed, 19 Feb 2014 00:52:25 +0000 (11:52 +1100)]

xfs_db: fix attribute leaf output for ATTR3 format

attr3_leaf_entries_count() checks against the wrong magic number.
hence returns zero for an entry count when it should be returning a
value. Fixing this makes xfs/021 pass on CRC enabled filesystems.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

commit | commitdiff | tree

Brian Foster [Wed, 19 Feb 2014 00:52:24 +0000 (11:52 +1100)]

xfs_io: set argmax to 1 for imap command

The imap command supports an optional argument to specify the
number of inode records to capture per ioctl(), but argmax is
currently set to 0. This leads to an error if an argument is
provided on the command line. Set argmax to 1 to support the
optional argument.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

commit | commitdiff | tree

Mark Tinguely [Wed, 19 Feb 2014 00:16:09 +0000 (11:16 +1100)]

xfs_db: fix the setting of unaligned directory fields

Setting the directory startoff, startblock, and blockcount
fields are difficult on both big and little endian machines.
The setting of extentflag was completely broken.

big endian test:
xfs_db> write u.bmx[0].startblock 12
u.bmx[0].startblock = 0
xfs_db> write u.bmx[0].startblock 0xc0000
u.bmx[0].startblock = 192

little endian test:
xfs_db> write u.bmx[0].startblock 12
u.bmx[0].startblock = 211106232532992
xfs_db> write u.bmx[0].startblock 0xc0000
u.bmx[0].startblock = 3221225472

Since the output fields and the lengths are not aligned to a byte,
setbitval requires them to be entered in big endian and properly
byte/nibble shifted. The extentflag output was aligned to a byte
but was not shifted correctly.

Convert the input to big endian on little endian machines and
nibble/byte shift on all platforms so setbitval can set the bits
correctly.

Clean some whitespace while in the setbitbal() function.

[dchinner: cleaned up remaining bad whitespace and comments]

Signed-off-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

commit | commitdiff | tree

Dave Chinner [Mon, 3 Feb 2014 22:41:51 +0000 (09:41 +1100)]

metadump: fully support discontiguous directory blocks

Now that directory block obfuscation can handle single contiguous
directory blocks, we can make the multi-block code use discontiguous
buffers to read in an entire directory block at a time. This allows
us to pass a complete directory object to the processing function
and hence be able to process any sort of directory object regardless
of it's underlying layout.

With the, we can remove the multi-block loop from the directory
processing code and get rid of allt eh structures used to hold
inter-call state. This graeatly simplifies the code as well as
adding the additional functionality.

With this patch, a CRC enabled filesystem now passes xfs/291.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

commit | commitdiff | tree

Dave Chinner [Mon, 3 Feb 2014 22:41:41 +0000 (09:41 +1100)]

metadump: walk single fsb objects a block at a time

To be able to support arbitrary discontiguous extents in multi-block
objects, we need to be able to process a single object at a time
presented as a single flat buffer. Right now we pass an arbitrary
extent and have the individal object processing functions break it
up and keep track of inter-call state.

This greatly complicates the processing of directory objects, such
that certain formats are simply not handled at all. Instead, for
single block objects loop over the extent a block at a time, feeding
a whole object to the processing function and hence making the
extent walking generic instead of per object.

At thsi point multi-block directory objects still need to use the
existing code, so duplicate the old single block object code into it
so we can fix it up properly. This means directory block processing
can't be fully cleaned up yet.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

commit | commitdiff | tree

Dave Chinner [Mon, 3 Feb 2014 22:40:13 +0000 (09:40 +1100)]

metadump: separate single block objects from multiblock objects

When trying to dump objects, we have to treat multi-block objects
differently to single block objects. Separate out the code paths for
single block vs multi-block objects so we can add a separate path
for multi-block objects.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

commit | commitdiff | tree

Dave Chinner [Mon, 3 Feb 2014 22:35:36 +0000 (09:35 +1100)]

metadump: support writing discontiguous io cursors

To handle discontiguous buffers, metadump needs to be able to handle
io cursrors that use discontiguous buffer mappings. Factor
write_buf() to extract the data copy routine and use that to
implement support for both flat and discontiguous buffer maps.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

commit | commitdiff | tree

Dave Chinner [Mon, 3 Feb 2014 22:35:26 +0000 (09:35 +1100)]

metadump: sanitise write_buf/index return values

Write_buf/write_index use confusing boolean values for return,
meaning that it's hard to tell what the correct error return is
supposed to be. Convert them to return zero on success or a
negative errno otherwise so that it's clear what the error case is.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

commit | commitdiff | tree

Dave Chinner [Mon, 3 Feb 2014 00:55:36 +0000 (11:55 +1100)]

xfs_repair: fix discontiguous directory block support

xfs/291 tests fragmented multi-block directories, and xfs_repair
throws lots of lookup badness errors in phase 3:

        - agno = 1
7fa3d2758740: Badness in key lookup (length)
bp=(bno 0x1e040, len 4096 bytes) key=(bno 0x1e040, len 16384 bytes)
        - agno = 2
7fa3d2758740: Badness in key lookup (length)
bp=(bno 0x2d0e8, len 4096 bytes) key=(bno 0x2d0e8, len 16384 bytes)
7fa3d2758740: Badness in key lookup (length)
bp=(bno 0x2d068, len 4096 bytes) key=(bno 0x2d068, len 16384 bytes)
        - agno = 3
7fa3d2758740: Badness in key lookup (length)
bp=(bno 0x39130, len 4096 bytes) key=(bno 0x39130, len 16384 bytes)
7fa3d2758740: Badness in key lookup (length)
bp=(bno 0x391b0, len 4096 bytes) key=(bno 0x391b0, len 16384 bytes)
7fa3d2758740: Badness in key lookup (length)

This is because it is trying to read a directory buffer in full
(16k), but is finding a single 4k block in the cache instead.

The opposite is happening in phase 6 - phase 6 is trying to read 4k
buffers but is finding a 16k buffer there instead.

The problem is caused by the fact that directory buffers can be
represented as compound buffers or as individual buffers depending
on the code reading the directory blocks. The main problem is that
the IO prefetch code doesn't understand compound buffers, so teach
it about compound buffers to make the problem go away.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

commit | commitdiff | tree

Dave Chinner [Mon, 3 Feb 2014 00:11:11 +0000 (11:11 +1100)]

libxfs: remove map from libxfs_readbufr_map

The map passed in to libxfs_readbufr_map is used to check the buffer
matches the map. However, the repair readahead code has no map it
can use to validate the buffer it set up previously, so just get rid
of the map being passed in because it serves no useful purpose.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

commit | commitdiff | tree

Dave Chinner [Mon, 3 Feb 2014 00:10:17 +0000 (11:10 +1100)]

libxfs: add a flags field to libxfs_getbuf_map

It will be needed to make the repair prefetch code aware of compound
buffers.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

commit | commitdiff | tree

Dave Chinner [Mon, 3 Feb 2014 00:08:44 +0000 (11:08 +1100)]

xfs_repair: add support for validating dirent ftype field

Add code to track the filetype of an inode from phase 3 when all the
inodes are scanned throught to phase 6 when the directory structure
is validated and corrected.

Add code to phase 6 shortform and long form directory entry
validation to read the ftype from the dirent, lookup the inode
record and check they are the same. If they aren't and we are in
no-modify mode, issue a warning such as:

Phase 6 - check inode connectivity...
        - traversing filesystem ...
would fix ftype mismatch (5/1) in directory/child inode 64/68
        - traversal finished ...
        - moving disconnected inodes to lost+found ...

If we are fixing the problem:

Phase 6 - check inode connectivity...
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
fixing ftype mismatch (5/1) in directory/child inode 64/68
        - traversal finished ...
        - moving disconnected inodes to lost+found ...

Note that this is from a leaf form directory entry that was
intentionally corrupted with xfs_db like so:

xfs_db> inode 64
xfs_db> a u3.bmx[0].startblock
xfs_db> p
....
du[3].inumber = 68
du[3].namelen = 11
du[3].name = "syscalltest"
du[3].filetype = 1
du[3].tag = 0x70
....
xfs_db> write du[3].filetype 5
du[3].filetype = 5
xfs_db> quit

Shortform directory entry repair was tested in a similar fashion.

Further, track the ftype in the directory hash table that is build,
so if the directory is rebuild from scratch it has the necessary
ftype information to rebuild the directory correctly. Further, if we
detect a ftype mismatch, update the entry in the hash so that later
directory errors that lead to it being rebuilt use the corrected
ftype field, not the bad one.

Note that this code pulls in some kernel side code that is currently
in kernel private locations (xfs_mode_to_ftype table), so there'll
be some kernel/userspace sync work needed to bring these back into
line.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

commit | commitdiff | tree

Namjae Jeon [Sun, 8 Dec 2013 14:11:30 +0000 (14:11 +0000)]

xfsprog: xfsio: Add support FALLOC_FL_COLLAPSE_RANGE for fallocate

Add support FALLOC_FL_COLLAPSE_RANGE for fallocate.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

commit | commitdiff | tree

Eric Sandeen [Thu, 12 Dec 2013 17:47:59 +0000 (17:47 +0000)]

xfs_metadump: Make -F (force) optional

If we do something crazy like:

# xfs_metadump /root/anaconda.cfg outfile

xfs_metadump will pass "-F" to xfs_db to carry on even in the face
of bad superblock magic. [1]  Depending on what we gave as an input
file, we may very well fail quite badly:

> xfs_metadump: /root/anaconda.cfg is not a valid XFS filesystem (unexpected SB magic number 0x230a2320)
> Floating point exception

I don't think it's possible to harden every path through libxfs for
non-xfs filesystems as input.  (Even if it's possible, I don't think it's
worth the effort).

So I propose making the "-F" optional; by default, xfs_metadump will
say no, this has bad magic, I'm stopping.  If the admin really wants to
try to proceed, suggest that they can use "-F" and they can keep all the
broken pieces.

[1] behavior added in 7f98455 xfs_db: exit on invalid magic number

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

commit | commitdiff | tree

Ben Myers [Tue, 10 Dec 2013 20:53:52 +0000 (20:53 +0000)]

xfs_repair: fix process_bmbt_reclist_int

There is a set checks for corruption in block map btrees in
process_bmbt_reclist_int that we identify but currently do not fix.  It
appears that the author's intent in this function was to set error = 1,
and then only clear it when all of the checks were completed
successfully.  Unfortunately error can be cleared when it is used for
the return value of blkmap_set_ext.  Some kinds of corruption are not
being fixed, including duplicate extents, claiming free blocks, claiming
metadata blocks, and multiply used blocks.

Fix this by using error2 for the return code from blkmap_set_ext.

Signed-off-by: Ben Myers <bpm@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

commit | commitdiff | tree

Eric Sandeen [Tue, 12 Nov 2013 17:14:46 +0000 (17:14 +0000)]

xfs_repair: suggest reboot after dangerous repair

If a "dangerous" repair (repair of a readonly mounted fs)
has been done, suggest an immediate reboot upon completion.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

commit | commitdiff | tree

Eric Sandeen [Mon, 18 Nov 2013 16:02:05 +0000 (16:02 +0000)]

xfsprogs: suggest "-d" option for repair of RO mount

We can offer the suggestion of a "-d" repair, if we're
in single-user mode with i.e. the root fs mounted readonly.

This change suggests -d to repair any RO mounted fs.

e2fsck allows this, and users are used to being able to
do it in single-user mode. A separate patch will recommend
a reboot after repair completes.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

commit | commitdiff | tree

Rich Johnston [Mon, 25 Nov 2013 19:23:25 +0000 (13:23 -0600)]

3.2.0-alpha2 release

commit | commitdiff | tree

Rich Johnston [Fri, 22 Nov 2013 18:37:43 +0000 (18:37 +0000)]

xfsprogs: update version for 3.2.0-alpha2

Update the VERSION, configure.ac and doc/CHANGES file for alpha
release, 3.2.0-alpha2

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>

commit | commitdiff | tree

Eric Sandeen [Thu, 14 Nov 2013 20:54:02 +0000 (20:54 +0000)]

xfsprogs: Fix stray semicolon in libxfs/rdwr.c

The commit:

800db1c1 libxfs: refactor libxfs_buf_read_map for xfs_db

introduced a stray semicolon during the refactoring. Luckily it
only breaks the error case.

(Caught by Coverity, CID 1128894)

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

commit | commitdiff | tree

Eric Sandeen [Thu, 14 Nov 2013 20:38:08 +0000 (20:38 +0000)]

xfs_repair: correct docs for "-t" units

usage() says that the xfs_repair -t reporting interval is
in minutes, and the manpage says nothing at all about units.

Actually, it's in seconds.

Clarify in both places that the interval is specified in
seconds.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

commit | commitdiff | tree

Eric Sandeen [Fri, 18 Oct 2013 22:30:18 +0000 (22:30 +0000)]

xfs_fsr: fix SWAPEXT failures under selinux

If we run xfs_fsr on a system which creates selinux extended
attributes, the temp file created by xfs_fsr may have a
large-ish local extended attribute as soon as it is created.

If the target file has NON-local extended attributes, it may
have a fork offset larger than the temp file, because i.e.
FMT_EXTENTS attributes take up less space. We currently
have no mechanism to grow the temp file's fork offset.
So in this case, the SWAPEXT ioctl will fail.

(With systems using selinux and lots of xattrs, this becomes
fairly common in the real world.)

After testing the target file for a non-local extent, and
checking to see if the temp forkoff needs to be grown on the
first pass, we can add a large attr to knock all attributes on
the temp file out of local format, and grow the fork offset for
this particular case.

This passes xfstest 227, and also resolves issues seen on
a metadata image provided by Gabriel.

Reported-by: Gabriel VLASIU <gabriel@vlasiu.net>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

commit | commitdiff | tree

Eric Sandeen [Fri, 18 Oct 2013 22:09:07 +0000 (22:09 +0000)]

xfs_fsr: extra debugging info

Provide 2 new pieces of information when -d is specified to
xfs_fsr:

* If we needed to grow the forkoffset and couldn't
* If we were unable to match the fork offset

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

commit | commitdiff | tree

Dave Chinner [Wed, 13 Nov 2013 06:41:00 +0000 (06:41 +0000)]

repair: fix leaf node directory data check

When walking the leaf node format blocks (LEAFN) in the hash index
of a large directory, we could trip over btree node blocks (DA_NODE)
in the address space if there are enough entries in the directory.
These cause a verifier failure, and hence the directory is
considered corrupt and is trashed and rebuilt unnecessarily. Fix this
by using the correct verifier that can handle both types of blocks
without triggering failures.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

commit | commitdiff | tree

Dave Chinner [Wed, 13 Nov 2013 06:40:59 +0000 (06:40 +0000)]

repair: Increase default repair parallelism on large filesystems

Large filesystems or high AG count filesystems generally have more
inherent parallelism in the backing storage. We should make use of
this by default to speed up repair times. Make xfs_repair use an
"auto-stride" configuration on filesystems with enough AGs to be
considered "multidisk" configurations.

This difference in elapsed time to repair a 100TB filesystem with
50 million inodes in it with all metadata in flash is:

Time IOPS BW CPU RAM
vanilla: 2719s 2900 55MB/s 25% 0.95GB
patched: 908s varied varied varied 2.33GB

With the patched kernel, there were IO peaks of over 1.3GB/s during
AG scanning. Some phases now run at noticeably different speeds
- phase 3 ran at ~180% CPU, 18,000 IOPS and 130MB/s,
- phase 4 ran at ~280% CPU, 12,000 IOPS and 100MB/s
- the other phases were similar to the vanilla repair.

Memory usage is increased because of the increased buffer cache
size as a result of concurrent AG scanning using it.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

commit | commitdiff | tree

Dave Chinner [Wed, 13 Nov 2013 06:40:58 +0000 (06:40 +0000)]

repair: prefetching is turned off unnecessarily

When we have a large filesystem, prefetching is only enabled when
there is a significant amount of RAM available - roughly 16GB RAM
for every 100TB of disk space. For large filesystems, this memory
usage calculation is mostly derived from the memory needed to track
used space rather than inodes. That is, for a 100TB filesystem with
50 million inodes, only 50M * 4 bytes or 200MB of the required
16GB of RAM is used for tracking inodes. Hence with prefetching
turned off, such a filesystem only uses 230MB of memory to run
repair to completion.

With prefetching turned on, this increases to about 900MB of RAM,
but it is still far, far less than the predicted 16GB of RAM needed
to enable prefetching. Hence we are turning off prefetching when we
really don't need to and hence large filesystems are being checked
slower than they could be.

This patch makes prefetching always be enabled, but adds warnings in
the case that we might not have enough memory to complete
successfully and if it fails to run again with prefetching disabled:

  Memory available for repair (12031MB) may not be sufficient.
  At least 13044MB is needed to repair this filesystem efficiently
  If repair fails due to lack of memory, please
  turn prefetching off (-P) to reduce the memory footprint.

A similar warning is also added when prefetching is disabled and
xfs_repair exhausts memory then more RAM/swap should be added to the
system.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

commit | commitdiff | tree

Dave Chinner [Wed, 13 Nov 2013 06:40:57 +0000 (06:40 +0000)]

xfsprogs: kill experimental warnings for v5 filesystems

With xfsprogs now being close to feature complete on v5 filesystems,
remove the experimental warnings from the superblock verifier. This
means that we don't need to filter such warnings from the output in
xfstests and so we can see exactly what tests are failing due to
code deficiencies rather than from detecting warning noise.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

commit | commitdiff | tree

Dave Chinner [Wed, 13 Nov 2013 06:40:56 +0000 (06:40 +0000)]

xfs: support larger inode clusters on v5 filesystems

To allow the kernel to use larger inode clusters than the standard
8192 bytes, we need to set the inode alignment fields appropriately
so that the kernel is consistent in it's inode to buffer mappings.
We set the alignment to allow a constant 32 inodes per cluster,
instead of a fixed 8k cluster size.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

commit | commitdiff | tree

Dave Chinner [Wed, 13 Nov 2013 06:40:55 +0000 (06:40 +0000)]

db: enable metadump on CRC filesystems

Now that we can calculate CRCs through xfs_db, we can add support
for recalculating CRCs on obfuscated metadump images. This simply
requires us to call the write verifier manually before writing the
buffer to the metadump image.

We don't need to do anything special to mdrestore, as the metadata
blocks it reads from the image file will already have all the
correct CRCs in them. Hence it can be mostly oblivious to the fact
that the filesystem it is restoring contains CRCs.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

commit | commitdiff | tree

Dave Chinner [Wed, 13 Nov 2013 06:40:54 +0000 (06:40 +0000)]

libxfs: work around do_div() not handling 32 bit numerators

The libxfs dquot buffer code uses do_div() with a 32 bit numerator.
This gives incorrect results as do_div() passes the numerator by
reference as a pointer to a 64 bit value. Hence it does the division
using 32 bits of garbage gives the wrong result.

As per Christoph's suggestion, we can kill the usage of do_div()
here completely and just do the division directly, both in userspace
and kernel space.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

commit | commitdiff | tree

Dave Chinner [Wed, 13 Nov 2013 06:40:53 +0000 (06:40 +0000)]

xfs_db: avoid libxfs buffer lookup warnings

xfs_db is unique in the way it can read the same blocks with
different lengths from disk, so we really need a way to avoid having
duplicate buffers in the cache. To handle this in a generic way,
introduce a "purge on compare failure" feature to libxfs.

What this feature does is instead of throwing a warning when a
buffer miscompare occurs (e.g. due to a length mismatch), it purges
the buffer that is in cache from the cache. We can do this safely in
the context of xfs_db because it always writes back changes made to
buffers before it releases the reference to the buffer. Hence we can
purge buffers directly from the lookup code without having to worry
about whether they are dirty or not.

Doing this purge on miscompare operation avoids the
problem that libxfs is currently warning about, and hence if the
feature flag is set then we don't need to warn about miscompares any
more. Hence the whole problem goes away entirely for xfs_db, without
affecting any of the other users of libxfs based IO.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

commit | commitdiff | tree

Dave Chinner [Wed, 13 Nov 2013 06:40:52 +0000 (06:40 +0000)]

xfs_db: use inode cluster buffers for inode IO

When we mount the filesystem inside xfs_db, libxfs is tasked with
reading some information from disk, such as root inodes. Because
libxfs does this inode reading, it uses inode cluster buffers to
read the inodes. xfs_db, OTOH, just uses FSB sized buffers to read
inodes, and hence xfs_db throws a warning when reading the root
inode block like so:

$ sudo xfs_db -c "sb 0" -c "p rootino" -c "inode 32" /dev/vda
Version 5 superblock detected. xfsprogs has EXPERIMENTAL support enabled!
Use of these features is at your own risk!
rootino = 32
7f59f20e6740: Badness in key lookup (length)
bp=(bno 0x20, len 8192 bytes) key=(bno 0x20, len 1024 bytes)
$

There is another way this can happen, and that is dumping raw data
from disk using either the "fsb NNN" or "daddr MMM" commands to dump
untyped information. This is always read in sector or filesystem
block units, and so will cause similar badness warnings.

To avoid this problem when reading inodes, teach xfs_db to read
inode clusters rather individual filesystem blocks when asked to
read an inode.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

commit | commitdiff | tree

Dave Chinner [Wed, 13 Nov 2013 06:40:51 +0000 (06:40 +0000)]

db: re-enable write support for v5 filesystems.

As we can now verify and recalculate CRCs on IO, we can modify the
on-disk structures without corrupting the filesyste, This makes it
safe to turn write support on for v5 filesystems for the first time.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

commit | commitdiff | tree

Dave Chinner [Wed, 13 Nov 2013 06:40:50 +0000 (06:40 +0000)]

db: add a special attribute buffer verifier

Because we only have a single attribute type that is used for all
the attribute buffer types, we need to provide a special verifier
for the read code. That verifier needs to know all the attribute
types and when it find one it knows about, switch to the correct
verifier and call it.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

commit | commitdiff | tree

Dave Chinner [Wed, 13 Nov 2013 06:40:49 +0000 (06:40 +0000)]

db: add a special directory buffer verifier

Because we only have a single directory type that is used for all
the different buffer types, we need to provide a special verifier
for the read code. That verifier needs to know all the directory
types and when it find one it knows about, switch to the correct
verifier and call it.

We already do this for certain readahead cases in the directory
code, so there is precedence for this. If we don't find a magic
number we recognise, the verifier fails...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

commit | commitdiff | tree

Dave Chinner [Wed, 13 Nov 2013 06:40:48 +0000 (06:40 +0000)]

db: verify and calculate dquot CRCs

When we set the current Io cursor to point at a dquot block, verify
that the dquot CRC is intact. And prior to writing such an IO
cursor, calculate the dquot CRC.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

commit | commitdiff | tree

Dave Chinner [Wed, 13 Nov 2013 06:40:47 +0000 (06:40 +0000)]

db: verify and calculate inode CRCs

When we set the current IO cursor to point at an inode, verify that
the inode CRC is intact. And prior to writing such an IO cursor,
calculate the inode CRC.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

commit | commitdiff | tree

Dave Chinner [Wed, 13 Nov 2013 06:40:46 +0000 (06:40 +0000)]

db: indicate if the CRC on a buffer is correct or not

When dumping metadata that has a CRC in it, output not only the CRC
but text to tell us whether the value is correct or not. Hence we
can see at a glance if there's something wrong or not.

Do this by peeking at the buffer attached to the current IO
context. If there was a CRC error, then it will be marked with a
EFSCORRUPTED error. Use this to determine what to output.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

commit | commitdiff | tree

Dave Chinner [Wed, 13 Nov 2013 06:40:45 +0000 (06:40 +0000)]

db: introduce verifier support into set_cur

To be able to use read and write verifiers, we need to pass the
verifier to the IO routines. We do this via the set_cur() function
used to trigger reading the buffer.

For most metadata types, there is only one type of verifier needed.
For these, we can simply add the verifier to the type table entry
for the given type and use that directly. This type entry is already
carried around by the IO context, so if we ever need to get it again
we have direct access to it in the context we'll be doing IO.

Only attach the verifiers to the v5 filesystem type table; there is
not need for them on v4 filesystems as we don't have to verify or
calculate CRCs for them.

There are some metadata types that have more than one buffer format,
or aren't based in directly in buffers. For these, leave the type
table verifier NULL for now - these will need to be addressed
individually.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

commit | commitdiff | tree

Dave Chinner [Wed, 13 Nov 2013 06:40:44 +0000 (06:40 +0000)]

db: rewrite IO engine to use libxfs

Now that we have buffers and xfs_buf_maps, it is relatively easy to
convert the IO engine to use libxfs routines. This gets rid of the
most of the differences between mapped and straight buffer reads,
and tracks xfs_bufs directly in the IO context that is being used.

This is not yet a perfect solution, as xfs_db does different sized
IOs for the same block range which will throw warnings like:

xfs_db> inode 64
7ffff7fde740: Badness in key lookup (length)
bp=(bno 0x40, len 8192 bytes) key=(bno 0x40, len 4096 bytes)
xfs_db>

This is when first displaying an inode in the root inode chunk.
These will need to be dealt with on a case by case basis.

Further, xfs_db can build up a large IO stack by the time it has run
to completion. If we don't unwind this IO stack before we shut down
the libxfs caches, metadump and other db programs will exit with
unreleased buffers and emit warnings like:

cache_purge: shake on cache 0x69e4f0 left 7 nodes!?

Hence we need to unwind the iostack as we shut down.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

commit | commitdiff | tree

Dave Chinner [Wed, 13 Nov 2013 06:40:43 +0000 (06:40 +0000)]

libxfs: refactor libxfs_buf_read_map for xfs_db

xfs_db requires low level read/write buffer primitives that are the
equivalent of libxfs_readbufr/writebufr. The implementation of
libxfs_writebufr already handles discontiguous buffers, but there is
no equivalent libxfs_readbufr_map support in the code.

Refactor libxfs_readbuf_map into two parts - one that does the
buffer cache lookup, and the other that does the read IO. This
provides the implementation of libxfs_readbufr_map that is required
for xfs_db.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

commit | commitdiff | tree

Dave Chinner [Wed, 13 Nov 2013 06:40:42 +0000 (06:40 +0000)]

db: rewrite bbmap to use xfs_buf_map

Use the libxfs struct xfs_buf_map for recording the extent layout of
discontiguous buffers and convert the read/write to decode them
directory and use read_buf/write_buf to do the extent IO. This
brings the physical xfs_db IO code to be very close to the model
that libxfs uses.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

commit | commitdiff | tree

Dave Chinner [Wed, 13 Nov 2013 06:40:41 +0000 (06:40 +0000)]

db: separate out straight buffer IO from map based IO.

Libxfs has two different interfaces for getting and reading buffers.
The first is a block/length interface for reading contiguous
regions, and the second is based on extent based xfs_buf_map arrays
for discontiguous regions. The xfs-db code is solely based on a
basic block array interface regardless of the type of region being
read, and so doesn't match to either libxfs interface.

As a first step to converting xfs_db to the libxfs interfaces, add a
simple block/length buffer API and implement it using pread/pwrite.
Then remove the single region conditionals from the basic block array
based interfaces, and convert all the contiguous block read cases to
use the new API.

This new API is temporary - it will be replaced by the equivalent
libxfs interface calls once all the infrastructure preparation for
the changeover has been completed.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

commit | commitdiff | tree

Christoph Hellwig [Wed, 13 Nov 2013 06:40:40 +0000 (06:40 +0000)]

libxfs: stop caching inode structures

Currently libxfs has a cache for xfs_inode structures.  Unlike in kernelspace
where the inode cache, and the associated page cache for file data is used
for all filesystem operations the libxfs inode cache is only used in few
places:

- the libxfs init code reads the root and realtime inodes when called from
   xfs_db using a special flag, but these inode structure are never referenced
   again
- mkfs uses namespace and bmap routines that take the xfs_inode structure
   to create the root and realtime inodes, as well as any additional files
   specified in the proto file
- the xfs_db attr code uses xfs_inode-based attr routines in the attrset
   and attrget commands
- phase6 of xfs_repair uses xfs_inode-based routines for rebuilding
   directories and moving files to the lost+found directory.
- phase7 of xfs_repair uses struct xfs_inode to modify the nlink count
   of inodes.

So except in repair we never ever reuse a cached inode, and even in repair
the logical inode caching doesn't help:

- in phase 6a we iterate over each inode in the incore inode tree,
   and if it's a directory check/rebuild it
- phase6b then updates the "." and ".." entries for directories
   that need, which means we require the backing buffers.
- phase6c moves disconnected inodes to lost_found, which again needs
   the backing buffer to actually do anything.
- phase7 then only touches inodes for which we need to reset i_nlink,
   which always involves reading, modifying and writing the physical
   inode.
   which always involves modifying the . and .. entries.

Given these facts stop caching the inodes to reduce memory usage
especially in xfs_repair, where this makes a different for large inode
count inodes.  On the upper end this allows repair to complete for
filesystem / amount of memory combinations that previously wouldn't.

With this we probably could increase the memory available to the buffer
cache in xfs_repair, but trying to do so I got a bit lost - the current
formula seems to magic to me to make any sense, and simply doubling the
buffer cache size causes us to run out of memory given that the data cached
in the buffer cache (typically lots of 8k inode buffers and few 4k other
metadata buffers) are much bigger than the inodes cached in the inode
cache.  We probably need a sizing scheme that takes the actual amount
of memory allocated to the buffer cache into account to solve this better.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

commit | commitdiff | tree

Dave Chinner [Wed, 13 Nov 2013 06:40:39 +0000 (06:40 +0000)]

libxfs: fix root inode handling inconsistencies

When "mounting" a filesystem via libxfs_mount(), callers can tell
libxfs to read the root and realtime inodes into cache. However,
when unmounting the filesystem, libxfs_unmount() used to
unconditionally free root inodes if they were present.

This leads to interesting issues like in mkfs, when it handles
creation, reading and freeing of the root and rt inodes itself.
It, however, passes in the flag to tell libxfs_mount() to read the
root inodes and so can result in unbalanced freeing of inodes when
cleaning up during the unmount proceedure.

As it turns out, nothing ever uses mp->m_rootip and so we don't need
to read it in or free it, or even have a pointer to it in the struct
xfs_mount. Similarly, the only user of the realtime inodes is mkfs,
and it initialises them itself. Hence we can kill the m_rootip and
the realtime inode mounting code.

This leaves one user of LIBXFS_MOUNT_ROOTINOS - xfs_db - and that is
only used to initialise the in-core superblock counter values from
the ag header for xfs_check. Move this code to the xfs_db init
functions so we can get rid of the mount parameter previously used
to trigger all these behavours (LIBXFS_MOUNT_ROOTINOS) completely.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

commit | commitdiff | tree

Dave Chinner [Wed, 13 Nov 2013 06:40:38 +0000 (06:40 +0000)]

xfs: don't emit corruption noise on fs probes

If we get EWRONGFS due to probing of non-xfs filesystems,
there's no need to issue the scary corruption error and backtrace.

Ported from kernel commit 31625f28.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

commit | commitdiff | tree

Dave Chinner [Wed, 13 Nov 2013 06:40:37 +0000 (06:40 +0000)]

xfs: fix node forward in xfs_node_toosmall

When a node is considered for a merge with a sibling, it overwrites the
sibling pointers of the original incore nodehdr with the sibling's
pointers. This leads to loop considering the original node as a merge
candidate with itself in the second pass, and so it incorrectly
determines a merge should occur.)

Ported from equivalent kernel commit 997def25.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

commit | commitdiff | tree

Dave Chinner [Wed, 13 Nov 2013 06:40:36 +0000 (06:40 +0000)]

xfs: fix the wrong new_size/rnew_size at xfs_iext_realloc_direct()

At xfs_iext_realloc_direct(), the new_size is changed by adding
if_bytes if originally the extent records are stored at the inline
extent buffer, and we have to switch from it to a direct extent
list for those new allocated extents, this is wrong.

This patch fix above problem and revise the new_size comments at
xfs_iext_realloc_direct() to make it more readable. Also, fix the
comments while switching from the inline extent buffer to a direct
extent list to reflect this change.

Ported from kernel commit 17ec81c1.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

commit | commitdiff | tree

Dave Chinner [Wed, 13 Nov 2013 06:40:35 +0000 (06:40 +0000)]

xfs: remove newlines from strings passed to __xfs_printk

__xfs_printk adds its own "\n". Having it in the original string
leads to unintentional blank lines from these messages.

Ported from kernel commit 08e96e1a.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

commit | commitdiff | tree

Dave Chinner [Wed, 13 Nov 2013 06:40:34 +0000 (06:40 +0000)]

libxfs: Minor cleanup and bug fix sync

These bring all the small single line comment, whitespace and minor
code differences into sync with the kernel code. Anything left at
this point is an intentional difference.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

commit | commitdiff | tree

Dave Chinner [Wed, 13 Nov 2013 06:40:33 +0000 (06:40 +0000)]

libxfs: bring across inode buffer readahead verifier changes

These were made for log recovery readahead in the kernel, so are not
directly used in userspace. Hence bringing the change across is
simply to keep files in sync.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

commit | commitdiff | tree

Dave Chinner [Wed, 13 Nov 2013 06:40:32 +0000 (06:40 +0000)]

libxfs: xfs_rtalloc.c becomes xfs_rtbitmap.c

To match the split-up of the kernel xfs_rtalloc.c file, convert the
libxfs version of xfs_rtalloc.c to match the newly shared kernel
source file with all the realtime bitmap functions in it,
xfs_rtbitmap.c.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

commit | commitdiff | tree

Dave Chinner [Wed, 13 Nov 2013 06:40:31 +0000 (06:40 +0000)]

libxfs: bmap btree owner swap support

For CRC enabled filesystems, we can't just swap inode forks from one
inode to another when defragmenting a file - the blocks in the inode
fork bmap btree contain pointers back to the owner inode. Hence if
we are to swap the inode forks we have to atomically modify every
block in the btree during the transaction.

This patch brings across the kernel code for doing the owner
swap of an entire fork - something that we are likely to end up
needing in xfs_repair when reparenting stray inodes to lost+found -
without all the associated swap extents transaction and recovery
cruft as those parts are not needed in userspace.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

commit | commitdiff | tree

Dave Chinner [Wed, 13 Nov 2013 06:40:30 +0000 (06:40 +0000)]

libxfs: unify xfs_btree.c with kernel code

The libxfs/xfs_btree.c code does not contain a small amount of code
for btree block readahead that the kernel code does. Instead, it
short circuits it at a higher layer and doesn't include the lower
layer functions. There is no harm in calling the lower lay functions
and have them do nothing, and doing so unifies the kernel and
userspace code.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

commit | commitdiff | tree

Dave Chinner [Wed, 13 Nov 2013 06:40:29 +0000 (06:40 +0000)]

xfs: decouple inode and bmap btree header files

Currently the xfs_inode.h header has a dependency on the definition
of the BMAP btree records as the inode fork includes an array of
xfs_bmbt_rec_host_t objects in it's definition.

Move all the btree format definitions from xfs_btree.h,
xfs_bmap_btree.h, xfs_alloc_btree.h and xfs_ialloc_btree.h to
xfs_format.h to continue the process of centralising the on-disk
format definitions. With this done, the xfs inode definitions are no
longer dependent on btree header files.

The enables a massive culling of unnecessary includes, with close to
200 #include directives removed from the XFS kernel code base.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

commit | commitdiff | tree

Dave Chinner [Wed, 13 Nov 2013 06:40:28 +0000 (06:40 +0000)]

xfs: split dquot buffer operations out

Parts of userspace want to be able to read and modify dquot buffers
(e.g. xfs_db) so we need to split out the reading and writing of
these buffers so it is easy to shared code with libxfs in userspace.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

commit | commitdiff | tree

Dave Chinner [Wed, 13 Nov 2013 06:40:27 +0000 (06:40 +0000)]

xfs: create a shared header file for format-related information

All of the buffer operations structures are needed to be exported
for xfs_db, so move them all to a common location rather than
spreading them all over the place. They are verifying the on-disk
format, so while xfs_format.h might be a good place, it is not part
of the on disk format.

Hence we need to create a new header file that we centralise these
related definitions. Start by moving the bffer operations
structures, and then also move all the other definitions that have
crept into xfs_log_format.h and xfs_format.h as there was no other
shared header file to put them in.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

commit | commitdiff | tree

Dave Chinner [Wed, 13 Nov 2013 06:40:26 +0000 (06:40 +0000)]

xfs: fix some minor sparse warnings

A couple of simple locking annotations and 0 vs NULL warnings.
Nothing that changes any code behaviour, just removes build noise.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

commit | commitdiff | tree

Dave Chinner [Wed, 13 Nov 2013 06:40:25 +0000 (06:40 +0000)]

xfsprogs: fix automatic dependency generation

Adding are removing a header file does not result in dependency
regeneration like it should. 'make clean' will rebuild the
dependencies, but a normal 'make' won't. Fix it.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

commit | commitdiff | tree

Eric Sandeen [Tue, 12 Nov 2013 16:31:12 +0000 (16:31 +0000)]

xfsprogs: remove xfs_check references from fsck.xfs script & manpage

xfs_check is deprecated & slated for removal in June 2014.

We can stop referring to it now, though - pointing fsck.xfs
users to it doesn't make sense at this point.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

commit | commitdiff | tree

Christoph Hellwig [Wed, 6 Nov 2013 10:25:39 +0000 (10:25 +0000)]

xfsprogs: add generated crc files to .gitignore

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

commit | commitdiff | tree

Mark Tinguely [Thu, 17 Oct 2013 15:28:08 +0000 (15:28 +0000)]

xfsprog: add mkfs.xfs sb v4 support for dirent filetype field

Add directory inode type feature to mkfs.xfs and its manual page.

In sb v4, "mkfs.xfs -n ftype=1" turns on the feature.

The feature is automatically turned on for "-m crc=1", but reject
the use of the "-n ftype=0|1" with the "-m crc=1" option.

Signed-off-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

commit | commitdiff | tree

Mark Tinguely [Thu, 17 Oct 2013 15:28:07 +0000 (15:28 +0000)]

xfs_progs: add dirent filetype to xfs_db version

Add directory inode type to the xfs_db version command.

Signed-off-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

commit | commitdiff | tree

Mark Tinguely [Thu, 17 Oct 2013 15:28:06 +0000 (15:28 +0000)]

xfsprog: add dirent filetype information for xfs_info

Make xfs_info aware of the directory inode type by using the
XFS_FSOP_GEOM_FLAGS_FTYPE set in the kernel.

Signed-off-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

commit | commitdiff | tree

Mark Tinguely [Thu, 17 Oct 2013 15:28:05 +0000 (15:28 +0000)]

xfsprog: add xfs sb v4 support for dirent filetype field

Add xfsprog superblock v4 support for the inode type information
in the xfs directory feature in xfs_sb.h.

This support adds a feature bit for version 4 superblocks and
leaves the original superblock 5 incompatibility bit.

Signed-off-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

commit | commitdiff | tree

Rich Johnston [Tue, 22 Oct 2013 18:04:53 +0000 (13:04 -0500)]

Revert "xfs: create a shared header file for format-related information"

This reverts commit dd297fd6d366f62ca2f579e510e6287699d0bd5d.

Wait for the kernel commits to shared files before finalising
and committing the userspace patches.

commit | commitdiff | tree

Rich Johnston [Tue, 22 Oct 2013 15:15:20 +0000 (10:15 -0500)]

Revert "[RESEND, 4/7] xfsprogs: xfsio: add support FALLOC_FL_COLLAPSE_RANGE for fallocate"

This reverts commit e64190f8440286a815060524777b435e06a7b364 until we
have the fallocate API support merged into the kernel. The kernel
code is still under review.

commit | commitdiff | tree

Dave Chinner [Mon, 30 Sep 2013 03:15:21 +0000 (03:15 +0000)]

xfs: unify directory/attribute format definitions

The on-disk format definitions for the directory and attribute
structures are spread across 3 header files right now, only one of
which is dedicated to defining on-disk structures and their
manipulation (xfs_dir2_format.h). Pull all the format definitions
into a single header file - xfs_da_format.h - and switch all the
code over to point at that.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

commit | commitdiff | tree

Li Zhong [Tue, 15 Oct 2013 02:55:31 +0000 (02:55 +0000)]

xfsprogs: fix resource leak in longform_dir2_rebuild()

coverity scan 997010 reported following leak:

1309                if (error) {
1310                        do_warn(
1311        _("space reservation failed (%d), filesystem may be out of space\n"),
1312                                error);
     25. Breaking from loop
1313                        break;
1314                }

......

1342                libxfs_trans_commit(tp,
1343                                XFS_TRANS_RELEASE_LOG_RES|XFS_TRANS_SYNC);
1344        }

CID 997010 (#1 of 1): Resource leak (RESOURCE_LEAK)
26. leaked_storage: Variable "tp" going out of scope leaks the storage it points to.
1345}

Though not reported by coverity, it seems that there might be some entries in
flist which needs to be freed in the failure case below libxfs_dir_createname(),
and libxfs_bunmapi().

The fix cleans up the code by stacking the error handling at the end of the
function, and jumping to the error handler label for the above cases. (fail
directly by calling res_failed() for reservation failure.)

Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

commit | commitdiff | tree

Eric Sandeen [Fri, 18 Oct 2013 17:59:36 +0000 (17:59 +0000)]

xfs_repair: add d_type when moving files to lost+found

When we move disconnected inodes to lost+found, they aren't
assigned a dtype. Fix this by just setting XFS_DIR3_FT_UNKNOWN
for now. If the files are moved out of lost+found, the type
will be properly set at that time.

When repair gains more type knowledge we could use xfs_mode_to_ftype[]
to set the proper type when moved, but right now it's not a big
deal; UNKNOWN will suffice for files in lost+found, and prevents
us from using an uninitialized value.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Rich Johnston <rjohnston@sgi.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

commit | commitdiff | tree

Eric Sandeen [Fri, 18 Oct 2013 17:57:37 +0000 (17:57 +0000)]

xfsprogs: Note projid32bit default change in mkfs.xfs manpage

As of xfsprogs 3.2.0, the projid32bit default changed from 0
to 1. Document this.

When we document -m crc etc, we'll also need to note that 0 is
incompatible with crc=1.

Signed-off-by: Eric Sandeen <sandeen@redaht.com>
Reviewed-by: Rich Johnston <rjohnston@sgi.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

commit | commitdiff | tree

Eric Sandeen [Thu, 12 Sep 2013 20:56:36 +0000 (20:56 +0000)]

xfs_repair: test for bad level in dir2 node

In traverse_int_dir2block(), the variable 'i' is the level in
the tree, with 0 being a leaf node.  In the "do" loop we
start at the root, and work our way down to a leaf.

If the first node we read is an interior node with NODE_MAGIC,
but it tells us that its level is 0 (a leaf), this is clearly
an inconsistency.

Worse, we'd return with success, bno set, and only level[0]
in the cursor initialized.  Then down this path we'll
segfault when accessing an uninitialized (and zeroed) member
of the cursor's level array:

process_node_dir2
  traverse_int_dir2block  // returns 0 w/ bno set, only level[0] init'd
  process_leaf_level_dir2
    verify_dir2_path(mp, da_cursor, 0) // p_level == 0
       this_level = p_level + 1;
       node = cursor->level[this_level].bp->b_addr; // level[1] uninit & 0'd

Fix this by recognizing that an interior node w/ level 0 is invalid, and
error out as for other inconsistencies.

By the time the level 0 test is done, we have already ensured that
this block has XFS_DA[3]_NODE_MAGIC.

Reported-by: Jan Yves Brueckner <jyb@gmx.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

commit | commitdiff | tree

Eric Sandeen [Tue, 17 Sep 2013 17:07:45 +0000 (17:07 +0000)]

xfsprogs: initialize filetype for lost+found creation

If we create lost+found make sure it's got the proper filetype.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

commit | commitdiff | tree

Eric Sandeen [Thu, 17 Oct 2013 17:50:16 +0000 (17:50 +0000)]

xfs_repair: avoid segfault if reporting progress early in repair

For a very large filesystem, zeroing the log may take some time.

If we ask for progress reports frequently enough that one fires
before we finish with log zeroing, we try to use a progress format
which has not yet been set up, and segfault:

# mkfs.xfs -d size=60t,file,name=fsfile
# xfs_repair -m 9000 -o ag_stride=32 -t 1 fsfile
Phase 1 - find and verify superblock...
        - reporting progress in intervals of 1 seconds
Phase 2 - using internal log
        - zero log...
Segmentation fault

(gdb) bt
#0  0x0000000000426962 in progress_rpt_thread (p=0x67ad20) at progress.c:234
#1  0x0000003b98a07851 in start_thread (arg=0x7f19d8e47700) at pthread_create.c:301
#2  0x0000003b982e767d in ?? ()
#3  0x0000000000000000 in ?? ()
(gdb) p msgp
$1 = (msg_block_t *) 0x67ad20
(gdb) p msgp->format
$2 = (progress_rpt_t *) 0x0
(gdb)

I suppose we could rig up progress reports for log zeroing, but
that won't usually take terribly long; for now, be defensive
and init the message->format to NULL, and just return early
from the progress thread if we've not yet set up any message.

(Sure, global_msgs is global, and ->format is already NULL,
but to me it's worth being explicit since we will test it).

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

commit | commitdiff | tree

Nathan Scott [Thu, 10 Oct 2013 23:56:19 +0000 (23:56 +0000)]

xfsprogs: fix a warning in the deb build

Fixes a warning in the deb build for me, plus some trivial cleanup
in the release script.

Signed-off-by: Nathan Scott <nathans@debian.com>
Reviewed-by: Rich Johnston <rjohnston@sgi.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

commit | commitdiff | tree

Colin Watson [Thu, 17 Oct 2013 20:27:02 +0000 (15:27 -0500)]

Sent: Friday, October 11, 2013 12:38:11 AM
Subject: Bug#725971: xfsprogs: config.guess/config.sub out of date for arm64

Package: xfsprogs
Version: 3.1.9
Severity: important
Tags: patch
User: debian-arm@lists.debian.org
Usertags: arm64

xfsprogs' config.guess/config.sub are out of date for the forthcoming
arm64 port.  The attached patch sets things up so that you don't have to
be bothered by this type of bug for future ports.

  * Use the autotools-dev dh addon to update config.guess/config.sub for
    arm64.

Signed-off-by: Colin Watson <cjwatson@ubuntu.com>
Reviewed-by: Nathan Scott <nathans@debian.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

commit | commitdiff | tree

Eric Sandeen [Tue, 8 Oct 2013 15:17:50 +0000 (15:17 +0000)]

xfsprogs: restrict platform_test_xfs_fd to regular files

If a special file (block, char, pipe etc) resides on an
xfs filesystem, platform_test_xfs_[fd|path] will return
true, but a subsequent xfsctl will fail, because the file
operations to support the xfs ioctls are not set up on such
files (see i_fop assignments in xfs_setup_inode()).

>From the xfsctl manpage it's pretty clear that these functions
are supposed to return true if a subsequent xfsctl can be
handled, so it makes sense to exclude special files.

This was showing up in xfstest generic/306, which creates
the dev/null block device on an xfstest an tries to pwrite
to it with xfs_io - which emitted a warning when the xfsctl
trying to get geometry failed.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

commit | commitdiff | tree

Eric Sandeen [Mon, 7 Oct 2013 17:35:16 +0000 (17:35 +0000)]

xfsprogs: remove incorrect l_sectBBsize assignment in xfs_repair

Commit e0607266 xfsprogs: add crc format support to repair

added a 2nd assignment to l_sectBBsize:

log.l_sectBBsize = 1 << mp->m_sb.sb_logsectlog;

which is incorrect; sb_logsectlog is log2 of the sector size,
in bytes; l_sectBBsize is the size of the log sector in
512-byte units.

So for a 4k sector size log, we were assigning 4096 rather
than 8. This broke xlog_find_tail, and caused xfs_repair
to think that a log was dirty even when it was clean:

"ERROR: The filesystem has valuable metadata changes in a log"

(xfs_logprint didn't have this error, so xfs_logprint -t
agreed that the filesystem really was clean).

Just remove the incorrect assignment; it was already properly
assigned about 12 lines prior:

log.l_sectBBsize = BTOBB(x.lbsize);

and things work again.

(This worked accidentally for 512-sector devices, because
we special-case those and set sb_logsectlog to "0" rather
than 9, so l_sectBBsize came out to "1" (as in 1 sector),
as it should have).

Reporteed-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

commit | commitdiff | tree

Namjae Jeon [Sun, 6 Oct 2013 20:13:38 +0000 (20:13 +0000)]

[RESEND, 4/7] xfsprogs: xfsio: add support FALLOC_FL_COLLAPSE_RANGE for fallocate

Add support FALLOC_FL_COLLAPSE_RANGE for fallocate.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

commit | commitdiff | tree

Eric Sandeen [Sat, 5 Oct 2013 02:20:03 +0000 (02:20 +0000)]

xfsprogs: fix crc32 build on big endian

While kernelspace can test #ifdef __LITTLE_ENDIAN, this
doesn't work in userspace. __LITTLE_ENDIAN is defined -
as is __BIG_ENDIAN.

So we build on all boxes as __LITTLE_ENDIAN, and the
self-test (thankfully!) fails on big endian boxes.

Fix this by testing __BYTE_ORDER values.

And add an else which should never be hit, but just in case...

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

commit | commitdiff | tree

Eric Sandeen [Mon, 30 Sep 2013 17:01:19 +0000 (17:01 +0000)]

xfsprogs: handle symlinks etc in fs_table_initialise_mounts()

Commit:

6a23747d xfs_quota: support relative path as `path' arguments

used realpath() on the supplied pathname to handle things like
relative pathnames and pathnames ending in "/" which otherwise
caused the getmntent scanning to fail.

However, this regressed cases where a path in mtab was a symlink;
realpath() resolves this to the target, and so no match is found.

This causes i.e.:

# xfs_quota -x -c report /dev/mapper/testvg-testlv

to fail with:

xfs_quota: cannot setup path for mount /dev/mapper/testvg-testlv: No such device or address

because the scanning looks for /dev/dm-3, but the long symlink
name is what exists in mtab, and no match is found.

Fix this, but keep the intended enhancements, by testing *both* the
user-specified path (which might be relative, or contain a trailing
slash on a mountpoint) and the realpath-resolved path (which turns
a relative mountpoint into a full path, and removes trailing slashes),
to determine whether the user-specified path is an xfs mountpoint or
device.

While we're at it, add a few comments, and go back to the testing
of "path" not "rpath"; whether or not path is passed to the function
is what determines control flow. If path is specified, and realpath
succeeds, we're guaranteed to have rpath as well, so there is no need
to retest that. rpath is initialized to NULL, so an unconditional
free(rpath) is safe as well.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

Unnamed repository; edit this file 'description' to name the repository.

RSS Atom