git.ipfire.org Git - thirdparty/xfsprogs-dev.git/log

xfs_repair: test for bad level in dir2 node

In traverse_int_dir2block(), the variable 'i' is the level in
the tree, with 0 being a leaf node.  In the "do" loop we
start at the root, and work our way down to a leaf.

If the first node we read is an interior node with NODE_MAGIC,
but it tells us that its level is 0 (a leaf), this is clearly
an inconsistency.

Worse, we'd return with success, bno set, and only level[0]
in the cursor initialized.  Then down this path we'll
segfault when accessing an uninitialized (and zeroed) member
of the cursor's level array:

process_node_dir2
  traverse_int_dir2block  // returns 0 w/ bno set, only level[0] init'd
  process_leaf_level_dir2
    verify_dir2_path(mp, da_cursor, 0) // p_level == 0
       this_level = p_level + 1;
       node = cursor->level[this_level].bp->b_addr; // level[1] uninit & 0'd

Fix this by recognizing that an interior node w/ level 0 is invalid, and
error out as for other inconsistencies.

By the time the level 0 test is done, we have already ensured that
this block has XFS_DA[3]_NODE_MAGIC.

Reported-by: Jan Yves Brueckner <jyb@gmx.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

xfsprogs: initialize filetype for lost+found creation

If we create lost+found make sure it's got the proper filetype.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

xfs_repair: avoid segfault if reporting progress early in repair

For a very large filesystem, zeroing the log may take some time.

If we ask for progress reports frequently enough that one fires
before we finish with log zeroing, we try to use a progress format
which has not yet been set up, and segfault:

# mkfs.xfs -d size=60t,file,name=fsfile
# xfs_repair -m 9000 -o ag_stride=32 -t 1 fsfile
Phase 1 - find and verify superblock...
        - reporting progress in intervals of 1 seconds
Phase 2 - using internal log
        - zero log...
Segmentation fault

(gdb) bt
#0  0x0000000000426962 in progress_rpt_thread (p=0x67ad20) at progress.c:234
#1  0x0000003b98a07851 in start_thread (arg=0x7f19d8e47700) at pthread_create.c:301
#2  0x0000003b982e767d in ?? ()
#3  0x0000000000000000 in ?? ()
(gdb) p msgp
$1 = (msg_block_t *) 0x67ad20
(gdb) p msgp->format
$2 = (progress_rpt_t *) 0x0
(gdb)

I suppose we could rig up progress reports for log zeroing, but
that won't usually take terribly long; for now, be defensive
and init the message->format to NULL, and just return early
from the progress thread if we've not yet set up any message.

(Sure, global_msgs is global, and ->format is already NULL,
but to me it's worth being explicit since we will test it).

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

xfsprogs: fix a warning in the deb build

Fixes a warning in the deb build for me, plus some trivial cleanup
in the release script.

Signed-off-by: Nathan Scott <nathans@debian.com>
Reviewed-by: Rich Johnston <rjohnston@sgi.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

Sent: Friday, October 11, 2013 12:38:11 AM
Subject: Bug#725971: xfsprogs: config.guess/config.sub out of date for arm64

Package: xfsprogs
Version: 3.1.9
Severity: important
Tags: patch
User: debian-arm@lists.debian.org
Usertags: arm64

xfsprogs' config.guess/config.sub are out of date for the forthcoming
arm64 port.  The attached patch sets things up so that you don't have to
be bothered by this type of bug for future ports.

  * Use the autotools-dev dh addon to update config.guess/config.sub for
    arm64.

Signed-off-by: Colin Watson <cjwatson@ubuntu.com>
Reviewed-by: Nathan Scott <nathans@debian.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

xfsprogs: restrict platform_test_xfs_fd to regular files

If a special file (block, char, pipe etc) resides on an
xfs filesystem, platform_test_xfs_[fd|path] will return
true, but a subsequent xfsctl will fail, because the file
operations to support the xfs ioctls are not set up on such
files (see i_fop assignments in xfs_setup_inode()).

>From the xfsctl manpage it's pretty clear that these functions
are supposed to return true if a subsequent xfsctl can be
handled, so it makes sense to exclude special files.

This was showing up in xfstest generic/306, which creates
the dev/null block device on an xfstest an tries to pwrite
to it with xfs_io - which emitted a warning when the xfsctl
trying to get geometry failed.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

xfsprogs: remove incorrect l_sectBBsize assignment in xfs_repair

Commit e0607266 xfsprogs: add crc format support to repair

added a 2nd assignment to l_sectBBsize:

log.l_sectBBsize = 1 << mp->m_sb.sb_logsectlog;

which is incorrect; sb_logsectlog is log2 of the sector size,
in bytes; l_sectBBsize is the size of the log sector in
512-byte units.

So for a 4k sector size log, we were assigning 4096 rather
than 8. This broke xlog_find_tail, and caused xfs_repair
to think that a log was dirty even when it was clean:

"ERROR: The filesystem has valuable metadata changes in a log"

(xfs_logprint didn't have this error, so xfs_logprint -t
agreed that the filesystem really was clean).

Just remove the incorrect assignment; it was already properly
assigned about 12 lines prior:

log.l_sectBBsize = BTOBB(x.lbsize);

and things work again.

(This worked accidentally for 512-sector devices, because
we special-case those and set sb_logsectlog to "0" rather
than 9, so l_sectBBsize came out to "1" (as in 1 sector),
as it should have).

Reporteed-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

[RESEND, 4/7] xfsprogs: xfsio: add support FALLOC_FL_COLLAPSE_RANGE for fallocate

Add support FALLOC_FL_COLLAPSE_RANGE for fallocate.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

xfsprogs: fix crc32 build on big endian

While kernelspace can test #ifdef __LITTLE_ENDIAN, this
doesn't work in userspace. __LITTLE_ENDIAN is defined -
as is __BIG_ENDIAN.

So we build on all boxes as __LITTLE_ENDIAN, and the
self-test (thankfully!) fails on big endian boxes.

Fix this by testing __BYTE_ORDER values.

And add an else which should never be hit, but just in case...

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

xfsprogs: handle symlinks etc in fs_table_initialise_mounts()

Commit:

6a23747d xfs_quota: support relative path as `path' arguments

used realpath() on the supplied pathname to handle things like
relative pathnames and pathnames ending in "/" which otherwise
caused the getmntent scanning to fail.

However, this regressed cases where a path in mtab was a symlink;
realpath() resolves this to the target, and so no match is found.

This causes i.e.:

# xfs_quota -x -c report /dev/mapper/testvg-testlv

to fail with:

xfs_quota: cannot setup path for mount /dev/mapper/testvg-testlv: No such device or address

because the scanning looks for /dev/dm-3, but the long symlink
name is what exists in mtab, and no match is found.

Fix this, but keep the intended enhancements, by testing *both* the
user-specified path (which might be relative, or contain a trailing
slash on a mountpoint) and the realpath-resolved path (which turns
a relative mountpoint into a full path, and removes trailing slashes),
to determine whether the user-specified path is an xfs mountpoint or
device.

While we're at it, add a few comments, and go back to the testing
of "path" not "rpath"; whether or not path is passed to the function
is what determines control flow. If path is specified, and realpath
succeeds, we're guaranteed to have rpath as well, so there is no need
to retest that. rpath is initialized to NULL, so an unconditional
free(rpath) is safe as well.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

xfs: create a shared header file for format-related information

All of the buffer operations structures are needed to be exported
for xfs_db, so move them all to a common location rather than
spreading them all over the place. They are verifying the on-disk
format, so while xfs_format.h might be a good place, it is not part
of the on disk format.

Hence we need to create a new header file that we centralise these
related definitions. Start by moving the buffer operations
structures, and then also move all the other definitions that have
crept into xfs_log_format.h and xfs_format.h as there was no other
shared header file to put them in.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

xfs: dirent dtype presence is dependent on directory magic numbers

The determination of whether a directory entry contains a dtype
field originally was dependent on the filesystem having CRCs
enabled. This meant that the format for dtype being enabled could be
determined by checking the directory block magic number rather than
doing a feature bit check. This was useful in that it meant that we
didn't need to pass a struct xfs_mount around to functions that
were already supplied with a directory block header.

Unfortunately, the introduction of dtype fields into the v4
structure via a feature bit meant this "use the directory block
magic number" method of discriminating the dirent entry sizes is
broken. Hence we need to convert the places that use magic number
checks to use feature bit checks so that they work correctly and not
by chance.

The current code works on v4 filesystems only because the dirent
size roundup covers the extra byte needed by the dtype field in the
places where this problem occurs.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

xfs: don't assert fail on bad inode numbers

Let the inode verifier do it's work by returning an error when we
fail to find correct magic numbers in an inode buffer.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

xfs: ensure we copy buffer type in da btree root splits

When splitting the root of the da btree, we shuffled data between
buffers and the structures that track them. At one point, we copy
data and state from one buffer to another, including the ops
associated with the buffer. When we do this, we also need to copy
the buffer type associated with the buf log item so that the buffer
is logged correctly. If we don't do that, log recovery won't
recognise it and hence it won't recalculate the CRC on the buffer
after recovery. This leads to a directory block that can't be read
after recovery has run.

Found by inspection after finding the same problem with remote
symlink buffers.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

xfs: check magic numbers in dir3 leaf verifier first

Calling xfs_dir3_leaf_hdr_from_disk() in a verifier before
validating the magic numbers in the buffer results in ASSERT
failures due to mismatching magic numbers when a corruption occurs.
Seeing as the verifier is supposed to catch the corruption and pass
it back to the caller, having the verifier assert fail on error
defeats the purpose of detecting the errors in the first place.

Check the magic numbers direct from the buffer before decoding the
header.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

libxfs: fix missing filetype updates to xfs_dir2.c

They were missed in the original patch that was committed.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

xfsprogs: fix return value of verify_set_primary_sb()

If get_sb() fails because of EOF, it will return with retval 1, which will
then be interpreted as XR_BAD_MAGIC("bad magic number") in phase1() when
warning the user.

This patch fix it by using XR_EOF here, so it would be interpreted correctly.
Also change the associated comments about the return value.

Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

[v3, 1/2] xfsprogs: fix potential memory leak in verify_set_primary_sb()

If verify_set_primary_sb() completes the secondary sb scanning loop with
too few valid secondaries found (num_ok < num_sbs / 2), it will immediately
return without freeing any of the previously allocated memory (variables
sb, checked, and any items on the geo list). This was reported by
the Coverity scanner as CID 997012, 997013 and 997014.

Fix this by using the out_free_list: goto target for this error case.

Earlier, if get_sb() fails in the secondary scan loop, it goes to
the out: target which does not free any items on the geo list. Fix
this by using the out_free_list: target as well, and remove the now-unused
out: target.

Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

xfsprogs: fix potential memory leak in repare/sb.c

Following Resource leak is reported by coverity:

CID 997011 (#1 of 1): Resource leak (RESOURCE_LEAK)6. leaked_storage:
Variable "buf" going out of scope leaks the storage it points to.
505 return(XR_EOF);

Add a free(buf) to solve it.

Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

xfsprogs: initialize filetype for xfs_name_dot

If we add the '.' entry in repair, make sure it has a file type
initialized.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

mkfs: add noalign option to usage()

Although it has been added to manpage, there is no information about the
existence of noalign option into the usage().

Changelog:

V2: Remove space in comma separated options
V3: Aligned the option together with another alignment options to make mutual
exclusive options more visible

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

xfsprogs: avoid array overflow in pf_batch_read()

The while loop in pf_batch_read, and the code preceding it, is really...
quite a thing. I'd love to rewrite it, but I haven't yet found
a particularly cleaner way.

It cleverly hides the fact that we might increment "num" past the
last index of bplist[] and then assign to it. This corrupts memory.

Rather than major surgery for now, just go for the simple fix,
and break out of the loop if we've increased "num" past the
last index.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

xfsprogs: fix Out-of-bounds access in repair/dinode.c

On Mon, 2013-08-26 at 12:20 -0500, Eric Sandeen wrote:
> On 8/23/13 11:38 AM, Ben Myers wrote:
> > Hey Rich and Li Zhong,
> >
> > On Wed, Aug 21, 2013 at 11:51:11AM -0500, Rich Johnston wrote:
> >> Looks good, thanks for the patch Li Zhong. it has been committed.
> >>
> >> --Rich
> >>
> >> Reviewed-by: Rich Johnston <rjohnston@sgi.com>
> >>
> >> commit e7c05095f5baa9cd2e35a6de03d7dd9f51dd3910
> >> Author: Li Zhong <zhong@linux.vnet.ibm.com>
> >> Date:   Mon Aug 12 06:11:01 2013 +0000
> >>
> >>     xfsprogs: fix Out-of-bounds access in repair/dinode.c
> >>
> >> On 08/12/2013 01:11 AM, Li Zhong wrote:
> >>> Following is reported by coverity in bug 1061528:
> >>>
> >>> 187                        __dirty_no_modify_ret(dirty);
> >>>
> >>> CID 1061528 (#1 of 1): Out-of-bounds access (OVERRUN)53. overrun-buffer-arg: Overrunning array "dinoc->di_pad" of 6 bytes by passing it to a function which accesses it at byte offset 15 using argument "16UL".
> >>> 188                        memset(dinoc->di_pad, 0, 16);
> >>>
> >>> It seems that di_pad here should be di_pad2, as sekharan pointed out.
> >>>
> >>> Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
> >>> ---
> >>>  repair/dinode.c | 4 ++--
> >>>  1 file changed, 2 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/repair/dinode.c b/repair/dinode.c
> >>> index e607f0b..94bf2f8 100644
> >>> --- a/repair/dinode.c
> >>> +++ b/repair/dinode.c
> >>> @@ -183,9 +183,9 @@ clear_dinode_core(struct xfs_mount *mp, xfs_dinode_t *dinoc, xfs_ino_t ino_num)
> >>>   }
> >>>
> >>>   for (i = 0; i < 16; i++) {
> >>> - if (dinoc->di_pad[i] != 0) {
> >>> + if (dinoc->di_pad2[i] != 0) {
> >>>   __dirty_no_modify_ret(dirty);
> >>> - memset(dinoc->di_pad, 0, 16);
> >>> + memset(dinoc->di_pad2, 0, 16);
> >>>   break;
> >>>   }
> >>>   }
> >
> > We also discussed this issue a bit in this thread:
> > http://oss.sgi.com/archives/xfs/2013-08/msg00228.html
> >
> > Looks like the loop itself is incorrect and should be removed, and Eric has
> > suggested that the conditional be changed to a memcmp in case the size of the
> > pad changes in the future.  Would either of you care to spin up another patch
> > to clean it up?
>
> I think I was confused; it seems fine as it is in git, not sure what I was
> thinking.
>
> memcmp can't use a bare "0" as an arg, so it's not ideal to use either.
>
> Not a huge fan of the hard-coded 16, but I think the code is correct now; we
> can probably move on to real problems.  ;)

OK :) Or maybe we could improve it with the calculation using sizeof as
below(which I posted in another thread)?

Thanks, Zhong

3.2.0-alpha1 release

Signed-off-by: Rich Johnston <rjohnston@sgi.com>

xfsprogs: update version for 3.2.0-alpha1

Update the VERSION, configure.ac and doc/CHANGES file for alpha release,
3.2.0-alpha1

Signed-off-by: Ben Myers <bpm@sgi.com>
Reviewed-by: Rich Johnston <rjohnston@sgi.com>

xfsprogs: cleanup miscellaneous merge faults

* clean up a few extra tabs
* xfs_buf_map->xfs_buf_ops in libxfs_readbuf and libxfs_readbuf_map args
* don't call the write verifier twice
* put the multithreaded scan_ags back

Signed-off-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

repair: fix segv on directory block read failure

We try to read all blocks in the directory, but if we have a block
form directory we only have one block and so we need to fail if
there is a read error. Otherwise we try to derefence a null buffer
pointer.

While fixing the error handling for a read failure, fix the bug that
caused the read failure - trying to verify a block format buffer
with the data format buffer verifier.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

xfs: inode log reservations are too small

We've been seeing occasional problems with log space leaks and
transaction underruns such as this for some time:

XFS (dm-0): xlog_write: reservation summary:
   trans type  = FSYNC_TS (36)
   unit res    = 2740 bytes
   current res = -4 bytes
   total reg   = 0 bytes (o/flow = 0 bytes)
   ophdrs      = 0 (ophdr space = 0 bytes)
   ophdr + reg = 0 bytes
   num regions = 0

Turns out that xfstests generic/311 is reliably reproducing this
problem with the test it runs at sequence 16 of it execution. It is
a 100% reliable reproducer with the mkfs configuration of "-b
size=1024 -m crc=1" on a 10GB scratch device.

The problem? Inode forks in btree format are logged in memory
format, not disk format (i.e. bmbt format, not bmdr format). That
means there is a btree block header being logged, when such a
structure is never written to the inode fork in bmdr format. The
bmdr header in the inode is only 4 bytes, while the bmbt header is
24 bytes for v4 filesystems and 72 bytes for v5 filesystems.

We currently reserve the inode size plus the rounded up overhead of
a logging a buffer, which is 128 bytes. That means the reservation
for a 512 byte inode is 640 bytes. What we can actually log is:

inode core, data and attr fork = 512 bytes
inode log format + log op header = 56 + 12 = 68 bytes
data fork bmbt hdr = 24/72 bytes
attr fork bmbt hdr = 24/72 bytes

So, for a v2 inodes we can log at least 628 bytes, but if we split that
inode over the end of the log across log buffers, we need to also
another log op header, which takes us to 640 bytes. If there's
another reservation taken out of this that I haven't taken into
account (perhaps multiple iclog splits?) or I haven't corectly
calculated the bmbt format space used (entirely possible), then
we will overun it.

For v3 inodes the maximum is actually 724 bytes, and even a
single maximally sized btree format fork can blow it (652 bytes).
And that's exactly what is happening with the FSYNC_TS transaction
in the above output - it's consumed 644 bytes of space after the CIL
context took the space reserved for it (2100 bytes).

This problem has always been present in the XFS code - the btree
format inode forks have always been logged in this manner. Hence
there has always been the possibility of an overrun with such a
transaction. The CRC code has just exposed it frequently enough to
be able to debug and understand the root cause....

So, let's fix all the inode log space reservations.

[ I'm so glad we spent the effort to clean up the transaction
  reservation code. This is an easy fix now. ]

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

xfs: btree block LSN escaping to disk uninitialised

When testing LSN ordering code for v5 superblocks, it was discovered
that the the LSN embedded in the generic btree blocks was
occasionally uninitialised. These values didn't get written to disk
by metadata writeback - they got written by previous transactions in
log recovery.

The issue is here that the when the block is first allocated and
initialised, the LSN field was not initialised - it gets overwritten
before IO is issued on the buffer - but the value that is logged by
transactions that modify the header before it is written to disk
(and initialised) contain garbage. Hence the first recovery of the
buffer will stamp garbage into the LSN field, and that can cause
subsequent transactions to not replay correctly.

The fix is simply to initialise the bb_lsn field to zero when we
initialise the block for the first time.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

xfs: fix calculation of the number of node entries in a dir3 node

The calculation doesn't take into account the size of the dir v3
header, so overestimates the hash entries in a node. This causes
directory buffer overruns when splitting and merging nodes.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

xfs: di_flushiter considered harmful

When we made all inode updates transactional, we no longer needed
the log recovery detection for inodes being newer on disk than the
transaction being replayed - it was redundant as replay of the log
would always result in the latest version of the inode woul dbe on
disk. It was redundant, but left in place because it wasn't
considered to be a problem.

However, with the new "don't read inodes on create" optimisation,
flushiter has come back to bite us. Essentially, the optimisation
made always initialises flushiter to zero in the create transaction,
and so if we then crash and run recovery and the inode already on
disk has a non-zero flushiter it will skip recovery of that inode.
As a result, log recovery does the wrong thing and we end up with a
corrupt filesystem.

Because we have to support old kernel to new kernl upgrades, we
can't just get rid of the flushiter support in log recovery as we
might be upgrading from a kernel that doesn't have fully transaction
inode updates. Unfortunately, for v4 superblocks there is no way to
guarantee that log recovery knows about this fact.

We cannot add a new inode format flag to say it's a "special inode
create" because it won't be understood by older kernels and so
recovery could do the wrong thing on downgrade. We cannot specially
detect the combination of zero mode/non-zero flushiter on disk to
non-zero mode, zero flushiter in the log item during recovery
because wrapping of the flushiter can result in false detection.

Hence that makes this "don't use flushiter" optimisation limited to
a disk format that guarantees that we don't need it. And that means
the only fix here is to limit the "no read IO on create"
optimisation to version 5 superblocks....

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

xfsprogs: add dtype support to mkfs and db

Now that we have an extra field in the dirent, add support into
xfs_db to be able to view it when looking at directory structures.

Add support to mkfs to create filesystems with filetype - we'll
always set it on CRC enabled filesystems so all new v5 filesystems
will have this functionality enabled.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

xfs: Add write support for dirent filetype field

Add support to propagate and add filetype values into the on-disk
directs. This involves passing the filetype into the xfs_da_args
structure along with the name and namelength for direct operations,
and encoding it into the dirent at the same time we write the inode
number into the dirent.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

[47/55,V2] xfs: Add read-only support for dirent filetype field

Add support for the file type field in directory entries so that
readdir can return the type of the inode the dirent points to to
userspace without first having to read the inode off disk.

The encoding of the type field is a single byte that is added to the
end of the directory entry name length. For all intents and
purposes, it appends a "hidden" byte to the name field which
contains the type information. As the directory entry is already of
dynamic size, helpers are already required to access and decode the
direct entry structures.

Hence the relevent extraction and iteration helpers are updated to
understand the hidden byte. Helpers for reading and writing the
filetype field from the directory entries are also added. Only the
read helpers are used by this patch. It also adds all the code
necessary to read the type information out of the dirents on disk.

Further we add the superblock feature bit and helpers to indicate
that we understand the on-disk format change. This is not a
compatible change - existing kernels cannot read the new format
successfully - so an incompatible feature flag is added. We don't
yet allow filesystems to mount with this flag yet - that will be
added once write support is added.

Finally, the code to take the type from the VFS, convert it to an
XFS on-disk type and put it into the xfs_name structures passed
around is added, but the directory code does not use this field yet.
That will be in the next patch.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

xfs: Add xfs_log_rlimit.c

Add source files for xfs_log_rlimit.c The new file is used for log
size calculations and validation shared with userspace.

[dchinner: xfs_log_calc_max_attrsetm_res() does not modify the
tr_attrsetm reservation, just calculates the maximum. ]

[dchinner: rework loop in xfs_log_get_max_trans_res() ]

[dchinner: implement xfs_log_calc_unit_res() in util.c to give mkfs
a worse case calculation of the log size needed. ]

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

xfs: Get rid of all XFS_XXX_LOG_RES() macro

Get rid of all XFS_XXX_LOG_RES() macros since they are obsoleted now.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

xfs: refactor xfs_trans_reserve() interface

With the new xfs_trans_res structure has been introduced, the log
reservation size, log count as well as log flags are pre-initialized
at mount time. So it's time to refine xfs_trans_reserve() interface
to be more neat.

Also, introduce a new helper M_RES() to return a pointer to the
mp->m_resv structure to simplify the input.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

xfs: Make writeid transaction use tr_writeid

tr_writeid is defined at mp->m_resv structure, however, it does not
really being used when it should be..

This patch changes it to tr_writeid to fetch the correct log
reservation size.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

xfs: Introduce tr_fsyncts to m_reservation

A preparation step.

For now fsync_ts transaction use the pre-calculated log reservation
size of tr_swrite.
This patch introduce a new item tr_fsyncts to mp->m_reservations
structure so that we can fetch the log reservation value for it
in a same manner to others.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

xfs: Introduce a new structure to hold transaction reservation items

Introduce a new structure xfs_trans_res to hold transaction
reservation item info per log ticket.

We also need to improve xfs_trans_resv_calc() by initializing the
log count as well as log flags for permanent log reservation.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

xfs: make struct xfs_perag kernel only

The struct xfs_perag has many kernel-only definitions in it,
requiring a __KERNEL__ guard so userspace can use it to. Move it to
xfs_mount.h so that it it kernel-only, and let userspace redefine
it's own version of the structure containing only what it needs.
This gets rid of another __KERNEL__ check in the XFS header files.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

xfs: move kernel specific type definitions to xfs.h

xfs_types.h is shared with userspace, so having kernel specific
types defined in it is problematic. Move all the kernel specific
defines to xfs_linux.h so we can remove the __KERNEL__ guards from
xfs_types.h

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

xfs: remove __KERNEL__ check from xfs_dir2_leaf.c

It's actually an ifndef section, which means it is only included in
userspace. however, it's deep within the libxfs code, so it's
unlikely that the condition checked in userspace can actually occur
(search an empty leaf) through the libxfs interfaces. i.e. if it can
happen in usrspace, it can happen in the kernel, so remove it from
userspace too....

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

xfs: remove __KERNEL__ from debug code

There is no reason the remaining kernel-only debug code needs to
remain kernel-only. Kill the __KERNEL__ part of the defines, and let
userspace handle the debug code appropriately.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

xfs: kill __KERNEL__ check for debug code in allocation code

Userspace running debug builds is relatively rare, so there's need
to special case the allocation algorithm code coverage debug switch.
As it is, userspace defines random numbers to 0, so invert the
logic of the switch so it is effectively a no-op in userspace.
This kills another couple of __KERNEL__ users.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

xfs: move swap extent code to xfs_extent_ops

Swapping extents is clearly an extent operaiton, and it is not
shared with userspace. Move the code to xfs_extent_ops.[ch], and
the userspace ioctl structure definition to xfs_fs.h where most of
the other ioctl structure definitions are. The means xfs_dfrag.h is
no longer needed in userspace.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

xfs: don't special case shared superblock mounts

Neither kernel or userspace support shared read-only mounts, so
don't beother special casing the support check to be different
between kernel and userspace. The same check canbe used as neither
like it...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

xfsprogs: sync minor kernel header differences

There are lots of little differences between kernel and userspace
headers noticable now that the files are largely the same. Clean up
all the formatting, whitespace and other minor differences in the
userspace headers.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

xfs: create xfs_bmap_util.[ch]

There is a bunch of code in xfs_bmap.c that is kernel specific and
not shared with userspace. to minimise the difference between the
kernel and userspace code, shift this unshared code to
xfs_bmap_util.c, and the declarations to xfs_bmap_util.h.

The biggest issue here is xfs_bmap_finish() - userspce has it's own
definition of this function, and so we need to move it out of
xfs_bmap.[ch]. This means several other files need to include
xfs_bmap_util.c as well.

It also introduces and interesting dance for the stack switching
code in xfs_bmapi_allocate(). The stack switching/workqueue code is
actually moved to xfs_bmap_util.c, so that userspace can simply use
a #define in a header file to connect the dots without needing to
know about the stack switch code at all.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

libxfs: switch over to xfs_sb.c and remove xfs_mount.c

Now that the kernel code has split the superblock specific code out
of xfs_mount.c, we don't need xfs_mount.c anymore. Copy in xfs_sb.c
and remove xfs_mount.c

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

xfs: split out the remote symlink handling

The remote symlink format definition and manipulation needs to be
shared with userspace, but the in-kernel interfaces do not. Split
the remote symlink format handling out into xfs_symlink_remote.[ch]
fo it can easily be shared with userspace.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Review-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

xfs: introduce xfs_inode_buf.c for inode buffer operations

The only thing remaining in xfs_inode.[ch] are the operations that
read, write or verify physical inodes in their underlying buffers.
Move all this code to xfs_inode_buf.[ch] and so we can stop sharing
xfs_inode.[ch] with userspace.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Review-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

xfs: move unrealted definitions out of xfs_inode.h

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Review-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

xfs: move inode fork definitions to a new header file

The inode fork definitions are a combination of on-disk format
definition and in-memory tracking and manipulation. They are both
shared with userspace, so move them all into their own file so
sharing is easy to do and track. This removes all inode fork
related information from xfs_inode.h.

Do the same for the all the C code that currently resides in
xfs_inode.c for the same reason.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Review-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

libxfs: move transaction code to trans.c

There is very little code left in xfs_trans.c. So little it is not
worthtrying to share this file with kernel space any more. Move the
code to libxfs/trans.c, and remove libxfs/xfs_trans.c.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Review-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

libxfs: introduce xfs_trans_resv.c

The log space reservation calculation code has been separated from
the core transaction code in kernelspace. THi smeans we can add it
here in preparation for removing xfs_trans.c to further reduce the
differences between kernel and usrspace files.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Review-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

xfs: introduce xfs_quota_defs.h

There are a lot of quota flag definitions that are shared by user
and kernel space. Move them all to xfs_quota_defs.h so we can
unshare xfs_quota.h and remove the __KERNEL__ regions from it.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Review-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

xfs: introduce xfs_rtalloc_defs.h

There are quite a few realtime device definitions shared with
userspace. Move them from xfs_rtalloc.h to xfs_rt_alloc_defs.h
so we don't need to share xfs_rtalloc.h with userspace anymore.

This removes the final __KERNEL__ region from the XFS kernel
codebase. Yay!

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Review-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

xfs: split out on-disk transaction definitions

There's a bunch of definitions in xfs_trans.h that define on-disk
formats - transaction headers taht get written into the log, log
item type definitions, etc. Split out everything into a separate
file so that all which remains in xfs_trans.h are kernel only
definitions.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Review-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

xfs: separate icreate log format definitions from xfs_icreate_item.h

The on disk log format definitions for the icreate log item are
intertwined with the kernel-only in-memory log item definitions.
Separate the log format definitions out into their own header file
so they can easily be shared with userspace.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Review-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

xfs: separate dquot on disk format definitions out of xfs_quota.h

The on disk format definitions of the on-disk dquot, log formats and
quota off log formats are all intertwined with other definitions for
quotas. Separate them out into their own header file so they can
easily be shared with userspace.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Review-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

xfs: split out inode log item format definition

The EFI/EFD item format definitions are shared with userspace. Split
the out of header files that contain kernel only defintions to make
it simple to shared them.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Review-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

xfs: split out buf log item format definitions

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Review-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

xfs: split out inode log item format definition

The log item format definitions are shared with userspace. split the
out of header files that contain kernel only defintions to make it
simple to shared them.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Review-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

xfs: separate out log format definitions

The on-disk format definitions for the log are spread randoms
through a couple of header files. Consolidate it all in a single
file that can be shared easily with userspace. This means that
xfs_log.h and xfs_log_priv.h no longer need to be shared with
userspace.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Review-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

libxfs: local to remote format support of remote symlinks

This conversion was overlooked earlier on. Now that the differences
between userspace and kernel space are getting smaller this bug is
obvious. Fix it.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Review-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

xfs: remove local fork format handling from xfs_bmapi_write()

The conversion from local format to extent format requires
interpretation of the data in the fork being converted, so it cannot
be done in a generic way. It is up to the caller to convert the fork
format to extent format before calling into xfs_bmapi_write() so
format conversion can be done correctly.

The code in xfs_bmapi_write() to convert the format is used
implicitly by the attribute and directory code, but they
specifically zero the fork size so that the conversion does not do
any allocation or manipulation. Move this conversion into the
shortform to leaf functions for the dir/attr code so the conversions
are explicitly controlled by all callers.

Now we can remove the conversion code in xfs_bmapi_write.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Review-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

libxfs: fix compile warnings

Some of the code shared with userspace causes compilation warnings
from things turned off in the kernel code, such as differences in
variable signedness. Fix those issues.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Review-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

xfsprogs: define min/max once and use them everywhere

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Review-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

libxfs: sync xfs_ialloc.c to the kernel code

include the missing xfs_difree() function. it's not used by
userspace, but it makes no sense to have just this one arbitrary
difference between the kernel and userspace files.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Review-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

libxfs: sync dir2 kernel differences

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Review-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

libxfs: sync attr code with kernel

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Review-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

libxfs: update xfs_alloc to current kernel version

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Review-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

libxfs: sync xfs_da_btree.c

Some variables we renamed in the kernel code, and there are a few
other minor differences. Fix them up.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Review-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

libxfs: fix byte swapping on constants

The kernel code uses cpu_to_beXX() on constants in switch()
statements for magic numbers in the btree code. The byte swapping
infratructure isn't hooked up to the proper byte swap macros to make
this work, so fix it and then swap all the generic btree code over
to match the kernel code.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Review-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

libxfs: ensure btree root split sets blkno correctly

For CRC enabled filesystems, the BMBT is rooted in an inode, so it
passes through a difference code path on root splits to the
freespace and inode btrees. The inode based btree root has a
corruption problem on split - it's the same problem we saw in the
directory/attr code where headers are memcpy()d from one block to
another without updating the self describing metadata.

Simple fix - when copying the header out of the root block, make
sure the block number is updated correctly.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Review-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

libxfs: fix directory/attribute format issues

directory data headers and attr leaf headers need padding for 32 bit
systems to correctly align the data sections on 64 bit boundaries.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Review-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

xfsprogs: teach logprint about icreate transaction

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Review-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

xfsprogs: port inode create transaction changes

Bring across the relevant parts of the new inode create transaction
sufficient to keep kernel/user code in sync and implement the
infrastructure needed to make it work in xfsprogs.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Review-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

xfsprogs: introduce xfs_icreate.h

Bring the new inode create item definitions across from kernel space
for xfs_logprint to be able to parse.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Review-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

xfs_io: v8 add the lseek() SEEK_DATA/SEEK_HOLE support

Add the lseek SEEK_DATA/SEEK_HOLE support into xfs_io.
The result from the lseek() call will be printed to the output.
For example:

xfs_io> seek -hs 609k
Whence Start Result
HOLE 623616 630784

Signed-off-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Rich Johnston <rjohnston@sgi.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

xfs_db: add header to freesp -d output

Today, xfs_db's freesp -d command dumps out a bunch of numbers:

# xfs_db -c "freesp -d" /dev/sdb1
       0        4        1
       0        5        1
       0        6        1
       0        7        1
       0       12   174772
...

which are not useful to the non-code-reading user.
Add some headers:

# xfs_db -c "freesp -d" /dev/sdb1
    agno    agbno      len
       0        4        1
       0        5        1
       0        6        1
       0        7        1
       0       12   174772
...

so there's at least some context.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Mark Tinguely <tinguely@sgi.com>

xfs_repair: zero out unused parts of superblocks

Prior to:
1375cb65 xfs: growfs: don't read garbage for new secondary superblocks

we ran the risk of allowing garbage in secondary superblocks
beyond the in-use sb fields. With kernels 3.10 and beyond, the
verifiers will kick these out as invalid, but xfs_repair does
not detect or repair this condition.

There is superblock stale-data zeroing code, but it is under a
narrow conditional - the bug addressed in the above commit did not
meet that conditional. So change this to check unconditionally.

Further, the checking code was looking at the in-memory
superblock buffer, which was zeroed prior to population, and
would therefore never possibly show any stale data beyond the
last up-rev superblock field.

So instead, check the disk buffer for this garbage condition.

If we detect garbage, we must zero out both the in-memory sb
and the disk buffer; the former may contain unused data
in up-rev sb fields which will be written back out; the latter
may contain garbage beyond all fields, which won't be updated
when we translate the in-memory sb back to disk.

The V4 superblock case was zeroing out the sb_bad_features2
field; we also fix that to leave that field alone.

Lastly, use offsetof() instead of the tortured (__psint_t)
casts & pointer math.

Reported-by: Michael Maier <m1278468@allmail.net>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Rich Johnston <rjohnston@sgi.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

xfs_repair: add prototype for alloc_ex_data()

3ac87fbf xfsprogs: fix inode crash in xfs_repair

un-static'd alloc_ex_data and used it in phase6.c,
but didn't put a prototype in a header, so:

phase6.c: In function ‘mk_orphanage’:
phase6.c:943: warning: implicit declaration of function ‘alloc_ex_data’

Fix it...

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

xfsprogs: fix Out-of-bounds access in repair/dinode.c

Following is reported by coverity in bug 1061528:

187 __dirty_no_modify_ret(dirty);

CID 1061528 (#1 of 1): Out-of-bounds access (OVERRUN)53. overrun-buffer-arg: Overrunning array "dinoc->di_pad" of 6 bytes by passing it to a function which accesses it at byte offset 15 using argument "16UL".
188 memset(dinoc->di_pad, 0, 16);

It seems that di_pad here should be di_pad2, as sekharan pointed out.

Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
Reviewed-by: Rich Johnston <rjohnston@sgi.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

xfsprogs: fix inode crash in xfs_repair

Adding the lost+found in phase 6 could allocate an inode from
a new inode chunk. Since this chunk was not around in phase 3
when the inode chunks are verificated and added to the avl tree,
the avl tree look up will return a NULL pointer. This results
in a NULL defererence and segmentation fault.

Add the newly created inode chunk as if found in the chunk
verification phase.

Signed-off-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Rich Johnston <rjohnston@sgi.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

xfsprogs: Start using pquotaino from on-disk superblock

Start using the new field sb_pquotino from the on-disk superblock if the
version of the superblock supports separate pquotino.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Reviewed-by: Rich Johnston <rjohnston@sgi.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

xfsprogs: Remove incore use of XFS_OQUOTA_ENFD and XFS_OQUOTA_CHKD

Remove incore use of XFS_OQUOTA_ENFD and XFS_OQUOTA_CHKD. Instead,
start using XFS_GQUOTA_.* XFS_PQUOTA_.* counterparts.

On disk version still uses XFS_OQUOTA_ENFD and XFS_OQUOTA_CHKD.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Reviewed-by: Rich Johnston <rjohnston@sgi.com>
Signed-off-by: Rich Johnston <rjohnston@sgi.com>

xfsprogs: fix unint var in repair phase6

2 calls to libxfs_bmapi_write exist in repair's phase6
where "first" is uninitialized, but is accessed
in that function.

Normally we call xfs_bmap_init() first to initialize
both the free list and the first block, but in these
cases, the free list var is sent as NULL.

So in these 2 cases, explicitly initialize the "first"
variable to NULLFSBLOCK as xfs_bmap_init() does
elsewhere.

Coverity caught this.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>

xfsprogs: fix agcnts leak in xfs_repair's scan_ags

agcnts is malloc'd but never freed in this function.

Coverity found this.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>

xfsprogs:free bp in xlog_find_tail() error path

xlog_find_tail() currently leaks a bp on one error path.

There is no error target, so manually free the bp before
returning the error.

Found by Coverity.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>

xfsprogs: free bp in xlog_find_zeroed() error path

xlog_find_zeroed() currently leaks a bp on one error path.

Using the bp_err: target resolves this.

Found by Coverity.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>

xfsprogs: fix buffer leak in xlog_print_find_oldest

The error path in this function did not free the buffer
before returning.

Coverity found this one.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>

xfsprogs: avoid double-free in xfs_attr_node_addname

xfs_attr_node_addname()'s error handling tests whether it
should free "state" in the out: error handling label:

out:
        if (state)
                xfs_da_state_free(state);

but an earlier free doesn't set state to NULL afterwards; this
could lead to a double free.  Fix it by setting state to NULL
after it's freed.

This was found by Coverity.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>

xfsprogs/io: add readdir command

readdir reads the directory entries from an open directory from
the provided offset (or 0 if not specified). On completion,
readdir prints summary information regarding the number of
operations and bytes transferred. Options are available to specify
the starting offset, length and verbose mode to dump directory
entry information.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>

mkfs.xfs: fix protofile name create block reservation

A large protofile which creates a large directory and requires
a a dir tree split, can fail:

  mkfs.xfs: directory createname error [28 - No space left on device]

This is because when we've split a block once, we decrement args->total:
(see kernel commit a7444053fb3ebd3d905e3c7a7bd5ea80a54b083a for the
rationale)

       /* account for newly allocated blocks in reserved blocks total */
       args->total -= dp->i_d.di_nblocks - nblks;

but every call into this path from proto file parsing started
reserved / args->total as only "1" as passed tro newdirent() -
so if we allocate a block, args->total hits 0, and then in
xfs_dir2_node_addname():

        /*
         * Add the new leaf entry.
         */
        rval = xfs_dir2_leafn_add(blk->bp, args, blk->index);
        if (rval == 0) {
...
        } else {
                /*
                 * It didn't work, we need to split the leaf block.
                 */
                if (args->total == 0) {
                        ASSERT(rval == ENOSPC);
                        goto done;
                }
                /*
                 * Split the leaf block and insert the new entry.
                 */

we hit the args->total == 0 special case, and don't do the next
split, and ENOSPC gets returned all the way up, and we fail.

So rather than calling newdirent with a total of "1" in every case,
which doesn't account for possible tree splits, we should call it
with a more appropriate value: XFS_DIRENTER_SPACE_RES(mp, name->len),
which will handle the maximum nr of block allocations that might be
needed during a directory entry insert.

Since the reservation required doesn't depend on entry type,
just push this down a level, into newdirent() itself.

Reported-by: Boris Ranto <branto@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>

xfs: don't emit v5 superblock warnings on write

We write the superblock every 30s or so which results in the
verifier being called. Right now that results in this output
every 30s:

XFS (vda): Version 5 superblock detected. This kernel has EXPERIMENTAL support enabled!
Use of these features in this kernel is at your own risk!

And spamming the logs.

We don't need to check for whether we support v5 superblocks or
whether there are feature bits we don't support set as these are
only relevant when we first mount the filesytem. i.e. on superblock
read. Hence for the write verification we can just skip all the
checks (and hence verbose output) altogether.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

xfs: rework remote attr CRCs

Note: this changes the on-disk remote attribute format. I assert
that this is OK to do as CRCs are marked experimental and the first
kernel it is included in has not yet reached release yet. Further,
the userspace utilities are still evolving and so anyone using this
stuff right now is a developer or tester using volatile filesystems
for testing this feature. Hence changing the format right now to
save longer term pain is the right thing to do.

The fundamental change is to move from a header per extent in the
attribute to a header per filesytem block in the attribute. This
means there are more header blocks and the parsing of the attribute
data is slightly more complex, but it has the advantage that we
always know the size of the attribute on disk based on the length of
the data it contains.

This is where the header-per-extent method has problems. We don't
know the size of the attribute on disk without first knowing how
many extents are used to hold it. And we can't tell from a
mapping lookup, either, because remote attributes can be allocated
contiguously with other attribute blocks and so there is no obvious
way of determining the actual size of the atribute on disk short of
walking and mapping buffers.

The problem with this approach is that if we map a buffer
incorrectly (e.g. we make the last buffer for the attribute data too
long), we then get buffer cache lookup failure when we map it
correctly. i.e. we get a size mismatch on lookup. This is not
necessarily fatal, but it's a cache coherency problem that can lead
to returning the wrong data to userspace or writing the wrong data
to disk. And debug kernels will assert fail if this occurs.

I found lots of niggly little problems trying to fix this issue on a
4k block size filesystem, finally getting it to pass with lots of
fixes. The thing is, 1024 byte filesystems still failed, and it was
getting really complex handling all the corner cases that were
showing up. And there were clearly more that I hadn't found yet.

It is complex, fragile code, and if we don't fix it now, it will be
complex, fragile code forever more.

Hence the simple fix is to add a header to each filesystem block.
This gives us the same relationship between the attribute data
length and the number of blocks on disk as we have without CRCs -
it's a linear mapping and doesn't require us to guess anything. It
is simple to implement, too - the remote block count calculated at
lookup time can be used by the remote attribute set/get/remove code
without modification for both CRC and non-CRC filesystems. The world
becomes sane again.

Because the copy-in and copy-out now need to iterate over each
filesystem block, I moved them into helper functions so we separate
the block mapping and buffer manupulations from the attribute data
and CRC header manipulations. The code becomes much clearer as a
result, and it is a lot easier to understand and debug. It also
appears to be much more robust - once it worked on 4k block size
filesystems, it has worked without failure on 1k block size
filesystems, too.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

xfs: fully initialise temp leaf in xfs_attr3_leaf_compact

xfs_attr3_leaf_compact() uses a temporary buffer for compacting the
the entries in a leaf. It copies the the original buffer into the
temporary buffer, then zeros the original buffer completely. It then
copies the entries back into the original buffer. However, the
original buffer has not been correctly initialised, and so the
movement of the entries goes horribly wrong.

Make sure the zeroed destination buffer is fully initialised, and
once we've set up the destination incore header appropriately, write
is back to the buffer before starting to move entries around.

While debugging this, the _d/_s prefixes weren't sufficient to
remind me what buffer was what, so rename then all _src/_dst.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

xfs: fully initialise temp leaf in xfs_attr3_leaf_unbalance

xfs_attr3_leaf_unbalance() uses a temporary buffer for recombining
the entries in two leaves when the destination leaf requires
compaction. The temporary buffer ends up being copied back over the
original destination buffer, so the header in the temporary buffer
needs to contain all the information that is in the destination
buffer.

To make sure the temporary buffer is fully initialised, once we've
set up the temporary incore header appropriately, write is back to
the temporary buffer before starting to move entries around.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>