git.ipfire.org Git - thirdparty/xfsprogs-dev.git/log

xfs: stop holding ILOCK over filldir callbacks

Source kernel commit dbad7c993053d8f482a5f76270a93307537efd8e

The recent change to the readdir locking made in 40194ec ("xfs:
reinstate the ilock in xfs_readdir") for CXFS directory sanity was
probably the wrong thing to do. Deep in the readdir code we
can take page faults in the filldir callback, and so taking a page
fault while holding an inode ilock creates a new set of locking
issues that lockdep warns all over the place about.

The locking order for regular inodes w.r.t. page faults is io_lock
-> pagefault -> mmap_sem -> ilock. The directory readdir code now
triggers ilock -> page fault -> mmap_sem. While we cannot deadlock
at this point, it inverts all the locking patterns that lockdep
normally sees on XFS inodes, and so triggers lockdep. We worked
around this with commit 93a8614 ("xfs: fix directory inode iolock
lockdep false positive"), but that then just moved the lockdep
warning to deeper in the page fault path and triggered on security
inode locks. Fixing the shmem issue there just moved the lockdep
reports somewhere else, and now we are getting false positives from
filesystem freezing annotations getting confused.

Further, if we enter memory reclaim in a readdir path, we now get
lockdep warning about potential deadlocks because the ilock is held
when we enter reclaim. This, again, is different to a regular file
in that we never allow memory reclaim to run while holding the ilock
for regular files. Hence lockdep now throws
ilock->kmalloc->reclaim->ilock warnings.

Basically, the problem is that the ilock is being used to protect
the directory data and the inode metadata, whereas for a regular
file the iolock protects the data and the ilock protects the
metadata. From the VFS perspective, the i_mutex serialises all
accesses to the directory data, and so not holding the ilock for
readdir doesn't matter. The issue is that CXFS doesn't access
directory data via the VFS, so it has no "data serialisaton"
mechanism. Hence we need to hold the IOLOCK in the correct places to
provide this low level directory data access serialisation.

The ilock can then be used just when the extent list needs to be
read, just like we do for regular files. The directory modification
code can take the iolock exclusive when the ilock is also taken,
and this then ensures that readdir is correct excluded while
modifications are in progress.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

libxfs: fix two comment typos

Source kernel commit fef4ded8cb2e53a47093b334e7730d55a3badc59

Just fix two typos in code comments.

Signed-off-by: Geliang Tang <geliangtang@163.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: swap leaf buffer into path struct atomically during path shift

Source kernel commit 7df1c170b9a45ab3a7401c79bbefa9939bf8eafb

The node directory lookup code uses a state structure that tracks the
path of buffers used to search for the hash of a filename through the
leaf blocks. When the lookup encounters a block that ends with the
requested hash, but the entry has not yet been found, it must shift over
to the next block and continue looking for the entry (i.e., duplicate
hashes could continue over into the next block). This shift mechanism
involves walking back up and down the state structure, replacing buffers
at the appropriate btree levels as necessary.

When a buffer is replaced, the old buffer is released and the new buffer
read into the active slot in the path structure. Because the buffer is
read directly into the path slot, a buffer read failure can result in
setting a NULL buffer pointer in an active slot. This throws off the
state cleanup code in xfs_dir2_node_lookup(), which expects to release a
buffer from each active slot. Instead, a BUG occurs due to a NULL
pointer dereference:

  BUG: unable to handle kernel NULL pointer dereference at 00000000000001e8
  IP: [<ffffffffa0585063>] xfs_trans_brelse+0x2a3/0x3c0 [xfs]
  ...
  RIP: 0010:[<ffffffffa0585063>]  [<ffffffffa0585063>] xfs_trans_brelse+0x2a3/0x3c0 [xfs]
  ...
  Call Trace:
   [<ffffffffa05250c6>] xfs_dir2_node_lookup+0xa6/0x2c0 [xfs]
   [<ffffffffa0519f7c>] xfs_dir_lookup+0x1ac/0x1c0 [xfs]
   [<ffffffffa055d0e1>] xfs_lookup+0x91/0x290 [xfs]
   [<ffffffffa05580b3>] xfs_vn_lookup+0x73/0xb0 [xfs]
   [<ffffffff8122de8d>] lookup_real+0x1d/0x50
   [<ffffffff8123330e>] path_openat+0x91e/0x1490
   [<ffffffff81235079>] do_filp_open+0x89/0x100
   ...

This has been reproduced via a parallel fsstress and filesystem shutdown
workload in a loop. The shutdown triggers the read error in the
aforementioned codepath and causes the BUG in xfs_dir2_node_lookup().

Update xfs_da3_path_shift() to update the active path slot atomically
with respect to the caller when a buffer is replaced. This ensures that
the caller always sees the old or new buffer in the slot and prevents
the NULL pointer dereference.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: handle dquot buffer readahead in log recovery correctly

Source kernel commit 7d6a13f023567d573ac362502bb702eda716e654

When we do dquot readahead in log recovery, we do not use a verifier
as the underlying buffer may not have dquots in it. e.g. the
allocation operation hasn't yet been replayed. Hence we do not want
to fail recovery because we detect an operation to be replayed has
not been run yet. This problem was addressed for inodes in commit
d891400 ("xfs: inode buffers may not be valid during recovery
readahead") but the problem was not recognised to exist for dquots
and their buffers as the dquot readahead did not have a verifier.

The result of not using a verifier is that when the buffer is then
next read to replay a dquot modification, the dquot buffer verifier
will only be attached to the buffer if *readahead is not complete*.
Hence we can read the buffer, replay the dquot changes and then add
it to the delwri submission list without it having a verifier
attached to it. This then generates warnings in xfs_buf_ioapply(),
which catches and warns about this case.

Fix this and make it handle the same readahead verifier error cases
as for inode buffers by adding a new readahead verifier that has a
write operation as well as a read operation that marks the buffer as
not done if any corruption is detected. Also make sure we don't run
readahead if the dquot buffer has been marked as cancelled by
recovery.

This will result in readahead either succeeding and the buffer
having a valid write verifier, or readahead failing and the buffer
state requiring the subsequent read to resubmit the IO with the new
verifier. In either case, this will result in the buffer always
ending up with a valid write verifier on it.

Note: we also need to fix the inode buffer readahead error handling
to mark the buffer with EIO. Brian noticed the code I copied from
there wrong during review, so fix it at the same time. Add comments
linking the two functions that handle readahead verifier errors
together so we don't forget this behavioural link in future.

cc: <stable@vger.kernel.org> # 3.12 - current
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: inode recovery readahead can race with inode buffer creation

Source kernel commit b79f4a1c68bb99152d0785ee4ea3ab4396cdacc6

When we do inode readahead in log recovery, we do can do the
readahead before we've replayed the icreate transaction that stamps
the buffer with inode cores. The inode readahead verifier catches
this and marks the buffer as !done to indicate that it doesn't yet
contain valid inodes.

In adding buffer error notification  (i.e. setting b_error = -EIO at
the same time as as we clear the done flag) to such a readahead
verifier failure, we can then get subsequent inode recovery failing
with this error:

XFS (dm-0): metadata I/O error: block 0xa00060 ("xlog_recover_do..(read#2)") error 5 numblks 32

This occurs when readahead completion races with icreate item replay
such as:

inode readahead
find buffer
lock buffer
submit RA io
....
icreate recovery
    xfs_trans_get_buffer
find buffer
lock buffer
<blocks on RA completion>
.....
<ra completion>
fails verifier
clear XBF_DONE
set bp->b_error = -EIO
release and unlock buffer
<icreate gains lock>
icreate initialises buffer
marks buffer as done
adds buffer to delayed write queue
releases buffer

At this point, we have an initialised inode buffer that is up to
date but has an -EIO state registered against it. When we finally
get to recovering an inode in that buffer:

inode item recovery
    xfs_trans_read_buffer
find buffer
lock buffer
sees XBF_DONE is set, returns buffer
    sees bp->b_error is set
fail log recovery!

Essentially, we need xfs_trans_get_buf_map() to clear the error status of
the buffer when doing a lookup. This function returns uninitialised
buffers, so the buffer returned can not be in an error state and
none of the code that uses this function expects b_error to be set
on return. Indeed, there is an ASSERT(!bp->b_error); in the
transaction case in xfs_trans_get_buf_map() that would have caught
this if log recovery used transactions....

This patch firstly changes the inode readahead failure to set -EIO
on the buffer, and secondly changes xfs_buf_get_map() to never
return a buffer with an error state set so this first change doesn't
cause unexpected log recovery failures.

cc: <stable@vger.kernel.org> # 3.12 - current
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: eliminate committed arg from xfs_bmap_finish

Source kernel commit f6106efae5f4144b32f6c10de0dc3e7efc9181e3

Calls to xfs_bmap_finish() and xfs_trans_ijoin(), and the
associated comments were replicated several times across
the attribute code, all dealing with what to do if the
transaction was or wasn't committed.

And in that replicated code, an ASSERT() test of an
uninitialized variable occurs in several locations:

error = xfs_attr_thing(&args);
if (!error) {
error = xfs_bmap_finish(&args.trans, args.flist,
&committed);
}
if (error) {
ASSERT(committed);

If the first xfs_attr_thing() failed, we'd skip the xfs_bmap_finish,
never set "committed", and then test it in the ASSERT.

Fix this up by moving the committed state internal to xfs_bmap_finish,
and add a new inode argument. If an inode is passed in, it is passed
through to __xfs_trans_roll() and joined to the transaction there if
the transaction was committed.

xfs_qm_dqalloc() was a little unique in that it called bjoin rather
than ijoin, but as Dave points out we can detect the committed state
but checking whether (*tpp != tp).

Addresses-Coverity-Id: 102360
Addresses-Coverity-Id: 102361
Addresses-Coverity-Id: 102363
Addresses-Coverity-Id: 102364
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: bmapbt checking on debug kernels too expensive

Source kernel commit e35438196c6a1d8b206471d51e80c380e80e047b

For large sparse or fragmented files, checking every single entry in
the bmapbt on every operation is prohibitively expensive. Especially
as such checks rarely discover problems during normal operations on
high extent coutn files. Our regression tests don't tend to exercise
files with hundreds of thousands to millions of extents, so mostly
this isn't noticed.

However, trying to run things like xfs_mdrestore of large filesystem
dumps on a debug kernel quickly becomes impossible as the CPU is
completely burnt up repeatedly walking the sparse file bmapbt that
is generated for every allocation that is made.

Hence, if the file has more than 10,000 extents, just don't bother
with walking the tree to check it exhaustively. The btree code has
checks that ensure that the newly inserted/removed/modified record
is correctly ordered, so the entrie tree walk in thses cases has
limited additional value.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: get mp from bma->ip in xfs_bmap code

Source kernel commit f1f96c4946590616812711ac19eb7a84be160877

In my earlier commit

  c29aad4 xfs: pass mp to XFS_WANT_CORRUPTED_GOTO

I added some local mp variables with code which indicates that
mp might be NULL.  Coverity doesn't like this now, because the
updated per-fs XFS_STATS macros dereference mp.

I don't think this is actually a problem; from what I can tell,
we cannot get to these functions with a null bma->tp, so my NULL
check was probably pointless.  Still, it's not super obvious.

So switch this code to get mp from the inode on the xfs_bmalloca
structure, with no conditional, because the functions are already
using bmap->ip directly.

Addresses-Coverity-Id: 1339552
Addresses-Coverity-Id: 1339553
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: introduce BMAPI_ZERO for allocating zeroed extents

Source kernel commit 3fbbbea34bac049c0b5938dc065f7d8ee1ef7e67

To enable DAX to do atomic allocation of zeroed extents, we need to
drive the block zeroing deep into the allocator. Because
xfs_bmapi_write() can return merged extents on allocation that were
only partially allocated (i.e. requested range spans allocated and
hole regions, allocation into the hole was contiguous), we cannot
zero the extent returned from xfs_bmapi_write() as that can
overwrite existing data with zeros.

Hence we have to drive the extent zeroing into the allocation code,
prior to where we merge the extents into the BMBT and return the
resultant map. This means we need to propagate this need down to
the xfs_alloc_vextent() and issue the block zeroing at this point.

While this functionality is being introduced for DAX, there is no
reason why it is specific to DAX - we can per-zero blocks during the
allocation transaction on any type of device. It's just slow (and
usually slower than unwritten allocation and conversion) on
traditional block devices so doesn't tend to get used. We can,
however, hook hardware zeroing optimisations via sb_issue_zeroout()
to this operation, so it may be useful in future and hence the
"allocate zeroed blocks" API needs to be implementation neutral.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: per-filesystem stats counter implementation

Source kernel commit ff6d6af2351caea7db681f4539d0d893e400557a

This patch modifies the stats counting macros and the callers
to those macros to properly increment, decrement, and add-to
the xfs stats counts. The counts for global and per-fs stats
are correctly advanced, and cleared by writing a "1" to the
corresponding clear file.

global counts: /sys/fs/xfs/stats/stats
per-fs counts: /sys/fs/xfs/sda*/stats/stats

global clear: /sys/fs/xfs/stats/stats_clear
per-fs clear: /sys/fs/xfs/sda*/stats/stats_clear

[dchinner: cleaned up macro variables, removed CONFIG_FS_PROC around
stats structures and macros. ]

Signed-off-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: log local to remote symlink conversions correctly on v5 supers

Source kernel commit b7cdc66be54b64daef593894d12ecc405f117829

A local format symlink inode is converted to extent format when an
extended attribute is set on an inode as part of the attribute fork
creation. This means a block is allocated, the local symlink target name
is copied to the block and the block is logged. Currently,
xfs_bmap_local_to_extents() handles logging the remote block data based
on the size of the data fork prior to the conversion. This is not
correct on v5 superblock filesystems, which add an additional header to
remote symlink blocks that is nonexistent in local format inodes.

As a result, the full length of the remote symlink block content is not
logged. This can lead to corruption should a crash occur and log
recovery replay this transaction.

Since a callout is already used to initialize the new remote symlink
block, update the local-to-extents conversion mechanism to make the
callout also responsible for logging the block. It is already required
to set the log buffer type and format the block appropriately based on
the superblock version. This ensures the remote symlink is always logged
correctly. Note that xfs_bmap_local_to_extents() is only called for
symlinks so there are no other callouts that require modification.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: add missing bmap cancel calls in error paths

Source kernel commit d4a97a04227d5ba91b91888a016e2300861cfbc7

If a failure occurs after the bmap free list is populated and before
xfs_bmap_finish() completes successfully (which returns a partial
list on failure), the bmap free list must be cancelled. Otherwise,
the extent items on the list are never freed and a memory leak
occurs.

Several random error paths throughout the code suffer this problem.
Fix these up such that xfs_bmap_cancel() is always called on error.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

libxfs: Optimize the loop for xfs_bitmap_empty

Source kernel commit 1d4292bfdc77f4f7c520064be15d0c46bd025fd2

If there is any non zero bit in a long bitmap, it can jump out of the
loop and finish the function as soon as possible.

Signed-off-by: Jia He <hejianet@gmail.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_io: add support for changing the new inode DAX attribute

It is changed via the FS_IOC_FSSETXATTR ioctl, so add the new flag
to the platform definitions for userspace that don't this API. Then
add support to the relevant xfs_io chattr and stat commands.

Signed-off-by: Dave Chinner <dchinner@redhat.com>

xfs: introduce per-inode DAX enablement

Source kernel commit 58f88ca2df7270881de2034c8286233a89efe71c

Rather than just being able to turn DAX on and off via a mount
option, some applications may only want to enable DAX for certain
performance critical files in a filesystem.

This patch introduces a new inode flag to enable DAX in the v3 inode
di_flags2 field. It adds support for setting and clearing flags in
the di_flags2 field via the XFS_IOC_FSSETXATTR ioctl, and sets the
S_DAX inode flag appropriately when it is seen.

When this flag is set on a directory, it acts as an "inherit flag".
That is, inodes created in the directory will automatically inherit
the on-disk inode DAX flag, enabling administrators to set up
directory heirarchies that automatically use DAX. Setting this flag
on an empty root directory will make the entire filesystem use DAX
by default.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_fs.h: XFS_IOC_FS[SG]SETXATTR to FS_IOC_FS[SG]ETXATTR promotion

The kernel commit to make this ioctl promotion (bb99e06ddf) moved
the definitions for the XFS ioctl to uapi/linux/fs.h for the
following reason:

    Hoist the ioctl definitions for the XFS_IOC_FS[SG]SETXATTR API
    from fs/xfs/libxfs/xfs_fs.h to include/uapi/linux/fs.h so that
    the ioctls can be used by all filesystems, not just XFS. This
    enables (initially) ext4 to use the ioctl to set project IDs on
    inodes.

This means we now need to handle this change in userspace as the
uapi/linux/fs.h file may not contain the definitions (i.e. new
xfsprogs/ old linux uapi files) xfsprogs needs to build. Hence we
need to massage the definition in xfs_fs.h to take the values from
the system header if it exists, otherwise keep the old definitions
for compatibility and platforms other than linux.

To this extent, we add the FS* definitions to the platform headers
so the FS* versions are available on all platforms, and add trivial
defines to xfs_fs.h to present the XFS* versions for backwards
compatibility with existing code. New code should always use the FS*
versions, and as such we also convert all the users of XFS* in
xfsprogs to use the FS* definitions.

Signed-off-by: Dave Chinner <dchinner@redhat.com>

libxfs: reset dirty buffer priority on lookup

When a buffer on the dirty MRU is looked up and found, we remove the
buffer from the MRU. However, we've already set the priority of the
buffer to "dirty" so when we are done with it it will go back on the
dirty buffer MRU regardless of whether it needs to or not.

Hence when we move a buffer to a the dirty MRU, record the old
priority and restore it when we remove the buffer from the MRU on
lookup. This will prevent us from putting fixed, now writeable
buffers back on the dirty MRU and allow the cache routine to write,
shake and reclaim the buffers once they are clean.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

libxfs: keep unflushable buffers off the cache MRUs

There's no point trying to free buffers that are dirty and return
errors on flush as we have to keep them around until the corruption
is fixed. Hence if we fail to flush an inode during a cache shake,
move the buffer to a special dirty MRU list that the cache does not
shake. This prevents memory pressure from seeing these buffers, but
allows subsequent cache lookups to still find them through the hash.
This ensures we don't waste huge amounts of CPU trying to flush and
reclaim buffers that canot be flushed or reclaimed.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

libxfs: don't repeatedly shake unwritable buffers

now that we try to write dirty buffers before we release them, we
can get buildup of unwritable dirty buffers on the LRU lists, This
results in the cache shaker repeatedly trying to write out these
buffers every time the cache fills up. This results in more
corruption warnings, and takes up a lot of time doing reclaiming
nothing. This can effectively livelock the processing parts of phase
4.

Fix this by not trying to write buffers with corruption errors on
them. These errors will get cleared when the buffer is re-read and
fixed and them marked dirty again. At which point, we'll be able to
write them and so the cache can reclaim them successfully.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

libxfs: don't discard dirty buffers

When we release a buffer from the cache, if it is dirty we wite it
to disk then put the buffer on the free list for recycling. However,
if the write fails (e.g. verifier failure due unfixed corruption) we
effectively throw the buffer and it contents away. This causes all
sorts of problems for xfs_repair as it then re-reads the buffer from
disk on the next access and hence loses all the corrections that had
previously been made, resulting in tripping over corruptions in code
that assumes the corruptions have already been fixed/flagged in the
buffer it receives.

TO fix this, we have to make the cache aware that writes can fail,
and keep the buffer in cache when writes fail. Hence we have to add
an explicit error notification to the flush operation, and we need
to do that before we release the buffer to the free list. This also
means that we need to remove the writeback code from the release
mechanisms, instead replacing them with assertions that the buffers
are already clean.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

libxfs: directory node splitting does not have an extra block

xfs_da3_split() has to handle all three versions of the
directory/attribute btree structure. The attr tree is v1, the dir
tre is v2 or v3. The main difference between the v1 and v2/3 trees
is the way tree nodes are split - in the v1 tree we can require a
double split to occur because the object to be inserted may be
larger than the space made by splitting a leaf. In this case we need
to do a double split - one to split the full leaf, then another to
allocate an empty leaf block in the correct location for the new
entry. This does not happen with dir (v2/v3) formats as the objects
being inserted are always guaranteed to fit into the new space in
the split blocks.

Indeed, for directories they *may* be an extra block on this buffer
pointer. However, it's guaranteed not to be a leaf block (i.e. a
directory data block) - the directory code only ever places hash
index or free space blocks in this pointer (as a cursor of
sorts), and so to use it as a directory data block will immediately
corrupt the directory.

The problem is that the code assumes that there may be extra blocks
that we need to link into the tree once we've split the root, but
this is not true for either dir or attr trees, because the extra
attr block is always consumed by the last node split before we split
the root. Hence the linking in an extra block is always wrong at the
root split level, and this manifests itself in repair as a directory
corruption in a repaired directory, leaving the directory rebuild
incomplete.

This is a dir v2 zero-day bug - it was in the initial dir v2 commit
that was made back in February 1998.

Fix this by ensuring the linking of the blocks after the root split
never tries to make use of the extra blocks that may be held in the
cursor. They are held there for other purposes and should never be
touched by the root splitting code.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

repair: parallelise uncertin inode processing in phase 3

This can take a long time when there are millions of uncertain inodes in badly
broken filesystems. THe processing is per-ag, the data structures are all
per-ag, and the load is mostly CPU time spent checking CRCs on each
uncertaini inode. Parallelising reduced the runtime of this phase on a badly
broken filesytem from ~30 minutes to under 5 miniutes.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

repair: parallelise phase 7

It operates on a single AG at a time, sequentially, doing inode updates. All the
data structures accessed and modified are per AG, as are the modification to
on-disk structures. Hence we can run this phase concurrently across multiple
AGs.

This is important for large, broken filesystem repairs, where there can be
millions of inodes that need link counts updated. Once such repair image takes
more than 45 minutes to run phase 7 as a single threaded operation.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_mdrestore: correctly account bytes read

Progess indication comes in the form of a "X MB read" output. This
doesn't match up with the actual number of bytes read from the
metadump file because it only accounts header blocks in the file,
not actual metadata blocks that are restored, Hence the number
reported is usually much lower than the size of the metadump file,
hence it's impossible to use to gauge progress of the restore.

While there, fix the progress output so that it overwrites the
previous progress output line correctly.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

metadump: bounds check btree block regions being zeroed

Arkadiusz Miskiewicz reported that metadump was crashing on one of
his corrupted filesystems, and the trace indicated that it was
zeroing unused regions in inode btree blocks when it failed. The
btree block had a corrupt nrecs field, which was resulting in an out
of bounds memset() occurring. Ensure that the region being
generated for zeroing is within bounds before executing the zeroing.

Reported-by: Arkadiusz Miskiewicz <arekm@maven.pl>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

metadump: clean up btree block region zeroing

Abstract out all the common operations in the btree block zeroing
so that we only have one copy of offset/len calculations and don't
require lots of casts for the pointer arithmetic.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Arkadiusz Miskiewicz <arekm@maven.pl>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_quota: fix up timer command help

The timer command help output suggests that either default, or a
specific ID, can be specified, but this is not the case.  Timers
are fs-wide for each quota type, not specific to the user.  So
remove the "-d|id|name" from the help output.  Also remove
"get" from the description, because it only sets these timers;
"state" returns their values.  Well actually just one of them,
but that's a different problem...

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_quota: remove extra 30 seconds from time limit reporting

The addition of 30s makes little sense; it's only added to the
values reported via the userspace tool, and doesn't reflect the
actual values used in the kernel. Setting a time limit to
"00:30:00" and reporting back "00:30:30" makes little sense to me.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_repair: Fix untranslatable strings

Don't substitute in "would" or "will" for %s here, it makes
it untranslatable. Just spell it out twice like every other
message in the function.

Reported-by: Jakub Bogusz <qboosh@pld-linux.org>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

mkfs.xfs.8: Clarify mkfs vs. mount block size limits.

The current explanation is a bit vague to the uninitiated;
clarify the text to make it obvious that while the mkfs.xfs
utility is able to create valid filesystems with block sizes
up to 65536, the Linux kernel can only mount filesystems with
page sized or smaller blocks.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

mkfs: factor finobt changes into min log size when formatting

Since the finobt affects the size of the log reservations, we need to
be able to include its effects in the calculation of the minimum log
size.

(Not really a problem now, but adding rmapbt will give this one some
bite.)

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_io: detect the '-R' option in getopt

Configure getopt to accept the -R argument to pwrite.
Accept -F and -B while we're at it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_io: print dedupe errors to stderr, not stdout

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

libxfs: move struct xfs_attr_shortform to xfs_da_format.h

Move the shortform attr structure definition to the same place as the
other attribute structure definitions for consistency and also so that
xfs/122 verifies the structure size.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_quota: use Q_XGETNEXTQUOTA for report and dump

Rather than a loop over getpwnam() etc, use the Q_XGETNEXTQUOTA
command to iterate through all active quotas in quota
reports and quota file dump.

If Q_XGETNEXTQUOTA fails, go back to the old way.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_quota: make report_mount() & dump_file() take an "output id"

Allow report_mount() and dump_file() to take a *oid pointer,
an "output id" which will be filled in from the returned quota
information. Useful if the quotactl returns an ID for something
other than that which was passed in, i.e. GETNEXTQUOTA.

Also, when printing results, print the id which was actually
returned, not the id which was passed in.

Should be a no-op change at this point; the next patch which
wires in Q_XGETNEXTQUOTA will make use of this.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_quota: define Q_XGETNEXTQUOTA

This simply defines the Q_XGETNEXTQUOTA quotactl in xfsprogs.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

include: Minor whitespace cleanup

While investigating a quota issue, I fixed a few whitespaces.

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_copy: use POSIX signal API

uClibc doesn't provide the sigrelse or sighold calls, so this patch replaces
them with POSIX signal API calls (sigprocmask(2), etc).

Refer to Gentoo Bug #477758:
https://bugs.gentoo.org/show_bug.cgi?id=477758

Signed-off-by: Joshua Kinard <kumba@gentoo.org>
Suggested-by: René Rhéaume <rene.rheaume@gmail.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

configure: remove aio.h check

Remove the configure check for aio.h, which is not provided by uClibc, and
which does not appear to be used by xfsprogs.

Refer to Gentoo Bug #477758:
https://bugs.gentoo.org/show_bug.cgi?id=477758

Signed-off-by: Joshua Kinard <kumba@gentoo.org>
Reported-by: El Goretto <el.goretto@free.fr>
Suggested-by: René Rhéaume <rene.rheaume@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

libxfs: don't pass negative errnos to strerror()

The error negation work in 12b5319 tripped up a little bit
when we're reporting errors via strerror(). By negating
the error before passing it to strerror, we get i.e.

mkfs.xfs: pwrite64 failed: Unknown error -22

Keep the error positive, but return -error, just as we
do in the else clauses in these functions.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

include/linux.h: Include <stdio.h> for fprintf and stderr

Signed-off-by: Felix Janda <felix.janda@posteo.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

linux.h: Use off64_t instead of loff_t

These are equivalent on glibc, while musl does not know loff_t.

In the long run, it would be preferable to enable transparent LFS so
that off64_t could be replaced by off_t.

Signed-off-by: Felix Janda <felix.janda@posteo.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_fsr: Include <paths.h> for _PATH_MOUNTED

Fixes a compilation failure with musl libc

Signed-off-by: Felix Janda <felix.janda@posteo.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

libxfs: refactor short btree block verification

Create xfs_btree_sblock_verify() to verify short-format btree blocks
(i.e. the per-AG btrees with 32-bit block pointers) instead of
open-coding them.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

libxfs: pack the agfl header structure so XFS_AGFL_SIZE is correct

Because struct xfs_agfl is 36 bytes long and has a 64-bit integer
inside it, gcc will quietly round the structure size up to the nearest
64 bits -- in this case, 40 bytes. This results in the XFS_AGFL_SIZE
macro returning incorrect results for v5 filesystems on 64-bit
machines (118 items instead of 119). As a result, a 32-bit xfs_repair
will see garbage in AGFL item 119 and complain.

Therefore, tell gcc not to pad the structure so that the AGFL size
calculation is correct.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

libxfs: use a convenience variable instead of open-coding the fork

Use a convenience variable instead of open-coding the inode fork.
This isn't really needed for now, but will become important when we
add the copy-on-write fork later.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

libxfs: reorder xfs_bmap_add_free args

Move the mount & transaction arguments to the start of xfs_bmap_add_free,
like most API calls. The kernel version of rmap makes this change, so
porting it to xfsprogs will make maintenance easier.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

repair: request inode buffers sized to fit one inode cluster

In get_agino_buf, grab inode buffers using the same size as the inode
processing code. Since the inode processing code uses that same
buffer size, this means that get_agino_buf can serve requests from the
cache instead of pointlessly dropping the cache entry and screaming
about it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfsprogs: Remove trailing blanks on various places

Clean up the code a little by removing blank characters on line ends.
Most of these (if not all) sits there for long time and it is probable
that without this explicit removal, they will stay in the code for much
longer.

Signed-off-by: Jan Tulak <jtulak@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_quota: allow name lookup when reporting from ID range

Add a new "-l" (lookup) argument to the "report" command, to
enable id->name lookups when doing a report across a lower/upper
ID range as specified by the -L / -U report options.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_quota: push id/name printing down into report_mount()

Push the printing of the id number and/or name during quota
reporting down into report_mount(), rather than doing it in
the caller.

This lets the next patch do a lookup of any ID found with
a quota, when upper/lower ID bounds are specified.  This patch,
however, changes no behavior.  Both the ID and the name are
passed in, and which one gets printed depends on the
presence of the NO_LOOKUP_FLAG.

If "NULL" is passed in (as in the upper/lower ID search case),
then as before, only the ID is printed.  Next patch changes this.

Drops a few lines of code as well.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_db: remove unnessary checks in process_leaf_node_dir_v2_free

xfs_dir2_free_hdr_t entries are unsigned, so they can't be negative.
remove these unnessary checks in process_leaf_node_dir_v2_free.
Reported by coverity.

Signed-off-by: Vivek Trivedi <t.vivek@samsung.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

mkfs: fix unintentional integer overflow

Fix unintentional integer overflow in mkfs.
reported by coverity.

Signed-off-by: Vivek Trivedi <t.vivek@samsung.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

libxlog: fix integer overflow in xlog_find_verify_cycle

Fix unintentional integer overflow in xlog_find_verify_cycle.
Reported by coverity.

Signed-off-by: Vivek Trivedi <t.vivek@samsung.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_io: fix a memory leak in imap_f

add NULL check for malloc return and free allocated memory in
return path in imap_f

[dchinner: changed to error stack]

Signed-off-by: Vivek Trivedi <t.vivek@samsung.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_io: allow zero-length reflink/dedupe commands

Since it's convention that calling the reflink/dedupe ioctls with
a zero length means "do this until you hit EOF in the source file",
explicitly permit this as an input. Fix the ioctl code to handle
this correctly.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_db: make check work for sparse inodes

Teach the inobt/finobt scanning functions how to deal with sparse
inode chunks well enough that we can pass the spot-check. Should
fix the xfs/076 failures.

v2: Put the alignment checks back in, and for each individual chunk
of inodes hanging off an inobt record, parse each chunk separately.
There is no (future) guarantee that all the inodes in a sparse inobt
record will be contiguous.

Cc: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_io: document zero and help commands in manpage

zero & fzero are 2 different commands using 2 different interfaces
to do the same thing, but may as well document them both.

Also explicitly document "help" to be consistent, and to satisfy
the xfstest which checks for all commands being documented.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

mkfs: get sector size from host fs d_miniosz when mkfs'ing file

Now that we allow logical-sector-sized DIOs even if our xfs
filesystem is set to physical-sector-sized "sectors," we can
allow the creation of filesystem images with block and sector
sizes down to the host device's logical sector size, rather
than the filesystem's sector size.

So in platform_findsizes(), change our query of the filesystem
to an XFS_IOC_DIOINFO query, and use the returned d_miniosz for
sector size in the S_IFREG case.

This allows the creation of i.e. a 2k block sized image on
an xfs filesystem w/ 4k sector size created on a 4k/512
block device, which would otherwise fail today.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

db: add the logformat command

Add the new logformat command to format the log to a specified log cycle
and/or log stripe unit. This command is primarily for testing purposes
and is only available in expert mode.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

libxfs: conditionalize log format record size optimization

The libxfs log clear mechanism includes an optimization to use the
maximum log buffer size (256k) when the log is being fully written
(i.e., cycle != 1). This behavior is always enabled as no current
callers are concerned with the size of the records written to format the
log to a given cycle.

A log format command will be added to xfs_db to facilitate testing of
the userspace log format code (among other things). This command is not
performance oriented and prefers the ability to format the log with
varying record sizes.

Update libxfs_log_clear() to use a new parameter to enable or disable
the record size optimization. Note that the flag is a no-op when the
cycle == XLOG_INIT_CYCLE (e.g., mkfs).

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

libxfs: format the log with valid log record headers

xfsprogs supports the ability to clear or reformat the log to an
arbitrary cycle number. This functionality is invoked in mkfs,
xfs_repair, etc. The code currently fills the log with unmount records
sized based on the specified log stripe unit alignment. When the stripe
unit is sufficiently larger, however, the resulting log records are
technically invalid. For example, the log buffer size (h_size) field is
always hardcoded to 32k even when the particular record length is
larger.

This has not traditionally been a problem because the kernel does not
fully process the log record when the log is clean. Recent changes to
unconditionally CRC verify the head of the log changes this behavior,
however, which means the log should be valid. Technically, the records
written from userspace are not written with a CRC, but this also helps
support unit testing by writing a log that can be verified with
logprint.

Update libxfs_log_header() to write a valid log record based on the
specified log stripe unit. h_size is updated to the log buffer size
based on the stripe unit and the expected number of extended record
headers are written out based on the log buffer size.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

logprint: use correct ext. header count for unaligned size

A log record has 1 record header and up to 7 additional extended record
headers depending on the size of the record. This is required for log
record packing. A header is required for each 32k of record data.

The header count for a particular record is fixed, based on the log
buffer size (h_size) specified in the record header. logprint calculates
the expected extended header count based on h_size, but does not account
for a log buffer size not aligned with 32k. This results in spurious
invalid header count errors for an otherwise valid log. This can be
reproduced by mounting a filesystem with '-o logbsize=16k' and running
xfs_logprint after a subsequent unmount.

Update xlog_print_extended_headers() to incorporate a non-32k aligned
log buffer size in the expected extended record header count
calculation. This is consistent with how the header count is calculated
in the kernel.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfsprogs: Release v4.3.0

Update all the release files for a 4.3.0 release.

Signed-off-by: Dave Chinner <david@fromorbit.com>

repair: print XFS_WANT_CORRUPTED info with -vvv

In the kernel, the XFS_WANT_CORRUPTED_* macros print the
caller information via XFS_ERROR_REPORT, but in userspace
this is silenced, because i.e. xfs_repair's job is to find
and fix corruption, not to complain loudly about it.

However, there are times when we would like to know if
we've hit an unexpected corruption point. To that end,
make "xfs_repair -vvv" enable noisiness from these macros,
so that they will print the function and line of the caller.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

libxfs: print name of verifier if it fails

This adds a name to each buf_ops structure, so that if
a verifier fails we can print the type of verifier that
failed it. Should be a slight debugging aid, I hope.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

fsr: more selinux fixes

Commit:

1adfe5c xfs_fsr: fix SWAPEXT failures under selinux

attempted to fix up the fork offset under selinux, where
the temp file is created with a local attribute, but the
target file has remote attributes; this can lead to a smaller
data area in the temp inode, without enough room to swap extents
from the target inode. I remedied this by pushing the temp
file attribute to remote, but *only* if the target file's attr
was also remote.

However, I have a case from the field where the parent dir
and the target file both have a context of:

system_u:object_r:samba_share_t:s0

but new files created in the dir have a context of

unconfined_u:object_r:samba_share_t:s0

This means the temp file has a smaller forkoff, and less space
in the inode for data, so we fail to swap the extents between
the two, because they don't fit.

The following patch fixes this by allowing xfs_fsr to
kick the tempfile's attr out of local format even if the target
file's attr is local, if this will move the forkoff in the right
direction. This does pass all our fsr xfstests, though I'm not
sure we have any real coverage of fsr under selinux...

The only functional change is the test at the very end of the
patch; the rest is comments, ascii art, and removing the
now-extraneous XFS_IOC_FSGETXATTRA ioctl.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

fsr: more tidy up work

Make the mountpoint stat structure "ms" local to
find_mountpoint_check(), no caller uses it. And remove 3rd "sb2"
statbuf in that same function, we can just re-use ms.

rename "struct mntent *mp" to "struct mntent *mnt" - "mp" has its
own common meaning in xfsprogs.

And a couple more minor whitespace removals.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

debian: update initramfs in postinst

All filesystem tools packages for filesystems
that are likely to be used for / and/or /usr should call
"update-initramfs -u" in their postinst.

Signed-off-by: Steve McIntyre <steve@einval.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

fsr: clean up mountpoint checks

Fix up a stray hunk of code from commit 7141fc5 ("xfsprogs: make fsr
use mntinfo when there is no mntent") that coverity reported. Also
clean up a couple of whitespace issues introduced with that commit,
too.

Addresses-Coverity-Id: 1338431
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfsprogs: Release v4.3.0-rc2

Update all the release files for a 4.3.0-rc2 release.

Signed-off-by: Dave Chinner <david@fromorbit.com>

mkfs.xfs.8: Spelling fix

Signed-off-by: Ville Skyttä <ville.skytta@iki.fi>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_db.8: Remove stray backslash near bmap

Signed-off-by: Ville Skyttä <ville.skytta@iki.fi>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfsprogs: make fsr use mntinfo when there is no mntent

For what fsr needs, mntinfo can be used instead of mntent on some
platforms. Exctract the platform-specific code to platform headers.

Signed-off-by: Jan Tulak <jtulak@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

io: update reflink/dedupe ioctl definitions

The committed version was a pre-review version of the ioctl
interface. Update it to match post-review
version.

[dchinner: I've just aplied the latest patches over the top of
what I originally committed, so the diff here isn't exactly what
Darrick originally sent. ]

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

libxcmd: provide a common function to report command runtimes

Refactor the open-coded runtime stats reporting into a library
command, then update xfs_io commands to use it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

headers: remove definition of ASSERT from xfs.h

If we define ASSERT() in the installed xfs.h header file, programs
including this header file will have their local definitions of
ASSERT screwed up. This is something internal to the xfsprogs build,
so move it to platform_defs.h.in where it will no longer be public.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfsprogs: Release v4.3.0-rc1

Update all the release files for a 4.3.0-rc1 release.

Signed-off-by: Dave Chinner <david@fromorbit.com>

db: fix AGI ops definition in CRC type table

The wrong buffer ops structure was added to the AGI field of the
type table when initially committed. This was not noticed because it
only affects manually setting the type of a buffer from xfs_db. e.g

xfs_db> agi 0
xfs_db> p
.....
crc = 0xbc58d757 (correct)
.....
xfs_db> fsb 2
xfs_db> type agi
Metadata CRC error detected at block 0x10/0x1000
xfs_db>

This is because (trimmed for clarity):

Breakpoint 1, xfs_verifier_error:
(gdb) bt
#0  xfs_verifier_error
#1  xfs_agfl_read_verify
#2  set_iocur_type
#3  type_f
#4  main

It's clear that the wrong verifier is being run (AGFL, not AGI).
The fix is simple.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_repair: fix left-shift overflows

pmask in struct parent_list is a __uint64_t, but in some places
we populated it with "1LL << shift" where shift could be up
to 63; this really needs to be a 1ULL type for this to be correct.

Also spotted by libubsan...

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_metadump: Fix unaligned accesses

This fixes some unaligned accesses spotted by libubsan in
xfs_metadump.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_logprint: fix some unaligned accesses

This routine had a fair bit of gyration to avoid unaligned
accesses, but didn't fix them all.
Fix some more spotted at runtime by libubsan.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_repair: fix unaligned accesses

This fixes some unaligned accesses spotted by libubsan in repair.

See Documentation/unaligned-memory-access.txt in the kernel
tree for why these can be a problem.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

libxfs: avoid negative (and full-width) shifts in radix-tree.c

Pull this commit over from the kernel, as libubsan spotted a
bad shift at runtime:

    commit 430d275a399175c7c0673459738979287ec1fd22
    Author: Peter Lund <firefly@vax64.dk>
    Date:   Tue Oct 16 23:29:35 2007 -0700

    avoid negative (and full-width) shifts in radix-tree.c

    Negative shifts are not allowed in C (the result is undefined).  Same thing
    with full-width shifts.

    It works on most platforms but not on the VAX with gcc 4.0.1 (it results in an
    "operand reserved" fault).

    Shifting by more than the width of the value on the left is also not
    allowed.  I think the extra '>> 1' tacked on at the end in the original
    code was an attempt to work around that.  Getting rid of that is an extra
    feature of this patch.

    Here's the chapter and verse, taken from the final draft of the C99
    standard ("6.5.7 Bitwise shift operators", paragraph 3):

      "The integer promotions are performed on each of the operands. The
      type of the result is that of the promoted left operand. If the
      value of the right operand is negative or is greater than or equal
      to the width of the promoted left operand, the behavior is
      undefined."

    Thank you to Jan-Benedict Glaw, Christoph Hellwig, Maciej Rozycki, Pekka
    Enberg, Andreas Schwab, and Christoph Lameter for review.  Special thanks
    to Andreas for spotting that my fix only removed half the undefined
    behaviour.

Signed-off-by: Peter Lund <firefly@vax64.dk>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_copy: format v5 sb logs correctly

xfs_copy formats the target filesystem log in non-duplicate copy mode to
stamp the new fs UUID into the log. The current format mechanism resets
the current LSN of the target filesystem to cycle 1, which is invalid
for v5 superblocks. The current LSN of v5 superblocks must always move
forward and remain ahead of metadata LSNs stored in filesystem metadata.

Update the log format helper to detect and use an alternate format
mechanism for v5 superblock logs. Allocate an independent log format
buffer based on the size of the log and format the buffer with an
incremented cycle count using the libxfs log format mechanism.

Note that the new libxfs log format mechanism could be used for both v5
and older superblock formats. The new mechanism requires a new, full log
sized buffer allocation as well as doing I/O to the entire log whereas
the pre-v5 sb mechanism only writes to the log head and tail. This is
due to how xfs_copy uses its own internal buffer data structure rather
than libxfs buftarg structures. As such, keep the original mechanism
around to avoid potential disruption for non-v5 users. The old mechanism
can be removed at some point in the future when the new mechanism is
shaken out and v5 filesystems tend to outnumber v4.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_copy: refactor log format code into new helper

The xfs_copy log format code is mostly open-coded into the main()
function of the application. Support for v5 superblock log formatting
will require an alternate mechanism than that utilized for v4
superblocks and older.

As such, refactor the log formatting code into new helper functions. The
top-level helper iterates the copy target devices and another helper
implements the log format magic. This patch does not change existing
behavior.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_copy: store data buf alignment in buf data structure

The write buffer data structure stores various characteristics of the
write buffer, such as I/O alignment requirements, etc., but does not
include the required data buffer alignment.

Data buffer alignment is a required buffer initialization parameter and
the v5 log format support code would like to initialize an independent
log buffer based on the predetermined alignment constraints encoded into
the global write buffer. Update the write buffer data structure to store
the provided data alignment value such that it can be accessed
throughout the codebase. This patch does not change existing behavior.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_copy: genericize write helper to facilitate separate log buf

xfs_copy uses a fixed write buffer throughout the code for the various
targets. This is a global buffer and is assumed to be the write source
in the do_write() helper used by the log clearing code.

v5 superblock log formatting will require a larger, independent buffer
to format the log. Therefore, update do_write() to accept a write buffer
parameter to optionally override the global write buffer. This patch
does not change current behavior.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_copy: check for dirty log on non-duplicate copies

xfs_copy either copies the log as-is or formats the log of the target(s)
based on whether duplicate copy mode is enabled. The target log is
formatted when non-duplicate mode is enabled because copies gain a new
fs UUID and the new UUID must be stamped into the log.

When non-duplicate mode is enabled, however, xfs_copy does not check
whether the source filesystem log is actually clean. If the source log
is dirty, the target filesystem is silently created with a clean log and
thus ends up in a potentially corrupted state.

Update xfs_copy to check the source log state and fail the copy if in
non-duplicate mode and the log is dirty. Encourage the user to mount the
filesystem or run xfs_repair to clear the log. Note that the log is
scanned unconditionally as opposed to only in non-duplicate mode because
v5 superblock log format support requires the current cycle number to
format the log correctly.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

db/metadump: bump lsn when log is cleared on v5 supers

xfs_metadump handles the log in different ways depending on the mode of
operation. If the log is dirty or obfuscation and stale data zeroing are
disabled, the log is copied as is. In all other scenarios, the log is
explicitly zeroed. This is incorrect for version 5 superblocks where the
current LSN is always expected to be ahead of all fs metadata.

Update metadump to use libxfs_log_clear() to format the log with an
elevated LSN rather than zero the log and reset the current the LSN.
Metadump does not use buffers for the dump target, instead using a
cursor implementation to access the log via a single memory buffer.
Therefore, update libxfs_log_clear() to receive an optional (but
exclusive to the buftarg parameter) memory buffer pointer for the log.
If the pointer is provided, the log format is written out to this
buffer. Otherwise, fall back to the original behavior and access the log
through buftarg buffers.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_db: do not reset current lsn from uuid command on v5 supers

The xfs_db uuid command modifes the uuid of the filesystem. As part of
this process, it is required to clear the log and format it with records
that use the new uuid. It currently resets the log to cycle 1, which is
not safe for v5 superblocks.

Update the uuid log clearing implementation to bump the current cycle
when the log is formatted on v5 superblock filesystems. This ensures
that the current LSN remains ahead of metadata LSNs on the subsequent
mount. Since the log is checked and cleared across a couple different
functions, also add a new global xlog structure, associate it with the
preexisting global mount structure and reference it to get and use the
current log cycle.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_repair: seed the max lsn from log state in phase 2

At this point, the log state is always available once repair has
progressed through phase 2 and the log is only ever zeroed when
absolutely necessary. This means that in the common case, repair runs
with the log in a non-initialized state. The libxfs max metadata LSN
tracking initializes the max LSN to zero, however, which will require
updates throughout the repair process even if all metadata LSNs are
behind the current LSN.

Since all metadata LSNs that are behind the current LSN are valid, seed
the libxfs maximum seen LSN value with the log state from phase 2. This
is a minor optimization to minimize global variable updates in the
common case where all (or most) metadata LSNs are valid.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_repair: don't clear the log by default

xfs_repair currently clears the log regardless of whether it is
corrupted, clean or contains data. This is traditionally harmless but
now causes log recovery problems on v5 filesystems. v5 filesystems
expect a clean log to always have an LSN out ahead of the maximum last
modification LSN stamped on any bit of metadata throughout the fs. If
this is not the case, repair must reformat the log with a larger cycle
number after fs processing is complete.

Given that unconditional log clearing actually introduces a filesystem
inconsistency on v5 superblocks (that repair must subsequently recover
from) and provides no tangible benefit for v4 filesystems that otherwise
have a clean and covered log, it is more appropriate behavior to not
clear the log by default.

Update xfs_repair to always and only clear the log when the -L parameter
is specified. Retain the existing logic to require -L or otherwise exit
if the log appears to contain data. Adopt similar behavior if the log
appears to be corrupted.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_repair: format the log with forward cycle number on v5 supers

v5 filesystems use the current LSN and the last modified LSN stored
within fs metadata to provide correct log recovery ordering. xfs_repair
historically clears the log in phase 2. This resets to the current LSN
of the filesystem to the initial cycle, as if the fs was just created.

This is problematic because the filesystem LSN is now behind many
pre-existing metadata structures on-disk until either the current
filesystem LSN catches up or those particular data structures are
modified and written out. If a filesystem crash occurs in the meantime,
log recovery can incorrectly skip log items and cause filesystem
corruption.

Update xfs_repair to check the maximum metadata LSN value against the
current log state once the filesystem has been processed. If the maximum
LSN exceeds the current LSN with respect to the log, reformat the log
with a cycle number that exceeds that of the maximum LSN.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_repair: process the log in no_modify mode

xfs_repair does not zero the log in no_modify mode. In doing so, it also
skips the function that scans the log, locates the head/tail blocks and
sets the current LSN. Now that the log state is used beyond phase 2, the
log scan must occur regardless of whether no_modify mode is enabled or
not.

Update phase 2 to always execute the log scanning code. Push down the
no_modify checks into the log clearing helper such that the log is still
not modified in no_modify mode.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_repair: track log state throughout all recovery phases

xfs_repair examines and clears the log in phase 2. Phase 2 acquires the
log state in local data structures that are lost upon phase exit. v5
filesystems require that the log is formatted with a higher cycle number
after the fs is repaired. This requires assessment of the log state to
determine whether a reformat is necessary.

Rather than duplicate the log processing code, update phase 2 to
populate a globally available log data structure. Add a log pointer to
xfs_mount, as exists in kernel space, that repair uses to store a
reference to the log that is available to various phases. Note that this
patch simply plumbs through the global log data structure and does not
change behavior in any way.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

libxlog: pull struct xlog out of xlog_is_dirty()

The xlog_is_dirty() helper is called in various places to acquire the
current log head, tail and to determine whether the log is dirty. Some
callers will require additional information to deal with formatting the
log, such as the current LSN. xlog_is_dirty() already acquires this
information through existing sub-helpers, but it is not available to
callers as the xlog structure is allocated on the local stack.

Update xlog_is_dirty() to receive the xlog structure as a parameter and
pass it along such that additional information about the log is
available to callers. Update the existing callers to allocate the xlog
structure on the stack.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

libxfs: add ability to clear log to arbitrary log cycle

The libxfs_log_clear() helper currently zeroes the log and writes a
single log record such that the kernel code detects the log has been
zeroed and mounts successfully. This is not sufficient for v5
filesystems, which must have the log cleared to an LSN that is
guaranteed to be ahead of any LSN that has been previously stamped into
on-disk metadata.

Update libxfs_log_clear() to support the ability to format the log to an
arbitrary cycle number. First, the log is physically zeroed. A log
record is written to the first block of the log with the desired lsn and
refers to the tail_lsn as the last record of the previous cycle. The
rest of the log is filled with log records of the previous cycle. This
causes the kernel to set the current LSN to start of the desired cycle
number at mount time.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

libxfs: pass lsn param to log clear and record header logging helpers

In preparation to support the ability to format the log with an
arbitrary cycle number, the log clear and record logging helpers must be
updated to receive the desired cycle and LSN values as parameters.

Update libxfs_log_clear() to receive the desired cycle number to format
the log with. Define a preprocessor directive to represent the currently
hardcoded case of cycle 1. Update libxfs_log_header() to receive the lsn
and tail_lsn of the record to write. Use a NULL value LSN to represent
the currently hardcoded behavior.

All callers are updated to use the current default values. As such, this
patch does not change behavior in any way.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>