git.ipfire.org Git - thirdparty/xfsprogs-dev.git/log

xfs_quota: only round up timer reporting > 1 day

I was too hasty with:

d1fe6ff xfs_quota: remove extra 30 seconds from time limit reporting

The point of that extra 30s, turns out, was to allow the user
to set a limit, query it, and get back what they just set, if
it is set to more than a day.

Without it, if we set a grace period to i.e. 3 days, and query it
1 second later, the rounding in the time_to_string function returns
"2 days" not "3 days" as it did before, because we are at
2 days 23:59:59 and it essentially applies a floor() for
brevity. I guess this was confusing.

(I've run into this same conundrum on my stove digital timer;
if you set it to 10m, it blinks "10" at you twice so that you
know what you set, then quickly flips to 9 as it counts down).

In some cases, however (and this is the case that prompted the
prior patch), we display a full "XYZ days hh:mm:ss" - we do this
if the verbose flag is set, or if the timer is less than one day.
In these cases, we should not add the 30s, because we are showing
full time resolution to the user.

Reported-by: Zorro Lang <zlang@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Zorro Lang <zlang@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_repair: further improvement on secondary superblock search method

This patch is a further optimization of secondary sb search, in
order to handle non-default geometries. Once again, use a similar
method to find fs geometry as that of xfs_mkfs. Refactor
verify_sb(), creating new sub-function that checks sanity of
agblocks and agcount: verify_sb_blocksize().

If verify_sb_blocksize verifies sane paramters, use found values for
the sb search. Otherwise, try search with default values. If these
faster methods both fail, fall back to original brute force slower
search.

Signed-off-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

mkfs.xfs: annotate fallthrough cases in cvtnum

We should really collapse our 3 cvtnum variants,
but for now at least shut up Coverity about this
intentional case fallthrough.

Addresses-Coverity-ID: 1361553
Addresses-Coverity-ID: 1361554
Addresses-Coverity-ID: 1361555
Addresses-Coverity-ID: 1361556
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_quota: check report_mount return value

The new call to report_mount doesn't check the return value
like every other caller does...

Returning 1 means it printed something; if the terse flag
is used and there is no usage, nothing gets printed.
If we set the NO_HEADER_FLAG anyway, then we won't see
the header for subsequent entries as we expect.

For example, project ID 0 has no usage in this case:

# xfs_quota -x -c "report -a" /mnt/test
Project quota on /mnt/test (/dev/sdb1)
                               Blocks
Project ID       Used       Soft       Hard    Warn/Grace
---------- --------------------------------------------------
#0                  0          0          0     00 [--------]
project          2048          4          4     00 [--none--]

So using the terse flag results in no header when it prints
projects with usage:

# xfs_quota -x -c "report -t -a" /mnt/test
project          2048          4          4     00 [--none--]

With this fix it prints the header as expected:

# xfs_quota -x -c "report -t -a" /mnt/test
Project quota on /mnt/test (/dev/sdb1)
                               Blocks
Project ID       Used       Soft       Hard    Warn/Grace
---------- --------------------------------------------------
project          2048          4          4     00 [--none--]

Addresses-Coverity-Id: 1361552
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Zorro Lang <zlang@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_repair: new secondary superblock search method

Optimize secondary sb search, using similar method to find
fs geometry as that of xfs_mkfs. If this faster method fails
in finding a secondary sb, fall back to original brute force
slower search.

Signed-off-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

libxcmd: generalize topology functions

Move general topology functions from xfs_mkfs to new topology
collection in libxcmd.

[dchinner: fix library dependencies and add them to the debian
package build script.]

Signed-off-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_db: defang frag command

Too many people freak out about this fictitious "fragmentation
factor." As shown in the fact, it is largely meaningless, because
the number approaches 100% extremely quickly for just a few
extents per file.

I thought about removing it altogether, but perhaps a note
about its uselessness, and a more soothing metric (avg extents
per file) might be useful.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

db: limit AGFL bno array printing

When asking for a single agfl entry such as:

# xfs_db -c "agfl 0" -c "p bno[1]" /dev/ram0
bno[1] = 1:6 2:7 3:8 4:null .....

The result should be just the single entry being asked for.
Currently this outputs the entire remainder of the array starting at
the given index. This makes it difficult to extract single entry
values.

This occurs because the printing of a flat array of number types
does not take into account the range that is specified on the
command line, which is held in fl->low and fl->high. To make this
work for flat arrays of number types (print function fp_num), change
print_flist() to limit the count of values to be emitted to the
range specified. This now gives:

# xfs_db -c "agfl 0" -c "p bno[1-2]" /dev/ram0
bno[1-2] = 1:6 2:7

To further simplify external parsing of single entry values, if only
a single value is requested from the array of fp_num type, don't
print the array index - it's already known. Hence:

# xfs_db -c "agfl 0" -c "p bno[1]" /dev/ram0
bno[1] = 6

This change will take effect on all types of flat number arrays that
are printed. e.g. the range limiting will work for things like the
AGI unlinked list arrays.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_db: allow recalculating CRCs on invalid metadata

Currently we can't write corrupt structures with valid CRCs on v5
filesystems via xfs_db. TO emulate certain types of corruption
result from software bugs in the kernel code, we need this
capability to set up the corrupted state. i.e. corrupt state with a
valid CRC needs to appear on disk.

This requires us to avoid running the verifier that would otherwise
prevent writing corrupt state to disk. To enable this, add the CRC
offset to the type table for different buffers and add a new flag to
the write command to trigger running a CRC calculation base don this
type table. We can then insert the calculated value into the correct
location in the buffer...

Because some objects are not directly buffer based, we can't easily
do this CRC trick. Those object types will be marked as
TYP_NO_CRC_OFF, and as a result will emit an error such as:

# xfs_db -x -c "inode 96" -c "write -d magic 0x4949" /dev/ram0
Cannot recalculate CRCs on this type of object
#

All v4 superblock types are configured this way, as are inode,
dquots and other v5 metadata types that either don't have CRCs or
don't have a fixed offset into a buffer to store their CRC.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_db: fix unaligned accesses

Fix 2 unaligned accesses in xfs_db which caused bus errors on
sparc64. Similar treatment was already done in xfs_repair and
xfs_metadump but somehow xfs_db got missed.

Thanks to Anatoly for reminding me that unaligned access is
a thing. ;)

Resolves-oss-bugzilla: #1140
Reported-by: Anatoly Pugachev <matorola@gmail.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

metadump: limit permissible sector sizes

A metadump is composed of many metablocks, which have the format:

[header|indices][ ... disk sectors ... ]

where "disk sectors" are BBSIZE (512) blocks, and the (indices)
indicate where those disk sectors should land in the restored
image.

The header+indices fit within a single BBSIZE sector, and as such
the number of indices is limited to:

num_indices = (BBSIZE - sizeof(xfs_metablock_t)) / sizeof(__be64);

In practice, this works out to 63 indices; sadly 64 are required
to store a 32k metadata chunk, if the filesystem was created with
XFS_MAX_SECTORSIZE. This leads to more sadness later on, as we
index past arrays etc.

For now, just refuse to create a metadump from a 32k sector
filesystem; that's largely just theoretical at this point anyway.

Also check this on mdrestore, and check the lower bound as well;
the AFL fuzzer showed that interesting things happen when the
metadump image claims to contain a sector size of 0.

Oh, and spell "indices" correctly while we're at it.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

mkfs: conflicting values with disabled crc should fail

If crc=0, then finobt=1 and spinodes=1 should both fail, instead of a warning.

[dchinner: whitespace fixes. ]

Signed-off-by: Jan Tulak <jtulak@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

mkfs: add optional 'reason' for illegal_option

Allow us to tell the user what exactly is wrong with the specified
options. For example, that the value is too small, instead of just
generic "bad option."

[dchinner: fix whitespace, simplify illegal option message ]

Signed-off-by: Jan Tulak <jtulak@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

mkfs: unit conversions are case insensitive

Solves the question "Should I use 10g or 10G?"

[dchinner: rewrite to convert *sp to lower only once we know it
is not null and the only character remaining in the number string. ]

Signed-off-by: Jan Tulak <jtulak@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

mkfs: move spinodes crc check

Spinodes crc check is now moved to be in the same way as finobt.

Signed-off-by: Jan Tulak <jtulak@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

mkfs: don't treat files as though they are block devices

If the device is actually a file, and "-d file" is not specified,
mkfs will try to treat it as a block device and get stuff wrong.
Image files don't necessarily have the same sector sizes as the
block device or filesystem underlying the image file, nor should we
be issuing discard ioctls on image files.

To fix this sanely, only require "-d file" if the device name is
invalid to trigger creation of the file. Otherwise, use stat() to
determine if the device is a file or block device and deal with that
appropriately by setting the "isfile" variables and turning off
direct IO. Then ensure that we check the "isfile" options before
doing things that are specific to block devices.

Other file/blockdev issues fixed:
- use getstr to detect specifying the data device name
  twice.
- check file/size/name parameters before anything else.
- overwrite checks need to be done before the image file is
  opened and potentially truncated.
- blkid_get_topology() should not be called for image files,
  so warn when it is called that way.
- zero_old_xfs_structures() emits a spurious error:
"existing superblock read failed: Success"
  when it is run on a truncated image file. Don't warn if we
  see this problem on an image file.
- Don't issue discards on image files.
- Use fsync() for image files, not BLKFLSBUF in
  platform_flush_device() for Linux.

[ Eric Sandeen <sandeen@redhat.com>: move check for no mkfs target,
other minor fixes. ]

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Jan Tulak <jtulak@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

mkfs: add string options to generic parsing

So that string options are correctly detected for conflicts and
respecification, add a getstr() function that modifies the option
tables appropriately.

[ Eric Sandeen <sandeen@redaht.com>: whitespace fixup ]

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Jan Tulak <jtulak@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

mkfs: encode conflicts into parsing table

Many options conflict, so we need to specify which options conflict
with each other in a generic manner. We already have a "seen"
variable used for respecification detection, so we can also use this
code conflict detection. Hence add a "conflicts" array to the sub
options parameter definition.

[ Eric Sandeen <sandeen@redhat.com>: remove explicit L_FILE conflict ]

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Jan Tulak <jtulak@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

mkfs: merge getnum

getnum() is now only called by getnum_checked(). Move the two
together into a single getnum() function and change all the callers
back to getnum().

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Jan Tulak <jtulak@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

mkfs: table based parsing for converted parameters

All the parameters that can be passed as block or sector sizes need
to be passed the block and sector sizes that they should be using
for conversion. For parameter parsing, it is always the same two
variables, so to make things easy just declare them as global
variables so we can avoid needing to pass them to getnum_checked().

We also need to mark these parameters are requiring conversion so
that we don't need to pass this information manually to
getnum_checked(). Further, some of these options are required to
have a power of 2 value, so add optional checking for that as well.

[ Eric Sandeen <sandeen@redhat.com>: fix min/max for "-l su" ]

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Jan Tulak <jtulak@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

mkfs: add respecification detection to generic parsing

Add flags to the generic input parameter tables so that
respecification can be detected in a generic manner.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Jan Tulak <jtulak@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

mkfs: use getnum_checked for all ranged parameters

Now that getnum_checked can handle min/max checking, use this for
all parameters that take straight numbers and don't require unit
conversions.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Jan Tulak <jtulak@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

mkfs: getbool is redundant

getbool() can be replaced with getnum_checked with appropriate
min/max values set for the boolean variables. Make boolean
arguments consistent - all accept 0 or 1 value now.

[ Eric Sandeen <sandeen@redhat.com>: manpage tidiness ]

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Jan Tulak <jtulak@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

mkfs: structify input parameter passing

Passing large number of parameters around to number conversion
functions is painful. Add a structure to encapsulate the constant
parameters that are passed, and convert getnum_checked to use it.

This is the first real step towards a table driven option parser.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Jan Tulak <jtulak@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

mkfs: validate logarithmic parameters sanely

Testing logarithmic paramters like "-n log=<num>" shows that we do a
terrible job of validating such input. e.g.:

.....
naming =version 2 bsize=65536 ascii-ci=0 ftype=0
....

Yeah, I just asked for a block size of 2^456858480, and it didn't
get rejected. Great, isn't it?

So, factor out the parsing of logarithmic parameters, and pass in
the maximum valid value that they can take. These maximum values
might not be completely accurate (e.g. block/sector sizes will
affect the eventual valid maximum) but we can get rid of all the
overflows and stupidities before we get to fine-grained validity
checking later in mkfs once things like block and sector sizes have
been finalised.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Jan Tulak <jtulak@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

mkfs: factor boolean option parsing

Many of the options passed to mkfs have boolean options (0 or 1) and
all hand roll the same code and validity checks. Factor these out
into a common function.

Note that the lazy-count option is now changed to match other
booleans in that if you don't specify a value, it reverts to the
default value (on) rather than throwing an error.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Jan Tulak <jtulak@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

mkfs: validate all input values

Right now, mkfs does a poor job of input validation of values. Many
parameters do not check for trailing garbage and so will pass
obviously invalid values as OK. Some don't even detect completely
invalid values, leaving it for other checks later on to fail due to
a bad value conversion - these tend to rely on atoi() implicitly
returning a sane value when it is passed garbage, and atoi gives no
guarantee of the return value when passed garbage.

Clean all this up by passing all strings that need to be converted
into values into a common function that is called regardless of
whether unit conversion is needed or not. Further, make sure every
conversion is checked for a valid result, and abort the moment an
invalid value is detected.

Get rid of the silly "isdigits(), cvtnum()" calls which don't use
any of the conversion capabilities of cvtnum() because we've already
ensured that there are no conversion units in the string via the
isdigits() call. These can simply be replaced by a standard
strtoll() call followed by checking for no trailing bytes.

Finally, the block size of the filesystem is not known until all
the options have been parsed and we can determine if the default is
to be used. This means any parameter that relies on using conversion
from filesystem block size (the "NNNb" format) requires the block
size to first be specified on the command line so it is known.

Similarly, we make the same rule for specifying counts in sectors.
This is a change from the existing behaviour that assumes sectors
are 512 bytes unless otherwise changed on the command line. This,
unfortunately, leads to complete silliness where you can specify the
sector size as a count of sectors. It also means that you can do
some conversions with 512 byte sector sizes, and others with
whatever was specified on the command line, meaning the mkfs
behaviour changes depending in where in the command line the sector
size is changed....

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Jan Tulak <jtulak@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

mkfs: Sanitise the superblock feature macros

They are horrible macros that simply obfuscate the code, so
let's factor the code and make them nice functions.

To do this, add a sb_feat_args structure that carries around the
variables rather than a strange assortment of variables. This means
all the default can be clearly defined in a structure
initialisation, and dependent feature bits are easy to check.

[ Eric Sandeen <sandeen@redhat.com>: Properly initialize
sb_feat.dirftype, remove unnecessary continuation chars. ]

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Jan Tulak <jtulak@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

mkfs: sanitise ftype parameter values.

Because passing "-n ftype=2" should fail.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Jan Tulak <jtulak@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfsprogs: use common code for multi-disk detection

Both xfs_repair and mkfs.xfs need to agree on what is a "multidisk:
configuration - mkfs for determining the AG count of the filesystem,
repair for determining how to automatically parallelise it's
execution. This requires a bunch of common defines that both mkfs
and reapir need to share.

In fact, most of the defines in xfs_mkfs.h could be shared with
other programs (i.e. all the defaults mkfs uses) and so it is
simplest to move xfs_mkfs.h to the shared include directory and add
the new defines to it directly.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Jan Tulak <jtulak@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_repair: fix agf limit error messages

Today we see errors like:

"fllast 118 in agf 94 too large (max = 118)"

which makes no sense.

If we are erroring on X >= Y, Y is clearly not the maximum allowable
value.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

build: make librt optional for some platforms

The functions that we use from librt on Linux are in other libraries
on OS X. So make it possible to disable librt.

The hardcoded librt dependency was added in 0caa2bae.

Signed-off-by: Jan Tulak <jtulak@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

configure: add mailing list contact

Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

po: respect LINGUAS build setting

It is common gettext practice to limit the translations a particular
package will include by setting the LINGUAS environment variable.

Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_quota: print quota id number if the name can't be found

When use GETNEXTQUOTA ioctl to report project quota, it always
report an unexpected quota:

  (null) 0 0 0 00 [--------]

The ID 0 store the default quota, even if no one set default quota,
it still have quota accounting, but not enforced. So GETNEXTQUOTA
can find and report this undefined quota.

From this problem, I thought if others' quota name miss, (null) will
be printed too. e.g.

  # xfs_quota -xc "limit -u bsoft=300m bhard=400m test" $mnt
  # xfs_quota -xc "report -u" $mnt
  User ID          Used       Soft       Hard    Warn/Grace
  ---------- --------------------------------------------------
  root                0          0          0     00 [--------]
  test                0     307200     409600     00 [--------]
  # userdel -r test
  # xfs_quota -xc "report -u" $mnt
  User ID          Used       Soft       Hard    Warn/Grace
  ---------- --------------------------------------------------
  root                0          0          0     00 [--------]
  (null)              0     307200     409600     00 [--------]

So this problem same with above id 0's problem. To deal with this,
this patch will print id number if the name can't be found.

However, if we use the old GETQUOTA ioctl, it won't print project id
0 quota information if it's not defined. That's different with
GETNEXTQUOTA. For keep consistent, this patch also print project id
0 when use old GETQUOTA.

Signed-off-by: Zorro Lang <zlang@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_quota: fully support users and groups beginning with digits

A normal user or group name allow beginning with digits, but xfs_quota
can't create a limit for that user or group. The reason is 'strtoul'
function only translate digits at the beginning, it will ignore
letters after digits.

There's a commit fd537fc50eeade63bbd2a66105f39d04a011a7f5, it try to
fix "xfsprogs: xfs_quota allow user or group names beginning with
digits". But it doesn't effect 'limit' command, so a command likes:

xfs_quota 'limit .... 12345678-user' xxxx

will try to create limit for username="12345678", not "12345678-user".

This patch will fix this problem, and a test case xfs/138 in xfstests
is used to reproduce this bug.

Signed-off-by: Zorro Lang <zlang@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_quota: add missed options -D and -P into man page

There're two options in xfsprogs/quota/init.c:init() function, the -D
option is used to set a file to instead of /etc/projects, and the -P
option is used to set a file to instead of /etc/projid. I don't know
these two options when I write xfstests case xfs/133, because
there's no any information about them in xfs_quota and other related
man pages.

Document these options in the xfs_quota man page.

Reported-by: Eryu Guan <eguan@redhat.com>
Signed-off-by: Zorro Lang <zlang@redhat.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_io: allow mmap command to reserve some free space

When using xfs_io to rewrite LTP's mmap16 testcase for xfstests,
we always hit mremap ENOMEM error. But according to linux commit:

  90a8020 vfs: fix data corruption when blocksize < pagesize for mmaped data

The reproducer shouldn't use MREMAP_MAYMOVE flag for mremap(). So we
need a stable method to make mremap can extend space from the original
mapped starting address.

Generally if we try to mremap from the original mapped starting point
in a C program, at first we always do:

  addr = mmap(NULL, res_size, prot, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
  munmap(addr, res_size);

Then do:

  addr = mmap(addr, real_len, ...);

The "res_size" is bigger than "real_len". This will help us get a
region between real_len and res_size, which maybe free space. The
truth is we can't guarantee that this free memory will stay free.
But this method is still very helpful for reserve some free space
in short time.

This patch bring in the "res_size" by use a new -s option. If don't use
this option, xfs_io -c "mmap 0 1024" -c "mremap 8192" will hit ENOMEM
error nearly 100%, but if use this option, it nearly always remap
successfully.

Signed-off-by: Zorro Lang <zlang@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_io: modify argument errors of mremap command

There are two argument parsing errors in mremap command:

1. "mremap 1024 8192" won't return error and will run "mremap 1024"
silently.

2. The "-f" option can't be used, due to it needing a new address
argument as the fifth argument of mremap() syscall.

Signed-off-by: Zorro Lang <zlang@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

Merge branch 'progs-misc-fixes-for-4.6' into for-master

Merge branch 'libxfs-4.6-sync' into for-master

xfsprogs: Release v4.5.0

Update all the release files for a 4.5.0 release.

Signed-off-by: Dave Chinner <dchinner@redhat.com>

xfs_io: implement 'inode' command

Implements a new xfs_io command, named 'inode', which is supposed to be
used to query information about inode's existence and its physical size
in the filesystem.

Supported options:

Default:     -- Return true(1) or false(0) if any inode greater than
                32bits has been found in the filesystem
[num]        -- Return inode number or 0 if the inode [num] is in use
-n [num]     -- Return the next valid inode after [num]
-v           -- verbose mode
                Display the inode number and its physical size according to the
argument used

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

mkfs: fix crash when initializing rbmip

Initialize rbmip, log the inode, /then/ assign it to the xfs_mount.
Don't try to access rbmip in the xfs_mount before that, or it'll crash.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

libxfs: fix up mismerge in libxfs_iflush_int

XFS_ISDIR is a bool, don't compare it to S_IFDIR

e37bf5 xfs: mode di_mode to vfs inode had a small mis-merge
from kernelspace, when moving from

if ((ip->i_d.di_mode & S_IFMT) == S_IFDIR)
to
if (XFS_ISDIR(ip) == S_IFDIR

that "==" should have been dropped.

Signed-off-by: ERic Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_io: Prevent devide by zero from {pread,pwrite}_random

Math is wrong if range requested is less or equals to block size

xfs_io -c 'pwrite -b 4k 8k 4k -R' \
-c 'pread -b 4k 4k 4k -R' -f file

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_db: Fix dquot command docs

The dquot command in xfs_db takes quota type as well as an ID.
Fix the usage & man page to reflect this.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_quota: modify commands which can't handle multiple types

Some xfs_quota commands can't deal with multiple types together.
For example, if we run "limit -ug ...", one type will overwrite
the other. I find below commands can't handle multiple types:

[quota, limit, timer, warn, dump, restore and quot]

(Although timer and warn command can't support -ugp types until
now, it will in one day.)

For every single $command, I change their ${command}_f function,
${command}_cmd structure and man page.

Signed-off-by: Zorro Lang <zlang@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: mode di_mode to vfs inode

Source kernel commit c19b3b05ae440de50fffe2ac2a9b27392a7448e9

Move the di_mode value from the xfs_icdinode to the VFS inode, reducing
the xfs_icdinode byte another 2 bytes and collapsing another 2 byte hole
in the structure.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: move di_changecount to VFS inode

Source kernel commit 83e06f21b439b7b308eda06332a4feef35739e94

We can store the di_changecount in the i_version field of the VFS
inode and remove another 8 bytes from the xfs_icdinode.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: move inode generation count to VFS inode

Source kernel commit 9e9a2674e43353f650ecd19a54eba028eafff82e

Pull another 4 bytes out of the xfs_icdinode.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: use vfs inode nlink field everywhere

Source kernel commit 54d7b5c1d03e9711cce2d72237d5b3f5c87431f4

The VFS tracks the inode nlink just like the xfs_icdinode. We can
remove the variable from the icdinode and use the VFS inode variable
everywhere, reducing the size of the xfs_icdinode by a further 4
bytes.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: move v1 inode conversion to xfs_inode_from_disk

Source kernel commit faeb4e4715be017e88e630bda84477afc1dff38b

So we don't have to carry an di_onlink variable around anymore, move
the inode conversion from v1 inode format to v2 inode format into
xfs_inode_from_disk(). This means we can remove the di_onlink fields
from the struct xfs_icdinode.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: cull unnecessary icdinode fields

Source kernel commit 93f958f9c41f0bfd10627a2279457df64004d957

Now that the struct xfs_icdinode is not directly related to the
on-disk format, we can cull things in it we really don't need to
store:

- magic number never changes
- padding is not necessary
- next_unlinked is never used
- inode number is redundant
- uuid is redundant
- lsn is accessed directly from dinode
- inode CRC is only accessed directly from dinode

Hence we can remove these from the struct xfs_icdinode and redirect
the code that uses them to the xfs_dinode appripriately. This
reduces the size of the struct icdinode from 152 bytes to 88 bytes,
and removes a fair chunk of unnecessary code, too.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: remove timestamps from incore inode

Source kernel commit 3987848c7c2be112e03c82d03821b044f1c0edec

The struct xfs_inode has two copies of the current timestamps in it,
one in the vfs inode and one in the struct xfs_icdinode. Now that we
no longer log the struct xfs_icdinode directly, we don't need to
keep the timestamps in this structure. instead we can copy them
straight out of the VFS inode when formatting the inode log item or
the on-disk inode.

This reduces the struct xfs_inode in size by 24 bytes.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: introduce inode log format object

Source kernel commit f8d55aa0523ad0f78979c222ed18b78ea7be793a

We currently carry around and log an entire inode core in the
struct xfs_inode. A lot of the information in the inode core is
duplicated in the VFS inode, but we cannot remove this duplication
of infomration because the inode core is logged directly in
xfs_inode_item_format().

Add a new function xfs_inode_item_format_core() that copies the
inode core data into a struct xfs_icdinode that is pulled directly
from the log vector buffer. This means we no longer directly
copy the inode core, but copy the structures one member at a time.
This will be slightly less efficient than copying, but will allow us
to remove duplicate and unnecessary items from the struct xfs_inode.

To enable us to do this, call the new structure a xfs_log_dinode,
so that we know it's different to the physical xfs_dinode and the
in-core xfs_icdinode.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: RT bitmap and summary buffers need verifiers

Source kernel commit bf85e0998ae8bafc1e0863d914df3be2b1bc372a

Buffers without verifiers issue runtime warnings on XFS. We don't
have anything we can actually verify in the RT buffers (no CRCs, not
magic numbers, etc), but we still need verifiers to avoid the
warnings.

Add a set of dummy verifier operations for the realtime buffers and
apply them in the appropriate places.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: RT bitmap and summary buffers are not typed

Source kernel commit f67ca6eca89cddd355c83639a90109e245f9d5a7

When logging buffers, we attach a type to them that follows the
buffer all the way into the log and is used to identify the buffer
contents in log recovery. Both the realtime summary buffers and the
bitmap buffers do not have types defined or set, so when we try to
log them we see assert failure:

XFS: Assertion failed: (bip->bli_flags & XFS_BLI_STALE) || (xfs_blft_from_flags(&bip->__bli_format) > XFS_BLFT_UNKNOWN_BUF && xfs_blft_from_flags(&bip->__bli_format) < XFS_BLFT_MAX_BUF), file: fs/xfs/xfs_buf_item.c, line: 294

Fix this by adding buffer log format types for these buffers, and
add identification support into log recovery for them. Only build the log
recovery support if CONFIG_XFS_RT=y - we can't get into log recovery for real
time filesystems if support is not built into the kernel, and this avoids
potential build problems.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: remove unused function definitions

Source kernel commit de0b85a8cf24f8c535b7f719c454e3951298d17a

Old leftovers.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: move buffer invalidation to xfs_btree_free_block

Source kernel commit edfd9dd549212a0923c9b5b142275dc88912abfa

... instead of leaving it in the methods.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: factor btree block freeing into a helper

Source kernel commit c46ee8ad7856b58eeb383c30ce847897f85c4103

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: handle errors from ->free_blocks in xfs_btree_kill_iroot

Source kernel commit 196328ec973a74ee52cc282824e72c3824dc1cf5

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: wire up Q_XGETNEXTQUOTA / get_nextdqblk

Source kernel commit 296c24e26ee3af2dbfecb482e6bc9560bd34c455

Add code to allow the Q_XGETNEXTQUOTA quotactl to quickly find
all active quotas by examining the quota inode, and skipping
over unallocated or uninitialized regions.

Userspace can then use this interface rather than i.e. a
getpwent() loop when asked to report all active quotas.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfsprogs: Release v4.5.0-rc1

Update all the release files for a 4.5.0-rc1 release.

Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: Validate the length of on-disk ACLs

Source kernel commit 86a21c79745ca97676cbd47f8608839382cc0448

In xfs_acl_from_disk, instead of trusting that xfs_acl.acl_cnt is correct,
make sure that the length of the attributes is correct as well. Also, turn
the aclp parameter into a const pointer.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: stop holding ILOCK over filldir callbacks

Source kernel commit dbad7c993053d8f482a5f76270a93307537efd8e

The recent change to the readdir locking made in 40194ec ("xfs:
reinstate the ilock in xfs_readdir") for CXFS directory sanity was
probably the wrong thing to do. Deep in the readdir code we
can take page faults in the filldir callback, and so taking a page
fault while holding an inode ilock creates a new set of locking
issues that lockdep warns all over the place about.

The locking order for regular inodes w.r.t. page faults is io_lock
-> pagefault -> mmap_sem -> ilock. The directory readdir code now
triggers ilock -> page fault -> mmap_sem. While we cannot deadlock
at this point, it inverts all the locking patterns that lockdep
normally sees on XFS inodes, and so triggers lockdep. We worked
around this with commit 93a8614 ("xfs: fix directory inode iolock
lockdep false positive"), but that then just moved the lockdep
warning to deeper in the page fault path and triggered on security
inode locks. Fixing the shmem issue there just moved the lockdep
reports somewhere else, and now we are getting false positives from
filesystem freezing annotations getting confused.

Further, if we enter memory reclaim in a readdir path, we now get
lockdep warning about potential deadlocks because the ilock is held
when we enter reclaim. This, again, is different to a regular file
in that we never allow memory reclaim to run while holding the ilock
for regular files. Hence lockdep now throws
ilock->kmalloc->reclaim->ilock warnings.

Basically, the problem is that the ilock is being used to protect
the directory data and the inode metadata, whereas for a regular
file the iolock protects the data and the ilock protects the
metadata. From the VFS perspective, the i_mutex serialises all
accesses to the directory data, and so not holding the ilock for
readdir doesn't matter. The issue is that CXFS doesn't access
directory data via the VFS, so it has no "data serialisaton"
mechanism. Hence we need to hold the IOLOCK in the correct places to
provide this low level directory data access serialisation.

The ilock can then be used just when the extent list needs to be
read, just like we do for regular files. The directory modification
code can take the iolock exclusive when the ilock is also taken,
and this then ensures that readdir is correct excluded while
modifications are in progress.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

libxfs: fix two comment typos

Source kernel commit fef4ded8cb2e53a47093b334e7730d55a3badc59

Just fix two typos in code comments.

Signed-off-by: Geliang Tang <geliangtang@163.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: swap leaf buffer into path struct atomically during path shift

Source kernel commit 7df1c170b9a45ab3a7401c79bbefa9939bf8eafb

The node directory lookup code uses a state structure that tracks the
path of buffers used to search for the hash of a filename through the
leaf blocks. When the lookup encounters a block that ends with the
requested hash, but the entry has not yet been found, it must shift over
to the next block and continue looking for the entry (i.e., duplicate
hashes could continue over into the next block). This shift mechanism
involves walking back up and down the state structure, replacing buffers
at the appropriate btree levels as necessary.

When a buffer is replaced, the old buffer is released and the new buffer
read into the active slot in the path structure. Because the buffer is
read directly into the path slot, a buffer read failure can result in
setting a NULL buffer pointer in an active slot. This throws off the
state cleanup code in xfs_dir2_node_lookup(), which expects to release a
buffer from each active slot. Instead, a BUG occurs due to a NULL
pointer dereference:

  BUG: unable to handle kernel NULL pointer dereference at 00000000000001e8
  IP: [<ffffffffa0585063>] xfs_trans_brelse+0x2a3/0x3c0 [xfs]
  ...
  RIP: 0010:[<ffffffffa0585063>]  [<ffffffffa0585063>] xfs_trans_brelse+0x2a3/0x3c0 [xfs]
  ...
  Call Trace:
   [<ffffffffa05250c6>] xfs_dir2_node_lookup+0xa6/0x2c0 [xfs]
   [<ffffffffa0519f7c>] xfs_dir_lookup+0x1ac/0x1c0 [xfs]
   [<ffffffffa055d0e1>] xfs_lookup+0x91/0x290 [xfs]
   [<ffffffffa05580b3>] xfs_vn_lookup+0x73/0xb0 [xfs]
   [<ffffffff8122de8d>] lookup_real+0x1d/0x50
   [<ffffffff8123330e>] path_openat+0x91e/0x1490
   [<ffffffff81235079>] do_filp_open+0x89/0x100
   ...

This has been reproduced via a parallel fsstress and filesystem shutdown
workload in a loop. The shutdown triggers the read error in the
aforementioned codepath and causes the BUG in xfs_dir2_node_lookup().

Update xfs_da3_path_shift() to update the active path slot atomically
with respect to the caller when a buffer is replaced. This ensures that
the caller always sees the old or new buffer in the slot and prevents
the NULL pointer dereference.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: handle dquot buffer readahead in log recovery correctly

Source kernel commit 7d6a13f023567d573ac362502bb702eda716e654

When we do dquot readahead in log recovery, we do not use a verifier
as the underlying buffer may not have dquots in it. e.g. the
allocation operation hasn't yet been replayed. Hence we do not want
to fail recovery because we detect an operation to be replayed has
not been run yet. This problem was addressed for inodes in commit
d891400 ("xfs: inode buffers may not be valid during recovery
readahead") but the problem was not recognised to exist for dquots
and their buffers as the dquot readahead did not have a verifier.

The result of not using a verifier is that when the buffer is then
next read to replay a dquot modification, the dquot buffer verifier
will only be attached to the buffer if *readahead is not complete*.
Hence we can read the buffer, replay the dquot changes and then add
it to the delwri submission list without it having a verifier
attached to it. This then generates warnings in xfs_buf_ioapply(),
which catches and warns about this case.

Fix this and make it handle the same readahead verifier error cases
as for inode buffers by adding a new readahead verifier that has a
write operation as well as a read operation that marks the buffer as
not done if any corruption is detected. Also make sure we don't run
readahead if the dquot buffer has been marked as cancelled by
recovery.

This will result in readahead either succeeding and the buffer
having a valid write verifier, or readahead failing and the buffer
state requiring the subsequent read to resubmit the IO with the new
verifier. In either case, this will result in the buffer always
ending up with a valid write verifier on it.

Note: we also need to fix the inode buffer readahead error handling
to mark the buffer with EIO. Brian noticed the code I copied from
there wrong during review, so fix it at the same time. Add comments
linking the two functions that handle readahead verifier errors
together so we don't forget this behavioural link in future.

cc: <stable@vger.kernel.org> # 3.12 - current
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: inode recovery readahead can race with inode buffer creation

Source kernel commit b79f4a1c68bb99152d0785ee4ea3ab4396cdacc6

When we do inode readahead in log recovery, we do can do the
readahead before we've replayed the icreate transaction that stamps
the buffer with inode cores. The inode readahead verifier catches
this and marks the buffer as !done to indicate that it doesn't yet
contain valid inodes.

In adding buffer error notification  (i.e. setting b_error = -EIO at
the same time as as we clear the done flag) to such a readahead
verifier failure, we can then get subsequent inode recovery failing
with this error:

XFS (dm-0): metadata I/O error: block 0xa00060 ("xlog_recover_do..(read#2)") error 5 numblks 32

This occurs when readahead completion races with icreate item replay
such as:

inode readahead
find buffer
lock buffer
submit RA io
....
icreate recovery
    xfs_trans_get_buffer
find buffer
lock buffer
<blocks on RA completion>
.....
<ra completion>
fails verifier
clear XBF_DONE
set bp->b_error = -EIO
release and unlock buffer
<icreate gains lock>
icreate initialises buffer
marks buffer as done
adds buffer to delayed write queue
releases buffer

At this point, we have an initialised inode buffer that is up to
date but has an -EIO state registered against it. When we finally
get to recovering an inode in that buffer:

inode item recovery
    xfs_trans_read_buffer
find buffer
lock buffer
sees XBF_DONE is set, returns buffer
    sees bp->b_error is set
fail log recovery!

Essentially, we need xfs_trans_get_buf_map() to clear the error status of
the buffer when doing a lookup. This function returns uninitialised
buffers, so the buffer returned can not be in an error state and
none of the code that uses this function expects b_error to be set
on return. Indeed, there is an ASSERT(!bp->b_error); in the
transaction case in xfs_trans_get_buf_map() that would have caught
this if log recovery used transactions....

This patch firstly changes the inode readahead failure to set -EIO
on the buffer, and secondly changes xfs_buf_get_map() to never
return a buffer with an error state set so this first change doesn't
cause unexpected log recovery failures.

cc: <stable@vger.kernel.org> # 3.12 - current
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: eliminate committed arg from xfs_bmap_finish

Source kernel commit f6106efae5f4144b32f6c10de0dc3e7efc9181e3

Calls to xfs_bmap_finish() and xfs_trans_ijoin(), and the
associated comments were replicated several times across
the attribute code, all dealing with what to do if the
transaction was or wasn't committed.

And in that replicated code, an ASSERT() test of an
uninitialized variable occurs in several locations:

error = xfs_attr_thing(&args);
if (!error) {
error = xfs_bmap_finish(&args.trans, args.flist,
&committed);
}
if (error) {
ASSERT(committed);

If the first xfs_attr_thing() failed, we'd skip the xfs_bmap_finish,
never set "committed", and then test it in the ASSERT.

Fix this up by moving the committed state internal to xfs_bmap_finish,
and add a new inode argument. If an inode is passed in, it is passed
through to __xfs_trans_roll() and joined to the transaction there if
the transaction was committed.

xfs_qm_dqalloc() was a little unique in that it called bjoin rather
than ijoin, but as Dave points out we can detect the committed state
but checking whether (*tpp != tp).

Addresses-Coverity-Id: 102360
Addresses-Coverity-Id: 102361
Addresses-Coverity-Id: 102363
Addresses-Coverity-Id: 102364
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: bmapbt checking on debug kernels too expensive

Source kernel commit e35438196c6a1d8b206471d51e80c380e80e047b

For large sparse or fragmented files, checking every single entry in
the bmapbt on every operation is prohibitively expensive. Especially
as such checks rarely discover problems during normal operations on
high extent coutn files. Our regression tests don't tend to exercise
files with hundreds of thousands to millions of extents, so mostly
this isn't noticed.

However, trying to run things like xfs_mdrestore of large filesystem
dumps on a debug kernel quickly becomes impossible as the CPU is
completely burnt up repeatedly walking the sparse file bmapbt that
is generated for every allocation that is made.

Hence, if the file has more than 10,000 extents, just don't bother
with walking the tree to check it exhaustively. The btree code has
checks that ensure that the newly inserted/removed/modified record
is correctly ordered, so the entrie tree walk in thses cases has
limited additional value.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: get mp from bma->ip in xfs_bmap code

Source kernel commit f1f96c4946590616812711ac19eb7a84be160877

In my earlier commit

  c29aad4 xfs: pass mp to XFS_WANT_CORRUPTED_GOTO

I added some local mp variables with code which indicates that
mp might be NULL.  Coverity doesn't like this now, because the
updated per-fs XFS_STATS macros dereference mp.

I don't think this is actually a problem; from what I can tell,
we cannot get to these functions with a null bma->tp, so my NULL
check was probably pointless.  Still, it's not super obvious.

So switch this code to get mp from the inode on the xfs_bmalloca
structure, with no conditional, because the functions are already
using bmap->ip directly.

Addresses-Coverity-Id: 1339552
Addresses-Coverity-Id: 1339553
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: introduce BMAPI_ZERO for allocating zeroed extents

Source kernel commit 3fbbbea34bac049c0b5938dc065f7d8ee1ef7e67

To enable DAX to do atomic allocation of zeroed extents, we need to
drive the block zeroing deep into the allocator. Because
xfs_bmapi_write() can return merged extents on allocation that were
only partially allocated (i.e. requested range spans allocated and
hole regions, allocation into the hole was contiguous), we cannot
zero the extent returned from xfs_bmapi_write() as that can
overwrite existing data with zeros.

Hence we have to drive the extent zeroing into the allocation code,
prior to where we merge the extents into the BMBT and return the
resultant map. This means we need to propagate this need down to
the xfs_alloc_vextent() and issue the block zeroing at this point.

While this functionality is being introduced for DAX, there is no
reason why it is specific to DAX - we can per-zero blocks during the
allocation transaction on any type of device. It's just slow (and
usually slower than unwritten allocation and conversion) on
traditional block devices so doesn't tend to get used. We can,
however, hook hardware zeroing optimisations via sb_issue_zeroout()
to this operation, so it may be useful in future and hence the
"allocate zeroed blocks" API needs to be implementation neutral.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: per-filesystem stats counter implementation

Source kernel commit ff6d6af2351caea7db681f4539d0d893e400557a

This patch modifies the stats counting macros and the callers
to those macros to properly increment, decrement, and add-to
the xfs stats counts. The counts for global and per-fs stats
are correctly advanced, and cleared by writing a "1" to the
corresponding clear file.

global counts: /sys/fs/xfs/stats/stats
per-fs counts: /sys/fs/xfs/sda*/stats/stats

global clear: /sys/fs/xfs/stats/stats_clear
per-fs clear: /sys/fs/xfs/sda*/stats/stats_clear

[dchinner: cleaned up macro variables, removed CONFIG_FS_PROC around
stats structures and macros. ]

Signed-off-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: log local to remote symlink conversions correctly on v5 supers

Source kernel commit b7cdc66be54b64daef593894d12ecc405f117829

A local format symlink inode is converted to extent format when an
extended attribute is set on an inode as part of the attribute fork
creation. This means a block is allocated, the local symlink target name
is copied to the block and the block is logged. Currently,
xfs_bmap_local_to_extents() handles logging the remote block data based
on the size of the data fork prior to the conversion. This is not
correct on v5 superblock filesystems, which add an additional header to
remote symlink blocks that is nonexistent in local format inodes.

As a result, the full length of the remote symlink block content is not
logged. This can lead to corruption should a crash occur and log
recovery replay this transaction.

Since a callout is already used to initialize the new remote symlink
block, update the local-to-extents conversion mechanism to make the
callout also responsible for logging the block. It is already required
to set the log buffer type and format the block appropriately based on
the superblock version. This ensures the remote symlink is always logged
correctly. Note that xfs_bmap_local_to_extents() is only called for
symlinks so there are no other callouts that require modification.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: add missing bmap cancel calls in error paths

Source kernel commit d4a97a04227d5ba91b91888a016e2300861cfbc7

If a failure occurs after the bmap free list is populated and before
xfs_bmap_finish() completes successfully (which returns a partial
list on failure), the bmap free list must be cancelled. Otherwise,
the extent items on the list are never freed and a memory leak
occurs.

Several random error paths throughout the code suffer this problem.
Fix these up such that xfs_bmap_cancel() is always called on error.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

libxfs: Optimize the loop for xfs_bitmap_empty

Source kernel commit 1d4292bfdc77f4f7c520064be15d0c46bd025fd2

If there is any non zero bit in a long bitmap, it can jump out of the
loop and finish the function as soon as possible.

Signed-off-by: Jia He <hejianet@gmail.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_io: add support for changing the new inode DAX attribute

It is changed via the FS_IOC_FSSETXATTR ioctl, so add the new flag
to the platform definitions for userspace that don't this API. Then
add support to the relevant xfs_io chattr and stat commands.

Signed-off-by: Dave Chinner <dchinner@redhat.com>

xfs: introduce per-inode DAX enablement

Source kernel commit 58f88ca2df7270881de2034c8286233a89efe71c

Rather than just being able to turn DAX on and off via a mount
option, some applications may only want to enable DAX for certain
performance critical files in a filesystem.

This patch introduces a new inode flag to enable DAX in the v3 inode
di_flags2 field. It adds support for setting and clearing flags in
the di_flags2 field via the XFS_IOC_FSSETXATTR ioctl, and sets the
S_DAX inode flag appropriately when it is seen.

When this flag is set on a directory, it acts as an "inherit flag".
That is, inodes created in the directory will automatically inherit
the on-disk inode DAX flag, enabling administrators to set up
directory heirarchies that automatically use DAX. Setting this flag
on an empty root directory will make the entire filesystem use DAX
by default.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_fs.h: XFS_IOC_FS[SG]SETXATTR to FS_IOC_FS[SG]ETXATTR promotion

The kernel commit to make this ioctl promotion (bb99e06ddf) moved
the definitions for the XFS ioctl to uapi/linux/fs.h for the
following reason:

    Hoist the ioctl definitions for the XFS_IOC_FS[SG]SETXATTR API
    from fs/xfs/libxfs/xfs_fs.h to include/uapi/linux/fs.h so that
    the ioctls can be used by all filesystems, not just XFS. This
    enables (initially) ext4 to use the ioctl to set project IDs on
    inodes.

This means we now need to handle this change in userspace as the
uapi/linux/fs.h file may not contain the definitions (i.e. new
xfsprogs/ old linux uapi files) xfsprogs needs to build. Hence we
need to massage the definition in xfs_fs.h to take the values from
the system header if it exists, otherwise keep the old definitions
for compatibility and platforms other than linux.

To this extent, we add the FS* definitions to the platform headers
so the FS* versions are available on all platforms, and add trivial
defines to xfs_fs.h to present the XFS* versions for backwards
compatibility with existing code. New code should always use the FS*
versions, and as such we also convert all the users of XFS* in
xfsprogs to use the FS* definitions.

Signed-off-by: Dave Chinner <dchinner@redhat.com>

libxfs: reset dirty buffer priority on lookup

When a buffer on the dirty MRU is looked up and found, we remove the
buffer from the MRU. However, we've already set the priority of the
buffer to "dirty" so when we are done with it it will go back on the
dirty buffer MRU regardless of whether it needs to or not.

Hence when we move a buffer to a the dirty MRU, record the old
priority and restore it when we remove the buffer from the MRU on
lookup. This will prevent us from putting fixed, now writeable
buffers back on the dirty MRU and allow the cache routine to write,
shake and reclaim the buffers once they are clean.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

libxfs: keep unflushable buffers off the cache MRUs

There's no point trying to free buffers that are dirty and return
errors on flush as we have to keep them around until the corruption
is fixed. Hence if we fail to flush an inode during a cache shake,
move the buffer to a special dirty MRU list that the cache does not
shake. This prevents memory pressure from seeing these buffers, but
allows subsequent cache lookups to still find them through the hash.
This ensures we don't waste huge amounts of CPU trying to flush and
reclaim buffers that canot be flushed or reclaimed.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

libxfs: don't repeatedly shake unwritable buffers

now that we try to write dirty buffers before we release them, we
can get buildup of unwritable dirty buffers on the LRU lists, This
results in the cache shaker repeatedly trying to write out these
buffers every time the cache fills up. This results in more
corruption warnings, and takes up a lot of time doing reclaiming
nothing. This can effectively livelock the processing parts of phase
4.

Fix this by not trying to write buffers with corruption errors on
them. These errors will get cleared when the buffer is re-read and
fixed and them marked dirty again. At which point, we'll be able to
write them and so the cache can reclaim them successfully.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

libxfs: don't discard dirty buffers

When we release a buffer from the cache, if it is dirty we wite it
to disk then put the buffer on the free list for recycling. However,
if the write fails (e.g. verifier failure due unfixed corruption) we
effectively throw the buffer and it contents away. This causes all
sorts of problems for xfs_repair as it then re-reads the buffer from
disk on the next access and hence loses all the corrections that had
previously been made, resulting in tripping over corruptions in code
that assumes the corruptions have already been fixed/flagged in the
buffer it receives.

TO fix this, we have to make the cache aware that writes can fail,
and keep the buffer in cache when writes fail. Hence we have to add
an explicit error notification to the flush operation, and we need
to do that before we release the buffer to the free list. This also
means that we need to remove the writeback code from the release
mechanisms, instead replacing them with assertions that the buffers
are already clean.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

libxfs: directory node splitting does not have an extra block

xfs_da3_split() has to handle all three versions of the
directory/attribute btree structure. The attr tree is v1, the dir
tre is v2 or v3. The main difference between the v1 and v2/3 trees
is the way tree nodes are split - in the v1 tree we can require a
double split to occur because the object to be inserted may be
larger than the space made by splitting a leaf. In this case we need
to do a double split - one to split the full leaf, then another to
allocate an empty leaf block in the correct location for the new
entry. This does not happen with dir (v2/v3) formats as the objects
being inserted are always guaranteed to fit into the new space in
the split blocks.

Indeed, for directories they *may* be an extra block on this buffer
pointer. However, it's guaranteed not to be a leaf block (i.e. a
directory data block) - the directory code only ever places hash
index or free space blocks in this pointer (as a cursor of
sorts), and so to use it as a directory data block will immediately
corrupt the directory.

The problem is that the code assumes that there may be extra blocks
that we need to link into the tree once we've split the root, but
this is not true for either dir or attr trees, because the extra
attr block is always consumed by the last node split before we split
the root. Hence the linking in an extra block is always wrong at the
root split level, and this manifests itself in repair as a directory
corruption in a repaired directory, leaving the directory rebuild
incomplete.

This is a dir v2 zero-day bug - it was in the initial dir v2 commit
that was made back in February 1998.

Fix this by ensuring the linking of the blocks after the root split
never tries to make use of the extra blocks that may be held in the
cursor. They are held there for other purposes and should never be
touched by the root splitting code.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

repair: parallelise uncertin inode processing in phase 3

This can take a long time when there are millions of uncertain inodes in badly
broken filesystems. THe processing is per-ag, the data structures are all
per-ag, and the load is mostly CPU time spent checking CRCs on each
uncertaini inode. Parallelising reduced the runtime of this phase on a badly
broken filesytem from ~30 minutes to under 5 miniutes.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

repair: parallelise phase 7

It operates on a single AG at a time, sequentially, doing inode updates. All the
data structures accessed and modified are per AG, as are the modification to
on-disk structures. Hence we can run this phase concurrently across multiple
AGs.

This is important for large, broken filesystem repairs, where there can be
millions of inodes that need link counts updated. Once such repair image takes
more than 45 minutes to run phase 7 as a single threaded operation.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_mdrestore: correctly account bytes read

Progess indication comes in the form of a "X MB read" output. This
doesn't match up with the actual number of bytes read from the
metadump file because it only accounts header blocks in the file,
not actual metadata blocks that are restored, Hence the number
reported is usually much lower than the size of the metadump file,
hence it's impossible to use to gauge progress of the restore.

While there, fix the progress output so that it overwrites the
previous progress output line correctly.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

metadump: bounds check btree block regions being zeroed

Arkadiusz Miskiewicz reported that metadump was crashing on one of
his corrupted filesystems, and the trace indicated that it was
zeroing unused regions in inode btree blocks when it failed. The
btree block had a corrupt nrecs field, which was resulting in an out
of bounds memset() occurring. Ensure that the region being
generated for zeroing is within bounds before executing the zeroing.

Reported-by: Arkadiusz Miskiewicz <arekm@maven.pl>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

metadump: clean up btree block region zeroing

Abstract out all the common operations in the btree block zeroing
so that we only have one copy of offset/len calculations and don't
require lots of casts for the pointer arithmetic.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Arkadiusz Miskiewicz <arekm@maven.pl>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_quota: fix up timer command help

The timer command help output suggests that either default, or a
specific ID, can be specified, but this is not the case.  Timers
are fs-wide for each quota type, not specific to the user.  So
remove the "-d|id|name" from the help output.  Also remove
"get" from the description, because it only sets these timers;
"state" returns their values.  Well actually just one of them,
but that's a different problem...

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_quota: remove extra 30 seconds from time limit reporting

The addition of 30s makes little sense; it's only added to the
values reported via the userspace tool, and doesn't reflect the
actual values used in the kernel. Setting a time limit to
"00:30:00" and reporting back "00:30:30" makes little sense to me.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_repair: Fix untranslatable strings

Don't substitute in "would" or "will" for %s here, it makes
it untranslatable. Just spell it out twice like every other
message in the function.

Reported-by: Jakub Bogusz <qboosh@pld-linux.org>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

mkfs.xfs.8: Clarify mkfs vs. mount block size limits.

The current explanation is a bit vague to the uninitiated;
clarify the text to make it obvious that while the mkfs.xfs
utility is able to create valid filesystems with block sizes
up to 65536, the Linux kernel can only mount filesystems with
page sized or smaller blocks.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

mkfs: factor finobt changes into min log size when formatting

Since the finobt affects the size of the log reservations, we need to
be able to include its effects in the calculation of the minimum log
size.

(Not really a problem now, but adding rmapbt will give this one some
bite.)

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_io: detect the '-R' option in getopt

Configure getopt to accept the -R argument to pwrite.
Accept -F and -B while we're at it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_io: print dedupe errors to stderr, not stdout

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

libxfs: move struct xfs_attr_shortform to xfs_da_format.h

Move the shortform attr structure definition to the same place as the
other attribute structure definitions for consistency and also so that
xfs/122 verifies the structure size.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_quota: use Q_XGETNEXTQUOTA for report and dump

Rather than a loop over getpwnam() etc, use the Q_XGETNEXTQUOTA
command to iterate through all active quotas in quota
reports and quota file dump.

If Q_XGETNEXTQUOTA fails, go back to the old way.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>