Dave Chinner [Wed, 13 Nov 2013 06:40:54 +0000 (06:40 +0000)]
libxfs: work around do_div() not handling 32 bit numerators
The libxfs dquot buffer code uses do_div() with a 32 bit numerator.
This gives incorrect results as do_div() passes the numerator by
reference as a pointer to a 64 bit value. Hence it does the division
using 32 bits of garbage gives the wrong result.
As per Christoph's suggestion, we can kill the usage of do_div()
here completely and just do the division directly, both in userspace
and kernel space.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Rich Johnston <rjohnston@sgi.com>
Dave Chinner [Wed, 13 Nov 2013 06:40:53 +0000 (06:40 +0000)]
xfs_db: avoid libxfs buffer lookup warnings
xfs_db is unique in the way it can read the same blocks with
different lengths from disk, so we really need a way to avoid having
duplicate buffers in the cache. To handle this in a generic way,
introduce a "purge on compare failure" feature to libxfs.
What this feature does is instead of throwing a warning when a
buffer miscompare occurs (e.g. due to a length mismatch), it purges
the buffer that is in cache from the cache. We can do this safely in
the context of xfs_db because it always writes back changes made to
buffers before it releases the reference to the buffer. Hence we can
purge buffers directly from the lookup code without having to worry
about whether they are dirty or not.
Doing this purge on miscompare operation avoids the
problem that libxfs is currently warning about, and hence if the
feature flag is set then we don't need to warn about miscompares any
more. Hence the whole problem goes away entirely for xfs_db, without
affecting any of the other users of libxfs based IO.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Rich Johnston <rjohnston@sgi.com>
Dave Chinner [Wed, 13 Nov 2013 06:40:52 +0000 (06:40 +0000)]
xfs_db: use inode cluster buffers for inode IO
When we mount the filesystem inside xfs_db, libxfs is tasked with
reading some information from disk, such as root inodes. Because
libxfs does this inode reading, it uses inode cluster buffers to
read the inodes. xfs_db, OTOH, just uses FSB sized buffers to read
inodes, and hence xfs_db throws a warning when reading the root
inode block like so:
$ sudo xfs_db -c "sb 0" -c "p rootino" -c "inode 32" /dev/vda
Version 5 superblock detected. xfsprogs has EXPERIMENTAL support enabled!
Use of these features is at your own risk!
rootino = 32 7f59f20e6740: Badness in key lookup (length)
bp=(bno 0x20, len 8192 bytes) key=(bno 0x20, len 1024 bytes)
$
There is another way this can happen, and that is dumping raw data
from disk using either the "fsb NNN" or "daddr MMM" commands to dump
untyped information. This is always read in sector or filesystem
block units, and so will cause similar badness warnings.
To avoid this problem when reading inodes, teach xfs_db to read
inode clusters rather individual filesystem blocks when asked to
read an inode.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Rich Johnston <rjohnston@sgi.com>
Dave Chinner [Wed, 13 Nov 2013 06:40:51 +0000 (06:40 +0000)]
db: re-enable write support for v5 filesystems.
As we can now verify and recalculate CRCs on IO, we can modify the
on-disk structures without corrupting the filesyste, This makes it
safe to turn write support on for v5 filesystems for the first time.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Rich Johnston <rjohnston@sgi.com>
Dave Chinner [Wed, 13 Nov 2013 06:40:50 +0000 (06:40 +0000)]
db: add a special attribute buffer verifier
Because we only have a single attribute type that is used for all
the attribute buffer types, we need to provide a special verifier
for the read code. That verifier needs to know all the attribute
types and when it find one it knows about, switch to the correct
verifier and call it.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Rich Johnston <rjohnston@sgi.com>
Dave Chinner [Wed, 13 Nov 2013 06:40:49 +0000 (06:40 +0000)]
db: add a special directory buffer verifier
Because we only have a single directory type that is used for all
the different buffer types, we need to provide a special verifier
for the read code. That verifier needs to know all the directory
types and when it find one it knows about, switch to the correct
verifier and call it.
We already do this for certain readahead cases in the directory
code, so there is precedence for this. If we don't find a magic
number we recognise, the verifier fails...
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Rich Johnston <rjohnston@sgi.com>
Dave Chinner [Wed, 13 Nov 2013 06:40:48 +0000 (06:40 +0000)]
db: verify and calculate dquot CRCs
When we set the current Io cursor to point at a dquot block, verify
that the dquot CRC is intact. And prior to writing such an IO
cursor, calculate the dquot CRC.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Rich Johnston <rjohnston@sgi.com>
Dave Chinner [Wed, 13 Nov 2013 06:40:47 +0000 (06:40 +0000)]
db: verify and calculate inode CRCs
When we set the current IO cursor to point at an inode, verify that
the inode CRC is intact. And prior to writing such an IO cursor,
calculate the inode CRC.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Rich Johnston <rjohnston@sgi.com>
Dave Chinner [Wed, 13 Nov 2013 06:40:46 +0000 (06:40 +0000)]
db: indicate if the CRC on a buffer is correct or not
When dumping metadata that has a CRC in it, output not only the CRC
but text to tell us whether the value is correct or not. Hence we
can see at a glance if there's something wrong or not.
Do this by peeking at the buffer attached to the current IO
context. If there was a CRC error, then it will be marked with a
EFSCORRUPTED error. Use this to determine what to output.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Rich Johnston <rjohnston@sgi.com>
Dave Chinner [Wed, 13 Nov 2013 06:40:45 +0000 (06:40 +0000)]
db: introduce verifier support into set_cur
To be able to use read and write verifiers, we need to pass the
verifier to the IO routines. We do this via the set_cur() function
used to trigger reading the buffer.
For most metadata types, there is only one type of verifier needed.
For these, we can simply add the verifier to the type table entry
for the given type and use that directly. This type entry is already
carried around by the IO context, so if we ever need to get it again
we have direct access to it in the context we'll be doing IO.
Only attach the verifiers to the v5 filesystem type table; there is
not need for them on v4 filesystems as we don't have to verify or
calculate CRCs for them.
There are some metadata types that have more than one buffer format,
or aren't based in directly in buffers. For these, leave the type
table verifier NULL for now - these will need to be addressed
individually.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Rich Johnston <rjohnston@sgi.com>
Dave Chinner [Wed, 13 Nov 2013 06:40:44 +0000 (06:40 +0000)]
db: rewrite IO engine to use libxfs
Now that we have buffers and xfs_buf_maps, it is relatively easy to
convert the IO engine to use libxfs routines. This gets rid of the
most of the differences between mapped and straight buffer reads,
and tracks xfs_bufs directly in the IO context that is being used.
This is not yet a perfect solution, as xfs_db does different sized
IOs for the same block range which will throw warnings like:
xfs_db> inode 64 7ffff7fde740: Badness in key lookup (length)
bp=(bno 0x40, len 8192 bytes) key=(bno 0x40, len 4096 bytes)
xfs_db>
This is when first displaying an inode in the root inode chunk.
These will need to be dealt with on a case by case basis.
Further, xfs_db can build up a large IO stack by the time it has run
to completion. If we don't unwind this IO stack before we shut down
the libxfs caches, metadump and other db programs will exit with
unreleased buffers and emit warnings like:
cache_purge: shake on cache 0x69e4f0 left 7 nodes!?
Hence we need to unwind the iostack as we shut down.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Rich Johnston <rjohnston@sgi.com>
Dave Chinner [Wed, 13 Nov 2013 06:40:43 +0000 (06:40 +0000)]
libxfs: refactor libxfs_buf_read_map for xfs_db
xfs_db requires low level read/write buffer primitives that are the
equivalent of libxfs_readbufr/writebufr. The implementation of
libxfs_writebufr already handles discontiguous buffers, but there is
no equivalent libxfs_readbufr_map support in the code.
Refactor libxfs_readbuf_map into two parts - one that does the
buffer cache lookup, and the other that does the read IO. This
provides the implementation of libxfs_readbufr_map that is required
for xfs_db.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Rich Johnston <rjohnston@sgi.com>
Dave Chinner [Wed, 13 Nov 2013 06:40:42 +0000 (06:40 +0000)]
db: rewrite bbmap to use xfs_buf_map
Use the libxfs struct xfs_buf_map for recording the extent layout of
discontiguous buffers and convert the read/write to decode them
directory and use read_buf/write_buf to do the extent IO. This
brings the physical xfs_db IO code to be very close to the model
that libxfs uses.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Rich Johnston <rjohnston@sgi.com>
Dave Chinner [Wed, 13 Nov 2013 06:40:41 +0000 (06:40 +0000)]
db: separate out straight buffer IO from map based IO.
Libxfs has two different interfaces for getting and reading buffers.
The first is a block/length interface for reading contiguous
regions, and the second is based on extent based xfs_buf_map arrays
for discontiguous regions. The xfs-db code is solely based on a
basic block array interface regardless of the type of region being
read, and so doesn't match to either libxfs interface.
As a first step to converting xfs_db to the libxfs interfaces, add a
simple block/length buffer API and implement it using pread/pwrite.
Then remove the single region conditionals from the basic block array
based interfaces, and convert all the contiguous block read cases to
use the new API.
This new API is temporary - it will be replaced by the equivalent
libxfs interface calls once all the infrastructure preparation for
the changeover has been completed.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Rich Johnston <rjohnston@sgi.com>
Currently libxfs has a cache for xfs_inode structures. Unlike in kernelspace
where the inode cache, and the associated page cache for file data is used
for all filesystem operations the libxfs inode cache is only used in few
places:
- the libxfs init code reads the root and realtime inodes when called from
xfs_db using a special flag, but these inode structure are never referenced
again
- mkfs uses namespace and bmap routines that take the xfs_inode structure
to create the root and realtime inodes, as well as any additional files
specified in the proto file
- the xfs_db attr code uses xfs_inode-based attr routines in the attrset
and attrget commands
- phase6 of xfs_repair uses xfs_inode-based routines for rebuilding
directories and moving files to the lost+found directory.
- phase7 of xfs_repair uses struct xfs_inode to modify the nlink count
of inodes.
So except in repair we never ever reuse a cached inode, and even in repair
the logical inode caching doesn't help:
- in phase 6a we iterate over each inode in the incore inode tree,
and if it's a directory check/rebuild it
- phase6b then updates the "." and ".." entries for directories
that need, which means we require the backing buffers.
- phase6c moves disconnected inodes to lost_found, which again needs
the backing buffer to actually do anything.
- phase7 then only touches inodes for which we need to reset i_nlink,
which always involves reading, modifying and writing the physical
inode.
which always involves modifying the . and .. entries.
Given these facts stop caching the inodes to reduce memory usage
especially in xfs_repair, where this makes a different for large inode
count inodes. On the upper end this allows repair to complete for
filesystem / amount of memory combinations that previously wouldn't.
With this we probably could increase the memory available to the buffer
cache in xfs_repair, but trying to do so I got a bit lost - the current
formula seems to magic to me to make any sense, and simply doubling the
buffer cache size causes us to run out of memory given that the data cached
in the buffer cache (typically lots of 8k inode buffers and few 4k other
metadata buffers) are much bigger than the inodes cached in the inode
cache. We probably need a sizing scheme that takes the actual amount
of memory allocated to the buffer cache into account to solve this better.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Rich Johnston <rjohnston@sgi.com>
Dave Chinner [Wed, 13 Nov 2013 06:40:39 +0000 (06:40 +0000)]
libxfs: fix root inode handling inconsistencies
When "mounting" a filesystem via libxfs_mount(), callers can tell
libxfs to read the root and realtime inodes into cache. However,
when unmounting the filesystem, libxfs_unmount() used to
unconditionally free root inodes if they were present.
This leads to interesting issues like in mkfs, when it handles
creation, reading and freeing of the root and rt inodes itself.
It, however, passes in the flag to tell libxfs_mount() to read the
root inodes and so can result in unbalanced freeing of inodes when
cleaning up during the unmount proceedure.
As it turns out, nothing ever uses mp->m_rootip and so we don't need
to read it in or free it, or even have a pointer to it in the struct
xfs_mount. Similarly, the only user of the realtime inodes is mkfs,
and it initialises them itself. Hence we can kill the m_rootip and
the realtime inode mounting code.
This leaves one user of LIBXFS_MOUNT_ROOTINOS - xfs_db - and that is
only used to initialise the in-core superblock counter values from
the ag header for xfs_check. Move this code to the xfs_db init
functions so we can get rid of the mount parameter previously used
to trigger all these behavours (LIBXFS_MOUNT_ROOTINOS) completely.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Rich Johnston <rjohnston@sgi.com>
Dave Chinner [Wed, 13 Nov 2013 06:40:37 +0000 (06:40 +0000)]
xfs: fix node forward in xfs_node_toosmall
When a node is considered for a merge with a sibling, it overwrites the
sibling pointers of the original incore nodehdr with the sibling's
pointers. This leads to loop considering the original node as a merge
candidate with itself in the second pass, and so it incorrectly
determines a merge should occur.)
Dave Chinner [Wed, 13 Nov 2013 06:40:36 +0000 (06:40 +0000)]
xfs: fix the wrong new_size/rnew_size at xfs_iext_realloc_direct()
At xfs_iext_realloc_direct(), the new_size is changed by adding
if_bytes if originally the extent records are stored at the inline
extent buffer, and we have to switch from it to a direct extent
list for those new allocated extents, this is wrong.
This patch fix above problem and revise the new_size comments at
xfs_iext_realloc_direct() to make it more readable. Also, fix the
comments while switching from the inline extent buffer to a direct
extent list to reflect this change.
Dave Chinner [Wed, 13 Nov 2013 06:40:34 +0000 (06:40 +0000)]
libxfs: Minor cleanup and bug fix sync
These bring all the small single line comment, whitespace and minor
code differences into sync with the kernel code. Anything left at
this point is an intentional difference.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Rich Johnston <rjohnston@sgi.com>
Dave Chinner [Wed, 13 Nov 2013 06:40:33 +0000 (06:40 +0000)]
libxfs: bring across inode buffer readahead verifier changes
These were made for log recovery readahead in the kernel, so are not
directly used in userspace. Hence bringing the change across is
simply to keep files in sync.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Rich Johnston <rjohnston@sgi.com>
Dave Chinner [Wed, 13 Nov 2013 06:40:32 +0000 (06:40 +0000)]
libxfs: xfs_rtalloc.c becomes xfs_rtbitmap.c
To match the split-up of the kernel xfs_rtalloc.c file, convert the
libxfs version of xfs_rtalloc.c to match the newly shared kernel
source file with all the realtime bitmap functions in it,
xfs_rtbitmap.c.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Rich Johnston <rjohnston@sgi.com>
Dave Chinner [Wed, 13 Nov 2013 06:40:31 +0000 (06:40 +0000)]
libxfs: bmap btree owner swap support
For CRC enabled filesystems, we can't just swap inode forks from one
inode to another when defragmenting a file - the blocks in the inode
fork bmap btree contain pointers back to the owner inode. Hence if
we are to swap the inode forks we have to atomically modify every
block in the btree during the transaction.
This patch brings across the kernel code for doing the owner
swap of an entire fork - something that we are likely to end up
needing in xfs_repair when reparenting stray inodes to lost+found -
without all the associated swap extents transaction and recovery
cruft as those parts are not needed in userspace.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Rich Johnston <rjohnston@sgi.com>
Dave Chinner [Wed, 13 Nov 2013 06:40:30 +0000 (06:40 +0000)]
libxfs: unify xfs_btree.c with kernel code
The libxfs/xfs_btree.c code does not contain a small amount of code
for btree block readahead that the kernel code does. Instead, it
short circuits it at a higher layer and doesn't include the lower
layer functions. There is no harm in calling the lower lay functions
and have them do nothing, and doing so unifies the kernel and
userspace code.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Rich Johnston <rjohnston@sgi.com>
Dave Chinner [Wed, 13 Nov 2013 06:40:29 +0000 (06:40 +0000)]
xfs: decouple inode and bmap btree header files
Currently the xfs_inode.h header has a dependency on the definition
of the BMAP btree records as the inode fork includes an array of
xfs_bmbt_rec_host_t objects in it's definition.
Move all the btree format definitions from xfs_btree.h,
xfs_bmap_btree.h, xfs_alloc_btree.h and xfs_ialloc_btree.h to
xfs_format.h to continue the process of centralising the on-disk
format definitions. With this done, the xfs inode definitions are no
longer dependent on btree header files.
The enables a massive culling of unnecessary includes, with close to
200 #include directives removed from the XFS kernel code base.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Rich Johnston <rjohnston@sgi.com>
Dave Chinner [Wed, 13 Nov 2013 06:40:28 +0000 (06:40 +0000)]
xfs: split dquot buffer operations out
Parts of userspace want to be able to read and modify dquot buffers
(e.g. xfs_db) so we need to split out the reading and writing of
these buffers so it is easy to shared code with libxfs in userspace.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Rich Johnston <rjohnston@sgi.com>
Dave Chinner [Wed, 13 Nov 2013 06:40:27 +0000 (06:40 +0000)]
xfs: create a shared header file for format-related information
All of the buffer operations structures are needed to be exported
for xfs_db, so move them all to a common location rather than
spreading them all over the place. They are verifying the on-disk
format, so while xfs_format.h might be a good place, it is not part
of the on disk format.
Hence we need to create a new header file that we centralise these
related definitions. Start by moving the bffer operations
structures, and then also move all the other definitions that have
crept into xfs_log_format.h and xfs_format.h as there was no other
shared header file to put them in.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Rich Johnston <rjohnston@sgi.com>
Dave Chinner [Wed, 13 Nov 2013 06:40:25 +0000 (06:40 +0000)]
xfsprogs: fix automatic dependency generation
Adding are removing a header file does not result in dependency
regeneration like it should. 'make clean' will rebuild the
dependencies, but a normal 'make' won't. Fix it.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Rich Johnston <rjohnston@sgi.com>
Rich Johnston [Tue, 22 Oct 2013 15:15:20 +0000 (10:15 -0500)]
Revert "[RESEND, 4/7] xfsprogs: xfsio: add support FALLOC_FL_COLLAPSE_RANGE for fallocate"
This reverts commit e64190f8440286a815060524777b435e06a7b364 until we
have the fallocate API support merged into the kernel. The kernel
code is still under review.
Dave Chinner [Mon, 30 Sep 2013 03:15:21 +0000 (03:15 +0000)]
xfs: unify directory/attribute format definitions
The on-disk format definitions for the directory and attribute
structures are spread across 3 header files right now, only one of
which is dedicated to defining on-disk structures and their
manipulation (xfs_dir2_format.h). Pull all the format definitions
into a single header file - xfs_da_format.h - and switch all the
code over to point at that.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Rich Johnston <rjohnston@sgi.com>
Li Zhong [Tue, 15 Oct 2013 02:55:31 +0000 (02:55 +0000)]
xfsprogs: fix resource leak in longform_dir2_rebuild()
coverity scan 997010 reported following leak:
1309 if (error) {
1310 do_warn(
1311 _("space reservation failed (%d), filesystem may be out of space\n"),
1312 error);
25. Breaking from loop
1313 break;
1314 }
CID 997010 (#1 of 1): Resource leak (RESOURCE_LEAK)
26. leaked_storage: Variable "tp" going out of scope leaks the storage it points to.
1345}
Though not reported by coverity, it seems that there might be some entries in
flist which needs to be freed in the failure case below libxfs_dir_createname(),
and libxfs_bunmapi().
The fix cleans up the code by stacking the error handling at the end of the
function, and jumping to the error handler label for the above cases. (fail
directly by calling res_failed() for reservation failure.)
Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Rich Johnston <rjohnston@sgi.com>
Eric Sandeen [Fri, 18 Oct 2013 17:59:36 +0000 (17:59 +0000)]
xfs_repair: add d_type when moving files to lost+found
When we move disconnected inodes to lost+found, they aren't
assigned a dtype. Fix this by just setting XFS_DIR3_FT_UNKNOWN
for now. If the files are moved out of lost+found, the type
will be properly set at that time.
When repair gains more type knowledge we could use xfs_mode_to_ftype[]
to set the proper type when moved, but right now it's not a big
deal; UNKNOWN will suffice for files in lost+found, and prevents
us from using an uninitialized value.
Eric Sandeen [Thu, 12 Sep 2013 20:56:36 +0000 (20:56 +0000)]
xfs_repair: test for bad level in dir2 node
In traverse_int_dir2block(), the variable 'i' is the level in
the tree, with 0 being a leaf node. In the "do" loop we
start at the root, and work our way down to a leaf.
If the first node we read is an interior node with NODE_MAGIC,
but it tells us that its level is 0 (a leaf), this is clearly
an inconsistency.
Worse, we'd return with success, bno set, and only level[0]
in the cursor initialized. Then down this path we'll
segfault when accessing an uninitialized (and zeroed) member
of the cursor's level array:
Eric Sandeen [Thu, 17 Oct 2013 17:50:16 +0000 (17:50 +0000)]
xfs_repair: avoid segfault if reporting progress early in repair
For a very large filesystem, zeroing the log may take some time.
If we ask for progress reports frequently enough that one fires
before we finish with log zeroing, we try to use a progress format
which has not yet been set up, and segfault:
# mkfs.xfs -d size=60t,file,name=fsfile
# xfs_repair -m 9000 -o ag_stride=32 -t 1 fsfile
Phase 1 - find and verify superblock...
- reporting progress in intervals of 1 seconds
Phase 2 - using internal log
- zero log...
Segmentation fault
(gdb) bt
#0 0x0000000000426962 in progress_rpt_thread (p=0x67ad20) at progress.c:234
#1 0x0000003b98a07851 in start_thread (arg=0x7f19d8e47700) at pthread_create.c:301
#2 0x0000003b982e767d in ?? ()
#3 0x0000000000000000 in ?? ()
(gdb) p msgp
$1 = (msg_block_t *) 0x67ad20
(gdb) p msgp->format
$2 = (progress_rpt_t *) 0x0
(gdb)
I suppose we could rig up progress reports for log zeroing, but
that won't usually take terribly long; for now, be defensive
and init the message->format to NULL, and just return early
from the progress thread if we've not yet set up any message.
(Sure, global_msgs is global, and ->format is already NULL,
but to me it's worth being explicit since we will test it).
Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Rich Johnston <rjohnston@sgi.com>
xfsprogs' config.guess/config.sub are out of date for the forthcoming
arm64 port. The attached patch sets things up so that you don't have to
be bothered by this type of bug for future ports.
* Use the autotools-dev dh addon to update config.guess/config.sub for
arm64.
Signed-off-by: Colin Watson <cjwatson@ubuntu.com> Reviewed-by: Nathan Scott <nathans@debian.com> Signed-off-by: Rich Johnston <rjohnston@sgi.com>
Eric Sandeen [Tue, 8 Oct 2013 15:17:50 +0000 (15:17 +0000)]
xfsprogs: restrict platform_test_xfs_fd to regular files
If a special file (block, char, pipe etc) resides on an
xfs filesystem, platform_test_xfs_[fd|path] will return
true, but a subsequent xfsctl will fail, because the file
operations to support the xfs ioctls are not set up on such
files (see i_fop assignments in xfs_setup_inode()).
>From the xfsctl manpage it's pretty clear that these functions
are supposed to return true if a subsequent xfsctl can be
handled, so it makes sense to exclude special files.
This was showing up in xfstest generic/306, which creates
the dev/null block device on an xfstest an tries to pwrite
to it with xfs_io - which emitted a warning when the xfsctl
trying to get geometry failed.
Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Rich Johnston <rjohnston@sgi.com>
Eric Sandeen [Mon, 7 Oct 2013 17:35:16 +0000 (17:35 +0000)]
xfsprogs: remove incorrect l_sectBBsize assignment in xfs_repair
Commit e0607266 xfsprogs: add crc format support to repair
added a 2nd assignment to l_sectBBsize:
log.l_sectBBsize = 1 << mp->m_sb.sb_logsectlog;
which is incorrect; sb_logsectlog is log2 of the sector size,
in bytes; l_sectBBsize is the size of the log sector in
512-byte units.
So for a 4k sector size log, we were assigning 4096 rather
than 8. This broke xlog_find_tail, and caused xfs_repair
to think that a log was dirty even when it was clean:
"ERROR: The filesystem has valuable metadata changes in a log"
(xfs_logprint didn't have this error, so xfs_logprint -t
agreed that the filesystem really was clean).
Just remove the incorrect assignment; it was already properly
assigned about 12 lines prior:
log.l_sectBBsize = BTOBB(x.lbsize);
and things work again.
(This worked accidentally for 512-sector devices, because
we special-case those and set sb_logsectlog to "0" rather
than 9, so l_sectBBsize came out to "1" (as in 1 sector),
as it should have).
Reporteed-by: Markus Trippelsdorf <markus@trippelsdorf.de> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Rich Johnston <rjohnston@sgi.com>
Eric Sandeen [Mon, 30 Sep 2013 17:01:19 +0000 (17:01 +0000)]
xfsprogs: handle symlinks etc in fs_table_initialise_mounts()
Commit:
6a23747d xfs_quota: support relative path as `path' arguments
used realpath() on the supplied pathname to handle things like
relative pathnames and pathnames ending in "/" which otherwise
caused the getmntent scanning to fail.
However, this regressed cases where a path in mtab was a symlink;
realpath() resolves this to the target, and so no match is found.
xfs_quota: cannot setup path for mount /dev/mapper/testvg-testlv: No such device or address
because the scanning looks for /dev/dm-3, but the long symlink
name is what exists in mtab, and no match is found.
Fix this, but keep the intended enhancements, by testing *both* the
user-specified path (which might be relative, or contain a trailing
slash on a mountpoint) and the realpath-resolved path (which turns
a relative mountpoint into a full path, and removes trailing slashes),
to determine whether the user-specified path is an xfs mountpoint or
device.
While we're at it, add a few comments, and go back to the testing
of "path" not "rpath"; whether or not path is passed to the function
is what determines control flow. If path is specified, and realpath
succeeds, we're guaranteed to have rpath as well, so there is no need
to retest that. rpath is initialized to NULL, so an unconditional
free(rpath) is safe as well.
Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Rich Johnston <rjohnston@sgi.com>
Dave Chinner [Mon, 30 Sep 2013 03:15:20 +0000 (03:15 +0000)]
xfs: create a shared header file for format-related information
All of the buffer operations structures are needed to be exported
for xfs_db, so move them all to a common location rather than
spreading them all over the place. They are verifying the on-disk
format, so while xfs_format.h might be a good place, it is not part
of the on disk format.
Hence we need to create a new header file that we centralise these
related definitions. Start by moving the buffer operations
structures, and then also move all the other definitions that have
crept into xfs_log_format.h and xfs_format.h as there was no other
shared header file to put them in.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Rich Johnston <rjohnston@sgi.com>
Dave Chinner [Mon, 30 Sep 2013 03:15:19 +0000 (03:15 +0000)]
xfs: dirent dtype presence is dependent on directory magic numbers
The determination of whether a directory entry contains a dtype
field originally was dependent on the filesystem having CRCs
enabled. This meant that the format for dtype being enabled could be
determined by checking the directory block magic number rather than
doing a feature bit check. This was useful in that it meant that we
didn't need to pass a struct xfs_mount around to functions that
were already supplied with a directory block header.
Unfortunately, the introduction of dtype fields into the v4
structure via a feature bit meant this "use the directory block
magic number" method of discriminating the dirent entry sizes is
broken. Hence we need to convert the places that use magic number
checks to use feature bit checks so that they work correctly and not
by chance.
The current code works on v4 filesystems only because the dirent
size roundup covers the extra byte needed by the dtype field in the
places where this problem occurs.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Rich Johnston <rjohnston@sgi.com>
Dave Chinner [Mon, 30 Sep 2013 03:15:17 +0000 (03:15 +0000)]
xfs: ensure we copy buffer type in da btree root splits
When splitting the root of the da btree, we shuffled data between
buffers and the structures that track them. At one point, we copy
data and state from one buffer to another, including the ops
associated with the buffer. When we do this, we also need to copy
the buffer type associated with the buf log item so that the buffer
is logged correctly. If we don't do that, log recovery won't
recognise it and hence it won't recalculate the CRC on the buffer
after recovery. This leads to a directory block that can't be read
after recovery has run.
Found by inspection after finding the same problem with remote
symlink buffers.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Rich Johnston <rjohnston@sgi.com>
Dave Chinner [Mon, 30 Sep 2013 03:15:16 +0000 (03:15 +0000)]
xfs: check magic numbers in dir3 leaf verifier first
Calling xfs_dir3_leaf_hdr_from_disk() in a verifier before
validating the magic numbers in the buffer results in ASSERT
failures due to mismatching magic numbers when a corruption occurs.
Seeing as the verifier is supposed to catch the corruption and pass
it back to the caller, having the verifier assert fail on error
defeats the purpose of detecting the errors in the first place.
Check the magic numbers direct from the buffer before decoding the
header.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Rich Johnston <rjohnston@sgi.com>
Li Zhong [Thu, 26 Sep 2013 06:48:12 +0000 (06:48 +0000)]
xfsprogs: fix return value of verify_set_primary_sb()
If get_sb() fails because of EOF, it will return with retval 1, which will
then be interpreted as XR_BAD_MAGIC("bad magic number") in phase1() when
warning the user.
This patch fix it by using XR_EOF here, so it would be interpreted correctly.
Also change the associated comments about the return value.
Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Rich Johnston <rjohnston@sgi.com>
Li Zhong [Thu, 26 Sep 2013 06:45:32 +0000 (06:45 +0000)]
[v3, 1/2] xfsprogs: fix potential memory leak in verify_set_primary_sb()
If verify_set_primary_sb() completes the secondary sb scanning loop with
too few valid secondaries found (num_ok < num_sbs / 2), it will immediately
return without freeing any of the previously allocated memory (variables
sb, checked, and any items on the geo list). This was reported by
the Coverity scanner as CID 997012, 997013 and 997014.
Fix this by using the out_free_list: goto target for this error case.
Earlier, if get_sb() fails in the secondary scan loop, it goes to
the out: target which does not free any items on the geo list. Fix
this by using the out_free_list: target as well, and remove the now-unused
out: target.
Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Rich Johnston <rjohnston@sgi.com>
Li Zhong [Wed, 18 Sep 2013 09:40:42 +0000 (09:40 +0000)]
xfsprogs: fix potential memory leak in repare/sb.c
Following Resource leak is reported by coverity:
CID 997011 (#1 of 1): Resource leak (RESOURCE_LEAK)6. leaked_storage:
Variable "buf" going out of scope leaks the storage it points to.
505 return(XR_EOF);
Add a free(buf) to solve it.
Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Rich Johnston <rjohnston@sgi.com>
Carlos Maiolino [Fri, 30 Aug 2013 17:09:50 +0000 (17:09 +0000)]
mkfs: add noalign option to usage()
Although it has been added to manpage, there is no information about the
existence of noalign option into the usage().
Changelog:
V2: Remove space in comma separated options
V3: Aligned the option together with another alignment options to make mutual
exclusive options more visible
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Rich Johnston <rjohnston@sgi.com>
Eric Sandeen [Fri, 30 Aug 2013 03:55:16 +0000 (03:55 +0000)]
xfsprogs: avoid array overflow in pf_batch_read()
The while loop in pf_batch_read, and the code preceding it, is really...
quite a thing. I'd love to rewrite it, but I haven't yet found
a particularly cleaner way.
It cleverly hides the fact that we might increment "num" past the
last index of bplist[] and then assign to it. This corrupts memory.
Rather than major surgery for now, just go for the simple fix,
and break out of the loop if we've increased "num" past the
last index.
Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Rich Johnston <rjohnston@sgi.com>
Li Zhong [Tue, 27 Aug 2013 01:58:34 +0000 (01:58 +0000)]
xfsprogs: fix Out-of-bounds access in repair/dinode.c
On Mon, 2013-08-26 at 12:20 -0500, Eric Sandeen wrote:
> On 8/23/13 11:38 AM, Ben Myers wrote:
> > Hey Rich and Li Zhong,
> >
> > On Wed, Aug 21, 2013 at 11:51:11AM -0500, Rich Johnston wrote:
> >> Looks good, thanks for the patch Li Zhong. it has been committed.
> >>
> >> --Rich
> >>
> >> Reviewed-by: Rich Johnston <rjohnston@sgi.com>
> >>
> >> commit e7c05095f5baa9cd2e35a6de03d7dd9f51dd3910
> >> Author: Li Zhong <zhong@linux.vnet.ibm.com>
> >> Date: Mon Aug 12 06:11:01 2013 +0000
> >>
> >> xfsprogs: fix Out-of-bounds access in repair/dinode.c
> >>
> >> On 08/12/2013 01:11 AM, Li Zhong wrote:
> >>> Following is reported by coverity in bug 1061528:
> >>>
> >>> 187 __dirty_no_modify_ret(dirty);
> >>>
> >>> CID 1061528 (#1 of 1): Out-of-bounds access (OVERRUN)53. overrun-buffer-arg: Overrunning array "dinoc->di_pad" of 6 bytes by passing it to a function which accesses it at byte offset 15 using argument "16UL".
> >>> 188 memset(dinoc->di_pad, 0, 16);
> >>>
> >>> It seems that di_pad here should be di_pad2, as sekharan pointed out.
> >>>
> >>> Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
> >>> ---
> >>> repair/dinode.c | 4 ++--
> >>> 1 file changed, 2 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/repair/dinode.c b/repair/dinode.c
> >>> index e607f0b..94bf2f8 100644
> >>> --- a/repair/dinode.c
> >>> +++ b/repair/dinode.c
> >>> @@ -183,9 +183,9 @@ clear_dinode_core(struct xfs_mount *mp, xfs_dinode_t *dinoc, xfs_ino_t ino_num)
> >>> }
> >>>
> >>> for (i = 0; i < 16; i++) {
> >>> - if (dinoc->di_pad[i] != 0) {
> >>> + if (dinoc->di_pad2[i] != 0) {
> >>> __dirty_no_modify_ret(dirty);
> >>> - memset(dinoc->di_pad, 0, 16);
> >>> + memset(dinoc->di_pad2, 0, 16);
> >>> break;
> >>> }
> >>> }
> >
> > We also discussed this issue a bit in this thread:
> > http://oss.sgi.com/archives/xfs/2013-08/msg00228.html
> >
> > Looks like the loop itself is incorrect and should be removed, and Eric has
> > suggested that the conditional be changed to a memcmp in case the size of the
> > pad changes in the future. Would either of you care to spin up another patch
> > to clean it up?
>
> I think I was confused; it seems fine as it is in git, not sure what I was
> thinking.
>
> memcmp can't use a bare "0" as an arg, so it's not ideal to use either.
>
> Not a huge fan of the hard-coded 16, but I think the code is correct now; we
> can probably move on to real problems. ;)
OK :) Or maybe we could improve it with the calculation using sizeof as
below(which I posted in another thread)?
Dave Chinner [Wed, 4 Sep 2013 22:05:59 +0000 (22:05 +0000)]
xfsprogs: cleanup miscellaneous merge faults
* clean up a few extra tabs
* xfs_buf_map->xfs_buf_ops in libxfs_readbuf and libxfs_readbuf_map args
* don't call the write verifier twice
* put the multithreaded scan_ags back
Signed-off-by: Ben Myers <bpm@sgi.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Rich Johnston <rjohnston@sgi.com>
Dave Chinner [Wed, 4 Sep 2013 22:05:58 +0000 (22:05 +0000)]
repair: fix segv on directory block read failure
We try to read all blocks in the directory, but if we have a block
form directory we only have one block and so we need to fail if
there is a read error. Otherwise we try to derefence a null buffer
pointer.
While fixing the error handling for a read failure, fix the bug that
caused the read failure - trying to verify a block format buffer
with the data format buffer verifier.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Rich Johnston <rjohnston@sgi.com>
Dave Chinner [Wed, 4 Sep 2013 22:05:57 +0000 (22:05 +0000)]
xfs: inode log reservations are too small
We've been seeing occasional problems with log space leaks and
transaction underruns such as this for some time:
XFS (dm-0): xlog_write: reservation summary:
trans type = FSYNC_TS (36)
unit res = 2740 bytes
current res = -4 bytes
total reg = 0 bytes (o/flow = 0 bytes)
ophdrs = 0 (ophdr space = 0 bytes)
ophdr + reg = 0 bytes
num regions = 0
Turns out that xfstests generic/311 is reliably reproducing this
problem with the test it runs at sequence 16 of it execution. It is
a 100% reliable reproducer with the mkfs configuration of "-b
size=1024 -m crc=1" on a 10GB scratch device.
The problem? Inode forks in btree format are logged in memory
format, not disk format (i.e. bmbt format, not bmdr format). That
means there is a btree block header being logged, when such a
structure is never written to the inode fork in bmdr format. The
bmdr header in the inode is only 4 bytes, while the bmbt header is
24 bytes for v4 filesystems and 72 bytes for v5 filesystems.
We currently reserve the inode size plus the rounded up overhead of
a logging a buffer, which is 128 bytes. That means the reservation
for a 512 byte inode is 640 bytes. What we can actually log is:
inode core, data and attr fork = 512 bytes
inode log format + log op header = 56 + 12 = 68 bytes
data fork bmbt hdr = 24/72 bytes
attr fork bmbt hdr = 24/72 bytes
So, for a v2 inodes we can log at least 628 bytes, but if we split that
inode over the end of the log across log buffers, we need to also
another log op header, which takes us to 640 bytes. If there's
another reservation taken out of this that I haven't taken into
account (perhaps multiple iclog splits?) or I haven't corectly
calculated the bmbt format space used (entirely possible), then
we will overun it.
For v3 inodes the maximum is actually 724 bytes, and even a
single maximally sized btree format fork can blow it (652 bytes).
And that's exactly what is happening with the FSYNC_TS transaction
in the above output - it's consumed 644 bytes of space after the CIL
context took the space reserved for it (2100 bytes).
This problem has always been present in the XFS code - the btree
format inode forks have always been logged in this manner. Hence
there has always been the possibility of an overrun with such a
transaction. The CRC code has just exposed it frequently enough to
be able to debug and understand the root cause....
So, let's fix all the inode log space reservations.
[ I'm so glad we spent the effort to clean up the transaction
reservation code. This is an easy fix now. ]
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Rich Johnston <rjohnston@sgi.com>
Dave Chinner [Wed, 4 Sep 2013 22:05:56 +0000 (22:05 +0000)]
xfs: btree block LSN escaping to disk uninitialised
When testing LSN ordering code for v5 superblocks, it was discovered
that the the LSN embedded in the generic btree blocks was
occasionally uninitialised. These values didn't get written to disk
by metadata writeback - they got written by previous transactions in
log recovery.
The issue is here that the when the block is first allocated and
initialised, the LSN field was not initialised - it gets overwritten
before IO is issued on the buffer - but the value that is logged by
transactions that modify the header before it is written to disk
(and initialised) contain garbage. Hence the first recovery of the
buffer will stamp garbage into the LSN field, and that can cause
subsequent transactions to not replay correctly.
The fix is simply to initialise the bb_lsn field to zero when we
initialise the block for the first time.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Rich Johnston <rjohnston@sgi.com>
Dave Chinner [Wed, 4 Sep 2013 22:05:55 +0000 (22:05 +0000)]
xfs: fix calculation of the number of node entries in a dir3 node
The calculation doesn't take into account the size of the dir v3
header, so overestimates the hash entries in a node. This causes
directory buffer overruns when splitting and merging nodes.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Rich Johnston <rjohnston@sgi.com>
Dave Chinner [Wed, 4 Sep 2013 22:05:54 +0000 (22:05 +0000)]
xfs: di_flushiter considered harmful
When we made all inode updates transactional, we no longer needed
the log recovery detection for inodes being newer on disk than the
transaction being replayed - it was redundant as replay of the log
would always result in the latest version of the inode woul dbe on
disk. It was redundant, but left in place because it wasn't
considered to be a problem.
However, with the new "don't read inodes on create" optimisation,
flushiter has come back to bite us. Essentially, the optimisation
made always initialises flushiter to zero in the create transaction,
and so if we then crash and run recovery and the inode already on
disk has a non-zero flushiter it will skip recovery of that inode.
As a result, log recovery does the wrong thing and we end up with a
corrupt filesystem.
Because we have to support old kernel to new kernl upgrades, we
can't just get rid of the flushiter support in log recovery as we
might be upgrading from a kernel that doesn't have fully transaction
inode updates. Unfortunately, for v4 superblocks there is no way to
guarantee that log recovery knows about this fact.
We cannot add a new inode format flag to say it's a "special inode
create" because it won't be understood by older kernels and so
recovery could do the wrong thing on downgrade. We cannot specially
detect the combination of zero mode/non-zero flushiter on disk to
non-zero mode, zero flushiter in the log item during recovery
because wrapping of the flushiter can result in false detection.
Hence that makes this "don't use flushiter" optimisation limited to
a disk format that guarantees that we don't need it. And that means
the only fix here is to limit the "no read IO on create"
optimisation to version 5 superblocks....
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Rich Johnston <rjohnston@sgi.com>
Dave Chinner [Wed, 4 Sep 2013 22:05:53 +0000 (22:05 +0000)]
xfsprogs: add dtype support to mkfs and db
Now that we have an extra field in the dirent, add support into
xfs_db to be able to view it when looking at directory structures.
Add support to mkfs to create filesystems with filetype - we'll
always set it on CRC enabled filesystems so all new v5 filesystems
will have this functionality enabled.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Rich Johnston <rjohnston@sgi.com>
Dave Chinner [Wed, 4 Sep 2013 22:05:52 +0000 (22:05 +0000)]
xfs: Add write support for dirent filetype field
Add support to propagate and add filetype values into the on-disk
directs. This involves passing the filetype into the xfs_da_args
structure along with the name and namelength for direct operations,
and encoding it into the dirent at the same time we write the inode
number into the dirent.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Rich Johnston <rjohnston@sgi.com>
Dave Chinner [Tue, 10 Sep 2013 21:34:23 +0000 (21:34 +0000)]
[47/55,V2] xfs: Add read-only support for dirent filetype field
Add support for the file type field in directory entries so that
readdir can return the type of the inode the dirent points to to
userspace without first having to read the inode off disk.
The encoding of the type field is a single byte that is added to the
end of the directory entry name length. For all intents and
purposes, it appends a "hidden" byte to the name field which
contains the type information. As the directory entry is already of
dynamic size, helpers are already required to access and decode the
direct entry structures.
Hence the relevent extraction and iteration helpers are updated to
understand the hidden byte. Helpers for reading and writing the
filetype field from the directory entries are also added. Only the
read helpers are used by this patch. It also adds all the code
necessary to read the type information out of the dirents on disk.
Further we add the superblock feature bit and helpers to indicate
that we understand the on-disk format change. This is not a
compatible change - existing kernels cannot read the new format
successfully - so an incompatible feature flag is added. We don't
yet allow filesystems to mount with this flag yet - that will be
added once write support is added.
Finally, the code to take the type from the VFS, convert it to an
XFS on-disk type and put it into the xfs_name structures passed
around is added, but the directory code does not use this field yet.
That will be in the next patch.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Rich Johnston <rjohnston@sgi.com>
Dave Chinner [Wed, 4 Sep 2013 22:05:50 +0000 (22:05 +0000)]
xfs: Add xfs_log_rlimit.c
Add source files for xfs_log_rlimit.c The new file is used for log
size calculations and validation shared with userspace.
[dchinner: xfs_log_calc_max_attrsetm_res() does not modify the
tr_attrsetm reservation, just calculates the maximum. ]
[dchinner: rework loop in xfs_log_get_max_trans_res() ]
[dchinner: implement xfs_log_calc_unit_res() in util.c to give mkfs
a worse case calculation of the log size needed. ]
Signed-off-by: Jie Liu <jeff.liu@oracle.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Rich Johnston <rjohnston@sgi.com>
Dave Chinner [Wed, 4 Sep 2013 22:05:49 +0000 (22:05 +0000)]
xfs: Get rid of all XFS_XXX_LOG_RES() macro
Get rid of all XFS_XXX_LOG_RES() macros since they are obsoleted now.
Signed-off-by: Jie Liu <jeff.liu@oracle.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Rich Johnston <rjohnston@sgi.com>
Dave Chinner [Wed, 4 Sep 2013 22:05:48 +0000 (22:05 +0000)]
xfs: refactor xfs_trans_reserve() interface
With the new xfs_trans_res structure has been introduced, the log
reservation size, log count as well as log flags are pre-initialized
at mount time. So it's time to refine xfs_trans_reserve() interface
to be more neat.
Also, introduce a new helper M_RES() to return a pointer to the
mp->m_resv structure to simplify the input.
Signed-off-by: Jie Liu <jeff.liu@oracle.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Rich Johnston <rjohnston@sgi.com>
Dave Chinner [Wed, 4 Sep 2013 22:05:47 +0000 (22:05 +0000)]
xfs: Make writeid transaction use tr_writeid
tr_writeid is defined at mp->m_resv structure, however, it does not
really being used when it should be..
This patch changes it to tr_writeid to fetch the correct log
reservation size.
Signed-off-by: Jie Liu <jeff.liu@oracle.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Rich Johnston <rjohnston@sgi.com>
Dave Chinner [Wed, 4 Sep 2013 22:05:46 +0000 (22:05 +0000)]
xfs: Introduce tr_fsyncts to m_reservation
A preparation step.
For now fsync_ts transaction use the pre-calculated log reservation
size of tr_swrite.
This patch introduce a new item tr_fsyncts to mp->m_reservations
structure so that we can fetch the log reservation value for it
in a same manner to others.
Signed-off-by: Jie Liu <jeff.liu@oracle.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Rich Johnston <rjohnston@sgi.com>
Dave Chinner [Wed, 4 Sep 2013 22:05:45 +0000 (22:05 +0000)]
xfs: Introduce a new structure to hold transaction reservation items
Introduce a new structure xfs_trans_res to hold transaction
reservation item info per log ticket.
We also need to improve xfs_trans_resv_calc() by initializing the
log count as well as log flags for permanent log reservation.
Signed-off-by: Jie Liu <jeff.liu@oracle.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Rich Johnston <rjohnston@sgi.com>
Dave Chinner [Wed, 4 Sep 2013 22:05:44 +0000 (22:05 +0000)]
xfs: make struct xfs_perag kernel only
The struct xfs_perag has many kernel-only definitions in it,
requiring a __KERNEL__ guard so userspace can use it to. Move it to
xfs_mount.h so that it it kernel-only, and let userspace redefine
it's own version of the structure containing only what it needs.
This gets rid of another __KERNEL__ check in the XFS header files.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Rich Johnston <rjohnston@sgi.com>
Dave Chinner [Wed, 4 Sep 2013 22:05:43 +0000 (22:05 +0000)]
xfs: move kernel specific type definitions to xfs.h
xfs_types.h is shared with userspace, so having kernel specific
types defined in it is problematic. Move all the kernel specific
defines to xfs_linux.h so we can remove the __KERNEL__ guards from
xfs_types.h
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Rich Johnston <rjohnston@sgi.com>
Dave Chinner [Wed, 4 Sep 2013 22:05:42 +0000 (22:05 +0000)]
xfs: remove __KERNEL__ check from xfs_dir2_leaf.c
It's actually an ifndef section, which means it is only included in
userspace. however, it's deep within the libxfs code, so it's
unlikely that the condition checked in userspace can actually occur
(search an empty leaf) through the libxfs interfaces. i.e. if it can
happen in usrspace, it can happen in the kernel, so remove it from
userspace too....
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Rich Johnston <rjohnston@sgi.com>
Dave Chinner [Wed, 4 Sep 2013 22:05:41 +0000 (22:05 +0000)]
xfs: remove __KERNEL__ from debug code
There is no reason the remaining kernel-only debug code needs to
remain kernel-only. Kill the __KERNEL__ part of the defines, and let
userspace handle the debug code appropriately.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Rich Johnston <rjohnston@sgi.com>
Dave Chinner [Wed, 4 Sep 2013 22:05:40 +0000 (22:05 +0000)]
xfs: kill __KERNEL__ check for debug code in allocation code
Userspace running debug builds is relatively rare, so there's need
to special case the allocation algorithm code coverage debug switch.
As it is, userspace defines random numbers to 0, so invert the
logic of the switch so it is effectively a no-op in userspace.
This kills another couple of __KERNEL__ users.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Rich Johnston <rjohnston@sgi.com>
Dave Chinner [Wed, 4 Sep 2013 22:05:39 +0000 (22:05 +0000)]
xfs: move swap extent code to xfs_extent_ops
Swapping extents is clearly an extent operaiton, and it is not
shared with userspace. Move the code to xfs_extent_ops.[ch], and
the userspace ioctl structure definition to xfs_fs.h where most of
the other ioctl structure definitions are. The means xfs_dfrag.h is
no longer needed in userspace.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Rich Johnston <rjohnston@sgi.com>
Dave Chinner [Wed, 4 Sep 2013 22:05:38 +0000 (22:05 +0000)]
xfs: don't special case shared superblock mounts
Neither kernel or userspace support shared read-only mounts, so
don't beother special casing the support check to be different
between kernel and userspace. The same check canbe used as neither
like it...
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Rich Johnston <rjohnston@sgi.com>
Dave Chinner [Wed, 4 Sep 2013 22:05:37 +0000 (22:05 +0000)]
xfsprogs: sync minor kernel header differences
There are lots of little differences between kernel and userspace
headers noticable now that the files are largely the same. Clean up
all the formatting, whitespace and other minor differences in the
userspace headers.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Rich Johnston <rjohnston@sgi.com>
Dave Chinner [Wed, 4 Sep 2013 22:05:36 +0000 (22:05 +0000)]
xfs: create xfs_bmap_util.[ch]
There is a bunch of code in xfs_bmap.c that is kernel specific and
not shared with userspace. to minimise the difference between the
kernel and userspace code, shift this unshared code to
xfs_bmap_util.c, and the declarations to xfs_bmap_util.h.
The biggest issue here is xfs_bmap_finish() - userspce has it's own
definition of this function, and so we need to move it out of
xfs_bmap.[ch]. This means several other files need to include
xfs_bmap_util.c as well.
It also introduces and interesting dance for the stack switching
code in xfs_bmapi_allocate(). The stack switching/workqueue code is
actually moved to xfs_bmap_util.c, so that userspace can simply use
a #define in a header file to connect the dots without needing to
know about the stack switch code at all.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Rich Johnston <rjohnston@sgi.com>
Dave Chinner [Tue, 10 Sep 2013 21:32:49 +0000 (21:32 +0000)]
libxfs: switch over to xfs_sb.c and remove xfs_mount.c
Now that the kernel code has split the superblock specific code out
of xfs_mount.c, we don't need xfs_mount.c anymore. Copy in xfs_sb.c
and remove xfs_mount.c
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Rich Johnston <rjohnston@sgi.com>
Dave Chinner [Wed, 4 Sep 2013 22:05:34 +0000 (22:05 +0000)]
xfs: split out the remote symlink handling
The remote symlink format definition and manipulation needs to be
shared with userspace, but the in-kernel interfaces do not. Split
the remote symlink format handling out into xfs_symlink_remote.[ch]
fo it can easily be shared with userspace.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Review-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Rich Johnston <rjohnston@sgi.com>
Dave Chinner [Wed, 4 Sep 2013 22:05:33 +0000 (22:05 +0000)]
xfs: introduce xfs_inode_buf.c for inode buffer operations
The only thing remaining in xfs_inode.[ch] are the operations that
read, write or verify physical inodes in their underlying buffers.
Move all this code to xfs_inode_buf.[ch] and so we can stop sharing
xfs_inode.[ch] with userspace.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Review-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Rich Johnston <rjohnston@sgi.com>
Dave Chinner [Wed, 4 Sep 2013 22:05:31 +0000 (22:05 +0000)]
xfs: move inode fork definitions to a new header file
The inode fork definitions are a combination of on-disk format
definition and in-memory tracking and manipulation. They are both
shared with userspace, so move them all into their own file so
sharing is easy to do and track. This removes all inode fork
related information from xfs_inode.h.
Do the same for the all the C code that currently resides in
xfs_inode.c for the same reason.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Review-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Rich Johnston <rjohnston@sgi.com>
Dave Chinner [Wed, 4 Sep 2013 22:05:30 +0000 (22:05 +0000)]
libxfs: move transaction code to trans.c
There is very little code left in xfs_trans.c. So little it is not
worthtrying to share this file with kernel space any more. Move the
code to libxfs/trans.c, and remove libxfs/xfs_trans.c.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Review-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Rich Johnston <rjohnston@sgi.com>
Dave Chinner [Wed, 4 Sep 2013 22:05:29 +0000 (22:05 +0000)]
libxfs: introduce xfs_trans_resv.c
The log space reservation calculation code has been separated from
the core transaction code in kernelspace. THi smeans we can add it
here in preparation for removing xfs_trans.c to further reduce the
differences between kernel and usrspace files.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Review-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Rich Johnston <rjohnston@sgi.com>
Dave Chinner [Wed, 4 Sep 2013 22:05:28 +0000 (22:05 +0000)]
xfs: introduce xfs_quota_defs.h
There are a lot of quota flag definitions that are shared by user
and kernel space. Move them all to xfs_quota_defs.h so we can
unshare xfs_quota.h and remove the __KERNEL__ regions from it.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Review-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Rich Johnston <rjohnston@sgi.com>
Dave Chinner [Wed, 4 Sep 2013 22:05:27 +0000 (22:05 +0000)]
xfs: introduce xfs_rtalloc_defs.h
There are quite a few realtime device definitions shared with
userspace. Move them from xfs_rtalloc.h to xfs_rt_alloc_defs.h
so we don't need to share xfs_rtalloc.h with userspace anymore.
This removes the final __KERNEL__ region from the XFS kernel
codebase. Yay!
Signed-off-by: Dave Chinner <dchinner@redhat.com> Review-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Rich Johnston <rjohnston@sgi.com>