Dave Chinner [Wed, 6 Dec 2017 23:14:27 +0000 (17:14 -0600)]
mkfs: rework stripe calculations
The data and log stripe calculations a spaghettied all over the mkfs
code. This patch pulls all of the different chunks of code together
into calc_stripe_factors() and removes all the redundant/repeated
checks and calculations that are made.
Signed-Off-By: Dave Chinner <dchinner@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Dave Chinner [Wed, 6 Dec 2017 23:14:27 +0000 (17:14 -0600)]
mkfs: factor sectorsize validation
Start factoring all the sector size validation code into a
function that takes cli, dft and cfg structures. This starts
removing option flags and some of the temporary code in the input
parsing structures.
Signed-Off-By: Dave Chinner <dchinner@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Dave Chinner [Wed, 6 Dec 2017 23:14:27 +0000 (17:14 -0600)]
mkfs: introduce default configuration structure
mkfs has lots of options that require default values. Some of these
are centralised, but others aren't. Introduce a new structure
designed to hold default values for all the parameters that need
defaults in one place.
This structure also provides a mechanism for providing mkfs defaults
from a config file. This is not implemented in this series, but a
comment is left where it is expected this functionality will hook
in.
Signed-Off-By: Dave Chinner <dchinner@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Dave Chinner [Wed, 6 Dec 2017 23:14:27 +0000 (17:14 -0600)]
mkfs: factor writing AG headers
There are some slight changes to the way log alignment is calculated
in the change. Instead of using a flag, it checks the log start
block to see if it's different to the first free block in the log
AG, and if it is different then does the aligned setup. This means
we no longer have to care if the log is aligned or not, the code
will do the right thing in all cases.
Signed-Off-By: Dave Chinner <dchinner@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Dave Chinner [Wed, 6 Dec 2017 23:14:27 +0000 (17:14 -0600)]
mkfs: factor out device preparation
Prior to formating the device(s), we have to take several steps to
prepare them and check that they are appropriate for the formatting
that is about to take place. Pull all this into a single function
that is run before mounting the libxfs infrastructure.
Signed-Off-By: Dave Chinner <dchinner@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Dave Chinner [Wed, 6 Dec 2017 23:14:27 +0000 (17:14 -0600)]
mkfs: Introduce mkfs configuration structure
Formatting the on disk XFS structures requires a certain set of
validated and calculated parameters. By the time we start writing
information to disk this has all been done. Abstract this information
out into a separate structures and initialise it with all the
calculated parameters so we can factor the mkfs formatting code
to use it.
Signed-Off-By: Dave Chinner <dchinner@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Dave Chinner [Wed, 6 Dec 2017 23:14:27 +0000 (17:14 -0600)]
mkfs: add generic subopt parsing table
Abstract out the common subopt parsing code into a common function
and type table so we can factor the parsing code. Add the function
stubs in preparation for factoring.
Signed-Off-By: Dave Chinner <dchinner@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Dave Chinner [Wed, 6 Dec 2017 23:14:27 +0000 (17:14 -0600)]
mkfs: introduce a structure to hold CLI options
We need to hold the values set from command line options so they can
later be validated and discriminated from the default values that
might be set. This structure will form a connector between the input
parsing and the rest of the mkfs code.
Signed-Off-By: Dave Chinner <dchinner@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Dave Chinner [Wed, 6 Dec 2017 23:14:27 +0000 (17:14 -0600)]
mkfs: make subopt table const
Use const for all the tables to remove most of the (char **) casts.
This adds a couple of temporary (const char **) casts that go away
as the input parsing is factored.
Signed-Off-By: Dave Chinner <dchinner@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Dave Chinner [Wed, 6 Dec 2017 23:14:27 +0000 (17:14 -0600)]
mkfs: disallow specifying the sector size of internal log
If the log is on the data device (i.e. internal) then it should
match the sector size the data device is using. If they don't match,
then one or the other doesn't have atomic sector writes and we could
have crash consistency problems. Not to mention that it's simply
wrong to have two different sector sizes for the same device.
Hence enforce the requirement that an internal log device always has
the same sector size as the data device.
Signed-Off-By: Dave Chinner <dchinner@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Wed, 6 Dec 2017 15:17:08 +0000 (09:17 -0600)]
xfs_db: add missing padding fields
Several data structures are missing padding fields from their field
definitions. Add them so that they can be printed out if explicitly
requested.
Fix the AGI field order to be consistent with the structure
definition while we're at it.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com>
[sandeen: make them available but not printed by default] Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
We are very inconsistent about how we print padding fields in on-disk
structures -- sometimes we hide it from printall, sometimes we deviate
from unsigned hex values, etc. Make this all consistent -- always hide
padding values when printing the whole structure, always print them as
unsigned hex integers when explicitly requested.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com>
[sandeen: switch to never-print instead of always-print] Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Wed, 6 Dec 2017 15:17:08 +0000 (09:17 -0600)]
xfs_repair: remove old workqueue stuff in favor of libfrog code
Now that we've made a generic workqueue in libfrog, we can remove the
implementation in xfs_repair and turn the old functions into wrappers
that call do_error if they fail. There are no functional changes in
this patch, though some of the names and types have changed.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Wed, 6 Dec 2017 15:17:08 +0000 (09:17 -0600)]
libhandle: add missing destructor
Make it so that we can tear down the file descriptor hash table.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Wed, 6 Dec 2017 15:17:08 +0000 (09:17 -0600)]
libfrog: add missing function fs_table_destroy
Add a function to tear down the fs_table when we're done
messing with paths.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Wed, 6 Dec 2017 15:17:08 +0000 (09:17 -0600)]
libfrog: move paths.c out of libxcmd
Move the fs_table code into libfrog since it's not really a command.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Wed, 6 Dec 2017 15:17:08 +0000 (09:17 -0600)]
libfrog: move conversion factors out of libxcmd
Move all the conversion functions out of libxcmd since they'll be used
by scrub, which doesn't have a commandline.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Wed, 6 Dec 2017 15:17:08 +0000 (09:17 -0600)]
libfrog: move topology code out of libxcmd
Move the filesystem topology code out of libxcmd.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Wed, 6 Dec 2017 15:17:08 +0000 (09:17 -0600)]
libfrog: create a threaded workqueue
Create a thread pool that queues and runs discrete work items. This
will be a namespaced version of the pool in repair/threads.c; a
subsequent patch will switch repair over. xfs_scrub will use the
generic thread pool.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Wed, 6 Dec 2017 15:17:08 +0000 (09:17 -0600)]
libfrog: promote avl64 code from xfs_repair
xfs_scrub will make use of the avl64 code, so promote it out of repair
and into libfrog.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Wed, 6 Dec 2017 15:17:07 +0000 (09:17 -0600)]
libfrog: move list_sort out of libxfs
List operations aren't really a part of libxfs, so move them to libfrog.
This is purely a directory tree restructuring; no functional changes,
though some indentation fixes are included.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Wed, 6 Dec 2017 15:17:07 +0000 (09:17 -0600)]
libfrog: add bit manipulation functions
Duplicate the libxfs bit manipulation functions -- this is for programs
that don't need libxfs.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Wed, 6 Dec 2017 15:17:07 +0000 (09:17 -0600)]
libfrog: move libxfs_log2_roundup to libfrog
Move libxfs_log2_roundup to libfrog and remove the 'libxfs_' prefix.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Wed, 6 Dec 2017 15:17:07 +0000 (09:17 -0600)]
libfrog: move all the userspace support stuff into a new library
This library is meant to contain all the Funny Random Other Gunk that
the xfsprogs utilities rely on. Move all that stuff into this library
to reduce the pollution in the other libraries.
Ribbit! Ribbit!
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Wed, 6 Dec 2017 15:17:07 +0000 (09:17 -0600)]
man: describe the metadata scrubbing ioctl
Document the XFS-specific metadata scrub/repair ioctl's behavior,
arguments, and side effects.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Wed, 6 Dec 2017 15:17:07 +0000 (09:17 -0600)]
xfs_io: provide an interface to the scrub ioctls
Create a new xfs_io command to call the new XFS metadata scrub ioctl.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Eric Sandeen [Wed, 6 Dec 2017 15:17:07 +0000 (09:17 -0600)]
xfs_io: add buf_lru_ref tag to inject table
And catch it at build time if we get out of sync again.
Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Wed, 6 Dec 2017 15:17:07 +0000 (09:17 -0600)]
xfs_io: pull xfs errortag definitions from libxfs
Use the libxfs definitions, don't provide our own.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Nikolay Borisov [Wed, 6 Dec 2017 15:17:07 +0000 (09:17 -0600)]
xfs_io: implement ranged fiemap query
Currently the fiemap implementation of xfs_io doesn't support making
ranged queries. This patch implements two optional arguments which
take the starting offset and the length of the region to be queried.
When the end of the requested region falls within an extent boundary
then we print the whole extent (i.e. return all the information that
the kernel has given us). When the end offset falls within a hole
then the printed hole range is truncated to the requested one since
we do not have information how long the hole is.
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com>
[sandeen: simplify/rewrite ranged logic] Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Dave Chinner [Wed, 6 Dec 2017 15:17:07 +0000 (09:17 -0600)]
xfs_io: fix gcc-7 related printf warnings
New compiler, new checks, new warnings.
Fix the new [-Wformat-truncation=] warnings that io/fsmap.c is
throwing w/ gcc-7.2 because "%lld..%lld" requires a buffer 40
characters long, not 32.
Signed-Off-By: Dave Chinner <dchinner@redhat.com> Reviewed-by: Eric Sandeen <sandeen@sandeen.net> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
And move them to xfs_linux.h so that xfsprogs can stub them out more
easily.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
[sandeen: stub them out in xfsprogs] Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
found the issue by kmemleak.
unreferenced object 0xffff8800674611c0 (size 16):
xfs_iext_insert+0x82a/0xa90 [xfs]
xfs_bmap_add_extent_hole_delay+0x1e5/0x5b0 [xfs]
xfs_bmapi_reserve_delalloc+0x483/0x530 [xfs]
xfs_file_iomap_begin+0xac8/0xd40 [xfs]
iomap_apply+0xb8/0x1b0
iomap_file_buffered_write+0xac/0xe0
xfs_file_buffered_aio_write+0x198/0x420 [xfs]
xfs_file_write_iter+0x23f/0x2a0 [xfs]
__vfs_write+0x23e/0x340
vfs_write+0xe9/0x240
SyS_write+0xa1/0x120
do_syscall_64+0xda/0x260
Signed-off-by: Shu Wang <shuwang@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Be consistent about using uint32_t/uint8_t instead of u32/u8. This is
more so that we don't have to maintain /those/ types in xfsprogs.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Jeff Mahoney [Mon, 20 Nov 2017 19:54:02 +0000 (13:54 -0600)]
xfs_io: stat: treat statfs.f_flags as optional
Kernels prior to 2.6.36 didn't contain statfs.f_flags. Distros with
initial releases with kernels prior to this may not have updated
headers with this member. Only attempt to print it if we have the
header with the member defined.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
[sandeen: define HAVE_STATFS_FLAGS in io/Makefile] Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Mon, 20 Nov 2017 19:53:56 +0000 (13:53 -0600)]
xfs_copy: don't hang if /all/ the targets hit write errors
If xfs_copy is told to copy a filesystem and /all/ the writer threads
hit an write error, there won't be any threads to unlock mainwait, which
means that write_wbuf will deadlock with itself trying to lock mainwait.
Therefore, if we discover that all the writer threads are dead, just
bail out.
Discovered by running xfs/073 with a tiny test device.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Zirong Lang [Mon, 20 Nov 2017 19:53:56 +0000 (13:53 -0600)]
xfsprogs: fix wrong do_pwritev definition
In io/pwrite.c, if not define HAVE_PWRITEV, we will use:
#define do_pwritev(fd, offset, count, buffer_size) (0)
But the real do_pwritev() function is:
do_pwritev(fd, offset, count, buffer_size, pwritev2_flags);
There's one more 'pwritev2_flags' argument.
Fixes: c5deeac9 "xfs_io: Add support for pwritev2()" Signed-off-by: Zorro Lang <zlang@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Zirong Lang [Mon, 20 Nov 2017 19:53:56 +0000 (13:53 -0600)]
xfsprogs: fix wrong variable types in pwrite/pread code
The 'Coverity Scan' found a problem in new write_once() function:
272 size_t bytes;
273 bytes = do_pwrite(file->fd, offset, count, count, pwritev2_flags);
>>> CID 1420710: Control flow issues (NO_EFFECT)
>>> This less-than-zero comparison of an unsigned value is never true. "bytes < 0UL".
274 if (bytes < 0)
275 return -1;
That's unreasonable. do_pwrite return 'ssize_t' type value, which can
be less than zero, but we use a 'size_t' to get the return value. So
change the size_t to ssize_t for it can store the return value
correctly.
By the chance, correct all 'ssize_t' type problems in pwrite/pread
related functions.
Signed-off-by: Zorro Lang <zlang@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com>
[sandeen: modify commit summary] Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Use the uint* types instead of the u_int* types. This will (hopefully)
pair with an xfsprogs cleanup.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
[sandeen: no-op commit, libxfs was already fixed in userspace] Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
And also rename fill to nr_entries to match the rest of the code.
Reported-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reported-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reported-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reported-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Fix to check the correct value, and remove a duplicate handling of the
uneven record number split algorith,
Reported-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Neither defines an on-disk format, so move them out of xfs_format.h.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
This removed an unaligned load per extent, as well as the manual poking
into the on-disk extent format.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
We only have two places that remove 2 extents at the same time, so unroll
the loop there.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
We only have two places that insert 2 extents at the same time, so unroll
the loop there.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Replace the current linear list and the indirection array for the in-core
extent list with a b+tree to avoid the need for larger memory allocations
for the indirection array when lots of extents are present. The current
extent list implementations leads to heavy pressure on the memory
allocator when modifying files with a high extent count, and can lead
to high latencies because of that.
The replacement is a b+tree with a few quirks. The leaf nodes directly
store the extent record in two u64 values. The encoding is a little bit
different from the existing in-core extent records so that the start
offset and length which are required for lookups can be retreived with
simple mask operations. The inner nodes store a 64-bit key containing
the start offset in the first half of the node, and the pointers to the
next lower level in the second half. In either case we walk the node
from the beginninig to the end and do a linear search, as that is more
efficient for the low number of cache lines touched during a search
(2 for the inner nodes, 4 for the leaf nodes) than a binary search.
We store termination markers (zero length for the leaf nodes, an
otherwise impossible high bit for the inner nodes) to terminate the key
list / records instead of storing a count to use the available cache
lines as efficiently as possible.
One quirk of the algorithm is that while we normally split a node half and
half like usual btree implementations we just spill over entries added at
the very end of the list to a new node on its own. This means we get a
100% fill grade for the common cases of bulk insertion when reading an
inode into memory, and when only sequentially appending to a file. The
downside is a slightly higher chance of splits on the first random
insertions.
Both insert and removal manually recurse into the lower levels, but
the bulk deletion of the whole tree is still implemented as a recursive
function call, although one limited by the overall depth and with very
little stack usage in every iteration.
For the first few extents we dynamically grow the list from a single
extent to the next powers of two until we have a first full leaf block
and that building the actual tree.
The code started out based on the generic lib/btree.c code from Joern
Engel based on earlier work from Peter Zijlstra, but has since been
rewritten beyond recognition.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
[sandeen: use uint32_t for rec_len, update trace macros & repair/] Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
To make life a little simpler make xfs_bmbt_set_all unaligned access
aware so that we can use it directly on the destination buffer.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Supporting a small bit of data inside the inode fork blows up the fork size
a lot, removing the 32 bytes of inline data halves the effective size of
the inode fork (and it still has a lot of unused padding left), and the
performance of a single kmalloc doesn't show up compared to the size to read
an inode or create one.
It also simplifies the fork management code a lot.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Instead of looking up extents to convert and calling xfs_bmapi_write on
each of them just let xfs_bmapi_write handle the full range. To make
this robust add a new XFS_BMAPI_CONVERT_ONLY that only converts ranges
and never allocates blocks.
[darrick: shorten the stringified CONVERT_ONLY trace flag]
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Add a new xfs_iext_cursor structure to hide the direct extent map
index manipulations. In addition to the existing lookup/get/insert/
remove and update routines new primitives to get the first and last
extent cursor, as well as moving up and down by one extent are
provided. Also new are convenience to increment/decrement the
cursor and retreive the new extent, as well as to peek into the
previous/next extent without updating the cursor and last but not
least a macro to iterate over all extents in a fork.
[darrick: rename for_each_iext to for_each_xfs_iext]
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
[sandeen: use cursor in xfs_repair code as well] Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
This actually makes the function very slightly less efficient for now as we
detour through the expanded irect format between the in-core extent format
and the on-disk one instead of just endian swapping them. But with the
incore extent btree the in-core one will use a different format and the
representation will be entirely hidden.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
This actually makes the function very slightly less efficient for now as we
detour through the expanded irect format between the in-core extent format
and the on-disk one instead of just endian swapping them. But with the
incore extent btree the in-core one will use a different format and the
representation will be entirely hidden. It also happens to make the
function a whole more readable.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
This prepares for getting rid of the current in-memory extent format.
At the end of the series we will change the calling convention again
to pass the xfs_bmbt_irec structure once it is available everywhere.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Stop poking before and after the index and just increment or decrement
it while doing our operations on it to prepare for a new extent list
implementation.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Stop poking before and after the index and just increment or decrement
it while doing our operations on it to prepare for a new extent list
implementation.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Stop poking before and after the index and just increment or decrement
it while doing our operations on it to prepare for a new extent list
implementation.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Stop poking before and after the index and just increment or decrement
it while doing our operations on it to prepare for a new extent list
implementation.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Stop poking before and after the index and just increment or decrement
it while doing our operations on it to prepare for a new extent list
implementation.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Stop poking before and after the index and just increment or decrement
it while doing our operations on it to prepare for a new extent list
implementation.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reported-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Two cases in xfs_bmap_add_extent_delay_real currently insert a new
extent before updating the existing one that is being split. While
this works fine with a simple extent list, a more complex tree can't
easily cope with overlapping extent. Reshuffle the code a bit to update
the slot of the existing delalloc extent to the new real extent before
inserting the shortened delalloc extent before or after it. This
avoids the overlapping extents while still allowing to update the
br_startblock field of the delalloc extent with the updated indirect
block reservation.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Some were missed in the pass that converted the function return
values from int to bool. Update the remaining ones for consistency.
Signed-Off-By: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Move the error injection tag names into a libxfs header so that we can
share it between kernel and userspace.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Remove xfs_inode_log_format_t now that xfs_inode_log_format is
explicitly padded and therefore is a real on-disk structure. This
enables xfs/122 to check the size of the structure.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
[sandeen: same treatment in libxlog & logprint/ ] Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
xfs: remove the inode log format from the inode log item
No need to keep the inode log format around all the time, we can
easily generate it at iop_format time.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
[sandeen: matching change in userspace xfs_trans.h] Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Variable bit is being assigned a value that is never read, hence
the assignment is redundant and can be removed. Cleans up clang
warning:
fs/xfs/libxfs/xfs_rtbitmap.c:675:3: warning: Value stored to
'bit' is never read
Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
When we're done checking all the records/keys in a btree block, compute
the low and high key of the block and compare them to the associated key
in the parent btree block.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Abort an dir/attr btree operation if the attr btree has obvious problems
like loops back to the root or pointers don't point down the tree.
Found by fuzzing btree[0].before to zero in xfs/402, which livelocks on
the cycle in the attr btree.
Apply the same checks to xfs_da3_node_lookup_int.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>