When we are clearing all the bits from the first record that contains the
target range and that record ends at or before our target range but starts
before our target range, we are doing a lot of unnecessary work:
1) Allocating a prealloc state if we don't have one already;
2) Adjust that record's start offset to the start of our range and
make the prealloc state have a range going from the original start
offset of that first record to the start offset of our target range,
and with the same bits as that first record. Then we insert the
prealloc extent in the rbtree - this is done in split_state();
3) Remove our adjusted first state from the rbtree since all the bits
were cleared - this is done in clear_state_bit().
This is only wasting time when we can simply trim that first record, so
that it represents the range from its start offset to the start offset of
our target range. So optimize for that case and avoid the prealloc state
allocation, insertion and deletion from the rbtree.
This patch is the last patch of a patchset comprised of the following
patches (in descending order):
btrfs: optimize clearing all bits from first extent record in an io tree
btrfs: panic instead of warn when splitting extent state not in the tree
btrfs: free cached state outside critical section in wait_extent_bit()
btrfs: avoid unnecessary wake ups on io trees when there are no waiters
btrfs: remove wake parameter from clear_state_bit()
btrfs: change last argument of add_extent_changeset() to boolean
btrfs: use extent_io_tree_panic() instead of BUG_ON()
btrfs: make add_extent_changeset() only return errors or success
btrfs: tag as unlikely branches that call extent_io_tree_panic()
btrfs: turn extent_io_tree_panic() into a macro for better error reporting
btrfs: optimize clearing all bits from the last extent record in an io tree
The following fio script was used to measure performance before and after
applying all the patches:
$ cat ./fio-io-uring-2.sh
#!/bin/bash
DEV=/dev/nullb0
MNT=/mnt/nullb0
MOUNT_OPTIONS="-o ssd"
MKFS_OPTIONS=""
if [ $# -ne 3 ]; then
echo "Use $0 NUM_JOBS FILE_SIZE RUN_TIME"
exit 1
fi
NUM_JOBS=$1
FILE_SIZE=$2
RUN_TIME=$3
cat <<EOF > /tmp/fio-job.ini
[io_uring_rw]
rw=randwrite
fsync=0
fallocate=none
group_reporting=1
direct=1
ioengine=io_uring
fixedbufs=1
iodepth=64
bs=4K
filesize=$FILE_SIZE
runtime=$RUN_TIME
time_based
filename=foobar
directory=$MNT
numjobs=$NUM_JOBS
thread
EOF
echo performance | \
tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
echo
echo "Using config:"
echo
cat /tmp/fio-job.ini
echo
umount $MNT &> /dev/null
mkfs.btrfs -f $MKFS_OPTIONS $DEV &> /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
fio /tmp/fio-job.ini
umount $MNT
When running this script on a 12 cores machine using a 16G null block
device the results were the following:
Before patchset:
$ ./fio-io-uring-2.sh 12 8G 60
(...)
WRITE: bw=74.8MiB/s (78.5MB/s), 74.8MiB/s-74.8MiB/s (78.5MB/s-78.5MB/s), io=4504MiB (4723MB), run=60197-60197msec
After patchset:
$ ./fio-io-uring-2.sh 12 8G 60
(...)
WRITE: bw=82.2MiB/s (86.2MB/s), 82.2MiB/s-82.2MiB/s (86.2MB/s-86.2MB/s), io=4937MiB (5176MB), run=60027-60027msec
Also, using bpftrace to collect the duration (in nanoseconds) of all the
btrfs_clear_extent_bit_changeset() calls done during that fio test and
then making an histogram from that data, held the following results:
Before patchset:
Count:
6304804
Range: 0.000 -
7587172.000; Mean: 2011.308; Median: 1219.000; Stddev: 17117.533
Percentiles: 90th: 1888.000; 95th: 2189.000; 99th: 16104.000
0.000 - 8.098: 7 |
8.098 - 40.385: 20 |
40.385 - 187.254: 146 |
187.254 - 855.347: 742048 #######
855.347 - 3894.426:
5462542 #####################################################
3894.426 - 17718.848: 41489 |
17718.848 - 80604.558: 46085 |
80604.558 - 366664.449: 11285 |
366664.449 -
1667918.122: 961 |
1667918.122 -
7587172.000: 113 |
After patchset:
Count:
6282879
Range: 0.000 -
6029290.000; Mean: 1896.482; Median: 1126.000; Stddev: 15276.691
Percentiles: 90th: 1741.000; 95th: 2026.000; 99th: 15713.000
0.000 - 60.014: 12 |
60.014 - 217.984: 63 |
217.984 - 784.949: 517515 #####
784.949 - 2819.823:
5632335 #####################################################
2819.823 - 10123.127: 55716 #
10123.127 - 36335.184: 46034 |
36335.184 - 130412.049: 25708 |
130412.049 - 468060.350: 4824 |
468060.350 -
1679903.189: 549 |
1679903.189 -
6029290.000: 84 |
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
int ret = 0;
bool clear;
const bool delete = (bits & EXTENT_CLEAR_ALL_BITS);
+ const u32 bits_to_clear = (bits & ~EXTENT_CTLBITS);
gfp_t mask;
set_gfp_mask_from_bits(&bits, &mask);
*/
if (state->start < start) {
+ /*
+ * If all bits are cleared, there's no point in allocating or
+ * using the prealloc extent, split the state record, insert the
+ * prealloc record and then remove this record. We can just
+ * adjust this record and move on to the next without adding or
+ * removing anything to the tree.
+ */
+ if (state->end <= end && (state->state & ~bits_to_clear) == 0) {
+ const u64 orig_start = state->start;
+
+ if (tree->owner == IO_TREE_INODE_IO)
+ btrfs_split_delalloc_extent(tree->inode, state, start);
+
+ /*
+ * Temporarilly ajdust this state's range to match the
+ * range for which we are clearing bits.
+ */
+ state->start = start;
+
+ ret = add_extent_changeset(state, bits_to_clear, changeset, false);
+ if (unlikely(ret < 0)) {
+ extent_io_tree_panic(tree, state,
+ "add_extent_changeset", ret);
+ goto out;
+ }
+
+ if (tree->owner == IO_TREE_INODE_IO)
+ btrfs_clear_delalloc_extent(tree->inode, state, bits);
+
+ /*
+ * Now adjust the range to the section for which no bits
+ * are cleared.
+ */
+ state->start = orig_start;
+ state->end = start - 1;
+
+ state_wake_up(tree, state, bits);
+ state = next_search_state(state, end);
+ goto next;
+ }
+
prealloc = alloc_extent_state_atomic(prealloc);
if (!prealloc)
goto search_again;
* We need to split the extent, and clear the bit on the first half.
*/
if (state->start <= end && state->end > end) {
- const u32 bits_to_clear = bits & ~EXTENT_CTLBITS;
-
/*
* If all bits are cleared, there's no point in allocating or
* using the prealloc extent, split the state record, insert the