From: Filipe Manana Date: Wed, 11 Mar 2026 18:36:36 +0000 (+0000) Subject: btrfs: optimize clearing all bits from first extent record in an io tree X-Git-Url: http://git.ipfire.org/?a=commitdiff_plain;h=031dd12c37071d0bac01d67794000a75d042cfdb;p=thirdparty%2Fkernel%2Flinux.git btrfs: optimize clearing all bits from first extent record in an io tree When we are clearing all the bits from the first record that contains the target range and that record ends at or before our target range but starts before our target range, we are doing a lot of unnecessary work: 1) Allocating a prealloc state if we don't have one already; 2) Adjust that record's start offset to the start of our range and make the prealloc state have a range going from the original start offset of that first record to the start offset of our target range, and with the same bits as that first record. Then we insert the prealloc extent in the rbtree - this is done in split_state(); 3) Remove our adjusted first state from the rbtree since all the bits were cleared - this is done in clear_state_bit(). This is only wasting time when we can simply trim that first record, so that it represents the range from its start offset to the start offset of our target range. So optimize for that case and avoid the prealloc state allocation, insertion and deletion from the rbtree. This patch is the last patch of a patchset comprised of the following patches (in descending order): btrfs: optimize clearing all bits from first extent record in an io tree btrfs: panic instead of warn when splitting extent state not in the tree btrfs: free cached state outside critical section in wait_extent_bit() btrfs: avoid unnecessary wake ups on io trees when there are no waiters btrfs: remove wake parameter from clear_state_bit() btrfs: change last argument of add_extent_changeset() to boolean btrfs: use extent_io_tree_panic() instead of BUG_ON() btrfs: make add_extent_changeset() only return errors or success btrfs: tag as unlikely branches that call extent_io_tree_panic() btrfs: turn extent_io_tree_panic() into a macro for better error reporting btrfs: optimize clearing all bits from the last extent record in an io tree The following fio script was used to measure performance before and after applying all the patches: $ cat ./fio-io-uring-2.sh #!/bin/bash DEV=/dev/nullb0 MNT=/mnt/nullb0 MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="" if [ $# -ne 3 ]; then echo "Use $0 NUM_JOBS FILE_SIZE RUN_TIME" exit 1 fi NUM_JOBS=$1 FILE_SIZE=$2 RUN_TIME=$3 cat < /tmp/fio-job.ini [io_uring_rw] rw=randwrite fsync=0 fallocate=none group_reporting=1 direct=1 ioengine=io_uring fixedbufs=1 iodepth=64 bs=4K filesize=$FILE_SIZE runtime=$RUN_TIME time_based filename=foobar directory=$MNT numjobs=$NUM_JOBS thread EOF echo performance | \ tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor echo echo "Using config:" echo cat /tmp/fio-job.ini echo umount $MNT &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV &> /dev/null mount $MOUNT_OPTIONS $DEV $MNT fio /tmp/fio-job.ini umount $MNT When running this script on a 12 cores machine using a 16G null block device the results were the following: Before patchset: $ ./fio-io-uring-2.sh 12 8G 60 (...) WRITE: bw=74.8MiB/s (78.5MB/s), 74.8MiB/s-74.8MiB/s (78.5MB/s-78.5MB/s), io=4504MiB (4723MB), run=60197-60197msec After patchset: $ ./fio-io-uring-2.sh 12 8G 60 (...) WRITE: bw=82.2MiB/s (86.2MB/s), 82.2MiB/s-82.2MiB/s (86.2MB/s-86.2MB/s), io=4937MiB (5176MB), run=60027-60027msec Also, using bpftrace to collect the duration (in nanoseconds) of all the btrfs_clear_extent_bit_changeset() calls done during that fio test and then making an histogram from that data, held the following results: Before patchset: Count: 6304804 Range: 0.000 - 7587172.000; Mean: 2011.308; Median: 1219.000; Stddev: 17117.533 Percentiles: 90th: 1888.000; 95th: 2189.000; 99th: 16104.000 0.000 - 8.098: 7 | 8.098 - 40.385: 20 | 40.385 - 187.254: 146 | 187.254 - 855.347: 742048 ####### 855.347 - 3894.426: 5462542 ##################################################### 3894.426 - 17718.848: 41489 | 17718.848 - 80604.558: 46085 | 80604.558 - 366664.449: 11285 | 366664.449 - 1667918.122: 961 | 1667918.122 - 7587172.000: 113 | After patchset: Count: 6282879 Range: 0.000 - 6029290.000; Mean: 1896.482; Median: 1126.000; Stddev: 15276.691 Percentiles: 90th: 1741.000; 95th: 2026.000; 99th: 15713.000 0.000 - 60.014: 12 | 60.014 - 217.984: 63 | 217.984 - 784.949: 517515 ##### 784.949 - 2819.823: 5632335 ##################################################### 2819.823 - 10123.127: 55716 # 10123.127 - 36335.184: 46034 | 36335.184 - 130412.049: 25708 | 130412.049 - 468060.350: 4824 | 468060.350 - 1679903.189: 549 | 1679903.189 - 6029290.000: 84 | Signed-off-by: Filipe Manana Reviewed-by: David Sterba Signed-off-by: David Sterba --- diff --git a/fs/btrfs/extent-io-tree.c b/fs/btrfs/extent-io-tree.c index 72ddd8d2e7a3e..6ae7709cba23e 100644 --- a/fs/btrfs/extent-io-tree.c +++ b/fs/btrfs/extent-io-tree.c @@ -635,6 +635,7 @@ int btrfs_clear_extent_bit_changeset(struct extent_io_tree *tree, u64 start, u64 int ret = 0; bool clear; const bool delete = (bits & EXTENT_CLEAR_ALL_BITS); + const u32 bits_to_clear = (bits & ~EXTENT_CTLBITS); gfp_t mask; set_gfp_mask_from_bits(&bits, &mask); @@ -712,6 +713,47 @@ hit_next: */ if (state->start < start) { + /* + * If all bits are cleared, there's no point in allocating or + * using the prealloc extent, split the state record, insert the + * prealloc record and then remove this record. We can just + * adjust this record and move on to the next without adding or + * removing anything to the tree. + */ + if (state->end <= end && (state->state & ~bits_to_clear) == 0) { + const u64 orig_start = state->start; + + if (tree->owner == IO_TREE_INODE_IO) + btrfs_split_delalloc_extent(tree->inode, state, start); + + /* + * Temporarilly ajdust this state's range to match the + * range for which we are clearing bits. + */ + state->start = start; + + ret = add_extent_changeset(state, bits_to_clear, changeset, false); + if (unlikely(ret < 0)) { + extent_io_tree_panic(tree, state, + "add_extent_changeset", ret); + goto out; + } + + if (tree->owner == IO_TREE_INODE_IO) + btrfs_clear_delalloc_extent(tree->inode, state, bits); + + /* + * Now adjust the range to the section for which no bits + * are cleared. + */ + state->start = orig_start; + state->end = start - 1; + + state_wake_up(tree, state, bits); + state = next_search_state(state, end); + goto next; + } + prealloc = alloc_extent_state_atomic(prealloc); if (!prealloc) goto search_again; @@ -739,8 +781,6 @@ hit_next: * We need to split the extent, and clear the bit on the first half. */ if (state->start <= end && state->end > end) { - const u32 bits_to_clear = bits & ~EXTENT_CTLBITS; - /* * If all bits are cleared, there's no point in allocating or * using the prealloc extent, split the state record, insert the