Zhang Yi <yi.zhang@huawei.com> says:
Currently, we can use the fallocate command to quickly create a
pre-allocated file. However, on most filesystems, such as ext4 and XFS,
fallocate create pre-allocation blocks in an unwritten state, and the
FALLOC_FL_ZERO_RANGE flag also behaves similarly. The extent state must
be converted to a written state when the user writes data into this
range later, which can trigger numerous metadata changes and consequent
journal I/O. This may leads to significant write amplification and
performance degradation in synchronous write mode. Therefore, we need a
method to create a pre-allocated file with written extents that can be
used for pure overwriting. At the monent, the only method available is
to create an empty file and write zero data into it (for example, using
'dd' with a large block size). However, this method is slow and consumes
a considerable amount of disk bandwidth, we must pre-allocate files in
advance but cannot add pre-allocated files while user business services
are running.
Fortunately, with the development and more and more widely used of
flash-based storage devices, we can efficiently write zeros to SSDs
using the unmap write zeroes command if the devices do not write
physical zeroes to the media. For example, if SCSI SSDs support the
UMMAP bit or NVMe SSDs support the DEAC bit[1], the write zeroes command
does not write actual data to the device, instead, NVMe converts the
zeroed range to a deallocated state, which works fast and consumes
almost no disk write bandwidth. Consequently, this feature can provide
us with a faster method for creating pre-allocated files with written
extents and zeroed data. However, please note that this may be a
best-effort optimization rather than a mandatory requirement, some
devices may partially fall back to writing physical zeroes due to
factors such as receiving unaligned commands.
This series aims to implement this by:
1. Introduce a new feature BLK_FEAT_WRITE_ZEROES_UNMAP to the block
device queue limit features, which indicates whether the storage is
device explicitly supports the unmapped write zeroes command. This
flag should be set to 1 by the driver if the attached disk supports
this command.
2. Introduce a queue limit flag, BLK_FLAG_WRITE_ZEROES_UNMAP_DISABLED,
along with a corresponding sysfs entry. Users can query the support
status of the unmap write zeroes operation and disable this operation
if the write zeroes operation is very slow.
/sys/block/<disk>/queue/write_zeroes_unmap
3. Introduce a new flag, FALLOC_FL_WRITE_ZEROES, into the fallocate.
Filesystems that support this operation should allocate written
extents and issue zeroes to the specified range of the device. For
local block device filesystems, this operation should depend on the
write_zeroes_unmap operaion of the underlying block device. It should
return -EOPNOTSUPP if the device doesn't enable unmap write zeroes
operaion.
This series implements the BLK_FEAT_WRITE_ZEROES_UNMAP feature and
BLK_FLAG_WRITE_ZEROES_UNMAP_DISABLED flag for SCSI, NVMe and
device-mapper drivers, and add the FALLOC_FL_WRITE_ZEROES and
STATX_ATTR_WRITE_ZEROES_UNMAP support for ext4 and raw bdev devices.
Any comments are welcome.
I've tested performance with this series on ext4 filesystem on my
machine with an Intel Xeon Gold 6248R CPU, a 7TB KCD61LUL7T68 NVMe SSD
which supports unmap write zeroes command with the Deallocated state
and the DEAC bit. Feel free to give it a try.
0. Ensure the NVMe device supports WRITE_ZERO command.
$ cat /sys/block/nvme5n1/queue/write_zeroes_max_bytes
8388608
$ nvme id-ns -H /dev/nvme5n1 | grep -i -A 3 "dlfeat"
dlfeat : 25
[4:4] : 0x1 Guard Field of Deallocated Logical Blocks is set to CRC
of The Value Read
[3:3] : 0x1 Deallocate Bit in the Write Zeroes Command is Supported
[2:0] : 0x1 Bytes Read From a Deallocated Logical Block and its
Metadata are 0x00
1. Compare 'dd' and fallocate with unmap write zeroes, the later one is
significantly faster than 'dd'.
Create a 1GB and 10GB zeroed file.
$dd if=/dev/zero of=foo bs=2M count=$count oflag=direct
$time fallocate -w -l $size bar
#1G
dd: 0.5s
FALLOC_FL_WRITE_ZEROES: 0.17s
#10G
dd: 5.0s
FALLOC_FL_WRITE_ZEROES: 1.7s
2. Run fio overwrite and fallocate with unmap write zeroes
simultaneously, fallocate has little impact on write bandwidth and
only slightly affects write latency.
a) Test bandwidth costs.
$ fio -directory=/test -direct=1 -iodepth=10 -fsync=0 -rw=write \
-numjobs=10 -bs=2M -ioengine=libaio -size=20G -runtime=20 \
-fallocate=none -overwrite=1 -group_reportin -name=bw_test
Without background zero range:
bw (MiB/s): min= 2068, max= 2280, per=100.00%, avg=2186.40
With background zero range:
bw (MiB/s): min= 2056, max= 2308, per=100.00%, avg=2186.20
b) Test write latency costs.
$ fio -filename=/test/foo -direct=1 -iodepth=1 -fsync=0 -rw=write \
-numjobs=1 -bs=4k -ioengine=psync -size=5G -runtime=20 \
-fallocate=none -overwrite=1 -group_reportin -name=lat_test
Without background zero range:
lat (nsec): min=9269, max=71635, avg=9840.65
With a background zero range:
lat (usec): min=9, max=982, avg=11.03
3. Compare overwriting in a pre-allocated unwritten file and a written
file in O_DSYNC mode. Write to a file with written extents is much
faster.
# First mkfs and create a test file according to below three cases,
# and then run fio.
$ fio -filename=/test/foo -direct=1 -iodepth=1 -fdatasync=1 \
-rw=write -numjobs=1 -bs=4k -ioengine=psync -size=5G \
-runtime=20 -fallocate=none -group_reportin -name=test
unwritten file: IOPS=20.1k, BW=78.7MiB/s
unwritten file + fast_commit: IOPS=42.9k, BW=167MiB/s
written file: IOPS=98.8k, BW=386MiB/s
* patches from https://lore.kernel.org/
20250619111806.
3546162-1-yi.zhang@huaweicloud.com:
ext4: add FALLOC_FL_WRITE_ZEROES support
block: add FALLOC_FL_WRITE_ZEROES support
block: factor out common part in blkdev_fallocate()
fs: introduce FALLOC_FL_WRITE_ZEROES to fallocate
dm: clear unmap write zeroes limits when disabling write zeroes
scsi: sd: set max_hw_wzeroes_unmap_sectors if device supports SD_ZERO_*_UNMAP
nvmet: set WZDS and DRB if device enables unmap write zeroes operation
nvme: set max_hw_wzeroes_unmap_sectors if device supports DEAC bit
block: introduce max_{hw|user}_wzeroes_unmap_sectors to queue limits
Link: https://lore.kernel.org/20250619111806.3546162-1-yi.zhang@huaweicloud.com
Signed-off-by: Christian Brauner <brauner@kernel.org>