git.ipfire.org Git - people/ms/linux.git/commit

btrfs: add the beginning of async discard, discard workqueue

When discard is enabled, everytime a pinned extent is released back to
the block_group's free space cache, a discard is issued for the extent.
This is an overeager approach when it comes to discarding and helping
the SSD maintain enough free space to prevent severe garbage collection
situations.

This adds the beginning of async discard. Instead of issuing a discard
prior to returning it to the free space, it is just marked as untrimmed.
The block_group is then added to a LRU which then feeds into a workqueue
to issue discards at a much slower rate. Full discarding of unused block
groups is still done and will be addressed in a future patch of the
series.

For now, we don't persist the discard state of extents and bitmaps.
Therefore, our failure recovery mode will be to consider extents
untrimmed. This lets us handle failure and unmounting as one in the
same.

On a number of Facebook webservers, I collected data every minute
accounting the time we spent in btrfs_finish_extent_commit() (col. 1)
and in btrfs_commit_transaction() (col. 2). btrfs_finish_extent_commit()
is where we discard extents synchronously before returning them to the
free space cache.

discard=sync:
                 p99 total per minute       p99 total per minute
      Drive   |   extent_commit() (ms)  |    commit_trans() (ms)
    ---------------------------------------------------------------
     Drive A  |           434           |          1170
     Drive B  |           880           |          2330
     Drive C  |          2943           |          3920
     Drive D  |          4763           |          5701

discard=async:
                 p99 total per minute       p99 total per minute
      Drive   |   extent_commit() (ms)  |    commit_trans() (ms)
    --------------------------------------------------------------
     Drive A  |           134           |           956
     Drive B  |            64           |          1972
     Drive C  |            59           |          1032
     Drive D  |            62           |          1200

While it's not great that the stats are cumulative over 1m, all of these
servers are running the same workload and and the delta between the two
are substantial. We are spending significantly less time in
btrfs_finish_extent_commit() which is responsible for discarding.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

author	Dennis Zhou <dennis@kernel.org>
	Sat, 14 Dec 2019 00:22:14 +0000 (16:22 -0800)
committer	David Sterba <dsterba@suse.com>
	Mon, 20 Jan 2020 15:40:57 +0000 (16:40 +0100)
commit	b0643e59cfa609c4b5f246f2b2c33b078f87e9d9
tree	04ff6c50b7ea0f5284b1237d198210a69d0e71f6	tree \| snapshot
parent	da080fe1bad4777b02f6a3db42823a8797aadbca	commit \| diff

fs/btrfs/Makefile		diff \| blob \| blame \| history
fs/btrfs/block-group.c		diff \| blob \| blame \| history
fs/btrfs/block-group.h		diff \| blob \| blame \| history
fs/btrfs/ctree.h		diff \| blob \| blame \| history
fs/btrfs/discard.c	[new file with mode: 0644]	blob
fs/btrfs/discard.h	[new file with mode: 0644]	blob
fs/btrfs/disk-io.c		diff \| blob \| blame \| history
fs/btrfs/extent-tree.c		diff \| blob \| blame \| history
fs/btrfs/free-space-cache.c		diff \| blob \| blame \| history
fs/btrfs/free-space-cache.h		diff \| blob \| blame \| history
fs/btrfs/super.c		diff \| blob \| blame \| history
fs/btrfs/volumes.c		diff \| blob \| blame \| history