Merge branch 'md-suspend-rewrite' into md-next
From Yu Kuai, written by Song Liu
Recent tests with raid10 revealed many issues with the following scenarios:
- add or remove disks to the array
- issue io to the array
At first, we fixed each problem independently respect that io can
concurrent with array reconfiguration. However, with more issues reported
continuously, I am hoping to fix these problems thoroughly.
Refer to how block layer protect io with queue reconfiguration (for
example, change elevator):
blk_mq_freeze_queue
-> wait for all io to be done, and prevent new io to be dispatched
// reconfiguration
blk_mq_unfreeze_queue
I think we can do something similar to synchronize io with array
reconfiguration.
Current synchronization works as the following. For the reconfiguration
operation:
1. Hold 'reconfig_mutex';
2. Check that rdev can be added/removed, one condition is that there is no
IO (for example, check nr_pending).
3. Do the actual operations to add/remove a rdev, one procedure is
set/clear a pointer to rdev.
4. Check if there is still no IO on this rdev, if not, revert the
change.
IO path uses rcu_read_lock/unlock() to access rdev.
- rcu is used wrongly;
- There are lots of places involved that old rdev can be read, however,
many places doesn't handle old value correctly;
- Between step 3 and 4, if new io is dispatched, NULL will be read for
the rdev, and data will be lost if step 4 failed.
The new synchronization is similar to blk_mq_freeze_queue(). To add or
remove disk:
1. Suspend the array, that is, stop new IO from being dispatched
and wait for inflight IO to finish.
2. Add or remove rdevs to array;
3. Resume the array;
IO path doesn't need to change for now, and all rcu implementation can
be removed.
Then main work is divided into 3 steps:
First, first make sure new apis to suspend the array is general:
- make sure suspend array will wait for io to be done(Done by [1]);
- make sure suspend array can be called for all personalities(Done by [2]);
- make sure suspend array can be called at any time(Done by [3]);
- make sure suspend array doesn't rely on 'reconfig_mutex'(PATCH 3-5);
Second replace old apis with new apis(PATCH 6-16). Specifically, the
synchronization is changed from:
lock reconfig_mutex
suspend array
make changes
resume array
unlock reconfig_mutex
to:
suspend array
lock reconfig_mutex
make changes
unlock reconfig_mutex
resume array
Finally, for the remain path that involved reconfiguration, suspend the
array first(PATCH 11,12, [4] and PATCH 17):
Preparatory work:
[1] https://lore.kernel.org/all/
20230621165110.
1498313-1-yukuai1@huaweicloud.com/
[2] https://lore.kernel.org/all/
20230628012931.88911-2-yukuai1@huaweicloud.com/
[3] https://lore.kernel.org/all/
20230825030956.
1527023-1-yukuai1@huaweicloud.com/
[4] https://lore.kernel.org/all/
20230825031622.
1530464-1-yukuai1@huaweicloud.com/
* md-suspend-rewrite:
md: rename __mddev_suspend/resume() back to mddev_suspend/resume()
md: remove old apis to suspend the array
md: suspend array in md_start_sync() if array need reconfiguration
md/raid5: replace suspend with quiesce() callback
md/md-linear: cleanup linear_add()
md: cleanup mddev_create/destroy_serial_pool()
md: use new apis to suspend array before mddev_create/destroy_serial_pool
md: use new apis to suspend array for ioctls involed array reconfiguration
md: use new apis to suspend array for adding/removing rdev from state_store()
md: use new apis to suspend array for sysfs apis
md/raid5: use new apis to suspend array
md/raid5-cache: use new apis to suspend array
md/md-bitmap: use new apis to suspend array for location_store()
md/dm-raid: use new apis to suspend array
md: add new helpers to suspend/resume and lock/unlock array
md: add new helpers to suspend/resume array
md: replace is_md_suspended() with 'mddev->suspended' in md_check_recovery()
md/raid5-cache: use READ_ONCE/WRITE_ONCE for 'conf->log'
md: use READ_ONCE/WRITE_ONCE for 'suspend_lo' and 'suspend_hi'