[thirdparty/mdadm.git] / external-reshape-design.txt

External Reshape

1 Problem statement

External (third-party metadata) reshape differs from native-metadata
reshape in three key ways:

1.1 Format specific constraints

In the native case reshape is limited by what is implemented in the
generic reshape routine (Grow_reshape()) and what is supported by the
kernel.  There are exceptional cases where Grow_reshape() may block
operations when it knows that the kernel implementation is broken, but
otherwise the kernel is relied upon to be the final arbiter of what
reshape operations are supported.

In the external case the kernel, and the generic checks in
Grow_reshape(), become the super-set of what reshapes are possible.  The
metadata format may not support, or have yet to implement a given
reshape type.  The implication for Grow_reshape() is that it must query
the metadata handler and effect changes in the metadata before the new
geometry is posted to the kernel.  The ->reshape_super method allows
Grow_reshape() to validate the requested operation and post the metadata
update.

1.2 Scope of reshape

Native metadata reshape is always performed at the array scope (no
metadata relationship with sibling arrays on the same disks).  External
reshape, depending on the format, may not allow the number of member
disks to be changed in a subarray unless the change is simultaneously
applied to all subarrays in the container.  For example the imsm format
requires all member disks to be a member of all subarrays, so a 4-disk
raid5 in a container that also houses a 4-disk raid10 array could not be
reshaped to 5 disks as the imsm format does not support a 5-disk raid10
representation.  This requires the ->reshape_super method to check the
contents of the array and ask the user to run the reshape at container
scope (if all subarrays are agreeable to the change), or report an
error in the case where one subarray cannot support the change.

1.3 Monitoring / checkpointing

Reshape, unlike rebuild/resync, requires strict checkpointing to survive
interrupted reshape operations.  For example when expanding a raid5
array the first few stripes of the array will be overwritten in a
destructive manner.  When restarting the reshape process we need to know
the exact location of the last successfully written stripe, and we need
to restore the data in any partially overwritten stripe.  Native
metadata stores this backup data in the unused portion of spares that
are being promoted to array members, or in an external backup file
(located on a non-involved block device).

The kernel is in charge of recording checkpoints of reshape progress,
but mdadm is delegated the task of managing the backup space which
involves:
1/ Identifying what data will be overwritten in the next unit of reshape
   operation
2/ Suspending access to that region so that a snapshot of the data can
   be transferred to the backup space.
3/ Allowing the kernel to reshape the saved region and setting the
   boundary for the next backup.

In the external reshape case we want to preserve this mdadm
'reshape-manager' arrangement, but have a third actor, mdmon, to
consider.  It is tempting to give the role of managing reshape to mdmon,
but that is counter to its role as a monitor, and conflicts with the
existing capabilities and role of mdadm to manage the progress of
reshape.  For clarity the external reshape implementation maintains the
role of mdmon as a (mostly) passive recorder of raid events, and mdadm
treats it as it would the kernel in the native reshape case (modulo
needing to send explicit metadata update messages and checking that
mdmon took the expected action).

External reshape can use the generic md backup file as a fallback, but in the
optimal/firmware-compatible case the reshape-manager will use the metadata
specific areas for managing reshape.  The implementation also needs to spawn a
reshape-manager per subarray when the reshape is being carried out at the
container level.  For these two reasons the ->manage_reshape() method is
introduced.  This method in addition to base tasks mentioned above:
1/ Processed each subarray one at a time in series - where appropriate.
2/ Uses either generic routines in Grow.c for md-style backup file
   support, or uses the metadata-format specific location for storing
   recovery data.
This aims to avoid a "midlayer mistake"[1] and lets the metadata handler
optionally take advantage of generic infrastructure in Grow.c

2 Details for specific reshape requests

There are quite a few moving pieces spread out across md, mdadm, and mdmon for
the support of external reshape, and there are several different types of
reshape that need to be comprehended by the implementation.  A rundown of
these details follows.

2.0 General provisions:

Obtain an exclusive open on the container to make sure we are not
running concurrently with a Create() event.

2.1 Freezing sync_action

   Before making any attempt at a reshape we 'freeze' every array in
   the container to ensure no spare assignment or recovery happens.
   This involves writing 'frozen' to sync_action and changing the '/'
   after 'external:' in metadata_version to a '-'. mdmon knows that
   this means not to perform any management.

   Before doing this we check that all sync_actions are 'idle', which
   is racy but still useful.
   Afterwards we check that all member arrays have no spares
   or partial spares (recovery_start != 'none') which would indicate a
   race.  If they do, we unfreeze again.

   Once this completes we know all the arrays are stable.  They may
   still have failed devices as devices can fail at any time.  However
   we treat those like failures that happen during the reshape.

2.2 Reshape size

   1/ mdadm::Grow_reshape(): checks if mdmon is running and optionally
      initializes st->update_tail
   2/ mdadm::Grow_reshape() calls ->reshape_super() to check that the size change
      is allowed (being performed at subarray scope / enough room) prepares a
      metadata update
   3/ mdadm::Grow_reshape(): flushes the metadata update (via
      flush_metadata_update(), or ->sync_metadata())
   4/ mdadm::Grow_reshape(): post the new size to the kernel


2.3 Reshape level (simple-takeover)

"simple-takeover" implies the level change can be satisfied without touching
sync_action

    1/ mdadm::Grow_reshape(): checks if mdmon is running and optionally
       initializes st->update_tail
    2/ mdadm::Grow_reshape() calls ->reshape_super() to check that the level change
       is allowed (being performed at subarray scope) prepares a
       metadata update
       2a/ raid10 --> raid0: degrade all mirror legs prior to calling
           ->reshape_super
    3/ mdadm::Grow_reshape(): flushes the metadata update (via
       flush_metadata_update(), or ->sync_metadata())
    4/ mdadm::Grow_reshape(): post the new level to the kernel

2.4 Reshape chunk, layout

2.5 Reshape raid disks (grow)

    1/ mdadm::Grow_reshape(): unconditionally initializes st->update_tail
       because only redundant raid levels can modify the number of raid disks
    2/ mdadm::Grow_reshape(): calls ->reshape_super() to check that the level
       change is allowed (being performed at proper scope / permissible
       geometry / proper spares available in the container), chooses
       the spares to use, and prepares a metadata update.
    3/ mdadm::Grow_reshape(): Converts each subarray in the container to the
       raid level that can perform the reshape and starts mdmon.
    4/ mdadm::Grow_reshape(): Pushes the update to mdmon.
    5/ mdadm::Grow_reshape(): uses container_content to find details of
       the spares and passes them to the kernel.
    6/ mdadm::Grow_reshape(): gives raid_disks update to the kernel,
       sets sync_max, sync_min, suspend_lo, suspend_hi all to zero,
       and starts the reshape by writing 'reshape' to sync_action.
    7/ mdmon::monitor notices the sync_action change and tells
       managemon to check for new devices.  managemon notices the new
       devices, opens relevant sysfs file, and passes them all to
       monitor.
    8/ mdadm::Grow_reshape() calls ->manage_reshape to oversee the
       rest of the reshape.
       
    9/ mdadm::<format>->manage_reshape(): saves data that will be overwritten by
       the kernel to either the backup file or the metadata specific location,
       advances sync_max, waits for reshape, ping mdmon, repeat.
       Meanwhile mdmon::read_and_act(): records checkpoints.
       Specifically.

       9a/ if the 'next' stripe to be reshaped will over-write
           itself during reshape then:
	9a.1/ increase suspend_hi to cover a suitable number of
           stripes.
	9a.2/ backup those stripes safely.
	9a.3/ advance sync_max to allow those stripes to be backed up
	9a.4/ when sync_completed indicates that those stripes have
           been reshaped, manage_reshape must ping_manager
	9a.5/ when mdmon notices that sync_completed has been updated,
           it records the new checkpoint in the metadata
	9a.6/ after the ping_manager, manage_reshape will increase
           suspend_lo to allow access to those stripes again

       9b/ if the 'next' stripe to be reshaped will over-write unused
           space during reshape then we apply same process as above,
	   except that there is no need to back anything up.
	   Note that we *do* need to keep suspend_hi progressing as
	   it is not safe to write to the area-under-reshape.  For
	   kernel-managed-metadata this protection is provided by
	   ->reshape_safe, but that does not protect us in the case
	   of user-space-managed-metadata.
	   
   10/ mdadm::<format>->manage_reshape(): Once reshape completes changes the raid
       level back to the nominal raid level (if necessary)

       FIXME: native metadata does not have the capability to record the original
       raid level in reshape-restart case because the kernel always records current
       raid level to the metadata, whereas external metadata can masquerade at an
       alternate level based on the reshape state.

2.6 Reshape raid disks (shrink)

3 TODO

...

[1]: Linux kernel design patterns - part 3, Neil Brown http://lwn.net/Articles/336262/
Commit	Line	Data
d54d79bd DW	1	External Reshape
	2
	3	1 Problem statement
	4
	5	External (third-party metadata) reshape differs from native-metadata
	6	reshape in three key ways:
	7
	8	1.1 Format specific constraints
	9
	10	In the native case reshape is limited by what is implemented in the
	11	generic reshape routine (Grow_reshape()) and what is supported by the
	12	kernel. There are exceptional cases where Grow_reshape() may block
	13	operations when it knows that the kernel implementation is broken, but
	14	otherwise the kernel is relied upon to be the final arbiter of what
	15	reshape operations are supported.
	16
	17	In the external case the kernel, and the generic checks in
	18	Grow_reshape(), become the super-set of what reshapes are possible. The
	19	metadata format may not support, or have yet to implement a given
	20	reshape type. The implication for Grow_reshape() is that it must query
	21	the metadata handler and effect changes in the metadata before the new
	22	geometry is posted to the kernel. The ->reshape_super method allows
	23	Grow_reshape() to validate the requested operation and post the metadata
	24	update.
	25
	26	1.2 Scope of reshape
	27
	28	Native metadata reshape is always performed at the array scope (no
	29	metadata relationship with sibling arrays on the same disks). External
	30	reshape, depending on the format, may not allow the number of member
	31	disks to be changed in a subarray unless the change is simultaneously
	32	applied to all subarrays in the container. For example the imsm format
	33	requires all member disks to be a member of all subarrays, so a 4-disk
	34	raid5 in a container that also houses a 4-disk raid10 array could not be
	35	reshaped to 5 disks as the imsm format does not support a 5-disk raid10
	36	representation. This requires the ->reshape_super method to check the
	37	contents of the array and ask the user to run the reshape at container
8bd67e34	38	scope (if all subarrays are agreeable to the change), or report an
d54d79bd DW	39	error in the case where one subarray cannot support the change.
	40
	41	1.3 Monitoring / checkpointing
	42
	43	Reshape, unlike rebuild/resync, requires strict checkpointing to survive
	44	interrupted reshape operations. For example when expanding a raid5
	45	array the first few stripes of the array will be overwritten in a
	46	destructive manner. When restarting the reshape process we need to know
	47	the exact location of the last successfully written stripe, and we need
	48	to restore the data in any partially overwritten stripe. Native
	49	metadata stores this backup data in the unused portion of spares that
	50	are being promoted to array members, or in an external backup file
	51	(located on a non-involved block device).
	52
	53	The kernel is in charge of recording checkpoints of reshape progress,
	54	but mdadm is delegated the task of managing the backup space which
	55	involves:
	56	1/ Identifying what data will be overwritten in the next unit of reshape
	57	operation
	58	2/ Suspending access to that region so that a snapshot of the data can
	59	be transferred to the backup space.
	60	3/ Allowing the kernel to reshape the saved region and setting the
	61	boundary for the next backup.
	62
	63	In the external reshape case we want to preserve this mdadm
	64	'reshape-manager' arrangement, but have a third actor, mdmon, to
	65	consider. It is tempting to give the role of managing reshape to mdmon,
	66	but that is counter to its role as a monitor, and conflicts with the
	67	existing capabilities and role of mdadm to manage the progress of
	68	reshape. For clarity the external reshape implementation maintains the
	69	role of mdmon as a (mostly) passive recorder of raid events, and mdadm
	70	treats it as it would the kernel in the native reshape case (modulo
	71	needing to send explicit metadata update messages and checking that
	72	mdmon took the expected action).
	73
	74	External reshape can use the generic md backup file as a fallback, but in the
	75	optimal/firmware-compatible case the reshape-manager will use the metadata
	76	specific areas for managing reshape. The implementation also needs to spawn a
	77	reshape-manager per subarray when the reshape is being carried out at the
	78	container level. For these two reasons the ->manage_reshape() method is
	79	introduced. This method in addition to base tasks mentioned above:
8bd67e34	80	1/ Processed each subarray one at a time in series - where appropriate.
d54d79bd DW	81	2/ Uses either generic routines in Grow.c for md-style backup file
	82	support, or uses the metadata-format specific location for storing
	83	recovery data.
	84	This aims to avoid a "midlayer mistake"[1] and lets the metadata handler
	85	optionally take advantage of generic infrastructure in Grow.c
	86
	87	2 Details for specific reshape requests
	88
	89	There are quite a few moving pieces spread out across md, mdadm, and mdmon for
	90	the support of external reshape, and there are several different types of
	91	reshape that need to be comprehended by the implementation. A rundown of
	92	these details follows.
	93
	94	2.0 General provisions:
	95
	96	Obtain an exclusive open on the container to make sure we are not
	97	running concurrently with a Create() event.
	98
	99	2.1 Freezing sync_action
	100
8bd67e34 N	101	Before making any attempt at a reshape we 'freeze' every array in
	102	the container to ensure no spare assignment or recovery happens.
	103	This involves writing 'frozen' to sync_action and changing the '/'
	104	after 'external:' in metadata_version to a '-'. mdmon knows that
	105	this means not to perform any management.
	106
	107	Before doing this we check that all sync_actions are 'idle', which
	108	is racy but still useful.
	109	Afterwards we check that all member arrays have no spares
	110	or partial spares (recovery_start != 'none') which would indicate a
	111	race. If they do, we unfreeze again.
	112
	113	Once this completes we know all the arrays are stable. They may
	114	still have failed devices as devices can fail at any time. However
	115	we treat those like failures that happen during the reshape.
	116
d54d79bd DW	117	2.2 Reshape size
	118
	119	1/ mdadm::Grow_reshape(): checks if mdmon is running and optionally
	120	initializes st->update_tail
	121	2/ mdadm::Grow_reshape() calls ->reshape_super() to check that the size change
	122	is allowed (being performed at subarray scope / enough room) prepares a
	123	metadata update
	124	3/ mdadm::Grow_reshape(): flushes the metadata update (via
	125	flush_metadata_update(), or ->sync_metadata())
	126	4/ mdadm::Grow_reshape(): post the new size to the kernel
	127
	128
	129	2.3 Reshape level (simple-takeover)
	130
	131	"simple-takeover" implies the level change can be satisfied without touching
	132	sync_action
	133
	134	1/ mdadm::Grow_reshape(): checks if mdmon is running and optionally
	135	initializes st->update_tail
	136	2/ mdadm::Grow_reshape() calls ->reshape_super() to check that the level change
	137	is allowed (being performed at subarray scope) prepares a
	138	metadata update
	139	2a/ raid10 --> raid0: degrade all mirror legs prior to calling
	140	->reshape_super
	141	3/ mdadm::Grow_reshape(): flushes the metadata update (via
	142	flush_metadata_update(), or ->sync_metadata())
	143	4/ mdadm::Grow_reshape(): post the new level to the kernel
	144
	145	2.4 Reshape chunk, layout
	146
	147	2.5 Reshape raid disks (grow)
	148
	149	1/ mdadm::Grow_reshape(): unconditionally initializes st->update_tail
	150	because only redundant raid levels can modify the number of raid disks
	151	2/ mdadm::Grow_reshape(): calls ->reshape_super() to check that the level
	152	change is allowed (being performed at proper scope / permissible
8bd67e34 N	153	geometry / proper spares available in the container), chooses
8bd67e34 N	154	the spares to use, and prepares a metadata update.
d54d79bd DW	155	3/ mdadm::Grow_reshape(): Converts each subarray in the container to the
d54d79bd DW	156	raid level that can perform the reshape and starts mdmon.
8bd67e34 N	157	4/ mdadm::Grow_reshape(): Pushes the update to mdmon.
	158	5/ mdadm::Grow_reshape(): uses container_content to find details of
	159	the spares and passes them to the kernel.
	160	6/ mdadm::Grow_reshape(): gives raid_disks update to the kernel,
	161	sets sync_max, sync_min, suspend_lo, suspend_hi all to zero,
	162	and starts the reshape by writing 'reshape' to sync_action.
	163	7/ mdmon::monitor notices the sync_action change and tells
	164	managemon to check for new devices. managemon notices the new
	165	devices, opens relevant sysfs file, and passes them all to
	166	monitor.
	167	8/ mdadm::Grow_reshape() calls ->manage_reshape to oversee the
	168	rest of the reshape.
	169
	170	9/ mdadm::<format>->manage_reshape(): saves data that will be overwritten by
d54d79bd DW	171	the kernel to either the backup file or the metadata specific location,
d54d79bd DW	172	advances sync_max, waits for reshape, ping mdmon, repeat.
8bd67e34 N	173	Meanwhile mdmon::read_and_act(): records checkpoints.
	174	Specifically.
	175
	176	9a/ if the 'next' stripe to be reshaped will over-write
	177	itself during reshape then:
	178	9a.1/ increase suspend_hi to cover a suitable number of
	179	stripes.
	180	9a.2/ backup those stripes safely.
	181	9a.3/ advance sync_max to allow those stripes to be backed up
	182	9a.4/ when sync_completed indicates that those stripes have
	183	been reshaped, manage_reshape must ping_manager
	184	9a.5/ when mdmon notices that sync_completed has been updated,
	185	it records the new checkpoint in the metadata
	186	9a.6/ after the ping_manager, manage_reshape will increase
	187	suspend_lo to allow access to those stripes again
	188
	189	9b/ if the 'next' stripe to be reshaped will over-write unused
	190	space during reshape then we apply same process as above,
	191	except that there is no need to back anything up.
	192	Note that we do need to keep suspend_hi progressing as
	193	it is not safe to write to the area-under-reshape. For
	194	kernel-managed-metadata this protection is provided by
	195	->reshape_safe, but that does not protect us in the case
	196	of user-space-managed-metadata.
	197
	198	10/ mdadm::<format>->manage_reshape(): Once reshape completes changes the raid
d54d79bd DW	199	level back to the nominal raid level (if necessary)
	200
	201	FIXME: native metadata does not have the capability to record the original
	202	raid level in reshape-restart case because the kernel always records current
	203	raid level to the metadata, whereas external metadata can masquerade at an
	204	alternate level based on the reshape state.
	205
	206	2.6 Reshape raid disks (shrink)
	207
	208	3 TODO
	209
	210	...
	211
	212	[1]: Linux kernel design patterns - part 3, Neil Brown http://lwn.net/Articles/336262/