]> git.ipfire.org Git - thirdparty/mdadm.git/blob - external-reshape-design.txt
Spare migration tests
[thirdparty/mdadm.git] / external-reshape-design.txt
1 External Reshape
2
3 1 Problem statement
4
5 External (third-party metadata) reshape differs from native-metadata
6 reshape in three key ways:
7
8 1.1 Format specific constraints
9
10 In the native case reshape is limited by what is implemented in the
11 generic reshape routine (Grow_reshape()) and what is supported by the
12 kernel. There are exceptional cases where Grow_reshape() may block
13 operations when it knows that the kernel implementation is broken, but
14 otherwise the kernel is relied upon to be the final arbiter of what
15 reshape operations are supported.
16
17 In the external case the kernel, and the generic checks in
18 Grow_reshape(), become the super-set of what reshapes are possible. The
19 metadata format may not support, or have yet to implement a given
20 reshape type. The implication for Grow_reshape() is that it must query
21 the metadata handler and effect changes in the metadata before the new
22 geometry is posted to the kernel. The ->reshape_super method allows
23 Grow_reshape() to validate the requested operation and post the metadata
24 update.
25
26 1.2 Scope of reshape
27
28 Native metadata reshape is always performed at the array scope (no
29 metadata relationship with sibling arrays on the same disks). External
30 reshape, depending on the format, may not allow the number of member
31 disks to be changed in a subarray unless the change is simultaneously
32 applied to all subarrays in the container. For example the imsm format
33 requires all member disks to be a member of all subarrays, so a 4-disk
34 raid5 in a container that also houses a 4-disk raid10 array could not be
35 reshaped to 5 disks as the imsm format does not support a 5-disk raid10
36 representation. This requires the ->reshape_super method to check the
37 contents of the array and ask the user to run the reshape at container
38 scope (if both subarrays are agreeable to the change), or report an
39 error in the case where one subarray cannot support the change.
40
41 1.3 Monitoring / checkpointing
42
43 Reshape, unlike rebuild/resync, requires strict checkpointing to survive
44 interrupted reshape operations. For example when expanding a raid5
45 array the first few stripes of the array will be overwritten in a
46 destructive manner. When restarting the reshape process we need to know
47 the exact location of the last successfully written stripe, and we need
48 to restore the data in any partially overwritten stripe. Native
49 metadata stores this backup data in the unused portion of spares that
50 are being promoted to array members, or in an external backup file
51 (located on a non-involved block device).
52
53 The kernel is in charge of recording checkpoints of reshape progress,
54 but mdadm is delegated the task of managing the backup space which
55 involves:
56 1/ Identifying what data will be overwritten in the next unit of reshape
57 operation
58 2/ Suspending access to that region so that a snapshot of the data can
59 be transferred to the backup space.
60 3/ Allowing the kernel to reshape the saved region and setting the
61 boundary for the next backup.
62
63 In the external reshape case we want to preserve this mdadm
64 'reshape-manager' arrangement, but have a third actor, mdmon, to
65 consider. It is tempting to give the role of managing reshape to mdmon,
66 but that is counter to its role as a monitor, and conflicts with the
67 existing capabilities and role of mdadm to manage the progress of
68 reshape. For clarity the external reshape implementation maintains the
69 role of mdmon as a (mostly) passive recorder of raid events, and mdadm
70 treats it as it would the kernel in the native reshape case (modulo
71 needing to send explicit metadata update messages and checking that
72 mdmon took the expected action).
73
74 External reshape can use the generic md backup file as a fallback, but in the
75 optimal/firmware-compatible case the reshape-manager will use the metadata
76 specific areas for managing reshape. The implementation also needs to spawn a
77 reshape-manager per subarray when the reshape is being carried out at the
78 container level. For these two reasons the ->manage_reshape() method is
79 introduced. This method in addition to base tasks mentioned above:
80 1/ Spawns a manager per-subarray, when necessary
81 2/ Uses either generic routines in Grow.c for md-style backup file
82 support, or uses the metadata-format specific location for storing
83 recovery data.
84 This aims to avoid a "midlayer mistake"[1] and lets the metadata handler
85 optionally take advantage of generic infrastructure in Grow.c
86
87 2 Details for specific reshape requests
88
89 There are quite a few moving pieces spread out across md, mdadm, and mdmon for
90 the support of external reshape, and there are several different types of
91 reshape that need to be comprehended by the implementation. A rundown of
92 these details follows.
93
94 2.0 General provisions:
95
96 Obtain an exclusive open on the container to make sure we are not
97 running concurrently with a Create() event.
98
99 2.1 Freezing sync_action
100
101 2.2 Reshape size
102
103 1/ mdadm::Grow_reshape(): checks if mdmon is running and optionally
104 initializes st->update_tail
105 2/ mdadm::Grow_reshape() calls ->reshape_super() to check that the size change
106 is allowed (being performed at subarray scope / enough room) prepares a
107 metadata update
108 3/ mdadm::Grow_reshape(): flushes the metadata update (via
109 flush_metadata_update(), or ->sync_metadata())
110 4/ mdadm::Grow_reshape(): post the new size to the kernel
111
112
113 2.3 Reshape level (simple-takeover)
114
115 "simple-takeover" implies the level change can be satisfied without touching
116 sync_action
117
118 1/ mdadm::Grow_reshape(): checks if mdmon is running and optionally
119 initializes st->update_tail
120 2/ mdadm::Grow_reshape() calls ->reshape_super() to check that the level change
121 is allowed (being performed at subarray scope) prepares a
122 metadata update
123 2a/ raid10 --> raid0: degrade all mirror legs prior to calling
124 ->reshape_super
125 3/ mdadm::Grow_reshape(): flushes the metadata update (via
126 flush_metadata_update(), or ->sync_metadata())
127 4/ mdadm::Grow_reshape(): post the new level to the kernel
128
129 2.4 Reshape chunk, layout
130
131 2.5 Reshape raid disks (grow)
132
133 1/ mdadm::Grow_reshape(): unconditionally initializes st->update_tail
134 because only redundant raid levels can modify the number of raid disks
135 2/ mdadm::Grow_reshape(): calls ->reshape_super() to check that the level
136 change is allowed (being performed at proper scope / permissible
137 geometry / proper spares available in the container) prepares a metadata
138 update.
139 3/ mdadm::Grow_reshape(): Converts each subarray in the container to the
140 raid level that can perform the reshape and starts mdmon.
141 4/ mdadm::Grow_reshape(): Pushes the update to mdmon...
142 4a/ mdmon::process_update(): marks the array as reshaping
143 4b/ mdmon::manage_member(): adds the spares (without assigning a slot)
144 5/ mdadm::Grow_reshape(): Notes that mdmon has assigned spares and invokes
145 ->manage_reshape()
146 5/ mdadm::<format>->manage_reshape(): (for each subarray) sets sync_max to
147 zero, starts the reshape, and pings mdmon
148 5a/ mdmon::read_and_act(): notices that reshape has started and notifies
149 the metadata handler to record the slots chosen by the kernel
150 6/ mdadm::<format>->manage_reshape(): saves data that will be overwritten by
151 the kernel to either the backup file or the metadata specific location,
152 advances sync_max, waits for reshape, ping mdmon, repeat.
153 6a/ mdmon::read_and_act(): records checkpoints
154 7/ mdadm::<format>->manage_reshape(): Once reshape completes changes the raid
155 level back to the nominal raid level (if necessary)
156
157 FIXME: native metadata does not have the capability to record the original
158 raid level in reshape-restart case because the kernel always records current
159 raid level to the metadata, whereas external metadata can masquerade at an
160 alternate level based on the reshape state.
161
162 2.6 Reshape raid disks (shrink)
163
164 3 TODO
165
166 ...
167
168 [1]: Linux kernel design patterns - part 3, Neil Brown http://lwn.net/Articles/336262/