]> git.ipfire.org Git - thirdparty/mdadm.git/blame - external-reshape-design.txt
imsm: remove redundant characters from some error messages
[thirdparty/mdadm.git] / external-reshape-design.txt
CommitLineData
d54d79bd
DW
1External Reshape
2
31 Problem statement
4
5External (third-party metadata) reshape differs from native-metadata
6reshape in three key ways:
7
81.1 Format specific constraints
9
10In the native case reshape is limited by what is implemented in the
11generic reshape routine (Grow_reshape()) and what is supported by the
12kernel. There are exceptional cases where Grow_reshape() may block
13operations when it knows that the kernel implementation is broken, but
14otherwise the kernel is relied upon to be the final arbiter of what
15reshape operations are supported.
16
17In the external case the kernel, and the generic checks in
18Grow_reshape(), become the super-set of what reshapes are possible. The
19metadata format may not support, or have yet to implement a given
20reshape type. The implication for Grow_reshape() is that it must query
21the metadata handler and effect changes in the metadata before the new
22geometry is posted to the kernel. The ->reshape_super method allows
23Grow_reshape() to validate the requested operation and post the metadata
24update.
25
261.2 Scope of reshape
27
28Native metadata reshape is always performed at the array scope (no
29metadata relationship with sibling arrays on the same disks). External
30reshape, depending on the format, may not allow the number of member
31disks to be changed in a subarray unless the change is simultaneously
32applied to all subarrays in the container. For example the imsm format
33requires all member disks to be a member of all subarrays, so a 4-disk
34raid5 in a container that also houses a 4-disk raid10 array could not be
35reshaped to 5 disks as the imsm format does not support a 5-disk raid10
36representation. This requires the ->reshape_super method to check the
37contents of the array and ask the user to run the reshape at container
8bd67e34 38scope (if all subarrays are agreeable to the change), or report an
d54d79bd
DW
39error in the case where one subarray cannot support the change.
40
411.3 Monitoring / checkpointing
42
43Reshape, unlike rebuild/resync, requires strict checkpointing to survive
44interrupted reshape operations. For example when expanding a raid5
45array the first few stripes of the array will be overwritten in a
46destructive manner. When restarting the reshape process we need to know
47the exact location of the last successfully written stripe, and we need
48to restore the data in any partially overwritten stripe. Native
49metadata stores this backup data in the unused portion of spares that
50are being promoted to array members, or in an external backup file
51(located on a non-involved block device).
52
53The kernel is in charge of recording checkpoints of reshape progress,
54but mdadm is delegated the task of managing the backup space which
55involves:
561/ Identifying what data will be overwritten in the next unit of reshape
57 operation
582/ Suspending access to that region so that a snapshot of the data can
59 be transferred to the backup space.
603/ Allowing the kernel to reshape the saved region and setting the
61 boundary for the next backup.
62
63In the external reshape case we want to preserve this mdadm
64'reshape-manager' arrangement, but have a third actor, mdmon, to
65consider. It is tempting to give the role of managing reshape to mdmon,
66but that is counter to its role as a monitor, and conflicts with the
67existing capabilities and role of mdadm to manage the progress of
68reshape. For clarity the external reshape implementation maintains the
69role of mdmon as a (mostly) passive recorder of raid events, and mdadm
70treats it as it would the kernel in the native reshape case (modulo
71needing to send explicit metadata update messages and checking that
72mdmon took the expected action).
73
74External reshape can use the generic md backup file as a fallback, but in the
75optimal/firmware-compatible case the reshape-manager will use the metadata
76specific areas for managing reshape. The implementation also needs to spawn a
77reshape-manager per subarray when the reshape is being carried out at the
78container level. For these two reasons the ->manage_reshape() method is
79introduced. This method in addition to base tasks mentioned above:
8bd67e34 801/ Processed each subarray one at a time in series - where appropriate.
d54d79bd
DW
812/ Uses either generic routines in Grow.c for md-style backup file
82 support, or uses the metadata-format specific location for storing
83 recovery data.
84This aims to avoid a "midlayer mistake"[1] and lets the metadata handler
85optionally take advantage of generic infrastructure in Grow.c
86
872 Details for specific reshape requests
88
89There are quite a few moving pieces spread out across md, mdadm, and mdmon for
90the support of external reshape, and there are several different types of
91reshape that need to be comprehended by the implementation. A rundown of
92these details follows.
93
942.0 General provisions:
95
96Obtain an exclusive open on the container to make sure we are not
97running concurrently with a Create() event.
98
992.1 Freezing sync_action
100
8bd67e34
N
101 Before making any attempt at a reshape we 'freeze' every array in
102 the container to ensure no spare assignment or recovery happens.
103 This involves writing 'frozen' to sync_action and changing the '/'
104 after 'external:' in metadata_version to a '-'. mdmon knows that
105 this means not to perform any management.
106
107 Before doing this we check that all sync_actions are 'idle', which
108 is racy but still useful.
109 Afterwards we check that all member arrays have no spares
110 or partial spares (recovery_start != 'none') which would indicate a
111 race. If they do, we unfreeze again.
112
113 Once this completes we know all the arrays are stable. They may
114 still have failed devices as devices can fail at any time. However
115 we treat those like failures that happen during the reshape.
116
d54d79bd
DW
1172.2 Reshape size
118
119 1/ mdadm::Grow_reshape(): checks if mdmon is running and optionally
120 initializes st->update_tail
121 2/ mdadm::Grow_reshape() calls ->reshape_super() to check that the size change
122 is allowed (being performed at subarray scope / enough room) prepares a
123 metadata update
124 3/ mdadm::Grow_reshape(): flushes the metadata update (via
125 flush_metadata_update(), or ->sync_metadata())
126 4/ mdadm::Grow_reshape(): post the new size to the kernel
127
128
1292.3 Reshape level (simple-takeover)
130
131"simple-takeover" implies the level change can be satisfied without touching
132sync_action
133
134 1/ mdadm::Grow_reshape(): checks if mdmon is running and optionally
135 initializes st->update_tail
136 2/ mdadm::Grow_reshape() calls ->reshape_super() to check that the level change
137 is allowed (being performed at subarray scope) prepares a
138 metadata update
139 2a/ raid10 --> raid0: degrade all mirror legs prior to calling
140 ->reshape_super
141 3/ mdadm::Grow_reshape(): flushes the metadata update (via
142 flush_metadata_update(), or ->sync_metadata())
143 4/ mdadm::Grow_reshape(): post the new level to the kernel
144
1452.4 Reshape chunk, layout
146
1472.5 Reshape raid disks (grow)
148
149 1/ mdadm::Grow_reshape(): unconditionally initializes st->update_tail
150 because only redundant raid levels can modify the number of raid disks
151 2/ mdadm::Grow_reshape(): calls ->reshape_super() to check that the level
152 change is allowed (being performed at proper scope / permissible
8bd67e34
N
153 geometry / proper spares available in the container), chooses
154 the spares to use, and prepares a metadata update.
d54d79bd
DW
155 3/ mdadm::Grow_reshape(): Converts each subarray in the container to the
156 raid level that can perform the reshape and starts mdmon.
8bd67e34
N
157 4/ mdadm::Grow_reshape(): Pushes the update to mdmon.
158 5/ mdadm::Grow_reshape(): uses container_content to find details of
159 the spares and passes them to the kernel.
160 6/ mdadm::Grow_reshape(): gives raid_disks update to the kernel,
161 sets sync_max, sync_min, suspend_lo, suspend_hi all to zero,
162 and starts the reshape by writing 'reshape' to sync_action.
163 7/ mdmon::monitor notices the sync_action change and tells
164 managemon to check for new devices. managemon notices the new
165 devices, opens relevant sysfs file, and passes them all to
166 monitor.
167 8/ mdadm::Grow_reshape() calls ->manage_reshape to oversee the
168 rest of the reshape.
bcbb92d4 169
8bd67e34 170 9/ mdadm::<format>->manage_reshape(): saves data that will be overwritten by
d54d79bd
DW
171 the kernel to either the backup file or the metadata specific location,
172 advances sync_max, waits for reshape, ping mdmon, repeat.
8bd67e34
N
173 Meanwhile mdmon::read_and_act(): records checkpoints.
174 Specifically.
175
176 9a/ if the 'next' stripe to be reshaped will over-write
177 itself during reshape then:
178 9a.1/ increase suspend_hi to cover a suitable number of
179 stripes.
180 9a.2/ backup those stripes safely.
181 9a.3/ advance sync_max to allow those stripes to be backed up
182 9a.4/ when sync_completed indicates that those stripes have
183 been reshaped, manage_reshape must ping_manager
184 9a.5/ when mdmon notices that sync_completed has been updated,
185 it records the new checkpoint in the metadata
186 9a.6/ after the ping_manager, manage_reshape will increase
187 suspend_lo to allow access to those stripes again
188
189 9b/ if the 'next' stripe to be reshaped will over-write unused
190 space during reshape then we apply same process as above,
191 except that there is no need to back anything up.
192 Note that we *do* need to keep suspend_hi progressing as
193 it is not safe to write to the area-under-reshape. For
194 kernel-managed-metadata this protection is provided by
195 ->reshape_safe, but that does not protect us in the case
196 of user-space-managed-metadata.
bcbb92d4 197
8bd67e34 198 10/ mdadm::<format>->manage_reshape(): Once reshape completes changes the raid
d54d79bd
DW
199 level back to the nominal raid level (if necessary)
200
201 FIXME: native metadata does not have the capability to record the original
202 raid level in reshape-restart case because the kernel always records current
203 raid level to the metadata, whereas external metadata can masquerade at an
204 alternate level based on the reshape state.
205
2062.6 Reshape raid disks (shrink)
207
7443ee81
N
2083 Interaction with metadata handle.
209
210 The following calls are made into the metadata handler to assist
211 with initiating and monitoring a 'reshape'.
212
213 1/ ->reshape_super is called quite early (after only minimial
214 checks) to make sure that the metadata can record the new shape
215 and any necessary transitions. It may be passed a 'container'
216 or an individual array within a container, and it should notice
217 the difference and act accordingly.
218 When a reshape is requested against a container it is expected
219 that it should be applied to every array in the container,
220 however it is up to the metadata handler to determine final
221 policy.
222
223 If the reshape is supportable, the internal copy of the metadata
224 should be updated, and a metadata update suitable for sending
225 to mdmon should be queued.
226
227 If the reshape will involve converting spares into array members,
228 this must be recorded in the metadata too.
229
230 2/ ->container_content will be called to find out the new state
231 of all the array, or all arrays in the container. Any newly
232 added devices (with state==0 and raid_disk >= 0) will be added
233 to the array as spares with the relevant slot number.
234
235 It is likely that the info returned by ->container_content will
236 have ->reshape_active set, ->reshape_progress set to e.g. 0, and
237 new_* set appropriately. mdadm will use this information to
238 cause the correct reshape to start at an appropriate time.
239
240 3/ ->set_array_state will be called by mdmon when reshape has
241 started and again periodically as it progresses. This should
242 record the ->last_checkpoint as the point where reshape has
243 progressed to. When the reshape finished this will be called
244 again and it should notice that ->curr_action is no longer
245 'reshape' and so should record that the reshape has finished
246 providing 'last_checkpoint' has progressed suitably.
247
248 4/ ->manage_reshape will be called once the reshape has been set
249 up in the kernel but before sync_max has been moved from 0, so
250 no actual reshape will have happened.
251
252 ->manage_reshape should call progress_reshape() to allow the
253 reshape to progress, and should back-up any data as indicated
254 by the return value. See the documentation of that function
255 for more details.
256 ->manage_reshape will be called multiple times when a
257 container is being reshaped, once for each member array in
258 the container.
259
260
261 The progress of the metadata is as follows:
262 1/ mdadm sends a metadata update to mdmon which marks the array
263 as undergoing a reshape. This is set up by
264 ->reshape_super and applied by ->process_update
265 For container-wide reshape, this happens once for the whole
266 container.
267 2/ mdmon notices progress via the sysfs files and calls
268 ->set_array_state to update the state periodically
269 For container-wide reshape, this happens repeatedly for
270 one array, then repeatedly for the next, etc.
271 3/ mdmon notices when reshape has finished and call
272 ->set_array_state to record the the reshape is complete.
273 For container-wide reshape, this happens once for each
274 member array.
bcbb92d4
N
275
276
277
d54d79bd
DW
278...
279
280[1]: Linux kernel design patterns - part 3, Neil Brown http://lwn.net/Articles/336262/