]>
Commit | Line | Data |
---|---|---|
d54d79bd DW |
1 | External Reshape |
2 | ||
3 | 1 Problem statement | |
4 | ||
5 | External (third-party metadata) reshape differs from native-metadata | |
6 | reshape in three key ways: | |
7 | ||
8 | 1.1 Format specific constraints | |
9 | ||
10 | In the native case reshape is limited by what is implemented in the | |
11 | generic reshape routine (Grow_reshape()) and what is supported by the | |
12 | kernel. There are exceptional cases where Grow_reshape() may block | |
13 | operations when it knows that the kernel implementation is broken, but | |
14 | otherwise the kernel is relied upon to be the final arbiter of what | |
15 | reshape operations are supported. | |
16 | ||
17 | In the external case the kernel, and the generic checks in | |
18 | Grow_reshape(), become the super-set of what reshapes are possible. The | |
19 | metadata format may not support, or have yet to implement a given | |
20 | reshape type. The implication for Grow_reshape() is that it must query | |
21 | the metadata handler and effect changes in the metadata before the new | |
22 | geometry is posted to the kernel. The ->reshape_super method allows | |
23 | Grow_reshape() to validate the requested operation and post the metadata | |
24 | update. | |
25 | ||
26 | 1.2 Scope of reshape | |
27 | ||
28 | Native metadata reshape is always performed at the array scope (no | |
29 | metadata relationship with sibling arrays on the same disks). External | |
30 | reshape, depending on the format, may not allow the number of member | |
31 | disks to be changed in a subarray unless the change is simultaneously | |
32 | applied to all subarrays in the container. For example the imsm format | |
33 | requires all member disks to be a member of all subarrays, so a 4-disk | |
34 | raid5 in a container that also houses a 4-disk raid10 array could not be | |
35 | reshaped to 5 disks as the imsm format does not support a 5-disk raid10 | |
36 | representation. This requires the ->reshape_super method to check the | |
37 | contents of the array and ask the user to run the reshape at container | |
8bd67e34 | 38 | scope (if all subarrays are agreeable to the change), or report an |
d54d79bd DW |
39 | error in the case where one subarray cannot support the change. |
40 | ||
41 | 1.3 Monitoring / checkpointing | |
42 | ||
43 | Reshape, unlike rebuild/resync, requires strict checkpointing to survive | |
44 | interrupted reshape operations. For example when expanding a raid5 | |
45 | array the first few stripes of the array will be overwritten in a | |
46 | destructive manner. When restarting the reshape process we need to know | |
47 | the exact location of the last successfully written stripe, and we need | |
48 | to restore the data in any partially overwritten stripe. Native | |
49 | metadata stores this backup data in the unused portion of spares that | |
50 | are being promoted to array members, or in an external backup file | |
51 | (located on a non-involved block device). | |
52 | ||
53 | The kernel is in charge of recording checkpoints of reshape progress, | |
54 | but mdadm is delegated the task of managing the backup space which | |
55 | involves: | |
56 | 1/ Identifying what data will be overwritten in the next unit of reshape | |
57 | operation | |
58 | 2/ Suspending access to that region so that a snapshot of the data can | |
59 | be transferred to the backup space. | |
60 | 3/ Allowing the kernel to reshape the saved region and setting the | |
61 | boundary for the next backup. | |
62 | ||
63 | In the external reshape case we want to preserve this mdadm | |
64 | 'reshape-manager' arrangement, but have a third actor, mdmon, to | |
65 | consider. It is tempting to give the role of managing reshape to mdmon, | |
66 | but that is counter to its role as a monitor, and conflicts with the | |
67 | existing capabilities and role of mdadm to manage the progress of | |
68 | reshape. For clarity the external reshape implementation maintains the | |
69 | role of mdmon as a (mostly) passive recorder of raid events, and mdadm | |
70 | treats it as it would the kernel in the native reshape case (modulo | |
71 | needing to send explicit metadata update messages and checking that | |
72 | mdmon took the expected action). | |
73 | ||
74 | External reshape can use the generic md backup file as a fallback, but in the | |
75 | optimal/firmware-compatible case the reshape-manager will use the metadata | |
76 | specific areas for managing reshape. The implementation also needs to spawn a | |
77 | reshape-manager per subarray when the reshape is being carried out at the | |
78 | container level. For these two reasons the ->manage_reshape() method is | |
79 | introduced. This method in addition to base tasks mentioned above: | |
8bd67e34 | 80 | 1/ Processed each subarray one at a time in series - where appropriate. |
d54d79bd DW |
81 | 2/ Uses either generic routines in Grow.c for md-style backup file |
82 | support, or uses the metadata-format specific location for storing | |
83 | recovery data. | |
84 | This aims to avoid a "midlayer mistake"[1] and lets the metadata handler | |
85 | optionally take advantage of generic infrastructure in Grow.c | |
86 | ||
87 | 2 Details for specific reshape requests | |
88 | ||
89 | There are quite a few moving pieces spread out across md, mdadm, and mdmon for | |
90 | the support of external reshape, and there are several different types of | |
91 | reshape that need to be comprehended by the implementation. A rundown of | |
92 | these details follows. | |
93 | ||
94 | 2.0 General provisions: | |
95 | ||
96 | Obtain an exclusive open on the container to make sure we are not | |
97 | running concurrently with a Create() event. | |
98 | ||
99 | 2.1 Freezing sync_action | |
100 | ||
8bd67e34 N |
101 | Before making any attempt at a reshape we 'freeze' every array in |
102 | the container to ensure no spare assignment or recovery happens. | |
103 | This involves writing 'frozen' to sync_action and changing the '/' | |
104 | after 'external:' in metadata_version to a '-'. mdmon knows that | |
105 | this means not to perform any management. | |
106 | ||
107 | Before doing this we check that all sync_actions are 'idle', which | |
108 | is racy but still useful. | |
109 | Afterwards we check that all member arrays have no spares | |
110 | or partial spares (recovery_start != 'none') which would indicate a | |
111 | race. If they do, we unfreeze again. | |
112 | ||
113 | Once this completes we know all the arrays are stable. They may | |
114 | still have failed devices as devices can fail at any time. However | |
115 | we treat those like failures that happen during the reshape. | |
116 | ||
d54d79bd DW |
117 | 2.2 Reshape size |
118 | ||
119 | 1/ mdadm::Grow_reshape(): checks if mdmon is running and optionally | |
120 | initializes st->update_tail | |
121 | 2/ mdadm::Grow_reshape() calls ->reshape_super() to check that the size change | |
122 | is allowed (being performed at subarray scope / enough room) prepares a | |
123 | metadata update | |
124 | 3/ mdadm::Grow_reshape(): flushes the metadata update (via | |
125 | flush_metadata_update(), or ->sync_metadata()) | |
126 | 4/ mdadm::Grow_reshape(): post the new size to the kernel | |
127 | ||
128 | ||
129 | 2.3 Reshape level (simple-takeover) | |
130 | ||
131 | "simple-takeover" implies the level change can be satisfied without touching | |
132 | sync_action | |
133 | ||
134 | 1/ mdadm::Grow_reshape(): checks if mdmon is running and optionally | |
135 | initializes st->update_tail | |
136 | 2/ mdadm::Grow_reshape() calls ->reshape_super() to check that the level change | |
137 | is allowed (being performed at subarray scope) prepares a | |
138 | metadata update | |
139 | 2a/ raid10 --> raid0: degrade all mirror legs prior to calling | |
140 | ->reshape_super | |
141 | 3/ mdadm::Grow_reshape(): flushes the metadata update (via | |
142 | flush_metadata_update(), or ->sync_metadata()) | |
143 | 4/ mdadm::Grow_reshape(): post the new level to the kernel | |
144 | ||
145 | 2.4 Reshape chunk, layout | |
146 | ||
147 | 2.5 Reshape raid disks (grow) | |
148 | ||
149 | 1/ mdadm::Grow_reshape(): unconditionally initializes st->update_tail | |
150 | because only redundant raid levels can modify the number of raid disks | |
151 | 2/ mdadm::Grow_reshape(): calls ->reshape_super() to check that the level | |
152 | change is allowed (being performed at proper scope / permissible | |
8bd67e34 N |
153 | geometry / proper spares available in the container), chooses |
154 | the spares to use, and prepares a metadata update. | |
d54d79bd DW |
155 | 3/ mdadm::Grow_reshape(): Converts each subarray in the container to the |
156 | raid level that can perform the reshape and starts mdmon. | |
8bd67e34 N |
157 | 4/ mdadm::Grow_reshape(): Pushes the update to mdmon. |
158 | 5/ mdadm::Grow_reshape(): uses container_content to find details of | |
159 | the spares and passes them to the kernel. | |
160 | 6/ mdadm::Grow_reshape(): gives raid_disks update to the kernel, | |
161 | sets sync_max, sync_min, suspend_lo, suspend_hi all to zero, | |
162 | and starts the reshape by writing 'reshape' to sync_action. | |
163 | 7/ mdmon::monitor notices the sync_action change and tells | |
164 | managemon to check for new devices. managemon notices the new | |
165 | devices, opens relevant sysfs file, and passes them all to | |
166 | monitor. | |
167 | 8/ mdadm::Grow_reshape() calls ->manage_reshape to oversee the | |
168 | rest of the reshape. | |
bcbb92d4 | 169 | |
8bd67e34 | 170 | 9/ mdadm::<format>->manage_reshape(): saves data that will be overwritten by |
d54d79bd DW |
171 | the kernel to either the backup file or the metadata specific location, |
172 | advances sync_max, waits for reshape, ping mdmon, repeat. | |
8bd67e34 N |
173 | Meanwhile mdmon::read_and_act(): records checkpoints. |
174 | Specifically. | |
175 | ||
176 | 9a/ if the 'next' stripe to be reshaped will over-write | |
177 | itself during reshape then: | |
178 | 9a.1/ increase suspend_hi to cover a suitable number of | |
179 | stripes. | |
180 | 9a.2/ backup those stripes safely. | |
181 | 9a.3/ advance sync_max to allow those stripes to be backed up | |
182 | 9a.4/ when sync_completed indicates that those stripes have | |
183 | been reshaped, manage_reshape must ping_manager | |
184 | 9a.5/ when mdmon notices that sync_completed has been updated, | |
185 | it records the new checkpoint in the metadata | |
186 | 9a.6/ after the ping_manager, manage_reshape will increase | |
187 | suspend_lo to allow access to those stripes again | |
188 | ||
189 | 9b/ if the 'next' stripe to be reshaped will over-write unused | |
190 | space during reshape then we apply same process as above, | |
191 | except that there is no need to back anything up. | |
192 | Note that we *do* need to keep suspend_hi progressing as | |
193 | it is not safe to write to the area-under-reshape. For | |
194 | kernel-managed-metadata this protection is provided by | |
195 | ->reshape_safe, but that does not protect us in the case | |
196 | of user-space-managed-metadata. | |
bcbb92d4 | 197 | |
8bd67e34 | 198 | 10/ mdadm::<format>->manage_reshape(): Once reshape completes changes the raid |
d54d79bd DW |
199 | level back to the nominal raid level (if necessary) |
200 | ||
201 | FIXME: native metadata does not have the capability to record the original | |
202 | raid level in reshape-restart case because the kernel always records current | |
203 | raid level to the metadata, whereas external metadata can masquerade at an | |
204 | alternate level based on the reshape state. | |
205 | ||
206 | 2.6 Reshape raid disks (shrink) | |
207 | ||
7443ee81 N |
208 | 3 Interaction with metadata handle. |
209 | ||
210 | The following calls are made into the metadata handler to assist | |
211 | with initiating and monitoring a 'reshape'. | |
212 | ||
213 | 1/ ->reshape_super is called quite early (after only minimial | |
214 | checks) to make sure that the metadata can record the new shape | |
215 | and any necessary transitions. It may be passed a 'container' | |
216 | or an individual array within a container, and it should notice | |
217 | the difference and act accordingly. | |
218 | When a reshape is requested against a container it is expected | |
219 | that it should be applied to every array in the container, | |
220 | however it is up to the metadata handler to determine final | |
221 | policy. | |
222 | ||
223 | If the reshape is supportable, the internal copy of the metadata | |
224 | should be updated, and a metadata update suitable for sending | |
225 | to mdmon should be queued. | |
226 | ||
227 | If the reshape will involve converting spares into array members, | |
228 | this must be recorded in the metadata too. | |
229 | ||
230 | 2/ ->container_content will be called to find out the new state | |
231 | of all the array, or all arrays in the container. Any newly | |
232 | added devices (with state==0 and raid_disk >= 0) will be added | |
233 | to the array as spares with the relevant slot number. | |
234 | ||
235 | It is likely that the info returned by ->container_content will | |
236 | have ->reshape_active set, ->reshape_progress set to e.g. 0, and | |
237 | new_* set appropriately. mdadm will use this information to | |
238 | cause the correct reshape to start at an appropriate time. | |
239 | ||
240 | 3/ ->set_array_state will be called by mdmon when reshape has | |
241 | started and again periodically as it progresses. This should | |
242 | record the ->last_checkpoint as the point where reshape has | |
243 | progressed to. When the reshape finished this will be called | |
244 | again and it should notice that ->curr_action is no longer | |
245 | 'reshape' and so should record that the reshape has finished | |
246 | providing 'last_checkpoint' has progressed suitably. | |
247 | ||
248 | 4/ ->manage_reshape will be called once the reshape has been set | |
249 | up in the kernel but before sync_max has been moved from 0, so | |
250 | no actual reshape will have happened. | |
251 | ||
252 | ->manage_reshape should call progress_reshape() to allow the | |
253 | reshape to progress, and should back-up any data as indicated | |
254 | by the return value. See the documentation of that function | |
255 | for more details. | |
256 | ->manage_reshape will be called multiple times when a | |
257 | container is being reshaped, once for each member array in | |
258 | the container. | |
259 | ||
260 | ||
261 | The progress of the metadata is as follows: | |
262 | 1/ mdadm sends a metadata update to mdmon which marks the array | |
263 | as undergoing a reshape. This is set up by | |
264 | ->reshape_super and applied by ->process_update | |
265 | For container-wide reshape, this happens once for the whole | |
266 | container. | |
267 | 2/ mdmon notices progress via the sysfs files and calls | |
268 | ->set_array_state to update the state periodically | |
269 | For container-wide reshape, this happens repeatedly for | |
270 | one array, then repeatedly for the next, etc. | |
271 | 3/ mdmon notices when reshape has finished and call | |
272 | ->set_array_state to record the the reshape is complete. | |
273 | For container-wide reshape, this happens once for each | |
274 | member array. | |
bcbb92d4 N |
275 | |
276 | ||
277 | ||
d54d79bd DW |
278 | ... |
279 | ||
280 | [1]: Linux kernel design patterns - part 3, Neil Brown http://lwn.net/Articles/336262/ |