]>
Commit | Line | Data |
---|---|---|
d54d79bd DW |
1 | External Reshape |
2 | ||
3 | 1 Problem statement | |
4 | ||
5 | External (third-party metadata) reshape differs from native-metadata | |
6 | reshape in three key ways: | |
7 | ||
8 | 1.1 Format specific constraints | |
9 | ||
10 | In the native case reshape is limited by what is implemented in the | |
11 | generic reshape routine (Grow_reshape()) and what is supported by the | |
12 | kernel. There are exceptional cases where Grow_reshape() may block | |
13 | operations when it knows that the kernel implementation is broken, but | |
14 | otherwise the kernel is relied upon to be the final arbiter of what | |
15 | reshape operations are supported. | |
16 | ||
17 | In the external case the kernel, and the generic checks in | |
18 | Grow_reshape(), become the super-set of what reshapes are possible. The | |
19 | metadata format may not support, or have yet to implement a given | |
20 | reshape type. The implication for Grow_reshape() is that it must query | |
21 | the metadata handler and effect changes in the metadata before the new | |
22 | geometry is posted to the kernel. The ->reshape_super method allows | |
23 | Grow_reshape() to validate the requested operation and post the metadata | |
24 | update. | |
25 | ||
26 | 1.2 Scope of reshape | |
27 | ||
28 | Native metadata reshape is always performed at the array scope (no | |
29 | metadata relationship with sibling arrays on the same disks). External | |
30 | reshape, depending on the format, may not allow the number of member | |
31 | disks to be changed in a subarray unless the change is simultaneously | |
32 | applied to all subarrays in the container. For example the imsm format | |
33 | requires all member disks to be a member of all subarrays, so a 4-disk | |
34 | raid5 in a container that also houses a 4-disk raid10 array could not be | |
35 | reshaped to 5 disks as the imsm format does not support a 5-disk raid10 | |
36 | representation. This requires the ->reshape_super method to check the | |
37 | contents of the array and ask the user to run the reshape at container | |
38 | scope (if both subarrays are agreeable to the change), or report an | |
39 | error in the case where one subarray cannot support the change. | |
40 | ||
41 | 1.3 Monitoring / checkpointing | |
42 | ||
43 | Reshape, unlike rebuild/resync, requires strict checkpointing to survive | |
44 | interrupted reshape operations. For example when expanding a raid5 | |
45 | array the first few stripes of the array will be overwritten in a | |
46 | destructive manner. When restarting the reshape process we need to know | |
47 | the exact location of the last successfully written stripe, and we need | |
48 | to restore the data in any partially overwritten stripe. Native | |
49 | metadata stores this backup data in the unused portion of spares that | |
50 | are being promoted to array members, or in an external backup file | |
51 | (located on a non-involved block device). | |
52 | ||
53 | The kernel is in charge of recording checkpoints of reshape progress, | |
54 | but mdadm is delegated the task of managing the backup space which | |
55 | involves: | |
56 | 1/ Identifying what data will be overwritten in the next unit of reshape | |
57 | operation | |
58 | 2/ Suspending access to that region so that a snapshot of the data can | |
59 | be transferred to the backup space. | |
60 | 3/ Allowing the kernel to reshape the saved region and setting the | |
61 | boundary for the next backup. | |
62 | ||
63 | In the external reshape case we want to preserve this mdadm | |
64 | 'reshape-manager' arrangement, but have a third actor, mdmon, to | |
65 | consider. It is tempting to give the role of managing reshape to mdmon, | |
66 | but that is counter to its role as a monitor, and conflicts with the | |
67 | existing capabilities and role of mdadm to manage the progress of | |
68 | reshape. For clarity the external reshape implementation maintains the | |
69 | role of mdmon as a (mostly) passive recorder of raid events, and mdadm | |
70 | treats it as it would the kernel in the native reshape case (modulo | |
71 | needing to send explicit metadata update messages and checking that | |
72 | mdmon took the expected action). | |
73 | ||
74 | External reshape can use the generic md backup file as a fallback, but in the | |
75 | optimal/firmware-compatible case the reshape-manager will use the metadata | |
76 | specific areas for managing reshape. The implementation also needs to spawn a | |
77 | reshape-manager per subarray when the reshape is being carried out at the | |
78 | container level. For these two reasons the ->manage_reshape() method is | |
79 | introduced. This method in addition to base tasks mentioned above: | |
80 | 1/ Spawns a manager per-subarray, when necessary | |
81 | 2/ Uses either generic routines in Grow.c for md-style backup file | |
82 | support, or uses the metadata-format specific location for storing | |
83 | recovery data. | |
84 | This aims to avoid a "midlayer mistake"[1] and lets the metadata handler | |
85 | optionally take advantage of generic infrastructure in Grow.c | |
86 | ||
87 | 2 Details for specific reshape requests | |
88 | ||
89 | There are quite a few moving pieces spread out across md, mdadm, and mdmon for | |
90 | the support of external reshape, and there are several different types of | |
91 | reshape that need to be comprehended by the implementation. A rundown of | |
92 | these details follows. | |
93 | ||
94 | 2.0 General provisions: | |
95 | ||
96 | Obtain an exclusive open on the container to make sure we are not | |
97 | running concurrently with a Create() event. | |
98 | ||
99 | 2.1 Freezing sync_action | |
100 | ||
101 | 2.2 Reshape size | |
102 | ||
103 | 1/ mdadm::Grow_reshape(): checks if mdmon is running and optionally | |
104 | initializes st->update_tail | |
105 | 2/ mdadm::Grow_reshape() calls ->reshape_super() to check that the size change | |
106 | is allowed (being performed at subarray scope / enough room) prepares a | |
107 | metadata update | |
108 | 3/ mdadm::Grow_reshape(): flushes the metadata update (via | |
109 | flush_metadata_update(), or ->sync_metadata()) | |
110 | 4/ mdadm::Grow_reshape(): post the new size to the kernel | |
111 | ||
112 | ||
113 | 2.3 Reshape level (simple-takeover) | |
114 | ||
115 | "simple-takeover" implies the level change can be satisfied without touching | |
116 | sync_action | |
117 | ||
118 | 1/ mdadm::Grow_reshape(): checks if mdmon is running and optionally | |
119 | initializes st->update_tail | |
120 | 2/ mdadm::Grow_reshape() calls ->reshape_super() to check that the level change | |
121 | is allowed (being performed at subarray scope) prepares a | |
122 | metadata update | |
123 | 2a/ raid10 --> raid0: degrade all mirror legs prior to calling | |
124 | ->reshape_super | |
125 | 3/ mdadm::Grow_reshape(): flushes the metadata update (via | |
126 | flush_metadata_update(), or ->sync_metadata()) | |
127 | 4/ mdadm::Grow_reshape(): post the new level to the kernel | |
128 | ||
129 | 2.4 Reshape chunk, layout | |
130 | ||
131 | 2.5 Reshape raid disks (grow) | |
132 | ||
133 | 1/ mdadm::Grow_reshape(): unconditionally initializes st->update_tail | |
134 | because only redundant raid levels can modify the number of raid disks | |
135 | 2/ mdadm::Grow_reshape(): calls ->reshape_super() to check that the level | |
136 | change is allowed (being performed at proper scope / permissible | |
137 | geometry / proper spares available in the container) prepares a metadata | |
138 | update. | |
139 | 3/ mdadm::Grow_reshape(): Converts each subarray in the container to the | |
140 | raid level that can perform the reshape and starts mdmon. | |
141 | 4/ mdadm::Grow_reshape(): Pushes the update to mdmon... | |
142 | 4a/ mdmon::process_update(): marks the array as reshaping | |
143 | 4b/ mdmon::manage_member(): adds the spares (without assigning a slot) | |
144 | 5/ mdadm::Grow_reshape(): Notes that mdmon has assigned spares and invokes | |
145 | ->manage_reshape() | |
146 | 5/ mdadm::<format>->manage_reshape(): (for each subarray) sets sync_max to | |
147 | zero, starts the reshape, and pings mdmon | |
148 | 5a/ mdmon::read_and_act(): notices that reshape has started and notifies | |
149 | the metadata handler to record the slots chosen by the kernel | |
150 | 6/ mdadm::<format>->manage_reshape(): saves data that will be overwritten by | |
151 | the kernel to either the backup file or the metadata specific location, | |
152 | advances sync_max, waits for reshape, ping mdmon, repeat. | |
153 | 6a/ mdmon::read_and_act(): records checkpoints | |
154 | 7/ mdadm::<format>->manage_reshape(): Once reshape completes changes the raid | |
155 | level back to the nominal raid level (if necessary) | |
156 | ||
157 | FIXME: native metadata does not have the capability to record the original | |
158 | raid level in reshape-restart case because the kernel always records current | |
159 | raid level to the metadata, whereas external metadata can masquerade at an | |
160 | alternate level based on the reshape state. | |
161 | ||
162 | 2.6 Reshape raid disks (shrink) | |
163 | ||
164 | 3 TODO | |
165 | ||
166 | ... | |
167 | ||
168 | [1]: Linux kernel design patterns - part 3, Neil Brown http://lwn.net/Articles/336262/ |