]>
Commit | Line | Data |
---|---|---|
519561f7 NB |
1 | .\" Copyright Neil Brown and others. |
2 | .\" This program is free software; you can redistribute it and/or modify | |
3 | .\" it under the terms of the GNU General Public License as published by | |
4 | .\" the Free Software Foundation; either version 2 of the License, or | |
5 | .\" (at your option) any later version. | |
6 | .\" See file COPYING in distribution for details. | |
17645275 | 7 | .if n .pl 1000v |
56eb10c0 NB |
8 | .TH MD 4 |
9 | .SH NAME | |
93e790af | 10 | md \- Multiple Device driver aka Linux Software RAID |
56eb10c0 NB |
11 | .SH SYNOPSIS |
12 | .BI /dev/md n | |
13 | .br | |
14 | .BI /dev/md/ n | |
e0fe762a N |
15 | .br |
16 | .BR /dev/md/ name | |
56eb10c0 NB |
17 | .SH DESCRIPTION |
18 | The | |
19 | .B md | |
20 | driver provides virtual devices that are created from one or more | |
e0d19036 | 21 | independent underlying devices. This array of devices often contains |
02b76eea NB |
22 | redundancy and the devices are often disk drives, hence the acronym RAID |
23 | which stands for a Redundant Array of Independent Disks. | |
56eb10c0 NB |
24 | .PP |
25 | .B md | |
599e5a36 NB |
26 | supports RAID levels |
27 | 1 (mirroring), | |
28 | 4 (striped array with parity device), | |
29 | 5 (striped array with distributed parity information), | |
30 | 6 (striped array with distributed dual redundancy information), and | |
31 | 10 (striped and mirrored). | |
32 | If some number of underlying devices fails while using one of these | |
98c6faba NB |
33 | levels, the array will continue to function; this number is one for |
34 | RAID levels 4 and 5, two for RAID level 6, and all but one (N-1) for | |
93e790af | 35 | RAID level 1, and dependent on configuration for level 10. |
56eb10c0 NB |
36 | .PP |
37 | .B md | |
e0d19036 | 38 | also supports a number of pseudo RAID (non-redundant) configurations |
570c0542 NB |
39 | including RAID0 (striped array), LINEAR (catenated array), |
40 | MULTIPATH (a set of different interfaces to the same device), | |
41 | and FAULTY (a layer over a single device into which errors can be injected). | |
56eb10c0 | 42 | |
e0fe762a | 43 | .SS MD METADATA |
bcbb92d4 | 44 | Each device in an array may have some |
e0fe762a N |
45 | .I metadata |
46 | stored in the device. This metadata is sometimes called a | |
47 | .BR superblock . | |
48 | The metadata records information about the structure and state of the array. | |
570c0542 | 49 | This allows the array to be reliably re-assembled after a shutdown. |
56eb10c0 | 50 | |
570c0542 NB |
51 | From Linux kernel version 2.6.10, |
52 | .B md | |
e0fe762a | 53 | provides support for two different formats of metadata, and |
570c0542 NB |
54 | other formats can be added. Prior to this release, only one format is |
55 | supported. | |
56 | ||
b3f1c093 | 57 | The common format \(em known as version 0.90 \(em has |
570c0542 | 58 | a superblock that is 4K long and is written into a 64K aligned block that |
11a3e71d | 59 | starts at least 64K and less than 128K from the end of the device |
56eb10c0 NB |
60 | (i.e. to get the address of the superblock round the size of the |
61 | device down to a multiple of 64K and then subtract 64K). | |
11a3e71d | 62 | The available size of each device is the amount of space before the |
56eb10c0 NB |
63 | super block, so between 64K and 128K is lost when a device in |
64 | incorporated into an MD array. | |
93e790af | 65 | This superblock stores multi-byte fields in a processor-dependent |
570c0542 NB |
66 | manner, so arrays cannot easily be moved between computers with |
67 | different processors. | |
68 | ||
b3f1c093 | 69 | The new format \(em known as version 1 \(em has a superblock that is |
570c0542 NB |
70 | normally 1K long, but can be longer. It is normally stored between 8K |
71 | and 12K from the end of the device, on a 4K boundary, though | |
72 | variations can be stored at the start of the device (version 1.1) or 4K from | |
73 | the start of the device (version 1.2). | |
e0fe762a | 74 | This metadata format stores multibyte data in a |
93e790af | 75 | processor-independent format and supports up to hundreds of |
570c0542 | 76 | component devices (version 0.90 only supports 28). |
56eb10c0 | 77 | |
e0fe762a | 78 | The metadata contains, among other things: |
56eb10c0 NB |
79 | .TP |
80 | LEVEL | |
11a3e71d | 81 | The manner in which the devices are arranged into the array |
956a13fb | 82 | (LINEAR, RAID0, RAID1, RAID4, RAID5, RAID10, MULTIPATH). |
56eb10c0 NB |
83 | .TP |
84 | UUID | |
85 | a 128 bit Universally Unique Identifier that identifies the array that | |
93e790af | 86 | contains this device. |
56eb10c0 | 87 | |
e0fe762a | 88 | .PP |
2a940e36 NB |
89 | When a version 0.90 array is being reshaped (e.g. adding extra devices |
90 | to a RAID5), the version number is temporarily set to 0.91. This | |
91 | ensures that if the reshape process is stopped in the middle (e.g. by | |
92 | a system crash) and the machine boots into an older kernel that does | |
93 | not support reshaping, then the array will not be assembled (which | |
94 | would cause data corruption) but will be left untouched until a kernel | |
95 | that can complete the reshape processes is used. | |
96 | ||
e0fe762a | 97 | .SS ARRAYS WITHOUT METADATA |
570c0542 | 98 | While it is usually best to create arrays with superblocks so that |
93e790af SW |
99 | they can be assembled reliably, there are some circumstances when an |
100 | array without superblocks is preferred. These include: | |
570c0542 NB |
101 | .TP |
102 | LEGACY ARRAYS | |
11a3e71d NB |
103 | Early versions of the |
104 | .B md | |
956a13fb | 105 | driver only supported LINEAR and RAID0 configurations and did not use |
570c0542 NB |
106 | a superblock (which is less critical with these configurations). |
107 | While such arrays should be rebuilt with superblocks if possible, | |
11a3e71d | 108 | .B md |
570c0542 NB |
109 | continues to support them. |
110 | .TP | |
111 | FAULTY | |
112 | Being a largely transparent layer over a different device, the FAULTY | |
113 | personality doesn't gain anything from having a superblock. | |
114 | .TP | |
115 | MULTIPATH | |
116 | It is often possible to detect devices which are different paths to | |
117 | the same storage directly rather than having a distinctive superblock | |
118 | written to the device and searched for on all paths. In this case, | |
119 | a MULTIPATH array with no superblock makes sense. | |
120 | .TP | |
121 | RAID1 | |
956a13fb | 122 | In some configurations it might be desired to create a RAID1 |
93e790af | 123 | configuration that does not use a superblock, and to maintain the state of |
095407fa | 124 | the array elsewhere. While not encouraged for general use, it does |
addc80c4 | 125 | have special-purpose uses and is supported. |
11a3e71d | 126 | |
e0fe762a N |
127 | .SS ARRAYS WITH EXTERNAL METADATA |
128 | ||
129 | From release 2.6.28, the | |
130 | .I md | |
131 | driver supports arrays with externally managed metadata. That is, | |
1e49aaa0 | 132 | the metadata is not managed by the kernel but rather by a user-space |
e0fe762a N |
133 | program which is external to the kernel. This allows support for a |
134 | variety of metadata formats without cluttering the kernel with lots of | |
135 | details. | |
136 | .PP | |
137 | .I md | |
138 | is able to communicate with the user-space program through various | |
139 | sysfs attributes so that it can make appropriate changes to the | |
1b17b4e4 | 140 | metadata \- for example to mark a device as faulty. When necessary, |
e0fe762a N |
141 | .I md |
142 | will wait for the program to acknowledge the event by writing to a | |
143 | sysfs attribute. | |
144 | The manual page for | |
145 | .IR mdmon (8) | |
146 | contains more detail about this interaction. | |
147 | ||
148 | .SS CONTAINERS | |
149 | Many metadata formats use a single block of metadata to describe a | |
150 | number of different arrays which all use the same set of devices. | |
151 | In this case it is helpful for the kernel to know about the full set | |
152 | of devices as a whole. This set is known to md as a | |
153 | .IR container . | |
154 | A container is an | |
155 | .I md | |
156 | array with externally managed metadata and with device offset and size | |
157 | so that it just covers the metadata part of the devices. The | |
158 | remainder of each device is available to be incorporated into various | |
159 | arrays. | |
160 | ||
56eb10c0 | 161 | .SS LINEAR |
11a3e71d | 162 | |
956a13fb | 163 | A LINEAR array simply catenates the available space on each |
93e790af | 164 | drive to form one large virtual drive. |
11a3e71d NB |
165 | |
166 | One advantage of this arrangement over the more common RAID0 | |
167 | arrangement is that the array may be reconfigured at a later time with | |
93e790af SW |
168 | an extra drive, so the array is made bigger without disturbing the |
169 | data that is on the array. This can even be done on a live | |
11a3e71d NB |
170 | array. |
171 | ||
599e5a36 NB |
172 | If a chunksize is given with a LINEAR array, the usable space on each |
173 | device is rounded down to a multiple of this chunksize. | |
11a3e71d | 174 | |
56eb10c0 | 175 | .SS RAID0 |
11a3e71d NB |
176 | |
177 | A RAID0 array (which has zero redundancy) is also known as a | |
178 | striped array. | |
e0d19036 | 179 | A RAID0 array is configured at creation with a |
bcbb92d4 | 180 | .B "Chunk Size" |
e0fe762a N |
181 | which must be a power of two (prior to Linux 2.6.31), and at least 4 |
182 | kibibytes. | |
e0d19036 | 183 | |
2d465520 | 184 | The RAID0 driver assigns the first chunk of the array to the first |
e0d19036 | 185 | device, the second chunk to the second device, and so on until all |
e0fe762a | 186 | drives have been assigned one chunk. This collection of chunks forms a |
e0d19036 | 187 | .BR stripe . |
93e790af | 188 | Further chunks are gathered into stripes in the same way, and are |
e0d19036 NB |
189 | assigned to the remaining space in the drives. |
190 | ||
2d465520 NB |
191 | If devices in the array are not all the same size, then once the |
192 | smallest device has been exhausted, the RAID0 driver starts | |
e0d19036 NB |
193 | collecting chunks into smaller stripes that only span the drives which |
194 | still have remaining space. | |
195 | ||
329dfc28 N |
196 | A bug was introduced in linux 3.14 which changed the layout of blocks in |
197 | a RAID0 beyond the region that is striped over all devices. This bug | |
198 | does not affect an array with all devices the same size, but can affect | |
199 | other RAID0 arrays. | |
200 | ||
201 | Linux 5.4 (and some stable kernels to which the change was backported) | |
202 | will not normally assemble such an array as it cannot know which layout | |
203 | to use. There is a module parameter "raid0.default_layout" which can be | |
204 | set to "1" to force the kernel to use the pre-3.14 layout or to "2" to | |
205 | force it to use the 3.14-and-later layout. when creating a new RAID0 | |
206 | array, | |
207 | .I mdadm | |
208 | will record the chosen layout in the metadata in a way that allows newer | |
209 | kernels to assemble the array without needing a module parameter. | |
e0d19036 | 210 | |
027c099f N |
211 | To assemble an old array on a new kernel without using the module parameter, |
212 | use either the | |
213 | .B "--update=layout-original" | |
214 | option or the | |
215 | .B "--update=layout-alternate" | |
216 | option. | |
217 | ||
97b51a2c N |
218 | Once you have updated the layout you will not be able to mount the array |
219 | on an older kernel. If you need to revert to an older kernel, the | |
220 | layout information can be erased with the | |
221 | .B "--update=layout-unspecificed" | |
0d583954 | 222 | option. If you use this option to |
97b51a2c N |
223 | .B --assemble |
224 | while running a newer kernel, the array will NOT assemble, but the | |
225 | metadata will be update so that it can be assembled on an older kernel. | |
226 | ||
227 | No that setting the layout to "unspecified" removes protections against | |
228 | this bug, and you must be sure that the kernel you use matches the | |
229 | layout of the array. | |
230 | ||
56eb10c0 | 231 | .SS RAID1 |
e0d19036 NB |
232 | |
233 | A RAID1 array is also known as a mirrored set (though mirrors tend to | |
5787fa49 | 234 | provide reflected images, which RAID1 does not) or a plex. |
e0d19036 NB |
235 | |
236 | Once initialised, each device in a RAID1 array contains exactly the | |
237 | same data. Changes are written to all devices in parallel. Data is | |
238 | read from any one device. The driver attempts to distribute read | |
239 | requests across all devices to maximise performance. | |
240 | ||
241 | All devices in a RAID1 array should be the same size. If they are | |
242 | not, then only the amount of space available on the smallest device is | |
93e790af | 243 | used (any extra space on other devices is wasted). |
e0d19036 | 244 | |
3dacb890 IP |
245 | Note that the read balancing done by the driver does not make the RAID1 |
246 | performance profile be the same as for RAID0; a single stream of | |
247 | sequential input will not be accelerated (e.g. a single dd), but | |
248 | multiple sequential streams or a random workload will use more than one | |
249 | spindle. In theory, having an N-disk RAID1 will allow N sequential | |
250 | threads to read from all disks. | |
251 | ||
e0fe762a | 252 | Individual devices in a RAID1 can be marked as "write-mostly". |
1b17b4e4 | 253 | These drives are excluded from the normal read balancing and will only |
e0fe762a N |
254 | be read from when there is no other option. This can be useful for |
255 | devices connected over a slow link. | |
256 | ||
56eb10c0 | 257 | .SS RAID4 |
e0d19036 NB |
258 | |
259 | A RAID4 array is like a RAID0 array with an extra device for storing | |
aa88f531 NB |
260 | parity. This device is the last of the active devices in the |
261 | array. Unlike RAID0, RAID4 also requires that all stripes span all | |
e0d19036 NB |
262 | drives, so extra space on devices that are larger than the smallest is |
263 | wasted. | |
264 | ||
93e790af | 265 | When any block in a RAID4 array is modified, the parity block for that |
e0d19036 NB |
266 | stripe (i.e. the block in the parity device at the same device offset |
267 | as the stripe) is also modified so that the parity block always | |
93e790af | 268 | contains the "parity" for the whole stripe. I.e. its content is |
e0d19036 NB |
269 | equivalent to the result of performing an exclusive-or operation |
270 | between all the data blocks in the stripe. | |
271 | ||
272 | This allows the array to continue to function if one device fails. | |
273 | The data that was on that device can be calculated as needed from the | |
274 | parity block and the other data blocks. | |
275 | ||
56eb10c0 | 276 | .SS RAID5 |
e0d19036 NB |
277 | |
278 | RAID5 is very similar to RAID4. The difference is that the parity | |
279 | blocks for each stripe, instead of being on a single device, are | |
280 | distributed across all devices. This allows more parallelism when | |
93e790af | 281 | writing, as two different block updates will quite possibly affect |
e0d19036 NB |
282 | parity blocks on different devices so there is less contention. |
283 | ||
93e790af | 284 | This also allows more parallelism when reading, as read requests are |
e0d19036 NB |
285 | distributed over all the devices in the array instead of all but one. |
286 | ||
98c6faba NB |
287 | .SS RAID6 |
288 | ||
289 | RAID6 is similar to RAID5, but can handle the loss of any \fItwo\fP | |
290 | devices without data loss. Accordingly, it requires N+2 drives to | |
291 | store N drives worth of data. | |
292 | ||
293 | The performance for RAID6 is slightly lower but comparable to RAID5 in | |
294 | normal mode and single disk failure mode. It is very slow in dual | |
295 | disk failure mode, however. | |
296 | ||
599e5a36 NB |
297 | .SS RAID10 |
298 | ||
93e790af | 299 | RAID10 provides a combination of RAID1 and RAID0, and is sometimes known |
599e5a36 NB |
300 | as RAID1+0. Every datablock is duplicated some number of times, and |
301 | the resulting collection of datablocks are distributed over multiple | |
302 | drives. | |
303 | ||
93e790af | 304 | When configuring a RAID10 array, it is necessary to specify the number |
8dc92b41 CAM |
305 | of replicas of each data block that are required (this will usually |
306 | be\ 2) and whether their layout should be "near", "far" or "offset" | |
307 | (with "offset" being available since Linux\ 2.6.18). | |
308 | ||
309 | .B About the RAID10 Layout Examples: | |
310 | .br | |
311 | The examples below visualise the chunk distribution on the underlying | |
312 | devices for the respective layout. | |
313 | ||
314 | For simplicity it is assumed that the size of the chunks equals the | |
315 | size of the blocks of the underlying devices as well as those of the | |
316 | RAID10 device exported by the kernel (for example \fB/dev/md/\fPname). | |
317 | .br | |
318 | Therefore the chunks\ /\ chunk numbers map directly to the blocks\ /\ | |
319 | block addresses of the exported RAID10 device. | |
320 | ||
321 | Decimal numbers (0,\ 1, 2,\ ...) are the chunks of the RAID10 and due | |
322 | to the above assumption also the blocks and block addresses of the | |
323 | exported RAID10 device. | |
324 | .br | |
325 | Repeated numbers mean copies of a chunk\ /\ block (obviously on | |
326 | different underlying devices). | |
327 | .br | |
328 | Hexadecimal numbers (0x00,\ 0x01, 0x02,\ ...) are the block addresses | |
329 | of the underlying devices. | |
330 | ||
331 | .TP | |
332 | \fB "near" Layout\fP | |
333 | When "near" replicas are chosen, the multiple copies of a given chunk are laid | |
334 | out consecutively ("as close to each other as possible") across the stripes of | |
335 | the array. | |
336 | ||
337 | With an even number of devices, they will likely (unless some misalignment is | |
338 | present) lay at the very same offset on the different devices. | |
339 | .br | |
340 | This is as the "classic" RAID1+0; that is two groups of mirrored devices (in the | |
341 | example below the groups Device\ #1\ /\ #2 and Device\ #3\ /\ #4 are each a | |
342 | RAID1) both in turn forming a striped RAID0. | |
343 | ||
344 | .ne 10 | |
345 | .B Example with 2\ copies per chunk and an even number\ (4) of devices: | |
346 | .TS | |
347 | tab(;); | |
348 | C - - - - | |
349 | C | C | C | C | C | | |
350 | | - | - | - | - | - | | |
351 | | C | C | C | C | C | | |
352 | | C | C | C | C | C | | |
353 | | C | C | C | C | C | | |
354 | | C | C | C | C | C | | |
355 | | C | C | C | C | C | | |
356 | | C | C | C | C | C | | |
357 | | - | - | - | - | - | | |
358 | C C S C S | |
359 | C C S C S | |
360 | C C S S S | |
361 | C C S S S. | |
362 | ; | |
363 | ;Device #1;Device #2;Device #3;Device #4 | |
364 | 0x00;0;0;1;1 | |
365 | 0x01;2;2;3;3 | |
366 | \.\.\.;\.\.\.;\.\.\.;\.\.\.;\.\.\. | |
367 | :;:;:;:;: | |
368 | \.\.\.;\.\.\.;\.\.\.;\.\.\.;\.\.\. | |
369 | 0x80;254;254;255;255 | |
370 | ;\\---------v---------/;\\---------v---------/ | |
371 | ;RAID1;RAID1 | |
372 | ;\\---------------------v---------------------/ | |
373 | ;RAID0 | |
374 | .TE | |
375 | ||
376 | .ne 10 | |
377 | .B Example with 2\ copies per chunk and an odd number\ (5) of devices: | |
378 | .TS | |
379 | tab(;); | |
380 | C - - - - - | |
381 | C | C | C | C | C | C | | |
382 | | - | - | - | - | - | - | | |
383 | | C | C | C | C | C | C | | |
384 | | C | C | C | C | C | C | | |
385 | | C | C | C | C | C | C | | |
386 | | C | C | C | C | C | C | | |
387 | | C | C | C | C | C | C | | |
388 | | C | C | C | C | C | C | | |
389 | | - | - | - | - | - | - | | |
390 | C. | |
391 | ; | |
f99a9e15 | 392 | ;Dev #1;Dev #2;Dev #3;Dev #4;Dev #5 |
8dc92b41 CAM |
393 | 0x00;0;0;1;1;2 |
394 | 0x01;2;3;3;4;4 | |
395 | \.\.\.;\.\.\.;\.\.\.;\.\.\.;\.\.\.;\.\.\. | |
396 | :;:;:;:;:;: | |
397 | \.\.\.;\.\.\.;\.\.\.;\.\.\.;\.\.\.;\.\.\. | |
398 | 0x80;317;318;318;319;319 | |
399 | ; | |
400 | .TE | |
401 | ||
402 | .TP | |
403 | \fB "far" Layout\fP | |
404 | When "far" replicas are chosen, the multiple copies of a given chunk | |
405 | are laid out quite distant ("as far as reasonably possible") from each | |
406 | other. | |
407 | ||
408 | First a complete sequence of all data blocks (that is all the data one | |
409 | sees on the exported RAID10 block device) is striped over the | |
410 | devices. Then another (though "shifted") complete sequence of all data | |
411 | blocks; and so on (in the case of more than 2\ copies per chunk). | |
412 | ||
413 | The "shift" needed to prevent placing copies of the same chunks on the | |
414 | same devices is actually a cyclic permutation with offset\ 1 of each | |
415 | of the stripes within a complete sequence of chunks. | |
416 | .br | |
417 | The offset\ 1 is relative to the previous complete sequence of chunks, | |
418 | so in case of more than 2\ copies per chunk one gets the following | |
419 | offsets: | |
420 | .br | |
421 | 1.\ complete sequence of chunks: offset\ =\ \ 0 | |
422 | .br | |
423 | 2.\ complete sequence of chunks: offset\ =\ \ 1 | |
424 | .br | |
425 | 3.\ complete sequence of chunks: offset\ =\ \ 2 | |
426 | .br | |
427 | : | |
428 | .br | |
429 | n.\ complete sequence of chunks: offset\ =\ n-1 | |
430 | ||
431 | .ne 10 | |
432 | .B Example with 2\ copies per chunk and an even number\ (4) of devices: | |
433 | .TS | |
434 | tab(;); | |
435 | C - - - - | |
436 | C | C | C | C | C | | |
437 | | - | - | - | - | - | | |
438 | | C | C | C | C | C | L | |
439 | | C | C | C | C | C | L | |
440 | | C | C | C | C | C | L | |
441 | | C | C | C | C | C | L | |
442 | | C | C | C | C | C | L | |
443 | | C | C | C | C | C | L | |
444 | | C | C | C | C | C | L | |
445 | | C | C | C | C | C | L | |
446 | | C | C | C | C | C | L | |
447 | | C | C | C | C | C | L | |
448 | | C | C | C | C | C | L | |
449 | | C | C | C | C | C | L | |
450 | | - | - | - | - | - | | |
451 | C. | |
452 | ; | |
453 | ;Device #1;Device #2;Device #3;Device #4 | |
454 | ; | |
455 | 0x00;0;1;2;3;\\ | |
456 | 0x01;4;5;6;7;> [#] | |
457 | \.\.\.;\.\.\.;\.\.\.;\.\.\.;\.\.\.;: | |
458 | :;:;:;:;:;: | |
459 | \.\.\.;\.\.\.;\.\.\.;\.\.\.;\.\.\.;: | |
460 | 0x40;252;253;254;255;/ | |
461 | 0x41;3;0;1;2;\\ | |
462 | 0x42;7;4;5;6;> [#]~ | |
463 | \.\.\.;\.\.\.;\.\.\.;\.\.\.;\.\.\.;: | |
464 | :;:;:;:;:;: | |
465 | \.\.\.;\.\.\.;\.\.\.;\.\.\.;\.\.\.;: | |
466 | 0x80;255;252;253;254;/ | |
467 | ; | |
468 | .TE | |
469 | ||
470 | .ne 10 | |
471 | .B Example with 2\ copies per chunk and an odd number\ (5) of devices: | |
472 | .TS | |
473 | tab(;); | |
474 | C - - - - - | |
475 | C | C | C | C | C | C | | |
476 | | - | - | - | - | - | - | | |
477 | | C | C | C | C | C | C | L | |
478 | | C | C | C | C | C | C | L | |
479 | | C | C | C | C | C | C | L | |
480 | | C | C | C | C | C | C | L | |
481 | | C | C | C | C | C | C | L | |
482 | | C | C | C | C | C | C | L | |
483 | | C | C | C | C | C | C | L | |
484 | | C | C | C | C | C | C | L | |
485 | | C | C | C | C | C | C | L | |
486 | | C | C | C | C | C | C | L | |
487 | | C | C | C | C | C | C | L | |
488 | | C | C | C | C | C | C | L | |
489 | | - | - | - | - | - | - | | |
490 | C. | |
491 | ; | |
f99a9e15 | 492 | ;Dev #1;Dev #2;Dev #3;Dev #4;Dev #5 |
8dc92b41 CAM |
493 | ; |
494 | 0x00;0;1;2;3;4;\\ | |
495 | 0x01;5;6;7;8;9;> [#] | |
496 | \.\.\.;\.\.\.;\.\.\.;\.\.\.;\.\.\.;\.\.\.;: | |
497 | :;:;:;:;:;:;: | |
498 | \.\.\.;\.\.\.;\.\.\.;\.\.\.;\.\.\.;\.\.\.;: | |
499 | 0x40;315;316;317;318;319;/ | |
500 | 0x41;4;0;1;2;3;\\ | |
501 | 0x42;9;5;6;7;8;> [#]~ | |
502 | \.\.\.;\.\.\.;\.\.\.;\.\.\.;\.\.\.;\.\.\.;: | |
503 | :;:;:;:;:;:;: | |
504 | \.\.\.;\.\.\.;\.\.\.;\.\.\.;\.\.\.;\.\.\.;: | |
505 | 0x80;319;315;316;317;318;/ | |
506 | ; | |
507 | .TE | |
508 | ||
509 | With [#]\ being the complete sequence of chunks and [#]~\ the cyclic permutation | |
510 | with offset\ 1 thereof (in the case of more than 2 copies per chunk there would | |
511 | be ([#]~)~,\ (([#]~)~)~,\ ...). | |
512 | ||
513 | The advantage of this layout is that MD can easily spread sequential reads over | |
514 | the devices, making them similar to RAID0 in terms of speed. | |
515 | .br | |
516 | The cost is more seeking for writes, making them substantially slower. | |
517 | ||
518 | .TP | |
519 | \fB"offset" Layout\fP | |
520 | When "offset" replicas are chosen, all the copies of a given chunk are | |
521 | striped consecutively ("offset by the stripe length after each other") | |
522 | over the devices. | |
523 | ||
524 | Explained in detail, <number of devices> consecutive chunks are | |
525 | striped over the devices, immediately followed by a "shifted" copy of | |
526 | these chunks (and by further such "shifted" copies in the case of more | |
527 | than 2\ copies per chunk). | |
528 | .br | |
529 | This pattern repeats for all further consecutive chunks of the | |
530 | exported RAID10 device (in other words: all further data blocks). | |
531 | ||
532 | The "shift" needed to prevent placing copies of the same chunks on the | |
533 | same devices is actually a cyclic permutation with offset\ 1 of each | |
534 | of the striped copies of <number of devices> consecutive chunks. | |
535 | .br | |
536 | The offset\ 1 is relative to the previous striped copy of <number of | |
537 | devices> consecutive chunks, so in case of more than 2\ copies per | |
538 | chunk one gets the following offsets: | |
539 | .br | |
540 | 1.\ <number of devices> consecutive chunks: offset\ =\ \ 0 | |
541 | .br | |
542 | 2.\ <number of devices> consecutive chunks: offset\ =\ \ 1 | |
543 | .br | |
544 | 3.\ <number of devices> consecutive chunks: offset\ =\ \ 2 | |
545 | .br | |
546 | : | |
547 | .br | |
548 | n.\ <number of devices> consecutive chunks: offset\ =\ n-1 | |
549 | ||
550 | .ne 10 | |
551 | .B Example with 2\ copies per chunk and an even number\ (4) of devices: | |
552 | .TS | |
553 | tab(;); | |
554 | C - - - - | |
555 | C | C | C | C | C | | |
556 | | - | - | - | - | - | | |
557 | | C | C | C | C | C | L | |
558 | | C | C | C | C | C | L | |
559 | | C | C | C | C | C | L | |
560 | | C | C | C | C | C | L | |
561 | | C | C | C | C | C | L | |
562 | | C | C | C | C | C | L | |
563 | | C | C | C | C | C | L | |
564 | | C | C | C | C | C | L | |
565 | | C | C | C | C | C | L | |
566 | | - | - | - | - | - | | |
567 | C. | |
568 | ; | |
569 | ;Device #1;Device #2;Device #3;Device #4 | |
570 | ; | |
571 | 0x00;0;1;2;3;) AA | |
572 | 0x01;3;0;1;2;) AA~ | |
573 | 0x02;4;5;6;7;) AB | |
574 | 0x03;7;4;5;6;) AB~ | |
575 | \.\.\.;\.\.\.;\.\.\.;\.\.\.;\.\.\.;) \.\.\. | |
576 | :;:;:;:;:; : | |
577 | \.\.\.;\.\.\.;\.\.\.;\.\.\.;\.\.\.;) \.\.\. | |
578 | 0x79;251;252;253;254;) EX | |
579 | 0x80;254;251;252;253;) EX~ | |
580 | ; | |
581 | .TE | |
582 | ||
583 | .ne 10 | |
584 | .B Example with 2\ copies per chunk and an odd number\ (5) of devices: | |
585 | .TS | |
586 | tab(;); | |
587 | C - - - - - | |
588 | C | C | C | C | C | C | | |
589 | | - | - | - | - | - | - | | |
590 | | C | C | C | C | C | C | L | |
591 | | C | C | C | C | C | C | L | |
592 | | C | C | C | C | C | C | L | |
593 | | C | C | C | C | C | C | L | |
594 | | C | C | C | C | C | C | L | |
595 | | C | C | C | C | C | C | L | |
596 | | C | C | C | C | C | C | L | |
597 | | C | C | C | C | C | C | L | |
598 | | C | C | C | C | C | C | L | |
599 | | - | - | - | - | - | - | | |
600 | C. | |
601 | ; | |
f99a9e15 | 602 | ;Dev #1;Dev #2;Dev #3;Dev #4;Dev #5 |
8dc92b41 CAM |
603 | ; |
604 | 0x00;0;1;2;3;4;) AA | |
605 | 0x01;4;0;1;2;3;) AA~ | |
606 | 0x02;5;6;7;8;9;) AB | |
607 | 0x03;9;5;6;7;8;) AB~ | |
608 | \.\.\.;\.\.\.;\.\.\.;\.\.\.;\.\.\.;\.\.\.;) \.\.\. | |
609 | :;:;:;:;:;:; : | |
610 | \.\.\.;\.\.\.;\.\.\.;\.\.\.;\.\.\.;\.\.\.;) \.\.\. | |
611 | 0x79;314;315;316;317;318;) EX | |
612 | 0x80;318;314;315;316;317;) EX~ | |
613 | ; | |
614 | .TE | |
615 | ||
616 | With AA,\ AB,\ ..., AZ,\ BA,\ ... being the sets of <number of devices> consecutive | |
617 | chunks and AA~,\ AB~,\ ..., AZ~,\ BA~,\ ... the cyclic permutations with offset\ 1 | |
618 | thereof (in the case of more than 2 copies per chunk there would be (AA~)~,\ ... | |
619 | as well as ((AA~)~)~,\ ... and so on). | |
620 | ||
621 | This should give similar read characteristics to "far" if a suitably large chunk | |
622 | size is used, but without as much seeking for writes. | |
623 | .PP | |
624 | ||
b578481c | 625 | |
599e5a36 | 626 | It should be noted that the number of devices in a RAID10 array need |
93e790af | 627 | not be a multiple of the number of replica of each data block; however, |
599e5a36 NB |
628 | there must be at least as many devices as replicas. |
629 | ||
630 | If, for example, an array is created with 5 devices and 2 replicas, | |
631 | then space equivalent to 2.5 of the devices will be available, and | |
632 | every block will be stored on two different devices. | |
633 | ||
8dc92b41 | 634 | Finally, it is possible to have an array with both "near" and "far" |
93e790af | 635 | copies. If an array is configured with 2 near copies and 2 far |
599e5a36 NB |
636 | copies, then there will be a total of 4 copies of each block, each on |
637 | a different drive. This is an artifact of the implementation and is | |
638 | unlikely to be of real value. | |
639 | ||
bf40ab85 | 640 | .SS MULTIPATH |
e0d19036 NB |
641 | |
642 | MULTIPATH is not really a RAID at all as there is only one real device | |
643 | in a MULTIPATH md array. However there are multiple access points | |
644 | (paths) to this device, and one of these paths might fail, so there | |
645 | are some similarities. | |
646 | ||
a9d69660 | 647 | A MULTIPATH array is composed of a number of logically different |
2d465520 NB |
648 | devices, often fibre channel interfaces, that all refer the the same |
649 | real device. If one of these interfaces fails (e.g. due to cable | |
956a13fb | 650 | problems), the MULTIPATH driver will attempt to redirect requests to |
e0fe762a N |
651 | another interface. |
652 | ||
653 | The MULTIPATH drive is not receiving any ongoing development and | |
654 | should be considered a legacy driver. The device-mapper based | |
655 | multipath drivers should be preferred for new installations. | |
e0d19036 | 656 | |
b5e64645 | 657 | .SS FAULTY |
956a13fb | 658 | The FAULTY md module is provided for testing purposes. A FAULTY array |
b5e64645 NB |
659 | has exactly one component device and is normally assembled without a |
660 | superblock, so the md array created provides direct access to all of | |
661 | the data in the component device. | |
662 | ||
663 | The FAULTY module may be requested to simulate faults to allow testing | |
a9d69660 | 664 | of other md levels or of filesystems. Faults can be chosen to trigger |
b5e64645 | 665 | on read requests or write requests, and can be transient (a subsequent |
addc80c4 | 666 | read/write at the address will probably succeed) or persistent |
b5e64645 NB |
667 | (subsequent read/write of the same address will fail). Further, read |
668 | faults can be "fixable" meaning that they persist until a write | |
669 | request at the same address. | |
670 | ||
93e790af | 671 | Fault types can be requested with a period. In this case, the fault |
a9d69660 NB |
672 | will recur repeatedly after the given number of requests of the |
673 | relevant type. For example if persistent read faults have a period of | |
674 | 100, then every 100th read request would generate a fault, and the | |
b5e64645 NB |
675 | faulty sector would be recorded so that subsequent reads on that |
676 | sector would also fail. | |
677 | ||
678 | There is a limit to the number of faulty sectors that are remembered. | |
679 | Faults generated after this limit is exhausted are treated as | |
680 | transient. | |
681 | ||
a9d69660 | 682 | The list of faulty sectors can be flushed, and the active list of |
b5e64645 | 683 | failure modes can be cleared. |
e0d19036 NB |
684 | |
685 | .SS UNCLEAN SHUTDOWN | |
686 | ||
599e5a36 NB |
687 | When changes are made to a RAID1, RAID4, RAID5, RAID6, or RAID10 array |
688 | there is a possibility of inconsistency for short periods of time as | |
93e790af SW |
689 | each update requires at least two block to be written to different |
690 | devices, and these writes probably won't happen at exactly the same | |
599e5a36 NB |
691 | time. Thus if a system with one of these arrays is shutdown in the |
692 | middle of a write operation (e.g. due to power failure), the array may | |
693 | not be consistent. | |
e0d19036 | 694 | |
2d465520 | 695 | To handle this situation, the md driver marks an array as "dirty" |
e0d19036 | 696 | before writing any data to it, and marks it as "clean" when the array |
98c6faba NB |
697 | is being disabled, e.g. at shutdown. If the md driver finds an array |
698 | to be dirty at startup, it proceeds to correct any possibly | |
699 | inconsistency. For RAID1, this involves copying the contents of the | |
700 | first drive onto all other drives. For RAID4, RAID5 and RAID6 this | |
701 | involves recalculating the parity for each stripe and making sure that | |
599e5a36 NB |
702 | the parity block has the correct data. For RAID10 it involves copying |
703 | one of the replicas of each block onto all the others. This process, | |
704 | known as "resynchronising" or "resync" is performed in the background. | |
705 | The array can still be used, though possibly with reduced performance. | |
98c6faba NB |
706 | |
707 | If a RAID4, RAID5 or RAID6 array is degraded (missing at least one | |
93e790af | 708 | drive, two for RAID6) when it is restarted after an unclean shutdown, it cannot |
98c6faba NB |
709 | recalculate parity, and so it is possible that data might be |
710 | undetectably corrupted. The 2.4 md driver | |
e0d19036 | 711 | .B does not |
addc80c4 NB |
712 | alert the operator to this condition. The 2.6 md driver will fail to |
713 | start an array in this condition without manual intervention, though | |
35cc5be4 | 714 | this behaviour can be overridden by a kernel parameter. |
e0d19036 NB |
715 | |
716 | .SS RECOVERY | |
717 | ||
addc80c4 | 718 | If the md driver detects a write error on a device in a RAID1, RAID4, |
599e5a36 NB |
719 | RAID5, RAID6, or RAID10 array, it immediately disables that device |
720 | (marking it as faulty) and continues operation on the remaining | |
93e790af SW |
721 | devices. If there are spare drives, the driver will start recreating |
722 | on one of the spare drives the data which was on that failed drive, | |
599e5a36 NB |
723 | either by copying a working drive in a RAID1 configuration, or by |
724 | doing calculations with the parity block on RAID4, RAID5 or RAID6, or | |
93e790af | 725 | by finding and copying originals for RAID10. |
e0d19036 | 726 | |
addc80c4 NB |
727 | In kernels prior to about 2.6.15, a read error would cause the same |
728 | effect as a write error. In later kernels, a read-error will instead | |
729 | cause md to attempt a recovery by overwriting the bad block. i.e. it | |
730 | will find the correct data from elsewhere, write it over the block | |
731 | that failed, and then try to read it back again. If either the write | |
732 | or the re-read fail, md will treat the error the same way that a write | |
93e790af | 733 | error is treated, and will fail the whole device. |
addc80c4 | 734 | |
2d465520 | 735 | While this recovery process is happening, the md driver will monitor |
e0d19036 NB |
736 | accesses to the array and will slow down the rate of recovery if other |
737 | activity is happening, so that normal access to the array will not be | |
738 | unduly affected. When no other activity is happening, the recovery | |
739 | process proceeds at full speed. The actual speed targets for the two | |
740 | different situations can be controlled by the | |
741 | .B speed_limit_min | |
742 | and | |
743 | .B speed_limit_max | |
744 | control files mentioned below. | |
745 | ||
1cc44574 N |
746 | .SS SCRUBBING AND MISMATCHES |
747 | ||
748 | As storage devices can develop bad blocks at any time it is valuable | |
749 | to regularly read all blocks on all devices in an array so as to catch | |
750 | such bad blocks early. This process is called | |
751 | .IR scrubbing . | |
752 | ||
753 | md arrays can be scrubbed by writing either | |
754 | .I check | |
755 | or | |
756 | .I repair | |
757 | to the file | |
758 | .I md/sync_action | |
759 | in the | |
760 | .I sysfs | |
761 | directory for the device. | |
762 | ||
c93e9d68 | 763 | Requesting a scrub will cause |
1cc44574 N |
764 | .I md |
765 | to read every block on every device in the array, and check that the | |
c93e9d68 N |
766 | data is consistent. For RAID1 and RAID10, this means checking that the copies |
767 | are identical. For RAID4, RAID5, RAID6 this means checking that the | |
768 | parity block is (or blocks are) correct. | |
1cc44574 N |
769 | |
770 | If a read error is detected during this process, the normal read-error | |
771 | handling causes correct data to be found from other devices and to be | |
772 | written back to the faulty device. In many case this will | |
773 | effectively | |
774 | .I fix | |
775 | the bad block. | |
776 | ||
777 | If all blocks read successfully but are found to not be consistent, | |
778 | then this is regarded as a | |
779 | .IR mismatch . | |
780 | ||
781 | If | |
782 | .I check | |
783 | was used, then no action is taken to handle the mismatch, it is simply | |
784 | recorded. | |
785 | If | |
786 | .I repair | |
787 | was used, then a mismatch will be repaired in the same way that | |
788 | .I resync | |
c93e9d68 | 789 | repairs arrays. For RAID5/RAID6 new parity blocks are written. For RAID1/RAID10, |
1cc44574 N |
790 | all but one block are overwritten with the content of that one block. |
791 | ||
792 | A count of mismatches is recorded in the | |
793 | .I sysfs | |
794 | file | |
795 | .IR md/mismatch_cnt . | |
796 | This is set to zero when a | |
c93e9d68 | 797 | scrub starts and is incremented whenever a sector is |
1cc44574 N |
798 | found that is a mismatch. |
799 | .I md | |
800 | normally works in units much larger than a single sector and when it | |
1e49aaa0 | 801 | finds a mismatch, it does not determine exactly how many actual sectors were |
c93e9d68 N |
802 | affected but simply adds the number of sectors in the IO unit that was |
803 | used. So a value of 128 could simply mean that a single 64KB check | |
804 | found an error (128 x 512bytes = 64KB). | |
805 | ||
806 | If an array is created by | |
807 | .I mdadm | |
808 | with | |
1cc44574 N |
809 | .I \-\-assume\-clean |
810 | then a subsequent check could be expected to find some mismatches. | |
811 | ||
812 | On a truly clean RAID5 or RAID6 array, any mismatches should indicate | |
813 | a hardware problem at some level - software issues should never cause | |
814 | such a mismatch. | |
815 | ||
816 | However on RAID1 and RAID10 it is possible for software issues to | |
817 | cause a mismatch to be reported. This does not necessarily mean that | |
818 | the data on the array is corrupted. It could simply be that the | |
819 | system does not care what is stored on that part of the array - it is | |
820 | unused space. | |
821 | ||
822 | The most likely cause for an unexpected mismatch on RAID1 or RAID10 | |
823 | occurs if a swap partition or swap file is stored on the array. | |
824 | ||
825 | When the swap subsystem wants to write a page of memory out, it flags | |
826 | the page as 'clean' in the memory manager and requests the swap device | |
827 | to write it out. It is quite possible that the memory will be | |
828 | changed while the write-out is happening. In that case the 'clean' | |
829 | flag will be found to be clear when the write completes and so the | |
830 | swap subsystem will simply forget that the swapout had been attempted, | |
c93e9d68 | 831 | and will possibly choose a different page to write out. |
1cc44574 | 832 | |
c93e9d68 | 833 | If the swap device was on RAID1 (or RAID10), then the data is sent |
1cc44574 | 834 | from memory to a device twice (or more depending on the number of |
c93e9d68 N |
835 | devices in the array). Thus it is possible that the memory gets changed |
836 | between the times it is sent, so different data can be written to | |
837 | the different devices in the array. This will be detected by | |
1cc44574 N |
838 | .I check |
839 | as a mismatch. However it does not reflect any corruption as the | |
840 | block where this mismatch occurs is being treated by the swap system as | |
841 | being empty, and the data will never be read from that block. | |
842 | ||
843 | It is conceivable for a similar situation to occur on non-swap files, | |
844 | though it is less likely. | |
845 | ||
846 | Thus the | |
847 | .I mismatch_cnt | |
848 | value can not be interpreted very reliably on RAID1 or RAID10, | |
849 | especially when the device is used for swap. | |
850 | ||
851 | ||
599e5a36 NB |
852 | .SS BITMAP WRITE-INTENT LOGGING |
853 | ||
854 | From Linux 2.6.13, | |
855 | .I md | |
856 | supports a bitmap based write-intent log. If configured, the bitmap | |
857 | is used to record which blocks of the array may be out of sync. | |
858 | Before any write request is honoured, md will make sure that the | |
859 | corresponding bit in the log is set. After a period of time with no | |
860 | writes to an area of the array, the corresponding bit will be cleared. | |
861 | ||
862 | This bitmap is used for two optimisations. | |
863 | ||
1afe1167 | 864 | Firstly, after an unclean shutdown, the resync process will consult |
599e5a36 | 865 | the bitmap and only resync those blocks that correspond to bits in the |
1afe1167 | 866 | bitmap that are set. This can dramatically reduce resync time. |
599e5a36 NB |
867 | |
868 | Secondly, when a drive fails and is removed from the array, md stops | |
869 | clearing bits in the intent log. If that same drive is re-added to | |
870 | the array, md will notice and will only recover the sections of the | |
871 | drive that are covered by bits in the intent log that are set. This | |
872 | can allow a device to be temporarily removed and reinserted without | |
873 | causing an enormous recovery cost. | |
874 | ||
875 | The intent log can be stored in a file on a separate device, or it can | |
876 | be stored near the superblocks of an array which has superblocks. | |
877 | ||
93e790af | 878 | It is possible to add an intent log to an active array, or remove an |
addc80c4 | 879 | intent log if one is present. |
599e5a36 NB |
880 | |
881 | In 2.6.13, intent bitmaps are only supported with RAID1. Other levels | |
addc80c4 | 882 | with redundancy are supported from 2.6.15. |
599e5a36 | 883 | |
968d2a33 | 884 | .SS BAD BLOCK LIST |
bf95d0f3 N |
885 | |
886 | From Linux 3.5 each device in an | |
887 | .I md | |
888 | array can store a list of known-bad-blocks. This list is 4K in size | |
889 | and usually positioned at the end of the space between the superblock | |
890 | and the data. | |
891 | ||
892 | When a block cannot be read and cannot be repaired by writing data | |
893 | recovered from other devices, the address of the block is stored in | |
968d2a33 | 894 | the bad block list. Similarly if an attempt to write a block fails, |
bf95d0f3 N |
895 | the address will be recorded as a bad block. If attempting to record |
896 | the bad block fails, the whole device will be marked faulty. | |
897 | ||
898 | Attempting to read from a known bad block will cause a read error. | |
899 | Attempting to write to a known bad block will be ignored if any write | |
900 | errors have been reported by the device. If there have been no write | |
901 | errors then the data will be written to the known bad block and if | |
902 | that succeeds, the address will be removed from the list. | |
903 | ||
904 | This allows an array to fail more gracefully - a few blocks on | |
905 | different devices can be faulty without taking the whole array out of | |
906 | action. | |
907 | ||
968d2a33 | 908 | The list is particularly useful when recovering to a spare. If a few blocks |
bf95d0f3 | 909 | cannot be read from the other devices, the bulk of the recovery can |
968d2a33 | 910 | complete and those few bad blocks will be recorded in the bad block list. |
bf95d0f3 | 911 | |
0d583954 | 912 | .SS RAID WRITE HOLE |
28f83f6d | 913 | |
0d583954 OS |
914 | Due to non-atomicity nature of RAID write operations, |
915 | interruption of write operations (system crash, etc.) to RAID456 | |
916 | array can lead to inconsistent parity and data loss (so called | |
917 | RAID-5 write hole). | |
918 | To plug the write hole md supports two mechanisms described below. | |
28f83f6d | 919 | |
0d583954 OS |
920 | .TP |
921 | DIRTY STRIPE JOURNAL | |
922 | From Linux 4.4, md supports write ahead journal for RAID456. | |
923 | When the array is created, an additional journal device can be added to | |
924 | the array through write-journal option. The RAID write journal works | |
925 | similar to file system journals. Before writing to the data | |
926 | disks, md persists data AND parity of the stripe to the journal | |
927 | device. After crashes, md searches the journal device for | |
928 | incomplete write operations, and replay them to the data disks. | |
28f83f6d SL |
929 | |
930 | When the journal device fails, the RAID array is forced to run in | |
931 | read-only mode. | |
932 | ||
0d583954 OS |
933 | .TP |
934 | PARTIAL PARITY LOG | |
935 | From Linux 4.12 md supports Partial Parity Log (PPL) for RAID5 arrays only. | |
936 | Partial parity for a write operation is the XOR of stripe data chunks not | |
937 | modified by the write. PPL is stored in the metadata region of RAID member drives, | |
938 | no additional journal drive is needed. | |
939 | After crashes, if one of the not modified data disks of | |
940 | the stripe is missing, this updated parity can be used to recover its | |
941 | data. | |
942 | ||
943 | This mechanism is documented more fully in the file | |
944 | Documentation/md/raid5-ppl.rst | |
945 | ||
599e5a36 NB |
946 | .SS WRITE-BEHIND |
947 | ||
948 | From Linux 2.6.14, | |
949 | .I md | |
addc80c4 | 950 | supports WRITE-BEHIND on RAID1 arrays. |
599e5a36 NB |
951 | |
952 | This allows certain devices in the array to be flagged as | |
953 | .IR write-mostly . | |
954 | MD will only read from such devices if there is no | |
955 | other option. | |
956 | ||
957 | If a write-intent bitmap is also provided, write requests to | |
958 | write-mostly devices will be treated as write-behind requests and md | |
959 | will not wait for writes to those requests to complete before | |
960 | reporting the write as complete to the filesystem. | |
961 | ||
962 | This allows for a RAID1 with WRITE-BEHIND to be used to mirror data | |
8f21823f | 963 | over a slow link to a remote computer (providing the link isn't too |
599e5a36 NB |
964 | slow). The extra latency of the remote link will not slow down normal |
965 | operations, but the remote system will still have a reasonably | |
966 | up-to-date copy of all data. | |
967 | ||
71574efb N |
968 | .SS FAILFAST |
969 | ||
970 | From Linux 4.10, | |
971 | .I | |
972 | md | |
973 | supports FAILFAST for RAID1 and RAID10 arrays. This is a flag that | |
974 | can be set on individual drives, though it is usually set on all | |
975 | drives, or no drives. | |
976 | ||
977 | When | |
978 | .I md | |
979 | sends an I/O request to a drive that is marked as FAILFAST, and when | |
980 | the array could survive the loss of that drive without losing data, | |
981 | .I md | |
982 | will request that the underlying device does not perform any retries. | |
983 | This means that a failure will be reported to | |
984 | .I md | |
985 | promptly, and it can mark the device as faulty and continue using the | |
986 | other device(s). | |
987 | .I md | |
988 | cannot control the timeout that the underlying devices use to | |
989 | determine failure. Any changes desired to that timeout must be set | |
990 | explictly on the underlying device, separately from using | |
991 | .IR mdadm . | |
992 | ||
993 | If a FAILFAST request does fail, and if it is still safe to mark the | |
994 | device as faulty without data loss, that will be done and the array | |
995 | will continue functioning on a reduced number of devices. If it is not | |
996 | possible to safely mark the device as faulty, | |
997 | .I md | |
998 | will retry the request without disabling retries in the underlying | |
999 | device. In any case, | |
1000 | .I md | |
1001 | will not attempt to repair read errors on a device marked as FAILFAST | |
1002 | by writing out the correct. It will just mark the device as faulty. | |
1003 | ||
1004 | FAILFAST is appropriate for storage arrays that have a low probability | |
1005 | of true failure, but will sometimes introduce unacceptable delays to | |
1006 | I/O requests while performing internal maintenance. The value of | |
1007 | setting FAILFAST involves a trade-off. The gain is that the chance of | |
1008 | unacceptable delays is substantially reduced. The cost is that the | |
1009 | unlikely event of data-loss on one device is slightly more likely to | |
1010 | result in data-loss for the array. | |
1011 | ||
1012 | When a device in an array using FAILFAST is marked as faulty, it will | |
1013 | usually become usable again in a short while. | |
1014 | .I mdadm | |
1015 | makes no attempt to detect that possibility. Some separate | |
1016 | mechanism, tuned to the specific details of the expected failure modes, | |
1017 | needs to be created to monitor devices to see when they return to full | |
1018 | functionality, and to then re-add them to the array. In order of | |
1019 | this "re-add" functionality to be effective, an array using FAILFAST | |
1020 | should always have a write-intent bitmap. | |
1021 | ||
addc80c4 NB |
1022 | .SS RESTRIPING |
1023 | ||
1024 | .IR Restriping , | |
1025 | also known as | |
1026 | .IR Reshaping , | |
1027 | is the processes of re-arranging the data stored in each stripe into a | |
1028 | new layout. This might involve changing the number of devices in the | |
93e790af | 1029 | array (so the stripes are wider), changing the chunk size (so stripes |
addc80c4 | 1030 | are deeper or shallower), or changing the arrangement of data and |
956a13fb | 1031 | parity (possibly changing the RAID level, e.g. 1 to 5 or 5 to 6). |
addc80c4 | 1032 | |
c64881d7 N |
1033 | As of Linux 2.6.35, md can reshape a RAID4, RAID5, or RAID6 array to |
1034 | have a different number of devices (more or fewer) and to have a | |
1035 | different layout or chunk size. It can also convert between these | |
1036 | different RAID levels. It can also convert between RAID0 and RAID10, | |
1037 | and between RAID0 and RAID4 or RAID5. | |
1038 | Other possibilities may follow in future kernels. | |
addc80c4 NB |
1039 | |
1040 | During any stripe process there is a 'critical section' during which | |
35cc5be4 | 1041 | live data is being overwritten on disk. For the operation of |
956a13fb | 1042 | increasing the number of drives in a RAID5, this critical section |
addc80c4 NB |
1043 | covers the first few stripes (the number being the product of the old |
1044 | and new number of devices). After this critical section is passed, | |
1045 | data is only written to areas of the array which no longer hold live | |
b3f1c093 | 1046 | data \(em the live data has already been located away. |
addc80c4 | 1047 | |
c64881d7 N |
1048 | For a reshape which reduces the number of devices, the 'critical |
1049 | section' is at the end of the reshape process. | |
1050 | ||
addc80c4 NB |
1051 | md is not able to ensure data preservation if there is a crash |
1052 | (e.g. power failure) during the critical section. If md is asked to | |
1053 | start an array which failed during a critical section of restriping, | |
1054 | it will fail to start the array. | |
1055 | ||
1056 | To deal with this possibility, a user-space program must | |
1057 | .IP \(bu 4 | |
1058 | Disable writes to that section of the array (using the | |
1059 | .B sysfs | |
1060 | interface), | |
1061 | .IP \(bu 4 | |
93e790af | 1062 | take a copy of the data somewhere (i.e. make a backup), |
addc80c4 | 1063 | .IP \(bu 4 |
93e790af | 1064 | allow the process to continue and invalidate the backup and restore |
addc80c4 NB |
1065 | write access once the critical section is passed, and |
1066 | .IP \(bu 4 | |
93e790af | 1067 | provide for restoring the critical data before restarting the array |
addc80c4 NB |
1068 | after a system crash. |
1069 | .PP | |
1070 | ||
1071 | .B mdadm | |
93e790af | 1072 | versions from 2.4 do this for growing a RAID5 array. |
addc80c4 NB |
1073 | |
1074 | For operations that do not change the size of the array, like simply | |
1075 | increasing chunk size, or converting RAID5 to RAID6 with one extra | |
93e790af SW |
1076 | device, the entire process is the critical section. In this case, the |
1077 | restripe will need to progress in stages, as a section is suspended, | |
c64881d7 | 1078 | backed up, restriped, and released. |
addc80c4 NB |
1079 | |
1080 | .SS SYSFS INTERFACE | |
93e790af | 1081 | Each block device appears as a directory in |
addc80c4 | 1082 | .I sysfs |
93e790af | 1083 | (which is usually mounted at |
addc80c4 NB |
1084 | .BR /sys ). |
1085 | For MD devices, this directory will contain a subdirectory called | |
1086 | .B md | |
1087 | which contains various files for providing access to information about | |
1088 | the array. | |
1089 | ||
1090 | This interface is documented more fully in the file | |
5e592e1e | 1091 | .B Documentation/admin-guide/md.rst |
addc80c4 NB |
1092 | which is distributed with the kernel sources. That file should be |
1093 | consulted for full documentation. The following are just a selection | |
1094 | of attribute files that are available. | |
1095 | ||
1096 | .TP | |
1097 | .B md/sync_speed_min | |
1098 | This value, if set, overrides the system-wide setting in | |
1099 | .B /proc/sys/dev/raid/speed_limit_min | |
1100 | for this array only. | |
1101 | Writing the value | |
93e790af SW |
1102 | .B "system" |
1103 | to this file will cause the system-wide setting to have effect. | |
addc80c4 NB |
1104 | |
1105 | .TP | |
1106 | .B md/sync_speed_max | |
1107 | This is the partner of | |
1108 | .B md/sync_speed_min | |
1109 | and overrides | |
1e49aaa0 | 1110 | .B /proc/sys/dev/raid/speed_limit_max |
addc80c4 NB |
1111 | described below. |
1112 | ||
1113 | .TP | |
1114 | .B md/sync_action | |
1115 | This can be used to monitor and control the resync/recovery process of | |
1116 | MD. | |
1117 | In particular, writing "check" here will cause the array to read all | |
1118 | data block and check that they are consistent (e.g. parity is correct, | |
1119 | or all mirror replicas are the same). Any discrepancies found are | |
1120 | .B NOT | |
1121 | corrected. | |
1122 | ||
1123 | A count of problems found will be stored in | |
1124 | .BR md/mismatch_count . | |
1125 | ||
1126 | Alternately, "repair" can be written which will cause the same check | |
1127 | to be performed, but any errors will be corrected. | |
1128 | ||
1129 | Finally, "idle" can be written to stop the check/repair process. | |
1130 | ||
1131 | .TP | |
1132 | .B md/stripe_cache_size | |
1133 | This is only available on RAID5 and RAID6. It records the size (in | |
1134 | pages per device) of the stripe cache which is used for synchronising | |
800053d6 DW |
1135 | all write operations to the array and all read operations if the array |
1136 | is degraded. The default is 256. Valid values are 17 to 32768. | |
addc80c4 | 1137 | Increasing this number can increase performance in some situations, at |
800053d6 DW |
1138 | some cost in system memory. Note, setting this value too high can |
1139 | result in an "out of memory" condition for the system. | |
1140 | ||
1141 | memory_consumed = system_page_size * nr_disks * stripe_cache_size | |
addc80c4 | 1142 | |
a5ee6dfb DW |
1143 | .TP |
1144 | .B md/preread_bypass_threshold | |
1145 | This is only available on RAID5 and RAID6. This variable sets the | |
1146 | number of times MD will service a full-stripe-write before servicing a | |
1147 | stripe that requires some "prereading". For fairness this defaults to | |
800053d6 DW |
1148 | 1. Valid values are 0 to stripe_cache_size. Setting this to 0 |
1149 | maximizes sequential-write throughput at the cost of fairness to threads | |
bcbb92d4 | 1150 | doing small or random writes. |
addc80c4 | 1151 | |
e53cb968 GJ |
1152 | .TP |
1153 | .B md/bitmap/backlog | |
1154 | The value stored in the file only has any effect on RAID1 when write-mostly | |
1155 | devices are active, and write requests to those devices are proceed in the | |
1156 | background. | |
1157 | ||
1158 | This variable sets a limit on the number of concurrent background writes, | |
1159 | the valid values are 0 to 16383, 0 means that write-behind is not allowed, | |
1160 | while any other number means it can happen. If there are more write requests | |
1161 | than the number, new writes will by synchronous. | |
1162 | ||
1163 | .TP | |
1164 | .B md/bitmap/can_clear | |
1165 | This is for externally managed bitmaps, where the kernel writes the bitmap | |
1166 | itself, but metadata describing the bitmap is managed by mdmon or similar. | |
1167 | ||
1168 | When the array is degraded, bits mustn't be cleared. When the array becomes | |
1169 | optimal again, bit can be cleared, but first the metadata needs to record | |
1170 | the current event count. So md sets this to 'false' and notifies mdmon, | |
1171 | then mdmon updates the metadata and writes 'true'. | |
1172 | ||
1173 | There is no code in mdmon to actually do this, so maybe it doesn't even | |
1174 | work. | |
1175 | ||
1176 | .TP | |
1177 | .B md/bitmap/chunksize | |
1178 | The bitmap chunksize can only be changed when no bitmap is active, and | |
1179 | the value should be power of 2 and at least 512. | |
1180 | ||
1181 | .TP | |
1182 | .B md/bitmap/location | |
1183 | This indicates where the write-intent bitmap for the array is stored. | |
1184 | It can be "none" or "file" or a signed offset from the array metadata | |
1185 | - measured in sectors. You cannot set a file by writing here - that can | |
1186 | only be done with the SET_BITMAP_FILE ioctl. | |
1187 | ||
1188 | Write 'none' to 'bitmap/location' will clear bitmap, and the previous | |
1189 | location value must be write to it to restore bitmap. | |
1190 | ||
1191 | .TP | |
1192 | .B md/bitmap/max_backlog_used | |
1193 | This keeps track of the maximum number of concurrent write-behind requests | |
1194 | for an md array, writing any value to this file will clear it. | |
1195 | ||
1196 | .TP | |
1197 | .B md/bitmap/metadata | |
1198 | This can be 'internal' or 'clustered' or 'external'. 'internal' is set | |
1199 | by default, which means the metadata for bitmap is stored in the first 256 | |
1200 | bytes of the bitmap space. 'clustered' means separate bitmap metadata are | |
1201 | used for each cluster node. 'external' means that bitmap metadata is managed | |
1202 | externally to the kernel. | |
1203 | ||
1204 | .TP | |
1205 | .B md/bitmap/space | |
1206 | This shows the space (in sectors) which is available at md/bitmap/location, | |
1207 | and allows the kernel to know when it is safe to resize the bitmap to match | |
1208 | a resized array. It should big enough to contain the total bytes in the bitmap. | |
1209 | ||
1210 | For 1.0 metadata, assume we can use up to the superblock if before, else | |
1211 | to 4K beyond superblock. For other metadata versions, assume no change is | |
1212 | possible. | |
1213 | ||
1214 | .TP | |
1215 | .B md/bitmap/time_base | |
1216 | This shows the time (in seconds) between disk flushes, and is used to looking | |
1217 | for bits in the bitmap to be cleared. | |
1218 | ||
1219 | The default value is 5 seconds, and it should be an unsigned long value. | |
1220 | ||
5787fa49 NB |
1221 | .SS KERNEL PARAMETERS |
1222 | ||
addc80c4 | 1223 | The md driver recognised several different kernel parameters. |
5787fa49 NB |
1224 | .TP |
1225 | .B raid=noautodetect | |
1226 | This will disable the normal detection of md arrays that happens at | |
1227 | boot time. If a drive is partitioned with MS-DOS style partitions, | |
1228 | then if any of the 4 main partitions has a partition type of 0xFD, | |
1229 | then that partition will normally be inspected to see if it is part of | |
1230 | an MD array, and if any full arrays are found, they are started. This | |
addc80c4 | 1231 | kernel parameter disables this behaviour. |
5787fa49 | 1232 | |
a9d69660 NB |
1233 | .TP |
1234 | .B raid=partitionable | |
1235 | .TP | |
1236 | .B raid=part | |
1237 | These are available in 2.6 and later kernels only. They indicate that | |
1238 | autodetected MD arrays should be created as partitionable arrays, with | |
1239 | a different major device number to the original non-partitionable md | |
1240 | arrays. The device number is listed as | |
1241 | .I mdp | |
1242 | in | |
1243 | .IR /proc/devices . | |
1244 | ||
addc80c4 NB |
1245 | .TP |
1246 | .B md_mod.start_ro=1 | |
e0fe762a N |
1247 | .TP |
1248 | .B /sys/module/md_mod/parameters/start_ro | |
addc80c4 NB |
1249 | This tells md to start all arrays in read-only mode. This is a soft |
1250 | read-only that will automatically switch to read-write on the first | |
1251 | write request. However until that write request, nothing is written | |
1252 | to any device by md, and in particular, no resync or recovery | |
1253 | operation is started. | |
1254 | ||
1255 | .TP | |
1256 | .B md_mod.start_dirty_degraded=1 | |
e0fe762a N |
1257 | .TP |
1258 | .B /sys/module/md_mod/parameters/start_dirty_degraded | |
addc80c4 NB |
1259 | As mentioned above, md will not normally start a RAID4, RAID5, or |
1260 | RAID6 that is both dirty and degraded as this situation can imply | |
1261 | hidden data loss. This can be awkward if the root filesystem is | |
93e790af | 1262 | affected. Using this module parameter allows such arrays to be started |
addc80c4 NB |
1263 | at boot time. It should be understood that there is a real (though |
1264 | small) risk of data corruption in this situation. | |
a9d69660 | 1265 | |
5787fa49 NB |
1266 | .TP |
1267 | .BI md= n , dev , dev ,... | |
a9d69660 NB |
1268 | .TP |
1269 | .BI md=d n , dev , dev ,... | |
5787fa49 NB |
1270 | This tells the md driver to assemble |
1271 | .B /dev/md n | |
1272 | from the listed devices. It is only necessary to start the device | |
1273 | holding the root filesystem this way. Other arrays are best started | |
1274 | once the system is booted. | |
1275 | ||
a9d69660 NB |
1276 | In 2.6 kernels, the |
1277 | .B d | |
1278 | immediately after the | |
1279 | .B = | |
1280 | indicates that a partitionable device (e.g. | |
1281 | .BR /dev/md/d0 ) | |
1282 | should be created rather than the original non-partitionable device. | |
1283 | ||
5787fa49 NB |
1284 | .TP |
1285 | .BI md= n , l , c , i , dev... | |
1286 | This tells the md driver to assemble a legacy RAID0 or LINEAR array | |
1287 | without a superblock. | |
1288 | .I n | |
1289 | gives the md device number, | |
1290 | .I l | |
dae45415 | 1291 | gives the level, 0 for RAID0 or \-1 for LINEAR, |
5787fa49 NB |
1292 | .I c |
1293 | gives the chunk size as a base-2 logarithm offset by twelve, so 0 | |
1294 | means 4K, 1 means 8K. | |
1295 | .I i | |
1296 | is ignored (legacy support). | |
e0d19036 | 1297 | |
56eb10c0 NB |
1298 | .SH FILES |
1299 | .TP | |
1300 | .B /proc/mdstat | |
1301 | Contains information about the status of currently running array. | |
1302 | .TP | |
1303 | .B /proc/sys/dev/raid/speed_limit_min | |
93e790af | 1304 | A readable and writable file that reflects the current "goal" rebuild |
56eb10c0 NB |
1305 | speed for times when non-rebuild activity is current on an array. |
1306 | The speed is in Kibibytes per second, and is a per-device rate, not a | |
93e790af | 1307 | per-array rate (which means that an array with more disks will shuffle |
e0fe762a | 1308 | more data for a given speed). The default is 1000. |
56eb10c0 NB |
1309 | |
1310 | .TP | |
1311 | .B /proc/sys/dev/raid/speed_limit_max | |
93e790af | 1312 | A readable and writable file that reflects the current "goal" rebuild |
56eb10c0 | 1313 | speed for times when no non-rebuild activity is current on an array. |
e0fe762a | 1314 | The default is 200,000. |
56eb10c0 NB |
1315 | |
1316 | .SH SEE ALSO | |
1317 | .BR mdadm (8), |