]>
Commit | Line | Data |
---|---|---|
56eb10c0 NB |
1 | .TH MD 4 |
2 | .SH NAME | |
3 | md \- Multiple Device driver aka Linux Software Raid | |
4 | .SH SYNOPSIS | |
5 | .BI /dev/md n | |
6 | .br | |
7 | .BI /dev/md/ n | |
8 | .SH DESCRIPTION | |
9 | The | |
10 | .B md | |
11 | driver provides virtual devices that are created from one or more | |
e0d19036 | 12 | independent underlying devices. This array of devices often contains |
56eb10c0 | 13 | redundancy, and hence the acronym RAID which stands for a Redundant |
e0d19036 | 14 | Array of Independent Devices. |
56eb10c0 NB |
15 | .PP |
16 | .B md | |
599e5a36 NB |
17 | supports RAID levels |
18 | 1 (mirroring), | |
19 | 4 (striped array with parity device), | |
20 | 5 (striped array with distributed parity information), | |
21 | 6 (striped array with distributed dual redundancy information), and | |
22 | 10 (striped and mirrored). | |
23 | If some number of underlying devices fails while using one of these | |
98c6faba NB |
24 | levels, the array will continue to function; this number is one for |
25 | RAID levels 4 and 5, two for RAID level 6, and all but one (N-1) for | |
599e5a36 | 26 | RAID level 1, and dependant of configuration for level 10. |
56eb10c0 NB |
27 | .PP |
28 | .B md | |
e0d19036 | 29 | also supports a number of pseudo RAID (non-redundant) configurations |
570c0542 NB |
30 | including RAID0 (striped array), LINEAR (catenated array), |
31 | MULTIPATH (a set of different interfaces to the same device), | |
32 | and FAULTY (a layer over a single device into which errors can be injected). | |
56eb10c0 | 33 | |
11a3e71d | 34 | .SS MD SUPER BLOCK |
570c0542 NB |
35 | Each device in an array may have a |
36 | .I superblock | |
37 | which records information about the structure and state of the array. | |
38 | This allows the array to be reliably re-assembled after a shutdown. | |
56eb10c0 | 39 | |
570c0542 NB |
40 | From Linux kernel version 2.6.10, |
41 | .B md | |
42 | provides support for two different formats of this superblock, and | |
43 | other formats can be added. Prior to this release, only one format is | |
44 | supported. | |
45 | ||
46 | The common format - known as version 0.90 - has | |
47 | a superblock that is 4K long and is written into a 64K aligned block that | |
11a3e71d | 48 | starts at least 64K and less than 128K from the end of the device |
56eb10c0 NB |
49 | (i.e. to get the address of the superblock round the size of the |
50 | device down to a multiple of 64K and then subtract 64K). | |
11a3e71d | 51 | The available size of each device is the amount of space before the |
56eb10c0 NB |
52 | super block, so between 64K and 128K is lost when a device in |
53 | incorporated into an MD array. | |
570c0542 NB |
54 | This superblock stores multi-byte fields in a processor-dependant |
55 | manner, so arrays cannot easily be moved between computers with | |
56 | different processors. | |
57 | ||
58 | The new format - known as version 1 - has a superblock that is | |
59 | normally 1K long, but can be longer. It is normally stored between 8K | |
60 | and 12K from the end of the device, on a 4K boundary, though | |
61 | variations can be stored at the start of the device (version 1.1) or 4K from | |
62 | the start of the device (version 1.2). | |
63 | This superblock format stores multibyte data in a | |
64 | processor-independant format and has supports upto hundreds of | |
65 | component devices (version 0.90 only supports 28). | |
56eb10c0 NB |
66 | |
67 | The superblock contains, among other things: | |
68 | .TP | |
69 | LEVEL | |
11a3e71d | 70 | The manner in which the devices are arranged into the array |
599e5a36 | 71 | (linear, raid0, raid1, raid4, raid5, raid10, multipath). |
56eb10c0 NB |
72 | .TP |
73 | UUID | |
74 | a 128 bit Universally Unique Identifier that identifies the array that | |
75 | this device is part of. | |
76 | ||
570c0542 NB |
77 | .SS ARRAYS WITHOUT SUPERBLOCKS |
78 | While it is usually best to create arrays with superblocks so that | |
79 | they can be assembled reliably, there are some circumstances where an | |
80 | array without superblocks in preferred. This include: | |
81 | .TP | |
82 | LEGACY ARRAYS | |
11a3e71d NB |
83 | Early versions of the |
84 | .B md | |
570c0542 NB |
85 | driver only supported Linear and Raid0 configurations and did not use |
86 | a superblock (which is less critical with these configurations). | |
87 | While such arrays should be rebuilt with superblocks if possible, | |
11a3e71d | 88 | .B md |
570c0542 NB |
89 | continues to support them. |
90 | .TP | |
91 | FAULTY | |
92 | Being a largely transparent layer over a different device, the FAULTY | |
93 | personality doesn't gain anything from having a superblock. | |
94 | .TP | |
95 | MULTIPATH | |
96 | It is often possible to detect devices which are different paths to | |
97 | the same storage directly rather than having a distinctive superblock | |
98 | written to the device and searched for on all paths. In this case, | |
99 | a MULTIPATH array with no superblock makes sense. | |
100 | .TP | |
101 | RAID1 | |
102 | In some configurations it might be desired to create a raid1 | |
103 | configuration that does use a superblock, and to maintain the state of | |
104 | the array elsewhere. While not encouraged, this is supported. | |
11a3e71d | 105 | |
56eb10c0 | 106 | .SS LINEAR |
11a3e71d NB |
107 | |
108 | A linear array simply catenates the available space on each | |
109 | drive together to form one large virtual drive. | |
110 | ||
111 | One advantage of this arrangement over the more common RAID0 | |
112 | arrangement is that the array may be reconfigured at a later time with | |
113 | an extra drive and so the array is made bigger without disturbing the | |
114 | data that is on the array. However this cannot be done on a live | |
115 | array. | |
116 | ||
599e5a36 NB |
117 | If a chunksize is given with a LINEAR array, the usable space on each |
118 | device is rounded down to a multiple of this chunksize. | |
11a3e71d | 119 | |
56eb10c0 | 120 | .SS RAID0 |
11a3e71d NB |
121 | |
122 | A RAID0 array (which has zero redundancy) is also known as a | |
123 | striped array. | |
e0d19036 NB |
124 | A RAID0 array is configured at creation with a |
125 | .B "Chunk Size" | |
c913b90e | 126 | which must be a power of two, and at least 4 kibibytes. |
e0d19036 | 127 | |
2d465520 | 128 | The RAID0 driver assigns the first chunk of the array to the first |
e0d19036 | 129 | device, the second chunk to the second device, and so on until all |
2d465520 | 130 | drives have been assigned one chunk. This collection of chunks forms |
e0d19036 NB |
131 | a |
132 | .BR stripe . | |
133 | Further chunks are gathered into stripes in the same way which are | |
134 | assigned to the remaining space in the drives. | |
135 | ||
2d465520 NB |
136 | If devices in the array are not all the same size, then once the |
137 | smallest device has been exhausted, the RAID0 driver starts | |
e0d19036 NB |
138 | collecting chunks into smaller stripes that only span the drives which |
139 | still have remaining space. | |
140 | ||
141 | ||
56eb10c0 | 142 | .SS RAID1 |
e0d19036 NB |
143 | |
144 | A RAID1 array is also known as a mirrored set (though mirrors tend to | |
5787fa49 | 145 | provide reflected images, which RAID1 does not) or a plex. |
e0d19036 NB |
146 | |
147 | Once initialised, each device in a RAID1 array contains exactly the | |
148 | same data. Changes are written to all devices in parallel. Data is | |
149 | read from any one device. The driver attempts to distribute read | |
150 | requests across all devices to maximise performance. | |
151 | ||
152 | All devices in a RAID1 array should be the same size. If they are | |
153 | not, then only the amount of space available on the smallest device is | |
154 | used. Any extra space on other devices is wasted. | |
155 | ||
56eb10c0 | 156 | .SS RAID4 |
e0d19036 NB |
157 | |
158 | A RAID4 array is like a RAID0 array with an extra device for storing | |
aa88f531 NB |
159 | parity. This device is the last of the active devices in the |
160 | array. Unlike RAID0, RAID4 also requires that all stripes span all | |
e0d19036 NB |
161 | drives, so extra space on devices that are larger than the smallest is |
162 | wasted. | |
163 | ||
164 | When any block in a RAID4 array is modified the parity block for that | |
165 | stripe (i.e. the block in the parity device at the same device offset | |
166 | as the stripe) is also modified so that the parity block always | |
167 | contains the "parity" for the whole stripe. i.e. its contents is | |
168 | equivalent to the result of performing an exclusive-or operation | |
169 | between all the data blocks in the stripe. | |
170 | ||
171 | This allows the array to continue to function if one device fails. | |
172 | The data that was on that device can be calculated as needed from the | |
173 | parity block and the other data blocks. | |
174 | ||
56eb10c0 | 175 | .SS RAID5 |
e0d19036 NB |
176 | |
177 | RAID5 is very similar to RAID4. The difference is that the parity | |
178 | blocks for each stripe, instead of being on a single device, are | |
179 | distributed across all devices. This allows more parallelism when | |
180 | writing as two different block updates will quite possibly affect | |
181 | parity blocks on different devices so there is less contention. | |
182 | ||
183 | This also allows more parallelism when reading as read requests are | |
184 | distributed over all the devices in the array instead of all but one. | |
185 | ||
98c6faba NB |
186 | .SS RAID6 |
187 | ||
188 | RAID6 is similar to RAID5, but can handle the loss of any \fItwo\fP | |
189 | devices without data loss. Accordingly, it requires N+2 drives to | |
190 | store N drives worth of data. | |
191 | ||
192 | The performance for RAID6 is slightly lower but comparable to RAID5 in | |
193 | normal mode and single disk failure mode. It is very slow in dual | |
194 | disk failure mode, however. | |
195 | ||
599e5a36 NB |
196 | .SS RAID10 |
197 | ||
198 | RAID10 provides a combination of RAID1 and RAID0, and sometimes known | |
199 | as RAID1+0. Every datablock is duplicated some number of times, and | |
200 | the resulting collection of datablocks are distributed over multiple | |
201 | drives. | |
202 | ||
203 | When configuring a RAID10 array it is necessary to specify the number | |
204 | of replicas of each data block that are required (this will normally | |
205 | be 2) and whether the replicas should be 'near' or 'far'. | |
206 | ||
207 | When 'near' replicas are chosen, the multiple copies of a given chunk | |
208 | are laid out consecutively across the stripes of the array, so the two | |
209 | copies of a datablock will likely be at the same offset on two | |
210 | adjacent devices. | |
211 | ||
212 | When 'far' replicas are chosen, the multiple copies of a given chunk | |
213 | are laid out quite distant from each other. The first copy of all | |
214 | data blocks will be striped across the early part of all drives in | |
215 | RAID0 fashion, and then the next copy of all blocks will be striped | |
216 | across a later section of all drives, always ensuring that all copies | |
217 | of any given block are on different drives. | |
218 | ||
219 | The 'far' arrangement can give sequential read performance equal to | |
220 | that of a RAID0 array, but at the cost of degraded write performance. | |
221 | ||
222 | It should be noted that the number of devices in a RAID10 array need | |
223 | not be a multiple of the number of replica of each data block, those | |
224 | there must be at least as many devices as replicas. | |
225 | ||
226 | If, for example, an array is created with 5 devices and 2 replicas, | |
227 | then space equivalent to 2.5 of the devices will be available, and | |
228 | every block will be stored on two different devices. | |
229 | ||
230 | Finally, it is possible to have an array with both 'near' and 'far' | |
231 | copies. If and array is configured with 2 near copies and 2 far | |
232 | copies, then there will be a total of 4 copies of each block, each on | |
233 | a different drive. This is an artifact of the implementation and is | |
234 | unlikely to be of real value. | |
235 | ||
11a3e71d | 236 | .SS MUTIPATH |
e0d19036 NB |
237 | |
238 | MULTIPATH is not really a RAID at all as there is only one real device | |
239 | in a MULTIPATH md array. However there are multiple access points | |
240 | (paths) to this device, and one of these paths might fail, so there | |
241 | are some similarities. | |
242 | ||
a9d69660 | 243 | A MULTIPATH array is composed of a number of logically different |
2d465520 NB |
244 | devices, often fibre channel interfaces, that all refer the the same |
245 | real device. If one of these interfaces fails (e.g. due to cable | |
a9d69660 | 246 | problems), the multipath driver will attempt to redirect requests to |
2d465520 | 247 | another interface. |
e0d19036 | 248 | |
b5e64645 NB |
249 | .SS FAULTY |
250 | The FAULTY md module is provided for testing purposes. A faulty array | |
251 | has exactly one component device and is normally assembled without a | |
252 | superblock, so the md array created provides direct access to all of | |
253 | the data in the component device. | |
254 | ||
255 | The FAULTY module may be requested to simulate faults to allow testing | |
a9d69660 | 256 | of other md levels or of filesystems. Faults can be chosen to trigger |
b5e64645 NB |
257 | on read requests or write requests, and can be transient (a subsequent |
258 | read/write at the address will probably succeed) or persistant | |
259 | (subsequent read/write of the same address will fail). Further, read | |
260 | faults can be "fixable" meaning that they persist until a write | |
261 | request at the same address. | |
262 | ||
263 | Fault types can be requested with a period. In this case the fault | |
a9d69660 NB |
264 | will recur repeatedly after the given number of requests of the |
265 | relevant type. For example if persistent read faults have a period of | |
266 | 100, then every 100th read request would generate a fault, and the | |
b5e64645 NB |
267 | faulty sector would be recorded so that subsequent reads on that |
268 | sector would also fail. | |
269 | ||
270 | There is a limit to the number of faulty sectors that are remembered. | |
271 | Faults generated after this limit is exhausted are treated as | |
272 | transient. | |
273 | ||
a9d69660 | 274 | The list of faulty sectors can be flushed, and the active list of |
b5e64645 | 275 | failure modes can be cleared. |
e0d19036 NB |
276 | |
277 | .SS UNCLEAN SHUTDOWN | |
278 | ||
599e5a36 NB |
279 | When changes are made to a RAID1, RAID4, RAID5, RAID6, or RAID10 array |
280 | there is a possibility of inconsistency for short periods of time as | |
281 | each update requires are least two block to be written to different | |
282 | devices, and these writes probably wont happen at exactly the same | |
283 | time. Thus if a system with one of these arrays is shutdown in the | |
284 | middle of a write operation (e.g. due to power failure), the array may | |
285 | not be consistent. | |
e0d19036 | 286 | |
2d465520 | 287 | To handle this situation, the md driver marks an array as "dirty" |
e0d19036 | 288 | before writing any data to it, and marks it as "clean" when the array |
98c6faba NB |
289 | is being disabled, e.g. at shutdown. If the md driver finds an array |
290 | to be dirty at startup, it proceeds to correct any possibly | |
291 | inconsistency. For RAID1, this involves copying the contents of the | |
292 | first drive onto all other drives. For RAID4, RAID5 and RAID6 this | |
293 | involves recalculating the parity for each stripe and making sure that | |
599e5a36 NB |
294 | the parity block has the correct data. For RAID10 it involves copying |
295 | one of the replicas of each block onto all the others. This process, | |
296 | known as "resynchronising" or "resync" is performed in the background. | |
297 | The array can still be used, though possibly with reduced performance. | |
98c6faba NB |
298 | |
299 | If a RAID4, RAID5 or RAID6 array is degraded (missing at least one | |
300 | drive) when it is restarted after an unclean shutdown, it cannot | |
301 | recalculate parity, and so it is possible that data might be | |
302 | undetectably corrupted. The 2.4 md driver | |
e0d19036 | 303 | .B does not |
5787fa49 | 304 | alert the operator to this condition. The 2.5 md driver will fail to |
e0d19036 NB |
305 | start an array in this condition without manual intervention. |
306 | ||
307 | .SS RECOVERY | |
308 | ||
98c6faba | 309 | If the md driver detects any error on a device in a RAID1, RAID4, |
599e5a36 NB |
310 | RAID5, RAID6, or RAID10 array, it immediately disables that device |
311 | (marking it as faulty) and continues operation on the remaining | |
312 | devices. If there is a spare drive, the driver will start recreating | |
313 | on one of the spare drives the data what was on that failed drive, | |
314 | either by copying a working drive in a RAID1 configuration, or by | |
315 | doing calculations with the parity block on RAID4, RAID5 or RAID6, or | |
316 | by finding a copying originals for RAID10. | |
e0d19036 | 317 | |
2d465520 | 318 | While this recovery process is happening, the md driver will monitor |
e0d19036 NB |
319 | accesses to the array and will slow down the rate of recovery if other |
320 | activity is happening, so that normal access to the array will not be | |
321 | unduly affected. When no other activity is happening, the recovery | |
322 | process proceeds at full speed. The actual speed targets for the two | |
323 | different situations can be controlled by the | |
324 | .B speed_limit_min | |
325 | and | |
326 | .B speed_limit_max | |
327 | control files mentioned below. | |
328 | ||
599e5a36 NB |
329 | .SS BITMAP WRITE-INTENT LOGGING |
330 | ||
331 | From Linux 2.6.13, | |
332 | .I md | |
333 | supports a bitmap based write-intent log. If configured, the bitmap | |
334 | is used to record which blocks of the array may be out of sync. | |
335 | Before any write request is honoured, md will make sure that the | |
336 | corresponding bit in the log is set. After a period of time with no | |
337 | writes to an area of the array, the corresponding bit will be cleared. | |
338 | ||
339 | This bitmap is used for two optimisations. | |
340 | ||
341 | Firstly, after an unclear shutdown, the resync process will consult | |
342 | the bitmap and only resync those blocks that correspond to bits in the | |
343 | bitmap that are set. This can dramatically increase resync time. | |
344 | ||
345 | Secondly, when a drive fails and is removed from the array, md stops | |
346 | clearing bits in the intent log. If that same drive is re-added to | |
347 | the array, md will notice and will only recover the sections of the | |
348 | drive that are covered by bits in the intent log that are set. This | |
349 | can allow a device to be temporarily removed and reinserted without | |
350 | causing an enormous recovery cost. | |
351 | ||
352 | The intent log can be stored in a file on a separate device, or it can | |
353 | be stored near the superblocks of an array which has superblocks. | |
354 | ||
355 | Subsequent versions of Linux will support hot-adding of bitmaps to | |
356 | existing arrays. | |
357 | ||
358 | In 2.6.13, intent bitmaps are only supported with RAID1. Other levels | |
359 | will follow. | |
360 | ||
361 | .SS WRITE-BEHIND | |
362 | ||
363 | From Linux 2.6.14, | |
364 | .I md | |
365 | will support WRITE-BEHIND on RAID1 arrays. | |
366 | ||
367 | This allows certain devices in the array to be flagged as | |
368 | .IR write-mostly . | |
369 | MD will only read from such devices if there is no | |
370 | other option. | |
371 | ||
372 | If a write-intent bitmap is also provided, write requests to | |
373 | write-mostly devices will be treated as write-behind requests and md | |
374 | will not wait for writes to those requests to complete before | |
375 | reporting the write as complete to the filesystem. | |
376 | ||
377 | This allows for a RAID1 with WRITE-BEHIND to be used to mirror data | |
378 | over a slow link to a remove computer (providing the link isn't too | |
379 | slow). The extra latency of the remote link will not slow down normal | |
380 | operations, but the remote system will still have a reasonably | |
381 | up-to-date copy of all data. | |
382 | ||
5787fa49 NB |
383 | .SS KERNEL PARAMETERS |
384 | ||
385 | The md driver recognised three different kernel parameters. | |
386 | .TP | |
387 | .B raid=noautodetect | |
388 | This will disable the normal detection of md arrays that happens at | |
389 | boot time. If a drive is partitioned with MS-DOS style partitions, | |
390 | then if any of the 4 main partitions has a partition type of 0xFD, | |
391 | then that partition will normally be inspected to see if it is part of | |
392 | an MD array, and if any full arrays are found, they are started. This | |
393 | kernel paramenter disables this behaviour. | |
394 | ||
a9d69660 NB |
395 | .TP |
396 | .B raid=partitionable | |
397 | .TP | |
398 | .B raid=part | |
399 | These are available in 2.6 and later kernels only. They indicate that | |
400 | autodetected MD arrays should be created as partitionable arrays, with | |
401 | a different major device number to the original non-partitionable md | |
402 | arrays. The device number is listed as | |
403 | .I mdp | |
404 | in | |
405 | .IR /proc/devices . | |
406 | ||
407 | ||
5787fa49 NB |
408 | .TP |
409 | .BI md= n , dev , dev ,... | |
a9d69660 NB |
410 | .TP |
411 | .BI md=d n , dev , dev ,... | |
5787fa49 NB |
412 | This tells the md driver to assemble |
413 | .B /dev/md n | |
414 | from the listed devices. It is only necessary to start the device | |
415 | holding the root filesystem this way. Other arrays are best started | |
416 | once the system is booted. | |
417 | ||
a9d69660 NB |
418 | In 2.6 kernels, the |
419 | .B d | |
420 | immediately after the | |
421 | .B = | |
422 | indicates that a partitionable device (e.g. | |
423 | .BR /dev/md/d0 ) | |
424 | should be created rather than the original non-partitionable device. | |
425 | ||
5787fa49 NB |
426 | .TP |
427 | .BI md= n , l , c , i , dev... | |
428 | This tells the md driver to assemble a legacy RAID0 or LINEAR array | |
429 | without a superblock. | |
430 | .I n | |
431 | gives the md device number, | |
432 | .I l | |
433 | gives the level, 0 for RAID0 or -1 for LINEAR, | |
434 | .I c | |
435 | gives the chunk size as a base-2 logarithm offset by twelve, so 0 | |
436 | means 4K, 1 means 8K. | |
437 | .I i | |
438 | is ignored (legacy support). | |
e0d19036 | 439 | |
56eb10c0 NB |
440 | .SH FILES |
441 | .TP | |
442 | .B /proc/mdstat | |
443 | Contains information about the status of currently running array. | |
444 | .TP | |
445 | .B /proc/sys/dev/raid/speed_limit_min | |
446 | A readable and writable file that reflects the current goal rebuild | |
447 | speed for times when non-rebuild activity is current on an array. | |
448 | The speed is in Kibibytes per second, and is a per-device rate, not a | |
449 | per-array rate (which means that an array with more disc will shuffle | |
450 | more data for a given speed). The default is 100. | |
451 | ||
452 | .TP | |
453 | .B /proc/sys/dev/raid/speed_limit_max | |
454 | A readable and writable file that reflects the current goal rebuild | |
455 | speed for times when no non-rebuild activity is current on an array. | |
456 | The default is 100,000. | |
457 | ||
458 | .SH SEE ALSO | |
459 | .BR mdadm (8), | |
460 | .BR mkraid (8). |