]>
Commit | Line | Data |
---|---|---|
519561f7 NB |
1 | .\" Copyright Neil Brown and others. |
2 | .\" This program is free software; you can redistribute it and/or modify | |
3 | .\" it under the terms of the GNU General Public License as published by | |
4 | .\" the Free Software Foundation; either version 2 of the License, or | |
5 | .\" (at your option) any later version. | |
6 | .\" See file COPYING in distribution for details. | |
56eb10c0 NB |
7 | .TH MD 4 |
8 | .SH NAME | |
93e790af | 9 | md \- Multiple Device driver aka Linux Software RAID |
56eb10c0 NB |
10 | .SH SYNOPSIS |
11 | .BI /dev/md n | |
12 | .br | |
13 | .BI /dev/md/ n | |
e0fe762a N |
14 | .br |
15 | .BR /dev/md/ name | |
56eb10c0 NB |
16 | .SH DESCRIPTION |
17 | The | |
18 | .B md | |
19 | driver provides virtual devices that are created from one or more | |
e0d19036 | 20 | independent underlying devices. This array of devices often contains |
02b76eea NB |
21 | redundancy and the devices are often disk drives, hence the acronym RAID |
22 | which stands for a Redundant Array of Independent Disks. | |
56eb10c0 NB |
23 | .PP |
24 | .B md | |
599e5a36 NB |
25 | supports RAID levels |
26 | 1 (mirroring), | |
27 | 4 (striped array with parity device), | |
28 | 5 (striped array with distributed parity information), | |
29 | 6 (striped array with distributed dual redundancy information), and | |
30 | 10 (striped and mirrored). | |
31 | If some number of underlying devices fails while using one of these | |
98c6faba NB |
32 | levels, the array will continue to function; this number is one for |
33 | RAID levels 4 and 5, two for RAID level 6, and all but one (N-1) for | |
93e790af | 34 | RAID level 1, and dependent on configuration for level 10. |
56eb10c0 NB |
35 | .PP |
36 | .B md | |
e0d19036 | 37 | also supports a number of pseudo RAID (non-redundant) configurations |
570c0542 NB |
38 | including RAID0 (striped array), LINEAR (catenated array), |
39 | MULTIPATH (a set of different interfaces to the same device), | |
40 | and FAULTY (a layer over a single device into which errors can be injected). | |
56eb10c0 | 41 | |
e0fe762a N |
42 | .SS MD METADATA |
43 | Each device in an array may have some | |
44 | .I metadata | |
45 | stored in the device. This metadata is sometimes called a | |
46 | .BR superblock . | |
47 | The metadata records information about the structure and state of the array. | |
570c0542 | 48 | This allows the array to be reliably re-assembled after a shutdown. |
56eb10c0 | 49 | |
570c0542 NB |
50 | From Linux kernel version 2.6.10, |
51 | .B md | |
e0fe762a | 52 | provides support for two different formats of metadata, and |
570c0542 NB |
53 | other formats can be added. Prior to this release, only one format is |
54 | supported. | |
55 | ||
b3f1c093 | 56 | The common format \(em known as version 0.90 \(em has |
570c0542 | 57 | a superblock that is 4K long and is written into a 64K aligned block that |
11a3e71d | 58 | starts at least 64K and less than 128K from the end of the device |
56eb10c0 NB |
59 | (i.e. to get the address of the superblock round the size of the |
60 | device down to a multiple of 64K and then subtract 64K). | |
11a3e71d | 61 | The available size of each device is the amount of space before the |
56eb10c0 NB |
62 | super block, so between 64K and 128K is lost when a device in |
63 | incorporated into an MD array. | |
93e790af | 64 | This superblock stores multi-byte fields in a processor-dependent |
570c0542 NB |
65 | manner, so arrays cannot easily be moved between computers with |
66 | different processors. | |
67 | ||
b3f1c093 | 68 | The new format \(em known as version 1 \(em has a superblock that is |
570c0542 NB |
69 | normally 1K long, but can be longer. It is normally stored between 8K |
70 | and 12K from the end of the device, on a 4K boundary, though | |
71 | variations can be stored at the start of the device (version 1.1) or 4K from | |
72 | the start of the device (version 1.2). | |
e0fe762a | 73 | This metadata format stores multibyte data in a |
93e790af | 74 | processor-independent format and supports up to hundreds of |
570c0542 | 75 | component devices (version 0.90 only supports 28). |
56eb10c0 | 76 | |
e0fe762a | 77 | The metadata contains, among other things: |
56eb10c0 NB |
78 | .TP |
79 | LEVEL | |
11a3e71d | 80 | The manner in which the devices are arranged into the array |
599e5a36 | 81 | (linear, raid0, raid1, raid4, raid5, raid10, multipath). |
56eb10c0 NB |
82 | .TP |
83 | UUID | |
84 | a 128 bit Universally Unique Identifier that identifies the array that | |
93e790af | 85 | contains this device. |
56eb10c0 | 86 | |
e0fe762a | 87 | .PP |
2a940e36 NB |
88 | When a version 0.90 array is being reshaped (e.g. adding extra devices |
89 | to a RAID5), the version number is temporarily set to 0.91. This | |
90 | ensures that if the reshape process is stopped in the middle (e.g. by | |
91 | a system crash) and the machine boots into an older kernel that does | |
92 | not support reshaping, then the array will not be assembled (which | |
93 | would cause data corruption) but will be left untouched until a kernel | |
94 | that can complete the reshape processes is used. | |
95 | ||
e0fe762a | 96 | .SS ARRAYS WITHOUT METADATA |
570c0542 | 97 | While it is usually best to create arrays with superblocks so that |
93e790af SW |
98 | they can be assembled reliably, there are some circumstances when an |
99 | array without superblocks is preferred. These include: | |
570c0542 NB |
100 | .TP |
101 | LEGACY ARRAYS | |
11a3e71d NB |
102 | Early versions of the |
103 | .B md | |
570c0542 NB |
104 | driver only supported Linear and Raid0 configurations and did not use |
105 | a superblock (which is less critical with these configurations). | |
106 | While such arrays should be rebuilt with superblocks if possible, | |
11a3e71d | 107 | .B md |
570c0542 NB |
108 | continues to support them. |
109 | .TP | |
110 | FAULTY | |
111 | Being a largely transparent layer over a different device, the FAULTY | |
112 | personality doesn't gain anything from having a superblock. | |
113 | .TP | |
114 | MULTIPATH | |
115 | It is often possible to detect devices which are different paths to | |
116 | the same storage directly rather than having a distinctive superblock | |
117 | written to the device and searched for on all paths. In this case, | |
118 | a MULTIPATH array with no superblock makes sense. | |
119 | .TP | |
120 | RAID1 | |
121 | In some configurations it might be desired to create a raid1 | |
93e790af | 122 | configuration that does not use a superblock, and to maintain the state of |
095407fa | 123 | the array elsewhere. While not encouraged for general use, it does |
addc80c4 | 124 | have special-purpose uses and is supported. |
11a3e71d | 125 | |
e0fe762a N |
126 | .SS ARRAYS WITH EXTERNAL METADATA |
127 | ||
128 | From release 2.6.28, the | |
129 | .I md | |
130 | driver supports arrays with externally managed metadata. That is, | |
1e49aaa0 | 131 | the metadata is not managed by the kernel but rather by a user-space |
e0fe762a N |
132 | program which is external to the kernel. This allows support for a |
133 | variety of metadata formats without cluttering the kernel with lots of | |
134 | details. | |
135 | .PP | |
136 | .I md | |
137 | is able to communicate with the user-space program through various | |
138 | sysfs attributes so that it can make appropriate changes to the | |
1b17b4e4 | 139 | metadata \- for example to mark a device as faulty. When necessary, |
e0fe762a N |
140 | .I md |
141 | will wait for the program to acknowledge the event by writing to a | |
142 | sysfs attribute. | |
143 | The manual page for | |
144 | .IR mdmon (8) | |
145 | contains more detail about this interaction. | |
146 | ||
147 | .SS CONTAINERS | |
148 | Many metadata formats use a single block of metadata to describe a | |
149 | number of different arrays which all use the same set of devices. | |
150 | In this case it is helpful for the kernel to know about the full set | |
151 | of devices as a whole. This set is known to md as a | |
152 | .IR container . | |
153 | A container is an | |
154 | .I md | |
155 | array with externally managed metadata and with device offset and size | |
156 | so that it just covers the metadata part of the devices. The | |
157 | remainder of each device is available to be incorporated into various | |
158 | arrays. | |
159 | ||
56eb10c0 | 160 | .SS LINEAR |
11a3e71d NB |
161 | |
162 | A linear array simply catenates the available space on each | |
93e790af | 163 | drive to form one large virtual drive. |
11a3e71d NB |
164 | |
165 | One advantage of this arrangement over the more common RAID0 | |
166 | arrangement is that the array may be reconfigured at a later time with | |
93e790af SW |
167 | an extra drive, so the array is made bigger without disturbing the |
168 | data that is on the array. This can even be done on a live | |
11a3e71d NB |
169 | array. |
170 | ||
599e5a36 NB |
171 | If a chunksize is given with a LINEAR array, the usable space on each |
172 | device is rounded down to a multiple of this chunksize. | |
11a3e71d | 173 | |
56eb10c0 | 174 | .SS RAID0 |
11a3e71d NB |
175 | |
176 | A RAID0 array (which has zero redundancy) is also known as a | |
177 | striped array. | |
e0d19036 NB |
178 | A RAID0 array is configured at creation with a |
179 | .B "Chunk Size" | |
e0fe762a N |
180 | which must be a power of two (prior to Linux 2.6.31), and at least 4 |
181 | kibibytes. | |
e0d19036 | 182 | |
2d465520 | 183 | The RAID0 driver assigns the first chunk of the array to the first |
e0d19036 | 184 | device, the second chunk to the second device, and so on until all |
e0fe762a | 185 | drives have been assigned one chunk. This collection of chunks forms a |
e0d19036 | 186 | .BR stripe . |
93e790af | 187 | Further chunks are gathered into stripes in the same way, and are |
e0d19036 NB |
188 | assigned to the remaining space in the drives. |
189 | ||
2d465520 NB |
190 | If devices in the array are not all the same size, then once the |
191 | smallest device has been exhausted, the RAID0 driver starts | |
e0d19036 NB |
192 | collecting chunks into smaller stripes that only span the drives which |
193 | still have remaining space. | |
194 | ||
195 | ||
56eb10c0 | 196 | .SS RAID1 |
e0d19036 NB |
197 | |
198 | A RAID1 array is also known as a mirrored set (though mirrors tend to | |
5787fa49 | 199 | provide reflected images, which RAID1 does not) or a plex. |
e0d19036 NB |
200 | |
201 | Once initialised, each device in a RAID1 array contains exactly the | |
202 | same data. Changes are written to all devices in parallel. Data is | |
203 | read from any one device. The driver attempts to distribute read | |
204 | requests across all devices to maximise performance. | |
205 | ||
206 | All devices in a RAID1 array should be the same size. If they are | |
207 | not, then only the amount of space available on the smallest device is | |
93e790af | 208 | used (any extra space on other devices is wasted). |
e0d19036 | 209 | |
3dacb890 IP |
210 | Note that the read balancing done by the driver does not make the RAID1 |
211 | performance profile be the same as for RAID0; a single stream of | |
212 | sequential input will not be accelerated (e.g. a single dd), but | |
213 | multiple sequential streams or a random workload will use more than one | |
214 | spindle. In theory, having an N-disk RAID1 will allow N sequential | |
215 | threads to read from all disks. | |
216 | ||
e0fe762a | 217 | Individual devices in a RAID1 can be marked as "write-mostly". |
1b17b4e4 | 218 | These drives are excluded from the normal read balancing and will only |
e0fe762a N |
219 | be read from when there is no other option. This can be useful for |
220 | devices connected over a slow link. | |
221 | ||
56eb10c0 | 222 | .SS RAID4 |
e0d19036 NB |
223 | |
224 | A RAID4 array is like a RAID0 array with an extra device for storing | |
aa88f531 NB |
225 | parity. This device is the last of the active devices in the |
226 | array. Unlike RAID0, RAID4 also requires that all stripes span all | |
e0d19036 NB |
227 | drives, so extra space on devices that are larger than the smallest is |
228 | wasted. | |
229 | ||
93e790af | 230 | When any block in a RAID4 array is modified, the parity block for that |
e0d19036 NB |
231 | stripe (i.e. the block in the parity device at the same device offset |
232 | as the stripe) is also modified so that the parity block always | |
93e790af | 233 | contains the "parity" for the whole stripe. I.e. its content is |
e0d19036 NB |
234 | equivalent to the result of performing an exclusive-or operation |
235 | between all the data blocks in the stripe. | |
236 | ||
237 | This allows the array to continue to function if one device fails. | |
238 | The data that was on that device can be calculated as needed from the | |
239 | parity block and the other data blocks. | |
240 | ||
56eb10c0 | 241 | .SS RAID5 |
e0d19036 NB |
242 | |
243 | RAID5 is very similar to RAID4. The difference is that the parity | |
244 | blocks for each stripe, instead of being on a single device, are | |
245 | distributed across all devices. This allows more parallelism when | |
93e790af | 246 | writing, as two different block updates will quite possibly affect |
e0d19036 NB |
247 | parity blocks on different devices so there is less contention. |
248 | ||
93e790af | 249 | This also allows more parallelism when reading, as read requests are |
e0d19036 NB |
250 | distributed over all the devices in the array instead of all but one. |
251 | ||
98c6faba NB |
252 | .SS RAID6 |
253 | ||
254 | RAID6 is similar to RAID5, but can handle the loss of any \fItwo\fP | |
255 | devices without data loss. Accordingly, it requires N+2 drives to | |
256 | store N drives worth of data. | |
257 | ||
258 | The performance for RAID6 is slightly lower but comparable to RAID5 in | |
259 | normal mode and single disk failure mode. It is very slow in dual | |
260 | disk failure mode, however. | |
261 | ||
599e5a36 NB |
262 | .SS RAID10 |
263 | ||
93e790af | 264 | RAID10 provides a combination of RAID1 and RAID0, and is sometimes known |
599e5a36 NB |
265 | as RAID1+0. Every datablock is duplicated some number of times, and |
266 | the resulting collection of datablocks are distributed over multiple | |
267 | drives. | |
268 | ||
93e790af | 269 | When configuring a RAID10 array, it is necessary to specify the number |
599e5a36 | 270 | of replicas of each data block that are required (this will normally |
b578481c NB |
271 | be 2) and whether the replicas should be 'near', 'offset' or 'far'. |
272 | (Note that the 'offset' layout is only available from 2.6.18). | |
599e5a36 NB |
273 | |
274 | When 'near' replicas are chosen, the multiple copies of a given chunk | |
275 | are laid out consecutively across the stripes of the array, so the two | |
276 | copies of a datablock will likely be at the same offset on two | |
277 | adjacent devices. | |
278 | ||
279 | When 'far' replicas are chosen, the multiple copies of a given chunk | |
280 | are laid out quite distant from each other. The first copy of all | |
281 | data blocks will be striped across the early part of all drives in | |
282 | RAID0 fashion, and then the next copy of all blocks will be striped | |
283 | across a later section of all drives, always ensuring that all copies | |
284 | of any given block are on different drives. | |
285 | ||
286 | The 'far' arrangement can give sequential read performance equal to | |
91c00388 | 287 | that of a RAID0 array, but at the cost of reduced write performance. |
599e5a36 | 288 | |
b578481c NB |
289 | When 'offset' replicas are chosen, the multiple copies of a given |
290 | chunk are laid out on consecutive drives and at consecutive offsets. | |
291 | Effectively each stripe is duplicated and the copies are offset by one | |
292 | device. This should give similar read characteristics to 'far' if a | |
293 | suitably large chunk size is used, but without as much seeking for | |
294 | writes. | |
295 | ||
599e5a36 | 296 | It should be noted that the number of devices in a RAID10 array need |
93e790af | 297 | not be a multiple of the number of replica of each data block; however, |
599e5a36 NB |
298 | there must be at least as many devices as replicas. |
299 | ||
300 | If, for example, an array is created with 5 devices and 2 replicas, | |
301 | then space equivalent to 2.5 of the devices will be available, and | |
302 | every block will be stored on two different devices. | |
303 | ||
304 | Finally, it is possible to have an array with both 'near' and 'far' | |
93e790af | 305 | copies. If an array is configured with 2 near copies and 2 far |
599e5a36 NB |
306 | copies, then there will be a total of 4 copies of each block, each on |
307 | a different drive. This is an artifact of the implementation and is | |
308 | unlikely to be of real value. | |
309 | ||
bf40ab85 | 310 | .SS MULTIPATH |
e0d19036 NB |
311 | |
312 | MULTIPATH is not really a RAID at all as there is only one real device | |
313 | in a MULTIPATH md array. However there are multiple access points | |
314 | (paths) to this device, and one of these paths might fail, so there | |
315 | are some similarities. | |
316 | ||
a9d69660 | 317 | A MULTIPATH array is composed of a number of logically different |
2d465520 NB |
318 | devices, often fibre channel interfaces, that all refer the the same |
319 | real device. If one of these interfaces fails (e.g. due to cable | |
a9d69660 | 320 | problems), the multipath driver will attempt to redirect requests to |
e0fe762a N |
321 | another interface. |
322 | ||
323 | The MULTIPATH drive is not receiving any ongoing development and | |
324 | should be considered a legacy driver. The device-mapper based | |
325 | multipath drivers should be preferred for new installations. | |
e0d19036 | 326 | |
b5e64645 NB |
327 | .SS FAULTY |
328 | The FAULTY md module is provided for testing purposes. A faulty array | |
329 | has exactly one component device and is normally assembled without a | |
330 | superblock, so the md array created provides direct access to all of | |
331 | the data in the component device. | |
332 | ||
333 | The FAULTY module may be requested to simulate faults to allow testing | |
a9d69660 | 334 | of other md levels or of filesystems. Faults can be chosen to trigger |
b5e64645 | 335 | on read requests or write requests, and can be transient (a subsequent |
addc80c4 | 336 | read/write at the address will probably succeed) or persistent |
b5e64645 NB |
337 | (subsequent read/write of the same address will fail). Further, read |
338 | faults can be "fixable" meaning that they persist until a write | |
339 | request at the same address. | |
340 | ||
93e790af | 341 | Fault types can be requested with a period. In this case, the fault |
a9d69660 NB |
342 | will recur repeatedly after the given number of requests of the |
343 | relevant type. For example if persistent read faults have a period of | |
344 | 100, then every 100th read request would generate a fault, and the | |
b5e64645 NB |
345 | faulty sector would be recorded so that subsequent reads on that |
346 | sector would also fail. | |
347 | ||
348 | There is a limit to the number of faulty sectors that are remembered. | |
349 | Faults generated after this limit is exhausted are treated as | |
350 | transient. | |
351 | ||
a9d69660 | 352 | The list of faulty sectors can be flushed, and the active list of |
b5e64645 | 353 | failure modes can be cleared. |
e0d19036 NB |
354 | |
355 | .SS UNCLEAN SHUTDOWN | |
356 | ||
599e5a36 NB |
357 | When changes are made to a RAID1, RAID4, RAID5, RAID6, or RAID10 array |
358 | there is a possibility of inconsistency for short periods of time as | |
93e790af SW |
359 | each update requires at least two block to be written to different |
360 | devices, and these writes probably won't happen at exactly the same | |
599e5a36 NB |
361 | time. Thus if a system with one of these arrays is shutdown in the |
362 | middle of a write operation (e.g. due to power failure), the array may | |
363 | not be consistent. | |
e0d19036 | 364 | |
2d465520 | 365 | To handle this situation, the md driver marks an array as "dirty" |
e0d19036 | 366 | before writing any data to it, and marks it as "clean" when the array |
98c6faba NB |
367 | is being disabled, e.g. at shutdown. If the md driver finds an array |
368 | to be dirty at startup, it proceeds to correct any possibly | |
369 | inconsistency. For RAID1, this involves copying the contents of the | |
370 | first drive onto all other drives. For RAID4, RAID5 and RAID6 this | |
371 | involves recalculating the parity for each stripe and making sure that | |
599e5a36 NB |
372 | the parity block has the correct data. For RAID10 it involves copying |
373 | one of the replicas of each block onto all the others. This process, | |
374 | known as "resynchronising" or "resync" is performed in the background. | |
375 | The array can still be used, though possibly with reduced performance. | |
98c6faba NB |
376 | |
377 | If a RAID4, RAID5 or RAID6 array is degraded (missing at least one | |
93e790af | 378 | drive, two for RAID6) when it is restarted after an unclean shutdown, it cannot |
98c6faba NB |
379 | recalculate parity, and so it is possible that data might be |
380 | undetectably corrupted. The 2.4 md driver | |
e0d19036 | 381 | .B does not |
addc80c4 NB |
382 | alert the operator to this condition. The 2.6 md driver will fail to |
383 | start an array in this condition without manual intervention, though | |
35cc5be4 | 384 | this behaviour can be overridden by a kernel parameter. |
e0d19036 NB |
385 | |
386 | .SS RECOVERY | |
387 | ||
addc80c4 | 388 | If the md driver detects a write error on a device in a RAID1, RAID4, |
599e5a36 NB |
389 | RAID5, RAID6, or RAID10 array, it immediately disables that device |
390 | (marking it as faulty) and continues operation on the remaining | |
93e790af SW |
391 | devices. If there are spare drives, the driver will start recreating |
392 | on one of the spare drives the data which was on that failed drive, | |
599e5a36 NB |
393 | either by copying a working drive in a RAID1 configuration, or by |
394 | doing calculations with the parity block on RAID4, RAID5 or RAID6, or | |
93e790af | 395 | by finding and copying originals for RAID10. |
e0d19036 | 396 | |
addc80c4 NB |
397 | In kernels prior to about 2.6.15, a read error would cause the same |
398 | effect as a write error. In later kernels, a read-error will instead | |
399 | cause md to attempt a recovery by overwriting the bad block. i.e. it | |
400 | will find the correct data from elsewhere, write it over the block | |
401 | that failed, and then try to read it back again. If either the write | |
402 | or the re-read fail, md will treat the error the same way that a write | |
93e790af | 403 | error is treated, and will fail the whole device. |
addc80c4 | 404 | |
2d465520 | 405 | While this recovery process is happening, the md driver will monitor |
e0d19036 NB |
406 | accesses to the array and will slow down the rate of recovery if other |
407 | activity is happening, so that normal access to the array will not be | |
408 | unduly affected. When no other activity is happening, the recovery | |
409 | process proceeds at full speed. The actual speed targets for the two | |
410 | different situations can be controlled by the | |
411 | .B speed_limit_min | |
412 | and | |
413 | .B speed_limit_max | |
414 | control files mentioned below. | |
415 | ||
1cc44574 N |
416 | .SS SCRUBBING AND MISMATCHES |
417 | ||
418 | As storage devices can develop bad blocks at any time it is valuable | |
419 | to regularly read all blocks on all devices in an array so as to catch | |
420 | such bad blocks early. This process is called | |
421 | .IR scrubbing . | |
422 | ||
423 | md arrays can be scrubbed by writing either | |
424 | .I check | |
425 | or | |
426 | .I repair | |
427 | to the file | |
428 | .I md/sync_action | |
429 | in the | |
430 | .I sysfs | |
431 | directory for the device. | |
432 | ||
c93e9d68 | 433 | Requesting a scrub will cause |
1cc44574 N |
434 | .I md |
435 | to read every block on every device in the array, and check that the | |
c93e9d68 N |
436 | data is consistent. For RAID1 and RAID10, this means checking that the copies |
437 | are identical. For RAID4, RAID5, RAID6 this means checking that the | |
438 | parity block is (or blocks are) correct. | |
1cc44574 N |
439 | |
440 | If a read error is detected during this process, the normal read-error | |
441 | handling causes correct data to be found from other devices and to be | |
442 | written back to the faulty device. In many case this will | |
443 | effectively | |
444 | .I fix | |
445 | the bad block. | |
446 | ||
447 | If all blocks read successfully but are found to not be consistent, | |
448 | then this is regarded as a | |
449 | .IR mismatch . | |
450 | ||
451 | If | |
452 | .I check | |
453 | was used, then no action is taken to handle the mismatch, it is simply | |
454 | recorded. | |
455 | If | |
456 | .I repair | |
457 | was used, then a mismatch will be repaired in the same way that | |
458 | .I resync | |
c93e9d68 | 459 | repairs arrays. For RAID5/RAID6 new parity blocks are written. For RAID1/RAID10, |
1cc44574 N |
460 | all but one block are overwritten with the content of that one block. |
461 | ||
462 | A count of mismatches is recorded in the | |
463 | .I sysfs | |
464 | file | |
465 | .IR md/mismatch_cnt . | |
466 | This is set to zero when a | |
c93e9d68 | 467 | scrub starts and is incremented whenever a sector is |
1cc44574 N |
468 | found that is a mismatch. |
469 | .I md | |
470 | normally works in units much larger than a single sector and when it | |
1e49aaa0 | 471 | finds a mismatch, it does not determine exactly how many actual sectors were |
c93e9d68 N |
472 | affected but simply adds the number of sectors in the IO unit that was |
473 | used. So a value of 128 could simply mean that a single 64KB check | |
474 | found an error (128 x 512bytes = 64KB). | |
475 | ||
476 | If an array is created by | |
477 | .I mdadm | |
478 | with | |
1cc44574 N |
479 | .I \-\-assume\-clean |
480 | then a subsequent check could be expected to find some mismatches. | |
481 | ||
482 | On a truly clean RAID5 or RAID6 array, any mismatches should indicate | |
483 | a hardware problem at some level - software issues should never cause | |
484 | such a mismatch. | |
485 | ||
486 | However on RAID1 and RAID10 it is possible for software issues to | |
487 | cause a mismatch to be reported. This does not necessarily mean that | |
488 | the data on the array is corrupted. It could simply be that the | |
489 | system does not care what is stored on that part of the array - it is | |
490 | unused space. | |
491 | ||
492 | The most likely cause for an unexpected mismatch on RAID1 or RAID10 | |
493 | occurs if a swap partition or swap file is stored on the array. | |
494 | ||
495 | When the swap subsystem wants to write a page of memory out, it flags | |
496 | the page as 'clean' in the memory manager and requests the swap device | |
497 | to write it out. It is quite possible that the memory will be | |
498 | changed while the write-out is happening. In that case the 'clean' | |
499 | flag will be found to be clear when the write completes and so the | |
500 | swap subsystem will simply forget that the swapout had been attempted, | |
c93e9d68 | 501 | and will possibly choose a different page to write out. |
1cc44574 | 502 | |
c93e9d68 | 503 | If the swap device was on RAID1 (or RAID10), then the data is sent |
1cc44574 | 504 | from memory to a device twice (or more depending on the number of |
c93e9d68 N |
505 | devices in the array). Thus it is possible that the memory gets changed |
506 | between the times it is sent, so different data can be written to | |
507 | the different devices in the array. This will be detected by | |
1cc44574 N |
508 | .I check |
509 | as a mismatch. However it does not reflect any corruption as the | |
510 | block where this mismatch occurs is being treated by the swap system as | |
511 | being empty, and the data will never be read from that block. | |
512 | ||
513 | It is conceivable for a similar situation to occur on non-swap files, | |
514 | though it is less likely. | |
515 | ||
516 | Thus the | |
517 | .I mismatch_cnt | |
518 | value can not be interpreted very reliably on RAID1 or RAID10, | |
519 | especially when the device is used for swap. | |
520 | ||
521 | ||
599e5a36 NB |
522 | .SS BITMAP WRITE-INTENT LOGGING |
523 | ||
524 | From Linux 2.6.13, | |
525 | .I md | |
526 | supports a bitmap based write-intent log. If configured, the bitmap | |
527 | is used to record which blocks of the array may be out of sync. | |
528 | Before any write request is honoured, md will make sure that the | |
529 | corresponding bit in the log is set. After a period of time with no | |
530 | writes to an area of the array, the corresponding bit will be cleared. | |
531 | ||
532 | This bitmap is used for two optimisations. | |
533 | ||
1afe1167 | 534 | Firstly, after an unclean shutdown, the resync process will consult |
599e5a36 | 535 | the bitmap and only resync those blocks that correspond to bits in the |
1afe1167 | 536 | bitmap that are set. This can dramatically reduce resync time. |
599e5a36 NB |
537 | |
538 | Secondly, when a drive fails and is removed from the array, md stops | |
539 | clearing bits in the intent log. If that same drive is re-added to | |
540 | the array, md will notice and will only recover the sections of the | |
541 | drive that are covered by bits in the intent log that are set. This | |
542 | can allow a device to be temporarily removed and reinserted without | |
543 | causing an enormous recovery cost. | |
544 | ||
545 | The intent log can be stored in a file on a separate device, or it can | |
546 | be stored near the superblocks of an array which has superblocks. | |
547 | ||
93e790af | 548 | It is possible to add an intent log to an active array, or remove an |
addc80c4 | 549 | intent log if one is present. |
599e5a36 NB |
550 | |
551 | In 2.6.13, intent bitmaps are only supported with RAID1. Other levels | |
addc80c4 | 552 | with redundancy are supported from 2.6.15. |
599e5a36 NB |
553 | |
554 | .SS WRITE-BEHIND | |
555 | ||
556 | From Linux 2.6.14, | |
557 | .I md | |
addc80c4 | 558 | supports WRITE-BEHIND on RAID1 arrays. |
599e5a36 NB |
559 | |
560 | This allows certain devices in the array to be flagged as | |
561 | .IR write-mostly . | |
562 | MD will only read from such devices if there is no | |
563 | other option. | |
564 | ||
565 | If a write-intent bitmap is also provided, write requests to | |
566 | write-mostly devices will be treated as write-behind requests and md | |
567 | will not wait for writes to those requests to complete before | |
568 | reporting the write as complete to the filesystem. | |
569 | ||
570 | This allows for a RAID1 with WRITE-BEHIND to be used to mirror data | |
8f21823f | 571 | over a slow link to a remote computer (providing the link isn't too |
599e5a36 NB |
572 | slow). The extra latency of the remote link will not slow down normal |
573 | operations, but the remote system will still have a reasonably | |
574 | up-to-date copy of all data. | |
575 | ||
addc80c4 NB |
576 | .SS RESTRIPING |
577 | ||
578 | .IR Restriping , | |
579 | also known as | |
580 | .IR Reshaping , | |
581 | is the processes of re-arranging the data stored in each stripe into a | |
582 | new layout. This might involve changing the number of devices in the | |
93e790af | 583 | array (so the stripes are wider), changing the chunk size (so stripes |
addc80c4 | 584 | are deeper or shallower), or changing the arrangement of data and |
93e790af | 585 | parity (possibly changing the raid level, e.g. 1 to 5 or 5 to 6). |
addc80c4 | 586 | |
c64881d7 N |
587 | As of Linux 2.6.35, md can reshape a RAID4, RAID5, or RAID6 array to |
588 | have a different number of devices (more or fewer) and to have a | |
589 | different layout or chunk size. It can also convert between these | |
590 | different RAID levels. It can also convert between RAID0 and RAID10, | |
591 | and between RAID0 and RAID4 or RAID5. | |
592 | Other possibilities may follow in future kernels. | |
addc80c4 NB |
593 | |
594 | During any stripe process there is a 'critical section' during which | |
35cc5be4 | 595 | live data is being overwritten on disk. For the operation of |
addc80c4 NB |
596 | increasing the number of drives in a raid5, this critical section |
597 | covers the first few stripes (the number being the product of the old | |
598 | and new number of devices). After this critical section is passed, | |
599 | data is only written to areas of the array which no longer hold live | |
b3f1c093 | 600 | data \(em the live data has already been located away. |
addc80c4 | 601 | |
c64881d7 N |
602 | For a reshape which reduces the number of devices, the 'critical |
603 | section' is at the end of the reshape process. | |
604 | ||
addc80c4 NB |
605 | md is not able to ensure data preservation if there is a crash |
606 | (e.g. power failure) during the critical section. If md is asked to | |
607 | start an array which failed during a critical section of restriping, | |
608 | it will fail to start the array. | |
609 | ||
610 | To deal with this possibility, a user-space program must | |
611 | .IP \(bu 4 | |
612 | Disable writes to that section of the array (using the | |
613 | .B sysfs | |
614 | interface), | |
615 | .IP \(bu 4 | |
93e790af | 616 | take a copy of the data somewhere (i.e. make a backup), |
addc80c4 | 617 | .IP \(bu 4 |
93e790af | 618 | allow the process to continue and invalidate the backup and restore |
addc80c4 NB |
619 | write access once the critical section is passed, and |
620 | .IP \(bu 4 | |
93e790af | 621 | provide for restoring the critical data before restarting the array |
addc80c4 NB |
622 | after a system crash. |
623 | .PP | |
624 | ||
625 | .B mdadm | |
93e790af | 626 | versions from 2.4 do this for growing a RAID5 array. |
addc80c4 NB |
627 | |
628 | For operations that do not change the size of the array, like simply | |
629 | increasing chunk size, or converting RAID5 to RAID6 with one extra | |
93e790af SW |
630 | device, the entire process is the critical section. In this case, the |
631 | restripe will need to progress in stages, as a section is suspended, | |
c64881d7 | 632 | backed up, restriped, and released. |
addc80c4 NB |
633 | |
634 | .SS SYSFS INTERFACE | |
93e790af | 635 | Each block device appears as a directory in |
addc80c4 | 636 | .I sysfs |
93e790af | 637 | (which is usually mounted at |
addc80c4 NB |
638 | .BR /sys ). |
639 | For MD devices, this directory will contain a subdirectory called | |
640 | .B md | |
641 | which contains various files for providing access to information about | |
642 | the array. | |
643 | ||
644 | This interface is documented more fully in the file | |
645 | .B Documentation/md.txt | |
646 | which is distributed with the kernel sources. That file should be | |
647 | consulted for full documentation. The following are just a selection | |
648 | of attribute files that are available. | |
649 | ||
650 | .TP | |
651 | .B md/sync_speed_min | |
652 | This value, if set, overrides the system-wide setting in | |
653 | .B /proc/sys/dev/raid/speed_limit_min | |
654 | for this array only. | |
655 | Writing the value | |
93e790af SW |
656 | .B "system" |
657 | to this file will cause the system-wide setting to have effect. | |
addc80c4 NB |
658 | |
659 | .TP | |
660 | .B md/sync_speed_max | |
661 | This is the partner of | |
662 | .B md/sync_speed_min | |
663 | and overrides | |
1e49aaa0 | 664 | .B /proc/sys/dev/raid/speed_limit_max |
addc80c4 NB |
665 | described below. |
666 | ||
667 | .TP | |
668 | .B md/sync_action | |
669 | This can be used to monitor and control the resync/recovery process of | |
670 | MD. | |
671 | In particular, writing "check" here will cause the array to read all | |
672 | data block and check that they are consistent (e.g. parity is correct, | |
673 | or all mirror replicas are the same). Any discrepancies found are | |
674 | .B NOT | |
675 | corrected. | |
676 | ||
677 | A count of problems found will be stored in | |
678 | .BR md/mismatch_count . | |
679 | ||
680 | Alternately, "repair" can be written which will cause the same check | |
681 | to be performed, but any errors will be corrected. | |
682 | ||
683 | Finally, "idle" can be written to stop the check/repair process. | |
684 | ||
685 | .TP | |
686 | .B md/stripe_cache_size | |
687 | This is only available on RAID5 and RAID6. It records the size (in | |
688 | pages per device) of the stripe cache which is used for synchronising | |
800053d6 DW |
689 | all write operations to the array and all read operations if the array |
690 | is degraded. The default is 256. Valid values are 17 to 32768. | |
addc80c4 | 691 | Increasing this number can increase performance in some situations, at |
800053d6 DW |
692 | some cost in system memory. Note, setting this value too high can |
693 | result in an "out of memory" condition for the system. | |
694 | ||
695 | memory_consumed = system_page_size * nr_disks * stripe_cache_size | |
addc80c4 | 696 | |
a5ee6dfb DW |
697 | .TP |
698 | .B md/preread_bypass_threshold | |
699 | This is only available on RAID5 and RAID6. This variable sets the | |
700 | number of times MD will service a full-stripe-write before servicing a | |
701 | stripe that requires some "prereading". For fairness this defaults to | |
800053d6 DW |
702 | 1. Valid values are 0 to stripe_cache_size. Setting this to 0 |
703 | maximizes sequential-write throughput at the cost of fairness to threads | |
704 | doing small or random writes. | |
addc80c4 | 705 | |
5787fa49 NB |
706 | .SS KERNEL PARAMETERS |
707 | ||
addc80c4 | 708 | The md driver recognised several different kernel parameters. |
5787fa49 NB |
709 | .TP |
710 | .B raid=noautodetect | |
711 | This will disable the normal detection of md arrays that happens at | |
712 | boot time. If a drive is partitioned with MS-DOS style partitions, | |
713 | then if any of the 4 main partitions has a partition type of 0xFD, | |
714 | then that partition will normally be inspected to see if it is part of | |
715 | an MD array, and if any full arrays are found, they are started. This | |
addc80c4 | 716 | kernel parameter disables this behaviour. |
5787fa49 | 717 | |
a9d69660 NB |
718 | .TP |
719 | .B raid=partitionable | |
720 | .TP | |
721 | .B raid=part | |
722 | These are available in 2.6 and later kernels only. They indicate that | |
723 | autodetected MD arrays should be created as partitionable arrays, with | |
724 | a different major device number to the original non-partitionable md | |
725 | arrays. The device number is listed as | |
726 | .I mdp | |
727 | in | |
728 | .IR /proc/devices . | |
729 | ||
addc80c4 NB |
730 | .TP |
731 | .B md_mod.start_ro=1 | |
e0fe762a N |
732 | .TP |
733 | .B /sys/module/md_mod/parameters/start_ro | |
addc80c4 NB |
734 | This tells md to start all arrays in read-only mode. This is a soft |
735 | read-only that will automatically switch to read-write on the first | |
736 | write request. However until that write request, nothing is written | |
737 | to any device by md, and in particular, no resync or recovery | |
738 | operation is started. | |
739 | ||
740 | .TP | |
741 | .B md_mod.start_dirty_degraded=1 | |
e0fe762a N |
742 | .TP |
743 | .B /sys/module/md_mod/parameters/start_dirty_degraded | |
addc80c4 NB |
744 | As mentioned above, md will not normally start a RAID4, RAID5, or |
745 | RAID6 that is both dirty and degraded as this situation can imply | |
746 | hidden data loss. This can be awkward if the root filesystem is | |
93e790af | 747 | affected. Using this module parameter allows such arrays to be started |
addc80c4 NB |
748 | at boot time. It should be understood that there is a real (though |
749 | small) risk of data corruption in this situation. | |
a9d69660 | 750 | |
5787fa49 NB |
751 | .TP |
752 | .BI md= n , dev , dev ,... | |
a9d69660 NB |
753 | .TP |
754 | .BI md=d n , dev , dev ,... | |
5787fa49 NB |
755 | This tells the md driver to assemble |
756 | .B /dev/md n | |
757 | from the listed devices. It is only necessary to start the device | |
758 | holding the root filesystem this way. Other arrays are best started | |
759 | once the system is booted. | |
760 | ||
a9d69660 NB |
761 | In 2.6 kernels, the |
762 | .B d | |
763 | immediately after the | |
764 | .B = | |
765 | indicates that a partitionable device (e.g. | |
766 | .BR /dev/md/d0 ) | |
767 | should be created rather than the original non-partitionable device. | |
768 | ||
5787fa49 NB |
769 | .TP |
770 | .BI md= n , l , c , i , dev... | |
771 | This tells the md driver to assemble a legacy RAID0 or LINEAR array | |
772 | without a superblock. | |
773 | .I n | |
774 | gives the md device number, | |
775 | .I l | |
776 | gives the level, 0 for RAID0 or -1 for LINEAR, | |
777 | .I c | |
778 | gives the chunk size as a base-2 logarithm offset by twelve, so 0 | |
779 | means 4K, 1 means 8K. | |
780 | .I i | |
781 | is ignored (legacy support). | |
e0d19036 | 782 | |
56eb10c0 NB |
783 | .SH FILES |
784 | .TP | |
785 | .B /proc/mdstat | |
786 | Contains information about the status of currently running array. | |
787 | .TP | |
788 | .B /proc/sys/dev/raid/speed_limit_min | |
93e790af | 789 | A readable and writable file that reflects the current "goal" rebuild |
56eb10c0 NB |
790 | speed for times when non-rebuild activity is current on an array. |
791 | The speed is in Kibibytes per second, and is a per-device rate, not a | |
93e790af | 792 | per-array rate (which means that an array with more disks will shuffle |
e0fe762a | 793 | more data for a given speed). The default is 1000. |
56eb10c0 NB |
794 | |
795 | .TP | |
796 | .B /proc/sys/dev/raid/speed_limit_max | |
93e790af | 797 | A readable and writable file that reflects the current "goal" rebuild |
56eb10c0 | 798 | speed for times when no non-rebuild activity is current on an array. |
e0fe762a | 799 | The default is 200,000. |
56eb10c0 NB |
800 | |
801 | .SH SEE ALSO | |
802 | .BR mdadm (8), | |
803 | .BR mkraid (8). |