]> git.ipfire.org Git - thirdparty/mdadm.git/blame - md.4
Arrange that SparesMissing events generate an email too.
[thirdparty/mdadm.git] / md.4
CommitLineData
56eb10c0
NB
1.TH MD 4
2.SH NAME
3md \- Multiple Device driver aka Linux Software Raid
4.SH SYNOPSIS
5.BI /dev/md n
6.br
7.BI /dev/md/ n
8.SH DESCRIPTION
9The
10.B md
11driver provides virtual devices that are created from one or more
e0d19036 12independent underlying devices. This array of devices often contains
56eb10c0 13redundancy, and hence the acronym RAID which stands for a Redundant
e0d19036 14Array of Independent Devices.
56eb10c0
NB
15.PP
16.B md
599e5a36
NB
17supports RAID levels
181 (mirroring),
194 (striped array with parity device),
205 (striped array with distributed parity information),
216 (striped array with distributed dual redundancy information), and
2210 (striped and mirrored).
23If some number of underlying devices fails while using one of these
98c6faba
NB
24levels, the array will continue to function; this number is one for
25RAID levels 4 and 5, two for RAID level 6, and all but one (N-1) for
addc80c4 26RAID level 1, and dependant on configuration for level 10.
56eb10c0
NB
27.PP
28.B md
e0d19036 29also supports a number of pseudo RAID (non-redundant) configurations
570c0542
NB
30including RAID0 (striped array), LINEAR (catenated array),
31MULTIPATH (a set of different interfaces to the same device),
32and FAULTY (a layer over a single device into which errors can be injected).
56eb10c0 33
11a3e71d 34.SS MD SUPER BLOCK
570c0542
NB
35Each device in an array may have a
36.I superblock
37which records information about the structure and state of the array.
38This allows the array to be reliably re-assembled after a shutdown.
56eb10c0 39
570c0542
NB
40From Linux kernel version 2.6.10,
41.B md
42provides support for two different formats of this superblock, and
43other formats can be added. Prior to this release, only one format is
44supported.
45
46The common format - known as version 0.90 - has
47a superblock that is 4K long and is written into a 64K aligned block that
11a3e71d 48starts at least 64K and less than 128K from the end of the device
56eb10c0
NB
49(i.e. to get the address of the superblock round the size of the
50device down to a multiple of 64K and then subtract 64K).
11a3e71d 51The available size of each device is the amount of space before the
56eb10c0
NB
52super block, so between 64K and 128K is lost when a device in
53incorporated into an MD array.
570c0542
NB
54This superblock stores multi-byte fields in a processor-dependant
55manner, so arrays cannot easily be moved between computers with
56different processors.
57
58The new format - known as version 1 - has a superblock that is
59normally 1K long, but can be longer. It is normally stored between 8K
60and 12K from the end of the device, on a 4K boundary, though
61variations can be stored at the start of the device (version 1.1) or 4K from
62the start of the device (version 1.2).
63This superblock format stores multibyte data in a
addc80c4 64processor-independent format and has supports up to hundreds of
570c0542 65component devices (version 0.90 only supports 28).
56eb10c0
NB
66
67The superblock contains, among other things:
68.TP
69LEVEL
11a3e71d 70The manner in which the devices are arranged into the array
599e5a36 71(linear, raid0, raid1, raid4, raid5, raid10, multipath).
56eb10c0
NB
72.TP
73UUID
74a 128 bit Universally Unique Identifier that identifies the array that
75this device is part of.
76
570c0542
NB
77.SS ARRAYS WITHOUT SUPERBLOCKS
78While it is usually best to create arrays with superblocks so that
79they can be assembled reliably, there are some circumstances where an
80array without superblocks in preferred. This include:
81.TP
82LEGACY ARRAYS
11a3e71d
NB
83Early versions of the
84.B md
570c0542
NB
85driver only supported Linear and Raid0 configurations and did not use
86a superblock (which is less critical with these configurations).
87While such arrays should be rebuilt with superblocks if possible,
11a3e71d 88.B md
570c0542
NB
89continues to support them.
90.TP
91FAULTY
92Being a largely transparent layer over a different device, the FAULTY
93personality doesn't gain anything from having a superblock.
94.TP
95MULTIPATH
96It is often possible to detect devices which are different paths to
97the same storage directly rather than having a distinctive superblock
98written to the device and searched for on all paths. In this case,
99a MULTIPATH array with no superblock makes sense.
100.TP
101RAID1
102In some configurations it might be desired to create a raid1
103configuration that does use a superblock, and to maintain the state of
addc80c4
NB
104the array elsewhere. While not encouraged for general us, it does
105have special-purpose uses and is supported.
11a3e71d 106
56eb10c0 107.SS LINEAR
11a3e71d
NB
108
109A linear array simply catenates the available space on each
110drive together to form one large virtual drive.
111
112One advantage of this arrangement over the more common RAID0
113arrangement is that the array may be reconfigured at a later time with
114an extra drive and so the array is made bigger without disturbing the
115data that is on the array. However this cannot be done on a live
116array.
117
599e5a36
NB
118If a chunksize is given with a LINEAR array, the usable space on each
119device is rounded down to a multiple of this chunksize.
11a3e71d 120
56eb10c0 121.SS RAID0
11a3e71d
NB
122
123A RAID0 array (which has zero redundancy) is also known as a
124striped array.
e0d19036
NB
125A RAID0 array is configured at creation with a
126.B "Chunk Size"
c913b90e 127which must be a power of two, and at least 4 kibibytes.
e0d19036 128
2d465520 129The RAID0 driver assigns the first chunk of the array to the first
e0d19036 130device, the second chunk to the second device, and so on until all
2d465520 131drives have been assigned one chunk. This collection of chunks forms
e0d19036
NB
132a
133.BR stripe .
134Further chunks are gathered into stripes in the same way which are
135assigned to the remaining space in the drives.
136
2d465520
NB
137If devices in the array are not all the same size, then once the
138smallest device has been exhausted, the RAID0 driver starts
e0d19036
NB
139collecting chunks into smaller stripes that only span the drives which
140still have remaining space.
141
142
56eb10c0 143.SS RAID1
e0d19036
NB
144
145A RAID1 array is also known as a mirrored set (though mirrors tend to
5787fa49 146provide reflected images, which RAID1 does not) or a plex.
e0d19036
NB
147
148Once initialised, each device in a RAID1 array contains exactly the
149same data. Changes are written to all devices in parallel. Data is
150read from any one device. The driver attempts to distribute read
151requests across all devices to maximise performance.
152
153All devices in a RAID1 array should be the same size. If they are
154not, then only the amount of space available on the smallest device is
155used. Any extra space on other devices is wasted.
156
56eb10c0 157.SS RAID4
e0d19036
NB
158
159A RAID4 array is like a RAID0 array with an extra device for storing
aa88f531
NB
160parity. This device is the last of the active devices in the
161array. Unlike RAID0, RAID4 also requires that all stripes span all
e0d19036
NB
162drives, so extra space on devices that are larger than the smallest is
163wasted.
164
165When any block in a RAID4 array is modified the parity block for that
166stripe (i.e. the block in the parity device at the same device offset
167as the stripe) is also modified so that the parity block always
168contains the "parity" for the whole stripe. i.e. its contents is
169equivalent to the result of performing an exclusive-or operation
170between all the data blocks in the stripe.
171
172This allows the array to continue to function if one device fails.
173The data that was on that device can be calculated as needed from the
174parity block and the other data blocks.
175
56eb10c0 176.SS RAID5
e0d19036
NB
177
178RAID5 is very similar to RAID4. The difference is that the parity
179blocks for each stripe, instead of being on a single device, are
180distributed across all devices. This allows more parallelism when
181writing as two different block updates will quite possibly affect
182parity blocks on different devices so there is less contention.
183
184This also allows more parallelism when reading as read requests are
185distributed over all the devices in the array instead of all but one.
186
98c6faba
NB
187.SS RAID6
188
189RAID6 is similar to RAID5, but can handle the loss of any \fItwo\fP
190devices without data loss. Accordingly, it requires N+2 drives to
191store N drives worth of data.
192
193The performance for RAID6 is slightly lower but comparable to RAID5 in
194normal mode and single disk failure mode. It is very slow in dual
195disk failure mode, however.
196
599e5a36
NB
197.SS RAID10
198
199RAID10 provides a combination of RAID1 and RAID0, and sometimes known
200as RAID1+0. Every datablock is duplicated some number of times, and
201the resulting collection of datablocks are distributed over multiple
202drives.
203
204When configuring a RAID10 array it is necessary to specify the number
205of replicas of each data block that are required (this will normally
206be 2) and whether the replicas should be 'near' or 'far'.
207
208When 'near' replicas are chosen, the multiple copies of a given chunk
209are laid out consecutively across the stripes of the array, so the two
210copies of a datablock will likely be at the same offset on two
211adjacent devices.
212
213When 'far' replicas are chosen, the multiple copies of a given chunk
214are laid out quite distant from each other. The first copy of all
215data blocks will be striped across the early part of all drives in
216RAID0 fashion, and then the next copy of all blocks will be striped
217across a later section of all drives, always ensuring that all copies
218of any given block are on different drives.
219
220The 'far' arrangement can give sequential read performance equal to
221that of a RAID0 array, but at the cost of degraded write performance.
222
223It should be noted that the number of devices in a RAID10 array need
224not be a multiple of the number of replica of each data block, those
225there must be at least as many devices as replicas.
226
227If, for example, an array is created with 5 devices and 2 replicas,
228then space equivalent to 2.5 of the devices will be available, and
229every block will be stored on two different devices.
230
231Finally, it is possible to have an array with both 'near' and 'far'
232copies. If and array is configured with 2 near copies and 2 far
233copies, then there will be a total of 4 copies of each block, each on
234a different drive. This is an artifact of the implementation and is
235unlikely to be of real value.
236
11a3e71d 237.SS MUTIPATH
e0d19036
NB
238
239MULTIPATH is not really a RAID at all as there is only one real device
240in a MULTIPATH md array. However there are multiple access points
241(paths) to this device, and one of these paths might fail, so there
242are some similarities.
243
a9d69660 244A MULTIPATH array is composed of a number of logically different
2d465520
NB
245devices, often fibre channel interfaces, that all refer the the same
246real device. If one of these interfaces fails (e.g. due to cable
a9d69660 247problems), the multipath driver will attempt to redirect requests to
2d465520 248another interface.
e0d19036 249
b5e64645
NB
250.SS FAULTY
251The FAULTY md module is provided for testing purposes. A faulty array
252has exactly one component device and is normally assembled without a
253superblock, so the md array created provides direct access to all of
254the data in the component device.
255
256The FAULTY module may be requested to simulate faults to allow testing
a9d69660 257of other md levels or of filesystems. Faults can be chosen to trigger
b5e64645 258on read requests or write requests, and can be transient (a subsequent
addc80c4 259read/write at the address will probably succeed) or persistent
b5e64645
NB
260(subsequent read/write of the same address will fail). Further, read
261faults can be "fixable" meaning that they persist until a write
262request at the same address.
263
264Fault types can be requested with a period. In this case the fault
a9d69660
NB
265will recur repeatedly after the given number of requests of the
266relevant type. For example if persistent read faults have a period of
267100, then every 100th read request would generate a fault, and the
b5e64645
NB
268faulty sector would be recorded so that subsequent reads on that
269sector would also fail.
270
271There is a limit to the number of faulty sectors that are remembered.
272Faults generated after this limit is exhausted are treated as
273transient.
274
a9d69660 275The list of faulty sectors can be flushed, and the active list of
b5e64645 276failure modes can be cleared.
e0d19036
NB
277
278.SS UNCLEAN SHUTDOWN
279
599e5a36
NB
280When changes are made to a RAID1, RAID4, RAID5, RAID6, or RAID10 array
281there is a possibility of inconsistency for short periods of time as
282each update requires are least two block to be written to different
283devices, and these writes probably wont happen at exactly the same
284time. Thus if a system with one of these arrays is shutdown in the
285middle of a write operation (e.g. due to power failure), the array may
286not be consistent.
e0d19036 287
2d465520 288To handle this situation, the md driver marks an array as "dirty"
e0d19036 289before writing any data to it, and marks it as "clean" when the array
98c6faba
NB
290is being disabled, e.g. at shutdown. If the md driver finds an array
291to be dirty at startup, it proceeds to correct any possibly
292inconsistency. For RAID1, this involves copying the contents of the
293first drive onto all other drives. For RAID4, RAID5 and RAID6 this
294involves recalculating the parity for each stripe and making sure that
599e5a36
NB
295the parity block has the correct data. For RAID10 it involves copying
296one of the replicas of each block onto all the others. This process,
297known as "resynchronising" or "resync" is performed in the background.
298The array can still be used, though possibly with reduced performance.
98c6faba
NB
299
300If a RAID4, RAID5 or RAID6 array is degraded (missing at least one
301drive) when it is restarted after an unclean shutdown, it cannot
302recalculate parity, and so it is possible that data might be
303undetectably corrupted. The 2.4 md driver
e0d19036 304.B does not
addc80c4
NB
305alert the operator to this condition. The 2.6 md driver will fail to
306start an array in this condition without manual intervention, though
307this behaviour can be over-ridden by a kernel parameter.
e0d19036
NB
308
309.SS RECOVERY
310
addc80c4 311If the md driver detects a write error on a device in a RAID1, RAID4,
599e5a36
NB
312RAID5, RAID6, or RAID10 array, it immediately disables that device
313(marking it as faulty) and continues operation on the remaining
314devices. If there is a spare drive, the driver will start recreating
315on one of the spare drives the data what was on that failed drive,
316either by copying a working drive in a RAID1 configuration, or by
317doing calculations with the parity block on RAID4, RAID5 or RAID6, or
318by finding a copying originals for RAID10.
e0d19036 319
addc80c4
NB
320In kernels prior to about 2.6.15, a read error would cause the same
321effect as a write error. In later kernels, a read-error will instead
322cause md to attempt a recovery by overwriting the bad block. i.e. it
323will find the correct data from elsewhere, write it over the block
324that failed, and then try to read it back again. If either the write
325or the re-read fail, md will treat the error the same way that a write
326error is treated and will fail the whole device.
327
2d465520 328While this recovery process is happening, the md driver will monitor
e0d19036
NB
329accesses to the array and will slow down the rate of recovery if other
330activity is happening, so that normal access to the array will not be
331unduly affected. When no other activity is happening, the recovery
332process proceeds at full speed. The actual speed targets for the two
333different situations can be controlled by the
334.B speed_limit_min
335and
336.B speed_limit_max
337control files mentioned below.
338
599e5a36
NB
339.SS BITMAP WRITE-INTENT LOGGING
340
341From Linux 2.6.13,
342.I md
343supports a bitmap based write-intent log. If configured, the bitmap
344is used to record which blocks of the array may be out of sync.
345Before any write request is honoured, md will make sure that the
346corresponding bit in the log is set. After a period of time with no
347writes to an area of the array, the corresponding bit will be cleared.
348
349This bitmap is used for two optimisations.
350
351Firstly, after an unclear shutdown, the resync process will consult
352the bitmap and only resync those blocks that correspond to bits in the
353bitmap that are set. This can dramatically increase resync time.
354
355Secondly, when a drive fails and is removed from the array, md stops
356clearing bits in the intent log. If that same drive is re-added to
357the array, md will notice and will only recover the sections of the
358drive that are covered by bits in the intent log that are set. This
359can allow a device to be temporarily removed and reinserted without
360causing an enormous recovery cost.
361
362The intent log can be stored in a file on a separate device, or it can
363be stored near the superblocks of an array which has superblocks.
364
addc80c4
NB
365It is possible to add an intent log or an active array, or remove an
366intent log if one is present.
599e5a36
NB
367
368In 2.6.13, intent bitmaps are only supported with RAID1. Other levels
addc80c4 369with redundancy are supported from 2.6.15.
599e5a36
NB
370
371.SS WRITE-BEHIND
372
373From Linux 2.6.14,
374.I md
addc80c4 375supports WRITE-BEHIND on RAID1 arrays.
599e5a36
NB
376
377This allows certain devices in the array to be flagged as
378.IR write-mostly .
379MD will only read from such devices if there is no
380other option.
381
382If a write-intent bitmap is also provided, write requests to
383write-mostly devices will be treated as write-behind requests and md
384will not wait for writes to those requests to complete before
385reporting the write as complete to the filesystem.
386
387This allows for a RAID1 with WRITE-BEHIND to be used to mirror data
388over a slow link to a remove computer (providing the link isn't too
389slow). The extra latency of the remote link will not slow down normal
390operations, but the remote system will still have a reasonably
391up-to-date copy of all data.
392
addc80c4
NB
393.SS RESTRIPING
394
395.IR Restriping ,
396also known as
397.IR Reshaping ,
398is the processes of re-arranging the data stored in each stripe into a
399new layout. This might involve changing the number of devices in the
400array (so the stripes are wider) changing the chunk size (so stripes
401are deeper or shallower), or changing the arrangement of data and
402parity, possibly changing the raid level (e.g. 1 to 5 or 5 to 6).
403
404As of Linux 2.6.17, md can reshape a raid5 array to have more
405devices. Other possibilities may follow in future kernels.
406
407During any stripe process there is a 'critical section' during which
408live data is being over-written on disk. For the operation of
409increasing the number of drives in a raid5, this critical section
410covers the first few stripes (the number being the product of the old
411and new number of devices). After this critical section is passed,
412data is only written to areas of the array which no longer hold live
413data - the live data has already been located away.
414
415md is not able to ensure data preservation if there is a crash
416(e.g. power failure) during the critical section. If md is asked to
417start an array which failed during a critical section of restriping,
418it will fail to start the array.
419
420To deal with this possibility, a user-space program must
421.IP \(bu 4
422Disable writes to that section of the array (using the
423.B sysfs
424interface),
425.IP \(bu 4
426Take a copy of the data somewhere (i.e. make a backup)
427.IP \(bu 4
428Allow the process to continue and invalidate the backup and restore
429write access once the critical section is passed, and
430.IP \(bu 4
431Provide for restoring the critical data before restarting the array
432after a system crash.
433.PP
434
435.B mdadm
436version 2.4 and later will do this for growing a RAID5 array.
437
438For operations that do not change the size of the array, like simply
439increasing chunk size, or converting RAID5 to RAID6 with one extra
440device, the entire process is the critical section. In this case the
441restripe will need to progress in stages as a section is suspended,
442backed up,
443restriped, and released. This is not yet implemented.
444
445.SS SYSFS INTERFACE
446All block devices appear as a directory in
447.I sysfs
448(usually mounted at
449.BR /sys ).
450For MD devices, this directory will contain a subdirectory called
451.B md
452which contains various files for providing access to information about
453the array.
454
455This interface is documented more fully in the file
456.B Documentation/md.txt
457which is distributed with the kernel sources. That file should be
458consulted for full documentation. The following are just a selection
459of attribute files that are available.
460
461.TP
462.B md/sync_speed_min
463This value, if set, overrides the system-wide setting in
464.B /proc/sys/dev/raid/speed_limit_min
465for this array only.
466Writing the value
467.B system
468to this file cause the system-wide setting to have effect.
469
470.TP
471.B md/sync_speed_max
472This is the partner of
473.B md/sync_speed_min
474and overrides
475.B /proc/sys/dev/raid/spool_limit_max
476described below.
477
478.TP
479.B md/sync_action
480This can be used to monitor and control the resync/recovery process of
481MD.
482In particular, writing "check" here will cause the array to read all
483data block and check that they are consistent (e.g. parity is correct,
484or all mirror replicas are the same). Any discrepancies found are
485.B NOT
486corrected.
487
488A count of problems found will be stored in
489.BR md/mismatch_count .
490
491Alternately, "repair" can be written which will cause the same check
492to be performed, but any errors will be corrected.
493
494Finally, "idle" can be written to stop the check/repair process.
495
496.TP
497.B md/stripe_cache_size
498This is only available on RAID5 and RAID6. It records the size (in
499pages per device) of the stripe cache which is used for synchronising
500all read and write operations to the array. The default is 128.
501Increasing this number can increase performance in some situations, at
502some cost in system memory.
503
504
5787fa49
NB
505.SS KERNEL PARAMETERS
506
addc80c4 507The md driver recognised several different kernel parameters.
5787fa49
NB
508.TP
509.B raid=noautodetect
510This will disable the normal detection of md arrays that happens at
511boot time. If a drive is partitioned with MS-DOS style partitions,
512then if any of the 4 main partitions has a partition type of 0xFD,
513then that partition will normally be inspected to see if it is part of
514an MD array, and if any full arrays are found, they are started. This
addc80c4 515kernel parameter disables this behaviour.
5787fa49 516
a9d69660
NB
517.TP
518.B raid=partitionable
519.TP
520.B raid=part
521These are available in 2.6 and later kernels only. They indicate that
522autodetected MD arrays should be created as partitionable arrays, with
523a different major device number to the original non-partitionable md
524arrays. The device number is listed as
525.I mdp
526in
527.IR /proc/devices .
528
addc80c4
NB
529.TP
530.B md_mod.start_ro=1
531This tells md to start all arrays in read-only mode. This is a soft
532read-only that will automatically switch to read-write on the first
533write request. However until that write request, nothing is written
534to any device by md, and in particular, no resync or recovery
535operation is started.
536
537.TP
538.B md_mod.start_dirty_degraded=1
539As mentioned above, md will not normally start a RAID4, RAID5, or
540RAID6 that is both dirty and degraded as this situation can imply
541hidden data loss. This can be awkward if the root filesystem is
542affected. Using the module parameter allows such arrays to be started
543at boot time. It should be understood that there is a real (though
544small) risk of data corruption in this situation.
a9d69660 545
5787fa49
NB
546.TP
547.BI md= n , dev , dev ,...
a9d69660
NB
548.TP
549.BI md=d n , dev , dev ,...
5787fa49
NB
550This tells the md driver to assemble
551.B /dev/md n
552from the listed devices. It is only necessary to start the device
553holding the root filesystem this way. Other arrays are best started
554once the system is booted.
555
a9d69660
NB
556In 2.6 kernels, the
557.B d
558immediately after the
559.B =
560indicates that a partitionable device (e.g.
561.BR /dev/md/d0 )
562should be created rather than the original non-partitionable device.
563
5787fa49
NB
564.TP
565.BI md= n , l , c , i , dev...
566This tells the md driver to assemble a legacy RAID0 or LINEAR array
567without a superblock.
568.I n
569gives the md device number,
570.I l
571gives the level, 0 for RAID0 or -1 for LINEAR,
572.I c
573gives the chunk size as a base-2 logarithm offset by twelve, so 0
574means 4K, 1 means 8K.
575.I i
576is ignored (legacy support).
e0d19036 577
56eb10c0
NB
578.SH FILES
579.TP
580.B /proc/mdstat
581Contains information about the status of currently running array.
582.TP
583.B /proc/sys/dev/raid/speed_limit_min
584A readable and writable file that reflects the current goal rebuild
585speed for times when non-rebuild activity is current on an array.
586The speed is in Kibibytes per second, and is a per-device rate, not a
587per-array rate (which means that an array with more disc will shuffle
588more data for a given speed). The default is 100.
589
590.TP
591.B /proc/sys/dev/raid/speed_limit_max
592A readable and writable file that reflects the current goal rebuild
593speed for times when no non-rebuild activity is current on an array.
594The default is 100,000.
595
596.SH SEE ALSO
597.BR mdadm (8),
598.BR mkraid (8).