]> git.ipfire.org Git - thirdparty/mdadm.git/blame - md.4
More consistent honoring of --configfile
[thirdparty/mdadm.git] / md.4
CommitLineData
56eb10c0
NB
1.TH MD 4
2.SH NAME
3md \- Multiple Device driver aka Linux Software Raid
4.SH SYNOPSIS
5.BI /dev/md n
6.br
7.BI /dev/md/ n
8.SH DESCRIPTION
9The
10.B md
11driver provides virtual devices that are created from one or more
e0d19036 12independent underlying devices. This array of devices often contains
56eb10c0 13redundancy, and hence the acronym RAID which stands for a Redundant
e0d19036 14Array of Independent Devices.
56eb10c0
NB
15.PP
16.B md
599e5a36
NB
17supports RAID levels
181 (mirroring),
194 (striped array with parity device),
205 (striped array with distributed parity information),
216 (striped array with distributed dual redundancy information), and
2210 (striped and mirrored).
23If some number of underlying devices fails while using one of these
98c6faba
NB
24levels, the array will continue to function; this number is one for
25RAID levels 4 and 5, two for RAID level 6, and all but one (N-1) for
addc80c4 26RAID level 1, and dependant on configuration for level 10.
56eb10c0
NB
27.PP
28.B md
e0d19036 29also supports a number of pseudo RAID (non-redundant) configurations
570c0542
NB
30including RAID0 (striped array), LINEAR (catenated array),
31MULTIPATH (a set of different interfaces to the same device),
32and FAULTY (a layer over a single device into which errors can be injected).
56eb10c0 33
11a3e71d 34.SS MD SUPER BLOCK
570c0542
NB
35Each device in an array may have a
36.I superblock
37which records information about the structure and state of the array.
38This allows the array to be reliably re-assembled after a shutdown.
56eb10c0 39
570c0542
NB
40From Linux kernel version 2.6.10,
41.B md
42provides support for two different formats of this superblock, and
43other formats can be added. Prior to this release, only one format is
44supported.
45
46The common format - known as version 0.90 - has
47a superblock that is 4K long and is written into a 64K aligned block that
11a3e71d 48starts at least 64K and less than 128K from the end of the device
56eb10c0
NB
49(i.e. to get the address of the superblock round the size of the
50device down to a multiple of 64K and then subtract 64K).
11a3e71d 51The available size of each device is the amount of space before the
56eb10c0
NB
52super block, so between 64K and 128K is lost when a device in
53incorporated into an MD array.
570c0542
NB
54This superblock stores multi-byte fields in a processor-dependant
55manner, so arrays cannot easily be moved between computers with
56different processors.
57
58The new format - known as version 1 - has a superblock that is
59normally 1K long, but can be longer. It is normally stored between 8K
60and 12K from the end of the device, on a 4K boundary, though
61variations can be stored at the start of the device (version 1.1) or 4K from
62the start of the device (version 1.2).
63This superblock format stores multibyte data in a
addc80c4 64processor-independent format and has supports up to hundreds of
570c0542 65component devices (version 0.90 only supports 28).
56eb10c0
NB
66
67The superblock contains, among other things:
68.TP
69LEVEL
11a3e71d 70The manner in which the devices are arranged into the array
599e5a36 71(linear, raid0, raid1, raid4, raid5, raid10, multipath).
56eb10c0
NB
72.TP
73UUID
74a 128 bit Universally Unique Identifier that identifies the array that
75this device is part of.
76
570c0542
NB
77.SS ARRAYS WITHOUT SUPERBLOCKS
78While it is usually best to create arrays with superblocks so that
79they can be assembled reliably, there are some circumstances where an
80array without superblocks in preferred. This include:
81.TP
82LEGACY ARRAYS
11a3e71d
NB
83Early versions of the
84.B md
570c0542
NB
85driver only supported Linear and Raid0 configurations and did not use
86a superblock (which is less critical with these configurations).
87While such arrays should be rebuilt with superblocks if possible,
11a3e71d 88.B md
570c0542
NB
89continues to support them.
90.TP
91FAULTY
92Being a largely transparent layer over a different device, the FAULTY
93personality doesn't gain anything from having a superblock.
94.TP
95MULTIPATH
96It is often possible to detect devices which are different paths to
97the same storage directly rather than having a distinctive superblock
98written to the device and searched for on all paths. In this case,
99a MULTIPATH array with no superblock makes sense.
100.TP
101RAID1
102In some configurations it might be desired to create a raid1
103configuration that does use a superblock, and to maintain the state of
addc80c4
NB
104the array elsewhere. While not encouraged for general us, it does
105have special-purpose uses and is supported.
11a3e71d 106
56eb10c0 107.SS LINEAR
11a3e71d
NB
108
109A linear array simply catenates the available space on each
110drive together to form one large virtual drive.
111
112One advantage of this arrangement over the more common RAID0
113arrangement is that the array may be reconfigured at a later time with
114an extra drive and so the array is made bigger without disturbing the
115data that is on the array. However this cannot be done on a live
116array.
117
599e5a36
NB
118If a chunksize is given with a LINEAR array, the usable space on each
119device is rounded down to a multiple of this chunksize.
11a3e71d 120
56eb10c0 121.SS RAID0
11a3e71d
NB
122
123A RAID0 array (which has zero redundancy) is also known as a
124striped array.
e0d19036
NB
125A RAID0 array is configured at creation with a
126.B "Chunk Size"
c913b90e 127which must be a power of two, and at least 4 kibibytes.
e0d19036 128
2d465520 129The RAID0 driver assigns the first chunk of the array to the first
e0d19036 130device, the second chunk to the second device, and so on until all
2d465520 131drives have been assigned one chunk. This collection of chunks forms
e0d19036
NB
132a
133.BR stripe .
134Further chunks are gathered into stripes in the same way which are
135assigned to the remaining space in the drives.
136
2d465520
NB
137If devices in the array are not all the same size, then once the
138smallest device has been exhausted, the RAID0 driver starts
e0d19036
NB
139collecting chunks into smaller stripes that only span the drives which
140still have remaining space.
141
142
56eb10c0 143.SS RAID1
e0d19036
NB
144
145A RAID1 array is also known as a mirrored set (though mirrors tend to
5787fa49 146provide reflected images, which RAID1 does not) or a plex.
e0d19036
NB
147
148Once initialised, each device in a RAID1 array contains exactly the
149same data. Changes are written to all devices in parallel. Data is
150read from any one device. The driver attempts to distribute read
151requests across all devices to maximise performance.
152
153All devices in a RAID1 array should be the same size. If they are
154not, then only the amount of space available on the smallest device is
155used. Any extra space on other devices is wasted.
156
56eb10c0 157.SS RAID4
e0d19036
NB
158
159A RAID4 array is like a RAID0 array with an extra device for storing
aa88f531
NB
160parity. This device is the last of the active devices in the
161array. Unlike RAID0, RAID4 also requires that all stripes span all
e0d19036
NB
162drives, so extra space on devices that are larger than the smallest is
163wasted.
164
165When any block in a RAID4 array is modified the parity block for that
166stripe (i.e. the block in the parity device at the same device offset
167as the stripe) is also modified so that the parity block always
168contains the "parity" for the whole stripe. i.e. its contents is
169equivalent to the result of performing an exclusive-or operation
170between all the data blocks in the stripe.
171
172This allows the array to continue to function if one device fails.
173The data that was on that device can be calculated as needed from the
174parity block and the other data blocks.
175
56eb10c0 176.SS RAID5
e0d19036
NB
177
178RAID5 is very similar to RAID4. The difference is that the parity
179blocks for each stripe, instead of being on a single device, are
180distributed across all devices. This allows more parallelism when
181writing as two different block updates will quite possibly affect
182parity blocks on different devices so there is less contention.
183
184This also allows more parallelism when reading as read requests are
185distributed over all the devices in the array instead of all but one.
186
98c6faba
NB
187.SS RAID6
188
189RAID6 is similar to RAID5, but can handle the loss of any \fItwo\fP
190devices without data loss. Accordingly, it requires N+2 drives to
191store N drives worth of data.
192
193The performance for RAID6 is slightly lower but comparable to RAID5 in
194normal mode and single disk failure mode. It is very slow in dual
195disk failure mode, however.
196
599e5a36
NB
197.SS RAID10
198
199RAID10 provides a combination of RAID1 and RAID0, and sometimes known
200as RAID1+0. Every datablock is duplicated some number of times, and
201the resulting collection of datablocks are distributed over multiple
202drives.
203
204When configuring a RAID10 array it is necessary to specify the number
205of replicas of each data block that are required (this will normally
b578481c
NB
206be 2) and whether the replicas should be 'near', 'offset' or 'far'.
207(Note that the 'offset' layout is only available from 2.6.18).
599e5a36
NB
208
209When 'near' replicas are chosen, the multiple copies of a given chunk
210are laid out consecutively across the stripes of the array, so the two
211copies of a datablock will likely be at the same offset on two
212adjacent devices.
213
214When 'far' replicas are chosen, the multiple copies of a given chunk
215are laid out quite distant from each other. The first copy of all
216data blocks will be striped across the early part of all drives in
217RAID0 fashion, and then the next copy of all blocks will be striped
218across a later section of all drives, always ensuring that all copies
219of any given block are on different drives.
220
221The 'far' arrangement can give sequential read performance equal to
222that of a RAID0 array, but at the cost of degraded write performance.
223
b578481c
NB
224When 'offset' replicas are chosen, the multiple copies of a given
225chunk are laid out on consecutive drives and at consecutive offsets.
226Effectively each stripe is duplicated and the copies are offset by one
227device. This should give similar read characteristics to 'far' if a
228suitably large chunk size is used, but without as much seeking for
229writes.
230
599e5a36
NB
231It should be noted that the number of devices in a RAID10 array need
232not be a multiple of the number of replica of each data block, those
233there must be at least as many devices as replicas.
234
235If, for example, an array is created with 5 devices and 2 replicas,
236then space equivalent to 2.5 of the devices will be available, and
237every block will be stored on two different devices.
238
239Finally, it is possible to have an array with both 'near' and 'far'
240copies. If and array is configured with 2 near copies and 2 far
241copies, then there will be a total of 4 copies of each block, each on
242a different drive. This is an artifact of the implementation and is
243unlikely to be of real value.
244
11a3e71d 245.SS MUTIPATH
e0d19036
NB
246
247MULTIPATH is not really a RAID at all as there is only one real device
248in a MULTIPATH md array. However there are multiple access points
249(paths) to this device, and one of these paths might fail, so there
250are some similarities.
251
a9d69660 252A MULTIPATH array is composed of a number of logically different
2d465520
NB
253devices, often fibre channel interfaces, that all refer the the same
254real device. If one of these interfaces fails (e.g. due to cable
a9d69660 255problems), the multipath driver will attempt to redirect requests to
2d465520 256another interface.
e0d19036 257
b5e64645
NB
258.SS FAULTY
259The FAULTY md module is provided for testing purposes. A faulty array
260has exactly one component device and is normally assembled without a
261superblock, so the md array created provides direct access to all of
262the data in the component device.
263
264The FAULTY module may be requested to simulate faults to allow testing
a9d69660 265of other md levels or of filesystems. Faults can be chosen to trigger
b5e64645 266on read requests or write requests, and can be transient (a subsequent
addc80c4 267read/write at the address will probably succeed) or persistent
b5e64645
NB
268(subsequent read/write of the same address will fail). Further, read
269faults can be "fixable" meaning that they persist until a write
270request at the same address.
271
272Fault types can be requested with a period. In this case the fault
a9d69660
NB
273will recur repeatedly after the given number of requests of the
274relevant type. For example if persistent read faults have a period of
275100, then every 100th read request would generate a fault, and the
b5e64645
NB
276faulty sector would be recorded so that subsequent reads on that
277sector would also fail.
278
279There is a limit to the number of faulty sectors that are remembered.
280Faults generated after this limit is exhausted are treated as
281transient.
282
a9d69660 283The list of faulty sectors can be flushed, and the active list of
b5e64645 284failure modes can be cleared.
e0d19036
NB
285
286.SS UNCLEAN SHUTDOWN
287
599e5a36
NB
288When changes are made to a RAID1, RAID4, RAID5, RAID6, or RAID10 array
289there is a possibility of inconsistency for short periods of time as
290each update requires are least two block to be written to different
291devices, and these writes probably wont happen at exactly the same
292time. Thus if a system with one of these arrays is shutdown in the
293middle of a write operation (e.g. due to power failure), the array may
294not be consistent.
e0d19036 295
2d465520 296To handle this situation, the md driver marks an array as "dirty"
e0d19036 297before writing any data to it, and marks it as "clean" when the array
98c6faba
NB
298is being disabled, e.g. at shutdown. If the md driver finds an array
299to be dirty at startup, it proceeds to correct any possibly
300inconsistency. For RAID1, this involves copying the contents of the
301first drive onto all other drives. For RAID4, RAID5 and RAID6 this
302involves recalculating the parity for each stripe and making sure that
599e5a36
NB
303the parity block has the correct data. For RAID10 it involves copying
304one of the replicas of each block onto all the others. This process,
305known as "resynchronising" or "resync" is performed in the background.
306The array can still be used, though possibly with reduced performance.
98c6faba
NB
307
308If a RAID4, RAID5 or RAID6 array is degraded (missing at least one
309drive) when it is restarted after an unclean shutdown, it cannot
310recalculate parity, and so it is possible that data might be
311undetectably corrupted. The 2.4 md driver
e0d19036 312.B does not
addc80c4
NB
313alert the operator to this condition. The 2.6 md driver will fail to
314start an array in this condition without manual intervention, though
315this behaviour can be over-ridden by a kernel parameter.
e0d19036
NB
316
317.SS RECOVERY
318
addc80c4 319If the md driver detects a write error on a device in a RAID1, RAID4,
599e5a36
NB
320RAID5, RAID6, or RAID10 array, it immediately disables that device
321(marking it as faulty) and continues operation on the remaining
322devices. If there is a spare drive, the driver will start recreating
323on one of the spare drives the data what was on that failed drive,
324either by copying a working drive in a RAID1 configuration, or by
325doing calculations with the parity block on RAID4, RAID5 or RAID6, or
326by finding a copying originals for RAID10.
e0d19036 327
addc80c4
NB
328In kernels prior to about 2.6.15, a read error would cause the same
329effect as a write error. In later kernels, a read-error will instead
330cause md to attempt a recovery by overwriting the bad block. i.e. it
331will find the correct data from elsewhere, write it over the block
332that failed, and then try to read it back again. If either the write
333or the re-read fail, md will treat the error the same way that a write
334error is treated and will fail the whole device.
335
2d465520 336While this recovery process is happening, the md driver will monitor
e0d19036
NB
337accesses to the array and will slow down the rate of recovery if other
338activity is happening, so that normal access to the array will not be
339unduly affected. When no other activity is happening, the recovery
340process proceeds at full speed. The actual speed targets for the two
341different situations can be controlled by the
342.B speed_limit_min
343and
344.B speed_limit_max
345control files mentioned below.
346
599e5a36
NB
347.SS BITMAP WRITE-INTENT LOGGING
348
349From Linux 2.6.13,
350.I md
351supports a bitmap based write-intent log. If configured, the bitmap
352is used to record which blocks of the array may be out of sync.
353Before any write request is honoured, md will make sure that the
354corresponding bit in the log is set. After a period of time with no
355writes to an area of the array, the corresponding bit will be cleared.
356
357This bitmap is used for two optimisations.
358
359Firstly, after an unclear shutdown, the resync process will consult
360the bitmap and only resync those blocks that correspond to bits in the
361bitmap that are set. This can dramatically increase resync time.
362
363Secondly, when a drive fails and is removed from the array, md stops
364clearing bits in the intent log. If that same drive is re-added to
365the array, md will notice and will only recover the sections of the
366drive that are covered by bits in the intent log that are set. This
367can allow a device to be temporarily removed and reinserted without
368causing an enormous recovery cost.
369
370The intent log can be stored in a file on a separate device, or it can
371be stored near the superblocks of an array which has superblocks.
372
addc80c4
NB
373It is possible to add an intent log or an active array, or remove an
374intent log if one is present.
599e5a36
NB
375
376In 2.6.13, intent bitmaps are only supported with RAID1. Other levels
addc80c4 377with redundancy are supported from 2.6.15.
599e5a36
NB
378
379.SS WRITE-BEHIND
380
381From Linux 2.6.14,
382.I md
addc80c4 383supports WRITE-BEHIND on RAID1 arrays.
599e5a36
NB
384
385This allows certain devices in the array to be flagged as
386.IR write-mostly .
387MD will only read from such devices if there is no
388other option.
389
390If a write-intent bitmap is also provided, write requests to
391write-mostly devices will be treated as write-behind requests and md
392will not wait for writes to those requests to complete before
393reporting the write as complete to the filesystem.
394
395This allows for a RAID1 with WRITE-BEHIND to be used to mirror data
396over a slow link to a remove computer (providing the link isn't too
397slow). The extra latency of the remote link will not slow down normal
398operations, but the remote system will still have a reasonably
399up-to-date copy of all data.
400
addc80c4
NB
401.SS RESTRIPING
402
403.IR Restriping ,
404also known as
405.IR Reshaping ,
406is the processes of re-arranging the data stored in each stripe into a
407new layout. This might involve changing the number of devices in the
408array (so the stripes are wider) changing the chunk size (so stripes
409are deeper or shallower), or changing the arrangement of data and
410parity, possibly changing the raid level (e.g. 1 to 5 or 5 to 6).
411
412As of Linux 2.6.17, md can reshape a raid5 array to have more
413devices. Other possibilities may follow in future kernels.
414
415During any stripe process there is a 'critical section' during which
416live data is being over-written on disk. For the operation of
417increasing the number of drives in a raid5, this critical section
418covers the first few stripes (the number being the product of the old
419and new number of devices). After this critical section is passed,
420data is only written to areas of the array which no longer hold live
421data - the live data has already been located away.
422
423md is not able to ensure data preservation if there is a crash
424(e.g. power failure) during the critical section. If md is asked to
425start an array which failed during a critical section of restriping,
426it will fail to start the array.
427
428To deal with this possibility, a user-space program must
429.IP \(bu 4
430Disable writes to that section of the array (using the
431.B sysfs
432interface),
433.IP \(bu 4
434Take a copy of the data somewhere (i.e. make a backup)
435.IP \(bu 4
436Allow the process to continue and invalidate the backup and restore
437write access once the critical section is passed, and
438.IP \(bu 4
439Provide for restoring the critical data before restarting the array
440after a system crash.
441.PP
442
443.B mdadm
444version 2.4 and later will do this for growing a RAID5 array.
445
446For operations that do not change the size of the array, like simply
447increasing chunk size, or converting RAID5 to RAID6 with one extra
448device, the entire process is the critical section. In this case the
449restripe will need to progress in stages as a section is suspended,
450backed up,
451restriped, and released. This is not yet implemented.
452
453.SS SYSFS INTERFACE
454All block devices appear as a directory in
455.I sysfs
456(usually mounted at
457.BR /sys ).
458For MD devices, this directory will contain a subdirectory called
459.B md
460which contains various files for providing access to information about
461the array.
462
463This interface is documented more fully in the file
464.B Documentation/md.txt
465which is distributed with the kernel sources. That file should be
466consulted for full documentation. The following are just a selection
467of attribute files that are available.
468
469.TP
470.B md/sync_speed_min
471This value, if set, overrides the system-wide setting in
472.B /proc/sys/dev/raid/speed_limit_min
473for this array only.
474Writing the value
475.B system
476to this file cause the system-wide setting to have effect.
477
478.TP
479.B md/sync_speed_max
480This is the partner of
481.B md/sync_speed_min
482and overrides
483.B /proc/sys/dev/raid/spool_limit_max
484described below.
485
486.TP
487.B md/sync_action
488This can be used to monitor and control the resync/recovery process of
489MD.
490In particular, writing "check" here will cause the array to read all
491data block and check that they are consistent (e.g. parity is correct,
492or all mirror replicas are the same). Any discrepancies found are
493.B NOT
494corrected.
495
496A count of problems found will be stored in
497.BR md/mismatch_count .
498
499Alternately, "repair" can be written which will cause the same check
500to be performed, but any errors will be corrected.
501
502Finally, "idle" can be written to stop the check/repair process.
503
504.TP
505.B md/stripe_cache_size
506This is only available on RAID5 and RAID6. It records the size (in
507pages per device) of the stripe cache which is used for synchronising
508all read and write operations to the array. The default is 128.
509Increasing this number can increase performance in some situations, at
510some cost in system memory.
511
512
5787fa49
NB
513.SS KERNEL PARAMETERS
514
addc80c4 515The md driver recognised several different kernel parameters.
5787fa49
NB
516.TP
517.B raid=noautodetect
518This will disable the normal detection of md arrays that happens at
519boot time. If a drive is partitioned with MS-DOS style partitions,
520then if any of the 4 main partitions has a partition type of 0xFD,
521then that partition will normally be inspected to see if it is part of
522an MD array, and if any full arrays are found, they are started. This
addc80c4 523kernel parameter disables this behaviour.
5787fa49 524
a9d69660
NB
525.TP
526.B raid=partitionable
527.TP
528.B raid=part
529These are available in 2.6 and later kernels only. They indicate that
530autodetected MD arrays should be created as partitionable arrays, with
531a different major device number to the original non-partitionable md
532arrays. The device number is listed as
533.I mdp
534in
535.IR /proc/devices .
536
addc80c4
NB
537.TP
538.B md_mod.start_ro=1
539This tells md to start all arrays in read-only mode. This is a soft
540read-only that will automatically switch to read-write on the first
541write request. However until that write request, nothing is written
542to any device by md, and in particular, no resync or recovery
543operation is started.
544
545.TP
546.B md_mod.start_dirty_degraded=1
547As mentioned above, md will not normally start a RAID4, RAID5, or
548RAID6 that is both dirty and degraded as this situation can imply
549hidden data loss. This can be awkward if the root filesystem is
550affected. Using the module parameter allows such arrays to be started
551at boot time. It should be understood that there is a real (though
552small) risk of data corruption in this situation.
a9d69660 553
5787fa49
NB
554.TP
555.BI md= n , dev , dev ,...
a9d69660
NB
556.TP
557.BI md=d n , dev , dev ,...
5787fa49
NB
558This tells the md driver to assemble
559.B /dev/md n
560from the listed devices. It is only necessary to start the device
561holding the root filesystem this way. Other arrays are best started
562once the system is booted.
563
a9d69660
NB
564In 2.6 kernels, the
565.B d
566immediately after the
567.B =
568indicates that a partitionable device (e.g.
569.BR /dev/md/d0 )
570should be created rather than the original non-partitionable device.
571
5787fa49
NB
572.TP
573.BI md= n , l , c , i , dev...
574This tells the md driver to assemble a legacy RAID0 or LINEAR array
575without a superblock.
576.I n
577gives the md device number,
578.I l
579gives the level, 0 for RAID0 or -1 for LINEAR,
580.I c
581gives the chunk size as a base-2 logarithm offset by twelve, so 0
582means 4K, 1 means 8K.
583.I i
584is ignored (legacy support).
e0d19036 585
56eb10c0
NB
586.SH FILES
587.TP
588.B /proc/mdstat
589Contains information about the status of currently running array.
590.TP
591.B /proc/sys/dev/raid/speed_limit_min
592A readable and writable file that reflects the current goal rebuild
593speed for times when non-rebuild activity is current on an array.
594The speed is in Kibibytes per second, and is a per-device rate, not a
595per-array rate (which means that an array with more disc will shuffle
596more data for a given speed). The default is 100.
597
598.TP
599.B /proc/sys/dev/raid/speed_limit_max
600A readable and writable file that reflects the current goal rebuild
601speed for times when no non-rebuild activity is current on an array.
602The default is 100,000.
603
604.SH SEE ALSO
605.BR mdadm (8),
606.BR mkraid (8).