]> git.ipfire.org Git - thirdparty/mdadm.git/blame - md.4
Centralise code for copying uuid
[thirdparty/mdadm.git] / md.4
CommitLineData
90fc992e
NB
1''' Copyright Neil Brown and others.
2''' This program is free software; you can redistribute it and/or modify
3''' it under the terms of the GNU General Public License as published by
4''' the Free Software Foundation; either version 2 of the License, or
5''' (at your option) any later version.
6''' See file COPYING in distribution for details.
56eb10c0
NB
7.TH MD 4
8.SH NAME
9md \- Multiple Device driver aka Linux Software Raid
10.SH SYNOPSIS
11.BI /dev/md n
12.br
13.BI /dev/md/ n
14.SH DESCRIPTION
15The
16.B md
17driver provides virtual devices that are created from one or more
e0d19036 18independent underlying devices. This array of devices often contains
56eb10c0 19redundancy, and hence the acronym RAID which stands for a Redundant
e0d19036 20Array of Independent Devices.
56eb10c0
NB
21.PP
22.B md
599e5a36
NB
23supports RAID levels
241 (mirroring),
254 (striped array with parity device),
265 (striped array with distributed parity information),
276 (striped array with distributed dual redundancy information), and
2810 (striped and mirrored).
29If some number of underlying devices fails while using one of these
98c6faba
NB
30levels, the array will continue to function; this number is one for
31RAID levels 4 and 5, two for RAID level 6, and all but one (N-1) for
addc80c4 32RAID level 1, and dependant on configuration for level 10.
56eb10c0
NB
33.PP
34.B md
e0d19036 35also supports a number of pseudo RAID (non-redundant) configurations
570c0542
NB
36including RAID0 (striped array), LINEAR (catenated array),
37MULTIPATH (a set of different interfaces to the same device),
38and FAULTY (a layer over a single device into which errors can be injected).
56eb10c0 39
11a3e71d 40.SS MD SUPER BLOCK
570c0542
NB
41Each device in an array may have a
42.I superblock
43which records information about the structure and state of the array.
44This allows the array to be reliably re-assembled after a shutdown.
56eb10c0 45
570c0542
NB
46From Linux kernel version 2.6.10,
47.B md
48provides support for two different formats of this superblock, and
49other formats can be added. Prior to this release, only one format is
50supported.
51
52The common format - known as version 0.90 - has
53a superblock that is 4K long and is written into a 64K aligned block that
11a3e71d 54starts at least 64K and less than 128K from the end of the device
56eb10c0
NB
55(i.e. to get the address of the superblock round the size of the
56device down to a multiple of 64K and then subtract 64K).
11a3e71d 57The available size of each device is the amount of space before the
56eb10c0
NB
58super block, so between 64K and 128K is lost when a device in
59incorporated into an MD array.
570c0542
NB
60This superblock stores multi-byte fields in a processor-dependant
61manner, so arrays cannot easily be moved between computers with
62different processors.
63
64The new format - known as version 1 - has a superblock that is
65normally 1K long, but can be longer. It is normally stored between 8K
66and 12K from the end of the device, on a 4K boundary, though
67variations can be stored at the start of the device (version 1.1) or 4K from
68the start of the device (version 1.2).
69This superblock format stores multibyte data in a
addc80c4 70processor-independent format and has supports up to hundreds of
570c0542 71component devices (version 0.90 only supports 28).
56eb10c0
NB
72
73The superblock contains, among other things:
74.TP
75LEVEL
11a3e71d 76The manner in which the devices are arranged into the array
599e5a36 77(linear, raid0, raid1, raid4, raid5, raid10, multipath).
56eb10c0
NB
78.TP
79UUID
80a 128 bit Universally Unique Identifier that identifies the array that
81this device is part of.
82
2a940e36
NB
83When a version 0.90 array is being reshaped (e.g. adding extra devices
84to a RAID5), the version number is temporarily set to 0.91. This
85ensures that if the reshape process is stopped in the middle (e.g. by
86a system crash) and the machine boots into an older kernel that does
87not support reshaping, then the array will not be assembled (which
88would cause data corruption) but will be left untouched until a kernel
89that can complete the reshape processes is used.
90
570c0542
NB
91.SS ARRAYS WITHOUT SUPERBLOCKS
92While it is usually best to create arrays with superblocks so that
93they can be assembled reliably, there are some circumstances where an
94array without superblocks in preferred. This include:
95.TP
96LEGACY ARRAYS
11a3e71d
NB
97Early versions of the
98.B md
570c0542
NB
99driver only supported Linear and Raid0 configurations and did not use
100a superblock (which is less critical with these configurations).
101While such arrays should be rebuilt with superblocks if possible,
11a3e71d 102.B md
570c0542
NB
103continues to support them.
104.TP
105FAULTY
106Being a largely transparent layer over a different device, the FAULTY
107personality doesn't gain anything from having a superblock.
108.TP
109MULTIPATH
110It is often possible to detect devices which are different paths to
111the same storage directly rather than having a distinctive superblock
112written to the device and searched for on all paths. In this case,
113a MULTIPATH array with no superblock makes sense.
114.TP
115RAID1
116In some configurations it might be desired to create a raid1
117configuration that does use a superblock, and to maintain the state of
addc80c4
NB
118the array elsewhere. While not encouraged for general us, it does
119have special-purpose uses and is supported.
11a3e71d 120
56eb10c0 121.SS LINEAR
11a3e71d
NB
122
123A linear array simply catenates the available space on each
124drive together to form one large virtual drive.
125
126One advantage of this arrangement over the more common RAID0
127arrangement is that the array may be reconfigured at a later time with
128an extra drive and so the array is made bigger without disturbing the
129data that is on the array. However this cannot be done on a live
130array.
131
599e5a36
NB
132If a chunksize is given with a LINEAR array, the usable space on each
133device is rounded down to a multiple of this chunksize.
11a3e71d 134
56eb10c0 135.SS RAID0
11a3e71d
NB
136
137A RAID0 array (which has zero redundancy) is also known as a
138striped array.
e0d19036
NB
139A RAID0 array is configured at creation with a
140.B "Chunk Size"
c913b90e 141which must be a power of two, and at least 4 kibibytes.
e0d19036 142
2d465520 143The RAID0 driver assigns the first chunk of the array to the first
e0d19036 144device, the second chunk to the second device, and so on until all
2d465520 145drives have been assigned one chunk. This collection of chunks forms
e0d19036
NB
146a
147.BR stripe .
148Further chunks are gathered into stripes in the same way which are
149assigned to the remaining space in the drives.
150
2d465520
NB
151If devices in the array are not all the same size, then once the
152smallest device has been exhausted, the RAID0 driver starts
e0d19036
NB
153collecting chunks into smaller stripes that only span the drives which
154still have remaining space.
155
156
56eb10c0 157.SS RAID1
e0d19036
NB
158
159A RAID1 array is also known as a mirrored set (though mirrors tend to
5787fa49 160provide reflected images, which RAID1 does not) or a plex.
e0d19036
NB
161
162Once initialised, each device in a RAID1 array contains exactly the
163same data. Changes are written to all devices in parallel. Data is
164read from any one device. The driver attempts to distribute read
165requests across all devices to maximise performance.
166
167All devices in a RAID1 array should be the same size. If they are
168not, then only the amount of space available on the smallest device is
169used. Any extra space on other devices is wasted.
170
56eb10c0 171.SS RAID4
e0d19036
NB
172
173A RAID4 array is like a RAID0 array with an extra device for storing
aa88f531
NB
174parity. This device is the last of the active devices in the
175array. Unlike RAID0, RAID4 also requires that all stripes span all
e0d19036
NB
176drives, so extra space on devices that are larger than the smallest is
177wasted.
178
179When any block in a RAID4 array is modified the parity block for that
180stripe (i.e. the block in the parity device at the same device offset
181as the stripe) is also modified so that the parity block always
182contains the "parity" for the whole stripe. i.e. its contents is
183equivalent to the result of performing an exclusive-or operation
184between all the data blocks in the stripe.
185
186This allows the array to continue to function if one device fails.
187The data that was on that device can be calculated as needed from the
188parity block and the other data blocks.
189
56eb10c0 190.SS RAID5
e0d19036
NB
191
192RAID5 is very similar to RAID4. The difference is that the parity
193blocks for each stripe, instead of being on a single device, are
194distributed across all devices. This allows more parallelism when
195writing as two different block updates will quite possibly affect
196parity blocks on different devices so there is less contention.
197
198This also allows more parallelism when reading as read requests are
199distributed over all the devices in the array instead of all but one.
200
98c6faba
NB
201.SS RAID6
202
203RAID6 is similar to RAID5, but can handle the loss of any \fItwo\fP
204devices without data loss. Accordingly, it requires N+2 drives to
205store N drives worth of data.
206
207The performance for RAID6 is slightly lower but comparable to RAID5 in
208normal mode and single disk failure mode. It is very slow in dual
209disk failure mode, however.
210
599e5a36
NB
211.SS RAID10
212
213RAID10 provides a combination of RAID1 and RAID0, and sometimes known
214as RAID1+0. Every datablock is duplicated some number of times, and
215the resulting collection of datablocks are distributed over multiple
216drives.
217
218When configuring a RAID10 array it is necessary to specify the number
219of replicas of each data block that are required (this will normally
b578481c
NB
220be 2) and whether the replicas should be 'near', 'offset' or 'far'.
221(Note that the 'offset' layout is only available from 2.6.18).
599e5a36
NB
222
223When 'near' replicas are chosen, the multiple copies of a given chunk
224are laid out consecutively across the stripes of the array, so the two
225copies of a datablock will likely be at the same offset on two
226adjacent devices.
227
228When 'far' replicas are chosen, the multiple copies of a given chunk
229are laid out quite distant from each other. The first copy of all
230data blocks will be striped across the early part of all drives in
231RAID0 fashion, and then the next copy of all blocks will be striped
232across a later section of all drives, always ensuring that all copies
233of any given block are on different drives.
234
235The 'far' arrangement can give sequential read performance equal to
236that of a RAID0 array, but at the cost of degraded write performance.
237
b578481c
NB
238When 'offset' replicas are chosen, the multiple copies of a given
239chunk are laid out on consecutive drives and at consecutive offsets.
240Effectively each stripe is duplicated and the copies are offset by one
241device. This should give similar read characteristics to 'far' if a
242suitably large chunk size is used, but without as much seeking for
243writes.
244
599e5a36
NB
245It should be noted that the number of devices in a RAID10 array need
246not be a multiple of the number of replica of each data block, those
247there must be at least as many devices as replicas.
248
249If, for example, an array is created with 5 devices and 2 replicas,
250then space equivalent to 2.5 of the devices will be available, and
251every block will be stored on two different devices.
252
253Finally, it is possible to have an array with both 'near' and 'far'
254copies. If and array is configured with 2 near copies and 2 far
255copies, then there will be a total of 4 copies of each block, each on
256a different drive. This is an artifact of the implementation and is
257unlikely to be of real value.
258
11a3e71d 259.SS MUTIPATH
e0d19036
NB
260
261MULTIPATH is not really a RAID at all as there is only one real device
262in a MULTIPATH md array. However there are multiple access points
263(paths) to this device, and one of these paths might fail, so there
264are some similarities.
265
a9d69660 266A MULTIPATH array is composed of a number of logically different
2d465520
NB
267devices, often fibre channel interfaces, that all refer the the same
268real device. If one of these interfaces fails (e.g. due to cable
a9d69660 269problems), the multipath driver will attempt to redirect requests to
2d465520 270another interface.
e0d19036 271
b5e64645
NB
272.SS FAULTY
273The FAULTY md module is provided for testing purposes. A faulty array
274has exactly one component device and is normally assembled without a
275superblock, so the md array created provides direct access to all of
276the data in the component device.
277
278The FAULTY module may be requested to simulate faults to allow testing
a9d69660 279of other md levels or of filesystems. Faults can be chosen to trigger
b5e64645 280on read requests or write requests, and can be transient (a subsequent
addc80c4 281read/write at the address will probably succeed) or persistent
b5e64645
NB
282(subsequent read/write of the same address will fail). Further, read
283faults can be "fixable" meaning that they persist until a write
284request at the same address.
285
286Fault types can be requested with a period. In this case the fault
a9d69660
NB
287will recur repeatedly after the given number of requests of the
288relevant type. For example if persistent read faults have a period of
289100, then every 100th read request would generate a fault, and the
b5e64645
NB
290faulty sector would be recorded so that subsequent reads on that
291sector would also fail.
292
293There is a limit to the number of faulty sectors that are remembered.
294Faults generated after this limit is exhausted are treated as
295transient.
296
a9d69660 297The list of faulty sectors can be flushed, and the active list of
b5e64645 298failure modes can be cleared.
e0d19036
NB
299
300.SS UNCLEAN SHUTDOWN
301
599e5a36
NB
302When changes are made to a RAID1, RAID4, RAID5, RAID6, or RAID10 array
303there is a possibility of inconsistency for short periods of time as
304each update requires are least two block to be written to different
305devices, and these writes probably wont happen at exactly the same
306time. Thus if a system with one of these arrays is shutdown in the
307middle of a write operation (e.g. due to power failure), the array may
308not be consistent.
e0d19036 309
2d465520 310To handle this situation, the md driver marks an array as "dirty"
e0d19036 311before writing any data to it, and marks it as "clean" when the array
98c6faba
NB
312is being disabled, e.g. at shutdown. If the md driver finds an array
313to be dirty at startup, it proceeds to correct any possibly
314inconsistency. For RAID1, this involves copying the contents of the
315first drive onto all other drives. For RAID4, RAID5 and RAID6 this
316involves recalculating the parity for each stripe and making sure that
599e5a36
NB
317the parity block has the correct data. For RAID10 it involves copying
318one of the replicas of each block onto all the others. This process,
319known as "resynchronising" or "resync" is performed in the background.
320The array can still be used, though possibly with reduced performance.
98c6faba
NB
321
322If a RAID4, RAID5 or RAID6 array is degraded (missing at least one
323drive) when it is restarted after an unclean shutdown, it cannot
324recalculate parity, and so it is possible that data might be
325undetectably corrupted. The 2.4 md driver
e0d19036 326.B does not
addc80c4
NB
327alert the operator to this condition. The 2.6 md driver will fail to
328start an array in this condition without manual intervention, though
329this behaviour can be over-ridden by a kernel parameter.
e0d19036
NB
330
331.SS RECOVERY
332
addc80c4 333If the md driver detects a write error on a device in a RAID1, RAID4,
599e5a36
NB
334RAID5, RAID6, or RAID10 array, it immediately disables that device
335(marking it as faulty) and continues operation on the remaining
336devices. If there is a spare drive, the driver will start recreating
337on one of the spare drives the data what was on that failed drive,
338either by copying a working drive in a RAID1 configuration, or by
339doing calculations with the parity block on RAID4, RAID5 or RAID6, or
340by finding a copying originals for RAID10.
e0d19036 341
addc80c4
NB
342In kernels prior to about 2.6.15, a read error would cause the same
343effect as a write error. In later kernels, a read-error will instead
344cause md to attempt a recovery by overwriting the bad block. i.e. it
345will find the correct data from elsewhere, write it over the block
346that failed, and then try to read it back again. If either the write
347or the re-read fail, md will treat the error the same way that a write
348error is treated and will fail the whole device.
349
2d465520 350While this recovery process is happening, the md driver will monitor
e0d19036
NB
351accesses to the array and will slow down the rate of recovery if other
352activity is happening, so that normal access to the array will not be
353unduly affected. When no other activity is happening, the recovery
354process proceeds at full speed. The actual speed targets for the two
355different situations can be controlled by the
356.B speed_limit_min
357and
358.B speed_limit_max
359control files mentioned below.
360
599e5a36
NB
361.SS BITMAP WRITE-INTENT LOGGING
362
363From Linux 2.6.13,
364.I md
365supports a bitmap based write-intent log. If configured, the bitmap
366is used to record which blocks of the array may be out of sync.
367Before any write request is honoured, md will make sure that the
368corresponding bit in the log is set. After a period of time with no
369writes to an area of the array, the corresponding bit will be cleared.
370
371This bitmap is used for two optimisations.
372
373Firstly, after an unclear shutdown, the resync process will consult
374the bitmap and only resync those blocks that correspond to bits in the
375bitmap that are set. This can dramatically increase resync time.
376
377Secondly, when a drive fails and is removed from the array, md stops
378clearing bits in the intent log. If that same drive is re-added to
379the array, md will notice and will only recover the sections of the
380drive that are covered by bits in the intent log that are set. This
381can allow a device to be temporarily removed and reinserted without
382causing an enormous recovery cost.
383
384The intent log can be stored in a file on a separate device, or it can
385be stored near the superblocks of an array which has superblocks.
386
addc80c4
NB
387It is possible to add an intent log or an active array, or remove an
388intent log if one is present.
599e5a36
NB
389
390In 2.6.13, intent bitmaps are only supported with RAID1. Other levels
addc80c4 391with redundancy are supported from 2.6.15.
599e5a36
NB
392
393.SS WRITE-BEHIND
394
395From Linux 2.6.14,
396.I md
addc80c4 397supports WRITE-BEHIND on RAID1 arrays.
599e5a36
NB
398
399This allows certain devices in the array to be flagged as
400.IR write-mostly .
401MD will only read from such devices if there is no
402other option.
403
404If a write-intent bitmap is also provided, write requests to
405write-mostly devices will be treated as write-behind requests and md
406will not wait for writes to those requests to complete before
407reporting the write as complete to the filesystem.
408
409This allows for a RAID1 with WRITE-BEHIND to be used to mirror data
8f21823f 410over a slow link to a remote computer (providing the link isn't too
599e5a36
NB
411slow). The extra latency of the remote link will not slow down normal
412operations, but the remote system will still have a reasonably
413up-to-date copy of all data.
414
addc80c4
NB
415.SS RESTRIPING
416
417.IR Restriping ,
418also known as
419.IR Reshaping ,
420is the processes of re-arranging the data stored in each stripe into a
421new layout. This might involve changing the number of devices in the
422array (so the stripes are wider) changing the chunk size (so stripes
423are deeper or shallower), or changing the arrangement of data and
424parity, possibly changing the raid level (e.g. 1 to 5 or 5 to 6).
425
426As of Linux 2.6.17, md can reshape a raid5 array to have more
427devices. Other possibilities may follow in future kernels.
428
429During any stripe process there is a 'critical section' during which
430live data is being over-written on disk. For the operation of
431increasing the number of drives in a raid5, this critical section
432covers the first few stripes (the number being the product of the old
433and new number of devices). After this critical section is passed,
434data is only written to areas of the array which no longer hold live
435data - the live data has already been located away.
436
437md is not able to ensure data preservation if there is a crash
438(e.g. power failure) during the critical section. If md is asked to
439start an array which failed during a critical section of restriping,
440it will fail to start the array.
441
442To deal with this possibility, a user-space program must
443.IP \(bu 4
444Disable writes to that section of the array (using the
445.B sysfs
446interface),
447.IP \(bu 4
448Take a copy of the data somewhere (i.e. make a backup)
449.IP \(bu 4
450Allow the process to continue and invalidate the backup and restore
451write access once the critical section is passed, and
452.IP \(bu 4
453Provide for restoring the critical data before restarting the array
454after a system crash.
455.PP
456
457.B mdadm
458version 2.4 and later will do this for growing a RAID5 array.
459
460For operations that do not change the size of the array, like simply
461increasing chunk size, or converting RAID5 to RAID6 with one extra
462device, the entire process is the critical section. In this case the
463restripe will need to progress in stages as a section is suspended,
464backed up,
465restriped, and released. This is not yet implemented.
466
467.SS SYSFS INTERFACE
468All block devices appear as a directory in
469.I sysfs
470(usually mounted at
471.BR /sys ).
472For MD devices, this directory will contain a subdirectory called
473.B md
474which contains various files for providing access to information about
475the array.
476
477This interface is documented more fully in the file
478.B Documentation/md.txt
479which is distributed with the kernel sources. That file should be
480consulted for full documentation. The following are just a selection
481of attribute files that are available.
482
483.TP
484.B md/sync_speed_min
485This value, if set, overrides the system-wide setting in
486.B /proc/sys/dev/raid/speed_limit_min
487for this array only.
488Writing the value
489.B system
490to this file cause the system-wide setting to have effect.
491
492.TP
493.B md/sync_speed_max
494This is the partner of
495.B md/sync_speed_min
496and overrides
497.B /proc/sys/dev/raid/spool_limit_max
498described below.
499
500.TP
501.B md/sync_action
502This can be used to monitor and control the resync/recovery process of
503MD.
504In particular, writing "check" here will cause the array to read all
505data block and check that they are consistent (e.g. parity is correct,
506or all mirror replicas are the same). Any discrepancies found are
507.B NOT
508corrected.
509
510A count of problems found will be stored in
511.BR md/mismatch_count .
512
513Alternately, "repair" can be written which will cause the same check
514to be performed, but any errors will be corrected.
515
516Finally, "idle" can be written to stop the check/repair process.
517
518.TP
519.B md/stripe_cache_size
520This is only available on RAID5 and RAID6. It records the size (in
521pages per device) of the stripe cache which is used for synchronising
522all read and write operations to the array. The default is 128.
523Increasing this number can increase performance in some situations, at
524some cost in system memory.
525
526
5787fa49
NB
527.SS KERNEL PARAMETERS
528
addc80c4 529The md driver recognised several different kernel parameters.
5787fa49
NB
530.TP
531.B raid=noautodetect
532This will disable the normal detection of md arrays that happens at
533boot time. If a drive is partitioned with MS-DOS style partitions,
534then if any of the 4 main partitions has a partition type of 0xFD,
535then that partition will normally be inspected to see if it is part of
536an MD array, and if any full arrays are found, they are started. This
addc80c4 537kernel parameter disables this behaviour.
5787fa49 538
a9d69660
NB
539.TP
540.B raid=partitionable
541.TP
542.B raid=part
543These are available in 2.6 and later kernels only. They indicate that
544autodetected MD arrays should be created as partitionable arrays, with
545a different major device number to the original non-partitionable md
546arrays. The device number is listed as
547.I mdp
548in
549.IR /proc/devices .
550
addc80c4
NB
551.TP
552.B md_mod.start_ro=1
553This tells md to start all arrays in read-only mode. This is a soft
554read-only that will automatically switch to read-write on the first
555write request. However until that write request, nothing is written
556to any device by md, and in particular, no resync or recovery
557operation is started.
558
559.TP
560.B md_mod.start_dirty_degraded=1
561As mentioned above, md will not normally start a RAID4, RAID5, or
562RAID6 that is both dirty and degraded as this situation can imply
563hidden data loss. This can be awkward if the root filesystem is
564affected. Using the module parameter allows such arrays to be started
565at boot time. It should be understood that there is a real (though
566small) risk of data corruption in this situation.
a9d69660 567
5787fa49
NB
568.TP
569.BI md= n , dev , dev ,...
a9d69660
NB
570.TP
571.BI md=d n , dev , dev ,...
5787fa49
NB
572This tells the md driver to assemble
573.B /dev/md n
574from the listed devices. It is only necessary to start the device
575holding the root filesystem this way. Other arrays are best started
576once the system is booted.
577
a9d69660
NB
578In 2.6 kernels, the
579.B d
580immediately after the
581.B =
582indicates that a partitionable device (e.g.
583.BR /dev/md/d0 )
584should be created rather than the original non-partitionable device.
585
5787fa49
NB
586.TP
587.BI md= n , l , c , i , dev...
588This tells the md driver to assemble a legacy RAID0 or LINEAR array
589without a superblock.
590.I n
591gives the md device number,
592.I l
593gives the level, 0 for RAID0 or -1 for LINEAR,
594.I c
595gives the chunk size as a base-2 logarithm offset by twelve, so 0
596means 4K, 1 means 8K.
597.I i
598is ignored (legacy support).
e0d19036 599
56eb10c0
NB
600.SH FILES
601.TP
602.B /proc/mdstat
603Contains information about the status of currently running array.
604.TP
605.B /proc/sys/dev/raid/speed_limit_min
606A readable and writable file that reflects the current goal rebuild
607speed for times when non-rebuild activity is current on an array.
608The speed is in Kibibytes per second, and is a per-device rate, not a
609per-array rate (which means that an array with more disc will shuffle
610more data for a given speed). The default is 100.
611
612.TP
613.B /proc/sys/dev/raid/speed_limit_max
614A readable and writable file that reflects the current goal rebuild
615speed for times when no non-rebuild activity is current on an array.
616The default is 100,000.
617
618.SH SEE ALSO
619.BR mdadm (8),
620.BR mkraid (8).