X-Git-Url: http://git.ipfire.org/?p=thirdparty%2Fmdadm.git;a=blobdiff_plain;f=md.4;h=0712af255dd5a962789ec0955bc9fe8d33832d7b;hp=5f6c3a7c264af867e377588bb17ea6ae7179d7cc;hb=45c43276d02a32876c7e1f9f0d04580595141b3d;hpb=956a13fb850321bed8568dfa8692c0c323538d7c diff --git a/md.4 b/md.4 index 5f6c3a7c..0712af25 100644 --- a/md.4 +++ b/md.4 @@ -4,6 +4,7 @@ .\" the Free Software Foundation; either version 2 of the License, or .\" (at your option) any later version. .\" See file COPYING in distribution for details. +.if n .pl 1000v .TH MD 4 .SH NAME md \- Multiple Device driver aka Linux Software RAID @@ -40,7 +41,7 @@ MULTIPATH (a set of different interfaces to the same device), and FAULTY (a layer over a single device into which errors can be injected). .SS MD METADATA -Each device in an array may have some +Each device in an array may have some .I metadata stored in the device. This metadata is sometimes called a .BR superblock . @@ -176,7 +177,7 @@ device is rounded down to a multiple of this chunksize. A RAID0 array (which has zero redundancy) is also known as a striped array. A RAID0 array is configured at creation with a -.B "Chunk Size" +.B "Chunk Size" which must be a power of two (prior to Linux 2.6.31), and at least 4 kibibytes. @@ -192,6 +193,27 @@ smallest device has been exhausted, the RAID0 driver starts collecting chunks into smaller stripes that only span the drives which still have remaining space. +A bug was introduced in linux 3.14 which changed the layout of blocks in +a RAID0 beyond the region that is striped over all devices. This bug +does not affect an array with all devices the same size, but can affect +other RAID0 arrays. + +Linux 5.4 (and some stable kernels to which the change was backported) +will not normally assemble such an array as it cannot know which layout +to use. There is a module parameter "raid0.default_layout" which can be +set to "1" to force the kernel to use the pre-3.14 layout or to "2" to +force it to use the 3.14-and-later layout. when creating a new RAID0 +array, +.I mdadm +will record the chosen layout in the metadata in a way that allows newer +kernels to assemble the array without needing a module parameter. + +To assemble an old array on a new kernel without using the module parameter, +use either the +.B "--update=layout-original" +option or the +.B "--update=layout-alternate" +option. .SS RAID1 @@ -267,31 +289,326 @@ the resulting collection of datablocks are distributed over multiple drives. When configuring a RAID10 array, it is necessary to specify the number -of replicas of each data block that are required (this will normally -be 2) and whether the replicas should be 'near', 'offset' or 'far'. -(Note that the 'offset' layout is only available from 2.6.18). - -When 'near' replicas are chosen, the multiple copies of a given chunk -are laid out consecutively across the stripes of the array, so the two -copies of a datablock will likely be at the same offset on two -adjacent devices. - -When 'far' replicas are chosen, the multiple copies of a given chunk -are laid out quite distant from each other. The first copy of all -data blocks will be striped across the early part of all drives in -RAID0 fashion, and then the next copy of all blocks will be striped -across a later section of all drives, always ensuring that all copies -of any given block are on different drives. - -The 'far' arrangement can give sequential read performance equal to -that of a RAID0 array, but at the cost of reduced write performance. - -When 'offset' replicas are chosen, the multiple copies of a given -chunk are laid out on consecutive drives and at consecutive offsets. -Effectively each stripe is duplicated and the copies are offset by one -device. This should give similar read characteristics to 'far' if a -suitably large chunk size is used, but without as much seeking for -writes. +of replicas of each data block that are required (this will usually +be\ 2) and whether their layout should be "near", "far" or "offset" +(with "offset" being available since Linux\ 2.6.18). + +.B About the RAID10 Layout Examples: +.br +The examples below visualise the chunk distribution on the underlying +devices for the respective layout. + +For simplicity it is assumed that the size of the chunks equals the +size of the blocks of the underlying devices as well as those of the +RAID10 device exported by the kernel (for example \fB/dev/md/\fPname). +.br +Therefore the chunks\ /\ chunk numbers map directly to the blocks\ /\ +block addresses of the exported RAID10 device. + +Decimal numbers (0,\ 1, 2,\ ...) are the chunks of the RAID10 and due +to the above assumption also the blocks and block addresses of the +exported RAID10 device. +.br +Repeated numbers mean copies of a chunk\ /\ block (obviously on +different underlying devices). +.br +Hexadecimal numbers (0x00,\ 0x01, 0x02,\ ...) are the block addresses +of the underlying devices. + +.TP +\fB "near" Layout\fP +When "near" replicas are chosen, the multiple copies of a given chunk are laid +out consecutively ("as close to each other as possible") across the stripes of +the array. + +With an even number of devices, they will likely (unless some misalignment is +present) lay at the very same offset on the different devices. +.br +This is as the "classic" RAID1+0; that is two groups of mirrored devices (in the +example below the groups Device\ #1\ /\ #2 and Device\ #3\ /\ #4 are each a +RAID1) both in turn forming a striped RAID0. + +.ne 10 +.B Example with 2\ copies per chunk and an even number\ (4) of devices: +.TS +tab(;); + C - - - - + C | C | C | C | C | +| - | - | - | - | - | +| C | C | C | C | C | +| C | C | C | C | C | +| C | C | C | C | C | +| C | C | C | C | C | +| C | C | C | C | C | +| C | C | C | C | C | +| - | - | - | - | - | + C C S C S + C C S C S + C C S S S + C C S S S. +; +;Device #1;Device #2;Device #3;Device #4 +0x00;0;0;1;1 +0x01;2;2;3;3 +\.\.\.;\.\.\.;\.\.\.;\.\.\.;\.\.\. +:;:;:;:;: +\.\.\.;\.\.\.;\.\.\.;\.\.\.;\.\.\. +0x80;254;254;255;255 +;\\---------v---------/;\\---------v---------/ +;RAID1;RAID1 +;\\---------------------v---------------------/ +;RAID0 +.TE + +.ne 10 +.B Example with 2\ copies per chunk and an odd number\ (5) of devices: +.TS +tab(;); + C - - - - - + C | C | C | C | C | C | +| - | - | - | - | - | - | +| C | C | C | C | C | C | +| C | C | C | C | C | C | +| C | C | C | C | C | C | +| C | C | C | C | C | C | +| C | C | C | C | C | C | +| C | C | C | C | C | C | +| - | - | - | - | - | - | +C. +; +;Dev #1;Dev #2;Dev #3;Dev #4;Dev #5 +0x00;0;0;1;1;2 +0x01;2;3;3;4;4 +\.\.\.;\.\.\.;\.\.\.;\.\.\.;\.\.\.;\.\.\. +:;:;:;:;:;: +\.\.\.;\.\.\.;\.\.\.;\.\.\.;\.\.\.;\.\.\. +0x80;317;318;318;319;319 +; +.TE + +.TP +\fB "far" Layout\fP +When "far" replicas are chosen, the multiple copies of a given chunk +are laid out quite distant ("as far as reasonably possible") from each +other. + +First a complete sequence of all data blocks (that is all the data one +sees on the exported RAID10 block device) is striped over the +devices. Then another (though "shifted") complete sequence of all data +blocks; and so on (in the case of more than 2\ copies per chunk). + +The "shift" needed to prevent placing copies of the same chunks on the +same devices is actually a cyclic permutation with offset\ 1 of each +of the stripes within a complete sequence of chunks. +.br +The offset\ 1 is relative to the previous complete sequence of chunks, +so in case of more than 2\ copies per chunk one gets the following +offsets: +.br +1.\ complete sequence of chunks: offset\ =\ \ 0 +.br +2.\ complete sequence of chunks: offset\ =\ \ 1 +.br +3.\ complete sequence of chunks: offset\ =\ \ 2 +.br + : +.br +n.\ complete sequence of chunks: offset\ =\ n-1 + +.ne 10 +.B Example with 2\ copies per chunk and an even number\ (4) of devices: +.TS +tab(;); + C - - - - + C | C | C | C | C | +| - | - | - | - | - | +| C | C | C | C | C | L +| C | C | C | C | C | L +| C | C | C | C | C | L +| C | C | C | C | C | L +| C | C | C | C | C | L +| C | C | C | C | C | L +| C | C | C | C | C | L +| C | C | C | C | C | L +| C | C | C | C | C | L +| C | C | C | C | C | L +| C | C | C | C | C | L +| C | C | C | C | C | L +| - | - | - | - | - | +C. +; +;Device #1;Device #2;Device #3;Device #4 +; +0x00;0;1;2;3;\\ +0x01;4;5;6;7;> [#] +\.\.\.;\.\.\.;\.\.\.;\.\.\.;\.\.\.;: +:;:;:;:;:;: +\.\.\.;\.\.\.;\.\.\.;\.\.\.;\.\.\.;: +0x40;252;253;254;255;/ +0x41;3;0;1;2;\\ +0x42;7;4;5;6;> [#]~ +\.\.\.;\.\.\.;\.\.\.;\.\.\.;\.\.\.;: +:;:;:;:;:;: +\.\.\.;\.\.\.;\.\.\.;\.\.\.;\.\.\.;: +0x80;255;252;253;254;/ +; +.TE + +.ne 10 +.B Example with 2\ copies per chunk and an odd number\ (5) of devices: +.TS +tab(;); + C - - - - - + C | C | C | C | C | C | +| - | - | - | - | - | - | +| C | C | C | C | C | C | L +| C | C | C | C | C | C | L +| C | C | C | C | C | C | L +| C | C | C | C | C | C | L +| C | C | C | C | C | C | L +| C | C | C | C | C | C | L +| C | C | C | C | C | C | L +| C | C | C | C | C | C | L +| C | C | C | C | C | C | L +| C | C | C | C | C | C | L +| C | C | C | C | C | C | L +| C | C | C | C | C | C | L +| - | - | - | - | - | - | +C. +; +;Dev #1;Dev #2;Dev #3;Dev #4;Dev #5 +; +0x00;0;1;2;3;4;\\ +0x01;5;6;7;8;9;> [#] +\.\.\.;\.\.\.;\.\.\.;\.\.\.;\.\.\.;\.\.\.;: +:;:;:;:;:;:;: +\.\.\.;\.\.\.;\.\.\.;\.\.\.;\.\.\.;\.\.\.;: +0x40;315;316;317;318;319;/ +0x41;4;0;1;2;3;\\ +0x42;9;5;6;7;8;> [#]~ +\.\.\.;\.\.\.;\.\.\.;\.\.\.;\.\.\.;\.\.\.;: +:;:;:;:;:;:;: +\.\.\.;\.\.\.;\.\.\.;\.\.\.;\.\.\.;\.\.\.;: +0x80;319;315;316;317;318;/ +; +.TE + +With [#]\ being the complete sequence of chunks and [#]~\ the cyclic permutation +with offset\ 1 thereof (in the case of more than 2 copies per chunk there would +be ([#]~)~,\ (([#]~)~)~,\ ...). + +The advantage of this layout is that MD can easily spread sequential reads over +the devices, making them similar to RAID0 in terms of speed. +.br +The cost is more seeking for writes, making them substantially slower. + +.TP +\fB"offset" Layout\fP +When "offset" replicas are chosen, all the copies of a given chunk are +striped consecutively ("offset by the stripe length after each other") +over the devices. + +Explained in detail, consecutive chunks are +striped over the devices, immediately followed by a "shifted" copy of +these chunks (and by further such "shifted" copies in the case of more +than 2\ copies per chunk). +.br +This pattern repeats for all further consecutive chunks of the +exported RAID10 device (in other words: all further data blocks). + +The "shift" needed to prevent placing copies of the same chunks on the +same devices is actually a cyclic permutation with offset\ 1 of each +of the striped copies of consecutive chunks. +.br +The offset\ 1 is relative to the previous striped copy of consecutive chunks, so in case of more than 2\ copies per +chunk one gets the following offsets: +.br +1.\ consecutive chunks: offset\ =\ \ 0 +.br +2.\ consecutive chunks: offset\ =\ \ 1 +.br +3.\ consecutive chunks: offset\ =\ \ 2 +.br + : +.br +n.\ consecutive chunks: offset\ =\ n-1 + +.ne 10 +.B Example with 2\ copies per chunk and an even number\ (4) of devices: +.TS +tab(;); + C - - - - + C | C | C | C | C | +| - | - | - | - | - | +| C | C | C | C | C | L +| C | C | C | C | C | L +| C | C | C | C | C | L +| C | C | C | C | C | L +| C | C | C | C | C | L +| C | C | C | C | C | L +| C | C | C | C | C | L +| C | C | C | C | C | L +| C | C | C | C | C | L +| - | - | - | - | - | +C. +; +;Device #1;Device #2;Device #3;Device #4 +; +0x00;0;1;2;3;) AA +0x01;3;0;1;2;) AA~ +0x02;4;5;6;7;) AB +0x03;7;4;5;6;) AB~ +\.\.\.;\.\.\.;\.\.\.;\.\.\.;\.\.\.;) \.\.\. +:;:;:;:;:; : +\.\.\.;\.\.\.;\.\.\.;\.\.\.;\.\.\.;) \.\.\. +0x79;251;252;253;254;) EX +0x80;254;251;252;253;) EX~ +; +.TE + +.ne 10 +.B Example with 2\ copies per chunk and an odd number\ (5) of devices: +.TS +tab(;); + C - - - - - + C | C | C | C | C | C | +| - | - | - | - | - | - | +| C | C | C | C | C | C | L +| C | C | C | C | C | C | L +| C | C | C | C | C | C | L +| C | C | C | C | C | C | L +| C | C | C | C | C | C | L +| C | C | C | C | C | C | L +| C | C | C | C | C | C | L +| C | C | C | C | C | C | L +| C | C | C | C | C | C | L +| - | - | - | - | - | - | +C. +; +;Dev #1;Dev #2;Dev #3;Dev #4;Dev #5 +; +0x00;0;1;2;3;4;) AA +0x01;4;0;1;2;3;) AA~ +0x02;5;6;7;8;9;) AB +0x03;9;5;6;7;8;) AB~ +\.\.\.;\.\.\.;\.\.\.;\.\.\.;\.\.\.;\.\.\.;) \.\.\. +:;:;:;:;:;:; : +\.\.\.;\.\.\.;\.\.\.;\.\.\.;\.\.\.;\.\.\.;) \.\.\. +0x79;314;315;316;317;318;) EX +0x80;318;314;315;316;317;) EX~ +; +.TE + +With AA,\ AB,\ ..., AZ,\ BA,\ ... being the sets of consecutive +chunks and AA~,\ AB~,\ ..., AZ~,\ BA~,\ ... the cyclic permutations with offset\ 1 +thereof (in the case of more than 2 copies per chunk there would be (AA~)~,\ ... +as well as ((AA~)~)~,\ ... and so on). + +This should give similar read characteristics to "far" if a suitably large chunk +size is used, but without as much seeking for writes. +.PP + It should be noted that the number of devices in a RAID10 array need not be a multiple of the number of replica of each data block; however, @@ -301,7 +618,7 @@ If, for example, an array is created with 5 devices and 2 replicas, then space equivalent to 2.5 of the devices will be available, and every block will be stored on two different devices. -Finally, it is possible to have an array with both 'near' and 'far' +Finally, it is possible to have an array with both "near" and "far" copies. If an array is configured with 2 near copies and 2 far copies, then there will be a total of 4 copies of each block, each on a different drive. This is an artifact of the implementation and is @@ -551,7 +868,7 @@ intent log if one is present. In 2.6.13, intent bitmaps are only supported with RAID1. Other levels with redundancy are supported from 2.6.15. -.SS BAD BLOCK LOG +.SS BAD BLOCK LIST From Linux 3.5 each device in an .I md @@ -561,7 +878,7 @@ and the data. When a block cannot be read and cannot be repaired by writing data recovered from other devices, the address of the block is stored in -the bad block log. Similarly if an attempt to write a block fails, +the bad block list. Similarly if an attempt to write a block fails, the address will be recorded as a bad block. If attempting to record the bad block fails, the whole device will be marked faulty. @@ -575,9 +892,29 @@ This allows an array to fail more gracefully - a few blocks on different devices can be faulty without taking the whole array out of action. -The log is particularly useful when recovering to a spare. If a few blocks +The list is particularly useful when recovering to a spare. If a few blocks cannot be read from the other devices, the bulk of the recovery can -complete and those few bad blocks will be recorded in the bad block log. +complete and those few bad blocks will be recorded in the bad block list. + +.SS RAID456 WRITE JOURNAL + +Due to non-atomicity nature of RAID write operations, interruption of +write operations (system crash, etc.) to RAID456 array can lead to +inconsistent parity and data loss (so called RAID-5 write hole). + +To plug the write hole, from Linux 4.4 (to be confirmed), +.I md +supports write ahead journal for RAID456. When the array is created, +an additional journal device can be added to the array through +.IR write-journal +option. The RAID write journal works similar to file system journals. +Before writing to the data disks, md persists data AND parity of the +stripe to the journal device. After crashes, md searches the journal +device for incomplete write operations, and replay them to the data +disks. + +When the journal device fails, the RAID array is forced to run in +read-only mode. .SS WRITE-BEHIND @@ -601,6 +938,60 @@ slow). The extra latency of the remote link will not slow down normal operations, but the remote system will still have a reasonably up-to-date copy of all data. +.SS FAILFAST + +From Linux 4.10, +.I +md +supports FAILFAST for RAID1 and RAID10 arrays. This is a flag that +can be set on individual drives, though it is usually set on all +drives, or no drives. + +When +.I md +sends an I/O request to a drive that is marked as FAILFAST, and when +the array could survive the loss of that drive without losing data, +.I md +will request that the underlying device does not perform any retries. +This means that a failure will be reported to +.I md +promptly, and it can mark the device as faulty and continue using the +other device(s). +.I md +cannot control the timeout that the underlying devices use to +determine failure. Any changes desired to that timeout must be set +explictly on the underlying device, separately from using +.IR mdadm . + +If a FAILFAST request does fail, and if it is still safe to mark the +device as faulty without data loss, that will be done and the array +will continue functioning on a reduced number of devices. If it is not +possible to safely mark the device as faulty, +.I md +will retry the request without disabling retries in the underlying +device. In any case, +.I md +will not attempt to repair read errors on a device marked as FAILFAST +by writing out the correct. It will just mark the device as faulty. + +FAILFAST is appropriate for storage arrays that have a low probability +of true failure, but will sometimes introduce unacceptable delays to +I/O requests while performing internal maintenance. The value of +setting FAILFAST involves a trade-off. The gain is that the chance of +unacceptable delays is substantially reduced. The cost is that the +unlikely event of data-loss on one device is slightly more likely to +result in data-loss for the array. + +When a device in an array using FAILFAST is marked as faulty, it will +usually become usable again in a short while. +.I mdadm +makes no attempt to detect that possibility. Some separate +mechanism, tuned to the specific details of the expected failure modes, +needs to be created to monitor devices to see when they return to full +functionality, and to then re-add them to the array. In order of +this "re-add" functionality to be effective, an array using FAILFAST +should always have a write-intent bitmap. + .SS RESTRIPING .IR Restriping , @@ -729,7 +1120,76 @@ number of times MD will service a full-stripe-write before servicing a stripe that requires some "prereading". For fairness this defaults to 1. Valid values are 0 to stripe_cache_size. Setting this to 0 maximizes sequential-write throughput at the cost of fairness to threads -doing small or random writes. +doing small or random writes. + +.TP +.B md/bitmap/backlog +The value stored in the file only has any effect on RAID1 when write-mostly +devices are active, and write requests to those devices are proceed in the +background. + +This variable sets a limit on the number of concurrent background writes, +the valid values are 0 to 16383, 0 means that write-behind is not allowed, +while any other number means it can happen. If there are more write requests +than the number, new writes will by synchronous. + +.TP +.B md/bitmap/can_clear +This is for externally managed bitmaps, where the kernel writes the bitmap +itself, but metadata describing the bitmap is managed by mdmon or similar. + +When the array is degraded, bits mustn't be cleared. When the array becomes +optimal again, bit can be cleared, but first the metadata needs to record +the current event count. So md sets this to 'false' and notifies mdmon, +then mdmon updates the metadata and writes 'true'. + +There is no code in mdmon to actually do this, so maybe it doesn't even +work. + +.TP +.B md/bitmap/chunksize +The bitmap chunksize can only be changed when no bitmap is active, and +the value should be power of 2 and at least 512. + +.TP +.B md/bitmap/location +This indicates where the write-intent bitmap for the array is stored. +It can be "none" or "file" or a signed offset from the array metadata +- measured in sectors. You cannot set a file by writing here - that can +only be done with the SET_BITMAP_FILE ioctl. + +Write 'none' to 'bitmap/location' will clear bitmap, and the previous +location value must be write to it to restore bitmap. + +.TP +.B md/bitmap/max_backlog_used +This keeps track of the maximum number of concurrent write-behind requests +for an md array, writing any value to this file will clear it. + +.TP +.B md/bitmap/metadata +This can be 'internal' or 'clustered' or 'external'. 'internal' is set +by default, which means the metadata for bitmap is stored in the first 256 +bytes of the bitmap space. 'clustered' means separate bitmap metadata are +used for each cluster node. 'external' means that bitmap metadata is managed +externally to the kernel. + +.TP +.B md/bitmap/space +This shows the space (in sectors) which is available at md/bitmap/location, +and allows the kernel to know when it is safe to resize the bitmap to match +a resized array. It should big enough to contain the total bytes in the bitmap. + +For 1.0 metadata, assume we can use up to the superblock if before, else +to 4K beyond superblock. For other metadata versions, assume no change is +possible. + +.TP +.B md/bitmap/time_base +This shows the time (in seconds) between disk flushes, and is used to looking +for bits in the bitmap to be cleared. + +The default value is 5 seconds, and it should be an unsigned long value. .SS KERNEL PARAMETERS