]>
Commit | Line | Data |
---|---|---|
56eb10c0 NB |
1 | MD(4) MD(4) |
2 | ||
3 | ||
4 | ||
5 | N\bNA\bAM\bME\bE | |
6 | md - Multiple Device driver aka Linux Software Raid | |
7 | ||
8 | S\bSY\bYN\bNO\bOP\bPS\bSI\bIS\bS | |
9 | /\b/d\bde\bev\bv/\b/m\bmd\bd_\bn | |
10 | /\b/d\bde\bev\bv/\b/m\bmd\bd/\b/_\bn | |
11 | ||
12 | D\bDE\bES\bSC\bCR\bRI\bIP\bPT\bTI\bIO\bON\bN | |
13 | The m\bmd\bd driver provides virtual devices that are created | |
e0d19036 | 14 | from one or more independent underlying devices. This |
56eb10c0 NB |
15 | array of devices often contains redundancy, and hence the |
16 | acronym RAID which stands for a Redundant Array of Inde- | |
e0d19036 | 17 | pendent Devices. |
56eb10c0 NB |
18 | |
19 | m\bmd\bd support RAID levels 1 (mirroring) 4 (striped array with | |
20 | parity device) and 5 (striped array with distributed par- | |
21 | ity information. If a single underlying device fails | |
11a3e71d NB |
22 | while using one of these level, the array will continue to |
23 | function. | |
56eb10c0 NB |
24 | |
25 | m\bmd\bd also supports a number of pseudo RAID (non-redundant) | |
e0d19036 | 26 | configurations including RAID0 (striped array), LINEAR |
56eb10c0 NB |
27 | (catenated array) and MULTIPATH (a set of different inter- |
28 | faces to the same device). | |
29 | ||
30 | ||
11a3e71d | 31 | M\bMD\bD S\bSU\bUP\bPE\bER\bR B\bBL\bLO\bOC\bCK\bK |
56eb10c0 | 32 | With the exception of Legacy Arrays described below, each |
e0d19036 | 33 | device that is incorporated into an MD array has a _\bs_\bu_\bp_\be_\br |
56eb10c0 NB |
34 | _\bb_\bl_\bo_\bc_\bk written towards the end of the device. This |
35 | superblock records information about the structure and | |
11a3e71d | 36 | state of the array so that the array can be reliably re- |
56eb10c0 NB |
37 | assembled after a shutdown. |
38 | ||
39 | The superblock is 4K long and is written into a 64K | |
11a3e71d | 40 | aligned block that starts at least 64K and less than 128K |
56eb10c0 NB |
41 | from the end of the device (i.e. to get the address of the |
42 | superblock round the size of the device down to a multiple | |
43 | of 64K and then subtract 64K). The available size of each | |
11a3e71d | 44 | device is the amount of space before the super block, so |
56eb10c0 NB |
45 | between 64K and 128K is lost when a device in incorporated |
46 | into an MD array. | |
47 | ||
48 | The superblock contains, among other things: | |
49 | ||
11a3e71d NB |
50 | LEVEL The manner in which the devices are arranged into |
51 | the array (linear, raid0, raid1, raid4, raid5, mul- | |
52 | tipath). | |
56eb10c0 NB |
53 | |
54 | UUID a 128 bit Universally Unique Identifier that iden- | |
55 | tifies the array that this device is part of. | |
56 | ||
57 | ||
11a3e71d NB |
58 | L\bLE\bEG\bGA\bAC\bCY\bY A\bAR\bRR\bRA\bAY\bYS\bS |
59 | Early versions of the m\bmd\bd driver only supported Linear and | |
60 | Raid0 configurations and so did not use an MD superblock | |
61 | (as there is not state that needs to be recorded). While | |
62 | it is strongly recommended that all newly created arrays | |
63 | utilise a superblock to help ensure that they are assem- | |
64 | bled properly, the m\bmd\bd driver still supports legacy linear | |
65 | and raid0 md arrays that do not have a superblock. | |
66 | ||
67 | ||
56eb10c0 | 68 | L\bLI\bIN\bNE\bEA\bAR\bR |
11a3e71d NB |
69 | A linear array simply catenates the available space on |
70 | each drive together to form one large virtual drive. | |
71 | ||
72 | One advantage of this arrangement over the more common | |
73 | RAID0 arrangement is that the array may be reconfigured at | |
74 | a later time with an extra drive and so the array is made | |
75 | bigger without disturbing the data that is on the array. | |
76 | However this cannot be done on a live array. | |
77 | ||
78 | ||
79 | ||
56eb10c0 | 80 | R\bRA\bAI\bID\bD0\b0 |
11a3e71d | 81 | A RAID0 array (which has zero redundancy) is also known as |
e0d19036 NB |
82 | a striped array. A RAID0 array is configured at creation |
83 | with a C\bCh\bhu\bun\bnk\bk S\bSi\biz\bze\be which must be a multiple of 4 kibibytes. | |
84 | ||
85 | The RAID0 driver places the first chunk of the array to | |
86 | the first device, the second chunk to the second device, | |
87 | and so on until all drives have been assigned one chuck. | |
88 | This collection of chunks forms a s\bst\btr\bri\bip\bpe\be. Further chunks | |
89 | are gathered into stripes in the same way which are | |
90 | assigned to the remaining space in the drives. | |
91 | ||
92 | If device in the array are not all the same size, then | |
93 | once the smallest devices has been exhausted, the RAID0 | |
94 | driver starts collecting chunks into smaller stripes that | |
95 | only span the drives which still have remaining space. | |
96 | ||
97 | ||
11a3e71d | 98 | |
56eb10c0 | 99 | R\bRA\bAI\bID\bD1\b1 |
e0d19036 NB |
100 | A RAID1 array is also known as a mirrored set (though mir- |
101 | rors tend to provide reflect images, which RAID1 does not) | |
102 | or a plex. | |
103 | ||
104 | Once initialised, each device in a RAID1 array contains | |
105 | exactly the same data. Changes are written to all devices | |
106 | in parallel. Data is read from any one device. The | |
107 | driver attempts to distribute read requests across all | |
108 | devices to maximise performance. | |
109 | ||
110 | All devices in a RAID1 array should be the same size. If | |
111 | they are not, then only the amount of space available on | |
112 | the smallest device is used. Any extra space on other | |
113 | devices is wasted. | |
114 | ||
115 | ||
56eb10c0 | 116 | R\bRA\bAI\bID\bD4\b4 |
e0d19036 NB |
117 | A RAID4 array is like a RAID0 array with an extra device |
118 | for storing parity. Unlike RAID0, RAID4 also requires | |
119 | that all stripes span all drives, so extra space on | |
120 | devices that are larger than the smallest is wasted. | |
121 | ||
122 | When any block in a RAID4 array is modified the parity | |
123 | block for that stripe (i.e. the block in the parity device | |
124 | at the same device offset as the stripe) is also modified | |
125 | so that the parity block always contains the "parity" for | |
126 | the whole stripe. i.e. its contents is equivalent to the | |
127 | result of performing an exclusive-or operation between all | |
128 | the data blocks in the stripe. | |
129 | ||
130 | This allows the array to continue to function if one | |
131 | device fails. The data that was on that device can be | |
132 | calculated as needed from the parity block and the other | |
133 | data blocks. | |
134 | ||
135 | ||
56eb10c0 | 136 | R\bRA\bAI\bID\bD5\b5 |
e0d19036 NB |
137 | RAID5 is very similar to RAID4. The difference is that |
138 | the parity blocks for each stripe, instead of being on a | |
139 | single device, are distributed across all devices. This | |
140 | allows more parallelism when writing as two different | |
141 | block updates will quite possibly affect parity blocks on | |
142 | different devices so there is less contention. | |
143 | ||
144 | This also allows more parallelism when reading as read | |
145 | requests are distributed over all the devices in the array | |
146 | instead of all but one. | |
147 | ||
148 | ||
11a3e71d | 149 | M\bMU\bUT\bTI\bIP\bPA\bAT\bTH\bH |
e0d19036 NB |
150 | MULTIPATH is not really a RAID at all as there is only one |
151 | real device in a MULTIPATH md array. However there are | |
152 | multiple access points (paths) to this device, and one of | |
153 | these paths might fail, so there are some similarities. | |
154 | ||
155 | A MULTIPATH array is composed of a number of different | |
156 | devices, often fibre channel interfaces, that all refer | |
157 | the the same real device. If one of these interfaces | |
158 | fails (e.g. due to cable problems), the multipath driver | |
159 | to attempt to redirect requests to another interface. | |
160 | ||
161 | ||
162 | ||
163 | U\bUN\bNC\bCL\bLE\bEA\bAN\bN S\bSH\bHU\bUT\bTD\bDO\bOW\bWN\bN | |
164 | When changes are made to an RAID1, RAID4, or RAID5 array | |
165 | there is a possibility of inconsistency for short periods | |
166 | of time as each update requires are least two block to be | |
167 | written to different devices, and these writes probably | |
168 | wont happen at exactly the same time. This is a system | |
169 | with one of these arrays is shutdown in the middle of a | |
170 | write operation (e.g. due to power failure), the array may | |
171 | not be consistent. | |
172 | ||
173 | The handle this situation, the md driver marks an array as | |
174 | "dirty" before writing any data to it, and marks it as | |
175 | "clean" when the array is being disabled, e.g. at shut- | |
176 | down. If the md driver finds an array to be dirty at | |
177 | startup, it proceeds to correct any possibly inconsis- | |
178 | tency. For RAID1, this involves copying the contents of | |
179 | the first drive onto all other drives. For RAID4 or RAID5 | |
180 | this involves recalculating the parity for each stripe and | |
181 | making sure that the parity block has the correct data. | |
182 | ||
183 | If a RAID4 or RAID5 array is degraded (missing one drive) | |
184 | when it is restarted after an unclean shutdown, it cannot | |
185 | recalculate parity, and so it is possible that data might | |
186 | be undetectably corrupted. The md driver currently d\bdo\boe\bes\bs | |
187 | n\bno\bot\bt alert the operator to this condition. It should prob- | |
188 | ably fail to start an array in this condition without man- | |
189 | ual intervention. | |
190 | ||
191 | ||
192 | R\bRE\bEC\bCO\bOV\bVE\bER\bRY\bY | |
193 | If the md driver detects any error on a device in a RAID1, | |
194 | RAID4, or RAID5 array, it immediately disables that device | |
195 | (marking it as faulty) and continues operation on the | |
196 | remaining devices. If there is a spare drive, the driver | |
197 | will start recreating on one of the spare drives the data | |
198 | what was on that failed drive, either by copying a working | |
199 | drive in a RAID1 configuration, or by doing calculations | |
200 | with the parity block on RAID4 and RAID5. | |
201 | ||
202 | Why this recovery process is happening, the md driver will | |
203 | monitor accesses to the array and will slow down the rate | |
204 | of recovery if other activity is happening, so that normal | |
205 | access to the array will not be unduly affected. When no | |
206 | other activity is happening, the recovery process proceeds | |
207 | at full speed. The actual speed targets for the two dif- | |
208 | ferent situations can be controlled by the s\bsp\bpe\bee\bed\bd_\b_l\bli\bim\bmi\bit\bt_\b_m\bmi\bin\bn | |
209 | and s\bsp\bpe\bee\bed\bd_\b_l\bli\bim\bmi\bit\bt_\b_m\bma\bax\bx control files mentioned below. | |
210 | ||
211 | ||
212 | ||
56eb10c0 NB |
213 | F\bFI\bIL\bLE\bES\bS |
214 | /\b/p\bpr\bro\boc\bc/\b/m\bmd\bds\bst\bta\bat\bt | |
215 | Contains information about the status of currently | |
216 | running array. | |
217 | ||
218 | /\b/p\bpr\bro\boc\bc/\b/s\bsy\bys\bs/\b/d\bde\bev\bv/\b/r\bra\bai\bid\bd/\b/s\bsp\bpe\bee\bed\bd_\b_l\bli\bim\bmi\bit\bt_\b_m\bmi\bin\bn | |
219 | A readable and writable file that reflects the cur- | |
220 | rent goal rebuild speed for times when non-rebuild | |
221 | activity is current on an array. The speed is in | |
222 | Kibibytes per second, and is a per-device rate, not | |
223 | a per-array rate (which means that an array with | |
224 | more disc will shuffle more data for a given | |
225 | speed). The default is 100. | |
226 | ||
227 | ||
228 | /\b/p\bpr\bro\boc\bc/\b/s\bsy\bys\bs/\b/d\bde\bev\bv/\b/r\bra\bai\bid\bd/\b/s\bsp\bpe\bee\bed\bd_\b_l\bli\bim\bmi\bit\bt_\b_m\bma\bax\bx | |
229 | A readable and writable file that reflects the cur- | |
230 | rent goal rebuild speed for times when no non- | |
231 | rebuild activity is current on an array. The | |
232 | default is 100,000. | |
233 | ||
234 | ||
235 | S\bSE\bEE\bE A\bAL\bLS\bSO\bO | |
236 | m\bmd\bda\bad\bdm\bm(8), m\bmk\bkr\bra\bai\bid\bd(8). | |
237 | ||
238 | ||
239 | ||
240 | MD(4) |