Use O_EXCL when opening component devices to be assembled into an array

[thirdparty/mdadm.git] / md.4
diff --git a/md.4 b/md.4

index 87f5a35df2c3f3d9e3f22f3f5609b5d5abc529b3..4643dd2598f1cae9843c24cac1c578f7a9b2646f 100644 (file)
--- a/md.4
+++ b/md.4
@@ -9,52 +9,275 @@ md \- Multiple Device driver aka Linux Software Raid
  The
  .B md
  driver provides virtual devices that are created from one or more
-independant underlying devices.  This array of devices often contains
+independent underlying devices.  This array of devices often contains
  redundancy, and hence the acronym RAID which stands for a Redundant
-Array of Independant Devices.
+Array of Independent Devices.
  .PP
  .B md
-support RAID levels 1 (mirroring) 4 (striped array with parity device) and 5
-(striped array with distributed parity information.  If a single underlying
-device fails while using one of these level, they array will continue
-to function.
+supports RAID levels 1 (mirroring) 4 (striped array with parity
+device), 5 (striped array with distributed parity information) and 6
+(striped array with distributed dual redundancy information.)  If a
+some number of underlying devices fails while using one of these
+levels, the array will continue to function; this number is one for
+RAID levels 4 and 5, two for RAID level 6, and all but one (N-1) for
+RAID level 1.
  .PP
  .B md
-also supports a number of pseudo RAID (non-redundant) configuations
+also supports a number of pseudo RAID (non-redundant) configurations
  including RAID0 (striped array), LINEAR (catenated array) and
  MULTIPATH (a set of different interfaces to the same device).
  
-.SS RAID SUPER BLOCK
+.SS MD SUPER BLOCK
  With the exception of Legacy Arrays described below, each device that
-is incorportated into an MD array has a
+is incorporated into an MD array has a
  .I super block
  written towards the end of the device.  This superblock records
  information about the structure and state of the array so that the
-array an be reliably re-assembled after a shutdown.
+array can be reliably re-assembled after a shutdown.
  
  The superblock is 4K long and is written into a 64K aligned block that
-start at least 64K and less than 128K from the end of the device
+starts at least 64K and less than 128K from the end of the device
  (i.e. to get the address of the superblock round the size of the
  device down to a multiple of 64K and then subtract 64K).
-The available size of each device is the ammount of space before the
+The available size of each device is the amount of space before the
  super block, so between 64K and 128K is lost when a device in
  incorporated into an MD array.
  
  The superblock contains, among other things:
  .TP
  LEVEL
-The 
+The manner in which the devices are arranged into the array
+(linear, raid0, raid1, raid4, raid5, multipath).
  .TP
  UUID
  a 128 bit Universally Unique Identifier that identifies the array that
  this device is part of.
  
+.SS LEGACY ARRAYS
+Early versions of the
+.B md
+driver only supported Linear and Raid0 configurations and so
+did not use an MD superblock (as there is no state that needs to be
+recorded).  While it is strongly recommended that all newly created
+arrays utilise a superblock to help ensure that they are assembled
+properly, the
+.B md
+driver still supports legacy linear and raid0 md arrays that
+do not have a superblock.
+
  .SS LINEAR
+
+A linear array simply catenates the available space on each
+drive together to form one large virtual drive.
+
+One advantage of this arrangement over the more common RAID0
+arrangement is that the array may be reconfigured at a later time with
+an extra drive and so the array is made bigger without disturbing the
+data that is on the array.  However this cannot be done on a live
+array.
+
+
  .SS RAID0
+
+A RAID0 array (which has zero redundancy) is also known as a
+striped array.
+A RAID0 array is configured at creation with a
+.B "Chunk Size" 
+which must be a power of two, and at least 4 kibibytes.
+
+The RAID0 driver assigns the first chunk of the array to the first
+device, the second chunk to the second device, and so on until all
+drives have been assigned one chunk.  This collection of chunks forms
+a
+.BR stripe .
+Further chunks are gathered into stripes in the same way which are
+assigned to the remaining space in the drives.
+
+If devices in the array are not all the same size, then once the
+smallest device has been exhausted, the RAID0 driver starts
+collecting chunks into smaller stripes that only span the drives which
+still have remaining space.
+
+
  .SS RAID1
+
+A RAID1 array is also known as a mirrored set (though mirrors tend to
+provide reflected images, which RAID1 does not) or a plex.
+
+Once initialised, each device in a RAID1 array contains exactly the
+same data.  Changes are written to all devices in parallel.  Data is
+read from any one device.  The driver attempts to distribute read
+requests across all devices to maximise performance.
+
+All devices in a RAID1 array should be the same size.  If they are
+not, then only the amount of space available on the smallest device is
+used.  Any extra space on other devices is wasted.
+
  .SS RAID4
+
+A RAID4 array is like a RAID0 array with an extra device for storing
+parity. This device is the last of the active devices in the
+array. Unlike RAID0, RAID4 also requires that all stripes span all
+drives, so extra space on devices that are larger than the smallest is
+wasted.
+
+When any block in a RAID4 array is modified the parity block for that
+stripe (i.e. the block in the parity device at the same device offset
+as the stripe) is also modified so that the parity block always
+contains the "parity" for the whole stripe.  i.e. its contents is
+equivalent to the result of performing an exclusive-or operation
+between all the data blocks in the stripe.
+
+This allows the array to continue to function if one device fails.
+The data that was on that device can be calculated as needed from the
+parity block and the other data blocks.
+
  .SS RAID5
-.SS REBUILD/RESYNC
+
+RAID5 is very similar to RAID4.  The difference is that the parity
+blocks for each stripe, instead of being on a single device, are
+distributed across all devices.  This allows more parallelism when
+writing as two different block updates will quite possibly affect
+parity blocks on different devices so there is less contention.
+
+This also allows more parallelism when reading as read requests are
+distributed over all the devices in the array instead of all but one.
+
+.SS RAID6
+
+RAID6 is similar to RAID5, but can handle the loss of any \fItwo\fP
+devices without data loss.  Accordingly, it requires N+2 drives to
+store N drives worth of data.
+
+The performance for RAID6 is slightly lower but comparable to RAID5 in
+normal mode and single disk failure mode.  It is very slow in dual
+disk failure mode, however.
+
+.SS MUTIPATH
+
+MULTIPATH is not really a RAID at all as there is only one real device
+in a MULTIPATH md array.  However there are multiple access points
+(paths) to this device, and one of these paths might fail, so there
+are some similarities.
+
+A MULTIPATH array is composed of a number of logical different
+devices, often fibre channel interfaces, that all refer the the same
+real device. If one of these interfaces fails (e.g. due to cable
+problems), the multipath driver to attempt to redirect requests to
+another interface. 
+
+.SS FAULTY
+The FAULTY md module is provided for testing purposes.  A faulty array
+has exactly one component device and is normally assembled without a
+superblock, so the md array created provides direct access to all of
+the data in the component device.
+
+The FAULTY module may be requested to simulate faults to allow testing
+of other md levels or of filesystem.  Faults can be chosen to trigger
+on read requests or write requests, and can be transient (a subsequent
+read/write at the address will probably succeed) or persistant
+(subsequent read/write of the same address will fail).  Further, read
+faults can be "fixable" meaning that they persist until a write
+request at the same address.
+
+Fault types can be requested with a period.  In this case the fault
+will recur repeatedly after the given number of request of the
+relevant time.  For example if persistent read faults have a period of
+100, then ever 100th read request would generate a fault, and the
+faulty sector would be recorded so that subsequent reads on that
+sector would also fail.
+
+There is a limit to the number of faulty sectors that are remembered.
+Faults generated after this limit is exhausted are treated as
+transient.
+
+It list of faulty sectors can be flushed, and the active list of
+failure modes can be cleared.
+
+.SS UNCLEAN SHUTDOWN
+
+When changes are made to a RAID1, RAID4, RAID5 or RAID6 array there is a
+possibility of inconsistency for short periods of time as each update
+requires are least two block to be written to different devices, and
+these writes probably wont happen at exactly the same time.
+Thus if a system with one of these arrays is shutdown in the middle of
+a write operation (e.g. due to power failure), the array may not be
+consistent.
+
+To handle this situation, the md driver marks an array as "dirty"
+before writing any data to it, and marks it as "clean" when the array
+is being disabled, e.g. at shutdown.  If the md driver finds an array
+to be dirty at startup, it proceeds to correct any possibly
+inconsistency.  For RAID1, this involves copying the contents of the
+first drive onto all other drives.  For RAID4, RAID5 and RAID6 this
+involves recalculating the parity for each stripe and making sure that
+the parity block has the correct data.  This process, known as
+"resynchronising" or "resync" is performed in the background.  The
+array can still be used, though possibly with reduced performance.
+
+If a RAID4, RAID5 or RAID6 array is degraded (missing at least one
+drive) when it is restarted after an unclean shutdown, it cannot
+recalculate parity, and so it is possible that data might be
+undetectably corrupted.  The 2.4 md driver
+.B does not
+alert the operator to this condition.  The 2.5 md driver will fail to
+start an array in this condition without manual intervention.
+
+.SS RECOVERY
+
+If the md driver detects any error on a device in a RAID1, RAID4,
+RAID5 or RAID6 array, it immediately disables that device (marking it
+as faulty) and continues operation on the remaining devices.  If there
+is a spare drive, the driver will start recreating on one of the spare
+drives the data what was on that failed drive, either by copying a
+working drive in a RAID1 configuration, or by doing calculations with
+the parity block on RAID4, RAID5 or RAID6.
+
+While this recovery process is happening, the md driver will monitor
+accesses to the array and will slow down the rate of recovery if other
+activity is happening, so that normal access to the array will not be
+unduly affected.  When no other activity is happening, the recovery
+process proceeds at full speed.  The actual speed targets for the two
+different situations can be controlled by the
+.B speed_limit_min
+and
+.B speed_limit_max
+control files mentioned below.
+
+.SS KERNEL PARAMETERS
+
+The md driver recognised three different kernel parameters.
+.TP
+.B raid=noautodetect
+This will disable the normal detection of md arrays that happens at
+boot time.  If a drive is partitioned with MS-DOS style partitions,
+then if any of the 4 main partitions has a partition type of 0xFD,
+then that partition will normally be inspected to see if it is part of
+an MD array, and if any full arrays are found, they are started.  This
+kernel paramenter disables this behaviour.
+
+.TP
+.BI md= n , dev , dev ,...
+This tells the md driver to assemble
+.B /dev/md n
+from the listed devices.  It is only necessary to start the device
+holding the root filesystem this way.  Other arrays are best started
+once the system is booted.
+
+.TP
+.BI md= n , l , c , i , dev...
+This tells the md driver to assemble a legacy RAID0 or LINEAR array
+without a superblock.
+.I n
+gives the md device number,
+.I l
+gives the level, 0 for RAID0 or -1 for LINEAR,
+.I c
+gives the chunk size as a base-2 logarithm offset by twelve, so 0
+means 4K, 1 means 8K.
+.I i
+is ignored (legacy support).
+
  .SH FILES
  .TP
  .B /proc/mdstat