Assemble: allow members of containers to be assembled and auto-assembled.

[thirdparty/mdadm.git] / md.4
diff --git a/md.4 b/md.4

index 87f5a35df2c3f3d9e3f22f3f5609b5d5abc529b3..ea12eaffaf1570f040801ff518c5a62b4b655d12 100644 (file)
--- a/md.4
+++ b/md.4
@@ -1,6 +1,12 @@
+.\" Copyright Neil Brown and others.
+.\"   This program is free software; you can redistribute it and/or modify
+.\"   it under the terms of the GNU General Public License as published by
+.\"   the Free Software Foundation; either version 2 of the License, or
+.\"   (at your option) any later version.
+.\" See file COPYING in distribution for details.
  .TH MD 4
  .SH NAME
-md \- Multiple Device driver aka Linux Software Raid
+md \- Multiple Device driver aka Linux Software RAID
  .SH SYNOPSIS
  .BI /dev/md n
  .br
@@ -9,67 +15,622 @@ md \- Multiple Device driver aka Linux Software Raid
  The
  .B md
  driver provides virtual devices that are created from one or more
-independant underlying devices.  This array of devices often contains
-redundancy, and hence the acronym RAID which stands for a Redundant
-Array of Independant Devices.
+independent underlying devices.  This array of devices often contains
+redundancy and the devices are often disk drives, hence the acronym RAID
+which stands for a Redundant Array of Independent Disks.
  .PP
  .B md
-support RAID levels 1 (mirroring) 4 (striped array with parity device) and 5
-(striped array with distributed parity information.  If a single underlying
-device fails while using one of these level, they array will continue
-to function.
+supports RAID levels
+1 (mirroring),
+4 (striped array with parity device),
+5 (striped array with distributed parity information),
+6 (striped array with distributed dual redundancy information), and
+10 (striped and mirrored).
+If some number of underlying devices fails while using one of these
+levels, the array will continue to function; this number is one for
+RAID levels 4 and 5, two for RAID level 6, and all but one (N-1) for
+RAID level 1, and dependent on configuration for level 10.
  .PP
  .B md
-also supports a number of pseudo RAID (non-redundant) configuations
-including RAID0 (striped array), LINEAR (catenated array) and
-MULTIPATH (a set of different interfaces to the same device).
-
-.SS RAID SUPER BLOCK
-With the exception of Legacy Arrays described below, each device that
-is incorportated into an MD array has a
-.I super block
-written towards the end of the device.  This superblock records
-information about the structure and state of the array so that the
-array an be reliably re-assembled after a shutdown.
-
-The superblock is 4K long and is written into a 64K aligned block that
-start at least 64K and less than 128K from the end of the device
+also supports a number of pseudo RAID (non-redundant) configurations
+including RAID0 (striped array), LINEAR (catenated array),
+MULTIPATH (a set of different interfaces to the same device),
+and FAULTY (a layer over a single device into which errors can be injected).
+
+.SS MD SUPER BLOCK
+Each device in an array may have a
+.I superblock
+which records information about the structure and state of the array.
+This allows the array to be reliably re-assembled after a shutdown.
+
+From Linux kernel version 2.6.10,
+.B md
+provides support for two different formats of this superblock, and
+other formats can be added.  Prior to this release, only one format is
+supported.
+
+The common format \(em known as version 0.90 \(em has
+a superblock that is 4K long and is written into a 64K aligned block that
+starts at least 64K and less than 128K from the end of the device
  (i.e. to get the address of the superblock round the size of the
  device down to a multiple of 64K and then subtract 64K).
-The available size of each device is the ammount of space before the
+The available size of each device is the amount of space before the
  super block, so between 64K and 128K is lost when a device in
  incorporated into an MD array.
+This superblock stores multi-byte fields in a processor-dependent
+manner, so arrays cannot easily be moved between computers with
+different processors.
+
+The new format \(em known as version 1 \(em has a superblock that is
+normally 1K long, but can be longer.  It is normally stored between 8K
+and 12K from the end of the device, on a 4K boundary, though
+variations can be stored at the start of the device (version 1.1) or 4K from
+the start of the device (version 1.2).
+This superblock format stores multibyte data in a
+processor-independent format and supports up to hundreds of
+component devices (version 0.90 only supports 28).
  
  The superblock contains, among other things:
  .TP
  LEVEL
-The 
+The manner in which the devices are arranged into the array
+(linear, raid0, raid1, raid4, raid5, raid10, multipath).
  .TP
  UUID
  a 128 bit Universally Unique Identifier that identifies the array that
-this device is part of.
+contains this device.
+
+When a version 0.90 array is being reshaped (e.g. adding extra devices
+to a RAID5), the version number is temporarily set to 0.91.  This
+ensures that if the reshape process is stopped in the middle (e.g. by
+a system crash) and the machine boots into an older kernel that does
+not support reshaping, then the array will not be assembled (which
+would cause data corruption) but will be left untouched until a kernel
+that can complete the reshape processes is used.
+
+.SS ARRAYS WITHOUT SUPERBLOCKS
+While it is usually best to create arrays with superblocks so that
+they can be assembled reliably, there are some circumstances when an
+array without superblocks is preferred.  These include:
+.TP
+LEGACY ARRAYS
+Early versions of the
+.B md
+driver only supported Linear and Raid0 configurations and did not use
+a superblock (which is less critical with these configurations).
+While such arrays should be rebuilt with superblocks if possible,
+.B md
+continues to support them.
+.TP
+FAULTY
+Being a largely transparent layer over a different device, the FAULTY
+personality doesn't gain anything from having a superblock.
+.TP
+MULTIPATH
+It is often possible to detect devices which are different paths to
+the same storage directly rather than having a distinctive superblock
+written to the device and searched for on all paths.  In this case,
+a MULTIPATH array with no superblock makes sense.
+.TP
+RAID1
+In some configurations it might be desired to create a raid1
+configuration that does not use a superblock, and to maintain the state of
+the array elsewhere.  While not encouraged for general us, it does
+have special-purpose uses and is supported.
  
  .SS LINEAR
+
+A linear array simply catenates the available space on each
+drive to form one large virtual drive.
+
+One advantage of this arrangement over the more common RAID0
+arrangement is that the array may be reconfigured at a later time with
+an extra drive, so the array is made bigger without disturbing the
+data that is on the array.  This can even be done on a live
+array.
+
+If a chunksize is given with a LINEAR array, the usable space on each
+device is rounded down to a multiple of this chunksize.
+
  .SS RAID0
+
+A RAID0 array (which has zero redundancy) is also known as a
+striped array.
+A RAID0 array is configured at creation with a
+.B "Chunk Size" 
+which must be a power of two, and at least 4 kibibytes.
+
+The RAID0 driver assigns the first chunk of the array to the first
+device, the second chunk to the second device, and so on until all
+drives have been assigned one chunk.  This collection of chunks forms
+a
+.BR stripe .
+Further chunks are gathered into stripes in the same way, and are
+assigned to the remaining space in the drives.
+
+If devices in the array are not all the same size, then once the
+smallest device has been exhausted, the RAID0 driver starts
+collecting chunks into smaller stripes that only span the drives which
+still have remaining space.
+
+
  .SS RAID1
+
+A RAID1 array is also known as a mirrored set (though mirrors tend to
+provide reflected images, which RAID1 does not) or a plex.
+
+Once initialised, each device in a RAID1 array contains exactly the
+same data.  Changes are written to all devices in parallel.  Data is
+read from any one device.  The driver attempts to distribute read
+requests across all devices to maximise performance.
+
+All devices in a RAID1 array should be the same size.  If they are
+not, then only the amount of space available on the smallest device is
+used (any extra space on other devices is wasted).
+
+Note that the read balancing done by the driver does not make the RAID1
+performance profile be the same as for RAID0; a single stream of
+sequential input will not be accelerated (e.g. a single dd), but
+multiple sequential streams or a random workload will use more than one
+spindle. In theory, having an N-disk RAID1 will allow N sequential
+threads to read from all disks.
+
  .SS RAID4
+
+A RAID4 array is like a RAID0 array with an extra device for storing
+parity. This device is the last of the active devices in the
+array. Unlike RAID0, RAID4 also requires that all stripes span all
+drives, so extra space on devices that are larger than the smallest is
+wasted.
+
+When any block in a RAID4 array is modified, the parity block for that
+stripe (i.e. the block in the parity device at the same device offset
+as the stripe) is also modified so that the parity block always
+contains the "parity" for the whole stripe.  I.e. its content is
+equivalent to the result of performing an exclusive-or operation
+between all the data blocks in the stripe.
+
+This allows the array to continue to function if one device fails.
+The data that was on that device can be calculated as needed from the
+parity block and the other data blocks.
+
  .SS RAID5
-.SS REBUILD/RESYNC
+
+RAID5 is very similar to RAID4.  The difference is that the parity
+blocks for each stripe, instead of being on a single device, are
+distributed across all devices.  This allows more parallelism when
+writing, as two different block updates will quite possibly affect
+parity blocks on different devices so there is less contention.
+
+This also allows more parallelism when reading, as read requests are
+distributed over all the devices in the array instead of all but one.
+
+.SS RAID6
+
+RAID6 is similar to RAID5, but can handle the loss of any \fItwo\fP
+devices without data loss.  Accordingly, it requires N+2 drives to
+store N drives worth of data.
+
+The performance for RAID6 is slightly lower but comparable to RAID5 in
+normal mode and single disk failure mode.  It is very slow in dual
+disk failure mode, however.
+
+.SS RAID10
+
+RAID10 provides a combination of RAID1 and RAID0, and is sometimes known
+as RAID1+0.  Every datablock is duplicated some number of times, and
+the resulting collection of datablocks are distributed over multiple
+drives.
+
+When configuring a RAID10 array, it is necessary to specify the number
+of replicas of each data block that are required (this will normally
+be 2) and whether the replicas should be 'near', 'offset' or 'far'.
+(Note that the 'offset' layout is only available from 2.6.18).
+
+When 'near' replicas are chosen, the multiple copies of a given chunk
+are laid out consecutively across the stripes of the array, so the two
+copies of a datablock will likely be at the same offset on two
+adjacent devices.
+
+When 'far' replicas are chosen, the multiple copies of a given chunk
+are laid out quite distant from each other.  The first copy of all
+data blocks will be striped across the early part of all drives in
+RAID0 fashion, and then the next copy of all blocks will be striped
+across a later section of all drives, always ensuring that all copies
+of any given block are on different drives.
+
+The 'far' arrangement can give sequential read performance equal to
+that of a RAID0 array, but at the cost of reduced write performance.
+
+When 'offset' replicas are chosen, the multiple copies of a given
+chunk are laid out on consecutive drives and at consecutive offsets.
+Effectively each stripe is duplicated and the copies are offset by one
+device.   This should give similar read characteristics to 'far' if a
+suitably large chunk size is used, but without as much seeking for
+writes.
+
+It should be noted that the number of devices in a RAID10 array need
+not be a multiple of the number of replica of each data block; however,
+there must be at least as many devices as replicas.
+
+If, for example, an array is created with 5 devices and 2 replicas,
+then space equivalent to 2.5 of the devices will be available, and
+every block will be stored on two different devices.
+
+Finally, it is possible to have an array with both 'near' and 'far'
+copies.  If an array is configured with 2 near copies and 2 far
+copies, then there will be a total of 4 copies of each block, each on
+a different drive.  This is an artifact of the implementation and is
+unlikely to be of real value.
+
+.SS MULTIPATH
+
+MULTIPATH is not really a RAID at all as there is only one real device
+in a MULTIPATH md array.  However there are multiple access points
+(paths) to this device, and one of these paths might fail, so there
+are some similarities.
+
+A MULTIPATH array is composed of a number of logically different
+devices, often fibre channel interfaces, that all refer the the same
+real device. If one of these interfaces fails (e.g. due to cable
+problems), the multipath driver will attempt to redirect requests to
+another interface. 
+
+.SS FAULTY
+The FAULTY md module is provided for testing purposes.  A faulty array
+has exactly one component device and is normally assembled without a
+superblock, so the md array created provides direct access to all of
+the data in the component device.
+
+The FAULTY module may be requested to simulate faults to allow testing
+of other md levels or of filesystems.  Faults can be chosen to trigger
+on read requests or write requests, and can be transient (a subsequent
+read/write at the address will probably succeed) or persistent
+(subsequent read/write of the same address will fail).  Further, read
+faults can be "fixable" meaning that they persist until a write
+request at the same address.
+
+Fault types can be requested with a period.  In this case, the fault
+will recur repeatedly after the given number of requests of the
+relevant type.  For example if persistent read faults have a period of
+100, then every 100th read request would generate a fault, and the
+faulty sector would be recorded so that subsequent reads on that
+sector would also fail.
+
+There is a limit to the number of faulty sectors that are remembered.
+Faults generated after this limit is exhausted are treated as
+transient.
+
+The list of faulty sectors can be flushed, and the active list of
+failure modes can be cleared.
+
+.SS UNCLEAN SHUTDOWN
+
+When changes are made to a RAID1, RAID4, RAID5, RAID6, or RAID10 array
+there is a possibility of inconsistency for short periods of time as
+each update requires at least two block to be written to different
+devices, and these writes probably won't happen at exactly the same
+time.  Thus if a system with one of these arrays is shutdown in the
+middle of a write operation (e.g. due to power failure), the array may
+not be consistent.
+
+To handle this situation, the md driver marks an array as "dirty"
+before writing any data to it, and marks it as "clean" when the array
+is being disabled, e.g. at shutdown.  If the md driver finds an array
+to be dirty at startup, it proceeds to correct any possibly
+inconsistency.  For RAID1, this involves copying the contents of the
+first drive onto all other drives.  For RAID4, RAID5 and RAID6 this
+involves recalculating the parity for each stripe and making sure that
+the parity block has the correct data.  For RAID10 it involves copying
+one of the replicas of each block onto all the others.  This process,
+known as "resynchronising" or "resync" is performed in the background.
+The array can still be used, though possibly with reduced performance.
+
+If a RAID4, RAID5 or RAID6 array is degraded (missing at least one
+drive, two for RAID6) when it is restarted after an unclean shutdown, it cannot
+recalculate parity, and so it is possible that data might be
+undetectably corrupted.  The 2.4 md driver
+.B does not
+alert the operator to this condition.  The 2.6 md driver will fail to
+start an array in this condition without manual intervention, though
+this behaviour can be overridden by a kernel parameter.
+
+.SS RECOVERY
+
+If the md driver detects a write error on a device in a RAID1, RAID4,
+RAID5, RAID6, or RAID10 array, it immediately disables that device
+(marking it as faulty) and continues operation on the remaining
+devices.  If there are spare drives, the driver will start recreating
+on one of the spare drives the data which was on that failed drive,
+either by copying a working drive in a RAID1 configuration, or by
+doing calculations with the parity block on RAID4, RAID5 or RAID6, or
+by finding and copying originals for RAID10.
+
+In kernels prior to about 2.6.15, a read error would cause the same
+effect as a write error.  In later kernels, a read-error will instead
+cause md to attempt a recovery by overwriting the bad block. i.e. it
+will find the correct data from elsewhere, write it over the block
+that failed, and then try to read it back again.  If either the write
+or the re-read fail, md will treat the error the same way that a write
+error is treated, and will fail the whole device.
+
+While this recovery process is happening, the md driver will monitor
+accesses to the array and will slow down the rate of recovery if other
+activity is happening, so that normal access to the array will not be
+unduly affected.  When no other activity is happening, the recovery
+process proceeds at full speed.  The actual speed targets for the two
+different situations can be controlled by the
+.B speed_limit_min
+and
+.B speed_limit_max
+control files mentioned below.
+
+.SS BITMAP WRITE-INTENT LOGGING
+
+From Linux 2.6.13,
+.I md
+supports a bitmap based write-intent log.  If configured, the bitmap
+is used to record which blocks of the array may be out of sync.
+Before any write request is honoured, md will make sure that the
+corresponding bit in the log is set.  After a period of time with no
+writes to an area of the array, the corresponding bit will be cleared.
+
+This bitmap is used for two optimisations.
+
+Firstly, after an unclean shutdown, the resync process will consult
+the bitmap and only resync those blocks that correspond to bits in the
+bitmap that are set.  This can dramatically reduce resync time.
+
+Secondly, when a drive fails and is removed from the array, md stops
+clearing bits in the intent log.  If that same drive is re-added to
+the array, md will notice and will only recover the sections of the
+drive that are covered by bits in the intent log that are set.  This
+can allow a device to be temporarily removed and reinserted without
+causing an enormous recovery cost.
+
+The intent log can be stored in a file on a separate device, or it can
+be stored near the superblocks of an array which has superblocks.
+
+It is possible to add an intent log to an active array, or remove an
+intent log if one is present.
+
+In 2.6.13, intent bitmaps are only supported with RAID1.  Other levels
+with redundancy are supported from 2.6.15.
+
+.SS WRITE-BEHIND
+
+From Linux 2.6.14,
+.I md
+supports WRITE-BEHIND on RAID1 arrays.
+
+This allows certain devices in the array to be flagged as
+.IR write-mostly .
+MD will only read from such devices if there is no
+other option.
+
+If a write-intent bitmap is also provided, write requests to
+write-mostly devices will be treated as write-behind requests and md
+will not wait for writes to those requests to complete before
+reporting the write as complete to the filesystem.
+
+This allows for a RAID1 with WRITE-BEHIND to be used to mirror data
+over a slow link to a remote computer (providing the link isn't too
+slow).  The extra latency of the remote link will not slow down normal
+operations, but the remote system will still have a reasonably
+up-to-date copy of all data.
+
+.SS RESTRIPING
+
+.IR Restriping ,
+also known as
+.IR Reshaping ,
+is the processes of re-arranging the data stored in each stripe into a
+new layout.  This might involve changing the number of devices in the
+array (so the stripes are wider), changing the chunk size (so stripes
+are deeper or shallower), or changing the arrangement of data and
+parity (possibly changing the raid level, e.g. 1 to 5 or 5 to 6).
+
+As of Linux 2.6.17, md can reshape a raid5 array to have more
+devices.  Other possibilities may follow in future kernels.
+
+During any stripe process there is a 'critical section' during which
+live data is being overwritten on disk.  For the operation of
+increasing the number of drives in a raid5, this critical section
+covers the first few stripes (the number being the product of the old
+and new number of devices).  After this critical section is passed,
+data is only written to areas of the array which no longer hold live
+data \(em the live data has already been located away.
+
+md is not able to ensure data preservation if there is a crash
+(e.g. power failure) during the critical section.  If md is asked to
+start an array which failed during a critical section of restriping,
+it will fail to start the array.
+
+To deal with this possibility, a user-space program must
+.IP \(bu 4
+Disable writes to that section of the array (using the
+.B sysfs
+interface),
+.IP \(bu 4
+take a copy of the data somewhere (i.e. make a backup),
+.IP \(bu 4
+allow the process to continue and invalidate the backup and restore
+write access once the critical section is passed, and
+.IP \(bu 4
+provide for restoring the critical data before restarting the array
+after a system crash.
+.PP
+
+.B mdadm
+versions from 2.4 do this for growing a RAID5 array.
+
+For operations that do not change the size of the array, like simply
+increasing chunk size, or converting RAID5 to RAID6 with one extra
+device, the entire process is the critical section.  In this case, the
+restripe will need to progress in stages, as a section is suspended,
+backed up,
+restriped, and released; this is not yet implemented.
+
+.SS SYSFS INTERFACE
+Each block device appears as a directory in
+.I sysfs
+(which is usually mounted at
+.BR /sys ).
+For MD devices, this directory will contain a subdirectory called
+.B md
+which contains various files for providing access to information about
+the array.
+
+This interface is documented more fully in the file
+.B Documentation/md.txt
+which is distributed with the kernel sources.  That file should be
+consulted for full documentation.  The following are just a selection
+of attribute files that are available.
+
+.TP
+.B md/sync_speed_min
+This value, if set, overrides the system-wide setting in
+.B /proc/sys/dev/raid/speed_limit_min
+for this array only.
+Writing the value
+.B "system"
+to this file will cause the system-wide setting to have effect.
+
+.TP
+.B md/sync_speed_max
+This is the partner of
+.B md/sync_speed_min
+and overrides
+.B /proc/sys/dev/raid/spool_limit_max
+described below.
+
+.TP
+.B md/sync_action
+This can be used to monitor and control the resync/recovery process of
+MD.
+In particular, writing "check" here will cause the array to read all
+data block and check that they are consistent (e.g. parity is correct,
+or all mirror replicas are the same).  Any discrepancies found are
+.B NOT
+corrected.
+
+A count of problems found will be stored in
+.BR md/mismatch_count .
+
+Alternately, "repair" can be written which will cause the same check
+to be performed, but any errors will be corrected.
+
+Finally, "idle" can be written to stop the check/repair process.
+
+.TP
+.B md/stripe_cache_size
+This is only available on RAID5 and RAID6.  It records the size (in
+pages per device) of the  stripe cache which is used for synchronising
+all write operations to the array and all read operations if the array
+is degraded.  The default is 256.  Valid values are 17 to 32768.
+Increasing this number can increase performance in some situations, at
+some cost in system memory.  Note, setting this value too high can
+result in an "out of memory" condition for the system.
+
+memory_consumed = system_page_size * nr_disks * stripe_cache_size
+
+.TP
+.B md/preread_bypass_threshold
+This is only available on RAID5 and RAID6.  This variable sets the
+number of times MD will service a full-stripe-write before servicing a
+stripe that requires some "prereading".  For fairness this defaults to
+1.  Valid values are 0 to stripe_cache_size.  Setting this to 0
+maximizes sequential-write throughput at the cost of fairness to threads
+doing small or random writes.  
+
+.SS KERNEL PARAMETERS
+
+The md driver recognised several different kernel parameters.
+.TP
+.B raid=noautodetect
+This will disable the normal detection of md arrays that happens at
+boot time.  If a drive is partitioned with MS-DOS style partitions,
+then if any of the 4 main partitions has a partition type of 0xFD,
+then that partition will normally be inspected to see if it is part of
+an MD array, and if any full arrays are found, they are started.  This
+kernel parameter disables this behaviour.
+
+.TP
+.B raid=partitionable
+.TP
+.B raid=part
+These are available in 2.6 and later kernels only.  They indicate that
+autodetected MD arrays should be created as partitionable arrays, with
+a different major device number to the original non-partitionable md
+arrays.  The device number is listed as
+.I mdp
+in
+.IR /proc/devices .
+
+.TP
+.B md_mod.start_ro=1
+This tells md to start all arrays in read-only mode.  This is a soft
+read-only that will automatically switch to read-write on the first
+write request.  However until that write request, nothing is written
+to any device by md, and in particular, no resync or recovery
+operation is started.
+
+.TP
+.B md_mod.start_dirty_degraded=1
+As mentioned above, md will not normally start a RAID4, RAID5, or
+RAID6 that is both dirty and degraded as this situation can imply
+hidden data loss.  This can be awkward if the root filesystem is
+affected.  Using this module parameter allows such arrays to be started
+at boot time.  It should be understood that there is a real (though
+small) risk of data corruption in this situation.
+
+.TP
+.BI md= n , dev , dev ,...
+.TP
+.BI md=d n , dev , dev ,...
+This tells the md driver to assemble
+.B /dev/md n
+from the listed devices.  It is only necessary to start the device
+holding the root filesystem this way.  Other arrays are best started
+once the system is booted.
+
+In 2.6 kernels, the
+.B d
+immediately after the
+.B =
+indicates that a partitionable device (e.g.
+.BR /dev/md/d0 )
+should be created rather than the original non-partitionable device.
+
+.TP
+.BI md= n , l , c , i , dev...
+This tells the md driver to assemble a legacy RAID0 or LINEAR array
+without a superblock.
+.I n
+gives the md device number,
+.I l
+gives the level, 0 for RAID0 or -1 for LINEAR,
+.I c
+gives the chunk size as a base-2 logarithm offset by twelve, so 0
+means 4K, 1 means 8K.
+.I i
+is ignored (legacy support).
+
  .SH FILES
  .TP
  .B /proc/mdstat
  Contains information about the status of currently running array.
  .TP
  .B /proc/sys/dev/raid/speed_limit_min
-A readable and writable file that reflects the current goal rebuild
+A readable and writable file that reflects the current "goal" rebuild
  speed for times when non-rebuild activity is current on an array.
  The speed is in Kibibytes per second, and is a per-device rate, not a
-per-array rate (which means that an array with more disc will shuffle
+per-array rate (which means that an array with more disks will shuffle
  more data for a given speed).   The default is 100.
  
  .TP
  .B /proc/sys/dev/raid/speed_limit_max
-A readable and writable file that reflects the current goal rebuild
+A readable and writable file that reflects the current "goal" rebuild
  speed for times when no non-rebuild activity is current on an array.
  The default is 100,000.