mdmon-design.txt

author NeilBrown <neilb@suse.de>

Thu, 16 Dec 2010 11:12:26 +0000 (22:12 +1100)

committer NeilBrown <neilb@suse.de>

Thu, 16 Dec 2010 11:12:26 +0000 (22:12 +1100)
author NeilBrown <neilb@suse.de>
Thu, 16 Dec 2010 11:12:26 +0000 (22:12 +1100)
committer NeilBrown <neilb@suse.de>
Thu, 16 Dec 2010 11:12:26 +0000 (22:12 +1100)
diff --git a/mdmon-design.txt b/mdmon-design.txt

new file mode 100644 (file)

index 0000000..f09184a
--- /dev/null
+++ b/mdmon-design.txt
@@ -0,0 +1,146 @@
+
+When managing a RAID1 array which uses metadata other than the
+"native" metadata understood by the kernel, mdadm makes use of a
+partner program named 'mdmon' to manage some aspects of updating
+that metadata and synchronising the metadata with the array state.
+
+This document provides some details on how mdmon works.
+
+Containers
+----------
+
+As background: mdadm makes a distinction between an 'array' and a
+'container'.  Other sources sometimes use the term 'volume' or
+'device' for an 'array', and may use the term 'array' for a
+'container'.
+
+For our purposes:
+ - a 'container' is a collection of devices which are described by a
+   single set of metadata.  The metadata may be stored equally
+   on all devices, or different devices may have quite different
+   subsets of the total metadata.  But there is conceptually one set
+   of metadata that unifies the devices.
+
+ - an 'array' is a set of datablock from various devices which
+   together are used to present the abstraction of a single linear
+   sequence of block, which may provide data redundancy or enhanced
+   performance.
+
+So a container has some metadata and provides a number of arrays which
+are described by that metadata.
+
+Sometimes this model doesn't work perfectly.  For example, global
+spares may have their own metadata which is quite different from the
+metadata from any device that participates in one or more arrays.
+Such a global spare might still need to belong to some container so
+that it is available to be used should a failure arise.  In that case
+we consider the 'metadata' to be the union of the metadata on the
+active devices which describes the arrays, and the metadata on the
+global spares which only describes the spares.  In this case different
+devices in the one container will have quite different metadata.
+
+
+Purpose
+-------
+
+The main purpose of mdmon is to update the metadata in response to
+changes to the array which need to be reflected in the metadata before
+futures writes to the array can safely be performed.
+These include:
+ - transitions from 'clean' to 'dirty'.
+ - recording the devices have failed.
+ - recording the progress of a 'reshape'
+
+This requires mdmon to be running at any time that the array is
+writable (a read-only array does not require mdmon to be running).
+
+Because mdmon must be able to process these metadata updates at any
+time, it must (when running) have exclusive write access to the
+metadata.  Any other changes (e.g. reconfiguration of the array) must
+go through mdmon.
+
+A secondary role for mdmon is to activate spares when a device fails.
+This role is much less time-critical than the other metadata updates,
+so it could be performed by a separate process, possibly
+"mdadm --monitor" which has a related role of moving devices between
+arrays.  A main reason for including this functionality in mdmon is
+that in the native-metadata case this function is handled in the
+kernel, and mdmon's reason for existence to provide functionality
+which is otherwise handled by the kernel.
+
+
+Design overview
+---------------
+
+mdmon is structured as two threads with a common address space and
+common data structures.  These threads are know as the 'monitor' and
+the 'manager'.
+
+The 'monitor' has the primary role of monitoring the array for
+important state changes and updating the metadata accordingly.  As
+writes to the array can be blocked until 'monitor' completes and
+acknowledges the update, it much be very careful not to block itself.
+In particular it must not block waiting for any write to complete else
+it could deadlock.  This means that it must not allocate memory as
+doing this can require dirty memory to be written out and if the
+system choose to write to the array that mdmon is monitoring, the
+memory allocation could deadlock.
+
+So 'monitor' must never allocate memory and must limit the number of
+other system call it performs. It may:
+ - use select (or poll) to wait for activity on a file descriptor
+ - read from a sysfs file descriptor
+ - write to a sysfs file descriptor
+ - write the metadata out to the block devices using O_DIRECT
+ - send a signal (kill) to the manager thread
+
+It must not e.g. open files or do anything similar that might allocate
+resources.
+
+The 'manager' thread does everything else that is needed.  If any
+files are to be opened (e.g. because a device has been added to the
+array), the manager does that.  If any memory needs to be allocated
+(e.g. to hold data about a new array as can happen when one set of
+metadata describes several arrays), the manager performs that
+allocation.
+
+The 'manager' is also responsible for communicating with mdadm and
+assigning spares to replace failed devices.
+
+
+Handling metadata updates
+-------------------------
+
+There are a number of cases in which mdadm needs to update the
+metdata which mdmon is managing.  These include:
+ - creating a new array in an active container
+ - adding a device to a container
+ - reconfiguring an array
+etc.
+
+To complete these updates, mdadm must send a message to mdmon which
+will merge the update into the metadata as it is at that moment.
+
+To achieve this, mdmon creates a Unix Domain Socket which the manager
+thread listens on.  mdadm sends a message over this socket.  The
+manager thread examines the message to see if it will require
+allocating any memory and allocates it.  This is done in the
+'prepare_update' metadata method.
+
+The update message is then queued for handling by the monitor thread
+which it will do when convenient.  The monitor thread calls
+->process_update which should atomically make the required changes to
+the metadata, making use of the pre-allocate memory as required.  Any
+memory the is no-longer needed can be placed back in the request and
+the manager thread will free it.
+
+The exact format of a metadata update is up to the implementer of the
+metadata handlers.  It will simply describe a change that needs to be
+made.  It will sometimes contain fragments of the metadata to be
+copied in to place.  However the ->process_update routine must make
+sure not to over-write any field that the monitor thread might have
+updated, such as a 'device failed' or 'array is dirty' state.
+
+When the monitor thread has completed the update and written it to the
+devices, an acknowledgement message is sent back over the socket so
+that mdadm knows it is complete.
author	NeilBrown <neilb@suse.de>
	Thu, 16 Dec 2010 11:12:26 +0000 (22:12 +1100)
committer	NeilBrown <neilb@suse.de>
	Thu, 16 Dec 2010 11:12:26 +0000 (22:12 +1100)