[thirdparty/mdadm.git] / mdmon-design.txt


When managing a RAID1 array which uses metadata other than the
"native" metadata understood by the kernel, mdadm makes use of a
partner program named 'mdmon' to manage some aspects of updating
that metadata and synchronising the metadata with the array state.

This document provides some details on how mdmon works.

Containers
----------

As background: mdadm makes a distinction between an 'array' and a
'container'.  Other sources sometimes use the term 'volume' or
'device' for an 'array', and may use the term 'array' for a
'container'.

For our purposes:
 - a 'container' is a collection of devices which are described by a
   single set of metadata.  The metadata may be stored equally
   on all devices, or different devices may have quite different
   subsets of the total metadata.  But there is conceptually one set
   of metadata that unifies the devices.

 - an 'array' is a set of datablock from various devices which
   together are used to present the abstraction of a single linear
   sequence of block, which may provide data redundancy or enhanced
   performance.

So a container has some metadata and provides a number of arrays which
are described by that metadata.

Sometimes this model doesn't work perfectly.  For example, global
spares may have their own metadata which is quite different from the
metadata from any device that participates in one or more arrays.
Such a global spare might still need to belong to some container so
that it is available to be used should a failure arise.  In that case
we consider the 'metadata' to be the union of the metadata on the
active devices which describes the arrays, and the metadata on the
global spares which only describes the spares.  In this case different
devices in the one container will have quite different metadata.


Purpose
-------

The main purpose of mdmon is to update the metadata in response to
changes to the array which need to be reflected in the metadata before
futures writes to the array can safely be performed.
These include:
 - transitions from 'clean' to 'dirty'.
 - recording the devices have failed.
 - recording the progress of a 'reshape'

This requires mdmon to be running at any time that the array is
writable (a read-only array does not require mdmon to be running).

Because mdmon must be able to process these metadata updates at any
time, it must (when running) have exclusive write access to the
metadata.  Any other changes (e.g. reconfiguration of the array) must
go through mdmon.

A secondary role for mdmon is to activate spares when a device fails.
This role is much less time-critical than the other metadata updates,
so it could be performed by a separate process, possibly
"mdadm --monitor" which has a related role of moving devices between
arrays.  A main reason for including this functionality in mdmon is
that in the native-metadata case this function is handled in the
kernel, and mdmon's reason for existence to provide functionality
which is otherwise handled by the kernel.


Design overview
---------------

mdmon is structured as two threads with a common address space and
common data structures.  These threads are know as the 'monitor' and
the 'manager'.

The 'monitor' has the primary role of monitoring the array for
important state changes and updating the metadata accordingly.  As
writes to the array can be blocked until 'monitor' completes and
acknowledges the update, it much be very careful not to block itself.
In particular it must not block waiting for any write to complete else
it could deadlock.  This means that it must not allocate memory as
doing this can require dirty memory to be written out and if the
system choose to write to the array that mdmon is monitoring, the
memory allocation could deadlock.

So 'monitor' must never allocate memory and must limit the number of
other system call it performs. It may:
 - use select (or poll) to wait for activity on a file descriptor
 - read from a sysfs file descriptor
 - write to a sysfs file descriptor
 - write the metadata out to the block devices using O_DIRECT
 - send a signal (kill) to the manager thread

It must not e.g. open files or do anything similar that might allocate
resources.

The 'manager' thread does everything else that is needed.  If any
files are to be opened (e.g. because a device has been added to the
array), the manager does that.  If any memory needs to be allocated
(e.g. to hold data about a new array as can happen when one set of
metadata describes several arrays), the manager performs that
allocation.

The 'manager' is also responsible for communicating with mdadm and
assigning spares to replace failed devices.


Handling metadata updates
-------------------------

There are a number of cases in which mdadm needs to update the
metdata which mdmon is managing.  These include:
 - creating a new array in an active container
 - adding a device to a container
 - reconfiguring an array
etc.

To complete these updates, mdadm must send a message to mdmon which
will merge the update into the metadata as it is at that moment.

To achieve this, mdmon creates a Unix Domain Socket which the manager
thread listens on.  mdadm sends a message over this socket.  The
manager thread examines the message to see if it will require
allocating any memory and allocates it.  This is done in the
'prepare_update' metadata method.

The update message is then queued for handling by the monitor thread
which it will do when convenient.  The monitor thread calls
->process_update which should atomically make the required changes to
the metadata, making use of the pre-allocate memory as required.  Any
memory the is no-longer needed can be placed back in the request and
the manager thread will free it.

The exact format of a metadata update is up to the implementer of the
metadata handlers.  It will simply describe a change that needs to be
made.  It will sometimes contain fragments of the metadata to be
copied in to place.  However the ->process_update routine must make
sure not to over-write any field that the monitor thread might have
updated, such as a 'device failed' or 'array is dirty' state.

When the monitor thread has completed the update and written it to the
devices, an acknowledgement message is sent back over the socket so
that mdadm knows it is complete.
Commit	Line	Data
	1
	2	When managing a RAID1 array which uses metadata other than the
	3	"native" metadata understood by the kernel, mdadm makes use of a
	4	partner program named 'mdmon' to manage some aspects of updating
	5	that metadata and synchronising the metadata with the array state.
	6
	7	This document provides some details on how mdmon works.
	8
	9	Containers
	10	----------
	11
	12	As background: mdadm makes a distinction between an 'array' and a
	13	'container'. Other sources sometimes use the term 'volume' or
	14	'device' for an 'array', and may use the term 'array' for a
	15	'container'.
	16
	17	For our purposes:
	18	- a 'container' is a collection of devices which are described by a
	19	single set of metadata. The metadata may be stored equally
	20	on all devices, or different devices may have quite different
	21	subsets of the total metadata. But there is conceptually one set
	22	of metadata that unifies the devices.
	23
	24	- an 'array' is a set of datablock from various devices which
	25	together are used to present the abstraction of a single linear
	26	sequence of block, which may provide data redundancy or enhanced
	27	performance.
	28
	29	So a container has some metadata and provides a number of arrays which
	30	are described by that metadata.
	31
	32	Sometimes this model doesn't work perfectly. For example, global
	33	spares may have their own metadata which is quite different from the
	34	metadata from any device that participates in one or more arrays.
	35	Such a global spare might still need to belong to some container so
	36	that it is available to be used should a failure arise. In that case
	37	we consider the 'metadata' to be the union of the metadata on the
	38	active devices which describes the arrays, and the metadata on the
	39	global spares which only describes the spares. In this case different
	40	devices in the one container will have quite different metadata.
	41
	42
	43	Purpose
	44	-------
	45
	46	The main purpose of mdmon is to update the metadata in response to
	47	changes to the array which need to be reflected in the metadata before
	48	futures writes to the array can safely be performed.
	49	These include:
	50	- transitions from 'clean' to 'dirty'.
	51	- recording the devices have failed.
	52	- recording the progress of a 'reshape'
	53
	54	This requires mdmon to be running at any time that the array is
	55	writable (a read-only array does not require mdmon to be running).
	56
	57	Because mdmon must be able to process these metadata updates at any
	58	time, it must (when running) have exclusive write access to the
	59	metadata. Any other changes (e.g. reconfiguration of the array) must
	60	go through mdmon.
	61
	62	A secondary role for mdmon is to activate spares when a device fails.
	63	This role is much less time-critical than the other metadata updates,
	64	so it could be performed by a separate process, possibly
	65	"mdadm --monitor" which has a related role of moving devices between
	66	arrays. A main reason for including this functionality in mdmon is
	67	that in the native-metadata case this function is handled in the
	68	kernel, and mdmon's reason for existence to provide functionality
	69	which is otherwise handled by the kernel.
	70
	71
	72	Design overview
	73	---------------
	74
	75	mdmon is structured as two threads with a common address space and
	76	common data structures. These threads are know as the 'monitor' and
	77	the 'manager'.
	78
	79	The 'monitor' has the primary role of monitoring the array for
	80	important state changes and updating the metadata accordingly. As
	81	writes to the array can be blocked until 'monitor' completes and
	82	acknowledges the update, it much be very careful not to block itself.
	83	In particular it must not block waiting for any write to complete else
	84	it could deadlock. This means that it must not allocate memory as
	85	doing this can require dirty memory to be written out and if the
	86	system choose to write to the array that mdmon is monitoring, the
	87	memory allocation could deadlock.
	88
	89	So 'monitor' must never allocate memory and must limit the number of
	90	other system call it performs. It may:
	91	- use select (or poll) to wait for activity on a file descriptor
	92	- read from a sysfs file descriptor
	93	- write to a sysfs file descriptor
	94	- write the metadata out to the block devices using O_DIRECT
	95	- send a signal (kill) to the manager thread
	96
	97	It must not e.g. open files or do anything similar that might allocate
	98	resources.
	99
	100	The 'manager' thread does everything else that is needed. If any
	101	files are to be opened (e.g. because a device has been added to the
	102	array), the manager does that. If any memory needs to be allocated
	103	(e.g. to hold data about a new array as can happen when one set of
	104	metadata describes several arrays), the manager performs that
	105	allocation.
	106
	107	The 'manager' is also responsible for communicating with mdadm and
	108	assigning spares to replace failed devices.
	109
	110
	111	Handling metadata updates
	112	-------------------------
	113
	114	There are a number of cases in which mdadm needs to update the
	115	metdata which mdmon is managing. These include:
	116	- creating a new array in an active container
	117	- adding a device to a container
	118	- reconfiguring an array
	119	etc.
	120
	121	To complete these updates, mdadm must send a message to mdmon which
	122	will merge the update into the metadata as it is at that moment.
	123
	124	To achieve this, mdmon creates a Unix Domain Socket which the manager
	125	thread listens on. mdadm sends a message over this socket. The
	126	manager thread examines the message to see if it will require
	127	allocating any memory and allocates it. This is done in the
	128	'prepare_update' metadata method.
	129
	130	The update message is then queued for handling by the monitor thread
	131	which it will do when convenient. The monitor thread calls
	132	->process_update which should atomically make the required changes to
	133	the metadata, making use of the pre-allocate memory as required. Any
	134	memory the is no-longer needed can be placed back in the request and
	135	the manager thread will free it.
	136
	137	The exact format of a metadata update is up to the implementer of the
	138	metadata handlers. It will simply describe a change that needs to be
	139	made. It will sometimes contain fragments of the metadata to be
	140	copied in to place. However the ->process_update routine must make
	141	sure not to over-write any field that the monitor thread might have
	142	updated, such as a 'device failed' or 'array is dirty' state.
	143
	144	When the monitor thread has completed the update and written it to the
	145	devices, an acknowledgement message is sent back over the socket so
	146	that mdadm knows it is complete.