]> git.ipfire.org Git - thirdparty/mdadm.git/blame - mdmon-design.txt
mdadm: load default sysfs attributes after assemblation
[thirdparty/mdadm.git] / mdmon-design.txt
CommitLineData
0d30ccec
N
1
2When managing a RAID1 array which uses metadata other than the
3"native" metadata understood by the kernel, mdadm makes use of a
4partner program named 'mdmon' to manage some aspects of updating
5that metadata and synchronising the metadata with the array state.
6
7This document provides some details on how mdmon works.
8
9Containers
10----------
11
12As background: mdadm makes a distinction between an 'array' and a
13'container'. Other sources sometimes use the term 'volume' or
14'device' for an 'array', and may use the term 'array' for a
15'container'.
16
17For our purposes:
18 - a 'container' is a collection of devices which are described by a
19 single set of metadata. The metadata may be stored equally
20 on all devices, or different devices may have quite different
21 subsets of the total metadata. But there is conceptually one set
22 of metadata that unifies the devices.
23
24 - an 'array' is a set of datablock from various devices which
25 together are used to present the abstraction of a single linear
26 sequence of block, which may provide data redundancy or enhanced
27 performance.
28
29So a container has some metadata and provides a number of arrays which
30are described by that metadata.
31
32Sometimes this model doesn't work perfectly. For example, global
33spares may have their own metadata which is quite different from the
34metadata from any device that participates in one or more arrays.
35Such a global spare might still need to belong to some container so
36that it is available to be used should a failure arise. In that case
37we consider the 'metadata' to be the union of the metadata on the
38active devices which describes the arrays, and the metadata on the
39global spares which only describes the spares. In this case different
40devices in the one container will have quite different metadata.
41
42
43Purpose
44-------
45
46The main purpose of mdmon is to update the metadata in response to
47changes to the array which need to be reflected in the metadata before
48futures writes to the array can safely be performed.
49These include:
50 - transitions from 'clean' to 'dirty'.
51 - recording the devices have failed.
52 - recording the progress of a 'reshape'
53
54This requires mdmon to be running at any time that the array is
55writable (a read-only array does not require mdmon to be running).
56
57Because mdmon must be able to process these metadata updates at any
58time, it must (when running) have exclusive write access to the
59metadata. Any other changes (e.g. reconfiguration of the array) must
60go through mdmon.
61
62A secondary role for mdmon is to activate spares when a device fails.
63This role is much less time-critical than the other metadata updates,
64so it could be performed by a separate process, possibly
65"mdadm --monitor" which has a related role of moving devices between
66arrays. A main reason for including this functionality in mdmon is
67that in the native-metadata case this function is handled in the
68kernel, and mdmon's reason for existence to provide functionality
69which is otherwise handled by the kernel.
70
71
72Design overview
73---------------
74
75mdmon is structured as two threads with a common address space and
76common data structures. These threads are know as the 'monitor' and
77the 'manager'.
78
79The 'monitor' has the primary role of monitoring the array for
80important state changes and updating the metadata accordingly. As
81writes to the array can be blocked until 'monitor' completes and
82acknowledges the update, it much be very careful not to block itself.
83In particular it must not block waiting for any write to complete else
84it could deadlock. This means that it must not allocate memory as
85doing this can require dirty memory to be written out and if the
86system choose to write to the array that mdmon is monitoring, the
87memory allocation could deadlock.
88
89So 'monitor' must never allocate memory and must limit the number of
90other system call it performs. It may:
91 - use select (or poll) to wait for activity on a file descriptor
92 - read from a sysfs file descriptor
93 - write to a sysfs file descriptor
94 - write the metadata out to the block devices using O_DIRECT
95 - send a signal (kill) to the manager thread
96
97It must not e.g. open files or do anything similar that might allocate
98resources.
99
100The 'manager' thread does everything else that is needed. If any
101files are to be opened (e.g. because a device has been added to the
102array), the manager does that. If any memory needs to be allocated
103(e.g. to hold data about a new array as can happen when one set of
104metadata describes several arrays), the manager performs that
105allocation.
106
107The 'manager' is also responsible for communicating with mdadm and
108assigning spares to replace failed devices.
109
110
111Handling metadata updates
112-------------------------
113
114There are a number of cases in which mdadm needs to update the
115metdata which mdmon is managing. These include:
116 - creating a new array in an active container
117 - adding a device to a container
118 - reconfiguring an array
119etc.
120
121To complete these updates, mdadm must send a message to mdmon which
122will merge the update into the metadata as it is at that moment.
123
124To achieve this, mdmon creates a Unix Domain Socket which the manager
125thread listens on. mdadm sends a message over this socket. The
126manager thread examines the message to see if it will require
127allocating any memory and allocates it. This is done in the
128'prepare_update' metadata method.
129
130The update message is then queued for handling by the monitor thread
131which it will do when convenient. The monitor thread calls
132->process_update which should atomically make the required changes to
133the metadata, making use of the pre-allocate memory as required. Any
134memory the is no-longer needed can be placed back in the request and
135the manager thread will free it.
136
137The exact format of a metadata update is up to the implementer of the
138metadata handlers. It will simply describe a change that needs to be
139made. It will sometimes contain fragments of the metadata to be
140copied in to place. However the ->process_update routine must make
141sure not to over-write any field that the monitor thread might have
142updated, such as a 'device failed' or 'array is dirty' state.
143
144When the monitor thread has completed the update and written it to the
145devices, an acknowledgement message is sent back over the socket so
146that mdadm knows it is complete.