]>
Commit | Line | Data |
---|---|---|
1 | ||
2 | When managing a RAID1 array which uses metadata other than the | |
3 | "native" metadata understood by the kernel, mdadm makes use of a | |
4 | partner program named 'mdmon' to manage some aspects of updating | |
5 | that metadata and synchronising the metadata with the array state. | |
6 | ||
7 | This document provides some details on how mdmon works. | |
8 | ||
9 | Containers | |
10 | ---------- | |
11 | ||
12 | As background: mdadm makes a distinction between an 'array' and a | |
13 | 'container'. Other sources sometimes use the term 'volume' or | |
14 | 'device' for an 'array', and may use the term 'array' for a | |
15 | 'container'. | |
16 | ||
17 | For our purposes: | |
18 | - a 'container' is a collection of devices which are described by a | |
19 | single set of metadata. The metadata may be stored equally | |
20 | on all devices, or different devices may have quite different | |
21 | subsets of the total metadata. But there is conceptually one set | |
22 | of metadata that unifies the devices. | |
23 | ||
24 | - an 'array' is a set of datablock from various devices which | |
25 | together are used to present the abstraction of a single linear | |
26 | sequence of block, which may provide data redundancy or enhanced | |
27 | performance. | |
28 | ||
29 | So a container has some metadata and provides a number of arrays which | |
30 | are described by that metadata. | |
31 | ||
32 | Sometimes this model doesn't work perfectly. For example, global | |
33 | spares may have their own metadata which is quite different from the | |
34 | metadata from any device that participates in one or more arrays. | |
35 | Such a global spare might still need to belong to some container so | |
36 | that it is available to be used should a failure arise. In that case | |
37 | we consider the 'metadata' to be the union of the metadata on the | |
38 | active devices which describes the arrays, and the metadata on the | |
39 | global spares which only describes the spares. In this case different | |
40 | devices in the one container will have quite different metadata. | |
41 | ||
42 | ||
43 | Purpose | |
44 | ------- | |
45 | ||
46 | The main purpose of mdmon is to update the metadata in response to | |
47 | changes to the array which need to be reflected in the metadata before | |
48 | futures writes to the array can safely be performed. | |
49 | These include: | |
50 | - transitions from 'clean' to 'dirty'. | |
51 | - recording the devices have failed. | |
52 | - recording the progress of a 'reshape' | |
53 | ||
54 | This requires mdmon to be running at any time that the array is | |
55 | writable (a read-only array does not require mdmon to be running). | |
56 | ||
57 | Because mdmon must be able to process these metadata updates at any | |
58 | time, it must (when running) have exclusive write access to the | |
59 | metadata. Any other changes (e.g. reconfiguration of the array) must | |
60 | go through mdmon. | |
61 | ||
62 | A secondary role for mdmon is to activate spares when a device fails. | |
63 | This role is much less time-critical than the other metadata updates, | |
64 | so it could be performed by a separate process, possibly | |
65 | "mdadm --monitor" which has a related role of moving devices between | |
66 | arrays. A main reason for including this functionality in mdmon is | |
67 | that in the native-metadata case this function is handled in the | |
68 | kernel, and mdmon's reason for existence to provide functionality | |
69 | which is otherwise handled by the kernel. | |
70 | ||
71 | ||
72 | Design overview | |
73 | --------------- | |
74 | ||
75 | mdmon is structured as two threads with a common address space and | |
76 | common data structures. These threads are know as the 'monitor' and | |
77 | the 'manager'. | |
78 | ||
79 | The 'monitor' has the primary role of monitoring the array for | |
80 | important state changes and updating the metadata accordingly. As | |
81 | writes to the array can be blocked until 'monitor' completes and | |
82 | acknowledges the update, it much be very careful not to block itself. | |
83 | In particular it must not block waiting for any write to complete else | |
84 | it could deadlock. This means that it must not allocate memory as | |
85 | doing this can require dirty memory to be written out and if the | |
86 | system choose to write to the array that mdmon is monitoring, the | |
87 | memory allocation could deadlock. | |
88 | ||
89 | So 'monitor' must never allocate memory and must limit the number of | |
90 | other system call it performs. It may: | |
91 | - use select (or poll) to wait for activity on a file descriptor | |
92 | - read from a sysfs file descriptor | |
93 | - write to a sysfs file descriptor | |
94 | - write the metadata out to the block devices using O_DIRECT | |
95 | - send a signal (kill) to the manager thread | |
96 | ||
97 | It must not e.g. open files or do anything similar that might allocate | |
98 | resources. | |
99 | ||
100 | The 'manager' thread does everything else that is needed. If any | |
101 | files are to be opened (e.g. because a device has been added to the | |
102 | array), the manager does that. If any memory needs to be allocated | |
103 | (e.g. to hold data about a new array as can happen when one set of | |
104 | metadata describes several arrays), the manager performs that | |
105 | allocation. | |
106 | ||
107 | The 'manager' is also responsible for communicating with mdadm and | |
108 | assigning spares to replace failed devices. | |
109 | ||
110 | ||
111 | Handling metadata updates | |
112 | ------------------------- | |
113 | ||
114 | There are a number of cases in which mdadm needs to update the | |
115 | metdata which mdmon is managing. These include: | |
116 | - creating a new array in an active container | |
117 | - adding a device to a container | |
118 | - reconfiguring an array | |
119 | etc. | |
120 | ||
121 | To complete these updates, mdadm must send a message to mdmon which | |
122 | will merge the update into the metadata as it is at that moment. | |
123 | ||
124 | To achieve this, mdmon creates a Unix Domain Socket which the manager | |
125 | thread listens on. mdadm sends a message over this socket. The | |
126 | manager thread examines the message to see if it will require | |
127 | allocating any memory and allocates it. This is done in the | |
128 | 'prepare_update' metadata method. | |
129 | ||
130 | The update message is then queued for handling by the monitor thread | |
131 | which it will do when convenient. The monitor thread calls | |
132 | ->process_update which should atomically make the required changes to | |
133 | the metadata, making use of the pre-allocate memory as required. Any | |
134 | memory the is no-longer needed can be placed back in the request and | |
135 | the manager thread will free it. | |
136 | ||
137 | The exact format of a metadata update is up to the implementer of the | |
138 | metadata handlers. It will simply describe a change that needs to be | |
139 | made. It will sometimes contain fragments of the metadata to be | |
140 | copied in to place. However the ->process_update routine must make | |
141 | sure not to over-write any field that the monitor thread might have | |
142 | updated, such as a 'device failed' or 'array is dirty' state. | |
143 | ||
144 | When the monitor thread has completed the update and written it to the | |
145 | devices, an acknowledgement message is sent back over the socket so | |
146 | that mdadm knows it is complete. |