]>
Commit | Line | Data |
---|---|---|
00e5a55c BS |
1 | From: Neil Brown <neilb@suse.de> |
2 | Subject: swap over network documentation | |
3 | Patch-mainline: No | |
4 | References: FATE#303834 | |
5 | ||
6 | Document describing the problem and proposed solution | |
7 | ||
8 | Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> | |
9 | Acked-by: Neil Brown <neilb@suse.de> | |
10 | Acked-by: Suresh Jayaraman <sjayaraman@suse.de> | |
11 | ||
12 | --- | |
13 | Documentation/network-swap.txt | 270 +++++++++++++++++++++++++++++++++++++++++ | |
14 | 1 file changed, 270 insertions(+) | |
15 | ||
16 | Index: linux-2.6.26/Documentation/network-swap.txt | |
17 | =================================================================== | |
18 | --- /dev/null | |
19 | +++ linux-2.6.26/Documentation/network-swap.txt | |
20 | @@ -0,0 +1,270 @@ | |
21 | + | |
22 | +Problem: | |
23 | + When Linux needs to allocate memory it may find that there is | |
24 | + insufficient free memory so it needs to reclaim space that is in | |
25 | + use but not needed at the moment. There are several options: | |
26 | + | |
27 | + 1/ Shrink a kernel cache such as the inode or dentry cache. This | |
28 | + is fairly easy but provides limited returns. | |
29 | + 2/ Discard 'clean' pages from the page cache. This is easy, and | |
30 | + works well as long as there are clean pages in the page cache. | |
31 | + Similarly clean 'anonymous' pages can be discarded - if there | |
32 | + are any. | |
33 | + 3/ Write out some dirty page-cache pages so that they become clean. | |
34 | + The VM limits the number of dirty page-cache pages to e.g. 40% | |
35 | + of available memory so that (among other reasons) a "sync" will | |
36 | + not take excessively long. So there should never be excessive | |
37 | + amounts of dirty pagecache. | |
38 | + Writing out dirty page-cache pages involves work by the | |
39 | + filesystem which may need to allocate memory itself. To avoid | |
40 | + deadlock, filesystems use GFP_NOFS when allocating memory on the | |
41 | + write-out path. When this is used, cleaning dirty page-cache | |
42 | + pages is not an option so if the filesystem finds that memory | |
43 | + is tight, another option must be found. | |
44 | + 4/ Write out dirty anonymous pages to the "Swap" partition/file. | |
45 | + This is the most interesting for a couple of reasons. | |
46 | + a/ Unlike dirty page-cache pages, there is no need to write anon | |
47 | + pages out unless we are actually short of memory. Thus they | |
48 | + tend to be left to last. | |
49 | + b/ Anon pages tend to be updated randomly and unpredictably, and | |
50 | + flushing them out of memory can have a very significant | |
51 | + performance impact on the process using them. This contrasts | |
52 | + with page-cache pages which are often written sequentially | |
53 | + and often treated as "write-once, read-many". | |
54 | + So anon pages tend to be left until last to be cleaned, and may | |
55 | + be the only cleanable pages while there are still some dirty | |
56 | + page-cache pages (which are waiting on a GFP_NOFS allocation). | |
57 | + | |
58 | +[I don't find the above wholly satisfying. There seems to be too much | |
59 | + hand-waving. If someone can provide better text explaining why | |
60 | + swapout is a special case, that would be great.] | |
61 | + | |
62 | +So we need to be able to write to the swap file/partition without | |
63 | +needing to allocate any memory ... or only a small well controlled | |
64 | +amount. | |
65 | + | |
66 | +The VM reserves a small amount of memory that can only be allocated | |
67 | +for use as part of the swap-out procedure. It is only available to | |
68 | +processes with the PF_MEMALLOC flag set, which is typically just the | |
69 | +memory cleaner. | |
70 | + | |
71 | +Traditionally swap-out is performed directly to block devices (swap | |
72 | +files on block-device filesystems are supported by examining the | |
73 | +mapping from file offset to device offset in advance, and then using | |
74 | +the device offsets to write directly to the device). Block devices | |
75 | +are (required to be) written to pre-allocate any memory that might be | |
76 | +needed during write-out, and to block when the pre-allocated memory is | |
77 | +exhausted and no other memory is available. They can be sure not to | |
78 | +block forever as the pre-allocated memory will be returned as soon as | |
79 | +the data it is being used for has been written out. The primary | |
80 | +mechanism for pre-allocating memory is called "mempools". | |
81 | + | |
82 | +This approach does not work for writing anonymous pages | |
83 | +(i.e. swapping) over a network, using e.g NFS or NBD or iSCSI. | |
84 | + | |
85 | + | |
86 | +The main reason that it does not work is that when data from an anon | |
87 | +page is written to the network, we must wait for a reply to confirm | |
88 | +the data is safe. Receiving that reply will consume memory and, | |
89 | +significantly, we need to allocate memory to an incoming packet before | |
90 | +we can tell if it is the reply we are waiting for or not. | |
91 | + | |
92 | +The secondary reason is that the network code is not written to use | |
93 | +mempools and in most cases does not need to use them. Changing all | |
94 | +allocations in the networking layer to use mempools would be quite | |
95 | +intrusive, and would waste memory, and probably cause a slow-down in | |
96 | +the common case of not swapping over the network. | |
97 | + | |
98 | +These problems are addressed by enhancing the system of memory | |
99 | +reserves used by PF_MEMALLOC and requiring any in-kernel networking | |
100 | +client that is used for swap-out to indicate which sockets are used | |
101 | +for swapout so they can be handled specially in low memory situations. | |
102 | + | |
103 | +There are several major parts to this enhancement: | |
104 | + | |
105 | +1/ page->reserve, GFP_MEMALLOC | |
106 | + | |
107 | + To handle low memory conditions we need to know when those | |
108 | + conditions exist. Having a global "low on memory" flag seems easy, | |
109 | + but its implementation is problematic. Instead we make it possible | |
110 | + to tell if a recent memory allocation required use of the emergency | |
111 | + memory pool. | |
112 | + For pages returned by alloc_page, the new page->reserve flag | |
113 | + can be tested. If this is set, then a low memory condition was | |
114 | + current when the page was allocated, so the memory should be used | |
115 | + carefully. (Because low memory conditions are transient, this | |
116 | + state is kept in an overloaded member instead of in page flags, which | |
117 | + would suggest a more permanent state.) | |
118 | + | |
119 | + For memory allocated using slab/slub: If a page that is added to a | |
120 | + kmem_cache is found to have page->reserve set, then a s->reserve | |
121 | + flag is set for the whole kmem_cache. Further allocations will only | |
122 | + be returned from that page (or any other page in the cache) if they | |
123 | + are emergency allocation (i.e. PF_MEMALLOC or GFP_MEMALLOC is set). | |
124 | + Non-emergency allocations will block in alloc_page until a | |
125 | + non-reserve page is available. Once a non-reserve page has been | |
126 | + added to the cache, the s->reserve flag on the cache is removed. | |
127 | + | |
128 | + Because slab objects have no individual state its hard to pass | |
129 | + reserve state along, the current code relies on a regular alloc | |
130 | + failing. There are various allocation wrappers help here. | |
131 | + | |
132 | + This allows us to | |
133 | + a/ request use of the emergency pool when allocating memory | |
134 | + (GFP_MEMALLOC), and | |
135 | + b/ to find out if the emergency pool was used. | |
136 | + | |
137 | +2/ SK_MEMALLOC, sk_buff->emergency. | |
138 | + | |
139 | + When memory from the reserve is used to store incoming network | |
140 | + packets, the memory must be freed (and the packet dropped) as soon | |
141 | + as we find out that the packet is not for a socket that is used for | |
142 | + swap-out. | |
143 | + To achieve this we have an ->emergency flag for skbs, and an | |
144 | + SK_MEMALLOC flag for sockets. | |
145 | + When memory is allocated for an skb, it is allocated with | |
146 | + GFP_MEMALLOC (if we are currently swapping over the network at | |
147 | + all). If a subsequent test shows that the emergency pool was used, | |
148 | + ->emergency is set. | |
149 | + When the skb is finally attached to its destination socket, the | |
150 | + SK_MEMALLOC flag on the socket is tested. If the skb has | |
151 | + ->emergency set, but the socket does not have SK_MEMALLOC set, then | |
152 | + the skb is immediately freed and the packet is dropped. | |
153 | + This ensures that reserve memory is never queued on a socket that is | |
154 | + not used for swapout. | |
155 | + | |
156 | + Similarly, if an skb is ever queued for delivery to user-space for | |
157 | + example by netfilter, the ->emergency flag is tested and the skb is | |
158 | + released if ->emergency is set. (so obviously the storage route may | |
159 | + not pass through a userspace helper, otherwise the packets will never | |
160 | + arrive and we'll deadlock) | |
161 | + | |
162 | + This ensures that memory from the emergency reserve can be used to | |
163 | + allow swapout to proceed, but will not get caught up in any other | |
164 | + network queue. | |
165 | + | |
166 | + | |
167 | +3/ pages_emergency | |
168 | + | |
169 | + The above would be sufficient if the total memory below the lowest | |
170 | + memory watermark (i.e the size of the emergency reserve) were known | |
171 | + to be enough to hold all transient allocations needed for writeout. | |
172 | + I'm a little blurry on how big the current emergency pool is, but it | |
173 | + isn't big and certainly hasn't been sized to allow network traffic | |
174 | + to consume any. | |
175 | + | |
176 | + We could simply make the size of the reserve bigger. However in the | |
177 | + common case that we are not swapping over the network, that would be | |
178 | + a waste of memory. | |
179 | + | |
180 | + So a new "watermark" is defined: pages_emergency. This is | |
181 | + effectively added to the current low water marks, so that pages from | |
182 | + this emergency pool can only be allocated if one of PF_MEMALLOC or | |
183 | + GFP_MEMALLOC are set. | |
184 | + | |
185 | + pages_emergency can be changed dynamically based on need. When | |
186 | + swapout over the network is required, pages_emergency is increased | |
187 | + to cover the maximum expected load. When network swapout is | |
188 | + disabled, pages_emergency is decreased. | |
189 | + | |
190 | + To determine how much to increase it by, we introduce reservation | |
191 | + groups.... | |
192 | + | |
193 | +3a/ reservation groups | |
194 | + | |
195 | + The memory used transiently for swapout can be in a number of | |
196 | + different places. e.g. the network route cache, the network | |
197 | + fragment cache, in transit between network card and socket, or (in | |
198 | + the case of NFS) in sunrpc data structures awaiting a reply. | |
199 | + We need to ensure each of these is limited in the amount of memory | |
200 | + they use, and that the maximum is included in the reserve. | |
201 | + | |
202 | + The memory required by the network layer only needs to be reserved | |
203 | + once, even if there are multiple swapout paths using the network | |
204 | + (e.g. NFS and NDB and iSCSI, though using all three for swapout at | |
205 | + the same time would be unusual). | |
206 | + | |
207 | + So we create a tree of reservation groups. The network might | |
208 | + register a collection of reservations, but not mark them as being in | |
209 | + use. NFS and sunrpc might similarly register a collection of | |
210 | + reservations, and attach it to the network reservations as it | |
211 | + depends on them. | |
212 | + When swapout over NFS is requested, the NFS/sunrpc reservations are | |
213 | + activated which implicitly activates the network reservations. | |
214 | + | |
215 | + The total new reservation is added to pages_emergency. | |
216 | + | |
217 | + Provided each memory usage stays beneath the registered limit (at | |
218 | + least when allocating memory from reserves), the system will never | |
219 | + run out of emergency memory, and swapout will not deadlock. | |
220 | + | |
221 | + It is worth noting here that it is not critical that each usage | |
222 | + stays beneath the limit 100% of the time. Occasional excess is | |
223 | + acceptable provided that the memory will be freed again within a | |
224 | + short amount of time that does *not* require waiting for any event | |
225 | + that itself might require memory. | |
226 | + This is because, at all stages of transmit and receive, it is | |
227 | + acceptable to discard all transient memory associated with a | |
228 | + particular writeout and try again later. On transmit, the page can | |
229 | + be re-queued for later transmission. On receive, the packet can be | |
230 | + dropped assuming that the peer will resend after a timeout. | |
231 | + | |
232 | + Thus allocations that are truly transient and will be freed without | |
233 | + blocking do not strictly need to be reserved for. Doing so might | |
234 | + still be a good idea to ensure forward progress doesn't take too | |
235 | + long. | |
236 | + | |
237 | +4/ low-mem accounting | |
238 | + | |
239 | + Most places that might hold on to emergency memory (e.g. route | |
240 | + cache, fragment cache etc) already place a limit on the amount of | |
241 | + memory that they can use. This limit can simply be reserved using | |
242 | + the above mechanism and no more needs to be done. | |
243 | + | |
244 | + However some memory usage might not be accounted with sufficient | |
245 | + firmness to allow an appropriate emergency reservation. The | |
246 | + in-flight skbs for incoming packets is one such example. | |
247 | + | |
248 | + To support this, a low-overhead mechanism for accounting memory | |
249 | + usage against the reserves is provided. This mechanism uses the | |
250 | + same data structure that is used to store the emergency memory | |
251 | + reservations through the addition of a 'usage' field. | |
252 | + | |
253 | + Before we attempt allocation from the memory reserves, we much check | |
254 | + if the resulting 'usage' is below the reservation. If so, we increase | |
255 | + the usage and attempt the allocation (which should succeed). If | |
256 | + the projected 'usage' exceeds the reservation we'll either fail the | |
257 | + allocation, or wait for 'usage' to decrease enough so that it would | |
258 | + succeed, depending on __GFP_WAIT. | |
259 | + | |
260 | + When memory that was allocated for that purpose is freed, the | |
261 | + 'usage' field is checked again. If it is non-zero, then the size of | |
262 | + the freed memory is subtracted from the usage, making sure the usage | |
263 | + never becomes less than zero. | |
264 | + | |
265 | + This provides adequate accounting with minimal overheads when not in | |
266 | + a low memory condition. When a low memory condition is encountered | |
267 | + it does add the cost of a spin lock necessary to serialise updates | |
268 | + to 'usage'. | |
269 | + | |
270 | + | |
271 | + | |
272 | +5/ swapon/swapoff/swap_out/swap_in | |
273 | + | |
274 | + So that a filesystem (e.g. NFS) can know when to set SK_MEMALLOC on | |
275 | + any network socket that it uses, and can know when to account | |
276 | + reserve memory carefully, new address_space_operations are | |
277 | + available. | |
278 | + "swapon" requests that an address space (i.e a file) be make ready | |
279 | + for swapout. swap_out and swap_in request the actual IO. They | |
280 | + together must ensure that each swap_out request can succeed without | |
281 | + allocating more emergency memory that was reserved by swapon. swapoff | |
282 | + is used to reverse the state changes caused by swapon when we disable | |
283 | + the swap file. | |
284 | + | |
285 | + | |
286 | +Thanks for reading this far. I hope it made sense :-) | |
287 | + | |
288 | +Neil Brown (with updates from Peter Zijlstra) | |
289 | + | |
290 | + |