]> git.ipfire.org Git - people/teissler/ipfire-2.x.git/blob - src/patches/suse-2.6.27.25/patches.suse/SoN-05-doc.patch
Updated xen patches taken from suse.
[people/teissler/ipfire-2.x.git] / src / patches / suse-2.6.27.25 / patches.suse / SoN-05-doc.patch
1 From: Neil Brown <neilb@suse.de>
2 Subject: swap over network documentation
3 Patch-mainline: No
4 References: FATE#303834
5
6 Document describing the problem and proposed solution
7
8 Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
9 Acked-by: Neil Brown <neilb@suse.de>
10 Acked-by: Suresh Jayaraman <sjayaraman@suse.de>
11
12 ---
13 Documentation/network-swap.txt | 270 +++++++++++++++++++++++++++++++++++++++++
14 1 file changed, 270 insertions(+)
15
16 Index: linux-2.6.26/Documentation/network-swap.txt
17 ===================================================================
18 --- /dev/null
19 +++ linux-2.6.26/Documentation/network-swap.txt
20 @@ -0,0 +1,270 @@
21 +
22 +Problem:
23 + When Linux needs to allocate memory it may find that there is
24 + insufficient free memory so it needs to reclaim space that is in
25 + use but not needed at the moment. There are several options:
26 +
27 + 1/ Shrink a kernel cache such as the inode or dentry cache. This
28 + is fairly easy but provides limited returns.
29 + 2/ Discard 'clean' pages from the page cache. This is easy, and
30 + works well as long as there are clean pages in the page cache.
31 + Similarly clean 'anonymous' pages can be discarded - if there
32 + are any.
33 + 3/ Write out some dirty page-cache pages so that they become clean.
34 + The VM limits the number of dirty page-cache pages to e.g. 40%
35 + of available memory so that (among other reasons) a "sync" will
36 + not take excessively long. So there should never be excessive
37 + amounts of dirty pagecache.
38 + Writing out dirty page-cache pages involves work by the
39 + filesystem which may need to allocate memory itself. To avoid
40 + deadlock, filesystems use GFP_NOFS when allocating memory on the
41 + write-out path. When this is used, cleaning dirty page-cache
42 + pages is not an option so if the filesystem finds that memory
43 + is tight, another option must be found.
44 + 4/ Write out dirty anonymous pages to the "Swap" partition/file.
45 + This is the most interesting for a couple of reasons.
46 + a/ Unlike dirty page-cache pages, there is no need to write anon
47 + pages out unless we are actually short of memory. Thus they
48 + tend to be left to last.
49 + b/ Anon pages tend to be updated randomly and unpredictably, and
50 + flushing them out of memory can have a very significant
51 + performance impact on the process using them. This contrasts
52 + with page-cache pages which are often written sequentially
53 + and often treated as "write-once, read-many".
54 + So anon pages tend to be left until last to be cleaned, and may
55 + be the only cleanable pages while there are still some dirty
56 + page-cache pages (which are waiting on a GFP_NOFS allocation).
57 +
58 +[I don't find the above wholly satisfying. There seems to be too much
59 + hand-waving. If someone can provide better text explaining why
60 + swapout is a special case, that would be great.]
61 +
62 +So we need to be able to write to the swap file/partition without
63 +needing to allocate any memory ... or only a small well controlled
64 +amount.
65 +
66 +The VM reserves a small amount of memory that can only be allocated
67 +for use as part of the swap-out procedure. It is only available to
68 +processes with the PF_MEMALLOC flag set, which is typically just the
69 +memory cleaner.
70 +
71 +Traditionally swap-out is performed directly to block devices (swap
72 +files on block-device filesystems are supported by examining the
73 +mapping from file offset to device offset in advance, and then using
74 +the device offsets to write directly to the device). Block devices
75 +are (required to be) written to pre-allocate any memory that might be
76 +needed during write-out, and to block when the pre-allocated memory is
77 +exhausted and no other memory is available. They can be sure not to
78 +block forever as the pre-allocated memory will be returned as soon as
79 +the data it is being used for has been written out. The primary
80 +mechanism for pre-allocating memory is called "mempools".
81 +
82 +This approach does not work for writing anonymous pages
83 +(i.e. swapping) over a network, using e.g NFS or NBD or iSCSI.
84 +
85 +
86 +The main reason that it does not work is that when data from an anon
87 +page is written to the network, we must wait for a reply to confirm
88 +the data is safe. Receiving that reply will consume memory and,
89 +significantly, we need to allocate memory to an incoming packet before
90 +we can tell if it is the reply we are waiting for or not.
91 +
92 +The secondary reason is that the network code is not written to use
93 +mempools and in most cases does not need to use them. Changing all
94 +allocations in the networking layer to use mempools would be quite
95 +intrusive, and would waste memory, and probably cause a slow-down in
96 +the common case of not swapping over the network.
97 +
98 +These problems are addressed by enhancing the system of memory
99 +reserves used by PF_MEMALLOC and requiring any in-kernel networking
100 +client that is used for swap-out to indicate which sockets are used
101 +for swapout so they can be handled specially in low memory situations.
102 +
103 +There are several major parts to this enhancement:
104 +
105 +1/ page->reserve, GFP_MEMALLOC
106 +
107 + To handle low memory conditions we need to know when those
108 + conditions exist. Having a global "low on memory" flag seems easy,
109 + but its implementation is problematic. Instead we make it possible
110 + to tell if a recent memory allocation required use of the emergency
111 + memory pool.
112 + For pages returned by alloc_page, the new page->reserve flag
113 + can be tested. If this is set, then a low memory condition was
114 + current when the page was allocated, so the memory should be used
115 + carefully. (Because low memory conditions are transient, this
116 + state is kept in an overloaded member instead of in page flags, which
117 + would suggest a more permanent state.)
118 +
119 + For memory allocated using slab/slub: If a page that is added to a
120 + kmem_cache is found to have page->reserve set, then a s->reserve
121 + flag is set for the whole kmem_cache. Further allocations will only
122 + be returned from that page (or any other page in the cache) if they
123 + are emergency allocation (i.e. PF_MEMALLOC or GFP_MEMALLOC is set).
124 + Non-emergency allocations will block in alloc_page until a
125 + non-reserve page is available. Once a non-reserve page has been
126 + added to the cache, the s->reserve flag on the cache is removed.
127 +
128 + Because slab objects have no individual state its hard to pass
129 + reserve state along, the current code relies on a regular alloc
130 + failing. There are various allocation wrappers help here.
131 +
132 + This allows us to
133 + a/ request use of the emergency pool when allocating memory
134 + (GFP_MEMALLOC), and
135 + b/ to find out if the emergency pool was used.
136 +
137 +2/ SK_MEMALLOC, sk_buff->emergency.
138 +
139 + When memory from the reserve is used to store incoming network
140 + packets, the memory must be freed (and the packet dropped) as soon
141 + as we find out that the packet is not for a socket that is used for
142 + swap-out.
143 + To achieve this we have an ->emergency flag for skbs, and an
144 + SK_MEMALLOC flag for sockets.
145 + When memory is allocated for an skb, it is allocated with
146 + GFP_MEMALLOC (if we are currently swapping over the network at
147 + all). If a subsequent test shows that the emergency pool was used,
148 + ->emergency is set.
149 + When the skb is finally attached to its destination socket, the
150 + SK_MEMALLOC flag on the socket is tested. If the skb has
151 + ->emergency set, but the socket does not have SK_MEMALLOC set, then
152 + the skb is immediately freed and the packet is dropped.
153 + This ensures that reserve memory is never queued on a socket that is
154 + not used for swapout.
155 +
156 + Similarly, if an skb is ever queued for delivery to user-space for
157 + example by netfilter, the ->emergency flag is tested and the skb is
158 + released if ->emergency is set. (so obviously the storage route may
159 + not pass through a userspace helper, otherwise the packets will never
160 + arrive and we'll deadlock)
161 +
162 + This ensures that memory from the emergency reserve can be used to
163 + allow swapout to proceed, but will not get caught up in any other
164 + network queue.
165 +
166 +
167 +3/ pages_emergency
168 +
169 + The above would be sufficient if the total memory below the lowest
170 + memory watermark (i.e the size of the emergency reserve) were known
171 + to be enough to hold all transient allocations needed for writeout.
172 + I'm a little blurry on how big the current emergency pool is, but it
173 + isn't big and certainly hasn't been sized to allow network traffic
174 + to consume any.
175 +
176 + We could simply make the size of the reserve bigger. However in the
177 + common case that we are not swapping over the network, that would be
178 + a waste of memory.
179 +
180 + So a new "watermark" is defined: pages_emergency. This is
181 + effectively added to the current low water marks, so that pages from
182 + this emergency pool can only be allocated if one of PF_MEMALLOC or
183 + GFP_MEMALLOC are set.
184 +
185 + pages_emergency can be changed dynamically based on need. When
186 + swapout over the network is required, pages_emergency is increased
187 + to cover the maximum expected load. When network swapout is
188 + disabled, pages_emergency is decreased.
189 +
190 + To determine how much to increase it by, we introduce reservation
191 + groups....
192 +
193 +3a/ reservation groups
194 +
195 + The memory used transiently for swapout can be in a number of
196 + different places. e.g. the network route cache, the network
197 + fragment cache, in transit between network card and socket, or (in
198 + the case of NFS) in sunrpc data structures awaiting a reply.
199 + We need to ensure each of these is limited in the amount of memory
200 + they use, and that the maximum is included in the reserve.
201 +
202 + The memory required by the network layer only needs to be reserved
203 + once, even if there are multiple swapout paths using the network
204 + (e.g. NFS and NDB and iSCSI, though using all three for swapout at
205 + the same time would be unusual).
206 +
207 + So we create a tree of reservation groups. The network might
208 + register a collection of reservations, but not mark them as being in
209 + use. NFS and sunrpc might similarly register a collection of
210 + reservations, and attach it to the network reservations as it
211 + depends on them.
212 + When swapout over NFS is requested, the NFS/sunrpc reservations are
213 + activated which implicitly activates the network reservations.
214 +
215 + The total new reservation is added to pages_emergency.
216 +
217 + Provided each memory usage stays beneath the registered limit (at
218 + least when allocating memory from reserves), the system will never
219 + run out of emergency memory, and swapout will not deadlock.
220 +
221 + It is worth noting here that it is not critical that each usage
222 + stays beneath the limit 100% of the time. Occasional excess is
223 + acceptable provided that the memory will be freed again within a
224 + short amount of time that does *not* require waiting for any event
225 + that itself might require memory.
226 + This is because, at all stages of transmit and receive, it is
227 + acceptable to discard all transient memory associated with a
228 + particular writeout and try again later. On transmit, the page can
229 + be re-queued for later transmission. On receive, the packet can be
230 + dropped assuming that the peer will resend after a timeout.
231 +
232 + Thus allocations that are truly transient and will be freed without
233 + blocking do not strictly need to be reserved for. Doing so might
234 + still be a good idea to ensure forward progress doesn't take too
235 + long.
236 +
237 +4/ low-mem accounting
238 +
239 + Most places that might hold on to emergency memory (e.g. route
240 + cache, fragment cache etc) already place a limit on the amount of
241 + memory that they can use. This limit can simply be reserved using
242 + the above mechanism and no more needs to be done.
243 +
244 + However some memory usage might not be accounted with sufficient
245 + firmness to allow an appropriate emergency reservation. The
246 + in-flight skbs for incoming packets is one such example.
247 +
248 + To support this, a low-overhead mechanism for accounting memory
249 + usage against the reserves is provided. This mechanism uses the
250 + same data structure that is used to store the emergency memory
251 + reservations through the addition of a 'usage' field.
252 +
253 + Before we attempt allocation from the memory reserves, we much check
254 + if the resulting 'usage' is below the reservation. If so, we increase
255 + the usage and attempt the allocation (which should succeed). If
256 + the projected 'usage' exceeds the reservation we'll either fail the
257 + allocation, or wait for 'usage' to decrease enough so that it would
258 + succeed, depending on __GFP_WAIT.
259 +
260 + When memory that was allocated for that purpose is freed, the
261 + 'usage' field is checked again. If it is non-zero, then the size of
262 + the freed memory is subtracted from the usage, making sure the usage
263 + never becomes less than zero.
264 +
265 + This provides adequate accounting with minimal overheads when not in
266 + a low memory condition. When a low memory condition is encountered
267 + it does add the cost of a spin lock necessary to serialise updates
268 + to 'usage'.
269 +
270 +
271 +
272 +5/ swapon/swapoff/swap_out/swap_in
273 +
274 + So that a filesystem (e.g. NFS) can know when to set SK_MEMALLOC on
275 + any network socket that it uses, and can know when to account
276 + reserve memory carefully, new address_space_operations are
277 + available.
278 + "swapon" requests that an address space (i.e a file) be make ready
279 + for swapout. swap_out and swap_in request the actual IO. They
280 + together must ensure that each swap_out request can succeed without
281 + allocating more emergency memory that was reserved by swapon. swapoff
282 + is used to reverse the state changes caused by swapon when we disable
283 + the swap file.
284 +
285 +
286 +Thanks for reading this far. I hope it made sense :-)
287 +
288 +Neil Brown (with updates from Peter Zijlstra)
289 +
290 +