]> git.ipfire.org Git - ipfire-2.x.git/blame - src/patches/suse-2.6.27.39/patches.suse/SoN-05-doc.patch
Imported linux-2.6.27.39 suse/xen patches.
[ipfire-2.x.git] / src / patches / suse-2.6.27.39 / patches.suse / SoN-05-doc.patch
CommitLineData
2cb7cef9
BS
1From: Neil Brown <neilb@suse.de>
2Subject: swap over network documentation
3Patch-mainline: No
4References: FATE#303834
5
6Document describing the problem and proposed solution
7
8Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
9Acked-by: Neil Brown <neilb@suse.de>
10Acked-by: Suresh Jayaraman <sjayaraman@suse.de>
11
12---
13 Documentation/network-swap.txt | 270 +++++++++++++++++++++++++++++++++++++++++
14 1 file changed, 270 insertions(+)
15
16Index: linux-2.6.26/Documentation/network-swap.txt
17===================================================================
18--- /dev/null
19+++ linux-2.6.26/Documentation/network-swap.txt
20@@ -0,0 +1,270 @@
21+
22+Problem:
23+ When Linux needs to allocate memory it may find that there is
24+ insufficient free memory so it needs to reclaim space that is in
25+ use but not needed at the moment. There are several options:
26+
27+ 1/ Shrink a kernel cache such as the inode or dentry cache. This
28+ is fairly easy but provides limited returns.
29+ 2/ Discard 'clean' pages from the page cache. This is easy, and
30+ works well as long as there are clean pages in the page cache.
31+ Similarly clean 'anonymous' pages can be discarded - if there
32+ are any.
33+ 3/ Write out some dirty page-cache pages so that they become clean.
34+ The VM limits the number of dirty page-cache pages to e.g. 40%
35+ of available memory so that (among other reasons) a "sync" will
36+ not take excessively long. So there should never be excessive
37+ amounts of dirty pagecache.
38+ Writing out dirty page-cache pages involves work by the
39+ filesystem which may need to allocate memory itself. To avoid
40+ deadlock, filesystems use GFP_NOFS when allocating memory on the
41+ write-out path. When this is used, cleaning dirty page-cache
42+ pages is not an option so if the filesystem finds that memory
43+ is tight, another option must be found.
44+ 4/ Write out dirty anonymous pages to the "Swap" partition/file.
45+ This is the most interesting for a couple of reasons.
46+ a/ Unlike dirty page-cache pages, there is no need to write anon
47+ pages out unless we are actually short of memory. Thus they
48+ tend to be left to last.
49+ b/ Anon pages tend to be updated randomly and unpredictably, and
50+ flushing them out of memory can have a very significant
51+ performance impact on the process using them. This contrasts
52+ with page-cache pages which are often written sequentially
53+ and often treated as "write-once, read-many".
54+ So anon pages tend to be left until last to be cleaned, and may
55+ be the only cleanable pages while there are still some dirty
56+ page-cache pages (which are waiting on a GFP_NOFS allocation).
57+
58+[I don't find the above wholly satisfying. There seems to be too much
59+ hand-waving. If someone can provide better text explaining why
60+ swapout is a special case, that would be great.]
61+
62+So we need to be able to write to the swap file/partition without
63+needing to allocate any memory ... or only a small well controlled
64+amount.
65+
66+The VM reserves a small amount of memory that can only be allocated
67+for use as part of the swap-out procedure. It is only available to
68+processes with the PF_MEMALLOC flag set, which is typically just the
69+memory cleaner.
70+
71+Traditionally swap-out is performed directly to block devices (swap
72+files on block-device filesystems are supported by examining the
73+mapping from file offset to device offset in advance, and then using
74+the device offsets to write directly to the device). Block devices
75+are (required to be) written to pre-allocate any memory that might be
76+needed during write-out, and to block when the pre-allocated memory is
77+exhausted and no other memory is available. They can be sure not to
78+block forever as the pre-allocated memory will be returned as soon as
79+the data it is being used for has been written out. The primary
80+mechanism for pre-allocating memory is called "mempools".
81+
82+This approach does not work for writing anonymous pages
83+(i.e. swapping) over a network, using e.g NFS or NBD or iSCSI.
84+
85+
86+The main reason that it does not work is that when data from an anon
87+page is written to the network, we must wait for a reply to confirm
88+the data is safe. Receiving that reply will consume memory and,
89+significantly, we need to allocate memory to an incoming packet before
90+we can tell if it is the reply we are waiting for or not.
91+
92+The secondary reason is that the network code is not written to use
93+mempools and in most cases does not need to use them. Changing all
94+allocations in the networking layer to use mempools would be quite
95+intrusive, and would waste memory, and probably cause a slow-down in
96+the common case of not swapping over the network.
97+
98+These problems are addressed by enhancing the system of memory
99+reserves used by PF_MEMALLOC and requiring any in-kernel networking
100+client that is used for swap-out to indicate which sockets are used
101+for swapout so they can be handled specially in low memory situations.
102+
103+There are several major parts to this enhancement:
104+
105+1/ page->reserve, GFP_MEMALLOC
106+
107+ To handle low memory conditions we need to know when those
108+ conditions exist. Having a global "low on memory" flag seems easy,
109+ but its implementation is problematic. Instead we make it possible
110+ to tell if a recent memory allocation required use of the emergency
111+ memory pool.
112+ For pages returned by alloc_page, the new page->reserve flag
113+ can be tested. If this is set, then a low memory condition was
114+ current when the page was allocated, so the memory should be used
115+ carefully. (Because low memory conditions are transient, this
116+ state is kept in an overloaded member instead of in page flags, which
117+ would suggest a more permanent state.)
118+
119+ For memory allocated using slab/slub: If a page that is added to a
120+ kmem_cache is found to have page->reserve set, then a s->reserve
121+ flag is set for the whole kmem_cache. Further allocations will only
122+ be returned from that page (or any other page in the cache) if they
123+ are emergency allocation (i.e. PF_MEMALLOC or GFP_MEMALLOC is set).
124+ Non-emergency allocations will block in alloc_page until a
125+ non-reserve page is available. Once a non-reserve page has been
126+ added to the cache, the s->reserve flag on the cache is removed.
127+
128+ Because slab objects have no individual state its hard to pass
129+ reserve state along, the current code relies on a regular alloc
130+ failing. There are various allocation wrappers help here.
131+
132+ This allows us to
133+ a/ request use of the emergency pool when allocating memory
134+ (GFP_MEMALLOC), and
135+ b/ to find out if the emergency pool was used.
136+
137+2/ SK_MEMALLOC, sk_buff->emergency.
138+
139+ When memory from the reserve is used to store incoming network
140+ packets, the memory must be freed (and the packet dropped) as soon
141+ as we find out that the packet is not for a socket that is used for
142+ swap-out.
143+ To achieve this we have an ->emergency flag for skbs, and an
144+ SK_MEMALLOC flag for sockets.
145+ When memory is allocated for an skb, it is allocated with
146+ GFP_MEMALLOC (if we are currently swapping over the network at
147+ all). If a subsequent test shows that the emergency pool was used,
148+ ->emergency is set.
149+ When the skb is finally attached to its destination socket, the
150+ SK_MEMALLOC flag on the socket is tested. If the skb has
151+ ->emergency set, but the socket does not have SK_MEMALLOC set, then
152+ the skb is immediately freed and the packet is dropped.
153+ This ensures that reserve memory is never queued on a socket that is
154+ not used for swapout.
155+
156+ Similarly, if an skb is ever queued for delivery to user-space for
157+ example by netfilter, the ->emergency flag is tested and the skb is
158+ released if ->emergency is set. (so obviously the storage route may
159+ not pass through a userspace helper, otherwise the packets will never
160+ arrive and we'll deadlock)
161+
162+ This ensures that memory from the emergency reserve can be used to
163+ allow swapout to proceed, but will not get caught up in any other
164+ network queue.
165+
166+
167+3/ pages_emergency
168+
169+ The above would be sufficient if the total memory below the lowest
170+ memory watermark (i.e the size of the emergency reserve) were known
171+ to be enough to hold all transient allocations needed for writeout.
172+ I'm a little blurry on how big the current emergency pool is, but it
173+ isn't big and certainly hasn't been sized to allow network traffic
174+ to consume any.
175+
176+ We could simply make the size of the reserve bigger. However in the
177+ common case that we are not swapping over the network, that would be
178+ a waste of memory.
179+
180+ So a new "watermark" is defined: pages_emergency. This is
181+ effectively added to the current low water marks, so that pages from
182+ this emergency pool can only be allocated if one of PF_MEMALLOC or
183+ GFP_MEMALLOC are set.
184+
185+ pages_emergency can be changed dynamically based on need. When
186+ swapout over the network is required, pages_emergency is increased
187+ to cover the maximum expected load. When network swapout is
188+ disabled, pages_emergency is decreased.
189+
190+ To determine how much to increase it by, we introduce reservation
191+ groups....
192+
193+3a/ reservation groups
194+
195+ The memory used transiently for swapout can be in a number of
196+ different places. e.g. the network route cache, the network
197+ fragment cache, in transit between network card and socket, or (in
198+ the case of NFS) in sunrpc data structures awaiting a reply.
199+ We need to ensure each of these is limited in the amount of memory
200+ they use, and that the maximum is included in the reserve.
201+
202+ The memory required by the network layer only needs to be reserved
203+ once, even if there are multiple swapout paths using the network
204+ (e.g. NFS and NDB and iSCSI, though using all three for swapout at
205+ the same time would be unusual).
206+
207+ So we create a tree of reservation groups. The network might
208+ register a collection of reservations, but not mark them as being in
209+ use. NFS and sunrpc might similarly register a collection of
210+ reservations, and attach it to the network reservations as it
211+ depends on them.
212+ When swapout over NFS is requested, the NFS/sunrpc reservations are
213+ activated which implicitly activates the network reservations.
214+
215+ The total new reservation is added to pages_emergency.
216+
217+ Provided each memory usage stays beneath the registered limit (at
218+ least when allocating memory from reserves), the system will never
219+ run out of emergency memory, and swapout will not deadlock.
220+
221+ It is worth noting here that it is not critical that each usage
222+ stays beneath the limit 100% of the time. Occasional excess is
223+ acceptable provided that the memory will be freed again within a
224+ short amount of time that does *not* require waiting for any event
225+ that itself might require memory.
226+ This is because, at all stages of transmit and receive, it is
227+ acceptable to discard all transient memory associated with a
228+ particular writeout and try again later. On transmit, the page can
229+ be re-queued for later transmission. On receive, the packet can be
230+ dropped assuming that the peer will resend after a timeout.
231+
232+ Thus allocations that are truly transient and will be freed without
233+ blocking do not strictly need to be reserved for. Doing so might
234+ still be a good idea to ensure forward progress doesn't take too
235+ long.
236+
237+4/ low-mem accounting
238+
239+ Most places that might hold on to emergency memory (e.g. route
240+ cache, fragment cache etc) already place a limit on the amount of
241+ memory that they can use. This limit can simply be reserved using
242+ the above mechanism and no more needs to be done.
243+
244+ However some memory usage might not be accounted with sufficient
245+ firmness to allow an appropriate emergency reservation. The
246+ in-flight skbs for incoming packets is one such example.
247+
248+ To support this, a low-overhead mechanism for accounting memory
249+ usage against the reserves is provided. This mechanism uses the
250+ same data structure that is used to store the emergency memory
251+ reservations through the addition of a 'usage' field.
252+
253+ Before we attempt allocation from the memory reserves, we much check
254+ if the resulting 'usage' is below the reservation. If so, we increase
255+ the usage and attempt the allocation (which should succeed). If
256+ the projected 'usage' exceeds the reservation we'll either fail the
257+ allocation, or wait for 'usage' to decrease enough so that it would
258+ succeed, depending on __GFP_WAIT.
259+
260+ When memory that was allocated for that purpose is freed, the
261+ 'usage' field is checked again. If it is non-zero, then the size of
262+ the freed memory is subtracted from the usage, making sure the usage
263+ never becomes less than zero.
264+
265+ This provides adequate accounting with minimal overheads when not in
266+ a low memory condition. When a low memory condition is encountered
267+ it does add the cost of a spin lock necessary to serialise updates
268+ to 'usage'.
269+
270+
271+
272+5/ swapon/swapoff/swap_out/swap_in
273+
274+ So that a filesystem (e.g. NFS) can know when to set SK_MEMALLOC on
275+ any network socket that it uses, and can know when to account
276+ reserve memory carefully, new address_space_operations are
277+ available.
278+ "swapon" requests that an address space (i.e a file) be make ready
279+ for swapout. swap_out and swap_in request the actual IO. They
280+ together must ensure that each swap_out request can succeed without
281+ allocating more emergency memory that was reserved by swapon. swapoff
282+ is used to reverse the state changes caused by swapon when we disable
283+ the swap file.
284+
285+
286+Thanks for reading this far. I hope it made sense :-)
287+
288+Neil Brown (with updates from Peter Zijlstra)
289+
290+