[ipfire-2.x.git] / src / patches / suse-2.6.27.31 / patches.suse / SoN-05-doc.patch

From: Neil Brown <neilb@suse.de>
Subject: swap over network documentation
Patch-mainline: No
References: FATE#303834

Document describing the problem and proposed solution

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Neil Brown <neilb@suse.de>
Acked-by: Suresh Jayaraman <sjayaraman@suse.de>

---
 Documentation/network-swap.txt |  270 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 270 insertions(+)

Index: linux-2.6.26/Documentation/network-swap.txt
===================================================================
--- /dev/null
+++ linux-2.6.26/Documentation/network-swap.txt
@@ -0,0 +1,270 @@
+
+Problem:
+   When Linux needs to allocate memory it may find that there is
+   insufficient free memory so it needs to reclaim space that is in
+   use but not needed at the moment.  There are several options:
+
+   1/ Shrink a kernel cache such as the inode or dentry cache.  This
+      is fairly easy but provides limited returns.
+   2/ Discard 'clean' pages from the page cache.  This is easy, and
+      works well as long as there are clean pages in the page cache.
+      Similarly clean 'anonymous' pages can be discarded - if there
+      are any.
+   3/ Write out some dirty page-cache pages so that they become clean.
+      The VM limits the number of dirty page-cache pages to e.g. 40%
+      of available memory so that (among other reasons) a "sync" will
+      not take excessively long.  So there should never be excessive
+      amounts of dirty pagecache.
+      Writing out dirty page-cache pages involves work by the
+      filesystem which may need to allocate memory itself.  To avoid
+      deadlock, filesystems use GFP_NOFS when allocating memory on the
+      write-out path.  When this is used, cleaning dirty page-cache
+      pages is not an option so if the filesystem finds that  memory
+      is tight, another option must be found.
+   4/ Write out dirty anonymous pages to the "Swap" partition/file.
+      This is the most interesting for a couple of reasons.
+      a/ Unlike dirty page-cache pages, there is no need to write anon
+         pages out unless we are actually short of memory.  Thus they
+         tend to be left to last.
+      b/ Anon pages tend to be updated randomly and unpredictably, and
+         flushing them out of memory can have a very significant
+         performance impact on the process using them.  This contrasts
+         with page-cache pages which are often written sequentially
+         and often treated as "write-once, read-many".
+      So anon pages tend to be left until last to be cleaned, and may
+      be the only cleanable pages while there are still some dirty
+      page-cache pages (which are waiting on a GFP_NOFS allocation).
+
+[I don't find the above wholly satisfying.  There seems to be too much
+ hand-waving.  If someone can provide better text explaining why
+ swapout is a special case, that would be great.]
+
+So we need to be able to write to the swap file/partition without
+needing to allocate any memory ... or only a small well controlled
+amount.
+
+The VM reserves a small amount of memory that can only be allocated
+for use as part of the swap-out procedure.  It is only available to
+processes with the PF_MEMALLOC flag set, which is typically just the
+memory cleaner.
+
+Traditionally swap-out is performed directly to block devices (swap
+files on block-device filesystems are supported by examining the
+mapping from file offset to device offset in advance, and then using
+the device offsets to write directly to the device).  Block devices
+are (required to be) written to pre-allocate any memory that might be
+needed during write-out, and to block when the pre-allocated memory is
+exhausted and no other memory is available.  They can be sure not to
+block forever as the pre-allocated memory will be returned as soon as
+the data it is being used for has been written out.  The primary
+mechanism for pre-allocating memory is called "mempools".
+
+This approach does not work for writing anonymous pages
+(i.e. swapping) over a network, using e.g NFS or NBD or iSCSI.
+
+
+The main reason that it does not work is that when data from an anon
+page is written to the network, we must wait for a reply to confirm
+the data is safe.  Receiving that reply will consume memory and,
+significantly, we need to allocate memory to an incoming packet before
+we can tell if it is the reply we are waiting for or not.
+
+The secondary reason is that the network code is not written to use
+mempools and in most cases does not need to use them.  Changing all
+allocations in the networking layer to use mempools would be quite
+intrusive, and would waste memory, and probably cause a slow-down in
+the common case of not swapping over the network.
+
+These problems are addressed by enhancing the system of memory
+reserves used by PF_MEMALLOC and requiring any in-kernel networking
+client that is used for swap-out to indicate which sockets are used
+for swapout so they can be handled specially in low memory situations.
+
+There are several major parts to this enhancement:
+
+1/ page->reserve, GFP_MEMALLOC
+
+  To handle low memory conditions we need to know when those
+  conditions exist.  Having a global "low on memory" flag seems easy,
+  but its implementation is problematic.  Instead we make it possible
+  to tell if a recent memory allocation required use of the emergency
+  memory pool.
+  For pages returned by alloc_page, the new page->reserve flag
+  can be tested.  If this is set, then a low memory condition was
+  current when the page was allocated, so the memory should be used
+  carefully. (Because low memory conditions are transient, this
+  state is kept in an overloaded member instead of in page flags, which
+  would suggest a more permanent state.)
+
+  For memory allocated using slab/slub: If a page that is added to a
+  kmem_cache is found to have page->reserve set, then a  s->reserve
+  flag is set for the whole kmem_cache.  Further allocations will only
+  be returned from that page (or any other page in the cache) if they
+  are emergency allocation (i.e. PF_MEMALLOC or GFP_MEMALLOC is set).
+  Non-emergency allocations will block in alloc_page until a
+  non-reserve page is available.  Once a non-reserve page has been
+  added to the cache, the s->reserve flag on the cache is removed.
+
+  Because slab objects have no individual state its hard to pass
+  reserve state along, the current code relies on a regular alloc
+  failing. There are various allocation wrappers help here.
+
+  This allows us to
+   a/ request use of the emergency pool when allocating memory
+     (GFP_MEMALLOC), and
+   b/ to find out if the emergency pool was used.
+
+2/ SK_MEMALLOC, sk_buff->emergency.
+
+  When memory from the reserve is used to store incoming network
+  packets, the memory must be freed (and the packet dropped) as soon
+  as we find out that the packet is not for a socket that is used for
+  swap-out.
+  To achieve this we have an ->emergency flag for skbs, and an
+  SK_MEMALLOC flag for sockets.
+  When memory is allocated for an skb, it is allocated with
+  GFP_MEMALLOC (if we are currently swapping over the network at
+  all).  If a subsequent test shows that the emergency pool was used,
+  ->emergency is set.
+  When the skb is finally attached to its destination socket, the
+  SK_MEMALLOC flag on the socket is tested.  If the skb has
+  ->emergency set, but the socket does not have SK_MEMALLOC set, then
+  the skb is immediately freed and the packet is dropped.
+  This ensures that reserve memory is never queued on a socket that is
+  not used for swapout.
+
+  Similarly, if an skb is ever queued for delivery to user-space for
+  example by netfilter, the ->emergency flag is tested and the skb is
+  released if ->emergency is set. (so obviously the storage route may
+  not pass through a userspace helper, otherwise the packets will never
+  arrive and we'll deadlock)
+
+  This ensures that memory from the emergency reserve can be used to
+  allow swapout to proceed, but will not get caught up in any other
+  network queue.
+
+
+3/ pages_emergency
+
+  The above would be sufficient if the total memory below the lowest
+  memory watermark (i.e the size of the emergency reserve) were known
+  to be enough to hold all transient allocations needed for writeout.
+  I'm a little blurry on how big the current emergency pool is, but it
+  isn't big and certainly hasn't been sized to allow network traffic
+  to consume any.
+
+  We could simply make the size of the reserve bigger. However in the
+  common case that we are not swapping over the network, that would be
+  a waste of memory.
+
+  So a new "watermark" is defined: pages_emergency.  This is
+  effectively added to the current low water marks, so that pages from
+  this emergency pool can only be allocated if one of PF_MEMALLOC or
+  GFP_MEMALLOC are set.
+
+  pages_emergency can be changed dynamically based on need.  When
+  swapout over the network is required, pages_emergency is increased
+  to cover the maximum expected load.  When network swapout is
+  disabled, pages_emergency is decreased.
+
+  To determine how much to increase it by, we introduce reservation
+  groups....
+
+3a/ reservation groups
+
+  The memory used transiently for swapout can be in a number of
+  different places.  e.g. the network route cache, the network
+  fragment cache, in transit between network card and socket, or (in
+  the case of NFS) in sunrpc data structures awaiting a reply.
+  We need to ensure each of these is limited in the amount of memory
+  they use, and that the maximum is included in the reserve.
+
+  The memory required by the network layer only needs to be reserved
+  once, even if there are multiple swapout paths using the network
+  (e.g. NFS and NDB and iSCSI, though using all three for swapout at
+  the same time would be unusual).
+
+  So we create a tree of reservation groups.  The network might
+  register a collection of reservations, but not mark them as being in
+  use.  NFS and sunrpc might similarly register a collection of
+  reservations, and attach it to the network reservations as it
+  depends on them.
+  When swapout over NFS is requested, the NFS/sunrpc reservations are
+  activated which implicitly activates the network reservations.
+
+  The total new reservation is added to pages_emergency.
+
+  Provided each memory usage stays beneath the registered limit (at
+  least when allocating memory from reserves), the system will never
+  run out of emergency memory, and swapout will not deadlock.
+
+  It is worth noting here that it is not critical that each usage
+  stays beneath the limit 100% of the time.  Occasional excess is
+  acceptable provided that the memory will be freed  again within a
+  short amount of time that does *not* require waiting for any event
+  that itself might require memory.
+  This is because, at all stages of transmit and receive, it is
+  acceptable to discard all transient memory associated with a
+  particular writeout and try again later.  On transmit, the page can
+  be re-queued for later transmission.  On receive, the packet can be
+  dropped assuming that the peer will resend after a timeout.
+
+  Thus allocations that are truly transient and will be freed without
+  blocking do not strictly need to be reserved for.  Doing so might
+  still be a good idea to ensure forward progress doesn't take too
+  long.
+
+4/ low-mem accounting
+
+  Most places that might hold on to emergency memory (e.g. route
+  cache, fragment cache etc) already place a limit on the amount of
+  memory that they can use.  This limit can simply be reserved using
+  the above mechanism and no more needs to be done.
+
+  However some memory usage might not be accounted with sufficient
+  firmness to allow an appropriate emergency reservation.  The
+  in-flight skbs for incoming packets is one such example.
+
+  To support this, a low-overhead mechanism for accounting memory
+  usage against the reserves is provided.  This mechanism uses the
+  same data structure that is used to store the emergency memory
+  reservations through the addition of a 'usage' field.
+
+  Before we attempt allocation from the memory reserves, we much check
+  if the resulting 'usage' is below the reservation. If so, we increase
+  the usage and attempt the allocation (which should succeed). If
+  the projected 'usage' exceeds the reservation we'll either fail the
+  allocation, or wait for 'usage' to decrease enough so that it would
+  succeed, depending on __GFP_WAIT.
+
+  When memory that was allocated for that purpose is freed, the
+  'usage' field is checked again.  If it is non-zero, then the size of
+  the freed memory is subtracted from the usage, making sure the usage
+  never becomes less than zero.
+
+  This provides adequate accounting with minimal overheads when not in
+  a low memory condition.  When a low memory condition is encountered
+  it does add the cost of a spin lock necessary to serialise updates
+  to 'usage'.
+
+
+
+5/ swapon/swapoff/swap_out/swap_in
+
+  So that a filesystem (e.g. NFS) can know when to set SK_MEMALLOC on
+  any network socket that it uses, and can know when to account
+  reserve memory carefully, new address_space_operations are
+  available.
+  "swapon" requests that an address space (i.e a file) be make ready
+  for swapout.  swap_out and swap_in request the actual IO.  They
+  together must ensure that each swap_out request can succeed without
+  allocating more emergency memory that was reserved by swapon. swapoff
+  is used to reverse the state changes caused by swapon when we disable
+  the swap file.
+
+
+Thanks for reading this far.  I hope it made sense :-)
+
+Neil Brown (with updates from Peter Zijlstra)
+
+
Commit	Line	Data
00e5a55c BS	1	From: Neil Brown <neilb@suse.de>
	2	Subject: swap over network documentation
	3	Patch-mainline: No
	4	References: FATE#303834
	5
	6	Document describing the problem and proposed solution
	7
	8	Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
	9	Acked-by: Neil Brown <neilb@suse.de>
	10	Acked-by: Suresh Jayaraman <sjayaraman@suse.de>
	11
	12	---
	13	Documentation/network-swap.txt \| 270 +++++++++++++++++++++++++++++++++++++++++
	14	1 file changed, 270 insertions(+)
	15
	16	Index: linux-2.6.26/Documentation/network-swap.txt
	17	===================================================================
	18	--- /dev/null
	19	+++ linux-2.6.26/Documentation/network-swap.txt
	20	@@ -0,0 +1,270 @@
	21	+
	22	+Problem:
	23	+ When Linux needs to allocate memory it may find that there is
	24	+ insufficient free memory so it needs to reclaim space that is in
	25	+ use but not needed at the moment. There are several options:
	26	+
	27	+ 1/ Shrink a kernel cache such as the inode or dentry cache. This
	28	+ is fairly easy but provides limited returns.
	29	+ 2/ Discard 'clean' pages from the page cache. This is easy, and
	30	+ works well as long as there are clean pages in the page cache.
	31	+ Similarly clean 'anonymous' pages can be discarded - if there
	32	+ are any.
	33	+ 3/ Write out some dirty page-cache pages so that they become clean.
	34	+ The VM limits the number of dirty page-cache pages to e.g. 40%
	35	+ of available memory so that (among other reasons) a "sync" will
	36	+ not take excessively long. So there should never be excessive
	37	+ amounts of dirty pagecache.
	38	+ Writing out dirty page-cache pages involves work by the
	39	+ filesystem which may need to allocate memory itself. To avoid
	40	+ deadlock, filesystems use GFP_NOFS when allocating memory on the
	41	+ write-out path. When this is used, cleaning dirty page-cache
	42	+ pages is not an option so if the filesystem finds that memory
	43	+ is tight, another option must be found.
	44	+ 4/ Write out dirty anonymous pages to the "Swap" partition/file.
	45	+ This is the most interesting for a couple of reasons.
	46	+ a/ Unlike dirty page-cache pages, there is no need to write anon
	47	+ pages out unless we are actually short of memory. Thus they
	48	+ tend to be left to last.
	49	+ b/ Anon pages tend to be updated randomly and unpredictably, and
	50	+ flushing them out of memory can have a very significant
	51	+ performance impact on the process using them. This contrasts
	52	+ with page-cache pages which are often written sequentially
	53	+ and often treated as "write-once, read-many".
	54	+ So anon pages tend to be left until last to be cleaned, and may
	55	+ be the only cleanable pages while there are still some dirty
	56	+ page-cache pages (which are waiting on a GFP_NOFS allocation).
	57	+
	58	+[I don't find the above wholly satisfying. There seems to be too much
	59	+ hand-waving. If someone can provide better text explaining why
	60	+ swapout is a special case, that would be great.]
	61	+
	62	+So we need to be able to write to the swap file/partition without
	63	+needing to allocate any memory ... or only a small well controlled
	64	+amount.
65	+
66	+The VM reserves a small amount of memory that can only be allocated
67	+for use as part of the swap-out procedure. It is only available to
68	+processes with the PF_MEMALLOC flag set, which is typically just the
69	+memory cleaner.
70	+
71	+Traditionally swap-out is performed directly to block devices (swap
72	+files on block-device filesystems are supported by examining the
73	+mapping from file offset to device offset in advance, and then using
74	+the device offsets to write directly to the device). Block devices
75	+are (required to be) written to pre-allocate any memory that might be
76	+needed during write-out, and to block when the pre-allocated memory is
77	+exhausted and no other memory is available. They can be sure not to
78	+block forever as the pre-allocated memory will be returned as soon as
79	+the data it is being used for has been written out. The primary
80	+mechanism for pre-allocating memory is called "mempools".
81	+
82	+This approach does not work for writing anonymous pages
83	+(i.e. swapping) over a network, using e.g NFS or NBD or iSCSI.
84	+
85	+
86	+The main reason that it does not work is that when data from an anon
87	+page is written to the network, we must wait for a reply to confirm
88	+the data is safe. Receiving that reply will consume memory and,
89	+significantly, we need to allocate memory to an incoming packet before
90	+we can tell if it is the reply we are waiting for or not.
91	+
92	+The secondary reason is that the network code is not written to use
93	+mempools and in most cases does not need to use them. Changing all
94	+allocations in the networking layer to use mempools would be quite
95	+intrusive, and would waste memory, and probably cause a slow-down in
96	+the common case of not swapping over the network.
97	+
98	+These problems are addressed by enhancing the system of memory
99	+reserves used by PF_MEMALLOC and requiring any in-kernel networking
100	+client that is used for swap-out to indicate which sockets are used
101	+for swapout so they can be handled specially in low memory situations.
102	+
103	+There are several major parts to this enhancement:
104	+
105	+1/ page->reserve, GFP_MEMALLOC
106	+
107	+ To handle low memory conditions we need to know when those
108	+ conditions exist. Having a global "low on memory" flag seems easy,
109	+ but its implementation is problematic. Instead we make it possible
110	+ to tell if a recent memory allocation required use of the emergency
111	+ memory pool.
112	+ For pages returned by alloc_page, the new page->reserve flag
113	+ can be tested. If this is set, then a low memory condition was
114	+ current when the page was allocated, so the memory should be used
115	+ carefully. (Because low memory conditions are transient, this
116	+ state is kept in an overloaded member instead of in page flags, which
117	+ would suggest a more permanent state.)
118	+
119	+ For memory allocated using slab/slub: If a page that is added to a
120	+ kmem_cache is found to have page->reserve set, then a s->reserve
121	+ flag is set for the whole kmem_cache. Further allocations will only
122	+ be returned from that page (or any other page in the cache) if they
123	+ are emergency allocation (i.e. PF_MEMALLOC or GFP_MEMALLOC is set).
124	+ Non-emergency allocations will block in alloc_page until a
125	+ non-reserve page is available. Once a non-reserve page has been
126	+ added to the cache, the s->reserve flag on the cache is removed.
127	+
128	+ Because slab objects have no individual state its hard to pass
129	+ reserve state along, the current code relies on a regular alloc
130	+ failing. There are various allocation wrappers help here.
131	+
132	+ This allows us to
133	+ a/ request use of the emergency pool when allocating memory
134	+ (GFP_MEMALLOC), and
135	+ b/ to find out if the emergency pool was used.
136	+
137	+2/ SK_MEMALLOC, sk_buff->emergency.
138	+
139	+ When memory from the reserve is used to store incoming network
140	+ packets, the memory must be freed (and the packet dropped) as soon
141	+ as we find out that the packet is not for a socket that is used for
142	+ swap-out.
143	+ To achieve this we have an ->emergency flag for skbs, and an
144	+ SK_MEMALLOC flag for sockets.
145	+ When memory is allocated for an skb, it is allocated with
146	+ GFP_MEMALLOC (if we are currently swapping over the network at
147	+ all). If a subsequent test shows that the emergency pool was used,
148	+ ->emergency is set.
149	+ When the skb is finally attached to its destination socket, the
150	+ SK_MEMALLOC flag on the socket is tested. If the skb has
151	+ ->emergency set, but the socket does not have SK_MEMALLOC set, then
152	+ the skb is immediately freed and the packet is dropped.
153	+ This ensures that reserve memory is never queued on a socket that is
154	+ not used for swapout.
155	+
156	+ Similarly, if an skb is ever queued for delivery to user-space for
157	+ example by netfilter, the ->emergency flag is tested and the skb is
158	+ released if ->emergency is set. (so obviously the storage route may
159	+ not pass through a userspace helper, otherwise the packets will never
160	+ arrive and we'll deadlock)
161	+
162	+ This ensures that memory from the emergency reserve can be used to
163	+ allow swapout to proceed, but will not get caught up in any other
164	+ network queue.
165	+
166	+
167	+3/ pages_emergency
168	+
169	+ The above would be sufficient if the total memory below the lowest
170	+ memory watermark (i.e the size of the emergency reserve) were known
171	+ to be enough to hold all transient allocations needed for writeout.
172	+ I'm a little blurry on how big the current emergency pool is, but it
173	+ isn't big and certainly hasn't been sized to allow network traffic
174	+ to consume any.
175	+
176	+ We could simply make the size of the reserve bigger. However in the
177	+ common case that we are not swapping over the network, that would be
178	+ a waste of memory.
179	+
180	+ So a new "watermark" is defined: pages_emergency. This is
181	+ effectively added to the current low water marks, so that pages from
182	+ this emergency pool can only be allocated if one of PF_MEMALLOC or
183	+ GFP_MEMALLOC are set.
184	+
185	+ pages_emergency can be changed dynamically based on need. When
186	+ swapout over the network is required, pages_emergency is increased
187	+ to cover the maximum expected load. When network swapout is
188	+ disabled, pages_emergency is decreased.
189	+
190	+ To determine how much to increase it by, we introduce reservation
191	+ groups....
192	+
193	+3a/ reservation groups
194	+
195	+ The memory used transiently for swapout can be in a number of
196	+ different places. e.g. the network route cache, the network
197	+ fragment cache, in transit between network card and socket, or (in
198	+ the case of NFS) in sunrpc data structures awaiting a reply.
199	+ We need to ensure each of these is limited in the amount of memory
200	+ they use, and that the maximum is included in the reserve.
201	+
202	+ The memory required by the network layer only needs to be reserved
203	+ once, even if there are multiple swapout paths using the network
204	+ (e.g. NFS and NDB and iSCSI, though using all three for swapout at
205	+ the same time would be unusual).
206	+
207	+ So we create a tree of reservation groups. The network might
208	+ register a collection of reservations, but not mark them as being in
209	+ use. NFS and sunrpc might similarly register a collection of
210	+ reservations, and attach it to the network reservations as it
211	+ depends on them.
212	+ When swapout over NFS is requested, the NFS/sunrpc reservations are
213	+ activated which implicitly activates the network reservations.
214	+
215	+ The total new reservation is added to pages_emergency.
216	+
217	+ Provided each memory usage stays beneath the registered limit (at
218	+ least when allocating memory from reserves), the system will never
219	+ run out of emergency memory, and swapout will not deadlock.
220	+
221	+ It is worth noting here that it is not critical that each usage
222	+ stays beneath the limit 100% of the time. Occasional excess is
223	+ acceptable provided that the memory will be freed again within a
224	+ short amount of time that does not require waiting for any event
225	+ that itself might require memory.
226	+ This is because, at all stages of transmit and receive, it is
227	+ acceptable to discard all transient memory associated with a
228	+ particular writeout and try again later. On transmit, the page can
229	+ be re-queued for later transmission. On receive, the packet can be
230	+ dropped assuming that the peer will resend after a timeout.
231	+
232	+ Thus allocations that are truly transient and will be freed without
233	+ blocking do not strictly need to be reserved for. Doing so might
234	+ still be a good idea to ensure forward progress doesn't take too
235	+ long.
236	+
237	+4/ low-mem accounting
238	+
239	+ Most places that might hold on to emergency memory (e.g. route
240	+ cache, fragment cache etc) already place a limit on the amount of
241	+ memory that they can use. This limit can simply be reserved using
242	+ the above mechanism and no more needs to be done.
243	+
244	+ However some memory usage might not be accounted with sufficient
245	+ firmness to allow an appropriate emergency reservation. The
246	+ in-flight skbs for incoming packets is one such example.
247	+
248	+ To support this, a low-overhead mechanism for accounting memory
249	+ usage against the reserves is provided. This mechanism uses the
250	+ same data structure that is used to store the emergency memory
251	+ reservations through the addition of a 'usage' field.
252	+
253	+ Before we attempt allocation from the memory reserves, we much check
254	+ if the resulting 'usage' is below the reservation. If so, we increase
255	+ the usage and attempt the allocation (which should succeed). If
256	+ the projected 'usage' exceeds the reservation we'll either fail the
257	+ allocation, or wait for 'usage' to decrease enough so that it would
258	+ succeed, depending on __GFP_WAIT.
259	+
260	+ When memory that was allocated for that purpose is freed, the
261	+ 'usage' field is checked again. If it is non-zero, then the size of
262	+ the freed memory is subtracted from the usage, making sure the usage
263	+ never becomes less than zero.
264	+
265	+ This provides adequate accounting with minimal overheads when not in
266	+ a low memory condition. When a low memory condition is encountered
267	+ it does add the cost of a spin lock necessary to serialise updates
268	+ to 'usage'.
269	+
270	+
271	+
272	+5/ swapon/swapoff/swap_out/swap_in
273	+
274	+ So that a filesystem (e.g. NFS) can know when to set SK_MEMALLOC on
275	+ any network socket that it uses, and can know when to account
276	+ reserve memory carefully, new address_space_operations are
277	+ available.
278	+ "swapon" requests that an address space (i.e a file) be make ready
279	+ for swapout. swap_out and swap_in request the actual IO. They
280	+ together must ensure that each swap_out request can succeed without
281	+ allocating more emergency memory that was reserved by swapon. swapoff
282	+ is used to reverse the state changes caused by swapon when we disable
283	+ the swap file.
284	+
285	+
286	+Thanks for reading this far. I hope it made sense :-)
287	+
288	+Neil Brown (with updates from Peter Zijlstra)
289	+
290	+