src/patches/suse-2.6.27.25/patches.suse/SoN-05-doc.patch

   1 From: Neil Brown <neilb@suse.de>
   2 Subject: swap over network documentation
   3 Patch-mainline: No
   4 References: FATE#303834
   5
   6 Document describing the problem and proposed solution
   7
   8 Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
   9 Acked-by: Neil Brown <neilb@suse.de>
  10 Acked-by: Suresh Jayaraman <sjayaraman@suse.de>
  11
  12 ---
  13  Documentation/network-swap.txt |  270 +++++++++++++++++++++++++++++++++++++++++
  14  1 file changed, 270 insertions(+)
  15
  16 Index: linux-2.6.26/Documentation/network-swap.txt
  17 ===================================================================
  18 --- /dev/null
  19 +++ linux-2.6.26/Documentation/network-swap.txt
  20 @@ -0,0 +1,270 @@
  21 +
  22 +Problem:
  23 +   When Linux needs to allocate memory it may find that there is
  24 +   insufficient free memory so it needs to reclaim space that is in
  25 +   use but not needed at the moment.  There are several options:
  26 +
  27 +   1/ Shrink a kernel cache such as the inode or dentry cache.  This
  28 +      is fairly easy but provides limited returns.
  29 +   2/ Discard 'clean' pages from the page cache.  This is easy, and
  30 +      works well as long as there are clean pages in the page cache.
  31 +      Similarly clean 'anonymous' pages can be discarded - if there
  32 +      are any.
  33 +   3/ Write out some dirty page-cache pages so that they become clean.
  34 +      The VM limits the number of dirty page-cache pages to e.g. 40%
  35 +      of available memory so that (among other reasons) a "sync" will
  36 +      not take excessively long.  So there should never be excessive
  37 +      amounts of dirty pagecache.
  38 +      Writing out dirty page-cache pages involves work by the
  39 +      filesystem which may need to allocate memory itself.  To avoid
  40 +      deadlock, filesystems use GFP_NOFS when allocating memory on the
  41 +      write-out path.  When this is used, cleaning dirty page-cache
  42 +      pages is not an option so if the filesystem finds that  memory
  43 +      is tight, another option must be found.
  44 +   4/ Write out dirty anonymous pages to the "Swap" partition/file.
  45 +      This is the most interesting for a couple of reasons.
  46 +      a/ Unlike dirty page-cache pages, there is no need to write anon
  47 +         pages out unless we are actually short of memory.  Thus they
  48 +         tend to be left to last.
  49 +      b/ Anon pages tend to be updated randomly and unpredictably, and
  50 +         flushing them out of memory can have a very significant
  51 +         performance impact on the process using them.  This contrasts
  52 +         with page-cache pages which are often written sequentially
  53 +         and often treated as "write-once, read-many".
  54 +      So anon pages tend to be left until last to be cleaned, and may
  55 +      be the only cleanable pages while there are still some dirty
  56 +      page-cache pages (which are waiting on a GFP_NOFS allocation).
  57 +
  58 +[I don't find the above wholly satisfying.  There seems to be too much
  59 + hand-waving.  If someone can provide better text explaining why
  60 + swapout is a special case, that would be great.]
  61 +
  62 +So we need to be able to write to the swap file/partition without
  63 +needing to allocate any memory ... or only a small well controlled
  64 +amount.
  65 +
  66 +The VM reserves a small amount of memory that can only be allocated
  67 +for use as part of the swap-out procedure.  It is only available to
  68 +processes with the PF_MEMALLOC flag set, which is typically just the
  69 +memory cleaner.
  70 +
  71 +Traditionally swap-out is performed directly to block devices (swap
  72 +files on block-device filesystems are supported by examining the
  73 +mapping from file offset to device offset in advance, and then using
  74 +the device offsets to write directly to the device).  Block devices
  75 +are (required to be) written to pre-allocate any memory that might be
  76 +needed during write-out, and to block when the pre-allocated memory is
  77 +exhausted and no other memory is available.  They can be sure not to
  78 +block forever as the pre-allocated memory will be returned as soon as
  79 +the data it is being used for has been written out.  The primary
  80 +mechanism for pre-allocating memory is called "mempools".
  81 +
  82 +This approach does not work for writing anonymous pages
  83 +(i.e. swapping) over a network, using e.g NFS or NBD or iSCSI.
  84 +
  85 +
  86 +The main reason that it does not work is that when data from an anon
  87 +page is written to the network, we must wait for a reply to confirm
  88 +the data is safe.  Receiving that reply will consume memory and,
  89 +significantly, we need to allocate memory to an incoming packet before
  90 +we can tell if it is the reply we are waiting for or not.
  91 +
  92 +The secondary reason is that the network code is not written to use
  93 +mempools and in most cases does not need to use them.  Changing all
  94 +allocations in the networking layer to use mempools would be quite
  95 +intrusive, and would waste memory, and probably cause a slow-down in
  96 +the common case of not swapping over the network.
  97 +
  98 +These problems are addressed by enhancing the system of memory
  99 +reserves used by PF_MEMALLOC and requiring any in-kernel networking
 100 +client that is used for swap-out to indicate which sockets are used
 101 +for swapout so they can be handled specially in low memory situations.
 102 +
 103 +There are several major parts to this enhancement:
 104 +
 105 +1/ page->reserve, GFP_MEMALLOC
 106 +
 107 +  To handle low memory conditions we need to know when those
 108 +  conditions exist.  Having a global "low on memory" flag seems easy,
 109 +  but its implementation is problematic.  Instead we make it possible
 110 +  to tell if a recent memory allocation required use of the emergency
 111 +  memory pool.
 112 +  For pages returned by alloc_page, the new page->reserve flag
 113 +  can be tested.  If this is set, then a low memory condition was
 114 +  current when the page was allocated, so the memory should be used
 115 +  carefully. (Because low memory conditions are transient, this
 116 +  state is kept in an overloaded member instead of in page flags, which
 117 +  would suggest a more permanent state.)
 118 +
 119 +  For memory allocated using slab/slub: If a page that is added to a
 120 +  kmem_cache is found to have page->reserve set, then a  s->reserve
 121 +  flag is set for the whole kmem_cache.  Further allocations will only
 122 +  be returned from that page (or any other page in the cache) if they
 123 +  are emergency allocation (i.e. PF_MEMALLOC or GFP_MEMALLOC is set).
 124 +  Non-emergency allocations will block in alloc_page until a
 125 +  non-reserve page is available.  Once a non-reserve page has been
 126 +  added to the cache, the s->reserve flag on the cache is removed.
 127 +
 128 +  Because slab objects have no individual state its hard to pass
 129 +  reserve state along, the current code relies on a regular alloc
 130 +  failing. There are various allocation wrappers help here.
 131 +
 132 +  This allows us to
 133 +   a/ request use of the emergency pool when allocating memory
 134 +     (GFP_MEMALLOC), and
 135 +   b/ to find out if the emergency pool was used.
 136 +
 137 +2/ SK_MEMALLOC, sk_buff->emergency.
 138 +
 139 +  When memory from the reserve is used to store incoming network
 140 +  packets, the memory must be freed (and the packet dropped) as soon
 141 +  as we find out that the packet is not for a socket that is used for
 142 +  swap-out.
 143 +  To achieve this we have an ->emergency flag for skbs, and an
 144 +  SK_MEMALLOC flag for sockets.
 145 +  When memory is allocated for an skb, it is allocated with
 146 +  GFP_MEMALLOC (if we are currently swapping over the network at
 147 +  all).  If a subsequent test shows that the emergency pool was used,
 148 +  ->emergency is set.
 149 +  When the skb is finally attached to its destination socket, the
 150 +  SK_MEMALLOC flag on the socket is tested.  If the skb has
 151 +  ->emergency set, but the socket does not have SK_MEMALLOC set, then
 152 +  the skb is immediately freed and the packet is dropped.
 153 +  This ensures that reserve memory is never queued on a socket that is
 154 +  not used for swapout.
 155 +
 156 +  Similarly, if an skb is ever queued for delivery to user-space for
 157 +  example by netfilter, the ->emergency flag is tested and the skb is
 158 +  released if ->emergency is set. (so obviously the storage route may
 159 +  not pass through a userspace helper, otherwise the packets will never
 160 +  arrive and we'll deadlock)
 161 +
 162 +  This ensures that memory from the emergency reserve can be used to
 163 +  allow swapout to proceed, but will not get caught up in any other
 164 +  network queue.
 165 +
 166 +
 167 +3/ pages_emergency
 168 +
 169 +  The above would be sufficient if the total memory below the lowest
 170 +  memory watermark (i.e the size of the emergency reserve) were known
 171 +  to be enough to hold all transient allocations needed for writeout.
 172 +  I'm a little blurry on how big the current emergency pool is, but it
 173 +  isn't big and certainly hasn't been sized to allow network traffic
 174 +  to consume any.
 175 +
 176 +  We could simply make the size of the reserve bigger. However in the
 177 +  common case that we are not swapping over the network, that would be
 178 +  a waste of memory.
 179 +
 180 +  So a new "watermark" is defined: pages_emergency.  This is
 181 +  effectively added to the current low water marks, so that pages from
 182 +  this emergency pool can only be allocated if one of PF_MEMALLOC or
 183 +  GFP_MEMALLOC are set.
 184 +
 185 +  pages_emergency can be changed dynamically based on need.  When
 186 +  swapout over the network is required, pages_emergency is increased
 187 +  to cover the maximum expected load.  When network swapout is
 188 +  disabled, pages_emergency is decreased.
 189 +
 190 +  To determine how much to increase it by, we introduce reservation
 191 +  groups....
 192 +
 193 +3a/ reservation groups
 194 +
 195 +  The memory used transiently for swapout can be in a number of
 196 +  different places.  e.g. the network route cache, the network
 197 +  fragment cache, in transit between network card and socket, or (in
 198 +  the case of NFS) in sunrpc data structures awaiting a reply.
 199 +  We need to ensure each of these is limited in the amount of memory
 200 +  they use, and that the maximum is included in the reserve.
 201 +
 202 +  The memory required by the network layer only needs to be reserved
 203 +  once, even if there are multiple swapout paths using the network
 204 +  (e.g. NFS and NDB and iSCSI, though using all three for swapout at
 205 +  the same time would be unusual).
 206 +
 207 +  So we create a tree of reservation groups.  The network might
 208 +  register a collection of reservations, but not mark them as being in
 209 +  use.  NFS and sunrpc might similarly register a collection of
 210 +  reservations, and attach it to the network reservations as it
 211 +  depends on them.
 212 +  When swapout over NFS is requested, the NFS/sunrpc reservations are
 213 +  activated which implicitly activates the network reservations.
 214 +
 215 +  The total new reservation is added to pages_emergency.
 216 +
 217 +  Provided each memory usage stays beneath the registered limit (at
 218 +  least when allocating memory from reserves), the system will never
 219 +  run out of emergency memory, and swapout will not deadlock.
 220 +
 221 +  It is worth noting here that it is not critical that each usage
 222 +  stays beneath the limit 100% of the time.  Occasional excess is
 223 +  acceptable provided that the memory will be freed  again within a
 224 +  short amount of time that does *not* require waiting for any event
 225 +  that itself might require memory.
 226 +  This is because, at all stages of transmit and receive, it is
 227 +  acceptable to discard all transient memory associated with a
 228 +  particular writeout and try again later.  On transmit, the page can
 229 +  be re-queued for later transmission.  On receive, the packet can be
 230 +  dropped assuming that the peer will resend after a timeout.
 231 +
 232 +  Thus allocations that are truly transient and will be freed without
 233 +  blocking do not strictly need to be reserved for.  Doing so might
 234 +  still be a good idea to ensure forward progress doesn't take too
 235 +  long.
 236 +
 237 +4/ low-mem accounting
 238 +
 239 +  Most places that might hold on to emergency memory (e.g. route
 240 +  cache, fragment cache etc) already place a limit on the amount of
 241 +  memory that they can use.  This limit can simply be reserved using
 242 +  the above mechanism and no more needs to be done.
 243 +
 244 +  However some memory usage might not be accounted with sufficient
 245 +  firmness to allow an appropriate emergency reservation.  The
 246 +  in-flight skbs for incoming packets is one such example.
 247 +
 248 +  To support this, a low-overhead mechanism for accounting memory
 249 +  usage against the reserves is provided.  This mechanism uses the
 250 +  same data structure that is used to store the emergency memory
 251 +  reservations through the addition of a 'usage' field.
 252 +
 253 +  Before we attempt allocation from the memory reserves, we much check
 254 +  if the resulting 'usage' is below the reservation. If so, we increase
 255 +  the usage and attempt the allocation (which should succeed). If
 256 +  the projected 'usage' exceeds the reservation we'll either fail the
 257 +  allocation, or wait for 'usage' to decrease enough so that it would
 258 +  succeed, depending on __GFP_WAIT.
 259 +
 260 +  When memory that was allocated for that purpose is freed, the
 261 +  'usage' field is checked again.  If it is non-zero, then the size of
 262 +  the freed memory is subtracted from the usage, making sure the usage
 263 +  never becomes less than zero.
 264 +
 265 +  This provides adequate accounting with minimal overheads when not in
 266 +  a low memory condition.  When a low memory condition is encountered
 267 +  it does add the cost of a spin lock necessary to serialise updates
 268 +  to 'usage'.
 269 +
 270 +
 271 +
 272 +5/ swapon/swapoff/swap_out/swap_in
 273 +
 274 +  So that a filesystem (e.g. NFS) can know when to set SK_MEMALLOC on
 275 +  any network socket that it uses, and can know when to account
 276 +  reserve memory carefully, new address_space_operations are
 277 +  available.
 278 +  "swapon" requests that an address space (i.e a file) be make ready
 279 +  for swapout.  swap_out and swap_in request the actual IO.  They
 280 +  together must ensure that each swap_out request can succeed without
 281 +  allocating more emergency memory that was reserved by swapon. swapoff
 282 +  is used to reverse the state changes caused by swapon when we disable
 283 +  the swap file.
 284 +
 285 +
 286 +Thanks for reading this far.  I hope it made sense :-)
 287 +
 288 +Neil Brown (with updates from Peter Zijlstra)
 289 +
 290 +