3.0-stable patches

author Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Wed, 25 Jul 2012 15:50:33 +0000 (08:50 -0700)

committer Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Wed, 25 Jul 2012 15:50:33 +0000 (08:50 -0700)
author Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Wed, 25 Jul 2012 15:50:33 +0000 (08:50 -0700)
committer Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Wed, 25 Jul 2012 15:50:33 +0000 (08:50 -0700)
diff --git a/queue-3.0/mm-compaction-allow-compaction-to-isolate-dirty-pages.patch b/queue-3.0/mm-compaction-allow-compaction-to-isolate-dirty-pages.patch

new file mode 100644 (file)

index 0000000..371f394
--- /dev/null
+++ b/queue-3.0/mm-compaction-allow-compaction-to-isolate-dirty-pages.patch
@@ -0,0 +1,433 @@
+From a77ebd333cd810d7b680d544be88c875131c2bd3 Mon Sep 17 00:00:00 2001
+From: Mel Gorman <mgorman@suse.de>
+Date: Thu, 12 Jan 2012 17:19:22 -0800
+Subject: mm: compaction: allow compaction to isolate dirty pages
+
+From: Mel Gorman <mgorman@suse.de>
+
+commit a77ebd333cd810d7b680d544be88c875131c2bd3 upstream.
+
+Stable note: Not tracked in Bugzilla. A fix aimed at preserving page aging
+       information by reducing LRU list churning had the side-effect of
+       reducing THP allocation success rates. This was part of a series
+       to restore the success rates while preserving the reclaim fix.
+
+Short summary: There are severe stalls when a USB stick using VFAT is
+used with THP enabled that are reduced by this series.  If you are
+experiencing this problem, please test and report back and considering I
+have seen complaints from openSUSE and Fedora users on this as well as a
+few private mails, I'm guessing it's a widespread issue.  This is a new
+type of USB-related stall because it is due to synchronous compaction
+writing where as in the past the big problem was dirty pages reaching
+the end of the LRU and being written by reclaim.
+
+Am cc'ing Andrew this time and this series would replace
+mm-do-not-stall-in-synchronous-compaction-for-thp-allocations.patch.
+I'm also cc'ing Dave Jones as he might have merged that patch to Fedora
+for wider testing and ideally it would be reverted and replaced by this
+series.
+
+That said, the later patches could really do with some review.  If this
+series is not the answer then a new direction needs to be discussed
+because as it is, the stalls are unacceptable as the results in this
+leader show.
+
+For testers that try backporting this to 3.1, it won't work because
+there is a non-obvious dependency on not writing back pages in direct
+reclaim so you need those patches too.
+
+Changelog since V5
+o Rebase to 3.2-rc5
+o Tidy up the changelogs a bit
+
+Changelog since V4
+o Added reviewed-bys, credited Andrea properly for sync-light
+o Allow dirty pages without mappings to be considered for migration
+o Bound the number of pages freed for compaction
+o Isolate PageReclaim pages on their own LRU list
+
+This is against 3.2-rc5 and follows on from discussions on "mm: Do
+not stall in synchronous compaction for THP allocations" and "[RFC
+PATCH 0/5] Reduce compaction-related stalls". Initially, the proposed
+patch eliminated stalls due to compaction which sometimes resulted in
+user-visible interactivity problems on browsers by simply never using
+sync compaction. The downside was that THP success allocation rates
+were lower because dirty pages were not being migrated as reported by
+Andrea. His approach at fixing this was nacked on the grounds that
+it reverted fixes from Rik merged that reduced the amount of pages
+reclaimed as it severely impacted his workloads performance.
+
+This series attempts to reconcile the requirements of maximising THP
+usage, without stalling in a user-visible fashion due to compaction
+or cheating by reclaiming an excessive number of pages.
+
+Patch 1 partially reverts commit 39deaf85 to allow migration to isolate
+       dirty pages. This is because migration can move some dirty
+       pages without blocking.
+
+Patch 2 notes that the /proc/sys/vm/compact_memory handler is not using
+       synchronous compaction when it should be. This is unrelated
+       to the reported stalls but is worth fixing.
+
+Patch 3 checks if we isolated a compound page during lumpy scan and
+       account for it properly. For the most part, this affects
+       tracing so it's unrelated to the stalls but worth fixing.
+
+Patch 4 notes that it is possible to abort reclaim early for compaction
+       and return 0 to the page allocator potentially entering the
+       "may oom" path. This has not been observed in practice but
+       the rest of the series potentially makes it easier to happen.
+
+Patch 5 adds a sync parameter to the migratepage callback and gives
+       the callback responsibility for migrating the page without
+       blocking if sync==false. For example, fallback_migrate_page
+       will not call writepage if sync==false. This increases the
+       number of pages that can be handled by asynchronous compaction
+       thereby reducing stalls.
+
+Patch 6 restores filter-awareness to isolate_lru_page for migration.
+       In practice, it means that pages under writeback and pages
+       without a ->migratepage callback will not be isolated
+       for migration.
+
+Patch 7 avoids calling direct reclaim if compaction is deferred but
+       makes sure that compaction is only deferred if sync
+       compaction was used.
+
+Patch 8 introduces a sync-light migration mechanism that sync compaction
+       uses. The objective is to allow some stalls but to not call
+       ->writepage which can lead to significant user-visible stalls.
+
+Patch 9 notes that while we want to abort reclaim ASAP to allow
+       compation to go ahead that we leave a very small window of
+       opportunity for compaction to run. This patch allows more pages
+       to be freed by reclaim but bounds the number to a reasonable
+       level based on the high watermark on each zone.
+
+Patch 10 allows slabs to be shrunk even after compaction_ready() is
+       true for one zone. This is to avoid a problem whereby a single
+       small zone can abort reclaim even though no pages have been
+       reclaimed and no suitably large zone is in a usable state.
+
+Patch 11 fixes a problem with the rate of page scanning. As reclaim is
+       rarely stalling on pages under writeback it means that scan
+       rates are very high. This is particularly true for direct
+       reclaim which is not calling writepage. The vmstat figures
+       implied that much of this was busy work with PageReclaim pages
+       marked for immediate reclaim. This patch is a prototype that
+       moves these pages to their own LRU list.
+
+This has been tested and other than 2 USB keys getting trashed,
+nothing horrible fell out. That said, I am a bit unhappy with the
+rescue logic in patch 11 but did not find a better way around it. It
+does significantly reduce scan rates and System CPU time indicating
+it is the right direction to take.
+
+What is of critical importance is that stalls due to compaction
+are massively reduced even though sync compaction was still
+allowed. Testing from people complaining about stalls copying to USBs
+with THP enabled are particularly welcome.
+
+The following tests all involve THP usage and USB keys in some
+way. Each test follows this type of pattern
+
+1. Read from some fast fast storage, be it raw device or file. Each time
+   the copy finishes, start again until the test ends
+2. Write a large file to a filesystem on a USB stick. Each time the copy
+   finishes, start again until the test ends
+3. When memory is low, start an alloc process that creates a mapping
+   the size of physical memory to stress THP allocation. This is the
+   "real" part of the test and the part that is meant to trigger
+   stalls when THP is enabled. Copying continues in the background.
+4. Record the CPU usage and time to execute of the alloc process
+5. Record the number of THP allocs and fallbacks as well as the number of THP
+   pages in use a the end of the test just before alloc exited
+6. Run the test 5 times to get an idea of variability
+7. Between each run, sync is run and caches dropped and the test
+   waits until nr_dirty is a small number to avoid interference
+   or caching between iterations that would skew the figures.
+
+The individual tests were then
+
+writebackCPDeviceBasevfat
+       Disable THP, read from a raw device (sda), vfat on USB stick
+writebackCPDeviceBaseext4
+       Disable THP, read from a raw device (sda), ext4 on USB stick
+writebackCPDevicevfat
+       THP enabled, read from a raw device (sda), vfat on USB stick
+writebackCPDeviceext4
+       THP enabled, read from a raw device (sda), ext4 on USB stick
+writebackCPFilevfat
+       THP enabled, read from a file on fast storage and USB, both vfat
+writebackCPFileext4
+       THP enabled, read from a file on fast storage and USB, both ext4
+
+The kernels tested were
+
+3.1            3.1
+vanilla                3.2-rc5
+freemore       Patches 1-10
+immediate      Patches 1-11
+andrea         The 8 patches Andrea posted as a basis of comparison
+
+The results are very long unfortunately. I'll start with the case
+where we are not using THP at all
+
+writebackCPDeviceBasevfat
+                   3.1.0-vanilla         rc5-vanilla       freemore-v6r1        isolate-v6r1         andrea-v2r1
+System Time         1.28 (    0.00%)   54.49 (-4143.46%)   48.63 (-3687.69%)    4.69 ( -265.11%)   51.88 (-3940.81%)
++/-                 0.06 (    0.00%)    2.45 (-4305.55%)    4.75 (-8430.57%)    7.46 (-13282.76%)    4.76 (-8440.70%)
+User Time           0.09 (    0.00%)    0.05 (   40.91%)    0.06 (   29.55%)    0.07 (   15.91%)    0.06 (   27.27%)
++/-                 0.02 (    0.00%)    0.01 (   45.39%)    0.02 (   25.07%)    0.00 (   77.06%)    0.01 (   52.24%)
+Elapsed Time      110.27 (    0.00%)   56.38 (   48.87%)   49.95 (   54.70%)   11.77 (   89.33%)   53.43 (   51.54%)
++/-                 7.33 (    0.00%)    3.77 (   48.61%)    4.94 (   32.63%)    6.71 (    8.50%)    4.76 (   35.03%)
+THP Active          0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)
++/-                 0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)
+Fault Alloc         0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)
++/-                 0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)
+Fault Fallback      0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)
++/-                 0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)
+
+The THP figures are obviously all 0 because THP was enabled. The
+main thing to watch is the elapsed times and how they compare to
+times when THP is enabled later. It's also important to note that
+elapsed time is improved by this series as System CPu time is much
+reduced.
+
+writebackCPDevicevfat
+
+                   3.1.0-vanilla         rc5-vanilla       freemore-v6r1        isolate-v6r1         andrea-v2r1
+System Time         1.22 (    0.00%)   13.89 (-1040.72%)   46.40 (-3709.20%)    4.44 ( -264.37%)   47.37 (-3789.33%)
++/-                 0.06 (    0.00%)   22.82 (-37635.56%)    3.84 (-6249.44%)    6.48 (-10618.92%)    6.60
+(-10818.53%)
+User Time           0.06 (    0.00%)    0.06 (   -6.90%)    0.05 (   17.24%)    0.05 (   13.79%)    0.04 (   31.03%)
++/-                 0.01 (    0.00%)    0.01 (   33.33%)    0.01 (   33.33%)    0.01 (   39.14%)    0.01 (   25.46%)
+Elapsed Time     10445.54 (    0.00%) 2249.92 (   78.46%)   70.06 (   99.33%)   16.59 (   99.84%)  472.43 (
+95.48%)
++/-               643.98 (    0.00%)  811.62 (  -26.03%)   10.02 (   98.44%)    7.03 (   98.91%)   59.99 (   90.68%)
+THP Active         15.60 (    0.00%)   35.20 (  225.64%)   65.00 (  416.67%)   70.80 (  453.85%)   62.20 (  398.72%)
++/-                18.48 (    0.00%)   51.29 (  277.59%)   15.99 (   86.52%)   37.91 (  205.18%)   22.02 (  119.18%)
+Fault Alloc       121.80 (    0.00%)   76.60 (   62.89%)  155.40 (  127.59%)  181.20 (  148.77%)  286.60 (  235.30%)
++/-                73.51 (    0.00%)   61.11 (   83.12%)   34.89 (   47.46%)   31.88 (   43.36%)   68.13 (   92.68%)
+Fault Fallback    881.20 (    0.00%)  926.60 (   -5.15%)  847.60 (    3.81%)  822.00 (    6.72%)  716.60 (   18.68%)
++/-                73.51 (    0.00%)   61.26 (   16.67%)   34.89 (   52.54%)   31.65 (   56.94%)   67.75 (    7.84%)
+MMTests Statistics: duration
+User/Sys Time Running Test (seconds)       3540.88   1945.37    716.04     64.97   1937.03
+Total Elapsed Time (seconds)              52417.33  11425.90    501.02    230.95   2520.28
+
+The first thing to note is the "Elapsed Time" for the vanilla kernels
+of 2249 seconds versus 56 with THP disabled which might explain the
+reports of USB stalls with THP enabled. Applying the patches brings
+performance in line with THP-disabled performance while isolating
+pages for immediate reclaim from the LRU cuts down System CPU time.
+
+The "Fault Alloc" success rate figures are also improved. The vanilla
+kernel only managed to allocate 76.6 pages on average over the course
+of 5 iterations where as applying the series allocated 181.20 on
+average albeit it is well within variance. It's worth noting that
+applies the series at least descreases the amount of variance which
+implies an improvement.
+
+Andrea's series had a higher success rate for THP allocations but
+at a severe cost to elapsed time which is still better than vanilla
+but still much worse than disabling THP altogether. One can bring my
+series close to Andrea's by removing this check
+
+        /*
+         * If compaction is deferred for high-order allocations, it is because
+         * sync compaction recently failed. In this is the case and the caller
+         * has requested the system not be heavily disrupted, fail the
+         * allocation now instead of entering direct reclaim
+         */
+        if (deferred_compaction && (gfp_mask & __GFP_NO_KSWAPD))
+                goto nopage;
+
+I didn't include a patch that removed the above check because hurting
+overall performance to improve the THP figure is not what the average
+user wants. It's something to consider though if someone really wants
+to maximise THP usage no matter what it does to the workload initially.
+
+This is summary of vmstat figures from the same test.
+
+                                       3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 andrea-v2r1
+Page Ins                                  3257266139  1111844061    17263623    10901575   161423219
+Page Outs                                   81054922    30364312     3626530     3657687     8753730
+Swap Ins                                        3294        2851        6560        4964        4592
+Swap Outs                                     390073      528094      620197      790912      698285
+Direct pages scanned                      1077581700  3024951463  1764930052   115140570  5901188831
+Kswapd pages scanned                        34826043     7112868     2131265     1686942     1893966
+Kswapd pages reclaimed                      28950067     4911036     1246044      966475     1497726
+Direct pages reclaimed                     805148398   280167837     3623473     2215044    40809360
+Kswapd efficiency                                83%         69%         58%         57%         79%
+Kswapd velocity                              664.399     622.521    4253.852    7304.360     751.490
+Direct efficiency                                74%          9%          0%          1%          0%
+Direct velocity                            20557.737  264745.137 3522673.849  498551.938 2341481.435
+Percentage direct scans                          96%         99%         99%         98%         99%
+Page writes by reclaim                        722646      529174      620319      791018      699198
+Page writes file                              332573        1080         122         106         913
+Page writes anon                              390073      528094      620197      790912      698285
+Page reclaim immediate                             0  2552514720  1635858848   111281140  5478375032
+Page rescued immediate                             0           0           0       87848           0
+Slabs scanned                                  23552       23552        9216        8192        9216
+Direct inode steals                              231           0           0           0           0
+Kswapd inode steals                                0           0           0           0           0
+Kswapd skipped wait                            28076         786           0          61           6
+THP fault alloc                                  609         383         753         906        1433
+THP collapse alloc                                12           6           0           0           6
+THP splits                                       536         211         456         593        1136
+THP fault fallback                              4406        4633        4263        4110        3583
+THP collapse fail                                120         127           0           0           4
+Compaction stalls                               1810         728         623         779        3200
+Compaction success                               196          53          60          80         123
+Compaction failures                             1614         675         563         699        3077
+Compaction pages moved                        193158       53545      243185      333457      226688
+Compaction move failure                         9952        9396       16424       23676       45070
+
+The main things to look at are
+
+1. Page In/out figures are much reduced by the series.
+
+2. Direct page scanning is incredibly high (264745.137 pages scanned
+   per second on the vanilla kernel) but isolating PageReclaim pages
+   on their own list reduces the number of pages scanned significantly.
+
+3. The fact that "Page rescued immediate" is a positive number implies
+   that we sometimes race removing pages from the LRU_IMMEDIATE list
+   that need to be put back on a normal LRU but it happens only for
+   0.07% of the pages marked for immediate reclaim.
+
+writebackCPDeviceext4
+                   3.1.0-vanilla         rc5-vanilla       freemore-v6r1        isolate-v6r1         andrea-v2r1
+System Time         1.51 (    0.00%)    1.77 (  -17.66%)    1.46 (    2.92%)    1.15 (   23.77%)    1.89 (  -25.63%)
++/-                 0.27 (    0.00%)    0.67 ( -148.52%)    0.33 (  -22.76%)    0.30 (  -11.15%)    0.19 (   30.16%)
+User Time           0.03 (    0.00%)    0.04 (  -37.50%)    0.05 (  -62.50%)    0.07 ( -112.50%)    0.04 (  -18.75%)
++/-                 0.01 (    0.00%)    0.02 ( -146.64%)    0.02 (  -97.91%)    0.02 (  -75.59%)    0.02 (  -63.30%)
+Elapsed Time      124.93 (    0.00%)  114.49 (    8.36%)   96.77 (   22.55%)   27.48 (   78.00%)  205.70 (  -64.65%)
++/-                20.20 (    0.00%)   74.39 ( -268.34%)   59.88 ( -196.48%)    7.72 (   61.79%)   25.03 (  -23.95%)
+THP Active        161.80 (    0.00%)   83.60 (   51.67%)  141.20 (   87.27%)   84.60 (   52.29%)   82.60 (   51.05%)
++/-                71.95 (    0.00%)   43.80 (   60.88%)   26.91 (   37.40%)   59.02 (   82.03%)   52.13 (   72.45%)
+Fault Alloc       471.40 (    0.00%)  228.60 (   48.49%)  282.20 (   59.86%)  225.20 (   47.77%)  388.40 (   82.39%)
++/-                88.07 (    0.00%)   87.42 (   99.26%)   73.79 (   83.78%)  109.62 (  124.47%)   82.62 (   93.81%)
+Fault Fallback    531.60 (    0.00%)  774.60 (  -45.71%)  720.80 (  -35.59%)  777.80 (  -46.31%)  614.80 (  -15.65%)
++/-                88.07 (    0.00%)   87.26 (    0.92%)   73.79 (   16.22%)  109.62 (  -24.47%)   82.29 (    6.56%)
+MMTests Statistics: duration
+User/Sys Time Running Test (seconds)         50.22     33.76     30.65     24.14    128.45
+Total Elapsed Time (seconds)               1113.73   1132.19   1029.45    759.49   1707.26
+
+Similar test but the USB stick is using ext4 instead of vfat. As
+ext4 does not use writepage for migration, the large stalls due to
+compaction when THP is enabled are not observed. Still, isolating
+PageReclaim pages on their own list helped completion time largely
+by reducing the number of pages scanned by direct reclaim although
+time spend in congestion_wait could also be a factor.
+
+Again, Andrea's series had far higher success rates for THP allocation
+at the cost of elapsed time. I didn't look too closely but a quick
+look at the vmstat figures tells me kswapd reclaimed 8 times more pages
+than the patch series and direct reclaim reclaimed roughly three times
+as many pages. It follows that if memory is aggressively reclaimed,
+there will be more available for THP.
+
+writebackCPFilevfat
+                   3.1.0-vanilla         rc5-vanilla       freemore-v6r1        isolate-v6r1         andrea-v2r1
+System Time         1.76 (    0.00%)   29.10 (-1555.52%)   46.01 (-2517.18%)    4.79 ( -172.35%)   54.89 (-3022.53%)
++/-                 0.14 (    0.00%)   25.61 (-18185.17%)    2.15 (-1434.83%)    6.60 (-4610.03%)    9.75
+(-6863.76%)
+User Time           0.05 (    0.00%)    0.07 (  -45.83%)    0.05 (   -4.17%)    0.06 (  -29.17%)    0.06 (  -16.67%)
++/-                 0.02 (    0.00%)    0.02 (   20.11%)    0.02 (   -3.14%)    0.01 (   31.58%)    0.01 (   47.41%)
+Elapsed Time     22520.79 (    0.00%) 1082.85 (   95.19%)   73.30 (   99.67%)   32.43 (   99.86%)  291.84 (  98.70%)
++/-              7277.23 (    0.00%)  706.29 (   90.29%)   19.05 (   99.74%)   17.05 (   99.77%)  125.55 (   98.27%)
+THP Active         83.80 (    0.00%)   12.80 (   15.27%)   15.60 (   18.62%)   13.00 (   15.51%)    0.80 (    0.95%)
++/-                66.81 (    0.00%)   20.19 (   30.22%)    5.92 (    8.86%)   15.06 (   22.54%)    1.17 (    1.75%)
+Fault Alloc       171.00 (    0.00%)   67.80 (   39.65%)   97.40 (   56.96%)  125.60 (   73.45%)  133.00 (   77.78%)
++/-                82.91 (    0.00%)   30.69 (   37.02%)   53.91 (   65.02%)   55.05 (   66.40%)   21.19 (   25.56%)
+Fault Fallback    832.00 (    0.00%)  935.20 (  -12.40%)  906.00 (   -8.89%)  877.40 (   -5.46%)  870.20 (   -4.59%)
++/-                82.91 (    0.00%)   30.69 (   62.98%)   54.01 (   34.86%)   55.05 (   33.60%)   20.91 (   74.78%)
+MMTests Statistics: duration
+User/Sys Time Running Test (seconds)       7229.81    928.42    704.52     80.68   1330.76
+Total Elapsed Time (seconds)             112849.04   5618.69    571.11    360.54   1664.28
+
+In this case, the test is reading/writing only from filesystems but as
+it's vfat, it's slow due to calling writepage during compaction. Little
+to observe really - the time to complete the test goes way down
+with the series applied and THP allocation success rates go up in
+comparison to 3.2-rc5.  The success rates are lower than 3.1.0 but
+the elapsed time for that kernel is abysmal so it is not really a
+sensible comparison.
+
+As before, Andrea's series allocates more THPs at the cost of overall
+performance.
+
+writebackCPFileext4
+                   3.1.0-vanilla         rc5-vanilla       freemore-v6r1        isolate-v6r1         andrea-v2r1
+System Time         1.51 (    0.00%)    1.77 (  -17.66%)    1.46 (    2.92%)    1.15 (   23.77%)    1.89 (  -25.63%)
++/-                 0.27 (    0.00%)    0.67 ( -148.52%)    0.33 (  -22.76%)    0.30 (  -11.15%)    0.19 (   30.16%)
+User Time           0.03 (    0.00%)    0.04 (  -37.50%)    0.05 (  -62.50%)    0.07 ( -112.50%)    0.04 (  -18.75%)
++/-                 0.01 (    0.00%)    0.02 ( -146.64%)    0.02 (  -97.91%)    0.02 (  -75.59%)    0.02 (  -63.30%)
+Elapsed Time      124.93 (    0.00%)  114.49 (    8.36%)   96.77 (   22.55%)   27.48 (   78.00%)  205.70 (  -64.65%)
++/-                20.20 (    0.00%)   74.39 ( -268.34%)   59.88 ( -196.48%)    7.72 (   61.79%)   25.03 (  -23.95%)
+THP Active        161.80 (    0.00%)   83.60 (   51.67%)  141.20 (   87.27%)   84.60 (   52.29%)   82.60 (   51.05%)
++/-                71.95 (    0.00%)   43.80 (   60.88%)   26.91 (   37.40%)   59.02 (   82.03%)   52.13 (   72.45%)
+Fault Alloc       471.40 (    0.00%)  228.60 (   48.49%)  282.20 (   59.86%)  225.20 (   47.77%)  388.40 (   82.39%)
++/-                88.07 (    0.00%)   87.42 (   99.26%)   73.79 (   83.78%)  109.62 (  124.47%)   82.62 (   93.81%)
+Fault Fallback    531.60 (    0.00%)  774.60 (  -45.71%)  720.80 (  -35.59%)  777.80 (  -46.31%)  614.80 (  -15.65%)
++/-                88.07 (    0.00%)   87.26 (    0.92%)   73.79 (   16.22%)  109.62 (  -24.47%)   82.29 (    6.56%)
+MMTests Statistics: duration
+User/Sys Time Running Test (seconds)         50.22     33.76     30.65     24.14    128.45
+Total Elapsed Time (seconds)               1113.73   1132.19   1029.45    759.49   1707.26
+
+Same type of story - elapsed times go down. In this case, allocation
+success rates are roughtly the same. As before, Andrea's has higher
+success rates but takes a lot longer.
+
+Overall the series does reduce latencies and while the tests are
+inherency racy as alloc competes with the cp processes, the variability
+was included. The THP allocation rates are not as high as they could
+be but that is because we would have to be more aggressive about
+reclaim and compaction impacting overall performance.
+
+This patch:
+
+Commit 39deaf85 ("mm: compaction: make isolate_lru_page() filter-aware")
+noted that compaction does not migrate dirty or writeback pages and that
+is was meaningless to pick the page and re-add it to the LRU list.
+
+What was missed during review is that asynchronous migration moves dirty
+pages if their ->migratepage callback is migrate_page() because these can
+be moved without blocking.  This potentially impacted hugepage allocation
+success rates by a factor depending on how many dirty pages are in the
+system.
+
+This patch partially reverts 39deaf85 to allow migration to isolate dirty
+pages again.  This increases how much compaction disrupts the LRU but that
+is addressed later in the series.
+
+Signed-off-by: Mel Gorman <mgorman@suse.de>
+Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
+Reviewed-by: Rik van Riel <riel@redhat.com>
+Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
+Cc: Dave Jones <davej@redhat.com>
+Cc: Jan Kara <jack@suse.cz>
+Cc: Andy Isaacson <adi@hexapodia.org>
+Cc: Nai Xia <nai.xia@gmail.com>
+Cc: Johannes Weiner <jweiner@redhat.com>
+Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
+Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+
+---
+ mm/compaction.c |    3 ---
+ 1 file changed, 3 deletions(-)
+
+--- a/mm/compaction.c
++++ b/mm/compaction.c
+@@ -371,9 +371,6 @@ static isolate_migrate_t isolate_migrate
+                       continue;
+               }
+ 
+-              if (!cc->sync)
+-                      mode |= ISOLATE_CLEAN;
+-
+               /* Try isolate the page */
+               if (__isolate_lru_page(page, mode, 0) != 0)
+                       continue;
diff --git a/queue-3.0/mm-compaction-determine-if-dirty-pages-can-be-migrated-without-blocking-within-migratepage.patch b/queue-3.0/mm-compaction-determine-if-dirty-pages-can-be-migrated-without-blocking-within-migratepage.patch

new file mode 100644 (file)

index 0000000..fc58f25
--- /dev/null
+++ b/queue-3.0/mm-compaction-determine-if-dirty-pages-can-be-migrated-without-blocking-within-migratepage.patch
@@ -0,0 +1,362 @@
+From b969c4ab9f182a6e1b2a0848be349f99714947b0 Mon Sep 17 00:00:00 2001
+From: Mel Gorman <mgorman@suse.de>
+Date: Thu, 12 Jan 2012 17:19:34 -0800
+Subject: mm: compaction: determine if dirty pages can be migrated without blocking within ->migratepage
+
+From: Mel Gorman <mgorman@suse.de>
+
+commit b969c4ab9f182a6e1b2a0848be349f99714947b0 upstream.
+
+Stable note: Not tracked in Bugzilla. A fix aimed at preserving page
+       aging information by reducing LRU list churning had the side-effect
+       of reducing THP allocation success rates. This was part of a series
+       to restore the success rates while preserving the reclaim fix.
+
+Asynchronous compaction is used when allocating transparent hugepages to
+avoid blocking for long periods of time.  Due to reports of stalling,
+there was a debate on disabling synchronous compaction but this severely
+impacted allocation success rates.  Part of the reason was that many dirty
+pages are skipped in asynchronous compaction by the following check;
+
+       if (PageDirty(page) && !sync &&
+               mapping->a_ops->migratepage != migrate_page)
+                       rc = -EBUSY;
+
+This skips over all mapping aops using buffer_migrate_page() even though
+it is possible to migrate some of these pages without blocking.  This
+patch updates the ->migratepage callback with a "sync" parameter.  It is
+the responsibility of the callback to fail gracefully if migration would
+block.
+
+Signed-off-by: Mel Gorman <mgorman@suse.de>
+Reviewed-by: Rik van Riel <riel@redhat.com>
+Cc: Andrea Arcangeli <aarcange@redhat.com>
+Cc: Minchan Kim <minchan.kim@gmail.com>
+Cc: Dave Jones <davej@redhat.com>
+Cc: Jan Kara <jack@suse.cz>
+Cc: Andy Isaacson <adi@hexapodia.org>
+Cc: Nai Xia <nai.xia@gmail.com>
+Cc: Johannes Weiner <jweiner@redhat.com>
+Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
+Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
+Signed-off-by: Mel Gorman <mgorman@suse.de>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+
+---
+ fs/btrfs/disk-io.c      |    4 -
+ fs/hugetlbfs/inode.c    |    3 -
+ fs/nfs/internal.h       |    2 
+ fs/nfs/write.c          |    4 -
+ include/linux/fs.h      |    9 ++-
+ include/linux/migrate.h |    2 
+ mm/migrate.c            |  129 ++++++++++++++++++++++++++++++++++--------------
+ 7 files changed, 106 insertions(+), 47 deletions(-)
+
+--- a/fs/btrfs/disk-io.c
++++ b/fs/btrfs/disk-io.c
+@@ -801,7 +801,7 @@ static int btree_submit_bio_hook(struct
+ 
+ #ifdef CONFIG_MIGRATION
+ static int btree_migratepage(struct address_space *mapping,
+-                      struct page *newpage, struct page *page)
++                      struct page *newpage, struct page *page, bool sync)
+ {
+       /*
+        * we can't safely write a btree page from here,
+@@ -816,7 +816,7 @@ static int btree_migratepage(struct addr
+       if (page_has_private(page) &&
+           !try_to_release_page(page, GFP_KERNEL))
+               return -EAGAIN;
+-      return migrate_page(mapping, newpage, page);
++      return migrate_page(mapping, newpage, page, sync);
+ }
+ #endif
+ 
+--- a/fs/hugetlbfs/inode.c
++++ b/fs/hugetlbfs/inode.c
+@@ -568,7 +568,8 @@ static int hugetlbfs_set_page_dirty(stru
+ }
+ 
+ static int hugetlbfs_migrate_page(struct address_space *mapping,
+-                              struct page *newpage, struct page *page)
++                              struct page *newpage, struct page *page,
++                              bool sync)
+ {
+       int rc;
+ 
+--- a/fs/nfs/internal.h
++++ b/fs/nfs/internal.h
+@@ -315,7 +315,7 @@ void nfs_commit_release_pages(struct nfs
+ 
+ #ifdef CONFIG_MIGRATION
+ extern int nfs_migrate_page(struct address_space *,
+-              struct page *, struct page *);
++              struct page *, struct page *, bool);
+ #else
+ #define nfs_migrate_page NULL
+ #endif
+--- a/fs/nfs/write.c
++++ b/fs/nfs/write.c
+@@ -1662,7 +1662,7 @@ out_error:
+ 
+ #ifdef CONFIG_MIGRATION
+ int nfs_migrate_page(struct address_space *mapping, struct page *newpage,
+-              struct page *page)
++              struct page *page, bool sync)
+ {
+       /*
+        * If PagePrivate is set, then the page is currently associated with
+@@ -1677,7 +1677,7 @@ int nfs_migrate_page(struct address_spac
+ 
+       nfs_fscache_release_page(page, GFP_KERNEL);
+ 
+-      return migrate_page(mapping, newpage, page);
++      return migrate_page(mapping, newpage, page, sync);
+ }
+ #endif
+ 
+--- a/include/linux/fs.h
++++ b/include/linux/fs.h
+@@ -607,9 +607,12 @@ struct address_space_operations {
+                       loff_t offset, unsigned long nr_segs);
+       int (*get_xip_mem)(struct address_space *, pgoff_t, int,
+                                               void **, unsigned long *);
+-      /* migrate the contents of a page to the specified target */
++      /*
++       * migrate the contents of a page to the specified target. If sync
++       * is false, it must not block.
++       */
+       int (*migratepage) (struct address_space *,
+-                      struct page *, struct page *);
++                      struct page *, struct page *, bool);
+       int (*launder_page) (struct page *);
+       int (*is_partially_uptodate) (struct page *, read_descriptor_t *,
+                                       unsigned long);
+@@ -2478,7 +2481,7 @@ extern int generic_check_addressable(uns
+ 
+ #ifdef CONFIG_MIGRATION
+ extern int buffer_migrate_page(struct address_space *,
+-                              struct page *, struct page *);
++                              struct page *, struct page *, bool);
+ #else
+ #define buffer_migrate_page NULL
+ #endif
+--- a/include/linux/migrate.h
++++ b/include/linux/migrate.h
+@@ -11,7 +11,7 @@ typedef struct page *new_page_t(struct p
+ 
+ extern void putback_lru_pages(struct list_head *l);
+ extern int migrate_page(struct address_space *,
+-                      struct page *, struct page *);
++                      struct page *, struct page *, bool);
+ extern int migrate_pages(struct list_head *l, new_page_t x,
+                       unsigned long private, bool offlining,
+                       bool sync);
+--- a/mm/migrate.c
++++ b/mm/migrate.c
+@@ -220,6 +220,55 @@ out:
+       pte_unmap_unlock(ptep, ptl);
+ }
+ 
++#ifdef CONFIG_BLOCK
++/* Returns true if all buffers are successfully locked */
++static bool buffer_migrate_lock_buffers(struct buffer_head *head, bool sync)
++{
++      struct buffer_head *bh = head;
++
++      /* Simple case, sync compaction */
++      if (sync) {
++              do {
++                      get_bh(bh);
++                      lock_buffer(bh);
++                      bh = bh->b_this_page;
++
++              } while (bh != head);
++
++              return true;
++      }
++
++      /* async case, we cannot block on lock_buffer so use trylock_buffer */
++      do {
++              get_bh(bh);
++              if (!trylock_buffer(bh)) {
++                      /*
++                       * We failed to lock the buffer and cannot stall in
++                       * async migration. Release the taken locks
++                       */
++                      struct buffer_head *failed_bh = bh;
++                      put_bh(failed_bh);
++                      bh = head;
++                      while (bh != failed_bh) {
++                              unlock_buffer(bh);
++                              put_bh(bh);
++                              bh = bh->b_this_page;
++                      }
++                      return false;
++              }
++
++              bh = bh->b_this_page;
++      } while (bh != head);
++      return true;
++}
++#else
++static inline bool buffer_migrate_lock_buffers(struct buffer_head *head,
++                                                              bool sync)
++{
++      return true;
++}
++#endif /* CONFIG_BLOCK */
++
+ /*
+  * Replace the page in the mapping.
+  *
+@@ -229,7 +278,8 @@ out:
+  * 3 for pages with a mapping and PagePrivate/PagePrivate2 set.
+  */
+ static int migrate_page_move_mapping(struct address_space *mapping,
+-              struct page *newpage, struct page *page)
++              struct page *newpage, struct page *page,
++              struct buffer_head *head, bool sync)
+ {
+       int expected_count;
+       void **pslot;
+@@ -259,6 +309,19 @@ static int migrate_page_move_mapping(str
+       }
+ 
+       /*
++       * In the async migration case of moving a page with buffers, lock the
++       * buffers using trylock before the mapping is moved. If the mapping
++       * was moved, we later failed to lock the buffers and could not move
++       * the mapping back due to an elevated page count, we would have to
++       * block waiting on other references to be dropped.
++       */
++      if (!sync && head && !buffer_migrate_lock_buffers(head, sync)) {
++              page_unfreeze_refs(page, expected_count);
++              spin_unlock_irq(&mapping->tree_lock);
++              return -EAGAIN;
++      }
++
++      /*
+        * Now we know that no one else is looking at the page.
+        */
+       get_page(newpage);      /* add cache reference */
+@@ -415,13 +478,13 @@ EXPORT_SYMBOL(fail_migrate_page);
+  * Pages are locked upon entry and exit.
+  */
+ int migrate_page(struct address_space *mapping,
+-              struct page *newpage, struct page *page)
++              struct page *newpage, struct page *page, bool sync)
+ {
+       int rc;
+ 
+       BUG_ON(PageWriteback(page));    /* Writeback must be complete */
+ 
+-      rc = migrate_page_move_mapping(mapping, newpage, page);
++      rc = migrate_page_move_mapping(mapping, newpage, page, NULL, sync);
+ 
+       if (rc)
+               return rc;
+@@ -438,28 +501,28 @@ EXPORT_SYMBOL(migrate_page);
+  * exist.
+  */
+ int buffer_migrate_page(struct address_space *mapping,
+-              struct page *newpage, struct page *page)
++              struct page *newpage, struct page *page, bool sync)
+ {
+       struct buffer_head *bh, *head;
+       int rc;
+ 
+       if (!page_has_buffers(page))
+-              return migrate_page(mapping, newpage, page);
++              return migrate_page(mapping, newpage, page, sync);
+ 
+       head = page_buffers(page);
+ 
+-      rc = migrate_page_move_mapping(mapping, newpage, page);
++      rc = migrate_page_move_mapping(mapping, newpage, page, head, sync);
+ 
+       if (rc)
+               return rc;
+ 
+-      bh = head;
+-      do {
+-              get_bh(bh);
+-              lock_buffer(bh);
+-              bh = bh->b_this_page;
+-
+-      } while (bh != head);
++      /*
++       * In the async case, migrate_page_move_mapping locked the buffers
++       * with an IRQ-safe spinlock held. In the sync case, the buffers
++       * need to be locked now
++       */
++      if (sync)
++              BUG_ON(!buffer_migrate_lock_buffers(head, sync));
+ 
+       ClearPagePrivate(page);
+       set_page_private(newpage, page_private(page));
+@@ -536,10 +599,13 @@ static int writeout(struct address_space
+  * Default handling if a filesystem does not provide a migration function.
+  */
+ static int fallback_migrate_page(struct address_space *mapping,
+-      struct page *newpage, struct page *page)
++      struct page *newpage, struct page *page, bool sync)
+ {
+-      if (PageDirty(page))
++      if (PageDirty(page)) {
++              if (!sync)
++                      return -EBUSY;
+               return writeout(mapping, page);
++      }
+ 
+       /*
+        * Buffers may be managed in a filesystem specific way.
+@@ -549,7 +615,7 @@ static int fallback_migrate_page(struct
+           !try_to_release_page(page, GFP_KERNEL))
+               return -EAGAIN;
+ 
+-      return migrate_page(mapping, newpage, page);
++      return migrate_page(mapping, newpage, page, sync);
+ }
+ 
+ /*
+@@ -585,29 +651,18 @@ static int move_to_new_page(struct page
+ 
+       mapping = page_mapping(page);
+       if (!mapping)
+-              rc = migrate_page(mapping, newpage, page);
+-      else {
++              rc = migrate_page(mapping, newpage, page, sync);
++      else if (mapping->a_ops->migratepage)
+               /*
+-               * Do not writeback pages if !sync and migratepage is
+-               * not pointing to migrate_page() which is nonblocking
+-               * (swapcache/tmpfs uses migratepage = migrate_page).
++               * Most pages have a mapping and most filesystems provide a
++               * migratepage callback. Anonymous pages are part of swap
++               * space which also has its own migratepage callback. This
++               * is the most common path for page migration.
+                */
+-              if (PageDirty(page) && !sync &&
+-                  mapping->a_ops->migratepage != migrate_page)
+-                      rc = -EBUSY;
+-              else if (mapping->a_ops->migratepage)
+-                      /*
+-                       * Most pages have a mapping and most filesystems
+-                       * should provide a migration function. Anonymous
+-                       * pages are part of swap space which also has its
+-                       * own migration function. This is the most common
+-                       * path for page migration.
+-                       */
+-                      rc = mapping->a_ops->migratepage(mapping,
+-                                                      newpage, page);
+-              else
+-                      rc = fallback_migrate_page(mapping, newpage, page);
+-      }
++              rc = mapping->a_ops->migratepage(mapping,
++                                              newpage, page, sync);
++      else
++              rc = fallback_migrate_page(mapping, newpage, page, sync);
+ 
+       if (rc) {
+               newpage->mapping = NULL;
diff --git a/queue-3.0/mm-compaction-make-isolate_lru_page-filter-aware.patch b/queue-3.0/mm-compaction-make-isolate_lru_page-filter-aware.patch

new file mode 100644 (file)

index 0000000..d49aee3
--- /dev/null
+++ b/queue-3.0/mm-compaction-make-isolate_lru_page-filter-aware.patch
@@ -0,0 +1,89 @@
+From 39deaf8585152f1a35c1676d3d7dc6ae0fb65967 Mon Sep 17 00:00:00 2001
+From: Minchan Kim <minchan.kim@gmail.com>
+Date: Mon, 31 Oct 2011 17:06:51 -0700
+Subject: mm: compaction: make isolate_lru_page() filter-aware
+
+From: Minchan Kim <minchan.kim@gmail.com>
+
+commit 39deaf8585152f1a35c1676d3d7dc6ae0fb65967 upstream.
+
+Stable note: Not tracked in Bugzilla. THP and compaction disrupt the LRU
+       list leading to poor reclaim decisions which has a variable
+       performance impact.
+
+In async mode, compaction doesn't migrate dirty or writeback pages.  So,
+it's meaningless to pick the page and re-add it to lru list.
+
+Of course, when we isolate the page in compaction, the page might be dirty
+or writeback but when we try to migrate the page, the page would be not
+dirty, writeback.  So it could be migrated.  But it's very unlikely as
+isolate and migration cycle is much faster than writeout.
+
+So, this patch helps cpu overhead and prevent unnecessary LRU churning.
+
+Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
+Acked-by: Johannes Weiner <hannes@cmpxchg.org>
+Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
+Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
+Acked-by: Mel Gorman <mgorman@suse.de>
+Acked-by: Rik van Riel <riel@redhat.com>
+Reviewed-by: Michal Hocko <mhocko@suse.cz>
+Cc: Andrea Arcangeli <aarcange@redhat.com>
+Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
+Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
+Signed-off-by: Mel Gorman <mgorman@suse.de>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+
+---
+ include/linux/mmzone.h |    2 ++
+ mm/compaction.c        |    7 +++++--
+ mm/vmscan.c            |    3 +++
+ 3 files changed, 10 insertions(+), 2 deletions(-)
+
+--- a/include/linux/mmzone.h
++++ b/include/linux/mmzone.h
+@@ -162,6 +162,8 @@ static inline int is_unevictable_lru(enu
+ #define ISOLATE_INACTIVE      ((__force isolate_mode_t)0x1)
+ /* Isolate active pages */
+ #define ISOLATE_ACTIVE                ((__force isolate_mode_t)0x2)
++/* Isolate clean file */
++#define ISOLATE_CLEAN         ((__force isolate_mode_t)0x4)
+ 
+ /* LRU Isolation modes. */
+ typedef unsigned __bitwise__ isolate_mode_t;
+--- a/mm/compaction.c
++++ b/mm/compaction.c
+@@ -261,6 +261,7 @@ static isolate_migrate_t isolate_migrate
+       unsigned long last_pageblock_nr = 0, pageblock_nr;
+       unsigned long nr_scanned = 0, nr_isolated = 0;
+       struct list_head *migratelist = &cc->migratepages;
++      isolate_mode_t mode = ISOLATE_ACTIVE|ISOLATE_INACTIVE;
+ 
+       /* Do not scan outside zone boundaries */
+       low_pfn = max(cc->migrate_pfn, zone->zone_start_pfn);
+@@ -370,9 +371,11 @@ static isolate_migrate_t isolate_migrate
+                       continue;
+               }
+ 
++              if (!cc->sync)
++                      mode |= ISOLATE_CLEAN;
++
+               /* Try isolate the page */
+-              if (__isolate_lru_page(page,
+-                              ISOLATE_ACTIVE|ISOLATE_INACTIVE, 0) != 0)
++              if (__isolate_lru_page(page, mode, 0) != 0)
+                       continue;
+ 
+               VM_BUG_ON(PageTransCompound(page));
+--- a/mm/vmscan.c
++++ b/mm/vmscan.c
+@@ -1045,6 +1045,9 @@ int __isolate_lru_page(struct page *page
+ 
+       ret = -EBUSY;
+ 
++      if ((mode & ISOLATE_CLEAN) && (PageDirty(page) || PageWriteback(page)))
++              return ret;
++
+       if (likely(get_page_unless_zero(page))) {
+               /*
+                * Be careful not to clear PageLRU until after we're
diff --git a/queue-3.0/mm-migration-clean-up-unmap_and_move.patch b/queue-3.0/mm-migration-clean-up-unmap_and_move.patch

new file mode 100644 (file)

index 0000000..9f90370
--- /dev/null
+++ b/queue-3.0/mm-migration-clean-up-unmap_and_move.patch
@@ -0,0 +1,146 @@
+From 0dabec93de633a87adfbbe1d800a4c56cd19d73b Mon Sep 17 00:00:00 2001
+From: Minchan Kim <minchan.kim@gmail.com>
+Date: Mon, 31 Oct 2011 17:06:57 -0700
+Subject: mm: migration: clean up unmap_and_move()
+
+From: Minchan Kim <minchan.kim@gmail.com>
+
+commit 0dabec93de633a87adfbbe1d800a4c56cd19d73b upstream.
+
+Stable note: Not tracked in Bugzilla. This patch makes later patches
+       easier to apply but has no other impact.
+
+unmap_and_move() is one a big messy function.  Clean it up.
+
+Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
+Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
+Cc: Johannes Weiner <hannes@cmpxchg.org>
+Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
+Cc: Mel Gorman <mgorman@suse.de>
+Cc: Rik van Riel <riel@redhat.com>
+Cc: Michal Hocko <mhocko@suse.cz>
+Cc: Andrea Arcangeli <aarcange@redhat.com>
+Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
+Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
+
+---
+ mm/migrate.c |   75 +++++++++++++++++++++++++++++++----------------------------
+ 1 file changed, 40 insertions(+), 35 deletions(-)
+
+--- a/mm/migrate.c
++++ b/mm/migrate.c
+@@ -621,38 +621,18 @@ static int move_to_new_page(struct page
+       return rc;
+ }
+ 
+-/*
+- * Obtain the lock on page, remove all ptes and migrate the page
+- * to the newly allocated page in newpage.
+- */
+-static int unmap_and_move(new_page_t get_new_page, unsigned long private,
+-                      struct page *page, int force, bool offlining, bool sync)
++static int __unmap_and_move(struct page *page, struct page *newpage,
++                              int force, bool offlining, bool sync)
+ {
+-      int rc = 0;
+-      int *result = NULL;
+-      struct page *newpage = get_new_page(page, private, &result);
++      int rc = -EAGAIN;
+       int remap_swapcache = 1;
+       int charge = 0;
+       struct mem_cgroup *mem;
+       struct anon_vma *anon_vma = NULL;
+ 
+-      if (!newpage)
+-              return -ENOMEM;
+-
+-      if (page_count(page) == 1) {
+-              /* page was freed from under us. So we are done. */
+-              goto move_newpage;
+-      }
+-      if (unlikely(PageTransHuge(page)))
+-              if (unlikely(split_huge_page(page)))
+-                      goto move_newpage;
+-
+-      /* prepare cgroup just returns 0 or -ENOMEM */
+-      rc = -EAGAIN;
+-
+       if (!trylock_page(page)) {
+               if (!force || !sync)
+-                      goto move_newpage;
++                      goto out;
+ 
+               /*
+                * It's not safe for direct compaction to call lock_page.
+@@ -668,7 +648,7 @@ static int unmap_and_move(new_page_t get
+                * altogether.
+                */
+               if (current->flags & PF_MEMALLOC)
+-                      goto move_newpage;
++                      goto out;
+ 
+               lock_page(page);
+       }
+@@ -785,27 +765,52 @@ uncharge:
+               mem_cgroup_end_migration(mem, page, newpage, rc == 0);
+ unlock:
+       unlock_page(page);
++out:
++      return rc;
++}
++
++/*
++ * Obtain the lock on page, remove all ptes and migrate the page
++ * to the newly allocated page in newpage.
++ */
++static int unmap_and_move(new_page_t get_new_page, unsigned long private,
++                      struct page *page, int force, bool offlining, bool sync)
++{
++      int rc = 0;
++      int *result = NULL;
++      struct page *newpage = get_new_page(page, private, &result);
++
++      if (!newpage)
++              return -ENOMEM;
+ 
+-move_newpage:
++      if (page_count(page) == 1) {
++              /* page was freed from under us. So we are done. */
++              goto out;
++      }
++
++      if (unlikely(PageTransHuge(page)))
++              if (unlikely(split_huge_page(page)))
++                      goto out;
++
++      rc = __unmap_and_move(page, newpage, force, offlining, sync);
++out:
+       if (rc != -EAGAIN) {
+-              /*
+-               * A page that has been migrated has all references
+-               * removed and will be freed. A page that has not been
+-               * migrated will have kepts its references and be
+-               * restored.
+-               */
+-              list_del(&page->lru);
++              /*
++               * A page that has been migrated has all references
++               * removed and will be freed. A page that has not been
++               * migrated will have kepts its references and be
++               * restored.
++               */
++              list_del(&page->lru);
+               dec_zone_page_state(page, NR_ISOLATED_ANON +
+                               page_is_file_cache(page));
+               putback_lru_page(page);
+       }
+-
+       /*
+        * Move the new page to the LRU. If migration was not successful
+        * then this will free the page.
+        */
+       putback_lru_page(newpage);
+-
+       if (result) {
+               if (rc)
+                       *result = rc;
diff --git a/queue-3.0/mm-zone_reclaim-make-isolate_lru_page-filter-aware.patch b/queue-3.0/mm-zone_reclaim-make-isolate_lru_page-filter-aware.patch

new file mode 100644 (file)

index 0000000..d0c2b37
--- /dev/null
+++ b/queue-3.0/mm-zone_reclaim-make-isolate_lru_page-filter-aware.patch
@@ -0,0 +1,104 @@
+From f80c0673610e36ae29d63e3297175e22f70dde5f Mon Sep 17 00:00:00 2001
+From: Minchan Kim <minchan.kim@gmail.com>
+Date: Mon, 31 Oct 2011 17:06:55 -0700
+Subject: mm: zone_reclaim: make isolate_lru_page() filter-aware
+
+From: Minchan Kim <minchan.kim@gmail.com>
+
+commit f80c0673610e36ae29d63e3297175e22f70dde5f upstream.
+
+Stable note: Not tracked in Bugzilla. THP and compaction disrupt the LRU list
+       leading to poor reclaim decisions which has a variable
+       performance impact.
+
+In __zone_reclaim case, we don't want to shrink mapped page.  Nonetheless,
+we have isolated mapped page and re-add it into LRU's head.  It's
+unnecessary CPU overhead and makes LRU churning.
+
+Of course, when we isolate the page, the page might be mapped but when we
+try to migrate the page, the page would be not mapped.  So it could be
+migrated.  But race is rare and although it happens, it's no big deal.
+
+Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
+Acked-by: Johannes Weiner <hannes@cmpxchg.org>
+Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
+Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
+Reviewed-by: Michal Hocko <mhocko@suse.cz>
+Cc: Mel Gorman <mgorman@suse.de>
+Cc: Rik van Riel <riel@redhat.com>
+Cc: Andrea Arcangeli <aarcange@redhat.com>
+Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
+Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
+Signed-off-by: Mel Gorman <mgorman@suse.de>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+
+---
+ include/linux/mmzone.h |    2 ++
+ mm/vmscan.c            |   20 ++++++++++++++++++--
+ 2 files changed, 20 insertions(+), 2 deletions(-)
+
+--- a/include/linux/mmzone.h
++++ b/include/linux/mmzone.h
+@@ -164,6 +164,8 @@ static inline int is_unevictable_lru(enu
+ #define ISOLATE_ACTIVE                ((__force isolate_mode_t)0x2)
+ /* Isolate clean file */
+ #define ISOLATE_CLEAN         ((__force isolate_mode_t)0x4)
++/* Isolate unmapped file */
++#define ISOLATE_UNMAPPED      ((__force isolate_mode_t)0x8)
+ 
+ /* LRU Isolation modes. */
+ typedef unsigned __bitwise__ isolate_mode_t;
+--- a/mm/vmscan.c
++++ b/mm/vmscan.c
+@@ -1048,6 +1048,9 @@ int __isolate_lru_page(struct page *page
+       if ((mode & ISOLATE_CLEAN) && (PageDirty(page) || PageWriteback(page)))
+               return ret;
+ 
++      if ((mode & ISOLATE_UNMAPPED) && page_mapped(page))
++              return ret;
++
+       if (likely(get_page_unless_zero(page))) {
+               /*
+                * Be careful not to clear PageLRU until after we're
+@@ -1471,6 +1474,12 @@ shrink_inactive_list(unsigned long nr_to
+               reclaim_mode |= ISOLATE_ACTIVE;
+ 
+       lru_add_drain();
++
++      if (!sc->may_unmap)
++              reclaim_mode |= ISOLATE_UNMAPPED;
++      if (!sc->may_writepage)
++              reclaim_mode |= ISOLATE_CLEAN;
++
+       spin_lock_irq(&zone->lru_lock);
+ 
+       if (scanning_global_lru(sc)) {
+@@ -1588,19 +1597,26 @@ static void shrink_active_list(unsigned
+       struct page *page;
+       struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
+       unsigned long nr_rotated = 0;
++      isolate_mode_t reclaim_mode = ISOLATE_ACTIVE;
+ 
+       lru_add_drain();
++
++      if (!sc->may_unmap)
++              reclaim_mode |= ISOLATE_UNMAPPED;
++      if (!sc->may_writepage)
++              reclaim_mode |= ISOLATE_CLEAN;
++
+       spin_lock_irq(&zone->lru_lock);
+       if (scanning_global_lru(sc)) {
+               nr_taken = isolate_pages_global(nr_pages, &l_hold,
+                                               &pgscanned, sc->order,
+-                                              ISOLATE_ACTIVE, zone,
++                                              reclaim_mode, zone,
+                                               1, file);
+               zone->pages_scanned += pgscanned;
+       } else {
+               nr_taken = mem_cgroup_isolate_pages(nr_pages, &l_hold,
+                                               &pgscanned, sc->order,
+-                                              ISOLATE_ACTIVE, zone,
++                                              reclaim_mode, zone,
+                                               sc->mem_cgroup, 1, file);
+               /*
+                * mem_cgroup_isolate_pages() keeps track of
diff --git a/queue-3.0/series b/queue-3.0/series

index ca0e09acdd8ba6851aa7a6517a7dc8ae069a279a..30fcb6914bbb09ee88232f146b3be6af7087bfca 100644 (file)
--- a/queue-3.0/series
+++ b/queue-3.0/series
@@ -16,3 +16,8 @@ vmscan-limit-direct-reclaim-for-higher-order-allocations.patch
  vmscan-abort-reclaim-compaction-if-compaction-can-proceed.patch
  mm-compaction-trivial-clean-up-in-acct_isolated.patch
  mm-change-isolate-mode-from-define-to-bitwise-type.patch
+mm-compaction-make-isolate_lru_page-filter-aware.patch
+mm-zone_reclaim-make-isolate_lru_page-filter-aware.patch
+mm-migration-clean-up-unmap_and_move.patch
+mm-compaction-allow-compaction-to-isolate-dirty-pages.patch
+mm-compaction-determine-if-dirty-pages-can-be-migrated-without-blocking-within-migratepage.patch
author	Greg Kroah-Hartman <gregkh@linuxfoundation.org>
	Wed, 25 Jul 2012 15:50:33 +0000 (08:50 -0700)
committer	Greg Kroah-Hartman <gregkh@linuxfoundation.org>
	Wed, 25 Jul 2012 15:50:33 +0000 (08:50 -0700)
queue-3.0/mm-compaction-allow-compaction-to-isolate-dirty-pages.patch	[new file with mode: 0644]	patch \| blob
queue-3.0/mm-compaction-determine-if-dirty-pages-can-be-migrated-without-blocking-within-migratepage.patch	[new file with mode: 0644]	patch \| blob
queue-3.0/mm-compaction-make-isolate_lru_page-filter-aware.patch	[new file with mode: 0644]	patch \| blob
queue-3.0/mm-migration-clean-up-unmap_and_move.patch	[new file with mode: 0644]	patch \| blob
queue-3.0/mm-zone_reclaim-make-isolate_lru_page-filter-aware.patch	[new file with mode: 0644]	patch \| blob
queue-3.0/series		patch \| blob \| blame \| history