--- /dev/null
+From a77ebd333cd810d7b680d544be88c875131c2bd3 Mon Sep 17 00:00:00 2001
+From: Mel Gorman <mgorman@suse.de>
+Date: Thu, 12 Jan 2012 17:19:22 -0800
+Subject: mm: compaction: allow compaction to isolate dirty pages
+
+From: Mel Gorman <mgorman@suse.de>
+
+commit a77ebd333cd810d7b680d544be88c875131c2bd3 upstream.
+
+Stable note: Not tracked in Bugzilla. A fix aimed at preserving page aging
+ information by reducing LRU list churning had the side-effect of
+ reducing THP allocation success rates. This was part of a series
+ to restore the success rates while preserving the reclaim fix.
+
+Short summary: There are severe stalls when a USB stick using VFAT is
+used with THP enabled that are reduced by this series. If you are
+experiencing this problem, please test and report back and considering I
+have seen complaints from openSUSE and Fedora users on this as well as a
+few private mails, I'm guessing it's a widespread issue. This is a new
+type of USB-related stall because it is due to synchronous compaction
+writing where as in the past the big problem was dirty pages reaching
+the end of the LRU and being written by reclaim.
+
+Am cc'ing Andrew this time and this series would replace
+mm-do-not-stall-in-synchronous-compaction-for-thp-allocations.patch.
+I'm also cc'ing Dave Jones as he might have merged that patch to Fedora
+for wider testing and ideally it would be reverted and replaced by this
+series.
+
+That said, the later patches could really do with some review. If this
+series is not the answer then a new direction needs to be discussed
+because as it is, the stalls are unacceptable as the results in this
+leader show.
+
+For testers that try backporting this to 3.1, it won't work because
+there is a non-obvious dependency on not writing back pages in direct
+reclaim so you need those patches too.
+
+Changelog since V5
+o Rebase to 3.2-rc5
+o Tidy up the changelogs a bit
+
+Changelog since V4
+o Added reviewed-bys, credited Andrea properly for sync-light
+o Allow dirty pages without mappings to be considered for migration
+o Bound the number of pages freed for compaction
+o Isolate PageReclaim pages on their own LRU list
+
+This is against 3.2-rc5 and follows on from discussions on "mm: Do
+not stall in synchronous compaction for THP allocations" and "[RFC
+PATCH 0/5] Reduce compaction-related stalls". Initially, the proposed
+patch eliminated stalls due to compaction which sometimes resulted in
+user-visible interactivity problems on browsers by simply never using
+sync compaction. The downside was that THP success allocation rates
+were lower because dirty pages were not being migrated as reported by
+Andrea. His approach at fixing this was nacked on the grounds that
+it reverted fixes from Rik merged that reduced the amount of pages
+reclaimed as it severely impacted his workloads performance.
+
+This series attempts to reconcile the requirements of maximising THP
+usage, without stalling in a user-visible fashion due to compaction
+or cheating by reclaiming an excessive number of pages.
+
+Patch 1 partially reverts commit 39deaf85 to allow migration to isolate
+ dirty pages. This is because migration can move some dirty
+ pages without blocking.
+
+Patch 2 notes that the /proc/sys/vm/compact_memory handler is not using
+ synchronous compaction when it should be. This is unrelated
+ to the reported stalls but is worth fixing.
+
+Patch 3 checks if we isolated a compound page during lumpy scan and
+ account for it properly. For the most part, this affects
+ tracing so it's unrelated to the stalls but worth fixing.
+
+Patch 4 notes that it is possible to abort reclaim early for compaction
+ and return 0 to the page allocator potentially entering the
+ "may oom" path. This has not been observed in practice but
+ the rest of the series potentially makes it easier to happen.
+
+Patch 5 adds a sync parameter to the migratepage callback and gives
+ the callback responsibility for migrating the page without
+ blocking if sync==false. For example, fallback_migrate_page
+ will not call writepage if sync==false. This increases the
+ number of pages that can be handled by asynchronous compaction
+ thereby reducing stalls.
+
+Patch 6 restores filter-awareness to isolate_lru_page for migration.
+ In practice, it means that pages under writeback and pages
+ without a ->migratepage callback will not be isolated
+ for migration.
+
+Patch 7 avoids calling direct reclaim if compaction is deferred but
+ makes sure that compaction is only deferred if sync
+ compaction was used.
+
+Patch 8 introduces a sync-light migration mechanism that sync compaction
+ uses. The objective is to allow some stalls but to not call
+ ->writepage which can lead to significant user-visible stalls.
+
+Patch 9 notes that while we want to abort reclaim ASAP to allow
+ compation to go ahead that we leave a very small window of
+ opportunity for compaction to run. This patch allows more pages
+ to be freed by reclaim but bounds the number to a reasonable
+ level based on the high watermark on each zone.
+
+Patch 10 allows slabs to be shrunk even after compaction_ready() is
+ true for one zone. This is to avoid a problem whereby a single
+ small zone can abort reclaim even though no pages have been
+ reclaimed and no suitably large zone is in a usable state.
+
+Patch 11 fixes a problem with the rate of page scanning. As reclaim is
+ rarely stalling on pages under writeback it means that scan
+ rates are very high. This is particularly true for direct
+ reclaim which is not calling writepage. The vmstat figures
+ implied that much of this was busy work with PageReclaim pages
+ marked for immediate reclaim. This patch is a prototype that
+ moves these pages to their own LRU list.
+
+This has been tested and other than 2 USB keys getting trashed,
+nothing horrible fell out. That said, I am a bit unhappy with the
+rescue logic in patch 11 but did not find a better way around it. It
+does significantly reduce scan rates and System CPU time indicating
+it is the right direction to take.
+
+What is of critical importance is that stalls due to compaction
+are massively reduced even though sync compaction was still
+allowed. Testing from people complaining about stalls copying to USBs
+with THP enabled are particularly welcome.
+
+The following tests all involve THP usage and USB keys in some
+way. Each test follows this type of pattern
+
+1. Read from some fast fast storage, be it raw device or file. Each time
+ the copy finishes, start again until the test ends
+2. Write a large file to a filesystem on a USB stick. Each time the copy
+ finishes, start again until the test ends
+3. When memory is low, start an alloc process that creates a mapping
+ the size of physical memory to stress THP allocation. This is the
+ "real" part of the test and the part that is meant to trigger
+ stalls when THP is enabled. Copying continues in the background.
+4. Record the CPU usage and time to execute of the alloc process
+5. Record the number of THP allocs and fallbacks as well as the number of THP
+ pages in use a the end of the test just before alloc exited
+6. Run the test 5 times to get an idea of variability
+7. Between each run, sync is run and caches dropped and the test
+ waits until nr_dirty is a small number to avoid interference
+ or caching between iterations that would skew the figures.
+
+The individual tests were then
+
+writebackCPDeviceBasevfat
+ Disable THP, read from a raw device (sda), vfat on USB stick
+writebackCPDeviceBaseext4
+ Disable THP, read from a raw device (sda), ext4 on USB stick
+writebackCPDevicevfat
+ THP enabled, read from a raw device (sda), vfat on USB stick
+writebackCPDeviceext4
+ THP enabled, read from a raw device (sda), ext4 on USB stick
+writebackCPFilevfat
+ THP enabled, read from a file on fast storage and USB, both vfat
+writebackCPFileext4
+ THP enabled, read from a file on fast storage and USB, both ext4
+
+The kernels tested were
+
+3.1 3.1
+vanilla 3.2-rc5
+freemore Patches 1-10
+immediate Patches 1-11
+andrea The 8 patches Andrea posted as a basis of comparison
+
+The results are very long unfortunately. I'll start with the case
+where we are not using THP at all
+
+writebackCPDeviceBasevfat
+ 3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 andrea-v2r1
+System Time 1.28 ( 0.00%) 54.49 (-4143.46%) 48.63 (-3687.69%) 4.69 ( -265.11%) 51.88 (-3940.81%)
++/- 0.06 ( 0.00%) 2.45 (-4305.55%) 4.75 (-8430.57%) 7.46 (-13282.76%) 4.76 (-8440.70%)
+User Time 0.09 ( 0.00%) 0.05 ( 40.91%) 0.06 ( 29.55%) 0.07 ( 15.91%) 0.06 ( 27.27%)
++/- 0.02 ( 0.00%) 0.01 ( 45.39%) 0.02 ( 25.07%) 0.00 ( 77.06%) 0.01 ( 52.24%)
+Elapsed Time 110.27 ( 0.00%) 56.38 ( 48.87%) 49.95 ( 54.70%) 11.77 ( 89.33%) 53.43 ( 51.54%)
++/- 7.33 ( 0.00%) 3.77 ( 48.61%) 4.94 ( 32.63%) 6.71 ( 8.50%) 4.76 ( 35.03%)
+THP Active 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
++/- 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
+Fault Alloc 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
++/- 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
+Fault Fallback 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
++/- 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
+
+The THP figures are obviously all 0 because THP was enabled. The
+main thing to watch is the elapsed times and how they compare to
+times when THP is enabled later. It's also important to note that
+elapsed time is improved by this series as System CPu time is much
+reduced.
+
+writebackCPDevicevfat
+
+ 3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 andrea-v2r1
+System Time 1.22 ( 0.00%) 13.89 (-1040.72%) 46.40 (-3709.20%) 4.44 ( -264.37%) 47.37 (-3789.33%)
++/- 0.06 ( 0.00%) 22.82 (-37635.56%) 3.84 (-6249.44%) 6.48 (-10618.92%) 6.60
+(-10818.53%)
+User Time 0.06 ( 0.00%) 0.06 ( -6.90%) 0.05 ( 17.24%) 0.05 ( 13.79%) 0.04 ( 31.03%)
++/- 0.01 ( 0.00%) 0.01 ( 33.33%) 0.01 ( 33.33%) 0.01 ( 39.14%) 0.01 ( 25.46%)
+Elapsed Time 10445.54 ( 0.00%) 2249.92 ( 78.46%) 70.06 ( 99.33%) 16.59 ( 99.84%) 472.43 (
+95.48%)
++/- 643.98 ( 0.00%) 811.62 ( -26.03%) 10.02 ( 98.44%) 7.03 ( 98.91%) 59.99 ( 90.68%)
+THP Active 15.60 ( 0.00%) 35.20 ( 225.64%) 65.00 ( 416.67%) 70.80 ( 453.85%) 62.20 ( 398.72%)
++/- 18.48 ( 0.00%) 51.29 ( 277.59%) 15.99 ( 86.52%) 37.91 ( 205.18%) 22.02 ( 119.18%)
+Fault Alloc 121.80 ( 0.00%) 76.60 ( 62.89%) 155.40 ( 127.59%) 181.20 ( 148.77%) 286.60 ( 235.30%)
++/- 73.51 ( 0.00%) 61.11 ( 83.12%) 34.89 ( 47.46%) 31.88 ( 43.36%) 68.13 ( 92.68%)
+Fault Fallback 881.20 ( 0.00%) 926.60 ( -5.15%) 847.60 ( 3.81%) 822.00 ( 6.72%) 716.60 ( 18.68%)
++/- 73.51 ( 0.00%) 61.26 ( 16.67%) 34.89 ( 52.54%) 31.65 ( 56.94%) 67.75 ( 7.84%)
+MMTests Statistics: duration
+User/Sys Time Running Test (seconds) 3540.88 1945.37 716.04 64.97 1937.03
+Total Elapsed Time (seconds) 52417.33 11425.90 501.02 230.95 2520.28
+
+The first thing to note is the "Elapsed Time" for the vanilla kernels
+of 2249 seconds versus 56 with THP disabled which might explain the
+reports of USB stalls with THP enabled. Applying the patches brings
+performance in line with THP-disabled performance while isolating
+pages for immediate reclaim from the LRU cuts down System CPU time.
+
+The "Fault Alloc" success rate figures are also improved. The vanilla
+kernel only managed to allocate 76.6 pages on average over the course
+of 5 iterations where as applying the series allocated 181.20 on
+average albeit it is well within variance. It's worth noting that
+applies the series at least descreases the amount of variance which
+implies an improvement.
+
+Andrea's series had a higher success rate for THP allocations but
+at a severe cost to elapsed time which is still better than vanilla
+but still much worse than disabling THP altogether. One can bring my
+series close to Andrea's by removing this check
+
+ /*
+ * If compaction is deferred for high-order allocations, it is because
+ * sync compaction recently failed. In this is the case and the caller
+ * has requested the system not be heavily disrupted, fail the
+ * allocation now instead of entering direct reclaim
+ */
+ if (deferred_compaction && (gfp_mask & __GFP_NO_KSWAPD))
+ goto nopage;
+
+I didn't include a patch that removed the above check because hurting
+overall performance to improve the THP figure is not what the average
+user wants. It's something to consider though if someone really wants
+to maximise THP usage no matter what it does to the workload initially.
+
+This is summary of vmstat figures from the same test.
+
+ 3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 andrea-v2r1
+Page Ins 3257266139 1111844061 17263623 10901575 161423219
+Page Outs 81054922 30364312 3626530 3657687 8753730
+Swap Ins 3294 2851 6560 4964 4592
+Swap Outs 390073 528094 620197 790912 698285
+Direct pages scanned 1077581700 3024951463 1764930052 115140570 5901188831
+Kswapd pages scanned 34826043 7112868 2131265 1686942 1893966
+Kswapd pages reclaimed 28950067 4911036 1246044 966475 1497726
+Direct pages reclaimed 805148398 280167837 3623473 2215044 40809360
+Kswapd efficiency 83% 69% 58% 57% 79%
+Kswapd velocity 664.399 622.521 4253.852 7304.360 751.490
+Direct efficiency 74% 9% 0% 1% 0%
+Direct velocity 20557.737 264745.137 3522673.849 498551.938 2341481.435
+Percentage direct scans 96% 99% 99% 98% 99%
+Page writes by reclaim 722646 529174 620319 791018 699198
+Page writes file 332573 1080 122 106 913
+Page writes anon 390073 528094 620197 790912 698285
+Page reclaim immediate 0 2552514720 1635858848 111281140 5478375032
+Page rescued immediate 0 0 0 87848 0
+Slabs scanned 23552 23552 9216 8192 9216
+Direct inode steals 231 0 0 0 0
+Kswapd inode steals 0 0 0 0 0
+Kswapd skipped wait 28076 786 0 61 6
+THP fault alloc 609 383 753 906 1433
+THP collapse alloc 12 6 0 0 6
+THP splits 536 211 456 593 1136
+THP fault fallback 4406 4633 4263 4110 3583
+THP collapse fail 120 127 0 0 4
+Compaction stalls 1810 728 623 779 3200
+Compaction success 196 53 60 80 123
+Compaction failures 1614 675 563 699 3077
+Compaction pages moved 193158 53545 243185 333457 226688
+Compaction move failure 9952 9396 16424 23676 45070
+
+The main things to look at are
+
+1. Page In/out figures are much reduced by the series.
+
+2. Direct page scanning is incredibly high (264745.137 pages scanned
+ per second on the vanilla kernel) but isolating PageReclaim pages
+ on their own list reduces the number of pages scanned significantly.
+
+3. The fact that "Page rescued immediate" is a positive number implies
+ that we sometimes race removing pages from the LRU_IMMEDIATE list
+ that need to be put back on a normal LRU but it happens only for
+ 0.07% of the pages marked for immediate reclaim.
+
+writebackCPDeviceext4
+ 3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 andrea-v2r1
+System Time 1.51 ( 0.00%) 1.77 ( -17.66%) 1.46 ( 2.92%) 1.15 ( 23.77%) 1.89 ( -25.63%)
++/- 0.27 ( 0.00%) 0.67 ( -148.52%) 0.33 ( -22.76%) 0.30 ( -11.15%) 0.19 ( 30.16%)
+User Time 0.03 ( 0.00%) 0.04 ( -37.50%) 0.05 ( -62.50%) 0.07 ( -112.50%) 0.04 ( -18.75%)
++/- 0.01 ( 0.00%) 0.02 ( -146.64%) 0.02 ( -97.91%) 0.02 ( -75.59%) 0.02 ( -63.30%)
+Elapsed Time 124.93 ( 0.00%) 114.49 ( 8.36%) 96.77 ( 22.55%) 27.48 ( 78.00%) 205.70 ( -64.65%)
++/- 20.20 ( 0.00%) 74.39 ( -268.34%) 59.88 ( -196.48%) 7.72 ( 61.79%) 25.03 ( -23.95%)
+THP Active 161.80 ( 0.00%) 83.60 ( 51.67%) 141.20 ( 87.27%) 84.60 ( 52.29%) 82.60 ( 51.05%)
++/- 71.95 ( 0.00%) 43.80 ( 60.88%) 26.91 ( 37.40%) 59.02 ( 82.03%) 52.13 ( 72.45%)
+Fault Alloc 471.40 ( 0.00%) 228.60 ( 48.49%) 282.20 ( 59.86%) 225.20 ( 47.77%) 388.40 ( 82.39%)
++/- 88.07 ( 0.00%) 87.42 ( 99.26%) 73.79 ( 83.78%) 109.62 ( 124.47%) 82.62 ( 93.81%)
+Fault Fallback 531.60 ( 0.00%) 774.60 ( -45.71%) 720.80 ( -35.59%) 777.80 ( -46.31%) 614.80 ( -15.65%)
++/- 88.07 ( 0.00%) 87.26 ( 0.92%) 73.79 ( 16.22%) 109.62 ( -24.47%) 82.29 ( 6.56%)
+MMTests Statistics: duration
+User/Sys Time Running Test (seconds) 50.22 33.76 30.65 24.14 128.45
+Total Elapsed Time (seconds) 1113.73 1132.19 1029.45 759.49 1707.26
+
+Similar test but the USB stick is using ext4 instead of vfat. As
+ext4 does not use writepage for migration, the large stalls due to
+compaction when THP is enabled are not observed. Still, isolating
+PageReclaim pages on their own list helped completion time largely
+by reducing the number of pages scanned by direct reclaim although
+time spend in congestion_wait could also be a factor.
+
+Again, Andrea's series had far higher success rates for THP allocation
+at the cost of elapsed time. I didn't look too closely but a quick
+look at the vmstat figures tells me kswapd reclaimed 8 times more pages
+than the patch series and direct reclaim reclaimed roughly three times
+as many pages. It follows that if memory is aggressively reclaimed,
+there will be more available for THP.
+
+writebackCPFilevfat
+ 3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 andrea-v2r1
+System Time 1.76 ( 0.00%) 29.10 (-1555.52%) 46.01 (-2517.18%) 4.79 ( -172.35%) 54.89 (-3022.53%)
++/- 0.14 ( 0.00%) 25.61 (-18185.17%) 2.15 (-1434.83%) 6.60 (-4610.03%) 9.75
+(-6863.76%)
+User Time 0.05 ( 0.00%) 0.07 ( -45.83%) 0.05 ( -4.17%) 0.06 ( -29.17%) 0.06 ( -16.67%)
++/- 0.02 ( 0.00%) 0.02 ( 20.11%) 0.02 ( -3.14%) 0.01 ( 31.58%) 0.01 ( 47.41%)
+Elapsed Time 22520.79 ( 0.00%) 1082.85 ( 95.19%) 73.30 ( 99.67%) 32.43 ( 99.86%) 291.84 ( 98.70%)
++/- 7277.23 ( 0.00%) 706.29 ( 90.29%) 19.05 ( 99.74%) 17.05 ( 99.77%) 125.55 ( 98.27%)
+THP Active 83.80 ( 0.00%) 12.80 ( 15.27%) 15.60 ( 18.62%) 13.00 ( 15.51%) 0.80 ( 0.95%)
++/- 66.81 ( 0.00%) 20.19 ( 30.22%) 5.92 ( 8.86%) 15.06 ( 22.54%) 1.17 ( 1.75%)
+Fault Alloc 171.00 ( 0.00%) 67.80 ( 39.65%) 97.40 ( 56.96%) 125.60 ( 73.45%) 133.00 ( 77.78%)
++/- 82.91 ( 0.00%) 30.69 ( 37.02%) 53.91 ( 65.02%) 55.05 ( 66.40%) 21.19 ( 25.56%)
+Fault Fallback 832.00 ( 0.00%) 935.20 ( -12.40%) 906.00 ( -8.89%) 877.40 ( -5.46%) 870.20 ( -4.59%)
++/- 82.91 ( 0.00%) 30.69 ( 62.98%) 54.01 ( 34.86%) 55.05 ( 33.60%) 20.91 ( 74.78%)
+MMTests Statistics: duration
+User/Sys Time Running Test (seconds) 7229.81 928.42 704.52 80.68 1330.76
+Total Elapsed Time (seconds) 112849.04 5618.69 571.11 360.54 1664.28
+
+In this case, the test is reading/writing only from filesystems but as
+it's vfat, it's slow due to calling writepage during compaction. Little
+to observe really - the time to complete the test goes way down
+with the series applied and THP allocation success rates go up in
+comparison to 3.2-rc5. The success rates are lower than 3.1.0 but
+the elapsed time for that kernel is abysmal so it is not really a
+sensible comparison.
+
+As before, Andrea's series allocates more THPs at the cost of overall
+performance.
+
+writebackCPFileext4
+ 3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 andrea-v2r1
+System Time 1.51 ( 0.00%) 1.77 ( -17.66%) 1.46 ( 2.92%) 1.15 ( 23.77%) 1.89 ( -25.63%)
++/- 0.27 ( 0.00%) 0.67 ( -148.52%) 0.33 ( -22.76%) 0.30 ( -11.15%) 0.19 ( 30.16%)
+User Time 0.03 ( 0.00%) 0.04 ( -37.50%) 0.05 ( -62.50%) 0.07 ( -112.50%) 0.04 ( -18.75%)
++/- 0.01 ( 0.00%) 0.02 ( -146.64%) 0.02 ( -97.91%) 0.02 ( -75.59%) 0.02 ( -63.30%)
+Elapsed Time 124.93 ( 0.00%) 114.49 ( 8.36%) 96.77 ( 22.55%) 27.48 ( 78.00%) 205.70 ( -64.65%)
++/- 20.20 ( 0.00%) 74.39 ( -268.34%) 59.88 ( -196.48%) 7.72 ( 61.79%) 25.03 ( -23.95%)
+THP Active 161.80 ( 0.00%) 83.60 ( 51.67%) 141.20 ( 87.27%) 84.60 ( 52.29%) 82.60 ( 51.05%)
++/- 71.95 ( 0.00%) 43.80 ( 60.88%) 26.91 ( 37.40%) 59.02 ( 82.03%) 52.13 ( 72.45%)
+Fault Alloc 471.40 ( 0.00%) 228.60 ( 48.49%) 282.20 ( 59.86%) 225.20 ( 47.77%) 388.40 ( 82.39%)
++/- 88.07 ( 0.00%) 87.42 ( 99.26%) 73.79 ( 83.78%) 109.62 ( 124.47%) 82.62 ( 93.81%)
+Fault Fallback 531.60 ( 0.00%) 774.60 ( -45.71%) 720.80 ( -35.59%) 777.80 ( -46.31%) 614.80 ( -15.65%)
++/- 88.07 ( 0.00%) 87.26 ( 0.92%) 73.79 ( 16.22%) 109.62 ( -24.47%) 82.29 ( 6.56%)
+MMTests Statistics: duration
+User/Sys Time Running Test (seconds) 50.22 33.76 30.65 24.14 128.45
+Total Elapsed Time (seconds) 1113.73 1132.19 1029.45 759.49 1707.26
+
+Same type of story - elapsed times go down. In this case, allocation
+success rates are roughtly the same. As before, Andrea's has higher
+success rates but takes a lot longer.
+
+Overall the series does reduce latencies and while the tests are
+inherency racy as alloc competes with the cp processes, the variability
+was included. The THP allocation rates are not as high as they could
+be but that is because we would have to be more aggressive about
+reclaim and compaction impacting overall performance.
+
+This patch:
+
+Commit 39deaf85 ("mm: compaction: make isolate_lru_page() filter-aware")
+noted that compaction does not migrate dirty or writeback pages and that
+is was meaningless to pick the page and re-add it to the LRU list.
+
+What was missed during review is that asynchronous migration moves dirty
+pages if their ->migratepage callback is migrate_page() because these can
+be moved without blocking. This potentially impacted hugepage allocation
+success rates by a factor depending on how many dirty pages are in the
+system.
+
+This patch partially reverts 39deaf85 to allow migration to isolate dirty
+pages again. This increases how much compaction disrupts the LRU but that
+is addressed later in the series.
+
+Signed-off-by: Mel Gorman <mgorman@suse.de>
+Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
+Reviewed-by: Rik van Riel <riel@redhat.com>
+Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
+Cc: Dave Jones <davej@redhat.com>
+Cc: Jan Kara <jack@suse.cz>
+Cc: Andy Isaacson <adi@hexapodia.org>
+Cc: Nai Xia <nai.xia@gmail.com>
+Cc: Johannes Weiner <jweiner@redhat.com>
+Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
+Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+
+---
+ mm/compaction.c | 3 ---
+ 1 file changed, 3 deletions(-)
+
+--- a/mm/compaction.c
++++ b/mm/compaction.c
+@@ -371,9 +371,6 @@ static isolate_migrate_t isolate_migrate
+ continue;
+ }
+
+- if (!cc->sync)
+- mode |= ISOLATE_CLEAN;
+-
+ /* Try isolate the page */
+ if (__isolate_lru_page(page, mode, 0) != 0)
+ continue;
--- /dev/null
+From b969c4ab9f182a6e1b2a0848be349f99714947b0 Mon Sep 17 00:00:00 2001
+From: Mel Gorman <mgorman@suse.de>
+Date: Thu, 12 Jan 2012 17:19:34 -0800
+Subject: mm: compaction: determine if dirty pages can be migrated without blocking within ->migratepage
+
+From: Mel Gorman <mgorman@suse.de>
+
+commit b969c4ab9f182a6e1b2a0848be349f99714947b0 upstream.
+
+Stable note: Not tracked in Bugzilla. A fix aimed at preserving page
+ aging information by reducing LRU list churning had the side-effect
+ of reducing THP allocation success rates. This was part of a series
+ to restore the success rates while preserving the reclaim fix.
+
+Asynchronous compaction is used when allocating transparent hugepages to
+avoid blocking for long periods of time. Due to reports of stalling,
+there was a debate on disabling synchronous compaction but this severely
+impacted allocation success rates. Part of the reason was that many dirty
+pages are skipped in asynchronous compaction by the following check;
+
+ if (PageDirty(page) && !sync &&
+ mapping->a_ops->migratepage != migrate_page)
+ rc = -EBUSY;
+
+This skips over all mapping aops using buffer_migrate_page() even though
+it is possible to migrate some of these pages without blocking. This
+patch updates the ->migratepage callback with a "sync" parameter. It is
+the responsibility of the callback to fail gracefully if migration would
+block.
+
+Signed-off-by: Mel Gorman <mgorman@suse.de>
+Reviewed-by: Rik van Riel <riel@redhat.com>
+Cc: Andrea Arcangeli <aarcange@redhat.com>
+Cc: Minchan Kim <minchan.kim@gmail.com>
+Cc: Dave Jones <davej@redhat.com>
+Cc: Jan Kara <jack@suse.cz>
+Cc: Andy Isaacson <adi@hexapodia.org>
+Cc: Nai Xia <nai.xia@gmail.com>
+Cc: Johannes Weiner <jweiner@redhat.com>
+Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
+Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
+Signed-off-by: Mel Gorman <mgorman@suse.de>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+
+---
+ fs/btrfs/disk-io.c | 4 -
+ fs/hugetlbfs/inode.c | 3 -
+ fs/nfs/internal.h | 2
+ fs/nfs/write.c | 4 -
+ include/linux/fs.h | 9 ++-
+ include/linux/migrate.h | 2
+ mm/migrate.c | 129 ++++++++++++++++++++++++++++++++++--------------
+ 7 files changed, 106 insertions(+), 47 deletions(-)
+
+--- a/fs/btrfs/disk-io.c
++++ b/fs/btrfs/disk-io.c
+@@ -801,7 +801,7 @@ static int btree_submit_bio_hook(struct
+
+ #ifdef CONFIG_MIGRATION
+ static int btree_migratepage(struct address_space *mapping,
+- struct page *newpage, struct page *page)
++ struct page *newpage, struct page *page, bool sync)
+ {
+ /*
+ * we can't safely write a btree page from here,
+@@ -816,7 +816,7 @@ static int btree_migratepage(struct addr
+ if (page_has_private(page) &&
+ !try_to_release_page(page, GFP_KERNEL))
+ return -EAGAIN;
+- return migrate_page(mapping, newpage, page);
++ return migrate_page(mapping, newpage, page, sync);
+ }
+ #endif
+
+--- a/fs/hugetlbfs/inode.c
++++ b/fs/hugetlbfs/inode.c
+@@ -568,7 +568,8 @@ static int hugetlbfs_set_page_dirty(stru
+ }
+
+ static int hugetlbfs_migrate_page(struct address_space *mapping,
+- struct page *newpage, struct page *page)
++ struct page *newpage, struct page *page,
++ bool sync)
+ {
+ int rc;
+
+--- a/fs/nfs/internal.h
++++ b/fs/nfs/internal.h
+@@ -315,7 +315,7 @@ void nfs_commit_release_pages(struct nfs
+
+ #ifdef CONFIG_MIGRATION
+ extern int nfs_migrate_page(struct address_space *,
+- struct page *, struct page *);
++ struct page *, struct page *, bool);
+ #else
+ #define nfs_migrate_page NULL
+ #endif
+--- a/fs/nfs/write.c
++++ b/fs/nfs/write.c
+@@ -1662,7 +1662,7 @@ out_error:
+
+ #ifdef CONFIG_MIGRATION
+ int nfs_migrate_page(struct address_space *mapping, struct page *newpage,
+- struct page *page)
++ struct page *page, bool sync)
+ {
+ /*
+ * If PagePrivate is set, then the page is currently associated with
+@@ -1677,7 +1677,7 @@ int nfs_migrate_page(struct address_spac
+
+ nfs_fscache_release_page(page, GFP_KERNEL);
+
+- return migrate_page(mapping, newpage, page);
++ return migrate_page(mapping, newpage, page, sync);
+ }
+ #endif
+
+--- a/include/linux/fs.h
++++ b/include/linux/fs.h
+@@ -607,9 +607,12 @@ struct address_space_operations {
+ loff_t offset, unsigned long nr_segs);
+ int (*get_xip_mem)(struct address_space *, pgoff_t, int,
+ void **, unsigned long *);
+- /* migrate the contents of a page to the specified target */
++ /*
++ * migrate the contents of a page to the specified target. If sync
++ * is false, it must not block.
++ */
+ int (*migratepage) (struct address_space *,
+- struct page *, struct page *);
++ struct page *, struct page *, bool);
+ int (*launder_page) (struct page *);
+ int (*is_partially_uptodate) (struct page *, read_descriptor_t *,
+ unsigned long);
+@@ -2478,7 +2481,7 @@ extern int generic_check_addressable(uns
+
+ #ifdef CONFIG_MIGRATION
+ extern int buffer_migrate_page(struct address_space *,
+- struct page *, struct page *);
++ struct page *, struct page *, bool);
+ #else
+ #define buffer_migrate_page NULL
+ #endif
+--- a/include/linux/migrate.h
++++ b/include/linux/migrate.h
+@@ -11,7 +11,7 @@ typedef struct page *new_page_t(struct p
+
+ extern void putback_lru_pages(struct list_head *l);
+ extern int migrate_page(struct address_space *,
+- struct page *, struct page *);
++ struct page *, struct page *, bool);
+ extern int migrate_pages(struct list_head *l, new_page_t x,
+ unsigned long private, bool offlining,
+ bool sync);
+--- a/mm/migrate.c
++++ b/mm/migrate.c
+@@ -220,6 +220,55 @@ out:
+ pte_unmap_unlock(ptep, ptl);
+ }
+
++#ifdef CONFIG_BLOCK
++/* Returns true if all buffers are successfully locked */
++static bool buffer_migrate_lock_buffers(struct buffer_head *head, bool sync)
++{
++ struct buffer_head *bh = head;
++
++ /* Simple case, sync compaction */
++ if (sync) {
++ do {
++ get_bh(bh);
++ lock_buffer(bh);
++ bh = bh->b_this_page;
++
++ } while (bh != head);
++
++ return true;
++ }
++
++ /* async case, we cannot block on lock_buffer so use trylock_buffer */
++ do {
++ get_bh(bh);
++ if (!trylock_buffer(bh)) {
++ /*
++ * We failed to lock the buffer and cannot stall in
++ * async migration. Release the taken locks
++ */
++ struct buffer_head *failed_bh = bh;
++ put_bh(failed_bh);
++ bh = head;
++ while (bh != failed_bh) {
++ unlock_buffer(bh);
++ put_bh(bh);
++ bh = bh->b_this_page;
++ }
++ return false;
++ }
++
++ bh = bh->b_this_page;
++ } while (bh != head);
++ return true;
++}
++#else
++static inline bool buffer_migrate_lock_buffers(struct buffer_head *head,
++ bool sync)
++{
++ return true;
++}
++#endif /* CONFIG_BLOCK */
++
+ /*
+ * Replace the page in the mapping.
+ *
+@@ -229,7 +278,8 @@ out:
+ * 3 for pages with a mapping and PagePrivate/PagePrivate2 set.
+ */
+ static int migrate_page_move_mapping(struct address_space *mapping,
+- struct page *newpage, struct page *page)
++ struct page *newpage, struct page *page,
++ struct buffer_head *head, bool sync)
+ {
+ int expected_count;
+ void **pslot;
+@@ -259,6 +309,19 @@ static int migrate_page_move_mapping(str
+ }
+
+ /*
++ * In the async migration case of moving a page with buffers, lock the
++ * buffers using trylock before the mapping is moved. If the mapping
++ * was moved, we later failed to lock the buffers and could not move
++ * the mapping back due to an elevated page count, we would have to
++ * block waiting on other references to be dropped.
++ */
++ if (!sync && head && !buffer_migrate_lock_buffers(head, sync)) {
++ page_unfreeze_refs(page, expected_count);
++ spin_unlock_irq(&mapping->tree_lock);
++ return -EAGAIN;
++ }
++
++ /*
+ * Now we know that no one else is looking at the page.
+ */
+ get_page(newpage); /* add cache reference */
+@@ -415,13 +478,13 @@ EXPORT_SYMBOL(fail_migrate_page);
+ * Pages are locked upon entry and exit.
+ */
+ int migrate_page(struct address_space *mapping,
+- struct page *newpage, struct page *page)
++ struct page *newpage, struct page *page, bool sync)
+ {
+ int rc;
+
+ BUG_ON(PageWriteback(page)); /* Writeback must be complete */
+
+- rc = migrate_page_move_mapping(mapping, newpage, page);
++ rc = migrate_page_move_mapping(mapping, newpage, page, NULL, sync);
+
+ if (rc)
+ return rc;
+@@ -438,28 +501,28 @@ EXPORT_SYMBOL(migrate_page);
+ * exist.
+ */
+ int buffer_migrate_page(struct address_space *mapping,
+- struct page *newpage, struct page *page)
++ struct page *newpage, struct page *page, bool sync)
+ {
+ struct buffer_head *bh, *head;
+ int rc;
+
+ if (!page_has_buffers(page))
+- return migrate_page(mapping, newpage, page);
++ return migrate_page(mapping, newpage, page, sync);
+
+ head = page_buffers(page);
+
+- rc = migrate_page_move_mapping(mapping, newpage, page);
++ rc = migrate_page_move_mapping(mapping, newpage, page, head, sync);
+
+ if (rc)
+ return rc;
+
+- bh = head;
+- do {
+- get_bh(bh);
+- lock_buffer(bh);
+- bh = bh->b_this_page;
+-
+- } while (bh != head);
++ /*
++ * In the async case, migrate_page_move_mapping locked the buffers
++ * with an IRQ-safe spinlock held. In the sync case, the buffers
++ * need to be locked now
++ */
++ if (sync)
++ BUG_ON(!buffer_migrate_lock_buffers(head, sync));
+
+ ClearPagePrivate(page);
+ set_page_private(newpage, page_private(page));
+@@ -536,10 +599,13 @@ static int writeout(struct address_space
+ * Default handling if a filesystem does not provide a migration function.
+ */
+ static int fallback_migrate_page(struct address_space *mapping,
+- struct page *newpage, struct page *page)
++ struct page *newpage, struct page *page, bool sync)
+ {
+- if (PageDirty(page))
++ if (PageDirty(page)) {
++ if (!sync)
++ return -EBUSY;
+ return writeout(mapping, page);
++ }
+
+ /*
+ * Buffers may be managed in a filesystem specific way.
+@@ -549,7 +615,7 @@ static int fallback_migrate_page(struct
+ !try_to_release_page(page, GFP_KERNEL))
+ return -EAGAIN;
+
+- return migrate_page(mapping, newpage, page);
++ return migrate_page(mapping, newpage, page, sync);
+ }
+
+ /*
+@@ -585,29 +651,18 @@ static int move_to_new_page(struct page
+
+ mapping = page_mapping(page);
+ if (!mapping)
+- rc = migrate_page(mapping, newpage, page);
+- else {
++ rc = migrate_page(mapping, newpage, page, sync);
++ else if (mapping->a_ops->migratepage)
+ /*
+- * Do not writeback pages if !sync and migratepage is
+- * not pointing to migrate_page() which is nonblocking
+- * (swapcache/tmpfs uses migratepage = migrate_page).
++ * Most pages have a mapping and most filesystems provide a
++ * migratepage callback. Anonymous pages are part of swap
++ * space which also has its own migratepage callback. This
++ * is the most common path for page migration.
+ */
+- if (PageDirty(page) && !sync &&
+- mapping->a_ops->migratepage != migrate_page)
+- rc = -EBUSY;
+- else if (mapping->a_ops->migratepage)
+- /*
+- * Most pages have a mapping and most filesystems
+- * should provide a migration function. Anonymous
+- * pages are part of swap space which also has its
+- * own migration function. This is the most common
+- * path for page migration.
+- */
+- rc = mapping->a_ops->migratepage(mapping,
+- newpage, page);
+- else
+- rc = fallback_migrate_page(mapping, newpage, page);
+- }
++ rc = mapping->a_ops->migratepage(mapping,
++ newpage, page, sync);
++ else
++ rc = fallback_migrate_page(mapping, newpage, page, sync);
+
+ if (rc) {
+ newpage->mapping = NULL;
--- /dev/null
+From 39deaf8585152f1a35c1676d3d7dc6ae0fb65967 Mon Sep 17 00:00:00 2001
+From: Minchan Kim <minchan.kim@gmail.com>
+Date: Mon, 31 Oct 2011 17:06:51 -0700
+Subject: mm: compaction: make isolate_lru_page() filter-aware
+
+From: Minchan Kim <minchan.kim@gmail.com>
+
+commit 39deaf8585152f1a35c1676d3d7dc6ae0fb65967 upstream.
+
+Stable note: Not tracked in Bugzilla. THP and compaction disrupt the LRU
+ list leading to poor reclaim decisions which has a variable
+ performance impact.
+
+In async mode, compaction doesn't migrate dirty or writeback pages. So,
+it's meaningless to pick the page and re-add it to lru list.
+
+Of course, when we isolate the page in compaction, the page might be dirty
+or writeback but when we try to migrate the page, the page would be not
+dirty, writeback. So it could be migrated. But it's very unlikely as
+isolate and migration cycle is much faster than writeout.
+
+So, this patch helps cpu overhead and prevent unnecessary LRU churning.
+
+Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
+Acked-by: Johannes Weiner <hannes@cmpxchg.org>
+Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
+Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
+Acked-by: Mel Gorman <mgorman@suse.de>
+Acked-by: Rik van Riel <riel@redhat.com>
+Reviewed-by: Michal Hocko <mhocko@suse.cz>
+Cc: Andrea Arcangeli <aarcange@redhat.com>
+Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
+Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
+Signed-off-by: Mel Gorman <mgorman@suse.de>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+
+---
+ include/linux/mmzone.h | 2 ++
+ mm/compaction.c | 7 +++++--
+ mm/vmscan.c | 3 +++
+ 3 files changed, 10 insertions(+), 2 deletions(-)
+
+--- a/include/linux/mmzone.h
++++ b/include/linux/mmzone.h
+@@ -162,6 +162,8 @@ static inline int is_unevictable_lru(enu
+ #define ISOLATE_INACTIVE ((__force isolate_mode_t)0x1)
+ /* Isolate active pages */
+ #define ISOLATE_ACTIVE ((__force isolate_mode_t)0x2)
++/* Isolate clean file */
++#define ISOLATE_CLEAN ((__force isolate_mode_t)0x4)
+
+ /* LRU Isolation modes. */
+ typedef unsigned __bitwise__ isolate_mode_t;
+--- a/mm/compaction.c
++++ b/mm/compaction.c
+@@ -261,6 +261,7 @@ static isolate_migrate_t isolate_migrate
+ unsigned long last_pageblock_nr = 0, pageblock_nr;
+ unsigned long nr_scanned = 0, nr_isolated = 0;
+ struct list_head *migratelist = &cc->migratepages;
++ isolate_mode_t mode = ISOLATE_ACTIVE|ISOLATE_INACTIVE;
+
+ /* Do not scan outside zone boundaries */
+ low_pfn = max(cc->migrate_pfn, zone->zone_start_pfn);
+@@ -370,9 +371,11 @@ static isolate_migrate_t isolate_migrate
+ continue;
+ }
+
++ if (!cc->sync)
++ mode |= ISOLATE_CLEAN;
++
+ /* Try isolate the page */
+- if (__isolate_lru_page(page,
+- ISOLATE_ACTIVE|ISOLATE_INACTIVE, 0) != 0)
++ if (__isolate_lru_page(page, mode, 0) != 0)
+ continue;
+
+ VM_BUG_ON(PageTransCompound(page));
+--- a/mm/vmscan.c
++++ b/mm/vmscan.c
+@@ -1045,6 +1045,9 @@ int __isolate_lru_page(struct page *page
+
+ ret = -EBUSY;
+
++ if ((mode & ISOLATE_CLEAN) && (PageDirty(page) || PageWriteback(page)))
++ return ret;
++
+ if (likely(get_page_unless_zero(page))) {
+ /*
+ * Be careful not to clear PageLRU until after we're
--- /dev/null
+From 0dabec93de633a87adfbbe1d800a4c56cd19d73b Mon Sep 17 00:00:00 2001
+From: Minchan Kim <minchan.kim@gmail.com>
+Date: Mon, 31 Oct 2011 17:06:57 -0700
+Subject: mm: migration: clean up unmap_and_move()
+
+From: Minchan Kim <minchan.kim@gmail.com>
+
+commit 0dabec93de633a87adfbbe1d800a4c56cd19d73b upstream.
+
+Stable note: Not tracked in Bugzilla. This patch makes later patches
+ easier to apply but has no other impact.
+
+unmap_and_move() is one a big messy function. Clean it up.
+
+Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
+Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
+Cc: Johannes Weiner <hannes@cmpxchg.org>
+Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
+Cc: Mel Gorman <mgorman@suse.de>
+Cc: Rik van Riel <riel@redhat.com>
+Cc: Michal Hocko <mhocko@suse.cz>
+Cc: Andrea Arcangeli <aarcange@redhat.com>
+Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
+Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
+
+---
+ mm/migrate.c | 75 +++++++++++++++++++++++++++++++----------------------------
+ 1 file changed, 40 insertions(+), 35 deletions(-)
+
+--- a/mm/migrate.c
++++ b/mm/migrate.c
+@@ -621,38 +621,18 @@ static int move_to_new_page(struct page
+ return rc;
+ }
+
+-/*
+- * Obtain the lock on page, remove all ptes and migrate the page
+- * to the newly allocated page in newpage.
+- */
+-static int unmap_and_move(new_page_t get_new_page, unsigned long private,
+- struct page *page, int force, bool offlining, bool sync)
++static int __unmap_and_move(struct page *page, struct page *newpage,
++ int force, bool offlining, bool sync)
+ {
+- int rc = 0;
+- int *result = NULL;
+- struct page *newpage = get_new_page(page, private, &result);
++ int rc = -EAGAIN;
+ int remap_swapcache = 1;
+ int charge = 0;
+ struct mem_cgroup *mem;
+ struct anon_vma *anon_vma = NULL;
+
+- if (!newpage)
+- return -ENOMEM;
+-
+- if (page_count(page) == 1) {
+- /* page was freed from under us. So we are done. */
+- goto move_newpage;
+- }
+- if (unlikely(PageTransHuge(page)))
+- if (unlikely(split_huge_page(page)))
+- goto move_newpage;
+-
+- /* prepare cgroup just returns 0 or -ENOMEM */
+- rc = -EAGAIN;
+-
+ if (!trylock_page(page)) {
+ if (!force || !sync)
+- goto move_newpage;
++ goto out;
+
+ /*
+ * It's not safe for direct compaction to call lock_page.
+@@ -668,7 +648,7 @@ static int unmap_and_move(new_page_t get
+ * altogether.
+ */
+ if (current->flags & PF_MEMALLOC)
+- goto move_newpage;
++ goto out;
+
+ lock_page(page);
+ }
+@@ -785,27 +765,52 @@ uncharge:
+ mem_cgroup_end_migration(mem, page, newpage, rc == 0);
+ unlock:
+ unlock_page(page);
++out:
++ return rc;
++}
++
++/*
++ * Obtain the lock on page, remove all ptes and migrate the page
++ * to the newly allocated page in newpage.
++ */
++static int unmap_and_move(new_page_t get_new_page, unsigned long private,
++ struct page *page, int force, bool offlining, bool sync)
++{
++ int rc = 0;
++ int *result = NULL;
++ struct page *newpage = get_new_page(page, private, &result);
++
++ if (!newpage)
++ return -ENOMEM;
+
+-move_newpage:
++ if (page_count(page) == 1) {
++ /* page was freed from under us. So we are done. */
++ goto out;
++ }
++
++ if (unlikely(PageTransHuge(page)))
++ if (unlikely(split_huge_page(page)))
++ goto out;
++
++ rc = __unmap_and_move(page, newpage, force, offlining, sync);
++out:
+ if (rc != -EAGAIN) {
+- /*
+- * A page that has been migrated has all references
+- * removed and will be freed. A page that has not been
+- * migrated will have kepts its references and be
+- * restored.
+- */
+- list_del(&page->lru);
++ /*
++ * A page that has been migrated has all references
++ * removed and will be freed. A page that has not been
++ * migrated will have kepts its references and be
++ * restored.
++ */
++ list_del(&page->lru);
+ dec_zone_page_state(page, NR_ISOLATED_ANON +
+ page_is_file_cache(page));
+ putback_lru_page(page);
+ }
+-
+ /*
+ * Move the new page to the LRU. If migration was not successful
+ * then this will free the page.
+ */
+ putback_lru_page(newpage);
+-
+ if (result) {
+ if (rc)
+ *result = rc;
--- /dev/null
+From f80c0673610e36ae29d63e3297175e22f70dde5f Mon Sep 17 00:00:00 2001
+From: Minchan Kim <minchan.kim@gmail.com>
+Date: Mon, 31 Oct 2011 17:06:55 -0700
+Subject: mm: zone_reclaim: make isolate_lru_page() filter-aware
+
+From: Minchan Kim <minchan.kim@gmail.com>
+
+commit f80c0673610e36ae29d63e3297175e22f70dde5f upstream.
+
+Stable note: Not tracked in Bugzilla. THP and compaction disrupt the LRU list
+ leading to poor reclaim decisions which has a variable
+ performance impact.
+
+In __zone_reclaim case, we don't want to shrink mapped page. Nonetheless,
+we have isolated mapped page and re-add it into LRU's head. It's
+unnecessary CPU overhead and makes LRU churning.
+
+Of course, when we isolate the page, the page might be mapped but when we
+try to migrate the page, the page would be not mapped. So it could be
+migrated. But race is rare and although it happens, it's no big deal.
+
+Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
+Acked-by: Johannes Weiner <hannes@cmpxchg.org>
+Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
+Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
+Reviewed-by: Michal Hocko <mhocko@suse.cz>
+Cc: Mel Gorman <mgorman@suse.de>
+Cc: Rik van Riel <riel@redhat.com>
+Cc: Andrea Arcangeli <aarcange@redhat.com>
+Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
+Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
+Signed-off-by: Mel Gorman <mgorman@suse.de>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+
+---
+ include/linux/mmzone.h | 2 ++
+ mm/vmscan.c | 20 ++++++++++++++++++--
+ 2 files changed, 20 insertions(+), 2 deletions(-)
+
+--- a/include/linux/mmzone.h
++++ b/include/linux/mmzone.h
+@@ -164,6 +164,8 @@ static inline int is_unevictable_lru(enu
+ #define ISOLATE_ACTIVE ((__force isolate_mode_t)0x2)
+ /* Isolate clean file */
+ #define ISOLATE_CLEAN ((__force isolate_mode_t)0x4)
++/* Isolate unmapped file */
++#define ISOLATE_UNMAPPED ((__force isolate_mode_t)0x8)
+
+ /* LRU Isolation modes. */
+ typedef unsigned __bitwise__ isolate_mode_t;
+--- a/mm/vmscan.c
++++ b/mm/vmscan.c
+@@ -1048,6 +1048,9 @@ int __isolate_lru_page(struct page *page
+ if ((mode & ISOLATE_CLEAN) && (PageDirty(page) || PageWriteback(page)))
+ return ret;
+
++ if ((mode & ISOLATE_UNMAPPED) && page_mapped(page))
++ return ret;
++
+ if (likely(get_page_unless_zero(page))) {
+ /*
+ * Be careful not to clear PageLRU until after we're
+@@ -1471,6 +1474,12 @@ shrink_inactive_list(unsigned long nr_to
+ reclaim_mode |= ISOLATE_ACTIVE;
+
+ lru_add_drain();
++
++ if (!sc->may_unmap)
++ reclaim_mode |= ISOLATE_UNMAPPED;
++ if (!sc->may_writepage)
++ reclaim_mode |= ISOLATE_CLEAN;
++
+ spin_lock_irq(&zone->lru_lock);
+
+ if (scanning_global_lru(sc)) {
+@@ -1588,19 +1597,26 @@ static void shrink_active_list(unsigned
+ struct page *page;
+ struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
+ unsigned long nr_rotated = 0;
++ isolate_mode_t reclaim_mode = ISOLATE_ACTIVE;
+
+ lru_add_drain();
++
++ if (!sc->may_unmap)
++ reclaim_mode |= ISOLATE_UNMAPPED;
++ if (!sc->may_writepage)
++ reclaim_mode |= ISOLATE_CLEAN;
++
+ spin_lock_irq(&zone->lru_lock);
+ if (scanning_global_lru(sc)) {
+ nr_taken = isolate_pages_global(nr_pages, &l_hold,
+ &pgscanned, sc->order,
+- ISOLATE_ACTIVE, zone,
++ reclaim_mode, zone,
+ 1, file);
+ zone->pages_scanned += pgscanned;
+ } else {
+ nr_taken = mem_cgroup_isolate_pages(nr_pages, &l_hold,
+ &pgscanned, sc->order,
+- ISOLATE_ACTIVE, zone,
++ reclaim_mode, zone,
+ sc->mem_cgroup, 1, file);
+ /*
+ * mem_cgroup_isolate_pages() keeps track of
vmscan-abort-reclaim-compaction-if-compaction-can-proceed.patch
mm-compaction-trivial-clean-up-in-acct_isolated.patch
mm-change-isolate-mode-from-define-to-bitwise-type.patch
+mm-compaction-make-isolate_lru_page-filter-aware.patch
+mm-zone_reclaim-make-isolate_lru_page-filter-aware.patch
+mm-migration-clean-up-unmap_and_move.patch
+mm-compaction-allow-compaction-to-isolate-dirty-pages.patch
+mm-compaction-determine-if-dirty-pages-can-be-migrated-without-blocking-within-migratepage.patch