From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Date: Wed, 8 Aug 2012 23:02:49 +0000 (-0700)
Subject: 3.5-stable patches
X-Git-Tag: v3.5.1~2^2
X-Git-Url: http://git.ipfire.org/?a=commitdiff_plain;h=9cc0c6e9c3b467a7594aed55da1fa6e53e0dcd68;p=thirdparty%2Fkernel%2Fstable-queue.git

3.5-stable patches

added patches:
	memcg-further-prevent-oom-with-too-many-dirty-pages.patch
	memcg-prevent-oom-with-too-many-dirty-pages.patch
	mm-fix-wrong-argument-of-migrate_huge_pages-in-soft_offline_huge_page.patch
	pcdp-use-early_ioremap-early_iounmap-to-access-pcdp-table.patch
---

diff --git a/queue-3.5/memcg-further-prevent-oom-with-too-many-dirty-pages.patch b/queue-3.5/memcg-further-prevent-oom-with-too-many-dirty-pages.patch
new file mode 100644
index 00000000000..7e7b03cc646
--- /dev/null
+++ b/queue-3.5/memcg-further-prevent-oom-with-too-many-dirty-pages.patch
@@ -0,0 +1,122 @@
+From c3b94f44fcb0725471ecebb701c077a0ed67bd07 Mon Sep 17 00:00:00 2001
+From: Hugh Dickins <hughd@google.com>
+Date: Tue, 31 Jul 2012 16:45:59 -0700
+Subject: memcg: further prevent OOM with too many dirty pages
+
+From: Hugh Dickins <hughd@google.com>
+
+commit c3b94f44fcb0725471ecebb701c077a0ed67bd07 upstream.
+
+The may_enter_fs test turns out to be too restrictive: though I saw no
+problem with it when testing on 3.5-rc6, it very soon OOMed when I tested
+on 3.5-rc6-mm1.  I don't know what the difference there is, perhaps I just
+slightly changed the way I started off the testing: dd if=/dev/zero
+of=/mnt/temp bs=1M count=1024; rm -f /mnt/temp; sync repeatedly, in 20M
+memory.limit_in_bytes cgroup to ext4 on USB stick.
+
+ext4 (and gfs2 and xfs) turn out to allocate new pages for writing with
+AOP_FLAG_NOFS: that seems a little worrying, and it's unclear to me why
+the transaction needs to be started even before allocating pagecache
+memory.  But it may not be worth worrying about these days: if direct
+reclaim avoids FS writeback, does __GFP_FS now mean anything?
+
+Anyway, we insisted on the may_enter_fs test to avoid hangs with the loop
+device; but since that also masks off __GFP_IO, we can test for __GFP_IO
+directly, ignoring may_enter_fs and __GFP_FS.
+
+But even so, the test still OOMs sometimes: when originally testing on
+3.5-rc6, it OOMed about one time in five or ten; when testing just now on
+3.5-rc6-mm1, it OOMed on the first iteration.
+
+This residual problem comes from an accumulation of pages under ordinary
+writeback, not marked PageReclaim, so rightly not causing the memcg check
+to wait on their writeback: these too can prevent shrink_page_list() from
+freeing any pages, so many times that memcg reclaim fails and OOMs.
+
+Deal with these in the same way as direct reclaim now deals with dirty FS
+pages: mark them PageReclaim.  It is appropriate to rotate these to tail
+of list when writepage completes, but more importantly, the PageReclaim
+flag makes memcg reclaim wait on them if encountered again.  Increment
+NR_VMSCAN_IMMEDIATE?  That's arguable: I chose not.
+
+Setting PageReclaim here may occasionally race with end_page_writeback()
+clearing it: lru_deactivate_fn() already faced the same race, and
+correctly concluded that the window is small and the issue non-critical.
+
+With these changes, the test runs indefinitely without OOMing on ext4,
+ext3 and ext2: I'll move on to test with other filesystems later.
+
+Trivia: invert conditions for a clearer block without an else, and goto
+keep_locked to do the unlock_page.
+
+Signed-off-by: Hugh Dickins <hughd@google.com>
+Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com>
+Cc: Minchan Kim <minchan@kernel.org>
+Cc: Rik van Riel <riel@redhat.com>
+Cc: Ying Han <yinghan@google.com>
+Cc: Greg Thelen <gthelen@google.com>
+Cc: Hugh Dickins <hughd@google.com>
+Cc: Mel Gorman <mgorman@suse.de>
+Cc: Johannes Weiner <hannes@cmpxchg.org>
+Cc: Fengguang Wu <fengguang.wu@intel.com>
+Acked-by: Michal Hocko <mhocko@suse.cz>
+Cc: Dave Chinner <david@fromorbit.com>
+Cc: Theodore Ts'o <tytso@mit.edu>
+Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
+Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+
+---
+ mm/vmscan.c |   33 ++++++++++++++++++++++++---------
+ 1 file changed, 24 insertions(+), 9 deletions(-)
+
+--- a/mm/vmscan.c
++++ b/mm/vmscan.c
+@@ -723,23 +723,38 @@ static unsigned long shrink_page_list(st
+ 			/*
+ 			 * memcg doesn't have any dirty pages throttling so we
+ 			 * could easily OOM just because too many pages are in
+-			 * writeback from reclaim and there is nothing else to
+-			 * reclaim.
++			 * writeback and there is nothing else to reclaim.
+ 			 *
+-			 * Check may_enter_fs, certainly because a loop driver
++			 * Check __GFP_IO, certainly because a loop driver
+ 			 * thread might enter reclaim, and deadlock if it waits
+ 			 * on a page for which it is needed to do the write
+ 			 * (loop masks off __GFP_IO|__GFP_FS for this reason);
+ 			 * but more thought would probably show more reasons.
++			 *
++			 * Don't require __GFP_FS, since we're not going into
++			 * the FS, just waiting on its writeback completion.
++			 * Worryingly, ext4 gfs2 and xfs allocate pages with
++			 * grab_cache_page_write_begin(,,AOP_FLAG_NOFS), so
++			 * testing may_enter_fs here is liable to OOM on them.
+ 			 */
+-			if (!global_reclaim(sc) && PageReclaim(page) &&
+-					may_enter_fs)
+-				wait_on_page_writeback(page);
+-			else {
++			if (global_reclaim(sc) ||
++			    !PageReclaim(page) || !(sc->gfp_mask & __GFP_IO)) {
++				/*
++				 * This is slightly racy - end_page_writeback()
++				 * might have just cleared PageReclaim, then
++				 * setting PageReclaim here end up interpreted
++				 * as PageReadahead - but that does not matter
++				 * enough to care.  What we do want is for this
++				 * page to have PageReclaim set next time memcg
++				 * reclaim reaches the tests above, so it will
++				 * then wait_on_page_writeback() to avoid OOM;
++				 * and it's also appropriate in global reclaim.
++				 */
++				SetPageReclaim(page);
+ 				nr_writeback++;
+-				unlock_page(page);
+-				goto keep;
++				goto keep_locked;
+ 			}
++			wait_on_page_writeback(page);
+ 		}
+ 
+ 		references = page_check_references(page, sc);
diff --git a/queue-3.5/memcg-prevent-oom-with-too-many-dirty-pages.patch b/queue-3.5/memcg-prevent-oom-with-too-many-dirty-pages.patch
new file mode 100644
index 00000000000..004d8b6973a
--- /dev/null
+++ b/queue-3.5/memcg-prevent-oom-with-too-many-dirty-pages.patch
@@ -0,0 +1,182 @@
+From e62e384e9da8d9a0c599795464a7e76fd490931c Mon Sep 17 00:00:00 2001
+From: Michal Hocko <mhocko@suse.cz>
+Date: Tue, 31 Jul 2012 16:45:55 -0700
+Subject: memcg: prevent OOM with too many dirty pages
+
+From: Michal Hocko <mhocko@suse.cz>
+
+commit e62e384e9da8d9a0c599795464a7e76fd490931c upstream.
+
+The current implementation of dirty pages throttling is not memcg aware
+which makes it easy to have memcg LRUs full of dirty pages.  Without
+throttling, these LRUs can be scanned faster than the rate of writeback,
+leading to memcg OOM conditions when the hard limit is small.
+
+This patch fixes the problem by throttling the allocating process
+(possibly a writer) during the hard limit reclaim by waiting on
+PageReclaim pages.  We are waiting only for PageReclaim pages because
+those are the pages that made one full round over LRU and that means that
+the writeback is much slower than scanning.
+
+The solution is far from being ideal - long term solution is memcg aware
+dirty throttling - but it is meant to be a band aid until we have a real
+fix.  We are seeing this happening during nightly backups which are placed
+into containers to prevent from eviction of the real working set.
+
+The change affects only memcg reclaim and only when we encounter
+PageReclaim pages which is a signal that the reclaim doesn't catch up on
+with the writers so somebody should be throttled.  This could be
+potentially unfair because it could be somebody else from the group who
+gets throttled on behalf of the writer but as writers need to allocate as
+well and they allocate in higher rate the probability that only innocent
+processes would be penalized is not that high.
+
+I have tested this change by a simple dd copying /dev/zero to tmpfs or
+ext3 running under small memcg (1G copy under 5M, 60M, 300M and 2G
+containers) and dd got killed by OOM killer every time.  With the patch I
+could run the dd with the same size under 5M controller without any OOM.
+The issue is more visible with slower devices for output.
+
+* With the patch
+================
+* tmpfs size=2G
+---------------
+$ vim cgroup_cache_oom_test.sh
+$ ./cgroup_cache_oom_test.sh 5M
+using Limit 5M for group
+1000+0 records in
+1000+0 records out
+1048576000 bytes (1.0 GB) copied, 30.4049 s, 34.5 MB/s
+$ ./cgroup_cache_oom_test.sh 60M
+using Limit 60M for group
+1000+0 records in
+1000+0 records out
+1048576000 bytes (1.0 GB) copied, 31.4561 s, 33.3 MB/s
+$ ./cgroup_cache_oom_test.sh 300M
+using Limit 300M for group
+1000+0 records in
+1000+0 records out
+1048576000 bytes (1.0 GB) copied, 20.4618 s, 51.2 MB/s
+$ ./cgroup_cache_oom_test.sh 2G
+using Limit 2G for group
+1000+0 records in
+1000+0 records out
+1048576000 bytes (1.0 GB) copied, 1.42172 s, 738 MB/s
+
+* ext3
+------
+$ ./cgroup_cache_oom_test.sh 5M
+using Limit 5M for group
+1000+0 records in
+1000+0 records out
+1048576000 bytes (1.0 GB) copied, 27.9547 s, 37.5 MB/s
+$ ./cgroup_cache_oom_test.sh 60M
+using Limit 60M for group
+1000+0 records in
+1000+0 records out
+1048576000 bytes (1.0 GB) copied, 30.3221 s, 34.6 MB/s
+$ ./cgroup_cache_oom_test.sh 300M
+using Limit 300M for group
+1000+0 records in
+1000+0 records out
+1048576000 bytes (1.0 GB) copied, 24.5764 s, 42.7 MB/s
+$ ./cgroup_cache_oom_test.sh 2G
+using Limit 2G for group
+1000+0 records in
+1000+0 records out
+1048576000 bytes (1.0 GB) copied, 3.35828 s, 312 MB/s
+
+* Without the patch
+===================
+* tmpfs size=2G
+---------------
+$ ./cgroup_cache_oom_test.sh 5M
+using Limit 5M for group
+./cgroup_cache_oom_test.sh: line 46:  4668 Killed                  dd if=/dev/zero of=$OUT/zero bs=1M count=$count
+$ ./cgroup_cache_oom_test.sh 60M
+using Limit 60M for group
+1000+0 records in
+1000+0 records out
+1048576000 bytes (1.0 GB) copied, 25.4989 s, 41.1 MB/s
+$ ./cgroup_cache_oom_test.sh 300M
+using Limit 300M for group
+1000+0 records in
+1000+0 records out
+1048576000 bytes (1.0 GB) copied, 24.3928 s, 43.0 MB/s
+$ ./cgroup_cache_oom_test.sh 2G
+using Limit 2G for group
+1000+0 records in
+1000+0 records out
+1048576000 bytes (1.0 GB) copied, 1.49797 s, 700 MB/s
+
+* ext3
+------
+$ ./cgroup_cache_oom_test.sh 5M
+using Limit 5M for group
+./cgroup_cache_oom_test.sh: line 46:  4689 Killed                  dd if=/dev/zero of=$OUT/zero bs=1M count=$count
+$ ./cgroup_cache_oom_test.sh 60M
+using Limit 60M for group
+./cgroup_cache_oom_test.sh: line 46:  4692 Killed                  dd if=/dev/zero of=$OUT/zero bs=1M count=$count
+$ ./cgroup_cache_oom_test.sh 300M
+using Limit 300M for group
+1000+0 records in
+1000+0 records out
+1048576000 bytes (1.0 GB) copied, 20.248 s, 51.8 MB/s
+$ ./cgroup_cache_oom_test.sh 2G
+using Limit 2G for group
+1000+0 records in
+1000+0 records out
+1048576000 bytes (1.0 GB) copied, 2.85201 s, 368 MB/s
+
+[akpm@linux-foundation.org: tweak changelog, reordered the test to optimize for CONFIG_CGROUP_MEM_RES_CTLR=n]
+[hughd@google.com: fix deadlock with loop driver]
+Reviewed-by: Mel Gorman <mgorman@suse.de>
+Acked-by: Johannes Weiner <hannes@cmpxchg.org>
+Reviewed-by: Fengguang Wu <fengguang.wu@intel.com>
+Signed-off-by: Michal Hocko <mhocko@suse.cz>
+Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com>
+Cc: Minchan Kim <minchan@kernel.org>
+Cc: Rik van Riel <riel@redhat.com>
+Cc: Ying Han <yinghan@google.com>
+Cc: Greg Thelen <gthelen@google.com>
+Cc: Hugh Dickins <hughd@google.com>
+Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
+Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+
+---
+ mm/vmscan.c |   23 ++++++++++++++++++++---
+ 1 file changed, 20 insertions(+), 3 deletions(-)
+
+--- a/mm/vmscan.c
++++ b/mm/vmscan.c
+@@ -720,9 +720,26 @@ static unsigned long shrink_page_list(st
+ 			(PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));
+ 
+ 		if (PageWriteback(page)) {
+-			nr_writeback++;
+-			unlock_page(page);
+-			goto keep;
++			/*
++			 * memcg doesn't have any dirty pages throttling so we
++			 * could easily OOM just because too many pages are in
++			 * writeback from reclaim and there is nothing else to
++			 * reclaim.
++			 *
++			 * Check may_enter_fs, certainly because a loop driver
++			 * thread might enter reclaim, and deadlock if it waits
++			 * on a page for which it is needed to do the write
++			 * (loop masks off __GFP_IO|__GFP_FS for this reason);
++			 * but more thought would probably show more reasons.
++			 */
++			if (!global_reclaim(sc) && PageReclaim(page) &&
++					may_enter_fs)
++				wait_on_page_writeback(page);
++			else {
++				nr_writeback++;
++				unlock_page(page);
++				goto keep;
++			}
+ 		}
+ 
+ 		references = page_check_references(page, sc);
diff --git a/queue-3.5/mm-fix-wrong-argument-of-migrate_huge_pages-in-soft_offline_huge_page.patch b/queue-3.5/mm-fix-wrong-argument-of-migrate_huge_pages-in-soft_offline_huge_page.patch
new file mode 100644
index 00000000000..19b26981229
--- /dev/null
+++ b/queue-3.5/mm-fix-wrong-argument-of-migrate_huge_pages-in-soft_offline_huge_page.patch
@@ -0,0 +1,55 @@
+From dc32f63453f56d07a1073a697dcd843dd3098c09 Mon Sep 17 00:00:00 2001
+From: Joonsoo Kim <js1304@gmail.com>
+Date: Mon, 30 Jul 2012 14:39:04 -0700
+Subject: mm: fix wrong argument of migrate_huge_pages() in soft_offline_huge_page()
+
+From: Joonsoo Kim <js1304@gmail.com>
+
+commit dc32f63453f56d07a1073a697dcd843dd3098c09 upstream.
+
+Commit a6bc32b89922 ("mm: compaction: introduce sync-light migration for
+use by compaction") changed the declaration of migrate_pages() and
+migrate_huge_pages().
+
+But it missed changing the argument of migrate_huge_pages() in
+soft_offline_huge_page().  In this case, we should call
+migrate_huge_pages() with MIGRATE_SYNC.
+
+Additionally, there is a mismatch between type the of argument and the
+function declaration for migrate_pages().
+
+Signed-off-by: Joonsoo Kim <js1304@gmail.com>
+Cc: Christoph Lameter <cl@linux.com>
+Cc: Mel Gorman <mgorman@suse.de>
+Acked-by: David Rientjes <rientjes@google.com>
+Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
+Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
+Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+
+---
+ mm/memory-failure.c |    6 +++---
+ 1 file changed, 3 insertions(+), 3 deletions(-)
+
+--- a/mm/memory-failure.c
++++ b/mm/memory-failure.c
+@@ -1431,8 +1431,8 @@ static int soft_offline_huge_page(struct
+ 	/* Keep page count to indicate a given hugepage is isolated. */
+ 
+ 	list_add(&hpage->lru, &pagelist);
+-	ret = migrate_huge_pages(&pagelist, new_page, MPOL_MF_MOVE_ALL, 0,
+-				true);
++	ret = migrate_huge_pages(&pagelist, new_page, MPOL_MF_MOVE_ALL, false,
++				MIGRATE_SYNC);
+ 	if (ret) {
+ 		struct page *page1, *page2;
+ 		list_for_each_entry_safe(page1, page2, &pagelist, lru)
+@@ -1561,7 +1561,7 @@ int soft_offline_page(struct page *page,
+ 					    page_is_file_cache(page));
+ 		list_add(&page->lru, &pagelist);
+ 		ret = migrate_pages(&pagelist, new_page, MPOL_MF_MOVE_ALL,
+-							0, MIGRATE_SYNC);
++							false, MIGRATE_SYNC);
+ 		if (ret) {
+ 			putback_lru_pages(&pagelist);
+ 			pr_info("soft offline: %#lx: migration failed %d, type %lx\n",
diff --git a/queue-3.5/pcdp-use-early_ioremap-early_iounmap-to-access-pcdp-table.patch b/queue-3.5/pcdp-use-early_ioremap-early_iounmap-to-access-pcdp-table.patch
new file mode 100644
index 00000000000..8d1f21e2145
--- /dev/null
+++ b/queue-3.5/pcdp-use-early_ioremap-early_iounmap-to-access-pcdp-table.patch
@@ -0,0 +1,70 @@
+From 6c4088ac3a4d82779903433bcd5f048c58fb1aca Mon Sep 17 00:00:00 2001
+From: Greg Pearson <greg.pearson@hp.com>
+Date: Mon, 30 Jul 2012 14:39:05 -0700
+Subject: pcdp: use early_ioremap/early_iounmap to access pcdp table
+
+From: Greg Pearson <greg.pearson@hp.com>
+
+commit 6c4088ac3a4d82779903433bcd5f048c58fb1aca upstream.
+
+efi_setup_pcdp_console() is called during boot to parse the HCDP/PCDP
+EFI system table and setup an early console for printk output.  The
+routine uses ioremap/iounmap to setup access to the HCDP/PCDP table
+information.
+
+The call to ioremap is happening early in the boot process which leads
+to a panic on x86_64 systems:
+
+    panic+0x01ca
+    do_exit+0x043c
+    oops_end+0x00a7
+    no_context+0x0119
+    __bad_area_nosemaphore+0x0138
+    bad_area_nosemaphore+0x000e
+    do_page_fault+0x0321
+    page_fault+0x0020
+    reserve_memtype+0x02a1
+    __ioremap_caller+0x0123
+    ioremap_nocache+0x0012
+    efi_setup_pcdp_console+0x002b
+    setup_arch+0x03a9
+    start_kernel+0x00d4
+    x86_64_start_reservations+0x012c
+    x86_64_start_kernel+0x00fe
+
+This replaces the calls to ioremap/iounmap in efi_setup_pcdp_console()
+with calls to early_ioremap/early_iounmap which can be called during
+early boot.
+
+This patch was tested on an x86_64 prototype system which uses the
+HCDP/PCDP table for early console setup.
+
+Signed-off-by: Greg Pearson <greg.pearson@hp.com>
+Acked-by: Khalid Aziz <khalid.aziz@hp.com>
+Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
+Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+
+---
+ drivers/firmware/pcdp.c |    4 ++--
+ 1 file changed, 2 insertions(+), 2 deletions(-)
+
+--- a/drivers/firmware/pcdp.c
++++ b/drivers/firmware/pcdp.c
+@@ -95,7 +95,7 @@ efi_setup_pcdp_console(char *cmdline)
+ 	if (efi.hcdp == EFI_INVALID_TABLE_ADDR)
+ 		return -ENODEV;
+ 
+-	pcdp = ioremap(efi.hcdp, 4096);
++	pcdp = early_ioremap(efi.hcdp, 4096);
+ 	printk(KERN_INFO "PCDP: v%d at 0x%lx\n", pcdp->rev, efi.hcdp);
+ 
+ 	if (strstr(cmdline, "console=hcdp")) {
+@@ -131,6 +131,6 @@ efi_setup_pcdp_console(char *cmdline)
+ 	}
+ 
+ out:
+-	iounmap(pcdp);
++	early_iounmap(pcdp, 4096);
+ 	return rc;
+ }
diff --git a/queue-3.5/series b/queue-3.5/series
index efa8c454963..1167b31d577 100644
--- a/queue-3.5/series
+++ b/queue-3.5/series
@@ -12,3 +12,7 @@ nilfs2-fix-deadlock-issue-between-chcp-and-thaw-ioctls.patch
 media-ene_ir-fix-driver-initialisation.patch
 media-m5mols-correct-reported-iso-values.patch
 media-videobuf-dma-contig-restore-buffer-mapping-for-uncached-bufers.patch
+pcdp-use-early_ioremap-early_iounmap-to-access-pcdp-table.patch
+memcg-prevent-oom-with-too-many-dirty-pages.patch
+memcg-further-prevent-oom-with-too-many-dirty-pages.patch
+mm-fix-wrong-argument-of-migrate_huge_pages-in-soft_offline_huge_page.patch