From: Greg Kroah-Hartman Date: Wed, 8 Aug 2012 23:02:49 +0000 (-0700) Subject: 3.5-stable patches X-Git-Tag: v3.5.1~2^2 X-Git-Url: http://git.ipfire.org/?a=commitdiff_plain;h=9cc0c6e9c3b467a7594aed55da1fa6e53e0dcd68;p=thirdparty%2Fkernel%2Fstable-queue.git 3.5-stable patches added patches: memcg-further-prevent-oom-with-too-many-dirty-pages.patch memcg-prevent-oom-with-too-many-dirty-pages.patch mm-fix-wrong-argument-of-migrate_huge_pages-in-soft_offline_huge_page.patch pcdp-use-early_ioremap-early_iounmap-to-access-pcdp-table.patch --- diff --git a/queue-3.5/memcg-further-prevent-oom-with-too-many-dirty-pages.patch b/queue-3.5/memcg-further-prevent-oom-with-too-many-dirty-pages.patch new file mode 100644 index 00000000000..7e7b03cc646 --- /dev/null +++ b/queue-3.5/memcg-further-prevent-oom-with-too-many-dirty-pages.patch @@ -0,0 +1,122 @@ +From c3b94f44fcb0725471ecebb701c077a0ed67bd07 Mon Sep 17 00:00:00 2001 +From: Hugh Dickins +Date: Tue, 31 Jul 2012 16:45:59 -0700 +Subject: memcg: further prevent OOM with too many dirty pages + +From: Hugh Dickins + +commit c3b94f44fcb0725471ecebb701c077a0ed67bd07 upstream. + +The may_enter_fs test turns out to be too restrictive: though I saw no +problem with it when testing on 3.5-rc6, it very soon OOMed when I tested +on 3.5-rc6-mm1. I don't know what the difference there is, perhaps I just +slightly changed the way I started off the testing: dd if=/dev/zero +of=/mnt/temp bs=1M count=1024; rm -f /mnt/temp; sync repeatedly, in 20M +memory.limit_in_bytes cgroup to ext4 on USB stick. + +ext4 (and gfs2 and xfs) turn out to allocate new pages for writing with +AOP_FLAG_NOFS: that seems a little worrying, and it's unclear to me why +the transaction needs to be started even before allocating pagecache +memory. But it may not be worth worrying about these days: if direct +reclaim avoids FS writeback, does __GFP_FS now mean anything? + +Anyway, we insisted on the may_enter_fs test to avoid hangs with the loop +device; but since that also masks off __GFP_IO, we can test for __GFP_IO +directly, ignoring may_enter_fs and __GFP_FS. + +But even so, the test still OOMs sometimes: when originally testing on +3.5-rc6, it OOMed about one time in five or ten; when testing just now on +3.5-rc6-mm1, it OOMed on the first iteration. + +This residual problem comes from an accumulation of pages under ordinary +writeback, not marked PageReclaim, so rightly not causing the memcg check +to wait on their writeback: these too can prevent shrink_page_list() from +freeing any pages, so many times that memcg reclaim fails and OOMs. + +Deal with these in the same way as direct reclaim now deals with dirty FS +pages: mark them PageReclaim. It is appropriate to rotate these to tail +of list when writepage completes, but more importantly, the PageReclaim +flag makes memcg reclaim wait on them if encountered again. Increment +NR_VMSCAN_IMMEDIATE? That's arguable: I chose not. + +Setting PageReclaim here may occasionally race with end_page_writeback() +clearing it: lru_deactivate_fn() already faced the same race, and +correctly concluded that the window is small and the issue non-critical. + +With these changes, the test runs indefinitely without OOMing on ext4, +ext3 and ext2: I'll move on to test with other filesystems later. + +Trivia: invert conditions for a clearer block without an else, and goto +keep_locked to do the unlock_page. + +Signed-off-by: Hugh Dickins +Cc: KAMEZAWA Hiroyuki +Cc: Minchan Kim +Cc: Rik van Riel +Cc: Ying Han +Cc: Greg Thelen +Cc: Hugh Dickins +Cc: Mel Gorman +Cc: Johannes Weiner +Cc: Fengguang Wu +Acked-by: Michal Hocko +Cc: Dave Chinner +Cc: Theodore Ts'o +Signed-off-by: Andrew Morton +Signed-off-by: Linus Torvalds +Signed-off-by: Greg Kroah-Hartman + +--- + mm/vmscan.c | 33 ++++++++++++++++++++++++--------- + 1 file changed, 24 insertions(+), 9 deletions(-) + +--- a/mm/vmscan.c ++++ b/mm/vmscan.c +@@ -723,23 +723,38 @@ static unsigned long shrink_page_list(st + /* + * memcg doesn't have any dirty pages throttling so we + * could easily OOM just because too many pages are in +- * writeback from reclaim and there is nothing else to +- * reclaim. ++ * writeback and there is nothing else to reclaim. + * +- * Check may_enter_fs, certainly because a loop driver ++ * Check __GFP_IO, certainly because a loop driver + * thread might enter reclaim, and deadlock if it waits + * on a page for which it is needed to do the write + * (loop masks off __GFP_IO|__GFP_FS for this reason); + * but more thought would probably show more reasons. ++ * ++ * Don't require __GFP_FS, since we're not going into ++ * the FS, just waiting on its writeback completion. ++ * Worryingly, ext4 gfs2 and xfs allocate pages with ++ * grab_cache_page_write_begin(,,AOP_FLAG_NOFS), so ++ * testing may_enter_fs here is liable to OOM on them. + */ +- if (!global_reclaim(sc) && PageReclaim(page) && +- may_enter_fs) +- wait_on_page_writeback(page); +- else { ++ if (global_reclaim(sc) || ++ !PageReclaim(page) || !(sc->gfp_mask & __GFP_IO)) { ++ /* ++ * This is slightly racy - end_page_writeback() ++ * might have just cleared PageReclaim, then ++ * setting PageReclaim here end up interpreted ++ * as PageReadahead - but that does not matter ++ * enough to care. What we do want is for this ++ * page to have PageReclaim set next time memcg ++ * reclaim reaches the tests above, so it will ++ * then wait_on_page_writeback() to avoid OOM; ++ * and it's also appropriate in global reclaim. ++ */ ++ SetPageReclaim(page); + nr_writeback++; +- unlock_page(page); +- goto keep; ++ goto keep_locked; + } ++ wait_on_page_writeback(page); + } + + references = page_check_references(page, sc); diff --git a/queue-3.5/memcg-prevent-oom-with-too-many-dirty-pages.patch b/queue-3.5/memcg-prevent-oom-with-too-many-dirty-pages.patch new file mode 100644 index 00000000000..004d8b6973a --- /dev/null +++ b/queue-3.5/memcg-prevent-oom-with-too-many-dirty-pages.patch @@ -0,0 +1,182 @@ +From e62e384e9da8d9a0c599795464a7e76fd490931c Mon Sep 17 00:00:00 2001 +From: Michal Hocko +Date: Tue, 31 Jul 2012 16:45:55 -0700 +Subject: memcg: prevent OOM with too many dirty pages + +From: Michal Hocko + +commit e62e384e9da8d9a0c599795464a7e76fd490931c upstream. + +The current implementation of dirty pages throttling is not memcg aware +which makes it easy to have memcg LRUs full of dirty pages. Without +throttling, these LRUs can be scanned faster than the rate of writeback, +leading to memcg OOM conditions when the hard limit is small. + +This patch fixes the problem by throttling the allocating process +(possibly a writer) during the hard limit reclaim by waiting on +PageReclaim pages. We are waiting only for PageReclaim pages because +those are the pages that made one full round over LRU and that means that +the writeback is much slower than scanning. + +The solution is far from being ideal - long term solution is memcg aware +dirty throttling - but it is meant to be a band aid until we have a real +fix. We are seeing this happening during nightly backups which are placed +into containers to prevent from eviction of the real working set. + +The change affects only memcg reclaim and only when we encounter +PageReclaim pages which is a signal that the reclaim doesn't catch up on +with the writers so somebody should be throttled. This could be +potentially unfair because it could be somebody else from the group who +gets throttled on behalf of the writer but as writers need to allocate as +well and they allocate in higher rate the probability that only innocent +processes would be penalized is not that high. + +I have tested this change by a simple dd copying /dev/zero to tmpfs or +ext3 running under small memcg (1G copy under 5M, 60M, 300M and 2G +containers) and dd got killed by OOM killer every time. With the patch I +could run the dd with the same size under 5M controller without any OOM. +The issue is more visible with slower devices for output. + +* With the patch +================ +* tmpfs size=2G +--------------- +$ vim cgroup_cache_oom_test.sh +$ ./cgroup_cache_oom_test.sh 5M +using Limit 5M for group +1000+0 records in +1000+0 records out +1048576000 bytes (1.0 GB) copied, 30.4049 s, 34.5 MB/s +$ ./cgroup_cache_oom_test.sh 60M +using Limit 60M for group +1000+0 records in +1000+0 records out +1048576000 bytes (1.0 GB) copied, 31.4561 s, 33.3 MB/s +$ ./cgroup_cache_oom_test.sh 300M +using Limit 300M for group +1000+0 records in +1000+0 records out +1048576000 bytes (1.0 GB) copied, 20.4618 s, 51.2 MB/s +$ ./cgroup_cache_oom_test.sh 2G +using Limit 2G for group +1000+0 records in +1000+0 records out +1048576000 bytes (1.0 GB) copied, 1.42172 s, 738 MB/s + +* ext3 +------ +$ ./cgroup_cache_oom_test.sh 5M +using Limit 5M for group +1000+0 records in +1000+0 records out +1048576000 bytes (1.0 GB) copied, 27.9547 s, 37.5 MB/s +$ ./cgroup_cache_oom_test.sh 60M +using Limit 60M for group +1000+0 records in +1000+0 records out +1048576000 bytes (1.0 GB) copied, 30.3221 s, 34.6 MB/s +$ ./cgroup_cache_oom_test.sh 300M +using Limit 300M for group +1000+0 records in +1000+0 records out +1048576000 bytes (1.0 GB) copied, 24.5764 s, 42.7 MB/s +$ ./cgroup_cache_oom_test.sh 2G +using Limit 2G for group +1000+0 records in +1000+0 records out +1048576000 bytes (1.0 GB) copied, 3.35828 s, 312 MB/s + +* Without the patch +=================== +* tmpfs size=2G +--------------- +$ ./cgroup_cache_oom_test.sh 5M +using Limit 5M for group +./cgroup_cache_oom_test.sh: line 46: 4668 Killed dd if=/dev/zero of=$OUT/zero bs=1M count=$count +$ ./cgroup_cache_oom_test.sh 60M +using Limit 60M for group +1000+0 records in +1000+0 records out +1048576000 bytes (1.0 GB) copied, 25.4989 s, 41.1 MB/s +$ ./cgroup_cache_oom_test.sh 300M +using Limit 300M for group +1000+0 records in +1000+0 records out +1048576000 bytes (1.0 GB) copied, 24.3928 s, 43.0 MB/s +$ ./cgroup_cache_oom_test.sh 2G +using Limit 2G for group +1000+0 records in +1000+0 records out +1048576000 bytes (1.0 GB) copied, 1.49797 s, 700 MB/s + +* ext3 +------ +$ ./cgroup_cache_oom_test.sh 5M +using Limit 5M for group +./cgroup_cache_oom_test.sh: line 46: 4689 Killed dd if=/dev/zero of=$OUT/zero bs=1M count=$count +$ ./cgroup_cache_oom_test.sh 60M +using Limit 60M for group +./cgroup_cache_oom_test.sh: line 46: 4692 Killed dd if=/dev/zero of=$OUT/zero bs=1M count=$count +$ ./cgroup_cache_oom_test.sh 300M +using Limit 300M for group +1000+0 records in +1000+0 records out +1048576000 bytes (1.0 GB) copied, 20.248 s, 51.8 MB/s +$ ./cgroup_cache_oom_test.sh 2G +using Limit 2G for group +1000+0 records in +1000+0 records out +1048576000 bytes (1.0 GB) copied, 2.85201 s, 368 MB/s + +[akpm@linux-foundation.org: tweak changelog, reordered the test to optimize for CONFIG_CGROUP_MEM_RES_CTLR=n] +[hughd@google.com: fix deadlock with loop driver] +Reviewed-by: Mel Gorman +Acked-by: Johannes Weiner +Reviewed-by: Fengguang Wu +Signed-off-by: Michal Hocko +Cc: KAMEZAWA Hiroyuki +Cc: Minchan Kim +Cc: Rik van Riel +Cc: Ying Han +Cc: Greg Thelen +Cc: Hugh Dickins +Signed-off-by: Andrew Morton +Signed-off-by: Linus Torvalds +Signed-off-by: Greg Kroah-Hartman + +--- + mm/vmscan.c | 23 ++++++++++++++++++++--- + 1 file changed, 20 insertions(+), 3 deletions(-) + +--- a/mm/vmscan.c ++++ b/mm/vmscan.c +@@ -720,9 +720,26 @@ static unsigned long shrink_page_list(st + (PageSwapCache(page) && (sc->gfp_mask & __GFP_IO)); + + if (PageWriteback(page)) { +- nr_writeback++; +- unlock_page(page); +- goto keep; ++ /* ++ * memcg doesn't have any dirty pages throttling so we ++ * could easily OOM just because too many pages are in ++ * writeback from reclaim and there is nothing else to ++ * reclaim. ++ * ++ * Check may_enter_fs, certainly because a loop driver ++ * thread might enter reclaim, and deadlock if it waits ++ * on a page for which it is needed to do the write ++ * (loop masks off __GFP_IO|__GFP_FS for this reason); ++ * but more thought would probably show more reasons. ++ */ ++ if (!global_reclaim(sc) && PageReclaim(page) && ++ may_enter_fs) ++ wait_on_page_writeback(page); ++ else { ++ nr_writeback++; ++ unlock_page(page); ++ goto keep; ++ } + } + + references = page_check_references(page, sc); diff --git a/queue-3.5/mm-fix-wrong-argument-of-migrate_huge_pages-in-soft_offline_huge_page.patch b/queue-3.5/mm-fix-wrong-argument-of-migrate_huge_pages-in-soft_offline_huge_page.patch new file mode 100644 index 00000000000..19b26981229 --- /dev/null +++ b/queue-3.5/mm-fix-wrong-argument-of-migrate_huge_pages-in-soft_offline_huge_page.patch @@ -0,0 +1,55 @@ +From dc32f63453f56d07a1073a697dcd843dd3098c09 Mon Sep 17 00:00:00 2001 +From: Joonsoo Kim +Date: Mon, 30 Jul 2012 14:39:04 -0700 +Subject: mm: fix wrong argument of migrate_huge_pages() in soft_offline_huge_page() + +From: Joonsoo Kim + +commit dc32f63453f56d07a1073a697dcd843dd3098c09 upstream. + +Commit a6bc32b89922 ("mm: compaction: introduce sync-light migration for +use by compaction") changed the declaration of migrate_pages() and +migrate_huge_pages(). + +But it missed changing the argument of migrate_huge_pages() in +soft_offline_huge_page(). In this case, we should call +migrate_huge_pages() with MIGRATE_SYNC. + +Additionally, there is a mismatch between type the of argument and the +function declaration for migrate_pages(). + +Signed-off-by: Joonsoo Kim +Cc: Christoph Lameter +Cc: Mel Gorman +Acked-by: David Rientjes +Cc: "Aneesh Kumar K.V" +Signed-off-by: Andrew Morton +Signed-off-by: Linus Torvalds +Signed-off-by: Greg Kroah-Hartman + +--- + mm/memory-failure.c | 6 +++--- + 1 file changed, 3 insertions(+), 3 deletions(-) + +--- a/mm/memory-failure.c ++++ b/mm/memory-failure.c +@@ -1431,8 +1431,8 @@ static int soft_offline_huge_page(struct + /* Keep page count to indicate a given hugepage is isolated. */ + + list_add(&hpage->lru, &pagelist); +- ret = migrate_huge_pages(&pagelist, new_page, MPOL_MF_MOVE_ALL, 0, +- true); ++ ret = migrate_huge_pages(&pagelist, new_page, MPOL_MF_MOVE_ALL, false, ++ MIGRATE_SYNC); + if (ret) { + struct page *page1, *page2; + list_for_each_entry_safe(page1, page2, &pagelist, lru) +@@ -1561,7 +1561,7 @@ int soft_offline_page(struct page *page, + page_is_file_cache(page)); + list_add(&page->lru, &pagelist); + ret = migrate_pages(&pagelist, new_page, MPOL_MF_MOVE_ALL, +- 0, MIGRATE_SYNC); ++ false, MIGRATE_SYNC); + if (ret) { + putback_lru_pages(&pagelist); + pr_info("soft offline: %#lx: migration failed %d, type %lx\n", diff --git a/queue-3.5/pcdp-use-early_ioremap-early_iounmap-to-access-pcdp-table.patch b/queue-3.5/pcdp-use-early_ioremap-early_iounmap-to-access-pcdp-table.patch new file mode 100644 index 00000000000..8d1f21e2145 --- /dev/null +++ b/queue-3.5/pcdp-use-early_ioremap-early_iounmap-to-access-pcdp-table.patch @@ -0,0 +1,70 @@ +From 6c4088ac3a4d82779903433bcd5f048c58fb1aca Mon Sep 17 00:00:00 2001 +From: Greg Pearson +Date: Mon, 30 Jul 2012 14:39:05 -0700 +Subject: pcdp: use early_ioremap/early_iounmap to access pcdp table + +From: Greg Pearson + +commit 6c4088ac3a4d82779903433bcd5f048c58fb1aca upstream. + +efi_setup_pcdp_console() is called during boot to parse the HCDP/PCDP +EFI system table and setup an early console for printk output. The +routine uses ioremap/iounmap to setup access to the HCDP/PCDP table +information. + +The call to ioremap is happening early in the boot process which leads +to a panic on x86_64 systems: + + panic+0x01ca + do_exit+0x043c + oops_end+0x00a7 + no_context+0x0119 + __bad_area_nosemaphore+0x0138 + bad_area_nosemaphore+0x000e + do_page_fault+0x0321 + page_fault+0x0020 + reserve_memtype+0x02a1 + __ioremap_caller+0x0123 + ioremap_nocache+0x0012 + efi_setup_pcdp_console+0x002b + setup_arch+0x03a9 + start_kernel+0x00d4 + x86_64_start_reservations+0x012c + x86_64_start_kernel+0x00fe + +This replaces the calls to ioremap/iounmap in efi_setup_pcdp_console() +with calls to early_ioremap/early_iounmap which can be called during +early boot. + +This patch was tested on an x86_64 prototype system which uses the +HCDP/PCDP table for early console setup. + +Signed-off-by: Greg Pearson +Acked-by: Khalid Aziz +Signed-off-by: Andrew Morton +Signed-off-by: Linus Torvalds +Signed-off-by: Greg Kroah-Hartman + +--- + drivers/firmware/pcdp.c | 4 ++-- + 1 file changed, 2 insertions(+), 2 deletions(-) + +--- a/drivers/firmware/pcdp.c ++++ b/drivers/firmware/pcdp.c +@@ -95,7 +95,7 @@ efi_setup_pcdp_console(char *cmdline) + if (efi.hcdp == EFI_INVALID_TABLE_ADDR) + return -ENODEV; + +- pcdp = ioremap(efi.hcdp, 4096); ++ pcdp = early_ioremap(efi.hcdp, 4096); + printk(KERN_INFO "PCDP: v%d at 0x%lx\n", pcdp->rev, efi.hcdp); + + if (strstr(cmdline, "console=hcdp")) { +@@ -131,6 +131,6 @@ efi_setup_pcdp_console(char *cmdline) + } + + out: +- iounmap(pcdp); ++ early_iounmap(pcdp, 4096); + return rc; + } diff --git a/queue-3.5/series b/queue-3.5/series index efa8c454963..1167b31d577 100644 --- a/queue-3.5/series +++ b/queue-3.5/series @@ -12,3 +12,7 @@ nilfs2-fix-deadlock-issue-between-chcp-and-thaw-ioctls.patch media-ene_ir-fix-driver-initialisation.patch media-m5mols-correct-reported-iso-values.patch media-videobuf-dma-contig-restore-buffer-mapping-for-uncached-bufers.patch +pcdp-use-early_ioremap-early_iounmap-to-access-pcdp-table.patch +memcg-prevent-oom-with-too-many-dirty-pages.patch +memcg-further-prevent-oom-with-too-many-dirty-pages.patch +mm-fix-wrong-argument-of-migrate_huge_pages-in-soft_offline_huge_page.patch