From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Date: Thu, 25 Jan 2018 06:49:21 +0000 (+0100)
Subject: 4.9-stable patches
X-Git-Tag: v4.4.114~32
X-Git-Url: http://git.ipfire.org/?a=commitdiff_plain;h=44e352713216d5c3a1b837a15754eebdb66051bf;p=thirdparty%2Fkernel%2Fstable-queue.git

4.9-stable patches

added patches:
	mm-fix-100-cpu-kswapd-busyloop-on-unreclaimable-nodes.patch
---

diff --git a/queue-4.9/mm-fix-100-cpu-kswapd-busyloop-on-unreclaimable-nodes.patch b/queue-4.9/mm-fix-100-cpu-kswapd-busyloop-on-unreclaimable-nodes.patch
new file mode 100644
index 00000000000..04d83bb1dde
--- /dev/null
+++ b/queue-4.9/mm-fix-100-cpu-kswapd-busyloop-on-unreclaimable-nodes.patch
@@ -0,0 +1,314 @@
+From c73322d098e4b6f5f0f0fa1330bf57e218775539 Mon Sep 17 00:00:00 2001
+From: Johannes Weiner <hannes@cmpxchg.org>
+Date: Wed, 3 May 2017 14:51:51 -0700
+Subject: mm: fix 100% CPU kswapd busyloop on unreclaimable nodes
+
+From: Johannes Weiner <hannes@cmpxchg.org>
+
+commit c73322d098e4b6f5f0f0fa1330bf57e218775539 upstream.
+
+Patch series "mm: kswapd spinning on unreclaimable nodes - fixes and
+cleanups".
+
+Jia reported a scenario in which the kswapd of a node indefinitely spins
+at 100% CPU usage.  We have seen similar cases at Facebook.
+
+The kernel's current method of judging its ability to reclaim a node (or
+whether to back off and sleep) is based on the amount of scanned pages
+in proportion to the amount of reclaimable pages.  In Jia's and our
+scenarios, there are no reclaimable pages in the node, however, and the
+condition for backing off is never met.  Kswapd busyloops in an attempt
+to restore the watermarks while having nothing to work with.
+
+This series reworks the definition of an unreclaimable node based not on
+scanning but on whether kswapd is able to actually reclaim pages in
+MAX_RECLAIM_RETRIES (16) consecutive runs.  This is the same criteria
+the page allocator uses for giving up on direct reclaim and invoking the
+OOM killer.  If it cannot free any pages, kswapd will go to sleep and
+leave further attempts to direct reclaim invocations, which will either
+make progress and re-enable kswapd, or invoke the OOM killer.
+
+Patch #1 fixes the immediate problem Jia reported, the remainder are
+smaller fixlets, cleanups, and overall phasing out of the old method.
+
+Patch #6 is the odd one out.  It's a nice cleanup to get_scan_count(),
+and directly related to #5, but in itself not relevant to the series.
+
+If the whole series is too ambitious for 4.11, I would consider the
+first three patches fixes, the rest cleanups.
+
+This patch (of 9):
+
+Jia He reports a problem with kswapd spinning at 100% CPU when
+requesting more hugepages than memory available in the system:
+
+$ echo 4000 >/proc/sys/vm/nr_hugepages
+
+top - 13:42:59 up  3:37,  1 user,  load average: 1.09, 1.03, 1.01
+Tasks:   1 total,   1 running,   0 sleeping,   0 stopped,   0 zombie
+%Cpu(s):  0.0 us, 12.5 sy,  0.0 ni, 85.5 id,  2.0 wa,  0.0 hi,  0.0 si,  0.0 st
+KiB Mem:  31371520 total, 30915136 used,   456384 free,      320 buffers
+KiB Swap:  6284224 total,   115712 used,  6168512 free.    48192 cached Mem
+
+  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
+   76 root      20   0       0      0      0 R 100.0 0.000 217:17.29 kswapd3
+
+At that time, there are no reclaimable pages left in the node, but as
+kswapd fails to restore the high watermarks it refuses to go to sleep.
+
+Kswapd needs to back away from nodes that fail to balance.  Up until
+commit 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of
+nodes") kswapd had such a mechanism.  It considered zones whose
+theoretically reclaimable pages it had reclaimed six times over as
+unreclaimable and backed away from them.  This guard was erroneously
+removed as the patch changed the definition of a balanced node.
+
+However, simply restoring this code wouldn't help in the case reported
+here: there *are* no reclaimable pages that could be scanned until the
+threshold is met.  Kswapd would stay awake anyway.
+
+Introduce a new and much simpler way of backing off.  If kswapd runs
+through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single
+page, make it back off from the node.  This is the same number of shots
+direct reclaim takes before declaring OOM.  Kswapd will go to sleep on
+that node until a direct reclaimer manages to reclaim some pages, thus
+proving the node reclaimable again.
+
+[hannes@cmpxchg.org: check kswapd failure against the cumulative nr_reclaimed count]
+  Link: http://lkml.kernel.org/r/20170306162410.GB2090@cmpxchg.org
+[shakeelb@google.com: fix condition for throttle_direct_reclaim]
+  Link: http://lkml.kernel.org/r/20170314183228.20152-1-shakeelb@google.com
+Link: http://lkml.kernel.org/r/20170228214007.5621-2-hannes@cmpxchg.org
+Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
+Signed-off-by: Shakeel Butt <shakeelb@google.com>
+Reported-by: Jia He <hejianet@gmail.com>
+Tested-by: Jia He <hejianet@gmail.com>
+Acked-by: Michal Hocko <mhocko@suse.com>
+Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
+Acked-by: Minchan Kim <minchan@kernel.org>
+Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
+Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
+Cc: Dmitry Shmidt <dimitrysh@google.com>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+
+---
+ include/linux/mmzone.h |    2 ++
+ mm/internal.h          |    6 ++++++
+ mm/page_alloc.c        |    9 ++-------
+ mm/vmscan.c            |   47 ++++++++++++++++++++++++++++++++---------------
+ mm/vmstat.c            |    2 +-
+ 5 files changed, 43 insertions(+), 23 deletions(-)
+
+--- a/include/linux/mmzone.h
++++ b/include/linux/mmzone.h
+@@ -633,6 +633,8 @@ typedef struct pglist_data {
+ 	int kswapd_order;
+ 	enum zone_type kswapd_classzone_idx;
+ 
++	int kswapd_failures;		/* Number of 'reclaimed == 0' runs */
++
+ #ifdef CONFIG_COMPACTION
+ 	int kcompactd_max_order;
+ 	enum zone_type kcompactd_classzone_idx;
+--- a/mm/internal.h
++++ b/mm/internal.h
+@@ -74,6 +74,12 @@ static inline void set_page_refcounted(s
+ extern unsigned long highest_memmap_pfn;
+ 
+ /*
++ * Maximum number of reclaim retries without progress before the OOM
++ * killer is consider the only way forward.
++ */
++#define MAX_RECLAIM_RETRIES 16
++
++/*
+  * in mm/vmscan.c:
+  */
+ extern int isolate_lru_page(struct page *page);
+--- a/mm/page_alloc.c
++++ b/mm/page_alloc.c
+@@ -3422,12 +3422,6 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_ma
+ }
+ 
+ /*
+- * Maximum number of reclaim retries without any progress before OOM killer
+- * is consider as the only way to move forward.
+- */
+-#define MAX_RECLAIM_RETRIES 16
+-
+-/*
+  * Checks whether it makes sense to retry the reclaim to make a forward progress
+  * for the given allocation request.
+  * The reclaim feedback represented by did_some_progress (any progress during
+@@ -4385,7 +4379,8 @@ void show_free_areas(unsigned int filter
+ 			K(node_page_state(pgdat, NR_WRITEBACK_TEMP)),
+ 			K(node_page_state(pgdat, NR_UNSTABLE_NFS)),
+ 			node_page_state(pgdat, NR_PAGES_SCANNED),
+-			!pgdat_reclaimable(pgdat) ? "yes" : "no");
++			pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES ?
++				"yes" : "no");
+ 	}
+ 
+ 	for_each_populated_zone(zone) {
+--- a/mm/vmscan.c
++++ b/mm/vmscan.c
+@@ -2606,6 +2606,15 @@ static bool shrink_node(pg_data_t *pgdat
+ 	} while (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
+ 					 sc->nr_scanned - nr_scanned, sc));
+ 
++	/*
++	 * Kswapd gives up on balancing particular nodes after too
++	 * many failures to reclaim anything from them and goes to
++	 * sleep. On reclaim progress, reset the failure counter. A
++	 * successful direct reclaim run will revive a dormant kswapd.
++	 */
++	if (reclaimable)
++		pgdat->kswapd_failures = 0;
++
+ 	return reclaimable;
+ }
+ 
+@@ -2680,10 +2689,6 @@ static void shrink_zones(struct zonelist
+ 						 GFP_KERNEL | __GFP_HARDWALL))
+ 				continue;
+ 
+-			if (sc->priority != DEF_PRIORITY &&
+-			    !pgdat_reclaimable(zone->zone_pgdat))
+-				continue;	/* Let kswapd poll it */
+-
+ 			/*
+ 			 * If we already have plenty of memory free for
+ 			 * compaction in this zone, don't free any more.
+@@ -2820,7 +2825,7 @@ retry:
+ 	return 0;
+ }
+ 
+-static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
++static bool allow_direct_reclaim(pg_data_t *pgdat)
+ {
+ 	struct zone *zone;
+ 	unsigned long pfmemalloc_reserve = 0;
+@@ -2828,6 +2833,9 @@ static bool pfmemalloc_watermark_ok(pg_d
+ 	int i;
+ 	bool wmark_ok;
+ 
++	if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES)
++		return true;
++
+ 	for (i = 0; i <= ZONE_NORMAL; i++) {
+ 		zone = &pgdat->node_zones[i];
+ 		if (!managed_zone(zone) ||
+@@ -2908,7 +2916,7 @@ static bool throttle_direct_reclaim(gfp_
+ 
+ 		/* Throttle based on the first usable node */
+ 		pgdat = zone->zone_pgdat;
+-		if (pfmemalloc_watermark_ok(pgdat))
++		if (allow_direct_reclaim(pgdat))
+ 			goto out;
+ 		break;
+ 	}
+@@ -2930,14 +2938,14 @@ static bool throttle_direct_reclaim(gfp_
+ 	 */
+ 	if (!(gfp_mask & __GFP_FS)) {
+ 		wait_event_interruptible_timeout(pgdat->pfmemalloc_wait,
+-			pfmemalloc_watermark_ok(pgdat), HZ);
++			allow_direct_reclaim(pgdat), HZ);
+ 
+ 		goto check_pending;
+ 	}
+ 
+ 	/* Throttle until kswapd wakes the process */
+ 	wait_event_killable(zone->zone_pgdat->pfmemalloc_wait,
+-		pfmemalloc_watermark_ok(pgdat));
++		allow_direct_reclaim(pgdat));
+ 
+ check_pending:
+ 	if (fatal_signal_pending(current))
+@@ -3116,7 +3124,7 @@ static bool prepare_kswapd_sleep(pg_data
+ 
+ 	/*
+ 	 * The throttled processes are normally woken up in balance_pgdat() as
+-	 * soon as pfmemalloc_watermark_ok() is true. But there is a potential
++	 * soon as allow_direct_reclaim() is true. But there is a potential
+ 	 * race between when kswapd checks the watermarks and a process gets
+ 	 * throttled. There is also a potential race if processes get
+ 	 * throttled, kswapd wakes, a large process exits thereby balancing the
+@@ -3130,6 +3138,10 @@ static bool prepare_kswapd_sleep(pg_data
+ 	if (waitqueue_active(&pgdat->pfmemalloc_wait))
+ 		wake_up_all(&pgdat->pfmemalloc_wait);
+ 
++	/* Hopeless node, leave it to direct reclaim */
++	if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES)
++		return true;
++
+ 	for (i = 0; i <= classzone_idx; i++) {
+ 		struct zone *zone = pgdat->node_zones + i;
+ 
+@@ -3216,9 +3228,9 @@ static int balance_pgdat(pg_data_t *pgda
+ 	count_vm_event(PAGEOUTRUN);
+ 
+ 	do {
++		unsigned long nr_reclaimed = sc.nr_reclaimed;
+ 		bool raise_priority = true;
+ 
+-		sc.nr_reclaimed = 0;
+ 		sc.reclaim_idx = classzone_idx;
+ 
+ 		/*
+@@ -3297,7 +3309,7 @@ static int balance_pgdat(pg_data_t *pgda
+ 		 * able to safely make forward progress. Wake them
+ 		 */
+ 		if (waitqueue_active(&pgdat->pfmemalloc_wait) &&
+-				pfmemalloc_watermark_ok(pgdat))
++				allow_direct_reclaim(pgdat))
+ 			wake_up_all(&pgdat->pfmemalloc_wait);
+ 
+ 		/* Check if kswapd should be suspending */
+@@ -3308,10 +3320,14 @@ static int balance_pgdat(pg_data_t *pgda
+ 		 * Raise priority if scanning rate is too low or there was no
+ 		 * progress in reclaiming pages
+ 		 */
+-		if (raise_priority || !sc.nr_reclaimed)
++		nr_reclaimed = sc.nr_reclaimed - nr_reclaimed;
++		if (raise_priority || !nr_reclaimed)
+ 			sc.priority--;
+ 	} while (sc.priority >= 1);
+ 
++	if (!sc.nr_reclaimed)
++		pgdat->kswapd_failures++;
++
+ out:
+ 	/*
+ 	 * Return the order kswapd stopped reclaiming at as
+@@ -3511,6 +3527,10 @@ void wakeup_kswapd(struct zone *zone, in
+ 	if (!waitqueue_active(&pgdat->kswapd_wait))
+ 		return;
+ 
++	/* Hopeless node, leave it to direct reclaim */
++	if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES)
++		return;
++
+ 	/* Only wake kswapd if all zones are unbalanced */
+ 	for (z = 0; z <= classzone_idx; z++) {
+ 		zone = pgdat->node_zones + z;
+@@ -3781,9 +3801,6 @@ int node_reclaim(struct pglist_data *pgd
+ 	    sum_zone_node_page_state(pgdat->node_id, NR_SLAB_RECLAIMABLE) <= pgdat->min_slab_pages)
+ 		return NODE_RECLAIM_FULL;
+ 
+-	if (!pgdat_reclaimable(pgdat))
+-		return NODE_RECLAIM_FULL;
+-
+ 	/*
+ 	 * Do not scan if the allocation should not be delayed.
+ 	 */
+--- a/mm/vmstat.c
++++ b/mm/vmstat.c
+@@ -1421,7 +1421,7 @@ static void zoneinfo_show_print(struct s
+ 		   "\n  node_unreclaimable:  %u"
+ 		   "\n  start_pfn:           %lu"
+ 		   "\n  node_inactive_ratio: %u",
+-		   !pgdat_reclaimable(zone->zone_pgdat),
++		   pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES,
+ 		   zone->zone_start_pfn,
+ 		   zone->zone_pgdat->inactive_ratio);
+ 	seq_putc(m, '\n');
diff --git a/queue-4.9/series b/queue-4.9/series
index 6ba711da5e9..59514c5206d 100644
--- a/queue-4.9/series
+++ b/queue-4.9/series
@@ -24,3 +24,4 @@ reiserfs-don-t-preallocate-blocks-for-extended-attributes.patch
 fs-fcntl-f_setown-avoid-undefined-behaviour.patch
 scsi-libiscsi-fix-shifting-of-did_requeue-host-byte.patch
 revert-module-add-retpoline-tag-to-vermagic.patch
+mm-fix-100-cpu-kswapd-busyloop-on-unreclaimable-nodes.patch