From: Greg Kroah-Hartman Date: Thu, 25 Jan 2018 06:49:21 +0000 (+0100) Subject: 4.9-stable patches X-Git-Tag: v4.4.114~32 X-Git-Url: http://git.ipfire.org/?a=commitdiff_plain;h=44e352713216d5c3a1b837a15754eebdb66051bf;p=thirdparty%2Fkernel%2Fstable-queue.git 4.9-stable patches added patches: mm-fix-100-cpu-kswapd-busyloop-on-unreclaimable-nodes.patch --- diff --git a/queue-4.9/mm-fix-100-cpu-kswapd-busyloop-on-unreclaimable-nodes.patch b/queue-4.9/mm-fix-100-cpu-kswapd-busyloop-on-unreclaimable-nodes.patch new file mode 100644 index 00000000000..04d83bb1dde --- /dev/null +++ b/queue-4.9/mm-fix-100-cpu-kswapd-busyloop-on-unreclaimable-nodes.patch @@ -0,0 +1,314 @@ +From c73322d098e4b6f5f0f0fa1330bf57e218775539 Mon Sep 17 00:00:00 2001 +From: Johannes Weiner +Date: Wed, 3 May 2017 14:51:51 -0700 +Subject: mm: fix 100% CPU kswapd busyloop on unreclaimable nodes + +From: Johannes Weiner + +commit c73322d098e4b6f5f0f0fa1330bf57e218775539 upstream. + +Patch series "mm: kswapd spinning on unreclaimable nodes - fixes and +cleanups". + +Jia reported a scenario in which the kswapd of a node indefinitely spins +at 100% CPU usage. We have seen similar cases at Facebook. + +The kernel's current method of judging its ability to reclaim a node (or +whether to back off and sleep) is based on the amount of scanned pages +in proportion to the amount of reclaimable pages. In Jia's and our +scenarios, there are no reclaimable pages in the node, however, and the +condition for backing off is never met. Kswapd busyloops in an attempt +to restore the watermarks while having nothing to work with. + +This series reworks the definition of an unreclaimable node based not on +scanning but on whether kswapd is able to actually reclaim pages in +MAX_RECLAIM_RETRIES (16) consecutive runs. This is the same criteria +the page allocator uses for giving up on direct reclaim and invoking the +OOM killer. If it cannot free any pages, kswapd will go to sleep and +leave further attempts to direct reclaim invocations, which will either +make progress and re-enable kswapd, or invoke the OOM killer. + +Patch #1 fixes the immediate problem Jia reported, the remainder are +smaller fixlets, cleanups, and overall phasing out of the old method. + +Patch #6 is the odd one out. It's a nice cleanup to get_scan_count(), +and directly related to #5, but in itself not relevant to the series. + +If the whole series is too ambitious for 4.11, I would consider the +first three patches fixes, the rest cleanups. + +This patch (of 9): + +Jia He reports a problem with kswapd spinning at 100% CPU when +requesting more hugepages than memory available in the system: + +$ echo 4000 >/proc/sys/vm/nr_hugepages + +top - 13:42:59 up 3:37, 1 user, load average: 1.09, 1.03, 1.01 +Tasks: 1 total, 1 running, 0 sleeping, 0 stopped, 0 zombie +%Cpu(s): 0.0 us, 12.5 sy, 0.0 ni, 85.5 id, 2.0 wa, 0.0 hi, 0.0 si, 0.0 st +KiB Mem: 31371520 total, 30915136 used, 456384 free, 320 buffers +KiB Swap: 6284224 total, 115712 used, 6168512 free. 48192 cached Mem + + PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND + 76 root 20 0 0 0 0 R 100.0 0.000 217:17.29 kswapd3 + +At that time, there are no reclaimable pages left in the node, but as +kswapd fails to restore the high watermarks it refuses to go to sleep. + +Kswapd needs to back away from nodes that fail to balance. Up until +commit 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of +nodes") kswapd had such a mechanism. It considered zones whose +theoretically reclaimable pages it had reclaimed six times over as +unreclaimable and backed away from them. This guard was erroneously +removed as the patch changed the definition of a balanced node. + +However, simply restoring this code wouldn't help in the case reported +here: there *are* no reclaimable pages that could be scanned until the +threshold is met. Kswapd would stay awake anyway. + +Introduce a new and much simpler way of backing off. If kswapd runs +through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single +page, make it back off from the node. This is the same number of shots +direct reclaim takes before declaring OOM. Kswapd will go to sleep on +that node until a direct reclaimer manages to reclaim some pages, thus +proving the node reclaimable again. + +[hannes@cmpxchg.org: check kswapd failure against the cumulative nr_reclaimed count] + Link: http://lkml.kernel.org/r/20170306162410.GB2090@cmpxchg.org +[shakeelb@google.com: fix condition for throttle_direct_reclaim] + Link: http://lkml.kernel.org/r/20170314183228.20152-1-shakeelb@google.com +Link: http://lkml.kernel.org/r/20170228214007.5621-2-hannes@cmpxchg.org +Signed-off-by: Johannes Weiner +Signed-off-by: Shakeel Butt +Reported-by: Jia He +Tested-by: Jia He +Acked-by: Michal Hocko +Acked-by: Hillf Danton +Acked-by: Minchan Kim +Signed-off-by: Andrew Morton +Signed-off-by: Linus Torvalds +Cc: Dmitry Shmidt +Signed-off-by: Greg Kroah-Hartman + +--- + include/linux/mmzone.h | 2 ++ + mm/internal.h | 6 ++++++ + mm/page_alloc.c | 9 ++------- + mm/vmscan.c | 47 ++++++++++++++++++++++++++++++++--------------- + mm/vmstat.c | 2 +- + 5 files changed, 43 insertions(+), 23 deletions(-) + +--- a/include/linux/mmzone.h ++++ b/include/linux/mmzone.h +@@ -633,6 +633,8 @@ typedef struct pglist_data { + int kswapd_order; + enum zone_type kswapd_classzone_idx; + ++ int kswapd_failures; /* Number of 'reclaimed == 0' runs */ ++ + #ifdef CONFIG_COMPACTION + int kcompactd_max_order; + enum zone_type kcompactd_classzone_idx; +--- a/mm/internal.h ++++ b/mm/internal.h +@@ -74,6 +74,12 @@ static inline void set_page_refcounted(s + extern unsigned long highest_memmap_pfn; + + /* ++ * Maximum number of reclaim retries without progress before the OOM ++ * killer is consider the only way forward. ++ */ ++#define MAX_RECLAIM_RETRIES 16 ++ ++/* + * in mm/vmscan.c: + */ + extern int isolate_lru_page(struct page *page); +--- a/mm/page_alloc.c ++++ b/mm/page_alloc.c +@@ -3422,12 +3422,6 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_ma + } + + /* +- * Maximum number of reclaim retries without any progress before OOM killer +- * is consider as the only way to move forward. +- */ +-#define MAX_RECLAIM_RETRIES 16 +- +-/* + * Checks whether it makes sense to retry the reclaim to make a forward progress + * for the given allocation request. + * The reclaim feedback represented by did_some_progress (any progress during +@@ -4385,7 +4379,8 @@ void show_free_areas(unsigned int filter + K(node_page_state(pgdat, NR_WRITEBACK_TEMP)), + K(node_page_state(pgdat, NR_UNSTABLE_NFS)), + node_page_state(pgdat, NR_PAGES_SCANNED), +- !pgdat_reclaimable(pgdat) ? "yes" : "no"); ++ pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES ? ++ "yes" : "no"); + } + + for_each_populated_zone(zone) { +--- a/mm/vmscan.c ++++ b/mm/vmscan.c +@@ -2606,6 +2606,15 @@ static bool shrink_node(pg_data_t *pgdat + } while (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed, + sc->nr_scanned - nr_scanned, sc)); + ++ /* ++ * Kswapd gives up on balancing particular nodes after too ++ * many failures to reclaim anything from them and goes to ++ * sleep. On reclaim progress, reset the failure counter. A ++ * successful direct reclaim run will revive a dormant kswapd. ++ */ ++ if (reclaimable) ++ pgdat->kswapd_failures = 0; ++ + return reclaimable; + } + +@@ -2680,10 +2689,6 @@ static void shrink_zones(struct zonelist + GFP_KERNEL | __GFP_HARDWALL)) + continue; + +- if (sc->priority != DEF_PRIORITY && +- !pgdat_reclaimable(zone->zone_pgdat)) +- continue; /* Let kswapd poll it */ +- + /* + * If we already have plenty of memory free for + * compaction in this zone, don't free any more. +@@ -2820,7 +2825,7 @@ retry: + return 0; + } + +-static bool pfmemalloc_watermark_ok(pg_data_t *pgdat) ++static bool allow_direct_reclaim(pg_data_t *pgdat) + { + struct zone *zone; + unsigned long pfmemalloc_reserve = 0; +@@ -2828,6 +2833,9 @@ static bool pfmemalloc_watermark_ok(pg_d + int i; + bool wmark_ok; + ++ if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES) ++ return true; ++ + for (i = 0; i <= ZONE_NORMAL; i++) { + zone = &pgdat->node_zones[i]; + if (!managed_zone(zone) || +@@ -2908,7 +2916,7 @@ static bool throttle_direct_reclaim(gfp_ + + /* Throttle based on the first usable node */ + pgdat = zone->zone_pgdat; +- if (pfmemalloc_watermark_ok(pgdat)) ++ if (allow_direct_reclaim(pgdat)) + goto out; + break; + } +@@ -2930,14 +2938,14 @@ static bool throttle_direct_reclaim(gfp_ + */ + if (!(gfp_mask & __GFP_FS)) { + wait_event_interruptible_timeout(pgdat->pfmemalloc_wait, +- pfmemalloc_watermark_ok(pgdat), HZ); ++ allow_direct_reclaim(pgdat), HZ); + + goto check_pending; + } + + /* Throttle until kswapd wakes the process */ + wait_event_killable(zone->zone_pgdat->pfmemalloc_wait, +- pfmemalloc_watermark_ok(pgdat)); ++ allow_direct_reclaim(pgdat)); + + check_pending: + if (fatal_signal_pending(current)) +@@ -3116,7 +3124,7 @@ static bool prepare_kswapd_sleep(pg_data + + /* + * The throttled processes are normally woken up in balance_pgdat() as +- * soon as pfmemalloc_watermark_ok() is true. But there is a potential ++ * soon as allow_direct_reclaim() is true. But there is a potential + * race between when kswapd checks the watermarks and a process gets + * throttled. There is also a potential race if processes get + * throttled, kswapd wakes, a large process exits thereby balancing the +@@ -3130,6 +3138,10 @@ static bool prepare_kswapd_sleep(pg_data + if (waitqueue_active(&pgdat->pfmemalloc_wait)) + wake_up_all(&pgdat->pfmemalloc_wait); + ++ /* Hopeless node, leave it to direct reclaim */ ++ if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES) ++ return true; ++ + for (i = 0; i <= classzone_idx; i++) { + struct zone *zone = pgdat->node_zones + i; + +@@ -3216,9 +3228,9 @@ static int balance_pgdat(pg_data_t *pgda + count_vm_event(PAGEOUTRUN); + + do { ++ unsigned long nr_reclaimed = sc.nr_reclaimed; + bool raise_priority = true; + +- sc.nr_reclaimed = 0; + sc.reclaim_idx = classzone_idx; + + /* +@@ -3297,7 +3309,7 @@ static int balance_pgdat(pg_data_t *pgda + * able to safely make forward progress. Wake them + */ + if (waitqueue_active(&pgdat->pfmemalloc_wait) && +- pfmemalloc_watermark_ok(pgdat)) ++ allow_direct_reclaim(pgdat)) + wake_up_all(&pgdat->pfmemalloc_wait); + + /* Check if kswapd should be suspending */ +@@ -3308,10 +3320,14 @@ static int balance_pgdat(pg_data_t *pgda + * Raise priority if scanning rate is too low or there was no + * progress in reclaiming pages + */ +- if (raise_priority || !sc.nr_reclaimed) ++ nr_reclaimed = sc.nr_reclaimed - nr_reclaimed; ++ if (raise_priority || !nr_reclaimed) + sc.priority--; + } while (sc.priority >= 1); + ++ if (!sc.nr_reclaimed) ++ pgdat->kswapd_failures++; ++ + out: + /* + * Return the order kswapd stopped reclaiming at as +@@ -3511,6 +3527,10 @@ void wakeup_kswapd(struct zone *zone, in + if (!waitqueue_active(&pgdat->kswapd_wait)) + return; + ++ /* Hopeless node, leave it to direct reclaim */ ++ if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES) ++ return; ++ + /* Only wake kswapd if all zones are unbalanced */ + for (z = 0; z <= classzone_idx; z++) { + zone = pgdat->node_zones + z; +@@ -3781,9 +3801,6 @@ int node_reclaim(struct pglist_data *pgd + sum_zone_node_page_state(pgdat->node_id, NR_SLAB_RECLAIMABLE) <= pgdat->min_slab_pages) + return NODE_RECLAIM_FULL; + +- if (!pgdat_reclaimable(pgdat)) +- return NODE_RECLAIM_FULL; +- + /* + * Do not scan if the allocation should not be delayed. + */ +--- a/mm/vmstat.c ++++ b/mm/vmstat.c +@@ -1421,7 +1421,7 @@ static void zoneinfo_show_print(struct s + "\n node_unreclaimable: %u" + "\n start_pfn: %lu" + "\n node_inactive_ratio: %u", +- !pgdat_reclaimable(zone->zone_pgdat), ++ pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES, + zone->zone_start_pfn, + zone->zone_pgdat->inactive_ratio); + seq_putc(m, '\n'); diff --git a/queue-4.9/series b/queue-4.9/series index 6ba711da5e9..59514c5206d 100644 --- a/queue-4.9/series +++ b/queue-4.9/series @@ -24,3 +24,4 @@ reiserfs-don-t-preallocate-blocks-for-extended-attributes.patch fs-fcntl-f_setown-avoid-undefined-behaviour.patch scsi-libiscsi-fix-shifting-of-did_requeue-host-byte.patch revert-module-add-retpoline-tag-to-vermagic.patch +mm-fix-100-cpu-kswapd-busyloop-on-unreclaimable-nodes.patch