releases/4.19.35/sched-fair-do-not-re-read-h_load_next-during-hierarchical-load-calculation.patch

   1 From 0e9f02450da07fc7b1346c8c32c771555173e397 Mon Sep 17 00:00:00 2001
   2 From: Mel Gorman <mgorman@techsingularity.net>
   3 Date: Tue, 19 Mar 2019 12:36:10 +0000
   4 Subject: sched/fair: Do not re-read ->h_load_next during hierarchical load calculation
   5
   6 From: Mel Gorman <mgorman@techsingularity.net>
   7
   8 commit 0e9f02450da07fc7b1346c8c32c771555173e397 upstream.
   9
  10 A NULL pointer dereference bug was reported on a distribution kernel but
  11 the same issue should be present on mainline kernel. It occured on s390
  12 but should not be arch-specific.  A partial oops looks like:
  13
  14   Unable to handle kernel pointer dereference in virtual kernel address space
  15   ...
  16   Call Trace:
  17     ...
  18     try_to_wake_up+0xfc/0x450
  19     vhost_poll_wakeup+0x3a/0x50 [vhost]
  20     __wake_up_common+0xbc/0x178
  21     __wake_up_common_lock+0x9e/0x160
  22     __wake_up_sync_key+0x4e/0x60
  23     sock_def_readable+0x5e/0x98
  24
  25 The bug hits any time between 1 hour to 3 days. The dereference occurs
  26 in update_cfs_rq_h_load when accumulating h_load. The problem is that
  27 cfq_rq->h_load_next is not protected by any locking and can be updated
  28 by parallel calls to task_h_load. Depending on the compiler, code may be
  29 generated that re-reads cfq_rq->h_load_next after the check for NULL and
  30 then oops when reading se->avg.load_avg. The dissassembly showed that it
  31 was possible to reread h_load_next after the check for NULL.
  32
  33 While this does not appear to be an issue for later compilers, it's still
  34 an accident if the correct code is generated. Full locking in this path
  35 would have high overhead so this patch uses READ_ONCE to read h_load_next
  36 only once and check for NULL before dereferencing. It was confirmed that
  37 there were no further oops after 10 days of testing.
  38
  39 As Peter pointed out, it is also necessary to use WRITE_ONCE() to avoid any
  40 potential problems with store tearing.
  41
  42 Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
  43 Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
  44 Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
  45 Cc: Linus Torvalds <torvalds@linux-foundation.org>
  46 Cc: Mike Galbraith <efault@gmx.de>
  47 Cc: Peter Zijlstra <peterz@infradead.org>
  48 Cc: Thomas Gleixner <tglx@linutronix.de>
  49 Cc: <stable@vger.kernel.org>
  50 Fixes: 685207963be9 ("sched: Move h_load calculation to task_h_load()")
  51 Link: https://lkml.kernel.org/r/20190319123610.nsivgf3mjbjjesxb@techsingularity.net
  52 Signed-off-by: Ingo Molnar <mingo@kernel.org>
  53 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
  54
  55 ---
  56  kernel/sched/fair.c |    6 +++---
  57  1 file changed, 3 insertions(+), 3 deletions(-)
  58
  59 --- a/kernel/sched/fair.c
  60 +++ b/kernel/sched/fair.c
  61 @@ -7437,10 +7437,10 @@ static void update_cfs_rq_h_load(struct
  62         if (cfs_rq->last_h_load_update == now)
  63                 return;
  64
  65 -       cfs_rq->h_load_next = NULL;
  66 +       WRITE_ONCE(cfs_rq->h_load_next, NULL);
  67         for_each_sched_entity(se) {
  68                 cfs_rq = cfs_rq_of(se);
  69 -               cfs_rq->h_load_next = se;
  70 +               WRITE_ONCE(cfs_rq->h_load_next, se);
  71                 if (cfs_rq->last_h_load_update == now)
  72                         break;
  73         }
  74 @@ -7450,7 +7450,7 @@ static void update_cfs_rq_h_load(struct
  75                 cfs_rq->last_h_load_update = now;
  76         }
  77
  78 -       while ((se = cfs_rq->h_load_next) != NULL) {
  79 +       while ((se = READ_ONCE(cfs_rq->h_load_next)) != NULL) {
  80                 load = cfs_rq->h_load;
  81                 load = div64_ul(load * se->avg.load_avg,
  82                         cfs_rq_load_avg(cfs_rq) + 1);