4.19-stable patches

author Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Mon, 4 Feb 2019 06:04:56 +0000 (07:04 +0100)

committer Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Mon, 4 Feb 2019 06:04:56 +0000 (07:04 +0100)
author Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Mon, 4 Feb 2019 06:04:56 +0000 (07:04 +0100)
committer Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Mon, 4 Feb 2019 06:04:56 +0000 (07:04 +0100)
diff --git a/queue-4.19/btrfs-clean-up-pending-block-groups-when-transaction-commit-aborts.patch b/queue-4.19/btrfs-clean-up-pending-block-groups-when-transaction-commit-aborts.patch

new file mode 100644 (file)

index 0000000..7fab4c3
--- /dev/null
+++ b/queue-4.19/btrfs-clean-up-pending-block-groups-when-transaction-commit-aborts.patch
@@ -0,0 +1,119 @@
+From c7cc64a98512ffc41df86d14a414eb3b09bf7481 Mon Sep 17 00:00:00 2001
+From: David Sterba <dsterba@suse.com>
+Date: Wed, 23 Jan 2019 17:09:16 +0100
+Subject: btrfs: clean up pending block groups when transaction commit aborts
+
+From: David Sterba <dsterba@suse.com>
+
+commit c7cc64a98512ffc41df86d14a414eb3b09bf7481 upstream.
+
+The fstests generic/475 stresses transaction aborts and can reveal
+space accounting or use-after-free bugs regarding block goups.
+
+In this case the pending block groups that remain linked to the
+structures after transaction commit aborts in the middle.
+
+The corrupted slabs lead to failures in following tests, eg. generic/476
+
+  [ 8172.752887] BUG: unable to handle kernel NULL pointer dereference at 0000000000000058
+  [ 8172.755799] #PF error: [normal kernel read fault]
+  [ 8172.757571] PGD 661ae067 P4D 661ae067 PUD 3db8e067 PMD 0
+  [ 8172.759000] Oops: 0000 [#1] PREEMPT SMP
+  [ 8172.760209] CPU: 0 PID: 39 Comm: kswapd0 Tainted: G        W         5.0.0-rc2-default #408
+  [ 8172.762495] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626cc-prebuilt.qemu-project.org 04/01/2014
+  [ 8172.765772] RIP: 0010:shrink_page_list+0x2f9/0xe90
+  [ 8172.770453] RSP: 0018:ffff967f00663b18 EFLAGS: 00010287
+  [ 8172.771184] RAX: 0000000000000000 RBX: ffff967f00663c20 RCX: 0000000000000000
+  [ 8172.772850] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8c0620ab20e0
+  [ 8172.774629] RBP: ffff967f00663dd8 R08: 0000000000000000 R09: 0000000000000000
+  [ 8172.776094] R10: ffff8c0620ab22f8 R11: ffff8c063f772688 R12: ffff967f00663b78
+  [ 8172.777533] R13: ffff8c063f625600 R14: ffff8c063f625608 R15: dead000000000200
+  [ 8172.778886] FS:  0000000000000000(0000) GS:ffff8c063d400000(0000) knlGS:0000000000000000
+  [ 8172.780545] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
+  [ 8172.781787] CR2: 0000000000000058 CR3: 000000004e962000 CR4: 00000000000006f0
+  [ 8172.783547] Call Trace:
+  [ 8172.784112]  shrink_inactive_list+0x194/0x410
+  [ 8172.784747]  shrink_node_memcg.constprop.85+0x3a5/0x6a0
+  [ 8172.785472]  shrink_node+0x62/0x1e0
+  [ 8172.786011]  balance_pgdat+0x216/0x460
+  [ 8172.786577]  kswapd+0xe3/0x4a0
+  [ 8172.787085]  ? finish_wait+0x80/0x80
+  [ 8172.787795]  ? balance_pgdat+0x460/0x460
+  [ 8172.788799]  kthread+0x116/0x130
+  [ 8172.789640]  ? kthread_create_on_node+0x60/0x60
+  [ 8172.790323]  ret_from_fork+0x24/0x30
+  [ 8172.794253] CR2: 0000000000000058
+
+or accounting errors at umount time:
+
+  [ 8159.537251] WARNING: CPU: 2 PID: 19031 at fs/btrfs/extent-tree.c:5987 btrfs_free_block_groups+0x3d5/0x410 [btrfs]
+  [ 8159.543325] CPU: 2 PID: 19031 Comm: umount Tainted: G        W         5.0.0-rc2-default #408
+  [ 8159.545472] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626cc-prebuilt.qemu-project.org 04/01/2014
+  [ 8159.548155] RIP: 0010:btrfs_free_block_groups+0x3d5/0x410 [btrfs]
+  [ 8159.554030] RSP: 0018:ffff967f079cbde8 EFLAGS: 00010206
+  [ 8159.555144] RAX: 0000000001000000 RBX: ffff8c06366cf800 RCX: 0000000000000000
+  [ 8159.556730] RDX: 0000000000000002 RSI: 0000000000000001 RDI: ffff8c06255ad800
+  [ 8159.558279] RBP: ffff8c0637ac0000 R08: 0000000000000001 R09: 0000000000000000
+  [ 8159.559797] R10: 0000000000000000 R11: 0000000000000001 R12: ffff8c0637ac0108
+  [ 8159.561296] R13: ffff8c0637ac0158 R14: 0000000000000000 R15: dead000000000100
+  [ 8159.562852] FS:  00007f7f693b9fc0(0000) GS:ffff8c063d800000(0000) knlGS:0000000000000000
+  [ 8159.564839] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
+  [ 8159.566160] CR2: 00007f7f68fab7b0 CR3: 000000000aec7000 CR4: 00000000000006e0
+  [ 8159.567898] Call Trace:
+  [ 8159.568597]  close_ctree+0x17f/0x350 [btrfs]
+  [ 8159.569628]  generic_shutdown_super+0x64/0x100
+  [ 8159.570808]  kill_anon_super+0x14/0x30
+  [ 8159.571857]  btrfs_kill_super+0x12/0xa0 [btrfs]
+  [ 8159.573063]  deactivate_locked_super+0x29/0x60
+  [ 8159.574234]  cleanup_mnt+0x3b/0x70
+  [ 8159.575176]  task_work_run+0x98/0xc0
+  [ 8159.576177]  exit_to_usermode_loop+0x83/0x90
+  [ 8159.577315]  do_syscall_64+0x15b/0x180
+  [ 8159.578339]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
+
+This fix is based on 2 Josef's patches that used sideefects of
+btrfs_create_pending_block_groups, this fix introduces the helper that
+does what we need.
+
+CC: stable@vger.kernel.org # 4.4+
+CC: Josef Bacik <josef@toxicpanda.com>
+Reviewed-by: Nikolay Borisov <nborisov@suse.com>
+Signed-off-by: David Sterba <dsterba@suse.com>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+
+---
+ fs/btrfs/transaction.c |   16 ++++++++++++++++
+ 1 file changed, 16 insertions(+)
+
+--- a/fs/btrfs/transaction.c
++++ b/fs/btrfs/transaction.c
+@@ -1886,6 +1886,21 @@ static void cleanup_transaction(struct b
+       kmem_cache_free(btrfs_trans_handle_cachep, trans);
+ }
+ 
++/*
++ * Release reserved delayed ref space of all pending block groups of the
++ * transaction and remove them from the list
++ */
++static void btrfs_cleanup_pending_block_groups(struct btrfs_trans_handle *trans)
++{
++       struct btrfs_fs_info *fs_info = trans->fs_info;
++       struct btrfs_block_group_cache *block_group, *tmp;
++
++       list_for_each_entry_safe(block_group, tmp, &trans->new_bgs, bg_list) {
++               btrfs_delayed_refs_rsv_release(fs_info, 1);
++               list_del_init(&block_group->bg_list);
++       }
++}
++
+ static inline int btrfs_start_delalloc_flush(struct btrfs_fs_info *fs_info)
+ {
+       /*
+@@ -2286,6 +2301,7 @@ scrub_continue:
+       btrfs_scrub_continue(fs_info);
+ cleanup_transaction:
+       btrfs_trans_release_metadata(trans);
++      btrfs_cleanup_pending_block_groups(trans);
+       btrfs_trans_release_chunk_metadata(trans);
+       trans->block_rsv = NULL;
+       btrfs_warn(fs_info, "Skipping commit of aborted transaction.");
diff --git a/queue-4.19/btrfs-fix-deadlock-when-allocating-tree-block-during-leaf-node-split.patch b/queue-4.19/btrfs-fix-deadlock-when-allocating-tree-block-during-leaf-node-split.patch

new file mode 100644 (file)

index 0000000..f580435
--- /dev/null
+++ b/queue-4.19/btrfs-fix-deadlock-when-allocating-tree-block-during-leaf-node-split.patch
@@ -0,0 +1,197 @@
+From a6279470762c19ba97e454f90798373dccdf6148 Mon Sep 17 00:00:00 2001
+From: Filipe Manana <fdmanana@suse.com>
+Date: Fri, 25 Jan 2019 11:48:51 +0000
+Subject: Btrfs: fix deadlock when allocating tree block during leaf/node split
+
+From: Filipe Manana <fdmanana@suse.com>
+
+commit a6279470762c19ba97e454f90798373dccdf6148 upstream.
+
+When splitting a leaf or node from one of the trees that are modified when
+flushing pending block groups (extent, chunk, device and free space trees),
+we need to allocate a new tree block, which in turn can result in the need
+to allocate a new block group. After allocating the new block group we may
+need to flush new block groups that were previously allocated during the
+course of the current transaction, which is what may cause a deadlock due
+to attempts to write lock twice the same leaf or node, as when splitting
+a leaf or node we are holding a write lock on it and its parent node.
+
+The same type of deadlock can also happen when increasing the tree's
+height, since we are holding a lock on the existing root while allocating
+the tree block to use as the new root node.
+
+An example trace when the deadlock happens during the leaf split path is:
+
+  [27175.293054] CPU: 0 PID: 3005 Comm: kworker/u17:6 Tainted: G        W         4.19.16 #1
+  [27175.293942] Hardware name: Penguin Computing Relion 1900/MD90-FS0-ZB-XX, BIOS R15 06/25/2018
+  [27175.294846] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs]
+  (...)
+  [27175.298384] RSP: 0018:ffffab2087107758 EFLAGS: 00010246
+  [27175.299269] RAX: 0000000000000bbd RBX: ffff9fadc7141c48 RCX: 0000000000000001
+  [27175.300155] RDX: 0000000000000001 RSI: 0000000000000002 RDI: ffff9fadc7141c48
+  [27175.301023] RBP: 0000000000000001 R08: ffff9faeb6ac1040 R09: ffff9fa9c0000000
+  [27175.301887] R10: 0000000000000000 R11: 0000000000000040 R12: ffff9fb21aac8000
+  [27175.302743] R13: ffff9fb1a64d6a20 R14: 0000000000000001 R15: ffff9fb1a64d6a18
+  [27175.303601] FS:  0000000000000000(0000) GS:ffff9fb21fa00000(0000) knlGS:0000000000000000
+  [27175.304468] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
+  [27175.305339] CR2: 00007fdc8743ead8 CR3: 0000000763e0a006 CR4: 00000000003606f0
+  [27175.306220] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
+  [27175.307087] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
+  [27175.307940] Call Trace:
+  [27175.308802]  btrfs_search_slot+0x779/0x9a0 [btrfs]
+  [27175.309669]  ? update_space_info+0xba/0xe0 [btrfs]
+  [27175.310534]  btrfs_insert_empty_items+0x67/0xc0 [btrfs]
+  [27175.311397]  btrfs_insert_item+0x60/0xd0 [btrfs]
+  [27175.312253]  btrfs_create_pending_block_groups+0xee/0x210 [btrfs]
+  [27175.313116]  do_chunk_alloc+0x25f/0x300 [btrfs]
+  [27175.313984]  find_free_extent+0x706/0x10d0 [btrfs]
+  [27175.314855]  btrfs_reserve_extent+0x9b/0x1d0 [btrfs]
+  [27175.315707]  btrfs_alloc_tree_block+0x100/0x5b0 [btrfs]
+  [27175.316548]  split_leaf+0x130/0x610 [btrfs]
+  [27175.317390]  btrfs_search_slot+0x94d/0x9a0 [btrfs]
+  [27175.318235]  btrfs_insert_empty_items+0x67/0xc0 [btrfs]
+  [27175.319087]  alloc_reserved_file_extent+0x84/0x2c0 [btrfs]
+  [27175.319938]  __btrfs_run_delayed_refs+0x596/0x1150 [btrfs]
+  [27175.320792]  btrfs_run_delayed_refs+0xed/0x1b0 [btrfs]
+  [27175.321643]  delayed_ref_async_start+0x81/0x90 [btrfs]
+  [27175.322491]  normal_work_helper+0xd0/0x320 [btrfs]
+  [27175.323328]  ? move_linked_works+0x6e/0xa0
+  [27175.324160]  process_one_work+0x191/0x370
+  [27175.324976]  worker_thread+0x4f/0x3b0
+  [27175.325763]  kthread+0xf8/0x130
+  [27175.326531]  ? rescuer_thread+0x320/0x320
+  [27175.327284]  ? kthread_create_worker_on_cpu+0x50/0x50
+  [27175.328027]  ret_from_fork+0x35/0x40
+  [27175.328741] ---[ end trace 300a1b9f0ac30e26 ]---
+
+Fix this by preventing the flushing of new blocks groups when splitting a
+leaf/node and when inserting a new root node for one of the trees modified
+by the flushing operation, similar to what is done when COWing a node/leaf
+from on of these trees.
+
+Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202383
+Reported-by: Eli V <eliventer@gmail.com>
+CC: stable@vger.kernel.org # 4.4+
+Signed-off-by: Filipe Manana <fdmanana@suse.com>
+Signed-off-by: David Sterba <dsterba@suse.com>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+
+---
+ fs/btrfs/ctree.c |   78 +++++++++++++++++++++++++++++++++++--------------------
+ 1 file changed, 50 insertions(+), 28 deletions(-)
+
+--- a/fs/btrfs/ctree.c
++++ b/fs/btrfs/ctree.c
+@@ -1003,6 +1003,48 @@ static noinline int update_ref_for_cow(s
+       return 0;
+ }
+ 
++static struct extent_buffer *alloc_tree_block_no_bg_flush(
++                                        struct btrfs_trans_handle *trans,
++                                        struct btrfs_root *root,
++                                        u64 parent_start,
++                                        const struct btrfs_disk_key *disk_key,
++                                        int level,
++                                        u64 hint,
++                                        u64 empty_size)
++{
++      struct btrfs_fs_info *fs_info = root->fs_info;
++      struct extent_buffer *ret;
++
++      /*
++       * If we are COWing a node/leaf from the extent, chunk, device or free
++       * space trees, make sure that we do not finish block group creation of
++       * pending block groups. We do this to avoid a deadlock.
++       * COWing can result in allocation of a new chunk, and flushing pending
++       * block groups (btrfs_create_pending_block_groups()) can be triggered
++       * when finishing allocation of a new chunk. Creation of a pending block
++       * group modifies the extent, chunk, device and free space trees,
++       * therefore we could deadlock with ourselves since we are holding a
++       * lock on an extent buffer that btrfs_create_pending_block_groups() may
++       * try to COW later.
++       * For similar reasons, we also need to delay flushing pending block
++       * groups when splitting a leaf or node, from one of those trees, since
++       * we are holding a write lock on it and its parent or when inserting a
++       * new root node for one of those trees.
++       */
++      if (root == fs_info->extent_root ||
++          root == fs_info->chunk_root ||
++          root == fs_info->dev_root ||
++          root == fs_info->free_space_root)
++              trans->can_flush_pending_bgs = false;
++
++      ret = btrfs_alloc_tree_block(trans, root, parent_start,
++                                   root->root_key.objectid, disk_key, level,
++                                   hint, empty_size);
++      trans->can_flush_pending_bgs = true;
++
++      return ret;
++}
++
+ /*
+  * does the dirty work in cow of a single block.  The parent block (if
+  * supplied) is updated to point to the new cow copy.  The new buffer is marked
+@@ -1050,28 +1092,8 @@ static noinline int __btrfs_cow_block(st
+       if ((root->root_key.objectid == BTRFS_TREE_RELOC_OBJECTID) && parent)
+               parent_start = parent->start;
+ 
+-      /*
+-       * If we are COWing a node/leaf from the extent, chunk, device or free
+-       * space trees, make sure that we do not finish block group creation of
+-       * pending block groups. We do this to avoid a deadlock.
+-       * COWing can result in allocation of a new chunk, and flushing pending
+-       * block groups (btrfs_create_pending_block_groups()) can be triggered
+-       * when finishing allocation of a new chunk. Creation of a pending block
+-       * group modifies the extent, chunk, device and free space trees,
+-       * therefore we could deadlock with ourselves since we are holding a
+-       * lock on an extent buffer that btrfs_create_pending_block_groups() may
+-       * try to COW later.
+-       */
+-      if (root == fs_info->extent_root ||
+-          root == fs_info->chunk_root ||
+-          root == fs_info->dev_root ||
+-          root == fs_info->free_space_root)
+-              trans->can_flush_pending_bgs = false;
+-
+-      cow = btrfs_alloc_tree_block(trans, root, parent_start,
+-                      root->root_key.objectid, &disk_key, level,
+-                      search_start, empty_size);
+-      trans->can_flush_pending_bgs = true;
++      cow = alloc_tree_block_no_bg_flush(trans, root, parent_start, &disk_key,
++                                         level, search_start, empty_size);
+       if (IS_ERR(cow))
+               return PTR_ERR(cow);
+ 
+@@ -3383,8 +3405,8 @@ static noinline int insert_new_root(stru
+       else
+               btrfs_node_key(lower, &lower_key, 0);
+ 
+-      c = btrfs_alloc_tree_block(trans, root, 0, root->root_key.objectid,
+-                                 &lower_key, level, root->node->start, 0);
++      c = alloc_tree_block_no_bg_flush(trans, root, 0, &lower_key, level,
++                                       root->node->start, 0);
+       if (IS_ERR(c))
+               return PTR_ERR(c);
+ 
+@@ -3513,8 +3535,8 @@ static noinline int split_node(struct bt
+       mid = (c_nritems + 1) / 2;
+       btrfs_node_key(c, &disk_key, mid);
+ 
+-      split = btrfs_alloc_tree_block(trans, root, 0, root->root_key.objectid,
+-                      &disk_key, level, c->start, 0);
++      split = alloc_tree_block_no_bg_flush(trans, root, 0, &disk_key, level,
++                                           c->start, 0);
+       if (IS_ERR(split))
+               return PTR_ERR(split);
+ 
+@@ -4298,8 +4320,8 @@ again:
+       else
+               btrfs_item_key(l, &disk_key, mid);
+ 
+-      right = btrfs_alloc_tree_block(trans, root, 0, root->root_key.objectid,
+-                      &disk_key, 0, l->start, 0);
++      right = alloc_tree_block_no_bg_flush(trans, root, 0, &disk_key, 0,
++                                           l->start, 0);
+       if (IS_ERR(right))
+               return PTR_ERR(right);
+ 
diff --git a/queue-4.19/btrfs-on-error-always-free-subvol_name-in-btrfs_mount.patch b/queue-4.19/btrfs-on-error-always-free-subvol_name-in-btrfs_mount.patch

new file mode 100644 (file)

index 0000000..e652e37
--- /dev/null
+++ b/queue-4.19/btrfs-on-error-always-free-subvol_name-in-btrfs_mount.patch
@@ -0,0 +1,51 @@
+From 532b618bdf237250d6d4566536d4b6ce3d0a31fe Mon Sep 17 00:00:00 2001
+From: "Eric W. Biederman" <ebiederm@xmission.com>
+Date: Wed, 30 Jan 2019 07:54:12 -0600
+Subject: btrfs: On error always free subvol_name in btrfs_mount
+
+From: Eric W. Biederman <ebiederm@xmission.com>
+
+commit 532b618bdf237250d6d4566536d4b6ce3d0a31fe upstream.
+
+The subvol_name is allocated in btrfs_parse_subvol_options and is
+consumed and freed in mount_subvol.  Add a free to the error paths that
+don't call mount_subvol so that it is guaranteed that subvol_name is
+freed when an error happens.
+
+Fixes: 312c89fbca06 ("btrfs: cleanup btrfs_mount() using btrfs_mount_root()")
+Cc: stable@vger.kernel.org # v4.19+
+Reviewed-by: Nikolay Borisov <nborisov@suse.com>
+Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
+Reviewed-by: David Sterba <dsterba@suse.com>
+Signed-off-by: David Sterba <dsterba@suse.com>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+
+---
+ fs/btrfs/super.c |    3 +++
+ 1 file changed, 3 insertions(+)
+
+--- a/fs/btrfs/super.c
++++ b/fs/btrfs/super.c
+@@ -1677,6 +1677,7 @@ static struct dentry *btrfs_mount(struct
+                               flags | SB_RDONLY, device_name, data);
+                       if (IS_ERR(mnt_root)) {
+                               root = ERR_CAST(mnt_root);
++                              kfree(subvol_name);
+                               goto out;
+                       }
+ 
+@@ -1686,12 +1687,14 @@ static struct dentry *btrfs_mount(struct
+                       if (error < 0) {
+                               root = ERR_PTR(error);
+                               mntput(mnt_root);
++                              kfree(subvol_name);
+                               goto out;
+                       }
+               }
+       }
+       if (IS_ERR(mnt_root)) {
+               root = ERR_CAST(mnt_root);
++              kfree(subvol_name);
+               goto out;
+       }
+ 
diff --git a/queue-4.19/kernel-exit.c-release-ptraced-tasks-before-zap_pid_ns_processes.patch b/queue-4.19/kernel-exit.c-release-ptraced-tasks-before-zap_pid_ns_processes.patch

new file mode 100644 (file)

index 0000000..6cbeb0c
--- /dev/null
+++ b/queue-4.19/kernel-exit.c-release-ptraced-tasks-before-zap_pid_ns_processes.patch
@@ -0,0 +1,73 @@
+From 8fb335e078378c8426fabeed1ebee1fbf915690c Mon Sep 17 00:00:00 2001
+From: Andrei Vagin <avagin@gmail.com>
+Date: Fri, 1 Feb 2019 14:20:24 -0800
+Subject: kernel/exit.c: release ptraced tasks before zap_pid_ns_processes
+
+From: Andrei Vagin <avagin@gmail.com>
+
+commit 8fb335e078378c8426fabeed1ebee1fbf915690c upstream.
+
+Currently, exit_ptrace() adds all ptraced tasks in a dead list, then
+zap_pid_ns_processes() waits on all tasks in a current pidns, and only
+then are tasks from the dead list released.
+
+zap_pid_ns_processes() can get stuck on waiting tasks from the dead
+list.  In this case, we will have one unkillable process with one or
+more dead children.
+
+Thanks to Oleg for the advice to release tasks in find_child_reaper().
+
+Link: http://lkml.kernel.org/r/20190110175200.12442-1-avagin@gmail.com
+Fixes: 7c8bd2322c7f ("exit: ptrace: shift "reap dead" code from exit_ptrace() to forget_original_parent()")
+Signed-off-by: Andrei Vagin <avagin@gmail.com>
+Signed-off-by: Oleg Nesterov <oleg@redhat.com>
+Cc: "Eric W. Biederman" <ebiederm@xmission.com>
+Cc: <stable@vger.kernel.org>
+Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
+Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+
+---
+ kernel/exit.c |   12 ++++++++++--
+ 1 file changed, 10 insertions(+), 2 deletions(-)
+
+--- a/kernel/exit.c
++++ b/kernel/exit.c
+@@ -558,12 +558,14 @@ static struct task_struct *find_alive_th
+       return NULL;
+ }
+ 
+-static struct task_struct *find_child_reaper(struct task_struct *father)
++static struct task_struct *find_child_reaper(struct task_struct *father,
++                                              struct list_head *dead)
+       __releases(&tasklist_lock)
+       __acquires(&tasklist_lock)
+ {
+       struct pid_namespace *pid_ns = task_active_pid_ns(father);
+       struct task_struct *reaper = pid_ns->child_reaper;
++      struct task_struct *p, *n;
+ 
+       if (likely(reaper != father))
+               return reaper;
+@@ -579,6 +581,12 @@ static struct task_struct *find_child_re
+               panic("Attempted to kill init! exitcode=0x%08x\n",
+                       father->signal->group_exit_code ?: father->exit_code);
+       }
++
++      list_for_each_entry_safe(p, n, dead, ptrace_entry) {
++              list_del_init(&p->ptrace_entry);
++              release_task(p);
++      }
++
+       zap_pid_ns_processes(pid_ns);
+       write_lock_irq(&tasklist_lock);
+ 
+@@ -668,7 +676,7 @@ static void forget_original_parent(struc
+               exit_ptrace(father, dead);
+ 
+       /* Can drop and reacquire tasklist_lock */
+-      reaper = find_child_reaper(father);
++      reaper = find_child_reaper(father, dead);
+       if (list_empty(&father->children))
+               return;
+ 
diff --git a/queue-4.19/mm-hugetlb.c-teach-follow_hugetlb_page-to-handle-foll_nowait.patch b/queue-4.19/mm-hugetlb.c-teach-follow_hugetlb_page-to-handle-foll_nowait.patch

new file mode 100644 (file)

index 0000000..622418f
--- /dev/null
+++ b/queue-4.19/mm-hugetlb.c-teach-follow_hugetlb_page-to-handle-foll_nowait.patch
@@ -0,0 +1,44 @@
+From 1ac25013fb9e4ed595cd608a406191e93520881e Mon Sep 17 00:00:00 2001
+From: Andrea Arcangeli <aarcange@redhat.com>
+Date: Fri, 1 Feb 2019 14:20:16 -0800
+Subject: mm/hugetlb.c: teach follow_hugetlb_page() to handle FOLL_NOWAIT
+
+From: Andrea Arcangeli <aarcange@redhat.com>
+
+commit 1ac25013fb9e4ed595cd608a406191e93520881e upstream.
+
+hugetlb needs the same fix as faultin_nopage (which was applied in
+commit 96312e61282a ("mm/gup.c: teach get_user_pages_unlocked to handle
+FOLL_NOWAIT")) or KVM hangs because it thinks the mmap_sem was already
+released by hugetlb_fault() if it returned VM_FAULT_RETRY, but it wasn't
+in the FOLL_NOWAIT case.
+
+Link: http://lkml.kernel.org/r/20190109020203.26669-2-aarcange@redhat.com
+Fixes: ce53053ce378 ("kvm: switch get_user_page_nowait() to get_user_pages_unlocked()")
+Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
+Tested-by: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
+Reported-by: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
+Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
+Reviewed-by: Peter Xu <peterx@redhat.com>
+Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
+Cc: <stable@vger.kernel.org>
+Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
+Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+
+---
+ mm/hugetlb.c |    3 ++-
+ 1 file changed, 2 insertions(+), 1 deletion(-)
+
+--- a/mm/hugetlb.c
++++ b/mm/hugetlb.c
+@@ -4269,7 +4269,8 @@ long follow_hugetlb_page(struct mm_struc
+                               break;
+                       }
+                       if (ret & VM_FAULT_RETRY) {
+-                              if (nonblocking)
++                              if (nonblocking &&
++                                  !(fault_flags & FAULT_FLAG_RETRY_NOWAIT))
+                                       *nonblocking = 0;
+                               *nr_pages = 0;
+                               /*
diff --git a/queue-4.19/mm-hwpoison-use-do_send_sig_info-instead-of-force_sig.patch b/queue-4.19/mm-hwpoison-use-do_send_sig_info-instead-of-force_sig.patch

new file mode 100644 (file)

index 0000000..a6fbadb
--- /dev/null
+++ b/queue-4.19/mm-hwpoison-use-do_send_sig_info-instead-of-force_sig.patch
@@ -0,0 +1,58 @@
+From 6376360ecbe525a9c17b3d081dfd88ba3e4ed65b Mon Sep 17 00:00:00 2001
+From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
+Date: Fri, 1 Feb 2019 14:21:08 -0800
+Subject: mm: hwpoison: use do_send_sig_info() instead of force_sig()
+
+From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
+
+commit 6376360ecbe525a9c17b3d081dfd88ba3e4ed65b upstream.
+
+Currently memory_failure() is racy against process's exiting, which
+results in kernel crash by null pointer dereference.
+
+The root cause is that memory_failure() uses force_sig() to forcibly
+kill asynchronous (meaning not in the current context) processes.  As
+discussed in thread https://lkml.org/lkml/2010/6/8/236 years ago for OOM
+fixes, this is not a right thing to do.  OOM solves this issue by using
+do_send_sig_info() as done in commit d2d393099de2 ("signal:
+oom_kill_task: use SEND_SIG_FORCED instead of force_sig()"), so this
+patch is suggesting to do the same for hwpoison.  do_send_sig_info()
+properly accesses to siglock with lock_task_sighand(), so is free from
+the reported race.
+
+I confirmed that the reported bug reproduces with inserting some delay
+in kill_procs(), and it never reproduces with this patch.
+
+Note that memory_failure() can send another type of signal using
+force_sig_mceerr(), and the reported race shouldn't happen on it because
+force_sig_mceerr() is called only for synchronous processes (i.e.
+BUS_MCEERR_AR happens only when some process accesses to the corrupted
+memory.)
+
+Link: http://lkml.kernel.org/r/20190116093046.GA29835@hori1.linux.bs1.fc.nec.co.jp
+Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
+Reported-by: Jane Chu <jane.chu@oracle.com>
+Reviewed-by: Dan Williams <dan.j.williams@intel.com>
+Reviewed-by: William Kucharski <william.kucharski@oracle.com>
+Cc: Oleg Nesterov <oleg@redhat.com>
+Cc: <stable@vger.kernel.org>
+Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
+Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+
+---
+ mm/memory-failure.c |    3 ++-
+ 1 file changed, 2 insertions(+), 1 deletion(-)
+
+--- a/mm/memory-failure.c
++++ b/mm/memory-failure.c
+@@ -372,7 +372,8 @@ static void kill_procs(struct list_head
+                       if (fail || tk->addr_valid == 0) {
+                               pr_err("Memory failure: %#lx: forcibly killing %s:%d because of failure to unmap corrupted page\n",
+                                      pfn, tk->tsk->comm, tk->tsk->pid);
+-                              force_sig(SIGKILL, tk->tsk);
++                              do_send_sig_info(SIGKILL, SEND_SIG_PRIV,
++                                               tk->tsk, PIDTYPE_PID);
+                       }
+ 
+                       /*
diff --git a/queue-4.19/mm-memory_hotplug-fix-scan_movable_pages-for-gigantic-hugepages.patch b/queue-4.19/mm-memory_hotplug-fix-scan_movable_pages-for-gigantic-hugepages.patch

new file mode 100644 (file)

index 0000000..17f915e
--- /dev/null
+++ b/queue-4.19/mm-memory_hotplug-fix-scan_movable_pages-for-gigantic-hugepages.patch
@@ -0,0 +1,99 @@
+From eeb0efd071d821a88da3fbd35f2d478f40d3b2ea Mon Sep 17 00:00:00 2001
+From: Oscar Salvador <osalvador@suse.de>
+Date: Fri, 1 Feb 2019 14:20:47 -0800
+Subject: mm,memory_hotplug: fix scan_movable_pages() for gigantic hugepages
+
+From: Oscar Salvador <osalvador@suse.de>
+
+commit eeb0efd071d821a88da3fbd35f2d478f40d3b2ea upstream.
+
+This is the same sort of error we saw in commit 17e2e7d7e1b8 ("mm,
+page_alloc: fix has_unmovable_pages for HugePages").
+
+Gigantic hugepages cross several memblocks, so it can be that the page
+we get in scan_movable_pages() is a page-tail belonging to a
+1G-hugepage.  If that happens, page_hstate()->size_to_hstate() will
+return NULL, and we will blow up in hugepage_migration_supported().
+
+The splat is as follows:
+
+  BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
+  #PF error: [normal kernel read fault]
+  PGD 0 P4D 0
+  Oops: 0000 [#1] SMP PTI
+  CPU: 1 PID: 1350 Comm: bash Tainted: G            E     5.0.0-rc1-mm1-1-default+ #27
+  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
+  RIP: 0010:__offline_pages+0x6ae/0x900
+  Call Trace:
+   memory_subsys_offline+0x42/0x60
+   device_offline+0x80/0xa0
+   state_store+0xab/0xc0
+   kernfs_fop_write+0x102/0x180
+   __vfs_write+0x26/0x190
+   vfs_write+0xad/0x1b0
+   ksys_write+0x42/0x90
+   do_syscall_64+0x5b/0x180
+   entry_SYSCALL_64_after_hwframe+0x44/0xa9
+  Modules linked in: af_packet(E) xt_tcpudp(E) ipt_REJECT(E) xt_conntrack(E) nf_conntrack(E) nf_defrag_ipv4(E) ip_set(E) nfnetlink(E) ebtable_nat(E) ebtable_broute(E) bridge(E) stp(E) llc(E) iptable_mangle(E) iptable_raw(E) iptable_security(E) ebtable_filter(E) ebtables(E) iptable_filter(E) ip_tables(E) x_tables(E) kvm_intel(E) kvm(E) irqbypass(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) bochs_drm(E) ttm(E) aesni_intel(E) drm_kms_helper(E) aes_x86_64(E) crypto_simd(E) cryptd(E) glue_helper(E) drm(E) virtio_net(E) syscopyarea(E) sysfillrect(E) net_failover(E) sysimgblt(E) pcspkr(E) failover(E) i2c_piix4(E) fb_sys_fops(E) parport_pc(E) parport(E) button(E) btrfs(E) libcrc32c(E) xor(E) zstd_decompress(E) zstd_compress(E) xxhash(E) raid6_pq(E) sd_mod(E) ata_generic(E) ata_piix(E) ahci(E) libahci(E) libata(E) crc32c_intel(E) serio_raw(E) virtio_pci(E) virtio_ring(E) virtio(E) sg(E) scsi_mod(E) autofs4(E)
+
+[akpm@linux-foundation.org: fix brace layout, per David.  Reduce indentation]
+Link: http://lkml.kernel.org/r/20190122154407.18417-1-osalvador@suse.de
+Signed-off-by: Oscar Salvador <osalvador@suse.de>
+Reviewed-by: Anthony Yznaga <anthony.yznaga@oracle.com>
+Acked-by: Michal Hocko <mhocko@suse.com>
+Reviewed-by: David Hildenbrand <david@redhat.com>
+Cc: <stable@vger.kernel.org>
+Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
+Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+
+---
+ mm/memory_hotplug.c |   36 ++++++++++++++++++++----------------
+ 1 file changed, 20 insertions(+), 16 deletions(-)
+
+--- a/mm/memory_hotplug.c
++++ b/mm/memory_hotplug.c
+@@ -1326,23 +1326,27 @@ int test_pages_in_a_zone(unsigned long s
+ static unsigned long scan_movable_pages(unsigned long start, unsigned long end)
+ {
+       unsigned long pfn;
+-      struct page *page;
++
+       for (pfn = start; pfn < end; pfn++) {
+-              if (pfn_valid(pfn)) {
+-                      page = pfn_to_page(pfn);
+-                      if (PageLRU(page))
+-                              return pfn;
+-                      if (__PageMovable(page))
+-                              return pfn;
+-                      if (PageHuge(page)) {
+-                              if (hugepage_migration_supported(page_hstate(page)) &&
+-                                  page_huge_active(page))
+-                                      return pfn;
+-                              else
+-                                      pfn = round_up(pfn + 1,
+-                                              1 << compound_order(page)) - 1;
+-                      }
+-              }
++              struct page *page, *head;
++              unsigned long skip;
++
++              if (!pfn_valid(pfn))
++                      continue;
++              page = pfn_to_page(pfn);
++              if (PageLRU(page))
++                      return pfn;
++              if (__PageMovable(page))
++                      return pfn;
++
++              if (!PageHuge(page))
++                      continue;
++              head = compound_head(page);
++              if (hugepage_migration_supported(page_hstate(head)) &&
++                  page_huge_active(head))
++                      return pfn;
++              skip = (1 << compound_order(head)) - (page - head);
++              pfn += skip - 1;
+       }
+       return 0;
+ }
diff --git a/queue-4.19/mm-migrate-don-t-rely-on-__pagemovable-of-newpage-after-unlocking-it.patch b/queue-4.19/mm-migrate-don-t-rely-on-__pagemovable-of-newpage-after-unlocking-it.patch

new file mode 100644 (file)

index 0000000..a5db262
--- /dev/null
+++ b/queue-4.19/mm-migrate-don-t-rely-on-__pagemovable-of-newpage-after-unlocking-it.patch
@@ -0,0 +1,97 @@
+From e0a352fabce61f730341d119fbedf71ffdb8663f Mon Sep 17 00:00:00 2001
+From: David Hildenbrand <david@redhat.com>
+Date: Fri, 1 Feb 2019 14:21:19 -0800
+Subject: mm: migrate: don't rely on __PageMovable() of newpage after unlocking it
+
+From: David Hildenbrand <david@redhat.com>
+
+commit e0a352fabce61f730341d119fbedf71ffdb8663f upstream.
+
+We had a race in the old balloon compaction code before b1123ea6d3b3
+("mm: balloon: use general non-lru movable page feature") refactored it
+that became visible after backporting 195a8c43e93d ("virtio-balloon:
+deflate via a page list") without the refactoring.
+
+The bug existed from commit d6d86c0a7f8d ("mm/balloon_compaction:
+redesign ballooned pages management") till b1123ea6d3b3 ("mm: balloon:
+use general non-lru movable page feature").  d6d86c0a7f8d
+("mm/balloon_compaction: redesign ballooned pages management") was
+backported to 3.12, so the broken kernels are stable kernels [3.12 -
+4.7].
+
+There was a subtle race between dropping the page lock of the newpage in
+__unmap_and_move() and checking for __is_movable_balloon_page(newpage).
+
+Just after dropping this page lock, virtio-balloon could go ahead and
+deflate the newpage, effectively dequeueing it and clearing PageBalloon,
+in turn making __is_movable_balloon_page(newpage) fail.
+
+This resulted in dropping the reference of the newpage via
+putback_lru_page(newpage) instead of put_page(newpage), leading to
+page->lru getting modified and a !LRU page ending up in the LRU lists.
+With 195a8c43e93d ("virtio-balloon: deflate via a page list")
+backported, one would suddenly get corrupted lists in
+release_pages_balloon():
+
+- WARNING: CPU: 13 PID: 6586 at lib/list_debug.c:59 __list_del_entry+0xa1/0xd0
+- list_del corruption. prev->next should be ffffe253961090a0, but was dead000000000100
+
+Nowadays this race is no longer possible, but it is hidden behind very
+ugly handling of __ClearPageMovable() and __PageMovable().
+
+__ClearPageMovable() will not make __PageMovable() fail, only
+PageMovable().  So the new check (__PageMovable(newpage)) will still
+hold even after newpage was dequeued by virtio-balloon.
+
+If anybody would ever change that special handling, the BUG would be
+introduced again.  So instead, make it explicit and use the information
+of the original isolated page before migration.
+
+This patch can be backported fairly easy to stable kernels (in contrast
+to the refactoring).
+
+Link: http://lkml.kernel.org/r/20190129233217.10747-1-david@redhat.com
+Fixes: d6d86c0a7f8d ("mm/balloon_compaction: redesign ballooned pages management")
+Signed-off-by: David Hildenbrand <david@redhat.com>
+Reported-by: Vratislav Bendel <vbendel@redhat.com>
+Acked-by: Michal Hocko <mhocko@suse.com>
+Acked-by: Rafael Aquini <aquini@redhat.com>
+Cc: Mel Gorman <mgorman@techsingularity.net>
+Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
+Cc: Michal Hocko <mhocko@suse.com>
+Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
+Cc: Jan Kara <jack@suse.cz>
+Cc: Andrea Arcangeli <aarcange@redhat.com>
+Cc: Dominik Brodowski <linux@dominikbrodowski.net>
+Cc: Matthew Wilcox <willy@infradead.org>
+Cc: Vratislav Bendel <vbendel@redhat.com>
+Cc: Rafael Aquini <aquini@redhat.com>
+Cc: Konstantin Khlebnikov <k.khlebnikov@samsung.com>
+Cc: Minchan Kim <minchan@kernel.org>
+Cc: <stable@vger.kernel.org>   [3.12 - 4.7]
+Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
+Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+
+---
+ mm/migrate.c |    7 +++++--
+ 1 file changed, 5 insertions(+), 2 deletions(-)
+
+--- a/mm/migrate.c
++++ b/mm/migrate.c
+@@ -1118,10 +1118,13 @@ out:
+        * If migration is successful, decrease refcount of the newpage
+        * which will not free the page because new page owner increased
+        * refcounter. As well, if it is LRU page, add the page to LRU
+-       * list in here.
++       * list in here. Use the old state of the isolated source page to
++       * determine if we migrated a LRU page. newpage was already unlocked
++       * and possibly modified by its owner - don't rely on the page
++       * state.
+        */
+       if (rc == MIGRATEPAGE_SUCCESS) {
+-              if (unlikely(__PageMovable(newpage)))
++              if (unlikely(!is_lru))
+                       put_page(newpage);
+               else
+                       putback_lru_page(newpage);
diff --git a/queue-4.19/mm-oom-fix-use-after-free-in-oom_kill_process.patch b/queue-4.19/mm-oom-fix-use-after-free-in-oom_kill_process.patch

new file mode 100644 (file)

index 0000000..e687d12
--- /dev/null
+++ b/queue-4.19/mm-oom-fix-use-after-free-in-oom_kill_process.patch
@@ -0,0 +1,70 @@
+From cefc7ef3c87d02fc9307835868ff721ea12cc597 Mon Sep 17 00:00:00 2001
+From: Shakeel Butt <shakeelb@google.com>
+Date: Fri, 1 Feb 2019 14:20:54 -0800
+Subject: mm, oom: fix use-after-free in oom_kill_process
+
+From: Shakeel Butt <shakeelb@google.com>
+
+commit cefc7ef3c87d02fc9307835868ff721ea12cc597 upstream.
+
+Syzbot instance running on upstream kernel found a use-after-free bug in
+oom_kill_process.  On further inspection it seems like the process
+selected to be oom-killed has exited even before reaching
+read_lock(&tasklist_lock) in oom_kill_process().  More specifically the
+tsk->usage is 1 which is due to get_task_struct() in oom_evaluate_task()
+and the put_task_struct within for_each_thread() frees the tsk and
+for_each_thread() tries to access the tsk.  The easiest fix is to do
+get/put across the for_each_thread() on the selected task.
+
+Now the next question is should we continue with the oom-kill as the
+previously selected task has exited? However before adding more
+complexity and heuristics, let's answer why we even look at the children
+of oom-kill selected task? The select_bad_process() has already selected
+the worst process in the system/memcg.  Due to race, the selected
+process might not be the worst at the kill time but does that matter?
+The userspace can use the oom_score_adj interface to prefer children to
+be killed before the parent.  I looked at the history but it seems like
+this is there before git history.
+
+Link: http://lkml.kernel.org/r/20190121215850.221745-1-shakeelb@google.com
+Reported-by: syzbot+7fbbfa368521945f0e3d@syzkaller.appspotmail.com
+Fixes: 6b0c81b3be11 ("mm, oom: reduce dependency on tasklist_lock")
+Signed-off-by: Shakeel Butt <shakeelb@google.com>
+Reviewed-by: Roman Gushchin <guro@fb.com>
+Acked-by: Michal Hocko <mhocko@suse.com>
+Cc: David Rientjes <rientjes@google.com>
+Cc: Johannes Weiner <hannes@cmpxchg.org>
+Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
+Cc: <stable@vger.kernel.org>
+Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
+Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+
+---
+ mm/oom_kill.c |    8 ++++++++
+ 1 file changed, 8 insertions(+)
+
+--- a/mm/oom_kill.c
++++ b/mm/oom_kill.c
+@@ -962,6 +962,13 @@ static void oom_kill_process(struct oom_
+        * still freeing memory.
+        */
+       read_lock(&tasklist_lock);
++
++      /*
++       * The task 'p' might have already exited before reaching here. The
++       * put_task_struct() will free task_struct 'p' while the loop still try
++       * to access the field of 'p', so, get an extra reference.
++       */
++      get_task_struct(p);
+       for_each_thread(p, t) {
+               list_for_each_entry(child, &t->children, sibling) {
+                       unsigned int child_points;
+@@ -981,6 +988,7 @@ static void oom_kill_process(struct oom_
+                       }
+               }
+       }
++      put_task_struct(p);
+       read_unlock(&tasklist_lock);
+ 
+       /*
diff --git a/queue-4.19/oom-oom_reaper-do-not-enqueue-same-task-twice.patch b/queue-4.19/oom-oom_reaper-do-not-enqueue-same-task-twice.patch

new file mode 100644 (file)

index 0000000..ef689d1
--- /dev/null
+++ b/queue-4.19/oom-oom_reaper-do-not-enqueue-same-task-twice.patch
@@ -0,0 +1,104 @@
+From 9bcdeb51bd7d2ae9fe65ea4d60643d2aeef5bfe3 Mon Sep 17 00:00:00 2001
+From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
+Date: Fri, 1 Feb 2019 14:20:31 -0800
+Subject: oom, oom_reaper: do not enqueue same task twice
+
+From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
+
+commit 9bcdeb51bd7d2ae9fe65ea4d60643d2aeef5bfe3 upstream.
+
+Arkadiusz reported that enabling memcg's group oom killing causes
+strange memcg statistics where there is no task in a memcg despite the
+number of tasks in that memcg is not 0.  It turned out that there is a
+bug in wake_oom_reaper() which allows enqueuing same task twice which
+makes impossible to decrease the number of tasks in that memcg due to a
+refcount leak.
+
+This bug existed since the OOM reaper became invokable from
+task_will_free_mem(current) path in out_of_memory() in Linux 4.7,
+
+  T1@P1     |T2@P1     |T3@P1     |OOM reaper
+  ----------+----------+----------+------------
+                                   # Processing an OOM victim in a different memcg domain.
+                        try_charge()
+                          mem_cgroup_out_of_memory()
+                            mutex_lock(&oom_lock)
+             try_charge()
+               mem_cgroup_out_of_memory()
+                 mutex_lock(&oom_lock)
+  try_charge()
+    mem_cgroup_out_of_memory()
+      mutex_lock(&oom_lock)
+                            out_of_memory()
+                              oom_kill_process(P1)
+                                do_send_sig_info(SIGKILL, @P1)
+                                mark_oom_victim(T1@P1)
+                                wake_oom_reaper(T1@P1) # T1@P1 is enqueued.
+                            mutex_unlock(&oom_lock)
+                 out_of_memory()
+                   mark_oom_victim(T2@P1)
+                   wake_oom_reaper(T2@P1) # T2@P1 is enqueued.
+                 mutex_unlock(&oom_lock)
+      out_of_memory()
+        mark_oom_victim(T1@P1)
+        wake_oom_reaper(T1@P1) # T1@P1 is enqueued again due to oom_reaper_list == T2@P1 && T1@P1->oom_reaper_list == NULL.
+      mutex_unlock(&oom_lock)
+                                   # Completed processing an OOM victim in a different memcg domain.
+                                   spin_lock(&oom_reaper_lock)
+                                   # T1P1 is dequeued.
+                                   spin_unlock(&oom_reaper_lock)
+
+but memcg's group oom killing made it easier to trigger this bug by
+calling wake_oom_reaper() on the same task from one out_of_memory()
+request.
+
+Fix this bug using an approach used by commit 855b018325737f76 ("oom,
+oom_reaper: disable oom_reaper for oom_kill_allocating_task").  As a
+side effect of this patch, this patch also avoids enqueuing multiple
+threads sharing memory via task_will_free_mem(current) path.
+
+Link: http://lkml.kernel.org/r/e865a044-2c10-9858-f4ef-254bc71d6cc2@i-love.sakura.ne.jp
+Link: http://lkml.kernel.org/r/5ee34fc6-1485-34f8-8790-903ddabaa809@i-love.sakura.ne.jp
+Fixes: af8e15cc85a25315 ("oom, oom_reaper: do not enqueue task if it is on the oom_reaper_list head")
+Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
+Reported-by: Arkadiusz Miskiewicz <arekm@maven.pl>
+Tested-by: Arkadiusz Miskiewicz <arekm@maven.pl>
+Acked-by: Michal Hocko <mhocko@suse.com>
+Acked-by: Roman Gushchin <guro@fb.com>
+Cc: Tejun Heo <tj@kernel.org>
+Cc: Aleksa Sarai <asarai@suse.de>
+Cc: Jay Kamat <jgkamat@fb.com>
+Cc: Johannes Weiner <hannes@cmpxchg.org>
+Cc: <stable@vger.kernel.org>
+Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
+Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+
+---
+ include/linux/sched/coredump.h |    1 +
+ mm/oom_kill.c                  |    4 ++--
+ 2 files changed, 3 insertions(+), 2 deletions(-)
+
+--- a/include/linux/sched/coredump.h
++++ b/include/linux/sched/coredump.h
+@@ -71,6 +71,7 @@ static inline int get_dumpable(struct mm
+ #define MMF_HUGE_ZERO_PAGE    23      /* mm has ever used the global huge zero page */
+ #define MMF_DISABLE_THP               24      /* disable THP for all VMAs */
+ #define MMF_OOM_VICTIM                25      /* mm is the oom victim */
++#define MMF_OOM_REAP_QUEUED   26      /* mm was queued for oom_reaper */
+ #define MMF_DISABLE_THP_MASK  (1 << MMF_DISABLE_THP)
+ 
+ #define MMF_INIT_MASK         (MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK |\
+--- a/mm/oom_kill.c
++++ b/mm/oom_kill.c
+@@ -634,8 +634,8 @@ static int oom_reaper(void *unused)
+ 
+ static void wake_oom_reaper(struct task_struct *tsk)
+ {
+-      /* tsk is already queued? */
+-      if (tsk == oom_reaper_list || tsk->oom_reaper_list)
++      /* mm is already queued? */
++      if (test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags))
+               return;
+ 
+       get_task_struct(tsk);
diff --git a/queue-4.19/series b/queue-4.19/series

index e1b0686d5087d93fa838d964707a7c5f5cbcd136..6bbaaeb3baa1f484caea5af38c0baea6e8859f4c 100644 (file)
--- a/queue-4.19/series
+++ b/queue-4.19/series
@@ -56,3 +56,13 @@ ib-hfi1-remove-overly-conservative-vm_exec-flag-check.patch
  platform-x86-asus-nb-wmi-map-0x35-to-key_screenlock.patch
  platform-x86-asus-nb-wmi-drop-mapping-of-0x33-and-0x.patch
  mmc-sdhci-iproc-handle-mmc_of_parse-errors-during-probe.patch
+btrfs-clean-up-pending-block-groups-when-transaction-commit-aborts.patch
+btrfs-fix-deadlock-when-allocating-tree-block-during-leaf-node-split.patch
+btrfs-on-error-always-free-subvol_name-in-btrfs_mount.patch
+kernel-exit.c-release-ptraced-tasks-before-zap_pid_ns_processes.patch
+mm-hugetlb.c-teach-follow_hugetlb_page-to-handle-foll_nowait.patch
+oom-oom_reaper-do-not-enqueue-same-task-twice.patch
+mm-memory_hotplug-fix-scan_movable_pages-for-gigantic-hugepages.patch
+mm-oom-fix-use-after-free-in-oom_kill_process.patch
+mm-hwpoison-use-do_send_sig_info-instead-of-force_sig.patch
+mm-migrate-don-t-rely-on-__pagemovable-of-newpage-after-unlocking-it.patch
author	Greg Kroah-Hartman <gregkh@linuxfoundation.org>
	Mon, 4 Feb 2019 06:04:56 +0000 (07:04 +0100)
committer	Greg Kroah-Hartman <gregkh@linuxfoundation.org>
	Mon, 4 Feb 2019 06:04:56 +0000 (07:04 +0100)
queue-4.19/btrfs-clean-up-pending-block-groups-when-transaction-commit-aborts.patch	[new file with mode: 0644]	patch \| blob
queue-4.19/btrfs-fix-deadlock-when-allocating-tree-block-during-leaf-node-split.patch	[new file with mode: 0644]	patch \| blob
queue-4.19/btrfs-on-error-always-free-subvol_name-in-btrfs_mount.patch	[new file with mode: 0644]	patch \| blob
queue-4.19/kernel-exit.c-release-ptraced-tasks-before-zap_pid_ns_processes.patch	[new file with mode: 0644]	patch \| blob
queue-4.19/mm-hugetlb.c-teach-follow_hugetlb_page-to-handle-foll_nowait.patch	[new file with mode: 0644]	patch \| blob
queue-4.19/mm-hwpoison-use-do_send_sig_info-instead-of-force_sig.patch	[new file with mode: 0644]	patch \| blob
queue-4.19/mm-memory_hotplug-fix-scan_movable_pages-for-gigantic-hugepages.patch	[new file with mode: 0644]	patch \| blob
queue-4.19/mm-migrate-don-t-rely-on-__pagemovable-of-newpage-after-unlocking-it.patch	[new file with mode: 0644]	patch \| blob
queue-4.19/mm-oom-fix-use-after-free-in-oom_kill_process.patch	[new file with mode: 0644]	patch \| blob
queue-4.19/oom-oom_reaper-do-not-enqueue-same-task-twice.patch	[new file with mode: 0644]	patch \| blob
queue-4.19/series		patch \| blob \| blame \| history