From: Greg Kroah-Hartman Date: Wed, 9 Dec 2020 09:15:12 +0000 (+0100) Subject: 4.19-stable patches X-Git-Tag: v5.9.14~39 X-Git-Url: http://git.ipfire.org/gitweb.cgi?a=commitdiff_plain;h=c5780612154d9da37f0a1c083928cae5d334a2ca;p=thirdparty%2Fkernel%2Fstable-queue.git 4.19-stable patches added patches: mm-list_lru-set-shrinker-map-bit-when-child-nr_items-is-not-zero.patch mm-swapfile-do-not-sleep-with-a-spin-lock-held.patch --- diff --git a/queue-4.19/mm-list_lru-set-shrinker-map-bit-when-child-nr_items-is-not-zero.patch b/queue-4.19/mm-list_lru-set-shrinker-map-bit-when-child-nr_items-is-not-zero.patch new file mode 100644 index 00000000000..17f10e8c1e5 --- /dev/null +++ b/queue-4.19/mm-list_lru-set-shrinker-map-bit-when-child-nr_items-is-not-zero.patch @@ -0,0 +1,131 @@ +From 8199be001a470209f5c938570cc199abb012fe53 Mon Sep 17 00:00:00 2001 +From: Yang Shi +Date: Sat, 5 Dec 2020 22:14:48 -0800 +Subject: mm: list_lru: set shrinker map bit when child nr_items is not zero + +From: Yang Shi + +commit 8199be001a470209f5c938570cc199abb012fe53 upstream. + +When investigating a slab cache bloat problem, significant amount of +negative dentry cache was seen, but confusingly they neither got shrunk +by reclaimer (the host has very tight memory) nor be shrunk by dropping +cache. The vmcore shows there are over 14M negative dentry objects on +lru, but tracing result shows they were even not scanned at all. + +Further investigation shows the memcg's vfs shrinker_map bit is not set. +So the reclaimer or dropping cache just skip calling vfs shrinker. So +we have to reboot the hosts to get the memory back. + +I didn't manage to come up with a reproducer in test environment, and +the problem can't be reproduced after rebooting. But it seems there is +race between shrinker map bit clear and reparenting by code inspection. +The hypothesis is elaborated as below. + +The memcg hierarchy on our production environment looks like: + + root + / \ + system user + +The main workloads are running under user slice's children, and it +creates and removes memcg frequently. So reparenting happens very often +under user slice, but no task is under user slice directly. + +So with the frequent reparenting and tight memory pressure, the below +hypothetical race condition may happen: + + CPU A CPU B +reparent + dst->nr_items == 0 + shrinker: + total_objects == 0 + add src->nr_items to dst + set_bit + return SHRINK_EMPTY + clear_bit +child memcg offline + replace child's kmemcg_id with + parent's (in memcg_offline_kmem()) + list_lru_del() between shrinker runs + see parent's kmemcg_id + dec dst->nr_items +reparent again + dst->nr_items may go negative + due to concurrent list_lru_del() + + The second run of shrinker: + read nr_items without any + synchronization, so it may + see intermediate negative + nr_items then total_objects + may return 0 coincidently + + keep the bit cleared + dst->nr_items != 0 + skip set_bit + add scr->nr_item to dst + +After this point dst->nr_item may never go zero, so reparenting will not +set shrinker_map bit anymore. And since there is no task under user +slice directly, so no new object will be added to its lru to set the +shrinker map bit either. That bit is kept cleared forever. + +How does list_lru_del() race with reparenting? It is because reparenting +replaces children's kmemcg_id to parent's without protecting from +nlru->lock, so list_lru_del() may see parent's kmemcg_id but actually +deleting items from child's lru, but dec'ing parent's nr_items, so the +parent's nr_items may go negative as commit 2788cf0c401c ("memcg: +reparent list_lrus and free kmemcg_id on css offline") says. + +Since it is impossible that dst->nr_items goes negative and +src->nr_items goes zero at the same time, so it seems we could set the +shrinker map bit iff src->nr_items != 0. We could synchronize +list_lru_count_one() and reparenting with nlru->lock, but it seems +checking src->nr_items in reparenting is the simplest and avoids lock +contention. + +Fixes: fae91d6d8be5 ("mm/list_lru.c: set bit in memcg shrinker bitmap on first list_lru item appearance") +Suggested-by: Roman Gushchin +Signed-off-by: Yang Shi +Signed-off-by: Andrew Morton +Reviewed-by: Roman Gushchin +Reviewed-by: Shakeel Butt +Acked-by: Kirill Tkhai +Cc: Vladimir Davydov +Cc: [4.19] +Link: https://lkml.kernel.org/r/20201202171749.264354-1-shy828301@gmail.com +Signed-off-by: Linus Torvalds +Signed-off-by: Greg Kroah-Hartman + +--- + mm/list_lru.c | 10 +++++----- + 1 file changed, 5 insertions(+), 5 deletions(-) + +--- a/mm/list_lru.c ++++ b/mm/list_lru.c +@@ -542,7 +542,6 @@ static void memcg_drain_list_lru_node(st + struct list_lru_node *nlru = &lru->node[nid]; + int dst_idx = dst_memcg->kmemcg_id; + struct list_lru_one *src, *dst; +- bool set; + + /* + * Since list_lru_{add,del} may be called under an IRQ-safe lock, +@@ -554,11 +553,12 @@ static void memcg_drain_list_lru_node(st + dst = list_lru_from_memcg_idx(nlru, dst_idx); + + list_splice_init(&src->list, &dst->list); +- set = (!dst->nr_items && src->nr_items); +- dst->nr_items += src->nr_items; +- if (set) ++ ++ if (src->nr_items) { ++ dst->nr_items += src->nr_items; + memcg_set_shrinker_bit(dst_memcg, nid, lru_shrinker_id(lru)); +- src->nr_items = 0; ++ src->nr_items = 0; ++ } + + spin_unlock_irq(&nlru->lock); + } diff --git a/queue-4.19/mm-swapfile-do-not-sleep-with-a-spin-lock-held.patch b/queue-4.19/mm-swapfile-do-not-sleep-with-a-spin-lock-held.patch new file mode 100644 index 00000000000..9f1833c0862 --- /dev/null +++ b/queue-4.19/mm-swapfile-do-not-sleep-with-a-spin-lock-held.patch @@ -0,0 +1,53 @@ +From b11a76b37a5aa7b07c3e3eeeaae20b25475bddd3 Mon Sep 17 00:00:00 2001 +From: Qian Cai +Date: Sat, 5 Dec 2020 22:14:55 -0800 +Subject: mm/swapfile: do not sleep with a spin lock held + +From: Qian Cai + +commit b11a76b37a5aa7b07c3e3eeeaae20b25475bddd3 upstream. + +We can't call kvfree() with a spin lock held, so defer it. Fixes a +might_sleep() runtime warning. + +Fixes: 873d7bcfd066 ("mm/swapfile.c: use kvzalloc for swap_info_struct allocation") +Signed-off-by: Qian Cai +Signed-off-by: Andrew Morton +Reviewed-by: Andrew Morton +Cc: Hugh Dickins +Cc: +Link: https://lkml.kernel.org/r/20201202151549.10350-1-qcai@redhat.com +Signed-off-by: Linus Torvalds +Signed-off-by: Greg Kroah-Hartman + +--- + mm/swapfile.c | 4 +++- + 1 file changed, 3 insertions(+), 1 deletion(-) + +--- a/mm/swapfile.c ++++ b/mm/swapfile.c +@@ -2825,6 +2825,7 @@ late_initcall(max_swapfiles_check); + static struct swap_info_struct *alloc_swap_info(void) + { + struct swap_info_struct *p; ++ struct swap_info_struct *defer = NULL; + unsigned int type; + int i; + int size = sizeof(*p) + nr_node_ids * sizeof(struct plist_node); +@@ -2854,7 +2855,7 @@ static struct swap_info_struct *alloc_sw + smp_wmb(); + WRITE_ONCE(nr_swapfiles, nr_swapfiles + 1); + } else { +- kvfree(p); ++ defer = p; + p = swap_info[type]; + /* + * Do not memset this entry: a racing procfs swap_next() +@@ -2867,6 +2868,7 @@ static struct swap_info_struct *alloc_sw + plist_node_init(&p->avail_lists[i], 0); + p->flags = SWP_USED; + spin_unlock(&swap_lock); ++ kvfree(defer); + spin_lock_init(&p->lock); + spin_lock_init(&p->cont_lock); + diff --git a/queue-4.19/powerpc-pseries-pass-msi-affinity-to-irq_create_mapping.patch b/queue-4.19/powerpc-pseries-pass-msi-affinity-to-irq_create_mapping.patch deleted file mode 100644 index 49f2d4d9395..00000000000 --- a/queue-4.19/powerpc-pseries-pass-msi-affinity-to-irq_create_mapping.patch +++ /dev/null @@ -1,54 +0,0 @@ -From 9ea69a55b3b9a71cded9726af591949c1138f235 Mon Sep 17 00:00:00 2001 -From: Laurent Vivier -Date: Thu, 26 Nov 2020 09:28:52 +0100 -Subject: powerpc/pseries: Pass MSI affinity to irq_create_mapping() - -From: Laurent Vivier - -commit 9ea69a55b3b9a71cded9726af591949c1138f235 upstream. - -With virtio multiqueue, normally each queue IRQ is mapped to a CPU. - -Commit 0d9f0a52c8b9f ("virtio_scsi: use virtio IRQ affinity") exposed -an existing shortcoming of the arch code by moving virtio_scsi to -the automatic IRQ affinity assignment. - -The affinity is correctly computed in msi_desc but this is not applied -to the system IRQs. - -It appears the affinity is correctly passed to rtas_setup_msi_irqs() but -lost at this point and never passed to irq_domain_alloc_descs() -(see commit 06ee6d571f0e ("genirq: Add affinity hint to irq allocation")) -because irq_create_mapping() doesn't take an affinity parameter. - -Use the new irq_create_mapping_affinity() function, which allows to forward -the affinity setting from rtas_setup_msi_irqs() to irq_domain_alloc_descs(). - -With this change, the virtqueues are correctly dispatched between the CPUs -on pseries. - -Fixes: e75eafb9b039 ("genirq/msi: Switch to new irq spreading infrastructure") -Signed-off-by: Laurent Vivier -Signed-off-by: Thomas Gleixner -Reviewed-by: Greg Kurz -Acked-by: Michael Ellerman -Cc: stable@vger.kernel.org -Link: https://lore.kernel.org/r/20201126082852.1178497-3-lvivier@redhat.com -Signed-off-by: Greg Kroah-Hartman - ---- - arch/powerpc/platforms/pseries/msi.c | 3 ++- - 1 file changed, 2 insertions(+), 1 deletion(-) - ---- a/arch/powerpc/platforms/pseries/msi.c -+++ b/arch/powerpc/platforms/pseries/msi.c -@@ -462,7 +462,8 @@ again: - return hwirq; - } - -- virq = irq_create_mapping(NULL, hwirq); -+ virq = irq_create_mapping_affinity(NULL, hwirq, -+ entry->affinity); - - if (!virq) { - pr_debug("rtas_msi: Failed mapping hwirq %d\n", hwirq); diff --git a/queue-4.19/series b/queue-4.19/series index 172c12dae87..eea6879cc27 100644 --- a/queue-4.19/series +++ b/queue-4.19/series @@ -18,5 +18,6 @@ cifs-fix-potential-use-after-free-in-cifs_echo_request.patch i2c-imx-don-t-generate-stop-condition-if-arbitration-has-been-lost.patch scsi-mpt3sas-fix-ioctl-timeout.patch dm-writecache-fix-the-maximum-number-of-arguments.patch -powerpc-pseries-pass-msi-affinity-to-irq_create_mapping.patch dm-remove-invalid-sparse-__acquires-and-__releases-annotations.patch +mm-list_lru-set-shrinker-map-bit-when-child-nr_items-is-not-zero.patch +mm-swapfile-do-not-sleep-with-a-spin-lock-held.patch