From: Greg Kroah-Hartman Date: Sun, 13 Aug 2017 15:56:30 +0000 (-0700) Subject: 4.4-stable patches X-Git-Tag: v3.18.66~14 X-Git-Url: http://git.ipfire.org/?a=commitdiff_plain;h=39379d0a8da96dc597e2b63a2c75928f5cebaa3b;p=thirdparty%2Fkernel%2Fstable-queue.git 4.4-stable patches added patches: mm-ratelimit-pfns-busy-info-message.patch --- diff --git a/queue-4.4/mm-ratelimit-pfns-busy-info-message.patch b/queue-4.4/mm-ratelimit-pfns-busy-info-message.patch new file mode 100644 index 00000000000..d23046fed6e --- /dev/null +++ b/queue-4.4/mm-ratelimit-pfns-busy-info-message.patch @@ -0,0 +1,79 @@ +From 75dddef32514f7aa58930bde6a1263253bc3d4ba Mon Sep 17 00:00:00 2001 +From: Jonathan Toppins +Date: Thu, 10 Aug 2017 15:23:35 -0700 +Subject: mm: ratelimit PFNs busy info message + +From: Jonathan Toppins + +commit 75dddef32514f7aa58930bde6a1263253bc3d4ba upstream. + +The RDMA subsystem can generate several thousand of these messages per +second eventually leading to a kernel crash. Ratelimit these messages +to prevent this crash. + +Doug said: + "I've been carrying a version of this for several kernel versions. I + don't remember when they started, but we have one (and only one) class + of machines: Dell PE R730xd, that generate these errors. When it + happens, without a rate limit, we get rcu timeouts and kernel oopses. + With the rate limit, we just get a lot of annoying kernel messages but + the machine continues on, recovers, and eventually the memory + operations all succeed" + +And: + "> Well... why are all these EBUSY's occurring? It sounds inefficient + > (at least) but if it is expected, normal and unavoidable then + > perhaps we should just remove that message altogether? + + I don't have an answer to that question. To be honest, I haven't + looked real hard. We never had this at all, then it started out of the + blue, but only on our Dell 730xd machines (and it hits all of them), + but no other classes or brands of machines. And we have our 730xd + machines loaded up with different brands and models of cards (for + instance one dedicated to mlx4 hardware, one for qib, one for mlx5, an + ocrdma/cxgb4 combo, etc), so the fact that it hit all of the machines + meant it wasn't tied to any particular brand/model of RDMA hardware. + To me, it always smelled of a hardware oddity specific to maybe the + CPUs or mainboard chipsets in these machines, so given that I'm not an + mm expert anyway, I never chased it down. + + A few other relevant details: it showed up somewhere around 4.8/4.9 or + thereabouts. It never happened before, but the prinkt has been there + since the 3.18 days, so possibly the test to trigger this message was + changed, or something else in the allocator changed such that the + situation started happening on these machines? + + And, like I said, it is specific to our 730xd machines (but they are + all identical, so that could mean it's something like their specific + ram configuration is causing the allocator to hit this on these + machine but not on other machines in the cluster, I don't want to say + it's necessarily the model of chipset or CPU, there are other bits of + identicalness between these machines)" + +Link: http://lkml.kernel.org/r/499c0f6cc10d6eb829a67f2a4d75b4228a9b356e.1501695897.git.jtoppins@redhat.com +Signed-off-by: Jonathan Toppins +Reviewed-by: Doug Ledford +Tested-by: Doug Ledford +Cc: Michal Hocko +Cc: Vlastimil Babka +Cc: Mel Gorman +Cc: Hillf Danton +Signed-off-by: Andrew Morton +Signed-off-by: Linus Torvalds +Signed-off-by: Greg Kroah-Hartman + +--- + mm/page_alloc.c | 2 +- + 1 file changed, 1 insertion(+), 1 deletion(-) + +--- a/mm/page_alloc.c ++++ b/mm/page_alloc.c +@@ -6804,7 +6804,7 @@ int alloc_contig_range(unsigned long sta + + /* Make sure the range is really isolated. */ + if (test_pages_isolated(outer_start, end, false)) { +- pr_info("%s: [%lx, %lx) PFNs busy\n", ++ pr_info_ratelimited("%s: [%lx, %lx) PFNs busy\n", + __func__, outer_start, end); + ret = -EBUSY; + goto done;