From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Date: Fri, 21 Oct 2016 07:43:04 +0000 (+0200)
Subject: 4.4-stable patches
X-Git-Tag: v4.7.10~3
X-Git-Url: http://git.ipfire.org/?a=commitdiff_plain;h=51737902f6ae94cd61521c783c3a03913da528dc;p=thirdparty%2Fkernel%2Fstable-queue.git

4.4-stable patches

added patches:
	cfq-fix-starvation-of-asynchronous-writes.patch
---

diff --git a/queue-4.4/cfq-fix-starvation-of-asynchronous-writes.patch b/queue-4.4/cfq-fix-starvation-of-asynchronous-writes.patch
new file mode 100644
index 00000000000..ab96eaacd9c
--- /dev/null
+++ b/queue-4.4/cfq-fix-starvation-of-asynchronous-writes.patch
@@ -0,0 +1,188 @@
+From 3932a86b4b9d1f0b049d64d4591ce58ad18b44ec Mon Sep 17 00:00:00 2001
+From: Glauber Costa <glauber@scylladb.com>
+Date: Thu, 22 Sep 2016 20:59:59 -0400
+Subject: cfq: fix starvation of asynchronous writes
+
+From: Glauber Costa <glauber@scylladb.com>
+
+commit 3932a86b4b9d1f0b049d64d4591ce58ad18b44ec upstream.
+
+While debugging timeouts happening in my application workload (ScyllaDB), I have
+observed calls to open() taking a long time, ranging everywhere from 2 seconds -
+the first ones that are enough to time out my application - to more than 30
+seconds.
+
+The problem seems to happen because XFS may block on pending metadata updates
+under certain circumnstances, and that's confirmed with the following backtrace
+taken by the offcputime tool (iovisor/bcc):
+
+    ffffffffb90c57b1 finish_task_switch
+    ffffffffb97dffb5 schedule
+    ffffffffb97e310c schedule_timeout
+    ffffffffb97e1f12 __down
+    ffffffffb90ea821 down
+    ffffffffc046a9dc xfs_buf_lock
+    ffffffffc046abfb _xfs_buf_find
+    ffffffffc046ae4a xfs_buf_get_map
+    ffffffffc046babd xfs_buf_read_map
+    ffffffffc0499931 xfs_trans_read_buf_map
+    ffffffffc044a561 xfs_da_read_buf
+    ffffffffc0451390 xfs_dir3_leaf_read.constprop.16
+    ffffffffc0452b90 xfs_dir2_leaf_lookup_int
+    ffffffffc0452e0f xfs_dir2_leaf_lookup
+    ffffffffc044d9d3 xfs_dir_lookup
+    ffffffffc047d1d9 xfs_lookup
+    ffffffffc0479e53 xfs_vn_lookup
+    ffffffffb925347a path_openat
+    ffffffffb9254a71 do_filp_open
+    ffffffffb9242a94 do_sys_open
+    ffffffffb9242b9e sys_open
+    ffffffffb97e42b2 entry_SYSCALL_64_fastpath
+    00007fb0698162ed [unknown]
+
+Inspecting my run with blktrace, I can see that the xfsaild kthread exhibit very
+high "Dispatch wait" times, on the dozens of seconds range and consistent with
+the open() times I have saw in that run.
+
+Still from the blktrace output, we can after searching a bit, identify the
+request that wasn't dispatched:
+
+  8,0   11      152    81.092472813   804  A  WM 141698288 + 8 <- (8,1) 141696240
+  8,0   11      153    81.092472889   804  Q  WM 141698288 + 8 [xfsaild/sda1]
+  8,0   11      154    81.092473207   804  G  WM 141698288 + 8 [xfsaild/sda1]
+  8,0   11      206    81.092496118   804  I  WM 141698288 + 8 (   22911) [xfsaild/sda1]
+  <==== 'I' means Inserted (into the IO scheduler) ===================================>
+  8,0    0   289372    96.718761435     0  D  WM 141698288 + 8 (15626265317) [swapper/0]
+  <==== Only 15s later the CFQ scheduler dispatches the request ======================>
+
+As we can see above, in this particular example CFQ took 15 seconds to dispatch
+this request. Going back to the full trace, we can see that the xfsaild queue
+had plenty of opportunity to run, and it was selected as the active queue many
+times. It would just always be preempted by something else (example):
+
+  8,0    1        0    81.117912979     0  m   N cfq1618SN / insert_request
+  8,0    1        0    81.117913419     0  m   N cfq1618SN / add_to_rr
+  8,0    1        0    81.117914044     0  m   N cfq1618SN / preempt
+  8,0    1        0    81.117914398     0  m   N cfq767A  / slice expired t=1
+  8,0    1        0    81.117914755     0  m   N cfq767A  / resid=40
+  8,0    1        0    81.117915340     0  m   N / served: vt=1948520448 min_vt=1948520448
+  8,0    1        0    81.117915858     0  m   N cfq767A  / sl_used=1 disp=0 charge=0 iops=1 sect=0
+
+where cfq767 is the xfsaild queue and cfq1618 corresponds to one of the ScyllaDB
+IO dispatchers.
+
+The requests preempting the xfsaild queue are synchronous requests. That's a
+characteristic of ScyllaDB workloads, as we only ever issue O_DIRECT requests.
+While it can be argued that preempting ASYNC requests in favor of SYNC is part
+of the CFQ logic, I don't believe that doing so for 15+ seconds is anyone's
+goal.
+
+Moreover, unless I am misunderstanding something, that breaks the expectation
+set by the "fifo_expire_async" tunable, which in my system is set to the
+default.
+
+Looking at the code, it seems to me that the issue is that after we make
+an async queue active, there is no guarantee that it will execute any request.
+
+When the queue itself tests if it cfq_may_dispatch() it can bail if it sees SYNC
+requests in flight. An incoming request from another queue can also preempt it
+in such situation before we have the chance to execute anything (as seen in the
+trace above).
+
+This patch sets the must_dispatch flag if we notice that we have requests
+that are already fifo_expired. This flag is always cleared after
+cfq_dispatch_request() returns from cfq_dispatch_requests(), so it won't pin
+the queue for subsequent requests (unless they are themselves expired)
+
+Care is taken during preempt to still allow rt requests to preempt us
+regardless.
+
+Testing my workload with this patch applied produces much better results.
+From the application side I see no timeouts, and the open() latency histogram
+generated by systemtap looks much better, with the worst outlier at 131ms:
+
+Latency histogram of xfs_buf_lock acquisition (microseconds):
+ value |-------------------------------------------------- count
+     0 |                                                     11
+     1 |@@@@                                                161
+     2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  1966
+     4 |@                                                    54
+     8 |                                                     36
+    16 |                                                      7
+    32 |                                                      0
+    64 |                                                      0
+       ~
+  1024 |                                                      0
+  2048 |                                                      0
+  4096 |                                                      1
+  8192 |                                                      1
+ 16384 |                                                      2
+ 32768 |                                                      0
+ 65536 |                                                      0
+131072 |                                                      1
+262144 |                                                      0
+524288 |                                                      0
+
+Signed-off-by: Glauber Costa <glauber@scylladb.com>
+CC: Jens Axboe <axboe@kernel.dk>
+CC: linux-block@vger.kernel.org
+CC: linux-kernel@vger.kernel.org
+Signed-off-by: Glauber Costa <glauber@scylladb.com>
+Signed-off-by: Jens Axboe <axboe@fb.com>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+
+---
+ block/cfq-iosched.c |   13 ++++++++++---
+ 1 file changed, 10 insertions(+), 3 deletions(-)
+
+--- a/block/cfq-iosched.c
++++ b/block/cfq-iosched.c
+@@ -3003,7 +3003,6 @@ static struct request *cfq_check_fifo(st
+ 	if (time_before(jiffies, rq->fifo_time))
+ 		rq = NULL;
+ 
+-	cfq_log_cfqq(cfqq->cfqd, cfqq, "fifo=%p", rq);
+ 	return rq;
+ }
+ 
+@@ -3377,6 +3376,9 @@ static bool cfq_may_dispatch(struct cfq_
+ {
+ 	unsigned int max_dispatch;
+ 
++	if (cfq_cfqq_must_dispatch(cfqq))
++		return true;
++
+ 	/*
+ 	 * Drain async requests before we start sync IO
+ 	 */
+@@ -3468,15 +3470,20 @@ static bool cfq_dispatch_request(struct
+ 
+ 	BUG_ON(RB_EMPTY_ROOT(&cfqq->sort_list));
+ 
++	rq = cfq_check_fifo(cfqq);
++	if (rq)
++		cfq_mark_cfqq_must_dispatch(cfqq);
++
+ 	if (!cfq_may_dispatch(cfqd, cfqq))
+ 		return false;
+ 
+ 	/*
+ 	 * follow expired path, else get first next available
+ 	 */
+-	rq = cfq_check_fifo(cfqq);
+ 	if (!rq)
+ 		rq = cfqq->next_rq;
++	else
++		cfq_log_cfqq(cfqq->cfqd, cfqq, "fifo=%p", rq);
+ 
+ 	/*
+ 	 * insert request into driver dispatch list
+@@ -3944,7 +3951,7 @@ cfq_should_preempt(struct cfq_data *cfqd
+ 	 * if the new request is sync, but the currently running queue is
+ 	 * not, let the sync request have priority.
+ 	 */
+-	if (rq_is_sync(rq) && !cfq_cfqq_sync(cfqq))
++	if (rq_is_sync(rq) && !cfq_cfqq_sync(cfqq) && !cfq_cfqq_must_dispatch(cfqq))
+ 		return true;
+ 
+ 	if (new_cfqq->cfqg != cfqq->cfqg)
diff --git a/queue-4.4/series b/queue-4.4/series
index 9ce2174e845..704e616a3f7 100644
--- a/queue-4.4/series
+++ b/queue-4.4/series
@@ -22,3 +22,4 @@ crypto-ghash-generic-move-common-definitions-to-a-new-header-file.patch
 crypto-vmx-fix-memory-corruption-caused-by-p8_ghash.patch
 dlm-free-workqueues-after-the-connections.patch
 vfs-move-permission-checking-into-notify_change-for-utimes-null.patch
+cfq-fix-starvation-of-asynchronous-writes.patch