From: Greg Kroah-Hartman Date: Fri, 21 Oct 2016 07:43:04 +0000 (+0200) Subject: 4.4-stable patches X-Git-Tag: v4.7.10~3 X-Git-Url: http://git.ipfire.org/?a=commitdiff_plain;h=51737902f6ae94cd61521c783c3a03913da528dc;p=thirdparty%2Fkernel%2Fstable-queue.git 4.4-stable patches added patches: cfq-fix-starvation-of-asynchronous-writes.patch --- diff --git a/queue-4.4/cfq-fix-starvation-of-asynchronous-writes.patch b/queue-4.4/cfq-fix-starvation-of-asynchronous-writes.patch new file mode 100644 index 00000000000..ab96eaacd9c --- /dev/null +++ b/queue-4.4/cfq-fix-starvation-of-asynchronous-writes.patch @@ -0,0 +1,188 @@ +From 3932a86b4b9d1f0b049d64d4591ce58ad18b44ec Mon Sep 17 00:00:00 2001 +From: Glauber Costa +Date: Thu, 22 Sep 2016 20:59:59 -0400 +Subject: cfq: fix starvation of asynchronous writes + +From: Glauber Costa + +commit 3932a86b4b9d1f0b049d64d4591ce58ad18b44ec upstream. + +While debugging timeouts happening in my application workload (ScyllaDB), I have +observed calls to open() taking a long time, ranging everywhere from 2 seconds - +the first ones that are enough to time out my application - to more than 30 +seconds. + +The problem seems to happen because XFS may block on pending metadata updates +under certain circumnstances, and that's confirmed with the following backtrace +taken by the offcputime tool (iovisor/bcc): + + ffffffffb90c57b1 finish_task_switch + ffffffffb97dffb5 schedule + ffffffffb97e310c schedule_timeout + ffffffffb97e1f12 __down + ffffffffb90ea821 down + ffffffffc046a9dc xfs_buf_lock + ffffffffc046abfb _xfs_buf_find + ffffffffc046ae4a xfs_buf_get_map + ffffffffc046babd xfs_buf_read_map + ffffffffc0499931 xfs_trans_read_buf_map + ffffffffc044a561 xfs_da_read_buf + ffffffffc0451390 xfs_dir3_leaf_read.constprop.16 + ffffffffc0452b90 xfs_dir2_leaf_lookup_int + ffffffffc0452e0f xfs_dir2_leaf_lookup + ffffffffc044d9d3 xfs_dir_lookup + ffffffffc047d1d9 xfs_lookup + ffffffffc0479e53 xfs_vn_lookup + ffffffffb925347a path_openat + ffffffffb9254a71 do_filp_open + ffffffffb9242a94 do_sys_open + ffffffffb9242b9e sys_open + ffffffffb97e42b2 entry_SYSCALL_64_fastpath + 00007fb0698162ed [unknown] + +Inspecting my run with blktrace, I can see that the xfsaild kthread exhibit very +high "Dispatch wait" times, on the dozens of seconds range and consistent with +the open() times I have saw in that run. + +Still from the blktrace output, we can after searching a bit, identify the +request that wasn't dispatched: + + 8,0 11 152 81.092472813 804 A WM 141698288 + 8 <- (8,1) 141696240 + 8,0 11 153 81.092472889 804 Q WM 141698288 + 8 [xfsaild/sda1] + 8,0 11 154 81.092473207 804 G WM 141698288 + 8 [xfsaild/sda1] + 8,0 11 206 81.092496118 804 I WM 141698288 + 8 ( 22911) [xfsaild/sda1] + <==== 'I' means Inserted (into the IO scheduler) ===================================> + 8,0 0 289372 96.718761435 0 D WM 141698288 + 8 (15626265317) [swapper/0] + <==== Only 15s later the CFQ scheduler dispatches the request ======================> + +As we can see above, in this particular example CFQ took 15 seconds to dispatch +this request. Going back to the full trace, we can see that the xfsaild queue +had plenty of opportunity to run, and it was selected as the active queue many +times. It would just always be preempted by something else (example): + + 8,0 1 0 81.117912979 0 m N cfq1618SN / insert_request + 8,0 1 0 81.117913419 0 m N cfq1618SN / add_to_rr + 8,0 1 0 81.117914044 0 m N cfq1618SN / preempt + 8,0 1 0 81.117914398 0 m N cfq767A / slice expired t=1 + 8,0 1 0 81.117914755 0 m N cfq767A / resid=40 + 8,0 1 0 81.117915340 0 m N / served: vt=1948520448 min_vt=1948520448 + 8,0 1 0 81.117915858 0 m N cfq767A / sl_used=1 disp=0 charge=0 iops=1 sect=0 + +where cfq767 is the xfsaild queue and cfq1618 corresponds to one of the ScyllaDB +IO dispatchers. + +The requests preempting the xfsaild queue are synchronous requests. That's a +characteristic of ScyllaDB workloads, as we only ever issue O_DIRECT requests. +While it can be argued that preempting ASYNC requests in favor of SYNC is part +of the CFQ logic, I don't believe that doing so for 15+ seconds is anyone's +goal. + +Moreover, unless I am misunderstanding something, that breaks the expectation +set by the "fifo_expire_async" tunable, which in my system is set to the +default. + +Looking at the code, it seems to me that the issue is that after we make +an async queue active, there is no guarantee that it will execute any request. + +When the queue itself tests if it cfq_may_dispatch() it can bail if it sees SYNC +requests in flight. An incoming request from another queue can also preempt it +in such situation before we have the chance to execute anything (as seen in the +trace above). + +This patch sets the must_dispatch flag if we notice that we have requests +that are already fifo_expired. This flag is always cleared after +cfq_dispatch_request() returns from cfq_dispatch_requests(), so it won't pin +the queue for subsequent requests (unless they are themselves expired) + +Care is taken during preempt to still allow rt requests to preempt us +regardless. + +Testing my workload with this patch applied produces much better results. +From the application side I see no timeouts, and the open() latency histogram +generated by systemtap looks much better, with the worst outlier at 131ms: + +Latency histogram of xfs_buf_lock acquisition (microseconds): + value |-------------------------------------------------- count + 0 | 11 + 1 |@@@@ 161 + 2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1966 + 4 |@ 54 + 8 | 36 + 16 | 7 + 32 | 0 + 64 | 0 + ~ + 1024 | 0 + 2048 | 0 + 4096 | 1 + 8192 | 1 + 16384 | 2 + 32768 | 0 + 65536 | 0 +131072 | 1 +262144 | 0 +524288 | 0 + +Signed-off-by: Glauber Costa +CC: Jens Axboe +CC: linux-block@vger.kernel.org +CC: linux-kernel@vger.kernel.org +Signed-off-by: Glauber Costa +Signed-off-by: Jens Axboe +Signed-off-by: Greg Kroah-Hartman + +--- + block/cfq-iosched.c | 13 ++++++++++--- + 1 file changed, 10 insertions(+), 3 deletions(-) + +--- a/block/cfq-iosched.c ++++ b/block/cfq-iosched.c +@@ -3003,7 +3003,6 @@ static struct request *cfq_check_fifo(st + if (time_before(jiffies, rq->fifo_time)) + rq = NULL; + +- cfq_log_cfqq(cfqq->cfqd, cfqq, "fifo=%p", rq); + return rq; + } + +@@ -3377,6 +3376,9 @@ static bool cfq_may_dispatch(struct cfq_ + { + unsigned int max_dispatch; + ++ if (cfq_cfqq_must_dispatch(cfqq)) ++ return true; ++ + /* + * Drain async requests before we start sync IO + */ +@@ -3468,15 +3470,20 @@ static bool cfq_dispatch_request(struct + + BUG_ON(RB_EMPTY_ROOT(&cfqq->sort_list)); + ++ rq = cfq_check_fifo(cfqq); ++ if (rq) ++ cfq_mark_cfqq_must_dispatch(cfqq); ++ + if (!cfq_may_dispatch(cfqd, cfqq)) + return false; + + /* + * follow expired path, else get first next available + */ +- rq = cfq_check_fifo(cfqq); + if (!rq) + rq = cfqq->next_rq; ++ else ++ cfq_log_cfqq(cfqq->cfqd, cfqq, "fifo=%p", rq); + + /* + * insert request into driver dispatch list +@@ -3944,7 +3951,7 @@ cfq_should_preempt(struct cfq_data *cfqd + * if the new request is sync, but the currently running queue is + * not, let the sync request have priority. + */ +- if (rq_is_sync(rq) && !cfq_cfqq_sync(cfqq)) ++ if (rq_is_sync(rq) && !cfq_cfqq_sync(cfqq) && !cfq_cfqq_must_dispatch(cfqq)) + return true; + + if (new_cfqq->cfqg != cfqq->cfqg) diff --git a/queue-4.4/series b/queue-4.4/series index 9ce2174e845..704e616a3f7 100644 --- a/queue-4.4/series +++ b/queue-4.4/series @@ -22,3 +22,4 @@ crypto-ghash-generic-move-common-definitions-to-a-new-header-file.patch crypto-vmx-fix-memory-corruption-caused-by-p8_ghash.patch dlm-free-workqueues-after-the-connections.patch vfs-move-permission-checking-into-notify_change-for-utimes-null.patch +cfq-fix-starvation-of-asynchronous-writes.patch