From: Willy Tarreau <w@1wt.eu>
Date: Thu, 18 Jul 2013 19:49:32 +0000 (+0200)
Subject: BUG/MEDIUM: splicing: fix abnormal CPU usage with splicing
X-Git-Tag: v1.5-dev20~332
X-Git-Url: http://git.ipfire.org/cgi-bin/gitweb.cgi?a=commitdiff_plain;h=61d39a0e2a047df78f7f3bfcf5584090913cdc65;p=thirdparty%2Fhaproxy.git

BUG/MEDIUM: splicing: fix abnormal CPU usage with splicing

Mark Janssen reported an issue in 1.5-dev19 which was introduced
in 1.5-dev12 by commit 96199b10. From time to time, randomly, the
CPU usage spikes to 100% for seconds to minutes.

A deep analysis of the traces provided shows that it happens when
waiting for the response to a second pipelined HTTP request, or
when trying to handle the received shutdown advertised by epoll()
after the last block of data. Each time, splice() was involved with
data pending in the pipe.

The cause of this was that such events could not be taken into account
by splice nor by recv and were left pending :

  - the transfer of the last block of data, optionally with a shutdown
    was not handled by splice() because of the validation that to_forward
    is higher than MIN_SPLICE_FORWARD ;

  - the next recv() call was inhibited because of the test on presence
    of data in the pipe. This is also what prevented the recv() call
    from handling a response to a pipelined request until the client
    had ACKed the previous response.

No less than 4 different methods were experimented to fix this, and the
current one was finally chosen. The principle is that if an event is not
caught by splice(), then it MUST be caught by recv(). So we remove the
condition on the pipe's emptiness to perform an recv(), and in order to
prevent recv() from being used in the middle of a transfer, we mark
supposedly full pipes with CO_FL_WAIT_ROOM, which makes sense because
the reason for stopping a splice()-based receive is that the pipe is
supposed to be full.

The net effect is that we don't wake up and sleep in loops during these
transient states. This happened much more often than expected, sometimes
for a few cycles at end of transfers, but rarely long enough to be
noticed, unless a client timed out with data pending in the pipe. The
effect on CPU usage is visible even when transfering 1MB objects in
pipeline, where the CPU usage drops from 10 to 6% on a small machine at
medium bandwidth.

Some further improvements are needed :
  - the last chunk of a splice() transfer is never done using splice due
    to the test on to_forward. This is wrong and should be performed with
    splice if the pipe has not yet been emptied ;

  - si_chk_snd() should not be called when the write event is already being
    polled, otherwise we're almost certain to get EAGAIN.

Many thanks to Mark for all the traces he cared to provide, they were
essential for understanding this issue which was not reproducible
without.

Only 1.5-dev is affected, no backport is needed.
---

diff --git a/src/raw_sock.c b/src/raw_sock.c
index e030253591..b9bb8dcb1e 100644
--- a/src/raw_sock.c
+++ b/src/raw_sock.c
@@ -165,6 +165,7 @@ int raw_sock_to_pipe(struct connection *conn, struct pipe *pipe, unsigned int co
 			/* We've read enough of it for this time, let's stop before
 			 * being asked to poll.
 			 */
+			conn->flags |= CO_FL_WAIT_ROOM;
 			break;
 		}
 	} /* while */
diff --git a/src/stream_interface.c b/src/stream_interface.c
index 8a21d398a3..90e4044fb0 100644
--- a/src/stream_interface.c
+++ b/src/stream_interface.c
@@ -964,20 +964,29 @@ static void si_conn_recv_cb(struct connection *conn)
 		if (conn->flags & CO_FL_ERROR)
 			goto out_error;
 
-		if (conn->flags & CO_FL_WAIT_ROOM) /* most likely the pipe is full */
+		if (conn->flags & CO_FL_WAIT_ROOM) {
+			/* the pipe is full or we have read enough data that it
+			 * could soon be full. Let's stop before needing to poll.
+			 */
 			si->flags |= SI_FL_WAIT_ROOM;
+			__conn_data_stop_recv(conn);
+		}
 
 		/* splice not possible (anymore), let's go on on standard copy */
 	}
 
  abort_splice:
-	/* release the pipe if we can, which is almost always the case */
-	if (chn->pipe && !chn->pipe->data) {
+	if (chn->pipe && unlikely(!chn->pipe->data)) {
 		put_pipe(chn->pipe);
 		chn->pipe = NULL;
 	}
 
-	while (!chn->pipe && !(conn->flags & (CO_FL_ERROR | CO_FL_SOCK_RD_SH | CO_FL_DATA_RD_SH | CO_FL_WAIT_RD | CO_FL_WAIT_ROOM | CO_FL_HANDSHAKE))) {
+	/* Important note : if we're called with POLL_IN|POLL_HUP, it means the read polling
+	 * was enabled, which implies that the recv buffer was not full. So we have a guarantee
+	 * that if such an event is not handled above in splice, it will be handled here by
+	 * recv().
+	 */
+	while (!(conn->flags & (CO_FL_ERROR | CO_FL_SOCK_RD_SH | CO_FL_DATA_RD_SH | CO_FL_WAIT_RD | CO_FL_WAIT_ROOM | CO_FL_HANDSHAKE))) {
 		max = bi_avail(chn);
 
 		if (!max) {