[thirdparty/kernel/stable-queue.git] / releases / 3.16.3 / ring-buffer-always-reset-iterator-to-reader-page.patch

From 651e22f2701b4113989237c3048d17337dd2185c Mon Sep 17 00:00:00 2001
From: "Steven Rostedt (Red Hat)" <rostedt@goodmis.org>
Date: Wed, 6 Aug 2014 14:11:33 -0400
Subject: ring-buffer: Always reset iterator to reader page

From: "Steven Rostedt (Red Hat)" <rostedt@goodmis.org>

commit 651e22f2701b4113989237c3048d17337dd2185c upstream.

When performing a consuming read, the ring buffer swaps out a
page from the ring buffer with a empty page and this page that
was swapped out becomes the new reader page. The reader page
is owned by the reader and since it was swapped out of the ring
buffer, writers do not have access to it (there's an exception
to that rule, but it's out of scope for this commit).

When reading the "trace" file, it is a non consuming read, which
means that the data in the ring buffer will not be modified.
When the trace file is opened, a ring buffer iterator is allocated
and writes to the ring buffer are disabled, such that the iterator
will not have issues iterating over the data.

Although the ring buffer disabled writes, it does not disable other
reads, or even consuming reads. If a consuming read happens, then
the iterator is reset and starts reading from the beginning again.

My tests would sometimes trigger this bug on my i386 box:

WARNING: CPU: 0 PID: 5175 at kernel/trace/trace.c:1527 __trace_find_cmdline+0x66/0xaa()
Modules linked in:
CPU: 0 PID: 5175 Comm: grep Not tainted 3.16.0-rc3-test+ #8
Hardware name:                  /DG965MQ, BIOS MQ96510J.86A.0372.2006.0605.1717 06/05/2006
 00000000 00000000 f09c9e1c c18796b3 c1b5d74c f09c9e4c c103a0e3 c1b5154b
 f09c9e78 00001437 c1b5d74c 000005f7 c10bd85a c10bd85a c1cac57c f09c9eb0
 ed0e0000 f09c9e64 c103a185 00000009 f09c9e5c c1b5154b f09c9e78 f09c9e80^M
Call Trace:
 [<c18796b3>] dump_stack+0x4b/0x75
 [<c103a0e3>] warn_slowpath_common+0x7e/0x95
 [<c10bd85a>] ? __trace_find_cmdline+0x66/0xaa
 [<c10bd85a>] ? __trace_find_cmdline+0x66/0xaa
 [<c103a185>] warn_slowpath_fmt+0x33/0x35
 [<c10bd85a>] __trace_find_cmdline+0x66/0xaa^M
 [<c10bed04>] trace_find_cmdline+0x40/0x64
 [<c10c3c16>] trace_print_context+0x27/0xec
 [<c10c4360>] ? trace_seq_printf+0x37/0x5b
 [<c10c0b15>] print_trace_line+0x319/0x39b
 [<c10ba3fb>] ? ring_buffer_read+0x47/0x50
 [<c10c13b1>] s_show+0x192/0x1ab
 [<c10bfd9a>] ? s_next+0x5a/0x7c
 [<c112e76e>] seq_read+0x267/0x34c
 [<c1115a25>] vfs_read+0x8c/0xef
 [<c112e507>] ? seq_lseek+0x154/0x154
 [<c1115ba2>] SyS_read+0x54/0x7f
 [<c188488e>] syscall_call+0x7/0xb
---[ end trace 3f507febd6b4cc83 ]---
>>>> ##### CPU 1 buffer started ####

Which was the __trace_find_cmdline() function complaining about the pid
in the event record being negative.

After adding more test cases, this would trigger more often. Strangely
enough, it would never trigger on a single test, but instead would trigger
only when running all the tests. I believe that was the case because it
required one of the tests to be shutting down via delayed instances while
a new test started up.

After spending several days debugging this, I found that it was caused by
the iterator becoming corrupted. Debugging further, I found out why
the iterator became corrupted. It happened with the rb_iter_reset().

As consuming reads may not read the full reader page, and only part
of it, there's a "read" field to know where the last read took place.
The iterator, must also start at the read position. In the rb_iter_reset()
code, if the reader page was disconnected from the ring buffer, the iterator
would start at the head page within the ring buffer (where writes still
happen). But the mistake there was that it still used the "read" field
to start the iterator on the head page, where it should always start
at zero because readers never read from within the ring buffer where
writes occur.

I originally wrote a patch to have it set the iter->head to 0 instead
of iter->head_page->read, but then I questioned why it wasn't always
setting the iter to point to the reader page, as the reader page is
still valid.  The list_empty(reader_page->list) just means that it was
successful in swapping out. But the reader_page may still have data.

There was a bug report a long time ago that was not reproducible that
had something about trace_pipe (consuming read) not matching trace
(iterator read). This may explain why that happened.

Anyway, the correct answer to this bug is to always use the reader page
an not reset the iterator to inside the writable ring buffer.

Fixes: d769041f8653 "ring_buffer: implement new locking"
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
 kernel/trace/ring_buffer.c |   17 ++++++-----------
 1 file changed, 6 insertions(+), 11 deletions(-)

--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -3357,21 +3357,16 @@ static void rb_iter_reset(struct ring_bu
 	struct ring_buffer_per_cpu *cpu_buffer = iter->cpu_buffer;
 
 	/* Iterator usage is expected to have record disabled */
-	if (list_empty(&cpu_buffer->reader_page->list)) {
-		iter->head_page = rb_set_head_page(cpu_buffer);
-		if (unlikely(!iter->head_page))
-			return;
-		iter->head = iter->head_page->read;
-	} else {
-		iter->head_page = cpu_buffer->reader_page;
-		iter->head = cpu_buffer->reader_page->read;
-	}
+	iter->head_page = cpu_buffer->reader_page;
+	iter->head = cpu_buffer->reader_page->read;
+
+	iter->cache_reader_page = iter->head_page;
+	iter->cache_read = iter->head;
+
 	if (iter->head)
 		iter->read_stamp = cpu_buffer->read_stamp;
 	else
 		iter->read_stamp = iter->head_page->page->time_stamp;
-	iter->cache_reader_page = cpu_buffer->reader_page;
-	iter->cache_read = cpu_buffer->read;
 }
 
 /**
Commit	Line	Data
bf9ffb20 GKH	1	From 651e22f2701b4113989237c3048d17337dd2185c Mon Sep 17 00:00:00 2001
	2	From: "Steven Rostedt (Red Hat)" <rostedt@goodmis.org>
	3	Date: Wed, 6 Aug 2014 14:11:33 -0400
	4	Subject: ring-buffer: Always reset iterator to reader page
	5
	6	From: "Steven Rostedt (Red Hat)" <rostedt@goodmis.org>
	7
	8	commit 651e22f2701b4113989237c3048d17337dd2185c upstream.
	9
	10	When performing a consuming read, the ring buffer swaps out a
	11	page from the ring buffer with a empty page and this page that
	12	was swapped out becomes the new reader page. The reader page
	13	is owned by the reader and since it was swapped out of the ring
	14	buffer, writers do not have access to it (there's an exception
	15	to that rule, but it's out of scope for this commit).
	16
	17	When reading the "trace" file, it is a non consuming read, which
	18	means that the data in the ring buffer will not be modified.
	19	When the trace file is opened, a ring buffer iterator is allocated
	20	and writes to the ring buffer are disabled, such that the iterator
	21	will not have issues iterating over the data.
	22
	23	Although the ring buffer disabled writes, it does not disable other
	24	reads, or even consuming reads. If a consuming read happens, then
	25	the iterator is reset and starts reading from the beginning again.
	26
	27	My tests would sometimes trigger this bug on my i386 box:
	28
	29	WARNING: CPU: 0 PID: 5175 at kernel/trace/trace.c:1527 __trace_find_cmdline+0x66/0xaa()
	30	Modules linked in:
	31	CPU: 0 PID: 5175 Comm: grep Not tainted 3.16.0-rc3-test+ #8
	32	Hardware name: /DG965MQ, BIOS MQ96510J.86A.0372.2006.0605.1717 06/05/2006
	33	00000000 00000000 f09c9e1c c18796b3 c1b5d74c f09c9e4c c103a0e3 c1b5154b
	34	f09c9e78 00001437 c1b5d74c 000005f7 c10bd85a c10bd85a c1cac57c f09c9eb0
	35	ed0e0000 f09c9e64 c103a185 00000009 f09c9e5c c1b5154b f09c9e78 f09c9e80^M
	36	Call Trace:
	37	[<c18796b3>] dump_stack+0x4b/0x75
	38	[<c103a0e3>] warn_slowpath_common+0x7e/0x95
	39	[<c10bd85a>] ? __trace_find_cmdline+0x66/0xaa
	40	[<c10bd85a>] ? __trace_find_cmdline+0x66/0xaa
	41	[<c103a185>] warn_slowpath_fmt+0x33/0x35
	42	[<c10bd85a>] __trace_find_cmdline+0x66/0xaa^M
	43	[<c10bed04>] trace_find_cmdline+0x40/0x64
	44	[<c10c3c16>] trace_print_context+0x27/0xec
	45	[<c10c4360>] ? trace_seq_printf+0x37/0x5b
	46	[<c10c0b15>] print_trace_line+0x319/0x39b
	47	[<c10ba3fb>] ? ring_buffer_read+0x47/0x50
	48	[<c10c13b1>] s_show+0x192/0x1ab
	49	[<c10bfd9a>] ? s_next+0x5a/0x7c
	50	[<c112e76e>] seq_read+0x267/0x34c
	51	[<c1115a25>] vfs_read+0x8c/0xef
	52	[<c112e507>] ? seq_lseek+0x154/0x154
	53	[<c1115ba2>] SyS_read+0x54/0x7f
	54	[<c188488e>] syscall_call+0x7/0xb
	55	---[ end trace 3f507febd6b4cc83 ]---
	56	>>>> ##### CPU 1 buffer started ####
	57
	58	Which was the __trace_find_cmdline() function complaining about the pid
	59	in the event record being negative.
	60
	61	After adding more test cases, this would trigger more often. Strangely
	62	enough, it would never trigger on a single test, but instead would trigger
	63	only when running all the tests. I believe that was the case because it
	64	required one of the tests to be shutting down via delayed instances while
65	a new test started up.
66
67	After spending several days debugging this, I found that it was caused by
68	the iterator becoming corrupted. Debugging further, I found out why
69	the iterator became corrupted. It happened with the rb_iter_reset().
70
71	As consuming reads may not read the full reader page, and only part
72	of it, there's a "read" field to know where the last read took place.
73	The iterator, must also start at the read position. In the rb_iter_reset()
74	code, if the reader page was disconnected from the ring buffer, the iterator
75	would start at the head page within the ring buffer (where writes still
76	happen). But the mistake there was that it still used the "read" field
77	to start the iterator on the head page, where it should always start
78	at zero because readers never read from within the ring buffer where
79	writes occur.
80
81	I originally wrote a patch to have it set the iter->head to 0 instead
82	of iter->head_page->read, but then I questioned why it wasn't always
83	setting the iter to point to the reader page, as the reader page is
84	still valid. The list_empty(reader_page->list) just means that it was
85	successful in swapping out. But the reader_page may still have data.
86
87	There was a bug report a long time ago that was not reproducible that
88	had something about trace_pipe (consuming read) not matching trace
89	(iterator read). This may explain why that happened.
90
91	Anyway, the correct answer to this bug is to always use the reader page
92	an not reset the iterator to inside the writable ring buffer.
93
94	Fixes: d769041f8653 "ring_buffer: implement new locking"
95	Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
96	Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
97
98	---
99	kernel/trace/ring_buffer.c \| 17 ++++++-----------
100	1 file changed, 6 insertions(+), 11 deletions(-)
101
102	--- a/kernel/trace/ring_buffer.c
103	+++ b/kernel/trace/ring_buffer.c
104	@@ -3357,21 +3357,16 @@ static void rb_iter_reset(struct ring_bu
105	struct ring_buffer_per_cpu *cpu_buffer = iter->cpu_buffer;
106
107	/* Iterator usage is expected to have record disabled */
108	- if (list_empty(&cpu_buffer->reader_page->list)) {
109	- iter->head_page = rb_set_head_page(cpu_buffer);
110	- if (unlikely(!iter->head_page))
111	- return;
112	- iter->head = iter->head_page->read;
113	- } else {
114	- iter->head_page = cpu_buffer->reader_page;
115	- iter->head = cpu_buffer->reader_page->read;
116	- }
117	+ iter->head_page = cpu_buffer->reader_page;
118	+ iter->head = cpu_buffer->reader_page->read;
119	+
120	+ iter->cache_reader_page = iter->head_page;
121	+ iter->cache_read = iter->head;
122	+
123	if (iter->head)
124	iter->read_stamp = cpu_buffer->read_stamp;
125	else
126	iter->read_stamp = iter->head_page->page->time_stamp;
127	- iter->cache_reader_page = cpu_buffer->reader_page;
128	- iter->cache_read = cpu_buffer->read;
129	}
130
131	/**