]>
Commit | Line | Data |
---|---|---|
2532c0bb CW |
1 | From 7658cc289288b8ae7dd2c2224549a048431222b3 Mon Sep 17 00:00:00 2001 |
2 | From: Linus Torvalds <torvalds@macmini.osdl.org> | |
3 | Date: Fri, 29 Dec 2006 10:00:58 -0800 | |
4 | Subject: VM: Fix nasty and subtle race in shared mmap'ed page writeback | |
5 | ||
6 | The VM layer (on the face of it, fairly reasonably) expected that when | |
7 | it does a ->writepage() call to the filesystem, it would write out the | |
8 | full page at that point in time. Especially since it had earlier marked | |
9 | the whole page dirty with "set_page_dirty()". | |
10 | ||
11 | But that isn't actually the case: ->writepage() does not actually write | |
12 | a page, it writes the parts of the page that have been explicitly marked | |
13 | dirty before, *and* that had not got written out for other reasons since | |
14 | the last time we told it they were dirty. | |
15 | ||
16 | That last caveat is the important one. | |
17 | ||
18 | Which _most_ of the time ends up being the whole page (since we had | |
19 | called "set_page_dirty()" on the page earlier), but if the filesystem | |
20 | had done any dirty flushing of its own (for example, to honor some | |
21 | internal write ordering guarantees), it might end up doing only a | |
22 | partial page IO (or none at all) when ->writepage() is actually called. | |
23 | ||
24 | That is the correct thing in general (since we actually often _want_ | |
25 | only the known-dirty parts of the page to be written out), but the | |
26 | shared dirty page handling had implicitly forgotten about these details, | |
27 | and had a number of cases where it was doing just the "->writepage()" | |
28 | part, without telling the low-level filesystem that the whole page might | |
29 | have been re-dirtied as part of being mapped writably into user space. | |
30 | ||
31 | Since most of the time the FS did actually write out the full page, we | |
32 | didn't notice this for a loong time, and this needed some really odd | |
33 | patterns to trigger. But it caused occasional corruption with rtorrent | |
34 | and with the Debian "apt" database, because both use shared mmaps to | |
35 | update the end result. | |
36 | ||
37 | This fixes it. Finally. After way too much hair-pulling. | |
38 | ||
39 | Acked-by: Nick Piggin <nickpiggin@yahoo.com.au> | |
40 | Acked-by: Martin J. Bligh <mbligh@google.com> | |
41 | Acked-by: Martin Michlmayr <tbm@cyrius.com> | |
42 | Acked-by: Martin Johansson <martin@fatbob.nu> | |
43 | Acked-by: Ingo Molnar <mingo@elte.hu> | |
44 | Acked-by: Andrei Popa <andrei.popa@i-neo.ro> | |
45 | Cc: High Dickins <hugh@veritas.com> | |
46 | Cc: Andrew Morton <akpm@osdl.org>, | |
47 | Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> | |
48 | Cc: Segher Boessenkool <segher@kernel.crashing.org> | |
49 | Cc: David Miller <davem@davemloft.net> | |
50 | Cc: Arjan van de Ven <arjan@infradead.org> | |
51 | Cc: Gordon Farquharson <gordonfarquharson@gmail.com> | |
52 | Cc: Guillaume Chazarain <guichaz@yahoo.fr> | |
53 | Cc: Theodore Tso <tytso@mit.edu> | |
54 | Cc: Kenneth Cheng <kenneth.w.chen@intel.com> | |
55 | Cc: Tobias Diedrich <ranma@tdiedrich.de> | |
56 | Signed-off-by: Linus Torvalds <torvalds@osdl.org> | |
57 | [chrisw: backport to 2.6.19.1] | |
58 | Signed-off-by: Chris Wright <chrisw@sous-sol.org> | |
59 | --- | |
60 | mm/page-writeback.c | 39 ++++++++++++++++++++++++++++++++++----- | |
61 | 1 file changed, 34 insertions(+), 5 deletions(-) | |
62 | ||
63 | --- linux-2.6.19.1.orig/mm/page-writeback.c | |
64 | +++ linux-2.6.19.1/mm/page-writeback.c | |
65 | @@ -893,12 +893,41 @@ int clear_page_dirty_for_io(struct page | |
66 | { | |
67 | struct address_space *mapping = page_mapping(page); | |
68 | ||
69 | - if (mapping) { | |
70 | + if (mapping && mapping_cap_account_dirty(mapping)) { | |
71 | + /* | |
72 | + * Yes, Virginia, this is indeed insane. | |
73 | + * | |
74 | + * We use this sequence to make sure that | |
75 | + * (a) we account for dirty stats properly | |
76 | + * (b) we tell the low-level filesystem to | |
77 | + * mark the whole page dirty if it was | |
78 | + * dirty in a pagetable. Only to then | |
79 | + * (c) clean the page again and return 1 to | |
80 | + * cause the writeback. | |
81 | + * | |
82 | + * This way we avoid all nasty races with the | |
83 | + * dirty bit in multiple places and clearing | |
84 | + * them concurrently from different threads. | |
85 | + * | |
86 | + * Note! Normally the "set_page_dirty(page)" | |
87 | + * has no effect on the actual dirty bit - since | |
88 | + * that will already usually be set. But we | |
89 | + * need the side effects, and it can help us | |
90 | + * avoid races. | |
91 | + * | |
92 | + * We basically use the page "master dirty bit" | |
93 | + * as a serialization point for all the different | |
94 | + * threads doing their things. | |
95 | + * | |
96 | + * FIXME! We still have a race here: if somebody | |
97 | + * adds the page back to the page tables in | |
98 | + * between the "page_mkclean()" and the "TestClearPageDirty()", | |
99 | + * we might have it mapped without the dirty bit set. | |
100 | + */ | |
101 | + if (page_mkclean(page)) | |
102 | + set_page_dirty(page); | |
103 | if (TestClearPageDirty(page)) { | |
104 | - if (mapping_cap_account_dirty(mapping)) { | |
105 | - page_mkclean(page); | |
106 | - dec_zone_page_state(page, NR_FILE_DIRTY); | |
107 | - } | |
108 | + dec_zone_page_state(page, NR_FILE_DIRTY); | |
109 | return 1; | |
110 | } | |
111 | return 0; |