]>
Commit | Line | Data |
---|---|---|
11f02f71 GKH |
1 | From 02a3307aa9c20b4f6626255b028f07f6cfa16feb Mon Sep 17 00:00:00 2001 |
2 | From: Liu Bo <bo.liu@linux.alibaba.com> | |
3 | Date: Wed, 16 May 2018 01:37:36 +0800 | |
4 | Subject: btrfs: fix reading stale metadata blocks after degraded raid1 mounts | |
5 | ||
6 | From: Liu Bo <bo.liu@linux.alibaba.com> | |
7 | ||
8 | commit 02a3307aa9c20b4f6626255b028f07f6cfa16feb upstream. | |
9 | ||
10 | If a btree block, aka. extent buffer, is not available in the extent | |
11 | buffer cache, it'll be read out from the disk instead, i.e. | |
12 | ||
13 | btrfs_search_slot() | |
14 | read_block_for_search() # hold parent and its lock, go to read child | |
15 | btrfs_release_path() | |
16 | read_tree_block() # read child | |
17 | ||
18 | Unfortunately, the parent lock got released before reading child, so | |
19 | commit 5bdd3536cbbe ("Btrfs: Fix block generation verification race") had | |
20 | used 0 as parent transid to read the child block. It forces | |
21 | read_tree_block() not to check if parent transid is different with the | |
22 | generation id of the child that it reads out from disk. | |
23 | ||
24 | A simple PoC is included in btrfs/124, | |
25 | ||
26 | 0. A two-disk raid1 btrfs, | |
27 | ||
28 | 1. Right after mkfs.btrfs, block A is allocated to be device tree's root. | |
29 | ||
30 | 2. Mount this filesystem and put it in use, after a while, device tree's | |
31 | root got COW but block A hasn't been allocated/overwritten yet. | |
32 | ||
33 | 3. Umount it and reload the btrfs module to remove both disks from the | |
34 | global @fs_devices list. | |
35 | ||
36 | 4. mount -odegraded dev1 and write some data, so now block A is allocated | |
37 | to be a leaf in checksum tree. Note that only dev1 has the latest | |
38 | metadata of this filesystem. | |
39 | ||
40 | 5. Umount it and mount it again normally (with both disks), since raid1 | |
41 | can pick up one disk by the writer task's pid, if btrfs_search_slot() | |
42 | needs to read block A, dev2 which does NOT have the latest metadata | |
43 | might be read for block A, then we got a stale block A. | |
44 | ||
45 | 6. As parent transid is not checked, block A is marked as uptodate and | |
46 | put into the extent buffer cache, so the future search won't bother | |
47 | to read disk again, which means it'll make changes on this stale | |
48 | one and make it dirty and flush it onto disk. | |
49 | ||
50 | To avoid the problem, parent transid needs to be passed to | |
51 | read_tree_block(). | |
52 | ||
53 | In order to get a valid parent transid, we need to hold the parent's | |
54 | lock until finishing reading child. | |
55 | ||
56 | This patch needs to be slightly adapted for stable kernels, the | |
57 | &first_key parameter added to read_tree_block() is from 4.16+ | |
58 | (581c1760415c4). The fix is to replace 0 by 'gen'. | |
59 | ||
60 | Fixes: 5bdd3536cbbe ("Btrfs: Fix block generation verification race") | |
61 | CC: stable@vger.kernel.org # 4.4+ | |
62 | Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> | |
63 | Reviewed-by: Filipe Manana <fdmanana@suse.com> | |
64 | Reviewed-by: Qu Wenruo <wqu@suse.com> | |
65 | [ update changelog ] | |
66 | Signed-off-by: David Sterba <dsterba@suse.com> | |
67 | Signed-off-by: Nikolay Borisov <nborisov@suse.com> | |
68 | Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> | |
69 | ||
70 | --- | |
71 | fs/btrfs/ctree.c | 6 +++--- | |
72 | 1 file changed, 3 insertions(+), 3 deletions(-) | |
73 | ||
74 | --- a/fs/btrfs/ctree.c | |
75 | +++ b/fs/btrfs/ctree.c | |
76 | @@ -2497,10 +2497,8 @@ read_block_for_search(struct btrfs_trans | |
77 | if (p->reada) | |
78 | reada_for_search(root, p, level, slot, key->objectid); | |
79 | ||
80 | - btrfs_release_path(p); | |
81 | - | |
82 | ret = -EAGAIN; | |
83 | - tmp = read_tree_block(root, blocknr, 0); | |
84 | + tmp = read_tree_block(root, blocknr, gen); | |
85 | if (!IS_ERR(tmp)) { | |
86 | /* | |
87 | * If the read above didn't mark this buffer up to date, | |
88 | @@ -2512,6 +2510,8 @@ read_block_for_search(struct btrfs_trans | |
89 | ret = -EIO; | |
90 | free_extent_buffer(tmp); | |
91 | } | |
92 | + | |
93 | + btrfs_release_path(p); | |
94 | return ret; | |
95 | } | |
96 |