Before commit 0366a1c70b89 ("powerpc/irq: Run softirqs off the top of
the irq stack"), check_stack_overflow() was called by do_IRQ(), before
switching to the irq stack.
In that commit, do_IRQ() was renamed __do_irq(), and is now executing
on the irq stack, so check_stack_overflow() has just become almost
useless.
Move check_stack_overflow() call in do_IRQ() to do the check while
still on the current stack.
ext4_check_dir_entry() currently does not catch a case when a directory
entry ends so close to the block end that the header of the next
directory entry would not fit in the remaining space. This can lead to
directory iteration code trying to access address beyond end of current
buffer head leading to oops.
We now use btrfs_can_overcommit() to see if we can flip a block group
read only. Before this would fail because we weren't taking into
account the usable un-allocated space for allocating chunks. With my
patches we were allowed to do the balance, which is technically correct.
The test is trying to start balance on degraded mount. So now we're
trying to allocate a chunk and cannot because we want to allocate a
RAID1 chunk, but there's only 1 device that's available for usage. This
results in an ENOSPC.
But we shouldn't even be making it this far, we don't have enough
devices to restripe. The problem is we're using btrfs_num_devices(),
that also includes missing devices. That's not actually what we want, we
need to use rw_devices.
The chunk_mutex is not needed here, rw_devices changes only in device
add, remove or replace, all are excluded by EXCL_OP mechanism.
Fixes: e4d8ec0f65b9 ("Btrfs: implement online profile changing") Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com>
[ add stacktrace, update changelog, drop chunk_mutex ] Signed-off-by: David Sterba <dsterba@suse.com>
[bwh: Backported to 3.16: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
btrfs_del_root_ref() will simply WARN_ON() if the ref doesn't match in
any way, and then continue to delete the reference. This shouldn't
happen, we have these values because there's more to the reference than
the original root and the sub root. If any of these checks fail, return
-ENOENT.
Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Although it is safe to call this on already released paths with no locks
held or extent buffers, removing the redundant btrfs_release_path is
reasonable.
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
[bwh: Backported to 3.16 as dependency of commit d49d3287e74f
"btrfs: fix invalid removal of root ref"] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
When starting writeback for a range that covers part of a preallocated
extent, due to a race with writeback for another range that also covers
another part of the same preallocated extent, we can end up in an infinite
loop.
Consider the following example where for inode 280 we have two dirty
ranges:
range A, from 294912 to 303103, 8192 bytes
range B, from 348160 to 438271, 90112 bytes
and we have the following file extent item layout for our inode:
leaf 38895616 gen 24544 total ptrs 29 free space 13820 owner 5
(...)
item 27 key (280 108 200704) itemoff 14598 itemsize 53
extent data disk bytenr 0 nr 0 type 1 (regular)
extent data offset 0 nr 94208 ram 94208
item 28 key (280 108 294912) itemoff 14545 itemsize 53
extent data disk bytenr 10433052672 nr 81920 type 2 (prealloc)
extent data offset 0 nr 81920 ram 81920
Then the following happens:
1) Writeback starts for range B (from 348160 to 438271), execution of
run_delalloc_nocow() starts;
2) The first iteration of run_delalloc_nocow()'s whil loop leaves us at
the extent item at slot 28, pointing to the prealloc extent item
covering the range from 294912 to 376831. This extent covers part of
our range;
3) An ordered extent is created against that extent, covering the file
range from 348160 to 376831 (28672 bytes);
4) We adjust 'cur_offset' to 376832 and move on to the next iteration of
the while loop;
5) The call to btrfs_lookup_file_extent() leaves us at the same leaf,
pointing to slot 29, 1 slot after the last item (the extent item
we processed in the previous iteration);
6) Because we are a slot beyond the last item, we call btrfs_next_leaf(),
which releases the search path before doing a another search for the
last key of the leaf (280 108 294912);
7) Right after btrfs_next_leaf() released the path, and before it did
another search for the last key of the leaf, writeback for the range
A (from 294912 to 303103) completes (it was previously started at
some point);
8) Upon completion of the ordered extent for range A, the prealloc extent
we previously found got split into two extent items, one covering the
range from 294912 to 303103 (8192 bytes), with a type of regular extent
(and no longer prealloc) and another covering the range from 303104 to
376831 (73728 bytes), with a type of prealloc and an offset of 8192
bytes. So our leaf now has the following layout:
leaf 38895616 gen 24544 total ptrs 31 free space 13664 owner 5
(...)
item 27 key (280 108 200704) itemoff 14598 itemsize 53
extent data disk bytenr 0 nr 0 type 1
extent data offset 0 nr 8192 ram 94208
item 28 key (280 108 208896) itemoff 14545 itemsize 53
extent data disk bytenr 10433142784 nr 86016 type 1
extent data offset 0 nr 86016 ram 86016
item 29 key (280 108 294912) itemoff 14492 itemsize 53
extent data disk bytenr 10433052672 nr 81920 type 1
extent data offset 0 nr 8192 ram 81920
item 30 key (280 108 303104) itemoff 14439 itemsize 53
extent data disk bytenr 10433052672 nr 81920 type 2
extent data offset 8192 nr 73728 ram 81920
9) After btrfs_next_leaf() returns, we have our path pointing to that same
leaf and at slot 30, since it has a key we didn't have before and it's
the first key greater then the key that was previously the last key of
the leaf (key (280 108 294912));
10) The extent item at slot 30 covers the range from 303104 to 376831
which is in our target range, so we process it, despite having already
created an ordered extent against this extent for the file range from
348160 to 376831. This is because we skip to the next extent item only
if its end is less than or equals to the start of our delalloc range,
and not less than or equals to the current offset ('cur_offset');
12) We then call create_io_em() for a 0 bytes range starting at offset
376832;
13) Then create_io_em() enters an infinite loop because its calls to
btrfs_drop_extent_cache() do nothing due to the 0 length range
passed to it. So no existing extent maps that cover the offset
376832 get removed, and therefore calls to add_extent_mapping()
return -EEXIST, resulting in an infinite loop. This loop from
create_io_em() is the following:
do {
btrfs_drop_extent_cache(BTRFS_I(inode), em->start,
em->start + em->len - 1, 0);
write_lock(&em_tree->lock);
ret = add_extent_mapping(em_tree, em, 1);
write_unlock(&em_tree->lock);
/*
* The caller has taken lock_extent(), who could race with us
* to add em?
*/
} while (ret == -EEXIST);
Also, each call to btrfs_drop_extent_cache() triggers a warning because
the start offset passed to it (376832) is smaller then the end offset
(376832 - 1) passed to it by -1, due to the 0 length:
Which results in flooding dmesg/syslog since btrfs_drop_extent_cache()
uses WARN_ON() and not WARN_ON_ONCE().
So fix this issue by changing run_delalloc_nocow()'s loop to move to the
next extent item when the current extent item ends at at offset less than
or equals to the current offset instead of the start offset.
Fixes: 80ff385665b7fc ("Btrfs: update nodatacow code v2") Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
[bwh: Backported to 3.16: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
If we fail to read the fs root corresponding with a reloc root we'll
just break out and free the reloc roots. But we remove our current
reloc_root from this list higher up, which means we'll leak this
reloc_root. Fix this by adding ourselves back to the reloc_roots list
so we are properly cleaned up.
Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
My fsstress modifications coupled with generic/475 uncovered a failure
to mount and replay the log if we hit a orphaned root. We do not want
to replay the log for an orphan root, but it's completely legitimate to
have an orphaned root with a log attached. Fix this by simply skipping
replaying the log. We still need to pin it's root node so that we do
not overwrite it while replaying other logs, as we re-read the log root
at every stage of the replay.
Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
[bwh: Backported to 3.16:
- Pass fs_info->extent_root, instead of fs_info, as first argument to
btrfs_pin_extent_for_log_replay()
- Adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
If we get an -ENOENT back from btrfs_uuid_iter_rem when iterating the
uuid tree we'll just continue and do btrfs_next_item(). However we've
done a btrfs_release_path() at this point and no longer have a valid
path. So increment the key and go back and do a normal search.
Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
We can just abort the transaction here, and in fact do that for every
other failure in this function except these two cases.
Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
[bwh: Backported to 3.16:
- Add root as second argument to btrfs_abort_transaction()
- Adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
When a tree mod log user no longer needs to use the tree it calls
btrfs_put_tree_mod_seq() to remove itself from the list of users and
delete all no longer used elements of the tree's red black tree, which
should be all elements with a sequence number less then our equals to
the caller's sequence number. However the logic is broken because it
can delete and free elements from the red black tree that have a
sequence number greater then the caller's sequence number:
1) At a point in time we have sequence numbers 1, 2, 3 and 4 in the
tree mod log;
2) The task which got assigned the sequence number 1 calls
btrfs_put_tree_mod_seq();
3) Sequence number 1 is deleted from the list of sequence numbers;
4) The current minimum sequence number is computed to be the sequence
number 2;
5) A task using sequence number 2 is at tree_mod_log_rewind() and gets
a pointer to one of its elements from the red black tree through
a call to tree_mod_log_search();
6) The task with sequence number 1 iterates the red black tree of tree
modification elements and deletes (and frees) all elements with a
sequence number less then or equals to 2 (the computed minimum sequence
number) - it ends up only leaving elements with sequence numbers of 3
and 4;
7) The task with sequence number 2 now uses the pointer to its element,
already freed by the other task, at __tree_mod_log_rewind(), resulting
in a use-after-free issue. When CONFIG_DEBUG_PAGEALLOC=y it produces
a trace like the following:
Fix this by making btrfs_put_tree_mod_seq() skip deletion of elements that
have a sequence number equals to the computed minimum sequence number, and
not just elements with a sequence number greater then that minimum.
Fixes: bd989ba359f2ac ("Btrfs: add tree modification log functions") Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
[bwh: Backported to 3.16: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Testing with the new fsstress uncovered a pretty nasty deadlock with
lookup and snapshot deletion.
Process A
unlink
-> final iput
-> inode_tree_del
-> synchronize_srcu(subvol_srcu)
Process B
btrfs_lookup <- srcu_read_lock() acquired here
-> btrfs_iget
-> find inode that has I_FREEING set
-> __wait_on_freeing_inode()
We're holding the srcu_read_lock() while doing the iget in order to make
sure our fs root doesn't go away, and then we are waiting for the inode
to finish freeing. However because the free'ing process is doing a
synchronize_srcu() we deadlock.
Fix this by dropping the synchronize_srcu() in inode_tree_del(). We
don't need people to stop accessing the fs root at this point, we're
only adding our empty root to the dead roots list.
A larger much more invasive fix is forthcoming to address how we deal
with fs roots, but this fixes the immediate problem.
Fixes: 76dda93c6ae2 ("Btrfs: add snapshot/subvolume destroy ioctl") Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
[bwh: Backported to 3.16: No fs_info variable was used here] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
We should not be leaving half-mapped usages with potentially invalid
keycodes, as that may confuse hidinput_find_key() when the key is located
by index, which may end up feeding way too large keycode into the VT
keyboard handler and cause OOB write there:
BUG: KASAN: global-out-of-bounds in clear_bit include/asm-generic/bitops-instrumented.h:56 [inline]
BUG: KASAN: global-out-of-bounds in kbd_keycode drivers/tty/vt/keyboard.c:1411 [inline]
BUG: KASAN: global-out-of-bounds in kbd_event+0xe6b/0x3790 drivers/tty/vt/keyboard.c:1495
Write of size 8 at addr ffffffff89a1b2d8 by task syz-executor108/1722
...
kbd_keycode drivers/tty/vt/keyboard.c:1411 [inline]
kbd_event+0xe6b/0x3790 drivers/tty/vt/keyboard.c:1495
input_to_handler+0x3b6/0x4c0 drivers/input/input.c:118
input_pass_values.part.0+0x2e3/0x720 drivers/input/input.c:145
input_pass_values drivers/input/input.c:949 [inline]
input_set_keycode+0x290/0x320 drivers/input/input.c:954
evdev_handle_set_keycode_v2+0xc4/0x120 drivers/input/evdev.c:882
evdev_do_ioctl drivers/input/evdev.c:1150 [inline]
Reported-by: syzbot+19340dff067c2d3835c0@syzkaller.appspotmail.com Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com> Tested-by: Benjamin Tissoires <benjamin.tissoires@redhat.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
However it is not above.
Hence, make do above remove sequence.
Fixes: fa417f7b520ee ("IB/mlx4: Add support for IBoE") Signed-off-by: Parav Pandit <parav@mellanox.com> Reviewed-by: Maor Gottlieb <maorg@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Link: https://lore.kernel.org/r/20191212091214.315005-3-leon@kernel.org Signed-off-by: Doug Ledford <dledford@redhat.com>
[bwh: Backported to 3.16: No need to call mlx4_ib_diag_cleanup()] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
When device is being removed (e.g during VPI port link type change
from ETH to IB), tasks for gid table changes should not be executed.
Flush the current queue of tasks and block further tasks from entering the queue.
Signed-off-by: Moni Shoua <monis@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Roland Dreier <roland@purestorage.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
As the commit 677fe555cbfb ("serial: imx: Fix recursive locking bug")
has mentioned the uart driver might cause recursive locking between
normal printing and the kernel debugging facilities (e.g. sysrq and
oops). In the commit it gave out suggestion for fixing recursive
locking issue: "The solution is to avoid locking in the sysrq case
and trylock in the oops_in_progress case."
This patch follows the suggestion (also used the exactly same code with
other serial drivers, e.g. amba-pl011.c) to fix the recursive locking
issue, this can avoid stuck caused by deadlock and print out log for
sysrq and oops.
Fixes: 04896a77a97b ("msm_serial: serial driver for MSM7K onboard serial peripheral.") Signed-off-by: Leo Yan <leo.yan@linaro.org> Reviewed-by: Jeffrey Hugo <jeffrey.l.hugo@gmail.com> Link: https://lore.kernel.org/r/20191127141544.4277-2-leo.yan@linaro.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
The original ID that was added for Comet Lake PCH was
actually for the -LP (low power) variant even though the
constant for it said CMLH. Changing that while at it.
Signed-off-by: Heikki Krogerus <heikki.krogerus@linux.intel.com> Acked-by: Felipe Balbi <balbi@kernel.org> Link: https://lore.kernel.org/r/20191212093713.60614-1-heikki.krogerus@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[bwh: Backported to 3.16:
- No properties support
- Adjust context, indentation] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
This patch adds the necessary PCI ID for TGP-LP devices.
Signed-off-by: Felipe Balbi <felipe.balbi@linux.intel.com>
[bwh: Backported to 3.16:
- No properties support
- Adjust context, indentation] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Felipe Balbi <felipe.balbi@linux.intel.com>
[bwh: Backported to 3.16:
- No properties support
- Adjust context, indentation] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Felipe Balbi <felipe.balbi@linux.intel.com>
[bwh: Backported to 3.16:
- No properties support
- Adjust context, indentation] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Heikki Krogerus <heikki.krogerus@linux.intel.com> Signed-off-by: Felipe Balbi <felipe.balbi@linux.intel.com>
[bwh: Backported to 3.16: adjust context, indentation] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Intel Cannonlake PCH has the same DWC3 than Intel
Sunrisepoint. Add the new IDs to the supported devices.
Signed-off-by: Heikki Krogerus <heikki.krogerus@linux.intel.com> Signed-off-by: Felipe Balbi <felipe.balbi@linux.intel.com>
[bwh: Backported to 3.16: adjust context, indentation] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Intel Gemini Lake SoC has the same DWC3 than Broxton. Add
the new ID to the supported Devices.
Signed-off-by: Heikki Krogerus <heikki.krogerus@linux.intel.com> Signed-off-by: Felipe Balbi <felipe.balbi@linux.intel.com>
[bwh: Backported to 3.16: adjust context, indentation] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Intel Kabylake PCH has the same DWC3 than Intel
Sunrisepoint. Add the new ID to the supported devices.
Signed-off-by: Heikki Krogerus <heikki.krogerus@linux.intel.com> Signed-off-by: Felipe Balbi <felipe.balbi@linux.intel.com>
[bwh: Backported to 3.16: adjust context, indentation] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
BXT-M is a Intel Broxton SoC based platform with unique PCI ID.
Signed-off-by: Heikki Krogerus <heikki.krogerus@linux.intel.com> Signed-off-by: Felipe Balbi <felipe.balbi@linux.intel.com>
[bwh: Backported to 3.16: adjust context, indentation] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Heikki Krogerus <heikki.krogerus@linux.intel.com> Signed-off-by: Felipe Balbi <balbi@ti.com>
[bwh: Backported to 3.16: adjust context, indentation] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Heikki Krogerus <heikki.krogerus@linux.intel.com> Signed-off-by: Felipe Balbi <balbi@ti.com>
[bwh: Backported to 3.16: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
The device controller is the same but it has different PCI ID. Add this new
ID to the driver's list of supported IDs.
Signed-off-by: Alan Cox <alan@linux.intel.com> Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com> Signed-off-by: Heikki Krogerus <heikki.krogerus@linux.intel.com> Signed-off-by: Felipe Balbi <balbi@ti.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
spin_unlock_irqrestore() might be called with stale flags after
reading port status, possibly restoring interrupts to a incorrect
state.
If a usb2 port just finished resuming while the port status is read
the spin lock will be temporary released and re-acquired in a separate
function. The flags parameter is passed as value instead of a pointer,
not updating flags properly before the final spin_unlock_irqrestore()
is called.
xhci driver claims it needs XHCI_TRUST_TX_LENGTH quirk for both
Broadcom/Cavium and a Renesas xHC controllers.
The quirk was inteded for handling false "success" complete event for
transfers that had data left untransferred.
These transfers should complete with "short packet" events instead.
In these two new cases the false "success" completion is reported
after a "short packet" if the TD consists of several TRBs.
xHCI specs 4.10.1.1.2 say remaining TRBs should report "short packet"
as well after the first short packet in a TD, but this issue seems so
common it doesn't make sense to add the quirk for all vendors.
Turn these events into short packets automatically instead.
This gets rid of the "The WARN Successful completion on short TX for
slot 1 ep 1: needs XHCI_TRUST_TX_LENGTH quirk" warning in many cases.
Reported-by: Eli Billauer <eli.billauer@gmail.com> Reported-by: Ard Biesheuvel <ardb@kernel.org> Tested-by: Eli Billauer <eli.billauer@gmail.com> Tested-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com> Link: https://lore.kernel.org/r/20191211142007.8847-6-mathias.nyman@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[bwh: Backported to 3.16: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
The syzbot fuzzer found a slab-out-of-bounds bug in the HID report
handler. The bug was caused by a report descriptor which included a
field with size 12 bits and count 4899, for a total size of 7349
bytes.
The usbhid driver uses at most a single-page 4-KB buffer for reports.
In the test there wasn't any problem about overflowing the buffer,
since only one byte was received from the device. Rather, the bug
occurred when the HID core tried to extract the data from the report
fields, which caused it to try reading data beyond the end of the
allocated buffer.
This patch fixes the problem by rejecting any report whose total
length exceeds the HID_MAX_BUFFER_SIZE limit (minus one byte to allow
for a possible report index). In theory a device could have a report
longer than that, but if there was such a thing we wouldn't handle it
correctly anyway.
Reported-and-tested-by: syzbot+09ef48aa58261464b621@syzkaller.appspotmail.com Signed-off-by: Alan Stern <stern@rowland.harvard.edu> Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
In case we have to migrate a ballon page to a newpage of another zone, the
managed page count of both zones is wrong. Paired with memory offlining
(which will adjust the managed page count), we can trigger kernel crashes
and all kinds of different symptoms.
In another instance (older kernel), I was able to observe this
(negative page count :/):
[ 180.896971] Offlined Pages 32768
[ 182.667462] Offlined Pages 32768
[ 184.408117] Offlined Pages 32768
[ 186.026321] Offlined Pages 32768
[ 187.684861] Offlined Pages 32768
[ 189.227013] Offlined Pages 32768
[ 190.830303] Offlined Pages 32768
[ 190.833071] Built 1 zonelists, mobility grouping on. Total pages: -36920272750453009
In another instance (older kernel), I was no longer able to start any
process:
[root@vm ~]# [ 214.348068] Offlined Pages 32768
[ 215.973009] Offlined Pages 32768
cat /proc/meminfo
-bash: fork: Cannot allocate memory
[root@vm ~]# cat /proc/meminfo
-bash: fork: Cannot allocate memory
Fix it by properly adjusting the managed page count when migrating if
the zone changed. The managed page count of the zones now looks after
unplug of the DIMM (and after deflating the balloon) just like before
inflating the balloon (and plugging+onlining the DIMM).
We'll temporarily modify the totalram page count. If this ever becomes a
problem, we can fine tune by providing helpers that don't touch
the totalram pages (e.g., adjust_zone_managed_page_count()).
Please note that fixing up the managed page count is only necessary when
we adjusted the managed page count when inflating - only if we
don't have VIRTIO_BALLOON_F_DEFLATE_ON_OOM. With that feature, the
managed page count is not touched when inflating/deflating.
Reported-by: Yumei Huang <yuhuang@redhat.com> Fixes: 3dcc0571cd64 ("mm: correctly update zone->managed_pages") Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Jiang Liu <liuj97@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Igor Mammedov <imammedo@redhat.com> Cc: virtualization@lists.linux-foundation.org Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
[bwh: Backported to 3.16: Deflate-on-OOM is not supported at all so don't
check that flag] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
gpiolib has a corner case with open drain outputs that are emulated.
When such outputs are outputting a logic 1, emulation will set the
hardware to input mode, which will cause gpiod_get_direction() to
report that it is in input mode. This is different from the behaviour
with a true open-drain output.
Unify the semantics here.
Suggested-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Bartosz Golaszewski <bgolaszewski@baylibre.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Make sure to use the current alternate setting when verifying the
interface descriptors to avoid binding to an invalid interface.
Failing to do so could cause the driver to misbehave or trigger a WARN()
in usb_submit_urb() that kernels with panic_on_warn set would choke on.
Fixes: 2865d42c78a9 ("staging: r8712u: Add the new driver to the mainline kernel") Signed-off-by: Johan Hovold <johan@kernel.org> Link: https://lore.kernel.org/r/20191210114751.5119-3-johan@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[bwh: Backported to 3.16: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Make sure that the interrupt interface has an endpoint before trying to
access its endpoint descriptors to avoid dereferencing a NULL pointer.
The driver binds to the interrupt interface with interface number 0, but
must not assume that this interface or its current alternate setting are
the first entries in the corresponding configuration arrays.
Fixes: b72458a80c75 ("[PATCH] USB: Eagle and ADI 930 usb adsl modem driver") Signed-off-by: Johan Hovold <johan@kernel.org> Link: https://lore.kernel.org/r/20191210112601.3561-2-johan@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Make sure to use the current alternate setting when looking up the
endpoints on epic devices to avoid binding to an invalid interface.
Failing to do so could cause the driver to misbehave or trigger a WARN()
in usb_submit_urb() that kernels with panic_on_warn set would choke on.
Fixes: 6e8cf7751f9f ("USB: add EPIC support to the io_edgeport driver") Signed-off-by: Johan Hovold <johan@kernel.org> Link: https://lore.kernel.org/r/20191210112601.3561-5-johan@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
The problem arises because our read() function grabs a lock of the
circular buffer, finds something of interest, then invokes copy_to_user()
straight from the buffer, which in turn takes mm->mmap_sem. In the same
time, the callback mon_bin_vma_fault() is invoked under mm->mmap_sem.
It attempts to take the fetch lock and deadlocks.
This patch does away with protecting of our page list with any
semaphores, and instead relies on the kernel not close the device
while mmap is active in a process.
In addition, we prohibit re-sizing of a buffer while mmap is active.
This way, when (now unlocked) fault is processed, it works with the
page that is intended to be mapped-in, and not some other random page.
Note that this may have an ABI impact, but hopefully no legitimate
program is this wrong.
Signed-off-by: Pete Zaitcev <zaitcev@redhat.com> Reported-by: syzbot+56f9673bb4cdcbeb0e92@syzkaller.appspotmail.com Reviewed-by: Alan Stern <stern@rowland.harvard.edu> Fixes: 46eb14a6e158 ("USB: fix usbmon BUG trigger") Link: https://lore.kernel.org/r/20191204203941.3503452b@suzdal.zaitcev.lan Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Explicitly initialize URB structure urb_list field in usb_init_urb().
This field can be potentially accessed uninitialized and its
initialization is coherent with the usage of list_del_init() in
usb_hcd_unlink_urb_from_ep() and usb_giveback_urb_bh() and its
explicit initialization in usb_hcd_submit_urb() error path.
Fix a general protection fault when accessing the endpoint descriptors
which could be triggered by a malicious device due to missing sanity
checks on the number of endpoints.
Reported-by: syzbot+35b1c403a14f5c89eba7@syzkaller.appspotmail.com Fixes: 07dc1f9f2f80 ("[PATCH] isdn4linux: Siemens Gigaset drivers - M105 USB DECT adapter") Cc: Hansjoerg Lipp <hjlipp@web.de> Cc: Tilman Schmidt <tilman@imap.cc> Signed-off-by: Johan Hovold <johan@kernel.org> Link: https://lore.kernel.org/r/20191202085610.12719-2-johan@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[bwh: Backported to 3.16: adjust filename] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
On this error path we call qla4xxx_mem_free() and then the caller also
calls qla4xxx_free_adapter() which calls qla4xxx_mem_free(). It leads to a
couple double frees:
Fixes: afaf5a2d341d ("[SCSI] Initial Commit of qla4xxx") Link: https://lore.kernel.org/r/20191203094421.hw7ex7qr3j2rbsmx@kili.mountain Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Certain ACPI-enumerated devices represented as platform devices in
Linux, like fans, require special low-level power management handling
implemented by their drivers that is not in agreement with the ACPI
PM domain behavior. That leads to problems with managing ACPI fans
during system-wide suspend and resume.
For this reason, make acpi_dev_pm_attach() skip the affected devices
by adding a list of device IDs to avoid to it and putting the IDs of
the affected devices into that list.
Fixes: e5cc8ef31267 (ACPI / PM: Provide ACPI PM callback routines for subsystems) Reported-by: Zhang Rui <rui.zhang@intel.com> Tested-by: Todd Brandt <todd.e.brandt@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
[bwh: Backported to 3.16: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
There is softlockup when using TPACKET_V3:
...
NMI watchdog: BUG: soft lockup - CPU#2 stuck for 60010ms!
(__irq_svc) from [<c0558a0c>] (_raw_spin_unlock_irqrestore+0x44/0x54)
(_raw_spin_unlock_irqrestore) from [<c027b7e8>] (mod_timer+0x210/0x25c)
(mod_timer) from [<c0549c30>]
(prb_retire_rx_blk_timer_expired+0x68/0x11c)
(prb_retire_rx_blk_timer_expired) from [<c027a7ac>]
(call_timer_fn+0x90/0x17c)
(call_timer_fn) from [<c027ab6c>] (run_timer_softirq+0x2d4/0x2fc)
(run_timer_softirq) from [<c021eaf4>] (__do_softirq+0x218/0x318)
(__do_softirq) from [<c021eea0>] (irq_exit+0x88/0xac)
(irq_exit) from [<c0240130>] (msa_irq_exit+0x11c/0x1d4)
(msa_irq_exit) from [<c0209cf0>] (handle_IPI+0x650/0x7f4)
(handle_IPI) from [<c02015bc>] (gic_handle_irq+0x108/0x118)
(gic_handle_irq) from [<c0558ee4>] (__irq_usr+0x44/0x5c)
...
If __ethtool_get_link_ksettings() is failed in
prb_calc_retire_blk_tmo(), msec and tmo will be zero, so tov_in_jiffies
is zero and the timer expire for retire_blk_timer is turn to
mod_timer(&pkc->retire_blk_timer, jiffies + 0),
which will trigger cpu usage of softirq is 100%.
Fixes: f6fb8f100b80 ("af-packet: TPACKET_V3 flexible buffer implementation.") Tested-by: Xiao Jiangfeng <xiaojiangfeng@huawei.com> Signed-off-by: Mao Wenan <maowenan@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
1) syzbot reported an uninit-value in bond_neigh_setup() [1]
bond_neigh_setup() uses a temporary on-stack 'struct neigh_parms parms',
but only clears parms.neigh_setup field.
A stacked bonding device would then enter bond_neigh_setup()
and read garbage from parms->dev.
If we get really unlucky and garbage is matching @dev, then we
could recurse and eventually crash.
Let's make sure the whole structure is cleared to avoid surprises.
2) bond_neigh_setup() can be called while another cpu manipulates
the master device, removing or adding a slave.
We need at least rcu protection to prevent use-after-free.
Note: Prior code does not support a stack of bonding devices,
this patch does not attempt to fix this, and leave a comment instead.
Local variable description: ----parms@bond_neigh_init
Variable was created at:
bond_neigh_init+0x8c/0x4b0 drivers/net/bonding/bond_main.c:3617
bond_neigh_init+0x8c/0x4b0 drivers/net/bonding/bond_main.c:3617
Fixes: 9918d5bf329d ("bonding: modify only neigh_parms owned by us") Fixes: 234bcf8a499e ("net/bonding: correctly proxy slave neigh param setup ndo function") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Cc: Jay Vosburgh <j.vosburgh@gmail.com> Cc: Veaceslav Falico <vfalico@gmail.com> Cc: Andy Gospodarek <andy@greyhouse.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
neigh_cleanup() has not been used for seven years, and was a wrong design.
Messing with shared pointer in bond_neigh_init() without proper
memory barriers would at least trigger syzbot complains eventually.
It is time to remove this stuff.
Fixes: b63b70d87741 ("IPoIB: Use a private hash table for path lookup in xmit path") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
[bwh: Backported to 3.16: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
When a GPIO offset in a lookup table is out-of-range, the printed error
message (1) does not include the actual out-of-range value, and (2)
contains an off-by-one error in the upper bound.
Avoid user confusion by also printing the actual GPIO offset, and
correcting the upper bound of the range.
While at it, use "%u" for unsigned int.
Sample impact:
-requested GPIO 0 is out of range [0..32] for chip e6052000.gpio
+requested GPIO 0 (45) is out of range [0..31] for chip e6052000.gpio
Fixes: 2a3cf6a3599e9015 ("gpiolib: return -ENOENT if no GPIO mapping exists") Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Link: https://lore.kernel.org/r/20191127095919.4214-1-geert+renesas@glider.be Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Userspace might bogusly sent NFT_DATA_VERDICT in several netlink
attributes that assume NFT_DATA_VALUE. Moreover, make sure that error
path invokes nft_data_release() to decrement the reference count on the
chain object.
Fixes: 96518518cc41 ("netfilter: add nftables") Fixes: 0f3cd9b36977 ("netfilter: nf_tables: add range expression") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
[bwh: Backported to 3.16:
- Drop changes in nft_get_set_elem(), nft_range
- Call nft_data_uninit() instead of nft_data_release()
- Adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Do not assume userspace always sends us NFT_DATA_VALUE for bitwise and
cmp expressions. Although NFT_DATA_VERDICT does not make any sense, it
is still possible to handcraft a netlink message using this incorrect
data type.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
[bwh: Backported to 3.16: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Commit 39ce8150a079 ("pinctrl: baytrail: Serialize all register access")
added a spinlock around all register accesses because:
"There is a hardware issue in Intel Baytrail where concurrent GPIO register
access might result reads of 0xffffffff and writes might get dropped
completely."
Testing has shown that this does not catch all cases, there are still
2 problems remaining
1) The original fix uses a spinlock per byt_gpio device / struct,
additional testing has shown that this is not sufficient concurent
accesses to 2 different GPIO banks also suffer from the same problem.
This commit fixes this by moving to a single global lock.
2) The original fix did not add a lock around the register accesses in
the suspend/resume handling.
Since pinctrl-baytrail.c is using normal suspend/resume handlers,
interrupts are still enabled during suspend/resume handling. Nothing
should be using the GPIOs when they are being taken down, _but_ the
GPIOs themselves may still cause interrupts, which are likely to
use (read) the triggering GPIO. So we need to protect against
concurrent GPIO register accesses in the suspend/resume handlers too.
This commit fixes this by adding the missing spin_lock / unlock calls.
The 2 fixes together fix the Acer Switch 10 SW5-012 getting completely
confused after a suspend resume. The DSDT for this device has a bug
in its _LID method which reprograms the home and power button trigger-
flags requesting both high and low _level_ interrupts so the IRQs for
these 2 GPIOs continuously fire. This combined with the saving of
registers during suspend, triggers concurrent GPIO register accesses
resulting in saving 0xffffffff as pconf0 value during suspend and then
when restoring this on resume the pinmux settings get all messed up,
resulting in various I2C busses being stuck, the wifi no longer working
and often the tablet simply not coming out of suspend at all.
Fixes: 39ce8150a079 ("pinctrl: baytrail: Serialize all register access") Signed-off-by: Hans de Goede <hdegoede@redhat.com> Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com> Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
[bwh: Backported to 3.16:
- Drop changes in functions that don't exist here
- Delete local pointers to byt_gpio that become unused
- Adjust filename, context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
There is a hardware issue in Intel Baytrail where concurrent GPIO register
access might result reads of 0xffffffff and writes might get dropped
completely.
Prevent this from happening by taking the serializing lock in all places
where it is possible that more than one thread might be accessing the
hardware concurrently.
Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com> Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
[bwh: Backported to 3.16: adjust filename] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Instead of handling everything in the driver's first level interrupt
handler, we can take advantage of already existing flow handlers that are
provided by the IRQ core.
This changes the functionality a bit also. Previously the driver looped
over pending interrupts in a single loop, restarting the loop if some
interrupt changed state. This caused problem with Lenovo Thinkpad 10
digitizer that it was not able to deassert the interrupt before the driver
disabled the interrupt for good (looplimit was exhausted).
Rework the interrupt handling logic a bit so that we provide proper mask,
ack and unmask operations in terms of Baytrail GPIO hardware and loop over
pending interrupts only once. If the interrupt remains asserted the first
level handler will be re-triggered automatically.
Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com> Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
[bwh: Backported to 3.16 as dependency of commit 39ce8150a079
"pinctrl: baytrail: Serialize all register access":
- Adjust filename, context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
If the pin is already configured as GPIO and it has any of the triggering
flags set, we may get spurious interrupts depending on the state of the
pin.
Prevent this by clearing the triggering flags on such pins. However, if the
pin is also configured as "direct IRQ" we leave the flags as is. Otherwise
it will prevent interrupts that are routed directly to IO-APIC.
Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com> Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
[bwh: Backported to 3.16:
- Add definition of BYT_DIRECT_IRQ_EN
- Adjust filename] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Zotac ZBOX PI320, a Baytrail based mini-PC, has power button connected to a
GPIO pin and it is exposed to the operating system as Windows 8 button
array. This is implemented in Linux as a driver using gpio_keys.
However, BIOS on this particula machine forgot to mux the pin to be a GPIO
instead of native function, which results following message to be seen on
the console:
byt_gpio INT33FC:02: pin 16 cannot be used as GPIO.
This causes power button to not work as the driver was not able to request
the GPIO it needs.
So instead of completely preventing this we allow turning the pin as GPIO
but issue warning that something might be wrong.
Reported-by: Benjamin Adler <benadler@gmx.net> Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com> Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
[bwh: Backported to 3.16: adjust filename] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
We got the following warnings from thin_check during thin-pool setup:
$ thin_check /dev/vdb
examining superblock
examining devices tree
missing devices: [1, 84]
too few entries in btree_node: 41, expected at least 42 (block 138, max_entries = 126)
examining mapping tree
The phenomenon is the number of entries in one node of details_info tree is
less than (max_entries / 3). And it can be easily reproduced by the following
procedures:
$ new a thin pool
$ presume the max entries of details_info tree is 126
$ new 127 thin devices (e.g. 1~127) to make the root node being full
and then split
$ remove the first 43 (e.g. 1~43) thin devices to make the children
reblance repeatedly
$ stop the thin pool
$ thin_check
The root cause is that the B-tree removal procedure in __rebalance2()
doesn't guarantee the invariance: the minimal number of entries in
non-root node should be >= (max_entries / 3).
Simply fix the problem by increasing the rebalance threshold to
make sure the number of entries in each child will be greater
than or equal to (max_entries / 3 + 1), so no matter which
child is used for removal, the number will still be valid.
Signed-off-by: Hou Tao <houtao1@huawei.com> Acked-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
When assiging and testing taskstats in taskstats_exit() there's a race
when setting up and reading sig->stats when a thread-group with more
than one thread exits:
write to 0xffff8881157bbe10 of 8 bytes by task 7951 on cpu 0:
taskstats_tgid_alloc kernel/taskstats.c:567 [inline]
taskstats_exit+0x6b7/0x717 kernel/taskstats.c:596
do_exit+0x2c2/0x18e0 kernel/exit.c:864
do_group_exit+0xb4/0x1c0 kernel/exit.c:983
get_signal+0x2a2/0x1320 kernel/signal.c:2734
do_signal+0x3b/0xc00 arch/x86/kernel/signal.c:815
exit_to_usermode_loop+0x250/0x2c0 arch/x86/entry/common.c:159
prepare_exit_to_usermode arch/x86/entry/common.c:194 [inline]
syscall_return_slowpath arch/x86/entry/common.c:274 [inline]
do_syscall_64+0x2d7/0x2f0 arch/x86/entry/common.c:299
entry_SYSCALL_64_after_hwframe+0x44/0xa9
read to 0xffff8881157bbe10 of 8 bytes by task 7949 on cpu 1:
taskstats_tgid_alloc kernel/taskstats.c:559 [inline]
taskstats_exit+0xb2/0x717 kernel/taskstats.c:596
do_exit+0x2c2/0x18e0 kernel/exit.c:864
do_group_exit+0xb4/0x1c0 kernel/exit.c:983
__do_sys_exit_group kernel/exit.c:994 [inline]
__se_sys_exit_group kernel/exit.c:992 [inline]
__x64_sys_exit_group+0x2e/0x30 kernel/exit.c:992
do_syscall_64+0xcf/0x2f0 arch/x86/entry/common.c:296
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Fix this by using smp_load_acquire() and smp_store_release().
Reported-by: syzbot+c5d03165a1bd1dead0c1@syzkaller.appspotmail.com Fixes: 34ec12349c8a ("taskstats: cleanup ->signal->stats allocation") Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Acked-by: Marco Elver <elver@google.com> Reviewed-by: Will Deacon <will@kernel.org> Reviewed-by: Andrea Parri <parri.andrea@gmail.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Link: https://lore.kernel.org/r/20191009114809.8643-1-christian.brauner@ubuntu.com Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Curtis Taylor and Jon Maxwell reported and debugged a crash on 3.10
based kernel.
Crash occurs in ctnetlink_conntrack_events because net->nfnl socket is
NULL. The nfnl socket was set to NULL by netns destruction running on
another cpu.
The exiting network namespace calls the relevant destructors in the
following order:
1. ctnetlink_net_exit_batch
This nulls out the event callback pointer in struct netns.
2. nfnetlink_net_exit_batch
This nulls net->nfnl socket and frees it.
3. nf_conntrack_cleanup_net_list
This removes all remaining conntrack entries.
This is order is correct. The only explanation for the crash so ar is:
cpu1: a. fetches rcu protected pointer to obtain ctnetlink event callback.
b. gets interrupted.
cpu2: runs netns exit handlers:
a runs ctnetlink destructor, event cb pointer set to NULL.
b runs nfnetlink destructor, nfnl socket is closed and set to NULL.
cpu1: c. resumes and trips over NULL net->nfnl.
Problem appears to be that ctnetlink_net_exit_batch only prevents future
callers of nf_conntrack_eventmask_report() from obtaining the callback.
It doesn't wait of other cpus that might have already obtained the
callbacks address.
I don't see anything in upstream kernels that would prevent similar
crash: We need to wait for all cpus to have exited the event callback.
Fixes: 9592a5c01e79dbc59eb56fa ("netfilter: ctnetlink: netns support") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
The distributed arp table is using a DHT to store and retrieve MAC address
information for an IP address. This is done using unicast messages to
selected peers. The potential peers are looked up using the IP address and
the VID.
While the IP address is always stored in big endian byte order, this is not
the case of the VID. It can (depending on the host system) either be big
endian or little endian. The host must therefore always convert it to big
endian to ensure that all devices calculate the same peers for the same
lookup data.
Fixes: be1db4f6615b ("batman-adv: make the Distributed ARP Table vlan aware") Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
[bwh: Backported to 3.16: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Pavel Tatashin [Fri, 12 Jan 2018 20:00:02 +0000 (15:00 -0500)]
x86/pti/efi: broken conversion from efi to kernel page table
In entry_64.S we have code like this:
/* Unconditionally use kernel CR3 for do_nmi() */
/* %rax is saved above, so OK to clobber here */
ALTERNATIVE "jmp 2f", "movq %cr3, %rax", X86_FEATURE_KAISER
/* If PCID enabled, NOFLUSH now and NOFLUSH on return */
ALTERNATIVE "", "bts $63, %rax", X86_FEATURE_PCID
pushq %rax
/* mask off "user" bit of pgd address and 12 PCID bits: */
andq $(~(X86_CR3_PCID_ASID_MASK | KAISER_SHADOW_PGD_OFFSET)), %rax
movq %rax, %cr3
2:
/* paranoidentry do_nmi, 0; without TRACE_IRQS_OFF */
call do_nmi
With this instruction:
andq $(~(X86_CR3_PCID_ASID_MASK | KAISER_SHADOW_PGD_OFFSET)), %rax
We unconditionally switch from whatever our CR3 was to kernel page table.
But, in arch/x86/platform/efi/efi_64.c We temporarily set a different page
table, that does not have the kernel page table with 0x1000 offset from it.
Look in efi_thunk() and efi_thunk_set_virtual_address_map().
So, while CR3 points to the other page table, we get an NMI interrupt,
and clear 0x1000 from CR3, resulting in a bogus CR3 if the 0x1000 bit was
set.
The efi page table comes from realmode/rm/trampoline_64.S:
Notice: alignment is PAGE_SIZE, so after applying KAISER_SHADOW_PGD_OFFSET
which equal to PAGE_SIZE, we can get a different page table.
But, even if we fix alignment, here the trampoline binary is later copied
into dynamically allocated memory in reserve_real_mode(), so we need to
fix that place as well.
Fixes: f9a1666f97b3 ("KAISER: Kernel Address Isolation") Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Reviewed-by: Steven Sistare <steven.sistare@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[bwh: Adjust the Fixes field for 3.16] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
v4.8-rc3 commit 99f3c90d0d ("dm flakey: error READ bios during the
down_interval") overlooked the 'drop_writes' feature, which is meant to
allow reads to be issued rather than errored, during the down_interval.
Fixes: 99f3c90d0d ("dm flakey: error READ bios during the down_interval") Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Fix to return error code -EINVAL instead of 0, as is done elsewhere in
this function.
Fixes: e80d1c805a3b ("dm: do not override error code returned from dm_get_device()") Signed-off-by: Wei Yongjun <weiyj.lk@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Some of the device mapper targets override the error code returned by
dm_get_device() and return either -EINVAL or -ENXIO. There is nothing
gained by this override. It is better to propagate the returned error
code unchanged to caller.
This work was motivated by hitting an issue where the underlying device
was busy but -EINVAL was being returned. After this change we get
-EBUSY instead and it is easier to figure out the problem.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
[bwh: Backported to 3.16: drop changes to dm-log-writes] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
We recently had a customer issue with a corrupted filesystem. When
trying to mount this image btrfs panicked with a division by zero in
calc_stripe_length().
The corrupt chunk had a 'num_stripes' value of 1. calc_stripe_length()
takes this value and divides it by the number of copies the RAID profile
is expected to have to calculate the amount of data stripes. As a DUP
profile is expected to have 2 copies this division resulted in 1/2 = 0.
Later then the 'data_stripes' variable is used as a divisor in the
stripe length calculation which results in a division by 0 and thus a
kernel panic.
When encountering a filesystem with a DUP block group and a
'num_stripes' value unequal to 2, refuse mounting as the image is
corrupted and will lead to unexpected behaviour.
Code inspection showed a RAID1 block group has the same issues.
Fixes: e06cd3dd7cea ("Btrfs: add validadtion checks for chunk loading") Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Although we have tree level check at tree read runtime, it's completely
based on its parent level.
We still need to do accurate level check to avoid invalid tree blocks
sneak into kernel space.
The check itself is simple, for leaf its level should always be 0.
For nodes its level should be in range [1, BTRFS_MAX_LEVEL - 1].
Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Su Yue <suy.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
[bwh: Backported to 4.4:
- Pass root instead of fs_info to generic_err()
- Adjust context] Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
A crafted btrfs image with incorrect chunk<->block group mapping will
trigger a lot of unexpected things as the mapping is essential.
Although the problem can be caught by block group item checker
added in "btrfs: tree-checker: Verify block_group_item", it's still not
sufficient. A sufficiently valid block group item can pass the check
added by the mentioned patch but could fail to match the existing chunk.
This patch will add extra block group -> chunk mapping check, to ensure
we have a completely matching (start, len, flags) chunk for each block
group at mount time.
Here we reuse the original helper find_first_block_group(), which is
already doing the basic bg -> chunk checks, adding further checks of the
start/len and type flags.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=199837 Reported-by: Xu Wen <wen.xu@gatech.edu> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Su Yue <suy.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
[bwh: Backported to 4.4: Use root->fs_info instead of fs_info] Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
A crafted image with invalid block group items could make free space cache
code to cause panic.
We could detect such invalid block group item by checking:
1) Item size
Known fixed value.
2) Block group size (key.offset)
We have an upper limit on block group item (10G)
3) Chunk objectid
Known fixed value.
4) Type
Only 4 valid type values, DATA, METADATA, SYSTEM and DATA|METADATA.
No more than 1 bit set for profile type.
5) Used space
No more than the block group size.
This should allow btrfs to detect and refuse to mount the crafted image.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=199849 Reported-by: Xu Wen <wen.xu@gatech.edu> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Gu Jinxiang <gujx@cn.fujitsu.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Gu Jinxiang <gujx@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
[bwh: Backported to 4.4:
- In check_leaf_item(), pass root->fs_info to check_block_group_item()
- Include <linux/sizes.h> (in ctree.h, to match upstream)
- Adjust context] Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[bwh: Backported to 3.16: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
I've noticed that the updated item checker stack consumption increased
dramatically in 542f5385e20cf97447 ("btrfs: tree-checker: Add checker
for dir item")
tree-checker.c:check_leaf +552 (176 -> 728)
The array is 255 bytes long, dynamic allocation would slow down the
sanity checks so it's more reasonable to keep it on-stack. Moving the
variable to the scope of use reduces the stack usage again
tree-checker.c:check_leaf -264 (728 -> 464)
Reviewed-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
The return value of sizeof() is of type size_t, so we must print it
using the %z format modifier rather than %l to avoid this warning
on some architectures:
fs/btrfs/tree-checker.c: In function 'check_dir_item':
fs/btrfs/tree-checker.c:273:50: error: format '%lu' expects argument of type 'long unsigned int', but argument 5 has type 'u32' {aka 'unsigned int'} [-Werror=format=]
Fixes: 005887f2e3e0 ("btrfs: tree-checker: Add checker for dir item") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Add checker for dir item, for key types DIR_ITEM, DIR_INDEX and
XATTR_ITEM.
This checker does comprehensive checks for:
1) dir_item header and its data size
Against item boundary and maximum name/xattr length.
This part is mostly the same as old verify_dir_item().
2) dir_type
Against maximum file types, and against key type.
Since XATTR key should only have FT_XATTR dir item, and normal dir
item type should not have XATTR key.
The check between key->type and dir_type is newly introduced by this
patch.
3) name hash
For XATTR and DIR_ITEM key, key->offset is name hash (crc32c).
Check the hash of the name against the key to ensure it's correct.
The name hash check is only found in btrfs-progs before this patch.
Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Su Yue <suy.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
[bwh: Backported to 4.4: BTRFS_MAX_XATTR_SIZE() takes a root instead of an
fs_info, and yields a value of type size_t instead of unsigned int] Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
[Cause]
Btrfs will call btrfs_check_leaf() in btrfs_mark_buffer_dirty() to check
if the leaf is valid with CONFIG_BTRFS_FS_RUN_SANITY_TESTS=y.
However quite some btrfs_mark_buffer_dirty() callers(*) don't really
initialize its item data but only initialize its item pointers, leaving
item data uninitialized.
This makes tree-checker catch uninitialized data as error, causing
such panic.
*: These callers include but not limited to
setup_items_for_insert()
btrfs_split_item()
btrfs_expand_item()
[Fix]
Add a new parameter @check_item_data to btrfs_check_leaf().
With @check_item_data set to false, item data check will be skipped and
fallback to old btrfs_check_leaf() behavior.
So we can still get early warning if we screw up item pointers, and
avoid false panic.
Cc: Filipe Manana <fdmanana@gmail.com> Reported-by: Lakshmipathi.G <lakshmipathi.g@gmail.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
[bwh: Backported to 4.4: adjust context] Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Use inline function to replace macro since we don't need
stringification.
(Macro still exists until all callers get updated)
And add more info about the error, and replace EIO with EUCLEAN.
For nr_items error, report if it's too large or too small, and output
the valid value range.
For node block pointer, added a new alignment checker.
For key order, also output the next key to make the problem more
obvious.
Signed-off-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
[ wording adjustments, unindented long strings ] Signed-off-by: David Sterba <dsterba@suse.com>
[bwh: Backported to 4.4:
- Use root->sectorsize instead of root->fs_info->sectorsize
- BTRFS_NODEPTRS_PER_BLOCK() takes a root instead of an fs_info, and yields
a value of type size_t instead of unsigned int] Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
It's no doubt the comprehensive tree block checker will become larger,
so moving them into their own files is quite reasonable.
Signed-off-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
[ wording adjustments ] Signed-off-by: David Sterba <dsterba@suse.com>
[bwh: Backported to 4.4:
- The moved code is slightly different
- Adjust context] Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
EXTENT_CSUM checker is a relatively easy one, only needs to check:
1) Objectid
Fixed to BTRFS_EXTENT_CSUM_OBJECTID
2) Key offset alignment
Must be aligned to sectorsize
3) Item size alignedment
Must be aligned to csum size
Signed-off-by: Qu Wenruo <quwenruo.btrfs@gmx.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
[bwh: Backported to 4.4: Use root->sectorsize instead of
root->fs_info->sectorsize] Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Add extra checks for item with EXTENT_DATA type. This checks the
following thing:
0) Key offset
All key offsets must be aligned to sectorsize.
Inline extent must have 0 for key offset.
1) Item size
Uncompressed inline file extent size must match item size.
(Compressed inline file extent has no information about its on-disk size.)
Regular/preallocated file extent size must be a fixed value.
2) Every member of regular file extent item
Including alignment for bytenr and offset, possible value for
compression/encryption/type.
3) Type/compression/encode must be one of the valid values.
This should be the most comprehensive and strict check in the context
of btrfs_item for EXTENT_DATA.
Signed-off-by: Qu Wenruo <quwenruo.btrfs@gmx.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com>
[ switch to BTRFS_FILE_EXTENT_TYPES, similar to what
BTRFS_COMPRESS_TYPES does ] Signed-off-by: David Sterba <dsterba@suse.com>
[bwh: Backported to 4.4:
- Use root->sectorsize instead of root->fs_info->sectorsize
- Adjust filename] Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Current check_leaf() function does a good job checking key order and
item offset/size.
However it only checks from slot 0 to the last but one slot, this is
good but makes later expansion hard.
So this refactoring iterates from slot 0 to the last slot.
For key comparison, it uses a key with all 0 as initial key, so all
valid keys should be larger than that.
And for item size/offset checks, it compares current item end with
previous item offset.
For slot 0, use leaf end as a special case.
This makes later item/key offset checks and item size checks easier to
be implemented.
Also, makes check_leaf() to return -EUCLEAN other than -EIO to indicate
error.
Signed-off-by: Qu Wenruo <quwenruo.btrfs@gmx.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
[bwh: Backported to 4.4:
- BTRFS_LEAF_DATA_SIZE() takes a root rather than an fs_info
- Adjust context] Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
We have reader helpers for most of the on-disk structures that use
an extent_buffer and pointer as offset into the buffer that are
read-only. We should mark them as const and, in turn, allow consumers
of these interfaces to mark the buffers const as well.
No impact on code, but serves as documentation that a buffer is intended
not to be modified.
Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
We can not simply use the owner field from an extent buffer's header to
get the id of the respective tree when the extent buffer is from a
relocation tree. When we create the root for a relocation tree we leave
(on purpose) the owner field with the same value as the subvolume's tree
root (we do this at ctree.c:btrfs_copy_root()). So we must ignore extent
buffers from relocation trees, which have the BTRFS_HEADER_FLAG_RELOC
flag set, because otherwise we will always consider the extent buffer
as not being the root of the tree (the root of original subvolume tree
is always different from the root of the respective relocation tree).
This lead to assertion failures when running with the integrity checker
enabled (CONFIG_BTRFS_FS_CHECK_INTEGRITY=y) such as the following:
This can be easily reproduced by running xfstests with the integrity
checker enabled.
Fixes: 1ba98d086fe3 (Btrfs: detect corruption when non-root leaf has zero item) Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
This is an additional patch to
"Btrfs: memset to avoid stale content in btree node block".
This uses memset to initialize the unused space in a leaf to avoid
potential stale content, which may be incurred by pushing items
between sibling leaves.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
In a corrupted btrfs image, we can come across this BUG_ON and
get an unreponsive system, but if we return errors instead,
its caller can handle everything gracefully by aborting the current
transaction.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
We need to check items in a node to make sure that we're reading
a valid one, otherwise we could get various crashes while processing
delayed_refs.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
During updating btree, we could push items between sibling
nodes/leaves, for leaves data sections starts reversely from
the end of the block while for nodes we only have key pairs
which are stored one by one from the start of the block.
So we could do try to push key pairs from one node to the next
node right in the tree, and after that, we update the node's
nritems to reflect the correct end while leaving the stale
content in the node. One may intentionally corrupt the fs
image and access the stale content by bumping the nritems and
causes various crashes.
This takes the in-memory @nritems as the correct one and
gets to memset the unused part of a btree node.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
This can only happen with CONFIG_BTRFS_FS_CHECK_INTEGRITY=y.
Commit 1ba98d0 ("Btrfs: detect corruption when non-root leaf has zero item")
assumes that a leaf is its root when leaf->bytenr == btrfs_root_bytenr(root),
however, we should not use btrfs_root_bytenr(root) since it's mainly got
updated during committing transaction. So the check can fail when doing
COW on this leaf while it is a root.
This changes to use "if (leaf == btrfs_root_node(root))" instead, just like
how we check whether leaf is a root in __btrfs_cow_block().
Fixes: 1ba98d086fe3 (Btrfs: detect corruption when non-root leaf has zero item) Reported-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
When btree node (level = 1) has nritems which equals to zero,
we can end up with panic due to insert_ptr()'s
BUG_ON(slot > nritems);
where slot is 1 and nritems is 0, as copy_for_split() calls
insert_ptr(.., path->slots[1] + 1, ...);
A invalid value results in the whole mess, this adds the check
for btree's node nritems so that we stop reading block when
when something is wrong.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Right now we treat leaf which has zero item as a valid one
because we could have an empty tree, that is, a root that is
also a leaf without any item, however, in the same case but
when the leaf is not a root, we can end up with hitting the
BUG_ON(1) in btrfs_extend_item() called by
setup_inline_extent_backref().
This makes us check the situation as a corruption if leaf is
not its own root.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
With btrfs-corrupt-block, one can drop one chunk item and mounting
will end up with a panic in btrfs_full_stripe_len().
This doesn't not remove the BUG_ON, but instead checks it a bit
earlier when we find the block group item.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
To prevent fuzzed filesystem images from panic the whole system,
we need various validation checks to refuse to mount such an image
if btrfs finds any invalid value during loading chunks, including
both sys_array and regular chunks.
Note that these checks may not be sufficient to cover all corner cases,
feel free to add more checks.
Reported-by: Vegard Nossum <vegard.nossum@oracle.com> Reported-by: Quentin Casasnovas <quentin.casasnovas@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Enhance chunk validation:
1) Num_stripes
We already have such check but it's only in super block sys chunk
array.
Now check all on-disk chunks.
2) Chunk logical
It should be aligned to sector size.
This behavior should be *DOUBLE CHECKED* for 64K sector size like
PPC64 or AArch64.
Maybe we can found some hidden bugs.
3) Chunk length
Same as chunk logical, should be aligned to sector size.
4) Stripe length
It should be power of 2.
5) Chunk type
Any bit out of TYPE_MAS | PROFILE_MASK is invalid.
With all these much restrict rules, several fuzzed image reported in
mail list should no longer cause kernel panic.
Reported-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>