tools/bpf/bpftool/.gitignore has the "bpftool" pattern, which is
intended to ignore the following build artifact:
tools/bpf/bpftool/bpftool
However, the .gitignore entry is effective not only for the current
directory, but also for any sub-directories.
So, from the point of .gitignore grammar, the following check-in file
is also considered to be ignored:
tools/bpf/bpftool/bash-completion/bpftool
As the manual gitignore(5) says "Files already tracked by Git are not
affected", this is not a problem as far as Git is concerned.
However, Git is not the only program that parses .gitignore because
.gitignore is useful to distinguish build artifacts from source files.
For example, tar(1) supports the --exclude-vcs-ignore option. As of
writing, this option does not work perfectly, but it intends to create
a tarball excluding files specified by .gitignore.
So, I believe it is better to fix this issue.
You can fix it by prefixing the pattern with a slash; the leading slash
means the specified pattern is relative to the current directory.
Test test_libbpf.sh failed on my development server with failure
-bash-4.4$ sudo ./test_libbpf.sh
[0] libbpf: Error in bpf_object__probe_name():Operation not permitted(1).
Couldn't load basic 'r0 = 0' BPF program.
test_libbpf: failed at file test_l4lb.o
selftests: test_libbpf [FAILED]
-bash-4.4$
The reason is because my machine has 64KB locked memory by default which
is not enough for this program to get locked memory.
Similar to other bpf selftests, let us increase RLIMIT_MEMLOCK
to infinity, which fixed the issue.
When unmapping the AF_XDP memory regions used for the rings, an
invalid address was passed to the munmap() calls. Instead of passing
the beginning of the memory region, the descriptor region was passed
to munmap.
When the userspace application tried to tear down an AF_XDP socket,
the operation failed and the application would still have a reference
to socket it wished to get rid of.
Reported-by: William Tu <u9012063@gmail.com> Fixes: 1cad07884239 ("libbpf: add support for using AF_XDP sockets") Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Tested-by: William Tu <u9012063@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
Fixed possible memory leak in i40e_vc_add_cloud_filter function:
cfilter is being allocated and in some error conditions
the function returns without freeing the memory.
Fix of integer truncation from u16 (type of queue_id value) to u8
when calling i40e_vc_isvalid_queue_id function.
When build perf for ARC recently, there was a build failure due to lack
of __NR_bpf.
| Auto-detecting system features:
|
| ... get_cpuid: [ OFF ]
| ... bpf: [ on ]
|
| # error __NR_bpf not defined. libbpf does not support your arch.
^~~~~
| bpf.c: In function 'sys_bpf':
| bpf.c:66:17: error: '__NR_bpf' undeclared (first use in this function)
| return syscall(__NR_bpf, cmd, attr, size);
| ^~~~~~~~
| sys_bpf
The SD Physical Layer Spec says the following: Since the SD Memory Card
shall support at least the two bus modes 1-bit or 4-bit width, then any SD
Card shall set at least bits 0 and 2 (SD_BUS_WIDTH="0101").
This change verifies the card has specified a bus width.
AMD SDHC Device 7806 can get into a bad state after a card disconnect
where anything transferred via the DATA lines will always result in a
zero filled buffer. Currently the driver will continue without error if
the HC is in this condition. A block device will be created, but reading
from it will result in a zero buffer. This makes it seem like the SD
device has been erased, when in actuality the data is never getting
copied from the DATA lines to the data buffer.
SCR is the first command in the SD initialization sequence that uses the
DATA lines. By checking that the response was invalid, we can abort
mounting the card.
This patch has to do with the life cycle of glocks and buffers. When
gfs2 metadata or journaled data is queued to be written, a gfs2_bufdata
object is assigned to track the buffer, and that is queued to various
lists, including the glock's gl_ail_list to indicate it's on the active
items list. Once the page associated with the buffer has been written,
it is removed from the ail list, but its life isn't over until a revoke
has been successfully written.
So after the block is written, its bufdata object is moved from the
glock's gl_ail_list to a file-system-wide list of pending revokes,
sd_log_le_revoke. At that point the glock still needs to track how many
revokes it contributed to that list (in gl_revokes) so that things like
glock go_sync can ensure all the metadata has been not only written, but
also revoked before the glock is granted to a different node. This is
to guarantee journal replay doesn't replay the block once the glock has
been granted to another node.
Ross Lagerwall recently discovered a race in which an inode could be
evicted, and its glock freed after its ail list had been synced, but
while it still had unwritten revokes on the sd_log_le_revoke list. The
evict decremented the glock reference count to zero, which allowed the
glock to be freed. After the revoke was written, function
revoke_lo_after_commit tried to adjust the glock's gl_revokes counter
and clear its GLF_LFLUSH flag, at which time it referenced the freed
glock.
This patch fixes the problem by incrementing the glock reference count
in gfs2_add_revoke when the glock's first bufdata object is moved from
the glock to the global revokes list. Later, when the glock's last such
bufdata object is freed, the reference count is decremented. This
guarantees that whichever process finishes last (the revoke writing or
the evict) will properly free the glock, and neither will reference the
glock after it has been freed.
Reported-by: Ross Lagerwall <ross.lagerwall@citrix.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
Since QP destruction frees memory, hfi1_wq should have the WQ_MEM_RECLAIM.
The hfi1_wq does not allocate memory with GFP_KERNEL or otherwise become
entangled with memory reclaim, so this flag is appropriate.
Fixes: 0a226edd203f ("staging/rdma/hfi1: Use parallel workqueue for SDMA engines") Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com> Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
This issue is found by running liburing/test/io_uring_setup test.
When test run, the testcase "attempt to bind to invalid cpu" would not
pass with messages like:
io_uring_setup(1, 0xbfc2f7c8), \
flags: IORING_SETUP_SQPOLL|IORING_SETUP_SQ_AFF, \
resv: 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000, \
sq_thread_cpu: 2
expected -1, got 3
FAIL
On my system, there is:
CPU(s) possible : 0-3
CPU(s) online : 0-1
CPU(s) offline : 2-3
CPU(s) present : 0-1
The sq_thread_cpu 2 is offline on my system, so the bind should fail.
But cpu_possible() will pass the check. We shouldn't be able to bind
to an offline cpu. Use cpu_online() to do the check.
After the change, the testcase run as expected: EINVAL will be returned
for cpu offlined.
Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Shenghui Wang <shhuiw@foxmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
As part of the freeze operation, gfs2_freeze_func() is left blocking
on a request to hold the sd_freeze_gl in SH. This glock is held in EX
by the gfs2_freeze() code.
A subsequent call to gfs2_unfreeze() releases the EXclusively held
sd_freeze_gl, which allows gfs2_freeze_func() to acquire it in SH and
resume its operation.
gfs2_unfreeze(), however, doesn't wait for gfs2_freeze_func() to complete.
If a umount is issued right after unfreeze, it could result in an
inconsistent filesystem because some journal data (statfs update) isn't
written out.
Refer to commit 24972557b12c for a more detailed explanation of how
freeze/unfreeze work.
This patch causes gfs2_unfreeze() to wait for gfs2_freeze_func() to
complete before returning to the user.
Signed-off-by: Abhi Das <adas@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
Actually we don't do anything with return value from
nfs_wait_client_init_complete in nfs_match_client, as a
consequence if we get a fatal signal and client is not
fully initialised, we'll loop to "again" label
This has been proven to cause soft lockups on some scenarios
(no-carrier but configured network interfaces)
Signed-off-by: Roberto Bergantinos Corpas <rbergant@redhat.com> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
The AFS3 FID is three 32-bit unsigned numbers and is represented as three
up-to-8-hex-digit numbers separated by colons to the afs.fid xattr.
However, with the advent of support for YFS, the FID is now a 64-bit volume
number, a 96-bit vnode/inode number and a 32-bit uniquifier (as before).
Whilst the sprintf in afs_xattr_get_fid() has been partially updated (it
currently ignores the upper 32 bits of the 96-bit vnode number), the size
of the stack-based buffer has not been increased to match, thereby allowing
stack corruption to occur.
Fix this by increasing the buffer size appropriately and conditionally
including the upper part of the vnode number if it is non-zero. The latter
requires the lower part to be zero-padded if the upper part is non-zero.
Fixes: 3b6492df4153 ("afs: Increase to 64-bit volume ID and 96-bit vnode ID for YFS") Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
Under certain conditions, lru_count may drop below zero resulting in
a large amount of log spam like this:
vmscan: shrink_slab: gfs2_dump_glock+0x3b0/0x630 [gfs2] \
negative objects to delete nr=-1
This happens as follows:
1) A glock is moved from lru_list to the dispose list and lru_count is
decremented.
2) The dispose function calls cond_resched() and drops the lru lock.
3) Another thread takes the lru lock and tries to add the same glock to
lru_list, checking if the glock is on an lru list.
4) It is on a list (actually the dispose list) and so it avoids
incrementing lru_count.
5) The glock is moved to lru_list.
5) The original thread doesn't dispose it because it has been re-added
to the lru list but the lru_count has still decreased by one.
Fix by checking if the LRU flag is set on the glock rather than checking
if the glock is on some list and rearrange the code so that the LRU flag
is added/removed precisely when the glock is added/removed from lru_list.
Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
There is currently no corresponding patch in master due to additional
changes that would be significantly different from plain revert in the
respective stable branch.
The range argument was not handled correctly and could cause trim to
overlap allocated areas or reach beyond the end of the device. The
address space that fitrim normally operates on is in logical
coordinates, while the discards are done on the physical device extents.
This distinction cannot be made with the current ioctl interface and
caused the confusion.
The bug depends on the layout of block groups and does not always
happen. The whole-fs trim (run by default by the fstrim tool) is not
affected.
Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit 59c08c69c278 ("netfilter: ctnetlink: Support L3 protocol-filter
on flush") introduced a user-space regression when flushing connection
track entries. Before this commit, the nfgen_family field was not used
by the kernel and all entries were removed. Since this commit,
nfgen_family is used to filter out entries that should not be removed.
One example a broken tool is conntrack. conntrack always sets
nfgen_family to AF_INET, so after 59c08c69c278 only IPv4 entries were
removed with the -F parameter.
Pablo Neira Ayuso suggested using nfgenmsg->version to resolve the
regression, and this commit implements his suggestion. nfgenmsg->version
is so far set to zero, so it is well-suited to be used as a flag for
selecting old or new flush behavior. If version is 0, nfgen_family is
ignored and all entries are used. If user-space sets the version to one
(or any other value than 0), then the new behavior is used. As version
only can have two valid values, I chose not to add a new
NFNETLINK_VERSION-constant.
Fixes: 59c08c69c278 ("netfilter: ctnetlink: Support L3 protocol-filter on flush") Reported-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Kristian Evensen <kristian.evensen@gmail.com> Tested-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
What happens there is that we are replacing file->path.mnt of
a file we'd just opened with a clone and we need the write
count contribution to be transferred from original mount to
new one. That's it. We do *NOT* want any kind of freeze
protection for the duration of switchover.
IOW, we should just use __mnt_{want,drop}_write() for that
switchover; no need to bother with mnt_{want,drop}_write()
there.
Tested-by: Amir Goldstein <amir73il@gmail.com> Reported-by: syzbot+2a73a6ea9507b7112141@syzkaller.appspotmail.com Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Syzbot has reported some issues with the locking assumptions made for
the multicast tt/tvlv worker: It was able to trigger the WARN_ON() in
batadv_mcast_mla_tt_retract() and batadv_mcast_mla_tt_add().
While hard/not reproduceable for us so far it seems that the
delayed_work_pending() we use might not be quite safe from reordering.
Therefore this patch adds an explicit, new spinlock to protect the
update of the mla_list and flags in bat_priv and then removes the
WARN_ON(delayed_work_pending()).
Reported-by: syzbot+83f2d54ec6b7e417e13f@syzkaller.appspotmail.com Reported-by: syzbot+050927a651272b145a5d@syzkaller.appspotmail.com Reported-by: syzbot+979ffc89b87309b1b94b@syzkaller.appspotmail.com Reported-by: syzbot+f9f3f388440283da2965@syzkaller.appspotmail.com Fixes: cbebd363b2e9 ("batman-adv: Use own timer for multicast TT and TVLV updates") Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue> Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
synchronize_rcu() is fine when the rcu callbacks only need
to free memory (kfree_rcu() or direct kfree() call rcu call backs)
__dev_map_entry_free() is a bit more complex, so we need to make
sure that call queued __dev_map_entry_free() callbacks have completed.
sysbot report:
BUG: KASAN: use-after-free in dev_map_flush_old kernel/bpf/devmap.c:365
[inline]
BUG: KASAN: use-after-free in __dev_map_entry_free+0x2a8/0x300
kernel/bpf/devmap.c:379
Read of size 8 at addr ffff8801b8da38c8 by task ksoftirqd/1/18
Memory state around the buggy address: ffff8801b8da3780: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb ffff8801b8da3800: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> ffff8801b8da3880: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^ ffff8801b8da3900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff8801b8da3980: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
Fixes: 546ac1ffb70d ("bpf: add devmap, a map for storing net device references") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot+457d3e2ffbcf31aee5c0@syzkaller.appspotmail.com Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In ssb_modinit, it does not fail SSB init when ssb_host_pcmcia_init failed,
however in ssb_modexit, ssb_host_pcmcia_exit calls pcmcia_unregister_driver
unconditionally, which may tigger a NULL pointer dereference issue as above.
Reported-by: Hulk Robot <hulkci@huawei.com> Fixes: 399500da18f7 ("ssb: pick PCMCIA host code support from b43 driver") Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Kalle Valo <kvalo@codeaurora.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
syzkaller reported crashes on kfree() called from
vivid_vid_cap_s_selection(). This looks like a simple typo, as
dev->bitmap_cap is allocated with vzalloc() throughout the file.
Fixes: ef834f7836ec0 ("[media] vivid: add the video capture and output
parts")
Signed-off-by: Alexander Potapenko <glider@google.com> Reported-by: Syzbot <syzbot+6c0effb5877f6b0344e2@syzkaller.appspotmail.com> Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl> Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Calling VIDIOC_DQBUF can release the core serialization lock pointed to
by vb2_queue->lock if it has to wait for a new buffer to arrive.
However, if userspace dup()ped the video device filehandle, then it is
possible to read or call DQBUF from two filehandles at the same time.
It is also possible to call REQBUFS from one filehandle while the other
is waiting for a buffer. This will remove all the buffers and reallocate
new ones. Removing all the buffers isn't the problem here (that's already
handled correctly by DQBUF), but the reallocating part is: DQBUF isn't
aware that the buffers have changed.
This is fixed by setting a flag whenever the lock is released while waiting
for a buffer to arrive. And checking the flag where needed so we can return
-EBUSY.
Signed-off-by: Hans Verkuil <hverkuil@xs4all.nl> Reported-by: Syzbot <syzbot+4180ff9ca6810b06c1e9@syzkaller.appspotmail.com> Reviewed-by: Tomasz Figa <tfiga@chromium.org> Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl> Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Memory state around the buggy address: ffff8881dc7adf00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffff8881dc7adf80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>ffff8881dc7ae000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^ ffff8881dc7ae080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff8881dc7ae100: fc fc fc fc fc fc fc fc 00 00 00 00 00 00 00 00
There are already cleanup handlings in serial_ir_init error path,
no need to call serial_ir_exit do it again in serial_ir_init_module,
otherwise will trigger a use-after-free issue.
Fixes: fa5dc29c1fcc ("[media] lirc_serial: move out of staging and rename to serial_ir") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Sean Young <sean@mess.org> Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Memory state around the buggy address: ffff8881f59a6a00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffff8881f59a6a80: 00 00 00 00 00 00 00 00 00 00 fc fc fc fc fc fc
>ffff8881f59a6b00: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
^ ffff8881f59a6b80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff8881f59a6c00: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
cpia2_init does not check return value of cpia2_init, if it failed
in usb_register_driver, there is already cleanup using driver_unregister.
No need call cpia2_usb_cleanup on module exit.
Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl> Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This nasty little syzbot repro:
https://syzkaller.appspot.com/x/repro.syz?x=12c7a94f400000
Creates overlay mounts where the same directory is both in upper and lower
layers. Simplified example:
mkdir foo work
mount -t overlay none foo -o"lowerdir=.,upperdir=foo,workdir=work"
The repro runs several threads in parallel that attempt to chdir into foo
and attempt to symlink/rename/exec/mkdir the file bar.
The repro hits a WARN_ON() I placed in ovl_instantiate(), which suggests
that an overlay inode already exists in cache and is hashed by the pointer
of the real upper dentry that ovl_create_real() has just created. At the
point of the WARN_ON(), for overlay dir inode lock is held and upper dir
inode lock, so at first, I did not see how this was possible.
On a closer look, I see that after ovl_create_real(), because of the
overlapping upper and lower layers, a lookup by another thread can find the
file foo/bar that was just created in upper layer, at overlay path
foo/foo/bar and hash the an overlay inode with the new real dentry as lower
dentry. This is possible because the overlay directory foo/foo is not
locked and the upper dentry foo/bar is in dcache, so ovl_lookup() can find
it without taking upper dir inode shared lock.
Overlapping layers is considered a wrong setup which would result in
unexpected behavior, but it shouldn't crash the kernel and it shouldn't
trigger WARN_ON() either, so relax this WARN_ON() and leave a pr_warn()
instead to cover all cases of failure to get an overlay inode.
The error returned from failure to insert new inode to cache with
inode_insert5() was changed to -EEXIST, to distinguish from the error
-ENOMEM returned on failure to get/allocate inode with iget5_locked().
Reported-by: syzbot+9c69c282adc4edd2b540@syzkaller.appspotmail.com Fixes: 01b39dcc9568 ("ovl: use inode_insert5() to hash a newly...") Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
A failed call to kobject_init_and_add() must be followed by a call to
kobject_put(). Currently in the error path when adding fs_devices we
are missing this call. This could be fixed by calling
btrfs_sysfs_remove_fsid() if btrfs_sysfs_add_fsid() returns an error or
by adding a call to kobject_put() directly in btrfs_sysfs_add_fsid().
Here we choose the second option because it prevents the slightly
unusual error path handling requirements of kobject from leaking out
into btrfs functions.
Add a call to kobject_put() in the error path of kobject_add_and_init().
This causes the release method to be called if kobject_init_and_add()
fails. open_tree() is the function that calls btrfs_sysfs_add_fsid()
and the error code in this function is already written with the
assumption that the release method is called during the error path of
open_tree() (as seen by the call to btrfs_sysfs_remove_fsid() under the
fail_fsdev_sysfs label).
Cc: stable@vger.kernel.org # v4.4+ Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Tobin C. Harding <tobin@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
If a call to kobject_init_and_add() fails we must call kobject_put()
otherwise we leak memory.
Calling kobject_put() when kobject_init_and_add() fails drops the
refcount back to 0 and calls the ktype release method (which in turn
calls the percpu destroy and kfree).
Add call to kobject_put() in the error path of call to
kobject_init_and_add().
Cc: stable@vger.kernel.org # v4.4+ Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Tobin C. Harding <tobin@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When we do a full fsync (the bit BTRFS_INODE_NEEDS_FULL_SYNC is set in the
inode) that happens to be ranged, which happens during a msync() or writes
for files opened with O_SYNC for example, we can end up with a corrupt log,
due to different file extent items representing ranges that overlap with
each other, or hit some assertion failures.
When doing a ranged fsync we only flush delalloc and wait for ordered
exents within that range. If while we are logging items from our inode
ordered extents for adjacent ranges complete, we end up in a race that can
make us insert the file extent items that overlap with others we logged
previously and the assertion failures.
For example, if tree-log.c:copy_items() receives a leaf that has the
following file extents items, all with a length of 4K and therefore there
is an implicit hole in the range 68K to 72K - 1:
It copies them to the log tree. However due to the need to detect implicit
holes, it may release the path, in order to look at the previous leaf to
detect an implicit hole, and then later it will search again in the tree
for the first file extent item key, with the goal of locking again the
leaf (which might have changed due to concurrent changes to other inodes).
However when it locks again the leaf containing the first key, the key
corresponding to the extent at offset 72K may not be there anymore since
there is an ordered extent for that range that is finishing (that is,
somewhere in the middle of btrfs_finish_ordered_io()), and it just
removed the file extent item but has not yet replaced it with a new file
extent item, so the part of copy_items() that does hole detection will
decide that there is a hole in the range starting from 68K to 76K - 1,
and therefore insert a file extent item to represent that hole, having
a key offset of 68K. After that we now have a log tree with 2 different
extent items that have overlapping ranges:
1) The file extent item copied before copy_items() released the path,
which has a key offset of 72K and a length of 4K, representing the
file range 72K to 76K - 1.
2) And a file extent item representing a hole that has a key offset of
68K and a length of 8K, representing the range 68K to 76K - 1. This
item was inserted after releasing the path, and overlaps with the
extent item inserted before.
The overlapping extent items can cause all sorts of unpredictable and
incorrect behaviour, either when replayed or if a fast (non full) fsync
happens later, which can trigger a BUG_ON() when calling
btrfs_set_item_key_safe() through __btrfs_drop_extents(), producing a
trace like the following:
A sample of a corrupt log tree leaf with overlapping extents I got from
running btrfs/072:
item 14 key (295 108 200704) itemoff 2599 itemsize 53
extent data disk bytenr 0 nr 0
extent data offset 0 nr 458752 ram 458752
item 15 key (295 108 659456) itemoff 2546 itemsize 53
extent data disk bytenr 4343541760 nr 770048
extent data offset 606208 nr 163840 ram 770048
item 16 key (295 108 663552) itemoff 2493 itemsize 53
extent data disk bytenr 4343541760 nr 770048
extent data offset 610304 nr 155648 ram 770048
item 17 key (295 108 819200) itemoff 2440 itemsize 53
extent data disk bytenr 4334788608 nr 4096
extent data offset 0 nr 4096 ram 4096
The file extent item at offset 659456 (item 15) ends at offset 823296
(659456 + 163840) while the next file extent item (item 16) starts at
offset 663552.
Another different problem that the race can trigger is a failure in the
assertions at tree-log.c:copy_items(), which expect that the first file
extent item key we found before releasing the path exists after we have
released path and that the last key we found before releasing the path
also exists after releasing the path:
$ cat -n fs/btrfs/tree-log.c
4080 if (need_find_last_extent) {
4081 /* btrfs_prev_leaf could return 1 without releasing the path */
4082 btrfs_release_path(src_path);
4083 ret = btrfs_search_slot(NULL, inode->root, &first_key,
4084 src_path, 0, 0);
4085 if (ret < 0)
4086 return ret;
4087 ASSERT(ret == 0);
(...)
4103 if (i >= btrfs_header_nritems(src_path->nodes[0])) {
4104 ret = btrfs_next_leaf(inode->root, src_path);
4105 if (ret < 0)
4106 return ret;
4107 ASSERT(ret == 0);
4108 src = src_path->nodes[0];
4109 i = 0;
4110 need_find_last_extent = true;
4111 }
(...)
The second assertion implicitly expects that the last key before the path
release still exists, because the surrounding while loop only stops after
we have found that key. When this assertion fails it produces a stack like
this:
So fix this race between a full ranged fsync and writeback of adjacent
ranges by flushing all delalloc and waiting for all ordered extents to
complete before logging the inode. This is the simplest way to solve the
problem because currently the full fsync path does not deal with ranges
at all (it assumes a full range from 0 to LLONG_MAX) and it always needs
to look at adjacent ranges for hole detection. For use cases of ranged
fsyncs this can make a few fsyncs slower but on the other hand it can
make some following fsyncs to other ranges do less work or no need to do
anything at all. A full fsync is rare anyway and happens only once after
loading/creating an inode and once after less common operations such as a
shrinking truncate.
This is an issue that exists for a long time, and was often triggered by
generic/127, because it does mmap'ed writes and msync (which triggers a
ranged fsync). Adding support for the tree checker to detect overlapping
extents (next patch in the series) and trigger a WARN() when such cases
are found, and then calling btrfs_check_leaf_full() at the end of
btrfs_insert_file_extent() made the issue much easier to detect. Running
btrfs/072 with that change to the tree checker and making fsstress open
files always with O_SYNC made it much easier to trigger the issue (as
triggering it with generic/127 is very rare).
CC: stable@vger.kernel.org # 3.16+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When we are doing a full fsync (bit BTRFS_INODE_NEEDS_FULL_SYNC set) of a
file that has holes and has file extent items spanning two or more leafs,
we can end up falling to back to a full transaction commit due to a logic
bug that leads to failure to insert a duplicate file extent item that is
meant to represent a hole between the last file extent item of a leaf and
the first file extent item in the next leaf. The failure (EEXIST error)
leads to a transaction commit (as most errors when logging an inode do).
For example, we have the two following leafs:
Leaf N:
-----------------------------------------------
| ..., ..., ..., (257, FILE_EXTENT_ITEM, 64K) |
-----------------------------------------------
The file extent item at the end of leaf N has a length of 4Kb,
representing the file range from 64K to 68K - 1.
Leaf N + 1:
-----------------------------------------------
| (257, FILE_EXTENT_ITEM, 72K), ..., ..., ... |
-----------------------------------------------
The file extent item at the first slot of leaf N + 1 has a length of
4Kb too, representing the file range from 72K to 76K - 1.
During the full fsync path, when we are at tree-log.c:copy_items() with
leaf N as a parameter, after processing the last file extent item, that
represents the extent at offset 64K, we take a look at the first file
extent item at the next leaf (leaf N + 1), and notice there's a 4K hole
between the two extents, and therefore we insert a file extent item
representing that hole, starting at file offset 68K and ending at offset
72K - 1. However we don't update the value of *last_extent, which is used
to represent the end offset (plus 1, non-inclusive end) of the last file
extent item inserted in the log, so it stays with a value of 68K and not
with a value of 72K.
Then, when copy_items() is called for leaf N + 1, because the value of
*last_extent is smaller then the offset of the first extent item in the
leaf (68K < 72K), we look at the last file extent item in the previous
leaf (leaf N) and see it there's a 4K gap between it and our first file
extent item (again, 68K < 72K), so we decide to insert a file extent item
representing the hole, starting at file offset 68K and ending at offset
72K - 1, this insertion will fail with -EEXIST being returned from
btrfs_insert_file_extent() because we already inserted a file extent item
representing a hole for this offset (68K) in the previous call to
copy_items(), when processing leaf N.
The -EEXIST error gets propagated to the fsync callback, btrfs_sync_file(),
which falls back to a full transaction commit.
Fix this by adjusting *last_extent after inserting a hole when we had to
look at the next leaf.
Fixes: 4ee3fad34a9c ("Btrfs: fix fsync after hole punching when using no-holes feature") Cc: stable@vger.kernel.org # 4.14+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Currently when we fail to COW a path at btrfs_update_root() we end up
always aborting the transaction. However all the current callers of
btrfs_update_root() are able to deal with errors returned from it, many do
end up aborting the transaction themselves (directly or not, such as the
transaction commit path), other BUG_ON() or just gracefully cancel whatever
they were doing.
When syncing the fsync log, we call btrfs_update_root() through
tree-log.c:update_log_root(), and if it returns an -ENOSPC error, the log
sync code does not abort the transaction, instead it gracefully handles
the error and returns -EAGAIN to the fsync handler, so that it falls back
to a transaction commit. Any other error different from -ENOSPC, makes the
log sync code abort the transaction.
So remove the transaction abort from btrfs_update_log() when we fail to
COW a path to update the root item, so that if an -ENOSPC failure happens
we avoid aborting the current transaction and have a chance of the fsync
succeeding after falling back to a transaction commit.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203413 Fixes: 79787eaab46121 ("btrfs: replace many BUG_ONs with proper error handling") Cc: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When a file's compression property is set as zlib or zstd but leave
the compression mount option not be set, that means btrfs will try
to compress the file with default compression level. But in
btrfs_compress_pages(), it calls get_workspace() with level = 0.
This will return a workspace with a wrong compression level.
For zlib, the compression level in the workspace will be 0
(that means "store only"). And for zstd, the compression in the
workspace will be 1, not the default level 3.
How to reproduce:
mkfs -t btrfs /dev/sdb
mount /dev/sdb /mnt/
mkdir /mnt/zlib
btrfs property set /mnt/zlib/ compression zlib
dd if=/dev/zero of=/mnt/zlib/compression-friendly-file-10M bs=1M count=10
sync
btrfs-debugfs -f /mnt/zlib/compression-friendly-file-10M
btrfs-debugfs output:
* before:
...
(258 9961472): ram 524288 disk 1106247680 disk_size 524288
file: ... extents 20 disk size 10485760 logical size 10485760 ratio 1.00
* after:
...
(258 10354688): ram 131072 disk 14217216 disk_size 4096
file: ... extents 80 disk size 327680 logical size 10485760 ratio 32.00
The steps for zstd are similar, but need to put a debugging message to
show the level of the return workspace in zstd_get_workspace().
This commit adds a check of the compression level before getting a
workspace by set_level().
CC: stable@vger.kernel.org # 5.1+ Signed-off-by: Johnny Chang <johnnyc@synology.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
If we have an error writing out a delalloc range in
btrfs_punch_hole_lock_range we'll unlock the inode and then goto
out_only_mutex, where we will again unlock the inode. This is bad,
don't do this.
Fixes: f27451f22996 ("Btrfs: add support for fallocate's zero range operation") CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit 4d207133e9c3 changed the types of the statistic values in struct
gfs2_lkstats from s64 to u64. Because of that, what should be a signed
value in gfs2_update_stats turned into an unsigned value. When shifted
right, we end up with a large positive value instead of a small negative
value, which results in an incorrect variance estimate.
Fixes: 4d207133e9c3 ("gfs2: Make statistics unsigned, suitable for use with do_div()") Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Cc: stable@vger.kernel.org # v4.4+ Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
DMA allocations that can't sleep may return non-remapped addresses, but
we do not properly handle them in the mmap and get_sgtable methods.
Resolve non-vmalloc addresses using virt_to_page to handle this corner
case.
Cc: <stable@vger.kernel.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Although we merged support for pseudo-nmi using interrupt priority
masking in 5.1, we've since uncovered a number of non-trivial issues
with the implementation. Although there are patches pending to address
these problems, we're facing issues that prevent us from merging them at
this current time:
7290d5809571 ("module: use relative references for __ksymtab entries")
updated the ksymtab handling of some KASLR capable architectures
so that ksymtab entries are emitted as pairs of 32-bit relative
references. This reduces the size of the entries, but more
importantly, it gets rid of statically assigned absolute
addresses, which require fixing up at boot time if the kernel
is self relocating (which takes a 24 byte RELA entry for each
member of the ksymtab struct).
Since ksymtab entries are always part of the same module as the
symbol they export, it was assumed at the time that a 32-bit
relative reference is always sufficient to capture the offset
between a ksymtab entry and its target symbol.
Unfortunately, this is not always true: in the case of per-CPU
variables, a per-CPU variable's base address (which usually differs
from the actual address of any of its per-CPU copies) is allocated
in the vicinity of the ..data.percpu section in the core kernel
(i.e., in the per-CPU reserved region which follows the section
containing the core kernel's statically allocated per-CPU variables).
Since we randomize the module space over a 4 GB window covering
the core kernel (based on the -/+ 4 GB range of an ADRP/ADD pair),
we may end up putting the core kernel out of the -/+ 2 GB range of
32-bit relative references of module ksymtab entries that refer to
per-CPU variables.
So reduce the module randomization range a bit further. We lose
1 bit of randomization this way, but this is something we can
tolerate.
Jeff discovered that performance improves from ~375K iops to ~519K iops
on a simple psync-write fio workload when moving the location of 'struct
page' from the default PMEM location to DRAM. This result is surprising
because the expectation is that 'struct page' for dax is only needed for
third party references to dax mappings. For example, a dax-mapped buffer
passed to another system call for direct-I/O requires 'struct page' for
sending the request down the driver stack and pinning the page. There is
no usage of 'struct page' for first party access to a file via
read(2)/write(2) and friends.
However, this "no page needed" expectation is violated by
CONFIG_HARDENED_USERCOPY and the check_copy_size() performed in
copy_from_iter_full_nocache() and copy_to_iter_mcsafe(). The
check_heap_object() helper routine assumes the buffer is backed by a
slab allocator (DRAM) page and applies some checks. Those checks are
invalid, dax pages do not originate from the slab, and redundant,
dax_iomap_actor() has already validated that the I/O is within bounds.
Specifically that routine validates that the logical file offset is
within bounds of the file, then it does a sector-to-pfn translation
which validates that the physical mapping is within bounds of the block
device.
Bypass additional hardened usercopy overhead and call the 'no check'
versions of the copy_{to,from}_iter operations directly.
Fixes: 0aed55af8834 ("x86, uaccess: introduce copy_from_iter_flushcache...") Cc: <stable@vger.kernel.org> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Matthew Wilcox <willy@infradead.org> Reported-and-tested-by: Jeff Smits <jeff.smits@intel.com> Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Current logic does not allow VCPU to be loaded onto CPU with
APIC ID 255. This should be allowed since the host physical APIC ID
field in the AVIC Physical APIC table entry is an 8-bit value,
and APIC ID 255 is valid in system with x2APIC enabled.
Instead, do not allow VCPU load if the host APIC ID cannot be
represented by an 8-bit value.
Also, use the more appropriate AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK
instead of AVIC_MAX_PHYSICAL_ID_COUNT.
When assigning kvm irqfd we didn't check the irqchip mode but we allow
KVM_IRQFD to succeed with all the irqchip modes. However it does not
make much sense to create irqfd even without the kernel chips. Let's
provide a arch-dependent helper to check whether a specific irqfd is
allowed by the arch. At least for x86, it should make sense to check:
- when irqchip mode is NONE, all irqfds should be disallowed, and,
- when irqchip mode is SPLIT, irqfds that are with resamplefd should
be disallowed.
For either of the case, previously we'll silently ignore the irq or
the irq ack event if the irqchip mode is incorrect. However that can
cause misterious guest behaviors and it can be hard to triage. Let's
fail KVM_IRQFD even earlier to detect these incorrect configurations.
CC: Paolo Bonzini <pbonzini@redhat.com> CC: Radim Krčmář <rkrcmar@redhat.com> CC: Alex Williamson <alex.williamson@redhat.com> CC: Eduardo Habkost <ehabkost@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Pankaj reports that starting with commit ad428cdb525a "dax: Check the
end of the block-device capacity with dax_direct_access()" device-mapper
no longer allows dax operation. This results from the stricter checks in
__bdev_dax_supported() that validate that the start and end of a
block-device map to the same 'pagemap' instance.
Teach the dax-core and device-mapper to validate the 'pagemap' on a
per-target basis. This is accomplished by refactoring the
bdev_dax_supported() internals into generic_fsdax_supported() which
takes a sector range to validate. Consequently generic_fsdax_supported()
is suitable to be used in a device-mapper ->iterate_devices() callback.
A new ->dax_supported() operation is added to allow composite devices to
split and route upper-level bdev_dax_supported() requests.
Fixes: ad428cdb525a ("dax: Check the end of the block-device...") Cc: <stable@vger.kernel.org> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Keith Busch <keith.busch@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Reported-by: Pankaj Gupta <pagupta@redhat.com> Reviewed-by: Pankaj Gupta <pagupta@redhat.com> Tested-by: Pankaj Gupta <pagupta@redhat.com> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Without this check a snapshot is taken whenever a bucket's max is hit,
rather than only when the global max is hit, as it should be.
Before:
In this example, we do a first run of the workload (cyclictest),
examine the output, note the max ('triggering value') (347), then do
a second run and note the max again.
In this case, the max in the second run (39) is below the max in the
first run, but since we haven't cleared the histogram, the first max
is still in the histogram and is higher than any other max, so it
should still be the max for the snapshot. It isn't however - the
value should still be 347 after the second run.
# echo 'hist:keys=pid:ts0=common_timestamp.usecs if comm=="cyclictest"' >> /sys/kernel/debug/tracing/events/sched/sched_waking/trigger
# echo 'hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts0:onmax($wakeup_lat).save(next_prio,next_comm,prev_pid,prev_prio,prev_comm):onmax($wakeup_lat).snapshot() if next_comm=="cyclictest"' >> /sys/kernel/debug/tracing/events/sched/sched_switch/trigger
Snapshot taken (see tracing/snapshot). Details:
triggering value { onmax($wakeup_lat) }: 39
triggered by event with key: { next_pid: 2150 }
After:
In this example, we do a first run of the workload (cyclictest),
examine the output, note the max ('triggering value') (375), then do
a second run and note the max again.
In this case, the max in the second run is still 375, the highest in
any bucket, as it should be.
The iproc host eMMC/SD controller hold time does not meet the
specification in the HS50 mode. This problem can be mitigated
by disabling the HISPD bit; thus forcing the controller output
data to be driven on the falling clock edges rather than the
rising clock edges.
Stable tag (v4.12+) chosen to assist stable kernel maintainers so that
the change does not produce merge conflicts backporting to older kernel
versions. In reality, the timing bug existed since the driver was first
introduced but there is no need for this driver to be supported in kernel
versions that old.
The iproc host eMMC/SD controller hold time does not meet the
specification in the HS50 mode. This problem can be mitigated
by disabling the HISPD bit; thus forcing the controller output
data to be driven on the falling clock edges rather than the
rising clock edges.
This change applies only to the Cygnus platform.
Stable tag (v4.12+) chosen to assist stable kernel maintainers so that
the change does not produce merge conflicts backporting to older kernel
versions. In reality, the timing bug existed since the driver was first
introduced but there is no need for this driver to be supported in kernel
versions that old.
The kernel self-tests picked up an issue with CTR mode:
alg: skcipher: p8_aes_ctr encryption test failed (wrong result) on test vector 3, cfg="uneven misaligned splits, may sleep"
In the aesp8-ppc code from OpenSSL, there are two paths that
increment IVs: the bulk (8 at a time) path, and the individual
path which is used when there are fewer than 8 AES blocks to
process.
In the bulk path, the IV is incremented with vadduqm: "Vector
Add Unsigned Quadword Modulo", which does 128-bit addition.
In the individual path, however, the IV is incremented with
vadduwm: "Vector Add Unsigned Word Modulo", which instead
does 4 32-bit additions. Thus the IV would instead become FFFFFFFFFFFFFFFFFFFFFFFF00000000, throwing off the result.
Use vadduqm.
This was probably a typo originally, what with q and w being
adjacent. It is a pretty narrow edge case: I am really
impressed by the quality of the kernel self-tests!
Fixes: 5c380d623ed3 ("crypto: vmx - Add support for VMS instructions by ASM") Cc: stable@vger.kernel.org Signed-off-by: Daniel Axtens <dja@axtens.net> Acked-by: Nayna Jain <nayna@linux.ibm.com> Tested-by: Nayna Jain <nayna@linux.ibm.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The "hmac(sha3-224-generic)" algorithm has a descsize of 368 bytes,
which is greater than HASH_MAX_DESCSIZE (360) which is only enough for
sha3-224-generic. The check in shash_prepare_alg() doesn't catch this
because the HMAC template doesn't set descsize on the algorithms, but
rather sets it on each individual HMAC transform.
This causes a stack buffer overflow when SHASH_DESC_ON_STACK() is used
with hmac(sha3-224-generic).
Fix it by increasing HASH_MAX_DESCSIZE to the real maximum. Also add a
sanity check to hmac_init().
This was detected by the improved crypto self-tests in v5.2, by loading
the tcrypt module with CONFIG_CRYPTO_MANAGER_EXTRA_TESTS=y enabled. I
didn't notice this bug when I ran the self-tests by requesting the
algorithms via AF_ALG (i.e., not using tcrypt), probably because the
stack layout differs in the two cases and that made a difference here.
KASAN report:
BUG: KASAN: stack-out-of-bounds in memcpy include/linux/string.h:359 [inline]
BUG: KASAN: stack-out-of-bounds in shash_default_import+0x52/0x80 crypto/shash.c:223
Write of size 360 at addr ffff8880651defc8 by task insmod/3689
This patch introduced regressions for devices that come online in
read-only state and subsequently switch to read-write.
Given how the partition code is currently implemented it is not
possible to persist the read-only flag across a device revalidate
call. This may need to get addressed in the future since it is common
for user applications to proactively call BLKRRPART.
Reverting this commit will re-introduce a regression where a
device-initiated revalidate event will cause the admin state to be
forgotten. A separate patch will address this issue.
Fixes: 20bd1d026aac ("scsi: sd: Keep disk read-only when re-reading partition") Cc: <stable@vger.kernel.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This barrier only applies to the read-modify-write operations; in
particular, it does not apply to the atomic_set() primitive.
Replace the barrier with an smp_mb().
Fixes: dac56212e8127 ("bio: skip atomic inc/dec of ->bi_cnt for most use cases") Cc: stable@vger.kernel.org Reported-by: "Paul E. McKenney" <paulmck@linux.ibm.com> Reported-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrea Parri <andrea.parri@amarulasolutions.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Ming Lei <ming.lei@redhat.com> Cc: linux-block@vger.kernel.org Cc: "Paul E. McKenney" <paulmck@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
672ff6cff80c ("KVM: x86: Raise #GP when guest vCPU do not support PMU")
my AMD guests started #GPing like this:
general protection fault: 0000 [#1] PREEMPT SMP
CPU: 1 PID: 4355 Comm: bash Not tainted 5.1.0-rc6+ #3
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
RIP: 0010:x86_perf_event_update+0x3b/0xa0
with Code: pointing to RDPMC. It is RDPMC because the guest has the
hardware watchdog CONFIG_HARDLOCKUP_DETECTOR_PERF enabled which uses
perf. Instrumenting kvm_pmu_rdpmc() some, showed that it fails due to:
if (!pmu->version)
return 1;
which the above commit added. Since AMD's PMU leaves the version at 0,
that causes the #GP injection into the guest.
Set pmu->version arbitrarily to 1 and move it above the non-applicable
struct kvm_pmu members.
Signed-off-by: Borislav Petkov <bp@suse.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Janakarajan Natarajan <Janakarajan.Natarajan@amd.com> Cc: kvm@vger.kernel.org Cc: Liran Alon <liran.alon@oracle.com> Cc: Mihai Carabas <mihai.carabas@oracle.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Radim Krčmář" <rkrcmar@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: x86@kernel.org Cc: stable@vger.kernel.org Fixes: 672ff6cff80c ("KVM: x86: Raise #GP when guest vCPU do not support PMU") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit 11988499e62b ("KVM: x86: Skip EFER vs. guest CPUID checks for
host-initiated writes", 2019-04-02) introduced a "return false" in a
function returning int, and anyway set_efer has a "nonzero on error"
conventon so it should be returning 1.
Reported-by: Pavel Machek <pavel@denx.de> Fixes: 11988499e62b ("KVM: x86: Skip EFER vs. guest CPUID checks for host-initiated writes") Cc: Sean Christopherson <sean.j.christopherson@intel.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
We didn't wait for outstanding direct IO during truncate in nojournal
mode (as we skip orphan handling in that case). This can lead to fs
corruption or stale data exposure if truncate ends up freeing blocks
and these get reallocated before direct IO finishes. Fix the condition
determining whether the wait is necessary.
CC: stable@vger.kernel.org Fixes: 1c9114f9c0f1 ("ext4: serialize unlocked dio reads with truncate") Reviewed-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
It is possible that unlinked inode enters ext4_setattr() (e.g. if
somebody calls ftruncate(2) on unlinked but still open file). In such
case we should not delete the inode from the orphan list if truncate
fails. Note that this is mostly a theoretical concern as filesystem is
corrupted if we reach this path anyway but let's be consistent in our
orphan handling.
Reviewed-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
User Mode Linux does not have access to the ip or sp fields of the pt_regs,
and accessing them causes UML to fail to build. Hide the int3_emulate_jmp()
and int3_emulate_call() instructions from UML, as it doesn't need them
anyway.
Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
A fallthrough in switch/case was introduced in f627caf55b8e ("fbdev:
sm712fb: fix crashes and garbled display during DPMS modesetting"),
due to my copy-paste error, which would cause the memory clock frequency
for SM720 to be programmed to SM712.
Since it only reprograms the clock to a different frequency, it's only
a benign issue without visible side-effect, so it also evaded Sudip
Mukherjee's code review and regression tests. scripts/checkpatch.pl
also failed to discover the issue, possibly due to nested switch
statements.
This issue was found by Stephen Rothwell by building linux-next with
-Wimplicit-fallthrough.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Fixes: f627caf55b8e ("fbdev: sm712fb: fix crashes and garbled display during DPMS modesetting") Signed-off-by: Yifeng Li <tomli@tomli.me> Cc: Sudip Mukherjee <sudipm.mukherjee@gmail.com> Cc: "Gustavo A. R. Silva" <gustavo@embeddedor.com> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The main 3.3V regulator sources a series of additional regulators.
This patch adds a small delay, so when the 3.3V regulator comes
on it delays a bit before the subsequent regulators can come on.
This reduces the inrush current a bit on the external DC power
supply to help prevent a situation where the sourcing power supply
cannot source enough current and overloads and the kit fails to
start.
Fixes: 1c207f911fe9 ("ARM: dts: imx: Add support for Logic PD i.MX6QD EVM") Signed-off-by: Adam Ford <aford173@gmail.com> Signed-off-by: Shawn Guo <shawnguo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Some USB peripherals draw more power, and the sourcing regulator
take a little time to turn on. This patch fixes an issue where
some devices occasionally do not get detected, because the power
isn't quite ready when communication starts, so we add a bit
of a delay.
Fixes: 1c207f911fe9 ("ARM: dts: imx: Add support for Logic PD i.MX6QD EVM") Signed-off-by: Adam Ford <aford173@gmail.com> Signed-off-by: Shawn Guo <shawnguo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit d2311e698578 ("btrfs: relocation: Delay reloc tree deletion after
merge_reloc_roots()") expands the life span of root->reloc_root.
This breaks certain checs of fs_info->reloc_ctl. Before that commit, if
we have a root with valid reloc_root, then it's ensured to have
fs_info->reloc_ctl.
But now since reloc_root doesn't always mean a valid fs_info->reloc_ctl,
such check is unreliable and can cause the following NULL pointer
dereference:
As Stepan Golosunov points out, there is a small mistake in the
get_timespec64() function in the kernel. It was originally added under the
assumption that CONFIG_64BIT_TIME would get enabled on all 32-bit and
64-bit architectures, but when the conversion was done, it was only turned
on for 32-bit ones.
The effect is that the get_timespec64() function never clears the upper
half of the tv_nsec field for 32-bit tasks in compat mode. Clearing this is
required for POSIX compliant behavior of functions that pass a 'timespec'
structure with a 64-bit tv_sec and a 32-bit tv_nsec, plus uninitialized
padding.
The easiest fix for linux-5.1 is to just make the Kconfig symbol
unconditional, as it was originally intended. As a follow-up, the #ifdef
CONFIG_64BIT_TIME can be removed completely..
Note: for native 32-bit mode, no change is needed, this works as
designed and user space should never need to clear the upper 32
bits of the tv_nsec field, in or out of the kernel.
One of the biggest issues we face right now with picking LRU map over
regular hash table is that a map walk out of user space, for example,
to just dump the existing entries or to remove certain ones, will
completely mess up LRU eviction heuristics and wrong entries such
as just created ones will get evicted instead. The reason for this
is that we mark an entry as "in use" via bpf_lru_node_set_ref() from
system call lookup side as well. Thus upon walk, all entries are
being marked, so information of actual least recently used ones
are "lost".
In case of Cilium where it can be used (besides others) as a BPF
based connection tracker, this current behavior causes disruption
upon control plane changes that need to walk the map from user space
to evict certain entries. Discussion result from bpfconf [0] was that
we should simply just remove marking from system call side as no
good use case could be found where it's actually needed there.
Therefore this patch removes marking for regular LRU and per-CPU
flavor. If there ever should be a need in future, the behavior could
be selected via map creation flag, but due to mentioned reason we
avoid this here.
[0] http://vger.kernel.org/bpfconf.html
Fixes: 29ba732acbee ("bpf: Add BPF_MAP_TYPE_LRU_HASH") Fixes: 8f8449384ec3 ("bpf: Add BPF_MAP_TYPE_LRU_PERCPU_HASH") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Add a callback map_lookup_elem_sys_only() that map implementations
could use over map_lookup_elem() from system call side in case the
map implementation needs to handle the latter differently than from
the BPF data path. If map_lookup_elem_sys_only() is set, this will
be preferred pick for map lookups out of user space. This hook is
used in a follow-up fix for LRU map, but once development window
opens, we can convert other map types from map_lookup_elem() (here,
the one called upon BPF_MAP_LOOKUP_ELEM cmd is meant) over to use
the callback to simplify and clean up the latter.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
For iptable module to load a bpf program from a pinned location, it
only retrieve a loaded program and cannot change the program content so
requiring a write permission for it might not be necessary.
Also when adding or removing an unrelated iptable rule, it might need to
flush and reload the xt_bpf related rules as well and triggers the inode
permission check. It might be better to remove the write premission
check for the inode so we won't need to grant write access to all the
processes that flush and restore iptables rules.
In commit 376991db4b64 ("driver core: Postpone DMA tear-down until after
devres release"), we changed the ordering of tearing down the device DMA
ops and releasing all the device's resources; this was because the DMA ops
should be maintained until we release the device's managed DMA memories.
However, we have seen another crash on an arm64 system when a
device driver probe fails:
hisi_sas_v3_hw 0000:74:02.0: Adding to iommu group 2
scsi host1: hisi_sas_v3_hw
BUG: Bad page state in process swapper/0 pfn:313f5
page:ffff7e0000c4fd40 count:1 mapcount:0
mapping:0000000000000000 index:0x0
flags: 0xfffe00000001000(reserved)
raw: 0fffe00000001000ffff7e0000c4fd48ffff7e0000c4fd48 0000000000000000
raw: 0000000000000000000000000000000000000001ffffffff 0000000000000000
page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
bad because of flags: 0x1000(reserved)
Modules linked in:
CPU: 49 PID: 1 Comm: swapper/0 Not tainted 5.1.0-rc1-43081-g22d97fd-dirty #1433
Hardware name: Huawei D06/D06, BIOS Hisilicon D06 UEFI
RC0 - V1.12.01 01/29/2019
Call trace:
dump_backtrace+0x0/0x118
show_stack+0x14/0x1c
dump_stack+0xa4/0xc8
bad_page+0xe4/0x13c
free_pages_check_bad+0x4c/0xc0
__free_pages_ok+0x30c/0x340
__free_pages+0x30/0x44
__dma_direct_free_pages+0x30/0x38
dma_direct_free+0x24/0x38
dma_free_attrs+0x9c/0xd8
dmam_release+0x20/0x28
release_nodes+0x17c/0x220
devres_release_all+0x34/0x54
really_probe+0xc4/0x2c8
driver_probe_device+0x58/0xfc
device_driver_attach+0x68/0x70
__driver_attach+0x94/0xdc
bus_for_each_dev+0x5c/0xb4
driver_attach+0x20/0x28
bus_add_driver+0x14c/0x200
driver_register+0x6c/0x124
__pci_register_driver+0x48/0x50
sas_v3_pci_driver_init+0x20/0x28
do_one_initcall+0x40/0x25c
kernel_init_freeable+0x2b8/0x3c0
kernel_init+0x10/0x100
ret_from_fork+0x10/0x18
Disabling lock debugging due to kernel taint
BUG: Bad page state in process swapper/0 pfn:313f6
page:ffff7e0000c4fd80 count:1 mapcount:0
mapping:0000000000000000 index:0x0
[ 89.322983] flags: 0xfffe00000001000(reserved)
raw: 0fffe00000001000ffff7e0000c4fd88ffff7e0000c4fd88 0000000000000000
raw: 0000000000000000000000000000000000000001ffffffff 0000000000000000
The crash occurs for the same reason.
In this case, on the really_probe() failure path, we are still clearing
the DMA ops prior to releasing the device's managed memories.
This patch fixes this issue by reordering the DMA ops teardown and the
call to devres_release_all() on the failure path.
Reported-by: Xiang Chen <chenxiang66@hisilicon.com> Tested-by: Xiang Chen <chenxiang66@hisilicon.com> Signed-off-by: John Garry <john.garry@huawei.com> Reviewed-by: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
On imx8mq B0 chip, AHB/SDMA clock ratio 2:1 can't be supported,
since SDMA clock ratio has to be increased to 250Mhz, AHB can't reach
to 500Mhz, so use 1:1 instead.
To limit this change to the imx8mq for now this patch also adds an
im8mq-sdma compatible string.
Signed-off-by: Angus Ainslie (Purism) <angus@akkea.ca> Acked-by: Robin Gong <yibin.gong@nxp.com> Signed-off-by: Vinod Koul <vkoul@kernel.org> Cc: Richard Leitner <richard.leitner@skidata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The problem is that any 'uptodate' vs 'disks' check is not precise
in this path. Put a "WARN_ON(!test_bit(R5_UPTODATE, &dev->flags)" on the
device that might try to kick off writes and then skip the action.
Better to prevent the raid driver from taking unexpected action *and* keep
the system alive vs killing the machine with BUG_ON.
Note: fixed warning reported by kbuild test robot <lkp@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Nigel Croxon <ncroxon@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Dan Williams <dan.j.williams@intel.com> Cc: Nigel Croxon <ncroxon@redhat.com> Cc: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit 61697a6abd24 ("dm: eliminate 'split_discard_bios' flag from DM
target interface") incorrectly removed code from
__send_changing_extent_only() that is required to impose a per-target IO
boundary on IO that exceeds max_io_len_target_boundary(). Otherwise
"special" IO (e.g. DISCARD, WRITE SAME, WRITE ZEROES) can write beyond
where allowed.
Fix this by restoring the max_io_len_target_boundary() limit in
__send_changing_extent_only()
Fixes: 61697a6abd24 ("dm: eliminate 'split_discard_bios' flag from DM target interface") Cc: stable@vger.kernel.org # 5.1+ Signed-off-by: Michael Lass <bevan@bi-co.net> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Starting from commit 9c225f2655e3 ("vfs: atomic f_pos accesses as per
POSIX") files opened even via nonseekable_open gate read and write via lock
and do not allow them to be run simultaneously. This can create read vs
write deadlock if a filesystem is trying to implement a socket-like file
which is intended to be simultaneously used for both read and write from
filesystem client. See commit 10dce8af3422 ("fs: stream_open - opener for
stream-like files so that read and write can run simultaneously without
deadlock") for details and e.g. commit 581d21a2d02a ("xenbus: fix deadlock
on writes to /proc/xen/xenbus") for a similar deadlock example on
/proc/xen/xenbus.
To avoid such deadlock it was tempting to adjust fuse_finish_open to use
stream_open instead of nonseekable_open on just FOPEN_NONSEEKABLE flags,
but grepping through Debian codesearch shows users of FOPEN_NONSEEKABLE,
and in particular GVFS which actually uses offset in its read and write
handlers
so if we would do such a change it will break a real user.
Add another flag (FOPEN_STREAM) for filesystem servers to indicate that the
opened handler is having stream-like semantics; does not use file position
and thus the kernel is free to issue simultaneous read and write request on
opened file handle.
This patch together with stream_open() should be added to stable kernels
starting from v3.14+. This will allow to patch OSSPD and other FUSE
filesystems that provide stream-like files to return FOPEN_STREAM |
FOPEN_NONSEEKABLE in open handler and this way avoid the deadlock on all
kernel versions. This should work because fuse_finish_open ignores unknown
open flags returned from a filesystem and so passing FOPEN_STREAM to a
kernel that is not aware of this flag cannot hurt. In turn the kernel that
is not aware of FOPEN_STREAM will be < v3.14 where just FOPEN_NONSEEKABLE
is sufficient to implement streams without read vs write deadlock.
Commit b592211c33f7 ("dm mpath: fix attached_handler_name leak and
dangling hw_handler_name pointer") fixed a memory leak for the case
where setup_scsi_dh() returns failure. But setup_scsi_dh may return
success and not "use" attached_handler_name if the
retain_attached_hwhandler flag is not set on the map. As setup_scsi_sh
properly "steals" the pointer by nullifying it, freeing it
unconditionally in parse_path() is safe.
Fixes: b592211c33f7 ("dm mpath: fix attached_handler_name leak and dangling hw_handler_name pointer") Cc: stable@vger.kernel.org Reported-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Martin Wilck <mwilck@suse.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The dm_early_create() function (which deals with "dm-mod.create=" kernel
command line option) calls dm_hash_insert() who gets an extra reference
to the md object.
In case of failure, this reference wasn't being released, causing
dm_destroy() to hang, thus hanging the whole boot process.
Fix this by calling __hash_remove() in the error path.
Fixes: 6bbc923dfcf57d ("dm: add support to directly boot to a mapped device") Cc: stable@vger.kernel.org Signed-off-by: Helen Koike <helen.koike@collabora.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When we use separate devices for data and metadata, dm-integrity would
incorrectly calculate the size of the metadata device as if it had
512-byte block size - and it would refuse activation with larger block
size and smaller metadata device.
Fix this so that it takes actual block size into account, which fixes
the following reported issue:
https://gitlab.com/cryptsetup/cryptsetup/issues/450
The information about tag size should not be printed without debug info
set. Also print device major:minor in the error message to identify the
device instance.
Also use rate limiting and debug level for info about used crypto API
implementaton. This is important because during online reencryption
the existing message saturates syslog (because we are moving hotzone
across the whole device).
Cc: stable@vger.kernel.org Signed-off-by: Milan Broz <gmazyland@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When the target line contains an invalid device, delay_ctr() will call
delay_dtr() with NULL workqueue. Attempting to destroy the NULL
workqueue causes a crash.
dm-init should allow up to DM_MAX_{DEVICES,TARGETS} for devices/targets,
and not DM_MAX_{DEVICES,TARGETS} - 1.
Fix the checks and also fix the error message when the number of devices
is surpassed.
Fixes: 6bbc923dfcf57d ("dm: add support to directly boot to a mapped device") Cc: stable@vger.kernel.org Signed-off-by: Helen Koike <helen.koike@collabora.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The function blkdev_report_zones() returns success even if no zone
information is reported (empty report). Empty zone reports can only
happen if the report start sector passed exceeds the device capacity.
The conditions for this to happen are either a bug in the caller code,
or, a change in the device that forced the low level driver to change
the device capacity to a value that is lower than the report start
sector. This situation includes a failed disk revalidation resulting in
the disk capacity being changed to 0.
If this change happens while dm-zoned is in its initialization phase
executing dmz_init_zones(), this function may enter an infinite loop
and hang the system. To avoid this, add a check to disallow empty zone
reports and bail out early. Also fix the function dmz_update_zone() to
make sure that the report for the requested zone was correctly obtained.
Fixes: 3b1a94c88b79 ("dm zoned: drive-managed zoned block device target") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Shaun Tancheff <shaun@tancheff.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Due to an erratum in some Pericom PCIe-to-PCI bridges in reverse mode
(conventional PCI on primary side, PCIe on downstream side), the Retrain
Link bit needs to be cleared manually to allow the link training to
complete successfully.
If it is not cleared manually, the link training is continuously restarted
and no devices below the PCI-to-PCIe bridge can be accessed. That means
drivers for devices below the bridge will be loaded but won't work and may
even crash because the driver is only reading 0xffff.
See the Pericom Errata Sheet PI7C9X111SLB_errata_rev1.2_102711.pdf for
details. Devices known as affected so far are: PI7C9X110, PI7C9X111SL,
PI7C9X130.
Add a new flag, clear_retrain_link, in struct pci_dev. Quirks for affected
devices set this bit.
Note that pcie_retrain_link() lives in aspm.c because that's currently the
only place we use it, but this erratum is not specific to ASPM, and we may
retrain links for other reasons in the future.
Signed-off-by: Stefan Mätje <stefan.maetje@esd.eu>
[bhelgaas: apply regardless of CONFIG_PCIEASPM] Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> CC: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reestablish the PCIe link very early in the resume process in case it
went down to prevent PCI accesses from hanging the bus. Such accesses
can happen early in the PCI resume process, as early as the
SUSPEND_RESUME_NOIRQ step, thus the link must be reestablished in the
driver resume_noirq() callback.
Fixes: e015f88c368d ("PCI: rcar: Add support for R-Car H3 to pcie-rcar") Signed-off-by: Kazufumi Ikeda <kaz-ikeda@xc.jp.nec.com> Signed-off-by: Gaku Inami <gaku.inami.xw@bp.renesas.com> Signed-off-by: Marek Vasut <marek.vasut+renesas@gmail.com>
[lorenzo.pieralisi@arm.com: reformatted commit log] Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Reviewed-by: Simon Horman <horms+renesas@verge.net.au> Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be> Acked-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Cc: stable@vger.kernel.org Cc: Geert Uytterhoeven <geert+renesas@glider.be> Cc: Phil Edworthy <phil.edworthy@renesas.com> Cc: Simon Horman <horms+renesas@verge.net.au> Cc: Wolfram Sang <wsa@the-dreams.de> Cc: linux-renesas-soc@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit 60ed982a4e78 ("PCI/AER: Move internal declarations to
drivers/pci/pci.h") changed pci_aer_init() to return "void", but didn't
change the stub for when CONFIG_PCIEAER isn't enabled. Change the stub to
match.
Two functions allocate a host bridge: devm_pci_alloc_host_bridge() and
pci_alloc_host_bridge(). At the moment, only the unmanaged one initializes
the PCIe feature bits, which prevents from using features such as hotplug
or AER on some systems, when booting with device tree. Make the
initialization code common.
On ThinkPad P50 SKUs with an Nvidia Quadro M1000M instead of the M2000M
variant, the BIOS does not always reset the secondary Nvidia GPU during
reboot if the laptop is configured in Hybrid Graphics mode. The reason is
unknown, but the following steps and possibly a good bit of patience will
reproduce the issue:
1. Boot up the laptop normally in Hybrid Graphics mode
2. Make sure nouveau is loaded and that the GPU is awake
3. Allow the Nvidia GPU to runtime suspend itself after being idle
4. Reboot the machine, the more sudden the better (e.g. sysrq-b may help)
5. If nouveau loads up properly, reboot the machine again and go back to
step 2 until you reproduce the issue
This results in some very strange behavior: the GPU will be left in exactly
the same state it was in when the previously booted kernel started the
reboot. This has all sorts of bad side effects: for starters, this
completely breaks nouveau starting with a mysterious EVO channel failure
that happens well before we've actually used the EVO channel for anything:
This causes a timeout trying to bring up the GR ctx:
nouveau 0000:01:00.0: timeout
WARNING: CPU: 0 PID: 12 at drivers/gpu/drm/nouveau/nvkm/engine/gr/ctxgf100.c:1547 gf100_grctx_generate+0x7b2/0x850 [nouveau]
Hardware name: LENOVO 20EQS64N0B/20EQS64N0B, BIOS N1EET82W (1.55 ) 12/18/2018
Workqueue: events_long drm_dp_mst_link_probe_work [drm_kms_helper]
...
nouveau 0000:01:00.0: gr: wait for idle timeout (en: 1, ctxsw: 0, busy: 1)
nouveau 0000:01:00.0: gr: wait for idle timeout (en: 1, ctxsw: 0, busy: 1)
nouveau 0000:01:00.0: fifo: fault 01 [WRITE] at 0000000000008000 engine 00 [GR] client 15 [HUB/SCC_NB] reason c4 [] on channel -1 [0000000000 unknown]
The GPU never manages to recover. Booting without loading nouveau causes
issues as well, since the GPU starts sending spurious interrupts that cause
other device's IRQs to get disabled by the kernel:
irq 16: nobody cared (try booting with the "irqpoll" option)
...
handlers:
[<000000007faa9e99>] i801_isr [i2c_i801]
Disabling IRQ #16
...
serio: RMI4 PS/2 pass-through port at rmi4-00.fn03
i801_smbus 0000:00:1f.4: Timeout waiting for interrupt!
i801_smbus 0000:00:1f.4: Transaction timeout
rmi4_f03 rmi4-00.fn03: rmi_f03_pt_write: Failed to write to F03 TX register (-110).
i801_smbus 0000:00:1f.4: Timeout waiting for interrupt!
i801_smbus 0000:00:1f.4: Transaction timeout
rmi4_physical rmi4-00: rmi_driver_set_irq_bits: Failed to change enabled interrupts!
This causes the touchpad and sometimes other things to get disabled.
Since this happens without nouveau, we can't fix this problem from nouveau
itself.
Add a PCI quirk for the specific P50 variant of this GPU. Make sure the
GPU is advertising NoReset- so we don't reset the GPU when the machine is
in Dedicated graphics mode (where the GPU being initialized by the BIOS is
normal and expected). Map the GPU MMIO space and read the magic 0x2240c
register, which will have bit 1 set if the device was POSTed during a
previous boot. Once we've confirmed all of this, reset the GPU and
re-disable it - bringing it back to a healthy state.
When using PCI passthrough with this device, the host machine locks up
completely when starting the VM, requiring a hard reboot. Add a quirk to
avoid bus resets on this device.
Fixes: c3e59ee4e766 ("PCI: Mark Atheros AR93xx to avoid bus reset") Link: https://lore.kernel.org/linux-pci/20190107213248.3034-1-james.prestwood@linux.intel.com Signed-off-by: James Prestwood <james.prestwood@linux.intel.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> CC: stable@vger.kernel.org # v3.14+ Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
ATS is broken on the Radeon R7 GPU (at least for Stoney Ridge based laptop)
and causes IOMMU stalls and system failure. Disable ATS on these devices
to make them usable again with IOMMU enabled.
Thanks to Joerg Roedel <jroedel@suse.de> for help.
[bhelgaas: In the email thread mentioned below, Alex suspects the real
problem is in sbios or iommu, so it may affect only certain systems, and it
may affect other devices in those systems as well. However, per Joerg we
lack the ability to debug further, so this quirk is the best we can do for
now.]
On a Thinkpad s30 (Pentium III / i440MX, Lynx3DM), blanking the display
or starting the X server will crash and freeze the system, or garble the
display.
Experiments showed this problem can mostly be solved by adjusting the
order of register writes. Also, sm712fb failed to consider the difference
of clock frequency when unblanking the display, and programs the clock for
SM712 to SM720.
Fix them by adjusting the order of register writes, and adding an
additional check for SM720 for programming the clock frequency.
Loongson MIPS netbooks use 1024x600 LCD panels, which is the original
target platform of this driver, but nearly all old x86 laptops have
1024x768. Lighting 768 panels using 600's timings would partially
garble the display. Since it's not possible to distinguish them reliably,
we change the default to 768, but keep 600 as-is on MIPS.
Further, earlier laptops, such as IBM Thinkpad 240X, has a 800x600 LCD
panel, this driver would probably garbled those display. As we don't
have one for testing, the original behavior of the driver is kept as-is,
but the problem has been documented is the comments.
In order to support the 1024x600 panel on Yeeloong Loongson MIPS
laptop, the original 1024x768-16 table was modified to 1024x600-16,
without leaving the original. It causes problem on x86 laptop as
the 1024x768-16 support was still claimed but not working.
On a Thinkpad s30 (Pentium III / i440MX, Lynx3DM), running fbtest or X
will crash the machine instantly, because the VRAM/framebuffer is not
mapped correctly.
On SM712, the framebuffer starts at the beginning of address space, but
SM720's framebuffer starts at the 1 MiB offset from the beginning. However,
sm712fb fails to take this into account, as a result, writing to the
framebuffer will destroy all the registers and kill the system immediately.
Another problem is the driver assumes 8 MiB of VRAM for SM720, but some
SM720 system, such as this IBM Thinkpad, only has 4 MiB of VRAM.
Fix this problem by removing the hardcoded VRAM size, adding a function to
query the amount of VRAM from register MCR76 on SM720, and adding proper
framebuffer offset.
Please note that the memory map may have additional problems on Big-Endian
system, which is not available for testing by myself. But I highly suspect
that the original code is also broken on Big-Endian machines for SM720, so
at least we are not making the problem worse. More, the driver also assumed
SM710/SM712 has 4 MiB of VRAM, but it has a 2 MiB version as well, and used
in earlier laptops, such as IBM Thinkpad 240X, the driver would probably
crash on them. I've never seen one of those machines and cannot fix it, but
I have documented these problems in the comments.
When the machine is booted in VGA mode, loading sm712fb would cause
a glitch of random pixels shown on the screen. To prevent it from
happening, we first clear the entire framebuffer, and we also need
to stop calling smtcfb_setmode() during initialization, the fbdev
layer will call it for us later when it's ready.
On a Thinkpad s30 (Pentium III / i440MX, Lynx3DM), rebooting with
sm712fb framebuffer driver would cause a white screen of death on
the next POST, presumably the proper timings for the LCD panel was
not reprogrammed properly by the BIOS.
Experiments showed a few CRTC Scratch Registers, including CRT3D,
CRT3E and CRT3F may be used internally by BIOS as some flags. CRT3B is
a hardware testing register, we shouldn't mess with it. CRT3C has
blanking signal and line compare control, which is not needed for this
driver.
Stop writing to CR3B-CR3F (a.k.a CRT3B-CRT3F) registers. Even if these
registers don't have side-effect on other systems, writing to them is
also highly questionable.
On a Thinkpad s30 (Pentium III / i440MX, Lynx3DM), the amount of Video
RAM is not detected correctly by the xf86-video-siliconmotion driver.
This is because sm712fb overwrites the GPR71 Scratch Pad Register, which
is set by BIOS on x86 and used to indicate amount of VRAM.
Other Scratch Pad Registers, including GPR70/74/75, don't have the same
side-effect, but overwriting to them is still questionable, as they are
not related to modesetting.
Stop writing to SR70/71/74/75 (a.k.a GPR70/71/74/75).
On a Thinkpad s30 (Pentium III / i440MX, Lynx3DM), rebooting with
sm712fb framebuffer driver would cause the role of brightness up/down
button to swap.
Experiments showed the FPR30 register caused this behavior. Moreover,
even if this register don't have side-effect on other systems, over-
writing it is also highly questionable, since it was originally
configurated by the motherboard manufacturer by hardwiring pull-down
resistors to indicate the type of LCD panel. We should not mess with
it.
38ac0287b7f4 ("fbdev/efifb: Honour UEFI memory map attributes when mapping the FB")
updated the EFI framebuffer code to use memory mappings for the linear
framebuffer that are permitted by the memory attributes described by the
EFI memory map for the particular region, if the framebuffer happens to
be covered by the EFI memory map (which is typically only the case for
framebuffers in shared memory). This is required since non-x86 systems
may require cacheable attributes for memory mappings that are shared
with other masters (such as GPUs), and this information cannot be
described by the Graphics Output Protocol (GOP) EFI protocol itself,
and so we rely on the EFI memory map for this.
As reported by James, this breaks some x86 systems:
[ 1.173368] efifb: probing for efifb
[ 1.173386] efifb: abort, cannot remap video memory 0x1d5000 @ 0xcf800000
[ 1.173395] Trying to free nonexistent resource <00000000cf800000-00000000cf9d4bff>
[ 1.173413] efi-framebuffer: probe of efi-framebuffer.0 failed with error -5
The problem turns out to be that the memory map entry that describes the
framebuffer has no memory attributes listed at all, and so we end up with
a mem_flags value of 0x0.
So work around this by ensuring that the memory map entry's attribute field
has a sane value before using it to mask the set of usable attributes.
Reported-by: James Hilliard <james.hilliard1@gmail.com> Tested-by: James Hilliard <james.hilliard1@gmail.com> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: <stable@vger.kernel.org> # v4.19+ Cc: Borislav Petkov <bp@alien8.de> Cc: James Morse <james.morse@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Peter Jones <pjones@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-efi@vger.kernel.org Fixes: 38ac0287b7f4 ("fbdev/efifb: Honour UEFI memory map attributes when ...") Link: http://lkml.kernel.org/r/20190516213159.3530-2-ard.biesheuvel@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This is a bit of a mess, to put it mildly. But, it's a bug
that only seems to have showed up in 4.20 but wasn't noticed
until now, because nobody uses MPX.
MPX has the arch_unmap() hook inside of munmap() because MPX
uses bounds tables that protect other areas of memory. When
memory is unmapped, there is also a need to unmap the MPX
bounds tables. Barring this, unused bounds tables can eat 80%
of the address space.
But, the recursive do_munmap() that gets called vi arch_unmap()
wreaks havoc with __do_munmap()'s state. It can result in
freeing populated page tables, accessing bogus VMA state,
double-freed VMAs and more.
See the "long story" further below for the gory details.
To fix this, call arch_unmap() before __do_unmap() has a chance
to do anything meaningful. Also, remove the 'vma' argument
and force the MPX code to do its own, independent VMA lookup.
== UML / unicore32 impact ==
Remove unused 'vma' argument to arch_unmap(). No functional
change.
I compile tested this on UML but not unicore32.
== powerpc impact ==
powerpc uses arch_unmap() well to watch for munmap() on the
VDSO and zeroes out 'current->mm->context.vdso_base'. Moving
arch_unmap() makes this happen earlier in __do_munmap(). But,
'vdso_base' seems to only be used in perf and in the signal
delivery that happens near the return to userspace. I can not
find any likely impact to powerpc, other than the zeroing
happening a little earlier.
powerpc does not use the 'vma' argument and is unaffected by
its removal.
I compile-tested a 64-bit powerpc defconfig.
== x86 impact ==
For the common success case this is functionally identical to
what was there before. For the munmap() failure case, it's
possible that some MPX tables will be zapped for memory that
continues to be in use. But, this is an extraordinarily
unlikely scenario and the harm would be that MPX provides no
protection since the bounds table got reset (zeroed).
I can't imagine anyone doing this:
ptr = mmap();
// use ptr
ret = munmap(ptr);
if (ret)
// oh, there was an error, I'll
// keep using ptr.
Because if you're doing munmap(), you are *done* with the
memory. There's probably no good data in there _anyway_.
This passes the original reproducer from Richard Biener as
well as the existing mpx selftests/.
The long story:
munmap() has a couple of pieces:
1. Find the affected VMA(s)
2. Split the start/end one(s) if neceesary
3. Pull the VMAs out of the rbtree
4. Actually zap the memory via unmap_region(), including
freeing page tables (or queueing them to be freed).
5. Fix up some of the accounting (like fput()) and actually
free the VMA itself.
This specific ordering was actually introduced by:
dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap")
during the 4.20 merge window. The previous __do_munmap() code
was actually safe because the only thing after arch_unmap() was
remove_vma_list(). arch_unmap() could not see 'vma' in the
rbtree because it was detached, so it is not even capable of
doing operations unsafe for remove_vma_list()'s use of 'vma'.
Richard Biener reported a test that shows this in dmesg:
[1216548.787498] BUG: Bad rss-counter state mm:0000000017ce560b idx:1 val:551
[1216548.787500] BUG: non-zero pgtables_bytes on freeing mm: 24576
What triggered this was the recursive do_munmap() called via
arch_unmap(). It was freeing page tables that has not been
properly zapped.
But, the problem was bigger than this. For one, arch_unmap()
can free VMAs. But, the calling __do_munmap() has variables
that *point* to VMAs and obviously can't handle them just
getting freed while the pointer is still in use.
I tried a couple of things here. First, I tried to fix the page
table freeing problem in isolation, but I then found the VMA
issue. I also tried having the MPX code return a flag if it
modified the rbtree which would force __do_munmap() to re-walk
to restart. That spiralled out of control in complexity pretty
fast.
Just moving arch_unmap() and accepting that the bonkers failure
case might eat some bounds tables seems like the simplest viable
fix.
This was also reported in the following kernel bugzilla entry:
There are some reports that this commit triggered this bug:
dd2283f2605 ("mm: mmap: zap pages with read mmap_sem in munmap")
While that commit certainly made the issues easier to hit, I believe
the fundamental issue has been with us as long as MPX itself, thus
the Fixes: tag below is for one of the original MPX commits.
[ mingo: Minor edits to the changelog and the patch. ]
Reported-by: Richard Biener <rguenther@suse.de> Reported-by: H.J. Lu <hjl.tools@gmail.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Yang Shi <yang.shi@linux.alibaba.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jeff Dike <jdike@addtoit.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rik van Riel <riel@surriel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: linux-arch@vger.kernel.org Cc: linux-mm@kvack.org Cc: linux-um@lists.infradead.org Cc: linuxppc-dev@lists.ozlabs.org Cc: stable@vger.kernel.org Fixes: dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap") Link: http://lkml.kernel.org/r/20190419194747.5E1AD6DC@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>