When some ioctl flags are checked we return EOPNOTSUPP, like for
BTRFS_SCRUB_SUPPORTED_FLAGS, BTRFS_SUBVOL_CREATE_ARGS_MASK or fallocate
modes. The EINVAL is supposed to be for a supported but invalid
values or combination of options. Fix that when checking send flags so
it's consistent with the rest.
If a subvolume still exists, forbid deleting its qgroup 0/subvolid.
This behavior generally leads to incorrect behavior in squotas and
doesn't have a legitimate purpose.
Fixes: cecbb533b5fc ("btrfs: record simple quota deltas in delayed refs") CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[BUG]
There is a syzbot crash, triggered by the ASSERT() during subvolume
creation:
assertion failed: !anon_dev, in fs/btrfs/disk-io.c:1319
------------[ cut here ]------------
kernel BUG at fs/btrfs/disk-io.c:1319!
invalid opcode: 0000 [#1] PREEMPT SMP KASAN
RIP: 0010:btrfs_get_root_ref.part.0+0x9aa/0xa60
<TASK>
btrfs_get_new_fs_root+0xd3/0xf0
create_subvol+0xd02/0x1650
btrfs_mksubvol+0xe95/0x12b0
__btrfs_ioctl_snap_create+0x2f9/0x4f0
btrfs_ioctl_snap_create+0x16b/0x200
btrfs_ioctl+0x35f0/0x5cf0
__x64_sys_ioctl+0x19d/0x210
do_syscall_64+0x3f/0xe0
entry_SYSCALL_64_after_hwframe+0x63/0x6b
---[ end trace 0000000000000000 ]---
[CAUSE]
During create_subvol(), after inserting root item for the newly created
subvolume, we would trigger btrfs_get_new_fs_root() to get the
btrfs_root of that subvolume.
The idea here is, we have preallocated an anonymous device number for
the subvolume, thus we can assign it to the new subvolume.
But there is really nothing preventing things like backref walk to read
the new subvolume.
If that happens before we call btrfs_get_new_fs_root(), the subvolume
would be read out, with a new anonymous device number assigned already.
In that case, we would trigger ASSERT(), as we really expect no one to
read out that subvolume (which is not yet accessible from the fs).
But things like backref walk is still possible to trigger the read on
the subvolume.
Thus our assumption on the ASSERT() is not correct in the first place.
[FIX]
Fix it by removing the ASSERT(), and just free the @anon_dev, reset it
to 0, and continue.
If the subvolume tree is read out by something else, it should have
already get a new anon_dev assigned thus we only need to free the
preallocated one.
Reported-by: Chenyuan Yang <chenyuan0y@gmail.com> Fixes: 2dfb1e43f57d ("btrfs: preallocate anon block device at first phase of snapshot creation") CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Creating a qgroup 0/subvolid leads to various races and it isn't
helpful, because you can't specify a subvol id when creating a subvol,
so you can't be sure it will be the right one. Any requirements on the
automatic subvol can be gratified by using a higher level qgroup and the
inheritance parameters of subvol creation.
Fixes: cecbb533b5fc ("btrfs: record simple quota deltas in delayed refs") CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Since commit 28270e25c69a ("btrfs: always reserve space for delayed refs
when starting transaction") we started not only to reserve metadata space
for the delayed refs a caller of btrfs_start_transaction() might generate
but also to try to fully refill the delayed refs block reserve, because
there are several case where we generate delayed refs and haven't reserved
space for them, relying on the global block reserve. Relying too much on
the global block reserve is not always safe, and can result in hitting
-ENOSPC during transaction commits or worst, in rare cases, being unable
to mount a filesystem that needs to do orphan cleanup or anything that
requires modifying the filesystem during mount, and has no more
unallocated space and the metadata space is nearly full. This was
explained in detail in that commit's change log.
However the gap between the reserved amount and the size of the delayed
refs block reserve can be huge, so attempting to reserve space for such
a gap can result in allocating many metadata block groups that end up
not being used. After a recent patch, with the subject:
"btrfs: add new unused block groups to the list of unused block groups"
We started to add new block groups that are unused to the list of unused
block groups, to avoid having them around for a very long time in case
they are never used, because a block group is only added to the list of
unused block groups when we deallocate the last extent or when mounting
the filesystem and the block group has 0 bytes used. This is not a problem
introduced by the commit mentioned earlier, it always existed as our
metadata space reservations are, most of the time, pessimistic and end up
not using all the space they reserved, so we can occasionally end up with
one or two unused metadata block groups for a long period. However after
that commit mentioned earlier, we are just more pessimistic in the
metadata space reservations when starting a transaction and therefore the
issue is more likely to happen.
This however is not always enough because we might create unused metadata
block groups when reserving metadata space at a high rate if there's
always a gap in the delayed refs block reserve and the cleaner kthread
isn't triggered often enough or is busy with other work (running delayed
iputs, cleaning deleted roots, etc), not to mention the block group's
allocated space is only usable for a new block group after the transaction
used to remove it is committed.
A user reported that he's getting a lot of allocated metadata block groups
but the usage percentage of metadata space was very low compared to the
total allocated space, specially after running a series of block group
relocations.
So for now stop trying to refill the gap in the delayed refs block reserve
and reserve space only for the delayed refs we are expected to generate
when starting a transaction.
Space reservations for metadata are, most of the time, pessimistic as we
reserve space for worst possible cases - where tree heights are at the
maximum possible height (8), we need to COW every extent buffer in a tree
path, need to split extent buffers, etc.
For data, we generally reserve the exact amount of space we are going to
allocate. The exception here is when using compression, in which case we
reserve space matching the uncompressed size, as the compression only
happens at writeback time and in the worst possible case we need that
amount of space in case the data is not compressible.
This means that when there's not available space in the corresponding
space_info object, we may need to allocate a new block group, and then
that block group might not be used after all. In this case the block
group is never added to the list of unused block groups and ends up
never being deleted - except if we unmount and mount again the fs, as
when reading block groups from disk we add unused ones to the list of
unused block groups (fs_info->unused_bgs). Otherwise a block group is
only added to the list of unused block groups when we deallocate the
last extent from it, so if no extent is ever allocated, the block group
is kept around forever.
This also means that if we have a bunch of tasks reserving space in
parallel we can end up allocating many block groups that end up never
being used or kept around for too long without being used, which has
the potential to result in ENOSPC failures in case for example we over
allocate too many metadata block groups and then end up in a state
without enough unallocated space to allocate a new data block group.
This is more likely to happen with metadata reservations as of kernel
6.7, namely since commit 28270e25c69a ("btrfs: always reserve space for
delayed refs when starting transaction"), because we started to always
reserve space for delayed references when starting a transaction handle
for a non-zero number of items, and also to try to reserve space to fill
the gap between the delayed block reserve's reserved space and its size.
So to avoid this, when finishing the creation a new block group, add the
block group to the list of unused block groups if it's still unused at
that time. This way the next time the cleaner kthread runs, it will delete
the block group if it's still unused and not needed to satisfy existing
space reservations.
Reported-by: Ivan Shapovalov <intelfx@intelfx.name> Link: https://lore.kernel.org/linux-btrfs/9cdbf0ca9cdda1b4c84e15e548af7d7f9f926382.camel@intelfx.name/ CC: stable@vger.kernel.org # 6.7+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Before deleting a block group that is in the list of unused block groups
(fs_info->unused_bgs), we check if the block group became used before
deleting it, as extents from it may have been allocated after it was added
to the list.
However even if the block group was not yet used, there may be tasks that
have only reserved space and have not yet allocated extents, and they
might be relying on the availability of the unused block group in order
to allocate extents. The reservation works first by increasing the
"bytes_may_use" field of the corresponding space_info object (which may
first require flushing delayed items, allocating a new block group, etc),
and only later a task does the actual allocation of extents.
For metadata we usually don't end up using all reserved space, as we are
pessimistic and typically account for the worst cases (need to COW every
single node in a path of a tree at maximum possible height, etc). For
data we usually reserve the exact amount of space we're going to allocate
later, except when using compression where we always reserve space based
on the uncompressed size, as compression is only triggered when writeback
starts so we don't know in advance how much space we'll actually need, or
if the data is compressible.
So don't delete an unused block group if the total size of its space_info
object minus the block group's size is less then the sum of used space and
space that may be used (space_info->bytes_may_use), as that means we have
tasks that reserved space and may need to allocate extents from the block
group. In this case, besides skipping the deletion, re-add the block group
to the list of unused block groups so that it may be reconsidered later,
in case the tasks that reserved space end up not needing to allocate
extents from it.
Allowing the deletion of the block group while we have reserved space, can
result in tasks failing to allocate metadata extents (-ENOSPC) while under
a transaction handle, resulting in a transaction abort, or failure during
writeback for the case of data extents.
CC: stable@vger.kernel.org # 6.0+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Add a helper function to determine if a block group is being used and make
use of it at btrfs_delete_unused_bgs(). This helper will also be used in
future code changes.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit efa7df3e3bb5 ("mm: align larger anonymous mappings on THP
boundaries") incured regression for stress-ng pthread benchmark [1]. It
is because THP get allocated to pthread's stack area much more possible
than before. Pthread's stack area is allocated by mmap without
VM_GROWSDOWN or VM_GROWSUP flag, so kernel can't tell whether it is a
stack area or not.
The MAP_STACK flag is used to mark the stack area, but it is a no-op on
Linux. Mapping MAP_STACK to VM_NOHUGEPAGE to prevent from allocating THP
for such stack area.
In commit 4356e9f841f7 ("work around gcc bugs with 'asm goto' with
outputs") I did the gcc workaround unconditionally, because the cause of
the bad code generation wasn't entirely clear.
In the meantime, Jakub Jelinek debugged the issue, and has come up with
a fix in gcc [2], which also got backported to the still maintained
branches of gcc-11, gcc-12 and gcc-13.
Note that while the fix technically wasn't in the original gcc-14
branch, Jakub says:
"while it is true that no GCC 14 snapshots until today (or whenever the
fix will be committed) have the fix, for GCC trunk it is up to the
distros to use the latest snapshot if they use it at all and would
allow better testing of the kernel code without the workaround, so
that if there are other issues they won't be discovered years later.
Most userland code doesn't actually use asm goto with outputs..."
so we will consider gcc-14 to be fixed - if somebody is using gcc
snapshots of the gcc-14 before the fix, they should upgrade.
Note that while the bug goes back to gcc-11, in practice other gcc
changes seem to have effectively hidden it since gcc-12.1 as per a
bisect by Jakub. So even a gcc-14 snapshot without the fix likely
doesn't show actual problems.
Also, make the default 'asm_goto_output()' macro mark the asm as
volatile by hand, because of an unrelated gcc issue [1] where it doesn't
match the documented behavior ("asm goto is always volatile").
We've had issues with gcc and 'asm goto' before, and we created a
'asm_volatile_goto()' macro for that in the past: see commits 3f0116c3238a ("compiler/gcc4: Add quirk for 'asm goto' miscompilation
bug") and a9f180345f53 ("compiler/gcc4: Make quirk for
asm_volatile_goto() unconditional").
Then, much later, we ended up removing the workaround in commit 43c249ea0b1e ("compiler-gcc.h: remove ancient workaround for gcc PR
58670") because we no longer supported building the kernel with the
affected gcc versions, but we left the macro uses around.
Now, Sean Christopherson reports a new version of a very similar
problem, which is fixed by re-applying that ancient workaround. But the
problem in question is limited to only the 'asm goto with outputs'
cases, so instead of re-introducing the old workaround as-is, let's
rename and limit the workaround to just that much less common case.
It looks like there are at least two separate issues that all hit in
this area:
(a) some versions of gcc don't mark the asm goto as 'volatile' when it
has outputs:
rbtree lazy gc on insert might collect an end interval element that has
been just added in this transactions, skip end interval elements that
are not yet active.
DDPP: Disable Data path Parity Protection.
When it is 0x0, Data path Parity Protection is enabled.
When it is 0x1, Data path Parity Protection is disabled.
The cited commit introduces and uses the string constants dpp_tx_err and
dpp_rx_err. These are assigned to constant fields of the array
dwxgmac3_error_desc.
It has been reported that on GCC 6 and 7.5.0 this results in warnings
such as:
.../dwxgmac2_core.c:836:20: error: initialiser element is not constant
{ true, "TDPES0", dpp_tx_err },
I have been able to reproduce this using: GCC 7.5.0, 8.4.0, 9.4.0 and 10.5.0.
But not GCC 13.2.0.
So it seems this effects older compilers but not newer ones.
As Jon points out in his report, the minimum compiler supported by
the kernel is GCC 5.1, so it does seem that this ought to be fixed.
It is not clear to me what combination of 'const', if any, would address
this problem. So this patch takes of using #defines for the string
constants
Compile tested only.
Fixes: 46eba193d04f ("net: stmmac: xgmac: fix handling of DPP safety error for DMA channels") Reported-by: Jon Hunter <jonathanh@nvidia.com> Closes: https://lore.kernel.org/netdev/c25eb595-8d91-40ea-9f52-efa15ebafdbc@nvidia.com/ Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202402081135.lAxxBXHk-lkp@intel.com/ Signed-off-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240208-xgmac-const-v1-1-e69a1eeabfc8@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
While inlining csum_and_memcpy() into memcpy_to_iter_csum(), the from
address passed to csum_partial_copy_nocheck() was accidentally changed.
This causes a regression in applications using UDP, as for example
OpenAFS, causing loss of datagrams.
Fixes: dc32bff195b4 ("iov_iter, net: Fold in csum_and_memcpy()") Cc: David Howells <dhowells@redhat.com> Cc: stable@vger.kernel.org Cc: regressions@lists.linux.dev Signed-off-by: Michael Lass <bevan@bi-co.net> Reviewed-by: Jeffrey Altman <jaltman@auristor.com> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
If we have multiple clients and some/all are flooding the receives to
such an extent that we can retry a LOT handling multishot receives, then
we can be starving some clients and hence serving traffic in an
imbalanced fashion.
Limit multishot retry attempts to some arbitrary value, whose only
purpose serves to ensure that we don't keep serving a single connection
for way too long. We default to 32 retries, which should be more than
enough to provide fairness, yet not so small that we'll spend too much
time requeuing rather than handling traffic.
Since our poll handling is edge triggered, multishot handlers retry
internally until they know that no more data is available. In
preparation for limiting these retries, add an internal return code,
IOU_REQUEUE, which can be used to inform the poll backend about the
handler wanting to retry, but that this should happen through a normal
task_work requeue rather than keep hammering on the issue side for this
one request.
No functional changes in this patch, nobody is using this return code
just yet.
REQ_OP_FLUSH is only for internal use in the blk-mq and request based
drivers. File systems and other block layer consumers must use
REQ_OP_WRITE | REQ_PREFLUSH as documented in
Documentation/block/writeback_cache_control.rst.
While REQ_OP_FLUSH appears to work for blk-mq drivers it does not
get the proper flush state machine handling, and completely fails
for any bio based drivers, including all the stacking drivers. The
block layer will also get a check in 6.8 to reject this use case
entirely.
[Note: completely untested, but as this never got fixed since the
original bug report in November:
In __bch2_ioctl_subvolume_create(), we grab s_umount unconditionally
and unlock it at the end of the function. There is a comment
"why do we need this lock?" about the lock coming from
commit 42d237320e98 ("bcachefs: Snapshot creation, deletion")
The reason is that __bch2_ioctl_subvolume_create() calls
sync_inodes_sb() which enforce locked s_umount to writeback all dirty
nodes before doing snapshot works.
Fix it by read locking s_umount for snapshotting only and unlocking
s_umount after sync_inodes_sb().
Signed-off-by: Su Yue <glass.su@suse.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Drop t he loop in bch2_kthread_io_clock_wait(): this allows the code
that uses it to be woken up for other reasons, and fixes a bug where
rebalance wouldn't wake up when a scan was requested.
This raises the possibility of spurious wakeups, but callers should
always be able to handle that reasonably well.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Daniel Hill <daniel@gluo.nz> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The deadlock can easily be reproduced on machines like the Lenovo ThinkPad
X13s by adding a delay to increase the race window during asynchronous
probe where another thread can take a write lock.
Add a new pci_set_power_state_locked() and associated helper functions that
can be called with the PCI bus semaphore held to avoid taking the read lock
twice.
io_read_mshot() always relies on poll triggering retries, and this works
fine as long as we do a retry per size of the buffer being read. The
buffer size is given by the size of the buffer(s) in the given buffer
group ID.
But if we're reading less than what is available, then we don't always
get to read everything that is available. For example, if the buffers
available are 32 bytes and we have 64 bytes to read, then we'll
correctly read the first 32 bytes and then wait for another poll trigger
before we attempt the next read. This next poll trigger may never
happen, in which case we just sit forever and never make progress, or it
may trigger at some point in the future, and now we're just delivering
the available data much later than we should have.
io_read_mshot() could do retries itself, but that is wasteful as we'll
be going through all of __io_read() again, and most likely in vain.
Rather than do that, bump our poll reference count and have
io_poll_check_events() do one more loop and check with vfs_poll() if we
have more data to read. If we do, io_read_mshot() will get invoked again
directly and we'll read the next chunk.
io_poll_multishot_retry() must only get called from inside
io_poll_issue(), which is our multishot retry handler, as we know we
already "own" the request at this point.
Cc: stable@vger.kernel.org Link: https://github.com/axboe/liburing/issues/1041 Fixes: fc68fcda0491 ("io_uring/rw: add support for IORING_OP_READ_MULTISHOT") Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
If we use IORING_OP_RECV with provided buffers and pass in '0' as the
length of the request, the length is retrieved from the selected buffer.
If MSG_WAITALL is also set and we get a short receive, then we may hit
the retry path which decrements sr->len and increments the buffer for
a retry. However, the length is still zero at this point, which means
that sr->len now becomes huge and import_ubuf() will cap it to
MAX_RW_COUNT and subsequently return -EFAULT for the range as a whole.
Fix this by always assigning sr->len once the buffer has been selected.
After commit 936e4d49ecbc ("Input: atkbd - skip ATKBD_CMD_GETID in
translated mode") the keyboard on Dell XPS 13 9350 / 9360 / 9370 models
has stopped working after a suspend/resume.
The problem appears to be that atkbd_probe() fails when called
from atkbd_reconnect() on resume, which on systems where
ATKBD_CMD_GETID is skipped can only happen by ATKBD_CMD_SETLEDS
failing. ATKBD_CMD_SETLEDS failing because ATKBD_CMD_GETID was
skipped is weird, but apparently that is what is happening.
Fix this by also skipping ATKBD_CMD_SETLEDS when skipping
ATKBD_CMD_GETID.
Fixes: 936e4d49ecbc ("Input: atkbd - skip ATKBD_CMD_GETID in translated mode") Reported-by: Paul Menzel <pmenzel@molgen.mpg.de> Closes: https://lore.kernel.org/linux-input/0aa4a61f-c939-46fe-a572-08022e8931c7@molgen.mpg.de/ Closes: https://bbs.archlinux.org/viewtopic.php?pid=2146300 Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218424 Closes: https://bugzilla.redhat.com/show_bug.cgi?id=2260517 Tested-by: Paul Menzel <pmenzel@molgen.mpg.de> Cc: stable@vger.kernel.org Signed-off-by: Hans de Goede <hdegoede@redhat.com> Link: https://lore.kernel.org/r/20240126160724.13278-2-hdegoede@redhat.com Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When closing the laptop lid with an external screen connected, the mouse
pointer has a constant movement to the lower right corner. Opening the
lid again stops this movement, but after that the touchpad does no longer
register clicks.
The touchpad is connected both via i2c-hid and PS/2, the predecessor of
this device (NS70MU) has the same layout in this regard and also strange
behaviour caused by the psmouse and the i2c-hid driver fighting over
touchpad control. This fix is reusing the same workaround by just
disabling the PS/2 aux port, that is only used by the touchpad, to give the
i2c-hid driver the lone control over the touchpad.
The hrtimers migration on CPU-down hotplug process has been moved
earlier, before the CPU actually goes to die. This leaves a small window
of opportunity to queue an hrtimer in a blind spot, leaving it ignored.
For example a practical case has been reported with RCU waking up a
SCHED_FIFO task right before the CPUHP_AP_IDLE_DEAD stage, queuing that
way a sched/rt timer to the local offline CPU.
Make sure such situations never go unnoticed and warn when that happens.
Fixes: 5c0930ccaad5 ("hrtimers: Push pending hrtimers away from outgoing CPU earlier") Reported-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240129235646.3171983-4-boqun.feng@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
xHCI 4.9 explicitly forbids assuming that the xHC has released its
ownership of a multi-TRB TD when it reports an error on one of the
early TRBs. Yet the driver makes such assumption and releases the TD,
allowing the remaining TRBs to be freed or overwritten by new TDs.
The xHC should also report completion of the final TRB due to its IOC
flag being set by us, regardless of prior errors. This event cannot
be recognized if the TD has already been freed earlier, resulting in
"Transfer event TRB DMA ptr not part of current TD" error message.
Fix this by reusing the logic for processing isoc Transaction Errors.
This also handles hosts which fail to report the final completion.
Fix transfer length reporting on Babble errors. They may be caused by
device malfunction, no guarantee that the buffer has been filled.
The last TRB of a isoc TD might not trigger an event if there was
an error event for a TRB mid TD. This is seen on a NEC Corporation
uPD720200 USB 3.0 Host
After an error mid a multi-TRB TD the xHC should according to xhci 4.9.1
generate events for passed TRBs with IOC flag set if it proceeds to the
next TD. This event is either a copy of the original error, or a
"success" transfer event.
If that event is missing then the driver and xHC host get out of sync as
the driver is still expecting a transfer event for that first TD, while
xHC host is already sending events for the next TD in the list.
This leads to
"Transfer event TRB DMA ptr not part of current TD" messages.
As a solution we tag the isoc TDs that get error events mid TD.
If an event doesn't match the first TD, then check if the tag is
set, and event points to the next TD.
In that case give back the fist TD and process the next TD normally
Make sure TD status and transferred length stay valid in both cases
with and without final TD completion event.
Upstream commit bac1ec551434 ("usb: xhci: Set quirk for
XHCI_SG_TRB_CACHE_SIZE_QUIRK") introduced a new quirk in XHCI
which fixes XHC timeout, which was seen on synopsys XHCs while
using SG buffers. Currently this quirk can only be set using
xhci private data. But there are some drivers like dwc3/host.c
which adds adds quirks using software node for xhci device.
Hence set this xhci quirk by iterating over device properties.
Cc: stable@vger.kernel.org # 5.11 Fixes: bac1ec551434 ("usb: xhci: Set quirk for XHCI_SG_TRB_CACHE_SIZE_QUIRK") Signed-off-by: Prashanth K <quic_prashk@quicinc.com> Link: https://lore.kernel.org/r/20240116055816.1169821-3-quic_prashk@quicinc.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Upstream commit bac1ec551434 ("usb: xhci: Set quirk for
XHCI_SG_TRB_CACHE_SIZE_QUIRK") introduced a new quirk in XHCI
which fixes XHC timeout, which was seen on synopsys XHCs while
using SG buffers. But the support for this quirk isn't present
in the DWC3 layer.
We will encounter this XHCI timeout/hung issue if we run iperf
loopback tests using RTL8156 ethernet adaptor on DWC3 targets
with scatter-gather enabled. This gets resolved after enabling
the XHCI_SG_TRB_CACHE_SIZE_QUIRK. This patch enables it using
the xhci device property since its needed for DWC3 controller.
In Synopsys DWC3 databook,
Table 9-3: xHCI Debug Capability Limitations
Chained TRBs greater than TRB cache size: The debug capability
driver must not create a multi-TRB TD that describes smaller
than a 1K packet that spreads across 8 or more TRBs on either
the IN TR or the OUT TR.
During memory error injection test on kernels >= v6.4, the kernel panics
like below. However, this issue couldn't be reproduced on kernels <= v6.3.
mce: [Hardware Error]: CPU 296: Machine Check Exception: f Bank 1: bd80000000100134
mce: [Hardware Error]: RIP 10:<ffffffff821b9776> {__get_user_nocheck_4+0x6/0x20}
mce: [Hardware Error]: TSC 411a93533ed ADDR 346a8730040 MISC 86
mce: [Hardware Error]: PROCESSOR 0:a06d0 TIME 1706000767 SOCKET 1 APIC 211 microcode 80001490
mce: [Hardware Error]: Run the above through 'mcelog --ascii'
mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel
Kernel panic - not syncing: Fatal local machine check
The MCA code can recover from an in-kernel #MC if the fixup type is
EX_TYPE_UACCESS, explicitly indicating that the kernel is attempting to
access userspace memory. However, if the fixup type is EX_TYPE_DEFAULT
the only thing that is raised for an in-kernel #MC is a panic.
ex_handler_uaccess() would warn if users gave a non-canonical addresses
(with bit 63 clear) to {get, put}_user(), which was unexpected.
Therefore, commit
b19b74bc99b1 ("x86/mm: Rework address range check in get_user() and put_user()")
replaced _ASM_EXTABLE_UA() with _ASM_EXTABLE() for {get, put}_user()
fixups. However, the new fixup type EX_TYPE_DEFAULT results in a panic.
Commit
6014bc27561f ("x86-64: make access_ok() independent of LAM")
added the check gp_fault_address_ok() right before the WARN_ONCE() in
ex_handler_uaccess() to not warn about non-canonical user addresses due
to LAM.
With that in place, revert back to _ASM_EXTABLE_UA() for {get,put}_user()
exception fixups in order to be able to handle in-kernel MCEs correctly
again.
[ bp: Massage commit message. ]
Fixes: b19b74bc99b1 ("x86/mm: Rework address range check in get_user() and put_user()") Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: <stable@kernel.org> Link: https://lore.kernel.org/r/20240129063842.61584-1-qiuxu.zhuo@intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Kenneth Feng <kenneth.feng@amd.com> Cc: stable@vger.kernel.org Fixes: 5f38ac54e605 ("drm/amd/pm: fix the high voltage and temperature issue") Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3062 Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Given that ERROR_RECOVERY calls into PORT_RESET for Hi-Zing
the CC pins, setting CC pins to default state during PORT_RESET
breaks error recovery.
4.5.2.2.2.1 ErrorRecovery State Requirements
The port shall not drive VBUS or VCONN, and shall present a
high-impedance to ground (above zOPEN) on its CC1 and CC2 pins.
Hi-Zing the CC pins is the inteded behavior for PORT_RESET.
CC pins are set to default state after tErrorRecovery in
PORT_RESET_WAIT_OFF.
4.5.2.2.2.2 Exiting From ErrorRecovery State
A Sink shall transition to Unattached.SNK after tErrorRecovery.
A Source shall transition to Unattached.SRC after tErrorRecovery.
Cc: stable@vger.kernel.org Cc: Frank Wang <frank.wang@rock-chips.com> Fixes: 1e35f074399d ("usb: typec: tcpm: fix cc role at port reset") Signed-off-by: Badhri Jagan Sridharan <badhri@google.com> Reviewed-by: Guenter Roeck <linux@roeck-us.net> Link: https://lore.kernel.org/r/20240117114742.2587779-1-badhri@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The device IMST USB-Stick for Smart Meter is a rebranded IMST iM871A-USB
Wireless M-Bus USB-adapter. It is used to read wireless water, gas and
electricity meters.
Signed-off-by: Leonard Dallmayr <leonard.dallmayr@mailbox.org> Cc: stable@vger.kernel.org Signed-off-by: Johan Hovold <johan@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Update the USB serial option driver support for the Fibocom
FM101-GL
LTE modules as there are actually several different variants.
- VID:PID 2cb7:01a3, FM101-GL are laptop M.2 cards (with
MBIM interfaces for /Linux/Chrome OS)
The RODE NT-USB+ is marketed as a professional usb microphone, however the
usb audio interface is a mess:
[ 1.130977] usb 1-5: new full-speed USB device number 2 using xhci_hcd
[ 1.503906] usb 1-5: config 1 has an invalid interface number: 5 but max is 4
[ 1.503912] usb 1-5: config 1 has no interface number 4
[ 1.519689] usb 1-5: New USB device found, idVendor=19f7, idProduct=0035, bcdDevice= 1.09
[ 1.519695] usb 1-5: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[ 1.519697] usb 1-5: Product: RØDE NT-USB+
[ 1.519699] usb 1-5: Manufacturer: RØDE
[ 1.519700] usb 1-5: SerialNumber: 1D773A1A
[ 8.327495] usb 1-5: 1:1: cannot get freq at ep 0x82
[ 8.344500] usb 1-5: 1:2: cannot get freq at ep 0x82
[ 8.365499] usb 1-5: 2:1: cannot get freq at ep 0x2
Add QUIRK_FLAG_GET_SAMPLE_RATE to work around the broken sample rate get.
I have asked Rode support to fix it, but they show no interest.
When iocg_kick_delay() is called from a CPU different than the one which set
the delay, @now may be in the past of @iocg->delay_at leading to the
following warning:
UBSAN: shift-out-of-bounds in block/blk-iocost.c:1359:23
shift exponent 18446744073709 is too large for 64-bit type 'u64' (aka 'unsigned long long')
...
Call Trace:
<TASK>
dump_stack_lvl+0x79/0xc0
__ubsan_handle_shift_out_of_bounds+0x2ab/0x300
iocg_kick_delay+0x222/0x230
ioc_rqos_merge+0x1d7/0x2c0
__rq_qos_merge+0x2c/0x80
bio_attempt_back_merge+0x83/0x190
blk_attempt_plug_merge+0x101/0x150
blk_mq_submit_bio+0x2b1/0x720
submit_bio_noacct_nocheck+0x320/0x3e0
__swap_writepage+0x2ab/0x9d0
The underflow itself doesn't really affect the behavior in any meaningful
way; however, the past timestamp may exaggerate the delay amount calculated
later in the code, which shouldn't be a material problem given the nature of
the delay mechanism.
If @now is in the past, this CPU is racing another CPU which recently set up
the delay and there's nothing this CPU can contribute w.r.t. the delay.
Let's bail early from iocg_kick_delay() in such cases.
Correct header file is needed for getting CLOSE_RANGE_* macros.
Previously it was tested with newer glibc which didn't show the need to
include the header which was a mistake.
The original code didn't update the firmware version if the
"next slot" of the AFI register isn't zero or if the
"current slot" field is zero; in those cases it assumed
that a reset was needed.
However, the NVMe specification doesn't exclude the possibility that
the "next slot" value is equal to the "current slot" value,
meaning that the same firmware slot will be activated after performing
a controller level reset; in this case a reset is clearly not
necessary and we can safely update the firmware version.
Modify the code so the kernel will report that a Controller Level Reset
is needed only in the following cases:
1) If the "current slot" field is zero. This is invalid and means that
something is wrong, a reset is needed.
or
2) if the "next slot" field isn't zero AND it's not equal to the
"current slot" value. This means that at the next reset a different
firmware slot will be activated.
Fixes: 983a338b96c8 ("nvme: update firmware version after commit") Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
The percpu area overflow_stacks is exported from arch/riscv/kernel/traps.c
for use in the entry code, but is not declared anywhere. Add the relevant
declaration to arch/riscv/include/asm/stacktrace.h to silence the following
sparse warning:
arch/riscv/kernel/traps.c:395:1: warning: symbol '__pcpu_scope_overflow_stack' was not declared. Should it be static?
We don't add the stackinfo_get_overflow() call as for some of the other
architectures as this doesn't seem to be used yet, so just silence the
warning.
The fscrypt code will use i_blkbits to setup ci_data_unit_bits when
allocating the new inode, but ceph will initiate i_blkbits ater when
filling the inode, which is too late. Since ci_data_unit_bits will only
be used by the fscrypt framework so initiating i_blkbits with
CEPH_FSCRYPT_BLOCK_SHIFT is safe.
Link: https://tracker.ceph.com/issues/64035 Fixes: 5b1188847180 ("fscrypt: support crypto data unit size less than filesystem block size") Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
A short read may occur while reading the message footer from the
socket. Later, when the socket is ready for another read, the
messenger invokes all read_partial_*() handlers, including
read_partial_sparse_msg_data(). The expectation is that
read_partial_sparse_msg_data() would bail, allowing the messenger to
invoke read_partial() for the footer and pick up where it left off.
However read_partial_sparse_msg_data() violates that and ends up
calling into the state machine in the OSD client. The sparse-read
state machine assumes that it's a new op and interprets some piece of
the footer as the sparse-read header and returns bogus extents/data
length, etc.
To determine whether read_partial_sparse_msg_data() should bail, let's
reuse cursor->total_resid. Because once it reaches to zero that means
all the extents and data have been successfully received in last read,
else it could break out when partially reading any of the extents and
data. And then osd_sparse_read() could continue where it left off.
[ idryomov: changelog ]
Link: https://tracker.ceph.com/issues/63586 Fixes: d396f89db39a ("libceph: add sparse read support to msgr1") Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
These functions are supposed to behave like other read_partial_*()
handlers: the contract with messenger v1 is that the handler bails if
the area of the message it's responsible for is already processed.
This comes up when handling short reads from the socket.
[ idryomov: changelog ]
Signed-off-by: Xiubo Li <xiubli@redhat.com> Acked-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Stable-dep-of: 8e46a2d068c9 ("libceph: just wait for more data to be available on the socket") Signed-off-by: Sasha Levin <sashal@kernel.org>
Commit 4373534a9850 ("scsi: core: Move scsi_host_busy() out of host lock
for waking up EH handler") intended to fix a hard lockup issue triggered by
EH. The core idea was to move scsi_host_busy() out of the host lock when
processing individual commands for EH. However, a suggested style change
inadvertently caused scsi_host_busy() to remain under the host lock. Fix
this by calling scsi_host_busy() outside the lock.
Fixes: 4373534a9850 ("scsi: core: Move scsi_host_busy() out of host lock for waking up EH handler") Cc: Sathya Prakash Veerichetty <safhya.prakash@broadcom.com> Cc: Bart Van Assche <bvanassche@acm.org> Cc: Ewan D. Milne <emilne@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20240203024521.2006455-1-ming.lei@redhat.com Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
The spare_init() calls memmap_populate() many times to create VA to PA
mapping for the VMEMMAP area, where all "struct page" are located once
CONFIG_SPARSEMEM_VMEMMAP is defined. These "struct page" are later
initialized in the zone_sizes_init() function. However, during this
process, no sfence.vma instruction is executed for this VMEMMAP area.
This omission may cause the hart to fail to perform page table walk
because some data related to the address translation is invisible to the
hart. To solve this issue, the local_flush_tlb_kernel_range() is called
right after the sparse_init() to execute a sfence.vma instruction for this
VMEMMAP area, ensuring that all data related to the address translation
is visible to the hart.
The pcpu setup when using the page allocator sets up a new vmalloc
mapping very early in the boot process, so early that it cannot use the
flush_cache_vmap() function which may depend on structures not yet
initialized (for example in riscv, we currently send an IPI to flush
other cpus TLB).
But on some architectures, we must call flush_cache_vmap(): for example,
in riscv, some uarchs can cache invalid TLB entries so we need to flush
the new established mapping to avoid taking an exception.
So fix this by introducing a new function flush_cache_vmap_early() which
is called right after setting the new page table entry and before
accessing this new mapping. This new function implements a local flush
tlb on riscv and is no-op for other architectures (same as today).
Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Dennis Zhou <dennis@kernel.org>
Stable-dep-of: d9807d60c145 ("riscv: mm: execute local TLB flush after populating vmemmap") Signed-off-by: Sasha Levin <sashal@kernel.org>
The issue here is when this is called from ntfs_load_attr_list(). The
"size" comes from le32_to_cpu(attr->res.data_size) so it can't overflow
on a 64bit systems but on 32bit systems the "+ 1023" can overflow and
the result is zero. This means that the kmalloc will succeed by
returning the ZERO_SIZE_PTR and then the memcpy() will crash with an
Oops on the next line.
Fixes: be71b5cba2e6 ("fs/ntfs3: Add attrib operations") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
Pipapo needs a scratchpad area to keep state during matching.
This state can be large and thus cannot reside on stack.
Each set preallocates percpu areas for this.
On each match stage, one scratchpad half starts with all-zero and the other
is inited to all-ones.
At the end of each stage, the half that starts with all-ones is
always zero. Before next field is tested, pointers to the two halves
are swapped, i.e. resmap pointer turns into fill pointer and vice versa.
After the last field has been processed, pipapo stashes the
index toggle in a percpu variable, with assumption that next packet
will start with the all-zero half and sets all bits in the other to 1.
This isn't reliable.
There can be multiple sets and we can't be sure that the upper
and lower half of all set scratch map is always in sync (lookups
can be conditional), so one set might have swapped, but other might
not have been queried.
Thus we need to keep the index per-set-and-cpu, just like the
scratchpad.
Note that this bug fix is incomplete, there is a related issue.
avx2 and normal implementation might use slightly different areas of the
map array space due to the avx2 alignment requirements, so
m->scratch (generic/fallback implementation) and ->scratch_aligned
(avx) may partially overlap. scratch and scratch_aligned are not distinct
objects, the latter is just the aligned address of the former.
After this change, write to scratch_align->map_index may write to
scratch->map, so this issue becomes more prominent, we can set to 1
a bit in the supposedly-all-zero area of scratch->map[].
A followup patch will remove the scratch_aligned and makes generic and
avx code use the same (aligned) area.
Its done in a separate change to ease review.
Fixes: 3c4287f62044 ("nf_tables: Add set type for arbitrary concatenation of ranges") Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
Add a timestamp field at the beginning of the transaction, store it
in the nftables per-netns area.
Update set backend .insert, .deactivate and sync gc path to use the
timestamp, this avoids that an element expires while control plane
transaction is still unfinished.
.lookup and .update, which are used from packet path, still use the
current time to check if the element has expired. And .get path and dump
also since this runs lockless under rcu read size lock. Then, there is
async gc which also needs to check the current time since it runs
asynchronously from a workqueue.
Fixes: c3e1b005ed1c ("netfilter: nf_tables: add set element timeout support") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
'stream_enc_regs' array is an array of dcn10_stream_enc_registers
structures. The array is initialized with four elements, corresponding
to the four calls to stream_enc_regs() in the array initializer. This
means that valid indices for this array are 0, 1, 2, and 3.
The error message 'stream_enc_regs' 4 <= 5 below, is indicating that
there is an attempt to access this array with an index of 5, which is
out of bounds. This could lead to undefined behavior
Here, eng_id is used as an index to access the stream_enc_regs array. If
eng_id is 5, this would result in an out-of-bounds access on the
stream_enc_regs array.
Thus fixing Buffer overflow error in dcn301_stream_encoder_create
reported by Smatch:
drivers/gpu/drm/amd/amdgpu/../display/dc/resource/dcn301/dcn301_resource.c:1011 dcn301_stream_encoder_create() error: buffer overflow 'stream_enc_regs' 4 <= 5
Fixes: 3a83e4e64bb1 ("drm/amd/display: Add dcn3.01 support to DC (v2)") Cc: Roman Li <Roman.Li@amd.com> Cc: Rodrigo Siqueira <Rodrigo.Siqueira@amd.com> Cc: Aurabindo Pillai <aurabindo.pillai@amd.com> Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Reviewed-by: Roman Li <roman.li@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
'panel_cntl' structure used to control the display panel could be null,
dereferencing it could lead to a null pointer access.
Fixes the below:
drivers/gpu/drm/amd/amdgpu/../display/dc/hwss/dcn21/dcn21_hwseq.c:269 dcn21_set_backlight_level() error: we previously assumed 'panel_cntl' could be null (see line 250)
Fixes: 474ac4a875ca ("drm/amd/display: Implement some asic specific abm call backs.") Cc: Yongqiang Sun <yongqiang.sun@amd.com> Cc: Anthony Koo <Anthony.Koo@amd.com> Cc: Rodrigo Siqueira <Rodrigo.Siqueira@amd.com> Cc: Aurabindo Pillai <aurabindo.pillai@amd.com> Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Reviewed-by: Anthony Koo <Anthony.Koo@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
xt_check_{match,target} expects u16, but NFTA_RULE_COMPAT_PROTO is u32.
NLA_POLICY_MAX(NLA_BE32, 65535) cannot be used because .max in
nla_policy is s16, see 3e48be05f3c7 ("netlink: add attribute range
validation to policy").
Fixes: 0ca743a55991 ("netfilter: nf_tables: add compatibility layer for x_tables") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
cmsg_ipv6 test requests tcpdump to capture 4 packets,
and sends until tcpdump quits. Only the first packet
is "real", however, and the rest are basic UDP packets.
So if tcpdump doesn't start in time it will miss
the real packet and only capture the UDP ones.
This makes the test fail on slow machine (no KVM or with
debug enabled) 100% of the time, while it passes in fast
environments.
Repeat the "real" / expected packet.
Fixes: 9657ad09e1fa ("selftests: net: test IPV6_TCLASS") Fixes: 05ae83d5a4a2 ("selftests: net: test IPV6_HOPLIMIT") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
In case devlink_rel_nested_in_notify_work() can not take the devlink
lock mutex. Convert the work to delayed work and in case of reschedule
do it jiffie later and avoid potential looping.
Suggested-by: Paolo Abeni <pabeni@redhat.com> Fixes: c137743bce02 ("devlink: introduce object and nested devlink relationship infra") Signed-off-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20240205171114.338679-1-jiri@resnulli.us Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
This forms a self-cyclic reference that GC should finally untangle
but does not due to lack of MSG_OOB handling, resulting in memory
leak.
Recently, commit 11498715f266 ("af_unix: Remove io_uring code for
GC.") removed io_uring's dead code in GC and revealed the problem.
The code was executed at the final stage of GC and unconditionally
moved all GC candidates from gc_candidates to gc_inflight_list.
That papered over the reported problem by always making the following
WARN_ON_ONCE(!list_empty(&gc_candidates)) false.
The problem has been there since commit 2aab4b969002 ("af_unix: fix
struct pid leaks in OOB support") added full scm support for MSG_OOB
while fixing another bug.
To fix this problem, we must call kfree_skb() for unix_sk(sk)->oob_skb
if the socket still exists in gc_candidates after purging collected skb.
Then, we need to set NULL to oob_skb before calling kfree_skb() because
it calls last fput() and triggers unix_release_sock(), where we call
duplicate kfree_skb(u->oob_skb) if not NULL.
Note that the leaked socket remained being linked to a global list, so
kmemleak also could not detect it. We need to check /proc/net/protocol
to notice the unfreed socket.
The cause of this issue is that when tipc_nl_bearer_add() is called with
the TIPC_NLA_BEARER_UDP_OPTS attribute, tipc_udp_nl_bearer_add() is called
even if the bearer is not UDP.
tipc_udp_is_known_peer() called by tipc_udp_nl_bearer_add() assumes that
the media_ptr field of the tipc_bearer has an udp_bearer type object, so
the function goes crazy for non-UDP bearers.
This patch fixes the issue by checking the bearer type before calling
tipc_udp_nl_bearer_add() in tipc_nl_bearer_add().
In very slow environments, most big TCP cases including
segmentation and reassembly of big TCP packets have a good
chance to fail: by default the TCP client uses write size
well below 64K. If the host is low enough autocorking is
unable to build real big TCP packets.
Address the issue using much larger write operations.
Note that is hard to observe the issue without an extremely
slow and/or overloaded environment; reduce the TCP transfer
time to allow for much easier/faster reproducibility.
Fixes: 6bb382bcf742 ("selftests: add a selftest for big tcp") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
Fix the counting of new acks and nacks when parsing a packet - something
that is used in congestion control.
As the code stands, it merely notes if there are any nacks whereas what we
really should do is compare the previous SACK table to the new one,
assuming we get two successive ACK packets with nacks in them. However, we
really don't want to do that if we can avoid it as the tables might not
correspond directly as one may be shifted from the other - something that
will only get harder to deal with once extended ACK tables come into full
use (with a capacity of up to 8192).
Instead, count the number of nacks shifted out of the old SACK, the number
of nacks retained in the portion still active and the number of new acks
and nacks in the new table then calculate what we need.
Note this ends up a bit of an estimate as the Rx protocol allows acks to be
withdrawn by the receiver and packets requested to be retransmitted.
Fixes: d57a3a151660 ("rxrpc: Save last ACK's SACK table rather than marking txbufs") Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: linux-afs@lists.infradead.org
cc: netdev@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
Stop rxrpc from sending a DUP ACK in response to a PING RESPONSE ACK on a
dead call. We may have initiated the ping but the call may have beaten the
response to completion.
Fixes: 18bfeba50dfd ("rxrpc: Perform terminal call ACK/ABORT retransmission from conn processor") Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: linux-afs@lists.infradead.org
cc: netdev@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
Fix the construction of delayed ACKs to not set the reference serial number
as they can't be used as an RTT reference.
Fixes: 17926a79320a ("[AF_RXRPC]: Provide secure RxRPC sockets for use by userspace and kernel both") Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: linux-afs@lists.infradead.org
cc: netdev@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
In the Rx protocol, every packet generated is marked with a per-connection
monotonically increasing serial number. This number can be referenced in
an ACK packet generated in response to an incoming packet - thereby
allowing the sender to use this for RTT determination, amongst other
things.
However, if the reference field in the ACK is zero, it doesn't refer to any
incoming packet (it could be a ping to find out if a packet got lost, for
example) - so we shouldn't generate zero serial numbers.
Fix the generation of serial numbers to retry if it comes up with a zero.
Furthermore, since the serial numbers are only ever allocated within the
I/O thread this connection is bound to, there's no need for atomics so
remove that too.
Fixes: 17926a79320a ("[AF_RXRPC]: Provide secure RxRPC sockets for use by userspace and kernel both") Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: linux-afs@lists.infradead.org
cc: netdev@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
The .compat section is a dummy PE section that contains the address of
the 32-bit entrypoint of the 64-bit kernel image if it is bootable from
32-bit firmware (i.e., CONFIG_EFI_MIXED=y)
This section is only 8 bytes in size and is only referenced from the
loader, and so it is placed at the end of the memory view of the image,
to avoid the need for padding it to 4k, which is required for sections
appearing in the middle of the image.
Unfortunately, this violates the PE/COFF spec, and even if most EFI
loaders will work correctly (including the Tianocore reference
implementation), PE loaders do exist that reject such images, on the
basis that both the file and memory views of the file contents should be
described by the section headers in a monotonically increasing manner
without leaving any gaps.
So reorganize the sections to avoid this issue. This results in a slight
padding overhead (< 4k) which can be avoided if desired by disabling
CONFIG_EFI_MIXED (which is only needed in rare cases these days)
This code prints the wrong variable in the warning message. It should
print "i" instead of "info->offset". On the first iteration "info" is
uninitialized leading to a crash and on subsequent iterations it prints
the previous offset instead of the current one.
Fixes: e0f74ed4634d ("i915/gvt: Separate the MMIO tracking table from GVT-g") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Zhenyu Wang <zhenyuw@linux.intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/11957c20-b178-4027-9b0a-e32e9591dd7c@moroto.mountain Reviewed-by: Zhenyu Wang <zhenyuw@linux.intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
inet_recv_error() is called without holding the socket lock.
IPv6 socket could mutate to IPv4 with IPV6_ADDRFORM
socket option and trigger a KCSAN warning.
Fixes: f4713a3dfad0 ("net-timestamp: make tcp_recvmsg call ipv6_recv_error for AF_INET6 socks") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
Before commit 7108b80a542b ("hwmon/coretemp: Handle large core ID
value"), there is a fixed mapping between
1. cpu_core_id
2. the index in pdata->core_data[] array
3. the sysfs attr name, aka "tempX_"
The later two always equal cpu_core_id + 2.
After the commit, pdata->core_data[] index is got from ida so that it
can handle sparse core ids and support more cores within a package.
However, the commit erroneously maps the sysfs attr name to
pdata->core_data[] index instead of cpu_core_id + 2.
As a result, the code is not aligned with the comments, and brings user
visible changes in hwmon sysfs on systems with sparse core id.
For example, before commit 7108b80a542b ("hwmon/coretemp: Handle large
core ID value"),
/sys/class/hwmon/hwmon2/temp2_label:Core 0
/sys/class/hwmon/hwmon2/temp3_label:Core 1
/sys/class/hwmon/hwmon2/temp4_label:Core 2
/sys/class/hwmon/hwmon2/temp5_label:Core 3
/sys/class/hwmon/hwmon2/temp6_label:Core 4
/sys/class/hwmon/hwmon3/temp10_label:Core 8
/sys/class/hwmon/hwmon3/temp11_label:Core 9
after commit,
/sys/class/hwmon/hwmon2/temp2_label:Core 0
/sys/class/hwmon/hwmon2/temp3_label:Core 1
/sys/class/hwmon/hwmon2/temp4_label:Core 2
/sys/class/hwmon/hwmon2/temp5_label:Core 3
/sys/class/hwmon/hwmon2/temp6_label:Core 4
/sys/class/hwmon/hwmon2/temp7_label:Core 8
/sys/class/hwmon/hwmon2/temp8_label:Core 9
Restore the previous behavior and rework the code, comments and variable
names to avoid future confusions.
When qmem_alloc and pfvf->hw_ops->sq_aq_init fails, sq->sg should be
freed to prevent memleak.
Fixes: c9c12d339d93 ("octeontx2-pf: Add support for PTP clock") Signed-off-by: Zhipeng Lu <alexious@zju.edu.cn> Acked-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>