When config table entries don't match with the device to be probed,
currently we fall back to SND_INTEL_DSP_DRIVER_ANY, which means to
allow any drivers to bind with it.
This was set so with the assumption (or hope) that all controller
drivers should cover the devices generally, but in practice, this
caused a problem as reported recently. Namely, when a specific
kconfig for SOF isn't set for the modern Intel chips like Alderlake,
a wrong driver (AVS) got probed and failed. This is because we have
entries like:
so this entry is effective only when CONFIG_SND_SOC_SOF_ALDERLAKE is
set. If not set, there is no matching entry, hence it returns
SND_INTEL_DSP_DRIVER_ANY as fallback. OTOH, if the kconfig is set, it
explicitly falls back to SND_INTEL_DSP_DRIVER_LEGACY when no DMIC or
SoundWire is found -- that was the working scenario. That being said,
the current setup may be broken for modern Intel chips that are
supposed to work with either SOF or legacy driver when the
corresponding kconfig were missing.
For addressing the problem above, this patch changes the fallback
driver to the legacy driver, i.e. return SND_INTEL_DSP_DRIVER_LEGACY
type as much as possible. When CONFIG_SND_HDA_INTEL is also disabled,
the fallback is set to SND_INTEL_DSP_DRIVER_ANY type, just to be sure.
A race condition exists between the read loop and IRQ `complete()` call.
An interrupt could call the complete() between the inner loop and
reinit_completion(), potentially losing the completion event and causing
an unnecessary timeout. Moving reinit_completion() before the loop
prevents this. A premature signal will only result in a spurious wakeup
and another wait cycle, which is preferable to waiting for a timeout.
A race condition was found in sg_proc_debug_helper(). It was observed on
a system using an IBM LTO-9 SAS Tape Drive (ULTRIUM-TD9) and monitoring
/proc/scsi/sg/debug every second. A very large elapsed time would
sometimes appear. This is caused by two race conditions.
We reproduced the issue with an IBM ULTRIUM-HH9 tape drive on an x86_64
architecture. A patched kernel was built, and the race condition could
not be observed anymore after the application of this patch. A
reproducer C program utilising the scsi_debug module was also built by
Changhui Zhong and can be viewed here:
The first race happens between the reading of hp->duration in
sg_proc_debug_helper() and request completion in sg_rq_end_io(). The
hp->duration member variable may hold either of two types of
information:
#1 - The start time of the request. This value is present while
the request is not yet finished.
#2 - The total execution time of the request (end_time - start_time).
If sg_proc_debug_helper() executes *after* the value of hp->duration was
changed from #1 to #2, but *before* srp->done is set to 1 in
sg_rq_end_io(), a fresh timestamp is taken in the else branch, and the
elapsed time (value type #2) is subtracted from a timestamp, which
cannot yield a valid elapsed time (which is a type #2 value as well).
To fix this issue, the value of hp->duration must change under the
protection of the sfp->rq_list_lock in sg_rq_end_io(). Since
sg_proc_debug_helper() takes this read lock, the change to srp->done and
srp->header.duration will happen atomically from the perspective of
sg_proc_debug_helper() and the race condition is thus eliminated.
The second race condition happens between sg_proc_debug_helper() and
sg_new_write(). Even though hp->duration is set to the current time
stamp in sg_add_request() under the write lock's protection, it gets
overwritten by a call to get_sg_io_hdr(), which calls copy_from_user()
to copy struct sg_io_hdr from userspace into kernel space. hp->duration
is set to the start time again in sg_common_write(). If
sg_proc_debug_helper() is called between these two calls, an arbitrary
value set by userspace (usually zero) is used to compute the elapsed
time.
To fix this issue, hp->duration must be set to the current timestamp
again after get_sg_io_hdr() returns successfully. A small race window
still exists between get_sg_io_hdr() and setting hp->duration, but this
window is only a few instructions wide and does not result in observable
issues in practice, as confirmed by testing.
Additionally, we fix the format specifier from %d to %u for printing
unsigned int values in sg_proc_debug_helper().
Signed-off-by: Michal Rábek <mrabek@redhat.com> Suggested-by: Tomas Henzl <thenzl@redhat.com> Tested-by: Changhui Zhong <czhong@redhat.com> Reviewed-by: Ewan D. Milne <emilne@redhat.com> Reviewed-by: John Meneghini <jmeneghi@redhat.com> Reviewed-by: Tomas Henzl <thenzl@redhat.com> Link: https://patch.msgid.link/20251212160900.64924-1-mrabek@redhat.com Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
Drivers does cache sync during runtime resume, setting all writable
registers. Not all writable registers are set in cache default, resulting
in the erorr message:
fsl-sai 30c30000.sai: using zero-initialized flat cache, this may cause
unexpected behavior
Fix this by adding missing writable register defaults.
unregister_netdevice: waiting for vcan0 to become free. Usage count = 2
even after commit 93a27b5891b8 ("can: j1939: add missing calls in
NETDEV_UNREGISTER notification handler") was added. A debug printk() patch
found that j1939_session_activate() can succeed even after
j1939_cancel_active_session() from j1939_netdev_notify(NETDEV_UNREGISTER)
has completed.
Since j1939_cancel_active_session() is processed with the session list lock
held, checking ndev->reg_state in j1939_session_activate() with the session
list lock held can reliably close the race window.
Pass character "0" rather than NULL terminator to properly format
queue restoration SMI events. Currently, the NULL terminator precedes
the newline character that is intended to delineate separate events
in the SMI event buffer, which can break userspace parsers.
Signed-off-by: Brian Kocoloski <brian.kocoloski@amd.com> Reviewed-by: Philip Yang <Philip.Yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 6e7143e5e6e21f9d5572e0390f7089e6d53edf3c) Signed-off-by: Sasha Levin <sashal@kernel.org>
This driver is migrated to use threaded IRQ since commit 5972eb05ca32
("spi: spi-mt65xx: Use threaded interrupt for non-SPIMEM transfer"), and
we almost always want to disable the interrupt line to avoid excess
interrupts while the threaded handler is processing SPI transfer.
Use IRQF_ONESHOT for that purpose.
In practice, we see MediaTek devices show SPI transfer timeout errors
when communicating with ChromeOS EC in certain scenarios, and with
IRQF_ONESHOT, the issue goes away.
Currently nf_tables will traverse the entire table (chain graph), starting
from the entry points (base chains), exploring all possible paths
(chain jumps). But there are cases where we could avoid revalidation.
Then the second rule does not need to revalidate j2, and, by extension j3,
because this was already checked during validation of the first rule.
We need to validate it only for rule 3.
This is needed because chain loop detection also ensures we do not exceed
the jump stack: Just because we know that j2 is cycle free, its last jump
might now exceed the allowed stack size. We also need to update all
reachable chains with the new largest observed call depth.
Care has to be taken to revalidate even if the chain depth won't be an
issue: chain validation also ensures that expressions are not called from
invalid base chains. For example, the masquerade expression can only be
called from NAT postrouting base chains.
Therefore we also need to keep record of the base chain context (type,
hooknum) and revalidate if the chain becomes reachable from a different
hook location.
Reported-by: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com> Closes: https://lore.kernel.org/netfilter-devel/20251118221735.GA5477@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net/ Tested-by: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
Fix inconsistent error handling for sscanf() return value check.
Implicit boolean conversion is used instead of explicit return
value checks. The code checks if (!sscanf(...)) which is incorrect
because:
1. sscanf returns the number of successfully parsed items
2. On success, it returns 1 (one item passed)
3. On failure, it returns 0 or EOF
4. The check 'if (!sscanf(...))' is wrong because it treats
success (1) as failure
All occurrences of sscanf() now uses explicit return value check.
With this behavior it returns '-EINVAL' when parsing fails (returns
0 or EOF), and continues when parsing succeeds (returns 1).
The device becomes visible to userspace via device_register()
even before it fully initialized by idr_init(). If userspace
or another thread tries to register a zone immediately after
device_register(), the control_type_valid() will fail because
the control_type is not yet in the list. The IDR is not yet
initialized, so this race condition causes zone registration
failure.
Move idr_init() and list addition before device_register()
fix the race condition.
Signed-off-by: Sumeet Pawnikar <sumeet4linux@gmail.com>
[ rjw: Subject adjustment, empty line added ] Link: https://patch.msgid.link/20251205190216.5032-1-sumeet4linux@gmail.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
Some Potron SFP+ XGSPON ONU sticks are shipped with different EEPROM
vendor ID and vendor name strings, but are otherwise functionally
identical to the existing "Potron SFP+ XGSPON ONU Stick" handled by
sfp_quirk_potron().
These modules, including units distributed under the "Better Internet"
branding, use the same UART pin assignment and require the same
TX_FAULT/LOS behaviour and boot delay. Re-use the existing Potron
quirk for this EEPROM variant.
unregister_netdevice: waiting for sit0 to become free. Usage count = 2
problem. A debug printk() patch found that a refcount is obtained at
xdp_convert_md_to_buff() from bpf_prog_test_run_xdp().
According to commit ec94670fcb3b ("bpf: Support specifying ingress via
xdp_md context in BPF_PROG_TEST_RUN"), the refcount obtained by
xdp_convert_md_to_buff() will be released by xdp_convert_buff_to_md().
Therefore, we can consider that the error handling path introduced by
commit 1c1949982524 ("bpf: introduce frags support to
bpf_prog_test_run_xdp()") forgot to call xdp_convert_buff_to_md().
The xdp_frame structure takes up part of the XDP frame headroom,
limiting the size of the metadata. However, in bpf_test_run, we don't
take this into account, which makes it possible for userspace to supply
a metadata size that is too large (taking up the entire headroom).
If userspace supplies such a large metadata size in live packet mode,
the xdp_update_frame_from_buff() call in xdp_test_run_init_page() call
will fail, after which packet transmission proceeds with an
uninitialised frame structure, leading to the usual Bad Stuff.
The commit in the Fixes tag fixed a related bug where the second check
in xdp_update_frame_from_buff() could fail, but did not add any
additional constraints on the metadata size. Complete the fix by adding
an additional check on the metadata size. Reorder the checks slightly to
make the logic clearer and add a comment.
To test bpf_xdp_pull_data(), an xdp packet containing fragments as well
as free linear data area after xdp->data_end needs to be created.
However, bpf_prog_test_run_xdp() always fills the linear area with
data_in before creating fragments, leaving no space to pull data. This
patch will allow users to specify the linear data size through
ctx->data_end.
Currently, ctx_in->data_end must match data_size_in and will not be the
final ctx->data_end seen by xdp programs. This is because ctx->data_end
is populated according to the xdp_buff passed to test_run. The linear
data area available in an xdp_buff, max_linear_sz, is alawys filled up
before copying data_in into fragments.
This patch will allow users to specify the size of data that goes into
the linear area. When ctx_in->data_end is different from data_size_in,
only ctx_in->data_end bytes of data will be put into the linear area when
creating the xdp_buff.
While ctx_in->data_end will be allowed to be different from data_size_in,
it cannot be larger than the data_size_in as there will be no data to
copy from user space. If it is larger than the maximum linear data area
size, the layout suggested by the user will not be honored. Data beyond
max_linear_sz bytes will still be copied into fragments.
Finally, since it is possible for a NIC to produce a xdp_buff with empty
linear data area, allow it when calling bpf_test_init() from
bpf_prog_test_run_xdp() so that we can test XDP kfuncs with such
xdp_buff. This is done by moving lower-bound check to callers as most of
them already do except bpf_prog_test_run_skb(). The change also fixes a
bug that allows passing an xdp_buff with data < ETH_HLEN. This can
happen when ctx is used and metadata is at least ETH_HLEN.
Signed-off-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20250922233356.3356453-7-ameryhung@gmail.com
Stable-dep-of: e558cca21779 ("bpf, test_run: Subtract size of xdp_frame from allowed metadata size") Signed-off-by: Sasha Levin <sashal@kernel.org>
Change the variable naming in bpf_prog_test_run_xdp() to make the
overall logic less confusing. As different modes were added to the
function over the time, some variables got overloaded, making
it hard to understand and changing the code becomes error-prone.
Replace "size" with "linear_sz" where it refers to the size of metadata
and data. If "size" refers to input data size, use test.data_size_in
directly.
Replace "max_data_sz" with "max_linear_sz" to better reflect the fact
that it is the maximum size of metadata and data (i.e., linear_sz). Also,
xdp_rxq.frags_size is always PAGE_SIZE, so just set it directly instead
of subtracting headroom and tailroom and adding them back.
Signed-off-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20250922233356.3356453-6-ameryhung@gmail.com
Stable-dep-of: e558cca21779 ("bpf, test_run: Subtract size of xdp_frame from allowed metadata size") Signed-off-by: Sasha Levin <sashal@kernel.org>
The bpf selftest xdp_adjust_tail/xdp_adjust_frags_tail_grow failed on
arm64 with 64KB page:
xdp_adjust_tail/xdp_adjust_frags_tail_grow:FAIL
In bpf_prog_test_run_xdp(), the xdp->frame_sz is set to 4K, but later on
when constructing frags, with 64K page size, the frag data_len could
be more than 4K. This will cause problems in bpf_xdp_frags_increase_tail().
To fix the failure, the xdp->frame_sz is set to be PAGE_SIZE so kernel
can test different page size properly. With the kernel change, the user
space and bpf prog needs adjustment. Currently, the MAX_SKB_FRAGS default
value is 17, so for 4K page, the maximum packet size will be less than 68K.
To test 64K page, a bigger maximum packet size than 68K is desired. So two
different functions are implemented for subtest xdp_adjust_frags_tail_grow.
Depending on different page size, different data input/output sizes are used
to adapt with different page size.
[BUG]
For the following write sequence with 64K page size and 4K fs block size,
it will lead to file extent items to be inserted without any data
checksum:
Note for file offset 60K, we're inserting a file extent without any
data checksum.
Also note that range [32K, 36K) didn't reach
insert_ordered_extent_file_extent(), which is the correct behavior as
that OE is fully truncated, should not result any file extent.
Although file extent at 60K will be later dropped by btrfs_truncate(),
if the transaction got committed after file extent inserted but before
the file extent dropping, we will have a small window where we have a
file extent beyond EOF and without any data checksum.
That will cause "btrfs check" to report error.
[CAUSE]
The sequence happens like this:
- Buffered write dirtied the page cache and updated isize
Now the inode size is 64K, with the following page cache layout:
0 16K 32K 48K 64K
|/////////////| |//| |//|
- Truncate the inode to 16K
Which will trigger writeback through:
btrfs_setsize()
|- truncate_setsize()
| Now the inode size is set to 16K
|
|- btrfs_truncate()
|- btrfs_wait_ordered_range() for [16K, u64(-1)]
|- btrfs_fdatawrite_range() for [16K, u64(-1)}
|- extent_writepage() for folio 0
|- writepage_delalloc()
| Generated OE for [0, 16K), [32K, 36K] and [60K, 64K)
|
|- extent_writepage_io()
Then inside extent_writepage_io(), the dirty fs blocks are handled
differently:
- Submit write for range [0, 16K)
As they are still inside the inode size (16K).
- Mark OE [32K, 36K) as truncated
Since we only call btrfs_lookup_first_ordered_range() once, which
returned the first OE after file offset 16K.
- Mark all OEs inside range [16K, 64K) as finished
Which will mark OE ranges [32K, 36K) and [60K, 64K) as finished.
For OE [32K, 36K) since it's already marked as truncated, and its
truncated length is 0, no file extent will be inserted.
For OE [60K, 64K) it has never been submitted thus has no data
checksum, and we insert the file extent as usual.
This is the root cause of file extent at 60K to be inserted without
any data checksum.
- Clear dirty flags for range [16K, 64K)
It is the function btrfs_folio_clear_dirty() which searches and clears
any dirty blocks inside that range.
[FIX]
The bug itself was introduced a long time ago, way before subpage and
large folio support.
At that time, fs block size must match page size, thus the range
[cur, end) is just one fs block.
But later with subpage and large folios, the same range [cur, end)
can have multiple blocks and ordered extents.
Later commit 18de34daa7c6 ("btrfs: truncate ordered extent when skipping
writeback past i_size") was fixing a bug related to subpage/large
folios, but it's still utilizing the old range [cur, end), meaning only
the first OE will be marked as truncated.
The proper fix here is to make EOF handling block-by-block, not trying
to handle the whole range to @end.
By this we always locate and truncate the OE for every dirty block.
While running test case btrfs/192 from fstests with support for large
folios (needs CONFIG_BTRFS_EXPERIMENTAL=y) I ended up getting very sporadic
btrfs check failures reporting that csum items were missing. Looking into
the issue it turned out that btrfs check searches for csum items of a file
extent item with a range that spans beyond the i_size of a file and we
don't have any, because the kernel's writeback code skips submitting bios
for ranges beyond eof. It's not expected however to find a file extent item
that crosses the rounded up (by the sector size) i_size value, but there is
a short time window where we can end up with a transaction commit leaving
this small inconsistency between the i_size and the last file extent item.
Example btrfs check output when this happens:
$ btrfs check /dev/sdc
Opening filesystem to check...
Checking filesystem on /dev/sdc
UUID: 69642c61-5efb-4367-aa31-cdfd4067f713
[1/8] checking log skipped (none written)
[2/8] checking root items
[3/8] checking extents
[4/8] checking free space tree
[5/8] checking fs roots
root 5 inode 332 errors 1000, some csum missing
ERROR: errors found in fs roots
(...)
Looking at a tree dump of the fs tree (root 5) for inode 332 we have:
We can see that the file extent item for file offset 589824 has a length of
64K and its number of bytes is 64K. Looking at the inode item we see that
its i_size is 610969 bytes which falls within the range of that file extent
item [589824, 655360[.
We see that the csum item number 15 covers the first 24K of the file extent
item - it ends at offset 21770240 and the extent's disk_bytenr is 21745664,
so we have:
We see that the next csum item (number 16) is completely outside the range,
so the remaining 40K of the extent doesn't have csum items in the tree.
If we round up the i_size to the sector size, we get:
round_up(610969, 4096) = 614400
If we subtract from that the file offset for the extent item we get:
614400 - 589824 = 24K
So the missing 40K corresponds to the end of the file extent item's range
minus the rounded up i_size:
655360 - 614400 = 40K
Normally we don't expect a file extent item to span over the rounded up
i_size of an inode, since when truncating, doing hole punching and other
operations that trim a file extent item, the number of bytes is adjusted.
There is however a short time window where the kernel can end up,
temporarily,persisting an inode with an i_size that falls in the middle of
the last file extent item and the file extent item was not yet trimmed (its
number of bytes reduced so that it doesn't cross i_size rounded up by the
sector size).
The steps (in the kernel) that lead to such scenario are the following:
1) We have inode I as an empty file, no allocated extents, i_size is 0;
2) A buffered write is done for file range [589824, 655360[ (length of
64K) and the i_size is updated to 655360. Note that we got a single
large folio for the range (64K);
3) A truncate operation starts that reduces the inode's i_size down to
610969 bytes. The truncate sets the inode's new i_size at
btrfs_setsize() by calling truncate_setsize() and before calling
btrfs_truncate();
4) At btrfs_truncate() we trigger writeback for the range starting at
610304 (which is the new i_size rounded down to the sector size) and
ending at (u64)-1;
5) During the writeback, at extent_write_cache_pages(), we get from the
call to filemap_get_folios_tag(), the 64K folio that starts at file
offset 589824 since it contains the start offset of the writeback
range (610304);
6) At writepage_delalloc() we find the whole range of the folio is dirty
and therefore we run delalloc for that 64K range ([589824, 655360[),
reserving a 64K extent, creating an ordered extent, etc;
7) At extent_writepage_io() we submit IO only for subrange [589824, 614400[
because the inode's i_size is 610969 bytes (rounded up by sector size
is 614400). There, in the while loop we intentionally skip IO beyond
i_size to avoid any unnecessay work and just call
btrfs_mark_ordered_io_finished() for the range [614400, 655360[ (which
has a 40K length);
8) Once the IO finishes we finish the ordered extent by ending up at
btrfs_finish_one_ordered(), join transaction N, insert a file extent
item in the inode's subvolume tree for file offset 589824 with a number
of bytes of 64K, and update the inode's delayed inode item or directly
the inode item with a call to btrfs_update_inode_fallback(), which
results in storing the new i_size of 610969 bytes;
9) Transaction N is committed either by the transaction kthread or some
other task committed it (in response to a sync or fsync for example).
At this point we have inode I persisted with an i_size of 610969 bytes
and file extent item that starts at file offset 589824 and has a number
of bytes of 64K, ending at an offset of 655360 which is beyond the
i_size rounded up to the sector size (614400).
--> So after a crash or power failure here, the btrfs check program
reports that error about missing checksum items for this inode, as
it tries to lookup for checksums covering the whole range of the
extent;
10) Only after transaction N is committed that at btrfs_truncate() the
call to btrfs_start_transaction() starts a new transaction, N + 1,
instead of joining transaction N. And it's with transaction N + 1 that
it calls btrfs_truncate_inode_items() which updates the file extent
item at file offset 589824 to reduce its number of bytes from 64K down
to 24K, so that the file extent item's range ends at the i_size
rounded up to the sector size (614400 bytes).
Fix this by truncating the ordered extent at extent_writepage_io() when we
skip writeback because the current offset in the folio is beyond i_size.
This ensures we don't ever persist a file extent item with a number of
bytes beyond the rounded up (by sector size) value of the i_size.
For the future large folio support, our filemap can have folios with
different sizes, thus we can no longer rely on a fixed blocks_per_page
value.
To prepare for that future, here we do:
- Remove btrfs_fs_info::sectors_per_page
- Introduce a helper, btrfs_blocks_per_folio()
Which uses the folio size to calculate the number of blocks for each
folio.
- Migrate the existing btrfs_fs_info::sectors_per_page to use that
helper
There are some exceptions:
* Metadata nodesize < page size support
In the future, even if we support large folios, we will only
allocate a folio that matches our nodesize.
Thus we won't have a folio covering multiple metadata unless
nodesize < page size.
* Existing subpage bitmap dump
We use a single unsigned long to store the bitmap.
That means until we change the bitmap dumping code, our upper limit
for folio size will only be 256K (4K block size, 64 bit unsigned
long).
* btrfs_is_subpage() check
This will be migrated into a future patch.
Previously when those functions failed, there was no error message at
all, making the debugging much harder.
So here we introduce extra error messages for:
- cow_file_range()
- run_delalloc_nocow()
- submit_uncompressed_range()
- writepage_delalloc() when btrfs_run_delalloc_range() failed
- extent_writepage() when extent_writepage_io() failed
One example of the new debug error messages is the following one:
run fstests generic/750 at 2024-12-08 12:41:41
BTRFS: device fsid 461b25f5-e240-4543-8deb-e7c2bd01a6d3 devid 1 transid 8 /dev/mapper/test-scratch1 (253:4) scanned by mount (2436600)
BTRFS info (device dm-4): first mount of filesystem 461b25f5-e240-4543-8deb-e7c2bd01a6d3
BTRFS info (device dm-4): using crc32c (crc32c-arm64) checksum algorithm
BTRFS info (device dm-4): forcing free space tree for sector size 4096 with page size 65536
BTRFS info (device dm-4): using free-space-tree
BTRFS warning (device dm-4): read-write for sector size 4096 with page size 65536 is experimental
BTRFS info (device dm-4): checking UUID tree
BTRFS error (device dm-4): cow_file_range failed, root=363 inode=412 start=503808 len=98304: -28
BTRFS error (device dm-4): run_delalloc_nocow failed, root=363 inode=412 start=503808 len=98304: -28
BTRFS error (device dm-4): failed to run delalloc range, root=363 ino=412 folio=458752 submit_bitmap=11-15 start=503808 len=98304: -28
Which shows an error from cow_file_range() which is called inside a
nocow write attempt, along with the extra bitmap from
writepage_delalloc().
For btrfs_folio_assert_not_dirty() and btrfs_folio_set_lock(), we call
bitmap_test_range_all_zero() to ensure the involved range has no
dirty/lock bit already set.
However with my recent enhanced delalloc range error handling, I was
hitting the ASSERT() inside btrfs_folio_set_lock(), and it turns out
that some error handling path is not properly updating the folio flags.
So add some extra dumping for the ASSERTs to dump the involved bitmap
to help debug.
[BUG]
If we failed to compress the range, or cannot reserve a large enough
data extent (e.g. too fragmented free space), we will fall back to
submit_uncompressed_range().
But inside submit_uncompressed_range(), run_delalloc_cow() can also fail
due to -ENOSPC or any other error.
In that case there are 3 bugs in the error handling:
1) Double freeing for the same ordered extent
This can lead to crash due to ordered extent double accounting
2) Start/end writeback without updating the subpage writeback bitmap
3) Unlock the folio without clear the subpage lock bitmap
Both bugs 2) and 3) will crash the kernel if the btrfs block size is
smaller than folio size, as the next time the folio gets writeback/lock
updates, subpage will find the bitmap already have the range set,
triggering an ASSERT().
[CAUSE]
Bug 1) happens in the following call chain:
submit_uncompressed_range()
|- run_delalloc_cow()
| |- cow_file_range()
| |- btrfs_reserve_extent()
| Failed with -ENOSPC or whatever error
|
|- btrfs_clean_up_ordered_extents()
| |- btrfs_mark_ordered_io_finished()
| Which cleans all the ordered extents in the async_extent range.
|
|- btrfs_mark_ordered_io_finished()
Which cleans the folio range.
The finished ordered extents may not be immediately removed from the
ordered io tree, as they are removed inside a work queue.
So the second btrfs_mark_ordered_io_finished() may find the finished but
not-yet-removed ordered extents, and double free them.
Furthermore, the second btrfs_mark_ordered_io_finished() is not subpage
compatible, as it uses fixed folio_pos() with PAGE_SIZE, which can cover
other ordered extents.
Bugs 2) and 3) are more straightforward, btrfs just calls folio_unlock(),
folio_start_writeback() and folio_end_writeback(), other than the helpers
which handle subpage cases.
[FIX]
For bug 1) since the first btrfs_cleanup_ordered_extents() call is
handling the whole range, we should not do the second
btrfs_mark_ordered_io_finished() call.
And for the first btrfs_cleanup_ordered_extents(), we no longer need to
pass the @locked_page parameter, as we are already in the async extent
context, thus will never rely on the error handling inside
btrfs_run_delalloc_range().
So just let the btrfs_clean_up_ordered_extents() handle every folio
equally.
For bug 2) we should not even call
folio_start_writeback()/folio_end_writeback() anymore.
As the error handling protocol, cow_file_range() should clear
dirty flag and start/finish the writeback for the whole range passed in.
For bug 3) just change the folio_unlock() to btrfs_folio_end_lock()
helper.
If ac97_add_adapter() fails, put_device() is the correct way to drop
the device reference. kfree() is not required.
Add kfree() if idr_alloc() fails and in ac97_adapter_release() to do
the cleanup.
Sheng Yong reported [1] that Android APEX images didn't work with commit 072a7c7cdbea ("erofs: don't bother with s_stack_depth increasing for
now") because "EROFS-formatted APEX file images can be stored within an
EROFS-formatted Android system partition."
In response, I sent a quick fat-fingered [PATCH v3] to address the
report. Unfortunately, the updated condition was incorrect:
The condition `!sb->s_bdev` is always true for all file-backed EROFS
mounts, making the check effectively a no-op.
The real fix tested and confirmed by Sheng Yong [2] at that time was
[PATCH v3 RESEND], which correctly ensures the following EROFS^2 setup
works:
EROFS (on a block device) + EROFS (file-backed mount)
But sadly I screwed it up again by upstreaming the outdated [PATCH v3].
This patch applies the same logic as the delta between the upstream
[PATCH v3] and the real fix [PATCH v3 RESEND].
Previously, commit d53cd891f0e4 ("erofs: limit the level of fs stacking
for file-backed mounts") bumped `s_stack_depth` by one to avoid kernel
stack overflow when stacking an unlimited number of EROFS on top of
each other.
This fix breaks composefs mounts, which need EROFS+ovl^2 sometimes
(and such setups are already used in production for quite a long time).
One way to fix this regression is to bump FILESYSTEM_MAX_STACK_DEPTH
from 2 to 3, but proving that this is safe in general is a high bar.
After a long discussion on GitHub issues [1] about possible solutions,
one conclusion is that there is no need to support nesting file-backed
EROFS mounts on stacked filesystems, because there is always the option
to use loopback devices as a fallback.
As a quick fix for the composefs regression for this cycle, instead of
bumping `s_stack_depth` for file backed EROFS mounts, we disallow
nesting file-backed EROFS over EROFS and over filesystems with
`s_stack_depth` > 0.
This works for all known file-backed mount use cases (composefs,
containerd, and Android APEX for some Android vendors), and the fix is
self-contained.
Essentially, we are allowing one extra unaccounted fs stacking level of
EROFS below stacking filesystems, but EROFS can only be used in the read
path (i.e. overlayfs lower layers), which typically has much lower stack
usage than the write path.
We can consider increasing FILESYSTEM_MAX_STACK_DEPTH later, after more
stack usage analysis or using alternative approaches, such as splitting
the `s_stack_depth` limitation according to different combinations of
stacking.
The max buffer size of ENETC RX BD is 0xFFFF bytes, so if the PAGE_SIZE
is greater than 128K, ENETC_RXB_DMA_SIZE and ENETC_RXB_DMA_SIZE_XDP will
be greater than 0xFFFF, thus causing a build warning.
This will not cause any practical issues because ENETC is currently only
used on the ARM64 platform, and the max PAGE_SIZE is 64K. So this patch
is only for fixing the build warning that occurs when compiling ENETC
drivers for other platforms.
Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202601050637.kHEKKOG7-lkp@intel.com/ Fixes: e59bc32df2e9 ("net: enetc: correct the value of ENETC_RXB_TRUESIZE") Signed-off-by: Wei Fang <wei.fang@nxp.com> Reviewed-by: Frank Li <Frank.Li@nxp.com> Link: https://patch.msgid.link/20260107091204.1980222-1-wei.fang@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
`qfq_class->leaf_qdisc->q.qlen > 0` does not imply that the class
itself is active.
Two qfq_class objects may point to the same leaf_qdisc. This happens
when:
1. one QFQ qdisc is attached to the dev as the root qdisc, and
2. another QFQ qdisc is temporarily referenced (e.g., via qdisc_get()
/ qdisc_put()) and is pending to be destroyed, as in function
tc_new_tfilter.
When packets are enqueued through the root QFQ qdisc, the shared
leaf_qdisc->q.qlen increases. At the same time, the second QFQ
qdisc triggers qdisc_put and qdisc_destroy: the qdisc enters
qfq_reset() with its own q->q.qlen == 0, but its class's leaf
qdisc->q.qlen > 0. Therefore, the qfq_reset would wrongly deactivate
an inactive aggregate and trigger a null-deref in qfq_deactivate_agg:
For years I wondered why the Apple Cinema Display driver would not
just work for me. Turns out the hidraw driver instantly takes it
over. Fix by adding appledisplay VID/PIDs to hid_have_special_driver.
Fixes: 069e8a65cd79 ("Driver for Apple Cinema Display") Signed-off-by: René Rebe <rene@exactco.de> Signed-off-by: Jiri Kosina <jkosina@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
This patch fixes the edge case behavior on ifup/ifdown and
linking/unlinking two netdevsim interfaces:
1. unlink two interfaces netdevsim1 and netdevsim2
2. ifdown netdevsim1
3. ifup netdevsim1
4. link two interfaces netdevsim1 and netdevsim2
5. (Now two interfaces are linked in terms of netdevsim peer, but
carrier state of the two interfaces remains DOWN.)
This inconsistent behavior is caused by the current implementation,
which only cares about the "link, then ifup" order, not "ifup, then
link" order. This patch fixes the inconsistency by calling
netif_carrier_on() when two netdevsim interfaces are linked.
This patch fixes buggy behavior on NetworkManager-based systems which
causes the netdevsim test to fail with the following error:
# timeout set to 600
# selftests: drivers/net/netdevsim: peer.sh
# 2025/12/25 00:54:03 socat[9115] W address is opened in read-write mode but only supports read-only
# 2025/12/25 00:56:17 socat[9115] W connect(7, AF=2 192.168.1.1:1234, 16): Connection timed out
# 2025/12/25 00:56:17 socat[9115] E TCP:192.168.1.1:1234: Connection timed out
# expected 3 bytes, got 0
# 2025/12/25 00:56:17 socat[9109] W exiting on signal 15
not ok 13 selftests: drivers/net/netdevsim: peer.sh # exit=1
This patch also solves timeout on TCP Fast Open (TFO) test in
NetworkManager-based systems because it also depends on netdevsim's
carrier consistency.
The HW only supports a maximum Rx buffer size of 16K-128. On systems
using large pages, the libeth logic can configure the buffer size to be
larger than this. The upper bound is PAGE_SIZE while the lower bound is
MTU rounded up to the nearest power of 2. For example, ARM systems with
a 64K page size and an mtu of 9000 will set the Rx buffer size to 16K,
which will cause the config Rx queues message to fail.
Initialize the bufq/fill queue buf_len field to the maximum supported
size. This will trigger the libeth logic to cap the maximum Rx buffer
size by reducing the upper bound.
Fixes: 74d1412ac8f37 ("idpf: use libeth Rx buffer management for payload buffer") Signed-off-by: Joshua Hay <joshua.a.hay@intel.com> Acked-by: Alexander Lobakin <aleksander.lobakin@intel.com> Reviewed-by: Madhu Chittim <madhu.chittim@intel.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: David Decotigny <ddecotig@google.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
During a successful reset the driver would re-allocate vport resources
while keeping the netdevs intact. However, in case of an error in the
init task, the netdev of the failing vport will be unregistered,
effectively removing the network interface:
ip -br a
ens801f0d1 DOWN
ens801f0d2 DOWN
ens801f0d3 DOWN
Re-shuffle the logic in the error path of the init task to make sure the
netdevs remain intact. This will allow the driver to attempt recovery via
subsequent resets, provided the FW is still functional.
The main change is to make sure that idpf_decfg_netdev() is not called
should the init task fail during a reset. The error handling is
consolidated under unwind_vports, as the removed labels had the same
cleanup logic split depending on the point of failure.
Fixes: ce1b75d0635c ("idpf: add ptypes and MAC filter support") Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Tested-by: Samuel Salin <Samuel.salin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
When skb_segment_list() is called during packet forwarding, it handles
packets that were aggregated by the GRO engine.
Historically, the segmentation logic in skb_segment_list assumes that
individual segments are split from a parent SKB and may need to carry
their own socket memory accounting. Accordingly, the code transfers
truesize from the parent to the newly created segments.
Prior to commit ed4cccef64c1 ("gro: fix ownership transfer"), this
truesize subtraction in skb_segment_list() was valid because fragments
still carry a reference to the original socket.
However, commit ed4cccef64c1 ("gro: fix ownership transfer") changed
this behavior by ensuring that fraglist entries are explicitly
orphaned (skb->sk = NULL) to prevent illegal orphaning later in the
stack. This change meant that the entire socket memory charge remained
with the head SKB, but the corresponding accounting logic in
skb_segment_list() was never updated.
As a result, the current code unconditionally adds each fragment's
truesize to delta_truesize and subtracts it from the parent SKB. Since
the fragments are no longer charged to the socket, this subtraction
results in an effective under-count of memory when the head is freed.
This causes sk_wmem_alloc to remain non-zero, preventing socket
destruction and leading to a persistent memory leak.
The leak can be observed via KMEMLEAK when tearing down the networking
environment:
Since skb_segment_list() is exclusively used for SKB_GSO_FRAGLIST
packets constructed by GRO, the truesize adjustment is removed.
The call to skb_release_head_state() must be preserved. As documented in
commit cf673ed0e057 ("net: fix fraglist segmentation reference count
leak"), it is still required to correctly drop references to SKB
extensions that may be overwritten during __copy_skb_header().
Fixes: ed4cccef64c1 ("gro: fix ownership transfer") Signed-off-by: Mohammad Heib <mheib@redhat.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20260104213101.352887-1-mheib@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
These marcos are not used after commit b5b4287accd7 ("riscv: mm: Use
hint address in mmap if available"). Cleanup VA_USER_XXX definitions
in asm/pgtable.h.
Fixes: b5b4287accd7 ("riscv: mm: Use hint address in mmap if available") Signed-off-by: Guo Ren (Alibaba DAMO Academy) <guoren@kernel.org> Reviewed-by: Jinjie Ruan <ruanjinjie@huawei.com> Link: https://patch.msgid.link/20251201005850.702569-1-guoren@kernel.org Signed-off-by: Paul Walmsley <pjw@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
[BUG]
Since the introduction of btrfs bs < ps support, v1 cache was never on
the plan due to its hard coded PAGE_SIZE usage, and the future plan to
properly deprecate it.
However for bs < ps cases, even if 'nospace_cache,clear_cache' mount
option is specified, it's never respected and free space tree is always
enabled:
This means a different behavior compared to bs >= ps cases.
[CAUSE]
The forcing usage of v2 space cache is done inside
btrfs_set_free_space_cache_settings(), however it never checks if we're
even using space cache but always enabling v2 cache.
[FIX]
Instead unconditionally enable v2 cache, only forcing v2 cache if the
old v1 cache is required.
Now v2 space cache can be properly disabled on bs < ps cases:
mkfs.btrfs -f -O ^bgt,fst $dev
mount $dev $mnt -o clear_cache,nospace_cache
umount $mnt
btrfs ins dump-super $dev
...
compat_ro_flags 0x0
...
Fixes: 9f73f1aef98b ("btrfs: force v2 space cache usage for subpage mount") Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
Fix the max number of bits passed to find_first_zero_bit() in
bnxt_alloc_agg_idx(). We were incorrectly passing the number of
long words. find_first_zero_bit() may fail to find a zero bit and
cause a wrong ID to be used. If the wrong ID is already in use, this
can cause data corruption. Sometimes an error like this can also be
seen:
Fix it by passing the correct number of bits MAX_TPA_P5. Use
DECLARE_BITMAP() to more cleanly define the bitmap. Add a sanity
check to warn if a bit cannot be found and reset the ring [MChan].
Fixes: ec4d8e7cf024 ("bnxt_en: Add TPA ID mapping logic for 57500 chips.") Reviewed-by: Ray Jui <ray.jui@broadcom.com> Signed-off-by: Srijit Bose <srijit.bose@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Link: https://patch.msgid.link/20251231083625.3911652-1-michael.chan@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
Commit 1f52d7b62285 ("net: wwan: iosm: Enable M.2 7360 WWAN card support")
allocated memory for pp_qlt in ipc_mux_init() but did not free it in
ipc_mux_deinit(). This results in a memory leak when the driver is
unloaded.
Free the allocated memory in ipc_mux_deinit() to fix the leak.
Dumping module EEPROM on newer modules is supported through the netlink
interface only.
Querying with old userspace ethtool (or other tools, such as 'lshw')
which still uses the ioctl interface results in an error message that
could flood dmesg (in addition to the expected error return value).
The original message was added under the assumption that the driver
should be able to handle all module types, but now that such flows are
easily triggered from userspace, it doesn't serve its purpose.
Change the log level of the print in mlx5_query_module_eeprom() to
debug.
Fixes: bb64143eee8c ("net/mlx5e: Add ethtool support for dump module EEPROM") Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20251225132717.358820-5-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
Directly increment the TSO features incurs a side effect: it will also
directly clear the flags in NETIF_F_ALL_FOR_ALL on the master device,
which can cause issues such as the inability to enable the nocache copy
feature on the bonding driver.
The fix is to include NETIF_F_ALL_FOR_ALL in the update mask, thereby
preventing it from being cleared.
Fixes: b0ce3508b25e ("bonding: allow TSO being set on bonding master") Signed-off-by: Di Zhu <zhud@hygon.cn> Link: https://patch.msgid.link/20251224012224.56185-1-zhud@hygon.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
skbuff_fclone_cache was created without defining a usercopy region,
[1] unlike skbuff_head_cache which properly whitelists the cb[] field.
[2] This causes a usercopy BUG() when CONFIG_HARDENED_USERCOPY is
enabled and the kernel attempts to copy sk_buff.cb data to userspace
via sock_recv_errqueue() -> put_cmsg().
The crash occurs when: 1. TCP allocates an skb using alloc_skb_fclone()
(from skbuff_fclone_cache) [1]
2. The skb is cloned via skb_clone() using the pre-allocated fclone
[3] 3. The cloned skb is queued to sk_error_queue for timestamp
reporting 4. Userspace reads the error queue via recvmsg(MSG_ERRQUEUE)
5. sock_recv_errqueue() calls put_cmsg() to copy serr->ee from skb->cb
[4] 6. __check_heap_object() fails because skbuff_fclone_cache has no
usercopy whitelist [5]
When cloned skbs allocated from skbuff_fclone_cache are used in the
socket error queue, accessing the sock_exterr_skb structure in skb->cb
via put_cmsg() triggers a usercopy hardening violation:
Commit 15faa1f67ab4 ("lan966x: Fix crash when adding interface under a lag")
fixed a similar issue in the lan966x driver caused by a NULL pointer dereference.
The ocelot_set_aggr_pgids() function in the ocelot driver has similar logic
and is susceptible to the same crash.
This issue specifically affects the ocelot_vsc7514.c frontend, which leaves
unused ports as NULL pointers. The felix_vsc9959.c frontend is unaffected as
it uses the DSA framework which registers all ports.
Fix this by checking if the port pointer is valid before accessing it.
Fixes: 528d3f190c98 ("net: mscc: ocelot: drop the use of the "lags" array") Signed-off-by: Jerry Wu <w.7erry@foxmail.com> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://patch.msgid.link/tencent_75EF812B305E26B0869C673DD1160866C90A@qq.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
When using an 802.1ad bridge with vlan_tunnel, the C-VLAN tag is
incorrectly stripped from frames during egress processing.
br_handle_egress_vlan_tunnel() uses skb_vlan_pop() to remove the S-VLAN
from hwaccel before VXLAN encapsulation. However, skb_vlan_pop() also
moves any "next" VLAN from the payload into hwaccel:
/* move next vlan tag to hw accel tag */
__skb_vlan_pop(skb, &vlan_tci);
__vlan_hwaccel_put_tag(skb, vlan_proto, vlan_tci);
For QinQ frames where the C-VLAN sits in the payload, this moves it to
hwaccel where it gets lost during VXLAN encapsulation.
Fix by calling __vlan_hwaccel_clear_tag() directly, which clears only
the hwaccel S-VLAN and leaves the payload untouched.
This path is only taken when vlan_tunnel is enabled and tunnel_info
is configured, so 802.1Q bridges are unaffected.
Tested with 802.1ad bridge + VXLAN vlan_tunnel, verified C-VLAN
preserved in VXLAN payload via tcpdump.
Fixes: 11538d039ac6 ("bridge: vlan dst_metadata hooks in ingress and egress paths") Signed-off-by: Alexandre Knecht <knecht.alexandre@gmail.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20251228020057.2788865-1-knecht.alexandre@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
Currently last_gc is being updated everytime a new connection is
tracked, that means that it is updated even if a GC wasn't performed.
With a sufficiently high packet rate, it is possible to always bypass
the GC, causing the list to grow infinitely.
Update the last_gc value only when a GC has been actually performed.
In nf_tables_newrule(), if nft_use_inc() fails, the function jumps to
the err_release_rule label without freeing the allocated flow, leading
to a memory leak.
Fix this by adding a new label err_destroy_flow and jumping to it when
nft_use_inc() fails. This ensures that the flow is properly released
in this error case.
GPIO drivers with latch input support may miss short pulses on input
pins even when input latching is enabled. The generic interrupt logic in
the pca953x driver reports interrupts by comparing the current input
value against the previously sampled one and only signals an event when
a level change is observed between two reads.
For short pulses, the first edge is captured when the input register is
read, but if the signal returns to its previous level before the read,
the second edge is not observed. As a result, successive pulses can
produce identical input values at read time and no level change is
detected, causing interrupts to be missed. Below timing diagram shows
this situation where the top signal is the input pin level and the
bottom signal indicates the latched value.
PCAL variants provide an interrupt status register that records which
pins triggered an interrupt, but the status and input registers cannot
be read atomically. The interrupt status is only cleared when the input
port is read, and the input value must also be read to determine the
triggering edge. If another interrupt occurs on a different line after
the status register has been read but before the input register is
sampled, that event will not be reflected in the earlier status
snapshot, so relying solely on the interrupt status register is also
insufficient.
Support for input latching and interrupt status handling was previously
added by [1], but the interrupt status-based logic was reverted by [2]
due to these issues. This patch addresses the original problem by
combining both sources of information. Events indicated by the interrupt
status register are merged with events detected through the existing
level-change logic. As a result:
* short pulses, whose second edges are invisible, are detected via the
interrupt status register, and
* interrupts that occur between the status and input reads are still
caught by the generic level-change logic.
This significantly improves robustness on devices that signal interrupts
as short pulses, while avoiding the issues that led to the earlier
reversion. In practice, even if only the first edge of a pulse is
observable, the interrupt is reliably detected.
This fixes missed interrupts from an Ilitek touch controller with its
interrupt line connected to a PCAL6416A, where active-low pulses are
approximately 200 us long.
[1] commit 44896beae605 ("gpio: pca953x: add PCAL9535 interrupt support for Galileo Gen2")
[2] commit d6179f6c6204 ("gpio: pca953x: Improve interrupt support")
Fixes: d6179f6c6204 ("gpio: pca953x: Improve interrupt support") Signed-off-by: Ernest Van Hoecke <ernest.vanhoecke@toradex.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Link: https://lore.kernel.org/r/20251217153050.142057-1-ernestvanhoecke@gmail.com Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
Adds support for level-triggered interrupts in the PCA953x GPIO
expander driver. Previously, the driver only supported edge-triggered
interrupts, which could lead to missed events in scenarios where an
interrupt condition persists until it is explicitly cleared.
By enabling level-triggered interrupts, the driver can now detect and
respond to sustained interrupt conditions more reliably.
During nft_synproxy eval we are reading nf_synproxy_info struct which
can be modified on update operation concurrently. As nf_synproxy_info
struct fits in 32 bits, use READ_ONCE/WRITE_ONCE annotations.
set->klen has to be used, not sizeof(). The latter only compares a
single register but a full check of the entire key is needed.
Example:
table ip t {
map s {
typeof iifname . ip saddr : verdict
flags interval
}
}
nft add element t s '{ "lo" . 10.0.0.0/24 : drop }' # no error, expected
nft add element t s '{ "lo" . 10.0.0.0/24 : drop }' # no error, expected
nft add element t s '{ "lo" . 10.0.0.0/8 : drop }' # bug: no error
The 3rd 'add element' should be rejected via -ENOTEMPTY, not -EEXIST,
so userspace / nft can report an error to the user.
The latter is only correct for the 2nd case (re-add of existing element).
As-is, userspace is told that the command was successful, but no elements were
added.
After this patch, 3rd command gives:
Error: Could not process rule: File exists
add element t s { "lo" . 127.0.0.0/8 . "lo" : drop }
^^^^^^^^^^^^^^^^^^^^^^^^^
Fixes: 0eb4b5ee33f2 ("netfilter: nft_set_pipapo: Separate partial and complete overlap cases on insertion") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
The commit 616effc0272b5 ("arm64: dts: imx8: Fix lpuart DMA channel
order") swap uart rx and tx channel at common imx8-ss-dma.dtsi. But miss
update imx8qm-ss-dma.dtsi.
The commit 5a8e9b022e569 ("arm64: dts: imx8qm-ss-dma: Pass lpuart
dma-names") just simple add dma-names as binding doc requirement.
Correct lpuart0 - lpuart3 dma rx and tx channels, and use defines for
the FSL_EDMA_RX flag.
Fixes: 5a8e9b022e56 ("arm64: dts: imx8qm-ss-dma: Pass lpuart dma-names") Signed-off-by: Sherry Sun <sherry.sun@nxp.com> Reviewed-by: Frank Li <Frank.Li@nxp.com> Reviewed-by: Alexander Stein <alexander.stein@ew.tq-group.com> Signed-off-by: Shawn Guo <shawnguo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
Add missing 'clocks' property to LAN8740Ai PHY node, to allow the PHY driver
to manage LAN8740Ai CLKIN reference clock supply. This fixes sporadic link
bouncing caused by interruptions on the PHY reference clock, by letting the
PHY driver manage the reference clock and assure there are no interruptions.
This follows the matching PHY driver recommendation described in commit bedd8d78aba3 ("net: phy: smsc: LAN8710/20: add phy refclk in support")
Fixes: 8d6712695bc8 ("arm64: dts: imx8mp: Add support for DH electronics i.MX8M Plus DHCOM and PDK2") Signed-off-by: Marek Vasut <marek.vasut@mailbox.org> Tested-by: Christoph Niedermaier <cniedermaier@dh-electronics.com> Signed-off-by: Shawn Guo <shawnguo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
For SD card, according to the spec requirement, for sd card power reset
operation, it need sd card supply voltage to be lower than 0.5v and keep
over 1ms, otherwise, next time power back the sd card supply voltage to
3.3v, sd card can't support SD3.0 mode again.
To match such requirement on imx8qm-mek board, add 4.8ms delay between
sd power off and power on.
The restarting message from PF to VF is sent twice during AER error
handling: once from adf_error_detected() and again from
adf_disable_sriov().
This causes userspace subservices to shutdown unexpectedly when they
receive a duplicate restarting message after already being restarted.
Avoid calling adf_pf2vf_notify_restarting() and
adf_pf2vf_wait_for_restarting_complete() from adf_error_detected() so
that the restarting msg is sent only once from PF to VF.
A similar situation occurred in dml2, which was resolved by
commit e4479aecf658 ("drm/amd/display: Increase sanitizer frame larger
than limit when compile testing with clang") by increasing the limit for
clang when compile testing with certain sanitizer enabled, so that
allmodconfig (an easy testing target) continues to work.
Apply that same change to the dml folder to clear up the warning for
allmodconfig, unbreaking the build.
Currently, there are several files in drm/amd/display that aim to have a
higher -Wframe-larger-than value to avoid instances of that warning with
a lower value from the user's configuration. However, with the way that
it is currently implemented, it does not respect the user's request via
CONFIG_FRAME_WARN for a higher stack frame limit, which can cause pain
when new instances of the warning appear and break the build due to
CONFIG_WERROR.
Adjust the logic to switch from a hard coded -Wframe-larger-than value
to only using the value as a minimum clamp and deferring to the
requested value from CONFIG_FRAME_WARN if it is higher.
Suggested-by: Harry Wentland <harry.wentland@amd.com> Reported-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Closes: https://lore.kernel.org/2025013003-audience-opposing-7f95@gregkh/ Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Stable-dep-of: 70740454377f ("drm/amd/display: Apply e4479aecf658 to dml") Signed-off-by: Sasha Levin <sashal@kernel.org>
When evicting an inode the first thing we do is to setup tracing for it,
which implies fetching the root's id. But in btrfs_evict_inode() the
root might be NULL, as implied in the next check that we do in
btrfs_evict_inode().
Hence, we either should set the ->root_objectid to 0 in case the root is
NULL, or we move tracing setup after checking that the root is not
NULL. Setting the rootid to 0 at least gives us the possibility to trace
this call even in the case when the root is NULL, so that's the solution
taken here.
Fixes: 1abe9b8a138c ("Btrfs: add initial tracepoint support for btrfs") Reported-by: syzbot+d991fea1b4b23b1f6bf8@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=d991fea1b4b23b1f6bf8 Signed-off-by: Miquel Sabaté Solà <mssola@mssola.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
Qgroupid Referenced Exclusive Parent Path
-------- ---------- --------- ------ ----
0/5 16.00KiB 16.00KiB - <toplevel>
0/256 16.00KiB 16.00KiB 1/100 subv1
0/257 16.00KiB 16.00KiB 1/100 snap1
1/100 32.00KiB 32.00KiB 2/100 2/100<1 member qgroup>
2/100 16.00KiB 16.00KiB - <0 member qgroups>
# Note that 2/100 is not updated, and qgroup numbers are inconsistent
umount $mnt
[CAUSE]
If the snapshot source subvolume belongs to a parent qgroup, and the new
snapshot target is also added to the new same parent qgroup, we allow a
quick update without marking qgroup inconsistent.
But that quick update only update the parent qgroup, without checking if
there is any more parent qgroups.
[FIX]
Iterate through all parent qgroups during the quick inherit.
Reported-by: Boris Burkov <boris@bur.io> Fixes: b20fe56cd285 ("btrfs: qgroup: allow quick inherit if snapshot is created and added to the same parent") Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
qgroup_snapshot_quick_inherit() detects conditions where the snapshot
destination would land in the same parent qgroup as the snapshot source
subvolume. In this case we can avoid costly qgroup calculations and just
add the nodesize of the new snapshot to the parent.
However, in the case of squotas this is actually a double count, and
also an undercount for deeper qgroup nestings.
# Create working snapshot with --inherit (auto-adds to Q1)
# src=intermediate (in only Q1)
# dst=snap (inheriting only into Q1)
# This double counts the 16k nodesize of the snapshot in Q1, and
# undercounts it in Q2.
btrfs subvolume snapshot -i 1/100 "$mnt/intermediate" "$mnt/snap" >/dev/null
snap_id=$(btrfs subvolume show "$mnt/snap" | grep 'Subvolume ID:' | awk '{print $3}')
# Fully complete snapshot creation
sync
# Delete working snapshot
# Q1 and Q2 will lose the full snap usage
btrfs subvolume delete "$mnt/snap" >/dev/null
# Delete intermediate and remove from Q1
# Q1 and Q2 will lose the full intermediate usage
btrfs qgroup remove "0/$inter_id" 1/100 "$mnt"
btrfs subvolume delete "$mnt/intermediate" >/dev/null
# Q1 should be at 0, but still has 16k. Q2 is "correct" at 0 (for now...)
# Trigger cleaner, wait for deletions
mount -o remount,sync=1 "$mnt"
btrfs subvolume sync "$mnt" "$snap_id"
btrfs subvolume sync "$mnt" "$inter_id"
# Remove Q1 from Q2
# Frees 16k more from Q2, underflowing it to 16EiB
btrfs qgroup remove 1/100 2/100 "$mnt"
# And show the bad state:
btrfs qgroup show -pc "$mnt"
Fix this by simply not doing this quick inheritance with squotas.
I suspect that it is also wrong in normal qgroups to not recurse up the
qgroup tree in the quick inherit case, though other consistency checks
will likely fix it anyway.
Fixes: b20fe56cd285 ("btrfs: qgroup: allow quick inherit if snapshot is created and added to the same parent") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
When probing the exp-attached sata device, libsas/libata will issue a
hard reset in sas_probe_sata() -> ata_sas_async_probe(), then a
broadcast event will be received after the disk probe fails, and this
commit causes the probe will be re-executed on the disk, and a faulty
disk may get into an indefinite loop of probe.
Therefore, revert this commit, although it can fix some temporary issues
with disk probe failure.
Signed-off-by: Xingui Yang <yangxingui@huawei.com> Reviewed-by: Jason Yan <yanaijie@huawei.com> Reviewed-by: John Garry <john.g.garry@oracle.com> Link: https://patch.msgid.link/20251202065627.140361-1-yangxingui@huawei.com Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
When a W-LUN resume fails, its parent devices in the SCSI hierarchy,
including the scsi_target, may be runtime suspended. Subsequently, the
error handler in ufshcd_recover_pm_error() fails to set the W-LUN device
back to active because the parent target is not active. This results in
the following errors:
google-ufshcd 3c2d0000.ufs: ufshcd_err_handler started; HBA state eh_fatal; ...
ufs_device_wlun 0:0:0:49488: START_STOP failed for power mode: 1, result 40000
ufs_device_wlun 0:0:0:49488: ufshcd_wl_runtime_resume failed: -5
...
ufs_device_wlun 0:0:0:49488: runtime PM trying to activate child device 0:0:0:49488 but parent (target0:0:0) is not active
Address this by:
1. Ensuring the W-LUN's parent scsi_target is runtime resumed before
attempting to set the W-LUN to active within
ufshcd_recover_pm_error().
2. Explicitly checking for power.runtime_error on the HBA and W-LUN
devices before calling pm_runtime_set_active() to clear the error
state.
3. Adding pm_runtime_get_sync(hba->dev) in
ufshcd_err_handling_prepare() to ensure the HBA itself is active
during error recovery, even if a child device resume failed.
These changes ensure the device power states are managed correctly
during error recovery.
Signed-off-by: Brian Kao <powenkao@google.com> Tested-by: Brian Kao <powenkao@google.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://patch.msgid.link/20251112063214.1195761-1-powenkao@google.com Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
During adapter reset, ipr device driver blocks config space access but
can't block MMIO access for MSI-X entries. There is very small window:
irqbalance daemon kicks in during adapter reset before ipr driver calls
pci_restore_state(pdev) to restore MSI-X table.
irqbalance daemon reads back all 0 for that MSI-X vector in
__pci_read_msi_msg().
irqbalance daemon:
msi_domain_set_affinity()
->irq_chip_set_affinity_patent()
->xive_irq_set_affinity()
->irq_chip_compose_msi_msg()
->pseries_msi_compose_msg()
->__pci_read_msi_msg(): read all 0 since didn't call pci_restore_state
->irq_chip_write_msi_msg()
-> pci_write_msg_msi(): write 0 to the msix vector entry
When ipr driver calls pci_restore_state(pdev) in
ipr_reset_restore_cfg_space(), the MSI-X vector entry has been cleared
by irqbalance daemon in pci_write_msg_msix().
pci_restore_state()
->__pci_restore_msix_state()
Below is the MSI-X table for ipr adapter after irqbalance daemon kicked
in during adapter reset:
IRQ balancing is not required during adapter reset.
Enable "IRQ_NO_BALANCING" flag before starting adapter reset and disable
it after calling pci_restore_state(). The irqbalance daemon is disabled
for this short period of time (~2s).
When automounting, the fs_context should be fixed up to use the cred
from the parent filesystem, since the operation is just extending the
namespace. Authorisation to enter that namespace will already have been
provided by the preceding lookup.
'version' is an enum, thus cast of pointer on 64-bit compile test with
clang W=1 causes:
rockchip_pdm.c:583:17: error: cast to smaller integer type 'enum rk_pdm_version' from 'const void *' [-Werror,-Wvoid-pointer-to-enum-cast]
This was already fixed in commit 49a4a8d12612 ("ASoC: rockchip: Fix
Wvoid-pointer-to-enum-cast warning") but then got bad in
commit 9958d85968ed ("ASoC: Use device_get_match_data()").
Discussion on LKML also pointed out that 'uintptr_t' is not the correct
type and either 'kernel_ulong_t' or 'unsigned long' should be used,
with several arguments towards the latter [1].
We have observed an NFSv4 client receiving a LOCK reply with a status of
NFS4ERR_OLD_STATEID and subsequently retrying the LOCK request with an
earlier seqid value in the stateid. As this was for a new lockowner,
that would imply that nfs_set_open_stateid_locked() had updated the open
stateid seqid with an earlier value.
Looking at nfs_set_open_stateid_locked(), if the incoming seqid is out
of sequence, the task will sleep on the state->waitq for up to 5
seconds. If the task waits for the full 5 seconds, then after finishing
the wait it'll update the open stateid seqid with whatever value the
incoming seqid has. If there are multiple waiters in this scenario,
then the last one to perform said update may not be the one with the
highest seqid.
Add a check to ensure that the seqid can only be incremented, and add a
tracepoint to indicate when old seqids are skipped.
Signed-off-by: Scott Mayhew <smayhew@redhat.com> Reviewed-by: Benjamin Coddington <bcodding@hammerspace.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
There is reported 'scheduling while atomic' bug when using dm-snapshot on
real-time kernels. The reason for the bug is that the hlist_bl code does
preempt_disable() when taking the lock and the kernel attempts to take
other spinlocks while holding the hlist_bl lock.
Fix this by converting a hlist_bl spinlock into a regular spinlock.
Similar in nature to ab107276607af90b13a5994997e19b7b9731e251. glibc-2.42
drops the legacy termio struct, but the ioctls.h header still defines some
TC* constants in terms of termio (via sizeof). Hardcode the values instead.
This fixes building Python for example, which falls over like:
./Modules/termios.c:1119:16: error: invalid application of 'sizeof' to incomplete type 'struct termio'
gup_pgd_range() is invoked with disabled interrupts and invokes
__kmap_local_page_prot() via pte_offset_map(), gup_p4d_range().
With HIGHPTE enabled, __kmap_local_page_prot() invokes kmap_high_get()
which uses a spinlock_t via lock_kmap_any(). This leads to an
sleeping-while-atomic error on PREEMPT_RT because spinlock_t becomes a
sleeping lock and must not be acquired in atomic context.
The loop in map_new_virtual() uses wait_queue_head_t for wake up which
also is using a spinlock_t.
Since HIGHPTE is rarely needed at all, turn it off for PREEMPT_RT
to allow the use of get_user_pages_fast().
[arnd: rework patch to turn off HIGHPTE instead of HAVE_PAST_GUP]
Co-developed-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Linus Walleij <linus.walleij@linaro.org> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: Sasha Levin <sashal@kernel.org>
In the csky_cmpxchg_fixup function, it is incorrect to use the global
variable csky_cmpxchg_stw to determine the address where the exception
occurred.The global variable csky_cmpxchg_stw stores the opcode at the
time of the exception, while &csky_cmpxchg_stw shows the address where
the exception occurred.
Signed-off-by: Yang Li <yang.li85200@gmail.com> Signed-off-by: Guo Ren <guoren@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
This patch ensures the gt will be awake for the entire duration
of the resume sequences until GuCRC takes over and GT-C6 gets
re-enabled.
Before suspending GT-C6 is kept enabled, but upon resume, GuCRC
is not yet alive to properly control the exits and some cases of
instability and corruption related to GT-C6 can be observed.
Currently calc_target() clears t->paused if the request shouldn't be
paused anymore, but doesn't ever set t->paused even though it's able to
determine when the request should be paused. Setting t->paused is left
to __submit_request() which is fine for regular requests but doesn't
work for linger requests -- since __submit_request() doesn't operate
on linger requests, there is nowhere for lreq->t.paused to be set.
One consequence of this is that watches don't get reestablished on
paused -> unpaused transitions in cases where requests have been paused
long enough for the (paused) unwatch request to time out and for the
subsequent (re)watch request to enter the paused state. On top of the
watch not getting reestablished, rbd_reregister_watch() gets stuck with
rbd_dev->watch_mutex held:
It's waiting for lreq->reg_commit_wait to be completed, but for that to
happen the respective request needs to end up on need_resend_linger list
and be kicked when requests are unpaused. There is no chance for that
if the request in question is never marked paused in the first place.
The fact that rbd_dev->watch_mutex remains taken out forever then
prevents the image from getting unmapped -- "rbd unmap" would inevitably
hang in D state on an attempt to grab the mutex.
When a fault occurs, the connection is abandoned, reestablished, and any
pending operations are retried. The OSD client tracks the progress of a
sparse-read reply using a separate state machine, largely independent of
the messenger's state.
If a connection is lost mid-payload or the sparse-read state machine
returns an error, the sparse-read state is not reset. The OSD client
will then interpret the beginning of a new reply as the continuation of
the old one. If this makes the sparse-read machinery enter a failure
state, it may never recover, producing loops like:
libceph: [0] got 0 extents
libceph: data len 142248331 != extent len 0
libceph: osd0 (1)...:6801 socket error on read
libceph: data len 142248331 != extent len 0
libceph: osd0 (1)...:6801 socket error on read
Therefore, reset the sparse-read state in osd_fault(), ensuring retries
start from a clean state.
Cc: stable@vger.kernel.org Fixes: f628d7999727 ("libceph: add sparse read support to OSD client") Signed-off-by: Sam Edwards <CFSworks@gmail.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Currently any error from ceph_auth_handle_reply_done() is propagated
via finish_auth() but isn't returned from mon_handle_auth_done(). This
results in higher layers learning that (despite the monitor considering
us to be successfully authenticated) something went wrong in the
authentication phase and reacting accordingly, but msgr2 still trying
to proceed with establishing the session in the background. In the
case of secure mode this can trigger a WARN in setup_crypto() and later
lead to a NULL pointer dereference inside of prepare_auth_signature().
free_choose_arg_map() may dereference a NULL pointer if its caller fails
after a partial allocation.
For example, in decode_choose_args(), if allocation of arg_map->args
fails, execution jumps to the fail label and free_choose_arg_map() is
called. Since arg_map->size is updated to a non-zero value before memory
allocation, free_choose_arg_map() will iterate over arg_map->args and
dereference a NULL pointer.
To prevent this potential NULL pointer dereference and make
free_choose_arg_map() more resilient, add checks for pointers before
iterating.
If the osdmap is (maliciously) corrupted such that the incremental
osdmap epoch is different from what is expected, there is no need to
BUG. Instead, just declare the incremental osdmap to be invalid.
During the transition to use channel contexts throughout, the
ability to do injection while in monitor mode concurrent with
another interface was lost, since the (virtual) monitor won't
have a chanctx assigned in this scenario.
It's harder to fix drivers that actually transitioned to using
channel contexts themselves, such as mt76, but it's easy to do
those that are (still) just using the emulation. Do that.
struct iw_point {
void __user *pointer; /* Pointer to the data (in user space) */
__u16 length; /* number of fields or size in bytes */
__u16 flags; /* Optional params */
};
Make sure to zero the structure to avoid disclosing 32bits of kernel data
to user space.
Fixes: 87de87d5e47f ("wext: Dispatch and handle compat ioctls entirely in net/wireless/wext.c") Reported-by: syzbot+bfc7323743ca6dbcc3d3@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/695f83f3.050a0220.1c677c.0392.GAE@google.com/T/#u Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260108101927.857582-1-edumazet@google.com Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The gpio_chip settings in this driver say the controller can't sleep
but it actually uses a mutex for synchronization. This triggers the
following BUG():