KVM: s390: Prevent memslots outside the ASCE range
With KVM_S390_VM_MEM_LIMIT_SIZE, userspace can set the highest address
allowed for the VM. Creating a memslot that lies over the maximum
address does not make sense and is only a potential source of bugs.
Prevent creation of memslots over the maximum address, and prevent the
maximum address from being reduced below the end of existing memslots.
Fixes: e38c884df921 ("KVM: s390: Switch to new gmap") Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <20260602142356.169458-9-imbrenda@linux.ibm.com>
Pavel Begunkov [Tue, 2 Jun 2026 10:08:25 +0000 (11:08 +0100)]
io_uring/bpf-ops: restrict ctx access to BPF
BPF programs should have no need in looking into struct io_ring_ctx, if
anything, most of such cases would be anti patterns like looking up ring
indices directly via the context.
Replace it with a new empty structure, which is just an alias to struct
io_ring_ctx. It'll create a new BTF type and fail verification if a BPF
program tries to access it (beyond the first byte). It'll also give more
flexibility for the future, and otherwise it can be made aligned with
io_ring_ctx as before with struct groups if ever needed or extended in a
different way.
One of the complications of trying to use the shmem helpers to create a
scatterlist for shmem objects is that we need to be able to provide a
guarantee that the driver cannot be unbound for the lifetime of the
scatterlist.
The easiest way of handling this seems to be just hooking up an unmap
operation to devres the first time we create a scatterlist, which allows us
to still take advantage of gem shmem facilities without breaking that
guarantee. To allow for this, we extract __drm_gem_shmem_free_sgt_locked()
- which allows a caller (e.g. the rust bindings) to manually unmap the sgt
for a gem object as needed.
Lyude Paul [Tue, 28 Apr 2026 19:03:41 +0000 (15:03 -0400)]
rust: drm: gem: s/device::Device/Device/ for shmem.rs
We're about to start explicitly mentioning kernel devices as well in this
file, so this makes it easier to differentiate the two by allowing us to
import `device` as `kernel::device`.
Another small follow-up from the sashiko findings about signed loaders.
In particular, closing the gap to reject exclusive maps in iterators.
====================
Daniel Borkmann [Tue, 2 Jun 2026 13:30:52 +0000 (15:30 +0200)]
selftests/bpf: Test that exclusive maps are rejected as iter targets
Add a subtest to map_excl that creates an exclusive map and verifies a
bpf_map_elem iterator cannot be attached to it, which would otherwise
let an unrelated program read and overwrite the map's contents through
the iterator's writable value buffer.
sashiko complained that 38498c0ebacd ("selftests/bpf: Adjust verifier_map_ptr
for the map's excl field") would slightly decrease the test coverage given
before the test was against the verifier rejecting the ops pointer. Recover
the old test with the right offsets and add the existing one as an additional
test case.
# LDLIBS=-static PKG_CONFIG='pkg-config --static' ./vmtest.sh -- ./test_progs -t verifier_map_ptr
[ 1.672932] bpf_testmod: module verification failed: signature and/or required key missing - tainting kernel
#637/1 verifier_map_ptr/bpf_map_ptr: read with negative offset rejected:OK
#637/2 verifier_map_ptr/bpf_map_ptr: read with negative offset rejected @unpriv:OK
#637/3 verifier_map_ptr/bpf_map_ptr: write rejected:OK
#637/4 verifier_map_ptr/bpf_map_ptr: write rejected @unpriv:OK
#637/5 verifier_map_ptr/bpf_map_ptr: read non-existent field rejected:OK
#637/6 verifier_map_ptr/bpf_map_ptr: read non-existent field rejected @unpriv:OK
#637/7 verifier_map_ptr/bpf_map_ptr: read beyond excl field rejected:OK
#637/8 verifier_map_ptr/bpf_map_ptr: read beyond excl field rejected @unpriv:OK
#637/9 verifier_map_ptr/bpf_map_ptr: read ops field accepted:OK
#637/10 verifier_map_ptr/bpf_map_ptr: read ops field accepted @unpriv:OK
#637/11 verifier_map_ptr/bpf_map_ptr: r = 0, map_ptr = map_ptr + r:OK
#637/12 verifier_map_ptr/bpf_map_ptr: r = 0, map_ptr = map_ptr + r @unpriv:OK
#637/13 verifier_map_ptr/bpf_map_ptr: r = 0, r = r + map_ptr:OK
#637/14 verifier_map_ptr/bpf_map_ptr: r = 0, r = r + map_ptr @unpriv:OK
#637 verifier_map_ptr:OK
[...]
Summary: 2/20 PASSED, 0 SKIPPED, 0 FAILED
Daniel Borkmann [Tue, 2 Jun 2026 13:30:50 +0000 (15:30 +0200)]
libbpf: Guard add_data() against size overflow
add_data() computes size8 = roundup(size, 8) and then hands size8 to
realloc_data_buf() before doing memcpy(gen->data_cur, data, size) with
the original size. A wrapped size8 passes through the realloc_data_buf()
INT32_MAX check. Harden this against overflow, though not realistic to
happen in practice.
Daniel Borkmann [Tue, 2 Jun 2026 13:30:49 +0000 (15:30 +0200)]
bpf: Reject exclusive maps for bpf_map_elem iterators
Exclusive maps (aka excl_prog_hash) are meant to be reachable only
from the single program whose hash matches. This is enforced by
check_map_prog_compatibility() when the map is referenced from a
program such as signed BPF loaders.
A bpf_map_elem iterator, however, binds its target map at attach
time in bpf_iter_attach_map() instead of referencing it from the
program, so the exclusivity check is never reached. On top of that,
the iterator exposes the map value as a writable buffer.
Linus Torvalds [Tue, 2 Jun 2026 15:59:35 +0000 (08:59 -0700)]
Merge tag 'mm-hotfixes-stable-2026-06-01-20-58' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM fixes from Andrew Morton:
"13 hotfixes. All are for MM. 10 are cc:stable and the remaining 3
address post-7.1 issues or aren't considered suitable for backporting.
There's a three-patch series "userfaultfd: verify VMA state across
UFFDIO_COPY retry" from Mike Rapoport which fixes a few uffd things.
The rest are singletons - please see the individual changelogs for
details"
* tag 'mm-hotfixes-stable-2026-06-01-20-58' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
userfaultfd: remove redundant check in vm_uffd_ops()
userfaultfd: refuse to __mfill_atomic_pte() for unsupported VMAs
userfaultfd: verify VMA state across UFFDIO_COPY retry
mm/huge_memory: update file PMD counter before folio_put()
mm/huge_memory: update file PUD counter before folio_put()
mm/hugetlb_vmemmap: fix incorrect vmemmap restore in rollback
mm/damon/ops-common: call folio_test_lru() after folio_get()
mm/cma: fix reserved page leak on activation failure
mm/memory-failure: fix hugetlb_lock AA deadlock in get_huge_page_for_hwpoison
mm/hugetlb: restore reservation on error in hugetlb folio copy paths
mm/cma_debug: fix invalid accesses for inactive CMA areas
memcg: use round-robin victim selection in refill_stock
mm/hugetlb: avoid false positive lockdep assertion
dt-bindings: arm-smmu: Correct and add constraints for Hawi, Shikra and Kaanapali
Previous commit 75949eb02653 ("dt-bindings: arm-smmu: Constrain clocks
for newer Qualcomm variants") duplicated constraints for
qcom,sm6350-smmu-500 and qcom,sm6375-smmu-500 - these are already part
of previous "if:" block.
It also missed enforcing one clock for qcom,kaanapali-smmu-500 in GPU
case and missed simultaneously added Shikra and Hawi.
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@oss.qualcomm.com> Signed-off-by: Will Deacon <will@kernel.org>
Lizhi Hou [Tue, 2 Jun 2026 04:06:24 +0000 (21:06 -0700)]
accel/amdxdna: Preserve user address when PASID is disabled
When PASID is not used, the buffer user address is set to
AMDXDNA_INVALID_ADDR. As a result, heap buffer user address validation
fails even though the original userspace address is available.
Preserve the userspace address regardless of PASID usage so heap buffer
address validation works correctly.
Ard Biesheuvel [Fri, 29 May 2026 15:02:06 +0000 (17:02 +0200)]
arm64: mm: Unmap kernel data/bss entirely from the linear map
The linear aliases of the kernel text and rodata are also mapped
read-only in the linear map. Given that the contents of these regions
are mostly identical to the version in the loadable image, mapping them
read-only and leaving their contents visible is a reasonable hardening
measure.
Data and bss, however, are now also mapped read-only but the contents of
these regions are more likely to contain data that we'd rather not leak.
So let's unmap these entirely in the linear map when the kernel is
running normally.
When going into hibernation or waking up from it, these regions need to
be mapped, so map the region initially, and toggle the valid bit so
map/unmap the region as needed.
Doing so is required because pages covering the kernel image are marked
as PageReserved, and therefore disregarded for snapshotting by the
hibernate logic unless they are mapped.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Reviewed-by: Kevin Brodsky <kevin.brodsky@arm.com> Signed-off-by: Will Deacon <will@kernel.org>
Ard Biesheuvel [Fri, 29 May 2026 15:02:05 +0000 (17:02 +0200)]
arm64: mm: Map the kernel data/bss read-only in the linear map
On systems where the bootloader adheres to the original arm64 boot
protocol, the placement of the kernel in the physical address space is
highly predictable, and this makes the placement of its linear alias in
the kernel virtual address space equally predictable, given the lack of
randomization of the linear map.
The linear aliases of the kernel text and rodata regions are already
mapped read-only, but the kernel data and bss are mapped read-write in
this region. This is not needed, so map them read-only as well.
Note that the statically allocated kernel page tables do need to be
modifiable via the linear map, so leave these mapped read-write.
Reviewed-by: Kevin Brodsky <kevin.brodsky@arm.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Will Deacon <will@kernel.org>
Ard Biesheuvel [Fri, 29 May 2026 15:02:04 +0000 (17:02 +0200)]
mm: Make empty_zero_page[] const
The empty zero page is used to back any kernel or user space mapping
that is supposed to remain cleared, and so the page itself is never
supposed to be modified.
So mark it as const, which moves it into .rodata rather than .bss: on
most architectures, this ensures that both the kernel's mapping of it
and any aliases that are accessible via the kernel direct (linear) map
are mapped read-only, and cannot be used (inadvertently or maliciously)
to corrupt the contents of the zero page.
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Kevin Brodsky <kevin.brodsky@arm.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Jann Horn <jannh@google.com> Reviewed-by: Feng Tang <feng.tang@linux.alibaba.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Will Deacon <will@kernel.org>
the zero page did double duty as a boot params region, and was cleared
separately, as it was not part of BSS. The memset() in question was
dropped by that commit, but the __flush_wback_region() call remained.
As empty_zero_page[] has been moved to BSS, it can be treated as any
other BSS memory, and so the cache flush can be dropped.
Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Rich Felker <dalias@libc.org> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Mike Rapoport <rppt@kernel.org> Cc: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Will Deacon <will@kernel.org>
Ard Biesheuvel [Fri, 29 May 2026 15:02:02 +0000 (17:02 +0200)]
powerpc/code-patching: Avoid r/w mapping of the zero page
The only remaining use of map_patch_area() is mapping the zero page, and
immediately unmapping it again so that the intermediate page table
levels are all guaranteed to be populated.
The use of the zero page here is completely arbitrary, and not harmful
per se, but currently, it creates a writable mapping, and does so in a
manner that requires that the empty_zero_page[] symbol is not
const-qualified.
Given that this is about to change, and that map_patch_area() now never
maps anything other than the zero page, let's simplify the code and
- remove the helpers and call [un]map_kernel_page() directly
- take the PA of empty_zero_page directly
- create a read-only temporary mapping.
This allows empty_zero_page[] to be repainted as const u8[] in a
subsequent patch, without making substantial changes to this code
patching logic.
Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Christophe Leroy (CS GROUP) <chleroy@kernel.org> Link: https://lore.kernel.org/all/20260520085423.485402-1-ardb@kernel.org/ Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Will Deacon <will@kernel.org>
Ard Biesheuvel [Fri, 29 May 2026 15:02:01 +0000 (17:02 +0200)]
arm64: mm: Don't abuse memblock NOMAP to check for overlaps
Now that the linear region mapping routines respect existing table
mappings and contiguous block and page mappings, it is no longer needed
to fiddle with the memblock tables to set and clear the NOMAP attribute
in order to omit text and rodata when creating the linear map.
Instead, map the kernel text and rodata alias first with the desired
initial attributes and granularity, so that the loop iterating over the
memblocks will not remap it in a manner that prevents it from being
remapped with updated attributes later.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Reviewed-by: Kevin Brodsky <kevin.brodsky@arm.com> Signed-off-by: Will Deacon <will@kernel.org>
Ard Biesheuvel [Fri, 29 May 2026 15:02:00 +0000 (17:02 +0200)]
arm64: Move fixmap and kasan page tables to end of kernel image
Move the fixmap and kasan page tables out of the BSS section, and place
them at the end of the image, right before the init_pg_dir section where
some of the other statically allocated page tables live.
These page tables are currently the only data objects in vmlinux that
are meant to be accessed via the kernel image's linear alias, and so
placing them together allows the remainder of the data/bss section to be
remapped read-only or unmapped entirely.
Reviewed-by: Kevin Brodsky <kevin.brodsky@arm.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Will Deacon <will@kernel.org>
Ard Biesheuvel [Fri, 29 May 2026 15:01:59 +0000 (17:01 +0200)]
arm64: mm: Permit contiguous attribute for preliminary mappings
There are a few cases where we omit the contiguous hint for mappings
that start out as read-write and are remapped read-only later, on the
basis that manipulating live descriptors with the PTE_CONT attribute set
is unsafe. When support for the contiguous hint was added to the code,
the ARM ARM was ambiguous about this, and so we erred on the side of
caution.
In the meantime, this has been clarified [0], and regions that will be
remapped in their entirety, retaining the contiguous bit on all entries,
can use the contiguous hint both in the initial mapping as well as the
one that replaces it. Note that this requires that the logic that may be
called to remap overlapping regions respects existing valid descriptors
that have the contiguous bit cleared.
So omit the NO_CONT_MAPPINGS flag in places where it is unneeded.
[0] RJQQTC
For a TLB lookup in a contiguous region mapped by translation table entries that
have consistent values for the Contiguous bit, but have the OA, attributes, or
permissions misprogrammed, that TLB lookup is permitted to produce an OA, access
permissions, and memory attributes that are consistent with any one of the
programmed translation table values.
Reviewed-by: Kevin Brodsky <kevin.brodsky@arm.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Will Deacon <will@kernel.org>
Ard Biesheuvel [Fri, 29 May 2026 15:01:58 +0000 (17:01 +0200)]
arm64: kfence: Avoid NOMAP tricks when mapping the early pool
Now that the map_mem() routines respect existing page mappings and
contiguous granule sized blocks with the contiguous bit cleared, there
is no longer a reason to play tricks with the memblock NOMAP attribute.
Instead, the kfence pool can be allocated and mapped with page
granularity first, and this granularity will be respected when the rest
of DRAM is mapped later, even if block and contiguous mappings are
allowed for the remainder of those mappings.
Add the NO_EXEC_MAPPINGS flag to ensure that hierarchical XN attributes
are set on the intermediate page tables that are allocated when mapping
the pool.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Reviewed-by: Kevin Brodsky <kevin.brodsky@arm.com> Signed-off-by: Will Deacon <will@kernel.org>
Ard Biesheuvel [Fri, 29 May 2026 15:01:57 +0000 (17:01 +0200)]
arm64: mm: Permit contiguous descriptors to be manipulated
Currently, pgattr_change_is_safe() is overly pedantic when it comes to
descriptors with the contiguous hint attribute set, as it rejects
assignments even if the old and the new value are the same.
In fact, as per ARM ARM RJQQTC, manipulating descriptors with the
contiguous bit set is safe as long as the bit itself does not change
value, in the sense that no TLB conflict aborts or other exceptions may
be raised as a result. Inconsistent permission attributes within the
contiguous region may result in any of the alternatives to be taken to
apply to the entire region, which might be a programming error, but it
does not constitute an unsafe manipulation in terms of what
pgattr_change_is_safe() is intended to detect.
So drop the special PTE_CONT check, but still omit PTE_CONT from 'mask'
so that modifying the bit is still regarded as unsafe.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Will Deacon <will@kernel.org>
Ard Biesheuvel [Fri, 29 May 2026 15:01:56 +0000 (17:01 +0200)]
arm64: mm: Preserve non-contiguous descriptors when mapping DRAM
Instead of blindly overwriting existing live entries regardless of the
value of their contiguous bit when mapping DRAM regions at
contiguous-hint granularity, check whether the contiguous region in
question contains any valid descriptors that have the contiguous bit
cleared, and in that case, leave the contiguous bit unset on the entire
region. This permits the logic of mapping the kernel's linear alias to
be simplified in a subsequent patch.
Note that this can only result in a misprogrammed contiguous bit (as per
ARM ARM RNGLXZ) if the region in question already contains a mix of
valid contiguous and valid non-contiguous descriptors, in which case it
was already misprogrammed to begin with.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Will Deacon <will@kernel.org>
Ard Biesheuvel [Fri, 29 May 2026 15:01:55 +0000 (17:01 +0200)]
arm64: mm: Preserve existing table mappings when mapping DRAM
Instead of blindly overwriting an existing table entry when mapping DRAM
regions, take care not to replace a pre-existing table entry with a
block entry. This permits the logic of mapping the kernel's linear alias
to be simplified in a subsequent patch.
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Will Deacon <will@kernel.org>
Ard Biesheuvel [Fri, 29 May 2026 15:01:54 +0000 (17:01 +0200)]
arm64: mm: Check for pud_/pmd_set_huge() failures on kernel mappings
Sashiko reports:
| If pmd_set_huge() rejects an unsafe page table transition (such as
| mapping a different physical address over an existing block mapping),
| it returns 0 and leaves the page table entry unmodified.
|
| Because *pmdp remains unmodified, READ_ONCE(pmd_val(*pmdp)) will equal
| pmd_val(old_pmd). The transition from old_pmd to old_pmd is evaluated
| as safe by pgattr_change_is_safe(), so the BUG_ON never triggers.
|
| This allows invalid and unsafe mapping updates to be silently dropped
| instead of panicking, leaving stale memory mappings active while the
| caller assumes the update was successful.
The same applies to pud_set_huge() in alloc_init_pud().
Given how it is generally preferred to limp on rather than blow up the
system if an unexpected condition such as this one occurs, and the fact
that there are no known cases where this disparity results in real
problems, let's WARN on these failures rather than BUG, allowing the
system to survive to the point where it can actually report them.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Will Deacon <will@kernel.org>
Ard Biesheuvel [Fri, 29 May 2026 15:01:53 +0000 (17:01 +0200)]
arm64: mm: Drop redundant pgd_t* argument from map_mem()
__map_memblock() and map_mem() always operate on swapper_pg_dir, so
there is no need to pass around a pgd_t pointer between them.
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Kevin Brodsky <kevin.brodsky@arm.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Will Deacon <will@kernel.org>
Ard Biesheuvel [Fri, 29 May 2026 15:01:52 +0000 (17:01 +0200)]
arm64: mm: Remove bogus stop condition from map_mem() loop
The memblock API guarantees that start is not greater than or equal to
end, so there is no need to test it. And if it were, it is doubtful that
breaking out of the loop would be a reasonable course of action here
(rather than attempting to map the remaining regions)
So let's drop this check.
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Kevin Brodsky <kevin.brodsky@arm.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Will Deacon <will@kernel.org>
Mark Brown [Tue, 2 Jun 2026 15:21:48 +0000 (16:21 +0100)]
ASoC: loongson: Refactor DMA and regmap handling
Binbin Zhou <zhoubinbin@loongson.cn> says:
This series refactors the Loongson I2S ASoC drivers, reducing code
duplication and improving DMA differentiation. It also adds an entry
in MAINTAINERS and applies a few fixes to the es8323 codec driver.
These changes have been tested on Loongson-2K0300 (platform, eDMA) and
Loongson-2K2000 (PCI, iDMA) boards.
Binbin Zhou [Mon, 1 Jun 2026 09:29:39 +0000 (17:29 +0800)]
ASoC: loongson: Separate external shared DMA from the platform interface
The Loongson I2S platform driver (used on LS2K1000, LS7A etc.) relies on
an external DMA engine (e.g., dw_dmac) rather than the internal DMA.
However, its DMA-related code was originally embedded in
loongson_i2s_plat.c, duplicating logic that should be shared.
Extract the external DMA (eDMA) support from the platform driver and move
it into loongson_dma.c alongside the existing internal DMA (iDMA) code.
This change eliminates code duplication and prepares for future
consolidation of DMA selection logic.
Binbin Zhou [Mon, 1 Jun 2026 09:29:38 +0000 (17:29 +0800)]
ASoC: loongson: Use the `idma` identifier for internal DMA variables
The Loongson I2S controller can work with two types of DMA:
- Internal DMA (iDMA): integrated DMA engine, driven by dedicated
registers and interrupts.
- External DMA (eDMA): generic DMA engine (e.g., dw_dmac), using the
standard dmaengine API.
To distinguish these two distinct implementations, rename all
internal-DMA-related structures, functions, and the component driver
to use the "idma" prefix.
Binbin Zhou [Mon, 1 Jun 2026 09:29:37 +0000 (17:29 +0800)]
ASoC: loongson: Combined regmap definitions
Previously, the regmap configuration for Loongson I2S controller was
duplicated in both PCI and platform glue drivers. Move the common
regmap configuration into the shared loongson_i2s.c to avoid code
duplication and centralize register access handling.
While moving, adjust the following:
- Mark RX_DATA/TX_DATA/I2S_CTRL as volatile registers. The PCI version
incorrectly marked CFG/CFG1 as volatile, which prevented proper
regcache synchronization.
- Change cache type from REGCACHE_FLAT to REGCACHE_MAPLE. The register
map is sparse and the number of registers is small; MAPLE tree provides
better scalability and is the recommended cache type for modern
regmap users.
Also, the following warning for the i2s_plat driver will be eliminated:
loongson-i2s-plat loongson-i2s: using zero-initialized flat cache, this may cause unexpected behavior.
Wandun Chen [Mon, 25 May 2026 12:17:00 +0000 (20:17 +0800)]
of: reserved_mem: only support one <base size> entry in reg property
A /reserved-memory child node may have multiple <base size> tuples in
'reg' property, but multiple entries in 'reg' have never been fully
functional:
- fdt_scan_reserved_mem() in the early pass loops over every
tuple and reserves them all.
- fdt_scan_reserved_mem_late() reads 'reg' by
of_flat_dt_get_addr_size(), which returns false if entries != 1.
So 'reg' property with multiple <base size> entries will be
skipped, no reserved_mem entry is created in reserved_mem[].
Supporting multiple <base size> tuples is not a good idea:
- It requires reserved_mem_ops->node_init support. Currently,
CMA(rmem_cma_setup) and DMA(rmem_dma_setup) are not supported.
- of_reserved_mem_lookup() is name-based, only the first entry in
multiple <base size> tuples will be found.
So change to support one <base size> entry in 'reg' property.
Also update dt binding:
https://github.com/devicetree-org/dt-schema/pull/197
Mark Brown [Tue, 2 Jun 2026 15:13:08 +0000 (16:13 +0100)]
ASoC: mediatek: mt8192 probe cleanup
Cássio Gabriel <cassiogabrielcontato@gmail.com> says:
Fix two MT8192 AFE probe cleanup issues that mirror the recently fixed
MT8189 and MT8196 paths.
The first patch registers a devm cleanup action for a successful
reserved-memory assignment so later probe failures and driver unbind
release it.
The second patch checks the temporary runtime resume used while
reinitializing the regmap cache and makes the regcache failure path drop
the PM reference and clear pm_runtime_bypass_reg_ctl.
Cássio Gabriel [Wed, 27 May 2026 13:55:47 +0000 (10:55 -0300)]
ASoC: mediatek: mt8192: Check runtime resume during probe
The MT8192 AFE probe enables runtime PM temporarily while reinitializing
the regmap cache from hardware, but it uses pm_runtime_get_sync()
without checking the return value. If runtime resume fails, probe keeps
going without the device necessarily being accessible, and
pm_runtime_get_sync() may leave the PM usage count incremented.
The regmap_reinit_cache() failure path also returns before dropping the
temporary PM reference and before clearing pm_runtime_bypass_reg_ctl.
Use pm_runtime_resume_and_get() so resume failures do not leak a usage
count, and clear the temporary bypass flag after dropping the probe PM
reference on all regmap_reinit_cache() outcomes.
Cássio Gabriel [Wed, 27 May 2026 13:55:46 +0000 (10:55 -0300)]
ASoC: mediatek: mt8192: Release reserved memory on cleanup
The MT8192 AFE probe calls of_reserved_mem_device_init() and falls
back to preallocated buffers when no reserved memory region is
available. When the reserved memory assignment succeeds, however, the
driver never releases it.
Register a devm cleanup action after a successful reserved-memory
assignment so the assignment is released on probe failure and driver
unbind.
Cássio Gabriel <cassiogabrielcontato@gmail.com> says:
The MT8183 AFE probe has two cleanup gaps that match issues
recently fixed in newer MediaTek AFE drivers.
First, reserved memory assigned with of_reserved_mem_device_init()
is never released on driver removal or later probe failures.
Second, the probe-time runtime PM resume used before reinitializing
the regmap cache is unchecked, and a regmap_reinit_cache() failure
skips the temporary PM put.
Fix both issues with a devm reserved-memory release action and
checked runtime PM resume handling.
Cássio Gabriel [Wed, 27 May 2026 13:41:49 +0000 (10:41 -0300)]
ASoC: mediatek: mt8183: Check runtime resume during probe
The MT8183 AFE probe uses pm_runtime_get_sync() before reading hardware
defaults into the regmap cache, but does not check whether runtime resume
failed. If regmap_reinit_cache() then fails, the temporary runtime PM
usage count is also not released.
Use pm_runtime_resume_and_get() so resume failures abort probe without
leaking a usage count, and release the temporary reference before
handling the regmap cache result.
Cássio Gabriel [Wed, 27 May 2026 13:41:48 +0000 (10:41 -0300)]
ASoC: mediatek: mt8183: Release reserved memory on cleanup
The MT8183 AFE probe can assign reserved memory with
of_reserved_mem_device_init(), but the assignment is never released on
driver removal or later probe failures.
Register a devm cleanup action so the reserved memory assignment is
released consistently, matching newer Mediatek AFE drivers.
Mark Brown [Tue, 2 Jun 2026 15:09:27 +0000 (16:09 +0100)]
regulator: Use named initializers for platform_device_id arrays
Uwe Kleine-König (The Capable Hub) <u.kleine-koenig@baylibre.com> says:
this series targets to use named initializers for platform_device_id
arrays. In general these are better readable for humans and more robust
to changes in the respective struct definition.
regulator: Unify usage of space and comma in platform_device_id arrays
After converting all these arrays to use named initializers and fixing
coding style en passant, adapt the coding style also for those drivers that
already used named initializers before for consistency.
regulator: Use named initializers for platform_device_id arrays
Named initializers are better readable and more robust to changes of the
struct definition. This robustness is relevant for a planned change to
struct platform_device_id replacing .driver_data by an anonymous unit.
While touching these arrays unify spacing and usage of commas.
regulator: Drop unused assignment of platform_device_id driver data
Several drivers explicitly set the .driver_data member of struct
platform_device_id to zero without relying on that value. Drop these
unused assignments.
While touching these arrays unify spacing, usage of commas and use
named initializers for .name.
Cássio Gabriel [Mon, 25 May 2026 17:18:03 +0000 (14:18 -0300)]
ASoC: codecs: rk3328: Use managed GPIO and clock helpers
rk3328_platform_probe() acquires the mute GPIO with gpiod_get_optional()
but never releases it. It also enables mclk and pclk manually while
relying on probe error labels for unwind, and the driver has no platform
remove callback to disable those clocks after a successful unbind.
This path has already needed fixes for missing clock unwinds on probe
errors. Use devm_gpiod_get_optional() and devm_clk_get_enabled() so the
GPIO and enabled clock lifetimes are tied to the device. This removes the
manual error labels and makes both probe failure and driver unbind follow
the normal devres cleanup path.
Wentao Liang [Wed, 27 May 2026 10:48:50 +0000 (10:48 +0000)]
regulator: scmi: fix of_node refcount leak in scmi_regulator_probe()
scmi_regulator_probe() calls of_find_node_by_name() which takes a
reference on the returned device node. On the error path where
process_scmi_regulator_of_node() fails, the function returns without
calling of_node_put() on the child node, leaking the reference.
Add of_node_put(np) on the error path to properly release the
reference.
Cássio Gabriel [Fri, 22 May 2026 02:30:07 +0000 (23:30 -0300)]
ASoC: rockchip: i2s: Use managed hclk and runtime PM cleanup
The Rockchip I2S driver mixes devm-managed probe resources with manual
runtime PM and hclk cleanup. This leaves the remove path doing runtime PM
shutdown and clock disable before devm-managed ASoC and PCM resources are
released.
Keep the bus clock enabled for the device lifetime with
devm_clk_get_enabled(), and move the runtime PM teardown into devres so the
unwind order matches the managed registrations. This also removes the
remove callback, which only existed for cleanup.
Use a devm action for the final runtime suspend and register it before the
managed runtime PM action, so teardown disables runtime PM before forcing
the device into the suspended state.
ASoC: cs35l56: Share common SoundWire interrupt enable/disable code
Move the duplicated SoundWire interrupt enable/disable code into shared
functions. These new functions are in cs35l56.c to prevent circular
dependency between cs35l56.c and cs35l56-sdw.c
Make sure _kvm_s390_pv_make_secure() takes the pte lock for the given
address when attempting to make the page secure.
One of the steps in making the page secure is freezing the folio using
folio_ref_freeze(), which temporarily sets the reference count to 0.
Any attempt to get such a folio while frozen will fail and cause a
warning to be printed.
Other users of folio_ref_freeze() make sure that the page is not mapped
while it's being frozen, thus preventing gup functions from being able
to access it. For _kvm_s390_pv_make_secure(), this is not possible,
because the page needs to be mapped in order for the import to succeed.
By taking the pte lock, gup functions will be blocked until the import
operation is done, thus avoiding the race.
In theory this does not completely solve the issue: if a page is mapped
through multiple mappings, locking one pte does not protect from
calling gup on it through the other mapping. In practice this does not
happen and it is a decent stopgap solution until a more correct
solution is available.
Fixes: e38c884df921 ("KVM: s390: Switch to new gmap") Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <20260602142356.169458-8-imbrenda@linux.ibm.com>
Fix the fault-in code so that it does not return success if a
concurrent unmap event invalidated the fault-in process between the
best-effort lockless check and the proper check with lock.
The new behaviour is to retry, like the best-effort lockless check
already did.
This prevents the fault-in handler from returning success without
having actually faulted in the requested page.
KVM: s390: vsie: Fix rmap handling in _do_shadow_crste()
Fix _do_shadow_crste() to also apply a mask on the reverse address, to
prevent spurious entries from being created, like already done in
gmap_protect_rmap().
Fixes: e38c884df921 ("KVM: s390: Switch to new gmap") Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <20260602142356.169458-6-imbrenda@linux.ibm.com>
KVM: s390: Avoid potentially sleeping while atomic when zapping pages
Factor out try_get_locked_pte(), which behaves similarly to
get_locked_pte(), but does not attempt to allocate missing tables and
performs a spin_trylock() instead of blocking.
The new function is also exported, since it will be used in other
patches.
If intermediate entries are missing, there can be no pte swap entry to
free, so it's safe to ignore them.
This avoids potentially sleeping while atomic.
Fixes: e38c884df921 ("KVM: s390: Switch to new gmap") Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <20260602142356.169458-4-imbrenda@linux.ibm.com>
The previous incorrect behaviour cleared the vsie_notif bit without
returning false, which allowed shadow crstes to be installed without
the vsie_notif bit.
Return false and do not perform the operation if an unshadow event has
been triggered, but still attempt to clear the vsie_notif bit from the
existing crste.
This will prevent the installation of shadow crstes without vsie_notif
bit and will also prevent the caller from looping forever if it was
not checking for the sg->invalidated flag.
In _gmap_unmap_crste(), the crste to be unmapped is zapped calling
gmap_crstep_xchg_atomic() exactly once, and expecting it to succeed.
This is a reasonable sanity check, since kvm->mmu_lock is being held in
write mode, and thus no races should be possible.
An upcoming patch will change the behaviour of gmap_crstep_xchg_atomic()
to return false and clear the vsie_notif bit if the operation triggers
an unshadow operation. With the new behaviour, an unmap operation that
triggers an unshadow would cause the VM to be killed.
Prepare for the change by checking if the vsie_notif bit was set in
the old crste if gmap_crstep_xchg_atomic() fails the first time, and
try a second time. The second time no failures are allowed.
Steven Rostedt [Mon, 1 Jun 2026 17:07:46 +0000 (13:07 -0400)]
tracing/eprobes: Allow use of BTF names to dereference pointers
Add syntax to the parsing of eprobes to be able to typecast a trace event
field that is a pointer to a structure.
Currently, a dereference must be a number, where the user has to figure
out manually the offset of a member of a structure that they want to
dereference.
But for event probes that records a field that happens to be a pointer to
a structure, it cannot dereference these values with BTF naming, but
must use numerical offsets.
For example, to find out what device a sk_buff is pointing to in the
net_dev_xmit trace event, one must first use gdb to find the offsets of the
members of the structures:
Extract the SHA-384 hash, RSA public key, and RSA signature from the
FMC ELF32 firmware sections. FSP Chain of Trust verification needs
these to validate the FMC image during boot.
Signed-off-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Eliot Courtney <ecourtney@nvidia.com> Link: https://patch.msgid.link/20260602032111.224790-14-jhubbard@nvidia.com
[acourbot: derive `Zeroable` on `FmcSignature` for in-place initialization] Co-developed-by: Alexandre Courbot <acourbot@nvidia.com> Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
Hopper and Blackwell use FSP instead of SEC2 for secure boot. The
driver must wait for FSP secure boot to complete before continuing
with GSP bring-up. Poll for boot success with a 5-second timeout, and
return the FSP interface only on success so that later Chain of Trust
operations cannot run before FSP is ready. The interface owns the FSP
falcon and the FMC firmware.
Signed-off-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Eliot Courtney <ecourtney@nvidia.com> Link: https://patch.msgid.link/20260602032111.224790-13-jhubbard@nvidia.com
[acourbot: use `inspect_err` instead of `map_err` and display actual error]
[acourbot: limit visibility of `fsp_hal` to `super``] Co-developed-by: Alexandre Courbot <acourbot@nvidia.com> Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
FSP is the Falcon that runs FMC firmware on Hopper and Blackwell.
Load the FMC ELF in two forms: the image section that FSP boots from,
and the full Firmware object for later signature extraction during
Chain of Trust verification. Declare the FMC image in the module's
firmware table so it is bundled for FSP-based chipsets.
Signed-off-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Eliot Courtney <ecourtney@nvidia.com> Link: https://patch.msgid.link/20260602032111.224790-12-jhubbard@nvidia.com Co-developed-by: Alexandre Courbot <acourbot@nvidia.com> Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
Add the FSP (Foundation Security Processor) falcon engine type that
will handle secure boot and Chain of Trust operations on Hopper and
Blackwell architectures.
The FSP falcon replaces SEC2's role in the boot sequence for these newer
architectures. This initial stub just defines the falcon type and its
base address.
John Hubbard [Tue, 2 Jun 2026 03:20:57 +0000 (20:20 -0700)]
gpu: nova-core: add auto-detection of 32-bit, 64-bit firmware images
A firmware image may be either a 32-bit or a 64-bit ELF, and callers
should not have to know which. Detect the ELF class from the image
header at parse time and dispatch to the matching parser, so a single
entry point handles both layouts.
John Hubbard [Tue, 2 Jun 2026 03:20:56 +0000 (20:20 -0700)]
gpu: nova-core: add support for 32-bit firmware images
Some GPU firmware images are packaged as 32-bit ELF rather than 64-bit.
Add a 32-bit implementation of the shared ELF section-parsing
abstraction so those images can be parsed alongside the existing 64-bit
path.
Introduce a single ELF format abstraction that ties each ELF header
type to its matching section-header type. This keeps the shared
section parser ready for upcoming ELF32 support and avoids mixing
32-bit and 64-bit ELF layouts by mistake.
John Hubbard [Tue, 2 Jun 2026 03:20:54 +0000 (20:20 -0700)]
gpu: nova-core: Blackwell: use correct sysmem flush registers
Blackwell GPUs moved the sysmem flush page registers away from the
Ampere/Ada location. GB10x routes the flush through a pair of HSHUB0
register sets (primary and egress) that must both be programmed to
the same address. GB20x routes it through FBHUB0.
Define these registers relative to their HSHUB0 and FBHUB0 bases, as
Open RM does, and implement the flush paths in the GB10x and GB20x
framebuffer HALs.
The GSP-RM boot working memory portion of the WPR2 heap must be
larger on Hopper and later GPUs than on Turing, Ampere, and Ada.
Select the larger value for those generations.
Hopper and Blackwell need a larger non-WPR heap than the 1 MiB that
earlier architectures use. Hopper and Blackwell GB10x need 2 MiB, while
Blackwell GB20x needs 2 MiB + 128 KiB. These sizes diverge by family,
so give Hopper and each Blackwell family its own framebuffer HAL and
select the non-WPR heap size per chipset family.
GSP boot needs to know how much framebuffer memory is reserved for
the PMU. Compute it per architecture: Blackwell dGPUs reserve a
non-zero amount, earlier architectures leave it at zero, matching
Open RM behavior.
Signed-off-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Eliot Courtney <ecourtney@nvidia.com> Link: https://patch.msgid.link/20260602032111.224790-4-jhubbard@nvidia.com Co-developed-by: Alexandre Courbot <acourbot@nvidia.com> Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
John Hubbard [Tue, 2 Jun 2026 03:20:50 +0000 (20:20 -0700)]
gpu: nova-core: Hopper/Blackwell: new location for PCI config mirror
Hopper and Blackwell GPUs moved the PCI config space mirror from
0x088000 to 0x092000. Select the correct address per architecture
when building the GSP system info command.
Signed-off-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Eliot Courtney <ecourtney@nvidia.com> Link: https://patch.msgid.link/20260602032111.224790-3-jhubbard@nvidia.com Co-developed-by: Alexandre Courbot <acourbot@nvidia.com> Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
John Hubbard [Tue, 2 Jun 2026 03:20:49 +0000 (20:20 -0700)]
gpu: nova-core: set DMA mask width based on GPU architecture
Replace the hardcoded 47-bit DMA mask with a GPU HAL method that
provides the correct value for the architecture.
Set the DMA mask in Gpu::new(). Gpu owns all DMA allocations for
the device, so no concurrent allocations can exist while the
constructor is still running.
Signed-off-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Gary Guo <gary@garyguo.net> Reviewed-by: Eliot Courtney <ecourtney@nvidia.com> Acked-by: Danilo Krummrich <dakr@kernel.org> Link: https://patch.msgid.link/20260602032111.224790-2-jhubbard@nvidia.com Co-developed-by: Alexandre Courbot <acourbot@nvidia.com> Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
Yu Kuai [Mon, 1 Jun 2026 06:15:02 +0000 (14:15 +0800)]
block, bfq: release cgroup stats with bfq_group
BFQ cgroup stats contain percpu counters embedded in struct bfq_group,
but the old free path destroys them from bfq_pd_free(), which is tied
to blkg policy-data teardown.
That is not the same lifetime as struct bfq_group. BFQ pins bfq_group
while bfq_queue entities refer to it, so bfq_pd_free() can drop the
policy-data reference while other bfq_group references still exist. The
following blkcg change also defers policy-data release through RCU and
leaves BFQ to run the final bfqg_put() from an RCU callback. For that
conversion, stats teardown must belong to the last bfq_group put, not to
policy-data teardown.
Move stats teardown to bfqg_put() so the embedded counters are destroyed
exactly when the last bfq_group reference is released, before kfree(bfqg).
Without this preparatory change, the RCU-delayed policy-data free
conversion reproduced the following KASAN report:
BUG: KASAN: slab-use-after-free in percpu_counter_destroy_many+0xf1/0x2e0
Write of size 8 at addr ffff88811d9409e0 by task test_blkcg/535
Last potentially related work creation:
kasan_save_stack+0x3e/0x60
kasan_record_aux_stack+0x99/0xb0
call_rcu+0x55/0x5c0
blkg_free_workfn+0x130/0x220
process_scheduled_works+0x655/0xb60
worker_thread+0x446/0x600
kthread+0x1f4/0x230
ret_from_fork+0x259/0x420
ret_from_fork_asm+0x1a/0x30
Fixes Coccinelle/coccicheck warning reported by do_div.cocci.
Compared to do_div(), div64_ul() does not implicitly cast the divisor and
does not unnecessarily calculate the remainder.
There are no functional changes. The benefit is purely a semantic cleanup
that better communicates the intent of the division and resolves the
static analysis warning.
driver core: Use system_percpu_wq instead of system_wq
Commit 1137838865bf ("driver core: Use mod_delayed_work to prevent lost
deferred probe work") added a use of system_wq, which is deprecated in
favor of system_percpu_wq added by commit 128ea9f6ccfb ("workqueue: Add
system_percpu_wq and system_dfl_wq"). An upcoming warning in the
workqueue tree flags this with:
workqueue: work func deferred_probe_timeout_work_func enqueued on deprecated workqueue. Use system_{percpu|dfl}_wq instead.
Switch to system_percpu_wq to clear up the warning.
Yao Sang [Thu, 28 May 2026 07:36:01 +0000 (15:36 +0800)]
nvme: refresh multipath head zoned limits from path limits
queue_limits_stack_bdev() updates the multipath head limits from the
path queue, but it does not propagate max_open_zones or
max_active_zones. As a result, a zoned multipath namespace head can
keep stale 0/0 values even after a ready path reports finite zoned
resource limits.
When refreshing the head limits in nvme_update_ns_info(), stack the
zoned resource limits directly after stacking the path queue limits.
Use min_not_zero() so the block layer's 0 value keeps its "no limit"
meaning while finite limits are combined conservatively.
This avoids advertising "no limit" on the multipath head while keeping
the zoned-limit handling local to the NVMe multipath update path.
Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Yao Sang <sangyao@kylinos.cn> Signed-off-by: Keith Busch <kbusch@kernel.org>
liuxixin [Thu, 28 May 2026 10:00:01 +0000 (18:00 +0800)]
nvme: fix FDP fdpcidx bounds check
The fdpcidx bounds check sets n = NUMFDPC + 1 but used > instead of >=,
incorrectly accepting fdp_idx when it equals n (i.e. NUMFDPC + 1).
Fixes: 30b5f20bb2dd ("nvme: register fdp parameters with the block layer") Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: liuxixin <gliuxen@gmail.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
Linus Walleij [Tue, 2 Jun 2026 11:27:47 +0000 (13:27 +0200)]
Merge tag 'v7.2-rockchip-dts64-1' of https://git.kernel.org/pub/scm/linux/kernel/git/mmind/linux-rockchip into soc/dt
New peripherals: Watchdog on RK3528, MIPI CSI-2 receiver on RK3588.
Adding frl-enable-gpios to a number of boards for HDMI 2.0 support.
And a bunch of fixes and new peripherals for a number of boards.
* tag 'v7.2-rockchip-dts64-1' of https://git.kernel.org/pub/scm/linux/kernel/git/mmind/linux-rockchip: (30 commits)
arm64: dts: rockchip: Add watchdog node for RK3528
arm64: dts: rockchip: add mipi csi-2 receiver nodes to rk3588
arm64: dts: rockchip: fix rk809 interrupt pin on rk3566-roc-pc
arm64: dts: rockchip: Add missing pinctrl-names to rk3588s boards
arm64: dts: rockchip: Add missing pinctrl-names to rk3588 boards
arm64: dts: rockchip: Add missing pinctrl-names to rk3576 boards
arm64: dts: rockchip: Drop unnecessary #{address,size}-cells from rk3588-jaguar
arm64: dts: rockchip: Add frl-enable-gpios to rk3588s-roc-pc
arm64: dts: rockchip: Add frl-enable-gpios to rk3588s-orangepi-cm5-base
arm64: dts: rockchip: Add frl-enable-gpios to rk3588s-khadas-edge2
arm64: dts: rockchip: Add frl-enable-gpios to rk3588s-gameforce-ace
arm64: dts: rockchip: Add frl-enable-gpios to rk3588s boards
arm64: dts: rockchip: Add frl-enable-gpios to rk3588 boards
arm64: dts: rockchip: Add frl-enable-gpios to rk3576-nanopi-r76s
arm64: dts: rockchip: Add frl-enable-gpios to rk3576-luckfox-core3576
arm64: dts: rockchip: Add frl-enable-gpios to rk3576 boards
arm64: dts: rockchip: Add AP6275P wireless support for Khadas Edge 2L
arm64: dts: rockchip: Add HYM8563 RTC for Khadas Edge 2L
arm64: dts: rockchip: Add #{address,size}-cells to Chromium-based /firmware
arm64: dts: rockchip: Add HDMI and VOP support for Khadas Edge 2L
...
wifi: mac80211: limit injected antenna index in ieee80211_parse_tx_radiotap
When parsing the radiotap header of an injected frame,
ieee80211_parse_tx_radiotap() uses the IEEE80211_RADIOTAP_ANTENNA value
directly as a shift count:
*iterator.this_arg is an 8-bit value taken straight from the frame
supplied by userspace, so BIT() can be asked to shift by up to 255. That
is undefined behaviour on the unsigned long and is reported by UBSAN:
UBSAN: shift-out-of-bounds in net/mac80211/tx.c:2174:30
shift exponent 235 is too large for 64-bit type 'unsigned long'
Call Trace:
ieee80211_parse_tx_radiotap+0xadb/0x1950 net/mac80211/tx.c:2174
ieee80211_monitor_start_xmit+0xb1f/0x1250 net/mac80211/tx.c:2451
...
packet_sendmsg+0x3eb6/0x50f0 net/packet/af_packet.c:3109
info->control.antennas is a 2-bit bitmap (u8 antennas:2), so only antenna
indices 0 and 1 can ever be represented. Ignore any larger value instead
of shifting out of bounds.
Reported-by: syzbot+8e0622f6d9446420271f@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=8e0622f6d9446420271f Fixes: ef246a1480cc ("wifi: mac80211: support antenna control in injection") Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com> Link: https://patch.msgid.link/20260531011721.102941-1-kartikey406@gmail.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Yuqi Xu [Fri, 29 May 2026 15:25:37 +0000 (23:25 +0800)]
wifi: nl80211: reject oversized EMA RNR lists
nl80211_parse_rnr_elems() stores the parsed element count in a
u8-backed cfg80211_rnr_elems::cnt field and uses that count to size
the flexible array allocation.
Reject nested NL80211_ATTR_EMA_RNR_ELEMS input once the count reaches
255, before incrementing it again. This keeps the parser aligned with
the data structure it fills and matches the existing bound check used
by nl80211_parse_mbssid_elems().
Fixes: dbbb27e183b1 ("cfg80211: support RNR for EMA AP") Cc: stable@kernel.org Reported-by: Yuan Tan <yuantan098@gmail.com> Reported-by: Zhengchuan Liang <zcliangcn@gmail.com> Reported-by: Xin Liu <bird@lzu.edu.cn> Assisted-by: Codex:gpt-5.4 Signed-off-by: Yuqi Xu <xuyuqiabc@gmail.com> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn> Link: https://patch.msgid.link/20260529152542.1412734-1-n05ec@lzu.edu.cn Signed-off-by: Johannes Berg <johannes.berg@intel.com>
nvme-tcp: Use WQ_PERCPU explicitly if wq_unbound is false.
Since commit 21c05ca88a54 ("workqueue: Add warnings and ensure
one among WQ_PERCPU or WQ_UNBOUND is present"), we must explicitly
set WQ_PERCPU or WQ_UNBOUND when creating workqueue.
nvme_tcp_init_module() sets WQ_UNBOUND when the module param
wq_unbound is set, but otherwise, WQ_PERCPU is missing, triggering
the warning below:
workqueue: nvme_tcp_wq is using neither WQ_PERCPU or WQ_UNBOUND. Setting WQ_PERCPU.
WARNING: kernel/workqueue.c:5856 at __alloc_workqueue+0x1d02/0x2070 kernel/workqueue.c:5855, CPU#0: swapper/0/1
Bryam Vargas [Wed, 27 May 2026 20:00:00 +0000 (15:00 -0500)]
nvmet: fix pre-auth out-of-bounds heap read in Discovery Get Log Page
nvmet_execute_disc_get_log_page() validates only the dword alignment
of the host-supplied Log Page Offset (lpo). The 64-bit offset is then
added to a small kzalloc'd buffer that holds the discovery log page
and the result is passed straight to nvmet_copy_to_sgl(), which
memcpy()s data_len bytes out to the host with no source-side bound
check:
The Discovery controller is unauthenticated -- nvmet_host_allowed()
returns true unconditionally for the discovery subsystem -- so the call
is reachable pre-authentication by any TCP/RDMA/FC peer that can reach
the nvmet target. With a discovery log page of ~1 KiB, an attacker
requesting up to 4 KiB starting at offset == alloc_len reads the next
slab page out and gets its content returned over the fabric (an
empirical run on a default nvmet-tcp loopback target leaked 81
canonical kernel pointers in one Get Log Page response). Pointing the
offset at unmapped kernel memory faults the in-kernel memcpy and
crashes (or panics, on panic_on_oops=1) the target host instead.
The attacker-controlled source-side offset pattern
"nvmet_copy_to_sgl(req, 0, buffer + ATTACKER_OFFSET, ...)" is unique
to nvmet_execute_disc_get_log_page in the entire nvmet codebase: every
other Get Log Page handler in admin-cmd.c either ignores lpo (and
silently starts every response at offset 0) or tracks a local
destination offset with a fixed source pointer.
Validate the host-supplied offset against the log page size, cap the
copy length to what is actually available, and zero-fill any remainder
of the host transfer buffer. The zero-fill matches the existing
short-response pattern in nvmet_execute_get_log_changed_ns()
(admin-cmd.c) and prevents leaking transport SGL contents when the
host asks for more bytes than the log page contains.
Fixes: a07b4970f464 ("nvmet: add a generic NVMe target") Cc: stable@vger.kernel.org Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Bryam Vargas <hexlabsecurity@proton.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
According to the C standard, the evaluation order of expressions in an
initializer list is indeterminately sequenced. The compiler (Clang, in
this KMSAN build) evaluates `cpu_to_node(ids.cpu_id)` *before*
`ids.cpu_id` is initialized with `task_cpu(t)`.
This is fixed by moving the assignment of ids.node_id outside the
structure initialization.
Fixes: 82f572449cfe ("rseq: Implement read only ABI enforcement for optimized RSEQ V2 mode") Closes: https://syzkaller.appspot.com/bug?extid=185a631927096f9da2fc Reported-by: syzbot+185a631927096f9da2fc@syzkaller.appspotmail.com Signed-off-by: Qing Wang <wangqing7171@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mark Rutland <mark.rutland@arm.com> Link: https://patch.msgid.link/20260602030854.574038-1-wangqing7171@gmail.com
Peter Zijlstra [Tue, 2 Jun 2026 07:10:05 +0000 (07:10 +0000)]
sched/fair: Unify cfs_rq throttling via account_cfs_rq_runtime()
assign_cfs_rq_runtime() during update_curr() sets the resched indicator
and relies on check_cfs_rq_runtime() during pick_next_task() /
put_prev_entity() to throttle the hierarchy once current task is
preempted / blocks.
Per-task throttle, on the other hand, uses throttle_cfs_rq() to simply
propagate the throttle signals, and then relies on task work to
individually throttle the runnable tasks on their way out to the
userspace.
Remove check_cfs_rq_runtime() and unify throttling into
account_cfs_rq_runtime() which only sets the cfs_rq->throttled,
cfs_rq->throttle_count indicators via throttle_cfs_rq() and optionally
adds the task work to the current task (donor) it is on the throttled
hierarchy.
throttle_cfs_rq() requests for sched_cfs_bandwidth_slice() worth of
bandwidth for the current hierarchy that enable it to continue running
uninterrupted when selected. For the rest, it requests a bare minimum of
"1" to ensure some bandwidth is available and pass the
"runtime_remaining > 0" checks once selected.
For SCHED_PROXY_EXEC, a mutex holder cannot exit to userspace without
dropping it first and the mutex_unlock() ensures proxy is stopped before
the mutex handoff which preserves the current semantics for running a
throttled task until it exits to the userspace even if it acts as a
donor.
[ prateek: rebased on tip, comments, commit message. ]
Reviewed-By: Benjamin Segall <bsegall@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Aaron Lu <ziqianlu@bytedance.com> Link: https://patch.msgid.link/20260602071005.11942-1-kprateek.nayak@amd.com
K Prateek Nayak [Tue, 2 Jun 2026 05:25:30 +0000 (05:25 +0000)]
sched/fair: Move the throttled tasks to a local list in tg_unthrottle_up()
An update_curr() during the enqueue of throttled task will start
throttling the hierarchy from subsequent commit. This can lead to
tg_throttle_down() seeing non-empty throttled_limbo_list for the cfs_rq
attaching the task from throttled_limbo_list one by one. For example:
R
|
A
/ \
*B C
|
rq->curr
*B is throttled with tasks on hte limbo list. When the tasks are
unthrottled via tg_unthrottle_up() and entity of group B is placed onto
A, update_curr() is called to catch up the vruntime and it may throttle
group A causing the subsequent tg_throttle_down() to see the pending
task's on B's limbo list.
Move the tasks from throttled_limbo_list onto a local list before
starting the unthrottle to prevent the splat described above. If the
hierarchy is throttled again in middle of an unthrottle, put the pending
tasks back onto the limbo list to prevent running them unnecessarily.
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Benjamin Segall <bsegall@google.com> Tested-by: Aaron Lu <ziqianlu@bytedance.com> Link: https://patch.msgid.link/20260602052531.11450-2-kprateek.nayak@amd.com
K Prateek Nayak [Tue, 2 Jun 2026 05:25:29 +0000 (05:25 +0000)]
sched/fair: Call update_curr() before unthrottling the hierarchy
Subsequent commits will allow update_curr() to throttle the hierarchy
when the runtime accounting exceeds allocated quota. Call update_curr()
before the unthrottle event, and in tg_unthrottle_up() to catch up on
any remaining runtime and stabilize the "runtime_remaining" and
"throttle_count" for that cfs_rq.
Doing an update_curr() early ensures the cfs_rq is not throttled right
back up again when the unthrottle is in progress.
Since all callers of unthrottle_cfs_rq(), except two, already update the
rq_clock and call rq_clock_start_loop_update(), move the
update_rq_clock() from unthrottle_cfs_rq() to the callers that don't
update the rq_clock.
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Benjamin Segall <bsegall@google.com> Tested-by: Aaron Lu <ziqianlu@bytedance.com> Link: https://patch.msgid.link/20260602052531.11450-1-kprateek.nayak@amd.com
K Prateek Nayak [Tue, 2 Jun 2026 05:00:02 +0000 (05:00 +0000)]
sched/fair: Use throttled_csd_list for local unthrottle
When distribute_cfs_runtime() encounters a local cfs_rq, it adds it to a
local list and unthrottles it at the end, when it is done unthrottling
other cfs_rq(s) on cfs_b->throttled_cfs_rq until the bandwidth runs out.
Instead of using a local list, reuse the local CPU's
rq->throttled_csd_list and the __cfsb_csd_unthrottle() path for
unthrottle.
If this is the first cfs_rq to be queued on the "throttled_csd_list", it
prevents the need for a remote CPUs to interrupt this local CPU if they
themselves are performing async unthrottle.
If this is not the first cfs_rq on the list, there is an async unthrottle
operation pending on this local CPU and the unthrottle can be batched
together.
No functional changes intended.
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Benjamin Segall <bsegall@google.com> Tested-by: Aaron Lu <ziqianlu@bytedance.com> Link: https://patch.msgid.link/20260602050005.11160-3-kprateek.nayak@amd.com
K Prateek Nayak [Tue, 2 Jun 2026 05:00:01 +0000 (05:00 +0000)]
sched/fair: Convert cfs bandwidth throttling to use guards
Routine conversion of rcu_read_lock(), spin_lock*, and rq_lock usage
within the cfs bandwidth controller to use class guards.
Only notable changes are:
- Checking for "cfs_rq->runtime_remaining <= 0" instead of the inverse
to spot a throttle and break early. This also saves the need
for extra indentation in the unthrottle case.
- Reordering of list_del_rcu() against throttled_clock indicator update
in unthrottle_cfs_rq(). Both are done with "cfs_b->lock" held after
the "cfs_rq->throttled" is cleared which make the reordering safe
against concurrent list modifications.
No functional changes intended.
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ben Segall <bsegall@google.com> Tested-by: Aaron Lu <ziqianlu@bytedance.com> Link: https://patch.msgid.link/20260602050005.11160-2-kprateek.nayak@amd.com
Zecheng Li [Fri, 22 May 2026 14:15:50 +0000 (10:15 -0400)]
sched/fair: Allocate cfs_tg_state with percpu allocator
To remove the cfs_rq pointer array in task_group, allocate the combined
cfs_rq and sched_entity using the per-cpu allocator.
This patch implements the following:
- Changes task_group->cfs_rq from 'struct cfs_rq **' to
'struct cfs_rq __percpu *'.
- Updates memory allocation in alloc_fair_sched_group() and
free_fair_sched_group() to use alloc_percpu() and free_percpu()
respectively.
- Uses the inline accessor tg_cfs_rq(tg, cpu) with per_cpu_ptr() to retrieve
the pointer to cfs_rq for the given task group and CPU.
- Replaces direct accesses tg->cfs_rq[cpu] with calls to the new tg_cfs_rq(tg,
cpu) helper.
- Handles the root_task_group: since struct rq is already a per-cpu variable
(runqueues), its embedded cfs_rq (rq->cfs) is also per-cpu. Therefore, we
assign root_task_group.cfs_rq = &runqueues.cfs.
- Cleanup the code in initializing the root task group.
This change places each CPU's cfs_rq and sched_entity in its local per-cpu
memory area to remove the per-task_group pointer arrays.
Signed-off-by: Zecheng Li <zecheng@google.com> Signed-off-by: Zecheng Li <zli94@ncsu.edu> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Reviewed-by: Josh Don <joshdon@google.com> Link: https://patch.msgid.link/20260522141623.600235-4-zli94@ncsu.edu
Zecheng Li [Fri, 22 May 2026 14:15:49 +0000 (10:15 -0400)]
sched/fair: Remove task_group->se pointer array
Now that struct sched_entity is co-located with struct cfs_rq for non-root task
groups, the task_group->se pointer array is redundant. The associated
sched_entity can be loaded directly from the cfs_rq.
This patch performs the access conversion with the helpers:
- is_root_task_group(tg): checks if a task group is the root task group. It
compares the task group's address with the global root_task_group variable.
- tg_se(tg, cpu): retrieves the cfs_rq and returns the address of the
co-located se. This function checks if tg is the root task group to ensure
behaving the same of previous tg->se[cpu]. Replaces all accesses that use
the tg->se[cpu] pointer array with calls to the new tg_se(tg, cpu) accessor.
- cfs_rq_se(cfs_rq): simplifies access paths like cfs_rq->tg->se[...] to use
the co-located sched_entity. This function also checks if tg is the root
task group to ensure same behavior.
Since tg_se is not in very hot code paths, and the branch is a register
comparison with an immediate value (`&root_task_group`), the performance impact
is expected to be negligible.
Signed-off-by: Zecheng Li <zecheng@google.com> Signed-off-by: Zecheng Li <zli94@ncsu.edu> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Reviewed-by: Josh Don <joshdon@google.com> Link: https://patch.msgid.link/20260522141623.600235-3-zli94@ncsu.edu
Zecheng Li [Fri, 22 May 2026 14:15:48 +0000 (10:15 -0400)]
sched/fair: Co-locate cfs_rq and sched_entity in cfs_tg_state
Improve data locality and reduce pointer chasing by allocating struct
cfs_rq and struct sched_entity together for non-root task groups. This
is achieved by introducing a new combined struct cfs_tg_state that
holds both objects in a single allocation.
This patch:
- Introduces struct cfs_tg_state that embeds cfs_rq, sched_entity, and
sched_statistics together in a single structure.
- Updates __schedstats_from_se() in stats.h to use cfs_tg_state for accessing
sched_statistics from a group sched_entity.
- Modifies alloc_fair_sched_group() and free_fair_sched_group() to allocate
and free the new struct as a single unit.
- Modifies the per-CPU pointers in task_group->se and task_group->cfs_rq to
point to the members in the new combined structure.
Signed-off-by: Zecheng Li <zecheng@google.com> Signed-off-by: Zecheng Li <zli94@ncsu.edu> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Reviewed-by: Josh Don <joshdon@google.com> Link: https://patch.msgid.link/20260522141623.600235-2-zli94@ncsu.edu
Guanyou.Chen [Fri, 22 May 2026 13:09:59 +0000 (21:09 +0800)]
sched: restore timer_slack_ns when resetting RT policy on fork
Commit ed4fb6d7ef68 ("hrtimer: Use and report correct timerslack values
for realtime tasks") sets timer_slack_ns to 0 for RT tasks in
__setscheduler_params(). However, when an RT task with SCHED_RESET_ON_FORK
creates child threads, the children inherit timer_slack_ns=0 from the
parent. sched_fork() resets the child's policy to SCHED_NORMAL but does
not restore timer_slack_ns, leaving the child permanently running with
zero slack.
Fix this by restoring timer_slack_ns from default_timer_slack_ns in
sched_fork() when resetting from RT/DL to NORMAL policy, matching the
existing behavior in __setscheduler_params().
Note: this fix alone requires a correct default_timer_slack_ns to be
effective. See the following patch for that fix.
Fixes: ed4fb6d7ef68 ("hrtimer: Use and report correct timerslack values for realtime tasks") Reported-by: Qiaoting.Lin <linqiaoting@xiaomi.com> Signed-off-by: Guanyou.Chen <chenguanyou@xiaomi.com> Signed-off-by: Chunhui.Li <chunhui.li@mediatek.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260522131000.1664983-2-chenguanyou@xiaomi.com
The reason for this condition is that the signal condition in
try_to_block_task() would set_task_blocked_in_waking(). However, it no longer
does that, in fact, that path does clear_task_blocked_on().
there are a few other edge cases that needed this. But they're all
variants of PROXY_WAKING leaking out. And since PROXY_WAKING is now
gone, this is no longer needed either.
K Prateek Nayak [Tue, 26 May 2026 09:43:02 +0000 (11:43 +0200)]
sched/proxy: Remove PROXY_WAKING
Now that the proxy path uses ->is_blocked, use the '->is_blocked &&
!->blocked_on' state instead of PROXY_WAKING. Notably, this is where a
blocked_on relation is broken but the donor task might still need a return
migration.