Paolo Bonzini [Fri, 12 Jun 2026 08:12:22 +0000 (10:12 +0200)]
Merge tag 'kvm-x86-selftests-7.2' of https://github.com/kvm-x86/linux into HEAD
KVM selftests changes for 7.2
- Randomize the dirty log test's delay when reaping the bitmap on the first
pass, as always waiting only 1ms hid a KVM RISC-V bug as the test reaped the
bitmap before KVM could build up enough state to hit the bug.
Paolo Bonzini [Fri, 12 Jun 2026 08:11:59 +0000 (10:11 +0200)]
Merge tag 'kvm-x86-mmu-7.2' of https://github.com/kvm-x86/linux into HEAD
KVM x86 MMU changes for 7.2
- Use the kernel's "enum pg_level" in the TDX APIs instead of the TDX-Module's
level definitions (which are 0-based).
- Rework the TDX memory APIs to not require/assume that guest memory is
backed by "struct page" (in prepartion for guest_memfd hugepage support).
- Overhaul the TDP MMU => S-EPT code to move as much S-EPT specific logic as
possible into the TDX code, and to funnel (almost) all S-EPT updates into
a single chokepoint. The motivation is largely to prepare for upcoming
Dynamic PAMT support, but the cleanups are nice to have on their own.
- Plug a hole in the shadow MMU where KVM fails to recursively zap nested TDP
shadow when L1 is tearing its TDP page tables from the bottom up, as KVM's
TDP MMU now does.
Paolo Bonzini [Fri, 12 Jun 2026 08:11:09 +0000 (10:11 +0200)]
Merge tag 'kvm-x86-misc-7.2' of https://github.com/kvm-x86/linux into HEAD
KVM misc x86 changes for 7.2
- Handle EXIT_FASTPATH_EXIT_USERSPACE in vendor code to ensure vendor code
gets a chance to handle things like reaping the PML buffer.
- Ensure KVM's copy of CR0 and CR3 are up-to-date on SVM prior to invoking
fastpath handlers.
- Update KVM's view of PV async enabling if and only if the MSR write fully
succeeds.
- Fix a variety of issues where the emulator doesn't honor guest-debug state,
and clean up related code along the way.
- Synthesize EPT Violation and #NPF "error code" bits when injecting faults
into L1 that didn't originate in hardware (in which case the VMCS/VMCB
doesn't hold relevant information).
- Add support for virtualizing (well, emulating) AMD's flavor of CPL>0 CPUID
faulting.
- Clean up the GPR APIs so that KVM's use of "raw" is consistent, and fix a
variety of minor bugs along the way.
- Fix an OOB memory access due to not checking the VP ID when handling a
Hyper-V PV TLB flush for L2.
- Fix a bug in the mediated PMU's handling of fixed counters that allowed the
guest to bypass the PMU event filter.
- Allow userspace to return EAGAIN when handling SNP and TDX hypercalls, so
the KVM can forward a "retry" status code to the guest, and reserve all
unused error codes for future usage.
Paolo Bonzini [Fri, 12 Jun 2026 08:08:52 +0000 (10:08 +0200)]
Merge tag 'kvm-x86-gmem-7.2' of https://github.com/kvm-x86/linux into HEAD
KVM guest_memfd changes for 7.2
- Return -EEXIST instead of -EINVAL if userspace attempts to bind a gmem
range to multiple memslots, and fix the test that was supposed to ensure
KVM returns -EEXIST.
- Treat memslot binding offsets and sizes as unsigned values to fix a bug
where KVM interprets a large "offset + size" as a negative value and allows
a nonsensical offset.
- Use the inode number instead of the page offset for the NUMA interleaving
index to fix a bug where the effective index would jump by two for
consecutive pages (the caller also adds in the page offset).
Marc Zyngier [Fri, 12 Jun 2026 08:08:31 +0000 (09:08 +0100)]
Merge branch kvm-arm64/vgic-v5-PPI-fixes into kvmarm-master/next
* kvm-arm64/vgic-v5-PPI-fixes:
: .
: Substantial cleanup of the vgic-v5 PPI support. From the original
: cover letter:
:
: "With the GICv5 PPi support merged in, it has become obvious that a few
: things could be improved, both from the correctness and maintainability
: angles."
: .
KVM: arm64: Fix arch timer interrupts for GICv3-on-GICv5 guests
irqchip/gic-v5: Immediately exec priority drop following activate
Documentation: KVM: Clarify that PMU_V3_IRQ IntID requirements for GICv5
Documentation: KVM: Fix typos in VGICv5 documentation
KVM: arm64: selftests: Improve error handling for GICv5 PPI selftest
KVM: arm64: selftests: Cleanup unused vars in GICv5 PPI selftest
KVM: arm64: selftests: Add missing GIC CDEN to no-vgic-v5 selftest
KVM: arm64: vgic-v5: Atomically assign bits to PPI DVI bitmap
KVM: arm64: vgic-v5: Add missing trap handing for NV triage
KVM: arm64: vgic-v5: Limit support to 64 PPIs
KVM: arm64: vgic: Rationalise per-CPU irq accessor
KVM: arm64: vgic-v5: Drop defensive checks from vgic_v5_ppi_queue_irq_unlock()
KVM: arm64: vgic: Consolidate vgic_allocate_private_irqs_locked()
KVM: arm64: vgic: Constify struct irq_ops usage
KVM: arm64: vgic-v5: Drop pointless ARM64_HAS_GICV5_CPUIF check
KVM: arm64: vgic-v5: Remove use of __assign_bit() with a constant
KVM: arm64: vgic-v5: Move PPI caps into kvm_vgic_global_state
KVM: arm64: vgic-v5: Add for_each_visible_v5_ppi() iterator
Marc Zyngier [Fri, 12 Jun 2026 08:08:25 +0000 (09:08 +0100)]
Merge branch kvm-arm64/pkvm-fixes-7.2 into kvmarm-master/next
* kvm-arm64/pkvm-fixes-7.2:
: .
: Assorted pKVM fixes for 7.2:
:
: - Ensure that the vcpu memcache is filled in a number of cases (donate,
: share, selftest)
:
: - Fix vmemmap page order handling by resetting it when initialising the
: memory pool
:
: - Don't leak page references on failed memory donation
:
: - Add sanity-check for refcounted pages when donating/sharing pages
:
: - Clear __hyp_running_vcpu on state flush
:
: - Check LR upper bound against a trusted value
:
: - Assorted fixes for the host-side tracking of the pages shared with
: EL2 as a result of some Sashiko testing from Fuad
:
: - Correctly forward HCR_EL2.VSE from host to guest, so that protected
: guests can see SErrors
: .
KVM: arm64: Roll back partial shares on kvm_share_hyp() failure
KVM: arm64: Avoid host/hyp share desync on unshare hypercall failure
KVM: arm64: Free hyp-share tracking node when share hypercall fails
KVM: arm64: Flush HCR_EL2.VSE to deliver SErrors to pKVM guests
KVM: arm64: Bound used_lrs when flushing the pKVM hyp vCPU
KVM: arm64: Clear __hyp_running_vcpu when flushing the pKVM hyp vCPU
KVM: arm64: Pre-check vcpu memcache for host->guest donate
KVM: arm64: Pre-check vcpu memcache for host->guest share
KVM: arm64: Seed pkvm_ownership_selftest vcpu memcache
KVM: arm64: Add fail-safe for refcounted pages in __pkvm_hyp_donate_host
KVM: arm64: Fix __pkvm_init_vm error path
KVM: arm64: Reset page order in pKVM hyp_pool
Marc Zyngier [Fri, 12 Jun 2026 08:04:24 +0000 (09:04 +0100)]
Merge branch kvm-arm64/nv-granule-sizes into kvmarm-master/next
* kvm-arm64/nv-granule-sizes:
: .
: Tidying up of the behaviour when the selected page size in not
: implemented, courtesy of Wei-Lin Chang. From the initial cover
: letter:
:
: "This small series fixes the granule size selection for software stage-1
: and stage-2 walks. Previously we treat the guest's TCR/VTCR.TGx as-is
: and use the encoded granule size for the walks. However this is
: incorrect if the granule sizes are not advertised in the guest's
: ID_AA64MMFR0_EL1.TGRAN*. The architecture specifies that when an
: unsupported size is programed in TGx, it must be treated as an
: implemented size. Fix this by choosing an available one while
: prioritizing PAGE_SIZE."
: .
KVM: arm64: Fallback to a supported value for unsupported guest TGx
KVM: arm64: nv: Use literal granule size in TLBI range calculation
KVM: arm64: Factor out TG0/1 decoding of VTCR and TCR
KVM: arm64: nv: Rename vtcr_to_walk_info() to setup_s2_walk()
Marc Zyngier [Fri, 12 Jun 2026 08:03:57 +0000 (09:03 +0100)]
Merge branch kvm-arm64/nv-fp-elision into kvmarm-master/next
* kvm-arm64/nv-fp-elision:
: .
: Significantly reduce the overhead of the context switch between L1 and
: L2 guests by eliding the save/restore of the FP/SIMD/SVE registers, as
: this state is shared between the two guests, and therefore can be left
: live.
: .
KVM: arm64: nv: Don't save/restore FP register during a nested ERET or exception
KVM: arm64: nv: Track L2 to L1 exception emulation
Marc Zyngier [Fri, 12 Jun 2026 08:03:24 +0000 (09:03 +0100)]
Merge branch kvm-arm64/no-lazy-vgic-init into kvmarm-master/next
* kvm-arm64/no-lazy-vgic-init:
: .
: Fix an ugly situation where the vgic lazy init could happen in
: non-preemtible contexts such as vcpu reset, resulting in lockdep
: splats.
:
: This requires revamping the way in-kernel emulation of devices
: (timers, PMU) are presenting their interrupt to the vgic, and
: make sure there is no need to init the vgic on the back of that.
: .
KVM: arm64: vgic-v2: Don't init the vgic on in-kernel interrupt injection
KVM: arm64: vgic-v2: Force vgic init on injection outside the run loop
KVM: arm64: pmu: Kill the PMU interrupt level cache
KVM: arm64: timer: Kill the per-timer irq level cache
KVM: arm64: Simplify userspace notification of interrupt state
KVM: arm64: timer: Repaint kvm_timer_{should,irq_can}_fire() to kvm_timer_{pending,enabled}()
Jackie Liu [Thu, 4 Jun 2026 07:51:47 +0000 (15:51 +0800)]
KVM: arm64: vgic-its: Make ABI commit helpers return void
The return values of vgic_its_set_abi() and vgic_its_commit_v0() are always
0 and do not carry useful error information. Simplify by changing them to
void.
Suggested-by: Oliver Upton <oupton@kernel.org> Signed-off-by: Jackie Liu <liuyun01@kylinos.cn> Reviewed-by: Oliver Upton <oupton@kernel.org> Reviewed-by: Eric Auger <eric.auger@redhat.com> Link: https://patch.msgid.link/20260604075147.53299-1-liu.yun@linux.dev Signed-off-by: Marc Zyngier <maz@kernel.org>
Mikhail Lobanov [Wed, 10 Jun 2026 19:19:04 +0000 (22:19 +0300)]
xfs: shut down the filesystem on a failed mount
A corrupt/crafted XFS image can make mount fail after background inode
inactivation has already been enabled. xfs_mountfs() turns on inodegc
(xfs_inodegc_start()) right after log recovery, but the quota subsystem
(mp->m_quotainfo) is only allocated much later, in xfs_qm_newmount() /
xfs_qm_mount_quotas(). The quota accounting flags in mp->m_qflags are
parsed from the mount options before xfs_mountfs() even runs.
If the mount then aborts in between - e.g. xfs_rtmount_inodes() failing
with "failed to read RT inodes" - the unwind path flushes the inodegc
queue, which inactivates the inodes that are still queued, and
xfs_inactive() calls xfs_qm_dqattach(). That path trusts
XFS_IS_QUOTA_ON() (the flag is set) and dereferences the not yet
allocated mp->m_quotainfo:
XFS (loop0): failed to read RT inodes
Oops: general protection fault, probably for non-canonical address
0xdffffc000000002a: 0000 [#1] PREEMPT SMP KASAN NOPTI
KASAN: null-ptr-deref in range [0x0000000000000150-0x0000000000000157]
Workqueue: xfs-inodegc/loop0 xfs_inodegc_worker
RIP: 0010:__mutex_lock+0xfe/0x930
Call Trace:
xfs_qm_dqget_cache_lookup+0x63/0x7f0
xfs_qm_dqget_inode+0x336/0x860
xfs_qm_dqattach_one+0x232/0x4e0
xfs_qm_dqattach_locked+0x2c6/0x470
xfs_qm_dqattach+0x46/0x70
xfs_inactive+0x988/0xe80
xfs_inodegc_worker+0x27c/0x730
The NULL m_quotainfo deref is only one symptom. The deeper problem is
that a failed mount should not be inactivating inodes at all: it must
not write to the (possibly corrupt, only partially set up) persistent
metadata of a filesystem we just refused to mount, and the subsystems
inactivation relies on may not be initialised.
Mark the filesystem shut down before flushing the inodegc queue in the
xfs_mountfs() failure path. With the preceding patch a shut down mount
no longer inactivates the queued inodes: xfs_inactive() returns early so
they are dropped straight to reclaim instead. They are still pulled down
so reclaim can free them (which is why the flush was added in commit ab23a7768739 ("xfs: per-cpu deferred inode inactivation queues")), but
without touching the on-disk structures - matching that comment's own
"pull down all the state and flee" intent.
Use SHUTDOWN_META_IO_ERROR for the shutdown: it is the generic "cannot
safely touch metadata" reason already used elsewhere in this file and in
the xfs_ifree() failure path, and unlike SHUTDOWN_FORCE_UMOUNT it does
not log a misleading "User initiated shutdown received". A failed mount
is not necessarily on-disk corruption (it can be a transient I/O or
resource error), so SHUTDOWN_CORRUPT_ONDISK would not be accurate either.
Found by fuzzing XFS with syzkaller (corrupt image mount); reproduced and
verified under QEMU/KASAN.
Fixes: ab23a7768739 ("xfs: per-cpu deferred inode inactivation queues") Signed-off-by: Mikhail Lobanov <m.lobanov@rosa.ru> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Mikhail Lobanov [Wed, 10 Jun 2026 19:19:03 +0000 (22:19 +0300)]
xfs: skip inode inactivation on a shut down mount
XFS already declines to inactivate inodes on a shut down mount, but only
at queue time: xfs_inode_mark_reclaimable() calls
xfs_inode_needs_inactive(), which returns false when the mount is shut
down ("If the log isn't running, push inodes straight to reclaim"), and
then drops the dquots and marks the inode reclaimable directly.
An inode that was queued for background inactivation while the mount was
still live is not covered by that check: the inodegc worker still calls
xfs_inactive() on it even after the mount has been shut down in the
meantime. Inactivation modifies persistent metadata and runs
transactions that cannot complete on a shut down mount, and it relies on
subsystems (e.g. quota) that a torn down, or never fully set up, mount
may not have available.
Honour the same invariant in xfs_inactive() itself: if the mount is shut
down, return early before doing any inactivation work. The dquots
attached to the inode are released by the existing xfs_qm_dqdetach() at
the out: label, so references are not leaked, and the caller then makes
the inode reclaimable exactly as before.
On its own this is a consistency fix with the existing queue-time
behaviour; it is also a prerequisite for shutting the mount down in the
xfs_mountfs() failure path in the following patch.
Fixes: ab23a7768739 ("xfs: per-cpu deferred inode inactivation queues") Signed-off-by: Mikhail Lobanov <m.lobanov@rosa.ru> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Because CYCLE_LSN/BLOCK_LSN are defined in xfs_log_format.h, XFS_LSN_CMP
forces a xfs_log_format.h dependency in xfs_log.h. Move XFS_LSN_CMP
to xfs_log_format.h and drop the macro/inline indirection to clean up
our header mess a little bit.
This also helps xfsprogs, which doesn't have xfs_log.h, but needs
XFS_LSN_CMP.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Yao Sang [Fri, 12 Jun 2026 02:44:30 +0000 (10:44 +0800)]
xfs: shut down zoned file systems on writeback errors
Zoned writeback allocates space from an open zone and advances the
in-memory allocation state before submitting the bio. The completion
path only records the written blocks and updates the mapping on success.
If the write fails, XFS cannot tell how far the device write pointer
advanced and cannot safely roll the open zone accounting back.
This was observed while investigating xfs/643 and xfs/646 on an external
ZNS realtime device. A writeback error after consuming space from an
open zone left later writers waiting for open-zone or GC progress that
could not happen. xfs/643 exposed this through the GC defragmentation
path, while xfs/646 exposed the same failure mode through the
truncate/EOF-zeroing space wait path.
There is no local recovery path in ioend completion that can restore a
consistent zoned allocation state after the device has rejected the
write. Treat writeback errors for zoned inodes as fatal and force a
file system shutdown from the ioend completion path. The existing
shutdown path wakes zoned allocation waiters and makes future space
waits return -EIO instead of leaving tasks stuck waiting for progress.
Signed-off-by: Yao Sang <sangyao@kylinos.cn> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Jason Gunthorpe [Fri, 27 Mar 2026 15:23:58 +0000 (12:23 -0300)]
iommu/amd: Make CMD_INV_IOMMU_ALL_PAGES_ADDRESS match the spec
The spec in Table 14 defines the "Entire Cache" case as having the low
12 bits as zero. Indeed the command format doesn't even have the low
12 bits. Since there is only one user now, fix the constant to have 0
in the low 12 bits instead of 1 and remove the masking.
Jason Gunthorpe [Fri, 27 Mar 2026 15:23:57 +0000 (12:23 -0300)]
iommu/amd: Have amd_iommu_domain_flush_pages() use last
Finish clearing out the size/last/end switching by converting
amd_iommu_domain_flush_pages() to use last-based logic.
This algorithm is simpler than the previous. Ultimately all this wants
to do is select powers of two that are aligned to address and not
longer than the distance to last.
The new version is fully safe for size = U64_MAX and last = U64_MAX.
Finally, the gather can be passed through natively without risking an
overflow in (gather->end - gather->start + 1).
Jason Gunthorpe [Fri, 27 Mar 2026 15:23:56 +0000 (12:23 -0300)]
iommu/amd: Pass last in through to build_inv_address()
This is the trivial call chain below amd_iommu_domain_flush_pages().
Cases that are doing a full invalidate will pass a last of U64_MAX.
This avoids converting between size and last, and type confusion with
size_t, unsigned long and u64 all being used in different places along
the driver's invalidation path. Consistently use u64 in the internals.
Arnd Bergmann [Fri, 12 Jun 2026 07:01:25 +0000 (09:01 +0200)]
Merge tag 'bst-arm64-emmc-driver-dts-for-v7.2' of https://github.com/BlackSesame-SoC/linux into soc/dt
arm64: BST C1200 eMMC DTS for v7.2
Black Sesame Technologies:
Enable eMMC controller on BST C1200 CDCU1.0 board:
- Add mmc0 node in bstc1200.dtsi (DWCMSHC SDHCI controller)
- Add fixed clock definition and reserved SRAM bounce buffer
- Enable mmc0 with 8-bit bus on CDCU1.0 ADAS 4C2G board
The MMC driver was merged via mmc-next in v7.1-rc1.
this is the remaining DTS piece.
Signed-off-by: Gordon Ge <gordon.ge@bst.ai>
* tag 'bst-arm64-emmc-driver-dts-for-v7.2' of https://github.com/BlackSesame-SoC/linux:
arm64: dts: bst: enable eMMC controller in C1200
Dong Chenchen [Tue, 9 Jun 2026 09:21:17 +0000 (17:21 +0800)]
xfrm: Fix dev use-after-free in xfrm async resumption
xfrm async resumption hold skb->dev refcnt until after transport_finish.
However, xfrm_rcv_cb may modify skb->dev to tunnel dev without taking
device reference, such as vti_rcv_cb. The subsequent async resumption
will decrement the tunnel device's reference count, which lead to uaf
of tunnel dev and refcnt leak of orig dev as below:
unregister_netdevice: waiting for vti1 to become free. Usage count = -2
Stash the original skb->dev to fix refcnt imbalance. The new skb->dev set
by xfrm_rcv_cb can race with device teardown. Extend rcu protection over
xfrm_rcv_cb and transport_finish to prevent races.
Fixes: 1c428b038400 ("xfrm: hold dev ref until after transport_finish NF_HOOK") Reported-by: Xu Chunxiao <xuchunxiao3@huawei.com> Signed-off-by: Dong Chenchen <dongchenchen2@huawei.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Unlike the authentication (x->aalg) and encryption (x->ealg) branches of
the same function, the compression branch never initializes
calg->alg_key_len. IPComp carries no key and the allocation only
reserves sizeof(struct xfrm_algo) (i.e. no room for a key), so the field
is left containing uninitialized slab data.
calg->alg_key_len is later used as a length by xfrm_algo_clone() when an
IPComp state is cloned during XFRM_MSG_MIGRATE:
where xfrm_alg_len() returns sizeof(*alg) + (alg_key_len + 7) / 8. With
a non-zero garbage alg_key_len, kmemdup() reads past the end of the
68-byte calg object. Adding an IPComp SA via PF_KEY and then migrating
it triggers (net-next, KASAN, init_on_alloc=0):
The buggy address belongs to the object at ff11000025a74980
which belongs to the cache kmalloc-96 of size 96
The buggy address is located 0 bytes inside of
allocated 68-byte region [ff11000025a74980, ff11000025a749c4)
Depending on the uninitialized value the same field can instead request
an oversized kmemdup() allocation and make the migration clone fail.
The XFRM netlink path is not affected: verify_one_alg() rejects an
XFRMA_ALG_COMP attribute shorter than xfrm_alg_len(), so a calg added via
XFRM_MSG_NEWSA is always self-consistent.
Initialize calg->alg_key_len to 0, matching the aalg/ealg branches.
Sanman Pradhan [Sun, 7 Jun 2026 16:47:34 +0000 (16:47 +0000)]
xfrm: use compat translator only for u64 alignment mismatch
The XFRM compat layer (CONFIG_XFRM_USER_COMPAT) translates 32-bit xfrm
netlink and setsockopt messages into the native 64-bit layout. It is
only needed on architectures where the 32-bit and 64-bit ABIs disagree
on u64 alignment, which the kernel encodes as COMPAT_FOR_U64_ALIGNMENT.
That symbol is defined only by arch/x86. XFRM_USER_COMPAT depends on it,
so the translator can never be built on any other architecture,
including arm64, which still provides a 32-bit compat ABI (CONFIG_COMPAT)
for AArch32 EL0 userspace. On arm64 the AArch32 EABI already aligns u64
to 8 bytes, identical to the AArch64 ABI, so no translation is required
and the native code path is correct for 32-bit tasks.
However, xfrm_user_rcv_msg() and xfrm_user_policy() gate on
in_compat_syscall() alone and then call xfrm_get_translator(), which
returns NULL when no translator is registered. On arm64 that is always
the case, so every xfrm netlink message and the XFRM_POLICY setsockopt
issued by a 32-bit task returns -EOPNOTSUPP. A 32-bit userspace process
on arm64 (and on any other arch with CONFIG_COMPAT but without
COMPAT_FOR_U64_ALIGNMENT) therefore cannot configure XFRM state or
policy through the XFRM_USER netlink API, and cannot use the XFRM_POLICY
setsockopt path, because both fail before reaching the native parser.
The translator series replaced the blanket compat rejection with a
translator lookup. That made the path usable on x86 when the translator
is available, but left architectures that cannot build the translator
permanently rejected even when their compat layout already matches the
native layout. Let those architectures use the native parser instead.
Gate the translator requirement on COMPAT_FOR_U64_ALIGNMENT instead of
on in_compat_syscall() alone. Gating on the ABI property rather than on
CONFIG_XFRM_USER_COMPAT is deliberate: on x86 with IA32_EMULATION=y but
XFRM_USER_COMPAT=n, a 32-bit task must still be rejected rather than
routed through the native parser, which would misread genuinely
4-byte-aligned x86-32 messages. COMPAT_FOR_U64_ALIGNMENT is the ABI
property that makes the XFRM translator mandatory.
Only the receive/input direction needs the guard. The send, dump and
notification paths already call the translator as "if (xtr) { ... }"
with no error on NULL, so on arches without a translator they no-op and
the kernel emits native 64-bit-layout messages, which is what an AArch32
task expects.
Tested on Juniper SRX hardware: with the fix, 32-bit IPsec userspace
netlink and XFRM_POLICY setsockopt operations that previously failed
with -EOPNOTSUPP now succeed; x86 behaviour is unchanged by inspection.
Albert Yang [Fri, 12 Jun 2026 00:40:23 +0000 (08:40 +0800)]
arm64: dts: bst: enable eMMC controller in C1200
Add mmc0 node for the DWCMSHC SDHCI controller with basic configuration
(disabled by default) and fixed clock definition in bstc1200.dtsi.
Enable mmc0 with board-specific configuration including 8-bit bus
width and reserved SRAM bounce buffer on the CDCU1.0 ADAS 4C2G board.
The bounce buffer in reserved SRAM addresses hardware constraints
where the eMMC controller cannot access main system memory through
SMMU due to a hardware bug, and all DRAM is located outside the
4GB boundary.
Signed-off-by: Albert Yang <yangzh0906@thundersoft.com> Acked-by: Gordon Ge <gordon.ge@bst.ai> Signed-off-by: Gordon Ge <gordon.ge@bst.ai>
Propagate the actual error code returned by rmi_read() in
rmi_f12_read_sensor_tuning() instead of hardcoding -ENODEV.
Also, since rmi_read() returns 0 on success, use 'if (ret)'
instead of 'if (ret < 0)'.
Dmitry Torokhov [Tue, 5 May 2026 04:59:47 +0000 (21:59 -0700)]
Input: rmi4 - use sizeof(*ptr) and idiomatic checks in f12 allocators
Using sizeof(*ptr) is preferred over sizeof(struct) because it is
more robust against type changes. Also switch to checking for
allocation failure immediately after each call, and update
formatting.
Dmitry Torokhov [Tue, 5 May 2026 04:59:46 +0000 (21:59 -0700)]
Input: rmi4 - use devm_kmalloc for F12 data packet buffer
The sensor->data_pkt buffer is used exclusively to store incoming
hardware data during the attention handler, where it is entirely
overwritten by either memcpy() or rmi_read_block(). Therefore,
there is no need to zero-initialize it during probe.
Switch to devm_kmalloc() to avoid the unnecessary memset overhead.
Dmitry Torokhov [Tue, 5 May 2026 04:59:45 +0000 (21:59 -0700)]
Input: rmi4 - use flexible array member for IRQ masks in F12
Use a flexible array member to allocate the IRQ masks at the end of
the f12_data structure, and use the struct_size() helper to
calculate the allocation size safely. This replaces manual pointer
arithmetic.
Dmitry Torokhov [Tue, 5 May 2026 04:59:44 +0000 (21:59 -0700)]
Input: rmi4 - use unaligned access helpers in F12
Use get_unaligned_le16() instead of manual bit shifts to construct
16-bit values for max_x, max_y, pitch_x, pitch_y, and object
coordinates in the F12 parsing logic. This simplifies the code and
makes the endianness explicit.
Dmitry Torokhov [Tue, 5 May 2026 04:59:43 +0000 (21:59 -0700)]
Input: rmi4 - change reg_size type to u32
Change reg_size from unsigned long to u32 to save space and ensure
consistent size across 32-bit and 64-bit architectures, and use
DECLARE_BITMAP() for subpacket_map.
Also pack the structure by rearranging the members to avoid holes,
and use size_add() to prevent potential integer overflows when
calculating the total size of registers.
Dmitry Torokhov [Tue, 5 May 2026 04:59:42 +0000 (21:59 -0700)]
Input: rmi4 - refactor F12 probe function
The F12 probe function contains highly repetitive logic for parsing
register descriptors and their individual data items. Refactor the
function to use loops to eliminate redundancy, and clarify the code.
Dmitry Torokhov [Tue, 5 May 2026 04:59:40 +0000 (21:59 -0700)]
Input: rmi4 - refactor function allocation and registration
Currently, rmi_create_function() allocates memory for the rmi_function
structure, but rmi_register_function() initializes the device via
device_initialize(). This split of ownership makes error handling in
rmi_create_function() confusing because the caller must be aware that
if rmi_register_function() fails, it has already called put_device() to
clean up the memory.
To make the memory lifecycle explicit and fix potential leaks cleanly
introduce rmi_alloc_function() to handle memory allocation and device
initialization, and make the caller of rmi_register_function()
responsible for cleanup.
Dmitry Torokhov [Tue, 5 May 2026 04:59:39 +0000 (21:59 -0700)]
Input: rmi4 - use local presence map in rmi_read_register_desc()
The presence map is only used during the parsing of the register
descriptor, so we can make it a local variable instead of storing it
in struct rmi_register_descriptor.
Also fix the spelling of the constant and the variable name (presence
instead of presense).
Dmitry Torokhov [Thu, 11 Jun 2026 01:28:33 +0000 (18:28 -0700)]
Input: rmi4 - initialize attn_fifo properly
attn_fifo is allocated as part of struct rmi_driver_data using
devm_kzalloc in rmi_driver_probe. However, it is never initialized.
A zero-initialized kfifo has its mask set to 0, which effectively
limits its capacity to 1 element instead of the declared 16.
This can lead to lost attention data and memory leaks of the attention
data payload if multiple attention events are received before the
threaded interrupt handler can process them.
Initialize attn_fifo using INIT_KFIFO after allocating rmi_driver_data.
Dmitry Torokhov [Tue, 5 May 2026 04:59:34 +0000 (21:59 -0700)]
Input: rmi4 - fix num_subpackets overflow in register descriptor
RMI_REG_DESC_SUBPACKET_BITS is defined as 296 (37 * BITS_PER_BYTE). This
may overflow num_subpackets in struct rmi_register_desc_item which is
defined as a u8.
Fix this by changing the type of num_subpackets to u16.
Dmitry Torokhov [Tue, 5 May 2026 04:59:33 +0000 (21:59 -0700)]
Input: rmi4 - fix type overflow in register counts
The number of registers in the RMI4 register descriptor is populated
by counting the bits in the presence map using bitmap_weight(). Since
the presence map can contain up to 256 bits (RMI_REG_DESC_PRESENSE_BITS),
storing this count in a u8 can overflow to 0 if all 256 bits are set.
Change the num_registers field in struct rmi_register_descriptor
from u8 to u16 to prevent potential integer overflow and ensure safe
processing of devices reporting large descriptors.
When reading the register descriptor, the base address is incremented by
1 to read the presence register block. However, after reading the
presence register block, the address is incorrectly incremented by only
1 byte (++addr) instead of the actual size of the presence block
(size_presence_reg). This causes the subsequent structure block read to
read from the wrong memory location if the presence block is larger than
1 byte.
Fix this by advancing the address by size_presence_reg.
Dave Airlie [Fri, 12 Jun 2026 03:57:16 +0000 (13:57 +1000)]
Merge tag 'drm-xe-fixes-2026-06-11' of https://gitlab.freedesktop.org/drm/xe/kernel into drm-fixes
UAPI Changes:
Cross-subsystem Changes:
Core Changes:
Driver Changes:
- fix oops in suspend/shutdown without display (Jani)
- RAS fixes (Raag)
- Use HW_ERR prefix in log (Raag)
- include all registered queues in TLB invalidation (Tangudu)
- Fix refcount leak in xe_range_tree in error paths (Wentao)
- fix job timeout recovery for unstarted jobs and kernel queues (Rodrigo)
Wentao Liang [Thu, 4 Jun 2026 10:27:06 +0000 (10:27 +0000)]
crypto: tegra - fix refcount leak in tegra_se_host1x_submit()
The timeout error path in tegra_se_host1x_submit() returns without
calling host1x_job_put(), while all other paths (success, submit
error, pin error) properly release the job reference through the
job_put label. Since host1x_job_alloc() initializes the reference
count and host1x_job_put() is required to drop it, omitting it on
timeout causes a permanent refcount leak.
Fix this by redirecting the timeout return to the existing job_put
label, ensuring the job reference and any associated syncpt
references are consistently released.
Ilya Dryomov [Wed, 3 Jun 2026 15:50:04 +0000 (17:50 +0200)]
crypto: testmgr - allow authenc(hmac(sha{256,384}),cts(cbc(aes))) in FIPS mode
hmac(sha256), hmac(sha384) and cts(cbc(aes)) algorithms have been
marked as FIPS allowed for years. Mark the respective authenc()
constructions per RFC 8009 ("AES Encryption with HMAC-SHA2 for
Kerberos 5") as such as well.
SP 800-57 Part 3 Rev. 1 from Jan 2015 [1] links the draft of what
became RFC 8009 in Oct 2016 as approved in section 6.3 Procurement
Guidance (item/recommendation 3).
Wentao Liang [Wed, 3 Jun 2026 11:03:27 +0000 (11:03 +0000)]
hwrng: jh7110 - fix refcount leak in starfive_trng_read()
The starfive_trng_read() function acquires a runtime PM reference
via pm_runtime_get_sync() but fails to release it on two error
paths. If starfive_trng_wait_idle() or starfive_trng_cmd() returns
an error, the function exits without calling
pm_runtime_put_sync_autosuspend(), leaving the runtime PM usage
counter permanently elevated and preventing the device from entering
runtime suspend.
Refactor the function to use a unified error path that calls
pm_runtime_put_sync_autosuspend() before returning.
Thorsten Blum [Tue, 2 Jun 2026 22:25:19 +0000 (00:25 +0200)]
crypto: atmel-ecc - drop dead code in atmel_ecdh_max_size
atmel_ecdh_init_tfm() always allocates ctx->fallback, so it is never
NULL in atmel_ecdh_max_size(). Remove the dead code and return
crypto_kpp_maxsize() directly.
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Felix Gu [Tue, 2 Jun 2026 14:55:35 +0000 (22:55 +0800)]
crypto: cavium/cpt - fix DMA cleanup using wrong loop index
The sg_cleanup error path used list[i] instead of list[j] when unmapping
DMA buffers, leaking successfully mapped entries and repeatedly unmapping
the failed one.
Fixes: c694b233295b ("crypto: cavium - Add the Virtual Function driver for CPT") Signed-off-by: Felix Gu <ustc.gu@gmail.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Felix Gu [Tue, 2 Jun 2026 14:38:26 +0000 (22:38 +0800)]
crypto: marvell/octeontx - fix DMA cleanup using wrong loop index
The sg_cleanup path used list[i] instead of list[j] when unmapping DMA
buffers, leaking successfully mapped entries and repeatedly unmapping
the failed one.
Fixes: 10b4f09491bf ("crypto: marvell - add the Virtual Function driver for CPT") Signed-off-by: Felix Gu <ustc.gu@gmail.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
MAINTAINERS: make myself the maintainer of the Qualcomm QCE driver
Qualcomm wants to keep supporting and extending the crypto engine driver.
Thara has not been active for many months, so change the maintainer to
myself and upgrade the driver to Supported.
Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com> Acked-by: Krzysztof Kozlowski <krzk@kernel.org> Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Rosen Penev [Tue, 2 Jun 2026 01:46:45 +0000 (18:46 -0700)]
crypto: amcc - convert irq_of_parse_and_map to platform_get_irq
Replace the deprecated irq_of_parse_and_map() call with the modern
platform_get_irq() in the probe function. This also improves error
handling: platform_get_irq() returns a negative errno on failure,
whereas irq_of_parse_and_map() returned 0.
Change the irq field in struct crypto4xx_core_device from u32 to int
to match the return type of platform_get_irq().
Assisted-by: opencode:big-pickle Signed-off-by: Rosen Penev <rosenp@gmail.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Eric Biggers [Mon, 1 Jun 2026 16:07:57 +0000 (16:07 +0000)]
crypto: sun4i-ss - Remove insecure and unused rng_alg
Remove sun4i_ss_rng, as it is insecure and unused:
- It has multiple vulnerabilities. sun4i_ss_prng_seed() is missing
locking and has a buffer overflow. sun4i_ss_prng_generate() fails to
fill the entire buffer with cryptographic random bytes, because it
rounds the destination length down and also doesn't actually wait for
the hardware to be ready before pulling bytes from it.
- No user of this code is known. It's usable only theoretically via the
"rng" algorithm type of AF_ALG. But userspace actually just uses the
actual Linux RNG (/dev/random etc) instead. And rng_algs don't
contribute entropy to the actual Linux RNG either. (This may have
been confused with hwrng, which does contribute entropy.)
The sun4i_ss_prng_seed() buffer overflow was reported by Tianchu Chen
and discovered by Atuin - Automated Vulnerability Discovery Engine
There's no point in fixing all these vulnerabilities individually when
this is unused code, so let's just remove it.
Fixes: b8ae5c7387ad ("crypto: sun4i-ss - support the Security System PRNG") Cc: stable@vger.kernel.org Reported-by: Tianchu Chen <flynnnchen@tencent.com> Closes: https://lore.kernel.org/r/af749a8447bd7f0e9dd26ca6c87e9c6afecb09d9@linux.dev/ Acked-by: Corentin LABBE <clabbe.montjoie@gmail.com> Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Dave Jiang [Thu, 11 Jun 2026 23:03:55 +0000 (16:03 -0700)]
cxl/test: Unregister cxl_acpi in cxl_test_init() error path
In cxl_test_init(), Once cxl_mock_platform_device_add() succeeds, all
error paths after needs to call platform_device_unregister() instead of
platform_device_put() to clean up.
Fixes: 67dcdd4d3b83 ("tools/testing/cxl: Introduce a mocked-up CXL port hierarchy") Reported-by: sashiko-bot Reviewed-by: Alison Schofield <alison.schofield@intel.com> Link: https://patch.msgid.link/20260611230355.198912-1-dave.jiang@intel.com Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Chun-Tse Shao [Thu, 11 Jun 2026 21:56:32 +0000 (14:56 -0700)]
perf stat: Fix false NMI watchdog warning in aggregation modes
In aggregation modes (e.g. --per-socket, --per-die, etc.), a counter
might not be scheduled or counted on specific aggregate groups if it was
not assigned to the CPUs belonging to those groups. However, the
printout() check triggers the "print_free_counters_hint" logic
unconditionally for any supported counter with a missing count. This
results in a false "Some events weren't counted. Try disabling the NMI
watchdog" warning.
Furthermore, the NMI watchdog only reserves performance counters on core
PMUs. Uncore PMU events (e.g. CHA, IMC) are not affected by the NMI
watchdog, but their failures also falsely triggered this warning.
This warning was originally introduced in commit 02d492e5dcb72c00 ("perf
stat: Issue a HW watchdog disable hint")
To fix this, restrict setting of print_free_counters_hint to only
trigger for core PMU events by checking counter->pmu and
counter->pmu->is_core.
Example before/after:
$ perf stat -M lpm_miss_lat --metric-only --per-socket -a -- sleep 1
James Clark [Thu, 11 Jun 2026 11:13:46 +0000 (12:13 +0100)]
perf test: Compile named_threads workload with -O0
The work loop relies on the compiler not optimizing it away, although
named_threads_work is not static for that reason, the compiler could
still do it.
Fix it by compiling without optimization. Also add -fno-inline for
consistency and in case anyone wants to look at callstacks.
Fixes: b5dd510be55e8670 ("perf test: Add named_threads workload") Closes: https://lore.kernel.org/all/20260609160001.2739E1F00893@smtp.kernel.org Reported-by: sashiko-bot <sashiko-bot@kernel.org> Reviewed-by: Leo Yan <leo.yan@arm.com> Signed-off-by: James Clark <james.clark@linaro.org> Cc: Ian Rogers <irogers@google.com> Cc: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Jakub Kicinski [Fri, 12 Jun 2026 00:08:03 +0000 (17:08 -0700)]
Merge tag 'for-net-next-2026-06-11' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
Luiz Augusto von Dentz says:
====================
bluetooth-next pull request for net-next:
core:
- hci_sync: Add support for HCI_LE_Set_Host_Feature [v2]
- SMP: Use AES-CMAC library API
- sockets: convert to getsockopt_iter
- Add SPDX id lines to some source files
drivers:
- btintel_pcie: Support Product level reset
- btintel_pcie: Add support for smart trigger dump
- btintel_pcie: Add 50 ms delay before MAC init on BlazarIW
- btintel_pcie: Separate coredump work from RX work
- btmtk: add event filter to filter specific event
- btrtl: fix RTL8761B/BU broken LE extended scan
- btusb: Add Realtek RTL8922AE VID/PID 0bda/d922
- btusb: Add Realtek RTL8922AE VID/PID 0bda/d923
- btusb: MT7922: Add VID/PID 0e8d/223c
- btusb: MT7925: Add VID/PID 0e8d/8c38
- btusb: Add support for TP-Link TL-UB250
- btusb: Add Mercusys MA530 for Realtek RTL8761BUV
- btusb: Add TP-Link UB600 for Realtek 8761BUV
- btusb: Add support for Intel Lizard Peak 2 (0x8087:0x0040)
- btusb: Add USB ID 2c4e:0128 for Mercusys MA60XNB
- btusb: MT7925: Add VID/PID 13d3/3609
* tag 'for-net-next-2026-06-11' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next: (49 commits)
Bluetooth: btintel_pcie: Separate coredump work from RX work
Bluetooth: btmtksdio: fix infinite loop in btmtksdio_txrx_work()
Bluetooth: qca: Add BT FW build version to kernel log
Bluetooth: vhci: validate devcoredump state before side effects
Bluetooth: L2CAP: validate connectionless PSM length
Bluetooth: hci: validate codec capability element length
Bluetooth: L2CAP: Fix UAF in channel timeout by holding conn ref
Bluetooth: btintel_pcie: Load IOSF debug regs by controller variant
Bluetooth: btintel_pcie: Add 50 ms delay before MAC init on BlazarIW
Bluetooth: Add SPDX id lines to some source files
Bluetooth: btintel_pcie: Add support for smart trigger dump
Bluetooth: hci_h5: reset hci_uart::priv in the close() method
Bluetooth: btusb: clean up probe error handling
Bluetooth: btusb: fix wakeup irq devres lifetime
Bluetooth: btusb: fix wakeup source leak on probe failure
Bluetooth: btusb: fix use-after-free on marvell probe failure
Bluetooth: btusb: fix use-after-free on registration failure
Bluetooth: btmtk: fix URB leak in alloc_mtk_intr_urb error path
Bluetooth: hci_core: Fix UAF in hci_unregister_dev()
Bluetooth: hci_event: fix simultaneous discovery stuck in FINDING
...
====================
Jakub Kicinski [Fri, 12 Jun 2026 00:06:55 +0000 (17:06 -0700)]
Merge tag 'nfc-net-next-20260611' of https://codeberg.org/linux-nfc/linux
David Heidelberg says:
====================
NFC updates for net-next 20260611
- nxp-nci: Add ISO15693 support
- nxp-nci: treat -ENXIO in IRQ thread as no data available
- nci: uart: Constify struct tty_ldisc_ops
- trf7970a: fix comment typos
- Use named initializers for struct i2c_device_id
- MAINTAINERS: Update address for David Heidelberg
* tag 'nfc-net-next-20260611' of https://codeberg.org/linux-nfc/linux:
MAINTAINERS: Update address for David Heidelberg
nfc: Use named initializers for struct i2c_device_id
nfc: nxp-nci: treat -ENXIO in IRQ thread as no data available
nfc: nxp-nci: Add ISO15693 support
nfc: nci: uart: Constify struct tty_ldisc_ops
nfc: trf7970a: fix comment typos
====================
Chunkai Deng [Tue, 26 May 2026 03:38:03 +0000 (11:38 +0800)]
dt-bindings: mailbox: qcom: Add IPCC support for Maili Platform
Document the Inter-Processor Communication Controller on the Qualcomm
Maili Platform, which will be used to route interrupts across various
subsystems found on the SoC.
====================
tipc: fix netlink gate and receive-path bugs
This is v4 of the public TIPC series. The only change from v3 is in
patch 1: TIPC_NL_MEDIA_SET now uses GENL_UNS_ADMIN_PERM like the other
mutators, instead of GENL_ADMIN_PERM, so the whole series uses the
namespace-aware CAP_NET_ADMIN check that matches the legacy TIPC netlink
path. Patches 2 and 3 are unchanged.
Patch 1 gives the TIPCv2 mutating generic-netlink operations the admin
gate the legacy API already has, so a local unprivileged process can no
longer change TIPC state. Patch 2 drops CONN_ACK messages that
acknowledge more outstanding sends than exist, preventing the
snt_unacked underflow. Patch 3 rejects peer bindings with lower > upper,
which would otherwise leak binding-table memory.
====================
tipc: reject inverted service ranges from peer bindings
tipc_update_nametbl() inserts a binding advertised by a peer node using
the lower and upper service-range bounds taken directly from the wire,
without checking that lower <= upper. The local bind path validates the
ordering (tipc_uaddr_valid()), but the name-distribution path does not.
A binding with lower > upper is inserted at the far end of the
service-range rbtree (keyed on lower) where no lookup or withdrawal can
ever match it (service_range_foreach_match() requires sr->lower <= end).
The publication, its service_range node and the augmented rbtree entry
are then leaked for the lifetime of the namespace, and there is no
per-peer cap equivalent to TIPC_MAX_PUBL on locally created bindings.
Reject inverted ranges in the network path as well. A peer node can
otherwise leak unbounded binding-table memory by sending PUBLICATION
items with lower > upper.
Fixes: 37922ea4a310 ("tipc: permit overlapping service ranges in name table") Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com> Reviewed-by: Tung Nguyen <tung.quang.nguyen@est.tech> Link: https://patch.msgid.link/20260610124003.3831170-4-michael.bommarito@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
tipc_sk_conn_proto_rcv() subtracts the peer-supplied connection ack count
from the unsigned 16-bit send counter snt_unacked without checking that it
does not exceed the number of messages actually outstanding:
tsk->snt_unacked -= msg_conn_ack(hdr);
msg_conn_ack() is read straight from a received CONN_MANAGER/CONN_ACK
message. If the ack count is larger than snt_unacked, the subtraction
wraps to a near-maximum value, leaving tsk_conn_cong() permanently true
and starving the connection of further transmits.
Validate the ACK count at the start of the CONN_ACK block and drop the
message if it acknowledges more messages than are outstanding. A peer (or,
for a local connection, the connected peer socket) can otherwise wedge a
TIPC connection's send side by sending an oversized connection ack.
tipc: require net admin for TIPCv2 netlink mutators
TIPCv2 registers mutating generic-netlink operations without admin
permission flags. Generic netlink only checks CAP_NET_ADMIN when an
operation sets GENL_ADMIN_PERM or GENL_UNS_ADMIN_PERM, so a local
unprivileged process can currently change TIPC state through commands
such as TIPC_NL_NET_SET, TIPC_NL_KEY_SET, TIPC_NL_KEY_FLUSH, and
bearer enable/disable.
The legacy TIPC netlink API already checks netlink_net_capable(...,
CAP_NET_ADMIN) for administrative commands. Give the TIPCv2 mutators
the equivalent generic-netlink gate. Use GENL_UNS_ADMIN_PERM, which
maps to the same namespace-aware CAP_NET_ADMIN check that
netlink_net_capable() performs, so the behaviour matches the legacy
path and keeps working for CAP_NET_ADMIN holders in a non-initial user
namespace (containers).
A QEMU/KASAN repro run as uid/gid 65534 with zero effective
capabilities previously succeeded in changing the network id and node
identity, setting and flushing key material, and enabling/disabling a
UDP bearer. With this patch applied the same operations fail with
-EPERM.
Lorenzo Bianconi [Wed, 10 Jun 2026 13:25:13 +0000 (15:25 +0200)]
net: airoha: simplify WAN device check in airoha_dev_init()
airoha_register_gdm_devices() iterates eth->ports[] in order, so GDM2's
netdev is always registered before GDM3/GDM4. This means the explicit
check for eth->ports[1] && eth->ports[1]->devs[0] is a redundant
special-case of what airoha_get_wan_gdm_dev() already covers, since
GDM2 is always marked as WAN during its own ndo_init.
Remove the redundant check and rely solely on airoha_get_wan_gdm_dev()
which handles both the GDM2-present and GDM2-absent cases.
Victor Nogueira [Wed, 10 Jun 2026 13:28:24 +0000 (10:28 -0300)]
net/sched: sch_hfsc: Don't make class passive twice
update_vf() is called from two places for the same class during a single
dequeue when the class's child qdisc (e.g. codel/fq_codel) drops its last
packets while dequeuing:
1. The child calls qdisc_tree_reduce_backlog(), which, now that the child
is empty, invokes hfsc_qlen_notify() -> update_vf(cl, 0, 0) and turns
the class passive (cl_nactive is decremented up the hierarchy).
2. hfsc_dequeue() then calls update_vf(cl, qdisc_pkt_len(skb), cur_time)
to charge the dequeued bytes.
On the second call the class is already passive, but its child qdisc is
still empty, so update_vf() arms go_passive again:
The leaf is then skipped by the cl_nactive == 0 check inside the loop,
which does not clear go_passive, so the stale go_passive propagates to the
parent and decrements its cl_nactive a second time. A parent that still
has other active children is driven to cl_nactive == 0 and removed from
the vttree, even though those siblings are still backlogged. They are
never dequeued again and the qdisc stalls.
Fix this by only arming go_passive when the class is actually active, so an
already-passive class no longer triggers a second passive transition. The
byte accounting (cl->cl_total += len) still runs for every ancestor, so
dequeued bytes continue to be counted exactly once.
Fixes: 51eb3b65544c ("sch_hfsc: make hfsc_qlen_notify() idempotent") Reported-by: Anirudh Gupta <anirudhrudr@gmail.com> Closes: https://lore.kernel.org/netdev/CAN2cbVe79oj0O9==m4+4x3v+O+qzRagA=2=wkrp9i9=CqYvyZA@mail.gmail.com/ Tested-by: Anirudh Gupta <anirudhrudr@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Victor Nogueira <victor@mojatatu.com> Link: https://patch.msgid.link/20260610132824.3027549-1-victor@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Daniel Borkmann [Tue, 9 Jun 2026 21:22:40 +0000 (23:22 +0200)]
net: Stop leased rxq before uninstalling its memory provider
netif_rxq_cleanup_unlease() tears down the memory provider that was
installed on a physical RX queue through a netkit queue lease. It
currently revokes the provider's DMA mappings before stopping the
physical queue:
This inverts the ordering used by the regular teardown paths (normal
device unregister and the io_uring zcrx close path), which stop the
queue before revoking the provider's mappings.
With the physical queue still live, its NAPI can keep consuming
net_iov entries from the page_pool alloc cache after the
__netif_mp_uninstall_rxq() has already cleared their dma_addr,
opening a window for the device to DMA to a stale or zero address.
Fix it by swapping the two calls so the queue is stopped (and its
NAPI quiesced) before the provider is uninstalled. No functional
regression was observed across repeated runs of the nk_qlease.py
HW selftest, which exercises the lease teardown path; this was
tested against fbnic QEMU emulation.
Fixes: 5602ad61ebee ("net: Proxy netif_mp_{open,close}_rxq for leased queues") Reported-by: Ahmed Abdelmoemen <ahmedabdelmoumen05@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: David Wei <dw@davidwei.uk> Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20260609212240.677889-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Wentao Liang [Tue, 9 Jun 2026 08:47:30 +0000 (08:47 +0000)]
mlxsw: fix refcount leak in mlxsw_sp_vrs_lpm_tree_replace()
When mlxsw_sp_vrs_lpm_tree_replace() fails after replacing some VRs,
the error rollback loop does not correctly revert the preceding
replacements. The loop decrements the index but fails to update the
vr pointer, which still points to the VR that caused the failure. As
a result, the condition and the rollback call always operate on the
same VR, potentially calling mlxsw_sp_vr_lpm_tree_replace() multiple
times on it while never rolling back the earlier VRs. Those VRs
continue to hold a reference to new_tree acquired via
mlxsw_sp_lpm_tree_hold(), leaking the reference count of new_tree.
Fix by reinitializing vr inside the error loop with the updated index:
vr = &mlxsw_sp->router->vrs[i];
so that the loop correctly iterates over all VRs that were actually
replaced.
Cc: stable@vger.kernel.org Fixes: fc922bb0dd94 ("mlxsw: spectrum_router: Use one LPM tree for all virtual routers") Signed-off-by: Wentao Liang <vulab@iscas.ac.cn> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260609084730.215732-1-vulab@iscas.ac.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Wentao Liang [Tue, 9 Jun 2026 08:37:09 +0000 (08:37 +0000)]
mlxsw: fix refcount leak in mlxsw_sp_port_lag_join()
When mlxsw_sp_port_lag_index_get() fails, mlxsw_sp_port_lag_join()
returns an error without releasing the lag reference obtained by
the earlier mlxsw_sp_lag_get(). All other error paths in the
function jump to the cleanup label that ends with
mlxsw_sp_lag_put(), so this is a single missed release.
Fix the leak by replacing the bare 'return err' with a goto to the
existing error cleanup label, which will drop the reference safely.
Cc: stable@vger.kernel.org Fixes: 0d65fc13042f ("mlxsw: spectrum: Implement LAG port join/leave") Signed-off-by: Wentao Liang <vulab@iscas.ac.cn> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260609083709.209743-1-vulab@iscas.ac.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
ksz87xx: add support for low-loss cable equalizer errata
This patch implements the KSZ87xx short cable erratum
described in Microchip document DS80000687C for KSZ87xx switches
and the following support article:
Microchip documents two independent mechanisms to mitigate this issue:
adjusting the receiver low‑pass filter bandwidth and reducing the DSP
equalizer initial value. These registers are located in the switch’s
internal LinkMD table and cannot be accessed directly through a
stand‑alone PHY driver.
To keep the PHY‑facing API clean, this series models the erratum handling
as vendor‑specific Clause 22 PHY registers, virtualized by the KSZ8 DSA
driver. Accesses are intercepted by ksz8_r_phy() / ksz8_w_phy() and
translated into the appropriate indirect LinkMD register writes. The
erratum affects the shared PHY analog front‑end and therefore applies
globally to the switch.
Based on review feedback, the user‑visible interface is kept deliberately
simple and predictable:
- A boolean “short‑cable” PHY tunable applies a documented and
conservative preset (LPF bandwidth 62MHz, DSP EQ initial value 0).
This is the recommended KISS interface for the common short‑cable
scenario.
- Two additional integer PHY tunables allow advanced or experimental
tuning of the LPF bandwidth and the DSP EQ initial value. These
controls are orthogonal, have no ordering requirements, and simply
override the corresponding setting when written.
The tunables act as simple setters with no implicit state machine or
invalid combinations, avoiding surprises for userspace and not relying
on extended error reporting or netlink ethtool support.
This series contains:
1. Support for the KSZ87xx low‑loss cable erratum in the KSZ8 DSA driver,
including the short‑cable preset and orthogonal tuning controls.
2. Addition of vendor‑specific PHY tunable identifiers for the
short‑cable preset, LPF bandwidth, and DSP EQ initial value.
3. Exposure of these tunables through the Micrel PHY driver via
get_tunable / set_tunable callbacks.
This version follows the design agreed upon during v3 review and
reworks the interface accordingly.
====================
Add support for the KSZ87xx low-loss cable PHY tunables in the Micrel
PHY driver by implementing get_tunable and set_tunable callbacks.
These callbacks expose vendor-specific PHY tunables used to control the
KSZ87xx embedded PHY receiver behavior when operating with short or
low-loss Ethernet cables. The tunables provide:
- a boolean short-cable preset applying known good settings;
- an integer LPF bandwidth control;
- an integer DSP EQ initial value control.
The Micrel PHY driver forwards these tunables via standard phy_read() /
phy_write() operations, which are virtualized by the KSZ8 DSA driver and
translated into the appropriate indirect switch register accesses.
This patch implements the KSZ87xx short cable erratum
described in Microchip document DS80000687C for KSZ87xx switches
and the following support article:
KSZ87xx devices require a workaround for the Module 3 low-loss cable
condition, controlled through the switch TABLE_LINK_MD_V indirect
registers.
This change models the erratum handling as vendor-specific Clause 22 PHY
registers, virtualized by the KSZ8 DSA driver and accessed via
ksz8_r_phy() / ksz8_w_phy(). The following controls are provided:
- A boolean “short-cable” preset, which applies a documented and
conservative configuration (LPF 62 MHz bandwidth and DSP EQ initial
value 0), and is the recommended interface for typical use cases.
- Separate LPF bandwidth and DSP EQ initial value controls intended for
advanced or experimental tuning. These are orthogonal and independent,
and override the corresponding settings without requiring any specific
ordering.
The preset and tunables act as simple setters with no implicit state
machine or invalid combinations, keeping the API predictable and aligned
with the KISS principle.
The erratum affects the shared PHY analog front-end and therefore applies
globally to the switch.
Minxi Hou [Tue, 9 Jun 2026 16:57:25 +0000 (00:57 +0800)]
selftests/net/openvswitch: add flow modify test
Add mod_flow() and the mod-flow CLI command to ovs-dpctl.py, exercising
OVS_FLOW_CMD_SET. Add test_flow_set which first modifies an existing
flow with new actions and verifies the change via traffic, then modifies
the same flow without actions and verifies the kernel handles the
no-actions case gracefully.
The no-actions path is unreachable from userspace OVS tools (dpctl
mod-flow requires actions) but reachable via raw netlink. This is the
code path where Adrian Moreno found a possible kfree_skb of ERR_PTR
when reply allocation fails after locking.
Make parse() skip OVS_FLOW_ATTR_ACTIONS when actstr is None so the
kernel enters the post-lock allocation branch in ovs_flow_cmd_set().
After the no-actions set, verify via dump-flows that the flow retained
its drop action.
Nicolai Buchwitz [Wed, 10 Jun 2026 11:48:35 +0000 (13:48 +0200)]
net: bcmgenet: convert RX path to page_pool
Replace the per-packet __netdev_alloc_skb() + dma_map_single() in the
RX path with page_pool. SKBs are built from pool pages via
napi_build_skb() with skb_mark_for_recycle() so the network stack
returns pages to the pool, and DMA mapping happens once per page
instead of once per packet.
Reject HW-reported lengths smaller than the RSB so a runt cannot
underflow the SKB build path.
Drop the now-unused priv->rx_buf_len field and the rx_dma_failed soft
MIB counter (nothing increments it after the conversion). This
removes the "rx_dma_failed" entry from ethtool -S, which is a
user-visible change for monitoring tools that key on stat names.
net: airoha: move get_sport() callback at the beginning of airoha_enable_gdm2_loopback()
Move the get_sport() callback invocation at the beginning of
airoha_enable_gdm2_loopback() routine in order to avoid leaving the
hardware in a partially configured state if get_sport() fails.
Previously, get_sport() was called after GDM2 forwarding, loopback,
channel, length, VIP and IFC registers had already been programmed.
A failure at that point would return an error leaving GDM2 with
loopback enabled but WAN port, PPE CPU port and flow control mappings
not configured.
Performing the get_sport() lookup before any register write guarantees
the routine either completes the full configuration sequence or exits
with no side effects on the hardware.
====================
mptcp: pm: drop TCP TS with ADD_ADDRv6 + port
Up to this series, it was possible to add a "signal" MPTCP endpoint with
an IPv6 address and a port, or to directly request to send an ADD_ADDR
with a v6 address and a port, but the expected ADD_ADDR wasn't sent when
TCP timestamps was used for the connection.
In fact, such signalling option cannot be sent when TCP timestamps is
used due to a lack of option space: the limit is at 40 bytes, and, with
padding, TCP timestamps is taking 12 bytes, while an ADD_ADDR IPv6 +
port is taking 30 bytes. The selected solution here is to simply drop
the TCP timestamps option when such ADD_ADDR of 30 bytes needs to be
sent.
- Patches 1-3: small cleanups to avoid computing ADD/RM_ADDR twice.
- Patches 4-7: the new feature, controlled by a new sysctl knob.
- Patch 8: extra checks in the MPTCP Join selftests.
- Patches 9-15: A bunch of refactoring: renamed confusing helpers and
variables, and prevent future misused functions.
====================
mptcp_pm_announced_del_timer() removes the matched ADD_ADDR entry (if
found) from the ADD_ADDR list only if check_id is false. That's
dangerous, and not clear, because it means the caller should be free the
entry only in some cases, and it easy to miss that.
Instead, make it static, and call it from mptcp_pm_add_addr_echoed,
which is the only other case where mptcp_pm_add_addr_del_timer should be
called with check_id set to true. Bonus with that: a second call to
mptcp_pm_add_addr_lookup_by_addr() can be avoided.
Note that instead of adding the signature above to avoid a compilation
issue because this helper is called before the definition of the
function, the whole helper is moved above where it is first called. Its
content is untouched, except the addition of the 'static' keyboard.
Note that the signature is added above: it is easier than moving the
code around, because this helper depends on mptcp_pm_schedule_work which
is declared below.
While at it, explicitly mark it as to be called while pm->lock is held.
Similar to the two previous commits, using the 'add' prefix is
confusing, also confirmed by [1].
Now that the structure has been renamed to include 'add_addr' in its
name, easier to know the timer is linked to the ADD_ADDR, no need to
add the confusing prefix, or an unneeded longer one.
While at it, also update the ADD_ADDR timer helper to clearly specify it
is linked to ADD_ADDR, and it is not there to add a new timer.
Similar to the previous commit, only using the 'add' or 'anno' prefixes
is confusing -- generally associated to the action of adding something,
or the Latin name for "year" -- and lack of uniformity.
This has been causing issues in the past, e.g. del_add_timer seemed to
suggest the goal is to delete a previously added timer.
Instead, use the mptcp_pm_announced_ prefix.
While at it, slightly improves some helpers:
- mptcp_lookup_anno_list_by_saddr: no need to specify what is used to do
the lookup: mptcp_pm_announced_lookup.
- mptcp_pm_sport_in_anno_list: it doesn't just compare the port, but the
whole address linked to the sublow: mptcp_pm_announced_has_ssk.
- mptcp_pm_alloc_anno_list: it allocates one item of the list, not a
whole list: mptcp_pm_announced_alloc.
Using only the 'add' prefix is confusing: does it refer to a generic
added entry or address, or specifically to ADD_ADDRs. Using add_addr
removes this confusion.
Similar to most places in the MPTCP code. So instead of passing the
subflow list and use list_for_each_entry(subflow, list, node), pass the
msk and use mptcp_for_each_subflow(msk, subflow).
That's clearer and more uniform with the rest.
While at it, add 'pm_' prefix for the exported one to easily identify
the origin. Plus replace 'lookup' by 'has', because a bool is returned.
Before, they were only checked on demand, but it seems better to check
them each time received ADD_ADDRs are checked.
Errors are only reported when the counter exists, and the value is not
the expected one. This is similar to what is done in chk_join_nr: it
reduces the output, and avoids a lot of 'skip' when validating older
kernels. Also here, some tests need to adapt the default expected
counters, e.g. when ADD_ADDR echo are dropped on the reception side, or
it is not possible to send an ADD_ADDR due to the limited option space.