git.ipfire.org Git - thirdparty/linux.git/log

Merge tag 'kvm-x86-svm-7.2' of https://github.com/kvm-x86/linux into HEAD

KVM SVM changes for 7.2

- Add support for virtualizing gPAT (KVM previously just used L1's PAT when
   running L2).

- Fix goofs where KVM mishandles side effects (e.g. single-step and PMC
   updates) when emulating VMRUN.

- Fix a variety of bugs in AVIC's handling of x2APIC MSR interception, most
   notably where KVM didn't disable interception of IRR, ISR, and TMR regs.

- Add support for virtualizing Host-Only/Guest-Only bits in the mediated PMU.

Merge tag 'kvm-x86-vmx-7.2' of https://github.com/kvm-x86/linux into HEAD

KVM VMX changes for 7.2

- Fix a largely benign bug where KVM TDX would incorrectly state it could
emulate several x2APIC MSRs.

- Use the "safe" WRMSR API when proxying LBR MSR writes as the to-be-written
value is guest controlled and completely unvalidated.

Merge tag 'kvm-x86-vfio-7.2' of https://github.com/kvm-x86/linux into HEAD

KVM VFIO changes for 7.2

Use guard() to cleanup up various KVM+VFIO flows.

Merge tag 'kvm-x86-sev-7.2' of https://github.com/kvm-x86/linux into HEAD

KVM SEV changes for 7.2

- Don't advertise support for unusuable VM types, and account for VM types
   that are disabled by firmware, e.g. to mitigate security vulnerabilities.

- Rewrite the SEV {en,de}crypt debug ioctls as they were riddle with bugs and
   unnecessarily complicated, and add comprehensive tests.

- Clean up and deduplicate the SEV page pinning code.

- Fix minor goofs related to writing back CPUID information after firmware
   rejects a CPUID page for an SNP vCPU.

Merge tag 'kvm-x86-selftests-7.2' of https://github.com/kvm-x86/linux into HEAD

KVM selftests changes for 7.2

- Randomize the dirty log test's delay when reaping the bitmap on the first
pass, as always waiting only 1ms hid a KVM RISC-V bug as the test reaped the
bitmap before KVM could build up enough state to hit the bug.

- A pile of one-off fixes and cleanups.

Merge tag 'kvm-x86-mmu-7.2' of https://github.com/kvm-x86/linux into HEAD

KVM x86 MMU changes for 7.2

- Use the kernel's "enum pg_level" in the TDX APIs instead of the TDX-Module's
   level definitions (which are 0-based).

- Rework the TDX memory APIs to not require/assume that guest memory is
   backed by "struct page" (in prepartion for guest_memfd hugepage support).

- Overhaul the TDP MMU => S-EPT code to move as much S-EPT specific logic as
   possible into the TDX code, and to funnel (almost) all S-EPT updates into
   a single chokepoint.  The motivation is largely to prepare for upcoming
   Dynamic PAMT support, but the cleanups are nice to have on their own.

- Plug a hole in the shadow MMU where KVM fails to recursively zap nested TDP
   shadow when L1 is tearing its TDP page tables from the bottom up, as KVM's
   TDP MMU now does.

Merge tag 'kvm-x86-misc-7.2' of https://github.com/kvm-x86/linux into HEAD

KVM misc x86 changes for 7.2

- Handle EXIT_FASTPATH_EXIT_USERSPACE in vendor code to ensure vendor code
   gets a chance to handle things like reaping the PML buffer.

- Ensure KVM's copy of CR0 and CR3 are up-to-date on SVM prior to invoking
   fastpath handlers.

- Update KVM's view of PV async enabling if and only if the MSR write fully
   succeeds.

- Fix a variety of issues where the emulator doesn't honor guest-debug state,
   and clean up related code along the way.

- Synthesize EPT Violation and #NPF "error code" bits when injecting faults
   into L1 that didn't originate in hardware (in which case the VMCS/VMCB
   doesn't hold relevant information).

- Add support for virtualizing (well, emulating) AMD's flavor of CPL>0 CPUID
   faulting.

- Clean up the GPR APIs so that KVM's use of "raw" is consistent, and fix a
   variety of minor bugs along the way.

- Fix an OOB memory access due to not checking the VP ID when handling a
   Hyper-V PV TLB flush for L2.

- Fix a bug in the mediated PMU's handling of fixed counters that allowed the
   guest to bypass the PMU event filter.

- Allow userspace to return EAGAIN when handling SNP and TDX hypercalls, so
   the KVM can forward a "retry" status code to the guest, and reserve all
   unused error codes for future usage.

- Misc fixes and cleanups.

Merge tag 'kvm-x86-gmem-7.2' of https://github.com/kvm-x86/linux into HEAD

KVM guest_memfd changes for 7.2

- Return -EEXIST instead of -EINVAL if userspace attempts to bind a gmem
   range to multiple memslots, and fix the test that was supposed to ensure
   KVM returns -EEXIST.

- Treat memslot binding offsets and sizes as unsigned values to fix a bug
   where KVM interprets a large "offset + size" as a negative value and allows
   a nonsensical offset.

- Use the inode number instead of the page offset for the NUMA interleaving
   index to fix a bug where the effective index would jump by two for
   consecutive pages (the caller also adds in the page offset).

Merge branch kvm-arm64/vgic-v5-PPI-fixes into kvmarm-master/next

* kvm-arm64/vgic-v5-PPI-fixes:
  : .
  : Substantial cleanup of the vgic-v5 PPI support. From the original
  : cover letter:
  :
  : "With the GICv5 PPi support merged in, it has become obvious that a few
  :  things could be improved, both from the correctness and maintainability
  :  angles."
  : .
  KVM: arm64: Fix arch timer interrupts for GICv3-on-GICv5 guests
  irqchip/gic-v5: Immediately exec priority drop following activate
  Documentation: KVM: Clarify that PMU_V3_IRQ IntID requirements for GICv5
  Documentation: KVM: Fix typos in VGICv5 documentation
  KVM: arm64: selftests: Improve error handling for GICv5 PPI selftest
  KVM: arm64: selftests: Cleanup unused vars in GICv5 PPI selftest
  KVM: arm64: selftests: Add missing GIC CDEN to no-vgic-v5 selftest
  KVM: arm64: vgic-v5: Atomically assign bits to PPI DVI bitmap
  KVM: arm64: vgic-v5: Add missing trap handing for NV triage
  KVM: arm64: vgic-v5: Limit support to 64 PPIs
  KVM: arm64: vgic: Rationalise per-CPU irq accessor
  KVM: arm64: vgic-v5: Drop defensive checks from vgic_v5_ppi_queue_irq_unlock()
  KVM: arm64: vgic: Consolidate vgic_allocate_private_irqs_locked()
  KVM: arm64: vgic: Constify struct irq_ops usage
  KVM: arm64: vgic-v5: Drop pointless ARM64_HAS_GICV5_CPUIF check
  KVM: arm64: vgic-v5: Remove use of __assign_bit() with a constant
  KVM: arm64: vgic-v5: Move PPI caps into kvm_vgic_global_state
  KVM: arm64: vgic-v5: Add for_each_visible_v5_ppi() iterator

Signed-off-by: Marc Zyngier <maz@kernel.org>

Merge branch kvm-arm64/pkvm-fixes-7.2 into kvmarm-master/next

* kvm-arm64/pkvm-fixes-7.2:
  : .
  : Assorted pKVM fixes for 7.2:
  :
  : - Ensure that the vcpu memcache is filled in a number of cases (donate,
  :   share, selftest)
  :
  : - Fix vmemmap page order handling by resetting it when initialising the
  :   memory pool
  :
  : - Don't leak page references on failed memory donation
  :
  : - Add sanity-check for refcounted pages when donating/sharing pages
  :
  : - Clear __hyp_running_vcpu on state flush
  :
  : - Check LR upper bound against a trusted value
  :
  : - Assorted fixes for the host-side tracking of the pages shared with
  :   EL2 as a result of some Sashiko testing from Fuad
  :
  : - Correctly forward HCR_EL2.VSE from host to guest, so that protected
  :   guests can see SErrors
  : .
  KVM: arm64: Roll back partial shares on kvm_share_hyp() failure
  KVM: arm64: Avoid host/hyp share desync on unshare hypercall failure
  KVM: arm64: Free hyp-share tracking node when share hypercall fails
  KVM: arm64: Flush HCR_EL2.VSE to deliver SErrors to pKVM guests
  KVM: arm64: Bound used_lrs when flushing the pKVM hyp vCPU
  KVM: arm64: Clear __hyp_running_vcpu when flushing the pKVM hyp vCPU
  KVM: arm64: Pre-check vcpu memcache for host->guest donate
  KVM: arm64: Pre-check vcpu memcache for host->guest share
  KVM: arm64: Seed pkvm_ownership_selftest vcpu memcache
  KVM: arm64: Add fail-safe for refcounted pages in __pkvm_hyp_donate_host
  KVM: arm64: Fix __pkvm_init_vm error path
  KVM: arm64: Reset page order in pKVM hyp_pool

Signed-off-by: Marc Zyngier <maz@kernel.org>

Merge branch kvm-arm64/nv-granule-sizes into kvmarm-master/next

* kvm-arm64/nv-granule-sizes:
  : .
  : Tidying up of the behaviour when the selected page size in not
  : implemented, courtesy of Wei-Lin Chang. From the initial cover
  : letter:
  :
  : "This small series fixes the granule size selection for software stage-1
  :  and stage-2 walks. Previously we treat the guest's TCR/VTCR.TGx as-is
  :  and use the encoded granule size for the walks. However this is
  :  incorrect if the granule sizes are not advertised in the guest's
  :  ID_AA64MMFR0_EL1.TGRAN*. The architecture specifies that when an
  :  unsupported size is programed in TGx, it must be treated as an
  :  implemented size. Fix this by choosing an available one while
  :  prioritizing PAGE_SIZE."
  : .
  KVM: arm64: Fallback to a supported value for unsupported guest TGx
  KVM: arm64: nv: Use literal granule size in TLBI range calculation
  KVM: arm64: Factor out TG0/1 decoding of VTCR and TCR
  KVM: arm64: nv: Rename vtcr_to_walk_info() to setup_s2_walk()

Signed-off-by: Marc Zyngier <maz@kernel.org>

Merge branch kvm-arm64/nv-fp-elision into kvmarm-master/next

* kvm-arm64/nv-fp-elision:
  : .
  : Significantly reduce the overhead of the context switch between L1 and
  : L2 guests by eliding the save/restore of the FP/SIMD/SVE registers, as
  : this state is shared between the two guests, and therefore can be left
  : live.
  : .
  KVM: arm64: nv: Don't save/restore FP register during a nested ERET or exception
  KVM: arm64: nv: Track L2 to L1 exception emulation

Signed-off-by: Marc Zyngier <maz@kernel.org>

Merge branch kvm-arm64/no-lazy-vgic-init into kvmarm-master/next

* kvm-arm64/no-lazy-vgic-init:
  : .
  : Fix an ugly situation where the vgic lazy init could happen in
  : non-preemtible contexts such as vcpu reset, resulting in lockdep
  : splats.
  :
  : This requires revamping the way in-kernel emulation of devices
  : (timers, PMU) are presenting their interrupt to the vgic, and
  : make sure there is no need to init the vgic on the back of that.
  : .
  KVM: arm64: vgic-v2: Don't init the vgic on in-kernel interrupt injection
  KVM: arm64: vgic-v2: Force vgic init on injection outside the run loop
  KVM: arm64: pmu: Kill the PMU interrupt level cache
  KVM: arm64: timer: Kill the per-timer irq level cache
  KVM: arm64: Simplify userspace notification of interrupt state
  KVM: arm64: timer: Repaint kvm_timer_{should,irq_can}_fire() to kvm_timer_{pending,enabled}()

Signed-off-by: Marc Zyngier <maz@kernel.org>

KVM: arm64: vgic-its: Make ABI commit helpers return void

The return values of vgic_its_set_abi() and vgic_its_commit_v0() are always
0 and do not carry useful error information. Simplify by changing them to
void.

Suggested-by: Oliver Upton <oupton@kernel.org>
Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Reviewed-by: Oliver Upton <oupton@kernel.org>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Link: https://patch.msgid.link/20260604075147.53299-1-liu.yun@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>

xfs: shut down the filesystem on a failed mount

A corrupt/crafted XFS image can make mount fail after background inode
inactivation has already been enabled.  xfs_mountfs() turns on inodegc
(xfs_inodegc_start()) right after log recovery, but the quota subsystem
(mp->m_quotainfo) is only allocated much later, in xfs_qm_newmount() /
xfs_qm_mount_quotas().  The quota accounting flags in mp->m_qflags are
parsed from the mount options before xfs_mountfs() even runs.

If the mount then aborts in between - e.g. xfs_rtmount_inodes() failing
with "failed to read RT inodes" - the unwind path flushes the inodegc
queue, which inactivates the inodes that are still queued, and
xfs_inactive() calls xfs_qm_dqattach().  That path trusts
XFS_IS_QUOTA_ON() (the flag is set) and dereferences the not yet
allocated mp->m_quotainfo:

  XFS (loop0): failed to read RT inodes
  Oops: general protection fault, probably for non-canonical address
        0xdffffc000000002a: 0000 [#1] PREEMPT SMP KASAN NOPTI
  KASAN: null-ptr-deref in range [0x0000000000000150-0x0000000000000157]
  Workqueue: xfs-inodegc/loop0 xfs_inodegc_worker
  RIP: 0010:__mutex_lock+0xfe/0x930
  Call Trace:
   xfs_qm_dqget_cache_lookup+0x63/0x7f0
   xfs_qm_dqget_inode+0x336/0x860
   xfs_qm_dqattach_one+0x232/0x4e0
   xfs_qm_dqattach_locked+0x2c6/0x470
   xfs_qm_dqattach+0x46/0x70
   xfs_inactive+0x988/0xe80
   xfs_inodegc_worker+0x27c/0x730

The NULL m_quotainfo deref is only one symptom.  The deeper problem is
that a failed mount should not be inactivating inodes at all: it must
not write to the (possibly corrupt, only partially set up) persistent
metadata of a filesystem we just refused to mount, and the subsystems
inactivation relies on may not be initialised.

Mark the filesystem shut down before flushing the inodegc queue in the
xfs_mountfs() failure path.  With the preceding patch a shut down mount
no longer inactivates the queued inodes: xfs_inactive() returns early so
they are dropped straight to reclaim instead.  They are still pulled down
so reclaim can free them (which is why the flush was added in commit
ab23a7768739 ("xfs: per-cpu deferred inode inactivation queues")), but
without touching the on-disk structures - matching that comment's own
"pull down all the state and flee" intent.

Use SHUTDOWN_META_IO_ERROR for the shutdown: it is the generic "cannot
safely touch metadata" reason already used elsewhere in this file and in
the xfs_ifree() failure path, and unlike SHUTDOWN_FORCE_UMOUNT it does
not log a misleading "User initiated shutdown received".  A failed mount
is not necessarily on-disk corruption (it can be a transient I/O or
resource error), so SHUTDOWN_CORRUPT_ONDISK would not be accurate either.

Found by fuzzing XFS with syzkaller (corrupt image mount); reproduced and
verified under QEMU/KASAN.

Fixes: ab23a7768739 ("xfs: per-cpu deferred inode inactivation queues")
Signed-off-by: Mikhail Lobanov <m.lobanov@rosa.ru>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: skip inode inactivation on a shut down mount

XFS already declines to inactivate inodes on a shut down mount, but only
at queue time: xfs_inode_mark_reclaimable() calls
xfs_inode_needs_inactive(), which returns false when the mount is shut
down ("If the log isn't running, push inodes straight to reclaim"), and
then drops the dquots and marks the inode reclaimable directly.

An inode that was queued for background inactivation while the mount was
still live is not covered by that check: the inodegc worker still calls
xfs_inactive() on it even after the mount has been shut down in the
meantime. Inactivation modifies persistent metadata and runs
transactions that cannot complete on a shut down mount, and it relies on
subsystems (e.g. quota) that a torn down, or never fully set up, mount
may not have available.

Honour the same invariant in xfs_inactive() itself: if the mount is shut
down, return early before doing any inactivation work. The dquots
attached to the inode are released by the existing xfs_qm_dqdetach() at
the out: label, so references are not leaked, and the caller then makes
the inode reclaimable exactly as before.

On its own this is a consistency fix with the existing queue-time
behaviour; it is also a prerequisite for shutting the mount down in the
xfs_mountfs() failure path in the following patch.

Fixes: ab23a7768739 ("xfs: per-cpu deferred inode inactivation queues")
Signed-off-by: Mikhail Lobanov <m.lobanov@rosa.ru>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

Merge tag 'kvm-x86-generic-7.2' of https://github.com/kvm-x86/linux into HEAD

KVM generic changes for 7.2

- Rename invalidate_begin() to invalidate_start() throughout KVM to follow
the kernel's nomenclature, e.g. for mmu_notifiers.

- Minor cleanups.

Merge tag 'kvm-s390-master-7.1-4' of https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD

KVM: s390: A few more misc gmap fixes.

xfs: move XFS_LSN_CMP to xfs_log_format.h

Because CYCLE_LSN/BLOCK_LSN are defined in xfs_log_format.h, XFS_LSN_CMP
forces a xfs_log_format.h dependency in xfs_log.h. Move XFS_LSN_CMP
to xfs_log_format.h and drop the macro/inline indirection to clean up
our header mess a little bit.

This also helps xfsprogs, which doesn't have xfs_log.h, but needs
XFS_LSN_CMP.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: shut down zoned file systems on writeback errors

Zoned writeback allocates space from an open zone and advances the
in-memory allocation state before submitting the bio. The completion
path only records the written blocks and updates the mapping on success.
If the write fails, XFS cannot tell how far the device write pointer
advanced and cannot safely roll the open zone accounting back.

This was observed while investigating xfs/643 and xfs/646 on an external
ZNS realtime device. A writeback error after consuming space from an
open zone left later writers waiting for open-zone or GC progress that
could not happen. xfs/643 exposed this through the GC defragmentation
path, while xfs/646 exposed the same failure mode through the
truncate/EOF-zeroing space wait path.

There is no local recovery path in ioend completion that can restore a
consistent zoned allocation state after the device has rejected the
write. Treat writeback errors for zoned inodes as fatal and force a
file system shutdown from the ioend completion path. The existing
shutdown path wakes zoned allocation waiters and makes future space
waits return -EIO instead of leaving tasks stuck waiting for progress.

Signed-off-by: Yao Sang <sangyao@kylinos.cn>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

iommu/amd: Control INVALIDATE_IOMMU_PAGES PDE from the gather

Now that AMD uses iommupt, it is easy to make use of the PDE bit. If
the gather has no free list then no page directory entries were
changed.

Pass GN/PDE through the invalidation call chain in a u32 flags field
that is OR'd into data[2] and set it properly from the gather.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Wei Wang <wei.w.wang@hotmail.com>
Tested-by: Dheeraj Kumar Srivastava <dheerajkumar.srivastava@amd.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>

iommu/amd: Make CMD_INV_IOMMU_ALL_PAGES_ADDRESS match the spec

The spec in Table 14 defines the "Entire Cache" case as having the low
12 bits as zero. Indeed the command format doesn't even have the low
12 bits. Since there is only one user now, fix the constant to have 0
in the low 12 bits instead of 1 and remove the masking.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Wei Wang <wei.w.wang@hotmail.com>
Tested-by: Dheeraj Kumar Srivastava <dheerajkumar.srivastava@amd.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>

iommu/amd: Have amd_iommu_domain_flush_pages() use last

Finish clearing out the size/last/end switching by converting
amd_iommu_domain_flush_pages() to use last-based logic.

This algorithm is simpler than the previous. Ultimately all this wants
to do is select powers of two that are aligned to address and not
longer than the distance to last.

The new version is fully safe for size = U64_MAX and last = U64_MAX.

Finally, the gather can be passed through natively without risking an
overflow in (gather->end - gather->start + 1).

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Wei Wang <wei.w.wang@hotmail.com>
Tested-by: Dheeraj Kumar Srivastava <dheerajkumar.srivastava@amd.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>

iommu/amd: Pass last in through to build_inv_address()

This is the trivial call chain below amd_iommu_domain_flush_pages().

Cases that are doing a full invalidate will pass a last of U64_MAX.

This avoids converting between size and last, and type confusion with
size_t, unsigned long and u64 all being used in different places along
the driver's invalidation path. Consistently use u64 in the internals.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Dheeraj Kumar Srivastava <dheerajkumar.srivastava@amd.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>

iommu/amd: Simplify build_inv_address()

This function is doing more work than it needs to:

- iommu_num_pages() is pointless, the fls() is going to compute the
   required page size already.

- It is easier to understand as sz_lg2, which is 12 if size is 4K,
   than msb_diff which is 11 if size is 4K.

- Simplify the control flow to early exit on the out of range cases.

- Use the usual last instead of end to signify an inclusive last
   address.

- Use GENMASK to compute the 1's mask.

- Use GENMASK to compute the address mask for the command layout,
   not PAGE_MASK.

- Directly reference the spec language that defines the 52 bit
   limit.

No functional change intended.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Wei Wang <wei.w.wang@hotmail.com>
Tested-by: Dheeraj Kumar Srivastava <dheerajkumar.srivastava@amd.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>

Merge tag 'bst-arm64-emmc-driver-defconfig-for-v7.2' of https://github.com/BlackSesame-SoC/linux into soc/defconfig

arm64: BST C1200 eMMC defconfig for v7.2

Black Sesame Technologies:

Enable eMMC controller on BST C1200 CDCU1.0 board:
- Enable CONFIG_MMC_SDHCI_BST=y in arm64 defconfig

The MMC driver was merged via mmc-next in v7.1-rc1.
This is the remaining defconfig piece.

* tag 'bst-arm64-emmc-driver-defconfig-for-v7.2' of https://github.com/BlackSesame-SoC/linux:
arm64: defconfig: enable BST SDHCI controller

Signed-off-by: Arnd Bergmann <arnd@arndb.de>

Merge tag 'bst-arm64-emmc-driver-dts-for-v7.2' of https://github.com/BlackSesame-SoC/linux into soc/dt

arm64: BST C1200 eMMC DTS for v7.2

Black Sesame Technologies:

Enable eMMC controller on BST C1200 CDCU1.0 board:
    - Add mmc0 node in bstc1200.dtsi (DWCMSHC SDHCI controller)
    - Add fixed clock definition and reserved SRAM bounce buffer
    - Enable mmc0 with 8-bit bus on CDCU1.0 ADAS 4C2G board
The MMC driver was merged via mmc-next in v7.1-rc1.
this is the remaining DTS piece.

Signed-off-by: Gordon Ge <gordon.ge@bst.ai>
* tag 'bst-arm64-emmc-driver-dts-for-v7.2' of https://github.com/BlackSesame-SoC/linux:
  arm64: dts: bst: enable eMMC controller in C1200

Signed-off-by: Arnd Bergmann <arnd@arndb.de>

xfrm: Fix dev use-after-free in xfrm async resumption

xfrm async resumption hold skb->dev refcnt until after transport_finish.
However, xfrm_rcv_cb may modify skb->dev to tunnel dev without taking
device reference, such as vti_rcv_cb. The subsequent async resumption
will decrement the tunnel device's reference count, which lead to uaf
of tunnel dev and refcnt leak of orig dev as below:

unregister_netdevice: waiting for vti1 to become free. Usage count = -2

Stash the original skb->dev to fix refcnt imbalance. The new skb->dev set
by xfrm_rcv_cb can race with device teardown. Extend rcu protection over
xfrm_rcv_cb and transport_finish to prevent races.

Fixes: 1c428b038400 ("xfrm: hold dev ref until after transport_finish NF_HOOK")
Reported-by: Xu Chunxiao <xuchunxiao3@huawei.com>
Signed-off-by: Dong Chenchen <dongchenchen2@huawei.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>

net: af_key: initialize alg_key_len for IPComp states

pfkey_msg2xfrm_state() handles the IPComp (SADB_X_SATYPE_IPCOMP) case by
allocating x->calg and copying only the algorithm name:

x->calg = kmalloc_obj(*x->calg);
if (!x->calg) {
err = -ENOMEM;
goto out;
}
strcpy(x->calg->alg_name, a->name);
x->props.calgo = sa->sadb_sa_encrypt;

Unlike the authentication (x->aalg) and encryption (x->ealg) branches of
the same function, the compression branch never initializes
calg->alg_key_len.  IPComp carries no key and the allocation only
reserves sizeof(struct xfrm_algo) (i.e. no room for a key), so the field
is left containing uninitialized slab data.

calg->alg_key_len is later used as a length by xfrm_algo_clone() when an
IPComp state is cloned during XFRM_MSG_MIGRATE:

xfrm_state_migrate()
  xfrm_state_clone_and_setup()
    x->calg = xfrm_algo_clone(orig->calg);
      kmemdup(orig, xfrm_alg_len(orig));

where xfrm_alg_len() returns sizeof(*alg) + (alg_key_len + 7) / 8.  With
a non-zero garbage alg_key_len, kmemdup() reads past the end of the
68-byte calg object.  Adding an IPComp SA via PF_KEY and then migrating
it triggers (net-next, KASAN, init_on_alloc=0):

  BUG: KASAN: slab-out-of-bounds in kmemdup_noprof+0x44/0x60
  Read of size 4164 at addr ff11000025a74980 by task diag2/9287
  CPU: 3 UID: 0 PID: 9287 Comm: diag2 7.1.0-rc6-g903db046d557 #1
  Call Trace:
   <TASK>
   dump_stack_lvl+0x10e/0x1f0
   print_report+0xf7/0x600
   kasan_report+0xe4/0x120
   kasan_check_range+0x105/0x1b0
   __asan_memcpy+0x23/0x60
   kmemdup_noprof+0x44/0x60
   xfrm_state_migrate+0x70a/0x1da0
   xfrm_migrate+0x753/0x18a0
   xfrm_do_migrate+0xb47/0xf10
   xfrm_user_rcv_msg+0x411/0xb50
   netlink_rcv_skb+0x158/0x420
   xfrm_netlink_rcv+0x71/0x90
   netlink_unicast+0x584/0x850
   netlink_sendmsg+0x8b0/0xdc0
   ____sys_sendmsg+0x9f7/0xb90
   ___sys_sendmsg+0x134/0x1d0
   __sys_sendmsg+0x16d/0x220
   do_syscall_64+0x116/0x7d0
   entry_SYSCALL_64_after_hwframe+0x77/0x7f
   </TASK>

  Allocated by task 9287:
   kasan_save_stack+0x33/0x60
   kasan_save_track+0x14/0x30
   __kasan_kmalloc+0xaa/0xb0
   pfkey_add+0x2652/0x2ea0
   pfkey_process+0x6d0/0x830
   pfkey_sendmsg+0x42c/0x850
   __sys_sendto+0x461/0x4b0
   __x64_sys_sendto+0xe0/0x1c0
   do_syscall_64+0x116/0x7d0
   entry_SYSCALL_64_after_hwframe+0x77/0x7f

  The buggy address belongs to the object at ff11000025a74980
   which belongs to the cache kmalloc-96 of size 96
  The buggy address is located 0 bytes inside of
   allocated 68-byte region [ff11000025a74980, ff11000025a749c4)

Depending on the uninitialized value the same field can instead request
an oversized kmemdup() allocation and make the migration clone fail.

The XFRM netlink path is not affected: verify_one_alg() rejects an
XFRMA_ALG_COMP attribute shorter than xfrm_alg_len(), so a calg added via
XFRM_MSG_NEWSA is always self-consistent.

Initialize calg->alg_key_len to 0, matching the aalg/ealg branches.

Fixes: 80c9abaabf42 ("[XFRM]: Extension for dynamic update of endpoint address(es)")
Cc: stable@vger.kernel.org
Signed-off-by: Zijing Yin <yzjaurora@gmail.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>

xfrm: use compat translator only for u64 alignment mismatch

The XFRM compat layer (CONFIG_XFRM_USER_COMPAT) translates 32-bit xfrm
netlink and setsockopt messages into the native 64-bit layout. It is
only needed on architectures where the 32-bit and 64-bit ABIs disagree
on u64 alignment, which the kernel encodes as COMPAT_FOR_U64_ALIGNMENT.

That symbol is defined only by arch/x86. XFRM_USER_COMPAT depends on it,
so the translator can never be built on any other architecture,
including arm64, which still provides a 32-bit compat ABI (CONFIG_COMPAT)
for AArch32 EL0 userspace. On arm64 the AArch32 EABI already aligns u64
to 8 bytes, identical to the AArch64 ABI, so no translation is required
and the native code path is correct for 32-bit tasks.

However, xfrm_user_rcv_msg() and xfrm_user_policy() gate on
in_compat_syscall() alone and then call xfrm_get_translator(), which
returns NULL when no translator is registered. On arm64 that is always
the case, so every xfrm netlink message and the XFRM_POLICY setsockopt
issued by a 32-bit task returns -EOPNOTSUPP. A 32-bit userspace process
on arm64 (and on any other arch with CONFIG_COMPAT but without
COMPAT_FOR_U64_ALIGNMENT) therefore cannot configure XFRM state or
policy through the XFRM_USER netlink API, and cannot use the XFRM_POLICY
setsockopt path, because both fail before reaching the native parser.

The translator series replaced the blanket compat rejection with a
translator lookup. That made the path usable on x86 when the translator
is available, but left architectures that cannot build the translator
permanently rejected even when their compat layout already matches the
native layout. Let those architectures use the native parser instead.

Gate the translator requirement on COMPAT_FOR_U64_ALIGNMENT instead of
on in_compat_syscall() alone. Gating on the ABI property rather than on
CONFIG_XFRM_USER_COMPAT is deliberate: on x86 with IA32_EMULATION=y but
XFRM_USER_COMPAT=n, a 32-bit task must still be rejected rather than
routed through the native parser, which would misread genuinely
4-byte-aligned x86-32 messages. COMPAT_FOR_U64_ALIGNMENT is the ABI
property that makes the XFRM translator mandatory.

Only the receive/input direction needs the guard. The send, dump and
notification paths already call the translator as "if (xtr) { ... }"
with no error on NULL, so on arches without a translator they no-op and
the kernel emits native 64-bit-layout messages, which is what an AArch32
task expects.

Tested on Juniper SRX hardware: with the fix, 32-bit IPsec userspace
netlink and XFRM_POLICY setsockopt operations that previously failed
with -EOPNOTSUPP now succeed; x86 behaviour is unchanged by inspection.

Fixes: 5106f4a8acff ("xfrm/compat: Add 32=>64-bit messages translator")
Fixes: 96392ee5a13b ("xfrm/compat: Translate 32-bit user_policy from sockptr")
Cc: stable@vger.kernel.org
Signed-off-by: Sanman Pradhan <psanman@juniper.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>

arm64: defconfig: enable BST SDHCI controller

Enable CONFIG_MMC_SDHCI_BST to support eMMC on Black Sesame
Technologies C1200 boards.

Signed-off-by: Albert Yang <yangzh0906@thundersoft.com>
Acked-by: Gordon Ge <gordon.ge@bst.ai>
Signed-off-by: Gordon Ge <gordon.ge@bst.ai>

arm64: dts: bst: enable eMMC controller in C1200

Add mmc0 node for the DWCMSHC SDHCI controller with basic configuration
(disabled by default) and fixed clock definition in bstc1200.dtsi.

Enable mmc0 with board-specific configuration including 8-bit bus
width and reserved SRAM bounce buffer on the CDCU1.0 ADAS 4C2G board.

The bounce buffer in reserved SRAM addresses hardware constraints
where the eMMC controller cannot access main system memory through
SMMU due to a hardware bug, and all DRAM is located outside the
4GB boundary.

Signed-off-by: Albert Yang <yangzh0906@thundersoft.com>
Acked-by: Gordon Ge <gordon.ge@bst.ai>
Signed-off-by: Gordon Ge <gordon.ge@bst.ai>

Input: rmi4 - update formatting in F12

Clean up various style and formatting issues in the F12 code.

Link: https://patch.msgid.link/20260505045952.1570713-20-dmitry.torokhov@gmail.com
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>

Input: rmi4 - propagate proper error code in F12 sensor tuning

Propagate the actual error code returned by rmi_read() in
rmi_f12_read_sensor_tuning() instead of hardcoding -ENODEV.
Also, since rmi_read() returns 0 on success, use 'if (ret)'
instead of 'if (ret < 0)'.

Assisted-by: Gemini:gemini-3.1-pro
Link: https://patch.msgid.link/20260505045952.1570713-19-dmitry.torokhov@gmail.com
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>

Input: rmi4 - simplify size calculations in F12

Use min_t() to simplify the clamping logic when calculating the
number of objects to process and the number of valid bytes in the
attention handler.

Assisted-by: Gemini:gemini-3.1-pro
Link: https://patch.msgid.link/20260505045952.1570713-18-dmitry.torokhov@gmail.com
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>

Input: rmi4 - use sizeof(*ptr) and idiomatic checks in f12 allocators

Using sizeof(*ptr) is preferred over sizeof(struct) because it is
more robust against type changes. Also switch to checking for
allocation failure immediately after each call, and update
formatting.

Assisted-by: Gemini:gemini-3.1-pro
Link: https://patch.msgid.link/20260505045952.1570713-17-dmitry.torokhov@gmail.com
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>

Input: rmi4 - use devm_kmalloc for F12 data packet buffer

The sensor->data_pkt buffer is used exclusively to store incoming
hardware data during the attention handler, where it is entirely
overwritten by either memcpy() or rmi_read_block(). Therefore,
there is no need to zero-initialize it during probe.

Switch to devm_kmalloc() to avoid the unnecessary memset overhead.

Assisted-by: Gemini:gemini-3.1-pro
Link: https://patch.msgid.link/20260505045952.1570713-16-dmitry.torokhov@gmail.com
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>

Input: rmi4 - use flexible array member for IRQ masks in F12

Use a flexible array member to allocate the IRQ masks at the end of
the f12_data structure, and use the struct_size() helper to
calculate the allocation size safely. This replaces manual pointer
arithmetic.

Assisted-by: Gemini:gemini-3.1-pro
Link: https://patch.msgid.link/20260505045952.1570713-15-dmitry.torokhov@gmail.com
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>

Input: rmi4 - use unaligned access helpers in F12

Use get_unaligned_le16() instead of manual bit shifts to construct
16-bit values for max_x, max_y, pitch_x, pitch_y, and object
coordinates in the F12 parsing logic. This simplifies the code and
makes the endianness explicit.

Assisted-by: Gemini:gemini-3.1-pro
Link: https://patch.msgid.link/20260505045952.1570713-14-dmitry.torokhov@gmail.com
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>

Input: rmi4 - change reg_size type to u32

Change reg_size from unsigned long to u32 to save space and ensure
consistent size across 32-bit and 64-bit architectures, and use
DECLARE_BITMAP() for subpacket_map.

Also pack the structure by rearranging the members to avoid holes,
and use size_add() to prevent potential integer overflows when
calculating the total size of registers.

Assisted-by: Gemini:gemini-3.1-pro
Link: https://patch.msgid.link/20260505045952.1570713-13-dmitry.torokhov@gmail.com
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>

Input: rmi4 - refactor F12 probe function

The F12 probe function contains highly repetitive logic for parsing
register descriptors and their individual data items. Refactor the
function to use loops to eliminate redundancy, and clarify the code.

Assisted-by: Gemini:gemini-3.1-pro
Link: https://patch.msgid.link/20260505045952.1570713-12-dmitry.torokhov@gmail.com
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>

Input: rmi4 - use kzalloc_flex() for struct rmi_function

struct rmi_function contains a flexible array member irq_mask.
Convert the manual kzalloc size calculation to use the kzalloc_flex()
macro.

Assisted-by: Gemini:gemini-3.1-pro
Link: https://patch.msgid.link/20260505045952.1570713-11-dmitry.torokhov@gmail.com
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>

Input: rmi4 - refactor function allocation and registration

Currently, rmi_create_function() allocates memory for the rmi_function
structure, but rmi_register_function() initializes the device via
device_initialize(). This split of ownership makes error handling in
rmi_create_function() confusing because the caller must be aware that
if rmi_register_function() fails, it has already called put_device() to
clean up the memory.

To make the memory lifecycle explicit and fix potential leaks cleanly
introduce rmi_alloc_function() to handle memory allocation and device
initialization, and make the caller of rmi_register_function()
responsible for cleanup.

Assisted-by: Gemini:gemini-3.1-pro
Link: https://patch.msgid.link/20260505045952.1570713-10-dmitry.torokhov@gmail.com
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>

Input: rmi4 - use local presence map in rmi_read_register_desc()

The presence map is only used during the parsing of the register
descriptor, so we can make it a local variable instead of storing it
in struct rmi_register_descriptor.

Also fix the spelling of the constant and the variable name (presence
instead of presense).

Assisted-by: Gemini:gemini-3.1-pro
Link: https://patch.msgid.link/20260505045952.1570713-9-dmitry.torokhov@gmail.com
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>

Input: rmi4 - fix limit in rmi_register_desc_has_subpacket()

rmi_register_desc_has_subpacket() should use RMI_REG_DESC_SUBPACKET_BITS,
not RMI_REG_DESC_PRESENCE_BITS, as the limit for subpacket_map.

Fixes: 2b6a321da9a2 ("Input: synaptics-rmi4 - add support for Synaptics RMI4 devices")
Assisted-by: Gemini:gemini-3.1-pro
Link: https://patch.msgid.link/20260505045952.1570713-8-dmitry.torokhov@gmail.com
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>

Input: rmi4 - fix bit count in bitmap_copy()

bitmap_copy() takes number of bits, not bytes (or longs). Correct
the bit count in rmi_driver_set_irq_bits() and
rmi_driver_clear_irq_bits().

Fixes: 2b6a321da9a2 ("Input: synaptics-rmi4 - add support for Synaptics RMI4 devices")
Cc: stable@vger.kernel.org
Assisted-by: Gemini:gemini-3.1-pro
Link: https://patch.msgid.link/20260505045952.1570713-7-dmitry.torokhov@gmail.com
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>

Input: rmi4 - iterative IRQ handler

The current IRQ handler uses recursion to drain the attention FIFO,
which can lead to stack overflow on deep queues. Convert it to a
loop.

Fixes: b908d3cd812a ("Input: synaptics-rmi4 - allow to add attention data")
Cc: stable@vger.kernel.org
Assisted-by: Gemini:gemini-3.1-pro
Link: https://patch.msgid.link/20260505045952.1570713-6-dmitry.torokhov@gmail.com
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>

Input: rmi4 - fix memory leak in rmi_set_attn_data()

kfifo_put() returns 0 if the FIFO is full. In this case, we must
free the memory allocated for the attention data to avoid a leak.

Fixes: b908d3cd812a ("Input: synaptics-rmi4 - allow to add attention data")
Cc: stable@vger.kernel.org
Assisted-by: Gemini:gemini-3.1-pro
Link: https://patch.msgid.link/20260505045952.1570713-5-dmitry.torokhov@gmail.com
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>

Input: rmi4 - initialize attn_fifo properly

attn_fifo is allocated as part of struct rmi_driver_data using
devm_kzalloc in rmi_driver_probe. However, it is never initialized.
A zero-initialized kfifo has its mask set to 0, which effectively
limits its capacity to 1 element instead of the declared 16.
This can lead to lost attention data and memory leaks of the attention
data payload if multiple attention events are received before the
threaded interrupt handler can process them.

Initialize attn_fifo using INIT_KFIFO after allocating rmi_driver_data.

Reported-by: sashiko-bot@kernel.org
Assisted-by: Antigravity:gemini-3.5-flash
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>

Input: rmi4 - fix num_subpackets overflow in register descriptor

RMI_REG_DESC_SUBPACKET_BITS is defined as 296 (37 * BITS_PER_BYTE). This
may overflow num_subpackets in struct rmi_register_desc_item which is
defined as a u8.

Fix this by changing the type of num_subpackets to u16.

Fixes: 2b6a321da9a2 ("Input: synaptics-rmi4 - add support for Synaptics RMI4 devices")
Cc: stable@vger.kernel.org
Assisted-by: Gemini:gemini-3.1-pro
Link: https://patch.msgid.link/20260505045952.1570713-4-dmitry.torokhov@gmail.com
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>

Input: rmi4 - fix type overflow in register counts

The number of registers in the RMI4 register descriptor is populated
by counting the bits in the presence map using bitmap_weight(). Since
the presence map can contain up to 256 bits (RMI_REG_DESC_PRESENSE_BITS),
storing this count in a u8 can overflow to 0 if all 256 bits are set.

Change the num_registers field in struct rmi_register_descriptor
from u8 to u16 to prevent potential integer overflow and ensure safe
processing of devices reporting large descriptors.

Fixes: 2b6a321da9a2 ("Input: synaptics-rmi4 - add support for Synaptics RMI4 devices")
Cc: stable@vger.kernel.org
Assisted-by: Gemini:gemini-3.1-pro
Link: https://patch.msgid.link/20260505045952.1570713-3-dmitry.torokhov@gmail.com
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>

Input: rmi4 - refactor register descriptor parsing

Factor out parsing a register descriptor item from
rmi_read_register_desc() and ensure there are no out-of-bounds accesses.

Use get_unaligned_le16() and get_unaligned_le32() for reading multi-byte
values.

Reported-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Fixes: 2b6a321da9a2 ("Input: synaptics-rmi4 - add support for Synaptics RMI4 devices")
Cc: stable@vger.kernel.org
Assisted-by: Gemini:gemini-3.1-pro
Link: https://patch.msgid.link/20260505045952.1570713-2-dmitry.torokhov@gmail.com
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>

Input: rmi4 - fix register descriptor address calculation

When reading the register descriptor, the base address is incremented by
1 to read the presence register block. However, after reading the
presence register block, the address is incorrectly incremented by only
1 byte (++addr) instead of the actual size of the presence block
(size_presence_reg). This causes the subsequent structure block read to
read from the wrong memory location if the presence block is larger than
1 byte.

Fix this by advancing the address by size_presence_reg.

Fixes: 2b6a321da9a2 ("Input: synaptics-rmi4 - add support for Synaptics RMI4 devices")
Cc: stable@vger.kernel.org
Assisted-by: Gemini:gemini-3.1-pro
Link: https://patch.msgid.link/20260505045952.1570713-1-dmitry.torokhov@gmail.com
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>

Merge tag 'drm-xe-fixes-2026-06-11' of https://gitlab.freedesktop.org/drm/xe/kernel into drm-fixes

UAPI Changes:

Cross-subsystem Changes:

Core Changes:

Driver Changes:
- fix oops in suspend/shutdown without display (Jani)
- RAS fixes (Raag)
- Use HW_ERR prefix in log (Raag)
- include all registered queues in TLB invalidation (Tangudu)
- Fix refcount leak in xe_range_tree in error paths (Wentao)
- fix job timeout recovery for unstarted jobs and kernel queues (Rodrigo)

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Matthew Brost <matthew.brost@intel.com>
Link: https://patch.msgid.link/aitt8ZkYmxIT9cdP@gsse-cloud1.jf.intel.com

Merge tag 'drm-intel-fixes-2026-06-11' of https://gitlab.freedesktop.org/drm/i915/kernel into drm-fixes

- Check supported link rates DPCD read [edp] (Nikita Zhandarovich)
- Fix phys BO pread/pwrite with offset [gem] (Joonas Lahtinen)

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Tvrtko Ursulin <tursulin@igalia.com>
Link: https://patch.msgid.link/aipkcUDnTlzre-8F@linux

ip6_tunnel: annotate data-races around t->err_count and t->err_time

ip6_tnl_xmit() and ipip6_tunnel_xmit() run locklessly (dev->lltx == true).

ip6gre_err() and ipip6_err() also run locklessly.

We need to add READ_ONCE() and WRITE_ONCE() annotations
around t->err_count and t->err_time.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260610171458.1359630-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

crypto: tegra - fix refcount leak in tegra_se_host1x_submit()

The timeout error path in tegra_se_host1x_submit() returns without
calling host1x_job_put(), while all other paths (success, submit
error, pin error) properly release the job reference through the
job_put label. Since host1x_job_alloc() initializes the reference
count and host1x_job_put() is required to drop it, omitting it on
timeout causes a permanent refcount leak.

Fix this by redirecting the timeout return to the existing job_put
label, ensuring the job reference and any associated syncpt
references are consistently released.

Cc: stable@vger.kernel.org
Fixes: 0880bb3b00c8 ("crypto: tegra - Add Tegra Security Engine driver")
Signed-off-by: Wentao Liang <vulab@iscas.ac.cn>
Reviewed-by: Akhil R <akhilrajeev@nvidia.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

crypto: rng - Free default RNG on module exit

When the rng module is removed the default RNG will be leaked.
Call crypto_del_default_rng to free it if possible.

Fixes: 7cecadb7cca8 ("crypto: rng - Do not free default RNG when it becomes unused")
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

crypto: testmgr - allow authenc(hmac(sha{256,384}),cts(cbc(aes))) in FIPS mode

hmac(sha256), hmac(sha384) and cts(cbc(aes)) algorithms have been
marked as FIPS allowed for years. Mark the respective authenc()
constructions per RFC 8009 ("AES Encryption with HMAC-SHA2 for
Kerberos 5") as such as well.

SP 800-57 Part 3 Rev. 1 from Jan 2015 [1] links the draft of what
became RFC 8009 in Oct 2016 as approved in section 6.3 Procurement
Guidance (item/recommendation 3).

[1] https://csrc.nist.gov/pubs/sp/800/57/pt3/r1/final

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

hwrng: jh7110 - fix refcount leak in starfive_trng_read()

The starfive_trng_read() function acquires a runtime PM reference
via pm_runtime_get_sync() but fails to release it on two error
paths. If starfive_trng_wait_idle() or starfive_trng_cmd() returns
an error, the function exits without calling
pm_runtime_put_sync_autosuspend(), leaving the runtime PM usage
counter permanently elevated and preventing the device from entering
runtime suspend.

Refactor the function to use a unified error path that calls
pm_runtime_put_sync_autosuspend() before returning.

Cc: stable@vger.kernel.org
Fixes: c388f458bc34 ("hwrng: starfive - Add TRNG driver for StarFive SoC")
Signed-off-by: Wentao Liang <vulab@iscas.ac.cn>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

crypto: atmel-ecc - drop dead code in atmel_ecdh_max_size

atmel_ecdh_init_tfm() always allocates ctx->fallback, so it is never
NULL in atmel_ecdh_max_size(). Remove the dead code and return
crypto_kpp_maxsize() directly.

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

crypto: cavium/cpt - fix DMA cleanup using wrong loop index

The sg_cleanup error path used list[i] instead of list[j] when unmapping
DMA buffers, leaking successfully mapped entries and repeatedly unmapping
the failed one.

Fixes: c694b233295b ("crypto: cavium - Add the Virtual Function driver for CPT")
Signed-off-by: Felix Gu <ustc.gu@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

crypto: marvell/octeontx - fix DMA cleanup using wrong loop index

The sg_cleanup path used list[i] instead of list[j] when unmapping DMA
buffers, leaking successfully mapped entries and repeatedly unmapping
the failed one.

Fixes: 10b4f09491bf ("crypto: marvell - add the Virtual Function driver for CPT")
Signed-off-by: Felix Gu <ustc.gu@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

MAINTAINERS: make myself the maintainer of the Qualcomm QCE driver

Qualcomm wants to keep supporting and extending the crypto engine driver.
Thara has not been active for many months, so change the maintainer to
myself and upgrade the driver to Supported.

Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>
Acked-by: Krzysztof Kozlowski <krzk@kernel.org>
Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

crypto: amcc - convert irq_of_parse_and_map to platform_get_irq

Replace the deprecated irq_of_parse_and_map() call with the modern
platform_get_irq() in the probe function. This also improves error
handling: platform_get_irq() returns a negative errno on failure,
whereas irq_of_parse_and_map() returned 0.

Change the irq field in struct crypto4xx_core_device from u32 to int
to match the return type of platform_get_irq().

Assisted-by: opencode:big-pickle
Signed-off-by: Rosen Penev <rosenp@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

crypto: sun4i-ss - Remove insecure and unused rng_alg

Remove sun4i_ss_rng, as it is insecure and unused:

- It has multiple vulnerabilities.  sun4i_ss_prng_seed() is missing
  locking and has a buffer overflow.  sun4i_ss_prng_generate() fails to
  fill the entire buffer with cryptographic random bytes, because it
  rounds the destination length down and also doesn't actually wait for
  the hardware to be ready before pulling bytes from it.

- No user of this code is known.  It's usable only theoretically via the
  "rng" algorithm type of AF_ALG.  But userspace actually just uses the
  actual Linux RNG (/dev/random etc) instead.  And rng_algs don't
  contribute entropy to the actual Linux RNG either.  (This may have
  been confused with hwrng, which does contribute entropy.)

The sun4i_ss_prng_seed() buffer overflow was reported by Tianchu Chen
and discovered by Atuin - Automated Vulnerability Discovery Engine

There's no point in fixing all these vulnerabilities individually when
this is unused code, so let's just remove it.

Fixes: b8ae5c7387ad ("crypto: sun4i-ss - support the Security System PRNG")
Cc: stable@vger.kernel.org
Reported-by: Tianchu Chen <flynnnchen@tencent.com>
Closes: https://lore.kernel.org/r/af749a8447bd7f0e9dd26ca6c87e9c6afecb09d9@linux.dev/
Acked-by: Corentin LABBE <clabbe.montjoie@gmail.com>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

hwrng: xilinx - Move xilinx-rng into drivers/char/hw_random/

Since this file just implements a hwrng driver, move it into
drivers/char/hw_random/. Rename the kconfig option accordingly as well.

Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

cxl/test: Add check after kzalloc() memory in alloc_mock_res()

alloc_mock_res() calls kzalloc() without checking the return value.
Add scope based resource management to deal with the allocated memory
cleanly.

Reported-by: sashiko-bot
Fixes: 67dcdd4d3b83 ("tools/testing/cxl: Introduce a mocked-up CXL port hierarchy")
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Link: https://patch.msgid.link/20260611230305.197390-1-dave.jiang@intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>

cxl/test: Unregister cxl_acpi in cxl_test_init() error path

In cxl_test_init(), Once cxl_mock_platform_device_add() succeeds, all
error paths after needs to call platform_device_unregister() instead of
platform_device_put() to clean up.

Fixes: 67dcdd4d3b83 ("tools/testing/cxl: Introduce a mocked-up CXL port hierarchy")
Reported-by: sashiko-bot
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Link: https://patch.msgid.link/20260611230355.198912-1-dave.jiang@intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>

perf stat: Fix false NMI watchdog warning in aggregation modes

In aggregation modes (e.g. --per-socket, --per-die, etc.), a counter
might not be scheduled or counted on specific aggregate groups if it was
not assigned to the CPUs belonging to those groups. However, the
printout() check triggers the "print_free_counters_hint" logic
unconditionally for any supported counter with a missing count. This
results in a false "Some events weren't counted. Try disabling the NMI
watchdog" warning.

Furthermore, the NMI watchdog only reserves performance counters on core
PMUs. Uncore PMU events (e.g. CHA, IMC) are not affected by the NMI
watchdog, but their failures also falsely triggered this warning.

This warning was originally introduced in commit 02d492e5dcb72c00 ("perf
stat: Issue a HW watchdog disable hint")

To fix this, restrict setting of print_free_counters_hint to only
trigger for core PMU events by checking counter->pmu and
counter->pmu->is_core.

Example before/after:

$ perf stat -M lpm_miss_lat --metric-only --per-socket -a -- sleep 1

Before:

Performance counter stats for 'system wide':

       ns  lpm_miss_lat_rem ns  lpm_miss_lat_loc
S0      126                202.3               207.9
S1      126                231.9               259.3

       1.006029831 seconds time elapsed

Some events weren't counted. Try disabling the NMI watchdog:
        echo 0 > /proc/sys/kernel/nmi_watchdog
        perf stat ...
        echo 1 > /proc/sys/kernel/nmi_watchdog

After:

Performance counter stats for 'system wide':

       ns  lpm_miss_lat_rem ns  lpm_miss_lat_loc
S0      126                202.3               207.9
S1      126                231.9               259.3

       1.006029831 seconds time elapsed

Assisted-by: Gemini:gemini-next
Reviewed-by: Ian Rogers <irogers@google.com>
Signed-off-by: Chun-Tse Shao <ctshao@google.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>

perf test: Compile named_threads workload with -O0

The work loop relies on the compiler not optimizing it away, although
named_threads_work is not static for that reason, the compiler could
still do it.

Fix it by compiling without optimization. Also add -fno-inline for
consistency and in case anyone wants to look at callstacks.

Fixes: b5dd510be55e8670 ("perf test: Add named_threads workload")
Closes: https://lore.kernel.org/all/20260609160001.2739E1F00893@smtp.kernel.org
Reported-by: sashiko-bot <sashiko-bot@kernel.org>
Reviewed-by: Leo Yan <leo.yan@arm.com>
Signed-off-by: James Clark <james.clark@linaro.org>
Cc: Ian Rogers <irogers@google.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>

Merge tag 'for-net-next-2026-06-11' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next

Luiz Augusto von Dentz says:

====================
bluetooth-next pull request for net-next:

core:
- hci_sync: Add support for HCI_LE_Set_Host_Feature [v2]
- SMP: Use AES-CMAC library API
- sockets: convert to getsockopt_iter
- Add SPDX id lines to some source files

drivers:
- btintel_pcie: Support Product level reset
- btintel_pcie: Add support for smart trigger dump
- btintel_pcie: Add 50 ms delay before MAC init on BlazarIW
- btintel_pcie: Separate coredump work from RX work
- btmtk: add event filter to filter specific event
- btrtl: fix RTL8761B/BU broken LE extended scan
- btusb: Add Realtek RTL8922AE VID/PID 0bda/d922
- btusb: Add Realtek RTL8922AE VID/PID 0bda/d923
- btusb: MT7922: Add VID/PID 0e8d/223c
- btusb: MT7925: Add VID/PID 0e8d/8c38
- btusb: Add support for TP-Link TL-UB250
- btusb: Add Mercusys MA530 for Realtek RTL8761BUV
- btusb: Add TP-Link UB600 for Realtek 8761BUV
- btusb: Add support for Intel Lizard Peak 2 (0x8087:0x0040)
- btusb: Add USB ID 2c4e:0128 for Mercusys MA60XNB
- btusb: MT7925: Add VID/PID 13d3/3609

* tag 'for-net-next-2026-06-11' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next: (49 commits)
  Bluetooth: btintel_pcie: Separate coredump work from RX work
  Bluetooth: btmtksdio: fix infinite loop in btmtksdio_txrx_work()
  Bluetooth: qca: Add BT FW build version to kernel log
  Bluetooth: vhci: validate devcoredump state before side effects
  Bluetooth: L2CAP: validate connectionless PSM length
  Bluetooth: hci: validate codec capability element length
  Bluetooth: L2CAP: Fix UAF in channel timeout by holding conn ref
  Bluetooth: btintel_pcie: Load IOSF debug regs by controller variant
  Bluetooth: btintel_pcie: Add 50 ms delay before MAC init on BlazarIW
  Bluetooth: Add SPDX id lines to some source files
  Bluetooth: btintel_pcie: Add support for smart trigger dump
  Bluetooth: hci_h5: reset hci_uart::priv in the close() method
  Bluetooth: btusb: clean up probe error handling
  Bluetooth: btusb: fix wakeup irq devres lifetime
  Bluetooth: btusb: fix wakeup source leak on probe failure
  Bluetooth: btusb: fix use-after-free on marvell probe failure
  Bluetooth: btusb: fix use-after-free on registration failure
  Bluetooth: btmtk: fix URB leak in alloc_mtk_intr_urb error path
  Bluetooth: hci_core: Fix UAF in hci_unregister_dev()
  Bluetooth: hci_event: fix simultaneous discovery stuck in FINDING
  ...
====================

Link: https://patch.msgid.link/20260611183358.176776-1-luiz.dentz@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'nfc-net-next-20260611' of https://codeberg.org/linux-nfc/linux

David Heidelberg says:

====================
NFC updates for net-next 20260611

- nxp-nci: Add ISO15693 support
- nxp-nci: treat -ENXIO in IRQ thread as no data available
- nci: uart: Constify struct tty_ldisc_ops
- trf7970a: fix comment typos
- Use named initializers for struct i2c_device_id
- MAINTAINERS: Update address for David Heidelberg

* tag 'nfc-net-next-20260611' of https://codeberg.org/linux-nfc/linux:
  MAINTAINERS: Update address for David Heidelberg
  nfc: Use named initializers for struct i2c_device_id
  nfc: nxp-nci: treat -ENXIO in IRQ thread as no data available
  nfc: nxp-nci: Add ISO15693 support
  nfc: nci: uart: Constify struct tty_ldisc_ops
  nfc: trf7970a: fix comment typos
====================

Link: https://patch.msgid.link/1aed7555-3d24-413c-b284-bc85fdd33055@ixit.cz
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dt-bindings: mailbox: qcom: Add IPCC support for Maili Platform

Document the Inter-Processor Communication Controller on the Qualcomm
Maili Platform, which will be used to route interrupts across various
subsystems found on the SoC.

Signed-off-by: Chunkai Deng <chunkai.deng@oss.qualcomm.com>
Reviewed-by: Manivannan Sadhasivam <mani@kernel.org>
Signed-off-by: Jassi Brar <jassisinghbrar@gmail.com>

io_uring/zcrx: kill dead 'sock' member in struct io_zcrx_args

This member is only ever assigned, never read. Kill it.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

Merge branch 'tipc-fix-netlink-gate-and-receive-path-bugs'

Michael Bommarito says:

====================
tipc: fix netlink gate and receive-path bugs

This is v4 of the public TIPC series. The only change from v3 is in
patch 1: TIPC_NL_MEDIA_SET now uses GENL_UNS_ADMIN_PERM like the other
mutators, instead of GENL_ADMIN_PERM, so the whole series uses the
namespace-aware CAP_NET_ADMIN check that matches the legacy TIPC netlink
path. Patches 2 and 3 are unchanged.

Patch 1 gives the TIPCv2 mutating generic-netlink operations the admin
gate the legacy API already has, so a local unprivileged process can no
longer change TIPC state. Patch 2 drops CONN_ACK messages that
acknowledge more outstanding sends than exist, preventing the
snt_unacked underflow. Patch 3 rejects peer bindings with lower > upper,
which would otherwise leak binding-table memory.
====================

Link: https://patch.msgid.link/20260610124003.3831170-1-michael.bommarito@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tipc: reject inverted service ranges from peer bindings

tipc_update_nametbl() inserts a binding advertised by a peer node using
the lower and upper service-range bounds taken directly from the wire,
without checking that lower <= upper. The local bind path validates the
ordering (tipc_uaddr_valid()), but the name-distribution path does not.

A binding with lower > upper is inserted at the far end of the
service-range rbtree (keyed on lower) where no lookup or withdrawal can
ever match it (service_range_foreach_match() requires sr->lower <= end).
The publication, its service_range node and the augmented rbtree entry
are then leaked for the lifetime of the namespace, and there is no
per-peer cap equivalent to TIPC_MAX_PUBL on locally created bindings.

Reject inverted ranges in the network path as well. A peer node can
otherwise leak unbounded binding-table memory by sending PUBLICATION
items with lower > upper.

Fixes: 37922ea4a310 ("tipc: permit overlapping service ranges in name table")
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
Reviewed-by: Tung Nguyen <tung.quang.nguyen@est.tech>
Link: https://patch.msgid.link/20260610124003.3831170-4-michael.bommarito@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tipc: prevent snt_unacked underflow on CONN_ACK

tipc_sk_conn_proto_rcv() subtracts the peer-supplied connection ack count
from the unsigned 16-bit send counter snt_unacked without checking that it
does not exceed the number of messages actually outstanding:

tsk->snt_unacked -= msg_conn_ack(hdr);

msg_conn_ack() is read straight from a received CONN_MANAGER/CONN_ACK
message. If the ack count is larger than snt_unacked, the subtraction
wraps to a near-maximum value, leaving tsk_conn_cong() permanently true
and starving the connection of further transmits.

Validate the ACK count at the start of the CONN_ACK block and drop the
message if it acknowledges more messages than are outstanding. A peer (or,
for a local connection, the connected peer socket) can otherwise wedge a
TIPC connection's send side by sending an oversized connection ack.

Fixes: 10724cc7bb78 ("tipc: redesign connection-level flow control")
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
Reviewed-by: Tung Nguyen <tung.quang.nguyen@est.tech>
Link: https://patch.msgid.link/20260610124003.3831170-3-michael.bommarito@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tipc: require net admin for TIPCv2 netlink mutators

TIPCv2 registers mutating generic-netlink operations without admin
permission flags. Generic netlink only checks CAP_NET_ADMIN when an
operation sets GENL_ADMIN_PERM or GENL_UNS_ADMIN_PERM, so a local
unprivileged process can currently change TIPC state through commands
such as TIPC_NL_NET_SET, TIPC_NL_KEY_SET, TIPC_NL_KEY_FLUSH, and
bearer enable/disable.

The legacy TIPC netlink API already checks netlink_net_capable(...,
CAP_NET_ADMIN) for administrative commands. Give the TIPCv2 mutators
the equivalent generic-netlink gate. Use GENL_UNS_ADMIN_PERM, which
maps to the same namespace-aware CAP_NET_ADMIN check that
netlink_net_capable() performs, so the behaviour matches the legacy
path and keeps working for CAP_NET_ADMIN holders in a non-initial user
namespace (containers).

A QEMU/KASAN repro run as uid/gid 65534 with zero effective
capabilities previously succeeded in changing the network id and node
identity, setting and flushing key material, and enabling/disabling a
UDP bearer. With this patch applied the same operations fail with
-EPERM.

Fixes: 0655f6a8635b ("tipc: add bearer disable/enable to new netlink api")
Link: https://lore.kernel.org/all/20260604163102.2658553-1-dominik.czarnota@trailofbits.com/
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
Reviewed-by: Tung Nguyen <tung.quang.nguyen@est.tech>
Link: https://patch.msgid.link/20260610124003.3831170-2-michael.bommarito@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: airoha: simplify WAN device check in airoha_dev_init()

airoha_register_gdm_devices() iterates eth->ports[] in order, so GDM2's
netdev is always registered before GDM3/GDM4. This means the explicit
check for eth->ports[1] && eth->ports[1]->devs[0] is a redundant
special-case of what airoha_get_wan_gdm_dev() already covers, since
GDM2 is always marked as WAN during its own ndo_init.
Remove the redundant check and rely solely on airoha_get_wan_gdm_dev()
which handles both the GDM2-present and GDM2-absent cases.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://patch.msgid.link/20260610-airoha-eth-simplify-dev-init-v2-1-8f244e69b0d4@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/sched: sch_hfsc: Don't make class passive twice

update_vf() is called from two places for the same class during a single
dequeue when the class's child qdisc (e.g. codel/fq_codel) drops its last
packets while dequeuing:

1. The child calls qdisc_tree_reduce_backlog(), which, now that the child
   is empty, invokes hfsc_qlen_notify() -> update_vf(cl, 0, 0) and turns
   the class passive (cl_nactive is decremented up the hierarchy).

2. hfsc_dequeue() then calls update_vf(cl, qdisc_pkt_len(skb), cur_time)
   to charge the dequeued bytes.

On the second call the class is already passive, but its child qdisc is
still empty, so update_vf() arms go_passive again:

      if (cl->qdisc->q.qlen == 0 && cl->cl_flags & HFSC_FSC)
              go_passive = 1;

The leaf is then skipped by the cl_nactive == 0 check inside the loop,
which does not clear go_passive, so the stale go_passive propagates to the
parent and decrements its cl_nactive a second time. A parent that still
has other active children is driven to cl_nactive == 0 and removed from
the vttree, even though those siblings are still backlogged. They are
never dequeued again and the qdisc stalls.

Fix this by only arming go_passive when the class is actually active, so an
already-passive class no longer triggers a second passive transition. The
byte accounting (cl->cl_total += len) still runs for every ancestor, so
dequeued bytes continue to be counted exactly once.

Fixes: 51eb3b65544c ("sch_hfsc: make hfsc_qlen_notify() idempotent")
Reported-by: Anirudh Gupta <anirudhrudr@gmail.com>
Closes: https://lore.kernel.org/netdev/CAN2cbVe79oj0O9==m4+4x3v+O+qzRagA=2=wkrp9i9=CqYvyZA@mail.gmail.com/
Tested-by: Anirudh Gupta <anirudhrudr@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Link: https://patch.msgid.link/20260610132824.3027549-1-victor@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: Stop leased rxq before uninstalling its memory provider

netif_rxq_cleanup_unlease() tears down the memory provider that was
installed on a physical RX queue through a netkit queue lease. It
currently revokes the provider's DMA mappings before stopping the
physical queue:

__netif_mp_uninstall_rxq(virt_rxq, p); /* DMA unmap */
__netif_mp_close_rxq(phys_rxq->dev, rxq_idx, p); /* queue stop */

This inverts the ordering used by the regular teardown paths (normal
device unregister and the io_uring zcrx close path), which stop the
queue before revoking the provider's mappings.

With the physical queue still live, its NAPI can keep consuming
net_iov entries from the page_pool alloc cache after the
__netif_mp_uninstall_rxq() has already cleared their dma_addr,
opening a window for the device to DMA to a stale or zero address.

Fix it by swapping the two calls so the queue is stopped (and its
NAPI quiesced) before the provider is uninstalled. No functional
regression was observed across repeated runs of the nk_qlease.py
HW selftest, which exercises the lease teardown path; this was
tested against fbnic QEMU emulation.

Fixes: 5602ad61ebee ("net: Proxy netif_mp_{open,close}_rxq for leased queues")
Reported-by: Ahmed Abdelmoemen <ahmedabdelmoumen05@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: David Wei <dw@davidwei.uk>
Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20260609212240.677889-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mlxsw: fix refcount leak in mlxsw_sp_vrs_lpm_tree_replace()

When mlxsw_sp_vrs_lpm_tree_replace() fails after replacing some VRs,
the error rollback loop does not correctly revert the preceding
replacements. The loop decrements the index but fails to update the
vr pointer, which still points to the VR that caused the failure. As
a result, the condition and the rollback call always operate on the
same VR, potentially calling mlxsw_sp_vr_lpm_tree_replace() multiple
times on it while never rolling back the earlier VRs. Those VRs
continue to hold a reference to new_tree acquired via
mlxsw_sp_lpm_tree_hold(), leaking the reference count of new_tree.

Fix by reinitializing vr inside the error loop with the updated index:

vr = &mlxsw_sp->router->vrs[i];

so that the loop correctly iterates over all VRs that were actually
replaced.

Cc: stable@vger.kernel.org
Fixes: fc922bb0dd94 ("mlxsw: spectrum_router: Use one LPM tree for all virtual routers")
Signed-off-by: Wentao Liang <vulab@iscas.ac.cn>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260609084730.215732-1-vulab@iscas.ac.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mlxsw: fix refcount leak in mlxsw_sp_port_lag_join()

When mlxsw_sp_port_lag_index_get() fails, mlxsw_sp_port_lag_join()
returns an error without releasing the lag reference obtained by
the earlier mlxsw_sp_lag_get(). All other error paths in the
function jump to the cleanup label that ends with
mlxsw_sp_lag_put(), so this is a single missed release.

Fix the leak by replacing the bare 'return err' with a goto to the
existing error cleanup label, which will drop the reference safely.

Cc: stable@vger.kernel.org
Fixes: 0d65fc13042f ("mlxsw: spectrum: Implement LAG port join/leave")
Signed-off-by: Wentao Liang <vulab@iscas.ac.cn>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260609083709.209743-1-vulab@iscas.ac.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'ksz87xx-add-support-for-low-loss-cable-equalizer-errata'

Fidelio Lawson says:

====================
ksz87xx: add support for low-loss cable equalizer errata

This patch implements the KSZ87xx short cable erratum
described in Microchip document DS80000687C for KSZ87xx switches
and the following support article:

Link: https://support.microchip.com/s/article/Solution-for-Using-CAT-5E-or-CAT-6-Short-Cable-with-a-Link-Issue-for-the-KSZ8795-Family
According to the erratum, the embedded PHY receiver in KSZ87xx switches is
tuned by default for long, high-loss Ethernet cables. When operating with
short or low-loss cables (for example CAT5e or CAT6), the PHY equalizer may
over-amplify the incoming signal, leading to internal distortion and link
establishment failures.

Microchip documents two independent mechanisms to mitigate this issue:
adjusting the receiver low‑pass filter bandwidth and reducing the DSP
equalizer initial value. These registers are located in the switch’s
internal LinkMD table and cannot be accessed directly through a
stand‑alone PHY driver.

To keep the PHY‑facing API clean, this series models the erratum handling
as vendor‑specific Clause 22 PHY registers, virtualized by the KSZ8 DSA
driver. Accesses are intercepted by ksz8_r_phy() / ksz8_w_phy() and
translated into the appropriate indirect LinkMD register writes. The
erratum affects the shared PHY analog front‑end and therefore applies
globally to the switch.

Based on review feedback, the user‑visible interface is kept deliberately
simple and predictable:

- A boolean “short‑cable” PHY tunable applies a documented and
  conservative preset (LPF bandwidth 62MHz, DSP EQ initial value 0).
  This is the recommended KISS interface for the common short‑cable
  scenario.

- Two additional integer PHY tunables allow advanced or experimental
  tuning of the LPF bandwidth and the DSP EQ initial value. These
  controls are orthogonal, have no ordering requirements, and simply
  override the corresponding setting when written.

The tunables act as simple setters with no implicit state machine or
invalid combinations, avoiding surprises for userspace and not relying
on extended error reporting or netlink ethtool support.

This series contains:

  1. Support for the KSZ87xx low‑loss cable erratum in the KSZ8 DSA driver,
     including the short‑cable preset and orthogonal tuning controls.

  2. Addition of vendor‑specific PHY tunable identifiers for the
     short‑cable preset, LPF bandwidth, and DSP EQ initial value.

  3. Exposure of these tunables through the Micrel PHY driver via
     get_tunable / set_tunable callbacks.

This version follows the design agreed upon during v3 review and
reworks the interface accordingly.
====================

Link: https://patch.msgid.link/20260609-ksz87xx_errata_low_loss_connections-v10-0-9ba4418cf3db@exotec.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: micrel: expose KSZ87xx low-loss cable tunables

Add support for the KSZ87xx low-loss cable PHY tunables in the Micrel
PHY driver by implementing get_tunable and set_tunable callbacks.

These callbacks expose vendor-specific PHY tunables used to control the
KSZ87xx embedded PHY receiver behavior when operating with short or
low-loss Ethernet cables. The tunables provide:

- a boolean short-cable preset applying known good settings;
- an integer LPF bandwidth control;
- an integer DSP EQ initial value control.

The Micrel PHY driver forwards these tunables via standard phy_read() /
phy_write() operations, which are virtualized by the KSZ8 DSA driver and
translated into the appropriate indirect switch register accesses.

Reviewed-by: Marek Vasut <marex@nabladev.com>
Reviewed-by: Nicolai Buchwitz <nb@tipi-net.de>
Signed-off-by: Fidelio Lawson <fidelio.lawson@exotec.com>
Link: https://patch.msgid.link/20260609-ksz87xx_errata_low_loss_connections-v10-3-9ba4418cf3db@exotec.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ethtool: add KSZ87xx low-loss cable PHY tunables

Introduce vendor-specific PHY tunable identifiers to control the
KSZ87xx low-loss cable erratum handling through the ethtool PHY
tunable interface.

The following tunables are added:

- a boolean "short-cable" tunable, applying a documented and
  conservative preset intended for short or low-loss Ethernet cables;

- an integer LPF bandwidth tunable, allowing advanced adjustment of the
  receiver low-pass filter bandwidth;

- an integer DSP EQ initial value tunable, allowing advanced tuning of
  the PHY equalizer initialization.

The actual behavior is implemented by the corresponding PHY and switch
drivers.

Reviewed-by: Marek Vasut <marex@nabladev.com>
Reviewed-by: Nicolai Buchwitz <nb@tipi-net.de>
Signed-off-by: Fidelio Lawson <fidelio.lawson@exotec.com>
Link: https://patch.msgid.link/20260609-ksz87xx_errata_low_loss_connections-v10-2-9ba4418cf3db@exotec.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: microchip: implement KSZ87xx Module 3 low-loss cable errata

Implement the KSZ87xx short cable workaround.

This patch implements the KSZ87xx short cable erratum
described in Microchip document DS80000687C for KSZ87xx switches
and the following support article:

Link: https://support.microchip.com/s/article/Solution-for-Using-CAT-5E-or-CAT-6-Short-Cable-with-a-Link-Issue-for-the-KSZ8795-Family
The issue affects short or low-loss cable links (e.g. CAT5e/CAT6),
where the PHY receiver equalizer may amplify high-amplitude signals
excessively, resulting in internal distortion and link establishment
failures.

KSZ87xx devices require a workaround for the Module 3 low-loss cable
condition, controlled through the switch TABLE_LINK_MD_V indirect
registers.

This change models the erratum handling as vendor-specific Clause 22 PHY
registers, virtualized by the KSZ8 DSA driver and accessed via
ksz8_r_phy() / ksz8_w_phy(). The following controls are provided:

- A boolean “short-cable” preset, which applies a documented and
  conservative configuration (LPF 62 MHz bandwidth and DSP EQ initial
  value 0), and is the recommended interface for typical use cases.

- Separate LPF bandwidth and DSP EQ initial value controls intended for
  advanced or experimental tuning. These are orthogonal and independent,
  and override the corresponding settings without requiring any specific
  ordering.

The preset and tunables act as simple setters with no implicit state
machine or invalid combinations, keeping the API predictable and aligned
with the KISS principle.

The erratum affects the shared PHY analog front-end and therefore applies
globally to the switch.

Reviewed-by: Nicolai Buchwitz <nb@tipi-net.de>
Reviewed-by: Marek Vasut <marex@nabladev.com>
Signed-off-by: Fidelio Lawson <fidelio.lawson@exotec.com>
Link: https://patch.msgid.link/20260609-ksz87xx_errata_low_loss_connections-v10-1-9ba4418cf3db@exotec.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests/net/openvswitch: add flow modify test

Add mod_flow() and the mod-flow CLI command to ovs-dpctl.py, exercising
OVS_FLOW_CMD_SET. Add test_flow_set which first modifies an existing
flow with new actions and verifies the change via traffic, then modifies
the same flow without actions and verifies the kernel handles the
no-actions case gracefully.

The no-actions path is unreachable from userspace OVS tools (dpctl
mod-flow requires actions) but reachable via raw netlink. This is the
code path where Adrian Moreno found a possible kfree_skb of ERR_PTR
when reply allocation fails after locking.

Make parse() skip OVS_FLOW_ATTR_ACTIONS when actstr is None so the
kernel enters the post-lock allocation branch in ovs_flow_cmd_set().
After the no-actions set, verify via dump-flows that the flow retained
its drop action.

Suggested-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Minxi Hou <houminxi@gmail.com>
Reviewed-by: Aaron Conole <aconole@redhat.com>
Link: https://patch.msgid.link/20260609165725.107484-1-houminxi@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: bcmgenet: convert RX path to page_pool

Replace the per-packet __netdev_alloc_skb() + dma_map_single() in the
RX path with page_pool. SKBs are built from pool pages via
napi_build_skb() with skb_mark_for_recycle() so the network stack
returns pages to the pool, and DMA mapping happens once per page
instead of once per packet.

Reject HW-reported lengths smaller than the RSB so a runt cannot
underflow the SKB build path.

Drop the now-unused priv->rx_buf_len field and the rx_dma_failed soft
MIB counter (nothing increments it after the conversion). This
removes the "rx_dma_failed" entry from ethtool -S, which is a
user-visible change for monitoring tools that key on stat names.

Signed-off-by: Nicolai Buchwitz <nb@tipi-net.de>
Reviewed-by: Justin Chen <justin.chen@broadcom.com>
Tested-by: Justin Chen <justin.chen@broadcom.com>
Link: https://patch.msgid.link/20260610114835.2225423-1-nb@tipi-net.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: airoha: move get_sport() callback at the beginning of airoha_enable_gdm2_loopback()

Move the get_sport() callback invocation at the beginning of
airoha_enable_gdm2_loopback() routine in order to avoid leaving the
hardware in a partially configured state if get_sport() fails.
Previously, get_sport() was called after GDM2 forwarding, loopback,
channel, length, VIP and IFC registers had already been programmed.
A failure at that point would return an error leaving GDM2 with
loopback enabled but WAN port, PPE CPU port and flow control mappings
not configured.
Performing the get_sport() lookup before any register write guarantees
the routine either completes the full configuration sequence or exits
with no side effects on the hardware.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260608-airoha_enable_gdm2_loopback-minor-change-v1-1-1787a0f42b31@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'mptcp-pm-drop-tcp-ts-with-add_addrv6-port'

Matthieu Baerts says:

====================
mptcp: pm: drop TCP TS with ADD_ADDRv6 + port

Up to this series, it was possible to add a "signal" MPTCP endpoint with
an IPv6 address and a port, or to directly request to send an ADD_ADDR
with a v6 address and a port, but the expected ADD_ADDR wasn't sent when
TCP timestamps was used for the connection.

In fact, such signalling option cannot be sent when TCP timestamps is
used due to a lack of option space: the limit is at 40 bytes, and, with
padding, TCP timestamps is taking 12 bytes, while an ADD_ADDR IPv6 +
port is taking 30 bytes. The selected solution here is to simply drop
the TCP timestamps option when such ADD_ADDR of 30 bytes needs to be
sent.

- Patches 1-3: small cleanups to avoid computing ADD/RM_ADDR twice.

- Patches 4-7: the new feature, controlled by a new sysctl knob.

- Patch 8: extra checks in the MPTCP Join selftests.

- Patches 9-15: A bunch of refactoring: renamed confusing helpers and
variables, and prevent future misused functions.
====================

Link: https://patch.msgid.link/20260605-net-next-mptcp-add-addr6-port-ts-v2-0-758e7ca73f4d@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: options: rst: drop unused skb parameter

It was passed since its introduction in commit dc87efdb1a5c ("mptcp: add
mptcp reset option support"), but never used.

Simply removes it.

Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260605-net-next-mptcp-add-addr6-port-ts-v2-15-758e7ca73f4d@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: pm: avoid using del_timer directly

mptcp_pm_announced_del_timer() removes the matched ADD_ADDR entry (if
found) from the ADD_ADDR list only if check_id is false. That's
dangerous, and not clear, because it means the caller should be free the
entry only in some cases, and it easy to miss that.

Instead, make it static, and call it from mptcp_pm_add_addr_echoed,
which is the only other case where mptcp_pm_add_addr_del_timer should be
called with check_id set to true. Bonus with that: a second call to
mptcp_pm_add_addr_lookup_by_addr() can be avoided.

Note that instead of adding the signature above to avoid a compilation
issue because this helper is called before the definition of the
function, the whole helper is moved above where it is first called. Its
content is untouched, except the addition of the 'static' keyboard.

Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260605-net-next-mptcp-add-addr6-port-ts-v2-14-758e7ca73f4d@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: pm: make mptcp_pm_add_addr_send_ack static

Only used in pm.c.

Note that the signature is added above: it is easier than moving the
code around, because this helper depends on mptcp_pm_schedule_work which
is declared below.

While at it, explicitly mark it as to be called while pm->lock is held.

Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260605-net-next-mptcp-add-addr6-port-ts-v2-13-758e7ca73f4d@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: pm: remove add_ prefix from timer

Similar to the two previous commits, using the 'add' prefix is
confusing, also confirmed by [1].

Now that the structure has been renamed to include 'add_addr' in its
name, easier to know the timer is linked to the ADD_ADDR, no need to
add the confusing prefix, or an unneeded longer one.

While at it, also update the ADD_ADDR timer helper to clearly specify it
is linked to ADD_ADDR, and it is not there to add a new timer.

Link: https://lore.kernel.org/20251117100745.1913963-1-edumazet@google.com
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260605-net-next-mptcp-add-addr6-port-ts-v2-12-758e7ca73f4d@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: pm: uniform announced addresses helpers

Similar to the previous commit, only using the 'add' or 'anno' prefixes
is confusing -- generally associated to the action of adding something,
or the Latin name for "year" -- and lack of uniformity.

This has been causing issues in the past, e.g. del_add_timer seemed to
suggest the goal is to delete a previously added timer.

Instead, use the mptcp_pm_announced_ prefix.

While at it, slightly improves some helpers:

- mptcp_lookup_anno_list_by_saddr: no need to specify what is used to do
  the lookup: mptcp_pm_announced_lookup.

- mptcp_pm_sport_in_anno_list: it doesn't just compare the port, but the
  whole address linked to the sublow: mptcp_pm_announced_has_ssk.

- mptcp_pm_alloc_anno_list: it allocates one item of the list, not a
  whole list: mptcp_pm_announced_alloc.

Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260605-net-next-mptcp-add-addr6-port-ts-v2-11-758e7ca73f4d@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: pm: rename add_entry structure to add_addr

Using only the 'add' prefix is confusing: does it refer to a generic
added entry or address, or specifically to ADD_ADDRs. Using add_addr
removes this confusion.

Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260605-net-next-mptcp-add-addr6-port-ts-v2-10-758e7ca73f4d@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: pm: use for_each_subflow helper

Similar to most places in the MPTCP code. So instead of passing the
subflow list and use list_for_each_entry(subflow, list, node), pass the
msk and use mptcp_for_each_subflow(msk, subflow).

That's clearer and more uniform with the rest.

While at it, add 'pm_' prefix for the exported one to easily identify
the origin. Plus replace 'lookup' by 'has', because a bool is returned.

Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260605-net-next-mptcp-add-addr6-port-ts-v2-9-758e7ca73f4d@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: mptcp: always check sent/dropped ADD_ADDRs

Before, they were only checked on demand, but it seems better to check
them each time received ADD_ADDRs are checked.

Errors are only reported when the counter exists, and the value is not
the expected one. This is similar to what is done in chk_join_nr: it
reduces the output, and avoids a lot of 'skip' when validating older
kernels. Also here, some tests need to adapt the default expected
counters, e.g. when ADD_ADDR echo are dropped on the reception side, or
it is not possible to send an ADD_ADDR due to the limited option space.

Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260605-net-next-mptcp-add-addr6-port-ts-v2-8-758e7ca73f4d@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>