git.ipfire.org Git - thirdparty/kernel/linux.git/log

arm64: dts: aspeed: Fix duplicate pinctrl labels and address scheme

A report from shashiko-bot highlighted some concerns concurrent to
application of the series[1].

Fix duplicate pinctrl_tach{0-15} and pinctrl_n{cts,dcd,dsr,ri}5 labels
in aspeed-g7-soc1-pinctrl.dtsi. These didn't cause errors from dtc
because dtc accepts duplicate labels for duplicate nodes specified
through a node reference[2].

Drop the cpu-index from secondary/tertiary container nodes: reduce
the "#address-cells" from 2 to 1 and update unit-addresses and reg
accordingly. The 2-cell scheme was proposed in an early mailing list
sketch to prompt discussion[3], but the design evolved in ways that made
it unnecessary.

Also remove URL comments from the DTS. The links were to comments in
the kernel sources with discussion justifying the approach, but are not
necessary to carry forward.

[arj: Extend discussion in the commit message]

Link: https://lore.kernel.org/all/20260609025708.ADBFE1F00893@smtp.kernel.org/
Link: https://lore.kernel.org/all/b226339bb2abe42ce23e90eadbc654b426131083.camel@codeconstruct.com.au/
Link: https://lore.kernel.org/all/1a2ca78746e00c2ec4bfc2953a897c48376ed36f.camel@codeconstruct.com.au/
Suggested-by: Andrew Jeffery <andrew@codeconstruct.com.au>
Fixes: e77bb5dc5759 ("arm64: dts: aspeed: Add initial AST27xx SoC device tree")
Signed-off-by: Ryan Chen <ryan_chen@aspeedtech.com>
Link: https://patch.msgid.link/20260611-dtsi_fix-v1-1-ef2b7cd86d6d@aspeedtech.com
Signed-off-by: Andrew Jeffery <andrew@codeconstruct.com.au>
Link: https://lore.kernel.org/r/20260612-aspeed-arm64-dt-v1-1-d1d1a4737905@codeconstruct.com.au
Signed-off-by: Arnd Bergmann <arnd@arndb.de>

serial: 8250_pci: Don't specify conflicting values to pci_device_id members

The PCI_VDEVICE macro assigns 0 to .class and .class_mask to allow the
next value in the initializer to define the value for .driver_data.

So the construct

{
PCI_VDEVICE(INTASHIELD, 0x0D21),
.class = PCI_CLASS_COMMUNICATION_MULTISERIAL << 8,
.class_mask = 0xffff00,
.driver_data = pbn_b2_4_115200,
},

introduced in commit 44e55f1f3088 ("serial: 8250_pci: Consistently
define pci_device_ids using named initializers") has conflicting
assignments. In only some configurations (i.e. W=1 for me) that makes
the compiler unhappy.

So convert the two affected items to PCI_DEVICE which doesn't have that
hidden assigment to .class and .class_mask.

Fixes: 44e55f1f3088 ("serial: 8250_pci: Consistently define pci_device_ids using named initializers")
Signed-off-by: Uwe Kleine-König (The Capable Hub) <u.kleine-koenig@baylibre.com>
Reported-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Closes: https://lore.kernel.org/linux-serial/ah_5qVKOf8LXG1Xo@ashevche-desk.local/T/#ma6eab90ca801b4292639f5c255a89b4033b33d21
Tested-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://patch.msgid.link/20260603095616.937968-2-u.kleine-koenig@baylibre.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

vc_screen: fix null-ptr-deref in vcs_notifier() during concurrent vcs_write

A KASAN null-ptr-deref was observed in vcs_notifier():

BUG: KASAN: null-ptr-deref in vcs_notifier+0x98/0x130
Read of size 2 at addr qmp_cmd_name: qmp_capabilities, arguments: {}

The issue is a race condition in vcs_write(). When the console_lock is
temporarily dropped (to copy data from userspace), the vc_data pointer
obtained from vcs_vc() may become stale. After re-acquiring the lock,
vcs_vc() is called again to re-validate the pointer. If the vc has been
deallocated in the meantime, vcs_vc() returns NULL, and the while loop
breaks (with written > 0). However, after the loop, vcs_scr_updated(vc)
is still called with the now-NULL vc pointer, leading to a null pointer
dereference in the notifier chain (vcs_notifier dereferences param->vc).

Fix this by adding a NULL check for vc before calling vcs_scr_updated().

Fixes: 8fb9ea65c9d1 ("vc_screen: reload load of struct vc_data pointer in vcs_write() to avoid UAF")
Cc: stable@vger.kernel.org
Signed-off-by: Yi Yang <yiyang13@huawei.com>
Reviewed-by: Jiri Slaby <jirislaby@kernel.org>
Link: https://patch.msgid.link/20260604060734.2914976-1-yiyang13@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

serial: qcom_geni: Fix RX DMA stall when SE_DMA_RX_LEN_IN is zero

In qcom_geni_serial_handle_rx_dma(), geni_se_rx_dma_unprep() clears
port->rx_dma_addr before SE_DMA_RX_LEN_IN is read. If the register is zero,
for example when the RX stale counter fires on an idle line, the handler
returns without calling geni_se_rx_dma_prep().

The next RX DMA interrupt then hits the !port->rx_dma_addr guard and
returns immediately, so the RX DMA buffer is never rearmed and later input
is lost.

Keep the handler on the rearm path when rx_in is zero. Warn about the
unexpected zero-length DMA completion, skip received-data handling, and
always call geni_se_rx_dma_prep().

Fixes: 2aaa43c70778 ("tty: serial: qcom-geni-serial: add support for serial engine DMA")
Cc: stable@vger.kernel.org
Reviewed-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>
Signed-off-by: Viken Dadhaniya <viken.dadhaniya@oss.qualcomm.com>
Link: https://patch.msgid.link/20260528-serial-rx-0-byte-fix-v2-1-b4195cfe342f@oss.qualcomm.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

ata: Use named initializers for pci_device_id arrays

While being less compact, using named initializers allows to more easily
see which members of the structs are assigned which value without having
to lookup the declaration of the struct. And it's also more robust
against changes to the struct definition.

The mentioned robustness is relevant for a planned change to struct
pci_device_id that replaces .driver_data by an anonymous union.

Also drop the comma after a few list terminators.

This patch doesn't modify the compiled array, only their representation
in source form benefits. The former was confirmed with x86 and arm64
builds.

Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Uwe Kleine-König (The Capable Hub) <u.kleine-koenig@baylibre.com>
Signed-off-by: Niklas Cassel <cassel@kernel.org>

ata: Drop unused assignments of pci_device_id driver data

The drivers explicitly set the .driver_data member of struct
pci_device_id to zero without relying on that value. Drop these unused
assignments.

While touching these arrays, convert the one driver not using PCI_DEVICE
to use that macro and align the array's coding style to what is used
most for these. (i.e. break very long lines, a single space in the list
terminator and no trailing comma.)

This patch doesn't modify the compiled array, only its representation in
source form benefits. The former was confirmed with builds on x86 and
arm64.

Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Uwe Kleine-König (The Capable Hub) <u.kleine-koenig@baylibre.com>
Signed-off-by: Niklas Cassel <cassel@kernel.org>

s390: Revert support for DCACHE_WORD_ACCESS

load_unaligned_zeropad() reads eight bytes from unaligned addresses and may
cross page boundaries. It handles exceptions which may happen if reading
from the second page results in an exception.

For pages which are donated to the Ultravisor for secure execution purposes
the do_secure_storage_access() exception handler however does not handle
such exceptions correctly. Such an exception may result in an endless
exception loop which will never be resolved.

An attempt to fix this [1] turned out to be not sufficient. For now revert
load_unaligned_zeropad() until this problem has been resolved in a proper
way.

Note that the implementation of load_unaligned_zeropad() itself is
correct. The revert is just a temporary workaround until there is complete
fix for secure storage access exceptions.

[1] commit b00be77302d7 ("s390/mm: Add missing secure storage access fixups for donated memory")

Fixes: 802ba53eefc5 ("s390: add support for DCACHE_WORD_ACCESS")
Cc: stable@vger.kernel.org
Acked-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>

Merge branch 'slab/for-7.2/alloc_token' into slab/for-next

Merge series "slab: support for compiler-assisted type-based slab cache
partitioning" from Marco Elver. From the cover letter [6]:

Rework the general infrastructure around RANDOM_KMALLOC_CACHES into more
flexible KMALLOC_PARTITION_CACHES, with the former being a partitioning
mode of the latter.

Introduce a new mode, KMALLOC_PARTITION_TYPED, which leverages a feature
available in Clang 22 and later, called "allocation tokens" via
__builtin_infer_alloc_token() [1]. Unlike KMALLOC_PARTITION_RANDOM
(formerly RANDOM_KMALLOC_CACHES), this mode deterministically assigns a
slab cache to an allocation of type T, regardless of allocation site.

The builtin __builtin_infer_alloc_token(<malloc-args>, ...) instructs
the compiler to infer an allocation type from arguments commonly passed
to memory-allocating functions and returns a type-derived token ID. The
implementation passes kmalloc-args to the builtin: the compiler performs
best-effort type inference, and then recognizes common patterns such as
`kmalloc(sizeof(T), ...)`, `kmalloc(sizeof(T) * n, ...)`, but also
`(T *)kmalloc(...)`. Where the compiler fails to infer a type the
fallback token (default: 0) is chosen.

Note: kmalloc_obj(..) APIs fix the pattern how size and result type are
expressed, and therefore ensures there's not much drift in which
patterns the compiler needs to recognize. Specifically, kmalloc_obj()
and friends expand to `(TYPE *)KMALLOC(__obj_size, GFP)`, which the
compiler recognizes via the cast to TYPE*.

Clang's default token ID calculation is described as [1]:

   typehashpointersplit: This mode assigns a token ID based on the hash
   of the allocated type's name, where the top half ID-space is reserved
   for types that contain pointers and the bottom half for types that do
   not contain pointers.

Separating pointer-containing objects from pointerless objects and data
allocations can help mitigate certain classes of memory corruption
exploits [2]: attackers who gains a buffer overflow on a primitive
buffer cannot use it to directly corrupt pointers or other critical
metadata in an object residing in a different, isolated heap region.

It is important to note that heap isolation strategies offer a
best-effort approach, and do not provide a 100% security guarantee,
albeit achievable at relatively low performance cost. Note that this
also does not prevent cross-cache attacks: while waiting for future
features like SLAB_VIRTUAL [3] to provide physical page isolation, this
feature should be deployed alongside SHUFFLE_PAGE_ALLOCATOR and
init_on_free=1 to mitigate cross-cache attacks and page-reuse attacks as
much as possible today.

With all that, my kernel (x86 defconfig) shows me a histogram of slab
cache object distribution per /proc/slabinfo (after boot):

  <slab cache>      <objs> <hist>
  kmalloc-part-15    1465  ++++++++++++++
  kmalloc-part-14    2988  +++++++++++++++++++++++++++++
  kmalloc-part-13    1656  ++++++++++++++++
  kmalloc-part-12    1045  ++++++++++
  kmalloc-part-11    1697  ++++++++++++++++
  kmalloc-part-10    1489  ++++++++++++++
  kmalloc-part-09     965  +++++++++
  kmalloc-part-08     710  +++++++
  kmalloc-part-07     100  +
  kmalloc-part-06     217  ++
  kmalloc-part-05     105  +
  kmalloc-part-04    4047  ++++++++++++++++++++++++++++++++++++++++
  kmalloc-part-03     183  +
  kmalloc-part-02     283  ++
  kmalloc-part-01     316  +++
  kmalloc            1422  ++++++++++++++

The above /proc/slabinfo snapshot shows me there are 6673 allocated
objects (slabs 00 - 07) that the compiler claims contain no pointers or
it was unable to infer the type of, and 12015 objects that contain
pointers (slabs 08 - 15). On a whole, this looks relatively sane.

Additionally, when I compile my kernel with -Rpass=alloc-token, which
provides diagnostics where (after dead-code elimination) type inference
failed, I see 186 allocation sites where the compiler failed to identify
a type (down from 966 when I sent the RFC [4]). Some initial review
confirms these are mostly variable sized buffers, but also include
structs with trailing flexible length arrays.

Link: https://clang.llvm.org/docs/AllocToken.html
Link: https://blog.dfsec.com/ios/2025/05/30/blasting-past-ios-18/
Link: https://lwn.net/Articles/944647/
Link: https://lore.kernel.org/all/20250825154505.1558444-1-elver@google.com/
Link: https://discourse.llvm.org/t/rfc-a-framework-for-allocator-partitioning-hints/87434
Link: https://lore.kernel.org/all/20260511200136.3201646-1-elver@google.com/

Merge branch 'slab/for-7.2/alloc_bulk' into slab/for-next

Merge two separately sent but vaguely related patches from Christoph
Hellwig. One changes the kmem_cache_alloc_bulk() API to return bool,
because it was already actiong as all-or-nothing, and that aspect was
not documented. Existing callers are updated.

The second patch simplifies the mempool_alloc_bulk() API to stop
skipping over non-NULL entries in the array, and removes a related
parameter that said how many are non-NULL.

A similar simplification of alloc_pages_bulk() is being discussed as
well and should follow in near future.

Merge branch 'slab/for-7.2/tools' into slab/for-next

Merge series "Cleanup and fix tools/mm/slabinfo utility" from Xuewen
Wang.

This series fixes one bug and cleans up two code quality issues in
tools/mm/slabinfo.

Additionally, add this tool and other related scripts and tools to the
SLAB ALLOCATOR of MAINTAINERS.

Link: https://lore.kernel.org/all/20260518062159.80664-1-wangxuewen@kylinos.cn/

mm/slab: do not limit zeroing to orig_size when only red zoning is enabled

When init (zeroing) on allocation is requested, for kmalloc() we
generally have to zero the full object size even if a smaller size is
requested, in order to provide krealloc()'s __GFP_ZERO guarantees.

But if we track the requested size, krealloc() uses that information to
do the right thing, so we can zero only the requested size. With red
zoning also enabled, any extra size became part of the red zone, so it
must not be zeroed and thus we must zero only the requested size.

However the current check is imprecise, and will trigger also when only
SLAB_RED_ZONE is enabled without SLAB_STORE_USER (which enables tracking
the requested size). This means enabling red zoning alone can compromise
krealloc()'s __GFP_ZERO contract.

Fix this by using slub_debug_orig_size() instead, which is the exact
check for whether the requested size is tracked. We don't need to care
if red zoning is also enabled or not. Also update and expand the
comment accordingly.

Fixes: 9ce67395f5a0 ("mm/slub: only zero requested size of buffer for kzalloc when debug enabled")
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260610-slab_alloc_flags-v2-1-7190909db118@kernel.org
Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>
Reviewed-by: Hao Li <hao.li@linux.dev>
Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>

MAINTAINERS: hand over I2C to Andi Shyti

After 13.5 years of maintaining I2C, it is finally time for me to move
to other areas. So, I hereby transfer I2C maintainership to Andi Shyti.
He has been taking care of the I2C host drivers for a while now and
kindly agreed to look after the whole subsystem. Thank you, Andi! I also
want to thank all contributors, reviewers, and fellow maintainers making
all these years a mostly smooth ride. Happy hacking, everyone!

Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Signed-off-by: Andi Shyti <andi.shyti@kernel.org>
Link: https://lore.kernel.org/r/20260609091612.8228-4-wsa+renesas@sang-engineering.com

ALSA: pcxhr: Share PLL frequency register calculation

The PCXHR and HR222 clock paths duplicate the PLL divider calculation and
register encoding. The HR222 variant extends the same format with an
additional range for rates above those supported by the older boards.

Move the complete encoding into pcxhr_pll_freq_register() and pass each
hardware path its existing maximum frequency. The additional encoding
branch is unreachable with the older 110 kHz limit, so this preserves both
paths' accepted ranges and generated register values while removing the
duplicate implementation and its long-standing TODO.

Signed-off-by: Cássio Gabriel <cassiogabrielcontato@gmail.com>
Link: https://patch.msgid.link/20260612-alsa-pcxhr-pll-helper-v1-1-c84ae2bd2e9b@gmail.com
Signed-off-by: Takashi Iwai <tiwai@suse.de>

ALSA: usb-audio: qcom: Guard sideband endpoint removal

qmi_stop_session() conditionally looks up the cached data and sync
endpoints, but removes each endpoint unconditionally.

The data endpoint is always present for an active offload stream, while
the sync endpoint is optional. When no sync endpoint exists, ep still
refers to the data endpoint and the code attempts to remove that endpoint
a second time. The current sideband implementation rejects the duplicate
removal, but the teardown path should not pass an unrelated endpoint for
an absent sync endpoint.

Only look up and remove an endpoint when its cached pipe exists, check the
lookup result, and clear the cached pipe after handling it. This matches
the normal stream-disable path.

Fixes: 326bbc348298 ("ALSA: usb-audio: qcom: Introduce QC USB SND offloading support")
Signed-off-by: Cássio Gabriel <cassiogabrielcontato@gmail.com>
Link: https://patch.msgid.link/20260611-alsa-usb-qcom-guard-sideband-endpoint-removal-v1-1-00e73787c156@gmail.com
Signed-off-by: Takashi Iwai <tiwai@suse.de>

Merge tag 'kvmarm-7.2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 updates for 7.2

* New features:

  - None. Zilch. Nada. Que dalle.

* Fixes and other improvements:

  - Significant cleanup of the vgic-v5 PPI support which was merged in
    7.1. This makes the code more maintainable, and squashes a couple
    of bugs in the meantime.

  - Set of fixes for the handling of the MMU in an NV context,
    particularly VNCR-triggered faults. S1POE support is fixed
    as well.

  - Large set of pKVM fixes, mostly addressing recurring issues
    around hypervisor tracking of donated pages in obscure cases
    where the donation could fail and leave things in a bizarre
    state.

  - Fixes for the so-called "lazy vgic init", which resulted in
    sleeping operations in non-preemptible sections. This turned
    out to be far more invasive than initially expected...

  - Reduce the overhead of L1/L2 context switch by not touching
    the FP registers.

  - Fix the way non-implemented page sizes are dealt with when
    a guest insist on using them for S2 translation.

  - The usual set of low-impact fixes and cleanups all over the map.

Merge branch 'kvm-single-pdptrs' into HEAD

The non-MMU changes/preliminary cleanups from the "split kvm_mmu in
three" series[1]. The final outcome is to have a single copy of the
PDPTRs (in vcpu->arch) instead of two (in root_mmu and nested_mmu).

[1] https://lore.kernel.org/kvm/20260603105814.10236-1-pbonzini@redhat.com/T/#t

Merge tag 'thunderbolt-for-v7.2-rc1' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/westeri/thunderbolt into usb-next

Mika writes:

thunderbolt: Changes for v7.2 merge window

This includes following USB4/Thunderbolt changes for the v7.2 merge
window:

  - Make the driver more compliant with the connection manager guide.
  - Improvements over Thunderbolt XDomain service handling.
  - USB4STREAM driver.
  - Split out PCIe bits into pci.c to allow the driver to work on
    non-PCIe hosts as well.
  - Various fixes and improvements.

All these have been in linux-next with no reported issues.

* tag 'thunderbolt-for-v7.2-rc1' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/westeri/thunderbolt: (41 commits)
  thunderbolt: debugfs: Fix sideband write size check
  thunderbolt: debugfs: Fix margining error counter buffer leak
  thunderbolt: test: Release third DP tunnel
  thunderbolt: Prevent XDomain delayed work use-after-free on disconnect
  thunderbolt: test: Add KUnit tests for property parser bounds checks
  thunderbolt: Add some more descriptive probe error messages
  thunderbolt: Require nhi->ops be valid
  thunderbolt: Separate out common NHI bits
  thunderbolt: Move pci_device out of tb_nhi
  thunderbolt: Increase Notification Timeout to 255 ms for USB4 routers
  thunderbolt: Increase timeout for Configuration Ready bit
  thunderbolt: Verify Router Ready bit is set after router enumeration
  thunderbolt: Verify PCIe adapter in detect state before tunnel setup
  thunderbolt: Activate path hops from source to destination
  thunderbolt: Fix lane bonding log when bonding not possible
  thunderbolt: Don't access path config space on Lane 1 adapters in tb_switch_reset_host()
  thunderbolt: Improve multi-display DisplayPort tunnel allocation
  docs: admin-guide: thunderbolt: Add instructions how to use USB4STREAM
  thunderbolt: Add support for USB4STREAM
  thunderbolt: Add support for ConfigFS
  ...

KVM: x86/mmu: move pdptrs out of the MMU

PDPTRs are part of the CPU state.  A bit unconventionally, they are
reached via vcpu->arch.walk_mmu instead of being stored in vcpu->arch
directly.  That is nice in principle---it would allow TDP shadow paging
to have its own PDPTRs---but it is not necessary, because EPT has no
PDPTRs and NPT does not cache them.

Since kvm_pdptr_read does not otherwise need the MMU, drop the pdptrs
from the MMU altogether.  There is however something to be careful
about, in that PDPTRs are now not stored separately in root_mmu and
nested_mmu for L1 and L2 guests.  In practice this was already not
an issue:

- for EPT the VMCS0x has to keep them up to date; and for the purpose
  of emulation they are always loaded from the VMCS on vmentry/vmexit,
  thanks to the clearing of dirty and available register bitmaps in
  vmx_switch_vmcs()

- for NPT, VCPU_EXREG_PDPTR is similarly cleared for nNPT, which does
  not cache the PDPTRs; while for non-nNPT the PDPTRs are loaded
  together with the load of CR3.

Note that page table PDPTRs are not affected, since they are stored
in pae_root.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20260530165545.25599-6-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

KVM: x86: check that kvm_handle_invpcid is only invoked with shadow paging

This is true for both Intel and AMD. On Intel, "enable INVPCID" is
set unconditionally if supported, but the vmexit is triggered by the
"INVLPG exiting" control which is disabled by enable_ept. On AMD, KVM
can intercept INVPCID if NPT is enabled but only in order to inject #UD
in the guest.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20260530165545.25599-5-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

KVM: nSVM: invalidate cached PDPTRs across nested NPT transitions

When L2 runs under nested NPT and uses PAE paging, KVM's cached PDPTRs
in mmu->pdptrs[] can hold stale or wrong values after nested
transitions and across migration restore, because both
nested_svm_load_cr3() and svm_get_nested_state_pages() only refresh
PDPTRs on the !nested_npt path.

The user-visible bug is on migration restore of an L2 running with nested
NPT and 32-bit PAE paging, if userspace uses KVM_SET_SREGS rather than
KVM_SET_SREGS2.  In that case, load_pdptrs() leaves VCPU_EXREG_PDPTR
marked as available, and kvm_pdptr_read() will use a stale translation
that used L1 GPAs instead of L2 nGPAs.  svm_get_nested_state_pages()
runs on first KVM_RUN but skips the refresh because nested_npt_enabled()
is true.  The CPU itself reads L2's PDPTRs correctly from memory via
L1's NPT, but KVM-side walking of guest PAE page tables uses the bogus
cached values.

Unlike Intel's GUEST_PDPTR0..3 fields in the VMCS, SVM has no
VMCB-cached PDPTR state: the in-memory PDPTEs at the current CR3 are
the only source of truth, and svm_cache_reg(VCPU_EXREG_PDPTR) simply
reloads them from memory via load_pdptrs().  Clearing the avail
bit (and the dirty bit because !avail/dirty is invalid) to force
a reload when PDPTRs as needed fixes the bug.

Do the same for nested_svm_load_cr3()'s nested_npt branch, so that
the invariant "PDPTRs need reloading" is handled similarly for both
immediate and deferred loading.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20260530165545.25599-4-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

KVM: nVMX: remove unnecessary code in prepare_vmcs02_rare

The early vmwrite of the PDPTRs in prepare_vmcs02_rare() is redundant, because
every write it does will be performed by prepare_vmcs02() if it is actually
needed.

In any case where the emulator or the processor need the PDPTR, either
is_pae_paging() is true on vmentry, or a write of CR0, CR4 or EFER will
cause a vmexit to L0. The next vmentry will refresh the PDPTRs in the
vmcs02 from vmcs12.

In fact, the original version[1] of what ended up being commit
c7554efc8335 ("KVM: nVMX: Copy PDPTRs to/from vmcs12 only when
necessary"), the writes in what is now prepare_vmcs02_rare() were removed.
When the mega-collection of optimizations was posted[2], the removal of
that code got dropped as a rebase good, so reinstate it.

[1] https://lore.kernel.org/all/20190507160640.4812-16-sean.j.christopherson@intel.com
[2] https://lore.kernel.org/all/1560445409-17363-31-git-send-email-pbonzini@redhat.com

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20260530165545.25599-3-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

KVM: x86: remove nested_mmu from mmu_is_nested()

nested_mmu is always stored into vcpu->arch.walk_mmu at the same time as
guest_mmu is stored into vcpu->arch.mmu. But nested_mmu is not even
a proper MMU, it is only used for page walking; plus the fact that
walk_mmu has to be switched at all is just an implementation detail.

In the end what matters here is whether the guest is using nested
page tables; vmx/nested.c and svm/nested.c check it to see if they
are in nEPT or nNPT context respectively. So switch to checking
root_mmu vs. guest_mmu, which is a more cogent test.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20260511150648.685374-2-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20260530165545.25599-2-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Merge branch kvm-arm64/nv-mmu-7.2 into kvmarm-master/next

* kvm-arm64/nv-mmu-7.2:
  : .
  : Assorted collection of fixes for NV MMU bugs
  :
  : - Correctly plug AT S1E1A handling in the emulation backend
  :
  : - Make CPTR_EL2.E0POE depend on FEAT_S1POE
  :
  : - Drop the reference on the page if the VNCR translation
  :   races with an MMU notifier
  :
  : - Correctly synthesise an SEA if a page table walk fails due
  :   to a guest error
  :
  : - Fully invalidate the VNCR TLB and fixmap when translating
  :   for a new VNCR
  :
  : - Restart S1 walk when the S2 walk fails due to a race condition
  :
  : - Correctly return -EAGAIN when a S1 walk fails
  :
  : - Fix block mapping validity check in stage-1 walker for 64kB pages
  :
  : - Fix potential NULL dereference when performing an EL2 TLBI targeting
  :   the VNCR page
  :
  : - Hold kvm->mmu_lock while initialising the vncr_tlb pointer
  : .
  KVM: arm64: nv: Hold kvm->mmu_lock while initialising vcpu->arch.vncr_tlb
  KVM: arm64: nv: Avoid dereferencing NULL VNCR pseudo-TLB
  KVM: arm64: Fix block mapping validity check in stage-1 walker
  KVM: arm64: nv: Restart stage-1 walk if stage-2 desc update fails
  KVM: arm64: Restart instruction upon race in __kvm_at_s12()
  KVM: arm64: nv: Inject SEA TTW when desc update can't write to GPA
  KVM: arm64: nv: Fully update VNCR fixmap state in kvm_translate_vncr()
  KVM: arm64: Don't leak PFN when kvm_translate_vncr() races MMU notifier
  arm64: cpufeature: Expose ID_AA64ISAR2_EL1.ATS1A to KVM
  KVM: arm64: Wire AT S1E1A in the system instruction handling table
  KVM: arm64: Key CPTR_EL2.E0POE propagation on FEAT_S1POE

Signed-off-by: Marc Zyngier <maz@kernel.org>

Merge branch kvm-arm64/misc-7.2 into kvmarm-master/next

* kvm-arm64/misc-7.2:
  : .
  : - Check for a valid vcpu pointer upon deactivating traps when handling
  :   a HYP panic in VHE mode
  :
  : - Make the __deactivate_fgt() macro use its arguments instead of the
  :   surrounding context
  :
  : - Don't bother with initialising TPIDR_EL2 in the hyp stubs, as this
  :   is already taken care of in more obvious places
  :
  : - Drop the unused kvm_arch pointer passed to __load_stage2()
  :
  : - Return -EOPNOTSUPP when a hypercall fails for some reason, instead of
  :   returning whatever was in the result structure
  :
  : - Make the ITS ABI selection helpers return void, which avoids wondering
  :   about the nature of the return code (always 0)
  : .
  KVM: arm64: vgic-its: Make ABI commit helpers return void
  KVM: arm64: Set a Linux errno on SMCCC error in kvm_call_hyp_nvhe()
  KVM: arm64: Remove @arch from __load_stage2()
  KVM: arm64: Don't populate TPIDR_EL2 in finalise_el2()
  KVM: arm64: Fix __deactivate_fgt macro parameter typo
  KVM: arm64: Guard against NULL vcpu on VHE hyp panic path

Signed-off-by: Marc Zyngier <maz@kernel.org>

Merge tag 'kvm-x86-svm-7.2' of https://github.com/kvm-x86/linux into HEAD

KVM SVM changes for 7.2

- Add support for virtualizing gPAT (KVM previously just used L1's PAT when
   running L2).

- Fix goofs where KVM mishandles side effects (e.g. single-step and PMC
   updates) when emulating VMRUN.

- Fix a variety of bugs in AVIC's handling of x2APIC MSR interception, most
   notably where KVM didn't disable interception of IRR, ISR, and TMR regs.

- Add support for virtualizing Host-Only/Guest-Only bits in the mediated PMU.

Merge tag 'kvm-x86-vmx-7.2' of https://github.com/kvm-x86/linux into HEAD

KVM VMX changes for 7.2

- Fix a largely benign bug where KVM TDX would incorrectly state it could
emulate several x2APIC MSRs.

- Use the "safe" WRMSR API when proxying LBR MSR writes as the to-be-written
value is guest controlled and completely unvalidated.

Merge tag 'kvm-x86-vfio-7.2' of https://github.com/kvm-x86/linux into HEAD

KVM VFIO changes for 7.2

Use guard() to cleanup up various KVM+VFIO flows.

Merge tag 'kvm-x86-sev-7.2' of https://github.com/kvm-x86/linux into HEAD

KVM SEV changes for 7.2

- Don't advertise support for unusuable VM types, and account for VM types
   that are disabled by firmware, e.g. to mitigate security vulnerabilities.

- Rewrite the SEV {en,de}crypt debug ioctls as they were riddle with bugs and
   unnecessarily complicated, and add comprehensive tests.

- Clean up and deduplicate the SEV page pinning code.

- Fix minor goofs related to writing back CPUID information after firmware
   rejects a CPUID page for an SNP vCPU.

Merge tag 'kvm-x86-selftests-7.2' of https://github.com/kvm-x86/linux into HEAD

KVM selftests changes for 7.2

- Randomize the dirty log test's delay when reaping the bitmap on the first
pass, as always waiting only 1ms hid a KVM RISC-V bug as the test reaped the
bitmap before KVM could build up enough state to hit the bug.

- A pile of one-off fixes and cleanups.

Merge tag 'kvm-x86-mmu-7.2' of https://github.com/kvm-x86/linux into HEAD

KVM x86 MMU changes for 7.2

- Use the kernel's "enum pg_level" in the TDX APIs instead of the TDX-Module's
   level definitions (which are 0-based).

- Rework the TDX memory APIs to not require/assume that guest memory is
   backed by "struct page" (in prepartion for guest_memfd hugepage support).

- Overhaul the TDP MMU => S-EPT code to move as much S-EPT specific logic as
   possible into the TDX code, and to funnel (almost) all S-EPT updates into
   a single chokepoint.  The motivation is largely to prepare for upcoming
   Dynamic PAMT support, but the cleanups are nice to have on their own.

- Plug a hole in the shadow MMU where KVM fails to recursively zap nested TDP
   shadow when L1 is tearing its TDP page tables from the bottom up, as KVM's
   TDP MMU now does.

Merge tag 'kvm-x86-misc-7.2' of https://github.com/kvm-x86/linux into HEAD

KVM misc x86 changes for 7.2

- Handle EXIT_FASTPATH_EXIT_USERSPACE in vendor code to ensure vendor code
   gets a chance to handle things like reaping the PML buffer.

- Ensure KVM's copy of CR0 and CR3 are up-to-date on SVM prior to invoking
   fastpath handlers.

- Update KVM's view of PV async enabling if and only if the MSR write fully
   succeeds.

- Fix a variety of issues where the emulator doesn't honor guest-debug state,
   and clean up related code along the way.

- Synthesize EPT Violation and #NPF "error code" bits when injecting faults
   into L1 that didn't originate in hardware (in which case the VMCS/VMCB
   doesn't hold relevant information).

- Add support for virtualizing (well, emulating) AMD's flavor of CPL>0 CPUID
   faulting.

- Clean up the GPR APIs so that KVM's use of "raw" is consistent, and fix a
   variety of minor bugs along the way.

- Fix an OOB memory access due to not checking the VP ID when handling a
   Hyper-V PV TLB flush for L2.

- Fix a bug in the mediated PMU's handling of fixed counters that allowed the
   guest to bypass the PMU event filter.

- Allow userspace to return EAGAIN when handling SNP and TDX hypercalls, so
   the KVM can forward a "retry" status code to the guest, and reserve all
   unused error codes for future usage.

- Misc fixes and cleanups.

Merge tag 'kvm-x86-gmem-7.2' of https://github.com/kvm-x86/linux into HEAD

KVM guest_memfd changes for 7.2

- Return -EEXIST instead of -EINVAL if userspace attempts to bind a gmem
   range to multiple memslots, and fix the test that was supposed to ensure
   KVM returns -EEXIST.

- Treat memslot binding offsets and sizes as unsigned values to fix a bug
   where KVM interprets a large "offset + size" as a negative value and allows
   a nonsensical offset.

- Use the inode number instead of the page offset for the NUMA interleaving
   index to fix a bug where the effective index would jump by two for
   consecutive pages (the caller also adds in the page offset).

Merge branch kvm-arm64/vgic-v5-PPI-fixes into kvmarm-master/next

* kvm-arm64/vgic-v5-PPI-fixes:
  : .
  : Substantial cleanup of the vgic-v5 PPI support. From the original
  : cover letter:
  :
  : "With the GICv5 PPi support merged in, it has become obvious that a few
  :  things could be improved, both from the correctness and maintainability
  :  angles."
  : .
  KVM: arm64: Fix arch timer interrupts for GICv3-on-GICv5 guests
  irqchip/gic-v5: Immediately exec priority drop following activate
  Documentation: KVM: Clarify that PMU_V3_IRQ IntID requirements for GICv5
  Documentation: KVM: Fix typos in VGICv5 documentation
  KVM: arm64: selftests: Improve error handling for GICv5 PPI selftest
  KVM: arm64: selftests: Cleanup unused vars in GICv5 PPI selftest
  KVM: arm64: selftests: Add missing GIC CDEN to no-vgic-v5 selftest
  KVM: arm64: vgic-v5: Atomically assign bits to PPI DVI bitmap
  KVM: arm64: vgic-v5: Add missing trap handing for NV triage
  KVM: arm64: vgic-v5: Limit support to 64 PPIs
  KVM: arm64: vgic: Rationalise per-CPU irq accessor
  KVM: arm64: vgic-v5: Drop defensive checks from vgic_v5_ppi_queue_irq_unlock()
  KVM: arm64: vgic: Consolidate vgic_allocate_private_irqs_locked()
  KVM: arm64: vgic: Constify struct irq_ops usage
  KVM: arm64: vgic-v5: Drop pointless ARM64_HAS_GICV5_CPUIF check
  KVM: arm64: vgic-v5: Remove use of __assign_bit() with a constant
  KVM: arm64: vgic-v5: Move PPI caps into kvm_vgic_global_state
  KVM: arm64: vgic-v5: Add for_each_visible_v5_ppi() iterator

Signed-off-by: Marc Zyngier <maz@kernel.org>

Merge branch kvm-arm64/pkvm-fixes-7.2 into kvmarm-master/next

* kvm-arm64/pkvm-fixes-7.2:
  : .
  : Assorted pKVM fixes for 7.2:
  :
  : - Ensure that the vcpu memcache is filled in a number of cases (donate,
  :   share, selftest)
  :
  : - Fix vmemmap page order handling by resetting it when initialising the
  :   memory pool
  :
  : - Don't leak page references on failed memory donation
  :
  : - Add sanity-check for refcounted pages when donating/sharing pages
  :
  : - Clear __hyp_running_vcpu on state flush
  :
  : - Check LR upper bound against a trusted value
  :
  : - Assorted fixes for the host-side tracking of the pages shared with
  :   EL2 as a result of some Sashiko testing from Fuad
  :
  : - Correctly forward HCR_EL2.VSE from host to guest, so that protected
  :   guests can see SErrors
  : .
  KVM: arm64: Roll back partial shares on kvm_share_hyp() failure
  KVM: arm64: Avoid host/hyp share desync on unshare hypercall failure
  KVM: arm64: Free hyp-share tracking node when share hypercall fails
  KVM: arm64: Flush HCR_EL2.VSE to deliver SErrors to pKVM guests
  KVM: arm64: Bound used_lrs when flushing the pKVM hyp vCPU
  KVM: arm64: Clear __hyp_running_vcpu when flushing the pKVM hyp vCPU
  KVM: arm64: Pre-check vcpu memcache for host->guest donate
  KVM: arm64: Pre-check vcpu memcache for host->guest share
  KVM: arm64: Seed pkvm_ownership_selftest vcpu memcache
  KVM: arm64: Add fail-safe for refcounted pages in __pkvm_hyp_donate_host
  KVM: arm64: Fix __pkvm_init_vm error path
  KVM: arm64: Reset page order in pKVM hyp_pool

Signed-off-by: Marc Zyngier <maz@kernel.org>

Merge branch kvm-arm64/nv-granule-sizes into kvmarm-master/next

* kvm-arm64/nv-granule-sizes:
  : .
  : Tidying up of the behaviour when the selected page size in not
  : implemented, courtesy of Wei-Lin Chang. From the initial cover
  : letter:
  :
  : "This small series fixes the granule size selection for software stage-1
  :  and stage-2 walks. Previously we treat the guest's TCR/VTCR.TGx as-is
  :  and use the encoded granule size for the walks. However this is
  :  incorrect if the granule sizes are not advertised in the guest's
  :  ID_AA64MMFR0_EL1.TGRAN*. The architecture specifies that when an
  :  unsupported size is programed in TGx, it must be treated as an
  :  implemented size. Fix this by choosing an available one while
  :  prioritizing PAGE_SIZE."
  : .
  KVM: arm64: Fallback to a supported value for unsupported guest TGx
  KVM: arm64: nv: Use literal granule size in TLBI range calculation
  KVM: arm64: Factor out TG0/1 decoding of VTCR and TCR
  KVM: arm64: nv: Rename vtcr_to_walk_info() to setup_s2_walk()

Signed-off-by: Marc Zyngier <maz@kernel.org>

Merge branch kvm-arm64/nv-fp-elision into kvmarm-master/next

* kvm-arm64/nv-fp-elision:
  : .
  : Significantly reduce the overhead of the context switch between L1 and
  : L2 guests by eliding the save/restore of the FP/SIMD/SVE registers, as
  : this state is shared between the two guests, and therefore can be left
  : live.
  : .
  KVM: arm64: nv: Don't save/restore FP register during a nested ERET or exception
  KVM: arm64: nv: Track L2 to L1 exception emulation

Signed-off-by: Marc Zyngier <maz@kernel.org>

Merge branch kvm-arm64/no-lazy-vgic-init into kvmarm-master/next

* kvm-arm64/no-lazy-vgic-init:
  : .
  : Fix an ugly situation where the vgic lazy init could happen in
  : non-preemtible contexts such as vcpu reset, resulting in lockdep
  : splats.
  :
  : This requires revamping the way in-kernel emulation of devices
  : (timers, PMU) are presenting their interrupt to the vgic, and
  : make sure there is no need to init the vgic on the back of that.
  : .
  KVM: arm64: vgic-v2: Don't init the vgic on in-kernel interrupt injection
  KVM: arm64: vgic-v2: Force vgic init on injection outside the run loop
  KVM: arm64: pmu: Kill the PMU interrupt level cache
  KVM: arm64: timer: Kill the per-timer irq level cache
  KVM: arm64: Simplify userspace notification of interrupt state
  KVM: arm64: timer: Repaint kvm_timer_{should,irq_can}_fire() to kvm_timer_{pending,enabled}()

Signed-off-by: Marc Zyngier <maz@kernel.org>

KVM: arm64: vgic-its: Make ABI commit helpers return void

The return values of vgic_its_set_abi() and vgic_its_commit_v0() are always
0 and do not carry useful error information. Simplify by changing them to
void.

Suggested-by: Oliver Upton <oupton@kernel.org>
Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Reviewed-by: Oliver Upton <oupton@kernel.org>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Link: https://patch.msgid.link/20260604075147.53299-1-liu.yun@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>

xfs: shut down the filesystem on a failed mount

A corrupt/crafted XFS image can make mount fail after background inode
inactivation has already been enabled.  xfs_mountfs() turns on inodegc
(xfs_inodegc_start()) right after log recovery, but the quota subsystem
(mp->m_quotainfo) is only allocated much later, in xfs_qm_newmount() /
xfs_qm_mount_quotas().  The quota accounting flags in mp->m_qflags are
parsed from the mount options before xfs_mountfs() even runs.

If the mount then aborts in between - e.g. xfs_rtmount_inodes() failing
with "failed to read RT inodes" - the unwind path flushes the inodegc
queue, which inactivates the inodes that are still queued, and
xfs_inactive() calls xfs_qm_dqattach().  That path trusts
XFS_IS_QUOTA_ON() (the flag is set) and dereferences the not yet
allocated mp->m_quotainfo:

  XFS (loop0): failed to read RT inodes
  Oops: general protection fault, probably for non-canonical address
        0xdffffc000000002a: 0000 [#1] PREEMPT SMP KASAN NOPTI
  KASAN: null-ptr-deref in range [0x0000000000000150-0x0000000000000157]
  Workqueue: xfs-inodegc/loop0 xfs_inodegc_worker
  RIP: 0010:__mutex_lock+0xfe/0x930
  Call Trace:
   xfs_qm_dqget_cache_lookup+0x63/0x7f0
   xfs_qm_dqget_inode+0x336/0x860
   xfs_qm_dqattach_one+0x232/0x4e0
   xfs_qm_dqattach_locked+0x2c6/0x470
   xfs_qm_dqattach+0x46/0x70
   xfs_inactive+0x988/0xe80
   xfs_inodegc_worker+0x27c/0x730

The NULL m_quotainfo deref is only one symptom.  The deeper problem is
that a failed mount should not be inactivating inodes at all: it must
not write to the (possibly corrupt, only partially set up) persistent
metadata of a filesystem we just refused to mount, and the subsystems
inactivation relies on may not be initialised.

Mark the filesystem shut down before flushing the inodegc queue in the
xfs_mountfs() failure path.  With the preceding patch a shut down mount
no longer inactivates the queued inodes: xfs_inactive() returns early so
they are dropped straight to reclaim instead.  They are still pulled down
so reclaim can free them (which is why the flush was added in commit
ab23a7768739 ("xfs: per-cpu deferred inode inactivation queues")), but
without touching the on-disk structures - matching that comment's own
"pull down all the state and flee" intent.

Use SHUTDOWN_META_IO_ERROR for the shutdown: it is the generic "cannot
safely touch metadata" reason already used elsewhere in this file and in
the xfs_ifree() failure path, and unlike SHUTDOWN_FORCE_UMOUNT it does
not log a misleading "User initiated shutdown received".  A failed mount
is not necessarily on-disk corruption (it can be a transient I/O or
resource error), so SHUTDOWN_CORRUPT_ONDISK would not be accurate either.

Found by fuzzing XFS with syzkaller (corrupt image mount); reproduced and
verified under QEMU/KASAN.

Fixes: ab23a7768739 ("xfs: per-cpu deferred inode inactivation queues")
Signed-off-by: Mikhail Lobanov <m.lobanov@rosa.ru>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: skip inode inactivation on a shut down mount

XFS already declines to inactivate inodes on a shut down mount, but only
at queue time: xfs_inode_mark_reclaimable() calls
xfs_inode_needs_inactive(), which returns false when the mount is shut
down ("If the log isn't running, push inodes straight to reclaim"), and
then drops the dquots and marks the inode reclaimable directly.

An inode that was queued for background inactivation while the mount was
still live is not covered by that check: the inodegc worker still calls
xfs_inactive() on it even after the mount has been shut down in the
meantime. Inactivation modifies persistent metadata and runs
transactions that cannot complete on a shut down mount, and it relies on
subsystems (e.g. quota) that a torn down, or never fully set up, mount
may not have available.

Honour the same invariant in xfs_inactive() itself: if the mount is shut
down, return early before doing any inactivation work. The dquots
attached to the inode are released by the existing xfs_qm_dqdetach() at
the out: label, so references are not leaked, and the caller then makes
the inode reclaimable exactly as before.

On its own this is a consistency fix with the existing queue-time
behaviour; it is also a prerequisite for shutting the mount down in the
xfs_mountfs() failure path in the following patch.

Fixes: ab23a7768739 ("xfs: per-cpu deferred inode inactivation queues")
Signed-off-by: Mikhail Lobanov <m.lobanov@rosa.ru>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

Merge tag 'kvm-x86-generic-7.2' of https://github.com/kvm-x86/linux into HEAD

KVM generic changes for 7.2

- Rename invalidate_begin() to invalidate_start() throughout KVM to follow
the kernel's nomenclature, e.g. for mmu_notifiers.

- Minor cleanups.

Merge tag 'kvm-s390-master-7.1-4' of https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD

KVM: s390: A few more misc gmap fixes.

xfs: move XFS_LSN_CMP to xfs_log_format.h

Because CYCLE_LSN/BLOCK_LSN are defined in xfs_log_format.h, XFS_LSN_CMP
forces a xfs_log_format.h dependency in xfs_log.h. Move XFS_LSN_CMP
to xfs_log_format.h and drop the macro/inline indirection to clean up
our header mess a little bit.

This also helps xfsprogs, which doesn't have xfs_log.h, but needs
XFS_LSN_CMP.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: shut down zoned file systems on writeback errors

Zoned writeback allocates space from an open zone and advances the
in-memory allocation state before submitting the bio. The completion
path only records the written blocks and updates the mapping on success.
If the write fails, XFS cannot tell how far the device write pointer
advanced and cannot safely roll the open zone accounting back.

This was observed while investigating xfs/643 and xfs/646 on an external
ZNS realtime device. A writeback error after consuming space from an
open zone left later writers waiting for open-zone or GC progress that
could not happen. xfs/643 exposed this through the GC defragmentation
path, while xfs/646 exposed the same failure mode through the
truncate/EOF-zeroing space wait path.

There is no local recovery path in ioend completion that can restore a
consistent zoned allocation state after the device has rejected the
write. Treat writeback errors for zoned inodes as fatal and force a
file system shutdown from the ioend completion path. The existing
shutdown path wakes zoned allocation waiters and makes future space
waits return -EIO instead of leaving tasks stuck waiting for progress.

Signed-off-by: Yao Sang <sangyao@kylinos.cn>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

iommu/amd: Control INVALIDATE_IOMMU_PAGES PDE from the gather

Now that AMD uses iommupt, it is easy to make use of the PDE bit. If
the gather has no free list then no page directory entries were
changed.

Pass GN/PDE through the invalidation call chain in a u32 flags field
that is OR'd into data[2] and set it properly from the gather.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Wei Wang <wei.w.wang@hotmail.com>
Tested-by: Dheeraj Kumar Srivastava <dheerajkumar.srivastava@amd.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>

iommu/amd: Make CMD_INV_IOMMU_ALL_PAGES_ADDRESS match the spec

The spec in Table 14 defines the "Entire Cache" case as having the low
12 bits as zero. Indeed the command format doesn't even have the low
12 bits. Since there is only one user now, fix the constant to have 0
in the low 12 bits instead of 1 and remove the masking.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Wei Wang <wei.w.wang@hotmail.com>
Tested-by: Dheeraj Kumar Srivastava <dheerajkumar.srivastava@amd.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>

iommu/amd: Have amd_iommu_domain_flush_pages() use last

Finish clearing out the size/last/end switching by converting
amd_iommu_domain_flush_pages() to use last-based logic.

This algorithm is simpler than the previous. Ultimately all this wants
to do is select powers of two that are aligned to address and not
longer than the distance to last.

The new version is fully safe for size = U64_MAX and last = U64_MAX.

Finally, the gather can be passed through natively without risking an
overflow in (gather->end - gather->start + 1).

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Wei Wang <wei.w.wang@hotmail.com>
Tested-by: Dheeraj Kumar Srivastava <dheerajkumar.srivastava@amd.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>

iommu/amd: Pass last in through to build_inv_address()

This is the trivial call chain below amd_iommu_domain_flush_pages().

Cases that are doing a full invalidate will pass a last of U64_MAX.

This avoids converting between size and last, and type confusion with
size_t, unsigned long and u64 all being used in different places along
the driver's invalidation path. Consistently use u64 in the internals.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Dheeraj Kumar Srivastava <dheerajkumar.srivastava@amd.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>

iommu/amd: Simplify build_inv_address()

This function is doing more work than it needs to:

- iommu_num_pages() is pointless, the fls() is going to compute the
   required page size already.

- It is easier to understand as sz_lg2, which is 12 if size is 4K,
   than msb_diff which is 11 if size is 4K.

- Simplify the control flow to early exit on the out of range cases.

- Use the usual last instead of end to signify an inclusive last
   address.

- Use GENMASK to compute the 1's mask.

- Use GENMASK to compute the address mask for the command layout,
   not PAGE_MASK.

- Directly reference the spec language that defines the 52 bit
   limit.

No functional change intended.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Wei Wang <wei.w.wang@hotmail.com>
Tested-by: Dheeraj Kumar Srivastava <dheerajkumar.srivastava@amd.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>

Merge tag 'bst-arm64-emmc-driver-defconfig-for-v7.2' of https://github.com/BlackSesame-SoC/linux into soc/defconfig

arm64: BST C1200 eMMC defconfig for v7.2

Black Sesame Technologies:

Enable eMMC controller on BST C1200 CDCU1.0 board:
- Enable CONFIG_MMC_SDHCI_BST=y in arm64 defconfig

The MMC driver was merged via mmc-next in v7.1-rc1.
This is the remaining defconfig piece.

* tag 'bst-arm64-emmc-driver-defconfig-for-v7.2' of https://github.com/BlackSesame-SoC/linux:
arm64: defconfig: enable BST SDHCI controller

Signed-off-by: Arnd Bergmann <arnd@arndb.de>

Merge tag 'bst-arm64-emmc-driver-dts-for-v7.2' of https://github.com/BlackSesame-SoC/linux into soc/dt

arm64: BST C1200 eMMC DTS for v7.2

Black Sesame Technologies:

Enable eMMC controller on BST C1200 CDCU1.0 board:
    - Add mmc0 node in bstc1200.dtsi (DWCMSHC SDHCI controller)
    - Add fixed clock definition and reserved SRAM bounce buffer
    - Enable mmc0 with 8-bit bus on CDCU1.0 ADAS 4C2G board
The MMC driver was merged via mmc-next in v7.1-rc1.
this is the remaining DTS piece.

Signed-off-by: Gordon Ge <gordon.ge@bst.ai>
* tag 'bst-arm64-emmc-driver-dts-for-v7.2' of https://github.com/BlackSesame-SoC/linux:
  arm64: dts: bst: enable eMMC controller in C1200

Signed-off-by: Arnd Bergmann <arnd@arndb.de>

arm64: defconfig: enable BST SDHCI controller

Enable CONFIG_MMC_SDHCI_BST to support eMMC on Black Sesame
Technologies C1200 boards.

Signed-off-by: Albert Yang <yangzh0906@thundersoft.com>
Acked-by: Gordon Ge <gordon.ge@bst.ai>
Signed-off-by: Gordon Ge <gordon.ge@bst.ai>

arm64: dts: bst: enable eMMC controller in C1200

Add mmc0 node for the DWCMSHC SDHCI controller with basic configuration
(disabled by default) and fixed clock definition in bstc1200.dtsi.

Enable mmc0 with board-specific configuration including 8-bit bus
width and reserved SRAM bounce buffer on the CDCU1.0 ADAS 4C2G board.

The bounce buffer in reserved SRAM addresses hardware constraints
where the eMMC controller cannot access main system memory through
SMMU due to a hardware bug, and all DRAM is located outside the
4GB boundary.

Signed-off-by: Albert Yang <yangzh0906@thundersoft.com>
Acked-by: Gordon Ge <gordon.ge@bst.ai>
Signed-off-by: Gordon Ge <gordon.ge@bst.ai>

Merge tag 'drm-xe-fixes-2026-06-11' of https://gitlab.freedesktop.org/drm/xe/kernel into drm-fixes

UAPI Changes:

Cross-subsystem Changes:

Core Changes:

Driver Changes:
- fix oops in suspend/shutdown without display (Jani)
- RAS fixes (Raag)
- Use HW_ERR prefix in log (Raag)
- include all registered queues in TLB invalidation (Tangudu)
- Fix refcount leak in xe_range_tree in error paths (Wentao)
- fix job timeout recovery for unstarted jobs and kernel queues (Rodrigo)

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Matthew Brost <matthew.brost@intel.com>
Link: https://patch.msgid.link/aitt8ZkYmxIT9cdP@gsse-cloud1.jf.intel.com

Merge tag 'drm-intel-fixes-2026-06-11' of https://gitlab.freedesktop.org/drm/i915/kernel into drm-fixes

- Check supported link rates DPCD read [edp] (Nikita Zhandarovich)
- Fix phys BO pread/pwrite with offset [gem] (Joonas Lahtinen)

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Tvrtko Ursulin <tursulin@igalia.com>
Link: https://patch.msgid.link/aipkcUDnTlzre-8F@linux

ip6_tunnel: annotate data-races around t->err_count and t->err_time

ip6_tnl_xmit() and ipip6_tunnel_xmit() run locklessly (dev->lltx == true).

ip6gre_err() and ipip6_err() also run locklessly.

We need to add READ_ONCE() and WRITE_ONCE() annotations
around t->err_count and t->err_time.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260610171458.1359630-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

crypto: tegra - fix refcount leak in tegra_se_host1x_submit()

The timeout error path in tegra_se_host1x_submit() returns without
calling host1x_job_put(), while all other paths (success, submit
error, pin error) properly release the job reference through the
job_put label. Since host1x_job_alloc() initializes the reference
count and host1x_job_put() is required to drop it, omitting it on
timeout causes a permanent refcount leak.

Fix this by redirecting the timeout return to the existing job_put
label, ensuring the job reference and any associated syncpt
references are consistently released.

Cc: stable@vger.kernel.org
Fixes: 0880bb3b00c8 ("crypto: tegra - Add Tegra Security Engine driver")
Signed-off-by: Wentao Liang <vulab@iscas.ac.cn>
Reviewed-by: Akhil R <akhilrajeev@nvidia.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

crypto: rng - Free default RNG on module exit

When the rng module is removed the default RNG will be leaked.
Call crypto_del_default_rng to free it if possible.

Fixes: 7cecadb7cca8 ("crypto: rng - Do not free default RNG when it becomes unused")
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

crypto: testmgr - allow authenc(hmac(sha{256,384}),cts(cbc(aes))) in FIPS mode

hmac(sha256), hmac(sha384) and cts(cbc(aes)) algorithms have been
marked as FIPS allowed for years. Mark the respective authenc()
constructions per RFC 8009 ("AES Encryption with HMAC-SHA2 for
Kerberos 5") as such as well.

SP 800-57 Part 3 Rev. 1 from Jan 2015 [1] links the draft of what
became RFC 8009 in Oct 2016 as approved in section 6.3 Procurement
Guidance (item/recommendation 3).

[1] https://csrc.nist.gov/pubs/sp/800/57/pt3/r1/final

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

hwrng: jh7110 - fix refcount leak in starfive_trng_read()

The starfive_trng_read() function acquires a runtime PM reference
via pm_runtime_get_sync() but fails to release it on two error
paths. If starfive_trng_wait_idle() or starfive_trng_cmd() returns
an error, the function exits without calling
pm_runtime_put_sync_autosuspend(), leaving the runtime PM usage
counter permanently elevated and preventing the device from entering
runtime suspend.

Refactor the function to use a unified error path that calls
pm_runtime_put_sync_autosuspend() before returning.

Cc: stable@vger.kernel.org
Fixes: c388f458bc34 ("hwrng: starfive - Add TRNG driver for StarFive SoC")
Signed-off-by: Wentao Liang <vulab@iscas.ac.cn>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

crypto: atmel-ecc - drop dead code in atmel_ecdh_max_size

atmel_ecdh_init_tfm() always allocates ctx->fallback, so it is never
NULL in atmel_ecdh_max_size(). Remove the dead code and return
crypto_kpp_maxsize() directly.

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

crypto: cavium/cpt - fix DMA cleanup using wrong loop index

The sg_cleanup error path used list[i] instead of list[j] when unmapping
DMA buffers, leaking successfully mapped entries and repeatedly unmapping
the failed one.

Fixes: c694b233295b ("crypto: cavium - Add the Virtual Function driver for CPT")
Signed-off-by: Felix Gu <ustc.gu@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

crypto: marvell/octeontx - fix DMA cleanup using wrong loop index

The sg_cleanup path used list[i] instead of list[j] when unmapping DMA
buffers, leaking successfully mapped entries and repeatedly unmapping
the failed one.

Fixes: 10b4f09491bf ("crypto: marvell - add the Virtual Function driver for CPT")
Signed-off-by: Felix Gu <ustc.gu@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

MAINTAINERS: make myself the maintainer of the Qualcomm QCE driver

Qualcomm wants to keep supporting and extending the crypto engine driver.
Thara has not been active for many months, so change the maintainer to
myself and upgrade the driver to Supported.

Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>
Acked-by: Krzysztof Kozlowski <krzk@kernel.org>
Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

crypto: amcc - convert irq_of_parse_and_map to platform_get_irq

Replace the deprecated irq_of_parse_and_map() call with the modern
platform_get_irq() in the probe function. This also improves error
handling: platform_get_irq() returns a negative errno on failure,
whereas irq_of_parse_and_map() returned 0.

Change the irq field in struct crypto4xx_core_device from u32 to int
to match the return type of platform_get_irq().

Assisted-by: opencode:big-pickle
Signed-off-by: Rosen Penev <rosenp@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

crypto: sun4i-ss - Remove insecure and unused rng_alg

Remove sun4i_ss_rng, as it is insecure and unused:

- It has multiple vulnerabilities.  sun4i_ss_prng_seed() is missing
  locking and has a buffer overflow.  sun4i_ss_prng_generate() fails to
  fill the entire buffer with cryptographic random bytes, because it
  rounds the destination length down and also doesn't actually wait for
  the hardware to be ready before pulling bytes from it.

- No user of this code is known.  It's usable only theoretically via the
  "rng" algorithm type of AF_ALG.  But userspace actually just uses the
  actual Linux RNG (/dev/random etc) instead.  And rng_algs don't
  contribute entropy to the actual Linux RNG either.  (This may have
  been confused with hwrng, which does contribute entropy.)

The sun4i_ss_prng_seed() buffer overflow was reported by Tianchu Chen
and discovered by Atuin - Automated Vulnerability Discovery Engine

There's no point in fixing all these vulnerabilities individually when
this is unused code, so let's just remove it.

Fixes: b8ae5c7387ad ("crypto: sun4i-ss - support the Security System PRNG")
Cc: stable@vger.kernel.org
Reported-by: Tianchu Chen <flynnnchen@tencent.com>
Closes: https://lore.kernel.org/r/af749a8447bd7f0e9dd26ca6c87e9c6afecb09d9@linux.dev/
Acked-by: Corentin LABBE <clabbe.montjoie@gmail.com>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

hwrng: xilinx - Move xilinx-rng into drivers/char/hw_random/

Since this file just implements a hwrng driver, move it into
drivers/char/hw_random/. Rename the kconfig option accordingly as well.

Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

cxl/test: Add check after kzalloc() memory in alloc_mock_res()

alloc_mock_res() calls kzalloc() without checking the return value.
Add scope based resource management to deal with the allocated memory
cleanly.

Reported-by: sashiko-bot
Fixes: 67dcdd4d3b83 ("tools/testing/cxl: Introduce a mocked-up CXL port hierarchy")
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Link: https://patch.msgid.link/20260611230305.197390-1-dave.jiang@intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>

cxl/test: Unregister cxl_acpi in cxl_test_init() error path

In cxl_test_init(), Once cxl_mock_platform_device_add() succeeds, all
error paths after needs to call platform_device_unregister() instead of
platform_device_put() to clean up.

Fixes: 67dcdd4d3b83 ("tools/testing/cxl: Introduce a mocked-up CXL port hierarchy")
Reported-by: sashiko-bot
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Link: https://patch.msgid.link/20260611230355.198912-1-dave.jiang@intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>

perf stat: Fix false NMI watchdog warning in aggregation modes

In aggregation modes (e.g. --per-socket, --per-die, etc.), a counter
might not be scheduled or counted on specific aggregate groups if it was
not assigned to the CPUs belonging to those groups. However, the
printout() check triggers the "print_free_counters_hint" logic
unconditionally for any supported counter with a missing count. This
results in a false "Some events weren't counted. Try disabling the NMI
watchdog" warning.

Furthermore, the NMI watchdog only reserves performance counters on core
PMUs. Uncore PMU events (e.g. CHA, IMC) are not affected by the NMI
watchdog, but their failures also falsely triggered this warning.

This warning was originally introduced in commit 02d492e5dcb72c00 ("perf
stat: Issue a HW watchdog disable hint")

To fix this, restrict setting of print_free_counters_hint to only
trigger for core PMU events by checking counter->pmu and
counter->pmu->is_core.

Example before/after:

$ perf stat -M lpm_miss_lat --metric-only --per-socket -a -- sleep 1

Before:

Performance counter stats for 'system wide':

       ns  lpm_miss_lat_rem ns  lpm_miss_lat_loc
S0      126                202.3               207.9
S1      126                231.9               259.3

       1.006029831 seconds time elapsed

Some events weren't counted. Try disabling the NMI watchdog:
        echo 0 > /proc/sys/kernel/nmi_watchdog
        perf stat ...
        echo 1 > /proc/sys/kernel/nmi_watchdog

After:

Performance counter stats for 'system wide':

       ns  lpm_miss_lat_rem ns  lpm_miss_lat_loc
S0      126                202.3               207.9
S1      126                231.9               259.3

       1.006029831 seconds time elapsed

Assisted-by: Gemini:gemini-next
Reviewed-by: Ian Rogers <irogers@google.com>
Signed-off-by: Chun-Tse Shao <ctshao@google.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>

perf test: Compile named_threads workload with -O0

The work loop relies on the compiler not optimizing it away, although
named_threads_work is not static for that reason, the compiler could
still do it.

Fix it by compiling without optimization. Also add -fno-inline for
consistency and in case anyone wants to look at callstacks.

Fixes: b5dd510be55e8670 ("perf test: Add named_threads workload")
Closes: https://lore.kernel.org/all/20260609160001.2739E1F00893@smtp.kernel.org
Reported-by: sashiko-bot <sashiko-bot@kernel.org>
Reviewed-by: Leo Yan <leo.yan@arm.com>
Signed-off-by: James Clark <james.clark@linaro.org>
Cc: Ian Rogers <irogers@google.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>

Merge tag 'for-net-next-2026-06-11' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next

Luiz Augusto von Dentz says:

====================
bluetooth-next pull request for net-next:

core:
- hci_sync: Add support for HCI_LE_Set_Host_Feature [v2]
- SMP: Use AES-CMAC library API
- sockets: convert to getsockopt_iter
- Add SPDX id lines to some source files

drivers:
- btintel_pcie: Support Product level reset
- btintel_pcie: Add support for smart trigger dump
- btintel_pcie: Add 50 ms delay before MAC init on BlazarIW
- btintel_pcie: Separate coredump work from RX work
- btmtk: add event filter to filter specific event
- btrtl: fix RTL8761B/BU broken LE extended scan
- btusb: Add Realtek RTL8922AE VID/PID 0bda/d922
- btusb: Add Realtek RTL8922AE VID/PID 0bda/d923
- btusb: MT7922: Add VID/PID 0e8d/223c
- btusb: MT7925: Add VID/PID 0e8d/8c38
- btusb: Add support for TP-Link TL-UB250
- btusb: Add Mercusys MA530 for Realtek RTL8761BUV
- btusb: Add TP-Link UB600 for Realtek 8761BUV
- btusb: Add support for Intel Lizard Peak 2 (0x8087:0x0040)
- btusb: Add USB ID 2c4e:0128 for Mercusys MA60XNB
- btusb: MT7925: Add VID/PID 13d3/3609

* tag 'for-net-next-2026-06-11' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next: (49 commits)
  Bluetooth: btintel_pcie: Separate coredump work from RX work
  Bluetooth: btmtksdio: fix infinite loop in btmtksdio_txrx_work()
  Bluetooth: qca: Add BT FW build version to kernel log
  Bluetooth: vhci: validate devcoredump state before side effects
  Bluetooth: L2CAP: validate connectionless PSM length
  Bluetooth: hci: validate codec capability element length
  Bluetooth: L2CAP: Fix UAF in channel timeout by holding conn ref
  Bluetooth: btintel_pcie: Load IOSF debug regs by controller variant
  Bluetooth: btintel_pcie: Add 50 ms delay before MAC init on BlazarIW
  Bluetooth: Add SPDX id lines to some source files
  Bluetooth: btintel_pcie: Add support for smart trigger dump
  Bluetooth: hci_h5: reset hci_uart::priv in the close() method
  Bluetooth: btusb: clean up probe error handling
  Bluetooth: btusb: fix wakeup irq devres lifetime
  Bluetooth: btusb: fix wakeup source leak on probe failure
  Bluetooth: btusb: fix use-after-free on marvell probe failure
  Bluetooth: btusb: fix use-after-free on registration failure
  Bluetooth: btmtk: fix URB leak in alloc_mtk_intr_urb error path
  Bluetooth: hci_core: Fix UAF in hci_unregister_dev()
  Bluetooth: hci_event: fix simultaneous discovery stuck in FINDING
  ...
====================

Link: https://patch.msgid.link/20260611183358.176776-1-luiz.dentz@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'nfc-net-next-20260611' of https://codeberg.org/linux-nfc/linux

David Heidelberg says:

====================
NFC updates for net-next 20260611

- nxp-nci: Add ISO15693 support
- nxp-nci: treat -ENXIO in IRQ thread as no data available
- nci: uart: Constify struct tty_ldisc_ops
- trf7970a: fix comment typos
- Use named initializers for struct i2c_device_id
- MAINTAINERS: Update address for David Heidelberg

* tag 'nfc-net-next-20260611' of https://codeberg.org/linux-nfc/linux:
  MAINTAINERS: Update address for David Heidelberg
  nfc: Use named initializers for struct i2c_device_id
  nfc: nxp-nci: treat -ENXIO in IRQ thread as no data available
  nfc: nxp-nci: Add ISO15693 support
  nfc: nci: uart: Constify struct tty_ldisc_ops
  nfc: trf7970a: fix comment typos
====================

Link: https://patch.msgid.link/1aed7555-3d24-413c-b284-bc85fdd33055@ixit.cz
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dt-bindings: mailbox: qcom: Add IPCC support for Maili Platform

Document the Inter-Processor Communication Controller on the Qualcomm
Maili Platform, which will be used to route interrupts across various
subsystems found on the SoC.

Signed-off-by: Chunkai Deng <chunkai.deng@oss.qualcomm.com>
Reviewed-by: Manivannan Sadhasivam <mani@kernel.org>
Signed-off-by: Jassi Brar <jassisinghbrar@gmail.com>

io_uring/zcrx: kill dead 'sock' member in struct io_zcrx_args

This member is only ever assigned, never read. Kill it.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

Merge branch 'tipc-fix-netlink-gate-and-receive-path-bugs'

Michael Bommarito says:

====================
tipc: fix netlink gate and receive-path bugs

This is v4 of the public TIPC series. The only change from v3 is in
patch 1: TIPC_NL_MEDIA_SET now uses GENL_UNS_ADMIN_PERM like the other
mutators, instead of GENL_ADMIN_PERM, so the whole series uses the
namespace-aware CAP_NET_ADMIN check that matches the legacy TIPC netlink
path. Patches 2 and 3 are unchanged.

Patch 1 gives the TIPCv2 mutating generic-netlink operations the admin
gate the legacy API already has, so a local unprivileged process can no
longer change TIPC state. Patch 2 drops CONN_ACK messages that
acknowledge more outstanding sends than exist, preventing the
snt_unacked underflow. Patch 3 rejects peer bindings with lower > upper,
which would otherwise leak binding-table memory.
====================

Link: https://patch.msgid.link/20260610124003.3831170-1-michael.bommarito@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tipc: reject inverted service ranges from peer bindings

tipc_update_nametbl() inserts a binding advertised by a peer node using
the lower and upper service-range bounds taken directly from the wire,
without checking that lower <= upper. The local bind path validates the
ordering (tipc_uaddr_valid()), but the name-distribution path does not.

A binding with lower > upper is inserted at the far end of the
service-range rbtree (keyed on lower) where no lookup or withdrawal can
ever match it (service_range_foreach_match() requires sr->lower <= end).
The publication, its service_range node and the augmented rbtree entry
are then leaked for the lifetime of the namespace, and there is no
per-peer cap equivalent to TIPC_MAX_PUBL on locally created bindings.

Reject inverted ranges in the network path as well. A peer node can
otherwise leak unbounded binding-table memory by sending PUBLICATION
items with lower > upper.

Fixes: 37922ea4a310 ("tipc: permit overlapping service ranges in name table")
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
Reviewed-by: Tung Nguyen <tung.quang.nguyen@est.tech>
Link: https://patch.msgid.link/20260610124003.3831170-4-michael.bommarito@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tipc: prevent snt_unacked underflow on CONN_ACK

tipc_sk_conn_proto_rcv() subtracts the peer-supplied connection ack count
from the unsigned 16-bit send counter snt_unacked without checking that it
does not exceed the number of messages actually outstanding:

tsk->snt_unacked -= msg_conn_ack(hdr);

msg_conn_ack() is read straight from a received CONN_MANAGER/CONN_ACK
message. If the ack count is larger than snt_unacked, the subtraction
wraps to a near-maximum value, leaving tsk_conn_cong() permanently true
and starving the connection of further transmits.

Validate the ACK count at the start of the CONN_ACK block and drop the
message if it acknowledges more messages than are outstanding. A peer (or,
for a local connection, the connected peer socket) can otherwise wedge a
TIPC connection's send side by sending an oversized connection ack.

Fixes: 10724cc7bb78 ("tipc: redesign connection-level flow control")
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
Reviewed-by: Tung Nguyen <tung.quang.nguyen@est.tech>
Link: https://patch.msgid.link/20260610124003.3831170-3-michael.bommarito@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tipc: require net admin for TIPCv2 netlink mutators

TIPCv2 registers mutating generic-netlink operations without admin
permission flags. Generic netlink only checks CAP_NET_ADMIN when an
operation sets GENL_ADMIN_PERM or GENL_UNS_ADMIN_PERM, so a local
unprivileged process can currently change TIPC state through commands
such as TIPC_NL_NET_SET, TIPC_NL_KEY_SET, TIPC_NL_KEY_FLUSH, and
bearer enable/disable.

The legacy TIPC netlink API already checks netlink_net_capable(...,
CAP_NET_ADMIN) for administrative commands. Give the TIPCv2 mutators
the equivalent generic-netlink gate. Use GENL_UNS_ADMIN_PERM, which
maps to the same namespace-aware CAP_NET_ADMIN check that
netlink_net_capable() performs, so the behaviour matches the legacy
path and keeps working for CAP_NET_ADMIN holders in a non-initial user
namespace (containers).

A QEMU/KASAN repro run as uid/gid 65534 with zero effective
capabilities previously succeeded in changing the network id and node
identity, setting and flushing key material, and enabling/disabling a
UDP bearer. With this patch applied the same operations fail with
-EPERM.

Fixes: 0655f6a8635b ("tipc: add bearer disable/enable to new netlink api")
Link: https://lore.kernel.org/all/20260604163102.2658553-1-dominik.czarnota@trailofbits.com/
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
Reviewed-by: Tung Nguyen <tung.quang.nguyen@est.tech>
Link: https://patch.msgid.link/20260610124003.3831170-2-michael.bommarito@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: airoha: simplify WAN device check in airoha_dev_init()

airoha_register_gdm_devices() iterates eth->ports[] in order, so GDM2's
netdev is always registered before GDM3/GDM4. This means the explicit
check for eth->ports[1] && eth->ports[1]->devs[0] is a redundant
special-case of what airoha_get_wan_gdm_dev() already covers, since
GDM2 is always marked as WAN during its own ndo_init.
Remove the redundant check and rely solely on airoha_get_wan_gdm_dev()
which handles both the GDM2-present and GDM2-absent cases.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://patch.msgid.link/20260610-airoha-eth-simplify-dev-init-v2-1-8f244e69b0d4@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/sched: sch_hfsc: Don't make class passive twice

update_vf() is called from two places for the same class during a single
dequeue when the class's child qdisc (e.g. codel/fq_codel) drops its last
packets while dequeuing:

1. The child calls qdisc_tree_reduce_backlog(), which, now that the child
   is empty, invokes hfsc_qlen_notify() -> update_vf(cl, 0, 0) and turns
   the class passive (cl_nactive is decremented up the hierarchy).

2. hfsc_dequeue() then calls update_vf(cl, qdisc_pkt_len(skb), cur_time)
   to charge the dequeued bytes.

On the second call the class is already passive, but its child qdisc is
still empty, so update_vf() arms go_passive again:

      if (cl->qdisc->q.qlen == 0 && cl->cl_flags & HFSC_FSC)
              go_passive = 1;

The leaf is then skipped by the cl_nactive == 0 check inside the loop,
which does not clear go_passive, so the stale go_passive propagates to the
parent and decrements its cl_nactive a second time. A parent that still
has other active children is driven to cl_nactive == 0 and removed from
the vttree, even though those siblings are still backlogged. They are
never dequeued again and the qdisc stalls.

Fix this by only arming go_passive when the class is actually active, so an
already-passive class no longer triggers a second passive transition. The
byte accounting (cl->cl_total += len) still runs for every ancestor, so
dequeued bytes continue to be counted exactly once.

Fixes: 51eb3b65544c ("sch_hfsc: make hfsc_qlen_notify() idempotent")
Reported-by: Anirudh Gupta <anirudhrudr@gmail.com>
Closes: https://lore.kernel.org/netdev/CAN2cbVe79oj0O9==m4+4x3v+O+qzRagA=2=wkrp9i9=CqYvyZA@mail.gmail.com/
Tested-by: Anirudh Gupta <anirudhrudr@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Link: https://patch.msgid.link/20260610132824.3027549-1-victor@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: Stop leased rxq before uninstalling its memory provider

netif_rxq_cleanup_unlease() tears down the memory provider that was
installed on a physical RX queue through a netkit queue lease. It
currently revokes the provider's DMA mappings before stopping the
physical queue:

__netif_mp_uninstall_rxq(virt_rxq, p); /* DMA unmap */
__netif_mp_close_rxq(phys_rxq->dev, rxq_idx, p); /* queue stop */

This inverts the ordering used by the regular teardown paths (normal
device unregister and the io_uring zcrx close path), which stop the
queue before revoking the provider's mappings.

With the physical queue still live, its NAPI can keep consuming
net_iov entries from the page_pool alloc cache after the
__netif_mp_uninstall_rxq() has already cleared their dma_addr,
opening a window for the device to DMA to a stale or zero address.

Fix it by swapping the two calls so the queue is stopped (and its
NAPI quiesced) before the provider is uninstalled. No functional
regression was observed across repeated runs of the nk_qlease.py
HW selftest, which exercises the lease teardown path; this was
tested against fbnic QEMU emulation.

Fixes: 5602ad61ebee ("net: Proxy netif_mp_{open,close}_rxq for leased queues")
Reported-by: Ahmed Abdelmoemen <ahmedabdelmoumen05@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: David Wei <dw@davidwei.uk>
Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20260609212240.677889-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mlxsw: fix refcount leak in mlxsw_sp_vrs_lpm_tree_replace()

When mlxsw_sp_vrs_lpm_tree_replace() fails after replacing some VRs,
the error rollback loop does not correctly revert the preceding
replacements. The loop decrements the index but fails to update the
vr pointer, which still points to the VR that caused the failure. As
a result, the condition and the rollback call always operate on the
same VR, potentially calling mlxsw_sp_vr_lpm_tree_replace() multiple
times on it while never rolling back the earlier VRs. Those VRs
continue to hold a reference to new_tree acquired via
mlxsw_sp_lpm_tree_hold(), leaking the reference count of new_tree.

Fix by reinitializing vr inside the error loop with the updated index:

vr = &mlxsw_sp->router->vrs[i];

so that the loop correctly iterates over all VRs that were actually
replaced.

Cc: stable@vger.kernel.org
Fixes: fc922bb0dd94 ("mlxsw: spectrum_router: Use one LPM tree for all virtual routers")
Signed-off-by: Wentao Liang <vulab@iscas.ac.cn>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260609084730.215732-1-vulab@iscas.ac.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mlxsw: fix refcount leak in mlxsw_sp_port_lag_join()

When mlxsw_sp_port_lag_index_get() fails, mlxsw_sp_port_lag_join()
returns an error without releasing the lag reference obtained by
the earlier mlxsw_sp_lag_get(). All other error paths in the
function jump to the cleanup label that ends with
mlxsw_sp_lag_put(), so this is a single missed release.

Fix the leak by replacing the bare 'return err' with a goto to the
existing error cleanup label, which will drop the reference safely.

Cc: stable@vger.kernel.org
Fixes: 0d65fc13042f ("mlxsw: spectrum: Implement LAG port join/leave")
Signed-off-by: Wentao Liang <vulab@iscas.ac.cn>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260609083709.209743-1-vulab@iscas.ac.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'ksz87xx-add-support-for-low-loss-cable-equalizer-errata'

Fidelio Lawson says:

====================
ksz87xx: add support for low-loss cable equalizer errata

This patch implements the KSZ87xx short cable erratum
described in Microchip document DS80000687C for KSZ87xx switches
and the following support article:

Link: https://support.microchip.com/s/article/Solution-for-Using-CAT-5E-or-CAT-6-Short-Cable-with-a-Link-Issue-for-the-KSZ8795-Family
According to the erratum, the embedded PHY receiver in KSZ87xx switches is
tuned by default for long, high-loss Ethernet cables. When operating with
short or low-loss cables (for example CAT5e or CAT6), the PHY equalizer may
over-amplify the incoming signal, leading to internal distortion and link
establishment failures.

Microchip documents two independent mechanisms to mitigate this issue:
adjusting the receiver low‑pass filter bandwidth and reducing the DSP
equalizer initial value. These registers are located in the switch’s
internal LinkMD table and cannot be accessed directly through a
stand‑alone PHY driver.

To keep the PHY‑facing API clean, this series models the erratum handling
as vendor‑specific Clause 22 PHY registers, virtualized by the KSZ8 DSA
driver. Accesses are intercepted by ksz8_r_phy() / ksz8_w_phy() and
translated into the appropriate indirect LinkMD register writes. The
erratum affects the shared PHY analog front‑end and therefore applies
globally to the switch.

Based on review feedback, the user‑visible interface is kept deliberately
simple and predictable:

- A boolean “short‑cable” PHY tunable applies a documented and
  conservative preset (LPF bandwidth 62MHz, DSP EQ initial value 0).
  This is the recommended KISS interface for the common short‑cable
  scenario.

- Two additional integer PHY tunables allow advanced or experimental
  tuning of the LPF bandwidth and the DSP EQ initial value. These
  controls are orthogonal, have no ordering requirements, and simply
  override the corresponding setting when written.

The tunables act as simple setters with no implicit state machine or
invalid combinations, avoiding surprises for userspace and not relying
on extended error reporting or netlink ethtool support.

This series contains:

  1. Support for the KSZ87xx low‑loss cable erratum in the KSZ8 DSA driver,
     including the short‑cable preset and orthogonal tuning controls.

  2. Addition of vendor‑specific PHY tunable identifiers for the
     short‑cable preset, LPF bandwidth, and DSP EQ initial value.

  3. Exposure of these tunables through the Micrel PHY driver via
     get_tunable / set_tunable callbacks.

This version follows the design agreed upon during v3 review and
reworks the interface accordingly.
====================

Link: https://patch.msgid.link/20260609-ksz87xx_errata_low_loss_connections-v10-0-9ba4418cf3db@exotec.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: micrel: expose KSZ87xx low-loss cable tunables

Add support for the KSZ87xx low-loss cable PHY tunables in the Micrel
PHY driver by implementing get_tunable and set_tunable callbacks.

These callbacks expose vendor-specific PHY tunables used to control the
KSZ87xx embedded PHY receiver behavior when operating with short or
low-loss Ethernet cables. The tunables provide:

- a boolean short-cable preset applying known good settings;
- an integer LPF bandwidth control;
- an integer DSP EQ initial value control.

The Micrel PHY driver forwards these tunables via standard phy_read() /
phy_write() operations, which are virtualized by the KSZ8 DSA driver and
translated into the appropriate indirect switch register accesses.

Reviewed-by: Marek Vasut <marex@nabladev.com>
Reviewed-by: Nicolai Buchwitz <nb@tipi-net.de>
Signed-off-by: Fidelio Lawson <fidelio.lawson@exotec.com>
Link: https://patch.msgid.link/20260609-ksz87xx_errata_low_loss_connections-v10-3-9ba4418cf3db@exotec.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ethtool: add KSZ87xx low-loss cable PHY tunables

Introduce vendor-specific PHY tunable identifiers to control the
KSZ87xx low-loss cable erratum handling through the ethtool PHY
tunable interface.

The following tunables are added:

- a boolean "short-cable" tunable, applying a documented and
  conservative preset intended for short or low-loss Ethernet cables;

- an integer LPF bandwidth tunable, allowing advanced adjustment of the
  receiver low-pass filter bandwidth;

- an integer DSP EQ initial value tunable, allowing advanced tuning of
  the PHY equalizer initialization.

The actual behavior is implemented by the corresponding PHY and switch
drivers.

Reviewed-by: Marek Vasut <marex@nabladev.com>
Reviewed-by: Nicolai Buchwitz <nb@tipi-net.de>
Signed-off-by: Fidelio Lawson <fidelio.lawson@exotec.com>
Link: https://patch.msgid.link/20260609-ksz87xx_errata_low_loss_connections-v10-2-9ba4418cf3db@exotec.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: microchip: implement KSZ87xx Module 3 low-loss cable errata

Implement the KSZ87xx short cable workaround.

This patch implements the KSZ87xx short cable erratum
described in Microchip document DS80000687C for KSZ87xx switches
and the following support article:

Link: https://support.microchip.com/s/article/Solution-for-Using-CAT-5E-or-CAT-6-Short-Cable-with-a-Link-Issue-for-the-KSZ8795-Family
The issue affects short or low-loss cable links (e.g. CAT5e/CAT6),
where the PHY receiver equalizer may amplify high-amplitude signals
excessively, resulting in internal distortion and link establishment
failures.

KSZ87xx devices require a workaround for the Module 3 low-loss cable
condition, controlled through the switch TABLE_LINK_MD_V indirect
registers.

This change models the erratum handling as vendor-specific Clause 22 PHY
registers, virtualized by the KSZ8 DSA driver and accessed via
ksz8_r_phy() / ksz8_w_phy(). The following controls are provided:

- A boolean “short-cable” preset, which applies a documented and
  conservative configuration (LPF 62 MHz bandwidth and DSP EQ initial
  value 0), and is the recommended interface for typical use cases.

- Separate LPF bandwidth and DSP EQ initial value controls intended for
  advanced or experimental tuning. These are orthogonal and independent,
  and override the corresponding settings without requiring any specific
  ordering.

The preset and tunables act as simple setters with no implicit state
machine or invalid combinations, keeping the API predictable and aligned
with the KISS principle.

The erratum affects the shared PHY analog front-end and therefore applies
globally to the switch.

Reviewed-by: Nicolai Buchwitz <nb@tipi-net.de>
Reviewed-by: Marek Vasut <marex@nabladev.com>
Signed-off-by: Fidelio Lawson <fidelio.lawson@exotec.com>
Link: https://patch.msgid.link/20260609-ksz87xx_errata_low_loss_connections-v10-1-9ba4418cf3db@exotec.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests/net/openvswitch: add flow modify test

Add mod_flow() and the mod-flow CLI command to ovs-dpctl.py, exercising
OVS_FLOW_CMD_SET. Add test_flow_set which first modifies an existing
flow with new actions and verifies the change via traffic, then modifies
the same flow without actions and verifies the kernel handles the
no-actions case gracefully.

The no-actions path is unreachable from userspace OVS tools (dpctl
mod-flow requires actions) but reachable via raw netlink. This is the
code path where Adrian Moreno found a possible kfree_skb of ERR_PTR
when reply allocation fails after locking.

Make parse() skip OVS_FLOW_ATTR_ACTIONS when actstr is None so the
kernel enters the post-lock allocation branch in ovs_flow_cmd_set().
After the no-actions set, verify via dump-flows that the flow retained
its drop action.

Suggested-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Minxi Hou <houminxi@gmail.com>
Reviewed-by: Aaron Conole <aconole@redhat.com>
Link: https://patch.msgid.link/20260609165725.107484-1-houminxi@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: bcmgenet: convert RX path to page_pool

Replace the per-packet __netdev_alloc_skb() + dma_map_single() in the
RX path with page_pool. SKBs are built from pool pages via
napi_build_skb() with skb_mark_for_recycle() so the network stack
returns pages to the pool, and DMA mapping happens once per page
instead of once per packet.

Reject HW-reported lengths smaller than the RSB so a runt cannot
underflow the SKB build path.

Drop the now-unused priv->rx_buf_len field and the rx_dma_failed soft
MIB counter (nothing increments it after the conversion). This
removes the "rx_dma_failed" entry from ethtool -S, which is a
user-visible change for monitoring tools that key on stat names.

Signed-off-by: Nicolai Buchwitz <nb@tipi-net.de>
Reviewed-by: Justin Chen <justin.chen@broadcom.com>
Tested-by: Justin Chen <justin.chen@broadcom.com>
Link: https://patch.msgid.link/20260610114835.2225423-1-nb@tipi-net.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: airoha: move get_sport() callback at the beginning of airoha_enable_gdm2_loopback()

Move the get_sport() callback invocation at the beginning of
airoha_enable_gdm2_loopback() routine in order to avoid leaving the
hardware in a partially configured state if get_sport() fails.
Previously, get_sport() was called after GDM2 forwarding, loopback,
channel, length, VIP and IFC registers had already been programmed.
A failure at that point would return an error leaving GDM2 with
loopback enabled but WAN port, PPE CPU port and flow control mappings
not configured.
Performing the get_sport() lookup before any register write guarantees
the routine either completes the full configuration sequence or exits
with no side effects on the hardware.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260608-airoha_enable_gdm2_loopback-minor-change-v1-1-1787a0f42b31@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'mptcp-pm-drop-tcp-ts-with-add_addrv6-port'

Matthieu Baerts says:

====================
mptcp: pm: drop TCP TS with ADD_ADDRv6 + port

Up to this series, it was possible to add a "signal" MPTCP endpoint with
an IPv6 address and a port, or to directly request to send an ADD_ADDR
with a v6 address and a port, but the expected ADD_ADDR wasn't sent when
TCP timestamps was used for the connection.

In fact, such signalling option cannot be sent when TCP timestamps is
used due to a lack of option space: the limit is at 40 bytes, and, with
padding, TCP timestamps is taking 12 bytes, while an ADD_ADDR IPv6 +
port is taking 30 bytes. The selected solution here is to simply drop
the TCP timestamps option when such ADD_ADDR of 30 bytes needs to be
sent.

- Patches 1-3: small cleanups to avoid computing ADD/RM_ADDR twice.

- Patches 4-7: the new feature, controlled by a new sysctl knob.

- Patch 8: extra checks in the MPTCP Join selftests.

- Patches 9-15: A bunch of refactoring: renamed confusing helpers and
variables, and prevent future misused functions.
====================

Link: https://patch.msgid.link/20260605-net-next-mptcp-add-addr6-port-ts-v2-0-758e7ca73f4d@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: options: rst: drop unused skb parameter

It was passed since its introduction in commit dc87efdb1a5c ("mptcp: add
mptcp reset option support"), but never used.

Simply removes it.

Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260605-net-next-mptcp-add-addr6-port-ts-v2-15-758e7ca73f4d@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: pm: avoid using del_timer directly

mptcp_pm_announced_del_timer() removes the matched ADD_ADDR entry (if
found) from the ADD_ADDR list only if check_id is false. That's
dangerous, and not clear, because it means the caller should be free the
entry only in some cases, and it easy to miss that.

Instead, make it static, and call it from mptcp_pm_add_addr_echoed,
which is the only other case where mptcp_pm_add_addr_del_timer should be
called with check_id set to true. Bonus with that: a second call to
mptcp_pm_add_addr_lookup_by_addr() can be avoided.

Note that instead of adding the signature above to avoid a compilation
issue because this helper is called before the definition of the
function, the whole helper is moved above where it is first called. Its
content is untouched, except the addition of the 'static' keyboard.

Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260605-net-next-mptcp-add-addr6-port-ts-v2-14-758e7ca73f4d@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: pm: make mptcp_pm_add_addr_send_ack static

Only used in pm.c.

Note that the signature is added above: it is easier than moving the
code around, because this helper depends on mptcp_pm_schedule_work which
is declared below.

While at it, explicitly mark it as to be called while pm->lock is held.

Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260605-net-next-mptcp-add-addr6-port-ts-v2-13-758e7ca73f4d@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: pm: remove add_ prefix from timer

Similar to the two previous commits, using the 'add' prefix is
confusing, also confirmed by [1].

Now that the structure has been renamed to include 'add_addr' in its
name, easier to know the timer is linked to the ADD_ADDR, no need to
add the confusing prefix, or an unneeded longer one.

While at it, also update the ADD_ADDR timer helper to clearly specify it
is linked to ADD_ADDR, and it is not there to add a new timer.

Link: https://lore.kernel.org/20251117100745.1913963-1-edumazet@google.com
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260605-net-next-mptcp-add-addr6-port-ts-v2-12-758e7ca73f4d@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: pm: uniform announced addresses helpers

Similar to the previous commit, only using the 'add' or 'anno' prefixes
is confusing -- generally associated to the action of adding something,
or the Latin name for "year" -- and lack of uniformity.

This has been causing issues in the past, e.g. del_add_timer seemed to
suggest the goal is to delete a previously added timer.

Instead, use the mptcp_pm_announced_ prefix.

While at it, slightly improves some helpers:

- mptcp_lookup_anno_list_by_saddr: no need to specify what is used to do
  the lookup: mptcp_pm_announced_lookup.

- mptcp_pm_sport_in_anno_list: it doesn't just compare the port, but the
  whole address linked to the sublow: mptcp_pm_announced_has_ssk.

- mptcp_pm_alloc_anno_list: it allocates one item of the list, not a
  whole list: mptcp_pm_announced_alloc.

Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260605-net-next-mptcp-add-addr6-port-ts-v2-11-758e7ca73f4d@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: pm: rename add_entry structure to add_addr

Using only the 'add' prefix is confusing: does it refer to a generic
added entry or address, or specifically to ADD_ADDRs. Using add_addr
removes this confusion.

Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260605-net-next-mptcp-add-addr6-port-ts-v2-10-758e7ca73f4d@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: pm: use for_each_subflow helper

Similar to most places in the MPTCP code. So instead of passing the
subflow list and use list_for_each_entry(subflow, list, node), pass the
msk and use mptcp_for_each_subflow(msk, subflow).

That's clearer and more uniform with the rest.

While at it, add 'pm_' prefix for the exported one to easily identify
the origin. Plus replace 'lookup' by 'has', because a bool is returned.

Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260605-net-next-mptcp-add-addr6-port-ts-v2-9-758e7ca73f4d@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: mptcp: always check sent/dropped ADD_ADDRs

Before, they were only checked on demand, but it seems better to check
them each time received ADD_ADDRs are checked.

Errors are only reported when the counter exists, and the value is not
the expected one. This is similar to what is done in chk_join_nr: it
reduces the output, and avoids a lot of 'skip' when validating older
kernels. Also here, some tests need to adapt the default expected
counters, e.g. when ADD_ADDR echo are dropped on the reception side, or
it is not possible to send an ADD_ADDR due to the limited option space.

Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260605-net-next-mptcp-add-addr6-port-ts-v2-8-758e7ca73f4d@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>