git.ipfire.org Git - thirdparty/kernel/linux.git/log

batman-adv: fix fragment reassembly length accounting

batman-adv keeps a running payload length for queued fragments and uses it
to validate a fragment chain before reassembly.

That accounting currently allows the accumulated fragment length to be
truncated during updates. As a result, malformed fragment chains can
bypass the intended validation and drive reassembly with inconsistent
length state, leading to a local denial of service.

Fix the accounting by storing the accumulated length in a length-typed
field and rejecting update overflows before the existing validation logic
runs.

The fix was verified against the original reproducer and against valid
fragment reassembly paths.

Fixes: 610bfc6bc99b ("batman-adv: Receive fragmented packets and merge")
Cc: stable@kernel.org
Reported-by: Yuan Tan <yuantan098@gmail.com>
Reported-by: Yifan Wu <yifanwucs@gmail.com>
Reported-by: Juefei Pu <tomapufckgml@gmail.com>
Reported-by: Xin Liu <bird@lzu.edu.cn>
Signed-off-by: Ruide Cao <caoruide123@gmail.com>
Tested-by: Ren Wei <enjou1224z@gmail.com>
Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>
Signed-off-by: Sven Eckelmann <sven@narfation.org>

x86/xen: Tolerate nested XEN_LAZY_MMU entering/leaving

With the support of nested lazy mmu sections it can happen that
arch_enter_lazy_mmu_mode() is being called twice without a call of
arch_leave_lazy_mmu_mode() in between, as the lazy_mmu_*() helpers
are not disabling preemption when checking for nested lazy mmu
sections.

This is a problem when running as a Xen PV guest, as
xen_enter_lazy_mmu() and xen_leave_lazy_mmu() don't tolerate this
case.

Fix that in xen_enter_lazy_mmu() and xen_leave_lazy_mmu() in order
not to hurt all other lazy mmu mode users.

Fixes: 291b3abed657 ("x86/xen: use lazy_mmu_state when context-switching")
Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Message-ID: <20260508143933.493013-1-jgross@suse.com>

x86/xen: Fix xen_e820_swap_entry_with_ram()

When swapping a not page-aligned E820 map entry with RAM, the start
address of the modified entry is calculated wrong (the offset into the
page is subtracted instead of being added to the page address).

Fixes: be35d91c8880 ("xen: tolerate ACPI NVS memory overlapping with Xen allocated memory")
Reported-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Message-ID: <20260505102417.208138-1-jgross@suse.com>

gcc-plugins: Always define CONST_CAST_GIMPLE and CONST_CAST_TREE

For gcc-16, the CONST_CAST macro family was removed. Add back what
we were using in gcc-common.h, as they are simple wrappers.

See GCC commits:
c3d96ff9e916c02584aa081f03ab999292efbb50
458c7926d48959abcb2c1adaa22458e27459a551

Suggested-by: Ingo Saitz <ingo@hannover.ccc.de>
Link: https://lore.kernel.org/lkml/ab6OKoay0OWkywjK@spatz.zoo
Fixes: 6b90bd4ba40b ("GCC plugin infrastructure")
Tested-by: Ivan Bulatovic <combuster@archlinux.us>
Tested-by: Christopher Cradock <christopher@cradock.myzen.co.uk>
Signed-off-by: Kees Cook <kees@kernel.org>

Merge tag 'net-7.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Paolo Abeni:
"Including fixes from netfilter.

  Previous releases - regressions:

   - ethtool: fix NULL pointer dereference in phy_reply_size

   - netfilter:
      - allocate hook ops while under mutex
      - close dangling table module init race
      - restore nf_conntrack helper propagation via expectation

   - tcp:
      - fix potential UAF in reqsk_timer_handler().
      - fix out-of-bounds access for twsk in tcp_ao_established_key().

   - vsock: fix empty payload in tap skb for non-linear buffers

   - hsr: fix NULL pointer dereference in hsr_get_node_data()

   - eth:
      - cortina: fix RX drop accounting
      - ice: fix locking in ice_dcb_rebuild()

  Previous releases - always broken:

   - napi: avoid gro timer misfiring at end of busypoll

   - sched:
      - dualpi2: initialize timer earlier in dualpi2_init()
      - sch_cbs: Call qdisc_reset for child qdisc

   - shaper:
      - fix ordering issue in net_shaper_commit()
      - reject handle IDs exceeding internal bit-width

   - ipv6: flowlabel: enforce per-netns limit for unprivileged callers

   - tls: fix off-by-one in sg_chain entry count for wrapped sk_msg ring

   - smc: avoid NULL deref of conn->lnk in smc_msg_event tracepoint

   - sctp: revalidate list cursor after sctp_sendmsg_to_asoc() in SCTP_SENDALL

   - batman-adv:
      - reject new tp_meter sessions during teardown
      - purge non-released claims

   - eth:
      - i40e: cleanup PTP registration on probe failure
      - idpf: fix double free and use-after-free in aux device error paths
      - ena: fix potential use-after-free in get_timestamp"

* tag 'net-7.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (88 commits)
  net: phy: DP83TC811: add reading of abilities
  net: tls: prevent chain-after-chain in plain text SG
  net: tls: fix off-by-one in sg_chain entry count for wrapped sk_msg ring
  net/smc: reject CHID-0 ACCEPT that matches an empty ism_dev slot
  macsec: use rcu_work to defer TX SA crypto cleanup out of softirq
  macsec: use rcu_work to defer RX SA crypto cleanup out of softirq
  macsec: introduce dedicated workqueue for SA crypto cleanup
  net: net_failover: Fix the deadlock in slave register
  MAINTAINERS: update atlantic driver maintainer
  selftests/tc-testing: Add QFQ/CBS qlen underflow test
  net/sched: sch_cbs: Call qdisc_reset for child qdisc
  FDDI: defza: Sanitise the reset safety timer
  net: ethernet: ravb: Do not check URAM suspension when WoL is active
  ethtool: fix ethnl_bitmap32_not_zero() bit interval semantics
  net/smc: avoid NULL deref of conn->lnk in smc_msg_event tracepoint
  net/smc: fix sleep-inside-lock in __smc_setsockopt() causing local DoS
  net: atm: fix skb leak in sigd_send() default branch
  net: ethtool: phy: avoid NULL deref when PHY driver is unbound
  net: atlantic: preserve PCI wake-from-D3 on shutdown when WOL enabled
  net: shaper: reject QUEUE scope handle with missing id
  ...

smb: client: avoid integer overflow in SMB2 READ length check

SMB2 READ response validation in cifs_readv_receive() and
handle_read_data() checks data_offset + data_len against the received
buffer length.  Both values are attacker-controlled fields from the
server response and are stored as unsigned int, so the addition can
wrap before the bounds check:

fs/smb/client/transport.c:1259
if (!use_rdma_mr && (data_offset + data_len > buflen))

fs/smb/client/smb2ops.c:4839
else if (buf_len >= data_offset + data_len)

A malicious SMB server can use this to bypass validation.  In the
non-encrypted receive path the client attempts an oversized socket
read and stalls for the SMB response timeout (180 seconds) before
reconnecting.  In the SMB3 encrypted path, runtime testing shows the
malformed length can reach copy_to_iter() in handle_read_data() with
attacker-controlled size, where usercopy hardening stops the oversized
copy before bytes reach userspace.

Guard both call sites with check_add_overflow(), which is already
used elsewhere in this subsystem (smb2pdu.c).  On overflow, treat the
response as malformed and reject with -EIO.

Signed-off-by: Jeremy Erazo <mendozayt13@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>

Merge tag 'audit-pr-20260513' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit

Pull audit fixes from Paul Moore:

- Correctly log the inheritable capabilities

- Honor AUDIT_LOCKED in the AUDIT_TRIM and AUDIT_MAKE_EQUIV commands

* tag 'audit-pr-20260513' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
audit: enforce AUDIT_LOCKED for AUDIT_TRIM and AUDIT_MAKE_EQUIV
audit: fix incorrect inheritable capability in CAPSET records

phy: apple: atc: Fix typec switch/mux leak on unbind

atcphy_probe_switch() and atcphy_probe_mux() discard the pointers
returned by typec_switch_register() and typec_mux_register(). The
platform driver has no .remove callback, so when the driver unbinds
(e.g. via sysfs unbind) neither typec_switch_unregister() nor
typec_mux_unregister() is called. The framework reference taken in
typec_switch_register() (device_initialize() + device_add() in
drivers/usb/typec/mux.c) is therefore never dropped and the
typec_switch_dev / typec_mux_dev objects stay live forever, with
their sysfs entries under the typec_mux class also left behind. A
subsequent rebind cannot recreate them with the same fwnode-derived
name.

Save the registered handles and unregister them through
devm_add_action_or_reset() so framework registration is torn down
in step with the driver's other devm-managed state. While here,
drop struct apple_atcphy::sw and ::mux: they were declared with the
consumer-side types (typec_switch *, typec_mux *) instead of the
provider-side types and were never assigned.

Scope of the fix
================
This patch fixes the registration leak only. It does not close the
use-after-free window that arises when a consumer that obtained a
reference via fwnode_typec_switch_get() / fwnode_typec_mux_get()
outlives the provider unbind: such consumers keep the underlying
typec_switch_dev / typec_mux_dev alive past device_unregister(),
and a later typec_switch_set() / typec_mux_set() still invokes the
registered atcphy_sw_set() / atcphy_mux_set(), which dereferences
the freed apple_atcphy through typec_{switch,mux}_get_drvdata().

On Apple Silicon the relevant consumers are the typec port and the
cd321x controller registered by drivers/usb/typec/tipd/core.c.
Cable plug / orientation events and alt-mode transitions trigger
the .set callbacks via:

  tps6598x_interrupt()                 drivers/usb/typec/tipd/core.c
    tps6598x_handle_plug_event()
      tps6598x_connect()/_disconnect()
        typec_set_orientation()        drivers/usb/typec/class.c
          typec_switch_set(port->sw)   drivers/usb/typec/mux.c
            atcphy_sw_set()            drivers/phy/apple/atc.c

  cd321x_update_work()                 drivers/usb/typec/tipd/core.c
    cd321x_typec_update_mode()
      typec_mux_set(cd321x->mux)       drivers/usb/typec/mux.c
        atcphy_mux_set()               drivers/phy/apple/atc.c

Closing that window requires framework support for invalidating
consumer-held references on provider unbind. The same
consumer-survives-provider pattern has been discussed for the PHY
framework [1] and is out of scope here.

[1] https://lore.kernel.org/linux-phy/aZejMSJ9qqRWb2pX@google.com/

Fixes: 8e98ca1e74db ("phy: apple: Add Apple Type-C PHY")
Signed-off-by: David Carlier <devnexen@gmail.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Tested-by: Joshua Peisach <jpeisach@ubuntu.com>
Link: https://lkml.kernel.org/r/6ec1ed08328340db42655287afd5fa4067316b11.camel@perches.com
Link: https://patch.msgid.link/20260508201958.30060-1-devnexen@gmail.com
Signed-off-by: Vinod Koul <vkoul@kernel.org>

selftests/nolibc: test open mode handling

Add a selftest for the new O_TMPFILE open mode handling.
While O_CREAT or openat() are not tested, the code is the same,
so assume these also work.

Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Acked-by: Willy Tarreau <w@1wt.eu>
Link: https://patch.msgid.link/20260514-nolibc-open-tmpfile-v2-4-b4c6c5efa266@weissschuh.net

tools/nolibc: always pass mode to open syscall

When O_TMPFILE is set, the open mode needs to be passed to the kernel as
per the documentation. Currently this is not done.
Instead of checking for O_TMPFILE explicitly and making the conditionals
more complex, just always pass the mode to the kernel. If no value was
passed the mode will be garbage, but the kernel will ignore it anyways.

Fixes: a7604ba149e7 ("tools/nolibc/sys: make open() take a vararg on the 3rd argument")
Suggested-by: Willy Tarreau <w@1wt.eu>
Link: https://lore.kernel.org/lkml/afRfjdovT6pNtwtP@1wt.eu/
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Acked-by: Willy Tarreau <w@1wt.eu>
Link: https://patch.msgid.link/20260514-nolibc-open-tmpfile-v2-3-b4c6c5efa266@weissschuh.net

tools/nolibc: split open mode handling into a macro

This logic is duplicated and some upcoming extensions would require even
more duplicated logic.

Move it into a macro to avoid the duplication and allow cleaner changes.

Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Acked-by: Willy Tarreau <w@1wt.eu>
Link: https://patch.msgid.link/20260514-nolibc-open-tmpfile-v2-2-b4c6c5efa266@weissschuh.net

tools/nolibc: split implicit open flags into a macro

This logic is duplicated and its current form will be in the way of some
upcoming simplificiations.

Move it into a macro to avoid the duplication and enable some cleanups.

Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Acked-by: Willy Tarreau <w@1wt.eu>
Link: https://patch.msgid.link/20260514-nolibc-open-tmpfile-v2-1-b4c6c5efa266@weissschuh.net

ptrace: slightly saner 'get_dumpable()' logic

The 'dumpability' of a task is fundamentally about the memory image of
the task - the concept comes from whether it can core dump or not - and
makes no sense when you don't have an associated mm.

And almost all users do in fact use it only for the case where the task
has a mm pointer.

But we have one odd special case: ptrace_may_access() uses 'dumpable' to
check various other things entirely independently of the MM (typically
explicitly using flags like PTRACE_MODE_READ_FSCREDS). Including for
threads that no longer have a VM (and maybe never did, like most kernel
threads).

It's not what this flag was designed for, but it is what it is.

The ptrace code does check that the uid/gid matches, so you do have to
be uid-0 to see kernel thread details, but this means that the
traditional "drop capabilities" model doesn't make any difference for
this all.

Make it all make a *bit* more sense by saying that if you don't have a
MM pointer, we'll use a cached "last dumpability" flag if the thread
ever had a MM (it will be zero for kernel threads since it is never
set), and require a proper CAP_SYS_PTRACE capability to override.

Reported-by: Qualys Security Advisory <qsa@qualys.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Kees Cook <kees@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

drm/xe/gt_idle: Use NSEC_PER_MSEC instead of float literal

The residency multiplier conversion in get_residency_ms() used the
floating-point literal 1e6 as the divisor of mul_u64_u32_div(). While
the compiler constant-folds this to an integer, using float literals
in kernel code is bad practice since the kernel generally avoids
floating-point operations.

Replace 1e6 with the standard NSEC_PER_MSEC macro from <linux/time64.h>,
which is both self-documenting (ns to ms conversion) and unambiguously
integer. Add the corresponding include rather than relying on
transitive inclusion.

No functional change.

Assisted-by: Claude:claude-opus-4.6
Reviewed-by: Nitin Gote <nitin.r.gote@intel.com>
Link: https://patch.msgid.link/20260511153307.223435-1-shuicheng.lin@intel.com
Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com>

drm/xe/gsc: Fix double-free of managed BO in error path

The error path in xe_gsc_init_post_hwconfig() explicitly frees a BO
allocated with xe_managed_bo_create_pin_map() via
xe_bo_unpin_map_no_vm(). Since the managed BO already has a devm
cleanup action registered, this causes a double-free when devm
unwinds during probe failure.

Remove the explicit free and let devm handle it, consistent with
all other xe_managed_bo_create_pin_map() callers.

Fixes: 2e5d47fe7839 ("drm/xe/uc: Use managed bo for HuC and GSC objects")
Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Assisted-by: Claude:claude-opus-4.6
Link: https://patch.msgid.link/20260511154134.223696-1-shuicheng.lin@intel.com
Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com>

bpf: Use array_map_meta_equal for percpu array inner map replacement

percpu_array_map_ops.map_meta_equal points to the generic
bpf_map_meta_equal(), which does not compare max_entries. When a
percpu array serves as an inner map, replacing it with one that has
fewer max_entries bypasses the check. Since percpu_array_map_gen_lookup()
inlines the original template's index_mask as a JIT immediate, a lookup
on the replacement map can access pptrs[] out of bounds.

Point percpu_array_map_ops.map_meta_equal to array_map_meta_equal(),
which already enforces the max_entries equality check.

Add a selftest to verify that replacing a percpu array inner map with
a differently-sized one is rejected.

Fixes: db69718b8efa ("bpf: inline bpf_map_lookup_elem() for PERCPU_ARRAY maps")
Signed-off-by: Guannan Wang <wgnbuaa@gmail.com>
Acked-by: Mykyta Yatsenko <yatsenko@meta.com>
Link: https://lore.kernel.org/r/20260514074454.77491-1-wgnbuaa@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

cifs: client: stage smb3_reconfigure() updates and restore ctx on failure

smb3_reconfigure() moves strings out of cifs_sb->ctx before the
multichannel update, so a later failure can leave the live context
with NULL strings or options that do not match the session.

Stage the new ctx separately, commit it only on success, and restore
the snapshot on failure. Also make smb3_sync_session_ctx_passwords()
all-or-nothing.

Commit session passwords before channel updates so newly added channels
authenticate with the staged credentials.

Fixes: ef529f655a2c ("cifs: client: allow changing multichannel mount options on remount")
Reported-by: RAJASI MANDAL <rajasimandalos@gmail.com>
Closes: https://lore.kernel.org/lkml/CAEY6_V1+dzW3OD5zqXhsWyXwrDTrg5tAMGZ1AJ7_GAuRE+aevA@mail.gmail.com/
Link: https://lore.kernel.org/lkml/xkr2dlvgibq5j6gkcxd3yhhnj4atgxw2uy4eug2pxm7wy7nbms@iq6cf5taa65v/
Reviewed-by: Henrique Carvalho <henrique.carvalho@suse.com>
Signed-off-by: DaeMyung Kang <charsyam@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>

nvme-apple: Reset q->sq_tail during queue init

Fixes a "duplicate tag error for tag 0" firmware crash during controller
reset while setting up a queue on Apple A11 / T8015 caused by stale
entries in the submission queue due to an invalid sq_tail offset after
reset.

Fixes: 04d8ecf37b5e ("nvme: apple: Add Apple A11 support")
Cc: stable@vger.kernel.org
Suggested-by: Yuriy Havrylyuk <yhavry@gmail.com>
Reviewed-by: Sven Peter <sven@kernel.org>
Signed-off-by: Nick Chan <towinchenmi@gmail.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>

smb/client: fix possible infinite loop and oob read in symlink_data()

On 32-bit architectures, the infinite loop is as follows:

  len = p->ErrorDataLength == 0xfffffff8
  u8 *next = p->ErrorContextData + len
  next == p

On 32-bit architectures, the out-of-bounds read is as follows:

  len = p->ErrorDataLength == 0xfffffff0
  u8 *next = p->ErrorContextData + len
  next == (u8 *)p - 8

Reported-by: ChenXiaoSong <chenxiaosong@kylinos.cn>
Fixes: 76894f3e2f71 ("cifs: improve symlink handling for smb2+")
Cc: stable@vger.kernel.org
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: ChenXiaoSong <chenxiaosong@kylinos.cn>
Signed-off-by: Steve French <stfrench@microsoft.com>

ovpn: fix race between deleting interface and adding new peer

While deleting an existing ovpn interface, there is a very
narrow window where adding a new peer via netlink may cause
the netdevice to hang and prevent its unregistration.

It may happen during ovpn_dellink(), when all existing peers are
freed and the device is queued for deregistration, but a
CMD_PEER_NEW message comes in adding a new peer that takes again
a reference to the netdev.

At this point there is no way to release the device because we are
under the assumption that all peers were already released.

Fix the race condition by releasing all peers in ndo_uninit(),
when the netdevice has already been removed from the netdev
list.

Also ovpn_peer_add() has now an extra check that forces the
function to bail out if the device reg_state is not REGISTERED.
This way any incoming CMD_PEER_NEW racing with the interface
deletion routine will simply stop before adding the peer.

Note that the above check happens while holding the netdev_lock
to prevent racing netdev state changes.

ovpn_dellink() is now empty and can be removed.

Reported-by: Hyunwoo Kim <imv4bel@gmail.com>
Closes: https://lore.kernel.org/netdev/aaVgJ16edTfQkYbx@v4bel/
Suggested-by: Sabrina Dubroca <sd@queasysnail.net>
Fixes: 80747caef33d ("ovpn: introduce the ovpn_peer object")
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Antonio Quartulli <antonio@openvpn.net>

ovpn: respect peer refcount in CMD_NEW_PEER error path

ovpn_nl_peer_new_doit()'s error path calls ovpn_peer_release() directly
rather than ovpn_peer_put(), bypassing the kref. The accompanying
comment ("peer was not yet hashed, thus it is not used in any context")
holds for UDP but not for TCP.

For UDP, the ovpn_socket union uses the .ovpn arm and never points back
at a peer; UDP encap_recv looks up peers via the not-yet-populated
hashtables, so the new peer is unreachable until ovpn_peer_add()
publishes it.

For TCP, ovpn_socket_new() sets ovpn_sock->peer and
ovpn_tcp_socket_attach() publishes ovpn_sock via rcu_assign_sk_user_data().
From that moment until ovpn_socket_release() detaches in the error path,
the TCP fd is fully wired: userspace recvmsg / sendmsg / close / poll
on the fd, as well as the strparser-driven ovpn_tcp_rcv() path, can
reach the peer through sk_user_data -> ovpn_sock->peer and bump its
refcount via ovpn_peer_hold().

ovpn_tcp_socket_wait_finish() (called inside ovpn_socket_release())
drains strparser and the tx work, but does not synchronize with
userspace syscall callers that already hold a peer reference. If
ovpn_nl_peer_modify() or ovpn_peer_add() returns an error while such
a caller is in flight - notably an ovpn_tcp_recvmsg() blocked in
__skb_recv_datagram() on peer->tcp.user_queue - the direct
ovpn_peer_release() destroys the peer while the caller still holds
the reference, and the eventual ovpn_peer_put() from that caller
operates on freed memory.

Replace the direct destructor call with ovpn_peer_put() so the kref
correctly defers destruction until the last reference is dropped.
In the common case where no concurrent user is present, behaviour is
unchanged: the kref hits zero immediately and ovpn_peer_release_kref()
runs the same destructor.

With this conversion ovpn_peer_release() has no callers outside peer.c
- ovpn_peer_release_kref() in the same translation unit is the only
remaining user - so make it static and drop its declaration from
peer.h.

Fixes: 11851cbd60ea ("ovpn: implement TCP transport")
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: David Carlier <devnexen@gmail.com>
Signed-off-by: Antonio Quartulli <antonio@openvpn.net>

ovpn: tcp - use cached peer pointer in ovpn_tcp_close()

ovpn_tcp_close() loads the ovpn_socket via rcu_dereference_sk_user_data()
under rcu_read_lock(), takes a reference on sock->peer, caches the peer
pointer in a local, and drops the read lock. It then passes sock->peer
(rather than the cached local) to ovpn_peer_del(), re-dereferencing the
ovpn_socket after the RCU read section has ended.

Unlike ovpn_tcp_sendmsg(), which uses the same "load under RCU, use
after unlock" pattern but is protected by lock_sock() held across the
function, ovpn_tcp_close() runs without the socket lock: inet_release()
invokes sk_prot->close() without taking lock_sock first.

ovpn_socket_release() can therefore complete its kref_put -> detach ->
synchronize_rcu -> kfree(sock) sequence concurrently, in the window
after ovpn_tcp_close() drops rcu_read_lock() but before it dereferences
sock->peer. The synchronize_rcu() in ovpn_socket_release() protects
readers that use the dereferenced pointer inside the RCU read section,
not those that escape the pointer to a local and use it afterwards.

A reproducer follows the pattern of commit 94560267d6c4 ("ovpn: tcp -
don't deref NULL sk_socket member after tcp_close()"): trigger a peer
removal (keepalive expiration or netlink OVPN_CMD_DEL_PEER) at the same
moment userspace closes the TCP fd. That commit fixed the detach-side
of the same race window; this one fixes the close-side at a different
victim.

Tighten the entry block to read sock->peer exactly once into the cached
peer local, and route all subsequent uses (the hold check, the
ovpn_peer_del() call, and the prot->close() invocation) through that
local. sock->peer is only ever written once in ovpn_socket_new() under
lock_sock(), before rcu_assign_sk_user_data() publishes the ovpn_socket,
and is never reassigned afterwards - but the previous multi-read pattern
made that invariant implicit rather than explicit. The same multi-read
shape exists in ovpn_tcp_recvmsg(), ovpn_tcp_sendmsg(),
ovpn_tcp_data_ready() and ovpn_tcp_write_space(); those will be cleaned
up via a dedicated helper in a follow-up net-next series.

Fixes: 11851cbd60ea ("ovpn: implement TCP transport")
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: David Carlier <devnexen@gmail.com>
Signed-off-by: Antonio Quartulli <antonio@openvpn.net>

selftests: ovpn: reduce remaining ping flood counts

Commit 201ba706318d ("selftests: ovpn: reduce ping count in test.sh")
lowered the baseline traffic flood ping count to avoid flakes on slower
CI instances, however some instances were left out.

Apply the same limit to the remaining ovpn selftest flood pings that
still request 500 packets.

Fixes: 201ba706318d ("selftests: ovpn: reduce ping count in test.sh")
Signed-off-by: Ralf Lici <ralf@mandelbit.com>
Signed-off-by: Antonio Quartulli <antonio@openvpn.net>

io_uring/rsrc: raise registered buffer 1GB limit

There's no real reason to have a limit, as the memory is accounted by
the lockmem limits anyway, if any exist. io_pin_pages() will still
restrict the maximum allowed limit per buffer, which is INT_MAX
number of pages. Cap it a bit lower than that, at 1TB for a 64-bit
system. Surely that should be enough for everyone. For now.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring/rsrc: bump struct io_mapped_ubuf length field to size_t

In preparation for supporting bigger individual buffers, bump the length
field to a full 8-bytes with size_t rather than an unsigned int.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring/rsrc: add huge page accounting for registered buffers

Track huge page references in a per-ring xarray to prevent double
accounting when the same huge page is used by multiple registered
buffers, either within the same ring or across cloned rings.

When registering buffers backed by huge pages, we need to account for
RLIMIT_MEMLOCK. But if multiple buffers share the same huge page (common
with cloned buffers), we must not account for the same page multiple
times. Similarly, we must only unaccount when the last reference to a
huge page is released.

Maintain a per-ring xarray (hpage_acct) that tracks reference counts for
each huge page. When registering a buffer, for each unique huge page,
increment its accounting reference count, and only account pages that
are newly added.

When unregistering a buffer, for each unique huge page, decrement its
refcount. Once the refcount hits zero, the page is unaccounted.

Note: any account is done against the ctx->user that was assigned when
the ring was setup. As before, if root is running the operation, no
accounting is done.

With these changes, any use of imu->acct_pages is also dead, hence kill
it from struct io_mapped_ubuf. This shrinks it from 56b to 48b on a
64-bit arch. Additionally, hpage_already_acct() is gone, which was an
O(M*M) scan over current + previous registrations.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

Bluetooth: hci_qca: Convert timeout from jiffies to ms

Since the timer uses jiffies as its unit rather than ms, the timeout value
must be converted from ms to jiffies when configuring the timer. Otherwise,
the intended 8s timeout is incorrectly set to approximately 33s.

To improve readability, embed msecs_to_jiffies() directly in the macro
definitions and drop the _MS suffix from macros that now yield jiffies
values: MEMDUMP_TIMEOUT, FW_DOWNLOAD_TIMEOUT, IBS_DISABLE_SSR_TIMEOUT,
CMD_TRANS_TIMEOUT, and IBS_BTSOC_TX_IDLE_TIMEOUT.

IBS_WAKE_RETRANS_TIMEOUT_MS and IBS_HOST_TX_IDLE_TIMEOUT_MS are
intentionally left unchanged. Their values are stored in the struct fields
wake_retrans and tx_idle_delay, which hold ms values at runtime and can be
modified via debugfs. The msecs_to_jiffies() conversion happens at each
call site against the field value, so it cannot be embedded in the macro.

Wake timer depends on commit c347ca17d62a

Cc: stable@vger.kernel.org
Fixes: d841502c79e3 ("Bluetooth: hci_qca: Collect controller memory dump during SSR")
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Acked-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
Signed-off-by: Shuai Zhang <shuai.zhang@oss.qualcomm.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>

Bluetooth: L2CAP: ecred_reconfigure: send packed pdu, not stack pointer

Commit 1c08108f3014 ("Bluetooth: L2CAP: Avoid -Wflex-array-member-not-at-end
warnings") converted the on-stack request PDU in l2cap_ecred_reconfigure()
from an explicit packed struct to DEFINE_RAW_FLEX(), but did not adjust the
size and source-pointer arguments to l2cap_send_cmd():

  -    struct {
  -            struct l2cap_ecred_reconf_req req;
  -            __le16 scid;
  -    } pdu;
  +    DEFINE_RAW_FLEX(struct l2cap_ecred_reconf_req, pdu, scid, 1);
       ...
       l2cap_send_cmd(conn, chan->ident, L2CAP_ECRED_RECONF_REQ,
                      sizeof(pdu), &pdu);

After the conversion, DEFINE_RAW_FLEX() expands to declare an anonymous
union pdu_u plus a local pointer "pdu" pointing at it. Therefore:

  - sizeof(pdu) is now sizeof(struct l2cap_ecred_reconf_req *) = 8 on
    64-bit (4 on 32-bit), not the 6 bytes of (mtu, mps, scid[1]).
  - &pdu is the address of the local pointer's stack storage, not the
    address of the request payload.

l2cap_send_cmd() forwards (data, count) to l2cap_build_cmd(), which calls
skb_put_data(skb, data, count). The L2CAP_ECRED_RECONFIGURE_REQ packet
body therefore contains 8 bytes copied from the kernel stack starting at
&pdu -- the 8 bytes overlap the pdu pointer's value, leaking a kernel
stack address to the paired Bluetooth peer. The intended (mtu, mps, scid)
fields are not transmitted at all, so the peer rejects the request as
malformed and the L2CAP_ECRED_RECONFIGURE feature itself has been broken
for the local-side initiator since the introducing commit landed.

The sibling site l2cap_ecred_conn_req() in the same commit was converted
correctly (sizeof(*pdu) + len, pdu); only this site was missed.

Restore the original semantics: pass the full flex-struct size via
struct_size(pdu, scid, 1) and the pdu pointer (the struct address) as
the source.

Validated on a stock 7.0-based host kernel via the real call path:
setsockopt(SOL_BLUETOOTH, BT_RCVMTU, ...) on a BT_CONNECTED
L2CAP_MODE_EXT_FLOWCTL socket emits an L2CAP_ECRED_RECONFIGURE_REQ
whose body is 8 bytes (the on-stack pdu local's value) rather than
the expected 6. Three captures from fresh socket / fresh hciemu peer
on the same host -- low bytes vary per call, high 0xffff confirms a
kernel virtual address (KASLR-randomised stack slot, not a fixed
string):

  RECONF_REQ body (ident=0x02 len=8): 42 fb 54 af 0e ca ff ff
  RECONF_REQ body (ident=0x02 len=8): 52 3d 2e af 0e ca ff ff
  RECONF_REQ body (ident=0x02 len=8): b2 fc 5b af 0e ca ff ff

After this patch the body is 6 bytes carrying the expected
little-endian (mtu, mps, scid).

Cc: stable@vger.kernel.org
Fixes: 1c08108f3014 ("Bluetooth: L2CAP: Avoid -Wflex-array-member-not-at-end warnings")
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>

Bluetooth: btmtk: accept too short WMT FUNC_CTRL events

MT7925 (USB ID 0e8d:e025) on fw version 20260106153314 sends WMT
FUNC_CTRL events that are missing the status field.

Prior to commit 006b9943b982 ("Bluetooth: btmtk: validate WMT event SKB
length before struct access") the status was read from out-of-bounds of
SKB data, which usually would result to success with
BTMTK_WMT_ON_UNDONE, although I don't know the intent here. The bounds
check added in that commit returns with error instead, producing
"Bluetooth: hci0: Failed to send wmt func ctrl (-22)" and makes the
device unusable.

Fix the regression by interpreting too short packet as status
BTMTK_WMT_ON_UNDONE, which makes the device work normally again.

Fixes: 634a4408c061 ("Bluetooth: btmtk: validate WMT event SKB length before struct access")
Signed-off-by: Pauli Virtanen <pav@iki.fi>
Tested-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> # MT7922 (0489:e0e2)
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>

Bluetooth: serialize accept_q access

bt_sock_poll() walks the accept queue without synchronization, while
child teardown can unlink the same socket and drop its last reference.
The unsynchronized accept queue walk has existed since the initial
Bluetooth import.

Protect accept_q with a dedicated lock for queue updates and polling.
Also rework bt_accept_dequeue() to take temporary child references under
the queue lock before dropping it and locking the child socket.

Fixes: 1da177e4c3f41524e886b7f1b8a0c1fc7321cac2 ("Linux-2.6.12-rc2")
Cc: stable@vger.kernel.org
Reported-by: Jann Horn <jannh@google.com>
Reported-by: Yuan Tan <yuantan098@gmail.com>
Reported-by: Yifan Wu <yifanwucs@gmail.com>
Reported-by: Juefei Pu <tomapufckgml@gmail.com>
Reported-by: Xin Liu <bird@lzu.edu.cn>
Signed-off-by: Jiexun Wang <wangjiexun2025@gmail.com>
Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>
Signed-off-by: Jiexun Wang <wangjiexun2025@gmail.com>
Reviewed-by: Jann Horn <jannh@google.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>

cpufreq/amd-pstate: Drop Kconfig option for dynamic EPP

There are some performance issues being identified by dynamic EPP
and we don't want to have distributions turning it on by default
exposing them to users at this time.

Drop the kconfig option, and require an explicit opt in from kernel
command line or runtime sysfs option to turn it on.

Reported-by: Viktor Jägersküpper <viktor_jaegerskuepper@freenet.de>
Closes: https://lore.kernel.org/linux-pm/14a87c99-785c-4b16-bfce-35ecbf053448@freenet.de/
Reported-by: Stuart Meckle <stuartmeckle@gmail.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=221473
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lore.kernel.org/r/20260512221947.1652988-1-mario.limonciello@amd.com
(fix sysfs file path)
Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>

drm/ttm: Fix ttm_bo_shrink() infinite LRU walk on backup failure

Apply the same fix as b2ed01e7ad ("drm/ttm: Fix ttm_bo_swapout()
infinite LRU walk on swapout failure") to the ttm_bo_shrink() path.

Move del_bulk_move from before the backup to after success only,
using ttm_resource_del_bulk_move_unevictable() since the resource
is now unevictable once fully backed up.

Fixes: 70d645deac98 ("drm/ttm: Add helpers for shrinking")
Cc: Christian König <christian.koenig@amd.com>
Cc: Huang Rui <ray.huang@amd.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Dave Airlie <airlied@redhat.com>
Cc: dri-devel@lists.freedesktop.org
Cc: stable@vger.kernel.org # v6.15+
Assisted-by: GitHub_Copilot:claude-opus-4.6
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patch.msgid.link/20260511162443.24352-1-thomas.hellstrom@linux.intel.com
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>

io_uring/net: allow filtering on IORING_OP_CONNECT

This adds custom filtering for IORING_OP_CONNECT, where the target
family is always exposed, and (for AF_INET / AF_INET6) port and
address are exposed. port and v4_addr are in network byte order so
filter authors can compare against on-wire constants.

Skip population unless addr_len covers the populated fields, to
avoid leaking stale io_async_msghdr data on short connects.

Signed-off-by: Shouvik Kar <auxcorelabs@gmail.com>
Link: https://patch.msgid.link/20260512110242.26219-1-auxcorelabs@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: parenthesize io_ring_head_to_buf() expansion

Wrap the io_ring_head_to_buf() macro value in an extra pair of parentheses
so it is safe when composed into larger expressions, and to satisfy
scripts/checkpatch.pl.

Signed-off-by: Yi Xie <xieyi@kylinos.cn>
Link: https://patch.msgid.link/20260514083443.203387-1-xieyi@kylinos.cn
Signed-off-by: Jens Axboe <axboe@kernel.dk>

net: phy: DP83TC811: add reading of abilities

At this time the driver is not listing any speeds
it supports. This should be ETHTOOL_LINK_MODE_100baseT1_Full_BIT
for DP83TC811. Add the missing call for phylib to read the abilities.

Fixes: b753a9faaf9a ("net: phy: DP83TC811: Introduce support for the DP83TC811 phy")
Suggested-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Sven Schuchmann <schuchmann@schleissheimer.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20260512071949.6218-1-schuchmann@schleissheimer.de
[pabeni@redhat.com: dropped revision history]
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

mm/slub: hold cpus_read_lock around flush_rcu_sheaves_on_cache()

flush_rcu_sheaves_on_cache() calls queue_work_on() in a
for_each_online_cpu() loop, which requires the cpu to stay online.
But cpus_read_lock() is not held in kvfree_rcu_barrier_on_cache() and the
set of "online cpus" is subject to change.

There are two paths that call flush_rcu_sheaves_on_cache():

  // has cpus_read_lock()
  flush_all_rcu_sheaves()
    -> flush_rcu_sheaves_on_cache()

  // no cpus_read_lock()
  kvfree_rcu_barrier_on_cache()
    -> flush_rcu_sheaves_on_cache()

Fix this by holding cpus_read_lock() in kvfree_rcu_barrier_on_cache().

Why not move cpus_read_lock() from flush_all_rcu_sheaves() into
flush_rcu_sheaves_on_cache()? The reason is it would introduce a new lock
order (slab_mutex -> cpu_hotplug_lock). The reverse order
(cpu_hotplug_lock -> slab_mutex) is established by

- cpuhp_setup_state_nocalls(..., slub_cpu_setup, ...)
- kmem_cache_destroy()

The two orders together would form an AB-BA deadlock.

Finally, add lockdep_assert_cpus_held() in flush_rcu_sheaves_on_cache()
to catch the same problem in the future.

Fixes: 0f35040de593 ("mm/slab: introduce kvfree_rcu_barrier_on_cache() for cache destruction")
Cc: <stable@vger.kernel.org>
Signed-off-by: Qing Wang <wangqing7171@gmail.com>
Link: https://patch.msgid.link/20260512035035.762317-1-wangqing7171@gmail.com
Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>

kbuild: pacman-pkg: package unstripped vDSO libraries

The unstripped vDSO files are useful for debugging.
They are provided in the upstream 'linux-headers' package.

Also package them as part of 'make pacman-pkg'.
Make them part of the '-debug' package, as they fit there best.
This differs from the upstream package as that has no '-debug' variant.

Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Link: https://patch.msgid.link/20260318-kbuild-pacman-vdso-install-v1-1-48ceb31c0e80@weissschuh.net
Signed-off-by: Nathan Chancellor <nathan@kernel.org>

KVM: x86: nSVM: Save/restore gPAT with KVM_{GET,SET}_NESTED_STATE

Add a 'gpat' field to kvm_svm_nested_state_hdr to carry L2's guest PAT
value across save and restore.

When KVM_X86_QUIRK_NESTED_SVM_SHARED_PAT is disabled and the vCPU is in
guest mode with nested NPT enabled, save vmcb02's g_pat into the header on
KVM_GET_NESTED_STATE, and restore it on KVM_SET_NESTED_STATE.

Host-initiated accesses to IA32_PAT (via KVM_GET/SET_MSRS) always target
L1's hPAT, so they cannot be used to save or restore gPAT. The separate
header field ensures that KVM_GET/SET_MSRS and KVM_GET/SET_NESTED_STATE are
independent and can be ordered arbitrarily during save and restore.

Note that struct kvm_svm_nested_state_hdr is included in a union padded to
120 bytes, so there is room to add the gpat field without changing any
offsets.

Fixes: cc440cdad5b7 ("KVM: nSVM: implement KVM_GET_NESTED_STATE and KVM_SET_NESTED_STATE")
Signed-off-by: Jim Mattson <jmattson@google.com>
Link: https://patch.msgid.link/20260407190343.325299-9-jmattson@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: Documentation: document KVM_{GET,SET}_NESTED_STATE for SVM

Document the nested state constants and structures for SVM that were added
by commit cc440cdad5b7 ("KVM: nSVM: implement KVM_GET_NESTED_STATE and
KVM_SET_NESTED_STATE").

Fixes: cc440cdad5b7 ("KVM: nSVM: implement KVM_GET_NESTED_STATE and KVM_SET_NESTED_STATE")
Signed-off-by: Jim Mattson <jmattson@google.com>
Link: https://patch.msgid.link/20260407190343.325299-8-jmattson@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: nSVM: Save gPAT to vmcb12.g_pat on VMEXIT

According to the APM volume 3 pseudo-code for "VMRUN," when nested paging
is enabled in the vmcb, the guest PAT register (gPAT) is saved to the vmcb
on emulated VMEXIT.

When KVM_X86_QUIRK_NESTED_SVM_SHARED_PAT is disabled and the vCPU is in
guest mode with nested NPT enabled, save the vmcb02 g_pat field to the
vmcb12 g_pat field on emulated VMEXIT.

Fixes: 15038e147247 ("KVM: SVM: obey guest PAT")
Signed-off-by: Jim Mattson <jmattson@google.com>
Link: https://patch.msgid.link/20260407190343.325299-7-jmattson@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: nSVM: Redirect IA32_PAT accesses to either hPAT or gPAT

When handling PAT accesses from L2, route PAT accesses to either hPAT or
gPAT based on whether or not L2 has a separate PAT, i.e. if KVM is actually
emulating gPAT, instead of using L1's PAT for everything.  Specifically, if
KVM_X86_QUIRK_NESTED_SVM_SHARED_PAT is disabled, the vCPU is in guest mode
with nested NPT enabled, *and* the access if from the guest (i.e. is not
from the host stuffing PAT as part of save/restore), then redirect guest
PAT accesses to the gPAT "register" in vmcb02, i.e. emulate gPAT for L2.

Always route non-guest accesses to hPAT, i.e. L1's PAT in vcpu->arch.pat,
to ensures that KVM_{G,S}ET_MSRS and KVM_{G,S}ET_NESTED_STATE are
independent of each other and can be ordered arbitrarily during save and
restore.  E.g. if KVM didn't exempt host accesses, then whether a write to
PAT hit hPAT or gPAT would vary based on whether userspace restores PAT
before or after nested state.  Note, gPAT is saved and restored separately
via KVM_{G,S}ET_NESTED_STATE.

WARN if there's a host-initiated access to PAT from within KVM_RUN, i.e. if
KVM itself initiated the access, as there are no such accesses today, and
it's not clear what the "right" behavior would be.

Fixes: 15038e147247 ("KVM: SVM: obey guest PAT")
Signed-off-by: Jim Mattson <jmattson@google.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: nSVM: Set vmcb02.g_pat correctly for nested NPT

When KVM_X86_QUIRK_NESTED_SVM_SHARED_PAT is disabled and nested NPT is
enabled in vmcb12, copy the (cached and validated) vmcb12 g_pat field to
vmcb02's g_pat, giving L2 its own independent guest PAT register.

When the quirk is enabled (default), or when NPT is enabled but nested NPT
is disabled, copy L1's IA32_PAT MSR to the vmcb02 g_pat field, since L2
shares the IA32_PAT MSR with L1.

When NPT is disabled, the g_pat field is ignored by hardware.

Fixes: 15038e147247 ("KVM: SVM: obey guest PAT")
Signed-off-by: Jim Mattson <jmattson@google.com>
Link: https://patch.msgid.link/20260407190343.325299-5-jmattson@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: nSVM: Cache and validate vmcb12 g_pat

When KVM_X86_QUIRK_NESTED_SVM_SHARED_PAT is disabled and nested paging is
enabled in vmcb12, validate g_pat at emulated VMRUN and cause an immediate
VMEXIT with exit code VMEXIT_INVALID if it is invalid, as specified in the
APM, volume 2: "Nested Paging and VMRUN/VMEXIT."

Fixes: 3d6368ef580a ("KVM: SVM: Add VMRUN handler")
Signed-off-by: Jim Mattson <jmattson@google.com>
Link: https://patch.msgid.link/20260407190343.325299-4-jmattson@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: nSVM: Clear VMCB_NPT clean bit when updating hPAT from guest mode

When running an L2 guest and writing to MSR_IA32_CR_PAT, the host PAT value
is stored in both vmcb01's g_pat field and vmcb02's g_pat field, but the
clean bit was only being cleared for vmcb02.

Introduce the helper vmcb_set_gpat() which sets vmcb->save.g_pat and marks
the VMCB dirty for VMCB_NPT. Use this helper in both svm_set_msr() for
updating vmcb01 and in nested_vmcb02_compute_g_pat() for updating vmcb02,
ensuring both VMCBs' NPT fields are properly marked dirty.

Fixes: 4995a3685f1b ("KVM: SVM: Use a separate vmcb for the nested L2 guest")
Signed-off-by: Jim Mattson <jmattson@google.com>
Link: https://patch.msgid.link/20260407190343.325299-3-jmattson@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Define KVM_X86_QUIRK_NESTED_SVM_SHARED_PAT

Define a quirk to control whether nested SVM shares L1's PAT with L2
(legacy behavior) or gives L2 its own independent gPAT (correct behavior
per the APM).

When the quirk is enabled (default), L2 shares L1's PAT, preserving the
legacy KVM behavior. When userspace disables the quirk, KVM correctly
virtualizes the PAT for nested SVM guests, giving L2 a separate gPAT as
specified in the AMD architecture.

Signed-off-by: Jim Mattson <jmattson@google.com>
Link: https://patch.msgid.link/20260407190343.325299-2-jmattson@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

docs: threat-model: don't limit root capabilities to CAP_SYS_ADMIN

The threat-model document says that only users with CAP_SYS_ADMIN can carry
out a number of admin-level tasks, but there are numerous capabilities that
can confer that sort of power. Generalize the text slightly to make it
clear that CAP_SYS_ADMIN is not the only all-powerful capability.

Acked-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>

docs: security-bugs: add a link to the threat-model documentation

Rather than make readers search for this document, just a link to it where
it is referenced.

(While I was at it, I removed the unused and unneeded _threatmodel label
from the top of threat-model.rst).

Acked-by: Willy Tarreau <w@1wt.eu>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>

timers: Fix flseep() typo in kernel-doc comment

Signed-off-by: Gitle Mikkelsen <gitlem@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260501170616.1402-1-gitlem@gmail.com

ARM: dts: socfpga: arria10: Increase JFFS2 rootfs partition size

Increase the JFFS2 partition size to support larger root filesystem.
Also fix the partition label to match the actual start address.

Signed-off-by: Niravkumar L Rabara <niravkumar.l.rabara@altera.com>
Signed-off-by: Nazim Amirul <muhammad.nazim.amirul.nazle.asmade@altera.com>
Signed-off-by: Dinh Nguyen <dinguyen@kernel.org>

RAS/AMD/ATL: Drop malformed default N from Kconfig

The capital letters are for symbols and N in 'default N' will be evaluated as
another, nonexistent, Kconfig symbol, and not as the 'no' it should be. More
importantly, 'n' *is* the default already. Hence just drop the malformed line.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://patch.msgid.link/20260513205021.368190-1-andriy.shevchenko@linux.intel.com

net: tls: prevent chain-after-chain in plain text SG

Sashiko points out that if end = 0 (start != 0) the current
code will create a chain link to content type right after
the wrap link:

  This would create a chain where the wrap link points directly
  to another chain link. The scatterlist API sg_next iterator
  does not recursively resolve consecutive chain links.

meaning this is illegal input to crypto.

The wrapping link is unnecessary if end = 0. end is the entry after
the last one used so end = 0 means there's nothing pushed after
the wrap:

   end         start            i
    v            v              v
  [   ]...[   ][ d ][ d ][ d ][ d ][rsv for wrap]

Skip the wrapping in this case.

TLS 1.3 can use the "wrapping slot" for it's chaining if end = 0.
This avoids the chain-after-chain.

Move the wrap chaining before marking END and chaining off content
type, that feels like more logical ordering to me, but should not
matter from functional perspective.

Reported-by: Sashiko <sashiko-bot@kernel.org>
Fixes: 9aaaa56845a0 ("bpf: Sockmap/tls, skmsg can have wrapped skmsg that needs extra chaining")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20260511174920.433155-3-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: tls: fix off-by-one in sg_chain entry count for wrapped sk_msg ring

When an sk_msg scatterlist ring wraps (sg.end < sg.start),
tls_push_record() chains the tail portion of the ring to the head
using sg_chain(). An extra entry in the sg array is reserved for
this:

  struct sk_msg_sg {
        [...]
        /* The extra two elements:
         * 1) used for chaining the front and sections when the list becomes
         *    partitioned (e.g. end < start). The crypto APIs require the
         *    chaining;
         * 2) to chain tailer SG entries after the message.
         */
        struct scatterlist              data[MAX_MSG_FRAGS + 2];

The current code uses MAX_SKB_FRAGS + 1 as the ring size:

    sg_chain(&msg_pl->sg.data[msg_pl->sg.start],
             MAX_SKB_FRAGS - msg_pl->sg.start + 1,
             msg_pl->sg.data);

This places the chain pointer at

  sg_chain(data[start], (MAX_SKB_FRAGS - msg_start + 1) .. =
  &data[start] + (MAX_SKB_FRAGS - msg_start + 1) - 1 =
  data[start + (MAX_SKB_FRAGS - start + 1) - 1] =
  data[MAX_SKB_FRAGS]

instead of the true last entry. This is likely due to a "race" of
the commit under Fixes landing close to
commit 031097d9e079 ("bpf: sk_msg, zap ingress queue on psock down")

Convert to ARRAY_SIZE and drop the data[start] / - start (as suggested
by Sabrina).

Reported-by: 钱一铭 <yimingqian591@gmail.com>
Fixes: 9aaaa56845a0 ("bpf: Sockmap/tls, skmsg can have wrapped skmsg that needs extra chaining")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20260511174920.433155-2-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

clk: scpi: Unregister child clock providers on remove

SCPI clock providers are registered for each child node in
scpi_clk_add(), but scpi_clocks_remove() unregisters the parent node on
each iteration.

of_clk_del_provider() matches providers by the node used at registration
time, so passing the parent node leaves the child providers registered.
This leaks the provider allocations and the node references held by the
clock provider core.

Pass the child node to of_clk_del_provider() so the remove path matches
the probe path.

Fixes: cd52c2a4b5c4 ("clk: add support for clocks provided by SCP(System Control Processor)")
Signed-off-by: Stepan Ionichev <sozdayvek@gmail.com>
Link: https://patch.msgid.link/20260513090900.5323-1-sozdayvek@gmail.com
(sudeep.holla: Updated commit title and message a bit)
Signed-off-by: Sudeep Holla <sudeep.holla@kernel.org>

drm/ttm: Convert -EAGAIN from dmem_cgroup_try_charge to -ENOSPC

dmem_cgroup_try_charge() returns -EAGAIN when the cgroup limit is
hit and the charge fails. TTM has no concept of -EAGAIN from resource
allocation; -ENOSPC is the canonical error meaning "no space, try
eviction". Convert at the source in ttm_resource_alloc() so no caller
needs to handle an unexpected error code, and clean up the now-redundant
-EAGAIN check in ttm_bo_alloc_resource().

Without this, -EAGAIN escaping ttm_resource_alloc() during an eviction
walk causes the walk to terminate early instead of continuing to the
next candidate.

Cc: Friedrich Vock <friedrich.vock@gmx.de>
Cc: Maarten Lankhorst <dev@lankhorst.se>
Cc: Tejun Heo <tj@kernel.org>
Cc: Maxime Ripard <mripard@kernel.org>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: dri-devel@lists.freedesktop.org
Cc: <stable@vger.kernel.org> # v6.14+
Fixes: 2b624a2c1865 ("drm/ttm: Handle cgroup based eviction in TTM")
Assisted-by: GitHub_Copilot:claude-sonnet-4.6
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Maarten Lankhorst <dev@lankhrost.se>
Link: https://patch.msgid.link/20260508160920.230339-1-thomas.hellstrom@linux.intel.com

Merge branch 'bridge-add-selective-forwarding-of-gratuitous-neighbor-announcements'

Danielle Ratson says:

====================
bridge: Add selective forwarding of gratuitous neighbor announcements

The existing neighbor suppression unconditionally suppresses gratuitous
ARPs and unsolicited Neighbor Advertisements, which prevents fast
mobility of hosts between VTEPs.

This series adds a new neigh_forward_grat option that provides
independent control of gratuitous ARP and unsolicited NA forwarding.
When neigh_suppress is enabled but neigh_forward_grat is enabled,
regular neighbor discovery is suppressed while gratuitous announcements
are forwarded.

The implementation marks gratuitous ARPs and unsolicited NAs in
BR_INPUT_SKB_CB during input processing, then checks the per-output-port
neigh_forward_grat setting during flooding. This allows gratuitous
announcements from any input port to be selectively forwarded based on
each output port's individual configuration.

Both port-level control (via IFLA_BRPORT_NEIGH_FORWARD_GRAT) and
per-VLAN control (via BRIDGE_VLANDB_ENTRY_NEIGH_FORWARD_GRAT) are
provided. The default value of OFF preserves existing behavior.

This behavior is in accordance with RFC 9161 (Section 3.6), which
recommends that VTEPs forward gratuitous ARP and unsolicited NA messages
to avoid traffic disruption during host mobility events.

The new attributes use NLA_U8, although the kernel netlink guideline
recommends NLA_U32 as the minimum integer type on the grounds that
alignment makes smaller types equivalent on the wire. For a simple
on/off attribute there is no technical advantage to u32 over u8, and
keeping u8 preserves consistency with all surrounding bridge port
attributes and avoids introducing new helpers alongside the existing
infrastructure.

Patchset overview:
Patch #1: adds uapi headers.
Patches #2-#3: support selective forwarding of gratuitous ARP.
Patches #4-#5: add netlink handling.
Patch #6: adds tests.

Please see iproute related patches in the last 3 commits of:
https://github.com/daniellerts/iproute2
====================

Link: https://patch.msgid.link/20260511065936.4173106-1-danieller@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

selftests: net: Add tests for neigh_forward_grat option

Add tests to validate the neigh_forward_grat bridge option for selective
forwarding of gratuitous neighbor announcements.

The tests verify per-port and per-VLAN control of gratuitous neighbor
announcement forwarding for both IPv4 (gratuitous ARP) and IPv6
(unsolicited NA):
- When neigh_suppress is enabled with neigh_forward_grat off (default),
gratuitous announcements are suppressed
- When neigh_forward_grat is enabled, gratuitous announcements are
forwarded while regular neighbor discovery remains suppressed

For IPv4, use arping to send gratuitous ARP packets. For IPv6, use
mausezahn to craft unsolicited Neighbor Advertisement packets.

For the per-port tests, the IPv4 test exercises the ip link interface,
while the IPv6 test exercises the bridge link interface.
The per-VLAN tests use the bridge interface throughout, as per-VLAN
attributes are only accessible via 'bridge vlan'.

Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Link: https://patch.msgid.link/20260511065936.4173106-7-danieller@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

bridge: Add per-VLAN netlink handling for neigh_forward_grat

Add netlink handlers for the per-VLAN neigh_forward_grat option via
BRIDGE_VLANDB_ENTRY_NEIGH_FORWARD_GRAT attribute.

The per-VLAN option provides fine-grained control, allowing different
VLANs on the same port to have different gratuitous ARP/unsolicited NA
forwarding behavior.

This enables control via 'bridge' commands:
# bridge vlan set dev eth0 vid 10 neigh_suppress on
# bridge vlan set dev eth0 vid 10 neigh_forward_grat on

Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20260511065936.4173106-6-danieller@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

bridge: Add port-level netlink handling for neigh_forward_grat

Add netlink handlers for the port-level neigh_forward_grat option via
IFLA_BRPORT_NEIGH_FORWARD_GRAT attribute.

The default value of OFF preserves existing behavior, i.e. gratuitous ARP
and unsolicited NA are suppressed when neigh_suppress is enabled. Users can
explicitly set it to ON to allow these packets through.

Example for enabling control via 'bridge link' command:
# bridge link set dev eth0 neigh_suppress on
# bridge link set dev eth0 neigh_forward_grat on

Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20260511065936.4173106-5-danieller@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

bridge: Add selective forwarding of gratuitous neighbor announcements

The existing neighbor suppression unconditionally suppresses gratuitous
ARPs and unsolicited Neighbor Advertisements, which prevents fast
mobility of hosts between VTEPs.

Add the neigh_forward_grat option to allow selective control of gratuitous
neighbor announcements. When neigh_suppress is enabled but
neigh_forward_grat is disabled (default), gratuitous announcements are
suppressed. When neigh_forward_grat is enabled, gratuitous announcements
are forwarded while regular neighbor discovery remains suppressed.

The implementation provides per-output-port control by:
1. Adding a 'grat_arp' flag to BR_INPUT_SKB_CB to mark gratuitous ARPs and
   unsolicited NAs.
2. Setting both grat_arp and proxyarp_replied flags in
   br_do_proxy_suppress_arp() and br_do_suppress_nd() when gratuitous
   packets are detected.
3. Checking neigh_forward_grat per output port during flooding:
   - For gratuitous ARPs/NAs: suppress unless the output port has
     neigh_forward_grat enabled.
   - For regular ARPs/NDs: maintain existing behavior.

This allows gratuitous announcements from any input port to be selectively
forwarded based on each output port's individual neigh_forward_grat
setting, enabling gratuitous neighbor announcements to be flooded to the
VXLAN fabric.

Regular neighbor discovery (ARP requests, NS queries, solicited replies)
remains controlled by neigh_suppress and is unaffected.

Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20260511065936.4173106-4-danieller@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

bridge: Add internal flags for neigh_forward_grat

Add internal flags for the neigh_forward_grat feature:

- BR_NEIGH_FORWARD_GRAT: Port-level flag
- BR_VLFLAG_NEIGH_FORWARD_GRAT_ENABLED: Per-VLAN flag

These will be used to control whether gratuitous ARP and unsolicited NA
packets are forwarded when neighbor suppression is enabled.

Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20260511065936.4173106-3-danieller@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

bridge: uapi: Add neigh_forward_grat netlink attributes

Add netlink attributes for controlling gratuitous ARP and unsolicited NA
forwarding when neighbor suppression is enabled.

Add IFLA_BRPORT_NEIGH_FORWARD_GRAT for port-level control and
BRIDGE_VLANDB_ENTRY_NEIGH_FORWARD_GRAT for per-VLAN control.

The new attributes provide independent control of gratuitous ARP and
unsolicited NA packets. Operators can enable forwarding for those packets
for fast mobility across VTEPs while keeping general neighbor suppression
active.

Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Link: https://patch.msgid.link/20260511065936.4173106-2-danieller@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

vdso/gettimeofday: Reload sequence counter after switch to time page in do_aux()

After switching to the real data pages, the sequence counter needs to be
reloaded from there. The code using vdso_read_begin_timens() assumed
this worked by 'continue' jumping to the *beginning* of the do-while
retry loop. However the 'continue' jumps to the *end* of said loop,
evaluating the exit condition. If the data page has a sequence counter
of '1' it will match the one from the time namespace page and prematurely
exit the retry loop. This would result in garbage returned to the caller.

Reload the sequence counter after switching the pages by using an inner
while loop again, which will loop at most once.

The loop generates slightly better code than an explicit reload through
'seq = vdso_read_begin()'.

Fixes: ed78b7b2c5ae ("vdso/gettimeofday: Add a helper to read the sequence lock of a time namespace aware clock")
Reported-by: Ricardo Ribalda <ribalda@chromium.org>
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Tested-by: Ricardo Ribalda <ribalda@chromium.org>
Reviewed-by: Christophe Leroy (CS GROUP) <chleroy@kernel.org>
Link: https://patch.msgid.link/20260422-vdso-aux-timens-loop-v1-1-e2dd8c7164cc@linutronix.de
Closes: https://lore.kernel.org/lkml/CANiDSCsOy0P1if-gJZqOM5pTJ0RDcwVfru1B7KFbTOEMqjPKJw@mail.gmail.com/

net/smc: reject CHID-0 ACCEPT that matches an empty ism_dev slot

On the SMC-D client, slot 0 of ini->ism_dev[]/ini->ism_chid[] is
reserved for an SMC-Dv1 device. smc_find_ism_v2_device_clnt()
populates V2 entries starting at index 1, so when no V1 device is
selected slot 0 is left in its kzalloc()'ed state with ism_dev[0] ==
NULL and ism_chid[0] == 0.

smc_v2_determine_accepted_chid() then matches the peer's CHID against
the array starting from index 0 using the CHID alone. A malicious
peer replying to a SMC-Dv2-only proposal with d1.chid == 0 matches
the empty slot, ini->ism_selected becomes 0, and the subsequent
ism_dev[0]->lgr_lock dereference in smc_conn_create() faults at
offsetof(struct smcd_dev, lgr_lock) == 0x68:

  BUG: KASAN: null-ptr-deref in _raw_spin_lock_bh+0x79/0xe0
  Write of size 4 at addr 0000000000000068 by task exploit/144
  Call Trace:
   _raw_spin_lock_bh
   smc_conn_create (net/smc/smc_core.c:1997)
   __smc_connect (net/smc/af_smc.c:1447)
   smc_connect (net/smc/af_smc.c:1720)
   __sys_connect
   __x64_sys_connect
   do_syscall_64

Require ism_dev[i] to be non-NULL before accepting a CHID match.

Fixes: a7c9c5f4af7f ("net/smc: CLC accept / confirm V2")
Reported-by: Weiming Shi <bestswngs@gmail.com>
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Xiang Mei <xmei5@asu.edu>
Link: https://patch.msgid.link/20260511062138.2839584-1-xmei5@asu.edu
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'net-sched-netem-enhancements'

Stephen Hemminger says:

====================
net/sched: netem: enhancements

This is a collection of improvements to netem found while
investigating the fixes now in net tree.

v5 - minor cleanups
- add allocation_errors counter
- align counters with where impairment occurs
====================

Link: https://patch.msgid.link/20260509171123.307549-1-stephen@networkplumber.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/sched: netem: add per-impairment extended statistics

Add 64-bit counters for each impairment netem applies (delay, loss,
ECN marking, corruption, duplication, reordering) and for skb
allocation failures during enqueue. Exposed through TCA_STATS_APP
as struct tc_netem_xstats.

Counters increment when an impairment is occurs, independent of later
events that may mask its on-wire effect. Added allocation_errors
(similar to sch_fq) to account for when impairment could not be
applied due to memory pressure, etc.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Link: https://patch.msgid.link/20260509171123.307549-6-stephen@networkplumber.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/sched: netem: handle multi-segment skb in corruption

The packet corruption code only flipped bits in the linear
header portion of the skb, skipping corruption when
skb_headlen() was zero.

Linearize the whole skb if necessary before corruption.
Extends d64cb81dcbd5 ("net/sched: sch_netem: fix out-of-bounds access
in packet corruption") with a more general solution.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Link: https://patch.msgid.link/20260509171123.307549-5-stephen@networkplumber.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/sched: netem: replace pr_info with netlink extack error messages

Use netlink extack to report errors instead of sending them
to the kernel log with pr_info(). The error message can them be seen
with tc commands; and avoids log spam.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Link: https://patch.msgid.link/20260509171123.307549-4-stephen@networkplumber.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/sched: netem: remove useless VERSION

The version printed was never updated and kernel version is
better indication of what is fixed or not.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Link: https://patch.msgid.link/20260509171123.307549-3-stephen@networkplumber.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/sched: netem: reorder struct netem_sched_data

The current layout of struct netem_sched_data can be improved
by optimizing cache locality, compacting data types (use u8
for enum) and eliminating unused elements.

Reorganize the struct as follows:

  - Cacheline 0 holds the tfifo state (t_root/t_head/t_tail/t_len),
    counter, and the unconditional enqueue scalars
    latency/jitter/rate/gap/loss.

  - Cacheline 1 holds the remaining zero-check scalars
    (duplicate/reorder/corrupt/ecn), all five crndstate correlation
    structures, and loss_model.

  - Cacheline 2 holds prng, delay_dist, the slot dequeue state,
    slot_dist, and the inner classful qdisc pointer.

  - Rate-shaping fields, q->limit (config-only; the fast path reads
    sch->limit), and the CLG Markov state move to the warm tail.

  - tc_netem_slot slot_config and qdisc_watchdog (only consulted on
    slot reschedule and watchdog wake) move to the cold tail.

Also reorder struct clgstate to place the u8 state member after the
u32 transition probabilities.  This removes the 3-byte interior hole
without changing the struct's size.

Should have no functional change.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Link: https://patch.msgid.link/20260509171123.307549-2-stephen@networkplumber.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

arm_mpam: Check whether the config array is allocated before destroying it

__destroy_component_cfg() is called to free the configuration array.
It uses the embedded 'garbage' structure, which means the array has
to be allocated.

If __destroy_component_cfg() is called from mpam_disable() before the
configuration was ever allocated, then a NULL pointer is dereferenced.

Check for this case and return early if the configuration is not
allocated.

__destroy_component_cfg() also frees the mbwu_state as this is allocated
by __allocate_component_cfg(). As the mbwu_state is allocated after
comp->cfg is set, and is also under mpam_list_lock, only the first
pointer needs checking.

Fixes: 3bd04fe7d807 ("arm_mpam: Extend reset logic to allow devices to be reset any time")
Cc: <stable@vger.kernel.org>
Signed-off-by: James Morse <james.morse@arm.com>
Reviewed-by: Ben Horgan <ben.horgan@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

arm_mpam: Fix false positive assert failure during mpam_disable()

mpam_assert_partid_sizes_fixed() is used to document that the caller
doesn't expect the discovered PARTID size to change while it is walking
a list sized by PARTID. Typically the MSC state is not written to until
all the MSC have been discovered and this value is set.

However, if discovering the MSC fails and schedules mpam_disable(),
then the MSC state is written to reset it. In this case the
discovered PARTID size may be become smaller - but only PARTID 0
will be used once resctrl_exit() has been called.

Skip the WARN_ON_ONCE() if mpam_disable_reason has been set.

Fixes: 3bd04fe7d807 ("arm_mpam: Extend reset logic to allow devices to be reset any time")
Cc: <stable@vger.kernel.org>
Signed-off-by: James Morse <james.morse@arm.com>
Reviewed-by: Ben Horgan <ben.horgan@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

arm_mpam: Improve check for whether or not NRDY is hardware managed

mpam_ris_hw_probe_csu_nrdy() sets and clears MSMON_CSU.NRDY and checks
whether it's configuration sticks. However, hardware isn't given a chance
to disagree. Based on rule LRTGP, in MPAM specification IHI0099 version
B.b, the hardware will set NRDY if it needs time to establish a count after
a configuration change.

Enable the monitor so that NRDY becomes relevant and change the
configuration after clearing NRDY to try and coax the hardware into setting
it.

Fixes: 8c90dc68a5de ("arm_mpam: Probe the hardware features resctrl supports")
Cc: <stable@vger.kernel.org>
Signed-off-by: Ben Horgan <ben.horgan@arm.com>
Reviewed-by: James Morse <james.morse@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

arm_mpam: Pretend that NRDY is always hardware managed

Rule ZTXDS of the MPAM specification, IHI009 version B.b, states: "If a
monitor does not support automatic updates of NRDY, software can use that
bit for any purpose."

As software is not reliably informed whether or not the monitor supports
automatic updates of NRDY always assume that hardware may manage NRDY but
don't rely on it. When NRDY is truly untouched by hardware then, as it is
written to 0 on configuration, it will always read 0.

At probe it's checked if MSMON_CSU.NRDY and MSMON_MBWU.NRDY are hardware
managed but not MSMON_MBWU_L.NDRY. Specialize the checking for hardware
managed NRDY to CSU counters as this is the only case where hardware
management makes sense. Continue to inform the user if MSMON_CSU.NRDY
appears to be hardware managed but the firmware doesn't provide the
associated time limit for the automatic clearing of NRDY. Remove the NRDY
feature flags as they are now unused.

Fixes: 8c90dc68a5de ("arm_mpam: Probe the hardware features resctrl supports")
Cc: <stable@vger.kernel.org>
Signed-off-by: Ben Horgan <ben.horgan@arm.com>
Reviewed-by: James Morse <james.morse@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

arm_mpam: Fix monitor instance selection when checking for hardware NRDY

In _mpam_ris_hw_probe_hw_nrdy() a new register value to select the first
monitor and relevant RIS is prepared in mon_sel. However, it is written to
the monitor value register, e.g. MSMON_CSU, rather than MSMON_CFG_MON_SEL.

As MSMON_CFG_MON_SEL is a 32 bit register update the type of mon_sel to
u32. Write mon_sel to the intended register, MSMON_CFG_MON_SEL.

Fixes: 8c90dc68a5de ("arm_mpam: Probe the hardware features resctrl supports")
Cc: <stable@vger.kernel.org>
Signed-off-by: Ben Horgan <ben.horgan@arm.com>
Reviewed-by: James Morse <james.morse@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

slab: fix kernel-docs for mm-api

The mm-api kernel-docs have been disconnected from their symbols. While
the scripts were previously taught to handle the _noprof suffix added by
allocation tagging (in 51a7bf0238c2 "scripts/kernel-doc: drop "_noprof"
on function prototypes"), this does not handle cases where the internal
implementation function has an additional leading underscore. The added
optional parameters (via DECL_KMALLOC_PARAMS) further complicate parsing
the internal signatures.

When the kernel-doc block remains above the internal implementation
function but uses the public API name, the documentation generator fails
to associate the documented symbol.

Simply moving the docs to the macros in slab.h fixes the association but
causes loss of types in the generated documentation (rendering as e.g.
untyped 'kmalloc(size, flags)' macro).

Fix this by:

1. Moving the kernel-doc comment blocks from slub.c to slab.h, placing
   them directly above the user-facing macros.

2. Providing explicit, typed C prototypes for the documented APIs inside
   '#if 0 /* kernel-doc */' blocks.

3. Converting the variadic macros for the documented APIs to use
   explicit arguments to match the documentation.

No functional change intended.

Signed-off-by: Marco Elver <elver@google.com>
Link: https://patch.msgid.link/20260511200136.3201646-3-elver@google.com
Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>

slab: improve KMALLOC_PARTITION_RANDOM randomness

When using CONFIG_KMALLOC_PARTITION_RANDOM, _RET_IP_ was previously used
to identify the allocation site. _RET_IP_, however, evaluates to the
caller's parent's instruction pointer rather than the actual allocation
site; this would lead to collisions where a function performs multiple
allocations.

With the generalization to kmalloc_token_t, we now generate the token at
the outermost macro, and using _THIS_IP_ would fix this for all cases.

Unfortunately, the generic implementation of _THIS_IP_ relies on taking
the address of a local label, which is considered broken by both GCC [1]
and Clang [2] because label addresses are only expected to be used with
computed gotos. While the generic version more or less works today, it
is known to be brittle. For example, Clang -O2 always returns 1 when
this function is inlined:

static inline unsigned long get_ip(void)
{ return ({ __label__ __here; __here: (unsigned long)&&__here; }); }

To provide a reliable unique identifier without breaking architectures
relying on the generic _THIS_IP_, introduce _CODE_LOCATION_: it resolves
to _THIS_IP_ where architectures provide a safe implementation, and
falls back to a zero-cost static marker where _THIS_IP_ is broken.

Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120071
Link: https://github.com/llvm/llvm-project/issues/138272
Signed-off-by: Marco Elver <elver@google.com>
Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>
Link: https://patch.msgid.link/20260511200136.3201646-2-elver@google.com
Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>

slab: support for compiler-assisted type-based slab cache partitioning

Rework the general infrastructure around RANDOM_KMALLOC_CACHES into more
flexible KMALLOC_PARTITION_CACHES, with the former being a partitioning
mode of the latter.

Introduce a new mode, KMALLOC_PARTITION_TYPED, which leverages a feature
available in Clang 22 and later, called "allocation tokens" via
__builtin_infer_alloc_token() [1]. Unlike KMALLOC_PARTITION_RANDOM
(formerly RANDOM_KMALLOC_CACHES), this mode deterministically assigns a
slab cache to an allocation of type T, regardless of allocation site.

The builtin __builtin_infer_alloc_token(<malloc-args>, ...) instructs
the compiler to infer an allocation type from arguments commonly passed
to memory-allocating functions and returns a type-derived token ID. The
implementation passes kmalloc-args to the builtin: the compiler performs
best-effort type inference, and then recognizes common patterns such as
`kmalloc(sizeof(T), ...)`, `kmalloc(sizeof(T) * n, ...)`, but also
`(T *)kmalloc(...)`. Where the compiler fails to infer a type the
fallback token (default: 0) is chosen.

Note: kmalloc_obj(..) APIs fix the pattern how size and result type are
expressed, and therefore ensures there's not much drift in which
patterns the compiler needs to recognize. Specifically, kmalloc_obj()
and friends expand to `(TYPE *)KMALLOC(__obj_size, GFP)`, which the
compiler recognizes via the cast to TYPE*.

Clang's default token ID calculation is described as [1]:

   typehashpointersplit: This mode assigns a token ID based on the hash
   of the allocated type's name, where the top half ID-space is reserved
   for types that contain pointers and the bottom half for types that do
   not contain pointers.

Separating pointer-containing objects from pointerless objects and data
allocations can help mitigate certain classes of memory corruption
exploits [2]: attackers who gains a buffer overflow on a primitive
buffer cannot use it to directly corrupt pointers or other critical
metadata in an object residing in a different, isolated heap region.

It is important to note that heap isolation strategies offer a
best-effort approach, and do not provide a 100% security guarantee,
albeit achievable at relatively low performance cost. Note that this
also does not prevent cross-cache attacks: while waiting for future
features like SLAB_VIRTUAL [3] to provide physical page isolation, this
feature should be deployed alongside SHUFFLE_PAGE_ALLOCATOR and
init_on_free=1 to mitigate cross-cache attacks and page-reuse attacks as
much as possible today.

With all that, my kernel (x86 defconfig) shows me a histogram of slab
cache object distribution per /proc/slabinfo (after boot):

  <slab cache>      <objs> <hist>
  kmalloc-part-15    1465  ++++++++++++++
  kmalloc-part-14    2988  +++++++++++++++++++++++++++++
  kmalloc-part-13    1656  ++++++++++++++++
  kmalloc-part-12    1045  ++++++++++
  kmalloc-part-11    1697  ++++++++++++++++
  kmalloc-part-10    1489  ++++++++++++++
  kmalloc-part-09     965  +++++++++
  kmalloc-part-08     710  +++++++
  kmalloc-part-07     100  +
  kmalloc-part-06     217  ++
  kmalloc-part-05     105  +
  kmalloc-part-04    4047  ++++++++++++++++++++++++++++++++++++++++
  kmalloc-part-03     183  +
  kmalloc-part-02     283  ++
  kmalloc-part-01     316  +++
  kmalloc            1422  ++++++++++++++

The above /proc/slabinfo snapshot shows me there are 6673 allocated
objects (slabs 00 - 07) that the compiler claims contain no pointers or
it was unable to infer the type of, and 12015 objects that contain
pointers (slabs 08 - 15). On a whole, this looks relatively sane.

Additionally, when I compile my kernel with -Rpass=alloc-token, which
provides diagnostics where (after dead-code elimination) type inference
failed, I see 186 allocation sites where the compiler failed to identify
a type (down from 966 when I sent the RFC [4]). Some initial review
confirms these are mostly variable sized buffers, but also include
structs with trailing flexible length arrays.

Link: https://clang.llvm.org/docs/AllocToken.html
Link: https://blog.dfsec.com/ios/2025/05/30/blasting-past-ios-18/
Link: https://lwn.net/Articles/944647/
Link: https://lore.kernel.org/all/20250825154505.1558444-1-elver@google.com/
Link: https://discourse.llvm.org/t/rfc-a-framework-for-allocator-partitioning-hints/87434
Acked-by: GONG Ruiqi <gongruiqi1@huawei.com>
Co-developed-by: Harry Yoo (Oracle) <harry@kernel.org>
Signed-off-by: Harry Yoo (Oracle) <harry@kernel.org>
Signed-off-by: Marco Elver <elver@google.com>
Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>
Link: https://patch.msgid.link/20260511200136.3201646-1-elver@google.com
Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>

xfrm: Reject excessive values for XFRMA_TFCPAD

tfcpad is a u32, but that full range is excessive for padding.
Limit it to max IP length (64k).

Signed-off-by: David Ahern <dahern@nvidia.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>

xfrm: Check for underflow in xfrm_state_mtu

Leo Lin reported OOB write issue in esp component:

  xfrm_state_mtu() returns u32 but performs its arithmetic in unsigned
  modulo-2^32 space using an attacker-influenced "header_len + authsize +
  net_adj" subtracted from a small "mtu" argument. A nobody user can
  install an IPv4 ESP tunnel SA with a large authentication key
  (XFRMA_ALG_AUTH_TRUNC, e.g. hmac(sha512), 64-byte key, 64-byte trunc),
  configure a small interface MTU (68 bytes), and set XFRMA_TFCPAD to a
  large value. When a single UDP datagram is then sent through the
  tunnel, xfrm_state_mtu() underflows to a near-2^32 value, and
  esp_output() consumes it as a signed int via:

        padto      = min(x->tfcpad, xfrm_state_mtu(x, mtu_cached))
        esp.tfclen = padto - skb->len   (assigned to int)

  esp.tfclen ends up negative (e.g. -207). It is sign-extended to size_t
  when passed to memset() inside esp_output_fill_trailer(), producing a
  ~16 EB write of zeroes at skb_tail_pointer(skb). KASAN logs it as
  "Write of size 18446744073709551537 at addr ffff888...".

Check for underflow and return 1. This causes the sendmsg attempt to
fail with ENETUNREACH.

Fixes: c5c252389374 ("[XFRM]: Optimize MTU calculation")
Reported-by: Leo Lin <leo@depthfirst.com>
Assisted-by: Codex:26.506.31004
Signed-off-by: David Ahern <dahern@nvidia.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>

mm/slub: defer freelist construction until after bulk allocation from a new slab

Allocations from a fresh slab can consume all of its objects, and the
freelist built during slab allocation is discarded immediately as a result.

Instead of special-casing the whole-slab bulk refill case, defer freelist
construction until after objects are emitted from a fresh slab.
new_slab() now only allocates the slab and initializes its metadata.
refill_objects() then obtains a fresh slab and lets alloc_from_new_slab()
emit objects directly, building a freelist only for the objects left
unallocated; the same change is applied to alloc_single_from_new_slab().

To keep CONFIG_SLAB_FREELIST_RANDOM=y/n on the same path, introduce a
small iterator abstraction for walking free objects in allocation order.
The iterator is used both for filling the sheaf and for building the
freelist of the remaining objects.

Also mark setup_object() inline. After this optimization, the compiler no
longer consistently inlines this helper in the hot path, which can hurt
performance. Explicitly marking it inline restores the expected code
generation.

This reduces per-object overhead when allocating from a fresh slab.
The most direct benefit is in the paths that allocate objects first and
only build a freelist for the remainder afterward: bulk allocation from
a new slab in refill_objects(), single-object allocation from a new slab
in ___slab_alloc(), and the corresponding early-boot paths that now use
the same deferred-freelist scheme. Since refill_objects() is also used to
refill sheaves, the optimization is not limited to the small set of
kmem_cache_alloc_bulk()/kmem_cache_free_bulk() users; regular allocation
workloads may benefit as well when they refill from a fresh slab.

In slub_bulk_bench, the time per object drops by about 42% to 70% with
CONFIG_SLAB_FREELIST_RANDOM=n, and by about 58% to 69% with
CONFIG_SLAB_FREELIST_RANDOM=y. This benchmark is intended to isolate the
cost removed by this change: each iteration allocates exactly
slab->objects from a fresh slab. That makes it a near best-case scenario
for deferred freelist construction, because the old path still built a
full freelist even when no objects remained, while the new path avoids
that work. Realistic workloads may see smaller end-to-end gains depending
on how often allocations reach this fresh-slab refill path.

Benchmark results (slub_bulk_bench):
Machine: qemu-system-x86 -m 1024M -smp 8 -enable-kvm -cpu host
Kernel: Linux 7.1.0-rc1-next-20260429
Config: x86_64_defconfig
Cpu: 0
Rounds: 20
Total: 256MB

- CONFIG_SLAB_FREELIST_RANDOM=n -

obj_size=16, batch=256:
before: 5.44 +- 0.07 ns/object
after: 3.12 +- 0.03 ns/object
delta: -42.6%

obj_size=32, batch=128:
before: 7.57 +- 0.32 ns/object
after: 3.79 +- 0.07 ns/object
delta: -49.9%

obj_size=64, batch=64:
before: 11.27 +- 0.09 ns/object
after: 4.83 +- 0.06 ns/object
delta: -57.2%

obj_size=128, batch=32:
before: 19.38 +- 0.13 ns/object
after: 6.43 +- 0.08 ns/object
delta: -66.8%

obj_size=256, batch=32:
before: 23.59 +- 0.18 ns/object
after: 6.97 +- 0.07 ns/object
delta: -70.5%

obj_size=512, batch=32:
before: 21.06 +- 0.14 ns/object
after: 7.12 +- 0.17 ns/object
delta: -66.2%

- CONFIG_SLAB_FREELIST_RANDOM=y -

obj_size=16, batch=256:
before: 9.42 +- 0.11 ns/object
after: 4.36 +- 0.19 ns/object
delta: -53.7%

obj_size=32, batch=128:
before: 12.19 +- 0.62 ns/object
after: 4.93 +- 0.07 ns/object
delta: -59.6%

obj_size=64, batch=64:
before: 17.01 +- 0.73 ns/object
after: 6.14 +- 0.12 ns/object
delta: -63.9%

obj_size=128, batch=32:
before: 23.71 +- 1.10 ns/object
after: 8.35 +- 0.18 ns/object
delta: -64.8%

obj_size=256, batch=32:
before: 29.20 +- 0.35 ns/object
after: 9.44 +- 1.32 ns/object
delta: -67.7%

obj_size=512, batch=32:
before: 29.35 +- 0.79 ns/object
after: 9.21 +- 0.34 ns/object
delta: -68.6%

Link: https://github.com/HSM6236/slub_bulk_test.git
Suggested-by: Harry Yoo (Oracle) <harry@kernel.org>
Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>
Reviewed-by: Hao Li <hao.li@linux.dev>
Tested-by: Hao Li <hao.li@linux.dev>
Signed-off-by: Shengming Hu <hu.shengming@zte.com.cn>
Link: https://patch.msgid.link/202604302204413066CxdJnJ3RAGH_7iE4EBIO@zte.com.cn
Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>

powerpc/time: Remove redundant preempt_disable|enable() calls from arch_irq_work_raise()

A kernel panic is observed when handling machine check exceptions from
real mode.

  BUG: Unable to handle kernel data access on read at 0xc00000006be21300
  Oops: Kernel access of bad area, sig: 11 [#1]
  MSR:  8000000000001003 <SF,ME,RI,LE>  CR: 88222248  XER: 00000005
  CFAR: c00000000003ffc4 DAR: c00000006be21300 DSISR: 40000000 IRQMASK: 0
  NIP [c000000000029e40] arch_irq_work_raise+0x10/0x70
  LR [c00000000003ffc8] machine_check_queue_event+0xa8/0x150
  Call Trace:
  [c0000000179d3c70] [c00000000003ff64] machine_check_queue_event+0x44/0x150
  [c0000000179d3d30] [c0000000000084e0] machine_check_early_common+0x1f0/0x2c0

The crash occurs because arch_irq_work_raise() calls preempt_disable()
from machine check exception (MCE) handlers running in real mode. In
this context, accessing the preempt_count can fault, leading to the panic.

The preempt_disable()/preempt_enable() pair in arch_irq_work_raise()
was originally added by commit 0fe1ac48bef0 ("powerpc/perf_event: Fix
oops due to perf_event_do_pending call") to avoid races while raising
irq work from exception context.

Later, commit 471ba0e686cb ("irq_work: Do not raise an IPI when
queueing work on the local CPU") added preemption protection in
irq_work_queue() path, while commit 20b876918c06 ("irq_work: Use per
cpu atomics instead of regular atomics") added equivalent
protection in irq_work_queue_on() before reaching arch_irq_work_raise():

  irq_work_queue() / irq_work_queue_on()
    -> preempt_disable()
      -> __irq_work_queue_local()
        -> irq_work_raise()
          -> arch_irq_work_raise()

As a result, callers other than mce_irq_work_raise() already execute
with preemption disabled, making the additional
preempt_disable()/preempt_enable() pair in arch_irq_work_raise()
redundant.

The arch_irq_work_raise() function executes in NMI context when called
from MCE handler. Hence we will not be preempted or scheduled out since
we are in NMI context with MSR[EE]=0. Therefore, it is safe to remove
the preempt_disable()/preempt_enable() calls from here.

Remove it to avoid accessing preempt_count from real mode context.

Fixes: cc15ff327569 ("powerpc/mce: Avoid using irq_work_queue() in realmode")
Suggested-by: Mahesh Salgaonkar <mahesh@linux.ibm.com>
Acked-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Sayali Patil <sayalip@linux.ibm.com>
[Maddy: Fixed the commit title]
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/20260513081413.222490-1-sayalip@linux.ibm.com

Input: synaptics - add LEN2058 to SMBus passlist for ThinkPad E490

The Lenovo ThinkPad E490 (PNP ID: LEN2058) has a Synaptics TM3471-020
touchpad that supports SMBus/RMI4 mode but is not listed in
smbus_pnp_ids[]. Without this entry, RMI4 over SMBus is not enabled
by default, and the touchpad falls back to PS/2 mode.

Adding LEN2058 to the passlist enables automatic RMI4 detection without
requiring the psmouse.synaptics_intertouch parameter, and matches
the behavior of similar ThinkPad models already in the list
(E480/LEN2054, E580/LEN2055).

Tested on ThinkPad E490 with kernel 7.0.5-zen1 and Arch Linux.
RMI4 over SMBus is confirmed working without any kernel parameters.

Signed-off-by: Nicolás Bazaes <contacto@bazaes.cl>
Assisted-by: Claude:claude-sonnet-4-6
Link: https://patch.msgid.link/20260514013552.14234-1-contacto@bazaes.cl
Cc: stable@vger.kernel.org
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>

riscv: misaligned: Make enabling delegation depend on NONPORTABLE

The unaligned access emulation code in Linux has various deficiencies.
For example, it doesn't emulate vector instructions [1] [2], and doesn't
emulate KVM guest accesses. Therefore, requesting misaligned exception
delegation with SBI FWFT actually regresses vector instructions' and KVM
guests' behavior.

Until Linux can handle it properly, guard these sbi_fwft_set() calls
behind RISCV_SBI_FWFT_DELEGATE_MISALIGNED, which in turn depends on
NONPORTABLE. Those who are sure that this wouldn't be a problem can
enable this option, perhaps getting better performance.

The rest of the existing code proceeds as before, except as if
SBI_FWFT_MISALIGNED_EXC_DELEG is not available, to handle any remaining
address misaligned exceptions on a best-effort basis. The KVM SBI FWFT
implementation is also not touched, but it is disabled if the firmware
emulates unaligned accesses.

Cc: stable@vger.kernel.org
Fixes: cf5a8abc6560 ("riscv: misaligned: request misaligned exception from SBI")
Reported-by: Songsong Zhang <U2FsdGVkX1@gmail.com> # KVM
Link: https://lore.kernel.org/linux-riscv/38ce44c1-08cf-4e3f-8ade-20da224f529c@iscas.ac.cn/
Link: https://lore.kernel.org/linux-riscv/b3cfcdac-0337-4db0-a611-258f2868855f@iscas.ac.cn/
Signed-off-by: Vivian Wang <wangruikang@iscas.ac.cn>
Acked-by: Conor Dooley <conor.dooley@microchip.com>
Link: https://patch.msgid.link/20260401-riscv-misaligned-dont-delegate-v2-1-5014a288c097@iscas.ac.cn
Signed-off-by: Paul Walmsley <pjw@kernel.org>

riscv: Docs: fix unmatched quote warning

'make htmldocs' complains about ``prctrl` -- so add a second '`' to
avoid the warning.

Documentation/arch/riscv/zicfilp.rst:79: WARNING: Inline literal start-string without end-string. [docutils]

Fixes: 08ee1559052b ("prctl: cfi: change the branch landing pad prctl()s to be more descriptive")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://patch.msgid.link/20260406232304.1892528-1-rdunlap@infradead.org
Signed-off-by: Paul Walmsley <pjw@kernel.org>

Merge tag 'amd-drm-next-7.2-2026-05-13' of https://gitlab.freedesktop.org/agd5f/linux into drm-next

amd-drm-next-7.2-2026-05-13:

amdgpu:
- Userq fixes
- DCN 3.2 fix
- RAS fixes
- GC 12 fixes
- Add PTL support for profiler
- SMU multi-msg helpers
- OLED fix
- Misc cleanups
- DC aux transfer refactor
- Introduce dc_plane_cm and migrate surface update color path
- IPS fixes
- DCN 4.2 updates
- SR-IOV fixes
- Add FRL registers for HDMI 2.1
- NBIO 7.11.4 updates
- VPE 2.0 support
- Aldebaran SMU update

amdkfd:
- Add profiler API

UAPI:
- Add profiler IOCTL
Userspace: https://github.com/ROCm/rocm-systems/commit/40abc95a6463a61bb318a67efd6d9cc3e5ee8839

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patch.msgid.link/20260513232911.41274-1-alexander.deucher@amd.com

io_uring: validate user-controlled cq.head in io_cqe_cache_refill()

A fuzzing run reproduced an unkillable io_uring task stuck at ~100% CPU:

[root@fedora io_uring_stress]# ps -ef | grep io_uring
root 1240 1 99 13:36 ? 00:01:35 [io_uring_stress] <defunct>

The task loops inside io_cqring_wait() and never returns to userspace,
and SIGKILL has no effect.

This is caused by the CQ ring exposing rings->cq.head to userspace as
writable, while the authoritative tail lives in kernel-private
ctx->cached_cq_tail. io_cqe_cache_refill() computes free space as an
unsigned subtraction:

free = ctx->cq_entries - min(tail - head, ctx->cq_entries);

If userspace keeps head within [0, tail], the subtraction is well
defined and min() just acts as a defensive clamp. But if userspace
advances head past tail, (tail - head) wraps to a huge value, free
becomes 0, and io_cqe_cache_refill() fails. The CQE is pushed onto the
overflow list and IO_CHECK_CQ_OVERFLOW_BIT is set.

The wait loop in io_cqring_wait() relies on an invariant: refill() only
fails when the CQ is *physically* full, in which case rings->cq.tail has
been advanced to iowq->cq_tail and io_should_wake() returns true. The
tampered head breaks this: refill() fails while the ring is not full, no
OCQE is copied in, rings->cq.tail never catches up, io_should_wake()
stays false, and io_cqring_wait_schedule() keeps returning early because
IO_CHECK_CQ_OVERFLOW_BIT is still set. The result is a tight retry loop
that never returns to userspace.

Introduce io_cqring_queued() as the single point that converts the
(tail, head) pair into a trustworthy queued count. Since the real
head/tail distance is bounded by cq_entries (far below 2^31), a signed
comparison reliably detects userspace moving head past tail; in that
case treat the queue as empty so callers see the full cache as free and
forward progress is preserved.

Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Zizhi Wo <wozizhi@huawei.com>
Link: https://patch.msgid.link/20260514021847.4062782-1-wozizhi@huaweicloud.com
[axboe: fixup commit message, kill 'queued' var, and keep it all in
io_uring.c]
Signed-off-by: Jens Axboe <axboe@kernel.dk>

Merge branch 'net-sched-refine-fq_codel-memory-limits'

Eric Dumazet says:

====================
net/sched: refine fq_codel memory limits

Packets that are associated with local sockets sk_wmem_alloc
do not really need additional memory control.

First patch makes is_skb_wmem() available to modules.

Second patch uses is_skb_wmem() in fq_codel.
====================

Link: https://patch.msgid.link/20260512094859.3673997-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/sched: fq_codel: local packets no longer count against memory limit

Commit 95b58430abe7 ("fq_codel: add memory limitation per queue")
claimed that the 32Mb default was "reasonable even for heavy duty usages."

In practice, this is not the case.

Packets that are associated with local sockets sk_wmem_alloc
do not really need additional memory control.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@toke.dk>
Link: https://patch.msgid.link/20260512094859.3673997-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: make is_skb_wmem() available to modules

Following patch will use is_skb_wmem() from fq_codel.

Provide __sock_wfree() only if CONFIG_INET=y

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@toke.dk>
Link: https://patch.msgid.link/20260512094859.3673997-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-mlx5e-improve-rss-indirection-table-sizing-and-resizing'

Tariq Toukan says:

====================
net/mlx5e: improve RSS indirection table sizing and resizing

This series by Yael improves mlx5e RSS indirection table handling around
channel count changes and large RSS configurations.

The series:
* removes the XOR8-specific channel count limitation,
* advertises the maximum supported RSS indirection table size,
* fixes resizing of non-default RSS contexts,
* allows resizing configured default RSS contexts during channel
changes,
* and increases the default RSS spread factor from 2x to 4x to improve
traffic distribution for large channel counts.

Together, these changes make RSS table sizing more flexible and robust,
while improving load balancing behavior on large systems.
====================

Link: https://patch.msgid.link/20260511172719.330490-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: increase RSS indirection table spread factor

Increase the RQT uniform spread factor from 2 to 4 so that each channel
gets more indirection table entries and traffic is spread more evenly.
For num_channels > 64 imbalance drops from up to ~50% to up to ~25%.
For 64 or fewer channels the 256 entry minimum already provides at least
4x coverage and the table size is unchanged by this commit.

This satisfies the minimum 4x coverage requirement validated by the
generic RSS selftest commit 9e3d4dae9832 ("selftests: drv-net: rss:
validate min RSS table size").

The 4x spread factor is best-effort and the table size is always capped by
the device's log_max_rqt_size capability.

Signed-off-by: Yael Chemla <ychemla@nvidia.com>
Reviewed-by: Nimrod Oren <noren@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260511172719.330490-6-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: resize configured default RSS context table on channel change

mlx5e_ethtool_set_channels() rejected channel count changes that
required a different RQT size when the default context indirection
table was user-configured. This restriction was introduced by
commit ee3572409f74 ("net/mlx5e: RSS, Block changing channels number
when RXFH is configured").

Lift the restriction. Validate the resize upfront with
ethtool_rxfh_indir_can_resize(), then fold or unfold the table
in-place via ethtool_rxfh_indir_resize() inside state_lock, before
mlx5e_safe_switch_params(), so the preactivate callback sees the
correct table content when it programs the HW.

Signed-off-by: Yael Chemla <ychemla@nvidia.com>
Reviewed-by: Nimrod Oren <noren@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260511172719.330490-5-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: resize non-default RSS indirection tables on channel change

When the channel count changes and the RQT size changes with it, a
problem arise for non-default RSS contexts. The driver-side indirection
table grows actual_table_size without filling the new entries; stale
entries from a prior larger configuration may be re-exposed, causing
mlx5e_calc_indir_rqns() to WARN on an out-of-range index.

Replace mlx5e_rss_params_indir_modify_actual_size() with
mlx5e_rss_ctx_resize(), which fills new entries by replicating
the existing pattern, matching what ethtool_rxfh_ctxs_resize() does
for the same case. And restrict the loop to non-default contexts.

Call ethtool_rxfh_ctxs_can_resize() before acquiring state_lock to
validate that all non-default contexts can be resized, and
ethtool_rxfh_ctxs_resize() after releasing it to fold or unfold their
indirection tables. Both functions acquire rss_lock internally and
cannot be called under state_lock. RTNL, held by all set_channels
callers, serialises context creation and deletion making the pre-lock
check safe.

Guard both ethtool calls on mlx5e_rx_res_rss_cnt() > 1: skip the
validation and resize when no non-default contexts exist. This
naturally covers representors and IPoIB, which share
mlx5e_ethtool_set_channels() but cannot have non-default RSS contexts.

Signed-off-by: Yael Chemla <ychemla@nvidia.com>
Reviewed-by: Nimrod Oren <noren@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260511172719.330490-4-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: advertise max RSS indirection table size to ethtool

Set rxfh_indir_space to the maximum indirection table size the driver
can support: the next power of two above MLX5E_MAX_NUM_CHANNELS times
MLX5E_UNIFORM_SPREAD_RQT_FACTOR.

Without this, ethtool_rxfh_ctxs_can_resize() returns -EINVAL, blocking
non-default RSS contexts from tracking indirection table size changes
when the channel count changes.

Signed-off-by: Yael Chemla <ychemla@nvidia.com>
Reviewed-by: Nimrod Oren <noren@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260511172719.330490-3-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: remove channel count limit for XOR8 RSS hash

mlx5e_ethtool_set_channels() and mlx5e_rxfh_hfunc_check() rejected
channel counts that would produce an indirection table larger than 256
entries when the XOR8 hash function was active. This check was
introduced in commit 49e6c9387051 ("net/mlx5e: RSS, Block XOR hash
with over 128 channels").

XOR8 yields an 8-bit hash, so in practice only up to 256 entries in the
indirection table can be reached due to limited entropy. However, this
does not provide a strong justification for prohibiting larger
indirection tables. Remove the limitation.

Signed-off-by: Yael Chemla <ychemla@nvidia.com>
Reviewed-by: Nimrod Oren <noren@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260511172719.330490-2-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'amd-drm-fixes-7.1-2026-05-13' of https://gitlab.freedesktop.org/agd5f/linux into drm-fixes

amd-drm-fixes-7.1-2026-05-13:

amdgpu:
- Userq fixes
- DCN 3.2 fix
- RAS fix
- GC 12 fix

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patch.msgid.link/20260513224053.40670-1-alexander.deucher@amd.com

Merge branch 'macsec-use-rcu_work-to-fix-crypto-cleanup-in-softirq-context'

Jinliang Zheng says:

====================
macsec: use rcu_work to fix crypto cleanup in softirq context

From: Jinliang Zheng <alexjlzheng@tencent.com>

crypto_free_aead() can internally call vunmap() (e.g. via dma_free_attrs()
in hardware crypto drivers like hisi_sec2), which must not be invoked from
softirq context. Both free_rxsa() and free_txsa() are RCU callbacks that
run in softirq, causing a kernel crash on affected hardware.

This series fixes the issue by deferring the actual cleanup to a workqueue
using rcu_work, which combines the RCU grace period and workqueue dispatch
into a single primitive.

Two design decisions worth noting:

1. rcu_work instead of schedule_work() + synchronize_rcu()

   An alternative would be to call schedule_work() directly from
   macsec_rxsa_put()/macsec_txsa_put(), then call synchronize_rcu() at
   the start of the work handler to replace the grace period previously
   provided by call_rcu(). However, synchronize_rcu() blocks the worker
   thread for the duration of a full RCU grace period. Under high SA
   churn (e.g. tearing down an interface with many SAs), each SA would
   occupy a worker thread while waiting, and multiple concurrent calls
   cannot share the same grace period — leading to unnecessary latency
   and resource waste.

   rcu_work uses call_rcu_hurry() internally, which is fully asynchronous:
   the worker thread is only dispatched after the grace period has elapsed,
   and multiple concurrent queue_rcu_work() calls naturally batch under the
   same grace period via the RCU subsystem's existing coalescing mechanism.

2. Dedicated workqueue instead of system_wq

   Using a dedicated workqueue (macsec_wq) allows macsec_exit() to drain
   exactly the work items belonging to this module — by calling
   destroy_workqueue() after rcu_barrier(). If system_wq were used,
   flush_scheduled_work() would drain all pending work items across the
   entire system, creating unnecessary coupling with unrelated subsystems
   and potentially causing unexpected delays. The dedicated workqueue
   provides a clean, contained teardown path.
====================

Link: https://patch.msgid.link/20260511153102.2640368-1-alexjlzheng@tencent.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

macsec: use rcu_work to defer TX SA crypto cleanup out of softirq

free_txsa() is an RCU callback running in softirq context, but calls
crypto_free_aead() which can invoke vunmap() internally on hardware
crypto drivers (e.g. hisi_sec2), triggering a kernel crash.

Use rcu_work to defer the cleanup to a workqueue, for the same reasons
as the analogous fix to free_rxsa() in the previous patch.

Fixes: c09440f7dcb3 ("macsec: introduce IEEE 802.1AE driver")
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20260511153102.2640368-4-alexjlzheng@tencent.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

macsec: use rcu_work to defer RX SA crypto cleanup out of softirq

crypto_free_aead() can internally invoke vunmap() (e.g. via
dma_free_attrs() in hardware crypto drivers such as hisi_sec2).
vunmap() must not be called from softirq context, but free_rxsa()
is an RCU callback that runs in softirq, leading to a kernel crash:

  vunmap+0x4c/0x70
  __iommu_dma_free+0xd0/0x138
  dma_free_attrs+0xf4/0x100
  sec_aead_exit+0x64/0xb8 [hisi_sec2]
  crypto_destroy_tfm+0x98/0x110
  free_rxsa+0x28/0x50 [macsec]
  rcu_do_batch+0x184/0x460
  rcu_core+0xf4/0x1f8
  handle_softirqs+0x118/0x330

Use rcu_work to defer the cleanup to a workqueue. rcu_work dispatches
the worker asynchronously after the RCU grace period, so no thread
blocks waiting, and concurrent releases of multiple SAs naturally
share the same grace period.

Fixes: c09440f7dcb3 ("macsec: introduce IEEE 802.1AE driver")
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20260511153102.2640368-3-alexjlzheng@tencent.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

macsec: introduce dedicated workqueue for SA crypto cleanup

Introduce a dedicated ordered workqueue, macsec_wq, which will be used
by subsequent patches to defer SA crypto cleanup (crypto_free_aead and
related teardown) out of softirq context.

Using a dedicated workqueue instead of system_wq allows macsec_exit()
to drain exactly the work items belonging to this module via
destroy_workqueue(), without interfering with unrelated work items on
system_wq or causing unexpected delays elsewhere.

rcu_barrier() in macsec_exit() ensures all in-flight rcu_work callbacks
have enqueued their work items before destroy_workqueue() drains and
destroys the queue, making the two-step teardown correct and complete.
The same sequence is kept in the error path of macsec_init() as a
precaution, to mirror macsec_exit() and stay safe if work ever becomes
queueable before this point in the future.

While at it, rename the error labels in macsec_init() from the
resource-named style (rtnl:, notifier:, wq:) to the err_xxx: style
(err_rtnl:, err_notifier:, err_destroy_wq:) to align with the broader
kernel convention.

Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20260511153102.2640368-2-alexjlzheng@tencent.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>