git.ipfire.org Git - thirdparty/linux.git/log

selftests: ovpn: reduce remaining ping flood counts

Commit 201ba706318d ("selftests: ovpn: reduce ping count in test.sh")
lowered the baseline traffic flood ping count to avoid flakes on slower
CI instances, however some instances were left out.

Apply the same limit to the remaining ovpn selftest flood pings that
still request 500 packets.

Fixes: 201ba706318d ("selftests: ovpn: reduce ping count in test.sh")
Signed-off-by: Ralf Lici <ralf@mandelbit.com>
Signed-off-by: Antonio Quartulli <antonio@openvpn.net>

io_uring/rsrc: raise registered buffer 1GB limit

There's no real reason to have a limit, as the memory is accounted by
the lockmem limits anyway, if any exist. io_pin_pages() will still
restrict the maximum allowed limit per buffer, which is INT_MAX
number of pages. Cap it a bit lower than that, at 1TB for a 64-bit
system. Surely that should be enough for everyone. For now.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring/rsrc: bump struct io_mapped_ubuf length field to size_t

In preparation for supporting bigger individual buffers, bump the length
field to a full 8-bytes with size_t rather than an unsigned int.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring/rsrc: add huge page accounting for registered buffers

Track huge page references in a per-ring xarray to prevent double
accounting when the same huge page is used by multiple registered
buffers, either within the same ring or across cloned rings.

When registering buffers backed by huge pages, we need to account for
RLIMIT_MEMLOCK. But if multiple buffers share the same huge page (common
with cloned buffers), we must not account for the same page multiple
times. Similarly, we must only unaccount when the last reference to a
huge page is released.

Maintain a per-ring xarray (hpage_acct) that tracks reference counts for
each huge page. When registering a buffer, for each unique huge page,
increment its accounting reference count, and only account pages that
are newly added.

When unregistering a buffer, for each unique huge page, decrement its
refcount. Once the refcount hits zero, the page is unaccounted.

Note: any account is done against the ctx->user that was assigned when
the ring was setup. As before, if root is running the operation, no
accounting is done.

With these changes, any use of imu->acct_pages is also dead, hence kill
it from struct io_mapped_ubuf. This shrinks it from 56b to 48b on a
64-bit arch. Additionally, hpage_already_acct() is gone, which was an
O(M*M) scan over current + previous registrations.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

Bluetooth: hci_qca: Convert timeout from jiffies to ms

Since the timer uses jiffies as its unit rather than ms, the timeout value
must be converted from ms to jiffies when configuring the timer. Otherwise,
the intended 8s timeout is incorrectly set to approximately 33s.

To improve readability, embed msecs_to_jiffies() directly in the macro
definitions and drop the _MS suffix from macros that now yield jiffies
values: MEMDUMP_TIMEOUT, FW_DOWNLOAD_TIMEOUT, IBS_DISABLE_SSR_TIMEOUT,
CMD_TRANS_TIMEOUT, and IBS_BTSOC_TX_IDLE_TIMEOUT.

IBS_WAKE_RETRANS_TIMEOUT_MS and IBS_HOST_TX_IDLE_TIMEOUT_MS are
intentionally left unchanged. Their values are stored in the struct fields
wake_retrans and tx_idle_delay, which hold ms values at runtime and can be
modified via debugfs. The msecs_to_jiffies() conversion happens at each
call site against the field value, so it cannot be embedded in the macro.

Wake timer depends on commit c347ca17d62a

Cc: stable@vger.kernel.org
Fixes: d841502c79e3 ("Bluetooth: hci_qca: Collect controller memory dump during SSR")
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Acked-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
Signed-off-by: Shuai Zhang <shuai.zhang@oss.qualcomm.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>

Bluetooth: L2CAP: ecred_reconfigure: send packed pdu, not stack pointer

Commit 1c08108f3014 ("Bluetooth: L2CAP: Avoid -Wflex-array-member-not-at-end
warnings") converted the on-stack request PDU in l2cap_ecred_reconfigure()
from an explicit packed struct to DEFINE_RAW_FLEX(), but did not adjust the
size and source-pointer arguments to l2cap_send_cmd():

  -    struct {
  -            struct l2cap_ecred_reconf_req req;
  -            __le16 scid;
  -    } pdu;
  +    DEFINE_RAW_FLEX(struct l2cap_ecred_reconf_req, pdu, scid, 1);
       ...
       l2cap_send_cmd(conn, chan->ident, L2CAP_ECRED_RECONF_REQ,
                      sizeof(pdu), &pdu);

After the conversion, DEFINE_RAW_FLEX() expands to declare an anonymous
union pdu_u plus a local pointer "pdu" pointing at it. Therefore:

  - sizeof(pdu) is now sizeof(struct l2cap_ecred_reconf_req *) = 8 on
    64-bit (4 on 32-bit), not the 6 bytes of (mtu, mps, scid[1]).
  - &pdu is the address of the local pointer's stack storage, not the
    address of the request payload.

l2cap_send_cmd() forwards (data, count) to l2cap_build_cmd(), which calls
skb_put_data(skb, data, count). The L2CAP_ECRED_RECONFIGURE_REQ packet
body therefore contains 8 bytes copied from the kernel stack starting at
&pdu -- the 8 bytes overlap the pdu pointer's value, leaking a kernel
stack address to the paired Bluetooth peer. The intended (mtu, mps, scid)
fields are not transmitted at all, so the peer rejects the request as
malformed and the L2CAP_ECRED_RECONFIGURE feature itself has been broken
for the local-side initiator since the introducing commit landed.

The sibling site l2cap_ecred_conn_req() in the same commit was converted
correctly (sizeof(*pdu) + len, pdu); only this site was missed.

Restore the original semantics: pass the full flex-struct size via
struct_size(pdu, scid, 1) and the pdu pointer (the struct address) as
the source.

Validated on a stock 7.0-based host kernel via the real call path:
setsockopt(SOL_BLUETOOTH, BT_RCVMTU, ...) on a BT_CONNECTED
L2CAP_MODE_EXT_FLOWCTL socket emits an L2CAP_ECRED_RECONFIGURE_REQ
whose body is 8 bytes (the on-stack pdu local's value) rather than
the expected 6. Three captures from fresh socket / fresh hciemu peer
on the same host -- low bytes vary per call, high 0xffff confirms a
kernel virtual address (KASLR-randomised stack slot, not a fixed
string):

  RECONF_REQ body (ident=0x02 len=8): 42 fb 54 af 0e ca ff ff
  RECONF_REQ body (ident=0x02 len=8): 52 3d 2e af 0e ca ff ff
  RECONF_REQ body (ident=0x02 len=8): b2 fc 5b af 0e ca ff ff

After this patch the body is 6 bytes carrying the expected
little-endian (mtu, mps, scid).

Cc: stable@vger.kernel.org
Fixes: 1c08108f3014 ("Bluetooth: L2CAP: Avoid -Wflex-array-member-not-at-end warnings")
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>

Bluetooth: btmtk: accept too short WMT FUNC_CTRL events

MT7925 (USB ID 0e8d:e025) on fw version 20260106153314 sends WMT
FUNC_CTRL events that are missing the status field.

Prior to commit 006b9943b982 ("Bluetooth: btmtk: validate WMT event SKB
length before struct access") the status was read from out-of-bounds of
SKB data, which usually would result to success with
BTMTK_WMT_ON_UNDONE, although I don't know the intent here. The bounds
check added in that commit returns with error instead, producing
"Bluetooth: hci0: Failed to send wmt func ctrl (-22)" and makes the
device unusable.

Fix the regression by interpreting too short packet as status
BTMTK_WMT_ON_UNDONE, which makes the device work normally again.

Fixes: 634a4408c061 ("Bluetooth: btmtk: validate WMT event SKB length before struct access")
Signed-off-by: Pauli Virtanen <pav@iki.fi>
Tested-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> # MT7922 (0489:e0e2)
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>

Bluetooth: serialize accept_q access

bt_sock_poll() walks the accept queue without synchronization, while
child teardown can unlink the same socket and drop its last reference.
The unsynchronized accept queue walk has existed since the initial
Bluetooth import.

Protect accept_q with a dedicated lock for queue updates and polling.
Also rework bt_accept_dequeue() to take temporary child references under
the queue lock before dropping it and locking the child socket.

Fixes: 1da177e4c3f41524e886b7f1b8a0c1fc7321cac2 ("Linux-2.6.12-rc2")
Cc: stable@vger.kernel.org
Reported-by: Jann Horn <jannh@google.com>
Reported-by: Yuan Tan <yuantan098@gmail.com>
Reported-by: Yifan Wu <yifanwucs@gmail.com>
Reported-by: Juefei Pu <tomapufckgml@gmail.com>
Reported-by: Xin Liu <bird@lzu.edu.cn>
Signed-off-by: Jiexun Wang <wangjiexun2025@gmail.com>
Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>
Signed-off-by: Jiexun Wang <wangjiexun2025@gmail.com>
Reviewed-by: Jann Horn <jannh@google.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>

cpufreq/amd-pstate: Drop Kconfig option for dynamic EPP

There are some performance issues being identified by dynamic EPP
and we don't want to have distributions turning it on by default
exposing them to users at this time.

Drop the kconfig option, and require an explicit opt in from kernel
command line or runtime sysfs option to turn it on.

Reported-by: Viktor Jägersküpper <viktor_jaegerskuepper@freenet.de>
Closes: https://lore.kernel.org/linux-pm/14a87c99-785c-4b16-bfce-35ecbf053448@freenet.de/
Reported-by: Stuart Meckle <stuartmeckle@gmail.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=221473
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lore.kernel.org/r/20260512221947.1652988-1-mario.limonciello@amd.com
(fix sysfs file path)
Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>

drm/ttm: Fix ttm_bo_shrink() infinite LRU walk on backup failure

Apply the same fix as b2ed01e7ad ("drm/ttm: Fix ttm_bo_swapout()
infinite LRU walk on swapout failure") to the ttm_bo_shrink() path.

Move del_bulk_move from before the backup to after success only,
using ttm_resource_del_bulk_move_unevictable() since the resource
is now unevictable once fully backed up.

Fixes: 70d645deac98 ("drm/ttm: Add helpers for shrinking")
Cc: Christian König <christian.koenig@amd.com>
Cc: Huang Rui <ray.huang@amd.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Dave Airlie <airlied@redhat.com>
Cc: dri-devel@lists.freedesktop.org
Cc: stable@vger.kernel.org # v6.15+
Assisted-by: GitHub_Copilot:claude-opus-4.6
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patch.msgid.link/20260511162443.24352-1-thomas.hellstrom@linux.intel.com
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>

io_uring/net: allow filtering on IORING_OP_CONNECT

This adds custom filtering for IORING_OP_CONNECT, where the target
family is always exposed, and (for AF_INET / AF_INET6) port and
address are exposed. port and v4_addr are in network byte order so
filter authors can compare against on-wire constants.

Skip population unless addr_len covers the populated fields, to
avoid leaking stale io_async_msghdr data on short connects.

Signed-off-by: Shouvik Kar <auxcorelabs@gmail.com>
Link: https://patch.msgid.link/20260512110242.26219-1-auxcorelabs@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: parenthesize io_ring_head_to_buf() expansion

Wrap the io_ring_head_to_buf() macro value in an extra pair of parentheses
so it is safe when composed into larger expressions, and to satisfy
scripts/checkpatch.pl.

Signed-off-by: Yi Xie <xieyi@kylinos.cn>
Link: https://patch.msgid.link/20260514083443.203387-1-xieyi@kylinos.cn
Signed-off-by: Jens Axboe <axboe@kernel.dk>

net: phy: DP83TC811: add reading of abilities

At this time the driver is not listing any speeds
it supports. This should be ETHTOOL_LINK_MODE_100baseT1_Full_BIT
for DP83TC811. Add the missing call for phylib to read the abilities.

Fixes: b753a9faaf9a ("net: phy: DP83TC811: Introduce support for the DP83TC811 phy")
Suggested-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Sven Schuchmann <schuchmann@schleissheimer.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20260512071949.6218-1-schuchmann@schleissheimer.de
[pabeni@redhat.com: dropped revision history]
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

mm/slub: hold cpus_read_lock around flush_rcu_sheaves_on_cache()

flush_rcu_sheaves_on_cache() calls queue_work_on() in a
for_each_online_cpu() loop, which requires the cpu to stay online.
But cpus_read_lock() is not held in kvfree_rcu_barrier_on_cache() and the
set of "online cpus" is subject to change.

There are two paths that call flush_rcu_sheaves_on_cache():

  // has cpus_read_lock()
  flush_all_rcu_sheaves()
    -> flush_rcu_sheaves_on_cache()

  // no cpus_read_lock()
  kvfree_rcu_barrier_on_cache()
    -> flush_rcu_sheaves_on_cache()

Fix this by holding cpus_read_lock() in kvfree_rcu_barrier_on_cache().

Why not move cpus_read_lock() from flush_all_rcu_sheaves() into
flush_rcu_sheaves_on_cache()? The reason is it would introduce a new lock
order (slab_mutex -> cpu_hotplug_lock). The reverse order
(cpu_hotplug_lock -> slab_mutex) is established by

- cpuhp_setup_state_nocalls(..., slub_cpu_setup, ...)
- kmem_cache_destroy()

The two orders together would form an AB-BA deadlock.

Finally, add lockdep_assert_cpus_held() in flush_rcu_sheaves_on_cache()
to catch the same problem in the future.

Fixes: 0f35040de593 ("mm/slab: introduce kvfree_rcu_barrier_on_cache() for cache destruction")
Cc: <stable@vger.kernel.org>
Signed-off-by: Qing Wang <wangqing7171@gmail.com>
Link: https://patch.msgid.link/20260512035035.762317-1-wangqing7171@gmail.com
Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>

kbuild: pacman-pkg: package unstripped vDSO libraries

The unstripped vDSO files are useful for debugging.
They are provided in the upstream 'linux-headers' package.

Also package them as part of 'make pacman-pkg'.
Make them part of the '-debug' package, as they fit there best.
This differs from the upstream package as that has no '-debug' variant.

Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Link: https://patch.msgid.link/20260318-kbuild-pacman-vdso-install-v1-1-48ceb31c0e80@weissschuh.net
Signed-off-by: Nathan Chancellor <nathan@kernel.org>

KVM: x86: nSVM: Save/restore gPAT with KVM_{GET,SET}_NESTED_STATE

Add a 'gpat' field to kvm_svm_nested_state_hdr to carry L2's guest PAT
value across save and restore.

When KVM_X86_QUIRK_NESTED_SVM_SHARED_PAT is disabled and the vCPU is in
guest mode with nested NPT enabled, save vmcb02's g_pat into the header on
KVM_GET_NESTED_STATE, and restore it on KVM_SET_NESTED_STATE.

Host-initiated accesses to IA32_PAT (via KVM_GET/SET_MSRS) always target
L1's hPAT, so they cannot be used to save or restore gPAT. The separate
header field ensures that KVM_GET/SET_MSRS and KVM_GET/SET_NESTED_STATE are
independent and can be ordered arbitrarily during save and restore.

Note that struct kvm_svm_nested_state_hdr is included in a union padded to
120 bytes, so there is room to add the gpat field without changing any
offsets.

Fixes: cc440cdad5b7 ("KVM: nSVM: implement KVM_GET_NESTED_STATE and KVM_SET_NESTED_STATE")
Signed-off-by: Jim Mattson <jmattson@google.com>
Link: https://patch.msgid.link/20260407190343.325299-9-jmattson@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: Documentation: document KVM_{GET,SET}_NESTED_STATE for SVM

Document the nested state constants and structures for SVM that were added
by commit cc440cdad5b7 ("KVM: nSVM: implement KVM_GET_NESTED_STATE and
KVM_SET_NESTED_STATE").

Fixes: cc440cdad5b7 ("KVM: nSVM: implement KVM_GET_NESTED_STATE and KVM_SET_NESTED_STATE")
Signed-off-by: Jim Mattson <jmattson@google.com>
Link: https://patch.msgid.link/20260407190343.325299-8-jmattson@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: nSVM: Save gPAT to vmcb12.g_pat on VMEXIT

According to the APM volume 3 pseudo-code for "VMRUN," when nested paging
is enabled in the vmcb, the guest PAT register (gPAT) is saved to the vmcb
on emulated VMEXIT.

When KVM_X86_QUIRK_NESTED_SVM_SHARED_PAT is disabled and the vCPU is in
guest mode with nested NPT enabled, save the vmcb02 g_pat field to the
vmcb12 g_pat field on emulated VMEXIT.

Fixes: 15038e147247 ("KVM: SVM: obey guest PAT")
Signed-off-by: Jim Mattson <jmattson@google.com>
Link: https://patch.msgid.link/20260407190343.325299-7-jmattson@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: nSVM: Redirect IA32_PAT accesses to either hPAT or gPAT

When handling PAT accesses from L2, route PAT accesses to either hPAT or
gPAT based on whether or not L2 has a separate PAT, i.e. if KVM is actually
emulating gPAT, instead of using L1's PAT for everything.  Specifically, if
KVM_X86_QUIRK_NESTED_SVM_SHARED_PAT is disabled, the vCPU is in guest mode
with nested NPT enabled, *and* the access if from the guest (i.e. is not
from the host stuffing PAT as part of save/restore), then redirect guest
PAT accesses to the gPAT "register" in vmcb02, i.e. emulate gPAT for L2.

Always route non-guest accesses to hPAT, i.e. L1's PAT in vcpu->arch.pat,
to ensures that KVM_{G,S}ET_MSRS and KVM_{G,S}ET_NESTED_STATE are
independent of each other and can be ordered arbitrarily during save and
restore.  E.g. if KVM didn't exempt host accesses, then whether a write to
PAT hit hPAT or gPAT would vary based on whether userspace restores PAT
before or after nested state.  Note, gPAT is saved and restored separately
via KVM_{G,S}ET_NESTED_STATE.

WARN if there's a host-initiated access to PAT from within KVM_RUN, i.e. if
KVM itself initiated the access, as there are no such accesses today, and
it's not clear what the "right" behavior would be.

Fixes: 15038e147247 ("KVM: SVM: obey guest PAT")
Signed-off-by: Jim Mattson <jmattson@google.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: nSVM: Set vmcb02.g_pat correctly for nested NPT

When KVM_X86_QUIRK_NESTED_SVM_SHARED_PAT is disabled and nested NPT is
enabled in vmcb12, copy the (cached and validated) vmcb12 g_pat field to
vmcb02's g_pat, giving L2 its own independent guest PAT register.

When the quirk is enabled (default), or when NPT is enabled but nested NPT
is disabled, copy L1's IA32_PAT MSR to the vmcb02 g_pat field, since L2
shares the IA32_PAT MSR with L1.

When NPT is disabled, the g_pat field is ignored by hardware.

Fixes: 15038e147247 ("KVM: SVM: obey guest PAT")
Signed-off-by: Jim Mattson <jmattson@google.com>
Link: https://patch.msgid.link/20260407190343.325299-5-jmattson@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: nSVM: Cache and validate vmcb12 g_pat

When KVM_X86_QUIRK_NESTED_SVM_SHARED_PAT is disabled and nested paging is
enabled in vmcb12, validate g_pat at emulated VMRUN and cause an immediate
VMEXIT with exit code VMEXIT_INVALID if it is invalid, as specified in the
APM, volume 2: "Nested Paging and VMRUN/VMEXIT."

Fixes: 3d6368ef580a ("KVM: SVM: Add VMRUN handler")
Signed-off-by: Jim Mattson <jmattson@google.com>
Link: https://patch.msgid.link/20260407190343.325299-4-jmattson@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: nSVM: Clear VMCB_NPT clean bit when updating hPAT from guest mode

When running an L2 guest and writing to MSR_IA32_CR_PAT, the host PAT value
is stored in both vmcb01's g_pat field and vmcb02's g_pat field, but the
clean bit was only being cleared for vmcb02.

Introduce the helper vmcb_set_gpat() which sets vmcb->save.g_pat and marks
the VMCB dirty for VMCB_NPT. Use this helper in both svm_set_msr() for
updating vmcb01 and in nested_vmcb02_compute_g_pat() for updating vmcb02,
ensuring both VMCBs' NPT fields are properly marked dirty.

Fixes: 4995a3685f1b ("KVM: SVM: Use a separate vmcb for the nested L2 guest")
Signed-off-by: Jim Mattson <jmattson@google.com>
Link: https://patch.msgid.link/20260407190343.325299-3-jmattson@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Define KVM_X86_QUIRK_NESTED_SVM_SHARED_PAT

Define a quirk to control whether nested SVM shares L1's PAT with L2
(legacy behavior) or gives L2 its own independent gPAT (correct behavior
per the APM).

When the quirk is enabled (default), L2 shares L1's PAT, preserving the
legacy KVM behavior. When userspace disables the quirk, KVM correctly
virtualizes the PAT for nested SVM guests, giving L2 a separate gPAT as
specified in the AMD architecture.

Signed-off-by: Jim Mattson <jmattson@google.com>
Link: https://patch.msgid.link/20260407190343.325299-2-jmattson@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

docs: threat-model: don't limit root capabilities to CAP_SYS_ADMIN

The threat-model document says that only users with CAP_SYS_ADMIN can carry
out a number of admin-level tasks, but there are numerous capabilities that
can confer that sort of power. Generalize the text slightly to make it
clear that CAP_SYS_ADMIN is not the only all-powerful capability.

Acked-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>

docs: security-bugs: add a link to the threat-model documentation

Rather than make readers search for this document, just a link to it where
it is referenced.

(While I was at it, I removed the unused and unneeded _threatmodel label
from the top of threat-model.rst).

Acked-by: Willy Tarreau <w@1wt.eu>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>

timers: Fix flseep() typo in kernel-doc comment

Signed-off-by: Gitle Mikkelsen <gitlem@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260501170616.1402-1-gitlem@gmail.com

ARM: dts: socfpga: arria10: Increase JFFS2 rootfs partition size

Increase the JFFS2 partition size to support larger root filesystem.
Also fix the partition label to match the actual start address.

Signed-off-by: Niravkumar L Rabara <niravkumar.l.rabara@altera.com>
Signed-off-by: Nazim Amirul <muhammad.nazim.amirul.nazle.asmade@altera.com>
Signed-off-by: Dinh Nguyen <dinguyen@kernel.org>

RAS/AMD/ATL: Drop malformed default N from Kconfig

The capital letters are for symbols and N in 'default N' will be evaluated as
another, nonexistent, Kconfig symbol, and not as the 'no' it should be. More
importantly, 'n' *is* the default already. Hence just drop the malformed line.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://patch.msgid.link/20260513205021.368190-1-andriy.shevchenko@linux.intel.com

net: tls: prevent chain-after-chain in plain text SG

Sashiko points out that if end = 0 (start != 0) the current
code will create a chain link to content type right after
the wrap link:

  This would create a chain where the wrap link points directly
  to another chain link. The scatterlist API sg_next iterator
  does not recursively resolve consecutive chain links.

meaning this is illegal input to crypto.

The wrapping link is unnecessary if end = 0. end is the entry after
the last one used so end = 0 means there's nothing pushed after
the wrap:

   end         start            i
    v            v              v
  [   ]...[   ][ d ][ d ][ d ][ d ][rsv for wrap]

Skip the wrapping in this case.

TLS 1.3 can use the "wrapping slot" for it's chaining if end = 0.
This avoids the chain-after-chain.

Move the wrap chaining before marking END and chaining off content
type, that feels like more logical ordering to me, but should not
matter from functional perspective.

Reported-by: Sashiko <sashiko-bot@kernel.org>
Fixes: 9aaaa56845a0 ("bpf: Sockmap/tls, skmsg can have wrapped skmsg that needs extra chaining")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20260511174920.433155-3-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: tls: fix off-by-one in sg_chain entry count for wrapped sk_msg ring

When an sk_msg scatterlist ring wraps (sg.end < sg.start),
tls_push_record() chains the tail portion of the ring to the head
using sg_chain(). An extra entry in the sg array is reserved for
this:

  struct sk_msg_sg {
        [...]
        /* The extra two elements:
         * 1) used for chaining the front and sections when the list becomes
         *    partitioned (e.g. end < start). The crypto APIs require the
         *    chaining;
         * 2) to chain tailer SG entries after the message.
         */
        struct scatterlist              data[MAX_MSG_FRAGS + 2];

The current code uses MAX_SKB_FRAGS + 1 as the ring size:

    sg_chain(&msg_pl->sg.data[msg_pl->sg.start],
             MAX_SKB_FRAGS - msg_pl->sg.start + 1,
             msg_pl->sg.data);

This places the chain pointer at

  sg_chain(data[start], (MAX_SKB_FRAGS - msg_start + 1) .. =
  &data[start] + (MAX_SKB_FRAGS - msg_start + 1) - 1 =
  data[start + (MAX_SKB_FRAGS - start + 1) - 1] =
  data[MAX_SKB_FRAGS]

instead of the true last entry. This is likely due to a "race" of
the commit under Fixes landing close to
commit 031097d9e079 ("bpf: sk_msg, zap ingress queue on psock down")

Convert to ARRAY_SIZE and drop the data[start] / - start (as suggested
by Sabrina).

Reported-by: 钱一铭 <yimingqian591@gmail.com>
Fixes: 9aaaa56845a0 ("bpf: Sockmap/tls, skmsg can have wrapped skmsg that needs extra chaining")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20260511174920.433155-2-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

clk: scpi: Unregister child clock providers on remove

SCPI clock providers are registered for each child node in
scpi_clk_add(), but scpi_clocks_remove() unregisters the parent node on
each iteration.

of_clk_del_provider() matches providers by the node used at registration
time, so passing the parent node leaves the child providers registered.
This leaks the provider allocations and the node references held by the
clock provider core.

Pass the child node to of_clk_del_provider() so the remove path matches
the probe path.

Fixes: cd52c2a4b5c4 ("clk: add support for clocks provided by SCP(System Control Processor)")
Signed-off-by: Stepan Ionichev <sozdayvek@gmail.com>
Link: https://patch.msgid.link/20260513090900.5323-1-sozdayvek@gmail.com
(sudeep.holla: Updated commit title and message a bit)
Signed-off-by: Sudeep Holla <sudeep.holla@kernel.org>

drm/ttm: Convert -EAGAIN from dmem_cgroup_try_charge to -ENOSPC

dmem_cgroup_try_charge() returns -EAGAIN when the cgroup limit is
hit and the charge fails. TTM has no concept of -EAGAIN from resource
allocation; -ENOSPC is the canonical error meaning "no space, try
eviction". Convert at the source in ttm_resource_alloc() so no caller
needs to handle an unexpected error code, and clean up the now-redundant
-EAGAIN check in ttm_bo_alloc_resource().

Without this, -EAGAIN escaping ttm_resource_alloc() during an eviction
walk causes the walk to terminate early instead of continuing to the
next candidate.

Cc: Friedrich Vock <friedrich.vock@gmx.de>
Cc: Maarten Lankhorst <dev@lankhorst.se>
Cc: Tejun Heo <tj@kernel.org>
Cc: Maxime Ripard <mripard@kernel.org>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: dri-devel@lists.freedesktop.org
Cc: <stable@vger.kernel.org> # v6.14+
Fixes: 2b624a2c1865 ("drm/ttm: Handle cgroup based eviction in TTM")
Assisted-by: GitHub_Copilot:claude-sonnet-4.6
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Maarten Lankhorst <dev@lankhrost.se>
Link: https://patch.msgid.link/20260508160920.230339-1-thomas.hellstrom@linux.intel.com

Merge branch 'bridge-add-selective-forwarding-of-gratuitous-neighbor-announcements'

Danielle Ratson says:

====================
bridge: Add selective forwarding of gratuitous neighbor announcements

The existing neighbor suppression unconditionally suppresses gratuitous
ARPs and unsolicited Neighbor Advertisements, which prevents fast
mobility of hosts between VTEPs.

This series adds a new neigh_forward_grat option that provides
independent control of gratuitous ARP and unsolicited NA forwarding.
When neigh_suppress is enabled but neigh_forward_grat is enabled,
regular neighbor discovery is suppressed while gratuitous announcements
are forwarded.

The implementation marks gratuitous ARPs and unsolicited NAs in
BR_INPUT_SKB_CB during input processing, then checks the per-output-port
neigh_forward_grat setting during flooding. This allows gratuitous
announcements from any input port to be selectively forwarded based on
each output port's individual configuration.

Both port-level control (via IFLA_BRPORT_NEIGH_FORWARD_GRAT) and
per-VLAN control (via BRIDGE_VLANDB_ENTRY_NEIGH_FORWARD_GRAT) are
provided. The default value of OFF preserves existing behavior.

This behavior is in accordance with RFC 9161 (Section 3.6), which
recommends that VTEPs forward gratuitous ARP and unsolicited NA messages
to avoid traffic disruption during host mobility events.

The new attributes use NLA_U8, although the kernel netlink guideline
recommends NLA_U32 as the minimum integer type on the grounds that
alignment makes smaller types equivalent on the wire. For a simple
on/off attribute there is no technical advantage to u32 over u8, and
keeping u8 preserves consistency with all surrounding bridge port
attributes and avoids introducing new helpers alongside the existing
infrastructure.

Patchset overview:
Patch #1: adds uapi headers.
Patches #2-#3: support selective forwarding of gratuitous ARP.
Patches #4-#5: add netlink handling.
Patch #6: adds tests.

Please see iproute related patches in the last 3 commits of:
https://github.com/daniellerts/iproute2
====================

Link: https://patch.msgid.link/20260511065936.4173106-1-danieller@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

selftests: net: Add tests for neigh_forward_grat option

Add tests to validate the neigh_forward_grat bridge option for selective
forwarding of gratuitous neighbor announcements.

The tests verify per-port and per-VLAN control of gratuitous neighbor
announcement forwarding for both IPv4 (gratuitous ARP) and IPv6
(unsolicited NA):
- When neigh_suppress is enabled with neigh_forward_grat off (default),
gratuitous announcements are suppressed
- When neigh_forward_grat is enabled, gratuitous announcements are
forwarded while regular neighbor discovery remains suppressed

For IPv4, use arping to send gratuitous ARP packets. For IPv6, use
mausezahn to craft unsolicited Neighbor Advertisement packets.

For the per-port tests, the IPv4 test exercises the ip link interface,
while the IPv6 test exercises the bridge link interface.
The per-VLAN tests use the bridge interface throughout, as per-VLAN
attributes are only accessible via 'bridge vlan'.

Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Link: https://patch.msgid.link/20260511065936.4173106-7-danieller@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

bridge: Add per-VLAN netlink handling for neigh_forward_grat

Add netlink handlers for the per-VLAN neigh_forward_grat option via
BRIDGE_VLANDB_ENTRY_NEIGH_FORWARD_GRAT attribute.

The per-VLAN option provides fine-grained control, allowing different
VLANs on the same port to have different gratuitous ARP/unsolicited NA
forwarding behavior.

This enables control via 'bridge' commands:
# bridge vlan set dev eth0 vid 10 neigh_suppress on
# bridge vlan set dev eth0 vid 10 neigh_forward_grat on

Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20260511065936.4173106-6-danieller@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

bridge: Add port-level netlink handling for neigh_forward_grat

Add netlink handlers for the port-level neigh_forward_grat option via
IFLA_BRPORT_NEIGH_FORWARD_GRAT attribute.

The default value of OFF preserves existing behavior, i.e. gratuitous ARP
and unsolicited NA are suppressed when neigh_suppress is enabled. Users can
explicitly set it to ON to allow these packets through.

Example for enabling control via 'bridge link' command:
# bridge link set dev eth0 neigh_suppress on
# bridge link set dev eth0 neigh_forward_grat on

Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20260511065936.4173106-5-danieller@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

bridge: Add selective forwarding of gratuitous neighbor announcements

The existing neighbor suppression unconditionally suppresses gratuitous
ARPs and unsolicited Neighbor Advertisements, which prevents fast
mobility of hosts between VTEPs.

Add the neigh_forward_grat option to allow selective control of gratuitous
neighbor announcements. When neigh_suppress is enabled but
neigh_forward_grat is disabled (default), gratuitous announcements are
suppressed. When neigh_forward_grat is enabled, gratuitous announcements
are forwarded while regular neighbor discovery remains suppressed.

The implementation provides per-output-port control by:
1. Adding a 'grat_arp' flag to BR_INPUT_SKB_CB to mark gratuitous ARPs and
   unsolicited NAs.
2. Setting both grat_arp and proxyarp_replied flags in
   br_do_proxy_suppress_arp() and br_do_suppress_nd() when gratuitous
   packets are detected.
3. Checking neigh_forward_grat per output port during flooding:
   - For gratuitous ARPs/NAs: suppress unless the output port has
     neigh_forward_grat enabled.
   - For regular ARPs/NDs: maintain existing behavior.

This allows gratuitous announcements from any input port to be selectively
forwarded based on each output port's individual neigh_forward_grat
setting, enabling gratuitous neighbor announcements to be flooded to the
VXLAN fabric.

Regular neighbor discovery (ARP requests, NS queries, solicited replies)
remains controlled by neigh_suppress and is unaffected.

Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20260511065936.4173106-4-danieller@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

bridge: Add internal flags for neigh_forward_grat

Add internal flags for the neigh_forward_grat feature:

- BR_NEIGH_FORWARD_GRAT: Port-level flag
- BR_VLFLAG_NEIGH_FORWARD_GRAT_ENABLED: Per-VLAN flag

These will be used to control whether gratuitous ARP and unsolicited NA
packets are forwarded when neighbor suppression is enabled.

Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20260511065936.4173106-3-danieller@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

bridge: uapi: Add neigh_forward_grat netlink attributes

Add netlink attributes for controlling gratuitous ARP and unsolicited NA
forwarding when neighbor suppression is enabled.

Add IFLA_BRPORT_NEIGH_FORWARD_GRAT for port-level control and
BRIDGE_VLANDB_ENTRY_NEIGH_FORWARD_GRAT for per-VLAN control.

The new attributes provide independent control of gratuitous ARP and
unsolicited NA packets. Operators can enable forwarding for those packets
for fast mobility across VTEPs while keeping general neighbor suppression
active.

Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Link: https://patch.msgid.link/20260511065936.4173106-2-danieller@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

vdso/gettimeofday: Reload sequence counter after switch to time page in do_aux()

After switching to the real data pages, the sequence counter needs to be
reloaded from there. The code using vdso_read_begin_timens() assumed
this worked by 'continue' jumping to the *beginning* of the do-while
retry loop. However the 'continue' jumps to the *end* of said loop,
evaluating the exit condition. If the data page has a sequence counter
of '1' it will match the one from the time namespace page and prematurely
exit the retry loop. This would result in garbage returned to the caller.

Reload the sequence counter after switching the pages by using an inner
while loop again, which will loop at most once.

The loop generates slightly better code than an explicit reload through
'seq = vdso_read_begin()'.

Fixes: ed78b7b2c5ae ("vdso/gettimeofday: Add a helper to read the sequence lock of a time namespace aware clock")
Reported-by: Ricardo Ribalda <ribalda@chromium.org>
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Tested-by: Ricardo Ribalda <ribalda@chromium.org>
Reviewed-by: Christophe Leroy (CS GROUP) <chleroy@kernel.org>
Link: https://patch.msgid.link/20260422-vdso-aux-timens-loop-v1-1-e2dd8c7164cc@linutronix.de
Closes: https://lore.kernel.org/lkml/CANiDSCsOy0P1if-gJZqOM5pTJ0RDcwVfru1B7KFbTOEMqjPKJw@mail.gmail.com/

net/smc: reject CHID-0 ACCEPT that matches an empty ism_dev slot

On the SMC-D client, slot 0 of ini->ism_dev[]/ini->ism_chid[] is
reserved for an SMC-Dv1 device. smc_find_ism_v2_device_clnt()
populates V2 entries starting at index 1, so when no V1 device is
selected slot 0 is left in its kzalloc()'ed state with ism_dev[0] ==
NULL and ism_chid[0] == 0.

smc_v2_determine_accepted_chid() then matches the peer's CHID against
the array starting from index 0 using the CHID alone. A malicious
peer replying to a SMC-Dv2-only proposal with d1.chid == 0 matches
the empty slot, ini->ism_selected becomes 0, and the subsequent
ism_dev[0]->lgr_lock dereference in smc_conn_create() faults at
offsetof(struct smcd_dev, lgr_lock) == 0x68:

  BUG: KASAN: null-ptr-deref in _raw_spin_lock_bh+0x79/0xe0
  Write of size 4 at addr 0000000000000068 by task exploit/144
  Call Trace:
   _raw_spin_lock_bh
   smc_conn_create (net/smc/smc_core.c:1997)
   __smc_connect (net/smc/af_smc.c:1447)
   smc_connect (net/smc/af_smc.c:1720)
   __sys_connect
   __x64_sys_connect
   do_syscall_64

Require ism_dev[i] to be non-NULL before accepting a CHID match.

Fixes: a7c9c5f4af7f ("net/smc: CLC accept / confirm V2")
Reported-by: Weiming Shi <bestswngs@gmail.com>
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Xiang Mei <xmei5@asu.edu>
Link: https://patch.msgid.link/20260511062138.2839584-1-xmei5@asu.edu
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'net-sched-netem-enhancements'

Stephen Hemminger says:

====================
net/sched: netem: enhancements

This is a collection of improvements to netem found while
investigating the fixes now in net tree.

v5 - minor cleanups
- add allocation_errors counter
- align counters with where impairment occurs
====================

Link: https://patch.msgid.link/20260509171123.307549-1-stephen@networkplumber.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/sched: netem: add per-impairment extended statistics

Add 64-bit counters for each impairment netem applies (delay, loss,
ECN marking, corruption, duplication, reordering) and for skb
allocation failures during enqueue. Exposed through TCA_STATS_APP
as struct tc_netem_xstats.

Counters increment when an impairment is occurs, independent of later
events that may mask its on-wire effect. Added allocation_errors
(similar to sch_fq) to account for when impairment could not be
applied due to memory pressure, etc.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Link: https://patch.msgid.link/20260509171123.307549-6-stephen@networkplumber.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/sched: netem: handle multi-segment skb in corruption

The packet corruption code only flipped bits in the linear
header portion of the skb, skipping corruption when
skb_headlen() was zero.

Linearize the whole skb if necessary before corruption.
Extends d64cb81dcbd5 ("net/sched: sch_netem: fix out-of-bounds access
in packet corruption") with a more general solution.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Link: https://patch.msgid.link/20260509171123.307549-5-stephen@networkplumber.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/sched: netem: replace pr_info with netlink extack error messages

Use netlink extack to report errors instead of sending them
to the kernel log with pr_info(). The error message can them be seen
with tc commands; and avoids log spam.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Link: https://patch.msgid.link/20260509171123.307549-4-stephen@networkplumber.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/sched: netem: remove useless VERSION

The version printed was never updated and kernel version is
better indication of what is fixed or not.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Link: https://patch.msgid.link/20260509171123.307549-3-stephen@networkplumber.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/sched: netem: reorder struct netem_sched_data

The current layout of struct netem_sched_data can be improved
by optimizing cache locality, compacting data types (use u8
for enum) and eliminating unused elements.

Reorganize the struct as follows:

  - Cacheline 0 holds the tfifo state (t_root/t_head/t_tail/t_len),
    counter, and the unconditional enqueue scalars
    latency/jitter/rate/gap/loss.

  - Cacheline 1 holds the remaining zero-check scalars
    (duplicate/reorder/corrupt/ecn), all five crndstate correlation
    structures, and loss_model.

  - Cacheline 2 holds prng, delay_dist, the slot dequeue state,
    slot_dist, and the inner classful qdisc pointer.

  - Rate-shaping fields, q->limit (config-only; the fast path reads
    sch->limit), and the CLG Markov state move to the warm tail.

  - tc_netem_slot slot_config and qdisc_watchdog (only consulted on
    slot reschedule and watchdog wake) move to the cold tail.

Also reorder struct clgstate to place the u8 state member after the
u32 transition probabilities.  This removes the 3-byte interior hole
without changing the struct's size.

Should have no functional change.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Link: https://patch.msgid.link/20260509171123.307549-2-stephen@networkplumber.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

arm_mpam: Check whether the config array is allocated before destroying it

__destroy_component_cfg() is called to free the configuration array.
It uses the embedded 'garbage' structure, which means the array has
to be allocated.

If __destroy_component_cfg() is called from mpam_disable() before the
configuration was ever allocated, then a NULL pointer is dereferenced.

Check for this case and return early if the configuration is not
allocated.

__destroy_component_cfg() also frees the mbwu_state as this is allocated
by __allocate_component_cfg(). As the mbwu_state is allocated after
comp->cfg is set, and is also under mpam_list_lock, only the first
pointer needs checking.

Fixes: 3bd04fe7d807 ("arm_mpam: Extend reset logic to allow devices to be reset any time")
Cc: <stable@vger.kernel.org>
Signed-off-by: James Morse <james.morse@arm.com>
Reviewed-by: Ben Horgan <ben.horgan@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

arm_mpam: Fix false positive assert failure during mpam_disable()

mpam_assert_partid_sizes_fixed() is used to document that the caller
doesn't expect the discovered PARTID size to change while it is walking
a list sized by PARTID. Typically the MSC state is not written to until
all the MSC have been discovered and this value is set.

However, if discovering the MSC fails and schedules mpam_disable(),
then the MSC state is written to reset it. In this case the
discovered PARTID size may be become smaller - but only PARTID 0
will be used once resctrl_exit() has been called.

Skip the WARN_ON_ONCE() if mpam_disable_reason has been set.

Fixes: 3bd04fe7d807 ("arm_mpam: Extend reset logic to allow devices to be reset any time")
Cc: <stable@vger.kernel.org>
Signed-off-by: James Morse <james.morse@arm.com>
Reviewed-by: Ben Horgan <ben.horgan@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

arm_mpam: Improve check for whether or not NRDY is hardware managed

mpam_ris_hw_probe_csu_nrdy() sets and clears MSMON_CSU.NRDY and checks
whether it's configuration sticks. However, hardware isn't given a chance
to disagree. Based on rule LRTGP, in MPAM specification IHI0099 version
B.b, the hardware will set NRDY if it needs time to establish a count after
a configuration change.

Enable the monitor so that NRDY becomes relevant and change the
configuration after clearing NRDY to try and coax the hardware into setting
it.

Fixes: 8c90dc68a5de ("arm_mpam: Probe the hardware features resctrl supports")
Cc: <stable@vger.kernel.org>
Signed-off-by: Ben Horgan <ben.horgan@arm.com>
Reviewed-by: James Morse <james.morse@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

arm_mpam: Pretend that NRDY is always hardware managed

Rule ZTXDS of the MPAM specification, IHI009 version B.b, states: "If a
monitor does not support automatic updates of NRDY, software can use that
bit for any purpose."

As software is not reliably informed whether or not the monitor supports
automatic updates of NRDY always assume that hardware may manage NRDY but
don't rely on it. When NRDY is truly untouched by hardware then, as it is
written to 0 on configuration, it will always read 0.

At probe it's checked if MSMON_CSU.NRDY and MSMON_MBWU.NRDY are hardware
managed but not MSMON_MBWU_L.NDRY. Specialize the checking for hardware
managed NRDY to CSU counters as this is the only case where hardware
management makes sense. Continue to inform the user if MSMON_CSU.NRDY
appears to be hardware managed but the firmware doesn't provide the
associated time limit for the automatic clearing of NRDY. Remove the NRDY
feature flags as they are now unused.

Fixes: 8c90dc68a5de ("arm_mpam: Probe the hardware features resctrl supports")
Cc: <stable@vger.kernel.org>
Signed-off-by: Ben Horgan <ben.horgan@arm.com>
Reviewed-by: James Morse <james.morse@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

arm_mpam: Fix monitor instance selection when checking for hardware NRDY

In _mpam_ris_hw_probe_hw_nrdy() a new register value to select the first
monitor and relevant RIS is prepared in mon_sel. However, it is written to
the monitor value register, e.g. MSMON_CSU, rather than MSMON_CFG_MON_SEL.

As MSMON_CFG_MON_SEL is a 32 bit register update the type of mon_sel to
u32. Write mon_sel to the intended register, MSMON_CFG_MON_SEL.

Fixes: 8c90dc68a5de ("arm_mpam: Probe the hardware features resctrl supports")
Cc: <stable@vger.kernel.org>
Signed-off-by: Ben Horgan <ben.horgan@arm.com>
Reviewed-by: James Morse <james.morse@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

slab: fix kernel-docs for mm-api

The mm-api kernel-docs have been disconnected from their symbols. While
the scripts were previously taught to handle the _noprof suffix added by
allocation tagging (in 51a7bf0238c2 "scripts/kernel-doc: drop "_noprof"
on function prototypes"), this does not handle cases where the internal
implementation function has an additional leading underscore. The added
optional parameters (via DECL_KMALLOC_PARAMS) further complicate parsing
the internal signatures.

When the kernel-doc block remains above the internal implementation
function but uses the public API name, the documentation generator fails
to associate the documented symbol.

Simply moving the docs to the macros in slab.h fixes the association but
causes loss of types in the generated documentation (rendering as e.g.
untyped 'kmalloc(size, flags)' macro).

Fix this by:

1. Moving the kernel-doc comment blocks from slub.c to slab.h, placing
   them directly above the user-facing macros.

2. Providing explicit, typed C prototypes for the documented APIs inside
   '#if 0 /* kernel-doc */' blocks.

3. Converting the variadic macros for the documented APIs to use
   explicit arguments to match the documentation.

No functional change intended.

Signed-off-by: Marco Elver <elver@google.com>
Link: https://patch.msgid.link/20260511200136.3201646-3-elver@google.com
Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>

slab: improve KMALLOC_PARTITION_RANDOM randomness

When using CONFIG_KMALLOC_PARTITION_RANDOM, _RET_IP_ was previously used
to identify the allocation site. _RET_IP_, however, evaluates to the
caller's parent's instruction pointer rather than the actual allocation
site; this would lead to collisions where a function performs multiple
allocations.

With the generalization to kmalloc_token_t, we now generate the token at
the outermost macro, and using _THIS_IP_ would fix this for all cases.

Unfortunately, the generic implementation of _THIS_IP_ relies on taking
the address of a local label, which is considered broken by both GCC [1]
and Clang [2] because label addresses are only expected to be used with
computed gotos. While the generic version more or less works today, it
is known to be brittle. For example, Clang -O2 always returns 1 when
this function is inlined:

static inline unsigned long get_ip(void)
{ return ({ __label__ __here; __here: (unsigned long)&&__here; }); }

To provide a reliable unique identifier without breaking architectures
relying on the generic _THIS_IP_, introduce _CODE_LOCATION_: it resolves
to _THIS_IP_ where architectures provide a safe implementation, and
falls back to a zero-cost static marker where _THIS_IP_ is broken.

Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120071
Link: https://github.com/llvm/llvm-project/issues/138272
Signed-off-by: Marco Elver <elver@google.com>
Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>
Link: https://patch.msgid.link/20260511200136.3201646-2-elver@google.com
Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>

slab: support for compiler-assisted type-based slab cache partitioning

Rework the general infrastructure around RANDOM_KMALLOC_CACHES into more
flexible KMALLOC_PARTITION_CACHES, with the former being a partitioning
mode of the latter.

Introduce a new mode, KMALLOC_PARTITION_TYPED, which leverages a feature
available in Clang 22 and later, called "allocation tokens" via
__builtin_infer_alloc_token() [1]. Unlike KMALLOC_PARTITION_RANDOM
(formerly RANDOM_KMALLOC_CACHES), this mode deterministically assigns a
slab cache to an allocation of type T, regardless of allocation site.

The builtin __builtin_infer_alloc_token(<malloc-args>, ...) instructs
the compiler to infer an allocation type from arguments commonly passed
to memory-allocating functions and returns a type-derived token ID. The
implementation passes kmalloc-args to the builtin: the compiler performs
best-effort type inference, and then recognizes common patterns such as
`kmalloc(sizeof(T), ...)`, `kmalloc(sizeof(T) * n, ...)`, but also
`(T *)kmalloc(...)`. Where the compiler fails to infer a type the
fallback token (default: 0) is chosen.

Note: kmalloc_obj(..) APIs fix the pattern how size and result type are
expressed, and therefore ensures there's not much drift in which
patterns the compiler needs to recognize. Specifically, kmalloc_obj()
and friends expand to `(TYPE *)KMALLOC(__obj_size, GFP)`, which the
compiler recognizes via the cast to TYPE*.

Clang's default token ID calculation is described as [1]:

   typehashpointersplit: This mode assigns a token ID based on the hash
   of the allocated type's name, where the top half ID-space is reserved
   for types that contain pointers and the bottom half for types that do
   not contain pointers.

Separating pointer-containing objects from pointerless objects and data
allocations can help mitigate certain classes of memory corruption
exploits [2]: attackers who gains a buffer overflow on a primitive
buffer cannot use it to directly corrupt pointers or other critical
metadata in an object residing in a different, isolated heap region.

It is important to note that heap isolation strategies offer a
best-effort approach, and do not provide a 100% security guarantee,
albeit achievable at relatively low performance cost. Note that this
also does not prevent cross-cache attacks: while waiting for future
features like SLAB_VIRTUAL [3] to provide physical page isolation, this
feature should be deployed alongside SHUFFLE_PAGE_ALLOCATOR and
init_on_free=1 to mitigate cross-cache attacks and page-reuse attacks as
much as possible today.

With all that, my kernel (x86 defconfig) shows me a histogram of slab
cache object distribution per /proc/slabinfo (after boot):

  <slab cache>      <objs> <hist>
  kmalloc-part-15    1465  ++++++++++++++
  kmalloc-part-14    2988  +++++++++++++++++++++++++++++
  kmalloc-part-13    1656  ++++++++++++++++
  kmalloc-part-12    1045  ++++++++++
  kmalloc-part-11    1697  ++++++++++++++++
  kmalloc-part-10    1489  ++++++++++++++
  kmalloc-part-09     965  +++++++++
  kmalloc-part-08     710  +++++++
  kmalloc-part-07     100  +
  kmalloc-part-06     217  ++
  kmalloc-part-05     105  +
  kmalloc-part-04    4047  ++++++++++++++++++++++++++++++++++++++++
  kmalloc-part-03     183  +
  kmalloc-part-02     283  ++
  kmalloc-part-01     316  +++
  kmalloc            1422  ++++++++++++++

The above /proc/slabinfo snapshot shows me there are 6673 allocated
objects (slabs 00 - 07) that the compiler claims contain no pointers or
it was unable to infer the type of, and 12015 objects that contain
pointers (slabs 08 - 15). On a whole, this looks relatively sane.

Additionally, when I compile my kernel with -Rpass=alloc-token, which
provides diagnostics where (after dead-code elimination) type inference
failed, I see 186 allocation sites where the compiler failed to identify
a type (down from 966 when I sent the RFC [4]). Some initial review
confirms these are mostly variable sized buffers, but also include
structs with trailing flexible length arrays.

Link: https://clang.llvm.org/docs/AllocToken.html
Link: https://blog.dfsec.com/ios/2025/05/30/blasting-past-ios-18/
Link: https://lwn.net/Articles/944647/
Link: https://lore.kernel.org/all/20250825154505.1558444-1-elver@google.com/
Link: https://discourse.llvm.org/t/rfc-a-framework-for-allocator-partitioning-hints/87434
Acked-by: GONG Ruiqi <gongruiqi1@huawei.com>
Co-developed-by: Harry Yoo (Oracle) <harry@kernel.org>
Signed-off-by: Harry Yoo (Oracle) <harry@kernel.org>
Signed-off-by: Marco Elver <elver@google.com>
Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>
Link: https://patch.msgid.link/20260511200136.3201646-1-elver@google.com
Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>

xfrm: Reject excessive values for XFRMA_TFCPAD

tfcpad is a u32, but that full range is excessive for padding.
Limit it to max IP length (64k).

Signed-off-by: David Ahern <dahern@nvidia.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>

xfrm: Check for underflow in xfrm_state_mtu

Leo Lin reported OOB write issue in esp component:

  xfrm_state_mtu() returns u32 but performs its arithmetic in unsigned
  modulo-2^32 space using an attacker-influenced "header_len + authsize +
  net_adj" subtracted from a small "mtu" argument. A nobody user can
  install an IPv4 ESP tunnel SA with a large authentication key
  (XFRMA_ALG_AUTH_TRUNC, e.g. hmac(sha512), 64-byte key, 64-byte trunc),
  configure a small interface MTU (68 bytes), and set XFRMA_TFCPAD to a
  large value. When a single UDP datagram is then sent through the
  tunnel, xfrm_state_mtu() underflows to a near-2^32 value, and
  esp_output() consumes it as a signed int via:

        padto      = min(x->tfcpad, xfrm_state_mtu(x, mtu_cached))
        esp.tfclen = padto - skb->len   (assigned to int)

  esp.tfclen ends up negative (e.g. -207). It is sign-extended to size_t
  when passed to memset() inside esp_output_fill_trailer(), producing a
  ~16 EB write of zeroes at skb_tail_pointer(skb). KASAN logs it as
  "Write of size 18446744073709551537 at addr ffff888...".

Check for underflow and return 1. This causes the sendmsg attempt to
fail with ENETUNREACH.

Fixes: c5c252389374 ("[XFRM]: Optimize MTU calculation")
Reported-by: Leo Lin <leo@depthfirst.com>
Assisted-by: Codex:26.506.31004
Signed-off-by: David Ahern <dahern@nvidia.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>

mm/slub: defer freelist construction until after bulk allocation from a new slab

Allocations from a fresh slab can consume all of its objects, and the
freelist built during slab allocation is discarded immediately as a result.

Instead of special-casing the whole-slab bulk refill case, defer freelist
construction until after objects are emitted from a fresh slab.
new_slab() now only allocates the slab and initializes its metadata.
refill_objects() then obtains a fresh slab and lets alloc_from_new_slab()
emit objects directly, building a freelist only for the objects left
unallocated; the same change is applied to alloc_single_from_new_slab().

To keep CONFIG_SLAB_FREELIST_RANDOM=y/n on the same path, introduce a
small iterator abstraction for walking free objects in allocation order.
The iterator is used both for filling the sheaf and for building the
freelist of the remaining objects.

Also mark setup_object() inline. After this optimization, the compiler no
longer consistently inlines this helper in the hot path, which can hurt
performance. Explicitly marking it inline restores the expected code
generation.

This reduces per-object overhead when allocating from a fresh slab.
The most direct benefit is in the paths that allocate objects first and
only build a freelist for the remainder afterward: bulk allocation from
a new slab in refill_objects(), single-object allocation from a new slab
in ___slab_alloc(), and the corresponding early-boot paths that now use
the same deferred-freelist scheme. Since refill_objects() is also used to
refill sheaves, the optimization is not limited to the small set of
kmem_cache_alloc_bulk()/kmem_cache_free_bulk() users; regular allocation
workloads may benefit as well when they refill from a fresh slab.

In slub_bulk_bench, the time per object drops by about 42% to 70% with
CONFIG_SLAB_FREELIST_RANDOM=n, and by about 58% to 69% with
CONFIG_SLAB_FREELIST_RANDOM=y. This benchmark is intended to isolate the
cost removed by this change: each iteration allocates exactly
slab->objects from a fresh slab. That makes it a near best-case scenario
for deferred freelist construction, because the old path still built a
full freelist even when no objects remained, while the new path avoids
that work. Realistic workloads may see smaller end-to-end gains depending
on how often allocations reach this fresh-slab refill path.

Benchmark results (slub_bulk_bench):
Machine: qemu-system-x86 -m 1024M -smp 8 -enable-kvm -cpu host
Kernel: Linux 7.1.0-rc1-next-20260429
Config: x86_64_defconfig
Cpu: 0
Rounds: 20
Total: 256MB

- CONFIG_SLAB_FREELIST_RANDOM=n -

obj_size=16, batch=256:
before: 5.44 +- 0.07 ns/object
after: 3.12 +- 0.03 ns/object
delta: -42.6%

obj_size=32, batch=128:
before: 7.57 +- 0.32 ns/object
after: 3.79 +- 0.07 ns/object
delta: -49.9%

obj_size=64, batch=64:
before: 11.27 +- 0.09 ns/object
after: 4.83 +- 0.06 ns/object
delta: -57.2%

obj_size=128, batch=32:
before: 19.38 +- 0.13 ns/object
after: 6.43 +- 0.08 ns/object
delta: -66.8%

obj_size=256, batch=32:
before: 23.59 +- 0.18 ns/object
after: 6.97 +- 0.07 ns/object
delta: -70.5%

obj_size=512, batch=32:
before: 21.06 +- 0.14 ns/object
after: 7.12 +- 0.17 ns/object
delta: -66.2%

- CONFIG_SLAB_FREELIST_RANDOM=y -

obj_size=16, batch=256:
before: 9.42 +- 0.11 ns/object
after: 4.36 +- 0.19 ns/object
delta: -53.7%

obj_size=32, batch=128:
before: 12.19 +- 0.62 ns/object
after: 4.93 +- 0.07 ns/object
delta: -59.6%

obj_size=64, batch=64:
before: 17.01 +- 0.73 ns/object
after: 6.14 +- 0.12 ns/object
delta: -63.9%

obj_size=128, batch=32:
before: 23.71 +- 1.10 ns/object
after: 8.35 +- 0.18 ns/object
delta: -64.8%

obj_size=256, batch=32:
before: 29.20 +- 0.35 ns/object
after: 9.44 +- 1.32 ns/object
delta: -67.7%

obj_size=512, batch=32:
before: 29.35 +- 0.79 ns/object
after: 9.21 +- 0.34 ns/object
delta: -68.6%

Link: https://github.com/HSM6236/slub_bulk_test.git
Suggested-by: Harry Yoo (Oracle) <harry@kernel.org>
Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>
Reviewed-by: Hao Li <hao.li@linux.dev>
Tested-by: Hao Li <hao.li@linux.dev>
Signed-off-by: Shengming Hu <hu.shengming@zte.com.cn>
Link: https://patch.msgid.link/202604302204413066CxdJnJ3RAGH_7iE4EBIO@zte.com.cn
Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>

powerpc/time: Remove redundant preempt_disable|enable() calls from arch_irq_work_raise()

A kernel panic is observed when handling machine check exceptions from
real mode.

  BUG: Unable to handle kernel data access on read at 0xc00000006be21300
  Oops: Kernel access of bad area, sig: 11 [#1]
  MSR:  8000000000001003 <SF,ME,RI,LE>  CR: 88222248  XER: 00000005
  CFAR: c00000000003ffc4 DAR: c00000006be21300 DSISR: 40000000 IRQMASK: 0
  NIP [c000000000029e40] arch_irq_work_raise+0x10/0x70
  LR [c00000000003ffc8] machine_check_queue_event+0xa8/0x150
  Call Trace:
  [c0000000179d3c70] [c00000000003ff64] machine_check_queue_event+0x44/0x150
  [c0000000179d3d30] [c0000000000084e0] machine_check_early_common+0x1f0/0x2c0

The crash occurs because arch_irq_work_raise() calls preempt_disable()
from machine check exception (MCE) handlers running in real mode. In
this context, accessing the preempt_count can fault, leading to the panic.

The preempt_disable()/preempt_enable() pair in arch_irq_work_raise()
was originally added by commit 0fe1ac48bef0 ("powerpc/perf_event: Fix
oops due to perf_event_do_pending call") to avoid races while raising
irq work from exception context.

Later, commit 471ba0e686cb ("irq_work: Do not raise an IPI when
queueing work on the local CPU") added preemption protection in
irq_work_queue() path, while commit 20b876918c06 ("irq_work: Use per
cpu atomics instead of regular atomics") added equivalent
protection in irq_work_queue_on() before reaching arch_irq_work_raise():

  irq_work_queue() / irq_work_queue_on()
    -> preempt_disable()
      -> __irq_work_queue_local()
        -> irq_work_raise()
          -> arch_irq_work_raise()

As a result, callers other than mce_irq_work_raise() already execute
with preemption disabled, making the additional
preempt_disable()/preempt_enable() pair in arch_irq_work_raise()
redundant.

The arch_irq_work_raise() function executes in NMI context when called
from MCE handler. Hence we will not be preempted or scheduled out since
we are in NMI context with MSR[EE]=0. Therefore, it is safe to remove
the preempt_disable()/preempt_enable() calls from here.

Remove it to avoid accessing preempt_count from real mode context.

Fixes: cc15ff327569 ("powerpc/mce: Avoid using irq_work_queue() in realmode")
Suggested-by: Mahesh Salgaonkar <mahesh@linux.ibm.com>
Acked-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Sayali Patil <sayalip@linux.ibm.com>
[Maddy: Fixed the commit title]
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/20260513081413.222490-1-sayalip@linux.ibm.com

Input: synaptics - add LEN2058 to SMBus passlist for ThinkPad E490

The Lenovo ThinkPad E490 (PNP ID: LEN2058) has a Synaptics TM3471-020
touchpad that supports SMBus/RMI4 mode but is not listed in
smbus_pnp_ids[]. Without this entry, RMI4 over SMBus is not enabled
by default, and the touchpad falls back to PS/2 mode.

Adding LEN2058 to the passlist enables automatic RMI4 detection without
requiring the psmouse.synaptics_intertouch parameter, and matches
the behavior of similar ThinkPad models already in the list
(E480/LEN2054, E580/LEN2055).

Tested on ThinkPad E490 with kernel 7.0.5-zen1 and Arch Linux.
RMI4 over SMBus is confirmed working without any kernel parameters.

Signed-off-by: Nicolás Bazaes <contacto@bazaes.cl>
Assisted-by: Claude:claude-sonnet-4-6
Link: https://patch.msgid.link/20260514013552.14234-1-contacto@bazaes.cl
Cc: stable@vger.kernel.org
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>

riscv: misaligned: Make enabling delegation depend on NONPORTABLE

The unaligned access emulation code in Linux has various deficiencies.
For example, it doesn't emulate vector instructions [1] [2], and doesn't
emulate KVM guest accesses. Therefore, requesting misaligned exception
delegation with SBI FWFT actually regresses vector instructions' and KVM
guests' behavior.

Until Linux can handle it properly, guard these sbi_fwft_set() calls
behind RISCV_SBI_FWFT_DELEGATE_MISALIGNED, which in turn depends on
NONPORTABLE. Those who are sure that this wouldn't be a problem can
enable this option, perhaps getting better performance.

The rest of the existing code proceeds as before, except as if
SBI_FWFT_MISALIGNED_EXC_DELEG is not available, to handle any remaining
address misaligned exceptions on a best-effort basis. The KVM SBI FWFT
implementation is also not touched, but it is disabled if the firmware
emulates unaligned accesses.

Cc: stable@vger.kernel.org
Fixes: cf5a8abc6560 ("riscv: misaligned: request misaligned exception from SBI")
Reported-by: Songsong Zhang <U2FsdGVkX1@gmail.com> # KVM
Link: https://lore.kernel.org/linux-riscv/38ce44c1-08cf-4e3f-8ade-20da224f529c@iscas.ac.cn/
Link: https://lore.kernel.org/linux-riscv/b3cfcdac-0337-4db0-a611-258f2868855f@iscas.ac.cn/
Signed-off-by: Vivian Wang <wangruikang@iscas.ac.cn>
Acked-by: Conor Dooley <conor.dooley@microchip.com>
Link: https://patch.msgid.link/20260401-riscv-misaligned-dont-delegate-v2-1-5014a288c097@iscas.ac.cn
Signed-off-by: Paul Walmsley <pjw@kernel.org>

riscv: Docs: fix unmatched quote warning

'make htmldocs' complains about ``prctrl` -- so add a second '`' to
avoid the warning.

Documentation/arch/riscv/zicfilp.rst:79: WARNING: Inline literal start-string without end-string. [docutils]

Fixes: 08ee1559052b ("prctl: cfi: change the branch landing pad prctl()s to be more descriptive")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://patch.msgid.link/20260406232304.1892528-1-rdunlap@infradead.org
Signed-off-by: Paul Walmsley <pjw@kernel.org>

Merge tag 'amd-drm-next-7.2-2026-05-13' of https://gitlab.freedesktop.org/agd5f/linux into drm-next

amd-drm-next-7.2-2026-05-13:

amdgpu:
- Userq fixes
- DCN 3.2 fix
- RAS fixes
- GC 12 fixes
- Add PTL support for profiler
- SMU multi-msg helpers
- OLED fix
- Misc cleanups
- DC aux transfer refactor
- Introduce dc_plane_cm and migrate surface update color path
- IPS fixes
- DCN 4.2 updates
- SR-IOV fixes
- Add FRL registers for HDMI 2.1
- NBIO 7.11.4 updates
- VPE 2.0 support
- Aldebaran SMU update

amdkfd:
- Add profiler API

UAPI:
- Add profiler IOCTL
Userspace: https://github.com/ROCm/rocm-systems/commit/40abc95a6463a61bb318a67efd6d9cc3e5ee8839

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patch.msgid.link/20260513232911.41274-1-alexander.deucher@amd.com

io_uring: validate user-controlled cq.head in io_cqe_cache_refill()

A fuzzing run reproduced an unkillable io_uring task stuck at ~100% CPU:

[root@fedora io_uring_stress]# ps -ef | grep io_uring
root 1240 1 99 13:36 ? 00:01:35 [io_uring_stress] <defunct>

The task loops inside io_cqring_wait() and never returns to userspace,
and SIGKILL has no effect.

This is caused by the CQ ring exposing rings->cq.head to userspace as
writable, while the authoritative tail lives in kernel-private
ctx->cached_cq_tail. io_cqe_cache_refill() computes free space as an
unsigned subtraction:

free = ctx->cq_entries - min(tail - head, ctx->cq_entries);

If userspace keeps head within [0, tail], the subtraction is well
defined and min() just acts as a defensive clamp. But if userspace
advances head past tail, (tail - head) wraps to a huge value, free
becomes 0, and io_cqe_cache_refill() fails. The CQE is pushed onto the
overflow list and IO_CHECK_CQ_OVERFLOW_BIT is set.

The wait loop in io_cqring_wait() relies on an invariant: refill() only
fails when the CQ is *physically* full, in which case rings->cq.tail has
been advanced to iowq->cq_tail and io_should_wake() returns true. The
tampered head breaks this: refill() fails while the ring is not full, no
OCQE is copied in, rings->cq.tail never catches up, io_should_wake()
stays false, and io_cqring_wait_schedule() keeps returning early because
IO_CHECK_CQ_OVERFLOW_BIT is still set. The result is a tight retry loop
that never returns to userspace.

Introduce io_cqring_queued() as the single point that converts the
(tail, head) pair into a trustworthy queued count. Since the real
head/tail distance is bounded by cq_entries (far below 2^31), a signed
comparison reliably detects userspace moving head past tail; in that
case treat the queue as empty so callers see the full cache as free and
forward progress is preserved.

Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Zizhi Wo <wozizhi@huawei.com>
Link: https://patch.msgid.link/20260514021847.4062782-1-wozizhi@huaweicloud.com
[axboe: fixup commit message, kill 'queued' var, and keep it all in
io_uring.c]
Signed-off-by: Jens Axboe <axboe@kernel.dk>

Merge branch 'net-sched-refine-fq_codel-memory-limits'

Eric Dumazet says:

====================
net/sched: refine fq_codel memory limits

Packets that are associated with local sockets sk_wmem_alloc
do not really need additional memory control.

First patch makes is_skb_wmem() available to modules.

Second patch uses is_skb_wmem() in fq_codel.
====================

Link: https://patch.msgid.link/20260512094859.3673997-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/sched: fq_codel: local packets no longer count against memory limit

Commit 95b58430abe7 ("fq_codel: add memory limitation per queue")
claimed that the 32Mb default was "reasonable even for heavy duty usages."

In practice, this is not the case.

Packets that are associated with local sockets sk_wmem_alloc
do not really need additional memory control.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@toke.dk>
Link: https://patch.msgid.link/20260512094859.3673997-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: make is_skb_wmem() available to modules

Following patch will use is_skb_wmem() from fq_codel.

Provide __sock_wfree() only if CONFIG_INET=y

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@toke.dk>
Link: https://patch.msgid.link/20260512094859.3673997-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-mlx5e-improve-rss-indirection-table-sizing-and-resizing'

Tariq Toukan says:

====================
net/mlx5e: improve RSS indirection table sizing and resizing

This series by Yael improves mlx5e RSS indirection table handling around
channel count changes and large RSS configurations.

The series:
* removes the XOR8-specific channel count limitation,
* advertises the maximum supported RSS indirection table size,
* fixes resizing of non-default RSS contexts,
* allows resizing configured default RSS contexts during channel
changes,
* and increases the default RSS spread factor from 2x to 4x to improve
traffic distribution for large channel counts.

Together, these changes make RSS table sizing more flexible and robust,
while improving load balancing behavior on large systems.
====================

Link: https://patch.msgid.link/20260511172719.330490-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: increase RSS indirection table spread factor

Increase the RQT uniform spread factor from 2 to 4 so that each channel
gets more indirection table entries and traffic is spread more evenly.
For num_channels > 64 imbalance drops from up to ~50% to up to ~25%.
For 64 or fewer channels the 256 entry minimum already provides at least
4x coverage and the table size is unchanged by this commit.

This satisfies the minimum 4x coverage requirement validated by the
generic RSS selftest commit 9e3d4dae9832 ("selftests: drv-net: rss:
validate min RSS table size").

The 4x spread factor is best-effort and the table size is always capped by
the device's log_max_rqt_size capability.

Signed-off-by: Yael Chemla <ychemla@nvidia.com>
Reviewed-by: Nimrod Oren <noren@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260511172719.330490-6-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: resize configured default RSS context table on channel change

mlx5e_ethtool_set_channels() rejected channel count changes that
required a different RQT size when the default context indirection
table was user-configured. This restriction was introduced by
commit ee3572409f74 ("net/mlx5e: RSS, Block changing channels number
when RXFH is configured").

Lift the restriction. Validate the resize upfront with
ethtool_rxfh_indir_can_resize(), then fold or unfold the table
in-place via ethtool_rxfh_indir_resize() inside state_lock, before
mlx5e_safe_switch_params(), so the preactivate callback sees the
correct table content when it programs the HW.

Signed-off-by: Yael Chemla <ychemla@nvidia.com>
Reviewed-by: Nimrod Oren <noren@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260511172719.330490-5-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: resize non-default RSS indirection tables on channel change

When the channel count changes and the RQT size changes with it, a
problem arise for non-default RSS contexts. The driver-side indirection
table grows actual_table_size without filling the new entries; stale
entries from a prior larger configuration may be re-exposed, causing
mlx5e_calc_indir_rqns() to WARN on an out-of-range index.

Replace mlx5e_rss_params_indir_modify_actual_size() with
mlx5e_rss_ctx_resize(), which fills new entries by replicating
the existing pattern, matching what ethtool_rxfh_ctxs_resize() does
for the same case. And restrict the loop to non-default contexts.

Call ethtool_rxfh_ctxs_can_resize() before acquiring state_lock to
validate that all non-default contexts can be resized, and
ethtool_rxfh_ctxs_resize() after releasing it to fold or unfold their
indirection tables. Both functions acquire rss_lock internally and
cannot be called under state_lock. RTNL, held by all set_channels
callers, serialises context creation and deletion making the pre-lock
check safe.

Guard both ethtool calls on mlx5e_rx_res_rss_cnt() > 1: skip the
validation and resize when no non-default contexts exist. This
naturally covers representors and IPoIB, which share
mlx5e_ethtool_set_channels() but cannot have non-default RSS contexts.

Signed-off-by: Yael Chemla <ychemla@nvidia.com>
Reviewed-by: Nimrod Oren <noren@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260511172719.330490-4-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: advertise max RSS indirection table size to ethtool

Set rxfh_indir_space to the maximum indirection table size the driver
can support: the next power of two above MLX5E_MAX_NUM_CHANNELS times
MLX5E_UNIFORM_SPREAD_RQT_FACTOR.

Without this, ethtool_rxfh_ctxs_can_resize() returns -EINVAL, blocking
non-default RSS contexts from tracking indirection table size changes
when the channel count changes.

Signed-off-by: Yael Chemla <ychemla@nvidia.com>
Reviewed-by: Nimrod Oren <noren@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260511172719.330490-3-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: remove channel count limit for XOR8 RSS hash

mlx5e_ethtool_set_channels() and mlx5e_rxfh_hfunc_check() rejected
channel counts that would produce an indirection table larger than 256
entries when the XOR8 hash function was active. This check was
introduced in commit 49e6c9387051 ("net/mlx5e: RSS, Block XOR hash
with over 128 channels").

XOR8 yields an 8-bit hash, so in practice only up to 256 entries in the
indirection table can be reached due to limited entropy. However, this
does not provide a strong justification for prohibiting larger
indirection tables. Remove the limitation.

Signed-off-by: Yael Chemla <ychemla@nvidia.com>
Reviewed-by: Nimrod Oren <noren@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260511172719.330490-2-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'amd-drm-fixes-7.1-2026-05-13' of https://gitlab.freedesktop.org/agd5f/linux into drm-fixes

amd-drm-fixes-7.1-2026-05-13:

amdgpu:
- Userq fixes
- DCN 3.2 fix
- RAS fix
- GC 12 fix

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patch.msgid.link/20260513224053.40670-1-alexander.deucher@amd.com

Merge branch 'macsec-use-rcu_work-to-fix-crypto-cleanup-in-softirq-context'

Jinliang Zheng says:

====================
macsec: use rcu_work to fix crypto cleanup in softirq context

From: Jinliang Zheng <alexjlzheng@tencent.com>

crypto_free_aead() can internally call vunmap() (e.g. via dma_free_attrs()
in hardware crypto drivers like hisi_sec2), which must not be invoked from
softirq context. Both free_rxsa() and free_txsa() are RCU callbacks that
run in softirq, causing a kernel crash on affected hardware.

This series fixes the issue by deferring the actual cleanup to a workqueue
using rcu_work, which combines the RCU grace period and workqueue dispatch
into a single primitive.

Two design decisions worth noting:

1. rcu_work instead of schedule_work() + synchronize_rcu()

   An alternative would be to call schedule_work() directly from
   macsec_rxsa_put()/macsec_txsa_put(), then call synchronize_rcu() at
   the start of the work handler to replace the grace period previously
   provided by call_rcu(). However, synchronize_rcu() blocks the worker
   thread for the duration of a full RCU grace period. Under high SA
   churn (e.g. tearing down an interface with many SAs), each SA would
   occupy a worker thread while waiting, and multiple concurrent calls
   cannot share the same grace period — leading to unnecessary latency
   and resource waste.

   rcu_work uses call_rcu_hurry() internally, which is fully asynchronous:
   the worker thread is only dispatched after the grace period has elapsed,
   and multiple concurrent queue_rcu_work() calls naturally batch under the
   same grace period via the RCU subsystem's existing coalescing mechanism.

2. Dedicated workqueue instead of system_wq

   Using a dedicated workqueue (macsec_wq) allows macsec_exit() to drain
   exactly the work items belonging to this module — by calling
   destroy_workqueue() after rcu_barrier(). If system_wq were used,
   flush_scheduled_work() would drain all pending work items across the
   entire system, creating unnecessary coupling with unrelated subsystems
   and potentially causing unexpected delays. The dedicated workqueue
   provides a clean, contained teardown path.
====================

Link: https://patch.msgid.link/20260511153102.2640368-1-alexjlzheng@tencent.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

macsec: use rcu_work to defer TX SA crypto cleanup out of softirq

free_txsa() is an RCU callback running in softirq context, but calls
crypto_free_aead() which can invoke vunmap() internally on hardware
crypto drivers (e.g. hisi_sec2), triggering a kernel crash.

Use rcu_work to defer the cleanup to a workqueue, for the same reasons
as the analogous fix to free_rxsa() in the previous patch.

Fixes: c09440f7dcb3 ("macsec: introduce IEEE 802.1AE driver")
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20260511153102.2640368-4-alexjlzheng@tencent.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

macsec: use rcu_work to defer RX SA crypto cleanup out of softirq

crypto_free_aead() can internally invoke vunmap() (e.g. via
dma_free_attrs() in hardware crypto drivers such as hisi_sec2).
vunmap() must not be called from softirq context, but free_rxsa()
is an RCU callback that runs in softirq, leading to a kernel crash:

  vunmap+0x4c/0x70
  __iommu_dma_free+0xd0/0x138
  dma_free_attrs+0xf4/0x100
  sec_aead_exit+0x64/0xb8 [hisi_sec2]
  crypto_destroy_tfm+0x98/0x110
  free_rxsa+0x28/0x50 [macsec]
  rcu_do_batch+0x184/0x460
  rcu_core+0xf4/0x1f8
  handle_softirqs+0x118/0x330

Use rcu_work to defer the cleanup to a workqueue. rcu_work dispatches
the worker asynchronously after the RCU grace period, so no thread
blocks waiting, and concurrent releases of multiple SAs naturally
share the same grace period.

Fixes: c09440f7dcb3 ("macsec: introduce IEEE 802.1AE driver")
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20260511153102.2640368-3-alexjlzheng@tencent.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

macsec: introduce dedicated workqueue for SA crypto cleanup

Introduce a dedicated ordered workqueue, macsec_wq, which will be used
by subsequent patches to defer SA crypto cleanup (crypto_free_aead and
related teardown) out of softirq context.

Using a dedicated workqueue instead of system_wq allows macsec_exit()
to drain exactly the work items belonging to this module via
destroy_workqueue(), without interfering with unrelated work items on
system_wq or causing unexpected delays elsewhere.

rcu_barrier() in macsec_exit() ensures all in-flight rcu_work callbacks
have enqueued their work items before destroy_workqueue() drains and
destroys the queue, making the two-step teardown correct and complete.
The same sequence is kept in the error path of macsec_init() as a
precaution, to mirror macsec_exit() and stay safe if work ever becomes
queueable before this point in the future.

While at it, rename the error labels in macsec_init() from the
resource-named style (rtnl:, notifier:, wq:) to the err_xxx: style
(err_rtnl:, err_notifier:, err_destroy_wq:) to align with the broader
kernel convention.

Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20260511153102.2640368-2-alexjlzheng@tencent.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: net_failover: Fix the deadlock in slave register

There is netdev_lock_ops() before the NETDEV_REGISTER notifier
in register_netdevice(), so use the non-locking functions
in net_failover_slave_register().
failover_slave_register() in failover_existing_slave_register() adds lock
and unlock ops too.

Call Trace:
<TASK>
__schedule+0x30d/0x7a0
schedule+0x27/0x90
schedule_preempt_disabled+0x15/0x30
__mutex_lock.constprop.0+0x538/0x9e0
__mutex_lock_slowpath+0x13/0x20
mutex_lock+0x3b/0x50
dev_set_mtu+0x40/0xe0
net_failover_slave_register+0x24/0x280
failover_slave_register+0x103/0x1b0
failover_event+0x15e/0x210
? dropmon_net_event+0xac/0xe0
notifier_call_chain+0x5e/0xe0
raw_notifier_call_chain+0x16/0x30
call_netdevice_notifiers_info+0x52/0xa0
register_netdevice+0x5f4/0x7c0
register_netdev+0x1e/0x40
_mlx5e_probe+0xe2/0x370 [mlx5_core]
mlx5e_probe+0x59/0x70 [mlx5_core]
? __pfx_mlx5e_probe+0x10/0x10 [mlx5_core]

Fixes: 4c975fd70002 ("net: hold instance lock during NETDEV_REGISTER/UP")
Signed-off-by: Faicker Mo <faicker.mo@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'drm-intel-fixes-2026-05-13' of https://gitlab.freedesktop.org/drm/i915/kernel into drm-fixes

- Skip __i915_request_skip() for already signaled requests (Sebastian Brzezinka)
- Fix VSC dynamic range signaling for RGB formats [dp] (Chaitanya Kumar Borah)

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Tvrtko Ursulin <tursulin@igalia.com>
Link: https://patch.msgid.link/agSVZmNC_qV4G6jQ@linux

Merge branch 'bpf-maximum-combined-stack-depth'

Paul Chaignon says:

====================
bpf: Maximum combined stack depth

This patchset dumps the maximum combined stack depth in verifier logs
and parses it in veristat.

Changes in v3:
  - Increment spec_cnt field in veristat for new MAX_STACK id (AI bot).
Changes in v2:
  - Remove unnecessary max_stack_depth assignment (Eduard).
  - Fix and test incorrect handling of private stacks.
  - Add veristat metric (Eduard).
====================

Link: https://patch.msgid.link/cover.1778700777.git.paul.chaignon@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

veristat: Report max stack depth

This patch adds a new "Max stack depth" field to the set of gathered
statistics. This field reports the maximum combined stack depth compared
to the 512 bytes limit. It is null for rejected programs.

Suggested-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Link: https://lore.kernel.org/r/a27ed8f336669152c4b1b05e920aee4438e3e2b3.1778700777.git.paul.chaignon@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Test reported max stack depth

This patch tests the maximum stack depth reporting in verifier logs,
with a couple special cases covered: fastcall, private stacks (main
subprog & callee), and rounding up to 16 bytes. For that last one, we
need to skip the test when JIT compilation is disabled as the rounding
is then to 32 bytes.

Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/075d22efd4338385a92f13b7817025cc3f04ec60.1778700777.git.paul.chaignon@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Report maximum combined stack depth

We've hit the 512 bytes limit on stack depth a few times in Cilium
recently. As a result, we started reporting in CI our current maximum
stack depth across all configurations for each BPF program.

Unfortunately, that is not trivial to compute in userspace. The
verifier reports the stack depths of individual subprogs at the end of
the logs. However the maximum combined stack depth also depends on the
callgraph of those subprogs (the max combined stack depth is the height
of the callgraph weighted by per-subprog stack depths). We can compute
a callgraph in userspace from the loaded instructions, but it often
doesn't match the verifier's own callgraph because of dead code
elimination. Our current approach relies on dumping the BPF_LOG_LEVEL2
logs, but this feels overkill considering the verifier already has the
information we need.

The patch lets the verifier dump the maximum combined stack depth in
the logs, on the same line as the per-subprog stack depths:

stack depth 16+256 max 272

The per-subprog stack depths and the new max stack depth are not
directly comparable. The former is sometimes updated during fixups,
while the latter is not. As a result, even with a single subprog, we
may end up with two slightly different values. The aim of the new max
value is to be closest to what is actually enforced by the verifier.

Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/d3d23a0410f87f116f3bbaa98a815dbae113bda2.1778700777.git.paul.chaignon@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

MAINTAINERS: update atlantic driver maintainer

Igor Russkikh and Egor Pomozov have left Marvell.
Take over maintenance of the atlantic driver and its PTP subsystem.

Signed-off-by: Sukhdeep Singh <sukhdeeps@marvell.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'netpoll-move-out-netconsole-specific-functions'

Breno Leitao says:

====================
netpoll: move out netconsole-specific functions

netpoll and netconsole were created together and their code has
been intermixed in net/core/netpoll.c for decades. The result is
that netpoll exposes two send-side interfaces:

* a generic "give me an sk_buff" path used by every stacked-device
driver (bonding, team, vlan, bridge, macvlan, dsa),

* a second path that takes raw bytes and builds a UDP/IP/Ethernet
packet -- exclusively for netconsole.

The packet builder, an skb pool allocator, and several
netconsole-specific helpers all live next to the generic plumbing even
though no other consumer ever touches them.

Worse, every netpoll user pays for that overlap: struct netpoll carries
an skb_pool and a refill work_struct that only netconsole's find_skb()
ever reads from, and net-core has to review unrelated changes (TTL, hop
limit, IP ID generation, source MAC selection, pool sizing) just because
they happen to be coded inside netpoll.

This is a waste of memory for something useless.

This series splits the netconsole-specific code out:

* netpoll_send_udp() and its private helpers (push_ipv6, push_ipv4,
push_eth, push_udp, netpoll_udp_checksum, find_skb) move into
drivers/net/netconsole.c, leaving netpoll with a single skb-only
send interface that is the same for every user.

The moves are one function per patch for reviewability; helpers are
temporarily EXPORT_SYMBOL_GPL'd while netpoll_send_udp() is still in
netpoll calling them, then those exports are dropped together once
netpoll_send_udp() itself moves.

The only new permanent export is zap_completion_queue(), needed because
find_skb() still drains the per-CPU TX completion queue before
allocating.

struct netpoll is unchanged in this series; making the pool itself
netconsole-private (and reclaiming the skb_pool / refill_wq fields for
the rest of netpoll's users) is the natural follow-up, once this patchset
lands.
====================

Link: https://patch.msgid.link/20260512-netconsole_split-v2-0-1191d14ad66d@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netconsole: move find_skb() from netpoll

find_skb() is the netconsole-specific entry into the netpoll skb
pool: every other netpoll consumer (bonding, team, vlan, bridge,
macvlan, dsa) builds its own sk_buff and never touches the pool.
With netpoll_send_udp() (its only caller) now living in netconsole,
find_skb() can join it.

Move find_skb() into drivers/net/netconsole.c as a file-static
helper, drop EXPORT_SYMBOL_GPL(find_skb) and remove its prototype
from include/linux/netpoll.h. find_skb() drains TX completions via
netpoll_zap_completion_queue(), which is already exported in the
NETDEV_INTERNAL namespace, so netconsole picks up
MODULE_IMPORT_NS("NETDEV_INTERNAL") to consume it.

The skb pool's lifecycle (np->skb_pool, np->refill_wq, refill_skbs(),
refill_skbs_work_handler(), skb_pool_flush()) stays in netpoll: it
is initialised in __netpoll_setup() and torn down in
__netpoll_cleanup(), both of which remain netpoll's responsibility.
The refill work queued via schedule_work(&np->refill_wq) from the
moved find_skb() runs refill_skbs_work_handler() in netpoll without
any further plumbing.

This is pure code motion: the function body is unchanged and its
sole caller (netpoll_send_udp(), already moved by an earlier patch)
keeps invoking it the same way. Pre-existing concerns about
find_skb() running from NMI/printk context (zap_completion_queue()
re-entry, skb_pool spinlocks, GFP_ATOMIC allocation, fallback skb
sizing vs. MAX_SKB_SIZE, PREEMPT_RT semantics of __kfree_skb()) are
inherited as-is and are not addressed here; they predate this
series and are out of scope. Fixing them is left for follow-up
work.

Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20260512-netconsole_split-v2-9-1191d14ad66d@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netpoll: rename and export netpoll_zap_completion_queue()

zap_completion_queue() drains the per-CPU softnet completion queue.
Rename it with the netpoll_ prefix shared by the rest of the
subsystem's public API, and promote it from file-static to
EXPORT_SYMBOL_NS_GPL in the NETDEV_INTERNAL namespace so the upcoming
netconsole-side find_skb() can call it once the function moves out.
A forward declaration is added to include/linux/netpoll.h, and the
old file-static forward declaration is dropped.

No functional change.

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20260512-netconsole_split-v2-8-1191d14ad66d@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netconsole: move netpoll_udp_checksum() from netpoll

netpoll_udp_checksum() computes the UDP checksum for netconsole's
packets. Move it into drivers/net/netconsole.c as a file-static
helper; drop its EXPORT_SYMBOL_GPL and remove the prototype from
include/linux/netpoll.h.

This was the last csum_ipv6_magic() consumer in net/core/netpoll.c,
so drop the now-stale <net/ip6_checksum.h> include there. Pull it
into netconsole.c so the moved code keeps building.

It was also the last udp_hdr() consumer in net/core/netpoll.c. The
file no longer needs anything from <net/udp.h> (the UDP socket-layer
helpers); MAX_SKB_SIZE only needs struct udphdr, which is provided
by the lighter <linux/udp.h>. Swap the include accordingly.

Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20260512-netconsole_split-v2-7-1191d14ad66d@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netconsole: move push_udp() from netpoll

push_udp() builds the UDP header (and triggers the checksum) for
netconsole's UDP packets. Move it into drivers/net/netconsole.c as
a file-static helper; drop its EXPORT_SYMBOL_GPL and remove the
prototype from include/linux/netpoll.h.

Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20260512-netconsole_split-v2-6-1191d14ad66d@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netconsole: move push_eth() from netpoll

push_eth() builds the Ethernet header for netconsole's UDP packets.
Move it into drivers/net/netconsole.c as a file-static helper; drop
its EXPORT_SYMBOL_GPL and remove the prototype from
include/linux/netpoll.h.

Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20260512-netconsole_split-v2-5-1191d14ad66d@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netconsole: move push_ipv4() from netpoll

push_ipv4() builds the IPv4 header for netconsole's UDP packets.
Move it into drivers/net/netconsole.c as a file-static helper; drop
its EXPORT_SYMBOL_GPL and remove the prototype from
include/linux/netpoll.h.

put_unaligned() is no longer used in net/core/netpoll.c, so drop
the now-stale <linux/unaligned.h> include from there. Pull it into
netconsole.c so the moved code keeps building.

Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20260512-netconsole_split-v2-4-1191d14ad66d@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netconsole: move push_ipv6() from netpoll

push_ipv6() builds the IPv6 header for netconsole's UDP packets.
Its only caller, netpoll_send_udp(), now lives in netconsole, so
the helper can move there as a file-static function. Drop its
EXPORT_SYMBOL_GPL and remove the prototype from
include/linux/netpoll.h.

Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20260512-netconsole_split-v2-3-1191d14ad66d@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netconsole: move netpoll_send_udp() from netpoll

Move netpoll_send_udp() from net/core/netpoll.c into
drivers/net/netconsole.c as a static helper, drop EXPORT_SYMBOL(),
and remove the prototype from include/linux/netpoll.h.

netconsole was the only in-tree caller of this entry point. Every
other netpoll consumer (bonding, team, vlan, bridge, macvlan, dsa)
already builds its own sk_buff and hands it to netpoll_send_skb(),
so the netpoll send-side interface is now skb-only.

The helpers it depends on (find_skb(), push_ipv6(), push_ipv4(),
push_udp(), push_eth(), netpoll_udp_checksum()) were exposed in
the previous patches and stay in net/core/netpoll.c for now.
Subsequent patches move each of them into netconsole one at a time
and drop the corresponding EXPORT_SYMBOL_GPL.

Pull <linux/ip.h>, <linux/ipv6.h> and <linux/udp.h> into netconsole.c
so the moved code can name the header structures.

Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20260512-netconsole_split-v2-2-1191d14ad66d@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netpoll: expose UDP packet builder helpers for netconsole

Promote each from file-static to EXPORT_SYMBOL_GPL and forward-
declare them in include/linux/netpoll.h so netconsole can call
them once netpoll_send_udp() moves out.

These exports are kept until the end of the series, when
al of them move into netconsole.

No functional change.

Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20260512-netconsole_split-v2-1-1191d14ad66d@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

riscv: cfi: reduce shadow stack size limit from 4GB to 2GB

Follow the ARM64 GCS (Guarded Control Stack) implementation approach
by reducing the shadow stack size allocation from min(RLIMIT_STACK, 4GB)
to min(RLIMIT_STACK/2, 2GB). See commit 506496bcbb42 ("arm64/gcs: Ensure
that new threads have a GCS")

Rationale:

1. Shadow stacks only store return addresses (8 bytes per entry), not
   local variables, function parameters, or saved registers. A 2GB
   shadow stack is far more than sufficient for any practical
   application, even with extremely deep recursion. Using half the size
   maintains adequate margin while being more resource-efficient.

2. On memory-constrained systems (e.g., platforms with only 4GB of
   physical memory, which is a common configuration), allocating 4GB
   of virtual address space for shadow stack per process/thread can
   lead to virtual memory allocation failures when the overcommit mode
   is set to OVERCOMMIT_GUESS or OVERCOMMIT_NEVER:
   Error: "__vm_enough_memory: not enough memory for the allocation"

This reduces virtual address space consumption by 50% while maintaining
more than adequate space for return address storage.

Signed-off-by: Zong Li <zong.li@sifive.com>
Link: https://patch.msgid.link/20260428024105.645162-1-zong.li@sifive.com
[pjw@kernel.org: clean up patch description]
Signed-off-by: Paul Walmsley <pjw@kernel.org>

net: usb: pegasus: replace simple_strtoul with kstrtouint

simple_strtoul() is deprecated as it has no error checking. Replace it
with kstrtouint() which returns an error code on invalid input, and add
appropriate error handling.

Also add a NULL check before parsing flags, since strsep() can set id
to NULL if the input has fewer tokens than expected.

Preserve the original behavior for a trailing colon by checking *id
before parsing flags, so an empty string results in flags = 0 rather
than an error.

Signed-off-by: Sajal Gupta <sajal2005gupta@gmail.com>
Link: https://patch.msgid.link/20260509095518.2640-1-sajal2005gupta@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipvlan: use netif_receive_skb() in ipvlan_process_multicast()

ipvlan_process_multicast() runs from process context, there is no
risk of stack overflow if we call netif_receive_skb() instead
of netif_rx().

This avoids some overhead adding/removing skbs to/from a per-cpu
backlog and raising/processing NET_RX softirqs.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260512042019.3300975-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests/tc-testing: Add QFQ/CBS qlen underflow test

Since CBS was not calling reset for its child qdisc, there are scenarios
where it could cause an underflow on its parent's qlen/backlog. When the
parent is QFQ, a null-ptr deref could occur.

Add a test case that reproduces the underflow followed by a null-ptr
deref scenario.

Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/sched: sch_cbs: Call qdisc_reset for child qdisc

During a reset, CBS is not calling reset on its child qdisc, which
might cause qlen/backlog accounting issues. For example, if we have CBS
with a QFQ parent and a netem child with delay, we can create a scenario
where the parent's qlen underflows. QFQ, specifically, uses qlen to
check whether it should deference a pointer, so this scenario may cause
a null-ptr deref in QFQ:

[   43.875639][  T319] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000009: 0000 [#1] SMP KASAN NOPTI
[   43.876124][  T319] KASAN: null-ptr-deref in range [0x0000000000000048-0x000000000000004f]
[   43.876417][  T319] CPU: 10 UID: 0 PID: 319 Comm: ping Not tainted 7.0.0-13039-ge728258debd5 #773 PREEMPT(full)
[   43.876751][  T319] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[   43.876949][  T319] RIP: 0010:qfq_dequeue+0x35c/0x1650
[   43.877123][  T319] Code: 00 fc ff df 80 3c 02 00 0f 85 17 0e 00 00 4c 8d 73 48 48 89 9d b8 02 00 00 48 b8 00 00 00 00 00 fc ff df 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 76 0c 00 00 48 b8 00 00 00 00 00 fc ff df 4c 8b
[   43.877648][  T319] RSP: 0018:ffff8881017ef4f0 EFLAGS: 00010216
[   43.877845][  T319] RAX: dffffc0000000000 RBX: 0000000000000000 RCX: dffffc0000000000
[   43.878073][  T319] RDX: 0000000000000009 RSI: 0000000c40000000 RDI: ffff88810eef02b0
[   43.878306][  T319] RBP: ffff88810eef0000 R08: ffff88810eef0280 R09: 1ffff1102120fd63
[   43.878523][  T319] R10: 1ffff1102120fd66 R11: 1ffff1102120fd67 R12: 0000000c40000000
[   43.878742][  T319] R13: ffff88810eef02b8 R14: 0000000000000048 R15: 0000000020000000
[   43.878959][  T319] FS:  00007f9c51c47c40(0000) GS:ffff88817a0be000(0000) knlGS:0000000000000000
[   43.879214][  T319] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   43.879403][  T319] CR2: 000055e69a2230a8 CR3: 000000010c07a000 CR4: 0000000000750ef0
[   43.879621][  T319] PKRU: 55555554
[   43.879735][  T319] Call Trace:
[   43.879844][  T319]  <TASK>
[   43.879924][  T319]  __qdisc_run+0x169/0x1900
[   43.880075][  T319]  ? dev_qdisc_enqueue+0x8b/0x210
[   43.880222][  T319]  __dev_queue_xmit+0x2346/0x37a0
[   43.880376][  T319]  ? register_lock_class+0x3f/0x800
[   43.880531][  T319]  ? srso_alias_return_thunk+0x5/0xfbef5
[   43.880684][  T319]  ? __pfx___dev_queue_xmit+0x10/0x10
[   43.880834][  T319]  ? srso_alias_return_thunk+0x5/0xfbef5
[   43.880977][  T319]  ? __lock_acquire+0x819/0x1df0
[   43.881124][  T319]  ? srso_alias_return_thunk+0x5/0xfbef5
[   43.881275][  T319]  ? srso_alias_return_thunk+0x5/0xfbef5
[   43.881418][  T319]  ? __asan_memcpy+0x3c/0x60
[   43.881563][  T319]  ? srso_alias_return_thunk+0x5/0xfbef5
[   43.881708][  T319]  ? eth_header+0x165/0x1a0
[   43.881853][  T319]  ? lockdep_hardirqs_on_prepare+0xdb/0x1a0
[   43.882031][  T319]  ? srso_alias_return_thunk+0x5/0xfbef5
[   43.882174][  T319]  ? neigh_resolve_output+0x3cc/0x7e0
[   43.882325][  T319]  ? srso_alias_return_thunk+0x5/0xfbef5
[   43.882471][  T319]  ip_finish_output2+0x6b6/0x1e10

Fix this by calling qdisc_reset for CBS' child qdisc.
Sashiko caught an issue which could result in a null ptr deref if
qdisc_create_dflt() is invoked on an unitialised cbs qdisc which is exposed
by this patch. We add an early return if the qdisc is null to address this.
This is a similar approach used by two other fixes[1][2].

The proper fix for this specific issue elucidated by sashiko is to remove
the call to qdisc_reset when qdisc_create_dflt fails. Since the dflt qdisc
isn't attached anywhere yet at that point, calling the reset callback doesn't
make much sense (and as stated has been a source of two other bugs).
We plan on  submitting this fix in a later patch.
[1] https://lore.kernel.org/netdev/20221018063201.306474-2-shaozhengchao@huawei.com/
[2] https://lore.kernel.org/netdev/20221018063201.306474-4-shaozhengchao@huawei.com/

Fixes: 585d763af09c ("net/sched: Introduce Credit Based Shaper (CBS) qdisc")
Reported-by: Junyoung Jang <graypanda.inzag@gmail.com>
Tested-by: Junyoung Jang <graypanda.inzag@gmail.com>
Tested-by: Victor Nogueira <victor@mojatatu.com>
Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>