git.ipfire.org Git - thirdparty/kernel/linux.git/log

sh: Drop cache flush of the zero page at boot

SuperH performs cache maintenance on the zero page during boot,
presumably because before commit

6215d9f4470f ("arch, mm: consolidate empty_zero_page")

the zero page did double duty as a boot params region, and was cleared
separately, as it was not part of BSS. The memset() in question was
dropped by that commit, but the __flush_wback_region() call remained.

As empty_zero_page[] has been moved to BSS, it can be treated as any
other BSS memory, and so the cache flush can be dropped.

Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Rich Felker <dalias@libc.org>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Will Deacon <will@kernel.org>

powerpc/code-patching: Avoid r/w mapping of the zero page

The only remaining use of map_patch_area() is mapping the zero page, and
immediately unmapping it again so that the intermediate page table
levels are all guaranteed to be populated.

The use of the zero page here is completely arbitrary, and not harmful
per se, but currently, it creates a writable mapping, and does so in a
manner that requires that the empty_zero_page[] symbol is not
const-qualified.

Given that this is about to change, and that map_patch_area() now never
maps anything other than the zero page, let's simplify the code and
- remove the helpers and call [un]map_kernel_page() directly
- take the PA of empty_zero_page directly
- create a read-only temporary mapping.

This allows empty_zero_page[] to be repainted as const u8[] in a
subsequent patch, without making substantial changes to this code
patching logic.

Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Christophe Leroy (CS GROUP) <chleroy@kernel.org>
Link: https://lore.kernel.org/all/20260520085423.485402-1-ardb@kernel.org/
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Will Deacon <will@kernel.org>

arm64: mm: Don't abuse memblock NOMAP to check for overlaps

Now that the linear region mapping routines respect existing table
mappings and contiguous block and page mappings, it is no longer needed
to fiddle with the memblock tables to set and clear the NOMAP attribute
in order to omit text and rodata when creating the linear map.

Instead, map the kernel text and rodata alias first with the desired
initial attributes and granularity, so that the loop iterating over the
memblocks will not remap it in a manner that prevents it from being
remapped with updated attributes later.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Kevin Brodsky <kevin.brodsky@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>

arm64: Move fixmap and kasan page tables to end of kernel image

Move the fixmap and kasan page tables out of the BSS section, and place
them at the end of the image, right before the init_pg_dir section where
some of the other statically allocated page tables live.

These page tables are currently the only data objects in vmlinux that
are meant to be accessed via the kernel image's linear alias, and so
placing them together allows the remainder of the data/bss section to be
remapped read-only or unmapped entirely.

Reviewed-by: Kevin Brodsky <kevin.brodsky@arm.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Will Deacon <will@kernel.org>

arm64: mm: Permit contiguous attribute for preliminary mappings

There are a few cases where we omit the contiguous hint for mappings
that start out as read-write and are remapped read-only later, on the
basis that manipulating live descriptors with the PTE_CONT attribute set
is unsafe. When support for the contiguous hint was added to the code,
the ARM ARM was ambiguous about this, and so we erred on the side of
caution.

In the meantime, this has been clarified [0], and regions that will be
remapped in their entirety, retaining the contiguous bit on all entries,
can use the contiguous hint both in the initial mapping as well as the
one that replaces it. Note that this requires that the logic that may be
called to remap overlapping regions respects existing valid descriptors
that have the contiguous bit cleared.

So omit the NO_CONT_MAPPINGS flag in places where it is unneeded.

[0] RJQQTC

For a TLB lookup in a contiguous region mapped by translation table entries that
have consistent values for the Contiguous bit, but have the OA, attributes, or
permissions misprogrammed, that TLB lookup is permitted to produce an OA, access
permissions, and memory attributes that are consistent with any one of the
programmed translation table values.

Reviewed-by: Kevin Brodsky <kevin.brodsky@arm.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Will Deacon <will@kernel.org>

arm64: kfence: Avoid NOMAP tricks when mapping the early pool

Now that the map_mem() routines respect existing page mappings and
contiguous granule sized blocks with the contiguous bit cleared, there
is no longer a reason to play tricks with the memblock NOMAP attribute.

Instead, the kfence pool can be allocated and mapped with page
granularity first, and this granularity will be respected when the rest
of DRAM is mapped later, even if block and contiguous mappings are
allowed for the remainder of those mappings.

Add the NO_EXEC_MAPPINGS flag to ensure that hierarchical XN attributes
are set on the intermediate page tables that are allocated when mapping
the pool.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Kevin Brodsky <kevin.brodsky@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>

arm64: mm: Permit contiguous descriptors to be manipulated

Currently, pgattr_change_is_safe() is overly pedantic when it comes to
descriptors with the contiguous hint attribute set, as it rejects
assignments even if the old and the new value are the same.

In fact, as per ARM ARM RJQQTC, manipulating descriptors with the
contiguous bit set is safe as long as the bit itself does not change
value, in the sense that no TLB conflict aborts or other exceptions may
be raised as a result. Inconsistent permission attributes within the
contiguous region may result in any of the alternatives to be taken to
apply to the entire region, which might be a programming error, but it
does not constitute an unsafe manipulation in terms of what
pgattr_change_is_safe() is intended to detect.

So drop the special PTE_CONT check, but still omit PTE_CONT from 'mask'
so that modifying the bit is still regarded as unsafe.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Will Deacon <will@kernel.org>

arm64: mm: Preserve non-contiguous descriptors when mapping DRAM

Instead of blindly overwriting existing live entries regardless of the
value of their contiguous bit when mapping DRAM regions at
contiguous-hint granularity, check whether the contiguous region in
question contains any valid descriptors that have the contiguous bit
cleared, and in that case, leave the contiguous bit unset on the entire
region. This permits the logic of mapping the kernel's linear alias to
be simplified in a subsequent patch.

Note that this can only result in a misprogrammed contiguous bit (as per
ARM ARM RNGLXZ) if the region in question already contains a mix of
valid contiguous and valid non-contiguous descriptors, in which case it
was already misprogrammed to begin with.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Will Deacon <will@kernel.org>

arm64: mm: Preserve existing table mappings when mapping DRAM

Instead of blindly overwriting an existing table entry when mapping DRAM
regions, take care not to replace a pre-existing table entry with a
block entry. This permits the logic of mapping the kernel's linear alias
to be simplified in a subsequent patch.

Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Will Deacon <will@kernel.org>

arm64: mm: Check for pud_/pmd_set_huge() failures on kernel mappings

Sashiko reports:

| If pmd_set_huge() rejects an unsafe page table transition (such as
| mapping a different physical address over an existing block mapping),
| it returns 0 and leaves the page table entry unmodified.
|
| Because *pmdp remains unmodified, READ_ONCE(pmd_val(*pmdp)) will equal
| pmd_val(old_pmd). The transition from old_pmd to old_pmd is evaluated
| as safe by pgattr_change_is_safe(), so the BUG_ON never triggers.
|
| This allows invalid and unsafe mapping updates to be silently dropped
| instead of panicking, leaving stale memory mappings active while the
| caller assumes the update was successful.

The same applies to pud_set_huge() in alloc_init_pud().

Given how it is generally preferred to limp on rather than blow up the
system if an unexpected condition such as this one occurs, and the fact
that there are no known cases where this disparity results in real
problems, let's WARN on these failures rather than BUG, allowing the
system to survive to the point where it can actually report them.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Will Deacon <will@kernel.org>

arm64: mm: Drop redundant pgd_t* argument from map_mem()

__map_memblock() and map_mem() always operate on swapper_pg_dir, so
there is no need to pass around a pgd_t pointer between them.

Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Kevin Brodsky <kevin.brodsky@arm.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Will Deacon <will@kernel.org>

arm64: mm: Remove bogus stop condition from map_mem() loop

The memblock API guarantees that start is not greater than or equal to
end, so there is no need to test it. And if it were, it is doubtful that
breaking out of the loop would be a reasonable course of action here
(rather than attempting to map the remaining regions)

So let's drop this check.

Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Kevin Brodsky <kevin.brodsky@arm.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Will Deacon <will@kernel.org>

regulator: Use named initializers for platform_device_id arrays

Uwe Kleine-König (The Capable Hub) <u.kleine-koenig@baylibre.com> says:

this series targets to use named initializers for platform_device_id
arrays. In general these are better readable for humans and more robust
to changes in the respective struct definition.

This robustness is needed as I want to do

Link: https://patch.msgid.link/cover.1779878004.git.u.kleine-koenig@baylibre.com

regulator: Unify usage of space and comma in platform_device_id arrays

After converting all these arrays to use named initializers and fixing
coding style en passant, adapt the coding style also for those drivers that
already used named initializers before for consistency.

Signed-off-by: Uwe Kleine-König (The Capable Hub) <u.kleine-koenig@baylibre.com>
Link: https://patch.msgid.link/a3a2736ebfcfa5a228dcebfbfefc14960dcce314.1779878004.git.u.kleine-koenig@baylibre.com
Signed-off-by: Mark Brown <broonie@kernel.org>

regulator: Use named initializers for platform_device_id arrays

Named initializers are better readable and more robust to changes of the
struct definition. This robustness is relevant for a planned change to
struct platform_device_id replacing .driver_data by an anonymous unit.

While touching these arrays unify spacing and usage of commas.

Signed-off-by: Uwe Kleine-König (The Capable Hub) <u.kleine-koenig@baylibre.com>
Acked-by: Karel Balej <balejk@matfyz.cz>
Reviewed-by: Matti Vaittinen <mazziesaccount@gmail.com>
Link: https://patch.msgid.link/d02f55dfd5bdd743ae5cd76f2a5af0d346226a68.1779878004.git.u.kleine-koenig@baylibre.com
Signed-off-by: Mark Brown <broonie@kernel.org>

regulator: Drop unused assignment of platform_device_id driver data

Several drivers explicitly set the .driver_data member of struct
platform_device_id to zero without relying on that value. Drop these
unused assignments.

While touching these arrays unify spacing, usage of commas and use
named initializers for .name.

Signed-off-by: Uwe Kleine-König (The Capable Hub) <u.kleine-koenig@baylibre.com>
Link: https://patch.msgid.link/613cd1bed263c2bf562ee714595f6d57f442804d.1779878004.git.u.kleine-koenig@baylibre.com
Signed-off-by: Mark Brown <broonie@kernel.org>

regulator: scmi: fix of_node refcount leak in scmi_regulator_probe()

scmi_regulator_probe() calls of_find_node_by_name() which takes a
reference on the returned device node. On the error path where
process_scmi_regulator_of_node() fails, the function returns without
calling of_node_put() on the child node, leaking the reference.

Add of_node_put(np) on the error path to properly release the
reference.

Cc: stable@vger.kernel.org
Fixes: 0fbeae70ee7c ("regulator: add SCMI driver")
Signed-off-by: Wentao Liang <vulab@iscas.ac.cn>
Link: https://patch.msgid.link/20260527104850.872415-1-vulab@iscas.ac.cn
Signed-off-by: Mark Brown <broonie@kernel.org>

mount: honour SB_NOUSER in the new mount API

One should *not* be allowed to mount one of those, new API or not.

Reported-by: Denis Arefev <arefev@swemel.ru>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Link: https://patch.msgid.link/20260602020444.GP2636677@ZenIV
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

KVM: s390: Lock pte when making page secure

Make sure _kvm_s390_pv_make_secure() takes the pte lock for the given
address when attempting to make the page secure.

One of the steps in making the page secure is freezing the folio using
folio_ref_freeze(), which temporarily sets the reference count to 0.
Any attempt to get such a folio while frozen will fail and cause a
warning to be printed.

Other users of folio_ref_freeze() make sure that the page is not mapped
while it's being frozen, thus preventing gup functions from being able
to access it. For _kvm_s390_pv_make_secure(), this is not possible,
because the page needs to be mapped in order for the import to succeed.

By taking the pte lock, gup functions will be blocked until the import
operation is done, thus avoiding the race.

In theory this does not completely solve the issue: if a page is mapped
through multiple mappings, locking one pte does not protect from
calling gup on it through the other mapping. In practice this does not
happen and it is a decent stopgap solution until a more correct
solution is available.

Fixes: e38c884df921 ("KVM: s390: Switch to new gmap")
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <20260602142356.169458-8-imbrenda@linux.ibm.com>

KVM: s390: Fix fault-in code

Fix the fault-in code so that it does not return success if a
concurrent unmap event invalidated the fault-in process between the
best-effort lockless check and the proper check with lock.

The new behaviour is to retry, like the best-effort lockless check
already did.

This prevents the fault-in handler from returning success without
having actually faulted in the requested page.

Fixes: e907ae530133 ("KVM: s390: Add helper functions for fault handling")
Reviewed-by: Steffen Eiden <seiden@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <20260602142356.169458-7-imbrenda@linux.ibm.com>

KVM: s390: vsie: Fix rmap handling in _do_shadow_crste()

Fix _do_shadow_crste() to also apply a mask on the reverse address, to
prevent spurious entries from being created, like already done in
gmap_protect_rmap().

Fixes: e38c884df921 ("KVM: s390: Switch to new gmap")
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <20260602142356.169458-6-imbrenda@linux.ibm.com>

KVM: s390: Fix guest / virtual address confusion in _essa_clear_cbrl()

Until now, gmap_helper_zap_one_page() was being called with the guest
absolute address, but it expects a userspace virtual address.

This meant that in the best case the requested pages were not being
discarded, and in the worst case that the wrong pages were being
discarded.

Fix this by converting the guest absolute address to host virtual
before passing it to gmap_helper_zap_one_page().

Fixes: e38c884df921 ("KVM: s390: Switch to new gmap")
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <20260602142356.169458-5-imbrenda@linux.ibm.com>

KVM: s390: Avoid potentially sleeping while atomic when zapping pages

Factor out try_get_locked_pte(), which behaves similarly to
get_locked_pte(), but does not attempt to allocate missing tables and
performs a spin_trylock() instead of blocking.

The new function is also exported, since it will be used in other
patches.

If intermediate entries are missing, there can be no pte swap entry to
free, so it's safe to ignore them.

This avoids potentially sleeping while atomic.

Fixes: e38c884df921 ("KVM: s390: Switch to new gmap")
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <20260602142356.169458-4-imbrenda@linux.ibm.com>

KVM: s390: Fix _gmap_crstep_xchg_atomic()

The previous incorrect behaviour cleared the vsie_notif bit without
returning false, which allowed shadow crstes to be installed without
the vsie_notif bit.

Return false and do not perform the operation if an unshadow event has
been triggered, but still attempt to clear the vsie_notif bit from the
existing crste.

This will prevent the installation of shadow crstes without vsie_notif
bit and will also prevent the caller from looping forever if it was
not checking for the sg->invalidated flag.

Fixes: b827ef02f409 ("KVM: s390: Remove non-atomic dat_crstep_xchg()")
Fixes: a2c17f9270cc ("KVM: s390: New gmap code")
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <20260602142356.169458-3-imbrenda@linux.ibm.com>

KVM: s390: Fix _gmap_unmap_crste()

In _gmap_unmap_crste(), the crste to be unmapped is zapped calling
gmap_crstep_xchg_atomic() exactly once, and expecting it to succeed.
This is a reasonable sanity check, since kvm->mmu_lock is being held in
write mode, and thus no races should be possible.

An upcoming patch will change the behaviour of gmap_crstep_xchg_atomic()
to return false and clear the vsie_notif bit if the operation triggers
an unshadow operation. With the new behaviour, an unmap operation that
triggers an unshadow would cause the VM to be killed.

Prepare for the change by checking if the vsie_notif bit was set in
the old crste if gmap_crstep_xchg_atomic() fails the first time, and
try a second time. The second time no failures are allowed.

Fixes: b827ef02f409 ("KVM: s390: Remove non-atomic dat_crstep_xchg()")
Fixes: a2c17f9270cc ("KVM: s390: New gmap code")
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <20260602142356.169458-2-imbrenda@linux.ibm.com>

tracing/eprobes: Allow use of BTF names to dereference pointers

Add syntax to the parsing of eprobes to be able to typecast a trace event
field that is a pointer to a structure.

Currently, a dereference must be a number, where the user has to figure
out manually the offset of a member of a structure that they want to
dereference.

But for event probes that records a field that happens to be a pointer to
a structure, it cannot dereference these values with BTF naming, but
must use numerical offsets.

For example, to find out what device a sk_buff is pointing to in the
net_dev_xmit trace event, one must first use gdb to find the offsets of the
members of the structures:

(gdb) p &((struct sk_buff *)0)->dev
$1 = (struct net_device **) 0x10
(gdb) p &((struct net_device *)0)->name
$2 = (char (*)[16]) 0x118

And then use the raw numbers to dereference:

  # echo 'e:xmit net.net_dev_xmit +0x118(+0x10($skbaddr)):string' >> dynamic_events

If BTF is in the kernel, then instead, the skbaddr can be typecast to
sk_buff and use the normal dereference logic.

  # echo 'e:xmit net.net_dev_xmit (sk_buff)skbaddr->dev->name:string' >> dynamic_events
  # echo 1 > events/eprobes/xmit/enable
  # cat trace
[..]
    sshd-session-1022    [000] b..2.   860.249343: xmit: (net.net_dev_xmit) arg1="enp7s0"
    sshd-session-1022    [000] b..2.   860.250061: xmit: (net.net_dev_xmit) arg1="enp7s0"
    sshd-session-1022    [000] b..2.   860.250142: xmit: (net.net_dev_xmit) arg1="enp7s0"
    sshd-session-1022    [000] b..2.   860.263553: xmit: (net.net_dev_xmit) arg1="enp7s0"
    sshd-session-1022    [000] b..2.   860.283820: xmit: (net.net_dev_xmit) arg1="enp7s0"
    sshd-session-1022    [000] b..2.   860.302716: xmit: (net.net_dev_xmit) arg1="enp7s0"
    sshd-session-1022    [000] b..2.   860.322905: xmit: (net.net_dev_xmit) arg1="enp7s0"
    sshd-session-1022    [000] b..2.   860.342828: xmit: (net.net_dev_xmit) arg1="enp7s0"
    sshd-session-1022    [000] b..2.   860.362268: xmit: (net.net_dev_xmit) arg1="enp7s0"
    sshd-session-1022    [000] b..2.   860.382335: xmit: (net.net_dev_xmit) arg1="enp7s0"
    sshd-session-1022    [000] b..2.   860.400856: xmit: (net.net_dev_xmit) arg1="enp7s0"
    sshd-session-1022    [000] b..2.   860.419893: xmit: (net.net_dev_xmit) arg1="enp7s0"

The syntax is simply: (STRUCT)(FIELD)->MEMBER[->MEMBER..]

Also add comments around the #else and #endif of #ifdef CONFIG_PROBE_EVENTS_BTF_ARGS
to know what they are for.

Link: https://lore.kernel.org/all/20260601130746.2139d926@gandalf.local.home/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>

block, bfq: release cgroup stats with bfq_group

BFQ cgroup stats contain percpu counters embedded in struct bfq_group,
but the old free path destroys them from bfq_pd_free(), which is tied
to blkg policy-data teardown.

That is not the same lifetime as struct bfq_group. BFQ pins bfq_group
while bfq_queue entities refer to it, so bfq_pd_free() can drop the
policy-data reference while other bfq_group references still exist. The
following blkcg change also defers policy-data release through RCU and
leaves BFQ to run the final bfqg_put() from an RCU callback. For that
conversion, stats teardown must belong to the last bfq_group put, not to
policy-data teardown.

Move stats teardown to bfqg_put() so the embedded counters are destroyed
exactly when the last bfq_group reference is released, before kfree(bfqg).

Without this preparatory change, the RCU-delayed policy-data free
conversion reproduced the following KASAN report:

  BUG: KASAN: slab-use-after-free in percpu_counter_destroy_many+0xf1/0x2e0
  Write of size 8 at addr ffff88811d9409e0 by task test_blkcg/535

  CPU: 0 UID: 0 PID: 535 Comm: test_blkcg Not tainted 7.1.0-rc2-g1e14adca0199 #1 PREEMPT  ea13f83d4b74a12510d20db4a7d9a0fe8275f05c
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-5.fc42 04/01/2014
  Call Trace:
   <TASK>
   dump_stack_lvl+0x54/0x70
   print_address_description+0x77/0x200
   ? percpu_counter_destroy_many+0xf1/0x2e0
   print_report+0x64/0x70
   kasan_report+0x118/0x150
   ? percpu_counter_destroy_many+0xf1/0x2e0
   percpu_counter_destroy_many+0xf1/0x2e0
   __mmdrop+0x1d8/0x350
   finish_task_switch+0x3f5/0x570
   __schedule+0xe8e/0x18a0
   schedule+0xfe/0x1c0
   schedule_timeout+0x7f/0x1d0
   __wait_for_common+0x26c/0x3f0
   wait_for_completion_state+0x21/0x40
   call_usermodehelper_exec+0x271/0x2c0
   __request_module+0x296/0x410
   elv_iosched_store+0x1bc/0x2c0
   queue_attr_store+0x152/0x1c0
   kernfs_fop_write_iter+0x1d7/0x280
   vfs_write+0x580/0x630
   ksys_write+0xec/0x190
   do_syscall_64+0x156/0x490
   entry_SYSCALL_64_after_hwframe+0x77/0x7f

  Allocated by task 535:
   kasan_save_track+0x3e/0x80
   __kasan_kmalloc+0x72/0x90
   bfq_pd_alloc+0x60/0x100 [bfq]
   blkg_create+0x3bb/0xbe0
   blkg_lookup_create+0x3a2/0x460
   blkg_conf_start+0x24a/0x2d0
   bfq_io_set_weight+0x17f/0x430 [bfq]
   cgroup_file_write+0x1c5/0x4b0
   kernfs_fop_write_iter+0x1d7/0x280
   vfs_write+0x580/0x630
   ksys_write+0xec/0x190
   do_syscall_64+0x156/0x490
   entry_SYSCALL_64_after_hwframe+0x77/0x7f

  Freed by task 0:
   kasan_save_track+0x3e/0x80
   kasan_save_free_info+0x46/0x50
   __kasan_slab_free+0x3a/0x60
   kfree+0x14e/0x4f0
   rcu_core+0x6f3/0xcd0
   handle_softirqs+0x1a0/0x550
   __irq_exit_rcu+0x8c/0x150
   irq_exit_rcu+0xe/0x20
   sysvec_apic_timer_interrupt+0x6e/0x80
   asm_sysvec_apic_timer_interrupt+0x1a/0x20

  Last potentially related work creation:
   kasan_save_stack+0x3e/0x60
   kasan_record_aux_stack+0x99/0xb0
   call_rcu+0x55/0x5c0
   blkg_free_workfn+0x130/0x220
   process_scheduled_works+0x655/0xb60
   worker_thread+0x446/0x600
   kthread+0x1f4/0x230
   ret_from_fork+0x259/0x420
   ret_from_fork_asm+0x1a/0x30

Signed-off-by: Yu Kuai <yukuai@fygo.io>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20260601061502.899552-1-yukuai@fygo.io
Signed-off-by: Jens Axboe <axboe@kernel.dk>

driver core: Use system_percpu_wq instead of system_wq

Commit 1137838865bf ("driver core: Use mod_delayed_work to prevent lost
deferred probe work") added a use of system_wq, which is deprecated in
favor of system_percpu_wq added by commit 128ea9f6ccfb ("workqueue: Add
system_percpu_wq and system_dfl_wq"). An upcoming warning in the
workqueue tree flags this with:

workqueue: work func deferred_probe_timeout_work_func enqueued on deprecated workqueue. Use system_{percpu|dfl}_wq instead.

Switch to system_percpu_wq to clear up the warning.

Fixes: 1137838865bf ("driver core: Use mod_delayed_work to prevent lost deferred probe work")
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: Rafael J. Wysocki (Intel) <rafael@kernel.org>
Link: https://patch.msgid.link/20260601-driver-core-fix-system_wq-warning-v1-1-f9001a70ee25@kernel.org
Signed-off-by: Danilo Krummrich <dakr@kernel.org>

nvme: refresh multipath head zoned limits from path limits

queue_limits_stack_bdev() updates the multipath head limits from the
path queue, but it does not propagate max_open_zones or
max_active_zones. As a result, a zoned multipath namespace head can
keep stale 0/0 values even after a ready path reports finite zoned
resource limits.

When refreshing the head limits in nvme_update_ns_info(), stack the
zoned resource limits directly after stacking the path queue limits.
Use min_not_zero() so the block layer's 0 value keeps its "no limit"
meaning while finite limits are combined conservatively.

This avoids advertising "no limit" on the multipath head while keeping
the zoned-limit handling local to the NVMe multipath update path.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Yao Sang <sangyao@kylinos.cn>
Signed-off-by: Keith Busch <kbusch@kernel.org>

nvme: fix FDP fdpcidx bounds check

The fdpcidx bounds check sets n = NUMFDPC + 1 but used > instead of >=,
incorrectly accepting fdp_idx when it equals n (i.e. NUMFDPC + 1).

Fixes: 30b5f20bb2dd ("nvme: register fdp parameters with the block layer")
Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: liuxixin <gliuxen@gmail.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>

wifi: mac80211: limit injected antenna index in ieee80211_parse_tx_radiotap

When parsing the radiotap header of an injected frame,
ieee80211_parse_tx_radiotap() uses the IEEE80211_RADIOTAP_ANTENNA value
directly as a shift count:

info->control.antennas |= BIT(*iterator.this_arg);

*iterator.this_arg is an 8-bit value taken straight from the frame
supplied by userspace, so BIT() can be asked to shift by up to 255. That
is undefined behaviour on the unsigned long and is reported by UBSAN:

  UBSAN: shift-out-of-bounds in net/mac80211/tx.c:2174:30
  shift exponent 235 is too large for 64-bit type 'unsigned long'
  Call Trace:
   ieee80211_parse_tx_radiotap+0xadb/0x1950 net/mac80211/tx.c:2174
   ieee80211_monitor_start_xmit+0xb1f/0x1250 net/mac80211/tx.c:2451
   ...
   packet_sendmsg+0x3eb6/0x50f0 net/packet/af_packet.c:3109

info->control.antennas is a 2-bit bitmap (u8 antennas:2), so only antenna
indices 0 and 1 can ever be represented. Ignore any larger value instead
of shifting out of bounds.

Reported-by: syzbot+8e0622f6d9446420271f@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=8e0622f6d9446420271f
Fixes: ef246a1480cc ("wifi: mac80211: support antenna control in injection")
Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com>
Link: https://patch.msgid.link/20260531011721.102941-1-kartikey406@gmail.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>

wifi: nl80211: reject oversized EMA RNR lists

nl80211_parse_rnr_elems() stores the parsed element count in a
u8-backed cfg80211_rnr_elems::cnt field and uses that count to size
the flexible array allocation.

Reject nested NL80211_ATTR_EMA_RNR_ELEMS input once the count reaches
255, before incrementing it again. This keeps the parser aligned with
the data structure it fills and matches the existing bound check used
by nl80211_parse_mbssid_elems().

Fixes: dbbb27e183b1 ("cfg80211: support RNR for EMA AP")
Cc: stable@kernel.org
Reported-by: Yuan Tan <yuantan098@gmail.com>
Reported-by: Zhengchuan Liang <zcliangcn@gmail.com>
Reported-by: Xin Liu <bird@lzu.edu.cn>
Assisted-by: Codex:gpt-5.4
Signed-off-by: Yuqi Xu <xuyuqiabc@gmail.com>
Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>
Link: https://patch.msgid.link/20260529152542.1412734-1-n05ec@lzu.edu.cn
Signed-off-by: Johannes Berg <johannes.berg@intel.com>

ARM64: remove unnecessary architecture-specific <asm/device.h>

arch/arm64/include/asm/device.h is identical to
include/asm-generic/device.h, and therefore the ARM64-specific version
is unnecessary. Remove it.

Signed-off-by: Ethan Nelson-Moore <enelsonmoore@gmail.com>
Signed-off-by: Will Deacon <will@kernel.org>

nvme-tcp: Use WQ_PERCPU explicitly if wq_unbound is false.

Since commit 21c05ca88a54 ("workqueue: Add warnings and ensure
one among WQ_PERCPU or WQ_UNBOUND is present"), we must explicitly
set WQ_PERCPU or WQ_UNBOUND when creating workqueue.

nvme_tcp_init_module() sets WQ_UNBOUND when the module param
wq_unbound is set, but otherwise, WQ_PERCPU is missing, triggering
the warning below:

workqueue: nvme_tcp_wq is using neither WQ_PERCPU or WQ_UNBOUND. Setting WQ_PERCPU.
WARNING: kernel/workqueue.c:5856 at __alloc_workqueue+0x1d02/0x2070 kernel/workqueue.c:5855, CPU#0: swapper/0/1

Let's set WQ_PERCPU if wq_unbound is false.

Reported-by: syzbot+d078cba4418e65f61984@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/6a1a9a86.323e8352.141b09.0001.GAE@google.com/
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>

nvmet: fix pre-auth out-of-bounds heap read in Discovery Get Log Page

nvmet_execute_disc_get_log_page() validates only the dword alignment
of the host-supplied Log Page Offset (lpo).  The 64-bit offset is then
added to a small kzalloc'd buffer that holds the discovery log page
and the result is passed straight to nvmet_copy_to_sgl(), which
memcpy()s data_len bytes out to the host with no source-side bound
check:

    u64 offset      = nvmet_get_log_page_offset(req->cmd);  /* 64-bit host */
    size_t data_len = nvmet_get_log_page_len(req->cmd);     /* 32-bit host */
    ...
    if (offset & 0x3) { ... }                               /* only check */
    ...
    alloc_len = sizeof(*hdr) + entry_size * discovery_log_entries(req);
    buffer = kzalloc(alloc_len, GFP_KERNEL);
    ...
    status = nvmet_copy_to_sgl(req, 0, buffer + offset, data_len);

The Discovery controller is unauthenticated -- nvmet_host_allowed()
returns true unconditionally for the discovery subsystem -- so the call
is reachable pre-authentication by any TCP/RDMA/FC peer that can reach
the nvmet target.  With a discovery log page of ~1 KiB, an attacker
requesting up to 4 KiB starting at offset == alloc_len reads the next
slab page out and gets its content returned over the fabric (an
empirical run on a default nvmet-tcp loopback target leaked 81
canonical kernel pointers in one Get Log Page response).  Pointing the
offset at unmapped kernel memory faults the in-kernel memcpy and
crashes (or panics, on panic_on_oops=1) the target host instead.

The attacker-controlled source-side offset pattern
"nvmet_copy_to_sgl(req, 0, buffer + ATTACKER_OFFSET, ...)" is unique
to nvmet_execute_disc_get_log_page in the entire nvmet codebase: every
other Get Log Page handler in admin-cmd.c either ignores lpo (and
silently starts every response at offset 0) or tracks a local
destination offset with a fixed source pointer.

Validate the host-supplied offset against the log page size, cap the
copy length to what is actually available, and zero-fill any remainder
of the host transfer buffer.  The zero-fill matches the existing
short-response pattern in nvmet_execute_get_log_changed_ns()
(admin-cmd.c) and prevents leaking transport SGL contents when the
host asks for more bytes than the log page contains.

Fixes: a07b4970f464 ("nvmet: add a generic NVMe target")
Cc: stable@vger.kernel.org
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bryam Vargas <hexlabsecurity@proton.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>

rseq: Fix using an uninitialized stack variable in rseq_exit_user_update()

There is an bug in which an uninitialized stack variable is used in
rseq_exit_user_update() as reported by syzbot:

BUG: KMSAN: kernel-infoleak in rseq_set_ids_get_csaddr include/linux/rseq_entry.h:502 [inline]

The local variable:

struct rseq_ids ids = {
.cpu_id = task_cpu(t),
.mm_cid = task_mm_cid(t),
.node_id = cpu_to_node(ids.cpu_id),
};

According to the C standard, the evaluation order of expressions in an
initializer list is indeterminately sequenced. The compiler (Clang, in
this KMSAN build) evaluates `cpu_to_node(ids.cpu_id)` *before*
`ids.cpu_id` is initialized with `task_cpu(t)`.

This is fixed by moving the assignment of ids.node_id outside the
structure initialization.

Fixes: 82f572449cfe ("rseq: Implement read only ABI enforcement for optimized RSEQ V2 mode")
Closes: https://syzkaller.appspot.com/bug?extid=185a631927096f9da2fc
Reported-by: syzbot+185a631927096f9da2fc@syzkaller.appspotmail.com
Signed-off-by: Qing Wang <wangqing7171@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Link: https://patch.msgid.link/20260602030854.574038-1-wangqing7171@gmail.com

sched/fair: Unify cfs_rq throttling via account_cfs_rq_runtime()

assign_cfs_rq_runtime() during update_curr() sets the resched indicator
and relies on check_cfs_rq_runtime() during pick_next_task() /
put_prev_entity() to throttle the hierarchy once current task is
preempted / blocks.

Per-task throttle, on the other hand, uses throttle_cfs_rq() to simply
propagate the throttle signals, and then relies on task work to
individually throttle the runnable tasks on their way out to the
userspace.

Remove check_cfs_rq_runtime() and unify throttling into
account_cfs_rq_runtime() which only sets the cfs_rq->throttled,
cfs_rq->throttle_count indicators via throttle_cfs_rq() and optionally
adds the task work to the current task (donor) it is on the throttled
hierarchy.

throttle_cfs_rq() requests for sched_cfs_bandwidth_slice() worth of
bandwidth for the current hierarchy that enable it to continue running
uninterrupted when selected. For the rest, it requests a bare minimum of
"1" to ensure some bandwidth is available and pass the
"runtime_remaining > 0" checks once selected.

For SCHED_PROXY_EXEC, a mutex holder cannot exit to userspace without
dropping it first and the mutex_unlock() ensures proxy is stopped before
the mutex handoff which preserves the current semantics for running a
throttled task until it exits to the userspace even if it acts as a
donor.

[ prateek: rebased on tip, comments, commit message. ]

Reviewed-By: Benjamin Segall <bsegall@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Aaron Lu <ziqianlu@bytedance.com>
Link: https://patch.msgid.link/20260602071005.11942-1-kprateek.nayak@amd.com

sched/fair: Move the throttled tasks to a local list in tg_unthrottle_up()

An update_curr() during the enqueue of throttled task will start
throttling the hierarchy from subsequent commit. This can lead to
tg_throttle_down() seeing non-empty throttled_limbo_list for the cfs_rq
attaching the task from throttled_limbo_list one by one. For example:

     R
     |
     A
    / \
  *B   C
       |
       rq->curr

*B is throttled with tasks on hte limbo list. When the tasks are
unthrottled via tg_unthrottle_up() and entity of group B is placed onto
A, update_curr() is called to catch up the vruntime and it may throttle
group A causing the subsequent tg_throttle_down() to see the pending
task's on B's limbo list.

  tg_unthrottle_up()
    /* --cfs_rq->throttle_count == 0 */
    list_for_each_entry_safe(p, cfs_rq->throttled_limbo_list)
      enqueue_task_fair()
        enqueue_entity(se /* B->se */)
          update_curr(cfs_rq /* A->gcfs_rq */)
            account_cfs_rq_runtime(cfs_rq)
              throttle_cfs_rq(cfs_rq /* A->gcfs_rq */ )
                tg_throttle_down()
                  /* Reaches B->cfs_rq with throttle_count == 0 */

                  !!! !list_empty(&cfs_rq->throttled_limbo_list)) !!!

Move the tasks from throttled_limbo_list onto a local list before
starting the unthrottle to prevent the splat described above. If the
hierarchy is throttled again in middle of an unthrottle, put the pending
tasks back onto the limbo list to prevent running them unnecessarily.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Benjamin Segall <bsegall@google.com>
Tested-by: Aaron Lu <ziqianlu@bytedance.com>
Link: https://patch.msgid.link/20260602052531.11450-2-kprateek.nayak@amd.com

sched/fair: Call update_curr() before unthrottling the hierarchy

Subsequent commits will allow update_curr() to throttle the hierarchy
when the runtime accounting exceeds allocated quota. Call update_curr()
before the unthrottle event, and in tg_unthrottle_up() to catch up on
any remaining runtime and stabilize the "runtime_remaining" and
"throttle_count" for that cfs_rq.

Doing an update_curr() early ensures the cfs_rq is not throttled right
back up again when the unthrottle is in progress.

Since all callers of unthrottle_cfs_rq(), except two, already update the
rq_clock and call rq_clock_start_loop_update(), move the
update_rq_clock() from unthrottle_cfs_rq() to the callers that don't
update the rq_clock.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Benjamin Segall <bsegall@google.com>
Tested-by: Aaron Lu <ziqianlu@bytedance.com>
Link: https://patch.msgid.link/20260602052531.11450-1-kprateek.nayak@amd.com

sched/fair: Use throttled_csd_list for local unthrottle

When distribute_cfs_runtime() encounters a local cfs_rq, it adds it to a
local list and unthrottles it at the end, when it is done unthrottling
other cfs_rq(s) on cfs_b->throttled_cfs_rq until the bandwidth runs out.

Instead of using a local list, reuse the local CPU's
rq->throttled_csd_list and the __cfsb_csd_unthrottle() path for
unthrottle.

If this is the first cfs_rq to be queued on the "throttled_csd_list", it
prevents the need for a remote CPUs to interrupt this local CPU if they
themselves are performing async unthrottle.

If this is not the first cfs_rq on the list, there is an async unthrottle
operation pending on this local CPU and the unthrottle can be batched
together.

No functional changes intended.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Benjamin Segall <bsegall@google.com>
Tested-by: Aaron Lu <ziqianlu@bytedance.com>
Link: https://patch.msgid.link/20260602050005.11160-3-kprateek.nayak@amd.com

sched/fair: Convert cfs bandwidth throttling to use guards

Routine conversion of rcu_read_lock(), spin_lock*, and rq_lock usage
within the cfs bandwidth controller to use class guards.

Only notable changes are:

- Checking for "cfs_rq->runtime_remaining <= 0" instead of the inverse
   to spot a throttle and break early. This also saves the need
   for extra indentation in the unthrottle case.

- Reordering of list_del_rcu() against throttled_clock indicator update
   in unthrottle_cfs_rq(). Both are done with "cfs_b->lock" held after
   the "cfs_rq->throttled" is cleared which make the reordering safe
   against concurrent list modifications.

No functional changes intended.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ben Segall <bsegall@google.com>
Tested-by: Aaron Lu <ziqianlu@bytedance.com>
Link: https://patch.msgid.link/20260602050005.11160-2-kprateek.nayak@amd.com

sched/fair: Allocate cfs_tg_state with percpu allocator

To remove the cfs_rq pointer array in task_group, allocate the combined
cfs_rq and sched_entity using the per-cpu allocator.

This patch implements the following:

- Changes task_group->cfs_rq from 'struct cfs_rq **' to
   'struct cfs_rq __percpu *'.

- Updates memory allocation in alloc_fair_sched_group() and
   free_fair_sched_group() to use alloc_percpu() and free_percpu()
   respectively.

- Uses the inline accessor tg_cfs_rq(tg, cpu) with per_cpu_ptr() to retrieve
   the pointer to cfs_rq for the given task group and CPU.

- Replaces direct accesses tg->cfs_rq[cpu] with calls to the new tg_cfs_rq(tg,
   cpu) helper.

- Handles the root_task_group: since struct rq is already a per-cpu variable
   (runqueues), its embedded cfs_rq (rq->cfs) is also per-cpu. Therefore, we
   assign root_task_group.cfs_rq = &runqueues.cfs.

- Cleanup the code in initializing the root task group.

This change places each CPU's cfs_rq and sched_entity in its local per-cpu
memory area to remove the per-task_group pointer arrays.

Signed-off-by: Zecheng Li <zecheng@google.com>
Signed-off-by: Zecheng Li <zli94@ncsu.edu>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Reviewed-by: Josh Don <joshdon@google.com>
Link: https://patch.msgid.link/20260522141623.600235-4-zli94@ncsu.edu

sched/fair: Remove task_group->se pointer array

Now that struct sched_entity is co-located with struct cfs_rq for non-root task
groups, the task_group->se pointer array is redundant. The associated
sched_entity can be loaded directly from the cfs_rq.

This patch performs the access conversion with the helpers:

- is_root_task_group(tg): checks if a task group is the root task group. It
   compares the task group's address with the global root_task_group variable.

- tg_se(tg, cpu): retrieves the cfs_rq and returns the address of the
   co-located se. This function checks if tg is the root task group to ensure
   behaving the same of previous tg->se[cpu]. Replaces all accesses that use
   the tg->se[cpu] pointer array with calls to the new tg_se(tg, cpu) accessor.

- cfs_rq_se(cfs_rq): simplifies access paths like cfs_rq->tg->se[...] to use
   the co-located sched_entity. This function also checks if tg is the root
   task group to ensure same behavior.

Since tg_se is not in very hot code paths, and the branch is a register
comparison with an immediate value (`&root_task_group`), the performance impact
is expected to be negligible.

Signed-off-by: Zecheng Li <zecheng@google.com>
Signed-off-by: Zecheng Li <zli94@ncsu.edu>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Reviewed-by: Josh Don <joshdon@google.com>
Link: https://patch.msgid.link/20260522141623.600235-3-zli94@ncsu.edu

sched/fair: Co-locate cfs_rq and sched_entity in cfs_tg_state

Improve data locality and reduce pointer chasing by allocating struct
cfs_rq and struct sched_entity together for non-root task groups. This
is achieved by introducing a new combined struct cfs_tg_state that
holds both objects in a single allocation.

This patch:

- Introduces struct cfs_tg_state that embeds cfs_rq, sched_entity, and
   sched_statistics together in a single structure.

- Updates __schedstats_from_se() in stats.h to use cfs_tg_state for accessing
   sched_statistics from a group sched_entity.

- Modifies alloc_fair_sched_group() and free_fair_sched_group() to allocate
   and free the new struct as a single unit.

- Modifies the per-CPU pointers in task_group->se and task_group->cfs_rq to
   point to the members in the new combined structure.

Signed-off-by: Zecheng Li <zecheng@google.com>
Signed-off-by: Zecheng Li <zli94@ncsu.edu>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Reviewed-by: Josh Don <joshdon@google.com>
Link: https://patch.msgid.link/20260522141623.600235-2-zli94@ncsu.edu

sched: restore timer_slack_ns when resetting RT policy on fork

Commit ed4fb6d7ef68 ("hrtimer: Use and report correct timerslack values
for realtime tasks") sets timer_slack_ns to 0 for RT tasks in
__setscheduler_params(). However, when an RT task with SCHED_RESET_ON_FORK
creates child threads, the children inherit timer_slack_ns=0 from the
parent. sched_fork() resets the child's policy to SCHED_NORMAL but does
not restore timer_slack_ns, leaving the child permanently running with
zero slack.

Fix this by restoring timer_slack_ns from default_timer_slack_ns in
sched_fork() when resetting from RT/DL to NORMAL policy, matching the
existing behavior in __setscheduler_params().

Note: this fix alone requires a correct default_timer_slack_ns to be
effective. See the following patch for that fix.

Fixes: ed4fb6d7ef68 ("hrtimer: Use and report correct timerslack values for realtime tasks")
Reported-by: Qiaoting.Lin <linqiaoting@xiaomi.com>
Signed-off-by: Guanyou.Chen <chenguanyou@xiaomi.com>
Signed-off-by: Chunhui.Li <chunhui.li@mediatek.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260522131000.1664983-2-chenguanyou@xiaomi.com

MAINTAINERS: Fix spelling mistake in Peter's name

Fix a typo in Peter's name which was added by commit 113d0a6b3954
("MAINTAINERS: Add Peter explicitly to the psi section").

Signed-off-by: Zenghui Yu <zenghui.yu@linux.dev>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260530160842.29089-1-zenghui.yu@linux.dev

sched: Simplify ttwu_runnable()

Note that both proxy and delayed tasks have ->is_blocked set. Use this one
condition to guard both paths.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260526113322.714832584%40infradead.org

sched/proxy: Remove superfluous clear_task_blocked_in()

Per the discussion here:

  https://lore.kernel.org/all/20260403112810.GG3738786@noisy.programming.kicks-ass.net/

The reason for this condition is that the signal condition in
try_to_block_task() would set_task_blocked_in_waking(). However, it no longer
does that, in fact, that path does clear_task_blocked_on().

Further, per the discussions here:

  https://lore.kernel.org/r/dc61cf77-e541-441d-a708-c40e19aa0db2%40amd.com
  https://lore.kernel.org/r//9dd1d24d-45d3-4ee2-8e67-8305b34bfb6d%40amd.com

there are a few other edge cases that needed this. But they're all
variants of PROXY_WAKING leaking out. And since PROXY_WAKING is now
gone, this is no longer needed either.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: John Stultz <jstultz@google.com>
Link: https://patch.msgid.link/20260526113322.120970670%40infradead.org

sched/proxy: Remove PROXY_WAKING

Now that the proxy path uses ->is_blocked, use the '->is_blocked &&
!->blocked_on' state instead of PROXY_WAKING. Notably, this is where a
blocked_on relation is broken but the donor task might still need a return
migration.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260526113322.596522894%40infradead.org

sched/proxy: Switch proxy to use p->is_blocked

Rather than gate the proxy paths with p->blocked_on, use p->is_blocked.

This opens up the state: '->is_blocked && !->blocked_on' for future use.

Notably, only proxy and delayed tasks can be ->on_rq && ->is_blocked, and it is
guaranteed that sched_class::pick_task() will never return a delayed task.
Therefore any task returned from pick_next_task() that has ->is_blocked set,
must be a proxy task.

Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260526113322.477954312%40infradead.org

sched/proxy: Only return migrate when needed

Current code will 'unconditionally' return migrate on PROXY_WAKING, even if the
task is (still) on the original CPU.

Check task_cpu(p) against p->waking_cpu, which per proxy_set_task_cpu()
preserves the original CPU the task was on. If they do not mis-match, there is
no need to go through the more expensive wakeup path.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260527082916.GP3126523%40noisy.programming.kicks-ass.net

sched: Be more strict about p->is_blocked

Upon entry to try_to_block_task(), p->is_blocked should be false. After all,
the prior wakeup would have made it so per ttwu_do_wakeup().

Ensure this is the case, rather than clearing it in the path that doesn't set
it.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: John Stultz <jstultz@google.com>
Link: https://patch.msgid.link/20260526113322.364017314%40infradead.org

sched/proxy: Optimize try_to_wake_up()

The reason for the clause in try_to_wake_up() is, per its comment, that
find_proxy_task()'s proxy_deactivate() is not always called with a cleared
p->blocked_on.

However, that seems silly and easily cured. Make sure to always call
proxy_deactivate() with a cleared p->blocked_on such that we might remove this
clause from the common wake-up path.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: John Stultz <jstultz@google.com>
Link: https://patch.msgid.link/20260526113322.244729903%40infradead.org

sched: Add blocked_donor link to task for smarter mutex handoffs

Add link to the task this task is proxying for, and use it so
the mutex owner can do an intelligent hand-off of the mutex to
the task that the owner is running on behalf.

[jstultz: This patch was split out from larger proxy patch]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Connor O'Brien <connoro@google.com>
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260512025635.2840817-8-jstultz@google.com

sched: Add is_blocked task flag

Add a new is_blocked flag to the task struct. This flag is set
by try_to_block_task() and cleared by ttwu_do_wakeup() and
tracks if the task is blocked.

Traditionally this would mirror !p->on_rq, however due things
like DELAY_DEQUEUE and PROXY_EXEC, this can diverge, so its
useful to manage separately.

Additionally with this, we might be able to get rid of the
p->se.sched_delayed (ab)use in the core code (eventually).

Taken whole cloth from Peter's email:
https://lore.kernel.org/lkml/20260501132143.GC1026330@noisy.programming.kicks-ass.net/

With a few additional p->is_blocked = 0 in a few cases where
we return current if blocked_on gets zeroed or there is
no owner. This may hint that these current special cases
might be dropped eventually.

This change also helps resolve wait-queue stalls seen with
proxy-execution. See previous patch attempts for details:
https://lore.kernel.org/lkml/20260430215103.2978955-2-jstultz@google.com/

Reported-by: Vineeth Pillai <vineethrp@google.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260512025635.2840817-7-jstultz@google.com

sched: Have try_to_wake_up() handle return-migration for PROXY_WAKING case

This patch adds logic so try_to_wake_up() will notice if we are
waking a task where blocked_on == PROXY_WAKING, and if necessary
dequeue the task so the wakeup will naturally return-migrate the
donor task back to a cpu it can run on.

This helps performance as we do the dequeue and wakeup under the
locks normally taken in the try_to_wake_up() and avoids having
to do proxy_force_return() from __schedule(), which has to
re-take similar locks and then force a pick again loop.

This was split out from the larger proxy patch, and
significantly reworked.

Credits for the original patch go to:
  Peter Zijlstra (Intel) <peterz@infradead.org>
  Juri Lelli <juri.lelli@redhat.com>
  Valentin Schneider <valentin.schneider@arm.com>
  Connor O'Brien <connoro@google.com>

Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260512025635.2840817-6-jstultz@google.com

sched: Rework block_task so it can be directly called

Pull most of the logic out of try_to_block_task() and put it
into block_task() directly, so that we can call block_task() and
not have to worry about the failing cases in try_to_block_task()

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260512025635.2840817-5-jstultz@google.com

sched: deadline: Add dl_rq->curr pointer to address issues with Proxy Exec

The DL scheduler keeps the current task in the rbtree, since
the deadline value isn't usually chagned while the task is
runnable.

This results in set_next_task() and put_prev_task() being
simpler, but unfortunately this causes complexity elsewhere.

Specifically when update_curr_dl() updates the deadline, it has
to dequeue and then enqueue the task.

From put_prev_task_dl(), we first call update_curr_dl(), and
then call enqueue_pushable_dl_task().

However, with Proxy Exec this goes awry. Since when a mutex is
released, we might wake the waiting rq->donor. This will cause
put_prev_task() to be called on the donor to take it off the
cpu for return migration.

At that point, from put_prev_task_dl() the update_curr_dl()
logic will dequeue & enqueue the task, and the enqueue function
will call enqueue_pushable_dl_task() (since the task_current()
check won't prevent it). Then back up the callstack in
put_prev_task_dl() we'll end up calling
enqueue_pushable_dl_task() again, tripping the
!RB_EMPTY_NODE(&p->pushable_dl_tasks) warning.

So to avoid this, use Peter's suggested[1] approach, and add a
dl_rq->curr pointer that is set/cleared from set_next_task()/
put_prev_task(), which effectively tracks the rq->donor. We can
then use this to avoid adding the active donor to the pushable
list from enqueue_task_dl().

[1]: https://lore.kernel.org/lkml/20260304095123.GP606826@noisy.programming.kicks-ass.net/

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260512025635.2840817-4-jstultz@google.com

sched: deadline: Add some helper variables to cleanup deadline logic

As part of an improvement to handling pushable deadline tasks,
Peter suggested this cleanup[1], to use helper values for
dl_entity and dl_rq in the enqueue_task_dl() and
put_prev_task_dl() functions. There should be no functional
change from this patch.

To make sure this cleanup change doesn't obscure later logic
changes, I've split it into its own patch.

[1]: https://lore.kernel.org/lkml/20260304095123.GP606826@noisy.programming.kicks-ass.net/

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260512025635.2840817-3-jstultz@google.com

sched: Rework prev_balance() to avoid stale prev references

Historically, the prev value from __schedule() was the rq->curr.
This prev value is passed down through numerous functions, and
used in the class scheduler implementations. The fact that
prev was on_cpu until the end of __schedule(), meant it was
stable across the rq lock drops that the class->balance()
implementations often do.

However, with proxy-exec, the prev passed to functions called
by __schedule() is rq->donor, which may not be the same as
rq->curr and may not be on_cpu, this makes the prev value
potentially unstable across rq lock drops.

A recently found issue with proxy-exec, is when we begin doing
return migration from try_to_wake_up(), its possible we may be
waking up the rq->donor. When we do this, we proxy_resched_idle()
to put_prev_set_next() setting the rq->donor to rq->idle, allowing
the rq->donor to be return migrated and allowed to run.

This however runs into trouble, as on another cpu we might be in
the middle of calling __schedule(). Conceptually the rq lock is
held for the majority of the time, but in calling prev_balance()
its possible the class->balance() handler call may briefly drop the rq lock.
This opens a window for try_to_wake_up() to wake and return migrate the
rq->donor before the class logic reacquires the rq lock.

Unfortunately prev_balance() pass in a prev argument, to which we pass
rq->donor. However this prev value can now become stale and incorrect across a
rq lock drop.

So, to correct this, rework the prev_balance() call so that it does not take a
"prev" argument.

Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260512025635.2840817-2-jstultz@google.com

locking: mutex: Fix proxy-exec potentially deactivating tasks marked TASK_RUNNING

Vineeth found came up with a test driver that could trip up
workqueue stalls. After fixing one issue this test found,
Vineeth reported the test was still failing.

Greatly simplified, a task that tries to take a mutex already
owned by another task that is sleeping, can hit a edge case in
the mutex_lock_common() case.

If the task fails to get the lock, calls into schedule, but gets
a spurious wakeup, it will find that it is first waiter, and
go into the mutex_optimistic_spin() logic. Though before calling
mutex_optimistic_spin(), we clear task blocked_on state, since
mutex_optimistic_spin() may call schedule() if need_resched() is
set.

After mutex_optimistic_spin() fails, we set blocked_on again,
restart the main mutex loop, try to take the lock and call into
schedule_preempt_disabled().

From there, with proxy-execution, we'll see the task is
blocked_on, follow the chain, see the owner is sleeping and
dequeue the waiting task from the runqueue.

This all sounds fine and reasonable. But what I had missed is
that in mutex_optimistic_spin(), not only do we call schedule()
but we set TASK_RUNNABLE right before doing so.

This is ok for that invocation of schedule(). But when we come
back we re-set the blocked_on we had just cleared, but we do not
re-set the task state to TASK_INTERRUPTIBLE/UNINTERRUPTIBLE.

This means we have a task that is blocked_on & TASK_RUNNABLE,
so when the proxy execution code dequeues the task, we are
in trouble since future wakeups will be shortcut by the
ttwu_state_match() check.

Thus, to avoid this, after mutex_optimistic_spin(), set the task
state back when we set blocked_on.

Many many thanks again to Vineeth for his very useful testing
driver that uncovered this long hidden bug, that I hadn't
tripped in all my testing! Very impressed with the problems he's
uncovered!

Reported-by: Vineeth Pillai <vineethrp@google.com>
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Vineeth Pillai <vineethrp@google.com>
Link: https://patch.msgid.link/20260430215103.2978955-3-jstultz@google.com

Merge branch 'tip/sched/urgent'

Pick up urgent fixes.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>

nvme-multipath: set BIO_REMAPPED on bios remapped to per-path namespace disks

When nvme_ns_head_submit_bio() remaps a bio from the multipath head to a
per-path namespace, bio_set_dev() clears BIO_REMAPPED.  The remapped bio
is then resubmitted through submit_bio_noacct() which calls
bio_check_eod() because BIO_REMAPPED is not set.

This races with nvme_ns_remove() which zeroes the per-path capacity
before synchronize_srcu():

  CPU 0 (IO submission)
  ---------------------
  srcu_read_lock()
  nvme_find_path() -> ns
    [NVME_NS_READY is set]

  CPU 1 (namespace removal)
  -------------------------
  clear_bit(NVME_NS_READY)
  set_capacity(ns->disk, 0)
  synchronize_srcu()  <- blocks

  CPU 0 (IO submission)
  ---------------------
  bio_set_dev(bio, ns->disk->part0)
    [clears BIO_REMAPPED]
  submit_bio_noacct(bio)
    -> bio_check_eod() sees capacity=0
    -> bio fails with IO error

The SRCU read lock prevents synchronize_srcu() from completing, but does
not prevent set_capacity(0) from executing.  The bio fails the EOD check
before it reaches the NVMe driver, so nvme_failover_req() never gets a
chance to redirect it to another path of multipath.  IO errors are
reported to the application despite another path being available.

On older kernels (before commit 0b64682e78f7 "block: skip unnecessary
checks for split bio"), the same race was also reachable through split
remainders resubmitted via submit_bio_noacct().

Fix this by setting BIO_REMAPPED after bio_set_dev() in
nvme_ns_head_submit_bio().  This skips bio_check_eod() on the per-path
device; the EOD check already passed on the multipath head.

NVMe per-path namespace devices are always whole disks (bd_partno=0), so
the blk_partition_remap() skip also gated by BIO_REMAPPED is a no-op.
The flag does not persist across failover and cannot go stale if the
namespace geometry changes between attempts: nvme_failover_req() calls
bio_set_dev() to redirect the bio back to the multipath head, which
clears BIO_REMAPPED.  When nvme_requeue_work() resubmits through
submit_bio_noacct(), bio_check_eod() runs normally against the current
capacity.

Same approach as commit 3a905c37c351 ("block: skip bio_check_eod for
partition-remapped bios").

Fixes: a7c7f7b2b641 ("nvme: use bio_set_dev to assign ->bi_bdev")
Cc: stable@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Igor Achkinazi <igor.achkinazi@dell.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>

xfrm: iptfs: fix use-after-free on first_skb in __input_process_payload

__input_process_payload() stores first_skb into xtfs->ra_newskb under
drop_lock when starting partial reassembly, then unlocks and breaks out
of the processing loop. The post-loop check reads xtfs->ra_newskb
without the lock to decide whether first_skb is still owned:

if (first_skb && first_iplen && !defer && first_skb != xtfs->ra_newskb)

Between spin_unlock and this read, a concurrent CPU running
iptfs_reassem_cont() (or the drop_timer hrtimer) can complete
reassembly, NULL xtfs->ra_newskb, and free the skb. The check then
evaluates first_skb != NULL as true, and pskb_trim/ip_summed/consume_skb
operate on the freed skb — a use-after-free in skbuff_head_cache.

Replace the unlocked read with a local bool that records whether
first_skb was handed to the reassembly state in the current call. The
flag is set after the existing spin_unlock, before the break, using the
pointer equality that is stable at that point (first_skb == skb iff
first_skb was stored in ra_newskb).

Fixes: 3f3339885fb3 ("xfrm: iptfs: add reusing received skb for the tunnel egress packet")
Signed-off-by: Zhenghang Xiao <kipreyyy@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>

nvme-multipath: require exact iopolicy names for module parameter

The iopolicy module parameter uses strncmp prefix matching, so values
like "numax" are accepted as "numa". The per-subsystem sysfs attribute
already requires an exact match via sysfs_streq(). Parse both through
a shared helper so invalid values are rejected consistently.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: liyouhong <liyouhong@kylinos.cn>
Signed-off-by: Keith Busch <kbusch@kernel.org>

nvme-multipath: pass NS head to nvme_mpath_revalidate_paths()

In nvme_mpath_revalidate_paths(), we are passed a NS pointer and use that
to lookup the NS head and then use that same NS pointer as an iter variable.

It makes more sense pass the NS head and use a local variable for the NS
iter.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>

USB: serial: io_ti: fix heap overflow in build_i2c_fw_hdr()

build_i2c_fw_hdr() allocates a fixed-size buffer of
(16*1024 - 512) + sizeof(struct ti_i2c_firmware_rec) bytes, then
copies le16_to_cpu(img_header->Length) bytes into it without
validating that Length fits within the available space after the
firmware record header.

img_header->Length is a __le16 from the firmware file and can be
up to 65535. check_fw_sanity() validates the total firmware size
but not img_header->Length specifically.

Fix by rejecting images where img_header->Length exceeds the
available destination space.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Cc: stable@vger.kernel.org
Signed-off-by: Adrian Korwel <adriank20047@gmail.com>
Signed-off-by: Johan Hovold <johan@kernel.org>

USB: serial: io_ti: fix heap overflow in get_manuf_info()

get_manuf_info() reads le16_to_cpu(rom_desc->Size) bytes from the
device I2C EEPROM into a buffer allocated with kmalloc_obj(), which
is sizeof(struct edge_ti_manuf_descriptor) = 10 bytes.

The Size field comes from the device and is only validated (in
check_i2c_image()) to make sure the descriptor fits within
TI_MAX_I2C_SIZE (16384 bytes), not against the destination buffer size.
A malicious USB device can therefore set Size to any value up to 16377,
causing a heap overflow of up to 16367 bytes when plugged into a host
running this driver.

valid_csum() is called after read_rom() and also iterates
buffer[0..Size-1], compounding the out-of-bounds access.

Fix by rejecting descriptors with unexpected length before calling
read_rom().

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Cc: stable@vger.kernel.org
Signed-off-by: Adrian Korwel <adriank20047@gmail.com>
[ johan: amend commit message; also check for short descriptors ]
Signed-off-by: Johan Hovold <johan@kernel.org>

pps: Convert to ktime_get_snapshot_id()

ktime_get_snapshot() resolves to ktime_get_snapshot_id(CLOCK_REALTIME).

Make it obvious in the code and convert the readout to use the
snapshot::systime and monoraw fields instead of snapshot::real and raw,
which aregoing away.

Similar to the PPS generators, avoid the more expensive snapshot when
CONFIG_NTP_PPS is disabled.

No functional change intended.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Tested-by: Arthur Kiyanovski <akiyano@amazon.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260529195557.123410250@kernel.org

pps: generators: Use ktime_get_real_ts64() instead of ktime_get_snapshot()

There is no reason to use the more complex ktime_get_snapshot() for
retrieving CLOCK_REALTIME.

Just use ktime_get_real_ts64(), which avoids the extra timespec64
conversion as a bonus.

No functional change intended.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Tested-by: Arthur Kiyanovski <akiyano@amazon.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Acked-by: Rodolfo Giometti <giometti@enneenne.com>
Link: https://patch.msgid.link/20260529195557.074439049@kernel.org

timekeeping: Use system_time_snapshot::systime/monoraw instead of ::real/raw

system_time_snapshot::systime provides the same information as
system_time_snapshot::real when the snapshot was taken with
ktime_get_snapshot_id(CLOCK_REALTIME).

Convert the history interpolation over to use 'systime' and 'monoraw' as
'real/raw' are going away once all users are converted.

As a side effect this is the first step to support CLOCK_AUX with
get_device_crosstime_stamp() and the history interpolation.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Tested-by: David Woodhouse <dwmw@amazon.co.uk>
Tested-by: Arthur Kiyanovski <akiyano@amazon.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260529195557.024415766@kernel.org

timekeeping: Provide ktime_get_snapshot_id()

ktime_get_snapshot() provides a snapshot of the underlying clocksource
counter value and the corresponding CLOCK_MONOTONIC_RAW, CLOCK_REALTIME and
CLOCK_BOOTTIME timestamps.

There is no usage of CLOCK_REALTIME and CLOCK_BOOTTIME at the same time and
CLOCK_BOOTTIME support was just added for the ARM64 KVM tracing mechanism,
which needs CLOCK_BOOTTIME and the underlying clocksource counter value.

ktime_get_snapshot() is also not suitable for usage with CLOCK_AUX, but
that's a prerequisite to support PTP hardware timestamping for CLOCK_AUX
steering.

As a first step, rename ktime_get_snapshot() to ktime_get_snapshot_id(),
which now takes a clockid argument to select the clock which needs to be
captured. The result is stored in system_time_snapshot::systime, which will
replace the system_time_snapshot::real/boot members once all usage sites
have been converted.

ktime_get_snapshot() is a simple wrapper which hands in CLOCK_REALTIME as
clockid argument for the conversion period. That means CLOCK_REALTIME is
now captured twice, but that redunancy is only temporary.

As all usage sites of struct system_time_snapshot has to be updated anyway,
rename the 'raw' member to 'monoraw' for clarity.

No functional change vs. current users of ktime_get_snapshot()

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Tested-by: David Woodhouse <dwmw@amazon.co.uk>
Tested-by: Arthur Kiyanovski <akiyano@amazon.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260529195556.971591633@kernel.org

ext2: fix ignored return value of generic_write_sync()

Fix ext2_dio_write_iter() to propagate the error returned by
generic_write_sync() instead of silently discarding it, which could
cause write(2) to return success to userspace on O_SYNC/O_DSYNC files
even when the sync failed.

The correct pattern, already used in ext2_dax_write_iter() in the same
file and in ext4, xfs, f2fs among others, is:
if (ret > 0)
ret = generic_write_sync(iocb, ret);

Found by Linux Verification Center (linuxtesting.org) with SVACE.

[JK: Reflect also filemap_write_and_wait() return value]

Fixes: fb5de4358e1a ("ext2: Move direct-io to use iomap")
Signed-off-by: Danila Chernetsov <listdansp@mail.ru>
Link: https://patch.msgid.link/20260530122311.136803-1-listdansp@mail.ru
Signed-off-by: Jan Kara <jack@suse.cz>

NFS: write_completion: dereference loop-local req, not hdr->req

5d3869a41f36 ("NFS: fix writeback in presence of errors") introduced
a dereference of hdr->req->wb_lock_context in nfs_write_completion's
per-request loop.  hdr->req is set once at nfs_pgheader_init() time
and is not refcount-protected for the lifetime of the loop; when hdr
aggregates requests from multiple page groups (common under heavy
NFSv3 writeback), a parallel COMMIT on hdr->req's group can drop the
last reference and free it while the outer loop is still iterating
requests from other groups.  KASAN catches this as an 8-byte read at
offset +24 of a freed nfs_page slab object (wb_lock_context).

All requests in a given pgio share the same open_context, so reading
the loop-local req's wb_lock_context yields the same value and is
safe -- req is still on hdr->pages and holds its writeback kref
through the commit branch.

Caught with kasan:

BUG: KASAN: slab-use-after-free in nfs_write_completion+0x8f8/0xa50 [nfs]
Read of size 8 at addr ffff888118af2058 by task kworker/u16:16/122062
CPU: 2 UID: 0 PID: 122062 Comm: kworker/u16:16 Kdump: loaded Not tainted 7.1.0-rc4+ #ge05a759574b2 PREEMPT
Workqueue: nfsiod rpc_async_release
Call Trace:
<TASK>
dump_stack_lvl+0xaf/0x100
? nfs_write_completion+0x8f8/0xa50 [nfs]
print_report+0x157/0x4a1
? __virt_addr_valid+0x1fb/0x400
? nfs_write_completion+0x8f8/0xa50 [nfs]
kasan_report+0xc2/0x190
? nfs_write_completion+0x8f8/0xa50 [nfs]
nfs_write_completion+0x8f8/0xa50 [nfs]
? nfs_commit_release_pages+0xbd0/0xbd0 [nfs]
? lock_acquire+0x182/0x2e0
? process_one_work+0x937/0x1890
? nfs_pgio_header_alloc+0xd0/0xd0 [nfs]
rpc_free_task+0xee/0x160
rpc_async_release+0x5d/0xb0
process_one_work+0x9b0/0x1890
? pwq_dec_nr_in_flight+0xed0/0xed0
? rpc_final_put_task+0x140/0x140
worker_thread+0x75a/0x10a0
? process_one_work+0x1890/0x1890
? kthread+0x1af/0x4d0
? process_one_work+0x1890/0x1890
kthread+0x3d3/0x4d0
? kthread_affine_node+0x2c0/0x2c0
ret_from_fork+0x669/0xa50
? native_tss_update_io_bitmap+0x660/0x660
? __switch_to+0x9dd/0x1310
? kthread_affine_node+0x2c0/0x2c0
ret_from_fork_asm+0x11/0x20
</TASK>

Allocated by task 121997 on cpu 3 at 31643.290294s:
kasan_save_stack+0x1e/0x40
kasan_save_track+0x13/0x60
__kasan_slab_alloc+0x62/0x70
kmem_cache_alloc_noprof+0x1ab/0x4e0
nfs_page_create+0x152/0x460 [nfs]
nfs_page_create_from_folio+0x7e/0x210 [nfs]
nfs_update_folio+0x7a9/0x32a0 [nfs]
nfs_write_end+0x290/0xc60 [nfs]
generic_perform_write+0x4ce/0x990
nfs_file_write+0x6b3/0xce0 [nfs]
vfs_write+0x63c/0xfa0
ksys_write+0x122/0x240
do_syscall_64+0xc3/0x13f0
entry_SYSCALL_64_after_hwframe+0x4b/0x53

Freed by task 122046 on cpu 0 at 31647.037964s:
kasan_save_stack+0x1e/0x40
kasan_save_track+0x13/0x60
kasan_save_free_info+0x37/0x60
__kasan_slab_free+0x3b/0x60
kmem_cache_free+0x11b/0x5a0
nfs_page_group_destroy+0x13a/0x210 [nfs]
nfs_unlock_and_release_request+0x64/0x90 [nfs]
nfs_commit_release_pages+0x339/0xbd0 [nfs]
nfs_commit_release+0x51/0xb0 [nfs]
rpc_free_task+0xee/0x160
rpc_async_release+0x5d/0xb0
process_one_work+0x9b0/0x1890
worker_thread+0x75a/0x10a0
kthread+0x3d3/0x4d0
ret_from_fork+0x669/0xa50
ret_from_fork_asm+0x11/0x20

The buggy address belongs to the object at ffff888118af2040\x0a which belongs to the cache nfs_page of size 96
The buggy address is located 24 bytes inside of\x0a freed 96-byte region [ffff888118af2040, ffff888118af20a0)

The buggy address belongs to the physical page:
page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x118af2
head: order:1 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
flags: 0x4000000000000040(head|zone=2)
page_type: f5(slab)
raw: 4000000000000040 ffff88818cf2c4c0 ffffea000e61b990 ffffea0004e7d110
raw: 0000000000000000 0000000800190019 00000000f5000000 0000000000000000
head: 4000000000000040 ffff88818cf2c4c0 ffffea000e61b990 ffffea0004e7d110
head: 0000000000000000 0000000800190019 00000000f5000000 0000000000000000
head: 4000000000000001 ffffffffffffff81 00000000ffffffff 00000000ffffffff
head: ffffffffffffffff 0000000000000000 00000000ffffffff 0000000000000002
page dumped because: kasan: bad access detected
page_owner tracks the page as allocated
page last allocated via order 1, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 121997, tgid 121997 (rsync), ts 31643290274577, free_ts 31642154777182
post_alloc_hook+0xd1/0x100
get_page_from_freelist+0xbad/0x2910
__alloc_frozen_pages_noprof+0x1c6/0x4a0
allocate_slab+0x330/0x620
___slab_alloc+0xe9/0x930
kmem_cache_alloc_noprof+0x35b/0x4e0
nfs_page_create+0x152/0x460 [nfs]
nfs_page_create_from_folio+0x7e/0x210 [nfs]
nfs_update_folio+0x7a9/0x32a0 [nfs]
nfs_write_end+0x290/0xc60 [nfs]
generic_perform_write+0x4ce/0x990
nfs_file_write+0x6b3/0xce0 [nfs]
vfs_write+0x63c/0xfa0
ksys_write+0x122/0x240
do_syscall_64+0xc3/0x13f0
entry_SYSCALL_64_after_hwframe+0x4b/0x53
page last free pid 122202 tgid 122202 stack trace:
__free_frozen_pages+0x6da/0xf30
qlist_free_all+0x53/0x130
kasan_quarantine_reduce+0x198/0x1f0
__kasan_slab_alloc+0x46/0x70
kmem_cache_alloc_noprof+0x1ab/0x4e0
__alloc_object+0x2f/0x230
__create_object+0x22/0x80
kmem_cache_alloc_node_noprof+0x416/0x4d0
__alloc_skb+0x146/0x6e0
tcp_stream_alloc_skb+0x35/0x660
tcp_sendmsg_locked+0x1746/0x4260
tcp_sendmsg+0x2f/0x40
inet_sendmsg+0x9e/0xe0
__sock_sendmsg+0xd9/0x180
sock_sendmsg+0x122/0x200
xprt_sock_sendmsg+0x4ff/0x9a0

Memory state around the buggy address:
ffff888118af1f00: 00 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc
ffff888118af1f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffff888118af2000: fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb fb
                                                    ^
ffff888118af2080: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc
ffff888118af2100: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
==================================================================

Reviewed-by Jeff Layton <jlayton@kernel.org>

Fixes: 5d3869a41f36 ("NFS: fix writeback in presence of errors")
Cc: Olga Kornievskaia <okorniev@redhat.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Anna Schumaker <anna@kernel.org>
Cc: linux-nfs@vger.kernel.org
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

drm/imx: Fix three kernel-doc warnings in dcss-scaler.c

Fix the following W=1 kerneldoc warnings by adding the missing parameter
descriptions for @phase0_identity and @nn_interpolation in
dcss_scaler_filter_design() and @phase0_identity in
dcss_scaler_gaussian_filter()

Warning: drivers/gpu/drm/imx/dcss/dcss-scaler.c:173 function parameter 'phase0_identity' not described in 'dcss_scaler_gaussian_filter'
Warning: drivers/gpu/drm/imx/dcss/dcss-scaler.c:270 function parameter 'phase0_identity' not described in 'dcss_scaler_filter_design'
Warning: drivers/gpu/drm/imx/dcss/dcss-scaler.c:270 function parameter 'nn_interpolation' not described in 'dcss_scaler_filter_design'

Fixes: 9021c317b770 ("drm/imx: Add initial support for DCSS on iMX8MQ")
Signed-off-by: Yicong Hui <yiconghui@gmail.com>
Reviewed-by: Laurentiu Palcu <laurentiu.palcu@oss.nxp.com>
Link: https://patch.msgid.link/20260406180013.2442096-1-yiconghui@gmail.com
Signed-off-by: Liu Ying <victor.liu@nxp.com>

kbuild: rust: clean `*.long-type-*.txt` files

When rustc prints an error containing a long type that doesn't fit
in a line, it will write the whole thing in a .txt file -- see commit
420dd187e157 (".gitignore: ignore rustc long type txt files") for more
details.

These files are purely compiler artifacts and are not created
intentionally by the build system.

Thus add them to the `clean` target to stop them from cluttering up the
source tree.

Suggested-by: Miguel Ojeda <ojeda@kernel.org>
Link: https://github.com/Rust-for-Linux/linux/issues/1236
Signed-off-by: Joel Kamminga <contact@jkam.dev>
Acked-by: Nathan Chancellor <nathan@kernel.org>
Link: https://patch.msgid.link/20260530184944.10459-1-contact@jkam.dev
[ Reworded and linked to the previous related commit. - Miguel ]
Signed-off-by: Miguel Ojeda <ojeda@kernel.org>

accel/ivpu: Add buffer overflow check in MS get_info_ioctl

Add validation that the info size returned from the metric stream info
query is not exceeded when checked against the allocated buffer size.
If the firmware returns a size larger than the buffer, reject the
operation with -EOVERFLOW instead of proceeding with an incorrect
buffer copy.

Fixes: cdfad4db7756 ("accel/ivpu: Add NPU profiling support")
Cc: stable@vger.kernel.org # v6.18+
Signed-off-by: Andrzej Kacprowski <andrzej.kacprowski@linux.intel.com>
Reviewed-by: Karol Wachowski <karol.wachowski@linux.intel.com>
Signed-off-by: Karol Wachowski <karol.wachowski@linux.intel.com>
Link: https://patch.msgid.link/20260529120841.135852-1-andrzej.kacprowski@linux.intel.com

accel/ivpu: Add bounds checks for firmware log indices

Add validation that read and write indices in the firmware log buffer
are within valid bounds (< data_size) before using them. If
out-of-bounds indices are encountered (from firmware), clamp them to
safe values instead of proceeding with invalid offsets.

This prevents potential out-of-bounds buffer access when firmware
supplies invalid log indices.

Fixes: 1fc1251149a7 ("accel/ivpu: Refactor functions in ivpu_fw_log.c")
Cc: stable@vger.kernel.org # v6.18+
Signed-off-by: Andrzej Kacprowski <andrzej.kacprowski@linux.intel.com>
Reviewed-by: Karol Wachowski <karol.wachowski@linux.intel.com>
Signed-off-by: Karol Wachowski <karol.wachowski@linux.intel.com>
Link: https://patch.msgid.link/20260529115842.135378-1-andrzej.kacprowski@linux.intel.com

accel/ivpu: Add bounds check for firmware runtime memory

Validate that the firmware runtime memory specified in the image
header is properly aligned and sized to hold the firmware image.
This prevents errors during memory allocation and image transfer.

Fixes: 2007e210b6a1 ("accel/ivpu: Split FW runtime and global memory buffers")
Cc: stable@vger.kernel.org # v7.0+
Signed-off-by: Andrzej Kacprowski <andrzej.kacprowski@linux.intel.com>
Reviewed-by: Karol Wachowski <karol.wachowski@linux.intel.com>
Signed-off-by: Karol Wachowski <karol.wachowski@linux.intel.com>
Link: https://patch.msgid.link/20260529120853.135876-1-andrzej.kacprowski@linux.intel.com

platform/chrome: Prevent build for big-endian systems

Both ARM and ARM64 which are a dependency for CHROME_PLATFORMS have
seldomly used big-endian variants.

The ChromeOS EC framework and drivers are written under the assumption
that they will be running on a little-endian systems. Code which would
be broken on big-endian can be found trivially.

Some examples:
cros_ec.c: suspend_params.sleep_timeout_ms = ec_dev->suspend_timeout_ms
cros_ec_debugfs.c: resp->time_since_ec_boot_ms
cros_ec_wdt.c: arg.req.reboot_timeout_sec = wdd->timeout

Prevent the build for big-endian systems.

Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Link: https://lore.kernel.org/r/20260531-cros-big-endian-v1-2-0cc90f39c636@weissschuh.net
Signed-off-by: Tzung-Bi Shih <tzungbi@kernel.org>

platform/chrome: Remove superfluous dependencies from CROS_EC

CROS_EC depends on CHROME_PLATFORMS which already declares these
dependencies.

Remove the duplication.

Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Link: https://lore.kernel.org/r/20260531-cros-big-endian-v1-1-0cc90f39c636@weissschuh.net
Signed-off-by: Tzung-Bi Shih <tzungbi@kernel.org>

Merge tag 'for-7.1/dm-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper fix from Mikulas Patocka:

- fix race condition in dm-cache-policy-smq

* tag 'for-7.1/dm-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm cache policy smq: check allocation under invalidate lock

devlink: Release nested relation on devlink free

devlink relation state is normally released from devl_unregister(), which
calls devlink_rel_put(). This misses devlink instances that get a nested
relation before registration and then fail probe before devl_register() is
reached.

That flow can happen for SFs. The child devlink gets linked to its
parent before registration, then a later probe error calls devlink_free()
directly. Since the instance was never registered, devl_unregister() is not
called and devlink->rel is leaked.

Release any pending relation from devlink_free() as well. The registered
path is unchanged because devl_unregister() already clears devlink->rel
before devlink_free() runs.

Fixes: c137743bce02 ("devlink: introduce object and nested devlink relationship infra")
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://patch.msgid.link/20260528191411.3270532-1-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'auxdisplay-v7.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/andy/linux-auxdisplay

Pull auxdisplay updates from Andy Shevchenko:

- Fix potential out-of-bound access in line-display library

- Miscellaneous refactoring and cleaning up

[ Andy says this could easily be delayed until 7.2, but it's _so_ tiny
  that it's more work for me to schedule it for later than to just take
  it now, and just doesn't seem worth delaying    - Linus ]

* tag 'auxdisplay-v7.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/andy/linux-auxdisplay:
  auxdisplay: Kconfig: drop unneeded quotes in PANEL_BOOT_MESSAGE dep
  auxdisplay: line-display: fix OOB read on zero-length message_store()
  auxdisplay: max6959: use regmap_assign_bits() in max6959_enable()

l2tp: pppol2tp: hold reference to session in pppol2tp_ioctl()

pppol2tp_ioctl() read sock->sk->sk_user_data directly without any
locks or reference counting.  If a controllable sleep was induced during
copy_from_user() (e.g. via a userfaultfd page fault sleep), a concurrent
socket close could trigger pppol2tp_session_close() asynchronously.  This
frees the l2tp_session structure via the l2tp_session_del_work workqueue.
Upon resuming, the ioctl thread dereferences the stale session pointer,
resulting in a Use-After-Free (UAF).

Fix this by securely fetching the session reference using the RCU-safe,
refcounted helper pppol2tp_sock_to_session(sk) on entry.  This locks the
session's refcount across the sleep.  We structured the function to exit
via standard err breaks, guaranteeing that l2tp_session_put() is cleanly
called on all return paths to drop the reference.

To preserve existing behavior we validate the session and its magic
signature only for the specific L2TP commands that require it.  This
ensures that generic/unknown ioctls called on an unconnected socket
still return -ENOIOCTLCMD and correctly fall back to generic handlers
(e.g. in sock_do_ioctl()).

Signed-off-by: Lee Jones <lee@kernel.org>
Fixes: fd558d186df2 ("l2tp: Split pppol2tp patch into separate l2tp and ppp parts")
Link: https://patch.msgid.link/20260527133630.2120612-1-lee@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

6lowpan: fix off-by-one in multicast context address compression

The second memcpy in lowpan_iphc_mcast_ctx_addr_compress() uses
&data[1] as destination and &ipaddr->s6_addr[11] as source, but
both should be offset by one: &data[2] and &ipaddr->s6_addr[12]
respectively.

This off-by-one has two consequences:
1. data[1] is overwritten with s6_addr[11], corrupting the RIID
   field in the compressed multicast address
2. data[5] is never written, so uninitialized kernel stack memory
   is transmitted over the network via lowpan_push_hc_data(),
   leaking kernel stack contents

The correct inline data layout must match what the decompression
function lowpan_uncompress_multicast_ctx_daddr() expects:
  data[0..1] = s6_addr[1..2]  (flags/scope + RIID)
  data[2..5] = s6_addr[12..15] (group ID)

Also zero-initialize the data array as a defensive measure against
similar bugs in the future.

Fixes: 5609c185f24d ("6lowpan: iphc: add support for stateful compression")
Reported-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn>
Reported-by: Yuxiang Yang <yangyx22@mails.tsinghua.edu.cn>
Reported-by: Ao Wang <wangao@seu.edu.cn>
Reported-by: Xuewei Feng <fengxw06@126.com>
Reported-by: Qi Li <qli01@tsinghua.edu.cn>
Reported-by: Ke Xu <xuke@tsinghua.edu.cn>
Signed-off-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn>
Acked-by: Alexander Aring <aahringo@redhat.com>
Link: https://patch.msgid.link/20260527081806.42747-1-zhaoyz24@mails.tsinghua.edu.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

RDMA/core: Validate the passed in fops for ib_get_ucaps()

Sashiko pointed out it is not safe to rely only on the devt because
char/block alias so if the user finds a block device with the same dev_t
it can masquerade as a ucap cdev fd.

Test the f_ops to only accept authentic cdevs.

Link: https://patch.msgid.link/r/0-v1-fd9482545e37+1e25-ib_ucaps_fd_ops_jgg@nvidia.com
Cc: stable@vger.kernel.org
Fixes: 61e51682816d ("RDMA/uverbs: Introduce UCAP (User CAPabilities) API")
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

net/sched: act_api: use RCU with deferred freeing for action lifecycle

When NEWTFILTER and DELFILTER are run concurrently it is possible to create a
race with an associated action.

Let's illustrate with CPU0 running NEWTFILTER and CPU1 running DELFILTER:

0: mutex_lock() <-- holds the idr lock
0: rcu_read_lock()
0: p = idr_find(idr, index) <-- action p is valid (RCU protects IDR)
0: mutex_unlock() <-- releases the idr lock
1: refcount_dec_and_mutex_lock() <-- refcnt 1->0, mutex held
1: idr_remove(idr, index) <-- Action removed from IDR
1: mutex_unlock() <-- mutex released allowing us to delete the action
1: tcf_action_cleanup(p); kfree(p) <-- Kfrees p immediately, no deferral
0: refcount_inc_not_zero(&p->tcfa_refcnt) <-- ouch, UAF p points to freed memory

This patch fixes the race condition between NEWTFILTER and DELFILTER by
adding struct rcu_head to tc_action used in the deferral and introducing a
call_rcu() in the delete path to defer the final kfree().

Note: this is a revert of commit d7fb60b9cafb ("net_sched: get rid of tcfa_rcu")
but also modernization/simplification to directly use kfree_rcu().

Let's illustrate the new restored code path:

0: rcu_read_lock()
1: refcount_dec_and_mutex_lock() <-- refcnt 1->0, mutex held
1: idr_remove(idr, index)
1: mutex_unlock()
1: call_rcu(&p->tcfa_rcu, tcf_action_rcu_free) <-- defer kfree after grace period
0: p = idr_find(idr, index)
0: refcount_inc_not_zero(&p->tcfa_refcnt) <-- fails, refcnt already 0
1: rcu_read_unlock() <-- release so freeing can run after grace period

After CPU1 calls idr_remove(), the object is no longer reachable through the IDR.
CPU0's subsequent idr_find() will return NULL, and even if it still held a
stale pointer, the immediate kfree() is now deferred until after the RCU grace
period, so no UAF can occur.

Fixes: d7fb60b9cafb ("net_sched: get rid of tcfa_rcu")
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Reported-by: Kyle Zeng <kylebot@openai.com>
Tested-by: Victor Nogueira <victor@mojatatu.com>
Tested-by: syzbot@syzkaller.appspotmail.com
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Tested-by: Kyle Zeng <kylebot@openai.com>
Reviewed-by: Pedro Tammela <pctammela@mojatatu.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Victor Nogueira <victor@mojatatu.com>
Link: https://patch.msgid.link/20260531160812.68020-1-jhs@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

x86/cpu: Keep the PROCESSOR_SELECT menu together

Having a stray kconfig symbol in the middle of the PROCESSOR_SELECT menu
(this symbol plus its dependent symbols) causes the menu dependencies
not to be displayed correctly in "make {menu,n,g,x}config".

Move the BROADCAST_TLB_FLUSH symbol away from the PROCESSOR_SELECT menu
so that the list of processors is displayed correctly.

Fixes: 767ae437a32d ("x86/mm: Add INVLPGB feature and Kconfig entry")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://patch.msgid.link/20260519173526.10985-1-rdunlap@infradead.org

docs/dyndbg: explain flags parse 1st

When writing queries to >control, flags are parsed 1st, since they are
the only required field, and they require specific compositions. So
if the flags draw an error (on those specifics), then keyword errors
aren't reported. This can be mildly confusing/annoying, so explain it
instead.

cc: linux-doc@vger.kernel.org
Reviewed-by: Louis Chauvet <louis.chauvet@bootlin.com>
Signed-off-by: Jim Cromie <jim.cromie@gmail.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <20260502-dyndbg-doc-v1-2-67cc4a93a77e@gmail.com>

docs/dyndbg: update examples \012 to \n

commit 47ea6f99d06e ("dyndbg: use ESCAPE_SPACE for cat control")
changed the control-file to display format strings with "\n" rather
than "\012". Update the docs to match the new reality.

Reviewed-by: Louis Chauvet <louis.chauvet@bootlin.com>
Tested-by: Louis Chauvet <louis.chauvet@bootlin.com>
Signed-off-by: Jim Cromie <jim.cromie@gmail.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <20260502-dyndbg-doc-v1-1-67cc4a93a77e@gmail.com>

docs: kernel-parameters: Fix stale sticore file paths

Update file paths for sticore references that
became stale when drivers were reorganized:
- drivers/video/console/sticore.c -> drivers/video/

Assisted-by: Claude:claude-opus-4-6
Signed-off-by: Costa Shulyupin <costa.shul@redhat.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <20260531140541.4115641-1-costa.shul@redhat.com>

docs: real-time: Fix duplicated sched(7) text

The man page reference appeared twice - once as plain text and
once as a hyperlink. Remove the plain text duplicate.

Assisted-by: Claude:claude-opus-4-6
Signed-off-by: Costa Shulyupin <costa.shul@redhat.com>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <20260531141823.4118954-1-costa.shul@redhat.com>

docs: kgdb: Fix stale source file paths

Update two file paths that became stale when kgdb/kdb sources
were reorganized:
- kernel/debugger/debug_core.c -> kernel/debug/debug_core.c
- drivers/char/kdb_keyboard.c -> kernel/debug/kdb/kdb_keyboard.c

Assisted-by: Claude:claude-opus-4-6
Signed-off-by: Costa Shulyupin <costa.shul@redhat.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <20260531140207.4114764-1-costa.shul@redhat.com>

docs: sonypi: Fix stale header file path

The sonypi.h header was moved from drivers/char/ to
include/linux/. Update the reference.

Assisted-by: Claude:claude-opus-4-6
Signed-off-by: Costa Shulyupin <costa.shul@redhat.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <20260531135850.4113774-1-costa.shul@redhat.com>

docs: kernel-parameters: Remove sa1100ir IrDA parameter

The sa1100ir parameter referenced drivers/net/irda/sa1100_ir.c,
which was removed along with the entire IrDA stack in commit d64c2a76123f
("staging: irda: remove the irda network stack and drivers").

Assisted-by: Claude:claude-opus-4-6
Signed-off-by: Costa Shulyupin <costa.shul@redhat.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <20260531135455.4113157-1-costa.shul@redhat.com>

iommu: Documentation: rearrange, update kernel-parameters

Add text for some undescribed iommu= parameters (merge, nomerge,
biomerge, panic, nopanic, pt, nopt). Add "usedac" and its description.
Add that iommu=pt is equivalent to iommu.passthrough=1
and that iommu=nopt is equivalent to iommu.passthrough=0.

Move the PPC/POWERNV heading & its option "nobypass" to a separate
area since the current "iommu=" applies only to X86 (according to
its heading).

Unindent the AMD GART IOMMU options heading to make it stand out.
Also add its kconfig symbol name to be explicit about what these
options apply to.

Make sure that the IOMMU options that are listed under AMD Gart
HW IOMMU-specific options are only for that product; i.e., add "force"
there and move "merge", "nomerge", and "panic" to the general IOMMU
options area.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <20260528054611.1524937-1-rdunlap@infradead.org>

Merge tag 'md-7.2-20260531' of https://git.kernel.org/pub/scm/linux/kernel/git/mdraid/linux into for-7.2/block

Pull MD updates and fixes from Yu Kuai:

"Bug Fixes:
- Only requeue dm-raid bios when dm is suspending. (Benjamin Marzinski)
- Reset raid10 read_slot when reusing r10bio for discard. (Chen Cheng)
- Fix raid1/raid10 deadlock in read error recovery path.
   (Abd-Alrhman Masalkhi)
- Fix raid1/raid10 error-path detection with md_cloned_bio().
   (Abd-Alrhman Masalkhi)
- Fix raid1/raid10 bio accounting for split md cloned bios.
   (Abd-Alrhman Masalkhi)
- Fix raid1 nr_pending leak in REQ_ATOMIC bad-block path.
   (Abd-Alrhman Masalkhi)

Improvements:
- Skip redundant raid_disks updates when the value is unchanged.
   (Abd-Alrhman Masalkhi)

Cleanups:
- Update MAINTAINERS email addresses. (Yu Kuai, Li Nan)
- Clean up raid1 read error handling. (Christoph Hellwig)
- Move the exceed_read_errors condition out of fix_read_error().
   (Christoph Hellwig)
- Use str_plural() in raid0 dump_zones(). (Thorsten Blum)"

* tag 'md-7.2-20260531' of https://git.kernel.org/pub/scm/linux/kernel/git/mdraid/linux:
  md/raid0: use str_plural helper in dump_zones
  raid1: fix nr_pending leak in REQ_ATOMIC bad-block error path
  md/raid1: move the exceed_read_errors condition out of fix_read_error
  md/raid1: cleanup handle_read_error
  md/raid1,raid10: fix bio accounting for split md cloned bios
  md/raid1,raid10: fix error-path detection with md_cloned_bio()
  md/raid1,raid10: fix deadlock in read error recovery path
  md/raid10: reset read_slot when reusing r10bio for discard
  md: skip redundant raid_disks update when value is unchanged
  dm-raid: only requeue bios when dm is suspending
  MAINTAINERS: Update Li Nan's E-mail address
  MAINTAINERS: update Yu Kuai's email address

docs: md: fix grammar in speed_limit description

Replace 'This are' with 'These are' in the md sysfs speed limit
section to correct grammar and improve readability.

Signed-off-by: Miguel Martín Gil <miguel.martin.gil.uni@gmail.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <20260525214554.2196-1-miguel.martin.gil.uni@gmail.com>

docs: changes.rst: restore pahole 1.26 minimum (regressed by sort)

Commit 9edd04c4189e ("docs: Raise minimum pahole version to 1.26 for
KF_IMPLICIT_ARGS kfuncs") raised the minimum required pahole version
from 1.22 to 1.26 in the requirements table and added a paragraph
explaining the failure mode for distributions still shipping pahole
v1.25 (e.g. Ubuntu 24.04 LTS).

The next day, commit ece7e57afd51 ("docs: changes.rst and ver_linux:
sort the lists") came through a different tree (docs vs sched_ext) and
re-flowed the table alphabetically, but its base did not include
9edd04c4189e.  When the two commits met in mainline, the textual rewrite
of the table won and the version bump was lost.  The added "Since Linux
7.0..." paragraph also disappeared.

The result is that changes.rst on master (v7.1-rc5) lists pahole 1.22
again, even though sched_ext kfuncs annotated with KF_IMPLICIT_ARGS
genuinely require v1.26 to produce a correct vmlinux BTF.  Users on
distributions with pahole v1.25 hit "func_proto incompatible with
vmlinux" when loading any sched_ext BPF program (scx_simple,
scx_qmap, ...) and have no documentation pointing them at the version
gap.

Restore both changes from 9edd04c4189e.

Fixes: ece7e57afd51 ("docs: changes.rst and ver_linux: sort the lists")
Signed-off-by: Zhan Xusheng <zhanxusheng@xiaomi.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <20260526022033.1301884-1-zhanxusheng@xiaomi.com>