git.ipfire.org Git - thirdparty/kernel/linux.git/log

workqueue: Show all busy workers in stall diagnostics

show_cpu_pool_hog() only prints workers whose task is currently running
on the CPU (task_is_running()).  This misses workers that are busy
processing a work item but are sleeping or blocked — for example, a
worker that clears PF_WQ_WORKER and enters wait_event_idle().  Such a
worker still occupies a pool slot and prevents progress, yet produces
an empty backtrace section in the watchdog output.

This is happening on real arm64 systems, where
toggle_allocation_gate() IPIs every single CPU in the machine (which
lacks NMI), causing workqueue stalls that show empty backtraces because
toggle_allocation_gate() is sleeping in wait_event_idle().

Remove the task_is_running() filter so every in-flight worker in the
pool's busy_hash is dumped.  The busy_hash is protected by pool->lock,
which is already held.

Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>

workqueue: Show in-flight work item duration in stall diagnostics

When diagnosing workqueue stalls, knowing how long each in-flight work
item has been executing is valuable. Add a current_start timestamp
(jiffies) to struct worker, set it when a work item begins execution in
process_one_work(), and print the elapsed wall-clock time in show_pwq().

Unlike current_at (which tracks CPU runtime and resets on wakeup for
CPU-intensive detection), current_start is never reset because the
diagnostic cares about total wall-clock time including sleeps.

Before: in-flight: 165:stall_work_fn [wq_stall]
After: in-flight: 165:stall_work_fn [wq_stall] for 100s

Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>

workqueue: Rename pool->watchdog_ts to pool->last_progress_ts

The watchdog_ts name doesn't convey what the timestamp actually tracks.
This field tracks the last time a workqueue got progress.

Rename it to last_progress_ts to make it clear that it records when the
pool last made forward progress (started processing new work items).

No functional change.

Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>

workqueue: Use POOL_BH instead of WQ_BH when checking pool flags

pr_cont_worker_id() checks pool->flags against WQ_BH, which is a
workqueue-level flag (defined in workqueue.h). Pool flags use a
separate namespace with POOL_* constants (defined in workqueue.c).
The correct constant is POOL_BH. Both WQ_BH and POOL_BH are defined
as (1 << 0) so this has no behavioral impact, but it is semantically
wrong and inconsistent with every other pool-level BH check in the
file.

Fixes: 4cb1ef64609f ("workqueue: Implement BH workqueues to eventually replace tasklets")
Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>

Merge tag 'modules-7.0-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux

Pull module fixes from Sami Tolvanen:

- Fix a potential kernel panic in the module loader by adding a bounds
   check for the ELF section index. This prevents crashes if attempting
   to load a module that uses SHN_XINDEX or is corrupted.

- Fix the Kconfig menu layout for module versioning, signing, and
   compression options so they correctly appear as submenus in
   menuconfig.

- Remove a redundant lockdep_free_key_range() call in the load_module()
   error path. This is already handled by module_deallocate() calling
   free_mod_mem() since the module_memory rework.

* tag 'modules-7.0-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux:
  module: Fix kernel panic when a symbol st_shndx is out of bounds
  module: Fix the modversions and signing submenus
  module: Remove duplicate freeing of lockdep classes

Merge tag 'vfs-7.0-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs fixes from Christian Brauner:

- kthread: consolidate kthread exit paths to prevent use-after-free

- iomap:
    - don't mark folio uptodate if read IO has bytes pending
    - don't report direct-io retries to fserror
    - reject delalloc mappings during writeback

- ns: tighten visibility checks

- netfs: Fix unbuffered/DIO writes to dispatch subrequests in strict
   sequence

* tag 'vfs-7.0-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  iomap: reject delalloc mappings during writeback
  iomap: don't mark folio uptodate if read IO has bytes pending
  selftests: fix mntns iteration selftests
  nstree: tighten permission checks for listing
  nsfs: tighten permission checks for handle opening
  nsfs: tighten permission checks for ns iteration ioctls
  netfs: Fix unbuffered/DIO writes to dispatch subrequests in strict sequence
  kthread: consolidate kthread exit paths to prevent use-after-free
  iomap: don't report direct-io retries to fserror

Merge tag 'sysctl-7.00-fixes-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl

Pull sysctl fix from Joel Granados:

- Fix error when reporting jiffies converted values back to user space

Return the converted value instead of "Invalid argument" error

* tag 'sysctl-7.00-fixes-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl:
time/jiffies: Fix sysctl file error on configurations where USER_HZ < HZ

Merge tag 'media/v7.0-3' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media

Pull media fix from Mauro Carvalho Chehab:
"Fix for MPEG-TS decoder in dvb-net"

* tag 'media/v7.0-3' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
media: dvb-net: fix OOB access in ULE extension header tables

Merge tag 'pinctrl-v7.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl

Pull pin control fixes from Linus Walleij:
"All of these are driver fixes except a memory leak in the
  pinconf_generic_parse_dt_config() helper which is the most
  important fix.

   - Rename and fix up the Intel Equilibrium immutable interrupt chip

   - Handle the Qualcomm QCS615 dual edge GPIO IRQ by adding the right
     flag

   - Fix a memory leak in the widely used pinconf_generic_parse_dt_config()
     and a more local leak in aml_dt_node_to_map_pinmux()

   - Fix double put in the Cirrus cs42l43_pin_probe()

   - Staticize amdisp_pinctrl_ops, Qualcomm SDM660 groups and functions

   - Unexport CIX sky1_pinctrl_pm_ops

   - Fix configuration of deferred pin in the Rockchip driver

   - Implement .get_direction() in the Sunxi driver squelching a dmesg
     warning message

   - Fix a readout of the last bank of registers in the Cypress CY8C95x0
     driver"

* tag 'pinctrl-v7.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
  pinctrl: cy8c95x0: Don't miss reading the last bank registers
  pinctrl: sunxi: Implement gpiochip::get_direction()
  pinctrl: rockchip: Fix configuring a deferred pin
  pinctrl: cirrus: cs42l43: Fix double-put in cs42l43_pin_probe()
  pinctrl: meson: amlogic-a4: Fix device node reference leak in aml_dt_node_to_map_pinmux()
  pinctrl: qcom: sdm660-lpass-lpi: Make groups and functions variables static
  pinctrl: cix: sky1: Unexport sky1_pinctrl_pm_ops
  pinctrl: amdisp: Make amdisp_pinctrl_ops variable static
  pinctrl: pinconf-generic: Fix memory leak in pinconf_generic_parse_dt_config()
  pinctrl: qcom: qcs615: Add missing dual edge GPIO IRQ errata flag
  pinctrl: equilibrium: fix warning trace on load
  pinctrl: equilibrium: rename irq_chip function callbacks

iomap: reject delalloc mappings during writeback

Filesystems should never provide a delayed allocation mapping to
writeback; they're supposed to allocate the space before replying.
This can lead to weird IO errors and crashes in the block layer if the
filesystem is being malicious, or if it hadn't set iomap->dev because
it's a delalloc mapping.

Fix this by failing writeback on delalloc mappings. Currently no
filesystems actually misbehave in this manner, but we ought to be
stricter about things like that.

Cc: stable@vger.kernel.org # v5.5
Fixes: 598ecfbaa742ac ("iomap: lift the xfs writeback code to iomap")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Link: https://patch.msgid.link/20260302173002.GL13829@frogsfrogsfrogs
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>

Merge patch "iomap: don't mark folio uptodate if read IO has bytes pending"

Joanne Koong <joannelkoong@gmail.com> says:

This is a fix for this scenario:

->read_folio() gets called on a folio size that is 16k while the file is 4k:
  a) ifs->read_bytes_pending gets initialized to 16k
  b) ->read_folio_range() is called for the 4k read
  c) the 4k read succeeds, ifs->read_bytes_pending is now 12k and the
0 to 4k range is marked uptodate
  d) the post-eof blocks are zeroed and marked uptodate in the call to
iomap_set_range_uptodate()
  e) iomap_set_range_uptodate() sees all the ranges are marked
uptodate and it marks the folio uptodate
  f) iomap_read_end() gets called to subtract the 12k from
ifs->read_bytes_pending. it too sees all the ranges are marked
uptodate and marks the folio uptodate using XOR
  g) the XOR call clears the uptodate flag on the folio

The same situation can occur if the last range read for the folio is done as
an inline read and all the previous ranges have already completed by the time
the inline read completes.

For more context, the full discussion can be found in [1]. There was a
discussion about alternative approaches in that thread, but they had more
complications.

There is another discussion in v1 [2] about consolidating the read paths.
Until that is resolved, this patch fixes the issue.

[1] https://lore.kernel.org/linux-fsdevel/CAJnrk1Z9za5w4FoJqTGx50zR2haHHaoot1KJViQyEHJQq4=34w@mail.gmail.com/#t
[2] https://lore.kernel.org/linux-fsdevel/20260219003911.344478-1-joannelkoong@gmail.com/T/#u

* patches from https://patch.msgid.link/20260303233420.874231-1-joannelkoong@gmail.com:
  iomap: don't mark folio uptodate if read IO has bytes pending

Link: https://patch.msgid.link/20260303233420.874231-1-joannelkoong@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>

iomap: don't mark folio uptodate if read IO has bytes pending

If a folio has ifs metadata attached to it and the folio is partially
read in through an async IO helper with the rest of it then being read
in through post-EOF zeroing or as inline data, and the helper
successfully finishes the read first, then post-EOF zeroing / reading
inline will mark the folio as uptodate in iomap_set_range_uptodate().

This is a problem because when the read completion path later calls
iomap_read_end(), it will call folio_end_read(), which sets the uptodate
bit using XOR semantics. Calling folio_end_read() on a folio that was
already marked uptodate clears the uptodate bit.

Fix this by not marking the folio as uptodate if the read IO has bytes
pending. The folio uptodate state will be set in the read completion
path through iomap_end_read() -> folio_end_read().

Reported-by: Wei Gao <wegao@suse.com>
Suggested-by: Sasha Levin <sashal@kernel.org>
Tested-by: Wei Gao <wegao@suse.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Cc: stable@vger.kernel.org # v6.19
Link: https://lore.kernel.org/linux-fsdevel/aYbmy8JdgXwsGaPP@autotest-wegao.qe.prg2.suse.org/
Fixes: b2f35ac4146d ("iomap: add caller-provided callbacks for read and readahead")
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://patch.msgid.link/20260303233420.874231-2-joannelkoong@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>

time/jiffies: Fix sysctl file error on configurations where USER_HZ < HZ

Commit 2dc164a48e6fd ("sysctl: Create converter functions with two new
macros") incorrectly returns error to user space when jiffies sysctl
converter is used. The old overflow check got replaced with an
unconditional one:
+ if (USER_HZ < HZ)
+ return -EINVAL;
which will always be true on configurations with "USER_HZ < HZ".

Remove the check; it is no longer needed as clock_t_to_jiffies() returns
ULONG_MAX for the overflow case and proc_int_u2k_conv_uop() checks for
"> INT_MAX" after conversion

Fixes: 2dc164a48e6fd ("sysctl: Create converter functions with two new macros")
Reported-by: Colm Harrington <colm.harrington@oracle.com>
Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com>
Signed-off-by: Joel Granados <joel.granados@kernel.org>

Merge tag 'cgroup-for-7.0-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

Pull cgroup fixes from Tejun Heo:

- Fix circular locking dependency in cpuset partition code by
   deferring housekeeping_update() calls to a workqueue instead
   of calling them directly under cpus_read_lock

- Fix null-ptr-deref in rebuild_sched_domains_cpuslocked() when
   generate_sched_domains() returns NULL due to kmalloc failure

- Fix incorrect cpuset behavior for effective_xcpus in
   partition_xcpus_del() and cpuset_update_tasks_cpumask()
   in update_cpumasks_hier()

- Fix race between task migration and cgroup iteration

* tag 'cgroup-for-7.0-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup/cpuset: fix null-ptr-deref in rebuild_sched_domains_cpuslocked
  cgroup/cpuset: Call housekeeping_update() without holding cpus_read_lock
  cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug to workqueue
  cgroup/cpuset: Move housekeeping_update()/rebuild_sched_domains() together
  kselftest/cgroup: Simplify test_cpuset_prs.sh by removing "S+" command
  cgroup/cpuset: Set isolated_cpus_updating only if isolated_cpus is changed
  cgroup/cpuset: Clarify exclusion rules for cpuset internal variables
  cgroup/cpuset: Fix incorrect use of cpuset_update_tasks_cpumask() in update_cpumasks_hier()
  cgroup/cpuset: Fix incorrect change to effective_xcpus in partition_xcpus_del()
  cgroup: fix race between task migration and iteration

Merge tag 'sched_ext-for-7.0-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext

Pull sched_ext fixes from Tejun Heo:

- Fix starvation of scx_enable() under fair-class saturation by
   offloading the enable path to an RT kthread

- Fix out-of-bounds access in idle mask initialization on systems with
   non-contiguous NUMA node IDs

- Fix a preemption window during scheduler exit and a refcount
   underflow in cgroup init error path

- Fix SCX_EFLAG_INITIALIZED being a no-op flag

- Add READ_ONCE() annotations for KCSAN-clean lockless accesses and
   replace naked scx_root dereferences with container_of() in kobject
   callbacks

- Tooling and selftest fixes: compilation issues with clang 17,
   strtoul() misuse, unused options cleanup, and Kconfig sync

* tag 'sched_ext-for-7.0-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
  sched_ext: Fix starvation of scx_enable() under fair-class saturation
  sched_ext: Remove redundant css_put() in scx_cgroup_init()
  selftests/sched_ext: Fix peek_dsq.bpf.c compile error for clang 17
  selftests/sched_ext: Add -fms-extensions to bpf build flags
  tools/sched_ext: Add -fms-extensions to bpf build flags
  sched_ext: Use READ_ONCE() for plain reads of scx_watchdog_timeout
  sched_ext: Replace naked scx_root dereferences in kobject callbacks
  sched_ext: Use READ_ONCE() for the read side of dsq->nr update
  tools/sched_ext: fix strtoul() misuse in scx_hotplug_seq()
  sched_ext: Fix SCX_EFLAG_INITIALIZED being a no-op flag
  sched_ext: Fix out-of-bounds access in scx_idle_init_masks()
  sched_ext: Disable preemption between scx_claim_exit() and kicking helper work
  tools/sched_ext: Add Kconfig to sync with upstream
  tools/sched_ext: Sync README.md Kconfig with upstream scx
  selftests/sched_ext: Remove duplicated unistd.h include in rt_stall.c
  tools/sched_ext: scx_sdt: Remove unused '-f' option
  tools/sched_ext: scx_central: Remove unused '-p' option
  selftests/sched_ext: Fix unused-result warning for read()
  selftests/sched_ext: Abort test loop on signal

sched_ext: Fix starvation of scx_enable() under fair-class saturation

During scx_enable(), the READY -> ENABLED task switching loop changes the
calling thread's sched_class from fair to ext. Since fair has higher
priority than ext, saturating fair-class workloads can indefinitely starve
the enable thread, hanging the system. This was introduced when the enable
path switched from preempt_disable() to scx_bypass() which doesn't protect
against fair-class starvation. Note that the original preempt_disable()
protection wasn't complete either - in partial switch modes, the calling
thread could still be starved after preempt_enable() as it may have been
switched to ext class.

Fix it by offloading the enable body to a dedicated system-wide RT
(SCHED_FIFO) kthread which cannot be starved by either fair or ext class
tasks. scx_enable() lazily creates the kthread on first use and passes the
ops pointer through a struct scx_enable_cmd containing the kthread_work,
then synchronously waits for completion.

The workfn runs on a different kthread from sch->helper (which runs
disable_work), so it can safely flush disable_work on the error path
without deadlock.

Fixes: 8c2090c504e9 ("sched_ext: Initialize in bypass mode")
Cc: stable@vger.kernel.org # v6.12+
Signed-off-by: Tejun Heo <tj@kernel.org>

Merge tag 'for-7.0-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
"One-liner or short fixes for minor/moderate problems reported recently:

   - fixes or level adjustments of error messages

   - fix leaked transaction handles after aborted transactions, when
     using the remap tree feature

   - fix a few leaked chunk maps after errors

   - fix leaked page array in io_uring encoded read if an error occurs
     and the 'finished' is not called

   - fix double release of reserved extents when doing a range COW

   - don't commit super block when the filesystem is in shutdown state

   - fix squota accounting condition when checking members vs parent
     usage

   - other error handling fixes"

* tag 'for-7.0-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: check block group lookup in remove_range_from_remap_tree()
  btrfs: fix transaction handle leaks in btrfs_last_identity_remap_gone()
  btrfs: fix chunk map leak in btrfs_map_block() after btrfs_translate_remap()
  btrfs: fix chunk map leak in btrfs_map_block() after btrfs_chunk_map_num_copies()
  btrfs: fix compat mask in error messages in btrfs_check_features()
  btrfs: print correct subvol num if active swapfile prevents deletion
  btrfs: fix warning in scrub_verify_one_metadata()
  btrfs: fix objectid value in error message in check_extent_data_ref()
  btrfs: fix incorrect key offset in error message in check_dev_extent_item()
  btrfs: fix error message order of parameters in btrfs_delete_delayed_dir_index()
  btrfs: don't commit the super block when unmounting a shutdown filesystem
  btrfs: free pages on error in btrfs_uring_read_extent()
  btrfs: fix referenced/exclusive check in squota_check_parent_usage()
  btrfs: remove pointless WARN_ON() in cache_save_setup()
  btrfs: convert log messages to error level in btrfs_replay_log()
  btrfs: remove btrfs_handle_fs_error() after failure to recover log trees
  btrfs: remove redundant warning message in btrfs_check_uuid_tree()
  btrfs: change warning messages to error level in open_ctree()
  btrfs: fix a double release on reserved extents in cow_one_range()
  btrfs: handle discard errors in in btrfs_finish_extent_commit()

sched_ext: Remove redundant css_put() in scx_cgroup_init()

The iterator css_for_each_descendant_pre() walks the cgroup hierarchy
under cgroup_lock(). It does not increment the reference counts on
yielded css structs.

According to the cgroup documentation, css_put() should only be used
to release a reference obtained via css_get() or css_tryget_online().
Since the iterator does not use either of these to acquire a reference,
calling css_put() in the error path of scx_cgroup_init() causes a
refcount underflow.

Remove the unbalanced css_put() to prevent a potential Use-After-Free
(UAF) vulnerability.

Fixes: 819513666966 ("sched_ext: Add cgroup support")
Cc: stable@vger.kernel.org # v6.12+
Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

selftests/sched_ext: Fix peek_dsq.bpf.c compile error for clang 17

When compiling sched_ext selftests using clang 17.0.6, it raised
compiler crash and build error:

Error at line 68: Unsupport signed division for DAG: 0x55b2f9a60240:
i64 = sdiv 0x55b2f9a609b0, Constant:i64<100>, peek_dsq.bpf.c:68:25 @[
peek_dsq.bpf.c:95:4 @[ peek_dsq.bpf.c:169:8 @[ peek
_dsq.bpf.c:140:6 ] ] ]Please convert to unsigned div/mod

After digging, it's not a compiler error, clang supported Signed division
only when using -mcpu=v4, while we use -mcpu=v3 currently, the better way
is to use unsigned div, see [1] for details.

[1] https://github.com/llvm/llvm-project/issues/70433

Signed-off-by: Zhao Mengmeng <zhaomengmeng@kylinos.cn>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

selftests/sched_ext: Add -fms-extensions to bpf build flags

Similar to commit 835a50753579 ("selftests/bpf: Add -fms-extensions to
bpf build flags") and commit 639f58a0f480 ("bpftool: Fix build warnings
due to MS extensions")

Fix "declaration does not declare anything" warning by using
-fms-extensions and -Wno-microsoft-anon-tag flags to build bpf programs
that #include "vmlinux.h"

Signed-off-by: Zhao Mengmeng <zhaomengmeng@kylinos.cn>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

tools/sched_ext: Add -fms-extensions to bpf build flags

Similar to commit 835a50753579 ("selftests/bpf: Add -fms-extensions to
bpf build flags") and commit 639f58a0f480 ("bpftool: Fix build warnings
due to MS extensions")

The kernel is now built with -fms-extensions, therefore
generated vmlinux.h contains types like:
struct aes_key {
        struct aes_enckey;
        union aes_invkey_arch inv_k;
};

struct ns_common {
...
        union {
                struct ns_tree;
                struct callback_head ns_rcu;
        };
};

Which raise warning like below when building scx scheduler:

tools/sched_ext/build/include/vmlinux.h:50533:3: warning:
declaration does not declare anything [-Wmissing-declarations]
50533 |                 struct ns_tree;
       |                 ^
Fix it by using -fms-extensions and -Wno-microsoft-anon-tag flags
to build bpf programs that #include "vmlinux.h"

Signed-off-by: Zhao Mengmeng <zhaomengmeng@kylinos.cn>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

sched_ext: Use READ_ONCE() for plain reads of scx_watchdog_timeout

scx_watchdog_timeout is written with WRITE_ONCE() in scx_enable():

    WRITE_ONCE(scx_watchdog_timeout, timeout);

However, three read-side accesses use plain reads without the matching
READ_ONCE():

    /* check_rq_for_timeouts() - L2824 */
    last_runnable + scx_watchdog_timeout

    /* scx_watchdog_workfn() - L2852 */
    scx_watchdog_timeout / 2

    /* scx_enable() - L5179 */
    scx_watchdog_timeout / 2

The KCSAN documentation requires that if one accessor uses WRITE_ONCE()
to annotate lock-free access, all other accesses must also use the
appropriate accessor. Plain reads alongside WRITE_ONCE() leave the pair
incomplete and can trigger KCSAN warnings.

Note that scx_tick() already uses the correct READ_ONCE() annotation:

    last_check + READ_ONCE(scx_watchdog_timeout)

Fix the three remaining plain reads to match, making all accesses to
scx_watchdog_timeout consistently annotated and KCSAN-clean.

Signed-off-by: zhidao su <suzhidao@xiaomi.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

uaccess: Fix scoped_user_read_access() for 'pointer to const'

If a 'const struct foo __user *ptr' is used for the address passed to
scoped_user_read_access() then you get a warning/error

uaccess.h:691:1: error: initialization discards 'const' qualifier from pointer target type [-Werror=discarded-qualifiers]

for the

void __user *_tmpptr = __scoped_user_access_begin(mode, uptr, size, elbl)

assignment.

Fix by using 'auto' for both _tmpptr and the redeclaration of uptr.
Replace the CLASS() with explicit __cleanup() functions on uptr.

Fixes: e497310b4ffb ("uaccess: Provide scoped user access regions")
Signed-off-by: David Laight <david.laight.linux@gmail.com>
Reviewed-and-tested-by: Christophe Leroy (CS GROUP) <chleroy@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

sched_ext: Replace naked scx_root dereferences in kobject callbacks

scx_attr_ops_show() and scx_uevent() access scx_root->ops.name directly.
This is problematic for two reasons:

1. The file-level comment explicitly identifies naked scx_root
   dereferences as a temporary measure that needs to be replaced
   with proper per-instance access.

2. scx_attr_events_show(), the neighboring sysfs show function in
   the same group, already uses the correct pattern:

       struct scx_sched *sch = container_of(kobj, struct scx_sched, kobj);

   Having inconsistent access patterns in the same sysfs/uevent
   group is error-prone.

The kobject embedded in struct scx_sched is initialized as:

    kobject_init_and_add(&sch->kobj, &scx_ktype, NULL, "root");

so container_of(kobj, struct scx_sched, kobj) correctly retrieves
the owning scx_sched instance in both callbacks.

Replace the naked scx_root dereferences with container_of()-based
access, consistent with scx_attr_events_show() and in preparation
for proper multi-instance scx_sched support.

Signed-off-by: zhidao su <suzhidao@xiaomi.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

sched_ext: Use READ_ONCE() for the read side of dsq->nr update

scx_bpf_dsq_nr_queued() reads dsq->nr via READ_ONCE() without holding
any lock, making dsq->nr a lock-free concurrently accessed variable.
However, dsq_mod_nr(), the sole writer of dsq->nr, only uses
WRITE_ONCE() on the write side without the matching READ_ONCE() on the
read side:

    WRITE_ONCE(dsq->nr, dsq->nr + delta);
                        ^^^^^^^
                        plain read -- KCSAN data race

The KCSAN documentation requires that if one accessor uses READ_ONCE()
or WRITE_ONCE() on a variable to annotate lock-free access, all other
accesses must also use the appropriate accessor. A plain read on the
right-hand side of WRITE_ONCE() leaves the pair incomplete and will
trigger KCSAN warnings.

Fix by using READ_ONCE() for the read side of the update:

    WRITE_ONCE(dsq->nr, READ_ONCE(dsq->nr) + delta);

This is consistent with scx_bpf_dsq_nr_queued() and makes the
concurrent access annotation complete and KCSAN-clean.

Signed-off-by: zhidao su <suzhidao@xiaomi.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

Merge tag 'nfsd-7.0-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fixes from Chuck Lever:

- Restore previous nfsd thread count reporting behavior

- Fix credential reference leaks in the NFSD netlink admin protocol

* tag 'nfsd-7.0-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  nfsd: report the requested maximum number of threads instead of number running
  nfsd: Fix cred ref leak in nfsd_nl_listener_set_doit().
  nfsd: Fix cred ref leak in nfsd_nl_threads_set_doit().

Linux 7.0-rc2

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
"Arm:

   - Make sure we don't leak any S1POE state from guest to guest when
     the feature is supported on the HW, but not enabled on the host

   - Propagate the ID registers from the host into non-protected VMs
     managed by pKVM, ensuring that the guest sees the intended feature
     set

   - Drop double kern_hyp_va() from unpin_host_sve_state(), which could
     bite us if we were to change kern_hyp_va() to not being idempotent

   - Don't leak stage-2 mappings in protected mode

   - Correctly align the faulting address when dealing with single page
     stage-2 mappings for PAGE_SIZE > 4kB

   - Fix detection of virtualisation-capable GICv5 IRS, due to the
     maintainer being obviously fat fingered... [his words, not mine]

   - Remove duplication of code retrieving the ASID for the purpose of
     S1 PT handling

   - Fix slightly abusive const-ification in vgic_set_kvm_info()

  Generic:

   - Remove internal Kconfigs that are now set on all architectures

   - Remove per-architecture code to enable KVM_CAP_SYNC_MMU, all
     architectures finally enable it in Linux 7.0"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: always define KVM_CAP_SYNC_MMU
  KVM: remove CONFIG_KVM_GENERIC_MMU_NOTIFIER
  KVM: arm64: Deduplicate ASID retrieval code
  irqchip/gic-v5: Fix inversion of IRS_IDR0.virt flag
  KVM: arm64: Revert accidental drop of kvm_uninit_stage2_mmu() for non-NV VMs
  KVM: arm64: Fix protected mode handling of pages larger than 4kB
  KVM: arm64: vgic: Handle const qualifier from gic_kvm_info allocation type
  KVM: arm64: Remove redundant kern_hyp_va() in unpin_host_sve_state()
  KVM: arm64: Fix ID register initialization for non-protected pKVM guests
  KVM: arm64: Optimise away S1POE handling when not supported by host
  KVM: arm64: Hide S1POE from guests when not supported by the host

Merge tag 'core-debugobjects-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull debugobjects fix from Thomas Gleixner:
"A single fix for debugobjects.

  The deferred page initialization prevents debug objects from
  allocating slab pages until the initialization is complete. That
  causes depletion of the pool and disabling of debugobjects.

  The reason is that debugobjects uses __GFP_HIGH for allocations as it
  might be invoked from arbitrary contexts. When PREEMPT_COUNT is
  disabled there is no way to know whether the context is safe to set
  __GFP_KSWAPD_RECLAIM.

  This worked until v6.18. Since then allocations w/o a reclaim flag
  cause new_slab() to end up in alloc_frozen_pages_nolock_noprof(),
  which returns early when deferred page initialization has not yet
  completed.

  Work around that when PREEMPT_COUNT is enabled as the preempt counter
  allows debugobjects to add __GFP_KSWAPD_RECLAIM to the GFP flags when
  the context is preemtible. When PREEMPT_COUNT is disabled the context
  is unknown and the reclaim bit can't be set because the caller might
  hold locks which might deadlock in the allocator.

  That makes debugobjects depend on PREEMPT_COUNT ||
  !DEFERRED_STRUCT_PAGE_INIT, which limits the coverage slightly, but
  keeps it functional for most cases"

* tag 'core-debugobjects-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  debugobject: Make it work with deferred page initialization - again

Merge tag 'x86-urgent-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Ingo Molnar:

- Fix speculative safety in fred_extint()

- Fix __WARN_printf() trap in early_fixup_exception()

- Fix clang-build boot bug for unusual alignments, triggered by
   CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B=y

- Replace the final few __ASSEMBLY__ stragglers that snuck in lately
   into non-UAPI x86 headers and use __ASSEMBLER__ consistently (again)

* tag 'x86-urgent-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/headers: Replace __ASSEMBLY__ stragglers with __ASSEMBLER__
  x86/cfi: Fix CFI rewrite for odd alignments
  x86/bug: Handle __WARN_printf() trap in early_fixup_exception()
  x86/fred: Correct speculative safety in fred_extint()

Merge tag 'timers-urgent-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer fix from Ingo Molnar:
"Improve the inlining of jiffies_to_msecs() and jiffies_to_usecs(), for
  the common HZ=100, 250 or 1000 cases. Only use a function call for odd
  HZ values like HZ=300 that generate more code.

  The function call overhead showed up in performance tests of the TCP
  code"

* tag 'timers-urgent-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  time/jiffies: Inline jiffies_to_msecs() and jiffies_to_usecs()

Merge tag 'sched-urgent-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fixes from Ingo Molnar:

- Fix zero_vruntime tracking when there's a single task running

- Fix slice protection logic

- Fix the ->vprot logic for reniced tasks

- Fix lag clamping in mixed slice workloads

- Fix objtool uaccess warning (and bug) in the
   !CONFIG_RSEQ_SLICE_EXTENSION case caused by unexpected un-inlining,
   which triggers with older compilers

- Fix a comment in the rseq registration rseq_size bound check code

- Fix a legacy RSEQ ABI quirk that handled 32-byte area sizes
   differently, which special size we now reached naturally and want to
   avoid. The visible ugliness of the new reserved field will be avoided
   the next time the RSEQ area is extended.

* tag 'sched-urgent-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  rseq: slice ext: Ensure rseq feature size differs from original rseq size
  rseq: Clarify rseq registration rseq_size bound check comment
  sched/core: Fix wakeup_preempt's next_class tracking
  rseq: Mark rseq_arm_slice_extension_timer() __always_inline
  sched/fair: Fix lag clamp
  sched/eevdf: Update se->vprot in reweight_entity()
  sched/fair: Only set slice protection at pick time
  sched/fair: Fix zero_vruntime tracking

Merge tag 'perf-urgent-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf events fixes from Ingo Molnar:

- Fix lock ordering bug found by lockdep in perf_event_wakeup()

- Fix uncore counter enumeration on Granite Rapids and Sierra Forest

- Fix perf_mmap() refcount bug found by Syzkaller

- Fix __perf_event_overflow() vs perf_remove_from_context() race

* tag 'perf-urgent-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf: Fix __perf_event_overflow() vs perf_remove_from_context() race
  perf/core: Fix refcount bug and potential UAF in perf_mmap
  perf/x86/intel/uncore: Add per-scheduler IMC CAS count events
  perf/core: Fix invalid wait context in ctx_sched_in()

Merge tag 'locking-urgent-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking fix from Ingo Molnar:
"Now that LLVM 22 has been released officially, require a release
  version to use the new CONFIG_WARN_CONTEXT_ANALYSIS feature.

  In particular this avoids the widely used Android clang 22.0.1
  pre-release build which is known to be broken for this usecase"

* tag 'locking-urgent-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  lib/Kconfig.debug: Require a release version of LLVM 22 for context analysis

Merge tag 'irq-urgent-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull irqchip driver fixes from Ingo Molnar:

- Fix frozen interrupt bug in the sifive-plic driver

- Limit per-device MSI interrupts on uncommon gic-v3-its hardware
   variants

- Address Sparse warning by constifying a variable in the MMP driver

- Revert broken commit and also fix an error check in the ls-extirq
   driver

* tag 'irq-urgent-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqchip/ls-extirq: Fix devm_of_iomap() error check
  Revert "irqchip/ls-extirq: Use for_each_of_imap_item iterator"
  irqchip/mmp: Make icu_irq_chip variable static const
  irqchip/gic-v3-its: Limit number of per-device MSIs to the range the ITS supports
  irqchip/sifive-plic: Fix frozen interrupt due to affinity setting

Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull SCSI fixes from James Bottomley:
"All changes in drivers (well technically SES is enclosure services,
  but its change is minor). The biggest is the write combining change in
  lpfc followed by the additional NULL checks in mpi3mr"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  scsi: ufs: core: Fix shift out of bounds when MAXQ=32
  scsi: ufs: core: Move link recovery for hibern8 exit failure to wl_resume
  scsi: ufs: core: Fix possible NULL pointer dereference in ufshcd_add_command_trace()
  scsi: snic: MAINTAINERS: Update snic maintainers
  scsi: snic: Remove unused linkstatus
  scsi: pm8001: Fix use-after-free in pm8001_queue_command()
  scsi: mpi3mr: Add NULL checks when resetting request and reply queues
  scsi: ufs: core: Reset urgent_bkops_lvl to allow runtime PM power mode
  scsi: ses: Fix devices attaching to different hosts
  scsi: ufs: core: Fix RPMB region size detection for UFS 2.2
  scsi: storvsc: Fix scheduling while atomic on PREEMPT_RT
  scsi: lpfc: Properly set WC for DPP mapping

Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Pull bpf fixes from Alexei Starovoitov:

- Fix alignment of arm64 JIT buffer to prevent atomic tearing (Fuad
   Tabba)

- Fix invariant violation for single value tnums in the verifier
   (Harishankar Vishwanathan, Paul Chaignon)

- Fix a bunch of issues found by ASAN in selftests/bpf (Ihor Solodrai)

- Fix race in devmpa and cpumap on PREEMPT_RT (Jiayuan Chen)

- Fix show_fdinfo of kprobe_multi when cookies are not present (Jiri
   Olsa)

- Fix race in freeing special fields in BPF maps to prevent memory
   leaks (Kumar Kartikeya Dwivedi)

- Fix OOB read in dmabuf_collector (T.J. Mercier)

* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: (36 commits)
  selftests/bpf: Avoid simplification of crafted bounds test
  selftests/bpf: Test refinement of single-value tnum
  bpf: Improve bounds when tnum has a single possible value
  bpf: Introduce tnum_step to step through tnum's members
  bpf: Fix race in devmap on PREEMPT_RT
  bpf: Fix race in cpumap on PREEMPT_RT
  selftests/bpf: Add tests for special fields races
  bpf: Retire rcu_trace_implies_rcu_gp() from local storage
  bpf: Delay freeing fields in local storage
  bpf: Lose const-ness of map in map_check_btf()
  bpf: Register dtor for freeing special fields
  selftests/bpf: Fix OOB read in dmabuf_collector
  selftests/bpf: Fix a memory leak in xdp_flowtable test
  bpf: Fix stack-out-of-bounds write in devmap
  bpf: Fix kprobe_multi cookies access in show_fdinfo callback
  bpf, arm64: Force 8-byte alignment for JIT buffer to prevent atomic tearing
  selftests/bpf: Don't override SIGSEGV handler with ASAN
  selftests/bpf: Check BPFTOOL env var in detect_bpftool_path()
  selftests/bpf: Fix out-of-bounds array access bugs reported by ASAN
  selftests/bpf: Fix array bounds warning in jit_disasm_helpers
  ...

Merge tag 'driver-core-7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core

Pull driver core fixes from Danilo Krummrich:

- Do not register imx_clk_scu_driver in imx8qxp_clk_probe(); besides
   fixing two other issues, this avoids a deadlock in combination with
   commit dc23806a7c47 ("driver core: enforce device_lock for
   driver_match_device()")

- Move secondary node lookup from device_get_next_child_node() to
   fwnode_get_next_child_node(); this avoids issues when users switch
   from the device API to the fwnode API

- Export io_define_{read,write}!() to avoid unused import warnings when
   CONFIG_PCI=n

* tag 'driver-core-7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core:
  clk: scu/imx8qxp: do not register driver in probe()
  rust: io: macro_export io_define_read!() and io_define_write!()
  device property: Allow secondary lookup in fwnode_get_next_child_node()

Merge tag 'v7.0rc1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull smb client fixes from Steve French:

- Two multichannel fixes

- Locking fix for superblock flags

- Fix to remove debug message that could log password

- Cleanup fix for setting credentials

* tag 'v7.0rc1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  smb: client: Use snprintf in cifs_set_cifscreds
  smb: client: Don't log plaintext credentials in cifs_set_cifscreds
  smb: client: fix broken multichannel with krb5+signing
  smb: client: use atomic_t for mnt_cifs_flags
  smb: client: fix cifs_pick_channel when channels are equally loaded

firewire: ohci: initialize page array to use alloc_pages_bulk() correctly

The call of alloc_pages_bulk() skips to fill entries of page array when
the entries already have values. While, 1394 OHCI PCI driver passes the
page array without initializing. It could cause invalid state at PFN
validation in vmap().

Fixes: f2ae92780ab9 ("firewire: ohci: split page allocation from dma mapping")
Reported-by: John Ogness <john.ogness@linutronix.de>
Reported-and-tested-by: Harald Arnesen <linux@skogtun.org>
Reported-and-tested-by: David Gow <david@davidgow.net>
Closes: https://lore.kernel.org/lkml/87tsv1vig5.fsf@jogness.linutronix.de/
Signed-off-by: Takashi Sakamoto <o-takashi@sakamocchi.jp>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'spi-fix-v7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi

Pull spi fixes from Mark Brown:
"One fix for the stm32 driver which got broken for DMA chaining cases,
  plus a removal of some straggling bindings for the Bikal SoC which has
  been pulled out of the kernel"

* tag 'spi-fix-v7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
  spi: stm32: fix missing pointer assignment in case of dma chaining
  spi: dt-bindings: snps,dw-abp-ssi: Remove unused bindings

Merge tag 'regulator-fix-v7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator

Pull regulator fixes from Mark Brown:
"A small pile of fixes, none of which are super major - the code fixes
  are improved error handling and fixing a leak of a device node.

  We also have a typo fix and an improvement to make the binding example
  for mt6359 more directly usable"

* tag 'regulator-fix-v7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator:
  regulator: Kconfig: fix a typo
  regulator: bq257xx: Fix device node reference leak in bq257xx_reg_dt_parse_gpio()
  regulator: fp9931: Fix PM runtime reference leak in fp9931_hwmon_read()
  regulator: tps65185: check devm_kzalloc() result in probe
  regulator: dt-bindings: mt6359: make regulator names unique

Merge tag 's390-7.0-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux

Pull s390 fixes from Vasily Gorbik:

- Fix guest pfault init to pass a physical address to DIAG 0x258,
   restoring pfault interrupts and avoiding vCPU stalls during host
   page-in

- Fix kexec/kdump hangs with stack protector by marking
   s390_reset_system() __no_stack_protector; set_prefix(0) switches
   lowcore and the canary no longer matches

- Fix idle/vtime cputime accounting (idle-exit ordering, vtimer
   double-forwarding) and small cleanups

* tag 's390-7.0-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
  s390/pfault: Fix virtual vs physical address confusion
  s390/kexec: Disable stack protector in s390_reset_system()
  s390/idle: Remove psw_idle() prototype
  s390/vtime: Use lockdep_assert_irqs_disabled() instead of BUG_ON()
  s390/vtime: Use __this_cpu_read() / get rid of READ_ONCE()
  s390/irq/idle: Remove psw bits early
  s390/idle: Inline update_timer_idle()
  s390/idle: Slightly optimize idle time accounting
  s390/idle: Add comment for non obvious code
  s390/vtime: Fix virtual timer forwarding
  s390/idle: Fix cpu idle exit cpu time accounting

Merge tag 'kvmarm-fixes-7.0-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 fixes for 7.0, take #1

- Make sure we don't leak any S1POE state from guest to guest when
  the feature is supported on the HW, but not enabled on the host

- Propagate the ID registers from the host into non-protected VMs
  managed by pKVM, ensuring that the guest sees the intended feature set

- Drop double kern_hyp_va() from unpin_host_sve_state(), which could
  bite us if we were to change kern_hyp_va() to not being idempotent

- Don't leak stage-2 mappings in protected mode

- Correctly align the faulting address when dealing with single page
  stage-2 mappings for PAGE_SIZE > 4kB

- Fix detection of virtualisation-capable GICv5 IRS, due to the
  maintainer being obviously fat fingered...

- Remove duplication of code retrieving the ASID for the purpose of
  S1 PT handling

- Fix slightly abusive const-ification in vgic_set_kvm_info()

KVM: always define KVM_CAP_SYNC_MMU

KVM_CAP_SYNC_MMU is provided by KVM's MMU notifiers, which are now always
available. Move the definition from individual architectures to common
code.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

KVM: remove CONFIG_KVM_GENERIC_MMU_NOTIFIER

All architectures now use MMU notifier for KVM page table management.
Remove the Kconfig symbol and the code that is used when it is
disabled.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Merge branch 'fix-invariant-violation-for-single-value-tnums'

Paul Chaignon says:

====================
Fix invariant violation for single-value tnums

We're hitting an invariant violation in Cilium that sometimes leads to
BPF programs being rejected and Cilium failing to start [1]. As far as
I know this is the first case of invariant violation found in a real
program (i.e., not by a fuzzer). The following extract from verifier
logs shows what's happening:

  from 201 to 236: R1=0 R6=ctx() R7=1 R9=scalar(smin=umin=smin32=umin32=3584,smax=umax=smax32=umax32=3840,var_off=(0xe00; 0x100)) R10=fp0
  236: R1=0 R6=ctx() R7=1 R9=scalar(smin=umin=smin32=umin32=3584,smax=umax=smax32=umax32=3840,var_off=(0xe00; 0x100)) R10=fp0
  ; if (magic == MARK_MAGIC_HOST || magic == MARK_MAGIC_OVERLAY || magic == MARK_MAGIC_ENCRYPT) @ bpf_host.c:1337
  236: (16) if w9 == 0xe00 goto pc+45   ; R9=scalar(smin=umin=smin32=umin32=3585,smax=umax=smax32=umax32=3840,var_off=(0xe00; 0x100))
  237: (16) if w9 == 0xf00 goto pc+1
  verifier bug: REG INVARIANTS VIOLATION (false_reg1): range bounds violation u64=[0xe01, 0xe00] s64=[0xe01, 0xe00] u32=[0xe01, 0xe00] s32=[0xe01, 0xe00] var_off=(0xe00, 0x0)

More details are given in the second patch, but in short, the verifier
should be able to detect that the false branch of instruction 237 is
never true. After instruction 236, the u64 range and the tnum overlap
in a single value, 0xf00.

The long-term solution to invariant violation is likely to rely on the
refinement + invariant violation check to detect dead branches, as
started by Eduard. To fix the current issue, we need something with
less refactoring that we can backport to affected kernels.

The solution implemented in the second patch is to improve the bounds
refinement to avoid this case. It relies on a new tnum helper,
tnum_step, first sent as an RFC in [2]. The last two patches extend and
update the selftests.

Link: https://github.com/cilium/cilium/issues/44216
Link: https://lore.kernel.org/bpf/20251107192328.2190680-2-harishankar.vishwanathan@gmail.com/
Changes in v3:
  - Fix commit description error spotted by AI bot.
  - Simplify constants in first two tests (Eduard).
  - Rework comment on third test (Eduard).
  - Add two new negative test cases (Eduard).
  - Rebased.
Changes in v2:
  - Add guard suggested by Hari in tnum_step, to avoid undefined
    behavior spotted by AI code review.
  - Add explanation diagrams in code as suggested by Eduard.
  - Rework conditions for readability as suggested by Eduard.
  - Updated reference to SMT formula.
  - Rebased.
====================

Link: https://patch.msgid.link/cover.1772225741.git.paul.chaignon@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Avoid simplification of crafted bounds test

The reg_bounds_crafted tests validate the verifier's range analysis
logic. They focus on the actual ranges and thus ignore the tnum. As a
consequence, they carry the assumption that the tested cases can be
reproduced in userspace without using the tnum information.

Unfortunately, the previous change the refinement logic breaks that
assumption for one test case:

  (u64)2147483648 (u32)<op> [4294967294; 0x100000000]

The tested bytecode is shown below. Without our previous improvement, on
the false branch of the condition, R7 is only known to have u64 range
[0xfffffffe; 0x100000000]. With our improvement, and using the tnum
information, we can deduce that R7 equals 0x100000000.

  19: (bc) w0 = w6                ; R6=0x80000000
  20: (bc) w0 = w7                ; R7=scalar(smin=umin=0xfffffffe,smax=umax=0x100000000,smin32=-2,smax32=0,var_off=(0x0; 0x1ffffffff))
  21: (be) if w6 <= w7 goto pc+3  ; R6=0x80000000 R7=0x100000000

R7's tnum is (0; 0x1ffffffff). On the false branch, regs_refine_cond_op
refines R7's u32 range to [0; 0x7fffffff]. Then, __reg32_deduce_bounds
refines the s32 range to 0 using u32 and finally also sets u32=0.
From this, __reg_bound_offset improves the tnum to (0; 0x100000000).
Finally, our previous patch uses this new tnum to deduce that it only
intersect with u64=[0xfffffffe; 0x100000000] in a single value:
0x100000000.

Because the verifier uses the tnum to reach this constant value, the
selftest is unable to reproduce it by only simulating ranges. The
solution implemented in this patch is to change the test case such that
there is more than one overlap value between u64 and the tnum. The max.
u64 value is thus changed from 0x100000000 to 0x300000000.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Link: https://lore.kernel.org/r/50641c6a7ef39520595dcafa605692427c1006ec.1772225741.git.paul.chaignon@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Test refinement of single-value tnum

This patch introduces selftests to cover the new bounds refinement
logic introduced in the previous patch. Without the previous patch,
the first two tests fail because of the invariant violation they
trigger. The last test fails because the R10 access is not detected as
dead code. In addition, all three tests fail because of R0 having a
non-constant value in the verifier logs.

In addition, the last two cases are covering the negative cases: when we
shouldn't refine the bounds because the u64 and tnum overlap in at least
two values.

Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Link: https://lore.kernel.org/r/90d880c8cf587b9f7dc715d8961cd1b8111d01a8.1772225741.git.paul.chaignon@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Improve bounds when tnum has a single possible value

We're hitting an invariant violation in Cilium that sometimes leads to
BPF programs being rejected and Cilium failing to start [1]. The
following extract from verifier logs shows what's happening:

  from 201 to 236: R1=0 R6=ctx() R7=1 R9=scalar(smin=umin=smin32=umin32=3584,smax=umax=smax32=umax32=3840,var_off=(0xe00; 0x100)) R10=fp0
  236: R1=0 R6=ctx() R7=1 R9=scalar(smin=umin=smin32=umin32=3584,smax=umax=smax32=umax32=3840,var_off=(0xe00; 0x100)) R10=fp0
  ; if (magic == MARK_MAGIC_HOST || magic == MARK_MAGIC_OVERLAY || magic == MARK_MAGIC_ENCRYPT) @ bpf_host.c:1337
  236: (16) if w9 == 0xe00 goto pc+45   ; R9=scalar(smin=umin=smin32=umin32=3585,smax=umax=smax32=umax32=3840,var_off=(0xe00; 0x100))
  237: (16) if w9 == 0xf00 goto pc+1
  verifier bug: REG INVARIANTS VIOLATION (false_reg1): range bounds violation u64=[0xe01, 0xe00] s64=[0xe01, 0xe00] u32=[0xe01, 0xe00] s32=[0xe01, 0xe00] var_off=(0xe00, 0x0)

We reach instruction 236 with two possible values for R9, 0xe00 and
0xf00. This is perfectly reflected in the tnum, but of course the ranges
are less accurate and cover [0xe00; 0xf00]. Taking the fallthrough path
at instruction 236 allows the verifier to reduce the range to
[0xe01; 0xf00]. The tnum is however not updated.

With these ranges, at instruction 237, the verifier is not able to
deduce that R9 is always equal to 0xf00. Hence the fallthrough pass is
explored first, the verifier refines the bounds using the assumption
that R9 != 0xf00, and ends up with an invariant violation.

This pattern of impossible branch + bounds refinement is common to all
invariant violations seen so far. The long-term solution is likely to
rely on the refinement + invariant violation check to detect dead
branches, as started by Eduard. To fix the current issue, we need
something with less refactoring that we can backport.

This patch uses the tnum_step helper introduced in the previous patch to
detect the above situation. In particular, three cases are now detected
in the bounds refinement:

1. The u64 range and the tnum only overlap in umin.
   u64:  ---[xxxxxx]-----
   tnum: --xx----------x-

2. The u64 range and the tnum only overlap in the maximum value
   represented by the tnum, called tmax.
   u64:  ---[xxxxxx]-----
   tnum: xx-----x--------

3. The u64 range and the tnum only overlap in between umin (excluded)
   and umax.
   u64:  ---[xxxxxx]-----
   tnum: xx----x-------x-

To detect these three cases, we call tnum_step(tnum, umin), which
returns the smallest member of the tnum greater than umin, called
tnum_next here. We're in case (1) if umin is part of the tnum and
tnum_next is greater than umax. We're in case (2) if umin is not part of
the tnum and tnum_next is equal to tmax. Finally, we're in case (3) if
umin is not part of the tnum, tnum_next is inferior or equal to umax,
and calling tnum_step a second time gives us a value past umax.

This change implements these three cases. With it, the above bytecode
looks as follows:

  0: (85) call bpf_get_prandom_u32#7    ; R0=scalar()
  1: (47) r0 |= 3584                    ; R0=scalar(smin=0x8000000000000e00,umin=umin32=3584,smin32=0x80000e00,var_off=(0xe00; 0xfffffffffffff1ff))
  2: (57) r0 &= 3840                    ; R0=scalar(smin=umin=smin32=umin32=3584,smax=umax=smax32=umax32=3840,var_off=(0xe00; 0x100))
  3: (15) if r0 == 0xe00 goto pc+2      ; R0=3840
  4: (15) if r0 == 0xf00 goto pc+1
  4: R0=3840
  6: (95) exit

In addition to the new selftests, this change was also verified with
Agni [3]. For the record, the raw SMT is available at [4]. The property
it verifies is that: If a concrete value x is contained in all input
abstract values, after __update_reg_bounds, it will continue to be
contained in all output abstract values.

Link: https://github.com/cilium/cilium/issues/44216
Link: https://pchaigno.github.io/test-verifier-complexity.html
Link: https://github.com/bpfverif/agni
Link: https://pastebin.com/raw/naCfaqNx
Fixes: 0df1a55afa83 ("bpf: Warn on internal verifier errors")
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Tested-by: Marco Schirrmeister <mschirrmeister@gmail.com>
Co-developed-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com>
Signed-off-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com>
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Link: https://lore.kernel.org/r/ef254c4f68be19bd393d450188946821c588565d.1772225741.git.paul.chaignon@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Introduce tnum_step to step through tnum's members

This commit introduces tnum_step(), a function that, when given t, and a
number z returns the smallest member of t larger than z. The number z
must be greater or equal to the smallest member of t and less than the
largest member of t.

The first step is to compute j, a number that keeps all of t's known
bits, and matches all unknown bits to z's bits. Since j is a member of
the t, it is already a candidate for result. However, we want our result
to be (minimally) greater than z.

There are only two possible cases:

(1) Case j <= z. In this case, we want to increase the value of j and
make it > z.
(2) Case j > z. In this case, we want to decrease the value of j while
keeping it > z.

(Case 1) j <= z

t = xx11x0x0
z = 10111101 (189)
j = 10111000 (184)
         ^
         k

(Case 1.1) Let's first consider the case where j < z. We will address j
== z later.

Since z > j, there had to be a bit position that was 1 in z and a 0 in
j, beyond which all positions of higher significance are equal in j and
z. Further, this position could not have been unknown in a, because the
unknown positions of a match z. This position had to be a 1 in z and
known 0 in t.

Let k be position of the most significant 1-to-0 flip. In our example, k
= 3 (starting the count at 1 at the least significant bit).  Setting (to
1) the unknown bits of t in positions of significance smaller than
k will not produce a result > z. Hence, we must set/unset the unknown
bits at positions of significance higher than k. Specifically, we look
for the next larger combination of 1s and 0s to place in those
positions, relative to the combination that exists in z. We can achieve
this by concatenating bits at unknown positions of t into an integer,
adding 1, and writing the bits of that result back into the
corresponding bit positions previously extracted from z.

>From our example, considering only positions of significance greater
than k:

t =  xx..x
z =  10..1
    +    1
     -----
     11..0

This is the exact combination 1s and 0s we need at the unknown bits of t
in positions of significance greater than k. Further, our result must
only increase the value minimally above z. Hence, unknown bits in
positions of significance smaller than k should remain 0. We finally
have,

result = 11110000 (240)

(Case 1.2) Now consider the case when j = z, for example

t = 1x1x0xxx
z = 10110100 (180)
j = 10110100 (180)

Matching the unknown bits of the t to the bits of z yielded exactly z.
To produce a number greater than z, we must set/unset the unknown bits
in t, and *all* the unknown bits of t candidates for being set/unset. We
can do this similar to Case 1.1, by adding 1 to the bits extracted from
the masked bit positions of z. Essentially, this case is equivalent to
Case 1.1, with k = 0.

t =  1x1x0xxx
z =  .0.1.100
    +       1
    ---------
     .0.1.101

This is the exact combination of bits needed in the unknown positions of
t. After recalling the known positions of t, we get

result = 10110101 (181)

(Case 2) j > z

t = x00010x1
z = 10000010 (130)
j = 10001011 (139)
^
k

Since j > z, there had to be a bit position which was 0 in z, and a 1 in
j, beyond which all positions of higher significance are equal in j and
z. This position had to be a 0 in z and known 1 in t. Let k be the
position of the most significant 0-to-1 flip. In our example, k = 4.

Because of the 0-to-1 flip at position k, a member of t can become
greater than z if the bits in positions greater than k are themselves >=
to z. To make that member *minimally* greater than z, the bits in
positions greater than k must be exactly = z. Hence, we simply match all
of t's unknown bits in positions more significant than k to z's bits. In
positions less significant than k, we set all t's unknown bits to 0
to retain minimality.

In our example, in positions of greater significance than k (=4),
t=x000. These positions are matched with z (1000) to produce 1000. In
positions of lower significance than k, t=10x1. All unknown bits are set
to 0 to produce 1001. The final result is:

result = 10001001 (137)

This concludes the computation for a result > z that is a member of t.

The procedure for tnum_step() in this commit implements the idea
described above. As a proof of correctness, we verified the algorithm
against a logical specification of tnum_step. The specification asserts
the following about the inputs t, z and output res that:

1. res is a member of t, and
2. res is strictly greater than z, and
3. there does not exist another value res2 such that
3a. res2 is also a member of t, and
3b. res2 is greater than z
3c. res2 is smaller than res

We checked the implementation against this logical specification using
an SMT solver. The verification formula in SMTLIB format is available
at [1]. The verification returned an "unsat": indicating that no input
assignment exists for which the implementation and the specification
produce different outputs.

In addition, we also automatically generated the logical encoding of the
C implementation using Agni [2] and verified it against the same
specification. This verification also returned an "unsat", confirming
that the implementation is equivalent to the specification. The formula
for this check is also available at [3].

Link: https://pastebin.com/raw/2eRWbiit
Link: https://github.com/bpfverif/agni
Link: https://pastebin.com/raw/EztVbBJ2
Co-developed-by: Srinivas Narayana <srinivas.narayana@rutgers.edu>
Signed-off-by: Srinivas Narayana <srinivas.narayana@rutgers.edu>
Co-developed-by: Santosh Nagarakatte <santosh.nagarakatte@rutgers.edu>
Signed-off-by: Santosh Nagarakatte <santosh.nagarakatte@rutgers.edu>
Signed-off-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com>
Link: https://lore.kernel.org/r/93fdf71910411c0f19e282ba6d03b4c65f9c5d73.1772225741.git.paul.chaignon@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge branch 'bpf-cpumap-devmap-fix-per-cpu-bulk-queue-races-on-preempt_rt'

Jiayuan Chen says:

====================
bpf: Fix per-CPU bulk queue races on PREEMPT_RT

On PREEMPT_RT kernels, local_bh_disable() only calls migrate_disable()
(when PREEMPT_RT_NEEDS_BH_LOCK is not set) and does not disable
preemption. This means CFS scheduling can preempt a task inside the
per-CPU bulk queue (bq) operations in cpumap and devmap, allowing
another task on the same CPU to concurrently access the same bq,
leading to use-after-free, list corruption, and kernel panics.

Patch 1 fixes the cpumap race in bq_flush_to_queue(), originally
reported by syzbot [1].

Patch 2 fixes the same class of race in devmap's bq_xmit_all(),
identified by code inspection after Sebastian Andrzej Siewior pointed
out that devmap has the same per-CPU bulk queue pattern [2].

Both patches use local_lock_nested_bh() to serialize access to the
per-CPU bq. On non-RT this is a pure lockdep annotation with no
overhead; on PREEMPT_RT it provides a per-CPU sleeping lock.

To reproduce the devmap race, insert an mdelay(100) in bq_xmit_all()
after "cnt = bq->count" and before the actual transmit loop. Then pin
two threads to the same CPU, each running BPF_PROG_TEST_RUN with an XDP
program that redirects to a DEVMAP entry (e.g. a veth pair). CFS
timeslicing during the mdelay window causes interleaving. Without the
fix, KASAN reports null-ptr-deref due to operating on freed frames:

  BUG: KASAN: null-ptr-deref in __build_skb_around+0x22d/0x340
  Write of size 32 at addr 0000000000000d50 by task devmap_race_rep/449

  CPU: 0 UID: 0 PID: 449 Comm: devmap_race_rep Not tainted 6.19.0+ #31 PREEMPT_RT
  Call Trace:
   <TASK>
   __build_skb_around+0x22d/0x340
   build_skb_around+0x25/0x260
   __xdp_build_skb_from_frame+0x103/0x860
   veth_xdp_rcv_bulk_skb.isra.0+0x162/0x320
   veth_xdp_rcv.constprop.0+0x61e/0xbb0
   veth_poll+0x280/0xb50
   __napi_poll.constprop.0+0xa5/0x590
   net_rx_action+0x4b0/0xea0
   handle_softirqs.isra.0+0x1b3/0x780
   __local_bh_enable_ip+0x12a/0x240
   xdp_test_run_batch.constprop.0+0xedd/0x1f60
   bpf_test_run_xdp_live+0x304/0x640
   bpf_prog_test_run_xdp+0xd24/0x1b70
   __sys_bpf+0x61c/0x3e00
   </TASK>

  Kernel panic - not syncing: Fatal exception in interrupt

[1] https://lore.kernel.org/all/69369331.a70a0220.38f243.009d.GAE@google.com/T/
[2] https://lore.kernel.org/bpf/20260212023634.366343-1-jiayuan.chen@linux.dev/

v3 -> v4: https://lore.kernel.org/all/20260213034018.284146-1-jiayuan.chen@linux.dev/
- Move panic trace to cover letter. (Sebastian Andrzej Siewior)
- Add Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> to both patches
  from cover letter.

v2 -> v3: https://lore.kernel.org/bpf/20260212023634.366343-1-jiayuan.chen@linux.dev/
- Fix commit message: remove incorrect "spin_lock() becomes rt_mutex"
claim, the per-CPU bq has no spin_lock at all. (Sebastian Andrzej Siewior)
- Fix commit message: accurately describe local_lock_nested_bh()
behavior instead of referencing local_lock(). (Sebastian Andrzej Siewior)
- Remove incomplete discussion of snapshot alternative.
(Sebastian Andrzej Siewior)
- Remove panic trace from commit message. (Sebastian Andrzej Siewior)
- Add patch 2/2 for devmap, same race pattern. (Sebastian Andrzej Siewior)

v1 -> v2: https://lore.kernel.org/bpf/20260211064417.196401-1-jiayuan.chen@linux.dev/
- Use local_lock_nested_bh()/local_unlock_nested_bh() instead of
local_lock()/local_unlock(), since these paths already run under
local_bh_disable(). (Sebastian Andrzej Siewior)
- Replace "Caller must hold bq->bq_lock" comment with
lockdep_assert_held() in bq_flush_to_queue(). (Sebastian Andrzej Siewior)
- Fix Fixes tag to 3253cb49cbad ("softirq: Allow to drop the
softirq-BKL lock on PREEMPT_RT") which is the actual commit that
makes the race possible. (Sebastian Andrzej Siewior)
====================

Link: https://patch.msgid.link/20260225121459.183121-1-jiayuan.chen@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Fix race in devmap on PREEMPT_RT

On PREEMPT_RT kernels, the per-CPU xdp_dev_bulk_queue (bq) can be
accessed concurrently by multiple preemptible tasks on the same CPU.

The original code assumes bq_enqueue() and __dev_flush() run atomically
with respect to each other on the same CPU, relying on
local_bh_disable() to prevent preemption. However, on PREEMPT_RT,
local_bh_disable() only calls migrate_disable() (when
PREEMPT_RT_NEEDS_BH_LOCK is not set) and does not disable
preemption, which allows CFS scheduling to preempt a task during
bq_xmit_all(), enabling another task on the same CPU to enter
bq_enqueue() and operate on the same per-CPU bq concurrently.

This leads to several races:

1. Double-free / use-after-free on bq->q[]: bq_xmit_all() snapshots
   cnt = bq->count, then iterates bq->q[0..cnt-1] to transmit frames.
   If preempted after the snapshot, a second task can call bq_enqueue()
   -> bq_xmit_all() on the same bq, transmitting (and freeing) the
   same frames. When the first task resumes, it operates on stale
   pointers in bq->q[], causing use-after-free.

2. bq->count and bq->q[] corruption: concurrent bq_enqueue() modifying
   bq->count and bq->q[] while bq_xmit_all() is reading them.

3. dev_rx/xdp_prog teardown race: __dev_flush() clears bq->dev_rx and
   bq->xdp_prog after bq_xmit_all(). If preempted between
   bq_xmit_all() return and bq->dev_rx = NULL, a preempting
   bq_enqueue() sees dev_rx still set (non-NULL), skips adding bq to
   the flush_list, and enqueues a frame. When __dev_flush() resumes,
   it clears dev_rx and removes bq from the flush_list, orphaning the
   newly enqueued frame.

4. __list_del_clearprev() on flush_node: similar to the cpumap race,
   both tasks can call __list_del_clearprev() on the same flush_node,
   the second dereferences the prev pointer already set to NULL.

The race between task A (__dev_flush -> bq_xmit_all) and task B
(bq_enqueue -> bq_xmit_all) on the same CPU:

  Task A (xdp_do_flush)          Task B (ndo_xdp_xmit redirect)
  ----------------------         --------------------------------
  __dev_flush(flush_list)
    bq_xmit_all(bq)
      cnt = bq->count  /* e.g. 16 */
      /* start iterating bq->q[] */
    <-- CFS preempts Task A -->
                                   bq_enqueue(dev, xdpf)
                                     bq->count == DEV_MAP_BULK_SIZE
                                     bq_xmit_all(bq, 0)
                                       cnt = bq->count  /* same 16! */
                                       ndo_xdp_xmit(bq->q[])
                                       /* frames freed by driver */
                                       bq->count = 0
    <-- Task A resumes -->
      ndo_xdp_xmit(bq->q[])
      /* use-after-free: frames already freed! */

Fix this by adding a local_lock_t to xdp_dev_bulk_queue and acquiring
it in bq_enqueue() and __dev_flush(). These paths already run under
local_bh_disable(), so use local_lock_nested_bh() which on non-RT is
a pure annotation with no overhead, and on PREEMPT_RT provides a
per-CPU sleeping lock that serializes access to the bq.

Fixes: 3253cb49cbad ("softirq: Allow to drop the softirq-BKL lock on PREEMPT_RT")
Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Link: https://lore.kernel.org/r/20260225121459.183121-3-jiayuan.chen@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Fix race in cpumap on PREEMPT_RT

On PREEMPT_RT kernels, the per-CPU xdp_bulk_queue (bq) can be accessed
concurrently by multiple preemptible tasks on the same CPU.

The original code assumes bq_enqueue() and __cpu_map_flush() run
atomically with respect to each other on the same CPU, relying on
local_bh_disable() to prevent preemption. However, on PREEMPT_RT,
local_bh_disable() only calls migrate_disable() (when
PREEMPT_RT_NEEDS_BH_LOCK is not set) and does not disable
preemption, which allows CFS scheduling to preempt a task during
bq_flush_to_queue(), enabling another task on the same CPU to enter
bq_enqueue() and operate on the same per-CPU bq concurrently.

This leads to several races:

1. Double __list_del_clearprev(): after bq->count is reset in
   bq_flush_to_queue(), a preempting task can call bq_enqueue() ->
   bq_flush_to_queue() on the same bq when bq->count reaches
   CPU_MAP_BULK_SIZE. Both tasks then call __list_del_clearprev()
   on the same bq->flush_node, the second call dereferences the
   prev pointer that was already set to NULL by the first.

2. bq->count and bq->q[] races: concurrent bq_enqueue() can corrupt
   the packet queue while bq_flush_to_queue() is processing it.

The race between task A (__cpu_map_flush -> bq_flush_to_queue) and
task B (bq_enqueue -> bq_flush_to_queue) on the same CPU:

  Task A (xdp_do_flush)          Task B (cpu_map_enqueue)
  ----------------------         ------------------------
  bq_flush_to_queue(bq)
    spin_lock(&q->producer_lock)
    /* flush bq->q[] to ptr_ring */
    bq->count = 0
    spin_unlock(&q->producer_lock)
                                   bq_enqueue(rcpu, xdpf)
    <-- CFS preempts Task A -->      bq->q[bq->count++] = xdpf
                                     /* ... more enqueues until full ... */
                                     bq_flush_to_queue(bq)
                                       spin_lock(&q->producer_lock)
                                       /* flush to ptr_ring */
                                       spin_unlock(&q->producer_lock)
                                       __list_del_clearprev(flush_node)
                                         /* sets flush_node.prev = NULL */
    <-- Task A resumes -->
    __list_del_clearprev(flush_node)
      flush_node.prev->next = ...
      /* prev is NULL -> kernel oops */

Fix this by adding a local_lock_t to xdp_bulk_queue and acquiring it
in bq_enqueue() and __cpu_map_flush(). These paths already run under
local_bh_disable(), so use local_lock_nested_bh() which on non-RT is
a pure annotation with no overhead, and on PREEMPT_RT provides a
per-CPU sleeping lock that serializes access to the bq.

To reproduce, insert an mdelay(100) between bq->count = 0 and
__list_del_clearprev() in bq_flush_to_queue(), then run reproducer
provided by syzkaller.

Fixes: 3253cb49cbad ("softirq: Allow to drop the softirq-BKL lock on PREEMPT_RT")
Reported-by: syzbot+2b3391f44313b3983e91@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/69369331.a70a0220.38f243.009d.GAE@google.com/T/
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Link: https://lore.kernel.org/r/20260225121459.183121-2-jiayuan.chen@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge branch 'close-race-in-freeing-special-fields-and-map-value'

Kumar Kartikeya Dwivedi says:

====================
Close race in freeing special fields and map value

There exists a race across various map types where the freeing of
special fields (tw, timer, wq, kptr, etc.) can be done eagerly when a
logical delete operation is done on a map value, such that the program
which continues to have access to such a map value can recreate the
fields and cause them to leak.

The set contains fixes for this case. It is a continuation of Mykyta's
previous attempt in [0], but applies to all fields. A test is included
which reproduces the bug reliably in absence of the fixes.

Local Storage Benchmarks
------------------------
Evaluation Setup: Benchmarked on a dual-socket Intel Xeon Gold 6348 (Ice
Lake) @ 2.60GHz (56 cores / 112 threads), with the CPU governor set to
performance. Bench was pinned to a single NUMA node throughout the test.

Benchmark comes from [1] using the following command:
./bench -p 1 local-storage-create --storage-type <socket,task> --batch-size <16,32,64>

Before the test, 10 runs of all cases ([socket|task] x 3 batch sizes x 7
iterations per batch size) are done to warm up and prime the machine.

Then, 3 runs of all cases are done (with and without the patch, across
reboots).

For each comparison, we have 21 samples, i.e. per batch size (e.g.
socket 16) of a given local storage, we have 3 runs x 7 iterations.

The statistics (mean, median, stddev) and t-test is done for each
scenario (local storage and batch size pair) individually (21 samples
for either case). All values are for local storage creations in thousand
creations / sec (k/s).

       Baseline (without patch)               With patch                       Delta
     Case      Median        Mean   Std. Dev.   Median        Mean   Std. Dev.   Median       %
---------------------------------------------------------------------------------------------------
socket 16     432.026     431.941    1.047     431.347     431.953    1.635      -0.679    -0.16%
socket 32     432.641     432.818    1.535     432.488     432.302    1.508      -0.153    -0.04%
socket 64     431.504     431.996    1.337     429.145     430.326    2.469      -2.359    -0.55%
  task 16      38.816      39.382    1.456      39.657      39.337    1.831      +0.841    +2.17%
  task 32      38.815      39.644    2.690      38.721      39.122    1.636      -0.094    -0.24%
  task 64      37.562      38.080    1.701      39.554      38.563    1.689      +1.992    +5.30%

The cases for socket are within the range of noise, and improvements in task
local storage are due to high variance (CV ~4%-6% across batch sizes). The only
statistically significant case worth mentioning is socket with batch size 64
with p-value from t-test < 0.05, but the absolute difference is small (~2k/s).

TL;DR there doesn't appear to be any significant regression or improvement.

  [0]: https://lore.kernel.org/bpf/20260216131341.1285427-1-mykyta.yatsenko5@gmail.com
  [1]: https://lore.kernel.org/bpf/20260205222916.1788211-1-ameryhung@gmail.com

Changelog:
----------
v2 -> v3
v2: https://lore.kernel.org/bpf/20260227052031.3988575-1-memxor@gmail.com

* Add syzbot Tested-by.
* Add Amery's Reviewed-by.
* Fix missing rcu_dereference_check() in __bpf_selem_free_rcu. (BPF CI Bot)
* Remove migrate_disable() in bpf_selem_free_rcu. (Alexei)

v1 -> v2
v1: https://lore.kernel.org/bpf/20260225185121.2057388-1-memxor@gmail.com

* Add Paul's Reviewed-by.
* Fix use-after-free in accessing bpf_mem_alloc embedded in map. (syzbot CI)
* Add benchmark numbers for local storage.
* Add extra test case for per-cpu hashmap coverage with up to 16 refcount leaks.
* Target bpf tree.
====================

Link: https://patch.msgid.link/20260227224806.646888-1-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Add tests for special fields races

Add a couple of tests to ensure that the refcount drops to zero when we
exercise the race where creation of a special field succeeds the logical
bpf_obj_free_fields done when deleting an element. Prior to previous
changes, the fields would be freed eagerly and repopulate and end up
leaking, causing the reference to not drop down correctly. Running this
test on a kernel without fixes will cause a hang in delete_module, since
the module reference stays active due to the leaked kptr not dropping
it. After the fixes tests succeed as expected.

Reviewed-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260227224806.646888-6-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Retire rcu_trace_implies_rcu_gp() from local storage

This assumption will always hold going forward, hence just remove the
various checks and assume it is true with a comment for the uninformed
reader.

Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260227224806.646888-5-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Delay freeing fields in local storage

Currently, when use_kmalloc_nolock is false, the freeing of fields for a
local storage selem is done eagerly before waiting for the RCU or RCU
tasks trace grace period to elapse. This opens up a window where the
program which has access to the selem can recreate the fields after the
freeing of fields is done eagerly, causing memory leaks when the element
is finally freed and returned to the kernel.

Make a few changes to address this. First, delay the freeing of fields
until after the grace periods have expired using a __bpf_selem_free_rcu
wrapper which is eventually invoked after transitioning through the
necessary number of grace period waits. Replace usage of the kfree_rcu
with call_rcu to be able to take a custom callback. Finally, care needs
to be taken to extend the rcu barriers for all cases, and not just when
use_kmalloc_nolock is true, as RCU and RCU tasks trace callbacks can be
in flight for either case and access the smap field, which is used to
obtain the BTF record to walk over special fields in the map value.

While we're at it, drop migrate_disable() from bpf_selem_free_rcu, since
migration should be disabled for RCU callbacks already.

Fixes: 9bac675e6368 ("bpf: Postpone bpf_obj_free_fields to the rcu callback")
Reviewed-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260227224806.646888-4-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Lose const-ness of map in map_check_btf()

BPF hash map may now use the map_check_btf() callback to decide whether
to set a dtor on its bpf_mem_alloc or not. Unlike C++ where members can
opt out of const-ness using mutable, we must lose the const qualifier on
the callback such that we can avoid the ugly cast. Make the change and
adjust all existing users, and lose the comment in hashtab.c.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260227224806.646888-3-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Register dtor for freeing special fields

There is a race window where BPF hash map elements can leak special
fields if the program with access to the map value recreates these
special fields between the check_and_free_fields done on the map value
and its eventual return to the memory allocator.

Several ways were explored prior to this patch, most notably [0] tried
to use a poison value to reject attempts to recreate special fields for
map values that have been logically deleted but still accessible to BPF
programs (either while sitting in the free list or when reused). While
this approach works well for task work, timers, wq, etc., it is harder
to apply the idea to kptrs, which have a similar race and failure mode.

Instead, we change bpf_mem_alloc to allow registering destructor for
allocated elements, such that when they are returned to the allocator,
any special fields created while they were accessible to programs in the
mean time will be freed. If these values get reused, we do not free the
fields again before handing the element back. The special fields thus
may remain initialized while the map value sits in a free list.

When bpf_mem_alloc is retired in the future, a similar concept can be
introduced to kmalloc_nolock-backed kmem_cache, paired with the existing
idea of a constructor.

Note that the destructor registration happens in map_check_btf, after
the BTF record is populated and (at that point) avaiable for inspection
and duplication. Duplication is necessary since the freeing of embedded
bpf_mem_alloc can be decoupled from actual map lifetime due to logic
introduced to reduce the cost of rcu_barrier()s in mem alloc free path in
9f2c6e96c65e ("bpf: Optimize rcu_barrier usage between hash map and bpf_mem_alloc.").

As such, once all callbacks are done, we must also free the duplicated
record. To remove dependency on the bpf_map itself, also stash the key
size of the map to obtain value from htab_elem long after the map is
gone.

[0]: https://lore.kernel.org/bpf/20260216131341.1285427-1-mykyta.yatsenko5@gmail.com

Fixes: 14a324f6a67e ("bpf: Wire up freeing of referenced kptr")
Fixes: 1bfbc267ec91 ("bpf: Enable bpf_timer and bpf_wq in any context")
Reported-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: syzbot@syzkaller.appspotmail.com
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260227224806.646888-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 fixes from Will Deacon:
"The diffstat is dominated by changes to our TLB invalidation errata
  handling and the introduction of a new GCS selftest to catch one of
  the issues that is fixed here relating to PROT_NONE mappings.

   - Fix cpufreq warning due to attempting a cross-call with interrupts
     masked when reading local AMU counters

   - Fix DEBUG_PREEMPT warning from the delay loop when it tries to
     access per-cpu errata workaround state for the virtual counter

   - Re-jig and optimise our TLB invalidation errata workarounds in
     preparation for more hardware brokenness

   - Fix GCS mappings to interact properly with PROT_NONE and to avoid
     corrupting the pte on CPUs with FEAT_LPA2

   - Fix ioremap_prot() to extract only the memory attributes from the
     user pte and ignore all the other 'prot' bits"

* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
  arm64: topology: Fix false warning in counters_read_on_cpu() for same-CPU reads
  arm64: Fix sampling the "stable" virtual counter in preemptible section
  arm64: tlb: Optimize ARM64_WORKAROUND_REPEAT_TLBI
  arm64: tlb: Allow XZR argument to TLBI ops
  kselftest: arm64: Check access to GCS after mprotect(PROT_NONE)
  arm64: gcs: Honour mprotect(PROT_NONE) on shadow stack mappings
  arm64: gcs: Do not set PTE_SHARED on GCS mappings if FEAT_LPA2 is enabled
  arm64: io: Extract user memory type in ioremap_prot()
  arm64: io: Rename ioremap_prot() to __ioremap_prot()

Merge tag 'pci-v7.0-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci

Pull pci fixes from Bjorn Helgaas:

- Update MAINTAINERS email address (Shawn Guo)

- Refresh cached Endpoint driver MSI Message Address to fix a v7.0
   regression when kernel changes the address after firmware has
   configured it (Niklas Cassel)

- Flush Endpoint MSI-X writes so they complete before the outbound ATU
   entry is unmapped (Niklas Cassel)

- Correct the PCI_CAP_EXP_ENDPOINT_SIZEOF_V2 value, which broke VMM use
   of PCI capabilities (Bjorn Helgaas)

* tag 'pci-v7.0-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci:
  PCI: Correct PCI_CAP_EXP_ENDPOINT_SIZEOF_V2 value
  PCI: dwc: ep: Flush MSI-X write before unmapping its ATU entry
  PCI: dwc: ep: Refresh MSI Message Address cache on change
  MAINTAINERS: Update Shawn Guo's address for HiSilicon PCIe controller driver

Merge patch series "tighten nstree visibility checks"

Christian Brauner <brauner@kernel.org> says:

Listing various namespaces is currently only scoped on owning namespace.
We can make this more fine-grained so that we scope visibility even
tighter. To make it possible to change behavior restrict visibility for
now. This shouldn't be a big deal as there aren't actual large users out
there and paves the way to make this even cleaner in the future.

* patches from https://patch.msgid.link/20260226-work-visibility-fixes-v1-0-d2c2853313bd@kernel.org:
  selftests: fix mntns iteration selftests
  nstree: tighten permission checks for listing
  nsfs: tighten permission checks for handle opening
  nsfs: tighten permission checks for ns iteration ioctls

Link: https://patch.msgid.link/20260226-work-visibility-fixes-v1-0-d2c2853313bd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>

selftests: fix mntns iteration selftests

Now that we changed permission checking make sure that we reflect that
in the selftests.

Link: https://patch.msgid.link/20260226-work-visibility-fixes-v1-4-d2c2853313bd@kernel.org
Fixes: 9d87b1067382 ("selftests: add tests for mntns iteration")
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Cc: stable@kernel.org # v6.14+
Signed-off-by: Christian Brauner <brauner@kernel.org>

nstree: tighten permission checks for listing

Even privileged services should not necessarily be able to see other
privileged service's namespaces so they can't leak information to each
other. Use may_see_all_namespaces() helper that centralizes this policy
until the nstree adapts.

Link: https://patch.msgid.link/20260226-work-visibility-fixes-v1-3-d2c2853313bd@kernel.org
Fixes: 76b6f5dfb3fd ("nstree: add listns()")
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Cc: stable@kernel.org # v6.19+
Signed-off-by: Christian Brauner <brauner@kernel.org>

nsfs: tighten permission checks for handle opening

Even privileged services should not necessarily be able to see other
privileged service's namespaces so they can't leak information to each
other. Use may_see_all_namespaces() helper that centralizes this policy
until the nstree adapts.

Link: https://patch.msgid.link/20260226-work-visibility-fixes-v1-2-d2c2853313bd@kernel.org
Fixes: 5222470b2fbb ("nsfs: support file handles")
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Cc: stable@kernel.org # v6.18+
Signed-off-by: Christian Brauner <brauner@kernel.org>

nsfs: tighten permission checks for ns iteration ioctls

Even privileged services should not necessarily be able to see other
privileged service's namespaces so they can't leak information to each
other. Use may_see_all_namespaces() helper that centralizes this policy
until the nstree adapts.

Link: https://patch.msgid.link/20260226-work-visibility-fixes-v1-1-d2c2853313bd@kernel.org
Fixes: a1d220d9dafa ("nsfs: iterate through mount namespaces")
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Cc: stable@kernel.org # v6.12+
Signed-off-by: Christian Brauner <brauner@kernel.org>

tools/sched_ext: fix strtoul() misuse in scx_hotplug_seq()

scx_hotplug_seq() uses strtoul() but validates the result with a
negative check (val < 0), which can never be true for an unsigned
return value.

Use the endptr mechanism to verify the entire string was consumed,
and check errno == ERANGE for overflow detection.

Signed-off-by: David Carlier <devnexen@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

Merge tag 'cxl-fixes-7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl

Pull cxl fixes from Dave Jiang:

- Fix incorrect usages of decoder flags

- Validate payload size before accessing contents

- Fix race condition when creating nvdimm objects

- Fix deadlock on attach failure

* tag 'cxl-fixes-7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl:
  cxl/region: Test CXL_DECODER_F_NORMALIZED_ADDRESSING as a bitmask
  cxl: Test CXL_DECODER_F_LOCK as a bitmask
  cxl/mbox: validate payload size before accessing contents in cxl_payload_from_user_allowed()
  cxl: Fix race of nvdimm_bus object when creating nvdimm objects
  cxl: Move devm_cxl_add_nvdimm_bridge() to cxl_pmem.ko
  cxl/port: Hold port host lock during dport adding.
  cxl/port: Introduce port_to_host() helper
  cxl/memdev: fix deadlock in cxl_memdev_autoremove() on attach failure

Merge tag 'mmc-v7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc

Pull MMC fixes from Ulf Hansson:
"MMC core:
   - Avoid bitfield RMW for claim/retune flags

  MMC host:
   - dw_mmc-rockchip: Fix runtime PM support for internal phase support
   - mmci: Fix device_node reference leak in of_get_dml_pipe_index()
   - sdhci-brcmstb: Use correct register offset for V1 pin_sel restore"

* tag 'mmc-v7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
  mmc: core: Avoid bitfield RMW for claim/retune flags
  mmc: sdhci-brcmstb: use correct register offset for V1 pin_sel restore
  mmc: dw_mmc-rockchip: Fix runtime PM support for internal phase support
  mmc: mmci: Fix device_node reference leak in of_get_dml_pipe_index()

Merge tag 'block-7.0-20260227' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

Pull block fixes from Jens Axboe:
"Two sets of fixes, one for drbd, and one for the zoned loop driver"

* tag 'block-7.0-20260227' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
  zloop: check for spurious options passed to remove
  zloop: advertise a volatile write cache
  drbd: fix null-pointer dereference on local read error
  drbd: Replace deprecated strcpy with strscpy
  drbd: fix "LOGIC BUG" in drbd_al_begin_io_nonblock()

Merge tag 'io_uring-7.0-20260227' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

Pull io_uring fixes from Jens Axboe:
"Just two minor patches in here, ensuring the use of READ_ONCE() for
  sqe field reading is consistent across the codebase. There were two
  missing cases, now they are covered too"

* tag 'io_uring-7.0-20260227' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
  io_uring/timeout: READ_ONCE sqe->addr
  io_uring/cmd_net: use READ_ONCE() for ->addr3 read

Merge tag 'xfs-fixes-7.0-rc2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Carlos Maiolino:
"Nothing reeeally stands out here: a few bug fixes, some refactoring to
  easily fit the bug fixes, and a couple cosmetic changes"

* tag 'xfs-fixes-7.0-rc2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: add static size checks for ioctl UABI
  xfs: remove duplicate static size checks
  xfs: Add comments for usages of some macros.
  xfs: Update lazy counters in xfs_growfs_rt_bmblock()
  xfs: Add a comment in xfs_log_sb()
  xfs: Fix xfs_last_rt_bmblock()
  xfs: don't report half-built inodes to fserror
  xfs: don't report metadata inodes to fserror
  xfs: fix potential pointer access race in xfs_healthmon_get
  xfs: fix xfs_group release bug in xfs_dax_notify_dev_failure
  xfs: fix xfs_group release bug in xfs_verify_report_losses
  xfs: fix copy-paste error in previous fix
  xfs: Fix error pointer dereference
  xfs: remove metafile inodes from the active inode stat
  xfs: cleanup inode counter stats
  xfs: fix code alignment issues in xfs_ondisk.c
  xfs: Replace &rtg->rtg_group with rtg_group()
  xfs: Refactoring the nagcount and delta calculation
  xfs: Replace ASSERT with XFS_IS_CORRUPT in xfs_rtcopy_summary()

Merge tag 'slab-for-7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab

Pull slab fixes from Vlastimil Babka:

- Fix for spurious page allocation warnings on sheaf refill (Harry Yoo)

- Fix for CONFIG_MEM_ALLOC_PROFILING_DEBUG warnings (Suren
   Baghdasaryan)

- Fix for kernel-doc warning on ksize() (Sanjay Chitroda)

- Fix to avoid setting slab->stride later than on slab allocation.
   Doesn't yet fix the reports from powerpc; debugging is making
   progress (Harry Yoo)

* tag 'slab-for-7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab:
  mm/slab: initialize slab->stride early to avoid memory ordering issues
  mm/slub: drop duplicate kernel-doc for ksize()
  mm/slab: mark alloc tags empty for sheaves allocated with __GFP_NO_OBJ_EXT
  mm/slab: pass __GFP_NOWARN to refill_sheaf() if fallback is available

Merge tag 'gpio-fixes-for-v7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux

Pull gpio fixes from Bartosz Golaszewski:

- fix memory leaks in shared GPIO management

- normalize the return values of gpio_chip::get() in GPIO core on
   behalf of drivers that return invalid values (this is done because
   adding stricter sanitization of callback retvals led to breakages in
   existing users, we'll revert that once all are fixed)

* tag 'gpio-fixes-for-v7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
  gpiolib: normalize the return value of gc->get() on behalf of buggy drivers
  gpio: shared: fix memory leaks

Merge tag 'sound-7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound

Pull sound fixes from Takashi Iwai:
"A bunch of small device-specific fixes. Mostly quirks and fix-ups for
  USB- and HD-audio at this time, in addition to a couple of ASoC AMD
  and Cirrus fixes"

* tag 'sound-7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (24 commits)
  ASoC: SDCA: Fix comments for sdca_irq_request()
  ALSA: us144mkii: Drop kernel-doc markers
  ALSA: usb: qcom: Correct parameter comment for uaudio_transfer_buffer_setup()
  ALSA: usb-audio: Drop superfluous kernel-doc markers
  ALSA: hda: cs35l56: Remove unnecessary struct cs_dsp_client_ops
  ALSA: hda: cs35l56: Fix signedness error in cs35l56_hda_posture_put()
  ALSA: usb-audio: Use correct version for UAC3 header validation
  ALSA: hda/realtek: add quirk for Acer Nitro ANV15-51
  ALSA: hda/intel: increase default bdl_pos_adj for Nvidia controllers
  ALSA: usb-audio: Use inclusive terms
  ALSA: usb-audio: Avoid implicit feedback mode on DIYINHK USB Audio 2.0
  ALSA: usb-audio: Check max frame size for implicit feedback mode, too
  ALSA: usb-audio: Cap the packet size pre-calculations
  ASoC: amd: yc: Add ASUS EXPERTBOOK BM1503CDA to quirk table
  ASoC: cs42l43: Report insert for exotic peripherals
  ALSA: usb-audio: Skip clock selector for Focusrite devices
  ALSA: usb-audio: Add QUIRK_FLAG_SKIP_IFACE_SETUP
  ALSA: usb-audio: Remove VALIDATE_RATES quirk for Focusrite devices
  ALSA: usb-audio: Improve Focusrite sample rate filtering
  ALSA: hda/realtek: add quirk for Samsung Galaxy Book Flex (NT950QCT-A38A)
  ...

Merge tag 'drm-fixes-2026-02-27' of https://gitlab.freedesktop.org/drm/kernel

Pull drm fixes from Dave Airlie:
"Regular fixes pull, amdxdna and amdgpu are the main ones, with a
  couple of intel fixes, then a scattering of fixes across drivers,
  nothing too major.

  i915/display:
   - Fix Panel Replay stuck with X during mode transitions on Panther
     Lake

  xe:
   - W/a fix for multi-cast registers
   - Fix xe_sync initialization issues

  amdgpu:
   - UserQ fixes
   - DC fix
   - RAS fixes
   - VCN 5 fix
   - Slot reset fix
   - Remove MES workaround that's no longer needed

  amdxdna:
   - deadlock fix
   - NULL ptr deref fix
   - suspend failure fix
   - OOB access fix
   - buffer overflow fix
   - input sanitiation fix
   - firmware loading fix

  dw-dp:
   - An error handling fix

  ethosu:
   - A binary shift overflow fix

  imx:
   - An error handling fix

  logicvc:
   - A dt node reference leak fix

  nouveau:
   - A WARN_ON removal

  samsung-dsim:
   - A memory leak fix

  tiny:
   - sharp-memory: NULL pointer deref fix

  vmwgfx:
   - A reference count and error handling fix"

* tag 'drm-fixes-2026-02-27' of https://gitlab.freedesktop.org/drm/kernel: (39 commits)
  drm/amd: Disable MES LR compute W/A
  drm/amdgpu: Fix error handling in slot reset
  drm/amdgpu/vcn5: Add SMU dpm interface type
  drm/amdgpu: Fix locking bugs in error paths
  drm/amdgpu: Unlock a mutex before destroying it
  drm/amd/display: Use GFP_ATOMIC in dc_create_stream_for_sink
  drm/amdgpu: add upper bound check on user inputs in wait ioctl
  drm/amdgpu: add upper bound check on user inputs in signal ioctl
  drm/amdgpu/userq: Do not allow userspace to trivially triger kernel warnings
  drm/amdgpu/userq: Fix reference leak in amdgpu_userq_wait_ioctl
  accel/amdxdna: Use a different name for latest firmware
  drm/client: Do not destroy NULL modes
  drm/gpusvm: Fix drm_gpusvm_pages_valid_unlocked() kernel-doc
  drm/xe/sync: Fix user fence leak on alloc failure
  drm/xe/sync: Cleanup partially initialized sync on parse failure
  drm/xe/wa: Steer RMW of MCR registers while building default LRC
  accel/amdxdna: Validate command buffer payload count
  accel/amdxdna: Prevent ubuf size overflow
  accel/amdxdna: Fix out-of-bounds memset in command slot handling
  accel/amdxdna: Fix command hang on suspended hardware context
  ...

PCI: Correct PCI_CAP_EXP_ENDPOINT_SIZEOF_V2 value

fb82437fdd8c ("PCI: Change capability register offsets to hex") incorrectly
converted the PCI_CAP_EXP_ENDPOINT_SIZEOF_V2 value from decimal 52 to hex
0x32:

-#define PCI_CAP_EXP_ENDPOINT_SIZEOF_V2 52 /* v2 endpoints with link end here */
+#define PCI_CAP_EXP_ENDPOINT_SIZEOF_V2 0x32 /* end of v2 EPs w/ link */

This broke PCI capabilities in a VMM because subsequent ones weren't
DWORD-aligned.

Change PCI_CAP_EXP_ENDPOINT_SIZEOF_V2 to the correct value of 0x34.

fb82437fdd8c was from Baruch Siach <baruch@tkos.co.il>, but this was not
Baruch's fault; it's a mistake I made when applying the patch.

Fixes: fb82437fdd8c ("PCI: Change capability register offsets to hex")
Reported-by: David Woodhouse <dwmw2@infradead.org>
Closes: https://lore.kernel.org/all/3ae392a0158e9d9ab09a1d42150429dd8ca42791.camel@infradead.org
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Krzysztof Wilczyński <kwilczynski@kernel.org>

smb: client: Use snprintf in cifs_set_cifscreds

Replace unbounded sprintf() calls with the safer snprintf(). Avoid using
magic numbers and use strlen() to calculate the key descriptor buffer
size. Save the size in a local variable and reuse it for the bounded
snprintf() calls. Remove CIFSCREDS_DESC_SIZE.

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Acked-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

mm/slab: initialize slab->stride early to avoid memory ordering issues

When alloc_slab_obj_exts() is called later (instead of during slab
allocation and initialization), slab->stride and slab->obj_exts are
updated after the slab is already accessible by multiple CPUs.

The current implementation does not enforce memory ordering between
slab->stride and slab->obj_exts. For correctness, slab->stride must be
visible before slab->obj_exts. Otherwise, concurrent readers may observe
slab->obj_exts as non-zero while stride is still stale.

With stale slab->stride, slab_obj_ext() could return the wrong obj_ext.
This could cause two problems:

  - obj_cgroup_put() is called on the wrong objcg, leading to
    a use-after-free due to incorrect reference counting [1] by
    decrementing the reference count more than it was incremented.

  - refill_obj_stock() is called on the wrong objcg, leading to
    a page_counter overflow [2] by uncharging more memory than charged.

Fix this by unconditionally initializing slab->stride in
alloc_slab_obj_exts_early(), before the need_slab_obj_exts() check.
In the case of SLAB_OBJ_EXT_IN_OBJ, it is overridden in the function.

This ensures updates to slab->stride become visible before the slab
can be accessed by other CPUs via the per-node partial slab list
(protected by spinlock with acquire/release semantics).

Thanks to Shakeel Butt for pointing out this issue [3].

[vbabka@kernel.org: the bug reports [1] and [2] are not yet fully fixed,
with investigation ongoing, but it is nevertheless a step in the right
direction to only set stride once after allocating the slab and not
change it later ]

Fixes: 7a8e71bc619d ("mm/slab: use stride to access slabobj_ext")
Reported-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Link: https://lore.kernel.org/lkml/ca241daa-e7e7-4604-a48d-de91ec9184a5@linux.ibm.com
Link: https://lore.kernel.org/all/ddff7c7d-c0c3-4780-808f-9a83268bbf0c@linux.ibm.com
Link: https://lore.kernel.org/linux-mm/aZu9G9mVIVzSm6Ft@hyeyoo
Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>

media: dvb-net: fix OOB access in ULE extension header tables

The ule_mandatory_ext_handlers[] and ule_optional_ext_handlers[] tables
in handle_one_ule_extension() are declared with 255 elements (valid
indices 0-254), but the index htype is derived from network-controlled
data as (ule_sndu_type & 0x00FF), giving a range of 0-255. When
htype equals 255, an out-of-bounds read occurs on the function pointer
table, and the OOB value may be called as a function pointer.

Add a bounds check on htype against the array size before either table
is accessed. Out-of-range values now cause the SNDU to be discarded.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Reported-by: Ariel Silver <arielsilver77@gmail.com>
Signed-off-by: Ariel Silver <arielsilver77@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

smb: client: Don't log plaintext credentials in cifs_set_cifscreds

When debug logging is enabled, cifs_set_cifscreds() logs the key
payload and exposes the plaintext username and password. Remove the
debug log to avoid exposing credentials.

Fixes: 8a8798a5ff90 ("cifs: fetch credentials out of keyring for non-krb5 auth multiuser mounts")
Cc: stable@vger.kernel.org
Acked-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Steve French <stfrench@microsoft.com>

smb: client: fix broken multichannel with krb5+signing

When mounting a share with 'multichannel,max_channels=n,sec=krb5i',
the client was duplicating signing key for all secondary channels,
thus making the server fail all commands sent from secondary channels
due to bad signatures.

Every channel has its own signing key, so when establishing a new
channel with krb5 auth, make sure to use the new session key as the
derived key to generate channel's signing key in SMB2_auth_kerberos().

Repro:

$ mount.cifs //srv/share /mnt -o multichannel,max_channels=4,sec=krb5i
$ sleep 5
$ umount /mnt
$ dmesg
  ...
  CIFS: VFS: sign fail cmd 0x5 message id 0x2
  CIFS: VFS: \\srv SMB signature verification returned error = -13
  CIFS: VFS: sign fail cmd 0x5 message id 0x2
  CIFS: VFS: \\srv SMB signature verification returned error = -13
  CIFS: VFS: sign fail cmd 0x4 message id 0x2
  CIFS: VFS: \\srv SMB signature verification returned error = -13

Reported-by: Xiaoli Feng <xifeng@redhat.com>
Reviewed-by: Enzo Matsumiya <ematsumiya@suse.de>
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Cc: David Howells <dhowells@redhat.com>
Cc: linux-cifs@vger.kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>

smb: client: use atomic_t for mnt_cifs_flags

Use atomic_t for cifs_sb_info::mnt_cifs_flags as it's currently
accessed locklessly and may be changed concurrently in mount/remount
and reconnect paths.

Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Reviewed-by: David Howells <dhowells@redhat.com>
Cc: linux-cifs@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>

Merge tag 'v7.0-rc1-ksmbd-server-fixes' of git://git.samba.org/ksmbd

Pull smb server fixes from Steve French:

- auth security improvement

- fix potential buffer overflow in smbdirect negotiation

* tag 'v7.0-rc1-ksmbd-server-fixes' of git://git.samba.org/ksmbd:
ksmbd: fix signededness bug in smb_direct_prepare_negotiation()
ksmbd: Compare MACs in constant time

Merge tag 'mm-hotfixes-stable-2026-02-26-14-14' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
"12 hotfixes.  7 are cc:stable.  8 are for MM.

  All are singletons - please see the changelogs for details"

* tag 'mm-hotfixes-stable-2026-02-26-14-14' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  MAINTAINERS: update Yosry Ahmed's email address
  mailmap: add entry for Daniele Alessandrelli
  mm: fix NULL NODE_DATA dereference for memoryless nodes on boot
  mm/tracing: rss_stat: ensure curr is false from kthread context
  mm/kfence: fix KASAN hardware tag faults during late enablement
  mm/damon/core: disallow non-power of two min_region_sz
  Squashfs: check metadata block offset is within range
  MAINTAINERS, mailmap: update e-mail address for Vlastimil Babka
  liveupdate: luo_file: remember retrieve() status
  mm: thp: deny THP for files on anonymous inodes
  mm: change vma_alloc_folio_noprof() macro to inline function
  mm/kfence: disable KFENCE upon KASAN HW tags enablement

Merge tag 'amd-drm-fixes-7.0-2026-02-26' of https://gitlab.freedesktop.org/agd5f/linux into drm-fixes

amd-drm-fixes-7.0-2026-02-26:

amdgpu:
- UserQ fixes
- DC fix
- RAS fixes
- VCN 5 fix
- Slot reset fix
- Remove MES workaround that's no longer needed

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patch.msgid.link/20260226161330.3549393-1-alexander.deucher@amd.com

Merge tag 'dma-mapping-7.0-2026-02-26' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux

Pull dma-mapping fixes from Marek Szyprowski:
"Two DMA-mapping fixes for the recently merged API rework (Jiri Pirko
  and Stian Halseth)"

* tag 'dma-mapping-7.0-2026-02-26' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux:
  sparc: Fix page alignment in dma mapping
  dma-mapping: avoid random addr value print out on error path

spi: stm32: fix missing pointer assignment in case of dma chaining

Commit c4f2c05ab029 ("spi: stm32: fix pointer-to-pointer variables usage")
introduced a regression since dma descriptors generated as part of the
stm32_spi_prepare_rx_dma_mdma_chaining function are not well propagated
to the caller function, leading to mdma-dma chaining being no more
functional.

Fixes: c4f2c05ab029 ("spi: stm32: fix pointer-to-pointer variables usage")
Signed-off-by: Alain Volmat <alain.volmat@foss.st.com>
Acked-by: Antonio Quartulli <antonio@mandelbit.com>
Link: https://patch.msgid.link/20260224-spi-stm32-chaining-fix-v1-1-5da7a4851b66@foss.st.com
Signed-off-by: Mark Brown <broonie@kernel.org>

Merge tag 'drm-xe-fixes-2026-02-26' of https://gitlab.freedesktop.org/drm/xe/kernel into drm-fixes

- W/a fix for multi-cast registers (Roper)
- Fix xe_sync initialization issues (Shuicheng)

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://patch.msgid.link/aaBGHy_0RLGGIBP5@intel.com

Merge tag 'acpi-7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull ACPI fixes from Rafael Wysocki:
"New platform quirks for two systems:

   - Add a quirk for Lenovo G70-35 to save the ACPI NVS memory on system
     suspend (Piotr Mazek)

   - Add a DMI quirk for Acer Aspire One D255 to work around a backlight
     issue by returning false to _OSI("Windows 2009") (Sofia Schneider)"

* tag 'acpi-7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  ACPI: OSI: Add DMI quirk for Acer Aspire One D255
  ACPI: PM: Save NVS memory on Lenovo G70-35

pinctrl: cy8c95x0: Don't miss reading the last bank registers

When code had been changed to use for_each_set_clump8(), it mistakenly
switched from chip->nport to chip->tpin since the cy8c9540 and cy8c9560
have a 4-pin gap. This, in particular, led to the missed read of
the last bank interrupt status register and hence missing interrupts
on those pins. Restore the upper limit in for_each_set_clump8() to take
into consideration that gap.

Fixes: 83e29a7a1fdf ("pinctrl: cy8c95x0; Switch to use for_each_set_clump8()")
Cc: stable@vger.kernel.org
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Linus Walleij <linusw@kernel.org>

Merge tag 'pm-7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael Wysocki:
"These fix two intel_pstate driver issues causing it to crash on sysfs
  attribute accesses when some CPUs in the system are offline, finalize
  changes related to turning pm_runtime_put() into a void function, and
  update Daniel Lezcano's contact information:

   - Fix two issues in the intel_pstate driver causing it to crash when
     its sysfs interface is used on a system with some offline CPUs
     (David Arcari, Srinivas Pandruvada)

   - Update the last user of the pm_runtime_put() return value to
     discard it and turn pm_runtime_put() into a void function (Rafael
     Wysocki)

   - Update Daniel Lezcano's contact information in MAINTAINERS and
     .mailmap (Daniel Lezcano)"

* tag 'pm-7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  MAINTAINERS: Update contact with the kernel.org address
  cpufreq: intel_pstate: Fix crash during turbo disable
  cpufreq: intel_pstate: Fix NULL pointer dereference in update_cpu_qos_request()
  PM: runtime: Change pm_runtime_put() return type to void
  pmdomain: imx: gpcv2: Discard pm_runtime_put() return value

Merge tag 'for-linus-7.0-1' of https://github.com/cminyard/linux-ipmi

Pull IPMI driver fixes from Corey Minyard:
"This mostly revolves around getting the driver to behave when the IPMI
  device misbehaves. Past attempts have not worked very well because I
  didn't have hardware I could make do this, and AI was fairly useless
  for help on this.

  So I modified qemu and my test suite so I could reproduce a
  misbehaving IPMI device, and with that I was able to fix the issues"

* tag 'for-linus-7.0-1' of https://github.com/cminyard/linux-ipmi:
  ipmi:si: Fix check for a misbehaving BMC
  ipmi:msghandler: Handle error returns from the SMI sender
  ipmi:si: Don't block module unload if the BMC is messed up
  ipmi:si: Use a long timeout when the BMC is misbehaving
  ipmi:si: Handle waiting messages when BMC failure detected
  ipmi:ls2k: Make ipmi_ls2k_platform_driver static
  ipmi: ipmb: initialise event handler read bytes
  ipmi: Consolidate the run to completion checking for xmit msgs lock
  ipmi: Fix use-after-free and list corruption on sender error

sched_ext: Fix SCX_EFLAG_INITIALIZED being a no-op flag

SCX_EFLAG_INITIALIZED is the sole member of enum scx_exit_flags with no
explicit value, so the compiler assigns it 0. This makes the bitwise OR
in scx_ops_init() a no-op:

sch->exit_info->flags |= SCX_EFLAG_INITIALIZED; /* |= 0 */

As a result, BPF schedulers cannot distinguish whether ops.init()
completed successfully by inspecting exit_info->flags.

Assign the value 1LLU << 0 so the flag is actually set.

Fixes: f3aec2adce8d ("sched_ext: Add SCX_EFLAG_INITIALIZED to indicate successful ops.init()")
Signed-off-by: David Carlier <devnexen@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

Merge branch 'acpi-pm'

Add a quirk for Lenovo G70-35 to save the ACPI NVS memory on system
suspend (Piotr Mazek)

* acpi-pm:
ACPI: PM: Save NVS memory on Lenovo G70-35

Merge branches 'pm-cpufreq' and 'pm-runtime'

Merge cpufreq and runtime PM updates for 7.0-rc2:

- Fix two issues in the intel_pstate driver causing it to crash when
   its sysfs interface is used on a system with some offline CPUs (David
   Arcari, Srinivas Pandruvada)

- Update the last user of the pm_runtime_put() return value to discard
   it and turn pm_runtime_put() into a void function (Rafael Wysocki)

* pm-cpufreq:
  cpufreq: intel_pstate: Fix crash during turbo disable
  cpufreq: intel_pstate: Fix NULL pointer dereference in update_cpu_qos_request()

* pm-runtime:
  PM: runtime: Change pm_runtime_put() return type to void
  pmdomain: imx: gpcv2: Discard pm_runtime_put() return value

Merge tag 'drm-misc-fixes-2026-02-26' of https://gitlab.freedesktop.org/drm/misc/kernel into drm-fixes

Several fixes for:
  - amdxdna: Fix for a deadlock, a NULL pointer dereference, a suspend
    failure, a hang, an out-of-bounds access, a buffer overflow, input
    sanitization and other minor fixes.
  - dw-dp: An error handling fix
  - ethosu: A binary shift overflow fix
  - imx: An error handling fix
  - logicvc: A dt node reference leak fix
  - nouveau: A WARN_ON removal
  - samsung-dsim: A memory leak fix
  - sharp-memory: A NULL pointer dereference fix
  - vmgfx: A reference count and error handling fix

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Maxime Ripard <mripard@redhat.com>
Link: https://patch.msgid.link/20260226-heretic-stimulating-swine-6a2f27@penduick

Merge tag 'drm-intel-fixes-2026-02-25' of https://gitlab.freedesktop.org/drm/i915/kernel into drm-fixes

- Fix #7153: Panel Replay stuck with X during mode transitions on Panther Lake

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Link: https://patch.msgid.link/aZ8JxQkN5oMxXsT6@jlahtine-mobl

selftests/bpf: Fix OOB read in dmabuf_collector

Dmabuf name allocations can be less than DMA_BUF_NAME_LEN characters,
but bpf_probe_read_kernel always tries to read exactly that many bytes.
If a name is less than DMA_BUF_NAME_LEN characters,
bpf_probe_read_kernel will read past the end. bpf_probe_read_kernel_str
stops at the first NUL terminator so use it instead, like
iter_dmabuf_for_each already does.

Fixes: ae5d2c59ecd7 ("selftests/bpf: Add test for dmabuf_iter")
Reported-by: Jerome Lee <jaewookl@quicinc.com>
Signed-off-by: T.J. Mercier <tjmercier@google.com>
Link: https://lore.kernel.org/r/20260225003349.113746-1-tjmercier@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>