git.ipfire.org Git - thirdparty/kernel/linux.git/log

vfs: uapi: retire octal and hex numbers in favor of (1 << n) for O_ flags

A recent build failure[1] exposed the diffculty of working with the
current octal and hex definitions of O_ flags when trying to find a gap
for a new flag. This difficulty is compounded by the fact that O_ flags
may have architectural specific values.

Replace the hex/octal #defines, which are hard to parse when looking for
free bits, with explicit bit shifts like (1 << 11). Also, add comments
that identify which architectures redefine some of the seemingly free
("cursed") bits in uapi/asm-generic/fcntl.h. These should not be used to
define new O_ flags (for now, at least).

The translastion was done with Claude Opus 4.8, and verified with a
(non-AI) gawk script. The accounting of which architectures claim
which bit-gaps in uapi/asm-generic/fcntl.h is also done by hand.

[1]: https://lore.kernel.org/all/agruPPybCx8q2XcJ@sirena.org.uk/

Assisted-by: Claude:Opus 4.8
Signed-off-by: Jori Koolstra <jkoolstra@xs4all.nl>
Link: https://patch.msgid.link/20260604222405.5382-1-jkoolstra@xs4all.nl
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

bpf: add bpf_real_inode() kfunc

Add a sleepable BPF kfunc that resolves the real inode backing a dentry
via d_real_inode(). On overlay/union filesystems the inode attached to
the dentry is the overlay inode which does not carry the underlying
device information. d_real_inode() resolves through the overlay and
returns the inode from the lower, real filesystem.

This is used in the RestrictFilesytemAccess bpf program that has been
merged into systemd a little while ago.

Link: https://github.com/systemd/systemd/pull/41340
Link: https://patch.msgid.link/20260526-work-bpf-verity-v2-1-cd0b1850d31b@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

bpf: Add simple xattr support to bpffs

Add support for extended attributes on bpffs inodes so that user space
and BPF LSM programs can attach metadata, for example, a content hash
or a security label - to a pinned object or directory. BPF LSM or user
space tooling can then uniformly look at this (e.g. security.bpf.*) in
similar way to other fs'es. The store is in-memory and non-persistent:
it lives only for the lifetime of the mount, like everything else in
bpffs. The modelling is similar to tmpfs.

bpffs serves the trusted.* and security.* namespaces; user.* is left
unsupported. As bpffs is FS_USERNS_MOUNT, security.* is reachable by
the unprivileged mounter in a user namespace, and thus we are using
the simple_xattr_set_limited infra there (trusted.* needs global
CAP_SYS_ADMIN).

bpf_fill_super() is open-coded instead of using simple_fill_super(),
because the root inode must now be allocated through bpf_fs_alloc_inode()
i.e. carry the bpf_fs_inode wrapper and come from the right cache -
which requires s_op (and s_xattr) to be installed before the first
inode is created. While at it, also harden s_iflags with SB_I_NOEXEC
and SB_I_NODEV.

bpf_fs_listxattr() is only reachable through the filesystem via
i_op->listxattr, so the BPF token inode is left untouched. Name-based
fsetxattr()/fgetxattr() on a token fd still work since the get/set
handlers are installed at the superblock.

For security.* namespace, we use simple_xattr_set_limited() but
there was no simple_xattr_add_limited() API yet which was needed
in bpf_fs_initxattrs() to avoid underflows in the accounting. The
symlink target is freed in bpf_free_inode() rather than in
bpf_destroy_inode() so that it is released only after an RCU grace
period, as an RCU path walk following the symlink may still
dereference inode->i_link in security_inode_follow_link(). Lastly,
the bpf_symlink() allocated the symlink target is switched to
GFP_KERNEL_ACCOUNT, so the string is charged to the caller's memcg.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://patch.msgid.link/20260602074012.416289-1-daniel@iogearbox.net
Cc: Christian Brauner <brauner@kernel.org>
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

kernfs: link kn to its parent before the LSM init hook

After commit 12e9e3cd03b5 ("simpe_xattr: use per-sb cache"),
kernfs_xattr_set() and kernfs_xattr_get() compute the cache via
kernfs_root(kn) before any other check.  kernfs_root(kn) walks
kn->__parent first and falls back to kn->dir.root, both of which are
NULL on a freshly kmem_cache_zalloc()'d kn. kn->__parent was being set
in kernfs_new_node() after __kernfs_new_node() returned, and kn->dir.root
is set even later by kernfs_create_dir_ns() / kernfs_create_empty_dir().

The LSM kernfs_init_security hook is invoked from inside
__kernfs_new_node(), before either field has been initialized.
selinux_kernfs_init_security() ends with kernfs_xattr_set(kn,
XATTR_NAME_SELINUX, ...).  kernfs_root(kn) then returns NULL, and
&((struct kernfs_root *)NULL)->xa_cache evaluates to
offsetof(struct kernfs_root, xa_cache) which faults:

  BUG: kernel NULL pointer dereference, address: 00000000000000e0
  RIP: 0010:simple_xattr_set+0x27/0x8b0
  Call Trace:
   kernfs_xattr_set+0x63/0xb0
   selinux_kernfs_init_security+0x13b/0x270
   security_kernfs_init_security+0x36/0xc0
   __kernfs_new_node+0x182/0x290
   kernfs_new_node+0x80/0xc0
   kernfs_create_dir_ns+0x2b/0xa0
   cgroup_create+0x116/0x380
   cgroup_mkdir+0x7c/0x1a0

Reproduces deterministically at PID 1 (systemd) on an SELinux-enabled
distro. The first cgroup mkdir under /sys/fs/cgroup with a labelled
parent panics the kernel.

The LSM hook's contract is that the kn_dir argument is the parent of
the new kn, so kn->__parent should already point at kn_dir when the
hook runs.  Move kernfs_get(parent) and rcu_assign_pointer of
kn->__parent from kernfs_new_node() into __kernfs_new_node() right
before the security hook, and unwind the parent reference on the
err_out4 path.  kernfs_root(kn) then takes its parent branch during
the hook and returns parent->dir.root, which is the correct root.

This also closes the same-shape latent bug in kernfs_xattr_get() (which
today is hidden only by kernfs_iattrs_noalloc() returning NULL on a
fresh kn).

Fixes: 12e9e3cd03b5 ("simpe_xattr: use per-sb cache")
Reported-by: Calum Mackay <calum.mackay@oracle.com>
Closes: https://lore.kernel.org/all/5386153f-9112-4971-98fc-de90d7aae2c6@oracle.com/
Link: https://patch.msgid.link/20260526-ablief-demut-wehen-aef8446ef5c9@brauner
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

Merge patch series "Rework simple xattrs"

Christian Brauner (Amutable) <brauner@kernel.org> says:

Rework the simple xattrs api to make it more efficient and easier to
use for all consumers.

* patches from https://patch.msgid.link/20260605135322.2632068-1-mszeredi@redhat.com:
  simpe_xattr: use per-sb cache
  simple_xattr: change interface to pass struct simple_xattrs **
  tmpfs: simplify constructing "security.foo" xattr names
  kernfs: fix xattr race condition with multiple superblocks

Link: https://patch.msgid.link/20260605135322.2632068-1-mszeredi@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>

simpe_xattr: use per-sb cache

Move the hash table to the super block to remove excessive overhead in case
of small number of xattrs per inode.

Add linked list to the inode, used for listxattr and eviction.  Listxattr
uses rcu protection to iterate the list of xattrs.

Before being made per-sb, lazy allocation was protected by inode lock.  Now
inode lock no longer provides sufficient exclusion, so use cmpxchg() to
ensure atomicity.

Though I haven't found a description of this pattern, after some research
it seems that cmpxchg_release() and READ_ONCE() should provide the
necessary memory barriers.

Use simple_xattr_free_rcu() in simple_xattrs_free(). This is needed because
the hash table is now shared between inodes and lookup on a different inode
might be running the compare function on the just freed element within the
RCU grace period.

Following stats are based on slabinfo diff, after creating 100k empty
files, then adding a "user.test=foo" xattr to each:

v7.0 (no rhashtable):
  File creation: 993.40 bytes/file
  Xattr addition: 79.99 bytes/file

v7.1-rc2 (per-inode rhashtable):
  File creation: 939.73 bytes/file
  Xattr addition: 1296.08 bytes/file

v7.1-rc2 + this patch (per-sb rhashtable)
  File creation: 946.84 bytes/file
  Xattr addition: 111.86 bytes/file

The overhead of a single xattr is reduced to nearly v7.0 levels.  The per
xattr overhead is slightly larger due to the addition of three pointers to
struct simple_xattr.

Fixes: b32c4a213698 ("xattr: add rhashtable-based simple_xattr infrastructure")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Link: https://patch.msgid.link/20260605135322.2632068-5-mszeredi@redhat.com
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

simple_xattr: change interface to pass struct simple_xattrs **

Change the simple_xattr API to accept pointer-to-pointer (struct
simple_xattrs **) instead of pointer. This allows the functions to handle
lazy allocation internally without requiring callers to use
simple_xattrs_lazy_alloc().

The simple_xattr_set(), simple_xattr_set_limited() and simple_xattr_add()
functions now handle allocation when xattrs is NULL. simple_xattrs_free()
now also frees the xattrs structure itself and sets the pointer to NULL.

This simplifies callers and removes the need for most callers to explicitly
manage xattrs allocation and lifetime.

In shmem_initxattrs(), the total required space for all initial xattrs
(ispace) is pre-calculated and deducted from sbinfo->free_ispace.

Since this patch modifies the function to add new xattrs directly to the
inode's &info->xattrs list rather than using a local temporary variable, a
failure means that the partially populated info->xattrs list remains
attached to the inode.

When the VFS caller handles the -ENOMEM error, it drops the newly created
inode via iput(), shmem_free_inode() adds freed to sbinfo->free_ispace a
second time, permanently inflating the tmpfs free space quota.

Fix by substracting already added xattrs from ispace.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Link: https://patch.msgid.link/20260605135322.2632068-4-mszeredi@redhat.com
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

tmpfs: simplify constructing "security.foo" xattr names

Use kasprintf() instead of doing it with kmalloc() + 2 x memcpy().

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Link: https://patch.msgid.link/20260605135322.2632068-3-mszeredi@redhat.com
Tested-by: Calum Mackay <calum.mackay@oracle.com>
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

kernfs: fix xattr race condition with multiple superblocks

Multiple superblocks with different namespaces can share the same
kernfs_node when kernfs_test_super() finds a matching root but
different namespace. This means multiple inodes from different
superblocks can reference the same kernfs_node->iattr->xattrs
structure.

The VFS layer only holds per-inode locks during xattr operations,
which is insufficient to serialize concurrent xattr modifications on
the shared kernfs_node. This can lead to race conditions in
simple_xattr_set() where the lookup->replace/remove sequence is not
atomic with respect to operations from other superblocks.

Fix this by protecting xattr operations with the existing hashed
kernfs_locks->open_file_mutex[] array, which is already used to
protect per-node open file data. The hashed mutex array provides
scalable per-node serialization (scaled by CPU count, up to 1024 locks
on 32+ CPU systems) with zero memory overhead.

Changes:
- Rename open_file_mutex[] to node_mutex[] to reflect dual purpose
- Add kernfs_node_lock_ptr() and kernfs_node_lock() helpers
- Protect simple_xattr_set() calls in kernfs_xattr_set() and
kernfs_vfs_user_xattr_set() with the hashed mutex
- Update file.c to use new helpers via compatibility wrappers
- Update documentation to explain the extended lock usage

Fixes: b32c4a213698 ("xattr: add rhashtable-based simple_xattr infrastructure")
Reported-by: Sashiko <sashiko-bot@kernel.org>
Closes: https://sashiko.dev/#/patchset/20260601162454.2116375-1-mszeredi%40redhat.com
Assisted-by: Claude:claude-sonnet-4-5
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Link: https://patch.msgid.link/20260605135322.2632068-2-mszeredi@redhat.com
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

debugobjects: Don't call fill_pool() in early boot hardirq context

When booting a debug PREEMPT_RT kernel on an ARM64 system, a "inconsistent
{HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage" lockdep warning message was
reported to the console.

During early boot, interrupts are enabled before the scheduler is
enabled. In this window (before SYSTEM_SCHEDULING is set) interrupts can
fire and in the hard interrupt context handler attempt to fill the pool

This can lead to a deadlock when the interrupt occurred when the interrupt
hits a region which holds a lock that is required to be taken in the
allocation path.

Add a new can_fill_pool() helper and reorder the exception rule and forbid
this scenario by excluding allocations from hard interrupt context.

Fixes: 06e0ae988f6e ("debugobjects: Allow to refill the pool before SYSTEM_SCHEDULING")
Suggested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260605173038.495075-1-longman@redhat.com

Merge tag 'pin-init-v7.2' of https://github.com/Rust-for-Linux/linux into rust-next

Pull pin-init updates from Gary Guo:
"User visible changes:

   - Do not generate 'non_snake_case' warnings for identifiers that are
     syntactically just users of a field name. This would allow all
     '#[allow(non_snake_case)]' in nova-core to be removed, which I will
     send to the nova tree next cycle.

   - Filter non-cfg attributes out properly in derived structs. This
     improves pin-init compatibility with other derive macros.

   - Insert projection types' where clause properly.

  Other changes:

   - Bump MSRV to 1.82, plus associated cleanups.

   - Overhaul how init slots are projected. The new approach is easier
     to justify with safety comments.

   - Mark more functions as inline, which should help mitigate the
     super-long symbol name issue due to lack of inlining.

   - Various small code quality cleanups."

* tag 'pin-init-v7.2' of https://github.com/Rust-for-Linux/linux: (27 commits)
  rust: pin_init: internal: use `loop {}` to produce never value
  rust: pin-init: remove `E` from `InitClosure`
  rust: pin-init: move `InitClosure` out from `__internal`
  rust: pin-init: docs: fix typos in MaybeZeroable documentation
  rust: pin-init: internal: suppress `non_snake_case` lint in `[pin_]init!`
  rust: pin-init: internal: suppress `non_snake_case` lint in `#[pin_data]`
  rust: pin-init: internal: pin_data: filter non-`#[cfg]` attr in generated code
  rust: pin-init: internal: project using full slot
  rust: pin-init: internal: project slots instead of references
  rust: pin-init: internal: make `make_closure` inherent methods
  rust: pin-init: internal: use marker on drop guard type for pinned fields
  rust: pin-init: internal: init: handle code blocks early
  rust: pin-init: internal: add `PhantomInvariant` and `PhantomInvariantLifetime`
  rust: pin-init: internal: pin_data: add struct to record field info
  rust: pin-init: internal: pin_data: use closure for `handle_field`
  rust: pin-init: examples: fix `useless_borrows_in_formatting` clippy warning
  rust: pin-init: internal: remove `collect_tuple` polyfill after MSRV bump
  rust: pin-init: internal: turn `PhantomPinned` error into warnings
  rust: pin-init: cleanup workaround for old Rust compiler
  rust: pin-init: fix badge URL in README
  ...

kconfig: Remove the architecture specific config for Propeller

The CONFIG_PROPELLER_CLANG option currently depends on
ARCH_SUPPORTS_PROPELLER_CLANG, but this dependency seems unnecessary.

Remove ARCH_SUPPORTS_PROPELLER_CLANG and allow users to control
Propeller builds solely through CONFIG_PROPELLER_CLANG. This simplifies
the kconfig and avoids potential confusion.

Move the .llvm_bb_addr_map sections grouping to
include/asm-generic/vmlinux.lds.h.

The Propeller documentation has been updated to reflect the most
recent tool location and now includes instructions for arm64.

Contributor Acknowledgments:
* SPE instructions: Daniel Hoekwater <hoekwater@google.com>

Signed-off-by: Rong Xu <xur@google.com>
Suggested-by: Will Deacon <will@kernel.org>
Suggested-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Yabin Cui <yabinc@google.com>
Reviewed-by: Kees Cook <kees@kernel.org>
Link: https://patch.msgid.link/20260604195612.3757860-3-xur@google.com
Signed-off-by: Nathan Chancellor <nathan@kernel.org>

kconfig: Remove the architecture specific config for AutoFDO

The CONFIG_AUTOFDO_CLANG option currently depends on
ARCH_SUPPORTS_AUTOFDO_CLANG, but this dependency seems unnecessary.

Remove ARCH_SUPPORTS_AUTOFDO_CLANG and allow users to control AutoFDO
builds solely through CONFIG_AUTOFDO_CLANG. This simplifies the kconfig
and avoids potential confusion.

Expand the AutoFDO documentation to include instructions for arm64.

Contributor acknowledgments:
* SPE instructions: Daniel Hoekwater <hoekwater@google.com>
* ETM instructions: Yabin Cui <yabinc@google.com>

Signed-off-by: Rong Xu <xur@google.com>
Suggested-by: Will Deacon <will@kernel.org>
Tested-by: Yabin Cui <yabinc@google.com>
Reviewed-by: Kees Cook <kees@kernel.org>
Link: https://patch.msgid.link/20260604195612.3757860-2-xur@google.com
Signed-off-by: Nathan Chancellor <nathan@kernel.org>

modpost: Add __llvm_covfun and __llvm_covmap to section_white_list

Modpost emits hundreds of warnings when using Clang to build for ARCH=um
and CONFIG_GCOV=y. e.g.:
    vmlinux (__llvm_covfun): unexpected non-allocatable section.
    Did you forget to use "ax"/"aw" in a .S file?
    Note that for example <linux/init.h> contains
    section definitions for use in .S files.

For example, when we use LLVM for a kunit user mode build with coverage:
    python3 tools/testing/kunit/kunit.py build --make_options LLVM=1 \
        --kunitconfig=tools/testing/kunit/configs/default.config \
        --kunitconfig=tools/testing/kunit/configs/coverage_uml.config

The behaviour occurs when building the kernel for ARCH=um with code
coverage enabled. The warnings come from modpost's check_sec_ref
function, which ensures no sections reference others that will be
discarded. covfun and covmap sections must reference __init and __exit
sections to collect coverage data, triggering the modpost warning.

To suppress these warnings, these section names have been added to
modpost's whitelist. This is unlikely to suppress legitimate warnings as
Clang will only insert these sections when building with coverage, and
can be assumed to manage these references safely.

Signed-off-by: James Lee <james@codeconstruct.com.au>
Link: https://patch.msgid.link/20260604-dev-coverage-patch-v1-1-9f9368253cb4@codeconstruct.com.au
Signed-off-by: Nathan Chancellor <nathan@kernel.org>

selftests/bpf: Inspect the signature verdict exposed to BPF LSM

Add a minimal BPF LSM program on lsm/bpf_prog_load that, for loads on
the monitored thread, reads back prog->aux->sig.{verdict,keyring_type,
keyring_serial}, and a signed_loader subtest that drives the same
gen_loader loader through the hook twice: i) /unsigned/ where the LSM
must observe UNSIGNED, no keyring and serial 0; ii) /signed/ where the
very same insns signed against the session keyring must be observed as
VERIFIED with a user keyring, and the recorded keyring_serial must be
equal to the resolved session keyring serial. Loading (not running) the
loader is sufficient since the verdict is attached at load time.

  # LDLIBS=-static PKG_CONFIG='pkg-config --static' ./vmtest.sh -- ./test_progs -t signed_loader
  [    1.970530] clocksource: Switched to clocksource tsc
  #405/1   signed_loader/metadata_check_shape:OK
  #405/2   signed_loader/metadata_match:OK
  #405/3   signed_loader/metadata_sha_mismatch:OK
  #405/4   signed_loader/metadata_not_exclusive:OK
  #405/5   signed_loader/metadata_hash_not_computed:OK
  #405/6   signed_loader/signature_enforced:OK
  #405/7   signed_loader/signature_too_large:OK
  #405/8   signed_loader/signature_bad_keyring:OK
  #405/9   signed_loader/metadata_ctx_max_entries_ignored:OK
  #405/10  signed_loader/metadata_ctx_initial_value_ignored:OK
  #405/11  signed_loader/signature_authenticates_insns:OK
  #405/12  signed_loader/hash_requires_frozen:OK
  #405/13  signed_loader/no_update_after_freeze:OK
  #405/14  signed_loader/freeze_writable_mmap:OK
  #405/15  signed_loader/no_writable_mmap_frozen:OK
  #405/16  signed_loader/map_hash_matches_libbpf:OK
  #405/17  signed_loader/map_hash_multi_element:OK
  #405/18  signed_loader/map_hash_bad_size:OK
  #405/19  signed_loader/map_hash_unsupported_type:OK
  #405/20  signed_loader/lsm_signature_verdict:OK
  #405     signed_loader:OK
  Summary: 1/20 PASSED, 0 SKIPPED, 0 FAILED

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20260605213518.544262-2-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Expose signature verdict via bpf_prog_aux

BPF_PROG_LOAD verifies the loader signature but does not record the
outcome on the BPF program. [BPF] LSMs and audit can read attr->signature
and attr->keyring_id to infer "was this signed, and if so, against which
keyring".

Add prog->aux->sig (verdict + keyring_{type,serial}), populated by
bpf_prog_load before the LSM hook. keyring_type classifies the keyring
the load referenced (builtin, secondary, platform or user), while
keyring_serial records the serial of the keyring the signature was
actually validated against. System keyrings carry a pseudo key pointer
with no user-visible serial and are reported as 0, as are unsigned loads.
Failed verifications reject the load before the hook runs, so it observes
only either UNSIGNED or VERIFIED.

Signed-off-by: KP Singh <kpsingh@kernel.org>
Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20260605213518.544262-1-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge branch 'selftests-bpf-libarena-add-initial-data-structures'

Emil Tsalapatis says:

====================
selftests/bpf: libarena: Add initial data structures

Add two new data structures to libarena. These data structures initially
resided in the sched-ext repo (https://github.com/sched-ext/scx) and
have been adapted to the internal libarena build system. The data
structures are:

- Red black tree: Fundamental tree data structure that can also serve
  as a base for more domain-specific data structures.

- Lev-Chase deque: Queue data structure that allows efficient work
  stealing, useful in scheduling scenarios.

The data structures are accompanied by selftests that are automatically
discovered by the existing libarena test_progs selftest and incorporated
in the CI.

CHANGELOG
=========

v3 -> v4 (https://lore.kernel.org/bpf/20260604235016.20856-1-emil@etsalapatis.com/)
- Turn off load_acquire/store_relesase - dependent selftests for s390 (CI)
- Various style/non-functional nits (AI)

v2 -> v3 (https://lore.kernel.org/bpf/20260603182727.3922-1-emil@etsalapatis.com/)

- Add workaround to handle LLVM 21 and GCC 15 assignment-to-memset promotions
  that are causing verification failures for arena programs (CI)
- Incorporate Sashiko feedback for cleanup edge cases (Sashiko)
- Simplify some of the ordering semantics in spmc

v1 -> v2 (https://lore.kernel.org/bpf/20260511214100.9487-1-emil@etsalapatis.com/):

- Rename tests from st_ to test_ (Alexei)
- Removed the freelist caches from the rbtrees, previously used to defer freeing (Alexei)
- Moved the type and function definitions to use the __arena identifier
- Removed the typecasts during function return and directly return __arena
  pointers (Alexei)
- Renamed queues to spmc queues to abstract away the algorithm (Alexei)
- Adjusted the memory barriers in the spmc queue
- Added multithreaded testing harness for libarena programs (Alexei)
- Added parallel selftest for queues (Alexei)
- Split least upper bound and exact find operations back into separate
  functions to prevent RB_DUPLICATE-related bug (AI)
====================

Link: https://patch.msgid.link/20260605222020.5231-1-emil@etsalapatis.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: libarena: parallel test harness and spmc parallel selftest

Add a parallel test for the SPMC Lev-Chase workstealing queue. The queue
is built to be wait-free even when there are multiple consumers, and
the parallel selftest provides a signal on whether the queue behaves
correctly when stress tested.

To support the test, this patch includes a test harness for parallel
selftests. The spmc selftest acts as an example of the naming and other
conventions expected by the harness.

Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com>
Link: https://lore.kernel.org/r/20260605222020.5231-4-emil@etsalapatis.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: libarena: Add spmc queue data structure

Expand libarena with a single producer multiple consumer deque data
structure. This is a single producer, multiple consumer lockless structure
that permits efficient work stealing. The structure is a Lev-Chase queue,
so it is lock-free and wait-free.

The data structure exposes three main calls. two of them are available to
the thread owning the queue and one available to all threads in the program:

spmc_owner_push(): Push an item to the top of the queue.
spmc_owner_pop(): Pop an item from the top of the queue.
spmc_steal(): Steal a thread from the bottom of the queue from
any thread.

Note that the queue is not really FIFO for all consumers, since
non-owners of the queue can only work steal from the bottom.

Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com>
Link: https://lore.kernel.org/r/20260605222020.5231-3-emil@etsalapatis.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: libarena: Add rbtree data structure

Add a native red-black tree data structure to libarena.
The data structure supports multiple APIs (key-value based,
node based) with which users can query and modify it. The
tree uses the libarena memory allocator to manage its data.

Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com>
Link: https://lore.kernel.org/r/20260605222020.5231-2-emil@etsalapatis.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

net: stmmac: xgmac: report L3/L4 filter match count in ethtool stats

Read the L3FM and L4FM bits from the RX descriptor status word (RDES2)
and increment the corresponding ethtool statistics counters. This allows
users to observe L3/L4 filter hit rates via ethtool -S.

Signed-off-by: Rohan G Thomas <rohan.g.thomas@altera.com>
Signed-off-by: Nazim Amirul <muhammad.nazim.amirul.nazle.asmade@altera.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/20260604083037.24407-1-muhammad.nazim.amirul.nazle.asmade@altera.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: microchip: sparx5: clean up PSFP resources on flower setup failure

sparx5_tc_flower_psfp_setup() allocates PSFP stream gate, flow meter and
stream filter resources before adding VCAP actions. If a later step
fails, the resources allocated earlier in the function are not unwound.

Add error paths to release the stream filter, flow meter and stream gate
when setup fails after they have been acquired.

Also make sparx5_psfp_fm_add() return the acquired flow-meter id before
the existing-flow-meter early return. When an existing flow meter is
reused, sparx5_psfp_fm_get() increments its pool reference count, but the
caller previously kept psfp_fmid as 0. If a later setup step failed, the
error path could try to delete flow-meter id 0 instead of the reused flow
meter, leaving the incremented reference behind.

Signed-off-by: Haoxiang Li <lihaoxiang@isrc.iscas.ac.cn>
Link: https://patch.msgid.link/20260603061716.747282-1-lihaoxiang@isrc.iscas.ac.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netlabel: validate unlabeled address and mask attribute lengths

netlbl_unlabel_addrinfo_get() used the address attribute length to
determine whether the attribute data could be read as an IPv4 or IPv6
address, but did not independently validate the corresponding mask
attribute length. A crafted Generic Netlink request could therefore
provide a valid IPv4/IPv6 address attribute with a shorter mask
attribute, which would later be read as a full struct in_addr or
struct in6_addr.

NLA_BINARY policy lengths are maximum lengths by default, so use
NLA_POLICY_EXACT_LEN() for the unlabeled IPv4/IPv6 address and mask
attributes. This rejects short attributes during policy validation and
also exposes the exact length requirements through policy introspection.

Fixes: 8cc44579d1bd ("NetLabel: Introduce static network labels for unlabeled connections")
Signed-off-by: Chenguang Zhao <zhaochenguang@kylinos.cn>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-airoha-support-multiple-net_devices-connected-to-the-same-gdm-port'

Lorenzo Bianconi says:

====================
net: airoha: Support multiple net_devices connected to the same GDM port

EN7581 or AN7583 SoCs support connecting multiple external SerDes (e.g.
Ethernet or USB SerDes) to GDM3 or GDM4 ports via a hw arbiter that
manages the traffic in a TDM manner. As a result multiple net_devices can
connect to the same GDM{3,4} port and there is a theoretical "1:n"
relation between GDM ports and net_devices.

           ┌─────────────────────────────────┐
           │                                 │    ┌──────┐
           │                         P1 GDM1 ├────►MT7530│
           │                                 │    └──────┘
           │                                 │      ETH0 (DSA conduit)
           │                                 │
           │              PSE/FE             │
           │                                 │
           │                                 │
           │                                 │    ┌─────┐
           │                         P0 CDM1 ├────►QDMA0│
           │  P4                     P9 GDM4 │    └─────┘
           └──┬─────────────────────────┬────┘
              │                         │
           ┌──▼──┐                 ┌────▼────┐
           │ PPE │                 │   ARB   │
           └─────┘                 └─┬─────┬─┘
                                     │     │
                                  ┌──▼──┐┌─▼───┐
                                  │ ETH ││ USB │
                                  └─────┘└─────┘
                                   ETH1   ETH2

This series introduces support for multiple net_devices connected to the
same Frame Engine (FE) GDM port (GDM3 or GDM4) via an external hw
arbiter. Please note GDM1 or GDM2 does not support the connection with
the external arbiter.
====================

Link: https://patch.msgid.link/20260603-airoha-eth-multi-serdes-v9-0-5d476bc2f426@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: airoha: Support multiple LAN/WAN interfaces for hw MAC address configuration

The EN7581 and AN7583 SoCs provide registers to configure hardware LAN/WAN
MAC addresses. These registers are used during FE hw acceleration to
determine whether received traffic is destined to this host (L3 traffic)
or should be switched to another device (L2 traffic).
The SoC hardware design assumes all interfaces configured as LAN (or WAN)
share the MAC address MSBs, which are programmed into the
REG_FE_{LAN,WAN}_MAC_H register. The LSBs of 'local' mac addresses can be
expressed as a range via the REG_FE_MAC_LMIN and REG_FE_MAC_LMAX
registers. In order to properly accelerate the traffic, FE module requires
the user to configure the REG_FE_{LAN,WAN}_MAC_H register respecting this
limitation. Please note a misconfiguration in REG_FE_{LAN,WAN}_MAC_H
will still allow the user to log into the device for debugging.
Previously, only a single interface was considered when programming these
registers. Extend the logic to derive the correct minimum and maximum
values for REG_FE_MAC_LMIN/REG_FE_MAC_LMAX when two or more interfaces are
configured as LAN or WAN. Since this functionality was not available
before this series, no regression is introduced.

Tested-by: Madhur Agrawal <madhur.agrawal@airoha.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/20260603-airoha-eth-multi-serdes-v9-6-5d476bc2f426@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: airoha: Introduce WAN device flag

Introduce WAN flag to specify if a given device is used to transmit/receive
WAN or LAN traffic. Current codebase supports specifying LAN/WAN device
configuration in ndo_init() callback during device bootstrap.
In order to consider setups where LAN configuration is used even for
GDM3/GDM4 devices, check airoha_is_lan_gdm_dev() to select pse_port in
airoha_ppe_foe_entry_prepare().
Please note after this patch, it will be possible to specify multiple LAN
devices but just a single WAN one. Please note this change is not visible
to the user since airoha_eth driver currently supports just the internal
phy available via the MT7530 DSA switch and there are no WAN interfaces
officially supported since PCS/external phy is not merged mainline yet
(it will be posted with following patches).

Tested-by: Xuegang Lu <xuegang.lu@airoha.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/20260603-airoha-eth-multi-serdes-v9-5-5d476bc2f426@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: airoha: Do not stop GDM port if it is shared

Theoretically, in the current codebase, two independent net_devices can
be connected to the same GDM port so we need to check the GDM port is not
used by any other running net_device before setting the forward
configuration to FE_PSE_PORT_DROP.
Moreover, always set in GDM_LONG_LEN_MASK field of REG_GDM_LEN_CFG
register the maximum MTU of all running net_devices connected to the same
GDM port.

Tested-by: Xuegang Lu <xuegang.lu@airoha.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/20260603-airoha-eth-multi-serdes-v9-4-5d476bc2f426@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: airoha: Support multiple net_devices for a single FE GDM port

EN7581 or AN7583 SoCs support connecting multiple external SerDes (e.g.
Ethernet or USB SerDes) to GDM3 or GDM4 ports via a hw arbiter that
manages the traffic in a TDM manner. As a result multiple net_devices can
connect to the same GDM{3,4} port and there is a theoretical "1:n"
relation between GDM ports and net_devices.

           ┌─────────────────────────────────┐
           │                                 │    ┌──────┐
           │                         P1 GDM1 ├────►MT7530│
           │                                 │    └──────┘
           │                                 │      ETH0 (DSA conduit)
           │                                 │
           │              PSE/FE             │
           │                                 │
           │                                 │
           │                                 │    ┌─────┐
           │                         P0 CDM1 ├────►QDMA0│
           │  P4                     P9 GDM4 │    └─────┘
           └──┬─────────────────────────┬────┘
              │                         │
           ┌──▼──┐                 ┌────▼────┐
           │ PPE │                 │   ARB   │
           └─────┘                 └─┬─────┬─┘
                                     │     │
                                  ┌──▼──┐┌─▼───┐
                                  │ ETH ││ USB │
                                  └─────┘└─────┘
                                   ETH1   ETH2

Introduce support for multiple net_devices connected to the same Frame
Engine (FE) GDM port (GDM3 or GDM4) via an external hw arbiter.
Please note GDM1 or GDM2 does not support the connection with the external
arbiter.
Add get_dev_from_sport callback since EN7581 and AN7583 have different
logics for the net_device type connected to GDM3 or GDM4.

Tested-by: Xuegang Lu <xuegang.lu@airoha.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/20260603-airoha-eth-multi-serdes-v9-3-5d476bc2f426@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: airoha: Remove private net_device pointer in airoha_gdm_dev struct

Remove redundant net_device pointer inside airoha_gdm_dev struct and
rely on netdev_from_priv routine instead. Please note this patch does
not introduce any logical change, just code refactoring.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/20260603-airoha-eth-multi-serdes-v9-2-5d476bc2f426@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dt-bindings: net: airoha: Add GDM port ethernet child node

EN7581 and AN7583 SoCs support connecting multiple external SerDes to GDM3
or GDM4 ports via a hw arbiter that manages the traffic in a TDM manner.
As a result multiple net_devices can connect to the same GDM{3,4} port
and there is a theoretical "1:n" relation between GDM ports and
net_devices.
Introduce the ethernet node child of a specific GDM port in order to model
a given net_device that is connected via the external arbiter to the
GDM{3,4} port. This new ethernet node is defined by the "airoha,eth-port"
compatible string. Please note GDM1 and GDM2 does not support the
connection with the external arbiter and they are represented by an
ethernet node defined by the "airoha,eth-mac" compatible string.

Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/20260603-airoha-eth-multi-serdes-v9-1-5d476bc2f426@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-mdio-realtek-rtl9300-refactor-initialization-and-port-lookup'

Markus Stockhausen says:

====================
net: mdio: realtek-rtl9300: Refactor initialization and port lookup

The Realtek Otto switch platform consists of four different series

- RTL838x aka maple   : 28 port 1G Switches
- RTL839x aka cypress : 52 port 1G Switches
- RTL930x aka longan  : 28 port 1G/2.5G/10G Switches
- RTL931x aka mango   : 56 port 1G/2.5G/10G Switches

A lot has been done to enhance the ethernet MDIO driver for simple
integration of more SoCs. Now it is time to solve inconveniences
that were discovered during daily operation. That includes

- Consistent "of" API usage
- Tightening error handling and improving overall robustness.
- Adding support for PHY packages.
- Fixing setup order issues. These currently hinder the driver from
  properly enabling the hardware on devices where U-Boot skips the
  setup and leaves the controller registers untouched.
====================

Link: https://patch.msgid.link/20260603175924.123019-1-markus.stockhausen@gmx.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: mdio: realtek-rtl9300: Correctly handle ethernet-phy-package

Realtek Otto switches usually make use of multiport PHYs (e.g. 8 port
1G RTL8218D or 4 port 2.5G RTL8224). The device tree can describe this
fact via an "ethernet-phy-package" node that resides between the bus
and the PHY node.

When looking up the device tree bus node via the chain port->phy->parent
the driver totally ignores the existence of a PHY package. Enhance the
lookup to take care of this feature.

Link: https://github.com/openwrt/openwrt/pull/23591
Signed-off-by: Manuel Stocker <mensi@mensi.ch>
Signed-off-by: Markus Stockhausen <markus.stockhausen@gmx.de>
Link: https://patch.msgid.link/20260603175924.123019-8-markus.stockhausen@gmx.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: mdio: realtek-rtl9300: reorder controller setup

After the former refactoring the existing otto_emdio_9300_mdiobus_init()
contains only the c22/c45 bus mode setup. Like the topology setup this
must run before bus registration. Otherwise the bus does not "speak" the
right protocol for PHY setup.

This setup is device-specific and other SoCs will need to set up other
register bits in the controller in the future. Therefore

- Relocate c22/c45 device tree readout to the very beginning of the probing
- Add a new device-specific setup_controller() into the info structure.
- Relocate otto_emdio_priv to satisfy the new info structure dependency.
- Rename otto_emdio_9300_mdiobus_init accordingly and add it to the
RTL9300 info structure. At the same time, adapt register naming
for the function to make it clear that it only applies to this SoC.
- Call setup_controller() prior to bus registration.

Signed-off-by: Markus Stockhausen <markus.stockhausen@gmx.de>
Link: https://patch.msgid.link/20260603175924.123019-7-markus.stockhausen@gmx.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: mdio: realtek-rtl9300: relocate c22/c45 device tree readout

otto_emdio_map_ports() is the central place to lookup the topology and the
properties of the Realtek ethernet MDIO controller from the device tree.
Deviating from this the c22/c45 detection via "ethernet-phy-ieee802.3-c45"
is running separately in otto_emdio_probe_one(). It loops over the same
nodes, just at a later point in time.

There is no benefit to divide this setup and to have a time window where
the data structure is only filled partially. Additionally it uses the
"fwnode" API. Consolidate the setup and convert it to the "of" API.

Remark. This is a subtle change for dangling PHY nodes (not referenced
by ethernet-ports). Before this commit all PHY nodes were evaluated for
c45 setup, now only the referenced ones.

Signed-off-by: Markus Stockhausen <markus.stockhausen@gmx.de>
Link: https://patch.msgid.link/20260603175924.123019-6-markus.stockhausen@gmx.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: mdio: realtek-rtl9300: relocate topology setup

Until now the driver sets up the port to bus/address topology of the
controller after all buses are set up via otto_emdio_probe_one(). This
does not work for devices where U-Boot skips this setup. It is not
only needed for the hardware internal background PHY polling engine
but it is essential for access to the PHYs during probing.

Depending on the SoC type there exist two different register arrays

- Bus mapping registers (RTL930x, RTL931x) define to which bus the port
  is attached. E.g. [1]
- Address mapping registers (RTL838x, RTL930x, RTL931x) define to which
  address of the bus the port is attached. E.g. [2]

Relocate the topology setup and make it generic. For this

- Define device-specific bus_base/addr_base attributes that give the
  register base address where the mapping lives. In case one or both are
  not given the SoC does not support this specific type of mapping.
- Create a helper otto_emdio_setup_topology() that writes the detected
  topology to the registers.
- Call this helper prior to otto_emdio_probe_one().
- Remove unneeded code from otto_emdio_9300_mdiobus_init().
- Due to the added prefixes, increase define indentation

Subtle change: The old coding used regmap_bulk_write and silently wrote
bus=0/address=0 to mapping registers for ports that are out of scope.
The new coding leaves those untouched.

[1] https://svanheule.net/realtek/longan/register/smi_port0_15_polling_sel
[2] https://svanheule.net/realtek/longan/register/smi_port0_5_addr_ctrl

Signed-off-by: Markus Stockhausen <markus.stockhausen@gmx.de>
Link: https://patch.msgid.link/20260603175924.123019-5-markus.stockhausen@gmx.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: mdio: realtek-rtl9300: harden otto_emdio_probe_one()

The bus probing of the MDIO driver uses a two stage approach.

1. The device tree "ethernet-ports" node is scanned to build a mapping
between ports and PHYs.

2. The children of the device tree "controller" are scanned to create
the individual MDIO buses.

The first step already checks the consistency of the PHY and bus nodes
that are linked via the ports. But it might miss a dangling bus child
node that is not linked. Step two simply iterates over all bus child
nodes and might read malformed data from nodes not checked in step one.

Harden this and return a meaningful error message.

Signed-off-by: Markus Stockhausen <markus.stockhausen@gmx.de>
Link: https://patch.msgid.link/20260603175924.123019-4-markus.stockhausen@gmx.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: mdio: realtek-rtl9300: harden otto_emdio_map_ports()

Due to its design the MDIO driver needs to set up a port to bus/address
mapping during probing. The "ethernet-ports" subnodes are scanned and
from the "phy-handle" property the MDIO nodes are looked up. In case of
a malformed device tree the driver might produce out-of-bounds accesses.
The PHY address is not checked against the maximum supported address.

Add a sanity check and drop the unneeded MAX_SMI_ADDR define.

Signed-off-by: Markus Stockhausen <markus.stockhausen@gmx.de>
Link: https://patch.msgid.link/20260603175924.123019-3-markus.stockhausen@gmx.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: mdio: realtek-rtl9300: Refactor otto_emdio_map_ports()

This function has multiple issues:
- It uses __free low level cleanups
- It mixes "fwnode" and "of" functions

Convert this to a uniform "of" usage and manual reference counting
cleanup. With that also fix two subtle lookup bugs in the original
code.

  mdio_dn = phy_dn->parent;
  if (mdio_dn->parent != dev->of_node)
    continue;

This skips an API access and therefore misses reference counting.
Additionally in the case of a very buggy device tree, phy_dn might
be a root node. Looking up its grandparent leads to a NULL pointer
access.

Signed-off-by: Markus Stockhausen <markus.stockhausen@gmx.de>
Link: https://patch.msgid.link/20260603175924.123019-2-markus.stockhausen@gmx.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bnge: fix context mem iteration

The firmware advertises context memory (backing store) types
through a linked list, with BNGE_CTX_INV serving as the
end-of-list sentinel.
However, the driver incorrectly assumes that the list is strictly
ordered and prematurely terminates traversal when it encounters
an unrecognized type (>=BNGE_CTX_V2_MAX). As a result, any valid
context types that appear later in the chain are silently skipped,
leading to incomplete memory configuration and eventual driver load
failure.

Fix this by traversing the entire list until the BNGE_CTX_INV sentinel
is reached, while safely ignoring only those context types that fall
outside the supported range.

Fixes: 29c5b358f385 ("bng_en: Add backing store support")
Signed-off-by: Vikas Gupta <vikas.gupta@broadcom.com>
Reviewed-by: Dharmender Garg <dharmender.garg@broadcom.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: dwmac4: Report DCB feature capability

Bit 16 of the MAC HW Feature1 register reports the DCB (Data Centre
Bridging) feature. Read it so that dma_cap.dcben and the debugfs
report it accurately. Right now it is always reported as being disabled.

Signed-off-by: Ovidiu Panait <ovidiu.panait.rb@renesas.com>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/20260603173644.24371-1-ovidiu.panait.rb@renesas.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ena: PHC: Add missing barrier

Add dma_rmb() barrier after req_id completion check in
ena_com_phc_get_timestamp(). On weakly-ordered architectures,
payload fields may be read before req_id is observed as updated.

Fixes: e0ea34158ee8 ("net: ena: Add PHC support in the ENA driver")
Closes: https://sashiko.dev/#/patchset/20260430032507.11586-1-akiyano%40amazon.com
Signed-off-by: Arthur Kiyanovski <akiyano@amazon.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: airoha: Add NULL check for of_reserved_mem_lookup() in airoha_qdma_init_hfwd_queues()

of_reserved_mem_lookup() may return NULL if the reserved memory region
referenced by the "memory-region" phandle is not found in the reserved
memory table (e.g. due to a misconfigured DTS or a removed
memory-region node).  The current code dereferences the returned
pointer without checking for NULL, leading to a kernel NULL pointer
dereference at the following lines:

    dma_addr = rmem->base;                          // line 1156
    num_desc = div_u64(rmem->size, buf_size);       // line 1160

Add a NULL check after of_reserved_mem_lookup() and return -ENODEV if
the lookup fails, which is consistent with the existing error handling
for of_parse_phandle() failure in the same code block.

Fixes: 3a1ce9e3d01b ("net: airoha: Add the capability to allocate hwfd buffers via reserved-memory")
Cc: stable@vger.kernel.org
Signed-off-by: ZhaoJinming <zhaojinming@uniontech.com>
Acked-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

hsr: broadcast netlink notifications in the device's net namespace

The HSR generic netlink family sets .netnsok = true. HSR devices can
live in network namespaces other than init_net.

Two async notifiers broadcast events with genlmsg_multicast(). They
are hsr_nl_ringerror() and hsr_nl_nodedown(). That helper delivers
only on the default genl socket in init_net. So the events always land
in init_net. The network namespace of the device does not matter.

This has two effects. A listener in the device's own namespace never
sees its own ring error and node down events. A privileged listener in
init_net receives events from HSR devices in other namespaces. The
payload carries the peer node MAC (HSR_A_NODE_ADDR) and the slave port
ifindex (HSR_A_IFINDEX).

Switch both callers to genlmsg_multicast_netns(). Other families with
.netnsok = true already do this. Examples are gtp, ovpn, team,
batman-adv, netdev-genl, ethtool and handshake.

hsr_nl_ringerror() already has the slave port. It uses
dev_net(port->dev). hsr_nl_nodedown() takes the namespace from the
master port via hsr_port_get_hsr().

Reviewed-by: Fernando Fernandez Mancera <fmancera@suse.de>
Signed-off-by: Maoyi Xie <maoyixie.tju@gmail.com>
Link: https://patch.msgid.link/20260604054949.2999304-1-maoyixie.tju@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-devmem-allow-bind-rx-from-non-init-user-namespaces'

Bobby Eshleman says:

====================
net: devmem: allow bind-rx from non-init user namespaces

NETDEV_CMD_BIND_RX is GENL_ADMIN_PERM, which checks CAP_NET_ADMIN
against init_user_ns. With netkit and netns support for devmem, it is
now useful to let workloads holding CAP_NET_ADMIN only in their own
user_ns issue bind-rx for a netns owned by that user_ns.

The first patch switches the flag to GENL_UNS_ADMIN_PERM so the check
uses the target netns's owning user_ns. Init remains permitted.

The second patch just adds test cases. They are identical to
nk_devmem.py tests, but using a non-init userns.
====================

Link: https://patch.msgid.link/20260602-nl-prov-v2-0-ad721142c641@meta.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: drv-net: add userns devmem RX test

Add userns_devmem.py, which mirrors nk_devmem.py but places the netkit
guest in a netns whose owning user_ns is non-init. ncdevmem is ran there
via nsenter so the bind-rx call is issued with creds that hold
CAP_NET_ADMIN only in the child user_ns.

Without the preceding GENL_UNS_ADMIN_PERM patch the test fails at
bind-rx with EPERM, but with the patch the transfer completes and tests
pass.

Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20260602-nl-prov-v2-2-ad721142c641@meta.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: devmem: allow bind-rx from non-init user namespaces

NETDEV_CMD_BIND_RX is currently GENL_ADMIN_PERM, which checks
CAP_NET_ADMIN against init userns. With recent container/netkit/ns
support for devmem, other userns/netns use cases come online and require
bind-rx to allow CAP_NET_ADMIN in non-init user ns as well.

Switch the flag to GENL_UNS_ADMIN_PERM to allow bind-rx for
CAP_NET_ADMIN in the netns's owning userns as well.

Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20260602-nl-prov-v2-1-ad721142c641@meta.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'drm-fixes-2026-06-06' of https://gitlab.freedesktop.org/drm/kernel

Pull drm fixes from Dave Airlie:
"Weekly drm fixes, not contributing to things settling down
  unfortunately. Lots of driver fixes for various bounds checks, leaks
  and UAF type things, i915/xe probably the most sane, amdgpu has a mix
  of fixes all over, then ethosu has lots of small fixes.

  The problem of fixing thing in private has really hit us with the
  change handle ioctl, and "Sima was right" and we should have disabled
  the ioctl, since it was only introduced a couple of kernels ago and
  failed to upstream it's tests in time.

  The patch here fixes the problems Sima identified, but disables the
  ioctl as well, with a list of known problems in it and a request for
  proper tests to be written and upstreamed. It's a niche user ioctl
  designed for CRIU with AMD ROCm, so I think it's fine to just disable
  it.

  Maybe this week will settle down.

  core:
   - disable the gem change handle ioctl for security reasons (plan to
     fix it on list later with proper test coverage)

  dumb-buffer:
   - remove strict limits on buffer geometry

  amdgpu:
   - BT.2020 fix for DCE
   - DC bounds checking fixes
   - SDMA 7.1 fix
   - UserQ fixes
   - SI fix
   - SMU 13 fixes
   - SMU 14 fixes
   - GC 12.1 fix
   - Userptr fix
   - GC 10.1 fix
   - GART fix for non-4K pages

  amdkfd:
   - UAF race fix
   - Fix a potential NULL pointer dereference
   - GC 11 buffer overflow fix for SDMA

  xe:
   - Revert removing support for unpublished NVL-S GuC
   - Suspend fixes related to multi-queue

  i915:
   - Fix color blob reference handling in intel_plane_state
   - Revert "drm/i915/backlight: Remove try_vesa_interface"

  ethosu:
   - reject unsupported NPU_OP_RESIZE
   - fix index of IFM region
   - fix weight index
   - fix overflows in DMA-size calculations
   - reject DMA commands with uninitialized length
   - fix OOB write in ethosu_gem_cmdstream_copy_and_validate

  imx:
   - fix kernel-doc warnings

  ivpu:
   - add overflow checks in firmware handling and get_info_ioctl

  v3d:
   - wait for pending L2T flush before cleaning caches
   - fix leak of vaddr
   - skip CSD when it has zeroed workgroups
   - fix ref counting in performance monitoring"

* tag 'drm-fixes-2026-06-06' of https://gitlab.freedesktop.org/drm/kernel: (50 commits)
  drm/gem: Try to fix change_handle ioctl, attempt 4
  Revert "drm/i915/backlight: Remove try_vesa_interface"
  accel/ethosu: fix OOB write in ethosu_gem_cmdstream_copy_and_validate()
  accel/ethosu: reject DMA commands with uninitialized length
  accel/ethosu: fix arithmetic issues in dma_length()
  accel/ethosu: fix wrong weight index in NPU_SET_SCALE1_LENGTH on U85
  accel/ethosu: reject NPU_OP_RESIZE commands from userspace
  accel/ethosu: fix IFM region index out-of-bounds in command stream parser
  drm/v3d: Fix global performance monitor reference counting
  drm/xe/multi_queue: skip submit when primary queue is suspended
  drm/xe: Clear pending_disable before signaling suspend fence
  Revert "drm/xe: Skip exec queue schedule toggle if queue is idle during suspend"
  drm/amd/pm: smu_v14_0_0: use SoftMin for gfxclk in set_soft_freq_limited_range
  drm/amdgpu: Fix incorrect VRAM GART mappings on non-4K page size systems
  drm/amdgpu/userq: move wptr_obj cleanup in mqd_destroy
  drm/amdgpu: improve the userq seq BO free bit lookup
  drm/amdgpu/userq: remove the vital queue unmap logging
  drm/amdkfd: Fix buffer overflow in SDMA queue checkpoint/restore on GFX11
  drm/amdkfd: fix NULL dereference in get_queue_ids()
  drm/amdgpu: set noretry=1 as default for GFX 10.1.x (Navi10/12/14)
  ...

ipv6: remove obsolete EXPORT_SYMBOL() and EXPORT_SYMBOL_GPL()

These symbols don't need to be exported, they are only used from vmlinux:

- inet6addr_notifier_call_chain
- inet6addr_validator_notifier_call_chain
- in6addr_linklocal_allnodes
- in6addr_linklocal_allrouters
- in6addr_interfacelocal_allnodes
- in6addr_interfacelocal_allrouters
- in6addr_sitelocal_allrouters
- inet6_cleanup_sock
- ip6_sk_dst_lookup_flow
- ipv6_proxy_select_ident
- ip6_sk_update_pmtu
- ip6_sk_redirect

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20260604174555.2801532-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv4: remove obsolete EXPORT_SYMBOL() and EXPORT_SYMBOL_GPL()

These symbols no longer need to be exported, they are only used from
vmlinux:

- inet_send_prepare
- inet_splice_eof
- inet_sk_rebuild_header
- inet_current_timestamp
- snmp_fold_field
- snmp_get_cpu_field64
- snmp_fold_field64
- fib_nh_common_release
- fib_nh_common_init
- fib_nexthop_info
- fib_add_nexthop
- ip_build_and_send_pkt
- ipv4_sk_update_pmtu
- ipv4_sk_redirect
- rt_dst_clone

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20260604173413.2782008-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'bridge-prepare-lockless-br_port_fill_attrs-i'

Eric Dumazet says:

====================
bridge: prepare lockless br_port_fill_attrs() (I)

medium-term goal is to allow "ip link show" dump commands to run without RTNL.

This round of patches adds/fixes some lockess accesses in bridge.

This is not complete, more patches will come later.

Ultimately all changes to p->flags should use set_bit()/clear_bit().

I repeat (AI agents might read this cover ?):

Many p->flags accesses are racy, and will hopefully be fixed in a
future series.
====================

Link: https://patch.msgid.link/20260604141343.2124500-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bridge: read p->flags once in br_port_fill_attrs()

We might run br_port_fill_attrs() locklessly in the future.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20260604141343.2124500-12-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bridge: provide lockless access to p->config_pending

Needed for sysfs show_config_pending(), BRCTL_GET_PORT_INFO
and upcoming RTNL avoidance in "ip link" dumps (cf br_port_fill_attrs()).

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20260604141343.2124500-11-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bridge: provide lockless access to p->port_id

sysfs show_port_id() and BRCTL_GET_PORT_INFO need this.

This will be needed for upcoming RTNL avoidance in "ip link"
dumps (cf br_port_fill_attrs()).

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20260604141343.2124500-10-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bridge: provide lockless access to p->priority

sysfs show_priority() needs this.

Also br_port_fill_attrs() might in the future run without RTNL.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20260604141343.2124500-9-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bridge: provide lockless access to p->designated_port

Add READ_ONCE()/WRITE_ONCE() annotations around p->designated_port

This is needed at least for sysfs show_designated_port(), BRCTL_GET_PORT_INFO
and upcoming RTNL avoidance in "ip link" dumps (cf br_port_fill_attrs()).

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20260604141343.2124500-8-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bridge: provide lockless access to p->designated_cost

Add READ_ONCE()/WRITE_ONCE() annotations around p->designated_cost

This is needed at least for sysfs show_designated_cost(), BRCTL_GET_PORT_INFO
and upcoming RTNL avoidance in "ip link" dumps (cf br_port_fill_attrs()).

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260604141343.2124500-7-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bridge: provide lockless access to p->path_cost

Add READ_ONCE()/WRITE_ONCE() annotations around p->path_cost.

This is needed at least for sysfs show_path_cost(), BRCTL_GET_PORT_INFO
and upcoming RTNL avoidance in "ip link" dumps (cf br_port_fill_attrs()).

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260604141343.2124500-6-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bridge: use BR_ADMIN_COST_BIT

Use set_bit() and test_bit() lockless functions.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20260604141343.2124500-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bridge: use BR_PROMISC_BIT

Use BR_PROMISC_BIT and set_bit(), clear_bit() and test_bit() lockless
functions.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20260604141343.2124500-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bridge: add bridge_flags_bit enum

We want to use atomic operations for lockless p->flags changes
and reads.

Add definitions for bits in addition of masks so that we can use
test_bit(), clear_bit() and set_bit() in subsequent patches.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20260604141343.2124500-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bridge: add a READ_ONCE() in br_timer_value()

br_timer_value() can be called locklessly, the expires field
could be changed concurrently.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20260604141343.2124500-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: net: do not detect PPPoX loopback

By default, pppd attempts to detect loopbacks on the underlying
interface using a pseudo-randomly generated magic number and checks if
the same value is received. The seed for the PRNG is a hash of hostname
XOR current time XOR pid, which is likely to collide on NIPA, causing
false positives. Disable magic number generation.

Reported-by: Matthieu Baerts <matttbe@kernel.org>
Fixes: 7af2a94f4dcf ("selftests: net: add tests for PPPoL2TP")
Signed-off-by: Qingfang Deng <qingfang.deng@linux.dev>
Link: https://patch.msgid.link/20260603061746.23452-1-qingfang.deng@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: cpsw_new: unregister devlink on port registration failure

cpsw_probe() registers devlink before registering the CPSW ports.

If cpsw_register_ports() fails, the error path only unregisters the
notifiers and then releases the lower level resources. It does not undo
the successful cpsw_register_devlink() call, leaving the devlink instance
and its parameters registered after probe has failed.

Add a devlink cleanup label for the path where devlink registration has
already succeeded, and use it when port registration fails.

Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Alexander Sverdlin <alexander.sverdlin@siemens.com>
Signed-off-by: Guangshuo Li <lgs201920130244@gmail.com>
Link: https://patch.msgid.link/20260604043115.1409134-1-lgs201920130244@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'intel-wired-lan-driver-updates-2026-06-02-i40e-ice-idpf'

Tony Nguyen says:

====================
Intel Wired LAN Driver Updates 2026-06-02 (ice, idpf)

Petr Oros adds missing callbacks for U.FL DPLL pins on ice.

Alok Tiwari corrects copy/paste error causing incorrect reporting of PTP
mailbox capability for idpf.
====================

Link: https://patch.msgid.link/20260602225513.393338-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

idpf: fix mailbox capability for set device clock time

The current code incorrectly uses VIRTCHNL2_CAP_PTP_SET_DEVICE_CLK_TIME
for both direct and mailbox capabilities, causing mailbox-only support
to be ignored and potentially reporting IDPF_PTP_NONE.

Fixes: d5dba8f7206da ("idpf: add PTP clock configuration")
Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com>
Tested-by: Samuel Salin <Samuel.salin@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://patch.msgid.link/20260602225513.393338-4-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ice: fix missing priority callbacks for U.FL DPLL pins

The U.FL2 input pin advertises DPLL_PIN_CAPABILITIES_PRIORITY_CAN_CHANGE
in its capability mask, but ice_dpll_pin_ufl_ops does not provide
.prio_get and .prio_set callbacks. As a result the DPLL subsystem
cannot report or accept priority for U.FL pins: pin-get omits the prio
field on U.FL2 and pin-set with prio is rejected as invalid, even
though the capability is present. This prevents user space from using
priority to select or disable U.FL2 as a DPLL input source.

Reproducer with iproute2 (dpll command):

  # dpll pin show board-label U.FL2
  pin id 16:
    module-name ice
    board-label U.FL2
    type ext
    capabilities priority-can-change|state-can-change
    parent-device:
      id 0 direction input state selectable phase-offset 0
    /* note: no "prio" between "direction" and "state",
       even though priority-can-change is advertised */

  # dpll pin set id 16 parent-device 0 prio 5
  RTNETLINK answers: Operation not supported

After the fix the prio field is reported by pin show and pin set with
prio is accepted on U.FL2.

Add the missing .prio_get and .prio_set callbacks to
ice_dpll_pin_ufl_ops, reusing ice_dpll_sw_input_prio_{get,set}. The
same ops struct is shared by U.FL1 and U.FL2: U.FL2 (input) delegates
to the backing hardware input pin, while U.FL1 (output) does not
advertise DPLL_PIN_CAPABILITIES_PRIORITY_CAN_CHANGE so the dpll core
capability gate never invokes prio_set for it, and prio_get reports
the OUTPUT sentinel (ICE_DPLL_PIN_PRIO_OUTPUT) on the output side
exactly like the SMA path does today.

Fixes: 2dd5d03c77e2 ("ice: redesign dpll sma/u.fl pins control")
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: Petr Oros <poros@redhat.com>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://patch.msgid.link/20260602225513.393338-3-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests/bpf: Fix test_lirc test

Since commit 68a99f6a0ebf ("media: lirc: report ir receiver overflow"),
the rc-loopback driver does not accept edges over 50ms, as these are
never seen in real life ir protocols. Fix this.

Signed-off-by: Sean Young <sean@mess.org>
Link: https://lore.kernel.org/r/20260605151417.777614-1-sean@mess.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge branch 'add-validation-for-bpf_set_retval-helper'

Xu Kuohai says:

====================
Add validation for bpf_set_retval helper

From: Xu Kuohai <xukuohai@huawei.com>

The bpf_set_retval() helper is used by cgroup BPF programs to set the
return value of the kernel hook. The argument type for this helper is
ARG_ANYTHING. This allows setting a positive value, which no cgroup
hook expects and can cause issues, such as the kernel panic reported
in [1].

This series adds validation for the argument of the bpf_set_retval()
helper.

For BPF_LSM_CGROUP, the same validation as BPF_LSM_MAC is enforced,
i.e. validate the argument against the LSM hook specific range, which
is returned by bpf_lsm_get_retval_range().

For all other cgroup program types, restrict the argument to
[-MAX_ERRNO, 0], which matches the kernel convention of 0 for success
and negative errno for error.

BPF_CGROUP_GETSOCKOPT is an exception from this restriction, since valid
getsockopt implementations may return positive values (e.g. optlen), as
allowed by commit c4dcfdd406aa ("bpf: Move getsockopt retval to struct
bpf_cg_run_ctx").

[1] https://lore.kernel.org/all/567d3206-74a5-44e5-99c6-779c425f399e@std.uestc.edu.cn

v5:
- Use resolve_prog_type(env->prog) instead of env->prog->type for prog type checks
- Target bpf-next tree

v4: https://lore.kernel.org/bpf/20260604130458.617765-1-xukuohai@huaweicloud.com
- Remove the return value limit for BPF_CGROUP_GETSOCKOPT type
- Refine the range of return value of bpf_get_retval helper

v3: https://lore.kernel.org/bpf/20260530101239.590395-1-xukuohai@huaweicloud.com/
- Mark R1 as precise to prevent validation bypass via branch pruning (sashiko)

v2: https://lore.kernel.org/bpf/20260530055557.549474-1-xukuohai@huaweicloud.com/
- Extend validation from LSM cgroup BPF type to all cgroup BPF types (sashiko)

v1: https://lore.kernel.org/bpf/20260523085806.417723-1-xukuohai@huaweicloud.com/
====================

Link: https://patch.msgid.link/20260605140243.664590-1-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Add tests for bpf_set_retval validation

Add verifier tests to validate bpf_set_retval argument for cgroup
program types.

Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> #v1
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Link: https://lore.kernel.org/r/20260605140243.664590-4-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Add validation for bpf_set_retval argument

The bpf_set_retval() helper is used by cgroup BPF programs to set the
return value of the target hook. The argument type for this helper is
ARG_ANYTHING. This allows setting a positive value, which no cgroup
hook expects and can cause issues, such as:

- BPF_LSM_CGROUP: a positive value from bpf_lsm_socket_create bypasses
  the err < 0 check in __sock_create(), leaving the socket object
  unallocated. The positive return value is then propagated to the
  syscall entry __sys_socket(), which also bypasses the IS_ERR() guard
  and ultimately causes a NULL pointer dereference.

- BPF_CGROUP_DEVICE: a positive value can be returned through cgroup
  device bpf prog -> devcgroup_check_permission() -> bdev_permission()
  -> bdev_file_open_by_dev(), where ERR_PTR(positive) produces a pointer
  that IS_ERR() does not catch, leading to a wild pointer dereference.

- BPF_CGROUP_SOCK: a positive value can be returned through cgroup sock
  bpf prog -> __cgroup_bpf_run_filter_sk() -> inet_create() ->
  __sock_create(), where inet_create() frees the newly allocated sk
  via sk_common_release() and sets sock->sk = NULL on the non-zero
  return, but __sock_create() only checks err < 0 for cleanup, so a
  positive retval bypasses cleanup and returns a socket with NULL sk
  to userspace, triggering a NULL pointer dereference on subsequent
  socket operations.

- BPF_CGROUP_SYSCTL: a positive value can be returned through the cgroup
  bpf prog -> __cgroup_bpf_run_filter_sysctl() -> proc_sys_call_handler(),
  where a non-zero return bypasses the normal sysctl proc_handler and is
  returned directly to userspace as return value of read() or write()
  syscall.

So add validation for the argument of the bpf_set_retval() helper.

For BPF_LSM_CGROUP, enforce the LSM hook specific range returned by
bpf_lsm_get_retval_range().

For all other cgroup program types, restrict the argument to
[-MAX_ERRNO, 0], which matches the kernel convention of 0 for success
and negative errno for error.

BPF_CGROUP_GETSOCKOPT is an exception, since valid getsockopt
implementations may return positive values, as allowed by commit
c4dcfdd406aa ("bpf: Move getsockopt retval to struct bpf_cg_run_ctx").

Also refine the return value range of bpf_get_retval() so that
values returned by bpf_get_retval() can be passed directly to
bpf_set_retval() without extra manual bounds checking.

Fixes: b44123b4a3dc ("bpf: Add cgroup helpers bpf_{get,set}_retval to get/set syscall return value")
Fixes: 69fd337a975c ("bpf: per-cgroup lsm flavor")
Reported-by: Quan Sun <2022090917019@std.uestc.edu.cn>
Closes: https://lore.kernel.org/all/567d3206-74a5-44e5-99c6-779c425f399e@std.uestc.edu.cn
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Link: https://lore.kernel.org/r/20260605140243.664590-3-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Restrict bpf_set_retval argument in sk_bypass_prot_mem

Test sk_bypass_prot_mem passes an unchecked value as argument to helper
bpf_set_retval(). The argument can be outside the valid range enforced
by the strict retval validation added in the next patch.

Restrict the argument to -EFAULT when it is outside the valid range, so
the test will not be rejected by the verifier when retval validation
is enforced.

Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Link: https://lore.kernel.org/r/20260605140243.664590-2-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

drm/gem: Try to fix change_handle ioctl, attempt 4

[airlied: just added some comments on how to reenable]
On-list because the cat is out of the bag and we're clearly not good
enough to figure this out in private. The story thus far:

5e28b7b94408 ("drm: Set old handle to NULL before prime swap in
change_handle") tried to fix a race condition between the gem_close and
gem_change_handle ioctls, but got a few things wrong:

- There's a confusion with the local variable handle, which is actually
  the new handle, and so the two-stage trick was actually applied to the
  wrong idr slot. 7164d78559b0 ("drm/gem: fix race between
  change_handle and handle_delete") tried to fix that by adding yet
  another code block, but forgot to add the error handling. Which meant
  we now have two paths, both kinda wrong.

- dc366607c41c ("drm: Replace old pointer to new idr") tried to apply
  another fix, but inconsistently, again because of the handle confusion
  - this would be the right fix (kinda, somewhat, it's a mess) if we'd
  do the two-stage approach for the new handle. Except that wasn't the
  intent of the original fix.

We also didn't have an igt merged for the original ioctl, which is a big
no-go. This was attempted to address off-list in the original bugfix,
and amd QA people claimed the bug was fixed now. Very clearly that's not
the case. Here's my attempt to sort this out:

- Rename the local variable to new_handle, the old aliasing with
  args->handle is just too dangerously confusing.

- Merge the gem obj lookup with the two-stage idr_replace so that we
  avoid getting ourselves confused there.

- This means we don't have a surplus temporary reference anymore, only
  an inherited from the idr. A concurrent gem_close on the new_handle
  could steal that. Fix that with the same two-stage approach
  create_tail uses. This is a bit overkill as documented in the comment,
  but I also don't trust my ability to understand this all correctly, so
  go with the established pattern we have from other ioctls instead for
  maximum paranoia.

- Adjust error paths. I've tried to make the error and success paths
  common, because they are identical except for which handle is removed
  and on which we call idr_replace to (re)install the object again. But
  that made things messier to read, so I've left it at the more verbose
  version, which unfortunately hides the symmetry in the entire code
  flow a bit.

- While at it, also replace the 7 space indent with 1 tab.

And finally, because I flat out don't trust my abilities here at all
anymore:

- Disable the ioctl until we have the igt situation and everything else
  sorted out on-list and with full consensus.

v2:

Sashiko noticed that I didn't handle the error path for idr_replace
correctly, it must be checked with IS_ERR_OR_NULL like in
gem_handle_delete. So yeah, definitely should just the existing paths
1:1 because this is endless amounts of tricky.

Also add the Fixes: line for the original ioctl, I forgot that too.

Reported-by: DARKNAVY (@DarkNavyOrg) <vr@darknavy.com>
Signed-off-by: Simona Vetter <simona.vetter@ffwll.ch>
Fixes: dc366607c41c ("drm: Replace old pointer to new idr")
Cc: syzbot+d7c9eed171647e421013@syzkaller.appspotmail.com
Cc: stable@vger.kernel.org
Cc: Edward Adam Davis <eadavis@qq.com>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Maxime Ripard <mripard@kernel.org>
Cc: Thomas Zimmermann <tzimmermann@suse.de>
Fixes: 5e28b7b94408 ("drm: Set old handle to NULL before prime swap in change_handle")
Cc: David Francis <David.Francis@amd.com>
Cc: Puttimet Thammasaeng <pwn8official@gmail.com>
Cc: Christian Koenig <Christian.Koenig@amd.com>
Fixes: 7164d78559b0 ("drm/gem: fix race between change_handle and handle_delete")
Cc: Zhenghang Xiao <kipreyyy@gmail.com>
Fixes: 5e28b7b94408 ("drm: Set old handle to NULL before prime swap in change_handle")
Reviewed-by: David Francis <David.Francis@amd.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
Link: https://patch.msgid.link/20260604194437.1725314-1-simona.vetter@ffwll.ch

Merge branch 'bpf-fix-sysctl-new-value-handling-in-__cgroup_bpf_run_filter_sysctl'

Dawei Feng says:

====================
bpf: fix sysctl new-value handling in __cgroup_bpf_run_filter_sysctl

This series fixes three bugs in the sysctl write-buffer replacement path
of __cgroup_bpf_run_filter_sysctl(). It resolves a kvzalloc()/kfree()
mismatch, adds a missing NUL terminator to the replacement string, and
updates a stale return value check to safely restore the replacement
functionality.

Patch Summary:
- patch 1 NUL-terminates the replaced sysctl value
- patch 2 uses kvfree() for the replaced sysctl write buffer
- patch 3 restores sysctl new-value replacement

Changelog:
v2 -> v3:
- reordered patches 1 and 2
- added the missing Reviewed-by/Acked-by tags to patches 2 and 3
- fixed the incorrect Fixes tag in patch 3
- simplified the dynamic test logs in patch 1 and 2, and updated
titles
====================

Link: https://patch.msgid.link/20260603105317.944304-1-dawei.feng@seu.edu.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Restore sysctl new-value from 1 to 0

Commit 4e63acdff864 ("bpf: Introduce bpf_sysctl_{get,set}_new_value
helpers") changed the success return value to 0, but failed to update the
corresponding check in __cgroup_bpf_run_filter_sysctl(). Since
bpf_prog_run_array_cg() now returns 0 on success, the legacy ret == 1
condition is never satisfied. As a result, the modified value is ignored,
and bpf_sysctl_set_new_value() fails to replace the write buffer.

Fix this by checking for a return value of 0 instead, so cgroup/sysctl
programs can correctly replace the pending sysctl buffer.

This bug was discovered during a manual code review. Tested via a
cgroup/sysctl BPF reproducer overriding writes to a target sysctl.
Pre-fix, bpf_sysctl_set_new_value("foo") was silently ignored: the write
returned 8192 and the value remained "600". Post-fix, the BPF replacement
buffer properly propagates: the write returns 3 and the value updates to
"foo".

Fixes: f10d05966196 ("bpf: Make BPF_PROG_RUN_ARRAY return -err instead of allow boolean")
Cc: stable@vger.kernel.org
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Zilin Guan <zilin@seu.edu.cn>
Signed-off-by: Dawei Feng <dawei.feng@seu.edu.cn>
Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Acked-by: Xu Kuohai <xukuohai@huawei.com>
Link: https://lore.kernel.org/r/20260603105317.944304-4-dawei.feng@seu.edu.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: use kvfree() for replaced sysctl write buffer

proc_sys_call_handler() allocates its temporary sysctl buffer with
kvzalloc() and passes it to __cgroup_bpf_run_filter_sysctl(). Since
kvzalloc() may fall back to vmalloc() for large allocations, freeing
that buffer with kfree() is wrong and can corrupt memory.

Use kvfree() to safely handle both kmalloc and kvzalloc()/vmalloc
allocations.

The bug was first flagged by an experimental analysis tool we are
developing for kernel memory-management bugs while analyzing
v6.13-rc1. The tool is still under development and is not yet publicly
available. Manual inspection confirms that the bug is still
present in v7.1-rc5.

Reproduced the bug based on v7.1-rc4 in a QEMU x86_64 guest booted with
KASAN and CONFIG_FAILSLAB enabled. To exercise the replacement path, the
test tree also included the accompanying fix for the stale ret == 1
check in __cgroup_bpf_run_filter_sysctl(). The reproducer confines
failslab injections to the proc_sys_call_handler() range, uses
stacktrace-depth=32, and injects fail-nth=1 while writing 8191 bytes to
/proc/sys/kernel/domainname from a task in the target cgroup. Under
that setup, fail-nth=1 triggered the fault:

  BUG: unable to handle page fault for address: ffffeb0200024d48
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD 0 P4D 0
  Oops: Oops: 0000  SMP KASAN NOPTI
  CPU: 2 UID: 0 PID: 209 Comm: repro_proc_sys_ Not tainted 7.1.0-rc4-00686-g97625979a5d4  PREEMPT(lazy)
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014
  RIP: 0010:kfree+0x6e/0x510
  ...
  Call Trace:
   <TASK>
   ? __cgroup_bpf_run_filter_sysctl+0x626/0xc30
   __cgroup_bpf_run_filter_sysctl+0x74d/0xc30
   ? __pfx___cgroup_bpf_run_filter_sysctl+0x10/0x10
   ? srso_return_thunk+0x5/0x5f
   ? __kvmalloc_node_noprof+0x345/0x870
   ? proc_sys_call_handler+0x250/0x480
   ? srso_return_thunk+0x5/0x5f
   proc_sys_call_handler+0x3a2/0x480
   ? __pfx_proc_sys_call_handler+0x10/0x10
   ? srso_return_thunk+0x5/0x5f
   ? selinux_file_permission+0x39f/0x500
   ? srso_return_thunk+0x5/0x5f
   ? lock_is_held_type+0x9e/0x120
   vfs_write+0x98e/0x1000
   ...
   </TASK>

With this fix applied on top of the same test setup, rerunning the
reproducer with fail-nth=1 yields no corresponding Oops reports.

Fixes: 4508943794ef ("proc: use kvzalloc for our kernel buffer")
Cc: stable@vger.kernel.org
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Zilin Guan <zilin@seu.edu.cn>
Signed-off-by: Dawei Feng <dawei.feng@seu.edu.cn>
Link: https://lore.kernel.org/r/20260603105317.944304-3-dawei.feng@seu.edu.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: NUL-terminate replaced sysctl value

When writing to sysctls, proc_sys_call_handler() guarantees that the
buffer passed to proc handlers is NUL-terminated. If
bpf_sysctl_set_new_value() replaces the pending sysctl value, it can
hand a replacement buffer directly to proc handlers. However, the
helper currently copies only buf_len bytes into that buffer without
appending a NUL terminator, leaving downstream parsers vulnerable to
out-of-bounds access.

Fix this by appending a '\0' after the replaced value to restore the
expected sysctl semantics. Since the helper already rejects buf_len
greater than PAGE_SIZE - 1, there is always room for the extra byte.

Reproduced in a QEMU x86_64 guest booted with KASAN while exercising
the sysctl replacement path with a cgroup/sysctl BPF program. The
reproducer targets `/proc/sys/net/core/flow_limit_cpu_bitmap`, fills
the original user write buffer with non-zero bytes, and overrides the
sysctl value so the replacement buffer lacks a terminating NUL. Under
that setup, the pre-fix kernel reported:

  BUG: KASAN: slab-out-of-bounds in strnchrnul+0x72/0x90
  Read of size 1 at addr ffff88800de57000 by task repro_patch3/66
  CPU: 0 UID: 0 PID: 66 Comm: repro_patch3 Not tainted 7.1.0-rc3-00269-g8370ca1f87cc #6 PREEMPT(lazy)
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
  Call Trace:
   <TASK>
   dump_stack_lvl+0x68/0xa0
   print_report+0xcb/0x5e0
   ? __virt_addr_valid+0x21d/0x3f0
   ? strnchrnul+0x72/0x90
   ? strnchrnul+0x72/0x90
   kasan_report+0xca/0x100
   ? strnchrnul+0x72/0x90
   strnchrnul+0x72/0x90
   bitmap_parse+0x37/0x2e0
   flow_limit_cpu_sysctl+0xc6/0x840
   ? __pfx_flow_limit_cpu_sysctl+0x10/0x10
   ? __kvmalloc_node_noprof+0x5ba/0x870
   proc_sys_call_handler+0x31d/0x480
   ? __pfx_proc_sys_call_handler+0x10/0x10
   ? selinux_file_permission+0x39f/0x500
   ? lock_is_held_type+0x9e/0x120
   vfs_write+0x98e/0x1000
   ...
   </TASK>
  The buggy address is located 0 bytes to the right of
  allocated 4096-byte region [ffff88800de56000, ffff88800de57000)
With this fix applied, rerunning the same sysctl-targeted path yields
no corresponding KASAN reports.

Signed-off-by: Zilin Guan <zilin@seu.edu.cn>
Signed-off-by: Dawei Feng <dawei.feng@seu.edu.cn>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20260603105317.944304-2-dawei.feng@seu.edu.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge tag 'drm-intel-fixes-2026-06-05' of https://gitlab.freedesktop.org/drm/i915/kernel into drm-fixes

- Fix color blob reference handling in intel_plane_state (Chaitanya Kumar Borah)
- Revert "drm/i915/backlight: Remove try_vesa_interface" [backlight] (Suraj Kandpal)

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Tvrtko Ursulin <tursulin@igalia.com>
Link: https://patch.msgid.link/aiKgmwz7VGOaFXIv@linux

Merge tag 'drm-misc-fixes-2026-06-05' of https://gitlab.freedesktop.org/drm/misc/kernel into drm-fixes

Short summary of fixes pull:

dumb-buffer:
- remove strict limits on buffer geometry

ethosu:
- reject unsupported NPU_OP_RESIZE
- fix index of IFM region
- fix weight index
- fix overflows in DMA-size calculations
- reject DMA commands with uninitialized length
- fix OOB write in ethosu_gem_cmdstream_copy_and_validate

imx:
- fix kernel-doc warnings

ivpu:
- add overflow checks in firmware handling and get_info_ioctl

v3d:
- wait for pending L2T flush before cleaning caches
- fix leak of vaddr
- skip CSD when it has zeroed workgroups
- fix ref counting in performance monitoring

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Thomas Zimmermann <tzimmermann@suse.de>
Link: https://patch.msgid.link/20260605072602.GA268798@linux.fritz.box

Merge branch 'bpf-update-transport_header-when-encapsulating-udp-tunnel-in-lwt'

Leon Hwang says:

====================
bpf: Update transport_header when encapsulating UDP tunnel in lwt

Currently, bpf_lwt_push_ip_encap() does not update skb->transport_header.
When a driver, e.g. ice, reuses the stale skb->transport_header to
offload checksum computation to NIC hardware, VxLAN packets encapsulated
by bpf_lwt_push_encap() helper may be dropped due to incorrect checksum.

Update skb->transport_header in bpf_lwt_push_ip_encap() whenever the
encapsulated packet uses UDP, so checksum offload works correctly.

Changes:
v3 -> v4:
* Address comments from Emil:
  * Make the logic of skb_set_transport_header() clearer in patch #1.
  * Fold the code of fexit_lwt_push_ip_encap() into test_lwt_ip_encap.c in
    patch #2.
  * Resolve assorted issues of test in patch #2.
* v3: https://lore.kernel.org/bpf/20260601150203.20352-1-leon.hwang@linux.dev/

v2 -> v3:
* Drop patch #1 and #2 of v2 that aim to resolve potential issues
  reported by sashiko (per Alexei).
* Check target IP version and UDP tunnel in test (per sashiko).
* v2: https://lore.kernel.org/bpf/20260529151351.69911-1-leon.hwang@linux.dev/

v1 -> v2:
* Address sashiko's reviews:
  * Fix TOCTOU issue in lwt to avoid changing hdr after checks.
  * Add check iph->ihl < 5 in lwt to avoid infinite-loop in MIPS driver.
  * Update comment style in selftests with BPF comment style.
* v1: https://lore.kernel.org/bpf/20260525142650.2569-1-leon.hwang@linux.dev/
====================

Link: https://patch.msgid.link/20260602150931.49629-1-leon.hwang@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Add tests to verify the fix of encapsulating VxLAN in lwt

Add two tests to verify the transport header of skb has been set when
encapsulate VxLAN using bpf_lwt_push_encap() helper.

1. VxLAN over IPv4.
2. VxLAN over IPv6.

Without the fix, the tests would fail:

lwt_ip_encap_vxlan:FAIL:transport_hdr offset unexpected transport_hdr offset: actual 70 != expected 20
#208 lwt_ip_encap_vxlan_ipv4:FAIL
lwt_ip_encap_vxlan:FAIL:transport_hdr offset unexpected transport_hdr offset: actual 110 != expected 40
#209 lwt_ip_encap_vxlan_ipv6:FAIL

The unexpected offsets are: outer encap headers
(IPv4: iphdr+udp+vxlan+eth = 50 bytes, IPv6: ipv6hdr+udp+vxlan+eth = 70 bytes)
plus the inner IP header (20 or 40 bytes), because without the fix
transport_header still points at the inner transport layer instead of the
outer UDP header.

Assisted-by: Claude:claude-sonnet-4-6
Cc: Leon Hwang <leon.huangfu@shopee.com>
Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
Link: https://lore.kernel.org/r/20260602150931.49629-3-leon.hwang@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Update transport_header when encapsulating UDP tunnel in lwt

Currently, bpf_lwt_push_ip_encap() does not update skb->transport_header.
When a driver, e.g. ice, reuses the stale skb->transport_header to
offload checksum computation to NIC hardware, VxLAN packets encapsulated
by bpf_lwt_push_encap() helper may be dropped due to incorrect checksum.

Update skb->transport_header in bpf_lwt_push_ip_encap() whenever the
encapsulated packet uses UDP, so checksum offload works correctly.

Fixes: 52f278774e79 ("bpf: implement BPF_LWT_ENCAP_IP mode in bpf_lwt_push_encap")
Cc: Leon Hwang <leon.huangfu@shopee.com>
Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
Link: https://lore.kernel.org/r/20260602150931.49629-2-leon.hwang@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge branch 'risc-v-jit-support-for-bpf_get_current_task-_btf'

Varun R Mallya says:

====================
RISC-V JIT support for bpf_get_current_task/_btf

These two patches add support for the bpf_get_current_task and
bpf_get_current_task_btf kfuncs in RISC-V JIT and add a selftest.

The first patch adds support for cpu and feature detection on
the JIT disassembly helper function as RISC-V JITed code was not
being disassembled using `LLVMCreateDisasm` as it was missing the
"+c" CPU feature and JITed code contained RISC-V Compressed (C)
Extension. This patch generalizes that to detect CPU features and
enables testing on more RISC-V JIT work ahead.

The second patch, which actually adds this support has been benchmarked
on QEMU RISC-V and shows significant improvements.
It was benchmarked using a simple loop inside a bpf program that ran
bpf_get_current_task() and execution time was measured. It used
bpf_prog_test_run_opts() to repeatedly trigger the BPF program. The loop
ran in 1 second intervals and it kept firing bpf_prog_test_run_opts() as
fast as possible until a second had elapsed and then reported statistics.
====================

Link: https://patch.msgid.link/20260602205847.102825-1-varunrmallya@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf, riscv: inline bpf_get_current_task() and bpf_get_current_task_btf()

On RISC-V, the current task pointer is stored in the thread pointer
register (tp). Emit a single `mv a5, tp` instead of a full helper
call for BPF_FUNC_get_current_task and BPF_FUNC_get_current_task_btf.

Register bpf_jit_inlines_helper_call() entries for both helpers so the
verifier treats them as inlined, and add the expected `mv a5, tp`
annotation to the riscv64 selftests.

The following show changes before and after this patch.

Before patch:

      auipc  t1,0x817a    # load upper PC-relative address
      jalr   -2004(t1)    # call bpf_get_current_task helper
      mv     a5,a0        # move return value to BPF_REG_0

After patch:

      mv     a5,tp        # directly: a5 = current (tp = thread pointer)

Benchmark (bpf_prog_test_run wrapping bpf_get_current_task in loop,
batch=100, 10s, QEMU RISC-V):

              | runs/sec  | helper-calls/sec | ns/call
-------------+-----------+------------------+---------
Before patch |   173,490 |       17,349,090 |      57
After patch  |   320,497 |       32,049,780 |      31
-------------+-----------+------------------+---------
Improvement  |   +84.7%  |          +84.7%  |  -45.6%

Signed-off-by: Varun R Mallya <varunrmallya@gmail.com>
Acked-by: Björn Töpel <bjorn@kernel.org>
Link: https://lore.kernel.org/r/20260602205847.102825-3-varunrmallya@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: use host CPU features in JIT disassembler

Pass the host CPU name and feature string to
LLVMCreateDisasmCPUFeatures() instead of using LLVMCreateDisasm(), so
the disassembler correctly decodes CPU-specific instructions and
extensions such as RISC-V compressed and vector instructions.

Signed-off-by: Varun R Mallya <varunrmallya@gmail.com>
Reviewed-by: Björn Töpel <bjorn@kernel.org>
Link: https://lore.kernel.org/r/20260602205847.102825-2-varunrmallya@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge branch 'bpf-check-tail-zero-of-bpf_map_info-and-bpf_prog_info'

Leon Hwang says:

====================
bpf: Check tail zero of bpf_map_info and bpf_prog_info

Check the tail bytes of bpf_map_info and bpf_prog_info due to padding
when getting map info and prog info via BPF_OBJ_GET_INFO_BY_FD, which
was discussed in the thread
"bpf: Check tail zero of bpf_common_attr using offsetofend" [1].

Links:
[1] https://lore.kernel.org/bpf/20260518145446.6794-2-leon.hwang@linux.dev/

Changes:
v2 -> v3:
* Add "__u32 :32" to bpf_map_info and bpf_prog_info (per Alexei).
* v2: https://lore.kernel.org/bpf/20260604150505.99129-1-leon.hwang@linux.dev/

v1 -> v2:
* Collect Acked-by tags from Mykyta, thanks.
* Update Fixes tag in patch #2 (per bot+bpf-ci)
* v1: https://lore.kernel.org/bpf/20260603144518.67065-1-leon.hwang@linux.dev/
====================

Link: https://patch.msgid.link/20260605155249.20772-1-leon.hwang@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Add tests to verify checking padding bytes for bpf_[map,prog]_info

Add two tests to verify that the tail padding 4 bytes of struct
bpf_map_info and bpf_prog_info are checked in syscall.c using
bpf_check_uarg_tail_zero().

Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
Link: https://lore.kernel.org/r/20260605155249.20772-4-leon.hwang@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Check tail zero of bpf_prog_info

Since there're 4 bytes padding at the end of struct bpf_prog_info, they
won't be checked by bpf_check_uarg_tail_zero().

pahole -C bpf_prog_info ./vmlinux
struct bpf_prog_info {
...
__u32 attach_btf_obj_id; /* 220 4 */
__u32 attach_btf_id; /* 224 4 */

/* size: 232, cachelines: 4, members: 38 */
/* sum members: 224 */
/* sum bitfield members: 1 bits, bit holes: 1, sum bit holes: 31 bits */
/* padding: 4 */
/* forced alignments: 9 */
/* last cacheline: 40 bytes */
} __attribute__((__aligned__(8)));

If a future kernel extension adds a new 4-byte field, older userspace
programs allocating this structure on the stack might inadvertently pass
uninitialized stack garbage into the new field, permanently breaking
backward compatibility. -- sashiko [1]

Fix it by changing sizeof(info) to
offsetofend(struct bpf_prog_info, attach_btf_id).

And, add "__u32 :32" to the tail of struct bpf_prog_info.

[1] https://lore.kernel.org/bpf/20260513224823.6494FC19425@smtp.kernel.org/

Fixes: aba64c7da983 ("bpf: Add verified_insns to bpf_prog_info and fdinfo")
Acked-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
Link: https://lore.kernel.org/r/20260605155249.20772-3-leon.hwang@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Check tail zero of bpf_map_info

Since there're 4 bytes padding at the end of struct bpf_map_info, they
won't be checked by bpf_check_uarg_tail_zero().

pahole -C bpf_map_info ./vmlinux
struct bpf_map_info {
...
__u64 hash __attribute__((__aligned__(8))); /* 88 8 */
__u32 hash_size; /* 96 4 */

/* size: 104, cachelines: 2, members: 18 */
/* padding: 4 */
/* forced alignments: 1 */
/* last cacheline: 40 bytes */
} __attribute__((__aligned__(8)));

If a future kernel extension adds a new 4-byte field, older userspace
programs allocating this structure on the stack might inadvertently pass
uninitialized stack garbage into the new field, permanently breaking
backward compatibility. -- sashiko [1]

Fix it by changing sizeof(info) to
offsetofend(struct bpf_map_info, hash_size).

And, add "__u32 :32" to the tail of struct bpf_map_info.

[1] https://lore.kernel.org/bpf/20260513224823.6494FC19425@smtp.kernel.org/

Fixes: ea2e6467ac36 ("bpf: Return hashes of maps in BPF_OBJ_GET_INFO_BY_FD")
Acked-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
Link: https://lore.kernel.org/r/20260605155249.20772-2-leon.hwang@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

perf sched: Replace BUG_ON and add NULL checks in replay event helpers

get_new_event() has three issues:

1. The zalloc() result is dereferenced without a NULL check, crashing
   on allocation failure.

2. BUG_ON(!task->atoms) kills the process when realloc() fails.
   Since perf.data is untrusted input, this should be a graceful error.

3. The realloc pattern assigns directly to task->atoms, losing the old
   pointer on failure.  task->nr_events is also incremented before the
   realloc, leaving corrupted state on failure.

Fix get_new_event() to:
  - Check the zalloc() result before dereferencing
  - Use a temporary for realloc() to avoid losing the old pointer
  - Increment nr_events only after successful realloc
  - Return NULL instead of calling BUG_ON on failure

Also fix add_sched_event_wakeup() where zalloc() for wait_sem is
passed to sem_init() without a NULL check.

Update all callers (add_sched_event_run, add_sched_event_wakeup,
add_sched_event_sleep) to handle NULL returns by returning early.
The replay may produce incomplete output on OOM but will not crash.

Fixes: ec156764d424 ("perf sched: Import schedbench.c")
Reported-by: sashiko-bot <sashiko-bot@kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Assisted-by: Claude:claude-opus-4.6
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>

perf sched: Use thread__put() in free_idle_threads()

free_idle_threads() calls thread__delete() directly instead of
thread__put(), bypassing the reference counting lifecycle.  Under
REFCNT_CHECKING builds, this leaks the pointer handle since
thread__delete() frees the object without going through the refcount
wrapper.

The idle threads are created via thread__new() (refcount=1) in
get_idle_thread().  Callers get additional references via thread__get()
which they release with thread__put().  free_idle_threads() drops the
base reference — thread__put() is the correct call, matching the
thread__new() acquisition.

Fixes: 49394a2a24c7 ("perf sched timehist: Introduce timehist command")
Reported-by: sashiko-bot <sashiko-bot@kernel.org>
Cc: David Ahern <dsahern@gmail.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Assisted-by: Claude:claude-opus-4.6
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>

perf sched: Fix thread reference leak in idle hist processing

timehist_sched_change_event() sets itr->last_thread to NULL at the end
of idle hist processing without calling thread__put() first. The
thread reference was acquired via thread__get() in timehist_get_thread()
(line 2581), so every idle context switch leaks a thread reference when
--idle-hist is active.

Use thread__zput() to properly release the reference before clearing
the pointer.

Fixes: 5d8f17fb5822 ("perf sched timehist: Add -I/--idle-hist option")
Reported-by: sashiko-bot <sashiko-bot@kernel.org>
Cc: David Ahern <dsahern@gmail.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Assisted-by: Claude:claude-opus-4.6
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>

perf sched: Use is_idle_sample() for idle thread runtime cast guard

timehist_sched_change_event() uses thread__tid(thread) == 0 to decide
whether to cast thread_runtime to idle_thread_runtime.  However, a
crafted perf.data can set common_pid=0 and common_tid=0 (the perf_sample
fields) while prev_pid != 0 (the tracepoint field).  is_idle_sample()
returns false (it checks prev_pid for sched_switch), so
timehist_get_thread() goes through machine__findnew_thread() and returns
the machine's TID 0 thread — whose priv data is a regular thread_runtime,
not the larger idle_thread_runtime allocated by init_idle_thread().

The subsequent cast to idle_thread_runtime reads past the thread_runtime
allocation, accessing itr->last_thread, itr->cursor, and itr->callchain
from adjacent heap memory.  Writing to itr->last_thread corrupts the
heap; calling thread__put() on the OOB value frees an arbitrary pointer.

Replace the thread__tid() == 0 check with is_idle_sample(), which uses
the tracepoint-specific prev_pid field and correctly identifies whether
the sample originated from an idle thread with idle_thread_runtime priv.

Fixes: 5d8f17fb5822 ("perf sched timehist: Add -I/--idle-hist option")
Reported-by: sashiko-bot <sashiko-bot@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Assisted-by: Claude:claude-opus-4.6
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>

perf sched: Clean up idle_threads entry on init failure

get_idle_thread() allocates a thread via thread__new() and stores it in
idle_threads[cpu], then calls init_idle_thread() to set up the private
data. If init_idle_thread() fails (e.g. OOM for the idle_thread_runtime
struct), the function returns NULL but leaves the partially initialized
thread in idle_threads[cpu].

On subsequent calls for the same CPU, get_idle_thread() finds a non-NULL
idle_threads[cpu], skips allocation, and returns thread__get() on a
thread that has no priv data. Callers then get a thread whose
thread__priv() returns NULL, leading to unexpected behavior.

Release the thread and reset the slot to NULL on init failure so the
entry doesn't persist in a corrupted state.

Fixes: 49394a2a24c7 ("perf sched timehist: Introduce timehist command")
Reported-by: sashiko-bot <sashiko-bot@kernel.org>
Cc: David Ahern <dsahern@gmail.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Assisted-by: Claude:claude-opus-4.6
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>

perf c2c: Bounds-check CPU IDs in setup_nodes() topology loop

setup_nodes() iterates CPU maps from the perf.data topology header and
uses cpu.cpu directly as an array index into cpu2node[] (allocated with
c2c.cpus_cnt = env->nr_cpus_avail entries) and __set_bit(cpu.cpu, set)
(bitmap also sized to c2c.cpus_cnt).

A crafted perf.data with topology CPU IDs exceeding nr_cpus_avail causes
out-of-bounds heap writes into both the cpu2node array and the per-node
bitmap.

Add a bounds check to skip CPU IDs that fall outside the valid range.

Fixes: 1e181b92a2da ("perf c2c report: Add 'node' sort key")
Reported-by: sashiko-bot <sashiko-bot@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Assisted-by: Claude:claude-opus-4.6
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>

perf c2c: Bounds-check CPU and node IDs before bitmap and array access

c2c_he__set_cpu() passes sample->cpu directly to __set_bit(cpu, cpuset)
after only checking for the (u32)-1 sentinel.  The cpuset bitmap is
allocated with c2c.cpus_cnt bits (from env->nr_cpus_avail), so a crafted
perf.data with CPU IDs exceeding that count causes out-of-bounds heap
writes.

c2c_he__set_node() similarly passes the node ID from mem2node__node()
to __set_bit(node, nodeset) after only checking for negative values.
The nodeset bitmap is sized to c2c.nodes_cnt (from env->nr_numa_nodes),
so a node ID exceeding that causes OOB writes.

process_sample_event() indexes c2c.cpu2node[cpu] and
c2c_he->node_stats[node] without bounds checking.  Both arrays are
sized to c2c.cpus_cnt and c2c.nodes_cnt respectively.

Add bounds checks in all three paths:
  - c2c_he__set_cpu(): return if sample->cpu >= c2c.cpus_cnt
  - c2c_he__set_node(): return if node >= c2c.nodes_cnt
  - process_sample_event(): clamp cpu to 0 if >= cpus_cnt,
    guard node_stats access with bounds check

Fixes: 1e181b92a2da ("perf c2c report: Add 'node' sort key")
Reported-by: sashiko-bot <sashiko-bot@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Assisted-by: Claude:claude-opus-4.6
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>

dt-bindings: display: panel: Describe Samsung SOFEF01-M DDIC

Document the Samsung SOFEF01-M Display-Driver-IC and 1080x2520@60Hz
command-mode DSI panels found in many Sony phones:
- Sony Xperia 5 (kumano bahamut): amb609tc01
- Sony Xperia 10 II (seine pdx201): ams597ut01
- Sony Xperia 10 III (lena pdx213): ams597ut04
- Sony Xperia 10 IV (murray pdx225): ams597ut05
- Sony Xperia 10 V (zambezi pdx235): ams605dk01
- Sony Xperia 10 VI (columbia pdx246): ams605dk01

Signed-off-by: Marijn Suijten <marijn.suijten@somainline.org>
Link: https://patch.msgid.link/20251222-drm-panels-sony-v2-4-82a87465d163@somainline.org
[robh: move vci-supply property to top-level]
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>

Merge branch 'selftests-bpf-tolerate-partial-builds-across-kernel-configs'

Ricardo B. Marliere says:

====================
selftests/bpf: Tolerate partial builds across kernel configs

Currently the BPF selftests can only be built by using the minimum kernel
configuration defined in tools/testing/selftests/bpf/config*. This poses a
problem in distribution kernels that may have some of the flags disabled or
set as module. For example, we have been running the tests regularly in
openSUSE Tumbleweed [1] [2] but to work around this fact we created a
special package [3] that build the tests against an auxiliary vmlinux with
the BPF Kconfig. We keep a list of known issues that may happen due to,
amongst other things, configuration mismatches [4] [5].

The maintenance of this package is far from ideal, especially for
enterprise kernels. The goal of this series is to enable the common usecase
of running the following in any system:

```sh
make -C tools/testing/selftests install \
     SKIP_TARGETS= \
     TARGETS=bpf \
     BPF_STRICT_BUILD=0 \
     O=/lib/modules/$(uname -r)/build
```

As an example, the following script targeting a minimal config can be used
for testing:

```sh
make defconfig
scripts/config --file .config \
               --enable DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT \
               --enable DEBUG_INFO_BTF \
               --enable BPF_SYSCALL \
               --enable BPF_JIT
make olddefconfig
make -j$(nproc)
make -j$(nproc) -C tools/testing/selftests install \
     SKIP_TARGETS= \
     TARGETS=bpf \
     BPF_STRICT_BUILD=0
```

This produces a test_progs binary with 592 subtests, against the total of
727. Many of them will still fail or be skipped at runtime due to lack of
symbols, but at least there will be a clear way of building the tests.

[1]: https://openqa.opensuse.org/tests/5811715
[2]: https://openqa.opensuse.org/tests/5811730
[3]: https://src.opensuse.org/rmarliere/kselftests
[4]: https://github.com/openSUSE/kernel-qe/blob/main/kselftests_known_issues.yaml
[5]: https://openqa.opensuse.org/tests/5811730/logfile?filename=run_kselftests-config_mismatches.txt
---
Changes in v12:
- Rebase from 11 to 8 commits: split the test_kmods KDIR patch into a
  pure KDIR fix (no PERMISSIVE) and a separate tolerance patch; squash
  the BPF_STRICT_BUILD toggle with its first PERMISSIVE user; squash
  the four test_progs partial-build patches into one; move the install
  fix to immediately follow the first PERMISSIVE user
- Link to v11: https://patch.msgid.link/20260430-selftests-bpf_misconfig-v11-0-e11f7a8c4fdc@suse.com

Changes in v11:
- Gate the BTFIDS pretty-print on $(filter 1,$(V)) so V=0 and V=2 still
  print the marker; only V=1 suppresses it (patch 6)
- Use asm volatile ("") in the uprobe_multi_func_{1,2,3} weak stubs to
  match the strong definitions in prog_tests/uprobe_multi_test.c
  (patch 10)
- Link to v10: https://patch.msgid.link/20260430-selftests-bpf_misconfig-v10-0-cd302a31af16@suse.com

Changes in v10:
- Drop $(wildcard $^) from the bench link recipe so cc fails cleanly on
  the first missing input and the || fallback emits one SKIP-LINK line
  instead of a wall of undefined-reference errors; also makes make -n
  accurate (patch 9)
- Include <alloca.h> in testing_helpers.c for alloca() (patch 10)
- Link to v9: https://patch.msgid.link/20260429-selftests-bpf_misconfig-v9-0-c311f06b4791@suse.com

Changes in v9:
- Also pass KBUILD_OUTPUT=$(KMOD_O_VALID) when invoking kbuild so an
  inherited command-line KBUILD_OUTPUT cannot disagree with O= (patch 2)
- Note BPFOBJ remains a normal prereq intentionally (patch 5)
- Sub-shell isolate cd/CC and use $@ for $(RM); gate the BTFIDS
  pretty-print on $(V) so verbose mode does not double-print (patch 6)
- Restrict permissive compile-failure tolerance and partial-link to
  test_progs%; runners with strong cross-object references (test_maps)
  keep strict semantics (patches 6 and 8)
- Link to v8: https://patch.msgid.link/20260428-selftests-bpf_misconfig-v8-0-bf02cf97dbcb@suse.com

Changes in v8:
- In permissive mode, keep source changes to already-built tests
  triggering relinks, while using recipe-time $(wildcard ...) to pick
  up fresh .test.o files without duplicate linker inputs (patch 8)
- Resolve relative O=/KBUILD_OUTPUT before recursing into test_kmods,
  and only treat it as a kernel build dir when it contains
  Module.symvers (patch 2)
- Note in the commit message that PERMISSIVE-mode bisecting here needs
  the next two patches (patch 6)
- Clarify in the commit message that order-only skeleton prereqs do not
  break the new-skel-via-local-header case (patch 5)
- Exclude not-built tests from the success count so summary and JSON
  report them only as skipped (patch 7)
- Tighten commit messages for accuracy and clarity
- Link to v7: https://patch.msgid.link/20260416-selftests-bpf_misconfig-v7-0-a078e18012e4@suse.com

Changes in v7:
- Use $(abspath) for KMOD_O so relative O= paths resolve correctly
  when make -C changes directory (patch 2)
- Guard make clean against missing KDIR unconditionally; there is
  nothing to clean when kernel headers are absent (patch 2)
- Drop explicit $(TRUNNER_TEST_OBJS) from linker filter in strict mode;
  $^ already contains them as normal prerequisites (patch 8)
- Link to v6: https://patch.msgid.link/20260416-selftests-bpf_misconfig-v6-0-7efeab504af1@suse.com

Changes in v6:
- Add --ignore-missing-args to -extras rsync so out-of-tree permissive
  builds do not abort when .ko files are absent (patch 2)
- Use $(abspath) for KMOD_O so relative O= paths resolve correctly
  when make -C changes directory (patch 2)
- Guard make clean against missing KDIR unconditionally so cleaning
  does not abort when kernel headers are absent (patch 2)
- Remove stale skeleton headers in early-exit paths when .bpf.o is
  missing on incremental builds (patch 3)
- Fix strict-mode skeleton rules: use && before temp file cleanup so
  bpftool failures are not masked by rm -f exit code (patch 3)
- Track test filter selection separately from not_built so -t/-n flags
  are respected for unbuilt tests (patch 7)
- Make TRUNNER_TEST_OBJS order-only only in permissive mode, preserving
  incremental relinking in strict builds (patch 8)
- Drop explicit $(TRUNNER_TEST_OBJS) from linker filter in strict mode;
  they are already in $^ as normal prerequisites (patch 8)
- Reorder: move skip-unbuilt-tests patch before partial-linking patch
  for bisectability
- Link to v5: https://patch.msgid.link/20260415-selftests-bpf_misconfig-v5-0-03d0a52a898a@suse.com

Changes in v5:
- Add BPF_STRICT_BUILD toggle as patch 1 so every subsequent patch
  gates tolerance behind PERMISSIVE from the start, making the series
  bisectable with strict-by-default at every point
- Fix O= commit message; make parent cp conditional (patch 2)
- Tolerate linked skeleton failures (patch 3)
- Skip feature detection for emit_tests (patch 4)
- Clarify bench is all-or-nothing in commit message (patch 8)
- Move stack_mprotect() to testing_helpers.c, drop weak stubs (patch 9)
- Report not-built tests as "SKIP (not built)" in output (patch 10)
- Drop overly broad 2>/dev/null || true from install rsync; rely solely
  on --ignore-missing-args which already handles absent files (patch 11)
- Link to v4: https://patch.msgid.link/20260406-selftests-bpf_misconfig-v4-0-9914f50efdf7@suse.com

Changes in v4:
- Drop the test_kmods kselftest module flow patch: lib.mk gen_mods_dir
  invokes $(MAKE) -C $(TEST_GEN_MODS_DIR) without forwarding
  RESOLVE_BTFIDS, breaking ASAN and GCC BPF CI builds (Makefile.modfinal
  cannot find resolve_btfids in the kbuild output tree)
- Link to v3:
  https://patch.msgid.link/20260406-selftests-bpf_misconfig-v3-0-587a1114263c@suse.com

Changes in v3:
- Split test_kmods patch into two: fix KDIR handling (O= passthrough,
  EXTRA_CFLAGS/EXTRA_LDFLAGS clearing) and wire into lib.mk via
  TEST_GEN_MODS_DIR
- Pass O= through to the kernel module build so artifacts land in the
  output tree, not the source tree
- Clear EXTRA_CFLAGS and EXTRA_LDFLAGS when invoking the kernel build to
  prevent host flags (e.g. -static) leaking into module compilation
- Replace the bespoke test_kmods pattern rule with lib.mk module
  infrastructure (TEST_GEN_MODS_DIR); lib.mk now drives build and clean
  lifecycle
- Make the .ko copy step resilient: emit SKIP instead of failing when a
  module is absent
- Expand the uprobe weak stub comment in bpf_cookie.c to explain why
  noinline is required
- Link to v2:
  https://patch.msgid.link/20260403-selftests-bpf_misconfig-v2-0-f06700380a9d@suse.com

Changes in v2:
- Skip test_kmods build/clean when KDIR directory does not exist
- Use `Module.symvers` instead of `.config` for in-tree detection
- Fix skeleton order-only prereqs commit message
- Guard BTFIDS step when .test.o is absent
- Add `__weak stack_mprotect()` stubs in `bpf_cookie.c` and `iters.c`
- Link to v1:
  https://patch.msgid.link/20260401-selftests-bpf_misconfig-v1-0-3ae42c0af76f@suse.com

Assisted-by: {codex,claude}
Tested-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Ricardo B. Marliere <rbm@suse.com>
---
Ricardo B. Marlière (11):
      selftests/bpf: Add BPF_STRICT_BUILD toggle
      selftests/bpf: Fix test_kmods KDIR to honor O= and distro kernels
      selftests/bpf: Tolerate BPF and skeleton generation failures
      selftests/bpf: Avoid rebuilds when running emit_tests
      selftests/bpf: Make skeleton headers order-only prerequisites of .test.d
      selftests/bpf: Tolerate test file compilation failures
      selftests/bpf: Skip tests whose objects were not built
      selftests/bpf: Allow test_progs to link with a partial object set
      selftests/bpf: Tolerate benchmark build failures
      selftests/bpf: Provide weak definitions for cross-test functions
      selftests/bpf: Tolerate missing files during install

tools/testing/selftests/bpf/Makefile               | 177 ++++++++++++++-------
.../testing/selftests/bpf/prog_tests/bpf_cookie.c  |  17 +-
tools/testing/selftests/bpf/prog_tests/iters.c     |   2 -
tools/testing/selftests/bpf/prog_tests/test_lsm.c  |  22 ---
tools/testing/selftests/bpf/test_kmods/Makefile    |  30 +++-
tools/testing/selftests/bpf/test_progs.c           |  53 +++++-
tools/testing/selftests/bpf/test_progs.h           |   1 +
tools/testing/selftests/bpf/testing_helpers.c      |  18 +++
tools/testing/selftests/bpf/testing_helpers.h      |   1 +
9 files changed, 226 insertions(+), 95 deletions(-)
---
base-commit: b93c55b4932dd7e32dca8cf34a3443cc87a02906
change-id: 20260401-selftests-bpf_misconfig-4c33ef5c56da

Best regards,
--
Ricardo B. Marlière <rbm@suse.com>
====================

Link: https://patch.msgid.link/20260602-selftests-bpf_misconfig-v12-0-27f898b3ba26@suse.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Tolerate missing files during install

With partial builds, some TEST_GEN_FILES entries can be absent at install
time. rsync treats missing source arguments as fatal and aborts kselftest
installation.

Override INSTALL_SINGLE_RULE in selftests/bpf to use --ignore-missing-args,
while keeping the existing bpf-specific INSTALL_RULE extension logic. Also
add --ignore-missing-args to the TEST_INST_SUBDIRS rsync loop so that
subdirectories with no .bpf.o files (e.g. when a test runner flavor was
skipped) do not abort installation.

Note that the INSTALL_SINGLE_RULE override applies globally to all file
categories including static source files (TEST_PROGS, TEST_FILES). These
are version-controlled and should always be present, so the practical risk
is negligible.

Signed-off-by: Ricardo B. Marlière <rbm@suse.com>
Link: https://lore.kernel.org/r/20260602-selftests-bpf_misconfig-v12-11-27f898b3ba26@suse.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Provide weak definitions for cross-test functions

Some test files reference functions defined in other translation units that
may not be compiled when skeletons are missing. Replace forward
declarations of uprobe_multi_func_{1,2,3}() with weak no-op stubs so the
linker resolves them regardless of which objects are present.

The stub bodies are `asm volatile ("")` rather than empty, matching the
shape of the strong definitions in prog_tests/uprobe_multi_test.c. This
keeps the weak and strong sides on the same footing for the optimiser
(noinline + asm-barrier), which is the form upstream already relies on
for these functions.

Move stack_mprotect() from test_lsm.c into testing_helpers.c so it is
always available. The previous weak-stub approach returned 0, which would
cause callers expecting -1/EPERM to fail their assertions
deterministically. Having the real implementation in a shared utility
avoids this problem entirely.

Include <alloca.h> for alloca() so the build does not rely on glibc's
implicit declaration via <stdlib.h>.

Signed-off-by: Ricardo B. Marlière <rbm@suse.com>
Link: https://lore.kernel.org/r/20260602-selftests-bpf_misconfig-v12-10-27f898b3ba26@suse.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Tolerate benchmark build failures

Benchmark objects depend on skeletons that may be missing when some BPF
programs fail to build. In that case, benchmark object compilation or final
bench linking should not abort the full selftests/bpf build.

Keep both steps non-fatal, emit SKIP-BENCH or SKIP-LINK, and remove failed
outputs so stale objects or binaries are not reused by later incremental
builds. Note that because bench.c statically references every benchmark via
extern symbols, partial linking is not possible: if any single benchmark
object fails, the entire bench binary is skipped. This is by design -- the
error handler catches all compilation failures including genuine ones, but
those are caught by full-config CI runs.

Signed-off-by: Ricardo B. Marlière <rbm@suse.com>
Link: https://lore.kernel.org/r/20260602-selftests-bpf_misconfig-v12-9-27f898b3ba26@suse.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>