git.ipfire.org Git - thirdparty/linux.git/log

bpf: Add BPF_F_CPU and BPF_F_ALL_CPUS flags support for percpu_array maps

Introduce support for the BPF_F_ALL_CPUS flag in percpu_array maps to
allow updating values for all CPUs with a single value for both
update_elem and update_batch APIs.

Introduce support for the BPF_F_CPU flag in percpu_array maps to allow:

* update value for specified CPU for both update_elem and update_batch
APIs.
* lookup value for specified CPU for both lookup_elem and lookup_batch
APIs.

The BPF_F_CPU flag is passed via:

* map_flags of lookup_elem and update_elem APIs along with embedded cpu
info.
* elem_flags of lookup_batch and update_batch APIs along with embedded
cpu info.

Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
Link: https://lore.kernel.org/r/20260107022022.12843-3-leon.hwang@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Introduce BPF_F_CPU and BPF_F_ALL_CPUS flags

Introduce BPF_F_CPU and BPF_F_ALL_CPUS flags and check them for
following APIs:

* 'map_lookup_elem()'
* 'map_update_elem()'
* 'generic_map_lookup_batch()'
* 'generic_map_update_batch()'

And, get the correct value size for these APIs.

Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
Link: https://lore.kernel.org/r/20260107022022.12843-2-leon.hwang@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge branch 'bpf-verifier-allow-calling-arena-functions-when-holding-bpf-lock'

Emil Tsalapatis says:

====================
bpf/verifier: Allow calling arena functions when holding BPF lock

BPF arena-related kfuncs now cannot sleep, so they are safe to call
while holding a spinlock. However, the verifier still rejects
programs that do so. Update the verifier to allow arena kfunc
calls while holding a lock.

Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com>
Changes v1->v2: (https://lore.kernel.org/r/20260106-arena-under-lock-v1-0-6ca9c121d826@etsalapatis.com)
- Added patch to account for active locks in_sleepable_context() (AI)
====================

Link: https://patch.msgid.link/20260106-arena-under-lock-v2-0-378e9eab3066@etsalapatis.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: add tests for arena kfuncs under lock

Add selftests to ensure the verifier permits calling the arena
kfunc API while holding a lock.

Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com>
Link: https://lore.kernel.org/r/20260106-arena-under-lock-v2-3-378e9eab3066@etsalapatis.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Allow calls to arena functions while holding spinlocks

The bpf_arena_*_pages() kfuncs can be called from sleepable contexts,
but the verifier still prevents BPF programs from calling them while
holding a spinlock. Amend the verifier to allow for BPF programs
calling arena page management functions while holding a lock.

Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com>
Link: https://lore.kernel.org/r/20260106-arena-under-lock-v2-2-378e9eab3066@etsalapatis.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Check active lock count in in_sleepable_context()

The in_sleepable_context() function is used to specialize the BPF code
in do_misc_fixups(). With the addition of nonsleepable arena kfuncs,
there are kfuncs whose specialization depends on whether we are
holding a lock. We should use the nonsleepable version while
holding a lock and the sleepable one when not.

Add a check for active_locks to account for locking when specializing
arena kfuncs.

Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com>
Link: https://lore.kernel.org/r/20260106-arena-under-lock-v2-1-378e9eab3066@etsalapatis.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

mm: drop mem_cgroup_usage() declaration from memcontrol.h

mem_cgroup_usage() is not used outside of memcg-v1 code,
the declaration was added by a mistake.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Link: https://lore.kernel.org/r/20260106042313.140256-1-roman.gushchin@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Replace __opt annotation with __nullable for kfuncs

The __opt annotation was originally introduced specifically for
buffer/size argument pairs in bpf_dynptr_slice() and
bpf_dynptr_slice_rdwr(), allowing the buffer pointer to be NULL while
still validating the size as a constant.  The __nullable annotation
serves the same purpose but is more general and is already used
throughout the BPF subsystem for raw tracepoints, struct_ops, and other
kfuncs.

This patch unifies the two annotations by replacing __opt with
__nullable.  The key change is in the verifier's
get_kfunc_ptr_arg_type() function, where mem/size pair detection is now
performed before the nullable check.  This ensures that buffer/size
pairs are correctly classified as KF_ARG_PTR_TO_MEM_SIZE even when the
buffer is nullable, while adding an !arg_mem_size condition to the
nullable check prevents interference with mem/size pair handling.

When processing KF_ARG_PTR_TO_MEM_SIZE arguments, the verifier now uses
is_kfunc_arg_nullable() instead of the removed is_kfunc_arg_optional()
to determine whether to skip size validation for NULL buffers.

This is the first documentation added for the __nullable annotation,
which has been in use since it was introduced but was previously
undocumented.

No functional changes to verifier behavior - nullable buffer/size pairs
continue to work exactly as before.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20260102221513.1961781-1-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge branch 'memcg-accounting-for-bpf-arena'

Puranjay Mohan says:

====================
memcg accounting for BPF arena

v4: https://lore.kernel.org/all/20260102181333.3033679-1-puranjay@kernel.org/
Changes in v4->v5:
- Remove unused variables from bpf_map_alloc_pages() (CI)

v3: https://lore.kernel.org/all/20260102151852.570285-1-puranjay@kernel.org/
Changes in v3->v4:
- Do memcg set/recover in arena_reserve_pages() rather than
  bpf_arena_reserve_pages() for symmetry with other kfuncs (Alexei)

v2: https://lore.kernel.org/all/20251231141434.3416822-1-puranjay@kernel.org/
Changes in v2->v3:
- Remove memcg accounting from bpf_map_alloc_pages() as the caller does
  it already. (Alexei)
- Do memcg set/recover in arena_alloc/free_pages() rather than
  bpf_arena_alloc/free_pages(), it reduces copy pasting in
  sleepable/non_sleepable functions.

v1: https://lore.kernel.org/all/20251230153006.1347742-1-puranjay@kernel.org/
Changes in v1->v2:
- Return both pointers through arguments from bpf_map_memcg_enter and
  make it return void. (Alexei)
- Add memcg accounting in arena_free_worker (AI)

This set adds memcg accounting logic into arena kfuncs and other places
that do allocations in arena.c.
====================

Link: https://patch.msgid.link/20260102200230.25168-1-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: arena: Reintroduce memcg accounting

When arena allocations were converted from bpf_map_alloc_pages() to
kmalloc_nolock() to support non-sleepable contexts, memcg accounting was
inadvertently lost. This commit restores proper memory accounting for
all arena-related allocations.

All arena related allocations are accounted into memcg of the process
that created bpf_arena.

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20260102200230.25168-3-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: syscall: Introduce memcg enter/exit helpers

Introduce bpf_map_memcg_enter() and bpf_map_memcg_exit() helpers to
reduce code duplication in memcg context management.

bpf_map_memcg_enter() gets the memcg from the map, sets it as active,
and returns both the previous and the now active memcg.

bpf_map_memcg_exit() restores the previous active memcg and releases the
reference obtained during enter.

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20260102200230.25168-2-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge branch 'bpf-make-kf_trusted_args-default'

Puranjay Mohan says:

====================
bpf: Make KF_TRUSTED_ARGS default

v2: https://lore.kernel.org/all/20251231171118.1174007-1-puranjay@kernel.org/
Changes in v2->v3:
- Fix documentation: add a new section for kfunc parameters (Eduard)
- Remove all occurances of KF_TRUSTED from comments, etc. (Eduard)
- Fix the netfilter kfuncs to drop dead NULL checks.
- Fix selftest for netfilter kfuncs to check for verification failures
  and remove the runtime failure that are not possible after this
  changes

v1: https://lore.kernel.org/all/20251224192448.3176531-1-puranjay@kernel.org/
Changes in v1->v2:
- Update kfunc_dynptr_param selftest to use a real pointer that is not
  ptr_to_stack and not CONST_PTR_TO_DYNPTR rather than casting 1
  (Alexei)
- Thoroughly review all kfuncs in the to find regressions or missing
  annotations. (Eduard)
- Fix kfuncs found from the above step.

This series makes trusted arguments the default requirement for all BPF
kfuncs, inverting the current opt-in model. Instead of requiring
explicit KF_TRUSTED_ARGS flags, kfuncs now require trusted arguments by
default and must explicitly opt-out using __nullable/__opt annotations
or the KF_RCU flag.

This improves security and type safety by preventing BPF programs from
passing untrusted or NULL pointers to kernel functions at verification
time, while maintaining flexibility for the small number of kfuncs that
legitimately need to accept NULL or RCU pointers.

MOTIVATION

The current opt-in model is error-prone and inconsistent. Most kfuncs already
require trusted pointers from sources like KF_ACQUIRE, struct_ops callbacks, or
tracepoints. Making trusted arguments the default:

- Prevents NULL pointer dereferences at verification time
- Reduces defensive NULL checks in kernel code
- Provides better error messages for invalid BPF programs
- Aligns with existing patterns (context pointers, struct_ops already trusted)

IMPACT ANALYSIS

Comprehensive analysis of all 304+ kfuncs across 37 kernel files found:
- Most kfuncs (299/304) are already safe and require no changes
- Only 4 kfuncs required fixes (all included in this series)
- 0 regressions found in independent verification

All bpf selftests are passing. The hid_bpf tests are also passing:
# PASSED: 20 / 20 tests passed.
# Totals: pass:20 fail:0 xfail:0 xpass:0 skip:0 error:0

bpf programs in drivers/hid/bpf/progs/ show no regression as shown by
veristat:

Done. Processed 24 files, 62 programs. Skipped 0 files, 0 programs.

TECHNICAL DETAILS

The verifier now validates kfunc arguments in this order:
1. NULL check (runs first): Rejects NULL unless parameter has __nullable/__opt
2. Trusted check: Rejects untrusted pointers unless kfunc has KF_RCU

Special cases that bypass trusted checking:
- Context pointers (xdp_md, __sk_buff): Handled via KF_ARG_PTR_TO_CTX
- Struct_ops callbacks: Pre-marked as PTR_TRUSTED during initialization
- KF_RCU kfuncs: Have separate validation path for RCU pointers

BACKWARD COMPATIBILITY

This affects BPF program verification, not runtime:
- Valid programs passing trusted pointers: Continue to work
- Programs with bugs: May now fail verification (preventing runtime crashes)

This series introduces two intentional breaking changes to the BPF
verifier's kfunc handling:

1. NULL pointer rejection timing: Kfuncs that previously accepted NULL
pointers without KF_TRUSTED_ARGS will now reject NULL at verification
time instead of returning runtime errors. This affects netfilter
connection tracking functions (bpf_xdp_ct_lookup, bpf_skb_ct_lookup,
bpf_xdp_ct_alloc, bpf_skb_ct_alloc), which now enforce their documented
"Cannot be NULL" requirements at load time rather than returning -EINVAL
at runtime.

2. Fentry/fexit program restrictions: BPF programs using fentry/fexit
attachment points can no longer pass their callback arguments directly
to kfuncs, as these arguments are not marked as trusted by default.
Programs requiring trusted argument semantics should migrate to tp_btf
(tracepoint with BTF) attachment points where arguments are guaranteed
trusted by the verifier.

Both changes strengthen the verifier's safety guarantees by catching
errors earlier in the development cycle and are accompanied by
comprehensive test updates demonstrating the new expected behaviors.
====================

Link: https://patch.msgid.link/20260102180038.2708325-1-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests: bpf: Fix test_bpf_nf for trusted args becoming default

With trusted args now being the default, passing NULL to kfunc
parameters that are pointers causes verifier rejection rather than a
runtime error. The test_bpf_nf test was failing because it attempted to
pass NULL to bpf_xdp_ct_lookup() to verify runtime error handling.

Since the NULL check now happens at verification time, remove the
runtime test case that passed NULL to the bpf_tuple parameter and
instead add verification-time tests to ensure the verifier correctly
rejects programs that pass NULL to trusted arguments.

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20260102180038.2708325-11-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests: bpf: fix cgroup_hierarchical_stats

The cgroup_hierarchical_stats selftests uses an fentry program attached
to cgroup_attach_task and then passes the received &dst_cgrp->self to
the css_rstat_updated() kfunc. The verifier now assumes that all kfuncs
only takes trusted pointer arguments, and pointers received by fentry
are not marked trustes by default.

Use a tp_btf program in place for fentry for this test, pointers
received by tp_btf programs are marked trusted by the verifier.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20260102180038.2708325-10-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests: bpf: fix test_kfunc_dynptr_param

As verifier now assumes that all kfuncs only takes trusted pointer
arguments, passing 0 (NULL) to a kfunc that doesn't mark the argument as
__nullable or __opt will be rejected with a failure message of: Possibly
NULL pointer passed to trusted arg<n>

Pass a non-null value to the kfunc to test the expected failure mode.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20260102180038.2708325-9-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests: bpf: Update failure message for rbtree_fail

The rbtree_api_use_unchecked_remove_retval() selftest passes a pointer
received from bpf_rbtree_remove() to bpf_rbtree_add() without checking
for NULL, this was earlier caught by __check_ptr_off_reg() in the
verifier. Now the verifier assumes every kfunc only takes trusted pointer
arguments, so it catches this NULL pointer earlier in the path and
provides a more accurate failure message.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20260102180038.2708325-8-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests: bpf: Update kfunc_param_nullable test for new error message

With trusted args now being the default, the NULL pointer check runs
before type-specific validation. Update test3 to expect the new error
message "Possibly NULL pointer passed to trusted arg0" instead of the
old dynptr-specific error message.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20260102180038.2708325-7-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

HID: bpf: drop dead NULL checks in kfuncs

As KF_TRUSTED_ARGS is now considered default for all kfuns, the verifier
will not allow passing NULL pointers to these kfuns. These checks for
NULL pointers can therefore be removed.

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20260102180038.2708325-6-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: xfrm: drop dead NULL check in bpf_xdp_get_xfrm_state()

As KF_TRUSTED_ARGS is now considered the default for all kfuncs, the
opts parameter in bpf_xdp_get_xfrm_state() can never be NULL. Verifier
will detect this at load time and will not allow passing NULL to this
function. This matches the documentation above the kfunc that says this
parameter (opts) Cannot be NULL.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20260102180038.2708325-5-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: net: netfilter: drop dead NULL checks

bpf_xdp_ct_lookup() and bpf_skb_ct_lookup() receive bpf_tuple and opts
parameter that are expected to be not NULL for real usages (see doc
string above functions). They return an error if NULL is passed for opts
or tuple.

The verifier will now reject programs that pass NULL to these
parameters, the kfuns can assume that these are always valid pointer, so
drop the NULL checks for these parameters.

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20260102180038.2708325-4-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Remove redundant KF_TRUSTED_ARGS flag from all kfuncs

Now that KF_TRUSTED_ARGS is the default for all kfuncs, remove the
explicit KF_TRUSTED_ARGS flag from all kfunc definitions and remove the
flag itself.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20260102180038.2708325-3-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Make KF_TRUSTED_ARGS the default for all kfuncs

Change the verifier to make trusted args the default requirement for
all kfuncs by removing is_kfunc_trusted_args() assuming it be to always
return true.

This works because:
1. Context pointers (xdp_md, __sk_buff, etc.) are handled through their
   own KF_ARG_PTR_TO_CTX case label and bypass the trusted check
2. Struct_ops callback arguments are already marked as PTR_TRUSTED during
   initialization and pass is_trusted_reg()
3. KF_RCU kfuncs are handled separately via is_kfunc_rcu() checks at
   call sites (always checked with || alongside is_kfunc_trusted_args)

This simple change makes all kfuncs require trusted args by default
while maintaining correct behavior for all existing special cases.

Note: This change means kfuncs that previously accepted NULL pointers
without KF_TRUSTED_ARGS will now reject NULL at verification time.
Several netfilter kfuncs are affected: bpf_xdp_ct_lookup(),
bpf_skb_ct_lookup(), bpf_xdp_ct_alloc(), and bpf_skb_ct_alloc() all
accept NULL for their bpf_tuple and opts parameters internally (checked
in __bpf_nf_ct_lookup), but after this change the verifier rejects NULL
before the kfunc is even called. This is acceptable because these kfuncs
don't work with NULL parameters in their proper usage. Now they will be
rejected rather than returning an error, which shouldn't make a
difference to BPF programs that were using these kfuncs properly.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20260102180038.2708325-2-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: veristat: fix printing order in output_stats()

The order of the variables in the printf() doesn't match the text and
therefore veristat prints something like this:

Done. Processed 24 files, 0 programs. Skipped 62 files, 0 programs.

When it should print:

Done. Processed 24 files, 62 programs. Skipped 0 files, 0 programs.

Fix the order of variables in the printf() call.

Fixes: 518fee8bfaf2 ("selftests/bpf: make veristat skip non-BPF and failing-to-open BPF objects")
Tested-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20251231221052.759396-1-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Update BPF_PROG_RUN documentation

LWT_SEG6LOCAL no longer supports test_run starting from v6.11
so remove it from the list of program types supported by BPF_PROG_RUN.

Add TRACING and NETFILTER program types to reflect the
current set of types that implement test_run.

Signed-off-by: SungRock Jung <tjdfkr2421@gmail.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20251221070041.26592-1-tjdfkr2421@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

scripts/gen-btf.sh: Reduce log verbosity

Remove info messages from gen-btf.sh, as they are unnecessarily
detailed and sometimes inaccurate [1]. Verbose log can be produced by
passing V=1 to make, which will set -x for the shell.

[1] https://lore.kernel.org/bpf/CAADnVQ+biTSDaNtoL=ct9XtBJiXYMUqGYLqu604C3D8N+8YH9A@mail.gmail.com/

Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Link: https://lore.kernel.org/r/20251231183929.65668-1-ihor.solodrai@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

resolve_btfids: Implement --patch_btfids

Recent changes in BTF generation [1] rely on ${OBJCOPY} command to
update .BTF_ids section data in target ELF files.

This exposed a bug in llvm-objcopy --update-section code path, that
may lead to corruption of a target ELF file. Specifically, because of
the bug st_shndx of some symbols may be (incorrectly) set to 0xffff
(SHN_XINDEX) [2][3].

While there is a pending fix for LLVM, it'll take some time before it
lands (likely in 22.x). And the kernel build must keep working with
older LLVM toolchains in the foreseeable future.

Using GNU objcopy for .BTF_ids update would work, but it would require
changes to LLVM-based build process, likely breaking existing build
environments as discussed in [2].

To work around llvm-objcopy bug, implement --patch_btfids code path in
resolve_btfids as a drop-in replacement for:

${OBJCOPY} --update-section .BTF_ids=${btf_ids} ${elf}

Which works specifically for .BTF_ids section:

${RESOLVE_BTFIDS} --patch_btfids ${btf_ids} ${elf}

This feature in resolve_btfids can be removed at some point in the
future, when llvm-objcopy with a relevant bugfix becomes common.

[1] https://lore.kernel.org/bpf/20251219181321.1283664-1-ihor.solodrai@linux.dev/
[2] https://lore.kernel.org/bpf/20251224005752.201911-1-ihor.solodrai@linux.dev/
[3] https://github.com/llvm/llvm-project/issues/168060#issuecomment-3533552952

Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Link: https://lore.kernel.org/r/20251231012558.1699758-1-ihor.solodrai@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge branch 'bpf-unify-state-pruning-handling-of-invalid-misc-stack-slots'

Eduard Zingerman says:

====================
bpf: unify state pruning handling of invalid/misc stack slots

This change unifies states pruning handling of NOT_INIT registers,
STACK_INVALID/STACK_MISC stack slots for regular and iterator/callback
based loop cases.

The change results in a modest verifier performance improvement:

========= selftests: master vs loop-stack-misc-pruning =========

File                             Program               Insns (A)  Insns (B)  Insns     (DIFF)
-------------------------------  --------------------  ---------  ---------  ----------------
test_tcp_custom_syncookie.bpf.o  tcp_custom_syncookie      38307      18430  -19877 (-51.89%)
xdp_synproxy_kern.bpf.o          syncookie_tc              23035      19067   -3968 (-17.23%)
xdp_synproxy_kern.bpf.o          syncookie_xdp             21022      18516   -2506 (-11.92%)

Total progs: 4173
Old success: 2520
New success: 2521
total_insns diff min:  -99.99%
total_insns diff max:    0.00%
0 -> value: 0
value -> 0: 0
total_insns abs max old: 837,487
total_insns abs max new: 837,487
-100 .. -90  %: 1
-60 .. -50  %: 3
-50 .. -40  %: 2
-40 .. -30  %: 2
-30 .. -20  %: 8
-20 .. -10  %: 4
-10 .. 0    %: 5
   0 .. 5    %: 4148

========= scx: master vs loop-stack-misc-pruning =========

File                       Program           Insns (A)  Insns (B)  Insns     (DIFF)
-------------------------  ----------------  ---------  ---------  ----------------
scx_arena_selftests.bpf.o  arena_selftest       257545     243678   -13867 (-5.38%)
scx_chaos.bpf.o            chaos_dispatch        13989      12804    -1185 (-8.47%)
scx_layered.bpf.o          layered_dispatch      27600      13925  -13675 (-49.55%)

Total progs: 305
Old success: 292
New success: 292
total_insns diff min:  -49.55%
total_insns diff max:    0.00%
0 -> value: 0
value -> 0: 0
total_insns abs max old: 257,545
total_insns abs max new: 243,678
-50 .. -45  %: 7
-30 .. -20  %: 5
-20 .. -10  %: 14
-10 .. 0    %: 18
   0 .. 5    %: 261

There is also a significant verifier performance improvement for some
bpf_loop() heavy Meta internal programs (~ -40% processed instructions).
====================

Link: https://patch.msgid.link/20251230-loop-stack-misc-pruning-v1-0-585cfd6cec51@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: iterator based loop and STACK_MISC states pruning

The test case first initializes 9 stack slots as STACK_MISC,
then conditionally updates each of them to SCALAR spill inside an
iterator based loop. This leads to 2**9 combinations of MISC/SPILL
marks for these slots at the iterator next call.
The loop converges only if the verifier treats such states as
equivalent, otherwise visited states are evicted from the states cache
too quickly.

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20251230-loop-stack-misc-pruning-v1-2-585cfd6cec51@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: allow states pruning for misc/invalid slots in iterator loops

Within an iterator or callback based loop, it should be safe to prune
the current state if the old state stack slot is marked as
STACK_INVALID or STACK_MISC:
- either all branches of the old state lead to a program exit;
- or some branch of the old state leads the current state.

This is the same logic as applied in non-loop cases when
states_equal() is called in NOT_EXACT mode.

The test case that exercises stacksafe() and demonstrates the
difference in verification performance is included in the next patch.
I'm not sure if it is possible to prepare a test case that exercises
regsafe(); it appears that the compute_live_registers() pass makes
this impossible.

Nevertheless, for code readability reasons, I think that stacksafe()
and regsafe() should handle STACK_INVALID / NOT_INIT symmetrically.
Hence, this commit changes both functions.

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20251230-loop-stack-misc-pruning-v1-1-585cfd6cec51@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge branch 'bpf-calls-to-bpf_loop-should-have-an-scc-and-accumulate-backedges'

Eduard Zingerman says:

====================
bpf: calls to bpf_loop() should have an SCC and accumulate backedges

This is a correctness fix for the verification of BPF programs that
work with callback-calling functions. The problem is the same as the
issue fixed by series [1] for iterator-based loops: some of the states
created while processing the callback function body might have
incomplete read or precision marks.

An example of an unsafe program that is accepted without this fix can
be found in patch #2.

There is some impact on verification performance:

File                             Program               Insns (A)  Insns (B)  Insns      (DIFF)
-------------------------------  --------------------  ---------  ---------  -----------------
pyperf600_bpf_loop.bpf.o         on_event                   4247       9985   +5738 (+135.11%)
setget_sockopt.bpf.o             skops_sockopt              5719       7446    +1727 (+30.20%)
setget_sockopt.bpf.o             socket_post_create         1253       1603     +350 (+27.93%)
strobemeta_bpf_loop.bpf.o        on_event                   3424       7224   +3800 (+110.98%)
test_tcp_custom_syncookie.bpf.o  tcp_custom_syncookie      11929      38307  +26378 (+221.12%)
xdp_synproxy_kern.bpf.o          syncookie_tc              13986      23035    +9049 (+64.70%)
xdp_synproxy_kern.bpf.o          syncookie_xdp             13881      21022    +7141 (+51.44%)

Total progs: 4172
Old success: 2520
New success: 2520
total_insns diff min:    0.00%
total_insns diff max:  221.12%
0 -> value: 0
value -> 0: 0
total_insns abs max old: 837,487
total_insns abs max new: 837,487
   0 .. 5    %: 4163
   5 .. 15   %: 2
  25 .. 35   %: 2
  50 .. 60   %: 1
  60 .. 70   %: 1
110 .. 120  %: 1
135 .. 145  %: 1
220 .. 225  %: 1

[1] https://lore.kernel.org/bpf/174968344350.3524559.14906547029551737094.git-patchwork-notify@kernel.org/
---
====================

Link: https://patch.msgid.link/20251229-scc-for-callbacks-v1-0-ceadfe679900@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: test cases for bpf_loop SCC and state graph backedges

Test for state graph backedges accumulation for SCCs formed by
bpf_loop(). Equivalent to the following C program:

  int main(void) {
    1: fp[-8] = bpf_get_prandom_u32();
    2: fp[-16] = -32;                       // used in a memory access below
    3: bpf_loop(7, loop_cb4, fp, 0);
    4: return 0;
  }

  int loop_cb4(int i, void *ctx) {
    5: if (unlikely(ctx[-8] > bpf_get_prandom_u32()))
    6:   *(u64 *)(fp + ctx[-16]) = 42;      // aligned access expected
    7: if (unlikely(fp[-8] > bpf_get_prandom_u32()))
    8:   ctx[-16] = -31;                    // makes said access unaligned
    9: return 0;
  }

If state graph backedges are not accumulated properly at the SCC
formed by loop_cb4() call from bpf_loop(), the state {ctx[-16]=-32}
injected at instruction 9 on verification path 1,2,3,5,7,9,4 would be
considered fully verified and would lack precision mark for ctx[-16].
This would lead to early pruning of verification path 1,2,3,5,7,8,9 in
state {ctx[-16]=-31}, which in turn leads to the incorrect assumption
that the above program is safe.

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20251229-scc-for-callbacks-v1-2-ceadfe679900@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: bpf_scc_visit instance and backedges accumulation for bpf_loop()

Calls like bpf_loop() or bpf_for_each_map_elem() introduce loops that
are not explicitly present in the control-flow graph. The verifier
processes such calls by repeatedly interpreting the callback function
body within the same verification path (until the current state
converges with a previous state).

Such loops require a bpf_scc_visit instance in order to allow the
accumulation of the state graph backedges. Otherwise, certain
checkpoint states created within the bodies of such loops will have
incomplete precision marks.

See the next patch for an example of a program that leads to the
verifier accepting an unsafe program.

Fixes: 96c6aa4c63af ("bpf: compute SCCs in program control flow graph")
Fixes: c9e31900b54c ("bpf: propagate read/precision marks over state graph backedges")
Reported-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Tested-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/r/20251229-scc-for-callbacks-v1-1-ceadfe679900@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Fix verifier_arena_large/big_alloc3 test

The big_alloc3() test tries to allocate 2051 pages at once in
non-sleepable context and this can fail sporadically on resource
contrained systems, so skip this test in case of such failures.

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20251230195134.599463-1-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

scripts/gen-btf.sh: Fix .btf.o generation when compiling for RISCV

gen-btf.sh emits a .btf.o file with BTF sections to be linked into
vmlinux in link-vmlinux.sh

This .btf.o file is created by compiling an emptystring with ${CC},
and then adding BTF sections into it with ${OBJCOPY}.

To ensure the .btf.o is linkable when cross-compiling with LLVM, we
have to also pass ${KBUILD_FLAGS}, which in particular control the
target word size.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202512240559.2M06DSX7-lkp@intel.com/
Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Link: https://lore.kernel.org/r/20251229202823.569619-1-ihor.solodrai@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge branch 'remove-kf_sleepable-from-arena-kfuncs'

Puranjay Mohan says:

====================
Remove KF_SLEEPABLE from arena kfuncs

V7: https://lore.kernel.org/all/20251222190815.4112944-1-puranjay@kernel.org/
Changes in V7->v8:
- Use clear_lo32(arena->user_vm_start) in place of user_vm_start in patch 3

V6: https://lore.kernel.org/all/20251217184438.3557859-1-puranjay@kernel.org/
Changes in v6->v7:
- Fix a deadlock in patch 1, that was being fixed in patch 2. Move the fix to patch 1.
- Call flush_cache_vmap() after setting up the mappings as it is
  required by some architectures.

V5: https://lore.kernel.org/all/20251212044516.37513-1-puranjay@kernel.org/
Changes in v5->v6:
Patch 1:
- Add a missing ; to make sure this patch builds individually. (AI)

V4: https://lore.kernel.org/all/20251212004350.6520-1-puranjay@kernel.org/
Changes in v4->v5:
Patch 1:
- Fix a memory leak in arena_alloc_pages(), it was being fixed in
  Patch 3 but, every patch should be complete in itself. (AI)
Patch 3:
- Don't do useless addition in arena_alloc_pages() (Alexei)
- Add a comment about kmalloc_nolock() failure and expectations.

v3: https://lore.kernel.org/all/20251117160150.62183-1-puranjay@kernel.org/
Changes in v3->v4:
- Coding style changes related to comments in Patch 2/3 (Alexei)

v2: https://lore.kernel.org/all/20251114111700.43292-1-puranjay@kernel.org/
Changes in v2->v3:
Patch 1:
        - Call range_tree_destroy() in error path of
          populate_pgtable_except_pte() in arena_map_alloc() (AI)
Patch 2:
        - Fix double mutex_unlock() in the error path of
          arena_alloc_pages() (AI)
        - Fix coding style issues (Alexei)
Patch 3:
        - Unlock spinlock before returning from arena_vm_fault() in case
          BPF_F_SEGV_ON_FAULT is set by user. (AI)
        - Use __llist_del_all() in place of llist_del_all for on-stack
          llist (free_pages) (Alexei)
        - Fix build issues on 32-bit systems where arena.c is not compiled.
          (kernel test robot)
        - Make bpf_arena_alloc_pages() polymorphic so it knows if it has
          been called in sleepable or non-sleepable context. This
          information is passed to arena_free_pages() in the error path.
Patch 4:
        - Add a better comment for the big_alloc3() test that triggers
          kmalloc_nolock()'s limit and if bpf_arena_alloc_pages() works
          correctly above this limit.

v1: https://lore.kernel.org/all/20251111163424.16471-1-puranjay@kernel.org/
Changes in v1->v2:
Patch 1:
        - Import tlbflush.h to fix build issue in loongarch. (kernel
          test robot)
        - Fix unused variable error in apply_range_clear_cb() (kernel
          test robot)
        - Call bpf_map_area_free() on error path of
          populate_pgtable_except_pte() (AI)
        - Use PAGE_SIZE in apply_to_existing_page_range() (AI)
Patch 2:
        - Cap allocation made by kmalloc_nolock() for pages array to
          KMALLOC_MAX_CACHE_SIZE and reuse the array in an explicit loop
          to overcome this limit. (AI)
Patch 3:
        - Do page_ref_add(page, 1); under the spinlock to mitigate a
          race (AI)
Patch 4:
        - Add a new testcase big_alloc3() verifier_arena_large.c that
          tries to allocate a large number of pages at once, this is to
          trigger the kmalloc_nolock() limit in Patch 2 and see if the
          loop logic works correctly.

This set allows arena kfuncs to be called from non-sleepable contexts.
It is acheived by the following changes:

The range_tree is now protected with a rqspinlock and not a mutex,
this change is enough to make bpf_arena_reserve_pages() any context
safe.

bpf_arena_alloc_pages() had four points where it could sleep:

1. Mutex to protect range_tree: now replaced with rqspinlock

2. kvcalloc() for allocations: now replaced with kmalloc_nolock()

3. Allocating pages with bpf_map_alloc_pages(): this already calls
   alloc_pages_nolock() in non-sleepable contexts and therefore is safe.

4. Setting up kernel page tables with vm_area_map_pages():
   vm_area_map_pages() may allocate memory while inserting pages into
   bpf arena's vm_area. Now, at arena creation time populate all page
   table levels except the last level and when new pages need to be
   inserted call apply_to_page_range() again which will only do
   set_pte_at() for those pages and will not allocate memory.

The above four changes make bpf_arena_alloc_pages() any context safe.

bpf_arena_free_pages() has to do the following steps:

1. Update the range_tree
2. vm_area_unmap_pages(): to unmap pages from kernel vm_area
3. flush the tlb: done in step 2, already.
4. zap_pages(): to unmap pages from user page tables
5. free pages.

The third patch in this set makes bpf_arena_free_pages() polymorphic using
the specialize_kfunc() mechanism. When called from a sleepable context,
arena_free_pages() remains mostly unchanged except the following:
1. rqspinlock is taken now instead of the mutex for the range tree
2. Instead of using vm_area_unmap_pages() that can free intermediate page
   table levels, apply_to_existing_page_range() with a callback is used
   that only does pte_clear() on the last level and leaves the intermediate
   page table levels intact. This is needed to make sure that
   bpf_arena_alloc_pages() can safely do set_pte_at() without allocating
   intermediate page tables.

When arena_free_pages() is called from a non-sleepable context or it fails to
acquire the rqspinlock in the sleepable case, a lock-less list of struct
arena_free_span is used to queue the uaddr and page cnt. kmalloc_nolock()
is used to allocate this arena_free_span, this can fail but we need to make
this trade-off for frees done from non-sleepable contexts.

arena_free_pages() then raises an irq_work whose handler in turn schedules
work that iterate this list and clears ptes, flushes tlbs, zap pages, and
frees pages for the queued uaddr and page cnts.

apply_range_clear_cb() with apply_to_existing_page_range() is used to
clear PTEs and collect pages to be freed, struct llist_node pcp_llist;
in the struct page is used to do this.
====================

Link: https://patch.msgid.link/20251222195022.431211-1-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests: bpf: test non-sleepable arena allocations

As arena kfuncs can now be called from non-sleepable contexts, test this
by adding non-sleepable copies of tests in verifier_arena, this is done
by using a socket program instead of syscall.

Add a new test case in verifier_arena_large to check that the
bpf_arena_alloc_pages() works for more than 1024 pages.
1024 * sizeof(struct page *) is the upper limit of kmalloc_nolock() but
bpf_arena_alloc_pages() should still succeed because it re-uses this
array in a loop.

Augment the arena_list selftest to also run in non-sleepable context by
taking rcu_read_lock.

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20251222195022.431211-5-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: arena: make arena kfuncs any context safe

Make arena related kfuncs any context safe by the following changes:

bpf_arena_alloc_pages() and bpf_arena_reserve_pages():
Replace the usage of the mutex with a rqspinlock for range tree and use
kmalloc_nolock() wherever needed. Use free_pages_nolock() to free pages
from any context.
apply_range_set/clear_cb() with apply_to_page_range() has already made
populating the vm_area in bpf_arena_alloc_pages() any context safe.

bpf_arena_free_pages(): defer the main logic to a workqueue if it is
called from a non-sleepable context.

specialize_kfunc() is used to replace the sleepable arena_free_pages()
with bpf_arena_free_pages_non_sleepable() when the verifier detects the
call is from a non-sleepable context.

In the non-sleepable case, arena_free_pages() queues the address and the
page count to be freed to a lock-less list of struct arena_free_spans
and raises an irq_work. The irq_work handler calls schedules_work() as
it is safe to be called from irq context. arena_free_worker() (the work
queue handler) iterates these spans and clears ptes, flushes tlb, zaps
pages, and calls __free_page().

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20251222195022.431211-4-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: arena: use kmalloc_nolock() in place of kvcalloc()

To make arena_alloc_pages() safe to be called from any context, replace
kvcalloc() with kmalloc_nolock() so as it doesn't sleep or take any
locks. kmalloc_nolock() returns NULL for allocations larger than
KMALLOC_MAX_CACHE_SIZE, which is (PAGE_SIZE * 2) = 8KB on systems with
4KB pages. So, round down the allocation done by kmalloc_nolock to 1024
* 8 and reuse the array in a loop.

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20251222195022.431211-3-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: arena: populate vm_area without allocating memory

vm_area_map_pages() may allocate memory while inserting pages into bpf
arena's vm_area. In order to make bpf_arena_alloc_pages() kfunc
non-sleepable change bpf arena to populate pages without
allocating memory:
- at arena creation time populate all page table levels except
  the last level
- when new pages need to be inserted call apply_to_page_range() again
  with apply_range_set_cb() which will only set_pte_at() those pages and
  will not allocate memory.
- when freeing pages call apply_to_existing_page_range with
  apply_range_clear_cb() to clear the pte for the page to be removed. This
  doesn't free intermediate page table levels.

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20251222195022.431211-2-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: crypto: replace -EEXIST with -EBUSY

The -EEXIST error code is reserved by the module loading infrastructure
to indicate that a module is already loaded. When a module's init
function returns -EEXIST, userspace tools like kmod interpret this as
"module already loaded" and treat the operation as successful, returning
0 to the user even though the module initialization actually failed.

This follows the precedent set by commit 54416fd76770 ("netfilter:
conntrack: helper: Replace -EEXIST by -EBUSY") which fixed the same
issue in nf_conntrack_helper_register().

This affects bpf_crypto_skcipher module. While the configuration
required to build it as a module is unlikely in practice, it is
technically possible, so fix it for correctness.

Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
Acked-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Link: https://lore.kernel.org/r/20251220-dev-module-init-eexists-bpf-v1-1-7f186663dbe7@samsung.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge branch 'allow-calling-kfuncs-from-raw_tp-programs'

Puranjay Mohan says:

====================
Allow calling kfuncs from raw_tp programs

V1: https://lore.kernel.org/all/20251218145514.339819-1-puranjay@kernel.org/
Changes in V1->V2:
- Update selftests to allow success for raw_tp programs calling kfuncs.

This set enables calling kfuncs from raw_tp programs.
====================

Link: https://patch.msgid.link/20251222133250.1890587-1-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests: bpf: fix tests with raw_tp calling kfuncs

As the previous commit allowed raw_tp programs to call kfuncs, so of the
selftests that were expected to fail will now succeed.

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20251222133250.1890587-3-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: allow calling kfuncs from raw_tp programs

Associate raw tracepoint program type with the kfunc tracing hook. This
allows calling kfuncs from raw_tp programs.

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20251222133250.1890587-2-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge branch 'mm-bpf-kfuncs-to-access-memcg-data'

Roman Gushchin says:

====================
mm: bpf kfuncs to access memcg data

Introduce kfuncs to simplify the access to the memcg data.
These kfuncs can be used to accelerate monitoring use cases and
for implementing custom OOM policies once BPF OOM is landed.

This patchset was separated out from the BPF OOM patchset to simplify
the logistics and accelerate the landing of the part which is useful
by itself. No functional changes since BPF OOM v2.

v4:
  - refactored memcg vm event and stat item idx checks (by Alexei)

v3:
  - dropped redundant kfuncs flags (by Alexei)
  - fixed kdocs warnings (by Alexei)
  - merged memcg stats access patches into one (by Alexei)
  - restored root memcg usage reporting, added a comment
  - added checks for enum boundaries
  - added Shakeel and JP as co-maintainers (by Shakeel)

v2:
  - added mem_cgroup_disabled() checks (by Shakeel B.)
  - added special handling of the root memcg in bpf_mem_cgroup_usage()
  (by Shakeel B.)
  - minor fixes in the kselftest (by Shakeel B.)
  - added a MAINTAINERS entry (by Shakeel B.)

v1:
  https://lore.kernel.org/bpf/87ike29s5r.fsf@linux.dev/T/#t
====================

Link: https://patch.msgid.link/20251223044156.208250-1-roman.gushchin@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

MAINTAINERS: add an entry for MM BPF extensions

Let's create a separate entry for MM BPF extensions: these patches
often require an attention from both bpf and mm communities.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Link: https://lore.kernel.org/r/20251223044156.208250-7-roman.gushchin@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: selftests: selftests for memcg stat kfuncs

Add test coverage for the kfuncs that fetch memcg stats. Using some common
stats, test scenarios ensuring that the given stat increases by some
arbitrary amount. The stats selected cover the three categories represented
by the enums: node_stat_item, memcg_stat_item, vm_event_item.

Since only a subset of all stats are queried, use a static struct made up
of fields for each stat. Write to the struct with the fetched values when
the bpf program is invoked and read the fields in the user mode program for
verification.

Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Link: https://lore.kernel.org/r/20251223044156.208250-6-roman.gushchin@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

mm: introduce BPF kfuncs to access memcg statistics and events

Introduce BPF kfuncs to conveniently access memcg data:
  - bpf_mem_cgroup_vm_events(),
  - bpf_mem_cgroup_memory_events(),
  - bpf_mem_cgroup_usage(),
  - bpf_mem_cgroup_page_state(),
  - bpf_mem_cgroup_flush_stats().

These functions are useful for implementing BPF OOM policies, but
also can be used to accelerate access to the memcg data. Reading
it through cgroupfs is much more expensive, roughly 5x, mostly
because of the need to convert the data into the text and back.

JP Kobryn:
An experiment was setup to compare the performance of a program that
uses the traditional method of reading memory.stat vs a program using
the new kfuncs. The control program opens up the root memory.stat file
and for 1M iterations reads, converts the string values to numeric data,
then seeks back to the beginning. The experimental program sets up the
requisite libbpf objects and for 1M iterations invokes a bpf program
which uses the kfuncs to fetch all available stats for node_stat_item,
memcg_stat_item, and vm_event_item types.

The results showed a significant perf benefit on the experimental side,
outperforming the control side by a margin of 93%. In kernel mode,
elapsed time was reduced by 80%, while in user mode, over 99% of time
was saved.

control: elapsed time
real    0m38.318s
user    0m25.131s
sys     0m13.070s

experiment: elapsed time
real    0m2.789s
user    0m0.187s
sys     0m2.512s

control: perf data
33.43% a.out libc.so.6         [.] __vfscanf_internal
6.88% a.out [kernel.kallsyms] [k] vsnprintf
6.33% a.out libc.so.6         [.] _IO_fgets
5.51% a.out [kernel.kallsyms] [k] format_decode
4.31% a.out libc.so.6         [.] __GI_____strtoull_l_internal
3.78% a.out [kernel.kallsyms] [k] string
3.53% a.out [kernel.kallsyms] [k] number
2.71% a.out libc.so.6         [.] _IO_sputbackc
2.41% a.out [kernel.kallsyms] [k] strlen
1.98% a.out a.out             [.] main
1.70% a.out libc.so.6         [.] _IO_getline_info
1.51% a.out libc.so.6         [.] __isoc99_sscanf
1.47% a.out [kernel.kallsyms] [k] memory_stat_format
1.47% a.out [kernel.kallsyms] [k] memcpy_orig
1.41% a.out [kernel.kallsyms] [k] seq_buf_printf

experiment: perf data
10.55% memcgstat bpf_prog_..._query [k] bpf_prog_16aab2f19fa982a7_query
6.90% memcgstat [kernel.kallsyms]  [k] memcg_page_state_output
3.55% memcgstat [kernel.kallsyms]  [k] _raw_spin_lock
3.12% memcgstat [kernel.kallsyms]  [k] memcg_events
2.87% memcgstat [kernel.kallsyms]  [k] __memcg_slab_post_alloc_hook
2.73% memcgstat [kernel.kallsyms]  [k] kmem_cache_free
2.70% memcgstat [kernel.kallsyms]  [k] entry_SYSRETQ_unsafe_stack
2.25% memcgstat [kernel.kallsyms]  [k] __memcg_slab_free_hook
2.06% memcgstat [kernel.kallsyms]  [k] get_page_from_freelist

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Co-developed-by: JP Kobryn <inwardvessel@gmail.com>
Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
Link: https://lore.kernel.org/r/20251223044156.208250-5-roman.gushchin@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

mm: introduce bpf_get_root_mem_cgroup() BPF kfunc

Introduce a BPF kfunc to get a trusted pointer to the root memory
cgroup. It's very handy to traverse the full memcg tree, e.g.
for handling a system-wide OOM.

It's possible to obtain this pointer by traversing the memcg tree
up from any known memcg, but it's sub-optimal and makes BPF programs
more complex and less efficient.

bpf_get_root_mem_cgroup() has a KF_ACQUIRE | KF_RET_NULL semantics,
however in reality it's not necessary to bump the corresponding
reference counter - root memory cgroup is immortal, reference counting
is skipped, see css_get(). Once set, root_mem_cgroup is always a valid
memcg pointer. It's safe to call bpf_put_mem_cgroup() for the pointer
obtained with bpf_get_root_mem_cgroup(), it's effectively a no-op.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Link: https://lore.kernel.org/r/20251223044156.208250-4-roman.gushchin@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

mm: introduce BPF kfuncs to deal with memcg pointers

To effectively operate with memory cgroups in BPF there is a need
to convert css pointers to memcg pointers. A simple container_of
cast which is used in the kernel code can't be used in BPF because
from the verifier's point of view that's a out-of-bounds memory access.

Introduce helper get/put kfuncs which can be used to get
a refcounted memcg pointer from the css pointer:
- bpf_get_mem_cgroup,
- bpf_put_mem_cgroup.

bpf_get_mem_cgroup() can take both memcg's css and the corresponding
cgroup's "self" css. It allows it to be used with the existing cgroup
iterator which iterates over cgroup tree, not memcg tree.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Link: https://lore.kernel.org/r/20251223044156.208250-3-roman.gushchin@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

mm: declare memcg_page_state_output() in memcontrol.h

To use memcg_page_state_output() in bpf_memcontrol.c move the
declaration from v1-specific memcontrol-v1.h to memcontrol.h.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Link: https://lore.kernel.org/r/20251223044156.208250-2-roman.gushchin@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: arm64: Fix sparse warnings

ctx->image is declared as __le32 because arm64 instructions are LE
regardless of CPU's runtime endianness. emit_u32_data() emits raw data
and not instructions so cast the value to __le32 to fix the sparse
warning.

Cast function pointer to void * before doing arithmetic.

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20251219191310.3204425-1-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: add test case for BPF LSM hook bpf_lsm_mmap_file

Add a trivial test case asserting that the BPF verifier enforces
PTR_MAYBE_NULL semantics on the struct file pointer argument of BPF
LSM hook bpf_lsm_mmap_file().

Dereferencing the struct file pointer passed into bpf_lsm_mmap_file()
without explicitly performing a NULL check first should not be
permitted by the BPF verifier as it can lead to NULL pointer
dereferences and a kernel crash.

Signed-off-by: Matt Bobrowski <mattbobrowski@google.com>
Acked-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20251216133000.3690723-2-mattbobrowski@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: annotate file argument as __nullable in bpf_lsm_mmap_file

As reported in [0], anonymous memory mappings are not backed by a
struct file instance. Consequently, the struct file pointer passed to
the security_mmap_file() LSM hook is NULL in such cases.

The BPF verifier is currently unaware of this, allowing BPF LSM
programs to dereference this struct file pointer without needing to
perform an explicit NULL check. This leads to potential NULL pointer
dereference and a kernel crash.

Add a strong override for bpf_lsm_mmap_file() which annotates the
struct file pointer parameter with the __nullable suffix. This
explicitly informs the BPF verifier that this pointer (PTR_MAYBE_NULL)
can be NULL, forcing BPF LSM programs to perform a check on it before
dereferencing it.

[0] https://lore.kernel.org/bpf/5e460d3c.4c3e9.19adde547d8.Coremail.kaiyanm@hust.edu.cn/

Reported-by: Kaiyan Mei <M202472210@hust.edu.cn>
Reported-by: Yinhao Hu <dddddd@hust.edu.cn>
Reviewed-by: Dongliang Mu <dzm91@hust.edu.cn>
Closes: https://lore.kernel.org/bpf/5e460d3c.4c3e9.19adde547d8.Coremail.kaiyanm@hust.edu.cn/
Signed-off-by: Matt Bobrowski <mattbobrowski@google.com>
Acked-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20251216133000.3690723-1-mattbobrowski@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

x86/bpf: Avoid emitting LOCK prefix for XCHG atomic ops

The x86 XCHG instruction is implicitly locked when one of the
operands is a memory location, making an explicit LOCK prefix
unnecessary.

Stop emitting the LOCK prefix for BPF_XCHG in the JIT atomic
read-modify-write helpers. This avoids redundant instruction
prefixes while preserving correct atomic semantics.

No functional change for other atomic operations.

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://lore.kernel.org/r/20251208163420.7643-1-ubizjak@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge branch 'bpf-optimize-recursion-detection-on-arm64'

Puranjay Mohan says:

====================
bpf: Optimize recursion detection on arm64

V2: https://lore.kernel.org/all/20251217233608.2374187-1-puranjay@kernel.org/
Changes in v2->v3:
- Added acked by Yonghong
- Patch 2:
        - Change alignment of active from 8 to 4
        - Use le32_to_cpu in place of get_unaligned_le32()

V1: https://lore.kernel.org/all/20251217162830.2597286-1-puranjay@kernel.org/
Changes in V1->V2:
- Patch 2:
        - Put preempt_enable()/disable() around RMW accesses to mitigate
          race conditions. Because on CONFIG_PREEMPT_RCU and sleepable
  bpf programs, preemption can cause no bpf prog to execute in
  case of recursion.

BPF programs detect recursion using a per-CPU 'active' flag in struct
bpf_prog. The trampoline currently sets/clears this flag with atomic
operations.

On some arm64 platforms (e.g., Neoverse V2 with LSE), per-CPU atomic
operations are relatively slow. Unlike x86_64 - where per-CPU updates
can avoid cross-core atomicity, arm64 LSE atomics are always atomic
across all cores, which is unnecessary overhead for strictly per-CPU
state.

This patch removes atomics from the recursion detection path on arm64.

It was discovered in [1] that per-CPU atomics that don't return a value
were extremely slow on some arm64 platforms, Catalin added a fix in
commit 535fdfc5a228 ("arm64: Use load LSE atomics for the non-return
per-CPU atomic operations") to solve this issue, but it seems to have
caused a regression on the fentry benchmark.

Using the fentry benchmark from the bpf selftests shows the following:

  ./tools/testing/selftests/bpf/bench trig-fentry

+---------------------------------------------+------------------------+
|               Configuration                 | Total Operations (M/s) |
+---------------------------------------------+------------------------+
| bpf-next/master with Catalin’s fix reverted |         51.770         |
|---------------------------------------------|------------------------|
| bpf-next/master                             |         43.271         |
| bpf-next/master with this change            |         43.271         |
+---------------------------------------------+------------------------+

All benchmarks were run on a KVM based vm with Neoverse-V2 and 8 cpus.

This patch yields a 25% improvement in this benchmark compared to
bpf-next. Notably, reverting Catalin's fix also results in a performance
gain for this benchmark, which is interesting but expected.

For completeness, this benchmark was also run with the change enabled on
x86-64, which resulted in a 30% regression in the fentry benchmark. So,
it is only enabled on arm64.

P.S. - Here is more data with other program types:

+-----------------+-----------+-----------+----------+
|     Metric      |  Before   |   After   | % Diff   |
+-----------------+-----------+-----------+----------+
| fentry          |   43.149  |   53.948  | +25.03%  |
| fentry.s        |   41.831  |   50.937  | +21.76%  |
| rawtp           |   50.834  |   58.731  | +15.53%  |
| fexit           |   31.118  |   34.360  | +10.42%  |
| tp              |   39.536  |   41.632  |  +5.30%  |
| syscall-count   |    8.053  |    8.305  |  +3.13%  |
| fmodret         |   33.940  |   34.769  |  +2.44%  |
| kprobe          |    9.970  |    9.998  |  +0.28%  |
| usermode-count  |  224.886  |  224.839  |  -0.02%  |
| kernel-count    |  154.229  |  153.043  |  -0.77%  |
+-----------------+-----------+-----------+----------+

[1] https://lore.kernel.org/all/e7d539ed-ced0-4b96-8ecd-048a5b803b85@paulmck-laptop/
====================

Link: https://patch.msgid.link/20251219184422.2899902-1-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: arm64: Optimize recursion detection by not using atomics

BPF programs detect recursion using a per-CPU 'active' flag in struct
bpf_prog. The trampoline currently sets/clears this flag with atomic
operations.

On some arm64 platforms (e.g., Neoverse V2 with LSE), per-CPU atomic
operations are relatively slow. Unlike x86_64 - where per-CPU updates
can avoid cross-core atomicity, arm64 LSE atomics are always atomic
across all cores, which is unnecessary overhead for strictly per-CPU
state.

This patch removes atomics from the recursion detection path on arm64 by
changing 'active' to a per-CPU array of four u8 counters, one per
context: {NMI, hard-irq, soft-irq, normal}. The running context uses a
non-atomic increment/decrement on its element. After increment,
recursion is detected by reading the array as a u32 and verifying that
only the expected element changed; any change in another element
indicates inter-context recursion, and a value > 1 in the same element
indicates same-context recursion.

For example, starting from {0,0,0,0}, a normal-context trigger changes
the array to {0,0,0,1}. If an NMI arrives on the same CPU and triggers
the program, the array becomes {1,0,0,1}. When the NMI context checks
the u32 against the expected mask for normal (0x00000001), it observes
0x01000001 and correctly reports recursion. Same-context recursion is
detected analogously.

Acked-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20251219184422.2899902-3-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: move recursion detection logic to helpers

BPF programs detect recursion by doing atomic inc/dec on a per-cpu
active counter from the trampoline. Create two helpers for operations on
this active counter, this makes it easy to changes the recursion
detection logic in future.

This commit makes no functional changes.

Acked-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20251219184422.2899902-2-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge branch 'resolve_btfids-support-for-btf-modifications'

Ihor Solodrai says:

====================
resolve_btfids: Support for BTF modifications

This series changes resolve_btfids and kernel build scripts to enable
BTF transformations in resolve_btfids. Main motivation for enhancing
resolve_btfids is to reduce dependency of the kernel build on pahole
capabilities [1] and enable BTF features and optimizations [2][3]
particular to the kernel.

Patches #1-#4 in the series are non-functional changes in
resolve_btfids.

Patch #5 makes kernel build notice pahole version changes between
builds.

Patch #6 changes minimum version of pahole required for
CONFIG_DEBUG_INFO_BTF to v1.22

Patch #7 makes a small prep change in selftests/bpf build.

The last patch (#8) makes significant changes in resolve_btfids and
introduces scripts/gen-btf.sh. See implementation details in the patch
description.

Successful BPF CI run: https://github.com/kernel-patches/bpf/actions/runs/20378061470

[1] https://lore.kernel.org/dwarves/ba1650aa-fafd-49a8-bea4-bdddee7c38c9@linux.dev/
[2] https://lore.kernel.org/bpf/20251029190113.3323406-1-ihor.solodrai@linux.dev/
[3] https://lore.kernel.org/bpf/20251119031531.1817099-1-dolinux.peng@gmail.com/
---

v6->v7:
  - documentation edits in patches #5 and #6 (Nicolas)

v6: https://lore.kernel.org/bpf/20251219020006.785065-1-ihor.solodrai@linux.dev/

v5->v6:
  - patch #8: fix double free when btf__distill_base fails (reported by AI)
    https://lore.kernel.org/bpf/e269870b8db409800045ee0061fc02d21721e0efadd99ca83960b48f8db7b3f3@mail.kernel.org/

v5: https://lore.kernel.org/bpf/20251219003147.587098-1-ihor.solodrai@linux.dev/

v4->v5:
  - patch #3: fix an off-by-one bug (reported by AI)
    https://lore.kernel.org/bpf/106b6e71bce75b8f12a85f2f99e75129e67af7287f6d81fa912589ece14044f9@mail.kernel.org/
  - patch #8: cleanup GEN_BTF in Makefile.btf

v4: https://lore.kernel.org/bpf/20251218003314.260269-1-ihor.solodrai@linux.dev/

v3->v4:
  - add patch #4: "resolve_btfids: Always build with -Wall -Werror"
  - add patch #5: "kbuild: Sync kconfig when PAHOLE_VERSION changes" (Alan)
  - fix clang cross-compilation (LKP)
    https://lore.kernel.org/bpf/cecb6351-ea9a-4f8a-863a-82c9ef02f012@linux.dev/
  - remove GEN_BTF env variable (Andrii)
  - nits and cleanup in resolve_btfids/main.c (Andrii, Eduard)
  - nits in a patch bumping minimum pahole version (Andrii, AI)

v3: https://lore.kernel.org/bpf/20251205223046.4155870-1-ihor.solodrai@linux.dev/

v2->v3:
  - add patch #4 bumping minimum pahole version (Andrii, Alan)
  - add patch #5 pre-fixing resolve_btfids test (Donglin)
  - add GEN_BTF var and assemble RESOLVE_BTFIDS_FLAGS in Makefile.btf (Alan)
  - implement --distill_base flag in resolve_btfids, set it depending
    on KBUILD_EXTMOD in Makefile.btf (Eduard)
  - various implementation nits, see the v2 thread for details (Andrii, Eduard)

v2: https://lore.kernel.org/bpf/20251127185242.3954132-1-ihor.solodrai@linux.dev/

v1->v2:
  - gen-btf.sh and other shell script fixes (Donglin)
  - update selftests build (Donglin)
  - generate .BTF.base only when KBUILD_EXTMOD is set (Alan)
  - proper endianness handling for cross-compilation
  - change elf_begin mode from ELF_C_RDWR_MMAP to ELF_C_READ_MMAP_PRIVATE
  - remove compressed_section_fix()
  - nit NULL check in patch #3 (suggested by AI)

v1: https://lore.kernel.org/bpf/20251126012656.3546071-1-ihor.solodrai@linux.dev/
====================

Link: https://patch.msgid.link/20251219181321.1283664-1-ihor.solodrai@linux.dev
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>

resolve_btfids: Change in-place update with raw binary output

Currently resolve_btfids updates .BTF_ids section of an ELF file
in-place, based on the contents of provided BTF, usually within the
same input file, and optionally a BTF base.

Change resolve_btfids behavior to enable BTF transformations as part
of its main operation. To achieve this, in-place ELF write in
resolve_btfids is replaced with generation of the following binaries:
  * ${1}.BTF with .BTF section data
  * ${1}.BTF_ids with .BTF_ids section data if it existed in ${1}
  * ${1}.BTF.base with .BTF.base section data for out-of-tree modules

The execution of resolve_btfids and consumption of its output is
orchestrated by scripts/gen-btf.sh introduced in this patch.

The motivation for emitting binary data is that it allows simplifying
resolve_btfids implementation by delegating ELF update to the $OBJCOPY
tool [1], which is already widely used across the codebase.

There are two distinct paths for BTF generation and resolve_btfids
application in the kernel build: for vmlinux and for kernel modules.

For the vmlinux binary a .BTF section is added in a roundabout way to
ensure correct linking. The patch doesn't change this approach, only
the implementation is a little different.

Before this patch it worked as follows:

  * pahole consumed .tmp_vmlinux1 [2] and added .BTF section with
    llvm-objcopy [3] to it
  * then everything except the .BTF section was stripped from .tmp_vmlinux1
    into a .tmp_vmlinux1.bpf.o object [2], later linked into vmlinux
  * resolve_btfids was executed later on vmlinux.unstripped [4],
    updating it in-place

After this patch gen-btf.sh implements the following:

  * pahole consumes .tmp_vmlinux1 and produces a *detached* file with
    raw BTF data
  * resolve_btfids consumes .tmp_vmlinux1 and detached BTF to produce
    (potentially modified) .BTF, and .BTF_ids sections data
  * a .tmp_vmlinux1.bpf.o object is then produced with objcopy copying
    BTF output of resolve_btfids
  * .BTF_ids data gets embedded into vmlinux.unstripped in
    link-vmlinux.sh by objcopy --update-section

For kernel modules, creating a special .bpf.o file is not necessary,
and so embedding of sections data produced by resolve_btfids is
straightforward with objcopy.

With this patch an ELF file becomes effectively read-only within
resolve_btfids, which allows deleting elf_update() call and satellite
code (like compressed_section_fix [5]).

Endianness handling of .BTF_ids data is also changed. Previously the
"flags" part of the section was bswapped in sets_patch() [6], and then
Elf_Type was modified before elf_update() to signal to libelf that
bswap may be necessary. With this patch we explicitly bswap entire
data buffer on load and on dump.

[1] https://lore.kernel.org/bpf/131b4190-9c49-4f79-a99d-c00fac97fa44@linux.dev/
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/scripts/link-vmlinux.sh?h=v6.18#n110
[3] https://git.kernel.org/pub/scm/devel/pahole/pahole.git/tree/btf_encoder.c?h=v1.31#n1803
[4] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/scripts/link-vmlinux.sh?h=v6.18#n284
[5] https://lore.kernel.org/bpf/20200819092342.259004-1-jolsa@kernel.org/
[6] https://lore.kernel.org/bpf/cover.1707223196.git.vmalik@redhat.com/

Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Tested-by: Alan Maguire <alan.maguire@oracle.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20251219181825.1289460-3-ihor.solodrai@linux.dev

selftests/bpf: Run resolve_btfids only for relevant .test.o objects

A selftest targeting resolve_btfids functionality relies on a resolved
.BTF_ids section to be available in the TRUNNER_BINARY. The underlying
BTF data is taken from a special BPF program (btf_data.c), and so
resolve_btfids is executed as a part of a TRUNNER_BINARY build recipe
on the final binary.

Subsequent patches in this series allow resolve_btfids to modify BTF
before resolving the symbols, which means that the test needs access
to that modified BTF [1]. Currently the test simply reads in
btf_data.bpf.o on the assumption that BTF hasn't changed.

Implement resolve_btfids call only for particular test objects (just
resolve_btfids.test.o for now). The test objects are linked into the
TRUNNER_BINARY, and so .BTF_ids section will be available there.

This will make it trivial for the resolve_btfids test to access BTF
modified by resolve_btfids.

[1] https://lore.kernel.org/bpf/CAErzpmvsgSDe-QcWH8SFFErL6y3p3zrqNri5-UHJ9iK2ChyiBw@mail.gmail.com/

Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Tested-by: Alan Maguire <alan.maguire@oracle.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20251219181825.1289460-2-ihor.solodrai@linux.dev

lib/Kconfig.debug: Set the minimum required pahole version to v1.22

Subsequent patches in the series change vmlinux linking scripts to
unconditionally pass --btf_encode_detached to pahole, which was
introduced in v1.22 [1][2].

This change allows to remove PAHOLE_HAS_SPLIT_BTF Kconfig option and
other checks of older pahole versions.

[1] https://github.com/acmel/dwarves/releases/tag/v1.22
[2] https://lore.kernel.org/bpf/cbafbf4e-9073-4383-8ee6-1353f9e5869c@oracle.com/

Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Tested-by: Alan Maguire <alan.maguire@oracle.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Nicolas Schier <nsc@kernel.org>
Link: https://lore.kernel.org/bpf/20251219181825.1289460-1-ihor.solodrai@linux.dev

kbuild: Sync kconfig when PAHOLE_VERSION changes

This patch implements kconfig re-sync when the pahole version changes
between builds, similar to how it happens for compiler version change
via CC_VERSION_TEXT.

Define PAHOLE_VERSION in the top-level Makefile and export it for
config builds. Set CONFIG_PAHOLE_VERSION default to the exported
variable.

Kconfig records the PAHOLE_VERSION value in
include/config/auto.conf.cmd [1].

The Makefile includes auto.conf.cmd, so if PAHOLE_VERSION changes
between builds, make detects a dependency change and triggers
syncconfig to update the kconfig [2].

For external module builds, add a warning message in the prepare
target, similar to the existing compiler version mismatch warning.

Note that if pahole is not installed or available, PAHOLE_VERSION is
set to 0 by pahole-version.sh, so the (un)installation of pahole is
treated as a version change.

See previous discussions for context [3].

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/scripts/kconfig/preprocess.c?h=v6.18#n91
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Makefile?h=v6.18#n815
[3] https://lore.kernel.org/bpf/8f946abf-dd88-4fac-8bb4-84fcd8d81cf0@oracle.com/

Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Tested-by: Nicolas Schier <nsc@kernel.org>
Tested-by: Alan Maguire <alan.maguire@oracle.com>
Reviewed-by: Nicolas Schier <nsc@kernel.org>
Link: https://lore.kernel.org/bpf/20251219181321.1283664-6-ihor.solodrai@linux.dev

resolve_btfids: Always build with -Wall -Werror

resolve_btfids builds without compiler warnings currently, so let's
enforce this for future changes with '-Wall -Werror' flags [1].

[1] https://lore.kernel.org/bpf/1957a60b-6c45-42a7-b525-a6e335a735ff@linux.dev/

Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Tested-by: Alan Maguire <alan.maguire@oracle.com>
Link: https://lore.kernel.org/bpf/20251219181321.1283664-5-ihor.solodrai@linux.dev

resolve_btfids: Introduce enum btf_id_kind

Instead of using multiple flags, make struct btf_id tagged with an
enum value indicating its kind in the context of resolve_btfids.

Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Tested-by: Alan Maguire <alan.maguire@oracle.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20251219181321.1283664-4-ihor.solodrai@linux.dev

resolve_btfids: Factor out load_btf()

Increase the lifetime of parsed BTF in resolve_btfids by factoring
load_btf() routine out of symbols_resolve() and storing the base_btf
and btf pointers in the struct object.

Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Tested-by: Alan Maguire <alan.maguire@oracle.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20251219181321.1283664-3-ihor.solodrai@linux.dev

resolve_btfids: Rename object btf field to btf_path

Rename the member of `struct object` holding the path to BTF data if
provided via --btf arg. `btf_path` is less ambiguous.

Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Tested-by: Alan Maguire <alan.maguire@oracle.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20251219181321.1283664-2-ihor.solodrai@linux.dev

Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf after 6.19-rc1

Cross-merge BPF and other fixes after downstream PR.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Pull bpf fixes from Alexei Starovoitov:

- Fix BPF builds due to -fms-extensions. selftests (Alexei
   Starovoitov), bpftool (Quentin Monnet).

- Fix build of net/smc when CONFIG_BPF_SYSCALL=y, but CONFIG_BPF_JIT=n
   (Geert Uytterhoeven)

- Fix livepatch/BPF interaction and support reliable unwinding through
   BPF stack frames (Josh Poimboeuf)

- Do not audit capability check in arm64 JIT (Ondrej Mosnacek)

- Fix truncated dmabuf BPF iterator reads (T.J. Mercier)

- Fix verifier assumptions of bpf_d_path's output buffer (Shuran Liu)

- Fix warnings in libbpf when built with -Wdiscarded-qualifiers under
   C23 (Mikhail Gavrilov)

* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  selftests/bpf: add regression test for bpf_d_path()
  bpf: Fix verifier assumptions of bpf_d_path's output buffer
  selftests/bpf: Add test for truncated dmabuf_iter reads
  bpf: Fix truncated dmabuf iterator reads
  x86/unwind/orc: Support reliable unwinding through BPF stack frames
  bpf: Add bpf_has_frame_pointer()
  bpf, arm64: Do not audit capability check in do_jit()
  libbpf: Fix -Wdiscarded-qualifiers under C23
  bpftool: Fix build warnings due to MS extensions
  net: smc: SMC_HS_CTRL_BPF should depend on BPF_JIT
  selftests/bpf: Add -fms-extensions to bpf build flags

Merge tag 's390-6.19-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux

Pull s390 fixes from Alexander Gordeev:

- clear 'Search boot program' flag when 'bootprog' sysfs file is
   written to override a value set from Hardware Management Console

- fix cyclic dead-lock in zpci_zdev_put() and zpci_scan_devices()
   functions when triggering PCI device recovery using sysfs

- annotate the expected lock context imbalance in zpci_release_device()
   function to fix a sparse complaint

- fix the logic to fallback to the return address register value in the
   topmost frame when stack tracing uses a back chain

* tag 's390-6.19-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
  s390/stacktrace: Do not fallback to RA register
  s390/pci: Annotate lock context imbalance in zpci_release_device()
  s390/pci: Fix cyclic dead-lock in zpci_zdev_put() and zpci_scan_devices()
  s390/ipl: Clear SBP flag when bootprog is set

Merge branch 'libbpf-move-arena-variables-out-of-the-zero-page'

Emil Tsalapatis says:

====================
libbpf: move arena variables out of the zero page

Modify libbpf to place arena globals at the end of the arena mapping
instead of the very beginning. This allows programs to leave the
"zero page" of the arena unmapped, so that NULL arena pointer
dereferences trigger a page fault and associated backtrace in BPF streams.
In contrast, the current policy of placing global data in the zero pages
means that NULL dereferences silently corrupt global data, e.g, arena
qspinlock state. This makes arena bugs more difficult to debug.

The patchset adds code to libbpf to move global arena data to the end of
the arena. At load time, libbpf adjusts each symbol's location within
the arena to point to the right location in the arena. The patchset
also adjusts the arena skeleton pointer to point to the arena globals,
now that they are not in the beginning of the arena region.

CHANGESET
=========

v3->v4: (https://lore.kernel.org/bpf/20251215161313.10120-1-emil@etsalapatis.com/T/#t)

- Added Acks by Eduard
- Changed jumptable sym_off to unsigned int for consistency (AI)
- Adjusted selftests to ensure arena globals are actually mapped in (Eduard)
- (Patch 2) Adjusted selftests that were failing because they were expecting the
  now removed "direct map offset" error message

v2->v3: (https://lore.kernel.org/bpf/20251203162625.13152-1-emil@etsalapatis.com/)

- Remove unnecessary kernel bounds check in resolve_pseudo_ldimm64
  (Andrii)
- Added patch to turn sym_off unsigned to prevent overflow (AI)
- Remove obsolete references to offsets from test patch description
  (Andrii)
- Use size_t for arena_data_off (Andrii)
- Remove extra mutable variable from offset calculations (Andrii)

v1->v2: (https://lore.kernel.org/bpf/20251118030058.162967-1-emil@etsalapatis.com)

- Moved globals to the end of the mapping: (Andrii)
- Removed extra parameter for offset and parameter picking logic
- Removed padding in the skeleton
- Removed additional libbpf call
- Added Reviewed-by from Eduard on patch 1

Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com>
====================

Link: https://patch.msgid.link/20251216173325.98465-1-emil@etsalapatis.com
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>

selftests/bpf: Add tests for the arena offset of globals

Add tests for the new libbpf globals arena offset logic. The
tests cover the case of globals being as large as the arena
itself, and being smaller than the arena. In that case, the
data is placed at the end of the arena, and the beginning
of the arena is free.

Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20251216173325.98465-6-emil@etsalapatis.com

libbpf: Move arena globals to the end of the arena

Arena globals are currently placed at the beginning of the arena
by libbpf. This is convenient, but prevents users from reserving
guard pages in the beginning of the arena to identify NULL pointer
dereferences. Adjust the load logic to place the globals at the
end of the arena instead.

Also modify bpftool to set the arena pointer in the program's BPF
skeleton to point to the globals. Users now call bpf_map__initial_value()
to find the beginning of the arena mapping and use the arena pointer
in the skeleton to determine which part of the mapping holds the
arena globals and which part is free.

Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20251216173325.98465-5-emil@etsalapatis.com

libbpf: Turn relo_core->sym_off unsigned

The symbols' relocation offsets in BPF are stored in an int field,
but cannot actually be negative. When in the next patch libbpf relocates
globals to the end of the arena, it is also possible to have valid
offsets > 2GiB that are used to calculate the final relo offsets.
Avoid accidentally interpreting large offsets as negative by turning
the sym_off field unsigned.

Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20251216173325.98465-4-emil@etsalapatis.com

bpf/verifier: Do not limit maximum direct offset into arena map

The verifier currently limits direct offsets into a map to 512MiB
to avoid overflow during pointer arithmetic. However, this prevents
arena maps from using direct addressing instructions to access data
at the end of > 512MiB arena maps. This is necessary when moving
arena globals to the end of the arena instead of the front.

Refactor the verifier code to remove the offset calculation during
direct value access calculations. This is possible because the only
two map types that implement .map_direct_value_addr() are arrays and
arenas, and they both do their own internal checks to ensure the
offset is within bounds.

Adjust selftests that expect the old error. These tests still fail
because the verifier identifies the access as out of bounds for the
map, so change them to expect an "invalid access to map value pointer"
error instead.

Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20251216173325.98465-3-emil@etsalapatis.com

selftests/bpf: Explicitly account for globals in verifier_arena_large

The big_alloc1 test in verifier_arena_large assumes that the arena base
and the first page allocated by bpf_arena_alloc_pages are identical.
This is not the case, because the first page in the arena is populated
by global arena data. The test still passes because the code makes the
tacit assumption that the first page is on offset PAGE_SIZE instead of
0.

Make this distinction explicit in the code, and adjust the page offsets
requested during the test to count from the beginning of the arena
instead of using the address of the first allocated page.

Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20251216173325.98465-2-emil@etsalapatis.com

Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull shmem rename fixes from Al Viro:
"A couple of shmem rename fixes - recent regression from tree-in-dcache
  series and older breakage from stable directory offsets stuff"

* tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  shmem: fix recovery on rename failures
  shmem_whiteout(): fix regression from tree-in-dcache series

Merge tag 'v6.19-rc1-ksmbd-server-fixes' of git://git.samba.org/ksmbd

Pull smb server fixes from Steve French:

- Fix set xattr name validation

- Fix session refcount leak

- Minor cleanup

- smbdirect (RDMA) fixes: improve receive completion, and connect

* tag 'v6.19-rc1-ksmbd-server-fixes' of git://git.samba.org/ksmbd:
  ksmbd: fix buffer validation by including null terminator size in EA length
  ksmbd: Fix refcount leak when invalid session is found on session lookup
  ksmbd: remove redundant DACL check in smb_check_perm_dacl
  ksmbd: convert comma to semicolon
  smb: server: defer the initial recv completion logic to smb_direct_negotiate_recv_work()
  smb: server: initialize recv_io->cqe.done = recv_done just once
  smb: smbdirect: introduce smbdirect_socket.connect.{lock,work}

Merge tag 'for-6.19-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

- fix missing btrfs_path release after printing a relocation error
   message

- fix extent changeset leak on mmap write after failure to reserve
   metadata

- fix fs devices list structure freeing, it could be potentially leaked
   under some circumstances

- tree log fixes:
     - fix incremental directory logging where inodes for new dentries
       were incorrectly skipped
     - don't log conflicting inode if it's a directory moved in the
       current transaction

- regression fixes:
     - fix incorrect btrfs_path freeing when it's auto-cleaned
     - revert commit simplifying preallocation of temporary structures
       in qgroup functions, some cases were not handled properly

* tag 'for-6.19-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix changeset leak on mmap write after failure to reserve metadata
  btrfs: fix memory leak of fs_devices in degraded seed device path
  btrfs: fix a potential path leak in print_data_reloc_error()
  Revert "btrfs: add ASSERTs on prealloc in qgroup functions"
  btrfs: do not skip logging new dentries when logging a new name
  btrfs: don't log conflicting inode if it's a dir moved in the current transaction
  btrfs: tests: fix double btrfs_path free in remove_extent_ref()

Merge tag 'sched_ext-for-6.19-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext

Pull sched_ext fixes from Tejun Heo:

- Fix memory leak when destroying helper kthread workers during
   scheduler disable

- Fix bypass depth accounting on scx_enable() failure which could leave
   the system permanently in bypass mode

- Fix missing preemption handling when moving tasks to local DSQs via
   scx_bpf_dsq_move()

- Misc fixes including NULL check for put_prev_task(), flushing stdout
   in selftests, and removing unused code

* tag 'sched_ext-for-6.19-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
  sched_ext: Remove unused code in the do_pick_task_scx()
  selftests/sched_ext: flush stdout before test to avoid log spam
  sched_ext: Fix missing post-enqueue handling in move_local_task_to_local_dsq()
  sched_ext: Factor out local_dsq_post_enq() from dispatch_enqueue()
  sched_ext: Fix bypass depth leak on scx_enable() failure
  sched/ext: Avoid null ptr traversal when ->put_prev_task() is called with NULL next
  sched_ext: Fix the memleak for sch->helper objects

Merge tag 'cgroup-for-6.19-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

Pull cgroup fix from Tejun Heo:

- Fix a race condition in css_rstat_updated() where CMPXCHG without
   LOCK prefix could cause lnode corruption when the flusher runs
   concurrently on another CPU. The issue was introduced in 6.17 and
   causes memcg stats to become corrupted in production.

* tag 'cgroup-for-6.19-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: rstat: use LOCK CMPXCHG in css_rstat_updated

shmem: fix recovery on rename failures

maple_tree insertions can fail if we are seriously short on memory;
simple_offset_rename() does not recover well if it runs into that.
The same goes for simple_offset_rename_exchange().

Moreover, shmem_whiteout() expects that if it succeeds, the caller will
progress to d_move(), i.e. that shmem_rename2() won't fail past the
successful call of shmem_whiteout().

Not hard to fix, fortunately - mtree_store() can't fail if the index we
are trying to store into is already present in the tree as a singleton.

For simple_offset_rename_exchange() that's enough - we just need to be
careful about the order of operations.

For simple_offset_rename() solution is to preinsert the target into the
tree for new_dir; the rest can be done without any potentially failing
operations.

That preinsertion has to be done in shmem_rename2() rather than in
simple_offset_rename() itself - otherwise we'd need to deal with the
possibility of failure after successful shmem_whiteout().

Fixes: a2e459555c5f ("shmem: stable directory offsets")
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

sched_ext: Remove unused code in the do_pick_task_scx()

The kick_idle variable is no longer used, this commit therefore remove
it and also remove associated code in the do_pick_task_scx().

Signed-off-by: Zqiang <qiang.zhang@linux.dev>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

ksmbd: fix buffer validation by including null terminator size in EA length

The smb2_set_ea function, which handles Extended Attributes (EA),
was performing buffer validation checks that incorrectly omitted the size
of the null terminating character (+1 byte) for EA Name.
This patch fixes the issue by explicitly adding '+ 1' to EaNameLength where
the null terminator is expected to be present in the buffer, ensuring
the validation accurately reflects the total required buffer size.

Cc: stable@vger.kernel.org
Reported-by: Roger <roger.andersen@protonmail.com>
Reported-by: Stanislas Polu <spolu@dust.tt>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

ksmbd: Fix refcount leak when invalid session is found on session lookup

When a session is found but its state is not SMB2_SESSION_VALID, It
indicates that no valid session was found, but it is missing to decrement
the reference count acquired by the session lookup, which results in
a reference count leak. This patch fixes the issue by explicitly calling
ksmbd_user_session_put to release the reference to the session.

Cc: stable@vger.kernel.org
Reported-by: Alexandre <roger.andersen@protonmail.com>
Reported-by: Stanislas Polu <spolu@dust.tt>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

ksmbd: remove redundant DACL check in smb_check_perm_dacl

A zero value of pdacl->num_aces is already handled at the start of
smb_check_perm_dacl() so the second check is useless.

Drop the unreachable code block, no functional impact intended.

Found by Linux Verification Center (linuxtesting.org) with SVACE.

Signed-off-by: Alexey Velichayshiy <a.velichayshiy@ispras.ru>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

ksmbd: convert comma to semicolon

Replace comma between expressions with semicolons.

Using a ',' in place of a ';' can have unintended side effects.
Although that is not the case here, it is seems best to use ';'
unless ',' is intended.

Found by inspection.
No functional change intended.
Compile tested only.

Signed-off-by: Chen Ni <nichen@iscas.ac.cn>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

smb: server: defer the initial recv completion logic to smb_direct_negotiate_recv_work()

The previous change to relax WARN_ON_ONCE(SMBDIRECT_SOCKET_*) checks in
recv_done() and smb_direct_cm_handler() seems to work around the
problem that the order of initial recv completion and
RDMA_CM_EVENT_ESTABLISHED is random, but it's still
a bit ugly.

This implements a better solution deferring the recv completion
processing to smb_direct_negotiate_recv_work(), which is queued
only if both events arrived.

In order to avoid more basic changes to the main recv_done
callback, I introduced a smb_direct_negotiate_recv_done,
which is only used for the first pdu, this will allow
further cleanup and simplifications in recv_done
as a future patch.

smb_direct_negotiate_recv_work() is also very basic
with only basic error checking and the transition
from SMBDIRECT_SOCKET_NEGOTIATE_NEEDED to
SMBDIRECT_SOCKET_NEGOTIATE_RUNNING, which allows
smb_direct_prepare() to continue as before.

Cc: Tom Talpey <tom@talpey.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

smb: server: initialize recv_io->cqe.done = recv_done just once

smbdirect_recv_io structures are pre-allocated so we can set the
callback function just once.

This will make it easy to move smb_direct_post_recv to common code
soon.

Cc: Tom Talpey <tom@talpey.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

smb: smbdirect: introduce smbdirect_socket.connect.{lock,work}

This will first be used by the server in order to defer
the processing of the initial recv of the negotiation
request.

But in future it will also be used by the client in order
to implement an async connect.

Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

s390/stacktrace: Do not fallback to RA register

The logic to fallback to the return address (RA) register value in
the topmost frame when stack tracing using back chain is broken in
multiple ways:

When assuming the RA register 14 has not been saved yet one must assume
that a new user stack frame has not been allocated either.  Therefore
the back chain would not contain the stack pointer (SP) at entry, but
the caller's SP at its entry instead.

Therefore when falling back to the RA register 14 value it would also be
necessary to fallback to the SP register 15 value.  Otherwise an invalid
combination of RA register 14 and caller's SP at its entry (from the
back chain) is used.

In the topmost frame the back chain contains either the caller's SP at
its entry (before having allocated a new stack frame in the prologue),
the SP at entry (after having allocated a new stack frame), or an
uninitialized value (during static/dynamic stack allocation).  In both
cases where the back chain is valid either the caller or prologue must
have saved its respective RA to the respective frame.  Therefore, if the
RA obtained from the frame pointed to by the back chain is invalid, this
does not indicate that the IP in the topmost frame is still early in the
prologue and the RA has not been saved.

Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Jens Remus <jremus@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>

s390/pci: Annotate lock context imbalance in zpci_release_device()

When checking `arch/s390/pci/pci.c` with `sparse` during build, the
following complaint is reported:

arch/s390/pci/pci.c: note: in included file (through include/linux/smp.h, include/linux/lockdep.h, include/linux/spinlock.h, include/linux/mmzone.h, include/linux/gfp.h, include/linux/slab.h):
./include/linux/list.h:237:25: warning: context imbalance in 'zpci_release_device' - unexpected unlock

But this is expected, as zpci_release_device() is expected to be called
with `zpci_list_lock` held, as part of `kref_put_lock()` or similar.

Reflect this by annotating the function with the appropriate
__releases().

Signed-off-by: Benjamin Block <bblock@linux.ibm.com>
Reviewed-by: Farhan Ali <alifm@linux.ibm.com>
Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
Reviewed-by: Gerd Bayer <gbayer@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>

s390/pci: Fix cyclic dead-lock in zpci_zdev_put() and zpci_scan_devices()

When triggering PCI device recovery by writing into the SysFS attribute
`recover` of a Physical Function with existing child SR-IOV Virtual
Functions, lockdep is reporting a possible deadlock between three
threads:

         Thread (A)             Thread (B)             Thread (C)
             |                      |                      |
      recover_store()      zpci_scan_devices()    zpci_scan_devices()
lock(pci_rescan_remove_lock)        |                      |
             |                      |                      |
             |                      |            zpci_bus_scan_busses()
             |                      |             lock(zbus_list_lock)
             |              zpci_add_device()              |
             |          lock(zpci_add_remove_lock)         |
             |                      |                      ┴
             |                      |             zpci_bus_scan_bus()
             |                      |         lock(pci_rescan_remove_lock)
             ┴                      |
      zpci_zdev_put()               |
lock(zpci_add_remove_lock)         |
                                    ┴
                              zpci_bus_get()
                           lock(zbus_list_lock)

In zpci_bus_scan_busses() the `zbus_list_lock` is taken for the whole
duration of the function, which also includes taking
`pci_rescan_remove_lock`, among other things. But `zbus_list_lock` only
really needs to protect the modification of the global registration
`zbus_list`, it can be dropped while the functions within the list
iteration run; this way we break the cycle above.

Break up zpci_bus_scan_busses() into an "iterator" zpci_bus_get_next()
that iterates over `zbus_list` element by element, and acquires and
releases `zbus_list_lock` as necessary, but never keep holding it.
References to `zpci_bus` objects are also acquired and released.

The reference counting on `zpci_bus` objects is also changed so that all
put() and get() operations are done under the protection of
`zbus_list_lock`, and if the operation results in a modification of
`zpci_bus_list`, this modification is done in the same critical section
(apart the very first initialization). This way objects are never seen
on the list that are about to be released and/or half-initialized.

Fixes: 14c87ba8123a ("s390/pci: separate zbus registration from scanning")
Suggested-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Benjamin Block <bblock@linux.ibm.com>
Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
Reviewed-by: Gerd Bayer <gbayer@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>

s390/ipl: Clear SBP flag when bootprog is set

With z16 a new flag 'search boot program' was introduced for
list-directed IPL (SCSI, NVMe, ECKD DASD). If this flag is set,
e.g. via selecting the "Automatic" value for the "Boot program
selector" control on an HMC load panel, it is copied to the reipl
structure from the initial ipl structure. When a user now sets a
boot prog via sysfs, the flag is not cleared and the bootloader
will again automatically select the boot program, ignoring user
configuration.

To avoid that, clear the SBP flag when a bootprog sysfs file is
written.

Cc: stable@vger.kernel.org
Reviewed-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>

Linux 6.19-rc1

Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull SCSI fixes from James Bottomley:
"The only core fix is in doc; all the others are in drivers, with the
  biggest impacts in libsas being the rollback on error handling and in
  ufs coming from a couple of error handling fixes, one causing a crash
  if it's activated before scanning and the other fixing W-LUN
  resumption"

* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  scsi: ufs: qcom: Fix confusing cleanup.h syntax
  scsi: libsas: Add rollback handling when an error occurs
  scsi: device_handler: Return error pointer in scsi_dh_attached_handler_name()
  scsi: ufs: core: Fix a deadlock in the frequency scaling code
  scsi: ufs: core: Fix an error handler crash
  scsi: Revert "scsi: libsas: Fix exp-attached device scan after probe failure scanned in again after probe failed"
  scsi: ufs: core: Fix RPMB link error by reversing Kconfig dependencies
  scsi: qla4xxx: Use time conversion macros
  scsi: qla2xxx: Enable/disable IRQD_NO_BALANCING during reset
  scsi: ipr: Enable/disable IRQD_NO_BALANCING during reset
  scsi: imm: Fix use-after-free bug caused by unfinished delayed work
  scsi: target: sbp: Remove KMSG_COMPONENT macro
  scsi: core: Correct documentation for scsi_device_quiesce()
  scsi: mpi3mr: Prevent duplicate SAS/SATA device entries in channel 1
  scsi: target: Reset t_task_cdb pointer in error case
  scsi: ufs: core: Fix EH failure after W-LUN resume error

shmem_whiteout(): fix regression from tree-in-dcache series

Now that shmem_mknod() hashes the new dentry, d_rehash() in
shmem_whiteout() should be removed.

X-paperbag: brown
Reported-by: Hugh Dickins <hughd@google.com>
Acked-by: Hugh Dickins <hughd@google.com>
Tested-by: Hugh Dickins <hughd@google.com>
Fixes: 2313598222f9 ("convert ramfs and tmpfs")
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

Merge tag 'ceph-for-6.19-rc1' of https://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:
"We have a patch that adds an initial set of tracepoints to the MDS
  client from Max, a fix that hardens osdmap parsing code from myself
  (marked for stable) and a few assorted fixups"

* tag 'ceph-for-6.19-rc1' of https://github.com/ceph/ceph-client:
  rbd: stop selecting CRC32, CRYPTO, and CRYPTO_AES
  ceph: stop selecting CRC32, CRYPTO, and CRYPTO_AES
  libceph: make decode_pool() more resilient against corrupted osdmaps
  libceph: Amend checking to fix `make W=1` build breakage
  ceph: Amend checking to fix `make W=1` build breakage
  ceph: add trace points to the MDS client
  libceph: fix log output race condition in OSD client

Merge tag 'tomoyo-pr-20251212' of git://git.code.sf.net/p/tomoyo/tomoyo

Pull tomoyo update from Tetsuo Handa:
"Trivial optimization"

* tag 'tomoyo-pr-20251212' of git://git.code.sf.net/p/tomoyo/tomoyo:
tomoyo: Use local kmap in tomoyo_dump_page()

bpf: Fix bpf_seq_read docs for increased buffer size

Commit af65320948b8 ("bpf: Bump iter seq size to support BTF
representation of large data structures") increased the fixed buffer
size from PAGE_SIZE to PAGE_SIZE << 3, but the docs for the function
didn't get updated at the same time. Update them.

Signed-off-by: T.J. Mercier <tjmercier@google.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20251207091005.2829703-1-tjmercier@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge tag 'smp-urgent-2025-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull CPU hotplug fix from Ingo Molnar:

- Fix CPU hotplug callbacks to disable interrupts on UP kernels

* tag 'smp-urgent-2025-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
cpu: Make atomic hotplug callbacks run with interrupts disabled on UP