]> git.ipfire.org Git - thirdparty/linux.git/log
thirdparty/linux.git
4 weeks agobpf: drop KF_ACQUIRE flag on BPF kfunc bpf_get_root_mem_cgroup()
Matt Bobrowski [Tue, 13 Jan 2026 08:39:48 +0000 (08:39 +0000)] 
bpf: drop KF_ACQUIRE flag on BPF kfunc bpf_get_root_mem_cgroup()

With the BPF verifier now treating pointers to struct types returned
from BPF kfuncs as implicitly trusted by default, there is no need for
bpf_get_root_mem_cgroup() to be annotated with the KF_ACQUIRE flag.

bpf_get_root_mem_cgroup() does not acquire any references, but rather
simply returns a NULL pointer or a pointer to a struct mem_cgroup
object that is valid for the entire lifetime of the kernel.

This simplifies BPF programs using this kfunc by removing the
requirement to pair the call with bpf_put_mem_cgroup().

Signed-off-by: Matt Bobrowski <mattbobrowski@google.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260113083949.2502978-2-mattbobrowski@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
4 weeks agobpf: return PTR_TO_BTF_ID | PTR_TRUSTED from BPF kfuncs by default
Matt Bobrowski [Tue, 13 Jan 2026 08:39:47 +0000 (08:39 +0000)] 
bpf: return PTR_TO_BTF_ID | PTR_TRUSTED from BPF kfuncs by default

Teach the BPF verifier to treat pointers to struct types returned from
BPF kfuncs as implicitly trusted (PTR_TO_BTF_ID | PTR_TRUSTED) by
default. Returning untrusted pointers to struct types from BPF kfuncs
should be considered an exception only, and certainly not the norm.

Update existing selftests to reflect the change in register type
printing (e.g. `ptr_` becoming `trusted_ptr_` in verifier error
messages).

Link: https://lore.kernel.org/bpf/aV4nbCaMfIoM0awM@google.com/
Signed-off-by: Matt Bobrowski <mattbobrowski@google.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260113083949.2502978-1-mattbobrowski@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
4 weeks agoMerge branch 'improve-the-performance-of-btf-type-lookups-with-binary-search'
Andrii Nakryiko [Wed, 14 Jan 2026 00:10:40 +0000 (16:10 -0800)] 
Merge branch 'improve-the-performance-of-btf-type-lookups-with-binary-search'

Donglin Peng says:

====================
Improve the performance of BTF type lookups with binary search

From: Donglin Peng <pengdonglin@xiaomi.com>

The series addresses the performance limitations of linear search in large
BTFs by:
1. Adding BTF permutation support
2. Using resolve_btfids to sort BTF during the build phase
3. Checking BTF sorting
4. Using binary search when looking up types

Patch #1 introduces an interface for btf__permute in libbpf to relay out BTF.
Patch #2 adds test cases to validate the functionality of btf__permute in base
and split BTF scenarios.
Patch #3 introduces a new phase in the resolve_btfids tool to sort BTF by name
in ascending order.
Patches #4-#7 implement the sorting check and binary search.
Patches #8-#10 optimize type lookup performance of some functions by skipping
anonymous types or invoking btf_find_by_name_kind.
Patch #11 refactors the code by calling str_is_empty.

Here is a simple performance test result [1] for lookups to find 87,584 named
types in vmlinux BTF:

./vmtest.sh -- ./test_progs -t btf_permute/perf -v

Results:
| Condition          | Lookup Time | Improvement  |
|--------------------|-------------|--------------|
| Unsorted (Linear)  |  36,534 ms  | Baseline     |
| Sorted (Binary)    |      15 ms  | 2437x faster |

The binary search implementation reduces lookup time from 36.5 seconds to 15
milliseconds, achieving a **2437x** speedup for large-scale type queries.

Changelog:
v12:
- Set the start_id to 1 instead of btf->start_id in the btf__find_by_name (AI)

v11:
Link: https://lore.kernel.org/bpf/20260108031645.1350069-1-dolinux.peng@gmail.com/
- PATCH #1: Modify implementation of btf__permute: id_map[0] must be 0 for base BTF (Andrii)
- PATCH #3: Refactor the code (Andrii)
- PATCH #4~8:
  - Revert to using the binary search in v7 to simplify the code (Andrii)
  - Refactor the code of btf_check_sorted (Andrii, Eduard)
  - Rename sorted_start_id to named_start_id
  - Rename btf_sorted_start_id to btf_named_start_id, and add comments (Andrii, Eduard)

v10:
Link: https://lore.kernel.org/all/20251218113051.455293-1-dolinux.peng@gmail.com/
- Improve btf__permute() documentation (Eduard)
- Fall back to linear search when locating anonymous types (Eduard)
- Remove redundant NULL name check in libbpf's linear search path (Eduard)
- Simplify btf_check_sorted() implementation (Eduard)
- Treat kernel modules as unsorted by default
- Introduce btf_is_sorted and btf_sorted_start_id for clarity (Eduard)
- Fix optimizations in btf_find_decl_tag_value() and btf_prepare_func_args()
  to support split BTF
- Remove linear search branch in determine_ptr_size()
- Rebase onto Ihor's v4 patch series [4]

v9:
Link: https://lore.kernel.org/bpf/20251208062353.1702672-1-dolinux.peng@gmail.com/
- Optimize the performance of the function determine_ptr_size by invoking
  btf__find_by_name_kind
- Optimize the performance of btf_find_decl_tag_value/btf_prepare_func_args/
  bpf_core_add_cands by skipping anonymous types
- Rebase the patch series onto Ihor's v3 patch series [3]

v8
Link: https://lore.kernel.org/bpf/20251126085025.784288-1-dolinux.peng@gmail.com/
- Remove the type dropping feature of btf__permute (Andrii)
- Refactor the code of btf__permute (Andrii, Eduard)
- Make the self-test code cleaner (Eduard)
- Reconstruct the BTF sorting patch based on Ihor's patch series [2]
- Simplify the sorting logic and place anonymous types before named types
  (Andrii, Eduard)
- Optimize type lookup performance of two kernel functions
- Refactoring the binary search and type lookup logic achieves a 4.2%
  performance gain, reducing the average lookup time (via the perf test
  code in [1] for 60,995 named types in vmlinux BTF) from 10,217 us (v7) to
  9,783 us (v8).

v7:
Link: https://lore.kernel.org/all/20251119031531.1817099-1-dolinux.peng@gmail.com/
- btf__permute API refinement: Adjusted id_map and id_map_cnt parameter
  usage so that for base BTF, id_map[0] now contains the new id of original
  type id 1 (instead of VOID type id 0), improving logical consistency
- Selftest updates: Modified test cases to align with the API usage changes
- Refactor the code of resolve_btfids

v6:
Link: https://lore.kernel.org/all/20251117132623.3807094-1-dolinux.peng@gmail.com/
- ID Map-based reimplementation of btf__permute (Andrii)
- Build-time BTF sorting using resolve_btfids (Alexei, Eduard)
- Binary search method refactoring (Andrii)
- Enhanced selftest coverage

v5:
Link: https://lore.kernel.org/all/20251106131956.1222864-1-dolinux.peng@gmail.com/
- Refactor binary search implementation for improved efficiency
  (Thanks to Andrii and Eduard)
- Extend btf__permute interface with 'ids_sz' parameter to support
  type dropping feature (suggested by Andrii). Plan subsequent reimplementation of
  id_map version for comparative analysis with current sequence interface
- Add comprehensive test coverage for type dropping functionality
- Enhance function comment clarity and accuracy

v4:
Link: https://lore.kernel.org/all/20251104134033.344807-1-dolinux.peng@gmail.com/
- Abstracted btf_dedup_remap_types logic into a helper function (suggested by Eduard).
- Removed btf_sort.c and implemented sorting separately for libbpf and kernel (suggested by Andrii).
- Added test cases for both base BTF and split BTF scenarios (suggested by Eduard).
- Added validation for name-only sorting of types (suggested by Andrii)
- Refactored btf__permute implementation to reduce complexity (suggested by Andrii)
- Add doc comments for btf__permute (suggested by Andrii)

v3:
Link: https://lore.kernel.org/all/20251027135423.3098490-1-dolinux.peng@gmail.com/
- Remove sorting logic from libbpf and provide a generic btf__permute() interface (suggested
  by Andrii)
- Omitted the search direction patch to avoid conflicts with base BTF (suggested by Eduard).
- Include btf_sort.c directly in btf.c to reduce function call overhead

v2:
Link: https://lore.kernel.org/all/20251020093941.548058-1-dolinux.peng@gmail.com/
- Moved sorting to the build phase to reduce overhead (suggested by Alexei).
- Integrated sorting into btf_dedup_compact_and_sort_types (suggested by Eduard).
- Added sorting checks during BTF parsing.
- Consolidated common logic into btf_sort.c for sharing (suggested by Alan).

v1:
Link: https://lore.kernel.org/all/20251013131537.1927035-1-dolinux.peng@gmail.com/
[1] https://github.com/pengdonglin137/btf_sort_test
[2] https://lore.kernel.org/bpf/20251126012656.3546071-1-ihor.solodrai@linux.dev/
[3] https://lore.kernel.org/bpf/20251205223046.4155870-1-ihor.solodrai@linux.dev/
[4] https://lore.kernel.org/bpf/20251218003314.260269-1-ihor.solodrai@linux.dev/
====================

Link: https://patch.msgid.link/20260109130003.3313716-1-dolinux.peng@gmail.com
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
4 weeks agobtf: Refactor the code by calling str_is_empty
Donglin Peng [Fri, 9 Jan 2026 13:00:03 +0000 (21:00 +0800)] 
btf: Refactor the code by calling str_is_empty

Calling the str_is_empty function to clarify the code and
no functional changes are introduced.

Signed-off-by: Donglin Peng <pengdonglin@xiaomi.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20260109130003.3313716-12-dolinux.peng@gmail.com
4 weeks agobpf: Optimize the performance of find_bpffs_btf_enums
Donglin Peng [Fri, 9 Jan 2026 13:00:01 +0000 (21:00 +0800)] 
bpf: Optimize the performance of find_bpffs_btf_enums

Currently, vmlinux BTF is unconditionally sorted during
the build phase. The function btf_find_by_name_kind
executes the binary search branch, so find_bpffs_btf_enums
can be optimized by using btf_find_by_name_kind.

Signed-off-by: Donglin Peng <pengdonglin@xiaomi.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20260109130003.3313716-10-dolinux.peng@gmail.com
4 weeks agobpf: Skip anonymous types in type lookup for performance
Donglin Peng [Fri, 9 Jan 2026 13:00:00 +0000 (21:00 +0800)] 
bpf: Skip anonymous types in type lookup for performance

Currently, vmlinux and kernel module BTFs are unconditionally
sorted during the build phase, with named types placed at the
end. Thus, anonymous types should be skipped when starting the
search. In my vmlinux BTF, the number of anonymous types is
61,747, which means the loop count can be reduced by 61,747.

Signed-off-by: Donglin Peng <pengdonglin@xiaomi.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20260109130003.3313716-9-dolinux.peng@gmail.com
4 weeks agobtf: Verify BTF sorting
Donglin Peng [Fri, 9 Jan 2026 12:59:59 +0000 (20:59 +0800)] 
btf: Verify BTF sorting

This patch checks whether the BTF is sorted by name in ascending order.
If sorted, binary search will be used when looking up types.

Specifically, vmlinux and kernel module BTFs are always sorted during
the build phase with anonymous types placed before named types, so we
only need to identify the starting ID of named types.

Signed-off-by: Donglin Peng <pengdonglin@xiaomi.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20260109130003.3313716-8-dolinux.peng@gmail.com
4 weeks agobtf: Optimize type lookup with binary search
Donglin Peng [Fri, 9 Jan 2026 12:59:58 +0000 (20:59 +0800)] 
btf: Optimize type lookup with binary search

Improve btf_find_by_name_kind() performance by adding binary search
support for sorted types. Falls back to linear search for compatibility.

Signed-off-by: Donglin Peng <pengdonglin@xiaomi.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20260109130003.3313716-7-dolinux.peng@gmail.com
4 weeks agolibbpf: Verify BTF sorting
Donglin Peng [Fri, 9 Jan 2026 12:59:57 +0000 (20:59 +0800)] 
libbpf: Verify BTF sorting

This patch checks whether the BTF is sorted by name in ascending
order. If sorted, binary search will be used when looking up types.

Signed-off-by: Donglin Peng <pengdonglin@xiaomi.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20260109130003.3313716-6-dolinux.peng@gmail.com
4 weeks agolibbpf: Optimize type lookup with binary search for sorted BTF
Donglin Peng [Fri, 9 Jan 2026 12:59:56 +0000 (20:59 +0800)] 
libbpf: Optimize type lookup with binary search for sorted BTF

This patch introduces binary search optimization for BTF type lookups
when the BTF instance contains sorted types.

The optimization significantly improves performance when searching for
types in large BTF instances with sorted types. For unsorted BTF, the
implementation falls back to the original linear search.

Signed-off-by: Donglin Peng <pengdonglin@xiaomi.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20260109130003.3313716-5-dolinux.peng@gmail.com
4 weeks agotools/resolve_btfids: Support BTF sorting feature
Donglin Peng [Fri, 9 Jan 2026 12:59:55 +0000 (20:59 +0800)] 
tools/resolve_btfids: Support BTF sorting feature

This introduces a new BTF sorting phase that specifically sorts
BTF types by name in ascending order, so that the binary search
can be used to look up types.

Signed-off-by: Donglin Peng <pengdonglin@xiaomi.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20260109130003.3313716-4-dolinux.peng@gmail.com
4 weeks agoselftests/bpf: Add test cases for btf__permute functionality
Donglin Peng [Fri, 9 Jan 2026 12:59:54 +0000 (20:59 +0800)] 
selftests/bpf: Add test cases for btf__permute functionality

This patch introduces test cases for the btf__permute function to ensure
it works correctly with both base BTF and split BTF scenarios.

The test suite includes:
- test_permute_base: Validates permutation on base BTF
- test_permute_split: Tests permutation on split BTF

Signed-off-by: Donglin Peng <pengdonglin@xiaomi.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20260109130003.3313716-3-dolinux.peng@gmail.com
4 weeks agolibbpf: Add BTF permutation support for type reordering
Donglin Peng [Fri, 9 Jan 2026 12:59:53 +0000 (20:59 +0800)] 
libbpf: Add BTF permutation support for type reordering

Introduce btf__permute() API to allow in-place rearrangement of BTF types.
This function reorganizes BTF type order according to a provided array of
type IDs, updating all type references to maintain consistency.

Signed-off-by: Donglin Peng <pengdonglin@xiaomi.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20260109130003.3313716-2-dolinux.peng@gmail.com
4 weeks agobpf: Remove an unused parameter in check_func_proto
Song Chen [Mon, 5 Jan 2026 15:50:09 +0000 (23:50 +0800)] 
bpf: Remove an unused parameter in check_func_proto

The func_id parameter is not needed in check_func_proto.
This patch removes it.

Signed-off-by: Song Chen <chensong_2000@189.cn>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260105155009.4581-1-chensong_2000@189.cn
4 weeks agoMerge branch 'bpf-recognize-special-arithmetic-shift-in-the-verifier'
Alexei Starovoitov [Tue, 13 Jan 2026 17:33:38 +0000 (09:33 -0800)] 
Merge branch 'bpf-recognize-special-arithmetic-shift-in-the-verifier'

Puranjay Mohan says:

====================
bpf: Recognize special arithmetic shift in the verifier

v3: https://lore.kernel.org/all/20260103022310.935686-1-puranjay@kernel.org/
Changes in v3->v4:
- Fork verifier state while processing BPF_OR when src_reg has [-1,0]
  range and 2nd operand is a constant. This is to detect the following pattern:
i32 X > -1 ? C1 : -1 --> (X >>s 31) | C1
- Add selftests for above.
- Remove __description("s>>=63") (Eduard in another patchset)

v2: https://lore.kernel.org/bpf/20251115022611.64898-1-alexei.starovoitov@gmail.com/
Changes in v2->v3:
- fork verifier state while processing BPF_AND when src_reg has [-1,0]
  range and 2nd operand is a constant.

v1->v2:
Use __mark_reg32_known() or __mark_reg_known() for zero too.
Add comment to selftest.

v1:
https://lore.kernel.org/bpf/20251114031039.63852-1-alexei.starovoitov@gmail.com/
====================

Link: https://patch.msgid.link/20260112201424.816836-1-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
4 weeks agoselftests/bpf: Add tests for s>>=31 and s>>=63
Alexei Starovoitov [Mon, 12 Jan 2026 20:13:58 +0000 (12:13 -0800)] 
selftests/bpf: Add tests for s>>=31 and s>>=63

Add tests for special arithmetic shift right.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Co-developed-by: Puranjay Mohan <puranjay@kernel.org>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20260112201424.816836-3-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
4 weeks agobpf: Recognize special arithmetic shift in the verifier
Alexei Starovoitov [Mon, 12 Jan 2026 20:13:57 +0000 (12:13 -0800)] 
bpf: Recognize special arithmetic shift in the verifier

cilium bpf_wiregard.bpf.c when compiled with -O1 fails to load
with the following verifier log:

192: (79) r2 = *(u64 *)(r10 -304)     ; R2=pkt(r=40) R10=fp0 fp-304=pkt(r=40)
...
227: (85) call bpf_skb_store_bytes#9          ; R0=scalar()
228: (bc) w2 = w0                     ; R0=scalar() R2=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff))
229: (c4) w2 s>>= 31                  ; R2=scalar(smin=0,smax=umax=0xffffffff,smin32=-1,smax32=0,var_off=(0x0; 0xffffffff))
230: (54) w2 &= -134                  ; R2=scalar(smin=0,smax=umax=umax32=0xffffff7a,smax32=0x7fffff7a,var_off=(0x0; 0xffffff7a))
...
232: (66) if w2 s> 0xffffffff goto pc+125     ; R2=scalar(smin=umin=umin32=0x80000000,smax=umax=umax32=0xffffff7a,smax32=-134,var_off=(0x80000000; 0x7fffff7a))
...
238: (79) r4 = *(u64 *)(r10 -304)     ; R4=scalar() R10=fp0 fp-304=scalar()
239: (56) if w2 != 0xffffff78 goto pc+210     ; R2=0xffffff78 // -136
...
258: (71) r1 = *(u8 *)(r4 +0)
R4 invalid mem access 'scalar'

The error might confuse most bpf authors, since fp-304 slot had 'pkt'
pointer at insn 192 and became 'scalar' at 238. That happened because
bpf_skb_store_bytes() clears all packet pointers including those in
the stack. On the first glance it might look like a bug in the source
code, since ctx->data pointer should have been reloaded after the call
to bpf_skb_store_bytes().

The relevant part of cilium source code looks like this:

// bpf/lib/nodeport.h
int dsr_set_ipip6()
{
if (ctx_adjust_hroom(...))
return DROP_INVALID; // -134
if (ctx_store_bytes(...))
return DROP_WRITE_ERROR; // -141
return 0;
}

bool dsr_fail_needs_reply(int code)
{
if (code == DROP_FRAG_NEEDED) // -136
return true;
return false;
}

tail_nodeport_ipv6_dsr()
{
ret = dsr_set_ipip6(...);
if (!IS_ERR(ret)) {
...
} else {
if (dsr_fail_needs_reply(ret))
return dsr_reply_icmp6(...);
}
}

The code doesn't have arithmetic shift by 31 and it reloads ctx->data
every time it needs to access it. So it's not a bug in the source code.

The reason is DAGCombiner::foldSelectCCToShiftAnd() LLVM transformation:

  // If this is a select where the false operand is zero and the compare is a
  // check of the sign bit, see if we can perform the "gzip trick":
  // select_cc setlt X, 0, A, 0 -> and (sra X, size(X)-1), A
  // select_cc setgt X, 0, A, 0 -> and (not (sra X, size(X)-1)), A

The conditional branch in dsr_set_ipip6() and its return values
are optimized into BPF_ARSH plus BPF_AND:

227: (85) call bpf_skb_store_bytes#9
228: (bc) w2 = w0
229: (c4) w2 s>>= 31   ; R2=scalar(smin=0,smax=umax=0xffffffff,smin32=-1,smax32=0,var_off=(0x0; 0xffffffff))
230: (54) w2 &= -134   ; R2=scalar(smin=0,smax=umax=umax32=0xffffff7a,smax32=0x7fffff7a,var_off=(0x0; 0xffffff7a))

after insn 230 the register w2 can only be 0 or -134,
but the verifier approximates it, since there is no way to
represent two scalars in bpf_reg_state.
After fallthough at insn 232 the w2 can only be -134,
hence the branch at insn
239: (56) if w2 != -136 goto pc+210
should be always taken, and trapping insn 258 should never execute.
LLVM generated correct code, but the verifier follows impossible
path and rejects valid program. To fix this issue recognize this
special LLVM optimization and fork the verifier state.
So after insn 229: (c4) w2 s>>= 31
the verifier has two states to explore:
one with w2 = 0 and another with w2 = 0xffffffff
which makes the verifier accept bpf_wiregard.c

A similar pattern exists were OR operation is used in place of the AND
operation, the verifier detects that pattern as well by forking the
state before the OR operation with a scalar in range [-1,0].

Note there are 20+ such patterns in bpf_wiregard.o compiled
with -O1 and -O2, but they're rarely seen in other production
bpf programs, so push_stack() approach is not a concern.

Reported-by: Hao Sun <sunhao.th@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Co-developed-by: Puranjay Mohan <puranjay@kernel.org>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20260112201424.816836-2-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
4 weeks agoMerge branch 'fix-a-few-selftest-failure-due-to-64k-page'
Alexei Starovoitov [Tue, 13 Jan 2026 17:32:16 +0000 (09:32 -0800)] 
Merge branch 'fix-a-few-selftest-failure-due-to-64k-page'

Yonghong Song says:

====================
Fix a few selftest failure due to 64K page

Fix a few arm64 selftest failures due to 64K page. Please see each
indvidual patch for why the test failed and how the test gets fixed.
====================

Link: https://patch.msgid.link/20260113061018.3797051-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
4 weeks agoselftests/bpf: Fix verifier_arena_globals1 failure with 64K page
Yonghong Song [Tue, 13 Jan 2026 06:10:33 +0000 (22:10 -0800)] 
selftests/bpf: Fix verifier_arena_globals1 failure with 64K page

With 64K page on arm64, verifier_arena_globals1 failed like below:
  ...
  libbpf: map 'arena': failed to create: -E2BIG
  ...
  #509/1   verifier_arena_globals1/check_reserve1:FAIL
  ...

For 64K page, if the number of arena pages is (1UL << 20), the total
memory will exceed 4G and this will cause map creation failure.
Adjusting ARENA_PAGES based on the actual page size fixed the problem.

Cc: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Link: https://lore.kernel.org/r/20260113061033.3798549-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
4 weeks agoselftests/bpf: Fix sk_bypass_prot_mem failure with 64K page
Yonghong Song [Tue, 13 Jan 2026 06:10:28 +0000 (22:10 -0800)] 
selftests/bpf: Fix sk_bypass_prot_mem failure with 64K page

The current selftest sk_bypass_prot_mem only supports 4K page.
When running with 64K page on arm64, the following failure happens:
  ...
  check_bypass:FAIL:no bypass unexpected no bypass: actual 3 <= expected 32
  ...
  #385/1   sk_bypass_prot_mem/TCP  :FAIL
  ...
  check_bypass:FAIL:no bypass unexpected no bypass: actual 4 <= expected 32
  ...
  #385/2   sk_bypass_prot_mem/UDP  :FAIL
  ...

Adding support to 64K page as well fixed the failure.

Cc: Kuniyuki Iwashima <kuniyu@google.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20260113061028.3798326-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
4 weeks agoselftests/bpf: Fix dmabuf_iter/lots_of_buffers failure with 64K page
Yonghong Song [Tue, 13 Jan 2026 06:10:23 +0000 (22:10 -0800)] 
selftests/bpf: Fix dmabuf_iter/lots_of_buffers failure with 64K page

On arm64 with 64K page , I observed the following test failure:
  ...
  subtest_dmabuf_iter_check_lots_of_buffers:FAIL:total_bytes_read unexpected total_bytes_read:
      actual 4696 <= expected 65536
  #97/3    dmabuf_iter/lots_of_buffers:FAIL

With 4K page on x86, the total_bytes_read is 4593.
With 64K page on arm64, the total_byte_read is 4696.

In progs/dmabuf_iter.c, for each iteration, the output is
  BPF_SEQ_PRINTF(seq, "%lu\n%llu\n%s\n%s\n", inode, size, name, exporter);

The only difference between 4K and 64K page is 'size' in
the above BPF_SEQ_PRINTF. The 4K page will output '4096' and
the 64K page will output '65536'. So the total_bytes_read with 64K page
is slighter greater than 4K page.

Adjusting the total_bytes_read from 65536 to 4096 fixed the issue.

Cc: T.J. Mercier <tjmercier@google.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20260113061023.3798085-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
4 weeks agobpf: Consistently use reg_state() for register access in the verifier
Mykyta Yatsenko [Tue, 13 Jan 2026 13:48:26 +0000 (13:48 +0000)] 
bpf: Consistently use reg_state() for register access in the verifier

Replace the pattern of declaring a local regs array from cur_regs()
and then indexing into it with the more concise reg_state() helper.
This simplifies the code by eliminating intermediate variables and
makes register access more consistent throughout the verifier.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20260113134826.2214860-1-mykyta.yatsenko5@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
4 weeks agoMerge branch 'use-correct-destructor-kfunc-types'
Alexei Starovoitov [Tue, 13 Jan 2026 02:53:57 +0000 (18:53 -0800)] 
Merge branch 'use-correct-destructor-kfunc-types'

Sami Tolvanen says:

====================
While running BPF self-tests with CONFIG_CFI (Control Flow
Integrity) enabled, I ran into a couple of failures in
bpf_obj_free_fields() caused by type mismatches between the
btf_dtor_kfunc_t function pointer type and the registered
destructor functions.

It looks like we can't change the argument type for these
functions to match btf_dtor_kfunc_t because the verifier doesn't
like void pointer arguments for functions used in BPF programs,
so this series fixes the issue by adding stubs with correct types
to use as destructors for each instance of this I found in the
kernel tree.

The last patch changes btf_check_dtor_kfuncs() to enforce the
function type when CFI is enabled, so we don't end up registering
destructors that panic the kernel.

v5:
- Rebased on bpf-next/master again.

v4: https://lore.kernel.org/bpf/20251126221724.897221-6-samitolvanen@google.com/
- Rebased on bpf-next/master.
- Renamed CONFIG_CFI_CLANG to CONFIG_CFI.
- Picked up Acked/Tested-by tags.

v3: https://lore.kernel.org/bpf/20250728202656.559071-6-samitolvanen@google.com/
- Renamed the functions and went back to __bpf_kfunc based
  on review feedback.

v2: https://lore.kernel.org/bpf/20250725214401.1475224-6-samitolvanen@google.com/
- Annotated the stubs with CFI_NOSEAL to fix issues with IBT
  sealing on x86.
- Changed __bpf_kfunc to explicit __used __retain.

v1: https://lore.kernel.org/bpf/20250724223225.1481960-6-samitolvanen@google.com/
====================

Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260110082548.113748-6-samitolvanen@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
4 weeks agobpf, btf: Enforce destructor kfunc type with CFI
Sami Tolvanen [Sat, 10 Jan 2026 08:25:53 +0000 (08:25 +0000)] 
bpf, btf: Enforce destructor kfunc type with CFI

Ensure that registered destructor kfuncs have the same type
as btf_dtor_kfunc_t to avoid a kernel panic on systems with
CONFIG_CFI enabled.

Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20260110082548.113748-10-samitolvanen@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
4 weeks agoselftests/bpf: Use the correct destructor kfunc type
Sami Tolvanen [Sat, 10 Jan 2026 08:25:52 +0000 (08:25 +0000)] 
selftests/bpf: Use the correct destructor kfunc type

With CONFIG_CFI enabled, the kernel strictly enforces that indirect
function calls use a function pointer type that matches the target
function. As bpf_testmod_ctx_release() signature differs from the
btf_dtor_kfunc_t pointer type used for the destructor calls in
bpf_obj_free_fields(), add a stub function with the correct type to
fix the type mismatch.

Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20260110082548.113748-9-samitolvanen@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
4 weeks agobpf: net_sched: Use the correct destructor kfunc type
Sami Tolvanen [Sat, 10 Jan 2026 08:25:51 +0000 (08:25 +0000)] 
bpf: net_sched: Use the correct destructor kfunc type

With CONFIG_CFI enabled, the kernel strictly enforces that indirect
function calls use a function pointer type that matches the
target function. As bpf_kfree_skb() signature differs from the
btf_dtor_kfunc_t pointer type used for the destructor calls in
bpf_obj_free_fields(), add a stub function with the correct type to
fix the type mismatch.

Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20260110082548.113748-8-samitolvanen@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
4 weeks agobpf: crypto: Use the correct destructor kfunc type
Sami Tolvanen [Sat, 10 Jan 2026 08:25:50 +0000 (08:25 +0000)] 
bpf: crypto: Use the correct destructor kfunc type

With CONFIG_CFI enabled, the kernel strictly enforces that indirect
function calls use a function pointer type that matches the target
function. I ran into the following type mismatch when running BPF
self-tests:

  CFI failure at bpf_obj_free_fields+0x190/0x238 (target:
    bpf_crypto_ctx_release+0x0/0x94; expected type: 0xa488ebfc)
  Internal error: Oops - CFI: 00000000f2008228 [#1]  SMP
  ...

As bpf_crypto_ctx_release() is also used in BPF programs and using
a void pointer as the argument would make the verifier unhappy, add
a simple stub function with the correct type and register it as the
destructor kfunc instead.

Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Tested-by: Viktor Malik <vmalik@redhat.com>
Link: https://lore.kernel.org/r/20260110082548.113748-7-samitolvanen@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
4 weeks agolibbpf: Fix OOB read in btf_dump_get_bitfield_value
Varun R Mallya [Tue, 6 Jan 2026 23:35:27 +0000 (05:05 +0530)] 
libbpf: Fix OOB read in btf_dump_get_bitfield_value

When dumping bitfield data, btf_dump_get_bitfield_value() reads data
based on the underlying type's size (t->size). However, it does not
verify that the provided data buffer (data_sz) is large enough to
contain these bytes.

If btf_dump__dump_type_data() is called with a buffer smaller than
the type's size, this leads to an out-of-bounds read. This was
confirmed by AddressSanitizer in the linked issue.

Fix this by ensuring we do not read past the provided data_sz limit.

Fixes: a1d3cc3c5eca ("libbpf: Avoid use of __int128 in typed dump display")
Reported-by: Harrison Green <harrisonmichaelgreen@gmail.com>
Suggested-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Varun R Mallya <varunrmallya@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20260106233527.163487-1-varunrmallya@gmail.com
Closes: https://github.com/libbpf/libbpf/issues/928
4 weeks agobpftool: Make skeleton C++ compatible with explicit casts
WanLi Niu [Tue, 6 Jan 2026 02:31:23 +0000 (10:31 +0800)] 
bpftool: Make skeleton C++ compatible with explicit casts

Fix C++ compilation errors in generated skeleton by adding explicit
pointer casts and use char * subtraction for offset calculation

error: invalid conversion from 'void*' to '<obj_name>*' [-fpermissive]
      |         skel = skel_alloc(sizeof(*skel));
      |                ~~~~~~~~~~^~~~~~~~~~~~~~~
      |                          |
      |                          void*

error: arithmetic on pointers to void
      |         skel->ctx.sz = (void *)&skel->links - (void *)skel;
      |                        ~~~~~~~~~~~~~~~~~~~~ ^ ~~~~~~~~~~~~

error: assigning to 'struct <obj_name>__<ident> *' from incompatible type 'void *'
      |                 skel-><ident> = skel_prep_map_data((void *)data, 4096,
      |                             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |                                                 sizeof(data) - 1);
      |                                                 ~~~~~~~~~~~~~~~~~

error: assigning to 'struct <obj_name>__<ident> *' from incompatible type 'void *'
      |         skel-><ident> = skel_finalize_map_data(&skel->maps.<ident>.initial_value,
      |                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |                                         4096, PROT_READ | PROT_WRITE, skel->maps.<ident>.map_fd);
      |                                         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Minimum reproducer:

$ cat test.bpf.c
int val; // placed in .bss section

#include "vmlinux.h"
#include <bpf/bpf_helpers.h>

SEC("raw_tracepoint/sched_wakeup_new") int handle(void *ctx) { return 0; }

$ cat test.cpp
#include <cerrno>

extern "C" {
#include "test.bpf.skel.h"
}

$ bpftool btf dump file /sys/kernel/btf/vmlinux format c > vmlinux.h
$ clang -g -O2 -target bpf -c test.bpf.c -o test.bpf.o
$ bpftool gen skeleton test.bpf.o -L  > test.bpf.skel.h
$ g++ -c test.cpp -I.

Co-developed-by: Menglong Dong <dongml2@chinatelecom.cn>
Signed-off-by: WanLi Niu <niuwl1@chinatelecom.cn>
Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20260106023123.2928-1-kiraskyler@163.com
5 weeks agoMerge branch 'bpf-selftests-fixes-for-gcc-bpf-16'
Alexei Starovoitov [Wed, 7 Jan 2026 05:04:11 +0000 (21:04 -0800)] 
Merge branch 'bpf-selftests-fixes-for-gcc-bpf-16'

Jose E. Marchesi says:

====================
bpf: selftests fixes for GCC-BPF 16

Hello.

Just a couple of small fixes to get the BPF selftests build with what
will become GCC 16 this spring.  One of the regressions is due to a
change in the behavior of a warning in GCC 16.  The other is due to
the fact that GCC 16 actually implements btf_decl_tag and
btf_type_tag.

Salud!
====================

Link: https://patch.msgid.link/20260106173650.18191-1-jose.marchesi@oracle.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
5 weeks agobpf: GCC requires function attributes before the declarator
Jose E. Marchesi [Tue, 6 Jan 2026 17:36:50 +0000 (18:36 +0100)] 
bpf: GCC requires function attributes before the declarator

GCC insists in placing attributes before the declarators in function
declarations.  Now that GCC supports btf_decl_tag and therefore __tag1
and __tag2 expand to actual attributes, the compiler is complaining
about it for

  static __noinline int foo(int x __tag1 __tag2) __tag1 __tag2

progs/test_btf_decl_tag.c:36:1: error: attributes should be specified \
before the declarator in a function definition

This patch simply places the tags before the declarator.

Signed-off-by: Jose E. Marchesi <jose.marchesi@oracle.com>
Cc: david.faust@oracle.com
Cc: cupertino.miranda@oracle.com
Cc: Eduard Zingerman <eddyz87@gmail.com>
Cc: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20260106173650.18191-3-jose.marchesi@oracle.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
5 weeks agobpf: adapt selftests to GCC 16 -Wunused-but-set-variable
Jose E. Marchesi [Tue, 6 Jan 2026 17:36:49 +0000 (18:36 +0100)] 
bpf: adapt selftests to GCC 16 -Wunused-but-set-variable

GCC 16 has changed the semantics of -Wunused-but-set-variable, as well
as introducing new options -Wunused-but-set-variable={0,1,2,3} to
adjust the level of support.

One of the changes is that GCC now treats 'sum += 1' and 'sum++' as
non-usage, whereas clang (and GCC < 16) considers the first as usage
and the second as non-usage, which is sort of inconsistent.

The GCC 16 -Wunused-but-set-variable=2 option implements the previous
semantics of -Wunused-but-set-variable, but since it is a new option,
it cannot be used unconditionally for forward-compatibility, just for
backwards-compatibility.

So this patch adds pragmas to the two self-tests impacted by this,
progs/free_timer.c and progs/rcu_read_lock.c, to make gcc to ignore
-Wunused-but-set-variable warnings when compiling them with GCC > 15.

See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=44677#c25 for details
on why this regression got introduced in GCC upstream.

Signed-off-by: Jose E. Marchesi <jose.marchesi@oracle.com>
Cc: david.faust@oracle.com
Cc: cupertino.miranda@oracle.com
Cc: Eduard Zingerman <eddyz87@gmail.com>
Cc: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20260106173650.18191-2-jose.marchesi@oracle.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
5 weeks agoscripts/gen-btf.sh: Ensure initial object in gen_btf_o is ELF with correct endianness
Nathan Chancellor [Tue, 6 Jan 2026 22:44:20 +0000 (15:44 -0700)] 
scripts/gen-btf.sh: Ensure initial object in gen_btf_o is ELF with correct endianness

After commit 600605853f87 ("scripts/gen-btf.sh: Fix .btf.o generation
when compiling for RISCV"), there is an error from llvm-objcopy when
CONFIG_LTO_CLANG is enabled:

  llvm-objcopy: error: '.tmp_vmlinux1.btf.o': The file was not recognized as a valid object file
  Failed to generate BTF for vmlinux

KBUILD_CFLAGS includes CC_FLAGS_LTO, which makes clang emit an LLVM IR
object, rather than an ELF one as expected by llvm-objcopy.

Most areas of the kernel deal with this by filtering out CC_FLAGS_LTO
from KBUILD_CFLAGS for the particular object or directory but this is
not so easy to do in bash. Just include '-fno-lto' after KBUILD_CFLAGS
to ensure an ELF object is consistently created as the initial .o file.

Additionally, while there is no reported or discovered bug yet, the
absence of KBUILD_CPPFLAGS from this command could result in incorrect
endianness because KBUILD_CPPFLAGS typically contains '-mbig-endian' and
'-mlittle-endian' so that biendian toolchains can be used. Include it in
this ${CC} command to hopefully limit necessary changes to this command
for the foreseeable future.

Fixes: 600605853f87 ("scripts/gen-btf.sh: Fix .btf.o generation when compiling for RISCV")
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Acked-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Link: https://lore.kernel.org/r/20260106-fix-gen-btf-sh-lto-v2-1-01d3e1c241c4@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
5 weeks agoMerge branch 'bpf-introduce-bpf_f_cpu-and-bpf_f_all_cpus-flags-for-percpu-maps'
Alexei Starovoitov [Wed, 7 Jan 2026 04:48:33 +0000 (20:48 -0800)] 
Merge branch 'bpf-introduce-bpf_f_cpu-and-bpf_f_all_cpus-flags-for-percpu-maps'

Leon Hwang says:

====================
bpf: Introduce BPF_F_CPU and BPF_F_ALL_CPUS flags for percpu maps

This patch set introduces the BPF_F_CPU and BPF_F_ALL_CPUS flags for
percpu maps, as the requirement of BPF_F_ALL_CPUS flag for percpu_array
maps was discussed in the thread of
"[PATCH bpf-next v3 0/4] bpf: Introduce global percpu data"[1].

The goal of BPF_F_ALL_CPUS flag is to reduce data caching overhead in light
skeletons by allowing a single value to be reused to update values across all
CPUs. This avoids the M:N problem where M cached values are used to update a
map on N CPUs kernel.

The BPF_F_CPU flag is accompanied by *flags*-embedded cpu info, which
specifies the target CPU for the operation:

* For lookup operations: the flag field alongside cpu info enable querying
  a value on the specified CPU.
* For update operations: the flag field alongside cpu info enable
  updating value for specified CPU.

Links:
[1] https://lore.kernel.org/bpf/20250526162146.24429-1-leon.hwang@linux.dev/

Changes:
v12 -> v13:
* No changes, rebased on latest tree.

v11 -> v12:
* Dropped the v11 changes.
* Stabilized the lru_percpu_hash map test by keeping an extra spare entry,
  which can be used temporarily during updates to avoid unintended LRU
  evictions.

v10 -> v11:
* Support the combination of BPF_EXIST and BPF_F_CPU/BPF_F_ALL_CPUS for
  update operations.
* Fix unstable lru_percpu_hash map test using the combination of
  BPF_EXIST and BPF_F_CPU/BPF_F_ALL_CPUS to avoid LRU eviction
  (reported by Alexei).

v9 -> v10:
* Add tests to verify array and hash maps do not support BPF_F_CPU and
  BPF_F_ALL_CPUS flags.
* Address comment from Andrii:
  * Copy map value using copy_map_value_long for percpu_cgroup_storage
    maps in a separate patch.

v8 -> v9:
* Change value type from u64 to u32 in selftests.
* Address comments from Andrii:
  * Keep value_size unaligned and update everywhere for consistency when
    cpu flags are specified.
  * Update value by getting pointer for percpu hash and percpu
    cgroup_storage maps.

v7 -> v8:
* Address comments from Andrii:
  * Check BPF_F_LOCK when update percpu_array, percpu_hash and
    lru_percpu_hash maps.
  * Refactor flags check in __htab_map_lookup_and_delete_batch().
  * Keep value_size unaligned and copy value using copy_map_value() in
    __htab_map_lookup_and_delete_batch() when BPF_F_CPU is specified.
  * Update warn message in libbpf's validate_map_op().
  * Update comment of libbpf's bpf_map__lookup_elem().

v6 -> v7:
* Get correct value size for percpu_hash and lru_percpu_hash in
  update_batch API.
* Set 'count' as 'max_entries' in test cases for lookup_batch API.
* Address comment from Alexei:
  * Move cpu flags check into bpf_map_check_op_flags().

v5 -> v6:
* Move bpf_map_check_op_flags() from 'bpf.h' to 'syscall.c'.
* Address comments from Alexei:
  * Drop the refactoring code of data copying logic for percpu maps.
  * Drop bpf_map_check_op_flags() wrappers.

v4 -> v5:
* Address comments from Andrii:
  * Refactor data copying logic for all percpu maps.
  * Drop this_cpu_ptr() micro-optimization.
  * Drop cpu check in libbpf's validate_map_op().
  * Enhance bpf_map_check_op_flags() using *allowed flags* instead of
    'extra_flags_mask'.

v3 -> v4:
* Address comments from Andrii:
  * Remove unnecessary map_type check in bpf_map_value_size().
  * Reduce code churn.
  * Remove unnecessary do_delete check in
    __htab_map_lookup_and_delete_batch().
  * Introduce bpf_percpu_copy_to_user() and bpf_percpu_copy_from_user().
  * Rename check_map_flags() to bpf_map_check_op_flags() with
    extra_flags_mask.
  * Add human-readable pr_warn() explanations in validate_map_op().
  * Use flags in bpf_map__delete_elem() and
    bpf_map__lookup_and_delete_elem().
  * Drop "for alignment reasons".
link: https://lore.kernel.org/bpf/20250821160817.70285-1-leon.hwang@linux.dev/
v2 -> v3:
* Address comments from Alexei:
  * Use BPF_F_ALL_CPUS instead of BPF_ALL_CPUS magic.
  * Introduce these two cpu flags for all percpu maps.
* Address comments from Jiri:
  * Reduce some unnecessary u32 cast.
  * Refactor more generic map flags check function.
  * A code style issue.
link: https://lore.kernel.org/bpf/20250805163017.17015-1-leon.hwang@linux.dev/
v1 -> v2:
* Address comments from Andrii:
  * Embed cpu info as high 32 bits of *flags* totally.
  * Use ERANGE instead of E2BIG.
  * Few format issues.
====================

Link: https://patch.msgid.link/20260107022022.12843-1-leon.hwang@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
5 weeks agoselftests/bpf: Add cases to test BPF_F_CPU and BPF_F_ALL_CPUS flags
Leon Hwang [Wed, 7 Jan 2026 02:20:22 +0000 (10:20 +0800)] 
selftests/bpf: Add cases to test BPF_F_CPU and BPF_F_ALL_CPUS flags

Add test coverage for the new BPF_F_CPU and BPF_F_ALL_CPUS flags support
in percpu maps. The following APIs are exercised:

* bpf_map_update_batch()
* bpf_map_lookup_batch()
* bpf_map_update_elem()
* bpf_map__update_elem()
* bpf_map_lookup_elem_flags()
* bpf_map__lookup_elem()

For lru_percpu_hash map, set max_entries to
'libbpf_num_possible_cpus() + 1' and only use the first
'libbpf_num_possible_cpus()' entries. This ensures a spare entry is always
available in the LRU free list, avoiding eviction.

When updating an existing key in lru_percpu_hash map:

1. l_new = prealloc_lru_pop();  /* Borrow from free list */
2. l_old = lookup_elem_raw();   /* Found, key exists */
3. pcpu_copy_value();           /* In-place update */
4. bpf_lru_push_free();         /* Return l_new to free list */

Also add negative tests to verify that non-percpu array and hash maps
reject the BPF_F_CPU and BPF_F_ALL_CPUS flags.

Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
Link: https://lore.kernel.org/r/20260107022022.12843-8-leon.hwang@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
5 weeks agolibbpf: Add BPF_F_CPU and BPF_F_ALL_CPUS flags support for percpu maps
Leon Hwang [Wed, 7 Jan 2026 02:20:21 +0000 (10:20 +0800)] 
libbpf: Add BPF_F_CPU and BPF_F_ALL_CPUS flags support for percpu maps

Add libbpf support for the BPF_F_CPU flag for percpu maps by embedding the
cpu info into the high 32 bits of:

1. **flags**: bpf_map_lookup_elem_flags(), bpf_map__lookup_elem(),
   bpf_map_update_elem() and bpf_map__update_elem()
2. **opts->elem_flags**: bpf_map_lookup_batch() and
   bpf_map_update_batch()

And the flag can be BPF_F_ALL_CPUS, but cannot be
'BPF_F_CPU | BPF_F_ALL_CPUS'.

Behavior:

* If the flag is BPF_F_ALL_CPUS, the update is applied across all CPUs.
* If the flag is BPF_F_CPU, it updates value only to the specified CPU.
* If the flag is BPF_F_CPU, lookup value only from the specified CPU.
* lookup does not support BPF_F_ALL_CPUS.

Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
Link: https://lore.kernel.org/r/20260107022022.12843-7-leon.hwang@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
5 weeks agobpf: Add BPF_F_CPU and BPF_F_ALL_CPUS flags support for percpu_cgroup_storage maps
Leon Hwang [Wed, 7 Jan 2026 02:20:20 +0000 (10:20 +0800)] 
bpf: Add BPF_F_CPU and BPF_F_ALL_CPUS flags support for percpu_cgroup_storage maps

Introduce BPF_F_ALL_CPUS flag support for percpu_cgroup_storage maps to
allow updating values for all CPUs with a single value for update_elem
API.

Introduce BPF_F_CPU flag support for percpu_cgroup_storage maps to
allow:

* update value for specified CPU for update_elem API.
* lookup value for specified CPU for lookup_elem API.

The BPF_F_CPU flag is passed via map_flags along with embedded cpu info.

Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
Link: https://lore.kernel.org/r/20260107022022.12843-6-leon.hwang@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
5 weeks agobpf: Copy map value using copy_map_value_long for percpu_cgroup_storage maps
Leon Hwang [Wed, 7 Jan 2026 02:20:19 +0000 (10:20 +0800)] 
bpf: Copy map value using copy_map_value_long for percpu_cgroup_storage maps

Copy map value using 'copy_map_value_long()'. It's to keep consistent
style with the way of other percpu maps.

No functional change intended.

Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
Link: https://lore.kernel.org/r/20260107022022.12843-5-leon.hwang@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
5 weeks agobpf: Add BPF_F_CPU and BPF_F_ALL_CPUS flags support for percpu_hash and lru_percpu_ha...
Leon Hwang [Wed, 7 Jan 2026 02:20:18 +0000 (10:20 +0800)] 
bpf: Add BPF_F_CPU and BPF_F_ALL_CPUS flags support for percpu_hash and lru_percpu_hash maps

Introduce BPF_F_ALL_CPUS flag support for percpu_hash and lru_percpu_hash
maps to allow updating values for all CPUs with a single value for both
update_elem and update_batch APIs.

Introduce BPF_F_CPU flag support for percpu_hash and lru_percpu_hash
maps to allow:

* update value for specified CPU for both update_elem and update_batch
APIs.
* lookup value for specified CPU for both lookup_elem and lookup_batch
APIs.

The BPF_F_CPU flag is passed via:

* map_flags along with embedded cpu info.
* elem_flags along with embedded cpu info.

Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
Link: https://lore.kernel.org/r/20260107022022.12843-4-leon.hwang@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
5 weeks agobpf: Add BPF_F_CPU and BPF_F_ALL_CPUS flags support for percpu_array maps
Leon Hwang [Wed, 7 Jan 2026 02:20:17 +0000 (10:20 +0800)] 
bpf: Add BPF_F_CPU and BPF_F_ALL_CPUS flags support for percpu_array maps

Introduce support for the BPF_F_ALL_CPUS flag in percpu_array maps to
allow updating values for all CPUs with a single value for both
update_elem and update_batch APIs.

Introduce support for the BPF_F_CPU flag in percpu_array maps to allow:

* update value for specified CPU for both update_elem and update_batch
APIs.
* lookup value for specified CPU for both lookup_elem and lookup_batch
APIs.

The BPF_F_CPU flag is passed via:

* map_flags of lookup_elem and update_elem APIs along with embedded cpu
info.
* elem_flags of lookup_batch and update_batch APIs along with embedded
cpu info.

Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
Link: https://lore.kernel.org/r/20260107022022.12843-3-leon.hwang@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
5 weeks agobpf: Introduce BPF_F_CPU and BPF_F_ALL_CPUS flags
Leon Hwang [Wed, 7 Jan 2026 02:20:16 +0000 (10:20 +0800)] 
bpf: Introduce BPF_F_CPU and BPF_F_ALL_CPUS flags

Introduce BPF_F_CPU and BPF_F_ALL_CPUS flags and check them for
following APIs:

* 'map_lookup_elem()'
* 'map_update_elem()'
* 'generic_map_lookup_batch()'
* 'generic_map_update_batch()'

And, get the correct value size for these APIs.

Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
Link: https://lore.kernel.org/r/20260107022022.12843-2-leon.hwang@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
5 weeks agoMerge branch 'bpf-verifier-allow-calling-arena-functions-when-holding-bpf-lock'
Alexei Starovoitov [Wed, 7 Jan 2026 01:42:55 +0000 (17:42 -0800)] 
Merge branch 'bpf-verifier-allow-calling-arena-functions-when-holding-bpf-lock'

Emil Tsalapatis says:

====================
bpf/verifier: Allow calling arena functions when holding BPF lock

BPF arena-related kfuncs now cannot sleep, so they are safe to call
while holding a spinlock. However, the verifier still rejects
programs that do so. Update the verifier to allow arena kfunc
calls while holding a lock.

Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com>
Changes v1->v2: (https://lore.kernel.org/r/20260106-arena-under-lock-v1-0-6ca9c121d826@etsalapatis.com)
- Added patch to account for active locks in_sleepable_context() (AI)
====================

Link: https://patch.msgid.link/20260106-arena-under-lock-v2-0-378e9eab3066@etsalapatis.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
5 weeks agoselftests/bpf: add tests for arena kfuncs under lock
Emil Tsalapatis [Tue, 6 Jan 2026 23:36:45 +0000 (18:36 -0500)] 
selftests/bpf: add tests for arena kfuncs under lock

Add selftests to ensure the verifier permits calling the arena
kfunc API while holding a lock.

Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com>
Link: https://lore.kernel.org/r/20260106-arena-under-lock-v2-3-378e9eab3066@etsalapatis.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
5 weeks agobpf: Allow calls to arena functions while holding spinlocks
Emil Tsalapatis [Tue, 6 Jan 2026 23:36:44 +0000 (18:36 -0500)] 
bpf: Allow calls to arena functions while holding spinlocks

The bpf_arena_*_pages() kfuncs can be called from sleepable contexts,
but the verifier still prevents BPF programs from calling them while
holding a spinlock. Amend the verifier to allow for BPF programs
calling arena page management functions while holding a lock.

Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com>
Link: https://lore.kernel.org/r/20260106-arena-under-lock-v2-2-378e9eab3066@etsalapatis.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
5 weeks agobpf: Check active lock count in in_sleepable_context()
Emil Tsalapatis [Tue, 6 Jan 2026 23:36:43 +0000 (18:36 -0500)] 
bpf: Check active lock count in in_sleepable_context()

The in_sleepable_context() function is used to specialize the BPF code
in do_misc_fixups(). With the addition of nonsleepable arena kfuncs,
there are kfuncs whose specialization depends on whether we are
holding a lock. We should use the nonsleepable version while
holding a lock and the sleepable one when not.

Add a check for active_locks to account for locking when specializing
arena kfuncs.

Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com>
Link: https://lore.kernel.org/r/20260106-arena-under-lock-v2-1-378e9eab3066@etsalapatis.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
5 weeks agomm: drop mem_cgroup_usage() declaration from memcontrol.h
Roman Gushchin [Tue, 6 Jan 2026 04:23:13 +0000 (20:23 -0800)] 
mm: drop mem_cgroup_usage() declaration from memcontrol.h

mem_cgroup_usage() is not used outside of memcg-v1 code,
the declaration was added by a mistake.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Link: https://lore.kernel.org/r/20260106042313.140256-1-roman.gushchin@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
5 weeks agobpf: Replace __opt annotation with __nullable for kfuncs
Puranjay Mohan [Fri, 2 Jan 2026 22:15:12 +0000 (14:15 -0800)] 
bpf: Replace __opt annotation with __nullable for kfuncs

The __opt annotation was originally introduced specifically for
buffer/size argument pairs in bpf_dynptr_slice() and
bpf_dynptr_slice_rdwr(), allowing the buffer pointer to be NULL while
still validating the size as a constant.  The __nullable annotation
serves the same purpose but is more general and is already used
throughout the BPF subsystem for raw tracepoints, struct_ops, and other
kfuncs.

This patch unifies the two annotations by replacing __opt with
__nullable.  The key change is in the verifier's
get_kfunc_ptr_arg_type() function, where mem/size pair detection is now
performed before the nullable check.  This ensures that buffer/size
pairs are correctly classified as KF_ARG_PTR_TO_MEM_SIZE even when the
buffer is nullable, while adding an !arg_mem_size condition to the
nullable check prevents interference with mem/size pair handling.

When processing KF_ARG_PTR_TO_MEM_SIZE arguments, the verifier now uses
is_kfunc_arg_nullable() instead of the removed is_kfunc_arg_optional()
to determine whether to skip size validation for NULL buffers.

This is the first documentation added for the __nullable annotation,
which has been in use since it was introduced but was previously
undocumented.

No functional changes to verifier behavior - nullable buffer/size pairs
continue to work exactly as before.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20260102221513.1961781-1-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
5 weeks agoMerge branch 'memcg-accounting-for-bpf-arena'
Alexei Starovoitov [Fri, 2 Jan 2026 22:31:59 +0000 (14:31 -0800)] 
Merge branch 'memcg-accounting-for-bpf-arena'

Puranjay Mohan says:

====================
memcg accounting for BPF arena

v4: https://lore.kernel.org/all/20260102181333.3033679-1-puranjay@kernel.org/
Changes in v4->v5:
- Remove unused variables from bpf_map_alloc_pages() (CI)

v3: https://lore.kernel.org/all/20260102151852.570285-1-puranjay@kernel.org/
Changes in v3->v4:
- Do memcg set/recover in arena_reserve_pages() rather than
  bpf_arena_reserve_pages() for symmetry with other kfuncs (Alexei)

v2: https://lore.kernel.org/all/20251231141434.3416822-1-puranjay@kernel.org/
Changes in v2->v3:
- Remove memcg accounting from bpf_map_alloc_pages() as the caller does
  it already. (Alexei)
- Do memcg set/recover in arena_alloc/free_pages() rather than
  bpf_arena_alloc/free_pages(), it reduces copy pasting in
  sleepable/non_sleepable functions.

v1: https://lore.kernel.org/all/20251230153006.1347742-1-puranjay@kernel.org/
Changes in v1->v2:
- Return both pointers through arguments from bpf_map_memcg_enter and
  make it return void. (Alexei)
- Add memcg accounting in arena_free_worker (AI)

This set adds memcg accounting logic into arena kfuncs and other places
that do allocations in arena.c.
====================

Link: https://patch.msgid.link/20260102200230.25168-1-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
5 weeks agobpf: arena: Reintroduce memcg accounting
Puranjay Mohan [Fri, 2 Jan 2026 20:02:28 +0000 (12:02 -0800)] 
bpf: arena: Reintroduce memcg accounting

When arena allocations were converted from bpf_map_alloc_pages() to
kmalloc_nolock() to support non-sleepable contexts, memcg accounting was
inadvertently lost. This commit restores proper memory accounting for
all arena-related allocations.

All arena related allocations are accounted into memcg of the process
that created bpf_arena.

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20260102200230.25168-3-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
5 weeks agobpf: syscall: Introduce memcg enter/exit helpers
Puranjay Mohan [Fri, 2 Jan 2026 20:02:27 +0000 (12:02 -0800)] 
bpf: syscall: Introduce memcg enter/exit helpers

Introduce bpf_map_memcg_enter() and bpf_map_memcg_exit() helpers to
reduce code duplication in memcg context management.

bpf_map_memcg_enter() gets the memcg from the map, sets it as active,
and returns both the previous and the now active memcg.

bpf_map_memcg_exit() restores the previous active memcg and releases the
reference obtained during enter.

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20260102200230.25168-2-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
5 weeks agoMerge branch 'bpf-make-kf_trusted_args-default'
Alexei Starovoitov [Fri, 2 Jan 2026 20:04:29 +0000 (12:04 -0800)] 
Merge branch 'bpf-make-kf_trusted_args-default'

Puranjay Mohan says:

====================
bpf: Make KF_TRUSTED_ARGS default

v2: https://lore.kernel.org/all/20251231171118.1174007-1-puranjay@kernel.org/
Changes in v2->v3:
- Fix documentation: add a new section for kfunc parameters (Eduard)
- Remove all occurances of KF_TRUSTED from comments, etc. (Eduard)
- Fix the netfilter kfuncs to drop dead NULL checks.
- Fix selftest for netfilter kfuncs to check for verification failures
  and remove the runtime failure that are not possible after this
  changes

v1: https://lore.kernel.org/all/20251224192448.3176531-1-puranjay@kernel.org/
Changes in v1->v2:
- Update kfunc_dynptr_param selftest to use a real pointer that is not
  ptr_to_stack and not CONST_PTR_TO_DYNPTR rather than casting 1
  (Alexei)
- Thoroughly review all kfuncs in the to find regressions or missing
  annotations. (Eduard)
- Fix kfuncs found from the above step.

This series makes trusted arguments the default requirement for all BPF
kfuncs, inverting the current opt-in model. Instead of requiring
explicit KF_TRUSTED_ARGS flags, kfuncs now require trusted arguments by
default and must explicitly opt-out using __nullable/__opt annotations
or the KF_RCU flag.

This improves security and type safety by preventing BPF programs from
passing untrusted or NULL pointers to kernel functions at verification
time, while maintaining flexibility for the small number of kfuncs that
legitimately need to accept NULL or RCU pointers.

MOTIVATION

The current opt-in model is error-prone and inconsistent. Most kfuncs already
require trusted pointers from sources like KF_ACQUIRE, struct_ops callbacks, or
tracepoints. Making trusted arguments the default:

- Prevents NULL pointer dereferences at verification time
- Reduces defensive NULL checks in kernel code
- Provides better error messages for invalid BPF programs
- Aligns with existing patterns (context pointers, struct_ops already trusted)

IMPACT ANALYSIS

Comprehensive analysis of all 304+ kfuncs across 37 kernel files found:
- Most kfuncs (299/304) are already safe and require no changes
- Only 4 kfuncs required fixes (all included in this series)
- 0 regressions found in independent verification

All bpf selftests are passing. The hid_bpf tests are also passing:
# PASSED: 20 / 20 tests passed.
# Totals: pass:20 fail:0 xfail:0 xpass:0 skip:0 error:0

bpf programs in drivers/hid/bpf/progs/ show no regression as shown by
veristat:

Done. Processed 24 files, 62 programs. Skipped 0 files, 0 programs.

TECHNICAL DETAILS

The verifier now validates kfunc arguments in this order:
1. NULL check (runs first): Rejects NULL unless parameter has __nullable/__opt
2. Trusted check: Rejects untrusted pointers unless kfunc has KF_RCU

Special cases that bypass trusted checking:
- Context pointers (xdp_md, __sk_buff): Handled via KF_ARG_PTR_TO_CTX
- Struct_ops callbacks: Pre-marked as PTR_TRUSTED during initialization
- KF_RCU kfuncs: Have separate validation path for RCU pointers

BACKWARD COMPATIBILITY

This affects BPF program verification, not runtime:
- Valid programs passing trusted pointers: Continue to work
- Programs with bugs: May now fail verification (preventing runtime crashes)

This series introduces two intentional breaking changes to the BPF
verifier's kfunc handling:

1. NULL pointer rejection timing: Kfuncs that previously accepted NULL
pointers without KF_TRUSTED_ARGS will now reject NULL at verification
time instead of returning runtime errors. This affects netfilter
connection tracking functions (bpf_xdp_ct_lookup, bpf_skb_ct_lookup,
bpf_xdp_ct_alloc, bpf_skb_ct_alloc), which now enforce their documented
"Cannot be NULL" requirements at load time rather than returning -EINVAL
at runtime.

2. Fentry/fexit program restrictions: BPF programs using fentry/fexit
attachment points can no longer pass their callback arguments directly
to kfuncs, as these arguments are not marked as trusted by default.
Programs requiring trusted argument semantics should migrate to tp_btf
(tracepoint with BTF) attachment points where arguments are guaranteed
trusted by the verifier.

Both changes strengthen the verifier's safety guarantees by catching
errors earlier in the development cycle and are accompanied by
comprehensive test updates demonstrating the new expected behaviors.
====================

Link: https://patch.msgid.link/20260102180038.2708325-1-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
5 weeks agoselftests: bpf: Fix test_bpf_nf for trusted args becoming default
Puranjay Mohan [Fri, 2 Jan 2026 18:00:36 +0000 (10:00 -0800)] 
selftests: bpf: Fix test_bpf_nf for trusted args becoming default

With trusted args now being the default, passing NULL to kfunc
parameters that are pointers causes verifier rejection rather than a
runtime error. The test_bpf_nf test was failing because it attempted to
pass NULL to bpf_xdp_ct_lookup() to verify runtime error handling.

Since the NULL check now happens at verification time, remove the
runtime test case that passed NULL to the bpf_tuple parameter and
instead add verification-time tests to ensure the verifier correctly
rejects programs that pass NULL to trusted arguments.

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20260102180038.2708325-11-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
5 weeks agoselftests: bpf: fix cgroup_hierarchical_stats
Puranjay Mohan [Fri, 2 Jan 2026 18:00:35 +0000 (10:00 -0800)] 
selftests: bpf: fix cgroup_hierarchical_stats

The cgroup_hierarchical_stats selftests uses an fentry program attached
to cgroup_attach_task and then passes the received &dst_cgrp->self to
the css_rstat_updated() kfunc. The verifier now assumes that all kfuncs
only takes trusted pointer arguments, and pointers received by fentry
are not marked trustes by default.

Use a tp_btf program in place for fentry for this test, pointers
received by tp_btf programs are marked trusted by the verifier.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20260102180038.2708325-10-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
5 weeks agoselftests: bpf: fix test_kfunc_dynptr_param
Puranjay Mohan [Fri, 2 Jan 2026 18:00:34 +0000 (10:00 -0800)] 
selftests: bpf: fix test_kfunc_dynptr_param

As verifier now assumes that all kfuncs only takes trusted pointer
arguments, passing 0 (NULL) to a kfunc that doesn't mark the argument as
__nullable or __opt will be rejected with a failure message of: Possibly
NULL pointer passed to trusted arg<n>

Pass a non-null value to the kfunc to test the expected failure mode.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20260102180038.2708325-9-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
5 weeks agoselftests: bpf: Update failure message for rbtree_fail
Puranjay Mohan [Fri, 2 Jan 2026 18:00:33 +0000 (10:00 -0800)] 
selftests: bpf: Update failure message for rbtree_fail

The rbtree_api_use_unchecked_remove_retval() selftest passes a pointer
received from bpf_rbtree_remove() to bpf_rbtree_add() without checking
for NULL, this was earlier caught by __check_ptr_off_reg() in the
verifier. Now the verifier assumes every kfunc only takes trusted pointer
arguments, so it catches this NULL pointer earlier in the path and
provides a more accurate failure message.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20260102180038.2708325-8-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
5 weeks agoselftests: bpf: Update kfunc_param_nullable test for new error message
Puranjay Mohan [Fri, 2 Jan 2026 18:00:32 +0000 (10:00 -0800)] 
selftests: bpf: Update kfunc_param_nullable test for new error message

With trusted args now being the default, the NULL pointer check runs
before type-specific validation. Update test3 to expect the new error
message "Possibly NULL pointer passed to trusted arg0" instead of the
old dynptr-specific error message.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20260102180038.2708325-7-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
5 weeks agoHID: bpf: drop dead NULL checks in kfuncs
Puranjay Mohan [Fri, 2 Jan 2026 18:00:31 +0000 (10:00 -0800)] 
HID: bpf: drop dead NULL checks in kfuncs

As KF_TRUSTED_ARGS is now considered default for all kfuns, the verifier
will not allow passing NULL pointers to these kfuns. These checks for
NULL pointers can therefore be removed.

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20260102180038.2708325-6-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
5 weeks agobpf: xfrm: drop dead NULL check in bpf_xdp_get_xfrm_state()
Puranjay Mohan [Fri, 2 Jan 2026 18:00:30 +0000 (10:00 -0800)] 
bpf: xfrm: drop dead NULL check in bpf_xdp_get_xfrm_state()

As KF_TRUSTED_ARGS is now considered the default for all kfuncs, the
opts parameter in bpf_xdp_get_xfrm_state() can never be NULL. Verifier
will detect this at load time and will not allow passing NULL to this
function. This matches the documentation above the kfunc that says this
parameter (opts) Cannot be NULL.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20260102180038.2708325-5-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
5 weeks agobpf: net: netfilter: drop dead NULL checks
Puranjay Mohan [Fri, 2 Jan 2026 18:00:29 +0000 (10:00 -0800)] 
bpf: net: netfilter: drop dead NULL checks

bpf_xdp_ct_lookup() and bpf_skb_ct_lookup() receive bpf_tuple and opts
parameter that are expected to be not NULL for real usages (see doc
string above functions). They return an error if NULL is passed for opts
or tuple.

The verifier will now reject programs that pass NULL to these
parameters, the kfuns can assume that these are always valid pointer, so
drop the NULL checks for these parameters.

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20260102180038.2708325-4-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
5 weeks agobpf: Remove redundant KF_TRUSTED_ARGS flag from all kfuncs
Puranjay Mohan [Fri, 2 Jan 2026 18:00:28 +0000 (10:00 -0800)] 
bpf: Remove redundant KF_TRUSTED_ARGS flag from all kfuncs

Now that KF_TRUSTED_ARGS is the default for all kfuncs, remove the
explicit KF_TRUSTED_ARGS flag from all kfunc definitions and remove the
flag itself.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20260102180038.2708325-3-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
5 weeks agobpf: Make KF_TRUSTED_ARGS the default for all kfuncs
Puranjay Mohan [Fri, 2 Jan 2026 18:00:27 +0000 (10:00 -0800)] 
bpf: Make KF_TRUSTED_ARGS the default for all kfuncs

Change the verifier to make trusted args the default requirement for
all kfuncs by removing is_kfunc_trusted_args() assuming it be to always
return true.

This works because:
1. Context pointers (xdp_md, __sk_buff, etc.) are handled through their
   own KF_ARG_PTR_TO_CTX case label and bypass the trusted check
2. Struct_ops callback arguments are already marked as PTR_TRUSTED during
   initialization and pass is_trusted_reg()
3. KF_RCU kfuncs are handled separately via is_kfunc_rcu() checks at
   call sites (always checked with || alongside is_kfunc_trusted_args)

This simple change makes all kfuncs require trusted args by default
while maintaining correct behavior for all existing special cases.

Note: This change means kfuncs that previously accepted NULL pointers
without KF_TRUSTED_ARGS will now reject NULL at verification time.
Several netfilter kfuncs are affected: bpf_xdp_ct_lookup(),
bpf_skb_ct_lookup(), bpf_xdp_ct_alloc(), and bpf_skb_ct_alloc() all
accept NULL for their bpf_tuple and opts parameters internally (checked
in __bpf_nf_ct_lookup), but after this change the verifier rejects NULL
before the kfunc is even called. This is acceptable because these kfuncs
don't work with NULL parameters in their proper usage. Now they will be
rejected rather than returning an error, which shouldn't make a
difference to BPF programs that were using these kfuncs properly.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20260102180038.2708325-2-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
6 weeks agoselftests/bpf: veristat: fix printing order in output_stats()
Puranjay Mohan [Wed, 31 Dec 2025 22:10:50 +0000 (14:10 -0800)] 
selftests/bpf: veristat: fix printing order in output_stats()

The order of the variables in the printf() doesn't match the text and
therefore veristat prints something like this:

Done. Processed 24 files, 0 programs. Skipped 62 files, 0 programs.

When it should print:

Done. Processed 24 files, 62 programs. Skipped 0 files, 0 programs.

Fix the order of variables in the printf() call.

Fixes: 518fee8bfaf2 ("selftests/bpf: make veristat skip non-BPF and failing-to-open BPF objects")
Tested-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20251231221052.759396-1-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
6 weeks agobpf: Update BPF_PROG_RUN documentation
SungRock Jung [Sun, 21 Dec 2025 07:00:41 +0000 (07:00 +0000)] 
bpf: Update BPF_PROG_RUN documentation

LWT_SEG6LOCAL no longer supports test_run starting from v6.11
so remove it from the list of program types supported by BPF_PROG_RUN.

Add TRACING and NETFILTER program types to reflect the
current set of types that implement test_run.

Signed-off-by: SungRock Jung <tjdfkr2421@gmail.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20251221070041.26592-1-tjdfkr2421@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
6 weeks agoscripts/gen-btf.sh: Reduce log verbosity
Ihor Solodrai [Wed, 31 Dec 2025 18:39:29 +0000 (10:39 -0800)] 
scripts/gen-btf.sh: Reduce log verbosity

Remove info messages from gen-btf.sh, as they are unnecessarily
detailed and sometimes inaccurate [1].  Verbose log can be produced by
passing V=1 to make, which will set -x for the shell.

[1] https://lore.kernel.org/bpf/CAADnVQ+biTSDaNtoL=ct9XtBJiXYMUqGYLqu604C3D8N+8YH9A@mail.gmail.com/

Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Link: https://lore.kernel.org/r/20251231183929.65668-1-ihor.solodrai@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
6 weeks agoresolve_btfids: Implement --patch_btfids
Ihor Solodrai [Wed, 31 Dec 2025 01:25:57 +0000 (17:25 -0800)] 
resolve_btfids: Implement --patch_btfids

Recent changes in BTF generation [1] rely on ${OBJCOPY} command to
update .BTF_ids section data in target ELF files.

This exposed a bug in llvm-objcopy --update-section code path, that
may lead to corruption of a target ELF file. Specifically, because of
the bug st_shndx of some symbols may be (incorrectly) set to 0xffff
(SHN_XINDEX) [2][3].

While there is a pending fix for LLVM, it'll take some time before it
lands (likely in 22.x). And the kernel build must keep working with
older LLVM toolchains in the foreseeable future.

Using GNU objcopy for .BTF_ids update would work, but it would require
changes to LLVM-based build process, likely breaking existing build
environments as discussed in [2].

To work around llvm-objcopy bug, implement --patch_btfids code path in
resolve_btfids as a drop-in replacement for:

    ${OBJCOPY} --update-section .BTF_ids=${btf_ids} ${elf}

Which works specifically for .BTF_ids section:

    ${RESOLVE_BTFIDS} --patch_btfids ${btf_ids} ${elf}

This feature in resolve_btfids can be removed at some point in the
future, when llvm-objcopy with a relevant bugfix becomes common.

[1] https://lore.kernel.org/bpf/20251219181321.1283664-1-ihor.solodrai@linux.dev/
[2] https://lore.kernel.org/bpf/20251224005752.201911-1-ihor.solodrai@linux.dev/
[3] https://github.com/llvm/llvm-project/issues/168060#issuecomment-3533552952

Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Link: https://lore.kernel.org/r/20251231012558.1699758-1-ihor.solodrai@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
6 weeks agoMerge branch 'bpf-unify-state-pruning-handling-of-invalid-misc-stack-slots'
Alexei Starovoitov [Wed, 31 Dec 2025 17:01:13 +0000 (09:01 -0800)] 
Merge branch 'bpf-unify-state-pruning-handling-of-invalid-misc-stack-slots'

Eduard Zingerman says:

====================
bpf: unify state pruning handling of invalid/misc stack slots

This change unifies states pruning handling of NOT_INIT registers,
STACK_INVALID/STACK_MISC stack slots for regular and iterator/callback
based loop cases.

The change results in a modest verifier performance improvement:

========= selftests: master vs loop-stack-misc-pruning =========

File                             Program               Insns (A)  Insns (B)  Insns     (DIFF)
-------------------------------  --------------------  ---------  ---------  ----------------
test_tcp_custom_syncookie.bpf.o  tcp_custom_syncookie      38307      18430  -19877 (-51.89%)
xdp_synproxy_kern.bpf.o          syncookie_tc              23035      19067   -3968 (-17.23%)
xdp_synproxy_kern.bpf.o          syncookie_xdp             21022      18516   -2506 (-11.92%)

Total progs: 4173
Old success: 2520
New success: 2521
total_insns diff min:  -99.99%
total_insns diff max:    0.00%
0 -> value: 0
value -> 0: 0
total_insns abs max old: 837,487
total_insns abs max new: 837,487
-100 .. -90  %: 1
 -60 .. -50  %: 3
 -50 .. -40  %: 2
 -40 .. -30  %: 2
 -30 .. -20  %: 8
 -20 .. -10  %: 4
 -10 .. 0    %: 5
   0 .. 5    %: 4148

========= scx: master vs loop-stack-misc-pruning =========

File                       Program           Insns (A)  Insns (B)  Insns     (DIFF)
-------------------------  ----------------  ---------  ---------  ----------------
scx_arena_selftests.bpf.o  arena_selftest       257545     243678   -13867 (-5.38%)
scx_chaos.bpf.o            chaos_dispatch        13989      12804    -1185 (-8.47%)
scx_layered.bpf.o          layered_dispatch      27600      13925  -13675 (-49.55%)

Total progs: 305
Old success: 292
New success: 292
total_insns diff min:  -49.55%
total_insns diff max:    0.00%
0 -> value: 0
value -> 0: 0
total_insns abs max old: 257,545
total_insns abs max new: 243,678
 -50 .. -45  %: 7
 -30 .. -20  %: 5
 -20 .. -10  %: 14
 -10 .. 0    %: 18
   0 .. 5    %: 261

There is also a significant verifier performance improvement for some
bpf_loop() heavy Meta internal programs (~ -40% processed instructions).
====================

Link: https://patch.msgid.link/20251230-loop-stack-misc-pruning-v1-0-585cfd6cec51@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
6 weeks agoselftests/bpf: iterator based loop and STACK_MISC states pruning
Eduard Zingerman [Wed, 31 Dec 2025 05:36:04 +0000 (21:36 -0800)] 
selftests/bpf: iterator based loop and STACK_MISC states pruning

The test case first initializes 9 stack slots as STACK_MISC,
then conditionally updates each of them to SCALAR spill inside an
iterator based loop. This leads to 2**9 combinations of MISC/SPILL
marks for these slots at the iterator next call.
The loop converges only if the verifier treats such states as
equivalent, otherwise visited states are evicted from the states cache
too quickly.

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20251230-loop-stack-misc-pruning-v1-2-585cfd6cec51@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
6 weeks agobpf: allow states pruning for misc/invalid slots in iterator loops
Eduard Zingerman [Wed, 31 Dec 2025 05:36:03 +0000 (21:36 -0800)] 
bpf: allow states pruning for misc/invalid slots in iterator loops

Within an iterator or callback based loop, it should be safe to prune
the current state if the old state stack slot is marked as
STACK_INVALID or STACK_MISC:
- either all branches of the old state lead to a program exit;
- or some branch of the old state leads the current state.

This is the same logic as applied in non-loop cases when
states_equal() is called in NOT_EXACT mode.

The test case that exercises stacksafe() and demonstrates the
difference in verification performance is included in the next patch.
I'm not sure if it is possible to prepare a test case that exercises
regsafe(); it appears that the compute_live_registers() pass makes
this impossible.

Nevertheless, for code readability reasons, I think that stacksafe()
and regsafe() should handle STACK_INVALID / NOT_INIT symmetrically.
Hence, this commit changes both functions.

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20251230-loop-stack-misc-pruning-v1-1-585cfd6cec51@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
6 weeks agoMerge branch 'bpf-calls-to-bpf_loop-should-have-an-scc-and-accumulate-backedges'
Alexei Starovoitov [Tue, 30 Dec 2025 23:42:42 +0000 (15:42 -0800)] 
Merge branch 'bpf-calls-to-bpf_loop-should-have-an-scc-and-accumulate-backedges'

Eduard Zingerman says:

====================
bpf: calls to bpf_loop() should have an SCC and accumulate backedges

This is a correctness fix for the verification of BPF programs that
work with callback-calling functions. The problem is the same as the
issue fixed by series [1] for iterator-based loops: some of the states
created while processing the callback function body might have
incomplete read or precision marks.

An example of an unsafe program that is accepted without this fix can
be found in patch #2.

There is some impact on verification performance:

File                             Program               Insns (A)  Insns (B)  Insns      (DIFF)
-------------------------------  --------------------  ---------  ---------  -----------------
pyperf600_bpf_loop.bpf.o         on_event                   4247       9985   +5738 (+135.11%)
setget_sockopt.bpf.o             skops_sockopt              5719       7446    +1727 (+30.20%)
setget_sockopt.bpf.o             socket_post_create         1253       1603     +350 (+27.93%)
strobemeta_bpf_loop.bpf.o        on_event                   3424       7224   +3800 (+110.98%)
test_tcp_custom_syncookie.bpf.o  tcp_custom_syncookie      11929      38307  +26378 (+221.12%)
xdp_synproxy_kern.bpf.o          syncookie_tc              13986      23035    +9049 (+64.70%)
xdp_synproxy_kern.bpf.o          syncookie_xdp             13881      21022    +7141 (+51.44%)

Total progs: 4172
Old success: 2520
New success: 2520
total_insns diff min:    0.00%
total_insns diff max:  221.12%
0 -> value: 0
value -> 0: 0
total_insns abs max old: 837,487
total_insns abs max new: 837,487
   0 .. 5    %: 4163
   5 .. 15   %: 2
  25 .. 35   %: 2
  50 .. 60   %: 1
  60 .. 70   %: 1
 110 .. 120  %: 1
 135 .. 145  %: 1
 220 .. 225  %: 1

[1] https://lore.kernel.org/bpf/174968344350.3524559.14906547029551737094.git-patchwork-notify@kernel.org/
---
====================

Link: https://patch.msgid.link/20251229-scc-for-callbacks-v1-0-ceadfe679900@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
6 weeks agoselftests/bpf: test cases for bpf_loop SCC and state graph backedges
Eduard Zingerman [Tue, 30 Dec 2025 07:13:08 +0000 (23:13 -0800)] 
selftests/bpf: test cases for bpf_loop SCC and state graph backedges

Test for state graph backedges accumulation for SCCs formed by
bpf_loop(). Equivalent to the following C program:

  int main(void) {
    1: fp[-8] = bpf_get_prandom_u32();
    2: fp[-16] = -32;                       // used in a memory access below
    3: bpf_loop(7, loop_cb4, fp, 0);
    4: return 0;
  }

  int loop_cb4(int i, void *ctx) {
    5: if (unlikely(ctx[-8] > bpf_get_prandom_u32()))
    6:   *(u64 *)(fp + ctx[-16]) = 42;      // aligned access expected
    7: if (unlikely(fp[-8] > bpf_get_prandom_u32()))
    8:   ctx[-16] = -31;                    // makes said access unaligned
    9: return 0;
  }

If state graph backedges are not accumulated properly at the SCC
formed by loop_cb4() call from bpf_loop(), the state {ctx[-16]=-32}
injected at instruction 9 on verification path 1,2,3,5,7,9,4 would be
considered fully verified and would lack precision mark for ctx[-16].
This would lead to early pruning of verification path 1,2,3,5,7,8,9 in
state {ctx[-16]=-31}, which in turn leads to the incorrect assumption
that the above program is safe.

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20251229-scc-for-callbacks-v1-2-ceadfe679900@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
6 weeks agobpf: bpf_scc_visit instance and backedges accumulation for bpf_loop()
Eduard Zingerman [Tue, 30 Dec 2025 07:13:07 +0000 (23:13 -0800)] 
bpf: bpf_scc_visit instance and backedges accumulation for bpf_loop()

Calls like bpf_loop() or bpf_for_each_map_elem() introduce loops that
are not explicitly present in the control-flow graph. The verifier
processes such calls by repeatedly interpreting the callback function
body within the same verification path (until the current state
converges with a previous state).

Such loops require a bpf_scc_visit instance in order to allow the
accumulation of the state graph backedges. Otherwise, certain
checkpoint states created within the bodies of such loops will have
incomplete precision marks.

See the next patch for an example of a program that leads to the
verifier accepting an unsafe program.

Fixes: 96c6aa4c63af ("bpf: compute SCCs in program control flow graph")
Fixes: c9e31900b54c ("bpf: propagate read/precision marks over state graph backedges")
Reported-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Tested-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/r/20251229-scc-for-callbacks-v1-1-ceadfe679900@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
6 weeks agoselftests/bpf: Fix verifier_arena_large/big_alloc3 test
Puranjay Mohan [Tue, 30 Dec 2025 19:51:32 +0000 (11:51 -0800)] 
selftests/bpf: Fix verifier_arena_large/big_alloc3 test

The big_alloc3() test tries to allocate 2051 pages at once in
non-sleepable context and this can fail sporadically on resource
contrained systems, so skip this test in case of such failures.

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20251230195134.599463-1-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
6 weeks agoscripts/gen-btf.sh: Fix .btf.o generation when compiling for RISCV
Ihor Solodrai [Mon, 29 Dec 2025 20:28:23 +0000 (12:28 -0800)] 
scripts/gen-btf.sh: Fix .btf.o generation when compiling for RISCV

gen-btf.sh emits a .btf.o file with BTF sections to be linked into
vmlinux in link-vmlinux.sh

This .btf.o file is created by compiling an emptystring with ${CC},
and then adding BTF sections into it with ${OBJCOPY}.

To ensure the .btf.o is linkable when cross-compiling with LLVM, we
have to also pass ${KBUILD_FLAGS}, which in particular control the
target word size.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202512240559.2M06DSX7-lkp@intel.com/
Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Link: https://lore.kernel.org/r/20251229202823.569619-1-ihor.solodrai@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
7 weeks agoMerge branch 'remove-kf_sleepable-from-arena-kfuncs'
Alexei Starovoitov [Tue, 23 Dec 2025 19:30:00 +0000 (11:30 -0800)] 
Merge branch 'remove-kf_sleepable-from-arena-kfuncs'

Puranjay Mohan says:

====================
Remove KF_SLEEPABLE from arena kfuncs

V7: https://lore.kernel.org/all/20251222190815.4112944-1-puranjay@kernel.org/
Changes in V7->v8:
- Use clear_lo32(arena->user_vm_start) in place of user_vm_start in patch 3

V6: https://lore.kernel.org/all/20251217184438.3557859-1-puranjay@kernel.org/
Changes in v6->v7:
- Fix a deadlock in patch 1, that was being fixed in patch 2. Move the fix to patch 1.
- Call flush_cache_vmap() after setting up the mappings as it is
  required by some architectures.

V5: https://lore.kernel.org/all/20251212044516.37513-1-puranjay@kernel.org/
Changes in v5->v6:
Patch 1:
- Add a missing ; to make sure this patch builds individually. (AI)

V4: https://lore.kernel.org/all/20251212004350.6520-1-puranjay@kernel.org/
Changes in v4->v5:
Patch 1:
- Fix a memory leak in arena_alloc_pages(), it was being fixed in
  Patch 3 but, every patch should be complete in itself. (AI)
Patch 3:
- Don't do useless addition in arena_alloc_pages() (Alexei)
- Add a comment about kmalloc_nolock() failure and expectations.

v3: https://lore.kernel.org/all/20251117160150.62183-1-puranjay@kernel.org/
Changes in v3->v4:
- Coding style changes related to comments in Patch 2/3 (Alexei)

v2: https://lore.kernel.org/all/20251114111700.43292-1-puranjay@kernel.org/
Changes in v2->v3:
Patch 1:
        - Call range_tree_destroy() in error path of
          populate_pgtable_except_pte() in arena_map_alloc() (AI)
Patch 2:
        - Fix double mutex_unlock() in the error path of
          arena_alloc_pages() (AI)
        - Fix coding style issues (Alexei)
Patch 3:
        - Unlock spinlock before returning from arena_vm_fault() in case
          BPF_F_SEGV_ON_FAULT is set by user. (AI)
        - Use __llist_del_all() in place of llist_del_all for on-stack
          llist (free_pages) (Alexei)
        - Fix build issues on 32-bit systems where arena.c is not compiled.
          (kernel test robot)
        - Make bpf_arena_alloc_pages() polymorphic so it knows if it has
          been called in sleepable or non-sleepable context. This
          information is passed to arena_free_pages() in the error path.
Patch 4:
        - Add a better comment for the big_alloc3() test that triggers
          kmalloc_nolock()'s limit and if bpf_arena_alloc_pages() works
          correctly above this limit.

v1: https://lore.kernel.org/all/20251111163424.16471-1-puranjay@kernel.org/
Changes in v1->v2:
Patch 1:
        - Import tlbflush.h to fix build issue in loongarch. (kernel
          test robot)
        - Fix unused variable error in apply_range_clear_cb() (kernel
          test robot)
        - Call bpf_map_area_free() on error path of
          populate_pgtable_except_pte() (AI)
        - Use PAGE_SIZE in apply_to_existing_page_range() (AI)
Patch 2:
        - Cap allocation made by kmalloc_nolock() for pages array to
          KMALLOC_MAX_CACHE_SIZE and reuse the array in an explicit loop
          to overcome this limit. (AI)
Patch 3:
        - Do page_ref_add(page, 1); under the spinlock to mitigate a
          race (AI)
Patch 4:
        - Add a new testcase big_alloc3() verifier_arena_large.c that
          tries to allocate a large number of pages at once, this is to
          trigger the kmalloc_nolock() limit in Patch 2 and see if the
          loop logic works correctly.

This set allows arena kfuncs to be called from non-sleepable contexts.
It is acheived by the following changes:

The range_tree is now protected with a rqspinlock and not a mutex,
this change is enough to make bpf_arena_reserve_pages() any context
safe.

bpf_arena_alloc_pages() had four points where it could sleep:

1. Mutex to protect range_tree: now replaced with rqspinlock

2. kvcalloc() for allocations: now replaced with kmalloc_nolock()

3. Allocating pages with bpf_map_alloc_pages(): this already calls
   alloc_pages_nolock() in non-sleepable contexts and therefore is safe.

4. Setting up kernel page tables with vm_area_map_pages():
   vm_area_map_pages() may allocate memory while inserting pages into
   bpf arena's vm_area. Now, at arena creation time populate all page
   table levels except the last level and when new pages need to be
   inserted call apply_to_page_range() again which will only do
   set_pte_at() for those pages and will not allocate memory.

The above four changes make bpf_arena_alloc_pages() any context safe.

bpf_arena_free_pages() has to do the following steps:

1. Update the range_tree
2. vm_area_unmap_pages(): to unmap pages from kernel vm_area
3. flush the tlb: done in step 2, already.
4. zap_pages(): to unmap pages from user page tables
5. free pages.

The third patch in this set makes bpf_arena_free_pages() polymorphic using
the specialize_kfunc() mechanism. When called from a sleepable context,
arena_free_pages() remains mostly unchanged except the following:
1. rqspinlock is taken now instead of the mutex for the range tree
2. Instead of using vm_area_unmap_pages() that can free intermediate page
   table levels, apply_to_existing_page_range() with a callback is used
   that only does pte_clear() on the last level and leaves the intermediate
   page table levels intact. This is needed to make sure that
   bpf_arena_alloc_pages() can safely do set_pte_at() without allocating
   intermediate page tables.

When arena_free_pages() is called from a non-sleepable context or it fails to
acquire the rqspinlock in the sleepable case, a lock-less list of struct
arena_free_span is used to queue the uaddr and page cnt. kmalloc_nolock()
is used to allocate this arena_free_span, this can fail but we need to make
this trade-off for frees done from non-sleepable contexts.

arena_free_pages() then raises an irq_work whose handler in turn schedules
work that iterate this list and clears ptes, flushes tlbs, zap pages, and
frees pages for the queued uaddr and page cnts.

apply_range_clear_cb() with apply_to_existing_page_range() is used to
clear PTEs and collect pages to be freed, struct llist_node pcp_llist;
in the struct page is used to do this.
====================

Link: https://patch.msgid.link/20251222195022.431211-1-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
7 weeks agoselftests: bpf: test non-sleepable arena allocations
Puranjay Mohan [Mon, 22 Dec 2025 19:50:19 +0000 (11:50 -0800)] 
selftests: bpf: test non-sleepable arena allocations

As arena kfuncs can now be called from non-sleepable contexts, test this
by adding non-sleepable copies of tests in verifier_arena, this is done
by using a socket program instead of syscall.

Add a new test case in verifier_arena_large to check that the
bpf_arena_alloc_pages() works for more than 1024 pages.
1024 * sizeof(struct page *) is the upper limit of kmalloc_nolock() but
bpf_arena_alloc_pages() should still succeed because it re-uses this
array in a loop.

Augment the arena_list selftest to also run in non-sleepable context by
taking rcu_read_lock.

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20251222195022.431211-5-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
7 weeks agobpf: arena: make arena kfuncs any context safe
Puranjay Mohan [Mon, 22 Dec 2025 19:50:18 +0000 (11:50 -0800)] 
bpf: arena: make arena kfuncs any context safe

Make arena related kfuncs any context safe by the following changes:

bpf_arena_alloc_pages() and bpf_arena_reserve_pages():
Replace the usage of the mutex with a rqspinlock for range tree and use
kmalloc_nolock() wherever needed. Use free_pages_nolock() to free pages
from any context.
apply_range_set/clear_cb() with apply_to_page_range() has already made
populating the vm_area in bpf_arena_alloc_pages() any context safe.

bpf_arena_free_pages(): defer the main logic to a workqueue if it is
called from a non-sleepable context.

specialize_kfunc() is used to replace the sleepable arena_free_pages()
with bpf_arena_free_pages_non_sleepable() when the verifier detects the
call is from a non-sleepable context.

In the non-sleepable case, arena_free_pages() queues the address and the
page count to be freed to a lock-less list of struct arena_free_spans
and raises an irq_work. The irq_work handler calls schedules_work() as
it is safe to be called from irq context.  arena_free_worker() (the work
queue handler) iterates these spans and clears ptes, flushes tlb, zaps
pages, and calls __free_page().

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20251222195022.431211-4-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
7 weeks agobpf: arena: use kmalloc_nolock() in place of kvcalloc()
Puranjay Mohan [Mon, 22 Dec 2025 19:50:17 +0000 (11:50 -0800)] 
bpf: arena: use kmalloc_nolock() in place of kvcalloc()

To make arena_alloc_pages() safe to be called from any context, replace
kvcalloc() with kmalloc_nolock() so as it doesn't sleep or take any
locks. kmalloc_nolock() returns NULL for allocations larger than
KMALLOC_MAX_CACHE_SIZE, which is (PAGE_SIZE * 2) = 8KB on systems with
4KB pages. So, round down the allocation done by kmalloc_nolock to 1024
* 8 and reuse the array in a loop.

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20251222195022.431211-3-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
7 weeks agobpf: arena: populate vm_area without allocating memory
Puranjay Mohan [Mon, 22 Dec 2025 19:50:16 +0000 (11:50 -0800)] 
bpf: arena: populate vm_area without allocating memory

vm_area_map_pages() may allocate memory while inserting pages into bpf
arena's vm_area. In order to make bpf_arena_alloc_pages() kfunc
non-sleepable change bpf arena to populate pages without
allocating memory:
- at arena creation time populate all page table levels except
  the last level
- when new pages need to be inserted call apply_to_page_range() again
  with apply_range_set_cb() which will only set_pte_at() those pages and
  will not allocate memory.
- when freeing pages call apply_to_existing_page_range with
  apply_range_clear_cb() to clear the pte for the page to be removed. This
  doesn't free intermediate page table levels.

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20251222195022.431211-2-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
7 weeks agobpf: crypto: replace -EEXIST with -EBUSY
Daniel Gomez [Sat, 20 Dec 2025 03:48:09 +0000 (04:48 +0100)] 
bpf: crypto: replace -EEXIST with -EBUSY

The -EEXIST error code is reserved by the module loading infrastructure
to indicate that a module is already loaded. When a module's init
function returns -EEXIST, userspace tools like kmod interpret this as
"module already loaded" and treat the operation as successful, returning
0 to the user even though the module initialization actually failed.

This follows the precedent set by commit 54416fd76770 ("netfilter:
conntrack: helper: Replace -EEXIST by -EBUSY") which fixed the same
issue in nf_conntrack_helper_register().

This affects bpf_crypto_skcipher module. While the configuration
required to build it as a module is unlikely in practice, it is
technically possible, so fix it for correctness.

Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
Acked-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Link: https://lore.kernel.org/r/20251220-dev-module-init-eexists-bpf-v1-1-7f186663dbe7@samsung.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
7 weeks agoMerge branch 'allow-calling-kfuncs-from-raw_tp-programs'
Alexei Starovoitov [Tue, 23 Dec 2025 06:23:38 +0000 (22:23 -0800)] 
Merge branch 'allow-calling-kfuncs-from-raw_tp-programs'

Puranjay Mohan says:

====================
Allow calling kfuncs from raw_tp programs

V1: https://lore.kernel.org/all/20251218145514.339819-1-puranjay@kernel.org/
Changes in V1->V2:
- Update selftests to allow success for raw_tp programs calling kfuncs.

This set enables calling kfuncs from raw_tp programs.
====================

Link: https://patch.msgid.link/20251222133250.1890587-1-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
7 weeks agoselftests: bpf: fix tests with raw_tp calling kfuncs
Puranjay Mohan [Mon, 22 Dec 2025 13:32:46 +0000 (05:32 -0800)] 
selftests: bpf: fix tests with raw_tp calling kfuncs

As the previous commit allowed raw_tp programs to call kfuncs, so of the
selftests that were expected to fail will now succeed.

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20251222133250.1890587-3-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
7 weeks agobpf: allow calling kfuncs from raw_tp programs
Puranjay Mohan [Mon, 22 Dec 2025 13:32:45 +0000 (05:32 -0800)] 
bpf: allow calling kfuncs from raw_tp programs

Associate raw tracepoint program type with the kfunc tracing hook. This
allows calling kfuncs from raw_tp programs.

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20251222133250.1890587-2-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
7 weeks agoMerge branch 'mm-bpf-kfuncs-to-access-memcg-data'
Alexei Starovoitov [Tue, 23 Dec 2025 06:20:22 +0000 (22:20 -0800)] 
Merge branch 'mm-bpf-kfuncs-to-access-memcg-data'

Roman Gushchin says:

====================
mm: bpf kfuncs to access memcg data

Introduce kfuncs to simplify the access to the memcg data.
These kfuncs can be used to accelerate monitoring use cases and
for implementing custom OOM policies once BPF OOM is landed.

This patchset was separated out from the BPF OOM patchset to simplify
the logistics and accelerate the landing of the part which is useful
by itself. No functional changes since BPF OOM v2.

v4:
  - refactored memcg vm event and stat item idx checks (by Alexei)

v3:
  - dropped redundant kfuncs flags (by Alexei)
  - fixed kdocs warnings (by Alexei)
  - merged memcg stats access patches into one (by Alexei)
  - restored root memcg usage reporting, added a comment
  - added checks for enum boundaries
  - added Shakeel and JP as co-maintainers (by Shakeel)

v2:
  - added mem_cgroup_disabled() checks (by Shakeel B.)
  - added special handling of the root memcg in bpf_mem_cgroup_usage()
  (by Shakeel B.)
  - minor fixes in the kselftest (by Shakeel B.)
  - added a MAINTAINERS entry (by Shakeel B.)

v1:
  https://lore.kernel.org/bpf/87ike29s5r.fsf@linux.dev/T/#t
====================

Link: https://patch.msgid.link/20251223044156.208250-1-roman.gushchin@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
7 weeks agoMAINTAINERS: add an entry for MM BPF extensions
Roman Gushchin [Tue, 23 Dec 2025 04:41:56 +0000 (20:41 -0800)] 
MAINTAINERS: add an entry for MM BPF extensions

Let's create a separate entry for MM BPF extensions: these patches
often require an attention from both bpf and mm communities.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Link: https://lore.kernel.org/r/20251223044156.208250-7-roman.gushchin@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
7 weeks agobpf: selftests: selftests for memcg stat kfuncs
JP Kobryn [Tue, 23 Dec 2025 04:41:55 +0000 (20:41 -0800)] 
bpf: selftests: selftests for memcg stat kfuncs

Add test coverage for the kfuncs that fetch memcg stats. Using some common
stats, test scenarios ensuring that the given stat increases by some
arbitrary amount. The stats selected cover the three categories represented
by the enums: node_stat_item, memcg_stat_item, vm_event_item.

Since only a subset of all stats are queried, use a static struct made up
of fields for each stat. Write to the struct with the fetched values when
the bpf program is invoked and read the fields in the user mode program for
verification.

Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Link: https://lore.kernel.org/r/20251223044156.208250-6-roman.gushchin@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
7 weeks agomm: introduce BPF kfuncs to access memcg statistics and events
Roman Gushchin [Tue, 23 Dec 2025 04:41:54 +0000 (20:41 -0800)] 
mm: introduce BPF kfuncs to access memcg statistics and events

Introduce BPF kfuncs to conveniently access memcg data:
  - bpf_mem_cgroup_vm_events(),
  - bpf_mem_cgroup_memory_events(),
  - bpf_mem_cgroup_usage(),
  - bpf_mem_cgroup_page_state(),
  - bpf_mem_cgroup_flush_stats().

These functions are useful for implementing BPF OOM policies, but
also can be used to accelerate access to the memcg data. Reading
it through cgroupfs is much more expensive, roughly 5x, mostly
because of the need to convert the data into the text and back.

JP Kobryn:
An experiment was setup to compare the performance of a program that
uses the traditional method of reading memory.stat vs a program using
the new kfuncs. The control program opens up the root memory.stat file
and for 1M iterations reads, converts the string values to numeric data,
then seeks back to the beginning. The experimental program sets up the
requisite libbpf objects and for 1M iterations invokes a bpf program
which uses the kfuncs to fetch all available stats for node_stat_item,
memcg_stat_item, and vm_event_item types.

The results showed a significant perf benefit on the experimental side,
outperforming the control side by a margin of 93%. In kernel mode,
elapsed time was reduced by 80%, while in user mode, over 99% of time
was saved.

control: elapsed time
real    0m38.318s
user    0m25.131s
sys     0m13.070s

experiment: elapsed time
real    0m2.789s
user    0m0.187s
sys     0m2.512s

control: perf data
33.43% a.out libc.so.6         [.] __vfscanf_internal
 6.88% a.out [kernel.kallsyms] [k] vsnprintf
 6.33% a.out libc.so.6         [.] _IO_fgets
 5.51% a.out [kernel.kallsyms] [k] format_decode
 4.31% a.out libc.so.6         [.] __GI_____strtoull_l_internal
 3.78% a.out [kernel.kallsyms] [k] string
 3.53% a.out [kernel.kallsyms] [k] number
 2.71% a.out libc.so.6         [.] _IO_sputbackc
 2.41% a.out [kernel.kallsyms] [k] strlen
 1.98% a.out a.out             [.] main
 1.70% a.out libc.so.6         [.] _IO_getline_info
 1.51% a.out libc.so.6         [.] __isoc99_sscanf
 1.47% a.out [kernel.kallsyms] [k] memory_stat_format
 1.47% a.out [kernel.kallsyms] [k] memcpy_orig
 1.41% a.out [kernel.kallsyms] [k] seq_buf_printf

experiment: perf data
10.55% memcgstat bpf_prog_..._query [k] bpf_prog_16aab2f19fa982a7_query
 6.90% memcgstat [kernel.kallsyms]  [k] memcg_page_state_output
 3.55% memcgstat [kernel.kallsyms]  [k] _raw_spin_lock
 3.12% memcgstat [kernel.kallsyms]  [k] memcg_events
 2.87% memcgstat [kernel.kallsyms]  [k] __memcg_slab_post_alloc_hook
 2.73% memcgstat [kernel.kallsyms]  [k] kmem_cache_free
 2.70% memcgstat [kernel.kallsyms]  [k] entry_SYSRETQ_unsafe_stack
 2.25% memcgstat [kernel.kallsyms]  [k] __memcg_slab_free_hook
 2.06% memcgstat [kernel.kallsyms]  [k] get_page_from_freelist

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Co-developed-by: JP Kobryn <inwardvessel@gmail.com>
Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
Link: https://lore.kernel.org/r/20251223044156.208250-5-roman.gushchin@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
7 weeks agomm: introduce bpf_get_root_mem_cgroup() BPF kfunc
Roman Gushchin [Tue, 23 Dec 2025 04:41:53 +0000 (20:41 -0800)] 
mm: introduce bpf_get_root_mem_cgroup() BPF kfunc

Introduce a BPF kfunc to get a trusted pointer to the root memory
cgroup. It's very handy to traverse the full memcg tree, e.g.
for handling a system-wide OOM.

It's possible to obtain this pointer by traversing the memcg tree
up from any known memcg, but it's sub-optimal and makes BPF programs
more complex and less efficient.

bpf_get_root_mem_cgroup() has a KF_ACQUIRE | KF_RET_NULL semantics,
however in reality it's not necessary to bump the corresponding
reference counter - root memory cgroup is immortal, reference counting
is skipped, see css_get(). Once set, root_mem_cgroup is always a valid
memcg pointer. It's safe to call bpf_put_mem_cgroup() for the pointer
obtained with bpf_get_root_mem_cgroup(), it's effectively a no-op.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Link: https://lore.kernel.org/r/20251223044156.208250-4-roman.gushchin@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
7 weeks agomm: introduce BPF kfuncs to deal with memcg pointers
Roman Gushchin [Tue, 23 Dec 2025 04:41:52 +0000 (20:41 -0800)] 
mm: introduce BPF kfuncs to deal with memcg pointers

To effectively operate with memory cgroups in BPF there is a need
to convert css pointers to memcg pointers. A simple container_of
cast which is used in the kernel code can't be used in BPF because
from the verifier's point of view that's a out-of-bounds memory access.

Introduce helper get/put kfuncs which can be used to get
a refcounted memcg pointer from the css pointer:
  - bpf_get_mem_cgroup,
  - bpf_put_mem_cgroup.

bpf_get_mem_cgroup() can take both memcg's css and the corresponding
cgroup's "self" css. It allows it to be used with the existing cgroup
iterator which iterates over cgroup tree, not memcg tree.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Link: https://lore.kernel.org/r/20251223044156.208250-3-roman.gushchin@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
7 weeks agomm: declare memcg_page_state_output() in memcontrol.h
Roman Gushchin [Tue, 23 Dec 2025 04:41:51 +0000 (20:41 -0800)] 
mm: declare memcg_page_state_output() in memcontrol.h

To use memcg_page_state_output() in bpf_memcontrol.c move the
declaration from v1-specific memcontrol-v1.h to memcontrol.h.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Link: https://lore.kernel.org/r/20251223044156.208250-2-roman.gushchin@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
7 weeks agobpf: arm64: Fix sparse warnings
Puranjay Mohan [Fri, 19 Dec 2025 19:13:08 +0000 (11:13 -0800)] 
bpf: arm64: Fix sparse warnings

ctx->image is declared as __le32 because arm64 instructions are LE
regardless of CPU's runtime endianness. emit_u32_data() emits raw data
and not instructions so cast the value to __le32 to fix the sparse
warning.

Cast function pointer to void * before doing arithmetic.

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20251219191310.3204425-1-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
7 weeks agoselftests/bpf: add test case for BPF LSM hook bpf_lsm_mmap_file
Matt Bobrowski [Tue, 16 Dec 2025 13:30:00 +0000 (13:30 +0000)] 
selftests/bpf: add test case for BPF LSM hook bpf_lsm_mmap_file

Add a trivial test case asserting that the BPF verifier enforces
PTR_MAYBE_NULL semantics on the struct file pointer argument of BPF
LSM hook bpf_lsm_mmap_file().

Dereferencing the struct file pointer passed into bpf_lsm_mmap_file()
without explicitly performing a NULL check first should not be
permitted by the BPF verifier as it can lead to NULL pointer
dereferences and a kernel crash.

Signed-off-by: Matt Bobrowski <mattbobrowski@google.com>
Acked-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20251216133000.3690723-2-mattbobrowski@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
7 weeks agobpf: annotate file argument as __nullable in bpf_lsm_mmap_file
Matt Bobrowski [Tue, 16 Dec 2025 13:29:59 +0000 (13:29 +0000)] 
bpf: annotate file argument as __nullable in bpf_lsm_mmap_file

As reported in [0], anonymous memory mappings are not backed by a
struct file instance. Consequently, the struct file pointer passed to
the security_mmap_file() LSM hook is NULL in such cases.

The BPF verifier is currently unaware of this, allowing BPF LSM
programs to dereference this struct file pointer without needing to
perform an explicit NULL check. This leads to potential NULL pointer
dereference and a kernel crash.

Add a strong override for bpf_lsm_mmap_file() which annotates the
struct file pointer parameter with the __nullable suffix. This
explicitly informs the BPF verifier that this pointer (PTR_MAYBE_NULL)
can be NULL, forcing BPF LSM programs to perform a check on it before
dereferencing it.

[0] https://lore.kernel.org/bpf/5e460d3c.4c3e9.19adde547d8.Coremail.kaiyanm@hust.edu.cn/

Reported-by: Kaiyan Mei <M202472210@hust.edu.cn>
Reported-by: Yinhao Hu <dddddd@hust.edu.cn>
Reviewed-by: Dongliang Mu <dzm91@hust.edu.cn>
Closes: https://lore.kernel.org/bpf/5e460d3c.4c3e9.19adde547d8.Coremail.kaiyanm@hust.edu.cn/
Signed-off-by: Matt Bobrowski <mattbobrowski@google.com>
Acked-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20251216133000.3690723-1-mattbobrowski@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
7 weeks agox86/bpf: Avoid emitting LOCK prefix for XCHG atomic ops
Uros Bizjak [Mon, 8 Dec 2025 16:33:34 +0000 (17:33 +0100)] 
x86/bpf: Avoid emitting LOCK prefix for XCHG atomic ops

The x86 XCHG instruction is implicitly locked when one of the
operands is a memory location, making an explicit LOCK prefix
unnecessary.

Stop emitting the LOCK prefix for BPF_XCHG in the JIT atomic
read-modify-write helpers. This avoids redundant instruction
prefixes while preserving correct atomic semantics.

No functional change for other atomic operations.

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://lore.kernel.org/r/20251208163420.7643-1-ubizjak@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
7 weeks agoMerge branch 'bpf-optimize-recursion-detection-on-arm64'
Alexei Starovoitov [Sun, 21 Dec 2025 18:54:37 +0000 (10:54 -0800)] 
Merge branch 'bpf-optimize-recursion-detection-on-arm64'

Puranjay Mohan says:

====================
bpf: Optimize recursion detection on arm64

V2: https://lore.kernel.org/all/20251217233608.2374187-1-puranjay@kernel.org/
Changes in v2->v3:
- Added acked by Yonghong
- Patch 2:
        - Change alignment of active from 8 to 4
        - Use le32_to_cpu in place of get_unaligned_le32()

V1: https://lore.kernel.org/all/20251217162830.2597286-1-puranjay@kernel.org/
Changes in V1->V2:
- Patch 2:
        - Put preempt_enable()/disable() around RMW accesses to mitigate
          race conditions. Because on CONFIG_PREEMPT_RCU and sleepable
  bpf programs, preemption can cause no bpf prog to execute in
  case of recursion.

BPF programs detect recursion using a per-CPU 'active' flag in struct
bpf_prog. The trampoline currently sets/clears this flag with atomic
operations.

On some arm64 platforms (e.g., Neoverse V2 with LSE), per-CPU atomic
operations are relatively slow. Unlike x86_64 - where per-CPU updates
can avoid cross-core atomicity, arm64 LSE atomics are always atomic
across all cores, which is unnecessary overhead for strictly per-CPU
state.

This patch removes atomics from the recursion detection path on arm64.

It was discovered in [1] that per-CPU atomics that don't return a value
were extremely slow on some arm64 platforms, Catalin added a fix in
commit 535fdfc5a228 ("arm64: Use load LSE atomics for the non-return
per-CPU atomic operations") to solve this issue, but it seems to have
caused a regression on the fentry benchmark.

Using the fentry benchmark from the bpf selftests shows the following:

  ./tools/testing/selftests/bpf/bench trig-fentry

 +---------------------------------------------+------------------------+
 |               Configuration                 | Total Operations (M/s) |
 +---------------------------------------------+------------------------+
 | bpf-next/master with Catalin’s fix reverted |         51.770         |
 |---------------------------------------------|------------------------|
 | bpf-next/master                             |         43.271         |
 | bpf-next/master with this change            |         43.271         |
 +---------------------------------------------+------------------------+

All benchmarks were run on a KVM based vm with Neoverse-V2 and 8 cpus.

This patch yields a 25% improvement in this benchmark compared to
bpf-next. Notably, reverting Catalin's fix also results in a performance
gain for this benchmark, which is interesting but expected.

For completeness, this benchmark was also run with the change enabled on
x86-64, which resulted in a 30% regression in the fentry benchmark. So,
it is only enabled on arm64.

P.S. - Here is more data with other program types:

 +-----------------+-----------+-----------+----------+
 |     Metric      |  Before   |   After   | % Diff   |
 +-----------------+-----------+-----------+----------+
 | fentry          |   43.149  |   53.948  | +25.03%  |
 | fentry.s        |   41.831  |   50.937  | +21.76%  |
 | rawtp           |   50.834  |   58.731  | +15.53%  |
 | fexit           |   31.118  |   34.360  | +10.42%  |
 | tp              |   39.536  |   41.632  |  +5.30%  |
 | syscall-count   |    8.053  |    8.305  |  +3.13%  |
 | fmodret         |   33.940  |   34.769  |  +2.44%  |
 | kprobe          |    9.970  |    9.998  |  +0.28%  |
 | usermode-count  |  224.886  |  224.839  |  -0.02%  |
 | kernel-count    |  154.229  |  153.043  |  -0.77%  |
 +-----------------+-----------+-----------+----------+

[1] https://lore.kernel.org/all/e7d539ed-ced0-4b96-8ecd-048a5b803b85@paulmck-laptop/
====================

Link: https://patch.msgid.link/20251219184422.2899902-1-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
7 weeks agobpf: arm64: Optimize recursion detection by not using atomics
Puranjay Mohan [Fri, 19 Dec 2025 18:44:18 +0000 (10:44 -0800)] 
bpf: arm64: Optimize recursion detection by not using atomics

BPF programs detect recursion using a per-CPU 'active' flag in struct
bpf_prog. The trampoline currently sets/clears this flag with atomic
operations.

On some arm64 platforms (e.g., Neoverse V2 with LSE), per-CPU atomic
operations are relatively slow. Unlike x86_64 - where per-CPU updates
can avoid cross-core atomicity, arm64 LSE atomics are always atomic
across all cores, which is unnecessary overhead for strictly per-CPU
state.

This patch removes atomics from the recursion detection path on arm64 by
changing 'active' to a per-CPU array of four u8 counters, one per
context: {NMI, hard-irq, soft-irq, normal}. The running context uses a
non-atomic increment/decrement on its element.  After increment,
recursion is detected by reading the array as a u32 and verifying that
only the expected element changed; any change in another element
indicates inter-context recursion, and a value > 1 in the same element
indicates same-context recursion.

For example, starting from {0,0,0,0}, a normal-context trigger changes
the array to {0,0,0,1}.  If an NMI arrives on the same CPU and triggers
the program, the array becomes {1,0,0,1}. When the NMI context checks
the u32 against the expected mask for normal (0x00000001), it observes
0x01000001 and correctly reports recursion. Same-context recursion is
detected analogously.

Acked-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20251219184422.2899902-3-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
7 weeks agobpf: move recursion detection logic to helpers
Puranjay Mohan [Fri, 19 Dec 2025 18:44:17 +0000 (10:44 -0800)] 
bpf: move recursion detection logic to helpers

BPF programs detect recursion by doing atomic inc/dec on a per-cpu
active counter from the trampoline. Create two helpers for operations on
this active counter, this makes it easy to changes the recursion
detection logic in future.

This commit makes no functional changes.

Acked-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20251219184422.2899902-2-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
7 weeks agoMerge branch 'resolve_btfids-support-for-btf-modifications'
Andrii Nakryiko [Fri, 19 Dec 2025 18:55:40 +0000 (10:55 -0800)] 
Merge branch 'resolve_btfids-support-for-btf-modifications'

Ihor Solodrai says:

====================
resolve_btfids: Support for BTF modifications

This series changes resolve_btfids and kernel build scripts to enable
BTF transformations in resolve_btfids. Main motivation for enhancing
resolve_btfids is to reduce dependency of the kernel build on pahole
capabilities [1] and enable BTF features and optimizations [2][3]
particular to the kernel.

Patches #1-#4 in the series are non-functional changes in
resolve_btfids.

Patch #5 makes kernel build notice pahole version changes between
builds.

Patch #6 changes minimum version of pahole required for
CONFIG_DEBUG_INFO_BTF to v1.22

Patch #7 makes a small prep change in selftests/bpf build.

The last patch (#8) makes significant changes in resolve_btfids and
introduces scripts/gen-btf.sh. See implementation details in the patch
description.

Successful BPF CI run: https://github.com/kernel-patches/bpf/actions/runs/20378061470

[1] https://lore.kernel.org/dwarves/ba1650aa-fafd-49a8-bea4-bdddee7c38c9@linux.dev/
[2] https://lore.kernel.org/bpf/20251029190113.3323406-1-ihor.solodrai@linux.dev/
[3] https://lore.kernel.org/bpf/20251119031531.1817099-1-dolinux.peng@gmail.com/
---

v6->v7:
  - documentation edits in patches #5 and #6 (Nicolas)

v6: https://lore.kernel.org/bpf/20251219020006.785065-1-ihor.solodrai@linux.dev/

v5->v6:
  - patch #8: fix double free when btf__distill_base fails (reported by AI)
    https://lore.kernel.org/bpf/e269870b8db409800045ee0061fc02d21721e0efadd99ca83960b48f8db7b3f3@mail.kernel.org/

v5: https://lore.kernel.org/bpf/20251219003147.587098-1-ihor.solodrai@linux.dev/

v4->v5:
  - patch #3: fix an off-by-one bug (reported by AI)
    https://lore.kernel.org/bpf/106b6e71bce75b8f12a85f2f99e75129e67af7287f6d81fa912589ece14044f9@mail.kernel.org/
  - patch #8: cleanup GEN_BTF in Makefile.btf

v4: https://lore.kernel.org/bpf/20251218003314.260269-1-ihor.solodrai@linux.dev/

v3->v4:
  - add patch #4: "resolve_btfids: Always build with -Wall -Werror"
  - add patch #5: "kbuild: Sync kconfig when PAHOLE_VERSION changes" (Alan)
  - fix clang cross-compilation (LKP)
    https://lore.kernel.org/bpf/cecb6351-ea9a-4f8a-863a-82c9ef02f012@linux.dev/
  - remove GEN_BTF env variable (Andrii)
  - nits and cleanup in resolve_btfids/main.c (Andrii, Eduard)
  - nits in a patch bumping minimum pahole version (Andrii, AI)

v3: https://lore.kernel.org/bpf/20251205223046.4155870-1-ihor.solodrai@linux.dev/

v2->v3:
  - add patch #4 bumping minimum pahole version (Andrii, Alan)
  - add patch #5 pre-fixing resolve_btfids test (Donglin)
  - add GEN_BTF var and assemble RESOLVE_BTFIDS_FLAGS in Makefile.btf (Alan)
  - implement --distill_base flag in resolve_btfids, set it depending
    on KBUILD_EXTMOD in Makefile.btf (Eduard)
  - various implementation nits, see the v2 thread for details (Andrii, Eduard)

v2: https://lore.kernel.org/bpf/20251127185242.3954132-1-ihor.solodrai@linux.dev/

v1->v2:
  - gen-btf.sh and other shell script fixes (Donglin)
  - update selftests build (Donglin)
  - generate .BTF.base only when KBUILD_EXTMOD is set (Alan)
  - proper endianness handling for cross-compilation
  - change elf_begin mode from ELF_C_RDWR_MMAP to ELF_C_READ_MMAP_PRIVATE
  - remove compressed_section_fix()
  - nit NULL check in patch #3 (suggested by AI)

v1: https://lore.kernel.org/bpf/20251126012656.3546071-1-ihor.solodrai@linux.dev/
====================

Link: https://patch.msgid.link/20251219181321.1283664-1-ihor.solodrai@linux.dev
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
7 weeks agoresolve_btfids: Change in-place update with raw binary output
Ihor Solodrai [Fri, 19 Dec 2025 18:18:25 +0000 (10:18 -0800)] 
resolve_btfids: Change in-place update with raw binary output

Currently resolve_btfids updates .BTF_ids section of an ELF file
in-place, based on the contents of provided BTF, usually within the
same input file, and optionally a BTF base.

Change resolve_btfids behavior to enable BTF transformations as part
of its main operation. To achieve this, in-place ELF write in
resolve_btfids is replaced with generation of the following binaries:
  * ${1}.BTF with .BTF section data
  * ${1}.BTF_ids with .BTF_ids section data if it existed in ${1}
  * ${1}.BTF.base with .BTF.base section data for out-of-tree modules

The execution of resolve_btfids and consumption of its output is
orchestrated by scripts/gen-btf.sh introduced in this patch.

The motivation for emitting binary data is that it allows simplifying
resolve_btfids implementation by delegating ELF update to the $OBJCOPY
tool [1], which is already widely used across the codebase.

There are two distinct paths for BTF generation and resolve_btfids
application in the kernel build: for vmlinux and for kernel modules.

For the vmlinux binary a .BTF section is added in a roundabout way to
ensure correct linking. The patch doesn't change this approach, only
the implementation is a little different.

Before this patch it worked as follows:

  * pahole consumed .tmp_vmlinux1 [2] and added .BTF section with
    llvm-objcopy [3] to it
  * then everything except the .BTF section was stripped from .tmp_vmlinux1
    into a .tmp_vmlinux1.bpf.o object [2], later linked into vmlinux
  * resolve_btfids was executed later on vmlinux.unstripped [4],
    updating it in-place

After this patch gen-btf.sh implements the following:

  * pahole consumes .tmp_vmlinux1 and produces a *detached* file with
    raw BTF data
  * resolve_btfids consumes .tmp_vmlinux1 and detached BTF to produce
    (potentially modified) .BTF, and .BTF_ids sections data
  * a .tmp_vmlinux1.bpf.o object is then produced with objcopy copying
    BTF output of resolve_btfids
  * .BTF_ids data gets embedded into vmlinux.unstripped in
    link-vmlinux.sh by objcopy --update-section

For kernel modules, creating a special .bpf.o file is not necessary,
and so embedding of sections data produced by resolve_btfids is
straightforward with objcopy.

With this patch an ELF file becomes effectively read-only within
resolve_btfids, which allows deleting elf_update() call and satellite
code (like compressed_section_fix [5]).

Endianness handling of .BTF_ids data is also changed. Previously the
"flags" part of the section was bswapped in sets_patch() [6], and then
Elf_Type was modified before elf_update() to signal to libelf that
bswap may be necessary. With this patch we explicitly bswap entire
data buffer on load and on dump.

[1] https://lore.kernel.org/bpf/131b4190-9c49-4f79-a99d-c00fac97fa44@linux.dev/
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/scripts/link-vmlinux.sh?h=v6.18#n110
[3] https://git.kernel.org/pub/scm/devel/pahole/pahole.git/tree/btf_encoder.c?h=v1.31#n1803
[4] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/scripts/link-vmlinux.sh?h=v6.18#n284
[5] https://lore.kernel.org/bpf/20200819092342.259004-1-jolsa@kernel.org/
[6] https://lore.kernel.org/bpf/cover.1707223196.git.vmalik@redhat.com/

Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Tested-by: Alan Maguire <alan.maguire@oracle.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20251219181825.1289460-3-ihor.solodrai@linux.dev
7 weeks agoselftests/bpf: Run resolve_btfids only for relevant .test.o objects
Ihor Solodrai [Fri, 19 Dec 2025 18:18:24 +0000 (10:18 -0800)] 
selftests/bpf: Run resolve_btfids only for relevant .test.o objects

A selftest targeting resolve_btfids functionality relies on a resolved
.BTF_ids section to be available in the TRUNNER_BINARY. The underlying
BTF data is taken from a special BPF program (btf_data.c), and so
resolve_btfids is executed as a part of a TRUNNER_BINARY build recipe
on the final binary.

Subsequent patches in this series allow resolve_btfids to modify BTF
before resolving the symbols, which means that the test needs access
to that modified BTF [1]. Currently the test simply reads in
btf_data.bpf.o on the assumption that BTF hasn't changed.

Implement resolve_btfids call only for particular test objects (just
resolve_btfids.test.o for now). The test objects are linked into the
TRUNNER_BINARY, and so .BTF_ids section will be available there.

This will make it trivial for the resolve_btfids test to access BTF
modified by resolve_btfids.

[1] https://lore.kernel.org/bpf/CAErzpmvsgSDe-QcWH8SFFErL6y3p3zrqNri5-UHJ9iK2ChyiBw@mail.gmail.com/

Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Tested-by: Alan Maguire <alan.maguire@oracle.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20251219181825.1289460-2-ihor.solodrai@linux.dev
7 weeks agolib/Kconfig.debug: Set the minimum required pahole version to v1.22
Ihor Solodrai [Fri, 19 Dec 2025 18:18:23 +0000 (10:18 -0800)] 
lib/Kconfig.debug: Set the minimum required pahole version to v1.22

Subsequent patches in the series change vmlinux linking scripts to
unconditionally pass --btf_encode_detached to pahole, which was
introduced in v1.22 [1][2].

This change allows to remove PAHOLE_HAS_SPLIT_BTF Kconfig option and
other checks of older pahole versions.

[1] https://github.com/acmel/dwarves/releases/tag/v1.22
[2] https://lore.kernel.org/bpf/cbafbf4e-9073-4383-8ee6-1353f9e5869c@oracle.com/

Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Tested-by: Alan Maguire <alan.maguire@oracle.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Nicolas Schier <nsc@kernel.org>
Link: https://lore.kernel.org/bpf/20251219181825.1289460-1-ihor.solodrai@linux.dev