selftests/bpf: Add stress test for rqspinlock in NMI
Introduce a kernel module that will exercise lock acquisition in the NMI
path, and bias toward creating contention such that NMI waiters end up
being non-head waiters. Prior to the rqspinlock fix made in the commit 0d80e7f951be ("rqspinlock: Choose trylock fallback for NMI waiters"), it
was possible for the queueing path of non-head waiters to get stuck in
NMI, which this stress test reproduces fairly easily with just 3 CPUs.
Both AA and ABBA flavors are supported, and it will serve as a test case
for future fixes that address this corner case. More information about
the problem in question is available in the commit cited above. When the
fix is reverted, this stress test will lock up the system.
To enable this test automatically through the test_progs infrastructure,
add a load_module_params API to exercise both AA and ABBA cases when
running the test.
Note that the test runs for at most 5 seconds, and becomes a noop after
that, in order to allow the system to make forward progress. In
addition, CPU 0 is always kept untouched by the created threads and
NMIs. The test will automatically scale to the number of available
online CPUs.
Note that at least 3 CPUs are necessary to run this test, hence skip the
selftest in case the environment has less than 3 CPUs available.
Daniel Borkmann [Fri, 26 Sep 2025 17:12:01 +0000 (19:12 +0200)]
selftests/bpf: Add test case for different expected_attach_type
Add a small test case which adds two programs - one calling the other
through a tailcall - and check that BPF rejects them in case of different
expected_attach_type values:
# ./vmtest.sh -- ./test_progs -t xdp_devmap
[...]
#641/1 xdp_devmap_attach/DEVMAP with programs in entries:OK
#641/2 xdp_devmap_attach/DEVMAP with frags programs in entries:OK
#641/3 xdp_devmap_attach/Verifier check of DEVMAP programs:OK
#641/4 xdp_devmap_attach/DEVMAP with programs in entries on veth:OK
#641 xdp_devmap_attach:OK
Summary: 2/4 PASSED, 0 SKIPPED, 0 FAILED
Daniel Borkmann [Fri, 26 Sep 2025 17:12:00 +0000 (19:12 +0200)]
bpf: Enforce expected_attach_type for tailcall compatibility
Yinhao et al. recently reported:
Our fuzzer tool discovered an uninitialized pointer issue in the
bpf_prog_test_run_xdp() function within the Linux kernel's BPF subsystem.
This leads to a NULL pointer dereference when a BPF program attempts to
deference the txq member of struct xdp_buff object.
The test initializes two programs of BPF_PROG_TYPE_XDP: progA acts as the
entry point for bpf_prog_test_run_xdp() and its expected_attach_type can
neither be of be BPF_XDP_DEVMAP nor BPF_XDP_CPUMAP. progA calls into a slot
of a tailcall map it owns. progB's expected_attach_type must be BPF_XDP_DEVMAP
to pass xdp_is_valid_access() validation. The program returns struct xdp_md's
egress_ifindex, and the latter is only allowed to be accessed under mentioned
expected_attach_type. progB is then inserted into the tailcall which progA
calls.
The underlying issue goes beyond XDP though. Another example are programs
of type BPF_PROG_TYPE_CGROUP_SOCK_ADDR. sock_addr_is_valid_access() as well
as sock_addr_func_proto() have different logic depending on the programs'
expected_attach_type. Similarly, a program attached to BPF_CGROUP_INET4_GETPEERNAME
should not be allowed doing a tailcall into a program which calls bpf_bind()
out of BPF which is only enabled for BPF_CGROUP_INET4_CONNECT.
In short, specifying expected_attach_type allows to open up additional
functionality or restrictions beyond what the basic bpf_prog_type enables.
The use of tailcalls must not violate these constraints. Fix it by enforcing
expected_attach_type in __bpf_prog_map_compatible().
Note that we only enforce this for tailcall maps, but not for BPF devmaps or
cpumaps: There, the programs are invoked through dev_map_bpf_prog_run*() and
cpu_map_bpf_prog_run*() which set up a new environment / context and therefore
these situations are not prone to this issue.
Fixes: 5e43f899b03a ("bpf: Check attach type at prog load time") Reported-by: Yinhao Hu <dddddd@hust.edu.cn> Reported-by: Kaiyan Mei <M202472210@hust.edu.cn> Reviewed-by: Dongliang Mu <dzm91@hust.edu.cn> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20250926171201.188490-1-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>
D. Wythe [Fri, 26 Sep 2025 07:17:51 +0000 (15:17 +0800)]
libbpf: Fix error when st-prefix_ops and ops from differ btf
When a module registers a struct_ops, the struct_ops type and its
corresponding map_value type ("bpf_struct_ops_") may reside in different
btf objects, here are four possible case:
+--------+---------------+-------------+---------------------------------+
| |bpf_struct_ops_| xxx_ops | |
+--------+---------------+-------------+---------------------------------+
| case 0 | btf_vmlinux | btf_vmlinux | be used and reg only in vmlinux |
+--------+---------------+-------------+---------------------------------+
| case 1 | btf_vmlinux | mod_btf | INVALID |
+--------+---------------+-------------+---------------------------------+
| case 2 | mod_btf | btf_vmlinux | reg in mod but be used both in |
| | | | vmlinux and mod. |
+--------+---------------+-------------+---------------------------------+
| case 3 | mod_btf | mod_btf | be used and reg only in mod |
+--------+---------------+-------------+---------------------------------+
Currently we figure out the mod_btf by searching with the struct_ops type,
which makes it impossible to figure out the mod_btf when the struct_ops
type is in btf_vmlinux while it's corresponding map_value type is in
mod_btf (case 2).
The fix is to use the corresponding map_value type ("bpf_struct_ops_")
as the lookup anchor instead of the struct_ops type to figure out the
`btf` and `mod_btf` via find_ksym_btf_id(), and then we can locate
the kern_type_id via btf__find_by_name_kind() with the `btf` we just
obtained from find_ksym_btf_id().
With this change the lookup obtains the correct btf and mod_btf for case 2,
preserves correct behavior for other valid cases, and still fails as
expected for the invalid scenario (case 1).
Fixes: 590a00888250 ("bpf: libbpf: Add STRUCT_OPS support") Signed-off-by: D. Wythe <alibuda@linux.alibaba.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/bpf/20250926071751.108293-1-alibuda@linux.alibaba.com
selftests/bpf: Test changing packet data from kfunc
bpf_xdp_pull_data() is the first kfunc that changes packet data. Make
sure the verifier clear all packet pointers after calling packet data
changing kfunc.
Tao Chen [Thu, 25 Sep 2025 17:50:30 +0000 (01:50 +0800)]
selftests/bpf: Add stacktrace map lookup_and_delete_elem test case
Add tests for stacktrace map lookup and delete:
1. use bpf_map_lookup_and_delete_elem to lookup and delete the target
stack_id,
2. lookup the deleted stack_id again to double check.
Tao Chen [Thu, 25 Sep 2025 17:50:29 +0000 (01:50 +0800)]
selftests/bpf: Refactor stacktrace_map case with skeleton
The loading method of the stacktrace_map test case looks too outdated,
refactor it with skeleton, and we can use global variable feature in
the next patch.
Tao Chen [Thu, 25 Sep 2025 17:50:28 +0000 (01:50 +0800)]
bpf: Add lookup_and_delete_elem for BPF_MAP_STACK_TRACE
The stacktrace map can be easily full, which will lead to failure in
obtaining the stack. In addition to increasing the size of the map,
another solution is to delete the stack_id after looking it up from
the user, so extend the existing bpf_map_lookup_and_delete_elem()
functionality to stacktrace map types.
bpf_cookie can fail on perf_event_open(), when it runs after the task_work
selftest. The task_work test causes perf to lower
sysctl_perf_event_sample_rate, and bpf_cookie uses sample_freq,
which is validated against that sysctl. As a result,
perf_event_open() rejects the attr if the (now tighter) limit is
exceeded.
>From perf_event_open():
if (attr.freq) {
if (attr.sample_freq > sysctl_perf_event_sample_rate)
return -EINVAL;
} else {
if (attr.sample_period & (1ULL << 63))
return -EINVAL;
}
Switch bpf_cookie to use sample_period, which is not checked against
sysctl_perf_event_sample_rate.
selftests/bpf: Test changing packet data from global functions with a kfunc
The verifier should invalidate all packet pointers after a packet data
changing kfunc is called. So, similar to commit 3f23ee5590d9
("selftests/bpf: test for changing packet data from global functions"),
test changing packet data from global functions to make sure packet
pointers are indeed invalidated.
task_work selftest does not properly handle cleanup during failures:
* destroy bpf_link
* perf event fd is passed to bpf_link, no need to close it if link was
created successfully
* goto cleanup if fork() failed, close pipe.
Magnus Karlsson [Mon, 15 Sep 2025 12:01:44 +0000 (14:01 +0200)]
MAINTAINERS: Delete inactive maintainers from AF_XDP
Delete Björn Töpel and Jonathan Lemon as maintainer and reviewer,
respectively, as they have not been contributing towards AF_XDP for
several years. I have spoken to Björn and he is ok with his removal.
Andrea Righi [Wed, 24 Sep 2025 08:14:26 +0000 (10:14 +0200)]
bpf: Mark kfuncs as __noclone
Some distributions (e.g., CachyOS) support building the kernel with -O3,
but doing so may break kfuncs, resulting in their symbols not being
properly exported.
In fact, with gcc -O3, some kfuncs may be optimized away despite being
annotated as noinline. This happens because gcc can still clone the
function during IPA optimizations, e.g., by duplicating or inlining it
into callers, and then dropping the standalone symbol. This breaks BTF
ID resolution since resolve_btfids relies on the presence of a global
symbol for each kfunc.
Currently, this is not an issue for upstream, because we don't allow
building the kernel with -O3, but it may be safer to address it anyway,
to prevent potential issues in the future if compilers become more
aggressive with optimizations.
Therefore, add __noclone to __bpf_kfunc to ensure kfuncs are never
cloned and remain distinct, globally visible symbols, regardless of
the optimization level.
Fixes: 57e7c169cd6af ("bpf: Add __bpf_kfunc tag for marking kernel functions as kfuncs") Acked-by: David Vernet <void@manifault.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Andrea Righi <arighi@nvidia.com> Link: https://lore.kernel.org/r/20250924081426.156934-1-arighi@nvidia.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
====================
we recently had several requests for tetragon to be able to change
user application function return value or divert its execution through
instruction pointer change.
This patchset adds support for uprobe program to change app's registers
including instruction pointer.
v4 changes:
- rebased on bpf-next/master, we will handle the future simple conflict
with tip/perf/core
- changed condition in kprobe_prog_is_valid_access [Andrii]
- added acks
====================
Jiri Olsa [Tue, 16 Sep 2025 21:52:57 +0000 (23:52 +0200)]
uprobe: Do not emulate/sstep original instruction when ip is changed
If uprobe handler changes instruction pointer we still execute single
step) or emulate the original instruction and increment the (new) ip
with its length.
This makes the new instruction pointer bogus and application will
likely crash on illegal instruction execution.
If user decided to take execution elsewhere, it makes little sense
to execute the original instruction, so let's skip it.
Jiri Olsa [Tue, 16 Sep 2025 21:52:56 +0000 (23:52 +0200)]
bpf: Allow uprobe program to change context registers
Currently uprobe (BPF_PROG_TYPE_KPROBE) program can't write to the
context registers data. While this makes sense for kprobe attachments,
for uprobe attachment it might make sense to be able to change user
space registers to alter application execution.
Since uprobe and kprobe programs share the same type (BPF_PROG_TYPE_KPROBE),
we can't deny write access to context during the program load. We need
to check on it during program attachment to see if it's going to be
kprobe or uprobe.
Storing the program's write attempt to context and checking on it
during the attachment.
Martin KaFai Lau [Tue, 23 Sep 2025 20:35:13 +0000 (13:35 -0700)]
Merge branch 'add-kfunc-bpf_xdp_pull_data'
Amery Hung says:
====================
Add kfunc bpf_xdp_pull_data
v7 -> v6
patch 5 (new patch)
- Rename variables in bpf_prog_test_run_xdp()
patch 6
- Fix bugs (Martin)
v6 -> v5
patch 6
- v5 selftest failed on S390 when changing how tailroom occupied by
skb_shared_info is calculated. Revert selftest to v4, where we get
SKB_DATA_ALIGN(sizeof(struct skb_shared_info)) by running an XDP
program
patch 4
- Instead of adding is_xdp to bpf_test_init, move lower-bound check
of user_size to callers (Martin)
- Simplify linear data size calculation (Martin)
Try to build on top of the mlx5 patchset that avoids copying payload
to linear part by Christoph but got a kernel panic. Will rebase on
that patchset if it got merged first, or separate the mlx5 fix
from this set.
patch 1
- Remove the unnecessary head frag search (Dragos)
- Rewind the end frag pointer to simplify the change (Dragos)
- Rewind the end frag pointer and recalculate truesize only when the
number of frags changed (Dragos)
patch 3
- Fix len == zero behavior. To mirror bpf_skb_pull_data() correctly,
the kfunc should do nothing (Stanislav)
- Fix a pointer wrap around bug (Jakub)
- Use memmove() when moving sinfo->frags (Jakub)
selftests: drv-net: Pull data before parsing headers
It is possible for drivers to generate xdp packets with data residing
entirely in fragments. To keep parsing headers using direct packet
access, call bpf_xdp_pull_data() to pull headers into the linear data
area.
Test bpf_xdp_pull_data() with xdp packets with different layouts. The
xdp bpf program first checks if the layout is as expected. Then, it
calls bpf_xdp_pull_data(). Finally, it checks the 0xbb marker at offset
1024 using directly packet access.
bpf: Support specifying linear xdp packet data size for BPF_PROG_TEST_RUN
To test bpf_xdp_pull_data(), an xdp packet containing fragments as well
as free linear data area after xdp->data_end needs to be created.
However, bpf_prog_test_run_xdp() always fills the linear area with
data_in before creating fragments, leaving no space to pull data. This
patch will allow users to specify the linear data size through
ctx->data_end.
Currently, ctx_in->data_end must match data_size_in and will not be the
final ctx->data_end seen by xdp programs. This is because ctx->data_end
is populated according to the xdp_buff passed to test_run. The linear
data area available in an xdp_buff, max_linear_sz, is alawys filled up
before copying data_in into fragments.
This patch will allow users to specify the size of data that goes into
the linear area. When ctx_in->data_end is different from data_size_in,
only ctx_in->data_end bytes of data will be put into the linear area when
creating the xdp_buff.
While ctx_in->data_end will be allowed to be different from data_size_in,
it cannot be larger than the data_size_in as there will be no data to
copy from user space. If it is larger than the maximum linear data area
size, the layout suggested by the user will not be honored. Data beyond
max_linear_sz bytes will still be copied into fragments.
Finally, since it is possible for a NIC to produce a xdp_buff with empty
linear data area, allow it when calling bpf_test_init() from
bpf_prog_test_run_xdp() so that we can test XDP kfuncs with such
xdp_buff. This is done by moving lower-bound check to callers as most of
them already do except bpf_prog_test_run_skb(). The change also fixes a
bug that allows passing an xdp_buff with data < ETH_HLEN. This can
happen when ctx is used and metadata is at least ETH_HLEN.
bpf: Make variables in bpf_prog_test_run_xdp less confusing
Change the variable naming in bpf_prog_test_run_xdp() to make the
overall logic less confusing. As different modes were added to the
function over the time, some variables got overloaded, making
it hard to understand and changing the code becomes error-prone.
Replace "size" with "linear_sz" where it refers to the size of metadata
and data. If "size" refers to input data size, use test.data_size_in
directly.
Replace "max_data_sz" with "max_linear_sz" to better reflect the fact
that it is the maximum size of metadata and data (i.e., linear_sz). Also,
xdp_rxq.frags_size is always PAGE_SIZE, so just set it directly instead
of subtracting headroom and tailroom and adding them back.
bpf: Clear packet pointers after changing packet data in kfuncs
bpf_xdp_pull_data() may change packet data and therefore packet pointers
need to be invalidated. Add bpf_xdp_pull_data() to the special kfunc
list instead of introducing a new KF_ flag until there are more kfuncs
changing packet data.
Add kfunc, bpf_xdp_pull_data(), to support pulling data from xdp
fragments. Similar to bpf_skb_pull_data(), bpf_xdp_pull_data() makes
the first len bytes of data directly readable and writable in bpf
programs. If the "len" argument is larger than the linear data size,
data in fragments will be copied to the linear data area when there
is enough room. Specifically, the kfunc will try to use the tailroom
first. When the tailroom is not enough, metadata and data will be
shifted down to make room for pulling data.
A use case of the kfunc is to decapsulate headers residing in xdp
fragments. It is possible for a NIC driver to place headers in xdp
fragments. To keep using direct packet access for parsing and
decapsulating headers, users can pull headers into the linear data
area by calling bpf_xdp_pull_data() and then pop the header with
bpf_xdp_adjust_head().
bpf: Allow bpf_xdp_shrink_data to shrink a frag from head and tail
Move skb_frag_t adjustment into bpf_xdp_shrink_data() and extend its
functionality to be able to shrink an xdp fragment from both head and
tail. In a later patch, bpf_xdp_pull_data() will reuse it to shrink an
xdp fragment from head.
Additionally, in bpf_xdp_frags_shrink_tail(), breaking the loop when
bpf_xdp_shrink_data() returns false (i.e., not releasing the current
fragment) is not necessary as the loop condition, offset > 0, has the
same effect. Remove the else branch to simplify the code.
bpf: Clear pfmemalloc flag when freeing all fragments
It is possible for bpf_xdp_adjust_tail() to free all fragments. The
kfunc currently clears the XDP_FLAGS_HAS_FRAGS bit, but not
XDP_FLAGS_FRAGS_PF_MEMALLOC. So far, this has not caused a issue when
building sk_buff from xdp_buff since all readers of xdp_buff->flags
use the flag only when there are fragments. Clear the
XDP_FLAGS_FRAGS_PF_MEMALLOC bit as well to make the flags correct.
In the __arch_prepare_bpf_trampoline() function, retval_off is only
meaningful when save_ret is true, so the current logic is correct.
However, in the original logic, retval_off is only initialized under
certain conditions; for example, in the fmod_ret logic, the compiler is
not aware that the flags of the fmod_ret program (prog) have set
BPF_TRAMP_F_CALL_ORIG, which results in an uninitialized symbol
compilation warning.
So initialize retval_off unconditionally to fix it.
bpftool: Add bash completion for program signing options
Commit 40863f4d6ef2 ("bpftool: Add support for signing BPF programs")
added new options for "bpftool prog load" and "bpftool gen skeleton".
This commit brings the relevant update to the bash completion file.
We rework slightly the processing of options to make completion more
resilient for options that take an argument.
====================
bpf: Allow union argument in trampoline based programs
While tracing 'release_pages' with bpfsnoop[0], the verifier reports:
The function release_pages arg0 type UNION is unsupported.
However, it should be acceptable to trace functions that have 'union'
arguments.
This patch set enables such support in the verifier by allowing 'union'
as a valid argument type.
Changes:
v3 -> v4:
* Address comments from Alexei:
* Trim bpftrace output in patch #1 log.
* Drop the referenced commit info and the test output in patch #2 log.
v2 -> v3:
* Address comments from Alexei:
* Reuse the existing flag BTF_FMODEL_STRUCT_ARG.
* Update the comment of the flag BTF_FMODEL_STRUCT_ARG.
v1 -> v2:
* Add 16B 'union' argument support in x86_64 trampoline.
* Update selftests using bpf_testmod.
* Add test case about 16-bytes 'union' argument.
* Address comments from Alexei:
* Study the patch set about 'struct' argument support.
* Update selftests to cover more cases.
v1: https://lore.kernel.org/bpf/20250905133226.84675-1-leon.hwang@linux.dev/
Leon Hwang [Fri, 19 Sep 2025 04:41:10 +0000 (12:41 +0800)]
selftests/bpf: Add union argument tests using fexit programs
Add test coverage for union argument support using fexit programs:
* 8B union argument - verify that the verifier accepts it and that fexit
programs can trace such functions.
* 16B union argument - verify that the verifier accepts it and that
fexit programs can access the argument, which is passed using two
registers.
v3 -> v4:
v3: https://lore.kernel.org/all/20250915162848.54282-1-puranjay@kernel.org/
- Update bpf_jit_supports_insn() in riscv jit to reject signed arena loads (Eduard)
- Fix coding style related to braces usage in an if statement in x86 jit (Eduard)
v2 -> v3:
v2: https://lore.kernel.org/bpf/20250514175415.2045783-1-memxor@gmail.com/
- Fix encoding for the generated instructions in x86 JIT (Eduard)
The patch in v2 was generating instructions like:
42 63 44 20 f8 movslq -0x8(%rax,%r12), %eax
This doesn't make sense because movslq outputs a 64-bit result, but
the destination register here is set to eax (32-bit). The fix it to
set the REX.W bit in the opcode, that means changing
EMIT2(add_3mod(0x40, ...)) to EMIT2(add_3mod(0x48, ...))
- Add arm64 support
- Add selftests signed laods from arena.
v1 -> v2:
v1: https://lore.kernel.org/bpf/20250509194956.1635207-1-memxor@gmail.com
- Use bpf_jit_supports_insn. (Alexei)
Currently, signed load instructions into arena memory are unsupported.
The compiler is free to generate these, and on GCC-14 we see a
corresponding error when it happens. The hurdle in supporting them is
deciding which unused opcode to use to mark them for the JIT's own
consumption. After much thinking, it appears 0xc0 / BPF_NOSPEC can be
combined with load instructions to identify signed arena loads. Use
this to recognize and JIT them appropriately, and remove the verifier
side limitation on the program if the JIT supports them.
====================
selftests: bpf: Add tests for signed loads from arena
Add tests for loading 8, 16, and 32 bits with sign extension from arena,
also verify that exception handling is working correctly and correct
assembly is being generated by the x86 and arm64 JITs.
Add support for signed loads from arena which are internally converted
to loads with mode set BPF_PROBE_MEM32SX by the verifier. The
implementation is similar to BPF_PROBE_MEMSX and BPF_MEMSX but for
BPF_PROBE_MEM32SX, arena_vm_base is added to the src register to form
the address.
Currently, signed load instructions into arena memory are unsupported.
The compiler is free to generate these, and on GCC-14 we see a
corresponding error when it happens. The hurdle in supporting them is
deciding which unused opcode to use to mark them for the JIT's own
consumption. After much thinking, it appears 0xc0 / BPF_NOSPEC can be
combined with load instructions to identify signed arena loads. Use
this to recognize and JIT them appropriately, and remove the verifier
side limitation on the program if the JIT supports them.
This patch introduces a new mechanism for BPF programs to schedule
deferred execution in the context of a specific task using the kernel’s
task_work infrastructure.
The new bpf_task_work interface enables BPF use cases that
require sleepable subprogram execution within task context, for example,
scheduling sleepable function from the context that does not
allow sleepable, such as NMI.
Introduced kfuncs bpf_task_work_schedule_signal() and
bpf_task_work_schedule_resume() for scheduling BPF callbacks correspond
to different modes used by task_work (TWA_SIGNAL or TWA_RESUME).
The implementation manages scheduling state via metadata objects (struct
bpf_task_work_context). Pointers to bpf_task_work_context are stored
in BPF map values. State transitions are handled via an atomic
state machine (bpf_task_work_state) to ensure correctness under
concurrent usage and deletion, lifetime is guarded by refcounting and
RCU Tasks Trace.
Kfuncs call task_work_add() indirectly via irq_work to avoid locking in
potentially NMI context.
Changelog:
---
v7 -> v8
v7: https://lore.kernel.org/bpf/20250922232611.614512-1-mykyta.yatsenko5@gmail.com/
* Fix unused variable warning in patch 1
* Decrease stress test time from 2 to 1 second
* Went through CI warnings, other than unused variable, there are just
2 new in kernel/bpf/helpers.c related to newly introduced kfuncs, these
look expected.
v6 -> v7
v6: https://lore.kernel.org/bpf/20250918132615.193388-1-mykyta.yatsenko5@gmail.com/
* Added stress test
* Extending refactoring in patch 1
* Changing comment and removing one check for map->usercnt in patch 7
v5 -> v6
v5: https://lore.kernel.org/bpf/20250916233651.258458-1-mykyta.yatsenko5@gmail.com/
* Fixing readability in verifier.c:check_map_field_pointer()
* Removing BUG_ON from helpers.c
v4 -> v5
v4:
https://lore.kernel.org/all/20250915201820.248977-1-mykyta.yatsenko5@gmail.com/
* Fix invalid/null pointer dereference bug, reported by syzbot
* Nits in selftests
v3 -> v4
v3: https://lore.kernel.org/all/20250905164508.1489482-1-mykyta.yatsenko5@gmail.com/
* Modify async callback return value processing in verifier, to allow
non-zero return values.
* Change return type of the callback from void to int, as verifier
expects scalar value.
* Switched to void* for bpf_map API kfunc arguments to avoid casts.
* Addressing numerous nits and small improvements.
v2 -> v3
v2: https://lore.kernel.org/all/20250815192156.272445-1-mykyta.yatsenko5@gmail.com/
* Introduce ref counting
* Add patches with minor verifier and btf.c refactorings to avoid code
duplication
* Rework initiation of the task work scheduling to handle race with map
usercnt dropping to zero
====================
Add stress tests for BPF task-work scheduling kfuncs. The tests spawn
multiple threads that concurrently schedule task_work callbacks against
the same and different map values to exercise the kfuncs under high
contention.
Verify callbacks are reliably enqueued and executed with no drops.
Introducing selftests that check BPF task work scheduling mechanism.
Validate that verifier does not accepts incorrect calls to
bpf_task_work_schedule kfunc.
Implementation of the new bpf_task_work_schedule kfuncs, that let a BPF
program schedule task_work callbacks for a target task:
* bpf_task_work_schedule_signal() - schedules with TWA_SIGNAL
* bpf_task_work_schedule_resume() - schedules with TWA_RESUME
Each map value should embed a struct bpf_task_work, which the kernel
side pairs with struct bpf_task_work_kern, containing a pointer to
struct bpf_task_work_ctx, that maintains metadata relevant for the
concrete callback scheduling.
A small state machine and refcounting scheme ensures safe reuse and
teardown. State transitions:
_______________________________
| |
v |
[standby] ---> [pending] --> [scheduling] --> [scheduled]
^ |________________|_________
| |
| v
| [running]
|_______________________________________________________|
All states may transition into FREED state:
[pending] [scheduling] [scheduled] [running] [standby] -> [freed]
A FREED terminal state coordinates with map-value
deletion (bpf_task_work_cancel_and_free()).
Scheduling itself is deferred via irq_work to keep the kfunc callable
from NMI context.
Lifetime is guarded with refcount_t + RCU Tasks Trace.
Main components:
* struct bpf_task_work_context – Metadata and state management per task
work.
* enum bpf_task_work_state – A state machine to serialize work
scheduling and execution.
* bpf_task_work_schedule() – The central helper that initiates
scheduling.
* bpf_task_work_acquire_ctx() - Attempts to take ownership of the context,
pointed by passed struct bpf_task_work, allocates new context if none
exists yet.
* bpf_task_work_callback() – Invoked when the actual task_work runs.
* bpf_task_work_irq() – An intermediate step (runs in softirq context)
to enqueue task work.
* bpf_task_work_cancel_and_free() – Cleanup for deleted BPF map entries.
Flow of successful task work scheduling
1) bpf_task_work_schedule_* is called from BPF code.
2) Transition state from STANDBY to PENDING, mark context as owned by
this task work scheduler
3) irq_work_queue() schedules bpf_task_work_irq().
4) Transition state from PENDING to SCHEDULING (noop if transition
successful)
5) bpf_task_work_irq() attempts task_work_add(). If successful, state
transitions to SCHEDULED.
6) Task work calls bpf_task_work_callback(), which transition state to
RUNNING.
7) BPF callback is executed
8) Context is cleaned up, refcounts released, context state set back to
STANDBY.
Calculation of the BPF map key, given the pointer to a value is
duplicated in a couple of places in helpers already, in the next patch
another use case is introduced as well.
This patch extracts that functionality into a separate function.
This patch adds necessary plumbing in verifier, syscall and maps to
support handling new kfunc bpf_task_work_schedule and kernel structure
bpf_task_work. The idea is similar to how we already handle bpf_wq and
bpf_timer.
verifier changes validate calls to bpf_task_work_schedule to make sure
it is safe and expected invariants hold.
btf part is required to detect bpf_task_work structure inside map value
and store its offset, which will be used in the next patch to calculate
key and value addresses.
arraymap and hashtab changes are needed to handle freeing of the
bpf_task_work: run code needed to deinitialize it, for example cancel
task_work callback if possible.
The use of bpf_task_work and proper implementation for kfuncs are
introduced in the next patch.
bpf: verifier: permit non-zero returns from async callbacks
The verifier currently enforces a zero return value for all async
callbacks—a constraint originally introduced for bpf_timer. That
restriction is too narrow for other async use cases.
Relax the rule by allowing non-zero return codes from async callbacks in
general, while preserving the zero-return requirement for bpf_timer to
maintain its existing semantics.
bpf: htab: extract helper for freeing special structs
Extract the cleanup of known embedded structs into the dedicated helper.
Remove duplication and introduce a single source of truth for freeing
special embedded structs in hashtab.
bpf: extract generic helper from process_timer_func()
Refactor the verifier by pulling the common logic from
process_timer_func() into a dedicated helper. This allows reusing
process_async_func() helper for verifying bpf_task_work struct in the
next patch.
Reduce code duplication in detection of the known special field types in
map values. This refactoring helps to avoid copying a chunk of code in
the next patch of the series.
BPF Signing has gone over multiple discussions in various conferences with the
kernel and BPF community and the following patch series is a culmination
of the current of discussion and signed BPF programs. Once signing is
implemented, the next focus would be to implement the right security policies
for all BPF use-cases (dynamically generated bpf programs, simple non CO-RE
programs).
Signing also paves the way for allowing unrivileged users to
load vetted BPF programs and helps in adhering to the principle of least
privlege by avoiding unnecessary elevation of privileges to CAP_BPF and
CAP_SYS_ADMIN (ofcourse, with the appropriate security policy active).
A early version of this design was proposed in [1]:
The key idea of the design is to use a signing algorithm that allows
us to integrity-protect a number of future payloads, including their
order, by creating a chain of trust.
Consider that Alice needs to send messages M_1, M_2, ..., M_n to Bob.
We define blocks of data such that:
B_n = M_n || H(termination_marker)
(Each block contains its corresponding message and the hash of the
*next* block in the chain.)
Alice does the following (e.g., on a build system where all payloads
are available):
* Assembles the blocks B_1, B_2, ..., B_n.
* Calculates H(B_1) and signs it, yielding Sig(H(B_1)).
Alice sends the following to Bob:
M_1, H(B_2), Sig(H(B_1))
Bob receives this payload and does the following:
* Reconstructs B_1 as B_1' using the received M_1 and H(B_2)
(i.e., B_1' = M_1 || H(B_2)).
* Recomputes H(B_1') and verifies the signature against the
received Sig(H(B_1)).
* If the signature verifies, it establishes the integrity of M_1
and H(B_2) (and transitively, the integrity of the entire chain). Bob
now stores the verified H(B_2) until it receives the next message.
* When Bob receives M_2 (and H(B_3) if n > 2), it reconstructs
B_2' (e.g., B_2' = M_2 || H(B_3), or if n=2, B_2' = M_2 ||
H(termination_marker)). Bob then computes H(B_2') and compares it
against the stored H(B_2) that was verified in the previous step.
This process continues until the last block is received and verified.
Now, applying this to the BPF signing use-case, we simplify to two messages:
M_1 = I_loader (the instructions of the loader program)
M_2 = M_metadata (the metadata for the loader program, passed in a
map, which includes the programs to be loaded and other context)
For this specific BPF case, we will directly sign a composite of the
first message and the hash of the second. Let H_meta = H(M_metadata).
The block to be signed is effectively:
B_signed = I_loader || H_meta
The signature generated is Sig(B_signed).
The process then follows a similar pattern to the Alice and Bob model,
where the kernel (Bob) verifies I_loader and H_meta using the
signature. Then, the trusted I_loader is responsible for verifying
M_metadata against the trusted H_meta.
From an implementation standpoint:
bpftool (or some other tool in a trusted build environment) knows
about the metadata (M_metadata) and the loader program (I_loader). It
first calculates H_meta = H(M_metadata). Then it constructs the object
to be signed and computes the signature:
Sig(I_loader || H_meta)
The loader program and the metadata are a hermetic representation of the source
of the eBPF program, its maps and context. The loader program is generated by
libbpf as a part of a standard API i.e. bpf_object__gen_loader.
While users can use light skeletons as a convenient method to use signing
support, they can directly use the loader program generation using libbpf
(bpf_object__gen_loader) into their own trusted toolchains.
libbpf, which has access to the program's instruction buffer is a key part of
the TCB of the build environment
An advanced threat model that does not intend to depend on libbpf (or any provenant
userspace BPF libraries) due to supply chain risks despite it being developed
in the kernel source and by the kernel community will require reimplmenting a
lot of the core BPF userspace support (like instruction relocation, map handling).
Such an advanced user would also need to integrate the generation of the loader
into their toolchain.
Given that many use-cases (e.g. Cilium) generate trusted BPF programs,
trusted loaders are an inevitability and a requirement for signing support, a
entrusting loader programs will be a fundamental requirement for an security
policy.
The initial instructions of the loader program verify the SHA256 hash
of the metadata (M_metadata) that will be passed in a map. These instructions
effectively embed the precomputed H_meta as immediate values.
The reasons for a simpler UAPI is that it's more future proof (e.g.) with more
stable instruction buffers, loader programs being directly into the compilers.
A simple API also allows simple programs e.g. for networking that don't need
loader programs to directly use signing.
OBJ_GET_INFO_BY_FD is used to get information about BPF objects (maps, programs, links) and
returning the hash of the map is a natural extension of the UAPI as it can be
helpful for debugging, fingerprinting etc.
Currently, it's only implemented for BPF_MAP_TYPE_ARRAY. It can be trivially
extended for BPF programs to return the complete SHA256 along with the tag.
The SHA is stored in struct bpf_map for exclusive and frozen maps
An exclusive is created by passing an excl_prog_hash
(and excl_prog_hash_size) in the BPF_MAP_CREATE command.
When a BPF program is subsequently loaded and it attempts to use this map,
the kernel will compare the program's own SHA256 hash against the one
registered with the map, if matching, it will be added to prog->used_maps[].
The program load will fail if the hashes do not match or if the map is
already in use by another (non-matching) exclusive program.
Exclusive maps ensure that no other BPF programs and compromise the intergity of
the map post the signature verification.
NOTE: Exclusive maps cannot be added as inner maps.
* Compute the hash of the provided I_loader bytecode.
* Verify the signature against this computed hash.
* Check if the metadata map (now exclusive) is intended for this
program's hash.
The signature check happens in BPF_PROG_LOAD before the security_bpf_prog
LSM hook.
This ensures that the loaded loader program (I_loader), including the
embedded expected hash of the metadata (H_meta), is trusted.
Since the loader program is now trusted, it can be entrusted to verify
the actual metadata (M_metadata) read from the (now exclusive and
frozen) map against the embedded (and trusted) H_meta. There is no
Time-of-Check-Time-of-Use (TOCTOU) vulnerability here because:
* The signature covers the I_loader and its embedded H_meta.
* The metadata map M_metadata is frozen before the loader program is loaded
and associated with it.
* The map is made exclusive to the specific (signed and verified)
loader program.
selftests/bpf: Enable signature verification for some lskel tests
The test harness uses the verify_sig_setup.sh to generate the required
key material for program signing.
Generate key material for signing LSKEL some lskel programs and use
xxd to convert the verification certificate into a C header file.
Finally, update the main test runner to load this
certificate into the session keyring via the add_key() syscall before
executing any tests. Use the session keyring in the tests with signed
programs.
* For gen skeleton, embed a pre-generated signature into the C skeleton
file. This supports the use of signed programs in compiled applications.
bpftool gen skeleton -S -k <private_key> -i <identity_cert> fentry_test.bpf.o
Generation of the loader program and its metadata map is implemented in
libbpf (bpf_obj__gen_loader). bpftool generates a skeleton that loads
the program and automates the required steps: freezing the map, creating
an exclusive map, loading, and running. Users can use standard libbpf
APIs directly or integrate loader program generation into their own
toolchains.
libbpf: Embed and verify the metadata hash in the loader
To fulfill the BPF signing contract, represented as Sig(I_loader ||
H_meta), the generated trusted loader program must verify the integrity
of the metadata. This signature cryptographically binds the loader's
instructions (I_loader) to a hash of the metadata (H_meta).
The verification process is embedded directly into the loader program.
Upon execution, the loader loads the runtime hash from struct bpf_map
i.e. BPF_PSEUDO_MAP_IDX and compares this runtime hash against an
expected hash value that has been hardcoded directly by
bpf_obj__gen_loader.
The load from bpf_map can be improved by calling
BPF_OBJ_GET_INFO_BY_FD from the kernel context after BPF_OBJ_GET_INFO_BY_FD
has been updated for being called from the kernel context.
The following instructions are generated:
ld_imm64 r1, const_ptr_to_map // insn[0].src_reg == BPF_PSEUDO_MAP_IDX
r2 = *(u64 *)(r1 + 0);
ld_imm64 r3, sha256_of_map_part1 // constant precomputed by
bpftool (part of H_meta)
if r2 != r3 goto out;
r2 = *(u64 *)(r1 + 8);
ld_imm64 r3, sha256_of_map_part2 // (part of H_meta)
if r2 != r3 goto out;
r2 = *(u64 *)(r1 + 16);
ld_imm64 r3, sha256_of_map_part3 // (part of H_meta)
if r2 != r3 goto out;
r2 = *(u64 *)(r1 + 24);
ld_imm64 r3, sha256_of_map_part4 // (part of H_meta)
if r2 != r3 goto out;
...
* The metadata map is created with as an exclusive map (with an
excl_prog_hash) This restricts map access exclusively to the signed
loader program, preventing tampering by other processes.
* The map is then frozen, making it read-only from userspace.
* BPF_OBJ_GET_INFO_BY_ID instructs the kernel to compute the hash of the
metadata map (H') and store it in bpf_map->sha.
* The loader is then loaded with the signature which is then verified by
the kernel.
loading signed programs prebuilt into the kernel are not currently
supported. These can supported by enabling BPF_OBJ_GET_INFO_BY_ID to be
called from the kernel.
bpf: Implement signature verification for BPF programs
This patch extends the BPF_PROG_LOAD command by adding three new fields
to `union bpf_attr` in the user-space API:
- signature: A pointer to the signature blob.
- signature_size: The size of the signature blob.
- keyring_id: The serial number of a loaded kernel keyring (e.g.,
the user or session keyring) containing the trusted public keys.
When a BPF program is loaded with a signature, the kernel:
1. Retrieves the trusted keyring using the provided `keyring_id`.
2. Verifies the supplied signature against the BPF program's
instruction buffer.
3. If the signature is valid and was generated by a key in the trusted
keyring, the program load proceeds.
4. If no signature is provided, the load proceeds as before, allowing
for backward compatibility. LSMs can chose to restrict unsigned
programs and implement a security policy.
5. If signature verification fails for any reason,
the program is not loaded.
The main reason is that 'r8' in insn '70' is not an arena pointer.
Further debugging at llvm side shows that llvm commit ([1]) caused
the failure. For the original code:
page[i] = NULL;
page[i + 1] = NULL;
the llvm transformed it to something like below at source level:
__builtin_memset(&page[i], 0, 16)
Such transformation prevents llvm BPFCheckAndAdjustIR pass from
generating proper addr_space_cast insns ([2]).
Adding support in llvm BPFCheckAndAdjustIR pass should work, but
not sure that such a pattern exists or not in real applications.
At the same time, simply adding a memory barrier between two 'page'
assignment can fix the issue.
Tom Stellard [Wed, 17 Sep 2025 18:38:47 +0000 (11:38 -0700)]
bpftool: Fix -Wuninitialized-const-pointer warnings with clang >= 21
This fixes the build with -Werror -Wall.
btf_dumper.c:71:31: error: variable 'finfo' is uninitialized when passed as a const pointer argument here [-Werror,-Wuninitialized-const-pointer]
71 | info.func_info = ptr_to_u64(&finfo);
| ^~~~~
prog.c:2294:31: error: variable 'func_info' is uninitialized when passed as a const pointer argument here [-Werror,-Wuninitialized-const-pointer]
2294 | info.func_info = ptr_to_u64(&func_info);
|
Tao Chen [Fri, 19 Sep 2025 03:48:16 +0000 (11:48 +0800)]
bpftool: Fix UAF in get_delegate_value
The return value ret pointer is pointing opts_copy, but opts_copy
gets freed in get_delegate_value before return, fix this by free
the mntent->mnt_opts strdup memory after show delegate value.
The verifier processes this program by exploring two paths:
- 1 -> 2 -> 3 -> 4
- 1 -> 2 -> 3 -> 5 -> 6
When instruction (5) is processed, the current liveness tracking
mechanism moves up the register parent links and records a "read" mark
for stack slot -8 at checkpoint #1, stopping because of the "write"
mark recorded at instruction (2).
This patch set replaces the existing liveness tracking mechanism with
a path-insensitive data flow analysis. The program above is processed
as follows:
- a data structure representing live stack slots for
instructions 1-6 in frame #0 is allocated;
- when instruction (2) is processed, record that slot -8 is written at
instruction (2) in frame #0;
- when instruction (5) is processed, record that slot -8 is read at
instruction (5) in frame #0;
- when instruction (6) is processed, propagate read mark for slot -8
up the control flow graph to instructions 3 and 2.
The key difference is that the new mechanism operates on a control
flow graph and associates read and write marks with pairs of (call
chain, instruction index). In contrast, the old mechanism operates on
verifier states and register parent links, associating read and write
marks with verifier states.
Motivation
==========
As it stands, this patch set makes liveness tracking slightly less
precise, as it no longer distinguishes individual program paths taken
by the verifier during symbolic execution.
See the "Impact on verification performance" section for details.
However, this change is intended as a stepping stone toward the
following goals:
- Short term, integrate precision tracking into liveness analysis and
remove the following code:
- verifier backedge states accumulation in is_state_visited();
- most of the logic for precision tracking;
- jump history tracking.
- Long term, help with more efficient loop verification handling.
In a sense, precision tracking is very similar to liveness tracking.
The data flow equations for liveness tracking look as follows:
live_after =
U [state[s].live_before for s in insn_successors(i)]
state[i].live_before =
(live_after / state[i].must_write) U state[i].may_read
While data flow equations for precision tracking look as follows:
precise_after =
U [state[s].precise_before for s in insn_successors(i)]
// if some of the instruction outputs are precise,
// assume its inputs to be precise
induced_precise =
⎧ state[i].may_read if (state[i].may_write ∩ precise_after) ≠ ∅
⎨
⎩ ∅ otherwise
Where:
- `may_read` set represents a union of all possibly read slots
(any slot in `may_read` set might be by the instruction);
- `must_write` set represents an intersection of all possibly written slots
(any slot in `must_write` set is guaranteed to be written by the instruction).
- `may_write` set represents a union of all possibly written slots
(any slot in `may_write` set might be written by the instruction).
This means that precision tracking can be implemented as a logical
extension of liveness tracking:
- track registers as well as stack slots;
- add bit masks to represent `precise_before` and `may_write`;
- add above equations for `precise_before` computation;
- (linked registers require some additional consideration).
Such extension would allow removal of:
- precision propagation logic in verifier.c:
- backtrack_insn()
- mark_chain_precision()
- propagate_{precision,backedges}()
- push_jmp_history() and related data structures, which are only used
by precision tracking;
- add_scc_backedge() and related backedge state accumulation in
is_state_visited(), superseded by per-callchain function state
accumulated by liveness analysis.
The hope here is that unifying liveness and precision tracking will
reduce overall amount of code and make it easier to reason about.
How this helps with loops?
--------------------------
As it stands, this patch set shares the same deficiency as the current
liveness tracking mechanism. Liveness marks on stack slots cannot be
used to prune states when processing iterator-based loops:
- such states still have branches to be explored;
- meaning that not all stack slot reads have been discovered.
For any checkpoint state created at instruction (1), it is only
possible to rely on read marks for slots fp[-8] and fp[-16] once all
child states of (1) have been explored. Thus, when the verifier
transitions from (7) to (1), it cannot rely on read marks.
However, sacrificing path-sensitivity makes it possible to run
analysis defined in this patch set before main verification pass,
if estimates for value ranges are available.
E.g. for the following program:
If an estimate for `r2` range is available before the main
verification pass, it can be used to populate read marks at
instruction (4) and run the liveness analysis. Thus making
conservative liveness information available during loops verification.
Such estimates can be provided by some form of value range analysis.
Value range analysis is also necessary to address loop verification
from another angle: computing boundaries for loop induction variables
and iteration counts.
The hope here is that the new liveness tracking mechanism will support
the broader goal of making loop verification more efficient.
Validation
==========
The change was tested on three program sets:
- bpf selftests
- sched_ext
- Meta's internal set of programs
Commit [#8] enables a special mode where both the current and new
liveness analyses are enabled simultaneously. This mode signals an
error if the new algorithm considers a stack slot dead while the
current algorithm assumes it is alive. This mode was very useful for
debugging. At the time of posting, no such errors have been reported
for the above program sets.
[#8] "bpf: signal error if old liveness is more conservative than new"
Impact on memory consumption
============================
Debug patch [1] extends the kernel and veristat to count the amount of
memory allocated for storing analysis data. This patch is not included
in the submission. The maximal observed impact for the above program
sets is 2.6Mb.
Data below is shown in bytes.
For bpf selftests top 5 consumers look as follows:
v1: https://lore.kernel.org/bpf/20250911010437.2779173-1-eddyz87@gmail.com/T/
v1 -> v2:
- compute_postorder() fixed to handle jumps with offset -1 (syzbot).
- is_state_visited() in patch #9 fixed access to uninitialized `err`
(kernel test robot, Dan Carpenter).
- Selftests added.
- Fixed bug with write marks propagation from callee to caller,
see verifier_live_stack.c:caller_stack_write() test case.
- Added a patch for __not_msg() annotation for test_loader based
tests.
v2: https://lore.kernel.org/bpf/20250918-callchain-sensitive-liveness-v2-0-214ed2653eee@gmail.com/
v2 -> v3:
- Added __diag_ignore_all("-Woverride-init", ...) in liveness.c for
bpf_insn_successors() (suggested by Alexei).
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
====================
Eduard Zingerman [Fri, 19 Sep 2025 02:18:45 +0000 (19:18 -0700)]
selftests/bpf: test cases for callchain sensitive live stack tracking
- simple propagation of read/write marks;
- joining read/write marks from conditional branches;
- avoid must_write marks in when same instruction accesses different
stack offsets on different execution paths;
- avoid must_write marks in case same instruction accesses stack
and non-stack pointers on different execution paths;
- read/write marks propagation to outer stack frame;
- independent read marks for different callchains ending with the same
function;
- bpf_calls_callback() dependent logic in
liveness.c:bpf_stack_slot_alive().
Eduard Zingerman [Fri, 19 Sep 2025 02:18:44 +0000 (19:18 -0700)]
selftests/bpf: __not_msg() tag for test_loader framework
This patch adds tags __not_msg(<msg>) and __not_msg_unpriv(<msg>).
Test fails if <msg> is found in verifier log.
If __msg_not() is situated between __msg() tags framework matches
__msg() tags first, and then checks that <msg> is not present in a
portion of a log between bracketing __msg() tags.
__msg_not() tags bracketed by a same __msg() group are effectively
unordered.
The idea is borrowed from LLVM's CheckFile with its CHECK-NOT syntax.
Eduard Zingerman [Fri, 19 Sep 2025 02:18:43 +0000 (19:18 -0700)]
bpf: table based bpf_insn_successors()
Converting bpf_insn_successors() to use lookup table makes it ~1.5
times faster.
Also remove unnecessary conditionals:
- `idx + 1 < prog->len` is unnecessary because after check_cfg() all
jump targets are guaranteed to be within a program;
- `i == 0 || succ[0] != dst` is unnecessary because any client of
bpf_insn_successors() can handle duplicate edges:
- compute_live_registers()
- compute_scc()
Moving bpf_insn_successors() to liveness.c allows its inlining in
liveness.c:__update_stack_liveness().
Such inlining speeds up __update_stack_liveness() by ~40%.
bpf_insn_successors() is used in both verifier.c and liveness.c.
perf shows such move does not negatively impact users in verifier.c,
as these are executed only once before main varification pass.
Unlike __update_stack_liveness() which can be triggered multiple
times.
Eduard Zingerman [Fri, 19 Sep 2025 02:18:42 +0000 (19:18 -0700)]
bpf: disable and remove registers chain based liveness
Remove register chain based liveness tracking:
- struct bpf_reg_state->{parent,live} fields are no longer needed;
- REG_LIVE_WRITTEN marks are superseded by bpf_mark_stack_write()
calls;
- mark_reg_read() calls are superseded by bpf_mark_stack_read();
- log.c:print_liveness() is superseded by logging in liveness.c;
- propagate_liveness() is superseded by bpf_update_live_stack();
- no need to establish register chains in is_state_visited() anymore;
- fix a bunch of tests expecting "_w" suffixes in verifier log
messages.
Eduard Zingerman [Fri, 19 Sep 2025 02:18:41 +0000 (19:18 -0700)]
bpf: signal error if old liveness is more conservative than new
Unlike the new algorithm, register chain based liveness tracking is
fully path sensitive, and thus should be strictly more accurate.
Validate the new algorithm by signaling an error whenever it considers
a stack slot dead while the old algorithm considers it alive.
Allocate analysis instance:
- Add bpf_stack_liveness_{init,free}() calls to bpf_check().
Notify the instance about any stack reads and writes:
- Add bpf_mark_stack_write() call at every location where
REG_LIVE_WRITTEN is recorded for a stack slot.
- Add bpf_mark_stack_read() call at every location mark_reg_read() is
called.
- Both bpf_mark_stack_{read,write}() rely on
env->liveness->cur_instance callchain being in sync with
env->cur_state. It is possible to update env->liveness->cur_instance
every time a mark read/write is called, but that costs a hash table
lookup and is noticeable in the performance profile. Hence, manually
reset env->liveness->cur_instance whenever the verifier changes
env->cur_state call stack:
- call bpf_reset_live_stack_callchain() when the verifier enters a
subprogram;
- call bpf_update_live_stack() when the verifier exits a subprogram
(it implies the reset).
Make sure bpf_update_live_stack() is called for a callchain before
issuing liveness queries. And make sure that bpf_update_live_stack()
is called for any callee callchain first:
- Add bpf_update_live_stack() call at every location that processes
BPF_EXIT:
- exit from a subprogram;
- before pop_stack() call.
This makes sure that bpf_update_live_stack() is called for callee
callchains before caller callchains.
Make sure must_write marks are set to zero for instructions that
do not always access the stack:
- Wrap do_check_insn() with bpf_reset_stack_write_marks() /
bpf_commit_stack_write_marks() calls.
Any calls to bpf_mark_stack_write() are accumulated between this
pair of calls. If no bpf_mark_stack_write() calls were made
it means that the instruction does not access stack (at-least
on the current verification path) and it is important to record
this fact.
Finally, use bpf_live_stack_query_init() / bpf_stack_slot_alive()
to query stack liveness info.
The manual tracking of the correct order for callee/caller
bpf_update_live_stack() calls is a bit convoluted and may warrant some
automation in future revisions.
Eduard Zingerman [Fri, 19 Sep 2025 02:18:39 +0000 (19:18 -0700)]
bpf: callchain sensitive stack liveness tracking using CFG
This commit adds a flow-sensitive, context-sensitive, path-insensitive
data flow analysis for live stack slots:
- flow-sensitive: uses program control flow graph to compute data flow
values;
- context-sensitive: collects data flow values for each possible call
chain in a program;
- path-insensitive: does not distinguish between separate control flow
graph paths reaching the same instruction.
Compared to the current path-sensitive analysis, this approach trades
some precision for not having to enumerate every path in the program.
This gives a theoretical capability to run the analysis before main
verification pass. See cover letter for motivation.
The basic idea is as follows:
- Data flow values indicate stack slots that might be read and stack
slots that are definitely written.
- Data flow values are collected for each
(call chain, instruction number) combination in the program.
- Within a subprogram, data flow values are propagated using control
flow graph.
- Data flow values are transferred from entry instructions of callee
subprograms to call sites in caller subprograms.
In other words, a tree of all possible call chains is constructed.
Each node of this tree represents a subprogram. Read and write marks
are collected for each instruction of each node. Live stack slots are
first computed for lower level nodes. Then, information about outer
stack slots that might be read or are definitely written by a
subprogram is propagated one level up, to the corresponding call
instructions of the upper nodes. Procedure repeats until root node is
processed.
In the absence of value range analysis, stack read/write marks are
collected during main verification pass, and data flow computation is
triggered each time verifier.c:states_equal() needs to query the
information.
Implementation details are documented in kernel/bpf/liveness.c.
Quantitative data about verification performance changes and memory
consumption is in the cover letter.
Eduard Zingerman [Fri, 19 Sep 2025 02:18:38 +0000 (19:18 -0700)]
bpf: compute instructions postorder per subprogram
The next patch would require doing postorder traversal of individual
subprograms. Facilitate this by moving env->cfg.insn_postorder
computation from check_cfg() to a separate pass, as check_cfg()
descends into called subprograms (and it needs to, because of
merge_callee_effects() logic).
env->cfg.insn_postorder is used only by compute_live_registers(),
this function does not track cross subprogram dependencies,
thus the change does not affect it's operation.
Eduard Zingerman [Fri, 19 Sep 2025 02:18:36 +0000 (19:18 -0700)]
bpf: remove redundant REG_LIVE_READ check in stacksafe()
stacksafe() is called in exact == NOT_EXACT mode only for states that
had been porcessed by clean_verifier_states(). The latter replaces
dead stack spills with a series of STACK_INVALID masks. Such masks are
already handled by stacksafe().
Eduard Zingerman [Fri, 19 Sep 2025 02:18:35 +0000 (19:18 -0700)]
bpf: use compute_live_registers() info in clean_func_state
Prepare for bpf_reg_state->live field removal by leveraging
insn_aux_data->live_regs_before instead of bpf_reg_state->live in
compute_live_registers(). This is similar to logic in
func_states_equal(). No changes in verification performance for
selftests or sched_ext.
Eduard Zingerman [Fri, 19 Sep 2025 02:18:34 +0000 (19:18 -0700)]
bpf: bpf_verifier_state->cleaned flag instead of REG_LIVE_DONE
Prepare for bpf_reg_state->live field removal by introducing a
separate flag to track if clean_verifier_state() had been applied to
the state. No functional changes.
bpf: Return hashes of maps in BPF_OBJ_GET_INFO_BY_FD
Currently only array maps are supported, but the implementation can be
extended for other maps and objects. The hash is memoized only for
exclusive and frozen maps as their content is stable until the exclusive
program modifies the map.
This is required for BPF signing, enabling a trusted loader program to
verify a map's integrity. The loader retrieves
the map's runtime hash from the kernel and compares it against an
expected hash computed at build time.
Implement setters and getters that allow map to be registered as
exclusive to the specified program. The registration should be done
before the exclusive program is loaded.
Use AF_ALG sockets to not have libbpf depend on OpenSSL. The helper is
used for the loader generation code to embed the metadata hash in the
loader program and also by the bpf_map__make_exclusive API to calculate
the hash of the program the map is exclusive to.
Exclusive maps restrict map access to specific programs using a hash.
The current hash used for this is SHA1, which is prone to collisions.
This patch uses SHA256, which is more resilient against
collisions. This new hash is stored in bpf_prog and used by the verifier
to determine if a program can access a given exclusive map.
The original 64-bit tags are kept, as they are used by users as a short,
possibly colliding program identifier for non-security purposes.
Currently, KF_RCU_PROTECTED only applies to iterator APIs and that too
in a convoluted fashion: the presence of this flag on the kfunc is used
to set MEM_RCU in iterator type, and the lack of RCU protection results
in an error only later, once next() or destroy() methods are invoked on
the iterator. While there is no bug, this is certainly a bit unintuitive,
and makes the enforcement of the flag iterator specific.
In the interest of making this flag useful for other upcoming kfuncs,
e.g. scx_bpf_cpu_curr() [0][1], add enforcement for invoking the kfunc
in an RCU critical section in general.
In addition to this, the aforementioned kfunc also needs to return an
RCU protected pointer, which currently has no generic kfunc flag or
annotation. Add such a flag as well while we are at it.
* Drop KF_RET_RCU and fold change into KF_RCU_PROTECTED. (Andrea, Alexei)
* Update tests for non-struct pointer return values with KF_RCU_PROTECTED.
====================
Currently, KF_RCU_PROTECTED only applies to iterator APIs and that too
in a convoluted fashion: the presence of this flag on the kfunc is used
to set MEM_RCU in iterator type, and the lack of RCU protection results
in an error only later, once next() or destroy() methods are invoked on
the iterator. While there is no bug, this is certainly a bit
unintuitive, and makes the enforcement of the flag iterator specific.
In the interest of making this flag useful for other upcoming kfuncs,
e.g. scx_bpf_cpu_curr() [0][1], add enforcement for invoking the kfunc
in an RCU critical section in general.
This would also mean that iterator APIs using KF_RCU_PROTECTED will
error out earlier, instead of throwing an error for lack of RCU CS
protection when next() or destroy() methods are invoked.
In addition to this, if the kfuncs tagged KF_RCU_PROTECTED return a
pointer value, ensure that this pointer value is only usable in an RCU
critical section. There might be edge cases where the return value is
special and doesn't need to imply MEM_RCU semantics, but in general, the
assumption should hold for the majority of kfuncs, and we can revisit
things if necessary later.
Eduard Zingerman [Tue, 16 Sep 2025 21:22:51 +0000 (14:22 -0700)]
selftests/bpf: trigger verifier.c:maybe_exit_scc() for a speculative state
This is a test case minimized from a syzbot reproducer from [1].
The test case triggers verifier.c:maybe_exit_scc() w/o
preceding call to verifier.c:maybe_enter_scc() on a speculative
symbolic execution path.
- Non-speculative execution path 0-3 does not allocate any checkpoints
(and hence does not call maybe_enter_scc()), and schedules a
speculative jump from 2 to 1.
- Speculative execution path stops immediately because of an infinite
loop detection and triggers verifier.c:update_branch_counts() ->
maybe_exit_scc() calls.
Eduard Zingerman [Tue, 16 Sep 2025 21:22:50 +0000 (14:22 -0700)]
bpf: dont report verifier bug for missing bpf_scc_visit on speculative path
Syzbot generated a program that triggers a verifier_bug() call in
maybe_exit_scc(). maybe_exit_scc() assumes that, when called for a
state with insn_idx in some SCC, there should be an instance of struct
bpf_scc_visit allocated for that SCC. Turns out the assumption does
not hold for speculative execution paths. See example in the next
patch.
maybe_scc_exit() is called from update_branch_counts() for states that
reach branch count of zero, meaning that path exploration for a
particular path is finished. Path exploration can finish in one of
three ways:
a. Verification error is found. In this case, update_branch_counts()
is called only for non-speculative paths.
b. Top level BPF_EXIT is reached. Such instructions are never a part of
an SCC, so compute_scc_callchain() in maybe_scc_exit() will return
false, and maybe_scc_exit() will return early.
c. A checkpoint is reached and matched. Checkpoints are created by
is_state_visited(), which calls maybe_enter_scc(), which allocates
bpf_scc_visit instances for checkpoints within SCCs.
Hence, for non-speculative symbolic execution paths, the assumption
still holds: if maybe_scc_exit() is called for a state within an SCC,
bpf_scc_visit instance must exist.
This patch removes the verifier_bug() call for speculative paths.
Fixes: c9e31900b54c ("bpf: propagate read/precision marks over state graph backedges") Reported-by: syzbot+3afc814e8df1af64b653@syzkaller.appspotmail.com Closes: https://lore.kernel.org/bpf/68c85acd.050a0220.2ff435.03a4.GAE@google.com/ Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250916212251.3490455-1-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Paul Chaignon [Wed, 17 Sep 2025 08:10:53 +0000 (10:10 +0200)]
selftests/bpf: Test accesses to ctx padding
This patch adds tests covering the various paddings in ctx structures.
In case of sk_lookup BPF programs, the behavior is a bit different
because accesses to the padding are explicitly allowed. Other cases
result in a clear reject from the verifier.
Paul Chaignon [Wed, 17 Sep 2025 08:08:00 +0000 (10:08 +0200)]
bpf: Explicitly check accesses to bpf_sock_addr
Syzkaller found a kernel warning on the following sock_addr program:
0: r0 = 0
1: r2 = *(u32 *)(r1 +60)
2: exit
which triggers:
verifier bug: error during ctx access conversion (0)
This is happening because offset 60 in bpf_sock_addr corresponds to an
implicit padding of 4 bytes, right after msg_src_ip4. Access to this
padding isn't rejected in sock_addr_is_valid_access and it thus later
fails to convert the access.
This patch fixes it by explicitly checking the various fields of
bpf_sock_addr in sock_addr_is_valid_access.
I checked the other ctx structures and is_valid_access functions and
didn't find any other similar cases. Other cases of (properly handled)
padding are covered in new tests in a subsequent patch.
Fixes: 1cedee13d25a ("bpf: Hooks for sys_sendmsg") Reported-by: syzbot+136ca59d411f92e821b7@syzkaller.appspotmail.com Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Closes: https://syzkaller.appspot.com/bug?extid=136ca59d411f92e821b7 Link: https://lore.kernel.org/bpf/b58609d9490649e76e584b0361da0abd3c2c1779.1758094761.git.paul.chaignon@gmail.com
In case if bpf_patch_insn_single() returns an error the `new_data`
allocated at (1) will be freed at (3). However, at (2) this pointer
is stored in `env->insn_aux_data`. Which is freed unconditionally
by verifier.c:bpf_check() on both happy and error paths.
Thus, leading to double-free.
Fix this by removing vfree() call at (3), ownership over `new_data` is
already passed to `env->insn_aux_data` at this point.
Alan Maguire [Thu, 11 Sep 2025 16:30:56 +0000 (17:30 +0100)]
selftests/bpf: More open-coded gettid syscall cleanup
Commit 0e2fb011a0ba ("selftests/bpf: Clean up open-coded gettid syscall
invocations") addressed the issue that older libc may not have a gettid()
function call wrapper for the associated syscall.
A few more instances have crept into tests, use sys_gettid() instead, and
poison raw gettid() usage to avoid future issues.
====================
Remove use of current->cgns in bpf_cgroup_from_id
bpf_cgroup_from_id currently ends up doing a check on whether the cgroup
being looked up is a descendant of the root cgroup of the current task's
cgroup namespace. This leads to unreliable results since this kfunc can
be invoked from any arbitrary context, for any arbitrary value of
current. Fix this by removing namespace-awarness in the kfunc, and
include a test that detects such a case and fails without the fix.
* Add Ack from Tejun.
* Fix selftest to perform namespace migration and cgroup setup in a
child process to avoid changing test_progs namespace.
====================
selftests/bpf: Add a test for bpf_cgroup_from_id lookup in non-root cgns
Make sure that we only switch the cgroup namespace and enter a new
cgroup in a child process separate from test_progs, to not mess up the
environment for subsequent tests.
To remove this cgroup, we need to wait for the child to exit, and then
rmdir its cgroup. If the read call fails, or waitpid succeeds, we know
the child exited (read call would fail when the last pipe end is closed,
otherwise waitpid waits until exit(2) is called). We then invoke a newly
introduced remove_cgroup_pid() helper, that identifies cgroup path using
the passed in pid of the now dead child, instead of using the current
process pid (getpid()).
bpf: Do not limit bpf_cgroup_from_id to current's namespace
The bpf_cgroup_from_id kfunc relies on cgroup_get_from_id to obtain the
cgroup corresponding to a given cgroup ID. This helper can be called in
a lot of contexts where the current thread can be random. A recent
example was its use in sched_ext's ops.tick(), to obtain the root cgroup
pointer. Since the current task can be whatever random user space task
preempted by the timer tick, this makes the behavior of the helper
unreliable.
Refactor out __cgroup_get_from_id as the non-namespace aware version of
cgroup_get_from_id, and change bpf_cgroup_from_id to make use of it.
There is no compatibility breakage here, since changing the namespace
against which the lookup is being done to the root cgroup namespace only
permits a wider set of lookups to succeed now. The cgroup IDs across
namespaces are globally unique, and thus don't need to be retranslated.
For systems having CONFIG_NR_CPUS set to > 1024 in kernel config
the selftest fails as arena_spin_lock_irqsave() returns EOPNOTSUPP.
(eg - incase of powerpc default value for CONFIG_NR_CPUS is 8192)
The selftest is skipped incase bpf program returns EOPNOTSUPP,
with a descriptive message logged.
Leon Hwang [Mon, 15 Sep 2025 12:16:57 +0000 (20:16 +0800)]
selftests/bpf: Skip timer_interrupt case when bpf_timer is not supported
Like commit fbdd61c94bcb ("selftests/bpf: Skip timer cases when bpf_timer is not supported"),
'timer_interrupt' test case should be skipped if verifier rejects
bpf_timer with returning -EOPNOTSUPP.
bpftool: Search for tracefs at /sys/kernel/tracing first
With "bpftool prog tracelog", bpftool prints messages from the trace
pipe. To do so, it first needs to find the tracefs mount point to open
the pipe. Bpftool looks at a few "default" locations, including
/sys/kernel/debug/tracing and /sys/kernel/tracing.
Some of these locations, namely /tracing and /trace, are not standard.
They are in the list because some users used to hardcode the tracing
directory to short names; but we have no compelling reason to look at
these locations. If we fail to find the tracefs at the default
locations, we have an additional step to find it by parsing /proc/mounts
anyway, so it's safe to remove these entries from the list of default
locations to check.
Additionally, Alexei reports that looking for the tracefs at
/sys/kernel/debug/tracing may automatically mount the file system under
that location, and generate a kernel log message telling that
auto-mounting there is deprecated. To avoid this message, let's swap the
order for checking the potential mount points: try /sys/kernel/tracing
first, which should be the standard location nowadays. The kernel log
message may still appear if the tracefs is not mounted on
/sys/kernel/tracing when we run bpftool.