Menglong Dong [Sat, 21 Jun 2025 04:55:01 +0000 (12:55 +0800)]
bpf: Make update_prog_stats() always_inline
The function update_prog_stats() will be called in the bpf trampoline.
In most cases, it will be optimized by the compiler by making it inline.
However, we can't rely on the compiler all the time, and just make it
__always_inline to reduce the possible overhead.
Yuan Chen [Fri, 20 Jun 2025 01:21:33 +0000 (09:21 +0800)]
bpftool: Fix memory leak in dump_xx_nlmsg on realloc failure
In function dump_xx_nlmsg(), when realloc() fails to allocate memory,
the original pointer to the buffer is overwritten with NULL. This causes
a memory leak because the previously allocated buffer becomes unreachable
without being freed.
Slava Imameev [Fri, 20 Jun 2025 15:18:12 +0000 (01:18 +1000)]
selftests/bpf: Add test for bpftool access to read-only protected maps
Add selftest cases that validate bpftool's expected behavior when
accessing maps protected from modification via security_bpf_map.
The test includes a BPF program attached to security_bpf_map with two maps:
- A protected map that only allows read-only access
- An unprotected map that allows full access
The test script attaches the BPF program to security_bpf_map and
verifies that for the bpftool map command:
- Read access works on both maps
- Write access fails on the protected map
- Write access succeeds on the unprotected map
- These behaviors remain consistent when the maps are pinned
Slava Imameev [Fri, 20 Jun 2025 15:18:11 +0000 (01:18 +1000)]
bpftool: Use appropriate permissions for map access
Modify several functions in tools/bpf/bpftool/common.c to allow
specification of requested access for file descriptors, such as
read-only access.
Update bpftool to request only read access for maps when write
access is not required. This fixes errors when reading from maps
that are protected from modification via security_bpf_map.
Luis Gerhorst [Thu, 19 Jun 2025 14:26:47 +0000 (16:26 +0200)]
powerpc/bpf: Fix warning for unused ori31_emitted
Without this, the compiler (clang21) might emit a warning under W=1
because the variable ori31_emitted is set but never used if
CONFIG_PPC_BOOK3S_64=n.
Without this patch:
$ make -j $(nproc) W=1 ARCH=powerpc SHELL=/bin/bash arch/powerpc/net
[...]
CC arch/powerpc/net/bpf_jit_comp.o
CC arch/powerpc/net/bpf_jit_comp64.o
../arch/powerpc/net/bpf_jit_comp64.c: In function 'bpf_jit_build_body':
../arch/powerpc/net/bpf_jit_comp64.c:417:28: warning: variable 'ori31_emitted' set but not used [-Wunused-but-set-variable]
417 | bool sync_emitted, ori31_emitted;
| ^~~~~~~~~~~~~
AR arch/powerpc/net/built-in.a
With this patch:
[...]
CC arch/powerpc/net/bpf_jit_comp.o
CC arch/powerpc/net/bpf_jit_comp64.o
AR arch/powerpc/net/built-in.a
Fixes: dff883d9e93a ("bpf, arm64, powerpc: Change nospec to include v1 barrier") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202506180402.uUXwVoSH-lkp@intel.com/ Signed-off-by: Luis Gerhorst <luis.gerhorst@fau.de> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Link: https://lore.kernel.org/r/20250619142647.2157017-1-luis.gerhorst@fau.de Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Eduard Zingerman [Wed, 18 Jun 2025 09:31:34 +0000 (02:31 -0700)]
selftests/bpf: include limits.h needed for PATH_MAX directly
Constant PATH_MAX is used in function unpriv_helpers.c:open_config().
This constant is provided by include file <limits.h>.
The dependency was added by commit [1], which does not include
<limits.h> directly, relying instead on <limits.h> being included from
zlib.h -> zconf.h.
As it turns out, this is not the case for all systems, e.g. on
Fedora 41 zlib 1.3.1 is used, and there <limits.h> is not included
from zconf.h. Hence, there is a compilation error on Fedora 41.
[1] commit fc2915bb8bfc ("selftests/bpf: More precise cpu_mitigations state detection")
Fixes: fc2915bb8bfc ("selftests/bpf: More precise cpu_mitigations state detection") Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Viktor Malik <vmalik@redhat.com> Link: https://lore.kernel.org/r/20250618093134.3078870-1-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
James Bottomley [Tue, 17 Jun 2025 14:57:36 +0000 (10:57 -0400)]
bpf: Fix key serial argument of bpf_lookup_user_key()
The underlying lookup_user_key() function uses a signed 32 bit integer
for key serial numbers because legitimate serial numbers are positive
(and > 3) and keyrings are negative. Using a u32 for the keyring in
the bpf function doesn't currently cause any conversion problems but
will start to trip the signed to unsigned conversion warnings when the
kernel enables them, so convert the argument to signed (and update the
tests accordingly) before it acquires more users.
Yuan Chen [Tue, 17 Jun 2025 13:24:42 +0000 (09:24 -0400)]
bpftool: Fix JSON writer resource leak in version command
When using `bpftool --version -j/-p`, the JSON writer object
created in do_version() was not properly destroyed after use.
This caused a memory leak each time the version command was
executed with JSON output.
Fix: 004b45c0e51a (tools: bpftool: provide JSON output for all possible commands)
Eduard Zingerman [Tue, 17 Jun 2025 00:57:10 +0000 (17:57 -0700)]
selftests/bpf: More precise cpu_mitigations state detection
test_progs and test_verifier binaries execute unpriv tests under the
following conditions:
- unpriv BPF is enabled;
- CPU mitigations are enabled (see [1] for details).
The detection of the "mitigations enabled" state is performed by
unpriv_helpers.c:get_mitigations_off() via inspecting kernel boot
command line, looking for a parameter "mitigations=off".
Such detection scheme won't work for certain configurations,
e.g. when CONFIG_CPU_MITIGATIONS is disabled and boot parameter is
not supplied.
Miss-detection leads to test_progs executing tests meant to be run
only with mitigations enabled, e.g.
verifier_and.c:known_subreg_with_unknown_reg(), and reporting false
failures.
Internally, verifier sets bpf_verifier_env->bypass_spec_{v1,v4}
basing on the value returned by kernel/cpu.c:cpu_mitigations_off().
This function is backed by a variable kernel/cpu.c:cpu_mitigations.
This state is not fully introspect-able via sysfs. The closest proxy
is /sys/devices/system/cpu/vulnerabilities/spectre_v1, but it reports
"vulnerable" state only if mitigations are disabled *and* current cpu
is vulnerable, while verifier does not check cpu state.
There are only two ways the kernel/cpu.c:cpu_mitigations can be set:
- via boot parameter;
- via CONFIG_CPU_MITIGATIONS option.
This commit updates unpriv_helpers.c:get_mitigations_off() to scan
/boot/config-$(uname -r) and /proc/config.gz for
CONFIG_CPU_MITIGATIONS value in addition to boot command line check.
Tested using the following configurations:
- mitigations enabled (unpriv tests are enabled)
- mitigations disabled via boot cmdline (unpriv tests skipped)
- mitigations disabled via CONFIG_CPU_MITIGATIONS
(unpriv tests skipped)
Yonghong Song [Tue, 17 Jun 2025 04:49:56 +0000 (21:49 -0700)]
selftests/bpf: Fix RELEASE build failure with gcc14
With gcc14, when building with RELEASE=1, I hit four below compilation
failure:
Error 1:
In file included from test_loader.c:6:
test_loader.c: In function ‘run_subtest’: test_progs.h:194:17:
error: ‘retval’ may be used uninitialized in this function
[-Werror=maybe-uninitialized]
194 | fprintf(stdout, ##format); \
| ^~~~~~~
test_loader.c:958:13: note: ‘retval’ was declared here
958 | int retval, err, i;
| ^~~~~~
The uninitialized var 'retval' actually could cause incorrect result.
Error 2:
In function ‘test_fd_array_cnt’:
prog_tests/fd_array.c:71:14: error: ‘btf_id’ may be used uninitialized in this
function [-Werror=maybe-uninitialized]
71 | fd = bpf_btf_get_fd_by_id(id);
| ^~~~~~~~~~~~~~~~~~~~~~~~
prog_tests/fd_array.c:302:15: note: ‘btf_id’ was declared here
302 | __u32 btf_id;
| ^~~~~~
Changing ASSERT_GE to ASSERT_EQ can fix the compilation error. Otherwise,
there is no functionality change.
Error 3:
prog_tests/tailcalls.c: In function ‘test_tailcall_hierarchy_count’:
prog_tests/tailcalls.c:1402:23: error: ‘fentry_data_fd’ may be used uninitialized
in this function [-Werror=maybe-uninitialized]
1402 | err = bpf_map_lookup_elem(fentry_data_fd, &i, &val);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The code is correct. The change intends to silence gcc errors.
Error 4: (this error only happens on arm64)
In file included from prog_tests/log_buf.c:4:
prog_tests/log_buf.c: In function ‘bpf_prog_load_log_buf’:
./test_progs.h:390:22: error: ‘log_buf’ may be used uninitialized [-Werror=maybe-uninitialized]
390 | int ___err = libbpf_get_error(___res); \
| ^~~~~~~~~~~~~~~~~~~~~~~~
prog_tests/log_buf.c:158:14: note: in expansion of macro ‘ASSERT_OK_PTR’
158 | if (!ASSERT_OK_PTR(log_buf, "log_buf_alloc"))
| ^~~~~~~~~~~~~
In file included from selftests/bpf/tools/include/bpf/bpf.h:32,
from ./test_progs.h:36:
selftests/bpf/tools/include/bpf/libbpf_legacy.h:113:17:
note: by argument 1 of type ‘const void *’ to ‘libbpf_get_error’ declared here
113 | LIBBPF_API long libbpf_get_error(const void *ptr);
| ^~~~~~~~~~~~~~~~
Adding a pragma to disable maybe-uninitialized fixed the issue.
Same cleanup cycles are done in push_stack() and push_async_cb(),
both functions are only reachable from do_check_common() via
do_check() -> do_check_insn().
Hence, I think that cur state should not be freed in push_*()
functions and pop_stack() loop there is not needed.
This would also fix the 'symptom' for [2], but the issue also has a
simpler fix which was sent separately. This fix also makes sure the
push_*() callers always return an error for which
error_recoverable_with_nospec(err) is false. This is required because
otherwise we try to recover and access the stale `state`.
Moving free_verifier_state() and pop_stack(..., pop_log=false) to happen
after the bpf_vlog_reset() call in do_check_common() is fine because the
pop_stack() call that is moved does not call bpf_vlog_reset() with the
pop_log=false parameter.
Eduard Zingerman [Fri, 13 Jun 2025 17:53:31 +0000 (10:53 -0700)]
selftests/bpf: verify jset handling in CFG computation
A test case to check if both branches of jset are explored when
computing program CFG.
At 'if r1 & 0x7 ...':
- register 'r2' is computed alive only if jump branch of jset
instruction is followed;
- register 'r0' is computed alive only if fallthrough branch of jset
instruction is followed.
Eduard Zingerman [Fri, 13 Jun 2025 17:53:30 +0000 (10:53 -0700)]
bpf: handle jset (if a & b ...) as a jump in CFG computation
BPF_JSET is a conditional jump and currently verifier.c:can_jump()
does not know about that. This can lead to incorrect live registers
and SCC computation.
====================
veristat: memory accounting for bpf programs
When working on the verifier, it is sometimes interesting to know how a
particular change affects memory consumption. This patch-set modifies
veristat to provide such information. As a collateral, kernel needs an
update to make allocations reachable from BPF program load accountable
in memcg statistics.
Here is a sample output:
Program Peak states Peak memory (MiB)
--------------- ----------- -----------------
lavd_select_cpu 2153 43
lavd_enqueue 1982 41
lavd_dispatch 3480 28
Technically, this is implemented by creating and entering a new cgroup
at the start of veristat execution. The difference between values from
cgroup "memory.peak" file before and after bpf_object__load() is used
as a metric.
To minimize measurements jitter data is collected in megabytes.
Changelog:
v2: https://lore.kernel.org/bpf/20250612130835.2478649-1-eddyz87@gmail.com/
v2 -> v3:
- bpf_verifier_state->jmp_history and
bpf_verifier_env->explored_states allocations are switched from
GFP_USER to GFP_KERNEL_ACCOUNT (Andrii, Alexei);
- veristat.c:STR macro removed, PATH_MAX-1 == 4095 is hard-coded in
scanf format strings (Andrii);
- env->{orig,stat}_cgroup size changed to PATH_MAX (Andrii);
- snprintf_trunc() is removed, flag -Wno-format-truncation
is added to CFLAGS for veristat.o when compiled with gcc;
v1: https://lore.kernel.org/bpf/20250605230609.1444980-1-eddyz87@gmail.com/
v1 -> v2:
- a single cgroup, created at the beginning of execution, is now used
for measurements (Andrii, Mykyta);
- cgroup namespace is not created, as it turned out to be useless
(Andrii);
- veristat no longer mounts cgroup fs or changes subtree_control,
instead it looks for an existing mount point and reports an error if
memory.peak file can't be opened (Andrii, Alexei);
- if 'mem_peak' statistics is not enabled, veristat skips cgroup
setup;
- code sharing with cgroup_helpers.c was considered but was decided
against to simplify veristat github sync.
====================
Eduard Zingerman [Fri, 13 Jun 2025 07:21:47 +0000 (00:21 -0700)]
veristat: Memory accounting for bpf programs
This commit adds a new field mem_peak / "Peak memory (MiB)" field to a
set of gathered statistics. The field is intended as an estimate for
peak verifier memory consumption for processing of a given program.
Mechanically stat is collected as follows:
- At the beginning of handle_verif_mode() a new cgroup is created
and veristat process is moved into this cgroup.
- At each program load:
- bpf_object__load() is split into bpf_object__prepare() and
bpf_object__load() to avoid accounting for memory allocated for
maps;
- before bpf_object__load():
- a write to "memory.peak" file of the new cgroup is used to reset
cgroup statistics;
- updated value is read from "memory.peak" file and stashed;
- after bpf_object__load() "memory.peak" is read again and
difference between new and stashed values is used as a metric.
If any of the above steps fails veristat proceeds w/o collecting
mem_peak information for a program, reporting mem_peak as -1.
While memcg provides data in bytes (converted from pages), veristat
converts it to megabytes to avoid jitter when comparing results of
different executions.
The change has no measurable impact on veristat running time.
A correlation between "Peak states" and "Peak memory" fields provides
a sanity check for gathered statistics, e.g. a sample of data for
sched_ext programs:
Eduard Zingerman [Fri, 13 Jun 2025 07:21:46 +0000 (00:21 -0700)]
bpf: Include verifier memory allocations in memcg statistics
This commit adds __GFP_ACCOUNT flag to verifier induced memory
allocations. The intent is to account for all allocations reachable
from BPF_PROG_LOAD command, which is needed to track verifier memory
consumption in veristat. This includes allocations done in verifier.c,
and some allocations in btf.c, functions in log.c do not allocate.
There is also a utility function bpf_memcg_flags() which selectively
adds GFP_ACCOUNT flag depending on the `cgroup.memory=nobpf` option.
As far as I understand [1], the idea is to remove bpf_prog instances
and maps from memcg accounting as these objects do not strictly belong
to cgroup, hence it should not apply here.
(btf_parse_fields() is reachable from both program load and map
creation, but allocated record is not persistent as is freed as soon
as map_check_btf() exits).
Ruslan Semchenko [Thu, 12 Jun 2025 13:18:16 +0000 (16:18 +0300)]
tools/bpf_jit_disasm: Fix potential negative tpath index in get_exec_path()
If readlink() fails, len will be -1, which can cause negative indexing
and undefined behavior. This patch ensures that len is set to 0 on
readlink failure, preventing such issues.
====================
bpf: Fix a few test failures with 64K page size
Changelog:
v2 -> v3:
- v2: https://lore.kernel.org/bpf/20250611171519.2033193-1-yonghong.song@linux.dev/
- Add additional comments for xdp_adjust_tail test.
- Use actual kernel page size to set new_len for Patch 2 tests.
v1 -> v2:
- v1: https://lore.kernel.org/bpf/20250608165534.1019914-1-yonghong.song@linux.dev/
- For xdp_adjust_tail, let kernel test_run can handle various page sizes for xdp progs.
- For two change_tail tests, make code easier to understand.
- Resolved a new test failure (xdp_do_redirect).
====================
Yonghong Song [Thu, 12 Jun 2025 03:50:37 +0000 (20:50 -0700)]
selftests/bpf: Fix two net related test failures with 64K page size
When running BPF selftests on arm64 with a 64K page size, I encountered
the following two test failures:
sockmap_basic/sockmap skb_verdict change tail:FAIL
tc_change_tail:FAIL
With further debugging, I identified the root cause in the following
kernel code within __bpf_skb_change_tail():
u32 max_len = BPF_SKB_MAX_LEN;
u32 min_len = __bpf_skb_min_len(skb);
int ret;
With a 4K page size, new_len = 65535 and max_len = 16064, the function
returns -EINVAL. However, With a 64K page size, max_len increases to
261824, allowing execution to proceed further in the function. This is
because BPF_SKB_MAX_LEN scales with the page size and larger page sizes
result in higher max_len values.
Updating the new_len parameter in both tests based on actual kernel
page size resolved both failures.
Yonghong Song [Thu, 12 Jun 2025 03:50:32 +0000 (20:50 -0700)]
bpf: Fix an issue in bpf_prog_test_run_xdp when page size greater than 4K
The bpf selftest xdp_adjust_tail/xdp_adjust_frags_tail_grow failed on
arm64 with 64KB page:
xdp_adjust_tail/xdp_adjust_frags_tail_grow:FAIL
In bpf_prog_test_run_xdp(), the xdp->frame_sz is set to 4K, but later on
when constructing frags, with 64K page size, the frag data_len could
be more than 4K. This will cause problems in bpf_xdp_frags_increase_tail().
To fix the failure, the xdp->frame_sz is set to be PAGE_SIZE so kernel
can test different page size properly. With the kernel change, the user
space and bpf prog needs adjustment. Currently, the MAX_SKB_FRAGS default
value is 17, so for 4K page, the maximum packet size will be less than 68K.
To test 64K page, a bigger maximum packet size than 68K is desired. So two
different functions are implemented for subtest xdp_adjust_frags_tail_grow.
Depending on different page size, different data input/output sizes are used
to adapt with different page size.
Song Liu [Thu, 12 Jun 2025 22:11:00 +0000 (15:11 -0700)]
bpf: Initialize used but uninit variable in propagate_liveness()
With input changed == NULL, a local variable is used for "changed".
Initialize tmp properly, so that it can be used in the following:
*changed |= err > 0;
Otherwise, UBSAN will complain:
UBSAN: invalid-load in kernel/bpf/verifier.c:18924:4
load of value <some random value> is not a valid value for type '_Bool'
Fixes: dfb2d4c64b82 ("bpf: set 'changed' status if propagate_liveness() did any updates") Signed-off-by: Song Liu <song@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250612221100.2153401-1-song@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Yonghong Song [Thu, 12 Jun 2025 04:30:49 +0000 (21:30 -0700)]
docs/bpf: Default cpu version changed from v1 to v3 in llvm 20
The default cpu version is changed from v1 to v3 in llvm version 20.
See [1] for more detailed reasoning. Update bpf_devel_QA.rst so
developers can find such information easily.
Fushuai Wang [Thu, 12 Jun 2025 08:42:08 +0000 (16:42 +0800)]
selftests/bpf: fix signedness bug in redir_partial()
When xsend() returns -1 (error), the check 'n < sizeof(buf)' incorrectly
treats it as success due to unsigned promotion. Explicitly check for -1
first.
Luis Gerhorst [Wed, 11 Jun 2025 21:07:28 +0000 (23:07 +0200)]
bpf: Fix state use-after-free on push_stack() err
Without this, `state->speculative` is used after the cleanup cycles in
push_stack() or push_async_cb() freed `env->cur_state` (i.e., `state`).
Avoid this by relying on the short-circuit logic to only access `state`
if the error is recoverable (and make sure it never is after push_*()
failed).
push_*() callers must always return an error for which
error_recoverable_with_nospec(err) is false if push_*() returns NULL,
otherwise we try to recover and access the stale `state`. This is only
violated by sanitize_ptr_alu(), thus also fix this case to return
-ENOMEM.
state->speculative does not make sense if the error path of push_*()
ran. In that case, `state->speculative &&
error_recoverable_with_nospec(err)` as a whole should already never
evaluate to true (because all cases where push_stack() fails must return
-ENOMEM/-EFAULT). As mentioned, this is only violated by the
push_stack() call in sanitize_speculative_path() which returns -EACCES
without [1] (through REASON_STACK in sanitize_err() after
sanitize_ptr_alu()). To fix this, return -ENOMEM for REASON_STACK (which
is also the behavior we will have after [1]).
Checked that it fixes the syzbot reproducer as expected.
====================
bpf: propagate read/precision marks over state graph backedges
Current loop_entry-based states comparison logic does not handle the
following case:
.-> A --. Assume the states are visited in the order A, B, C.
| | | Assume that state B reaches a state equivalent to state A.
| v v At this point, state C is not processed yet, so state A
'-- B C has not received any read or precision marks from C.
As a result, these marks won't be propagated to B.
If B has incomplete marks, it is unsafe to use it in states_equal()
checks. This issue was first reported in [1].
This patch-set
--------------
Here is the gist of the algorithm implemented by this patch-set:
- Compute strongly connected components (SCCs) in the program CFG.
- When a verifier state enters an SCC, that state is recorded as the
SCC's entry point.
- When a verifier state is found to be equivalent to another
(e.g., B to A in the example above), it is recorded as a
states-graph backedge.
- Backedges are accumulated per SCC (*).
- When an SCC entry state reaches `branches == 0`, propagate read and
precision marks through the backedges until a fixed point is reached
(e.g., from A to B, from C to A, and then again from A to B).
(*) This is an oversimplification, see patch #8 for details.
Unfortunately, this means that commit [2] needs to be reverted,
as precision propagation requires access to jump history,
and backedges represent history not belonging to `env->cur_state`.
Details are provided in patch #8; a comment in `is_state_visited()`
explains most of the mechanics.
Patch #2 adds a `compute_scc()` function, which computes SCCs in the
program CFG. This function was tested using property-based testing in
[3], but it is not included in selftests.
Previous attempt
----------------
A previous attempt to fix this is described in [4]:
1. Within the states loop, `states_equal(... RANGE_WITHIN)` ignores
read and precision marks.
2. For states outside the loop, all registers for states within the
loop are marked as read and precise.
This approach led to an 86x regression on the `cond_break1` selftest.
In that test, one loop was followed by another, and a certain variable
was incremented in the second loop. This variable was marked as
precise due to rule (2), which hindered convergence in the first loop.
After some off-list discussion, it was decided that this might be a
typical case and such regressions are undesirable.
This patch-set avoids such eager precision markings.
Alternatives
------------
Another option is to associate a mask of read/written/precise stack
slots with each instruction. This mask can be populated during
verifier states exploration. Upon reaching an `EXIT` instruction or an
equivalent state, the accumulated masks can be used to propagate
read/written/precise bits across the program's control flow graph
using an analysis similar to use-def.
Unfortunately, a naive implementation of this approach [5] results in
a 10x regression in `veristat` for some `sched_ext` programs due to
the inability to express the must-write property. This issue requires
further investigation.
Changes in verification performance
-----------------------------------
There are some veristat regressions when comparing with master using
selftests and sched_ext BPF binaries. The comparison is done using
master from [6] and this patch-set from [7] where memory accounting
logic is added to veristat.
========= selftests: master vs patch-set =========
v1: https://lore.kernel.org/bpf/20250524191932.389444-1-eddyz87@gmail.com/
v1 -> v2:
- Rebase
- added mem_peak statistics (Alexei)
- selftests: fixed comments and removed useless r7 assignments (Yonghong)
v2: https://lore.kernel.org/bpf/20250606210352.1692944-1-eddyz87@gmail.com/
v2 -> v3:
- Rebase
Links
-----
[1] https://lore.kernel.org/bpf/20250312031344.3735498-1-eddyz87@gmail.com/
[2] commit 96a30e469ca1 ("bpf: use common instruction history across all states")
[3] https://github.com/eddyz87/scc-test
[4] https://lore.kernel.org/bpf/20250426104634.744077-1-eddyz87@gmail.com/
[5] https://github.com/eddyz87/bpf/tree/propagate-read-and-precision-in-cfg
[6] https://github.com/eddyz87/bpf/tree/veristat-memory-accounting
[7] https://github.com/eddyz87/bpf/tree/scc-accumulate-backedges
====================
W/o a fix that instructs verifier to ignore branches count for loop
entries verification proceeds as follows:
- 1-4, state is {r6=-32,fp-8=active};
- 6, checkpoint A is created with {r6=-32,fp-8=active};
- 7, checkpoint B is created with {r6=-32,fp-8=active},
push state {r6=-32,fp-8=active} from 7 to 9;
- 8,12,13, {r6=-32,fp-8=drained}, exit;
- pop state with {r6=-32,fp-8=active} from 7 to 9;
- 9, push state {r6=-32,fp-8=active} from 9 to 10;
- 6, checkpoint C is created with {r6=-32,fp-8=active};
- 7, checkpoint A is hit, no precision propagated for r6 to C;
- pop state {r6=-32,fp-8=active} from 9 to 10;
- 10, state is {r6=-31,fp-8=active}, r6 is marked as read and precise,
these marks are propagated to checkpoints A and B (but not C, as
it is not the parent of current state;
- 6, {r6=-31,fp-8=active} checkpoint C is hit, because r6 is not
marked precise for this checkpoint;
- the program is accepted, despite a possibility of unaligned u64
stack access at offset -31.
The test case absent_mark_in_the_middle_state2 is similar except the
following change:
int loop1(num, iter)
{
while (bpf_iter_num_next(iter)) {
if (unlikely(bpf_get_prandom_u32()))
*(fp + num) = 7;
}
return 0
}
int loop1_wrapper(iter)
{
r6 = -32;
if (unlikely(bpf_get_prandom_u32()))
r6 = -31;
loop1(r6, iter);
return 0;
}
The unsafe state is reached in a similar manner, but the loop is
located inside a subprogram that is called from two locations in the
main subprogram. This detail is important for exercising
bpf_scc_visit->backedges memory management.
Eduard Zingerman [Wed, 11 Jun 2025 20:08:34 +0000 (13:08 -0700)]
bpf: remove {update,get}_loop_entry functions
The previous patch switched read and precision tracking for
iterator-based loops from state-graph-based loop tracking to
control-flow-graph-based loop tracking.
This patch removes the now-unused `update_loop_entry()` and
`get_loop_entry()` functions, which were part of the state-graph-based
logic.
Eduard Zingerman [Wed, 11 Jun 2025 20:08:33 +0000 (13:08 -0700)]
bpf: propagate read/precision marks over state graph backedges
Current loop_entry-based exact states comparison logic does not handle
the following case:
.-> A --. Assume the states are visited in the order A, B, C.
| | | Assume that state B reaches a state equivalent to state A.
| v v At this point, state C is not processed yet, so state A
'-- B C has not received any read or precision marks from C.
As a result, these marks won't be propagated to B.
If B has incomplete marks, it is unsafe to use it in states_equal()
checks.
This commit replaces the existing logic with the following:
- Strongly connected components (SCCs) are computed over the program's
control flow graph (intraprocedurally).
- When a verifier state enters an SCC, that state is recorded as the
SCC entry point.
- When a verifier state is found equivalent to another (e.g., B to A
in the example), it is recorded as a states graph backedge.
Backedges are accumulated per SCC.
- When an SCC entry state reaches `branches == 0`, read and precision
marks are propagated through the backedges (e.g., from A to B, from
C to A, and then again from A to B).
To support nested subprogram calls, the entry state and backedge list
are associated not with the SCC itself but with an object called
`bpf_scc_callchain`. A callchain is a tuple `(callsite*, scc_id)`,
where `callsite` is the index of a call instruction for each frame
except the last.
See the comments added in `is_state_visited()` and
`compute_scc_callchain()` for more details.
Eduard Zingerman [Wed, 11 Jun 2025 20:08:32 +0000 (13:08 -0700)]
bpf: move REG_LIVE_DONE check to clean_live_states()
The next patch would add some relatively heavy-weight operation to
clean_live_states(), this operation can be skipped if REG_LIVE_DONE
is set. Move the check from clean_verifier_state() to
clean_verifier_state() as a small refactoring commit.
Eduard Zingerman [Wed, 11 Jun 2025 20:08:28 +0000 (13:08 -0700)]
bpf: frame_insn_idx() utility function
A function to return IP for a given frame in a call stack of a state.
Will be used by a next patch.
The `state->insn_idx = env->insn_idx;` assignment in the do_check()
allows to use frame_insn_idx with env->cur_state.
At the moment bpf_verifier_state->insn_idx is set when new cached
state is added in is_state_visited() and accessed only in the contexts
when the state is already in the cache. Hence this assignment does not
change verifier behaviour.
Eduard Zingerman [Wed, 11 Jun 2025 20:08:27 +0000 (13:08 -0700)]
bpf: compute SCCs in program control flow graph
Compute strongly connected components in the program CFG.
Assign an SCC number to each instruction, recorded in
env->insn_aux[*].scc. Use Tarjan's algorithm for SCC computation
adapted to run non-recursively.
For debug purposes print out computed SCCs as a part of full program
dump in compute_live_registers() at log level 2, e.g.:
Eduard Zingerman [Wed, 11 Jun 2025 20:08:26 +0000 (13:08 -0700)]
Revert "bpf: use common instruction history across all states"
This reverts commit 96a30e469ca1d2b8cc7811b40911f8614b558241.
Next patches in the series modify propagate_precision() to allow
arbitrary starting state. Precision propagation requires access to
jump history, and arbitrary states represent history not belonging to
`env->cur_state`.
Yonghong Song [Wed, 11 Jun 2025 16:21:03 +0000 (09:21 -0700)]
selftests/bpf: Fix cgroup_mprog_ordering failure due to uninitialized variable
On arm64, the cgroup_mprog_ordering selftest failed with test_progs run
when building with clang compiler. The reason is due to socklen_t optlen
not initialized.
In kernel function do_ip_getsockopt(), we have
if (copy_from_sockptr(&len, optlen, sizeof(int)))
return -EFAULT;
if (len < 0)
return -EINVAL;
The above 'len' variable is a negative value and hence the test failed.
But the test is okay on x86_64. I checked the x86_64 asm code and I didn't
see explicit initialization of 'optlen' but its value is 0 so kernel
didn't return error. This should be a pure luck.
Eslam Khafagy [Sat, 7 Jun 2025 22:24:25 +0000 (01:24 +0300)]
bpf, doc: Improve wording of docs
The phrase "dividing -1" is one I find confusing. E.g.,
"INT_MIN dividing -1" sounds like "-1 / INT_MIN" rather than the inverse.
"divided by" instead of "dividing" assuming the inverse is meant.
Tobias Klauser [Tue, 10 Jun 2025 14:07:56 +0000 (16:07 +0200)]
bpf: adjust path to trace_output sample eBPF program
The sample file was renamed from trace_output_kern.c to
trace_output.bpf.c in commit d4fffba4d04b ("samples/bpf: Change _kern
suffix to .bpf with syscall tracing program"). Adjust the path in the
documentation comment for bpf_perf_event_output.
====================
This improves the expressiveness of unprivileged BPF by inserting
speculation barriers instead of rejecting the programs.
The approach was previously presented at LPC'24 [1] and RAID'24 [2].
To mitigate the Spectre v1 (PHT) vulnerability, the kernel rejects
potentially-dangerous unprivileged BPF programs as of
commit 9183671af6db ("bpf: Fix leakage under speculation on mispredicted
branches"). In [2], we have analyzed 364 object files from open source
projects (Linux Samples and Selftests, BCC, Loxilb, Cilium, libbpf
Examples, Parca, and Prevail) and found that this affects 31% to 54% of
programs.
To resolve this in the majority of cases this patchset adds a fall-back
for mitigating Spectre v1 using speculation barriers. The kernel still
optimistically attempts to verify all speculative paths but uses
speculation barriers against v1 when unsafe behavior is detected. This
allows for more programs to be accepted without disabling the BPF
Spectre mitigations (e.g., by setting cpu_mitigations_off()).
For this, it relies on the fact that speculation barriers generally
prevent all later instructions from executing if the speculation was not
correct (not only loads). See patch 7 ("bpf: Fall back to nospec for
Spectre v1") for a detailed description and references to the relevant
vendor documentation (AMD and Intel x86-64, ARM64, and PowerPC).
In [1] we have measured the overhead of this approach relative to having
mitigations off and including the upstream Spectre v4 mitigations. For
event tracing and stack-sampling profilers, we found that mitigations
increase BPF program execution time by 0% to 62%. For the Loxilb network
load balancer, we have measured a 14% slowdown in SCTP performance but
no significant slowdown for TCP. This overhead only applies to programs
that were previously rejected.
I reran the expressiveness-evaluation with v6.14 and made sure the main
results still match those from [1] and [2] (which used v6.5).
Main design decisions are:
* Do not use separate bytecode insns for v1 and v4 barriers (inspired by
Daniel Borkmann's question at LPC). This simplifies the verifier
significantly and has the only downside that performance on PowerPC is
not as high as it could be.
* Allow archs to still disable v1/v4 mitigations separately by setting
bpf_jit_bypass_spec_v1/v4(). This has the benefit that archs can
benefit from improved BPF expressiveness / performance if they are not
vulnerable (e.g., ARM64 for v4 in the kernel).
* Do not remove the empty BPF_NOSPEC implementation for backends for
which it is unknown whether they are vulnerable to Spectre v1.
[1] https://lpc.events/event/18/contributions/1954/ ("Mitigating
Spectre-PHT using Speculation Barriers in Linux eBPF")
[2] https://arxiv.org/pdf/2405.00078 ("VeriFence: Lightweight and
Precise Spectre Defenses for Untrusted Linux Kernel Extensions")
Changes:
* v3 -> v4:
- Remove insn parameter from do_check_insn() and extract
process_bpf_exit_full as a function as requested by Eduard
- Investigate apparent sanitize_check_bounds() bug reported by
Kartikeya (does appear to not be a bug but only confusing code),
sent separate patch to document it and add an assert
- Remove already-merged commit 1 ("selftests/bpf: Fix caps for
__xlated/jited_unpriv")
- Drop former commit 10 ("bpf: Allow nospec-protected var-offset stack
access") as it did not include a test and there are other places
where var-off is rejected. Also, none of the tested real-world
programs used var-off in the paper. Therefore keep the old behavior
for now and potentially prepare a patch that converts all cases
later if required.
- Add link to AMD lfence and PowerPC speculation barrier (ori 31,31,0)
documentation
- Move detailed barrier documentation to commit 7 ("bpf: Fall back to
nospec for Spectre v1")
- Link to v3: https://lore.kernel.org/all/20250501073603.1402960-1-luis.gerhorst@fau.de/
* v2 -> v3:
- Fix
https://lore.kernel.org/oe-kbuild-all/202504212030.IF1SLhz6-lkp@intel.com/
and similar by moving the bpf_jit_bypass_spec_v1/v4() prototypes out
of the #ifdef CONFIG_BPF_SYSCALL. Decided not to move them to
filter.h (where similar bpf_jit_*() prototypes live) as they would
still have to be duplicated in bpf.h to be usable to
bpf_bypass_spec_v1/v4() (unless including filter.h in bpf.h is an
option).
- Fix
https://lore.kernel.org/oe-kbuild-all/202504220035.SoGveGpj-lkp@intel.com/
by moving the variable declarations out of the switch-case.
- Build touched C files with W=2 and bpf config on x86 to check that
there are no other warnings introduced.
- Found 3 more checkpatch warnings that can be fixed without degrading
readability.
- Rebase to bpf-next 2025-05-01
- Link to v2: https://lore.kernel.org/bpf/20250421091802.3234859-1-luis.gerhorst@fau.de/
* v1 -> v2:
- Drop former commits 9 ("bpf: Return PTR_ERR from push_stack()") and 11
("bpf: Fall back to nospec for spec path verification") as suggested
by Alexei. This series therefore no longer changes push_stack() to
return PTR_ERR.
- Add detailed explanation of how lfence works internally and how it
affects the algorithm.
- Add tests checking that nospec instructions are inserted in expected
locations using __xlated_unpriv as suggested by Eduard (also,
include a fix for __xlated_unpriv)
- Add a test for the mitigations from the description of
commit 9183671af6db ("bpf: Fix leakage under speculation on
mispredicted branches")
- Remove unused variables from do_check[_insn]() as suggested by
Eduard.
- Remove INSN_IDX_MODIFIED to improve readability as suggested by
Eduard. This also causes the nospec_result-check to run (and fail)
for jumping-ops. Add a warning to assert that this check must never
succeed in that case.
- Add details on the safety of patch 10 ("bpf: Allow nospec-protected
var-offset stack access") based on the feedback on v1.
- Rebase to bpf-next-250420
- Link to v1: https://lore.kernel.org/all/20250313172127.1098195-1-luis.gerhorst@fau.de/
* RFC -> v1:
- rebase to bpf-next-250313
- tests: mark expected successes/new errors
- add bpt_jit_bypass_spec_v1/v4() to avoid #ifdef in
bpf_bypass_spec_v1/v4()
- ensure that nospec with v1-support is implemented for archs for
which GCC supports speculation barriers, except for MIPS
- arm64: emit speculation barrier
- powerpc: change nospec to include v1 barrier
- discuss potential security (archs that do not impl. BPF nospec) and
performance (only PowerPC) regressions
- Link to RFC: https://lore.kernel.org/bpf/20250224203619.594724-1-luis.gerhorst@fau.de/
====================
Luis Gerhorst [Tue, 3 Jun 2025 21:24:28 +0000 (23:24 +0200)]
bpf: Fall back to nospec for Spectre v1
This implements the core of the series and causes the verifier to fall
back to mitigating Spectre v1 using speculation barriers. The approach
was presented at LPC'24 [1] and RAID'24 [2].
If we find any forbidden behavior on a speculative path, we insert a
nospec (e.g., lfence speculation barrier on x86) before the instruction
and stop verifying the path. While verifying a speculative path, we can
furthermore stop verification of that path whenever we encounter a
nospec instruction.
A minimal example program would look as follows:
A = true
B = true
if A goto e
f()
if B goto e
unsafe()
e: exit
There are the following speculative and non-speculative paths
(`cur->speculative` and `speculative` referring to the value of the
push_stack() parameters):
- A = true
- B = true
- if A goto e
- A && !cur->speculative && !speculative
- exit
- !A && !cur->speculative && speculative
- f()
- if B goto e
- B && cur->speculative && !speculative
- exit
- !B && cur->speculative && speculative
- unsafe()
If f() contains any unsafe behavior under Spectre v1 and the unsafe
behavior matches `state->speculative &&
error_recoverable_with_nospec(err)`, do_check() will now add a nospec
before f() instead of rejecting the program:
A = true
B = true
if A goto e
nospec
f()
if B goto e
unsafe()
e: exit
Alternatively, the algorithm also takes advantage of nospec instructions
inserted for other reasons (e.g., Spectre v4). Taking the program above
as an example, speculative path exploration can stop before f() if a
nospec was inserted there because of Spectre v4 sanitization.
In this example, all instructions after the nospec are dead code (and
with the nospec they are also dead code speculatively).
For this, it relies on the fact that speculation barriers generally
prevent all later instructions from executing if the speculation was not
correct:
* On Intel x86_64, lfence acts as full speculation barrier, not only as
a load fence [3]:
An LFENCE instruction or a serializing instruction will ensure that
no later instructions execute, even speculatively, until all prior
instructions complete locally. [...] Inserting an LFENCE instruction
after a bounds check prevents later operations from executing before
the bound check completes.
This was experimentally confirmed in [4].
* On AMD x86_64, lfence is dispatch-serializing [5] (requires MSR
C001_1029[1] to be set if the MSR is supported, this happens in
init_amd()). AMD further specifies "A dispatch serializing instruction
forces the processor to retire the serializing instruction and all
previous instructions before the next instruction is executed" [8]. As
dispatch is not specific to memory loads or branches, lfence therefore
also affects all instructions there. Also, if retiring a branch means
it's PC change becomes architectural (should be), this means any
"wrong" speculation is aborted as required for this series.
* ARM's SB speculation barrier instruction also affects "any instruction
that appears later in the program order than the barrier" [6].
* PowerPC's barrier also affects all subsequent instructions [7]:
[...] executing an ori R31,R31,0 instruction ensures that all
instructions preceding the ori R31,R31,0 instruction have completed
before the ori R31,R31,0 instruction completes, and that no
subsequent instructions are initiated, even out-of-order, until
after the ori R31,R31,0 instruction completes. The ori R31,R31,0
instruction may complete before storage accesses associated with
instructions preceding the ori R31,R31,0 instruction have been
performed
Regarding the example, this implies that `if B goto e` will not execute
before `if A goto e` completes. Once `if A goto e` completes, the CPU
should find that the speculation was wrong and continue with `exit`.
If there is any other path that leads to `if B goto e` (and therefore
`unsafe()`) without going through `if A goto e`, then a nospec will
still be needed there. However, this patch assumes this other path will
be explored separately and therefore be discovered by the verifier even
if the exploration discussed here stops at the nospec.
This patch furthermore has the unfortunate consequence that Spectre v1
mitigations now only support architectures which implement BPF_NOSPEC.
Before this commit, Spectre v1 mitigations prevented exploits by
rejecting the programs on all architectures. Because some JITs do not
implement BPF_NOSPEC, this patch therefore may regress unpriv BPF's
security to a limited extent:
* The regression is limited to systems vulnerable to Spectre v1, have
unprivileged BPF enabled, and do NOT emit insns for BPF_NOSPEC. The
latter is not the case for x86 64- and 32-bit, arm64, and powerpc
64-bit and they are therefore not affected by the regression.
According to commit a6f6a95f2580 ("LoongArch, bpf: Fix jit to skip
speculation barrier opcode"), LoongArch is not vulnerable to Spectre
v1 and therefore also not affected by the regression.
* To the best of my knowledge this regression may therefore only affect
MIPS. This is deemed acceptable because unpriv BPF is still disabled
there by default. As stated in a previous commit, BPF_NOSPEC could be
implemented for MIPS based on GCC's speculation_barrier
implementation.
* It is unclear which other architectures (besides x86 64- and 32-bit,
ARM64, PowerPC 64-bit, LoongArch, and MIPS) supported by the kernel
are vulnerable to Spectre v1. Also, it is not clear if barriers are
available on these architectures. Implementing BPF_NOSPEC on these
architectures therefore is non-trivial. Searching GCC and the kernel
for speculation barrier implementations for these architectures
yielded no result.
* If any of those regressed systems is also vulnerable to Spectre v4,
the system was already vulnerable to Spectre v4 attacks based on
unpriv BPF before this patch and the impact is therefore further
limited.
As an alternative to regressing security, one could still reject
programs if the architecture does not emit BPF_NOSPEC (e.g., by removing
the empty BPF_NOSPEC-case from all JITs except for LoongArch where it
appears justified). However, this will cause rejections on these archs
that are likely unfounded in the vast majority of cases.
In the tests, some are now successful where we previously had a
false-positive (i.e., rejection). Change them to reflect where the
nospec should be inserted (using __xlated_unpriv) and modify the error
message if the nospec is able to mitigate a problem that previously
shadowed another problem (in that case __xlated_unpriv does not work,
therefore just add a comment).
Define SPEC_V1 to avoid duplicating this ifdef whenever we check for
nospec insns using __xlated_unpriv, define it here once. This also
improves readability. PowerPC can probably also be added here. However,
omit it for now because the BPF CI currently does not include a test.
Limit it to EPERM, EACCES, and EINVAL (and not everything except for
EFAULT and ENOMEM) as it already has the desired effect for most
real-world programs. Briefly went through all the occurrences of EPERM,
EINVAL, and EACCESS in verifier.c to validate that catching them like
this makes sense.
Thanks to Dustin for their help in checking the vendor documentation.
[1] https://lpc.events/event/18/contributions/1954/ ("Mitigating
Spectre-PHT using Speculation Barriers in Linux eBPF")
[2] https://arxiv.org/pdf/2405.00078 ("VeriFence: Lightweight and
Precise Spectre Defenses for Untrusted Linux Kernel Extensions")
[3] https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/runtime-speculative-side-channel-mitigations.html
("Managed Runtime Speculative Execution Side Channel Mitigations")
[4] https://dl.acm.org/doi/pdf/10.1145/3359789.3359837 ("Speculator: a
tool to analyze speculative execution attacks and mitigations" -
Section 4.6 "Stopping Speculative Execution")
[5] https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/software-techniques-for-managing-speculation.pdf
("White Paper - SOFTWARE TECHNIQUES FOR MANAGING SPECULATION ON AMD
PROCESSORS - REVISION 5.09.23")
[6] https://developer.arm.com/documentation/ddi0597/2020-12/Base-Instructions/SB--Speculation-Barrier-
("SB - Speculation Barrier - Arm Armv8-A A32/T32 Instruction Set
Architecture (2020-12)")
[7] https://wiki.raptorcs.com/w/images/5/5f/OPF_PowerISA_v3.1C.pdf
("Power ISA™ - Version 3.1C - May 26, 2024 - Section 9.2.1 of Book
III")
[8] https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/40332.pdf
("AMD64 Architecture Programmer’s Manual Volumes 1–5 - Revision 4.08
- April 2024 - 7.6.4 Serializing Instructions")
Signed-off-by: Luis Gerhorst <luis.gerhorst@fau.de> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Henriette Herzog <henriette.herzog@rub.de> Cc: Dustin Nguyen <nguyen@cs.fau.de> Cc: Maximilian Ott <ott@cs.fau.de> Cc: Milan Stephan <milan.stephan@fau.de> Link: https://lore.kernel.org/r/20250603212428.338473-1-luis.gerhorst@fau.de Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Luis Gerhorst [Tue, 3 Jun 2025 21:20:24 +0000 (23:20 +0200)]
bpf: Rename sanitize_stack_spill to nospec_result
This is made to clarify that this flag will cause a nospec to be added
after this insn and can therefore be relied upon to reduce speculative
path analysis.
Signed-off-by: Luis Gerhorst <luis.gerhorst@fau.de> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Cc: Henriette Herzog <henriette.herzog@rub.de> Cc: Maximilian Ott <ott@cs.fau.de> Cc: Milan Stephan <milan.stephan@fau.de> Link: https://lore.kernel.org/r/20250603212024.338154-1-luis.gerhorst@fau.de Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Luis Gerhorst [Tue, 3 Jun 2025 21:17:03 +0000 (23:17 +0200)]
bpf, arm64, powerpc: Change nospec to include v1 barrier
This changes the semantics of BPF_NOSPEC (previously a v4-only barrier)
to always emit a speculation barrier that works against both Spectre v1
AND v4. If mitigation is not needed on an architecture, the backend
should set bpf_jit_bypass_spec_v4/v1().
As of now, this commit only has the user-visible implication that unpriv
BPF's performance on PowerPC is reduced. This is the case because we
have to emit additional v1 barrier instructions for BPF_NOSPEC now.
This commit is required for a future commit to allow us to rely on
BPF_NOSPEC for Spectre v1 mitigation. As of this commit, the feature
that nospec acts as a v1 barrier is unused.
Commit f5e81d111750 ("bpf: Introduce BPF nospec instruction for
mitigating Spectre v4") noted that mitigation instructions for v1 and v4
might be different on some archs. While this would potentially offer
improved performance on PowerPC, it was dismissed after the following
considerations:
* Only having one barrier simplifies the verifier and allows us to
easily rely on v4-induced barriers for reducing the complexity of
v1-induced speculative path verification.
* For the architectures that implemented BPF_NOSPEC, only PowerPC has
distinct instructions for v1 and v4. Even there, some insns may be
shared between the barriers for v1 and v4 (e.g., 'ori 31,31,0' and
'sync'). If this is still found to impact performance in an
unacceptable way, BPF_NOSPEC can be split into BPF_NOSPEC_V1 and
BPF_NOSPEC_V4 later. As an optimization, we can already skip v1/v4
insns from being emitted for PowerPC with this setup if
bypass_spec_v1/v4 is set.
Vulnerability-status for BPF_NOSPEC-based Spectre mitigations (v4 as of
this commit, v1 in the future) is therefore:
* x86 (32-bit and 64-bit), ARM64, and PowerPC (64-bit): Mitigated - This
patch implements BPF_NOSPEC for these architectures. The previous
v4-only version was supported since commit f5e81d111750 ("bpf:
Introduce BPF nospec instruction for mitigating Spectre v4") and
commit b7540d625094 ("powerpc/bpf: Emit stf barrier instruction
sequences for BPF_NOSPEC").
* LoongArch: Not Vulnerable - Commit a6f6a95f2580 ("LoongArch, bpf: Fix
jit to skip speculation barrier opcode") is the only other past commit
related to BPF_NOSPEC and indicates that the insn is not required
there.
* MIPS: Vulnerable (if unprivileged BPF is enabled) -
Commit a6f6a95f2580 ("LoongArch, bpf: Fix jit to skip speculation
barrier opcode") indicates that it is not vulnerable, but this
contradicts the kernel and Debian documentation. Therefore, I assume
that there exist vulnerable MIPS CPUs (but maybe not from Loongson?).
In the future, BPF_NOSPEC could be implemented for MIPS based on the
GCC speculation_barrier [1]. For now, we rely on unprivileged BPF
being disabled by default.
* Other: Unknown - To the best of my knowledge there is no definitive
information available that indicates that any other arch is
vulnerable. They are therefore left untouched (BPF_NOSPEC is not
implemented, but bypass_spec_v1/v4 is also not set).
I did the following testing to ensure the insn encoding is correct:
* ARM64:
* 'dsb nsh; isb' was successfully tested with the BPF CI in [2]
* 'sb' locally using QEMU v7.2.15 -cpu max (emitted sb insn is
executed for example with './test_progs -t verifier_array_access')
* PowerPC: The following configs were tested locally with ppc64le QEMU
v8.2 '-machine pseries -cpu POWER9':
* STF_BARRIER_EIEIO + CONFIG_PPC_BOOK32_64
* STF_BARRIER_SYNC_ORI (forced on) + CONFIG_PPC_BOOK32_64
* STF_BARRIER_FALLBACK (forced on) + CONFIG_PPC_BOOK32_64
* CONFIG_PPC_E500 (forced on) + STF_BARRIER_EIEIO
* CONFIG_PPC_E500 (forced on) + STF_BARRIER_SYNC_ORI (forced on)
* CONFIG_PPC_E500 (forced on) + STF_BARRIER_FALLBACK (forced on)
* CONFIG_PPC_E500 (forced on) + STF_BARRIER_NONE (forced on)
Most of those cobinations should not occur in practice, but I was not
able to get an PPC e6500 rootfs (for testing PPC_E500 without forcing
it on). In any case, this should ensure that there are no unexpected
conflicts between the insns when combined like this. Individual v1/v4
barriers were already emitted elsewhere.
Signed-off-by: Luis Gerhorst <luis.gerhorst@fau.de> Acked-by: Hari Bathini <hbathini@linux.ibm.com> Cc: Henriette Herzog <henriette.herzog@rub.de> Cc: Maximilian Ott <ott@cs.fau.de> Cc: Milan Stephan <milan.stephan@fau.de> Link: https://lore.kernel.org/r/20250603211703.337860-1-luis.gerhorst@fau.de Signed-off-by: Alexei Starovoitov <ast@kernel.org>
JITs can set bpf_jit_bypass_spec_v1/v4() if they want the verifier to
skip analysis/patching for the respective vulnerability. For v4, this
will reduce the number of barriers the verifier inserts. For v1, it
allows more programs to be accepted.
The primary motivation for this is to not regress unpriv BPF's
performance on ARM64 in a future commit where BPF_NOSPEC is also used
against Spectre v1.
This has the user-visible change that v1-induced rejections on
non-vulnerable PowerPC CPUs are avoided.
For now, this does not change the semantics of BPF_NOSPEC. It is still a
v4-only barrier and must not be implemented if bypass_spec_v4 is always
true for the arch. Changing it to a v1 AND v4-barrier is done in a
future commit.
As an alternative to bypass_spec_v1/v4, one could introduce NOSPEC_V1
AND NOSPEC_V4 instructions and allow backends to skip their lowering as
suggested by commit f5e81d111750 ("bpf: Introduce BPF nospec instruction
for mitigating Spectre v4"). Adding bpf_jit_bypass_spec_v1/v4() was
found to be preferable for the following reason:
* bypass_spec_v1/v4 benefits non-vulnerable CPUs: Always performing the
same analysis (not taking into account whether the current CPU is
vulnerable), needlessly restricts users of CPUs that are not
vulnerable. The only use case for this would be portability-testing,
but this can later be added easily when needed by allowing users to
force bypass_spec_v1/v4 to false.
* Portability is still acceptable: Directly disabling the analysis
instead of skipping the lowering of BPF_NOSPEC(_V1/V4) might allow
programs on non-vulnerable CPUs to be accepted while the program will
be rejected on vulnerable CPUs. With the fallback to speculation
barriers for Spectre v1 implemented in a future commit, this will only
affect programs that do variable stack-accesses or are very complex.
For PowerPC, the SEC_FTR checking in bpf_jit_bypass_spec_v4() is based
on the check that was previously located in the BPF_NOSPEC case.
For LoongArch, it would likely be safe to set both
bpf_jit_bypass_spec_v1() and _v4() according to
commit a6f6a95f2580 ("LoongArch, bpf: Fix jit to skip speculation
barrier opcode"). This is omitted here as I am unable to do any testing
for LoongArch.
Hari's ack concerns the PowerPC part only.
Signed-off-by: Luis Gerhorst <luis.gerhorst@fau.de> Acked-by: Hari Bathini <hbathini@linux.ibm.com> Cc: Henriette Herzog <henriette.herzog@rub.de> Cc: Maximilian Ott <ott@cs.fau.de> Cc: Milan Stephan <milan.stephan@fau.de> Link: https://lore.kernel.org/r/20250603211318.337474-1-luis.gerhorst@fau.de Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Luis Gerhorst [Tue, 3 Jun 2025 20:57:53 +0000 (22:57 +0200)]
bpf: Return -EFAULT on misconfigurations
Mark these cases as non-recoverable to later prevent them from being
caught when they occur during speculative path verification.
Eduard writes [1]:
The only pace I'm aware of that might act upon specific error code
from verifier syscall is libbpf. Looking through libbpf code, it seems
that this change does not interfere with libbpf.
Luis Gerhorst [Tue, 3 Jun 2025 20:57:52 +0000 (22:57 +0200)]
bpf: Move insn if/else into do_check_insn()
This is required to catch the errors later and fall back to a nospec if
on a speculative path.
Eliminate the regs variable as it is only used once and insn_idx is not
modified in-between the definition and usage.
Do not pass insn but compute it in the function itself. As Eduard points
out [1], insn is assumed to correspond to env->insn_idx in many places
(e.g, __check_reg_arg()).
Move code into do_check_insn(), replace
* "continue" with "return 0" after modifying insn_idx
* "goto process_bpf_exit" with "return PROCESS_BPF_EXIT"
* "goto process_bpf_exit_full" with "return process_bpf_exit_full()"
* "do_print_state = " with "*do_print_state = "
Ihor Solodrai [Mon, 9 Jun 2025 18:30:24 +0000 (11:30 -0700)]
selftests/bpf: Add test cases with CONST_PTR_TO_MAP null checks
A test requires the following to happen:
* CONST_PTR_TO_MAP value is checked for null
* the code in the null branch fails verification
Add test cases:
* direct global map_ptr comparison to null
* lookup inner map, then two checks (the first transforms
map_value_or_null into map_ptr)
* lookup inner map, spill-fill it, then check for null
* use an array of ringbufs to recreate a common coding pattern [1]
Ihor Solodrai [Mon, 9 Jun 2025 18:30:22 +0000 (11:30 -0700)]
bpf: Make reg_not_null() true for CONST_PTR_TO_MAP
When reg->type is CONST_PTR_TO_MAP, it can not be null. However the
verifier explores the branches under rX == 0 in check_cond_jmp_op()
even if reg->type is CONST_PTR_TO_MAP, because it was not checked for
in reg_not_null().
Fix this by adding CONST_PTR_TO_MAP to the set of types that are
considered non nullable in reg_not_null().
An old "unpriv: cmp map pointer with zero" selftest fails with this
change, because now early out correctly triggers in
check_cond_jmp_op(), making the verification to pass.
In practice verifier may allow pointer to null comparison in unpriv,
since in many cases the relevant branch and comparison op are removed
as dead code. So change the expected test result to __success_unpriv.
Tao Chen [Fri, 6 Jun 2025 15:02:58 +0000 (23:02 +0800)]
bpf: Add show_fdinfo for perf_event
After commit 1b715e1b0ec5 ("bpf: Support ->fill_link_info for perf_event") add
perf_event info, we can also show the info with the method of cat /proc/[fd]/fdinfo.
====================
bpf: Implement mprog API on top of existing cgroup progs
Current cgroup prog ordering is appending at attachment time. This is not
ideal. In some cases, users want specific ordering at a particular cgroup
level. For example, in Meta, we have a case where three different
applications all have cgroup/setsockopt progs and they require specific
ordering. Current approach is to use a bpfchainer where one bpf prog
contains multiple global functions and each global function can be
freplaced by a prog for a specific application. The ordering of global
functions decides the ordering of those application specific bpf progs.
Using bpfchainer is a centralized approach and is not desirable as
one of applications acts as a daemon. The decentralized attachment
approach is more favorable for those applications.
To address this, the existing mprog API ([2]) seems an ideal solution with
supporting BPF_F_BEFORE and BPF_F_AFTER flags on top of existing cgroup
bpf implementation. More specifically, the support is added for prog/link
attachment with BPF_F_BEFORE and BPF_F_AFTER. The kernel mprog
interface ([2]) is not used and the implementation is directly done in
cgroup bpf code base. The mprog 'revision' is also implemented in
attach/detach/replace, so users can query revision number to check the
change of cgroup prog list.
The patch set contains 5 patches. Patch 1 adds revision support for
cgroup bpf progs. Patch 2 implements mprog API implementation for
prog/link attach and revision update. Patch 3 adds a new libbpf
API to do cgroup link attach with flags like BPF_F_BEFORE/BPF_F_AFTER.
Patches 4 and 5 add two tests to validate the implementation.
Changelogs:
v4 -> v5:
- v4: https://lore.kernel.org/bpf/20250530173812.1823479-1-yonghong.song@linux.dev/
- Remove early prog/link checking based flags and id_or_fd as later code
will do checking as well.
- Do proper cgroup flag checking for bpf_prog_attach().
v3 -> v4:
- v3: https://lore.kernel.org/bpf/20250517162720.4077882-1-yonghong.song@linux.dev/
- Refactor some to make BPF_F_BEFORE/BPF_F_AFTER handling easier to understand.
- Perviously, I degraded 'link' to 'prog' for later mprog handling. This is
not correct. Similar to mprog.c, we should be check 'link' instead link->prog
since it is possible two different links may have the same underlying prog and
we do not want to miss supporting such use case.
v2 -> v3:
- v2: https://lore.kernel.org/bpf/20250508223524.487875-1-yonghong.song@linux.dev/
- Big change to replace get_anchor_prog() to get_prog_list() so the
'struct bpf_prog_list *' is returned directly.
- Support 'BPF_F_BEFORE | BPF_F_AFTER' attachment if the prog list is empty
and flags do not have 'BPF_F_LINK | BPF_F_ID' and id_or_fd is 0.
- Add BPF_F_LINK support.
- Patch 4 is added to reuse id_from_prog_fd() and id_from_link_fd().
v1 -> v2:
- v1: https://lore.kernel.org/bpf/20250411011523.1838771-1-yonghong.song@linux.dev/
- Change cgroup_bpf.revisions from atomic64_t to u64.
- Added missing bpf_prog_put in various places.
- Rename get_cmp_prog() to get_anchor_prog(). The implementation tries to
find the anchor prog regardless of whether id_or_fd is non-NULL or not.
- Rename bpf_cgroup_prog_attached() to is_cgroup_prog_type() and handle
BPF_PROG_TYPE_LSM properly (with BPF_LSM_CGROUP attach type).
- I kept 'id || id_or_fd' condition as the condition 'id' is also used
in mprog.c so I assume it is okay in cgroup.c as well.
====================
Yonghong Song [Fri, 6 Jun 2025 16:31:56 +0000 (09:31 -0700)]
selftests/bpf: Add two selftests for mprog API based cgroup progs
Two tests are added:
- cgroup_mprog_opts, which mimics tc_opts.c ([1]). Both prog and link
attach are tested. Some negative tests are also included.
- cgroup_mprog_ordering, which actually runs the program with some mprog
API flags.
Yonghong Song [Fri, 6 Jun 2025 16:31:51 +0000 (09:31 -0700)]
selftests/bpf: Move some tc_helpers.h functions to test_progs.h
Move static inline functions id_from_prog_fd() and id_from_link_fd()
from prog_tests/tc_helpers.h to test_progs.h so these two functions
can be reused for later cgroup mprog selftests.
Yonghong Song [Fri, 6 Jun 2025 16:31:46 +0000 (09:31 -0700)]
libbpf: Support link-based cgroup attach with options
Currently libbpf supports bpf_program__attach_cgroup() with signature:
LIBBPF_API struct bpf_link *
bpf_program__attach_cgroup(const struct bpf_program *prog, int cgroup_fd);
To support mprog style attachment, additionsl fields like flags,
relative_{fd,id} and expected_revision are needed.
Add a new API:
LIBBPF_API struct bpf_link *
bpf_program__attach_cgroup_opts(const struct bpf_program *prog, int cgroup_fd,
const struct bpf_cgroup_opts *opts);
where bpf_cgroup_opts contains all above needed fields.
Yonghong Song [Fri, 6 Jun 2025 16:31:41 +0000 (09:31 -0700)]
bpf: Implement mprog API on top of existing cgroup progs
Current cgroup prog ordering is appending at attachment time. This is not
ideal. In some cases, users want specific ordering at a particular cgroup
level. To address this, the existing mprog API seems an ideal solution with
supporting BPF_F_BEFORE and BPF_F_AFTER flags.
But there are a few obstacles to directly use kernel mprog interface.
Currently cgroup bpf progs already support prog attach/detach/replace
and link-based attach/detach/replace. For example, in struct
bpf_prog_array_item, the cgroup_storage field needs to be together
with bpf prog. But the mprog API struct bpf_mprog_fp only has bpf_prog
as the member, which makes it difficult to use kernel mprog interface.
In another case, the current cgroup prog detach tries to use the
same flag as in attach. This is different from mprog kernel interface
which uses flags passed from user space.
So to avoid modifying existing behavior, I made the following changes to
support mprog API for cgroup progs:
- The support is for prog list at cgroup level. Cross-level prog list
(a.k.a. effective prog list) is not supported.
- Previously, BPF_F_PREORDER is supported only for prog attach, now
BPF_F_PREORDER is also supported by link-based attach.
- For attach, BPF_F_BEFORE/BPF_F_AFTER/BPF_F_ID/BPF_F_LINK is supported
similar to kernel mprog but with different implementation.
- For detach and replace, use the existing implementation.
- For attach, detach and replace, the revision for a particular prog
list, associated with a particular attach type, will be updated
by increasing count by 1.
Yonghong Song [Fri, 6 Jun 2025 16:31:36 +0000 (09:31 -0700)]
cgroup: Add bpf prog revisions to struct cgroup_bpf
One of key items in mprog API is revision for prog list. The revision
number will be increased if the prog list changed, e.g., attach, detach
or replace.
Add 'revisions' field to struct cgroup_bpf, representing revisions for
all cgroup related attachment types. The initial revision value is
set to 1, the same as kernel mprog implementations.
====================
selftests/bpf: Fix a few test failures with arm64 64KB page
My local arm64 host has 64KB page size and the VM to run test_progs
also has 64KB page size. There are a few self tests assuming 4KB page
and failed in my environment.
Patch 1 reduced long assert logs so if the test fails, developers
can check logs easily. Patches 2-4 fixed three selftest failures.
Changelogs:
v3 -> v4:
- v3: https://lore.kernel.org/bpf/20250606213048.340421-1-yonghong.song@linux.dev/
- In v3, I tried to use __kconfig with CONFIG_ARM64_64K_PAGES to decide to have
4K or 64K aligned. But CI seems unhappy about this. Most likely the reason
is due to lskel. So in v4, simply adjust/increase numbers to 64K aligned for
test_ringbuf_write test.
v2 -> v3:
- v2: https://lore.kernel.org/bpf/20250606174139.3036576-1-yonghong.song@linux.dev/
- Fix veristat failure with bpf object file test_ringbuf_write.bpf.o.
v1 -> v2:
- v1: https://lore.kernel.org/bpf/20250606032309.444401-1-yonghong.song@linux.dev/
- Fix a problem with selftest release build, basically from
BUILD_BUG_ON to ASSERT_LT.
====================
Yonghong Song [Sat, 7 Jun 2025 01:36:21 +0000 (18:36 -0700)]
selftests/bpf: Fix ringbuf/ringbuf_write test failure with arm64 64KB page size
The ringbuf max_entries must be PAGE_ALIGNED. See kernel function
ringbuf_map_alloc(). So for arm64 64KB page size, adjust max_entries
and other related metrics properly.
Yonghong Song [Sat, 7 Jun 2025 01:36:15 +0000 (18:36 -0700)]
selftests/bpf: Fix bpf_mod_race test failure with arm64 64KB page size
Currently, uffd_register.range.len is set to 4096 for command
'ioctl(uffd, UFFDIO_REGISTER, &uffd_register)'. For arm64 64KB page size,
the len must be 64KB size aligned as page size alignment is required.
See fs/userfaultfd.c:validate_unaligned_range().
Rong Tao [Thu, 5 Jun 2025 08:45:14 +0000 (16:45 +0800)]
selftests/bpf: rbtree: Fix incorrect global variable usage
Within __add_three() function, should use function parameters instead of
global variables. So that the variables groot_nested.inner.root and
groot_nested.inner.glock in rbtree_add_nodes_nested() are tested
correctly.
Blake Jones [Tue, 3 Jun 2025 20:37:00 +0000 (13:37 -0700)]
libbpf: Add support for printing BTF character arrays as strings
The BTF dumper code currently displays arrays of characters as just that -
arrays, with each character formatted individually. Sometimes this is what
makes sense, but it's nice to be able to treat that array as a string.
This change adds a special case to the btf_dump functionality to allow
0-terminated arrays of single-byte integer values to be printed as
character strings. Characters for which isprint() returns false are
printed as hex-escaped values. This is enabled when the new ".emit_strings"
is set to 1 in the btf_dump_type_data_opts structure.
As an example, here's what it looks like to dump the string "hello" using
a few different field values for btf_dump_type_data_opts (.compact = 1):
Luis Gerhorst [Tue, 3 Jun 2025 20:45:57 +0000 (22:45 +0200)]
bpf: Clarify sanitize_check_bounds()
As is, it appears as if pointer arithmetic is allowed for everything
except PTR_TO_{STACK,MAP_VALUE} if one only looks at
sanitize_check_bounds(). However, this is misleading as the function
only works together with retrieve_ptr_limit() and the two must be kept
in sync. This patch documents the interdependency and adds a check to
ensure they stay in sync.
adjust_ptr_min_max_vals(): Because the preceding switch returns -EACCES
for every opcode except for ADD/SUB, the sanitize_needed() following the
sanitize_check_bounds() call is always true if reached. This means,
unless sanitize_check_bounds() detected that the pointer goes OOB
because of the ADD/SUB and returns -EACCES, sanitize_ptr_alu() always
executes after sanitize_check_bounds().
The following shows that this also implies that retrieve_ptr_limit()
runs in all relevant cases.
Note that there are two calls to sanitize_ptr_alu(), these are simply
needed to easily calculate the correct alu_limit as explained in
commit 7fedb63a8307 ("bpf: Tighten speculative pointer arithmetic
mask"). The truncation-simulation is already performed on the first
call.
In the second sanitize_ptr_alu(commit_window = true), we always run
retrieve_ptr_limit(), unless:
* can_skip_alu_sanititation() is true, notably `BPF_SRC(insn->code) ==
BPF_K`. BPF_K is fine because it means that there is no scalar
register (which could be subject to speculative scalar confusion due
to Spectre v4) that goes into the ALU operation. The pointer register
can not be subject to v4-based value confusion due to the nospec
added. Thus, in this case it would have been fine to also skip
sanitize_check_bounds().
* If we are on a speculative path (`vstate->speculative`) and in the
second "commit" phase, sanitize_ptr_alu() always just returns 0. This
makes sense because there are no ALU sanitization limits to be learned
from speculative paths. Furthermore, because the sanitization will
ensure that pointer arithmetic stays in (architectural) bounds, the
sanitize_check_bounds() on the speculative path could also be skipped.
The second case needs more attention: Assume we have some ALU operation
that is used with scalars architecturally, but with a
non-PTR_TO_{STACK,MAP_VALUE} pointer (e.g., PTR_TO_PACKET)
speculatively. It might appear as if this would allow an unsanitized
pointer ALU operations, but this can not happen because one of the
following two always holds:
* The type mismatch stems from Spectre v4, then it is prevented by a
nospec after the possibly-bypassed store involving the pointer. There
is no speculative path simulated for this case thus it never happens.
* The type mismatch stems from a Spectre v1 gadget like the following:
If `r2 = PTR_TO_PACKET` is indeed dead code, it will be sanitized to
`goto -1` (as is the case for the r4-if block). If it is not (e.g., if
`r1 = r4 = 1` is possible), it will also be explored on an
architectural path and retrieve_ptr_limit() will reject it.
To summarize, the exception for `vstate->speculative` is safe.
Back to retrieve_ptr_limit(): It only allows the ALU operation if the
involved pointer register (can be either source or destination for ADD)
is PTR_TO_STACK or PTR_TO_MAP_VALUE. Otherwise, it returns -EOPNOTSUPP.
Therefore, sanitize_check_bounds() returning 0 for
non-PTR_TO_{STACK,MAP_VALUE} is fine because retrieve_ptr_limit() also
runs for all relevant cases and prevents unsafe operations.
To summarize, we allow unsanitized pointer arithmetic with 64-bit
ADD/SUB for the following instructions if the requirements from
retrieve_ptr_limit() AND sanitize_check_bounds() hold:
Jiawei Zhao [Sat, 31 May 2025 09:51:11 +0000 (17:51 +0800)]
libbpf: Correct some typos and syntax issues in usdt doc
Fix some incorrect words, such as "and" -> "an", "it's" -> "its". Fix
some grammar issues, such as removing redundant "will", "would
complicated" -> "would complicate".
Tao Chen [Tue, 3 Jun 2025 15:43:07 +0000 (23:43 +0800)]
bpf: Add cookie to raw_tp bpf_link_info
After commit 68ca5d4eebb8 ("bpf: support BPF cookie in raw tracepoint
(raw_tp, tp_btf) programs"), we can show the cookie in bpf_link_info
like kprobe etc.
Linus Torvalds [Thu, 5 Jun 2025 15:54:47 +0000 (08:54 -0700)]
Merge tag 'rtc-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux
Pull RTC updates from Alexandre Belloni:
"There are two new drivers this cycle. There is also support for a
negative offset for RTCs that have been shipped with a date set using
an epoch that is before 1970. This unfortunately happens with some
products that ship with a vendor kernel and an out of tree driver.
Core:
- support negative offsets for RTCs that have shipped with an epoch
earlier than 1970
New drivers:
- NXP S32G2/S32G3
- Sophgo CV1800
Drivers:
- loongson: fix missing alarm notifications for ACPI
- m41t80: kickstart ocillator upon failure
- mt6359: mt6357 support
- pcf8563: fix wrong alarm register
- sh: cleanups"
* tag 'rtc-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux: (39 commits)
rtc: mt6359: Add mt6357 support
rtc: test: Test date conversion for dates starting in 1900
rtc: test: Also test time and wday outcome of rtc_time64_to_tm()
rtc: test: Emit the seconds-since-1970 value instead of days-since-1970
rtc: Fix offset calculation for .start_secs < 0
rtc: Make rtc_time64_to_tm() support dates before 1970
rtc: pcf8563: fix wrong alarm register
rtc: rzn1: support input frequencies other than 32768Hz
rtc: rzn1: Disable controller before initialization
dt-bindings: rtc: rzn1: add optional second clock
rtc: m41t80: reduce verbosity
rtc: m41t80: kickstart ocillator upon failure
rtc: s32g: add NXP S32G2/S32G3 SoC support
dt-bindings: rtc: add schema for NXP S32G2/S32G3 SoCs
dt-bindings: at91rm9260-rtt: add microchip,sama7d65-rtt
dt-bindings: rtc: at91rm9200: add microchip,sama7d65-rtc
rtc: loongson: Add missing alarm notifications for ACPI RTC events
rtc: sophgo: add rtc support for Sophgo CV1800 SoC
rtc: stm32: drop unused module alias
rtc: s3c: drop unused module alias
...
Linus Torvalds [Thu, 5 Jun 2025 15:49:30 +0000 (08:49 -0700)]
Merge tag 'dmaengine-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine
Pull dmaengine updates from Vinod Koul:
"A fairly small update for the dmaengine subsystem. This has a new ARM
dmaengine driver and couple of new device support and few driver
changes:
New support:
- Renesas RZ/V2H(P) dma support for r9a09g057
- Arm DMA-350 driver
- Tegra Tegra264 ADMA support
Linus Torvalds [Thu, 5 Jun 2025 15:20:21 +0000 (08:20 -0700)]
Merge tag 'phy-for-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/phy/linux-phy
Pull phy updates from Vinod Koul:
"As usual featuring couple of new driver and bunch of new device
support and some driver changes to Freescale, rockchip driver along
with couple of yaml binding conversions.
New Support:
- Qualcomm IPQ5424 qusb2 support, IPQ5018 uniphy-pcie driver
- Rockchip usb2 support for RK3562, RK3036 usb2 phy support
- Samsung exynos2200 eusb2 phy support and driver refactoring for
this support, exynos7870 USBDRD support
- Mediatek MT7988 xs-phy support
- Broadcom BCM74110 usb phy support
- Renesas RZ/V2H(P) usb2 phy support
Updates:
- Freescale phy rate claculation updates, i.MX95 tuning support
- Better error handling for amlogic pcie phy
- Rockchip color depth configuration and management support
- Yaml binding conversion for RK3399 Type-C and PCIe Phy"
* tag 'phy-for-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/phy/linux-phy: (77 commits)
phy: tegra: p2u: Broaden architecture dependency
phy: rockchip: inno-usb2: Add usb2 phy support for rk3562
dt-bindings: phy: rockchip,inno-usb2phy: add rk3562
phy: rockchip: inno-usb2: add phy definition for rk3036
dt-bindings: phy: rockchip,inno-usb2phy: add rk3036 compatible
phy: freescale: fsl-samsung-hdmi: Improve LUT search for best clock
phy: freescale: fsl-samsung-hdmi: Refactor finding PHY settings
phy: freescale: fsl-samsung-hdmi: Rename phy_clk_round_rate
phy: renesas: phy-rcar-gen3-usb2: Add USB2.0 PHY support for RZ/V2H(P)
phy: renesas: phy-rcar-gen3-usb2: Sort compatible entries by SoC part number
dt-bindings: phy: renesas,usb2-phy: Document RZ/V2H(P) SoC
dt-bindings: phy: renesas,usb2-phy: Add clock constraint for RZ/G2L family
phy: exynos5-usbdrd: support Exynos USBDRD 3.2 4nm controller
phy: phy-snps-eusb2: add support for exynos2200
phy: phy-snps-eusb2: refactor reference clock init
phy: phy-snps-eusb2: make reset control optional
phy: phy-snps-eusb2: make repeater optional
phy: phy-snps-eusb2: split phy init code
phy: phy-snps-eusb2: refactor constructs names
phy: move phy-qcom-snps-eusb2 out of its vendor sub-directory
...
Linus Torvalds [Thu, 5 Jun 2025 15:07:24 +0000 (08:07 -0700)]
Merge tag 'soundwire-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/soundwire
Pull soundwire updates from Vinod Koul:
"A couple of small core changes and an Intel driver change:
- sdw_assign_device_num() logic simplification, using internal slave
id for irqs and optimizing computing of port params in specific
stream states
- Intel driver updates for ACE3+ microphone privacy status reporting
and enabling the status in HDA Intel driver"
* tag 'soundwire-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/soundwire:
soundwire: only compute port params in specific stream states
ASoC: SOF: Intel: hda: Set the mic_privacy flag for soundwire with ACE3+
soundwire: intel: Add awareness of ACE3+ microphone privacy
soundwire: bus: Add internal slave ID and use for IRQs
soundwire: bus: Simplify sdw_assign_device_num()
Linus Torvalds [Thu, 5 Jun 2025 04:18:37 +0000 (21:18 -0700)]
Merge tag 'rust-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/ojeda/linux
Pull Rust updates from Miguel Ojeda:
"Toolchain and infrastructure:
- KUnit '#[test]'s:
- Support KUnit-mapped 'assert!' macros.
The support that landed last cycle was very basic, and the
'assert!' macros panicked since they were the standard library
ones. Now, they are mapped to the KUnit ones in a similar way to
how is done for doctests, reusing the infrastructure there.
# my_first_test: ASSERTION FAILED at rust/kernel/lib.rs:251
Expected 42 == 43 to be true, but is false
# my_first_test.speed: normal
not ok 1 my_first_test
- Support tests with checked 'Result' return types.
The return value of test functions that return a 'Result' will
be checked, thus one can now easily catch errors when e.g. using
the '?' operator in tests.
With this, a failing test like:
#[test]
fn my_test() -> Result {
f()?;
Ok(())
}
will report:
# my_test: ASSERTION FAILED at rust/kernel/lib.rs:321
Expected is_test_result_ok(my_test()) to be true, but is false
# my_test.speed: normal
not ok 1 my_test
- Add 'kunit_tests' to the prelude.
- Clarify the remaining language unstable features in use.
- Compile 'core' with edition 2024 for Rust >= 1.87.
- Workaround 'bindgen' issue with forward references to 'enum' types.
- objtool: relax slice condition to cover more 'noreturn' functions.
- Use absolute paths in macros referencing 'core' and 'kernel'
crates.
- Skip '-mno-fdpic' flag for bindgen in GCC 32-bit arm builds.
- Clean some 'doc_markdown' lint hits -- we may enable it later on.
'kernel' crate:
- 'alloc' module:
- 'Box': support for type coercion, e.g. 'Box<T>' to 'Box<dyn U>'
if 'T' implements 'U'.
- 'Vec': implement new methods (prerequisites for nova-core and
binder): 'truncate', 'resize', 'clear', 'pop',
'push_within_capacity' (with new error type 'PushError'),
'drain_all', 'retain', 'remove' (with new error type
'RemoveError'), insert_within_capacity' (with new error type
'InsertError').
In addition, simplify 'push' using 'spare_capacity_mut', split
'set_len' into 'inc_len' and 'dec_len', add type invariant 'len
<= capacity' and simplify 'truncate' using 'dec_len'.
- 'time' module:
- Morph the Rust hrtimer subsystem into the Rust timekeeping
subsystem, covering delay, sleep, timekeeping, timers. This new
subsystem has all the relevant timekeeping C maintainers listed
in the entry.
- Replace 'Ktime' with 'Delta' and 'Instant' types to represent a
duration of time and a point in time.
- Temporarily add 'Ktime' to 'hrtimer' module to allow 'hrtimer'
to delay converting to 'Instant' and 'Delta'.
- 'xarray' module:
- Add a Rust abstraction for the 'xarray' data structure. This
abstraction allows Rust code to leverage the 'xarray' to store
types that implement 'ForeignOwnable'. This support is a
dependency for memory backing feature of the Rust null block
driver, which is waiting to be merged.
- Set up an entry in 'MAINTAINERS' for the XArray Rust support.
Patches will go to the new Rust XArray tree and then via the
Rust subsystem tree for now.
- Allow 'ForeignOwnable' to carry information about the pointed-to
type. This helps asserting alignment requirements for the
pointer passed to the foreign language.
- 'container_of!': retain pointer mut-ness and add a compile-time
check of the type of the first parameter ('$field_ptr').
- Support optional message in 'static_assert!'.
- Add C FFI types (e.g. 'c_int') to the prelude.
- 'str' module: simplify KUnit tests 'format!' macro, convert
'rusttest' tests into KUnit, take advantage of the '-> Result'
support in KUnit '#[test]'s.
- 'list' module: add examples for 'List', fix path of
'assert_pinned!' (so far unused macro rule).
- 'workqueue' module: remove 'HasWork::OFFSET'.
- 'page' module: add 'inline' attribute.
'macros' crate:
- 'module' macro: place 'cleanup_module()' in '.exit.text' section.
'pin-init' crate:
- Add 'Wrapper<T>' trait for creating pin-initializers for wrapper
structs with a structurally pinned value such as 'UnsafeCell<T>' or
'MaybeUninit<T>'.
- Add 'MaybeZeroable' derive macro to try to derive 'Zeroable', but
not error if not all fields implement it. This is needed to derive
'Zeroable' for all bindgen-generated structs.
- Add 'unsafe fn cast_[pin_]init()' functions to unsafely change the
initialized type of an initializer. These are utilized by the
'Wrapper<T>' implementations.
- Add support for visibility in 'Zeroable' derive macro.
- Add support for 'union's in 'Zeroable' derive macro.
- Upstream dev news: streamline CI, fix some bugs. Add new workflows
to check if the user-space version and the one in the kernel tree
have diverged. Use the issues tab [1] to track them, which should
help folks report and diagnose issues w.r.t. 'pin-init' better.
Linus Torvalds [Thu, 5 Jun 2025 02:23:37 +0000 (19:23 -0700)]
Merge tag '6.16-rc-ksmbd-server-fixes' of git://git.samba.org/ksmbd
Pull smb server updates from Steve French:
"Four smb3 server fixes:
- Fix for special character handling when mounting with "posix"
- Fix for mounts from Mac for fs that don't provide unique inode
numbers
- Two cleanup patches (e.g. for crypto calls)"
* tag '6.16-rc-ksmbd-server-fixes' of git://git.samba.org/ksmbd:
ksmbd: allow a filename to contain special characters on SMB3.1.1 posix extension
ksmbd: provide zero as a unique ID to the Mac client
ksmbd: remove unnecessary softdep on crc32
ksmbd: use SHA-256 library API instead of crypto_shash API
Linus Torvalds [Thu, 5 Jun 2025 02:14:24 +0000 (19:14 -0700)]
Merge tag 'bcachefs-2025-06-04' of git://evilpiepirate.org/bcachefs
Pull more bcachefs updates from Kent Overstreet:
"More bcachefs updates:
- More stack usage improvements (~600 bytes)
- Define CLASS()es for some commonly used types, and convert most
rcu_read_lock() uses to the new lock guards
- New introspection:
- Superblock error counters are now available in sysfs:
previously, they were only visible with 'show-super', which
doesn't provide a live view
- New tracepoint, error_throw(), which is called any time we
return an error and start to unwind
- Repair
- check_fix_ptrs() can now repair btree node roots
- We can now repair when we've somehow ended up with the journal
using a superblock bucket
- Revert some leftovers from the aborted directory i_size feature,
and add repair code: some userspace programs (e.g. sshfs) were
getting confused
It seems in 6.15 there's a bug where i_nlink on the vfs inode has been
getting incorrectly set to 0, with some unfortunate results;
list_journal analysis showed bch2_inode_rm() being called (by
bch2_evict_inode()) when it clearly should not have been.
- bch2_inode_rm() now runs "should we be deleting this inode?" checks
that were previously only run when deleting unlinked inodes in
recovery
- check_subvol() was treating a dangling subvol (pointing to a
missing root inode) like a dangling dirent, and deleting it. This
was the really unfortunate one: check_subvol() will now recreate
the root inode if necessary
This took longer to debug than it should have, and we lost several
filesystems unnecessarily, because users have been ignoring the
release notes and blindly running 'fsck -y'. Debugging required
reconstructing what happened through analyzing the journal, when
ideally someone would have noticed 'hey, fsck is asking me if I want
to repair this: it usually doesn't, maybe I should run this in dry run
mode and check what's going on?'
As a reminder, fsck errors are being marked as autofix once we've
verified, in real world usage, that they're working correctly; blindly
running 'fsck -y' on an experimental filesystem is playing with fire
Up to this incident we've had an excellent track record of not losing
data, so let's try to learn from this one
This is a community effort, I wouldn't be able to get this done
without the help of all the people QAing and providing excellent bug
reports and feedback based on real world usage. But please don't
ignore advice and expect me to pick up the pieces
If an error isn't marked as autofix, and it /is/ happening in the
wild, that's also something I need to know about so we can check it
out and add it to the autofix list if repair looks good. I haven't
been getting those reports, and I should be; since we don't have any
sort of telemetry yet I am absolutely dependent on user reports
Now I'll be spending the weekend working on new repair code to see if
I can get a filesystem back for a user who didn't have backups"
* tag 'bcachefs-2025-06-04' of git://evilpiepirate.org/bcachefs: (69 commits)
bcachefs: add cond_resched() to handle_overwrites()
bcachefs: Make journal read log message a bit quieter
bcachefs: Fix subvol to missing root repair
bcachefs: Run may_delete_deleted_inode() checks in bch2_inode_rm()
bcachefs: delete dead code from may_delete_deleted_inode()
bcachefs: Add flags to subvolume_to_text()
bcachefs: Fix oops in btree_node_seq_matches()
bcachefs: Fix dirent_casefold_mismatch repair
bcachefs: Fix bch2_fsck_rename_dirent() for casefold
bcachefs: Redo bch2_dirent_init_name()
bcachefs: Fix -Wc23-extensions in bch2_check_dirents()
bcachefs: Run check_dirents second time if required
bcachefs: Run snapshot deletion out of system_long_wq
bcachefs: Make check_key_has_snapshot safer
bcachefs: BCH_RECOVERY_PASS_NO_RATELIMIT
bcachefs: bch2_require_recovery_pass()
bcachefs: bch_err_throw()
bcachefs: Repair code for directory i_size
bcachefs: Kill un-reverted directory i_size code
bcachefs: Delete redundant fsck_err()
...
Kent Overstreet [Tue, 3 Jun 2025 13:31:58 +0000 (09:31 -0400)]
bcachefs: Make journal read log message a bit quieter
Users seem to be assuming that the 'dropped unflushed entries' message
at the end of journal read indicates some sort of problem, when it does
not - we expect there to be entries in the journal that weren't
commited, it's purely informational so that we can correlate journal
sequence numbers elsewhere when debugging.
Shorten the log message a bit to hopefully make this clearer.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 2 Jun 2025 23:48:27 +0000 (19:48 -0400)]
bcachefs: Fix subvol to missing root repair
We had a bug where the root inode of a subvolume was erronously deleted:
bch2_evict_inode() called bch2_inode_rm(), meaning the VFS inode's
i_nlink was somehow set to 0 when it shouldn't have - the inode in the
btree indicated it clearly was not unlinked.
This has been addressed with additional safety checks in
bch2_inode_rm() - pulling in the safety checks we already were doing
when deleting unlinked inodes in recovery - but the really disastrous
bug was in check_subvols(), which on finding a dangling subvol (subvol
with a missing root inode) would delete the subvolume.
I assume this bug dates from early check_directory_structure() code,
which originally handled subvolumes and normal paths - the idea being
that still live contents of the subvolume would get reattached
somewhere.
But that's incorrect, and disastrously so; deleting a subvolume triggers
deleting the snapshot ID it points to, deleting the entire contents.
The correct way to repair is to recreate the root inode if it's missing;
then any contents will get reattached under that subvolume's lost+found.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 2 Jun 2025 13:26:20 +0000 (09:26 -0400)]
bcachefs: Fix oops in btree_node_seq_matches()
btree_update_nodes_written() needs to wait on in-flight writes to old
nodes before marking them as freed. But it has no reason to pin those
old nodes in memory, so some trickyness ensues.
The update we're completing deleted references to those nodes from the
btree, so we know if they've been evicted they can't be pulled back in.
We just have to check if the nodes we have pointers to are still those
old nodes, and haven't been reused.
To do that we check the node's "sequence number" (actually a random 64
bit cookie), but that lives in the node's data buffer. 'struct btree'
can't be freed until filesystem shutdown (as they're quite small), but
the data buffers can be freed or swapped around.
Commit 1f88c3567495, which was fixing a kmsan warning, assumed that we
could safely do this locklessly with just a READ_ONCE() - if we've got a
non-null ptr it would be safe to read from.
But that's not true if the data buffer is a vmalloc allocation, so we
need to restore the locking that commit deleted (or alternatively RCU
free those data buffers, but there's no other reason for that).
Fixes: 1f88c3567495 ("bcachefs: Fix a KMSAN splat in btree_update_nodes_written()") Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sat, 31 May 2025 04:11:52 +0000 (00:11 -0400)]
bcachefs: Fix dirent_casefold_mismatch repair
Instead of simply recreating a mis-casefolded dirent, use the str_hash
repair code, which will rename it if necessary - the dirent might have
been created again with the correct casefolding.
Factor out out bch2_str_hash_repair key() from
__bch2_str_hash_check_key() for the new path to use, and export
bch2_dirent_create_key() as well.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>