Currently, for rqspinlock usage, the implementation of
smp_cond_load_acquire (and thus, atomic_cond_read_acquire) are
susceptible to stalls on arm64, because they do not guarantee that the
conditional expression will be repeatedly invoked if the address being
loaded from is not written to by other CPUs. When support for
event-streams is absent (which unblocks stuck WFE-based loops every
~100us), we may end up being stuck forever.
This causes a problem for us, as we need to repeatedly invoke the
RES_CHECK_TIMEOUT in the spin loop to break out when the timeout
expires.
Let us import the smp_cond_load_acquire_timewait implementation Ankur is
proposing in [0], and then fallback to it once it is merged.
While we rely on the implementation to amortize the cost of sampling
check_timeout for us, it will not happen when event stream support is
unavailable. This is not the common case, and it would be difficult to
fit our logic in the time_expr_ns >= time_limit_ns comparison, hence
just let it be.
Introduce policy macro RES_CHECK_TIMEOUT which can be used to detect
when the timeout has expired for the slow path to return an error. It
depends on being passed two variables initialized to 0: ts, ret. The
'ts' parameter is of type rqspinlock_timeout.
This macro resolves to the (ret) expression so that it can be used in
statements like smp_cond_load_acquire to break the waiting loop
condition.
The 'spin' member is used to amortize the cost of checking time by
dispatching to the implementation every 64k iterations. The
'timeout_end' member is used to keep track of the timestamp that denotes
the end of the waiting period. The 'ret' parameter denotes the status of
the timeout, and can be checked in the slow path to detect timeouts
after waiting loops.
The 'duration' member is used to store the timeout duration for each
waiting loop. The default timeout value defined in the header
(RES_DEF_TIMEOUT) is 0.25 seconds.
This macro will be used as a condition for waiting loops in the slow
path. Since each waiting loop applies a fresh timeout using the same
rqspinlock_timeout, we add a new RES_RESET_TIMEOUT as well to ensure the
values can be easily reinitialized to the default state.
Changes to rqspinlock in subsequent commits will be algorithmic
modifications, which won't remain in agreement with the implementations
of paravirt spin lock and virt_spin_lock support. These future changes
include measures for terminating waiting loops in slow path after a
certain point. While using a fair lock like qspinlock directly inside
virtual machines leads to suboptimal performance under certain
conditions, we cannot use the existing virtualization support before we
make it resilient as well. Therefore, drop it for now.
Note that we need to drop qspinlock_stat.h, as it's only relevant in
case of CONFIG_PARAVIRT_SPINLOCKS=y, but we need to keep lock_events.h
in the includes, which was indirectly pulled in before.
This header contains the public declarations usable in the rest of the
kernel for rqspinlock.
Let's also type alias qspinlock to rqspinlock_t to ensure consistent use
of the new lock type. We want to remove dependence on the qspinlock type
in later patches as we need to provide a test-and-set fallback, hence
begin abstracting away from now onwards.
locking: Copy out qspinlock.c to kernel/bpf/rqspinlock.c
In preparation for introducing a new lock implementation, Resilient
Queued Spin Lock, or rqspinlock, we first begin our modifications by
using the existing qspinlock.c code as the base. Simply copy the code to
a new file and rename functions and variables from 'queued' to
'resilient_queued'.
Since we place the file in kernel/bpf, include needs to be relative.
This helps each subsequent commit in clearly showing how and where the
code is being changed. The only change after a literal copy in this
commit is renaming the functions where necessary, and rename qnodes to
rqnodes. Let's also use EXPORT_SYMBOL_GPL for rqspinlock slowpath.
locking: Allow obtaining result of arch_mcs_spin_lock_contended
To support upcoming changes that require inspecting the return value
once the conditional waiting loop in arch_mcs_spin_lock_contended
terminates, modify the macro to preserve the result of
smp_cond_load_acquire. This enables checking the return value as needed,
which will help disambiguate the MCS node’s locked state in future
patches.
locking: Move common qspinlock helpers to a private header
Move qspinlock helper functions that encode, decode tail word, set and
clear the pending and locked bits, and other miscellaneous definitions
and macros to a private header. To this end, create a qspinlock.h header
file in kernel/locking. Subsequent commits will introduce a modified
qspinlock slow path function, thus moving shared code to a private
header will help minimize unnecessary code duplication.
locking: Move MCS struct definition to public header
Move the definition of the struct mcs_spinlock from the private
mcs_spinlock.h header in kernel/locking to the mcs_spinlock.h
asm-generic header, since we will need to reference it from the
qspinlock.h header in subsequent commits.
Emil Tsalapatis [Tue, 18 Mar 2025 03:07:53 +0000 (23:07 -0400)]
bpf: Make perf_event_read_output accessible in all program types.
The perf_event_read_event_output helper is currently only available to
tracing protrams, but is useful for other BPF programs like sched_ext
schedulers. When the helper is available, provide its bpf_func_proto
directly from the bpf base_proto.
====================
bpftool: Using the right format specifiers
This patch adds the -Wformat-signedness compiler flag to detect and
prevent format string errors, where signed or unsigned types are
mismatched with format specifiers. Additionally, it fixes some format
string errors that were not fully addressed by the previous patch [1].
Jiayuan Chen [Tue, 11 Mar 2025 11:28:09 +0000 (19:28 +0800)]
bpftool: Using the right format specifiers
Fixed some formatting specifiers errors, such as using %d for int and %u
for unsigned int, as well as other byte-length types.
Perform type cast using the type derived from the data type itself, for
example, if it's originally an int, it will be cast to unsigned int if
forced to unsigned.
Jiayuan Chen [Tue, 11 Mar 2025 11:28:08 +0000 (19:28 +0800)]
bpftool: Add -Wformat-signedness flag to detect format errors
This commit adds the -Wformat-signedness compiler flag to detect and
prevent printf format errors, where signed or unsigned types are
mismatched with format specifiers. This helps to catch potential issues at
compile-time, ensuring that our code is more robust and reliable. With
this flag, the compiler will now warn about incorrect format strings, such
as using %d with unsigned types or %u with signed types.
====================
Support freplace prog from user namespace
From: Mykyta Yatsenko <yatsenko@meta.com>
Freplace programs can't be loaded from user namespace, as
bpf_program__set_attach_target() requires searching for target prog BTF,
which is locked under CAP_SYS_ADMIN.
This patch set enables this use case by:
1. Relaxing capable check in bpf's BPF_BTF_GET_FD_BY_ID, check for CAP_BPF
instead of CAP_SYS_ADMIN, support BPF token in attr argument.
2. Pass BPF token around libbpf from bpf_program__set_attach_target() to
bpf syscall where capable check is.
3. Validate positive/negative scenarios in selftests
This patch set is enabled by the recent libbpf change[1], that
introduced bpf_object__prepare() API. Calling bpf_object__prepare() for
freplace program before bpf_program__set_attach_target() initializes BPF
token, which is then passed to bpf syscall by libbpf.
Mykyta Yatsenko [Mon, 17 Mar 2025 17:40:39 +0000 (17:40 +0000)]
selftests/bpf: Test freplace from user namespace
Add selftests to verify that it is possible to load freplace program
from user namespace if BPF token is initialized by bpf_object__prepare
before calling bpf_program__set_attach_target.
Negative test is added as well.
Modified type of the priv_prog to xdp, as kprobe did not work on aarch64
and s390x.
Mykyta Yatsenko [Mon, 17 Mar 2025 17:40:38 +0000 (17:40 +0000)]
libbpf: Pass BPF token from find_prog_btf_id to BPF_BTF_GET_FD_BY_ID
Pass BPF token from bpf_program__set_attach_target to
BPF_BTF_GET_FD_BY_ID bpf command.
When freplace program attaches to target program, it needs to look up
for BTF of the target, this may require BPF token, if, for example,
running from user namespace.
Mykyta Yatsenko [Mon, 17 Mar 2025 17:40:37 +0000 (17:40 +0000)]
bpf: Return prog btf_id without capable check
Return prog's btf_id from bpf_prog_get_info_by_fd regardless of capable
check. This patch enables scenario, when freplace program, running
from user namespace, requires to query target prog's btf.
Mykyta Yatsenko [Mon, 17 Mar 2025 17:40:36 +0000 (17:40 +0000)]
bpf: BPF token support for BPF_BTF_GET_FD_BY_ID
Currently BPF_BTF_GET_FD_BY_ID requires CAP_SYS_ADMIN, which does not
allow running it from user namespace. This creates a problem when
freplace program running from user namespace needs to query target
program BTF.
This patch relaxes capable check from CAP_SYS_ADMIN to CAP_BPF and adds
support for BPF token that can be passed in attributes to syscall.
Kernel test robot reported "call without frame pointer save/setup"
warning in objtool. This will make stack traces unreliable on
CONFIG_UNWINDER_FRAME_POINTER=y, however it works on
CONFIG_UNWINDER_ORC=y. Fix this by creating a stack frame for the
function.
Fixes: 2fb761823ead ("bpf, x86: Add x86 JIT support for timed may_goto") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202503071350.QOhsHVaW-lkp@intel.com/ Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250315013039.1625048-1-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Hou Tao [Sat, 15 Mar 2025 15:09:30 +0000 (23:09 +0800)]
bpf: Check map->record at the beginning of check_and_free_fields()
When there are no special fields in the map value, there is no need to
invoke bpf_obj_free_fields(). Therefore, checking the validity of
map->record in advance.
After the change, the benchmark result of the per-cpu update case in
map_perf_test increased by 40% under a 16-CPU VM.
selftests/bpf: Fix sockopt selftest failure on powerpc
The SO_RCVLOWAT option is defined as 18 in the selftest header,
which matches the generic definition. However, on powerpc,
SO_RCVLOWAT is defined as 16. This discrepancy causes
sol_socket_sockopt() to fail with the default switch case on powerpc.
This commit fixes by defining SO_RCVLOWAT as 16 for powerpc.
Viktor Malik [Thu, 13 Mar 2025 12:28:52 +0000 (13:28 +0100)]
selftests/bpf: Fix string read in strncmp benchmark
The strncmp benchmark uses the bpf_strncmp helper and a hand-written
loop to compare two strings. The values of the strings are filled from
userspace. One of the strings is non-const (in .bss) while the other is
const (in .rodata) since that is the requirement of bpf_strncmp.
The problem is that in the hand-written loop, Clang optimizes the reads
from the const string to always return 0 which breaks the benchmark.
Use barrier_var to prevent the optimization.
The effect can be seen on the strncmp-no-helper variant.
Before this change:
# ./bench strncmp-no-helper
Setting up benchmark 'strncmp-no-helper'...
Benchmark 'strncmp-no-helper' started.
Iter 0 (112.309us): hits 0.000M/s ( 0.000M/prod), drops 0.000M/s, total operations 0.000M/s
Iter 1 (-23.238us): hits 0.000M/s ( 0.000M/prod), drops 0.000M/s, total operations 0.000M/s
Iter 2 ( 58.994us): hits 0.000M/s ( 0.000M/prod), drops 0.000M/s, total operations 0.000M/s
Iter 3 (-30.466us): hits 0.000M/s ( 0.000M/prod), drops 0.000M/s, total operations 0.000M/s
Iter 4 ( 29.996us): hits 0.000M/s ( 0.000M/prod), drops 0.000M/s, total operations 0.000M/s
Iter 5 ( 16.949us): hits 0.000M/s ( 0.000M/prod), drops 0.000M/s, total operations 0.000M/s
Iter 6 (-60.035us): hits 0.000M/s ( 0.000M/prod), drops 0.000M/s, total operations 0.000M/s
Summary: hits 0.000 ± 0.000M/s ( 0.000M/prod), drops 0.000 ± 0.000M/s, total operations 0.000 ± 0.000M/s
After this change:
# ./bench strncmp-no-helper
Setting up benchmark 'strncmp-no-helper'...
Benchmark 'strncmp-no-helper' started.
Iter 0 ( 77.711us): hits 5.534M/s ( 5.534M/prod), drops 0.000M/s, total operations 5.534M/s
Iter 1 ( 11.215us): hits 6.006M/s ( 6.006M/prod), drops 0.000M/s, total operations 6.006M/s
Iter 2 (-14.253us): hits 5.931M/s ( 5.931M/prod), drops 0.000M/s, total operations 5.931M/s
Iter 3 ( 59.087us): hits 6.005M/s ( 6.005M/prod), drops 0.000M/s, total operations 6.005M/s
Iter 4 (-21.379us): hits 6.010M/s ( 6.010M/prod), drops 0.000M/s, total operations 6.010M/s
Iter 5 (-20.310us): hits 5.861M/s ( 5.861M/prod), drops 0.000M/s, total operations 5.861M/s
Iter 6 ( 53.937us): hits 6.004M/s ( 6.004M/prod), drops 0.000M/s, total operations 6.004M/s
Summary: hits 5.969 ± 0.061M/s ( 5.969M/prod), drops 0.000 ± 0.000M/s, total operations 5.969 ± 0.061M/s
Fixes: 9c42652f8be3 ("selftests/bpf: Add benchmark for bpf_strncmp() helper") Suggested-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Viktor Malik <vmalik@redhat.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/bpf/20250313122852.1365202-1-vmalik@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
selftests/bpf: Fix arena_spin_lock compilation on PowerPC
Venkat reported a compilation error for BPF selftests on PowerPC [0].
The crux of the error is the following message:
In file included from progs/arena_spin_lock.c:7:
/root/bpf-next/tools/testing/selftests/bpf/bpf_arena_spin_lock.h:122:8:
error: member reference base type '__attribute__((address_space(1)))
u32' (aka '__attribute__((address_space(1))) unsigned int') is not a
structure or union
122 | old = atomic_read(&lock->val);
This is because PowerPC overrides the qspinlock type changing the
lock->val member's type from atomic_t to u32.
To remedy this, import the asm-generic version in the arena spin lock
header, name it __qspinlock (since it's aliased to arena_spinlock_t, the
actual name hardly matters), and adjust the selftest to not depend on
the type in vmlinux.h.
Arnd Bergmann [Mon, 10 Mar 2025 13:49:16 +0000 (14:49 +0100)]
bpf: preload: Add MODULE_DESCRIPTION
Modpost complains when extra warnings are enabled:
WARNING: modpost: missing MODULE_DESCRIPTION() in kernel/bpf/preload/bpf_preload.o
Add a description from the Kconfig help text.
Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250310134920.4123633-1-arnd@kernel.org
----
Not sure if that description actually fits what the module does. If not,
please add a different description instead.
Sewon Nam [Tue, 11 Mar 2025 03:12:37 +0000 (12:12 +0900)]
bpf: bpftool: Setting error code in do_loader()
We are missing setting error code in do_loader() when
bpf_object__open_file() fails. This means the command's exit status code
will be successful, even though the operation failed. So make sure to
return the correct error code. To maintain consistency with other
locations where bpf_object__open_file() is called, return -1.
====================
While trying to implement an eBPF gatekeeper program, we ran into an
issue whereas the LSM hooks are missing some relevant data.
Certain subcommands passed to the bpf() syscall can be invoked from
either the kernel or userspace. Additionally, some fields in the
bpf_attr struct contain pointers, and depending on where the
subcommand was invoked, they could point to either user or kernel
memory. One example of this is the bpf_prog_load subcommand and its
fd_array. This data is made available and used by the verifier but not
made available to the LSM subsystem. This patchset simply exposes that
information to applicable LSM hooks.
Change list:
- v6 -> v7
- use gettid/pid in lieu of getpid/tgid in test condition
- v5 -> v6
- fix regression caused by is_kernel renaming
- simplify test logic
- v4 -> v5
- merge v4 selftest breakout patch back into a single patch
- change "is_kernel" to "kernel"
- add selftest using new kernel flag
- v3 -> v4
- split out selftest changes into a separate patch
- v2 -> v3
- reorder params so that the new boolean flag is the last param
- fixup function signatures in bpf selftests
- v1 -> v2
- Pass a boolean flag in lieu of bpfptr_t
Chen Ni [Mon, 10 Mar 2025 03:20:45 +0000 (11:20 +0800)]
selftests/bpf: Convert comma to semicolon
Replace comma between expressions with semicolons.
Using a ',' in place of a ';' can have unintended side effects.
Although that is not the case here, it is seems best to use ';'
unless ',' is intended.
Found by inspection.
No functional change intended.
Compile tested only.
Blaise Boscaccy [Mon, 10 Mar 2025 22:17:12 +0000 (15:17 -0700)]
selftests/bpf: Add a kernel flag test for LSM bpf hook
This test exercises the kernel flag added to security_bpf by
effectively blocking light-skeletons from loading while allowing
normal skeletons to function as-is. Since this should work with any
arbitrary BPF program, an existing program from LSKELS_EXTRA was
used as a test payload.
Anton Protopopov [Mon, 10 Mar 2025 14:51:12 +0000 (14:51 +0000)]
selftests/bpf: Fix selection of static vs. dynamic LLVM
The Makefile uses the exit code of the `llvm-config --link-static --libs`
command to choose between statically-linked and dynamically-linked LLVMs.
The stdout and stderr of that command are redirected to /dev/null.
To redirect the output the "&>" construction is used, which might not be
supported by /bin/sh, which is executed by make for $(shell ...) commands.
On such systems the test will fail even if static LLVM is actually
supported. Replace "&>" by ">/dev/null 2>&1" to fix this.
Fixes: 2a9d30fac818 ("selftests/bpf: Support dynamically linking LLVM if static is not available") Signed-off-by: Anton Protopopov <aspsk@isovalent.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Daniel Xu <dxu@dxuuu.xyz> Link: https://lore.kernel.org/bpf/20250310145112.1261241-1-aspsk@isovalent.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Blaise Boscaccy [Mon, 10 Mar 2025 22:17:11 +0000 (15:17 -0700)]
security: Propagate caller information in bpf hooks
Certain bpf syscall subcommands are available for usage from both
userspace and the kernel. LSM modules or eBPF gatekeeper programs may
need to take a different course of action depending on whether or not
a BPF syscall originated from the kernel or userspace.
Additionally, some of the bpf_attr struct fields contain pointers to
arbitrary memory. Currently the functionality to determine whether or
not a pointer refers to kernel memory or userspace memory is exposed
to the bpf verifier, but that information is missing from various LSM
hooks.
Here we augment the LSM hooks to provide this data, by simply passing
a boolean flag indicating whether or not the call originated in the
kernel, in any hook that contains a bpf_attr struct that corresponds
to a subcommand that may be called from the kernel.
====================
bpf: introduce helper for populating bpf_cpumask
Some BPF programs like scx schedulers have their own internal CPU mask types,
mask types, which they must transform into struct bpf_cpumask instances
before passing them to scheduling-related kfuncs. There is currently no
way to efficiently populate the bitfield of a bpf_cpumask from BPF memory,
and programs must use multiple bpf_cpumask_[set, clear] calls to do so.
Introduce a kfunc helper to populate the bitfield of a bpf_cpumask from valid
BPF memory with a single call.
Addressed feedback by Hou Tao:
* Removed RUN_TESTS invocation causing tests to run twice
* Added is_test_task guard to new selftests
* Removed extraneous __success attribute from existing selftests
Addressed feedback by Hou Tao:
* Readded the tests in tools/selftests/bpf/prog_tests/cpumask.c,
turns out the selftest entries were not duplicates.
* Removed stray whitespace in selftest.
* Add patch the missing selftest to prog_tests/cpumask.c
* Explicitly annotate all cpumask selftests with __success
The last patch could very well be its own cleanup patch, but I rolled it into
this series because it came up in the discussion. If the last patch in the
series has any issues I'd be fine with applying the first 3 patches and dealing
with it separately.
* Removed new tests from tools/selftests/bpf/prog_tests/cpumask.c because
they were being run twice.
Addressed feedback by Alexei Starovoitov:
* Added missing return value in function kdoc
* Added an additional patch fixing some missing kdoc fields in
kernel/bpf/cpumask.c
Addressed feedback by Tejun Heo:
* Renamed the kfunc to bpf_cpumask_populate to avoid confusion
w/ bitmap_fill()
Addressed feedback by Alexei Starovoitov:
* Added back patch descriptions dropped from v1->v2
* Elide the alignment check for archs with efficient
unaligned accesses
Addressed feedback by Hou Tao:
* Add check that the input buffer is aligned to sizeof(long)
* Adjust input buffer size check to use bitmap_size()
* Add selftest for checking the bit pattern of the bpf_cpumask
* Moved all selftests into existing files
Signed-off-by: Emil Tsalapatis (Meta) <emil@etsalapatis.com>
====================
Emil Tsalapatis [Sun, 9 Mar 2025 23:04:27 +0000 (19:04 -0400)]
selftests: bpf: fix duplicate selftests in cpumask_success.
The BPF cpumask selftests are currently run twice in
test_progs/cpumask.c, once by traversing cpumask_success_testcases, and
once by invoking RUN_TESTS(cpumask_success). Remove the invocation of
RUN_TESTS to properly run the selftests only once.
Now that the tests are run only through cpumask_success_testscases, add
to it the missing test_refcount_null_tracking testcase. Also remove the
__success annotation from it, since it is now loaded and invoked by the
runner.
====================
This patch series continues the work to migrate the script tests into
prog_tests.
test_lwt_seg6local.sh tests some bpf_lwt_* helpers. It contains only one
test that uses a network topology quite different than the ones that
can be found in others prog_tests/lwt_*.c files so I add a new
prog_tests/lwt_seg6local.c file.
While working on the migration I noticed that some routes present in the
script weren't needed so PATCH 1 deletes them and then PATCH 2 migrates
the test into the test_progs framework.
====================
selftests/bpf: lwt_seg6local: Move test to test_progs
test_lwt_seg6local.sh isn't used by the BPF CI.
Add a new file in the test_progs framework to migrate the tests done by
test_lwt_seg6local.sh. It uses the same network topology and the same BPF
programs located in progs/test_lwt_seg6local.c.
Use the network helpers instead of `nc` to exchange the final packet.
Remove test_lwt_seg6local.sh and its Makefile entry.
Amery Hung [Wed, 5 Mar 2025 18:20:57 +0000 (10:20 -0800)]
selftests/bpf: Fix dangling stdout seen by traffic monitor thread
Traffic monitor thread may see dangling stdout as the main thread closes
and reassigns stdout without protection. This happens when the main thread
finishes one subtest and moves to another one in the same netns_new()
scope.
The issue can be reproduced by running test_progs repeatedly with traffic
monitor enabled:
for ((i=1;i<=100;i++)); do
./test_progs -a flow_dissector_skb* -m '*'
done
For restoring stdout in crash_handler(), since it does not really care
about closing stdout, simlpy flush stdout and restore it to the original
one.
Then, Fix the issue by consolidating stdio_restore_cleanup() and
stdio_restore(), and protecting the use/close/assignment of stdout with
a lock. The locking in the main thread is always performed regradless of
whether traffic monitor is running or not for simplicity. It won't have
any side-effect.
Signed-off-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://patch.msgid.link/20250305182057.2802606-3-ameryhung@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Some routes in fb00:: are initialized during setup, even though they
aren't needed by the test as the UDP packets will travel through the
lightweight tunnels.
Amery Hung [Wed, 5 Mar 2025 18:20:55 +0000 (10:20 -0800)]
selftests/bpf: Clean up call sites of stdio_restore()
reset_affinity() and save_ns() are only called in run_one_test(). There is
no need to call stdio_restore() in reset_affinity() and save_ns() if
stdio_restore() is moved right after a test finishes in run_one_test().
Also remove an unnecessary check of env.stdout_saved in crash_handler()
by moving env.stdout_saved assignment to the beginning of main().
Signed-off-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://patch.msgid.link/20250305182057.2802606-1-ameryhung@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
selftests/bpf: Move test_lwt_ip_encap to test_progs
test_lwt_ip_encap.sh isn't used by the BPF CI.
Add a new file in the test_progs framework to migrate the tests done by
test_lwt_ip_encap.sh. It uses the same network topology and the same BPF
programs located in progs/test_lwt_ip_encap.c.
Rework the GSO part to avoid using nc and dd.
Remove test_lwt_ip_encap.sh and its Makefile entry.
This set provides an implementation of queued spin lock for arena.
There is no support for resiliency and recovering from deadlocks yet.
We will wait for the rqspinlock patch set to land before incorporating
support.
One minor change compared to the qspinlock algorithm in the kernel is
that we don't have the trylock fallback when nesting count exceeds 4.
The maximum number of supported CPUs is 1024, but this can be increased
in the future if necessary.
The API supports returning an error, so resiliency support can be added
in the future. Callers are still expected to check for and handle any
potential errors.
Errors are returned when the spin loops time out, when the number of
CPUs is greater than 1024, or when the extreme edge case of NMI
interrupting NMI interrupting HardIRQ interrupting SoftIRQ interrupting
task, all of them simultaneously in slow path, occurs, which is
unsupported.
* Add better comment and document LLVM bug for __unqual_typeof.
* Switch to precise counting in the selftest and simplify test.
* Add comment about return value handling.
* Reduce size for 100k to 50k to cap test runtime.
* Drop extra corruption handling case in decode_tail.
* Stick to 1, 1k, 100k critical section sizes.
* Fix unqual_typeof to not cast away arena tag for pointers.
* Remove hack to skip first qnode.
* Choose 100 as repeat count, 1000 is too much for 100k size.
* Use pthread_barrier in test.
* Rename to arena_spin_lock
* Introduce cond_break_label macro to jump to label from cond_break.
* Drop trylock fallback when nesting count exceeds 4.
* Fix bug in try_cmpxchg implementation.
* Add tests with critical sections of varying lengths.
* Add comments for _Generic trick to drop __arena tag.
* Fix bug due to qnodes being placed on first page, leading to CPU 0's
node being indistinguishable from NULL.
Implement queued spin lock algorithm as BPF program for lock words
living in BPF arena.
The algorithm is copied from kernel/locking/qspinlock.c and adapted for
BPF use.
We first implement abstract helpers for portable atomics and
acquire/release load instructions, by relying on X86_64 presence to
elide expensive barriers and rely on implementation details of the JIT,
and fall back to slow but correct implementations elsewhere. When
support for acquire/release load/stores lands, we can improve this
state.
Then, the qspinlock algorithm is adapted to remove dependence on
multi-word atomics due to lack of support in BPF ISA. For instance,
xchg_tail cannot use 16-bit xchg, and needs to be a implemented as a
32-bit try_cmpxchg loop.
Loops which are seemingly infinite from verifier PoV are annotated with
cond_break_label macro to return an error. Only 1024 NR_CPUs are
supported.
Note that the slow path is a global function, hence the verifier doesn't
know the return value's precision. The recommended way of usage is to
always test against zero for success, and not ret < 0 for error, as the
verifier would assume ret > 0 has not been accounted for. Add comments
in the function documentation about this quirk.
Add a new cond_break_label macro that jumps to the specified label when
the cond_break termination check fires, and allows us to better handle
the uncontrolled termination of the loop.
may_goto instruction does not use any registers,
but in compute_insn_live_regs() it was treated as a regular
conditional jump of kind BPF_K with r0 as source register.
Thus unnecessarily marking r0 as used.
====================
bpf: simple DFA-based live registers analysis
This patch-set introduces a simple live registers DFA analysis.
Analysis is done as a separate step before main verification pass.
Results are stored in the env->insn_aux_data for each instruction.
The change helps with iterator/callback based loops handling,
as regular register liveness marks are not finalized while
loops are processed. See veristat results in patch #2.
Note: for regular subprogram calls analysis conservatively assumes
that r1-r5 are used, and r0 is used at each 'exit' instruction.
Experiments show that adding logic handling these cases precisely has
no impact on verification performance.
The patch set was tested by disabling the current register parentage
chain liveness computation, using DFA-based liveness for registers
while assuming all stack slots as live. See discussion in [1].
Changes v2 -> v3:
- added support for BPF_LOAD_ACQ, BPF_STORE_REL atomics (Alexei);
- correct use marks for r0 for BPF_CMPXCHG.
Changes v1 -> v2:
- added a refactoring commit extracting utility functions:
jmp_offset(), verbose_insn() (Alexei);
- added a refactoring commit extracting utility function
get_call_summary() in order to share helper/kfunc related code with
mark_fastcall_pattern_for_call() (Alexei);
- comment in the compute_insn_live_regs() extended (Alexei).
Changes RFC -> v1:
- parameter count for helpers and kfuncs is taken into account;
- copy_verifier_state() bugfix had been merged as a separate
patch-set and is no longer a part of this patch set.
====================
Introduce load-acquire and store-release BPF instructions
This patchset adds kernel support for BPF load-acquire and store-release
instructions (for background, please see [1]), including core/verifier
and arm64/x86-64 JIT compiler changes, as well as selftests. riscv64 is
also planned to be supported. The corresponding LLVM changes can be
found at:
https://github.com/llvm/llvm-project/pull/108636
The first 3 patches from v4 have already been applied:
- [bpf-next,v4,01/10] bpf/verifier: Factor out atomic_ptr_type_ok()
https://git.kernel.org/bpf/bpf-next/c/b2d9ef71d4c9
- [bpf-next,v4,02/10] bpf/verifier: Factor out check_atomic_rmw()
https://git.kernel.org/bpf/bpf-next/c/d430c46c7580
- [bpf-next,v4,03/10] bpf/verifier: Factor out check_load_mem() and check_store_reg()
https://git.kernel.org/bpf/bpf-next/c/d38ad248fb7a
Please refer to the LLVM PR and individual kernel patches for details.
Thanks!
o (kernel test robot) for 32-bit arches: make the verifier reject
64-bit load-acquires/store-releases, and fix
build error in interpreter changes
* tested ARCH=arc build following instructions from kernel test
robot
o (Alexei) drop Documentation/ patch (v4 10/10) for now
o (Alexei) change encoding to BPF_LOAD_ACQ=0x100, BPF_STORE_REL=0x110
o add Acked-by: tags from Ilya and Eduard
o make new selftests depend on:
* __clang_major__ >= 18, and
* ENABLE_ATOMICS_TESTS is defined (currently this means -mcpu=v3 or
v4), and
* JIT supports load_acq/store_rel (currenty only arm64)
o work around llvm-17 CI job failure by conditionally define
__arena_global variables as 64-bit if __clang_major__ < 18, to make
sure .addr_space.1 has no holes
o add Google copyright notice in new files
o (Eduard) for x86 and s390, make
bpf_jit_supports_insn(..., /*in_arena=*/true) return false
for load_acq/store_rel
o add Eduard's Acked-by: tag
o (Eduard) extract LDX and non-ATOMIC STX handling into helpers, see
PATCH v2 3/9
o allow unpriv programs to store-release pointers to stack
o (Alexei) make it clearer in the interpreter code (PATCH v2 4/9) that
only W and DW are supported for atomic RMW
o test misaligned load_acq/store_rel
o (Eduard) other selftests/ changes:
* test load_acq/store_rel with !atomic_ptr_type_ok() pointers:
- PTR_TO_CTX, for is_ctx_reg()
- PTR_TO_PACKET, for is_pkt_reg()
- PTR_TO_FLOW_KEYS, for is_flow_key_reg()
- PTR_TO_SOCKET, for is_sk_reg()
* drop atomics/ tests
* delete unnecessary 'pid' checks from arena_atomics/ tests
* avoid depending on __BPF_FEATURE_LOAD_ACQ_STORE_REL, use
__imm_insn() and inline asm macros instead
o 1-2/8: minor verifier.c refactoring patches
o 3/8: core/verifier changes
* (Eduard) handle load-acquire properly in backtrack_insn()
* (Eduard) avoid skipping checks (e.g.,
bpf_jit_supports_insn()) for load-acquires
* track the value stored by store-releases, just like how
non-atomic STX instructions are handled
* (Eduard) add missing link in commit message
* (Eduard) always print 'r' for disasm.c changes
o 4/8: arm64/insn: avoid treating load_acq/store_rel as
load_ex/store_ex
o 5/8: arm64/insn: add load_acq/store_rel
* (Xu) include Should-Be-One (SBO) bits in "mask" and "value",
to avoid setting fixed bits during runtime (JIT-compile
time)
o 6/8: arm64 JIT compiler changes
* (Xu) use emit_a64_add_i() for "pointer + offset" to optimize
code emission
o 7/8: selftests
* (Eduard) avoid adding new tests to the 'test_verifier' runner
* add more tests, e.g., checking mark_precise logic
o 8/8: instruction-set.rst changes
bpf: use register liveness information for func_states_equal
Liveness analysis DFA computes a set of registers live before each
instruction. Leverage this information to skip comparison of dead
registers in func_states_equal(). This helps with convergance of
iterator processing loops, as bpf_reg_state->live marks can't be used
when loops are processed.
This has certain performance impact for selftests, here is a veristat
listing using `-f "insns_pct>5" -f "!insns<200"`
The last two tests are added to check if backtrack_insn() handles the
new instructions correctly.
Additionally, the last test also makes sure that the verifier
"remembers" the value (in src_reg) we store-release into e.g. a stack
slot. For example, if we take a look at the test program:
At #1, if the verifier doesn't remember that we wrote 8 to the stack,
then later at #4 we would be adding an unbounded scalar value to the
stack pointer, which would cause the program to be rejected:
VERIFIER LOG:
=============
...
math between fp pointer and register with unbounded min value is not allowed
For easier CI integration, instead of using built-ins like
__atomic_{load,store}_n() which depend on the new
__BPF_FEATURE_LOAD_ACQ_STORE_REL pre-defined macro, manually craft
load-acquire/store-release instructions using __imm_insn(), as suggested
by Eduard.
All new tests depend on:
(1) Clang major version >= 18, and
(2) ENABLE_ATOMICS_TESTS is defined (currently implies -mcpu=v3 or
v4), and
(3) JIT supports load-acquire/store-release (currently arm64 and
x86-64)
That 1-byte hole in the .addr_space.1 ELF section caused clang-17 to
crash:
fatal error: error in backend: unable to write nop sequence of 1 bytes
To work around such llvm-17 CI job failures, conditionally define
__arena_global variables as 64-bit if __clang_major__ < 18, to make sure
.addr_space.1 has no holes. Ideally we should avoid compiling this file
using clang-17 at all (arena tests depend on
__BPF_FEATURE_ADDR_SPACE_CAST, and are skipped for llvm-17 anyway), but
that is a separate topic.
Compute may-live registers before each instruction in the program.
The register is live before the instruction I if it is read by I or
some instruction S following I during program execution and is not
overwritten between I and S.
This information would be used in the next patch as a hint in
func_states_equal().
Use a simple algorithm described in [1] to compute this information:
- define the following:
- I.use : a set of all registers read by instruction I;
- I.def : a set of all registers written by instruction I;
- I.in : a set of all registers that may be alive before I execution;
- I.out : a set of all registers that may be alive after I execution;
- I.successors : a set of instructions S that might immediately
follow I for some program execution;
- associate separate empty sets 'I.in' and 'I.out' with each instruction;
- visit each instruction in a postorder and update corresponding
'I.in' and 'I.out' sets as follows:
I.out = U [S.in for S in I.successors]
I.in = (I.out / I.def) U I.use
(where U stands for set union, / stands for set difference)
- repeat the computation while I.{in,out} changes for any instruction.
On implementation side keep things as simple, as possible:
- check_cfg() already marks instructions EXPLORED in post-order,
modify it to save the index of each EXPLORED instruction in a vector;
- represent I.{in,out,use,def} as bitmasks;
- don't split the program into basic blocks and don't maintain the
work queue, instead:
- do fixed-point computation by visiting each instruction;
- maintain a simple 'changed' flag if I.{in,out} for any instruction
change;
Measurements show that even such simplistic implementation does not
add measurable verification time overhead (for selftests, at-least).
Note on check_cfg() ex_insn_beg/ex_done change:
To avoid out of bounds access to env->cfg.insn_postorder array,
it should be guaranteed that instruction transitions to EXPLORED state
only once. Previously this was not the fact for incorrect programs
with direct calls to exception callbacks.
The 'align' selftest needs adjustment to skip computed insn/live
registers printout. Otherwise it matches lines from the live registers
printout.
Peilin Ye [Tue, 4 Mar 2025 01:06:40 +0000 (01:06 +0000)]
bpf, x86: Support load-acquire and store-release instructions
Recently we introduced BPF load-acquire (BPF_LOAD_ACQ) and store-release
(BPF_STORE_REL) instructions. For x86-64, simply implement them as
regular BPF_LDX/BPF_STX loads and stores. The verifier always rejects
misaligned load-acquires/store-releases (even if BPF_F_ANY_ALIGNMENT is
set), so emitted MOV* instructions are guaranteed to be atomic.
Arena accesses are supported. 8- and 16-bit load-acquires are
zero-extending (i.e., MOVZBQ, MOVZWQ).
Rename emit_atomic{,_index}() to emit_atomic_rmw{,_index}() to make it
clear that they only handle read-modify-write atomics, and extend their
@atomic_op parameter from u8 to u32, since we are starting to use more
than the lowest 8 bits of the 'imm' field.
Refactor mark_fastcall_pattern_for_call() to extract a utility
function get_call_summary(). For a helper or kfunc call this function
fills the following information: {num_params, is_void, fastcall}.
This function would be used in the next patch in order to get number
of parameters of a helper or kfunc call.
Peilin Ye [Tue, 4 Mar 2025 01:06:33 +0000 (01:06 +0000)]
bpf, arm64: Support load-acquire and store-release instructions
Support BPF load-acquire (BPF_LOAD_ACQ) and store-release
(BPF_STORE_REL) instructions in the arm64 JIT compiler. For example
(assuming little-endian):
Arena accesses are supported.
bpf_jit_supports_insn(..., /*in_arena=*/true) always returns true for
BPF_LOAD_ACQ and BPF_STORE_REL instructions, as they don't depend on
ARM64_HAS_LSE_ATOMICS.
bpf: jmp_offset() and verbose_insn() utility functions
Extract two utility functions:
- One BPF jump instruction uses .imm field to encode jump offset,
while the rest use .off. Encapsulate this detail as jmp_offset()
function.
- Avoid duplicating instruction printing callback definitions by
defining a verbose_insn() function, which disassembles an
instruction into the verifier log while hiding this detail.
Peilin Ye [Tue, 4 Mar 2025 01:06:19 +0000 (01:06 +0000)]
arm64: insn: Add BIT(23) to {load,store}_ex's mask
We are planning to add load-acquire (LDAR{,B,H}) and store-release
(STLR{,B,H}) instructions to insn.{c,h}; add BIT(23) to mask of load_ex
and store_ex to prevent aarch64_insn_is_{load,store}_ex() from returning
false-positives for load-acquire and store-release instructions.
Reference: Arm Architecture Reference Manual (ARM DDI 0487K.a,
ID032224),
This series replaces the current implementation of cond_break, which
uses the may_goto instruction, and counts 8 million iterations per stack
frame, with an implementation based on sampling time locally on the CPU.
This is done to permit a longer time for a given loop per-program
invocation. The accounting is still done per-stack frame, but the count
is used to instead amortize the cost of the logic to sample and check
the time spent since the start.
This is needed for expressing more complicated algorithms (spin locks,
waiting loops, etc.) in BPF programs without false positive expiration
of the loop. For instance, the plan is to make use of this for
implementing spin locks for BPF arena [0].
For the loop as follows:
for (int i = 0;; i++) {}
Testing on a bare-metal Sapphire Rapids Intel server yields the following
table (taking an average of 25 runs).
Here, count is used to amortize the time sampling and checking logic.
Obviously, this is the limit of an empty loop. Given the complexity of
the loop body, the time spent in the loop can be longer. Cancellations
will address the task of imposing an upper bound on program runtime.
* Address comments from Alexei
* Use kernel comment style for new code.
* Remove p->count == 0 check in bpf_check_timed_may_goto.
* Add comments on AX as argument/retval calling convention.
* Add comments describing how the counting logic works.
* Use BPF_EMIT_CALL instead of open-coding instruction encoding.
* Change if ax != 1 goto pc+X condition to if ax != 0 goto pc+X.
====================
A "load-acquire" is a BPF_STX | BPF_ATOMIC instruction with the 'imm'
field set to BPF_LOAD_ACQ (0x100).
Similarly, a "store-release" is a BPF_STX | BPF_ATOMIC instruction with
the 'imm' field set to BPF_STORE_REL (0x110).
Unlike existing atomic read-modify-write operations that only support
BPF_W (32-bit) and BPF_DW (64-bit) size modifiers, load-acquires and
store-releases also support BPF_B (8-bit) and BPF_H (16-bit). As an
exception, however, 64-bit load-acquires/store-releases are not
supported on 32-bit architectures (to fix a build error reported by the
kernel test robot).
An 8- or 16-bit load-acquire zero-extends the value before writing it to
a 32-bit register, just like ARM64 instruction LDARH and friends.
Similar to existing atomic read-modify-write operations, misaligned
load-acquires/store-releases are not allowed (even if
BPF_F_ANY_ALIGNMENT is set).
As an example, consider the following 64-bit load-acquire BPF
instruction (assuming little-endian):
In arch/{arm64,s390,x86}/net/bpf_jit_comp.c, have
bpf_jit_supports_insn(..., /*in_arena=*/true) return false for the new
instructions, until the corresponding JIT compiler supports them in
arena.
We are introducing a new function in the libbpf API, bpf_object__prepare,
which provides more granular control over the process of loading a
bpf_object. bpf_object__prepare performs ELF processing, relocations,
prepares final state of BPF program instructions (accessible with
bpf_program__insns()), creates and potentially pins maps, and stops short
of loading BPF programs.
There are couple of anticipated usecases for this API:
* Use BPF token for freplace programs that might need to lookup BTF of
other programs (BPF token creation can't be moved to open step, as open
step is "no privilege assumption" step so that tools like bpftool can
generate skeleton, discover the structure of BPF object, etc).
* Stopping at prepare gives users finalized BPF program
instructions (with subprogs appended, everything relocated and
finalized, etc). And that property can be taken advantage of by
veristat (and similar tools) that might want to process one program at
a time, but would like to avoid relatively slow ELF parsing and
processing; and even BPF selftests itself (RUN_TESTS part of it at
least) would benefit from this by eliminating waste of re-processing
ELF many times.
====================
Implement the arch_bpf_timed_may_goto function using inline assembly to
have control over which registers are spilled, and use our special
protocol of using BPF_REG_AX as an argument into the function, and as
the return value when going back.
Emit call depth accounting for the call made from this stub, and ensure
we don't have naked returns (when rethunk mitigations are enabled) by
falling back to the RET macro (instead of retq). After popping all saved
registers, the return address into the BPF program should be on top of
the stack.
Since the JIT support is now enabled, ensure selftests which are
checking the produced may_goto sequences do not break by adjusting them.
Make sure we still test the old may_goto sequence on other
architectures, while testing the new sequence on x86_64.
Implement support in the verifier for replacing may_goto implementation
from a counter-based approach to one which samples time on the local CPU
to have a bigger loop bound.
We implement it by maintaining 16-bytes per-stack frame, and using 8
bytes for maintaining the count for amortizing time sampling, and 8
bytes for the starting timestamp. To minimize overhead, we need to avoid
spilling and filling of registers around this sequence, so we push this
cost into the time sampling function 'arch_bpf_timed_may_goto'. This is
a JIT-specific wrapper around bpf_check_timed_may_goto which returns us
the count to store into the stack through BPF_REG_AX. All caller-saved
registers (r0-r5) are guaranteed to remain untouched.
The loop can be broken by returning count as 0, otherwise we dispatch
into the function when the count drops to 0, and the runtime chooses to
refresh it (by returning count as BPF_MAX_TIMED_LOOPS) or returning 0
and aborting the loop on next iteration.
Since the check for 0 is done right after loading the count from the
stack, all subsequent cond_break sequences should immediately break as
well, of the same loop or subsequent loops in the program.
We pass in the stack_depth of the count (and thus the timestamp, by
adding 8 to it) to the arch_bpf_timed_may_goto call so that it can be
passed in to bpf_check_timed_may_goto as an argument after r1 is saved,
by adding the offset to r10/fp. This adjustment will be arch specific,
and the next patch will introduce support for x86.
Note that depending on loop complexity, time spent in the loop can be
more than the current limit (250 ms), but imposing an upper bound on
program runtime is an orthogonal problem which will be addressed when
program cancellations are supported.
The current time afforded by cond_break may not be enough for cases
where BPF programs want to implement locking algorithms inline, and use
cond_break as a promise to the verifier that they will eventually
terminate.
Below are some benchmarking numbers on the time taken per-iteration for
an empty loop that counts the number of iterations until cond_break
fires. For comparison, we compare it against bpf_for/bpf_repeat which is
another way to achieve the same number of spins (BPF_MAX_LOOPS). The
hardware used for benchmarking was a Sapphire Rapids Intel server with
performance governor enabled, mitigations were enabled.
Mykyta Yatsenko [Mon, 3 Mar 2025 13:57:51 +0000 (13:57 +0000)]
libbpf: Split bpf object load into prepare/load
Introduce bpf_object__prepare API: additional intermediate preparation
step that performs ELF processing, relocations, prepares final state of
BPF program instructions (accessible with bpf_program__insns()), creates
and (potentially) pins maps, and stops short of loading BPF programs.
We anticipate few use cases for this API, such as:
* Use prepare to initialize bpf_token, without loading freplace
programs, unlocking possibility to lookup BTF of other programs.
* Execute prepare to obtain finalized BPF program instructions without
loading programs, enabling tools like veristat to process one program at
a time, without incurring cost of ELF parsing and processing.
Mykyta Yatsenko [Mon, 3 Mar 2025 13:57:50 +0000 (13:57 +0000)]
libbpf: Introduce more granular state for bpf_object
We are going to split bpf_object loading into 2 stages: preparation and
loading. This will increase flexibility when working with bpf_object
and unlock some optimizations and use cases.
This patch substitutes a boolean flag (loaded) by more finely-grained
state for bpf_object.
Breno Leitao [Fri, 28 Feb 2025 18:43:34 +0000 (10:43 -0800)]
net: filter: Avoid shadowing variable in bpf_convert_ctx_access()
Rename the local variable 'off' to 'offset' to avoid shadowing the existing
'off' variable that is declared as an `int` in the outer scope of
bpf_convert_ctx_access().
This fixes a compiler warning:
net/core/filter.c:9679:8: warning: declaration shadows a local variable [-Wshadow]
Mykyta Yatsenko [Mon, 3 Mar 2025 13:57:49 +0000 (13:57 +0000)]
libbpf: Use map_is_created helper in map setters
Refactoring: use map_is_created helper in map setters that need to check
the state of the map. This helps to reduce the number of the places that
depend explicitly on the loaded flag, simplifying refactoring in the
next patch of this set.
====================
selftests/bpf: Migrate test_tunnel.sh to test_progs
Hi all,
This patch series continues the work to migrate the *.sh tests into
prog_tests framework.
The test_tunnel.sh script has already been partly migrated to
test_progs in prog_tests/test_tunnel.c so I add my work to it.
PATCH 1 & 2 create some helpers to avoid code duplication and ease the
migration in the following patches.
PATCH 3 to 9 migrate the tests of gre, ip6gre, erspan, ip6erspan,
geneve, ip6geneve and ip6tnl tunnels.
PATCH 10 removes test_tunnel.sh
====================
All tests from test_tunnel.sh have been migrated into test test_progs.
The last test remaining in the script is the test_ipip() that is already
covered in the test_prog framework by the NONE case of test_ipip_tunnel().
Remove the test_tunnel.sh script and its Makefile entry
Signed-off-by: Bastien Curutchet (eBPF Foundation) <bastien.curutchet@bootlin.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250303-tunnels-v2-10-8329f38f0678@bootlin.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
selftests/bpf: test_tunnel: Move ip6tnl tunnel tests to test_progs
ip6tnl tunnels are tested in the test_tunnel.sh but not in the test_progs
framework.
Add a new test in test_progs to test ip6tnl tunnels. It uses the same
network topology and the same BPF programs than the script.
Remove test_ipip6() and test_ip6ip6() from the script.
Signed-off-by: Bastien Curutchet (eBPF Foundation) <bastien.curutchet@bootlin.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250303-tunnels-v2-9-8329f38f0678@bootlin.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
selftests/bpf: test_tunnel: Move ip6geneve tunnel test to test_progs
ip6geneve tunnels are tested in the test_tunnel.sh but not in the
test_progs framework.
Add a new test in test_progs to test ip6geneve tunnels. It uses the same
network topology and the same BPF programs than the script.
Remove test_ip6geneve() from the script.
Signed-off-by: Bastien Curutchet (eBPF Foundation) <bastien.curutchet@bootlin.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250303-tunnels-v2-8-8329f38f0678@bootlin.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
selftests/bpf: test_tunnel: Move geneve tunnel test to test_progs
geneve tunnels are tested in the test_tunnel.sh but not in the test_progs
framework.
Add a new test in test_progs to test geneve tunnels. It uses the same
network topology and the same BPF programs than the script.
Remove test_geneve() from the script.
Signed-off-by: Bastien Curutchet (eBPF Foundation) <bastien.curutchet@bootlin.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250303-tunnels-v2-7-8329f38f0678@bootlin.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
selftests/bpf: test_tunnel: Move ip6erspan tunnel test to test_progs
ip6erspan tunnels are tested in the test_tunnel.sh but not in the
test_progs framework.
Add a new test in test_progs to test ip6erspan tunnels. It uses the same
network topology and the same BPF programs than the script.
Remove test_ip6erspan() from the script.
Signed-off-by: Bastien Curutchet (eBPF Foundation) <bastien.curutchet@bootlin.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250303-tunnels-v2-6-8329f38f0678@bootlin.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
selftests/bpf: test_tunnel: Move erspan tunnel tests to test_progs
erspan tunnels are tested in the test_tunnel.sh but not in the test_progs
framework.
Add a new test in test_progs to test erspan tunnels. It uses the same
network topology and the same BPF programs than the script.
Remove test_erspan() from the script.
Signed-off-by: Bastien Curutchet (eBPF Foundation) <bastien.curutchet@bootlin.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250303-tunnels-v2-5-8329f38f0678@bootlin.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
selftests/bpf: test_tunnel: Move ip6gre tunnel test to test_progs
ip6gre tunnels are tested in the test_tunnel.sh but not in the test_progs
framework.
Add a new test in test_progs to test ip6gre tunnels. It uses the same
network topology and the same BPF programs than the script. Disable the
IPv6 DAD feature because it can take lot of time and cause some tests to
fail depending on the environment they're run on.
Remove test_ip6gre() and test_ip6gretap() from the script.
Signed-off-by: Bastien Curutchet (eBPF Foundation) <bastien.curutchet@bootlin.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250303-tunnels-v2-4-8329f38f0678@bootlin.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
selftests/bpf: test_tunnel: Move gre tunnel test to test_progs
gre tunnels are tested in the test_tunnel.sh but not in the test_progs
framework.
Add a new test in test_progs to test gre tunnels. It uses the same
network topology and the same BPF programs than the script.
Remove test_gre() and test_gre_no_tunnel_key() from the script.
Signed-off-by: Bastien Curutchet (eBPF Foundation) <bastien.curutchet@bootlin.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250303-tunnels-v2-3-8329f38f0678@bootlin.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
====================
veristat: @files-list.txt notation for object files list
A few small veristat improvements:
- It is possible to hit command line parameters number limit,
e.g. when running veristat for all object files generated for
test_progs. This patch-set adds an option to read objects files list
from a file.
- Correct usage of strerror() function.
- Avoid printing log lines to CSV output.
Changelog:
- v1 -> v2:
- replace strerror(errno) with strerror(-err) in patch #2 (Andrii)
All tests use more or less the same ping commands as final validation.
Also test_ping()'s return value is checked with ASSERT_OK() while this
check is already done by the SYS() macro inside test_ping().
Create helpers around test_ping() and use them in the tests to avoid code
duplication.
Remove the unnecessary ASSERT_OK() from the tests.
Signed-off-by: Bastien Curutchet (eBPF Foundation) <bastien.curutchet@bootlin.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250303-tunnels-v2-2-8329f38f0678@bootlin.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Peilin Ye [Mon, 3 Mar 2025 05:37:25 +0000 (05:37 +0000)]
bpf: Factor out check_load_mem() and check_store_reg()
Extract BPF_LDX and most non-ATOMIC BPF_STX instruction handling logic
in do_check() into helper functions to be used later. While we are
here, make that comment about "reserved fields" more specific.
veristat: Report program type guess results to sdterr
In order not to pollute CSV output, e.g.:
$ ./veristat -o csv exceptions_ext.bpf.o > test.csv
Using guessed program type 'sched_cls' for exceptions_ext.bpf.o/extension...
Using guessed program type 'sched_cls' for exceptions_ext.bpf.o/throwing_extension...
A fair amount of code duplication is present among tests to attach BPF
programs.
Create generic_attach* helpers that attach BPF programs to a given
interface.
Use ASSERT_OK_FD() instead of ASSERT_GE() to check fd's validity.
Use these helpers in all the available tests.
Signed-off-by: Bastien Curutchet (eBPF Foundation) <bastien.curutchet@bootlin.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250303-tunnels-v2-1-8329f38f0678@bootlin.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Peilin Ye [Mon, 3 Mar 2025 05:37:19 +0000 (05:37 +0000)]
bpf: Factor out check_atomic_rmw()
Currently, check_atomic() only handles atomic read-modify-write (RMW)
instructions. Since we are planning to introduce other types of atomic
instructions (i.e., atomic load/store), extract the existing RMW
handling logic into its own function named check_atomic_rmw().
Remove the @insn_idx parameter as it is not really necessary. Use
'env->insn_idx' instead, as in other places in verifier.c.
Eric Dumazet [Sat, 1 Mar 2025 19:13:15 +0000 (19:13 +0000)]
bpf: no longer acquire map_idr_lock in bpf_map_inc_not_zero()
bpf_sk_storage_clone() is the only caller of bpf_map_inc_not_zero()
and is holding rcu_read_lock().
map_idr_lock does not add any protection, just remove the cost
for passive TCP flows.
Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Kui-Feng Lee <kuifeng@meta.com> Cc: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://lore.kernel.org/r/20250301191315.1532629-1-edumazet@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
====================
Global subprogs in RCU/{preempt,irq}-disabled sections
Small change to allow non-sleepable global subprogs in
RCU, preempt-disabled, and irq-disabled sections. For
now, we don't lift the limitation for locks as it requires
more analysis, and will do this one resilient spin locks
land.
This surfaced a bug where sleepable global subprogs were
allowed in RCU read sections, that has been fixed. Tests
have been added to cover various cases.
* Rename subprog_info[i].sleepable to might_sleep, which more
accurately reflects the nature of the bit. 'sleepable' means whether
a given context is allowed to, while might_sleep captures if it
does.
* Disallow extensions that might sleep to attach to targets that don't
sleep, since they'd be permitted to be called in atomic contexts. (Eduard)
* Add tests for mixing non-sleepable and sleepable global function
calls, and extensions attaching to non-sleepable global functions. (Eduard)
* Rename changes_pkt_data -> summarization
====================
selftests/bpf: Add tests for extending sleepable global subprogs
Add tests for freplace behavior with the combination of sleepable
and non-sleepable global subprogs. The changes_pkt_data selftest
did all the hardwork, so simply rename it and include new support
for more summarization tests for might_sleep bit.
selftests/bpf: Test sleepable global subprogs in atomic contexts
Add tests for rejecting sleepable and accepting non-sleepable global
function calls in atomic contexts. For spin locks, we still reject
all global function calls. Once resilient spin locks land, we will
carefully lift in cases where we deem it safe.
Yonghong Song [Mon, 24 Feb 2025 23:01:16 +0000 (15:01 -0800)]
bpf: Allow pre-ordering for bpf cgroup progs
Currently for bpf progs in a cgroup hierarchy, the effective prog array
is computed from bottom cgroup to upper cgroups (post-ordering). For
example, the following cgroup hierarchy
root cgroup: p1, p2
subcgroup: p3, p4
have BPF_F_ALLOW_MULTI for both cgroup levels.
The effective cgroup array ordering looks like
p3 p4 p1 p2
and at run time, progs will execute based on that order.
But in some cases, it is desirable to have root prog executes earlier than
children progs (pre-ordering). For example,
- prog p1 intends to collect original pkt dest addresses.
- prog p3 will modify original pkt dest addresses to a proxy address for
security reason.
The end result is that prog p1 gets proxy address which is not what it
wants. Putting p1 to every child cgroup is not desirable either as it
will duplicate itself in many child cgroups. And this is exactly a use case
we are encountering in Meta.
To fix this issue, let us introduce a flag BPF_F_PREORDER. If the flag
is specified at attachment time, the prog has higher priority and the
ordering with that flag will be from top to bottom (pre-ordering).
For example, in the above example,
root cgroup: p1, p2
subcgroup: p3, p4
Let us say p2 and p4 are marked with BPF_F_PREORDER. The final
effective array ordering will be
p2 p4 p3 p1
The verifier currently does not permit global subprog calls when a lock
is held, preemption is disabled, or when IRQs are disabled. This is
because we don't know whether the global subprog calls sleepable
functions or not.
In case of locks, there's an additional reason: functions called by the
global subprog may hold additional locks etc. The verifier won't know
while verifying the global subprog whether it was called in context
where a spin lock is already held by the program.
Perform summarization of the sleepable nature of a global subprog just
like changes_pkt_data and then allow calls to global subprogs for
non-sleepable ones from atomic context.
While making this change, I noticed that RCU read sections had no
protection against sleepable global subprog calls, include it in the
checks and fix this while we're at it.
Care needs to be taken to not allow global subprog calls when regular
bpf_spin_lock is held. When resilient spin locks is held, we want to
potentially have this check relaxed, but not for now.
Also make sure extensions freplacing global functions cannot do so
in case the target is non-sleepable, but the extension is. The other
combination is ok.
Tests are included in the next patch to handle all special conditions.
====================
Optimize bpf selftest to increase CI success rate
1. Optimized some static bound port selftests to avoid port occupation
when running test_progs -j.
2. Optimized the retry logic for test_maps.
Some Failed CI:
https://github.com/kernel-patches/bpf/actions/runs/13275542359/job/37064974076
https://github.com/kernel-patches/bpf/actions/runs/13549227497/job/37868926343
https://github.com/kernel-patches/bpf/actions/runs/13548089029/job/37865812030
https://github.com/kernel-patches/bpf/actions/runs/13553536268/job/37883329296
(Perhaps it's due to the large number of pull requests requiring CI runs?)
====================
Jiayuan Chen [Thu, 27 Feb 2025 14:26:46 +0000 (22:26 +0800)]
selftests/bpf: Fixes for test_maps test
BPF CI has failed 3 times in the last 24 hours. Add retry for ENOMEM.
It's similar to the optimization plan:
commit 2f553b032cad ("selftsets/bpf: Retry map update for non-preallocated per-cpu map")
Introduce a new kfunc, bpf_dynptr_copy, which enables copying of
data from one dynptr to another. This functionality may be useful in
scenarios such as capturing XDP data to a ring buffer.
The patch set is split into 3 patches:
1. Refactor bpf_dynptr_read and bpf_dynptr_write by extracting code into
static functions, that allows calling them with no compiler warnings
2. Introduce bpf_dynptr_copy
3. Add tests for bpf_dynptr_copy
v2->v3:
* Implemented bpf_memcmp in dynptr_success.c test, as __builtin_memcmp
was not inlined on GCC-BPF.
====================