====================
Remove task and cgroup local storage percpu counters
* Motivation *
The goal of this patchset is to make bpf syscalls and helpers updating
task and cgroup local storage more robust by removing percpu counters
in them. Task local storage and cgroup storage each employs a percpu
counter to prevent deadlock caused by recursion. Since the underlying
bpf local storage takes spinlocks in various operations, bpf programs
running recursively may try to take a spinlock which is already taken.
For example, when a tracing bpf program called recursively during
bpf_task_storage_get(..., F_CREATE) tries to call
bpf_task_storage_get(..., F_CREATE) again, it will cause AA deadlock
if the percpu variable is not in place.
However, sometimes, the percpu counter may cause bpf syscalls or helpers
to return errors spuriously, as soon as another threads is also updating
the local storage or the local storage map. Ideally, the two threads
could have taken turn to take the locks and perform their jobs
respectively. However, due to the percpu counter, the syscalls and
helpers can return -EBUSY even if one of them does not run recursively
in another one. All it takes for this to happen is if the two threads run
on the same CPU. This happened when BPF-CI ran the selftest of task local
data. Since CI runs the test on VM with 2 CPUs, bpf_task_storage_get(...,
F_CREATE) can easily fail.
The failure mode is not good for users as they need to add retry logic
in user space or bpf programs to avoid it. Even with retry, there
is no guaranteed upper bound of the loop for a success call. Therefore,
this patchset seeks to remove the percpu counter and makes the related
bpf syscalls and helpers more reliable, while still make sure recursion
deadlock will not happen, with the help of resilient queued spinlock
(rqspinlock).
* Implementation *
To remove the percpu counter without introducing deadlock,
bpf_local_storage is refactored by changing the locks from raw_spin_lock
to rqspinlock, which prevents deadlock with deadlock detection and a
timeout mechanism.
The refactor basically repalces the locks with rqspinlock and propagates
errors returned by the locking function to BPF helpers or syscalls.
bpf_selem_unlink_nofail() is introduced to handle rqspinlock errors
in two lock acquiring functions that cannot fail,
bpf_local_storage_destroy() and bpf_local_storage_map_free()
(i.e., local storage is being freed by the subsystem or the map is
being freed). The high level idea is to bitfiel and atomic operation to
track who is referencing an selem when any locks cannot be acquired.
Additional care is needed to make sure special fields are freed and
owner memory are uncharged safely and correctly.
If not familiar with local storage, the last section briefly describe
the locks and structure of local storage. It also shows the abbreviation
used in the rest of the letter.
* Test *
Task and cgroup local storage selftests have already covered deadlock
caused by recursion. Patch 14 updates the expected result of task local
storage selftests as task local storage bpf helpers can now run on the
same CPU as they don't cause deadlock.
The benchmark is a microbenchmark stress-testing how fast local storage
can be created. After swicthing to rqspinlock and
bpf_unlink_selem_nofail(), socket local storage creation speed has a
~5% gain. For task local storage, the number remains the same.
Patch 1-4 convert local storage internal helpers to failable.
Patch 5 changes the locks to rqspinlock and propagate the error
returned from raw_res_spin_lock_irqsave() to BPF heleprs and syscalls.
Patch 6-8 remove percpu counters in task and cgroup local storage.
Patch 9-11 address the unlikely rqspinlock errors by switching to
bpf_selem_unlink_nofail() in map_free() and destroy().
Patch 12-17 update selftests.
* Appendix: local storage internal *
There are two locks in bpf_local_storage due to the ownership model as
illustrated in the figure below. A map value, which consists of a
pointer to the map and the data, is a bpf_local_storage_map_data (sdata)
stored in a bpf_local_storage_elem (selem). A selem belongs to a
bpf_local_storage and bpf_local_storage_map at the same time.
bpf_local_storage::lock (lock_storage->lock in short) protects the list
in a bpf_local_storage and bpf_local_storage_map_bucket::lock (b->lock)
protects the hash bucket in a bpf_local_storage_map.
v6 -> v7
- Minor comment and commit msg tweaks
- Patch 9: Remove unused "owner" (kernel test robot)
- Patch 13: Update comments in task_ls_recursion.c (AI) Link: https://lore.kernel.org/bpf/20260205070208.186382-1-ameryhung@gmail.com/
v5 -> v6
- Redo benchmark
- Patch 9: Remove storage->smap as it is not used any more
- Patch 17: Remove storage->smap check in selftests
- Patch 10, 11: Pass reuse_now = true to bpf_selem_free() and
bpf_local_storage_free() to allow faster memory reclaim (Martin)
- Patch 10: Use bitfield instead of refcount to track selem state to
be more precise, which removes the possibility map_free missing an
selem (Martin)
- Patch 10: Allow map_free() to free local_storage and drop
the change in bpf_local_storage_map_update() (Martin)
- Patch 11: Simplify destroy() by not deferring work as an owner is
unlikely to have too many maps that stalls RCU (Martin) Link: https://lore.kernel.org/bpf/20260201175050.468601-1-ameryhung@gmail.com/
v4 -> v5
- Patch 1: Fix incorrect bucket calculation (AI)
- Patch 3: Fix memory leak in bpf_sk_storage_clone() (AI)
- Patch 5: Fix memory leak in bpf_local_storage_update() (AI)
- Fix typo/comment/commit msg (AI)
- Patch 10: Replace smp_rmb() with smp_mb(). smp_rmb does not imply
acquire semantics Link: https://lore.kernel.org/bpf/20260131050920.2574084-1-ameryhung@gmail.com/
v3 -> v4
- Add performance numbers
- Avoid stale element when calling bpf_local_storage_map_free()
by allowing it to unlink selem from local_storage->list and uncharge
memory. Block destroy() from returning when pending map_free()
are uncharging
- Fix an -EAGAIN bug in bpf_local_storage_update() as map_free() now
does not free local storage
- Fix possible double-free of selem by ensuring an selem is only
processed once for each caller (Kumar)
- Fix possible inifinite loop in bpf_selem_unlink_nofail() when
iterating b->list by replacing while loop with
hlist_for_each_entry_rcu
- Fix unsafe iteration in destroy() by iterating local_storage->list
using hlist_for_each_entry_rcu
- Fix UAF due to clearing storage_owner after destroy(). Flip the order
to fix it
- Misc clean-up suggested by Martin Link: https://lore.kernel.org/bpf/20251218175628.1460321-1-ameryhung@gmail.com/
v2 -> v3
- Rebase to bpf-next where BPF memory allocator is replaced with
kmalloc_nolock()
- Revert to selecting bucket based on selem
- Introduce bpf_selem_unlink_lockless() to allow unlinking and
freeing selem without taking locks Link: https://lore.kernel.org/bpf/20251002225356.1505480-1-ameryhung@gmail.com/
v1 -> v2
- Rebase to bpf-next
- Select bucket based on local_storage instead of selem (Martin)
- Simplify bpf_selem_unlink (Martin)
- Change handling of rqspinlock errors in bpf_local_storage_destroy()
and bpf_local_storage_map_free(). Retry instead of WARN_ON. Link: https://lore.kernel.org/bpf/20250729182550.185356-1-ameryhung@gmail.com/
====================
Amery Hung [Thu, 5 Feb 2026 22:29:15 +0000 (14:29 -0800)]
selftests/bpf: Fix outdated test on storage->smap
bpf_local_storage_free() already does not rely on local_storage->smap
since switching to kmalloc_nolock(). As local_storage->smap is removed,
fix the outdated test by dropping the local_storage->smap check. Keep
the second map in task local storage map test to test that multiple
elements can be added to the storage similar to sk storage test.
Amery Hung [Thu, 5 Feb 2026 22:29:14 +0000 (14:29 -0800)]
selftests/bpf: Choose another percpu variable in bpf for btf_dump test
bpf_cgrp_storage_busy has been removed. Use bpf_bprintf_nest_level
instead. This percpu variable is also in the bpf subsystem so that
if it is removed in the future, BPF-CI will catch this type of CI-
breaking change.
Remove a test in test_maps that checks if the updating of the percpu
counter in task local storage map is preemption safe as the percpu
counter is now removed.
Amery Hung [Thu, 5 Feb 2026 22:29:11 +0000 (14:29 -0800)]
selftests/bpf: Update task_local_storage/recursion test
Update the expected result of the selftest as recursion of task local
storage syscall and helpers have been relaxed. Now that the percpu
counter is removed, task local storage helpers, bpf_task_storage_get()
and bpf_task_storage_delete() can now run on the same CPU at the same
time unless they cause deadlock.
Note that since there is no percpu counter preventing recursion in
task local storage helpers, bpf_trampoline now catches the recursion
of on_update as reported by recursion_misses.
Amery Hung [Thu, 5 Feb 2026 22:29:10 +0000 (14:29 -0800)]
selftests/bpf: Update sk_storage_omem_uncharge test
Check sk_omem_alloc when the caller of bpf_local_storage_destroy()
returns. bpf_local_storage_destroy() now returns the memory to uncharge
to the caller instead of directly uncharge. Therefore, in the
sk_storage_omem_uncharge, check sk_omem_alloc when bpf_sk_storage_free()
returns instead of bpf_local_storage_destroy().
Amery Hung [Thu, 5 Feb 2026 22:29:09 +0000 (14:29 -0800)]
bpf: Switch to bpf_selem_unlink_nofail in bpf_local_storage_{map_free, destroy}
Take care of rqspinlock error in bpf_local_storage_{map_free, destroy}()
properly by switching to bpf_selem_unlink_nofail().
Both functions iterate their own RCU-protected list of selems and call
bpf_selem_unlink_nofail(). In map_free(), to prevent infinite loop when
both map_free() and destroy() fail to remove a selem from b->list
(extremely unlikely), switch to hlist_for_each_entry_rcu(). In destroy(),
also switch to hlist_for_each_entry_rcu() since we no longer iterate
local_storage->list under local_storage->lock.
bpf_selem_unlink() now becomes dedicated to helpers and syscalls paths
so reuse_now should always be false. Remove it from the argument and
hardcode it.
Acked-by: Alexei Starovoitov <ast@kernel.org> Co-developed-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20260205222916.1788211-12-ameryhung@gmail.com
Amery Hung [Thu, 5 Feb 2026 22:29:08 +0000 (14:29 -0800)]
bpf: Support lockless unlink when freeing map or local storage
Introduce bpf_selem_unlink_nofail() to properly handle errors returned
from rqspinlock in bpf_local_storage_map_free() and
bpf_local_storage_destroy() where the operation must succeeds.
The idea of bpf_selem_unlink_nofail() is to allow an selem to be
partially linked and use atomic operation on a bit field, selem->state,
to determine when and who can free the selem if any unlink under lock
fails. An selem initially is fully linked to a map and a local storage.
Under normal circumstances, bpf_selem_unlink_nofail() will be able to
grab locks and unlink a selem from map and local storage in sequeunce,
just like bpf_selem_unlink(), and then free it after an RCU grace period.
However, if any of the lock attempts fails, it will only clear
SDATA(selem)->smap or selem->local_storage depending on the caller and
set SELEM_MAP_UNLINKED or SELEM_STORAGE_UNLINKED according to the
caller. Then, after both map_free() and destroy() see the selem and the
state becomes SELEM_UNLINKED, one of two racing caller can succeed in
cmpxchg the state from SELEM_UNLINKED to SELEM_TOFREE, ensuring no
double free or memory leak.
To make sure bpf_obj_free_fields() is done only once and when map is
still present, it is called when unlinking an selem from b->list under
b->lock.
To make sure uncharging memory is done only when the owner is still
present in map_free(), block destroy() from returning until there is no
pending map_free().
Since smap may not be valid in destroy(), bpf_selem_unlink_nofail()
skips bpf_selem_unlink_storage_nolock_misc() when called from destroy().
This is okay as bpf_local_storage_destroy() will return the remaining
amount of memory charge tracked by mem_charge to the owner to uncharge.
It is also safe to skip clearing local_storage->owner and owner_storage
as the owner is being freed and no users or bpf programs should be able
to reference the owner and using local_storage.
Finally, access of selem, SDATA(selem)->smap and selem->local_storage
are racy. Callers will protect these fields with RCU.
Acked-by: Alexei Starovoitov <ast@kernel.org> Co-developed-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20260205222916.1788211-11-ameryhung@gmail.com
Amery Hung [Thu, 5 Feb 2026 22:29:07 +0000 (14:29 -0800)]
bpf: Prepare for bpf_selem_unlink_nofail()
The next patch will introduce bpf_selem_unlink_nofail() to handle
rqspinlock errors. bpf_selem_unlink_nofail() will allow an selem to be
partially unlinked from map or local storage. Save memory allocation
method in selem so that later an selem can be correctly freed even when
SDATA(selem)->smap is init to NULL.
In addition, keep track of memory charge to the owner in local storage
so that later bpf_selem_unlink_nofail() can return the correct memory
charge to the owner. Updating local_storage->mem_charge is protected by
local_storage->lock.
Finally, extract miscellaneous tasks performed when unlinking an selem
from local_storage into bpf_selem_unlink_storage_nolock_misc(). It will
be reused by bpf_selem_unlink_nofail().
This patch also takes the chance to remove local_storage->smap, which
is no longer used since commit f484f4a3e058 ("bpf: Replace bpf memory
allocator with kmalloc_nolock() in local storage").
Amery Hung [Thu, 5 Feb 2026 22:29:06 +0000 (14:29 -0800)]
bpf: Remove unused percpu counter from bpf_local_storage_map_free
Percpu locks have been removed from cgroup and task local storage. Now
that all local storage no longer use percpu variables as locks preventing
recursion, there is no need to pass them to bpf_local_storage_map_free().
Remove the argument from the function.
Amery Hung [Thu, 5 Feb 2026 22:29:05 +0000 (14:29 -0800)]
bpf: Remove cgroup local storage percpu counter
The percpu counter in cgroup local storage is no longer needed as the
underlying bpf_local_storage can now handle deadlock with the help of
rqspinlock. Remove the percpu counter and related migrate_{disable,
enable}.
Amery Hung [Thu, 5 Feb 2026 22:29:04 +0000 (14:29 -0800)]
bpf: Remove task local storage percpu counter
The percpu counter in task local storage is no longer needed as the
underlying bpf_local_storage can now handle deadlock with the help of
rqspinlock. Remove the percpu counter and related migrate_{disable,
enable}.
Since the percpu counter is removed, merge back bpf_task_storage_get()
and bpf_task_storage_get_recur(). This will allow the bpf syscalls and
helpers to run concurrently on the same CPU, removing the spurious
-EBUSY error. bpf_task_storage_get(..., F_CREATE) will now always
succeed with enough free memory unless being called recursively.
Amery Hung [Thu, 5 Feb 2026 22:29:03 +0000 (14:29 -0800)]
bpf: Change local_storage->lock and b->lock to rqspinlock
Change bpf_local_storage::lock and bpf_local_storage_map_bucket::lock
from raw_spin_lock to rqspinlock.
Finally, propagate errors from raw_res_spin_lock_irqsave() to syscall
return or BPF helper return.
In bpf_local_storage_destroy(), ignore return from
raw_res_spin_lock_irqsave() for now. A later patch will correctly
handle errors correctly in bpf_local_storage_destroy() so that it can
unlink selems even when failing to acquire locks.
For __bpf_local_storage_map_cache(), instead of handling the error,
skip updating the cache.
Amery Hung [Thu, 5 Feb 2026 22:29:02 +0000 (14:29 -0800)]
bpf: Convert bpf_selem_unlink to failable
To prepare changing both bpf_local_storage_map_bucket::lock and
bpf_local_storage::lock to rqspinlock, convert bpf_selem_unlink() to
failable. It still always succeeds and returns 0 until the change
happens. No functional change.
Open code bpf_selem_unlink_storage() in the only caller,
bpf_selem_unlink(), since unlink_map and unlink_storage must be done
together after all the necessary locks are acquired.
For bpf_local_storage_map_free(), ignore the return from
bpf_selem_unlink() for now. A later patch will allow it to unlink selems
even when failing to acquire locks.
Amery Hung [Thu, 5 Feb 2026 22:29:01 +0000 (14:29 -0800)]
bpf: Convert bpf_selem_link_map to failable
To prepare for changing bpf_local_storage_map_bucket::lock to rqspinlock,
convert bpf_selem_link_map() to failable. It still always succeeds and
returns 0 until the change happens. No functional change.
Amery Hung [Thu, 5 Feb 2026 22:29:00 +0000 (14:29 -0800)]
bpf: Convert bpf_selem_unlink_map to failable
To prepare for changing bpf_local_storage_map_bucket::lock to rqspinlock,
convert bpf_selem_unlink_map() to failable. It still always succeeds and
returns 0 for now.
Since some operations updating local storage cannot fail in the middle,
open-code bpf_selem_unlink_map() to take the b->lock before the
operation. There are two such locations:
- bpf_local_storage_alloc()
The first selem will be unlinked from smap if cmpxchg owner_storage_ptr
fails, which should not fail. Therefore, hold b->lock when linking
until allocation complete. Helpers that assume b->lock is held by
callers are introduced: bpf_selem_link_map_nolock() and
bpf_selem_unlink_map_nolock().
- bpf_local_storage_update()
The three step update process: link_map(new_selem),
link_storage(new_selem), and unlink_map(old_selem) should not fail in
the middle.
In bpf_selem_unlink(), bpf_selem_unlink_map() and
bpf_selem_unlink_storage() should either all succeed or fail as a whole
instead of failing in the middle. So, return if unlink_map() failed.
Remove the selem_linked_to_map_lockless() check as an selem in the
common paths (not bpf_local_storage_map_free() or
bpf_local_storage_destroy()), will be unlinked under b->lock and
local_storage->lock and therefore no other threads can unlink the selem
from map at the same time.
In bpf_local_storage_destroy(), ignore the return of
bpf_selem_unlink_map() for now. A later patch will allow
bpf_local_storage_destroy() to unlink selems even when failing to
acquire locks.
Note that while this patch removes all callers of selem_linked_to_map(),
a later patch that introduces bpf_selem_unlink_nofail() will use it
again.
Amery Hung [Thu, 5 Feb 2026 22:28:59 +0000 (14:28 -0800)]
bpf: Select bpf_local_storage_map_bucket based on bpf_local_storage
A later bpf_local_storage refactor will acquire all locks before
performing any update. To simplified the number of locks needed to take
in bpf_local_storage_map_update(), determine the bucket based on the
local_storage an selem belongs to instead of the selem pointer.
Currently, when a new selem needs to be created to replace the old selem
in bpf_local_storage_map_update(), locks of both buckets need to be
acquired to prevent racing. This can be simplified if the two selem
belongs to the same bucket so that only one bucket needs to be locked.
Therefore, instead of hashing selem, hashing the local_storage pointer
the selem belongs.
Performance wise, this is slightly better as update now requires locking
one bucket. It should not change the level of contention on one bucket
as the pointers to local storages of selems in a map are just as unique
as pointers to selems.
====================
Fix some corner cases in xskxceiver
While working on XDP and AF_XDP support for ixgbevf driver,
I came across two distinct problems that caused tests to fail
when they shouldn't have.
====================
Larysa Zaremba [Tue, 3 Feb 2026 15:50:58 +0000 (16:50 +0100)]
selftests/xsk: fix number of Tx frags in invalid packet
The issue occurs in TOO_MANY_FRAGS test case when xdp_zc_max_segs is set to
an odd number.
TOO_MANY_FRAGS test case contains an invalid packet consisting of
(xdp_zc_max_segs) frags. Every frag, even the last one has XDP_PKT_CONTD
flag set. This packet is expected to be dropped. After that, there is a
valid linear packet, which is expected to be received back.
Once (xdp_zc_max_segs) is an odd number, the last packet cannot be
received, if packet forwarding between Rx and Tx interfaces relies on
the ethernet header, e.g. checks for ETH_P_LOOPBACK. Packet is malformed,
if all traffic is looped.
Turns out, sending function processes multiple invalid frags as if they
were in 2-frag packets. So once the invalid mbuf packet contains an odd
number of those, the valid packet after gets paired with the previous
invalid descriptor, and hence does not get an ethernet header generated, so
it is either dropped or malformed.
Make invalid packets in verbatim mode always have only a single frag. For
such packets, number of frags is otherwise meaningless, as descriptor flags
are pre-configured in verbatim mode and packet data is not generated for
invalid descriptors.
Fixes: 697604492b64 ("selftests/xsk: add invalid descriptor test for multi-buffer") Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com> Link: https://lore.kernel.org/r/20260203155103.2305816-3-larysa.zaremba@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Larysa Zaremba [Tue, 3 Feb 2026 15:50:57 +0000 (16:50 +0100)]
selftests/xsk: properly handle batch ending in the middle of a packet
Referenced commit reduced the scope of the variable pkt, so now it has to
be reinitialized via pkt_stream_get_next_rx_pkt(), which also increments
some counters. When the packet is interrupted by the batch ending, pkt
stream therefore proceeds to the next packet, while xsk ring still contains
the previous one, this results in a pkt_nb mismatch.
Decrement the affected counters when packet is interrupted.
Fixes: 8913e653e9b8 ("selftests/xsk: Iterate over all the sockets in the receive pkts function") Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com> Link: https://lore.kernel.org/r/20260203155103.2305816-2-larysa.zaremba@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
bpf: Prevent reentrance into call_rcu_tasks_trace()
call_rcu_tasks_trace() is not safe from in_nmi() and not reentrant.
To prevent deadlock on raw_spin_lock_rcu_node(rtpcp) or memory corruption
defer to irq_work when IRQs are disabled. call_rcu_tasks_generic()
protects itself with local_irq_save().
Note when bpf_async_cb->refcnt drops to zero it's safe to reuse
bpf_async_cb->worker for a different irq_work callback, since
bpf_async_schedule_op() -> irq_work_queue(&cb->worker);
is only called when refcnt >= 1.
KP Singh [Thu, 5 Feb 2026 07:07:55 +0000 (08:07 +0100)]
bpf: Require frozen map for calculating map hash
Currently, bpf_map_get_info_by_fd calculates and caches the hash of the
map regardless of the map's frozen state.
This leads to a TOCTOU bug where userspace can call
BPF_OBJ_GET_INFO_BY_FD to cache the hash and then modify the map
contents before freezing.
Therefore, a trusted loader can be tricked into verifying the stale hash
while loading the modified contents.
Fix this by returning -EPERM if the map is not frozen when the hash is
requested. This ensures the hash is only generated for the final,
immutable state of the map.
Fixes: ea2e6467ac36 ("bpf: Return hashes of maps in BPF_OBJ_GET_INFO_BY_FD") Reported-by: Toshi Piazza <toshi.piazza@microsoft.com> Signed-off-by: KP Singh <kpsingh@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20260205070755.695776-1-kpsingh@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
KP Singh [Thu, 5 Feb 2026 06:38:07 +0000 (07:38 +0100)]
bpf: Limit bpf program signature size
Practical BPF signatures are significantly smaller than
KMALLOC_MAX_CACHE_SIZE
Allowing larger sizes opens the door for abuse by passing excessive
size values and forcing the kernel into expensive allocation paths (via
kmalloc_large or vmalloc).
Fixes: 349271568303 ("bpf: Implement signature verification for BPF programs") Reported-by: Chris Mason <clm@meta.com> Signed-off-by: KP Singh <kpsingh@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20260205063807.690823-1-kpsingh@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
====================
Fix for bpf_wq retry loop during free
Small fix and improvement to ensure cancel_work() can handle the case
where wq callback is running, and doesn't lead to call_rcu_tasks_trace()
repeatedly after failing cancel_work, if wq callback is not pending.
====================
bpf: Reset prog callback in bpf_async_cancel_and_free()
Replace prog and callback in bpf_async_cb after removing visibility of
bpf_async_cb in bpf_async_cancel_and_free() to increase the chances the
scheduled async callbacks short-circuit execution and exit early, and
not starting a RCU tasks trace section. This improves the overall time
spent in running the wq selftest.
bpf: Check for running wq callback when freeing bpf_async_cb
When freeing a bpf_async_cb in bpf_async_cb_rcu_tasks_trace_free(), in
case the wq callback is not scheduled, doing cancel_work() currently
returns false and leads to retry of RCU tasks trace grace period. If the
callback is never scheduled, we keep retrying indefinitely and don't put
the prog reference.
Since the only race we care about here is against a potentially running
wq callback in the first grace period, it should finish by the second
grace period, hence check work_busy() result to detect presence of
running wq callback if it's not pending, otherwise free the object
immediately without retrying.
Reasoning behind the check and its correctness with racing wq callback
invocation: cancel_work is supposed to be synchronized, hence calling it
first and getting false would mean that work is definitely not pending,
at this point, either the work is not scheduled at all or already
running, or we race and it already finished by the time we checked for
it using work_busy(). In case it is running, we synchronize using
pool->lock to check the current work running there, if we match, it
means we extend the wait by another grace period using retry = true,
otherwise either the work already finished running or was never
scheduled, so we can free the bpf_async_cb right away.
Fixes: 1bfbc267ec91 ("bpf: Enable bpf_timer and bpf_wq in any context") Reported-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260205003853.527571-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
V3: https://lore.kernel.org/all/20260203222643.994713-1-puranjay@kernel.org/
Changes in v3->v4:
- Add a call to reg_bounds_sync() in sync_linked_regs() to sync bounds after alu op.
- Add a __sink(path[0]); in the C selftest so compiler doesn't say "error: variable 'path' set but
not used"
V2: https://lore.kernel.org/all/20260113152529.3217648-1-puranjay@kernel.org/
Changes in v2->v3:
- Added another selftest showing a real usage pattern
- Rebased on bpf-next/master
v1: https://lore.kernel.org/bpf/20260107203941.1063754-1-puranjay@kernel.org/
Changes in v1->v2:
- Add support for alu32 operations in linked register tracking (Alexei)
- Squash the selftest fix with the first patch (Eduard)
- Add more selftests to detect edge cases
This series extends the BPF verifier's linked register tracking to handle
negative offsets, BPF_SUB operations, and alu32 operations, enabling better bounds propagation for
common arithmetic patterns.
The verifier previously only tracked positive constant deltas between linked
registers using BPF_ADD. This meant patterns using negative offsets or
subtraction couldn't benefit from bounds propagation:
void alu32_negative_offset(void)
{
volatile char path[5];
volatile int offset = bpf_get_prandom_u32();
int off = offset;
FB Progs
File Program Verdict (A) Verdict (B) Verdict (DIFF) Insns (A) Insns (B) Insns (DIFF)
------------ ---------------- ----------- ----------- -------------- --------- --------- -----------------
bpf232.bpf.o layered_dump success success MATCH 1151 1218 +67 (+5.82%)
bpf257.bpf.o layered_runnable success success MATCH 5743 6143 +400 (+6.97%)
bpf252.bpf.o layered_runnable success success MATCH 5677 6075 +398 (+7.01%)
bpf227.bpf.o layered_dump success success MATCH 915 982 +67 (+7.32%)
bpf239.bpf.o layered_runnable success success MATCH 5459 5861 +402 (+7.36%)
bpf246.bpf.o layered_runnable success success MATCH 5562 6008 +446 (+8.02%)
bpf229.bpf.o layered_runnable success success MATCH 2559 3011 +452 (+17.66%)
bpf231.bpf.o layered_runnable success success MATCH 2559 3011 +452 (+17.66%)
bpf234.bpf.o layered_runnable success success MATCH 2549 3001 +452 (+17.73%)
bpf019.bpf.o do_sendmsg success success MATCH 124823 153523 +28700 (+22.99%)
bpf019.bpf.o do_parse success success MATCH 124809 153509 +28700 (+23.00%)
bpf227.bpf.o layered_runnable success success MATCH 1915 2356 +441 (+23.03%)
bpf228.bpf.o layered_runnable success success MATCH 1700 2152 +452 (+26.59%)
bpf232.bpf.o layered_runnable success success MATCH 1499 1951 +452 (+30.15%)
bpf312.bpf.o mount_exit success success MATCH 19253 62883 +43630 (+226.61%)
bpf312.bpf.o umount_exit success success MATCH 19253 62883 +43630 (+226.61%)
bpf311.bpf.o mount_exit success success MATCH 19226 62863 +43637 (+226.97%)
bpf311.bpf.o umount_exit success success MATCH 19226 62863 +43637 (+226.97%)
The above four programs have specific patters that make the verifier explore a lot more states:
for (; depth < MAX_DIR_DEPTH; depth++) {
const unsigned char* name = BPF_CORE_READ(dentry, d_name.name);
if (offset >= MAX_PATH_LEN - MAX_DIR_LEN) {
return depth;
}
int len = bpf_probe_read_kernel_str(&path[offset], MAX_DIR_LEN, name);
offset += len;
if (len == MAX_DIR_LEN) {
if (offset - 2 < MAX_PATH_LEN) { // <---- (a)
path[offset - 2] = '.';
}
if (offset - 3 < MAX_PATH_LEN) { // <---- (b)
path[offset - 3] = '.';
}
if (offset - 4 < MAX_PATH_LEN) { // <---- (c)
path[offset - 4] = '.';
}
}
}
When at some depth == N false branches of conditions (a), (b) and (c) are scheduled for
verification, constraints for offset at depth == N+1 are:
1. offset >= MAX_PATH_LEN + 2
2. offset >= MAX_PATH_LEN + 3
3. offset >= MAX_PATH_LEN + 4 (visited before others)
And after offset += len it becomes:
1. offset >= MAX_PATH_LEN - 4093
2. offset >= MAX_PATH_LEN - 4092
3. offset >= MAX_PATH_LEN - 4091 (visited before others)
Because of the DFS states exploration logic, the states above are visited in order 3, 2, 1; 3 is not
a subset of 2 and 1 is not a subset of 2, so pruning logic does not kick in.
Previously this was not a problem, because range for offset was not propagated through the
statements (a), (b), (c).
As the root cause of this regression is understood, this is not a blocker for this change.
Puranjay Mohan [Wed, 4 Feb 2026 15:17:37 +0000 (07:17 -0800)]
bpf: Support negative offsets, BPF_SUB, and alu32 for linked register tracking
Previously, the verifier only tracked positive constant deltas between
linked registers using BPF_ADD. This limitation meant patterns like:
r1 = r0;
r1 += -4;
if r1 s>= 0 goto l0_%=; // r1 >= 0 implies r0 >= 4
// verifier couldn't propagate bounds back to r0
if r0 != 0 goto l0_%=;
r0 /= 0; // Verifier thinks this is reachable
l0_%=:
Similar limitation exists for 32-bit registers.
With this change, the verifier can now track negative deltas in reg->off
enabling bound propagation for the above pattern.
For alu32, we make sure the destination register has the upper 32 bits
as 0s before creating the link. BPF_ADD_CONST is split into
BPF_ADD_CONST64 and BPF_ADD_CONST32, the latter is used in case of alu32
and sync_linked_regs uses this to zext the result if known_reg has this
flag.
====================
bpf: Add bitwise tracking for BPF_END
Add bitwise tracking (tnum analysis) for BPF_END (`bswap(16|32|64)`,
`be(16|32|64)`, `le(16|32|64)`) operations. Please see commit log of
1/2 for more details.
v3:
- Resend to fix a version control error in v2.
- The rest of the changes are identical to v2.
v2 (incorrect): https://lore.kernel.org/bpf/20260204091146.52447-1-ziye@zju.edu.cn/
- Refactored selftests using BSWAP_RANGE_TEST macro to eliminate code
duplication and improve maintainability. (Eduard)
- Simplified test names. (Eduard)
- Reduced excessive comments in test cases. (Eduard)
- Added more comments to explain BPF_END's special handling of zext_32_to_64.
Tianci Cao [Wed, 4 Feb 2026 11:15:03 +0000 (19:15 +0800)]
selftests/bpf: Add tests for BPF_END bitwise tracking
Now BPF_END has bitwise tracking support. This patch adds selftests to
cover various cases of BPF_END (`bswap(16|32|64)`, `be(16|32|64)`,
`le(16|32|64)`) with bitwise propagation.
This patch is based on existing `verifier_bswap.c`, and add several
types of new tests:
1. Unconditional byte swap operations:
- bswap16/bswap32/bswap64 with unknown bytes
2. Endian conversion operations (architecture-aware):
- be16/be32/be64: convert to big-endian
* on little-endian: do swap
* on big-endian: truncation (16/32-bit) or no-op (64-bit)
- le16/le32/le64: convert to little-endian
* on big-endian: do swap
* on little-endian: truncation (16/32-bit) or no-op (64-bit)
Each test simulates realistic networking scenarios where a value is
masked with unknown bits (e.g., var_off=(0x0; 0x3f00), range=[0,0x3f00]),
then byte-swapped, and the verifier must prove the result stays within
expected bounds.
Specifically, these selftests are based on dead code elimination:
If the BPF verifier can precisely track bitwise through byte swap
operations, it can prune the trap path (invalid memory access) that
should be unreachable, allowing the program to pass verification.
If bitwise tracking is incorrect, the verifier cannot prove the trap
is unreachable, causing verification failure.
The tests use preprocessor conditionals (#ifdef __BYTE_ORDER__) to
verify correct behavior on both little-endian and big-endian
architectures, and require Clang 18+ for bswap instruction support.
Tianci Cao [Wed, 4 Feb 2026 11:15:02 +0000 (19:15 +0800)]
bpf: Add bitwise tracking for BPF_END
This patch implements bitwise tracking (tnum analysis) for BPF_END
(byte swap) operation.
Currently, the BPF verifier does not track value for BPF_END operation,
treating the result as completely unknown. This limits the verifier's
ability to prove safety of programs that perform endianness conversions,
which are common in networking code.
For example, the following code pattern for port number validation:
int test(struct pt_regs *ctx) {
__u64 x = bpf_get_prandom_u32();
x &= 0x3f00; // Range: [0, 0x3f00], var_off: (0x0; 0x3f00)
x = bswap16(x); // Should swap to range [0, 0x3f], var_off: (0x0; 0x3f)
if (x > 0x3f) goto trap;
return 0;
trap:
return *(u64 *)NULL; // Should be unreachable
}
Without this patch, even though the verifier knows `x` has certain bits
set, after bswap16, it loses all tracking information and treats port
as having a completely unknown value [0, 65535].
According to the BPF instruction set[1], there are 3 kinds of BPF_END:
1. `bswap(16|32|64)`: opcode=0xd7 (BPF_END | BPF_ALU64 | BPF_TO_LE)
- do unconditional swap
2. `le(16|32|64)`: opcode=0xd4 (BPF_END | BPF_ALU | BPF_TO_LE)
- on big-endian: do swap
- on little-endian: truncation (16/32-bit) or no-op (64-bit)
3. `be(16|32|64)`: opcode=0xdc (BPF_END | BPF_ALU | BPF_TO_BE)
- on little-endian: do swap
- on big-endian: truncation (16/32-bit) or no-op (64-bit)
Since BPF_END operations are inherently bit-wise permutations, tnum
(bitwise tracking) offers the most efficient and precise mechanism
for value analysis. By implementing `tnum_bswap16`, `tnum_bswap32`,
and `tnum_bswap64`, we can derive exact `var_off` values concisely,
directly reflecting the bit-level changes.
Here is the overview of changes:
1. In `tnum_bswap(16|32|64)` (kernel/bpf/tnum.c):
Call `swab(16|32|64)` function on the value and mask of `var_off`, and
do truncation for 16/32-bit cases.
2. In `adjust_scalar_min_max_vals` (kernel/bpf/verifier.c):
Call helper function `scalar_byte_swap`.
- Only do byte swap when
* alu64 (unconditional swap) OR
* switching between big-endian and little-endian machines.
- If need do byte swap:
* Firstly call `tnum_bswap(16|32|64)` to update `var_off`.
* Then reset the bound since byte swap scrambles the range.
- For 16/32-bit cases, truncate dst register to match the swapped size.
This enables better verification of networking code that frequently uses
byte swaps for protocol processing, reducing false positive rejections.
bpf: Add a recursion check to prevent loops in bpf_timer
Do not schedule timer/wq operation on a cpu that is in irq_work
callback that is processing async_cmds queue.
Otherwise the following loop is possible:
bpf_timer_start() -> bpf_async_schedule_op() -> irq_work_queue().
irqrestore -> bpf_async_irq_worker() -> tracepoint -> bpf_timer_start().
bpf: Tighten conditions when timer/wq can be called synchronously
Though hrtimer_start/cancel() inlines all of the smaller helpers in
hrtimer.c and only call timerqueue_add/del() from lib/timerqueue.c where
everything is not traceable and not kprobe-able (because all files in
lib/ are not traceable), there are tracepoints within hrtimer that are
called with locks held. Therefore prevent the deadlock by tightening
conditions when timer/wq can be called synchronously.
hrtimer/wq are using raw_spin_lock_irqsave(), so irqs_disabled() is enough.
bpf: Use sk_is_inet() and sk_is_unix() in __cgroup_bpf_run_filter_sock_addr().
sk->sk_family should be read with READ_ONCE() in
__cgroup_bpf_run_filter_sock_addr() due to IPV6_ADDRFORM.
Also, the comment there is a bit stale since commit 859051dd165e
("bpf: Implement cgroup sockaddr hooks for unix sockets"), and the
kdoc has the same comment.
Let's use sk_is_inet() and sk_is_unix() and remove the comment.
====================
bpf: Avoid locks in bpf_timer and bpf_wq
From: Alexei Starovoitov <ast@kernel.org>
This series reworks implementation of BPF timer and workqueue APIs to
make them usable from any context.
Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Changes in v9:
- Different approach for patches 1 and 3:
- s/EBUSY/ENOENT/ when refcnt==0 to match existing
- drop latch, use refcnt and kmalloc_nolock() instead
- address race between timer/wq_start and delete_elem, add a test
- Link to v8: https://lore.kernel.org/bpf/20260127-timer_nolock-v8-0-5a29a9571059@meta.com/
Changes in v8:
- Return -EBUSY in bpf_async_read_op() if last_seq is failed to be set
- In bpf_async_cancel_and_free() drop bpf_async_cb ref after calling bpf_async_process()
- Link to v7: https://lore.kernel.org/r/20260122-timer_nolock-v7-0-04a45c55c2e2@meta.com
Changes in v7:
- Addressed Andrii's review points from the previous version - nothing
very significang.
- Added NMI stress tests for bpf_timer - hit few verifier failing checks
and removed them.
- Address sparse warning in the bpf_async_update_prog_callback()
- Link to v6: https://lore.kernel.org/r/20260120-timer_nolock-v6-0-670ffdd787b4@meta.com
Changes in v6:
- Reworked destruction and refcnt use:
- On cancel_and_free() set last_seq to BPF_ASYNC_DESTROY value, drop
map's reference
- In irq work callback, atomically switch DESTROY to DESTROYED, cancel
timer/wq
- Free bpf_async_cb on refcnt going to 0.
- Link to v5: https://lore.kernel.org/r/20260115-timer_nolock-v5-0-15e3aef2703d@meta.com
Changes in v5:
- Extracted lock-free algorithm for updating cb->prog and
cb->callback_fn into a function bpf_async_update_prog_callback(),
added a new commit and introduces this function and uses it in
__bpf_async_set_callback(), bpf_timer_cancel() and
bpf_async_cancel_and_free().
This allows to move the change into the separate commit without breaking
correctness.
- Handle NULL prog in bpf_async_update_prog_callback().
- Link to v4: https://lore.kernel.org/r/20260114-timer_nolock-v4-0-fa6355f51fa7@meta.com
Changes in v4:
- Handle irq_work_queue failures in both schedule and cancel_and_free
paths: introduced bpf_async_refcnt_dec_cleanup() that decrements refcnt
and makes sure if last reference is put, there is at least one irq_work
scheduled to execute final cleanup.
- Additional refcnt inc/dec in set_callback() + rcu lock to make sure
cleanup is not running at the same time as set_callback().
- Added READ_ONCE where it was needed.
- Squash 'bpf: Refactor __bpf_async_set_callback()' commit into 'bpf:
Add lock-free cell for NMI-safe
async operations'
- Removed mpmc_cell, use seqcount_latch_t instead.
- Link to v3: https://lore.kernel.org/r/20260107-timer_nolock-v3-0-740d3ec3e5f9@meta.com
Changes in v3:
- Major rework
- Introduce mpmc_cell, allowing concurrent writes and reads
- Implement irq_work deferring
- Adding selftests
- Introduces bpf_timer_cancel_async kfunc
- Link to v2: https://lore.kernel.org/r/20251105-timer_nolock-v2-0-32698db08bfa@meta.com
Changes in v2:
- Move refcnt initialization and put (from cancel_and_free())
from patch 5 into the patch 4, so that patch 4 has more clear and full
implementation and use of refcnt
- Link to v1: https://lore.kernel.org/r/20251031-timer_nolock-v1-0-b064ae403bfb@meta.com
====================
Mykyta Yatsenko [Sun, 1 Feb 2026 02:54:01 +0000 (18:54 -0800)]
selftests/bpf: Add timer stress test in NMI context
Add stress tests for BPF timers that run in NMI context using perf_event
programs attached to PERF_COUNT_HW_CPU_CYCLES.
The tests cover three scenarios:
- nmi_race: Tests concurrent timer start and async cancel operations
- nmi_update: Tests updating a map element (effectively deleting and
inserting new for array map) from within a timer callback
- nmi_cancel: Tests timer self-cancellation attempt.
A common test_common() helper is used to share timer setup logic across
all test modes.
The tests spawn multiple threads in a child process to generate
perf events, which trigger the BPF programs in NMI context. Hit counters
verify that the NMI code paths were actually exercised.
Refactor bpf_timer and bpf_wq to allow calling them from any context:
- add refcnt to bpf_async_cb
- map_delete_elem or map_free will drop refcnt to zero
via bpf_async_cancel_and_free()
- once refcnt is zero timer/wq_start is not allowed to make sure
that callback cannot rearm itself
- if in_hardirq defer to start/cancel operations to irq_work
Add a new bpf_stream_print_stack kfunc for printing a BPF program stack
into a BPF stream. Update the verifier to allow the new kfunc to be
called with BPF spinlocks held, along with bpf_stream_vprintk.
Patchset spun out of the larger libarena + ASAN patchset.
(https://lore.kernel.org/bpf/20260127181610.86376-1-emil@etsalapatis.com/)
Changeset:
- Update bpf_stream_print_stack to take stream_id arg (Kumar)
- Added selftest for the bpf_stream_print_stack
- Add selftest for calling the streams kfuncs under lock
v2->v1: (https://lore.kernel.org/bpf/20260202193311.446717-1-emil@etsalapatis.com/)
- Updated Signed-off-by to be consistent with past submissions
- Updated From email to be consistent with Signed-off-by
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com>
====================
Emil Tsalapatis [Tue, 3 Feb 2026 18:04:23 +0000 (13:04 -0500)]
bpf: Allow BPF stream kfuncs while holding a lock
The BPF stream kfuncs bpf_stream_vprintk and bpf_stream_print_stack
do not sleep and so are safe to call while holding a lock. Amend
the verifier to allow that.
Add a new kfunc called bpf_stream_print_stack to be used by programs
that need to print out their current BPF stack. The kfunc is essentially
a wrapper around the existing bpf_stream_dump_stack functionality used
to generate stack traces for error events like may_goto violations and
BPF-side arena page faults.
====================
bpf: Improve state pruning for scalar registers
V2: https://lore.kernel.org/all/20260203022229.1630849-1-puranjay@kernel.org/
Changes in V3:
- Fix spelling mistakes in commit logs (AI)
- Fix an incorrect comment in the selftest added in patch 5 (AI)
- Improve the title of patch 5
V1: https://lore.kernel.org/all/20260202104414.3103323-1-puranjay@kernel.org/
Changes in V2:
- Collected acked by Eduard
- Removed some unnecessary comments
- Added a selftest for id=0 equivalence in Patch 5
This series improves BPF verifier state pruning by relaxing scalar ID
equivalence requirements. Scalar register IDs are used to track
relationships between registers for bounds propagation. However, once
an ID becomes "singular" (only one register/stack slot carries it), it
can no longer participate in bounds propagation and becomes stale.
These stale IDs can prevent pruning of otherwise equivalent states.
The series addresses this in four patches:
Patch 1: Assign IDs on stack fills to ensure stack slots have IDs
before being read into registers, preparing for the singular ID
clearing in patch 2.
Patch 2: Clear IDs that appear only once before caching, as they cannot
contribute to bounds propagation.
Patch 3: Relax maybe_widen_reg() to only compare value-tracking fields
(bounds, tnum, var_off) rather than also requiring ID matches. Two
scalars with identical value constraints but different IDs represent
the same abstract value and don't need widening.
Patch 4: Relax scalar ID equivalence in state comparison by treating
rold->id == 0 as "independent". If the old state didn't rely on ID
relationships for a register, any linking in the current state only
adds constraints and is safe to accept for pruning.
Patch 5: Add a selftest to show the exact case being handled by Patch 4
I ran veristat on BPF programs from sched_ext, meta's internal programs,
and on selftest programs, showing programs with insn diff > 5%:
Puranjay Mohan [Tue, 3 Feb 2026 16:51:00 +0000 (08:51 -0800)]
bpf: Relax scalar id equivalence for state pruning
Scalar register IDs are used by the verifier to track relationships
between registers and enable bounds propagation across those
relationships. Once an ID becomes singular (i.e. only a single
register/stack slot carries it), it can no longer contribute to bounds
propagation and effectively becomes stale. The previous commit makes the
verifier clear such ids before caching the state.
When comparing the current and cached states for pruning, these stale
IDs can cause technically equivalent states to be considered different
and thus prevent pruning.
For example, in the selftest added in the next commit, two registers -
r6 and r7 are not linked to any other registers and get cached with
id=0, in the current state, they are both linked to each other with
id=A. Before this commit, check_scalar_ids would give temporary ids to
r6 and r7 (say tid1 and tid2) and then check_ids() would map tid1->A,
and when it would see tid2->A, it would not consider these state
equivalent.
Relax scalar ID equivalence by treating rold->id == 0 as "independent":
if the old state did not rely on any ID relationships for a register,
then any ID/linking present in the current state only adds constraints
and is always safe to accept for pruning. Implement this by returning
true immediately in check_scalar_ids() when old_id == 0.
Maintain correctness for the opposite direction (old_id != 0 && cur_id
== 0) by still allocating a temporary ID for cur_id == 0. This avoids
incorrectly allowing multiple independent current registers (id==0) to
satisfy a single linked old ID during mapping.
Puranjay Mohan [Tue, 3 Feb 2026 16:50:59 +0000 (08:50 -0800)]
bpf: Relax maybe_widen_reg() constraints
The maybe_widen_reg() function widens imprecise scalar registers to
unknown when their values differ between the cached and current states.
Previously, it used regs_exact() which also compared register IDs via
check_ids(), requiring registers to have matching IDs (or mapped IDs) to
be considered exact.
For scalar widening purposes, what matters is whether the value tracking
(bounds, tnum, var_off) is the same, not whether the IDs match. Two
scalars with identical value constraints but different IDs represent the
same abstract value and don't need to be widened.
Introduce scalars_exact_for_widen() that only compares the
value-tracking portion of bpf_reg_state (fields before 'id'). This
allows the verifier to preserve more scalar value information during
state merging when IDs differ but actual tracked values are identical,
reducing unnecessary widening and potentially improving verification
precision.
Puranjay Mohan [Tue, 3 Feb 2026 16:50:58 +0000 (08:50 -0800)]
bpf: Clear singular ids for scalars in is_state_visited()
The verifier assigns ids to scalar registers/stack slots when they are
linked through a mov or stack spill/fill instruction. These ids are
later used to propagate newly found bounds from one register to all
registers that share the same id. The verifier also compares the ids of
these registers in current state and cached state when making pruning
decisions.
When an ID becomes singular (i.e., only a single register or stack slot
has that ID), it can no longer participate in bounds propagation. During
comparisons between current and cached states for pruning decisions,
however, such stale IDs can prevent pruning of otherwise equivalent
states.
Find and clear all singular ids before caching a state in
is_state_visited(). struct bpf_idset which is currently unused has been
repurposed for this use case.
Puranjay Mohan [Tue, 3 Feb 2026 16:50:57 +0000 (08:50 -0800)]
bpf: Let the verifier assign ids on stack fills
The next commit will allow clearing of scalar ids if no other
register/stack slot has that id. This is because if only one register
has a unique id, it can't participate in bounds propagation and is
equivalent to having no id.
But if the id of a stack slot is cleared by clear_singular_ids() in the
next commit, reading that stack slot into a register will not establish
a link because the stack slot's id is cleared.
This can happen in a situation where a register is spilled and later
loses its id due to a multiply operation (for example) and then the
stack slot's id becomes singular and can be cleared.
Make sure that scalar stack slots have an id before we read them into a
register.
Jiri Olsa [Mon, 2 Feb 2026 07:58:49 +0000 (08:58 +0100)]
ftrace: Fix direct_functions leak in update_ftrace_direct_del
Alexei reported memory leak in update_ftrace_direct_del.
We miss cleanup of the replaced direct_functions in the
success path in update_ftrace_direct_del, adding that.
Changes:
v4 -> v5:
* Address comment from Alexei:
* Rename helper bpf_link_prog_session_cookie() to
bpf_prog_calls_session_cookie().
* v4: https://lore.kernel.org/bpf/20260129154953.66915-1-leon.hwang@linux.dev/
v3 -> v4:
* Add a log when !bpf_jit_supports_fsession() in patch #1 (per AI).
* v3: https://lore.kernel.org/bpf/20260129142536.48637-1-leon.hwang@linux.dev/
v2 -> v3:
* Fix typo in subject and patch message of patch #1 (per AI and Chris).
* Collect Acked-by, and Tested-by from Puranjay, thanks.
* v2: https://lore.kernel.org/bpf/20260128150112.8873-1-leon.hwang@linux.dev/
Leon Hwang [Sat, 31 Jan 2026 14:49:49 +0000 (22:49 +0800)]
bpf, arm64: Add fsession support
Implement fsession support in the arm64 BPF JIT trampoline.
Extend the trampoline stack layout to store function metadata and
session cookies, and pass the appropriate metadata to fentry and
fexit programs. This mirrors the existing x86 behavior and enables
session cookies on arm64.
Paul Chaignon [Sat, 31 Jan 2026 16:08:37 +0000 (17:08 +0100)]
bpf: Fix bpf_xdp_store_bytes proto for read-only arg
While making some maps in Cilium read-only from the BPF side, we noticed
that the bpf_xdp_store_bytes proto is incorrect. In particular, the
verifier was throwing the following error:
nat comes from a BPF_F_RDONLY_PROG map, so R3 is a PTR_TO_MAP_VALUE.
The verifier checks the helper's memory access to R3 in
check_mem_size_reg, as it reaches ARG_CONST_SIZE argument. The third
argument has expected type ARG_PTR_TO_UNINIT_MEM, which includes the
MEM_WRITE flag. The verifier thus checks for a BPF_WRITE access on R3.
Given R3 points to a read-only map, the check fails.
Conversely, ARG_PTR_TO_UNINIT_MEM can also lead to the helper reading
from uninitialized memory.
This patch simply fixes the expected argument type to match that of
bpf_skb_store_bytes.
====================
The BPF verifier validates pointers to special map fields (timers,
workqueues, task_work) through separate functions that share nearly
identical logic. This creates code duplication because of the
inconsistent data structure layout in struct bpf_call_arg_meta struct
bpf_kfunc_call_arg_meta.
This series contains 2 commits:
1. Introduces struct bpf_map_desc to provide a unified representation
for map pointer and uid tracking. Previously, bpf_call_arg_meta used
separate map_ptr and map_uid fields while bpf_kfunc_call_arg_metaused an
anonymous inline struct. This inconsistency made it harder to share
validation code between the two paths.
2. Consolidates the validation logic for BPF_TIMER, BPF_WORKQUEUE, and
BPF_TASK_WORK field types into a single check_map_field_pointer()
function. This eliminates process_wq_func() and process_task_work_func()
entirely, and simplifies process_timer_func() to just the PREEMPT_RT
check before calling the unified validation. The result is fewer
lines of code with clearer structure for future maintenance.
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Changes in v2:
- Added Signed-off-by to the top commit.
- Link to v1: https://lore.kernel.org/r/20260129-verif_special_fields-v1-0-d310b7f146c8@meta.com
====================
Mykyta Yatsenko [Fri, 30 Jan 2026 20:42:10 +0000 (20:42 +0000)]
bpf: Introduce struct bpf_map_desc in verifier
Introduce struct bpf_map_desc to hold bpf_map pointer and map uid. Use
this struct in both bpf_call_arg_meta and bpf_kfunc_call_arg_meta
instead of having different representations:
- bpf_call_arg_meta had separate map_ptr and map_uid fields
- bpf_kfunc_call_arg_meta had an anonymous inline struct
This unifies the map fields layout across both metadata structures,
making the code more consistent and preparing for further refactoring of
map field pointer validation.
====================
x86/fgraph,bpf: Fix ORC stack unwind from kprobe_multi
hi,
Mahe reported missing function from stack trace on top of kprobe multi
program. It turned out the latest fix [1] needs some more fixing.
v2 changes:
- keep the unwind same as for kprobes, attached function
is part of entry probe stacktrace, not kretprobe [Steven]
- several change in trigger bench [Andrii]
- added selftests for standard kprobes and fentry/fexit probes [Andrii]
Note I'll try to add similar stacktrace adjustment for fentry/fexit
in separate patchset to not complicate this change.
Jiri Olsa [Mon, 26 Jan 2026 21:18:33 +0000 (22:18 +0100)]
x86/fgraph,bpf: Switch kprobe_multi program stack unwind to hw_regs path
Mahe reported missing function from stack trace on top of kprobe
multi program. The missing function is the very first one in the
stacktrace, the one that the bpf program is attached to.
The reason is that the previous change (the Fixes commit) fixed
stack unwind for tracepoint, but removed attached function address
from the stack trace on top of kprobe multi programs, which I also
overlooked in the related test (check following patch).
The tracepoint and kprobe_multi have different stack setup, but use
same unwind path. I think it's better to keep the previous change,
which fixed tracepoint unwind and instead change the kprobe multi
unwind as explained below.
The bpf program stack unwind calls perf_callchain_kernel for kernel
portion and it follows two unwind paths based on X86_EFLAGS_FIXED
bit in pt_regs.flags.
When the bit set we unwind from stack represented by pt_regs argument,
otherwise we unwind currently executed stack up to 'first_frame'
boundary.
The 'first_frame' value is taken from regs.rsp value, but ftrace_caller
and ftrace_regs_caller (ftrace trampoline) functions set the regs.rsp
to the previous stack frame, so we skip the attached function entry.
If we switch kprobe_multi unwind to use the X86_EFLAGS_FIXED bit,
we set the start of the unwind to the attached function address.
As another benefit we also cut extra unwind cycles needed to reach
the 'first_frame' boundary.
The speedup can be measured with trigger bench for kprobe_multi
program and stacktrace support.
The return probe skips the attached function, because it's no longer
on the stack at the point of the unwind and this way is the same how
standard kretprobe works.
Jiri Olsa [Mon, 26 Jan 2026 21:18:32 +0000 (22:18 +0100)]
x86/fgraph: Fix return_to_handler regs.rsp value
The previous change (Fixes commit) messed up the rsp register value,
which is wrong because it's already adjusted with FRAME_SIZE, we need
the original rsp value.
This change does not affect fprobe current kernel unwind, the !perf_hw_regs
path perf_callchain_kernel:
if (perf_hw_regs(regs)) {
if (perf_callchain_store(entry, regs->ip))
return;
unwind_start(&state, current, regs, NULL);
} else {
unwind_start(&state, current, NULL, (void *)regs->sp);
}
which uses pt_regs.sp as first_frame boundary (FRAME_SIZE shift makes
no difference, unwind stil stops at the right frame).
This change fixes the other path when we want to unwind directly from
pt_regs sp/fp/ip state, which is coming in following change.
Fixes: 20a0bc10272f ("x86/fgraph,bpf: Fix stack ORC unwind from kprobe_multi return probe") Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Link: https://lore.kernel.org/bpf/20260126211837.472802-2-jolsa@kernel.org
Changwoo Min [Fri, 30 Jan 2026 02:18:43 +0000 (11:18 +0900)]
selftests/bpf: Make bpf get_preempt_count() work for v6.14+ kernels
Recent x86 kernels export __preempt_count as a ksym, while some old kernels
between v6.1 and v6.14 expose the preemption counter via
pcpu_hot.preempt_count. The existing selftest helper unconditionally
dereferenced __preempt_count, which breaks BPF program loading on such old
kernels.
Make the x86 preemption count lookup version-agnostic by:
- Marking __preempt_count and pcpu_hot as weak ksyms.
- Introducing a BTF-described pcpu_hot___local layout with
preserve_access_index.
- Selecting the appropriate access path at runtime using ksym availability
and bpf_ksym_exists() and bpf_core_field_exists().
This allows a single BPF binary to run correctly across kernel versions
(e.g., v6.18 vs. v6.13) without relying on compile-time version checks.
====================
this patchset allows sleepable programs to use tail calls.
At the moment we need to have separate sleepable uprobe program
to retrieve user space data and pass it to complex program with
tail calls. It'd be great if the program with tail calls could
be sleepable and do the data retrieval directly.
Jiri Olsa [Fri, 30 Jan 2026 08:12:08 +0000 (09:12 +0100)]
selftests/bpf: Add test for sleepable program tailcalls
Adding test that makes sure we can't mix sleepable and non-sleepable
bpf programs in the BPF_MAP_TYPE_PROG_ARRAY map and that we can do
tail call in the sleepable program.
Jiri Olsa [Fri, 30 Jan 2026 08:12:07 +0000 (09:12 +0100)]
bpf: Allow sleepable programs to use tail calls
Allowing sleepable programs to use tail calls.
Making sure we can't mix sleepable and non-sleepable bpf programs
in tail call map (BPF_MAP_TYPE_PROG_ARRAY) and allowing it to be
used in sleepable programs.
Sleepable programs can be preempted and sleep which might bring
new source of race conditions, but both direct and indirect tail
calls should not be affected.
Direct tail calls work by patching direct jump to callee into bpf
caller program, so no problem there. We atomically switch from nop
to jump instruction.
Indirect tail call reads the callee from the map and then jumps to
it. The callee bpf program can't disappear (be released) from the
caller, because it is executed under rcu lock (rcu_read_lock_trace).
====================
bpf: Fix verifier_bug_if to account for BPF_CALL
This fixes the verifier_bug_if() that runs on nospec_result to not trigger
for BPF_CALL (bug reported by Hu, Mei, and Mu). See patch 1 for a full
description and patch 2 for a test (based on the PoC from the report).
While working on this I noticed two other problems:
- nospec_result is currently ignored for BPF_CALL during patching, but it
may be required if we assume the CPU may speculate into/out of functions.
- Both the instruction patching for nospec and nospec_result erases the
instruction aux information even thought it might be better to keep that.
For nospec_result it may be fine as it is only applied to store
instructions currently (except for when we decide to change the thing
from above), but nospec may be set for arbitrary instructions and if
these require rewrites they break.
I assume these issues are better fixed separately, thus I decided to
exclude them from this series.
====================
Luis Gerhorst [Tue, 27 Jan 2026 11:59:11 +0000 (12:59 +0100)]
bpf: Fix verifier_bug_if to account for BPF_CALL
The BPF verifier assumes `insn_aux->nospec_result` is only set for
direct memory writes (e.g., `*(u32*)(r1+off) = r2`). However, the
assertion fails to account for helper calls (e.g.,
`bpf_skb_load_bytes_relative`) that perform writes to stack memory. Make
the check more precise to resolve this.
The problem is that `BPF_CALL` instructions have `BPF_CLASS(insn->code)
== BPF_JMP`, which triggers the warning check:
- Helpers like `bpf_skb_load_bytes_relative` write to stack memory
- `check_helper_call()` loops through `meta.access_size`, calling
`check_mem_access(..., BPF_WRITE)`
- `check_stack_write()` sets `insn_aux->nospec_result = 1`
- Since `BPF_CALL` is encoded as `BPF_JMP | BPF_CALL`, the warning fires
Execution flow:
```
1. Drop capabilities → Enable Spectre mitigation
2. Load BPF program
└─> do_check()
├─> check_cond_jmp_op() → Marks dead branch as speculative
│ └─> push_stack(..., speculative=true)
├─> pop_stack() → state->speculative = 1
├─> check_helper_call() → Processes helper in dead branch
│ └─> check_mem_access(..., BPF_WRITE)
│ └─> insn_aux->nospec_result = 1
└─> Checks: state->speculative && insn_aux->nospec_result
└─> BPF_CLASS(insn->code) == BPF_JMP → WARNING
```
To fix the assert, it would be nice to be able to reuse
bpf_insn_successors() here, but bpf_insn_successors()->cnt is not
exactly what we want as it may also be 1 for BPF_JA. Instead, we could
check opcode_info.can_jump, but then we would have to share the table
between the functions. This would mean moving the table out of the
function and adding bpf_opcode_info(). As the verifier_bug_if() only
runs for insns with nospec_result set, the impact on verification time
would likely still be negligible. However, I assume sharing
bpf_opcode_info() between liveness.c and verifier.c will not be worth
it. It seems as only adjust_jmp_off() could also be simplified using it,
and there imm/off is touched. Thus it is maybe better to rely on exact
opcode/class matching there.
Therefore, to avoid this sharing only for a verifier_bug_if(), just
check the opcode. This should now cover all opcodes for which can_jump
in bpf_insn_successors() is true.
Parts of the description and example are taken from the bug report.
Fixes: dadb59104c64 ("bpf: Fix aux usage after do_check_insn()") Signed-off-by: Luis Gerhorst <luis.gerhorst@fau.de> Reported-by: Yinhao Hu <dddddd@hust.edu.cn> Reported-by: Kaiyan Mei <M202472210@hust.edu.cn> Reported-by: Dongliang Mu <dzm91@hust.edu.cn> Closes: https://lore.kernel.org/bpf/7678017d-b760-4053-a2d8-a6879b0dbeeb@hust.edu.cn/ Link: https://lore.kernel.org/r/20260127115912.3026761-2-luis.gerhorst@fau.de Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Ihor Solodrai [Wed, 28 Jan 2026 21:12:55 +0000 (13:12 -0800)]
bpftool: Fix dependencies for static build
When building selftests/bpf with EXTRA_LDFLAGS=-static the follwoing
error happens:
LINK /ws/linux/tools/testing/selftests/bpf/tools/build/bpftool/bootstrap/bpftool
/usr/bin/x86_64-linux-gnu-ld.bfd: /usr/lib/gcc/x86_64-linux-gnu/15/../../../x86_64-linux-gnu/libcrypto.a(libcrypto-lib-dso_dlfcn.o): in function `dlfcn_globallookup':
[...]
/usr/bin/x86_64-linux-gnu-ld.bfd: /usr/lib/gcc/x86_64-linux-gnu/15/../../../x86_64-linux-gnu/libcrypto.a(libcrypto-lib-c_zlib.o): in function `zlib_oneshot_expand_block':
(.text+0xc64): undefined reference to `uncompress'
/usr/bin/x86_64-linux-gnu-ld.bfd: /usr/lib/gcc/x86_64-linux-gnu/15/../../../x86_64-linux-gnu/libcrypto.a(libcrypto-lib-c_zlib.o): in function `zlib_oneshot_compress_block':
(.text+0xce4): undefined reference to `compress'
collect2: error: ld returned 1 exit status
make[1]: *** [Makefile:252: /ws/linux/tools/testing/selftests/bpf/tools/build/bpftool/bootstrap/bpftool] Error 1
make: *** [Makefile:327: /ws/linux/tools/testing/selftests/bpf/tools/sbin/bpftool] Error 2
make: *** Waiting for unfinished jobs....
This is caused by wrong order of dependencies in the Makefile. Fix it.
Mykyta Yatsenko [Wed, 28 Jan 2026 19:05:51 +0000 (19:05 +0000)]
selftests/bpf: Remove xxd util dependency
The verification signature header generation requires converting a
binary certificate to a C array. Previously this only worked with
xxd (part of vim-common package).
As xxd may not be available on some systems building selftests, it makes
sense to substitute it with more common utils: hexdump, wc, sed to
generate equivalent C array output.
Tested by generating header with both xxd and hexdump and comparing
them.
The speedup seems to be related to the fact that with single ftrace_ops object
we don't call ftrace_shutdown anymore (we use ftrace_update_ops instead) and
we skip the synchronize rcu calls (each ~100ms) at the end of that function.
v6 changes:
- rename add_hash_entry_direct to add_ftrace_hash_entry_direct [Steven]
- factor hash_add/hash_sub [Steven]
- add kerneldoc header for update_ftrace_direct_* functions [Steven]
- few assorted smaller fixes [Steven]
- added missing direct_ops wrappers for !CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
case [Steven]
v5 changes:
- do not export ftrace_hash object [Steven]
- fix update_ftrace_direct_add new_filter_hash leak [ci]
v4 changes:
- rebased on top of bpf-next/master (with jmp attach changes)
added patch 1 to deal with that
- added extra checks for update_ftrace_direct_del/mod to address
the ci bot review
v3 changes:
- rebased on top of bpf-next/master
- fixed update_ftrace_direct_del cleanup path
- added missing inline to update_ftrace_direct_* stubs
v2 changes:
- rebased on top fo bpf-next/master plus Song's livepatch fixes [1]
- renamed the API functions [2] [Steven]
- do not export the new api [Steven]
- kept the original direct interface:
I'm not sure if we want to melt both *_ftrace_direct and the new interface
into single one. It's bit different in semantic (hence the name change as
Steven suggested [2]) and I don't think the changes are not that big so
we could easily keep both APIs.
v1 changes:
- make the change x86 specific, after discussing with Mark options for
arm64 [Mark]
Jiri Olsa [Tue, 30 Dec 2025 14:50:10 +0000 (15:50 +0100)]
bpf,x86: Use single ftrace_ops for direct calls
Using single ftrace_ops for direct calls update instead of allocating
ftrace_ops object for each trampoline.
With single ftrace_ops object we can use update_ftrace_direct_* api
that allows multiple ip sites updates on single ftrace_ops object.
Adding HAVE_SINGLE_FTRACE_DIRECT_OPS config option to be enabled on
each arch that supports this.
At the moment we can enable this only on x86 arch, because arm relies
on ftrace_ops object representing just single trampoline image (stored
in ftrace_ops::direct_call). Archs that do not support this will continue
to use *_ftrace_direct api.
Jiri Olsa [Tue, 30 Dec 2025 14:50:07 +0000 (15:50 +0100)]
ftrace: Add update_ftrace_direct_mod function
Adding update_ftrace_direct_mod function that modifies all entries
(ip -> direct) provided in hash argument to direct ftrace ops and
updates its attachments.
The difference to current modify_ftrace_direct is:
- hash argument that allows to modify multiple ip -> direct
entries at once
This change will allow us to have simple ftrace_ops for all bpf
direct interface users in following changes.
Jiri Olsa [Tue, 30 Dec 2025 14:50:06 +0000 (15:50 +0100)]
ftrace: Add update_ftrace_direct_del function
Adding update_ftrace_direct_del function that removes all entries
(ip -> addr) provided in hash argument to direct ftrace ops and
updates its attachments.
The difference to current unregister_ftrace_direct is
- hash argument that allows to unregister multiple ip -> direct
entries at once
- we can call update_ftrace_direct_del multiple times on the
same ftrace_ops object, becase we do not need to unregister
all entries at once, we can do it gradualy with the help of
ftrace_update_ops function
This change will allow us to have simple ftrace_ops for all bpf
direct interface users in following changes.
Jiri Olsa [Tue, 30 Dec 2025 14:50:05 +0000 (15:50 +0100)]
ftrace: Add update_ftrace_direct_add function
Adding update_ftrace_direct_add function that adds all entries
(ip -> addr) provided in hash argument to direct ftrace ops
and updates its attachments.
The difference to current register_ftrace_direct is
- hash argument that allows to register multiple ip -> direct
entries at once
- we can call update_ftrace_direct_add multiple times on the
same ftrace_ops object, becase after first registration with
register_ftrace_function_nolock, it uses ftrace_update_ops to
update the ftrace_ops object
This change will allow us to have simple ftrace_ops for all bpf
direct interface users in following changes.
Jiri Olsa [Tue, 30 Dec 2025 14:50:02 +0000 (15:50 +0100)]
ftrace,bpf: Remove FTRACE_OPS_FL_JMP ftrace_ops flag
At the moment the we allow the jmp attach only for ftrace_ops that
has FTRACE_OPS_FL_JMP set. This conflicts with following changes
where we use single ftrace_ops object for all direct call sites,
so all could be be attached via just call or jmp.
We already limit the jmp attach support with config option and bit
(LSB) set on the trampoline address. It turns out that's actually
enough to limit the jmp attach for architecture and only for chosen
addresses (with LSB bit set).
Each user of register_ftrace_direct or modify_ftrace_direct can set
the trampoline bit (LSB) to indicate it has to be attached by jmp.
The bpf trampoline generation code uses trampoline flags to generate
jmp-attach specific code and ftrace inner code uses the trampoline
bit (LSB) to handle return from jmp attachment, so there's no harm
to remove the FTRACE_OPS_FL_JMP bit.
The fexit/fmodret performance stays the same (did not drop),
current code: