git.ipfire.org Git - thirdparty/linux.git/log

Merge branch 'remove-task-and-cgroup-local-storage-percpu-counters'

Amery Hung says:

====================
Remove task and cgroup local storage percpu counters

* Motivation *

The goal of this patchset is to make bpf syscalls and helpers updating
task and cgroup local storage more robust by removing percpu counters
in them. Task local storage and cgroup storage each employs a percpu
counter to prevent deadlock caused by recursion. Since the underlying
bpf local storage takes spinlocks in various operations, bpf programs
running recursively may try to take a spinlock which is already taken.
For example, when a tracing bpf program called recursively during
bpf_task_storage_get(..., F_CREATE) tries to call
bpf_task_storage_get(..., F_CREATE) again, it will cause AA deadlock
if the percpu variable is not in place.

However, sometimes, the percpu counter may cause bpf syscalls or helpers
to return errors spuriously, as soon as another threads is also updating
the local storage or the local storage map. Ideally, the two threads
could have taken turn to take the locks and perform their jobs
respectively. However, due to the percpu counter, the syscalls and
helpers can return -EBUSY even if one of them does not run recursively
in another one. All it takes for this to happen is if the two threads run
on the same CPU. This happened when BPF-CI ran the selftest of task local
data. Since CI runs the test on VM with 2 CPUs, bpf_task_storage_get(...,
F_CREATE) can easily fail.

The failure mode is not good for users as they need to add retry logic
in user space or bpf programs to avoid it. Even with retry, there
is no guaranteed upper bound of the loop for a success call. Therefore,
this patchset seeks to remove the percpu counter and makes the related
bpf syscalls and helpers more reliable, while still make sure recursion
deadlock will not happen, with the help of resilient queued spinlock
(rqspinlock).

* Implementation *

To remove the percpu counter without introducing deadlock,
bpf_local_storage is refactored by changing the locks from raw_spin_lock
to rqspinlock, which prevents deadlock with deadlock detection and a
timeout mechanism.

The refactor basically repalces the locks with rqspinlock and propagates
errors returned by the locking function to BPF helpers or syscalls.
bpf_selem_unlink_nofail() is introduced to handle rqspinlock errors
in two lock acquiring functions that cannot fail,
bpf_local_storage_destroy() and bpf_local_storage_map_free()
(i.e., local storage is being freed by the subsystem or the map is
being freed). The high level idea is to bitfiel and atomic operation to
track who is referencing an selem when any locks cannot be acquired.
Additional care is needed to make sure special fields are freed and
owner memory are uncharged safely and correctly.

If not familiar with local storage, the last section briefly describe
the locks and structure of local storage. It also shows the abbreviation
used in the rest of the letter.

* Test *

Task and cgroup local storage selftests have already covered deadlock
caused by recursion. Patch 14 updates the expected result of task local
storage selftests as task local storage bpf helpers can now run on the
same CPU as they don't cause deadlock.

* Benchmark *

./bench -p 1 local-storage-create --storage-type <socket,task> \
  --batch-size <16,32,64>

The benchmark is a microbenchmark stress-testing how fast local storage
can be created. After swicthing to rqspinlock and
bpf_unlink_selem_nofail(), socket local storage creation speed has a
~5% gain. For task local storage, the number remains the same.

Socket local storage
                 batch  creation speed              creation speed diff
---------------  ----   ------------------                         ----
Before            16    134.371 ± 0.884k/s  3.12 kmallocs/create
                  32    133.032 ± 3.405k/s  3.12 kmallocs/create
                  64    133.494 ± 0.862k/s  3.12 kmallocs/create

After             16    140.778 ± 1.306k/s  3.12 kmallocs/create  +4.8%
                  32    140.550 ± 2.058k/s  3.11 kmallocs/create  +5.7%
                  64    139.311 ± 0.911k/s  3.13 kmallocs/create  +4.4%

Task local storage
                  batch  creation speed              creation speed diff
---------------  ----   ------------------                         ----
Before           16     25.301 ± 0.089k/s   2.43 kmallocs/create
                 32     23.797 ± 0.106k/s   2.51 kmallocs/create
                 64     23.251 ± 0.187k/s   2.51 kmallocs/create

After            16     25.307 ± 0.080k/s   2.45 kmallocs/create  +0.0%
                 32     23.889 ± 0.089k/s   2.46 kmallocs/create  +0.0%
                 64     23.230 ± 0.113k/s   2.63 kmallocs/create  -0.1%

* Patchset organization *

Patch 1-4 convert local storage internal helpers to failable.

Patch 5 changes the locks to rqspinlock and propagate the error
returned from raw_res_spin_lock_irqsave() to BPF heleprs and syscalls.

Patch 6-8 remove percpu counters in task and cgroup local storage.

Patch 9-11 address the unlikely rqspinlock errors by switching to
bpf_selem_unlink_nofail() in map_free() and destroy().

Patch 12-17 update selftests.

* Appendix: local storage internal *

There are two locks in bpf_local_storage due to the ownership model as
illustrated in the figure below. A map value, which consists of a
pointer to the map and the data, is a bpf_local_storage_map_data (sdata)
stored in a bpf_local_storage_elem (selem). A selem belongs to a
bpf_local_storage and bpf_local_storage_map at the same time.
bpf_local_storage::lock (lock_storage->lock in short) protects the list
in a bpf_local_storage and bpf_local_storage_map_bucket::lock (b->lock)
protects the hash bucket in a bpf_local_storage_map.

task_struct
┌ task1 ───────┐       bpf_local_storage
│ *bpf_storage │---->┌─────────┐
└──────────────┘<----│ *owner  │         bpf_local_storage_elem
                     │ *cache[16]        (selem)              selem
                     │ *smap   │        ┌──────────┐         ┌──────────┐
                     │ list    │------->│ snode    │<------->│ snode    │
                     │ lock    │  ┌---->│ map_node │<--┐ ┌-->│ map_node │
                     └─────────┘  │     │ sdata =  │   │ │   │ sdata =  │
task_struct                      │     │ {&mapA,} │   │ │   │ {&mapB,} │
┌ task2 ───────┐      bpf_local_storage └──────────┘   │ │   └──────────┘
│ *bpf_storage │---->┌─────────┐  │                    │ │
└──────────────┘<----│ *owner  │  │                    │ │
                     │ *cache[16] │      selem         │ │    selem
                     │ *smap   │  │     ┌──────────┐   │ │   ┌──────────┐
                     │ list    │--│---->│ snode    │<--│-│-->│ snode    │
                     │ lock    │  │ ┌-->│ map_node │   └-│-->│ map_node │
                     └─────────┘  │ │   │ sdata =  │     │   │ sdata =  │
bpf_local_storage_map            │ │   │ {&mapB,} │     │   │ {&mapA,} │
(smap)                           │ │   └──────────┘     │   └──────────┘
┌ mapA ───────┐                   │ │                    │
│ bpf_map map │      bpf_local_storage_map_bucket        │
│ *buckets    │---->┌ b[0] ┐      │ │                    │
└─────────────┘     │ list │------┘ │                    │
                    │ lock │        │                    │
                    └──────┘        │                    │
smap                 ...           │                    │
┌ mapB ───────┐                     │                    │
│ bpf_map map │      bpf_local_storage_map_bucket        │
│ *buckets    │---->┌ b[0] ┐        │                    │
└─────────────┘     │ list │--------┘                    │
                    │ lock │                             │
                    └──────┘                             │
                    ┌ b[1] ┐                             │
                    │ list │-----------------------------┘
                    │ lock │
                    └──────┘
                      ...

* Changelog *

v6 -> v7
  - Minor comment and commit msg tweaks
  - Patch 9: Remove unused "owner" (kernel test robot)
  - Patch 13: Update comments in task_ls_recursion.c (AI)
Link: https://lore.kernel.org/bpf/20260205070208.186382-1-ameryhung@gmail.com/
v5 -> v6
  - Redo benchmark
  - Patch 9: Remove storage->smap as it is not used any more
  - Patch 17: Remove storage->smap check in selftests
  - Patch 10, 11: Pass reuse_now = true to bpf_selem_free() and
    bpf_local_storage_free() to allow faster memory reclaim (Martin)
  - Patch 10: Use bitfield instead of refcount to track selem state to
    be more precise, which removes the possibility map_free missing an
    selem (Martin)
  - Patch 10: Allow map_free() to free local_storage and drop
    the change in bpf_local_storage_map_update() (Martin)
  - Patch 11: Simplify destroy() by not deferring work as an owner is
    unlikely to have too many maps that stalls RCU (Martin)
Link: https://lore.kernel.org/bpf/20260201175050.468601-1-ameryhung@gmail.com/
v4 -> v5
  - Patch 1: Fix incorrect bucket calculation (AI)
  - Patch 3: Fix memory leak in bpf_sk_storage_clone() (AI)
  - Patch 5: Fix memory leak in bpf_local_storage_update() (AI)
  - Fix typo/comment/commit msg (AI)
  - Patch 10: Replace smp_rmb() with smp_mb(). smp_rmb does not imply
    acquire semantics
Link: https://lore.kernel.org/bpf/20260131050920.2574084-1-ameryhung@gmail.com/
v3 -> v4
  - Add performance numbers
  - Avoid stale element when calling bpf_local_storage_map_free()
    by allowing it to unlink selem from local_storage->list and uncharge
    memory. Block destroy() from returning when pending map_free()
    are uncharging
  - Fix an -EAGAIN bug in bpf_local_storage_update() as map_free() now
    does not free local storage
  - Fix possible double-free of selem by ensuring an selem is only
    processed once for each caller (Kumar)
  - Fix possible inifinite loop in bpf_selem_unlink_nofail() when
    iterating b->list by replacing while loop with
    hlist_for_each_entry_rcu
  - Fix unsafe iteration in destroy() by iterating local_storage->list
    using hlist_for_each_entry_rcu
  - Fix UAF due to clearing storage_owner after destroy(). Flip the order
    to fix it
  - Misc clean-up suggested by Martin
Link: https://lore.kernel.org/bpf/20251218175628.1460321-1-ameryhung@gmail.com/
v2 -> v3
  - Rebase to bpf-next where BPF memory allocator is replaced with
    kmalloc_nolock()
  - Revert to selecting bucket based on selem
  - Introduce bpf_selem_unlink_lockless() to allow unlinking and
    freeing selem without taking locks
Link: https://lore.kernel.org/bpf/20251002225356.1505480-1-ameryhung@gmail.com/
v1 -> v2
  - Rebase to bpf-next
  - Select bucket based on local_storage instead of selem (Martin)
  - Simplify bpf_selem_unlink (Martin)
  - Change handling of rqspinlock errors in bpf_local_storage_destroy()
    and bpf_local_storage_map_free(). Retry instead of WARN_ON.
Link: https://lore.kernel.org/bpf/20250729182550.185356-1-ameryhung@gmail.com/
====================

Link: https://patch.msgid.link/20260205222916.1788211-1-ameryhung@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>

selftests/bpf: Fix outdated test on storage->smap

bpf_local_storage_free() already does not rely on local_storage->smap
since switching to kmalloc_nolock(). As local_storage->smap is removed,
fix the outdated test by dropping the local_storage->smap check. Keep
the second map in task local storage map test to test that multiple
elements can be added to the storage similar to sk storage test.

Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260205222916.1788211-18-ameryhung@gmail.com

selftests/bpf: Choose another percpu variable in bpf for btf_dump test

bpf_cgrp_storage_busy has been removed. Use bpf_bprintf_nest_level
instead. This percpu variable is also in the bpf subsystem so that
if it is removed in the future, BPF-CI will catch this type of CI-
breaking change.

Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260205222916.1788211-17-ameryhung@gmail.com

selftests/bpf: Remove test_task_storage_map_stress_lookup

Remove a test in test_maps that checks if the updating of the percpu
counter in task local storage map is preemption safe as the percpu
counter is now removed.

Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260205222916.1788211-16-ameryhung@gmail.com

selftests/bpf: Update task_local_storage/task_storage_nodeadlock test

Adjust the error code we are checking against as
bpf_task_storage_delete() now returns -EDEADLK or -ETIMEDOUT when
deadlock happens.

Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260205222916.1788211-15-ameryhung@gmail.com

selftests/bpf: Update task_local_storage/recursion test

Update the expected result of the selftest as recursion of task local
storage syscall and helpers have been relaxed. Now that the percpu
counter is removed, task local storage helpers, bpf_task_storage_get()
and bpf_task_storage_delete() can now run on the same CPU at the same
time unless they cause deadlock.

Note that since there is no percpu counter preventing recursion in
task local storage helpers, bpf_trampoline now catches the recursion
of on_update as reported by recursion_misses.

on_enter: tp_btf/sys_enter
on_update: fentry/bpf_local_storage_update

           Old behavior                         New behavior
           ____________                         ____________
on_enter                             on_enter
  bpf_task_storage_get(&map_a)         bpf_task_storage_get(&map_a)
    bpf_task_storage_trylock succeed     bpf_local_storage_update(&map_a)
    bpf_local_storage_update(&map_a)

    on_update                            on_update
      bpf_task_storage_get(&map_a)         bpf_task_storage_get(&map_a)
        bpf_task_storage_trylock fail        on_update::misses++ (1)
        return NULL                        create and return map_a::ptr

                                           map_a::ptr += 1 (1)

                                           bpf_task_storage_delete(&map_a)
                                             return 0

      bpf_task_storage_get(&map_b)         bpf_task_storage_get(&map_b)
        bpf_task_storage_trylock fail        on_update::misses++ (2)
        return NULL                        create and return map_b::ptr

                                           map_b::ptr += 1 (1)

    create and return map_a::ptr         create and return map_a::ptr
  map_a::ptr = 200                     map_a::ptr = 200

  bpf_task_storage_get(&map_b)         bpf_task_storage_get(&map_b)
    bpf_task_storage_trylock succeed     lockless lookup succeed
    bpf_local_storage_update(&map_b)     return map_b::ptr

    on_update
      bpf_task_storage_get(&map_a)
        bpf_task_storage_trylock fail
        lockless lookup succeed
        return map_a::ptr

      map_a::ptr += 1 (201)

      bpf_task_storage_delete(&map_a)
        bpf_task_storage_trylock fail
        return -EBUSY
      nr_del_errs++ (1)

      bpf_task_storage_get(&map_b)
        bpf_task_storage_trylock fail
        return NULL

    create and return ptr

  map_b::ptr = 100

Expected result:

map_a::ptr = 201                          map_a::ptr = 200
map_b::ptr = 100                          map_b::ptr = 1
nr_del_err = 1                            nr_del_err = 0
on_update::recursion_misses = 0           on_update::recursion_misses = 2
On_enter::recursion_misses = 0            on_enter::recursion_misses = 0

Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260205222916.1788211-14-ameryhung@gmail.com

selftests/bpf: Update sk_storage_omem_uncharge test

Check sk_omem_alloc when the caller of bpf_local_storage_destroy()
returns. bpf_local_storage_destroy() now returns the memory to uncharge
to the caller instead of directly uncharge. Therefore, in the
sk_storage_omem_uncharge, check sk_omem_alloc when bpf_sk_storage_free()
returns instead of bpf_local_storage_destroy().

Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260205222916.1788211-13-ameryhung@gmail.com

bpf: Switch to bpf_selem_unlink_nofail in bpf_local_storage_{map_free, destroy}

Take care of rqspinlock error in bpf_local_storage_{map_free, destroy}()
properly by switching to bpf_selem_unlink_nofail().

Both functions iterate their own RCU-protected list of selems and call
bpf_selem_unlink_nofail(). In map_free(), to prevent infinite loop when
both map_free() and destroy() fail to remove a selem from b->list
(extremely unlikely), switch to hlist_for_each_entry_rcu(). In destroy(),
also switch to hlist_for_each_entry_rcu() since we no longer iterate
local_storage->list under local_storage->lock.

bpf_selem_unlink() now becomes dedicated to helpers and syscalls paths
so reuse_now should always be false. Remove it from the argument and
hardcode it.

Acked-by: Alexei Starovoitov <ast@kernel.org>
Co-developed-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260205222916.1788211-12-ameryhung@gmail.com

bpf: Support lockless unlink when freeing map or local storage

Introduce bpf_selem_unlink_nofail() to properly handle errors returned
from rqspinlock in bpf_local_storage_map_free() and
bpf_local_storage_destroy() where the operation must succeeds.

The idea of bpf_selem_unlink_nofail() is to allow an selem to be
partially linked and use atomic operation on a bit field, selem->state,
to determine when and who can free the selem if any unlink under lock
fails. An selem initially is fully linked to a map and a local storage.
Under normal circumstances, bpf_selem_unlink_nofail() will be able to
grab locks and unlink a selem from map and local storage in sequeunce,
just like bpf_selem_unlink(), and then free it after an RCU grace period.
However, if any of the lock attempts fails, it will only clear
SDATA(selem)->smap or selem->local_storage depending on the caller and
set SELEM_MAP_UNLINKED or SELEM_STORAGE_UNLINKED according to the
caller. Then, after both map_free() and destroy() see the selem and the
state becomes SELEM_UNLINKED, one of two racing caller can succeed in
cmpxchg the state from SELEM_UNLINKED to SELEM_TOFREE, ensuring no
double free or memory leak.

To make sure bpf_obj_free_fields() is done only once and when map is
still present, it is called when unlinking an selem from b->list under
b->lock.

To make sure uncharging memory is done only when the owner is still
present in map_free(), block destroy() from returning until there is no
pending map_free().

Since smap may not be valid in destroy(), bpf_selem_unlink_nofail()
skips bpf_selem_unlink_storage_nolock_misc() when called from destroy().
This is okay as bpf_local_storage_destroy() will return the remaining
amount of memory charge tracked by mem_charge to the owner to uncharge.
It is also safe to skip clearing local_storage->owner and owner_storage
as the owner is being freed and no users or bpf programs should be able
to reference the owner and using local_storage.

Finally, access of selem, SDATA(selem)->smap and selem->local_storage
are racy. Callers will protect these fields with RCU.

Acked-by: Alexei Starovoitov <ast@kernel.org>
Co-developed-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260205222916.1788211-11-ameryhung@gmail.com

bpf: Prepare for bpf_selem_unlink_nofail()

The next patch will introduce bpf_selem_unlink_nofail() to handle
rqspinlock errors. bpf_selem_unlink_nofail() will allow an selem to be
partially unlinked from map or local storage. Save memory allocation
method in selem so that later an selem can be correctly freed even when
SDATA(selem)->smap is init to NULL.

In addition, keep track of memory charge to the owner in local storage
so that later bpf_selem_unlink_nofail() can return the correct memory
charge to the owner. Updating local_storage->mem_charge is protected by
local_storage->lock.

Finally, extract miscellaneous tasks performed when unlinking an selem
from local_storage into bpf_selem_unlink_storage_nolock_misc(). It will
be reused by bpf_selem_unlink_nofail().

This patch also takes the chance to remove local_storage->smap, which
is no longer used since commit f484f4a3e058 ("bpf: Replace bpf memory
allocator with kmalloc_nolock() in local storage").

Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260205222916.1788211-10-ameryhung@gmail.com

bpf: Remove unused percpu counter from bpf_local_storage_map_free

Percpu locks have been removed from cgroup and task local storage. Now
that all local storage no longer use percpu variables as locks preventing
recursion, there is no need to pass them to bpf_local_storage_map_free().
Remove the argument from the function.

Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260205222916.1788211-9-ameryhung@gmail.com

bpf: Remove cgroup local storage percpu counter

The percpu counter in cgroup local storage is no longer needed as the
underlying bpf_local_storage can now handle deadlock with the help of
rqspinlock. Remove the percpu counter and related migrate_{disable,
enable}.

Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260205222916.1788211-8-ameryhung@gmail.com

bpf: Remove task local storage percpu counter

The percpu counter in task local storage is no longer needed as the
underlying bpf_local_storage can now handle deadlock with the help of
rqspinlock. Remove the percpu counter and related migrate_{disable,
enable}.

Since the percpu counter is removed, merge back bpf_task_storage_get()
and bpf_task_storage_get_recur(). This will allow the bpf syscalls and
helpers to run concurrently on the same CPU, removing the spurious
-EBUSY error. bpf_task_storage_get(..., F_CREATE) will now always
succeed with enough free memory unless being called recursively.

Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260205222916.1788211-7-ameryhung@gmail.com

bpf: Change local_storage->lock and b->lock to rqspinlock

Change bpf_local_storage::lock and bpf_local_storage_map_bucket::lock
from raw_spin_lock to rqspinlock.

Finally, propagate errors from raw_res_spin_lock_irqsave() to syscall
return or BPF helper return.

In bpf_local_storage_destroy(), ignore return from
raw_res_spin_lock_irqsave() for now. A later patch will correctly
handle errors correctly in bpf_local_storage_destroy() so that it can
unlink selems even when failing to acquire locks.

For __bpf_local_storage_map_cache(), instead of handling the error,
skip updating the cache.

Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260205222916.1788211-6-ameryhung@gmail.com

bpf: Convert bpf_selem_unlink to failable

To prepare changing both bpf_local_storage_map_bucket::lock and
bpf_local_storage::lock to rqspinlock, convert bpf_selem_unlink() to
failable. It still always succeeds and returns 0 until the change
happens. No functional change.

Open code bpf_selem_unlink_storage() in the only caller,
bpf_selem_unlink(), since unlink_map and unlink_storage must be done
together after all the necessary locks are acquired.

For bpf_local_storage_map_free(), ignore the return from
bpf_selem_unlink() for now. A later patch will allow it to unlink selems
even when failing to acquire locks.

Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260205222916.1788211-5-ameryhung@gmail.com

bpf: Convert bpf_selem_link_map to failable

To prepare for changing bpf_local_storage_map_bucket::lock to rqspinlock,
convert bpf_selem_link_map() to failable. It still always succeeds and
returns 0 until the change happens. No functional change.

Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260205222916.1788211-4-ameryhung@gmail.com

bpf: Convert bpf_selem_unlink_map to failable

To prepare for changing bpf_local_storage_map_bucket::lock to rqspinlock,
convert bpf_selem_unlink_map() to failable. It still always succeeds and
returns 0 for now.

Since some operations updating local storage cannot fail in the middle,
open-code bpf_selem_unlink_map() to take the b->lock before the
operation. There are two such locations:

- bpf_local_storage_alloc()

  The first selem will be unlinked from smap if cmpxchg owner_storage_ptr
  fails, which should not fail. Therefore, hold b->lock when linking
  until allocation complete. Helpers that assume b->lock is held by
  callers are introduced: bpf_selem_link_map_nolock() and
  bpf_selem_unlink_map_nolock().

- bpf_local_storage_update()

  The three step update process: link_map(new_selem),
  link_storage(new_selem), and unlink_map(old_selem) should not fail in
  the middle.

In bpf_selem_unlink(), bpf_selem_unlink_map() and
bpf_selem_unlink_storage() should either all succeed or fail as a whole
instead of failing in the middle. So, return if unlink_map() failed.
Remove the selem_linked_to_map_lockless() check as an selem in the
common paths (not bpf_local_storage_map_free() or
bpf_local_storage_destroy()), will be unlinked under b->lock and
local_storage->lock and therefore no other threads can unlink the selem
from map at the same time.

In bpf_local_storage_destroy(), ignore the return of
bpf_selem_unlink_map() for now. A later patch will allow
bpf_local_storage_destroy() to unlink selems even when failing to
acquire locks.

Note that while this patch removes all callers of selem_linked_to_map(),
a later patch that introduces bpf_selem_unlink_nofail() will use it
again.

Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260205222916.1788211-3-ameryhung@gmail.com

bpf: Select bpf_local_storage_map_bucket based on bpf_local_storage

A later bpf_local_storage refactor will acquire all locks before
performing any update. To simplified the number of locks needed to take
in bpf_local_storage_map_update(), determine the bucket based on the
local_storage an selem belongs to instead of the selem pointer.

Currently, when a new selem needs to be created to replace the old selem
in bpf_local_storage_map_update(), locks of both buckets need to be
acquired to prevent racing. This can be simplified if the two selem
belongs to the same bucket so that only one bucket needs to be locked.
Therefore, instead of hashing selem, hashing the local_storage pointer
the selem belongs.

Performance wise, this is slightly better as update now requires locking
one bucket. It should not change the level of contention on one bucket
as the pointers to local storages of selems in a map are just as unique
as pointers to selems.

Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260205222916.1788211-2-ameryhung@gmail.com

Merge branch 'fix-some-corner-cases-in-xskxceiver'

Larysa Zaremba says:

====================
Fix some corner cases in xskxceiver

While working on XDP and AF_XDP support for ixgbevf driver,
I came across two distinct problems that caused tests to fail
when they shouldn't have.
====================

Link: https://patch.msgid.link/20260203155103.2305816-1-larysa.zaremba@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/xsk: fix number of Tx frags in invalid packet

The issue occurs in TOO_MANY_FRAGS test case when xdp_zc_max_segs is set to
an odd number.

TOO_MANY_FRAGS test case contains an invalid packet consisting of
(xdp_zc_max_segs) frags. Every frag, even the last one has XDP_PKT_CONTD
flag set. This packet is expected to be dropped. After that, there is a
valid linear packet, which is expected to be received back.

Once (xdp_zc_max_segs) is an odd number, the last packet cannot be
received, if packet forwarding between Rx and Tx interfaces relies on
the ethernet header, e.g. checks for ETH_P_LOOPBACK. Packet is malformed,
if all traffic is looped.

Turns out, sending function processes multiple invalid frags as if they
were in 2-frag packets. So once the invalid mbuf packet contains an odd
number of those, the valid packet after gets paired with the previous
invalid descriptor, and hence does not get an ethernet header generated, so
it is either dropped or malformed.

Make invalid packets in verbatim mode always have only a single frag. For
such packets, number of frags is otherwise meaningless, as descriptor flags
are pre-configured in verbatim mode and packet data is not generated for
invalid descriptors.

Fixes: 697604492b64 ("selftests/xsk: add invalid descriptor test for multi-buffer")
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
Link: https://lore.kernel.org/r/20260203155103.2305816-3-larysa.zaremba@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/xsk: properly handle batch ending in the middle of a packet

Referenced commit reduced the scope of the variable pkt, so now it has to
be reinitialized via pkt_stream_get_next_rx_pkt(), which also increments
some counters. When the packet is interrupted by the batch ending, pkt
stream therefore proceeds to the next packet, while xsk ring still contains
the previous one, this results in a pkt_nb mismatch.

Decrement the affected counters when packet is interrupted.

Fixes: 8913e653e9b8 ("selftests/xsk: Iterate over all the sockets in the receive pkts function")
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
Link: https://lore.kernel.org/r/20260203155103.2305816-2-larysa.zaremba@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Prevent reentrance into call_rcu_tasks_trace()

call_rcu_tasks_trace() is not safe from in_nmi() and not reentrant.
To prevent deadlock on raw_spin_lock_rcu_node(rtpcp) or memory corruption
defer to irq_work when IRQs are disabled. call_rcu_tasks_generic()
protects itself with local_irq_save().
Note when bpf_async_cb->refcnt drops to zero it's safe to reuse
bpf_async_cb->worker for a different irq_work callback, since
bpf_async_schedule_op() -> irq_work_queue(&cb->worker);
is only called when refcnt >= 1.

Fixes: 1bfbc267ec91 ("bpf: Enable bpf_timer and bpf_wq in any context")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20260205190233.912-1-alexei.starovoitov@gmail.com

bpf: Require frozen map for calculating map hash

Currently, bpf_map_get_info_by_fd calculates and caches the hash of the
map regardless of the map's frozen state.

This leads to a TOCTOU bug where userspace can call
BPF_OBJ_GET_INFO_BY_FD to cache the hash and then modify the map
contents before freezing.

Therefore, a trusted loader can be tricked into verifying the stale hash
while loading the modified contents.

Fix this by returning -EPERM if the map is not frozen when the hash is
requested. This ensures the hash is only generated for the final,
immutable state of the map.

Fixes: ea2e6467ac36 ("bpf: Return hashes of maps in BPF_OBJ_GET_INFO_BY_FD")
Reported-by: Toshi Piazza <toshi.piazza@microsoft.com>
Signed-off-by: KP Singh <kpsingh@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20260205070755.695776-1-kpsingh@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Limit bpf program signature size

Practical BPF signatures are significantly smaller than
KMALLOC_MAX_CACHE_SIZE

Allowing larger sizes opens the door for abuse by passing excessive
size values and forcing the kernel into expensive allocation paths (via
kmalloc_large or vmalloc).

Fixes: 349271568303 ("bpf: Implement signature verification for BPF programs")
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: KP Singh <kpsingh@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20260205063807.690823-1-kpsingh@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge branch 'fix-for-bpf_wq-retry-loop-during-free'

Kumar Kartikeya Dwivedi says:

====================
Fix for bpf_wq retry loop during free

Small fix and improvement to ensure cancel_work() can handle the case
where wq callback is running, and doesn't lead to call_rcu_tasks_trace()
repeatedly after failing cancel_work, if wq callback is not pending.
====================

Link: https://patch.msgid.link/20260205003853.527571-1-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Reset prog callback in bpf_async_cancel_and_free()

Replace prog and callback in bpf_async_cb after removing visibility of
bpf_async_cb in bpf_async_cancel_and_free() to increase the chances the
scheduled async callbacks short-circuit execution and exit early, and
not starting a RCU tasks trace section. This improves the overall time
spent in running the wq selftest.

Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260205003853.527571-3-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Check for running wq callback when freeing bpf_async_cb

When freeing a bpf_async_cb in bpf_async_cb_rcu_tasks_trace_free(), in
case the wq callback is not scheduled, doing cancel_work() currently
returns false and leads to retry of RCU tasks trace grace period. If the
callback is never scheduled, we keep retrying indefinitely and don't put
the prog reference.

Since the only race we care about here is against a potentially running
wq callback in the first grace period, it should finish by the second
grace period, hence check work_busy() result to detect presence of
running wq callback if it's not pending, otherwise free the object
immediately without retrying.

Reasoning behind the check and its correctness with racing wq callback
invocation: cancel_work is supposed to be synchronized, hence calling it
first and getting false would mean that work is definitely not pending,
at this point, either the work is not scheduled at all or already
running, or we race and it already finished by the time we checked for
it using work_busy(). In case it is running, we synchronize using
pool->lock to check the current work running there, if we match, it
means we extend the wait by another grace period using retry = true,
otherwise either the work already finished running or was never
scheduled, so we can free the bpf_async_cb right away.

Fixes: 1bfbc267ec91 ("bpf: Enable bpf_timer and bpf_wq in any context")
Reported-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260205003853.527571-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge branch 'bpf-improve-linked-register-tracking'

Puranjay Mohan says:

====================
bpf: Improve linked register tracking

V3: https://lore.kernel.org/all/20260203222643.994713-1-puranjay@kernel.org/
Changes in v3->v4:
- Add a call to reg_bounds_sync() in sync_linked_regs() to sync bounds after alu op.
- Add a __sink(path[0]); in the C selftest so compiler doesn't say  "error: variable 'path' set but
  not used"

V2: https://lore.kernel.org/all/20260113152529.3217648-1-puranjay@kernel.org/
Changes in v2->v3:
- Added another selftest showing a real usage pattern
- Rebased on bpf-next/master

v1: https://lore.kernel.org/bpf/20260107203941.1063754-1-puranjay@kernel.org/
Changes in v1->v2:
- Add support for alu32 operations in linked register tracking (Alexei)
- Squash the selftest fix with the first patch (Eduard)
- Add more selftests to detect edge cases

This series extends the BPF verifier's linked register tracking to handle
negative offsets, BPF_SUB operations, and alu32 operations, enabling better bounds propagation for
common arithmetic patterns.

The verifier previously only tracked positive constant deltas between linked
registers using BPF_ADD. This meant patterns using negative offsets or
subtraction couldn't benefit from bounds propagation:

  void alu32_negative_offset(void)
  {
          volatile char path[5];
          volatile int offset = bpf_get_prandom_u32();
          int off = offset;

          if (off >= 5 && off < 10)
                  path[off - 5] = '.';
  }

this gets compiled to:

0000000000000478 <alu32_negative_offset>:
     143:       call 0x7
     144:       *(u32 *)(r10 - 0xc) = w0
     145:       w1 = *(u32 *)(r10 - 0xc)
     146:       w2 = w1 // w2 and w1 share the same id
     147:       w2 += -0x5 // verifier knows w1 = w2 + 5
     148:       if w2 > 0x4 goto +0x5 <L0> // in fall-through: verifier knows w2 ∈ [0,4] => w1 ∈ [5, 9]
     149:       r2 = r10
     150:       r2 += -0x5 // r2 = fp - 5
     151:       r2 += r1 // r2 = fp - 5 + r1 (∈ [5, 9]) => r2 ∈ [fp, fp + 4]
     152:       w1 = 0x2e
     153:       *(u8 *)(r2 - 0x5) = w1 // r2 ∈ [fp, fp + 4] => r2 - 5 ∈ [fp - 5, fp - 1]
<L0>:
     154:       exit

After the changes, the verifier could link 32-bit scalars and also supported -ve offsets for linking:

146:       w2 = w1
147:       w2 += -0x5

It allowed the verifier to correctly propagate bounds, without the
changes in this patchset, verifier would reject this program with:

invalid unbounded variable-offset write to stack R2

This program has been added as a selftest in the second patch.

Veristat comparison on programs from sched_ext, selftests, and some meta internal programs:

Scx Progs
File               Program           Verdict (A)  Verdict (B)  Verdict (DIFF)  Insns (A)  Insns (B)  Insns  (DIFF)
-----------------  ----------------  -----------  -----------  --------------  ---------  ---------  -------------
scx_layered.bpf.o  layered_runnable  success      success      MATCH                5674       6077  +403 (+7.10%)

FB Progs
File          Program           Verdict (A)  Verdict (B)  Verdict (DIFF)  Insns (A)  Insns (B)  Insns      (DIFF)
------------  ----------------  -----------  -----------  --------------  ---------  ---------  -----------------
bpf232.bpf.o  layered_dump      success      success      MATCH                1151       1218       +67 (+5.82%)
bpf257.bpf.o  layered_runnable  success      success      MATCH                5743       6143      +400 (+6.97%)
bpf252.bpf.o  layered_runnable  success      success      MATCH                5677       6075      +398 (+7.01%)
bpf227.bpf.o  layered_dump      success      success      MATCH                 915        982       +67 (+7.32%)
bpf239.bpf.o  layered_runnable  success      success      MATCH                5459       5861      +402 (+7.36%)
bpf246.bpf.o  layered_runnable  success      success      MATCH                5562       6008      +446 (+8.02%)
bpf229.bpf.o  layered_runnable  success      success      MATCH                2559       3011     +452 (+17.66%)
bpf231.bpf.o  layered_runnable  success      success      MATCH                2559       3011     +452 (+17.66%)
bpf234.bpf.o  layered_runnable  success      success      MATCH                2549       3001     +452 (+17.73%)
bpf019.bpf.o  do_sendmsg        success      success      MATCH              124823     153523   +28700 (+22.99%)
bpf019.bpf.o  do_parse          success      success      MATCH              124809     153509   +28700 (+23.00%)
bpf227.bpf.o  layered_runnable  success      success      MATCH                1915       2356     +441 (+23.03%)
bpf228.bpf.o  layered_runnable  success      success      MATCH                1700       2152     +452 (+26.59%)
bpf232.bpf.o  layered_runnable  success      success      MATCH                1499       1951     +452 (+30.15%)

bpf312.bpf.o  mount_exit        success      success      MATCH               19253      62883  +43630 (+226.61%)
bpf312.bpf.o  umount_exit       success      success      MATCH               19253      62883  +43630 (+226.61%)
bpf311.bpf.o  mount_exit        success      success      MATCH               19226      62863  +43637 (+226.97%)
bpf311.bpf.o  umount_exit       success      success      MATCH               19226      62863  +43637 (+226.97%)

The above four programs have specific patters that make the verifier explore a lot more states:

    for (; depth < MAX_DIR_DEPTH; depth++) {
        const unsigned char* name = BPF_CORE_READ(dentry, d_name.name);
        if (offset >= MAX_PATH_LEN - MAX_DIR_LEN) {
            return depth;
        }
        int len = bpf_probe_read_kernel_str(&path[offset], MAX_DIR_LEN, name);
        offset += len;
        if (len == MAX_DIR_LEN) {
            if (offset - 2 < MAX_PATH_LEN) {   // <---- (a)
                path[offset - 2] = '.';
            }
            if (offset - 3 < MAX_PATH_LEN) {   // <---- (b)
                path[offset - 3] = '.';
            }
            if (offset - 4 < MAX_PATH_LEN) {   // <---- (c)
                path[offset - 4] = '.';
            }
        }
    }

When at some depth == N false branches of conditions (a), (b) and (c) are scheduled for
verification, constraints for offset at depth == N+1 are:
1. offset >= MAX_PATH_LEN + 2
2. offset >= MAX_PATH_LEN + 3
3. offset >= MAX_PATH_LEN + 4 (visited before others)

And after offset += len it becomes:
1. offset >= MAX_PATH_LEN - 4093
2. offset >= MAX_PATH_LEN - 4092
3. offset >= MAX_PATH_LEN - 4091 (visited before others)

Because of the DFS states exploration logic, the states above are visited in order 3, 2, 1; 3 is not
a subset of 2 and 1 is not a subset of 2, so pruning logic does not kick in.
Previously this was not a problem, because range for offset was not propagated through the
statements (a), (b), (c).

As the root cause of this regression is understood, this is not a blocker for this change.

Selftest Progs
File                                Program                   Verdict (A)  Verdict (B)  Verdict (DIFF)  Insns (A)  Insns (B)  Insns   (DIFF)
----------------------------------  ------------------------  -----------  -----------  --------------  ---------  ---------  --------------
linked_list_peek.bpf.o              list_peek                 success      success      MATCH                 152         88   -64 (-42.11%)
verifier_iterating_callbacks.bpf.o  cond_break2               success      success      MATCH                 110         88   -22 (-20.00%)

These are the added selftests that failed earlier but are passing now:

verifier_linked_scalars.bpf.o       alu32_negative_offset     failure      success      MISMATCH               11         13    +2 (+18.18%)
verifier_linked_scalars.bpf.o       scalars_alu32_big_offset  failure      success      MISMATCH                7         10    +3 (+42.86%)
verifier_linked_scalars.bpf.o       scalars_neg_alu32_add     failure      success      MISMATCH                7         10    +3 (+42.86%)
verifier_linked_scalars.bpf.o       scalars_neg_alu32_sub     failure      success      MISMATCH                7         10    +3 (+42.86%)
verifier_linked_scalars.bpf.o       scalars_neg               failure      success      MISMATCH                7         10    +3 (+42.86%)
verifier_linked_scalars.bpf.o       scalars_neg_sub           failure      success      MISMATCH                7         10    +3 (+42.86%)
verifier_linked_scalars.bpf.o       scalars_sub_neg_imm       failure      success      MISMATCH                7         10    +3 (+42.86%)

iters.bpf.o                         iter_obfuscate_counter    success      success      MATCH                  83        119   +36 (+43.37%)
bpf_cubic.bpf.o                     bpf_cubic_acked           success      success      MATCH                 243        430  +187 (+76.95%)
====================

Link: https://patch.msgid.link/20260204151741.2678118-1-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Add tests for improved linked register tracking

Add tests for linked register tracking with negative offsets, BPF_SUB,
and alu32. These test for all edge cases like overflows, etc.

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260204151741.2678118-3-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Support negative offsets, BPF_SUB, and alu32 for linked register tracking

Previously, the verifier only tracked positive constant deltas between
linked registers using BPF_ADD. This limitation meant patterns like:

  r1 = r0;
  r1 += -4;
  if r1 s>= 0 goto l0_%=;   // r1 >= 0 implies r0 >= 4
  // verifier couldn't propagate bounds back to r0
  if r0 != 0 goto l0_%=;
r0 /= 0; // Verifier thinks this is reachable
  l0_%=:

Similar limitation exists for 32-bit registers.

With this change, the verifier can now track negative deltas in reg->off
enabling bound propagation for the above pattern.

For alu32, we make sure the destination register has the upper 32 bits
as 0s before creating the link. BPF_ADD_CONST is split into
BPF_ADD_CONST64 and BPF_ADD_CONST32, the latter is used in case of alu32
and sync_linked_regs uses this to zext the result if known_reg has this
flag.

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260204151741.2678118-2-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge branch 'bpf-add-bitwise-tracking-for-bpf_end'

Tianci Cao says:

====================
bpf: Add bitwise tracking for BPF_END

Add bitwise tracking (tnum analysis) for BPF_END (`bswap(16|32|64)`,
`be(16|32|64)`, `le(16|32|64)`) operations. Please see commit log of
1/2 for more details.

v3:
- Resend to fix a version control error in v2.
- The rest of the changes are identical to v2.

v2 (incorrect): https://lore.kernel.org/bpf/20260204091146.52447-1-ziye@zju.edu.cn/
- Refactored selftests using BSWAP_RANGE_TEST macro to eliminate code
duplication and improve maintainability. (Eduard)
- Simplified test names. (Eduard)
- Reduced excessive comments in test cases. (Eduard)
- Added more comments to explain BPF_END's special handling of zext_32_to_64.

v1: https://lore.kernel.org/bpf/20260202133536.66207-1-ziye@zju.edu.cn/
====================

Link: https://patch.msgid.link/20260204111503.77871-1-ziye@zju.edu.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Add tests for BPF_END bitwise tracking

Now BPF_END has bitwise tracking support. This patch adds selftests to
cover various cases of BPF_END (`bswap(16|32|64)`, `be(16|32|64)`,
`le(16|32|64)`) with bitwise propagation.

This patch is based on existing `verifier_bswap.c`, and add several
types of new tests:

1. Unconditional byte swap operations:
   - bswap16/bswap32/bswap64 with unknown bytes

2. Endian conversion operations (architecture-aware):
   - be16/be32/be64: convert to big-endian
     * on little-endian: do swap
     * on big-endian: truncation (16/32-bit) or no-op (64-bit)
   - le16/le32/le64: convert to little-endian
     * on big-endian: do swap
     * on little-endian: truncation (16/32-bit) or no-op (64-bit)

Each test simulates realistic networking scenarios where a value is
masked with unknown bits (e.g., var_off=(0x0; 0x3f00), range=[0,0x3f00]),
then byte-swapped, and the verifier must prove the result stays within
expected bounds.

Specifically, these selftests are based on dead code elimination:
If the BPF verifier can precisely track bitwise through byte swap
operations, it can prune the trap path (invalid memory access) that
should be unreachable, allowing the program to pass verification.
If bitwise tracking is incorrect, the verifier cannot prove the trap
is unreachable, causing verification failure.

The tests use preprocessor conditionals (#ifdef __BYTE_ORDER__) to
verify correct behavior on both little-endian and big-endian
architectures, and require Clang 18+ for bswap instruction support.

Co-developed-by: Shenghao Yuan <shenghaoyuan0928@163.com>
Signed-off-by: Shenghao Yuan <shenghaoyuan0928@163.com>
Co-developed-by: Yazhou Tang <tangyazhou518@outlook.com>
Signed-off-by: Yazhou Tang <tangyazhou518@outlook.com>
Signed-off-by: Tianci Cao <ziye@zju.edu.cn>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260204111503.77871-3-ziye@zju.edu.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Add bitwise tracking for BPF_END

This patch implements bitwise tracking (tnum analysis) for BPF_END
(byte swap) operation.

Currently, the BPF verifier does not track value for BPF_END operation,
treating the result as completely unknown. This limits the verifier's
ability to prove safety of programs that perform endianness conversions,
which are common in networking code.

For example, the following code pattern for port number validation:

int test(struct pt_regs *ctx) {
    __u64 x = bpf_get_prandom_u32();
    x &= 0x3f00;           // Range: [0, 0x3f00], var_off: (0x0; 0x3f00)
    x = bswap16(x);        // Should swap to range [0, 0x3f], var_off: (0x0; 0x3f)
    if (x > 0x3f) goto trap;
    return 0;
trap:
    return *(u64 *)NULL;   // Should be unreachable
}

Currently generates verifier output:

1: (54) w0 &= 16128                   ; R0=scalar(smin=smin32=0,smax=umax=smax32=umax32=16128,var_off=(0x0; 0x3f00))
2: (d7) r0 = bswap16 r0               ; R0=scalar()
3: (25) if r0 > 0x3f goto pc+2        ; R0=scalar(smin=smin32=0,smax=umax=smax32=umax32=63,var_off=(0x0; 0x3f))

Without this patch, even though the verifier knows `x` has certain bits
set, after bswap16, it loses all tracking information and treats port
as having a completely unknown value [0, 65535].

According to the BPF instruction set[1], there are 3 kinds of BPF_END:

1. `bswap(16|32|64)`: opcode=0xd7 (BPF_END | BPF_ALU64 | BPF_TO_LE)
   - do unconditional swap
2. `le(16|32|64)`: opcode=0xd4 (BPF_END | BPF_ALU | BPF_TO_LE)
   - on big-endian: do swap
   - on little-endian: truncation (16/32-bit) or no-op (64-bit)
3. `be(16|32|64)`: opcode=0xdc (BPF_END | BPF_ALU | BPF_TO_BE)
   - on little-endian: do swap
   - on big-endian: truncation (16/32-bit) or no-op (64-bit)

Since BPF_END operations are inherently bit-wise permutations, tnum
(bitwise tracking) offers the most efficient and precise mechanism
for value analysis. By implementing `tnum_bswap16`, `tnum_bswap32`,
and `tnum_bswap64`, we can derive exact `var_off` values concisely,
directly reflecting the bit-level changes.

Here is the overview of changes:

1. In `tnum_bswap(16|32|64)` (kernel/bpf/tnum.c):

Call `swab(16|32|64)` function on the value and mask of `var_off`, and
do truncation for 16/32-bit cases.

2. In `adjust_scalar_min_max_vals` (kernel/bpf/verifier.c):

Call helper function `scalar_byte_swap`.
- Only do byte swap when
  * alu64 (unconditional swap) OR
  * switching between big-endian and little-endian machines.
- If need do byte swap:
  * Firstly call `tnum_bswap(16|32|64)` to update `var_off`.
  * Then reset the bound since byte swap scrambles the range.
- For 16/32-bit cases, truncate dst register to match the swapped size.

This enables better verification of networking code that frequently uses
byte swaps for protocol processing, reducing false positive rejections.

[1] https://www.kernel.org/doc/Documentation/bpf/standardization/instruction-set.rst

Co-developed-by: Shenghao Yuan <shenghaoyuan0928@163.com>
Signed-off-by: Shenghao Yuan <shenghaoyuan0928@163.com>
Co-developed-by: Yazhou Tang <tangyazhou518@outlook.com>
Signed-off-by: Yazhou Tang <tangyazhou518@outlook.com>
Signed-off-by: Tianci Cao <ziye@zju.edu.cn>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260204111503.77871-2-ziye@zju.edu.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge branch 'bpf-fix-conditions-when-timer-wq-can-be-called'

Alexei Starovoitov says:

====================
bpf: Fix conditions when timer/wq can be called

From: Alexei Starovoitov <ast@kernel.org>

v2->v3:
- Add missing refcount_put
- Detect recursion of indiviual async_cb
v2: https://lore.kernel.org/bpf/20260204040834.22263-4-alexei.starovoitov@gmail.com/

v1->v2:
- Add a recursion check
v1: https://lore.kernel.org/bpf/20260204030927.171-1-alexei.starovoitov@gmail.com/
====================

Link: https://patch.msgid.link/20260204055147.54960-1-alexei.starovoitov@gmail.com
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>

selftests/bpf: Strengthen timer_start_deadlock test

Strengthen timer_start_deadlock test and check for recursion now

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20260204055147.54960-5-alexei.starovoitov@gmail.com

bpf: Add a recursion check to prevent loops in bpf_timer

Do not schedule timer/wq operation on a cpu that is in irq_work
callback that is processing async_cmds queue.
Otherwise the following loop is possible:
bpf_timer_start() -> bpf_async_schedule_op() -> irq_work_queue().
irqrestore -> bpf_async_irq_worker() -> tracepoint -> bpf_timer_start().

Fixes: 1bfbc267ec91 ("bpf: Enable bpf_timer and bpf_wq in any context")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20260204055147.54960-4-alexei.starovoitov@gmail.com

selftests/bpf: Add a testcase for deadlock avoidance

Add a testcase that checks that deadlock avoidance is working
as expected.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20260204055147.54960-3-alexei.starovoitov@gmail.com

bpf: Tighten conditions when timer/wq can be called synchronously

Though hrtimer_start/cancel() inlines all of the smaller helpers in
hrtimer.c and only call timerqueue_add/del() from lib/timerqueue.c where
everything is not traceable and not kprobe-able (because all files in
lib/ are not traceable), there are tracepoints within hrtimer that are
called with locks held. Therefore prevent the deadlock by tightening
conditions when timer/wq can be called synchronously.
hrtimer/wq are using raw_spin_lock_irqsave(), so irqs_disabled() is enough.

Fixes: 1bfbc267ec91 ("bpf: Enable bpf_timer and bpf_wq in any context")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20260204055147.54960-2-alexei.starovoitov@gmail.com

resolve_btfids: Refactor the sort_btf_by_name function

Preserve original relative order of anonymous or same-named
types to improve the consistency.

No functional changes.

Signed-off-by: Donglin Peng <pengdonglin@xiaomi.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20260202120114.3707141-1-dolinux.peng@gmail.com

Merge branch 'bpf-misc-changes-around-af_unix'

Kuniyuki Iwashima says:

====================
bpf: Misc changes around AF_UNIX.

Patch 1 adapts sk_is_XXX() helpers in __cgroup_bpf_run_filter_sock_addr().
Patch 2 removes an unnecessary sk_fullsock() in bpf_skc_to_unix_sock().
====================

Link: https://patch.msgid.link/20260203213442.682838-1-kuniyu@google.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>

bpf: Don't check sk_fullsock() in bpf_skc_to_unix_sock().

AF_UNIX does not use TCP_NEW_SYN_RECV nor TCP_TIME_WAIT and
checking sk->sk_family is sufficient.

Let's remove sk_fullsock() and use sk_is_unix() in
bpf_skc_to_unix_sock().

Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260203213442.682838-3-kuniyu@google.com

bpf: Use sk_is_inet() and sk_is_unix() in __cgroup_bpf_run_filter_sock_addr().

sk->sk_family should be read with READ_ONCE() in
__cgroup_bpf_run_filter_sock_addr() due to IPV6_ADDRFORM.

Also, the comment there is a bit stale since commit 859051dd165e
("bpf: Implement cgroup sockaddr hooks for unix sockets"), and the
kdoc has the same comment.

Let's use sk_is_inet() and sk_is_unix() and remove the comment.

Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260203213442.682838-2-kuniyu@google.com

Merge branch 'bpf-avoid-locks-in-bpf_timer-and-bpf_wq'

Alexei Starovoitov says:

====================
bpf: Avoid locks in bpf_timer and bpf_wq

From: Alexei Starovoitov <ast@kernel.org>

This series reworks implementation of BPF timer and workqueue APIs to
make them usable from any context.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Changes in v9:
- Different approach for patches 1 and 3:
- s/EBUSY/ENOENT/ when refcnt==0 to match existing
- drop latch, use refcnt and kmalloc_nolock() instead
- address race between timer/wq_start and delete_elem, add a test
- Link to v8: https://lore.kernel.org/bpf/20260127-timer_nolock-v8-0-5a29a9571059@meta.com/

Changes in v8:
- Return -EBUSY in bpf_async_read_op() if last_seq is failed to be set
- In bpf_async_cancel_and_free() drop bpf_async_cb ref after calling bpf_async_process()
- Link to v7: https://lore.kernel.org/r/20260122-timer_nolock-v7-0-04a45c55c2e2@meta.com

Changes in v7:
- Addressed Andrii's review points from the previous version - nothing
  very significang.
- Added NMI stress tests for bpf_timer - hit few verifier failing checks
  and removed them.
- Address sparse warning in the bpf_async_update_prog_callback()
- Link to v6: https://lore.kernel.org/r/20260120-timer_nolock-v6-0-670ffdd787b4@meta.com

Changes in v6:
- Reworked destruction and refcnt use:
  - On cancel_and_free() set last_seq to BPF_ASYNC_DESTROY value, drop
    map's reference
  - In irq work callback, atomically switch DESTROY to DESTROYED, cancel
    timer/wq
  - Free bpf_async_cb on refcnt going to 0.
- Link to v5: https://lore.kernel.org/r/20260115-timer_nolock-v5-0-15e3aef2703d@meta.com

Changes in v5:
- Extracted lock-free algorithm for updating cb->prog and
cb->callback_fn into a function bpf_async_update_prog_callback(),
added a new commit and introduces this function and uses it in
__bpf_async_set_callback(), bpf_timer_cancel() and
bpf_async_cancel_and_free().
This allows to move the change into the separate commit without breaking
correctness.
- Handle NULL prog in bpf_async_update_prog_callback().
- Link to v4: https://lore.kernel.org/r/20260114-timer_nolock-v4-0-fa6355f51fa7@meta.com

Changes in v4:
- Handle irq_work_queue failures in both schedule and cancel_and_free
paths: introduced bpf_async_refcnt_dec_cleanup() that decrements refcnt
and makes sure if last reference is put, there is at least one irq_work
scheduled to execute final cleanup.
- Additional refcnt inc/dec in set_callback() + rcu lock to make sure
cleanup is not running at the same time as set_callback().
- Added READ_ONCE where it was needed.
- Squash 'bpf: Refactor __bpf_async_set_callback()' commit into 'bpf:
Add lock-free cell for NMI-safe
async operations'
- Removed mpmc_cell, use seqcount_latch_t instead.
- Link to v3: https://lore.kernel.org/r/20260107-timer_nolock-v3-0-740d3ec3e5f9@meta.com

Changes in v3:
- Major rework
- Introduce mpmc_cell, allowing concurrent writes and reads
- Implement irq_work deferring
- Adding selftests
- Introduces bpf_timer_cancel_async kfunc
- Link to v2: https://lore.kernel.org/r/20251105-timer_nolock-v2-0-32698db08bfa@meta.com

Changes in v2:
- Move refcnt initialization and put (from cancel_and_free())
from patch 5 into the patch 4, so that patch 4 has more clear and full
implementation and use of refcnt
- Link to v1: https://lore.kernel.org/r/20251031-timer_nolock-v1-0-b064ae403bfb@meta.com
====================

Link: https://patch.msgid.link/20260201025403.66625-1-alexei.starovoitov@gmail.com
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>

selftests/bpf: Add a test to stress bpf_timer_start and map_delete race

Add a test to stress bpf_timer_start and map_delete race

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20260201025403.66625-10-alexei.starovoitov@gmail.com

selftests/bpf: Removed obsolete tests

Now bpf_timer can be used in tracepoints, so these tests are no longer
relevant.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20260201025403.66625-9-alexei.starovoitov@gmail.com

selftests/bpf: Add timer stress test in NMI context

Add stress tests for BPF timers that run in NMI context using perf_event
programs attached to PERF_COUNT_HW_CPU_CYCLES.

The tests cover three scenarios:
- nmi_race: Tests concurrent timer start and async cancel operations
- nmi_update: Tests updating a map element (effectively deleting and
inserting new for array map) from within a timer callback
- nmi_cancel: Tests timer self-cancellation attempt.

A common test_common() helper is used to share timer setup logic across
all test modes.

The tests spawn multiple threads in a child process to generate
perf events, which trigger the BPF programs in NMI context. Hit counters
verify that the NMI code paths were actually exercised.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20260201025403.66625-8-alexei.starovoitov@gmail.com

selftests/bpf: Verify bpf_timer_cancel_async works

Add test that verifies that bpf_timer_cancel_async works: can cancel
callback successfully.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20260201025403.66625-7-alexei.starovoitov@gmail.com

selftests/bpf: Add stress test for timer async cancel

Extend BPF timer selftest to run stress test for async cancel.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20260201025403.66625-6-alexei.starovoitov@gmail.com

selftests/bpf: Refactor timer selftests

Refactor timer selftests, extracting stress test into a separate test.
This makes it easier to debug test failures and allows to extend.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20260201025403.66625-5-alexei.starovoitov@gmail.com

bpf: Introduce bpf_timer_cancel_async() kfunc

Introduce bpf_timer_cancel_async() that wraps hrtimer_try_to_cancel()
and executes it either synchronously or defers to irq_work.

Co-developed-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20260201025403.66625-4-alexei.starovoitov@gmail.com

bpf: Add verifier support for bpf_timer argument in kfuncs

Extend the verifier to recognize struct bpf_timer as a valid kfunc
argument type. Previously, bpf_timer was only supported in BPF helpers.

This prepares for adding timer-related kfuncs in subsequent patches.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20260201025403.66625-3-alexei.starovoitov@gmail.com

bpf: Enable bpf_timer and bpf_wq in any context

Refactor bpf_timer and bpf_wq to allow calling them from any context:
- add refcnt to bpf_async_cb
- map_delete_elem or map_free will drop refcnt to zero
via bpf_async_cancel_and_free()
- once refcnt is zero timer/wq_start is not allowed to make sure
that callback cannot rearm itself
- if in_hardirq defer to start/cancel operations to irq_work

Co-developed-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/bpf/20260201025403.66625-2-alexei.starovoitov@gmail.com

Merge branch 'bpf-add-bpf_stream_print_stack-kfunc'

Emil Tsalapatis says:

====================
bpf: Add bpf_stream_print_stack kfunc

Add a new bpf_stream_print_stack kfunc for printing a BPF program stack
into a BPF stream. Update the verifier to allow the new kfunc to be
called with BPF spinlocks held, along with bpf_stream_vprintk.

Patchset spun out of the larger libarena + ASAN patchset.
(https://lore.kernel.org/bpf/20260127181610.86376-1-emil@etsalapatis.com/)

Changeset:
- Update bpf_stream_print_stack to take stream_id arg (Kumar)
- Added selftest for the bpf_stream_print_stack
- Add selftest for calling the streams kfuncs under lock

v2->v1: (https://lore.kernel.org/bpf/20260202193311.446717-1-emil@etsalapatis.com/)
- Updated Signed-off-by to be consistent with past submissions
- Updated From email to be consistent with Signed-off-by

Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com>
====================

Link: https://patch.msgid.link/20260203180424.14057-1-emil@etsalapatis.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Add selftests for stream functions under lock

Add a selftest to ensure BPF stream functions can now be called
while holding a lock.

Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260203180424.14057-5-emil@etsalapatis.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Allow BPF stream kfuncs while holding a lock

The BPF stream kfuncs bpf_stream_vprintk and bpf_stream_print_stack
do not sleep and so are safe to call while holding a lock. Amend
the verifier to allow that.

Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260203180424.14057-4-emil@etsalapatis.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Add selftests for bpf_stream_print_stack

Add selftests for the new bpf_stream_print_stack kfunc.

Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260203180424.14057-3-emil@etsalapatis.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Add bpf_stream_print_stack stack dumping kfunc

Add a new kfunc called bpf_stream_print_stack to be used by programs
that need to print out their current BPF stack. The kfunc is essentially
a wrapper around the existing bpf_stream_dump_stack functionality used
to generate stack traces for error events like may_goto violations and
BPF-side arena page faults.

Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260203180424.14057-2-emil@etsalapatis.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge branch 'bpf-verifier-improve-state-pruning-for-scalar-registers'

Puranjay Mohan says:

====================
bpf: Improve state pruning for scalar registers

V2: https://lore.kernel.org/all/20260203022229.1630849-1-puranjay@kernel.org/
Changes in V3:
- Fix spelling mistakes in commit logs (AI)
- Fix an incorrect comment in the selftest added in patch 5 (AI)
- Improve the title of patch 5

V1: https://lore.kernel.org/all/20260202104414.3103323-1-puranjay@kernel.org/
Changes in V2:
- Collected acked by Eduard
- Removed some unnecessary comments
- Added a selftest for id=0 equivalence in Patch 5

This series improves BPF verifier state pruning by relaxing scalar ID
equivalence requirements. Scalar register IDs are used to track
relationships between registers for bounds propagation. However, once
an ID becomes "singular" (only one register/stack slot carries it), it
can no longer participate in bounds propagation and becomes stale.
These stale IDs can prevent pruning of otherwise equivalent states.

The series addresses this in four patches:

Patch 1: Assign IDs on stack fills to ensure stack slots have IDs
before being read into registers, preparing for the singular ID
clearing in patch 2.

Patch 2: Clear IDs that appear only once before caching, as they cannot
contribute to bounds propagation.

Patch 3: Relax maybe_widen_reg() to only compare value-tracking fields
(bounds, tnum, var_off) rather than also requiring ID matches. Two
scalars with identical value constraints but different IDs represent
the same abstract value and don't need widening.

Patch 4: Relax scalar ID equivalence in state comparison by treating
rold->id == 0 as "independent". If the old state didn't rely on ID
relationships for a register, any linking in the current state only
adds constraints and is safe to accept for pruning.

Patch 5: Add a selftest to show the exact case being handled by Patch 4

I ran veristat on BPF programs from sched_ext, meta's internal programs,
and on selftest programs, showing programs with insn diff > 5%:

Scx Progs
File                Program              States (A)  States (B)  States (DIFF)  Insns (A)  Insns (B)  Insns    (DIFF)
------------------  -------------------  ----------  ----------  -------------  ---------  ---------  ---------------
scx_rusty.bpf.o     rusty_set_cpumask           320         230  -90 (-28.12%)       4478       3259  -1219 (-27.22%)
scx_bpfland.bpf.o   bpfland_select_cpu           55          49   -6 (-10.91%)        691        618    -73 (-10.56%)
scx_beerland.bpf.o  beerland_select_cpu          27          25    -2 (-7.41%)        320        295     -25 (-7.81%)
scx_p2dq.bpf.o      p2dq_init                   265         250   -15 (-5.66%)       3423       3233    -190 (-5.55%)
scx_layered.bpf.o   layered_enqueue            1461        1386   -75 (-5.13%)      14541      13792    -749 (-5.15%)

FB Progs
File          Program              States (A)  States (B)  States  (DIFF)  Insns (A)  Insns (B)  Insns    (DIFF)
------------  -------------------  ----------  ----------  --------------  ---------  ---------  ---------------
bpf007.bpf.o  bpfj_free                  1726        1342  -384 (-22.25%)      25671      19096  -6575 (-25.61%)
bpf041.bpf.o  armr_net_block_init       22373       20411  -1962 (-8.77%)     651697     602873  -48824 (-7.49%)
bpf227.bpf.o  layered_quiescent            28          26     -2 (-7.14%)        365        340     -25 (-6.85%)
bpf248.bpf.o  p2dq_init                   263         248    -15 (-5.70%)       3370       3159    -211 (-6.26%)
bpf254.bpf.o  p2dq_init                   263         248    -15 (-5.70%)       3388       3177    -211 (-6.23%)
bpf241.bpf.o  p2dq_init                   264         249    -15 (-5.68%)       3428       3240    -188 (-5.48%)
bpf230.bpf.o  p2dq_init                   287         271    -16 (-5.57%)       3666       3431    -235 (-6.41%)
bpf251.bpf.o  lavd_cpu_offline            321         316     -5 (-1.56%)       6221       5891    -330 (-5.30%)
bpf251.bpf.o  lavd_cpu_online             321         316     -5 (-1.56%)       6219       5889    -330 (-5.31%)

Selftest Progs
File                                Program            States (A)  States (B)  States (DIFF)  Insns (A)  Insns (B)  Insns    (DIFF)
----------------------------------  -----------------  ----------  ----------  -------------  ---------  ---------  ---------------
verifier_iterating_callbacks.bpf.o  test2                       4           2   -2 (-50.00%)         29         18    -11 (-37.93%)
verifier_iterating_callbacks.bpf.o  test3                       4           2   -2 (-50.00%)         31         19    -12 (-38.71%)
strobemeta_bpf_loop.bpf.o           on_event                  318         221  -97 (-30.50%)       3938       2755  -1183 (-30.04%)
bpf_qdisc_fq.bpf.o                  bpf_fq_dequeue            133         105  -28 (-21.05%)       1686       1385   -301 (-17.85%)
iters.bpf.o                         delayed_read_mark           6           5   -1 (-16.67%)         60         46    -14 (-23.33%)
arena_strsearch.bpf.o               arena_strsearch           107         106    -1 (-0.93%)       1394       1258    -136 (-9.76%)
====================

Link: https://patch.msgid.link/20260203165102.2302462-1-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Add a test for ids=0 to verifier_scalar_ids test

Test that two registers with their id=0 (unlinked) in the cached state
can be mapped to a single id (linked) in the current state.

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20260203165102.2302462-6-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Relax scalar id equivalence for state pruning

Scalar register IDs are used by the verifier to track relationships
between registers and enable bounds propagation across those
relationships. Once an ID becomes singular (i.e. only a single
register/stack slot carries it), it can no longer contribute to bounds
propagation and effectively becomes stale. The previous commit makes the
verifier clear such ids before caching the state.

When comparing the current and cached states for pruning, these stale
IDs can cause technically equivalent states to be considered different
and thus prevent pruning.

For example, in the selftest added in the next commit, two registers -
r6 and r7 are not linked to any other registers and get cached with
id=0, in the current state, they are both linked to each other with
id=A. Before this commit, check_scalar_ids would give temporary ids to
r6 and r7 (say tid1 and tid2) and then check_ids() would map tid1->A,
and when it would see tid2->A, it would not consider these state
equivalent.

Relax scalar ID equivalence by treating rold->id == 0 as "independent":
if the old state did not rely on any ID relationships for a register,
then any ID/linking present in the current state only adds constraints
and is always safe to accept for pruning. Implement this by returning
true immediately in check_scalar_ids() when old_id == 0.

Maintain correctness for the opposite direction (old_id != 0 && cur_id
== 0) by still allocating a temporary ID for cur_id == 0. This avoids
incorrectly allowing multiple independent current registers (id==0) to
satisfy a single linked old ID during mapping.

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20260203165102.2302462-5-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Relax maybe_widen_reg() constraints

The maybe_widen_reg() function widens imprecise scalar registers to
unknown when their values differ between the cached and current states.
Previously, it used regs_exact() which also compared register IDs via
check_ids(), requiring registers to have matching IDs (or mapped IDs) to
be considered exact.

For scalar widening purposes, what matters is whether the value tracking
(bounds, tnum, var_off) is the same, not whether the IDs match. Two
scalars with identical value constraints but different IDs represent the
same abstract value and don't need to be widened.

Introduce scalars_exact_for_widen() that only compares the
value-tracking portion of bpf_reg_state (fields before 'id'). This
allows the verifier to preserve more scalar value information during
state merging when IDs differ but actual tracked values are identical,
reducing unnecessary widening and potentially improving verification
precision.

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20260203165102.2302462-4-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Clear singular ids for scalars in is_state_visited()

The verifier assigns ids to scalar registers/stack slots when they are
linked through a mov or stack spill/fill instruction. These ids are
later used to propagate newly found bounds from one register to all
registers that share the same id. The verifier also compares the ids of
these registers in current state and cached state when making pruning
decisions.

When an ID becomes singular (i.e., only a single register or stack slot
has that ID), it can no longer participate in bounds propagation. During
comparisons between current and cached states for pruning decisions,
however, such stale IDs can prevent pruning of otherwise equivalent
states.

Find and clear all singular ids before caching a state in
is_state_visited(). struct bpf_idset which is currently unused has been
repurposed for this use case.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20260203165102.2302462-3-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Let the verifier assign ids on stack fills

The next commit will allow clearing of scalar ids if no other
register/stack slot has that id. This is because if only one register
has a unique id, it can't participate in bounds propagation and is
equivalent to having no id.

But if the id of a stack slot is cleared by clear_singular_ids() in the
next commit, reading that stack slot into a register will not establish
a link because the stack slot's id is cleared.

This can happen in a situation where a register is spilled and later
loses its id due to a multiply operation (for example) and then the
stack slot's id becomes singular and can be cleared.

Make sure that scalar stack slots have an id before we read them into a
register.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20260203165102.2302462-2-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Replace snprintf("%s") with strscpy

Replace snprintf("%s") with the faster and more direct strscpy().

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Link: https://lore.kernel.org/r/20260201215247.677121-2-thorsten.blum@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

ftrace: Fix direct_functions leak in update_ftrace_direct_del

Alexei reported memory leak in update_ftrace_direct_del.
We miss cleanup of the replaced direct_functions in the
success path in update_ftrace_direct_del, adding that.

Fixes: 8d2c1233f371 ("ftrace: Add update_ftrace_direct_del function")
Reported-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Closes: https://lore.kernel.org/bpf/aX_BxG5EJTJdCMT9@krava/T/#m7c13f5a95f862ed7ab78e905fbb678d635306a0c
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/r/20260202075849.1684369-1-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge branch 'bpf-arm64-add-fsession-support'

Leon Hwang says:

====================
Similar to commit 98770bd4e6df ("bpf,x86: add fsession support for x86_64"),
add fsession support on arm64.

Patch #1 adds bpf_jit_supports_fsession() to prevent fsession loading
on architectures that do not implement fsession support.

Patch #2 implements fsession support in the arm64 BPF JIT trampoline.

Patch #3 enables the relevant selftests on arm64, including get_func_ip,
and get_func_args.

All enabled tests pass on arm64:

cd tools/testing/selftests/bpf
./test_progs -t fsession
#136/1   fsession_test/fsession_test:OK
#136/2   fsession_test/fsession_reattach:OK
#136/3   fsession_test/fsession_cookie:OK
#136     fsession_test:OK
Summary: 1/3 PASSED, 0 SKIPPED, 0 FAILED

./test_progs -t get_func
#138     get_func_args_test:OK
#139     get_func_ip_test:OK
Summary: 2/0 PASSED, 0 SKIPPED, 0 FAILED

Changes:
v4 -> v5:
* Address comment from Alexei:
  * Rename helper bpf_link_prog_session_cookie() to
    bpf_prog_calls_session_cookie().
* v4: https://lore.kernel.org/bpf/20260129154953.66915-1-leon.hwang@linux.dev/

v3 -> v4:
* Add a log when !bpf_jit_supports_fsession() in patch #1 (per AI).
* v3: https://lore.kernel.org/bpf/20260129142536.48637-1-leon.hwang@linux.dev/

v2 -> v3:
* Fix typo in subject and patch message of patch #1 (per AI and Chris).
* Collect Acked-by, and Tested-by from Puranjay, thanks.
* v2: https://lore.kernel.org/bpf/20260128150112.8873-1-leon.hwang@linux.dev/

v1 -> v2:
* Add bpf_jit_supports_fsession().
* v1: https://lore.kernel.org/bpf/20260127163344.92819-1-leon.hwang@linux.dev/
====================

Link: https://patch.msgid.link/20260131144950.16294-1-leon.hwang@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Enable get_func_args and get_func_ip tests on arm64

Allow get_func_args, and get_func_ip fsession selftests to run on arm64.

Acked-by: Puranjay Mohan <puranjay@kernel.org>
Tested-by: Puranjay Mohan <puranjay@kernel.org>
Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
Link: https://lore.kernel.org/r/20260131144950.16294-4-leon.hwang@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf, arm64: Add fsession support

Implement fsession support in the arm64 BPF JIT trampoline.

Extend the trampoline stack layout to store function metadata and
session cookies, and pass the appropriate metadata to fentry and
fexit programs. This mirrors the existing x86 behavior and enables
session cookies on arm64.

Acked-by: Puranjay Mohan <puranjay@kernel.org>
Tested-by: Puranjay Mohan <puranjay@kernel.org>
Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
Link: https://lore.kernel.org/r/20260131144950.16294-3-leon.hwang@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Add bpf_jit_supports_fsession()

The added fsession does not prevent running on those architectures, that
haven't added fsession support.

For example, try to run fsession tests on arm64:

test_fsession_basic:PASS:fsession_test__open_and_load 0 nsec
test_fsession_basic:PASS:fsession_attach 0 nsec
check_result:FAIL:test_run_opts err unexpected error: -14 (errno 14)

In order to prevent such errors, add bpf_jit_supports_fsession() to guard
those architectures.

Fixes: 2d419c44658f ("bpf: add fsession support")
Acked-by: Puranjay Mohan <puranjay@kernel.org>
Tested-by: Puranjay Mohan <puranjay@kernel.org>
Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
Link: https://lore.kernel.org/r/20260131144950.16294-2-leon.hwang@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Test access from RO map from xdp_store_bytes

This new test simply checks that helper bpf_xdp_store_bytes can
successfully read from a read-only map.

Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Link: https://lore.kernel.org/r/4fdb934a713b2d7cf133288c77f6cfefe9856440.1769875479.git.paul.chaignon@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Fix bpf_xdp_store_bytes proto for read-only arg

While making some maps in Cilium read-only from the BPF side, we noticed
that the bpf_xdp_store_bytes proto is incorrect. In particular, the
verifier was throwing the following error:

  ; ret = ctx_store_bytes(ctx, l3_off + offsetof(struct iphdr, saddr),
                          &nat->address, 4, 0);
  635: (79) r1 = *(u64 *)(r10 -144)     ; R1=ctx() R10=fp0 fp-144=ctx()
  636: (b4) w2 = 26                     ; R2=26
  637: (b4) w4 = 4                      ; R4=4
  638: (b4) w5 = 0                      ; R5=0
  639: (85) call bpf_xdp_store_bytes#190
  write into map forbidden, value_size=6 off=0 size=4

nat comes from a BPF_F_RDONLY_PROG map, so R3 is a PTR_TO_MAP_VALUE.
The verifier checks the helper's memory access to R3 in
check_mem_size_reg, as it reaches ARG_CONST_SIZE argument. The third
argument has expected type ARG_PTR_TO_UNINIT_MEM, which includes the
MEM_WRITE flag. The verifier thus checks for a BPF_WRITE access on R3.
Given R3 points to a read-only map, the check fails.

Conversely, ARG_PTR_TO_UNINIT_MEM can also lead to the helper reading
from uninitialized memory.

This patch simply fixes the expected argument type to match that of
bpf_skb_store_bytes.

Fixes: 3f364222d032 ("net: xdp: introduce bpf_xdp_pointer utility routine")
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Link: https://lore.kernel.org/r/9fa3c9f72d806e82541071c4df88b8cba28ad6a9.1769875479.git.paul.chaignon@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge branch 'bpf-unify-special-map-field-validation-in-verifier'

Mykyta Yatsenko says:

====================
The BPF verifier validates pointers to special map fields (timers,
workqueues, task_work) through separate functions that share nearly
identical logic. This creates code duplication because of the
inconsistent data structure layout in struct bpf_call_arg_meta struct
bpf_kfunc_call_arg_meta.

This series contains 2 commits:

1. Introduces struct bpf_map_desc to provide a unified representation
for map pointer and uid tracking. Previously, bpf_call_arg_meta used
separate map_ptr and map_uid fields while bpf_kfunc_call_arg_metaused an
anonymous inline struct. This inconsistency made it harder to share
validation code between the two paths.

2. Consolidates the validation logic for BPF_TIMER, BPF_WORKQUEUE, and
BPF_TASK_WORK field types into a single check_map_field_pointer()
function. This eliminates process_wq_func() and process_task_work_func()
entirely, and simplifies process_timer_func() to just the PREEMPT_RT
check before calling the unified validation. The result is fewer
lines of code with clearer structure for future maintenance.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Changes in v2:
- Added Signed-off-by to the top commit.
- Link to v1: https://lore.kernel.org/r/20260129-verif_special_fields-v1-0-d310b7f146c8@meta.com
====================

Link: https://patch.msgid.link/20260130-verif_special_fields-v2-0-2c59e637da7d@meta.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Consolidate special map field validation in verifier

Consolidate all logic for verifying special map fields in the single
function check_map_field_pointer().

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Link: https://lore.kernel.org/r/20260130-verif_special_fields-v2-2-2c59e637da7d@meta.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Introduce struct bpf_map_desc in verifier

Introduce struct bpf_map_desc to hold bpf_map pointer and map uid. Use
this struct in both bpf_call_arg_meta and bpf_kfunc_call_arg_meta
instead of having different representations:
- bpf_call_arg_meta had separate map_ptr and map_uid fields
- bpf_kfunc_call_arg_meta had an anonymous inline struct

This unifies the map fields layout across both metadata structures,
making the code more consistent and preparing for further refactoring of
map field pointer validation.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Link: https://lore.kernel.org/r/20260130-verif_special_fields-v2-1-2c59e637da7d@meta.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge branch 'x86-fgraph-bpf-fix-orc-stack-unwind-from-kprobe_multi'

Jiri Olsa says:

====================
x86/fgraph,bpf: Fix ORC stack unwind from kprobe_multi

hi,
Mahe reported missing function from stack trace on top of kprobe multi
program. It turned out the latest fix [1] needs some more fixing.

v2 changes:
- keep the unwind same as for kprobes, attached function
is part of entry probe stacktrace, not kretprobe [Steven]
- several change in trigger bench [Andrii]
- added selftests for standard kprobes and fentry/fexit probes [Andrii]

Note I'll try to add similar stacktrace adjustment for fentry/fexit
in separate patchset to not complicate this change.

thanks,
jirka

[1] https://lore.kernel.org/bpf/20251104215405.168643-1-jolsa@kernel.org/
---
====================

Link: https://patch.msgid.link/20260126211837.472802-1-jolsa@kernel.org
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>

selftests/bpf: Allow to benchmark trigger with stacktrace

Adding support to call bpf_get_stackid helper from trigger programs,
so far added for kprobe multi.

Adding the --stacktrace/-g option to enable it.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20260126211837.472802-7-jolsa@kernel.org

selftests/bpf: Add stacktrace ips test for fentry/fexit

Adding test that attaches fentry/fexitand verifies the
ORC stacktrace matches expected functions.

The test is only for ORC unwinder to keep it simple.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20260126211837.472802-6-jolsa@kernel.org

selftests/bpf: Add stacktrace ips test for kprobe/kretprobe

Adding test that attaches kprobe/kretprobe and verifies the
ORC stacktrace matches expected functions.

The test is only for ORC unwinder to keep it simple.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20260126211837.472802-5-jolsa@kernel.org

selftests/bpf: Fix kprobe multi stacktrace_ips test

We now include the attached function in the stack trace,
fixing the test accordingly.

Fixes: c9e208fa93cd ("selftests/bpf: Add stacktrace ips test for kprobe_multi/kretprobe_multi")
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20260126211837.472802-4-jolsa@kernel.org

x86/fgraph,bpf: Switch kprobe_multi program stack unwind to hw_regs path

Mahe reported missing function from stack trace on top of kprobe
multi program. The missing function is the very first one in the
stacktrace, the one that the bpf program is attached to.

  # bpftrace -e 'kprobe:__x64_sys_newuname* { print(kstack)}'
  Attaching 1 probe...

        do_syscall_64+134
        entry_SYSCALL_64_after_hwframe+118

  ('*' is used for kprobe_multi attachment)

The reason is that the previous change (the Fixes commit) fixed
stack unwind for tracepoint, but removed attached function address
from the stack trace on top of kprobe multi programs, which I also
overlooked in the related test (check following patch).

The tracepoint and kprobe_multi have different stack setup, but use
same unwind path. I think it's better to keep the previous change,
which fixed tracepoint unwind and instead change the kprobe multi
unwind as explained below.

The bpf program stack unwind calls perf_callchain_kernel for kernel
portion and it follows two unwind paths based on X86_EFLAGS_FIXED
bit in pt_regs.flags.

When the bit set we unwind from stack represented by pt_regs argument,
otherwise we unwind currently executed stack up to 'first_frame'
boundary.

The 'first_frame' value is taken from regs.rsp value, but ftrace_caller
and ftrace_regs_caller (ftrace trampoline) functions set the regs.rsp
to the previous stack frame, so we skip the attached function entry.

If we switch kprobe_multi unwind to use the X86_EFLAGS_FIXED bit,
we set the start of the unwind to the attached function address.
As another benefit we also cut extra unwind cycles needed to reach
the 'first_frame' boundary.

The speedup can be measured with trigger bench for kprobe_multi
program and stacktrace support.

- trigger bench with stacktrace on current code:

        kprobe-multi   :     0.810 ± 0.001M/s
        kretprobe-multi:     0.808 ± 0.001M/s

- and with the fix:

        kprobe-multi   :     1.264 ± 0.001M/s
        kretprobe-multi:     1.401 ± 0.002M/s

With the fix, the entry probe stacktrace:

  # bpftrace -e 'kprobe:__x64_sys_newuname* { print(kstack)}'
  Attaching 1 probe...

        __x64_sys_newuname+9
        do_syscall_64+134
        entry_SYSCALL_64_after_hwframe+118

The return probe skips the attached function, because it's no longer
on the stack at the point of the unwind and this way is the same how
standard kretprobe works.

  # bpftrace -e 'kretprobe:__x64_sys_newuname* { print(kstack)}'
  Attaching 1 probe...

        do_syscall_64+134
        entry_SYSCALL_64_after_hwframe+118

Fixes: 6d08340d1e35 ("Revert "perf/x86: Always store regs->ip in perf_callchain_kernel()"")
Reported-by: Mahe Tardy <mahe.tardy@gmail.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/bpf/20260126211837.472802-3-jolsa@kernel.org

x86/fgraph: Fix return_to_handler regs.rsp value

The previous change (Fixes commit) messed up the rsp register value,
which is wrong because it's already adjusted with FRAME_SIZE, we need
the original rsp value.

This change does not affect fprobe current kernel unwind, the !perf_hw_regs
path perf_callchain_kernel:

        if (perf_hw_regs(regs)) {
                if (perf_callchain_store(entry, regs->ip))
                        return;
                unwind_start(&state, current, regs, NULL);
        } else {
                unwind_start(&state, current, NULL, (void *)regs->sp);
        }

which uses pt_regs.sp as first_frame boundary (FRAME_SIZE shift makes
no difference, unwind stil stops at the right frame).

This change fixes the other path when we want to unwind directly from
pt_regs sp/fp/ip state, which is coming in following change.

Fixes: 20a0bc10272f ("x86/fgraph,bpf: Fix stack ORC unwind from kprobe_multi return probe")
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/bpf/20260126211837.472802-2-jolsa@kernel.org

selftests/bpf: Make bpf get_preempt_count() work for v6.14+ kernels

Recent x86 kernels export __preempt_count as a ksym, while some old kernels
between v6.1 and v6.14 expose the preemption counter via
pcpu_hot.preempt_count. The existing selftest helper unconditionally
dereferenced __preempt_count, which breaks BPF program loading on such old
kernels.

Make the x86 preemption count lookup version-agnostic by:
- Marking __preempt_count and pcpu_hot as weak ksyms.
- Introducing a BTF-described pcpu_hot___local layout with
preserve_access_index.
- Selecting the appropriate access path at runtime using ksym availability
and bpf_ksym_exists() and bpf_core_field_exists().

This allows a single BPF binary to run correctly across kernel versions
(e.g., v6.18 vs. v6.13) without relying on compile-time version checks.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
Link: https://lore.kernel.org/r/20260130021843.154885-1-changwoo@igalia.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge branch 'bpf-tail-calls-in-sleepable-programs'

Jiri Olsa says:

====================
this patchset allows sleepable programs to use tail calls.

At the moment we need to have separate sleepable uprobe program
to retrieve user space data and pass it to complex program with
tail calls. It'd be great if the program with tail calls could
be sleepable and do the data retrieval directly.

====================

Link: https://patch.msgid.link/20260130081208.1130204-1-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Add test for sleepable program tailcalls

Adding test that makes sure we can't mix sleepable and non-sleepable
bpf programs in the BPF_MAP_TYPE_PROG_ARRAY map and that we can do
tail call in the sleepable program.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20260130081208.1130204-3-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Allow sleepable programs to use tail calls

Allowing sleepable programs to use tail calls.

Making sure we can't mix sleepable and non-sleepable bpf programs
in tail call map (BPF_MAP_TYPE_PROG_ARRAY) and allowing it to be
used in sleepable programs.

Sleepable programs can be preempted and sleep which might bring
new source of race conditions, but both direct and indirect tail
calls should not be affected.

Direct tail calls work by patching direct jump to callee into bpf
caller program, so no problem there. We atomically switch from nop
to jump instruction.

Indirect tail call reads the callee from the map and then jumps to
it. The callee bpf program can't disappear (be released) from the
caller, because it is executed under rcu lock (rcu_read_lock_trace).

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Leon Hwang <leon.hwang@linux.dev>
Link: https://lore.kernel.org/r/20260130081208.1130204-2-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge branch 'bpf-fix-verifier_bug_if-to-account-for-bpf_call'

Luis Gerhorst says:

====================
bpf: Fix verifier_bug_if to account for BPF_CALL

This fixes the verifier_bug_if() that runs on nospec_result to not trigger
for BPF_CALL (bug reported by Hu, Mei, and Mu). See patch 1 for a full
description and patch 2 for a test (based on the PoC from the report).

While working on this I noticed two other problems:

- nospec_result is currently ignored for BPF_CALL during patching, but it
  may be required if we assume the CPU may speculate into/out of functions.

- Both the instruction patching for nospec and nospec_result erases the
  instruction aux information even thought it might be better to keep that.
  For nospec_result it may be fine as it is only applied to store
  instructions currently (except for when we decide to change the thing
  from above), but nospec may be set for arbitrary instructions and if
  these require rewrites they break.

I assume these issues are better fixed separately, thus I decided to
exclude them from this series.
====================

Link: https://patch.msgid.link/20260127115912.3026761-1-luis.gerhorst@fau.de
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Test nospec after dead stack write in helper

Without the fix from the previous commit, the selftest fails:

$ ./tools/testing/selftests/bpf/vmtest.sh -- \
        ./test_progs -t verifier_unpriv
[...]
run_subtest:PASS:obj_open_mem 0 nsec
libbpf: BTF loading error: -EPERM
libbpf: Error loading .BTF into kernel: -EPERM. BTF is optional, ignoring.
libbpf: prog 'unpriv_nospec_after_helper_stack_write': BPF program load failed: -EFAULT
libbpf: prog 'unpriv_nospec_after_helper_stack_write': failed to load: -EFAULT
libbpf: failed to load object 'verifier_unpriv'
run_subtest:FAIL:unexpected_load_failure unexpected error: -14 (errno 14)
VERIFIER LOG:
=============
0: R1=ctx() R10=fp0
0: (b7) r0 = 0                        ; R0=P0
1: (55) if r0 != 0x1 goto pc+6 2: R0=Pscalar() R1=ctx() R10=fp0
2: (b7) r2 = 0                        ; R2=P0
3: (bf) r3 = r10                      ; R3=fp0 R10=fp0
4: (07) r3 += -16                     ; R3=fp-16
5: (b7) r4 = 4                        ; R4=P4
6: (b7) r5 = 0                        ; R5=P0
7: (85) call bpf_skb_load_bytes_relative#68
verifier bug: speculation barrier after jump instruction may not have the desired effect (BPF_CLASS(insn->code) == BPF_JMP || BPF_CLASS(insn->code) == BPF_JMP32)
processed 9 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0
=============
[...]

The test is based on the PoC from the report.

Signed-off-by: Luis Gerhorst <luis.gerhorst@fau.de>
Reported-by: Yinhao Hu <dddddd@hust.edu.cn>
Reported-by: Kaiyan Mei <M202472210@hust.edu.cn>
Reported-by: Dongliang Mu <dzm91@hust.edu.cn>
Link: https://lore.kernel.org/bpf/7678017d-b760-4053-a2d8-a6879b0dbeeb@hust.edu.cn/
Link: https://lore.kernel.org/r/20260127115912.3026761-3-luis.gerhorst@fau.de
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Fix verifier_bug_if to account for BPF_CALL

The BPF verifier assumes `insn_aux->nospec_result` is only set for
direct memory writes (e.g., `*(u32*)(r1+off) = r2`). However, the
assertion fails to account for helper calls (e.g.,
`bpf_skb_load_bytes_relative`) that perform writes to stack memory. Make
the check more precise to resolve this.

The problem is that `BPF_CALL` instructions have `BPF_CLASS(insn->code)
== BPF_JMP`, which triggers the warning check:

- Helpers like `bpf_skb_load_bytes_relative` write to stack memory
- `check_helper_call()` loops through `meta.access_size`, calling
  `check_mem_access(..., BPF_WRITE)`
- `check_stack_write()` sets `insn_aux->nospec_result = 1`
- Since `BPF_CALL` is encoded as `BPF_JMP | BPF_CALL`, the warning fires

Execution flow:

```
1. Drop capabilities → Enable Spectre mitigation
2. Load BPF program
   └─> do_check()
       ├─> check_cond_jmp_op() → Marks dead branch as speculative
       │   └─> push_stack(..., speculative=true)
       ├─> pop_stack() → state->speculative = 1
       ├─> check_helper_call() → Processes helper in dead branch
       │   └─> check_mem_access(..., BPF_WRITE)
       │       └─> insn_aux->nospec_result = 1
       └─> Checks: state->speculative && insn_aux->nospec_result
           └─> BPF_CLASS(insn->code) == BPF_JMP → WARNING
```

To fix the assert, it would be nice to be able to reuse
bpf_insn_successors() here, but bpf_insn_successors()->cnt is not
exactly what we want as it may also be 1 for BPF_JA. Instead, we could
check opcode_info.can_jump, but then we would have to share the table
between the functions. This would mean moving the table out of the
function and adding bpf_opcode_info(). As the verifier_bug_if() only
runs for insns with nospec_result set, the impact on verification time
would likely still be negligible. However, I assume sharing
bpf_opcode_info() between liveness.c and verifier.c will not be worth
it. It seems as only adjust_jmp_off() could also be simplified using it,
and there imm/off is touched. Thus it is maybe better to rely on exact
opcode/class matching there.

Therefore, to avoid this sharing only for a verifier_bug_if(), just
check the opcode. This should now cover all opcodes for which can_jump
in bpf_insn_successors() is true.

Parts of the description and example are taken from the bug report.

Fixes: dadb59104c64 ("bpf: Fix aux usage after do_check_insn()")
Signed-off-by: Luis Gerhorst <luis.gerhorst@fau.de>
Reported-by: Yinhao Hu <dddddd@hust.edu.cn>
Reported-by: Kaiyan Mei <M202472210@hust.edu.cn>
Reported-by: Dongliang Mu <dzm91@hust.edu.cn>
Closes: https://lore.kernel.org/bpf/7678017d-b760-4053-a2d8-a6879b0dbeeb@hust.edu.cn/
Link: https://lore.kernel.org/r/20260127115912.3026761-2-luis.gerhorst@fau.de
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpftool: Fix dependencies for static build

When building selftests/bpf with EXTRA_LDFLAGS=-static the follwoing
error happens:

LINK /ws/linux/tools/testing/selftests/bpf/tools/build/bpftool/bootstrap/bpftool
/usr/bin/x86_64-linux-gnu-ld.bfd: /usr/lib/gcc/x86_64-linux-gnu/15/../../../x86_64-linux-gnu/libcrypto.a(libcrypto-lib-dso_dlfcn.o): in function `dlfcn_globallookup':
[...]
/usr/bin/x86_64-linux-gnu-ld.bfd: /usr/lib/gcc/x86_64-linux-gnu/15/../../../x86_64-linux-gnu/libcrypto.a(libcrypto-lib-c_zlib.o): in function `zlib_oneshot_expand_block':
(.text+0xc64): undefined reference to `uncompress'
/usr/bin/x86_64-linux-gnu-ld.bfd: /usr/lib/gcc/x86_64-linux-gnu/15/../../../x86_64-linux-gnu/libcrypto.a(libcrypto-lib-c_zlib.o): in function `zlib_oneshot_compress_block':
(.text+0xce4): undefined reference to `compress'
collect2: error: ld returned 1 exit status
make[1]: *** [Makefile:252: /ws/linux/tools/testing/selftests/bpf/tools/build/bpftool/bootstrap/bpftool] Error 1
make: *** [Makefile:327: /ws/linux/tools/testing/selftests/bpf/tools/sbin/bpftool] Error 2
make: *** Waiting for unfinished jobs....

This is caused by wrong order of dependencies in the Makefile. Fix it.

Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20260128211255.376933-1-ihor.solodrai@linux.dev

selftests/bpf: Remove xxd util dependency

The verification signature header generation requires converting a
binary certificate to a C array. Previously this only worked with
xxd (part of vim-common package).
As xxd may not be available on some systems building selftests, it makes
sense to substitute it with more common utils: hexdump, wc, sed to
generate equivalent C array output.

Tested by generating header with both xxd and hexdump and comparing
them.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Tested-by: Alan Maguire <alan.maguire@oracle.com>
Link: https://lore.kernel.org/bpf/20260128190552.242335-1-mykyta.yatsenko5@gmail.com

Merge branch 'ftrace-bpf-use-single-direct-ops-for-bpf-trampolines'

Jiri Olsa says:

====================
ftrace,bpf: Use single direct ops for bpf trampolines

hi,
while poking the multi-tracing interface I ended up with just one ftrace_ops
object to attach all trampolines.

This change allows to use less direct API calls during the attachment changes
in the future code, so in effect speeding up the attachment.

In current code we get a speed up from using just a single ftrace_ops object.

- with current code:

  Performance counter stats for 'bpftrace -e fentry:vmlinux:ksys_* {} -c true':

     6,364,157,902      cycles:k
       828,728,902      cycles:u
     1,064,803,824      instructions:u                   #    1.28  insn per cycle
    23,797,500,067      instructions:k                   #    3.74  insn per cycle

       4.416004987 seconds time elapsed

       0.164121000 seconds user
       1.289550000 seconds sys

- with the fix:

   Performance counter stats for 'bpftrace -e fentry:vmlinux:ksys_* {} -c true':

     6,535,857,905      cycles:k
       810,809,429      cycles:u
     1,064,594,027      instructions:u                   #    1.31  insn per cycle
    23,962,552,894      instructions:k                   #    3.67  insn per cycle

       1.666961239 seconds time elapsed

       0.157412000 seconds user
       1.283396000 seconds sys

The speedup seems to be related to the fact that with single ftrace_ops object
we don't call ftrace_shutdown anymore (we use ftrace_update_ops instead) and
we skip the synchronize rcu calls (each ~100ms) at the end of that function.

rfc: https://lore.kernel.org/bpf/20250729102813.1531457-1-jolsa@kernel.org/
v1:  https://lore.kernel.org/bpf/20250923215147.1571952-1-jolsa@kernel.org/
v2:  https://lore.kernel.org/bpf/20251113123750.2507435-1-jolsa@kernel.org/
v3:  https://lore.kernel.org/bpf/20251120212402.466524-1-jolsa@kernel.org/
v4:  https://lore.kernel.org/bpf/20251203082402.78816-1-jolsa@kernel.org/
v5:  https://lore.kernel.org/bpf/20251215211402.353056-10-jolsa@kernel.org/

v6 changes:
- rename add_hash_entry_direct to add_ftrace_hash_entry_direct [Steven]
- factor hash_add/hash_sub [Steven]
- add kerneldoc header for update_ftrace_direct_* functions [Steven]
- few assorted smaller fixes [Steven]
- added missing direct_ops wrappers for !CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
  case [Steven]

v5 changes:
- do not export ftrace_hash object [Steven]
- fix update_ftrace_direct_add new_filter_hash leak [ci]

v4 changes:
- rebased on top of bpf-next/master (with jmp attach changes)
  added patch 1 to deal with that
- added extra checks for update_ftrace_direct_del/mod to address
  the ci bot review

v3 changes:
- rebased on top of bpf-next/master
- fixed update_ftrace_direct_del cleanup path
- added missing inline to update_ftrace_direct_* stubs

v2 changes:
- rebased on top fo bpf-next/master plus Song's livepatch fixes [1]
- renamed the API functions [2] [Steven]
- do not export the new api [Steven]
- kept the original direct interface:

  I'm not sure if we want to melt both *_ftrace_direct and the new interface
  into single one. It's bit different in semantic (hence the name change as
  Steven suggested [2]) and I don't think the changes are not that big so
  we could easily keep both APIs.

v1 changes:
- make the change x86 specific, after discussing with Mark options for
  arm64 [Mark]

thanks,
jirka

[1] https://lore.kernel.org/bpf/20251027175023.1521602-1-song@kernel.org/
[2] https://lore.kernel.org/bpf/20250924050415.4aefcb91@batman.local.home/
---
====================

Link: https://patch.msgid.link/20251230145010.103439-1-jolsa@kernel.org
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>

bpf,x86: Use single ftrace_ops for direct calls

Using single ftrace_ops for direct calls update instead of allocating
ftrace_ops object for each trampoline.

With single ftrace_ops object we can use update_ftrace_direct_* api
that allows multiple ip sites updates on single ftrace_ops object.

Adding HAVE_SINGLE_FTRACE_DIRECT_OPS config option to be enabled on
each arch that supports this.

At the moment we can enable this only on x86 arch, because arm relies
on ftrace_ops object representing just single trampoline image (stored
in ftrace_ops::direct_call). Archs that do not support this will continue
to use *_ftrace_direct api.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/bpf/20251230145010.103439-10-jolsa@kernel.org

ftrace: Factor ftrace_ops ops_func interface

We are going to remove "ftrace_ops->private == bpf_trampoline" setup
in following changes.

Adding ip argument to ftrace_ops_func_t callback function, so we can
use it to look up the trampoline.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/bpf/20251230145010.103439-9-jolsa@kernel.org

bpf: Add trampoline ip hash table

Following changes need to lookup trampoline based on its ip address,
adding hash table for that.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20251230145010.103439-8-jolsa@kernel.org

ftrace: Add update_ftrace_direct_mod function

Adding update_ftrace_direct_mod function that modifies all entries
(ip -> direct) provided in hash argument to direct ftrace ops and
updates its attachments.

The difference to current modify_ftrace_direct is:
- hash argument that allows to modify multiple ip -> direct
entries at once

This change will allow us to have simple ftrace_ops for all bpf
direct interface users in following changes.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/bpf/20251230145010.103439-7-jolsa@kernel.org

ftrace: Add update_ftrace_direct_del function

Adding update_ftrace_direct_del function that removes all entries
(ip -> addr) provided in hash argument to direct ftrace ops and
updates its attachments.

The difference to current unregister_ftrace_direct is
- hash argument that allows to unregister multiple ip -> direct
   entries at once
- we can call update_ftrace_direct_del multiple times on the
   same ftrace_ops object, becase we do not need to unregister
   all entries at once, we can do it gradualy with the help of
   ftrace_update_ops function

This change will allow us to have simple ftrace_ops for all bpf
direct interface users in following changes.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/bpf/20251230145010.103439-6-jolsa@kernel.org

ftrace: Add update_ftrace_direct_add function

Adding update_ftrace_direct_add function that adds all entries
(ip -> addr) provided in hash argument to direct ftrace ops
and updates its attachments.

The difference to current register_ftrace_direct is
- hash argument that allows to register multiple ip -> direct
   entries at once
- we can call update_ftrace_direct_add multiple times on the
   same ftrace_ops object, becase after first registration with
   register_ftrace_function_nolock, it uses ftrace_update_ops to
   update the ftrace_ops object

This change will allow us to have simple ftrace_ops for all bpf
direct interface users in following changes.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/bpf/20251230145010.103439-5-jolsa@kernel.org

ftrace: Export some of hash related functions

We are going to use these functions in following changes.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/bpf/20251230145010.103439-4-jolsa@kernel.org

ftrace: Make alloc_and_copy_ftrace_hash direct friendly

Make alloc_and_copy_ftrace_hash to copy also direct address
for each hash entry.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/bpf/20251230145010.103439-3-jolsa@kernel.org

ftrace,bpf: Remove FTRACE_OPS_FL_JMP ftrace_ops flag

At the moment the we allow the jmp attach only for ftrace_ops that
has FTRACE_OPS_FL_JMP set. This conflicts with following changes
where we use single ftrace_ops object for all direct call sites,
so all could be be attached via just call or jmp.

We already limit the jmp attach support with config option and bit
(LSB) set on the trampoline address. It turns out that's actually
enough to limit the jmp attach for architecture and only for chosen
addresses (with LSB bit set).

Each user of register_ftrace_direct or modify_ftrace_direct can set
the trampoline bit (LSB) to indicate it has to be attached by jmp.

The bpf trampoline generation code uses trampoline flags to generate
jmp-attach specific code and ftrace inner code uses the trampoline
bit (LSB) to handle return from jmp attachment, so there's no harm
to remove the FTRACE_OPS_FL_JMP bit.

The fexit/fmodret performance stays the same (did not drop),
current code:

  fentry         :   77.904 ± 0.546M/s
  fexit          :   62.430 ± 0.554M/s
  fmodret        :   66.503 ± 0.902M/s

with this change:

  fentry         :   80.472 ± 0.061M/s
  fexit          :   63.995 ± 0.127M/s
  fmodret        :   67.362 ± 0.175M/s

Fixes: 25e4e3565d45 ("ftrace: Introduce FTRACE_OPS_FL_JMP")
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/bpf/20251230145010.103439-2-jolsa@kernel.org