]> git.ipfire.org Git - thirdparty/linux.git/log
thirdparty/linux.git
9 days agobpf: make kprobe_multi_link_prog_run always_inline
Menglong Dong [Wed, 26 Nov 2025 08:52:46 +0000 (16:52 +0800)] 
bpf: make kprobe_multi_link_prog_run always_inline

Make kprobe_multi_link_prog_run() always inline to obtain better
performance. Before this patch, the bench performance is:

./bench trig-kprobe-multi
Setting up benchmark 'trig-kprobe-multi'...
Benchmark 'trig-kprobe-multi' started.
Iter   0 ( 95.485us): hits   62.462M/s ( 62.462M/prod), [...]
Iter   1 (-80.054us): hits   62.486M/s ( 62.486M/prod), [...]
Iter   2 ( 13.572us): hits   62.287M/s ( 62.287M/prod), [...]
Iter   3 ( 76.961us): hits   62.293M/s ( 62.293M/prod), [...]
Iter   4 (-77.698us): hits   62.394M/s ( 62.394M/prod), [...]
Iter   5 (-13.399us): hits   62.319M/s ( 62.319M/prod), [...]
Iter   6 ( 77.573us): hits   62.250M/s ( 62.250M/prod), [...]
Summary: hits   62.338 ± 0.083M/s ( 62.338M/prod)

And after this patch, the performance is:

Iter   0 (454.148us): hits   66.900M/s ( 66.900M/prod), [...]
Iter   1 (-435.540us): hits   68.925M/s ( 68.925M/prod), [...]
Iter   2 (  8.223us): hits   68.795M/s ( 68.795M/prod), [...]
Iter   3 (-12.347us): hits   68.880M/s ( 68.880M/prod), [...]
Iter   4 (  2.291us): hits   68.767M/s ( 68.767M/prod), [...]
Iter   5 ( -1.446us): hits   68.756M/s ( 68.756M/prod), [...]
Iter   6 ( 13.882us): hits   68.657M/s ( 68.657M/prod), [...]
Summary: hits   68.792 ± 0.087M/s ( 68.792M/prod)

As we can see, the performance of kprobe-multi increase from 62M/s to
68M/s.

Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Link: https://lore.kernel.org/r/20251126085246.309942-1-dongml2@chinatelecom.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
9 days agoMerge branch 'selftests-bpf-convert-test_tc_edt-sh-into-test_progs'
Alexei Starovoitov [Sat, 29 Nov 2025 17:37:41 +0000 (09:37 -0800)] 
Merge branch 'selftests-bpf-convert-test_tc_edt-sh-into-test_progs'

Alexis Lothoré says:

====================
selftests/bpf: convert test_tc_edt.sh into test_progs

Hello,
this is a (late) v2 to my first attempt to convert the test_tc_edt
script to test_progs. This new version is way simpler, thanks to
Martin's suggestion about properly using the existing network_helpers
rather than reinventing the wheel. It also fixes a small bug in the
measured effective rate.

The converted test roughly follows the original script logic, with two
veths in two namespaces, a TCP connection between a client and a server,
and the client pushing a specific amount of data. Time is recorded
before and after the transmission to compute the effective rate.

There are two knobs driving the robustness of the test in CI:
- the amount of pushed data (the higher, the more precise is the
  effective rate)
- the tolerated error margin

The original test was configured with a 20s duration and a 1% error
margin. The new test is configured with 1MB of data being pushed and a
2% error margin, to:
- make the duration tolerable in CI
- while keeping enough margin for rate measure fluctuations depending on
  the CI machines load

This has been run multiple times locally to ensure that those values are
sane, and once in CI before sending the series, but I suggest to let it
live a few days in CI to see how it really behaves.

Signed-off-by: Alexis Lothoré (eBPF Foundation) <alexis.lothore@bootlin.com>
Changes in v2:
- drop custom client/server management
- update bpf program now that server pushes data
- fix effective rate computation
- Link to v1: https://lore.kernel.org/r/20251031-tc_edt-v1-0-5d34a5823144@bootlin.com

---
Alexis Lothoré (eBPF Foundation) (4):
      selftests/bpf: rename test_tc_edt.bpf.c section to expose program type
      selftests/bpf: integrate test_tc_edt into test_progs
      selftests/bpf: remove test_tc_edt.sh
      selftests/bpf: do not hardcode target rate in test_tc_edt BPF program

 tools/testing/selftests/bpf/Makefile               |   2 -
 .../testing/selftests/bpf/prog_tests/test_tc_edt.c | 145 +++++++++++++++++++++
 tools/testing/selftests/bpf/progs/test_tc_edt.c    |  11 +-
 tools/testing/selftests/bpf/test_tc_edt.sh         | 100 --------------
 4 files changed, 151 insertions(+), 107 deletions(-)
---
base-commit: 233a075a1b27070af76d64541cf001340ecff917
change-id: 20251030-tc_edt-3ea8e8d3d14e

Best regards,
====================

Link: https://patch.msgid.link/20251128-tc_edt-v2-0-26db48373e73@bootlin.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
9 days agoselftests/bpf: do not hardcode target rate in test_tc_edt BPF program
Alexis Lothoré (eBPF Foundation) [Fri, 28 Nov 2025 22:27:21 +0000 (23:27 +0100)] 
selftests/bpf: do not hardcode target rate in test_tc_edt BPF program

test_tc_edt currently defines the target rate in both the userspace and
BPF parts. This value could be defined once in the userspace part if we
make it able to configure the BPF program before starting the test.

Add a target_rate variable in the BPF part, and make the userspace part
set it to the desired rate before attaching the shaping program.

Signed-off-by: Alexis Lothoré (eBPF Foundation) <alexis.lothore@bootlin.com>
Link: https://lore.kernel.org/r/20251128-tc_edt-v2-4-26db48373e73@bootlin.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
9 days agoselftests/bpf: remove test_tc_edt.sh
Alexis Lothoré (eBPF Foundation) [Fri, 28 Nov 2025 22:27:20 +0000 (23:27 +0100)] 
selftests/bpf: remove test_tc_edt.sh

Now that test_tc_edt has been integrated in test_progs, remove the
legacy shell script.

Signed-off-by: Alexis Lothoré (eBPF Foundation) <alexis.lothore@bootlin.com>
Link: https://lore.kernel.org/r/20251128-tc_edt-v2-3-26db48373e73@bootlin.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
9 days agoselftests/bpf: integrate test_tc_edt into test_progs
Alexis Lothoré (eBPF Foundation) [Fri, 28 Nov 2025 22:27:19 +0000 (23:27 +0100)] 
selftests/bpf: integrate test_tc_edt into test_progs

test_tc_edt.sh uses a pair of veth and a BPF program attached to the TX
veth to shape the traffic to 5MBps. It then checks that the amount of
received bytes (at interface level), compared to the TX duration, indeed
matches 5Mbps.

Convert this test script to the test_progs framework:
- keep the double veth setup, isolated in two veths
- run a small tcp server, and connect client to server
- push a pre-configured amount of bytes, and measure how much time has
  been needed to push those
- ensure that this rate is in a 2% error margin around the target rate

This two percent value, while being tight, is hopefully large enough to
not make the test too flaky in CI, while also turning it into a small
example of BPF-based shaping.

Signed-off-by: Alexis Lothoré (eBPF Foundation) <alexis.lothore@bootlin.com>
Link: https://lore.kernel.org/r/20251128-tc_edt-v2-2-26db48373e73@bootlin.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
9 days agoselftests/bpf: rename test_tc_edt.bpf.c section to expose program type
Alexis Lothoré (eBPF Foundation) [Fri, 28 Nov 2025 22:27:18 +0000 (23:27 +0100)] 
selftests/bpf: rename test_tc_edt.bpf.c section to expose program type

The test_tc_edt BPF program uses a custom section name, which works fine
when manually loading it with tc, but prevents it from being loaded with
libbpf.

Update the program section name to "tc" to be able to manipulate it with
a libbpf-based C test.

Signed-off-by: Alexis Lothoré (eBPF Foundation) <alexis.lothore@bootlin.com>
Link: https://lore.kernel.org/r/20251128-tc_edt-v2-1-26db48373e73@bootlin.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
9 days agoMerge branch 'limited-queueing-in-nmi-for-rqspinlock'
Alexei Starovoitov [Sat, 29 Nov 2025 17:35:36 +0000 (09:35 -0800)] 
Merge branch 'limited-queueing-in-nmi-for-rqspinlock'

Kumar Kartikeya Dwivedi says:

====================
Limited queueing in NMI for rqspinlock

Ritesh reported that he was frequently seeing timeouts in cases which
should have been covered by the AA heuristics. This led to the discovery
of multiple gaps in the current code that could lead to timeouts when
AA heuristics could work to prevent them. More details and investigation
is available in the original threads. [0][1]

This set restores the ability for NMI waiters to queue in the slow path,
and reduces the cases where they would attempt to trylock. However, such
queueing must not happen when interrupting waiters which the NMI itself
depends upon for forward progress; in those cases the trylock fallback
remains, but with a single attempt to avoid aimless attempts to acquire
the lock.

It also closes a possible window in the lock fast path and the unlock
path where NMIs landing between cmpxchg and entry creation, or entry
deletion and unlock would miss the detection of an AA scenario and end
up timing out.

This virtually eliminates all the cases where existing heuristics can
prevent timeouts and quickly recover from a deadlock. More details are
available in the commit logs for each patch.

  [0]: https://lore.kernel.org/bpf/CAH6OuBTjG+N=+GGwcpOUbeDN563oz4iVcU3rbse68egp9wj9_A@mail.gmail.com
  [1]: https://lore.kernel.org/bpf/20251125203253.3287019-1-memxor@gmail.com
====================

Link: https://patch.msgid.link/20251128232802.1031906-1-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
9 days agoselftests/bpf: Add success stats to rqspinlock stress test
Kumar Kartikeya Dwivedi [Fri, 28 Nov 2025 23:28:02 +0000 (23:28 +0000)] 
selftests/bpf: Add success stats to rqspinlock stress test

Add stats to observe the success and failure rate of lock acquisition
attempts in various contexts.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20251128232802.1031906-7-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
9 days agorqspinlock: Precede non-head waiter queueing with AA check
Kumar Kartikeya Dwivedi [Fri, 28 Nov 2025 23:28:01 +0000 (23:28 +0000)] 
rqspinlock: Precede non-head waiter queueing with AA check

While previous commits sufficiently address the deadlocks, there are
still scenarios where queueing of waiters in NMIs can exacerbate the
possibility of timeouts.

Consider the case below:

CPU 0
<NMI>
res_spin_lock(A) -> becomes non-head waiter
</NMI>
lock owner in CS or pending waiter spinning

CPU 1
res_spin_lock(A) -> head waiter spinning on owner/pending bits

In such a scenario, the non-head waiter in NMI on CPU 0 will not poll
for deadlocks or timeout since it will simply queue behind previous
waiter (head on CPU 1), and also not enter the trylock fallback since
no rqspinlock queue waiter is active on CPU 0. In such a scenario, the
transaction initiated by the head waiter on CPU 1 will timeout,
signalling the NMI and ending the cyclic dependency, but it will cost
250 ms of time.

Instead, the NMI on CPU 0 could simply check for the presence of an AA
deadlock and only proceed with queueing on success. Add such a check
right before any form of queueing is initiated.

The reason the AA deadlock check is not used in conjunction with
in_nmi() is that a similar case could occur due to a reentrant path
in the owner's critical section, and unconditionally checking for AA
before entering the queueing path avoids expensive timeouts. Non-NMI
reentrancy only happens at controlled points in the slow path (with
specific tracepoints which do not impede the forward progress of a
waiter loop), or in the owner CS, while NMIs can land anywhere.

While this check is only needed for non-head waiter queueing, checking
whether we are head or not is racy without xchg_tail, and after that
point, we are already queued, hence for simplicity we must invoke the
check unconditionally.

Note that a more contrived case could still be constructed by using two
locks, and interrupting the progress of the respective owners by
non-head waiters of the other lock, in an ABBA fashion, which would
still not be covered by the current set of checks and conditions. It
would still lead to a timeout though, and not a deadlock. An ABBA check
cannot happen optimistically before the queueing, since it can be racy,
and needs to be happen continuously during the waiting period, which
would then require an unlinking step for queued NMI/reentrant waiters.
This is beyond the scope of this patch.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20251128232802.1031906-6-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
9 days agorqspinlock: Disable spinning for trylock fallback
Kumar Kartikeya Dwivedi [Fri, 28 Nov 2025 23:28:00 +0000 (23:28 +0000)] 
rqspinlock: Disable spinning for trylock fallback

The original trylock fallback was inherited from qspinlock, and then
reused for the reentrant NMIs while the slow path is active. However,
under contention, it is very unlikely for the trylock to succeed in
taking the lock. In addition, a trylock also has no fairness guarantees,
and thus is prone to starvation issues under extreme scenarios.

The original qspinlock had no choice in terms of returning an error the
caller; if the node count was breached, it had to fall back to trylock
to attempt to take the lock. In case of rqspinlock, we do have the
option of returning to the user. Thus, simply attempt the trylock once,
and instead of spinning, return an error in case the lock cannot be
taken.

This ends up significantly reducing the time spent in the trylock
fallback, since we no longer wait for the timeout duration trying to
aimlessly acquire the lock when there's a high-probability that under
contention, it won't be available to us anyway.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20251128232802.1031906-5-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
9 days agorqspinlock: Use trylock fallback when per-CPU rqnode is busy
Kumar Kartikeya Dwivedi [Fri, 28 Nov 2025 23:27:59 +0000 (23:27 +0000)] 
rqspinlock: Use trylock fallback when per-CPU rqnode is busy

In addition to deferring to the trylock fallback in NMIs, only do so
when an rqspinlock waiter is queued on the current CPU. This is detected
by noticing a non-zero node index. This allows NMI waiters to join the
waiter queue if it isn't interrupting an existing rqspinlock waiter, and
increase the chances of fairly obtaining the lock, performing deadlock
detection as the head, and not being starved while attempting the
trylock.

The trylock path in particular is unlikely to succeed under contention,
as it relies on the lock word becoming 0, which indicates no contention.
This means that the most likely result for NMIs attempting a trylock is
a timeout under contention if they don't hit an AA or ABBA case.

The core problem being addressed through the fixed commit was removing
the dependency edge between an NMI queue waiter and the queue waiter it
is interrupting. Whenever a circular dependency forms, and with no way
to break it (as non-head waiters don't poll for deadlocks or timeouts),
we would enter into a deadlock. A trylock either breaks such an edge by
probing for deadlocks, and finally terminating the waiting loop using a
timeout.

By excluding queueing on CPUs where the node index is non-zero for NMIs,
this sort of dependency is broken. The CPU enters the trylock path for
those cases, and falls back to deadlock checks and timeouts. However, in
other case where it doesn't interrupt the CPU in the slow path while its
queued on the lock, it can join the queue as a normal waiter, and avoid
trylock associated starvation and subsequent timeouts.

There are a few remaining cases here that matter: the NMI can still
preempt the owner in its critical section, and if it queues as a
non-head waiter, it can end up impeding the progress of the owner. While
this won't deadlock, since the head waiter will eventually signal the
NMI waiter to either stop (due to a timeout), it can still lead to long
timeouts. These gaps will be addressed in subsequent commits.

Note that while the node count detection approach is less conservative
than simply deferring NMIs to trylock, it is going to return errors
where attempts to lock B in NMI happen while waiters for lock A are in a
lower context on the same CPU. However, this only occurs when the lower
context is queued in the slow path, and the NMI attempt can proceed
without failure in all other cases. To continue to prevent AA deadlocks
(or ABBA in a similar NMI interrupting lower context pattern), we'd need
a more fleshed out algorithm to unlink NMI waiters after they queue and
detect such cases. However, all that complexity isn't appealing yet to
reduce the failure rate in the small window inside the slow path.

It is important to note that reentrancy in the slow path can also happen
through trace_contention_{begin,end}, but in those cases, unlike an NMI,
the forward progress of the head waiter (or the predecessor in general)
is not being blocked.

Fixes: 0d80e7f951be ("rqspinlock: Choose trylock fallback for NMI waiters")
Reported-by: Ritesh Oedayrajsingh Varma <ritesh@superluminal.eu>
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20251128232802.1031906-4-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
9 days agorqspinlock: Perform AA checks immediately
Kumar Kartikeya Dwivedi [Fri, 28 Nov 2025 23:27:58 +0000 (23:27 +0000)] 
rqspinlock: Perform AA checks immediately

Currently, while we enter the check_timeout call immediately due to the
way the ts.spin is initialized, we still invoke the AA and ABBA checks
in the second invocation, and only initialize the timestamp in the first
one. Since each iteration is at least done with a 1ms delay, this can
add delays in detection of AA deadlocks, up to a ms.

Rework check_timeout() to avoid this. First, call check_deadlock_AA()
while initializing the timestamps for the wait period. This also means
that we only do it once per waiting period, instead of every invocation.
Finally, drop check_deadlock() and call check_deadlock_ABBA() directly.

To save on unnecessary ktime_get_mono_fast_ns() in case of AA deadlock,
sample the time only if it returns 0.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20251128232802.1031906-3-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
9 days agorqspinlock: Enclose lock/unlock within lock entry acquisitions
Kumar Kartikeya Dwivedi [Fri, 28 Nov 2025 23:27:57 +0000 (23:27 +0000)] 
rqspinlock: Enclose lock/unlock within lock entry acquisitions

Ritesh reported that timeouts occurred frequently for rqspinlock despite
reentrancy on the same lock on the same CPU in [0]. This patch closes
one of the races leading to this behavior, and reduces the frequency of
timeouts.

We currently have a tiny window between the fast-path cmpxchg and the
grabbing of the lock entry where an NMI could land, attempt the same
lock that was just acquired, and end up timing out. This is not ideal.
Instead, move the lock entry acquisition from the fast path to before
the cmpxchg, and remove the grabbing of the lock entry in the slow path,
assuming it was already taken by the fast path. The TAS fallback is
invoked directly without being preceded by the typical fast path,
therefore we must continue to grab the deadlock detection entry in that
case.

Case on lock leading to missed AA:

cmpxchg lock A
<NMI>
... rqspinlock acquisition of A
... timeout
</NMI>
grab_held_lock_entry(A)

There is a similar case when unlocking the lock. If the NMI lands
between the WRITE_ONCE and smp_store_release, it is possible that we end
up in a situation where the NMI fails to diagnose the AA condition,
leading to a timeout.

Case on unlock leading to missed AA:

WRITE_ONCE(rqh->locks[rqh->cnt - 1], NULL)
<NMI>
... rqspinlock acquisition of A
... timeout
</NMI>
smp_store_release(A->locked, 0)

The patch changes the order on unlock to smp_store_release() succeeded
by WRITE_ONCE() of NULL. This avoids the missed AA detection described
above, but may lead to a false positive if the NMI lands between these
two statements, which is acceptable (and preferred over a timeout).

The original intention of the reverse order on unlock was to prevent the
following possible misdiagnosis of an ABBA scenario:

grab entry A
lock A
grab entry B
lock B
unlock B
   smp_store_release(B->locked, 0)
grab entry B
lock B
grab entry A
lock A
! <detect ABBA>
   WRITE_ONCE(rqh->locks[rqh->cnt - 1], NULL)

If the store release were is after the WRITE_ONCE, the other CPU would
not observe B in the table of the CPU unlocking the lock B.  However,
since the threads are obviously participating in an ABBA deadlock, it
is no longer appealing to use the order above since it may lead to a
250 ms timeout due to missed AA detection.

  [0]: https://lore.kernel.org/bpf/CAH6OuBTjG+N=+GGwcpOUbeDN563oz4iVcU3rbse68egp9wj9_A@mail.gmail.com

Fixes: 0d80e7f951be ("rqspinlock: Choose trylock fallback for NMI waiters")
Reported-by: Ritesh Oedayrajsingh Varma <ritesh@superluminal.eu>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20251128232802.1031906-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
10 days agobpf: Remove runqslower tool
Hoyeon Lee [Wed, 26 Nov 2025 09:38:11 +0000 (18:38 +0900)] 
bpf: Remove runqslower tool

runqslower was added in commit 9c01546d26d2 "tools/bpf: Add runqslower
tool to tools/bpf" as a BCC port to showcase early BPF CO-RE + libbpf
workflows. runqslower continues to live in BCC (libbpf-tools), so there
is no need to keep building and maintaining it.

Drop tools/bpf/runqslower and remove all build hooks in tools/bpf and
selftests accordingly.

Signed-off-by: Hoyeon Lee <hoyeon.lee@suse.com>
Link: https://lore.kernel.org/r/20251126093821.373291-1-hoyeon.lee@suse.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
10 days agoselftests/bpf: Remove usage of lsm/file_alloc_security in selftest
Amery Hung [Wed, 26 Nov 2025 20:29:27 +0000 (12:29 -0800)] 
selftests/bpf: Remove usage of lsm/file_alloc_security in selftest

file_alloc_security hook is disabled. Use other LSM hooks in selftests
instead.

Signed-off-by: Amery Hung <ameryhung@gmail.com>
Link: https://lore.kernel.org/r/20251126202927.2584874-2-ameryhung@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
10 days agobpf: Disable file_alloc_security hook
Amery Hung [Wed, 26 Nov 2025 20:29:26 +0000 (12:29 -0800)] 
bpf: Disable file_alloc_security hook

A use-after-free bug may be triggered by calling bpf_inode_storage_get()
in a BPF LSM program hooked to file_alloc_security. Disable the hook to
prevent this from happening.

The cause of the bug is shown in the trace below. In alloc_file(), a
file struct is first allocated through kmem_cache_alloc(). Then,
file_alloc_security hook is invoked. Since the zero initialization or
assignment of f->f_inode happen after this LSM hook, a BPF program may
get a dangeld inode pointer by walking the file struct.

  alloc_file()
  -> alloc_empty_file()
     -> f = kmem_cache_alloc()
     -> init_file()
        -> security_file_alloc() // f->f_inode not init-ed yet!
     -> f->f_inode = NULL;
  -> file_init_path()
     -> f->f_inode = path->dentry->d_inode

Reported-by: Kaiyan Mei <M202472210@hust.edu.cn>
Reported-by: Yinhao Hu <dddddd@hust.edu.cn>
Reported-by: Dongliang Mu <dzm91@hust.edu.cn>
Closes: https://lore.kernel.org/bpf/1d2d1968.47cd3.19ab9528e94.Coremail.kaiyanm@hust.edu.cn/
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Link: https://lore.kernel.org/r/20251126202927.2584874-1-ameryhung@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
10 days agoMerge branch 'a-pair-of-follow-ups-for-indirect-jumps'
Alexei Starovoitov [Fri, 28 Nov 2025 23:15:43 +0000 (15:15 -0800)] 
Merge branch 'a-pair-of-follow-ups-for-indirect-jumps'

Anton Protopopov says:

====================
A pair of follow ups for indirect jumps

Two fixes suggested by Alexei in [1]. Resending as a series,
as the second patch depends on the first.

  [1] https://lore.kernel.org/bpf/CAADnVQK3piReoo1ja=9hgz7aJ60Y_Jjur_JMOaYV8-Mn_VyE4A@mail.gmail.com/#R
====================

Link: https://patch.msgid.link/20251128063224.1305482-1-a.s.protopopov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
10 days agobpf: check for insn arrays in check_ptr_alignment
Anton Protopopov [Fri, 28 Nov 2025 06:32:24 +0000 (06:32 +0000)] 
bpf: check for insn arrays in check_ptr_alignment

Do not abuse the strict_alignment_once flag, and check if the map is
an instruction array inside the check_ptr_alignment() function.

Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com>
Link: https://lore.kernel.org/r/20251128063224.1305482-3-a.s.protopopov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
10 days agobpf: force BPF_F_RDONLY_PROG on insn array creation
Anton Protopopov [Fri, 28 Nov 2025 06:32:23 +0000 (06:32 +0000)] 
bpf: force BPF_F_RDONLY_PROG on insn array creation

The original implementation added a hack to check_mem_access()
to prevent programs from writing into insn arrays. To get rid
of this hack, enforce BPF_F_RDONLY_PROG on map creation.

Also fix the corresponding selftest, as the error message changes
with this patch.

Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com>
Link: https://lore.kernel.org/r/20251128063224.1305482-2-a.s.protopopov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
12 days agobpf: Fix exclusive map memory leak
Edward Adam Davis [Sun, 16 Nov 2025 14:58:13 +0000 (22:58 +0800)] 
bpf: Fix exclusive map memory leak

When excl_prog_hash is 0 and excl_prog_hash_size is non-zero, the map also
needs to be freed. Otherwise, the map memory will not be reclaimed, just
like the memory leak problem reported by syzbot [1].

syzbot reported:
BUG: memory leak
  backtrace (crc 7b9fb9b4):
    map_create+0x322/0x11e0 kernel/bpf/syscall.c:1512
    __sys_bpf+0x3556/0x3610 kernel/bpf/syscall.c:6131

Fixes: baefdbdf6812 ("bpf: Implement exclusive map creation")
Reported-by: syzbot+cf08c551fecea9fd1320@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=cf08c551fecea9fd1320
Tested-by: syzbot+cf08c551fecea9fd1320@syzkaller.appspotmail.com
Signed-off-by: Edward Adam Davis <eadavis@qq.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/tencent_3F226F882CE56DCC94ACE90EED1ECCFC780A@qq.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
13 days agoMerge branch 'general-enhancements-to-rqspinlock-stress-test'
Alexei Starovoitov [Tue, 25 Nov 2025 23:30:14 +0000 (15:30 -0800)] 
Merge branch 'general-enhancements-to-rqspinlock-stress-test'

Kumar Kartikeya Dwivedi says:

====================
General enhancements to rqspinlock stress test

Three enchancements, details in commit messages.

First, the CPU requirements are 2 for AA, 3 for ABBA, and 4 for ABBCCA,
hence relax the check during module initialization. Second, add a
per-CPU histogram to capture lock acquisition times to record which
buckets these acquisitions fall into for the normal task context and NMI
context.  Anything below 10ms is not printed in detail, but above that
displays the full breakdown for each context. Finally, make the delay of
the NMI and task contexts configurable, set to 10 and 20 ms respectively
by default.
====================

Link: https://patch.msgid.link/20251125020749.2421610-1-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
13 days agoselftests/bpf: Make CS length configurable for rqspinlock stress test
Kumar Kartikeya Dwivedi [Tue, 25 Nov 2025 02:07:49 +0000 (02:07 +0000)] 
selftests/bpf: Make CS length configurable for rqspinlock stress test

Allow users to configure the critical section delay for both task/normal
and NMI contexts, and set to 20ms and 10ms as before by default.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20251125020749.2421610-4-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
13 days agoselftests/bpf: Add lock wait time stats to rqspinlock stress test
Kumar Kartikeya Dwivedi [Tue, 25 Nov 2025 02:07:48 +0000 (02:07 +0000)] 
selftests/bpf: Add lock wait time stats to rqspinlock stress test

Add statistics per-CPU broken down by context and various timing windows
for the time taken to acquire an rqspinlock. Cases where all
acquisitions fit into the 10ms window are skipped from printing,
otherwise the full breakdown is displayed when printing the summary.
This allows capturing precisely the number of times outlier attempts
happened for a given lock in a given context.

A critical detail is that time is captured regardless of success or
failure, which is important to capture events for failed but long
waiting timeout attempts.

Output:

[   64.279459] rqspinlock acquisition latency histogram (ms):
[   64.279472]  cpu1: total 528426 (normal 526559, nmi 1867)
[   64.279477]    0-1ms: total 524697 (normal 524697, nmi 0)
[   64.279480]    2-2ms: total 3652 (normal 1811, nmi 1841)
[   64.279482]    3-3ms: total 66 (normal 47, nmi 19)
[   64.279485]    4-4ms: total 2 (normal 1, nmi 1)
[   64.279487]    5-5ms: total 1 (normal 1, nmi 0)
[   64.279489]    6-6ms: total 1 (normal 0, nmi 1)
[   64.279490]    101-150ms: total 1 (normal 0, nmi 1)
[   64.279492]    >= 251ms: total 6 (normal 2, nmi 4)
...

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20251125020749.2421610-3-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
13 days agoselftests/bpf: Relax CPU requirements for rqspinlock stress test
Kumar Kartikeya Dwivedi [Tue, 25 Nov 2025 02:07:47 +0000 (02:07 +0000)] 
selftests/bpf: Relax CPU requirements for rqspinlock stress test

Only require 2 CPUs for AA, 3 for ABBA, 4 for ABBCCA, which is
calculated nicely by adding to the mode enum. Enables running single CPU
AA tests.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20251125020749.2421610-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
13 days agobpf: Introduce internal bpf_map_check_op_flags helper function
Leon Hwang [Tue, 25 Nov 2025 14:58:50 +0000 (22:58 +0800)] 
bpf: Introduce internal bpf_map_check_op_flags helper function

It is to unify map flags checking for lookup_elem, update_elem,
lookup_batch and update_batch APIs.

Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
Link: https://lore.kernel.org/r/20251125145857.98134-2-leon.hwang@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
13 days agolibbpf: Fix some incorrect @param descriptions in the comment of libbpf.h
Jianyun Gao [Tue, 18 Nov 2025 03:30:24 +0000 (11:30 +0800)] 
libbpf: Fix some incorrect @param descriptions in the comment of libbpf.h

Fix up some of missing or incorrect @param descriptions for libbpf public APIs
in libbpf.h.

Signed-off-by: Jianyun Gao <jianyungao89@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20251118033025.11804-1-jianyungao89@gmail.com
13 days agoselftests/bpf: Call bpf_get_numa_node_id() in trigger_count()
Menglong Dong [Sun, 16 Nov 2025 01:42:42 +0000 (09:42 +0800)] 
selftests/bpf: Call bpf_get_numa_node_id() in trigger_count()

The bench test "trig-kernel-count" can be used as a baseline comparison
for fentry and other benchmarks, and the calling to bpf_get_numa_node_id()
should be considered as composition of the baseline. So, let's call it in
trigger_count(). Meanwhile, rename trigger_count() to
trigger_kernel_count() to make it easier understand.

Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20251116014242.151110-1-dongml2@chinatelecom.cn
13 days agodocs: bpf: map_array: Specify BPF_MAP_TYPE_PERCPU_ARRAY value size limit
Alex Tran [Sat, 15 Nov 2025 06:35:31 +0000 (22:35 -0800)] 
docs: bpf: map_array: Specify BPF_MAP_TYPE_PERCPU_ARRAY value size limit

Specify value size limit for BPF_MAP_TYPE_PERCPU_ARRAY which
is PCPU_MIN_UNIT_SIZE (32 kb). In percpu allocator (mm: percpu),
any request with a size greater than PCPU_MIN_UNIT_SIZE is rejected.

Signed-off-by: Alex Tran <alex.t.tran@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20251115063531.2302903-1-alex.t.tran@gmail.com
2 weeks agoselftests/bpf: Fix htab_update/reenter_update selftest failure
Saket Kumar Bhaskar [Mon, 17 Nov 2025 06:07:52 +0000 (11:37 +0530)] 
selftests/bpf: Fix htab_update/reenter_update selftest failure

Since commit 31158ad02ddb ("rqspinlock: Add deadlock detection
and recovery") the updated path on re-entrancy now reports deadlock
via -EDEADLK instead of the previous -EBUSY.

Also, the way reentrancy was exercised (via fentry/lookup_elem_raw)
has been fragile because lookup_elem_raw may be inlined
(find_kernel_btf_id() will return -ESRCH).

To fix this fentry is attached to bpf_obj_free_fields() instead of
lookup_elem_raw() and:

- The htab map is made to use a BTF-described struct val with a
  struct bpf_timer so that check_and_free_fields() reliably calls
  bpf_obj_free_fields() on element replacement.

- The selftest is updated to do two updates to the same key (insert +
  replace) in prog_test.

- The selftest is updated to align with expected errno with the
  kernel’s current behavior.

Signed-off-by: Saket Kumar Bhaskar <skb99@linux.ibm.com>
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Link: https://lore.kernel.org/r/20251117060752.129648-1-skb99@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 weeks agoMerge branch 'ease-bpf-signing-build-requirements'
Alexei Starovoitov [Mon, 24 Nov 2025 18:00:16 +0000 (10:00 -0800)] 
Merge branch 'ease-bpf-signing-build-requirements'

Alan Maguire says:

====================
Ease BPF signing build requirements

This series makes it easier to build bpftool and selftests with
signing support, removing reliance on >= openssl v3 (supporting
openssl v1) to build bpftool and not requiring latest xxd to
build verification cert header in selftests.

Changes since v1 [1]:

- Updated patch 2 to add symlink test_progs_verification_cert to .gitignore,
  EXTRA_CLEANFILES (AI review bot)
- Added acks to patch 1 (Song, Quentin)

[1] https://lore.kernel.org/bpf/20251114222249.30122-1-alan.maguire@oracle.com/
====================

Link: https://patch.msgid.link/20251120084754.640405-1-alan.maguire@oracle.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 weeks agoselftests/bpf: Allow selftests to build with older xxd
Alan Maguire [Thu, 20 Nov 2025 08:47:54 +0000 (08:47 +0000)] 
selftests/bpf: Allow selftests to build with older xxd

Currently selftests require xxd with the "-n <name>" option
which allows the user to specify a name not derived from
the input object path.  Instead of relying on this newer
feature, older xxd can be used if we link our desired name
("test_progs_verification_cert") to the input object.

Many distros ship xxd in vim-common package and do not have
the latest xxd with -n support.

Fixes: b720903e2b14d ("selftests/bpf: Enable signature verification for some lskel tests")
Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Link: https://lore.kernel.org/r/20251120084754.640405-3-alan.maguire@oracle.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 weeks agobpftool: Allow bpftool to build with openssl < 3
Alan Maguire [Thu, 20 Nov 2025 08:47:53 +0000 (08:47 +0000)] 
bpftool: Allow bpftool to build with openssl < 3

ERR_get_error_all()[1] is a openssl v3 API, so to make code
compatible with openssl v1 utilize ERR_get_err_line_data
instead.  Since openssl is already a build requirement for
the kernel (minimum requirement openssl 1.0.0), this will
allow bpftool to compile where opensslv3 is not available.
Signing-related BPF selftests pass with openssl v1.

[1] https://docs.openssl.org/3.4/man3/ERR_get_error/

Fixes: 40863f4d6ef2 ("bpftool: Add support for signing BPF programs")
Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Acked-by: Song Liu <song@kernel.org>
Acked-by: Quentin Monnet <qmo@kernel.org>
Link: https://lore.kernel.org/r/20251120084754.640405-2-alan.maguire@oracle.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 weeks agoMerge branch 'bpf-trampoline-support-jmp-mode'
Alexei Starovoitov [Mon, 24 Nov 2025 17:45:27 +0000 (09:45 -0800)] 
Merge branch 'bpf-trampoline-support-jmp-mode'

Menglong Dong says:

====================
bpf trampoline support "jmp" mode

For now, the bpf trampoline is called by the "call" instruction. However,
it break the RSB and introduce extra overhead in x86_64 arch.

For example, we hook the function "foo" with fexit, the call and return
logic will be like this:
  call foo -> call trampoline -> call foo-body ->
  return foo-body -> return foo

As we can see above, there are 3 call, but 2 return, which break the RSB
balance. We can pseudo a "return" here, but it's not the best choice,
as it will still cause once RSB miss:
  call foo -> call trampoline -> call foo-body ->
  return foo-body -> return dummy -> return foo

The "return dummy" doesn't pair the "call trampoline", which can also
cause the RSB miss.

Therefore, we introduce the "jmp" mode for bpf trampoline, as advised by
Alexei in [1]. And the logic will become this:
  call foo -> jmp trampoline -> call foo-body ->
  return foo-body -> return foo

As we can see above, the RSB is totally balanced after this series.

In this series, we introduce the FTRACE_OPS_FL_JMP for ftrace to make it
use the "jmp" instruction instead of "call".

And we also do some adjustment to bpf_arch_text_poke() to allow us specify
the old and new poke_type.

For the BPF_TRAMP_F_SHARE_IPMODIFY case, we will fallback to the "call"
mode, as it need to get the function address from the stack, which is not
supported in "jmp" mode.

Before this series, we have the following performance with the bpf
benchmark:

  $ cd tools/testing/selftests/bpf
  $ ./benchs/run_bench_trigger.sh
  usermode-count :  890.171 ± 1.522M/s
  kernel-count   :  409.184 ± 0.330M/s
  syscall-count  :   26.792 ± 0.010M/s
  fentry         :  171.242 ± 0.322M/s
  fexit          :   80.544 ± 0.045M/s
  fmodret        :   78.301 ± 0.065M/s
  rawtp          :  192.906 ± 0.900M/s
  tp             :   81.883 ± 0.209M/s
  kprobe         :   52.029 ± 0.113M/s
  kprobe-multi   :   62.237 ± 0.060M/s
  kprobe-multi-all:    4.761 ± 0.014M/s
  kretprobe      :   23.779 ± 0.046M/s
  kretprobe-multi:   29.134 ± 0.012M/s
  kretprobe-multi-all:    3.822 ± 0.003M/

And after this series, we have the following performance:

  usermode-count :  890.443 ± 0.307M/s
  kernel-count   :  416.139 ± 0.055M/s
  syscall-count  :   31.037 ± 0.813M/s
  fentry         :  169.549 ± 0.519M/s
  fexit          :  136.540 ± 0.518M/s
  fmodret        :  159.248 ± 0.188M/s
  rawtp          :  194.475 ± 0.144M/s
  tp             :   84.505 ± 0.041M/s
  kprobe         :   59.951 ± 0.071M/s
  kprobe-multi   :   63.153 ± 0.177M/s
  kprobe-multi-all:    4.699 ± 0.012M/s
  kretprobe      :   23.740 ± 0.015M/s
  kretprobe-multi:   29.301 ± 0.022M/s
  kretprobe-multi-all:    3.869 ± 0.005M/s

As we can see above, the performance of fexit increase from 80.544M/s to
136.540M/s, and the "fmodret" increase from 78.301M/s to 159.248M/s.

Link: https://lore.kernel.org/bpf/20251117034906.32036-1-dongml2@chinatelecom.cn/
Changes since v2:
* reject if the addr is already "jmp" in register_ftrace_direct() and
  __modify_ftrace_direct() in the 1st patch.
* fix compile error in powerpc in the 5th patch.
* changes in the 6th patch:
  - fix the compile error by wrapping the write to tr->fops->flags with
    CONFIG_DYNAMIC_FTRACE_WITH_JMP
  - reset BPF_TRAMP_F_SKIP_FRAME when the second try of modify_fentry in
    bpf_trampoline_update()

Link: https://lore.kernel.org/bpf/20251114092450.172024-1-dongml2@chinatelecom.cn/
Changes since v1:
* change the bool parameter that we add to save_args() to "u32 flags"
* rename bpf_trampoline_need_jmp() to bpf_trampoline_use_jmp()
* add new function parameter to bpf_arch_text_poke instead of introduce
  bpf_arch_text_poke_type()
* rename bpf_text_poke to bpf_trampoline_update_fentry
* remove the BPF_TRAMP_F_JMPED and check the current mode with the origin
  flags instead.

Link: https://lore.kernel.org/bpf/CAADnVQLX54sVi1oaHrkSiLqjJaJdm3TQjoVrgU-LZimK6iDcSA@mail.gmail.com/[1]
====================

Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://patch.msgid.link/20251118123639.688444-1-dongml2@chinatelecom.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 weeks agobpf: implement "jmp" mode for trampoline
Menglong Dong [Tue, 18 Nov 2025 12:36:34 +0000 (20:36 +0800)] 
bpf: implement "jmp" mode for trampoline

Implement the "jmp" mode for the bpf trampoline. For the ftrace_managed
case, we need only to set the FTRACE_OPS_FL_JMP on the tr->fops if "jmp"
is needed.

For the bpf poke case, we will check the origin poke type with the
"origin_flags", and current poke type with "tr->flags". The function
bpf_trampoline_update_fentry() is introduced to do the job.

The "jmp" mode will only be enabled with CONFIG_DYNAMIC_FTRACE_WITH_JMP
enabled and BPF_TRAMP_F_SHARE_IPMODIFY is not set. With
BPF_TRAMP_F_SHARE_IPMODIFY, we need to get the origin call ip from the
stack, so we can't use the "jmp" mode.

Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/r/20251118123639.688444-7-dongml2@chinatelecom.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 weeks agobpf: specify the old and new poke_type for bpf_arch_text_poke
Menglong Dong [Tue, 18 Nov 2025 12:36:33 +0000 (20:36 +0800)] 
bpf: specify the old and new poke_type for bpf_arch_text_poke

In the origin logic, the bpf_arch_text_poke() assume that the old and new
instructions have the same opcode. However, they can have different opcode
if we want to replace a "call" insn with a "jmp" insn.

Therefore, add the new function parameter "old_t" along with the "new_t",
which are used to indicate the old and new poke type. Meanwhile, adjust
the implement of bpf_arch_text_poke() for all the archs.

"BPF_MOD_NOP" is added to make the code more readable. In
bpf_arch_text_poke(), we still check if the new and old address is NULL to
determine if nop insn should be used, which I think is more safe.

Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Link: https://lore.kernel.org/r/20251118123639.688444-6-dongml2@chinatelecom.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 weeks agobpf,x86: adjust the "jmp" mode for bpf trampoline
Menglong Dong [Tue, 18 Nov 2025 12:36:32 +0000 (20:36 +0800)] 
bpf,x86: adjust the "jmp" mode for bpf trampoline

In the origin call case, if BPF_TRAMP_F_SKIP_FRAME is not set, it means
that the trampoline is not called, but "jmp".

Introduce the function bpf_trampoline_use_jmp() to check if the trampoline
is in "jmp" mode.

Do some adjustment on the "jmp" mode for the x86_64. The main adjustment
that we make is for the stack parameter passing case, as the stack
alignment logic changes in the "jmp" mode without the "rip". What's more,
the location of the parameters on the stack also changes.

Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Link: https://lore.kernel.org/r/20251118123639.688444-5-dongml2@chinatelecom.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 weeks agobpf: fix the usage of BPF_TRAMP_F_SKIP_FRAME
Menglong Dong [Tue, 18 Nov 2025 12:36:31 +0000 (20:36 +0800)] 
bpf: fix the usage of BPF_TRAMP_F_SKIP_FRAME

Some places calculate the origin_call by checking if
BPF_TRAMP_F_SKIP_FRAME is set. However, it should use
BPF_TRAMP_F_ORIG_STACK for this propose. Just fix them.

Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/r/20251118123639.688444-4-dongml2@chinatelecom.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 weeks agox86/ftrace: Implement DYNAMIC_FTRACE_WITH_JMP
Menglong Dong [Tue, 18 Nov 2025 12:36:30 +0000 (20:36 +0800)] 
x86/ftrace: Implement DYNAMIC_FTRACE_WITH_JMP

Implement the DYNAMIC_FTRACE_WITH_JMP for x86_64. In ftrace_call_replace,
we will use JMP32_INSN_OPCODE instead of CALL_INSN_OPCODE if the address
should use "jmp".

Meanwhile, adjust the direct call in the ftrace_regs_caller. The RSB is
balanced in the "jmp" mode. Take the function "foo" for example:

 original_caller:
 call foo -> foo:
         call fentry -> fentry:
                 [do ftrace callbacks ]
                 move tramp_addr to stack
                 RET -> tramp_addr
                         tramp_addr:
                         [..]
                         call foo_body -> foo_body:
                                 [..]
                                 RET -> back to tramp_addr
                         [..]
                         RET -> back to original_caller

Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/r/20251118123639.688444-3-dongml2@chinatelecom.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 weeks agoftrace: Introduce FTRACE_OPS_FL_JMP
Menglong Dong [Tue, 18 Nov 2025 12:36:29 +0000 (20:36 +0800)] 
ftrace: Introduce FTRACE_OPS_FL_JMP

For now, the "nop" will be replaced with a "call" instruction when a
function is hooked by the ftrace. However, sometimes the "call" can break
the RSB and introduce extra overhead. Therefore, introduce the flag
FTRACE_OPS_FL_JMP, which indicate that the ftrace_ops should be called
with a "jmp" instead of "call". For now, it is only used by the direct
call case.

When a direct ftrace_ops is marked with FTRACE_OPS_FL_JMP, the last bit of
the ops->direct_call will be set to 1. Therefore, we can tell if we should
use "jmp" for the callback in ftrace_call_replace().

Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/r/20251118123639.688444-2-dongml2@chinatelecom.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 weeks agobpf: cleanup aux->used_maps after jit
Anton Protopopov [Mon, 24 Nov 2025 15:15:15 +0000 (15:15 +0000)] 
bpf: cleanup aux->used_maps after jit

In commit b4ce5923e780 ("bpf, x86: add new map type: instructions array")
env->used_map was copied to func[i]->aux->used_maps before jitting.
Clear these fields out after jitting such that pointer to freed memory
(env->used_maps is freed later) are not kept in a live data structure.

The reason why the copies were initially added is explained in
https://lore.kernel.org/bpf/20251105090410.1250500-1-a.s.protopopov@gmail.com

Suggested-by: Alexei Starovoitov <ast@kernel.org>
Fixes: b4ce5923e780 ("bpf, x86: add new map type: instructions array")
Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com>
Link: https://lore.kernel.org/r/20251124151515.2543403-1-a.s.protopopov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 weeks agoMerge branch 'bpf-nested-rcu-critical-sections'
Alexei Starovoitov [Sat, 22 Nov 2025 02:35:00 +0000 (18:35 -0800)] 
Merge branch 'bpf-nested-rcu-critical-sections'

Puranjay Mohan says:

====================
bpf: Nested rcu critical sections

v1: https://lore.kernel.org/bpf/20250916113622.19540-1-puranjay@kernel.org/
Changes in v1->v2:
- Move the addition of new tests to a separate patch (Alexei)
- Avoid incrementing active_rcu_locks at two places (Eduard)

Support nested rcu critical sections by making the boolean flag
active_rcu_lock a counter and use it to manage rcu critical section
state. bpf_rcu_read_lock() increments this counter and
bpf_rcu_read_unlock() decrements it, MEM_RCU -> PTR_UNTRUSTED transition
happens when active_rcu_locks drops to 0.
====================

Link: https://patch.msgid.link/20251117200411.25563-1-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 weeks agoselftests: bpf: Add tests for unbalanced rcu_read_lock
Puranjay Mohan [Mon, 17 Nov 2025 20:04:10 +0000 (20:04 +0000)] 
selftests: bpf: Add tests for unbalanced rcu_read_lock

As verifier now supports nested rcu critical sections, add new test
cases to make sure unbalanced usage of rcu_read_lock()/unlock() is
rejected.

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20251117200411.25563-3-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 weeks agobpf: support nested rcu critical sections
Puranjay Mohan [Mon, 17 Nov 2025 20:04:09 +0000 (20:04 +0000)] 
bpf: support nested rcu critical sections

Currently, nested rcu critical sections are rejected by the verifier and
rcu_lock state is managed by a boolean variable. Add support for nested
rcu critical sections by make active_rcu_locks a counter similar to
active_preempt_locks. bpf_rcu_read_lock() increments this counter and
bpf_rcu_read_unlock() decrements it, MEM_RCU -> PTR_UNTRUSTED transition
happens when active_rcu_locks drops to 0.

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20251117200411.25563-2-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 weeks agobpf: test the correct stack liveness of tail calls
Eduard Zingerman [Wed, 19 Nov 2025 16:03:55 +0000 (17:03 +0100)] 
bpf: test the correct stack liveness of tail calls

A new test is added: caller_stack_write_tail_call tests that the live
stack is correctly tracked for a tail call.

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Martin Teichmann <martin.teichmann@xfel.eu>
Link: https://lore.kernel.org/r/20251119160355.1160932-5-martin.teichmann@xfel.eu
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 weeks agobpf: correct stack liveness for tail calls
Eduard Zingerman [Wed, 19 Nov 2025 16:03:54 +0000 (17:03 +0100)] 
bpf: correct stack liveness for tail calls

This updates bpf_insn_successors() reflecting that control flow might
jump over the instructions between tail call and function exit, verifier
might assume that some writes to parent stack always happen, which is
not the case.

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Martin Teichmann <martin.teichmann@xfel.eu>
Link: https://lore.kernel.org/r/20251119160355.1160932-4-martin.teichmann@xfel.eu
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 weeks agobpf: test the proper verification of tail calls
Martin Teichmann [Wed, 19 Nov 2025 16:03:53 +0000 (17:03 +0100)] 
bpf: test the proper verification of tail calls

Three tests are added:

- invalidate_pkt_pointers_by_tail_call checks that one can use the
  packet pointer after a tail call. This was originally possible
  and also poses not problems, but was made impossible by 1a4607ffba35.

- invalidate_pkt_pointers_by_static_tail_call tests a corner case
  found by Eduard Zingerman during the discussion of the original fix,
  which was broken in that fix.

- subprog_result_tail_call tests that precision propagation works
  correctly across tail calls. This did not work before.

Signed-off-by: Martin Teichmann <martin.teichmann@xfel.eu>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20251119160355.1160932-3-martin.teichmann@xfel.eu
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 weeks agobpf: properly verify tail call behavior
Martin Teichmann [Wed, 19 Nov 2025 16:03:52 +0000 (17:03 +0100)] 
bpf: properly verify tail call behavior

A successful ebpf tail call does not return to the caller, but to the
caller-of-the-caller, often just finishing the ebpf program altogether.

Any restrictions that the verifier needs to take into account - notably
the fact that the tail call might have modified packet pointers - are to
be checked on the caller-of-the-caller. Checking it on the caller made
the verifier refuse perfectly fine programs that would use the packet
pointers after a tail call, which is no problem as this code is only
executed if the tail call was unsuccessful, i.e. nothing happened.

This patch simulates the behavior of a tail call in the verifier. A
conditional jump to the code after the tail call is added for the case
of an unsucessful tail call, and a return to the caller is simulated for
a successful tail call.

For the successful case we assume that the tail call returns an int,
as tail calls are currently only allowed in functions that return and
int. We always assume that the tail call modified the packet pointers,
as we do not know what the tail call did.

For the unsuccessful case we know nothing happened, so we do not need to
add new constraints.

This approach also allows to check other problems that may occur with
tail calls, namely we are now able to check that precision is properly
propagated into subprograms using tail calls, as well as checking the
live slots in such a subprogram.

Fixes: 1a4607ffba35 ("bpf: consider that tail calls invalidate packet pointers")
Link: https://lore.kernel.org/bpf/20251029105828.1488347-1-martin.teichmann@xfel.eu/
Signed-off-by: Martin Teichmann <martin.teichmann@xfel.eu>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20251119160355.1160932-2-martin.teichmann@xfel.eu
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 weeks agobpf: Add a check to make static analysers happy
Anton Protopopov [Wed, 19 Nov 2025 11:25:17 +0000 (11:25 +0000)] 
bpf: Add a check to make static analysers happy

In [1] Dan Carpenter reported that the following code makes the
Smatch static analyser unhappy:

        17904       value = map->ops->map_lookup_elem(map, &i);
        17905       if (!value)
        17906               return -EINVAL;
    --> 17907       items[i - start] = value->xlated_off;

The analyser assumes that the `value` variable may contain an error
and thus it should be properly checked before the dereference.
On practice this will never happen as array maps do not return
error values in map_lookup_elem, but to make the Smatch and other
possible analysers happy this patch adds a formal check.

Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/bpf/aR2BN1Ix--8tmVrN@stanley.mountain/ [1]
Fixes: 493d9e0d6083 ("bpf, x86: add support for indirect jumps")
Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com>
Link: https://lore.kernel.org/r/20251119112517.1091793-1-a.s.protopopov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 weeks agoselftests/bpf: Update test_tag to use sha256
Xing Guo [Fri, 21 Nov 2025 06:14:58 +0000 (14:14 +0800)] 
selftests/bpf: Update test_tag to use sha256

commit 603b44162325 ("bpf: Update the bpf_prog_calc_tag to use SHA256")
changed digest of prog_tag to SHA256 but forgot to update tests
correspondingly. Fix it.

Fixes: 603b44162325 ("bpf: Update the bpf_prog_calc_tag to use SHA256")
Signed-off-by: Xing Guo <higuoxing@gmail.com>
Link: https://lore.kernel.org/r/20251121061458.3145167-1-higuoxing@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 weeks agoselftests/bpf: Improve reliability of test_perf_branches_no_hw()
Matt Bobrowski [Wed, 19 Nov 2025 14:35:40 +0000 (14:35 +0000)] 
selftests/bpf: Improve reliability of test_perf_branches_no_hw()

Currently, test_perf_branches_no_hw() relies on the busy loop within
test_perf_branches_common() being slow enough to allow at least one
perf event sample tick to occur before starting to tear down the
backing perf event BPF program. With a relatively small fixed
iteration count of 1,000,000, this is not guaranteed on modern fast
CPUs, resulting in the test run to subsequently fail with the
following:

bpf_testmod.ko is already unloaded.
Loading bpf_testmod.ko...
Successfully loaded bpf_testmod.ko.
test_perf_branches_common:PASS:test_perf_branches_load 0 nsec
test_perf_branches_common:PASS:attach_perf_event 0 nsec
test_perf_branches_common:PASS:set_affinity 0 nsec
check_good_sample:PASS:output not valid 0 nsec
check_good_sample:PASS:read_branches_size 0 nsec
check_good_sample:PASS:read_branches_stack 0 nsec
check_good_sample:PASS:read_branches_stack 0 nsec
check_good_sample:PASS:read_branches_global 0 nsec
check_good_sample:PASS:read_branches_global 0 nsec
check_good_sample:PASS:read_branches_size 0 nsec
test_perf_branches_no_hw:PASS:perf_event_open 0 nsec
test_perf_branches_common:PASS:test_perf_branches_load 0 nsec
test_perf_branches_common:PASS:attach_perf_event 0 nsec
test_perf_branches_common:PASS:set_affinity 0 nsec
check_bad_sample:FAIL:output not valid no valid sample from prog
Summary: 0/1 PASSED, 0 SKIPPED, 1 FAILED
Successfully unloaded bpf_testmod.ko.

On a modern CPU (i.e. one with a 3.5 GHz clock rate), executing 1
million increments of a volatile integer can take significantly less
than 1 millisecond. If the spin loop and detachment of the perf event
BPF program elapses before the first 1 ms sampling interval elapses,
the perf event will never end up firing. Fix this by bumping the loop
iteration counter a little within test_perf_branches_common(), along
with ensuring adding another loop termination condition which is
directly influenced by the backing perf event BPF program
executing. Notably, a concious decision was made to not adjust the
sample_freq value as that is just not a reliable way to go about
fixing the problem. It effectively still leaves the race window open.

Fixes: 67306f84ca78c ("selftests/bpf: Add bpf_read_branch_records() selftest")
Signed-off-by: Matt Bobrowski <mattbobrowski@google.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20251119143540.2911424-1-mattbobrowski@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 weeks agoselftests/bpf: skip test_perf_branches_hw() on unsupported platforms
Matt Bobrowski [Thu, 20 Nov 2025 14:20:59 +0000 (14:20 +0000)] 
selftests/bpf: skip test_perf_branches_hw() on unsupported platforms

Gracefully skip the test_perf_branches_hw subtest on platforms that
do not support LBR or require specialized perf event attributes
to enable branch sampling.

For example, AMD's Milan (Zen 3) supports BRS rather than traditional
LBR. This requires specific configurations (attr.type = PERF_TYPE_RAW,
attr.config = RETIRED_TAKEN_BRANCH_INSTRUCTIONS) that differ from the
generic setup used within this test. Notably, it also probably doesn't
hold much value to special case perf event configurations for selected
micro architectures.

Fixes: 67306f84ca78c ("selftests/bpf: Add bpf_read_branch_records() selftest")
Signed-off-by: Matt Bobrowski <mattbobrowski@google.com>
Acked-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20251120142059.2836181-1-mattbobrowski@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 weeks agoMerge branch 'bpf-arm64-indirect-jumps'
Alexei Starovoitov [Sat, 22 Nov 2025 00:40:22 +0000 (16:40 -0800)] 
Merge branch 'bpf-arm64-indirect-jumps'

Puranjay Mohan says:

====================
bpf: arm64: Indirect jumps

Changes in v1->v2:
v1: https://lore.kernel.org/all/20251117004656.33292-1-puranjay@kernel.org/
- Dropped patch 3 that was ignoring relocations for .jumptables. LLVM
  has been fixed to not emit relocations for .jumptables, so this patch
  is not needed.
- Added Reviewed-by: Anton Protopopov <a.s.protopopov@gmail.com>

This set adds the support of indirect jumps to the arm64 JIT. It
involves calling bpf_prog_update_insn_ptrs() to support instructions
array map. The second piece is supporting BPF_JMP|BPF_X|BPF_JA, SRC=0,
DST=Rx, off=0, imm=0 instruction that is trivial to implement on arm64.

The final patch enables selftests on arm64:

 [root@localhost bpf]# ./test_progs-cpuv4 -a "*gotox*"
 #20/1    bpf_gotox/one-switch:OK
 #20/2    bpf_gotox/one-switch-non-zero-sec-offset:OK
 #20/3    bpf_gotox/two-switches:OK
 #20/4    bpf_gotox/big-jump-table:OK
 #20/5    bpf_gotox/static-global:OK
 #20/6    bpf_gotox/nonstatic-global:OK
 #20/7    bpf_gotox/other-sec:OK
 #20/8    bpf_gotox/static-global-other-sec:OK
 #20/9    bpf_gotox/nonstatic-global-other-sec:OK
 #20/10   bpf_gotox/one-jump-two-maps:OK
 #20/11   bpf_gotox/one-map-two-jumps:OK
 #20      bpf_gotox:OK
 #537/1   verifier_gotox/jump_table_ok:OK
 #537/2   verifier_gotox/jump_table_reserved_field_src_reg:OK
 #537/3   verifier_gotox/jump_table_reserved_field_non_zero_off:OK
 #537/4   verifier_gotox/jump_table_reserved_field_non_zero_imm:OK
 #537/5   verifier_gotox/jump_table_no_jump_table:OK
 #537/6   verifier_gotox/jump_table_incorrect_dst_reg_type:OK
 #537/7   verifier_gotox/jump_table_invalid_read_size_u32:OK
 #537/8   verifier_gotox/jump_table_invalid_read_size_u16:OK
 #537/9   verifier_gotox/jump_table_invalid_read_size_u8:OK
 #537/10  verifier_gotox/jump_table_misaligned_access:OK
 #537/11  verifier_gotox/jump_table_invalid_mem_acceess_pos:OK
 #537/12  verifier_gotox/jump_table_invalid_mem_acceess_neg:OK
 #537/13  verifier_gotox/jump_table_add_sub_ok:OK
 #537/14  verifier_gotox/jump_table_no_writes:OK
 #537/15  verifier_gotox/jump_table_use_reg_r0:OK
 #537/16  verifier_gotox/jump_table_use_reg_r1:OK
 #537/17  verifier_gotox/jump_table_use_reg_r2:OK
 #537/18  verifier_gotox/jump_table_use_reg_r3:OK
 #537/19  verifier_gotox/jump_table_use_reg_r4:OK
 #537/20  verifier_gotox/jump_table_use_reg_r5:OK
 #537/21  verifier_gotox/jump_table_use_reg_r6:OK
 #537/22  verifier_gotox/jump_table_use_reg_r7:OK
 #537/23  verifier_gotox/jump_table_use_reg_r8:OK
 #537/24  verifier_gotox/jump_table_use_reg_r9:OK
 #537/25  verifier_gotox/jump_table_outside_subprog:OK
 #537/26  verifier_gotox/jump_table_contains_non_unique_values:OK
 #537     verifier_gotox:OK
 Summary: 2/37 PASSED, 0 SKIPPED, 0 FAILED
====================

Link: https://patch.msgid.link/20251117130732.11107-1-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 weeks agoselftests: bpf: Enable gotox tests from arm64
Puranjay Mohan [Mon, 17 Nov 2025 13:07:31 +0000 (13:07 +0000)] 
selftests: bpf: Enable gotox tests from arm64

arm64 JIT now supports gotox instruction and jumptables, so run tests in
verifier_gotox.c for arm64.

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Reviewed-by: Anton Protopopov <a.s.protopopov@gmail.com>
Link: https://lore.kernel.org/r/20251117130732.11107-4-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 weeks agobpf: arm64: Add support for indirect jumps
Puranjay Mohan [Mon, 17 Nov 2025 13:07:30 +0000 (13:07 +0000)] 
bpf: arm64: Add support for indirect jumps

Add support for a new instruction

BPF_JMP|BPF_X|BPF_JA, SRC=0, DST=Rx, off=0, imm=0

which does an indirect jump to a location stored in Rx.  The register
Rx should have type PTR_TO_INSN. This new type assures that the Rx
register contains a value (or a range of values) loaded from a
correct jump table – map of type instruction array.

ARM64 JIT supports indirect jumps to all registers through the A64_BR()
macro, use it to implement this new instruction.

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Reviewed-by: Anton Protopopov <a.s.protopopov@gmail.com>
Acked-by: Xu Kuohai <xukuohai@huawei.com>
Link: https://lore.kernel.org/r/20251117130732.11107-3-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 weeks agobpf: arm64: Add support for instructions array
Puranjay Mohan [Mon, 17 Nov 2025 13:07:29 +0000 (13:07 +0000)] 
bpf: arm64: Add support for instructions array

Add support for the instructions array map type in the arm64 JIT by
calling bpf_prog_update_insn_ptrs() with the offsets that map
xlated_offset to the jited_offset in the final image. arm64 JIT already
has this offset array which was being used for
bpf_prog_fill_jited_linfo() and can be used directly for
bpf_prog_update_insn_ptrs.

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Reviewed-by: Anton Protopopov <a.s.protopopov@gmail.com>
Acked-by: Xu Kuohai <xukuohai@huawei.com>
Link: https://lore.kernel.org/r/20251117130732.11107-2-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 weeks agoMerge branch 'selftests-bpf-networking-test-cleanups'
Martin KaFai Lau [Fri, 21 Nov 2025 17:54:50 +0000 (09:54 -0800)] 
Merge branch 'selftests-bpf-networking-test-cleanups'

Hoyeon Lee says:

====================
selftests/bpf: networking test cleanups

This series finishes the sockaddr_storage migration in the networking
selftests by removing the remaining open-coded IPv4/IPv6 wrappers
(addr_port/tuple in cls_redirect, sa46 in select_reuseport). The tests
now use sockaddr_storage directly. No other custom socket-address
wrappers remain after this series, so the churn stops here and behavior
is unchanged.
====================

Link: https://patch.msgid.link/20251121081332.2309838-1-hoyeon.lee@suse.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2 weeks agoselftests/bpf: Use sockaddr_storage instead of sa46 in select_reuseport test
Hoyeon Lee [Fri, 21 Nov 2025 08:13:32 +0000 (17:13 +0900)] 
selftests/bpf: Use sockaddr_storage instead of sa46 in select_reuseport test

The select_reuseport selftest uses a custom sa46 union to represent
IPv4 and IPv6 addresses. This custom wrapper requires extra manual
handling for address family and field extraction.

Replace sa46 with sockaddr_storage and update the helper functions to
operate on native socket structures. This simplifies the code and
removes unnecessary custom address-handling logic. No functional
changes intended.

Reviewed-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Hoyeon Lee <hoyeon.lee@suse.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20251121081332.2309838-3-hoyeon.lee@suse.com
2 weeks agoselftests/bpf: Use sockaddr_storage directly in cls_redirect test
Hoyeon Lee [Fri, 21 Nov 2025 08:13:31 +0000 (17:13 +0900)] 
selftests/bpf: Use sockaddr_storage directly in cls_redirect test

The cls_redirect test uses a custom addr_port/tuple wrapper to represent
IPv4/IPv6 addresses and ports. This custom wrapper requires extra
conversion logic and specific helpers such as fill_addr_port(), which
are no longer necessary when using standard socket address structures.

This commit replaces addr_port/tuple with the standard sockaddr_storage
so test handles address families and ports using native socket types.
It removes the custom helper, eliminates redundant casts, and simplifies
the setup helpers without functional changes. set_up_conn() and
build_input() now take src/dst sockaddr_storage directly.

Reviewed-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Hoyeon Lee <hoyeon.lee@suse.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20251121081332.2309838-2-hoyeon.lee@suse.com
2 weeks agobpf: Document cfi_stubs and owner fields in struct bpf_struct_ops
Nirbhay Sharma [Thu, 20 Nov 2025 20:46:21 +0000 (02:16 +0530)] 
bpf: Document cfi_stubs and owner fields in struct bpf_struct_ops

Add missing kernel-doc documentation for the cfi_stubs and owner
fields in struct bpf_struct_ops to fix the following warnings:

  Warning: include/linux/bpf.h:1931 struct member 'cfi_stubs' not
  described in 'bpf_struct_ops'
  Warning: include/linux/bpf.h:1931 struct member 'owner' not
  described in 'bpf_struct_ops'

The cfi_stubs field was added in commit 2cd3e3772e41 ("x86/cfi,bpf:
Fix bpf_struct_ops CFI") to provide CFI stub functions for trampolines,
and the owner field is used for module reference counting.

Signed-off-by: Nirbhay Sharma <nirbhay.lkd@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20251120204620.59571-2-nirbhay.lkd@gmail.com
2 weeks agoselftests/bpf: Use ASSERT_STRNEQ to factor in long slab cache names
Matt Bobrowski [Tue, 18 Nov 2025 07:37:34 +0000 (07:37 +0000)] 
selftests/bpf: Use ASSERT_STRNEQ to factor in long slab cache names

subtest_kmem_cache_iter_check_slabinfo() fundamentally compares slab
cache names parsed out from /proc/slabinfo against those stored within
struct kmem_cache_result. The current problem is that the slab cache
name within struct kmem_cache_result is stored within a bounded
fixed-length array (sized to SLAB_NAME_MAX(32)), whereas the name
parsed out from /proc/slabinfo is not. Meaning, using ASSERT_STREQ()
can certainly lead to test failures, particularly when dealing with
slab cache names that are longer than SLAB_NAME_MAX(32)
bytes. Notably, kmem_cache_create() allows callers to create slab
caches with somewhat arbitrarily sized names via its __name identifier
argument, so exceeding the SLAB_NAME_MAX(32) limit that is in place
now can certainly happen.

Make subtest_kmem_cache_iter_check_slabinfo() more reliable by only
checking up to sizeof(struct kmem_cache_result.name) - 1 using
ASSERT_STRNEQ().

Fixes: a496d0cdc84d ("selftests/bpf: Add a test for kmem_cache_iter")
Signed-off-by: Matt Bobrowski <mattbobrowski@google.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Song Liu <song@kernel.org>
Link: https://patch.msgid.link/20251118073734.4188710-1-mattbobrowski@google.com
2 weeks agoMerge branch 'replace-bpf-memory-allocator-with-kmalloc_nolock-in-local-storage'
Alexei Starovoitov [Wed, 19 Nov 2025 00:20:25 +0000 (16:20 -0800)] 
Merge branch 'replace-bpf-memory-allocator-with-kmalloc_nolock-in-local-storage'

Amery Hung says:

====================
Replace BPF memory allocator with kmalloc_nolock() in local storage

This patchset tries to simplify bpf_local_storage.c by adopting
kmalloc_nolock(). This removes memory preallocation and reduces the
dependency of smap in bpf_selem_free() and bpf_local_storage_free().
The later will simplify a future refactor that replaces
local_storage->lock and b->lock [1].

RFC v1 tried to switch to kmalloc_nolock() unconditionally. However,
as there is substantial performance loss in socket local storage due to
1) defer_free() in kfree_nolock() and 2) no kfree_rcu() batching,
replacing kzalloc() is postponed until necessary improvements in mm
land.

Benchmark

./bench -p 1 local-storage-create --storage-type <socket,task> \
  --batch-size <16,32,64>

The benchmark is a microbenchmark stress-testing how fast local storage
can be created. For task local storage, switching from BPF memory
allocator to kmalloc_nolock() yields a small amount of improvement. For
socket local storage, it remains roughly the same as nothing has changed.

Socket local storage
memory alloc     batch  creation speed              creation speed diff
---------------  ----   ------------------                         ----
kzalloc           16    144.149 ± 0.642k/s  3.10 kmallocs/create
(before)          32    144.379 ± 1.070k/s  3.08 kmallocs/create
                  64    144.491 ± 0.818k/s  3.13 kmallocs/create

kzalloc           16    146.180 ± 1.403k/s  3.10 kmallocs/create  +1.4%
(not changed)     32    146.245 ± 1.272k/s  3.10 kmallocs/create  +1.3%
                  64    145.012 ± 1.545k/s  3.10 kmallocs/create  +0.4%

Task local storage
memory alloc     batch  creation speed              creation speed diff
---------------  ----   ------------------                         ----
BPF memory        16     24.668 ± 0.121k/s  2.54 kmallocs/create
allocator         32     22.899 ± 0.097k/s  2.67 kmallocs/create
(before)          64     22.559 ± 0.076k/s  2.56 kmallocs/create

kmalloc_nolock    16     25.796 ± 0.059k/s  2.52 kmallocs/create  +4.6%
(after)           32     23.412 ± 0.069k/s  2.50 kmallocs/create  +2.2%
                  64     23.717 ± 0.108k/s  2.60 kmallocs/create  +5.1%

[1] https://lore.kernel.org/bpf/20251002225356.1505480-1-ameryhung@gmail.com/

v1 -> v2
  - Only replace BPF memory allocator with kmalloc_nolock()
Link: https://lore.kernel.org/bpf/20251112175939.2365295-1-ameryhung@gmail.com/
====================

Link: https://patch.msgid.link/20251114201329.3275875-1-ameryhung@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 weeks agobpf: Replace bpf memory allocator with kmalloc_nolock() in local storage
Amery Hung [Fri, 14 Nov 2025 20:13:26 +0000 (12:13 -0800)] 
bpf: Replace bpf memory allocator with kmalloc_nolock() in local storage

Replace bpf memory allocator with kmalloc_nolock() to reduce memory
wastage due to preallocation.

In bpf_selem_free(), an selem now needs to wait for a RCU grace period
before being freed when reuse_now == true. Therefore, rcu_barrier()
should be always be called in bpf_local_storage_map_free().

In bpf_local_storage_free(), since smap->storage_ma is no longer needed
to return the memory, the function is now independent from smap.

Remove the outdated comment in bpf_local_storage_alloc(). We already
free selem after an RCU grace period in bpf_local_storage_update() when
bpf_local_storage_alloc() failed the cmpxchg since commit c0d63f309186
("bpf: Add bpf_selem_free()").

Signed-off-by: Amery Hung <ameryhung@gmail.com>
Reviewed-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20251114201329.3275875-5-ameryhung@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 weeks agobpf: Save memory alloction info in bpf_local_storage
Amery Hung [Fri, 14 Nov 2025 20:13:25 +0000 (12:13 -0800)] 
bpf: Save memory alloction info in bpf_local_storage

Save the memory allocation method used for bpf_local_storage in the
struct explicitly so that we don't need to go through the hassle to
find out the info. When a later patch replaces BPF memory allocator
with kmalloc_noloc(), bpf_local_storage_free() will no longer need
smap->storage_ma to return the memory and completely remove the
dependency on smap in bpf_local_storage_free().

Signed-off-by: Amery Hung <ameryhung@gmail.com>
Reviewed-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20251114201329.3275875-4-ameryhung@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 weeks agobpf: Remove smap argument from bpf_selem_free()
Amery Hung [Fri, 14 Nov 2025 20:13:24 +0000 (12:13 -0800)] 
bpf: Remove smap argument from bpf_selem_free()

Since selem already saves a pointer to smap, use it instead of an
additional argument in bpf_selem_free(). This requires moving the
SDATA(selem)->smap assignment from bpf_selem_link_map() to
bpf_selem_alloc() since bpf_selem_free() may be called without the
selem being linked to smap in bpf_local_storage_update().

Signed-off-by: Amery Hung <ameryhung@gmail.com>
Reviewed-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20251114201329.3275875-3-ameryhung@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 weeks agobpf: Always charge/uncharge memory when allocating/unlinking storage elements
Amery Hung [Fri, 14 Nov 2025 20:13:23 +0000 (12:13 -0800)] 
bpf: Always charge/uncharge memory when allocating/unlinking storage elements

Since commit a96a44aba556 ("bpf: bpf_sk_storage: Fix invalid wait
context lockdep report"), {charge,uncharge}_mem are always true when
allocating a bpf_local_storage_elem or unlinking a bpf_local_storage_elem
from local storage, so drop these arguments. No functional change.

Signed-off-by: Amery Hung <ameryhung@gmail.com>
Reviewed-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20251114201329.3275875-2-ameryhung@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 weeks agoselftests/bpf: Replace TCP CC string comparisons with bpf_strncmp
Hoyeon Lee [Sat, 15 Nov 2025 22:55:39 +0000 (07:55 +0900)] 
selftests/bpf: Replace TCP CC string comparisons with bpf_strncmp

The connect4_prog and bpf_iter_setsockopt tests duplicate the same
open-coded TCP congestion control string comparison logic. Since
bpf_strncmp() provides the same functionality, use it instead to
avoid repeated open-coded loops.

This change applies only to functional BPF tests and does not affect
the verifier performance benchmarks (veristat.cfg). No functional
changes intended.

Reviewed-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Hoyeon Lee <hoyeon.lee@suse.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20251115225550.1086693-5-hoyeon.lee@suse.com
2 weeks agoselftests/bpf: Move common TCP helpers into bpf_tracing_net.h
Hoyeon Lee [Sat, 15 Nov 2025 22:55:38 +0000 (07:55 +0900)] 
selftests/bpf: Move common TCP helpers into bpf_tracing_net.h

Some BPF selftests contain identical copies of the min(), max(),
before(), and after() helpers. These repeated snippets are the same
across the tests and do not need to be defined separately.

Move these helpers into bpf_tracing_net.h so they can be shared by
TCP related BPF programs. This removes repeated code and keeps the
helpers in a single place.

Reviewed-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Hoyeon Lee <hoyeon.lee@suse.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20251115225550.1086693-4-hoyeon.lee@suse.com
2 weeks agobpf: Fix invalid prog->stats access when update_effective_progs fails
Pu Lehui [Sat, 15 Nov 2025 10:23:43 +0000 (10:23 +0000)] 
bpf: Fix invalid prog->stats access when update_effective_progs fails

Syzkaller triggers an invalid memory access issue following fault
injection in update_effective_progs. The issue can be described as
follows:

__cgroup_bpf_detach
  update_effective_progs
    compute_effective_progs
      bpf_prog_array_alloc <-- fault inject
  purge_effective_progs
    /* change to dummy_bpf_prog */
    array->items[index] = &dummy_bpf_prog.prog

---softirq start---
__do_softirq
  ...
    __cgroup_bpf_run_filter_skb
      __bpf_prog_run_save_cb
        bpf_prog_run
          stats = this_cpu_ptr(prog->stats)
          /* invalid memory access */
          flags = u64_stats_update_begin_irqsave(&stats->syncp)
---softirq end---

  static_branch_dec(&cgroup_bpf_enabled_key[atype])

The reason is that fault injection caused update_effective_progs to fail
and then changed the original prog into dummy_bpf_prog.prog in
purge_effective_progs. Then a softirq came, and accessing the members of
dummy_bpf_prog.prog in the softirq triggers invalid mem access.

To fix it, skip updating stats when stats is NULL.

Fixes: 492ecee892c2 ("bpf: enable program stats")
Signed-off-by: Pu Lehui <pulehui@huawei.com>
Link: https://lore.kernel.org/r/20251115102343.2200727-1-pulehui@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
3 weeks agobpf: don't skip other information if xlated_prog_insns is skipped
Altgelt, Max (Nextron) [Tue, 4 Nov 2025 14:26:56 +0000 (14:26 +0000)] 
bpf: don't skip other information if xlated_prog_insns is skipped

If xlated_prog_insns should not be exposed, other information
(such as func_info) still can and should be filled in.
Therefore, instead of directly terminating in this case,
continue with the normal flow.

Signed-off-by: Max Altgelt <max.altgelt@nextron-systems.com>
Link: https://lore.kernel.org/r/efd00fcec5e3e247af551632726e2a90c105fbd8.camel@nextron-systems.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
3 weeks agoselftests/bpf: Test bpf_skb_check_mtu(BPF_MTU_CHK_SEGS) when transport_header is...
Martin KaFai Lau [Wed, 12 Nov 2025 23:23:31 +0000 (15:23 -0800)] 
selftests/bpf: Test bpf_skb_check_mtu(BPF_MTU_CHK_SEGS) when transport_header is not set

Add a test to check that bpf_skb_check_mtu(BPF_MTU_CHK_SEGS) is
rejected (-EINVAL) if skb->transport_header is not set. The test
needs to lower the MTU of the loopback device. Thus, take this
opportunity to run the test in a netns by adding "ns_" to the test
name. The "serial_" prefix can then be removed.

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20251112232331.1566074-2-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
3 weeks agobpf: Check skb->transport_header is set in bpf_skb_check_mtu
Martin KaFai Lau [Wed, 12 Nov 2025 23:23:30 +0000 (15:23 -0800)] 
bpf: Check skb->transport_header is set in bpf_skb_check_mtu

The bpf_skb_check_mtu helper needs to use skb->transport_header when
the BPF_MTU_CHK_SEGS flag is used:

bpf_skb_check_mtu(skb, ifindex, &mtu_len, 0, BPF_MTU_CHK_SEGS)

The transport_header is not always set. There is a WARN_ON_ONCE
report when CONFIG_DEBUG_NET is enabled + skb->gso_size is set +
bpf_prog_test_run is used:

WARNING: CPU: 1 PID: 2216 at ./include/linux/skbuff.h:3071
 skb_gso_validate_network_len
 bpf_skb_check_mtu
 bpf_prog_3920e25740a41171_tc_chk_segs_flag # A test in the next patch
 bpf_test_run
 bpf_prog_test_run_skb

For a normal ingress skb (not test_run), skb_reset_transport_header
is performed but there is plan to avoid setting it as described in
commit 2170a1f09148 ("net: no longer reset transport_header in __netif_receive_skb_core()").

This patch fixes the bpf helper by checking
skb_transport_header_was_set(). The check is done just before
skb->transport_header is used, to avoid breaking the existing bpf prog.
The WARN_ON_ONCE is limited to bpf_prog_test_run, so targeting bpf-next.

Fixes: 34b2021cc616 ("bpf: Add BPF-helper for MTU checking")
Cc: Jesper Dangaard Brouer <hawk@kernel.org>
Reported-by: Kaiyan Mei <M202472210@hust.edu.cn>
Reported-by: Yinhao Hu <dddddd@hust.edu.cn>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20251112232331.1566074-1-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
3 weeks agobpf: verifier: Move desc->imm setup to sort_kfunc_descs_by_imm_off()
Puranjay Mohan [Fri, 14 Nov 2025 15:40:22 +0000 (15:40 +0000)] 
bpf: verifier: Move desc->imm setup to sort_kfunc_descs_by_imm_off()

Metadata about a kfunc call is added to the kfunc_tab in
add_kfunc_call() but the call instruction itself could get removed by
opt_remove_dead_code() later if it is not reachable.

If the call instruction is removed, specialize_kfunc() is never called
for it and the desc->imm in the kfunc_tab is never initialized for this
kfunc call. In this case, sort_kfunc_descs_by_imm_off(env->prog); in
do_misc_fixups() doesn't sort the table correctly.
This is a problem for s390 as its JIT uses this table to find the
addresses for kfuncs, and if this table is not sorted properly, JIT may
fail to find addresses for valid kfunc calls.

This was exposed by:

commit d869d56ca848 ("bpf: verifier: refactor kfunc specialization")

as before this commit, desc->imm was initialised in add_kfunc_call()
which happens before dead code elimination.

Move desc->imm setup down to sort_kfunc_descs_by_imm_off(), this fixes
the problem and also saves us from having the same logic in
add_kfunc_call() and specialize_kfunc().

Suggested-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20251114154023.12801-1-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
3 weeks agoselftests/bpf: Align kfuncs renamed in bpf tree
Mykyta Yatsenko [Wed, 5 Nov 2025 13:21:05 +0000 (13:21 +0000)] 
selftests/bpf: Align kfuncs renamed in bpf tree

bpf_task_work_schedule_resume() and bpf_task_work_schedule_signal() have
been renamed in bpf tree to bpf_task_work_schedule_resume_impl() and
bpf_task_work_schedule_signal_impl() accordingly.
There are few uses of these kfuncs in selftests that are not in bpf
tree, so that when we port [1] into bpf-next, those BPF programs will
not compile.
This patch aligns those remaining callsites with the kfunc renaming.
It should go on top of [1] when applying on bpf-next.

1: https://lore.kernel.org/all/20251104-implv2-v3-0-4772b9ae0e06@meta.com/

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Link: https://lore.kernel.org/r/20251105132105.597344-1-mykyta.yatsenko5@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
3 weeks agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf after 6.18-rc5+
Alexei Starovoitov [Sat, 15 Nov 2025 01:43:41 +0000 (17:43 -0800)] 
Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf after 6.18-rc5+

Cross-merge BPF and other fixes after downstream PR.

Minor conflict in kernel/bpf/helpers.c

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
3 weeks agoMerge branch 'libbpf-fix-btf-dedup-to-support-recursive-typedef'
Andrii Nakryiko [Sat, 15 Nov 2025 01:07:21 +0000 (17:07 -0800)] 
Merge branch 'libbpf-fix-btf-dedup-to-support-recursive-typedef'

Paul Houssel says:

====================
libbpf: fix BTF dedup to support recursive typedef

Pahole fails to encode BTF for some Go projects (e.g. Kubernetes and
Podman) due to recursive type definitions that create reference loops
not representable in C. These recursive typedefs trigger a failure in
the BTF deduplication algorithm.

This patch extends btf_dedup_struct_types() to properly handle potential
recursion for BTF_KIND_TYPEDEF, similar to how recursion is already
handled for BTF_KIND_STRUCT. This allows pahole to successfully
generate BTF for Go binaries using recursive types without impacting
existing C-based workflows.

Changes in v4: fix typo found by Claude-based CI

Changes in v3:
  1. Patch 1: Adjusted the comment of btf_dedup_ref_type() to refer to
  typedef as well.
  2. Patch 2: Update of the "dedup: recursive typedef" test to include a
  duplicated version of the types to make sure deduplication still happens
  in this case.

Changes in v2:
  1. Patch 1: Refactored code to prevent copying existing logic. Instead of
  adding a new function we modify the existing btf_dedup_struct_type()
  function to handle the BTF_KIND_TYPEDEF case. Calls to btf_hash_struct()
  and btf_shallow_equal_struct() are replaced with calls to functions that
  select btf_hash_struct() / btf_hash_typedef() based on the type.
  2. Patch 2: Added tests

v3: https://lore.kernel.org/lkml/cover.1763024337.git.paul.houssel@orange.com/

v2: https://lore.kernel.org/lkml/cover.1762956564.git.paul.houssel@orange.com/

v1: https://lore.kernel.org/lkml/20251107153408.159342-1-paulhoussel2@gmail.com/
====================

Link: https://patch.msgid.link/cover.1763037045.git.paul.houssel@orange.com
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
3 weeks agoselftests/bpf: Add BTF dedup tests for recursive typedef definitions
Paul Houssel [Thu, 13 Nov 2025 12:39:51 +0000 (13:39 +0100)] 
selftests/bpf: Add BTF dedup tests for recursive typedef definitions

Add several ./test_progs tests:
    1.  btf/dedup:recursive typedef ensures that deduplication no
longer fails on recursive typedefs.
    2.  btf/dedup:typedef ensures that typedefs are deduplicated correctly
just as they were before this patch.

Signed-off-by: Paul Houssel <paul.houssel@orange.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/9fac2f744089f6090257d4c881914b79f6cd6c6a.1763037045.git.paul.houssel@orange.com
3 weeks agolibbpf: Fix BTF dedup to support recursive typedef definitions
Paul Houssel [Thu, 13 Nov 2025 12:39:50 +0000 (13:39 +0100)] 
libbpf: Fix BTF dedup to support recursive typedef definitions

Handle recursive typedefs in BTF deduplication

Pahole fails to encode BTF for some Go projects (e.g. Kubernetes and
Podman) due to recursive type definitions that create reference loops
not representable in C. These recursive typedefs trigger a failure in
the BTF deduplication algorithm.

This patch extends btf_dedup_ref_type() to properly handle potential
recursion for BTF_KIND_TYPEDEF, similar to how recursion is already
handled for BTF_KIND_STRUCT. This allows pahole to successfully
generate BTF for Go binaries using recursive types without impacting
existing C-based workflows.

Suggested-by: Tristan d'Audibert <tristan.daudibert@gmail.com>
Co-developed-by: Martin Horth <martin.horth@telecom-sudparis.eu>
Co-developed-by: Ouail Derghal <ouail.derghal@imt-atlantique.fr>
Co-developed-by: Guilhem Jazeron <guilhem.jazeron@inria.fr>
Co-developed-by: Ludovic Paillat <ludovic.paillat@inria.fr>
Co-developed-by: Robin Theveniaut <robin.theveniaut@irit.fr>
Signed-off-by: Martin Horth <martin.horth@telecom-sudparis.eu>
Signed-off-by: Ouail Derghal <ouail.derghal@imt-atlantique.fr>
Signed-off-by: Guilhem Jazeron <guilhem.jazeron@inria.fr>
Signed-off-by: Ludovic Paillat <ludovic.paillat@inria.fr>
Signed-off-by: Robin Theveniaut <robin.theveniaut@irit.fr>
Signed-off-by: Paul Houssel <paul.houssel@orange.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/bf00857b1e06f282aac12f6834de7396a7547ba6.1763037045.git.paul.houssel@orange.com
3 weeks agoselftests/bpf: Fix failure paths in send_signal test
Alexei Starovoitov [Thu, 13 Nov 2025 17:11:53 +0000 (09:11 -0800)] 
selftests/bpf: Fix failure paths in send_signal test

When test_send_signal_kern__open_and_load() fails parent closes the
pipe which cases ASSERT_EQ(read(pipe_p2c...)) to fail, but child
continues and enters infinite loop, while parent is stuck in wait(NULL).
Other error paths have similar issue, so kill the child before waiting on it.

The bug was discovered while compiling all of selftests with -O1 instead of -O2
which caused progs/test_send_signal_kern.c to fail to load.

Fixes: ab8b7f0cb358 ("tools/bpf: Add self tests for bpf_send_signal_thread()")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20251113171153.2583-1-alexei.starovoitov@gmail.com
3 weeks agoMerge tag 'pci-v6.18-fixes-5' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci
Linus Torvalds [Fri, 14 Nov 2025 23:45:31 +0000 (15:45 -0800)] 
Merge tag 'pci-v6.18-fixes-5' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci

Pull pci fixes from Bjorn Helgaas:

 - Cache the ASPM L0s/L1 Supported bits early so quirks can override
   them if necessary (Bjorn Helgaas)

 - Add quirks for PA Semi and Freescale Root Ports and a HiSilicon Wi-Fi
   device that are reported to have broken L0s and L1 (Shawn Lin, Bjorn
   Helgaas)

* tag 'pci-v6.18-fixes-5' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci:
  PCI/ASPM: Avoid L0s and L1 on Hi1105 [19e5:1105] Wi-Fi
  PCI/ASPM: Avoid L0s and L1 on PA Semi [1959:a002] Root Ports
  PCI/ASPM: Avoid L0s and L1 on Freescale [1957:0451] Root Ports
  PCI/ASPM: Convert quirks to override advertised link states
  PCI/ASPM: Add pcie_aspm_remove_cap() to override advertised link states
  PCI/ASPM: Cache L0s/L1 Supported so advertised link states can be overridden

3 weeks agoMerge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Linus Torvalds [Fri, 14 Nov 2025 23:39:39 +0000 (15:39 -0800)] 
Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Pull bpf fixes from Alexei Starovoitov:

 - Fix interaction between livepatch and BPF fexit programs (Song Liu)
   With Steven and Masami acks.

 - Fix stack ORC unwind from BPF kprobe_multi (Jiri Olsa)
   With Steven and Masami acks.

 - Fix out of bounds access in widen_imprecise_scalars() in the verifier
   (Eduard Zingerman)

 - Fix conflicts between MPTCP and BPF sockmap (Jiayuan Chen)

 - Fix net_sched storage collision with BPF data_meta/data_end (Eric
   Dumazet)

 - Add _impl suffix to BPF kfuncs with implicit args to avoid breaking
   them in bpf-next when KF_IMPLICIT_ARGS is added (Mykyta Yatsenko)

* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  selftests/bpf: Test widen_imprecise_scalars() with different stack depth
  bpf: account for current allocated stack depth in widen_imprecise_scalars()
  bpf: Add bpf_prog_run_data_pointers()
  selftests/bpf: Add mptcp test with sockmap
  mptcp: Fix proto fallback detection with BPF
  mptcp: Disallow MPTCP subflows from sockmap
  selftests/bpf: Add stacktrace ips test for raw_tp
  selftests/bpf: Add stacktrace ips test for kprobe_multi/kretprobe_multi
  x86/fgraph,bpf: Fix stack ORC unwind from kprobe_multi return probe
  Revert "perf/x86: Always store regs->ip in perf_callchain_kernel()"
  bpf: add _impl suffix for bpf_stream_vprintk() kfunc
  bpf:add _impl suffix for bpf_task_work_schedule* kfuncs
  selftests/bpf: Add tests for livepatch + bpf trampoline
  ftrace: bpf: Fix IPMODIFY + DIRECT in modify_ftrace_direct()
  ftrace: Fix BPF fexit with livepatch

3 weeks agoMerge tag 'rust-fixes-6.18-2' of git://git.kernel.org/pub/scm/linux/kernel/git/ojeda...
Linus Torvalds [Fri, 14 Nov 2025 23:36:15 +0000 (15:36 -0800)] 
Merge tag 'rust-fixes-6.18-2' of git://git.kernel.org/pub/scm/linux/kernel/git/ojeda/linux

Pull Rust fix from Miguel Ojeda:

 - Fix a Rust 1.91.0 build issue due to 'bindings.o' not containing
   DWARF debug information anymore by teaching gendwarfksyms to skip
   object files without exports

* tag 'rust-fixes-6.18-2' of git://git.kernel.org/pub/scm/linux/kernel/git/ojeda/linux:
  gendwarfksyms: Skip files with no exports

3 weeks agoselftests/bpf: Convert glob_match() to bpf arena
Alexei Starovoitov [Tue, 11 Nov 2025 03:29:31 +0000 (19:29 -0800)] 
selftests/bpf: Convert glob_match() to bpf arena

Increase arena test coverage.
Convert glob_match() to bpf arena in two steps:
1.
Copy paste lib/glob.c into bpf_arena_strsearch.h
Copy paste lib/globtests.c into progs/arena_strsearch.c

2.
Add __arena to pointers
Add __arg_arena to global functions that accept arena pointers
Add cond_break to loops

The test also serves as a good example of what's possible
with bpf arena and how existing algorithms can be converted.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20251111032931.21430-1-alexei.starovoitov@gmail.com
3 weeks agoMerge tag 'nfs-for-6.18-3' of git://git.linux-nfs.org/projects/anna/linux-nfs
Linus Torvalds [Fri, 14 Nov 2025 21:44:23 +0000 (13:44 -0800)] 
Merge tag 'nfs-for-6.18-3' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client fixes from Anna Schumaker:

 - Various fixes when using NFS with TLS

 - Localio direct-IO fixes

 - Fix error handling in nfs_atomic_open_v23()

 - Fix sysfs memory leak when nfs_client kobject add fails

 - Fix an incorrect parameter when calling nfs4_call_sync()

 - Fix a failing LTP test when using delegated timestamps

* tag 'nfs-for-6.18-3' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  NFS: Fix LTP test failures when timestamps are delegated
  NFSv4: Fix an incorrect parameter when calling nfs4_call_sync()
  NFS: sysfs: fix leak when nfs_client kobject add fails
  NFSv2/v3: Fix error handling in nfs_atomic_open_v23()
  nfs/localio: do not issue misaligned DIO out-of-order
  nfs/localio: Ensure DIO WRITE's IO on stable storage upon completion
  nfs/localio: backfill missing partial read support for misaligned DIO
  nfs/localio: add refcounting for each iocb IO associated with NFS pgio header
  nfs/localio: remove unecessary ENOTBLK handling in DIO WRITE support
  NFS: Check the TLS certificate fields in nfs_match_client()
  pnfs: Set transport security policy to RPC_XPRTSEC_NONE unless using TLS
  pnfs: Fix TLS logic in _nfs4_pnfs_v4_ds_connect()
  pnfs: Fix TLS logic in _nfs4_pnfs_v3_ds_connect()

3 weeks agoMerge tag 'drm-fixes-2025-11-15' of https://gitlab.freedesktop.org/drm/kernel
Linus Torvalds [Fri, 14 Nov 2025 21:39:15 +0000 (13:39 -0800)] 
Merge tag 'drm-fixes-2025-11-15' of https://gitlab.freedesktop.org/drm/kernel

Pull drm fixes from Dave Airlie:
 "Weekly fixes, amdgpu and vmwgfx making up the most of it, along with
  panthor and i915/xe.

  Seems about right for this time of development, nothing major
  outstanding.

  client:
   - Fix description of module parameter

  panthor:
   - Flush writes before mapping buffers

  vmwgfx:
   - Improve command validation
   - Improve ref counting
   - Fix cursor-plane support

  amdgpu:
   - Disallow P2P DMA for GC 12 DCC surfaces
   - ctx error handling fix
   - UserQ fixes
   - VRR fix
   - ISP fix
   - JPEG 5.0.1 fix

  amdkfd:
   - Save area check fix
   - Fix GPU mappings for APU after prefetch

  i915:
   - Fix PSR's pipe to vblank conversion
   - Disable Panel Replay on MST links

  xe:
   - New HW workarounds affecting PTL and WCL platforms

* tag 'drm-fixes-2025-11-15' of https://gitlab.freedesktop.org/drm/kernel:
  drm/client: fix MODULE_PARM_DESC string for "active"
  drm/i915/dp_mst: Disable Panel Replay
  drm/amdkfd: Fix GPU mappings for APU after prefetch
  drm/amdkfd: relax checks for over allocation of save area
  drm/amdgpu/jpeg: Add parse_cs for JPEG5_0_1
  drm/amd/amdgpu: Ensure isp_kernel_buffer_alloc() creates a new BO
  drm/amd/display: Allow VRR params change if unsynced with the stream
  drm/amdgpu: fix lock warning in amdgpu_userq_fence_driver_process
  drm/amdgpu: jump to the correct label on failure
  drm/amdgpu: disable peer-to-peer access for DCC-enabled GC12 VRAM surfaces
  drm/xe/xe3lpg: Extend Wa_15016589081 for xe3lpg
  drm/xe/xe3: Extend wa_14023061436
  drm/xe/xe3: Add WA_14024681466 for Xe3_LPG
  drm/i915/psr: fix pipe to vblank conversion
  drm/panthor: Flush shmem writes before mapping buffers CPU-uncached
  drm/vmwgfx: Restore Guest-Backed only cursor plane support
  drm/vmwgfx: Use kref in vmw_bo_dirty
  drm/vmwgfx: Validate command header size against SVGA_CMD_MAX_DATASIZE

3 weeks agoMerge tag 'mmc-v6.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc
Linus Torvalds [Fri, 14 Nov 2025 21:34:36 +0000 (13:34 -0800)] 
Merge tag 'mmc-v6.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc

Pull MMC fixes from Ulf Hansson:

 - dw_mmc-rockchip: Fix internal phase calculation

 - pxamci: Simplify and fix ->probe() error handling

 - sdhci-of-dwcmshc: Fix strbin signal delay

 - wmt-sdmmc: Fix compile test default

* tag 'mmc-v6.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
  mmc: dw_mmc-rockchip: Fix wrong internal phase calculate
  mmc: pxamci: Simplify pxamci_probe() error handling using devm APIs
  mmc: sdhci-of-dwcmshc: Change DLL_STRBIN_TAPNUM_DEFAULT to 0x4
  mmc: wmt-sdmmc: fix compile test default

3 weeks agobpf: Handle return value of ftrace_set_filter_ip in register_fentry
Menglong Dong [Mon, 10 Nov 2025 12:07:05 +0000 (20:07 +0800)] 
bpf: Handle return value of ftrace_set_filter_ip in register_fentry

The error that returned by ftrace_set_filter_ip() in register_fentry() is
not handled properly. Just fix it.

Fixes: 00963a2e75a8 ("bpf: Support bpf_trampoline on functions with IPMODIFY (e.g. livepatch)")
Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Acked-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20251110120705.1553694-1-dongml2@chinatelecom.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
3 weeks agoMerge tag 'pmdomain-v6.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh...
Linus Torvalds [Fri, 14 Nov 2025 21:29:15 +0000 (13:29 -0800)] 
Merge tag 'pmdomain-v6.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/linux-pm

Pull pmdomain fixes from Ulf Hansson:

 - imx: Fix reference count leak in ->remove()

 - samsung: Rework legacy splash-screen handover workaround

 - samsung: Fix potential memleak during ->probe()

 - arm: Fix genpd leak on provider registration failure for scmi

* tag 'pmdomain-v6.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/linux-pm:
  pmdomain: imx: Fix reference count leak in imx_gpc_remove
  pmdomain: samsung: Rework legacy splash-screen handover workaround
  pmdomain: arm: scmi: Fix genpd leak on provider registration failure
  pmdomain: samsung: plug potential memleak during probe

3 weeks agobpf: Add missing checks to avoid verbose verifier log
Eduard Zingerman [Fri, 14 Nov 2025 20:05:42 +0000 (12:05 -0800)] 
bpf: Add missing checks to avoid verbose verifier log

There are a few places where log level is not checked before calling
"verbose()". This forces programs working only at
BPF_LOG_LEVEL_STATS (e.g. veristat) to allocate unnecessarily large
log buffers. Add missing checks.

Reported-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20251114200542.912386-1-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
3 weeks agoMerge tag 'cxl-fixes-6.18-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl
Linus Torvalds [Fri, 14 Nov 2025 21:25:00 +0000 (13:25 -0800)] 
Merge tag 'cxl-fixes-6.18-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl

Pull cxl fixes from Dave Jiang:

 - Fix incorrect device handle check for Generic Initiator

 - Fix offset calculation for extended linear cache poison injection

 - Fix lockdep warning for hmem_register_resource()

* tag 'cxl-fixes-6.18-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl:
  acpi/hmat: Fix lockdep warning for hmem_register_resource()
  cxl: Adjust offset calculation for poison injection
  acpi,srat: Fix incorrect device handle check for Generic Initiator

3 weeks agobpf: Prevent nesting overflow in bpf_try_get_buffers
Sahil Chandna [Fri, 14 Nov 2025 06:49:22 +0000 (12:19 +0530)] 
bpf: Prevent nesting overflow in bpf_try_get_buffers

bpf_try_get_buffers() returns one of multiple per-CPU buffers based on a
per-CPU nesting counter. This mechanism expects that buffers are not
endlessly acquired before being returned. migrate_disable() ensures that a
task remains on the same CPU, but it does not prevent the task from being
preempted by another task on that CPU.

Without disabled preemption, a task may be preempted while holding a
buffer, allowing another task to run on same CPU and acquire an
additional buffer. Several such preemptions can cause the per-CPU
nest counter to exceed MAX_BPRINTF_NEST_LEVEL and trigger the warning in
bpf_try_get_buffers(). Adding preempt_disable()/preempt_enable() around
buffer acquisition and release prevents this task preemption and
preserves the intended bounded nesting behavior.

Reported-by: syzbot+b0cff308140f79a9c4cb@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/68f6a4c8.050a0220.1be48.0011.GAE@google.com/
Fixes: 4223bf833c849 ("bpf: Remove preempt_disable in bpf_try_get_buffers")
Suggested-by: Yonghong Song <yonghong.song@linux.dev>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Sahil Chandna <chandna.sahil@gmail.com>
Link: https://lore.kernel.org/r/20251114064922.11650-1-chandna.sahil@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
3 weeks agoMerge tag 'spi-fix-v6.18-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/brooni...
Linus Torvalds [Fri, 14 Nov 2025 21:04:35 +0000 (13:04 -0800)] 
Merge tag 'spi-fix-v6.18-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi

Pull spi fixes from Mark Brown:
 "A few standard fixes here, plus one more interesting one from Hans
  which addresses an issue where a move in when we requested GPIOs on
  ACPI systems caused us to stop doing pinmuxing and leave things
  floating that we'd really rather not have floating"

* tag 'spi-fix-v6.18-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
  spi: Add TODO comment about ACPI GPIO setup
  spi: xilinx: increase number of retries before declaring stall
  spi: imx: keep dma request disabled before dma transfer setup
  spi: Try to get ACPI GPIO IRQ earlier

3 weeks agoMerge tag 'regulator-fix-v6.18-rc5' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Fri, 14 Nov 2025 21:01:23 +0000 (13:01 -0800)] 
Merge tag 'regulator-fix-v6.18-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator

Pull regulator fix from Mark Brown:
 "One simple fix for a GPIO descriptor leak in the probe error handling
  for the fixed regulator"

* tag 'regulator-fix-v6.18-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator:
  regulator: fixed: fix GPIO descriptor leak on register failure

3 weeks agoMerge tag 'sound-6.18-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai...
Linus Torvalds [Fri, 14 Nov 2025 20:50:08 +0000 (12:50 -0800)] 
Merge tag 'sound-6.18-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound

Pull sound fixes from Takashi Iwai:
 "A collection of small fixes. All changes are device-specific, and
  nothing stands out.

   - A regression fix for HD-audio HDMI probe

   - USB-audio hardening patches for issues spotted by fuzzers

   - ASoC fixes for TAS278x, SoundWire and Cirrus

   - Usual HD-audio and USB-audio quirks"

* tag 'sound-6.18-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
  ALSA: usb-audio: Add native DSD quirks for PureAudio DAC series
  ASoC: rsnd: fix OF node reference leak in rsnd_ssiu_probe()
  ALSA: hda/tas2781: Correct the wrong project ID
  ALSA: usb-audio: Fix NULL pointer dereference in snd_usb_mixer_controls_badd
  ASoC: SDCA: bug fix while parsing mipi-sdca-control-cn-list
  ALSA: usb-audio: Fix potential overflow of PCM transfer buffer
  ALSA: hda/tas2781: Add new quirk for HP new projects
  ASoC: tas2781: fix getting the wrong device number
  ASoC: codecs: va-macro: fix resource leak in probe error path
  ASoC: tas2783A: Fix issues in firmware parsing
  ASoC: sdw_utils: fix device reference leak in is_sdca_endpoint_present()
  ASoC: cs4271: Fix regulator leak on probe failure
  ALSA: hda/hdmi: Fix breakage at probing nvhdmi-mcp driver
  ASoC: da7213: Use component driver suspend/resume
  ALSA: usb-audio: add min_mute quirk for SteelSeries Arctis
  ASoC: doc: cs35l56: Update firmware filename description for B0 silicon

3 weeks agoMerge tag 'block-6.18-20251114' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Fri, 14 Nov 2025 18:18:45 +0000 (10:18 -0800)] 
Merge tag 'block-6.18-20251114' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

Pull block fixlet from Jens Axboe:
 "Been sitting on this one for a week or two, planning on sending it out
  when there were other block changes for 6.18. But as that hasn't
  materialized in the second week of sitting on it, let's flush it out.

  A previous commit updated my git tree locations, but one was missed as
  it was already set to the git.kernel.org one. But the git location swap
  also renamed the actual tree from linux-block to just linux, let's get
  that last one updated too"

* tag 'block-6.18-20251114' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
  MAINTAINERS: correct git location for block layer tree

3 weeks agoMerge tag 'io_uring-6.18-20251113' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Fri, 14 Nov 2025 17:57:30 +0000 (09:57 -0800)] 
Merge tag 'io_uring-6.18-20251113' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

Pull io_uring fixes from Jens Axboe:

 - Use the actual segments in a request when for bvec based buffers

 - Fix an odd case where the iovec might get leaked for a read/write
   request, if it was newly allocated, overflowed the alloc cache, and
   hit an early error

 - Minor tweak to the query API added in this release, returning the
   number of available entries

* tag 'io_uring-6.18-20251113' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
  io_uring/rsrc: don't use blk_rq_nr_phys_segments() as number of bvecs
  io_uring/query: return number of available queries
  io_uring/rw: ensure allocated iovec gets cleared for early failure

3 weeks agoselftests/bpf: Test widen_imprecise_scalars() with different stack depth
Eduard Zingerman [Fri, 14 Nov 2025 02:57:30 +0000 (18:57 -0800)] 
selftests/bpf: Test widen_imprecise_scalars() with different stack depth

A test case for a situation when widen_imprecise_scalars() is called
with old->allocated_stack > cur->allocated_stack. Test structure:

    def widening_stack_size_bug():
      r1 = 0
      for r6 in 0..1:
        iterator_with_diff_stack_depth(r1)
        r1 = 42

    def iterator_with_diff_stack_depth(r1):
      if r1 != 42:
        use 128 bytes of stack
      iterator based loop

iterator_with_diff_stack_depth() is verified with r1 == 0 first and
r1 == 42 next. Causing stack usage of 128 bytes on a first visit and 8
bytes on a second. Such arrangement triggered a KASAN error in
widen_imprecise_scalars().

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20251114025730.772723-2-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
3 weeks agobpf: account for current allocated stack depth in widen_imprecise_scalars()
Eduard Zingerman [Fri, 14 Nov 2025 02:57:29 +0000 (18:57 -0800)] 
bpf: account for current allocated stack depth in widen_imprecise_scalars()

The usage pattern for widen_imprecise_scalars() looks as follows:

    prev_st = find_prev_entry(env, ...);
    queued_st = push_stack(...);
    widen_imprecise_scalars(env, prev_st, queued_st);

Where prev_st is an ancestor of the queued_st in the explored states
tree. This ancestor is not guaranteed to have same allocated stack
depth as queued_st. E.g. in the following case:

    def main():
      for i in 1..2:
        foo(i)        // same callsite, differnt param

    def foo(i):
      if i == 1:
        use 128 bytes of stack
      iterator based loop

Here, for a second 'foo' call prev_st->allocated_stack is 128,
while queued_st->allocated_stack is much smaller.
widen_imprecise_scalars() needs to take this into account and avoid
accessing bpf_verifier_state->frame[*]->stack out of bounds.

Fixes: 2793a8b015f7 ("bpf: exact states comparison for iterator convergence checks")
Reported-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20251114025730.772723-1-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
3 weeks agobpf: Add bpf_prog_run_data_pointers()
Eric Dumazet [Wed, 12 Nov 2025 12:55:16 +0000 (12:55 +0000)] 
bpf: Add bpf_prog_run_data_pointers()

syzbot found that cls_bpf_classify() is able to change
tc_skb_cb(skb)->drop_reason triggering a warning in sk_skb_reason_drop().

WARNING: CPU: 0 PID: 5965 at net/core/skbuff.c:1192 __sk_skb_reason_drop net/core/skbuff.c:1189 [inline]
WARNING: CPU: 0 PID: 5965 at net/core/skbuff.c:1192 sk_skb_reason_drop+0x76/0x170 net/core/skbuff.c:1214

struct tc_skb_cb has been added in commit ec624fe740b4 ("net/sched:
Extend qdisc control block with tc control block"), which added a wrong
interaction with db58ba459202 ("bpf: wire in data and data_end for
cls_act_bpf").

drop_reason was added later.

Add bpf_prog_run_data_pointers() helper to save/restore the net_sched
storage colliding with BPF data_meta/data_end.

Fixes: ec624fe740b4 ("net/sched: Extend qdisc control block with tc control block")
Reported-by: syzbot <syzkaller@googlegroups.com>
Closes: https://lore.kernel.org/netdev/6913437c.a70a0220.22f260.013b.GAE@google.com/
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Victor Nogueira <victor@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://patch.msgid.link/20251112125516.1563021-1-edumazet@google.com
3 weeks agoMerge tag 'v6.18-p5' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Linus Torvalds [Fri, 14 Nov 2025 16:32:58 +0000 (08:32 -0800)] 
Merge tag 'v6.18-p5' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6

Pull crypto fix from Herbert Xu:

 - Fix device reference leak in hisilicon

* tag 'v6.18-p5' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
  crypto: hisilicon/qm - Fix device reference leak in qm_get_qos_value

3 weeks agoMerge tag 'v6.18-rc5-smb-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Linus Torvalds [Fri, 14 Nov 2025 16:30:48 +0000 (08:30 -0800)] 
Merge tag 'v6.18-rc5-smb-client-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull smb client fixes from Steve French:

 - Multichannel reconnect channel selection fix

 - Fix for smbdirect (RDMA) disconnect bug

 - Fix for incorrect username length check

 - Fix memory leak in mount parm processing

* tag 'v6.18-rc5-smb-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  smb: client: let smbd_disconnect_rdma_connection() turn CREATED into DISCONNECTED
  smb: fix invalid username check in smb3_fs_context_parse_param()
  cifs: client: fix memory leak in smb3_fs_context_parse_param
  smb: client: fix cifs_pick_channel when channel needs reconnect