git.ipfire.org Git - thirdparty/kernel/stable.git/log

bpf: Replace kvfree with kfree for kzalloc memory

These pointers are allocated by kzalloc. Therefore, replace kvfree() with
kfree() to avoid unnecessary is_vmalloc_addr() check in kvfree(). This is
the remaining unmodified part from [1].

Signed-off-by: Feng Yang <yangfeng@kylinos.cn>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20250811123949.552885-1-rongqianfeng@vivo.com
Link: https://lore.kernel.org/bpf/20250827032812.498216-1-yangfeng59949@163.com

bpftool: Add CET-aware symbol matching for x86_64 architectures

Adjust symbol matching logic to account for Control-flow Enforcement
Technology (CET) on x86_64 systems. CET prefixes functions with
a 4-byte 'endbr' instruction, shifting the actual hook entry point to
symbol + 4.

Signed-off-by: Yuan Chen <chenyuan@kylinos.cn>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Quentin Monnet <qmo@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20250829061107.23905-3-chenyuan_fl@163.com

bpftool: Refactor kernel config reading into common helper

Extract the kernel configuration file parsing logic from feature.c into
a new read_kernel_config() function in common.c. This includes:

1. Moving the config file handling and option parsing code
2. Adding required headers and struct definition
3. Keeping all existing functionality

The refactoring enables sharing this logic with other components while
maintaining current behavior. This will be used by subsequent patches
that need to check kernel config options.

Signed-off-by: Yuan Chen <chenyuan@kylinos.cn>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Quentin Monnet <qmo@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20250829061107.23905-2-chenyuan_fl@163.com

selftests/bpf: Fix bpf_prog_detach2 usage in test_lirc_mode2

Commit e9fc3ce99b34 ("libbpf: Streamline error reporting for high-level
APIs") redefined the way that bpf_prog_detach2() returns. Therefore, adapt
the usage in test_lirc_mode2_user.c.

Signed-off-by: Ricardo B. Marlière <rbm@suse.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250828-selftests-bpf-v1-1-c7811cd8b98c@suse.com

selftests/bpf: Add LPM trie microbenchmarks

Add benchmarks for the standard set of operations: LOOKUP, INSERT,
UPDATE, DELETE. Also include benchmarks to measure the overhead of the
bench framework itself (NOOP) as well as the overhead of generating keys
(BASELINE). Lastly, this includes a benchmark for FREE (trie_free())
which is known to have terrible performance for maps with many entries.

Benchmarks operate on tries without gaps in the key range, i.e. each
test begins or ends with a trie with valid keys in the range [0,
nr_entries). This is intended to cause maximum branching when traversing
the trie.

LOOKUP, UPDATE, DELETE, and FREE fill a BPF LPM trie from userspace
using bpf_map_update_batch() and run the corresponding benchmark
operation via bpf_loop(). INSERT starts with an empty map and fills it
kernel-side from bpf_loop(). FREE records the time to free a filled LPM
trie by attaching and destroying a BPF prog. NOOP measures the overhead
of the test harness by running an empty function with bpf_loop().
BASELINE is similar to NOOP except that the function generates a key.

Each operation runs 10,000 times using bpf_loop(). Note that this value
is intentionally independent of the number of entries in the LPM trie so
that the stability of the results isn't affected by the number of
entries.

For those benchmarks that need to reset the LPM trie once it's full
(INSERT) or empty (DELETE), throughput and latency results are scaled by
the fraction of a second the operation actually ran to ignore any time
spent reinitialising the trie.

By default, benchmarks run using sequential keys in the range [0,
nr_entries). BASELINE, LOOKUP, and UPDATE can use random keys via the
--random parameter but beware there is a runtime cost involved in
generating random keys. Other benchmarks are prohibited from using
random keys because it can skew the results, e.g. when inserting an
existing key or deleting a missing one.

All measurements are recorded from within the kernel to eliminate
syscall overhead. Most benchmarks run an XDP program to generate stats
but FREE needs to collect latencies using fentry/fexit on
map_free_deferred() because it's not possible to use fentry directly on
lpm_trie.c since commit c83508da5620 ("bpf: Avoid deadlock caused by
nested kprobe and fentry bpf programs") and there's no way to
create/destroy a map from within an XDP program.

Here is example output from an AMD EPYC 9684X 96-Core machine for each
of the benchmarks using a trie with 10K entries and a 32-bit prefix
length, e.g.

  $ ./bench lpm-trie-$op \
   --prefix_len=32  \
--producers=1     \
--nr_entries=10000

     noop: throughput   74.417 ± 0.032 M ops/s ( 74.417M ops/prod), latency   13.438 ns/op
baseline: throughput   70.107 ± 0.171 M ops/s ( 70.107M ops/prod), latency   14.264 ns/op
   lookup: throughput    8.467 ± 0.047 M ops/s (  8.467M ops/prod), latency  118.109 ns/op
   insert: throughput    2.440 ± 0.015 M ops/s (  2.440M ops/prod), latency  409.290 ns/op
   update: throughput    2.806 ± 0.042 M ops/s (  2.806M ops/prod), latency  356.322 ns/op
   delete: throughput    4.625 ± 0.011 M ops/s (  4.625M ops/prod), latency  215.613 ns/op
     free: throughput    0.578 ± 0.006 K ops/s (  0.578K ops/prod), latency    1.730 ms/op

And the same benchmarks using random keys:

  $ ./bench lpm-trie-$op \
   --prefix_len=32  \
--producers=1     \
--nr_entries=10000 \
--random

     noop: throughput   74.259 ± 0.335 M ops/s ( 74.259M ops/prod), latency   13.466 ns/op
baseline: throughput   35.150 ± 0.144 M ops/s ( 35.150M ops/prod), latency   28.450 ns/op
   lookup: throughput    7.119 ± 0.048 M ops/s (  7.119M ops/prod), latency  140.469 ns/op
   insert: N/A
   update: throughput    2.736 ± 0.012 M ops/s (  2.736M ops/prod), latency  365.523 ns/op
   delete: N/A
     free: N/A

Signed-off-by: Matt Fleming <mfleming@cloudflare.com>
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Link: https://lore.kernel.org/r/20250827140149.1001557-1-matt@readmodwrite.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge branch 'bpf-arm64-support-for-timed-may_goto'

Puranjay Mohan says:

====================
bpf, arm64: support for timed may_goto

Changes in v2->v3:
v2: https://lore.kernel.org/all/20250809204833.44803-1-puranjay@kernel.org/
- Rebased on bpf-next/master
- Added Acked-by: tags from Xu and Kumar

Changes in v1->v2:
v1: https://lore.kernel.org/bpf/20250724125443.26182-1-puranjay@kernel.org/
- Added comment in arch_bpf_timed_may_goto() about BPF_REG_FP setup (Xu
  Kuohai)

This set adds support for the timed may_goto instruction for the arm64.
The timed may_goto instruction is implemented by the verifier by
reserving 2 8byte slots in the program stack and then calling
arch_bpf_timed_may_goto() in a loop with the stack offset of these two
slots in BPF_REG_AX. It expects the function to put a timestamp in the
first slot and the returned count in BPF_REG_AX is put into the second
slot by a store instruction emitted by the verifier.

arch_bpf_timed_may_goto() is special as it receives the parameter in
BPF_REG_AX and is expected to return the result in BPF_REG_AX as well.
It can't clobber any caller saved registers because verifier doesn't
save anything before emitting the call.

So, arch_bpf_timed_may_goto() is implemented in assembly so the exact
registers that are stored/restored can be controlled (BPF caller saved
registers here) and it also needs to take care of moving arguments and
return values to and from BPF_REG_AX <-> arm64 R0.

So, arch_bpf_timed_may_goto() acts as a trampoline to call
bpf_check_timed_may_goto() which does the main logic of placing the
timestamp and returning the count.

All tests that use may_goto instruction pass after the changing some of
them in patch 2

#404     stream_errors:OK
[...]
#406/2   stream_success/stream_cond_break:OK
[...]
#494/23  verifier_bpf_fastcall/may_goto_interaction_x86_64:SKIP
#494/24  verifier_bpf_fastcall/may_goto_interaction_arm64:OK
[...]
#539/1   verifier_may_goto_1/may_goto 0:OK
#539/2   verifier_may_goto_1/batch 2 of may_goto 0:OK
#539/3   verifier_may_goto_1/may_goto batch with offsets 2/1/0:OK
#539/4   verifier_may_goto_1/may_goto batch with offsets 2/0:OK
#539     verifier_may_goto_1:OK
#540/1   verifier_may_goto_2/C code with may_goto 0:OK
#540     verifier_may_goto_2:OK
Summary: 7/16 PASSED, 25 SKIPPED, 0 FAILED
====================

Link: https://patch.msgid.link/20250827113245.52629-1-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Enable timed may_goto tests for arm64

As arm64 JIT now supports timed may_goto instruction, make sure all
relevant tests run on this architecture. Some tests were enabled and
other required modifications to work properly on arm64.

$ ./test_progs -a "stream*","*may_goto*",verifier_bpf_fastcall

#404     stream_errors:OK
[...]
#406/2   stream_success/stream_cond_break:OK
[...]
#494/23  verifier_bpf_fastcall/may_goto_interaction_x86_64:SKIP
#494/24  verifier_bpf_fastcall/may_goto_interaction_arm64:OK
[...]
#539/1   verifier_may_goto_1/may_goto 0:OK
#539/2   verifier_may_goto_1/batch 2 of may_goto 0:OK
#539/3   verifier_may_goto_1/may_goto batch with offsets 2/1/0:OK
#539/4   verifier_may_goto_1/may_goto batch with offsets 2/0:OK
#539     verifier_may_goto_1:OK
#540/1   verifier_may_goto_2/C code with may_goto 0:OK
#540     verifier_may_goto_2:OK
Summary: 7/16 PASSED, 25 SKIPPED, 0 FAILED

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Xu Kuohai <xukuohai@huawei.com>
Link: https://lore.kernel.org/r/20250827113245.52629-3-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf, arm64: Add JIT support for timed may_goto

When verifier sees a timed may_goto instruction, it emits a call to
arch_bpf_timed_may_goto() with a stack offset in BPF_REG_AX (arm64 r9)
and expects a count value to be returned in the same register. The
verifier doesn't save or restore any registers before emitting this
call.

arch_bpf_timed_may_goto() should act as a trampoline to call
bpf_check_timed_may_goto() with AAPCS64 calling convention.

To support this custom calling convention, implement
arch_bpf_timed_may_goto() in assembly and make sure BPF caller saved
registers are saved and restored, call bpf_check_timed_may_goto with
arm64 calling convention where first argument and return value both are
in x0, then put the result back into BPF_REG_AX before returning.

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Xu Kuohai <xukuohai@huawei.com>
Link: https://lore.kernel.org/r/20250827113245.52629-2-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge branch 'libbpf-fix-usdt-sib-argument-handling-causing-unrecognized-register-error'

Jiawei Zhao says:

====================
libbpf: fix USDT SIB argument handling causing unrecognized register error

When using GCC on x86-64 to compile an usdt prog with -O1 or higher
optimization, the compiler will generate SIB addressing mode for global
array, e.g. "1@-96(%rbp,%rax,8)".

The current USDT implementation in libbpf cannot parse these two formats,
causing `bpf_program__attach_usdt()` to fail with -ENOENT
(unrecognized register).

This patch series adds support for SIB addressing mode in USDT probes.
The main changes include:
- add correct handling logic for SIB-addressed arguments in
  `parse_usdt_arg`.
- add an usdt_o2 test case to cover SIB addressing mode.

Testing shows that the SIB probe correctly generates 8@(%rcx,%rax,8)
argument spec and passes all validation checks.

The modification history of this patch series:
Change since v1:
- refactor the code to make it more readable
- modify the commit message to explain why and how

Change since v2:
- fix the `scale` uninitialized error

Change since v3:
- force -O2 optimization for usdt.test.o to generate SIB addressing usdt
  and pass all test cases.

Change since v4:
- split the patch into two parts, one for the fix and the other for the
  test

Change since v5:
- Only enable optimization for x86 architecture to generate SIB addressing
  usdt argument spec.

Change since v6:
- Add an usdt_o2 test case to cover SIB addressing mode.
- Reinstate the usdt.c test case.

Change since v7:
- Refactor modifications to __bpf_usdt_arg_spec to avoid increasing its size,
  achieving better compatibility
- Fix some minor code style issues
- Refactor the usdt_o2 test case, removing semaphore and adding GCC attribute
  to force -O2 optimization

Change since v8:
- Refactor the usdt_o2 test case, using assembly to force SIB addressing mode.

Change since v9:
- Only enable the usdt_o2 test case on x86_64 and i386 architectures since the
  SIB addressing mode is only supported on x86_64 and i386.

Change since v10:
- Replace `__attribute__((optimize("O2")))` with `#pragma GCC optimize("O1")`
  to fix the issue where the optimized compilation condition works improperly.
- Renamed test case usdt_o2 and relevant files name to usdt_o1 in that O1
  level optimization is enough to generate SIB addressing usdt argument spec.

Change since v11:
- Replace `STAP_PROBE1` with `STAP_PROBE_ASM`
- Use bit fields instead of bit shifting operations
- Merge the usdt_o1 test case into the usdt test case

Change since v12:
- This patch is same with the v12 but with a new version number.

Change since v13(resolve some review comments):
- https://lore.kernel.org/bpf/CAEf4BzZWd2zUC=U6uGJFF3EMZ7zWGLweQAG3CJWTeHy-5yFEPw@mail.gmail.com/
- https://lore.kernel.org/bpf/CAEf4Bzbs3hV_Q47+d93tTX13WkrpkpOb4=U04mZCjHyZg4aVdw@mail.gmail.com/

Change since v14:
- fix a typo in __bpf_usdt_arg_spec

Change since v15(resolve some review comments):
- https://lore.kernel.org/bpf/CAEf4BzaxuYijEfQMDFZ+CQdjxLuDZiesUXNA-SiopS+5+VxRaA@mail.gmail.com/
- https://lore.kernel.org/bpf/CAEf4BzaHi5kpuJ6OVvDU62LT5g0qHbWYMfb_FBQ3iuvvUF9fag@mail.gmail.com/
- https://lore.kernel.org/bpf/d438bf3a-a9c9-4d34-b814-63f2e9bb3a85@linux.dev/
====================

Link: https://patch.msgid.link/20250827053128.1301287-1-phoenix500526@163.com
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>

selftests/bpf: Enrich subtest_basic_usdt case in selftests to cover SIB handling logic

When using GCC on x86-64 to compile an usdt prog with -O1 or higher
optimization, the compiler will generate SIB addressing mode for global
array, e.g. "1@-96(%rbp,%rax,8)".

In this patch:
- enrich subtest_basic_usdt test case to cover SIB addressing usdt argument spec
handling logic

Signed-off-by: Jiawei Zhao <phoenix500526@163.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250827053128.1301287-3-phoenix500526@163.com

libbpf: Fix USDT SIB argument handling causing unrecognized register error

On x86-64, USDT arguments can be specified using Scale-Index-Base (SIB)
addressing, e.g. "1@-96(%rbp,%rax,8)". The current USDT implementation
in libbpf cannot parse this format, causing `bpf_program__attach_usdt()`
to fail with -ENOENT (unrecognized register).

This patch fixes this by implementing the necessary changes:
- add correct handling for SIB-addressed arguments in `bpf_usdt_arg`.
- add adaptive support to `__bpf_usdt_arg_type` and
`__bpf_usdt_arg_spec` to represent SIB addressing parameters.

Signed-off-by: Jiawei Zhao <phoenix500526@163.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250827053128.1301287-2-phoenix500526@163.com

selftests/bpf: Fix typos and grammar in test sources

Fix spelling typos and grammar errors in BPF selftests source code.

Signed-off-by: Shubham Sharma <slopixelz@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250826125746.17983-1-slopixelz@gmail.com

bpf: Add selftest to check the verifier's abstract multiplication

Add new selftest to test the abstract multiplication technique(s) used
by the verifier, following the recent improvement in tnum
multiplication (tnum_mul). One of the newly added programs,
verifier_mul/mul_precise, results in a false positive with the old
tnum_mul, while the program passes with the latest one.

Signed-off-by: Nandakumar Edamana <nandakumar@nandakumar.co.in>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20250826034524.2159515-2-nandakumar@nandakumar.co.in

bpf: Improve the general precision of tnum_mul

Drop the value-mask decomposition technique and adopt straightforward
long-multiplication with a twist: when LSB(a) is uncertain, find the
two partial products (for LSB(a) = known 0 and LSB(a) = known 1) and
take a union.

Experiment shows that applying this technique in long multiplication
improves the precision in a significant number of cases (at the cost
of losing precision in a relatively lower number of cases).

Signed-off-by: Nandakumar Edamana <nandakumar@nandakumar.co.in>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Tested-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com>
Reviewed-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20250826034524.2159515-1-nandakumar@nandakumar.co.in

Merge branch 's390-bpf-add-s390-jit-support-for-timed-may_goto'

Ilya Leoshkevich says:

====================
s390/bpf: Add s390 JIT support for timed may_goto

v1: https://lore.kernel.org/bpf/20250821103256.291412-1-iii@linux.ibm.com/
v1 -> v2: Fix test_stream_errors (caught by CI).

This series adds timed may_goto implementation to the s390x JIT.
Patch 1 is the implementation itself, patches 2-5 are the associated
test changes.
====================

Link: https://patch.msgid.link/20250821113339.292434-1-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Remove may_goto tests from DENYLIST.s390x

The may_goto instruction is now fully supported on s390x, including the
timed implementation, so remove the respective test from the denylist.

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20250821113339.292434-6-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Enable timed may_goto verifier tests on s390x

Now that the timed may_goto implementation is available on s390x,
enable the respective verifier tests.

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20250821113339.292434-5-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Add __arch_s390x macro

Make it possible to limit certain tests to s390x, just like it's
already done for x86_64, arm64, and riscv64.

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20250821113339.292434-4-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Add a missing newline to the "bad arch spec" message

Fix error messages like this one:

parse_test_spec:FAIL:569 bad arch spec: 's390x'process_subtest:FAIL:1153 Can't parse test spec for program 'may_goto_simple'

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20250821113339.292434-3-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

s390/bpf: Add s390 JIT support for timed may_goto

The verifier provides an architecture-independent implementation of the
may_goto instruction, which is currently used on s390x, but it has a
downside: there is no way to prevent progs using it from running for a
very long time.

The solution to this problem is an alternative timed implementation,
which requires architecture-specific bits. Its availability is signaled
to the verifier by bpf_jit_supports_timed_may_goto() returning true.

The verifier then emits a call to arch_bpf_timed_may_goto() using a
non-standard calling convention. This function must act as a trampoline
for bpf_check_timed_may_goto().

Implement bpf_jit_supports_timed_may_goto(), account for the special
calling convention in the BPF_CALL implementation, and implement
arch_bpf_timed_may_goto().

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20250821113339.292434-2-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Remove entries from config.{arch} already present in config

`config.{arch}` had entries already present in `config`.

When generating the config used by vmtest, concatenate the `config` file
with the `config.{arch}` one, making those entries duplicated, so remove
those duplications.

Use the following command to get the differences:

  $ comm -1 -2  <(sort tools/testing/selftests/bpf/config.x86_64) <(sort tools/testing/selftests/bpf/config)
  $ comm -1 -2  <(sort tools/testing/selftests/bpf/config.aarch64) <(sort tools/testing/selftests/bpf/config)
  $ comm -1 -2  <(sort tools/testing/selftests/bpf/config.riscv64) <(sort tools/testing/selftests/bpf/config)
  $ comm -1 -2  <(sort tools/testing/selftests/bpf/config.ppc64el) <(sort tools/testing/selftests/bpf/config)
  $ comm -1 -2  <(sort tools/testing/selftests/bpf/config.s390x) <(sort tools/testing/selftests/bpf/config)

This is similar with commit 7a42af4b94f1 ("selftests/bpf: Remove entries
from config.s390x already present in config").

Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20250826065057.11415-1-yangtiezhu@loongson.cn

Merge branch 'bpf-introduce-and-use-rcu_read_lock_dont_migrate'

Menglong Dong says:

====================
bpf: introduce and use rcu_read_lock_dont_migrate

migrate_disable() and rcu_read_lock() are used to together in many case in
bpf. However, when PREEMPT_RCU is not enabled, rcu_read_lock() will
disable preemption, which indicate migrate_disable(), so we don't need to
call it in this case.

In this series, we introduce rcu_read_lock_dont_migrate and
rcu_read_unlock_migrate, which will call migrate_disable and
migrate_enable only when PREEMPT_RCU enabled. And use
rcu_read_lock_dont_migrate in bpf subsystem.

Changes since V2:
* make rcu_read_lock_dont_migrate() more compatible by using IS_ENABLED()

Changes since V1:
* introduce rcu_read_lock_dont_migrate() instead of
rcu_migrate_disable() + rcu_read_lock()
====================

Link: https://patch.msgid.link/20250821090609.42508-1-dongml2@chinatelecom.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: use rcu_read_lock_dont_migrate() for trampoline.c

Use rcu_read_lock_dont_migrate() and rcu_read_unlock_migrate() in
trampoline.c to obtain better performance when PREEMPT_RCU is not enabled.

Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Link: https://lore.kernel.org/r/20250821090609.42508-8-dongml2@chinatelecom.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: use rcu_read_lock_dont_migrate() for bpf_prog_run_array_cg()

Use rcu_read_lock_dont_migrate() and rcu_read_unlock_migrate() in
bpf_prog_run_array_cg to obtain better performance when PREEMPT_RCU is
not enabled.

Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Link: https://lore.kernel.org/r/20250821090609.42508-7-dongml2@chinatelecom.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: use rcu_read_lock_dont_migrate() for bpf_task_storage_free()

Use rcu_read_lock_dont_migrate() and rcu_read_unlock_migrate() in
bpf_task_storage_free to obtain better performance when PREEMPT_RCU is
not enabled.

Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Link: https://lore.kernel.org/r/20250821090609.42508-6-dongml2@chinatelecom.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: use rcu_read_lock_dont_migrate() for bpf_iter_run_prog()

Use rcu_read_lock_dont_migrate() and rcu_read_unlock_migrate() in
bpf_iter_run_prog to obtain better performance when PREEMPT_RCU is
not enabled.

Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Link: https://lore.kernel.org/r/20250821090609.42508-5-dongml2@chinatelecom.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: use rcu_read_lock_dont_migrate() for bpf_inode_storage_free()

Use rcu_read_lock_dont_migrate() and rcu_read_unlock_migrate() in
bpf_inode_storage_free to obtain better performance when PREEMPT_RCU is
not enabled.

Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Link: https://lore.kernel.org/r/20250821090609.42508-4-dongml2@chinatelecom.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: use rcu_read_lock_dont_migrate() for bpf_cgrp_storage_free()

Use rcu_read_lock_dont_migrate() and rcu_read_unlock_migrate() in
bpf_cgrp_storage_free to obtain better performance when PREEMPT_RCU is
not enabled.

Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Link: https://lore.kernel.org/r/20250821090609.42508-3-dongml2@chinatelecom.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

rcu: add rcu_read_lock_dont_migrate()

migrate_disable() is called to disable migration in the kernel, and it is
often used together with rcu_read_lock().

However, with PREEMPT_RCU disabled, it's unnecessary, as rcu_read_lock()
will always disable preemption, which will also disable migration.

Introduce rcu_read_lock_dont_migrate() and rcu_read_unlock_migrate(),
which will do the migration enable and disable only when PREEMPT_RCU.

Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/20250821090609.42508-2-dongml2@chinatelecom.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Remove preempt_disable in bpf_try_get_buffers

Now BPF program will run with migration disabled, so it is safe
to access this_cpu_inc_return(bpf_bprintf_nest_level).

Fixes: d9c9e4db186a ("bpf: Factorize bpf_trace_printk and bpf_seq_printf")
Signed-off-by: Tao Chen <chen.dylane@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250819125638.2544715-1-chen.dylane@linux.dev

bpf: Use sha1() instead of sha1_transform() in bpf_prog_calc_tag()

Now that there's a proper SHA-1 library API, just use that instead of
the low-level SHA-1 compression function. This eliminates the need for
bpf_prog_calc_tag() to implement the SHA-1 padding itself. No
functional change; the computed tags remain the same.

Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20250811201615.564461-1-ebiggers@kernel.org

selftests/bpf: Tests for is_scalar_branch_taken tnum logic

This patch adds tests for the new jeq and jne logic in
is_scalar_branch_taken. The following shows the first test failing
before the previous patch is applied. Once the previous patch is
applied, the verifier can use the tnum values to deduce that instruction
7 is dead code.

  0: call bpf_get_prandom_u32#7  ; R0_w=scalar()
  1: w0 = w0                     ; R0_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff))
  2: r0 >>= 30                   ; R0_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=3,var_off=(0x0; 0x3))
  3: r0 <<= 30                   ; R0_w=scalar(smin=0,smax=umax=umax32=0xc0000000,smax32=0x40000000,var_off=(0x0; 0xc0000000))
  4: r1 = r0                     ; R0_w=scalar(id=1,smin=0,smax=umax=umax32=0xc0000000,smax32=0x40000000,var_off=(0x0; 0xc0000000)) R1_w=scalar(id=1,smin=0,smax=umax=umax32=0xc0000000,smax32=0x40000000,var_off=(0x0; 0xc0000000))
  5: r1 += 1024                  ; R1_w=scalar(smin=umin=umin32=1024,smax=umax=umax32=0xc0000400,smin32=0x80000400,smax32=0x40000400,var_off=(0x400; 0xc0000000))
  6: if r1 != r0 goto pc+1       ; R0_w=scalar(id=1,smin=umin=umin32=1024,smax=umax=umax32=0xc0000000,smin32=0x80000400,smax32=0x40000000,var_off=(0x400; 0xc0000000)) R1_w=scalar(smin=umin=umin32=1024,smax=umax=umax32=0xc0000000,smin32=0x80000400,smax32=0x40000400,var_off=(0x400; 0xc0000000))
  7: r10 = 0
  frame pointer is read only

Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/550004f935e2553bdb2fb1f09cbde7d0452112d0.1755694148.git.paul.chaignon@gmail.com

bpf: Use tnums for JEQ/JNE is_branch_taken logic

In the following toy program (reg states minimized for readability), R0
and R1 always have different values at instruction 6. This is obvious
when reading the program but cannot be guessed from ranges alone as
they overlap (R0 in [0; 0xc0000000], R1 in [1024; 0xc0000400]).

  0: call bpf_get_prandom_u32#7  ; R0_w=scalar()
  1: w0 = w0                     ; R0_w=scalar(var_off=(0x0; 0xffffffff))
  2: r0 >>= 30                   ; R0_w=scalar(var_off=(0x0; 0x3))
  3: r0 <<= 30                   ; R0_w=scalar(var_off=(0x0; 0xc0000000))
  4: r1 = r0                     ; R1_w=scalar(var_off=(0x0; 0xc0000000))
  5: r1 += 1024                  ; R1_w=scalar(var_off=(0x400; 0xc0000000))
  6: if r1 != r0 goto pc+1

Looking at tnums however, we can deduce that R1 is always different from
R0 because their tnums don't agree on known bits. This patch uses this
logic to improve is_scalar_branch_taken in case of BPF_JEQ and BPF_JNE.

This change has a tiny impact on complexity, which was measured with
the Cilium complexity CI test. That test covers 72 programs with
various build and load time configurations for a total of 970 test
cases. For 80% of test cases, the patch has no impact. On the other
test cases, the patch decreases complexity by only 0.08% on average. In
the best case, the verifier needs to walk 3% less instructions and, in
the worst case, 1.5% more. Overall, the patch has a small positive
impact, especially for our largest programs.

Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/be3ee70b6e489c49881cb1646114b1d861b5c334.1755694147.git.paul.chaignon@gmail.com

selftests/bpf: Use vmlinux.h for BPF programs

Some of the bpf test progs still use linux/libc headers.
Let's use vmlinux.h instead like the rest of test progs.
This will also ease cross compiling.

Signed-off-by: Hengqi Chen <hengqi.chen@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250821030254.398826-1-hengqi.chen@gmail.com

libbpf: Add documentation to version and error API functions

Add documentation for the following API functions:

- libbpf_major_version()
- libbpf_minor_version()
- libbpf_version_string()
- libbpf_strerror()

Signed-off-by: Cryolitia PukNgae <cryolitia@uniontech.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250820-libbpf-doc-1-v1-1-13841f25a134@uniontech.com

libbpf: Export bpf_object__prepare symbol

Add missing LIBBPF_API macro for bpf_object__prepare function to enable
its export. libbpf.map had bpf_object__prepare already listed.

Fixes: 1315c28ed809 ("libbpf: Split bpf object load into prepare/load")
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20250819215119.37795-1-mykyta.yatsenko5@gmail.com

s390/bpf: Use direct calls and jumps where possible

After the V!=R rework (commit c98d2ecae08f ("s390/mm: Uncouple physical
vs virtual address spaces")), all kernel code and related data are
allocated within a 4G region, making it possible to use relative
addressing in BPF code more extensively.

Convert as many indirect calls and jumps to direct calls as possible,
namely:

* BPF_CALL
* __bpf_tramp_enter()
* __bpf_tramp_exit()
* __bpf_prog_enter()
* __bpf_prog_exit()
* fentry
* fmod_ret
* fexit
* BPF_TRAMP_F_CALL_ORIG without BPF_TRAMP_F_ORIG_STACK
* Trampoline returns without BPF_TRAMP_F_SKIP_FRAME and
BPF_TRAMP_F_ORIG_STACK

The following indirect calls and jumps remain:

* Prog returns
* Trampoline returns with BPF_TRAMP_F_SKIP_FRAME or
BPF_TRAMP_F_ORIG_STACK
* BPF_TAIL_CALL
* BPF_TRAMP_F_CALL_ORIG with BPF_TRAMP_F_ORIG_STACK

As a result, only one usage of call_r1() remains, so inline it.

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20250819102116.252203-1-iii@linux.ibm.com

bpftool: Add kernel.kptr_restrict hint for no instructions

From bpftool's github repository issue [0]: When a Linux distribution
has the kernel.kptr_restrict set to 2, bpftool prog dump jited returns
"no instructions returned". This message can be puzzling to bpftool
users who are not familiar with kernel BPF internals, so add a small
hint for bpftool users to check the kernel.kptr_restrict setting
similar to the DUMP_XLATED case. Outside of kernel.kptr_restrict, no
instructions could also be returned in case the JIT was disabled.

Signed-off-by: Vincent Li <vincent.mc.li@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://github.com/libbpf/bpftool/issues/184
Link: https://lore.kernel.org/bpf/20250818165113.15982-1-vincent.mc.li@gmail.com

Merge branch 'bpf-next/skb-meta-dynptr' into 'bpf-next/master'

Merge 'skb-meta-dynptr' branch into 'master' branch. No conflict.

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>

Merge branch 'add-a-dynptr-type-for-skb-metadata-for-tc-bpf'

Jakub Sitnicki says:

====================
Add a dynptr type for skb metadata for TC BPF

TL;DR
-----

This is the first step in an effort which aims to enable skb metadata
access for all BPF programs which operate on an skb context.

By skb metadata we mean the custom metadata area which can be allocated
from an XDP program with the bpf_xdp_adjust_meta helper [1]. Network stack
code accesses it using the skb_metadata_* helpers.

Changelog
---------
Changes in v7:
- Make dynptr read-only for cloned skbs for now. (Martin)
- Extend tests for skb clones to cover writes to metadata.
- Drop Jesse's review stamp for patch 2 due to an update.
- Link to v6: https://lore.kernel.org/r/20250804-skb-metadata-thru-dynptr-v6-0-05da400bfa4b@cloudflare.com

Changes in v6:
- Enable CONFIG_NET_ACT_MIRRED for bpf selftests to fix CI failure
- Switch from u32 to matchall classifier, which bpf selftests already use
- Link to v5: https://lore.kernel.org/r/20250731-skb-metadata-thru-dynptr-v5-0-f02f6b5688dc@cloudflare.com

Changes in v5:
- Invalidate skb payload and metadata slices on write to metadata. (Martin)
- Drop redundant bounds check in bpf_skb_meta_*(). (Martin)
- Check for unexpected flags in __bpf_dynptr_write(). (Martin)
- Fold bpf_skb_meta_{load,store}_bytes() into callers.
- Add a test for metadata access when an skb clone has been modified.
- Drop Eduard's Ack for patch 3. Patch updated.
- Keep Eduard's Ack for patches 4-8.
- Add Jesse's stamp from an internal review.
- Link to v4: https://lore.kernel.org/r/20250723-skb-metadata-thru-dynptr-v4-0-a0fed48bcd37@cloudflare.com

Changes in v4:
- Kill bpf_dynptr_from_skb_meta_rdonly. Not needed for now. (Marin)
- Add a test to cover passing OOB offsets to dynptr ops. (Eduard)
- Factor out bounds checks from bpf_dynptr_{read,write,slice}. (Eduard)
- Squash patches:
      bpf: Enable read access to skb metadata with bpf_dynptr_read
      bpf: Enable write access to skb metadata with bpf_dynptr_write
      bpf: Enable read-write access to skb metadata with dynptr slice
- Kept Eduard's Acks for v3 on unchanged patches.
- Link to v3: https://lore.kernel.org/r/20250721-skb-metadata-thru-dynptr-v3-0-e92be5534174@cloudflare.com

Changes in v3:
- Add a kfunc set for skb metadata access. Limited to TC BPF. (Martin)
- Drop patches related to skb metadata access outside of TC BPF:
      net: Clear skb metadata on handover from device to protocol
      selftests/bpf: Cover lack of access to skb metadata at ip layer
      selftests/bpf: Count successful bpf program runs
- Link to v2: https://lore.kernel.org/r/20250716-skb-metadata-thru-dynptr-v2-0-5f580447e1df@cloudflare.com

Changes in v2:
- Switch to a dedicated dynptr type for skb metadata (Andrii)
- Add verifier test coverage since we now touch its code
- Add missing test coverage for bpf_dynptr_adjust and access at an offset
- Link to v1: https://lore.kernel.org/r/20250630-skb-metadata-thru-dynptr-v1-0-f17da13625d8@cloudflare.com

Overview
--------

Today, the skb metadata is accessible only by the BPF TC ingress programs
through the __sk_buff->data_meta pointer. We propose a three step plan to
make skb metadata available to all other BPF programs which operate on skb
objects:

1) Add a dynptr type for skb metadata (this patch set)

    This is a preparatory step, but it also stands on its own. Here we
    enable access to the skb metadata through a bpf_dynptr, the same way we
    can already access the skb payload today.

    As the next step (2), we want to relocate the metadata as skb travels
    through the network stack in order to persist it. That will require a
    safe way to access the metadata area irrespective of its location.

    This is where the dynptr [2] comes into play. It solves exactly that
    problem. A dynptr to skb metadata can be backed by a memory area that
    resides in a different location depending on the code path.

2) Persist skb metadata past the TC hook (future)

    Having the metadata in front of the packet headers as the skb travels
    through the network stack is problematic - see the discussion of
    alternative approaches below. Hence, we plan to relocate it as
    necessary past the TC hook.

    Where to relocate it? We don't know yet. There are a couple of
    options: (i) move it to the top of skb headroom, or (ii) allocate
    dedicated memory for it.  They are not mutually exclusive. The right
    solution might be a mix.

    When to relocate it? That is also an open question. It could be done
    during device to protocol handover or lazily when headers get pushed or
    headroom gets resized.

3) skb dynptr for sockops, sk_lookup, etc. (future)

    There are BPF program types don't operate on __sk_buff context, but
    either have, or could have, access to the skb itself. As a final touch,
    we want to provide a way to create an skb metadata dynptr for these
    program types.

TIMTOWDI
--------

Alternative approaches which we considered:

* Keep the metadata always in front of skb->data

We think it is a bad idea for two reasons, outlined below. Nevertheless we
are open to it, if necessary.

1) Performance concerns

    It would require the network stack to move the metadata on each header
    pull/push - see skb_reorder_vlan_header() [3] for an example. While
    doable, there is an expected performance overhead.

2) Potential for bugs

    In addition to updating skb_push/pull and pskp_expand_head, we would
    need to audit any code paths which operate on skb->data pointer
    directly without going through the helpers. This creates a "known
    unknown" risk.

* Design a new custom metadata area from scratch

We have tried that in Arthur's patch set [4]. One of the outcomes of the
discussion there was that we don't want to have two places to store custom
metadata. Hence the change of approach to make the existing custom metadata
area work.

-jkbs

[1] https://docs.ebpf.io/linux/helper-function/bpf_xdp_adjust_meta/
[2] https://docs.ebpf.io/linux/concepts/dynptrs/
[3] https://elixir.bootlin.com/linux/v6.16-rc6/source/net/core/skbuff.c#L6211
[4] https://lore.kernel.org/all/20250422-afabre-traits-010-rfc2-v2-0-92bcc6b146c9@arthurfabre.com/
====================

Link: https://patch.msgid.link/20250814-skb-metadata-thru-dynptr-v7-0-8a39e636e0fb@cloudflare.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>

selftests/bpf: Cover metadata access from a modified skb clone

Demonstrate that, when processing an skb clone, the metadata gets truncated
if the program contains a direct write to either the payload or the
metadata, due to an implicit unclone in the prologue, and otherwise the
dynptr to the metadata is limited to being read-only.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20250814-skb-metadata-thru-dynptr-v7-9-8a39e636e0fb@cloudflare.com

selftests/bpf: Cover read/write to skb metadata at an offset

Exercise r/w access to skb metadata through an offset-adjusted dynptr,
read/write helper with an offset argument, and a slice starting at an
offset.

Also check for the expected errors when the offset is out of bounds.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Jesse Brandeburg <jbrandeburg@cloudflare.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://patch.msgid.link/20250814-skb-metadata-thru-dynptr-v7-8-8a39e636e0fb@cloudflare.com

selftests/bpf: Cover write access to skb metadata via dynptr

Add tests what exercise writes to skb metadata in two ways:
1. indirectly, using bpf_dynptr_write helper,
2. directly, using a read-write dynptr slice.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Jesse Brandeburg <jbrandeburg@cloudflare.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://patch.msgid.link/20250814-skb-metadata-thru-dynptr-v7-7-8a39e636e0fb@cloudflare.com

selftests/bpf: Cover read access to skb metadata via dynptr

Exercise reading from SKB metadata area in two new ways:
1. indirectly, with bpf_dynptr_read(), and
2. directly, with bpf_dynptr_slice().

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Jesse Brandeburg <jbrandeburg@cloudflare.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://patch.msgid.link/20250814-skb-metadata-thru-dynptr-v7-6-8a39e636e0fb@cloudflare.com

selftests/bpf: Parametrize test_xdp_context_tuntap

We want to add more test cases to cover different ways to access the
metadata area. Prepare for it. Pull up the skeleton management.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Jesse Brandeburg <jbrandeburg@cloudflare.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://patch.msgid.link/20250814-skb-metadata-thru-dynptr-v7-5-8a39e636e0fb@cloudflare.com

selftests/bpf: Pass just bpf_map to xdp_context_test helper

Prepare for parametrizing the xdp_context tests. The assert_test_result
helper doesn't need the whole skeleton. Pass just what it needs.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Jesse Brandeburg <jbrandeburg@cloudflare.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://patch.msgid.link/20250814-skb-metadata-thru-dynptr-v7-4-8a39e636e0fb@cloudflare.com

selftests/bpf: Cover verifier checks for skb_meta dynptr type

dynptr for skb metadata behaves the same way as the dynptr for skb data
with one exception - writes to skb_meta dynptr don't invalidate existing
skb and skb_meta slices.

Duplicate those the skb dynptr tests which we can, since
bpf_dynptr_from_skb_meta kfunc can be called only from TC BPF, to cover the
skb_meta dynptr verifier checks.

Also add a couple of new tests (skb_data_valid_*) to ensure we don't
invalidate the slices in the mentioned case, which are specific to skb_meta
dynptr.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Jesse Brandeburg <jbrandeburg@cloudflare.com>
Link: https://patch.msgid.link/20250814-skb-metadata-thru-dynptr-v7-3-8a39e636e0fb@cloudflare.com

bpf: Enable read/write access to skb metadata through a dynptr

Now that we can create a dynptr to skb metadata, make reads to the metadata
area possible with bpf_dynptr_read() or through a bpf_dynptr_slice(), and
make writes to the metadata area possible with bpf_dynptr_write() or
through a bpf_dynptr_slice_rdwr().

Note that for cloned skbs which share data with the original, we limit the
skb metadata dynptr to be read-only since we don't unclone on a
bpf_dynptr_write to metadata.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20250814-skb-metadata-thru-dynptr-v7-2-8a39e636e0fb@cloudflare.com

bpf: Add dynptr type for skb metadata

Add a dynptr type, similar to skb dynptr, but for the skb metadata access.

The dynptr provides an alternative to __sk_buff->data_meta for accessing
the custom metadata area allocated using the bpf_xdp_adjust_meta() helper.

More importantly, it abstracts away the fact where the storage for the
custom metadata lives, which opens up the way to persist the metadata by
relocating it as the skb travels through the network stack layers.

Writes to skb metadata invalidate any existing skb payload and metadata
slices. While this is more restrictive that needed at the moment, it leaves
the door open to reallocating the metadata on writes, and should be only a
minor inconvenience to the users.

Only the program types which can access __sk_buff->data_meta today are
allowed to create a dynptr for skb metadata at the moment. We need to
modify the network stack to persist the metadata across layers before
opening up access to other BPF hooks.

Once more BPF hooks gain access to skb_meta dynptr, we will also need to
add a read-only variant of the helper similar to
bpf_dynptr_from_skb_rdonly.

skb_meta dynptr ops are stubbed out and implemented by subsequent changes.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Jesse Brandeburg <jbrandeburg@cloudflare.com>
Link: https://patch.msgid.link/20250814-skb-metadata-thru-dynptr-v7-1-8a39e636e0fb@cloudflare.com

bpf: Add a verbose message when the BTF limit is reached

When a BPF program which is being loaded reaches the map limit
(MAX_USED_MAPS) or the BTF limit (MAX_USED_BTFS) the -E2BIG is
returned. However, in the former case there is an accompanying
verifier verbose message, and in the latter case there is not.
Add a verbose message to make the behaviour symmetrical.

Reported-by: Kevin Sheldrake <kevin.sheldrake@isovalent.com>
Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20250816151554.902995-1-a.s.protopopov@gmail.com

bpf: Replace get_next_cpu() with cpumask_next_wrap()

The get_next_cpu() function was only used in one place to find
the next possible CPU, which can be replaced by cpumask_next_wrap().

Signed-off-by: Fushuai Wang <wangfushuai@baidu.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20250818032344.23229-1-wangfushuai@baidu.com

selftests/bpf: Clobber a lot of registers in tailcall_bpf2bpf_hierarchy tests

Clobbering a lot of registers and stack slots helps exposing tail call
counter overwrite bugs in JITs.

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20250813121016.163375-5-iii@linux.ibm.com

s390/bpf: Write back tail call counter for BPF_TRAMP_F_CALL_ORIG

The tailcall_bpf2bpf_hierarchy_fentry test hangs on s390. Its call
graph is as follows:

  entry()
    subprog_tail()
      trampoline()
        fentry()
        the rest of subprog_tail()  # via BPF_TRAMP_F_CALL_ORIG
        return to entry()

The problem is that the rest of subprog_tail() increments the tail call
counter, but the trampoline discards the incremented value. This
results in an astronomically large number of tail calls.

Fix by making the trampoline write the incremented tail call counter
back.

Fixes: 528eb2cb87bc ("s390/bpf: Implement arch_prepare_bpf_trampoline()")
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20250813121016.163375-4-iii@linux.ibm.com

s390/bpf: Write back tail call counter for BPF_PSEUDO_CALL

The tailcall_bpf2bpf_hierarchy_1 test hangs on s390. Its call graph is
as follows:

  entry()
    subprog_tail()
      bpf_tail_call_static(0) -> entry + tail_call_start
    subprog_tail()
      bpf_tail_call_static(0) -> entry + tail_call_start

entry() copies its tail call counter to the subprog_tail()'s frame,
which then increments it. However, the incremented result is discarded,
leading to an astronomically large number of tail calls.

Fix by writing the incremented counter back to the entry()'s frame.

Fixes: dd691e847d28 ("s390/bpf: Implement bpf_jit_supports_subprog_tailcalls()")
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20250813121016.163375-3-iii@linux.ibm.com

s390/bpf: Do not write tail call counter into helper and kfunc frames

Only BPF functions make use of the tail call counter; helpers and
kfuncs ignore and most likely also clobber it. Writing it into these
functions' frames is pointless and misleading, so do not do it.

Fixes: dd691e847d28 ("s390/bpf: Implement bpf_jit_supports_subprog_tailcalls()")
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20250813121016.163375-2-iii@linux.ibm.com

Merge branch 'libbpf-fix-reuse-of-devmap'

Yureka Lilian says:

====================
libbpf: fix reuse of DEVMAP

changes in v3:

- instead of setting BPF_F_RDONLY_PROG on both sides, just
  clear BPF_F_RDONLY_PROG in map_info.map_flags as suggested
  by Andrii Nakryiko

- in the test, use ASSERT_* instead of CHECK
- shorten the test by using open_and_load from the skel
- in the test, drop NULL check before unloading/destroying bpf objs

- start the commit messages with "libbpf" and "selftests/bpf"
  respectively instead of just "bpf"

changes in v2:

- preserve compatibility with older kernels
- add a basic selftest covering the re-use of DEVMAP maps
====================

Link: https://patch.msgid.link/20250814180113.1245565-2-yuka@yuka.dev
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>

selftests/bpf: Add test for DEVMAP reuse

The test covers basic re-use of a pinned DEVMAP map,
with both matching and mismatching parameters.

Signed-off-by: Yureka Lilian <yuka@yuka.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20250814180113.1245565-4-yuka@yuka.dev

libbpf: Fix reuse of DEVMAP

Previously, re-using pinned DEVMAP maps would always fail, because
get_map_info on a DEVMAP always returns flags with BPF_F_RDONLY_PROG set,
but BPF_F_RDONLY_PROG being set on a map during creation is invalid.

Thus, ignore the BPF_F_RDONLY_PROG flag in the flags returned from
get_map_info when checking for compatibility with an existing DEVMAP.

The same problem is handled in a third-party ebpf library:
- https://github.com/cilium/ebpf/issues/925
- https://github.com/cilium/ebpf/pull/930

Fixes: 0cdbb4b09a06 ("devmap: Allow map lookups from eBPF")
Signed-off-by: Yureka Lilian <yuka@yuka.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250814180113.1245565-3-yuka@yuka.dev

bpf: Remove migrate_disable in kprobe_multi_link_prog_run

Graph tracer framework ensures we won't migrate, kprobe_multi_link_prog_run
called all the way from graph tracer, which disables preemption in
function_graph_enter_regs, as Jiri and Yonghong suggested, there is no
need to use migrate_disable. As a result, some overhead may will be reduced.
And add cant_sleep check for __this_cpu_inc_return.

Fixes: 0dcac2725406 ("bpf: Add multi kprobe link")
Signed-off-by: Tao Chen <chen.dylane@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250814121430.2347454-1-chen.dylane@linux.dev

bpf/selftests: Fix test_tcpnotify_user

Based on a bisect, it appears that commit 7ee988770326 ("timers:
Implement the hierarchical pull model") has somehow inadvertently
broken BPF selftest test_tcpnotify_user. The error that is being
generated by this test is as follows:

FAILED: Wrong stats Expected 10 calls, got 8

It looks like the change allows timer functions to be run on CPUs
different from the one they are armed on. The test had pinned itself
to CPU 0, and in the past the retransmit attempts also occurred on CPU
0. The test had set the max_entries attribute for
BPF_MAP_TYPE_PERF_EVENT_ARRAY to 2 and was calling
bpf_perf_event_output() with BPF_F_CURRENT_CPU, so the entry was
likely to be in range. With the change to allow timers to run on other
CPUs, the current CPU tasked with performing the retransmit might be
bumped and in turn fall out of range, as the event will be filtered
out via __bpf_perf_event_output() using:

if (unlikely(index >= array->map.max_entries))
return -E2BIG;

A possible change would be to explicitly set the max_entries attribute
for perf_event_map in test_tcpnotify_kern.c to a value that's at least
as large as the number of CPUs. As it turns out however, if the field
is left unset, then the libbpf will determine the number of CPUs available
on the underlying system and update the max_entries attribute accordingly
in map_set_def_max_entries().

A further problem with the test is that it has a thread that continues
running up until the program exits. The main thread cleans up some
LIBBPF data structures, while the other thread continues to use them,
which inevitably will trigger a SIGSEGV. This can be dealt with by
telling the thread to run for as long as necessary and doing a
pthread_join on it before exiting the program.

Finally, I don't think binding the process to CPU 0 is meaningful for
this test any more, so get rid of that.

Fixes: 435f90a338ae ("selftests/bpf: add a test case for sock_ops perf-event notification")
Signed-off-by: Matt Bobrowski <mattbobrowski@google.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/aJ8kHhwgATmA3rLf@google.com

selftests/bpf: Enable arena atomics tests for RV64

Enable arena atomics tests for RV64.

Signed-off-by: Pu Lehui <pulehui@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Björn Töpel <bjorn@rivosinc.com>
Reviewed-by: Björn Töpel <bjorn@rivosinc.com>
Acked-by: Björn Töpel <bjorn@kernel.org>
Link: https://lore.kernel.org/bpf/20250719091730.2660197-11-pulehui@huaweicloud.com

riscv, bpf: Add support arena atomics for RV64

Add arena atomics support for RMW atomics and load-acquire and
store-release instructions. Non-Zacas cmpxchg is implemented via loop,
which is not currently supported because it requires more complex
extable and loop logic.

Signed-off-by: Pu Lehui <pulehui@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Björn Töpel <bjorn@rivosinc.com>
Reviewed-by: Björn Töpel <bjorn@rivosinc.com>
Acked-by: Björn Töpel <bjorn@kernel.org>
Link: https://lore.kernel.org/bpf/20250719091730.2660197-10-pulehui@huaweicloud.com

riscv, bpf: Add ex_insn_off and ex_jmp_off for exception table handling

Add ex_insn_off and ex_jmp_off fields to struct rv_jit_context so that
add_exception_handler() does not need to be immediately followed by the
instruction to add the exception table. ex_insn_off indicates the offset
of the instruction to add the exception table, and ex_jmp_off indicates
the offset to jump over the faulting instruction. This is to prepare for
adding the exception table to atomic instructions later, because some
atomic instructions need to perform zext or other operations.

Signed-off-by: Pu Lehui <pulehui@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Björn Töpel <bjorn@rivosinc.com>
Reviewed-by: Björn Töpel <bjorn@rivosinc.com>
Acked-by: Björn Töpel <bjorn@kernel.org>
Link: https://lore.kernel.org/bpf/20250719091730.2660197-9-pulehui@huaweicloud.com

riscv, bpf: Optimize cmpxchg insn with Zacas support

Optimize cmpxchg instruction with amocas.w and amocas.d
Zacas instructions.

Signed-off-by: Pu Lehui <pulehui@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Björn Töpel <bjorn@rivosinc.com>
Reviewed-by: Björn Töpel <bjorn@rivosinc.com>
Acked-by: Björn Töpel <bjorn@kernel.org>
Link: https://lore.kernel.org/bpf/20250719091730.2660197-8-pulehui@huaweicloud.com

riscv, bpf: Add Zacas instructions

Add Zacas instructions introduced by [0] to reduce code size and
improve performance of RV64 JIT.

Signed-off-by: Pu Lehui <pulehui@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Björn Töpel <bjorn@rivosinc.com>
Reviewed-by: Björn Töpel <bjorn@rivosinc.com>
Acked-by: Björn Töpel <bjorn@kernel.org>
Link: https://github.com/riscvarchive/riscv-zacas/releases/download/v1.0/riscv-zacas.pdf
Link: https://lore.kernel.org/bpf/20250719091730.2660197-7-pulehui@huaweicloud.com

riscv, bpf: Add rv_ext_enabled macro for runtime detection extentsion

Add rv_ext_enabled macro to check whether the runtime detection
extension is enabled.

Signed-off-by: Pu Lehui <pulehui@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Björn Töpel <bjorn@rivosinc.com>
Reviewed-by: Björn Töpel <bjorn@rivosinc.com>
Acked-by: Björn Töpel <bjorn@kernel.org>
Link: https://lore.kernel.org/bpf/20250719091730.2660197-6-pulehui@huaweicloud.com

riscv: Separate toolchain support dependency from RISCV_ISA_ZACAS

RV64 bpf is going to support ZACAS instructions. Let's separate
toolchain support dependency from RISCV_ISA_ZACAS.

Signed-off-by: Pu Lehui <pulehui@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Björn Töpel <bjorn@rivosinc.com>
Reviewed-by: Björn Töpel <bjorn@rivosinc.com>
Acked-by: Björn Töpel <bjorn@kernel.org>
Link: https://lore.kernel.org/bpf/20250719091730.2660197-5-pulehui@huaweicloud.com

riscv, bpf: Extract emit_ldx() helper

There's a lot of redundant code related to load into register
operations, let's extract emit_ldx() to make code more compact.

Signed-off-by: Pu Lehui <pulehui@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Björn Töpel <bjorn@rivosinc.com>
Reviewed-by: Björn Töpel <bjorn@rivosinc.com>
Acked-by: Björn Töpel <bjorn@kernel.org>
Link: https://lore.kernel.org/bpf/20250719091730.2660197-4-pulehui@huaweicloud.com

riscv, bpf: Extract emit_st() helper

There's a lot of redundant code related to store from immediate
operations, let's extract emit_st() to make code more compact.

Signed-off-by: Pu Lehui <pulehui@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Björn Töpel <bjorn@rivosinc.com>
Reviewed-by: Björn Töpel <bjorn@rivosinc.com>
Acked-by: Björn Töpel <bjorn@kernel.org>
Link: https://lore.kernel.org/bpf/20250719091730.2660197-3-pulehui@huaweicloud.com

riscv, bpf: Extract emit_stx() helper

There's a lot of redundant code related to store from register
operations, let's extract emit_stx() to make code more compact.

Signed-off-by: Pu Lehui <pulehui@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Björn Töpel <bjorn@rivosinc.com>
Reviewed-by: Björn Töpel <bjorn@rivosinc.com>
Acked-by: Björn Töpel <bjorn@kernel.org>
Link: https://lore.kernel.org/bpf/20250719091730.2660197-2-pulehui@huaweicloud.com

selftests/bpf: Copy test_kmods when installing selftest

Commit d6212d82bf26 ("selftests/bpf: Consolidate kernel modules into
common directory") consolidated the Makefile of test_kmods. However,
since it removed test_kmods from TEST_GEN_PROGS_EXTENDED, the kernel
modules required by bpf selftests are now missing from kselftest_install
when "make install". Fix it by adding test_kmod to TEST_GEN_FILES.

Fixes: d6212d82bf26 ("selftests/bpf: Consolidate kernel modules into common directory")
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250812175039.2323570-1-ameryhung@gmail.com

bpf: Don't use %pK through printk

In the past %pK was preferable to %p as it would not leak raw pointer
values into the kernel log.
Since commit ad67b74d2469 ("printk: hash addresses printed with %p")
the regular %p has been improved to avoid this issue.
Furthermore, restricted pointers ("%pK") were never meant to be used
through printk(). They can still unintentionally leak raw pointers or
acquire sleeping locks in atomic contexts.

Switch to the regular pointer formatting which is safer and
easier to reason about.

Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250811-restricted-pointers-bpf-v1-1-a1d7cc3cb9e7@linutronix.de

bpf: Replace kvfree with kfree for kzalloc memory

The 'backedge' pointer is allocated with kzalloc(), which returns
physically contiguous memory. Using kvfree() to deallocate such
memory is functionally safe but semantically incorrect.

Replace kvfree() with kfree() to avoid unnecessary is_vmalloc_addr()
check in kvfree().

Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20250811123949.552885-1-rongqianfeng@vivo.com

bpf: Tidy verifier bug message

Yonghong noticed that error messages for potential verifier bugs often
have a '(1)' at the end. This is happening because verifier_bug_if(cond,
env, fmt, args...) prints "(" #cond ")\n" as part of the message and
verifier_bug() is defined as:

#define verifier_bug(env, fmt, args...) verifier_bug_if(1, env, fmt, ##args)

Hence, verifier_bug() always ends up displaying '(1)'. This small patch
fixes it by having verifier_bug_if conditionally call verifier_bug
instead of the other way around.

Fixes: 1cb0f56d9618 ("bpf: WARN_ONCE on verifier bugs")
Reported-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Tested-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/bpf/aJo9THBrzo8jFXsh@mail.gmail.com

bpf: Remove redundant __GFP_NOWARN

Commit 16f5dfbc851b ("gfp: include __GFP_NOWARN in GFP_NOWAIT")
made GFP_NOWAIT implicitly include __GFP_NOWARN.

Therefore, explicit __GFP_NOWARN combined with GFP_NOWAIT
(e.g., `GFP_NOWAIT | __GFP_NOWARN`) is now redundant. Let's clean
up these redundant flags across subsystems.

No functional changes.

Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20250804122731.460158-1-rongqianfeng@vivo.com

Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Cross merge bpf/master after 6.17-rc1.

No conflict.

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>

Linux 6.17-rc1

Merge tag 'turbostat-2025.09.09' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux

Pull turbostat updates from Len Brown:
"tools/power turbostat: version 2025.09.09

   - Probe and display L3 Cache topology

   - Add ability to average an added counter (useful for pre-integrated
     "counters", such as Watts)

   - Break the limit of 64 built-in counters

   - Assorted bug fixes and minor feature tweaks"

* tag 'turbostat-2025.09.09' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux:
  tools/power turbostat: version 2025.09.09
  tools/power turbostat: Handle non-root legacy-uncore sysfs permissions
  tools/power turbostat: standardize PER_THREAD_PARAMS
  tools/power turbostat: Fix DMR support
  tools/power turbostat: add format "average" for external attributes
  tools/power turbostat: delete GET_PKG()
  tools/power turbostat: probe and display L3 cache topology
  tools/power turbostat: Support more than 64 built-in-counters
  tools/power turbostat.8: Document Totl%C0, Any%C0, GFX%C0, CPUGFX% columns
  tools/power turbostat: Fix bogus SysWatt for forked program
  tools/power turbostat: Handle cap_get_proc() ENOSYS
  tools/power turbostat: Fix build with musl
  tools/power turbostat: verify arguments to params --show and --hide
  tools/power turbostat: regression fix: --show C1E%

Merge tag 'smp_urgent_for_v6.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull smp fixes from Borislav Petkov:

- Remove an obsolete comment and fix spelling

* tag 'smp_urgent_for_v6.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
cpu: Remove obsolete comment from takedown_cpu()
smp: Fix spelling in on_each_cpu_cond_mask()'s doc-comment

Merge tag 'irq_urgent_for_v6.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull irq fixes from Borislav Petkov:

- Fix a wrong ioremap size in mvebu-gicp

- Remove yet another compile-test case for a driver which needs an
   additional dependency

- Fix a lock inversion scenario in the IRQ unit test suite

- Remove an impossible flag situation in gic-v5

- Do not iounmap resources in gic-v5 which are managed by devm

- Make sure stale, left-over interrupts in mvebu-gicp are cleared on
   driver init

- Fix a reference counting mishap in msi-lib

- Fix a dereference-before-null-ptr-check case in the riscv-imsic
   irqchip driver

* tag 'irq_urgent_for_v6.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqchip/mvebu-gicp: Use resource_size() for ioremap()
  irqchip: Build IMX_MU_MSI only on ARM
  genirq/test: Resolve irq lock inversion warnings
  irqchip/gic-v5: Remove IRQD_RESEND_WHEN_IN_PROGRESS for ITS IRQs
  irqchip/gic-v5: iwb: Fix iounmap probe failure path
  irqchip/mvebu-gicp: Clear pending interrupts on init
  irqchip/msi-lib: Fix fwnode refcount in msi_lib_irq_domain_select()
  irqchip/riscv-imsic: Don't dereference before NULL pointer check

Merge tag 'x86_urgent_for_v6.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Borislav Petkov:

- Fix an interrupt vector setup race which leads to a non-functioning
   device

- Add new Intel CPU models *and* a family: 0x12. Finally. Yippie! :-)

* tag 'x86_urgent_for_v6.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/irq: Plug vector setup race
  x86/cpu: Add new Intel CPU model numbers for Wildcatlake and Novalake

Merge tag 'locking_urgent_for_v6.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking fix from Borislav Petkov:

- Prevent a futex hash leak due to different mm lifetimes

* tag 'locking_urgent_for_v6.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
futex: Move futex cleanup to __mmdrop()

tools/power turbostat: version 2025.09.09

Probe and display L3 Cache topology
Add ability to average an added counter
(useful for pre-integrated "counters", such as Watts)
Break the limit of 64 built-in counters.
Assorted bug fixes and minor feature tweaks

Signed-off-by: Len Brown <len.brown@intel.com>

tools/power turbostat: Handle non-root legacy-uncore sysfs permissions

/sys/devices/system/cpu/intel_uncore_frequency/package_X_die_Y/
may be readable by all, but
/sys/devices/system/cpu/intel_uncore_frequency/package_X_die_Y/current_freq_khz
may be readable only by root.

Non-root turbostat users see complaints in this scenario.

Fail probe of the interface if we can't read current_freq_khz.

Reported-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Original-patch-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>

tools/power turbostat: standardize PER_THREAD_PARAMS

use a macro for PER_THREAD_PARAMS to make adding one later more clear.

no functional change

Signed-off-by: Len Brown <len.brown@intel.com>

tools/power turbostat: Fix DMR support

Together with the RAPL MSRs, there are more MSRs gone on DMR, including
PLR (Perf Limit Reasons), and IRTL (Package cstate Interrupt Response
Time Limit) MSRs. The configurable TDP info should also be retrieved
from TPMI based Intel Speed Select Technology feature.

Remove the access of these MSRs for DMR. Improve the DMR platform
feature table to make it more readable at the same time.

Fixes: 83075bd59de2 ("tools/power turbostat: Add initial support for DMR")
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>

tools/power turbostat: add format "average" for external attributes

External atributes with format "raw" are not printed in summary lines
for nodes/packages (or with option -S). The new format "average"
behaves like "raw" but also adds the summary data

Signed-off-by: Michael Hebenstreit <michael.hebenstreit@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>

tools/power turbostat: delete GET_PKG()

pkg_base[pkg_id] is a simple array of structure pointers,
let the compiler treat it that way.

Signed-off-by: Len Brown <len.brown@intel.com>

tools/power turbostat: probe and display L3 cache topology

Signed-off-by: Len Brown <len.brown@intel.com>

tools/power turbostat: Support more than 64 built-in-counters

We have out-grown the ability to use a 64-bit memory location
to inventory every possible built-in counter.
Leverage the the CPU_SET(3) macros to break this barrier.

Also, break the Joules & Watts counters into two,
since we can no longer 'or' them together...

Signed-off-by: Len Brown <len.brown@intel.com>

tools/power turbostat.8: Document Totl%C0, Any%C0, GFX%C0, CPUGFX% columns

Explain the meaning of the Totl%C0, Any%C0, GFX%C0, CPUGFX% columns.

Signed-off-by: Len Brown <len.brown@intel.com>

Merge tag 'tty-6.16-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty

Pull TTY fix from Greg KH:
"Here is a single revert of one of the previous patches that went in
  the last tty/serial merge that is breaking userspace on some platforms
  (specifically powerpc, probably a few others.)

  It accidentially changed the ioctl values of some tty ioctls, which
  breaks xorg.

  The revert has been in linux-next all this week with no reported
  issues"

* tag 'tty-6.16-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
  Revert "tty: vt: use _IO() to define ioctl numbers"

Merge tag 'efi-next-for-v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi

Pull EFI updates from Ard Biesheuvel:

- Expose the OVMF firmware debug log via sysfs

- Lower the default log level for the EFI stub to avoid corrupting any
   splash screens with unimportant diagnostic output

* tag 'efi-next-for-v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi:
  efi: add API doc entry for ovmf_debug_log
  efistub: Lower default log level
  efi: add ovmf debug log driver

Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Pull bpf fixes from Alexei Starovoitov:

- Fix memory leak of bpf_scc_info objects (Eduard Zingerman)

- Fix a regression in the 'perf' tool caused by moving UID filtering to
   BPF (Ilya Leoshkevich)

* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  perf bpf-filter: Enable events manually
  libbpf: Add the ability to suppress perf event enablement
  bpf: Fix memory leak of bpf_scc_info objects

Merge tag 'block-6.17-20250808' of git://git.kernel.dk/linux

Pull more block updates from Jens Axboe:

- MD pull request via Yu:
      - mddev null-ptr-dereference fix, by Erkun
      - md-cluster fail to remove the faulty disk regression fix, by
        Heming
      - minor cleanup, by Li Nan and Jinchao
      - mdadm lifetime regression fix reported by syzkaller, by Yu Kuai

- MD pull request via Christoph
      - add support for getting the FDP featuee in fabrics passthru path
        (Nitesh Shetty)
      - add capability to connect to an administrative controller
        (Kamaljit Singh)
      - fix a leak on sgl setup error (Keith Busch)
      - initialize discovery subsys after debugfs is initialized
        (Mohamed Khalfella)
      - fix various comment typos (Bjorn Helgaas)
      - remove unneeded semicolons (Jiapeng Chong)

- nvmet debugfs ordering issue fix

- Fix UAF in the tag_set in zloop

- Ensure sbitmap shallow depth covers entire set

- Reduce lock roundtrips in io context lookup

- Move scheduler tags alloc/free out of elevator and freeze lock, to
   fix some lockdep found issues

- Improve robustness of queue limits checking

- Fix a regression with IO priorities, if no io context exists

* tag 'block-6.17-20250808' of git://git.kernel.dk/linux: (26 commits)
  lib/sbitmap: make sbitmap_get_shallow() internal
  lib/sbitmap: convert shallow_depth from one word to the whole sbitmap
  nvmet: exit debugfs after discovery subsystem exits
  block, bfq: Reorder struct bfq_iocq_bfqq_data
  md: make rdev_addable usable for rcu mode
  md/raid1: remove struct pool_info and related code
  md/raid1: change r1conf->r1bio_pool to a pointer type
  block: ensure discard_granularity is zero when discard is not supported
  zloop: fix KASAN use-after-free of tag set
  block: Fix default IO priority if there is no IO context
  nvme: fix various comment typos
  nvme-auth: remove unneeded semicolon
  nvme-pci: fix leak on sgl setup error
  nvmet: initialize discovery subsys after debugfs is initialized
  nvme: add capability to connect to an administrative controller
  nvmet: add support for FDP in fabrics passthru path
  md: rename recovery_cp to resync_offset
  md/md-cluster: handle REMOVE message earlier
  md: fix create on open mddev lifetime regression
  block: fix potential deadlock while running nr_hw_queue update
  ...

Merge tag 'io_uring-6.17-20250808' of git://git.kernel.dk/linux

Pull io_uring fixes from Jens Axboe:

- Allow vectorized payloads for send/send-zc - like sendmsg, but
   without the hassle of a msghdr.

- Fix for an integer wrap that should go to stable, spotted by syzbot.
   Nothing alarming here, as you need to be root to hit this.
   Nevertheless, it should get fixed.

   FWIW, kudos to the syzbot crew for having much nicer reproducers now,
   and with nicely annotated source code as well. This is particularly
   useful as syzbot uses the raw interface rather than liburing,
   historically it's been difficult to turn a syzbot reproducer into a
   meaningful test case. With the recent changes, not true anymore!

* tag 'io_uring-6.17-20250808' of git://git.kernel.dk/linux:
  io_uring/memmap: cast nr_pages to size_t before shifting
  io_uring/net: Allow to do vectorized send

Merge tag 'spi-fix-v6.17-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi

Pull spi fixes from Mark Brown:
"There's one fix here for an issue with the CS42L43 where we were
  allocating a single property for client devices as just that property
  rather than a terminated array of properties like we are supposed to.

  We also have an update to the MAINTAINERS file for some Renesas
  devices"

* tag 'spi-fix-v6.17-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
  spi: cs42l43: Property entry should be a null-terminated array
  MAINTAINERS: Add entries for the RZ/V2H(P) RSPI

Merge tag 'regulator-fix-v6.17-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator

Pull regulator fix from Mark Brown:
"This fixes an issue with the newly added code for handling large
  voltage changes on regulators which require that individual voltage
  changes cover a limited range, the check for convergence was broken"

* tag 'regulator-fix-v6.17-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator:
  regulator: core: correct convergence check in regulator_set_voltage()

Merge tag 'regmap-fix-v6.17-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap

Pull regmap fixes from Mark Brown:
"These patches fix a lockdep issue Russell King reported with nested
  regmap-irqs (unusual since regmap is generally for devices on slow
  buses so devices don't get nested), plus add a missing mutex free
  which I noticed while implementing a fix for that issue"

* tag 'regmap-fix-v6.17-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap:
  regmap: irq: Avoid lockdep warnings with nested regmap-irq chips
  regmap: irq: Free the regmap-irq mutex

Merge tag 'pci-v6.17-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci

Pull pci fix from Bjorn Helgaas:

- Fix vmd MSI interrupt domain restructure that caused crash early in
boot (Nam Cao)

* tag 'pci-v6.17-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci:
PCI: vmd: Fix wrong kfree() in vmd_msi_free()