Tejun Heo says:
====================
This makes BPF arena memory directly dereferenceable from kernel code
(struct_ops callbacks, kfuncs). Each arena gets a per-arena scratch page
that an arch fault hook installs into empty PTEs on kernel-side faults,
after KFENCE. The faulting instruction retries and the violation is reported
through the program's BPF stream.
v4:
- Patch 1: note that the strict-zero cmpxchg is narrower than pte_none() in
inline comments on both x86 and arm64. (Andrea)
- Patch 2: stub bpf_arena_handle_page_fault() for !CONFIG_BPF_SYSCALL via a
new include/linux/bpf_defs.h. (lkp)
- Patch 7: scx_arena_alloc() retries via a loop instead of a single retry on
pool growth. (Andrea)
- Picked up Reviewed-by tags from Emil and Andrea.
v3: https://lore.kernel.org/r/
20260520235052.
4180316-1-tj@kernel.org
v2: https://lore.kernel.org/r/
20260517211232.
1670594-1-tj@kernel.org
v1 (RFC): https://lore.kernel.org/r/
20260427105109.
2554518-1-tj@kernel.org
Motivation
----------
sched_ext's ops_cid.set_cmask() hands the BPF scheduler a struct scx_cmask
*. The kernel translates a kernel cpumask to a cmask, but it had no way to
write into the arena, so the cmask lived in kernel memory and was passed as
a trusted pointer. BPF cmask helpers all operate on arena cmasks though, so
the BPF side had to word-by-word probe-read the kernel cmask into an arena
cmask via cmask_copy_from_kernel() before any helper could touch it. It
works, but is clumsy.
The shape isn't unique to set_cmask. Sub-scheduler support is on the way and
more sched_ext callbacks will want to pass structured data to BPF. Anywhere
a kfunc or struct_ops callback wants to hand a struct to a BPF program,
arena residence is the natural answer.
Approach
--------
Each arena gets a per-arena scratch page. Arenas stay sparsely mapped as
today - PTEs are populated only for allocated pages. A new arch fault hook
(bpf_arena_handle_page_fault) is wired into x86 page_fault_oops() and arm64
__do_kernel_fault(), after KFENCE. When a kernel-side access faults inside
an arena's kern_vm range, the helper walks the stack to find the BPF program
responsible, range-checks the fault address against prog->aux->arena, and
atomically installs the scratch page into the empty PTE via the new
ptep_try_set() wrapper. The kernel instruction retries and reads/writes the
scratch page. Free paths and map destruction treat scratch as non-owned.
Real allocation refuses to overwrite scratch (apply_range_set_cb returns
-EBUSY). A scratched address stays dead until map destroy, since its
presence means the BPF program has already malfunctioned.
The mechanism is default behavior - no UAPI flag.
What this preserves
-------------------
All the debugging properties of today's sparse-PTE design are preserved:
* BPF programs still fault on unmapped arena accesses. The fault semantics
(instruction retry with rdst = 0) and the violation report through
bpf_streams are unchanged for prog-side accesses.
* The first kernel-side touch of an unmapped address is reported via
bpf_streams the same way as a prog-side fault, with the stack walk
attributing it to the originating prog.
* User-side fault on a never-scratched address still lazy-allocates a real
page (or returns SIGSEGV under BPF_F_SEGV_ON_FAULT). User-side fault on a
scratched address SIGSEGVs.
What changes for the kernel-side caller is just that an unmapped deref no
longer oopses - it retries through the scratch page and emits a violation
report. The same shape today's BPF instruction faults have.
Patches 1-2 (atomic PTE install + arena scratch-page recovery)
--------------------------------------------------------------
mm: Add ptep_try_set() for lockless empty-slot installs
bpf: Recover arena kernel faults with scratch page
Patches 3-5 (helpers used by struct_ops registration)
-----------------------------------------------------
bpf: Add sleepable variant of bpf_arena_alloc_pages for kernel callers
bpf: Add bpf_struct_ops_for_each_prog()
bpf/arena: Add bpf_arena_map_kern_vm_start() and bpf_prog_arena()
====================
Link: https://lore.kernel.org/bpf/20260522172219.1423324-1-tj@kernel.org/
Signed-off-by: Alexei Starovoitov <ast@kernel.org>