KP Singh [Mon, 1 Jun 2026 15:02:47 +0000 (17:02 +0200)]
selftests/bpf: Adjust verifier_map_ptr for the map's excl field
Adding the u32 excl field at offset 32 of struct bpf_map right after the
sha[SHA256_DIGEST_SIZE] hash shifts the ops pointer from offset 32 to 40.
Therefore, fix up the test case.
Daniel Borkmann [Mon, 1 Jun 2026 15:02:46 +0000 (17:02 +0200)]
libbpf: Skip max_entries override on signed loaders
bpf_gen__map_create() lets the host-supplied loader ctx override a
map's max_entries at runtime (map_desc[idx].max_entries, when non-zero).
This is how the light skeleton sizes maps to the target machine, but
it happens after emit_signature_match() and is covered by neither the
signed loader instructions nor the hashed blob.
For a signed loader this means an untrusted host can re-dimension the
program's maps, outside what the signature attests to. Gate the override
on gen_hash so signed loaders use the signer-provided max_entries baked
into the blob.
Daniel Borkmann [Mon, 1 Jun 2026 15:02:45 +0000 (17:02 +0200)]
libbpf: Skip initial_value override on signed loaders
bpf_gen__map_update_elem() emits code that, when the host-supplied
loader ctx provides a non-NULL map_desc[idx].initial_value, overwrites
the blob value with bytes read from the host (bpf_copy_from_user /
bpf_probe_read_kernel) before the BPF_MAP_UPDATE_ELEM that populates
the program's .data/.rodata/.bss maps.
This override runs after emit_signature_match() has validated map->sha[],
and initial_value is part of neither the signed loader instructions nor
the hashed data blob. For a signed loader this lets an untrusted host
substitute global-variable contents into a program whose code carries
a valid signature, thus weakening what the signature attests to.
The blob already contains the signer-provided value (added via add_data()
and covered by the embedded, signed hash), so simply skip emitting the
override for signed loaders (gen_hash). Runtime initialization stays
available for the unsigned light-skeleton path as before. The jump
offsets within the override block are internal to it, so guarding the
whole block leaves them unchanged.
KP Singh [Mon, 1 Jun 2026 15:02:44 +0000 (17:02 +0200)]
libbpf: Reject non-exclusive metadata maps in the signed loader
The loader verifies map->sha against the metadata hash in its
instructions. map->sha is calculated when BPF_OBJ_GET_INFO_BY_FD is
called on the frozen map.
While the map is frozen, the /signed loader/ must also ensure the map
is exclusive, as, without exclusivity (which a hostile host could just
omit when loading the loader), another BPF program with map access can
mutate the contents afterwards, so the check passes on stale data.
With the extra check as part of the signed loader, it now refuses to
move on with map->sha validation if the host set it up wrongly.
Fixes: fb2b0e290147 ("libbpf: Update light skeleton for signing") Signed-off-by: KP Singh <kpsingh@kernel.org> Co-developed-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20260601150248.394863-4-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Daniel Borkmann [Mon, 1 Jun 2026 15:02:43 +0000 (17:02 +0200)]
bpf: Drop redundant hash_buf from map_get_hash operation
bpf_map_get_info_by_fd() is the only caller of the ->map_get_hash
and always invokes it with hash_buf == map->sha and hash_buf_size
of SHA256_DIGEST_SIZE. array_map_get_hash() in turn lets sha256()
write the digest directly into that buffer (map->sha) and then
performs a trailing memcpy(), which evaluates to memcpy(map->sha,
map->sha, 32): a redundant self-copy. The hash_buf_size argument
was never used at all. Simplify this a bit, no functional change.
Daniel Borkmann [Mon, 1 Jun 2026 15:02:42 +0000 (17:02 +0200)]
bpf: Reject exclusive maps as inner maps in map-in-map
An exclusive map (created with excl_prog_hash) is bound to a single
program by hash: check_map_prog_compatibility() refuses to load any
program whose digest does not match map->excl_prog_sha. That check
only runs for maps a program references directly, i.e. its used_maps.
A map reached at runtime through a map-of-maps is never in used_maps,
and bpf_map_meta_equal() does not consider excl_prog_sha, so an
exclusive map can be inserted into a non-exclusive outer map and
then looked up and mutated by an unrelated program, bypassing the
exclusivity guarantee.
For the signed loader this defeats the metadata map exclusivity check
added in the signed loader: the cached map->sha[] is validated against
the signed hash while another program on a hostile host rewrites the
frozen map's contents through the outer map.
This patchset cleans up dynptr handling, refactors object relationship
tracking in the verifier by introducing parent_id and folding ref_obj_id
into id, and fixes dynptr use-after-free bugs where file/skb dynptrs
are not invalidated when the parent referenced object is freed.
* Motivation *
In BPF qdisc programs, an skb can be freed through kfuncs. However,
since dynptr does not track the parent referenced object (e.g., skb),
the verifier does not invalidate the dynptr after the skb is freed,
resulting in use-after-free. The same issue also affects file dynptr.
The figure below shows the current state of object tracking. The
verifier tracks objects using three fields: id for nullness tracking,
ref_obj_id for lifetime tracking, and dynptr_id for tracking the parent
dynptr of a slice (PTR_TO_MEM only). While dynptr_id links slices to
their parent dynptr, there is no field that links a dynptr back to its
parent skb. When the skb is freed via release_reference(ref_obj_id=1),
only objects with ref_obj_id=1 are invalidated. Since skb dynptr is
non-referenced (ref_obj_id=0), the dynptr and its derived slices remain
accessible.
Current: object (id, ref_obj_id, dynptr_id)
id = unique id of the object (for nullness tracking)
ref_obj_id = id of the referenced object (for lifetime tracking)
dynptr_id = id of the parent dynptr (only for PTR_TO_MEM slices)
skb (0,1,0)
^^
! No link from dynptr to skb !
|+------------------------------+
| bpf_dynptr_clone |
dynptr A (2,0,0) dynptr C (4,0,0)
^ ^
bpf_dynptr_slice | |
| |
slice B (3,0,2) slice D (5,0,4)
* Why not simply use ref_obj_id to track the parent? *
A natural first approach is to link dynptr to its parent by sharing
the parent's ref_obj_id and propagating it to slices. Now, releasing
the skb via release_reference(ref_obj_id=1) correctly invalidates all
derived objects.
Attempted fix: share parent's ref_obj_id
skb (0,1,0)
^^
||
|+------------------------------+
| bpf_dynptr_clone |
dynptr A (2,1,0) dynptr C (4,1,0)
^ ^
bpf_dynptr_slice | |
| |
slice B (3,1,2) slice D (5,1,4)
However, this approach does not generalize to all dynptr types.
Referenced dynptrs such as file dynptr acquire their own ref_obj_id to
track the dynptr's lifetime. Since ref_obj_id is already used for the
dynptr's own reference, it cannot also be used to point to the parent
file object. While it is possible to add specialized handling for
individual dynptr types [0], it adds complexity and does not generalize.
An alternative approach is to avoid introducing a new field and instead
repurpose ref_obj_id as parent_id by folding lifetime tracking into id
[1]. In this design, each object is represented as (id, ref_obj_id)
where id is used for both nullness and lifetime tracking, and ref_obj_id
tracks the parent object's id.
Attempted: object (id, ref_obj_id)
id = id of the object (for nullness and lifetime tracking)
ref_obj_id = id of the parent object
' = id is referenced
skb (1',0)
^^
||
bpf_dynptr_from_skb |+------------------------------+
| bpf_dynptr_clone(A, C) |
dynptr A (2,1') dynptr C (4,1')
^ ^
bpf_dynptr_slice | |
| |
slice B (3,2) slice D (5,4)
However, this design cannot express the relationship between referenced
socket pointers and their casted counterparts. After pointer casting,
the original and casted pointers need the same lifetime (same ref_obj_id
in the current design) but different nullness (different id). The casted
pointer may be NULL even if the original is valid. With id serving as
the only field for both nullness and lifetime, and ref_obj_id repurposed
as parent, there is no way to express "different identity, same
lifetime."
Referenced socket pointer (expressed using current design):
C = ptr_casting_function(A)
ptr A (1,1,0) ptr C (2,1,0)
^ ^
| |
ptr C may be NULL even if ptr A is valid
but they have the same lifetime
* New Design: parent_id with branch splitting and intermediate reference *
The patchset folds ref_obj_id into id and adds parent_id to
bpf_reg_state (patch 5). A child object's parent_id points to the
parent object's id. This replaces the PTR_TO_MEM-specific dynptr_id.
Whether a register is referenced is determined by checking if its id
appears in the reference array via reg_is_referenced() rather than
reading a dedicated ref_obj_id field.
Pointer casting:
The challenge with pointer casting is that a cast result may be NULL
even when the source is valid, requiring distinct identity but shared
lifetime. This is solved using branch splitting: when a helper like
bpf_sk_fullsock() is called with a referenced pointer, the verifier
pushes an explicit NULL branch and assigns the cast result the same id
as the source. Since the cast may return NULL for a non-NULL input, the
NULL case is explored as a separate verifier branch. This allows
releasing any of the original or cast pointers to invalidate all others,
while avoiding the need for a separate tracking mechanism.
Referenced dynptrs:
The challenge with referenced dynptrs is that clones of a referenced
dynptr have the same lifetime but different identities. When a
referenced dynptr is overwritten, only slices derived from it will be
invalidated. To solve this, the verifier creates an intermediate
reference. This reference serves as a shared lifetime anchor for the
dynptr and all its clones. All clones share the same parent_id but get
unique ids for independent slice tracking. Releasing a referenced dynptr
releases the intermediate reference, which in turn invalidates all
clones and their derived slices. If the parent object is released while
the intermediate reference still exists, it is reported as a leaked
reference.
Release cascading:
When releasing an object, release_reference() performs a stack-based DFS
to invalidate all descendants. It walks the object tree via parent_id
links, invalidating registers and dynptr stack slots. Child references
encountered during traversal are reported as leaked references.
parent_id is also added to bpf_reference_state to enable intermediate
reference. When acquiring a reference, a parent_id can be specified to
link the new reference to an existing one (e.g., file dynptr's
intermediate reference has parent_id linking to the file's reference).
Final: object (id, parent_id)
id = unique id of the object (for nullness and lifetime
tracking)
parent_id = id of the parent object (for object relationship
tracking)
I = intermediate reference serving as lifetime anchor in
acquired_refs
' = id is referenced (appears in reference array)
skb (1',0)
^^
||
bpf_dynptr_from_skb |+------------------------------+
| bpf_dynptr_clone(A, C) |
dynptr A (2,1') dynptr C (4,1')
^ ^
bpf_dynptr_slice | |
| |
slice B (3,2) slice D (5,4)
* Preserving reg->id after null-check *
For parent_id tracking to work, child objects need to refer to the
parent's id. This requires two preparatory changes: assigning reg->id
when reading referenced kptrs from program context (patch 3), and
preserving reg->id of pointer objects after null-check (patch 4).
Previously, null-check would clear reg->id, making it impossible for
children to reference the parent afterward. The latter causes a slight
increase in verified states for some programs. One selftest object
sees +19 states (+5.01%). For Meta BPF objects, the increase is
also minor, with the largest being +34 states (+3.63%).
* Object relationship in different scenarios (for reference) *
The figures below show how the final design handles all four
combinations of referenced/non-referenced dynptr with
referenced/non-referenced parent.
(1) Non-referenced dynptr with referenced parent (e.g., skb in Qdisc):
skb (1',0)
^^
||
bpf_dynptr_from_skb |+------------------------------+
| bpf_dynptr_clone(A, C) |
dynptr A (2,1') dynptr C (4,1')
dynptr A and C live independently
(2) Non-referenced dynptr with non-referenced parent (e.g., skb in TC,
always valid):
bpf_dynptr_from_skb
bpf_dynptr_clone(A, C)
dynptr A (1,0) dynptr C (2,0)
dynptr A and C live independently
(3) Referenced dynptr with referenced parent:
file (1',0)
^
bpf_dynptr_from_file |
I (2',1') <-- intermediate reference
^^
||
|+-------------------------------+
| bpf_dynptr_clone(A, C) |
dynptr A (3,2') dynptr C (4,2')
dynptr A and C have the same lifetime
Releasing either dynptr releases I, invalidating both.
Releasing file (1') detects I as a leaked reference.
(4) Referenced dynptr with non-referenced parent:
bpf_ringbuf_reserve_dynptr
I (1',0) <-- intermediate reference
^^
||
|+--------------------------------+
| bpf_dynptr_clone(A, C) |
dynptr A (2,1') dynptr C (3,1')
v5 -> v6
- Squash "bpf: Fold ref_obj_id into id and introduce virtual references"
(v5 patch 9) into "bpf: Refactor object relationship tracking and
fix dynptr UAF bug" (now patch 5). ref_obj_id is removed in the same
patch that introduces parent_id, eliminating the intermediate state
where both coexist (Eduard)
- Drop virtual references for pointer casting. Instead, cast results
reuse the source pointer's id and use branch splitting to explore
the NULL case as a separate verifier branch. This avoids adding
virtual reference infrastructure for a case that can be handled more
simply (Eduard, Andrii)
- Address nit from Eduard Link: https://lore.kernel.org/bpf/20260519181314.2731658-1-ameryhung@gmail.com/
v4 -> v5
- Add patch 9 folding ref_obj_id into id and introducing virtual
references for pointer casting and referenced dynptr clones (Eduard, Andrii)
- Add patch 10 fixing dynptr ref counting to scan all call frames
instead of only the current frame (Eduard)
- Add utility function validate_ref_obj() (Eduard) Link: https://lore.kernel.org/bpf/20260506142709.2298255-1-ameryhung@gmail.com/
v3 -> v4
- Add patch 1 clean up mark_stack_slot_obj_read() and callers
(to address v3 ignoring err returned from mark_dynptr_read) (Andrii)
- Fix release_reference() and move the logic allowing destroying a
referenced object when refcnt > 1 from
destroy_if_stack_slots_dynptr() to release_reference() (Mykyta)
- Add patch 7 introducing ref_obj_desc and unifying ref_obj handling
(to address Eduard's concern about unclear meta->{id,ref_obj_id}
initialization/use and confusing function arguments of
process_dynptr_func())
- Add patch 8 unifying release_regno handling so that bpf_kptr_xchg
also use release_reference() Link: https://lore.kernel.org/bpf/20260421221016.2967924-1-ameryhung@gmail.com/
v2 -> v3
- Rebase to bpf-next/master
- Update veristat numbers
- Update commit msg to explain multiple dropped checks (Mykyta, Andrii)
- Reuse idmap as idstack in release_reference() and check for
duplicate id (Mykyta, Andrii)
- Change to use RUN_TEST for qdisc dynptr selftest (Eduard) Link: https://lore.kernel.org/bpf/20260307064439.3247440-1-ameryhung@gmail.com/
v1 -> v2
- Redesign: Use object (id, ref_obj_id, parent_id) instead of
(id, ref_obj_id) as it cannot express ptr casting without
introducing specialized code to handle the case
- Use stack-based DFS to release objects to avoid recursion (Andrii)
- Keep reg->id after null check
- Add dynptr cleanup
- Fix dynptr kfunc arg type determination
- Add a file dynptr UAF selftest Link: https://lore.kernel.org/bpf/20260202214817.2853236-1-ameryhung@gmail.com/
---
====================
Amery Hung [Fri, 29 May 2026 01:49:35 +0000 (18:49 -0700)]
selftests/bpf: Test using file dynptr after the reference on file is dropped
File dynptr and slice should be invalidated when the parent file's
reference is dropped in the program. Without the verifier tracking
dyntpr's parent referenced object, the dynptr would continute to be
incorrectly used even if the underlying file is being tear down or gone.
Amery Hung [Fri, 29 May 2026 01:49:34 +0000 (18:49 -0700)]
selftests/bpf: Test using slice after invalidating dynptr clone
The parent object of a cloned dynptr is skb not the original dynptr.
Invalidate the original dynptr should not prevent the program from
using the slice derived from the cloned dynptr.
Amery Hung [Fri, 29 May 2026 01:49:32 +0000 (18:49 -0700)]
bpf: Fix dynptr ref counting to scan all call frames
When checking whether a referenced dynptr can be overwritten,
destroy_if_dynptr_stack_slot only counted sibling dynptrs in the
current call frame. If a clone sharing the same virtual ref parent
existed in a different frame (e.g., passed to a subprog), it would
not be counted, causing the verifier to incorrectly reject the
overwrite with "cannot overwrite referenced dynptr".
Fix by extracting the counting into dynptr_ref_cnt() which uses
bpf_for_each_reg_in_vstate_mask() to scan dynptr stack slots across
all call frames.
Fixes: 017f5c4ef73c ("bpf: Allow overwriting referenced dynptr when refcnt > 1") Reported-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Amery Hung <ameryhung@gmail.com> Link: https://lore.kernel.org/r/20260529014936.2811085-10-ameryhung@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Amery Hung [Fri, 29 May 2026 01:49:31 +0000 (18:49 -0700)]
bpf: Unify release handling for helpers and kfuncs
Introduce release_reg() to consolidate the release logic shared by both
helpers and kfuncs: dynptr release, kptr_xchg percpu-to-RCU conversion,
regular reference release, and NULL pass-through. NULL pass-through is
only allowed if the prototype indicates the argument may be null.
Determine release_regno from the function prototype/metadata before
argument checking, rather than discovering it dynamically during
argument processing. For helpers, scan the arg_type array in
check_func_proto() via check_proto_release_reg(). For kfuncs, set
release_regno to BPF_REG_1 in bpf_fetch_kfunc_arg_meta() when
KF_RELEASE is set. In the future when we start adding decl_tag to
kfunc arguments, we can just look at the function prototype instead
of a release_regno.
Extract ref_convert_alloc_rcu_protected() and
invalidate_rcu_protected_refs() to make it more clear what the code is
doing. For ref_convert_alloc_rcu_protected(), it pre-converts
MEM_ALLOC | MEM_PERCPU registers to MEM_RCU (clearing id so they
survive), then calls release_reference() to invalidate the remaining
registers and release the reference state.
Add KF_RELEASE to bpf_dynptr_file_discard() so its release_regno is set
via fetch_kfunc_meta rather than being assigned manually in the dynptr
argument processing. Set arg_type to ARG_PTR_TO_DYNPTR for
KF_ARG_PTR_TO_DYNPTR so that check_func_arg_reg_off() correctly allows
non-zero stack offsets for dynptr release arguments same as helper.
Amery Hung [Fri, 29 May 2026 01:49:30 +0000 (18:49 -0700)]
bpf: Unify referenced object tracking in verifier
Helpers and kfuncs independently tracked referenced object metadata
using standalone id fields in their respective arg_meta structs.
This led to duplicated logic and inconsistent error handling between the
two paths.
Introduce struct ref_obj_desc to consolidate id and parent_id along with
a count of how many arguments carry a reference. Add update_ref_obj() to
populate it from a bpf_reg_state, replacing open-coded assignments in
check_func_arg(), check_kfunc_args(), and process_iter_arg(). Add
validate_ref_obj() to check for ambiguous ref_obj before using it.
For ref_obj releasing helpers and kfuncs, keep checking it before
calling update_ref_obj() for now. A later patch will make these
functions not depending on ref_obj. For other users of ref_obj, move the
checks to the use locations. For helper, this means moving the checks
inside helper_multiple_ref_obj_use() to use locations.
is_acquire_function() is dropped as ref_obj is never used.
Pass ref_obj_desc into process_dynptr_func()/mark_stack_slots_dynptr()
instead of a bare parent_id to make it less confusing.
Drop the selftest introduced in 7ec899ac90a2 ("selftests/bpf: Negative
test case for ref_obj_id in args") since the verifier no longer
complains about ambiguous ref_obj if it is not used.
Amery Hung [Fri, 29 May 2026 01:49:29 +0000 (18:49 -0700)]
bpf: Remove redundant dynptr arg check for helper
unmark_stack_slots_dynptr() already makes sure that CONST_PTR_TO_DYNPTR
cannot be released. process_dynptr_func() also prevents passing
uninitialized dynptr to helpers expecting initialized dynptr. Now that
unmark_stack_slots_dynptr() also reports error returned from
release_reference(), there should be no reason to keep these redundant
checks.
Amery Hung [Fri, 29 May 2026 01:49:28 +0000 (18:49 -0700)]
bpf: Refactor object relationship tracking and fix dynptr UAF bug
Refactor object relationship tracking in the verifier and fix a dynptr
use-after-free bug where file/skb dynptrs are not invalidated when the
parent referenced object is freed.
Add parent_id to bpf_reg_state to precisely track child-parent
relationships. A child object's parent_id points to the parent object's
id. This replaces the PTR_TO_MEM-specific dynptr_id.
Remove ref_obj_id from bpf_reg_state by folding its role into the
existing id field. Previously, id tracked pointer identity for null
checking while ref_obj_id tracked the owning reference for lifetime
management. These are now unified: acquire helpers and kfuncs set id
to the acquired reference id, and release paths use id directly.
Add reg_is_referenced() which checks if a register is referenced by
looking up its id in the reference array. This replaces all former
ref_obj_id checks.
For release_reference(), invalidating an object now also invalidates
all descendants by traversing the object tree. This is done using
stack-based DFS to avoid recursive call chains of release_reference() ->
unmark_stack_slots_dynptr() -> release_reference(). Referenced objects
encountered during tree traversal are reported as leaked references.
Add parent_id to bpf_reference_state to enable hierarchical reference
tracking. When acquiring a reference, a parent_id can be specified to
link the new reference to an existing one (e.g., referenced dynptrs
acquire a reference with parent_id linking to the parent object's
reference).
Pointer casting:
For pointer casting helpers (bpf_sk_fullsock, bpf_tcp_sock), instead of
propagating ref_obj_id, the cast result reuses the same reference id as
the source pointer. Since the cast may return NULL for a non-NULL input,
the NULL case is explored as a separate verifier branch. This allows
releasing any of the original or cast pointers to invalidate all others.
Referenced dynptrs:
When constructing a referenced dynptr, acquire a intermediate reference
with parent_id linking to the parent referenced object. The dynptr and
all clones share the same parent_id (pointing to the intermediate ref)
but get unique ids for independent slice tracking. Releasing a
referenced dynptr releases the parent reference, which in turn
invalidates all clones and their derived slices.
Owning to non-owning reference conversion:
After converting owning to non-owning by clearing id (e.g.,
object(id=1) -> object(id=0)), the verifier releases the reference
state via release_reference_nomark().
Note that the error message "reference has not been acquired before" in
the helper and kfunc release paths is removed. This message was already
unreachable. The verifier only calls release_reference() after
confirming the reference is valid, so the condition could never trigger
in practice.
Amery Hung [Fri, 29 May 2026 01:49:27 +0000 (18:49 -0700)]
bpf: Preserve reg->id of pointer objects after null-check
Preserve reg->id of pointer objects after null-checking the register so
that children objects derived from it can still refer to it in the new
object relationship tracking mechanism introduced in a later patch. This
change incurs a slight increase in the number of states in one selftest
bpf object, rbtree_search.bpf.o. For Meta bpf objects, the increase of
states is also negligible.
Selftest BPF objects with insns_diff > 0
Program Insns (A) Insns (B) Insns (DIFF) States (A) States (B) States (DIFF)
------------------------ --------- --------- -------------- ---------- ---------- -------------
rbtree_search 6820 7326 +506 (+7.42%) 379 398 +19 (+5.01%)
Looking into rbtree_search, the reason for such increase is that the
verifier has to explore the main loop shown below for one more iteration
until state pruning decides the current state is safe.
long rbtree_search(void *ctx)
{
...
bpf_spin_lock(&glock0);
rb_n = bpf_rbtree_root(&groot0);
while (can_loop) {
if (!rb_n) {
bpf_spin_unlock(&glock0);
return __LINE__;
}
n = rb_entry(rb_n, struct node_data, r0);
if (lookup_key == n->key0)
break;
if (nr_gc < NR_NODES)
gc_ns[nr_gc++] = rb_n;
if (lookup_key < n->key0)
rb_n = bpf_rbtree_left(&groot0, rb_n);
else
rb_n = bpf_rbtree_right(&groot0, rb_n);
}
...
}
Below is what the verifier sees at the start of each iteration
(65: may_goto) after preserving id of rb_n. Without id of rb_n, the
verifier stops exploring the loop at iter 16.
rb_n gc_ns[15]
iter 15 257 257
iter 16 290 257 rb_n: idmap add 257->290
gc_ns[15]: check 257 != 290 --> state not equal
iter 17 325 257 rb_n: idmap add 290->325
gc_ns[15]: idmap add 257->257 --> state safe
Amery Hung [Fri, 29 May 2026 01:49:26 +0000 (18:49 -0700)]
bpf: Assign reg->id when getting referenced kptr from ctx
Assign reg->id when getting referenced kptr from read program context
to be consistent with R0 of KF_ACQUIRE kfunc. skb dynptr will track the
referenced skb in qdisc programs using a new field reg->parent_id in
a later patch.
Amery Hung [Fri, 29 May 2026 01:49:25 +0000 (18:49 -0700)]
bpf: Unify dynptr handling in the verifier
Simplify dynptr checking for helper and kfunc by unifying it. Remember
the initialized dynptr (i.e.,g !(arg_type |= MEM_UNINIT)) pass to a
dynptr kfunc during process_dynptr_func() so that we can easily
retrieve the information for verification later. By saving it in
meta->dynptr, there is no need to call dynptr helpers such as
dynptr_id(), dynptr_ref_obj_id() and dynptr_type() in check_func_arg().
Remove and open code the helpers in process_dynptr_func() when
saving id, ref_obj_id, and type.
Besides, since dynptr ref_obj_id information is now pass around in
meta->bpf_dynptr_desc, drop the check in helper_multiple_ref_obj_use.
Amery Hung [Fri, 29 May 2026 01:49:24 +0000 (18:49 -0700)]
bpf: Simplify mark_stack_slot_obj_read() and callers
Rename mark_stack_slot_obj_read() as mark_stack_slots_scratched() and
directly call it from functions processing iter, dynptr and irq_flag.
Commit 6762e3a0bce5 ("bpf: simplify liveness to use (callsite, depth)
keyed func_instances") has removed the dynamic liveness component in
mark_stack_slot_obj_read(). The function effectively only marks stack
slots as scratched and always succeed. Therefore, return void, drop the
unused bpf_reg_state argument and rename it to
mark_stack_slots_scratched() to reflect what it does now.
In addition, to prepare for unifying dynptr handling, dynptr_get_spi()
will be moved out of mark_dynptr_read(). As mark_dynptr_read() would join
mark_iter_read() as a thin wrapper of mark_stack_slots_scratched(), just
open code these helpers.
Paul Moore [Sat, 23 May 2026 16:00:26 +0000 (12:00 -0400)]
bpf: Fix security_bpf_prog_load() error handling
If security_bpf_prog_load() fails there is no need to call into
security_bpf_prog_free() as the LSM will handle the cleanup of any partial
LSM state before returning to the caller with an error. Thankfully this
isn't an issue with any of the existing code as the LSMs which currently
provide BPF hook callback implementations don't allocate any internal
state, but this is something we want to fix for potential future users.
Taegu Ha [Thu, 28 May 2026 06:21:55 +0000 (15:21 +0900)]
bpf: reject overlarge global subprog argument sizes
Global subprogram argument checking derives generic pointer sizes from BTF
and passes the resolved size to check_mem_reg() as a u32. The access-size
validation path then uses a signed int, and stack pointers negate the value
before calling check_helper_mem_access().
This creates a wrap when BTF describes a pointee size larger than S32_MAX.
For example, a global subprogram argument of type:
int (*p)[0x3fffffff]
has a BTF-resolved pointee size of 0xfffffffc bytes. At a call site the
caller can pass a pointer to a 4-byte stack slot at fp-4. The current
PTR_TO_STACK path computes:
size = -(int)mem_size
so 0xfffffffc becomes -4 as a signed int and the negation validates only
a 4-byte stack range. That range is covered by the caller's stack slot,
so the call is accepted.
The callee is then verified independently with R1 as PTR_TO_MEM and
mem_size 0xfffffffc. A small instruction such as:
r0 = *(u32 *)(r1 + 4)
is accepted as being inside that BTF-described memory region. At run time,
however, the actual argument value is still fp-4, so r1 + 4 addresses fp+0,
outside the 4-byte object that the caller provided.
Reject sizes that cannot be represented by the verifier's signed
access-size API before the stack-specific negation. Add a verifier
regression test for the oversized BTF argument.
Patch 1 fixes a redundant MOV in the arm64 JIT's
emit_stack_arg_store_imm() and clarifies the stack layout comments. This
is not a bug fix but an improvement.
Patch 2 bumps the stack argument tests from 6-8 args to at least 10 so
they actually exercise the native stack on arm64, where x0-x7 cover the
first 8 arguments.
====================
Puranjay Mohan [Thu, 28 May 2026 16:17:48 +0000 (09:17 -0700)]
selftests/bpf: Use at least 10 args in stack argument tests
On arm64, the first 8 arguments are passed in registers (x0-x7), so
tests with 8 or fewer arguments never exercise the native stack argument
path in the JIT. Increase argument counts to at least 10 across all
BPF-to-BPF subprog and kfunc stack argument tests so that at least 2
arguments land on the arm64 stack.
For the two-callees test, bump foo1 from 8 to 10 and foo2 from 10 to 12
args to preserve the different-stack-depth flavor of the test.
The bpf_kfunc_call_stack_arg_mem kfunc is left unchanged at 7 args to
avoid breaking the precision backtracking test which relies on hardcoded
verifier log instruction indices.
Puranjay Mohan [Thu, 28 May 2026 16:17:47 +0000 (09:17 -0700)]
bpf, arm64: Fix redundant MOV and clarify stack arg comments
emit_stack_arg_store_imm() materializes the immediate into tmp and
then moves tmp to the target register (x5-x7). Emit the immediate
directly into the target register to avoid the redundant MOV.
While here, qualify the bare "FP" in the stack-layout ASCII art as
"A64_FP" so it is not confused with BPF_FP, and note that incoming
stack arguments sit above the FP/LR pair pushed by the callee
prologue.
Daniel Borkmann [Fri, 29 May 2026 16:28:29 +0000 (18:28 +0200)]
libbpf: Skip endianness swap when loader generation failed
bpf_gen__prog_load() byte-swaps the program insns and the {func,line}_info
and CO-RE relo blobs in place for cross-endian targets. The blob offsets
come from add_data(), which returns 0 on failure: realloc_data_buf() either
frees and NULLs gen->data_start (realloc OOM) or returns early on an
already-latched gen->error, leaving a stale, possibly too-small buffer.
Neither bswap site checked for this. With gen->swapped_endian set and a
failed generation, "gen->data_start + off" becomes NULL + 0. Guard the
same way via !gen->error so they are skipped once generation has failed.
Fixes: 8ca3323dce43 ("libbpf: Support creating light skeleton of either endianness") Reported-by: sashiko <sashiko@sashiko.dev> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20260529162829.315921-1-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Daniel Borkmann [Fri, 29 May 2026 09:41:18 +0000 (11:41 +0200)]
libbpf: Also reset {insn,data}_cur on realloc failure
realloc_insn_buf() as well as realloc_data_buf() free and NULL
gen->insn_start / gen->data_start on -ENOMEM but leave gen->insn_cur /
gen->data_cur pointing into the old, freed buffer. Just reset the
cursors to NULL alongside the base pointers so the freed state is
coherent.
Daniel Borkmann [Fri, 29 May 2026 09:41:17 +0000 (11:41 +0200)]
libbpf: Skip hash computation when loader generation failed
bpf_gen__finish() calls compute_sha_update_offsets() gated only on
the gen_hash option, without first consulting gen->error. On a failed
generation this is buggy: a failed realloc_data_buf() sets gen->data_start
to NULL (leaving gen->data_cur dangling), so compute_sha_update_offsets()
runs libbpf_sha256() over a NULL buffer with a bogus length; a failed
realloc_insn_buf() likewise sets gen->insn_start to NULL and the hash
immediates get patched through that NULL base.
The computed program is discarded in either case, since the following
"if (!gen->error)" block does not publish opts->insns once an error is
set. Thus, skip the hash pass when generation has already failed.
Fixes: ea923080c145 ("libbpf: Embed and verify the metadata hash in the loader") Reported-by: sashiko <sashiko@sashiko.dev> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20260529094119.307264-2-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Daniel Borkmann [Fri, 29 May 2026 09:41:16 +0000 (11:41 +0200)]
libbpf: Drop redundant self-loop in emit_check_err
When the cleanup-label jump offset does not fit in s16, emit_check_err()
sets gen->error = -ERANGE and then emits a BPF_JMP_IMM(BPF_JA, 0, 0, -1)
self-loop.
The latter emit() is dead: gen->error is assigned on the preceding line,
and emit() then bails out early in realloc_insn_buf() the moment gen->error
is set, so the jump is never written into the instruction stream.
gen->error alone already marks the generation as failed. This is a follow-up
to 7dd62566e0d1 ("libbpf: fix off-by-one in emit_signature_match jump offset")
which removed the jump in emit_signature_match() but not in other locations.
====================
bpf: Align syscall writeback behavior with user-declared size
This series fixes an out-of-bounds write vulnerability in BPF_PROG_QUERY
while maintaining backward compatibility for older userspace applications.
BPF_PROG_QUERY unconditionally writes back the 'query.revision' field
to userspace. If userspace passes a smaller 'bpf_attr' structure (e.g. 40
bytes, which was the cgroup query layout before 'query.revision' was
added), the kernel performs an out-of-bounds write.
We address this by propagating the user-provided 'uattr_size' down to
the cgroup query handlers and conditionally skipping the write-back of
'query.revision' if the buffer is too small. This allows legacy cgroup
queries to succeed safely.
tcx and netkit queries are left unchanged since they were introduced in
the same merge window as 'query.revision' and have no legacy callers.
Finally, we add a selftest to verify these boundary behaviors.
Changes since v2:
- Propagate uattr_size to __cgroup_bpf_query() and conditionally write
revision (instead of unconditionally rejecting smaller sizes in front-gate).
- Update BPF selftests to verify that cgroup queries succeed with
OLD_QUERY_SIZE without writing revision, and succeed with FULL_QUERY_SIZE.
- Remove early size checks in the front-gate to keep the patch minimal.
Changes since v1:
- Simplify the kernel fix to checking the size only in bpf_prog_query().
- Revert all other subsystem query plumbing changes.
- Update BPF selftest to target BPF_CGROUP_INET_INGRESS cgroup query, and
add verification for attr size boundaries.
====================
Yuyang Huang [Sun, 31 May 2026 07:56:00 +0000 (15:56 +0800)]
selftests/bpf: add verification for BPF_PROG_QUERY attr size boundaries
Add a new selftest to verify that the BPF syscall (specifically
BPF_PROG_QUERY) correctly handles different user-declared attribute sizes.
Specifically, verify that:
- For cgroup queries, a query with a size that covers 'prog_cnt' but is
smaller than 'revision' (OLD_QUERY_SIZE) succeeds, but does not write
to 'revision' (verifying backward compatibility).
- A query with full size (FULL_QUERY_SIZE) succeeds and writes both
'prog_cnt' and 'revision'.
Fixes: 120933984460 ("bpf: Implement mprog API on top of existing cgroup progs") Cc: Maciej Żenczykowski <maze@google.com> Cc: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: Yuyang Huang <yuyanghuang@google.com> Link: https://lore.kernel.org/r/20260531075600.4058207-3-yuyanghuang@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Yuyang Huang [Sun, 31 May 2026 07:55:59 +0000 (15:55 +0800)]
bpf: fix BPF_PROG_QUERY OOB write and cgroup backward compat
BPF_PROG_QUERY writes back the 'query.revision' field unconditionally to
userspace. If userspace passes a smaller 'bpf_attr' structure (e.g. 40
bytes, which was the layout before the addition of 'query.revision'),
the kernel performs an out-of-bounds write.
Fix this by propagating the user-provided attribute size 'uattr_size'
down to the cgroup query handlers, and conditionally skipping writing
the revision field to userspace when the provided buffer size is
insufficient.
query.revision in bpf_mprog_query is structurally identical to the
cgroup case: a late tail field, written unconditionally.
But the backward-compat hazard is not the same.
The min-historical-size test is per command, and bpf_mprog_query only
serves attach types that were born with revision in the struct:
tcx, netkit, the revision field, and bpf_mprog_query itself all landed in
the same v6.6 merge window (053c8e1f235d added the mprog query API +
revision; tcx in e420bed02507, netkit in 35dfaad7188c). There has never
been a tcx/netkit BPF_PROG_QUERY userspace that doesn't know about
revision. So for these commands the minimum legitimate struct already
covers offset 56-64 — no old binary can be broken here.
Contrast with cgroup: BPF_PROG_QUERY on cgroup attach types shipped in
2017; revision write-back was bolted on years later (120933984460). That
path has a real population of pre-revision callers.
Fixes: 120933984460 ("bpf: Implement mprog API on top of existing cgroup progs") Cc: Maciej Żenczykowski <maze@google.com> Cc: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: Yuyang Huang <yuyanghuang@google.com> Link: https://lore.kernel.org/r/20260531075600.4058207-2-yuyanghuang@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
bpf: replace pop/push emptiness check with bpf_list_empty()
Simplify fq_flows_is_empty() by replacing the pop/push based emptiness
check with a direct call to bpf_list_empty().
This avoids unnecessary list mutation and simplifies the code while
preserving correctness.
Leon Hwang [Thu, 21 May 2026 14:29:09 +0000 (22:29 +0800)]
bpf: Fix race between bpf_map_new_fd() and close_fd()
Because there is time gap between bpf_map_new_fd() and close_fd(), a
concurrent thread is able to close the new fd and opens a new, unrelated
file with the exact same fd number. Thereafter, this close_fd() might
inadvertently close the unrelated file.
To avoid such regression, do finalize log before security_bpf_map_create().
However, in order to achieve it, move bpf_get_file_flag(),
security_bpf_map_create(), bpf_map_alloc_id(), and bpf_map_new_fd() from
__map_create() to map_create(). And, rename __map_create() to
map_create_alloc() meanwhile.
Then, in order to reuse the map and token when all checks pass in
map_create_alloc(), pass "struct bpf_map **" and "struct bpf_token **" to
map_create_alloc().
The series introduces stack_map_get_build_id_offset_sleepable(),
fixing a gap with parsing build_id in sleepable context in stackmap.c
In particular, this fixes a deadlock in
stack_map_get_build_id_offset() doing a blocking __kernel_read(),
which happens since commit 777a8560fd29 ("lib/buildid: use
__kernel_read() for sleepable context").
See previous revisions for more details.
---
v6->v7:
* Addressed feedback from Andrii (mostly patch #2):
* implement proper CONFIG_PER_VMA_LOCK=n support, following a
VMA locking pattern similar to one used in PROCMAP_QUERY
* change the contract of stack_map_lock_vma(): if a non-NULL VMA
is returned, then a read lock is held
* remove now unnecessary vma_locked flag
* and various other nits
* Add vma_is_anonymous() checks where appropriate (AIs)
v6: https://lore.kernel.org/bpf/20260521225022.2695755-1-ihor.solodrai@linux.dev/
v5->v6:
* Misc refactoring (Andrii):
* add stack_map_build_id_set_valid() helper
* simplify control flow in stack_map_get_build_id_offset_sleepable()
v5: https://lore.kernel.org/bpf/20260515005244.1333013-1-ihor.solodrai@linux.dev/
v4->v5:
* Add comments explaining mmap_read_trylock() (Shakeel)
* Rebase on bpf-next (Alexei)
v4: https://lore.kernel.org/bpf/20260514184727.1067141-1-ihor.solodrai@linux.dev/
v3->v4:
* Change Fixes tag in patch #2 (AI)
* Nit in caching implementation (Mykyta)
v3: https://lore.kernel.org/bpf/20260512032906.2670326-1-ihor.solodrai@linux.dev/
v2->v3:
* Split patch #2 in two: stack_map_get_build_id_offset_sleepable()
implementation, and then introduce caching
* Drop taking mmap_lock if CONFIG_PER_VMA_LOCK=n, fall back to raw
IPs instead
* Cache vm_{start,end} in addition to prev_file (Mykyta)
v2: https://lore.kernel.org/bpf/20260409010604.1439087-1-ihor.solodrai@linux.dev/
v1->v2:
* Addressed feedback from Puranjay:
* split out a small refactoring patch
* use mmap_read_trylock()
* take into account CONFIG_PER_VMA_LOCK
* replace find_vma() with vma_lookup()
* cache prev_build_id to avoid re-parsing the same file
* Snapshot vm_pgoff and vm_start before unlocking (AI)
* To avoid repetitive unlocking statements, introduce struct
stack_map_vma_lock to hold relevant lock state info and add an
unlock helper
v1: https://lore.kernel.org/bpf/20260407223003.720428-1-ihor.solodrai@linux.dev/
Ihor Solodrai [Mon, 25 May 2026 22:39:48 +0000 (15:39 -0700)]
bpf: Cache build IDs in sleepable stackmap path
Stack traces often contain adjacent IPs from the same VMA or from
different VMAs backed by the same ELF file. Cache the last successfully
parsed build id together with the resolved VMA range and backing file
so the sleepable build id path can avoid repeated VMA locking and file
parsing in common cases.
Ihor Solodrai [Mon, 25 May 2026 22:39:47 +0000 (15:39 -0700)]
bpf: Avoid faultable build ID reads under mm locks
Sleepable build ID parsing can block in __kernel_read() [1], so the
stackmap sleepable path must not call it while holding mmap_lock or a
per-VMA read lock.
The issue and the fix are conceptually similar to a recent procfs
patch [2]. A similar VMA locking pattern has already been used in
PROCMAP_QUERY [3].
Resolve each covered VMA with a stable read-side reference, preferring
lock_vma_under_rcu() and falling back to mmap_read_trylock() only long
enough to acquire the VMA read lock. Take a reference to the backing
file, drop the VMA lock, and then parse the build ID through
(sleepable) build_id_parse_file().
We have to use mmap_read_trylock() (and give up on failure) in this
context because taking mmap_read_lock() is generally unsafe on code
paths reachable from BPF programs [4], and may lead to deadlocks.
Ihor Solodrai [Mon, 25 May 2026 22:39:46 +0000 (15:39 -0700)]
bpf: Factor out stack_map build ID helpers
Factor out helpers from stack_map_get_build_id_offset() in
preparation for adding a sleepable build ID resolution path:
stack_map_build_id_set_ip(), stack_map_build_id_offset(), and
stack_map_build_id_set_valid().
While here, refactor stack_map_get_build_id_offset():
* use continue-driven control flow in the main loop and remove
build_id_valid label
* update prev_vma and prev_build_id on the fall-back-to-IP branch so
the cache reflects the actual VMA seen on the previous IP [1]
* guard fetch_build_id() with vma_is_anonymous() [2] to skip parse
attempts that would otherwise fail the ELF magic check
Tiezhu Yang [Tue, 26 May 2026 06:39:36 +0000 (14:39 +0800)]
libbpf: Add __NR_bpf definition for LoongArch
LoongArch uses the generic syscall table, where __NR_bpf is defined
as 280 in include/uapi/asm-generic/unistd.h.
To align with other architectures, add the __NR_bpf definition for
LoongArch to avoid a potential compilation failure: "error __NR_bpf
not defined. libbpf does not support your arch."
This is a follow up patch of:
commit b0c47807d31d ("bpf: Add sparc support to tools and samples.")
commit bad1926dd2f6 ("bpf, s390: fix build for libbpf and selftest suite")
commit ca31ca8247e2 ("tools/bpf: fix perf build error with uClibc (seen on ARC)")
commit e32cb12ff52a ("bpf, mips: Fix build errors about __NR_bpf undeclared")
Carlos Llamas [Sat, 23 May 2026 16:27:21 +0000 (16:27 +0000)]
libbpf: Fix UAF in strset__add_str()
strset_add_str_mem() might reallocate the strset data buffer in order to
accommodate the provided string 's'. However, if 's' points to a string
already present in the buffer, it becomes dangling after the realloc.
This leads to a use-after-free when attempting to memcpy() the string
into the new buffer.
One scenario that triggers this problematic path is when resolve_btfids
attempts to patch kfunc prototypes using existing BTF parameter names:
| resolve_btfids: function bpf_list_push_back_impl already exists in BTF
| Segmentation fault (core dumped)
Compiling resolve_btfids with fsanitize=address generates a detailed
report of the UAF:
| =================================================================
| ERROR: AddressSanitizer: heap-use-after-free on address 0x7f4c4a500bd4
| ==1507892==ERROR: AddressSanitizer: heap-use-after-free on address 0x7f4c4a500bd4 at pc 0x55d25155a2a8 bp 0x7ffcef879060 sp 0x7ffcef878818
| READ of size 5 at 0x7f4c4a500bd4 thread T0
| #0 0x55d25155a2a7 in memcpy (tools/bpf/resolve_btfids/resolve_btfids+0xcf2a7)
| #1 0x55d2515d708e in strset__add_str tools/lib/bpf/strset.c:162:2
| #2 0x55d2515c730b in btf__add_str tools/lib/bpf/btf.c:2109:8
| #3 0x55d2515c9020 in btf__add_func_param tools/lib/bpf/btf.c:3108:14
| #4 0x55d25159f0b5 in process_kfunc_with_implicit_args tools/bpf/resolve_btfids/main.c:1196:9
| #5 0x55d25159e004 in btf2btf tools/bpf/resolve_btfids/main.c:1229:9
| #6 0x55d25159cee7 in main tools/bpf/resolve_btfids/main.c:1535:6
| #7 0x7f4c78e29f76 in __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16
| #8 0x7f4c78e2a026 in __libc_start_main csu/../csu/libc-start.c:360:3
| #9 0x55d2514bb860 in _start (tools/bpf/resolve_btfids/resolve_btfids+0x30860)
|
| 0x7f4c4a500bd4 is located 13268 bytes inside of 2829000-byte region [0x7f4c4a4fd800,0x7f4c4a7b02c8)
| freed by thread T0 here:
| #0 0x55d25155b700 in realloc (tools/bpf/resolve_btfids/resolve_btfids+0xd0700)
| #1 0x55d2515c426c in libbpf_reallocarray tools/lib/bpf/./libbpf_internal.h:220:9
| #2 0x55d2515c426c in libbpf_add_mem tools/lib/bpf/btf.c:224:13
|
| previously allocated by thread T0 here:
| #0 0x55d25155b2e3 in malloc (tools/bpf/resolve_btfids/resolve_btfids+0xd02e3)
| #1 0x55d2515d6e7d in strset__new tools/lib/bpf/strset.c:58:20
While resolve_btfids could be refactored to avoid this call path, let's
instead fix this issue at the source in strset__add_str() and avoid
similar scenarios.
Let's check if set->strs_data was reallocated and whether 's' points to
an internal string within the old strset buffer. In such case, 's' is
reconstructed to point to the new buffer.
While already here, also fix strset__find_str() which suffers from the
same problem by factoring out the common operations into a new helper
function strset_str_append().
Siddharth Nayyar [Wed, 20 May 2026 09:40:44 +0000 (09:40 +0000)]
bpftool: Fix typo in struct_ops map FD generation for light skeleton
When generating light skeletons for BPF programs containing struct_ops
maps, bpftool incorrectly outputs a stray literal 't' instead of a tab
character for the map file descriptor member in the links structure.
This causes a compilation error when the generated light skeleton is
used.
Correct the format string by replacing 't' with '\t'.
parse_vma_segs() in tools/lib/bpf/usdt.c parses /proc/<pid>/maps
with two widthless scansets, "%s" into mode[16] and "%[^\n]"
into line[4096]. A VMA name in maps is not limited to that local
buffer; a deeply nested backing path can produce a maps record long
enough to overflow the stack buffer.
Bound both scansets to the declared buffer sizes ("%15s" for mode[16]
and "%4095[^\n]" for line[4096]) and drain any residue past line[4094]
with "%*[^\n]" before the trailing "\n". Without the drain, the residue
of an over-long record would stay in the stream and break the next
"%zx-%zx" parse, so the loop would exit early and silently skip later
maps records.
Also stop using sscanf(..., "%s") to peel the /proc/<pid>/root prefix
from lib_path. Parse the pid and prefix length with "%n", check for the
following slash, and copy the remainder with libbpf_strlcpy(). That
removes a second unbounded stack write and preserves paths containing
spaces.
Fixes: 74cc6311cec9 ("libbpf: Add USDT notes parsing and resolution logic") Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Link: https://lore.kernel.org/bpf/20260522201353.1454653-1-michael.bommarito@gmail.com
Tejun Heo [Wed, 27 May 2026 19:26:32 +0000 (09:26 -1000)]
bpf: Fix bpf_arena_handle_page_fault() redefinition without CONFIG_BPF_SYSCALL
On configs with CONFIG_BPF=y but CONFIG_BPF_SYSCALL=n (e.g. arm
multi_v7_defconfig), kernel/bpf/core.c defines a __weak
bpf_arena_handle_page_fault() while bpf_defs.h already supplies a static
inline stub for it, causing a redefinition error. Build the __weak
definition only under CONFIG_BPF_SYSCALL, matching the bpf_defs.h
declaration and the CONFIG_BPF_SYSCALL-gated strong definition in arena.c.
Fixes: dc11a4dba246 ("bpf: Recover arena kernel faults with scratch page") Reported-by: Mark Brown <broonie@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20260527192632.2109419-1-tj@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
====================
This makes BPF arena memory directly dereferenceable from kernel code
(struct_ops callbacks, kfuncs). Each arena gets a per-arena scratch page
that an arch fault hook installs into empty PTEs on kernel-side faults,
after KFENCE. The faulting instruction retries and the violation is reported
through the program's BPF stream.
v4:
- Patch 1: note that the strict-zero cmpxchg is narrower than pte_none() in
inline comments on both x86 and arm64. (Andrea)
- Patch 2: stub bpf_arena_handle_page_fault() for !CONFIG_BPF_SYSCALL via a
new include/linux/bpf_defs.h. (lkp)
- Patch 7: scx_arena_alloc() retries via a loop instead of a single retry on
pool growth. (Andrea)
- Picked up Reviewed-by tags from Emil and Andrea.
sched_ext's ops_cid.set_cmask() hands the BPF scheduler a struct scx_cmask
*. The kernel translates a kernel cpumask to a cmask, but it had no way to
write into the arena, so the cmask lived in kernel memory and was passed as
a trusted pointer. BPF cmask helpers all operate on arena cmasks though, so
the BPF side had to word-by-word probe-read the kernel cmask into an arena
cmask via cmask_copy_from_kernel() before any helper could touch it. It
works, but is clumsy.
The shape isn't unique to set_cmask. Sub-scheduler support is on the way and
more sched_ext callbacks will want to pass structured data to BPF. Anywhere
a kfunc or struct_ops callback wants to hand a struct to a BPF program,
arena residence is the natural answer.
Approach
--------
Each arena gets a per-arena scratch page. Arenas stay sparsely mapped as
today - PTEs are populated only for allocated pages. A new arch fault hook
(bpf_arena_handle_page_fault) is wired into x86 page_fault_oops() and arm64
__do_kernel_fault(), after KFENCE. When a kernel-side access faults inside
an arena's kern_vm range, the helper walks the stack to find the BPF program
responsible, range-checks the fault address against prog->aux->arena, and
atomically installs the scratch page into the empty PTE via the new
ptep_try_set() wrapper. The kernel instruction retries and reads/writes the
scratch page. Free paths and map destruction treat scratch as non-owned.
Real allocation refuses to overwrite scratch (apply_range_set_cb returns
-EBUSY). A scratched address stays dead until map destroy, since its
presence means the BPF program has already malfunctioned.
The mechanism is default behavior - no UAPI flag.
What this preserves
-------------------
All the debugging properties of today's sparse-PTE design are preserved:
* BPF programs still fault on unmapped arena accesses. The fault semantics
(instruction retry with rdst = 0) and the violation report through
bpf_streams are unchanged for prog-side accesses.
* The first kernel-side touch of an unmapped address is reported via
bpf_streams the same way as a prog-side fault, with the stack walk
attributing it to the originating prog.
* User-side fault on a never-scratched address still lazy-allocates a real
page (or returns SIGSEGV under BPF_F_SEGV_ON_FAULT). User-side fault on a
scratched address SIGSEGVs.
What changes for the kernel-side caller is just that an unmapped deref no
longer oopses - it retries through the scratch page and emits a violation
report. The same shape today's BPF instruction faults have.
Linus Torvalds [Sun, 24 May 2026 19:50:36 +0000 (12:50 -0700)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"arm64:
- Fix ITS EventID sanitisation when restoring an interrupt
translation table.
- Fix PPI memory leak when failing to initialise a vcpu.
- Correctly return an error when the validation of a hypervisor trace
descriptor fails, and limit this validation to protected mode only.
RISC-V:
- Fix invalid HVA warning in steal-time recording
- Return SBI_ERR_FAILURE to guest upon OOM in pmu_event_info() and
pmu_snapshot_set_shmem()
- Fix NULL pointer dereference in SBI v0.1 SEND_IPI handler
- Fix sign extension of value for MMIO loads
s390:
- Fix bugs in vSIE (nested virtualization) and UCONTROL, caused by
the page table rewrite.
x86:
- Apply erratum #1235 workaround (disable AVIC IPI virtualization) on
Hygon Family 18h, just like on AMD Family 17h.
- When KVM_CAP_X86_APIC_BUS_CYCLES_NS is queried on a specific VM,
return the VM's configured APIC bus frequency instead of the
default. This is less confusing (read: not wrong) and makes it
easier to fill in CPUID information that communicates the APIC bus
frequency to the guest.
Selftests:
- Do not include glibc-internal <bits/endian.h>; it worked by chance
and broke building KVM selftests with musl"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: SVM: Disable AVIC IPI virtualization on Hygon Family 18h (erratum #1235)
KVM: selftests: Verify that KVM returns the configured APIC cycle length
KVM: x86: Return the VM's configured APIC bus frequency when queried
KVM: selftests: elf: Include <endian.h> instead of <bits/endian.h>
KVM: s390: Properly reset zero bit in PGSTE
KVM: s390: vsie: Fix redundant rmap entries
KVM: s390: vsie: Fix unshadowing logic
KVM: s390: Fix leaking kvm_s390_mmu_cache in case of errors
KVM: s390: vsie: Fix memory leak when unshadowing
KVM: arm64: Fix nVHE/pKVM hyp tracing error on invalid desc
KVM: arm64: vgic: Free private_irqs when init fails after allocation
KVM: arm64: vgic-its: Reject restored DTE with out-of-range num_eventid_bits
RISC-V: KVM: Fix sign extension for MMIO loads
RISC-V: KVM: Fix NULL pointer dereference in SBI v0.1 SEND_IPI handler
riscv: kvm: return SBI_ERR_FAILURE for pmu_event_info() when OOM
riscv: kvm: return SBI_ERR_FAILURE for pmu_snapshot_set_shmem() when OOM
RISC-V: KVM: Fix invalid HVA warning in steal-time recording
Linus Torvalds [Sun, 24 May 2026 18:00:45 +0000 (11:00 -0700)]
Merge tag 'x86-urgent-2026-05-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
- On SEV guests, handle set_memory_{encrypted,decrypted}() failures
more conservatively by assuming that all affected pages are
unencrypted (Carlos López)
- Disable broadcast TLB flush when PCID is disabled (Tom Lendacky)
- Fix VMX vs. hrtimer_rearm_deferred() regression (Peter Zijlstra)
- Move IRQ/NMI dispatch code from KVM into x86 core, to prepare for a
KVM x2apic fix (Peter Zijlstra)
* tag 'x86-urgent-2026-05-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
virt: sev-guest: Explicitly leak pages in unknown state
x86/mm: Disable broadcast TLB flush when PCID is disabled
x86/kvm/vmx: Fix VMX vs hrtimer_rearm_deferred()
x86/kvm/vmx: Move IRQ/NMI dispatch from KVM into x86 core
x86/vdso: Fix incorrect size in munmap() on map_vdso() failure
Linus Torvalds [Sun, 24 May 2026 17:55:21 +0000 (10:55 -0700)]
Merge tag 'irq-urgent-2026-05-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irqchip driver fixes from Ingo Molnar:
- Fix the hardware probing error path of the renesas-rzt2h
irqchip driver
- Fix the exynos-combiner irqchip driver on -rt kernels
by turning the IRQ controller spinlock into a raw spinlock
* tag 'irq-urgent-2026-05-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/renesas-rzt2h: Use pm_runtime_put_sync() in probe error path
irqchip/exynos-combiner: Switch to raw_spinlock
Linus Torvalds [Sun, 24 May 2026 17:48:55 +0000 (10:48 -0700)]
Merge tag 'core-urgent-2026-05-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull debugobjects fix from Ingo Molnar::
- Fix debugobjects regression on -rt kernels: don't fill the pool
(which uses a coarse lock) if ->pi_blocked_on, because that messes up
the priority inheritance of callers
* tag 'core-urgent-2026-05-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
debugobjects: Do not fill_pool() if pi_blocked_on
Linus Torvalds [Sun, 24 May 2026 17:37:55 +0000 (10:37 -0700)]
Merge tag 'hwmon-for-v7.1-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging
Pull hwmon fixes from Guenter Roeck:
- adm1266: Various fixes from Abdurrahman Hussain
The fixed issues were reported by Sashiko as part of a code review of
a functional change in the driver.
- lenovo-ec-sensors: Convert to devm_request_region() to fix
release_region cleanup, and fix EC "MCHP" signature validation logic,
from Kean Ren
* tag 'hwmon-for-v7.1-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
hwmon: (pmbus/adm1266) serialize sequencer_state debugfs read with pmbus_lock
hwmon: (pmbus/adm1266) serialize NVMEM blackbox read with pmbus_lock
hwmon: (pmbus/adm1266) serialize GPIO PMBus accesses with pmbus_lock
hwmon: (pmbus/adm1266) register the nvmem device after pmbus_do_probe()
hwmon: (pmbus/adm1266) register the gpio_chip after pmbus_do_probe()
hwmon: (pmbus/adm1266) reject short block-read responses in the GPIO accessors
hwmon: (pmbus/adm1266) don't clobber GPIO bits before PDIO read in get_multiple
hwmon: (pmbus/adm1266) cap PDIO scan in get_multiple at ADM1266_PDIO_NR
hwmon: (pmbus/adm1266) bounce blackbox records through a protocol-sized buffer
hwmon: (pmbus/adm1266) include adapter number in GPIO line label
hwmon: (pmbus/adm1266) include PEC byte in pmbus_block_xfer read buffer
hwmon: (pmbus/adm1266) reject implausible blackbox record_count
hwmon: (pmbus/adm1266) widen blackbox-info buffer to I2C_SMBUS_BLOCK_MAX
hwmon: (pmbus/adm1266) seed timestamp from the real-time clock
hwmon: (lenovo-ec-sensors): Fix EC "MCHP" signature validation logic
hwmon: (lenovo-ec-sensors): Convert to devm_request_region()
drm/msm: Restore second parameter name in purge() and evict()
After commit 3392291fc509 ("drm/msm: Fix shrinker deadlock"), all
supported versions of clang warn (or error with CONFIG_WERROR=y):
drivers/gpu/drm/msm/msm_gem_shrinker.c:105:58: error: omitting the parameter name in a function definition is a C23 extension [-Werror,-Wc23-extensions]
105 | purge(struct drm_gem_object *obj, struct ww_acquire_ctx *)
| ^
drivers/gpu/drm/msm/msm_gem_shrinker.c:117:58: error: omitting the parameter name in a function definition is a C23 extension [-Werror,-Wc23-extensions]
117 | evict(struct drm_gem_object *obj, struct ww_acquire_ctx *)
| ^
2 errors generated.
With older but supported versions of GCC, this is an unconditional hard error:
drivers/gpu/drm/msm/msm_gem_shrinker.c: In function 'purge':
drivers/gpu/drm/msm/msm_gem_shrinker.c:105:35: error: parameter name omitted
purge(struct drm_gem_object *obj, struct ww_acquire_ctx *)
^~~~~~~~~~~~~~~~~~~~~~~
drivers/gpu/drm/msm/msm_gem_shrinker.c: In function 'evict':
drivers/gpu/drm/msm/msm_gem_shrinker.c:117:35: error: parameter name omitted
evict(struct drm_gem_object *obj, struct ww_acquire_ctx *)
^~~~~~~~~~~~~~~~~~~~~~~
Restore the parameter name to clear up the warnings, renaming it
"unused" to make it clear it is only needed to satisfy the prototype of
drm_gem_lru_scan().
Linus Torvalds [Sun, 24 May 2026 16:53:17 +0000 (09:53 -0700)]
Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Pull bpf fixes from Alexei Starovoitov:
- Fix bpf_throw() and global subprog combination (Kumar Kartikeya
Dwivedi)
- Fix out of bounds access in BPF interpreter (Yazhou Tang)
- Fix potential out of bounds access in inner per-cpu array map
(Guannan Wang)
- Reject NULL data/sig in bpf_verify_pkcs7_signature (KP Singh)
* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
libbpf: fix off-by-one in emit_signature_match jump offset
bpf: Reject NULL data/sig in bpf_verify_pkcs7_signature
selftests/bpf: Cover global subprog exception leaks
bpf: Check global subprog exception paths
bpf: make bpf_session_is_return() reference optional
bpf: Use array_map_meta_equal for percpu array inner map replacement
selftests/bpf: Add test for large offset bpf-to-bpf call
bpf: Fix s16 truncation for large bpf-to-bpf call offsets
bpf: Fix out-of-bounds read in bpf_patch_call_args()
Linus Torvalds [Sat, 23 May 2026 23:59:02 +0000 (16:59 -0700)]
Merge tag 'v7.1-rc5-ksmbd-server-fixes' of git://git.samba.org/ksmbd
Pull smb server fixes from Steve French:
- fix for creating tmpfiles
- fix durable reconnect error path
- validate SID in security descriptor when inheriting DACL
* tag 'v7.1-rc5-ksmbd-server-fixes' of git://git.samba.org/ksmbd:
smb/server: promote S_DEL_ON_CLS to S_DEL_PENDING when close
ksmbd: validate SID in parent security descriptor during ACL inheritance
ksmbd: fix durable reconnect error path file lifetime
Linus Torvalds [Sat, 23 May 2026 23:54:48 +0000 (16:54 -0700)]
Merge tag 'for-7.1-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A batch of fixes to simple quotas:
- add conditional rescheduling point not dependent on the lock during
inode iterations to avoid delays with PREEMPT_NONE enabled
- fix subvolume deletion so it does not break the squota invariants
- properly handle enabling squota, tracking extents in the initial
transaction
- catch and warn about underflows, clamp to zero to avoid further
problems
And one fix to inode size handling:
- fix handling of preallocated extents beyond i_size when not using
the no-holes feature"
* tag 'for-7.1-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: swallow btrfs_record_squota_delta() ENOENT
btrfs: clamp to avoid squota underflow
btrfs: fix squota accounting during enable generation
btrfs: check for subvolume before deleting squota qgroup
btrfs: always drop root->inodes lock before cond_resched()
btrfs: mark file extent range dirty after converting prealloc extents
Linus Torvalds [Sat, 23 May 2026 23:51:22 +0000 (16:51 -0700)]
Merge tag 'xfs-fixes-7.1-rc5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fix from Carlos Maiolino:
"A single fix for a race in xfs buffer cache which may lead to
filesystem shutdown due to inconsistent metadata if the buffer
lookup happens to find an old dead buffer still in the cache"
* tag 'xfs-fixes-7.1-rc5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: fix a buffer lookup against removal race
Linus Torvalds [Sat, 23 May 2026 16:21:08 +0000 (09:21 -0700)]
Merge tag 'nios2_updates_for_v7.2' of git://git.kernel.org/pub/scm/linux/kernel/git/dinguyen/linux
Pull nios2 fixes from Dinh Nguyen:
- Implement _THIS_IP_ for inline asm
- Add Simon Schuster as a maintainer and mark the NIOS2 as Supported
* tag 'nios2_updates_for_v7.2' of git://git.kernel.org/pub/scm/linux/kernel/git/dinguyen/linux:
nios2: Implement _THIS_IP_ using inline asm
MAINTAINERS: arch/nios2: Add Simon Schuster as co-maintainer
Linus Torvalds [Sat, 23 May 2026 16:13:00 +0000 (09:13 -0700)]
Merge tag 'loongarch-fixes-7.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson
Pull LoongArch fixes from Huacai Chen:
"Rework KASLR to avoid initrd overlap, remove some unused code to avoid
a build warning, fix some bugs in kprobes and KVM"
* tag 'loongarch-fixes-7.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson:
LoongArch: KVM: Move some variable declarations to paravirt.h
LoongArch: kprobes: Fix handling of fatal unrecoverable recursions
LoongArch: kprobes: Use larch_insn_text_copy() to patch instructions
LoongArch: Remove unused code to avoid build warning
LoongArch: Avoid initrd overlap during kernel relocation
LoongArch: Skip relocation-time KASLR if already applied
efi/loongarch: Randomize kernel preferred address for KASLR
KP Singh [Fri, 22 May 2026 21:53:36 +0000 (23:53 +0200)]
libbpf: fix off-by-one in emit_signature_match jump offset
The offset for the cleanup-label jump is computed before the MOV R7
instruction is emitted, but the JMP lands after it. Account for the
extra insn in the offset calculation (-2 instead of -1). Drop the
redundant self-loop in the else branch; gen->error = -ERANGE already
marks the generation as failed.
Linus Torvalds [Sat, 23 May 2026 14:49:05 +0000 (07:49 -0700)]
Merge tag 'driver-core-7.1-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core
Pull driver core fixes from Danilo Krummrich:
- Remove the software node on platform device release(); without this,
the software node remains registered after the device is gone and a
subsequent platform_device_register_full() reusing the same node
fails with -EBUSY
- In sysfs_update_group(), do not remove a pre-existing directory when
create_files() fails; the previous code would silently destroy a
sysfs group that the caller did not create
- Set fwnode->secondary to NULL in fwnode_init() to avoid dereferencing
uninitialized memory (e.g. in dev_to_swnode()) when the firmware node
is allocated on the stack or via a non-zeroing allocator
* tag 'driver-core-7.1-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core:
device property: set fwnode->secondary to NULL in fwnode_init()
sysfs: don't remove existing directory on update failure
driver core: platform: remove software node on release()
Linus Torvalds [Sat, 23 May 2026 14:17:27 +0000 (07:17 -0700)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma fixes from Jason Gunthorpe:
- syzbot triggred crash in rxe due to concurrent plug/unplug
- Possible non-zero'd memory exposed to userspace in bnxt_re
- Malicous 'magic packet' with SIW causes a buffer overflow
- Tighten the new uAPI validation code to not crash in debugging prints
and have the right module dependencies in drivers
- mana was missing the max_msg_sz report to userspace
- UAF in rtrs on an error path
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
RDMA/rtrs: Fix use-after-free in path file creation cleanup
RDMA/mana_ib: Report max_msg_sz in mana_ib_query_port
RDMA/core: Do not read wild stack memory in uverbs_get_handler_fn()
RDMA/core: Move the _ib_copy_validate_udata* functions to ib_core_uverbs
RDMA/siw: Reject MPA FPDU length underflow before signed receive math
RDMA/bnxt_re: zero shared page before exposing to userspace
selftests/rdma: explicitly skip tests when required modules are missing
RDMA/nldev: Add mutual exclusion in nldev_dellink()
Tejun Heo [Fri, 22 May 2026 17:22:16 +0000 (07:22 -1000)]
bpf/arena: Add bpf_arena_map_kern_vm_start() and bpf_prog_arena()
struct bpf_arena is opaque to callers outside arena.c. Add two helpers
for struct_ops subsystems that need to reach into an arena:
bpf_arena_map_kern_vm_start(struct bpf_map *map)
returns @map's kern_vm_start. A sched_ext follow-up needs this
to translate kern_va <-> uaddr.
bpf_prog_arena(struct bpf_prog *prog)
returns the bpf_map of the arena referenced by @prog (NULL if
@prog references no arena). The verifier enforces at most one
arena per program. Used by struct_ops callers that auto-discover
an arena from a member prog and need to take a map reference.
Tejun Heo [Fri, 22 May 2026 17:22:15 +0000 (07:22 -1000)]
bpf: Add bpf_struct_ops_for_each_prog()
Add a helper that walks the member progs of the struct_ops map
containing a given @kdata vmtable. struct_ops ->reg() callbacks (and
similar) sometimes need to inspect the loaded BPF programs, e.g. to
discover maps they reference via prog->aux->used_maps.
The implementation mirrors bpf_struct_ops_id(): container_of @kdata
to recover the bpf_struct_ops_map, then iterate st_map->links[i]->prog
for i in [0, funcs_cnt). Same access pattern, no new locking - by the
time ->reg() fires st_map is fully populated and stable.
A sched_ext follow-up walks the member progs of a cid-form scheduler's
struct_ops map, reads prog->aux->arena directly, and requires all member
progs to reference exactly one arena, without requiring the BPF program
to call a registration kfunc.
Tejun Heo [Fri, 22 May 2026 17:22:14 +0000 (07:22 -1000)]
bpf: Add sleepable variant of bpf_arena_alloc_pages for kernel callers
The existing kernel-side export of bpf_arena_alloc_pages is _non_sleepable
only - it's used by the verifier to inline the kfunc when the call site is
non-sleepable. There is no sleepable equivalent for kernel callers. The
kfunc bpf_arena_alloc_pages itself is BPF-only.
sched_ext needs sleepable kernel-side allocs for its arena pool init/grow
paths. Add bpf_arena_alloc_pages_sleepable() mirroring the _non_sleepable
wrapper but passing sleepable=true to arena_alloc_pages().
bpf: Recover arena kernel faults with scratch page
BPF arena usage is becoming more prevalent, but kernel <-> BPF communication
over arena memory is awkward today. Data has to be staged through a trusted
kernel pointer with extra code and copying on the BPF side. While reads
through arena pointers can use a fault-safe helper, writes don't have a good
solution. The in-line alternative would need instruction emulation or asm
fixup labels.
Enable direct kernel-side reads and writes within GUARD_SZ / 2 of any
handed-in arena pointer, without bounds checking. A per-arena scratch page
is installed by the arch fault path into empty arena kernel PTEs - x86 from
page_fault_oops() for not-present faults, arm64 from __do_kernel_fault() for
translation faults, both after the existing exception-table and KFENCE
handling. The faulting instruction retries and the access is also reported
through the program's BPF stream, preserving error reporting.
bpf_prog_find_from_stack() resolves the current BPF program (and its arena)
from the kernel stack - no new bpf_run_ctx state is added. Recovery covers
the 4 GiB arena plus the upper half-guard (GUARD_SZ / 2). The lower
half-guard is excluded because well-behaved kfuncs only access forward from
arena pointers. The kfunc-author contract - access at most GUARD_SZ / 2 past
a handed-in pointer - is documented in Documentation/bpf/kfuncs.rst.
The install is lock-free via ptep_try_set(). On race-loss the winning
installer's PTE is already valid, so the access retry succeeds. The arena
clear path uses ptep_get_and_clear() so installer and clearer race through
atomic accessors. No flush_tlb_kernel_range() afterwards. Stale "not mapped"
entries just cause one extra re-fault, cheaper than a global IPI on every
install.
Scratch exists only to keep the kernel from oopsing on an in-line arena
access. Its presence at a PTE means the BPF program has already
malfunctioned, and the violation is reported through the program's BPF
stream. The only requirement for behavior on a scratched PTE is that the
kernel doesn't crash. In particular, any user-side access through such a PTE
may segfault. The shared scratch page is freed once during map destruction.
BPF instruction faults continue to use the existing JIT exception-table
path. This patch changes only the kernel-text fault path. No UAPI flag is
added. The new behavior is the default.
v2: Use ptep_get_and_clear() in apply_range_clear_cb(). (David)
v3: Stub bpf_arena_handle_page_fault() for !CONFIG_BPF_SYSCALL. (lkp)
Tejun Heo [Fri, 22 May 2026 17:22:12 +0000 (07:22 -1000)]
mm: Add ptep_try_set() for lockless empty-slot installs
Add ptep_try_set(ptep, new_pte): atomically set *ptep to new_pte iff it is
currently pte_none(). Returns true on success, false if the slot was already
populated or the arch has no implementation.
The intended caller is the upcoming bpf_arena kernel-side fault recovery
path. The install runs from a page fault that can be nested under locks
held by the faulting kernel caller (e.g. a BPF program holding
raw_res_spin_lock_irqsave on its arena's spinlock), so trylock-and-retry
would A-A deadlock. Lock-free cmpxchg is the only viable option, which
constrains this helper to special kernel page tables where concurrent
writers cooperate via atomic accessors.
The generic version in <linux/pgtable.h> returns false. x86 and arm64
override with try_cmpxchg-based implementations on the underlying pteval.
Other architectures get the false stub - the callers there already fall
through to oops.
v2: Rename to ptep_try_set(). Tighten kerneldoc. (David, Alexei)
v3: Note that strict-zero cmpxchg is narrower than pte_none(). (Andrea)
Tina Zhang [Fri, 22 May 2026 04:00:14 +0000 (12:00 +0800)]
KVM: SVM: Disable AVIC IPI virtualization on Hygon Family 18h (erratum #1235)
Hygon Family 18h CPUs are derived from AMD Family 17h (Zen1) silicon and
share the same erratum #1235: hardware may read a stale IsRunning=1 bit
during ICR write emulation and silently fail to generate an
AVIC_IPI_FAILURE_TARGET_NOT_RUNNING VM-Exit on the sending vCPU.
The absence of the VM-Exit causes KVM to miss the required wakeup of
blocking target vCPUs, leading to hung vCPUs and unbounded delays in
guest execution.
Extend the existing AMD Family 17h erratum #1235 workaround to also cover
Hygon Family 18h. With IPI virtualization disabled, KVM never sets
IsRunning=1 in the Physical ID table, so every non-self IPI generates a
VM-Exit and is correctly emulated.
Fixes: 8de4a1c8164e ("KVM: SVM: Disable (x2)AVIC IPI virtualization if CPU has erratum #1235") Cc: <stable@vger.kernel.org> Signed-off-by: Tina Zhang <zhang_wei@open-hieco.net>
Message-ID: <20260522040014.3380201-1-zhang_wei@open-hieco.net>
KVM: selftests: Verify that KVM returns the configured APIC cycle length
Add checks in the APIC bus clock test to verify that querying
KVM_CAP_X86_APIC_BUS_CYCLES_NS on the VM after changing the frequency
returns the VM's actual APIC cycle length, not KVM's default. For
giggles, verify that KVM still returns its default frequency for the
system-scoped check.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20260522173526.3539407-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: x86: Return the VM's configured APIC bus frequency when queried
When KVM_CAP_X86_APIC_BUS_CYCLES_NS is queried on a specific VM, return the
VM's configured APIC bus frequency, not KVM's default. Aside from the fact
that returning the default frequency is blatantly wrong if userspace has
changed the frequency, returning the configured frequency means userspace
can blindly trust the result, e.g. when filling PV CPUID information that
communicates the APIC bus frequency to the guest.
Fixes: 6fef518594bc ("KVM: x86: Add a capability to configure bus frequency for APIC timer") Reported-by: David Woodhouse <dwmw2@infradead.org> Closes: https://lore.kernel.org/all/ab84153e33fbe7c25667f595c56b310d4d5a93ef.camel@infradead.org Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20260522173526.3539407-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Sat, 23 May 2026 08:04:35 +0000 (10:04 +0200)]
Merge tag 'kvm-riscv-fixes-7.1-1' of https://github.com/kvm-riscv/linux into HEAD
KVM/riscv fixes for 7.1, take #1
- Fix invalid HVA warning in steal-time recording
- Return SBI_ERR_FAILURE to guest upon OOM in pmu_event_info()
and pmu_snapshot_set_shmem()
- Fix NULL pointer dereference in SBI v0.1 SEND_IPI handler
- Fix sign extension of value for MMIO loads
Linus Torvalds [Fri, 22 May 2026 23:43:33 +0000 (16:43 -0700)]
Merge tag 'sched_ext-for-7.1-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext fixes from Tejun Heo:
- Spurious WARN in ops_dequeue() racing with concurrent dispatch
- Self-deadlock between scheduler disable and a concurrent sub-sched
enable
* tag 'sched_ext-for-7.1-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
sched_ext: Fix spurious WARN on stale ops_state in ops_dequeue()
sched_ext: Fix deadlock between scx_root_disable() and concurrent forks
Linus Torvalds [Fri, 22 May 2026 23:28:47 +0000 (16:28 -0700)]
Merge tag 'cgroup-for-7.1-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
"Two rstat fixes:
- Out-of-bounds access in the css_rstat_updated() BPF kfunc when
called with an unchecked user-supplied cpu
- Over-strict NMI guard after the recent switch to try_cmpxchg left
sparc and ppc64 unable to queue rstat updates from NMI"
* tag 'cgroup-for-7.1-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: rstat: relax NMI guard after switch to try_cmpxchg
cgroup/rstat: validate cpu before css_rstat_cpu() access
Linus Torvalds [Fri, 22 May 2026 23:15:32 +0000 (16:15 -0700)]
Merge tag 'drm-fixes-2026-05-23' of https://gitlab.freedesktop.org/drm/kernel
Pull drm fixes from Dave Airlie:
"Regular fixes pull, amdgpu/xe being the usual, with bonus msm content
to bulk things out, otherwise it has the usual scattered changes, with
amdxdna dropping a badly thought out userspace api.
gem:
- clean up LRU locking
msm:
- Core:
- Fixed bindings for SM8650, SM8750 and Eliza
- Don't use UTS_RELEASE directly
- Fix typo in clock-names property
- DPU:
- Fixed CWB description on Kaanapali
- Fixed scanline strides for YUV UBWC formats
- Stopped DSI register dumping to access past the end of region
- DSI:
- Fix dumping unaligned regions
- GPU:
- Fix GMEM_BASE for a6xx gen3
- Fix userspace reachable crash on a2xx-a4xx
- Fix sysprof_active for counter collection with IFPC enabled GPUs
- Fix shrinker lockdep
xe:
- SRIOV related fixes
- Fix leak and double-free
- Multi-cast register fixes
- Multi-queue fix
i915:
- Fix joiner color pipeline selection [display]
- Fix readback for target_rr in Adaptive Sync SDP [dp]
- Apply Intel DPCD workaround when SDP on prior line used [psr]
amdxdna:
- remove mmap and export for ubuf
bridge:
- chipone-icn6211: managed bridge cleanup
- lt66121: acquire reset GPIO
- megachips: fix clean up on failed IRQ requests
v3d:
- fix UAF in error code paths
- release GEM-object ref on free'd jobs
virtio:
- use uninterruptible resv locking in plane updates
mediatek:
- fix sparse warnings"
* tag 'drm-fixes-2026-05-23' of https://gitlab.freedesktop.org/drm/kernel: (78 commits)
drm/xe/oa: Fix exec_queue leak on width check in stream open
drm/virtio: use uninterruptible resv lock for plane updates
drm/amdgpu: fix handling in amdgpu_userq_create
drm/radeon/evergreen_cs: Add missing NULL prefix check in surface check
drm/amdgpu: userq_va_mapped should remain true once done
drm/amdgpu: avoid integer overflow in VA range check
drm/amd/ras: Fix UMC error address allocation leak
drm/amdgpu: unmap all user mappings of framebuffer and doorbell before mode1 reset
drm/amd/display: Validate payload length and link_index in dc_process_dmub_aux_transfer_async
drm/amd/display: Validate GPIO pin LUT table size before iterating
drm/amd/display: Fix integer overflow in bios_get_image()
drm/amdkfd: Check bounds for allocate_sdma_queue restore_sdma_id
drm/amdgpu: use atomic operation to achieve lockless serialization
drm/amdkfd: Check bounds on allocate_doorbell
drm/amdgpu/vce3: Fix VCE 3 firmware size and offsets
drm/amdgpu/vce2: Fix VCE 2 firmware size and offsets
drm/amdgpu/vce1: Stop using amdgpu_vce_resume
drm/amdgpu/vce1: Fix VCE 1 firmware size and offsets
drm/amdgpu/vce1: Don't repeat GTT MGR node allocation
drm/amdgpu/vce1: Check if VRAM address is lower than GART.
...
Linus Torvalds [Fri, 22 May 2026 23:08:06 +0000 (16:08 -0700)]
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"Small fixes, two in drivers and the remaining a sign conversion probem
in sd with no user visible consequences (non-zero is error)"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: target: tcm_loop: Fix NULL ptr dereference
scsi: isci: Fix use-after-free in device removal path
scsi: sd: Fix return code handling in sd_spinup_disk()
- hp-wmi:
- Add thermal support for Omen 16-c0xxx (board 8902)
- intel/vsec:
- Fix enable_cnt imbalance due to PCIe error recovery
- surface/aggregator_registry:
- Remove battery & AC nodes on Surface Laptop 7 to avoid duplicated
devices
- uniwill-laptop:
- Handle uninitialized and invalid charging threshold values
- Accept charging threshold of 0 through power supply sysfs ABI and
clamp it to 1
- Make 'force' parameter to work also when device descriptor is
found
- Do not enable charging limit despite the 'force' parameter to
avoid permanent damage to battery
* tag 'platform-drivers-x86-v7.1-4' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86: (35 commits)
platform/x86: bitland-mifs-wmi: add CONFIG_LEDS_CLASS dependency
platform/x86: wireless-hotkey: Check ACPI_COMPANION() against NULL
platform/x86: toshiba_haps: Check ACPI_COMPANION() against NULL
platform/x86: toshiba_bluetooth: Check ACPI_COMPANION() against NULL
platform/x86: toshiba_acpi: Check ACPI_COMPANION() against NULL
platform/x86: system76: Check ACPI_COMPANION() against NULL
platform/x86: sony-laptop: Check ACPI_COMPANION() against NULL
platform/x86: panasonic-laptop: Check ACPI_COMPANION() against NULL
platform/x86: lg-laptop: Check ACPI_COMPANION() against NULL
platform/x86: intel/smartconnect: Check ACPI_HANDLE() against NULL
platform/x86: intel/rst: Check ACPI_COMPANION() against NULL
platform/x86: fujitsu-tablet: Check ACPI_COMPANION() against NULL
platform/x86: fujitsu: Check ACPI_COMPANION() against NULL
platform/x86: eeepc-laptop: Check ACPI_COMPANION() against NULL
platform/x86: dell/dell-rbtn: Check ACPI_COMPANION() against NULL
platform/x86: asus-laptop: Check ACPI_COMPANION() against NULL
platform/x86: acer-wireless: Check ACPI_COMPANION() against NULL
platform/x86: asus-armoury: add support for GU605CP
platform/x86: asus-armoury: add support for FA401EA
platform/x86: asus-armoury: add support for G614FR
...
Linus Torvalds [Fri, 22 May 2026 20:19:41 +0000 (13:19 -0700)]
Merge tag 'spi-fix-v7.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi
Pull spi fixes from Mark Brown:
"Another batch of driver fixes from Johan fixing error handling paths,
plus another from Felix. We also have a new device ID added in the DT
bindings for SpacemiT K3"
* tag 'spi-fix-v7.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
spi: dt-bindings: fsl-qspi: support SpacemiT K3
spi: ti-qspi: fix use-after-free after DMA setup failure
spi: sprd: fix error pointer deref after DMA setup failure
spi: qup: fix error pointer deref after DMA setup failure
spi: mtk-snfi: Fix resource leak in mtk_snand_read_page_cache()
spi: ep93xx: fix error pointer deref after DMA setup failure
Linus Torvalds [Fri, 22 May 2026 20:17:29 +0000 (13:17 -0700)]
Merge tag 'regulator-fix-v7.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator
Pull regulator fixes from Mark Brown:
"A couple of fixes here, one very minor Kconfig fix and a fix for a
nasty issue with error reporting in the tps65219 driver"
* tag 'regulator-fix-v7.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator:
regulator: tps65219: fix irq_data.rdev not being assigned
regulator: Kconfig: fix a typo in help
Linus Torvalds [Fri, 22 May 2026 19:28:47 +0000 (12:28 -0700)]
Merge tag 'gpio-fixes-for-v7.1-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux
Pull gpio fixes from Bartosz Golaszewski:
- propagate the error code from regulator_enable() in resume path in
gpio-pca953x
- take the device lock when calling device_is_bound() in virtual GPIO
drivers
- fix software node leak in remove path in gpio-aggregator
- fix a potential use-after-free in gpio-aggregator
- harden the GPIO character device uAPI: check that line config
attributes are correctly zeroed
* tag 'gpio-fixes-for-v7.1-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
gpio: virtuser: lock device when calling device_is_bound()
gpio: aggregator: lock device when calling device_is_bound()
gpio: sim: lock device when calling device_is_bound()
gpio: aggregator: remove the software node when deactivating the aggregator
gpio: aggregator: fix a potential use-after-free
gpio: cdev: check if uAPI v2 config attributes are correctly zeroed
gpio: pca953x: propagate regulator_enable() error from resume
Linus Torvalds [Fri, 22 May 2026 19:22:22 +0000 (12:22 -0700)]
Merge tag 'sound-7.1-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"As expected, we still continue receiving lots of small fixes.
One major change is about HD-audio pending IRQ handling, but this
would influence only on odd machines or slow VMs. There are a few
other fixes for the core part, but most of them are not-too-serious
UAF fixes, while the rest are mostly device-specific fixes and quirks.
ALSA Core:
- Fix for PCM silencing with bogus iov_iter
- Fixes for past-the-end iterators in timer and seq
- Serialization of UMP output teardown
- Rate-limit ELD parsing errors
HD-audio:
- Fixes for IRQ work handling and SSID matching
- Various Realtek quirks for HP and ASUS laptops, including LED fixes
ASoC:
- Intel: ACPI match table updates for PTL, NVL, and ARL platforms
- Cirrus Logic: Fixes for cs-amp-lib and cs35l56 codecs
- Various platform fixes for AMD, FSL SAI, TI OMAP, and Qualcomm
- DT-binding fix for MediaTek
Others:
- USB ua101: Reject too-short USB descriptors
- Scarlett2: Fix for flash writes
- ASIHPI: Fix for potential OOB access
- AMD SPI: Fix for bus number in ACPI probe
MAINTAINERS:
- Updates for SOF and TI maintainers"
* tag 'sound-7.1-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (47 commits)
ASoC: codecs: pcm512x: fix null-ptr dereference in pcm512x_overclock_xxx_put()
ASoC: Intel: soc-acpi-intel-ptl-match: Remove unnecessary cs42l43 match
ASoC: soc-acpi-intel-ptl-match: Make Chrome matches conditional
ASoC: Intel: soc-acpi: Add entry for sof_es8336 in NVL match table.
ASoC: Intel: sof_sdw: Add support for nvlrvp in NVL platform
ASoC: cs-amp-lib: Fix typo in error message: write -> read
ASoC: cs-amp-lib: Fix missing dput() after debugfs_lookup()
ASoC: cs-amp-lib: Fix wrong sizeof() in _cs_amp_set_efi_calibration_data()
ASoC: cs35l56: Fix flushing of IRQ work in cs35l56_sdw_remove()
MAINTAINERS: ASoC: Intel/SOF: Remove Ranjani Sridharan as maintainer
ALSA: seq: Serialize UMP output teardown with event_input
ALSA: scarlett2: Allow flash writes ending at segment boundary
ALSA: hda/realtek: Add LED quirk for HP ProBook 430 G6
ALSA: hda/intel: Make sure to cancel irq-pending work at closing PCM stream
ALSA: hda: Move irq pending work into hda-intel stream
ASoC: soc-utils: Add missing va_end in snd_soc_ret()
ALSA: ua101: Reject too-short USB descriptors
ALSA: hda/realtek: Fix mute and mic-mute LEDs for HP 16 Piston OmniBook X
ALSA: seq: avoid past-the-end iterator in snd_seq_create_port()
ALSA: timer: avoid past-the-end iterator in snd_timer_dev_register()
...
Linus Torvalds [Fri, 22 May 2026 19:06:23 +0000 (12:06 -0700)]
Merge tag 'block-7.1-20260522' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull block fixes from Jens Axboe:
- NVMe pull request via Keith:
- Fix memory leak for peer-to-peer addresses
- Fix dma map leaks on resource errors
- Another bio integrity fix, fixing a recent regression
- Fix for an issue with the request pre-allocation and caching when IO
is queued, where if a bio split occurred and ended up blocking, the
list could be corrupted
* tag 'block-7.1-20260522' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
block: avoid use-after-free in disk_free_zone_resources()
blk-mq: pop cached request if it is usable
nvme-pci: fix dma mapping leak on data setup error
nvme-pci: fix dma_vecs leak on p2p memory
bio-integrity-fs: pass data iter to bio_integrity_verify()
Linus Torvalds [Fri, 22 May 2026 18:53:28 +0000 (11:53 -0700)]
Merge tag 'io_uring-7.1-20260522' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull io_uring fixes from Jens Axboe:
- Fix for an issue with IORING_OP_NOP and using injection results
- Fix for an issue in IORING_OP_WAITID, where the info state was
assumed cleared by the lower level syscall handler, but for some
cases it is not. Just clear the data upfront, so that non-initialized
data isn't copied back to userspace
- Fix for a lockdep reported issue, where IORING_OP_BIND enters file
create and hence hits mnt_want_write(), which creates a three part
lockdep cycle between the super lock, io_uring's uring_lock, and the
cred mutex
- Fix a regression introduced in this cycle with how linked timeouts
are deleted
- Ensure that the ->opcode nospec indexing on the opcode issue side
covers all the cases
* tag 'io_uring-7.1-20260522' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
io_uring/nop: pass all errors to userspace
io_uring/timeout: splice timed out link in timeout handler
io_uring: propagate array_index_nospec opcode into req->opcode
io_uring/waitid: clear waitid info before copying it to userspace
io_uring/net: punt IORING_OP_BIND async if it needs file create
Linus Torvalds [Fri, 22 May 2026 17:52:26 +0000 (10:52 -0700)]
Merge tag 'v7.1-rc5-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client fixes from Steve French:
- Fix missing lock
- Fix dentry in use after unmounting
- cifs.upcall security fix
- require CAP_NET_ADMIN for swn netlink
- change allocation in DUP_CTX_STR to GFP_KERNEL
- minor smbdirect debug fix
- handle_read_data() folio fix
* tag 'v7.1-rc5-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
smb: client: change allocation requirements in DUP_CTX_STR macro
smb: client: require net admin for CIFS SWN netlink
smb: smbdirect: divide, not multiply, milliseconds by 1000
cifs: Fix busy dentry used after unmounting
smb: client: use data_len for SMB2 READ encrypted folioq copy
smb: client: reject userspace cifs.spnego descriptions
smb: client: protect tc_count increment in smb2_find_smb_sess_tcon_unlocked()
Linus Torvalds [Fri, 22 May 2026 14:13:13 +0000 (07:13 -0700)]
Merge tag 'pm-7.1-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These fix maximum frequency computation in the intel_pstate driver for
two processor models, update its documentation and fix issues related
to the dynamic EPP support (added during the current development
cycle) in the amd-pstate driver:
- Fix maximum frequency computation in the intel_pstate driver for
Raptor Lake-E and Bartlett Lake that are SMP platforms derived from
hybrid ones (Rafael Wysocki, Henry Tseng)
- Fix the description of asymmetric packing with SMT in the
intel_pstate driver documentation (Ricardo Neri)
- Fix multiple amd-pstate driver issues related to dynamic EPP
support added recently, including making it opt-in only (K Prateek
Nayak, Mario Limonciello)"
* tag 'pm-7.1-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
cpufreq/amd-pstate: Drop Kconfig option for dynamic EPP
cpufreq: intel_pstate: Use HYBRID_SCALING_FACTOR_ADL for Bartlett Lake
cpufreq: intel_pstate: Use correct scaling factor on Raptor Lake-E
Documentation: intel_pstate: Fix description of asymmetric packing with SMT
cpufreq/amd-pstate-ut: Drop policy reference before driver switch
cpufreq/amd-pstate: Use "epp_default_dc" as default when dynamic_epp is disabled
cpufreq/amd-pstate: Reorder notifier unregistration and floor perf reset
cpufreq/amd-pstate: Allow writes to dynamic_epp when state isn't modified
cpufreq/amd-pstate: Return -ENOMEM on failure to allocate profile_name
cpufreq/amd-pstate: Grab "amd_pstate_driver_lock" when toggling dynamic_epp
Linus Torvalds [Fri, 22 May 2026 14:06:21 +0000 (07:06 -0700)]
Merge tag 'acpi-7.1-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI support fix from Rafael Wysocki:
"Unbreak system wakeup on critical battery status in the ACPI battery
driver inadvertently broken during the 7.0 development cycle"
* tag 'acpi-7.1-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI: battery: Fix system wakeup on critical battery status
Damien Le Moal [Fri, 22 May 2026 11:56:22 +0000 (20:56 +0900)]
block: avoid use-after-free in disk_free_zone_resources()
The function disk_update_zone_resources() may call
disk_free_zone_resources() in case of error, and following this,
blk_revalidate_disk_zones() will again calls disk_free_zone_resources() if
disk_update_zone_resources() failed. If a zone worker thread is being used
(which is the default for a rotational media zoned device),
disk_free_zone_resources() will try to stop the zone worker thread twice
because disk->zone_wplugs_worker is not reset to NULL when the worker
thread is stopped the first time.
In disk_free_zone_resources(), fix this by correctly clearing
disk->zone_wplugs_worker to NULL when the worker thread is stopped.
And while at it, since disk_free_zone_resources() is always called after a
failed call to disk_update_zone_resources(), remove the unnecessary call
to disk_free_zone_resources() in disk_update_zone_resources().
Fixes: 1365b6904fd0 ("block: allow submitting all zone writes from a single context") Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20260522115622.588535-1-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
Linus Torvalds [Fri, 22 May 2026 13:53:11 +0000 (06:53 -0700)]
Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fixes from Catalin Marinas:
- Handle probe on hinted conditional branch instructions.
BC.cond instructions can be simulated in the same way as B.cond
instructions, so extend the decode mask for B.cond to cover BC.cond
- Flush the walk cache when unsharing PMD tables. Recent changes to
huge_pmd_unshare() introduced mmu_gather::unshared_tables but the
arm64 code was still treating the TLB flushing as only targeting leaf
entries (TLBI VALE1IS).
Fix it by using non-leaf-only instructions (TLBI VAE1IS) when
tlb->unshared_tables is set
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: tlb: Flush walk cache when unsharing PMD tables
arm64: probes: Handle probes on hinted conditional branch instructions
Linus Torvalds [Fri, 22 May 2026 13:40:31 +0000 (06:40 -0700)]
Merge tag 's390-7.1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 fixes from Alexander Gordeev:
- Fix PAI NNPA mismatch between counting and recording, where sampling
reports twice the value
- Fix loss of PAI counter increments during recording on systems with
many CPUs under heavy load, while counting is not affected
- On some supported machines, CHSC cannot access memory outside the DMA
zone, causing CHSC command failures. Restore GFP_DMA flag when
allocating memory for CHSC control blocks
- Align the numbering scheme for higher-level topology structures like
socket, book, drawer with other hardware identifiers e.g. in sysfs,
procfs and tools like lscpu
* tag 's390-7.1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/topology: Use zero-based numbering for containing entities
s390/cio: Restore GFP_DMA for CHSC allocation
s390/pai: Fix missing PAI counter increments under heavy load
s390/pai: Disable duplicate read of kernel PAI counter value
Linus Torvalds [Fri, 22 May 2026 13:23:56 +0000 (06:23 -0700)]
Merge tag 'slab-for-7.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab
Pull slab fix from Vlastimil Babka:
- Stable fix for a missing cpus_read_lock in one of the cpu sheaves
flushing paths (Qing Wang)
* tag 'slab-for-7.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab:
mm/slub: hold cpus_read_lock around flush_rcu_sheaves_on_cache()
Linus Torvalds [Fri, 22 May 2026 13:16:00 +0000 (06:16 -0700)]
Merge tag 'dma-mapping-7.1-2026-05-22' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux
Pull dma-mapping fixes from Marek Szyprowski:
"Two minor updates for the DMA-mapping code, mainly fixing some rare
corner cases (Petr Tesarik, Jianpeng Chang)"
* tag 'dma-mapping-7.1-2026-05-22' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux:
dma-mapping: move dma_map_resource() sanity check into debug code
dma-direct: fix use of max_pfn