misc: fastrpc: fix use-after-free of fastrpc_user in workqueue context
There is a race between fastrpc_device_release() and the workqueue
that processes DSP responses. When the user closes the file descriptor,
fastrpc_device_release() frees the fastrpc_user structure. Concurrently,
an in-flight DSP invocation can complete and fastrpc_rpmsg_callback()
schedules context cleanup via schedule_work(&ctx->put_work). If the
workqueue runs fastrpc_context_free() in parallel with or after
fastrpc_device_release() has freed the user structure, it dereferences
the freed fastrpc_user. Depending on the state of the context at the
time of the race, any one of the following accesses can be hit:
1. fastrpc_buf_free() calls fastrpc_ipa_to_dma_addr(buf->fl->cctx, ...)
to strip the SID bits from the stored IOVA before passing the
physical address to dma_free_coherent().
2. fastrpc_free_map() reads map->fl->cctx->vmperms[0].vmid to
reconstruct the source permission bitmask needed for the
qcom_scm_assign_mem() call that returns memory from the DSP VM
back to HLOS.
3. fastrpc_free_map() acquires map->fl->lock to safely remove the
map node from the fl->maps list.
The resulting use-after-free manifests as:
pc : fastrpc_buf_free+0x38/0x80 [fastrpc]
lr : fastrpc_context_free+0xa8/0x1b0 [fastrpc]
fastrpc_context_free+0xa8/0x1b0 [fastrpc]
fastrpc_context_put_wq+0x78/0xa0 [fastrpc]
process_one_work+0x180/0x450
worker_thread+0x26c/0x388
Add kref-based reference counting to fastrpc_user. Have each invoke
context take a reference on the user at allocation time and release it
when the context is freed. Release the initial reference in
fastrpc_device_release() at file close. Move the teardown of the user
structure — freeing pending contexts, maps, mmaps, and the channel
context reference — into the kref release callback fastrpc_user_free(),
so that it runs only when the last reference is dropped, regardless of
whether that happens at device close or after the final in-flight
context completes.
Fixes: 6cffd79504ce ("misc: fastrpc: Add support for dmabuf exporter") Cc: stable@kernel.org Signed-off-by: Anandu Krishnan E <anandu.e@oss.qualcomm.com> Signed-off-by: Srinivas Kandagatla <srini@kernel.org> Link: https://patch.msgid.link/20260530204528.116920-2-srini@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
DaeMyung Kang [Sat, 30 May 2026 14:35:12 +0000 (23:35 +0900)]
ntfs: validate resident volume name values on lookup
The shared lookup-time attribute validator now has a safe caller path for
$VOLUME_NAME corruption: ntfs_write_volume_label() no longer treats
lookup errors as an absent label, and the mount path reinitializes its
search context before continuing to $VOLUME_INFORMATION.
Add $VOLUME_NAME-specific resident value validation. A volume name is
stored as a UTF-16LE string, so reject odd byte lengths, and reject
values longer than the NTFS volume label limit. Empty labels remain
valid.
Also reject non-resident $VOLUME_NAME records. $VOLUME_NAME is required
to be resident, like $FILE_NAME; a crafted non-resident record would
otherwise pass lookup and ntfs_write_volume_label() would remove it as if
it were a normal resident attribute.
DaeMyung Kang [Sat, 30 May 2026 14:35:11 +0000 (23:35 +0900)]
ntfs: reinit search context before volume information lookup
On mount the volume inode is searched for $VOLUME_NAME and then, reusing
the same search context, for $VOLUME_INFORMATION. The $VOLUME_NAME lookup
is optional and its result is otherwise ignored.
Once lookup-time validation can reject a corrupt $VOLUME_NAME with -EIO,
the search context is left in an undefined state: ntfs_attr_find()
documents that on an actual error @ctx->attr is undefined. Continuing the
$VOLUME_INFORMATION search from that context is not contractually valid.
Reinitialize the search context before the $VOLUME_INFORMATION lookup so
it always starts from a well-defined state regardless of the
$VOLUME_NAME lookup outcome.
DaeMyung Kang [Sat, 30 May 2026 14:35:10 +0000 (23:35 +0900)]
ntfs: do not replace volume name after lookup errors
ntfs_write_volume_label() removes an existing $VOLUME_NAME attribute and
then adds the replacement. The old code only distinguished lookup success
from all other results, so any lookup error was treated like an absent
label and the add path still ran.
That is unsafe once lookup-time validation rejects corrupt $VOLUME_NAME
records with -EIO: the corrupt record would remain in place and a second
$VOLUME_NAME record could be appended next to it.
Only add the replacement after the old label was removed successfully or
after lookup returned -ENOENT. Propagate all other lookup errors, and
also stop if removing the old attribute fails.
DaeMyung Kang [Sat, 30 May 2026 14:35:09 +0000 (23:35 +0900)]
ntfs: validate attribute values on lookup
ntfs_attr_find() and ntfs_external_attr_find() check that generic
resident attribute values fit in their attribute records and that
fixed-size resident values are large enough. For variable-length resident
formats, however, the fixed part is not enough: embedded length fields
can still point callers past the resident value.
A crafted image can set a small resident $FILE_NAME value_length while
leaving file_name_length large. Callers then trust file_name_length and
read past the resident value when converting or comparing the name. This
was reproduced with a crafted image under KASAN as a slab-out-of-bounds
read from the kmalloc-1k MFT record copy. The stack included
ntfs_lookup(), ntfs_iget(), ntfs_read_locked_inode(), ntfs_attr_name_get(),
ntfs_ucstonls(), and utf16s_to_utf8s().
Add a shared attribute value validator and use it before a lookup path
can return an attribute, including the AT_UNUSED enumeration case where
callers inspect returned attributes directly. The helper validates
resident value bounds, minimum resident value sizes, variable-length
$FILE_NAME fields, and non-resident mapping-pairs metadata that was
previously checked separately in both lookup paths.
This also preserves the intended resident @val matching semantics in the
external attribute lookup path. The old duplicated validation block
overwrote the actual resident value length with the type-specific minimum
length before comparing @val, so variable-length resident values could
fail to match even when the bytes were identical. Keep the comparison on
the actual value length, and make ntfs_attrlist_entry_add() compare
resident attributes with lowest_vcn zero instead of reading the
non-resident union member after a successful resident match.
Reject non-resident $FILE_NAME records too: the format requires
$FILE_NAME to be resident and callers treat returned records as resident.
Cc: stable@vger.kernel.org # v7.1 Fixes: 6ceb4cc81ef3 ("ntfs: add bound checking to ntfs_attr_find") Signed-off-by: DaeMyung Kang <charsyam@gmail.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
The NTFS mapping-pairs parser accumulates relative LCN deltas in a
signed integer. A corrupted attribute can drive that addition past
the representable range.
One corrupt runlist shape sets the accumulated LCN to S64_MAX and
then adds a delta of 1 in the next mapping-pairs entry.
Signed overflow is undefined and can turn an invalid runlist into a
different set of physical clusters.
Check the LCN addition for overflow before storing the next run.
Cc: stable@vger.kernel.org # v7.1 Assisted-by: Codex:gpt-5.5-cyber-preview Signed-off-by: Samuel Moelius <sam.moelius@trailofbits.com> Reviewed-by: Hyunchul Lee <hyc.lee@gmail.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Marco Crivellari [Thu, 14 May 2026 13:54:08 +0000 (15:54 +0200)]
ntfs: Add WQ_PERCPU to alloc_workqueue users
This continues the effort to refactor workqueue APIs, which began with
the introduction of new workqueues and a new alloc_workqueue flag in:
commit 128ea9f6ccfb ("workqueue: Add system_percpu_wq and system_dfl_wq")
commit 930c2ea566af ("workqueue: Add new WQ_PERCPU flag")
The refactoring is going to alter the default behavior of
alloc_workqueue() to be unbound by default.
With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
must now use WQ_PERCPU. For more details see the Link tag below.
In order to keep alloc_workqueue() behavior identical, explicitly request
WQ_PERCPU.
Hyunchul Lee [Tue, 2 Jun 2026 04:53:24 +0000 (13:53 +0900)]
ntfs: serialize volume label accesses
Protect vol->volume_label with a mutex and snaphost the label before
copy_to_user. This prevent a use-after-free when FS_IOC_SETFSLABEL
replaces the vol->volume_label and FS_IOC_GETTSLABEL reads it
concurrently.
The two bounds checks validating that mapping pair data bytes fit within
the attribute use strict greater-than (>), which allows a one-byte
out-of-bounds read when the data extends exactly to attr_end:
b = *buf & 0xf;
if (b) {
if (unlikely(buf + b > attr_end)) // off-by-one
goto io_error;
for (deltaxcn = (s8)buf[b--]; b; b--)
deltaxcn = (deltaxcn << 8) + buf[b];
}
When buf + b == attr_end, the check evaluates to false and buf[b] reads
one byte past the valid attribute boundary. The same pattern appears in
the LCN delta bytes check.
Fix both checks to use >= so that buf[b] at exactly attr_end is
correctly rejected as out of bounds.
Cc: stable@vger.kernel.org # v7.1 Signed-off-by: Ron de Bruijn <rmbruijn@gmail.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Hyunchul Lee [Thu, 28 May 2026 02:15:35 +0000 (11:15 +0900)]
ntfs: not change 0-byte $DATA attribute to non-resident
When ntfs_resident_attr_resize() cannot grow a resident attribute in
place, it retries after converting other resident attributes to
non-resident to free space in the MFT recrord.
Do not select zero-length resident $DATA attributes for this conversion.
fsck treats 0-byte non-resident $DATA attribute as corruptions.
Zhao Zhang [Tue, 2 Jun 2026 08:43:33 +0000 (16:43 +0800)]
bpf: Reject fragmented frames in devmap
Devmap broadcast redirects clone the packet for all but the last
destination.
For native XDP, that clone path copies only the linear xdp_frame data,
while fragmented frames keep skb_shared_info in tailroom outside the
linear area. Cloning such a frame leaves XDP_FLAGS_HAS_FRAGS set but
without valid frag metadata, and the later free path can interpret
uninitialized tail data as skb_shared_info, leading to an out-of-bounds
access during frame return.
Reject fragmented native XDP frames in dev_map_enqueue_clone().
Add the same restriction to the generic XDP clone path in
dev_map_redirect_clone(). Generic XDP represents fragmented packets as
nonlinear skbs, and rejecting them here keeps clone-based broadcast
support aligned between native and generic XDP.
DaeMyung Kang [Sun, 24 May 2026 05:42:37 +0000 (14:42 +0900)]
ntfs: free link name from ntfs_name_cache
ntfs_link() converts the new link name with ntfs_nlstoucs() using
NTFS_MAX_NAME_LEN. In this case ntfs_nlstoucs() allocates the result
from ntfs_name_cache, and its contract requires callers to release the
buffer with kmem_cache_free(ntfs_name_cache, ...).
All other ntfs_nlstoucs() callers in namei.c do that, but ntfs_link()
uses kfree(), which mismatches the allocator for successfully converted
names.
The conversion failure path reaches the common out label with uname ==
NULL. That was harmless for kfree(), but kmem_cache_free() does not
provide the same NULL contract. Return directly on conversion failure
and free successful conversions with ntfs_name_cache.
DaeMyung Kang [Sun, 17 May 2026 03:44:47 +0000 (12:44 +0900)]
ntfs: remove unsupported quota handling
The ntfs driver does not implement quota accounting. It creates
new inodes with the NTFS 1.2 $STANDARD_INFORMATION layout and does
not maintain the NTFS 3.x owner_id/quota_charged fields or the
$Quota usage records that Windows would need for meaningful quota
accounting.
The only runtime quota path left in the driver is the remount-rw
code that tries to mark $Quota/$Q out of date, plus the mount-time
code that loads $Quota and its $Q index solely to support that
marker.
Since the driver does not maintain the per-file quota metadata,
setting QUOTA_FLAG_OUT_OF_DATE does not make the quota state
meaningful, and failures in this unsupported path can unnecessarily
block remount-rw or force a mount read-only.
Remove the quota marker, the $Quota/$Q loading state, and the
unused quota volume flag. Keep the on-disk quota layout definitions
in layout.h so the documented NTFS structures remain available.
Hyunchul Lee [Sat, 23 May 2026 04:14:23 +0000 (13:14 +0900)]
ntfs: add bounds check before accessing EA entries
in ntfs_ea_lookup and ntfs_listxattr, this verifies that there is enough
space in the EA entry before accessing the next_entry_offset field of
the EA entry.
Hyunchul Lee [Sat, 23 May 2026 04:14:22 +0000 (13:14 +0900)]
ntfs: validate index entries on reading
Validate index entries immediately after reading an index root or index
block from disk. This eliminates repeated checks in lookup and readdir,
and reduce the risk of missing checks in those paths.
Hyunchul Lee [Sat, 23 May 2026 04:14:21 +0000 (13:14 +0900)]
ntfs: centalize $INDEX_ROOT header validation
Add a dedicated helper to perform stricter validation of $INDEX_ROOT and
use it for both directory inodes and named index inodes. This keeps the
root size and header geometry checks consistent across both read paths.
Hyunchul Lee [Sat, 23 May 2026 04:14:20 +0000 (13:14 +0900)]
ntfs: validate index block header more strictly
Modify ntfs_index_block_inconsisent() to perform stricter validation of
INDEX_HEADER geometry in INDX blocks, and update
ntfs_lookup_inode_by_name() to use that function to validate INDX
blocks.
Bjorn Andersson [Sat, 30 May 2026 20:44:21 +0000 (21:44 +0100)]
slimbus: qcom-ngd-ctrl: Avoid ABBA on tx_lock/ctrl->lock
During the SSR/PDR down notification the tx_lock is taken with the
intent to provide synchronization with active DMA transfers.
But during this period qcom_slim_ngd_down() is invoked, which ends up in
slim_report_absent(), which takes the slim_controller lock. In multiple
other codepaths these two locks are taken in the opposite order (i.e.
slim_controller then tx_lock).
The result is a lockdep splat, and a possible deadlock:
rprocctl/449 is trying to acquire lock: ffff00009793e620 (&ctrl->lock){+.+.}-{4:4}, at: slim_report_absent (drivers/slimbus/core.c:322) slimbus
but task is already holding lock: ffff00009793fb50 (&ctrl->tx_lock){+.+.}-{4:4}, at: qcom_slim_ngd_ssr_pdr_notify (drivers/slimbus/qcom-ngd-ctrl.c:1475) slim_qcom_ngd_ctrl
The assumption is that the comment refers to the desire to not call
qcom_slim_ngd_exit_dma() while we have an ongoing DMA TX transaction.
But any such transaction is initiated and completed within a single
qcom_slim_ngd_xfer_msg().
Prior to calling qcom_slim_ngd_exit_dma() the slim_controller is torn
down, all child devices are notified that the slimbus is gone and the
child devices are removed.
Stop taking the tx_lock in qcom_slim_ngd_ssr_pdr_notify() to avoid the
deadlock.
Bjorn Andersson [Sat, 30 May 2026 20:44:19 +0000 (21:44 +0100)]
slimbus: qcom-ngd-ctrl: Initialize controller resources in controller
The work structs and work queue are controller resources, create and
destroy them in the controller context. Creating them as part of the
child device's probe path seems to be okay now that the controller's
probe has been updated, but if for some reason the child does not probe
successfully a SSR or PDR notification will schedule_work() on an
uninitialized "ngd_up_work".
Move the initialization of these controller resources to the controller
probe function to avoid any issues, and to clarify the ownership.
Bjorn Andersson [Sat, 30 May 2026 20:44:18 +0000 (21:44 +0100)]
slimbus: qcom-ngd-ctrl: Register callbacks after creating the ngd
When the remoteproc starts in parallel with the NGD driver being probed,
or the remoteproc is already up when the PDR lookup is being registered,
or in the theoretical event that we get an interrupt from the hardware,
these callbacks will operate on uninitialized data. This result in
issues to boot the affected boards.
One such example can be seen in the following fault, where
qcom_slim_ngd_ssr_pdr_notify() schedules work on the NULL ngd_up_work.
qcom_slim_ngd_ctrl_probe() first registers the SSR callback then
allocates the PDR context, as such the error path needs to come in
opposite order to allow us to unroll each step.
Bjorn Andersson [Sat, 30 May 2026 20:44:15 +0000 (21:44 +0100)]
slimbus: qcom-ngd-ctrl: Fix up platform_driver registration
Device drivers should not invoke platform_driver_register()/unregister()
in their probe and remove paths. They should further not rely on
platform_driver_unregister() as their only means of "deleting" their
child devices.
Introduce a helper to unregister the child device and move the
platform_driver_register()/unregister() to module_init()/exit().
Platform devices created with platform_device_alloc() call
platform_device_release() when the last reference to the device's
kobject is dropped. This function calls of_node_put() unconditionally.
This works fine for devices created with platform_device_register_full()
but users of the split approach (platform_device_alloc() +
platform_device_add()) must bump the reference of the of_node they
assign manually. Add the missing call to of_node_get().
DaeMyung Kang [Fri, 22 May 2026 14:20:48 +0000 (23:20 +0900)]
ntfs: avoid heap allocation for free-cluster readahead state
get_nr_free_clusters() allocates a temporary file_ra_state before it
publishes the precomputed free cluster count, sets NVolFreeClusterKnown(),
and wakes vol->free_waitq. If that allocation fails, the worker returns
without setting the flag or waking waiters, so callers waiting for the free
count can block indefinitely.
The readahead state is only used synchronously while scanning the bitmap.
Keep it on the stack and pass it by address to the readahead helper. This
eliminates the early allocation failure path instead of adding a special
case that publishes a conservative count and wakes the waitqueue.
Zero-initialize the on-stack state because file_ra_state_init() only sets
ra_pages and prev_pos.
Apply the same treatment to __get_nr_free_mft_records(), which scans the
MFT bitmap with the same short-lived readahead state.
Hyunchul Lee [Thu, 21 May 2026 05:37:03 +0000 (14:37 +0900)]
ntfs: skip extent mft records in writeback to prevent deadlock
This patch fixes the ABBA deadlock between extent_lock and extent
mrec_lock triggered by xfstests generic/113, that occurs since the commit 6994acf33bae ("ntfs: use base mft_no when looking up base inode for
extent record").
Path B (MFT folio writeback):
VFS writeback of $MFT dirty folios
-> ntfs_mft_writepages()
-> ntfs_write_mft_block()
-> ntfs_may_write_mft_record()
-> holds one extent mrec_lock from a previous iteration
-> tries to acquire another base inode extent_lock
By removing all extent_lock and extent mrec_lock acquisition from the MFT
folio writeback path, the ABBA lock ordering is eliminated:
Path B is always redundant for extent records because:
1. mark_mft_record_dirty(ext_ni) does NOT dirty the MFT folio.
It only sets NInoDirty(ext_ni) and marks the base VFS inode dirty
via __mark_inode_dirty(I_DIRTY_DATASYNC), which triggers Path A.
Therefore, normal extent modifications never create a situation where
the MFT folio is dirty and Path B is not scheduled.
2. The MFT folio only gets dirtied via ntfs_mft_mark_dirty() inside
ntfs_mft_record_alloc(). But all identified callers in attrib.c
(ntfs_attr_add, ntfs_attr_record_move_away,
ntfs_attr_make_non_resident, ntfs_attr_record_resize) follow through
with mark_mft_record_dirty(), which triggers Path A to write the
complete record.
3. ntfs_evict_big_inode() calls ntfs_commit_inode() before freeing extent
inodes, ensuring all dirty extents are flushed via Path A before the
base inode leaves the icache.
DaeMyung Kang [Thu, 21 May 2026 10:17:51 +0000 (19:17 +0900)]
ntfs: only alias volume $UpCase to default on exact match
load_and_init_upcase() currently aliases vol->upcase to the global
default upcase whenever the shared prefix matches, and then truncates
vol->upcase_len to that shorter prefix. The result is correct only by
accident: upcase[] accesses in name collation are gated by upcase_len,
so the prefix-equality alias produces the same fold output as keeping
the volume's own shorter table.
Still, prefix equality is not equality: the volume table is logically
distinct from the default and should not be replaced by it unless they
are byte-for-byte identical. Use memcmp() to compare the complete table
in one expression and drop the now-redundant upcase_len rewrite.
No user-visible change is expected for compliant volumes whose $UpCase
has exactly default_upcase_len entries; shorter volume tables are no
longer aliased to the default.
Cc: stable@vger.kernel.org # v7.1 Signed-off-by: DaeMyung Kang <charsyam@gmail.com> Reviewed-by: Hyunchul Lee <hyc.lee@gmail.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
DaeMyung Kang [Thu, 21 May 2026 10:17:49 +0000 (19:17 +0900)]
ntfs: free volume-wide resources on fill_super failure
ntfs_fill_super()'s err_out_now path frees only the volume struct via
kfree(vol), leaving several vol-owned allocations behind on every mount
failure:
- vol->nls_map, loaded by ntfs_init_fs_context() via
load_nls_default() (or replaced by an explicit nls= option in
ntfs_parse_param()), is never unload_nls()'d.
- vol->volume_label, allocated by load_system_files() through
ntfs_ucstonls() once the $Volume name attribute has been parsed, is
not released by load_system_files()'s own error labels nor by the
fill_super() inline cleanup that only runs on d_make_root()
failure. Any later failure inside load_system_files() leaks it.
- vol->lcn_empty_bits_per_page was kvfree()'d in
unl_upcase_iput_tmp_ino_err_out_now without clearing the pointer,
so it could not be folded into a single common cleanup.
Because the failure paths never call ntfs_volume_free() and never reach
the d_make_root() inline cleanup block (it sits above the label and is
jumped over by the load_system_files() / kvmalloc failure gotos), these
resources accumulate per failed mount attempt with no chance of
recovery short of unloading the module. This is a silent leak: the
inodes loaded prior to failure remain hashed but generic_shutdown_super()
skips evict_inodes() when sb->s_root is unset, so no CHECK_DATA_CORRUPTION
warning is emitted either.
Move the per-volume frees down to err_out_now and drop the
lcn_empty_bits_per_page kvfree() from the upper label so the cleanup is
performed exactly once on every failure path. Using unconditional
kvfree() / kfree() / unload_nls() is safe because they all accept NULL
and the upper labels that previously freed nls_map (the d_make_root()
inline cleanup) already clear the pointer.
Cc: stable@vger.kernel.org # v7.1 Signed-off-by: DaeMyung Kang <charsyam@gmail.com> Reviewed-by: Hyunchul Lee <hyc.lee@gmail.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
nvmem: core: fix use-after-free bugs in error paths
Fix several instances of error paths in which we call
__nvmem_device_put() - which may end up freeing the underlying memory
and other resources - and then keep on using the nvmem structure. Always
put the reference to the nvmem device as the last step before returning
the error code.
Merge tag 'svc_fixes_for_v7.1' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/dinguyen/linux into char-misc-linus
Dinh writes:
firmware: stratix10-svc and stratix10-rsu: fixes for v7.1
- Return -EOPNOTSUPP when ATF async is not supported
- Fix SVC driver from loading entirely when asynchronous ops is not
supported in older ATF.
- Fix a NULL pointer dereference on a timeout in rsu_send_msg()
* tag 'svc_fixes_for_v7.1' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/dinguyen/linux:
firmware: stratix10-rsu: Fix NULL deref on rsu_send_msg() timeout in probe
firmware: stratix10-svc: Don't fail probe when async ops unsupported
firmware: stratix10-svc: Return -EOPNOTSUPP when ATF async unsupported
Song Chen [Wed, 3 Jun 2026 09:19:10 +0000 (17:19 +0800)]
bpf: Reject registration of duplicated kfunc
Search for duplicated kfunc in btf_vmlinux and btf_modules
before a kernel module attempts to register a kfunc.
If kfunc would shadow existing kfunc then pr_err() and
reject module loading.
Emil Tsalapatis [Thu, 4 Jun 2026 18:42:52 +0000 (14:42 -0400)]
MAINTAINERS: BPF: Add self as reviewer and run parse_maintainers.pl
Add myself as a reviewer for the BPF subsystem. While at it, run
./scripts/parse_maintainers.pl --order and reorder the BPF-related
entries in the file accordingly.
Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20260604184252.9917-1-emil@etsalapatis.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This patch series introduces BPF_MAP_TYPE_RHASH, a new hash map type that
leverages the kernel's rhashtable to provide resizable hash map for BPF.
The existing BPF_MAP_TYPE_HASH uses a fixed number of buckets determined at
map creation time. While this works well for many use cases, it presents
challenges when:
1. The number of elements is unknown at creation time
2. The element count varies significantly during runtime
3. Memory efficiency is important (over-provisioning wastes memory,
under-provisioning hurts performance)
BPF_MAP_TYPE_RHASH addresses these issues by using rhashtable, which
automatically grows and shrinks based on load factor.
The implementation wraps the kernel's rhashtable with BPF map operations:
- Uses bpf_mem_alloc for RCU-safe memory management
- Supports all standard map operations (lookup, update, delete, get_next_key)
- Supports batch operations (lookup_batch, lookup_and_delete_batch)
- Supports BPF iterators for traversal
- Supports BPF_F_LOCK for spin locks in values
- Requires BPF_F_NO_PREALLOC flag (elements allocated on demand)
- In-place updates for improved performance.
- max_entries serves as a hard limit, not bucket count
- Uses bit_spin_lock() + local_irq_save() for bucket locking,
similar to existing BPF hashmap's raw_spin_lock_irqsave(), insertions and
deletes may fail.
- Iterations are best-effort, if resize, insertions, deletions take place
concurrently, iterations may visit same elements multiple times or skip
elements.
- Lock out insertions, when running special fields destructor to guarantee
its completion.
The series includes comprehensive tests:
- Basic operations in test_maps (lookup, update, delete, get_next_key)
- BPF program tests for lookup/update/delete semantics
- Seq file tests
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
---
Update implementation
---------------------
Current implementation of the BPF_MAP_TYPE_RHASH does not provide
the same strong guarantees on the values consistency under concurrent
reads/writes as BPF_MAP_TYPE_HASH.
BPF_MAP_TYPE_HASH allocates a new element and atomically swaps the
pointer. BPF_MAP_TYPE_RHASH does memcpy in place with no lock held.
rhash trades consistency for speed, concurrent readers can observe
partially updated data. Two concurrent writers to the same key can
also interleave, producing mixed values. This is similar to arraymap
update implementation, including handling of the special fields.
As a solution, user may use BPF_F_LOCK to guarantee consistent reads
and write serialization.
Pattern:
* Small map (1K): htab wins for 8 / 32 byte keys by 10-20 %
because the preallocated bucket array fits in L1. Equalises
at 256 byte keys.
* Large map (1M): rhtab wins everywhere, up to 4x at high load
factor with 8 byte keys.
* Higher load factor amplifies rhtab's lead: rhtab grows the
bucket array; htab stays at user-declared max.
2. FULL UPDATE (M events/sec per producer, -p 7)
htab per-producer:
20.33 22.02 19.27 23.61 24.18 23.17 21.07
mean 21.94 range 19.27 - 24.18
rhtab per-producer:
133.51 129.47 74.52 129.29 102.26 129.98 107.64
mean 115.24 range 74.52 - 133.51
speedup (mean): 5.25x (+425 %)
In-place memcpy avoids the per-update alloc + RCU pointer swap
that htab pays.
3. MEMORY (overwrite, -p 8, no --preallocated)
value_size | htab ops/s | rhtab ops/s | htab mem | rhtab mem
-----------+-------------+-------------+----------+----------
32 B | 122.87 k/s | 133.04 k/s | 2.47 MiB | 2.49 MiB
4096 B | 64.43 k/s | 65.38 k/s | 6.74 MiB | 6.44 MiB
rhtab/htab : +8 % ops, +0.8 % mem (32 B)
+1 % ops, -4 % mem (4096 B)
SUMMARY
* Small / well-fitting map: htab is faster (cache-friendly
fixed bucket array), but only by ~10-20 %.
* Large / high-load-factor map: rhtab is dramatically faster
(1.2x to 4x) because rhashtable resizes to keep the load
factor sane while htab stays stuck at user-declared max.
* Update-heavy workloads: rhtab is ~5x faster per producer
via in-place memcpy.
* Memory benchmark: effectively on par
---
Changes in v7:
- rhashtable_next_key: move into lib/rhashtable.c, drop params argument
(Herbert).
- rhashtable_next_key: kdoc clarifies that behavior on tables with
duplicate keys is undefined (sashiko).
- rhashtable: include Herbert's "Use irq work for shrinking" patch so
__rhashtable_remove_fast_one() can fire the shrink path from NMI
context (Herbert).
- hashtab: fix u32 multiply overflow in __rhtab_map_lookup_and_delete_batch
copy_to_user; cast total to size_t before multiplying by key_size /
value_size (sashiko, bot+bpf-ci).
- hashtab: allow kptr/refcount fields in rhtab values (same model as
array map).
- Link to v6: https://patch.msgid.link/20260602-rhash-v6-0-1bfd35a4184f@meta.com
Changes in v6:
- rhashtable_next_key: advance past duplicate keys in the main bucket
chain to avoid an infinite loop when there are duplicate keys
(sashiko).
- rhashtable_next_key: return ERR_PTR(-EOPNOTSUPP) on rhltable (sashiko).
- rhashtable: selftest pre-sizes the table to avoid concurrent rehash
triggering spurious failures (sashiko).
- hashtab: real rhtab_map_mem_usage in the basic commit; move
bpf_map_free_internal_structs from rhtab_free_elem into the
special-fields commit where it does meaningful work (bot+bpf-ci).
- bpf_iter (seq_file): switch to rhashtable_walk_* for stronger
coverage under concurrent rehash; get_next_key and batch keep
rhashtable_next_key (sashiko).
- iter ops: rhtab_map_get_next_key adds IS_ERR check
before dereferencing the element pointer (sashiko).
- iter ops: bpf_each_rhash_elem removes cond_resched() (sashiko).
- iter ops: batch returns -EAGAIN (not -ENOENT) on cursor delete,
so userspace can distinguish lost cursor from end-of-iteration
and restart from NULL (sashiko).
- Link to v5: https://patch.msgid.link/20260528-rhash-v5-0-7205191b6c57@meta.com
Changes in v5:
- rhashtable_next_key: add kdoc WARNING to highlight lack of rehash
detection and unbounded iteration (Herbert).
- rhashtable: selftest now checks IS_ERR() before PTR_ERR comparison
on the missing-key path (bot+bpf-ci).
- hashtab: drop dead stub bodies and unused map_ops registrations
from the basic commit; iteration commit adds bodies, structs, and
registrations together. .map_get_next_key keeps a stub registration
in the basic commit because the syscall dispatcher does not
NULL-check it; iteration commit replaces the stub body with the
real implementation (bot+bpf-ci).
- hashtab: fix batch cursor advancement. v4 stashed the lookahead
element key but then resumed via next_key(cursor), skipping that
element across batch boundaries and orphaning it on
lookup_and_delete_batch. v5 stashes the lookahead key and looks
it up directly on the next batch entry (bot+bpf-ci, sashiko v3).
- hashtab: document torn-read race in rhtab_map_update_existing,
matching arraymap semantics (bot+bpf-ci).
- Link to v4: https://patch.msgid.link/20260513-rhash-v4-0-dd3d541ccb0b@meta.com
Changes in v4:
- rhashtable: introduce rhashtable_next_key(), drop walker-based
iteration for BPF (also drops earlier rhashtable_walk_enter_from()
proposal).
- map_extra: presize hint via lower 32 bits (nelem_hint), capped at
U16_MAX.
- Automatic shrinking enabled (was missing despite being advertised).
- Reject key_size > U16_MAX (rhashtable_params.key_len is u16).
- Replace irqs_disabled() guard with bpf_disable_instrumentation around
bucket-lock paths: closes same-CPU NMI tracing recursion without
rejecting legitimate IRQ-context callers.
- lookup_and_delete reordered: unlink before copy to avoid populating
user buffer on concurrent-unlink -ENOENT.
- update_existing reordered: copy then free_fields, matching arraymap.
- Word-sized key fast path (sizeof(long) bytes), inlined hashfn/cmpfn
via static-const rhashtable_params; works on both 32-bit and 64-bit.
- check_and_init_map_value() on insert (zero special-field bytes from
recycled bpf_mem_alloc memory; previously bpf_spin_lock could read
garbage and qspinlock would deadlock).
- BPF_SPIN_LOCK / BPF_RES_SPIN_LOCK allowlist moved to the special-
fields commit so each commit is bisect-safe.
- Link to v3: https://patch.msgid.link/20260424-rhash-v3-0-d0fa0ce4379b@meta.com
Changes in v3:
- Squash all commits implementing basic functions into one (Alexei)
- Remove selftests that were not necessary (Alexei)
- Resize detection for kernel full iterations, error out on resize (Alexei)
- Remove second lookup in get_next_key() (Emil)
- __acquires(RCU)/__releases(RCU) on seq_start/seq_stop (Emil)
- Use bpf_map_check_op_flags() where it makes sense (Leon)
- Benchmarks refresh, experiment with alternative hash functions
- Rely on iterator invalidation during rehash to handle table resizes:
fail on resize where we fully iterate on table inside kernel, dont fail on
resize where iteration goes through userspace. Exception -
rhtab_map_free_internal_structs() should be just safe to iterate fully
in kernel, no risk of infinite loop, because no user holding reference.
- Handle special fields during in-place updates (Emil, sashiko)
- Link to v2: https://lore.kernel.org/all/20260408-rhash-v2-0-3b3675da1f6e@meta.com/
Changes in v2:
- Added benchmarks
- Reworked all functions that walk the rhashtable, use walk API, instead
of directly accessing tbl and future_tbl
- Added rhashtable_walk_enter_from() into rhashtable to support O(1)
iteration continuations
- Link to v1: https://lore.kernel.org/r/20260205-rhash-v1-0-30dd6d63c462@meta.com
Pattern:
* Small map (1K): htab wins for 8 / 32 byte keys by 10-20%
* Large map (1M): rhtab wins everywhere, up to 4x at high load
factor with 8 byte keys.
* Higher load factor amplifies rhtab's lead: rhtab grows the
bucket array; htab stays at user-declared max.
2. FULL UPDATE (M events/sec per producer)
htab per-producer:
20.33 22.02 19.27 23.61 24.18 23.17 21.07
mean 21.94 range 19.27 - 24.18
rhtab per-producer:
133.51 129.47 74.52 129.29 102.26 129.98 107.64
mean 115.24 range 74.52 - 133.51
speedup (mean): 5.25x (+425 %)
In-place memcpy avoids the per-update alloc + RCU pointer swap
that htab pays.
3. MEMORY
value_size | htab ops/s | rhtab ops/s | htab mem | rhtab mem
-----------+-------------+-------------+----------+----------
32 B | 122.87 k/s | 133.04 k/s | 2.47 MiB | 2.49 MiB
4096 B | 64.43 k/s | 65.38 k/s | 6.74 MiB | 6.44 MiB
rhtab/htab : +8 % ops, +0.8 % mem (32 B)
+1 % ops, -4 % mem (4096 B)
Throughput effectively tied
SUMMARY
* Small / well-fitting map: htab is faster (cache-friendly
fixed bucket array), but only by ~10-20 %.
* Large / high-load-factor map: rhtab is dramatically faster
(1.2x to 4x) because rhashtable resizes to keep the load
factor sane while htab stays stuck at user-declared max.
* Update-heavy workloads: rhtab is ~5x faster per producer
via in-place memcpy.
* Memory benchmark: effectively on par.
Mykyta Yatsenko [Fri, 5 Jun 2026 11:41:26 +0000 (04:41 -0700)]
selftests/bpf: Add basic tests for resizable hash map
Test basic map operations (lookup, update, delete) for
BPF_MAP_TYPE_RHASH including boundary conditions like duplicate
key insertion and deletion of nonexistent keys.
Mykyta Yatsenko [Fri, 5 Jun 2026 11:41:24 +0000 (04:41 -0700)]
bpf: Optimize word-sized keys for resizable hashtable
Specialize the lookup/update/delete paths for keys whose size matches
sizeof(long) (4 bytes on 32-bit, 8 bytes on 64-bit). A static-const
rhashtable_params lets the compiler inline a custom XOR-fold hashfn and
a single-word equality cmpfn, eliminating the indirect jhash dispatch.
The same hashfn and cmpfn are installed into rhashtable's stored params
at rhashtable_init time, so the rehash worker, slow-path inserts, and
rhashtable_next_key() all agree with the inlined fast paths.
The seq_file BPF iterator uses rhashtable_walk_* and is unaffected.
Mykyta Yatsenko [Fri, 5 Jun 2026 11:41:23 +0000 (04:41 -0700)]
bpf: Allow special fields in resizable hashtab
Add support for timers, workqueues, task work, spin locks and kptrs.
Without this, users needing deferred callbacks, BPF_F_LOCK, or
refcounted kernel pointers in a dynamically-sized map have no option -
fixed-size htab is the only map supporting these field types.
Resizable hashtab should offer the same capability.
kptr semantics under in-place updates are identical to array map.
Properly clean up BTF record fields on element delete and map
teardown by wiring up bpf_obj_free_fields through a memory allocator
destructor, matching the pattern used by htab for non-prealloc maps.
Mykyta Yatsenko [Fri, 5 Jun 2026 11:41:22 +0000 (04:41 -0700)]
bpf: Implement iteration ops for resizable hashtab
Implement get_next_key, batch lookup/lookup-and-delete, for_each_map_elem
callback, and the seq_file BPF iterator for BPF_MAP_TYPE_RHASH.
get_next_key() and batch use rhashtable_next_key() — stateless,
matches the syscall UAPI shape (no kernel-side iterator state).
get_next_key falls back to the first key when prev_key was
concurrently deleted (matches htab semantics). Batch reports
cursor loss as -EAGAIN so userspace can distinguish it from
end-of-iteration (-ENOENT) and restart from NULL.
The seq_file BPF iterator uses rhashtable_walk_* instead. It runs
only from read() syscall context, so the walker's spin_lock is
safe, and seq_file's per-fd state lets the walker handle rehash
correctly (retry on -EAGAIN) for stronger coverage than the
stateless API can provide.
Mykyta Yatsenko [Fri, 5 Jun 2026 11:41:21 +0000 (04:41 -0700)]
bpf: Implement resizable hashmap basic functions
Use rhashtable_lookup_likely() for lookups, rhashtable_remove_fast()
for deletes, and rhashtable_lookup_get_insert_fast() for inserts.
Updates modify values in place under RCU rather than allocating a
new element and swapping the pointer (as regular htab does). This
trades read consistency for performance: concurrent readers may
see partial updates. BPF_F_LOCK support and special-field
handling (timers, kptrs, etc.) follow in a later commit.
Initialize rhashtable with bpf_mem_alloc element cache. Require
BPF_F_NO_PREALLOC. Limit max_entries to 2^31. Free elements via
rhashtable_free_and_destroy().
Mykyta Yatsenko [Fri, 5 Jun 2026 11:41:18 +0000 (04:41 -0700)]
rhashtable: Add rhashtable_next_key() API
Introduce a simpler iteration mechanism for rhashtable that lets
the caller continue from an arbitrary position by supplying the
previous key, without the per-iterator state of the
rhashtable_walk_* API.
Caller holds RCU; passes NULL prev_key for the first element or
the previously returned key to advance. Walks tbl->future_tbl
chain so in-flight rehashes are observed.
Best-effort: in case of concurrent resize, provides no guarantees:
- may produce duplicate elements
- may skip any amount of elements
- termination of the loop is not guaranteed in case of
sustained rehash. Callers are advised to bound loop externally
or avoid inserting new elements during such loop.
Returns ERR_PTR(-ENOENT) if prev_key is not found.
Behavior on tables with duplicate keys is undefined.
rhltable is not supported — returns ERR_PTR(-EOPNOTSUPP).
Inochi Amaoto [Thu, 28 May 2026 11:38:39 +0000 (19:38 +0800)]
RISC-V: KVM: Enhance the logging check for mmu mapping
When enabling dirty ring, the dirty bitmap is disable, and the logging
check is always false as the RISC-V architecture does not select
"NEED_KVM_DIRTY_RING_WITH_BITMAP". Although the dirty log is recorded
since the write path already trying to add the dirty log, the logic for
logging check is broken and some side effect will occurs.
Enhance the logging check for mmu mapping so it can check both the dirty
ring and the dirty bitmap.
netfilter: conntrack: call nf_ct_gre_keymap_destroy() if master helper is pptp
For GRE flows, validate that the ct master helper (if any) is pptp
before calling nf_ct_gre_keymap_destroy(), so the helper data area
can be accessed safely. Note that only the pptp helper provides a
.destroy callback.
This infrastructure is not used anymore after moving ct timeout and
helper to use datapath refcount to track object use.
Revert commit c56716c69ce1 ("netfilter: extensions: introduce extension
genid count") this patch disables all ct extensions (leading to NULL)
for unconfirmed conntracks, when this is only targeted at ct helper and
ct timeout. There is also codebase that dereferences the ct extension
without checking for NULL which could lead to crash.
netfilter: nf_conntrack_helper: add refcounting from datapath
This patch adds a new ->ct_refcnt field to struct nf_conntrack_helper
which is bumped when the helper is used by the ct helper extension. Drop
this reference count when the conntrack entry is released. This is a
packet path refcount which ensures that struct nf_conntrack_helper
remains in place for tricky scenarios where a packet sits in nfqueue, or
elsewhere, with a conntrack that refers to this helper.
For simplicity, this leaves a single refcount for helper objects in
place, remove the existing refcount for control plane that ensures that
the helper does not go away if it is used by ruleset.
On helper removal, the help callback is set to NULL to disable it from
packet path and, after rcu grace period, existing expectations are
removed. Update ctnetlink to disable access to .to_nlattr and
.from_nlattr if the helper is going away.
Remove nf_queue_nf_hook_drop() since it has proven not to be effective
because packets with unconfirmed conntracks which are still flying to
sit in nfqueue.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
netfilter: nf_conntrack_pptp: move GRE specific cleanup to GRE tracker
Move the GRE specific cleanup to nf_conntrack_proto_gre.c to ensure that
the .destroy callback for the pptp helper is still reachable by existing
conntrack entries while pptp module is being removed.
This is a preparation patch, no functional changes are intended.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
wifi: mac80211: Add 802.3 multicast encapsulation offload support
mac80211 converts 802.3 multicast packets to 802.11 format
before driver TX, even when Ethernet encapsulation offload
is enabled. This prevents drivers that support multicast
Ethernet encapsulation offload from receiving frames in
native 802.3 format.
Introduce the IEEE80211_OFFLOAD_ENCAP_MCAST flag to bypass the
802.11 encapsulation step and pass the multicast packet to the
driver in 802.3 format. Drivers that support multicast Ethernet
encapsulation offload can advertise this flag.
Disable multicast encapsulation offload in MLO case for drivers not
advertising MLO_MCAST_MULTI_LINK_TX support for AP mode and for
3-address AP_VLAN multicast packets.
wifi: mac80211: Add multicast to unicast support for 802.3 path
mac80211 already supports multicast-to-unicast conversion for
native 802.11 TX paths, but this handling is missing for the
802.3 transmit path. Due to that the packet never converted to
unicast and directly pass it to 802.11 Tx path by checking the
destination address as multicast.
Extend ieee80211_subif_start_xmit_8023() to honor the
multicast_to_unicast setting by cloning and converting multicast
Ethernet frames into per-station unicast transmissions, following
the same behavior of the native 802.11 TX path and allow it
to take 802.3 path.
wifi: mac80211: Add sta pointer sanity check in ieee80211_8023_xmit()
Currently ieee80211_8023_xmit() accesses the sta pointer without any
sanity check, assuming that only unicast packets for an authorized
station are processed. But the sta pointer could become NULL when
a framework to support 802.3 offload for the multicast packets is
added in the follow-up patches. Add the valid sta pointer sanity
check to avoid the invalid pointer access.
This aligns with some of the subordinate functions called by
ieee80211_8023_xmit() that already NULL-check 'sta' such as
ieee80211_select_queue() and ieee80211_aggr_check().
Nicolin Chen [Mon, 1 Jun 2026 20:42:38 +0000 (13:42 -0700)]
iommufd/selftest: Cover invalid read counts on vEVENTQ FD
The vEVENTQ file descriptor must reject reads whose buffer cannot hold
even one event record. Add selftest coverage that exercises both the
empty-queue path (the upfront size check) and the non-empty path (the
in-loop check that fires only after an event is fetched).
For iommufd_veventq_fops_read():
- count == 0 and count < sizeof(header) on an empty vEVENTQ both
return -EINVAL.
- count == 0 and count == sizeof(header) on a non-empty vEVENTQ
(event has trailing payload) both return -EINVAL.
Nicolin Chen [Mon, 1 Jun 2026 20:42:37 +0000 (13:42 -0700)]
iommufd: Avoid partial fault group delivery in iommufd_fault_fops_read()
The cookie returned by xa_alloc() in iommufd_fault_fops_read() is per fault
group, but the inner copy_to_user() runs per fault inside the group. If a
copy fails mid-group, xa_erase clears the cookie and the group is restored
to the deliver list, yet done is not rolled back. The function returns the
partial byte count, with the successfully copied faults sitting at offsets
below done carrying the now-erased cookie. The next read() then re-fetches
the group, allocates a fresh cookie, and re-delivers every fault including
the ones already copied; userspace sees duplicates carrying the new cookie,
and a stale cookie that can never be responded to.
Use a local group_done variable that tracks the per-group progress inside
the inner loop, and only commit done = group_done after the inner loop has
finished successfully. On a copy_to_user failure the outer break skips the
commit, so done remains at its prior start-of-group baseline; the partial
bytes already written past done are undefined to userspace per the read(2)
contract, and the next read re-delivers the whole group atomically.
Nicolin Chen [Mon, 1 Jun 2026 20:42:36 +0000 (13:42 -0700)]
iommufd: Break the loop on failure in iommufd_fault_fops_read()
On a copy_to_user() failure inside the inner list_for_each_entry, only the
inner loop breaks; the outer while re-fetches the just-restored fault group
and retries the failing copy_to_user() forever, spinning the reader at 100%
CPU with fault->mutex held.
Check rc after the inner loop and break the outer while as well.
Oliver Upton [Tue, 2 Jun 2026 16:59:01 +0000 (09:59 -0700)]
KVM: arm64: Correctly identify executable PTEs at stage-2
KVM invalidates the I-cache before installing an executable PTE on
implementations without DIC. Unfortunately, support for FEAT_XNX
broke this check as KVM_PTE_LEAF_ATTR_HI_S2_XN was expanded to a
bitfield.
Fix it by reusing kvm_pgtable_stage2_pte_prot() and testing the abstract
permission bits instead.
Fixes: 2608563b466b ("KVM: arm64: Add support for FEAT_XNX stage-2 permissions") Reported-by: Sashiko (gemini/gemini-3.1-pro-preview) Signed-off-by: Oliver Upton <oupton@kernel.org> Reviewed-by: Wei-Lin Chang <weilin.chang@arm.com> Link: https://patch.msgid.link/20260602165901.52800-3-oupton@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org> Cc: stable@vger.kernel.org
Oliver Upton [Tue, 2 Jun 2026 16:59:00 +0000 (09:59 -0700)]
KVM: arm64: nv: Fix handling of XN[0] when !FEAT_XNX
XN has already been extracted from its bitfield position so using
FIELD_PREP() on the mask that clears XN[0] is completely broken, having
the effect of unconditionally granting execute permissions...
Fix the obvious mistake by manipulating the right bit.
Cc: stable@vger.kernel.org Fixes: d93febe2ed2e ("KVM: arm64: nv: Forward FEAT_XNX permissions to the shadow stage-2") Reviewed-by: Wei-Lin Chang <weilin.chang@arm.com> Signed-off-by: Oliver Upton <oupton@kernel.org> Link: https://patch.msgid.link/20260602165901.52800-2-oupton@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>
Dmitry Ilvokhin [Fri, 5 Jun 2026 10:06:22 +0000 (03:06 -0700)]
cleanup: Specify nonnull argument index
The guard constructors were annotated with an empty __nonnull_args(),
relying on __nonnull__() marking every pointer parameter as non-NULL.
Sparse cannot parse the empty argument list.
Both constructors take the lock pointer as their first parameter, so
specify the index explicitly: __nonnull_args(1).
Reported-by: Dan Carpenter <error27@gmail.com> Closes: https://lore.kernel.org/all/aiJi0WcYE8FZt-jO@stanley.mountain/ Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/aiKpH3cLBEj3TF2Q@shell.ilvokhin.com
Andy Shevchenko [Thu, 4 Jun 2026 09:52:02 +0000 (11:52 +0200)]
fs/read_write: Do not export __kernel_write() to the entire world
Since we have EXPORT_SYMBOL_FOR_MODULES(), we may narrow
the __kernel_write() export to the only which really needs it.
With that being done, update the respective comment.
David Woodhouse [Thu, 4 Jun 2026 09:35:18 +0000 (10:35 +0100)]
ptp: vmclock: Use hw_cycles from snapshot for precise TSC pairing
When the system clocksource is kvmclock or Hyper-V (not the TSC directly),
vmclock_get_crosststamp() falls through to a separate get_cycles() call,
losing the atomic pairing between the system time snapshot and the TSC
reading.
Now that ktime_get_snapshot_id() populates hw_cycles with the underlying
TSC value for derived clocksources, use it when available. This gives a
perfect (system_time, tsc) pairing for the device time calculation.
The SUPPORT_KVMCLOCK wrapper is still needed to convert the TSC into
kvmclock nanoseconds for system_counter->cycles, because otherwise
get_device_system_crosststamp() can't interpret the result against the
system clock.
David Woodhouse [Thu, 4 Jun 2026 09:35:17 +0000 (10:35 +0100)]
x86/kvmclock: Implement read_snapshot() for kvmclock clocksource
Implement the read_snapshot() callback for the kvmclock clocksource. This
returns the kvmclock nanosecond value (for timekeeping) while also
providing the raw TSC value that was used to compute it.
The TSC is read inside the pvclock seqlock-protected region, ensuring the
raw TSC and derived kvmclock value are atomically paired.
This enables ktime_get_snapshot_id() to provide the raw TSC to consumers
like the vmclock PTP driver, which currently has to do a separate call to
get_cycles() to obtain a value at *approximately* the same time, to feed
through the vmclock calculation.
David Woodhouse [Thu, 4 Jun 2026 09:35:16 +0000 (10:35 +0100)]
clocksource/hyperv: Implement read_snapshot() for TSC page clocksource
Implement the read_snapshot() callback for the Hyper-V TSC page clock-
source. This returns the derived 10MHz reference time (for timekeeping)
while also providing the raw TSC value that was used to compute it.
When the TSC page is valid, hv_read_tsc_page_tsc() atomically captures both
values from a single RDTSC inside the sequence-counter protected read. When
the TSC page is invalid (sequence == 0), the hw_csid and hw_cycles are set
to zero indicating no value is available.
This enables ktime_get_snapshot_id() to provide the raw TSC to consumers
like KVM's master clock when running nested guests under Hyper-V.
pwm: th1520: Remove requirement for mul_u64_u64_div_u64_roundup
The cycle register is always u32, so cycles_to_ns() can take a u32
instead of a u64. With that narrowing, cycles * NSEC_PER_SEC is at most
u32::MAX * 1e9 (~4.3e18), which fits in u64 without overflow. The
saturating arithmetic is therefore no longer needed, and the ceiling
division can use Rust's u64::div_ceil() directly instead of the
open-coded numerator/denominator form.
This also drops the TODO referring to a future
mul_u64_u64_div_u64_roundup kernel helper, which is no longer required.
Shengming Hu [Thu, 4 Jun 2026 12:27:32 +0000 (20:27 +0800)]
mm/slub: preserve original size in _kmalloc_nolock_noprof retry path
_kmalloc_nolock_noprof() retries from the next kmalloc bucket when the
initial allocation fails. The retry currently reuses `size` as the
bucket selector and overwrites it with s->object_size + 1.
That value is later passed as the original allocation size to
__slab_alloc_node(), slab_post_alloc_hook() and kasan_kmalloc(). On a
successful retry this makes KASAN/slub-debug observe the retry bucket
selector rather than the caller requested size, potentially widening the
valid kmalloc range and hiding overflows.
Keep the caller requested size separately as orig_size and pass it to
the allocation/debug/KASAN paths. Continue using `size` as the retry cache
selector.
Fixes: af92793e52c3 ("slab: Introduce kmalloc_nolock() and kfree_nolock()") Signed-off-by: Shengming Hu <hu.shengming@zte.com.cn> Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org> Reviewed-by: Hao Li <hao.li@linux.dev> Link: https://patch.msgid.link/202606042027323804pk3MRY42Jy7y42OHAhQZ@zte.com.cn Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Namjae Jeon [Wed, 3 Jun 2026 14:40:31 +0000 (23:40 +0900)]
iomap: Add IOMAP_F_ZERO_TAIL flag to trace event strings
Add IOMAP_F_ZERO_TAIL to the flag string mapping in iomap trace
events. This allows the new flag to be properly displayed in
ftrace output when iomap operations use it.
chachapoly_create() still accepts the compatibility poly1305 parameter
in the template name, but it assumes the second template argument is
always present and immediately passes it to strcmp().
When the argument is missing, crypto_attr_alg_name() returns an error
pointer. Check for that before comparing the name so malformed template
instantiations fail with an error instead of dereferencing the error
pointer in strcmp().
This matches the surrounding Crypto API template pattern where
crypto_attr_alg_name() results are validated before string-specific use.
Junyuan Wang [Tue, 26 May 2026 09:28:39 +0000 (09:28 +0000)]
crypto: qat - add KPT support for GEN6 devices
Add support for Intel Key Protection Technology (KPT) on QAT GEN6
devices.
KPT protects private keys from exposure by keeping them wrapped
(encrypted) while in use, in-flight, and at rest. Keys remain in wrapped
form and are not exposed in plaintext in host memory. This feature
operates outside of the Linux crypto framework and kernel keyring.
Extend the firmware admin interface to enable and configure KPT. During
device initialisation, if KPT is enabled, the driver sends an admin
message to firmware to enable KPT mode and configure parameters such as
the maximum number of SWK (Symmetric Wrapping Key) slots and the SWK
time-to-live (TTL).
Expose KPT configuration via a new sysfs attribute group, "qat_kpt", and
add ABI documentation.
Co-developed-by: Nitesh Venkatesh <nitesh.venkatesh@intel.com> Signed-off-by: Nitesh Venkatesh <nitesh.venkatesh@intel.com> Signed-off-by: Junyuan Wang <junyuan.wang@intel.com> Reviewed-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com> Reviewed-by: Ahsan Atta <ahsan.atta@intel.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Ruijie Li [Mon, 25 May 2026 11:45:21 +0000 (19:45 +0800)]
crypto: pcrypt - restore callback for non-parallel fallback
pcrypt installs pcrypt_aead_done() on the child AEAD request before
trying to submit it through padata. If padata_do_parallel() returns
-EBUSY, pcrypt falls back to calling the child AEAD directly.
That fallback must not keep the padata completion callback. Otherwise
an asynchronous completion runs pcrypt_aead_done() even though the
request was never enrolled in padata.
Restore the original request callback and callback data before calling
the child AEAD directly. This keeps the fallback path aligned with a
direct AEAD request while leaving the parallel path unchanged.
Fixes: 662f2f13e66d ("crypto: pcrypt - Call crypto layer directly when padata_do_parallel() return -EBUSY") Cc: stable@kernel.org Reported-by: Yuan Tan <yuantan098@gmail.com> Reported-by: Yifan Wu <yifanwucs@gmail.com> Reported-by: Juefei Pu <tomapufckgml@gmail.com> Reported-by: Zhengchuan Liang <zcliangcn@gmail.com> Reported-by: Xin Liu <bird@lzu.edu.cn> Assisted-by: Codex:gpt-5.4 Signed-off-by: Ruijie Li <ruijieli51@gmail.com> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The Inline Crypto Engine found in Hawi SoC is compatible with the common
baseline IP 'qcom,inline-crypto-engine'. Hence, document the compatible as
such.
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@oss.qualcomm.com> Acked-by: Krzysztof Kozlowski <krzysztof.kozlowski@oss.qualcomm.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Merge patch series "vfs infrastructure for fs-verity support for XFS with post EOF merkle tree"
Christian Brauner <brauner@kernel.org> says:
This brings in the vfs infrastructure required to implement fs-verity
support for XFS.
* patches from https://patch.msgid.link/20260520123722.405752-1-aalbersh@kernel.org:
iomap: introduce iomap_fsverity_write() for writing fsverity metadata
iomap: teach iomap to read files with fsverity
iomap: introduce IOMAP_F_FSVERITY and teach writeback to handle fsverity
fsverity: generate and store zero-block hash
iomap: introduce iomap_fsverity_write() for writing fsverity metadata
This is just a wrapper around iomap_file_buffered_write() to create
necessary iterator over metadata.
Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org> Link: https://patch.msgid.link/20260520123722.405752-10-aalbersh@kernel.org Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
Obtain fsverity info for folios with file data and fsverity metadata.
Filesystem can pass vi down to ioend and then to fsverity for
verification. This is different from other filesystems ext4, f2fs, btrfs
supporting fsverity, these filesystems don't need fsverity_info for
reading fsverity metadata. While reading merkle tree iomap requires
fsverity info to synthesize hashes for zeroed data block.
fsverity metadata has two kinds of holes - ones in merkle tree and one
after fsverity descriptor.
Merkle tree holes are blocks full of hashes of zeroed data blocks. These
are not stored on the disk but synthesized on the fly. This saves a bit
of space for sparse files. Due to this iomap also need to lookup
fsverity_info for folios with fsverity metadata. ->vi has a hash of the
zeroed data block which will be used to fill the merkle tree block.
The hole past descriptor is interpreted as end of metadata region. As we
don't have EOF here we use this hole as an indication that rest of the
folio is empty. This patch marks rest of the folio beyond fsverity
descriptor as uptodate.
For file data, fsverity needs to verify consistency of the whole file
against the root hash, hashes of holes are included in the merkle tree.
Verify them too.
Issue reading of fsverity merkle tree on the fsverity inodes. This way
metadata will be available at I/O completion time.
Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org> Link: https://patch.msgid.link/20260520123722.405752-9-aalbersh@kernel.org Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
iomap: introduce IOMAP_F_FSVERITY and teach writeback to handle fsverity
This flag indicates that I/O is for fsverity metadata.
In the write path skip i_size check and i_size updates as metadata is
past EOF. In writeback don't update i_size and continue writeback if
even folio is beyond EOF. In read path don't zero fsverity folios, again
they are past EOF.
The iomap_block_needs_zeroing() is also called from write path. For
folios of larger order we don't want to zero out pages in the folio as
these could contain other merkle tree blocks. For fsverity, filesystem
will request to read PAGE_SIZE memory regions. For data folios, iomap
will zero the rest of the folio for anything which is beyond EOF. We
don't want this for fsverity folios.
Christian Brauner <brauner@kernel.org> says:
Changed IOMAP_F_FSVERITY from (1U << 10) to (1U << 11) to avoid colliding
with IOMAP_F_ZERO_TAIL, which already uses (1U << 10).
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org> Link: https://patch.msgid.link/20260520123722.405752-8-aalbersh@kernel.org Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
Compute the hash of one filesystem block's worth of zeros. A filesystem
implementation can decide to elide merkle tree blocks containing only
this hash and synthesize the contents at read time.
Let's pretend that there's a file containing 131 data block and whose
merkle tree looks roughly like this:
If data[0-128] are sparse holes, then leaf0 will contain a repeating
sequence of @zero_digest. Therefore, leaf0 need not be written to disk
because its contents can be synthesized.
A subsequent xfs patch will use this to reduce the size of the merkle
tree when dealing with sparse gold master disk images and the like.
Note that this works only on the first-level (data holes). fsverity
doesn't store/generate zero_digest for any higher levels.
Add a helper to pre-fill folio with hashes of empty blocks. This will be
used by iomap to synthesize blocks full of zero hashes on the fly.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org> Link: https://patch.msgid.link/20260520123722.405752-5-aalbersh@kernel.org Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
Adapt all existing helpers to use a modified version of
nf_ct_helper_init(), to dynamically allocate struct nf_conntrack_helper.
Allocate expect_policy[] built-in into the helper to ensure this area is
reachable after helper removal since a follow up patch adds refcount to
track use of the nf_conntrack_helper structure from packet path so it
remains around until last reference from ct helper extension is dropped.
Export __nf_conntrack_helper_register() which allows to register
nfnetlink_cthelper dynamically allocated helper. Adapt nfnetlink_cthelper
to use the built-in expect_policy[].
This is a preparation patch to add packet path refcounting to helpers.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Clément Léger [Thu, 4 Jun 2026 16:07:13 +0000 (09:07 -0700)]
io_uring/net: inherit IORING_CQE_F_BUF_MORE across bundle recv retries
When a bundle recv retries inside io_recv_finish(), the merge logic OR
the saved cflags from the previous iteration with the cflags returned by
the new iteration:
cflags = req->cqe.flags | (cflags & CQE_F_MASK);
Bits listed in CQE_F_MASK are inherited from the new iteration, and all
other bits (notably IORING_CQE_F_BUFFER and the buffer ID) come from the
saved cflags. Before this change CQE_F_MASK covered only
IORING_CQE_F_SOCK_NONEMPTY and IORING_CQE_F_MORE.
When using provided buffer rings (IOU_PBUF_RING_INC) with incremental
mode, and bundle recv, io_kbuf_inc_commit() can leave the head ring
entry partially consumed, __io_put_kbufs() then sets
IORING_CQE_F_BUF_MORE on the returned cflags so userspace knows the
buffer ID will be reused for subsequent completions.
Because IORING_CQE_F_BUF_MORE was not in CQE_F_MASK, the merge above
silently dropped it whenever the final retry iteration partially
consumed the buffer, and the subsequent req->cqe.flags = cflags &
~CQE_F_MASK save would have left a stale IORING_CQE_F_BUF_MORE in the
carried-over cflags had one been present. Userspace would then
wrongfully advance it ring head past an entry the kernel still uses.
Add IORING_CQE_F_BUF_MORE to CQE_F_MASK so it is both inherited from the
new iteration into the user-visible CQE and stripped from the saved
cflags between iterations.
Wyatt Feng [Tue, 2 Jun 2026 16:46:27 +0000 (00:46 +0800)]
xfrm: espintcp: do not reuse an in-progress partial send
espintcp keeps a single in-flight transmit in ctx->partial.
Before building a new sk_msg, espintcp_sendmsg() first tries to flush
that state through espintcp_push_msgs().
For blocking callers, espintcp_push_msgs() may return success even when
the previous partial send is still pending. espintcp_sendmsg() would
then reinitialize emsg->skmsg and reuse ctx->partial while the old
transfer still owns that state.
Do not rebuild the send message when ctx->partial is still in progress.
If espintcp_push_msgs() returns with emsg->len still set, fail the new
send instead of overwriting the live partial state.
This is a memory-safety fix: reusing the live partial-send state can
leave a stale offset attached to a new sk_msg and lead to an out-of-
bounds read in the send path.
tcp_sendmsg_locked() already handles waiting for send buffer memory, so
the fix here is just to preserve espintcp's one-message-at-a-time
transmit state.
Jens Axboe [Fri, 5 Jun 2026 11:18:58 +0000 (05:18 -0600)]
Merge tag 'nvme-7.2-2026-06-04' of git://git.infradead.org/nvme into for-7.2/block
Pull NVMe updates from Keith:
"- Per-controller timeouts
- Multipath telemetry
- Namespace format validation
- Various other fixes"
* tag 'nvme-7.2-2026-06-04' of git://git.infradead.org/nvme: (34 commits)
nvme: export controller reconnect event count via sysfs
nvme: export controller reset event count via sysfs
nvme: export I/O failure count when no path is available via sysfs
nvme: export I/O requeue count when no path is usable via sysfs
nvme: export command error counters via sysfs
nvme: export multipath failover count via sysfs
nvme: export command retry count via sysfs
nvme: add diag attribute group under sysfs
nvme-tcp: lockdep: use dynamic lockdep keys per socket instance
nvme-tcp: move nvme_tcp_reclassify_socket()
nvme: validate FDP configuration descriptor sizes
nvmet-auth: validate reply message payload bounds against transfer length
nvme: refresh multipath head zoned limits from path limits
nvme: fix FDP fdpcidx bounds check
nvme-tcp: Use WQ_PERCPU explicitly if wq_unbound is false.
nvmet: fix pre-auth out-of-bounds heap read in Discovery Get Log Page
nvme-multipath: set BIO_REMAPPED on bios remapped to per-path namespace disks
nvme-multipath: require exact iopolicy names for module parameter
nvme-multipath: pass NS head to nvme_mpath_revalidate_paths()
nvme-pci: fix out-of-bounds access in nvme_setup_descriptor_pools
...
Cássio Gabriel [Fri, 5 Jun 2026 04:14:40 +0000 (01:14 -0300)]
ALSA: usb-audio: qcom: Initialize offload control return value
snd_usb_offload_create_ctl() returns ret after walking the USB PCM list,
but ret is only assigned after a playback stream passes the endpoint and
PCM-index filters.
If all playback streams are skipped, for example because there is no
playback endpoint or because all PCM indexes exceed the 0xff control
range, the function returns an uninitialized stack value.
Initialize ret to 0 so the no-control-created path returns deterministic
success, while preserving the existing negative error return when
snd_ctl_add() fails.
netfilter: cttimeout: detach dataplane timeout policy and repurpose refcount
Add a refcount for struct nf_ct_timeout which is used by ct extension to
set the custom ct timeout policy, this tells us that the ct timeout is
being used by a conntrack entry. When the last conntrack entry drops the
refcount on the ct timeout, the ct timeout is released.
Remove the refcount for control plane which controls if the ruleset
refers to the timeout policy. After this update, it is possible to
remove the ct timeout policy from nfnetlink_cttimeout immediately.
This is for simplicity not to handle two refcounts on a single object.
Remove nf_queue_nf_hook_drop(): a packet sitting in nfqueue will just
hold a reference to the nf_ct_timeout object until packet is reinjected,
since this is part of the ct extension, this will be released by the
time the conntrack is freed.
nf_ct_untimeout() is still called to clean up in a best effort basis:
the ct timeout on existing entries gets removed when the ct timeout goes
away, but as long as the iptables ruleset still refers to the ct timeout
through a template, new conntracks may keep attaching it and extend its
lifetime until the rule is removed.
nf_ct_untimeout() is not called anymore from module removal path, this
is unlikely to find timeouts give module refcount is bumped, and the new
refcount already tracks the ct timeout policy use so it is released when
unused.
Fixes: 50978462300f ("netfilter: add cttimeout infrastructure for fine timeout tuning") Fixes: 7e0b2b57f01d ("netfilter: nft_ct: add ct timeout support") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
netfilter: synproxy: protect nf_ct_seqadj_init() with conntrack lock
nf_ct_seqadj_init() is called without holding the ct lock. This can race
with nf_ct_seq_adjust() when a connection is in CLOSE state due to an
RST or connection reopening. In addition for SYN_RECV state, concurrent
processing of packets can trigger nf_ct_seq_adjust() too. These
situations create a read/write data race.
As synproxy is the only user of nf_ct_seqadj_init() at the moment, fix
this by holding ct->lock inside nf_ct_seqadj_init() until all is done.
Fixes: 48b1de4c110a ("netfilter: add SYNPROXY core/target") Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
netfilter: synproxy: fix unaligned memory access in timestamp adjustment
Use get_unaligned_be32() and put_unaligned_be32() to safely read and
write the timestamp fields. This prevents performance degradation due to
unaligned memory access or even a crash on strict alignment
architectures.
This follows the implementation of timestamp parsing in the networking
stack at tcp_parse_options() and synproxy_parse_options().
Fixes: 48b1de4c110a ("netfilter: add SYNPROXY core/target") Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
RFC 9293 does not mention anything about duplicated options and each
networking stack handles it in their own way. Currently, Linux kernel is
processing options sequentially and in case of duplicated timestamp
options, the value from the latest one overrides the others.
As SYNPROXY is modifying only the first timestamp option found, a packet
can reach the backend server and it might parse the wrong timestamp
value. Let's just continue parsing the following options and in case a
duplicated timestamp is found, adjust it too.
Fixes: 48b1de4c110a ("netfilter: add SYNPROXY core/target") Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>