KVM: SEV: Move sev_free_vcpu() down below sev_es_unmap_ghcb()
Relocate sev_free_vcpu() down in sev.c so that it's definition comes after
sev_es_unmap_ghcb(). This will allow sharing unmap functionality between
the two functions without needing a forward declaration (or weird placement
of the common code).
No functional change intended.
Cc: stable@vger.kernel.org Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20260501202250.2115252-16-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20260529183549.1104619-16-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: Don't WARN if memory is dirtied without a vCPU when the VM is dying
When marking a page dirty, complain about not having a running/loaded vCPU
if and only if the VM is still alive, i.e. its refcount is non-zero. This
will allow fixing a memory leak for x86 SEV-ES guests without hitting what
is effectively a false positive on the WARN.
For some SEV-ES VM-Exits, KVM keeps a writable mapping of a guest page
across an exit to userspace, and typically unmaps the page on the next
KVM_RUN. But if userspace never calls KVM_RUN after such an exit, then KVM
needs to unmap the page when the vCPU is destroyed, which in turn triggers
the WARN about not having a running vCPU.
Alternatively, SEV-ES could temporarily load the vCPU to suppress the WARN,
as is done in nested_vmx_free_vcpu() (but for completely unrelated reasons;
suppressing WARN from nested_put_vmcs12_pages() is pure happenstance). But
loading a vCPU during destruction is gross (ideally nVMX code would be
cleaned up), risks complicating the SEV-ES code (KVM would need to ensure
the temporarily load()+put() only runs when the vCPU isn't already loaded),
and is ultimately pointless.
The motivation for the WARN is to guard against KVM dirtying guest memory
without pushing the corresponding GFN to the active vCPU's dirty ring, e.g.
to ensure userspace doesn't miss a dirty page. But for the VM's refcount
to reach zero, there can't be _any_ userspace mappings to the dirty ring,
as mapping the dirty ring requires doing mmap() on the vCPU FD. I.e. if
userspace had a valid mapping for the dirty ring, then the vCPU file and
thus the owning VM would still be alive. And so since userspace can't
possibly reach the dirty ring, whether or not KVM technically "misses" a
push to the dirty ring is irrelevant.
Reported-by: Michael Roth <michael.roth@amd.com> Cc: stable@vger.kernel.org Reviewed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20260501202250.2115252-15-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20260529183549.1104619-15-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: SEV: Read start/end indices of PSC requests exactly once per #VMGEXIT
Rework Page State Change (PSC) handling to read the guest-provided start
and end indices exactly once, at the beginning of the request. Re-reading
the indices is "fine", _if_ the guest is well-behaved. KVM _should_ be
safe against concurrent guest modification of the indices, but there is
zero reason to introduce unnecessary risk.
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20260501202250.2115252-14-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20260529183549.1104619-14-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: SEV: Add an anonymous "psc" struct to track current PSC metadata
Add a "psc" struct to vcpu_sev_es_state to avoid having to prefix all of
the fields with "psc_".
Take advantage of the code churn to opportunistically rename local
variables to "guest_psc" to make it more obvious that the buffer is guest
data, and more importantly, guest accessible!
Opportunistically rename inflight => batch_size as well, because there can
really only be one operation in-flight (per-vCPU), i.e. "inflight" _looks_
like a boolean, but in actuality is an integer tracking how many pages are
being handled by the current operation.
No functional change intended.
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20260501202250.2115252-13-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20260529183549.1104619-13-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: SEV: Make it more obvious when KVM is writing back the current PSC index
Increment the guest-visible "cur_entry" index outside of the for-loop
when processing Page State Change entries, and add a comment to make it
more obvious which code is operating on trusted data, and which code is
touching guest-accessible data.
No functional change intended.
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20260501202250.2115252-12-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20260529183549.1104619-12-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
check_rstbl() validates restart table structure, but does not constrain
per-entry lcns_follow values relative to the entry size. A malformed
filesystem image can provide an oversized lcns_follow value, causing
the conversion memmove() to access memory beyond the bounds of the
allocated restart table buffer.
The same field is later used to bound iteration over page_lcns[],
so validating lcns_follow during conversion also prevents downstream
out-of-bounds access from the same malformed metadata.
Compute the maximum valid lcns_follow from the already-validated
restart table entry size and reject entries that exceed this bound.
Reuse the existing t16/t32 scratch variables already declared in
log_replay() to avoid introducing new declarations.
Fixes: b46acd6a6a62 ("fs/ntfs3: Add NTFS journal") Cc: stable@vger.kernel.org Signed-off-by: Pavitra Jha <jhapavitra98@gmail.com>
[almaz.alexandrovich@paragon-software.com: fixed the conflicts] Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
fs/ntfs3: bound copy_lcns dp->page_lcns[] index in analysis pass
In log_replay()'s analysis pass, after find_dp() returns a
valid DIR_PAGE_ENTRY for the (target_attr, target_vcn) tuple,
the copy_lcns block walks lrh->lcns_follow further entries:
t16 = le16_to_cpu(lrh->lcns_follow);
for (i = 0; i < t16; i++) {
size_t j = (size_t)(le64_to_cpu(lrh->target_vcn) -
le64_to_cpu(dp->vcn));
dp->page_lcns[j + i] = lrh->page_lcns[i];
}
find_dp() only validates that target_vcn falls within
[dp->vcn, dp->vcn + dp->lcns_follow), i.e., that the FIRST
cluster is covered. The walk through the further entries is
not bounded against dp->lcns_follow. For a malformed LRH
where target_vcn = dp->vcn + dp->lcns_follow - 1 and
lrh->lcns_follow > 1, the i > 0 writes overflow the dp's
allocated page_lcns[] array.
Add the missing j + lrh->lcns_follow <= dp->lcns_follow guard.
Reproduced under UML+KASAN on mainline 8d90b09e6741 as a
slab-out-of-bounds write of size 8 from log_replay+0x68d4 on
the mount path.
This is distinct from Pavitra Jha's 2026-05-02 patch
("fs/ntfs3: validate lcns_follow in log_replay conversion",
<20260502154252.164586-1-jhapavitra98@gmail.com>) which
addresses the separate version-0 dirty-page-table conversion
path's memmove(&dp->vcn, ...) call. The two fixes are
complementary; both should land.
Fixes: b46acd6a6a62 ("fs/ntfs3: Add NTFS journal") Cc: stable@vger.kernel.org Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
[almaz.alexandrovich@paragon-software.com: clang-formatted the changes,
fixed conflicts] Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
In do_action()'s DeleteIndexEntryAllocation case, e->size comes
from an on-disk INDEX_BUFFER entry. When e->size makes
e + e->size point past hdr + hdr->used,
PtrOffset(e1, Add2Ptr(hdr, used)) returns a negative ptrdiff_t
that is silently cast to a quasi-infinite size_t when passed
to memmove(). The memmove then walks past the destination
buffer.
The sibling DeleteIndexEntryRoot case at fslog.c:3540-3543
already carries the corresponding guard:
Apply the same shape to the allocation-path case. Also reject
esize == 0: memmove(e, e, ...) is a no-op and leaves
hdr->used unchanged, hiding a malformed entry from the
existing check_index_header() walk.
Reproduced under UML+KASAN on mainline 8d90b09e6741 by
mounting a crafted NTFS image: the unguarded memmove takes a
length of 0xffffffffffffff00 and the kernel oopses in
memmove+0x81/0x1a0 on the do_action+0x36a2 frame.
Fixes: b46acd6a6a62 ("fs/ntfs3: Add NTFS journal") Cc: stable@vger.kernel.org Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
[almaz.alexandrovich@paragon-software.com: clang-formatted the changes] Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
fs/ntfs3: bound attr_off in UpdateResidentValue against data_off
In do_action()'s UpdateResidentValue case (fslog.c:3307),
lrh->attr_off and lrh->redo_len come from the on-disk LRH.
When they satisfy aoff + dlen < attr->res.data_off, the
assignment
underflows to ~4 GiB (e.g. 0xFFFFFFF9 when aoff=0x10, dlen=1,
data_off=0x18). Subsequent code that reads attr->res.data_size
to walk the resident attribute payload would then read up to
4 GiB past the 1024-byte MFT record allocation.
The existing mi_enum_attr() defense in fs/ntfs3/record.c:287
catches the corrupted data_size on the next attribute walk
and fails the mount, but only on the path that walks all
attributes. A read site that picks an attribute by name and
reads its data_size without re-validating is not covered.
Validate aoff against data_off and asize at the source.
Reproduced under UML+KASAN on mainline 8d90b09e6741 via
pr_warn-only probe: with aoff=0x10 and data_off=0x18, the
post-assignment data_size is 0xfffffff9 (mount then fails
at -22 from mi_enum_attr).
Fixes: b46acd6a6a62 ("fs/ntfs3: Add NTFS journal") Cc: stable@vger.kernel.org Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
[almaz.alexandrovich@paragon-software.com: clang-formatted the changes] Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Guan-Chun Wu [Sun, 31 May 2026 08:00:19 +0000 (16:00 +0800)]
ext4: improve str2hashbuf by processing 4-byte chunks and removing function pointers
The original byte-by-byte implementation with modulo checks is less
efficient. Refactor str2hashbuf_unsigned() and str2hashbuf_signed()
to process input in explicit 4-byte chunks instead of using a
modulus-based loop to emit words byte by byte.
Additionally, the use of function pointers for selecting the appropriate
str2hashbuf implementation has been removed. Instead, the functions are
directly invoked based on the hash type, eliminating the overhead of
dynamic function calls.
Performance test (x86_64, Intel Core i7-10700 @ 2.90GHz, average over 10000
runs, using kernel module for testing):
Guan-Chun Wu [Sun, 31 May 2026 08:00:18 +0000 (16:00 +0800)]
ext4: add Kunit coverage for directory hash computation
Introduce Kunit tests for fs/ext4/hash.c to verify ext4fs_dirhash()
across the legacy, half-MD4, and TEA hash variants.
The tests cover empty, seeded hashing, and non-ASCII name handling.
They also verify error paths, including invalid hash versions and
SipHash without a configured key, and check that the signed and
unsigned hash variants differ on non-ASCII input as expected.
When CONFIG_UNICODE is enabled, the tests further verify casefolded-name
hashing and the fallback behavior for invalid input.
Li RongQing [Wed, 3 Jun 2026 12:37:08 +0000 (20:37 +0800)]
dma-debug: fix physical address retrieval in debug_dma_sync_sg_for_device
In debug_dma_sync_sg_for_device(), when iterating over a scatterlist,
the debug entry population mistakenly uses the head of the scatterlist
'sg' to fetch the physical address via sg_phys(), instead of using the
current iterator variable 's'.
This causes dma-debug to track the physical address of the very first
scatterlist entry for all subsequent entries in the list.
Fix this by passing the correct loop iterator 's' to sg_phys()
Li Chen [Fri, 15 May 2026 09:18:27 +0000 (17:18 +0800)]
ext4: fast commit: export snapshot stats in fc_info
Snapshot-based fast commit can fall back when the commit-time snapshot
cannot be built (e.g. extent status cache misses). It is useful to
quantify the updates-locked window and to see why snapshotting failed.
Add best-effort snapshot counters to the ext4 superblock and extend
/proc/fs/ext4/<sb_id>/fc_info to report the number of snapshotted
inodes and ranges, snapshot failure reasons, and the average/max time
spent with journal updates locked.
Li Chen [Fri, 15 May 2026 09:18:26 +0000 (17:18 +0800)]
ext4: fast commit: add lock_updates tracepoint
Commit-time fast commit snapshots run under jbd2_journal_lock_updates(),
so it is useful to quantify the time spent with updates locked and to
understand why snapshotting can fail.
Add a new tracepoint, ext4_fc_lock_updates, reporting the time spent in
the updates-locked window along with the number of snapshotted inodes
and ranges. Record the first snapshot failure reason in a stable snap_err
field for tooling.
Li Chen [Fri, 15 May 2026 09:18:25 +0000 (17:18 +0800)]
ext4: fast commit: avoid i_data_sem by dropping ext4_map_blocks() in snapshots
Commit-time snapshots run under jbd2_journal_lock_updates(), so the work
done there must stay bounded.
The snapshot path still used ext4_map_blocks() to build data ranges. This
can take i_data_sem and pulls the mapping code into the snapshot logic.
Build inode data range snapshots from the extent status tree instead.
The extent status tree is a cache, not an authoritative source. If the
needed information is missing or unstable (e.g. delayed allocation), treat
the transaction as fast commit ineligible and fall back to full commit.
Also cap the number of inodes and ranges snapshotted per fast commit and
allocate range records from a dedicated slab cache. The inode pointer
array is allocated outside the updates-locked window.
Testing: QEMU/KVM guest, virtio-pmem + dax, ext4 -O fast_commit, mounted
dax,noatime. Ran python3 500x {4K write + fsync}, fallocate 256M, and
python3 500x {creat + fsync(dir)} without lockdep splats or errors.
Li Chen [Fri, 15 May 2026 09:18:24 +0000 (17:18 +0800)]
ext4: fast commit: avoid self-deadlock in inode snapshotting
ext4_fc_snapshot_inodes() used igrab()/iput() to pin inodes while building
commit-time snapshots. With ext4_fc_del() waiting for
EXT4_STATE_FC_COMMITTING, iput() can trigger
ext4_clear_inode()->ext4_fc_del() in the commit thread and deadlock waiting
for the fast commit to finish.
ext4_fc_del() also has to re-check EXT4_STATE_FC_COMMITTING after
waiting on EXT4_STATE_FC_FLUSHING_DATA. The commit thread clears
FLUSHING_DATA before it sets COMMITTING, so a waiter woken from the
flush wait must not delete the inode based on an old COMMITTING
check.
Avoid taking extra references. Collect inode pointers under s_fc_lock and
rely on EXT4_STATE_FC_COMMITTING to pin inodes until ext4_fc_cleanup()
clears the bit.
Also set EXT4_STATE_FC_COMMITTING for create-only inodes referenced
from the dentry update queue, and wake up waiters when ext4_fc_cleanup()
clears the bit.
Li Chen [Fri, 15 May 2026 09:18:23 +0000 (17:18 +0800)]
ext4: fast commit: avoid waiting for FC_COMMITTING
ext4_fc_track_inode() can be called while holding i_data_sem (e.g.
fallocate). Waiting for EXT4_STATE_FC_COMMITTING in that case risks an
ABBA deadlock: i_data_sem -> wait(FC_COMMITTING) vs FC_COMMITTING ->
wait(i_data_sem) in the commit task.
Now that fast commit snapshots inode state at commit time, updates during
log writing do not need to block. Drop the wait and lockdep assertion in
ext4_fc_track_inode(), and make ext4_fc_del() wait for FC_COMMITTING so an
inode cannot be removed while the commit thread is still using it.
When an inode is modified during a fast commit, mark it with
EXT4_STATE_FC_REQUEUE so cleanup keeps it queued for the next fast commit.
This is needed because jbd2_fc_end_commit() invokes the cleanup callback
with tid == 0, so tid-based requeue logic would requeue every inode.
Testing: tracepoint ext4:ext4_fc_commit_stop with two fsyncs in the same
transaction. nblks is the number of journal blocks written for that fast
commit. Before this change, the second fsync still wrote almost the same
fast commit log (nblks 10->9), because tid == 0 in jbd2_fc_end_commit()
caused the tid-based requeue logic to keep all inodes queued. After this
change, only inodes modified during the commit are requeued, and the
second fsync wrote a nearly empty fast commit (nblks 10->1).
Li Chen [Fri, 15 May 2026 09:18:22 +0000 (17:18 +0800)]
ext4: lockdep: handle i_data_sem subclassing for special inodes
Fast commit can hold s_fc_lock while writing journal blocks. Mapping the
journal inode can take its i_data_sem. Normal inode update paths can take a
data inode i_data_sem and then s_fc_lock, which makes lockdep report a
circular dependency.
lockdep treats all i_data_sem instances as one lock class and cannot
distinguish the journal inode i_data_sem from a regular inode i_data_sem.
The journal inode is not tracked by fast commit and no FC waiters ever
depend on it, so this is not a real ABBA deadlock. Assign the journal inode
a dedicated i_data_sem lockdep subclass to avoid the false positive.
Inode cache objects can be recycled, so also reset i_data_sem to
I_DATA_SEM_NORMAL when allocating an ext4 inode. Otherwise a new inode may
inherit an old subclass (journal/quota/ea) and trigger lockdep warnings.
Li Chen [Fri, 15 May 2026 09:18:21 +0000 (17:18 +0800)]
ext4: fast commit: snapshot inode state before writing log
Fast commit writes inode metadata and data range updates after unlocking
journal updates. New handles can start at that point, so the log writing
path must not look at live inode state.
Add a commit-time per-inode snapshot and populate it while journal updates
are locked and existing handles are drained. Store the snapshot behind
ext4_inode_info->i_fc_snap so ext4_inode_info only grows by one pointer.
The snapshot contains a copy of the on-disk inode plus the data range
records needed for fast commit TLVs.
Snapshotting runs under jbd2_journal_lock_updates(). Avoid triggering I/O
there by using ext4_get_inode_loc_noio() and falling back to full commit
if the inode table block is not present or not uptodate.
Log writing then only serializes the snapshot, so it no longer needs to
call ext4_map_blocks() and take i_data_sem under s_fc_lock. The snapshot
is installed and freed under s_fc_lock and is released from fast commit
cleanup and inode eviction.
Junrui Luo [Wed, 13 May 2026 09:28:40 +0000 (17:28 +0800)]
jbd2: fix integer underflow in jbd2_journal_initialize_fast_commit()
jbd2_journal_initialize_fast_commit() validates journal capacity by
checking (journal->j_last - num_fc_blks < JBD2_MIN_JOURNAL_BLOCKS).
Both j_last and num_fc_blks are unsigned, so when num_fc_blks exceeds
j_last the subtraction wraps to a large value, bypassing the bounds
check.
The resulting underflow corrupts j_last, j_fc_first, and j_free,
leading to journal abort.
Fix by checking num_fc_blks against j_last before the subtraction,
returning -EFSCORRUPTED.
Fixes: 6866d7b3f2bb ("ext4 / jbd2: add fast commit initialization") Reported-by: Yuhao Jiang <danisjiang@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Junrui Luo <moonafterrain@outlook.com> Fixes: e029c5f27987 ("ext4: make num of fast commit blocks configurable") Reviewed-by: Baokun Li <libaokun@linux.alibaba.com> Fixes: e029c5f279872 ("ext4: make num of fast commit blocks configurable") Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/SYBPR01MB7881663C927DE9D7BBF4D1DFAF062@SYBPR01MB7881.ausprd01.prod.outlook.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Li Chen [Wed, 13 May 2026 08:58:17 +0000 (16:58 +0800)]
ext4: fix fast commit wait/wake bit mapping on 64-bit
On 64-bit, ext4 dynamic inode states live in the upper half of i_flags,
and ext4_test_inode_state() applies the corresponding +32 offset.
The fast-commit wait and wake paths open-coded the wait key with the raw
EXT4_STATE_* value. Add small helpers for the state wait word and bit,
and use them for the FC_COMMITTING and FC_FLUSHING_DATA waits so the wait
key follows the same mapping as the state helpers.
Fixes: 857d32f26181 ("ext4: rework fast commit commit path") Reported-by: Sashiko AI review <sashiko-bot@kernel.org> Signed-off-by: Li Chen <chenl311@chinatelecom.cn> Reviewed-by: Baokun Li <libaokun@linux.alibaba.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20260513085818.552432-1-me@linux.beauty Signed-off-by: Theodore Ts'o <tytso@mit.edu>
However, h_transaction may legitimately be NULL for an aborted handle.
The is_handle_aborted() helper in include/linux/jbd2.h explicitly
treats !h_transaction as one of the aborted states:
if (handle->h_aborted || !handle->h_transaction)
return 1;
Every other entry point in fs/jbd2/transaction.c
(jbd2_journal_get_{write,undo,create}_access, jbd2_journal_extend,
jbd2_journal_restart, jbd2_journal_stop, etc.) guards against this
with an is_handle_aborted() check before any dereference of
h_transaction. jbd2_journal_dirty_metadata() was missing this guard.
This is reachable from ocfs2's xattr code. ocfs2_xa_set() intentionally
falls through to ocfs2_xa_journal_dirty() even after
ocfs2_xa_prepare_entry() fails, on the assumption that the buffer
needs to be journaled to record any partial modifications (see the
comment above the out_dirty label in fs/ocfs2/xattr.c). If the failure
was caused by the journal being aborted -- e.g. an underlying I/O
error during a sub-operation such as __ocfs2_remove_xattr_range() --
the handle's h_transaction has been cleared by the abort path, and
the unconditional deref in jbd2_journal_dirty_metadata() becomes a
NULL deref.
Reproduced by syzbot with a crafted ocfs2 image where I/O against the
loop device backing the mount is sabotaged via LOOP_SET_STATUS64
between two setxattr() calls, causing the second setxattr (which
truncates an external xattr value) to abort the journal mid-flight:
Oops: general protection fault, probably for non-canonical
address 0xdffffc0000000000
KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
RIP: jbd2_journal_dirty_metadata+0x4a/0xd30 fs/jbd2/transaction.c:1520
Call Trace:
ocfs2_journal_dirty+0x130/0x700 fs/ocfs2/journal.c:831
ocfs2_xa_journal_dirty fs/ocfs2/xattr.c:1483 [inline]
ocfs2_xa_set+0x15e3/0x2ec0 fs/ocfs2/xattr.c:2294
ocfs2_xattr_block_set+0x3e0/0x33c0 fs/ocfs2/xattr.c:3016
__ocfs2_xattr_set_handle+0x6b3/0xf50 fs/ocfs2/xattr.c:3418
ocfs2_xattr_set+0xf3f/0x13e0 fs/ocfs2/xattr.c:3681
__vfs_setxattr+0x43c/0x480 fs/xattr.c:218
...
Fix by adding the standard is_handle_aborted() guard at the top of
jbd2_journal_dirty_metadata() and returning -EROFS, matching the
pattern used by every other entry point in this file.
ocfs2_journal_dirty() already handles a non-zero return from
jbd2_journal_dirty_metadata() correctly.
Reported-by: syzbot+98f651460e558a21baae@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=98f651460e558a21baae Tested-by: syzbot+98f651460e558a21baae@syzkaller.appspotmail.com Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Link: https://patch.msgid.link/20260507050605.50081-1-kartikey406@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Avinash Bhatt [Sun, 31 May 2026 10:53:08 +0000 (13:53 +0300)]
wifi: iwlwifi: mld: add KUnit tests for link grading
Add tests for the link grading algorithm covering per-bandwidth
grading tables, channel load calculation, 6 GHz RSSI adjustments
including duplicated beacon and PSD/EIRP compensation, and
puncturing penalty.
Shahar Tzarfati [Sun, 31 May 2026 10:53:06 +0000 (13:53 +0300)]
wifi: iwlwifi: mld: drop TLC config cmd v4/v5 compat code
FW core102 bumped TLC_MNG_CONFIG_CMD_API_S from version 5 to
version 6. The v4 and v5 compatibility paths in
iwl_mld_send_tlc_cmd() are no longer reachable on any supported
firmware.
Miri Korenblit [Sun, 31 May 2026 10:53:05 +0000 (13:53 +0300)]
wifi: iwlwifi: mvm: remove __must_check annotation from command sending
We don't acually need to always check the return value. For example, if
we send a command to remove an object - we can assume success
(if it fails it is probably because the fw is dead, and then it doesn't
have the object anyway).
Miri Korenblit [Sun, 31 May 2026 10:53:04 +0000 (13:53 +0300)]
wifi: iwlwifi: trans: export the maximum supported hcmd size
Export the maximum allowed host command payload size to the op-modes.
Note that this information was available to the op-modes also before
this change, this just adds a clear macro.
Shahar Tzarfati [Sun, 31 May 2026 10:53:03 +0000 (13:53 +0300)]
wifi: iwlwifi: stop supporting core101
BZ, DR and SC no longer need to accept core101 firmware.
Raise the minimum supported firmware core from 101 to 102 so
these families only match supported core102 and newer images.
Shahar Tzarfati [Sun, 31 May 2026 10:53:02 +0000 (13:53 +0300)]
wifi: iwlwifi: remove orphaned DC2DC config enum
FW core102 removed both DC2DC_CONFIG_CMD_API_S and
DC2DC_CONFIG_CMD_RSP_API_S. The only driver-side artifact is
enum iwl_dc2dc_config_id in fw/api/config.h, which has no
callers in any .c file across all driver paths (mld/mvm/xvt).
Ever since the TFD queue size is no longer limited to 256 entries,
this code has been wrong, and might erroneously not detect a move
if it was by a multiple of 256. Not a big deal, but fix it while
I see it.
Our binding handling for P2P-Device can run into the following
scenario, as observed by our testing:
- a station interface is connected on some channel
- the P2P-Device does a remain-on-channel (ROC) on that channel
- the ROC ends, and the P2P-Device is removed from the binding,
but the phy_ctxt pointer is left around as a PHY cache so we
don't need to recalibrate to the channel again and again in
case it's not shared
- a binding update by the station interface, even a removal,
will re-add the P2P-Device to the binding
- the P2P-Device is removed, which removes the PHY context, but
it's still in the binding so the firmware crashes
Since the P2P device is removed from the binding and only re-
added by unrelated code, but we want to keep the phy_ctxt around
as a cache for future ROC usage, fix it by adding a boolean that
indicates whether or not the P2P-Device should be added to the
binding, and handle that in the binding iterator. That way, the
station interface cannot re-add the P2P-Device to the binding
when that isn't active.
Add KUnit tests to verify RSSI adjustment for 6 GHz duplicated
beacons across different operational bandwidths and validate
detection of the duplicated beacon bit.
Johannes Berg [Wed, 27 May 2026 20:05:09 +0000 (23:05 +0300)]
wifi: iwlwifi: mld: don't WARN on WoWLAN suspend w/o netdetect
Clearly, from a user perspective, it must be valid to configure
WoWLAN and then suspend while not connected to a network. Since
mac80211 doesn't distinguish these cases and simply calls the
driver to suspend whenever WoWLAN is configured, the driver has
to cleanly handle the case where it's called for WoWLAN, it's
not connected but there's also no netdetect configured.
Remove the WARN_ON() and keep returning 1 to disconnect and
then suspend.
Shahar Tzarfati [Wed, 27 May 2026 20:05:08 +0000 (23:05 +0300)]
wifi: iwlwifi: cfg: Revert "wifi: iwlwifi: cfg: move the MODULE_FIRMWARE to the per-rf file"
IWL_BZ_UCODE_CORE_MAX is undefined in cfg/rf-fm.c, this
causes __stringify(core) to turn it into the literal
token text, so MODULE_FIRMWARE entries are generated as
"iwlwifi...-cIWL_BZ_UCODE_CORE_MAX.ucode",
instead of the actual number.
wifi: iwlwifi: mld: set fast-balance scan for active EMLSR
While associated to MLD AP with active EMLSR, set all scan
operations as fast-balance scans. The only exception is when a
fragmented scan is planned (high traffic or low latency), in
which case the fragmented scan is preserved.
Miri Korenblit [Wed, 27 May 2026 20:05:05 +0000 (23:05 +0300)]
wifi: iwlwifi: mld: always allow mimo in NAN
The mimo field of the sta command is badly named. It really carries the
initial SMPS value as it is in the association request of the client
station (when we are the AP).
In NAN we don't have this information, just mark SMPS as disabled.
Given that we now use v3 rates with FW index throughout,
_to_hwrate() is confusing, since the hardware still uses
the PLCP value, the driver just doesn't see that now (as
it talks to firmware, not hardware.)
Rename this to iwl_mvm_rate_idx_to_fw_idx() to more
clearly indicate what it's doing.
Moriya Itzchaki [Wed, 27 May 2026 20:05:02 +0000 (23:05 +0300)]
wifi: iwlwifi: fix STEP_URM register address for SC devices
The CNVI_PMU_STEP_FLOW register address differs between device families.
For SC and newer devices, the register is at 0xA2D688,
while for BZ devices it's at 0xA2D588.
Miri Korenblit [Wed, 27 May 2026 20:05:01 +0000 (23:05 +0300)]
wifi: iwlwifi: mld: fix smatch warning
We dereference the mld_sta pointer before checking for NULL.
But we do check the sta pointer, and sta != NULL means mld_sta != NULL,
so there is no real issue.
Fix it anyway to silence the warning.
Miri Korenblit [Wed, 27 May 2026 20:04:59 +0000 (23:04 +0300)]
wifi: iwlwifi: remove stale comment
iwl_pcie_set_hw_ready still returns the return value of iwl_poll_bits,
but the latter one no longer returns the time elapsed until success, now it
returns either success or failure.
Remove the comment entirely.
Johannes Berg [Wed, 27 May 2026 20:04:58 +0000 (23:04 +0300)]
wifi: iwlwifi: fw: cut down NIC wakeups during dump
Currently, the dump code attempts to dump any number of
memories and register banks, as defined by the firmware.
Especially when the device is failing, this can lead to
excessive time spent attempting to acquire NIC access
over and over again.
Improve the code to only attempt to acquire NIC access
once or twice, but using the new memory dump functions
that may drop the spinlock etc. Mark all dump regions
that require NIC access, and skip them if we couldn't
obtain that.
In order to avoid CPU latency due to the increased time
holding the spinlock (and possibly disabling softirqs),
drop locks and call cond_resched() after each section
(if holding NIC access) but don't release HW NIC access.
AX231 is a device that is based on AX211 that doesn't support 6E and
its bandwidth is limited to 80 MHz.
Just reuse the radio config from AX203 which has the exact same
characteristics.
It has a specific subdevice ID to allow the driver to differentiate
between AX211 and AX231.
Maxwell Doose [Sun, 31 May 2026 23:38:28 +0000 (18:38 -0500)]
iio: chemical: scd30: Replace manual locking with RAII locking
scd30_core.c currently uses manual mutex_lock() and mutex_unlock()
calls. Replace them with the newer guard(mutex)() for cleaner RAII
patterns and to improve maintainability.
Add new helper function scd30_trigger_handler_helper() containing
the critical section for scd30_trigger_handler().
In addition, small refactor to replace "?:" operator with regular
if/else returns.
Heiko Carstens [Tue, 26 May 2026 05:57:02 +0000 (07:57 +0200)]
s390/percpu: Provide arch_this_cpu_write() implementation
Provide an s390 specific implementation of arch_this_cpu_write()
instead of the generic variant. The generic variant uses a quite
expensive raw_local_irq_save() / raw_local_irq_restore() pair.
Get rid of this by providing an own variant which makes use of the new
percpu code section infrastructure.
With this the text size of the kernel image is reduced by ~1k (defconfig).
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Heiko Carstens [Tue, 26 May 2026 05:57:01 +0000 (07:57 +0200)]
s390/percpu: Provide arch_this_cpu_read() implementation
Provide an s390 specific implementation of arch_this_cpu_read() instead
of the generic variant. The generic variant uses preempt_disable() /
preempt_enable() pair and READ_ONCE().
Get rid of the preempt_disable() / preempt_enable() pairs by providing an
own variant which makes use of the new percpu code section infrastructure.
With this the text size of the kernel image is reduced by ~1k
(defconfig). Also 87 generated preempt_schedule_notrace() function
calls within the kernel image (modules not counted) are removed.
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Heiko Carstens [Tue, 26 May 2026 05:57:00 +0000 (07:57 +0200)]
s390/percpu: Use new percpu code section for arch_this_cpu_[and|or]()
Convert arch_this_cpu_[and|or]() to make use of the new percpu code
section infrastructure.
There is no user of this_cpu_and() and only one user of this_cpu_or()
within the kernel. Therefore this conversion has hardly any effect,
and also removes only preempt_schedule_notrace() function call.
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Heiko Carstens [Tue, 26 May 2026 05:56:59 +0000 (07:56 +0200)]
s390/percpu: Use new percpu code section for arch_this_cpu_add_return()
Convert arch_this_cpu_add_return() to make use of the new percpu code
section infrastructure.
With this the text size of the kernel image is reduced by ~4k
(defconfig). Also 66 generated preempt_schedule_notrace() function
calls within the kernel image (modules not counted) are removed.
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Heiko Carstens [Tue, 26 May 2026 05:56:58 +0000 (07:56 +0200)]
s390/percpu: Use new percpu code section for arch_this_cpu_add()
Convert arch_this_cpu_add() to make use of the new percpu code section
infrastructure.
With this the text size of the kernel image is reduced by ~76kb
(defconfig). Also more than 5300 generated preempt_schedule_notrace()
function calls within the kernel image (modules not counted) are removed.
With:
DEFINE_PER_CPU(long, foo);
void bar(long a) { this_cpu_add(foo, a); }
Heiko Carstens [Tue, 26 May 2026 05:56:56 +0000 (07:56 +0200)]
s390/percpu: Infrastructure for more efficient this_cpu operations
With the intended removal of PREEMPT_NONE this_cpu operations based on
atomic instructions, guarded with preempt_disable()/preempt_enable() pairs
become more expensive: the preempt_disable() / preempt_enable() pairs are
not optimized away anymore during compile time.
In particular the conditional call to preempt_schedule_notrace() after
preempt_enable() adds additional code and register pressure.
E.g. this simple C code sequence
DEFINE_PER_CPU(long, foo);
long bar(long a) { return this_cpu_add_return(foo, a); }
Even though the above example is more or less the worst case, since the
branch to preempt_schedule_notrace() requires a stackframe, which
otherwise wouldn't be necessary, there is also the conditional jnhe branch
instruction.
Get rid of the conditional branch with the following code sequence:
The general idea is that this_cpu operations based on atomic instructions
are guarded with mviy instructions:
- The first mviy instruction writes the register number, which contains
the percpu address variable to lowcore. This also indicates that a
percpu code section is executed.
- The first instruction following the mviy instruction must be the ag
instruction which adds the percpu offset to the percpu address register.
- Afterwards the atomic percpu operation follows.
- Then a second mviy instruction writes a zero to lowcore, which indicates
the end of the percpu code section.
- In case of an interrupt/exception/nmi the register number which was
written to lowcore is copied to the exception frame (pt_regs), and a zero
is written to lowcore.
- On return to the previous context it is checked if a percpu code section
was executed (saved register number not zero), and if the process was
migrated to a different cpu. If the percpu offset was already added to
the percpu address register (instruction address does _not_ point to the
ag instruction) the content of the percpu address register is adjusted so
it points to percpu variable of the new cpu.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
s390/zcrypt: Replace get_zeroed_page() with kzalloc()
zcrypt_rng_device_add() allocates a buffer for the software random
number generator data cache.
This buffer can be allocated with kmalloc() as there's nothing special
about it to go directly to the page allocator.
kmalloc() provides a better API that does not require ugly casts and
kfree() does not need to know the size of the freed object.
Performance difference between kmalloc() and __get_free_pages() is not
measurable as both allocators take an object/page from a per-CPU list for
fast path allocations.
For the slow path the performance is anyway determined by the amount of
reclaim involved rather than by what allocator is used.
Replace use of get_zeroed_page() with kzalloc() and free_page() with
kfree().
Link: https://lore.kernel.org/all/635405e4-9423-4a25-a6e7-e03c8ea0bcbe@redhat.com Reviewed-by: Harald Freudenberger <freude@linux.ibm.com> Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
s390/trng: Replace __get_free_page() with kmalloc()
trng_read() allocates a temporary staging buffer for CPACF TRNG
random data before copying it to userspace.
This buffer can be allocated with kmalloc() as there's nothing special
about it to go directly to the page allocator.
kmalloc() provides a better API that does not require ugly casts and
kfree() does not need to know the size of the freed object.
Performance difference between kmalloc() and __get_free_pages() is not
measurable as both allocators take an object/page from a per-CPU list for
fast path allocations.
For the slow path the performance is anyway determined by the amount of
reclaim involved rather than by what allocator is used.
Replace use of __get_free_page() with kmalloc() and free_page() with
kfree().
s390/qeth: Replace get_zeroed_page() with kzalloc()
qeth_get_trap_id() allocates a temporary buffer for STSI system
information queries used to build trap identification strings.
This buffer can be allocated with kmalloc() as there's nothing special
about it to go directly to the page allocator.
kmalloc() provides a better API that does not require ugly casts and
kfree() does not need to know the size of the freed object.
Performance difference between kmalloc() and __get_free_pages() is not
measurable as both allocators take an object/page from a per-CPU list for
fast path allocations.
For the slow path the performance is anyway determined by the amount of
reclaim involved rather than by what allocator is used.
Replace use of get_zeroed_page() with kzalloc() and free_page() with
kfree().
s390/hvc_iucv: Replace get_zeroed_page() with kzalloc()
hvc_iucv_alloc() allocates a send staging buffer for accumulating
outbound terminal characters before they are copied into a separate
IUCV message buffer for transmission to the hypervisor. The staging
buffer itself is never passed to any IUCV function.
This buffer can be allocated with kmalloc() as there's nothing special
about it to go directly to the page allocator.
kmalloc() provides a better API that does not require ugly casts and
kfree() does not need to know the size of the freed object.
Performance difference between kmalloc() and __get_free_pages() is not
measurable as both allocators take an object/page from a per-CPU list for
fast path allocations.
For the slow path the performance is anyway determined by the amount of
reclaim involved rather than by what allocator is used.
Replace use of get_zeroed_page() with kzalloc() and free_page() with
kfree().
s390/dasd: Replace get_zeroed_page() with kzalloc()
DASD driver uses get_zeroed_page() to allocate pages for the Extended Error
Reporting software ring buffer and for a scratch buffer for formatting
sense dump diagnostic text.
These buffers can be allocated with kmalloc() as there's nothing special
about it to go directly to the page allocator.
kmalloc() provides a better API that does not require ugly casts and
kfree() does not need to know the size of the freed object.
Performance difference between kmalloc() and __get_free_pages() is not
measurable as both allocators take an object/page from a per-CPU list for
fast path allocations.
For the slow path the performance is anyway determined by the amount of
reclaim involved rather than by what allocator is used.
Replace use of get_zeroed_page() with kzalloc() and free_page() with
kfree().
s390/con3270: Replace __get_free_page() with kmalloc()
con3270_alloc_view() allocates a staging buffer used to assemble
3270 datastream content before it is copied into channel program
requests.
This buffer can be allocated with kmalloc() as there's nothing special
about it to go directly to the page allocator.
kmalloc() provides a better API that does not require ugly casts and
kfree() does not need to know the size of the freed object.
Performance difference between kmalloc() and __get_free_pages() is not
measurable as both allocators take an object/page from a per-CPU list for
fast path allocations.
For the slow path the performance is anyway determined by the amount of
reclaim involved rather than by what allocator is used.
Replace use of __get_free_page() with kmalloc() and free_page() with
kfree().
John Hubbard [Wed, 3 Jun 2026 07:30:22 +0000 (16:30 +0900)]
gpu: nova-core: Hopper/Blackwell: select FSP Chain of Trust version
The FSP Chain of Trust handshake is versioned: Hopper speaks version 1
and Blackwell speaks version 2. Provide the version through the FSP HAL
so the boot message carries the value FSP expects, and so chipsets that
do not use FSP need not express a version at all.
FSP exchanges are request/response: the driver sends an MCTP/NVDM
message and must match the reply against the request before acting on
it. Add the synchronous send-and-wait path that validates the response
transport and message headers and confirms the reply corresponds to the
request that was sent.
John Hubbard [Wed, 3 Jun 2026 07:30:20 +0000 (16:30 +0900)]
gpu: nova-core: add MCTP/NVDM protocol types for firmware communication
Add the MCTP (Management Component Transport Protocol) and NVDM (NVIDIA
Data Model) wire-format types used for communication between the kernel
driver and GPU firmware processors.
This includes typed MCTP transport headers, NVDM message headers, and
NVDM message type identifiers. Both the FSP boot path and the upcoming
GSP RPC message queue share this protocol layer.
FSP communication uses a pair of non-circular queues in the FSP
falcon's EMEM, one for messages from the driver to FSP and one for
replies, with the driver polling for response data. Add the queue
registers and the low-level helpers used by the higher-level FSP
message layer.
Signed-off-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Eliot Courtney <ecourtney@nvidia.com> Link: https://patch.msgid.link/20260603-b4-blackwell-v13-2-d9f3a06939e0@nvidia.com
[acourbot: align register fields names with OpenRM.]
[acourbot: represent registers as arrays of 8 instances, as per OpenRM.] Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
Add external memory (EMEM) read/write operations to the GPU's FSP falcon
engine. These operations use Falcon PIO (Programmed I/O) to communicate
with the FSP through indirect memory access.
Jinyu Tang [Sun, 17 May 2026 15:34:27 +0000 (23:34 +0800)]
KVM: riscv: Fast-path dirty logging write faults
With dirty logging enabled, guest writes often fault on an existing 4K
G-stage leaf that was write-protected only for dirty tracking. The slow
path still performs the full fault handling flow and takes mmu_lock for
write, even though the page-table shape does not change.
x86 handles the analogous case in its fast page fault path by atomically
making a writable SPTE writable again when the fault is only a
write-protection fault. Add the same style of fast path for RISC-V. If a
write fault hits an existing 4K leaf in a writable dirty-log memslot,
mark the page dirty and atomically set the PTE writable and dirty under
the read side of mmu_lock.
The dirty bitmap is updated before the PTE becomes writable again. The
PTE D bit is also set so systems that trap on a clear D bit do not fall
back to the slow path for a writable but clean PTE.
When a fault hits an existing G-stage leaf with the same PFN, KVM only
needs to update the PTE permissions. This path will be used by read-side
fault handling, so it must not overwrite a concurrent PTE update.
Use the cmpxchg helper when relaxing permissions on an existing leaf,
following the same concurrency model used by x86 for atomic SPTE
permission updates. Retry if another CPU changed the PTE first, and use
cpu_relax() while spinning.
Jinyu Tang [Sun, 17 May 2026 15:34:25 +0000 (23:34 +0800)]
KVM: riscv: Add a G-stage PTE cmpxchg helper
Permission-only G-stage PTE updates can run in parallel once they are
moved to the read side of mmu_lock. Plain set_pte() is not enough for
that case because another CPU may update the same PTE first.
x86 handles the same class of SPTE races with cmpxchg-based updates in
its fast page fault and TDP MMU paths. Add a small RISC-V helper for
atomic G-stage PTE updates. The helper reports contention to the caller
and flushes the target range only when the PTE value actually changes.
Jinyu Tang [Sun, 17 May 2026 15:34:24 +0000 (23:34 +0800)]
KVM: riscv: Use an rwlock for mmu_lock
RISC-V KVM currently uses a spinlock for mmu_lock. That serializes all
G-stage MMU operations, including permission-only updates that do not
allocate or free page-table pages.
Use KVM's rwlock form of mmu_lock, as x86 and arm64 already do. Keep the
existing map, unmap and teardown paths on the write side. This prepares
RISC-V for read-side handling of G-stage permission updates.
Yosry Ahmed [Thu, 28 May 2026 23:10:52 +0000 (16:10 -0700)]
KVM: selftests: Add a test for gPAT handling in L2
When KVM_X86_QUIRK_NESTED_SVM_SHARED_PAT is disabled, verify that KVM
correctly virtualizes the host PAT MSR and the guest PAT register for
nested SVM guests.
With nested NPT disabled:
* L1 and L2 share the same PAT
* The vmcb12.g_pat is ignored
With nested NPT enabled:
* An invalid g_pat in vmcb12 causes VMEXIT_INVALID
* RDMSR(IA32_PAT) from L2 returns the value of the guest PAT register
* WRMSR(IA32_PAT) from L2 is reflected in vmcb12's g_pat on VMEXIT
* RDMSR(IA32_PAT) from L1 returns the value of the host PAT MSR
Verify that save/restore with the vCPU in guest mode behaves as expected in
both cases, e.g. preserves both hPAT and gPAT when NPT is enabled.
Originally-by: Jim Mattson <jmattson@google.com> Signed-off-by: Yosry Ahmed <yosry@kernel.org>
[sean: use even fancier macro shenanigans] Link: https://patch.msgid.link/20260528231052.404737-1-seanjc@google.com
[sean: avoid use of goto, print skips] Signed-off-by: Sean Christopherson <seanjc@google.com>
KVM: selftests: Add guest_memfd regression test signed offset+size bug
Add a regression (and proof-of-bug) testcase to ensure KVM rejects an
offset+size that would result in a negative value when computed as a signed
64-bit value. KVM had a flaw where it would allow binding a memslot to a
guest_memfd instance even with a wildly out-of-range offset, if the offset
and size were both positive values, but the combined offset+size was
negative.
Use "0x7fffffffffffffffull - page_size", i.e. "INT64_MAX - page_size", for
the offset as the size of the guest_memfd file must be at least page_size
(KVM requires memslots and gmem files to be host page-size aligned). I.e.
"INT64_MAX - page_size + size" is guaranteed to generate an offset+size
that is negative when converted to a signed 64-bit value *and* honors KVM's
alignment requirements.
KVM: selftests: Expand the guest_memfd test macros to allow passing the VM
Expand the gmem test macros to allow passing the VM to testcases, without
needing to plumb the VM into _every_ testcase, as the vast majority of
testcases only need the fd and size.
KVM: guest_memfd: Treat memslot binding offset+size as unsigned values
When binding a memslot to a guest_memfd file, treat the offset and size as
unsigned values to fix a bug where the sum of the two can result in a false
negative when checking for overflow against the size of the file. Passing
unsigned values also avoids relying on somewhat obscure checks in other
flows for safety, and tracks the offset and size as they are intended to be
tracked, as unsigned values.
On 64-bit kernels, the number of pages a memslot contains and thus the size
(and offset) of its guest_memfd binding are unsigned 64-bit values. Taking
the offset+size as an loff_t instead of a uoff_t inadvertently converts
the unsigned value to a signed value if the offset and/or size is massive.
Locally storing the offset and size as signed values is benign in and of
itself (though even that is *extremely* difficult to discern), but
operating on their sum is not.
For the offset, KVM explicitly checks against a negative value, which might
seem like a bug as KVM could incorrectly reject a legitimate binding, but
that's not actually the case as KVM_CREATE_GUEST_MEMFD takes a signed value
for its size, i.e. a would-be-negative offset is also greater than the
maximum possible size of any guest_memfd file.
Regarding the size, while KVM lacks an explicit check for a negative value,
i.e. seemingly has a flawed overflow check, KVM restricts the number of
pages in a single memslot to the largest positive signed 32-bit value:
if (id < KVM_USER_MEM_SLOTS &&
(mem->memory_size >> PAGE_SHIFT) > KVM_MEM_MAX_NR_PAGES)
return -EINVAL;
and so that maximum "size" will ever be is 0x7fffffff000.
The sum of the two is, however, problematic. While the size is restricted
by KVM's memslot logic, the offset is not, i.e. the offset is completely
unchecked until the "offset + size > i_size_read(inode)" check. If the
offset is the (nearly) largest possible _positive_ value, then adding size
to the offset can result in a signed, negative 64-bit value. When compared
against the size of the file (guaranteed to be positive), the negative sum
is always smaller, and KVM incorrectly allows the absurd offset.
Opportunistically add missing includes in kvm_mm.h (instead of relying on
its parents).
Fixes: a7800aa80ea4 ("KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory") Cc: stable@vger.kernel.org Cc: Ackerley Tng <ackerleytng@google.com> Reviewed-by: Michael Roth <michael.roth@amd.com> Reviewed-by: Ackerley Tng <ackerleytng@google.com> Link: https://patch.msgid.link/20260602170921.1304394-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
Jason Gunthorpe [Mon, 1 Jun 2026 16:52:31 +0000 (13:52 -0300)]
RDMA/umem: Fix truncation for block sizes >= 4G
When the iommu is used the linearization of the mapping can give a single
block that is very large split across multiple SG entries.
When __rdma_block_iter_next() reassembles the split SG entries it is
overflowing the 32 bit stack values and computed the wrong DMA addresses
for blocks after the truncation.
KVM: x86: Move async #PF helpers to x86.h (as inlines)
Move kvm_pv_async_pf_enabled() and kvm_async_pf_hash_reset() to x86.h in
anticipation of extracting the majority of register and MSR specific code
out of x86.c.
Move update_cr8_intercept() to lapic.c so that it's globally visible
in anticipation of extracting most of the register-specific code out of
x86.c and into a new compilation unit. Opportunistically prefix the
helper kvm_lapic_ to make its role/scope more obvious.
KVM: x86: Harden is_64_bit_hypercall() against bugs on 32-bit kernels
Unconditionally return %false for is_64_bit_hypercall() on 32-bit kernels
to guard against incorrectly setting guest_state_protected, and because
in a (very) hypothetical world where 32-bit KVM supports protected guests,
assuming a hypercall was made in 64-bit mode is flat out wrong.
KVM: nSVM: Use kvm_rax_read() now that it's mode-aware
Now that kvm_rax_read() truncates the output value to 32 bits if the
vCPU isn't in 64-bit mode, use it instead of the more verbose (and very
technically slower) kvm_register_read().
Note! VMLOAD, VMSAVE, and VMRUN emulation are still technically buggy,
as they can use EAX (versus RAX) in 64-bit mode via an operand size
prefix. Don't bother trying to handle that case, as it would require
decoding the code stream, which would open an entirely different can of
worms, and in practice no sane guest would shove garbage into RAX[63:32]
and then execute VMLOAD/VMSAVE/VMRUN with just EAX.
Drop the non-raw, mode-aware kvm_<reg>_write() helpers as there is no
usage in KVM, and in all likelihood there will never be usage in KVM as
use of hardcoded registers in instructions is uncommon, and *modifying*
hardcoded registers is practically unheard of. While there are a few
instructions that modify registers in mode-aware ways, e.g. REP string
and some ENCLS varieties, the odds of KVM needing to emulate such
instructions (outside of the fully emulator) are vanishingly small.
Drop kvm_<reg>_write() to prevent incorrect usage; _if_ a new instruction
comes along that needs to modify a hardcoded register, this can be
reverted.
KVM: x86: Add mode-aware versions of kvm_<reg>_{read,write}() helpers
Make kvm_<reg>_{read,write}() mode-aware (where the value is truncated to
32 bits if the vCPU isn't in 64-bit mode), and convert all the intentional
"raw" accesses to kvm_<reg>_{read,write}_raw() versions. To avoid
confusion and bikeshedding over whether or not explicit 32-bit accesses
should use the "raw" or mode-aware variants, add and use "e" versions, e.g.
for things like RDMSR, WRMSR, and CPUID, where the instruction uses only
bits 31:0, regardless of mode.
No functional change intended (all use of "e" versions is for cases where
the value is already truncated due to bouncing through a u32).
Cc: Binbin Wu <binbin.wu@linux.intel.com> Cc: Kai Huang <kai.huang@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Link: https://patch.msgid.link/20260529222223.870923-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
KVM: x86: Move inlined GPR, CR, and DR helpers from x86.h to regs.h
Move inlined General Purpose Register, Control Register, and Debug
Register helpers from x86.h to the aptly named regs.h, to help trim
down x86.h (and x86.c in the future).
Move *very* select EFER functionality as well, but leave behind the bulk of
EFER handling and all other MSR handling. There is more than enough MSR
code to carve out msrs.{c,h} in the future. Give is_long_bit_mode()
special treatment as it's more along the lines of a CR4 bit check, but just
happens to be accessed through an MSR interface. And more importantly,
because giving regs.h access to is_long_bit_mode() greatly simplifies
dependency chains.
Rename kvm_cache_regs.h to simply regs.h, as the "cache" nomenclature is
already a lie (the file deals with state/registers that aren't cached per
se), and so that more code/functionality can be landed in the header
without making it a truly horrible misnomer.
Deliberately drop the kvm_ prefix/namespace to align with other "local"
headers, and to further differentiate regs.h from the public/global
arch/x86/include/asm/kvm_vcpu_regs.h, which sadly needs to stay in asm/
so that the number of registers can be referenced by kvm_vcpu_arch.
KVM: x86: Trace hypercall register *after* truncating values for 32-bit
When tracing hypercalls, invoke the tracepoint *after* truncating the
register values for 32-bit guests so as not to record unused garbage (in
the extremely unlikely scenario that the guest left garbage in a register
after transitioning from 64-bit mode to 32-bit mode).
Fixes: 229456fc34b1 ("KVM: convert custom marker based tracing to event traces") Reviewed-by: Yosry Ahmed <yosry@kernel.org> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://patch.msgid.link/20260529222223.870923-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
KVM: VMX: Read 32-bit GPR values for ENCLS instructions outside of 64-bit mode
When getting register values for ENCLS emulation, use kvm_register_read()
instead of kvm_<reg>_read() so that bits 63:32 of the register are dropped
if the guest is in 32-bit mode.
Note, the misleading/surprising behavior of kvm_<reg>_read() being "raw"
variants under the hood will be addressed once all non-benign bugs are
fixed.
Fixes: 70210c044b4e ("KVM: VMX: Add SGX ENCLS[ECREATE] handler to enforce CPUID restrictions") Fixes: b6f084ca5538 ("KVM: VMX: Add ENCLS[EINIT] handler to support SGX Launch Control (LC)") Acked-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://patch.msgid.link/20260529222223.870923-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
KVM: x86/xen: Don't truncate RAX when handling hypercall from protected guest
Don't truncate RAX when handling a Xen hypercall for a guest with protected
state, as KVM's ABI is to assume the guest is in 64-bit for such cases
(the guest leaving garbage in 63:32 after a transition to 32-bit mode is
far less likely than 63:32 being necessary to complete the hypercall).
Fixes: b5aead0064f3 ("KVM: x86: Assume a 64-bit hypercall for guests with protected state") Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://patch.msgid.link/20260529222223.870923-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>