git.ipfire.org Git - thirdparty/linux.git/log

KVM: SEV: Decouple the need to sync the GHCB SA from the need to free the SA

Decouple synchronizing the GHCB SA from freeing/unpinning the SA, so that
the free/unpin path can be reused when freeing a vCPU.

Opportunistically add a WARN to harden KVM against stomping over (and thus
leaking) an already-allocated scratch area.

Cc: stable@vger.kernel.org
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20260501202250.2115252-17-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20260529183549.1104619-17-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

KVM: SEV: Move sev_free_vcpu() down below sev_es_unmap_ghcb()

Relocate sev_free_vcpu() down in sev.c so that it's definition comes after
sev_es_unmap_ghcb(). This will allow sharing unmap functionality between
the two functions without needing a forward declaration (or weird placement
of the common code).

No functional change intended.

Cc: stable@vger.kernel.org
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20260501202250.2115252-16-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20260529183549.1104619-16-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

KVM: Don't WARN if memory is dirtied without a vCPU when the VM is dying

When marking a page dirty, complain about not having a running/loaded vCPU
if and only if the VM is still alive, i.e. its refcount is non-zero.  This
will allow fixing a memory leak for x86 SEV-ES guests without hitting what
is effectively a false positive on the WARN.

For some SEV-ES VM-Exits, KVM keeps a writable mapping of a guest page
across an exit to userspace, and typically unmaps the page on the next
KVM_RUN.  But if userspace never calls KVM_RUN after such an exit, then KVM
needs to unmap the page when the vCPU is destroyed, which in turn triggers
the WARN about not having a running vCPU.

Alternatively, SEV-ES could temporarily load the vCPU to suppress the WARN,
as is done in nested_vmx_free_vcpu() (but for completely unrelated reasons;
suppressing WARN from nested_put_vmcs12_pages() is pure happenstance).  But
loading a vCPU during destruction is gross (ideally nVMX code would be
cleaned up), risks complicating the SEV-ES code (KVM would need to ensure
the temporarily load()+put() only runs when the vCPU isn't already loaded),
and is ultimately pointless.

The motivation for the WARN is to guard against KVM dirtying guest memory
without pushing the corresponding GFN to the active vCPU's dirty ring, e.g.
to ensure userspace doesn't miss a dirty page.  But for the VM's refcount
to reach zero, there can't be _any_ userspace mappings to the dirty ring,
as mapping the dirty ring requires doing mmap() on the vCPU FD.  I.e. if
userspace had a valid mapping for the dirty ring, then the vCPU file and
thus the owning VM would still be alive.  And so since userspace can't
possibly reach the dirty ring, whether or not KVM technically "misses" a
push to the dirty ring is irrelevant.

Reported-by: Michael Roth <michael.roth@amd.com>
Cc: stable@vger.kernel.org
Reviewed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20260501202250.2115252-15-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20260529183549.1104619-15-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

KVM: SEV: Read start/end indices of PSC requests exactly once per #VMGEXIT

Rework Page State Change (PSC) handling to read the guest-provided start
and end indices exactly once, at the beginning of the request. Re-reading
the indices is "fine", _if_ the guest is well-behaved. KVM _should_ be
safe against concurrent guest modification of the indices, but there is
zero reason to introduce unnecessary risk.

Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20260501202250.2115252-14-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20260529183549.1104619-14-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

KVM: SEV: Add an anonymous "psc" struct to track current PSC metadata

Add a "psc" struct to vcpu_sev_es_state to avoid having to prefix all of
the fields with "psc_".

Take advantage of the code churn to opportunistically rename local
variables to "guest_psc" to make it more obvious that the buffer is guest
data, and more importantly, guest accessible!

Opportunistically rename inflight => batch_size as well, because there can
really only be one operation in-flight (per-vCPU), i.e. "inflight" _looks_
like a boolean, but in actuality is an integer tracking how many pages are
being handled by the current operation.

No functional change intended.

Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20260501202250.2115252-13-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20260529183549.1104619-13-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

KVM: SEV: Make it more obvious when KVM is writing back the current PSC index

Increment the guest-visible "cur_entry" index outside of the for-loop
when processing Page State Change entries, and add a comment to make it
more obvious which code is operating on trusted data, and which code is
touching guest-accessible data.

No functional change intended.

Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20260501202250.2115252-12-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20260529183549.1104619-12-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

fs/ntfs3: validate lcns_follow in log_replay conversion

log_replay() converts DIR_PAGE_ENTRY_32 records into DIR_PAGE_ENTRY
records when replaying version 0 restart tables.

During this conversion, the memmove() length is derived directly from
the on-disk lcns_follow field:

memmove(&dp->vcn, &dp0->vcn_low,
2 * sizeof(u64) +
le32_to_cpu(dp->lcns_follow) * sizeof(u64));

check_rstbl() validates restart table structure, but does not constrain
per-entry lcns_follow values relative to the entry size. A malformed
filesystem image can provide an oversized lcns_follow value, causing
the conversion memmove() to access memory beyond the bounds of the
allocated restart table buffer.

The same field is later used to bound iteration over page_lcns[],
so validating lcns_follow during conversion also prevents downstream
out-of-bounds access from the same malformed metadata.

Compute the maximum valid lcns_follow from the already-validated
restart table entry size and reject entries that exceed this bound.
Reuse the existing t16/t32 scratch variables already declared in
log_replay() to avoid introducing new declarations.

Fixes: b46acd6a6a62 ("fs/ntfs3: Add NTFS journal")
Cc: stable@vger.kernel.org
Signed-off-by: Pavitra Jha <jhapavitra98@gmail.com>
[almaz.alexandrovich@paragon-software.com: fixed the conflicts]
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

fs/ntfs3: bound copy_lcns dp->page_lcns[] index in analysis pass

In log_replay()'s analysis pass, after find_dp() returns a
valid DIR_PAGE_ENTRY for the (target_attr, target_vcn) tuple,
the copy_lcns block walks lrh->lcns_follow further entries:

t16 = le16_to_cpu(lrh->lcns_follow);
for (i = 0; i < t16; i++) {
    size_t j = (size_t)(le64_to_cpu(lrh->target_vcn) -
                        le64_to_cpu(dp->vcn));
    dp->page_lcns[j + i] = lrh->page_lcns[i];
}

find_dp() only validates that target_vcn falls within
[dp->vcn, dp->vcn + dp->lcns_follow), i.e., that the FIRST
cluster is covered.  The walk through the further entries is
not bounded against dp->lcns_follow.  For a malformed LRH
where target_vcn = dp->vcn + dp->lcns_follow - 1 and
lrh->lcns_follow > 1, the i > 0 writes overflow the dp's
allocated page_lcns[] array.

Add the missing j + lrh->lcns_follow <= dp->lcns_follow guard.

Reproduced under UML+KASAN on mainline 8d90b09e6741 as a
slab-out-of-bounds write of size 8 from log_replay+0x68d4 on
the mount path.

This is distinct from Pavitra Jha's 2026-05-02 patch
("fs/ntfs3: validate lcns_follow in log_replay conversion",
<20260502154252.164586-1-jhapavitra98@gmail.com>) which
addresses the separate version-0 dirty-page-table conversion
path's memmove(&dp->vcn, ...) call.  The two fixes are
complementary; both should land.

Fixes: b46acd6a6a62 ("fs/ntfs3: Add NTFS journal")
Cc: stable@vger.kernel.org
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
[almaz.alexandrovich@paragon-software.com: clang-formatted the changes,
fixed conflicts]
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

fs/ntfs3: bound DeleteIndexEntryAllocation memmove length

In do_action()'s DeleteIndexEntryAllocation case, e->size comes
from an on-disk INDEX_BUFFER entry.  When e->size makes
e + e->size point past hdr + hdr->used,
PtrOffset(e1, Add2Ptr(hdr, used)) returns a negative ptrdiff_t
that is silently cast to a quasi-infinite size_t when passed
to memmove().  The memmove then walks past the destination
buffer.

The sibling DeleteIndexEntryRoot case at fslog.c:3540-3543
already carries the corresponding guard:

if (PtrOffset(e1, Add2Ptr(hdr, used)) < esize ||
    Add2Ptr(e, esize) > Add2Ptr(lrh, rec_len) ||
    used + esize > le32_to_cpu(hdr->total)) {
goto dirty_vol;
}

Apply the same shape to the allocation-path case.  Also reject
esize == 0: memmove(e, e, ...) is a no-op and leaves
hdr->used unchanged, hiding a malformed entry from the
existing check_index_header() walk.

Reproduced under UML+KASAN on mainline 8d90b09e6741 by
mounting a crafted NTFS image: the unguarded memmove takes a
length of 0xffffffffffffff00 and the kernel oopses in
memmove+0x81/0x1a0 on the do_action+0x36a2 frame.

Fixes: b46acd6a6a62 ("fs/ntfs3: Add NTFS journal")
Cc: stable@vger.kernel.org
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
[almaz.alexandrovich@paragon-software.com: clang-formatted the changes]
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

fs/ntfs3: bound attr_off in UpdateResidentValue against data_off

In do_action()'s UpdateResidentValue case (fslog.c:3307),
lrh->attr_off and lrh->redo_len come from the on-disk LRH.
When they satisfy aoff + dlen < attr->res.data_off, the
assignment

attr->res.data_size = cpu_to_le32(aoff + dlen - data_off);

underflows to ~4 GiB (e.g. 0xFFFFFFF9 when aoff=0x10, dlen=1,
data_off=0x18). Subsequent code that reads attr->res.data_size
to walk the resident attribute payload would then read up to
4 GiB past the 1024-byte MFT record allocation.

The existing mi_enum_attr() defense in fs/ntfs3/record.c:287
catches the corrupted data_size on the next attribute walk
and fails the mount, but only on the path that walks all
attributes. A read site that picks an attribute by name and
reads its data_size without re-validating is not covered.
Validate aoff against data_off and asize at the source.

Reproduced under UML+KASAN on mainline 8d90b09e6741 via
pr_warn-only probe: with aoff=0x10 and data_off=0x18, the
post-assignment data_size is 0xfffffff9 (mount then fails
at -22 from mi_enum_attr).

Fixes: b46acd6a6a62 ("fs/ntfs3: Add NTFS journal")
Cc: stable@vger.kernel.org
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
[almaz.alexandrovich@paragon-software.com: clang-formatted the changes]
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

ext4: improve str2hashbuf by processing 4-byte chunks and removing function pointers

The original byte-by-byte implementation with modulo checks is less
efficient. Refactor str2hashbuf_unsigned() and str2hashbuf_signed()
to process input in explicit 4-byte chunks instead of using a
modulus-based loop to emit words byte by byte.

Additionally, the use of function pointers for selecting the appropriate
str2hashbuf implementation has been removed. Instead, the functions are
directly invoked based on the hash type, eliminating the overhead of
dynamic function calls.

Performance test (x86_64, Intel Core i7-10700 @ 2.90GHz, average over 10000
runs, using kernel module for testing):

    len | orig_s | new_s | orig_u | new_u
    ----+--------+-------+--------+-------
      1 |   70   |   71  |   63   |   63
      8 |   68   |   64  |   64   |   62
     32 |   75   |   70  |   75   |   63
     64 |   96   |   71  |  100   |   68
    255 |  192   |  108  |  187   |   84

This change improves performance, especially for larger input sizes.

Signed-off-by: Guan-Chun Wu <409411716@gms.tku.edu.tw>
Link: https://patch.msgid.link/20260531080019.3794809-3-409411716@gms.tku.edu.tw
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

ext4: add Kunit coverage for directory hash computation

Introduce Kunit tests for fs/ext4/hash.c to verify ext4fs_dirhash()
across the legacy, half-MD4, and TEA hash variants.

The tests cover empty, seeded hashing, and non-ASCII name handling.
They also verify error paths, including invalid hash versions and
SipHash without a configured key, and check that the signed and
unsigned hash variants differ on non-ASCII input as expected.

When CONFIG_UNICODE is enabled, the tests further verify casefolded-name
hashing and the fallback behavior for invalid input.

Co-developed-by: Chen Hao Yu <edward062254@gmail.com>
Signed-off-by: Chen Hao Yu <edward062254@gmail.com>
Signed-off-by: Guan-Chun Wu <409411716@gms.tku.edu.tw>
Link: https://patch.msgid.link/20260531080019.3794809-2-409411716@gms.tku.edu.tw
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

dma-debug: fix physical address retrieval in debug_dma_sync_sg_for_device

In debug_dma_sync_sg_for_device(), when iterating over a scatterlist,
the debug entry population mistakenly uses the head of the scatterlist
'sg' to fetch the physical address via sg_phys(), instead of using the
current iterator variable 's'.

This causes dma-debug to track the physical address of the very first
scatterlist entry for all subsequent entries in the list.

Fix this by passing the correct loop iterator 's' to sg_phys()

Fixes: 9d4f645a1fd49ee ("dma-debug: store a phys_addr_t in struct dma_debug_entry")
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/20260603123708.1665-1-lirongqing@baidu.com

ext4: fast commit: export snapshot stats in fc_info

Snapshot-based fast commit can fall back when the commit-time snapshot
cannot be built (e.g. extent status cache misses). It is useful to
quantify the updates-locked window and to see why snapshotting failed.

Add best-effort snapshot counters to the ext4 superblock and extend
/proc/fs/ext4/<sb_id>/fc_info to report the number of snapshotted
inodes and ranges, snapshot failure reasons, and the average/max time
spent with journal updates locked.

Signed-off-by: Li Chen <chenl311@chinatelecom.cn>
Link: https://patch.msgid.link/20260515091829.194810-8-me@linux.beauty
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

ext4: fast commit: add lock_updates tracepoint

Commit-time fast commit snapshots run under jbd2_journal_lock_updates(),
so it is useful to quantify the time spent with updates locked and to
understand why snapshotting can fail.

Add a new tracepoint, ext4_fc_lock_updates, reporting the time spent in
the updates-locked window along with the number of snapshotted inodes
and ranges. Record the first snapshot failure reason in a stable snap_err
field for tooling.

Signed-off-by: Li Chen <chenl311@chinatelecom.cn>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://patch.msgid.link/20260515091829.194810-7-me@linux.beauty
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

ext4: fast commit: avoid i_data_sem by dropping ext4_map_blocks() in snapshots

Commit-time snapshots run under jbd2_journal_lock_updates(), so the work
done there must stay bounded.

The snapshot path still used ext4_map_blocks() to build data ranges. This
can take i_data_sem and pulls the mapping code into the snapshot logic.
Build inode data range snapshots from the extent status tree instead.

The extent status tree is a cache, not an authoritative source. If the
needed information is missing or unstable (e.g. delayed allocation), treat
the transaction as fast commit ineligible and fall back to full commit.

Also cap the number of inodes and ranges snapshotted per fast commit and
allocate range records from a dedicated slab cache. The inode pointer
array is allocated outside the updates-locked window.

Testing: QEMU/KVM guest, virtio-pmem + dax, ext4 -O fast_commit, mounted
dax,noatime. Ran python3 500x {4K write + fsync}, fallocate 256M, and
python3 500x {creat + fsync(dir)} without lockdep splats or errors.

Signed-off-by: Li Chen <chenl311@chinatelecom.cn>
Link: https://patch.msgid.link/20260515091829.194810-6-me@linux.beauty
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

ext4: fast commit: avoid self-deadlock in inode snapshotting

ext4_fc_snapshot_inodes() used igrab()/iput() to pin inodes while building
commit-time snapshots. With ext4_fc_del() waiting for
EXT4_STATE_FC_COMMITTING, iput() can trigger
ext4_clear_inode()->ext4_fc_del() in the commit thread and deadlock waiting
for the fast commit to finish.

ext4_fc_del() also has to re-check EXT4_STATE_FC_COMMITTING after
waiting on EXT4_STATE_FC_FLUSHING_DATA. The commit thread clears
FLUSHING_DATA before it sets COMMITTING, so a waiter woken from the
flush wait must not delete the inode based on an old COMMITTING
check.

Avoid taking extra references. Collect inode pointers under s_fc_lock and
rely on EXT4_STATE_FC_COMMITTING to pin inodes until ext4_fc_cleanup()
clears the bit.

Also set EXT4_STATE_FC_COMMITTING for create-only inodes referenced
from the dentry update queue, and wake up waiters when ext4_fc_cleanup()
clears the bit.

Signed-off-by: Li Chen <chenl311@chinatelecom.cn>
Link: https://patch.msgid.link/20260515091829.194810-5-me@linux.beauty
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

ext4: fast commit: avoid waiting for FC_COMMITTING

ext4_fc_track_inode() can be called while holding i_data_sem (e.g.
fallocate). Waiting for EXT4_STATE_FC_COMMITTING in that case risks an
ABBA deadlock: i_data_sem -> wait(FC_COMMITTING) vs FC_COMMITTING ->
wait(i_data_sem) in the commit task.

Now that fast commit snapshots inode state at commit time, updates during
log writing do not need to block. Drop the wait and lockdep assertion in
ext4_fc_track_inode(), and make ext4_fc_del() wait for FC_COMMITTING so an
inode cannot be removed while the commit thread is still using it.

When an inode is modified during a fast commit, mark it with
EXT4_STATE_FC_REQUEUE so cleanup keeps it queued for the next fast commit.
This is needed because jbd2_fc_end_commit() invokes the cleanup callback
with tid == 0, so tid-based requeue logic would requeue every inode.

Testing: tracepoint ext4:ext4_fc_commit_stop with two fsyncs in the same
transaction. nblks is the number of journal blocks written for that fast
commit. Before this change, the second fsync still wrote almost the same
fast commit log (nblks 10->9), because tid == 0 in jbd2_fc_end_commit()
caused the tid-based requeue logic to keep all inodes queued. After this
change, only inodes modified during the commit are requeued, and the
second fsync wrote a nearly empty fast commit (nblks 10->1).

Signed-off-by: Li Chen <chenl311@chinatelecom.cn>
Link: https://patch.msgid.link/20260515091829.194810-4-me@linux.beauty
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

ext4: lockdep: handle i_data_sem subclassing for special inodes

Fast commit can hold s_fc_lock while writing journal blocks. Mapping the
journal inode can take its i_data_sem. Normal inode update paths can take a
data inode i_data_sem and then s_fc_lock, which makes lockdep report a
circular dependency.

lockdep treats all i_data_sem instances as one lock class and cannot
distinguish the journal inode i_data_sem from a regular inode i_data_sem.
The journal inode is not tracked by fast commit and no FC waiters ever
depend on it, so this is not a real ABBA deadlock. Assign the journal inode
a dedicated i_data_sem lockdep subclass to avoid the false positive.

Inode cache objects can be recycled, so also reset i_data_sem to
I_DATA_SEM_NORMAL when allocating an ext4 inode. Otherwise a new inode may
inherit an old subclass (journal/quota/ea) and trigger lockdep warnings.

Signed-off-by: Li Chen <chenl311@chinatelecom.cn>
Link: https://patch.msgid.link/20260515091829.194810-3-me@linux.beauty
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

ext4: fast commit: snapshot inode state before writing log

Fast commit writes inode metadata and data range updates after unlocking
journal updates. New handles can start at that point, so the log writing
path must not look at live inode state.

Add a commit-time per-inode snapshot and populate it while journal updates
are locked and existing handles are drained. Store the snapshot behind
ext4_inode_info->i_fc_snap so ext4_inode_info only grows by one pointer.
The snapshot contains a copy of the on-disk inode plus the data range
records needed for fast commit TLVs.

Snapshotting runs under jbd2_journal_lock_updates(). Avoid triggering I/O
there by using ext4_get_inode_loc_noio() and falling back to full commit
if the inode table block is not present or not uptodate.

Log writing then only serializes the snapshot, so it no longer needs to
call ext4_map_blocks() and take i_data_sem under s_fc_lock. The snapshot
is installed and freed under s_fc_lock and is released from fast commit
cleanup and inode eviction.

Signed-off-by: Li Chen <chenl311@chinatelecom.cn>
Link: https://patch.msgid.link/20260515091829.194810-2-me@linux.beauty
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

jbd2: fix integer underflow in jbd2_journal_initialize_fast_commit()

jbd2_journal_initialize_fast_commit() validates journal capacity by
checking (journal->j_last - num_fc_blks < JBD2_MIN_JOURNAL_BLOCKS).
Both j_last and num_fc_blks are unsigned, so when num_fc_blks exceeds
j_last the subtraction wraps to a large value, bypassing the bounds
check.

The resulting underflow corrupts j_last, j_fc_first, and j_free,
leading to journal abort.

Fix by checking num_fc_blks against j_last before the subtraction,
returning -EFSCORRUPTED.

Fixes: 6866d7b3f2bb ("ext4 / jbd2: add fast commit initialization")
Reported-by: Yuhao Jiang <danisjiang@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Junrui Luo <moonafterrain@outlook.com>
Fixes: e029c5f27987 ("ext4: make num of fast commit blocks configurable")
Reviewed-by: Baokun Li <libaokun@linux.alibaba.com>
Fixes: e029c5f279872 ("ext4: make num of fast commit blocks configurable")
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/SYBPR01MB7881663C927DE9D7BBF4D1DFAF062@SYBPR01MB7881.ausprd01.prod.outlook.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

ext4: fix fast commit wait/wake bit mapping on 64-bit

On 64-bit, ext4 dynamic inode states live in the upper half of i_flags,
and ext4_test_inode_state() applies the corresponding +32 offset.

The fast-commit wait and wake paths open-coded the wait key with the raw
EXT4_STATE_* value. Add small helpers for the state wait word and bit,
and use them for the FC_COMMITTING and FC_FLUSHING_DATA waits so the wait
key follows the same mapping as the state helpers.

Fixes: 857d32f26181 ("ext4: rework fast commit commit path")
Reported-by: Sashiko AI review <sashiko-bot@kernel.org>
Signed-off-by: Li Chen <chenl311@chinatelecom.cn>
Reviewed-by: Baokun Li <libaokun@linux.alibaba.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20260513085818.552432-1-me@linux.beauty
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

jbd2: check for aborted handle in jbd2_journal_dirty_metadata()

jbd2_journal_dirty_metadata() unconditionally dereferences
handle->h_transaction at function entry to obtain the journal pointer:

transaction_t *transaction = handle->h_transaction;
journal_t *journal = transaction->t_journal;

However, h_transaction may legitimately be NULL for an aborted handle.
The is_handle_aborted() helper in include/linux/jbd2.h explicitly
treats !h_transaction as one of the aborted states:

if (handle->h_aborted || !handle->h_transaction)
return 1;

Every other entry point in fs/jbd2/transaction.c
(jbd2_journal_get_{write,undo,create}_access, jbd2_journal_extend,
jbd2_journal_restart, jbd2_journal_stop, etc.) guards against this
with an is_handle_aborted() check before any dereference of
h_transaction. jbd2_journal_dirty_metadata() was missing this guard.

This is reachable from ocfs2's xattr code. ocfs2_xa_set() intentionally
falls through to ocfs2_xa_journal_dirty() even after
ocfs2_xa_prepare_entry() fails, on the assumption that the buffer
needs to be journaled to record any partial modifications (see the
comment above the out_dirty label in fs/ocfs2/xattr.c). If the failure
was caused by the journal being aborted -- e.g. an underlying I/O
error during a sub-operation such as __ocfs2_remove_xattr_range() --
the handle's h_transaction has been cleared by the abort path, and
the unconditional deref in jbd2_journal_dirty_metadata() becomes a
NULL deref.

Reproduced by syzbot with a crafted ocfs2 image where I/O against the
loop device backing the mount is sabotaged via LOOP_SET_STATUS64
between two setxattr() calls, causing the second setxattr (which
truncates an external xattr value) to abort the journal mid-flight:

  Oops: general protection fault, probably for non-canonical
        address 0xdffffc0000000000
  KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
  RIP: jbd2_journal_dirty_metadata+0x4a/0xd30 fs/jbd2/transaction.c:1520
  Call Trace:
   ocfs2_journal_dirty+0x130/0x700 fs/ocfs2/journal.c:831
   ocfs2_xa_journal_dirty fs/ocfs2/xattr.c:1483 [inline]
   ocfs2_xa_set+0x15e3/0x2ec0 fs/ocfs2/xattr.c:2294
   ocfs2_xattr_block_set+0x3e0/0x33c0 fs/ocfs2/xattr.c:3016
   __ocfs2_xattr_set_handle+0x6b3/0xf50 fs/ocfs2/xattr.c:3418
   ocfs2_xattr_set+0xf3f/0x13e0 fs/ocfs2/xattr.c:3681
   __vfs_setxattr+0x43c/0x480 fs/xattr.c:218
   ...

Fix by adding the standard is_handle_aborted() guard at the top of
jbd2_journal_dirty_metadata() and returning -EROFS, matching the
pattern used by every other entry point in this file.
ocfs2_journal_dirty() already handles a non-zero return from
jbd2_journal_dirty_metadata() correctly.

Reported-by: syzbot+98f651460e558a21baae@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=98f651460e558a21baae
Tested-by: syzbot+98f651460e558a21baae@syzkaller.appspotmail.com
Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://patch.msgid.link/20260507050605.50081-1-kartikey406@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

wifi: iwlwifi: bump maximum core version for BZ/SC/DR to 106

Start supporting Core 106 FW on these devices.

Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Link: https://patch.msgid.link/20260531135036.4ec96e57a17b.I1eea0a221656b2f03839964734d9a3624530b964@changeid
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>

wifi: iwlwifi: mld: add KUnit tests for link grading

Add tests for the link grading algorithm covering per-bandwidth
grading tables, channel load calculation, 6 GHz RSSI adjustments
including duplicated beacon and PSD/EIRP compensation, and
puncturing penalty.

Signed-off-by: Avinash Bhatt <avinash.bhatt@intel.com>
Link: https://patch.msgid.link/20260531135036.a4251e5665a0.I811b35680115e7de0ffd75b6b7a1c91ad361c97c@changeid
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>

wifi: iwlwifi: mld: add KUnit tests for PSD/EIRP RSSI adjustment

Add tests for PSD/EIRP RSSI adjustment which compensates measurements
when APs use PSD-based power scaling with bandwidth.

Tests cover all power types, bandwidths, and limiting scenarios.

Signed-off-by: Avinash Bhatt <avinash.bhatt@intel.com>
Link: https://patch.msgid.link/20260531135036.a18b8d0acd62.I68dfcc17359ab8a5abdc84e1e21db4ad1671af41@changeid
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>

wifi: iwlwifi: mld: drop TLC config cmd v4/v5 compat code

FW core102 bumped TLC_MNG_CONFIG_CMD_API_S from version 5 to
version 6. The v4 and v5 compatibility paths in
iwl_mld_send_tlc_cmd() are no longer reachable on any supported
firmware.

Signed-off-by: Shahar Tzarfati <shahar.tzarfati@intel.com>
Link: https://patch.msgid.link/20260531135036.c0e2dbfd0569.I44f8eb4d985bb9590b65b77e9a3dd157e4bd5e79@changeid
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>

wifi: iwlwifi: mvm: remove __must_check annotation from command sending

We don't acually need to always check the return value. For example, if
we send a command to remove an object - we can assume success
(if it fails it is probably because the fw is dead, and then it doesn't
have the object anyway).

Remove the annotations.

Link: https://patch.msgid.link/20260531135036.434473c7b29a.I455e0c3f93c25635df708da7d3216c183dbdbbbb@changeid
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>

wifi: iwlwifi: trans: export the maximum supported hcmd size

Export the maximum allowed host command payload size to the op-modes.
Note that this information was available to the op-modes also before
this change, this just adds a clear macro.

Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Link: https://patch.msgid.link/20260531135036.2e6b15bcaf50.I027e150e5f25ef2431ab4e212175dc00ca5e8abd@changeid
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>

wifi: iwlwifi: stop supporting core101

BZ, DR and SC no longer need to accept core101 firmware.
Raise the minimum supported firmware core from 101 to 102 so
these families only match supported core102 and newer images.

Signed-off-by: Shahar Tzarfati <shahar.tzarfati@intel.com>
Link: https://patch.msgid.link/20260531135036.4ece89be11a9.If00f9c7e011ec75219d28a38ca2077a926afc70e@changeid
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>

wifi: iwlwifi: remove orphaned DC2DC config enum

FW core102 removed both DC2DC_CONFIG_CMD_API_S and
DC2DC_CONFIG_CMD_RSP_API_S. The only driver-side artifact is
enum iwl_dc2dc_config_id in fw/api/config.h, which has no
callers in any .c file across all driver paths (mld/mvm/xvt).

Remove the dead definition.

Signed-off-by: Shahar Tzarfati <shahar.tzarfati@intel.com>
Link: https://patch.msgid.link/20260531135036.487ceed62714.I13cf8cc214c68899379112e8e52f0cd38dc7b6f8@changeid
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>

wifi: iwlwifi: fix a typo

We use 512 A-MSDUs in an A-MPDU, not 612. Fix the typo.

Link: https://patch.msgid.link/20260531135036.62a394741a04.I2fd9e1d5dc4d467426c9061df2796ff8ba0129d4@changeid
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>

wifi: iwlwifi: pcie: fix write pointer move detection

Ever since the TFD queue size is no longer limited to 256 entries,
this code has been wrong, and might erroneously not detect a move
if it was by a multiple of 256. Not a big deal, but fix it while
I see it.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://patch.msgid.link/20260531135036.87ffbeab298e.I4fae41383b6756bccbed250985e0521b68a40d0c@changeid
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>

wifi: iwlwifi: mld: Require HT support for NAN

NAN cannot be supported if HT is not supported, so check that
HT is supported before declaring that NAN is supported.

Signed-off-by: Ilan Peer <ilan.peer@intel.com>
Link: https://patch.msgid.link/20260527230313.6274b222e849.If215f00f0cdb5eefb2507f8d0fb5734a65ce945f@changeid
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>

wifi: iwlwifi: mvm: fix P2P-Device binding handling

Our binding handling for P2P-Device can run into the following
scenario, as observed by our testing:

- a station interface is connected on some channel
- the P2P-Device does a remain-on-channel (ROC) on that channel
- the ROC ends, and the P2P-Device is removed from the binding,
   but the phy_ctxt pointer is left around as a PHY cache so we
   don't need to recalibrate to the channel again and again in
   case it's not shared
- a binding update by the station interface, even a removal,
   will re-add the P2P-Device to the binding
- the P2P-Device is removed, which removes the PHY context, but
   it's still in the binding so the firmware crashes

Since the P2P device is removed from the binding and only re-
added by unrelated code, but we want to keep the phy_ctxt around
as a cache for future ROC usage, fix it by adding a boolean that
indicates whether or not the P2P-Device should be added to the
binding, and handle that in the binding iterator. That way, the
station interface cannot re-add the P2P-Device to the binding
when that isn't active.

Assisted-by: Github Copilot:claude-opus-4-6
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://patch.msgid.link/20260527230313.07f94335ae06.I384238b0859343c4a9a9dda20682be1aad89cc9d@changeid
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>

wifi: iwlwifi: mld: add KUnit tests for duplicated beacon RSSI adjustment

Add KUnit tests to verify RSSI adjustment for 6 GHz duplicated
beacons across different operational bandwidths and validate
detection of the duplicated beacon bit.

Signed-off-by: Avinash Bhatt <avinash.bhatt@intel.com>
Link: https://patch.msgid.link/20260527230313.a3500c44f5e8.Icba6ee1158e9f563a91b482b8cdd3f51ddace468@changeid
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>

wifi: iwlwifi: mld: don't WARN on WoWLAN suspend w/o netdetect

Clearly, from a user perspective, it must be valid to configure
WoWLAN and then suspend while not connected to a network. Since
mac80211 doesn't distinguish these cases and simply calls the
driver to suspend whenever WoWLAN is configured, the driver has
to cleanly handle the case where it's called for WoWLAN, it's
not connected but there's also no netdetect configured.

Remove the WARN_ON() and keep returning 1 to disconnect and
then suspend.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Reviewed-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Link: https://patch.msgid.link/20260527230313.19720967372b.Iff30814510a26f9f609f98eeea3111c50c1afb31@changeid
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>

wifi: iwlwifi: cfg: Revert "wifi: iwlwifi: cfg: move the MODULE_FIRMWARE to the per-rf file"

IWL_BZ_UCODE_CORE_MAX is undefined in cfg/rf-fm.c, this
causes __stringify(core) to turn it into the literal
token text, so MODULE_FIRMWARE entries are generated as
"iwlwifi...-cIWL_BZ_UCODE_CORE_MAX.ucode",
instead of the actual number.

This reverts the commit below.

Signed-off-by: Shahar Tzarfati <shahar.tzarfati@intel.com>
Link: https://patch.msgid.link/20260527230313.a10bc3359dca.I446a1340c635f07aff3efaba5317635e010c156f@changeid
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>

wifi: iwlwifi: mld: set fast-balance scan for active EMLSR

While associated to MLD AP with active EMLSR, set all scan
operations as fast-balance scans. The only exception is when a
fragmented scan is planned (high traffic or low latency), in
which case the fragmented scan is preserved.

Signed-off-by: Pagadala Yesu Anjaneyulu <pagadala.yesu.anjaneyulu@intel.com>
Link: https://patch.msgid.link/20260527230313.32d278842b0e.Ia3d73e4085eefc4d3921e93de4107b2d6a6f922e@changeid
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>

wifi: iwlwifi: mld: support FW TLV for NAN max channel switch time

Add a new FW TLV (IWL_UCODE_TLV_FW_NAN_MAX_CHAN_SWITCH_TIME) that
allows the firmware to specify the NAN maximum channel switch time
in microseconds.

When the TLV is present, use its value for the NAN device capability.
Otherwise, fall back to the default of 4 milliseconds.

Signed-off-by: Israel Kozitz <israel.kozitz@intel.com>
Link: https://patch.msgid.link/20260527230313.e8ae1a3adacd.I15b933407ca3974a65047b63b4f9b00bed3520fb@changeid
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>

wifi: iwlwifi: mld: always allow mimo in NAN

The mimo field of the sta command is badly named. It really carries the
initial SMPS value as it is in the association request of the client
station (when we are the AP).

In NAN we don't have this information, just mark SMPS as disabled.

Link: https://patch.msgid.link/20260527230313.abd136be474e.I9eb663d953b482236345ffbcb611f28facea83c1@changeid
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>

wifi: iwlwifi: move iwl_fw_rate_idx_to_plcp() to mvm

It's only needed by mvm, so there's no need to have it in
iwlwifi and export it, just move it to mvm itself.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://patch.msgid.link/20260527230313.87769f13c7d7.I3875d768694b9484317a3253f479a2a2100244f4@changeid
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>

wifi: iwlwifi: mvm: rename iwl_mvm_mac80211_idx_to_hwrate()

Given that we now use v3 rates with FW index throughout,
_to_hwrate() is confusing, since the hardware still uses
the PLCP value, the driver just doesn't see that now (as
it talks to firmware, not hardware.)

Rename this to iwl_mvm_rate_idx_to_fw_idx() to more
clearly indicate what it's doing.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://patch.msgid.link/20260527230313.a60c8aea5b6c.I6af48d5d9748e184eed9d3437d312291cab61d7f@changeid
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>

wifi: iwlwifi: fix STEP_URM register address for SC devices

The CNVI_PMU_STEP_FLOW register address differs between device families.
For SC and newer devices, the register is at 0xA2D688,
while for BZ devices it's at 0xA2D588.

Signed-off-by: Moriya Itzchaki <moriya.itzchaki@intel.com>
Link: https://patch.msgid.link/20260527230313.f0c115c4f74e.I3c66b2e39a97f754e853ac7e7dba8e433523619e@changeid
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>

wifi: iwlwifi: mld: fix smatch warning

We dereference the mld_sta pointer before checking for NULL.
But we do check the sta pointer, and sta != NULL means mld_sta != NULL,
so there is no real issue.
Fix it anyway to silence the warning.

Link: https://patch.msgid.link/20260527200512.506707-2-miriam.rachel.korenblit@intel.com
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>

wifi: iwlwifi: remove mvm prefix from marker command

This command is sent in other opmodes as well. Remove the mvm prefix.

Link: https://patch.msgid.link/20260527230313.290e4d9db14a.Ia4edc64dacc8e298ab7817ab5c37843e92698b8d@changeid
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>

wifi: iwlwifi: remove stale comment

iwl_pcie_set_hw_ready still returns the return value of iwl_poll_bits,
but the latter one no longer returns the time elapsed until success, now it
returns either success or failure.
Remove the comment entirely.

Link: https://patch.msgid.link/20260527230313.ae42da7924ec.I1a92266621dc0033afa80f022d4c45e91674fedb@changeid
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>

wifi: iwlwifi: fw: cut down NIC wakeups during dump

Currently, the dump code attempts to dump any number of
memories and register banks, as defined by the firmware.
Especially when the device is failing, this can lead to
excessive time spent attempting to acquire NIC access
over and over again.

Improve the code to only attempt to acquire NIC access
once or twice, but using the new memory dump functions
that may drop the spinlock etc. Mark all dump regions
that require NIC access, and skip them if we couldn't
obtain that.

In order to avoid CPU latency due to the increased time
holding the spinlock (and possibly disabling softirqs),
drop locks and call cond_resched() after each section
(if holding NIC access) but don't release HW NIC access.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://patch.msgid.link/20260527230313.bec886142cc8.I41f2eaf2403b38147504d5dab0a7414de2699adc@changeid
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>

wifi: iwlwifi: add support for AX231

AX231 is a device that is based on AX211 that doesn't support 6E and
its bandwidth is limited to 80 MHz.
Just reuse the radio config from AX203 which has the exact same
characteristics.
It has a specific subdevice ID to allow the driver to differentiate
between AX211 and AX231.

Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Link: https://patch.msgid.link/20260512082114.0685ed313987.Ibcfa24e196ac778405d2843f0984b66ca167704e@changeid
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>

iio: chemical: scd30: Replace manual locking with RAII locking

scd30_core.c currently uses manual mutex_lock() and mutex_unlock()
calls. Replace them with the newer guard(mutex)() for cleaner RAII
patterns and to improve maintainability.

Add new helper function scd30_trigger_handler_helper() containing
the critical section for scd30_trigger_handler().

In addition, small refactor to replace "?:" operator with regular
if/else returns.

Signed-off-by: Maxwell Doose <m32285159@gmail.com>
Reviewed-by: Joshua Crofts <joshua.crofts1@gmail.com>
Signed-off-by: Jonathan Cameron <jic23@kernel.org>

ASoC: amd: remove unused machine

Kuninori Morimoto <kuninori.morimoto.gx@renesas.com> says:

This patch-set removes unused machine

Link: https://patch.msgid.link/877bogce4k.wl-kuninori.morimoto.gx@renesas.com

ASoC: amd: ps-mach: remove unused machine

Not used, remove it.

Signed-off-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Reviewed-by: Vijendar Mukunda <Vijendar.Mukunda@amd.com>
Link: https://patch.msgid.link/8733z4ce3i.wl-kuninori.morimoto.gx@renesas.com
Signed-off-by: Mark Brown <broonie@kernel.org>

ASoC: amd: acp6x-mach: remove unused machine

Not used, remove it.

Signed-off-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Reviewed-by: Vijendar Mukunda <Vijendar.Mukunda@amd.com>
Link: https://patch.msgid.link/874ijkce3m.wl-kuninori.morimoto.gx@renesas.com
Signed-off-by: Mark Brown <broonie@kernel.org>

ASoC: amd: acp3x-rn: remove unused machine

Not used, remove it.

Signed-off-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Reviewed-by: Vijendar Mukunda <Vijendar.Mukunda@amd.com>
Link: https://patch.msgid.link/875x40ce3s.wl-kuninori.morimoto.gx@renesas.com
Signed-off-by: Mark Brown <broonie@kernel.org>

s390/percpu: Provide arch_this_cpu_write() implementation

Provide an s390 specific implementation of arch_this_cpu_write()
instead of the generic variant. The generic variant uses a quite
expensive raw_local_irq_save() / raw_local_irq_restore() pair.

Get rid of this by providing an own variant which makes use of the new
percpu code section infrastructure.

With this the text size of the kernel image is reduced by ~1k (defconfig).

Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>

s390/percpu: Provide arch_this_cpu_read() implementation

Provide an s390 specific implementation of arch_this_cpu_read() instead
of the generic variant. The generic variant uses preempt_disable() /
preempt_enable() pair and READ_ONCE().

Get rid of the preempt_disable() / preempt_enable() pairs by providing an
own variant which makes use of the new percpu code section infrastructure.

With this the text size of the kernel image is reduced by ~1k
(defconfig). Also 87 generated preempt_schedule_notrace() function
calls within the kernel image (modules not counted) are removed.

Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>

s390/percpu: Use new percpu code section for arch_this_cpu_[and|or]()

Convert arch_this_cpu_[and|or]() to make use of the new percpu code
section infrastructure.

There is no user of this_cpu_and() and only one user of this_cpu_or()
within the kernel. Therefore this conversion has hardly any effect,
and also removes only preempt_schedule_notrace() function call.

Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>

s390/percpu: Use new percpu code section for arch_this_cpu_add_return()

Convert arch_this_cpu_add_return() to make use of the new percpu code
section infrastructure.

With this the text size of the kernel image is reduced by ~4k
(defconfig). Also 66 generated preempt_schedule_notrace() function
calls within the kernel image (modules not counted) are removed.

Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>

s390/percpu: Use new percpu code section for arch_this_cpu_add()

Convert arch_this_cpu_add() to make use of the new percpu code section
infrastructure.

With this the text size of the kernel image is reduced by ~76kb
(defconfig). Also more than 5300 generated preempt_schedule_notrace()
function calls within the kernel image (modules not counted) are removed.

With:

DEFINE_PER_CPU(long, foo);
void bar(long a) { this_cpu_add(foo, a); }

Old arch_this_cpu_add() looks like this:

00000000000000c0 <bar>:
  c0:   c0 04 00 00 00 00       jgnop   c0 <bar>
  c6:   eb 01 03 a8 00 6a       asi     936,1
  cc:   c4 18 00 00 00 00       lgrl    %r1,cc <bar+0xc>
                        ce: R_390_GOTENT        foo+0x2
  d2:   e3 10 03 b8 00 08       ag      %r1,952
  d8:   eb 22 10 00 00 e8       laag    %r2,%r2,0(%r1)
  de:   eb ff 03 a8 00 6e       alsi    936,-1
  e4:   a7 a4 00 05             jhe     ee <bar+0x2e>
  e8:   c0 f4 00 00 00 00       jg      e8 <bar+0x28>
                        ea: R_390_PC32DBL       __s390_indirect_jump_r14+0x2
  ee:   c0 f4 00 00 00 00       jg      ee <bar+0x2e>
                        f0: R_390_PLT32DBL      preempt_schedule_notrace+0x2

New arch_this_cpu_add() looks like this:

00000000000000c0 <bar>:
  c0:   c0 04 00 00 00 00       jgnop   c0 <bar>
  c6:   c4 38 00 00 00 00       lgrl    %r3,c6 <bar+0x6>
                        c8: R_390_GOTENT        foo+0x2
  cc:   b9 04 00 43             lgr     %r4,%r3
  d0:   eb 00 43 c0 00 52       mviy    960(%r0),4
  d6:   e3 40 03 b8 00 08       ag      %r4,952
  dc:   eb 52 40 00 00 e8       laag    %r5,%r2,0(%r4)
  e2:   eb 00 03 c0 00 52       mviy    960,0
  e8:   c0 f4 00 00 00 00       jg      e8 <bar+0x28>
                        ea: R_390_PC32DBL       __s390_indirect_jump_r14+0x2

Note that the conditional function call is removed.

Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>

s390/percpu: Add missing do { } while (0) constructs

Add missing do { } while (0) constructs in order to avoid potential
build failures.

Reported-by: Sashiko <sashiko-bot@kernel.org>
Closes: https://sashiko.dev/#/patchset/20260319120503.4046659-1-hca%40linux.ibm.com
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>

s390/percpu: Infrastructure for more efficient this_cpu operations

With the intended removal of PREEMPT_NONE this_cpu operations based on
atomic instructions, guarded with preempt_disable()/preempt_enable() pairs
become more expensive: the preempt_disable() / preempt_enable() pairs are
not optimized away anymore during compile time.

In particular the conditional call to preempt_schedule_notrace() after
preempt_enable() adds additional code and register pressure.

E.g. this simple C code sequence

DEFINE_PER_CPU(long, foo);
long bar(long a) { return this_cpu_add_return(foo, a); }

generates this code:

  11a976:       eb af f0 68 00 24       stmg    %r10,%r15,104(%r15)
  11a97c:       b9 04 00 ef             lgr     %r14,%r15
  11a980:       b9 04 00 b2             lgr     %r11,%r2
  11a984:       e3 f0 ff c8 ff 71       lay     %r15,-56(%r15)
  11a98a:       e3 e0 f0 98 00 24       stg     %r14,152(%r15)
  11a990:       eb 01 03 a8 00 6a       asi     936,1            <- __preempt_count_add(1)
  11a996:       c0 10 00 d2 ac b5       larl    %r1,1b70300      <- address of percpu var
  11a9a0:       e3 10 23 b8 00 08       ag      %r1,952          <- add percpu offset
  11a9a6:       eb ab 10 00 00 e8       laag    %r10,%r11,0(%r1) <- atomic op
  11a9ac:       eb ff 03 a8 00 6e       alsi    936,-1           <- __preempt_count_dec_and_test()
  11a9b2:       a7 54 00 05             jnhe    11a9bc <bar+0x4c>
  11a9b6:       c0 e5 00 76 d1 bd       brasl   %r14,ff4d30 <preempt_schedule_notrace>
  11a9bc:       b9 e8 b0 2a             agrk    %r2,%r10,%r11
  11a9c0:       eb af f0 a0 00 04       lmg     %r10,%r15,160(%r15)
  11a9c6        07 fe                   br      %r14

Even though the above example is more or less the worst case, since the
branch to preempt_schedule_notrace() requires a stackframe, which
otherwise wouldn't be necessary, there is also the conditional jnhe branch
instruction.

Get rid of the conditional branch with the following code sequence:

  11a8e6:       c0 30 00 d0 c5 0d       larl    %r3,1b33300
  11a8ec:       b9 04 00 43             lgr     %r4,%r3
  11a8f0:       eb 00 43 c0 00 52       mviy    960,4
  11a8f6:       e3 40 03 b8 00 08       ag      %r4,952
  11a8fc:       eb 52 40 00 00 e8       laag    %r5,%r2,0(%r4)
  11a902:       eb 00 03 c0 00 52       mviy    960,0
  11a908:       b9 08 00 25             agr     %r2,%r5
  11a90c        07 fe                   br      %r14

The general idea is that this_cpu operations based on atomic instructions
are guarded with mviy instructions:

- The first mviy instruction writes the register number, which contains
  the percpu address variable to lowcore. This also indicates that a
  percpu code section is executed.

- The first instruction following the mviy instruction must be the ag
  instruction which adds the percpu offset to the percpu address register.

- Afterwards the atomic percpu operation follows.

- Then a second mviy instruction writes a zero to lowcore, which indicates
  the end of the percpu code section.

- In case of an interrupt/exception/nmi the register number which was
  written to lowcore is copied to the exception frame (pt_regs), and a zero
  is written to lowcore.

- On return to the previous context it is checked if a percpu code section
  was executed (saved register number not zero), and if the process was
  migrated to a different cpu. If the percpu offset was already added to
  the percpu address register (instruction address does _not_ point to the
  ag instruction) the content of the percpu address register is adjusted so
  it points to percpu variable of the new cpu.

Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>

s390/zcrypt: Replace get_zeroed_page() with kzalloc()

zcrypt_rng_device_add() allocates a buffer for the software random
number generator data cache.

This buffer can be allocated with kmalloc() as there's nothing special
about it to go directly to the page allocator.

kmalloc() provides a better API that does not require ugly casts and
kfree() does not need to know the size of the freed object.

Performance difference between kmalloc() and __get_free_pages() is not
measurable as both allocators take an object/page from a per-CPU list for
fast path allocations.

For the slow path the performance is anyway determined by the amount of
reclaim involved rather than by what allocator is used.

Replace use of get_zeroed_page() with kzalloc() and free_page() with
kfree().

Link: https://lore.kernel.org/all/635405e4-9423-4a25-a6e7-e03c8ea0bcbe@redhat.com
Reviewed-by: Harald Freudenberger <freude@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>

s390/trng: Replace __get_free_page() with kmalloc()

trng_read() allocates a temporary staging buffer for CPACF TRNG
random data before copying it to userspace.

This buffer can be allocated with kmalloc() as there's nothing special
about it to go directly to the page allocator.

kmalloc() provides a better API that does not require ugly casts and
kfree() does not need to know the size of the freed object.

Performance difference between kmalloc() and __get_free_pages() is not
measurable as both allocators take an object/page from a per-CPU list for
fast path allocations.

For the slow path the performance is anyway determined by the amount of
reclaim involved rather than by what allocator is used.

Replace use of __get_free_page() with kmalloc() and free_page() with
kfree().

Link: https://lore.kernel.org/all/635405e4-9423-4a25-a6e7-e03c8ea0bcbe@redhat.com
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>

s390/qeth: Replace get_zeroed_page() with kzalloc()

qeth_get_trap_id() allocates a temporary buffer for STSI system
information queries used to build trap identification strings.

This buffer can be allocated with kmalloc() as there's nothing special
about it to go directly to the page allocator.

kmalloc() provides a better API that does not require ugly casts and
kfree() does not need to know the size of the freed object.

Performance difference between kmalloc() and __get_free_pages() is not
measurable as both allocators take an object/page from a per-CPU list for
fast path allocations.

For the slow path the performance is anyway determined by the amount of
reclaim involved rather than by what allocator is used.

Replace use of get_zeroed_page() with kzalloc() and free_page() with
kfree().

Link: https://lore.kernel.org/all/635405e4-9423-4a25-a6e7-e03c8ea0bcbe@redhat.com
Acked-by: Alexandra Winter <wintera@linux.ibm.com>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>

s390/hvc_iucv: Replace get_zeroed_page() with kzalloc()

hvc_iucv_alloc() allocates a send staging buffer for accumulating
outbound terminal characters before they are copied into a separate
IUCV message buffer for transmission to the hypervisor. The staging
buffer itself is never passed to any IUCV function.

This buffer can be allocated with kmalloc() as there's nothing special
about it to go directly to the page allocator.

kmalloc() provides a better API that does not require ugly casts and
kfree() does not need to know the size of the freed object.

Performance difference between kmalloc() and __get_free_pages() is not
measurable as both allocators take an object/page from a per-CPU list for
fast path allocations.

For the slow path the performance is anyway determined by the amount of
reclaim involved rather than by what allocator is used.

Replace use of get_zeroed_page() with kzalloc() and free_page() with
kfree().

Link: https://lore.kernel.org/all/635405e4-9423-4a25-a6e7-e03c8ea0bcbe@redhat.com
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>

s390/dasd: Replace get_zeroed_page() with kzalloc()

DASD driver uses get_zeroed_page() to allocate pages for the Extended Error
Reporting software ring buffer and for a scratch buffer for formatting
sense dump diagnostic text.

These buffers can be allocated with kmalloc() as there's nothing special
about it to go directly to the page allocator.

kmalloc() provides a better API that does not require ugly casts and
kfree() does not need to know the size of the freed object.

Performance difference between kmalloc() and __get_free_pages() is not
measurable as both allocators take an object/page from a per-CPU list for
fast path allocations.

For the slow path the performance is anyway determined by the amount of
reclaim involved rather than by what allocator is used.

Replace use of get_zeroed_page() with kzalloc() and free_page() with
kfree().

Link: https://lore.kernel.org/all/635405e4-9423-4a25-a6e7-e03c8ea0bcbe@redhat.com
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>

s390/con3270: Replace __get_free_page() with kmalloc()

con3270_alloc_view() allocates a staging buffer used to assemble
3270 datastream content before it is copied into channel program
requests.

This buffer can be allocated with kmalloc() as there's nothing special
about it to go directly to the page allocator.

kmalloc() provides a better API that does not require ugly casts and
kfree() does not need to know the size of the freed object.

Performance difference between kmalloc() and __get_free_pages() is not
measurable as both allocators take an object/page from a per-CPU list for
fast path allocations.

For the slow path the performance is anyway determined by the amount of
reclaim involved rather than by what allocator is used.

Replace use of __get_free_page() with kmalloc() and free_page() with
kfree().

Link: https://lore.kernel.org/all/635405e4-9423-4a25-a6e7-e03c8ea0bcbe@redhat.com
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>

s390/fpu: Move GR_NUM / VX_NUM macros to separate header file

Move GR_NUM / VX_NUM macros to separate insn-common-asm.h header file
so they can be reused for non-fpu insn constructs.

Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Steffen Eiden <seiden@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>

s390/fpu: Shorten GR_NUM / VX_NUM macros

Use the ".irp" directive to get rid of all the repeated ".ifc" usages
in fpu-insn-asm.h.

Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Steffen Eiden <seiden@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>

s390/ap/zcrypt: Rearrange fields within AP and zcrypt structs

Rearrange some fields within AP and zcrypt structs to reduce
memory consumption and unused holes with the help of pahole
analysis of the code.

Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Reviewed-by: Finn Callies <fcallies@linux.ibm.com>
Reviewed-by: Holger Dengler <dengler@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>

gpu: nova-core: Hopper/Blackwell: select FSP Chain of Trust version

The FSP Chain of Trust handshake is versioned: Hopper speaks version 1
and Blackwell speaks version 2. Provide the version through the FSP HAL
so the boot message carries the value FSP expects, and so chipsets that
do not use FSP need not express a version at all.

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Eliot Courtney <ecourtney@nvidia.com>
Link: https://patch.msgid.link/20260603-b4-blackwell-v13-5-d9f3a06939e0@nvidia.com
Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>

gpu: nova-core: Hopper/Blackwell: add FSP send/receive messaging

FSP exchanges are request/response: the driver sends an MCTP/NVDM
message and must match the reply against the request before acting on
it. Add the synchronous send-and-wait path that validates the response
transport and message headers and confirms the reply corresponds to the
request that was sent.

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Eliot Courtney <ecourtney@nvidia.com>
Link: https://patch.msgid.link/20260603-b4-blackwell-v13-4-d9f3a06939e0@nvidia.com
[acourbot: make `MessageToFsp` private.]
Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>

gpu: nova-core: add MCTP/NVDM protocol types for firmware communication

Add the MCTP (Management Component Transport Protocol) and NVDM (NVIDIA
Data Model) wire-format types used for communication between the kernel
driver and GPU firmware processors.

This includes typed MCTP transport headers, NVDM message headers, and
NVDM message type identifiers. Both the FSP boot path and the upcoming
GSP RPC message queue share this protocol layer.

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Eliot Courtney <ecourtney@nvidia.com>
Link: https://patch.msgid.link/20260603-b4-blackwell-v13-3-d9f3a06939e0@nvidia.com
Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>

gpu: nova-core: Hopper/Blackwell: add FSP message infrastructure

FSP communication uses a pair of non-circular queues in the FSP
falcon's EMEM, one for messages from the driver to FSP and one for
replies, with the driver polling for response data. Add the queue
registers and the low-level helpers used by the higher-level FSP
message layer.

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Eliot Courtney <ecourtney@nvidia.com>
Link: https://patch.msgid.link/20260603-b4-blackwell-v13-2-d9f3a06939e0@nvidia.com
[acourbot: align register fields names with OpenRM.]
[acourbot: represent registers as arrays of 8 instances, as per OpenRM.]
Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>

gpu: nova-core: Hopper/Blackwell: add FSP falcon EMEM operations

Add external memory (EMEM) read/write operations to the GPU's FSP falcon
engine. These operations use Falcon PIO (Programmed I/O) to communicate
with the FSP through indirect memory access.

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Eliot Courtney <ecourtney@nvidia.com>
Link: https://patch.msgid.link/20260603-b4-blackwell-v13-1-d9f3a06939e0@nvidia.com
Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>

selftests: livepatch: set LC_ALL=C to fix locale-dependent test failure

When executing the command
"make -C tools/testing/selftests TARGETS=livepatch run_tests",
the following error message was reported.

TEST: livepatch interaction with ftrace_enabled sysctl ... not ok
...
livepatch: sysctlo
: setting key "kernel.ftrace_enabled": Device or resource busy
livepatch: sysctl: setting key "kernel.ftrace_enabled": 设备或资源忙
...
ERROR: livepatch kselftest(s) failed
not ok 5 selftests: livepatch: test-ftrace.sh # exit=1

To fix it, set LC_ALL=C.

Signed-off-by: Qiang Ma <maqianga@uniontech.com>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Acked-by: Petr Mladek <pmladek@suse.com>
Link: https://patch.msgid.link/20260527095929.1504032-1-maqianga@uniontech.com
Signed-off-by: Petr Mladek <pmladek@suse.com>

KVM: riscv: Fast-path dirty logging write faults

With dirty logging enabled, guest writes often fault on an existing 4K
G-stage leaf that was write-protected only for dirty tracking. The slow
path still performs the full fault handling flow and takes mmu_lock for
write, even though the page-table shape does not change.

x86 handles the analogous case in its fast page fault path by atomically
making a writable SPTE writable again when the fault is only a
write-protection fault. Add the same style of fast path for RISC-V. If a
write fault hits an existing 4K leaf in a writable dirty-log memslot,
mark the page dirty and atomically set the PTE writable and dirty under
the read side of mmu_lock.

The dirty bitmap is updated before the PTE becomes writable again. The
PTE D bit is also set so systems that trap on a clear D bit do not fall
back to the slow path for a writable but clean PTE.

Signed-off-by: Jinyu Tang <tjytimi@163.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Link: https://lore.kernel.org/r/20260517153427.94889-6-tjytimi@163.com
Signed-off-by: Anup Patel <anup@brainfault.org>

KVM: riscv: Update G-stage PTE permissions atomically

When a fault hits an existing G-stage leaf with the same PFN, KVM only
needs to update the PTE permissions. This path will be used by read-side
fault handling, so it must not overwrite a concurrent PTE update.

Use the cmpxchg helper when relaxing permissions on an existing leaf,
following the same concurrency model used by x86 for atomic SPTE
permission updates. Retry if another CPU changed the PTE first, and use
cpu_relax() while spinning.

Signed-off-by: Jinyu Tang <tjytimi@163.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Link: https://lore.kernel.org/r/20260517153427.94889-5-tjytimi@163.com
Signed-off-by: Anup Patel <anup@brainfault.org>

KVM: riscv: Add a G-stage PTE cmpxchg helper

Permission-only G-stage PTE updates can run in parallel once they are
moved to the read side of mmu_lock. Plain set_pte() is not enough for
that case because another CPU may update the same PTE first.

x86 handles the same class of SPTE races with cmpxchg-based updates in
its fast page fault and TDP MMU paths. Add a small RISC-V helper for
atomic G-stage PTE updates. The helper reports contention to the caller
and flushes the target range only when the PTE value actually changes.

Signed-off-by: Jinyu Tang <tjytimi@163.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Link: https://lore.kernel.org/r/20260517153427.94889-4-tjytimi@163.com
Signed-off-by: Anup Patel <anup@brainfault.org>

KVM: riscv: Use an rwlock for mmu_lock

RISC-V KVM currently uses a spinlock for mmu_lock. That serializes all
G-stage MMU operations, including permission-only updates that do not
allocate or free page-table pages.

Use KVM's rwlock form of mmu_lock, as x86 and arm64 already do. Keep the
existing map, unmap and teardown paths on the write side. This prepares
RISC-V for read-side handling of G-stage permission updates.

Signed-off-by: Jinyu Tang <tjytimi@163.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Link: https://lore.kernel.org/r/20260517153427.94889-3-tjytimi@163.com
Signed-off-by: Anup Patel <anup@brainfault.org>

KVM: riscv: Rely on common MMU notifier locking

The common KVM invalidation paths call kvm_unmap_gfn_range() with
mmu_lock already held for write.

For the standard MMU notifier path, the call chain is:

  kvm_mmu_notifier_invalidate_range_start()
    kvm_handle_hva_range()
      kvm_unmap_gfn_range()

kvm_mmu_notifier_invalidate_range_start() leaves range.lockless clear.
kvm_handle_hva_range() therefore takes KVM_MMU_LOCK(kvm) before invoking
the handler.

The guest_memfd path has the same locking contract:

  __kvm_gmem_invalidate_begin()
    kvm_mmu_unmap_gfn_range()
      kvm_unmap_gfn_range()

__kvm_gmem_invalidate_begin() explicitly takes KVM_MMU_LOCK(kvm) before
calling kvm_mmu_unmap_gfn_range().

So remove the local trylock and make the common locking contract explicit
with lockdep_assert_held_write() like x86.

Signed-off-by: Jinyu Tang <tjytimi@163.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Link: https://lore.kernel.org/r/20260517153427.94889-2-tjytimi@163.com
Signed-off-by: Anup Patel <anup@brainfault.org>

KVM: selftests: Add a test for gPAT handling in L2

When KVM_X86_QUIRK_NESTED_SVM_SHARED_PAT is disabled, verify that KVM
correctly virtualizes the host PAT MSR and the guest PAT register for
nested SVM guests.

With nested NPT disabled:
* L1 and L2 share the same PAT
* The vmcb12.g_pat is ignored

With nested NPT enabled:
* An invalid g_pat in vmcb12 causes VMEXIT_INVALID
* RDMSR(IA32_PAT) from L2 returns the value of the guest PAT register
* WRMSR(IA32_PAT) from L2 is reflected in vmcb12's g_pat on VMEXIT
* RDMSR(IA32_PAT) from L1 returns the value of the host PAT MSR

Verify that save/restore with the vCPU in guest mode behaves as expected in
both cases, e.g. preserves both hPAT and gPAT when NPT is enabled.

Originally-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Yosry Ahmed <yosry@kernel.org>
[sean: use even fancier macro shenanigans]
Link: https://patch.msgid.link/20260528231052.404737-1-seanjc@google.com
[sean: avoid use of goto, print skips]
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: selftests: Add guest_memfd regression test signed offset+size bug

Add a regression (and proof-of-bug) testcase to ensure KVM rejects an
offset+size that would result in a negative value when computed as a signed
64-bit value. KVM had a flaw where it would allow binding a memslot to a
guest_memfd instance even with a wildly out-of-range offset, if the offset
and size were both positive values, but the combined offset+size was
negative.

Use "0x7fffffffffffffffull - page_size", i.e. "INT64_MAX - page_size", for
the offset as the size of the guest_memfd file must be at least page_size
(KVM requires memslots and gmem files to be host page-size aligned). I.e.
"INT64_MAX - page_size + size" is guaranteed to generate an offset+size
that is negative when converted to a signed 64-bit value *and* honors KVM's
alignment requirements.

Reviewed-by: Ackerley Tng <ackerleytng@google.com>
Tested-by: Ackerley Tng <ackerleytng@google.com>
Link: https://patch.msgid.link/20260602170921.1304394-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: selftests: Expand the guest_memfd test macros to allow passing the VM

Expand the gmem test macros to allow passing the VM to testcases, without
needing to plumb the VM into _every_ testcase, as the vast majority of
testcases only need the fd and size.

No functional change intended.

Reviewed-by: Ackerley Tng <ackerleytng@google.com>
Tested-by: Ackerley Tng <ackerleytng@google.com>
Link: https://patch.msgid.link/20260602170921.1304394-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: guest_memfd: Treat memslot binding offset+size as unsigned values

When binding a memslot to a guest_memfd file, treat the offset and size as
unsigned values to fix a bug where the sum of the two can result in a false
negative when checking for overflow against the size of the file.  Passing
unsigned values also avoids relying on somewhat obscure checks in other
flows for safety, and tracks the offset and size as they are intended to be
tracked, as unsigned values.

On 64-bit kernels, the number of pages a memslot contains and thus the size
(and offset) of its guest_memfd binding are unsigned 64-bit values.  Taking
the offset+size as an loff_t instead of a uoff_t inadvertently converts
the unsigned value to a signed value if the offset and/or size is massive.

Locally storing the offset and size as signed values is benign in and of
itself (though even that is *extremely* difficult to discern), but
operating on their sum is not.

For the offset, KVM explicitly checks against a negative value, which might
seem like a bug as KVM could incorrectly reject a legitimate binding, but
that's not actually the case as KVM_CREATE_GUEST_MEMFD takes a signed value
for its size, i.e. a would-be-negative offset is also greater than the
maximum possible size of any guest_memfd file.

Regarding the size, while KVM lacks an explicit check for a negative value,
i.e. seemingly has a flawed overflow check, KVM restricts the number of
pages in a single memslot to the largest positive signed 32-bit value:

        if (id < KVM_USER_MEM_SLOTS &&
            (mem->memory_size >> PAGE_SHIFT) > KVM_MEM_MAX_NR_PAGES)
                return -EINVAL;

and so that maximum "size" will ever be is 0x7fffffff000.

The sum of the two is, however, problematic.  While the size is restricted
by KVM's memslot logic, the offset is not, i.e. the offset is completely
unchecked until the "offset + size > i_size_read(inode)" check.  If the
offset is the (nearly) largest possible _positive_ value, then adding size
to the offset can result in a signed, negative 64-bit value.  When compared
against the size of the file (guaranteed to be positive), the negative sum
is always smaller, and KVM incorrectly allows the absurd offset.

Opportunistically add missing includes in kvm_mm.h (instead of relying on
its parents).

Fixes: a7800aa80ea4 ("KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory")
Cc: stable@vger.kernel.org
Cc: Ackerley Tng <ackerleytng@google.com>
Reviewed-by: Michael Roth <michael.roth@amd.com>
Reviewed-by: Ackerley Tng <ackerleytng@google.com>
Link: https://patch.msgid.link/20260602170921.1304394-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Remove defunct kvm_load_segment_descriptor() declaration.

Remove a dead kvm_load_segment_descriptor() declaration, no functional
change intended.

Reviewed-by: Yosry Ahmed <yosry@kernel.org>
Link: https://patch.msgid.link/20260529222223.870923-30-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

RDMA/umem: Fix truncation for block sizes >= 4G

When the iommu is used the linearization of the mapping can give a single
block that is very large split across multiple SG entries.

When __rdma_block_iter_next() reassembles the split SG entries it is
overflowing the 32 bit stack values and computed the wrong DMA addresses
for blocks after the truncation.

Use the right types to hold DMA addresses.

Link: https://patch.msgid.link/r/1-v1-88303e9e509f+f7-ib_umem_types_jgg@nvidia.com
Cc: stable@vger.kernel.org
Fixes: a808273a495c ("RDMA/verbs: Add a DMA iterator to return aligned contiguous memory blocks")
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

KVM: x86: Drop defunct vcpu_tsc_khz() declaration

Remove a dead vcpu_tsc_khz() declaration. No functional change intended.

Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Yosry Ahmed <yosry@kernel.org>
Link: https://patch.msgid.link/20260529222223.870923-18-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Move async #PF helpers to x86.h (as inlines)

Move kvm_pv_async_pf_enabled() and kvm_async_pf_hash_reset() to x86.h in
anticipation of extracting the majority of register and MSR specific code
out of x86.c.

No functional change intended.

Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Yosry Ahmed <yosry@kernel.org>
Link: https://patch.msgid.link/20260529222223.870923-15-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Move update_cr8_intercept() to lapic.c

Move update_cr8_intercept() to lapic.c so that it's globally visible
in anticipation of extracting most of the register-specific code out of
x86.c and into a new compilation unit. Opportunistically prefix the
helper kvm_lapic_ to make its role/scope more obvious.

No functional change intended.

Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Yosry Ahmed <yosry@kernel.org>
Link: https://patch.msgid.link/20260529222223.870923-14-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Harden is_64_bit_hypercall() against bugs on 32-bit kernels

Unconditionally return %false for is_64_bit_hypercall() on 32-bit kernels
to guard against incorrectly setting guest_state_protected, and because
in a (very) hypothetical world where 32-bit KVM supports protected guests,
assuming a hypercall was made in 64-bit mode is flat out wrong.

Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://patch.msgid.link/20260529222223.870923-13-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

Revert "KVM: VMX: Read 32-bit GPR values for ENCLS instructions outside of 64-bit mode"

Now that kvm_<reg>_read() are mode aware, i.e. are functionally equivalent
to kvm_register_read(), revert aback to the less verbose versions.

No functional change intended.

This reverts commit 60919eccf6764c71cef31a1afeaa1a36b8e5ab85.

Acked-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://patch.msgid.link/20260529222223.870923-12-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: nSVM: Use kvm_rax_read() now that it's mode-aware

Now that kvm_rax_read() truncates the output value to 32 bits if the
vCPU isn't in 64-bit mode, use it instead of the more verbose (and very
technically slower) kvm_register_read().

Note! VMLOAD, VMSAVE, and VMRUN emulation are still technically buggy,
as they can use EAX (versus RAX) in 64-bit mode via an operand size
prefix. Don't bother trying to handle that case, as it would require
decoding the code stream, which would open an entirely different can of
worms, and in practice no sane guest would shove garbage into RAX[63:32]
and then execute VMLOAD/VMSAVE/VMRUN with just EAX.

No functional change intended.

Cc: Yosry Ahmed <yosry@kernel.org>
Reviewed-by: Yosry Ahmed <yosry@kernel.org>
Link: https://patch.msgid.link/20260529222223.870923-11-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Drop non-raw kvm_<reg>_write() helpers

Drop the non-raw, mode-aware kvm_<reg>_write() helpers as there is no
usage in KVM, and in all likelihood there will never be usage in KVM as
use of hardcoded registers in instructions is uncommon, and *modifying*
hardcoded registers is practically unheard of. While there are a few
instructions that modify registers in mode-aware ways, e.g. REP string
and some ENCLS varieties, the odds of KVM needing to emulate such
instructions (outside of the fully emulator) are vanishingly small.

Drop kvm_<reg>_write() to prevent incorrect usage; _if_ a new instruction
comes along that needs to modify a hardcoded register, this can be
reverted.

No functional change intended.

Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://patch.msgid.link/20260529222223.870923-10-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Add mode-aware versions of kvm_<reg>_{read,write}() helpers

Make kvm_<reg>_{read,write}() mode-aware (where the value is truncated to
32 bits if the vCPU isn't in 64-bit mode), and convert all the intentional
"raw" accesses to kvm_<reg>_{read,write}_raw() versions. To avoid
confusion and bikeshedding over whether or not explicit 32-bit accesses
should use the "raw" or mode-aware variants, add and use "e" versions, e.g.
for things like RDMSR, WRMSR, and CPUID, where the instruction uses only
bits 31:0, regardless of mode.

No functional change intended (all use of "e" versions is for cases where
the value is already truncated due to bouncing through a u32).

Cc: Binbin Wu <binbin.wu@linux.intel.com>
Cc: Kai Huang <kai.huang@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://patch.msgid.link/20260529222223.870923-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Move inlined GPR, CR, and DR helpers from x86.h to regs.h

Move inlined General Purpose Register, Control Register, and Debug
Register helpers from x86.h to the aptly named regs.h, to help trim
down x86.h (and x86.c in the future).

Move *very* select EFER functionality as well, but leave behind the bulk of
EFER handling and all other MSR handling.  There is more than enough MSR
code to carve out msrs.{c,h} in the future.  Give is_long_bit_mode()
special treatment as it's more along the lines of a CR4 bit check, but just
happens to be accessed through an MSR interface.  And more importantly,
because giving regs.h access to is_long_bit_mode() greatly simplifies
dependency chains.

No functional change intended.

Reviewed-by: Yosry Ahmed <yosry@kernel.org>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://patch.msgid.link/20260529222223.870923-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Rename kvm_cache_regs.h => regs.h

Rename kvm_cache_regs.h to simply regs.h, as the "cache" nomenclature is
already a lie (the file deals with state/registers that aren't cached per
se), and so that more code/functionality can be landed in the header
without making it a truly horrible misnomer.

Deliberately drop the kvm_ prefix/namespace to align with other "local"
headers, and to further differentiate regs.h from the public/global
arch/x86/include/asm/kvm_vcpu_regs.h, which sadly needs to stay in asm/
so that the number of registers can be referenced by kvm_vcpu_arch.

No functional change intended.

Reviewed-by: Yosry Ahmed <yosry@kernel.org>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://patch.msgid.link/20260529222223.870923-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Trace hypercall register *after* truncating values for 32-bit

When tracing hypercalls, invoke the tracepoint *after* truncating the
register values for 32-bit guests so as not to record unused garbage (in
the extremely unlikely scenario that the guest left garbage in a register
after transitioning from 64-bit mode to 32-bit mode).

Fixes: 229456fc34b1 ("KVM: convert custom marker based tracing to event traces")
Reviewed-by: Yosry Ahmed <yosry@kernel.org>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://patch.msgid.link/20260529222223.870923-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: VMX: Read 32-bit GPR values for ENCLS instructions outside of 64-bit mode

When getting register values for ENCLS emulation, use kvm_register_read()
instead of kvm_<reg>_read() so that bits 63:32 of the register are dropped
if the guest is in 32-bit mode.

Note, the misleading/surprising behavior of kvm_<reg>_read() being "raw"
variants under the hood will be addressed once all non-benign bugs are
fixed.

Fixes: 70210c044b4e ("KVM: VMX: Add SGX ENCLS[ECREATE] handler to enforce CPUID restrictions")
Fixes: b6f084ca5538 ("KVM: VMX: Add ENCLS[EINIT] handler to support SGX Launch Control (LC)")
Acked-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://patch.msgid.link/20260529222223.870923-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86/xen: Don't truncate RAX when handling hypercall from protected guest

Don't truncate RAX when handling a Xen hypercall for a guest with protected
state, as KVM's ABI is to assume the guest is in 64-bit for such cases
(the guest leaving garbage in 63:32 after a transition to 32-bit mode is
far less likely than 63:32 being necessary to complete the hypercall).

Fixes: b5aead0064f3 ("KVM: x86: Assume a 64-bit hypercall for guests with protected state")
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Link: https://patch.msgid.link/20260529222223.870923-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>