The commit 4d38328eb442d ("tracing: Fix synth event printk format for str
fields") replaced "%.*s" with "%s" but missed removing the number size of
the dynamic and static strings. The commit e1a453a57bc7 ("tracing: Do not
add length to print format in synthetic events") fixed the dynamic part
but did not fix the static part. That is, with the commands:
For the cases where user includes a non-zero value in 'token_uuid_ptr'
field of 'struct vfio_device_bind_iommufd', the copy_struct_from_user()
in vfio_df_ioctl_bind_iommufd() fails with -E2BIG. For the 'minsz' passed,
copy_struct_from_user() expects the newly introduced field to be zero-ed,
which would be incorrect in this case.
Fix this by passing the actual size of the kernel struct. If working
with a newer userspace, copy_struct_from_user() would copy the
'token_uuid_ptr' field, and if working with an old userspace, it would
zero out this field, thus still retaining backward compatibility.
Fixes: 86624ba3b522 ("vfio/pci: Do vf_token checks for VFIO_DEVICE_BIND_IOMMUFD") Cc: stable@vger.kernel.org Signed-off-by: Raghavendra Rao Ananta <rananta@google.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20251031170603.2260022-2-rananta@google.com Signed-off-by: Alex Williamson <alex@shazbot.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
pci_get_device() will increase the reference count for the returned
pci_dev, and also decrease the reference count for the input parameter
from if it is not NULL.
If we break the loop in with 'vf_pdev' not NULL. We
need to call pci_dev_put() to decrease the reference count.
Found via static anlaysis and this is similar to commit c508eb042d97
("perf/x86/intel/uncore: Fix reference count leak in sad_cfg_iio_topology()")
Fixes: 8b6c724cdab8 ("virtio: vdpa: vDPA driver for Marvell OCTEON DPU devices") Cc: stable@vger.kernel.org Signed-off-by: Miaoqian Lin <linmq006@gmail.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Message-Id: <20251027060737.33815-1-linmq006@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The write pointer of zones that are in the full condition is always
invalid. Reflect that fact by setting the write pointer of full zones
to ULLONG_MAX.
Fixes: eb0570c7df23 ("block: new zoned loop block device driver") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
zloop_rw() will fail any regular write operation that targets a full
sequential zone. The check for this is indirect and achieved by checking
the write pointer alignment of the write operation. But this check is
ineffective for zone append operations since these are alwasy
automatically directed at a zone write pointer.
Prevent zone append operations from being executed in a full zone with
an explicit check of the zone condition.
Fixes: eb0570c7df23 ("block: new zoned loop block device driver") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit 9a7c987fb92b ("crypto: arm64/ghash - Use API partial block
handling") made ghash_finup() pass the wrong buffer to
ghash_do_simd_update(). As a result, ghash-neon now produces incorrect
outputs when the message length isn't divisible by 16 bytes. Fix this.
(I didn't notice this earlier because this code is reached only on CPUs
that support NEON but not PMULL. I haven't yet found a way to get
qemu-system-aarch64 to emulate that configuration.)
Fixes: 9a7c987fb92b ("crypto: arm64/ghash - Use API partial block handling") Cc: stable@vger.kernel.org Reported-by: Diederik de Haas <diederik@cknow-tech.com> Closes: https://lore.kernel.org/linux-crypto/DETXT7QI62KE.F3CGH2VWX1SC@cknow-tech.com/ Tested-by: Diederik de Haas <diederik@cknow-tech.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Link: https://lore.kernel.org/r/20251209223417.112294-1-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
As kcalloc() may fail, check its return value to avoid a NULL pointer
dereference when passing the buffer to rng->read(). On allocation
failure, log the error and return since test_len() returns void.
Fixes: 2be0d806e25e ("crypto: caam - add a test for the RNG") Cc: stable@vger.kernel.org Signed-off-by: Guangshuo Li <lgs201920130244@gmail.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Several crypto user API contexts and requests allocated with
sock_kmalloc() were left uninitialized, relying on callers to
set fields explicitly. This resulted in the use of uninitialized
data in certain error paths or when new fields are added in the
future.
The ACVP patches also contain two user-space interface files:
algif_kpp.c and algif_akcipher.c. These too rely on proper
initialization of their context structures.
A particular issue has been observed with the newly added
'inflight' variable introduced in af_alg_ctx by commit:
Because the context is not memset to zero after allocation,
the inflight variable has contained garbage values. As a result,
af_alg_alloc_areq() has incorrectly returned -EBUSY randomly when
the garbage value was interpreted as true:
The check directly tests ctx->inflight without explicitly
comparing against true/false. Since inflight is only ever set to
true or false later, an uninitialized value has triggered
-EBUSY failures. Zero-initializing memory allocated with
sock_kmalloc() ensures inflight and other fields start in a known
state, removing random issues caused by uninitialized data.
Commit b8d3404058a6 ("dt-bindings: PCI: qcom,pcie-sm8550: Move SM8550 to
dedicated schema") move the device schema to separate file, but it
missed a "if:not:...then:" clause in the original binding which was
requiring power-domains and resets for this particular chip.
Commit 88c9b3af4e31 ("dt-bindings: PCI: qcom,pcie-sm8450: Move SM8450 to
dedicated schema") move the device schema to separate file, but it
missed a "if:not:...then:" clause in the original binding which was
requiring power-domains and resets for this particular chip.
Commit 2278b8b54773 ("dt-bindings: PCI: qcom,pcie-sm8350: Move SM8350 to
dedicated schema") move the device schema to separate file, but it
missed a "if:not:...then:" clause in the original binding which was
requiring power-domains and resets for this particular chip.
Commit 4891b66185c1 ("dt-bindings: PCI: qcom,pcie-sm8250: Move SM8250 to
dedicated schema") move the device schema to separate file, but it
missed a "if:not:...then:" clause in the original binding which was
requiring power-domains and resets for this particular chip.
Commit 51bc04d5b49d ("dt-bindings: PCI: qcom,pcie-sm8150: Move SM8150 to
dedicated schema") move the device schema to separate file, but it
missed a "if:not:...then:" clause in the original binding which was
requiring power-domains and resets for this particular chip.
Commit c007a5505504 ("dt-bindings: PCI: qcom,pcie-sc8280xp: Move
SC8280XP to dedicated schema") move the device schema to separate file,
but it missed a "if:not:...then:" clause in the original binding which
was requiring power-domains and resets for this particular chip.
Commit 756485bfbb85 ("dt-bindings: PCI: qcom,pcie-sc7280: Move SC7280 to
dedicated schema") move the device schema to separate file, but it
missed a "if:not:...then:" clause in the original binding which was
requiring power-domains and resets for this particular chip.
The manuals of 2K2000 says both BIT_CTRL_MODE and BYTE_CTRL_MODE are
supported but the latter is recommended. Also on 2K3000, per the ACPI
DSDT the GPIO controller is compatible with 2K2000, but it fails to
operate GPIOs 62 and 63 (and maybe others) using BIT_CTRL_MODE.
Using BYTE_CTRL_MODE also makes those 2K3000 GPIOs work.
Dell Precision 7780 often wakes up on its own from suspend. Sometimes
wake up happens immediately (i. e. within 7 seconds), sometimes it happens
after, say, 30 minutes.
With z16 a new flag 'search boot program' was introduced for
list-directed IPL (SCSI, NVMe, ECKD DASD). If this flag is set,
e.g. via selecting the "Automatic" value for the "Boot program
selector" control on an HMC load panel, it is copied to the reipl
structure from the initial ipl structure. When a user now sets a
boot prog via sysfs, the flag is not cleared and the bootloader
will again automatically select the boot program, ignoring user
configuration.
To avoid that, clear the SBP flag when a bootprog sysfs file is
written.
On x86-64, this_cpu_cmpxchg() uses CMPXCHG without LOCK prefix which
means it is only safe for the local CPU and not for multiple CPUs.
Recently the commit 36df6e3dbd7e ("cgroup: make css_rstat_updated nmi
safe") make css_rstat_updated lockless and uses lockless list to allow
reentrancy. Since css_rstat_updated can invoked from process context,
IRQ and NMI, it uses this_cpu_cmpxchg() to select the winner which will
inset the lockless lnode into the global per-cpu lockless list.
However the commit missed one case where lockless node of a cgroup can
be accessed and modified by another CPU doing the flushing. Basically
llist_del_first_init() in css_process_update_tree().
On a cursory look, it can be questioned how css_process_update_tree()
can see a lockless node in global lockless list where the updater is at
this_cpu_cmpxchg() and before llist_add() call in css_rstat_updated().
This can indeed happen in the presence of IRQs/NMI.
Consider this scenario: Updater for cgroup stat C on CPU A in process
context is after llist_on_list() check and before this_cpu_cmpxchg() in
css_rstat_updated() where it get interrupted by IRQ/NMI. In the IRQ/NMI
context, a new updater calls css_rstat_updated() for same cgroup C and
successfully inserts rstatc_pcpu->lnode.
Now concurrently CPU B is running the flusher and it calls
llist_del_first_init() for CPU A and got rstatc_pcpu->lnode of cgroup C
which was added by the IRQ/NMI updater.
Now imagine CPU B calling init_llist_node() on cgroup C's
rstatc_pcpu->lnode of CPU A and on CPU A, the process context updater
calling this_cpu_cmpxchg(rstatc_pcpu->lnode) concurrently.
The CMPXCNG without LOCK on CPU A is not safe and thus we need LOCK
prefix.
In Meta's fleet running the kernel with the commit 36df6e3dbd7e, we are
observing on some machines the memcg stats are getting skewed by more
than the actual memory on the system. On close inspection, we noticed
that lockless node for a workload for specific CPU was in the bad state
and thus all the updates on that CPU for that cgroup was being lost.
To confirm if this skew was indeed due to this CMPXCHG without LOCK in
css_rstat_updated(), we created a repro (using AI) at [1] which shows
that CMPXCHG without LOCK creates almost the same lnode corruption as
seem in Meta's fleet and with LOCK CMPXCHG the issue does not
reproduces.
We can't log a conflicting inode if it's a directory and it was moved
from one parent directory to another parent directory in the current
transaction, as this can result an attempt to have a directory with
two hard links during log replay, one for the old parent directory and
another for the new parent directory.
The following scenario triggers that issue:
1) We have directories "dir1" and "dir2" created in a past transaction.
Directory "dir1" has inode A as its parent directory;
2) We move "dir1" to some other directory;
3) We create a file with the name "dir1" in directory inode A;
4) We fsync the new file. This results in logging the inode of the new file
and the inode for the directory "dir1" that was previously moved in the
current transaction. So the log tree has the INODE_REF item for the
new location of "dir1";
5) We move the new file to some other directory. This results in updating
the log tree to included the new INODE_REF for the new location of the
file and removes the INODE_REF for the old location. This happens
during the rename when we call btrfs_log_new_name();
6) We fsync the file, and that persists the log tree changes done in the
previous step (btrfs_log_new_name() only updates the log tree in
memory);
7) We have a power failure;
8) Next time the fs is mounted, log replay happens and when processing
the inode for directory "dir1" we find a new INODE_REF and add that
link, but we don't remove the old link of the inode since we have
not logged the old parent directory of the directory inode "dir1".
As a result after log replay finishes when we trigger writeback of the
subvolume tree's extent buffers, the tree check will detect that we have
a directory a hard link count of 2 and we get a mount failure.
The errors and stack traces reported in dmesg/syslog are like this:
Fix this by never logging a conflicting inode that is a directory and was
moved in the current transaction (its last_unlink_trans equals the current
transaction) and instead fallback to a transaction commit.
If SMT is disabled or a partial SMT state is enabled, when a new kernel
image is loaded for kexec, on reboot the following warning is observed:
kexec: Waking offline cpu 228.
WARNING: CPU: 0 PID: 9062 at arch/powerpc/kexec/core_64.c:223 kexec_prepare_cpus+0x1b0/0x1bc
[snip]
NIP kexec_prepare_cpus+0x1b0/0x1bc
LR kexec_prepare_cpus+0x1a0/0x1bc
Call Trace:
kexec_prepare_cpus+0x1a0/0x1bc (unreliable)
default_machine_kexec+0x160/0x19c
machine_kexec+0x80/0x88
kernel_kexec+0xd0/0x118
__do_sys_reboot+0x210/0x2c4
system_call_exception+0x124/0x320
system_call_vectored_common+0x15c/0x2ec
This occurs as add_cpu() fails due to cpu_bootable() returning false for
CPUs that fail the cpu_smt_thread_allowed() check or non primary
threads if SMT is disabled.
Fix the issue by enabling SMT and resetting the number of SMT threads to
the number of threads per core, before attempting to wake up all present
CPUs.
A zero length gss_token results in pages == 0 and in_token->pages[0]
is NULL. The code unconditionally evaluates
page_address(in_token->pages[0]) for the initial memcpy, which can
dereference NULL even when the copy length is 0. Guard the first
memcpy so it only runs when length > 0.
svc_rdma_copy_inline_range added rc_curpage (page index) to the page
base instead of the byte offset rc_pageoff. Use rc_pageoff so copies
land within the current page.
Clang is not happy about set but (in some cases) unused variable:
fs/nfsd/export.c:1027:17: error: variable 'inode' set but not used [-Werror,-Wunused-but-set-variable]
since it's used as a parameter to dprintk() which might be configured
a no-op. To avoid uglifying code with the specific ifdeffery just mark
the variable __maybe_unused.
The commit [1], which introduced this behaviour, is quite old and hence
the Fixes tag points to the first of the Git era.
svc_rdma_copy_inline_range indexed rqstp->rq_pages[rc_curpage] without
verifying rc_curpage stays within the allocated page array. Add guards
before the first use and after advancing to a new page.
Fixes: d7cc73972661 ("svcrdma: support multiple Read chunks per RPC") Cc: stable@vger.kernel.org Signed-off-by: Joshua Rogers <linux@joshua.hu> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> The bit vector that would set all REQUIRED and RECOMMENDED
> attributes that are supported by the EXCLUSIVE4_1 method of file
> creation via the OPEN operation. The scope of this attribute
> applies to all objects with a matching fsid.
There's nothing in RFC 8881 that states that suppattr_exclcreat is
or is not allowed to contain bits for attributes that are clear in
the reported supported_attrs bitmask. But it doesn't make sense for
an NFS server to indicate that it /doesn't/ implement an attribute,
but then also indicate that clients /are/ allowed to set that
attribute using OPEN(create) with EXCLUSIVE4_1.
The FATTR4_WORD2_TIME_DELEG attributes are also not to be allowed
for OPEN(create) with EXCLUSIVE4_1. It doesn't make sense to set
a delegated timestamp on a new file.
Fixes: 7e13f4f8d27d ("nfsd: handle delegated timestamps in SETATTR") Cc: stable@vger.kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
By default, these devices use the quirk ALC294_FIXUP_ASUS_SPK. Not
using it causes the headphone jack to stop working. Therefore,
introduce a new quirk ALC287_FIXUP_TXNW2781_I2C_ASUS that binds
to the TAS amplifier while using that quirk.
Cc: stable@kernel.org Fixes: 18a4895370a7 ("ALSA: hda/realtek: Add match for ASUS Xbox Ally projects") Signed-off-by: Antheas Kapenekakis <lkml@antheas.dev> Link: https://patch.msgid.link/20251216211714.1116898-1-lkml@antheas.dev Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
With the introduction of 8-bit formats the DMIC blob lookup also needs to
be modified to prefer the 32-bit blob when 8-bit format is used on FE.
At the same time we also need to make sure that in case 8-bit format is
used, but only 16-bit blob is available for DMIC then we will not try to
look for 8-bit blob (which is invalid) as fallback, but for a 16-bit one.
Fixes: c04c2e829649 ("ASoC: SOF: ipc4-topology: Add support for 8-bit formats") Cc: stable@vger.kernel.org Signed-off-by: Peter Ujfalusi <peter.ujfalusi@linux.intel.com> Reviewed-by: Bard Liao <yung-chuan.liao@linux.intel.com> Reviewed-by: Seppo Ingalsuo <seppo.ingalsuo@linux.intel.com> Reviewed-by: Kai Vehmanen <kai.vehmanen@linux.intel.com> Reviewed-by: Ranjani Sridharan <ranjani.sridharan@linux.intel.com> Link: https://patch.msgid.link/20251215120648.4827-2-peter.ujfalusi@linux.intel.com Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
An NFSv4 client that sets an ACL with a named principal during file
creation retrieves the ACL afterwards, and finds that it is only a
default ACL (based on the mode bits) and not the ACL that was
requested during file creation. This violates RFC 8881 section
6.4.1.3: "the ACL attribute is set as given".
The issue occurs in nfsd_create_setattr(), which calls
nfsd_attrs_valid() to determine whether to call nfsd_setattr().
However, nfsd_attrs_valid() checks only for iattr changes and
security labels, but not POSIX ACLs. When only an ACL is present,
the function returns false, nfsd_setattr() is skipped, and the
POSIX ACL is never applied to the inode.
Subsequently, when the client retrieves the ACL, the server finds
no POSIX ACL on the inode and returns one generated from the file's
mode bits rather than returning the originally-specified ACL.
Reported-by: Aurélien Couderc <aurelien.couderc2002@gmail.com> Fixes: c0cbe70742f4 ("NFSD: add posix ACLs to struct nfsd_attrs") Cc: Roland Mainz <roland.mainz@nrubsig.org> Cc: stable@vger.kernel.org Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> The bit vector that would set all REQUIRED and RECOMMENDED
> attributes that are supported by the EXCLUSIVE4_1 method of file
> creation via the OPEN operation. The scope of this attribute
> applies to all objects with a matching fsid.
There's nothing in RFC 8881 that states that suppattr_exclcreat is
or is not allowed to contain bits for attributes that are clear in
the reported supported_attrs bitmask. But it doesn't make sense for
an NFS server to indicate that it /doesn't/ implement an attribute,
but then also indicate that clients /are/ allowed to set that
attribute using OPEN(create) with EXCLUSIVE4_1.
Ensure that the SECURITY_LABEL and ACL bits are not set in the
suppattr_exclcreat bitmask when they are also not set in the
supported_attrs bitmask.
Fixes: 8c18f2052e75 ("nfsd41: SUPPATTR_EXCLCREAT attribute") Cc: stable@vger.kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
handshake_req_submit() replaces sk->sk_destruct but never restores it when
submission fails before the request is hashed. handshake_sk_destruct() then
returns early and the original destructor never runs, leaking the socket.
Restore sk_destruct on the error path.
Fixes: 3b3009ea8abb ("net/handshake: Create a NETLINK service for handling handshake requests") Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Cc: stable@vger.kernel.org Signed-off-by: caoping <caoping@cmss.chinamobile.com> Link: https://patch.msgid.link/20251204091058.1545151-1-caoping@cmss.chinamobile.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
inotify/fanotify do not allow users with no read access to a file to
subscribe to events (e.g. IN_ACCESS/IN_MODIFY), but they do allow the
same user to subscribe for watching events on children when the user
has access to the parent directory (e.g. /dev).
Users with no read access to a file but with read access to its parent
directory can still stat the file and see if it was accessed/modified
via atime/mtime change.
The same is not true for special files (e.g. /dev/null). Users will not
generally observe atime/mtime changes when other users read/write to
special files, only when someone sets atime/mtime via utimensat().
Align fsnotify events with this stat behavior and do not generate
ACCESS/MODIFY events to parent watchers on read/write of special files.
The events are still generated to parent watchers on utimensat(). This
closes some side-channels that could be possibly used for information
exfiltration [1].
The local variable 'val' was never clamped to -75000 or 180000 because
the return value of clamp_val() was not used. Fix this by assigning the
clamped value back to 'val', and use clamp() instead of clamp_val().
Cc: stable@vger.kernel.org Fixes: a557a92e6881 ("net: phy: marvell-88q2xxx: add support for temperature sensor") Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Reviewed-by: Dimitri Fedrau <dima.fedrau@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/20251202172743.453055-3-thorsten.blum@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Wake-on-Lan does currently not work for r8169 in DASH mode, e.g. the
ASUS Pro WS X570-ACE with RTL8168fp/RTL8117.
Fix by not returning early in rtl_prepare_power_down when dash_enabled.
While this fixes WoL, it still kills the OOB RTL8117 remote management
BMC connection. Fix by not calling rtl8168_driver_stop if WoL is enabled.
Fixes: 065c27c184d6 ("r8169: phy power ops") Signed-off-by: René Rebe <rene@exactco.de> Cc: stable@vger.kernel.org Reviewed-by: Heiner Kallweit <hkallweit1@gmail.com> Link: https://patch.msgid.link/20251202.194137.1647877804487085954.rene@exactco.de Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When we exec a new task we forget to flush the set of locked GCS mode bits.
Since we do flush the rest of the state this means that if GCS is locked
the new task will be unable to enable GCS, it will be locked as being
disabled. Add the expected flush.
Fixes: fc84bc5378a8 ("arm64/gcs: Context switch GCS state for EL0") Cc: <stable@vger.kernel.org> # 6.13.x Reported-by: Yury Khrustalev <Yury.Khrustalev@arm.com> Signed-off-by: Mark Brown <broonie@kernel.org> Tested-by: Yury Khrustalev <yury.khrustalev@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit 89d9cec3b1e9 ("PM: runtime: Clear power.needs_force_resume in
pm_runtime_reinit()") added provisional clearing of power.needs_force_resume
to pm_runtime_reinit(), but it is done unconditionally which is a
mistake because pm_runtime_reinit() may race with driver probing
and removal [1].
To address this, notice that power.needs_force_resume should never
be set when runtime PM is enabled and so it only needs to be cleared
when runtime PM is disabled, and update pm_runtime_init() to only
clear that flag when runtime PM is disabled.
Fixes: 89d9cec3b1e9 ("PM: runtime: Clear power.needs_force_resume in pm_runtime_reinit()") Reported-by: Ed Tsai <ed.tsai@mediatek.com> Closes: https://lore.kernel.org/linux-pm/20251215122154.3180001-1-ed.tsai@mediatek.com/ [1] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: 6.17+ <stable@vger.kernel.org> # 6.17+ Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org> Link: https://patch.msgid.link/12807571.O9o76ZdvQC@rafael.j.wysocki Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Synthetic events currently do not have a function to register perf events.
This leads to calling the tracepoint register functions with a NULL
function pointer which triggers:
Instead, have the code return -ENODEV, which doesn't warn and has perf
error out with:
# perf record -e synthetic:futex_wait
Error:
The sys_perf_event_open() syscall returned with 19 (No such device) for event (synthetic:futex_wait).
"dmesg | grep -i perf" may provide additional information.
Ideally perf should support synthetic events, but for now just fix the
warning. The support can come later.
Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Link: https://patch.msgid.link/20251216182440.147e4453@gandalf.local.home Fixes: 4b147936fa509 ("tracing: Add support for 'synthetic' events") Reported-by: Ian Rogers <irogers@google.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Garbage collection assumes all zones contain the full amount of blocks.
Mkfs already ensures this happens, but make the kernel check it as well
to avoid getting into trouble due to fuzzers or mkfs bugs.
Fixes: 2167eaabe2fa ("xfs: define the zoned on-disk format") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Cc: stable@vger.kernel.org # v6.15 Signed-off-by: Carlos Maiolino <cem@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The xchk_setup_xattr_buf function can allocate a new value buffer, which
means that any reference to ab->value before the call could become a
dangling pointer. Fix this by moving an assignment to after the buffer
setup.
Cc: stable@vger.kernel.org # v6.10 Fixes: e47dcf113ae348 ("xfs: repair extended attributes") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The grofs code for zoned RT subvolums already tries to check for zone
alignment, but gets it wrong by using the old instead of the new mount
structure.
Fixes: 01b71e64bb87 ("xfs: support growfs on zoned file systems") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Cc: stable@vger.kernel.org # v6.15 Signed-off-by: Carlos Maiolino <cem@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
xfs_attr_item.c: In function ‘xfs_attr_recover_work’:
xfs_attr_item.c:785:9: warning: ‘ip’ may be used uninitialized [-Wmaybe-uninitialized]
785 | xfs_trans_ijoin(tp, ip, 0);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~
xfs_attr_item.c:740:42: note: ‘ip’ was declared here
740 | struct xfs_inode *ip;
| ^~
I think this is bogus since xfs_attri_recover_work either returns a real
pointer having initialized ip or an ERR_PTR having not touched it, but
the tools are smarter than me so let's just null-init the variable
anyway.
Cc: stable@vger.kernel.org # v6.8 Fixes: e70fb328d52772 ("xfs: recreate work items when recovering intent items") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In commit 0297cdc12a87 ("KVM: selftests: Add option to rseq test to
override /dev/cpu_dma_latency"), a 'break' is missed before the option
'l' in the argument parsing loop, which leads to an unexpected core
dump in atoi_paranoid(). It tries to get the latency from non-existent
argument.
host$ ./rseq_test -u
Random seed: 0x6b8b4567
Segmentation fault (core dumped)
Add a 'break' before the option 'l' in the argument parsing loop to avoid
the unexpected core dump.
Fixes: 0297cdc12a87 ("KVM: selftests: Add option to rseq test to override /dev/cpu_dma_latency") Cc: stable@vger.kernel.org # v6.15+ Signed-off-by: Gavin Shan <gshan@redhat.com> Link: https://patch.msgid.link/20251124050427.1924591-1-gshan@redhat.com
[sean: describe code change in shortlog] Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When handling KVM_SET_CPUID{,2}, do runtime CPUID updates on the vCPU's
current CPUID (and caps) prior to swapping in the incoming CPUID state so
that KVM doesn't lose pending updates if the incoming CPUID is rejected,
and to prevent a false failure on the equality check.
Note, runtime updates are unconditionally performed on the incoming/new
CPUID (and associated caps), i.e. clearing the dirty flag won't negatively
affect the new CPUID.
Fixes: 93da6af3ae56 ("KVM: x86: Defer runtime updates of dynamic CPUID bits until CPUID emulation") Reported-by: Igor Mammedov <imammedo@redhat.com> Closes: https://lore.kernel.org/all/20251128123202.68424a95@imammedo Cc: stable@vger.kernel.org Acked-by: Igor Mammedov <imammedo@redhat.com> Tested-by: Igor Mammedov <imammedo@redhat.com> Link: https://patch.msgid.link/20251202015049.1167490-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Explicitly clear exit_code_hi in the VMCB when synthesizing "normal"
nested VM-Exits, as the full exit code is a 64-bit value (spoiler alert),
and all exit codes for non-failing VMRUN use only bits 31:0.
Cc: Jim Mattson <jmattson@google.com> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: stable@vger.kernel.org Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev> Link: https://patch.msgid.link/20251113225621.1688428-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Set exit_code_hi to -1u as a temporary band-aid to fix a long-standing
(effectively since KVM's inception) bug where KVM treats the exit code as
a 32-bit value, when in reality it's a 64-bit value. Per the APM, offset
0x70 is a single 64-bit value:
070h 63:0 EXITCODE
And a sane reading of the error values defined in "Table C-1. SVM Intercept
Codes" is that negative values use the full 64 bits:
–1 VMEXIT_INVALID Invalid guest state in VMCB.
–2 VMEXIT_BUSYBUSY bit was set in the VMSA
–3 VMEXIT_IDLE_REQUIREDThe sibling thread is not in an idle state
-4 VMEXIT_INVALID_PMC Invalid PMC state
And that interpretation is confirmed by testing on Milan and Turin (by
setting bits in CR0[63:32] to generate VMEXIT_INVALID on VMRUN).
Furthermore, Xen has treated exitcode as a 64-bit value since HVM support
was adding in 2006 (see Xen commit d1bd157fbc ("Big merge the HVM
full-virtualisation abstractions.")).
Cc: Jim Mattson <jmattson@google.com> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: stable@vger.kernel.org Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev> Link: https://patch.msgid.link/20251113225621.1688428-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
If an APICv status updated was pended while L2 was active, immediately
refresh vmcs01's controls instead of pending KVM_REQ_APICV_UPDATE as
kvm_vcpu_update_apicv() only calls into vendor code if a change is
necessary.
E.g. if APICv is inhibited, and then activated while L2 is running:
Set all user-return MSRs to their post-TD-exit value when preparing to run
a TDX vCPU to ensure the value that KVM expects to be loaded after running
the vCPU is indeed the value that's loaded in hardware. If the TDX-Module
doesn't actually enter the guest, i.e. doesn't do VM-Enter, then it won't
"restore" VMM state, i.e. won't clobber user-return MSRs to their expected
post-run values, in which case simply updating KVM's "cached" value will
effectively corrupt the cache due to hardware still holding the original
value.
In theory, KVM could conditionally update the current user-return value if
and only if tdh_vp_enter() succeeds, but in practice "success" doesn't
guarantee the TDX-Module actually entered the guest, e.g. if the TDX-Module
synthesizes an EPT Violation because it suspects a zero-step attack.
Force-load the expected values instead of trying to decipher whether or
not the TDX-Module restored/clobbered MSRs, as the risk doesn't justify
the benefits. Effectively avoiding four WRMSRs once per run loop (even if
the vCPU is scheduled out, user-return MSRs only need to be reloaded if
the CPU exits to userspace or runs a non-TDX vCPU) is likely in the noise
when amortized over all entries, given the cost of running a TDX vCPU.
E.g. the cost of the WRMSRs is somewhere between ~300 and ~500 cycles,
whereas the cost of a _single_ roundtrip to/from a TDX guest is thousands
of cycles.
Fixes: e0b4f31a3c65 ("KVM: TDX: restore user ret MSRs") Cc: stable@vger.kernel.org Cc: Yan Zhao <yan.y.zhao@intel.com> Cc: Xiaoyao Li <xiaoyao.li@intel.com> Cc: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://patch.msgid.link/20251030191528.3380553-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Mark the VMCB_PERM_MAP bit as dirty in nested_vmcb02_prepare_control()
on every nested VMRUN.
If L1 changes MSR interception (INTERCEPT_MSR_PROT) between two VMRUN
instructions on the same L1 vCPU, the msrpm_base_pa in the associated
vmcb02 will change, and the VMCB_PERM_MAP clean bit should be cleared.
Fixes: 4bb170a5430b ("KVM: nSVM: do not mark all VMCB02 fields dirty on nested vmexit") Reported-by: Matteo Rizzo <matteorizzo@google.com> Cc: stable@vger.kernel.org Signed-off-by: Jim Mattson <jmattson@google.com> Link: https://lore.kernel.org/r/20250922162935.621409-2-jmattson@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In kvm_vcpu_on_spin(), the loop counter 'i' is incorrectly written to
last_boosted_vcpu instead of the actual vCPU index 'idx'. This causes
last_boosted_vcpu to store the loop iteration count rather than the
vCPU index, leading to incorrect round-robin behavior in subsequent
directed yield operations.
Fix this by using 'idx' instead of 'i' in the assignment.
Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20251110033232.12538-7-kernellwp@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When emulating L2 instructions, svm_check_intercept() checks whether a
write to CR0 should trigger a synthesized #VMEXIT with
SVM_EXIT_CR0_SEL_WRITE. For MOV-to-CR0, SVM_EXIT_CR0_SEL_WRITE is only
triggered if any bit other than CR0.MP and CR0.TS is updated. However,
according to the APM (24593—Rev. 3.42—March 2024, Table 15-7):
The LMSW instruction treats the selective CR0-write
intercept as a non-selective intercept (i.e., it intercepts
regardless of the value being written).
Skip checking the changed bits for x86_intercept_lmsw and always inject
SVM_EXIT_CR0_SEL_WRITE.
Mark the VMCB_NPT bit as dirty in nested_vmcb02_prepare_save()
on every nested VMRUN.
If L1 changes the PAT MSR between two VMRUN instructions on the same
L1 vCPU, the g_pat field in the associated vmcb02 will change, and the
VMCB_NPT clean bit should be cleared.
Fixes: 4bb170a5430b ("KVM: nSVM: do not mark all VMCB02 fields dirty on nested vmexit") Cc: stable@vger.kernel.org Signed-off-by: Jim Mattson <jmattson@google.com> Link: https://lore.kernel.org/r/20250922162935.621409-3-jmattson@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When emulating L2 instructions, svm_check_intercept() checks whether a
write to CR0 should trigger a synthesized #VMEXIT with
SVM_EXIT_CR0_SEL_WRITE. However, it does not check whether L1 enabled
the intercept for SVM_EXIT_WRITE_CR0, which has higher priority
according to the APM (24593—Rev. 3.42—March 2024, Table 15-7):
When both selective and non-selective CR0-write intercepts are active at
the same time, the non-selective intercept takes priority. With respect
to exceptions, the priority of this intercept is the same as the generic
CR0-write intercept.
Make sure L1 does NOT intercept SVM_EXIT_WRITE_CR0 before checking if
SVM_EXIT_CR0_SEL_WRITE needs to be injected.
Opportunistically tweak the "not CR0" logic to explicitly bail early so
that it's more obvious that only CR0 has a selective intercept, and that
modifying icpt_info.exit_code is functionally necessary so that the call
to nested_svm_exit_handled() checks the correct exit code.
When advancing the target expiration for the guest's APIC timer in periodic
mode, set the expiration to "now" if the target expiration is in the past
(similar to what is done in update_target_expiration()). Blindly adding
the period to the previous target expiration can result in KVM generating
a practically unbounded number of hrtimer IRQs due to programming an
expired timer over and over. In extreme scenarios, e.g. if userspace
pauses/suspends a VM for an extended duration, this can even cause hard
lockups in the host.
Currently, the bug only affects Intel CPUs when using the hypervisor timer
(HV timer), a.k.a. the VMX preemption timer. Unlike the software timer,
a.k.a. hrtimer, which KVM keeps running even on exits to userspace, the
HV timer only runs while the guest is active. As a result, if the vCPU
does not run for an extended duration, there will be a huge gap between
the target expiration and the current time the vCPU resumes running.
Because the target expiration is incremented by only one period on each
timer expiration, this leads to a series of timer expirations occurring
rapidly after the vCPU/VM resumes.
More critically, when the vCPU first triggers a periodic HV timer
expiration after resuming, advancing the expiration by only one period
will result in a target expiration in the past. As a result, the delta
may be calculated as a negative value. When the delta is converted into
an absolute value (tscdeadline is an unsigned u64), the resulting value
can overflow what the HV timer is capable of programming. I.e. the large
value will exceed the VMX Preemption Timer's maximum bit width of
cpu_preemption_timer_multi + 32, and thus cause KVM to switch from the
HV timer to the software timer (hrtimers).
After switching to the software timer, periodic timer expiration callbacks
may be executed consecutively within a single clock interrupt handler,
because hrtimers honors KVM's request for an expiration in the past and
immediately re-invokes KVM's callback after reprogramming. And because
the interrupt handler runs with IRQs disabled, restarting KVM's hrtimer
over and over until the target expiration is advanced to "now" can result
in a hard lockup.
E.g. the following hard lockup was triggered in the host when running a
Windows VM (only relevant because it used the APIC timer in periodic mode)
after resuming the VM from a long suspend (in the host).
Moreover, if the suspend duration of the virtual machine is not long enough
to trigger a hard lockup in this scenario, since commit 98c25ead5eda
("KVM: VMX: Move preemption timer <=> hrtimer dance to common x86"), KVM
will continue using the software timer until the guest reprograms the APIC
timer in some way. Since the periodic timer does not require frequent APIC
timer register programming, the guest may continue to use the software
timer in perpetuity.
Fixes: d8f2f498d9ed ("x86/kvm: fix LAPIC timer drift when guest uses periodic mode") Cc: stable@vger.kernel.org Signed-off-by: fuqiang wang <fuqiang.wng@gmail.com>
[sean: massage comments and changelog] Link: https://patch.msgid.link/20251113205114.1647493-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When restarting an hrtimer to emulate a the guest's APIC timer in periodic
mode, explicitly set the expiration using the target expiration computed
by advance_periodic_target_expiration() instead of adding the period to
the existing timer. This will allow making adjustments to the expiration,
e.g. to deal with expirations far in the past, without having to implement
the same logic in both advance_periodic_target_expiration() and
apic_timer_fn().
Cc: stable@vger.kernel.org Signed-off-by: fuqiang wang <fuqiang.wng@gmail.com>
[sean: split to separate patch, write changelog] Link: https://patch.msgid.link/20251113205114.1647493-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
WARN and don't restart the hrtimer if KVM's callback runs with the guest's
APIC timer in periodic mode but with a period of '0', as not advancing the
hrtimer's deadline would put the CPU into an infinite loop of hrtimer
events. Observing a period of '0' should be impossible, even when the
hrtimer is running on a different CPU than the vCPU, as KVM is supposed to
cancel the hrtimer before changing (or zeroing) the period, e.g. when
switching from periodic to one-shot.
Since Linux v6.7, booting using BootX on an Old World PowerMac produces
an early crash. Stan Johnson writes, "the symptoms are that the screen
goes blank and the backlight stays on, and the system freezes (Linux
doesn't boot)."
Further testing revealed that the failure can be avoided by disabling
CONFIG_BOOTX_TEXT. Bisection revealed that the regression was caused by
a change to the font bitmap pointer that's used when btext_init() begins
painting characters on the display, early in the boot process.
Christophe Leroy explains, "before kernel text is relocated to its final
location ... data is addressed with an offset which is added to the
Global Offset Table (GOT) entries at the start of bootx_init()
by function reloc_got2(). But the pointers that are located inside a
structure are not referenced in the GOT and are therefore not updated by
reloc_got2(). It is therefore needed to apply the offset manually by using
PTRRELOC() macro."
If the osdmap is (maliciously) corrupted such that the encoded length
of ceph_pg_pool envelope is less than what is expected for a particular
encoding version, out-of-bounds reads may ensue because the only bounds
check that is there is based on that length value.
This patch adds explicit bounds checks for each field that is decoded
or skipped.
In chacha_zvkb, avoid using the s0 register, which is the frame pointer,
by reallocating KEY0 to t5. This makes stack traces available if e.g. a
crash happens in chacha_zvkb.
No frame pointer maintenance is otherwise required since this is a leaf
function.
In dm-pcache, in order to ensure crash-consistency, a dual-copy scheme
is used to alternately update metadata, and there is a slot index that
records the current slot. However, in the write path the current
implementation writes directly to the current slot indexed by slot
index, and then advances the slot — which ends up overwriting the
existing slot, violating the crash-consistency guarantee.
This patch fixes that behavior, preventing metadata from being
overwritten incorrectly.
In addition, this patch add a missing pmem_wmb() after memcpy_flushcache().
The ASP chip is a very old variant of the GSP chip and is used e.g. in
HP 730 workstations. When trying to reprogram the affinity it will crash
with a HPMC as the relevant registers don't seem to be at the usual
location. Let's avoid the crash by checking the sversion. Also note,
that reprogramming isn't necessary either, as the HP730 is a just a
single-CPU machine.
__scs_magic() needs a 'void *' variable, but a 'struct task_struct *' is
given. 'task_scs(tsk)' is the starting address of the task's shadow call
stack, and '__scs_magic(task_scs(tsk))' is the end address of the task's
shadow call stack. Here should be '__scs_magic(task_scs(tsk))'.
The user-visible effect of this bug is that when CONFIG_DEBUG_STACK_USAGE
is enabled, the shadow call stack usage checking function
(scs_check_usage) would scan an incorrect memory range. This could lead
to:
1. **Inaccurate stack usage reporting**: The function would calculate
wrong usage statistics for the shadow call stack, potentially showing
incorrect value in kmsg.
2. **Potential kernel crash**: If the value of __scs_magic(tsk)is
greater than that of __scs_magic(task_scs(tsk)), the for loop may
access unmapped memory, potentially causing a kernel panic. However,
this scenario is unlikely because task_struct is allocated via the slab
allocator (which typically returns lower addresses), while the shadow
call stack returned by task_scs(tsk) is allocated via vmalloc(which
typically returns higher addresses).
However, since this is purely a debugging feature
(CONFIG_DEBUG_STACK_USAGE), normal production systems should be not
unaffected. The bug only impacts developers and testers who are actively
debugging stack usage with this configuration enabled.
Link: https://lkml.kernel.org/r/20251011082222.12965-1-zhichi.lin@vivo.com Fixes: 5bbaf9d1fcb9 ("scs: Add support for stack usage debugging") Signed-off-by: Jiyuan Xie <xiejiyuan@vivo.com> Signed-off-by: Zhichi Lin <zhichi.lin@vivo.com> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Acked-by: Will Deacon <will@kernel.org> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Kees Cook <keescook@chromium.org> Cc: Marco Elver <elver@google.com> Cc: Will Deacon <will@kernel.org> Cc: Yee Lee <yee.lee@mediatek.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Add sanity check in iris_vb2_stop_streaming. If inst->state is
already IRIS_INST_ERROR, we should skip the stream_off operation
because it would still send packets to the firmware.
In iris_kill_session, inst->state is set to IRIS_INST_ERROR and
session_close is executed, which will kfree(inst_hfi_gen2->packet).
If stop_streaming is called afterward, it will cause a crash.
Fix an interaction between SMM and PV asynchronous #PFs where an #SMI can
cause KVM to drop an async #PF ready event, and thus result in guest tasks
becoming permanently stuck due to the task that encountered the #PF never
being resumed. Specifically, don't clear the completion queue when paging
is disabled, and re-check for completed async #PFs if/when paging is
enabled.
Prior to commit 2635b5c4a0e4 ("KVM: x86: interrupt based APF 'page ready'
event delivery"), flushing the APF queue without notifying the guest of
completed APF requests when paging is disabled was "necessary", in that
delivering a #PF to the guest when paging is disabled would likely confuse
and/or crash the guest. And presumably the original async #PF development
assumed that a guest would only disable paging when there was no intent to
ever re-enable paging.
That assumption fails in several scenarios, most visibly on an emulated
SMI, as entering SMM always disables CR0.PG (i.e. initially runs with
paging disabled). When the SMM handler eventually executes RSM, the
interrupted paging-enabled is restored, and the async #PF event is lost.
Similarly, invoking firmware, e.g. via EFI runtime calls, might require a
transition through paging modes and thus also disable paging with valid
entries in the competion queue.
To avoid dropping completion events, drop the "clear" entirely, and handle
paging-enable transitions in the same way KVM already handles APIC
enable/disable events: if a vCPU's APIC is disabled, APF completion events
are not kept pending and not injected while APIC is disabled. Once a
vCPU's APIC is re-enabled, KVM raises KVM_REQ_APF_READY so that the vCPU
recognizes any pending pending #APF ready events.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20251015033258.50974-4-mlevitsk@redhat.com
[sean: rework changelog to call out #PF injection, drop "real mode"
references, expand the code comment] Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
syzbot reported a kernel BUG in ocfs2_find_victim_chain() because the
`cl_next_free_rec` field of the allocation chain list (next free slot in
the chain list) is 0, triggring the BUG_ON(!cl->cl_next_free_rec)
condition in ocfs2_find_victim_chain() and panicking the kernel.
To fix this, an if condition is introduced in ocfs2_claim_suballoc_bits(),
just before calling ocfs2_find_victim_chain(), the code block in it being
executed when either of the following conditions is true:
1. `cl_next_free_rec` is equal to 0, indicating that there are no free
chains in the allocation chain list
2. `cl_next_free_rec` is greater than `cl_count` (the total number of
chains in the allocation chain list)
Either of them being true is indicative of the fact that there are no
chains left for usage.
This is addressed using ocfs2_error(), which prints
the error log for debugging purposes, rather than panicking the kernel.
Link: https://lkml.kernel.org/r/20251201130711.143900-1-activprithvi@gmail.com Signed-off-by: Prithvi Tambewagh <activprithvi@gmail.com> Reported-by: syzbot+96d38c6e1655c1420a72@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=96d38c6e1655c1420a72 Tested-by: syzbot+96d38c6e1655c1420a72@syzkaller.appspotmail.com Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Cc: Heming Zhao <heming.zhao@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
vidtv_channel_si_init() creates a temporary list (program, service, event)
and ownership of the memory itself is transferred to the PAT/SDT/EIT
tables through vidtv_psi_pat_program_assign(),
vidtv_psi_sdt_service_assign(), vidtv_psi_eit_event_assign().
The problem here is that the local pointer where the memory ownership
transfer was completed is not initialized to NULL. This causes the
vidtv_psi_pmt_create_sec_for_each_pat_entry() function to fail, and
in the flow that jumps to free_eit, the memory that was freed by
vidtv_psi_*_table_destroy() can be accessed again by
vidtv_psi_*_event_destroy() due to the uninitialized local pointer, so it
is freed once again.
Therefore, to prevent use-after-free and double-free vulnerability,
local pointers must be initialized to NULL when transferring memory
ownership.
Cc: <stable@vger.kernel.org> Reported-by: syzbot+1d9c0edea5907af239e0@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=1d9c0edea5907af239e0 Fixes: 3be8037960bc ("media: vidtv: add error checks") Signed-off-by: Jeongjun Park <aha310510@gmail.com> Reviewed-by: Daniel Almeida <daniel.almeida@collabora.com> Signed-off-by: Hans Verkuil <hverkuil+cisco@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When CONFIG_SLUB_TINY is enabled, kfree_nolock() calls kasan_slab_free()
before defer_free(). On ARM64 with MTE (Memory Tagging Extension),
kasan_slab_free() poisons the memory and changes the tag from the
original (e.g., 0xf3) to a poison tag (0xfe).
When defer_free() then tries to write to the freed object to build the
deferred free list via llist_add(), the pointer still has the old tag,
causing a tag mismatch and triggering a KASAN use-after-free report:
BUG: KASAN: slab-use-after-free in defer_free+0x3c/0xbc mm/slub.c:6537
Write at addr f3f000000854f020 by task kworker/u8:6/983
Pointer tag: [f3], memory tag: [fe]
Fix this by calling kasan_reset_tag() before accessing the freed memory.
This is safe because defer_free() is part of the allocator itself and is
expected to manipulate freed memory for bookkeeping purposes.
Reject attempts to disable KVM_MEM_GUEST_MEMFD on a memslot that was
initially created with a guest_memfd binding, as KVM doesn't support
toggling KVM_MEM_GUEST_MEMFD on existing memslots. KVM prevents enabling
KVM_MEM_GUEST_MEMFD, but doesn't prevent clearing the flag.
Failure to reject the new memslot results in a use-after-free due to KVM
not unbinding from the guest_memfd instance. Unbinding on a FLAGS_ONLY
change is easy enough, and can/will be done as a hardening measure (in
anticipation of KVM supporting dirty logging on guest_memfd at some point),
but fixing the use-after-free would only address the immediate symptom.
==================================================================
BUG: KASAN: slab-use-after-free in kvm_gmem_release+0x362/0x400 [kvm]
Write of size 8 at addr ffff8881111ae908 by task repro/745
Allocated by task 745 on cpu 6 at 9.746971s:
kasan_save_stack+0x20/0x40
kasan_save_track+0x13/0x50
__kasan_kmalloc+0x77/0x90
kvm_set_memory_region.part.0+0x652/0x1110 [kvm]
kvm_vm_ioctl+0x14b0/0x3290 [kvm]
__x64_sys_ioctl+0x129/0x1a0
do_syscall_64+0x5b/0x900
entry_SYSCALL_64_after_hwframe+0x4b/0x53
Freed by task 745 on cpu 6 at 9.747467s:
kasan_save_stack+0x20/0x40
kasan_save_track+0x13/0x50
__kasan_save_free_info+0x37/0x50
__kasan_slab_free+0x3b/0x60
kfree+0xf5/0x440
kvm_set_memslot+0x3c2/0x1160 [kvm]
kvm_set_memory_region.part.0+0x86a/0x1110 [kvm]
kvm_vm_ioctl+0x14b0/0x3290 [kvm]
__x64_sys_ioctl+0x129/0x1a0
do_syscall_64+0x5b/0x900
entry_SYSCALL_64_after_hwframe+0x4b/0x53
Reported-by: Alexander Potapenko <glider@google.com> Fixes: a7800aa80ea4 ("KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory") Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20251202020334.1171351-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit 1d2da79708cb ("pinctrl: renesas: rzg2l: Avoid configuring ISEL in
gpio_irq_{en,dis}able*()") dropped the configuration of ISEL from
struct irq_chip::{irq_enable, irq_disable} APIs and moved it to
struct gpio_chip::irq::{child_to_parent_hwirq,
child_irq_domain_ops::free} APIs to fix spurious IRQs.
After commit 1d2da79708cb ("pinctrl: renesas: rzg2l: Avoid configuring ISEL
in gpio_irq_{en,dis}able*()"), ISEL was no longer configured properly on
resume. This is because the pinctrl resume code used
struct irq_chip::irq_enable (called from rzg2l_gpio_irq_restore()) to
reconfigure the wakeup interrupts. Some drivers (e.g. Ethernet) may also
reconfigure non-wakeup interrupts on resume through their own code,
eventually calling struct irq_chip::irq_enable.
Fix this by adding ISEL configuration back into the
struct irq_chip::irq_enable API and on resume path for wakeup interrupts.
As struct irq_chip::irq_enable needs now to lock to update the ISEL,
convert the struct rzg2l_pinctrl::lock to a raw spinlock and replace the
locking API calls with the raw variants. Otherwise the lockdep reports
invalid wait context when probing the adv7511 module on RZ/G2L:
[ BUG: Invalid wait context ] 6.17.0-rc5-next-20250911-00001-gfcfac22533c9 #18 Not tainted
-----------------------------
(udev-worker)/165 is trying to lock: ffff00000e3664a8 (&pctrl->lock){....}-{3:3}, at: rzg2l_gpio_irq_enable+0x38/0x78
other info that might help us debug this:
context-{5:5}
3 locks held by (udev-worker)/165:
#0: ffff00000e890108 (&dev->mutex){....}-{4:4}, at: __driver_attach+0x90/0x1ac
#1: ffff000011c07240 (request_class){+.+.}-{4:4}, at: __setup_irq+0xb4/0x6dc
#2: ffff000011c070c8 (lock_class){....}-{2:2}, at: __setup_irq+0xdc/0x6dc
stack backtrace:
CPU: 1 UID: 0 PID: 165 Comm: (udev-worker) Not tainted 6.17.0-rc5-next-20250911-00001-gfcfac22533c9 #18 PREEMPT
Hardware name: Renesas SMARC EVK based on r9a07g044l2 (DT)
Call trace:
show_stack+0x18/0x24 (C)
dump_stack_lvl+0x90/0xd0
dump_stack+0x18/0x24
__lock_acquire+0xa14/0x20b4
lock_acquire+0x1c8/0x354
_raw_spin_lock_irqsave+0x60/0x88
rzg2l_gpio_irq_enable+0x38/0x78
irq_enable+0x40/0x8c
__irq_startup+0x78/0xa4
irq_startup+0x108/0x16c
__setup_irq+0x3c0/0x6dc
request_threaded_irq+0xec/0x1ac
devm_request_threaded_irq+0x80/0x134
adv7511_probe+0x928/0x9a4 [adv7511]
i2c_device_probe+0x22c/0x3dc
really_probe+0xbc/0x2a0
__driver_probe_device+0x78/0x12c
driver_probe_device+0x40/0x164
__driver_attach+0x9c/0x1ac
bus_for_each_dev+0x74/0xd0
driver_attach+0x24/0x30
bus_add_driver+0xe4/0x208
driver_register+0x60/0x128
i2c_register_driver+0x48/0xd0
adv7511_init+0x5c/0x1000 [adv7511]
do_one_initcall+0x64/0x30c
do_init_module+0x58/0x23c
load_module+0x1bcc/0x1d40
init_module_from_file+0x88/0xc4
idempotent_init_module+0x188/0x27c
__arm64_sys_finit_module+0x68/0xac
invoke_syscall+0x48/0x110
el0_svc_common.constprop.0+0xc0/0xe0
do_el0_svc+0x1c/0x28
el0_svc+0x4c/0x160
el0t_64_sync_handler+0xa0/0xe4
el0t_64_sync+0x198/0x19c
Having ISEL configuration back into the struct irq_chip::irq_enable API
should be safe with respect to spurious IRQs, as in the probe case IRQs
are enabled anyway in struct gpio_chip::irq::child_to_parent_hwirq. No
spurious IRQs were detected on suspend/resume, boot, ethernet link
insert/remove tests (executed on RZ/G3S). Boot, ethernet link
insert/remove tests were also executed successfully on RZ/G2L.
KASAN reports a global-out-of-bounds access when running these nfit
tests: clear.sh, pmem-errors.sh, pfn-meta-errors.sh, btt-errors.sh,
daxdev-errors.sh, and inject-error.sh.
[] BUG: KASAN: global-out-of-bounds in nfit_test_ctl+0x769f/0x7840 [nfit_test]
[] Read of size 4 at addr ffffffffc03ea01c by task ndctl/1215
[] The buggy address belongs to the variable:
[] handle+0x1c/0x1df4 [nfit_test]
nfit_test_search_spa() uses handle[nvdimm->id] to retrieve a device
handle and triggers a KASAN error when it reads past the end of the
handle array. It should not be indexing the handle array at all.
The correct device handle is stored in per-DIMM test data. Each DIMM
has a struct nfit_mem that embeds a struct acpi_nfit_memdev that
describes the NFIT device handle. Use that device handle here.
With below scripts, it will trigger panic in f2fs:
mkfs.f2fs -f /dev/vdd
mount /dev/vdd /mnt/f2fs
touch /mnt/f2fs/foo
sync
echo 111 >> /mnt/f2fs/foo
f2fs_io fsync /mnt/f2fs/foo
f2fs_io shutdown 2 /mnt/f2fs
umount /mnt/f2fs
mount -o ro,norecovery /dev/vdd /mnt/f2fs
or
mount -o ro,disable_roll_forward /dev/vdd /mnt/f2fs
F2FS-fs (vdd): f2fs_recover_fsync_data: recovery fsync data, check_only: 0
F2FS-fs (vdd): Mounted with checkpoint version = 7f5c361f
F2FS-fs (vdd): Stopped filesystem due to reason: 0
F2FS-fs (vdd): f2fs_recover_fsync_data: recovery fsync data, check_only: 1
Filesystem f2fs get_tree() didn't set fc->root, returned 1
------------[ cut here ]------------
kernel BUG at fs/super.c:1761!
Oops: invalid opcode: 0000 [#1] SMP PTI
CPU: 3 UID: 0 PID: 722 Comm: mount Not tainted 6.18.0-rc2+ #721 PREEMPT(voluntary)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
RIP: 0010:vfs_get_tree.cold+0x18/0x1a
Call Trace:
<TASK>
fc_mount+0x13/0xa0
path_mount+0x34e/0xc50
__x64_sys_mount+0x121/0x150
do_syscall_64+0x84/0x800
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7fa6cc126cfe
The root cause is we missed to handle error number returned from
f2fs_recover_fsync_data() when mounting image w/ ro,norecovery or
ro,disable_roll_forward mount option, result in returning a positive
error number to vfs_get_tree(), fix it.
Cc: stable@kernel.org Fixes: 6781eabba1bd ("f2fs: give -EINVAL for norecovery and rw mount") Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
w/ LFS mode, in get_left_section_blocks(), we should not account the
blocks which were used before and now are invalided, otherwise those
blocks will be counted as freed one in has_curseg_enough_space(), result
in missing to trigger GC in time.
Cc: stable@kernel.org Fixes: 249ad438e1d9 ("f2fs: add a method for calculating the remaining blocks in the current segment in LFS mode.") Fixes: bf34c93d2645 ("f2fs: check curseg space before foreground GC") Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
mkfs.f2fs -f /dev/vdd
mount /dev/vdd /mnt/f2fs
touch /mnt/f2fs/foo
sync # avoid CP_UMOUNT_FLAG in last f2fs_checkpoint.ckpt_flags
touch /mnt/f2fs/bar
f2fs_io fsync /mnt/f2fs/bar
f2fs_io shutdown 2 /mnt/f2fs
umount /mnt/f2fs
blockdev --setro /dev/vdd
mount /dev/vdd /mnt/f2fs
mount: /mnt/f2fs: WARNING: source write-protected, mounted read-only.
For the case if we create and fsync a new inode before sudden power-cut,
without norecovery or disable_roll_forward mount option, the following
mount will succeed w/o recovering last fsynced inode.
The problem here is that we only check inode_list list after
find_fsync_dnodes() in f2fs_recover_fsync_data() to find out whether
there is recoverable data in the iamge, but there is a missed case, if
last fsynced inode is not existing in last checkpoint, then, we will
fail to get its inode due to nat of inode node is not existing in last
checkpoint, so the inode won't be linked in inode_list.
Let's detect such case in dyrun mode to fix this issue.
After this change, mount will fail as expected below:
mount: /mnt/f2fs: cannot mount /dev/vdd read-only.
dmesg(1) may have more information after failed mount system call.
demsg:
F2FS-fs (vdd): Need to recover fsync data, but write access unavailable, please try mount w/ disable_roll_forward or norecovery
Cc: stable@kernel.org Fixes: 6781eabba1bd ("f2fs: give -EINVAL for norecovery and rw mount") Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The one_time_gc field in struct victim_sel_policy is conditionally
initialized but unconditionally read, leading to undefined behavior
that triggers UBSAN warnings.
In f2fs_get_victim() at fs/f2fs/gc.c:774, the victim_sel_policy
structure is declared without initialization:
struct victim_sel_policy p;
The field p.one_time_gc is only assigned when the 'one_time' parameter
is true (line 789):
if (one_time) {
p.one_time_gc = one_time;
...
}
However, this field is unconditionally read in subsequent get_gc_cost()
at line 395:
if (p->one_time_gc && (valid_thresh_ratio < 100) && ...)
When one_time is false, p.one_time_gc contains uninitialized stack
memory. Hence p.one_time_gc is an invalid bool value.
UBSAN detects this invalid bool value:
UBSAN: invalid-load in fs/f2fs/gc.c:395:7
load of value 77 is not a valid value for type '_Bool'
CPU: 3 UID: 0 PID: 1297 Comm: f2fs_gc-252:16 Not tainted 6.18.0-rc3
#5 PREEMPT(voluntary)
Hardware name: OpenStack Foundation OpenStack Nova,
BIOS 1.13.0-1ubuntu1.1 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x70/0x90
dump_stack+0x14/0x20
__ubsan_handle_load_invalid_value+0xb3/0xf0
? dl_server_update+0x2e/0x40
? update_curr+0x147/0x170
f2fs_get_victim.cold+0x66/0x134 [f2fs]
? sched_balance_newidle+0x2ca/0x470
? finish_task_switch.isra.0+0x8d/0x2a0
f2fs_gc+0x2ba/0x8e0 [f2fs]
? _raw_spin_unlock_irqrestore+0x12/0x40
? __timer_delete_sync+0x80/0xe0
? timer_delete_sync+0x14/0x20
? schedule_timeout+0x82/0x100
gc_thread_func+0x38b/0x860 [f2fs]
? gc_thread_func+0x38b/0x860 [f2fs]
? __pfx_autoremove_wake_function+0x10/0x10
kthread+0x10b/0x220
? __pfx_gc_thread_func+0x10/0x10 [f2fs]
? _raw_spin_unlock_irq+0x12/0x40
? __pfx_kthread+0x10/0x10
ret_from_fork+0x11a/0x160
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
This issue is reliably reproducible with the following steps on a
100GB SSD /dev/vdb:
The uninitialized value causes incorrect GC victim selection, leading
to unpredictable garbage collection behavior.
Fix by zero-initializing the entire victim_sel_policy structure to
ensure all fields have defined values.
Fixes: e791d00bd06c ("f2fs: add valid block ratio not to do excessive GC for one time GC") Cc: stable@kernel.org Signed-off-by: Xiaole He <hexiaole1994@126.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The age extent cache uses last_blocks (derived from
allocated_data_blocks) to determine data age. However, there's a
conflict between the deletion
marker (last_blocks=0) and legitimate last_blocks=0 cases when
allocated_data_blocks overflows to 0 after reaching ULLONG_MAX.
In this case, valid extents are incorrectly skipped due to the
"if (!tei->last_blocks)" check in __update_extent_tree_range().
This patch fixes the issue by:
1. Reserving ULLONG_MAX as an invalid/deletion marker
2. Limiting allocated_data_blocks to range [0, ULLONG_MAX-1]
3. Using F2FS_EXTENT_AGE_INVALID for deletion scenarios
4. Adjusting overflow age calculation from ULLONG_MAX to (ULLONG_MAX-1)
Reproducer (using a patched kernel with allocated_data_blocks
initialized to ULLONG_MAX - 3 for quick testing):
Step 1: Mount and check initial state
# dd if=/dev/zero of=/tmp/test.img bs=1M count=100
# mkfs.f2fs -f /tmp/test.img
# mkdir -p /mnt/f2fs_test
# mount -t f2fs -o loop,age_extent_cache /tmp/test.img /mnt/f2fs_test
# cat /sys/kernel/debug/f2fs/status | grep -A 4 "Block Age"
Allocated Data Blocks: 18446744073709551612 # ULLONG_MAX - 3
Inner Struct Count: tree: 1(0), node: 0
Step 2: Create files and write data to trigger overflow
# touch /mnt/f2fs_test/{1,2,3,4}.txt; sync
# cat /sys/kernel/debug/f2fs/status | grep -A 4 "Block Age"
Allocated Data Blocks: 18446744073709551613 # ULLONG_MAX - 2
Inner Struct Count: tree: 5(0), node: 1
Step 3: Trigger the bug - next write should create node but gets skipped
# dd if=/dev/urandom of=/mnt/f2fs_test/4.txt bs=4K count=1; sync
# cat /sys/kernel/debug/f2fs/status | grep -A 4 "Block Age"
Allocated Data Blocks: 1
Inner Struct Count: tree: 5(0), node: 4
Expected: node: 5 (new extent node for 4.txt)
Actual: node: 4 (extent insertion was incorrectly skipped due to
last_blocks = allocated_data_blocks = 0 in __get_new_block_age)
After this fix, the extent node is correctly inserted and node count
becomes 5 as expected.
The bug can be reproduced w/ below scripts:
- mount /dev/vdb /mnt1
- mount /dev/vdc /mnt2
- umount /mnt1
- mounnt /dev/vdb /mnt1
The reason is if we created two slab caches, named f2fs_xattr_entry-7:3
and f2fs_xattr_entry-7:7, and they have the same slab size. Actually,
slab system will only create one slab cache core structure which has
slab name of "f2fs_xattr_entry-7:3", and two slab caches share the same
structure and cache address.
So, if we destroy f2fs_xattr_entry-7:3 cache w/ cache address, it will
decrease reference count of slab cache, rather than release slab cache
entirely, since there is one more user has referenced the cache.
Then, if we try to create slab cache w/ name "f2fs_xattr_entry-7:3" again,
slab system will find that there is existed cache which has the same name
and trigger the warning.
Let's changes to use global inline_xattr_slab instead of per-sb slab cache
for fixing.
Fixes: a999150f4fe3 ("f2fs: use kmem_cache pool during inline xattr lookups") Cc: stable@kernel.org Reported-by: Hong Yun <yhong@link.cuhk.edu.hk> Tested-by: Hong Yun <yhong@link.cuhk.edu.hk> Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
F2FS can mount filesystems with corrupted directory depth values that
get runtime-clamped to MAX_DIR_HASH_DEPTH. When RENAME_WHITEOUT
operations are performed on such directories, f2fs_rename performs
directory modifications (updating target entry and deleting source
entry) before attempting to add the whiteout entry via f2fs_add_link.
If f2fs_add_link fails due to the corrupted directory structure, the
function returns an error to VFS, but the partial directory
modifications have already been committed to disk. VFS assumes the
entire rename operation failed and does not update the dentry cache,
leaving stale mappings.
In the error path, VFS does not call d_move() to update the dentry
cache. This results in new_dentry still pointing to the old inode
(new_inode) which has already had its i_nlink decremented to zero.
The stale cache causes subsequent operations to incorrectly reference
the freed inode.
This causes subsequent operations to use cached dentry information that
no longer matches the on-disk state. When a second rename targets the
same entry, VFS attempts to decrement i_nlink on the stale inode, which
may already have i_nlink=0, triggering a WARNING in drop_nlink().
Example sequence:
1. First rename (RENAME_WHITEOUT): file2 → file1
- f2fs updates file1 entry on disk (points to inode 8)
- f2fs deletes file2 entry on disk
- f2fs_add_link(whiteout) fails (corrupted directory)
- Returns error to VFS
- VFS does not call d_move() due to error
- VFS cache still has: file1 → inode 7 (stale!)
- inode 7 has i_nlink=0 (already decremented)
2. Second rename: file3 → file1
- VFS uses stale cache: file1 → inode 7
- Tries to drop_nlink on inode 7 (i_nlink already 0)
- WARNING in drop_nlink()
Fix this by explicitly invalidating old_dentry and new_dentry when
f2fs_add_link fails during whiteout creation. This forces VFS to
refresh from disk on subsequent operations, ensuring cache consistency
even when the rename partially succeeds.
Reproducer:
1. Mount F2FS image with corrupted i_current_depth
2. renameat2(file2, file1, RENAME_WHITEOUT)
3. renameat2(file3, file1, 0)
4. System triggers WARNING in drop_nlink()
In order to avoid such deadlock, we need to avoid grabbing sb_lock in
f2fs_handle_error(), so, let's use asynchronous method instead:
- remove f2fs_handle_error() implementation
- rename f2fs_handle_error_async() to f2fs_handle_error()
- spread f2fs_handle_error()
Let's change as below to fix this issue:
- introduce a new atomic type variable .writeback in structure f2fs_inode_info
to track the number of threads which calling f2fs_write_cache_pages().
- use .i_sem lock to protect .writeback update.
- check .writeback before update compression context in f2fs_setflags_common()
to avoid race w/ ->writepages.
Fixes: 4c8ff7095bef ("f2fs: support data compression") Cc: stable@kernel.org Reported-by: Bai, Shuangpeng <sjb7183@psu.edu> Tested-by: Bai, Shuangpeng <sjb7183@psu.edu> Closes: https://lore.kernel.org/lkml/44D8F7B3-68AD-425F-9915-65D27591F93F@psu.edu Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Xfstests generic/335, generic/336 sometimes crash with the following message:
F2FS-fs (dm-0): detect filesystem reference count leak during umount, type: 9, count: 1
------------[ cut here ]------------
kernel BUG at fs/f2fs/super.c:1939!
Oops: invalid opcode: 0000 [#1] SMP NOPTI
CPU: 1 UID: 0 PID: 609351 Comm: umount Tainted: G W 6.17.0-rc5-xfstests-g9dd1835ecda5 #1 PREEMPT(none)
Tainted: [W]=WARN
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
RIP: 0010:f2fs_put_super+0x3b3/0x3c0
Call Trace:
<TASK>
generic_shutdown_super+0x7e/0x190
kill_block_super+0x1a/0x40
kill_f2fs_super+0x9d/0x190
deactivate_locked_super+0x30/0xb0
cleanup_mnt+0xba/0x150
task_work_run+0x5c/0xa0
exit_to_user_mode_loop+0xb7/0xc0
do_syscall_64+0x1ae/0x1c0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
</TASK>
---[ end trace 0000000000000000 ]---
It appears that sometimes it is possible that f2fs_put_super() is called before
all node page reads are completed.
Adding a call to f2fs_wait_on_all_pages() for F2FS_RD_NODE fixes the problem.
Cc: stable@kernel.org Fixes: 20872584b8c0b ("f2fs: fix to drop all dirty meta/node pages during umount()") Signed-off-by: Jan Prusakowski <jprusakowski@google.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
If UFS resume fails, the event history is updated in ufshcd_resume(), but
there is no code anywhere to record UFS suspend. Therefore, add code to
record UFS suspend error event history.
Fixes: dd11376b9f1b ("scsi: ufs: Split the drivers/scsi/ufs directory") Cc: stable@vger.kernel.org Signed-off-by: Seunghwan Baek <sh8267.baek@samsung.com> Reviewed-by: Peter Wang <peter.wang@mediatek.com> Link: https://patch.msgid.link/20251210063854.1483899-2-sh8267.baek@samsung.com Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The driver was not reading the MAX_REQ_PER_REPLY_QUEUE_LIMIT IOCFacts
flag, so the reply-queue-full handling was never enabled, even on
firmware that supports it. Reading this flag enables the feature and
prevents reply queue overflow.
Fixes: f08b24d82749 ("scsi: mpi3mr: Avoid reply queue full condition") Cc: stable@vger.kernel.org Signed-off-by: Chandrakanth Patil <chandrakanth.patil@broadcom.com> Link: https://patch.msgid.link/20251211002929.22071-1-chandrakanth.patil@broadcom.com Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The asd_pci_remove() function fails to synchronize with pending tasklets
before freeing the asd_ha structure, leading to a potential
use-after-free vulnerability.
When a device removal is triggered (via hot-unplug or module unload),
race condition can occur.
The fix adds tasklet_kill() before freeing the asd_ha structure,
ensuring all scheduled tasklets complete before cleanup proceeds.
Reported-by: Yuhao Jiang <danisjiang@gmail.com> Reported-by: Junrui Luo <moonafterrain@outlook.com> Fixes: 2908d778ab3e ("[SCSI] aic94xx: new driver") Cc: stable@vger.kernel.org Signed-off-by: Junrui Luo <moonafterrain@outlook.com> Link: https://patch.msgid.link/ME2PR01MB3156AB7DCACA206C845FC7E8AFFDA@ME2PR01MB3156.ausprd01.prod.outlook.com Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>