Chuck Lever [Mon, 27 Apr 2026 13:50:46 +0000 (09:50 -0400)]
SUNRPC: Add crypto/krb5 enctype lookup to krb5_ctx
Each krb5_ctx currently points to a gss_krb5_enctype, the
rpcsec_gss_krb5 module's own enctype descriptor. To begin
using the common crypto/krb5 library, store a pointer to the
corresponding struct krb5_enctype (from <crypto/krb5.h>) as
well.
The lookup is performed in gss_import_v2_context() immediately
after the existing gss_krb5_lookup_enctype() call. If
crypto_krb5_find_enctype() cannot find a matching enctype the
context import fails, ensuring the module never operates with
a partially-initialized krb5_ctx.
Assisted-by: Claude:claude-opus-4-6 Reviewed-by: Jeff Layton <jlayton@kernel.org> Acked-by: Anna Schumaker <anna.schumaker@hammerspace.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Chuck Lever [Mon, 27 Apr 2026 13:50:45 +0000 (09:50 -0400)]
SUNRPC: Add Kconfig dependency on CRYPTO_KRB5
The rpcsec_gss_krb5 module currently contains its own Kerberos 5
crypto implementation (key derivation, encryption, checksumming)
that duplicates functionality available in the common crypto/krb5
library. As a first step toward migrating to that library, add a
Kconfig select so that building rpcsec_gss_krb5 pulls in the
common Kerberos 5 crypto support.
The per-enctype Kconfig options (AES_SHA1, CAMELLIA, AES_SHA2)
remain: they continue to gate which encryption types are offered
by the GSS mechanism. The individual crypto algorithm selects
they carry become redundant once the migration is complete, since
CRYPTO_KRB5 already selects all needed ciphers and hashes.
Assisted-by: Claude:claude-opus-4-6 Reviewed-by: Jeff Layton <jlayton@kernel.org> Acked-by: Anna Schumaker <anna.schumaker@hammerspace.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Chuck Lever [Mon, 20 Apr 2026 15:38:30 +0000 (11:38 -0400)]
NFSD: Increase the default max_block_size to 4MB
Commit 8a81f16de64f ("NFSD: Add a "default" block size") introduced
NFSSVC_DEFBLKSIZE at 1MB, well below the 4MB NFSSVC_MAXBLKSIZE
ceiling, with the stated intent that a later change would raise the
default.
Raising the default reduces per-RPC overhead on fast networks by
amortizing header processing and scheduling costs across larger
payloads. The halving loop in nfsd_get_default_max_blksize()
constrains the returned value to 1/4096 of available RAM, so the
new 4MB default takes effect only on systems with at least 16GB of
RAM. Smaller machines continue to receive the same computed value
as before. Administrators can still override the computed value
through /proc/fs/nfsd/max_block_size.
On systems where the new default takes effect,
svc_sock_setbufsize() sizes each service socket's send and receive
buffers as nreqs * max_mesg * 2. Quadrupling max_mesg therefore
quadruples the per-socket buffer reservation at a fixed thread
count, which operators tuning large thread pools should account
for.
Note well: Your NFS client implementation must support large read
and write size settings to benefit from this change.
Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Roland Mainz <roland.mainz@nrubsig.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Chuck Lever [Sun, 19 Apr 2026 18:53:07 +0000 (14:53 -0400)]
NFSD: Close cached file handles when revoking export state
When NFSD_CMD_UNLOCK_EXPORT revokes NFSv4 state for an export path,
GC-managed nfsd_file entries for files under that path may remain
in the file cache. These cached handles hold the underlying
filesystem busy, preventing a subsequent unmount.
Add nfsd_file_close_export(), which walks the nfsd_file hash table
and closes GC-eligible entries whose underlying file resides on the
same filesystem and is a descendant of the export path. Because
nfsd_file entries do not carry an export reference, the ancestry
check uses is_subdir() on the file's dentry. False positives --
closing a cached handle that did not originate from the target
export -- are harmless; the handle is simply reopened on the next
access.
The handler calls nfsd_file_close_export() before revoking NFSv4
state, mirroring the order used by NFSD_CMD_UNLOCK_FILESYSTEM
(which cancels copies and releases NLM locks before revoking
state). Both calls run under nfsd_mutex.
Reviewed-by: Jeff Layton <jlayton@kernel.org> Tested-by: Dai Ngo <dai.ngo@oracle.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Chuck Lever [Sun, 19 Apr 2026 18:53:06 +0000 (14:53 -0400)]
NFSD: Add NFSD_CMD_UNLOCK_EXPORT netlink command
When a filesystem is exported to NFS clients, NFSv4 state
(opens, locks, delegations, layouts) holds references that
prevent the underlying filesystem from being unmounted.
NFSD_CMD_UNLOCK_FILESYSTEM addresses this at superblock
granularity, but administrators unexporting a single path on a
shared filesystem (e.g., one of several exports on the same device)
need finer control.
Add NFSD_CMD_UNLOCK_EXPORT, which revokes NFSv4 state acquired
through exports of a specific path. Matching is by path identity
(dentry + vfsmount) via the sc_export field on each nfs4_stid,
so multiple svc_export objects for the same path -- one per
auth_domain -- are handled correctly without requiring the caller
to name a specific client.
The command takes a single "path" attribute. Userspace (exportfs
-u) sends this after removing the last client for a given path,
enabling the underlying filesystem to be unmounted. When multiple
clients share an export path, individual unexports do not trigger
state revocation; only the final one does.
Reviewed-by: Jeff Layton <jlayton@kernel.org> Tested-by: Dai Ngo <dai.ngo@oracle.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Chuck Lever [Sun, 19 Apr 2026 18:53:05 +0000 (14:53 -0400)]
NFSD: Track svc_export in nfs4_stid
Add an sc_export field to struct nfs4_stid so that each stateid
records the export under which it was acquired. The export
reference is taken via exp_get() at stateid creation and released
via exp_put() in nfs4_put_stid().
Open stateids record the export from current_fh->fh_export.
Lock stateids and delegations inherit the export from their
parent open stateid. Layout stateids inherit from their
parent stateid. Directory delegations record the export from
cstate->current_fh.
A subsequent commit uses sc_export to scope state revocation to a
specific export, avoiding the need to walk inode dentry aliases at
revocation time.
Reviewed-by: Jeff Layton <jlayton@kernel.org> Tested-by: Dai Ngo <dai.ngo@oracle.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Add NFSD_CMD_UNLOCK_FILESYSTEM as a dedicated netlink command for
revoking NFS state under a filesystem path, providing a netlink
equivalent of /proc/fs/nfsd/unlock_fs.
The command requires a "path" string attribute containing the
filesystem path whose state should be released. The handler
resolves the path to its superblock, then cancels async copies,
releases NLM locks, and revokes NFSv4 state on that superblock.
Reviewed-by: Jeff Layton <jlayton@kernel.org> Tested-by: Dai Ngo <dai.ngo@oracle.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Chuck Lever [Sun, 19 Apr 2026 18:53:02 +0000 (14:53 -0400)]
NFSD: Add NFSD_CMD_UNLOCK_IP netlink command
The existing write_unlock_ip procfs interface releases NLM file
locks held by a specific client IP address, but procfs provides
no structured way to extend that operation to other scopes such as
revoking NFSv4 state.
Add NFSD_CMD_UNLOCK_IP as a dedicated netlink command for
releasing NLM locks by client address. The command accepts a
binary sockaddr_in or sockaddr_in6 in its address attribute.
The handler validates the address family and length, then calls
nlmsvc_unlock_all_by_ip() to release matching NLM locks. Because
lockd is a single global instance, that call operates across
all network namespaces regardless of which namespace the caller
inhabits.
A separate netlink command for filesystem-scoped unlock is added in
a subsequent commit.
The nfsd_ctl_unlock_ip tracepoint is updated from string-based
address logging to __sockaddr, which stores the binary sockaddr
and formats it with %pISpc. This affects both the new netlink path
and the existing procfs write_unlock_ip path, giving consistent
structured output in both cases.
Reviewed-by: Jeff Layton <jlayton@kernel.org> Tested-by: Dai Ngo <dai.ngo@oracle.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Chuck Lever [Sun, 19 Apr 2026 18:53:01 +0000 (14:53 -0400)]
NFSD: Extract revoke_one_stid() utility function
The per-stateid revocation logic in nfsd4_revoke_states() handles
four stateid types in a deeply nested switch. Extract two helpers:
revoke_ol_stid() performs admin-revocation of an open or lock
stateid with st_mutex already held: marks the stateid as
SC_STATUS_ADMIN_REVOKED, closes POSIX locks for lock stateids,
and releases file access.
revoke_one_stid() dispatches by sc_type, acquires st_mutex with
the appropriate lockdep class for open and lock stateids, and
handles delegation unhash and layout close inline.
No functional change. Preparation for adding export-scoped state
revocation which reuses revoke_one_stid().
Reviewed-by: Jeff Layton <jlayton@kernel.org> Tested-by: Dai Ngo <dai.ngo@oracle.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Chuck Lever [Sun, 19 Apr 2026 18:53:00 +0000 (14:53 -0400)]
NFSD: Handle layout stid in nfsd4_drop_revoked_stid()
nfsd4_drop_revoked_stid() has no SC_TYPE_LAYOUT case, so when a
client sends FREE_STATEID for an admin-revoked layout stid, the
default branch releases cl_lock and returns without unhashing or
releasing the stid. The stid remains in the IDR and on the
per-client list until the client is destroyed.
Remove the layout stid from the per-client list and call
nfs4_put_stid() to drop the creation reference. When the
refcount reaches zero, nfsd4_free_layout_stateid() handles the
remaining cleanup: cancelling the fence worker, removing from
the per-file list, and freeing the slab object.
Fixes: 1e33e1414bec ("nfsd: allow layout state to be admin-revoked.") Reviewed-by: Jeff Layton <jlayton@kernel.org> Tested-by: Dai Ngo <dai.ngo@oracle.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Jason Gunthorpe [Mon, 8 Jun 2026 18:10:04 +0000 (15:10 -0300)]
iommu/dma: Do not try to iommu_map a 0 length region in swiotlb
iommu_dma_iova_link_swiotlb() processes a mapping that is unaligned in three
parts, the head, middle and trailer. If the middle is empty because there
are no aligned pages it will call down to iommu_map() with a 0 size
which the iommupt implementation will fail as illegal.
It then tries to do an error unwind and starts from the wrong spot
corrupting the mapping so the eventual destruction triggers a WARN_ON.
Check for 0 length and avoid mapping and use offset not 0 as the starting
point to unlink.
This is frequently triggered by using some kinds of thunderbolt NVMe
drives that trigger forced SWIOTLB for unaligned memory. NVMe seems to
pass in oddly aligned buffers for the passthrough commands from smartctl
that hit this condition.
Cc: stable@vger.kernel.org Fixes: 433a76207dcf ("dma-mapping: Implement link/unlink ranges API") Reported-by: Mark Lord <mlord@pobox.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Samiullah Khawaja <skhawaja@google.com> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/0-v1-8536728bc89f+469-swiotlb_warn_jgg@nvidia.com
====================
bpf, lpm_trie: Allow sleepable BPF programs to use LPM tries
trie_lookup_elem() annotates its rcu_dereference_check() walks with only
rcu_read_lock_bh_held(), so a sleepable BPF program that touches an LPM
trie (e.g. a sleepable LSM hook calling bpf_map_lookup_elem()) trips a
"suspicious RCU usage" lockdep splat on debug kernels: it holds only
rcu_read_lock_trace(), which that annotation does not accept.
Patch 1 relaxes the rcu_dereference annotations in the trie walks so they
no longer trip lockdep from the Tasks Trace context, including the
trie_update_elem()/trie_delete_elem() writer walks (protected by
trie->lock). Patch 2 adds BPF_MAP_TYPE_LPM_TRIE to the verifier's
sleepable map whitelist so sleepable programs can reference an LPM trie
directly, not just as the inner map of a map-of-maps. LPM trie nodes are
reclaimed via bpf_mem_cache_free_rcu(), which chains a regular RCU grace
period into a Tasks Trace grace period before freeing -- the same
discipline BPF_MAP_TYPE_HASH relies on for sleepable access.
Changes since v1:
- Split into a 2-patch series.
- Patch 1 now also converts the trie_update_elem()/trie_delete_elem()
walks from rcu_dereference() to rcu_dereference_protected(*p, 1),
addressing review feedback that v1 only fixed the lookup path and left
the same splat on the writer paths.
- New patch 2 adds the verifier whitelist entry so the fix is actually
reachable for directly-referenced LPM tries.
- Retitled v1 ("Allow lookups from sleepable BPF programs").
Vlad Poenaru [Tue, 9 Jun 2026 13:55:58 +0000 (06:55 -0700)]
bpf: Allow sleepable programs to use LPM trie maps directly
The previous change relaxed the rcu_dereference annotations in
lpm_trie.c so the trie walks no longer trip lockdep when reached from a
sleepable BPF program holding only rcu_read_lock_trace(). By itself
that only helps tries reached as the inner map of a map-of-maps, or
from the classic-RCU syscall path: a sleepable program that references
an LPM trie directly is still rejected at load time by
check_map_prog_compatibility(), whose sleepable whitelist omits
BPF_MAP_TYPE_LPM_TRIE:
Sleepable programs can only use array, hash, ringbuf and local storage maps
LPM trie nodes are allocated from a bpf_mem_alloc (trie->ma) and freed
with bpf_mem_cache_free_rcu(), which chains a regular RCU grace period
into a Tasks Trace grace period before the node -- and the value
embedded in it that trie_lookup_elem() returns to the program -- is
released. That is the same reclaim discipline BPF_MAP_TYPE_HASH relies
on for sleepable access, so a value handed to a sleepable reader cannot
be freed while the program is still running under rcu_read_lock_trace().
The writer paths take trie->lock across the walk and never relied on the
RCU read-side lock to keep nodes alive.
Add BPF_MAP_TYPE_LPM_TRIE to the sleepable map whitelist so these
programs can use LPM tries directly.
Vlad Poenaru [Tue, 9 Jun 2026 13:55:57 +0000 (06:55 -0700)]
bpf: Allow LPM map access from sleepable BPF programs
trie_lookup_elem() annotates its rcu_dereference_check() walks with
only rcu_read_lock_bh_held(). Because rcu_dereference_check(p, c)
resolves to "c || rcu_read_lock_held()", this passes for XDP/NAPI and
classic RCU readers but fails for sleepable BPF programs, which enter
via __bpf_prog_enter_sleepable() and hold only rcu_read_lock_trace().
trie_update_elem() and trie_delete_elem() have the same problem in a
different form: they walk the trie with plain rcu_dereference(), which
asserts rcu_read_lock_held() unconditionally. Both are reachable from
sleepable BPF programs via the bpf_map_update_elem / bpf_map_delete_elem
helpers, and from the syscall path under classic rcu_read_lock(). In
the writer paths the trie is actually protected by trie->lock (an
rqspinlock taken across the walk); we never relied on the RCU read-side
lock to keep nodes alive there.
A sleepable LSM hook that ends up touching an LPM trie therefore
triggers lockdep on debug kernels:
=============================
WARNING: suspicious RCU usage
7.1.0-... Tainted: G E
-----------------------------
kernel/bpf/lpm_trie.c:249 suspicious rcu_dereference_check() usage!
1 lock held by net_tests/540:
#0: (rcu_tasks_trace_srcu_struct){....}-{0:0},
at: __bpf_prog_enter_sleepable+0x26/0x280
Call Trace:
dump_stack_lvl
lockdep_rcu_suspicious
trie_lookup_elem
bpf_prog_..._enforce_security_socket_connect
bpf_trampoline_...
security_socket_connect
__sys_connect
do_syscall_64
This is lockdep-only -- no UAF, since Tasks Trace RCU does serialize
against the trie's reclaim path -- but it spams the console once per
distinct callsite on every debug kernel running a sleepable BPF LSM
that touches an LPM trie, which is increasingly common.
For the lookup path, switch the rcu_dereference_check() annotation
from rcu_read_lock_bh_held() to bpf_rcu_lock_held(), which accepts all
three contexts (classic, BH, Tasks Trace). Other map types already
follow this convention.
For trie_update_elem() and trie_delete_elem(), annotate the walks as
rcu_dereference_protected(*p, 1) -- matching trie_free() in the same
file -- since trie->lock is held across the walk. rqspinlock has no
lockdep_map, so the predicate degenerates to '1' rather than
lockdep_is_held(&trie->lock); the protection is real but not
machine-verifiable. trie_get_next_key() also uses bare
rcu_dereference() but is reachable only from the BPF syscall, which
holds classic rcu_read_lock() before dispatching, so it is left
untouched.
Fixes: 694cea395fde ("bpf: Allow RCU-protected lookups to happen from bh context") Cc: stable@vger.kernel.org Signed-off-by: Vlad Poenaru <vlad.wing@gmail.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Link: https://lore.kernel.org/r/20260609135558.193287-2-vlad.wing@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Don't just overwrite the original pointer passed to krealloc()
with its return value without checking latter:
MEM = krealloc(MEM, SZ, GFP);
If krealloc() returns NULL, that erases the pointer
to the still allocated memory, hence leaks this memory.
Instead, use a temporary variable, check it's not NULL
and only then assign it to the original pointer:
TMP = krealloc(MEM, SZ, GFP);
if (!TMP) return;
MEM = TMP;
Praveen Talari [Wed, 20 May 2026 07:14:29 +0000 (12:44 +0530)]
i2c: qcom-geni: Use pm_runtime_force_{suspend,resume} helpers
The driver carries custom system suspend/resume handling that manually
tracks a suspended state and conditionally calls
geni_i2c_runtime_suspend()
from the noirq suspend path, then adjusts runtime PM state by hand. This
duplicates PM core behavior and adds unnecessary complexity.
Drop the manual state tracking and switch to pm_runtime_force_suspend()
and pm_runtime_force_resume() for system sleep. These helpers already
perform the required checks, call the runtime PM callbacks when needed,
and keep runtime PM state transitions consistent.
Emil Tsalapatis [Tue, 9 Jun 2026 06:36:30 +0000 (02:36 -0400)]
selftests/bpf: Avoid spurious spmc parallel selftest errors in libarena
The libarena parallel spmc selftest is nondeterministic by design.
As a result it depends up to a point on the relative timing between the
producer and consumer threads. This introduces the possibility for two
kinds of spurious failures that this patch addresses.
1) Spurious timeouts. The test proceeds in phases, and threads use a
common counter as a barrier to avoid proceeding to the next phase
until all threads are ready to do so. If a thread takes too long to
reach the barrier, the already waiting threads may time out.
Increase the current timeout. The timeout's value is a balance
between the maximum amount of time spent on the test and the
possibility of spurious failures. Right now the timeout is too short.
Err on the side of caution and significantly increase it to avoid
spurious failures.
2) Spurious resize failures. Some selftests require the spmc queue to
resize itself. This in turn requires for the producer side to be
materially faster than the consumer side so that the queue gets full
enough for a resize. However, in the benchmark the spmc queue's producer
is outnumbered 3:1. To offset it we add busy waits for consume
queues. However, we still see occasional failures due to the queue
never resizing.
Minimize the possibility for this in two ways: First, remove one of
the consumers. The 2 consumers still exercise the "race between
consumers" scenario. Second, increase the busy wait duration to
decrease the rate by which the consumers act on the queue.
While at it, also replace a stray invalid error value "153" with EINVAL.
Fixes: 42998f819256 ("selftests/bpf: libarena: parallel test harness and spmc parallel selftest") Reported-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com> Link: https://lore.kernel.org/r/20260609063630.10245-1-emil@etsalapatis.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Chen-Yu Tsai [Tue, 9 Jun 2026 08:36:27 +0000 (16:36 +0800)]
regulator: mt6359: Fix vbbck default internal supply name
This issue was pointed out by Sashiko.
vbbck is fed internally from vio18. For the MT6359, the default supply
name was incorrectly set as "VIO18", instead of the supply's default
"VIO18". In practice this still works, but it causes the regulator
description copy and replace to always happen. For the MT6359P the
name is correct.
Fix the supply name for MT6359 so that both instances are the same and
correct. Also copy the comment about the internal supply from the MT6359
list to the MT6359P list.
Fixes: 10be8fc1d534 ("regulator: mt6359: Add regulator supply names") Signed-off-by: Chen-Yu Tsai <wenst@chromium.org> Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> Link: https://patch.msgid.link/20260609083630.1600070-1-wenst@chromium.org Signed-off-by: Mark Brown <broonie@kernel.org>
Jason Gunthorpe [Fri, 5 Jun 2026 11:53:35 +0000 (08:53 -0300)]
IB/mlx4: Fill in the access_flags if IB_MR_REREG_ACCESS is not specified
Sashiko noticed mlx4 was using whatever random access flags were provided
when IB_MR_REREG_ACCESS is not used. Since IB_MR_REREG_TRANS needs
access_flags it used the random ones which means it doesn't work sensibly
if userspace provides only IB_MR_REREG_TRANS.
Keep track of the current access_flag of the MR and use it if the user
does not specify one.
Also fixup a little confusion around mmr.access, it is the HW access flags
so the convert_access() was missing. But nothing reads this by the time
rereg_mr can happen.
cxl/region: Avoid variable shadowing in region attach paths
A couple of symbol declarations shadow earlier variables in the region
attach paths. Shadowing makes it harder to tell which object is being
referenced and can obscure future bugs.
Reuse the existing 'cxld' variable in cxl_port_attach_region() and
rename the endpoint decoder iterator in cxl_region_attach() to avoid
shadowing the function parameter.
CƔssio Gabriel [Tue, 9 Jun 2026 12:03:56 +0000 (09:03 -0300)]
ASoC: sma1307: Fix uevent string leaks in fault worker
sma1307_check_fault_worker() stores dynamically allocated uevent strings in
envp[0]. Several fault conditions are checked in sequence, so a later fault
can overwrite envp[0] before the final kfree() and leak the previous
allocation.
The same flow can leave an OT1 volume entry in envp[1] while envp[0]
has been overwritten by a later non-OT1 fault, causing an inconsistent
uevent payload.
Use static STATUS strings and a stack buffer for the optional VOLUME entry.
This removes the allocations from the worker and keeps VOLUME tied only
to the OT1 events that produce it.
Peter Ujfalusi <peter.ujfalusi@linux.intel.com> says:
This series hardens SOF kcontrol data paths for both IPC3 and IPC4 by
fixing size-handling bugs in put/get/update flows and tightening bounds
checks around firmware/user-provided payload lengths.
The changes include:
Fix TOCTOU-style size misuse in IPC3/IPC4 bytes put paths by validating and
using the incoming payload size.
Add notification/update payload size validation before parsing control data.
Use overflow-checked arithmetic when computing expected IPC3 control sizes.
Ensure update/copy bounds are validated against actual allocation limits.
Fix IPC3 bytes_ext bounds checks to account for struct header offset, closing
a heap overflow/over-read issue from unprivileged userspace TLV access.
Overall, the series makes control payload processing robust against malformed or
inconsistent sizes and prevents out-of-bounds accesses.
Peter Ujfalusi [Tue, 9 Jun 2026 08:34:58 +0000 (11:34 +0300)]
ASoC: SOF: ipc3-control: Fix heap overflow in bytes_ext put/get
The ipc_control_data buffer is allocated as kzalloc(max_size), where
max_size covers the entire struct sof_ipc_ctrl_data including its
flexible array payload. However, the bounds checks in bytes_ext_put
and _bytes_ext_get compared user data lengths against max_size
directly, ignoring that cdata->data sits at an offset of
sizeof(struct sof_ipc_ctrl_data) bytes into the allocation.
This allowed writing up to sizeof(struct sof_ipc_ctrl_data) bytes past
the end of the heap buffer from unprivileged userspace via the ALSA TLV
kcontrol interface, and similarly allowed over-reading adjacent heap
data on the get path.
Fix all bounds checks to subtract sizeof(*cdata) from max_size so they
reflect the actual space available at the cdata->data offset. Also fix
the error-path restore in bytes_ext_put which wrote to cdata->data
instead of cdata, causing the same overflow.
Fixes: 67ec2a091630 ("ASoC: SOF: Add bytes_ext control IPC ops for IPC3") Cc: stable@vger.kernel.org Signed-off-by: Peter Ujfalusi <peter.ujfalusi@linux.intel.com> Reviewed-by: Liam Girdwood <liam.r.girdwood@intel.com> Reviewed-by: Bard Liao <yung-chuan.liao@linux.intel.com> Link: https://patch.msgid.link/20260609083458.31193-7-peter.ujfalusi@linux.intel.com Signed-off-by: Mark Brown <broonie@kernel.org>
Peter Ujfalusi [Tue, 9 Jun 2026 08:34:57 +0000 (11:34 +0300)]
ASoC: SOF: ipc3-control: Fix TOCTOU in bytes_put and bytes_get
In sof_ipc3_bytes_put(), the size used for the memcpy is derived from
the old data->size already in the buffer, not the incoming new data's
size field. If the new data has a different size, the copy length is
wrong: it may truncate valid data or copy stale bytes.
Similarly, sof_ipc3_bytes_get() checks data->size against max_size
without accounting for the sizeof(struct sof_ipc_ctrl_data) offset
of the flex array within the allocation.
Fix bytes_put to validate and use the incoming data's sof_abi_hdr.size
from ucontrol before copying. Fix bytes_get to subtract sizeof(*cdata)
from the bounds check to match the actual available space.
Fixes: 544ac8858f24 ("ASoC: SOF: Add bytes_get/put control IPC ops for IPC3") Cc: stable@vger.kernel.org Signed-off-by: Peter Ujfalusi <peter.ujfalusi@linux.intel.com> Reviewed-by: Liam Girdwood <liam.r.girdwood@intel.com> Reviewed-by: Bard Liao <yung-chuan.liao@linux.intel.com> Link: https://patch.msgid.link/20260609083458.31193-6-peter.ujfalusi@linux.intel.com Signed-off-by: Mark Brown <broonie@kernel.org>
Peter Ujfalusi [Tue, 9 Jun 2026 08:34:56 +0000 (11:34 +0300)]
ASoC: SOF: ipc3-control: Validate size in snd_sof_update_control
In snd_sof_update_control(), firmware-provided cdata->num_elems is
checked against local_cdata->data->size but never against the actual
allocation size. If local_cdata->data->size was previously set to an
inconsistent value, the memcpy could write past the allocated buffer.
Add a bounds check to ensure num_elems fits within the available space
in the ipc_control_data allocation before copying.
Fixes: 10f461d79c2d ("ASoC: SOF: Add IPC3 topology control ops") Cc: stable@vger.kernel.org Signed-off-by: Peter Ujfalusi <peter.ujfalusi@linux.intel.com> Reviewed-by: Liam Girdwood <liam.r.girdwood@intel.com> Reviewed-by: Bard Liao <yung-chuan.liao@linux.intel.com> Link: https://patch.msgid.link/20260609083458.31193-5-peter.ujfalusi@linux.intel.com Signed-off-by: Mark Brown <broonie@kernel.org>
Peter Ujfalusi [Tue, 9 Jun 2026 08:34:55 +0000 (11:34 +0300)]
ASoC: SOF: ipc3-control: Use overflow checks in control_update size calc
In sof_ipc3_control_update(), the expected_size calculation uses
firmware-provided cdata->num_elems in arithmetic that could overflow
on 32-bit platforms, wrapping to a small value. This would allow the
cdata->rhdr.hdr.size comparison to pass with mismatched sizes,
potentially leading to out-of-bounds access in snd_sof_update_control.
Use check_mul_overflow() and check_add_overflow() to detect and reject
overflowed size calculations.
Fixes: 10f461d79c2d ("ASoC: SOF: Add IPC3 topology control ops") Cc: stable@vger.kernel.org Signed-off-by: Peter Ujfalusi <peter.ujfalusi@linux.intel.com> Reviewed-by: Liam Girdwood <liam.r.girdwood@intel.com> Reviewed-by: Bard Liao <yung-chuan.liao@linux.intel.com> Link: https://patch.msgid.link/20260609083458.31193-4-peter.ujfalusi@linux.intel.com Signed-off-by: Mark Brown <broonie@kernel.org>
Peter Ujfalusi [Tue, 9 Jun 2026 08:34:53 +0000 (11:34 +0300)]
ASoC: SOF: ipc4-control: Fix TOCTOU in sof_ipc4_bytes_put
In sof_ipc4_bytes_put(), the copy size is derived from the old
data->size in the buffer rather than the incoming new data's size
field from ucontrol. If the new data has a different size, the copy
uses the wrong length: it may truncate valid data or copy stale bytes.
Fix by validating and using the incoming data's sof_abi_hdr.size from
ucontrol before copying.
Fixes: a062c8899fed ("ASoC: SOF: ipc4-control: Add support for bytes control get and put") Cc: stable@vger.kernel.org Signed-off-by: Peter Ujfalusi <peter.ujfalusi@linux.intel.com> Reviewed-by: Liam Girdwood <liam.r.girdwood@intel.com> Reviewed-by: Bard Liao <yung-chuan.liao@linux.intel.com> Link: https://patch.msgid.link/20260609083458.31193-2-peter.ujfalusi@linux.intel.com Signed-off-by: Mark Brown <broonie@kernel.org>
s390/ap: Fix locking issue in SE bind and associate sysfs functions
Revisit and reorganize the locking and lock coverage of the
ap->lock spinlock as used in the two sysfs functions
se_bind_store() and se_associate_store().
A kernel run reported a possible deadlock situation, caused by
holding the spinlock (ap->lock) while triggering a uevent.
The fix rearranges the code protected by the spinlock by excluding
the uevent invocation, which does not require protection.
Additionally, the start of the protected region is moved earlier
to cover more lines, ensuring a consistent view of the AP queue
state between reading and updating its struct fields.
=====================================================
WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
7.1.0-20260601.rc6.git12.516b5dbd4d4a.300.fc44.s390x+debug #1 Not tainted
-----------------------------------------------------
setupseguest.sh/11034 [HC0[0]:SC0[2]:HE1:SE0] is trying to acquire: 000001c991f498e8 (fs_reclaim){+.+.}-{0:0}, at: __kmalloc_cache_noprof+0x5a/0x6d0
and this task is already holding: 000000c4a1a12378 (&aq->lock){+.-.}-{2:2}, at: se_bind_store+0x96/0x3a0
which would create a new lock dependency:
(&aq->lock){+.-.}-{2:2} -> (fs_reclaim){+.+.}-{0:0}
but this new dependency connects a SOFTIRQ-irq-safe lock:
(&aq->lock){+.-.}-{2:2}
... which became SOFTIRQ-irq-safe at:
__lock_acquire+0x5ae/0x15a0
lock_acquire+0x14c/0x400
_raw_spin_lock_bh+0x58/0xb0
ap_tasklet_fn+0x72/0xd0
tasklet_action_common+0x174/0x1b0
handle_softirqs+0x180/0x5c0
irq_exit_rcu+0x196/0x200
do_ext_irq+0x12a/0x4d0
ext_int_handler+0xc6/0xf0
folio_zero_user+0x1c6/0x240
folio_zero_user+0x182/0x240
vma_alloc_anon_folio_pmd+0xa0/0x1d0
__do_huge_pmd_anonymous_page+0x3a/0x200
__handle_mm_fault+0x56c/0x590
handle_mm_fault+0xa2/0x370
do_exception+0x292/0x590
__do_pgm_check+0x136/0x3e0
pgm_check_handler+0x114/0x160
to a SOFTIRQ-irq-unsafe lock:
(fs_reclaim){+.+.}-{0:0}
... which became SOFTIRQ-irq-unsafe at:
...
__lock_acquire+0x5ae/0x15a0
lock_acquire+0x14c/0x400
__fs_reclaim_acquire+0x44/0x50
fs_reclaim_acquire+0xbe/0x100
fs_reclaim_correct_nesting+0x20/0x70
dotest+0x5e/0x148
locking_selftest+0x2854/0x2a88
start_kernel+0x3b2/0x4f0
startup_continue+0x2e/0x40
other info that might help us debug this:
Possible interrupt unsafe locking scenario:
CPU0 CPU1
---- ----
lock(fs_reclaim);
local_irq_disable();
lock(&aq->lock);
lock(fs_reclaim);
<Interrupt>
lock(&aq->lock);
*** DEADLOCK ***
4 locks held by setupseguest.sh/11034:
#0: 000000c485d01440 (sb_writers#4){.+.+}-{0:0}, at: vfs_write+0x2fc/0x380
#1: 000000c4d2283288 (&of->mutex#2){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x12a0x270
#2: 000000c4a1830e48 (kn->active#172){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x1e/0x270
#3: 000000c4a1a12378 (&aq->lock){+.-.}-{2:2}, at: se_bind_store+0x96/0x3a0
the dependencies between SOFTIRQ-irq-safe lock and the holding lock:
-> (&aq->lock){+.-.}-{2:2} {
HARDIRQ-ON-W at:
__lock_acquire+0x5ae/0x15a0
lock_acquire+0x14c/0x400
_raw_spin_lock_bh+0x58/0xb0
ap_queue_init_state+0x2e/0x50
ap_scan_domains+0x5d6/0x620
ap_scan_adapter+0x4c0/0x810
ap_scan_bus+0x70/0x350
ap_scan_bus_wq_callback+0x56/0x80
process_one_work+0x2ba/0x820
worker_thread+0x21a/0x400
kthread+0x164/0x190
__ret_from_fork+0x4c/0x340
ret_from_fork+0xa/0x30
IN-SOFTIRQ-W at:
__lock_acquire+0x5ae/0x15a0
lock_acquire+0x14c/0x400
_raw_spin_lock_bh+0x58/0xb0
ap_tasklet_fn+0x72/0xd0
tasklet_action_common+0x174/0x1b0
handle_softirqs+0x180/0x5c0
irq_exit_rcu+0x196/0x200
do_ext_irq+0x12a/0x4d0
ext_int_handler+0xc6/0xf0
folio_zero_user+0x1c6/0x240
folio_zero_user+0x182/0x240
vma_alloc_anon_folio_pmd+0xa0/0x1d0
__do_huge_pmd_anonymous_page+0x3a/0x200
__handle_mm_fault+0x56c/0x590
handle_mm_fault+0xa2/0x370
do_exception+0x292/0x590
__do_pgm_check+0x136/0x3e0
pgm_check_handler+0x114/0x160
INITIAL USE at:
__lock_acquire+0x5ae/0x15a0
lock_acquire+0x14c/0x400
_raw_spin_lock_bh+0x58/0xb0
ap_queue_init_state+0x2e/0x50
ap_scan_domains+0x5d6/0x620
ap_scan_adapter+0x4c0/0x810
ap_scan_bus+0x70/0x350
ap_scan_bus_wq_callback+0x56/0x80
process_one_work+0x2ba/0x820
worker_thread+0x21a/0x400
kthread+0x164/0x190
__ret_from_fork+0x4c/0x340
ret_from_fork+0xa/0x30
}
... key at: [<000001c9936e8aa0>] __key.7+0x0/0x10
the dependencies between the lock to be acquired
and SOFTIRQ-irq-unsafe lock:
-> (fs_reclaim){+.+.}-{0:0} {
HARDIRQ-ON-W at:
__lock_acquire+0x5ae/0x15a0
lock_acquire+0x14c/0x400
__fs_reclaim_acquire+0x44/0x50
fs_reclaim_acquire+0xbe/0x100
fs_reclaim_correct_nesting+0x20/0x70
dotest+0x5e/0x148
locking_selftest+0x2854/0x2a88
start_kernel+0x3b2/0x4f0
startup_continue+0x2e/0x40
SOFTIRQ-ON-W at:
__lock_acquire+0x5ae/0x15a0
lock_acquire+0x14c/0x400
__fs_reclaim_acquire+0x44/0x50
fs_reclaim_acquire+0xbe/0x100
fs_reclaim_correct_nesting+0x20/0x70
dotest+0x5e/0x148
locking_selftest+0x2854/0x2a88
start_kernel+0x3b2/0x4f0
startup_continue+0x2e/0x40
INITIAL USE at:
__lock_acquire+0x5ae/0x15a0
lock_acquire+0x14c/0x400
__fs_reclaim_acquire+0x44/0x50
fs_reclaim_acquire+0xbe/0x100
fs_reclaim_correct_nesting+0x20/0x70
dotest+0x5e/0x148
locking_selftest+0x2854/0x2a88
start_kernel+0x3b2/0x4f0
startup_continue+0x2e/0x40
}
... key at: [<000001c991f498e8>] __fs_reclaim_map+0x0/0x30
... acquired at:
check_prev_add+0x178/0xf40
__lock_acquire+0x12aa/0x15a0
lock_acquire+0x14c/0x400
__fs_reclaim_acquire+0x44/0x50
fs_reclaim_acquire+0xbe/0x100
__kmalloc_cache_noprof+0x5a/0x6d0
kobject_uevent_env+0xd4/0x420
ap_send_se_bind_uevent+0x48/0x70
se_bind_store+0x146/0x3a0
kernfs_fop_write_iter+0x18c/0x270
vfs_write+0x23c/0x380
ksys_write+0x88/0x120
__do_syscall+0x170/0x750
system_call+0x72/0x90
stack backtrace:
CPU: 6 UID: 0 PID: 11034 Comm: setupseguest.sh Not tainted 7.1.0-20260601.rc6.git2.516b5dbd4d4a.300.fc44.s390x+debug #1 PREEMPT
Hardware name: IBM 9175 ME1 701 (KVM/Linux)
Call Trace:
[<000001c98ffa0a7e>] dump_stack_lvl+0xae/0x108
[<000001c9900a6d7a>] print_bad_irq_dependency+0x47a/0x480
[<000001c9900a7184>] check_irq_usage+0x404/0x4c0
[<000001c9900a73b8>] check_prev_add+0x178/0xf40
[<000001c9900aaf1a>] __lock_acquire+0x12aa/0x15a0
[<000001c9900ab35c>] lock_acquire+0x14c/0x400
[<000001c9903be454>] __fs_reclaim_acquire+0x44/0x50
[<000001c9903be51e>] fs_reclaim_acquire+0xbe/0x100
[<000001c9903cf4ca>] __kmalloc_cache_noprof+0x5a/0x6d0
[<000001c9910ca9d4>] kobject_uevent_env+0xd4/0x420
[<000001c990d84098>] ap_send_se_bind_uevent+0x48/0x70
[<000001c990d87416>] se_bind_store+0x146/0x3a0
[<000001c99057da7c>] kernfs_fop_write_iter+0x18c/0x270
[<000001c99047712c>] vfs_write+0x23c/0x380
[<000001c990477438>] ksys_write+0x88/0x120
[<000001c9910f64e0>] __do_syscall+0x170/0x750
[<000001c99110a412>] system_call+0x72/0x90
INFO: lockdep is turned off.
Fixes: 4179c3984227 ("s390/ap: Implement SE bind and associate uevents") Reported-by: Ingo Franzki <ifranzki@linux.ibm.com> Suggested-by: Finn Callies <fcallies@linux.ibm.com> Reviewed-by: Finn Callies <fcallies@linux.ibm.com> Signed-off-by: Harald Freudenberger <freude@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Firmware will set dsp_ack to 1 when firmware sends response for the IPC
command issued by host. Similarly dsp_msg flag will be updated to 1.
During ACP D0 entry, the value read from the sof_dsp_ack_write scratch
flag can be uninitialized. A non-zero garbage value is treated as a
pending DSP IPC ack before SOF_FW_BOOT_COMPLETE, causing a spurious
"IPC reply before FW_BOOT_COMPLETE" log.
====================
net: ethtool: let ops locked drivers run without rtnl_lock
With the ethtool_get_link_ksettings() situation hopefully ironed out
the previous series (commit 6a5d837f0ce2) let's return to the main
part of the series.
We have been slowly moving towards removing the rtnl_lock dependency
in driver ops since the concept of "ops-locked" drivers have been
introduced last year. Since last year will take the netdev instance
lock before invoking any ndo or ethtool op of "ops-locked" drivers.
We dipped our toes into rtnl_lock-less ops with the queue binding API.
Queue stats, NAPI, and other netdev-netlink objects are also queried
without holding rtnl_lock already. It's time to take the next logical
step and lift the requirement from ethtool ops.
The direct motivation for this patchset is that ethtool ops often
involve communicating with device FW, and may take a long time
to complete. Aggressive polling of device state on machines
with 10+ NICs have been shown to significantly increase rtnl_lock
pressure.
There's a handful of areas which still need rtnl_lock (see below).
I decided to convert everything to rtnl_lock-less by default, and
add a set of flags which let the drivers request rtnl_lock to still
be taken. I don't love this, but I'm worried that opt-in would be
even more confusing.
Known issues / exclusions:
- qdiscs - qdisc configuration currently assumes rtnl_lock, this
is mostly impacting set_channels callback. qdisc config is probably
the easiest one of the exclusions to tackle, it's fairly self-contained.
- features - even tho feature changes are (correctly) plumbed to
the driver thru ndos they are part of ethtool uAPI. ethtool itself
calls netdev_features_change() if it has spotted device feature change
before vs after to the callback. Some drivers also call
netdev_features_change() directly in response to various changes,
e.g. setting priv flags.
Since features have to propagate to upper and lower devices anything
that touches features is quite hard to move from under rtnl_lock.
- phylink - phylink and SFP depend on rtnl_lock today, I suspect
that this is purely for historic reasons. I started poking at
it and don't really see a need for a global lock. But accessing
the netdev instance lock from the SFP entry points will require
some attention from the phylink folks.
- phydev - similar to phylink, looks quite doable. But no ops-locked
driver currently has a phydev (fbnic only uses phylink) so phydev
related paths retain a ASSERT_RTNL() for now.
Tested on mlx5, bnxt and fbnic.
====================
Jakub Kicinski [Fri, 5 Jun 2026 00:29:11 +0000 (17:29 -0700)]
net: ethtool: optionally skip rtnl_lock on IOCTL path
Convert the IOCTL path similarly to how we converted Netlink.
The device lookup gets a little hairy. We could take rtnl_lock
unconditionally and drop it before calling the driver (this would
avoid the reference + liveness check). But I think being able
to make progress even if rtnl is dead-locked is quite useful.
First extra concern is handling features. List all the cmds which
modify features and always take rtnl_lock. We could fold this list
into ethtool_ioctl_needs_rtnl() but seems cleaner to keep
ethtool_ioctl_needs_rtnl() driver-related. If a driver changed
features and we were not holding rtnl_lock - warn about it.
It can only happen on buggy ops locked drivers (buggy because
they should have set appropriate "I need rtnl for op X" bit).
Second wrinkle is the PHY ID hack which drops the locks while
sleeping. Convert its static "busy" variable which used to
be protected by rtnl_lock to a field in struct ethtool_netdev_state.
This feature is about identifying an adapter or a port within
a system, so being able to blink multiple LEDs at the same
time is likely not very useful in practice. But it's the simplest
fix, we can add a mutex if someone thinks a system should only
be ID'ing one port at a time.
Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260605002912.3456868-12-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Fri, 5 Jun 2026 00:29:10 +0000 (17:29 -0700)]
net: ethtool: ioctl: concentrate the locking
Add another layer of helper functions to make upcoming locking
changes easier. Otherwise we'd need a pretty complex goto
structure. netdev instance lock is now taken slightly sooner
but that should not be an issue since rtnl_lock is already held,
anyway.
Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260605002912.3456868-11-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Fri, 5 Jun 2026 00:29:09 +0000 (17:29 -0700)]
net: ethtool: optionally skip rtnl_lock in RSS context handlers
Skip rtnl_lock in RSS context handlers if device is ops-locked.
Fairly trivial conversion. bnxt needed rtnl_lock for changing
the main context but looks like additional contexts are fine
without it.
Note (for review bots?) that ethnl_ops_begin() checks whether
the device is still registered.
Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260605002912.3456868-10-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Fri, 5 Jun 2026 00:29:08 +0000 (17:29 -0700)]
net: ethtool: optionally skip rtnl_lock in ethnl_act_module_fw_flash()
Module firmware flashing reads SFF-8024 identifier bytes via
.get_module_eeprom_by_page(). Other than that it modifies
a bit in the netdev->ethtool struct. Both should be ops-locked
at this point.
Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260605002912.3456868-9-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Fri, 5 Jun 2026 00:29:07 +0000 (17:29 -0700)]
net: ethtool: optionally skip rtnl_lock in ethnl_tsinfo_dumpit()
ethnl_tsinfo_dumpit() iterates netdevs and per-netdev PHY topology
calling ops->get_ts_info(). Switch to the "ops compat locking"
helpers which take either rtnl_lock or instance lock, depending
on what the device needs.
Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260605002912.3456868-8-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Fri, 5 Jun 2026 00:29:06 +0000 (17:29 -0700)]
net: ethtool: optionally skip rtnl_lock in cable test handlers
Skip rtnl_lock in cable test handlers. This is really a noop since
no ops locked device supports these.
Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260605002912.3456868-7-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Fri, 5 Jun 2026 00:29:05 +0000 (17:29 -0700)]
net: ethtool: optionally skip rtnl_lock on Netlink path for SET ops
Make ethtool not take rtnl_lock for SET commands when operation
is performed on an ops-locked driver. cfg/cfg_pending are now
ops-locked, since only ethtool modifies them.
Some SET driver callbacks will still need rtnl_lock, most notably
those which may end up calling netdev_update_features() or the qdisc
layer (via netif_set_real_num_tx_queues()). Let drivers selectively
opt back into the rtnl_lock with a new bitfield in ops.
We need two helpers since Netlink and ioctl cmds have different
values. Keep the helpers side by side in common.h to make sure
they get updated together, even tho they will only get called
from ioctl.c and netlink.c.
SET commands which don't use ethnl_default_set_doit() are converted
by subsequent commits.
Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260605002912.3456868-6-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Fri, 5 Jun 2026 00:29:04 +0000 (17:29 -0700)]
net: ethtool: optionally skip rtnl_lock on Netlink path for GET ops
ethnl_default_doit() and ethnl_default_dump_one() are both used
exclusively for GET callbacks (former to get info for a single
device or get global strings). ops-locked devices don't need
rtnl_lock for GET callbacks, stop taking it.
Introduce an opt-out mechanism for devices which use phylink (fbnic)
since phylink currently depends on rtnl_lock protection. Subsequent
patches will add more exceptions, anyway. Practically the new helpers
for judging if command needs rtnl_lock could also call
netdev_need_ops_lock() but I find that it makes the code in the callers
slightly less obvious.
Add a helper for IOCTLs already, even tho it's unused so that
we can keep them in sync as the series progresses.
This is the first user-visible step of moving ethtool ops out
from under rtnl. Subsequent patches do the same for SET ops,
as well as the ioctl path.
Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260605002912.3456868-5-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Fri, 5 Jun 2026 00:29:03 +0000 (17:29 -0700)]
net: ethtool: make dev->hwprov ops-protected
dev->hwprov tracks the active hwtstamp provider for the device.
Make it ops protected (instance lock if the netdev driver opts
into holding instance lock around callbacks, otherwise rtnl_lock).
hwprov is written and read in:
- drivers/net/phy/phy_device.c
phydev and ops protection don't currently mix, add a comment
- net/ethtool/
as of now holds both rtnl lock and ops lock, this one will
soon only hold one lock or the other
read in:
- net/core/dev_ioctl.c
holds both rtnl lock and ops lock
- net/core/timestamping.c
RCU reader
The new netdev_ops_lock_dereference() helper does not have
"compat" in the name. The name would be quite long and I think
in this case it should be obvious that we need _a_ lock.
netdev_lock_dereference() already exists and means dev->lock
is always expected.
Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260605002912.3456868-4-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
phydev <> netdev linking and lifecycle depends on rtnl_lock.
We want to switch to instance locks for most ethtool ops.
Let's add an assert that ops locked devices don't use phydev
today. If one does we can either opt the phy ops out of
being purely ops locked, or do deeper surgery to make phy
locking ops-compatible. I don't think there's any fundamental
challenge to make that work.
Reviewed-by: Nicolai Buchwitz <nb@tipi-net.de> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260605002912.3456868-3-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ethnl_bcast_seq is a global counter stamped into the nlmsg_seq field
of every multicast notification, allowing userspace to detect dropped
messages. Today the ordering is achieved by using rtnl_lock().
Moving forward we will want ethtool ops to run under just the netdev
instance lock so to establish ordering we need a separate lock
for notifications. With the netdev instance locks operations on
different devices may bypass each other but the expectation is
that it should not matter. What we need to prevent is:
- notification IDs getting out of order
- operations on one device getting out of order
For simplicity defer allocating the ID of the notification right
before the notification is delivered. This removes the need for
special handling in ethnl_rss_create_send_ntf().
Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260605002912.3456868-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
igc: skip RX timestamp header for frame preemption verification
When RX hardware timestamping is enabled, a 16-byte inline timestamp header
is added to the start of the packet buffer, causing FPE handshake
verification to fail.
Because an incorrect packet buffer is passed to igc_fpe_handle_mpacket(),
the mem_is_zero() check inspects the timestamp metadata instead of the
actual mPacket payload. As a result, valid Verify/Response mPackets can be
missed when inline RX timestamps are present.
Pass pktbuf + pkt_offset to igc_fpe_handle_mpacket() so it inspects the
actual mPacket payload instead of the timestamp header.
Fixes: 5422570c0010 ("igc: add support for frame preemption verification") Co-developed-by: Faizal Rahim <faizal.abdul.rahim@linux.intel.com> Signed-off-by: Faizal Rahim <faizal.abdul.rahim@linux.intel.com> Signed-off-by: KhaiWenTan <khai.wen.tan@linux.intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Larysa Zaremba [Mon, 18 May 2026 11:15:04 +0000 (13:15 +0200)]
ixgbe: do not configure xps for XDP queues
netif_set_xps_queue() should not be called for an XDP Tx queue, since such
queues are not netdev-exposed. On systems with number of CPUs >=64, on E610
adapter, netdev is configured with maximum number queue pairs being 63
(due to MSI-X assignment), but configuring XDP results in 64 XDP queues.
So, during XDP program load, when netif_set_xps_queue() is called for the
last XDP queue, we get a WARNING with a call trace and KASAN report
afterwards (if enabled).
[ 2012.701094] BUG: KASAN: slab-out-of-bounds in __netif_set_xps_queue+0x1ac5/0x1e40
[ 2012.701100] Write of size 4 at addr ffff88888d43cff8 by task xdpsock/103668
Skip XPS configuration for XDP Tx queues.
Fixes: 33fdc82f0883 ("ixgbe: add support for XDP_TX action") Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Patryk Holda <patryk.holda@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Przemyslaw Korba [Mon, 25 May 2026 08:38:03 +0000 (10:38 +0200)]
idpf: add padding to PTP virtchnl structures
Add padding to virtchnl2 PTP structures to match the Control Plane
expected message sizes:
* virtchnl2_ptp_get_dev_clk_time: 8 -> 16 bytes
* virtchnl2_ptp_set_dev_clk_time: 8 -> 16 bytes
* virtchnl2_ptp_get_cross_time: 16 -> 24 bytes
The FW expects the above sizes and PTP negotiation fails due to the
mismatch. Previously neither the FW nor the driver checked message/reply
sizes strictly, so the problem appeared only after recent validation
improvements.
reproduction steps:
ptp4l -i <pf> -m
Observe: failed to open /dev/ptp0: Permission denied
Fixes: bf27283ba594 ("virtchnl: add PTP virtchnl definitions") Cc: stable@vger.kernel.org Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Przemyslaw Korba <przemyslaw.korba@intel.com> Tested-by: Samuel Salin <Samuel.salin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Terry Bowman [Fri, 5 Jun 2026 18:06:10 +0000 (13:06 -0500)]
cxl: Fix CXL_HEADERLOG_SIZE to match RAS Capability size
The CXL r4.0 8.2.4.17.7 RAS Capability Structure has total length 0x58
bytes (CXL_RAS_CAPABILITY_LENGTH); the Header Log occupies the trailing
64 bytes at offset 0x18. CXL_HEADERLOG_SIZE was defined as SZ_512,
eight times the actual on-device size.
header_log_copy() reads CXL_HEADERLOG_SIZE_U32 (128) dwords from the
RAS capability iomap, overrunning the 88-byte mapping by 448 bytes.
The cxl_aer_uncorrectable_error trace event memcpy()s CXL_HEADERLOG_SIZE
(512) bytes from its source. For the CPER caller the source is
struct cxl_ras_capability_regs::header_log[16] (64 bytes) embedded in a
stack-local cxl_cper_prot_err_work_data, so the memcpy reads 448 bytes
of kernel stack into the trace event ring buffer where userspace can
read it via tracefs.
Set CXL_HEADERLOG_SIZE to 64 and derive CXL_HEADERLOG_SIZE_U32 from it,
bringing all iomap readers into agreement on 16 dwords. Userspace tools
such as rasdaemon have grown a dependency on the buggy 512-byte (128 u32)
header_log layout in the cxl_aer_uncorrectable_error trace event. Add
CXL_HEADERLOG_TRACE_SIZE_U32 = 128 and use it for the trace event
__array and its memcpy to preserve that ABI. Both callers now pass a
zero-filled u32[CXL_HEADERLOG_TRACE_SIZE_U32] staging buffer with only
the first CXL_HEADERLOG_SIZE_U32 (16) entries populated from hardware;
the remaining 112 u32s are zero-padded, keeping the 512-byte trace ring
buffer layout intact.
[ dj: Replaced 64 with SZ_64 per RichardC ]
Fixes: 36f257e3b0ba ("acpi/ghes, cxl/pci: Process CXL CPER Protocol Errors") Fixes: 2905cb5236cb ("cxl/pci: Add (hopeful) error handling support") Cc: stable@vger.kernel.org Reported-by: Sashiko Signed-off-by: Terry Bowman <terry.bowman@amd.com> Reviewed-by: Alison Schofield <alison.schofield@intel.com> Reviewed-by: Dave Jiang <dave.jiang@intel.com> Reviewed-by: Ben Cheatham <benjamin.cheatham@amd.com> Reviewed-by: Richard Cheng <icheng@nvidia.com> Link: https://patch.msgid.link/20260605180610.2249458-1-terry.bowman@amd.com Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Arnd Bergmann [Tue, 9 Jun 2026 16:41:32 +0000 (18:41 +0200)]
Merge tag 'mvebu-arm-7.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/gclement/mvebu into soc/arm
mvebu arm for 7.2 (part 1)
Orion5x: Replace machine_is_mss2() with of_machine_is_compatible() in mss2_pci_init()
mvebu_v5_defconfig: Remove stale MACH_LINKSTATION_LSCHL reference
Armada 370: Simplify of_node_put calls and drop redundant NULL checks
* tag 'mvebu-arm-7.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/gclement/mvebu:
ARM: orion5x: update board check in mss2_pci_init() to use the DT
arm: mvebu_v5_defconfig: remove stale MACH_LINKSTATION_LSCHL reference
ARM: mvebu: simplify of_node_put calls
ARM: mvebu: drop unnecessary NULL check
Ryan Chen [Tue, 9 Jun 2026 02:47:20 +0000 (10:47 +0800)]
arm64: dts: aspeed: Add initial AST27xx SoC device tree
Add initial device tree support for the ASPEED AST27xx family, the
8th-generation Baseboard Management Controller (BMC) SoCs.
AST27xx SOC Family
- https://www.aspeedtech.com/server_ast2700/
- https://www.aspeedtech.com/server_ast2720/
- https://www.aspeedtech.com/server_ast2750/
The AST27xx features a dual-SoC architecture consisting of two dies,
referred to as SoC0 and SoC1 - interconnected through an internal
proprietary bus. Both SoCs share the same address decoding scheme,
while each maintains independent clock and reset domains.
- SoC0 (CPU die): contains a quad-core Cortex-A35 cluster and two
Cortex-M4 cores, along with high-speed peripherals.
- SoC1 (I/O die): includes the BootMCU (responsible for system
boot) and its own clock/reset domains low-speed peripherals.
The device tree describes the SoC0 and SoC1 domains and their peripheral
layouts.
Filipe Manana [Fri, 5 Jun 2026 15:15:37 +0000 (16:15 +0100)]
btrfs: fix use-after-free after relocation failure with concurrent COW
If we get a failure during relocation, before we update all the extent
buffers that have file extent items pointing to extents from the block
group being relocated, we can trigger a user-after-free on the reloc
control structure (fs_info->reloc_control) if we have a concurrent task
that is COWing a subvolume leaf.
This happens like this:
1) Relocation of data block group X starts;
2) Relocation changes its state to UPDATE_DATA_PTRS;
3) A task doing a rename for example, COWs leaf A from a subvolume tree
and ends up at btrfs_reloc_cow_block() and extracts fs_info->reloc_ctl
into a local variable, which then passes to replace_file_extents();
4) The relocation task gets an error and under the label 'out_put_bg' in
btrfs_relocate_block_group() calls free_reloc_control(), which frees
the reloc control structure that the rename task is using;
5) The rename task triggers a use-after-free on the reloc control
structure that was just freed.
Syzbot reported this recently, with the following stack trace:
1) Making the reloc control structure ref counted;
2) Make revery place that access fs_info->reloc_ctl outside the relocation
code, which at the moment it's only replace_file_extents() and
btrfs_init_reloc_root(), get a reference count on the structure.
There's also btrfs_update_reloc_root() that is called outside the
relocation code, but this case is safe because it's only called in
the transaction commit path while under the fs_info->reloc_mutex
protection, but nevertheless grab a reference to make the code more
consistent and avoid false alerts from AI reviews;
3) Add a spinlock to protect fs_info->reloc_ctl, since we can not take the
fs_info->reloc_mutex as that would cause a deadlock since that lock is
taken in the transaction commit path. That spinlock is taken before
setting fs_info->reloc_ctl to an allocated structure, setting it to
NULL and reading fs_info->reloc_ctl;
4) Make sure the structure is freed only when its reference count drops to
zero.
Filipe Manana [Fri, 5 Jun 2026 16:25:50 +0000 (17:25 +0100)]
btrfs: move WARN_ON on unexpected error in __add_tree_block()
There's no point in having the WARN_ON(1) inside the if statement for the
unexpected error. Move it into the if statement's condition, which brings
a couple benefits:
1) It marks the branch as unlikely, hinting the compiler to generate
better code;
2) The WARN_ON() produces a stack trace after the dumped leaf and error
message which can hide that more important information in case we get
a truncated dmesg/syslog.
Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Fri, 5 Jun 2026 16:07:08 +0000 (17:07 +0100)]
btrfs: move locking into btrfs_get_reloc_bg_bytenr()
It does not make sense for the single caller to have the responsability
to lock the relocation mutex before calling the function and then have
the function to assert the lock is held. As this is a function in
relocation.c, move the locking details into it.
Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Weiming Shi [Sun, 7 Jun 2026 05:25:13 +0000 (22:25 -0700)]
btrfs: lzo: reject compressed segment that overflows the compressed input
lzo_decompress_bio() validates each on-disk segment length seg_len only
against the workspace cbuf size, not against the compressed input size
(compressed_len, the total folio bytes of the bio). A crafted extent can
carry a segment whose seg_len passes the cbuf check but runs past the end
of the bio, so copy_compressed_segment() walks off the last folio:
get_current_folio() then returns the NULL folio from bio_next_folio(), and
with CONFIG_BTRFS_ASSERT disabled (default) folio_size(NULL) faults.
BUG: KASAN: null-ptr-deref in lzo_decompress_bio (fs/btrfs/lzo.c:383)
Read of size 8 at addr 0000000000000000 by task kworker/u8:1/29
Workqueue: btrfs-endio simple_end_io_work
kasan_report (mm/kasan/report.c:590)
lzo_decompress_bio (fs/btrfs/lzo.c:383)
end_bbio_compressed_read (fs/btrfs/compression.c:1065)
btrfs_bio_end_io (fs/btrfs/bio.c:135)
btrfs_check_read_bio (fs/btrfs/bio.c:180 fs/btrfs/bio.c:285)
simple_end_io_work
process_one_work
worker_thread
Reject any segment whose payload would extend beyond compressed_len before
copying it, treating it as corruption like the other on-disk validation
failures in this function.
Reported-by: Xiang Mei <xmei5@asu.edu> Fixes: a6e66e6f8c1b ("btrfs: rework lzo_decompress_bio() to make it subpage compatible") Assisted-by: Claude:claude-opus-4-8 Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Weiming Shi <bestswngs@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Thu, 4 Jun 2026 00:29:48 +0000 (09:59 +0930)]
btrfs: retry faulting in the pages after a zero sized short direct write
Currently btrfs_direct_write() will not try to fault in the pages, but
directly fall back to buffered writes, if the first page of the buffer
can not be faulted in.
For example, during generic/362 with nodatasum mount option, there is a
write at file offset 0, length PAGE_SIZE, and the page is not faulted in.
Then we go the following callchain and directly fall back to buffered
IO:
btrfs_direct_write()
|- btrfs_dio_write()
|- __iomap_dio_rw()
| |- iomap_iter()
| | |- btrfs_dio_iomap_begin()
| | Now an ordered extent is allocated for the 4K write.
| |
| |- iomi.status = iomap_dio_iter()
| | Where iomap_dio_iter() returned -EFAULT.
| |
| |- ret = iomap_iter()
| | |- btrfs_dio_iomap_end()
| | | | return -ENOTBLK
| | |- return -ENOTBLK
| |- if (ret == -ENOTBLK) { ret = 0; }
| Now the return value is reset to 0.
|
|- ret = iomap_dio_complete()
| Since no byte is submitted, @ret is now zero.
|
|- if (iov_iter_count() > 0 && (ret == -EFAULT || ret > 0))
| @ret is zero, thus not meeting the above retry condition
|
|- Fallback to buffered
Just slightly loosen the condition to allow retry faulting in pages after
a zero sized short write.
Unlike the previous two bug fixes, this one is not really cause any real
bug, but only reducing the chance to do zero-copy direct IO.
Thus it doesn't really require stable-CC nor fixes-tag.
Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Thu, 4 Jun 2026 00:29:47 +0000 (09:59 +0930)]
btrfs: fix incorrect buffered IO fallback for append direct writes
[BUG]
With the previous bug of short direct writes fixed, test case
generic/362 (*) still fails with the following error with nodatasum
mount option:
generic/362 0s ... - output mismatch (see /home/adam/xfstests/results//generic/362.out.bad)
- output mismatch (see /home/adam/xfstests/results//generic/362.out.bad)
--- tests/generic/362.out 2024-08-24 15:31:37.200000000 +0930
+++ /home/adam/xfstests/results//generic/362.out.bad 2026-05-27 10:13:09.072485767 +0930
@@ -1,2 +1,3 @@
QA output created by 362
+Wrong file size after first write, got 8192 expected 4096
Silence is golden
...
*: If the test case has been executed before with default data checksum,
the failure will not reproduce. Need the following fix to make it
reliably reproducible:
https://lore.kernel.org/linux-btrfs/20260528111659.87113-1-wqu@suse.com/
[CAUSE]
Inside btrfs_dio_iomap_begin() for a direct write, we increase the isize
if it's beyond the current isize.
But if the direct io finished short, we do not revert the isize to the
previous value nor to the short write end.
Then if we need to fall back to buffered writes, and the write has
IOCB_APPEND flag, then the buffered write will be positioned at the
incorrect isize.
The call chain looks like this:
btrfs_direct_write(pos=0, length=4K)
|- __iomap_dio_rw()
| |- iomap_iter()
| | |- btrfs_dio_iomap_begin()
| | |- btrfs_get_blocks_direct_write()
| | |- i_size_write()
| | Which updates the isize to the write end (4K).
| |
| |- iomap_dio_iter()
| | Failed with -EFAULT on the first page.
| |
| |- iomap_iter()
| | |- btrfs_dio_iomap_end()
| | Detects a short write, return -ENOTBLK
| |- if (ret == -ENOTBLK) { ret = 0;}
| Which resets the return value.
|
|- ret = iomap_dio_complet()
| Which returns 0.
|
|- btrfs_buffered_write(iocb, from);
|- generic_write_checks()
|- iocb->ki_pos = i_size_read()
Which is still the new size (4K), other than the original
isize 0.
[FIX]
Introduce the following btrfs_dio_data members:
- old_isize
- updated_isize
If the direct write has enlarged the isize.
Then if we got a short write, and btrfs_dio_data::updated_isize is set,
revert to the correct isize based on old_isize and current file
position.
And here we call i_size_write() without holding an extent lock, which is
a very special case that we're safe to do:
- Only a single writer can be enlarging isize
Enlarging isize will take the exclusive inode lock.
- Buffered readers need to wait for the OE we're holding
Buffered readers will lock extent and wait for OE of the folio range.
Sometimes we can skip the OE wait, but since all page cache is
invalidated, the OE wait can not be skipped.
But I do not think this is the most elegant solution, nor covers all
cases. E.g. if the bio is submitted but IO failed, we are unable to do
the revert.
I believe the more elegant one would be extend the EXTENT_DIO_LOCKED
lifespan for direct writes, so that we can update the isize when a
write beyond EOF finished successfully.
However that change is too huge for a small bug fix.
So only implement the minimal partial fix for now.
[REASON FOR NO FIXES TAG]
The bug is again very old, before commit f85781fb505e ("btrfs: switch to
iomap for direct IO") we are already increasing isize without a
proper rollback for short writes.
Thus only a CC to stable.
CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
generic/362 0s ... - output mismatch (see /home/adam/xfstests/results//generic/362.out.bad)
--- tests/generic/362.out 2024-08-24 15:31:37.200000000 +0930
+++ /home/adam/xfstests/results//generic/362.out.bad 2026-05-27 10:21:17.574771567 +0930
@@ -1,2 +1,3 @@
QA output created by 362
+First write failed: Input/output error
Silence is golden
...
*: If the test case has been executed before with default data checksum,
the failure will not reproduce. Need the following fix to make it
reliably reproducible:
https://lore.kernel.org/linux-btrfs/20260528111659.87113-1-wqu@suse.com/
[CAUSE]
Inside __iomap_dio_rw(), the -EFAULT/-ENOTBLK error is not directly returned.
Thus we never got an error pointer from __iomap_dio_rw().
The call chain looks like this:
btrfs_direct_write()
|- btrfs_dio_write()
|- __iomap_dio_rw()
| |- iomap_iter()
| | |- btrfs_dio_iomap_begin()
| | Now an ordered extent is allocated for the 4K write.
| |
| |- iomi.status = iomap_dio_iter()
| | Where iomap_dio_iter() returned -EFAULT.
| |
| |- ret = iomap_iter()
| | |- btrfs_dio_iomap_end()
| | | |- btrfs_finish_ordered_extent(uptodate = false)
| | | | |- can_finish_ordered_extent()
| | | | |- btrfs_mark_ordered_extent_error()
| | | | |- mapping_set_error()
| | | | Now the address space is marked error.
| | | | return -ENOTBLK
| | |- return -ENOTBLK
| |- if (ret == -ENOTBLK) { ret = 0; }
| Now the return value is reset to 0.
| Thus no error pointer will be returned.
|
|- ret = iomap_dio_complete()
| Since no byte is submitted, @ret is 0.
|
|- Fallback to buffered IO
| And the buffered write finished without error
|
|- filemap_fdatawait_range()
|- filemap_check_errors()
The previous error is recorded, thus an error is returned
However the buffered write is properly submitted and finished, the error
is from the btrfs_finish_ordered_extent() call with @uptodate = false.
[FIX]
When a short dio write happened, any range that is submitted will have
btrfs_extract_ordered_extent() to be called, thus the submitted range
will always have an OE just covering the submitted range.
The remaining OE range is never submitted, thus they should be treated
as truncated, not an error. So that we can properly reclaim and not
insert an unnecessary file extent item, without marking the mapping as
error.
Extract a helper, btrfs_mark_ordered_extent_truncated(), and utilize
that helper to mark the direct IO ordered extent as truncated, so it
won't cause failure for the later buffered fallback.
[REASON FOR NO FIXES TAG]
The bug itself is pretty old, at commit f85781fb505e ("btrfs: switch to
iomap for direct IO") we're already passing @uptodate=false finishing
the OE.
But at that time OE with IOERR won't call mapping_set_error(), so it's
not exposed.
Later commit d61bec08b904 ("btrfs: mark ordered extent and inode with
error if we fail to finish") finally exposed the bug, but that commit
is doing a correct job, not the root cause.
Anyway the bug is very old, dating back to 5.1x days, thus only CC to
stable.
CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 2 Jun 2026 13:42:17 +0000 (14:42 +0100)]
btrfs: use verbose assertions in backref.c
While debugging a relocation issue I hit an assertion in backref.c but it
was not super useful, since it could not tell what was the unexpected
value that triggered the assertion. The stack trace was this:
Qu Wenruo [Tue, 2 Jun 2026 05:26:49 +0000 (14:56 +0930)]
btrfs: print a message when a missing device re-appears
There is a bug report that fstrim crashed, and that crash is eventually
pinned down to a missing device which re-appeared and screwed up callers
that only checks BTRFS_DEV_STATE_MISSING, but not
BTRFS_DEV_STATE_WRITEABLE nor device->bdev.
A missing device re-appearing can be very tricky, as for now it will
result in a device without WRITEABLE or MISSING flag, and still no bdev
pointer.
As the first step to enhance handling of such re-appearing missing
devices, add a dmesg output when a missing device re-appeared.
Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
assertion failed: device->bdev, in extent-tree.c:6630 (devid=2 path=/dev/sdd dev_state=0x82)
Which means the device->bdev is NULL, and the dev_state is
BTRFS_DEV_STATE_IN_FS_METADATA | BTRFS_DEV_STATE_ITEM_FOUND, without
BTRFS_DEV_STATE_WRITEABLE flag set.
[CAUSE]
The pc points to the following call chain:
So the NULL pointer dereference is caused by device->bdev being NULL.
This looks impossible by a quick glance, as just before calling
btrfs_trim_free_extents_throttle(), we have skipped any device that has
BTRFS_DEV_STATE_MISSING flag set.
However in this particular case, there is a window where the missing
device is later re-scanned, causing btrfs to remove the
BTRFS_DEV_STATE_MISSING flag:
btrfs_control_ioctl()
|- btrfs_scan_one_device()
|- device_list_add()
|- rcu_assign_pointer(device->name, name);
| This updates the missing device's path to the new good path.
|
|- clear_bit(BTRFS_DEV_STATE_MISSING, &device->dev_state)
This removes the BTRFS_DEV_STATE_MISSING flag.
This allows the missing device to re-appear and clear the
BTRFS_DEV_STATE_MISSING flag. However the device still does not have
the BTRFS_DEV_STATE_WRITEABLE flag set, nor is its bdev pointer updated.
The bdev pointer remains NULL, triggering the crash later.
[FIX]
This is a big de-synchronization between BTRFS_DEV_STATE_MISSING and
device->bdev pointer, and shows a gap in btrfs's re-appearing-device
handling.
The proper handling of re-appearing device will need quite some extra
work, which is out of the context of this small fix.
Thankfully the regular bbio submission path has already handled it well
by checking if the device->bdev is NULL before submitting.
So here we just fix the crash by checking if the device is writeable and
has a bdev pointer before calling bdev_max_discard_sectors().
Reported-by: Su Yue <glass.su@suse.com> Link: https://lore.kernel.org/linux-btrfs/wlwir19t.fsf@damenly.org/ Fixes: 499f377f49f0 ("btrfs: iterate over unused chunk space in FITRIM") CC: stable@vger.kernel.org # 5.10+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Mon, 1 Jun 2026 09:45:14 +0000 (10:45 +0100)]
btrfs: return real error after lookup failure in btrfs_ioctl_default_subvol()
If we fail to lookup the dir item, we are always returning -ENOENT but
that may not be the reason for the failure, as btrfs_lookup_dir_item() can
return many different errors, such as -EIO or -ENOMEM for example.
Fix this by returning the real error, and also fixup the silly error
message, including the id of the directory and the error.
Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Ben Maurer [Fri, 29 May 2026 21:23:46 +0000 (14:23 -0700)]
btrfs: use lockless read in nr_cached_objects shrinker callback
Under heavy memcg-driven slab reclaim with many memcgs and CPUs,
shrink_slab_memcg() invokes the per-superblock count callback once per
(memcg, NUMA node) tuple. For btrfs that callback reaches
percpu_counter_sum_positive() on fs_info->evictable_extent_maps, which
takes the percpu_counter's raw spinlock with IRQs disabled and walks
every online CPU. With hundreds of memcgs driving reclaim on a host with
dozens of CPUs, this counter lock becomes a global serialization point:
profiles show CPU pinned in the spin_lock_irqsave acquire under
__percpu_counter_sum, with cross-CPU IPIs hitting csd_lock_wait_toolong
while waiting for spinning vCPUs.
The shrinker count is advisory -- super_cache_count() already notes
"counts can change between super_cache_count and super_cache_scan, so we
really don't need locks here." Use percpu_counter_read_positive(), which
is lockless. Worst-case skew is bounded by batch * num_online_cpus (a
few thousand), negligible compared to the millions of extent maps a busy
filesystem accumulates and well within the noise that the shrinker
already tolerates.
Tested-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Ben Maurer <bmaurer@meta.com> Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 27 May 2026 11:16:52 +0000 (13:16 +0200)]
btrfs: use shifts for sectorsize and nodesize
Convert more multiplications of sectorsize or nodesize to use the
shifts. The remaining cases are multiplications by constants that
compiler can optimize by itself, and in tests.
Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 26 May 2026 13:44:30 +0000 (14:44 +0100)]
btrfs: fix deadlock cloning inline extent when using flushoncommit
In commit b48c980b6a7e ("btrfs: fix deadlock between reflink and
transaction commit when using flushoncommit") a deadlock was fixed
between reflinks and transaction commits when the fs is mounted with the
flushoncommit option. This happened when we had to copy an inline extent's
data to the destination file. However the issue was fixed only for the
case where the destination offset is 0, it missed the case when the offset
is greater than zero.
Fix this by ensuring we get i_size update whenever we copied an inline
extent's data into the destination file.
Reported-by: syzbot+c7443384724bb0f9e913@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/6a150a09.820a0220.e7972.0006.GAE@google.com/ Fixes: 05a5a7621ce6 ("Btrfs: implement full reflink support for inline extents") Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Rik van Riel [Tue, 26 May 2026 22:37:39 +0000 (18:37 -0400)]
btrfs: allocate eb-attached btree pages as movable
Extent buffer pages allocated by alloc_extent_buffer() are attached to
btree_inode->i_mapping (the buffer_tree path), reach the LRU, and are
served by the btree_migrate_folio aops in fs/btrfs/disk-io.c. They are
migratable in practice once their owning extent buffer hits refs == 1,
which happens naturally. The buddy allocator classifies them by GFP,
however, and bare GFP_NOFS lands them in MIGRATE_UNMOVABLE pageblocks.
The result: every btree_inode page we read in pins an unmovable pageblock
from the page-superblock allocator's perspective, even though the page
itself can be moved.
Have each caller of btrfs_alloc_page_array, btrfs_alloc_folio_array,
and alloc_eb_folio_array pass in the full GFP mask directly, instead
of having the functions calculate it from boolean flags.
The alloc_extent_buffer call site passes GFP_NOFS | __GFP_NOFAIL |
__GFP_MOVABLE. All other call sites pass plain GFP_NOFS.
Three categories of caller stay on bare GFP_NOFS, deliberately:
- alloc_dummy_extent_buffer / btrfs_clone_extent_buffer: the
resulting eb is EXTENT_BUFFER_UNMAPPED, folio->mapping stays NULL,
the folios never enter LRU, never get migrate_folio aops. Tagging
them __GFP_MOVABLE would violate the page allocator's migrability
contract and they would defeat compaction in MOVABLE pageblocks
where isolate_migratepages_block skips non-LRU non-movable_ops
pages outright.
- btrfs_alloc_page_array callers in fs/btrfs/raid56.c (stripe
pages), fs/btrfs/inode.c (encoded reads), fs/btrfs/ioctl.c (io_uring
encoded reads), fs/btrfs/relocation.c (relocation buffers): same
contract violation. raid56 stripe_pages additionally persist in
the stripe cache (RBIO_CACHE_SIZE=1024) well beyond a single I/O,
so they are not transient enough to hand-wave the contract.
- btrfs_alloc_folio_array caller in fs/btrfs/scrub.c (stripe
folios): same -- stripe->folios[] are private buffers freed via
folio_put in release_scrub_stripe.
This change targets the dominant fragmentation source observed on the
page-superblock series: ~28 GB of btree_inode pages parked across
many tainted superpageblocks on a 250 GB test system with btrfs root,
preventing 1 GiB hugepage allocation from those regions. With the
movable hint, those pages now land in MOVABLE pageblocks where the
existing background defragger drains them through the standard
PB_has_movable gate, no LRU-sample fallback needed.
Assisted-by: Claude:claude-opus-4-6 Signed-off-by: Rik van Riel <riel@surriel.com> Signed-off-by: David Sterba <dsterba@suse.com>
Daan De Meyer [Thu, 21 May 2026 07:51:13 +0000 (07:51 +0000)]
btrfs: add 32-bit compat ioctl for BTRFS_IOC_GET_SUBVOL_INFO
On 64-bit kernels with 32-bit userspace, struct btrfs_ioctl_timespec is
laid out as 16 bytes (8B sec + 4B nsec + 4B trailing padding) instead of
the 12 bytes a 32-bit userspace expects, because the surrounding struct
is not packed. As a result, struct btrfs_ioctl_get_subvol_info_args has
a different size and layout in 32-bit userspace than in the 64-bit
kernel, and BTRFS_IOC_GET_SUBVOL_INFO returns garbage to 32-bit callers.
Mirror what was done for BTRFS_IOC_SET_RECEIVED_SUBVOL: add a packed
btrfs_ioctl_get_subvol_info_args_32 with btrfs_ioctl_timespec_32 fields,
define BTRFS_IOC_GET_SUBVOL_INFO_32 with that struct as the size
argument, factor the existing handler into a shared _btrfs_ioctl_get_
subvol_info() helper, and add btrfs_ioctl_get_subvol_info_32() which
fills the kernel struct and translates field-by-field into the 32-bit
struct before copy_to_user().
Signed-off-by: Daan De Meyer <daan@amutable.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
The f_fsid was originally derived from fs_devices->fsid and the
subvolume root ID. However, when temp_fsid is active, fs_devices->fsid
is randomized, making the standard derivation inconsistent.
Since metadata_uuid is optional, it is not a reliable alternative. This
patch instead retrieves the on-disk UUID from fs_info->super_copy->fsid.
To prevent f_fsid collisions between original and cloned filesystems,
this implementation hashes the dev_t for single-device btrfs filesystems
to ensure uniqueness. This is limited to single-device filesystems as
cloned mounts are currently only supported for that configuration. Note
that f_fsid will change if the device is replaced.
Additionally, since the kernel cannot distinguish between the original
and the cloned filesystem, this new f_fsid derivation is applied to
both.
[MINOR PROBLEM]
When mounting a filesystem with a valid DEV_STATS item, we will always
update the DEV_STATS again in the next transaction commit, even if there
is no change the values.
[CAUSE]
During the mount, btrfs_device_init_dev_stats() will read out the
on-disk DEV_STATS item for each device.
Then it calls btrfs_dev_stat_set() to update the in-memory structure.
However btrfs_dev_stat_set() does not only set the dev stats value, but
also increase device->dev_stats_ccnt.
That member determines if we should update the device item at the next
transaction commit. Since we have called btrfs_dev_stat_set() for each
dev status member, dev_stats_ccnt will be non-zero and we will update
the dev stats item even it doesn't change at all.
[FIX]
Instead of using btrfs_dev_stat_set() for valid on-disk DEV_STATUS
values, directly call atomic_set() to set the in-memory values.
For other call sites, we still want to use btrfs_dev_stat_set() so that
we will force updating/creating the dev stats item.
Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: always update/create the dev stats item when adding a new device
[MINOR PROBLEM]
When adding a new btrfs device, the corresponding DEV_STATS item creation
can only triggered by a mount cycle if there is no other error
triggered:
[CAUSE]
Btrfs only updates the DEV_STATS item when the device->dev_stats_ccnt
counter is not 0.
This is to reduce COW for the device tree. However that dev_stats_ccnt is
only increased at the following call sites:
- btrfs_dev_stat_inc()
This happens when some IO error happened.
- btrfs_dev_stat_read_and_reset()
This happens for GET_DEV_STATS ioctl with BTRFS_DEV_STATS_RESET flag.
- btrfs_dev_stat_set()
This happens inside btrfs_device_init_dev_stats().
So when a new device is added, its dev_stats_ccnt is just initialized to
0, and btrfs won't create nor update the corresponding DEV_STATS item at
all.
[ENHANCEMENT]
When a new device is added, also increase the dev_stats_ccnt by one.
This includes both device add ioctl and dev-replace.
This will force btrfs to create a new DEV_STATS item or update the
existing one with the correct values.
This not only makes the DEV_STATS creation early, but also prevents
old DEV_STATS left from older kernels to cause false alerts for the
newly added device.
Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: remove the dev stats item when removing a device
[MINOR BUG]
The following script will cause DEV_STATS item to be left after the
corresponding device is removed:
# mkfs.btrfs -f $dev1
# mount $dev1 $mnt
# btrfs dev add $dev2 $mnt
# umount $mnt
## Without real errors, only at mount time btrfs will update
## dev->dev_stats_ccnt, thus we need a mount cycle to create the
## DEV_STATS item for the new device.
# mount $dev1 $mnt
# touch $mnt/foobar
# sync
# btrfs dev remove $dev2 $mnt
# umount $mnt
This will result the DEV_STATS item for devid 2 still left in device
tree:
This is not a huge problem, but if the existing DEV_STATS contains
errors, and a new device is added into the fs taking the old devid, then
after a mount cycle, the new device will suddenly inherit old errors
which can give false alerts.
[CAUSE]
Btrfs never has the ability to delete DEV_STATS items.
It either create a new one through update_dev_stat_item(), or read an
existing one through btrfs_device_init_dev_stats().
However update_dev_stat_item() is only called lazily, if a new device is
created and no new update to dev stats, then it will skip the update of
the on-disk item.
So if the old DEV_STATS item exists and a new device is added, and no
errors during the remaining operations, the old DEV_STATS will not be
updated.
Then at the next mount cycle, btrfs_device_init_dev_stats() is called at
mount time, which will read out the old records, causing false alerts to
the newly added device.
[FIX]
Manually remove the DEV_STATS item during btrfs_rm_device().
Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: remove the dev stats item for replace target device
[MINOR PROBLEM]
When a running dev-replace hits some error for the target device (devid
0), there will be a DEV_STATS with error records created at the next
transaction commit.
Unfortunately that item will never to be deleted.
This means at the next dev-replace, if the replace is interrupted, then
at the next mount, the target device will suddenly inherit the old error
records from that DEV_STATS item, which can give some false alerts on
that device.
This shouldn't affect end users that much, as it requires all the
following conditions to be met, which is pretty rare:
- The initial dev-replace hits some error on the target device
E.g. write errors, but those errors itself is already a big problem
for a running replace.
This is required to create the DEV_STATS item in the first place.
- The next replace is interrupted
This is required to allow btrfs to read from the old records.
[CAUSE]
Btrfs just never deletes the DEV_STATS after a replace is finished.
[FIX]
Remove the DEV_STATS item for devid 0 after the replace is finished.
This is not going to completely fix the error, as we still have other
error paths, e.g. by somehow the fs flips RO and can not start a new
transaction for the DEV_STATS item removal.
But those corner cases will be addressed by later patches which provide
a more generic fix to DEV_STATS related problems.
Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Teng Liu [Wed, 13 May 2026 11:35:44 +0000 (13:35 +0200)]
btrfs: validate data reloc tree file extent item members
get_new_location() uses BUG_ON() to crash the kernel if the file extent
item it looks up has any of offset, compression, encryption, or
other_encoding set non-zero. The data reloc inode is only written by
relocation's own paths and the four fields are always 0 in what the
kernel writes:
- insert_prealloc_file_extent() memsets the stack item to zero and
only fills in type, disk_bytenr, disk_num_bytes and num_bytes, so
offset/compression/encryption/other_encoding stay 0.
- insert_ordered_extent_file_extent() copies oe->compress_type into
the file extent's compression field, but the data reloc inode is
created with BTRFS_INODE_NOCOMPRESS so compress_type is always 0;
encryption and other_encoding are reserved-and-zero in btrfs.
A non-zero value here means the leaf decoded from disk does not match
what the kernel wrote, i.e. on-disk corruption. A malformed image
reaches this code via balance and panics the kernel.
A previous attempt to enforce all four constraints in tree-checker's
check_extent_data_item() was merged as commit 7d0ee95979e9 ("btrfs:
validate data reloc tree file extent item members in tree-checker")
and then reverted by commit 1c034697fcaa after btrfs/061 produced
false positives on arm64 with 64K pages. The reason: relocation
writeback legitimately produces REG file_extent_items with offset != 0
in the data reloc tree. When an ordered extent covers only the back
portion of an underlying PREALLOC (num_bytes < ram_bytes on the input
file_extent), insert_ordered_extent_file_extent() inserts a REG with
offset = oe->offset
num_bytes = oe->num_bytes
ram_bytes preserved from the original PREALLOC,
and this item can reach disk if a transaction commit fires while it
is present in the leaf.
The four fields belong in different layers:
- compression, encryption and other_encoding are universal
invariants for every item in the data reloc tree, regardless of
cluster geometry. Enforce them in tree-checker's
check_extent_data_item() so a corrupt leaf is rejected at read
time.
- offset is only an invariant at the cluster-boundary keys that
get_new_location() searches (the key is computed as
src_disk_bytenr - reloc_block_group_start). Partial-PREALLOC
writebacks legitimately place REG items at non-boundary keys with
offset != 0; tree-checker cannot reject these. The cluster-
boundary item is always written by either
insert_prealloc_file_extent() (offset=0 by memset) or by the
front portion of a partial writeback (offset=0 by construction),
so a non-zero offset there is corruption.
Enforce the universal invariants in check_extent_data_item() with a
file_extent_err() rejection. Convert the BUG_ON() in
get_new_location() to a -EUCLEAN return paired with btrfs_print_leaf()
and btrfs_err() so the offending leaf is logged. The caller in
replace_file_extents() already handles non-zero returns from
get_new_location() by breaking out of the loop without aborting the
transaction.
Suggested-by: Qu Wenruo <wqu@suse.com> Suggested-by: David Sterba <dsterba@suse.com> Reported-by: syzbot+3e20d8f3d41bac5dc9a2@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=3e20d8f3d41bac5dc9a2 Signed-off-by: Teng Liu <27rabbitlt@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: annotate lockless read of defrag_bytes in should_nocow()
should_nocow() reads inode->defrag_bytes without holding inode->lock,
while btrfs_set_delalloc_extent() and btrfs_clear_delalloc_extent()
update it under that spinlock.
This is a data race. The read is a quick check used to decide whether
to fall back to COW for a NOCOW inode: if defrag_bytes is non-zero and
the range is tagged EXTENT_DEFRAG, we force COW so that defragmentation
can rewrite the extent. Reading a stale value is harmless because:
- A missed increment may skip COW once, but the defrag pass will
redo the extent later.
- A stale non-zero may force an unnecessary COW, which is a minor
efficiency loss, not a correctness issue.
On 64-bit platforms an aligned u64 load is naturally atomic so tearing
cannot happen. On 32-bit platforms u64 may tear, but we only test for
zero vs non-zero, so the heuristic stays correct regardless. Use
data_race() annotation.
Fixes: 47059d930f0e ("Btrfs: make defragment work with nodatacow option") Signed-off-by: Cen Zhang <zzzccc427@gmail.com>
[ Use data_race() instead of READ_ONCXE() ] Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: zoned: always set max_active_zones for zoned devices
When a block device does not report a maximum number of open or active
zones, currently assign BTRFS_DEFAULT_MAX_ACTIVE_ZONES (128) to
the internal limit, if the device has more than
BTRFS_DEFAULT_MAX_ACTIVE_ZONES zones.
But if the device has less than BTRFS_DEFAULT_MAX_ACTIVE_ZONES the
internal max_active_zones limit will stay at 0, even if the device has
zone resource limits. Furthermore, if the device has a total number of
zones that is less than BTRFS_DEFAULT_MAX_ACTIVE_ZONE, max_active_zones
should be set to at most the number of zones.
Also move the max_active_zone calculation and setting into a dedicated
helper, to shrink btrfs_get_dev_zone_info().
Fixes: 04147d8394e8 ("btrfs: zoned: limit active zones to max_open_zones") Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
Revert "btrfs: fix the file offset calculation inside btrfs_decompress_buf2page()"
It seems that af566bdaff54 was tested against a tree which did not
contain commit 12851bd921d4 ("fs: Turn page_offset() into a wrapper
around folio_pos()). Unfortunately it has a bug of its own; on 32-bit
systems, shifting by PAGE_SHIFT will overflow on files larger than 4GiB.
Since page_offset() is now fixed, just revert af566bdaff54.
Fixes: af566bdaff54 (btrfs: fix the file offset calculation inside btrfs_decompress_buf2page()) Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Boris Burkov <boris@bur.io> Tested-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: zoned: fix deadlock waiting for ticket during data relocation
When performing data relocation on a zoned filesystem, BTRFS can deadlock
in handle_reserve_tickets(). The relocation process is waiting on a space
reservation ticket that can never be fulfilled, because the relocation
itself is the operation responsible for freeing up that space.
Fix this by introducing a new flush state,
BTRFS_RESERVE_FLUSH_ZONED_RELOCATION, specifically for data chunk
allocation during zoned relocation. Like
BTRFS_RESERVE_FLUSH_FREE_SPACE_INODE, this state uses
priority_reclaim_data_space() instead of the normal flushing path, which
avoids re-entering the relocation code and breaking the deadlock cycle.
In btrfs_alloc_data_chunk_ondemand(), select this new flush state when the
inode belongs to a data relocation root on a zoned filesystem.
Fixes: e2a7fd22378f ("btrfs: zoned: add zone reclaim flush state for DATA space_info") Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
When searching for a data relocation block-group on mount,
btrfs_zoned_reserve_data_reloc_bg() is looking for the first empty DATA
block-group. But it first checks if the block-group is empty and if yes
continues the search, and then checks if it is the first DATA block-group.
There is actually no point in looking for the second empty DATA block
group as new DATA allocations will just allocate a new chunk for it. Pick
the first DATA block-group without any allocations done and set it as
relocation block-group.
At first, the commit 694ce5e143d6 ("btrfs: zoned: reserve data_reloc
block group on mount") introduced the functionality. At that time, we
took second unused (used == 0) block group, as the first one might be a
block group used for normal data. Later, commit daa0fde32235 ("btrfs:
zoned: fix data relocation block group reservation") switched to look
for an empty block group (alloc_offset == 0). At this point, there is no
reason taking the second one anymore. So, this commit is fixing an issue
in commit daa0fde32235.
Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
Add device tree compatible string for AST2700 based boards
("aspeed,ast2700-evb" and "aspeed,ast2700") to the Aspeed SoC
board bindings. This allows proper schema validation and
enables support for AST2700 platforms.