sockptr: fix usize check in copy_struct_from_sockptr() for user pointers
copy_struct_from_user will never hit the check_zeroed_user() call
and will never return -E2BIG if new userspace passed new bits in a
larger structure than the current kernel structure.
As far as I can there are no critical/related uapi changes in
- include/net/bluetooth/bluetooth.h and net/bluetooth/sco.c
after the use of copy_struct_from_sockptr in v6.13-rc3
- include/uapi/linux/tcp.h and net/ipv4/tcp_ao.c
after the use of copy_struct_from_sockptr in v6.6-rc1
So that new callers will get the correct behavior from the start.
Fixes: 4954f17ddefc ("net/tcp: Introduce TCP_AO setsockopt()s") Fixes: ef84703a911f ("net/tcp: Add TCP-AO getsockopt()s") Fixes: faadfaba5e01 ("net/tcp: Add TCP_AO_REPAIR") Fixes: 3e643e4efa1e ("Bluetooth: Improve setsockopt() handling of malformed user input") Cc: Dmitry Safonov <0x7f454c46@gmail.com> Cc: Dmitry Safonov <dima@arista.com> Cc: Francesco Ruggeri <fruggeri@arista.com> Cc: Salam Noureddine <noureddine@arista.com> Cc: David Ahern <dsahern@kernel.org> Cc: David S. Miller <davem@davemloft.net> Cc: Michal Luczaj <mhal@rbox.co> Cc: David Wei <dw@davidwei.uk> Cc: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Cc: Luiz Augusto von Dentz <luiz.dentz@gmail.com> Cc: Marcel Holtmann <marcel@holtmann.org> Cc: Xin Long <lucien.xin@gmail.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Kuniyuki Iwashima <kuniyu@google.com> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Willem de Bruijn <willemb@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Simon Horman <horms@kernel.org> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: Christian Brauner <brauner@kernel.org> CC: Kees Cook <keescook@chromium.org> Cc: netdev@vger.kernel.org Cc: linux-bluetooth@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Stefan Metzmacher <metze@samba.org> Link: https://patch.msgid.link/cfaedbc33ae9d36adaabf04fa79424f30ff1efdd.1775576651.git.metze@samba.org Reviewed-by: Aleksa Sarai <aleksa@amutable.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
uaccess: fix ignored_trailing logic in copy_struct_to_user()
Currently all callers pass ignored_trailing=NULL, but I have
code that will make use of.
Now it actually behaves like documented:
* If @usize < @ksize, then the kernel is trying to pass userspace a newer
struct than it supports. Thus we only copy the interoperable portions
(@usize) and ignore the rest (but @ignored_trailing is set to %true if
any of the trailing (@ksize - @usize) bytes are non-zero).
Fixes: 424a55a4a908 ("uaccess: add copy_struct_to_user helper") Cc: Dmitry Safonov <0x7f454c46@gmail.com> Cc: Dmitry Safonov <dima@arista.com> Cc: Francesco Ruggeri <fruggeri@arista.com> Cc: Salam Noureddine <noureddine@arista.com> Cc: David Ahern <dsahern@kernel.org> Cc: David S. Miller <davem@davemloft.net> Cc: Michal Luczaj <mhal@rbox.co> Cc: David Wei <dw@davidwei.uk> Cc: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Cc: Luiz Augusto von Dentz <luiz.dentz@gmail.com> Cc: Marcel Holtmann <marcel@holtmann.org> Cc: Xin Long <lucien.xin@gmail.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Kuniyuki Iwashima <kuniyu@google.com> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Willem de Bruijn <willemb@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Simon Horman <horms@kernel.org> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: Christian Brauner <brauner@kernel.org> CC: Kees Cook <keescook@chromium.org> Cc: netdev@vger.kernel.org Cc: linux-bluetooth@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Stefan Metzmacher <metze@samba.org> Link: https://patch.msgid.link/71f69442410c1186ed8ce6d5b4b9d4a5a70edbad.1775576651.git.metze@samba.org Reviewed-by: Aleksa Sarai <aleksa@amutable.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
Rosen Penev [Sat, 9 May 2026 00:34:38 +0000 (17:34 -0700)]
gpio: zevio: allow COMPILE_TEST builds
The ZEVIO GPIO driver uses generic platform, MMIO, and gpiolib interfaces.
Allow it to build with COMPILE_TEST so it gets coverage on non-ARM
platforms.
Drop the ARM-specific IOMEM() casts around the register pointer. The
pointer is already __iomem, so readl() and writel() can use it directly.
Tested with:
make LLVM=1 ARCH=loongarch drivers/gpio/gpio-zevio.o
fprobe: Fix unregister_fprobe() to wait for RCU grace period
Commit 4346ba1604093 ("fprobe: Rewrite fprobe on function-graph tracer")
changed fprobe to register struct fprobe to an rcu-hlist, but it forgot
to wait for RCU GP. Thus there can be use-after-free if the fprobe is
released right after unregistering. This can be happened on fprobe
event and sample module code.
To fix this issue, add synchronize_rcu() in unregister_fprobe().
Note that BPF is OK because fprobe is used as a part of
bpf_kprobe_multi_link. This unregisters its fprobe in
bpf_kprobe_multi_link_release() and it is deallocated via
bpf_kprobe_multi_link_dealloc(), which is invoked from
bpf_link_defer_dealloc_rcu_gp() RCU callback.
For BPF, this also introduced unregister_fprobe_async() which does
NOT wait for RCU grace priod.
thunderbolt: property: Cap recursion depth in __tb_property_parse_dir()
A DIRECTORY entry's value field is used as the dir_offset for a
recursive call into __tb_property_parse_dir() with no depth counter.
A crafted peer that chains DIRECTORY entries into a back-reference
loop drives the parser until the kernel stack is exhausted and the
guard page fires. Any untrusted XDomain peer (cable, dock, in-line
inspector, adjacent host) that reaches the PROPERTIES_REQUEST
control-plane exchange can trigger this without authentication.
Thread a depth counter through tb_property_parse() and
__tb_property_parse_dir(), and reject blocks that exceed
TB_PROPERTY_MAX_DEPTH = 8. That is comfortably larger than any
observed legitimate XDomain layout.
Operators who do not need XDomain host-to-host discovery can disable
the path entirely with thunderbolt.xdomain=0 on the kernel command
line.
Fixes: cdae7c07e3e3 ("thunderbolt: Add support for XDomain properties") Cc: stable@vger.kernel.org Assisted-by: Claude:claude-opus-4-6 Assisted-by: Codex:gpt-5-4 Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com> Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
thunderbolt: property: Reject dir_len < 4 to prevent size_t underflow
On the non-root path, __tb_property_parse_dir() takes dir_len from
entry->length (u16 widened to size_t). Two distinct OOB conditions
follow when entry->length < 4:
1. The non-root path begins with kmemdup(&block[dir_offset],
sizeof(*dir->uuid), ...) which always reads 4 dwords from
dir_offset. tb_property_entry_valid() only enforces
dir_offset + entry->length <= block_len, so a crafted entry
with dir_offset close to the end of the property block and
entry->length in 0..3 passes that gate but lets the UUID copy
run off the block (e.g. dir_offset = 497, dir_len = 3 in a
500-dword block reads block[497..501]).
2. After the kmemdup, content_len = dir_len - 4 underflows size_t
to ~SIZE_MAX, nentries becomes SIZE_MAX / 4, and the entry
walk runs OOB on each iteration until an entry fails
validation or the kernel oopses on an unmapped page.
Reject dir_len < 4 on the non-root path *before* the UUID kmemdup,
which closes both holes.
Also move INIT_LIST_HEAD(&dir->properties) up to immediately after
the dir allocation so the new error-return path (and the existing
uuid-alloc failure path) calling tb_property_free_dir() sees a
walkable list rather than the zero-initialized NULL next/prev that
list_for_each_entry_safe() would oops on.
Fixes: cdae7c07e3e3 ("thunderbolt: Add support for XDomain properties") Cc: stable@vger.kernel.org Assisted-by: Claude:claude-opus-4-6 Assisted-by: Codex:gpt-5-4 Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com> Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
thunderbolt: property: Reject u32 wrap in tb_property_entry_valid()
entry->value is u32 and entry->length is u16; the sum is performed in
u32 and wraps. A malicious XDomain peer can pick
value = 0xffffff00, length = 0x100 so the sum 0x100000000 wraps to 0
and passes the > block_len check. tb_property_parse() then passes
entry->value to parse_dwdata() as a dword offset into the property
block, reading attacker-directed memory far past the allocation.
For TEXT-typed entries with the "deviceid" or "vendorid" keys this
lands in xd->device_name / xd->vendor_name and is readable back via
the per-XDomain device_name / vendor_name sysfs attributes; the leak
is NUL-bounded (kstrdup() stops at the first zero byte) and
untargeted (the attacker picks a delta, not an absolute address).
DATA-typed entries are parsed into property->value.data but not
generically surfaced to userspace.
Use check_add_overflow() so a wrapped sum is rejected.
Fixes: cdae7c07e3e3 ("thunderbolt: Add support for XDomain properties") Cc: stable@vger.kernel.org Assisted-by: Claude:claude-opus-4-6 Assisted-by: Codex:gpt-5-4 Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com> Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Matt Bobrowski [Thu, 30 Apr 2026 07:38:36 +0000 (07:38 +0000)]
bpf: fix crash in bpf_[set|remove]_dentry_xattr for negative dentries
bpf_set_dentry_xattr and bpf_remove_dentry_xattr BPF kfuncs attempt to
lock the inode of the supplied dentry without checking if it is
NULL. If a negative dentry is passed (e.g. from
security_inode_create), d_inode(dentry) returns NULL, and
inode_lock(inode) will cause a NULL pointer dereference.
Trivially fix this by adding a NULL check for inode before attempting
to lock it, returning -EINVAL if it is NULL.
Additionally, drop WARN_ON(!inode) in bpf_xattr_read_permission() and
bpf_xattr_write_permission(). These warnings could be triggered by
passing a negative dentry to bpf_get_dentry_xattr() or the _locked
variants of the xattr kfuncs, potentially causing a Denial of Service
on systems with panic_on_warn enabled. Instead, simply return -EINVAL.
Reported-by: Quan Sun <2022090917019@std.uestc.edu.cn> Closes: https://lore.kernel.org/bpf/1587cbf4-1293-4e25-ad24-c970836a1686@std.uestc.edu.cn/ Fixes: 56467292794b ("bpf: fs/xattr: Add BPF kfuncs to set and remove xattrs") Signed-off-by: Matt Bobrowski <mattbobrowski@google.com> Link: https://patch.msgid.link/20260430073836.2894001-1-mattbobrowski@google.com Signed-off-by: Christian Brauner <brauner@kernel.org>
Merge patch series "cleanup block-style layouts exports"
Chuck Lever <cel@kernel.org> says:
This series cleanups the exportfs support for block-style layouts that
provide direct block device access. This is preparation for supporting
exportfs of more than a single device per file system.
* patches from https://patch.msgid.link/20260423181854.743150-1-cel@kernel.org:
exportfs,nfsd: rework checking for layout-based block device access support
exportfs: don't pass struct iattr to ->commit_blocks
exportfs: split out the ops for layout-based block device access
nfsd/blocklayout: always ignore loca_time_modify
exportfs,nfsd: rework checking for layout-based block device access support
Currently NFSD hard codes checking support for block-style layouts.
Lift the checks into a file system-helper and provide a exportfs-level
helper to implement the typical checks.
This prepares for supporting block layout export of multiple devices
per file system.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Link: https://patch.msgid.link/20260423181854.743150-5-cel@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
exportfs: don't pass struct iattr to ->commit_blocks
The only thing ->commit_blocks really needs is the new size, with a magic
-1 placeholder 0 for "do not change the size" because it only ever
extends the size.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Link: https://patch.msgid.link/20260423181854.743150-4-cel@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
exportfs: split out the ops for layout-based block device access
The support to grant layouts for direct block device access works
at a very different layer than the rest of exports. Split the methods
for it into a separate struct, and move that into a separate header
to better split things out. The pointer to the new operation vector
is kept in export_operations to avoid bloating the super_block.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Link: https://patch.msgid.link/20260423181854.743150-3-cel@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
RFC 8881 Section 18.42 makes it clear that the client provided timestamp
is a "may" condition, and clients that want to force a specific timestamp
should send a separate SETATTR in the compound.
Since commit b82f92d5dd1a ("fs: have setattr_copy handle multigrain
timestamps appropriately") the ia_mtime value is ignored by file
systems using multi-grain timestamps like XFS, which is the only
file system supporting blocklayout exports right now, so make that
explicit in NFSD as well.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Link: https://patch.msgid.link/20260423181854.743150-2-cel@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
selftests/pid_namespace: compute pid_max test limits dynamically
The pid_max kselftest hardcodes pid_max values of 400 and 500, but the
kernel enforces a minimum of PIDS_PER_CPU_MIN * num_possible_cpus().
On machines with many possible CPUs (e.g. nr_cpu_ids=128 yields a
minimum of 1024), writing 400 or 500 to /proc/sys/kernel/pid_max
returns EINVAL and all three tests fail.
Compute these limits the same way as the kernel does and set outer_limit
and inner_limit dynamically based on the result. Original test semantics
are preserved (outer < inner, nested namespace capped by parent).
Signed-off-by: Bjoern Doebel <doebel@amazon.com> Link: https://patch.msgid.link/20260422201151.3830506-1-doebel@amazon.com Reviewed-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com> Assisted-by: Kiro:claude-opus-4.6 Signed-off-by: Christian Brauner <brauner@kernel.org>
Andrea Righi [Mon, 11 May 2026 08:31:30 +0000 (10:31 +0200)]
sched_ext: Clear ops->priv on scx_alloc_and_add_sched() error paths
scx_alloc_and_add_sched() can fail after @sch has been assigned to
ops->priv. In those cases @sch is torn down (either via kfree() through
the err_free_* chain or via kobject_put() -> scx_kobj_release() -> RCU
work), but @ops->priv is left pointing at the about-to-be-freed pointer.
With the recent -EBUSY gate in scx_root_enable_workfn() and
scx_sub_enable_workfn() that rejects an attach when @ops->priv is still
non-NULL, see commit bbf30b383cf6 ("sched_ext: Fix ops->priv clobber on
concurrent attach/detach"), a dangling @ops->priv permanently locks the
kdata out: every future attach attempt sees a stale binding and returns
-EBUSY even though no scheduler is actually attached.
Clear @ops->priv on the post-assign failure paths so that the kdata
returns to its pre-attach state when the function returns ERR_PTR().
ceph: add ceph_has_realms_with_quotas() check to ceph_quota_update_statfs()
When MDS rejects a session, remove_session_caps() ->
__ceph_remove_cap() -> ceph_change_snap_realm() clears
i_snap_realm for every inode that loses its last cap.
The realm is restored once caps are re-granted after
reconnect. It is not a real error and this patch changes
pr_err_ratelimited_client() on doutc().
Every quota methods ceph_quota_is_max_files_exceeded(),
ceph_quota_is_max_bytes_exceeded(),
ceph_quota_is_max_bytes_approaching() calls
ceph_has_realms_with_quotas() check. This patch adds
the missing ceph_has_realms_with_quotas() call into
ceph_quota_update_statfs().
[ idryomov: add braces around both arms of multiline ifs ]
libceph: Fix potential out-of-bounds access in __ceph_x_decrypt()
In __ceph_x_decrypt(), a part of the buffer p is interpreted as a
ceph_x_encrypt_header, and the magic field of this struct is accessed.
This happens without any guarantee that the buffer is large enough to
hold this struct. The function parameter ciphertext_len represents the
length of the ciphertext to decrypt and is guaranteed to be at most the
remaining size of the allocated buffer p. However, this value is not
necessarily greater than sizeof(ceph_x_encrypt_header). E.g., a message
frame of type FRAME_TAG_AUTH_REPLY_MORE, that is just as long to hold
the ciphertext at its end with a ciphertext_len of 8 or less, can
trigger an out-of-bounds memory access when accessing hdr->magic.
This patch fixes the issue by adding a check to ensure that the
decrypted plaintext in the buffer is large enough to represent at least
the ceph_x_encrypt_header.
Commit d93231a6bc8a ("ceph: prevent a client from exceeding the MDS
maximum xattr size") moved the required_blob_size computation to before
the __build_xattrs() call, introducing a race.
__build_xattrs() releases and reacquires i_ceph_lock during execution.
In that window, handle_cap_grant() may update i_xattrs.blob with a
newer MDS-provided blob and bump i_xattrs.version. When
__build_xattrs() detects that index_version < version, it destroys and
rebuilds the entire xattr rb-tree from the new blob, potentially
increasing count, names_size, and vals_size.
The prealloc_blob size check that follows still uses the stale
required_blob_size computed before the rebuild, so it passes even when
prealloc_blob is too small for the now-larger tree. After __set_xattr()
adds one more xattr on top, __ceph_build_xattrs_blob() is called from
the cap flush path and hits:
Fix this by recomputing required_blob_size after __build_xattrs()
returns, using the current tree state. Also re-validate against
m_max_xattr_size to fall back to the sync path if the rebuilt tree now
exceeds the MDS limit.
Cc: stable@vger.kernel.org Fixes: d93231a6bc8a ("ceph: prevent a client from exceeding the MDS maximum xattr size") Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Reviewed-by: Alex Markuze <amarkuze@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The old_blob in __ceph_setxattr() can store
ci->i_xattrs.prealloc_blob value during the retry.
However, it is never called the ceph_buffer_put()
for the old_blob object. This patch fixes the issue of
the buffer leak.
libceph: Fix unnecessarily high ceph_decode_need() for uniform bucket
In crush_decode_uniform_bucket(), the item_weight field of the bucket
is set. This is a single field of type u32 since the uniform bucket uses
the same weight for all items. The value in ceph_decode_need() is set to
(1+b->h.size) * sizeof(u32), which is higher than actually needed.
This patch removes the call to ceph_decode_need() with the unnecessarily
high value and switches the subsequent operation from ceph_decode_32()
to ceph_decode_32_safe(), which already includes the correct bounds
check.
libceph: Fix potential out-of-bounds access in crush_decode()
A message of type CEPH_MSG_OSD_MAP containing a crush map with at least
one bucket has two fields holding the bucket algorithm. If the values
in these two fields differ, an out-of-bounds access can occur. This is
the case because the first algorithm field (alg) is used to allocate
the correct amount of memory for a bucket of this type, while the second
algorithm field inside the bucket (b->alg) is used in the subsequent
processing.
This patch fixes the issue by adding a check that compares alg and
b->alg and aborts the processing in case they differ. Furthermore,
b->alg is set to 0 in this case, because the destruction of the crush
map also uses this field to determine the bucket type, which can again
result in an out-of-bounds access when trying to free the memory pointed
to by the fields of the bucket. To correctly free the memory allocated
for the bucket in such a case, the corresponding call to kfree is moved
from the algorithm-specific crush_destroy_bucket functions to the
generic crush_destroy_bucket().
Zhenzhong Duan [Sat, 9 May 2026 02:43:46 +0000 (10:43 +0800)]
iommu/vt-d: Avoid NULL pointer dereference or refcount corruption
Commit 60f030f7418d ("iommu/vt-d: Avoid use of NULL after WARN_ON_ONCE")
fixed a NULL pointer dereference in an unlikely situation partly.
If dev_pasid is not found in the dev_pasids list, it remains NULL.
However, the teardown operations are executed unconditionally, this lead
to a NULL pointer dereference or refcount corruption.
If the domain was never attached to this IOMMU, info will be NULL, which
would cause an immediate dereference when checking --info->refcnt.
Even if info is not NULL, decrementing the refcount without having removed
a valid PASID might unbalance the count. This could lead to premature
dropping of the refcount to 0, potentially causing a use-after-free for the
remaining active devices sharing the domain.
Fix it by returning early if dev_pasid is NULL, before executing the
teardown operations.
Issue found by AI review and suggested by Kevin Tian.
https://sashiko.dev/#/patchset/20260421031347.1408890-1-zhenzhong.duan%40intel.com
Fixes: 60f030f7418d ("iommu/vt-d: Avoid use of NULL after WARN_ON_ONCE") Cc: stable@vger.kernel.org Suggested-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Link: https://lore.kernel.org/r/20260422033538.95000-1-zhenzhong.duan@intel.com Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
The global static blocked domain is a dummy domain without corresponding
dmar_domain structure, accessing beyond iommu_domain structure triggers
oops easily. Fix it by return early in domain_remove_dev_pasid() like
identity domain.
Fixes: 7d0c9da6c150 ("iommu/vt-d: Add set_dev_pasid callback for dma domain") Cc: stable@vger.kernel.org Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Link: https://lore.kernel.org/r/20260421031347.1408890-1-zhenzhong.duan@intel.com Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Naval Alcalá [Sat, 9 May 2026 02:43:44 +0000 (10:43 +0800)]
iommu/vt-d: Disable DMAR for Intel Q35 IGFX
Intel Q35 integrated graphics (8086:29b2) exhibits broken DMAR
behaviour similar to other G4x/GM45 devices for which DMAR is
already disabled via quirks.
When DMAR is enabled, the system may hard lock up during boot or
early device initialization, requiring a reset.
Add the missing PCI ID to the existing quirk list to disable
DMAR for this device.
Guopeng Zhang [Sat, 9 May 2026 10:20:30 +0000 (18:20 +0800)]
cgroup/cpuset: Reset DL migration state on can_attach() failure
cpuset_can_attach() accumulates temporary SCHED_DEADLINE migration
state in the destination cpuset while walking the taskset.
If a later task_can_attach() or security_task_setscheduler() check
fails, cgroup_migrate_execute() treats cpuset as the failing subsystem
and does not call cpuset_cancel_attach() for it. The partially
accumulated state is then left behind and can be consumed by a later
attach, corrupting cpuset DL task accounting and pending DL bandwidth
accounting.
Reset the pending DL migration state from the common error exit when
ret is non-zero. Successful can_attach() keeps the state for
cpuset_attach() or cpuset_cancel_attach().
Fixes: 2ef269ef1ac0 ("cgroup/cpuset: Free DL BW in case can_attach() fails") Cc: stable@vger.kernel.org # v6.10+ Signed-off-by: Guopeng Zhang <zhangguopeng@kylinos.cn> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Chen Ridong <chenridong@huaweicloud.com> Reviewed-by: Waiman Long <longman@redhat.com>
iommu: Warn on premature unblock during DMA aliased sibling reset
When two aliased siblings are in the same iommu_group, they might share the
same RID. The reset functions don't support this case, though it is unclear
whether there is a real case of having an ATS capable device on a PCI/PCI-X
bus.
Theoretically, however, if two aliased devices are resetting concurrently,
one might be unblocked prematurely in the middle of the reset by the other
sibling who completes the reset first.
This isn't a regression from this series but it's better to spit a warning,
so we can know if such use case is common enough for us to make subsequent
patches for its coverage.
iommu: Fix WARN_ON in __iommu_group_set_domain_nofail() due to reset
In __iommu_group_set_domain_internal(), concurrent domain attachments are
rejected when any device in the group is recovering. This is necessary to
fence concurrent attachments to a multi-device group where devices might
share the same RID due to PCI DMA alias quirks, but triggers the WARN_ON in
__iommu_group_set_domain_nofail().
Other IOMMU_SET_DOMAIN_MUST_SUCCEED callers in detach/teardown paths, such
as __iommu_group_set_core_domain and __iommu_release_dma_ownership, should
not be rejected, as the domain would be freed anyway in these nofail paths
while group->domain is still pointing to it. So pci_dev_reset_iommu_done()
could trigger a UAF when re-attaching group->domain.
Honor the IOMMU_SET_DOMAIN_MUST_SUCCEED flag, allowing the callers through
the group->recovery_cnt fence, so as to update the group->domain pointer.
Instead add a gdev->blocked check in the device iteration loop, to prevent
any concurrent per-device detachment.
iommu: Fix ATS invalidation timeouts during __iommu_remove_group_pasid()
If a device is blocked, its PASID domains are already detached. Repeating
iommu_remove_dev_pasid() is unnecessary and might trigger ATS invalidation
timeouts.
Skip the iommu_remove_dev_pasid() call upon gdev->blocked.
Shuai found that cxl_reset_bus_function() calls pci_reset_bus_function()
internally while both are calling pci_dev_reset_iommu_prepare/done().
As pci_dev_reset_iommu_prepare() doesn't support re-entry, the inner call
will trigger a WARN_ON and return -EBUSY, resulting in failing the entire
device reset.
On the other hand, removing the outer calls in the PCI callers is unsafe.
As pointed out by Kevin, device-specific quirks like reset_hinic_vf_dev()
execute custom firmware waits after their inner pcie_flr() completes. If
the IOMMU protection relies solely on the inner reset, the IOMMU will be
unblocked prematurely while the device is still resetting.
Instead, fix this by making pci_dev_reset_iommu_prepare/done() reentrant.
Introduce gdev->reset_depth to handle the re-entries on the same device.
iommu: Replace per-group resetting_domain with per-gdev blocked flag
The core tracks device resetting states with a per-group resetting_domain,
while a reset is actually per group-device. Such a mismatch might lead to
confusion and even difficulty to untangle per-gdev handling requirement.
Shuai found that cxl_reset_bus_function() calls pci_reset_bus_function()
internally while both are calling pci_dev_reset_iommu_prepare/done(). And
the solution requires the core to track at the group_device level as well.
Introduce a 'blocked' flag to struct group_device, to allow a multi-device
group to isolate concurrent device resets independently.
As the reset routine is per gdev, it cannot clear group->resetting_domain
without iterating over the device list to ensure no other device is being
reset. Simplify it by replacing the resetting_domain with a 'recovery_cnt'
in the struct iommu_group.
No functional change. But this is essential to apply following bug fixes.
iommu: Fix NULL group->domain dereference in pci_dev_reset_iommu_done()
Local sashiko review pointed it out that group->domain could be NULL when
a default domain fails to allocate during the first probe, which can crash
at domain->ops->attach_dev dereference in __iommu_attach_device() invoked
by pci_dev_reset_iommu_done().
pci_dev_reset_iommu_prepare() is fine as an old_domain pointer can be NULL.
Skip the re-attach in pci_dev_reset_iommu_done() to fix the bug.
iommu/amd: Bounds-check devid in __rlookup_amd_iommu()
iommu_device_register() walks every device on the PCI bus via
bus_for_each_dev() and calls amd_iommu_probe_device() for each. The
inlined check_device() path computes the device's sbdf, calls
rlookup_amd_iommu() to find the owning IOMMU, and only afterwards
verifies devid <= pci_seg->last_bdf. __rlookup_amd_iommu() indexes
rlookup_table[devid] with no bounds check of its own, so for a PCI
device whose BDF is not described by the IVRS, the lookup reads past
the end of the allocation before the caller's bounds check can run.
This was harmless before commit e874c666b15b ("iommu/amd: Change
rlookup, irq_lookup, and alias to use kvalloc()"): the table was a
zeroed page-order allocation, so the over-read returned NULL and the
caller's NULL check skipped the device. After that commit the table is
a tight kvcalloc() and the over-read returns adjacent slab contents,
which check_device() then dereferences as a struct amd_iommu *,
causing a boot-time GPF.
Seen on Google Compute Engine ct6e VMs, where the virtualized IVRS
describes only the four TPU endpoints 00:04.0-07.0; the gVNIC at
00:08.0 (devid 0x40) indexes 56 bytes past the 456-byte allocation,
into the adjacent kmalloc-512 slab object:
pci 0000:00:04.0: Adding to iommu group 0
pci 0000:00:05.0: Adding to iommu group 1
pci 0000:00:06.0: Adding to iommu group 2
pci 0000:00:07.0: Adding to iommu group 3
Oops: general protection fault, probably for non-canonical address 0x3a64695f78746382: 0000 [#1] SMP NOPTI
CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.18.22 #1
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 12/06/2025
RIP: 0010:amd_iommu_probe_device+0x54/0x3a0
Call Trace:
__iommu_probe_device+0x107/0x520
probe_iommu_group+0x29/0x50
bus_for_each_dev+0x7e/0xe0
iommu_device_register+0xc9/0x240
iommu_go_to_state+0x9c0/0x1c60
amd_iommu_init+0x14/0x40
pci_iommu_init+0x16/0x60
do_one_initcall+0x47/0x2f0
Guard the array access in __rlookup_amd_iommu(). With the fix applied
on 6.18.22, the gVNIC at 00:08.0 is skipped cleanly and the VM boots.
Fixes: e874c666b15b ("iommu/amd: Change rlookup, irq_lookup, and alias to use kvalloc()") Cc: stable@vger.kernel.org Reported-by: Ziyuan Chen <zc@anthropic.com> Tested-by: Ziyuan Chen <zc@anthropic.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Assisted-by: Claude:unspecified Signed-off-by: Jose Fernandez (Anthropic) <jose.fernandez@linux.dev> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
gpio: add GPIO controller found on Waveshare DSI TOUCH panels
The Waveshare DSI TOUCH family of panels has separate on-board GPIO
controller, which controls power supplies to the panel and the touch
screen and provides reset pins for both the panel and the touchscreen.
Also it provides a simple PWM controller for panel backlight. Add
support for this GPIO controller.
The Waveshare DSI TOUCH family of panels has separate on-board GPIO
controller, which controls power supplies to the panel and the touch
screen and provides reset pins for both the panel and the touchscreen.
Also it provides a simple PWM controller for panel backlight.
Add bindings for these GPIO controllers. As overall integration might be
not very obvious (and it differs significantly from the bindings used by
the original drivers), provide complete example with the on-board
regulators and the DSI panel.
Chen-Yu Tsai [Thu, 7 May 2026 05:29:41 +0000 (13:29 +0800)]
power: sequencing: print power sequencing device parent in debugfs
The debugfs summary currently shows the power sequencing device's name.
This is not really helpful since the device name is always "pwrseq.N".
Also print the parent device's name. This would likely be the device
node name from the device tree, something like "nvme-connector". This
would make it much easier for the developer to associate the summary
with a certain device.
Eder Zulian [Fri, 10 Apr 2026 12:55:50 +0000 (14:55 +0200)]
iommu/amd: Remove latent out-of-bounds access in IOMMU debugfs
In iommu_mmio_write() and iommu_capability_write(), the variables
dbg_mmio_offset and dbg_cap_offset are declared as int. However, they
are populated using kstrtou32_from_user(). If a user provides a
sufficiently large value, it can become a negative integer.
Prior to this patch, the AMD IOMMU debugfs implementation was already
protected by different mechanisms.
1. #define OFS_IN_SZ 8 ensures the user string <= 8 bytes, so
e.g. 0xffffffff isn't a valid input.
if (cnt > OFS_IN_SZ)
return -EINVAL;
2. Implicit type promotion in iommu_mmio_write(), dbg_mmio_offset is int
and iommu->mmio_phys_end is u64
if (dbg_mmio_offset > iommu->mmio_phys_end - sizeof(u64))
return -EINVAL;
3. The show handlers would currently catch the negative number and
refuse to perform the read.
Replace kstrtou32_from_user() with kstrtos32_from_user() to parse the
input, and check for negative values to explicitly prevent out-of-bounds
memory accesses directly in iommu_mmio_write() and
iommu_capability_write().
Signed-off-by: Eder Zulian <ezulian@redhat.com> Fixes: 7a4ee419e8c1 ("iommu/amd: Add debugfs support to dump IOMMU MMIO registers") Cc: stable@vger.kernel.org Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Andrea Righi [Mon, 11 May 2026 06:18:12 +0000 (08:18 +0200)]
sched_ext: Fix ops->priv clobber on concurrent attach/detach
Under heavy concurrent attach/detach operations, scx_claim_exit() can
trigger a NULL pointer dereference. This can be reproduced running the
reload_loop kselftests inside a virtme-ng session:
T1 acquires scx_enable_mutex inside scx_root_disable()'s mutex_unlock
window and starts a fresh attach on the same kdata, assigning sch_a800
to @ops->priv. T2 then continues out of scx_disable()/flush_disable_work
and clobbers @ops->priv to NULL, leaking sch_a800; the bpf_link is gone
but state stays SCX_ENABLED, so all future attaches fail with -EBUSY
permanently. The next bpf_scx_unreg() on that kdata then reads NULL
@ops->priv and dereferences it in scx_claim_exit().
Make @ops->priv the lifecycle binding: in scx_root_enable_workfn() and
scx_sub_enable_workfn(), after the existing state check and still under
scx_enable_mutex, refuse with -EBUSY if @ops->priv is non-NULL. This
rejects an attempt to reuse a kdata that is still bound to a previous
scheduler instance, closing the race without changing the unreg side.
platform/chrome: cros_kbd_led_backlight: Drop CONFIG_MFD_CROS_EC_DEV ifdeffery
The ifdeffery is unnecessary, as the compiler can already optimize away
all of the mfd-specific code based on the IS_ENABLED() in
keyboard_led_is_mfd_device().
platform/chrome: cros_kbd_led_backlight: Pass keyboard_led as parameter
Make the code simpler to read by passing the 'struct keyboard_led' as
a parameter to the 'init' callbacks instead of relying on the platform
device driver data.
A race condition exists between the probe of cros-ec-sysfs and
cros-ec-sensorhub.
The `kb_wake_angle` attribute should only be visible if the sensor hub
detects two or more accelerometers. If cros_ec_sysfs_probe() runs
before cros_ec_sensorhub_register() completes sensor enumeration, the
sysfs attributes are created while `has_kb_wake_angle` is still false,
hiding `kb_wake_angle` incorrectly.
Store the created attribute group pointer in `ec_dev->group`. When
the sensor hub completes sensor enumeration, it checks for this group
and calls sysfs_update_group() to notify the sysfs core to re-evaluate
attribute visibility. This ensures the `kb_wake_angle` attribute
visibility is correctly updated regardless of the driver probe order.
Andrea Righi [Sun, 10 May 2026 17:52:11 +0000 (19:52 +0200)]
selftests/sched_ext: Fix build error in dequeue selftest
Building the dequeue selftest with newer compilers (e.g., gcc 16)
triggers the following error:
dequeue.c:28:22: error: variable 'sum' set but not used
The 'volatile' qualifier prevents the writes from being optimized away,
but does not silence the unused variable 'sum' is indeed only written
and never read.
Consume 'sum' via an empty asm() with a register input constraint. This
forces the compiler to keep the accumulated value (preserving the CPU
stress loop) and avoiding the build error.
Fixes: 658ad2259b3e ("selftests/sched_ext: Add test to validate ops.dequeue() semantics") Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
cg_read_strcmp() allocated a buffer sized to strlen(expected) + 1,
then passed it to read_text() which calls read(fd, buf, size-1).
When comparing against an empty string (""), strlen("") = 0 gives a
1-byte buffer, and read() is asked to read 0 bytes. The file content
is never actually read, so strcmp("", buf) always returns 0 regardless
of the real content. This caused cg_test_proc_killed() to always
report the cgroup as empty immediately, making OOM tests pass without
verifying that processes were killed.
Signed-off-by: Hongfu Li <lihongfu@kylinos.cn> Signed-off-by: Tejun Heo <tj@kernel.org>
Guopeng Zhang [Mon, 11 May 2026 01:31:50 +0000 (09:31 +0800)]
cgroup/dmem: Return -ENOMEM on failed pool preallocation
get_cg_pool_unlocked() handles allocation failures under dmemcg_lock by
dropping the lock, preallocating a pool with GFP_KERNEL, and retrying the
locked lookup and creation path.
If the fallback allocation fails too, pool remains NULL. Since the loop
condition is while (!pool), the function can keep retrying instead of
propagating the allocation failure to the caller.
Set pool to ERR_PTR(-ENOMEM) when the fallback allocation fails so the
loop exits through the existing common return path. The callers already
handle ERR_PTR() from get_cg_pool_unlocked(), so this restores the
expected error path.
Mark Brown [Mon, 11 May 2026 01:04:02 +0000 (10:04 +0900)]
ASoC: sdw_utils: make RT712/RT721 CODEC_MIC be optional
Bard Liao <yung-chuan.liao@linux.intel.com> says:
The RT712 and RT721 codec mic are optional and are not used on some
products. Add a quirk to make it optional and skip the codec mic DAI
when it is not present in DisCo table.
Mac Chiang [Fri, 8 May 2026 09:32:23 +0000 (17:32 +0800)]
ASoC: sdw_utils: Add quirk to ignore RT712 CODEC_MIC
Some devices do not use CODEC_MIC but use the host PCH_DMIC
instead. Add a quirk to skip the CODEC_MIC DAI when it is not present
in disco table, ensuring the correct capture device is used.
If CODEC_MIC is present, it continues to be used as default.
Jang Pyohwan [Sat, 9 May 2026 08:53:10 +0000 (17:53 +0900)]
ASoC: Intel: soc-acpi: add LG Gram 16Z90U RT713 + single RT1320 quirk
Add a SoundWire machine table entry for the LG Gram Pro 2026
(16Z90U-KU7BK), which has an unusual configuration:
sdw:0:1:025d:1320:01 single stereo RT1320 SmartAmp on link 1
sdw:0:3:025d:0713:01 RT713 jack/headset codec on link 3
Existing rt713-rt1320 boards have two RT1320 amps on different links
("link_mask = BIT(1) | BIT(2) | BIT(3)"). The LG Gram uses a single
stereo RT1320 chip, so the new entry uses "link_mask = BIT(1) |
BIT(3)" with the existing rt1320_1_group2_adr structure, leaving the
two-channel routing to the topology.
The RT713 on this board does not expose a SMART_MIC function in
ACPI, so the .machine_check callback used by the existing entries
(snd_soc_acpi_intel_sdca_is_device_rt712_vb) would reject this
board. Drop machine_check for the new entry; speaker output and
the headset jack do not depend on the SMART_MIC presence check.
The corresponding topology source has been submitted to the SOF
project at https://github.com/thesofproject/sof/pull/10760 . The
generated sof-ptl-rt713-l3-rt1320-l1-2ch.tplg and
nhlt-sof-ptl-rt713-l3-rt1320-l1.bin will follow in linux-firmware
once that lands.
Tested on Ubuntu 26.04 with kernel 7.0.0-15: speaker (RT1320
stereo), headphone jack with auto-routing, headset mic, and the
internal NHLT DMIC array all work via the UCM HiFi profile.
Gary C Wang [Fri, 8 May 2026 10:42:38 +0000 (18:42 +0800)]
ASoC: soc-acpi-intel-arl-match: add rt712_l0_rt1320_l3 support
Add support for using the rt712 multi-function codec on link 0 and the
rt1320 amplifier on link 3 on ARL platforms.
Signed-off-by: Gary C Wang <gary.c.wang@intel.com> Co-developed-by: Mac Chiang <mac.chiang@intel.com> Signed-off-by: Mac Chiang <mac.chiang@intel.com> Signed-off-by: Bard Liao <yung-chuan.liao@linux.intel.com> Link: https://patch.msgid.link/20260508104239.1247525-3-yung-chuan.liao@linux.intel.com Signed-off-by: Mark Brown <broonie@kernel.org>
Mark Brown [Mon, 11 May 2026 00:55:56 +0000 (09:55 +0900)]
spi: switch to managed controller allocation (part 2/3)
Johan Hovold <johan@kernel.org> says:
In preparation for fixing the SPI controller API so that it no longer
drops a reference when deregistering (non-managed) controllers (cf.
[1]), this series converts drivers using non-managed registration to use
managed allocation.
Included is also a related cleanup of a ti-qspi error path.
This second set will be followed by a third set of 12 patches for
drivers using managed registration.
That leaves us with 18 drivers using non-managed allocation, which is
few enough to be able to fix the API in tree-wide change.
spi: amd: Set correct bus number in ACPI probe path
On platforms where the HID2 SPI controller (AMDI0063) is enumerated via
ACPI instead of PCI, amd_spi_probe() unconditionally sets bus_num to 0,
while the PCI probe path assigns bus_num 2 for HID2 controller.
Align the ACPI probe path to use the same bus number so that userspace
and SPI client drivers see a consistent bus assignment regardless of the
enumeration method.
Fixes: b644c2776652 ("spi: spi_amd: Add PCI-based driver for AMD HID2 SPI controller") Cc: stable@vger.kernel.org # v6.16+ Signed-off-by: Krishnamoorthi M <krishnamoorthi.m@amd.com> Link: https://patch.msgid.link/20260507180051.4158674-1-krishnamoorthi.m@amd.com Signed-off-by: Mark Brown <broonie@kernel.org>
A phandle-array is really a matrix and needs constraints on the number
of elements for both the inner and outer dimensions. Add the missing
inner constraints.
Fixes: 472d77bdc511 ("ASoC: dt-bindings: mediatek,mt8173-rt5650-rt5514: convert to DT schema") Signed-off-by: Rob Herring (Arm) <robh@kernel.org> Link: https://patch.msgid.link/20260508182438.1757394-1-robh@kernel.org Signed-off-by: Mark Brown <broonie@kernel.org>
Peter Ujfalusi [Tue, 5 May 2026 16:47:44 +0000 (19:47 +0300)]
MAINTAINERS: ASoC/ti: Remove myself and add Sen Wang as maintainer
As I cannot spend adequate time to fulfill my role as maintainer for the
TI ASoC drivers, it is for the better if I resign and hand over the role
to Sen Wang.
Clippy 1.97 introduces new `useless_borrows_in_formatting` warning which
fires on the examples as we have `&*expr` where the format macro takes
reference already. Remove the extra borrow.
Gary Guo [Tue, 28 Apr 2026 13:10:59 +0000 (14:10 +0100)]
rust: pin-init: internal: turn `PhantomPinned` error into warnings
The `PhantomPinned` detection is just a lint, and is emitted as an error
because there is no `compile_warning!()` macro, and
`proc-macro-diagnostics` is not stable.
Use of `#[deprecated = ""]` attribute to approximate custom proc-macro
warnings. A new line is added before message for visual clarity.
An example warning with this trick looks like this:
warning: use of deprecated function `_::warn`:
The field `pin` of type `PhantomPinned` only has an effect if it has the `#[pin]` attribute
--> test.rs:9:5
|
9 | pin: marker::PhantomPinned,
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^
rust: pin-init: internal: adjust license identifier of `zeroable.rs`
The pin-init crate has been licensed under `Apache-2.0 OR MIT` since the
beginning. I introduced in commit 071cedc84e90 ("rust: add derive macro for
`Zeroable`") `zeroable.rs` with incompatible GPL-2.0 SPDX identifier. The
file has not been modified by other authors, so relicense it under the
above license.
rust: pin-init: internal: add missing where clause to projection types
`#[pin_data]` failed to propagate the struct's `where` clause to the
generated projection struct. As a result, bounds written in a `where`
clause could be dropped during expansion, causing type errors when
fields depended on those bounds.
Fix this by adding the missing `where` clause to the generated
projection struct.
Reported-by: Andreas Hindborg <a.hindborg@kernel.org> Closes: https://rust-for-linux.zulipchat.com/#narrow/channel/561532-pin-init/topic/generic.20bounds.20and.20.60.23.5Bpin_data.5D.60/with/578381591 Signed-off-by: Mohamad Alsadhan <mo@sdhn.cc> Reviewed-by: Gary Guo <gary@garyguo.net>
[ Reworded commit message - Gary ] Link: https://patch.msgid.link/20260428-pin-init-sync-v1-5-07f9bd3859fb@garyguo.net Signed-off-by: Gary Guo <gary@garyguo.net>
rust: pin-init: extend `impl_zeroable_option` macro to handle generics
Improve impl_zeroable_option macro to handle generic impls for types
like `&T`, `&mut T`, `NonNull<T>`, and others (for which `Option<T>`
is guaranteed to be zeroable) with similar approach to
`impl_zeroable`.
Also, update old declarations to use generics e.g. `NonZeroU8` to
`NonZero<u8>`.
rust: pin-init: cleanup `Zeroable` and `ZeroableOptions`
Place definitions and implementations (incl. macro invocations) of
the `Zeroable` trait first in the relevant section of `src/lib.rs`,
followed by the ZeroableOption trait and its implementations.
Rename `impl_non_zero_int_zeroable_option` to `impl_zeroable_option`
for consistency.
This commit should not introduce any functional changes.
Gary Guo [Tue, 28 Apr 2026 13:10:51 +0000 (14:10 +0100)]
rust: pin-init: bump minimum Rust version to 1.82
Following the kernel minimum version bump in commit f32fb9c58a5b ("rust:
bump Rust minimum supported version to 1.85.0 (Debian Trixie)"), bump
pin-init's minimum Rust version to 1.82.
This removes the `lint_reasons` feature which is stabilized in 1.81 and the
`raw_ref_ops` and `new_uninit` features which are stabilized in 1.82.
Given we do not use any features that are stabilized in 1.82..=1.85 range,
and pin-init crate is useful for other projects which may have their own
MSRV requirements, the minimum version is not straightly bumped to 1.85.
Tejun Heo [Sun, 10 May 2026 20:08:16 +0000 (10:08 -1000)]
sched_ext: Handle SCX_TASK_NONE in disable/switched_from paths
scx_fail_parent() leaves cgroup tasks at (state=NONE, sched=parent,
sched_class=ext) until the parent itself is torn down by the scx_error() it
raised. When the later root_disable iterates them, two paths trip on NONE.
scx_disable_and_exit_task() re-enters the wrapper at NONE: the inner switch
returns early but the trailing scx_set_task_sched(p, NULL) clobbers the
parent sched left by scx_fail_parent(), and scx_set_task_state(p, NONE)
wastes a write on an already-NONE task. switched_from_scx() then calls
scx_disable_task(), which WARNs on non-ENABLED state and writes state=READY,
producing a NONE -> READY transition the validation matrix rejects.
Treat NONE as "nothing to do" in both paths. Add a NONE early-return at the
top of scx_disable_and_exit_task() and a parallel NONE check in
switched_from_scx() next to task_dead_and_done().
Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
Tejun Heo [Sun, 10 May 2026 20:08:16 +0000 (10:08 -1000)]
sched_ext: Close sub-sched init race with post-init DEAD recheck
scx_sub_enable_workfn()'s init pass and scx_sub_disable() migration both
drop the rq lock to call __scx_init_task() against the other sched. A
TASK_DEAD @p can fall through sched_ext_dead() in that window.
sched_ext_dead() runs ops.exit_task() on the sched @p was attached to, not
on the sched whose init just completed, so the new allocation leaks.
Reuse the DEAD signal set by sched_ext_dead(). After __scx_init_task()
returns, take task_rq_lock(p) and check for DEAD; on hit, call
scx_sub_init_cancel_task() against the sub sched the init ran for and drop
@p; on miss, proceed as before.
Tejun Heo [Sun, 10 May 2026 20:08:16 +0000 (10:08 -1000)]
sched_ext: Close root-enable vs sched_ext_dead() race with SCX_TASK_INIT_BEGIN
scx_root_enable_workfn() drops the iter rq lock for ops.init_task() and a
TASK_DEAD @p can fall through sched_ext_dead() in that window. The race hits
when sched_ext_dead() observes SCX_TASK_INIT (the intermediate state before
@p->scx.sched is published) and dereferences NULL via SCX_HAS_OP(NULL,
exit_task), or observes SCX_TASK_NONE during the unlocked init window and
skips cleanup so exit_task() never runs.
Add SCX_TASK_INIT_BEGIN. The enable path writes NONE -> INIT_BEGIN under the
iter rq lock, then takes the rq lock again after init to walk INIT_BEGIN ->
INIT -> READY. sched_ext_dead() that wins the rq-lock race observes
INIT_BEGIN and sets DEAD without calling into ops; the post-init recheck
unwinds via scx_sub_init_cancel_task().
scx_fork() runs single-threaded against sched_ext_dead() (the task is not on
scx_tasks until scx_post_fork() adds it) so its INIT_BEGIN -> INIT walk
needs no rq-lock pairing; it rolls back to NONE on ops.init_task() failure.
The validation matrix grows the INIT_BEGIN row and the INIT_BEGIN -> DEAD
edge; INIT now requires INIT_BEGIN as the predecessor. scx_sub_disable()'s
migration writes INIT_BEGIN as a synthetic predecessor to satisfy the
tightened verification.
The sub-sched paths still race with sched_ext_dead() during the unlocked
init window. This will be fixed by the next patch.
Tejun Heo [Sun, 10 May 2026 20:08:16 +0000 (10:08 -1000)]
sched_ext: Replace SCX_TASK_OFF_TASKS flag with SCX_TASK_DEAD state
SCX_TASK_OFF_TASKS marked tasks already through sched_ext_dead() so cgroup
task iteration would skip them. This can be expressed better with a task
state. Replace the flag with SCX_TASK_DEAD.
scx_disable_and_exit_task() resets state to NONE on its way out, so
sched_ext_dead() now sets DEAD after the wrapper returns. The validation
matrix grows NONE -> DEAD, warns on DEAD -> NONE, and tightens READY's
predecessor to INIT or ENABLED so the new DEAD value cannot silently
transition to READY.
Prepares for the following enable vs dead race fix.
Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
Tejun Heo [Sun, 10 May 2026 20:08:16 +0000 (10:08 -1000)]
sched_ext: Inline scx_init_task() and move RESET_RUNNABLE_AT into scx_set_task_state()
Prepare for the SCX_TASK_INIT_BEGIN/DEAD work that follows by collapsing the
scx_init_task() helper. Move the SCX_TASK_RESET_RUNNABLE_AT setting into
scx_set_task_state() on the INIT transition (it was set unconditionally at
every INIT site through the scx_init_task() helper), inline scx_init_task()
into scx_fork() and scx_root_enable_workfn(), and drop the helper.
As a side effect, scx_sub_disable() migration sequence now also sets
RESET_RUNNABLE_AT (it previously wrote INIT directly without going through
scx_init_task()). The flag triggers a runnable_at reset on the next
set_task_runnable(), which is harmless on a task that has just been moved
between scheds.
On root-enable, p->scx.flags is written without the task's rq lock. The task
isn't visible to scx yet, and a follow-up patch restores the lock-held
write.
v2: Note p->scx.flags rq-lock relaxation on root-enable path. (Andrea)
Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
Tejun Heo [Sun, 10 May 2026 20:08:16 +0000 (10:08 -1000)]
sched_ext: Cleanups in preparation for the SCX_TASK_INIT_BEGIN/DEAD work
Cleanups in preparation for the state-machine work that follows:
- Convert three sub-sched call sites that open-code
rcu_assign_pointer(p->scx.sched, ...) to scx_set_task_sched().
- Move scx_get_task_state()/scx_set_task_state() above the SCX task iter
section so scx_task_iter_next_locked() can use them without a forward
declaration.
No functional change.
Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>