Wei-Lin Chang [Tue, 5 May 2026 14:47:35 +0000 (15:47 +0100)]
KVM: arm64: nv: Consider the DS bit when translating TCR_EL2
When running an nVHE L1, TCR_EL2 is mapped to TCR_EL1. Writes to the
register are trapped and written to TCR_EL1 after a translation.
Booting an nVHE L1 with 52-bit VA isn't working because the translation
was ignoring the DS bit set by the guest, hence causing repeating level
0 faults. Add it in the translation function.
James Morse [Tue, 5 May 2026 16:52:03 +0000 (17:52 +0100)]
KVM: arm64: Work around C1-Pro erratum 4193714 for protected guests
C1-Pro cores with SME have an erratum where TLBI+DSB does not complete
all outstanding SME accesses. Instead a DSB needs to be executed on the
affected CPUs. The implication is that pages cannot be unmapped from the
host Stage 2 and then provided to a protected guest or to the
hypervisor. Host SME accesses may still complete after this point.
This erratum breaks pKVM's guarantees, and the workaround is hard to
implement as EL2 and EL1 share a security state meaning EL1 can mask
IPIs sent by EL2, leading to interrupt blackouts.
Instead, do this in EL3. This has the advantage of a separate security
state, meaning lower EL cannot mask the IPI. It is also simpler for EL3
to know about CPUs that are off or in PSCI's CPU_SUSPEND.
Add the needed hook to host_stage2_set_owner_metadata_locked(). This
covers the cases where the host loses access to a page:
__pkvm_host_donate_guest()
__pkvm_guest_unshare_host()
host_stage2_set_owner_locked() when owner_id == PKVM_ID_HYP
Since pKVM relies on the firmware call for correctness, check for the
firmware counterpart during protected KVM initialisation and fail the
pKVM initialisation if it is missing.
Signed-off-by: James Morse <james.morse@arm.com> Co-developed-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Oliver Upton <oupton@kernel.org> Cc: Will Deacon <will@kernel.org> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Lorenzo Pieralisi <lpieralisi@kernel.org> Cc: Sudeep Holla <sudeep.holla@kernel.org> Link: https://patch.msgid.link/20260505165205.2690919-1-catalin.marinas@arm.com Signed-off-by: Marc Zyngier <maz@kernel.org>
Eric Biggers [Thu, 30 Apr 2026 09:52:42 +0000 (02:52 -0700)]
block: export blk-crypto symbols required by dm-inlinecrypt
bio_crypt_set_ctx(), blk_crypto_init_key(), and
blk_crypto_start_using_key() are needed to use inline encryption; see
Documentation/block/inline-encryption.rst. Export them so that
dm-inlinecrypt can use them. The only reason these weren't exported
before was that inline encryption was previously used only by fs/crypto/
which is built-in code.
Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Vincent Guittot [Sun, 3 May 2026 10:45:03 +0000 (12:45 +0200)]
sched/fair: Fix wakeup_preempt_fair() for not waking up task
Make sure to only call pick_next_entity() on an non-empty cfs_rq.
The assumption that p is always enqueued and not delayed, is only true for
wakeup. If p was moved while delayed, pick_next_entity() will dequeue it and
the cfs might become empty. Test if there are still queued tasks before trying
again to determine if p could be the next one to be picked.
There are at least 2 cases:
When cfs becomes idle, it tries to pull tasks but if those pulled tasks are
delayed, they will be dequeued when attached to cfs. attach_tasks() ->
attach_task() -> wakeup_preempt(rq, p, 0);
A misfit task running on cfs A triggers a load balance to be pulled on a better
cpu, the load balance on cfs B starts an active load balance to pulled the
running misfit task. If there is a delayed dequeue task on cfs A, it can be
pulled instead of the previously running misfit task. attach_one_task() ->
attach_task() -> wakeup_preempt(rq, p, 0);
Zhan Xusheng [Fri, 1 May 2026 10:40:06 +0000 (12:40 +0200)]
sched/fair: Fix overflow in vruntime_eligible()
Zhan Xusheng reported running into sporadic a s64 mult overflow in
vruntime_eligible().
When constructing a worst case scenario:
If you have cgroups, then you can have an entity of weight 2 (per
calc_group_shares()), and its vlag should then be bounded by: (slice+TICK_NSEC)
* NICE_0_LOAD, which is around 44 bits as per the comment on entity_key().
The other extreme is 100*NICE_0_LOAD, thus you get:
And that is: (slice + TICK_NSEC) * NICE_0_LOAD * NICE_0_LOAD * 100, which will
overflow s64.
Zhan suggested using __builtin_mul_overflow(), however after staring at
compiler output for various architectures using godbolt, it seems that using an
__int128 multiplication often results in better code.
Specifically, a number of architectures already compute the __int128 product to
determine the overflow. Eg. arm64 already has the 'smulh' instruction used. By
explicitly doing an __int128 multiply, it will emit the 'mul; smulh' pattern,
which modern cores can fuse (armv8-a clang-22.1.0). x86_64 has less branches
(no OF handling).
Since Linux has ARCH_SUPPORTS_INT128 to gate __int128 usage, also provide the
__builtin_mul_overflow() variant as a fallback.
[peterz: Changelog and __int128 bits] Fixes: 556146ce5e94 ("sched/fair: Avoid overflow in enqueue_entity()") Reported-by: Zhan Xusheng <zhanxusheng1024@gmail.com> Closes: https://patch.msgid.link/20260415145742.10359-1-zhanxusheng%40xiaomi.com Signed-off-by: Zhan Xusheng <zhanxusheng@xiaomi.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260505103155.GN3102924%40noisy.programming.kicks-ass.net
Due to the incompatibility with TCMalloc the RSEQ optimizations and
extended features (time slice extensions) have been disabled and made
run-time conditional.
The original RSEQ implementation, which TCMalloc depends on, registers a 32
byte region (ORIG_RSEG_SIZE). This region has a 32 byte alignment
requirement.
The extension safe newer variant exposes the kernel RSEQ feature size via
getauxval(AT_RSEQ_FEATURE_SIZE) and the alignment requirement via
getauxval(AT_RSEQ_ALIGN). The alignment requirement is that the registered
RSEQ region is aligned to the next power of two of the feature size. The
kernel currently has a feature size of 33 bytes, which means the alignment
requirement is 64 bytes.
The TCMalloc RSEQ region is embedded into a cache line aligned data
structure starting at offset 32 bytes so that bytes 28-31 and the
cpu_id_start field at bytes 32-35 form a 64-bit little endian pointer with
the top-most bit (63 set) to check whether the kernel has overwritten
cpu_id_start with an actual CPU id value, which is guaranteed to not have
the top most bit set.
As this is part of their performance tuned magic, it's a pretty safe
assumption, that TCMalloc won't use a larger RSEQ size.
This allows the kernel to declare that registrations with a size greater
than the original size of 32 bytes, which is the cases since time slice
extensions got introduced, as RSEQ ABI v2 with the following differences to
the original behaviour:
1) Unconditional updates of the user read only fields (CPU, node, MMCID)
are removed. Those fields are only updated on registration, task
migration and MMCID changes.
2) Unconditional evaluation of the criticial section pointer is
removed. It's only evaluated when user space was interrupted and was
scheduled out or before delivering a signal in the interrupted
context.
3) The read/only requirement of the ID fields is enforced. When the
kernel detects that userspace manipulated the fields, the process is
terminated. This ensures that multiple entities (libraries) can
utilize RSEQ without interfering.
4) Todays extended RSEQ feature (time slice extensions) and future
extensions are only enabled in the v2 enabled mode.
Registrations with the original size of 32 bytes operate in backwards
compatible legacy mode without performance improvements and extended
features.
Unfortunately that also affects users of older GLIBC versions which
register the original size of 32 bytes and do not evaluate the kernel
required size in the auxiliary vector AT_RSEQ_FEATURE_SIZE.
That's the result of the lack of enforcement in the original implementation
and the unwillingness of a single entity to cooperate with the larger
ecosystem for many years.
Implement the required registration changes by restructuring the spaghetti
code and adding the size/version check. Also add documentation about the
differences of legacy and optimized RSEQ V2 mode.
Thanks to Mathieu for pointing out the ORIG_RSEQ_SIZE constraints!
Fixes: d6200245c75e ("rseq: Allow registering RSEQ with slice extension") Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Tested-by: Dmitry Vyukov <dvyukov@google.com> Link: https://patch.msgid.link/20260428224427.927160119%40kernel.org Cc: stable@vger.kernel.org
Thomas Gleixner [Sun, 26 Apr 2026 14:21:02 +0000 (16:21 +0200)]
rseq: Implement read only ABI enforcement for optimized RSEQ V2 mode
The optimized RSEQ V2 mode requires that user space adheres to the ABI
specification and does not modify the read-only fields cpu_id_start,
cpu_id, node_id and mm_cid behind the kernel's back.
While the kernel does not rely on these fields, the adherence to this is a
fundamental prerequisite to allow multiple entities, e.g. libraries, in an
application to utilize the full potential of RSEQ without stepping on each
other toes.
Validate this adherence on every update of these fields. If the kernel
detects that user space modified the fields, the application is force
terminated.
Fixes: d6200245c75e ("rseq: Allow registering RSEQ with slice extension") Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Tested-by: Dmitry Vyukov <dvyukov@google.com> Link: https://patch.msgid.link/20260428224427.845230956%40kernel.org Cc: stable@vger.kernel.org
Thomas Gleixner [Sun, 26 Apr 2026 15:51:07 +0000 (17:51 +0200)]
selftests/rseq: Validate legacy behavior
The RSEQ legacy mode behavior requires that the ID fields in the rseq
region are unconditionally updated on every context switch and before
signal delivery even if not required by the ABI specification.
To ensure that this behavior is preserved for legacy users in the future,
add a test which validates that with a sleep() and a signal sent to self.
Provide a run script which prevents GLIBC from registering a RSEQ region,
so that the test can register it's own legacy sized region.
Fixes: 566d8015f7ee ("rseq: Avoid CPU/MM CID updates when no event pending") Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Tested-by: Dmitry Vyukov <dvyukov@google.com> Link: https://patch.msgid.link/20260428224427.764705536%40kernel.org Cc: stable@vger.kernel.org
Colin Walters [Tue, 5 May 2026 22:42:57 +0000 (15:42 -0700)]
ovl: fix verity lazy-load guard broken by fsverity_active() semantic change
Commit f77f281b6118 ("fsverity: use a hashtable to find the
fsverity_info") made fsverity_active() check whether the inode has the
verity flag, rather than whether the inode's fsverity_info is loaded.
This broke ovl_ensure_verity_loaded(), which wants to load the
fsverity_info for any verity inodes that haven't had it loaded yet.
Therefore, to check that the fsverity_info hasn't yet been loaded, use
fsverity_get_info(inode) == NULL instead of !fsverity_active(inode).
Also, since fsverity_get_info() now involves a hash table lookup, put
the more lightweight IS_VERITY() flag check first.
Cássio Gabriel [Tue, 5 May 2026 11:18:17 +0000 (08:18 -0300)]
ALSA: hda/tas2781: Cancel async firmware request at unbind
TAS2781 HDA I2C and SPI queue RCA firmware loading from component
bind with request_firmware_nowait(). The firmware loader keeps the
callback module pinned and holds a device reference, but the callback
still uses driver-private HDA state.
Component unbind removes controls and DSP state immediately. Later
device removal tears down the TAS2781 private data, including
codec_lock. If the async firmware callback runs after unbind has
started, it can operate on state that is being torn down.
Cancel or synchronize the async firmware request before removing
controls and DSP state. A queued callback is cancelled, and an
already-running callback is allowed to finish before unbind continues.
Cássio Gabriel [Tue, 5 May 2026 11:18:16 +0000 (08:18 -0300)]
firmware_loader: Add cancel helper for async requests
request_firmware_nowait() keeps the callback module pinned and holds
a device reference until the firmware work completes.
Callers still have no way to cancel or synchronize the queued callback
before tearing down their driver-private state.
Track scheduled async firmware work in an internal list and add
request_firmware_nowait_cancel(). The helper cancels work matching the
device, callback context and callback function. It cancels work that has
not started yet and waits for an already-running callback to return. If
the request has already completed, it is a no-op.
Keep the existing request_firmware_nowait() lifetime model manual. A
devres-managed variant can be layered on top separately if needed.
025f89b01ed8 ("drm/i915: replace select with dependency for visible
DEBUG_OBJECTS") breaks the build in certain scenarios, presumably
because config DRM_I915_DEBUG selects DRM_I915_SW_FENCE_DEBUG_OBJECTS
without looking at its dependencies, allowing
DRM_I915_SW_FENCE_DEBUG_OBJECTS=y and DEBUG_OBJECTS=n.
Fixes: 025f89b01ed8 ("drm/i915: replace select with dependency for visible DEBUG_OBJECTS") Cc: Julian Braha <julianbraha@gmail.com> Acked-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> Link: https://patch.msgid.link/20260506101957.202271-1-jani.nikula@intel.com Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Jakub Kicinski [Wed, 6 May 2026 14:29:32 +0000 (07:29 -0700)]
Merge tag 'wireless-next-2026-05-06' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next
Johannes Berg says:
====================
Lots of new content in cfg80211/mac80211, notably
- more NAN work, mostly complete now (also hwsim)
- more UHR work (e.g. non-primary channel access),
this will continue for a while
- FTM ranging APIs
* tag 'wireless-next-2026-05-06' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (70 commits)
wifi: mac80211: explicitly disable FTM responder on AP stop
wifi: iwlwifi: don't blindly start the responder upon BSS_CHANGED_FTM_RESPONDER
wifi: mac80211_hwsim: claim HT STBC capability
wifi: mac80211_hwsim: enable NAN_DATA interface simulation support
wifi: mac80211_hwsim: Support Tx of multicast data on NAN
wifi: mac80211_hwsim: Do not declare support for NDPE
wifi: mac80211_hwsim: Declare support for secure NAN
wifi: mac80211_hwsim: add NAN data path TX/RX support
wifi: mac80211_hwsim: set HAS_RATE_CONTROL when using NAN
wifi: mac80211_hwsim: implement NAN schedule callbacks
wifi: mac80211_hwsim: add NAN PHY capabilities
wifi: mac80211_hwsim: add NAN_DATA interface limits
wifi: mac80211_hwsim: implement NAN synchronization
wifi: mac80211_hwsim: protect tsf_offset using a spinlock
wifi: mac80211_hwsim: only RX on NAN when active on a slot
wifi: mac80211_hwsim: select NAN TX channel based on current TSF
wifi: mac80211_hwsim: limit TX of frames to the NAN DW
wifi: cfg80211: don't allow NAN DATA on multi radio devices
wifi: mac80211: check AP using NPCA has NPCA capability
wifi: mac80211: don't parse full UHR operation from beacons
...
====================
Jakub Kicinski [Wed, 6 May 2026 14:29:31 +0000 (07:29 -0700)]
Merge tag 'wireless-2026-05-06' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless
Johannes Berg says:
====================
Quite a number of fixes now:
- mac80211
- remove HT NSS validation to work with broken APs
(with a kunit fix now)
- remove 'static' that could cause races
- check station link lookup before further processing
- fix use-after-free due to delete in list iteration
- remove AP station on assoc failures to fix crashes
- ath12k
- fix OF node refcount imbalance
- fix queue flush ("REO update") in MLO
- fix RCU assert
- ath12k:
- fix Kconfig with POWER_SEQUENCING
- fix WMI buffer leaks on error conditions
- don't use uninitialized stack data when processing RSSI events
- fix logic for determining the peer ID in the RX path
- ath5k: fix a potential stack buffer overwrite
- rsi: fix thread lifetime race
- brcmfmac: fix potential UAF
- nl80211:
- stricter permissions/checks for PMK and netns
- fix netlink policy vs. code type confusion
- cw1200: revert a broken locking change
- various fixes to not trust values from firmware
* tag 'wireless-2026-05-06' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless: (25 commits)
wifi: nl80211: re-check wiphy netns in nl80211_prepare_wdev_dump() continuation
wifi: nl80211: require CAP_NET_ADMIN over the target netns in SET_WIPHY_NETNS
wifi: nl80211: fix NL80211_PMSR_FTM_REQ_ATTR_FTMS_PER_BURST usage
wifi: mac80211: remove station if connection prep fails
wifi: mac80211: use safe list iteration in radar detect work
wifi: libertas: notify firmware load wait on disconnect
wifi: ath5k: do not access array OOB
wifi: ath12k: fix peer_id usage in normal RX path
wifi: ath12k: initialize RSSI dBm conversion event state
wifi: ath12k: fix leak in some ath12k_wmi_xxx() functions
wifi: cw1200: Revert "Fix locking in error paths"
wifi: mac80211: tests: mark HT check strict
wifi: rsi: fix kthread lifetime race between self-exit and external-stop
wifi: mac80211: drop stray 'static' from fast-RX rx_result
wifi: mac80211: check ieee80211_rx_data_set_link return in pubsta MLO path
wifi: nl80211: require admin perm on SET_PMK / DEL_PMK
wifi: libertas: fix integer underflow in process_cmdrequest()
wifi: b43legacy: enforce bounds check on firmware key index in RX path
wifi: b43: enforce bounds check on firmware key index in b43_rx()
wifi: brcmfmac: Fix potential use-after-free issue when stopping watchdog task
...
====================
Linus Torvalds [Wed, 6 May 2026 14:27:30 +0000 (07:27 -0700)]
Merge tag 'efi-fixes-for-v7.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi
Pull EFI fixes from Ard Biesheuvel:
- Fix issues in EFI graceful recovery on x86 introduced by changes to
the kernel mode FPU APIs
- I-cache coherency fixes for the LoongArch EFI stub
- Locking fix for EFI pstore
- Code tweak for efivarfs
* tag 'efi-fixes-for-v7.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi:
x86/efi: Restore IRQ state in EFI page fault handler
x86/efi: Fix graceful fault handling after FPU softirq changes
efi/libstub: Synchronize instruction cache after kernel relocation
efi/loongarch: Implement efi_cache_sync_image()
efi/libstub: Move efi_relocate_kernel() into its only remaining user
efi: pstore: Drop efivar lock when efi_pstore_open() returns with an error
efivarfs: use QSTR() in efivarfs_alloc_dentry
Takashi Iwai [Wed, 6 May 2026 14:10:00 +0000 (16:10 +0200)]
Merge tag 'asoc-fix-v7.1-rc2' of https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound into for-linus
ASoC: Fixes for v7.1
Another batch of fixes, plus a couple of quirks (mostly AMD ones, as has
been the case recently). All driver changes, including fixes for the
KUnit tests for the Cirrus drivers that could cause memory corruption.
Ahmed S. Darwish [Fri, 27 Mar 2026 02:15:18 +0000 (03:15 +0100)]
treewide: Explicitly include the x86 CPUID headers
Modify all CPUID call sites which implicitly include any of the CPUID
headers to explicitly include them instead.
For KVM's reverse_cpuid.h, just include <asm/cpuid/types.h> since it
references the CPUID_EAX..EDX symbols without using the CPUID APIs.
Note, this allows removing the inclusion of <asm/cpuid/api.h> from within
<asm/processor.h> next. That allows the CPUID API headers to include
<asm/processor.h> without introducing a circular dependency.
ASoC: cs35l56: Don't use devres to unregister component
Manually call snd_soc_unregister_component() from cs35l56_remove()
instead of using devres cleanup. This ensures that the component is
destroyed before cs35l56_remove() starts cleanup of anything the
component code could be using.
Devres cleanup happens after the driver remove() callback, so if
snd_soc_register_component() is used, it will not be destroyed until
after cs35l56_remove() has returned. But there is some cleanup that
must be done in cs35l56_remove(), or wrapped in a custom devres
cleanup handler to ensure correct ordering. The simplest option is
to call snd_soc_unregister_component() at the start of cs35l56_remove().
Fixes: e49611252900 ("ASoC: cs35l56: Add driver for Cirrus Logic CS35L56") Closes: https://sashiko.dev/#/patchset/20260501103002.2843735-1-rf%40opensource.cirrus.com Signed-off-by: Richard Fitzgerald <rf@opensource.cirrus.com> Link: https://patch.msgid.link/20260505161124.3621000-2-rf@opensource.cirrus.com Signed-off-by: Mark Brown <broonie@kernel.org>
Sheetal [Wed, 6 May 2026 10:20:32 +0000 (10:20 +0000)]
ASoC: tegra: Add per-stream Mixer Fade controls
Add per-stream fade controls for the Tegra mixer to allow
independently configuring target gain and fade duration for each of
the 10 input streams (RX1 through RX10).
The following controls are added per stream:
"RXn Fade Duration" - fade duration in samples (N3 parameter)
"RXn Fade Gain" - target gain level for fade
A strobe control commits all pending fade configurations:
"Fade Switch" - 1 = apply staged gain/duration and start fades
0 = disable sample count for all fading streams
A read-only status control reports per-stream fade state:
"Fade Status" - per-stream state for all 10 RX inputs
0 = idle, 1 = active
selftests: livepatch: Check if stack_order sysfs attribute exists
The commit 3dae09de4061 ("livepatch: Add stack_order sysfs attribute"),
merged in v6.14, introduced a new sysfs attribute.
In order to run the selftests on older kernels, check if given kernel
has support for the attribute. If the attribute is not supported, skip
the checks.
selftests: livepatch: Check if replace sysfs attribute exists
The commit adb68ed26a3e ("livepatch: Add "replace" sysfs attribute"),
merged in v6.11, introduced a new sysfs attribute.
In order to run the selftests on older kernels, check if given kernel
has support for the attribute. If the attribute is not supported, skip
the checks.
While at it, create a local variable to hold the module name to be
tested, instead of overwriting MOD_LIVEPATCH.
selftests: livepatch: Check if patched sysfs attribute exists
The commit bb26cfd9e77e
("livepatch: add sysfs entry "patched" for each klp_object") was merged
in v6.1, introducing a new sysfs attribute.
In order to run the selftests on older kernels, check if given kernel
has support for the attribute. If the attribute is not supported, skip
the checks.
Along with this change, use MOD_LIVEPATCH2 variable instead of
reassigning a new value to MOD_LIVEPATCH, and also use the variable
names in the check_result, to avoid using the module names.
selftests: livepatch: Replace true/false module parameter by y/n
Older kernels don't support true/false for boolean module parameters
because they lack commit 0d6ea3ac94ca
("lib/kstrtox.c: add "false"/"true" support to kstrtobool()"). Replace
true/false by y/n so the test module can be loaded on older kernels.
selftests: livepatch: Check for ARCH_HAS_SYSCALL_WRAPPER config
Older kernels that lack CONFIG_ARCH_HAS_SYSCALL_WRAPPER config don't
have any prefixes for their syscalls. The same applies to current
powerpc and loongarch, covering all currently supported architectures
that support livepatch.
The other supported architectures have specific prefixes, so error out
when a new architecture adds livepatch support with wrappers but didn't
update the test to include it.
Breno Leitao [Tue, 5 May 2026 16:02:13 +0000 (09:02 -0700)]
arm64/fpsimd: ptrace: zero target's fpsimd_state, not the tracer's
sve_set_common() is the backend for PTRACE_SETREGSET(NT_ARM_SVE) and
PTRACE_SETREGSET(NT_ARM_SSVE). Every write in the function operates on
the tracee (target) - except a single memset that uses current instead,
zeroing the tracer's saved V0-V31 / FPSR / FPCR shadow on every ptrace
SETREGSET call.
The memset is meant to give the tracee a defined zero register image
before the user-supplied payload is copied in (for partial writes,
header-only writes, and FPSIMD<->SVE format switches). Aiming it at
current both denies the tracee that clean slate and silently corrupts
the tracer.
The corruption of the tracer's saved FPSIMD state is not always
observable. Where the tracer's state is live on a CPU, this may be
reused without loading the corrupted state from memory, and will
eventually be written back over the corrupted state. Where the tracer's
state is saved in SVE_PT_REGS_SVE format, only the FPSR and FPCR are
clobbered, and the effective copy of the vectors is in the task's
sve_state.
Reproducible on an arm64 kernel with SVE: a single-threaded tracer that
loads a known pattern into V0-V31, issues PTRACE_SETREGSET(NT_ARM_SVE)
on a child, and reads V0-V31 back observes them all zeroed within tens
of thousands of iterations when a sibling thread keeps stealing the
FPSIMD CPU binding.
Michal Piekos [Tue, 28 Apr 2026 16:26:58 +0000 (18:26 +0200)]
dt-bindings: timer: allwinner,sun5i-a13-hstimer: add H616 and D1
D1 is similar to existing sun5i, but with different register offsets.
H616 uses same offsets as D1.
Add allwinner,sun20i-d1-hstimer
Add allwinner,sun50i-h616-hstimer with fallback to
allwinner,sun20i-d1-hstimer
Extend schema condition for interrupts to cover D1 compatible variant.
Maoyi Xie [Mon, 4 May 2026 15:37:55 +0000 (23:37 +0800)]
io_uring/wait: honour caller's time namespace for IORING_ENTER_ABS_TIMER
io_uring_enter() with IORING_ENTER_ABS_TIMER takes an absolute
timespec from the caller via ext_arg->ts. It arms an ABS mode
hrtimer in __io_cqring_wait_schedule(). The conversion path in
io_uring/wait.c parses ext_arg->ts inline rather than going
through io_parse_user_time(). It therefore does not pick up the
time namespace conversion added by the previous patch.
Apply timens_ktime_to_host() to the parsed time on the
IORING_ENTER_ABS_TIMER branch. This mirrors the IORING_TIMEOUT_ABS
fix in io_parse_user_time(). Use ctx->clockid as the clock id.
ctx->clockid is set either at ring creation or via
IORING_REGISTER_CLOCK.
timens_ktime_to_host() is a no-op for clocks not affected by time
namespaces. It is also a no-op for callers in the initial time
namespace. The fast path is unchanged.
Reproducer: in unshare --user --time, with a -10s monotonic
offset, call io_uring_enter with min_complete=1,
IORING_ENTER_ABS_TIMER, and ts = now + 1s. The call returns
-ETIME after <1ms instead of after the expected ~1s.
Maoyi Xie [Mon, 4 May 2026 15:37:54 +0000 (23:37 +0800)]
io_uring/timeout: honour caller's time namespace for IORING_TIMEOUT_ABS
io_uring's IORING_OP_TIMEOUT and IORING_OP_LINK_TIMEOUT accept a
timespec from the caller via io_parse_user_time(). With
IORING_TIMEOUT_ABS, the timestamp is an absolute deadline on the
selected clock. The clock is CLOCK_MONOTONIC by default.
CLOCK_BOOTTIME and CLOCK_REALTIME are also selectable.
A submitter inside a CLONE_NEWTIME time namespace observes
CLOCK_MONOTONIC and CLOCK_BOOTTIME shifted by the namespace's
offsets relative to the host. Every other ABS timer interface in
the kernel converts the caller's absolute time to host view via
timens_ktime_to_host() before arming an hrtimer:
io_parse_user_time() does not. As a result, an absolute timeout
submitted from within a time namespace is interpreted in host
view. That is generally a different point in time. It may already
be in the past, causing the timer to fire immediately, or far in
the future, causing the timer not to fire when expected.
Reproducer: in unshare --user --time, with a -10s monotonic
offset, submit IORING_OP_TIMEOUT with IORING_TIMEOUT_ABS and
deadline = now + 1s. The CQE is delivered after <1ms instead of
the expected ~1s.
Apply timens_ktime_to_host() to the parsed time when
IORING_TIMEOUT_ABS is set. Split the existing clock id resolver
in io_timeout_get_clock() into a flags only helper
io_flags_to_clock(), so io_parse_user_time() can resolve the
clock without a struct io_timeout_data.
timens_ktime_to_host() is a no-op for clocks not affected by time
namespaces, e.g. CLOCK_REALTIME. It is also a no-op for callers
in the initial time namespace. The fast path is unchanged.
SQPOLL is also covered. The SQPOLL kernel thread is created via
create_io_thread() with CLONE_THREAD and no CLONE_NEW* flag.
copy_namespaces() therefore shares the submitter's nsproxy by
reference. Inside the SQPOLL kthread, current->nsproxy->time_ns
is the submitter's time_ns. timens_ktime_to_host() resolves
correctly.
Ming Lei [Wed, 6 May 2026 08:22:38 +0000 (16:22 +0800)]
ublk: validate physical_bs_shift, io_min_shift and io_opt_shift
ublk_validate_params() checks logical_bs_shift is within
[9, PAGE_SHIFT] but has no upper bound for physical_bs_shift,
io_min_shift, or io_opt_shift. A malicious userspace can set any
of these to a large value (e.g., 44), causing undefined behavior
from `1 << shift` in ublk_ctrl_start_dev() since the result is
stored in 32-bit unsigned int.
Cap all three at ilog2(SZ_256M) (28). 256M is big enough to cover
all practical block sizes, and originates from the maximum physical
block size possible in NVMe (lba_size * (1 + npwg), where npwg is
16-bit).
Also zero out ub->params with memset() when copy_from_user() fails
or ublk_validate_params() returns error, so that no stale or partial
params survive for a subsequent START_DEV to consume.
drm/bridge: microchip-lvds: fix bus format mismatch with VESA displays
The LVDS controller was hardcoded to JEIDA mapping, which leads to
distorted output on panels expecting VESA mapping.
Update the driver to dynamically select the appropriate mapping and
pixel size based on the panel's advertised media bus format. This
ensures compatibility with both JEIDA and VESA displays.
Signed-off-by: Sandeep Sheriker M <sandeep.sheriker@microchip.com> Signed-off-by: Dharma Balasubiramani <dharma.b@microchip.com> Reviewed-by: Maxime Ripard <mripard@kernel.org> Link: https://patch.msgid.link/20250625-microchip-lvds-v6-3-7ce91f89d35a@microchip.com Signed-off-by: Manikandan Muralidharan <manikandan.m@microchip.com>
drm/bridge: microchip-lvds: Remove unused drm_panel and redundant port node lookup
Drop the unused drm_panel field from the mchp_lvds structure, and remove
the unnecessary port device node lookup, as devm_drm_of_get_bridge()
already performs the required checks internally.
Johannes Berg [Wed, 6 May 2026 09:32:32 +0000 (11:32 +0200)]
wifi: mac80211_hwsim: claim HT STBC capability
This is already claimed for VHT and HE, so it doesn't really
make sense to not claim it for HT, and this causes sigma-dut
failures since it assumes VHT support implies HT support.
Daniel Gabay [Wed, 6 May 2026 03:44:31 +0000 (06:44 +0300)]
wifi: mac80211_hwsim: enable NAN_DATA interface simulation support
Enable NAN_DATA interface simulation support by adding it to the
supported interface types. This completes the NAN Data Path
simulation introduced in the previous patches.
Ilan Peer [Wed, 6 May 2026 03:44:33 +0000 (06:44 +0300)]
wifi: mac80211_hwsim: Support Tx of multicast data on NAN
Add support for transmitting multicast data frames. These
frames can be transmitted when all the peer NDI stations
on the interface are available at the current slot.
Daniel Gabay [Wed, 6 May 2026 03:44:29 +0000 (06:44 +0300)]
wifi: mac80211_hwsim: add NAN data path TX/RX support
Implement TX and RX path handling for NAN Data Path (NDP) frames,
enabling data communication between NAN peers during scheduled
availability windows.
TX path:
- Select TX channel based on current time slot: use DW channel
during Discovery Windows, or FAW channel from local
schedule during Further Availability Windows.
- Verify peer availability before transmission by checking committed
DW schedule or FAW of the peer schedule.
RX path:
- Extend NAN receive filtering to handle NAN_DATA interface frames.
- Accept incoming frames during FAW slots when channel matches local
schedule.
Daniel Gabay [Wed, 6 May 2026 03:44:28 +0000 (06:44 +0300)]
wifi: mac80211_hwsim: set HAS_RATE_CONTROL when using NAN
- NAN switches between bands/channels per its schedule, so mac80211
rate control can't work, set HAS_RATE_CONTROL instead.
- Skip rate control checks for NAN interfaces in
mac80211_hwsim_sta_rc_update() as it's not relevant.
- Move set_rts_threshold stub to HWSIM_COMMON_OPS and return 0 instead
of -EOPNOTSUPP to prevent failures in non-MLO tests that set RTS
threshold (hwsim ignores the use_rts instruction from mac80211
anyway).
Daniel Gabay [Wed, 6 May 2026 03:44:27 +0000 (06:44 +0300)]
wifi: mac80211_hwsim: implement NAN schedule callbacks
Implement mac80211 schedule callbacks for NAN Data Path support:
- Track local schedule via BSS_CHANGED_NAN_LOCAL_SCHED, caching
the channel for each 16TU time slot.
- Copy peer schedule to driver-private storage in
nan_peer_sched_changed callback for use in TX availability
decisions.
Daniel Gabay [Wed, 6 May 2026 03:44:26 +0000 (06:44 +0300)]
wifi: mac80211_hwsim: add NAN PHY capabilities
Add static HT, VHT and HE PHY capabilities to the NAN capabilities
structure. These are based on the existing band capability structures
and initialization in mac80211_hwsim.
The NAN PHY capabilities are used by mac80211 and nl80211 to
advertise device capabilities for NAN data interfaces.
Benjamin Berg [Wed, 6 May 2026 03:44:24 +0000 (06:44 +0300)]
wifi: mac80211_hwsim: implement NAN synchronization
Add all the handling to do NAN synchronization on 2.4 GHz including
sending out beacons. With this, the mac80211_hwsim NAN device also works
when used in conjunction with an external medium simulation.
Note that the TSF sync is not ideal in case of an external medium
simulation. This is because the mactime for received frames needs to be
estimated and the simulation may not update the timestamp of beacons
to the actual time that the frame was transmitted.
The implementation has an initial short phase where it scans for
clusters. This facilitates cluster joining and avoids creating a new
cluster immediately, which would result in two cluster join
notifications. It does not scan otherwise and will only see another
cluster appearing if a discovery beacon happens to be sent during the
2.4 GHz discovery window (DW).
Benjamin Berg [Wed, 6 May 2026 03:44:23 +0000 (06:44 +0300)]
wifi: mac80211_hwsim: protect tsf_offset using a spinlock
To implement NAN synchronization in hwsim, the TSF needs to be adjusted
regularly from the RX path. Add a spinlock so that this can be done in a
safe manner.
Benjamin Berg [Wed, 6 May 2026 03:44:22 +0000 (06:44 +0300)]
wifi: mac80211_hwsim: only RX on NAN when active on a slot
This moves the NAN receive into the main code and changes it so that
frame RX only happens when the device is active on the channel. This
limits RX to the DW slots as there is currently no datapath.
With this the globally stored channel is obsolete, remove it.
Benjamin Berg [Wed, 6 May 2026 03:44:20 +0000 (06:44 +0300)]
wifi: mac80211_hwsim: limit TX of frames to the NAN DW
Frames submitted on the NAN device interface should only be transmitted
during one of the discovery windows (DWs). It is assumed that software
submits frames from the DW end notifications for the next DW period.
Simulate this behaviour by checking that we are currently in a DW before
transmitting from ieee80211_hwsim_wake_tx_queue. As frames will be
queued up at the start of a DW, wake the management TX queue every time
a DW is started. Do so with a randomized offset just to avoid every
client transmitting at the same time.
Miri Korenblit [Tue, 5 May 2026 16:46:13 +0000 (19:46 +0300)]
wifi: cfg80211: don't allow NAN DATA on multi radio devices
The support for NAN DATA was added for single radio devices only. For
example, checking the interface combinations is done for a single radio.
Prevent registration with NAN DATA interface type for multi radio
devices.
Maoyi Xie [Wed, 6 May 2026 06:48:54 +0000 (14:48 +0800)]
wifi: nl80211: re-check wiphy netns in nl80211_prepare_wdev_dump() continuation
NL80211_CMD_GET_SCAN is implemented as a multi-call dumpit. The first
invocation of nl80211_prepare_wdev_dump() validates the requested wdev
against the caller's netns via __cfg80211_wdev_from_attrs(). Subsequent
invocations look up the same wiphy by its global index and do not check
that the wiphy is still in the caller's netns.
Add the same filter to the continuation path. If the wiphy's netns no
longer matches the caller's, return -ENODEV and the netlink dump
machinery terminates the walk cleanly.
Maoyi Xie [Wed, 6 May 2026 06:48:53 +0000 (14:48 +0800)]
wifi: nl80211: require CAP_NET_ADMIN over the target netns in SET_WIPHY_NETNS
NL80211_CMD_SET_WIPHY_NETNS dispatches with GENL_UNS_ADMIN_PERM, which
verifies that the caller has CAP_NET_ADMIN for the source netns. It
doesn't verify that the caller has CAP_NET_ADMIN over the target netns
selected by NL80211_ATTR_NETNS_FD or NL80211_ATTR_PID.
This diverges from the convention enforced in
net/core/rtnetlink.c::rtnl_get_net_ns_capable():
/* For now, the caller is required to have CAP_NET_ADMIN in
* the user namespace owning the target net ns.
*/
if (!sk_ns_capable(sk, net->user_ns, CAP_NET_ADMIN))
return ERR_PTR(-EACCES);
A user with CAP_NET_ADMIN in their own user namespace can therefore
push a wiphy into an arbitrary netns (including init_net) over which
they have no privilege.
Mirror the rtnetlink convention by requiring CAP_NET_ADMIN in the
target netns before calling cfg80211_switch_netns().
This is documented as a u8 and has a policy of NLA_U8, but uses
nla_get_u32() which means it's completely broken on big-endian.
Fix it to use nla_get_u8().
Johannes Berg [Tue, 5 May 2026 13:15:34 +0000 (15:15 +0200)]
wifi: mac80211: remove station if connection prep fails
If connection preparation fails for MLO connections, then the
interface is completely reset to non-MLD. In this case, we must
not keep the station since it's related to the link of the vif
being removed. Delete an existing station. Any "new_sta" is
already being removed, so that doesn't need changes.
This fixes a use-after-free/double-free in debugfs if that's
enabled, because a vif going from MLD (and to MLD, but that's
not relevant here) recreates its entire debugfs.
Add BB diagnostic to track potential abnormal conditions. Currently,
five diagnostic metrics are monitored: 1) Hang detection monitors
consecutive absence of TX and RX activity. 2) PD maximum triggers
when PD stays at its maximum threshold for a period. 3) No RX
occurs when no CCA activity is detected over multiple consecutive
cycles. 4) High FA indicates a high false alarm ratio, reflecting
severe environmental interference. 5) EDCCA alerts when high EDCCA
ratio, signaling a potential TX hang.
These metrics are exposed via debugfs diag_bb.
Output:
[PHY 0]
Diag bitmap = 0x0
Event{Hang, PD MAX, No RX, High FA, High EDCCA Ratio} = {0, 0, 0, 0, 0}
consecutive_no_tx_cnt=0, consecutive_no_rx_cnt=0
Expand RX packet statistics including coding type, spatial
diversity, and beamforming. These statistics are accumulated
per PHY and displayed in bb_info debugfs.
wifi: rtw89: debug: extend bb_info with TX status and PER
Enhance bb_info debugfs by adding TX status information to aid
debugging and performance analysis.
A snapshot of TX-related registers, including PPDU type and subtype,
bandwidth, TX power, STBC, etc. The information is collected per
PHY during track_work and displayed via bb_info debugfs.
gpio: 74x164: support lines-initial-states for boot-time output state
74HC595 and 74LVC594 chains retain their output state from the first
serial write onwards. Today the driver always kicks that first write
from a zero-initialised buffer, so every output comes up low until user
space issues a write. Boards that rely on the chain to drive signals
whose power-on state matters (active-low indicators, reset lines, etc.)
have no way to express the desired initial pattern via DT.
Read the optional lines-initial-states bitmask, recently documented for
this binding, into chip->buffer before the first
__gen_74x164_write_config() so the chain comes up in a known state on
the very first SPI transaction. Bit N maps to GPIO line N (matching the
nxp,pcf8575 convention); on this output-only device, bit=0 drives the
line low and bit=1 drives it high. Property absence keeps the existing
zeroing behaviour intact.
The 74HC595 and 74LVC594 shift registers latch their outputs until the
first serial write, so boards that depend on a specific power-on pattern
(for example active-low indicators, reset lines, or other signals that
must come up non-zero) have no way to express that today: the Linux
driver always writes zeros from its zero-initialised buffer during
probe.
Document support for the existing lines-initial-states bitmask, already
defined for nxp,pcf8575, so the same convention covers this output-only
device. Bit N corresponds to GPIO line N. Because the 74HC595/74LVC594
family is push-pull output only (no input mode, no high-impedance state
under software control), bit=0 drives the line low and bit=1 drives it
high; this differs from nxp,pcf8575, where the 0/1 polarity reflects the
quasi-bidirectional nature of that part.
The bitmask covers up to 32 lines, which fits the typical 1-4 chip
cascades that appear in tree. Should longer chains require seeding in
the future, the property can be extended to a uint32-array without
breaking the bit-N-equals-line-N convention.
Cássio Gabriel [Wed, 6 May 2026 03:34:47 +0000 (00:34 -0300)]
ALSA: core: Serialize deferred fasync state checks
snd_fasync_helper() updates fasync->on under snd_fasync_lock, and
snd_fasync_work_fn() now also evaluates fasync->on under the same
lock. snd_kill_fasync() still tests the flag before taking the lock,
leaving an unsynchronized read against FASYNC enable/disable updates.
Move the enabled-state check into the locked section.
Also clear fasync->on under snd_fasync_lock in snd_fasync_free()
before unlinking the pending entry. Together with the locked sender-side
check, this publishes teardown before flushing the deferred work and
prevents a racing sender from requeueing the entry after free has
started.
Fixes: ef34a0ae7a26 ("ALSA: core: Add async signal helpers") Fixes: 8146cd333d23 ("ALSA: core: Fix potential data race at fasync handling") Cc: stable@vger.kernel.org Signed-off-by: Cássio Gabriel <cassiogabrielcontato@gmail.com> Link: https://patch.msgid.link/20260506-alsa-core-fasync-on-lock-v1-1-ea48c77d6ca4@gmail.com Signed-off-by: Takashi Iwai <tiwai@suse.de>
PMAC (Pseudo MAC) is a circuit within the baseband that can report
various packet-related counters through registers, such as TX ON,
TX EN, CCA, FA, CRC, etc. The driver periodically collects per
PHY PMAC counters in track_work and exposes them through the
bb_info debugfs for easier debugging.
wifi: rtw89: debug: bb_info entry including TX rate count for WiFi 7 chips
Enhance TX performance visibility for WiFi 7 chips by introducing TX
rate count tracking. This is critical for debugging and validation.
Additionally, introduce a new debugfs bb_info to enable and provide
baseband status.
Usage of bb_info debugfs:
$ echo enable 1 > bb_info // Start logging BB statistics information
$ echo mac_id 0 > bb_info // Specify mac_id for TX rate count tracking
Previously, RX statistics such as beacon RSSI and packet
counters were shared across all PHYs. To support MLO,
extend the statistics to be maintained per PHY.
Update the debugfs output for phy_info and beacon_info
to include a "[PHY X]" label for better clarity.
wifi: rtw89: mlo: rearrange MLSR link decision flow
The original MLSR link decision refers to RSSI, but it should be
based on the premise of an existing link. Otherwise, make a link
decision to select a new link from any available band.
The deprecated UNIVERSAL_DEV_PM_OPS() macro uses the provided callbacks
for both runtime PM and system sleep. This causes the DSI clocks to be
disabled twice: once during runtime suspend and again during system
suspend, resulting in a WARN message from the clock framework when
attempting to disable already-disabled clocks.
To address this issue, replace UNIVERSAL_DEV_PM_OPS() with
RUNTIME_PM_OPS(). Bridge and panel drivers should only deal with runtime
PM, as the DRM framework manages system-wide power transitions through
the bridge enable() and disable() hooks.
Rodrigo Faria [Tue, 5 May 2026 18:55:18 +0000 (19:55 +0100)]
ALSA: hda/realtek: Add mute LED fixup for HP Pavilion 15-cs1xxx
Add a new fixup for the mute LED on the HP Pavilion 15-cs1xxx series
using the VREF on NID 0x1b.
The BIOS on these models (tested up to F.32) incorrectly reports
the mute LED on NID 0x18 via DMI OEM strings, which lacks VREF
capabilities. This fixup overrides the LED pin to the correct
NID 0x1b.
Cássio Gabriel [Wed, 6 May 2026 03:15:48 +0000 (00:15 -0300)]
ALSA: seq: Fix UMP group 16 filtering
The sequencer UAPI defines group_filter as an unsigned int bitmap.
Bit 0 filters groupless messages and bits 1-16 filter UMP groups 1-16.
The internal snd_seq_client storage is only unsigned short, so bit 16
is truncated when userspace sets the filter. The same truncation affects
the automatic UMP client filter used to avoid delivery to inactive
groups, so events for group 16 cannot be filtered.
Store the internal bitmap as unsigned int and keep both userspace-provided
and automatically generated values limited to the defined UAPI bits.
Bitterblue Smith [Wed, 29 Apr 2026 12:02:48 +0000 (15:02 +0300)]
wifi: rtl8xxxu: Detect the maximum supported channel width
Some devices malfunction when connected to a network with 40 MHz channel
width, because they don't support that.
RTL8188FU, RTL8192FU, and RTL8710BU (RTL8188GU) have a way to signal
this (and some other capabilities) to the driver. Get this information
from the hardware and advertise 40 MHz support only when the hardware
can handle it. We assume the other chips can always handle it.
RTL8710BU needs a different way to retrieve this information, which will
be implemented some other time.
alarmtimer: Remove stale return description from alarm_handle_timer()
alarm_handle_timer() was converted from returning enum alarmtimer_restart
to void, but the kernel-doc "Return:" line was not removed. Remove the
stale description.
John Stultz [Tue, 28 Apr 2026 17:39:46 +0000 (17:39 +0000)]
selftests/posix_timers: Use CLOCK_THREAD_CPUTIME_ID for ITIMER_PROF measurements
It was reported that the posix_timers test was at times seeing failures
with ITIMER_PROF timers, specifically in cases where the RCU_SOFTIRQ was
taking up significant amounts of time.
Analysis showed that as the time in softirq isn't included in the task
stime + utime accounting used to trigger the SIGPROF so delays from softirq
work could cause it to appear that the signal was incorrectly delayed.
Contributing to this is that the test uses gettimeofday() to measure
itimers, which also means any scheduling delay can also cause failures (as
the task may not be running the entire time).
To fix this, convert all the itimer measurements to use clock_gettime(),
tweaking the logic to use nsecs instead of usecs. Then for ITIMER_PROF
timers, utilize the CLOCK_THREAD_CPUTIME_ID clockid so that it is similarly
measuring the time the task was running.
Systems with heterogeneous CPU capacities, such as big.LITTLE, have
reported power issues since the introduction of the new timer migration
code.
Timers migrate from small capacity CPUs to big ones, degrading their
target residency and thus overall power consumption.
Solve this with splitting hierarchies per CPU capacity. For example in
a big.LITTLE machine, split a single hierarchy in two: one for big
capacity CPUs and another one for small capacity CPUs. This way global
timers only migrate across CPUs of the same capacity.
For simplicity purpose, split hierarchies keep the same number of
possible levels as if there were a single hierarchy, even though the
CPUs are distributed between multiple hierarchies. This could be a
problem on NUMA systems with heterogeneous CPU capacities (provided that
ever exists yet) where useless intermediate nodes may be created.
Solving this properly will imply on boot to know in advance how many
capacities are available and the number of CPUs for each of them.
Reported-by: Sehee Jeong <sehee1.jeong@samsung.com> Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260423165354.95152-5-frederic@kernel.org
When a new root is created, the old root is connected to it and
propagates up its own assumed to be active state, since the hotplug
control CPU is itself active and part of the old root.
However with per-capacity hierarchies, this assumption won't be true
anymore because the hotplug control CPU calling the timer migration
prepare callback may not belong to the same hierarchy as the booting
CPU.
To solve this, track the available CPUs per hierarchies so that the
root connection can be offlined to safe CPUs.
timers/migration: Abstract out hierarchy to prepare for CPU capacity awareness
In order to prepare for separating out CPUs from different capacities in
distinct hierarchies, create a hierarchy structure that group setup
must rely upon.
timers/migration: Fix another hotplug activation race
The hotplug control CPU is assumed to be active in the hierarchy but
that doesn't imply that the root is active. If the current CPU is not
the one that activated the current hierarchy, and the CPU performing
this duty is still halfway through the tree, the root may still be
observed inactive. And this can break the activation of a new root as in
the following scenario:
1) Initially, the whole system has 64 CPUs and only CPU 63 is awake.
[GRP1:0]
active
/ | \
/ | \
[GRP0:0] [...] [GRP0:7]
idle idle active
/ | \ |
CPU 0 CPU 1 ... CPU 63
idle idle active
2) CPU 63 goes idle _but_ due to a #VMEXIT it hasn't yet reached the
[GRP1:0]->parent dereference (that would be NULL and stop the walk)
in __walk_groups_from().
[GRP1:0]
idle
/ | \
/ | \
[GRP0:0] [...] [GRP0:7]
idle idle idle
/ | \ |
CPU 0 CPU 1 ... CPU 63
idle idle idle
3) CPU 1 wakes up, activates GRP0:0 but didn't yet manage to propagate
up to GRP1:0 due to yet another #VMEXIT.
[GRP1:0]
idle
/ | \
/ | \
[GRP0:0] [...] [GRP0:7]
active idle idle
/ | \ |
CPU 0 CPU 1 ... CPU 63
idle active idle
3) CPU 0 wakes up and doesn't need to walk above GRP0:0 as it's CPU 1
role.
[GRP1:0]
idle
/ | \
/ | \
[GRP0:0] [...] [GRP0:7]
active idle idle
/ | \ |
CPU 0 CPU 1 ... CPU 63
active active idle
4) CPU 0 boots CPU 64. It creates a new root for it.
[GRP2:0]
idle
/ \
/ \
[GRP1:0] [GRP1:1]
idle idle
/ | \ \
/ | \ \
[GRP0:0] [...] [GRP0:7] [GRP0:8]
active idle idle idle
/ | \ | |
CPU 0 CPU 1 ... CPU 63 CPU 64
active active idle offline
5) CPU 0 activates the new root, but note that GRP1:0 is still idle,
waiting for CPU 1 to resume from #VMEXIT and activate it.
[GRP2:0]
active
/ \
/ \
[GRP1:0] [GRP1:1]
idle idle
/ | \ \
/ | \ \
[GRP0:0] [...] [GRP0:7] [GRP0:8]
active idle idle idle
/ | \ | |
CPU 0 CPU 1 ... CPU 63 CPU 64
active active idle offline
6) CPU 63 resumes after #VMEXIT and sees the new GRP1:0 parent.
Therefore it propagates the stale inactive state of GRP1:0 up to
GRP2:0.
[GRP2:0]
idle
/ \
/ \
[GRP1:0] [GRP1:1]
idle idle
/ | \ \
/ | \ \
[GRP0:0] [...] [GRP0:7] [GRP0:8]
active idle idle idle
/ | \ | |
CPU 0 CPU 1 ... CPU 63 CPU 64
active active idle offline
7) CPU 1 resumes after #VMEXIT and finally activates GRP1:0. But it
doesn't observe its parent link because no ordering enforced that.
Therefore GRP2:0 is spuriously left idle.
[GRP2:0]
idle
/ \
/ \
[GRP1:0] [GRP1:1]
active idle
/ | \ \
/ | \ \
[GRP0:0] [...] [GRP0:7] [GRP0:8]
active idle idle idle
/ | \ | |
CPU 0 CPU 1 ... CPU 63 CPU 64
active active idle offline
Such races are highly theoretical and the problem would solve itself
once the old root ever becomes idle again. But it still leaves a taste
of discomfort.
Fix it with enforcing a fully ordered atomic read of the old root state
before propagating the activate state up to the new root. It has a two
directions ordering effect:
* Acquire + release of the latest old root state: If the hotplug control
CPU is not the one that woke up the old root, make sure to acquire its
active state and propagate it upwards through the ordered chain of
activation (the acquire pairs with the cmpxchg() in tmigr_active_up()
and subsequent releases will pair with atomic_read_acquire() and
smp_mb__after_atomic() in tmigr_inactive_up()).
* Release: If the hotplug control CPU is not the one that must wake up
the old root, but the CPU covering that is lagging behind its duty,
publish the links from the old root to the new parents. This way the
lagging CPU will propagate the active state itself.
x86/cpu, x86/platform, watchdog: Remove CONFIG_X86_RDC321X support
This depends on M486 CPU support, which has been removed.
Note that we still keep the RDC321X MFD, watchdog and GPIO
drivers, because apparently there were 586/686 CPUs offered with the
RDC321X, according to Arnd Bergmann:
| "the [RDC321X] product line is still actively developed by RDC
| and DM&P, and I suspect that some of the drivers are still used
| on 586tsc-class (vortex86dx, vortex86mx) and 686-class
| (vortex86dx3, vortex86ex) SoCs that do run modern kernels and
| get updates."
For this reason, update the watchdog driver and offer it on
the broader 32-bit landscape, which has been COMPILE_TEST=y
build-tested previously already:
- depends on X86_RDC321X || COMPILE_TEST
+ depends on X86_32 || COMPILE_TEST
The MFD and GPIO drivers were already independent of CONFIG_X86_RDC321X.