Thomas Gleixner [Wed, 19 Nov 2025 17:26:49 +0000 (18:26 +0100)]
sched/mmcid: Cacheline align MM CID storage
Both the per CPU storage and the data in mm_struct are heavily used in
context switch. As they can end up next to other frequently modified data,
they are subject to false sharing.
Thomas Gleixner [Wed, 19 Nov 2025 17:26:45 +0000 (18:26 +0100)]
sched/mmcid: Revert the complex CID management
The CID management is a complex beast, which affects both scheduling and
task migration. The compaction mechanism forces random tasks of a process
into task work on exit to user space causing latency spikes.
Revert back to the initial simple bitmap allocating mechanics, which are
known to have scalability issues as that allows to gradually build up a
replacement functionality in a reviewable way.
Thomas Gleixner [Mon, 27 Oct 2025 08:45:26 +0000 (09:45 +0100)]
rseq: Switch to TIF_RSEQ if supported
TIF_NOTIFY_RESUME is a multiplexing TIF bit, which is suboptimal especially
with the RSEQ fast path depending on it, but not really handling it.
Define a separate TIF_RSEQ in the generic TIF space and enable the full
separation of fast and slow path for architectures which utilize that.
That avoids the hassle with invocations of resume_user_mode_work() from
hypervisors, which clear TIF_NOTIFY_RESUME. It makes the therefore required
re-evaluation at the end of vcpu_run() a NOOP on architectures which
utilize the generic TIF space and have a separate TIF_RSEQ.
The hypervisor TIF handling does not include the separate TIF_RSEQ as there
is no point in doing so. The guest does neither know nor care about the VMM
host applications RSEQ state. That state is only relevant when the ioctl()
returns to user space.
The fastpath implementation still utilizes TIF_NOTIFY_RESUME for failure
handling, but this only happens within exit_to_user_mode_loop(), so
arguably the hypervisor ioctl() code is long done when this happens.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251027084307.903622031@linutronix.de
Thomas Gleixner [Mon, 27 Oct 2025 08:45:24 +0000 (09:45 +0100)]
rseq: Split up rseq_exit_to_user_mode()
Separate the interrupt and syscall exit handling. Syscall exit does not
require to clear the user_irq bit as it can't be set. On interrupt exit it
can be set when the interrupt did not result in a scheduling event and
therefore the return path did not invoke the TIF work handling, which would
have cleared it.
The debug check for the event state is also not really required even when
debug mode is enabled via the static key. Debug mode is largely aiding user
space by enabling a larger amount of validation checks, which cause a
segfault when a malformed critical section is detected. In production mode
the critical section handling takes the content mostly as is and lets user
space keep the pieces when it screwed up.
On kernel changes in that area the state check is useful, but that can be
done when lockdep is enabled, which is anyway a required test scenario for
fundamental changes.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251027084307.842785700@linutronix.de
Thomas Gleixner [Mon, 27 Oct 2025 08:45:21 +0000 (09:45 +0100)]
entry: Split up exit_to_user_mode_prepare()
exit_to_user_mode_prepare() is used for both interrupts and syscalls, but
there is extra rseq work, which is only required for in the interrupt exit
case.
Split up the function and provide wrappers for syscalls and interrupts,
which allows to separate the rseq exit work in the next step.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251027084307.782234789@linutronix.de
Thomas Gleixner [Mon, 27 Oct 2025 08:45:19 +0000 (09:45 +0100)]
rseq: Switch to fast path processing on exit to user
Now that all bits and pieces are in place, hook the RSEQ handling fast path
function into exit_to_user_mode_prepare() after the TIF work bits have been
handled. If case of fast path failure, TIF_NOTIFY_RESUME has been raised
and the caller needs to take another turn through the TIF handling slow
path.
This only works for architectures which use the generic entry code.
Architectures who still have their own incomplete hacks are not supported
and won't be.
The 'cs cleared' and 'cs fixup' percentages are not relative to the exit to
user invocations, they are relative to the actual 'cs check' invocations.
While some of this could have been avoided in the original code, like the
obvious clearing of CS when it's already clear, the main problem of going
through TIF_NOTIFY_RESUME cannot be solved. In some workloads the RSEQ
notify handler is invoked more than once before going out to user
space. Doing this once when everything has stabilized is the only solution
to avoid this.
The initial attempt to completely decouple it from the TIF work turned out
to be suboptimal for workloads, which do a lot of quick and short system
calls. Even if the fast path decision is only 4 instructions (including a
conditional branch), this adds up quickly and becomes measurable when the
rate for actually having to handle rseq is in the low single digit
percentage range of user/kernel transitions.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251027084307.701201365@linutronix.de
Thomas Gleixner [Mon, 27 Oct 2025 08:45:17 +0000 (09:45 +0100)]
rseq: Implement fast path for exit to user
Implement the actual logic for handling RSEQ updates in a fast path after
handling the TIF work and at the point where the task is actually returning
to user space.
This is the right point to do that because at this point the CPU and the MM
CID are stable and cannot longer change due to yet another reschedule.
That happens when the task is handling it via TIF_NOTIFY_RESUME in
resume_user_mode_work(), which is invoked from the exit to user mode work
loop.
The function is invoked after the TIF work is handled and runs with
interrupts disabled, which means it cannot resolve page faults. It
therefore disables page faults and in case the access to the user space
memory faults, it:
- notes the fail in the event struct
- raises TIF_NOTIFY_RESUME
- returns false to the caller
The caller has to go back to the TIF work, which runs with interrupts
enabled and therefore can resolve the page faults. This happens mostly on
fork() when the memory is marked COW.
If the user memory inspection finds invalid data, the function returns
false as well and sets the fatal flag in the event struct along with
TIF_NOTIFY_RESUME. The slow path notify handler has to evaluate that flag
and terminate the task with SIGSEGV as documented.
The initial decision to invoke any of this is based on one flags in the
event struct: @sched_switch. The decision is in pseudo ASM:
So for the common case where the task was not scheduled out, this really
boils down to three instructions before going out if the compiler is not
completely stupid (and yes, some of them are).
If the condition is true, then it checks, whether CPU ID or MM CID have
changed. If so, then the CPU/MM IDs have to be updated and are thereby
cached for the next round. The update unconditionally retrieves the user
space critical section address to spare another user*begin/end() pair. If
that's not zero and tsk::event::user_irq is set, then the critical section
is analyzed and acted upon. If either zero or the entry came via syscall
the critical section analysis is skipped.
If the comparison is false then the critical section has to be analyzed
because the event flag is then only true when entry from user was by
interrupt.
This is provided without the actual hookup to let reviewers focus on the
implementation details. The hookup happens in the next step.
Note: As with quite some other optimizations this depends on the generic
entry infrastructure and is not enabled to be sucked into random
architecture implementations.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251027084307.638929615@linutronix.de
Thomas Gleixner [Mon, 27 Oct 2025 08:45:14 +0000 (09:45 +0100)]
rseq: Optimize event setting
After removing the various condition bits earlier it turns out that one
extra information is needed to avoid setting event::sched_switch and
TIF_NOTIFY_RESUME unconditionally on every context switch.
The update of the RSEQ user space memory is only required, when either
the task was interrupted in user space and schedules
or
the CPU or MM CID changes in schedule() independent of the entry mode
Right now only the interrupt from user information is available.
Add an event flag, which is set when the CPU or MM CID or both change.
Evaluate this event in the scheduler to decide whether the sched_switch
event and the TIF bit need to be set.
It's an extra conditional in context_switch(), but the downside of
unconditionally handling RSEQ after a context switch to user is way more
significant. The utilized boolean logic minimizes this to a single
conditional branch.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251027084307.578058898@linutronix.de
Thomas Gleixner [Mon, 27 Oct 2025 08:45:12 +0000 (09:45 +0100)]
rseq: Rework the TIF_NOTIFY handler
Replace the whole logic with a new implementation, which is shared with
signal delivery and the upcoming exit fast path.
Contrary to the original implementation, this ignores invocations from
KVM/IO-uring, which invoke resume_user_mode_work() with the @regs argument
set to NULL.
The original implementation updated the CPU/Node/MM CID fields, but that
was just a side effect, which was addressing the problem that this
invocation cleared TIF_NOTIFY_RESUME, which in turn could cause an update
on return to user space to be lost.
This problem has been addressed differently, so that it's not longer
required to do that update before entering the guest.
That might be considered a user visible change, when the hosts thread TLS
memory is mapped into the guest, but as this was never intentionally
supported, this abuse of kernel internal implementation details is not
considered an ABI break.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251027084307.517640811@linutronix.de
Thomas Gleixner [Mon, 27 Oct 2025 08:45:10 +0000 (09:45 +0100)]
rseq: Separate the signal delivery path
Completely separate the signal delivery path from the notify handler as
they have different semantics versus the event handling.
The signal delivery only needs to ensure that the interrupted user context
was not in a critical section or the section is aborted before it switches
to the signal frame context. The signal frame context does not have the
original instruction pointer anymore, so that can't be handled on exit to
user space.
No point in updating the CPU/CID ids as they might change again before the
task returns to user space for real.
The fast path optimization, which checks for the 'entry from user via
interrupt' condition is only available for architectures which use the
generic entry code.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251027084307.455429038@linutronix.de
Thomas Gleixner [Mon, 27 Oct 2025 08:45:08 +0000 (09:45 +0100)]
rseq: Provide and use rseq_set_ids()
Provide a new and straight forward implementation to set the IDs (CPU ID,
Node ID and MM CID), which can be later inlined into the fast path.
It does all operations in one scoped_user_rw_access() section and retrieves
also the critical section member (rseq::cs_rseq) from user space to avoid
another user..begin/end() pair. This is in preparation for optimizing the
fast path to avoid extra work when not required.
On rseq registration set the CPU ID fields to RSEQ_CPU_ID_UNINITIALIZED and
node and MM CID to zero. That's the same as the kernel internal reset
values. That makes the debug validation in the exit code work correctly on
the first exit to user space.
Use it to replace the whole related zoo in rseq.c
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251027084307.393972266@linutronix.de
Thomas Gleixner [Mon, 27 Oct 2025 08:45:02 +0000 (09:45 +0100)]
rseq: Make exit debugging static branch based
Disconnect it from the config switch and use the static debug branch. This
is a temporary measure for validating the rework. At the end this check
needs to be hidden behind lockdep as it has nothing to do with the other
debug infrastructure, which mainly aids user space debugging by enabling a
zoo of checks which terminate misbehaving tasks instead of letting them
keep the hard to diagnose pieces.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251027084307.272660745@linutronix.de
Thomas Gleixner [Mon, 27 Oct 2025 08:44:57 +0000 (09:44 +0100)]
rseq: Provide and use rseq_update_user_cs()
Provide a straight forward implementation to check for and eventually
clear/fixup critical sections in user space.
The non-debug version does only the minimal sanity checks and aims for
efficiency.
There are two attack vectors, which are checked for:
1) An abort IP which is in the kernel address space. That would cause at
least x86 to return to kernel space via IRET.
2) A rogue critical section descriptor with an abort IP pointing to some
arbitrary address, which is not preceded by the RSEQ signature.
If the section descriptors are invalid then the resulting misbehaviour of
the user space application is not the kernels problem.
The kernel provides a run-time switchable debug slow path, which implements
the full zoo of checks including termination of the task when one of the
gazillion conditions is not met.
Replace the zoo in rseq.c with it and invoke it from the TIF_NOTIFY_RESUME
handler. Move the remainders into the CONFIG_DEBUG_RSEQ section, which will
be replaced and removed in a subsequent step.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251027084307.151465632@linutronix.de
Thomas Gleixner [Mon, 27 Oct 2025 08:44:52 +0000 (09:44 +0100)]
rseq: Expose lightweight statistics in debugfs
Analyzing the call frequency without actually using tracing is helpful for
analysis of this infrastructure. The overhead is minimal as it just
increments a per CPU counter associated to each operation.
The debugfs readout provides a racy sum of all counters.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251027084307.027916598@linutronix.de
Thomas Gleixner [Mon, 27 Oct 2025 08:44:50 +0000 (09:44 +0100)]
rseq: Provide tracepoint wrappers for inline code
Provide tracepoint wrappers for the upcoming RSEQ exit to user space inline
fast path, so that the header can be safely included by code which defines
actual trace points.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251027084306.967114316@linutronix.de
Thomas Gleixner [Mon, 27 Oct 2025 08:44:48 +0000 (09:44 +0100)]
rseq: Record interrupt from user space
For RSEQ the only relevant reason to inspect and eventually fixup (abort)
user space critical sections is when user space was interrupted and the
task was scheduled out.
If the user to kernel entry was from a syscall no fixup is required. If
user space invokes a syscall from a critical section it can keep the
pieces as documented.
This is only supported on architectures which utilize the generic entry
code. If your architecture does not use it, bad luck.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251027084306.905067101@linutronix.de
Thomas Gleixner [Mon, 27 Oct 2025 08:44:45 +0000 (09:44 +0100)]
rseq: Cache CPU ID and MM CID values
In preparation for rewriting RSEQ exit to user space handling provide
storage to cache the CPU ID and MM CID values which were written to user
space. That prepares for a quick check, which avoids the update when
nothing changed.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251027084306.841964081@linutronix.de
Thomas Gleixner [Mon, 27 Oct 2025 08:44:33 +0000 (09:44 +0100)]
rseq: Introduce struct rseq_data
In preparation for a major rewrite of this code, provide a data structure
for rseq management.
Put all the rseq related data into it (except for the debug part), which
allows to simplify fork/execve by using memset() and memcpy() instead of
adding new fields to initialize over and over.
Create a storage struct for event management as well and put the
sched_switch event and a indicator for RSEQ on a task into it as a
start. That uses a union, which allows to mask and clear the whole lot
efficiently.
The indicators are explicitly not a bit field. Bit fields generate abysmal
code.
The boolean members are defined as u8 as that actually guarantees that it
fits. There seem to be strange architecture ABIs which need more than 8
bits for a boolean.
The has_rseq member is redundant vs. task::rseq, but it turns out that
boolean operations and quick checks on the union generate better code than
fiddling with separate entities and data types.
This struct will be extended over time to carry more information.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251027084306.527086690@linutronix.de
Thomas Gleixner [Mon, 27 Oct 2025 08:44:28 +0000 (09:44 +0100)]
rseq, virt: Retrigger RSEQ after vcpu_run()
Hypervisors invoke resume_user_mode_work() before entering the guest, which
clears TIF_NOTIFY_RESUME. The @regs argument is NULL as there is no user
space context available to them, so the rseq notify handler skips
inspecting the critical section, but updates the CPU/MM CID values
unconditionally so that the eventual pending rseq event is not lost on the
way to user space.
This is a pointless exercise as the task might be rescheduled before
actually returning to user space and it creates unnecessary work in the
vcpu_run() loops.
It's way more efficient to ignore that invocation based on @regs == NULL
and let the hypervisors re-raise TIF_NOTIFY_RESUME after returning from the
vcpu_run() loop before returning from the ioctl().
This ensures that a pending RSEQ update is not lost and the IDs are updated
before returning to user space.
Once the RSEQ handling is decoupled from TIF_NOTIFY_RESUME, this turns into
a NOOP.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Sean Christopherson <seanjc@google.com> Link: https://patch.msgid.link/20251027084306.399495855@linutronix.de
Thomas Gleixner [Mon, 27 Oct 2025 08:44:26 +0000 (09:44 +0100)]
rseq: Simplify the event notification
Since commit 0190e4198e47 ("rseq: Deprecate RSEQ_CS_FLAG_NO_RESTART_ON_*
flags") the bits in task::rseq_event_mask are meaningless and just extra
work in terms of setting them individually.
Aside of that the only relevant point where an event has to be raised is
context switch. Neither the CPU nor MM CID can change without going through
a context switch.
Collapse them all into a single boolean which simplifies the code a lot and
remove the pointless invocations which have been sprinkled all over the
place for no value.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251027084306.336978188@linutronix.de
Thomas Gleixner [Mon, 27 Oct 2025 08:44:20 +0000 (09:44 +0100)]
rseq: Move algorithm comment to top
Move the comment which documents the RSEQ algorithm to the top of the file,
so it does not create horrible diffs later when the actual implementation
is fed into the mincer.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251027084306.149519580@linutronix.de
Thomas Gleixner [Mon, 27 Oct 2025 08:44:16 +0000 (09:44 +0100)]
rseq: Avoid pointless evaluation in __rseq_notify_resume()
The RSEQ critical section mechanism only clears the event mask when a
critical section is registered, otherwise it is stale and collects
bits.
That means once a critical section is installed the first invocation of
that code when TIF_NOTIFY_RESUME is set will abort the critical section,
even when the TIF bit was not raised by the rseq preempt/migrate/signal
helpers.
This also has a performance implication because TIF_NOTIFY_RESUME is a
multiplexing TIF bit, which is utilized by quite some infrastructure. That
means every invocation of __rseq_notify_resume() goes unconditionally
through the heavy lifting of user space access and consistency checks even
if there is no reason to do so.
Keeping the stale event mask around when exiting to user space also
prevents it from being utilized by the upcoming time slice extension
mechanism.
Avoid this by reading and clearing the event mask before doing the user
space critical section access with interrupts or preemption disabled, which
ensures that the read and clear operation is CPU local atomic versus
scheduling and the membarrier IPI.
This is correct as after re-enabling interrupts/preemption any relevant
event will set the bit again and raise TIF_NOTIFY_RESUME, which makes the
user space exit code take another round of TIF bit clearing.
If the event mask was non-zero, invoke the slow path. On debug kernels the
slow path is invoked unconditionally and the result of the event mask
evaluation is handed in.
Add a exit path check after the TIF bit loop, which validates on debug
kernels that the event mask is zero before exiting to user space.
While at it reword the convoluted comment why the pt_regs pointer can be
NULL under certain circumstances.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251027084306.022571576@linutronix.de
Thomas Gleixner [Mon, 27 Oct 2025 08:44:00 +0000 (09:44 +0100)]
futex: Convert to get/put_user_inline()
Replace the open coded implementation with the new get/put_user_inline()
helpers. This might be replaced by a regular get/put_user(), but that needs
a proper performance evaluation.
This got worse with the recent addition of masked user access, which
optimizes the speculation prevention:
if (can_do_masked_user_access())
from = masked_user_read_access_begin((from));
else if (!user_read_access_begin(from, sizeof(*from)))
return -EFAULT;
unsafe_get_user(val, from, Efault);
user_read_access_end();
return 0;
Efault:
user_read_access_end();
return -EFAULT;
There have been issues with using the wrong user_*_access_end() variant in
the error path and other typical Copy&Pasta problems, e.g. using the wrong
fault label in the user accessor which ends up using the wrong accesss end
variant.
These patterns beg for scopes with automatic cleanup. The resulting outcome
is:
scoped_user_read_access(from, Efault)
unsafe_get_user(val, from, Efault);
return 0;
Efault:
return -EFAULT;
The scope guarantees the proper cleanup for the access mode is invoked both
in the success and the failure (fault) path.
The scoped_user_$MODE_access() macros are implemented as self terminating
nested for() loops. Thanks to Andrew Cooper for pointing me at them. The
scope can therefore be left with 'break', 'goto' and 'return'. Even
'continue' "works" due to the self termination mechanism. Both GCC and
clang optimize all the convoluted macro maze out and the above results with
clang in:
The fault labels for the scoped*() macros and the fault labels for the
actual user space accessors can be shared and must be placed outside of the
scope.
If masked user access is enabled on an architecture, then the pointer
handed in to scoped_user_$MODE_access() can be modified to point to a
guaranteed faulting user address. This modification is only scope local as
the pointer is aliased inside the scope. When the scope is left the alias
is not longer in effect. IOW the original pointer value is preserved so it
can be used e.g. for fixup or diagnostic purposes in the fault path.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251027083745.546420421@linutronix.de
Thomas Gleixner [Fri, 31 Oct 2025 09:37:09 +0000 (10:37 +0100)]
arm64: uaccess: Use unsafe wrappers for ASM GOTO
Clang propagates a provided label, which is outside of a cleanup scope to
ASM GOTO despite the fact that __raw_get_mem() has a local label for that
purpose:
"error: cannot jump from this asm goto statement to one of its possible targets"
Using the unsafe wrapper with the extra local label indirection cures that.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org>
It ends up leaking the pagefault disable counter in the fault path. clang
at least fails the build.
Rename unsafe_*_user() to arch_unsafe_*_user() which makes the generic
uaccess header wrap it with a local label that makes both compilers emit
correct code. Same for the kernel_nofault() variants.
It ends up leaking the pagefault disable counter in the fault path. clang
at least fails the build.
Rename unsafe_*_user() to arch_unsafe_*_user() which makes the generic
uaccess header wrap it with a local label that makes both compilers emit
correct code. Same for the kernel_nofault() variants.
It ends up leaking the pagefault disable counter in the fault path. clang
at least fails the build.
Rename unsafe_*_user() to arch_unsafe_*_user() which makes the generic
uaccess header wrap it with a local label that makes both compilers emit
correct code. Same for the kernel_nofault() variants.
When CONFIG_CPU_SPECTRE=n then get_user() is missing the 8 byte ASM variant
for no real good reason. This prevents using get_user(u64) in generic code.
Implement it as a sequence of two 4-byte reads with LE/BE awareness and
make the unsigned long (or long long) type for the intermediate variable to
read into dependend on the the target type.
The __long_type() macro and idea was lifted from PowerPC. Thanks to
Christophe for pointing it out.
Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Closes: https://lore.kernel.org/oe-kbuild-all/202509120155.pFgwfeUD-lkp@intel.com/ Link: https://patch.msgid.link/20251027083745.168468637@linutronix.de
Linus Torvalds [Sat, 1 Nov 2025 17:49:12 +0000 (10:49 -0700)]
Merge tag 'regulator-fix-v6.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator
Pull regulator fix from Mark Brown:
"A simple fix for a missed part of an API conversion in the bd718x7
driver"
* tag 'regulator-fix-v6.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator:
regulator: bd718x7: Fix voltages scaled by resistor divider
Linus Torvalds [Sat, 1 Nov 2025 17:45:39 +0000 (10:45 -0700)]
Merge tag 'regmap-fix-v6.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap
Pull regmap fixes from Mark Brown:
"One documentation fix and a fix for a problem with the slimbus regmap
which was uncovered by some changes in one of the drivers"
* tag 'regmap-fix-v6.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap:
regmap: irq: Correct documentation of wake_invert flag
regmap: slimbus: fix bus_context pointer in regmap init calls
Linus Torvalds [Sat, 1 Nov 2025 17:20:07 +0000 (10:20 -0700)]
Merge tag 'x86-urgent-2025-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc x86 fixes from Ingo Molnar:
- Limit AMD microcode Entrysign sha256 signature checking to
known CPU generations
- Disable AMD RDSEED32 on certain Zen5 CPUs that have a
microcode version before when the microcode-based fix was
issued for the AMD-SB-7055 erratum
- Fix FPU AMD XFD state synchronization on signal delivery
- Fix (work around) a SSE4a-disassembly related build failure
on X86_NATIVE_CPU=y builds
- Extend the AMD Zen6 model space with a new range of models
- Fix <asm/intel-family.h> CPU model comments
- Fix the CONFIG_CFI=y and CONFIG_LTO_CLANG_FULL=y build, which
was unhappy due to missing kCFI type annotations of clear_page()
variants
* tag 'x86-urgent-2025-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm: Ensure clear_page() variants always have __kcfi_typeid_ symbols
x86/cpu: Add/fix core comments for {Panther,Nova} Lake
x86/CPU/AMD: Extend Zen6 model range
x86/build: Disable SSE4a
x86/fpu: Ensure XFD state on signal delivery
x86/CPU/AMD: Add RDSEED fix for Zen5
x86/microcode/AMD: Limit Entrysign signature checking to known generations
Linus Torvalds [Sat, 1 Nov 2025 17:17:40 +0000 (10:17 -0700)]
Merge tag 'perf-urgent-2025-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf event fixes from Ingo Molnar:
"Miscellaneous fixes and CPU model updates:
- Fix an out-of-bounds access on non-hybrid platforms in the Intel
PMU DS code, reported by KASAN
- Add WildcatLake PMU and uncore support: it's identical to the
PantherLake version"
* tag 'perf-urgent-2025-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel/uncore: Add uncore PMU support for Wildcat Lake
perf/x86/intel: Add PMU support for WildcatLake
perf/x86/intel: Fix KASAN global-out-of-bounds warning
Linus Torvalds [Sat, 1 Nov 2025 17:07:35 +0000 (10:07 -0700)]
Merge tag 'objtool-urgent-2025-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull objtool fix from Ingo Molnar:
"Fix objtool warning when faced with raw STAC/CLAC instructions"
* tag 'objtool-urgent-2025-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
objtool: Fix skip_alt_group() for non-alternative STAC/CLAC
Linus Torvalds [Sat, 1 Nov 2025 17:04:35 +0000 (10:04 -0700)]
Merge tag 'xfs-fixes-6.18-rc4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Carlos Maiolino:
"Just a single bug fix (and documentation for the issue)"
* tag 'xfs-fixes-6.18-rc4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: document another racy GC case in xfs_zoned_map_extent
xfs: prevent gc from picking the same zone twice
Linus Torvalds [Sat, 1 Nov 2025 17:00:53 +0000 (10:00 -0700)]
Merge tag 'kbuild-fixes-6.18-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kbuild/linux
Pull Kbuild fixes from Nathan Chancellor:
- Formally adopt Kconfig in MAINTAINERS
- Fix install-extmod-build for more O= paths
- Align end of .modinfo to fix Authenticode calculation in EDK2
- Restore dynamic check for '-fsanitize=kernel-memory' in
CONFIG_HAVE_KMSAN_COMPILER to ensure backend target has support
for it
- Initialize locale in menuconfig and nconfig to fix UTF-8 terminals
that may not support VT100 ACS by default like PuTTY
* tag 'kbuild-fixes-6.18-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kbuild/linux:
kconfig/nconf: Initialize the default locale at startup
kconfig/mconf: Initialize the default locale at startup
KMSAN: Restore dynamic check for '-fsanitize=kernel-memory'
kbuild: align modinfo section for Secureboot Authenticode EDK2 compat
kbuild: install-extmod-build: Fix when given dir outside the build dir
MAINTAINERS: Update Kconfig section
Josh Poimboeuf [Wed, 29 Oct 2025 19:54:08 +0000 (12:54 -0700)]
objtool: Fix skip_alt_group() for non-alternative STAC/CLAC
If an insn->alt points to a STAC/CLAC instruction, skip_alt_group()
assumes it's part of an alternative ("alt group") as opposed to some
other kind of "alt" such as an exception fixup.
While that assumption may hold true in the current code base, Linus has
an out-of-tree patch which breaks that assumption by replacing the
STAC/CLAC alternatives with raw STAC/CLAC instructions.
Make skip_alt_group() more robust by making sure it's actually an alt
group before continuing.
Jakub Horký [Tue, 14 Oct 2025 14:44:06 +0000 (16:44 +0200)]
kconfig/nconf: Initialize the default locale at startup
Fix bug where make nconfig doesn't initialize the default locale, which
causes ncurses menu borders to be displayed incorrectly (lqqqqk) in
UTF-8 terminals that don't support VT100 ACS by default, such as PuTTY.
Jakub Horký [Tue, 14 Oct 2025 15:49:32 +0000 (17:49 +0200)]
kconfig/mconf: Initialize the default locale at startup
Fix bug where make menuconfig doesn't initialize the default locale, which
causes ncurses menu borders to be displayed incorrectly (lqqqqk) in
UTF-8 terminals that don't support VT100 ACS by default, such as PuTTY.
- Reject negative head_room in __bpf_skb_change_head (Daniel Borkmann)
- Conditionally include dynptr copy kfuncs (Malin Jonsson)
- Sync pending IRQ work before freeing BPF ring buffer (Noorain Eqbal)
- Do not audit capability check in x86 do_jit() (Ondrej Mosnacek)
- Fix arm64 JIT of BPF_ST insn when it writes into arena memory
(Puranjay Mohan)
* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
bpf/arm64: Fix BPF_ST into arena memory
bpf: Make migrate_disable always inline to avoid partial inlining
bpf: Reject negative head_room in __bpf_skb_change_head
bpf: Conditionally include dynptr copy kfuncs
libbpf: Fix powerpc's stack register definition in bpf_tracing.h
bpf: Do not audit capability check in do_jit()
bpf: Sync pending IRQ work before freeing ring buffer
x86/mm: Ensure clear_page() variants always have __kcfi_typeid_ symbols
When building with CONFIG_CFI=y and CONFIG_LTO_CLANG_FULL=y, there is a series
of errors from the various versions of clear_page() not having __kcfi_typeid_
symbols.
$ cat kernel/configs/repro.config
CONFIG_CFI=y
# CONFIG_LTO_NONE is not set
CONFIG_LTO_CLANG_FULL=y
$ make -skj"$(nproc)" ARCH=x86_64 LLVM=1 clean defconfig repro.config bzImage
ld.lld: error: undefined symbol: __kcfi_typeid_clear_page_rep
>>> referenced by ld-temp.o
>>> vmlinux.o:(__cfi_clear_page_rep)
With full LTO, it is possible for LLVM to realize that these functions never
have their address taken (as they are only used within an alternative, which
will make them a direct call) across the whole kernel and either drop or skip
generating their kCFI type identification symbols.
clear_page_{rep,orig,erms}() are defined in clear_page_64.S with
SYM_TYPED_FUNC_START as a result of
2981557cb040 ("x86,kcfi: Fix EXPORT_SYMBOL vs kCFI"),
as exported functions are free to be called indirectly thus need kCFI type
identifiers.
Use KCFI_REFERENCE with these clear_page() functions to force LLVM to see
these functions as address-taken and generate then keep the kCFI type
identifiers.
Linus Torvalds [Fri, 31 Oct 2025 21:47:02 +0000 (14:47 -0700)]
Merge tag 'drm-fixes-2025-10-31' of https://gitlab.freedesktop.org/drm/kernel
Pull drm fixes from Simona Vetter:
"Looks like stochastics conspired to make this one a bit bigger, but
nothing scary at all. Also first examples of the new Link: tags, yay!
* tag 'drm-fixes-2025-10-31' of https://gitlab.freedesktop.org/drm/kernel: (44 commits)
drm/ast: Clear preserved bits from register output value
drm/imx: parallel-display: add the bridge before attaching it
drm/imx: parallel-display: convert to devm_drm_bridge_alloc() API
drm/panel: kingdisplay-kd097d04: Disable EoTp
drm/panel: sitronix-st7789v: fix sync flags for t28cp45tn89
drm/xe: Do not wake device during a GT reset
drm/xe: Fix uninitialized return value from xe_validation_guard()
drm/msm/dpu: Fix adjusted mode clock check for 3d merge
drm/msm/dpu: Disable broken YUV on QSEED2 hardware
drm/msm/dpu: Require linear modifier for writeback framebuffers
drm/msm/dpu: Fix pixel extension sub-sampling
drm/msm/dpu: Disable scaling for unsupported scaler types
drm/msm/dpu: Propagate error from dpu_assign_plane_resources
drm/msm/dpu: Fix allocation of RGB SSPPs without scaling
drm/msm: dsi: fix PLL init in bonded mode
drm/i915/dmc: Clear HRR EVT_CTL/HTP to zero on ADL-S
drm/amd/display: Fix incorrect return of vblank enable on unconfigured crtc
drm/amd/display: Add HDR workaround for a specific eDP
drm/amdgpu: fix SPDX header on cyan_skillfish_reg_init.c
drm/amdgpu: fix SPDX header on irqsrcs_vcn_5_0.h
...
Linus Torvalds [Fri, 31 Oct 2025 21:24:32 +0000 (14:24 -0700)]
Merge tag 'pci-v6.18-fixes-4' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci
Pull pci fixes from Bjorn Helgaas:
- Restore custom qcom ASPM enablement code so L1 PM Substates are
enabled as they were in v6.17 even though the PCI core now enables
just L0s and L1 by default (Bjorn Helgaas)
- Size prefetchable bridge windows only when they actually exist, to
avoid a WARN_ON() regression (Ilpo Järvinen)
* tag 'pci-v6.18-fixes-4' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci:
PCI: Do not size non-existing prefetchable window
Revert "PCI: qcom: Remove custom ASPM enablement code"
Linus Torvalds [Fri, 31 Oct 2025 21:20:09 +0000 (14:20 -0700)]
Merge tag 'vfio-v6.18-rc4' of https://github.com/awilliam/linux-vfio
Pull VFIO fixes from Alex Williamson:
- Fix overflows in vfio type1 backend for mappings at the end of the
64-bit address space, resulting in leaked pinned memory.
New selftest support included to avoid such issues in the future
(Alex Mastro)
* tag 'vfio-v6.18-rc4' of https://github.com/awilliam/linux-vfio:
vfio: selftests: add end of address space DMA map/unmap tests
vfio: selftests: update DMA map/unmap helpers to support more test kinds
vfio/type1: handle DMA map/unmap up to the addressable limit
vfio/type1: move iova increment to unmap_unpin_*() caller
vfio/type1: sanitize for overflow using check_*_overflow()
Ilpo Järvinen [Mon, 27 Oct 2025 13:24:23 +0000 (15:24 +0200)]
PCI: Do not size non-existing prefetchable window
pbus_size_mem() should only be called for bridge windows that exist but
__pci_bus_size_bridges() may point 'pref' to a resource that does not exist
(has zero flags) in case of non-root buses.
When prefetchable bridge window does not exist, the same non-prefetchable
bridge window is sized more than once which may result in duplicating
entries into the realloc_head list. Duplicated entries are shown in this
log and trigger a WARN_ON() because realloc_head had residual entries after
the resource assignment algorithm:
pci 0000:00:03.0: [11ab:6820] type 01 class 0x060400 PCIe Root Port
pci 0000:00:03.0: PCI bridge to [bus 00]
pci 0000:00:03.0: bridge window [io 0x0000-0x0fff]
pci 0000:00:03.0: bridge window [mem 0x00000000-0x000fffff]
pci 0000:00:03.0: bridge window [mem 0x00200000-0x003fffff] to [bus 02] add_size 200000 add_align 200000
pci 0000:00:03.0: bridge window [mem 0x00200000-0x003fffff] to [bus 02] add_size 200000 add_align 200000
pci 0000:00:03.0: bridge window [mem 0xe0000000-0xe03fffff]: assigned
pci 0000:00:03.0: PCI bridge to [bus 02]
pci 0000:00:03.0: bridge window [mem 0xe0000000-0xe03fffff]
------------[ cut here ]------------
WARNING: CPU: 0 PID: 1 at drivers/pci/setup-bus.c:2373 pci_assign_unassigned_root_bus_resources+0x1bc/0x234
Check resource flags of 'pref' and only size the prefetchable window if the
resource has the IORESOURCE_PREFETCH flag.
Fixes: ae88d0b9c57f ("PCI: Use pbus_select_window_for_type() during mem window sizing") Reported-by: Klaus Kudielka <klaus.kudielka@gmail.com> Closes: https://lore.kernel.org/r/51e8cf1c62b8318882257d6b5a9de7fdaaecc343.camel@gmail.com/ Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Tested-by: Klaus Kudielka <klaus.kudielka@gmail.com> Link: https://patch.msgid.link/20251027132423.8841-1-ilpo.jarvinen@linux.intel.com
Prior to a729c1664619 ("PCI: qcom: Remove custom ASPM enablement code"),
the qcom controller driver enabled ASPM, including L0s, L1, and L1 PM
Substates, for all devices powered on at the time the controller driver
enumerates them.
ASPM was *not* enabled for devices powered on later by pwrctrl (unless the
kernel was built with PCIEASPM_POWERSAVE or PCIEASPM_POWER_SUPERSAVE, or
the user enabled ASPM via module parameter or sysfs).
After f3ac2ff14834 ("PCI/ASPM: Enable all ClockPM and ASPM states for
devicetree platforms"), the PCI core enabled all ASPM states for all
devices whether powered on initially or by pwrctrl, so a729c1664619 was
unnecessary and reverted.
But f3ac2ff14834 was too aggressive and broke platforms that didn't support
CLKREQ# or required device-specific configuration for L1 Substates, so df5192d9bb0e ("PCI/ASPM: Enable only L0s and L1 for devicetree platforms")
enabled only L0s and L1.
On Qualcomm platforms, this left L1 Substates disabled, which was a
regression. Revert a729c1664619 so L1 Substates will be enabled on devices
that are initially powered on. Devices powered on by pwrctrl will be
addressed later.
Fixes: df5192d9bb0e ("PCI/ASPM: Enable only L0s and L1 for devicetree platforms") Reported-by: Johan Hovold <johan@kernel.org> Closes: https://lore.kernel.org/lkml/aPuXZlaawFmmsLmX@hovoldconsulting.com/ Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Tested-by: Johan Hovold <johan@kernel.org> Reviewed-by: Manivannan Sadhasivam <mani@kernel.org> Link: https://patch.msgid.link/20251024210514.1365996-1-helgaas@kernel.org
Linus Torvalds [Fri, 31 Oct 2025 19:57:19 +0000 (12:57 -0700)]
Merge tag 'block-6.18-20251031' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull block fixes from Jens Axboe:
- Fix blk-crypto reporting EIO when EINVAL is the correct error code
- Two bug fixes for the block zone support
- NVME pull request via Keith:
- Target side authentication fixup
- Peer-to-peer metadata fixup
- null_blk DMA alignment fix
* tag 'block-6.18-20251031' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
null_blk: set dma alignment to logical block size
blk-crypto: use BLK_STS_INVAL for alignment errors
block: make REQ_OP_ZONE_OPEN a write operation
block: fix op_is_zone_mgmt() to handle REQ_OP_ZONE_RESET_ALL
nvme-pci: use blk_map_iter for p2p metadata
nvmet-auth: update sc_c in host response
Linus Torvalds [Fri, 31 Oct 2025 19:50:35 +0000 (12:50 -0700)]
Merge tag 's390-6.18-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 fixes from Heiko Carstens:
- Use correct locking in zPCI event code to avoid deadlock
- Get rid of irqs_registered flag in zpci_dev structure and restore IRQ
unconditionally for zPCI devices. This fixes sit uations where the
flag was not correctly updated
- Disable (revert) ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP for s390 again.
The optimized hugetlb vmemmap code modifies kernel page tables in a
way which does not work on s390 and leads to reproducible kernel
crashes due to stale TLB entries. This needs to be addressed with
some larger changes. For now simply disable the feature
- Update defconfigs
* tag 's390-6.18-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390: Disable ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP
s390/mm: Fix memory leak in add_marker() when kvrealloc() fails
s390/pci: Restore IRQ unconditionally for the zPCI device
s390: Update defconfigs
s390/pci: Avoid deadlock between PCI error recovery and mlx5 crdump
Puranjay Mohan [Thu, 30 Oct 2025 12:17:14 +0000 (12:17 +0000)]
bpf/arm64: Fix BPF_ST into arena memory
The arm64 JIT supports BPF_ST with BPF_PROBE_MEM32 (arena) by using the
tmp2 register to hold the dst + arena_vm_base value and using tmp2 as the
new dst register. But this is broken because in case is_lsi_offset()
returns false the tmp2 will be clobbered by emit_a64_mov_i(1, tmp2, off,
ctx); and hence the emitted store instruction will be of the form:
strb w10, [x11, x11]
Fix this by using the third temporary register to hold the dst +
arena_vm_base.
Two functions with identical names but different addresses are
considered ambiguous and removed by "pahole" from vmlinux BTF.
Later resolve_btfids warns since it cannot find them.
Commit 378b7708194f ("sched: Make migrate_{en,dis}able() inline") made
them inlineable in most places, but in vmlinux built with llvm 21 and 22
there are four symbols for migrate_{enable,disable}:
three static functions and one global function.
Fix the issue by marking migrate_{enable,disable} as always inline.
The alternative is to mark them as notrace/nokprobe which is more
drastic. Only bpf programs are prevented from attaching to these
functions. The rest of the tracing shouldn't be affected.
[note: Peter ok-ed the patch, Alexei rewrote commit log]
Fixes: 378b7708194f ("sched: Make migrate_{en,dis}able() inline") Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Menglong Dong <menglong.dong@linux.dev> Link: https://lore.kernel.org/r/20251029183646.3811774-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Linus Torvalds [Fri, 31 Oct 2025 16:34:21 +0000 (09:34 -0700)]
Merge tag '6.18-rc3-smb-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client fixes from Steve French:
- fix potential UAF in statfs
- DFS fix for expired referrals
- fix minor modinfo typo
- small improvement to reconnect for smbdirect
* tag '6.18-rc3-smb-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
smb: client: call smbd_destroy() in the same splace as kernel_sock_shutdown()/sock_release()
smb: client: handle lack of IPC in dfs_cache_refresh()
smb: client: fix potential cfid UAF in smb2_query_info_compound
cifs: fix typo in enable_gcm_256 module parameter
Hans Holmberg [Fri, 31 Oct 2025 09:48:26 +0000 (10:48 +0100)]
null_blk: set dma alignment to logical block size
This driver assumes that bio vectors are memory aligned to the logical
block size, so set the queue limit to reflect that.
Unless we set up the limit based on the logical block size, we will go
out of page bounds in copy_to_nullb / copy_from_nullb.
Apparently this wasn't noticed so far because none of the tests generate
such buffers, but since commit 851c4c96db00 ("xfs: implement
XFS_IOC_DIOINFO in terms of vfs_getattr") xfstests generates unaligned
I/O, which now lead to memory corruption when using null_blk devices
with 4k block size.
Fixes: bf8d08532bc1 ("iomap: add support for dma aligned direct-io") Fixes: b1a000d3b8ec ("block: relax direct io memory alignment") Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Linus Torvalds [Fri, 31 Oct 2025 14:29:09 +0000 (07:29 -0700)]
Merge tag 'sound-6.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"A collection of small fixes. It became slightly bigger than usual due
to timing issues (holidays, etc), but all changes are rather
device-specific fixes, so not really worrisome.
- ASoC Cirrus codec fixes for AMD
- Various fixes for ASoC Intel AVS, Qualcomm, SoundWire, FSL,
Mediatek, Renesas
- A few HD-audio quirks, and USB-audio regression fixes for Presonus"
* tag 'sound-6.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (24 commits)
ALSA: hda/realtek: Enable mic on Vaio RPL
ASoC: dt-bindings: pm4125-sdw: correct number of soundwire ports
ASoC: renesas: rz-ssi: Use proper dma_buffer_pos after resume
ASoC: soc_sdw_utils: remove cs42l43 component_name
ASoC: fsl_sai: Fix sync error in consumer mode
ASoC: Fix build for sdw_utils
ALSA: hda/realtek: Fix mute led for HP Victus 15-fa1xxx (MB 8C2D)
ASoC: rt721: fix prepare clock stop failed
ALSA: usb-audio: don't log messages meant for 1810c when initializing 1824c
ASoC: mediatek: Fix double pm_runtime_disable in remove functions
ASoC: fsl_micfil: correct the endian format for DSD
ASoC: fsl_sai: fix bit order for DSD format
ASoC: Intel: avs: Use snd_codec format when initializing probe
ASoC: Intel: avs: Disable periods-elapsed work when closing PCM
ASoC: Intel: avs: Unprepare a stream when XRUN occurs
ASoC: sdw_utils: add name_prefix for rt1321 part id
ASoC: qdsp6: q6asm: do not sleep while atomic
ASoC: Intel: soc-acpi-intel-ptl-match: Remove cs42l43 match from sdw link3
ASOC: max98090/91: fix for filter configuration: AHPF removed DMIC2_HPF added
ASoC: amd: acp: Add ACP7.0 match entries for cs35l56 and cs42l43
...
Linus Torvalds [Fri, 31 Oct 2025 14:25:10 +0000 (07:25 -0700)]
Merge tag 'v6.18-p4' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto fixes from Herbert Xu:
- Fix double free in aspeed
- Fix req->nbytes clobbering in s390/phmac
* tag 'v6.18-p4' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
crypto: aspeed - fix double free caused by devm
crypto: s390/phmac - Do not modify the req->nbytes value
Linus Torvalds [Fri, 31 Oct 2025 14:08:47 +0000 (07:08 -0700)]
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"ufs driver plus two core fixes.
One core fix makes the unit attention counters atomic (just in case
multiple commands detect them) and the other is fixing a merge window
regression caused by changes in the block tree"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: core: Fix the unit attention counter implementation
scsi: ufs: core: Declare tx_lanes witout initialization
scsi: ufs: core: Initialize value of an attribute returned by uic cmd
scsi: ufs: core: Fix error handler host_sem issue
scsi: core: Fix a regression triggered by scsi_host_busy()
xfs: document another racy GC case in xfs_zoned_map_extent
Besides blocks being invalidated, there is another case when the original
mapping could have changed between querying the rmap for GC and calling
xfs_zoned_map_extent. Document it there as it took us quite some time
to figure out what is going on while developing the multiple-GC
protection fix.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
When we are picking a zone for gc it might already be in the pipeline
which can lead to us moving the same data twice resulting in in write
amplification and a very unfortunate case where we keep on garbage
collecting the zone we just filled with migrated data stopping all
forward progress.
Fix this by introducing a count of on-going GC operations on a zone, and
skip any zone with ongoing GC when picking a new victim.
Fixes: 080d01c41 ("xfs: implement zoned garbage collection") Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Tested-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Linus Torvalds [Fri, 31 Oct 2025 02:48:13 +0000 (19:48 -0700)]
Merge tag 'linux_kselftest-fixes-6.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
Pull kselftest fixes from Shuah Khan:
"Fix build warning in cachestat found during clang build and add
tmpshmcstat to .gitignore"
* tag 'linux_kselftest-fixes-6.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
selftests: cachestat: Fix warning on declaration under label
selftests/cachestat: add tmpshmcstat file to .gitignore
Linus Torvalds [Fri, 31 Oct 2025 02:11:27 +0000 (19:11 -0700)]
Merge tag 'linux_kselftest-kunit-fixes-6.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
Pull kunit fixes from Shuah Khan:
"Fix log overwrite in param_tests and fixes incorrect cast of priv
pointer in test_dev_action().
Update email address for Rae Moar in MAINTAINERS KUnit entry"
* tag 'linux_kselftest-kunit-fixes-6.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
MAINTAINERS: Update KUnit email address for Rae Moar
kunit: prevent log overwrite in param_tests
kunit: test_dev_action: Correctly cast 'priv' pointer to long*
Linus Torvalds [Fri, 31 Oct 2025 02:05:46 +0000 (19:05 -0700)]
Merge tag 'acpi-6.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI fixes from Rafael Wysocki:
"These fix three ACPI driver issues and add version checks to two ACPI
table parsers:
- Call input_free_device() on failing input device registration as
necessary (and mentioned in the input subsystem documentation) in
the ACPI button driver (Kaushlendra Kumar)
- Fix use-after-free in acpi_video_switch_brightness() by canceling a
delayed work during tear-down (Yuhao Jiang)
- Use platform device for devres-related actions in the ACPI fan
driver to allow device-managed resources to be cleaned up properly
(Armin Wolf)
- Add version checks to the MRRM and SPCR table parsers (Tony Luck
and Punit Agrawal)"
* tag 'acpi-6.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI: SPCR: Check for table version when using precise baudrate
ACPI: MRRM: Check revision of MRRM table
ACPI: fan: Use platform device for devres-related actions
ACPI: fan: Use ACPI handle when retrieving _FST
ACPI: video: Fix use-after-free in acpi_video_switch_brightness()
ACPI: button: Call input_free_device() on failing input device registration
Linus Torvalds [Fri, 31 Oct 2025 02:02:16 +0000 (19:02 -0700)]
Merge tag 'pm-6.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These fix three regressions, two recent ones and one introduced during
the 6.17 development cycle:
- Add an exit latency check to the menu cpuidle governor in the case
when it considers using a real idle state instead of a polling one
to address a performance regression (Rafael Wysocki)
- Revert an attempted cleanup of a system suspend code path that
introduced a regression elsewhere (Samuel Wu)
- Allow pm_restrict_gfp_mask() to be called multiple times in a row
and adjust pm_restore_gfp_mask() accordingly to avoid having to
play nasty games with these calls during hibernation (Rafael
Wysocki)"
* tag 'pm-6.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM: sleep: Allow pm_restrict_gfp_mask() stacking
cpuidle: governors: menu: Select polling state in some more cases
Revert "PM: sleep: Make pm_wakeup_clear() call more clear"
- fbcon: Fix slab-use-after-free in fb_mode_is_equal (Quanmin Yan)
- fb.h: Fix typo in "vertical" (Piyush Choudhary)
* tag 'fbdev-for-6.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/linux-fbdev:
fbdev: atyfb: Check if pll_ops->init_pll failed
fbcon: Set fb_display[i]->mode to NULL when the mode is released
fbdev: bitblit: bound-check glyph index in bit_putcs*
fbdev: pvr2fb: Fix leftover reference to ONCHIP_NR_DMA_CHANNELS
fbdev: valkyriefb: Fix reference count leak in valkyriefb_init
video: fb: Fix typo in comment in fb.h
drm/ast: Clear preserved bits from register output value
Preserve the I/O register bits in __ast_write8_i_masked() as specified
by preserve_mask. Accidentally OR-ing the output value into these will
overwrite the register's previous settings.
Fixes display output on the AST2300, where the screen can go blank at
boot. The driver's original commit 312fec1405dd ("drm: Initial KMS
driver for AST (ASpeed Technologies) 2000 series (v2)") already added
the broken code. Commit 6f719373b943 ("drm/ast: Blank with VGACR17 sync
enable, always clear VGACRB6 sync off") triggered the bug.
Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de> Reported-by: Peter Schneider <pschneider1968@googlemail.com> Closes: https://lore.kernel.org/dri-devel/a40caf8e-58ad-4f9c-af7f-54f6f69c29bb@googlemail.com/ Tested-by: Peter Schneider <pschneider1968@googlemail.com> Reviewed-by: Jocelyn Falempe <jfalempe@redhat.com> Fixes: 6f719373b943 ("drm/ast: Blank with VGACR17 sync enable, always clear VGACRB6 sync off") Fixes: 312fec1405dd ("drm: Initial KMS driver for AST (ASpeed Technologies) 2000 series (v2)") Cc: Thomas Zimmermann <tzimmermann@suse.de> Cc: Nick Bowler <nbowler@draconx.ca> Cc: Douglas Anderson <dianders@chromium.org> Cc: Dave Airlie <airlied@redhat.com> Cc: Jocelyn Falempe <jfalempe@redhat.com> Cc: dri-devel@lists.freedesktop.org Cc: <stable@vger.kernel.org> # v3.5+ Link: https://patch.msgid.link/20251024073626.129032-1-tzimmermann@suse.de
Merge branches 'acpi-button', 'acpi-video' and 'acpi-fan'
Merge ACPI button, ACPI backlight (video), and ACPI fan driver fixes for
6.18-rc4:
- Call input_free_device() on failing input device registration as
necessary (and mentioned in the input subsystem documentation) in the
ACPI button driver (Kaushlendra Kumar)
- Fix use-after-free in acpi_video_switch_brightness() by canceling
a delayed work during tear-down (Yuhao Jiang)
- Use platform device for devres-related actions in the ACPI fan driver
to allow device-managed resources to be cleaned up properly (Armin
Wolf)
Merge a cpuidle fix and two fixes related to system sleep for 6.18-rc4:
- Add an exit latency check to the menu cpuidle governor in the case
when it considers using a real idle state instead of a polling one to
address a performance regression (Rafael Wysocki)
- Revert an attempted cleanup of a system suspend code path that
introduced a regression elsewhere (Samuel Wu)
- Allow pm_restrict_gfp_mask() to be called multiple times in a row
and adjust pm_restore_gfp_mask() accordingly to avoid having to play
nasty games with these calls during hibernation (Rafael Wysocki)
* pm-cpuidle:
cpuidle: governors: menu: Select polling state in some more cases
* pm-sleep:
PM: sleep: Allow pm_restrict_gfp_mask() stacking
Revert "PM: sleep: Make pm_wakeup_clear() call more clear"
Heiko Carstens [Thu, 30 Oct 2025 14:55:05 +0000 (15:55 +0100)]
s390: Disable ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP
As reported by Luiz Capitulino enabling HVO on s390 leads to reproducible
crashes. The problem is that kernel page tables are modified without
flushing corresponding TLB entries.
Even if it looks like the empty flush_tlb_all() implementation on s390 is
the problem, it is actually a different problem: on s390 it is not allowed
to replace an active/valid page table entry with another valid page table
entry without the detour over an invalid entry. A direct replacement may
lead to random crashes and/or data corruption.
In order to invalidate an entry special instructions have to be used
(e.g. ipte or idte). Alternatively there are also special instructions
available which allow to replace a valid entry with a different valid
entry (e.g. crdte or cspg).
Given that the HVO code currently does not provide the hooks to allow for
an implementation which is compliant with the s390 architecture
requirements, disable ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP again, which is
basically a revert of the original patch which enabled it.
Luca Ceresoli [Tue, 14 Oct 2025 11:30:51 +0000 (13:30 +0200)]
drm/imx: parallel-display: convert to devm_drm_bridge_alloc() API
This is the new API for allocating DRM bridges.
This conversion was missed during the initial conversion of all bridges to
the new API. Thus all kernels with commit 94d50c1a2ca3 ("drm/bridge:
get/put the bridge reference in drm_bridge_attach/detach()") and using this
driver now warn due to drm_bridge_attach() incrementing the refcount, which
is not initialized without using devm_drm_bridge_alloc() for allocation.
To make the conversion simple and straightforward without messing up with
the drmm_simple_encoder_alloc(), move the struct drm_bridge from struct
imx_parallel_display_encoder to struct imx_parallel_display.
Also remove the 'struct imx_parallel_display *pd' from struct
imx_parallel_display_encoder, not needed anymore.
Fixes: 94d50c1a2ca3 ("drm/bridge: get/put the bridge reference in drm_bridge_attach/detach()") Reported-by: Ernest Van Hoecke <ernestvanhoecke@gmail.com> Closes: https://lore.kernel.org/all/hlf4wdopapxnh4rekl5s3kvoi6egaga3lrjfbx6r223ar3txri@3ik53xw5idyh/ Signed-off-by: Luca Ceresoli <luca.ceresoli@bootlin.com> Reviewed-by: Louis Chauvet <louis.chauvet@bootlin.com> Tested-by: Ernest Van Hoecke <ernest.vanhoecke@toradex.com> Link: https://patch.msgid.link/20251014-drm-bridge-alloc-imx-ipuv3-v1-1-a1bb1dcbff50@bootlin.com Signed-off-by: Philipp Zabel <p.zabel@pengutronix.de>
Takashi Iwai [Thu, 30 Oct 2025 12:08:08 +0000 (13:08 +0100)]
Merge tag 'asoc-fix-v6.18-rc2' of https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound into for-linus
ASoC: Fixes for v6.18
A bigger batch of fixes than I'd like, things built up due to holidays
and some last minute issues which caused me to hold off on sending a pul
request. None of these are super remarkable, and there's a few new
device IDs in here too including a relatively big block of AMD devices.
The Cirrus Logic CS530x support subject line is actually a fix that was
on the start of that series and got pulled in here, I forgot to fix the
subject up when merging.