Peter Zijlstra [Mon, 19 Jan 2026 10:23:57 +0000 (11:23 +0100)]
rseq: Allow registering RSEQ with slice extension
Since glibc cares about the number of syscalls required to initialize a new
thread, allow initializing rseq with slice extension on. This avoids having to
do another prctl().
Thomas Gleixner [Mon, 15 Dec 2025 16:52:34 +0000 (17:52 +0100)]
selftests/rseq: Implement time slice extension test
Provide an initial test case to evaluate the functionality. This needs to be
extended to cover the ABI violations and expose the race condition between
observing granted and arriving in rseq_slice_yield().
Thomas Gleixner [Mon, 15 Dec 2025 16:52:28 +0000 (17:52 +0100)]
rseq: Implement rseq_grant_slice_extension()
Provide the actual decision function, which decides whether a time slice
extension is granted in the exit to user mode path when NEED_RESCHED is
evaluated.
The decision is made in two stages. First an inline quick check to avoid
going into the actual decision function. This checks whether:
#1 the functionality is enabled
#2 the exit is a return from interrupt to user mode
#3 any TIF bit, which causes extra work is set. That includes TIF_RSEQ,
which means the task was already scheduled out.
The slow path, which implements the actual user space ABI, is invoked
when:
A) #1 is true, #2 is true and #3 is false
It checks whether user space requested a slice extension by setting
the request bit in the rseq slice_ctrl field. If so, it grants the
extension and stores the slice expiry time, so that the actual exit
code can double check whether the slice is already exhausted before
going back.
B) #1 - #3 are true _and_ a slice extension was granted in a previous
loop iteration
In this case the grant is revoked.
In case that the user space access faults or invalid state is detected, the
task is terminated with SIGSEGV.
Thomas Gleixner [Mon, 15 Dec 2025 16:52:26 +0000 (17:52 +0100)]
rseq: Reset slice extension when scheduled
When a time slice extension was granted in the need_resched() check on exit
to user space, the task can still be scheduled out in one of the other
pending work items. When it gets scheduled back in, and need_resched() is
not set, then the stale grant would be preserved, which is just wrong.
RSEQ already keeps track of that and sets TIF_RSEQ, which invokes the
critical section and ID update mechanisms.
Utilize them and clear the user space slice control member of struct rseq
unconditionally within the existing user access sections. That's just an
unconditional store more in that path.
Thomas Gleixner [Mon, 15 Dec 2025 16:52:22 +0000 (17:52 +0100)]
rseq: Implement time slice extension enforcement timer
If a time slice extension is granted and the reschedule delayed, the kernel
has to ensure that user space cannot abuse the extension and exceed the
maximum granted time.
It was suggested to implement this via the existing hrtick() timer in the
scheduler, but that turned out to be problematic for several reasons:
1) It creates a dependency on CONFIG_SCHED_HRTICK, which can be disabled
independently of CONFIG_HIGHRES_TIMERS
2) HRTICK usage in the scheduler can be runtime disabled or is only used
for certain aspects of scheduling.
3) The function is calling into the scheduler code and that might have
unexpected consequences when this is invoked due to a time slice
enforcement expiry. Especially when the task managed to clear the
grant via sched_yield(0).
It would be possible to address #2 and #3 by storing state in the
scheduler, but that is extra complexity and fragility for no value.
Implement a dedicated per CPU hrtimer instead, which is solely used for the
purpose of time slice enforcement.
The timer is armed when an extension was granted right before actually
returning to user mode in rseq_exit_to_user_mode_restart().
It is disarmed, when the task relinquishes the CPU. This is expensive as
the timer is probably the first expiring timer on the CPU, which means it
has to reprogram the hardware. But that's less expensive than going through
a full hrtimer interrupt cycle for nothing.
Thomas Gleixner [Mon, 15 Dec 2025 16:52:19 +0000 (17:52 +0100)]
rseq: Implement syscall entry work for time slice extensions
The kernel sets SYSCALL_WORK_RSEQ_SLICE when it grants a time slice
extension. This allows to handle the rseq_slice_yield() syscall, which is
used by user space to relinquish the CPU after finishing the critical
section for which it requested an extension.
In case the kernel state is still GRANTED, the kernel resets both kernel
and user space state with a set of sanity checks. If the kernel state is
already cleared, then this raced against the timer or some other interrupt
and just clears the work bit.
Doing it in syscall entry work allows to catch misbehaving user space,
which issues an arbitrary syscall, i.e. not rseq_slice_yield(), from the
critical section. Contrary to the initial strict requirement to use
rseq_slice_yield() arbitrary syscalls are not considered a violation of the
ABI contract anymore to allow onion architecture applications, which cannot
control the code inside a critical section, to utilize this as well.
If the code detects inconsistent user space that result in a SIGSEGV for
the application.
If the grant was still active and the task was not preempted yet, the work
code reschedules immediately before continuing through the syscall.
Thomas Gleixner [Mon, 15 Dec 2025 16:52:15 +0000 (17:52 +0100)]
rseq: Implement sys_rseq_slice_yield()
Provide a new syscall which has the only purpose to yield the CPU after the
kernel granted a time slice extension.
sched_yield() is not suitable for that because it unconditionally
schedules, but the end of the time slice extension is not required to
schedule when the task was already preempted. This also allows to have a
strict check for termination to catch user space invoking random syscalls
including sched_yield() from a time slice extension region.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Link: https://patch.msgid.link/20251215155708.929634896@linutronix.de
Thomas Gleixner [Mon, 15 Dec 2025 16:52:12 +0000 (17:52 +0100)]
rseq: Add prctl() to enable time slice extensions
Implement a prctl() so that tasks can enable the time slice extension
mechanism. This fails, when time slice extensions are disabled at compile
time or on the kernel command line and when no rseq pointer is registered
in the kernel.
That allows to implement a single trivial check in the exit to user mode
hotpath, to decide whether the whole mechanism needs to be invoked.
Shrikanth Hegde [Thu, 15 Jan 2026 07:35:24 +0000 (13:05 +0530)]
sched/fair: Remove nohz.nr_cpus and use weight of cpumask instead
nohz.nr_cpus was observed as contended cacheline when running
enterprise workload on large systems.
Fundamental scalability challenge with nohz.idle_cpus_mask
and nohz.nr_cpus is the following:
(1) nohz_balancer_kick() observes (reads) nohz.nr_cpus
(or nohz.idle_cpu_mask) and nohz.has_blocked to see whether there's
any nohz balancing work to do, in every scheduler tick.
(2) nohz_balance_enter_idle() and nohz_balance_exit_idle()
(through nohz_balancer_kick() via sched_tick()) modify (write)
nohz.nr_cpus (and/or nohz.idle_cpu_mask) and nohz.has_blocked.
The characteristic frequencies are the following:
(1) nohz_balancer_kick() happens at scheduler (busy)tick frequency
on CPU(which has not gone idle). This is a relatively constant
frequency in the ~1 kHz range or lower.
(2) happens at idle enter/exit frequency on every CPU that goes to idle.
This is workload dependent, but can easily be hundreds of kHz for
IO-bound loads and high CPU counts. Ie. can be orders of magnitude
higher than (1), in which case a cachemiss at every invocation of (1)
is almost inevitable. idle exit will trigger (1) on the CPU
which is coming out of idle.
There's two types of costs from these functions:
(A) scheduler tick cost via (1): this happens on busy CPUs too, and is
thus a primary scalability cost. But the rate here is constant and
typically much lower than (B), hence the absolute benefit to workload
scalability will be lower as well.
(B) idle cost via (2): going-to-idle and coming-from-idle costs are
secondary concerns, because they impact power efficiency more than
they impact scalability. But in terms of absolute cost this scales
up with nr_cpus as well, and a much faster rate, and thus may also
approach and negatively impact system limits like
memory bus/fabric bandwidth.
Note that nohz.idle_cpus_mask and nohz.nr_cpus may appear to reside in the
same cacheline, however under CONFIG_CPUMASK_OFFSTACK=y the backing storage
for nohz.idle_cpus_mask will be elsewhere. With CPUMASK_OFFSTACK=n,
the nohz.idle_cpus_mask and rest of nohz fields are in different cachelines
under typical NR_CPUS=512/2048. This implies two separate cachelines
being dirtied upon idle entry / exit.
nohz.nr_cpus can be derived from the mask itself. Its usage doesn't warrant
a functionally correct value. This means one less cacheline being dirtied in
idle entry/exit path which helps to save some bus bandwidth w.r.t to those
nohz functions(approx 50%). This in turn helps to improve enterprise
workload throughput.
On system with 480 CPUs, running "hackbench 40 process 10000 loops"
(Avg of 3 runs)
baseline:
0.81% hackbench [k] nohz_balance_exit_idle
0.21% hackbench [k] nohz_balancer_kick
0.09% swapper [k] nohz_run_idle_balance
Shrikanth Hegde [Thu, 15 Jan 2026 07:35:22 +0000 (13:05 +0530)]
sched/fair: Move checking for nohz cpus after time check
Current code does.
- Read nohz.nr_cpus
- Check if the time has passed to do NOHZ idle balance
Instead do this.
- Check if the time has passed to do NOHZ idle balance
- Read nohz.nr_cpus
This will skip the read most of the time in normal system usage.
i.e when there are nohz.nr_cpus (system is not 100% busy).
Note that when there are no idle CPUs(100% busy), even if the flag gets
set to NOHZ_STATS_KICK | NOHZ_NEXT_KICK, find_new_ilb will fail and
there will be no NOHZ idle balance. In such cases there will be a very
narrow window where, kick_ilb will be called un-necessarily.
However current functionality is still retained.
Note: This patch doesn't solve any cacheline overheads. No improvement
in performance apart from saving a few cycles of reading nohz.nr_cpus
Reviewed-and-tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://patch.msgid.link/20260115073524.376643-2-sshegde@linux.ibm.com
Zhan Xusheng [Wed, 14 Jan 2026 09:00:35 +0000 (17:00 +0800)]
sched/fair: Fix math notation errors in avg_vruntime comment
The avg_vruntime comment contains a couple of mathematical notation
issues:
- The summation over w_i * (V - v_i) is written in an ambiguous form
- The delta term refers to v instead of v0, which is inconsistent
with the code and preceding explanation
Fix these to make the comment mathematically correct and consistent
with the implementation.
Gabriele Monaco [Mon, 12 Jan 2026 14:04:13 +0000 (15:04 +0100)]
sched: Fix build for modules using set_tsk_need_resched()
Commit adcc3bfa8806 ("sched: Adapt sched tracepoints for RV task model")
added a tracepoint to the need_resched action that can be triggered also
by set_tsk_need_resched.
This function was previously accessible from out-of-tree modules but
it's no longer available because the __trace_set_need_resched() symbol
is not exported (together with the tracepoint itself, which was exported
in a separate patch) and building such modules fails.
Export __trace_set_need_resched to modules to fix those build issues.
Fixes: adcc3bfa8806 ("sched: Adapt sched tracepoints for RV task model") Signed-off-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Phil Auld <pauld@redhat.com> Link: https://patch.msgid.link/20260112140413.362202-1-gmonaco@redhat.com
Gabriele Monaco [Fri, 5 Dec 2025 13:16:16 +0000 (14:16 +0100)]
sched: Export hidden tracepoints to modules
The tracepoints sched_entry, sched_exit and sched_set_need_resched
are not exported to tracefs as trace events, this allows only kernel
code to access them. Helper modules like [1] can be used to still have
the tracepoints available to ftrace for debugging purposes, but they do
rely on the tracepoints being exported.
Export the 3 not exported tracepoints.
Note that sched_set_state is already exported as the macro is called
from modules.
[1] - https://github.com/qais-yousef/sched_tp.git
Fixes: adcc3bfa8806 ("sched: Adapt sched tracepoints for RV task model") Signed-off-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Phil Auld <pauld@redhat.com> Link: https://patch.msgid.link/20251205131621.135513-9-gmonaco@redhat.com
Peter Zijlstra [Thu, 18 Dec 2025 14:25:10 +0000 (15:25 +0100)]
sched: Further restrict the preemption modes
The introduction of PREEMPT_LAZY was for multiple reasons:
- PREEMPT_RT suffered from over-scheduling, hurting performance compared to
!PREEMPT_RT.
- the introduction of (more) features that rely on preemption; like
folio_zero_user() which can do large memset() without preemption checks.
(Xen already had a horrible hack to deal with long running hypercalls)
- the endless and uncontrolled sprinkling of cond_resched() -- mostly cargo
cult or in response to poor to replicate workloads.
By moving to a model that is fundamentally preemptable these things become
managable and avoid needing to introduce more horrible hacks.
Since this is a requirement; limit PREEMPT_NONE to architectures that do not
support preemption at all. Further limit PREEMPT_VOLUNTARY to those
architectures that do not yet have PREEMPT_LAZY support (with the eventual goal
to make this the empty set and completely remove voluntary preemption and
cond_resched() -- notably VOLUNTARY is already limited to !ARCH_NO_PREEMPT.)
This leaves up-to-date architectures (arm64, loongarch, powerpc, riscv, s390,
x86) with only two preemption models: full and lazy.
While Lazy has been the recommended setting for a while, not all distributions
have managed to make the switch yet. Force things along. Keep the patch minimal
in case of hard to address regressions that might pop up.
Blake Jones [Tue, 2 Dec 2025 02:37:43 +0000 (18:37 -0800)]
sched: Reorder some fields in struct rq
This colocates some hot fields in "struct rq" to be on the same cache line
as others that are often accessed at the same time or in similar ways.
Using data from a Google-internal fleet-scale profiler, I found three
distinct groups of hot fields in struct rq:
- (1) The runqueue lock: __lock.
- (2) Those accessed from hot code in pick_next_task_fair():
nr_running, nr_numa_running, nr_preferred_running,
ttwu_pending, cpu_capacity, curr, idle.
- (3) Those accessed from some other hot codepaths, e.g.
update_curr(), update_rq_clock(), and scheduler_tick():
clock_task, clock_pelt, clock, lost_idle_time,
clock_update_flags, clock_pelt_idle, clock_idle.
The cycles spent on accessing these different groups of fields broke down
roughly as follows:
- 50% on group (1) (the runqueue lock, always read-write)
- 39% on group (2) (load:store ratio around 38:1)
- 8% on group (3) (load:store ratio around 5:1)
- 3% on all the other fields
Most of the fields in group (3) are already in a cache line grouping; this
patch just adds "clock" and "clock_update_flags" to that group. The fields
in group (2) are scattered across several cache lines; the main effect of
this patch is to group them together, on a single line at the beginning of
the structure. A few other less performance-critical fields (nr_switches,
numa_migrate_on, has_blocked_load, nohz_csd, last_blocked_load_update_tick)
were also reordered to reduce holes in the data structure.
Since the runqueue lock is acquired from so many different contexts, and is
basically always accessed using an atomic operation, putting it on either
of the cache lines for groups (2) or (3) would slow down accesses to those
fields dramatically, since those groups are read-mostly accesses.
To test this, I wrote a focused load test that would put load on the
pick_next_task_fair() path. A parent process would fork many child
processes, and each child would nanosleep() for 1 msec many times in a
loop. The load test was monitored with "perf", and I looked at the amount
of cycles that were spent with sched_balance_rq() on the stack. The test
was reliably spending ~5% of all of its cycles there. I ran it 100 times
on a pair of 2-socket Intel Haswell machines (72 vCPUs per machine) - one
running the tip of sched/core, the other running this change - using 360
child processes and 8192 1-msec sleeps per child. The mean cycle count
dropped from 5.14B to 4.91B, or a *4.6% decrease* in relevant scheduler
cycles.
Given that this change reduces cache misses in a very hot kernel codepath,
there's likely to be additional application performance improvement due to
reduced cache conflicts from kernel data structures.
On a Power11 system with 128-byte cache lines, my test showed a ~5%
decrease in relevant scheduler cycles, along with a slight increase in user
time - both positive indicators. This data comes from
https://lore.kernel.org/lkml/affdc6b1-9980-44d1-89db-d90730c1e384@linux.ibm.com/
This is the case even though the additional "____cacheline_aligned" that
puts the runqueue lock on the next cache line adds an additional 64 bytes
of padding on those machines. This patch does not change the size of
"struct rq" on machines with 64-byte cache lines.
I also ran "hackbench" to try to test this change, but it didn't show
conclusive results. Looking at a CPU cycle profile of the hackbench run,
it was spending 95% of its cycles inside __alloc_skb(), __kfree_skb(), or
kmem_cache_free() - almost all of which was spent updating memcg counters
or contending on the list_lock in kmem_cache_node. And it spent less than
0.5% of its cycles inside either schedule() or try_to_wake_up(). So it's
not surprising that it didn't show useful results here.
The "__no_randomize_layout" was added to reflect the fact that performance
of code that references this data structure is unusually sensitive to
placement of its members.
Signed-off-by: Blake Jones <blakejones@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com> Reviewed-by: Josh Don <joshdon@google.com> Tested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com> Link: https://patch.msgid.link/20251202023743.1524247-1-blakejones@google.com
Peter Zijlstra [Fri, 19 Dec 2025 08:04:45 +0000 (09:04 +0100)]
sched/fair: Fix sched_avg fold
After the robot reported a regression wrt commit: 089d84203ad4 ("sched/fair:
Fold the sched_avg update"), Shrikanth noted that two spots missed a factor
se_weight().
Fixes: 089d84203ad4 ("sched/fair: Fold the sched_avg update") Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202512181208.753b9f6e-lkp@intel.com Debugged-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251218102020.GO3707891@noisy.programming.kicks-ass.net
Peter Zijlstra [Wed, 17 Dec 2025 10:24:11 +0000 (11:24 +0100)]
sched: Fix faulty assertion in sched_change_end()
Commit 47efe2ddccb1f ("sched/core: Add assertions to QUEUE_CLASS") added an
assert to sched_change_end() verifying that a class demotion would result in a
reschedule.
As it turns out; rt_mutex_setprio() does not force a resched on class
demontion. Furthermore, this is only relevant to running tasks.
Change the warning into a reschedule and make sure to only do so for running
tasks.
Fixes: 47efe2ddccb1f ("sched/core: Add assertions to QUEUE_CLASS") Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Tested-by: Linux Kernel Functional Testing <lkft@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251216141725.GW3707837@noisy.programming.kicks-ass.net
Peter Zijlstra [Wed, 10 Dec 2025 08:06:50 +0000 (09:06 +0100)]
sched/core: Rework sched_class::wakeup_preempt() and rq_modified_*()
Change sched_class::wakeup_preempt() to also get called for
cross-class wakeups, specifically those where the woken task
is of a higher class than the previous highest class.
In order to do this, track the current highest class of the runqueue
in rq::next_class and have wakeup_preempt() track this upwards for
each new wakeup. Additionally have schedule() re-set the value on
pick.
Ingo Molnar [Tue, 2 Dec 2025 09:35:06 +0000 (10:35 +0100)]
sched/fair: Sort out 'blocked_load*' namespace noise
There's three layers of logic in the scheduler that
deal with 'has_blocked' (load) handling of the NOHZ code:
(1) nohz.has_blocked,
(2) rq->has_blocked_load, deal with NOHZ idle balancing,
(3) and cfs_rq_has_blocked(), which is part of the layer
that is passing the SMP load-balancing signal to the
NOHZ layers.
The 'has_blocked' and 'has_blocked_load' names are used
in a mixed fashion, sometimes within the same function.
Standardize on 'has_blocked_load' to make it all easy
to read and easy to grep.
No change in functionality.
Suggested-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Shrikanth Hegde <sshegde@linux.ibm.com> Link: https://patch.msgid.link/aS6yvxyc3JfMxxQW@gmail.com
In the vruntime_cmp() and vruntime_op() macros use Use __builtin_strcmp(),
because of __HAVE_ARCH_STRCMP might turn off the compiler optimizations
we rely on here to catch usage bugs.
Ingo Molnar [Tue, 2 Dec 2025 15:09:23 +0000 (16:09 +0100)]
sched/fair: Rename cfs_rq::avg_vruntime to ::sum_w_vruntime, and helper functions
The ::avg_vruntime field is a misnomer: it says it's an
'average vruntime', but in reality it's the momentary sum
of the weighted vruntimes of all queued tasks, which is
at least a division away from being an average.
This is clear from comments about the math of fair scheduling:
* \Sum (v_i - v0) * w_i := cfs_rq->avg_vruntime
This confusion is increased by the cfs_avg_vruntime() function,
which does perform the division and returns a true average.
The sum of all weighted vruntimes should be named thusly,
so rename the field to ::sum_w_vruntime. (As arguably
::sum_weighted_vruntime would be a bit of a mouthful.)
Understanding the scheduler is hard enough already, without
extra layers of obfuscated naming. ;-)
Ingo Molnar [Wed, 26 Nov 2025 11:09:16 +0000 (12:09 +0100)]
sched/fair: Rename cfs_rq::avg_load to cfs_rq::sum_weight
The ::avg_load field is a long-standing misnomer: it says it's an
'average load', but in reality it's the momentary sum of the load
of all currently runnable tasks. We'd have to also perform a
division by nr_running (or use time-decay) to arrive at any sort
of average value.
This is clear from comments about the math of fair scheduling:
* \Sum w_i := cfs_rq->avg_load
The sum of all weights is ... the sum of all weights, not
the average of all weights.
To make it doubly confusing, there's also an ::avg_load
in the load-balancing struct sg_lb_stats, which *is* a
true average.
The second part of the field's name is a minor misnomer
as well: it says 'load', and it is indeed a load_weight
structure as it shares code with the load-balancer - but
it's only in an SMP load-balancing context where
load = weight, in the fair scheduling context the primary
purpose is the weighting of different nice levels.
So rename the field to ::sum_weight instead, which makes
the terminology of the EEVDF math match up with our
implementation of it:
Peter Zijlstra [Fri, 28 Nov 2025 12:32:21 +0000 (13:32 +0100)]
sched/fair: Switch to rcu_dereference_all()
With the {rcu,sched,bh} RCU flavours being unified, it doesn't really
make sense to check for just the rcu one. Switch to the _all family of
verification which includes all 3 of the listed flavours.
Notably, this will enable us to remove some superfluous
rcu_read_lock() regions when we know they are inside preempt/IRQ
disabled regions.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org>
Peter Zijlstra [Wed, 12 Nov 2025 15:08:23 +0000 (16:08 +0100)]
sched/fair: Avoid rq->lock bouncing in sched_balance_newidle()
While poking at this code recently I noted we do a pointless
unlock+lock cycle in sched_balance_newidle(). We drop the rq->lock (so
we can balance) but then instantly grab the same rq->lock again in
sched_balance_update_blocked_averages().
Peter Zijlstra [Thu, 27 Nov 2025 15:39:44 +0000 (16:39 +0100)]
sched/fair: Fold the sched_avg update
Nine (and a half) instances of the same pattern is just silly, fold the lot.
Notably, the half instance in enqueue_load_avg() is right after setting
cfs_rq->avg.load_sum to cfs_rq->avg.load_avg * get_pelt_divider(&cfs_rq->avg).
Since get_pelt_divisor() >= PELT_MIN_DIVIDER, this ends up being a no-op
change.
Linus Torvalds [Sun, 14 Dec 2025 03:35:35 +0000 (15:35 +1200)]
Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"The only core fix is in doc; all the others are in drivers, with the
biggest impacts in libsas being the rollback on error handling and in
ufs coming from a couple of error handling fixes, one causing a crash
if it's activated before scanning and the other fixing W-LUN
resumption"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: ufs: qcom: Fix confusing cleanup.h syntax
scsi: libsas: Add rollback handling when an error occurs
scsi: device_handler: Return error pointer in scsi_dh_attached_handler_name()
scsi: ufs: core: Fix a deadlock in the frequency scaling code
scsi: ufs: core: Fix an error handler crash
scsi: Revert "scsi: libsas: Fix exp-attached device scan after probe failure scanned in again after probe failed"
scsi: ufs: core: Fix RPMB link error by reversing Kconfig dependencies
scsi: qla4xxx: Use time conversion macros
scsi: qla2xxx: Enable/disable IRQD_NO_BALANCING during reset
scsi: ipr: Enable/disable IRQD_NO_BALANCING during reset
scsi: imm: Fix use-after-free bug caused by unfinished delayed work
scsi: target: sbp: Remove KMSG_COMPONENT macro
scsi: core: Correct documentation for scsi_device_quiesce()
scsi: mpi3mr: Prevent duplicate SAS/SATA device entries in channel 1
scsi: target: Reset t_task_cdb pointer in error case
scsi: ufs: core: Fix EH failure after W-LUN resume error
Linus Torvalds [Sun, 14 Dec 2025 03:24:10 +0000 (15:24 +1200)]
Merge tag 'ceph-for-6.19-rc1' of https://github.com/ceph/ceph-client
Pull ceph updates from Ilya Dryomov:
"We have a patch that adds an initial set of tracepoints to the MDS
client from Max, a fix that hardens osdmap parsing code from myself
(marked for stable) and a few assorted fixups"
* tag 'ceph-for-6.19-rc1' of https://github.com/ceph/ceph-client:
rbd: stop selecting CRC32, CRYPTO, and CRYPTO_AES
ceph: stop selecting CRC32, CRYPTO, and CRYPTO_AES
libceph: make decode_pool() more resilient against corrupted osdmaps
libceph: Amend checking to fix `make W=1` build breakage
ceph: Amend checking to fix `make W=1` build breakage
ceph: add trace points to the MDS client
libceph: fix log output race condition in OSD client
Linus Torvalds [Sat, 13 Dec 2025 18:12:46 +0000 (06:12 +1200)]
Merge tag 'smp-urgent-2025-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull CPU hotplug fix from Ingo Molnar:
- Fix CPU hotplug callbacks to disable interrupts on UP kernels
* tag 'smp-urgent-2025-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
cpu: Make atomic hotplug callbacks run with interrupts disabled on UP
Linus Torvalds [Sat, 13 Dec 2025 18:10:35 +0000 (06:10 +1200)]
Merge tag 'perf-urgent-2025-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf event fixes from Ingo Molnar:
- Fix NULL pointer dereference crash in the Intel PMU driver
- Fix missing read event generation on task exit
- Fix AMD uncore driver init error handling
- Fix whitespace noise
* tag 'perf-urgent-2025-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel: Fix NULL event dereference crash in handle_pmi_common()
perf/core: Fix missing read event generation on task exit
perf/x86/amd/uncore: Fix the return value of amd_uncore_df_event_init() on error
perf/uprobes: Remove <space><Tab> whitespace noise
Linus Torvalds [Sat, 13 Dec 2025 18:07:09 +0000 (06:07 +1200)]
Merge tag 'irq-urgent-2025-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Ingo Molnar:
- Fix error code in the irqchip/mchp-eic driver
- Fix setup_percpu_irq() affinity assumptions
- Remove the unused irq_domain_add_tree() function
* tag 'irq-urgent-2025-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/mchp-eic: Fix error code in mchp_eic_domain_alloc()
irqdomain: Delete irq_domain_add_tree()
genirq: Allow NULL affinity for setup_percpu_irq()
Linus Torvalds [Sat, 13 Dec 2025 18:04:16 +0000 (06:04 +1200)]
Merge tag 'core-urgent-2025-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc core fixes from Ingo Molnar:
- Improve bug reporting
- Suppress W=1 format warning
- Improve rseq scalability on Clang builds
* tag 'core-urgent-2025-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
rseq: Always inline rseq_debug_syscall_return()
bug: Hush suggest-attribute=format for __warn_printf()
bug: Let report_bug_entry() provide the correct bugaddr
Linus Torvalds [Sat, 13 Dec 2025 08:55:12 +0000 (20:55 +1200)]
Merge tag 'mm-nonmm-stable-2025-12-11-11-47' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc updates from Andrew Morton:
"There are no significant series in this small merge. Please see the
individual changelogs for details"
[ Editor's note: it's mainly ocfs2 and a couple of random fixes ]
* tag 'mm-nonmm-stable-2025-12-11-11-47' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm: memfd_luo: add CONFIG_SHMEM dependency
mm: shmem: avoid build warning for CONFIG_SHMEM=n
ocfs2: fix memory leak in ocfs2_merge_rec_left()
ocfs2: invalidate inode if i_mode is zero after block read
ocfs2: avoid -Wflex-array-member-not-at-end warning
ocfs2: convert remaining read-only checks to ocfs2_emergency_state
ocfs2: add ocfs2_emergency_state helper and apply to setattr
checkpatch: add uninitialized pointer with __free attribute check
args: fix documentation to reflect the correct numbers
ocfs2: fix kernel BUG in ocfs2_find_victim_chain
liveupdate: luo_core: fix redundant bound check in luo_ioctl()
ocfs2: validate inline xattr size and entry count in ocfs2_xattr_ibody_list
fs/fat: remove unnecessary wrapper fat_max_cache()
ocfs2: replace deprecated strcpy with strscpy
ocfs2: check tl_used after reading it from trancate log inode
liveupdate: luo_file: don't use invalid list iterator
Brown paper bag time. This is a silly oversight where I missed to drop
the error condition checking to ensure we clean up on early error
returns. I have an internal unit testset coming up for this which will
catch all such issues going forward.
Reported-by: Chris Mason <clm@fb.com> Reported-by: Jeff Layton <jlayton@kernel.org> Fixes: 011703a9acd7 ("file: add FD_{ADD,PREPARE}()") Signed-off-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Sat, 13 Dec 2025 07:57:41 +0000 (19:57 +1200)]
x86/hv: Add gitignore entry for generated header file
Commit 7bfe3b8ea6e3 ("Drivers: hv: Introduce mshv_vtl driver") added a
new generated header file for the offsets into the mshv_vtl_cpu_context
structure to be used by the low-level assembly code. But it didn't add
the .gitignore file to go with it, so 'git status' and friends will
mention it.
Let's add the gitignore file before somebody thinks that generated
header should be committed.
Linus Torvalds [Sat, 13 Dec 2025 05:39:28 +0000 (17:39 +1200)]
Merge tag 'drm-fixes-2025-12-13' of https://gitlab.freedesktop.org/drm/kernel
Pull more drm fixes from Dave Airlie:
"These are the enqueued fixes that ended up in our fixes branch,
nouveau mostly, along with some small fixes in other places.
plane:
- Handle IS_ERR vs NULL in drm_plane_create_hotspot_properties()
ttm:
- fix devcoredump for evicted bos
panel:
- Fix stack usage warning in novatek-nt35560
nouveau:
- alloc fwsec sb at boot to avoid s/r problems
- fix strcpy usage
- fix i2c encoder crash
bridge:
- Ignore spurious PLL_UNLOCK bit in ti-sn65dsi83
mgag200:
- Fix bigendian handling in mgag200
tilcdc:
- Fix probe failure in tilcdc"
* tag 'drm-fixes-2025-12-13' of https://gitlab.freedesktop.org/drm/kernel:
drm/mgag200: Fix big-endian support
drm/tilcdc: Fix removal actions in case of failed probe
drm/ttm: Avoid NULL pointer deref for evicted BOs
drm: nouveau: Replace sprintf() with sysfs_emit()
drm/nouveau: fix circular dep oops from vendored i2c encoder
drm/nouveau: refactor deprecated strcpy
drm/plane: Fix IS_ERR() vs NULL check in drm_plane_create_hotspot_properties()
drm/bridge: ti-sn65dsi83: ignore PLL_UNLOCK errors
drm/nouveau/gsp: Allocate fwsec-sb at boot
drm/panel: novatek-nt35560: avoid on-stack device structure
i915:
- Fix format string truncation warning
- FIx runtime PM reference during fbdev BO creation
panthor:
- fix UAF
renesas:
- fix sync flag handling"
* tag 'drm-next-2025-12-13' of https://gitlab.freedesktop.org/drm/kernel:
Revert "drm/amd/display: Fix pbn to kbps Conversion"
drm/amd: Fix unbind/rebind for VCN 4.0.5
drm/i915: Fix format string truncation warning
drm/i915/fbdev: Hold runtime PM ref during fbdev BO creation
drm/amd/display: Improve HDMI info retrieval
drm/amdkfd: bump minimum vgpr size for gfx1151
drm/amd/display: shrink struct members
drm/amdkfd: Export the cwsr_size and ctl_stack_size to userspace
drm/amd/display: Refactor dml_core_mode_support to reduce stack frame
drm/amdgpu: don't attach the tlb fence for SI
drm/amd/display: Use GFP_ATOMIC in dc_create_plane_state()
drm/amdkfd: Trap handler support for expert scheduling mode
drm/amdkfd: Use huge page size to check split svm range alignment
drm/rcar-du: dsi: Handle both DRM_MODE_FLAG_N.SYNC and !DRM_MODE_FLAG_P.SYNC
drm/gem-shmem: revert the 8-byte alignment constraint
drm/gem-dma: revert the 8-byte alignment constraint
drm/panthor: Prevent potential UAF in group creation
Linus Torvalds [Sat, 13 Dec 2025 05:15:16 +0000 (17:15 +1200)]
Merge tag 'i3c/for-6.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/i3c/linux
Pull further i3c update from Alexandre Belloni:
"We are removing a legacy API callback and having this sooner rather
than later will help ensuring no one introduces a new driver using it.
I've also added patches removing the "__free(...) = NULL" pattern
because I'm sure we won't avoid people sending those following the
mailing list discussion..."
* tag 'i3c/for-6.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/i3c/linux:
i3c: adi: Fix confusing cleanup.h syntax
i3c: master: Fix confusing cleanup.h syntax
i3c: master: cleanup callback .priv_xfers()
i3c: master: switch to use new callback .i3c_xfers() from .priv_xfers()
Linus Torvalds [Sat, 13 Dec 2025 05:09:06 +0000 (17:09 +1200)]
Merge tag 'rtc-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux
Pull RTC updates from Alexandre Belloni:
"Subsystem:
- stop setting max_user_freq from the individual drivers as this has
not been hardware related for a while
New drivers:
- Andes ATCRTC100
- Apple SMC
- Nvidia VRS
Linus Torvalds [Sat, 13 Dec 2025 04:36:57 +0000 (16:36 +1200)]
Merge tag 'gpio-fixes-for-v6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux
Pull gpio updates from Bartosz Golaszewski:
- fix spinlock op type after conversion to lock guards
- fix a memory leak in error path in gpio-regmap
- Kconfig fixes in GPIO drivers
- add a GPIO ACPI quirk for Dell Precision 7780
- set of fixes for shared GPIO management
* tag 'gpio-fixes-for-v6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
gpio: shared: make locking more fine-grained
gpio: shared: fix auxiliary device cleanup order
gpio: shared: check if a reference is populated before cleaning its resources
gpio: shared: fix NULL-pointer dereference in teardown path
gpio: shared: ignore disabled nodes when traversing the device-tree
gpiolib: acpi: Add quirk for Dell Precision 7780
gpio: tb10x: fix OF_GPIO dependency
gpio: qixis: select CONFIG_REGMAP_MMIO
gpio: regmap: Fix memleak in error path in gpio_regmap_register()
gpio: mmio: fix bad guard conversion
Linus Torvalds [Sat, 13 Dec 2025 04:09:10 +0000 (16:09 +1200)]
Merge tag 'sound-fix-6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"The only slightly large change is the enablement of CIX HD-audio
controller, which took a bit time to be cooked up, while most of other
changes are device-specific small trivial fixes:
- Default disablement of the kconfig for decades old pre-release
alsa-lib PCM API; it's only the default config value change, so it
can't lead to any regressions for the existing setups
- Support for CIX HD-audio controller
- A few ASoC ACP fixes
- Fixes for ASoC cirrus, bcm, wcd, qcom, ak platforms
- Trivial hardening for FireWire and USB-audio
- HD-audio Intel binding fix and quirks"
* tag 'sound-fix-6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (30 commits)
ALSA: hda/tas2781: Add new quirk for HP new project
ALSA: hda: cix-ipbloq: Use modern PM ops
ALSA: hda: intel-dsp-config: Prefer legacy driver as fallback
ASoC: amd: acp: update tdm channels for specific DAI
ASoC: cs35l56: Fix incorrect select SND_SOC_CS35L56_CAL_SYSFS_COMMON
ALSA: firewire-motu: add bounds check in put_user loop for DSP events
ASoC: cs35l41: Always return 0 when a subsystem ID is found
ALSA: uapi: Fix typo in asound.h comment
ALSA: Do not build obsolete API
ALSA: hda: add CIX IPBLOQ HDA controller support
ALSA: hda/core: add addr_offset field for bus address translation
ALSA: hda: dt-bindings: add CIX IPBLOQ HDA controller support
ALSA: hda/realtek: Add support for ASUS UM3406GA
ALSA: hda/realtek: Add support for HP Turbine Laptops
ALSA: usb-audio: Initialize status1 to fix uninitialized symbol errors
ALSA: firewire-motu: fix buffer overflow in hwdep read for DSP events
ALSA: hda: cs35l41: Fix NULL pointer dereference in cs35l41_hda_read_acpi()
ASoC: cros_ec_codec: Remove unnecessary selection of CRYPTO
ASoc: qcom: q6afe: fix bad guard conversion
ASoC: rockchip: Fix Wvoid-pointer-to-enum-cast warning (again)
...
Dave Airlie [Sat, 13 Dec 2025 00:54:28 +0000 (10:54 +1000)]
Merge tag 'drm-misc-fixes-2025-12-10' of https://gitlab.freedesktop.org/drm/misc/kernel into drm-fixes
drm-misc-fixes for v6.19-rc1:
- Fix stack usage warning in novatek-nt35560.
- Fix s/r, i2c issues in nouveau and update string handling.
- Ignore spurious PLL_UNLOCK bit in ti-sn65dsi83.
- Handle IS_ERR vs NULL in drm_plane_create_hotspot_properties().
- Fix devcoredump crash on reading evicted bo's.
- Fix bigendian handling in mgag200.
- Fix probe failure in tilcdc.
Initializing automatic __free variables to NULL without need (e.g.
branches with different allocations), followed by actual allocation is
in contrary to explicit coding rules guiding cleanup.h:
"Given that the "__free(...) = NULL" pattern for variables defined at
the top of the function poses this potential interdependency problem the
recommendation is to always define and assign variables in one statement
and not group variable definitions at the top of the function when
__free() is used."
Code does not have a bug, but is less readable and uses discouraged
coding practice, so fix that by moving declaration to the place of
assignment.
Initializing automatic __free variables to NULL without need (e.g.
branches with different allocations), followed by actual allocation is
in contrary to explicit coding rules guiding cleanup.h:
"Given that the "__free(...) = NULL" pattern for variables defined at
the top of the function poses this potential interdependency problem the
recommendation is to always define and assign variables in one statement
and not group variable definitions at the top of the function when
__free() is used."
Code does not have a bug, but is less readable and uses discouraged
coding practice, so fix that by moving declaration to the place of
assignment.
Not that other existing usage of __free() in this context is a corret
exception initialized to NULL, because the actual allocation is branched
in if().
Linus Torvalds [Fri, 12 Dec 2025 17:44:03 +0000 (05:44 +1200)]
Merge tag 'loongarch-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson
Pull LoongArch updates from Huacai Chen:
- Add basic LoongArch32 support
Note: Build infrastructures of LoongArch32 are not enabled yet,
because we need to adjust irqchip drivers and wait for GNU toolchain
be upstream first.
- Select HAVE_ARCH_BITREVERSE in Kconfig
- Fix build and boot for CONFIG_RANDSTRUCT
- Correct the calculation logic of thread_count
- Some bug fixes and other small changes
* tag 'loongarch-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson: (22 commits)
LoongArch: Adjust default config files for 32BIT/64BIT
LoongArch: Adjust VDSO/VSYSCALL for 32BIT/64BIT
LoongArch: Adjust misc routines for 32BIT/64BIT
LoongArch: Adjust user accessors for 32BIT/64BIT
LoongArch: Adjust system call for 32BIT/64BIT
LoongArch: Adjust module loader for 32BIT/64BIT
LoongArch: Adjust time routines for 32BIT/64BIT
LoongArch: Adjust process management for 32BIT/64BIT
LoongArch: Adjust memory management for 32BIT/64BIT
LoongArch: Adjust boot & setup for 32BIT/64BIT
LoongArch: Adjust common macro definitions for 32BIT/64BIT
LoongArch: Add adaptive CSR accessors for 32BIT/64BIT
LoongArch: Add atomic operations for 32BIT/64BIT
LoongArch: Add new PCI ID for pci_fixup_vgadev()
LoongArch: Add and use some macros for AVEC
LoongArch: Correct the calculation logic of thread_count
LoongArch: Use unsigned long for _end and _text
LoongArch: Use __pmd()/__pte() for swap entry conversions
LoongArch: Fix arch_dup_task_struct() for CONFIG_RANDSTRUCT
LoongArch: Fix build errors for CONFIG_RANDSTRUCT
...
- Fix chacha-riscv64-zvkb.S to not use frame pointer for data"
* tag 'libcrypto-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux:
crypto: arm64/ghash - Fix incorrect output from ghash-neon
crypto/arm64: sm4/xts - Merge ksimd scopes to reduce stack bloat
crypto/arm64: aes/xts - Use single ksimd scope to reduce stack bloat
lib/crypto: blake2s: Replace manual unrolling with unrolled_full
lib/crypto: blake2b: Roll up BLAKE2b round loop on 32-bit
lib/crypto: riscv: Depend on RISCV_EFFICIENT_VECTOR_UNALIGNED_ACCESS
lib/crypto: riscv/chacha: Avoid s0/fp register
Linus Torvalds [Fri, 12 Dec 2025 10:04:18 +0000 (22:04 +1200)]
Merge tag 'block-6.19-20251211' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull block fixes from Jens Axboe:
- Always initialize DMA state, fixing a potentially nasty issue on the
block side
- btrfs zoned write fix with cached zone reports
- Fix corruption issues in bcache with chained bio's, and further make
it clear that the chained IO handler is simply a marker, it's not
code meant to be executed
- Kill old code dealing with synchronous IO polling in the block layer,
that has been dead for a long time. Only async polling is supported
these days
- Fix a lockdep issue in tag_set management, moving it to RCU
- Fix an issue with ublks bio_vec iteration
- Don't unconditionally enforce blocking issue of ublk control
commands, allow some of them with non-blocking issue as they
do not block
* tag 'block-6.19-20251211' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
blk-mq-dma: always initialize dma state
blk-mq: delete task running check in blk_hctx_poll()
block: fix cached zone reports on devices with native zone append
block: Use RCU in blk_mq_[un]quiesce_tagset() instead of set->tag_list_lock
ublk: don't mutate struct bio_vec in iteration
block: prohibit calls to bio_chain_endio
bcache: fix improper use of bi_end_io
ublk: allow non-blocking ctrl cmds in IO_URING_F_NONBLOCK issue
Linus Torvalds [Fri, 12 Dec 2025 10:01:32 +0000 (22:01 +1200)]
Merge tag 'io_uring-6.19-20251211' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull io_uring fix from Jens Axboe:
"Single fix for io_uring headed to stable, fixing an issue introduced
with the min_wait support earlier this year, where SQPOLL didn't get
correctly woken if an event arrived once the event waiting has
finished the min_wait portion.
As we already have regression tests for this added and people
reporting new failures there, let's get this one flushed out
so it can bubble back down to stable as well"
* tag 'io_uring-6.19-20251211' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
io_uring: fix min_wait wakeups for SQPOLL
Linus Torvalds [Fri, 12 Dec 2025 09:59:19 +0000 (21:59 +1200)]
Merge tag 'v6.19-rc-smb3-server-fixes' of git://git.samba.org/ksmbd
Pull smb server fixes from Steve French:
- minor cleanup
- minor update to comment to avoid confusion about fs type
* tag 'v6.19-rc-smb3-server-fixes' of git://git.samba.org/ksmbd:
smb/server: add comment to FileSystemName of FileFsAttributeInformation
smb/server: remove unused nterr.h
smb/server: rename include guard in smb_common.h
Linus Torvalds [Fri, 12 Dec 2025 09:52:42 +0000 (21:52 +1200)]
Merge tag 'nfs-for-6.19-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust:
"Bugfixes:
- Fix 'nlink' attribute update races when unlinking a file
- Add missing initialisers for the directory verifier in various
places
- Don't regress the NFSv4 open state due to misordered racing replies
- Ensure the NFSv4.x callback server uses the correct transport
connection
- Fix potential use-after-free races when shutting down the NFSv4.x
callback server
- Fix a pNFS layout commit crash
- Assorted fixes to ensure correct propagation of mount options when
the client crosses a filesystem boundary and triggers the VFS
automount code
- More localio fixes
Features and cleanups:
- Add initial support for basic directory delegations
- SunRPC back channel code cleanups"
* tag 'nfs-for-6.19-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (24 commits)
NFSv4: Handle NFS4ERR_NOTSUPP errors for directory delegations
nfs/localio: remove 61 byte hole from needless ____cacheline_aligned
nfs/localio: remove alignment size checking in nfs_is_local_dio_possible
NFS: Fix up the automount fs_context to use the correct cred
NFS: Fix inheritance of the block sizes when automounting
NFS: Automounted filesystems should inherit ro,noexec,nodev,sync flags
Revert "nfs: ignore SB_RDONLY when mounting nfs"
Revert "nfs: clear SB_RDONLY before getting superblock"
Revert "nfs: ignore SB_RDONLY when remounting nfs"
NFS: Add a module option to disable directory delegations
NFS: Shortcut lookup revalidations if we have a directory delegation
NFS: Request a directory delegation during RENAME
NFS: Request a directory delegation on ACCESS, CREATE, and UNLINK
NFS: Add support for sending GDD_GETATTR
NFSv4/pNFS: Clear NFS_INO_LAYOUTCOMMIT in pnfs_mark_layout_stateid_invalid
NFSv4.1: protect destroying and nullifying bc_serv structure
SUNRPC: new helper function for stopping backchannel server
SUNRPC: cleanup common code in backchannel request
NFSv4.1: pass transport for callback shutdown
NFSv4: ensure the open stateid seqid doesn't go backwards
...
Brendan Jackman [Sun, 7 Dec 2025 03:53:18 +0000 (03:53 +0000)]
bug: Hush suggest-attribute=format for __warn_printf()
Recent additions to this function cause GCC 14.3.0 to get excited
(W=1) and suggest a missing attribute:
lib/bug.c: In function '__warn_printf':
lib/bug.c:187:25: error: function '__warn_printf' be a candidate for 'gnu_printf' format attribute [-Werror=suggest-attribute=format]
187 | vprintk(fmt, *args);
| ^~~~~~~
Disable the diagnostic locally, following the pattern used for stuff
like va_format().
Heiko Carstens [Mon, 8 Dec 2025 20:06:58 +0000 (21:06 +0100)]
bug: Let report_bug_entry() provide the correct bugaddr
report_bug_entry() always provides zero for bugaddr but could easily
extract the correct address from the provided bug_entry. Just do that to
have proper warning messages.
E.g. adding an artificial:
void foo(void) { WARN_ONCE(1, "bar"); }
function generates this warning message:
WARNING: arch/s390/kernel/setup.c:1017 at 0x0, CPU#0: swapper/0/0
^^^
With the correct bug address this changes to:
WARNING: arch/s390/kernel/setup.c:1017 at foo+0x1c/0x40, CPU#0: swapper/0/0
^^^^^^^^^^^^^
Evan Li [Fri, 12 Dec 2025 08:49:43 +0000 (16:49 +0800)]
perf/x86/intel: Fix NULL event dereference crash in handle_pmi_common()
handle_pmi_common() may observe an active bit set in cpuc->active_mask
while the corresponding cpuc->events[] entry has already been cleared,
which leads to a NULL pointer dereference.
This can happen when interrupt throttling stops all events in a group
while PEBS processing is still in progress. perf_event_overflow() can
trigger perf_event_throttle_group(), which stops the group and clears
the cpuc->events[] entry, but the active bit may still be set when
handle_pmi_common() iterates over the events.
The following recent fix:
7e772a93eb61 ("perf/x86: Fix NULL event access and potential PEBS record loss")
moved the cpuc->events[] clearing from x86_pmu_stop() to x86_pmu_del() and
relied on cpuc->active_mask/pebs_enabled checks. However,
handle_pmi_common() can still encounter a NULL cpuc->events[] entry
despite the active bit being set.
Add an explicit NULL check on the event pointer before using it,
to cover this legitimate scenario and avoid the NULL dereference crash.
Fixes: 7e772a93eb61 ("perf/x86: Fix NULL event access and potential PEBS record loss") Reported-by: kitta <kitta@linux.alibaba.com> Co-developed-by: kitta <kitta@linux.alibaba.com> Signed-off-by: Evan Li <evan.li@linux.alibaba.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://patch.msgid.link/20251212084943.2124787-1-evan.li@linux.alibaba.com Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220855
When building without CONFIG_PM_SLEEP, there are several warnings (or
errors with CONFIG_WERROR=y / W=e) from the cix-ipbloq driver:
sound/hda/controllers/cix-ipbloq.c:378:12: error: 'cix_ipbloq_hda_runtime_resume' defined but not used [-Werror=unused-function]
378 | static int cix_ipbloq_hda_runtime_resume(struct device *dev)
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
sound/hda/controllers/cix-ipbloq.c:362:12: error: 'cix_ipbloq_hda_runtime_suspend' defined but not used [-Werror=unused-function]
362 | static int cix_ipbloq_hda_runtime_suspend(struct device *dev)
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
sound/hda/controllers/cix-ipbloq.c:349:12: error: 'cix_ipbloq_hda_resume' defined but not used [-Werror=unused-function]
349 | static int cix_ipbloq_hda_resume(struct device *dev)
| ^~~~~~~~~~~~~~~~~~~~~
sound/hda/controllers/cix-ipbloq.c:336:12: error: 'cix_ipbloq_hda_suspend' defined but not used [-Werror=unused-function]
336 | static int cix_ipbloq_hda_suspend(struct device *dev)
| ^~~~~~~~~~~~~~~~~~~~~~
When CONFIG_PM and CONFIG_PM_SLEEP are unset, SET_SYSTEM_SLEEP_PM_OPS()
and SET_RUNTIME_PM_OPS() evaluate to nothing, so these functions appear
unused to the compiler in this configuration.
Use the modern SYSTEM_SLEEP_PM_OPS and RUNTIME_PM_OPS macros to resolve
these warnings, which is what they are intended to do. Additionally,
wrap &cix_ipbloq_hda_pm in pm_ptr() to ensure the compiler can drop the
entire structure when CONFIG_PM is unset.
Linus Torvalds [Thu, 11 Dec 2025 03:13:29 +0000 (12:13 +0900)]
Merge tag 'for-6.19/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper updates from Mikulas Patocka:
- convert crypto_shash users to direct crypto library use with simpler
and faster code and reduced stack usage (Eric Biggers):
- the dm-verity SHA-256 conversion also teaches it to do two-way
interleaved hashing for added performance
- dm-crypt MD5 conversion (used for Loop-AES compatibility)
- added document for for takeover/reshape raid1 -> raid5 examples (Heinz Mauelshagen)
- fix dm-vdo kerneldoc warnings (Matthew Sakai)
- various random fixes and cleanups
* tag 'for-6.19/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (29 commits)
dm pcache: fix segment info indexing
dm pcache: fix cache info indexing
dm-pcache: advance slot index before writing slot
dm raid: add documentation for takeover/reshape raid1 -> raid5 table line examples
dm log-writes: Add missing set_freezable() for freezable kthread
dm-raid: fix possible NULL dereference with undefined raid type
dm-snapshot: fix 'scheduling while atomic' on real-time kernels
dm: ignore discard return value
MAINTAINERS: add Benjamin Marzinski as a device mapper maintainer
dm-mpath: Simplify the setup_scsi_dh code
dm vdo: fix kerneldoc warnings
dm-bufio: align write boundary on physical block size
dm-crypt: enable DM_TARGET_ATOMIC_WRITES
dm: test for REQ_ATOMIC in dm_accept_partial_bio()
dm-verity: remove useless mempool
dm-verity: disable recursive forward error correction
dm-ebs: Mark full buffer dirty even on partial write
dm mpath: enable DM_TARGET_ATOMIC_WRITES
dm verity fec: Expose corrected block count via status
dm: Don't warn if IMA_DISABLE_HTABLE is not enabled
...
Linus Torvalds [Thu, 11 Dec 2025 00:57:08 +0000 (09:57 +0900)]
Merge tag 'spi-fix-v6.19-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi
Pull spi fixes from Mark Brown:
"A few small fixes for SPI that came in during the merge window,
nothing too exciting here"
* tag 'spi-fix-v6.19-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
spi: microchip-core: Fix an error handling path in mchp_corespi_probe()
spi: cadence-qspi: Fix runtime PM imbalance in probe
Linus Torvalds [Thu, 11 Dec 2025 00:54:59 +0000 (09:54 +0900)]
Merge tag 'regulator-fix-v6.19-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator
Pull regulator fixes from Mark Brown:
"A few fixes that came in during the merge window, nothing too
exciting - the one core fix improves error propagation from gpiolib
which hopefully shouldn't actually happen but is safer"
* tag 'regulator-fix-v6.19-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator:
regulator: spacemit: Align input supply name with the DT binding
regulator: fixed: Rely on the core freeing the enable GPIO
regulator: check the return value of gpiod_set_value_cansleep()
Arnd Bergmann [Thu, 4 Dec 2025 10:01:58 +0000 (11:01 +0100)]
mm: memfd_luo: add CONFIG_SHMEM dependency
The new memfd code fails to link without SHMEM:
aarch64-linux-ld: mm/memfd_luo.o: in function `memfd_luo_retrieve_folios':
memfd_luo.c:(.text.memfd_luo_retrieve_folios+0xdc): undefined reference to `shmem_add_to_page_cache'
memfd_luo.c:(.text.memfd_luo_retrieve_folios+0x11c): undefined reference to `shmem_inode_acct_blocks'
memfd_luo.c:(.text.memfd_luo_retrieve_folios+0x134): undefined reference to `shmem_recalc_inode'
Add a Kconfig dependency to disallow that configuration.
Arnd Bergmann [Thu, 4 Dec 2025 10:28:59 +0000 (11:28 +0100)]
mm: shmem: avoid build warning for CONFIG_SHMEM=n
The newly added 'flags' variable is unused and causes a warning if
CONFIG_SHMEM is disabled, since the shmem_acct_size() macro it is passed
into does nothing:
mm/shmem.c: In function '__shmem_file_setup':
mm/shmem.c:5816:23: error: unused variable 'flags' [-Werror=unused-variable]
5816 | unsigned long flags = (vm_flags & VM_NORESERVE) ? SHMEM_F_NORESERVE : 0;
| ^~~~~
Replace the two macros with equivalent inline functions to get the
argument checking.
Link: https://lkml.kernel.org/r/20251204102905.1048000-1-arnd@kernel.org Fixes: 6ff1610ced56 ("mm: shmem: use SHMEM_F_* flags instead of VM_* flags") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Christian Brauner <brauner@kernel.org> Cc: guoweikang <guoweikang.kernel@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kairui Song <kasong@tencent.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
ocfs2: invalidate inode if i_mode is zero after block read
A panic occurs in ocfs2_unlink due to WARN_ON(inode->i_nlink == 0) when
handling a corrupted inode with i_mode=0 and i_nlink=0 in memory.
This "zombie" inode is created because ocfs2_read_locked_inode proceeds
even after ocfs2_validate_inode_block successfully validates a block that
structurally looks okay (passes checksum, signature etc.) but contains
semantically invalid data (specifically i_mode=0). The current validation
function doesn't check for i_mode being zero.
This results in an in-memory inode with i_mode=0 being added to the VFS
cache, which later triggers the panic during unlink.
Prevent this by adding an explicit check for (i_mode == 0, i_nlink == 0,
non-orphan) within ocfs2_validate_inode_block. If the check is true,
return -EFSCORRUPTED to signal corruption. This causes the caller
(ocfs2_read_locked_inode) to invoke make_bad_inode(), correctly preventing
the zombie inode from entering the cache.
Link: https://lkml.kernel.org/r/20251202224507.53452-2-eraykrdg1@gmail.com Co-developed-by: Albin Babu Varghese <albinbabuvarghese20@gmail.com> Signed-off-by: Albin Babu Varghese <albinbabuvarghese20@gmail.com> Signed-off-by: Ahmet Eray Karadag <eraykrdg1@gmail.com> Reported-by: syzbot+55c40ae8a0e5f3659f2b@syzkaller.appspotmail.com Fixes: https://syzkaller.appspot.com/bug?extid=55c40ae8a0e5f3659f2b Link: https://lore.kernel.org/all/20251022222752.46758-2-eraykrdg1@gmail.com/T/ Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: David Hunter <david.hunter.linux@gmail.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Cc: Heming Zhao <heming.zhao@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-Wflex-array-member-not-at-end was introduced in GCC-14, and we are
getting ready to enable it, globally.
Use the new TRAILING_OVERLAP() helper to fix the following warning:
fs/ocfs2/xattr.c:52:41: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end]
This helper creates a union between a flexible-array member (FAM) and a
set of MEMBERS that would otherwise follow it.
This overlays the trailing MEMBER struct ocfs2_extent_rec er; onto the FAM
struct ocfs2_xattr_value_root::xr_list.l_recs[], while keeping the FAM and
the start of MEMBER aligned.
The static_assert() ensures this alignment remains, and it's intentionally
placed inmediately after the related structure --no blank line in between.
Link: https://lkml.kernel.org/r/aRKm_7aN7Smc3J5L@kspp Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Reviewed-by: Heming Zhao <heming.zhao@suse.com> Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
ocfs2: convert remaining read-only checks to ocfs2_emergency_state
Now that the centralized `ocfs2_emergency_state()` helper is available,
refactor remaining filesystem-wide checks for `ocfs2_is_soft_readonly` and
`ocfs2_is_hard_readonly` to use this new function.
To ensure strict consistency with the previous behavior and guarantee no
functional changes, the call sites continue to explicitly return -EROFS
when the emergency state is detected. This standardizes the check logic
while preserving the existing error handling flow.
Link: https://lkml.kernel.org/r/3421641b54ad6b6e4ffca052351b518eacc1bd08.1764728893.git.eraykrdg1@gmail.com Co-developed-by: Albin Babu Varghese <albinbabuvarghese20@gmail.com> Signed-off-by: Albin Babu Varghese <albinbabuvarghese20@gmail.com> Signed-off-by: Ahmet Eray Karadag <eraykrdg1@gmail.com> Reviewed-by: Heming Zhao <heming.zhao@suse.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: David Hunter <david.hunter.linux@gmail.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Jun Piao <piaojun@huawei.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mark@fasheh.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
ocfs2: add ocfs2_emergency_state helper and apply to setattr
Patch series "ocfs2: Refactor read-only checks to use
ocfs2_emergency_state", v4.
Following the fix for the `make_bad_inode` validation failure (syzbot ID: b93b65ee321c97861072), this separate series introduces a new helper
function, `ocfs2_emergency_state()`, to improve and centralize read-only
and error state checking.
This is modeled after the `ext4_emergency_state()` pattern, providing a
single, unified location for checking all filesystem-level emergency
conditions. This makes the code cleaner and ensures that any future
checks (e.g., for fatal error states) can be added in one place.
This series is structured as follows:
1. The first patch introduces the `ocfs2_emergency_state()` helper
(currently checking for -EROFS) and applies it to `ocfs2_setattr`
to provide a "fail-fast" mechanism, as suggested by Albin
Babu Varghese.
2. The second patch completes the refactoring by converting all
remaining read-only checks throughout OCFS2 to use this new helper.
This patch (of 2):
To centralize error checking, follow the pattern of other filesystems like
ext4 (which uses `ext4_emergency_state()`), and prepare for future
enhancements, this patch introduces a new helper function:
`ocfs2_emergency_state()`.
The purpose of this helper is to provide a single, unified location for
checking all filesystem-level emergency conditions. In this initial
implementation, the function only checks for the existing hard and soft
read-only modes, returning -EROFS if either is set.
This provides a foundation where future checks (e.g., for fatal error
states returning -EIO, or shutdown states) can be easily added in one
place.
This patch also adds this new check to the beginning of `ocfs2_setattr()`.
This ensures that operations like `ftruncate` (which triggered the
original BUG) fail-fast with -EROFS when the filesystem is already in a
read-only state.
Link: https://lkml.kernel.org/r/cover.1764728893.git.eraykrdg1@gmail.com Link: https://lkml.kernel.org/r/e9e975bcaaff8dbc155b70fbc1b2798a2e36e96f.1764728893.git.eraykrdg1@gmail.com Co-developed-by: Albin Babu Varghese <albinbabuvarghese20@gmail.com> Signed-off-by: Albin Babu Varghese <albinbabuvarghese20@gmail.com> Signed-off-by: Ahmet Eray Karadag <eraykrdg1@gmail.com> Suggested-by: Heming Zhao <heming.zhao@suse.com> Reviewed-by: Heming Zhao <heming.zhao@suse.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Cc: David Hunter <david.hunter.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>