Michal Koutný [Mon, 23 Mar 2026 12:39:38 +0000 (13:39 +0100)]
sched/rt: Move group schedulability check to sched_rt_global_validate()
The sched_rt_global_constraints() function is a remnant that used to set
up global RT throttling but that is no more since commit 5f6bd380c7bdb
("sched/rt: Remove default bandwidth control") and the function ended up
only doing schedulability check.
Move the check into the validation function where it fits better.
(The order of validations sched_dl_global_validate() and
sched_rt_global_validate() shouldn't matter.)
Michal Koutný [Mon, 23 Mar 2026 12:39:37 +0000 (13:39 +0100)]
sched/rt: Skip group schedulable check with rt_group_sched=0
The warning from the commit 87f1fb77d87a6 ("sched: Add RT_GROUP WARN
checks for non-root task_groups") is wrong -- it assumes that only
task_groups with rt_rq are traversed, however, the schedulability check
would iterate all task_groups even when rt_group_sched=0 is disabled at
boot time but some non-root task_groups exist.
The schedulability check is supposed to validate:
a) that children don't overcommit its parent,
b) no RT task group overcommits global RT limit.
but with rt_group_sched=0 there is no (non-trivial) hierarchy of RT groups,
therefore skip the validation altogether. Otherwise, writes to the
global sched_rt_runtime_us knob will be rejected with incorrect
validation error.
This fix is immaterial with CONFIG_RT_GROUP_SCHED=n.
Peter Zijlstra [Sat, 4 Apr 2026 10:22:44 +0000 (12:22 +0200)]
sched/deadline: Use revised wakeup rule for dl_server
John noted that commit 115135422562 ("sched/deadline: Fix 'stuck' dl_server")
unfixed the issue from commit a3a70caf7906 ("sched/deadline: Fix dl_server
behaviour").
The issue in commit 115135422562 was for wakeups of the server after the
deadline; in which case you *have* to start a new period. The case for a3a70caf7906 is wakeups before the deadline.
Now, because the server is effectively running a least-laxity policy, it means
that any wakeup during the runnable phase means dl_entity_overflow() will be
true. This means we need to adjust the runtime to allow it to still run until
the existing deadline expires.
Use the revised wakeup rule for dl_defer entities.
Fixes: 115135422562 ("sched/deadline: Fix 'stuck' dl_server") Reported-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Tested-by: John Stultz <jstultz@google.com> Link: https://patch.msgid.link/20260404102244.GB22575@noisy.programming.kicks-ass.net
thermal: core: Suspend thermal zones later and resume them earlier
To avoid some undesirable interactions between thermal zone suspend
and resume with user space that is running when those operations are
carried out, move them closer to the suspend and resume of devices,
respectively, by updating dpm_prepare() to carry out thermal zone
suspend and dpm_complete() to start thermal zone resume (that will
continue asynchronously).
This also makes the code easier to follow by removing one, arguably
redundant, level of indirection represented by the thermal PM notifier.
Define thermal_class as a static structure to simplify thermal_init()
and to simplify thermal class availability checks that will need to
be carried out during the suspend and resume of thermal zones after
subsequent changes.
thermal: core: Drop redundant check from thermal_zone_device_update()
Since __thermal_zone_device_update() checks if tz->state is
TZ_STATE_READY and bails out immediately otherwise, it is not
necessary to check the thermal_zone_is_present() return value in
thermal_zone_device_update(). Namely, tz->state is equal to
TZ_STATE_FLAG_INIT initially and that flag is only cleared in
thermal_zone_init_complete() after adding tz to the list of thermal
zones, and thermal_zone_exit() sets TZ_STATE_FLAG_EXIT in tz->state
while removing tz from that list. Thus tz->state is not TZ_STATE_READY
when tz is not in the list and the check mentioned above is redundant.
Accordingly, drop the redundant thermal_zone_is_present() check from
thermal_zone_device_update() and drop the former altogether because it
has no more users.
thermal: core: Free thermal zone ID later during removal
The thermal zone removal ordering is different from the thermal zone
registration rollback path ordering and the former is arguably
problematic because freeing a thermal zone ID prematurely may cause
it to be used during the registration of another thermal zone which
may fail as a result.
Prevent that from occurring by changing the thermal zone removal
ordering to reflect the thermal zone registration rollback path
ordering.
Also more the ida_destroy() call from thermal_zone_device_unregister()
to thermal_release() for consistency.
thermal: core: Fix thermal zone governor cleanup issues
If thermal_zone_device_register_with_trips() fails after adding
a thermal governor to the thermal zone being registered, the
governor is not removed from it as appropriate which may lead to
a memory leak.
In turn, thermal_zone_device_unregister() calls thermal_set_governor()
without acquiring the thermal zone lock beforehand which may race with
a governor update via sysfs and may lead to a use-after-free in that
case.
Address these issues by adding two thermal_set_governor() calls, one to
thermal_release() to remove the governor from the given thermal zone,
and one to the thermal zone registration error path to cover failures
preceding the thermal zone device registration.
Fixes: e33df1d2f3a0 ("thermal: let governors have private data for each thermal zone") Cc: All applicable <stable@vger.kernel.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://patch.msgid.link/5092923.31r3eYUQgx@rafael.j.wysocki
pmdomain: qcom: rpmhpd: Add power domains for Hawi SoC
Add the RPMh power domains required for the Hawi SoC. This includes
new definitions for domains supplying specific hardware components:
- DCX: supplies VDD_DISP
- GBX: supplies VDD_GFX_BX
Reviewed-by: Taniya Das <taniya.das@oss.qualcomm.com> Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com> Signed-off-by: Fenglin Wu <fenglin.wu@oss.qualcomm.com> Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@oss.qualcomm.com> Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Merge the immutable branch dt into next, to allow the updated DT bindings
to be tested together with the pmdomain changes that are targeted for the
next release.
dt-bindings: power: qcom,rpmhpd: Add RPMh power domain for Hawi SoC
Document the RPMh power domain for Hawi SoC, and add definitions for
the new power domains which present in Hawi SoC:
- RPMHPD_DCX (Display Core X): supplies VDD_DISP for the display
subsystem
- RPMHPD_GBX (Graphics Box): supplies VDD_GFX_BX for the GPU/graphics
subsystem
Also, add constants for new power domain levels that supported in Hawi
SoC, including: LOW_SVS_D3_0, LOW_SVS_D1_0, LOW_SVS_D0_0, SVS_L2_0,
TURBO_L1_0/1/2, TURBO_L1_0/1/2.
Mark Rutland [Tue, 7 Apr 2026 13:16:45 +0000 (14:16 +0100)]
entry: Split preemption from irqentry_exit_to_kernel_mode()
Some architecture-specific work needs to be performed between the state
management for exception entry/exit and the "real" work to handle the
exception. For example, arm64 needs to manipulate a number of exception
masking bits, with different exceptions requiring different masking.
Generally this can all be hidden in the architecture code, but for arm64
the current structure of irqentry_exit_to_kernel_mode() makes this
particularly difficult to handle in a way that is correct, maintainable,
and efficient.
The gory details are described in the thread surrounding:
* Currently, irqentry_exit_to_kernel_mode() handles both involuntary
preemption AND state management necessary for exception return.
* When scheduling (including involuntary preemption), arm64 needs to
have all arm64-specific exceptions unmasked, though regular interrupts
must be masked.
* Prior to the state management for exception return, arm64 needs to
mask a number of arm64-specific exceptions, and perform some work with
these exceptions masked (with RCU watching, etc).
While in theory it is possible to handle this with a new arch_*() hook
called somewhere under irqentry_exit_to_kernel_mode(), this is fragile
and complicated, and doesn't match the flow used for exception return to
user mode, which has a separate 'prepare' step (where preemption can
occur) prior to the state management.
To solve this, refactor irqentry_exit_to_kernel_mode() to match the
style of {irqentry,syscall}_exit_to_user_mode(), moving preemption logic
into a new irqentry_exit_to_kernel_mode_preempt() function, and moving
state management in a new irqentry_exit_to_kernel_mode_after_preempt()
function. The existing irqentry_exit_to_kernel_mode() is left as a
caller of both of these, avoiding the need to modify existing callers.
There should be no functional change as a result of this change.
[ tglx: Updated kernel doc ]
Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Reviewed-by: Jinjie Ruan <ruanjinjie@huawei.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260407131650.3813777-6-mark.rutland@arm.com
Mark Rutland [Tue, 7 Apr 2026 13:16:44 +0000 (14:16 +0100)]
entry: Split kernel mode logic from irqentry_{enter,exit}()
The generic irqentry code has entry/exit functions specifically for
exceptions taken from user mode, but doesn't have entry/exit functions
specifically for exceptions taken from kernel mode.
It would be helpful to have separate entry/exit functions specifically
for exceptions taken from kernel mode. This would make the structure of
the entry code more consistent, and would make it easier for
architectures to manage logic specific to exceptions taken from kernel
mode.
Move the logic specific to kernel mode out of irqentry_enter() and
irqentry_exit() into new irqentry_enter_from_kernel_mode() and
irqentry_exit_to_kernel_mode() functions. These are marked
__always_inline and placed in irq-entry-common.h, as with
irqentry_enter_from_user_mode() and irqentry_exit_to_user_mode(), so
that they can be inlined into architecture-specific wrappers. The
existing out-of-line irqentry_enter() and irqentry_exit() functions
retained as callers of the new functions.
The lockdep assertion from irqentry_exit() is moved into
irqentry_exit_to_user_mode() and irqentry_exit_to_kernel_mode(). This
was previously missing from irqentry_exit_to_user_mode() when called
directly, and any new lockdep assertion failure relating from this
change is a latent bug.
Aside from the lockdep change noted above, there should be no functional
change as a result of this change.
[ tglx: Updated kernel doc ]
Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Reviewed-by: Jinjie Ruan <ruanjinjie@huawei.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260407131650.3813777-5-mark.rutland@arm.com
Mark Rutland [Tue, 7 Apr 2026 13:16:43 +0000 (14:16 +0100)]
entry: Move irqentry_enter() prototype later
Subsequent patches will rework the irqentry_*() functions. The end
result (and the intermediate diffs) will be much clearer if the
prototype for the irqentry_enter() function is moved later, immediately
before the prototype of the irqentry_exit() function.
Move the prototype later.
This is purely a move; there should be no functional change as a result
of this change.
Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Reviewed-by: Jinjie Ruan <ruanjinjie@huawei.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260407131650.3813777-4-mark.rutland@arm.com
local_irq_enable_exit_to_user() and local_irq_disable_exit_to_user() are
never overridden by architecture code, and are always equivalent to
local_irq_enable() and local_irq_disable().
These functions were added on the assumption that arm64 would override
them to manage 'DAIF' exception masking, as described by Thomas Gleixner
in these threads:
In practice arm64 did not need to override either. Prior to moving to
the generic irqentry code, arm64's management of DAIF was reworked in
commit:
97d935faacde ("arm64: Unmask Debug + SError in do_notify_resume()")
Since that commit, arm64 only masks interrupts during the 'prepare' step
when returning to user mode, and masks other DAIF exceptions later.
Within arm64_exit_to_user_mode(), the arm64 entry code is as follows:
Mark Rutland [Tue, 7 Apr 2026 13:16:41 +0000 (14:16 +0100)]
entry: Fix stale comment for irqentry_enter()
The kerneldoc comment for irqentry_enter() refers to idtentry_exit(),
which is an accidental holdover from the x86 entry code that the generic
irqentry code was based on.
Correct this to refer to irqentry_exit().
Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Reviewed-by: Jinjie Ruan <ruanjinjie@huawei.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260407131650.3813777-2-mark.rutland@arm.com
- Restore the state of the standard black-and-white and 16-color Linux
logos (no longer auto-enabled since commit 994fcd4b107d747b
("video/logo: don't select LOGO_LINUX_MONO and LOGO_LINUX_VGA16 by
default")).
Miguel Ojeda [Wed, 8 Apr 2026 08:44:46 +0000 (10:44 +0200)]
Merge tag 'pin-init-v7.1' of https://github.com/Rust-for-Linux/linux into rust-next
Pull pin-init updates from Benno Lossin:
- Replace the 'Zeroable' impls for 'Option<NonZero*>' with impls of
'ZeroableOption' for 'NonZero*'.
- Improve feature gate handling for unstable features.
- Declutter the documentation of implementations of 'Zeroable' for
tuples.
- Replace uses of 'addr_of[_mut]!' with '&raw [mut]'.
* tag 'pin-init-v7.1' of https://github.com/Rust-for-Linux/linux:
rust: pin-init: replace `addr_of_mut!` with `&raw mut`
rust: pin-init: implement ZeroableOption for NonZero* integer types
rust: pin-init: doc: de-clutter documentation with fake-variadics
rust: pin-init: properly document let binding workaround
rust: pin-init: build: simplify use of nightly features
Miguel Ojeda [Wed, 8 Apr 2026 07:46:01 +0000 (09:46 +0200)]
Merge tag 'rust-timekeeping-for-v7.1' of https://github.com/Rust-for-Linux/linux into rust-next
Pull timekeeping updates from Andreas Hindborg:
- Expand the example section in the 'HrTimer' documentation.
- Mark the 'ClockSource' trait as unsafe to ensure valid values for
'ktime_get()'.
- Add 'Delta::from_nanos()'.
This is a back merge since the pull request has a newer base -- we will
avoid that in the future.
And, given it is a back merge, it happens to resolve the "subtle" conflict
around '--remap-path-{prefix,scope}' that I discussed in linux-next [1],
plus a few other common conflicts. The result matches what we did for
next-20260407.
The actual diffstat (i.e. using a temporary merge of upstream first) is:
Delta 410 uses snd_akm4xxx_reset() both around DFS changes and from
its PM callbacks, but the AK4529 case in this helper is still left
unimplemented and never drives the codec reset path.
The AK4529 datasheet documents register 09h.RSTN as an internal
timing reset. Clearing RSTN powers down the ADC and DAC blocks, but
does not reinitialize the register map. That matches the existing
ak4xxx helper model, which already keeps the desired codec state in
the software register cache.
Implement AK4529 reset handling by clearing 09h.RSTN on state == 1,
then replaying the cached register image and setting RSTN back to 1
on state == 0.
This restores cached Delta 410 mixer state after resume and gives
the AK4529 DFS-change path a real codec reset sequence.
Eric Biggers [Wed, 8 Apr 2026 03:06:51 +0000 (20:06 -0700)]
crypto: Remove michael_mic from crypto_shash API
Remove the "michael_mic" crypto_shash algorithm, since it's no longer
used. Its only users were wireless drivers, which have now been
converted to use the michael_mic() function instead.
It makes sense that no other users ever appeared: Michael MIC is an
insecure algorithm that is specific to WPA TKIP, which itself was an
interim security solution to replace the broken WEP standard.
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Eric Biggers <ebiggers@kernel.org> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Link: https://patch.msgid.link/20260408030651.80336-7-ebiggers@kernel.org Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Cássio Gabriel [Tue, 7 Apr 2026 15:35:43 +0000 (12:35 -0300)]
ALSA: interwave: add ISA and PnP suspend and resume callbacks
interwave still leaves both its ISA and PnP PM callbacks disabled even
though the shared GUS suspend and resume path now exists.
This board needs InterWave-specific glue around the shared GUS PM path.
The attached WSS codec has its own register image that must be saved and
restored across suspend, the InterWave-specific GF1 compatibility,
decode, MPU401, and emulation settings must be rewritten after the
shared GF1 resume path reinitializes the chip, and the probe-detected
InterWave memory layout must be restored without rerunning the
destructive DRAM/ROM detection path.
Track the optional STB TEA6330T bus at probe time, restore its cached
mixer state after resume, add resume-safe helpers for the InterWave
register and memory-configuration state, and wire both the ISA and PnP
front-ends up to the shared GUS PM helpers.
The resume path intentionally restores only the cached hardware setup.
It does not attempt to preserve sample RAM contents across suspend.
Cássio Gabriel [Tue, 7 Apr 2026 15:35:42 +0000 (12:35 -0300)]
ALSA: tea6330t: add mixer state restore helper
The InterWave STB variant uses a TEA6330T mixer on its private
I2C bus. The mixer state is cached in software, but there is no
helper to push that register image back to hardware after system
resume.
Add a small restore helper that reapplies the cached TEA6330T
register image to the device so board drivers can restore the
external mixer state as part of their PM resume path.
Take snd_i2c_lock() around the full device lookup and restore
sequence so the bus device list traversal is also protected.
Move the remaining standalone snd_tea6330t_detect() EXPORT_SYMBOL()
declaration next to its function definition so tea6330t.c follows the
usual layout.
um: drivers: call kernel_strrchr() explicitly in cow_user.c
Building ARCH=um on glibc >= 2.43 fails:
arch/um/drivers/cow_user.c: error: implicit declaration of
function 'strrchr' [-Wimplicit-function-declaration]
glibc 2.43's C23 const-preserving strrchr() macro does not survive
UML's global -Dstrrchr=kernel_strrchr remap from arch/um/Makefile.
Call kernel_strrchr() directly in cow_user.c so the source no longer
depends on the -D rewrite.
Fixes: 2c51a4bc0233 ("um: fix strrchr() problems") Suggested-by: Johannes Berg <johannes@sipsolutions.net> Cc: stable@vger.kernel.org Assisted-by: Claude:claude-opus-4-6 Assisted-by: Codex:gpt-5-4 Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com> Link: https://patch.msgid.link/20260408070102.2325572-1-michael.bommarito@gmail.com
[remove unnecessary 'extern'] Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Currently the probe does not check whether getting the regmap succeeded.
This can cause crash when regmap is used, if it wasn't successfully
obtained. Failing to get the regmap is unlikely, especially since this
driver is expected to be kicked by the MFD driver only after registering
the regmap - but it is still better to handle this gracefully.
Eric Biggers [Wed, 8 Apr 2026 03:06:49 +0000 (20:06 -0700)]
wifi: ath12k: Use michael_mic() from cfg80211
Just use the michael_mic() function from cfg80211 instead of a local
implementation of it using the crypto_shash API.
Note: when the kernel is booted with fips=1,
crypto_alloc_shash("michael_mic", 0, 0) always returned
ERR_PTR(-ENOENT), because Michael MIC is not a "FIPS allowed" algorithm.
For now, just preserve that behavior exactly, to ensure that TKIP is not
allowed to be used in FIPS mode. This logic actually seems to disable
the entire driver in FIPS mode and not just TKIP, but that was the
existing behavior. Supporting this driver in FIPS mode, if anyone
actually needs it there, should be a separate commit.
Eric Biggers [Wed, 8 Apr 2026 03:06:48 +0000 (20:06 -0700)]
wifi: ath11k: Use michael_mic() from cfg80211
Just use the michael_mic() function from cfg80211 instead of a local
implementation of it using the crypto_shash API.
Note: when the kernel is booted with fips=1,
crypto_alloc_shash("michael_mic", 0, 0) always returned
ERR_PTR(-ENOENT), because Michael MIC is not a "FIPS allowed" algorithm.
For now, just preserve that behavior exactly, to ensure that TKIP is not
allowed to be used in FIPS mode. This logic actually seems to disable
the entire driver in FIPS mode and not just TKIP, but that was the
existing behavior. Supporting this driver in FIPS mode, if anyone
actually needs it there, should be a separate commit.
Eric Biggers [Wed, 8 Apr 2026 03:06:47 +0000 (20:06 -0700)]
wifi: mac80211, cfg80211: Export michael_mic() and move it to cfg80211
Export michael_mic() so that the ath11k and ath12k drivers can call it.
In addition, move it from mac80211 to cfg80211 so that the ipw2x00
drivers, which depend on cfg80211 but not mac80211, can also call it.
Currently these drivers have their own local implementations of
michael_mic() based on crypto_shash, which is redundant and inefficient.
By consolidating all the Michael MIC code into cfg80211, we'll be able
to remove the duplicate Michael MIC code in the crypto/ directory.
David Laight [Thu, 26 Mar 2026 20:18:19 +0000 (20:18 +0000)]
netfilter: nf_conntrack_h323: Correct indentation when H323_TRACE defined
The trace lines are indented using PRINT("%*.s", xx, " ").
Userspace will treat this as "%*.0s" and will output no characters
when 'xx' is zero, the kernel treats it as "%*s" and will output
a single ' ' - which is probably what is intended.
Change all the formats to "%*s" removing the default precision.
This gives a single space indent when level is zero.
Signed-off-by: David Laight <david.laight.linux@gmail.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Florian Westphal <fw@strlen.de>
netfilter: nft_meta: add double-tagged vlan and pppoe support
Currently:
add rule netdev x y ip saddr 1.1.1.1
does not work with neither double-tagged vlan nor pppoe packets. This is
because the network and transport header offset are not pointing to the
IP and transport protocol headers in the stack.
This patch expands NFT_META_PROTOCOL and NFT_META_L4PROTO to parse
double-tagged vlan and pppoe packets so matching network and transport
header fields becomes possible with the existing userspace generated
bytecode. Note that this parser only supports double-tagged vlan which
is composed of vlan offload + vlan header in the skb payload area for
simplicity.
NFT_META_PROTOCOL is used by bridge and netdev family as an implicit
dependency in the bytecode to match on network header fields.
Similarly, there is also NFT_META_L4PROTO, which is also used as an
implicit dependency when matching on the transport protocol header
fields.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>
Florian Westphal [Tue, 17 Mar 2026 14:50:08 +0000 (15:50 +0100)]
netfilter: nft_set_pipapo: increment data in one step
Since commit e807b13cb3e3 ("nft_set_pipapo: Generalise group size for buckets")
there is no longer a need to increment the data pointer in two steps.
Switch to a single invocation of NFT_PIPAPO_GROUPS_PADDED_SIZE() helper,
like the avx2 implementation.
Florian Westphal [Fri, 13 Mar 2026 12:12:30 +0000 (13:12 +0100)]
netfilter: nf_tables: add netlink policy based cap on registers
Should have no effect in practice; all of these use the
nft_parse_register_load/store apis which is mandatory anyway due
to the need to further validate the register load/store, e.g.
that the size argument doesn't result in out-of-bounds load/store.
OTOH this is a simple method to reject obviously wrong input
at earlier stage.
netfilter: add more netlink-based policy range checks
These spots either already check the attribute range manually
before use or the consuming functions tolerate unexpected values.
Nevertheless, add more range checks via netlink policy so we gain
more users and avoid possible re-use in other places that might
not have the required manual checks. This also improves error
reporting: netlink core can generate extack errors.
Sun Jian [Tue, 3 Mar 2026 10:15:25 +0000 (18:15 +0800)]
netfilter: use function typedefs for __rcu NAT helper hook pointers
After commit 07919126ecfc ("netfilter: annotate NAT helper hook pointers
with __rcu"), sparse can warn about type/address-space mismatches when
RCU-dereferencing NAT helper hook function pointers.
The hooks are __rcu-annotated and accessed via rcu_dereference(), but the
combination of complex function pointer declarators and the WRITE_ONCE()
machinery used by RCU_INIT_POINTER()/rcu_assign_pointer() can confuse
sparse and trigger false positives.
Introduce typedefs for the NAT helper function types, so __rcu applies to
a simple "fn_t __rcu *" pointer form. Also replace local typeof(hook)
variables with "fn_t *" to avoid propagating __rcu address space into
temporaries.
No functional change intended.
Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202603022359.3dGE9fwI-lkp@intel.com/ Signed-off-by: Sun Jian <sun.jian.kdev@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de>
Input: uinput - fix circular locking dependency with ff-core
A lockdep circular locking dependency warning can be triggered
reproducibly when using a force-feedback gamepad with uinput (for
example, playing ELDEN RING under Wine with a Flydigi Vader 5
controller):
The cycle is caused by four lock acquisition paths:
1. ff upload: input_ff_upload() holds ff->mutex and calls
uinput_dev_upload_effect() -> uinput_request_submit() ->
uinput_request_send(), which acquires udev->mutex.
2. device create: uinput_ioctl_handler() holds udev->mutex and calls
uinput_create_device() -> input_register_device(), which acquires
input_mutex.
3. device register: input_register_device() holds input_mutex and
calls kbd_connect() -> input_register_handle(), which acquires
dev->mutex.
4. evdev release: evdev_release() calls input_flush_device() under
dev->mutex, which calls input_ff_flush() acquiring ff->mutex.
Fix this by introducing a new state_lock spinlock to protect
udev->state and udev->dev access in uinput_request_send() instead of
acquiring udev->mutex. The function only needs to atomically check
device state and queue an input event into the ring buffer via
uinput_dev_event() -- both operations are safe under a spinlock
(ktime_get_ts64() and wake_up_interruptible() do not sleep). This
breaks the ff->mutex -> udev->mutex link since a spinlock is a leaf in
the lock ordering and cannot form cycles with mutexes.
To keep state transitions visible to uinput_request_send(), protect
writes to udev->state in uinput_create_device() and
uinput_destroy_device() with the same state_lock spinlock.
Additionally, move init_completion(&request->done) from
uinput_request_send() to uinput_request_submit() before
uinput_request_reserve_slot(). Once the slot is allocated,
uinput_flush_requests() may call complete() on it at any time from
the destroy path, so the completion must be initialised before the
request becomes visible.
powerpc32/bpf: fix loading fsession func metadata using PPC_LI32
PPC_RAW_LI32 is not a valid macro in the PowerPC BPF JIT. Use PPC_LI32,
which correctly handles immediate loads for large values.
Fixes the build error introduced when adding fsession support on ppc32.
====================
seg6: fix dst_cache sharing in seg6 lwtunnel
The seg6 lwtunnel encap uses a single per-route dst_cache shared
between seg6_input_core() and seg6_output_core(). These two paths
can perform the post-encap SID lookup in different routing contexts
(e.g., ip rules matching on the ingress interface, or VRF table
separation). Whichever path runs first populates the cache, and the
other reuses it blindly, bypassing its own lookup.
Patch 1 fixes this by splitting the cache into cache_input and
cache_output. Patch 2 adds a selftest that validates the isolation.
====================
Andrea Mayer [Sat, 4 Apr 2026 00:44:05 +0000 (02:44 +0200)]
selftests: seg6: add test for dst_cache isolation in seg6 lwtunnel
Add a selftest that verifies the dst_cache in seg6 lwtunnel is not
shared between the input (forwarding) and output (locally generated)
paths.
The test creates three namespaces (ns_src, ns_router, ns_dst)
connected in a line. An SRv6 encap route on ns_router encapsulates
traffic destined to cafe::1 with SID fc00::100. The SID is
reachable only for forwarded traffic (from ns_src) via an ip rule
matching the ingress interface (iif veth-r0 lookup 100), and
blackholed in the main table.
The test verifies that:
1. A packet generated locally on ns_router does not reach
ns_dst with an empty cache, since the SID is blackholed;
2. A forwarded packet from ns_src populates the input cache
from table 100 and reaches ns_dst;
3. A packet generated locally on ns_router still does not
reach ns_dst after the input cache is populated,
confirming the output path does not reuse the input
cache entry.
Both the forwarded and local packets are pinned to the same CPU
with taskset, since dst_cache is per-cpu.
Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it> Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Reviewed-by: Justin Iurman <justin.iurman@gmail.com> Link: https://patch.msgid.link/20260404004405.4057-3-andrea.mayer@uniroma2.it Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Andrea Mayer [Sat, 4 Apr 2026 00:44:04 +0000 (02:44 +0200)]
seg6: separate dst_cache for input and output paths in seg6 lwtunnel
The seg6 lwtunnel uses a single dst_cache per encap route, shared
between seg6_input_core() and seg6_output_core(). These two paths
can perform the post-encap SID lookup in different routing contexts
(e.g., ip rules matching on the ingress interface, or VRF table
separation). Whichever path runs first populates the cache, and the
other reuses it blindly, bypassing its own lookup.
Fix this by splitting the cache into cache_input and cache_output,
so each path maintains its own cached dst independently.
Fixes: 6c8702c60b88 ("ipv6: sr: add support for SRH encapsulation and injection with lwtunnels") Cc: stable@vger.kernel.org Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it> Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Reviewed-by: Justin Iurman <justin.iurman@gmail.com> Link: https://patch.msgid.link/20260404004405.4057-2-andrea.mayer@uniroma2.it Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Daniel Golle [Sun, 5 Apr 2026 21:29:19 +0000 (22:29 +0100)]
selftests: net: bridge_vlan_mcast: wait for h1 before querier check
The querier-interval test adds h1 (currently a slave of the VRF created
by simple_if_init) to a temporary bridge br1 acting as an outside IGMP
querier. The kernel VRF driver (drivers/net/vrf.c) calls cycle_netdev()
on every slave add and remove, toggling the interface admin-down then up.
Phylink takes the PHY down during the admin-down half of that cycle.
Since h1 and swp1 are cable-connected, swp1 also loses its link may need
several seconds to re-negotiate.
Use setup_wait_dev $h1 0 which waits for h1 to return to UP state, so the
test can rely on the link being back up at this point.
Jakub Kicinski [Sat, 4 Apr 2026 00:19:38 +0000 (17:19 -0700)]
net: avoid nul-deref trying to bind mp to incapable device
Sashiko points out that we use qops in __net_mp_open_rxq()
but never validate they are null. This was introduced when
check was moved from netdev_rx_queue_restart().
Look at ops directly instead of the locking config.
qops imply netdev_need_ops_lock(). We used netdev_need_ops_lock()
initially to signify that the real_num_rx_queues check below
is safe without rtnl_lock, but I'm not sure if this is actually
clear to most people, anyway.
Fixes: da7772a2b4ad ("net: move mp->rx_page_size validation to __net_mp_open_rxq()") Acked-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Mina Almasry <almasrymina@google.com> Link: https://patch.msgid.link/20260404001938.2425670-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Sat, 4 Apr 2026 23:01:03 +0000 (16:01 -0700)]
selftests: drv-net: adjust to socat changes
socat v1.8.1.0 now defaults to shut-null, it sends an extra
0-length UDP packet when sender disconnects. This breaks
our tests which expect the exact packet sequence.
Add shut-none which was the old default where necessary.
Johan Alvarado [Mon, 6 Apr 2026 07:44:25 +0000 (07:44 +0000)]
net: stmmac: dwmac-motorcomm: fix eFUSE MAC address read failure
This patch fixes an issue where reading the MAC address from the eFUSE
fails due to a race condition.
The root cause was identified by comparing the driver's behavior with a
custom U-Boot port. In U-Boot, the MAC address was read successfully
every time because the driver was loaded later in the boot process, giving
the hardware ample time to initialize. In Linux, reading the eFUSE
immediately returns all zeros, resulting in a fallback to a random MAC address.
Hardware cold-boot testing revealed that the eFUSE controller requires a
short settling time to load its internal data. Adding a 2000-5000us
delay after the reset ensures the hardware is fully ready, allowing the
native MAC address to be read consistently.
Fixes: 02ff155ea281 ("net: stmmac: Add glue driver for Motorcomm YT6801 ethernet controller") Reported-by: Georg Gottleuber <ggo@tuxedocomputers.com> Closes: https://lore.kernel.org/24cfefff-1233-4745-8c47-812b502d5d19@tuxedocomputers.com Signed-off-by: Johan Alvarado <contact@c127.dev> Reviewed-by: Yao Zi <me@ziyao.cc> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/fc5992a4-9532-49c3-8ec1-c2f8c5b84ca1@smtp-relay.sendinblue.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
Allow referenced dynptr to be overwritten when siblings exists
The patchset conditionally allow a referenced dynptr to be overwritten
when its siblings (original dynptr or dynptr clone) exist. Do it before
the verifier relation tracking refactor to mimimize verifier changes at
a time.
====================
Test overwriting referenced dynptr and clones to make sure it is only
allow when there is at least one other dynptr with the same ref_obj_id.
Also make sure slice is still invalidated after the dynptr's stack slot
is destroyed.
bpf: Allow overwriting referenced dynptr when refcnt > 1
The verifier currently does not allow overwriting a referenced dynptr's
stack slot to prevent resource leak. This is because referenced dynptr
holds additional resources that requires calling specific helpers to
release. This limitation can be relaxed when there are multiple copies
of the same dynptr. Whether it is the orignial dynptr or one of its
clones, as long as there exists at least one other dynptr with the same
ref_obj_id (to be used to release the reference), its stack slot should
be allowed to be overwritten.
Daniel Borkmann [Tue, 7 Apr 2026 19:24:19 +0000 (21:24 +0200)]
bpf: Clear delta when clearing reg id for non-{add,sub} ops
When a non-{add,sub} alu op such as xor is performed on a scalar
register that previously had a BPF_ADD_CONST delta, the else path
in adjust_reg_min_max_vals() only clears dst_reg->id but leaves
dst_reg->delta unchanged.
This stale delta can propagate via assign_scalar_id_before_mov()
when the register is later used in a mov. It gets a fresh id but
keeps the stale delta from the old (now-cleared) BPF_ADD_CONST.
This stale delta can later propagate leading to a verifier-vs-
runtime value mismatch.
The clear_id label already correctly clears both delta and id.
Make the else path consistent by also zeroing the delta when id
is cleared. More generally, this introduces a helper clear_scalar_id()
which internally takes care of zeroing. There are various other
locations in the verifier where only the id is cleared. By using
the helper we catch all current and future locations.
Fixes: 98d7ca374ba4 ("bpf: Track delta between "linked" registers.") Reported-by: STAR Labs SG <info@starlabs.sg> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20260407192421.508817-2-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Daniel Borkmann [Tue, 7 Apr 2026 19:24:18 +0000 (21:24 +0200)]
bpf: Fix linked reg delta tracking when src_reg == dst_reg
Consider the case of rX += rX where src_reg and dst_reg are pointers to
the same bpf_reg_state in adjust_reg_min_max_vals(). The latter first
modifies the dst_reg in-place, and later in the delta tracking, the
subsequent is_reg_const(src_reg)/reg_const_value(src_reg) reads the
post-{add,sub} value instead of the original source.
This is problematic since it sets an incorrect delta, which sync_linked_regs()
then propagates to linked registers, thus creating a verifier-vs-runtime
mismatch. Fix it by just skipping this corner case.
Fixes: 98d7ca374ba4 ("bpf: Track delta between "linked" registers.") Reported-by: STAR Labs SG <info@starlabs.sg> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20260407192421.508817-1-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>
John Pavlick [Mon, 6 Apr 2026 13:23:33 +0000 (13:23 +0000)]
net: sfp: add quirks for Hisense and HSGQ GPON ONT SFP modules
Several GPON ONT SFP sticks based on Realtek RTL960x report
1000BASE-LX at 1300MBd in their EEPROM but can operate at 2500base-X.
On hosts capable of 2500base-X (e.g. Banana Pi R3 / MT7986), the
kernel negotiates only 1G because it trusts the incorrect EEPROM data.
Each quirk advertises 2500base-X and ignores TX_FAULT during the
module's ~40s Linux boot time.
Tested on Banana Pi R3 (MT7986) with OpenWrt 25.12.1, confirmed
2.5Gbps link and full throughput with flow offloading.
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Suggested-by: Marcin Nita <marcin.nita@leolabs.pl> Signed-off-by: John Pavlick <jspavlick@posteo.net> Link: https://patch.msgid.link/20260406132321.72563-1-jspavlick@posteo.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently, ath12k AHB (in IPQ5332) uses SCM calls to authenticate the
firmware image to bring up userpd. From IPQ5424 onwards, Q6 firmware can
directly communicate with the Trusted Management Engine - Lite (TME-L),
eliminating the need for SCM calls for userpd bring-up.
Hence, to enable IPQ5424 device support, use qcom_mdt_load_no_init() and
skip the SCM call as Q6 will directly authenticate the userpd firmware.
wifi: ath12k: add ath12k_hw_version_map entry for IPQ5424
Add a new ath12k_hw_version_map entry for the AHB based WiFi 7 device
IPQ5424.
Reuse most of the ath12k_hw_version_map fields such as hal_ops,
hal_desc_sz, tcl_to_wbm_rbm_map, and hal_params from IPQ5332. The
register addresses differ on IPQ5424, hence set hw_regs temporarily
to NULL and populated it in a subsequent patch.
Add ath12k_hw_params for the ath12k AHB-based WiFi 7 device IPQ5424.
The WiFi device IPQ5424 is similar to IPQ5332. Most of the hardware
parameters like hw_ops, wmi_init, ring_mask, etc., are the same between
IPQ5424 and IPQ5332, hence use these same parameters for IPQ5424.
Some parameters are specific to IPQ5424; initially set these to
0 or NULL, and populate them in subsequent patches.
Baochen Qiang [Wed, 25 Mar 2026 03:05:01 +0000 (11:05 +0800)]
wifi: ath10k: fix station lookup failure during disconnect
Recent commit [1] moved station statistics collection to an earlier stage
of the disconnect flow. With this change in place, ath10k fails to resolve
the station entry when handling a peer stats event triggered during
disconnect, resulting in log messages such as:
wlp58s0: deauthenticating from 74:1a:e0:e7:b4:c8 by local choice (Reason: 3=DEAUTH_LEAVING)
ath10k_pci 0000:3a:00.0: not found station for peer stats
ath10k_pci 0000:3a:00.0: failed to parse stats info tlv: -22
The failure occurs because ath10k relies on ieee80211_find_sta_by_ifaddr()
for station lookup. That function uses local->sta_hash, but by the time
the peer stats request is triggered during disconnect, mac80211 has
already removed the station from that hash table, leading to lookup
failure.
Before commit [1], this issue was not visible because the transition from
IEEE80211_STA_NONE to IEEE80211_STA_NOTEXIST prevented ath10k from sending
a peer stats request at all: ath10k_mac_sta_get_peer_stats_info() would
fail early to find the peer and skip requesting statistics.
Fix this by switching the lookup path to ath10k_peer_find(), which queries
ath10k's internal peer table. At the point where the firmware emits the
peer stats event, the peer entry is still present in the driver's list,
ensuring lookup succeeds.
wifi: ath12k: Create symlink for each radio in a wiphy
In single-wiphy design, when more than one radio is registered as a
single-wiphy in the mac80211 layer, the following warnings are seen:
1. debugfs: File 'ath12k' in directory 'phy0' already present!
2. debugfs: File 'simulate_fw_crash' in directory 'pci-0000:57:00.0' already present!
debugfs: File 'device_dp_stats' in directory 'pci-01777777777777777777777:57:00.0' already present!
When more than one radio is registered as a single-wiphy, symlinks for
all the radios are created in the same debugfs directory:
/sys/kernel/debug/ieee80211/phyX/ath12k, resulting in warning 1. When a
symlink is created for the first radio, since the 'ath12k' directory is
not present, it will be created and no warning will be thrown. But when
symlink is created for more than one radio, since the 'ath12k'
directory was already created for symlink for radio 1, a warning is
thrown complaining that 'ath12k' directory is already present. To resolve
warning 1, create symlink for each radio in separate debugfs directories.
For the first radio, the symlink will always be the 'ath12k' directory.
This ensures that the existing directory structure is retained for
single-wiphy and multi-wiphy architectures. In single-wiphy architecture
with multiple radios, create symlink in separate debugfs directories
introduced by mac80211.
Existing debugfs directory in single-wiphy architecture:
/sys/kernel/debug/ieee80211/phyX/ath12k is a symlink to
/sys/kernel/debug/ath12k/pci-0001:01:00.0/macY
Proposed debugfs directory in single-wiphy architecture with one radio:
/sys/kernel/debug/ieee80211/phyX/ath12k is a symlink to
/sys/kernel/debug/ath12k/pci-0001:01:00.0/mac0
Proposed debugfs directory in single-wiphy architecture with more than
one radio:
/sys/kernel/debug/ieee80211/phyX/radio0/ath12k is a symlink to
/sys/kernel/debug/ath12k/pci-0001:01:00.0/mac0 and
/sys/kernel/debug/ieee80211/phyX/radioY/ath12k is a symlink to
/sys/kernel/debug/ath12k/pci-0001:01:00.0/macY
Where X is phy index and Y is radio index, seen in
'iw phyX info | grep Idx'. Two symlinks for the first radio are to ensure
compatibility with the existing design. Add radio_idx inside ar, to track
the radio index in probing order.
API ath12k_debugfs_pdev_create() that creates SoC entries is called more
than once when hardware group starts up, resulting in warning 2. To
resolve this warning, remove all other calls to this API and add one
inside the ath12k_core_pdev_create(). This API carries all pdev-specific
initializations and can conveniently hold a call to
ath12k_debugfs_pdev_create().
Avula Sri Charan [Mon, 30 Mar 2026 04:07:32 +0000 (09:37 +0530)]
wifi: ath12k: Skip adding inactive partner vdev info
Currently, a vdev that is created is considered active for partner link
population. In case of an MLD station, non-associated link vdevs can be
created but not started. Yet, they are added as partner links. This leads
to the creation of stale FW partner entries which accumulate and cause
assertions.
To resolve this issue, check if the vdev is started and operating on a
chosen frequency, i.e., arvif->is_started, instead of checking if the vdev
is created, i.e., arvif->is_created. This determines if the vdev is active
or not and skips adding it as a partner link if it's inactive.
Add support to request channel change stats from the firmware through
HTT stats type 76. These stats give channel switch details like the
channel that the radio changed to, its center frequency, time taken
for the switch, chainmask details, etc.
wifi: ath12k: Rename hw_link_id to radio_idx in ath12k_ah_to_ar()
ath12k_ah_to_ar() is returning radio from the given hardware based on the
radio index passed. But, the variable that radio index is received at is
wrongly named 'hw_link_id', which points to the hardware link index that
comes from the firmware. This affects readability.
Resolve this by renaming 'hw_link_id' to 'radio_idx'.
====================
tracing: Fix kprobe attachment when module shadows vmlinux symbol
When a kernel module exports a symbol with the same name as an existing
vmlinux symbol, kprobe attachment fails with -EADDRNOTAVAIL because
number_of_same_symbols() counts matches across both vmlinux and all
loaded modules, returning a count greater than 1.
This series takes a different approach from v1-v4, which implemented a
libbpf-side fallback parsing /proc/kallsyms and retrying with the
absolute address. That approach was rejected (Andrii Nakryiko, Ihor
Solodrai) because ambiguous symbol resolution does not belong in libbpf.
Following Ihor's suggestion, this series fixes the root cause in the
kernel: when an unqualified symbol name is given and the symbol is found
in vmlinux, prefer the vmlinux symbol and do not scan loaded modules.
This makes the skeleton auto-attach path work transparently with no
libbpf changes needed.
Patch 1: Kernel fix - return vmlinux-only count from
number_of_same_symbols() when the symbol is found in vmlinux,
preventing module shadows from causing -EADDRNOTAVAIL.
Patch 2: Selftests using bpf_fentry_shadow_test which exists in both
vmlinux and bpf_testmod - tests unqualified (vmlinux) and
MOD:SYM (module) attachment across all four attach modes, plus
kprobe_multi with the duplicate symbol.
Changes since v6 [1]:
- Fix comment style: use /* on its own line instead of networking-style
/* text on opener line (Alexei Starovoitov).
selftests/bpf: Add tests for kprobe attachment with duplicate symbols
bpf_fentry_shadow_test exists in both vmlinux (net/bpf/test_run.c) and
bpf_testmod (bpf_testmod.c), creating a duplicate symbol condition when
bpf_testmod is loaded. Add subtests that verify kprobe behavior with
this duplicate symbol:
In attach_probe:
- dup-sym-{default,legacy,perf,link}: unqualified attach succeeds
across all four modes, preferring vmlinux over module shadow.
- MOD:SYM qualification attaches to the module version.
In kprobe_multi_test:
- dup_sym: kprobe_multi attach with kprobe and kretprobe succeeds.
bpf_fentry_shadow_test is not invoked via test_run, so tests verify
attach and detach succeed without triggering the probe.
bpf: Prefer vmlinux symbols over module symbols for unqualified kprobes
When an unqualified kprobe target exists in both vmlinux and a loaded
module, number_of_same_symbols() returns a count greater than 1,
causing kprobe attachment to fail with -EADDRNOTAVAIL even though the
vmlinux symbol is unambiguous.
When no module qualifier is given and the symbol is found in vmlinux,
return the vmlinux-only count without scanning loaded modules. This
preserves the existing behavior for all other cases:
- Symbol only in a module: vmlinux count is 0, falls through to module
scan as before.
- Symbol qualified with MOD:SYM: mod != NULL, unchanged path.
- Symbol ambiguous within vmlinux itself: count > 1 is returned as-is.
Fixes: 926fe783c8a6 ("tracing/kprobes: Fix symbol counting logic by looking at modules as well") Fixes: 9d8616034f16 ("tracing/kprobes: Add symbol counting check when module loads") Suggested-by: Ihor Solodrai <ihor.solodrai@linux.dev> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Ihor Solodrai <ihor.solodrai@linux.dev> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@crowdstrike.com> Link: https://lore.kernel.org/r/20260407203912.1787502-2-andrey.grodzovsky@crowdstrike.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
selftests/bpf: add test for nullable PTR_TO_BUF access
Add iter_buf_null_fail with two tests and a test runner:
- iter_buf_null_deref: verifier must reject direct dereference of
ctx->key (PTR_TO_BUF | PTR_MAYBE_NULL) without a null check
- iter_buf_null_check_ok: verifier must accept dereference after
an explicit null check
Dave Airlie [Tue, 24 Feb 2026 02:06:23 +0000 (12:06 +1000)]
ttm/pool: track allocated_pages per numa node.
This gets the memory sizes from the nodes and stores the limit
as 50% of those. I think eventually we should drop the limits
once we have memcg aware shrinking, but this should be more NUMA
friendly, and I think seems like what people would prefer to
happen on NUMA aware systems.
Cc: Christian Koenig <christian.koenig@amd.com> Signed-off-by: Dave Airlie <airlied@redhat.com>
Dave Airlie [Tue, 24 Feb 2026 02:06:22 +0000 (12:06 +1000)]
ttm/pool: make pool shrinker NUMA aware (v2)
This enable NUMA awareness for the shrinker on the
ttm pools.
Cc: Christian Koenig <christian.koenig@amd.com> Cc: Dave Chinner <david@fromorbit.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Dave Airlie <airlied@redhat.com>
Dave Airlie [Tue, 24 Feb 2026 02:06:20 +0000 (12:06 +1000)]
ttm/pool: port to list_lru. (v2)
This is an initial port of the TTM pools for
write combined and uncached pages to use the list_lru.
This makes the pool's more NUMA aware and avoids
needing separate NUMA pools (later commit enables this).
Cc: Christian Koenig <christian.koenig@amd.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Dave Chinner <david@fromorbit.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Dave Airlie <airlied@redhat.com>
Dave Airlie [Tue, 24 Feb 2026 02:06:19 +0000 (12:06 +1000)]
drm/ttm: use gpu mm stats to track gpu memory allocations. (v4)
This uses the newly introduced per-node gpu tracking stats,
to track GPU memory allocated via TTM and reclaimable memory in
the TTM page pools.
These stats will be useful later for system information and
later when mem cgroups are integrated.
Cc: Christian Koenig <christian.koenig@amd.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: linux-mm@kvack.org Cc: Andrew Morton <akpm@linux-foundation.org> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Dave Airlie <airlied@redhat.com>
Dave Airlie [Tue, 24 Feb 2026 02:06:18 +0000 (12:06 +1000)]
mm: add gpu active/reclaim per-node stat counters (v2)
While discussing memcg intergration with gpu memory allocations,
it was pointed out that there was no numa/system counters for
GPU memory allocations.
With more integrated memory GPU server systems turning up, and
more requirements for memory tracking it seems we should start
closing the gap.
Add two counters to track GPU per-node system memory allocations.
The first is currently allocated to GPU objects, and the second
is for memory that is stored in GPU page pools that can be reclaimed,
by the shrinker.
Cc: Christian Koenig <christian.koenig@amd.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: linux-mm@kvack.org Cc: Andrew Morton <akpm@linux-foundation.org> Acked-by: Zi Yan <ziy@nvidia.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Dave Airlie <airlied@redhat.com>
When a CREATE returns STATUS_STOPPED_ON_SYMLINK, smb2_check_message()
returns success without any length validation, leaving the symlink
parsers as the only defense against an untrusted server.
symlink_data() walks SMB 3.1.1 error contexts with the loop test "p <
end", but reads p->ErrorId at offset 4 and p->ErrorDataLength at offset
0. When the server-controlled ErrorDataLength advances p to within 1-7
bytes of end, the next iteration will read past it. When the matching
context is found, sym->SymLinkErrorTag is read at offset 4 from
p->ErrorContextData with no check that the symlink header itself fits.
smb2_parse_symlink_response() then bounds-checks the substitute name
using SMB2_SYMLINK_STRUCT_SIZE as the offset of PathBuffer from
iov_base. That value is computed as sizeof(smb2_err_rsp) +
sizeof(smb2_symlink_err_rsp), which is correct only when
ErrorContextCount == 0.
With at least one error context the symlink data sits 8 bytes deeper,
and each skipped non-matching context shifts it further by 8 +
ALIGN(ErrorDataLength, 8). The check is too short, allowing the
substitute name read to run past iov_len. The out-of-bound heap bytes
are UTF-16-decoded into the symlink target and returned to userspace via
readlink(2).
Fix this all up by making the loops test require the full context header
to fit, rejecting sym if its header runs past end, and bound the
substitute name against the actual position of sym->PathBuffer rather
than a fixed offset.
Because sub_offs and sub_len are 16bits, the pointer math will not
overflow here with the new greater-than.
Cc: Ronnie Sahlberg <ronniesahlberg@gmail.com> Cc: Shyam Prasad N <sprasad@microsoft.com> Cc: Tom Talpey <tom@talpey.com> Cc: Bharath SM <bharathsm@microsoft.com> Cc: linux-cifs@vger.kernel.org Cc: samba-technical@lists.samba.org Cc: stable <stable@kernel.org> Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org> Assisted-by: gregkh_clanker_t1000 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Steve French <stfrench@microsoft.com>
smb: client: fix off-by-8 bounds check in check_wsl_eas()
The bounds check uses (u8 *)ea + nlen + 1 + vlen as the end of the EA
name and value, but ea_data sits at offset sizeof(struct
smb2_file_full_ea_info) = 8 from ea, not at offset 0. The strncmp()
later reads ea->ea_data[0..nlen-1] and the value bytes follow at
ea_data[nlen+1..nlen+vlen], so the actual end is ea->ea_data + nlen + 1
+ vlen. Isn't pointer math fun?
The earlier check (u8 *)ea > end - sizeof(*ea) only guarantees the
8-byte header is in bounds, but since the last EA is placed within 8
bytes of the end of the response, the name and value bytes are read past
the end of iov.
Fix this mess all up by using ea->ea_data as the base for the bounds
check.
An "untrusted" server can use this to leak up to 8 bytes of kernel heap
into the EA name comparison and influence which WSL xattr the data is
interpreted as.
Cc: Ronnie Sahlberg <ronniesahlberg@gmail.com> Cc: Shyam Prasad N <sprasad@microsoft.com> Cc: Tom Talpey <tom@talpey.com> Cc: Bharath SM <bharathsm@microsoft.com> Cc: linux-cifs@vger.kernel.org Cc: samba-technical@lists.samba.org Cc: stable <stable@kernel.org> Assisted-by: gregkh_clanker_t1000 Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Steve French <stfrench@microsoft.com>
gfs2: prevent NULL pointer dereference during unmount
When flushing out outstanding glock work during an unmount, gfs2_log_flush()
can be called when sdp->sd_jdesc has already been deallocated and sdp->sd_jdesc
is NULL. Commit 35264909e9d1 ("gfs2: Fix NULL pointer dereference in
gfs2_log_flush") added a check for that to gfs2_log_flush() itself, but it
missed the sdp->sd_jdesc dereference in gfs2_log_release(). Fix that.
Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <error27@gmail.com> Closes: https://lore.kernel.org/r/202604071139.HNJiCaAi-lkp@intel.com/ Fixes: 35264909e9d1 ("gfs2: Fix NULL pointer dereference in gfs2_log_flush") Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
During an unmount, wait for potential withdraw to complete before calling
gfs2_make_fs_ro(). This will allow gfs2_make_fs_ro() to skip much of its work.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>