Jessica Zhang [Sun, 24 May 2026 10:33:30 +0000 (13:33 +0300)]
drm/msm/dp: Fix the ISR_* enum values
The ISR_HPD_* enum should represent values that can be read from the
REG_DP_DP_HPD_INT_STATUS register. Swap ISR_HPD_IO_GLITCH_COUNT and
ISR_HPD_REPLUG_COUNT to map them correctly to register values.
While we are at it, correct the spelling for ISR_HPD_REPLUG_COUNT.
Monish Chunara [Fri, 8 May 2026 10:15:44 +0000 (15:45 +0530)]
dt-bindings: mmc: sdhci-msm: Document the Shikra compatible
Document the Shikra-specific SDHCI compatible in the sdhci-msm binding.
Use "qcom,sdhci-msm-v5" as the fallback compatible for the MSM SDHCI v5
controller used on Shikra.
Michael Riesch [Fri, 22 May 2026 21:23:11 +0000 (23:23 +0200)]
arm64: dts: rockchip: add vicap node to rk3588
Add the device tree node for the RK3588 Video Capture (VICAP) unit.
Signed-off-by: Michael Riesch <michael.riesch@collabora.com>
[converted reg values in vicap ports to hexadecimal, to have them align
with the port@X values, and be less confusing] Link: https://patch.msgid.link/20260522-rk3588-vicap-v5-5-d1d1f5265c56@collabora.com Signed-off-by: Heiko Stuebner <heiko@sntech.de>
This is a simple helper which replaces page_folio(bvec->bv_page).
Minor improvement in readability, but the real motivation is to reduce
the number of references to bvec->bv_page so that it can be changed
with less work.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Leon Romanovsky <leon@kernel.org> Reviewed-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: William Kucharski <william.kucharski@linux.dev> Link: https://patch.msgid.link/20260528175905.1102280-2-willy@infradead.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
The change to format propagation in the BRx broke configuration of the
DRM pipeline. Revert it to fix the regression.
The original commit was meant to fix a v4l2-compliance failure, with no
known userspace applications being affected beside test tools. Reverting
is the simplest option, a more comprehensive fix can be developed (and
tested more thoroughly) later.
The change to format initialization, along with the change to format
propagation in the BRx in commit 937f3e6b51f1 ("media: renesas: vsp1:
brx: Fix format propagation"), broke configuration of the DRM pipeline.
Revert it to fix the regression.
The original commit was meant to fix a v4l2-compliance failure, with no
known userspace applications being affected beside test tools. Reverting
is the simplest option, a more comprehensive fix can be developed (and
tested more thoroughly) later.
Fixes: 133ac42af0a1 ("media: renesas: vsp1: Initialize format on all pads") Tested-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com> # On RZ/T2H Reviewed-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com> Link: https://patch.msgid.link/20260506215650.1897177-2-laurent.pinchart+renesas@ideasonboard.com Signed-off-by: Laurent Pinchart <laurent.pinchart+renesas@ideasonboard.com> Signed-off-by: Hans Verkuil <hverkuil+cisco@kernel.org>
Rik van Riel [Tue, 26 May 2026 19:43:29 +0000 (12:43 -0700)]
sched/fair: Use rq_clock() in update_tg_load_avg() rate-limit
update_tg_load_avg() is called once per leaf cfs_rq from the
__update_blocked_fair() walk that runs inside the NOHZ idle-balance
softirq, and again from update_load_avg() with UPDATE_TG. Its first
operation after the trivial early-outs is unconditionally:
now = sched_clock_cpu(cpu_of(rq_of(cfs_rq)));
if (now - cfs_rq->last_update_tg_load_avg < NSEC_PER_MSEC)
return;
Jakub ran into a system where nohz_idle_balance() was taking 75%
of a CPU (which is handling network traffic and doing many irq_exit_cpu
calls), with 35% of that CPU spent in update_load_avg, and 17% of the
CPU in sched_clock_cpu(), reading the TSC.
In a quick synthetic test, it looks like this patch reduces the
CPU use of sched_balance_update_blocked_averages by about 20%.
Switch the rate-limit to read rq_clock(rq_of(cfs_rq)) instead.
This eliminates the rdtsc, and uses a fairly fresh timestamp,
because all callers of update_tg_load_avg() and clear_tg_load_avg()
hold rq->lock and have called update_rq_clock(rq) within microseconds:
caller pre-state
__update_blocked_fair encloser did update_rq_clock(rq)
update_load_avg's three UPDATE_TG sites under rq->lock after enqueue/dequeue/update_curr
attach_/detach_entity_cfs_rq preceded by update_load_avg(...)
clear_tg_load_avg via offline path rq_clock_start_loop_update(rq) upfront
so rq->clock is fresh at every call. Since cfs_rqs are per-CPU
per-task_group, cfs_rq->last_update_tg_load_avg is always compared
against the same rq's clock; no cross-rq drift.
Signed-off-by: Rik van Riel <riel@surriel.com> Assisted-by: Claude (Anthropic) Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://patch.msgid.link/20260527110250.6a91718d@fangorn
Andrea Righi [Tue, 26 May 2026 16:42:49 +0000 (18:42 +0200)]
selftests/sched_ext: Validate dl_server attach/detach in total_bw test
Extend the total_bw selftest to validate the fair/ext dl_server
auto-attach/detach operations.
After the existing consistency checks, the test now doubles the
fair_server's runtime on every CPU via debugfs and verifies that:
1. total_bw grew after the customization (proves fair_server was
attached and apply_params() honored the dl_bw_attached flag),
2. with the minimal BPF scheduler loaded, total_bw drops back to the
baseline value (proves fair_server was detached and ext_server was
attached at its own default runtime),
3. after unload total_bw matches the doubled value from step 1 (proves
fair_server was re-attached with the runtime customization preserved
across the load/unload cycle).
Commit cd959a3562050d ("sched_ext: Add a DL server for sched_ext tasks")
introduced an ext_server deadline server to protect sched_ext tasks from
fair/RT starvation, mirroring the existing fair_server.
Currently, both servers reserve their 50ms/1000ms bandwidth at boot,
regardless of whether a BPF scheduler is loaded. Unused bandwidth is
still reclaimed at runtime by other classes, but the static reservation
prevents the RT class from implicitly using that headroom when one of
the two classes is guaranteed to be empty.
A sysadmin can work around this by writing
/sys/kernel/debug/sched/{fair,ext}_server/cpu*/runtime, but that
requires manual action and not all systems expose debugfs.
A better approach is to make server bandwidth reservations dynamic: only
the scheduling policy that is currently active should register its
reservation, while the inactive one should not artificially hold
capacity (keeping both reservations only when the BPF scheduler is
running in partial mode):
+---------------------------------------------+-------------+------------+
| BPF scheduler state | fair server | ext server |
+---------------------------------------------+-------------+------------+
| not loaded (default boot) | reserved | none |
| loaded full mode (!SCX_OPS_SWITCH_PARTIAL) | none | reserved |
| loaded partial mode (SCX_OPS_SWITCH_PARTIAL)| reserved | reserved |
+---------------------------------------------+-------------+------------+
To achieve this, introduce an "attached/detached" state for each
deadline server, so the kernel can decide whether a server's bandwidth
should be accounted in global bandwidth tracking.
At boot, the system starts with only the fair server contributing to
bandwidth accounting. When a BPF scheduler is enabled, the ext server is
attached and may replace or complement the fair server depending on
whether full or partial mode is used. When sched_ext is disabled, the
system restores the previous deadline bandwidth values and behavior.
The transition logic ensures that switching between scheduling modes is
consistent and reversible, without losing runtime configuration or
requiring manual intervention.
Andrea Righi [Tue, 26 May 2026 10:05:02 +0000 (12:05 +0200)]
sched/deadline: Reject debugfs dl_server writes for offline CPUs
Writing runtime or period via the per-CPU dl_server debugfs files
(/sys/kernel/debug/sched/{fair,ext}_server/cpu*/{runtime,period}) on an
offline CPU can trigger two distinct kernel issues:
Both __dl_sub() and __dl_add() divide by cpus internally, which can be
0 once the CPU has been removed from any active root-domain span (this
has been latent since the debugfs interface was introduced).
2) WARN_ON_ONCE in dl_server_start():
WARNING: kernel/sched/deadline.c:1805 at dl_server_start+0x232/0x270
Commit ee6e44dfe6e5 ("sched/deadline: Stop dl_server before CPU goes
offline") added this check to catch enqueueing the server on an
offline rq.
There's no meaningful semantics for re-configuring the per-CPU dl_server
bandwidth while the CPU is offline, so simply reject the write with
-EBUSY so userspace gets a clear error.
Closes: https://lore.kernel.org/all/20260526092228.3B6891F00A3A@smtp.kernel.org/ Fixes: d741f297bcea ("sched/fair: Fair server interface") Reported-by: Sashiko <sashiko-bot@kernel.org> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Tested-by: abaci-kreproducer <abaci@linux.alibaba.com> Link: https://patch.msgid.link/20260526100502.575774-1-arighi@nvidia.com
On powerpc, cpu_coregroup_mask is available only when the underlying
hardware support coregroup. In shared LPAR, QEMU guest or power9 etc
coregroup isn't supported. In such cases llc_mask was being referenced
when it was null leading to panic.
On powerpc, LLC is at SMT core level. So assumption that coregroup(MC)
domain point to LLC is wrong. Provide a way for archs to say where its
LLC is if it not at MC domain.
slab->partial is assigned by get_obj("partial") and then immediately
overwritten by get_obj_and_str("partial", &t). Remove the first
redundant assignment.
Xuewen Wang [Mon, 18 May 2026 06:21:58 +0000 (14:21 +0800)]
tools/mm/slabinfo: remove dead assignment in get_obj_and_str()
The assignment `x = NULL` sets the local parameter variable instead of
`*x`, which is a no-op since `*x` was already set to NULL on the line
above. Remove the dead assignment.
The disable trace path in slab_debug() had a logic error where it would
set trace=1 instead of trace=0. This made trace functionality permanently
enabled once turned on for any slab cache.
Tvrtko Ursulin [Fri, 22 May 2026 09:01:29 +0000 (10:01 +0100)]
drm/sched: Fix clang build warning in kunit tests
Initializing compile time constant struct or arrays from another such
variable is a gcc extension, while clang strictly requires a compile time
constant literal.
As reported by LKP:
>> drivers/gpu/drm/scheduler/tests/tests_scheduler.c:675:10: error: initializer element is not a compile-time constant
drm_sched_scheduler_two_clients_attr),
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
include/kunit/test.h:224:13: note: expanded from macro 'KUNIT_CASE_PARAM_ATTR'
.attr = attributes, .module_name = KBUILD_MODNAME}
^~~~~~~~~~
1 error generated.
vim +675 drivers/gpu/drm/scheduler/tests/tests_scheduler.c
Mark Brown [Thu, 28 May 2026 23:01:44 +0000 (00:01 +0100)]
KVM: arm64: Correctly cap ZCR_EL2 provided by a guest hypervisor
ZCR_EL2 can be updated by a VHE guest hypervisor either using ZCR_EL2
(which traps) or ZCR_EL1 (which does not trap). KVM handles both in
different way:
- on ZCR_EL2 trap, ZCR_EL2.LEN is immediately capped at the VM's own
VL limit. This has the potential to break existing SW that relies
on the full LEN field to be stateful.
- on ZCR_EL1 access, we do absolutely nothing.
On restoring the SVE context for an L2 guest, we directly restore the
guest hypervisor's view of ZCR_EL2 into the physical ZCR_EL2. If the
guest's view of the register was updated using the ZCR_EL2 accessor,
the value has already been sanitised (with the caveat mentioned above).
But if the guest used ZCR_EL1, the raw value is written into the HW,
and the L2 guest can now access VLs that it shouldn't.
Fix all the above by moving the VL capping to the restore points,
ensuring that:
- the HW is always programmed with a capped value, irrespective of
the accessor being used,
- the ZCR_EL2.LEN field is always completely stateful, irrespective
of the accessor being used.
Additionally, move ZCR_EL2 to be a sanitised register, ensuring that
only the LEN field is actually stateful. This requires some creative
construction of the RES0 mask, as the sysreg generation script does
not yet generate RAZ/WI fields.
Jori Koolstra [Thu, 28 May 2026 17:58:47 +0000 (17:58 +0000)]
vfs: replace ints with enum last_type for LAST_XXX
Several functions in namei.c take an "int *type" parameter, such as
filename_parentat(). To know what values this can take you have to find
the anonymous struct that defines the LAST_XXX values. Define an enum
last_type to make this type explicit.
Jori Koolstra [Thu, 28 May 2026 17:58:46 +0000 (17:58 +0000)]
vfs: make LAST_XXX private to fs/namei.c
The only user of LAST_XXX outside of fs/namei.c is fs/smb/server/vfs.c;
ksmbd_vfs_path_lookup() calls vfs_path_parent_lookup() and expects a
LAST_NORM last type (or it will be ENOENT). ksmbd_vfs_rename() also calls
vfs_path_parent_lookup() but forgets the LAST_NORM check.
It does not really make sense to have vfs_path_parent_lookup() expose
the last_type because it is only needed to ensure it is LAST_NORM. So
let's do this check in vfs_path_parent_lookup() instead and keep the
LAST_XXX internal to fs/namei.c. This changes the ENOENT errno in
ksmbd_vfs_path_lookup() to EINVAL, which matches better with how this is
handled by callers of filename_parentat().
Tomas Glozar [Thu, 14 May 2026 07:30:38 +0000 (09:30 +0200)]
rtla: Document tests in README
RTLA tests are not documented anywhere. Mention both runtime and unit
tests in the README, with instructions on how to run them and a list of
dependencies and required system configuration.
gpu: nova-core: gsp: shuffle boot code a bit to keep chipset-specific parts close
Some parts of the GSP boot process are chip-specific actions, whereas
others (like sending the initial post-boot messages) deal directly with
the working GSP.
Reorganize the boot code a bit so the chipset-specific parts are clumped
together, which will make their extraction into a HAL easier.
John Hubbard [Sat, 11 Apr 2026 02:49:35 +0000 (19:49 -0700)]
gpu: nova-core: refactor SEC2 booter loading into BooterFirmware::run()
Move the SEC2 reset/load/boot sequence into a BooterFirmware::run()
method. This is mostly refactoring, with no significant behavior change,
done in preparation for adding an alternative FSP boot path.
Suggested-by: Danilo Krummrich <dakr@kernel.org> Signed-off-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Eliot Courtney <ecourtney@nvidia.com> Link: https://patch.msgid.link/20260521-nova-unload-v6-4-65f581c812c9@nvidia.com
[acourbot: fix typo in commit message.] Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
gpu: nova-core: do not import firmware commands into GSP command module
Importing all the firmware commands like we did is a bit confusing, as
the layer of a command type (fw or GSP) cannot be inferred from looking
at its name alone. Furthermore it makes it impossible to create commands
that have the same name as their firmware command.
Thus, stop importing all commands and refer to them from the `fw` module
instead.
gpu: nova-core: remove unneeded get_gsp_info proxy function
This function was useful before the generic command-queue send methods
got merged, but it is just boilerplate now. Replace it with the correct
sequence to queue the `GetGspStaticInfo` command directly.
Rajat Gupta [Thu, 21 May 2026 05:11:21 +0000 (22:11 -0700)]
drm: prevent integer overflows in dumb buffer creation helpers
Fix integer overflow issues in the dumb buffer creation path:
1. drm_mode_create_dumb() does not bound width, height, or bpp
before passing them to driver callbacks. Downstream helpers
(e.g. drm_gem_dma_dumb_create_internal) perform pitch/size
alignment in u32 arithmetic that can overflow for extreme
values. Add hard limits: width and height < 8192, bpp <= 32.
No legitimate software rendering use case exceeds these.
2. drm_mode_align_dumb() uses roundup(pitch, hw_pitch_align)
without checking for overflow. If pitch is near U32_MAX,
roundup() wraps to a small value, making subsequent
check_mul_overflow() pass with a much smaller pitch than
intended. Add an overflow check after roundup.
3. drm_mode_align_dumb() uses ALIGN(size, hw_size_align) which
only works correctly for power-of-two alignment values.
Replace with roundup() which works for any alignment.
Suggested-by: Thomas Zimmermann <tzimmermann@suse.de> Signed-off-by: Rajat Gupta <rajat.gupta@oss.qualcomm.com> Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
Yixun Lan [Wed, 20 May 2026 23:45:28 +0000 (23:45 +0000)]
riscv: dts: spacemit: k3: Initial support for CoM260-IFX board
The K3 CoM260-IFX board combine with one 260 pins "Gold Finger" computer
module with a carrier board. The module integrates the K3 SoC, LPDDR5,
UFS storage, Gigabit Ethernet, Micro SD card, PMIC Chip. The board offers
a comprehensive array of interfaces, including MIPI-DSI, MIPI-CSI,
DisplayPort, SDIO, SPI, I2S, I2C, CAN-FD, PWM, UART, USB, PCIe, and GMAC.
Add initial support for enabling Serial UART and ethernet.
The SpacemiT K3 CoM260-IFX board combines a 69.6 × 45 mm compute module
with a reference carrier board.
The module integrates up to 32GB LPDDR5 memory, UFS storage, Micro SD
card slot and includes interfaces such as dual MIPI CSI-2 connectors,
M.2 expansion, USB 3.0, Gigabit Ethernet, DisplayPort, and a 40-pin
expansion header.
The carrier board is intended as a general-purpose development platform
for CoM260 module and exposes interfaces for all of storage, display,
networking, and camera connectivity.
crypto: af_alg - Document that it is *always* slower
Without support for zero-copy or off-CPU offloads, AF_ALG is always
slower than software cryptography. Its only advantage is that it might
save code size. However, this is largely mitigated by lightweight
userspace cryptographic libraries.
Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
crypto: af_alg - Drop support for off-CPU cryptography
AF_ALG is deprecated and exposed to unprivileged userspace. Only
use the least buggy algorithm implementations: the pure software ones.
This removes one of the main advantages of AF_ALG, which is the
ability to use it with off-CPU accelerators. However, using off-CPU
accelerators has huge overheads, both in performance and attack surface.
I have yet to see real-world, performance-critical workloads where using
an accelerator via AF_ALG is actually a win over doing cryptography in
userspace.
If using an off-CPU accelerator really does turn out to be a win, a new
API should be developed that is actually a good fit for it.
Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The only user of msg->msg_iocb was AF_ALG, but that's deprecated.
It can be removed entirely at the cost of only supporting synchronous
operations. This doesn't break userspace, which will silently block
(for a bounded amount of time) in io_submit instead of operating
asynchronously.
This also makes struct msghdr smaller, helping every other caller of
sendmsg().
Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
crypto: ccp/tsm - Enable the root port after the endpoint
The PCIe r7.0, chapter "6.33.8 Other IDE Rules" mandates if selective IDE
is enabled for config requersts, a stream must be enabled on the endpoint
before enabling it on the rootport:
===
For Selective IDE, the Stream must not be used until it has been enabled in
both Partner Ports. For cases where one of the Partner Ports is a Root Port
and Selective IDE for Configuration Requests is enabled, the other
Partner Port must be enabled prior to the Root Port. For other scenarios,
the mechanisms to satisfy this requirement are implementation-specific.
===
Do what the spec says.
Fixes: 4be423572da1 ("crypto/ccp: Implement SEV-TIO PCIe IDE (phase1)") Signed-off-by: Alexey Kardashevskiy <aik@amd.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Ahsan Atta [Wed, 20 May 2026 12:51:50 +0000 (13:51 +0100)]
crypto: qat - use pci logging variants for PCI-specific messages
Replace dev_err(&pdev->dev, ...), dev_info(&pdev->dev, ...) and
dev_dbg(&pdev->dev, ...) with pci_err(), pci_info() and pci_dbg()
where the log message relates to a PCI subsystem operation such as
device enable, BAR mapping, PCI region requests, PCI state
save/restore, and SR-IOV management.
Messages about driver-level logic (NUMA topology, device matching,
accelerator units, capabilities, configuration, DMA) are intentionally
left as dev_err() even when a struct pci_dev pointer is in scope,
since those concern the device or driver rather than the PCI bus.
No functional change.
Suggested-by: Andy Shevchenko <andriy.shevchenko@intel.com> Signed-off-by: Ahsan Atta <ahsan.atta@intel.com> Reviewed-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@intel.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Ahsan Atta [Wed, 20 May 2026 12:41:55 +0000 (13:41 +0100)]
crypto: qat - protect service table iterations with service_lock
The service_table list is protected by service_lock when entries are
added or removed (in adf_service_add() and adf_service_remove()), but
several functions iterate over the list without holding this lock.
A concurrent adf_service_register() or adf_service_unregister() call
could modify the list during traversal, leading to list corruption or
a use-after-free.
Fix this by holding service_lock across all list_for_each_entry()
iterations of service_table in adf_dev_init(), adf_dev_start(),
adf_dev_stop(), adf_dev_shutdown(), adf_dev_restarting_notify(),
adf_dev_restarted_notify(), and adf_error_notifier().
The lock ordering is safe: callers of the static helpers (adf_dev_up()
and adf_dev_down()) acquire state_lock before service_lock, and no
event_hld callback or service_lock holder ever acquires state_lock in
the reverse order.
Cc: stable@vger.kernel.org Fixes: d8cba25d2c68 ("crypto: qat - Intel(R) QAT driver framework") Signed-off-by: Ahsan Atta <ahsan.atta@intel.com> Co-developed-by: Maksim Lukoshkov <maksim.lukoshkov@intel.com> Signed-off-by: Maksim Lukoshkov <maksim.lukoshkov@intel.com> Reviewed-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Ahsan Atta [Wed, 20 May 2026 12:33:00 +0000 (13:33 +0100)]
crypto: qat - fix restarting state leak on allocation failure
In adf_dev_aer_schedule_reset(), ADF_STATUS_RESTARTING is set before
allocating reset_data. If the allocation fails, the function returns
-ENOMEM without queuing reset work, so nothing ever clears the bit.
This leaves the device permanently stuck in the restarting state,
causing all subsequent reset attempts to be silently skipped.
Fix this by using test_and_set_bit() to atomically claim the
RESTARTING state, preventing duplicate reset scheduling races under
concurrent fatal error reporting. If the subsequent allocation fails,
clear the bit to restore clean state so future reset attempts can
proceed.
Cc: stable@vger.kernel.org Fixes: d8cba25d2c68 ("crypto: qat - Intel(R) QAT driver framework") Signed-off-by: Ahsan Atta <ahsan.atta@intel.com> Co-developed-by: Maksim Lukoshkov <maksim.lukoshkov@intel.com> Signed-off-by: Maksim Lukoshkov <maksim.lukoshkov@intel.com> Reviewed-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Thorsten Blum [Wed, 20 May 2026 10:00:30 +0000 (12:00 +0200)]
crypto: octeontx - use strscpy_pad in ucode_load_store
Instead of zero-initializing the temporary buffer and then copying into
it with strscpy(), use strscpy_pad() to copy the string and zero-pad any
trailing bytes. Drop the explicit size argument to further simplify the
code since strscpy_pad() can determine it automatically when the
destination buffer has a fixed length.
Also use strscpy_pad() to check for string truncation instead of the
hard-coded OTX_CPT_UCODE_NAME_LENGTH.
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Arnd Bergmann [Wed, 20 May 2026 07:38:44 +0000 (09:38 +0200)]
crypto: s390 - add select CRYPTO_AEAD for aes
The aes driver registers both skcipher and aead algorithms,
but when aead is not enabled this causes a link failure:
s390-linux-ld: arch/s390/crypto/aes_s390.o: in function `aes_s390_fini':
arch/s390/crypto/aes_s390.c:969:(.text+0x115e): undefined reference to `crypto_unregister_aead'
s390-linux-ld: arch/s390/crypto/aes_s390.o: in function `aes_s390_init':
arch/s390/crypto/aes_s390.c:1028:(.init.text+0x294): undefined reference to `crypto_register_aead'
Add the missing 'select' statement.
Fixes: bf7fa038707c ("s390/crypto: add s390 platform specific aes gcm support.") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Harald Freudenberger <freude@linux.ibm.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
crypto: atmel-ecc - Use named initializers for struct i2c_device_id
While being less compact, using named initializers allows to more easily
see which members of the structs are assigned which value without having
to lookup the declaration of the struct. And it's also more robust
against changes to the struct definition.
This patch doesn't modify the compiled array, only its representation in
source form benefits. The former was confirmed with x86 and arm64
builds.
Signed-off-by: Uwe Kleine-König (The Capable Hub) <u.kleine-koenig@baylibre.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
crypto: atmel-sha204a - Use named initializers for struct i2c_device_id
While being less compact, using named initializers allows to more easily
see which members of the structs are assigned which value without having
to lookup the declaration of the struct. And it's also more robust
against changes to the struct definition.
This patch doesn't modify the compiled array, only its representation in
source form benefits. The former was confirmed with x86 and arm64
builds.
For consistency also assign .driver_data for the array item that the
driver relies on i2c_get_match_data() returning NULL for.
Signed-off-by: Uwe Kleine-König (The Capable Hub) <u.kleine-koenig@baylibre.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The driver binds to i2c devices only and thus in the absence of an
assignment for .data in the of_device_id array i2c_get_match_data()
falls back to .driver_data from the i2c_device_id array. So only provide
&atsha204_quality once to reduce duplication.
Signed-off-by: Uwe Kleine-König (The Capable Hub) <u.kleine-koenig@baylibre.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
ecrdsa_exit_tfm() is empty, and sig_alg .exit is optional. The
corresponding .init callback is not set either, so there is nothing to
release in .exit.
Herbert Xu [Tue, 19 May 2026 04:22:18 +0000 (12:22 +0800)]
crypto: tegra - Fix dma_free_coherent size error
When freeing a coherent DMA buffer, the size must match the value
that was used during the allocation.
Unfortunately the size field in the tegra driver gets overwritten
by this point so it no longer matches and creates a warning.
Fix this by saving a copy of the size on the stack.
Note that the ccm function actually mixes up the inbuf and outbuf
sizes, but it doesn't matter because the two sizes are actually
equal.
Fixes: 1cb328da4e8f ("crypto: tegra - Do not use fixed size buffers") Reporeted-by: Patrick Talbert <ptalbert@redhat.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Reviewed-by: Vladislav Dronov <vdronov@redhat.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Zongyu Wu [Mon, 18 May 2026 14:29:56 +0000 (22:29 +0800)]
crypto: hisilicon/qm - support doorbell enable control
The driver notifies the hardware to handle task through
doorbell. Currently, doorbell is enabled by default. To
prevent the process from sending doorbells during hardware
reset scenarios, which could cause the hardware to process
doorbells and trigger new errors:
For example, when the physical machine is resetting the device,
doorbells are still being sent from the virtual machine.
Therefore, the driver disables doorbell during hardware
unavailability. After hardware initialization is completed,
doorbell is enabled, and any task sent during the unavailability
period will return errors.
The hardware supports the PF to disable doorbells for all functions,
while the VF can only disable its own doorbell function. When the PF
is reset, it will disable doorbells for all functions. When VF is
reset, it only disables its own doorbell and does not affect tasks
on other functions.
Signed-off-by: Zongyu Wu <wuzongyu1@huawei.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Weili Qian [Mon, 18 May 2026 14:29:55 +0000 (22:29 +0800)]
crypto: hisilicon - mask all error type when removing driver
Each bit in the error interrupt register corresponds to a specific
error type. A bit value of 0 enables the interrupt, and a bit value
of 1 disables the interrupt. Currently, when disabling interrupts,
it incorrectly enables the interrupt types that were not enabled.
Therefore, when disabling interrupts, all bits should be directly
written to 1.
Weili Qian [Mon, 18 May 2026 14:29:54 +0000 (22:29 +0800)]
crypto: hisilicon/qm - disable error report before flr
Before function level reset, driver first disable device error report
and then waits for the device reset to complete. However, when the
error is recovered, the error bits will be enabled again, resulting in
invalid disable. It is modified to detect that there is no error
before disable error report, and then do FLR.
Zhushuai Yin [Mon, 18 May 2026 14:29:53 +0000 (22:29 +0800)]
crypto: hisilicon/qm - support function-level error reset
When executing operations on crypto devices, hardware errors
are inevitable. For certain errors, a full device reset is
required to recover. However, in certain cases, only a
specific function may fail, while other functions can still
operate normally. A system-wide RAS reset in such cases would
unnecessarily impact functioning components.
This patch introduces function-level granularity handling,
enabling targeted resets of only the error-reporting
functions without affecting other operational functions.
Zhushuai Yin [Mon, 18 May 2026 14:29:52 +0000 (22:29 +0800)]
crypto: hisilicon/qm - place the interrupt status interface after the PM usage counter
To avoid accessing memory of a suspended device, and since the counter
interface used by PM involves sleep operations, the counter interface
cannot be placed in the interrupt top half. Therefore, the interface for
acquiring the interrupt status in the RAS reset flow that resides in the
interrupt context needs to be moved to the bottom half for processing.
Zhushuai Yin [Mon, 18 May 2026 14:29:51 +0000 (22:29 +0800)]
crypto: hisilicon/qm - allow VF devices to query hardware isolation status
The problem that the VF device cannot obtain the isolation
status and isolation threshold of the device is resolved.
The accelerator driver can query the device isolation status
and threshold via the VF device using the fault query sysfs
interface under uacce. Note that only the PF device supports
isolation policy configuration, while the VF device is
limited to read-only query operations.
Gao Xiang [Fri, 22 May 2026 08:27:16 +0000 (16:27 +0800)]
erofs: fix use-after-free on sbi->sync_decompress
z_erofs_decompress_kickoff() can race with filesystem unmount, causing
a use-after-free on sbi->sync_decompress.
When I/O completes, z_erofs_endio() calls z_erofs_decompress_kickoff()
to queue z_erofs_decompressqueue_work() asynchronously. Then, after all
folios are unlocked, unmount workflow can proceed and sbi will be freed
before accessing to sbi->sync_decompress.
Breno Leitao [Wed, 6 May 2026 12:58:25 +0000 (05:58 -0700)]
selftests/mm: add kmemleak verbose dedup test
Add a regression test for the per-scan verbose dedup added in the
preceding commit. The test loads samples/kmemleak's helper module
(CONFIG_SAMPLE_KMEMLEAK=m) to generate orphan allocations, several of
which share an allocation backtrace, runs four kmemleak scans with verbose
printing enabled, then walks dmesg looking for two "unreferenced object"
reports within a single scan that share an identical backtrace - which
would mean dedup failed to collapse them.
The test is intentionally permissive on detection but strict on
regressions:
- PASS when no duplicates are observed, regardless of whether the
dedup summary line ("... and N more object(s) with the same
backtrace") was actually emitted. Per-CPU chunk reuse, slab
freelist pointers, kernel stack residue and CONFIG_DEBUG_KMEMLEAK_
AUTO_SCAN can all keep most of the orphans "still referenced" or
reported across many separate scans, so the dedup path may have
nothing to fold within one scan. That is not a regression.
- PASS reports whether dedup actually fired, so a passing run on a
well-behaved environment is still informative.
- FAIL when two same-backtrace reports land in a single scan (clear
dedup regression).
- FAIL when kmemleak's own per-scan tally counts leaks but the
verbose path emits zero "unreferenced object" lines - that catches
a regression in the verbose printer itself, which would otherwise
pass the duplicate check trivially.
- SKIP when kmemleak is absent, disabled at runtime, or the helper
module is not built.
The dmesg parser anchors stack-frame matching to the indentation kmemleak
uses for them (4+ spaces under "kmemleak: ") so unrelated kmemleak
warnings landing between reports do not get lumped into the backtrace key
and mask a duplicate.
Link: https://lore.kernel.org/20260506-kmemleak_dedup-v3-2-2d36aafc34da@debian.org Signed-off-by: Breno Leitao <leitao@debian.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@kernel.org> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Breno Leitao [Wed, 6 May 2026 12:58:24 +0000 (05:58 -0700)]
mm/kmemleak: dedupe verbose scan output by allocation backtrace
Patch series "mm/kmemleak: dedupe verbose scan output", v3.
I am starting to run with kmemleak in verbose enabled in some "probe
points" across the my employers fleet so that suspected leaks land in
dmesg without needing a separate read of /sys/kernel/debug/kmemleak.
The downside is that workloads which leak many objects from a single
allocation site flood the console with byte-for-byte identical backtraces.
Hundreds of duplicates per scan are common, drowning out distinct leaks
and unrelated kernel messages, while adding no signal beyond the first
occurrence.
This series collapses those duplicates inside kmemleak itself. Each
unique stackdepot trace_handle prints once per scan, followed by a short
summary line when more than one object shares it:
kmemleak: unreferenced object 0xff110001083beb00 (size 192):
kmemleak: comm "modprobe", pid 974, jiffies 4294754196
kmemleak: ...
kmemleak: backtrace (crc 6f361828):
kmemleak: __kmalloc_cache_noprof+0x1af/0x650
kmemleak: ...
kmemleak: ... and 71 more object(s) with the same backtrace
The "N new suspected memory leaks" tally and the contents of
/sys/kernel/debug/kmemleak are unchanged - the per-object detail is still
available on demand, only the verbose (dmesg) output is collapsed.
Patch 1 is the kmemleak change.
Patch 2 adds a selftest that loads samples/kmemleak's CONFIG_SAMPLE
kmemleak-test module to generate ten leaks sharing one call site and
checks that the printed count is strictly less than the reported leak
total. Not sure if Patch 2 is useful or not, if not, it is easier to
discard.
This patch (of 2):
In kmemleak's verbose mode, every unreferenced object found during a scan
is logged with its full header, hex dump and 16-frame backtrace.
Workloads that leak many objects from a single allocation site flood dmesg
with byte-for-byte identical backtraces, drowning out distinct leaks and
other kernel messages.
Dedupe within each scan using stackdepot's trace_handle as the key: for
every leaked object with a recorded stack trace, look up the
representative kmemleak_object in a per-scan xarray keyed by trace_handle.
The first sighting stores the object pointer (with a get_object()
reference) and sets object->dup_count to 1; later sightings just bump
dup_count on the representative. After the scan, walk the xarray once and
emit each unique backtrace, followed by a single summary line when more
than one object shares it.
Leaks whose trace_handle is 0 (early-boot allocations tracked before
kmemleak_init() set up object_cache, or stack_depot_save() failures under
memory pressure) cannot be deduped, so they are still printed inline via
the same locked OBJECT_ALLOCATED-checked helper. The contents of
/sys/kernel/debug/kmemleak are unchanged - only the verbose console output
is collapsed.
Safety notes:
- The xarray store happens outside object->lock: object->lock is a
raw spinlock, while xa_store() may grab xa_node slab locks at a
higher wait-context level which lockdep flags as invalid.
trace_handle is captured under object->lock (which serialises with
kmemleak_update_trace()'s writer), so it is safe to use after
dropping the lock.
- get_object() pins the kmemleak_object metadata across
rcu_read_unlock(), but the underlying tracked allocation can still
be freed concurrently. The deferred print path therefore re-acquires
object->lock and re-checks OBJECT_ALLOCATED via print_leak_locked()
before touching object->pointer; __delete_object() clears that flag
under the same lock before the user memory goes away. The same
helper is used by the trace_handle == 0 and xa_store() failure
fallbacks, so every printer in the new path has identical safety
guarantees.
- If get_object() fails after we set OBJECT_REPORTED, the object is
already being torn down (use_count hit zero); the leak count is
still accurate but the verbose line is dropped, which is correct
- the memory was freed concurrently and is no longer a leak.
- If xa_store() fails to allocate an xa_node under memory pressure,
we fall back to printing inline via print_leak_locked() instead of
silently dropping the leak.
- The hex dump is skipped for coalesced entries (dup_count > 1):
bytes would differ across objects sharing a backtrace anyway, and
skipping it removes the only remaining read of object->pointer's
contents in the deferred path. The representative's reported size
may also differ from the coalesced objects' sizes; the printed
trace_handle reflects the representative's current value rather
than the value used as the dedup key, which is normally - but not
strictly - identical.
Link: https://lore.kernel.org/20260506-kmemleak_dedup-v3-0-2d36aafc34da@debian.org Link: https://lore.kernel.org/20260506-kmemleak_dedup-v3-1-2d36aafc34da@debian.org Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@kernel.org> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Zijiang Huang [Wed, 6 May 2026 13:09:19 +0000 (21:09 +0800)]
mm/swap: add cond_resched() in swap_reclaim_full_clusters to prevent softlockup
We hit a real softlockup in an internal stress test environment. The
workload was LTP memory/swap stress on a large arm64 machine, with 320
CPUs, about 1TB memory and an 8.6GB swap device. The system was under
heavy load and the swap device had a large number of full clusters. The
softlockup was triggered during a stress test after about 3 days.
So, add periodic cond_resched() calls during large full_clusters
reclaim operations to prevent softlockup issues.
Link: https://lore.kernel.org/20260506130919.2298807-1-kerayhuang@tencent.com Fixes: 5168a68eb78f ("mm, swap: avoid over reclaim of full clusters") Signed-off-by: Zijiang Huang <kerayhuang@tencent.com> Reviewed-by: Kairui Song <kasong@tencent.com> Reviewed-by: Hao Peng <flyingpeng@tencent.com> Reviewed-by: albinwyang <albinwyang@tencent.com> Reviewed-by: Baoquan He <baoquan.he@linux.dev> Acked-by: Chris Li <chrisl@kernel.org> Cc: Barry Song <baohua@kernel.org> Cc: Kairui Song <kasong@tencent.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Youngjun Park <youngjun.park@lge.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Sat, 2 May 2026 02:05:03 +0000 (19:05 -0700)]
mm/damon/stat: add a parameter for reading kdamond pid
Patch series "mm/damon/stat: add kdamond_pid parameter".
DAMON_STAT doesn't provide the pid of its kdamond, unlike DAMON_RECLAIM
and DAMON_LRU_SORT. This makes user-space management of DAMON_STAT
unnecessarily complicated. Provide the information via a new parameter,
namely kdamond_pid, and document it.
This patch (of 2):
Knowing the pid of the kdamonds can help user-space management including
monitoring of DAMON's system resource consumption. To make it easier,
DAMON_SYSFS, DAMON_RECLAIM and DAMON_LRU_SORT provide the pid information.
DAMON_STAT is not providing it, though. Expose the pid of DAMON_STAT
kdamond via a new read-only module parameter, namely kdamond_pid. This
also makes DAMON modules usage more standardized, because DAMON_RECLAIM
and DAMON_LRU_SORT also provide the information via their read-only
parameters of the same name.
Link: https://lore.kernel.org/20260502020505.80822-1-sj@kernel.org Link: https://lore.kernel.org/20260502020505.80822-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm/damon/reclaim: support monitoring intervals auto-tuning".
The monitoring intervals auto-tuning feature of DAMON has proven to be
useful in multiple environments. Add a new DAMON_RECLAIM parameter for
supporting the feature, and update the document for the new parameter.
This patch (of 2):
DAMON's monitoring intervals auto-tuning feature has proven to be useful
in multiple environments. DAMON_RECLAIM is still asking users to do the
manual tuning of the intervals. Add a module parameter for utilizing the
auto-tuning feature with the suggested default setup.
Note that use of the auto-tuning overrides the manually entered monitoring
intervals. Also, note that the 'min_age' will dynamically changed
proportional to auto-tuned intervals. It is recommended to use 'min_age'
short enough and use 'quota_mem_pressure_us' like coldness threshold
auto-tuning features together.
Link: https://lore.kernel.org/20260501011740.81988-1-sj@kernel.org Link: https://lore.kernel.org/20260501011740.81988-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Li Wang [Fri, 1 May 2026 02:20:58 +0000 (10:20 +0800)]
selftests/cgroup: include slab in test_percpu_basic memory check
test_percpu_basic() currently compares memory.current against only
memory.stat:percpu after creating 1000 child cgroups.
Observed failure:
#./test_kmem
ok 1 test_kmem_basic
ok 2 test_kmem_memcg_deletion
ok 3 test_kmem_proc_kpagecgroup
ok 4 test_kmem_kernel_stacks
ok 5 test_kmem_dead_cgroups
memory.current 11530240
percpu 8440000
not ok 6 test_percpu_basic
That assumption is too strict: child cgroup creation also allocates
slab-backed metadata, so memory.current is expected to be larger than
percpu alone. One visible path is:
These kernfs allocations are charged as slab and show up in
memory.stat:slab.
Update the check to compare memory.current against (percpu + slab)
within MAX_VMSTAT_ERROR, and print slab/delta in the failure message to
improve diagnostics.
Link: https://lore.kernel.org/20260501022058.18024-3-li.wang@linux.dev Signed-off-by: Li Wang <li.wang@linux.dev> Reviewed-by: Waiman Long <longman@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Sayali Patil <sayalip@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Li Wang [Fri, 1 May 2026 02:20:57 +0000 (10:20 +0800)]
selftests/cgroup: fix hardcoded page size in test_percpu_basic
Patch series "selftests/cgroup: Fix false positive failures in
test_percpu_basic", v2.
This patch series addresses two separate issues that cause false
positive failures in the test_percpu_basic test within the cgroup
kmem selftests.
The first issue stems from a hardcoded assumption about the system
page size, which breaks the test on architectures with larger page
sizes.
The second issue is an overly strict memory check that fails to
account for the slab metadata allocated during cgroup creation.
This patch (of 2):
MAX_VMSTAT_ERROR uses a hardcoded page size of 4096, which assumes 4K
pages. This causes test_percpu_basic to fail on systems where the kernel
is configured with a larger page size, such as aarch64 systems using 16K
or 64K pages, where the maximum permissible discrepancy between
memory.current and percpu charges is proportionally larger.
Replace the hardcoded 4096 with sysconf(_SC_PAGESIZE) to correctly derive
the page size at runtime regardless of the underlying architecture or
kernel configuration.
Link: https://lore.kernel.org/20260501022058.18024-1-li.wang@linux.dev Link: https://lore.kernel.org/20260501022058.18024-2-li.wang@linux.dev Signed-off-by: Li Wang <li.wang@linux.dev> Acked-by: Waiman Long <longman@redhat.com> Reviewed-by: Sayali Patil <sayalip@linux.ibm.com> Cc: Christoph Lameter <cl@linux.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/filemap: do not count FAULT_FLAG_TRIED retries as mmap hits
A fault that starts synchronous mmap readahead can return VM_FAULT_RETRY
after dropping mmap_lock. The retry may then map the folio brought in by
that same miss.
Do not let this retry decrement mmap_miss. The retry still maps the folio
from the page cache; it just does not count as a useful mmap readahead
hit.
Link: https://lore.kernel.org/tencent_22E6B8849EC1141FE7773C64467E6F1E2C09@qq.com Signed-off-by: fujunjie <fujunjie1@qq.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Vishal Moola <vishal.moola@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/filemap: count only the faulting address as a mmap hit
Patch series "mm/filemap: tighten mmap_miss hit accounting", v3.
mmap_miss is increased when synchronous mmap readahead is needed, and
decreased when filemap_map_pages() maps folios that are already in the
page cache. The decrease side can over-credit hits in two cases:
- fault-around installs nearby PTEs even though the fault only proves
that the faulting address was accessed;
- after synchronous mmap readahead returns VM_FAULT_RETRY, the retry
can find the folio brought in by the same miss and immediately
cancel that miss.
Current evidence comes from a local KVM/data-disk microbenchmark using
mmap_miss_probe, with an 8 GiB guest, 2 vCPUs, 8192 KiB read_ahead_kb,
cold page cache before each run, 1% of the file accessed, and medians of 3
runs.
mmap_miss_probe mmap()s a prepared file with MADV_NORMAL and then touches
one byte at selected base-page offsets. The access order is random,
sequential, or a fixed page stride. The harness drops caches before each
run and samples /proc/vmstat around that access loop.
The 20 GiB case below is a larger-than-memory file case in an 8 GiB guest.
No separate memory hog was used. The 4 GiB case uses the same 8 GiB
guest but keeps the file fit-in-memory.
Each case used a fresh temporary qcow2 data disk, seen by the guest as
/dev/vda, formatted as ext4 and mounted at /mnt/mmap-matrix.
Each result is "pgpgin GiB / elapsed seconds". "pgpgin GiB" is the delta
of the guest /proc/vmstat pgpgin counter, converted from KiB to GiB; it is
used here as an approximate block input counter, not as resident memory or
exact application IO. "Elapsed seconds" is the wall-clock runtime of the
whole mmap_miss_probe access pass, not per-access latency.
For the 20 GiB larger-than-memory case:
workload before after
random 223.377 GiB/101.293s 1.010 GiB/4.790s
stride1021 204.214 GiB/97.557s 204.208 GiB/108.086s
stride2053 409.584 GiB/193.700s 0.970 GiB/3.685s
stride4099 406.452 GiB/134.241s 0.975 GiB/3.499s
sequential 0.212 GiB/0.050s 0.212 GiB/0.057s
For the 4 GiB fit-in-memory case:
workload before after
random 3.987 GiB/1.960s 0.980 GiB/1.221s
stride1021 4.002 GiB/1.838s 4.002 GiB/1.851s
stride2053 3.991 GiB/1.835s 0.811 GiB/0.985s
stride4099 4.001 GiB/1.836s 0.819 GiB/1.037s
sequential 0.056 GiB/0.013s 0.056 GiB/0.018s
The 20 GiB setup also has an ablation. P1 is only the faulting-address
hit accounting change. P2-only is only the FAULT_FLAG_TRIED retry
filter. P1+P2 is the combined accounting change:
This does not claim to solve every sparse pattern. The stride1021 rows
are intentionally shown as a boundary: with 8192 KiB read_ahead_kb,
file->f_ra.ra_pages is 2048 base pages, and synchronous mmap read-around
uses a 2048-page window centered around the fault, roughly [index - 1024,
index + 1023]. stride1021 is 1021 * 4 KiB = 4084 KiB, so the next access
lands inside the previous read-around window. About every other access
can be a real faulting-address page-cache hit, and the other half can each
read about 8 MiB. For about 52k accesses in the 20 GiB/1% run, half of
them times 8 MiB is about 205 GiB, matching the observed 204 GiB.
This patch (of 2):
filemap_map_pages() reduces file->f_ra.mmap_miss when fault-around maps
folios that are already present in the page cache. That hit accounting is
too generous because fault-around can install PTEs around the faulting
address even though the fault only proves that the faulting address was
accessed.
Move the mmap_miss update back into filemap_map_pages(), drop the
mmap_miss argument from the helper functions, and decrement mmap_miss only
when the helper return value shows that the faulting address was mapped.
Keep the existing workingset-folio behavior unchanged.
mm: use zone lock guard in set_migratetype_isolate()
Use spinlock_irqsave scoped lock guard in set_migratetype_isolate() to
replace the explicit lock/unlock pattern with automatic scope-based
cleanup. The scoped variant is used to keep dump_page() outside the
locked section to avoid a lockdep splat.
Link: https://lore.kernel.org/6883351ad7f74d20875fff30e0e3214a089cea97.1777462630.git.d@ilvokhin.com Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com> Suggested-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Zi Yan <ziy@nvidia.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: David Hildenbrand <david@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: use zone lock guard in unreserve_highatomic_pageblock()
Use spinlock_irqsave zone lock guard in unreserve_highatomic_pageblock()
to replace the explicit lock/unlock pattern with automatic scope-based
cleanup.
Link: https://lore.kernel.org/69db814cd178915cb5615334a29304678f960963.1777462630.git.d@ilvokhin.com Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com> Suggested-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: David Hildenbrand <david@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: use zone lock guard in unset_migratetype_isolate()
Use spinlock_irqsave zone lock guard in unset_migratetype_isolate() to
replace the explicit lock/unlock and goto pattern with automatic
scope-based cleanup.
Link: https://lore.kernel.org/815c0905ea77828ed32bf56ff0a6d3c6548eb3a2.1777462630.git.d@ilvokhin.com Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com> Suggested-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Zi Yan <ziy@nvidia.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: David Hildenbrand <david@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: use zone lock guard in reserve_highatomic_pageblock()
Patch series "mm: use spinlock guards for zone lock", v3.
This series uses spinlock guard for zone lock across several mm functions
to replace explicit lock/unlock patterns with automatic scope-based
cleanup.
This simplifies the control flow by removing 'flags' variables, goto
labels, and redundant unlock calls.
Patches are ordered by decreasing value. The first six patches simplify
the control flow by removing gotos, multiple unlock paths, or 'ret'
variables. The last two are simpler lock/unlock pair conversions that
only remove 'flags' and can be dropped if considered unnecessary churn.
Binary size increase is +39 bytes, with Peter Zijlstra's fix for guards
[1] applied. This is due to the compiler not being able to deduplicate
epilogue and eliminate redundant NULL check. See discussion [2] for more
details. I proposed a patch [3] that fixes this, but until it is merged
we need to assume +39 bytes will stay (though it is compiler dependent).
This patch (of 8):
Use the spinlock_irqsave zone lock guard in reserve_highatomic_pageblock()
to replace the explicit lock/unlock and goto out_unlock pattern with
automatic scope-based cleanup.
SeongJae Park [Wed, 29 Apr 2026 15:03:06 +0000 (08:03 -0700)]
Docs/ABI/damon: mark schemes/<S>/filters/ deprecated
Now the 'filters/' directory is deprecated. Update ABI document to also
announce the fact. Also update the descriptions of the files to be based
on 'core_filter/' directory, to make the old descriptions ready to be
removed when the time arrives.
Link: https://lore.kernel.org/20260429150309.82282-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Wed, 29 Apr 2026 15:03:05 +0000 (08:03 -0700)]
Docs/admin-guide/mm/damon/usage: mark scheme filters sysfs dir as deprecated
Patch series "mm/damon/sysfs: document filters/ directory as deprecated".
Commit ab71d2d30121 ("mm/damon/sysfs-schemes: let
damon_sysfs_scheme_set_filters() be used for different named directories")
introduced alternatives of 'filters' directory, namely core_filters/ and
'ops_filters/ directories. Now the alternatives are well stabilized and
ready for all users. All filters/ directory use cases are expected to be
able to be migrated to the alternatives. An LTS kernel having the
alternatives, namely 6.18.y, is also released. Existence of filters/
directory is only confusing.
It would be better not immediately removing the directory, though. There
could be users that need time before migrating to the alternatives. There
might be unexpected use cases that the alternatives cannot support. Doing
the deprecation step by step across multiple years like DAMON debugfs
deprecation would be safer. Start the deprecation changes by announcing
the deprecation on the documents.
Every year, one more action for completely removing the directory will be
followed, like DAMON debugfs deprecation did. Following yearly actions
are currently expected. In 2027, deprecation warning kernel messages will
be printed once, for use of filters/ directory. In 2028, filters/
directory will be renamed to filters_DEPRECATED/. In 2029,
filters_DEPRECATED/ directory will be removed.
This patch (of 2):
The alternatives of 'filters/' directory, namely 'core_filters/' and
'ops_filters/', can fully support all the features 'filters/' directory
can do, and provide better user experience. Having 'filters/' directory
is only confusing to users. Announce it as deprecated on the usage
document.
Link: https://lore.kernel.org/20260429150309.82282-1-sj@kernel.org Link: https://lore.kernel.org/20260429150309.82282-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/khugepaged: return -EAGAIN for SCAN_PAGE_HAS_PRIVATE in MADV_COLLAPSE
MADV_COLLAPSE uses errno values to provide actionable feedback to
userspace. Temporary resource constraints are mapped to -EAGAIN so the
caller may retry, while intrinsic failures of the specified range are
mapped to -EINVAL.
collapse_file() returns SCAN_PAGE_HAS_PRIVATE when filemap_release_folio()
fails while isolating file-backed folios for collapse. This currently
falls through the default case in madvise_collapse_errno() and is reported
to userspace as -EINVAL.
However, filemap_release_folio() failure commonly reflects temporary folio
state rather than a permanently uncollapsible range.
For example, ext4 returns false when a folio still has dirty journalled
data, btrfs returns false for dirty or writeback folios before extent
state release, and NFS may return false while reclaiming
filesystem-private folio state.
In such cases, retrying MADV_COLLAPSE after writeback, reclaim or journal
progress may succeed. This matches the existing -EAGAIN handling for
SCAN_PAGE_DIRTY_OR_WRITEBACK and other transient collapse failures more
closely than -EINVAL.
Therefore, map SCAN_PAGE_HAS_PRIVATE to -EAGAIN so userspace receives
retryable feedback for this temporary failure path.
Link: https://lore.kernel.org/20260429140434.439456-1-agarwal.vineet2006@gmail.com Signed-off-by: Vineet Agarwal <agarwal.vineet2006@gmail.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
selftests/mm: khugepaged: initialize file contents via mmap
file_setup_area() currently allocates anonymous memory, fills it, and
writes it into the backing file used for collapse testing.
Instead of copying data through write(), resize the file with ftruncate(),
map it directly with MAP_SHARED, and initialize the mapped area in place.
This simplifies the setup path and avoids the need for explicit partial
write handling.
Link: https://lore.kernel.org/20260429115816.98824-1-agarwal.vineet2006@gmail.com Signed-off-by: Vineet Agarwal <agarwal.vineet2006@gmail.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Tested-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Wed, 29 Apr 2026 04:12:25 +0000 (21:12 -0700)]
mm/damon/lru_sort: cover all system rams
DAMON_LRU_SORT allows users to set the physical address range to monitor
and do the work on. When users don't explicitly set the range, the
biggest system ram resource of the system is selected as the monitoring
target address range. The intention was to reduce the overhead from
monitoring non-System RAM areas because monitoring non-System RAM may be
meaningless. However, because of the sampling based access check and
adaptive regions adjustment, the overhead should be negligible. It makes
more sense to just cover all system rams of the system. Do so.
Link: https://lore.kernel.org/20260429041232.90257-4-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Wed, 29 Apr 2026 04:12:24 +0000 (21:12 -0700)]
mm/damon/reclaim: cover all system rams
DAMON_RECLAIM allows users to set the physical address range to monitor
and do the work on. When users don't explicitly set the range, the
biggest System RAM resource of the system is selected as the monitoring
target address range. The intention was to reduce the overhead from
monitoring non-System RAM areas because monitoring of non-System RAM may
be meaningless. However, because of the sampling based access check and
adaptive regions adjustment, the overhead should be negligible. It makes
more sense to just cover all system rams of the system. Do so.
Link: https://lore.kernel.org/20260429041232.90257-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm/damon/reclaim,lru_sort: monitor all system rams by
default".
DAMON_RECLAIM and DAMON_LRU_SORT set the biggest 'System RAM' resource of
the system as the default monitoring target address range. The main
intention behind the design is to minimize the overhead coming from
monitoring of non-System RAM areas.
This could result in an odd setup when there are multiple discrete System
RAMs of considerable sizes. For example, there are System RAMs each
having 500 GiB size. In this case, only the first 500 GiB will be set as
the monitoring region by default. This is particularly common on NUMA
systems. Hence the modules allow users to set the monitoring target
address range using the module parameters if the default setup doesn't
work for them. In other words, the current design trades ease of setup
for lower overhead.
However, because DAMON utilizes the sampling based access check and the
adaptive regions adjustment mechanisms, the overhead from the monitoring
of non-System RAM areas should be negligible in most setups. Meanwhile,
the setup complexity is causing real headaches for users who need to run
those modules on various types of systems. That is, the current tradeoff
is not a good deal.
Set the physical address range that can cover all System RAM areas of the
system as the default monitoring regions for DAMON_RECLAIM and
DAMON_LRU_SORT.
Technically speaking, this is changing documented behavior. However, it
makes no sense to believe there is a real use case that really depends on
the old weird default behavior. If the old default behavior was working
for them in the reasonable way, this change will only add a negligible
amount of monitoring overhead. If it didn't work, the users may already
be using manual monitoring regions setup, and they will not be affected by
this change.
Patches Sequence
================
Patch 1 introduces a new core function that will be used for the new
default monitoring target region setup. Patch 2 and 3 update
DAMON_RECLAIM and DAMON_LRU_SORT to use the new function instead of the
old one, respectively. Patch 4 removes the old core function that was
replaced by the new one, as there is no more user of it. Patch 5 updates
DAMON_STAT to use the new one instead of its in-house nearly-duplicate
self implementation of the functionality. Finally patches 6 and 7 update
the DAMON_RECLAIM and DAMON_LRU_SORT user documentation for the new
behaviors, respectively.
This patch (of 7):
damon_set_region_biggest_system_ram_default() sets the monitoring target
region as the caller requested. If the caller didn't specify the region,
it finds the biggest System RAM of the system and sets it as the target
region. When there are more than one considerable size of System RAM
resources in the system, the default target setup makes no sense.
Introduce a variant, namely damon_set_region_system_rams_default(). It
sets a physical address range that covers all System RAM resources as the
default target region.
Link: https://lore.kernel.org/20260429041232.90257-1-sj@kernel.org Link: https://lore.kernel.org/20260429041232.90257-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: skip KASAN tagging for page-allocated page tables
Page tables are always accessed via the linear mapping with a match-all
tag, so HW-tag KASAN never checks them. For page-allocated tables (PTEs
and PGDs etc), avoid the tag setup and poisoning overhead by using
__GFP_SKIP_KASAN. SLUB-backed page tables are unchanged for now. (They
aren't widely used and require more SLUB related skip logic. Leave it
later.)
Link: https://lore.kernel.org/20260429102704.680174-4-dev.jain@arm.com Signed-off-by: Muhammad Usama Anjum <usama.anjum@arm.com> Signed-off-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Ben Segall <bsegall@google.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Kees Cook <kees@kernel.org> Cc: K Prateek Nayak <kprateek.nayak@amd.com> Cc: Liam Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
kasan: skip HW tagging for all kernel thread stacks
HW-tag KASAN never checks kernel stacks because stack pointers carry the
match-all tag, so setting/poisoning tags is pure overhead.
- Add __GFP_SKIP_KASAN to THREADINFO_GFP so every stack allocator that
uses it skips tagging (fork path plus arch users)
- Add __GFP_SKIP_KASAN to GFP_VMAP_STACK for the fork-specific vmap
stacks.
- When reusing cached vmap stacks, skip kasan_unpoison_range() if HW tags
are enabled.
Software KASAN is unchanged; this only affects tag-based KASAN.
Link: https://lore.kernel.org/20260429102704.680174-3-dev.jain@arm.com Signed-off-by: Muhammad Usama Anjum <usama.anjum@arm.com> Signed-off-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Ben Segall <bsegall@google.com> Cc: David Hildenbrand (Arm) <david@kernel.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Kees Cook <kees@kernel.org> Cc: K Prateek Nayak <kprateek.nayak@amd.com> Cc: Liam Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "kasan: hw_tags: Disable tagging for stack and page-tables",
v4.
Stacks and page tables are always accessed with the match-all tag, so
assigning a new random tag every time at allocation and setting invalid
tag at deallocation time, just adds overhead without improving the
detection.
With __GFP_SKIP_KASAN the page keeps its poison tag and KASAN_TAG_KERNEL
(match-all tag) is stored in the page flags while keeping the poison tag
in the hardware. The benefit of it is that 256 tag setting instruction
per 4 kB page aren't needed at allocation and deallocation time.
Thus match-all pointers still work, while non-match tags (other than
poison tag) still fault.
__GFP_SKIP_KASAN only skips for KASAN_HW_TAGS mode, so coverage is
unchanged.
Benchmark:
The benchmark has two modes. In thread mode, the child process forks
and creates N threads. In pgtable mode, the parent maps and faults a
specified memory size and then forks repeatedly with children exiting
immediately.
Thread benchmark:
2000 iterations, 2000 threads: 2.575 s → 2.229 s (~13.4% faster)
The pgtable samples:
- 2048 MB, 2000 iters 19.08 s → 17.62 s (~7.6% faster)
This patch (of 3):
For allocations that will be accessed only with match-all pointers (e.g.,
kernel stacks), setting tags is wasted work. If the caller already set
__GFP_SKIP_KASAN, skip tag setting of vmalloc pages.
Before this patch, __GFP_SKIP_KASAN wasn't being used with vmalloc APIs.
So it wasn't being checked. Now its being checked and acted upon. Other
KASAN modes are unchanged because __GFP_SKIP_KASAN is ignored for them in
the page allocator, and in vmalloc too we ignore this flag for them.
This is a preparatory patch for optimizing kernel stack allocations.
Link: https://lore.kernel.org/20260429102704.680174-1-dev.jain@arm.com Link: https://lore.kernel.org/20260429102704.680174-2-dev.jain@arm.com Signed-off-by: Muhammad Usama Anjum <usama.anjum@arm.com> Co-developed-by: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Co-developed-by: Dev Jain <dev.jain@arm.com> Signed-off-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Ben Segall <bsegall@google.com> Cc: David Hildenbrand <david@kernel.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Kees Cook <kees@kernel.org> Cc: K Prateek Nayak <kprateek.nayak@amd.com> Cc: Liam Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/memcontrol: hoist pstatc_pcpu assignment out of CPU loop
In mem_cgroup_alloc(), the assignment of pstatc_pcpu is invariant with
respect to the for_each_possible_cpu() loop: both the 'parent' pointer and
'parent->vmstats_percpu' remain constant throughout all iterations.
The original code redundantly re-evaluated the 'if (parent)' condition and
reassigned pstatc_pcpu on every CPU iteration, then repeated the same
ternary check 'parent ? pstatc_pcpu : NULL' when storing into
statc->parent_pcpu.
Move the single conditional assignment of pstatc_pcpu to before the loop,
resolving both the loop-invariant placement issue and the duplicated null
check. On systems with a large number of possible CPUs, this eliminates
repeated branch evaluation with no functional change.
No functional change intended.
Link: https://lore.kernel.org/20260429084216.186238-1-hui.zhu@linux.dev Signed-off-by: Hui Zhu <zhuhui@kylinos.cn> Reviewed-by: SeongJae Park <sj@kernel.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Shivank Garg [Tue, 24 Mar 2026 19:07:09 +0000 (19:07 +0000)]
mm/migrate: rename PAGE_ migration flags to FOLIO_
These flags only track folio-specific state during migration and are not
used for movable_ops pages. Rename the enum values and the old_page_state
variable to match.
No functional change.
Link: https://lore.kernel.org/20260324190706.964555-4-shivankg@amd.com Signed-off-by: Shivank Garg <shivankg@amd.com> Suggested-by: David Hildenbrand <david@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Lance Yang <lance.yang@linux.dev> Reviewed-by: Huang Ying <ying.huang@linux.alibaba.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Gregory Price <gourry@gourry.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Shivank Garg <shivankg@amd.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Mon, 27 Apr 2026 15:12:29 +0000 (08:12 -0700)]
selftests/damon/sysfs.py: pause DAMON before dumping status
The sysfs.py test commits DAMON parameters, dump the internal DAMON state,
and show if the parameters are committed as expected using the dumped
state. While the dumping is ongoing, DAMON is alive. It can make
internal changes including addition and removal of regions. It can
therefore make a race that can result in false test results. Pause DAMON
execution during the state dumping to avoid such races.
Link: https://lore.kernel.org/20260427151231.113429-11-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Mon, 27 Apr 2026 15:12:26 +0000 (08:12 -0700)]
selftests/damon/_damon_sysfs: support pause file staging
DAMON test-purpose sysfs interface control Python module, _damon_sysfs, is
not supporting the newly added pause file. Add the support of the file,
for future test and use of the feature.
Link: https://lore.kernel.org/20260427151231.113429-8-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>