David Lee [Mon, 25 May 2026 08:26:36 +0000 (16:26 +0800)]
wifi: rtw89: usb: skip ACPI capability check for USB devices
Skip the ACPI capability check for all USB devices by default,
allowing them to use their default configurations.
For USB dongles, customers will manage their own compliance and
certification. This initial patch focuses on the generic USB skip
infrastructure; specific customer certifications and localized
configurations will be handled by quirks afterward.
Add the specific USB device ID which adapter tested fully functional on
Fedora 44 with kernel 7.0.8-200.fc44.x86_64 and linux-firmware 20260410-1.fc44.
Dave Airlie [Wed, 3 Jun 2026 06:28:42 +0000 (16:28 +1000)]
Merge tag 'drm-intel-gt-next-2026-05-29' of https://gitlab.freedesktop.org/drm/i915/kernel into drm-next
Cross-subsystem Changes:
- Backmerge of drm-next to pull in a commit to revert
Driver Changes:
- Avoid skipping already signaled fence after reset (Sebastian)
- Fix potential UAF in TTM object purge (Janusz)
- Fix refcount underflow in intel_engine_park_heartbeat (Sebastian)
- Drop check for changed VM in EXECBUF (Joonas)
- Revert the "else vma = NULL" patch for being superseded (Joonas)
- Selfest improvements (Janusz, Krzysztof)
Sven Eckelmann [Sun, 31 May 2026 13:03:25 +0000 (15:03 +0200)]
batman-adv: tt: directly retrieve wifi flags of net_device
batadv_tt_local_add() tries to retrieve the wifi flags of an interface to
mark the TT entry as wifi client for the AP isolation feature. In the past,
it was necessary to look up the batadv_hard_iface because the wifi_flags
were stored inside this struct. But with the batadv_wifi_net_devices
rhashtable, it is preferred to directly retrieve the wifi_flags instead of
the indirect route via batadv_hard_iface - which at the end only provides
the net_device (which we used to find the batadv_hard_iface).
This will also be essential when the global batadv_hardif_list is removed
and each lookup via batadv_hardif_get_by_netdev() will require the RTNL
lock.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Sven Eckelmann [Tue, 2 Jun 2026 15:20:57 +0000 (17:20 +0200)]
batman-adv: tt: sync local and global tvlv preparation return values
The batadv_tt_prepare_tvlv_local_data() and
batadv_tt_prepare_tvlv_global_data() functions are supposed to work the
same - just with different sources for the TT entries. But both handled the
return values completely different. The global variant only made sure that
the *tt_len parameter was set to 0 and didn't care about the actual return
value of the function.
The local function never made sure that the *tt_len value was set to some
value when the operation failed. The callers were handling these
differences and made sure that they didn't access the incorrectly
initialized variable.
Sync both function as good as possible to avoid problems with new code
which might not be aware of these differences in the behavior.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
batadv_v_elp_start_timer() enqeues a delayed work. The time when it starts
is randomly chosen between (elp_interval - BATADV_JITTER) and
(elp_interval + BATADV_JITTER). The configured elp_interval must therefore
be larger or equal to BATADV_JITTER to avoid that it causes an underflow of
the unsigned integer. If this would happen, then a "fast" ELP interval
would turn into a "day long" delay.
At the same time, it must not be larger than the maximum value the variable
can store.
Sven Eckelmann [Tue, 26 May 2026 19:50:51 +0000 (21:50 +0200)]
batman-adv: bla: annotate lasttime access with READ/WRITE_ONCE
The lasttime field for claim, backbone_gw, and loopdetect tracks the
jiffies value of the most recent activity and is used to detect timeouts.
These accesses are not consistently protected by a lock, so
READ_ONCE/WRITE_ONCE must be used to prevent data races caused by compiler
optimizations.
Rosen Penev [Wed, 3 Jun 2026 03:17:58 +0000 (20:17 -0700)]
dma: map_benchmark: turn dma_sg_map_param buf into a flexible array
The buf pointer was kmalloc_array()'d immediately after the parent
struct allocation, with the count (granule, validated to 1..1024 by
the ioctl) trivially available beforehand. Move buf to the struct
tail as a flexible array member and fold the two allocations into a
single kzalloc_flex(), dropping the kfree(params->buf) in both the
prepare error path and unprepare.
Sven Eckelmann [Tue, 12 May 2026 17:37:05 +0000 (19:37 +0200)]
batman-adv: tp_meter: consolidate congestion control variables
Congestion control variables in batadv_tp_sender were scattered across the
struct without clear grouping, making it difficult to reason about which
fields require cwnd_lock (now "cc_lock") protection. These should be
combined in a structure to make it more easily visible which variable
should be read/modified with the cc_lock held.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Sven Eckelmann [Tue, 12 May 2026 17:37:05 +0000 (19:37 +0200)]
batman-adv: tp_meter: use locking for all congestion control variables
Some variables used atomic_t for concurrent access while others relied on
cwnd_lock, leading to an inconsistent locking model. This can be simplified
by:
* keeping all congestion control decisions inside the cc_lock
* variables which can be accessed without a lock must use
READ_ONCE/WRITE_ONE
This is only possible, by extracting the congestion control logic from
batadv_tp_recv_ack() into a new helper batadv_tp_handle_ack(). Its
decisions are returned as a batadv_tp_ack_reaction enum value and then
applied by the caller. This separates the algorithm (deciding what to do)
from the mechanism (actually doing it).
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Sven Eckelmann [Sat, 9 May 2026 16:33:16 +0000 (18:33 +0200)]
batman-adv: tp_meter: split vars into sender and receiver types
The monolithic batadv_tp_vars struct holds fields for both sender and
receiver roles, distinguished only by a runtime enum role. This makes it
easy to accidentally access a field intended for the opposite role, since
neither the compiler nor the type system provide any guard against such
mistakes. The role check also adds unnecessary branching in several code
paths.
Introduce batadv_tp_vars_common to hold fields shared across both roles,
then derive two separate types (sender/receiver) from it. The functions can
operate on them without any ambiguity about the available fields. This also
reduces the memory footprint of receiver sessions, which no longer carry
the substantial sender-only fields.
Care must be taken to prevent concurrent TP sessions between the same two
peers in opposite directions, since sender and receiver sessions are now
tracked in separate lists and a lookup in one list no longer detects a
session in the other.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Sven Eckelmann [Tue, 2 Jun 2026 14:51:42 +0000 (16:51 +0200)]
batman-adv: tp_meter: add only finished tp_vars to lists
When the receiver variables (aka "session") are initialized, then they are
added to the list of sessions before the timer is set up. A RCU protected
reader could therefore find the entry and run mod_setup before
batadv_tp_init_recv() finished the timer initialization.
The same is true for batadv_tp_start(), which must first initialize the
finish_work and the test_length to avoid a similar problem.
Cc: stable@kernel.org Fixes: 33a3bb4a3345 ("batman-adv: throughput meter implementation") Signed-off-by: Sven Eckelmann <sven@narfation.org>
Sven Eckelmann [Mon, 1 Jun 2026 09:47:29 +0000 (11:47 +0200)]
batman-adv: tp_meter: handle seqno wrap-around for fast recovery detection
The recover variable and the last_sent sequence number are initialized on
purpose as a really high value which will wrap-around after the first 2000
bytes. The fast recovery precondition must therefore not use simple integer
comparisons but use helpers which are aware of the sequence number
wrap-arounds.
Cc: stable@kernel.org Fixes: 33a3bb4a3345 ("batman-adv: throughput meter implementation") Signed-off-by: Sven Eckelmann <sven@narfation.org>
Sven Eckelmann [Mon, 1 Jun 2026 09:00:23 +0000 (11:00 +0200)]
batman-adv: tp_meter: fix fast recovery precondition
The fast recovery precondition checks if the recover (initialized to
BATADV_TP_FIRST_SEQ) is bigger than the received ack. But since recover is
only updated when this check is successful, it will never enter the fast
recovery mode.
According to RFC6582 Section 3.2 step 2, the check should actually be
different:
> When the third duplicate ACK is received, the TCP sender first
> checks the value of recover to see if the Cumulative
> Acknowledgment field covers more than recover
The precondition must therefore check if recover is smaller than the
received ack - basically swapping the operands of the current check.
Cc: stable@kernel.org Fixes: 33a3bb4a3345 ("batman-adv: throughput meter implementation") Signed-off-by: Sven Eckelmann <sven@narfation.org>
Sven Eckelmann [Sun, 31 May 2026 10:38:05 +0000 (12:38 +0200)]
batman-adv: tp_meter: avoid divide-by-zero for dec_cwnd
The cwnd is always MSS <= cwnd <= 0x20000000. But the calculation in
batadv_tp_update_cwnd() assumes unsigned 32 bit arithmetics.
((mss * 8) ** 2) / (cwnd * 8)
In case cwnd is actually 0x20000000, it will be shifted by 3 bit to the
left end up at 0x100000000 or U32_MAX + 1. It will therefore wrap around
and be 0 - resulting in:
((mss * 8) ** 2) / 0
This is of course invalid and cannot be calculated. The calculation should
must be simplified to avoid this overflow:
(mss ** 2) * 8 / cwnd
It will keep the precision enhancement from the scaling (by 8) but avoid
the overflow in the divisor.
In theory, there could still be an overflow in the dividend. It is at the
moment fixed to BATADV_TP_PLEN in batadv_tp_recv_ack() - so it is not an
imminent problem. But allowing it to use the whole u32 bit range, would
mean that it can still use up to 67 bits. To keep this calculation safe for
32 bit arithmetic, mss must never use more than floor((32 - 3) / 2) bits -
or in other words: must never be larger than 16383.
Cc: stable@kernel.org Fixes: 33a3bb4a3345 ("batman-adv: throughput meter implementation") Signed-off-by: Sven Eckelmann <sven@narfation.org>
Sven Eckelmann [Thu, 28 May 2026 19:14:39 +0000 (21:14 +0200)]
batman-adv: tp_meter: avoid window underflow
In batadv_tp_avail(), win_left is calculated with 32-bit unsigned
arithmetic: win_left = win_limit - tp_vars->last_sent;
During Fast Recovery, cwnd is inflated and last_sent advances rapidly. When
Fast Recovery ends, cwnd drops abruptly back to ss_threshold. If the newly
shrunk win_limit is less than last_sent, the unsigned subtraction will
underflow, wrapping to a massive positive value. Instead of returning that
the window is full (unavailable), it returns that the sender can continue
sending.
To handle this situation, it must be checked whether the windows end
sequence number (win_limit) has to be compared with the last sent sequence
number. If it would be before the last sent sequence number, then more acks
are needed before the transmission can be started again.
Cc: stable@kernel.org Fixes: 33a3bb4a3345 ("batman-adv: throughput meter implementation") Signed-off-by: Sven Eckelmann <sven@narfation.org>
When batadv_tp_update_cwnd() is called, dec_cwnd is increased. But dec_cwnd
is only initialixed (to 0) when a duplicate Ack was received or when cwnd
is below the ss_threshold.
Just initialize the cwnd during the initialization to avoid any potential
access of uninitialized data.
Cc: stable@kernel.org Fixes: 33a3bb4a3345 ("batman-adv: throughput meter implementation") Signed-off-by: Sven Eckelmann <sven@narfation.org>
When an ack with a sequence number equal to the last_acked is received, the
dup_acks counter is increased to decide whether fast retransmit should be
performed. Only when the sequence numbers are not equal, the dup_acks is
set to the initial value (0).
But if the initial packet would have the sequence number
BATADV_TP_FIRST_SEQ, dup_acks would not be initialized and atomic_inc would
operate on an undefined starting value. It is therefore required to have it
explicitly initialized during the start of the sender session.
Cc: stable@kernel.org Fixes: 33a3bb4a3345 ("batman-adv: throughput meter implementation") Signed-off-by: Sven Eckelmann <sven@narfation.org>
Sven Eckelmann [Thu, 28 May 2026 19:14:39 +0000 (21:14 +0200)]
batman-adv: tp_meter: keep unacked list in ascending ordered
When batadv_tp_handle_out_of_order inserts a new entry in the list of
unacked (out of order) packets, it searches from the entry with the newest
sequence number towards oldest sequence number. If an entry is found which
is older than the newly entry, the new entry has to be added after the
found one to keep the ascending order.
But for this operation list_add_tail() was used. But this function adds an
entry _before_ another one. As result, the list would contain a lot of
swapped sequence numbers. The consumer of this list
(batadv_tp_ack_unordered()) would then fail to correctly ack packets.
Cc: stable@kernel.org Fixes: 33a3bb4a3345 ("batman-adv: throughput meter implementation") Signed-off-by: Sven Eckelmann <sven@narfation.org>
Dave Airlie [Wed, 3 Jun 2026 04:31:52 +0000 (14:31 +1000)]
Merge tag 'drm-misc-next-2026-05-28' of https://gitlab.freedesktop.org/drm/misc/kernel into drm-next
drm-misc-next for v7.2-rc1:
UAPI Changes:
- amdxdna: Revert read-only user-pointer BO mappings.
- panthor: Add eviction and reclaim info to fdinfo.
Cross-subsystem Changes:
- Convert DMA-buf system and cma heap allocators to module.
Core Changes:
- Cleanup driver misuses of drm/exec.
Driver Changes:
- Add LG LP129WT232166, AM-1280800W8TZQW-T00H, NEC NL6448BC33-70C,
Riverdi RVT70HSLNWCA0 and RVT101HVLNWC00 panels.
- Add support for RZ/T2H SoC to renesas.
- Add cursor plane support to verisilicon.
- Support DVI outputs in ite-it66121 bridge.
- Assorted bugfixes, docbook updates and improvements to ivpu, tegra,
host1x, nouveau.
- Add DSC quirk for ASUS DC301 USB-C dock.
- Use drm client buffer for tegra framebuffer.
- Add support for GA100 to nouveau.
Jiayuan Chen [Fri, 29 May 2026 15:22:18 +0000 (23:22 +0800)]
ipv6: anycast: insert aca into global hash under idev->lock
syzbot reported a splat [1]: a slab-use-after-free in
ipv6_chk_acast_addr(), which walks the global inet6_acaddr_lst[] hash
under RCU and dereferences a struct ifacaddr6 that has already been
freed while still linked in the hash, so a later reader walks into a
dangling node.
In __ipv6_dev_ac_inc() the aca is allocated with refcount 1, then
aca_get() bumps it to 2 to keep it alive across the unlocked region.
It is published to idev->ac_list under idev->lock, but
ipv6_add_acaddr_hash() runs after write_unlock_bh(). A concurrent
teardown (ipv6_ac_destroy_dev() from addrconf_ifdown(), under RTNL)
can slip into that window:
CPU0 __ipv6_dev_ac_inc CPU1 ipv6_ac_destroy_dev (RTNL)
------------------------------ ------------------------------------
aca_alloc() refcnt 1
aca_get() refcnt 2
write_lock_bh(idev->lock)
add aca to ac_list
write_unlock_bh(idev->lock)
write_lock_bh(idev->lock)
pull aca off ac_list
write_unlock_bh(idev->lock)
ipv6_del_acaddr_hash(aca)
hlist_del_init_rcu() is a no-op,
aca is not in the hash yet
aca_put() refcnt 2->1
ipv6_add_acaddr_hash(aca)
aca now inserted into the hash
aca_put() refcnt 1->0
call_rcu(aca_free_rcu) -> kfree(aca)
The hash removal becomes a no-op because the insertion has not
happened yet, so once CPU0 inserts and drops the last reference, the
aca is freed while still linked in inet6_acaddr_lst[], and readers
dereference freed memory after the slab slot is reused.
This window opened once RTNL stopped serializing the join path against
device teardown. Move ipv6_add_acaddr_hash() inside the idev->lock
section so the ac_list and hash insertions are atomic with respect to
teardown: a racing remover now either misses the aca entirely or finds
it in both lists.
acaddr_hash_lock is now nested under idev->lock, which is acquired in
softirq context, so switch all acaddr_hash_lock sites to spin_lock_bh()
to avoid the irq lock inversion reported in [2].
longlong yan [Mon, 1 Jun 2026 01:39:27 +0000 (09:39 +0800)]
selftests/net: bind_bhash: fix memory leak in bind_socket
The getaddrinfo() call in bind_socket() dynamically allocates memory
for the result linked list that must be freed with freeaddrinfo().
However, none of the code paths after a successful getaddrinfo() call
free this memory, causing a leak in every invocation of bind_socket().
Nazim Amirul [Fri, 29 May 2026 06:46:59 +0000 (23:46 -0700)]
net: stmmac: Improve Tx timer arm logic further
Calling hrtimer_start() on an already-active txtimer is unnecessary
and expensive. Skip the restart if the timer is already active by
adding an hrtimer_active() check before hrtimer_start().
Previously, each packet reset the timer to tx_coal_timer in the future,
acting as a sliding window that delayed NAPI under burst traffic. With
this change, an already-active timer is left to fire sooner, scheduling
NAPI within tx_coal_timer of the first packet and freeing TX descriptors
earlier.
There is no race concern: hrtimer_start() is internally serialized and
safe to call on an active timer. In the event of a race between
hrtimer_active() and hrtimer_start(), the worst case is calling
hrtimer_start() on an already-active timer, which is identical to the
pre-patch behaviour.
Performance on Cyclone V with dwmac-socfpga (iperf3 -u -b 0 -l 64):
Before: ~45200 pps
After: ~52300 pps (~15% improvement)
Additionally, ~10% improvement in UDP throughput observed on Agilex5,
with hrtimer CPU usage reduced from ~8% to ~0.6%.
Signed-off-by: Rohan G Thomas <rohan.g.thomas@altera.com> Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Nazim Amirul <muhammad.nazim.amirul.nazle.asmade@altera.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/20260529064659.32287-1-muhammad.nazim.amirul.nazle.asmade@altera.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Wed, 3 Jun 2026 02:13:20 +0000 (19:13 -0700)]
Merge branch 'dpaa2-switch-various-improvements'
Ioana Ciornei says:
====================
dpaa2-switch: various improvements
This patch set is a comprised of improvements and fixes for
long-standing bugs which were only caught by sashiko while reviewing the
LAG support patches for the dpaa2-switch:
https://lore.kernel.org/all/20260512131554.952971-1-ioana.ciornei@nxp.com/
In order to not just add to the already big set, I am submitting these
before any v3 of the LAG support patches.
The individual patches tackle FDB and VLAN management in the
dpaa2-switch driver as well as removal of some duplicated code. The
error path of the dpaa2_switch_rx() is also improved.
====================
Ioana Ciornei [Thu, 28 May 2026 17:34:52 +0000 (20:34 +0300)]
dpaa2-switch: fix handling of NAPI on the remove path
All the NAPI instances for a DPSW device are attached to the first
switch port's net_device but shared by all ports. The NAPI instances get
disabled only once the last port goes down.
This causes an issue on the .remove() path where each port is
unregistered and freed one at a time, causing the NAPI instances to be
deleted even though they are not disabled.
In order to avoid this, split up the unregister_netdev() calls from the
free_netdev() so that we make sure all ports go down before we attempt
a deletion of NAPI instances. Also, make the netif_napi_del() explicit
as it is on the .probe() path.
Ioana Ciornei [Thu, 28 May 2026 17:34:51 +0000 (20:34 +0300)]
dpaa2-switch: support VLAN flag changes on existing VIDs
The switchdev core notifies the driver on VLAN flag changes on existing
VIDs through the changed field of the switchdev_obj_port_vlan structure.
Without this patch, the driver was erroring out from the start if the
same VID was inserted twice, from its perspective, even though the
second call was actually a flag change.
$ bridge vlan add dev eth2 vid 30 untagged
$ bridge vlan add dev eth2 vid 30
[ 458.589534] fsl_dpaa2_switch dpsw.0 eth2: VLAN 30 already configured
This patch fixes the above behavior by, first of all, removing the
checks and return of errors on a VLAN already being installed. The patch
also moves the sequence of code which checks if there is space for a new
VLAN so that the verification is being done only for VLANs not know by
the switch and not flag changes.
A new parameter is added to the dpaa2_switch_port_add_vlan() function so
that we pass the vlan->changed necessary information. Based on this new
parameter and the flags value, the untagged flag will be added or
removed from a VLAN installed on a port. The same thing is also extended
to the PVID configuration.
Ioana Ciornei [Thu, 28 May 2026 17:34:50 +0000 (20:34 +0300)]
dpaa2-switch: remove duplicated check for the maximum number of VLANs
The check for the maximum number of VLANs is exactly duplicated twice in
the dpaa2_switch_port_vlans_add() function. Remove one of the instances
so that we do not have dead code.
Ioana Ciornei [Thu, 28 May 2026 17:34:49 +0000 (20:34 +0300)]
dpaa2-switch: fix the error path in dpaa2_switch_rx()
In case of an error in dpaa2_switch_rx(), the dpaa2_switch_free_fd()
function is called in order to free the FD. This is incorrect since the
dpaa2_switch_free_fd() is intended to be used on Tx frame descriptors,
meaning that it expects in the software annotation area of the FD data
to find a valid skb pointer on which to call dev_kfree_skb().
Fix this by extracting the dma_unmap_page() from
dpaa2_switch_build_linear_skb() directly into the dpaa2_switch_rx()
function. This allows us to directly use free_pages() in case of an
error before an SKB was created and kfree_skb() afterwards.
Ioana Ciornei [Thu, 28 May 2026 17:34:48 +0000 (20:34 +0300)]
dpaa2-switch: rework FDB management on the bridge leave path
On bridge leave, the dpaa2_switch_port_set_fdb() function always
allocates a new FDB for the port which is becoming standalone. In case
no FDB is found, then the port leaving a bridge will continue to use the
current one.
The above logic does not cover the case in which there are multiple
bridges which have ports from the same DPSW instance. In this case, when
the last port leaves bridge #1, it finds an unused FDB to switch to, but
the old FDB is not marked as unused. Since the number of FDBs is equal
to the number of DPSW interfaces, this will eventually lead to multiple
ports sharing the same FDB.
Fix this by changing how we are managing the FDBs on the leave path.
Instead of directly allocating a new FDB, first verify if the current
port is the last one to leave a bridge. If this is the case, then
continue to use the current FDB and only allocate another FDB if there
are other ports remaining in the bridge.
The SG2042 MSI controller has one 32-bit doorbell register, and each bit
corresponds to an interrupt. At a glance, it seems that the MSI
controller can support 32 interrupts; however the PCI MSI capability
only supports 16-bit messages, which makes the high 16 interrupts
unusable in such way.
Reduce the MSI count to 16 to prevent producing MSI message values that
cannot fit 16-bit integers.
Signed-off-by: Icenowy Zheng <zhengxingda@iscas.ac.cn> Reviewed-by: Chen Wang <unicorn_wang@outlook.com> Tested-by: Chen Wang <unicorn_wang@outlook.com> on Pioneerbox. Link: https://patch.msgid.link/20260407160143.1182430-1-zhengxingda@iscas.ac.cn Signed-off-by: Inochi Amaoto <inochiama@gmail.com> Signed-off-by: Chen Wang <unicorn_wang@outlook.com>
riscv: dts: sophgo: sg2042: use hex for CPU unit address
Previous the CPU unit address cpu of sg2042 use decimal, it is
not following the general convention for unit addresses of the
OF. Convent the unit address to hex to resolve this problem.
The introduces a small change for the CPU node name, but it should
affect nothing since there is no direct full-path reference to
these CPU nodes.
Fixes: ae5bac370ed4 ("riscv: dts: sophgo: Add initial device tree of Sophgo SRD3-10") Tested-by: Chen Wang <unicorn_wang@outlook.com> # Pioneerbox. Reviewed-by: Guo Ren <guoren@kernel.org> Reviewed-by: Chen Wang <unicorn_wang@outlook.com> Acked-by: Conor Dooley <conor.dooley@microchip.com> Tested-by: Chen Wang <unicorn_wang@outlook.com> on Pioneerbox. Link: https://patch.msgid.link/20260426013449.694435-3-inochiama@gmail.com Signed-off-by: Inochi Amaoto <inochiama@gmail.com> Signed-off-by: Chen Wang <unicorn_wang@outlook.com>
riscv: dts: sophgo: sg2044: use hex for CPU unit address
Previous the CPU unit address cpu of sg2044 use decimal, it is
not following the general convention for unit addresses of the
OF. Convent the unit address to hex to resolve this problem.
The introduces a small change for the CPU node name, but it should
nothing since there is no direct full-path reference to these
CPU nodes.
kbuild: Remove unnecessary 'T' modifier in cmd_ar_builtin_fixup
In cmd_ar_builtin_fixup, the 'T' modifier was added to '$(AR) mPi' to
work around a bug in llvm-ar that caused thin archives to be silently
converted to full archives [1]. Since commit 20c098928356 ("kbuild: Bump
minimum version of LLVM for building the kernel to 15.0.0"), all
supported versions of llvm-ar have this issue fixed, so the 'T' modifier
and comment can be removed.
Thorsten Blum [Sun, 17 May 2026 17:26:17 +0000 (19:26 +0200)]
n64cart: use strscpy in n64cart_probe
strcpy() has been deprecated [1] because it performs no bounds checking
on the destination buffer, which can lead to buffer overflows. While the
current code works correctly, replace strcpy() with the safer strscpy()
to follow secure coding best practices.
io_uring/net: Avoid msghdr on op_connect/op_bind async data
Both IORING_OP_CONNECT and IORING_OP_BIND reuse the msghdr object just
to store the sockaddr. Beyond allocating a much larger object than
needed, msghdr can also wrap an iovec, which will be recycled
unnecessarily. This uses the sockaddr directly.
Zeyu WANG [Tue, 2 Jun 2026 17:09:09 +0000 (01:09 +0800)]
Input: atkbd - add DMI quirk for Lenovo Yoga Air 14 (83QK)
The Lenovo Yoga Air 14 (83QK) laptop keyboard becomes unresponsive
after the standard atkbd init sequence. Controlled testing on the
actual hardware shows the F5 (ATKBD_CMD_RESET_DIS / deactivate)
command specifically corrupts the EC state, causing zero IRQ1
interrupts after init.
Skipping only the deactivate command (while keeping F4 ENABLE)
resolves the issue completely: both keystroke input and CapsLock
LED toggle work correctly. The reverse test - skipping only F4
while keeping F5 - makes the problem worse (zero keystroke
interrupts), confirming F5 is the sole culprit.
Add a DMI quirk entry for LENOVO/83QK using the existing
atkbd_deactivate_fixup callback, consistent with the existing
entries for LG Electronics and HONOR FMB-P that address the
same EC F5 deactivate issue.
Dave Airlie [Tue, 2 Jun 2026 23:09:11 +0000 (09:09 +1000)]
Merge tag 'drm-intel-next-2026-05-28' of https://gitlab.freedesktop.org/drm/i915/kernel into drm-next
Xe related:
- Fix Xe oops in suspend/shutdown when display was disabled (Jani)
Display in general:
- More general refactor towards display separation (Jani)
- Preparation for fix Adaptive-Sync SDP for PR with Link ON + Auxless-ALPM (Ankit)
- PSR related fixes and improvements (Jouni)
- Use polling when irqs are unavailable (Michal)
- Split bandwidth params into platform- and display-IP-specific structs (Gustavo)
- Revert "drm/i915/backlight: Remove try_vesa_interface" (Suraj)
- Casf & scaler refactoring (Michal)
- Add support for pipe background color (Maarten)
- General clean-ups (Maarten)
- Sanitize DP link capability change handling (Imre)
- Multiple BW QGV fixes (Ville)
Dave Airlie [Tue, 2 Jun 2026 22:58:56 +0000 (08:58 +1000)]
Merge tag 'drm-xe-next-2026-05-28' of https://gitlab.freedesktop.org/drm/xe/kernel into drm-next
Driver Changes:
- drm/xe: Move xe_uc_fw_abi.h to abi/ (Michal Wajdeczko)
- drm/xe: Restore IDLEDLY regiter on engine reset (Balasubramani Vivekanandan)
- drm/xe/pm: Do early initialization in init_early() (Michal Wajdeczko)
- drm/xe/pm: Don't access device in init_early() (Michal Wajdeczko)
- drm/xe: Separate early xe_device initialization (Michal Wajdeczko)
- drm/xe: Move xe->info.devid|revid initialization (Michal Wajdeczko)
- drm/xe: Move xe->info.force_execlist initialization (Michal Wajdeczko)
- drm/xe: Drop unused param from xe_device_create() (Michal Wajdeczko)
- drm/xe: Use raw device ID to find sub-platform descriptor (Michal Wajdeczko)
- drm/xe: Assign queue name in time for drm_sched_init (Tvrtko Ursulin)
- drm/xe/rtp: Implement a structured parser for rule matching (Gustavo Sousa)
- drm/xe/rtp: Fully parse the ruleset (Gustavo Sousa)
- drm/xe/rtp: Extract rule_match_item() (Gustavo Sousa)
- drm/xe/rtp: Do not break parsing when missing context (Gustavo Sousa)
- drm/xe/rtp: Don't short-circuit to false in or-yes case (Gustavo Sousa)
- drm/xe/rtp: Drop rule matching cases from rtp_to_sr_cases and rtp_cases (Gustavo Sousa)
- drm/xe/rtp: Write kunit test cases specific for rule matching (Gustavo Sousa)
Link: https://lore.kernel.org/20260520051751.74396-1-leon.hwang@linux.dev Signed-off-by: Leon Hwang <leon.hwang@linux.dev> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lance Yang <lance.yang@linux.dev> Reviewed-by: SeongJae Park <sj@kernel.org> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Leon Hwang <leon.hwang@linux.dev> Cc: Liam R. Howlett <liam@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kefeng Wang [Tue, 12 May 2026 15:15:23 +0000 (23:15 +0800)]
mm/damon/vaddr: attempt per-vma lock during page table walk
Currently, DAMON virtual address operations use mmap_read_lock during page
table walks, which can cause unnecessary contention under high
concurrency.
Introduce damon_va_walk_page_range() to first attempt acquiring a per-vma
lock. If the VMA is found and the range is fully contained within it, the
page table walk proceeds with the per-vma lock instead of mmap_read_lock.
This optimization is expected to be particularly effective for
damon_va_young() and damon_va_mkold(), which are frequently called and
typically operate within a single VMA.
Kaitao Cheng [Thu, 14 May 2026 08:57:54 +0000 (16:57 +0800)]
mm/memory-failure: use zone_pcp_disable() for poison handling
__page_handle_poison() used drain_all_pages() instead of
zone_pcp_disable() because dissolve_free_hugetlb_folio() could restore HVO
vmemmap pages and decrement hugetlb_optimize_vmemmap_key. That static key
update took cpu_hotplug_lock through static_key_slow_dec(), while
zone_pcp_disable() holds pcp_batch_high_lock. CPU hotplug takes the locks
in the opposite order through page_alloc_cpu_online/dead(), so the
combination could deadlock.
That dependency no longer exists. Commit da3e2d1ca43d ("mm/hugetlb:
remove hugetlb_optimize_vmemmap_key static key") removed the HVO static
key and the static_branch_dec() from hugetlb_vmemmap_restore_folio(). The
dissolve_free_hugetlb_folio() path no longer reaches
static_key_slow_dec().
Use zone_pcp_disable() again while dissolving the hugetlb folio and taking
the target page off the buddy allocator. This prevents the drained PCP
lists from being refilled before take_page_off_buddy() runs, making the
page isolation deterministic.
Link: https://lore.kernel.org/20260514085754.84097-1-kaitao.cheng@linux.dev Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Georgi Djakov [Thu, 14 May 2026 09:26:57 +0000 (02:26 -0700)]
drivers/base/memory: set mem->altmap after successful device registration
If __add_memory_block() fails at xa_store() (under memory pressure for
example), device_unregister() is called, which eventually triggers
memory_block_release() with mem->altmap still set, causing a
WARN_ON(mem->altmap). This was triggered by modifying virtio-mem driver.
Fix this by delaying the assignment of mem->altmap until after
__add_memory_block() has succeeded.
Link: https://lore.kernel.org/20260514092657.3057141-1-georgi.djakov@oss.qualcomm.com Fixes: 1a8c64e11043 ("mm/memory_hotplug: embed vmem_altmap details in memory block") Signed-off-by: Georgi Djakov <georgi.djakov@oss.qualcomm.com> Acked-by: Oscar Salvador (SUSE) <osalvador@kernel.org> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Richard Cheng <icheng@nvidia.com> Cc: David Hildenbrand <david@kernel.org> Cc: Georgi Djakov <djakov@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Shivam Kalra [Tue, 19 May 2026 12:12:18 +0000 (17:42 +0530)]
lib/test_vmalloc: add vrealloc test case
Introduce a new test case "vrealloc_test" that exercises the vrealloc()
shrink and in-place grow paths:
- Grow beyond allocated pages (triggers full reallocation).
- Shrink crossing a page boundary (frees tail pages).
- Shrink within the same page (no page freeing).
- Grow within the already allocated page count (in-place).
Data integrity is validated after each realloc step by checking that the
first byte of the original allocation is preserved.
The test is gated behind run_test_mask bit 12 (id 4096).
Shivam Kalra [Tue, 19 May 2026 12:12:17 +0000 (17:42 +0530)]
mm/vmalloc: free unused pages on vrealloc() shrink
When vrealloc() shrinks an allocation and the new size crosses a page
boundary, unmap and free the tail pages that are no longer needed. This
reclaims physical memory that was previously wasted for the lifetime of
the allocation.
The heuristic is simple: always free when at least one full page becomes
unused. Huge page allocations (page_order > 0) are skipped, as partial
freeing would require splitting. Allocations with VM_FLUSH_RESET_PERMS
are also skipped, as their direct-map permissions must be reset before
pages are returned to the page allocator, which is handled by
vm_reset_perms() during vfree().
Additionally, allocations with VM_USERMAP are skipped because
remap_vmalloc_range_partial() validates mapping requests against the
unchanged vm->size; freeing tail pages would cause vmalloc_to_page() to
return NULL for the unmapped range.
To protect concurrent readers, the shrink path uses Node lock to
synchronize before freeing the pages.
Finally, we notify kmemleak of the reduced allocation size using
kmemleak_free_part() to prevent the kmemleak scanner from faulting on the
newly unmapped virtual addresses.
The virtual address reservation (vm->size / vmap_area) is intentionally
kept unchanged, preserving the address for potential future grow-in-place
support.
Shivam Kalra [Tue, 19 May 2026 12:12:16 +0000 (17:42 +0530)]
mm/vmalloc: use physical page count in vread_iter() for VM_ALLOC areas
For VM_ALLOC areas in vread_iter(), derive the vm area size from
vm->nr_pages rather than get_vm_area_size().
Only VM_ALLOC areas are subject to vrealloc() shrinking, which frees pages
without reducing the virtual reservation size. Switch to using
vm->nr_pages for VM_ALLOC areas so the reader remains correct once shrink
support is added. Other mapping types (vmap, ioremap) do not initialize
nr_pages and will continue using get_vm_area_size().
Shivam Kalra [Tue, 19 May 2026 12:12:15 +0000 (17:42 +0530)]
mm/vmalloc: use physical page count for vrealloc() grow-in-place check
Update the grow-in-place check in vrealloc() to compare the requested size
against the actual physical page count (vm->nr_pages) rather than the
virtual area size (alloced_size, derived from get_vm_area_size()).
Currently both values are equivalent, but the upcoming vrealloc() shrink
functionality will free pages without reducing the virtual reservation
size. After such a shrink, the old alloced_size-based comparison would
incorrectly allow a grow-in-place operation to succeed and attempt to
access freed pages. Switch to vm->nr_pages now so the check remains
correct once shrink support is added.
Shivam Kalra [Tue, 19 May 2026 12:12:14 +0000 (17:42 +0530)]
mm/vmalloc: extract vm_area_free_pages() helper from vfree()
Patch series "mm/vmalloc: free unused pages on vrealloc() shrink", v14.
This series implements the TODO in vrealloc() to unmap and free unused
pages when shrinking across a page boundary.
Problem:
When vrealloc() shrinks an allocation, it updates bookkeeping
(requested_size, KASAN shadow) but does not free the underlying physical
pages. This wastes memory for the lifetime of the allocation.
Solution:
- Patch 1: Extracts a vm_area_free_pages(vm, start_idx, end_idx) helper
from vfree() that frees a range of pages with memcg and nr_vmalloc_pages
accounting. Freed page pointers are set to NULL to prevent stale
references.
- Patch 2: Update the grow-in-place check in vrealloc() to compare the
requested size against the actual physical page count (vm->nr_pages)
rather than the virtual area sizes. This is a prerequisite for shrinking.
- Patch 3: For VM_ALLOC areas in vread_iter(), derive the vm area size
from vm->nr_pages rather than get_vm_area_size(), which would
overestimate the mapped range after a shrink. Other mapping types
(vmap, ioremap) don't set nr_pages and keep using get_vm_area_size().
- Patch 4: Uses the helper to free tail pages when vrealloc() shrinks
across a page boundary.
- Patch 5: Adds a vrealloc test case to lib/test_vmalloc that exercises
grow-realloc, shrink-across-boundary, shrink-within-page, and
grow-in-place paths.
The virtual address reservation is kept intact to preserve the range for
potential future grow-in-place support. A concrete user is the Rust
binder driver's KVVec::shrink_to [1], which performs explicit vrealloc()
shrinks for memory reclamation.
This patch (of 5):
Extract page freeing and NR_VMALLOC stat accounting from vfree() into a
reusable vm_area_free_pages() helper. The helper operates on a range
[start_idx, end_idx) of pages from a vm_struct, making it suitable for
both full free (vfree) and partial free (upcoming vrealloc shrink).
Freed page pointers in vm->pages[] are set to NULL to prevent stale
references when the vm_struct outlives the free (as in vrealloc shrink).
SeongJae Park [Mon, 18 May 2026 23:41:13 +0000 (16:41 -0700)]
mm/damon/sysfs-schemes: move memcg_path_to_id() to sysfs-common
The next commit will need to find the memcg id from the user-passed path
to the memory cgroup, from sysfs.c. memcg_path_to_id() is doing that, but
defined in sysfs-schemes.c as a static function. Move the function to
sysfs-common.c and mark it as non-static, so that the next commit can
reuse the function.
Link: https://lore.kernel.org/20260518234119.97569-26-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Mon, 18 May 2026 23:41:12 +0000 (16:41 -0700)]
mm/damon/sysfs: add filters/<F>/path file
Introduce a new DAMON sysfs file for letting users setup the target memory
cgroup of the belonging memory cgroup attribute monitoring. The file is
named 'path', located under the probe filter directory. Users can set the
target memory cgroup by writing the path to the memory cgroup from the
cgroup mount point to the file.
Link: https://lore.kernel.org/20260518234119.97569-25-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Mon, 18 May 2026 23:41:10 +0000 (16:41 -0700)]
mm/damon/core: introduce DAMON_FILTER_TYPE_MEMCG
Belonging memory cgoup is another data attribute that can be useful to
monitor. Introduce a new DAMON filter type, namely
DAMON_FILTER_TYPE_MEMCG, for monitoring of this attribute.
Link: https://lore.kernel.org/20260518234119.97569-23-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Mon, 18 May 2026 23:40:55 +0000 (16:40 -0700)]
mm/damon/core: do data attributes monitoring
Implement the data attributes monitoring execution. Update kdamond to
invoke the probes application callback, and reset the aggregated number of
per-region per-probe positive samples for every aggregation interval.
Link: https://lore.kernel.org/20260518234119.97569-8-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Mon, 18 May 2026 23:40:54 +0000 (16:40 -0700)]
mm/damon/core: introduce damon_ops->apply_probes
Extend damon_operations struct with a new callback, namely apply_probes.
The callback will be invoked for data attributes monitoring. More
specifically, the callback will apply damon_probe objects to each region
and update the per-region per-probe counters for the number of encountered
probe-positive samples.
Link: https://lore.kernel.org/20260518234119.97569-7-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Mon, 18 May 2026 23:40:53 +0000 (16:40 -0700)]
mm/damon/core: introduce damon_region->probe_hits
Add an array for the per-region per-probe positive samples count. For
simple and efficient implementation, add a limit to the number of data
probes and set the array to support only the limited number of counters.
Link: https://lore.kernel.org/20260518234119.97569-6-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Mon, 18 May 2026 23:40:51 +0000 (16:40 -0700)]
mm/damon/core: introduce damon_filter
Define a data structure for constructing damon_probe's attributes check,
namely damon_filter. It is very similar to damos_filter but works only
for monitoring purposes. Also embed that into damon_probe, implement
essential handling of the link, with fundamental helpers.
Link: https://lore.kernel.org/20260518234119.97569-4-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Mon, 18 May 2026 23:40:50 +0000 (16:40 -0700)]
mm/damon/core: embed damon_probe objects in damon_ctx
Let damon_probe objects be able to be installed on a given damon_ctx, by
adding a linked list header for storing the objects. Add initialization
and cleanup of the new field with helper functions, too.
Link: https://lore.kernel.org/20260518234119.97569-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Mon, 18 May 2026 23:40:49 +0000 (16:40 -0700)]
mm/damon/core: introduce struct damon_probe
Patch series "mm/damon: introduce data attributes monitoring".
TL; DR
======
Extend DAMON for monitoring general data attributes other than accesses.
The short term motivation is lightweight page type (e.g., belonging
cgroup) aware monitoring. In long term, this will help extending DAMON
for multiple access events capture primitives (e.g., page faults and PMU)
and eventually pivotting DAMON to a "Data Attributes Monitoring and
Operations eNgine" in long term.
Background: High Cost of Page Level Properties Monitoring
=========================================================
DAMON is initially introduced as a Data Access MONitor. It has been
extended for not only access monitoring but also data access-aware system
operations (DAMOS). But still the monitoring part is only for data
accesses.
Data access patterns is good information, but some users need more
holistic views. Particularly, users want to show the access pattern
information together with the types of the memory. For example, users who
work for making huge pages efficiently want to know how much of
DAMON-found hot/cold regions are backed by huge pages. Users who run
multiple workloads with different cgroups want to know how much of
DAMON-found hot/cold regions belong to specific cgroups.
For the user demand, we developed a DAMOS extension for page level
properties based monitoring [1], which has landed on 6.14. Using the
feature, users can inform the page level data properties that they are
interested in, in a flexible format that uses DAMOS filters. Then, DAMON
applies the filters to each folio of the entire DAMON region and lets
users know how many bytes of memory in each DAMON region passed the given
filters.
This gives page level detailed and deterministic information to users.
But, because the operation is done at page level, the overhead is
proportional to the memory size. It was useful for test or debugging
purposes on a small number of machines. But it was obviously too heavy to
be enabled always on all machines running the real user workloads. For
real world workloads, it was recommended to use the feature with
user-space controlled sampling approaches. For example, users could do
the page level monitoring only once per hour, on randomly selected one
percent of machines of their fleet. If the runtime and the size of the
fleet is long and big enough, it should provide statistically meaningful
data.
But users are too busy to implement such controls on their own.
Data Attributes Monitoring
==========================
Extend DAMON to monitor not only data accesses, but also general data
attributes. Do the extension while keeping the main promise of DAMON, the
bounded and best-effort minimum overhead.
Allow users to specify what data attributes in addition to the data access
they want to monitor. Users can install one 'data probe' per data
attribute of their interest for this purpose. The 'data probe' should be
able to be applied to any memory, and determine if the given memory has
the appropriate data attribute. E.g., if memory of physical address 42
belongs to cgroup A. Each 'data probe' is configured with filters that
are very similar to the DAMOS filters.
When DAMON checks if each sampling address memory of each region is
accessed since the last check, it applies data probes if registered. Same
to the number of access check-positive samples accounting (nr_accesses),
it accounts the number of each data probe-positive samples in another
per-region counters array, namely 'probe_hits'. When DAMON resets
nr_accesses every aggregation interval, it resets 'probe_hits' together.
Users can read 'probe_hits' just before the values are reset. In this
way, users can know how many hot/cold memory regions have data attributes
of their interest. E.g., 30 percent of this system's hot memory is
belonging to cgroup A, and 80 percent of the cgroup A-belonging hot memory
is backed by huge pages.
Patches Sequence
================
First eight patches implement the core feature, interface and the working
support. Patch 1 introduces data probe data structure, namely
damon_probe. Patch 2 extends damon_ctx for installing data probes. Patch
3 introduces another data structure for filters of each data probe, namely
damon_filter. Patch 4 updates damon_ctx commit function to handle the
probes. Patch 5 extends damon_region for the per-region per-probe
positive samples counter, namely probe_hits. Patch 6 extends
damon_operations for applying probes on the underlying DAMON operations
implementation. Patch 7 updates kdamond_fn() to invoke the probes
applying callback. Patch 8 finally implements the probes support on paddr
ops.
Ten changes for user interface (patches 9-18) come next. Patches 9-13
implements sysfs directories and files for setting data probes, namely
probes directory, probe directory, filters directory, filter directory and
filter directory internal files, respectively. Patch 14 connects the user
inputs that are made via the sysfs files to DAMON core. Following three
patches (patches 15-17) implement sysfs directories and files for showing
the probe_hits to users, namely probes directory, probe directory and hits
files, respectively. Patch 18 introduces a new tracepoint for showing the
probe_hits via tracefs.
Patch 19 adds a selftest for the sysfs files.
Patches 20 and 21 documents the design and usage of the new feature,
respectively.
Seven additional patches (patches 22-28) for monitoring belonging memory
cgroup follow. Depending on the feedback, this part might be separated to
another series in future. Patch 22 defines the DAMON filter type for the
new attribute, namely DAMON_FILTER_TYPE_MEMCG. Patch 23 add the support
on paddr ops. Patch 24 updates the sysfs interface for setup of the
target memcg. Patch 25 move code for easy reuse of the filter target
memcg setup. Patch 26 connects the user input to the core layer.
Finally, patches 27 and 28 update the design and usage documents for the
memcg attribute monitoring support.
Discussion
==========
This allows the page properties monitoring with overhead that is low
enough to be enabled always on real world workloads. Because the sampling
time for access check is reused for data attributes check, the
upper-bounded and best-effort minimum overhead of DAMON is kept. Because
the sampling memory for access check is reused for data attributes check,
additional overhead is minimum.
Still DAMOS-based page level properties monitoring should be useful,
because it provides a deterministic page level information. When in doubt
of the sampling based information, running DAMOS-based one together and
comparing the results would be useful, for debugging and tuning.
Future Works: Mid Term
========================
This version of implementation is limiting the maximum number of data
probes to four. I will try to find a way to remove the limit in future.
I personally think it should be enough for common use cases, though, and
therefore not giving high priority at the moment.
Future Works: Long Term
=======================
There are user requests for extending DAMON with detailed access
information, for example, per-CPUs/threads/read/writes monitoring. For
that, I was working [2] on extending DAMON to use page fault events as
another access check primitives, and making the infrastructure flexible
for future use of yet another access check primitive. Actually there is
another ongoing work [3] for extending DAMON with PMU events. The
motivation of the work is reducing the overhead, though.
In my work [2], I was introducing a new interface for access sampling
primitives control. Now I think this data probe interface can be used for
that, too. That is, data access becomes just one type of data attribute.
Also, pg_idle-confirmed access, page fault-confirmed access, and PMU
event-confirmed access will be different types of data attributes.
The regions adjustment mechanism is currently working based on the access
information. That's because DAMON is designed for data access monitoring.
That is, data access information is the primary interest, and therefore
DAMON adjusts regions in a way that can best-present the information.
Once data access becomes just one of data attributes, there is no reason
to think data access that special. There might be some users not
interested in access at all but want to know the location of memory of
specific type. Data probes interface will allow doing that. Further, we
could extend the interface to let users set any data attribute as the
'primary' attribute. Then, DAMON will split and merge regions in a way
that can best-present the 'primary' attributes.
DAMOS will also be extended, to specify targets based on not only the data
access pattern, but all user-registered data attributes. From this stage,
we may be able to call DAMON as a "Data Attributes Monitoring and
Operations eNgine".
This patch (of 28):
Introduce a data structure for data attribute probe. It is just a linked
list header at this step. It will be extended in a way that it can
determine if a given memory has a specific data attribute.
Ye Liu [Fri, 15 May 2026 02:01:43 +0000 (10:01 +0800)]
mm/memory-failure: remove hugetlb output parameter from try_memory_failure_hugetlb()
Use -ENOENT return value to distinguish "not a hugetlb page" from "hugetlb
handled", instead of carrying an extra output parameter.
Link: https://lore.kernel.org/20260515020144.164941-1-ye.liu@linux.dev Signed-off-by: Ye Liu <liuye@kylinos.cn> Suggested-by: Oscar Salvador <osalvador@suse.de> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Oscar Salvador (SUSE) <osalvador@kernel.org> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Sun, 17 May 2026 15:39:51 +0000 (23:39 +0800)]
mm, swap: merge zeromap into swap table
By allocating one additional bit in the swap table entry's flags field
alongside the count, we can store the zeromap inline
For 64 bit systems, zeromap will store in the swap table, avoiding zeromap
allocation. It reduces the allocated memory. That is the happy path.
For certain 32-bit archs, there might not be enough bits in the swap table
to contain both PFN and flags. Therefore, conditionally let each cluster
have a zeromap field at build time, and use that instead. If the swapfile
cluster is not fully used, it will still save memory for zeromap. The
empty cluster does not allocate a zeromap. In the worst case, all cluster
are fully populated. We will use memory similar to the previous zeromap
implementation.
A few macros were moved to different headers for build time struct
definition.
[akpm@linux-foundation.org: swap_cluster_alloc_table(): remove unused local `ret]
[akpm@linux-foundation.org: fix unused label `err_free'] Link: https://lore.kernel.org/20260517-swap-table-p4-v5-12-88ae43e064c7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Reviewed-by: Youngjun Park <youngjun.park@lge.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Sun, 17 May 2026 15:39:50 +0000 (23:39 +0800)]
mm/memcg: remove no longer used swap cgroup array
Now all swap cgroup records are stored in the swap cluster directly, the
static array is no longer needed.
Link: https://lore.kernel.org/20260517-swap-table-p4-v5-11-88ae43e064c7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Youngjun Park <youngjun.park@lge.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Sun, 17 May 2026 15:39:49 +0000 (23:39 +0800)]
mm/memcg, swap: store cgroup id in cluster table directly
Drop the usage of the swap_cgroup_ctrl, and use the dynamic cluster table
instead.
The per-cluster memcg table is 1024 / 512 bytes on most archs, and does
not need RCU protection: the cgroup data is only read and written under
the cluster lock. That keeps things simple, lets the allocation use plain
kmalloc with immediate kfree (no deferred free), and keeps fragmentation
acceptable.
[akpm@linux-foundation.org: memcgv1: don't compile swap functions when CONFIG_SWAP=n] Link: https://lore.kernel.org/202605281711.bSeZlErK-lkp@intel.com
[akpm@linux-foundation.org: fix CONFIG_SWAP=n build] Link: https://lore.kernel.org/20260517-swap-table-p4-v5-10-88ae43e064c7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Youngjun Park <youngjun.park@lge.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Sun, 17 May 2026 15:39:48 +0000 (23:39 +0800)]
mm, swap: consolidate cluster allocation helpers
Swap cluster table management is spread across several narrow helpers. As
a result, the allocation and fallback sequences are open-coded in multiple
places.
A few more per-cluster tables will be added soon, so avoid duplicating
these sequences per table type. Fold the existing pairs into
cluster-oriented helpers, and rename for consistency.
No functional change, only a few sanity checks are slightly adjusted.
Link: https://lore.kernel.org/20260517-swap-table-p4-v5-9-88ae43e064c7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Youngjun Park <youngjun.park@lge.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Sun, 17 May 2026 15:39:47 +0000 (23:39 +0800)]
mm, swap: delay and unify memcg lookup and charging for swapin
Instead of checking the cgroup private ID during page table walk in
swap_pte_batch(), move the memcg lookup into __swap_cache_add_check()
under the cluster lock.
The first pre-alloc check is speculative and skips the memcg check since
the post-alloc stable check ensures all slots covered by the folio belong
to the same memcg. It is very rare for contiguous and aligned entries
across a contiguous region of a page table of the same process or shmem
mapping to belong to different memcgs.
This also prepares for recording the memcg info in the cluster's table.
Also make the order check and fallback more compact.
There should be no user-observable behavior change.
Link: https://lore.kernel.org/20260517-swap-table-p4-v5-8-88ae43e064c7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Youngjun Park <youngjun.park@lge.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Sun, 17 May 2026 15:39:46 +0000 (23:39 +0800)]
mm, swap: support flexible batch freeing of slots in different memcgs
Instead of requiring the caller to ensure all slots are in the same memcg,
make the function handle different memcgs at once.
This is both a micro optimization and required for removing the memcg
lookup in the page table layer, so it can be unified at the swap layer.
We are not removing the memcg lookup in the page table in this commit. It
has to be done after the memcg lookup is deferred to the swap layer.
Link: https://lore.kernel.org/20260517-swap-table-p4-v5-7-88ae43e064c7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Youngjun Park <youngjun.park@lge.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Sun, 17 May 2026 15:39:45 +0000 (23:39 +0800)]
mm/memcg, swap: tidy up cgroup v1 memsw swap helpers
The cgroup v1 swap helpers always operate on swap cache folios whose swap
entry is stable: the folio is locked and in the swap cache. There is no
need to pass the swap entry or page count as separate parameters when they
can be derived from the folio itself.
Simplify the redundant parameters and add sanity checks to document the
required preconditions.
Also rename memcg1_swapout to __memcg1_swapout to indicate it requires
special calling context: the folio must be isolated and dying, and the
call must be made with interrupts disabled.
No functional change.
Link: https://lore.kernel.org/20260517-swap-table-p4-v5-6-88ae43e064c7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Youngjun Park <youngjun.park@lge.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Sun, 17 May 2026 15:39:44 +0000 (23:39 +0800)]
mm, swap: unify large folio allocation
Now that direct large order allocation is supported in the swap cache,
both anon and shmem can use it instead of implementing their own methods.
This unifies the fallback and swap cache check, which also reduces the
TOCTOU race window of swap cache state: previously, high order swapin
required checking swap cache states first, then allocating and falling
back separately. Now all these steps happen in the same compact loop.
Order fallback and statistics are also unified, callers just need to check
and pass the acceptable order bitmask.
There is basically no behavior change. This only makes things more
unified and prepares for later commits. Cgroup and zero map checks can
also be moved into the compact loop, further reducing race windows and
redundancy
Link: https://lore.kernel.org/20260517-swap-table-p4-v5-5-88ae43e064c7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Youngjun Park <youngjun.park@lge.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Sun, 17 May 2026 15:39:43 +0000 (23:39 +0800)]
mm, swap: add support for stable large allocation in swap cache directly
To make it possible to allocate large folios directly in swap cache,
provide a new infrastructure helper to handle the swap cache status check,
allocation, and order fallback in the swap cache layer
The new helper replaces the existing swap_cache_alloc_folio. Based on
this, all the separate swap folio allocation that is being done by anon /
shmem before is converted to use this helper directly, unifying folio
allocation for anon, shmem, and readahead.
This slightly consolidates how allocation is synchronized, making it more
stable and less prone to errors. The slot-count and cache-conflict check
is now always performed with the cluster lock held before allocation, and
repeated under the same lock right before cache insertion. This double
check produces a stable result compared to the previous anon and shmem
mTHP allocation implementation, avoids the false-negative conflict checks
that the lockless path can return — large allocations no longer have to
be unwound because the range turned out to be occupied — and aborts
early for already-freed slots, which helps ordinary swapin and especially
readahead, with only a marginal increase in cluster-lock contention (the
lock is very lightly contended and stays local in the first place).
Hence, callers of swap_cache_alloc_folio() no longer need to check the
swap slot count or swap cache status themselves.
And now whoever first successfully allocates a folio in the swap cache
will be the one who charges it and performs the swap-in. The race window
of swapping is also reduced since the loop is much more compact.
Link: https://lore.kernel.org/20260517-swap-table-p4-v5-4-88ae43e064c7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Youngjun Park <youngjun.park@lge.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Sun, 17 May 2026 15:39:42 +0000 (23:39 +0800)]
mm/huge_memory: move THP gfp limit helper into header
Shmem has some special requirements for THP GFP and has to limit it in
certain zones or provide a more lenient fallback.
We'll use this helper for generic swap THP allocation, which needs to
support shmem. For a typical GFP_HIGHUSER_MOVABLE swap-in, this helper is
basically a no-op. But it's necessary for certain shmem users, mostly
drivers.
No feature change.
Link: https://lore.kernel.org/20260517-swap-table-p4-v5-3-88ae43e064c7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Youngjun Park <youngjun.park@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Sun, 17 May 2026 15:39:41 +0000 (23:39 +0800)]
mm, swap: move common swap cache operations into standalone helpers
Move a few swap cache checking, adding, and deletion operations into
standalone helpers to be used later. And while at it, add proper kernel
doc.
No feature or behavior change.
Link: https://lore.kernel.org/20260517-swap-table-p4-v5-2-88ae43e064c7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Youngjun Park <youngjun.park@lge.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This series unifies the allocation and charging of anon and shmem swap in
folios, provides better synchronization, consolidates the metadata
management, hence dropping the static array and map, and improves the
performance. The static metadata overhead is now close to zero, and
workload performance is slightly improved.
For example, mounting a 1TB swap device saves about 512MB of memory:
Before:
free -m
total used free shared buff/cache available
Mem: 1464 805 346 1 382 658
Swap: 1048575 0 1048575
After:
free -m
total used free shared buff/cache available
Mem: 1464 277 899 1 356 1187
Swap: 1048575 0 1048575
Memory usage is ~512M lower, and we now have a close to 0 static overhead.
It was about 2 bytes per slot before, now roughly 0.09375 bytes per slot
(48 bytes ci info per cluster, which is 512 slots).
Performance test is also looking good, testing Redis in a 2G VM using 6G
ZRAM as swap:
This series also reduces memory thrashing, I no longer see any: "Huh
VM_FAULT_OOM leaked out to the #PF handler. Retrying PF", it was shown
several times during stress testing before this series when under great
pressure:
Instead of trying to return the existing folio if the entry is already
cached in swap_cache_alloc_folio, simply return an error pointer if the
allocation failed, and drop the output argument that indicates what kind
of folio is actually returned.
And a proper wrapper swap_cache_read_folio that decouples and handles the
actual requirement - read in the folio, or return the already read folio
in cache. This is what async swapin and readahead actually required.
As for zswap swap out, the caller just needs to abort if the allocation
fails because the entry is gone or already cached, so removing simplifies
the return argument, making it cleaner.
No feature change.
Link: https://lore.kernel.org/20260517-swap-table-p4-v5-0-88ae43e064c7@tencent.com Link: https://lore.kernel.org/20260517-swap-table-p4-v5-1-88ae43e064c7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Zi Yan <ziy@nvidia.com> Cc: Youngjun Park <youngjun.park@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: swap_cgroup: fix NULL deref in lookup_swap_cgroup_id on swapless host
lookup_swap_cgroup_id() passes swap_cgroup_ctrl[type].map to
__swap_cgroup_id_lookup() without checking that the type was ever
registered via swap_cgroup_swapon(). On a swapless host every ctrl->map
is NULL, so __swap_cgroup_id_lookup() dereferences NULL + a scaled
swp_offset().
Since commit bea67dcc5eea ("mm: attempt to batch free swap entries for
zap_pte_range()"), zap_pte_range() -> swap_pte_batch() calls
lookup_swap_cgroup_id() on any non-present, non-none PTE that decodes as a
real swap entry, without first validating it against swap_info[]. A
single PTE corrupted into a type-0 swap entry takes the host down at
process exit.
We hit this in production on a swapless 6.12.58 host: ~1s of
"get_swap_device: Bad swap file entry 3f800204222bb" (do_swap_page() being
correctly defensive about the same entry) followed by
BUG: unable to handle page fault for address: 000003f800204220
RIP: 0010:lookup_swap_cgroup_id+0x2b/0x60
Call Trace:
swap_pte_batch+0xbf/0x230
zap_pte_range+0x4c8/0x780
unmap_page_range+0x190/0x3e0
exit_mmap+0xd9/0x3c0
do_exit+0x20c/0x4b0
syzbot has reported the identical stack.
The source of the PTE corruption is a separate bug; this change makes the
teardown path as robust as the fault path already is. Every other caller
of lookup_swap_cgroup_id() is downstream of a get_swap_device() that has
already validated the entry, so the new branch is cold.
Link: https://lore.kernel.org/20260504-swap-cgroup-fix-7-0-v1-1-f53ff41ee553@linux.dev Fixes: bea67dcc5eea ("mm: attempt to batch free swap entries for zap_pte_range()") Signed-off-by: Jose Fernandez (Anthropic) <jose.fernandez@linux.dev> Reported-by: syzbot+e12bd9ca48157add237a@syzkaller.appspotmail.com Link: https://lore.kernel.org/r/69859728.050a0220.3b3015.0033.GAE@google.com Assisted-by: Claude:unspecified Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Brendan Jackman [Tue, 19 May 2026 14:17:58 +0000 (14:17 +0000)]
mm/page_alloc: document that alloc_pages_nolock() uses RCU
The allocator interacts with cgroups which rely on RCU. RCU does not work
everywhere, so the "any context" claim is slightly overstated here.
This should already be enforced by objtool, since this function is not
marked noinstr the x86 build should fail if you call it from a place where
RCU is not watching. But, expecting readers to make that connection for
themselves seems a bit cruel (I don't think there is even any
documentation of what noinstr means at all, let alone the connection with
RCU).
Note this is not claiming that any cgroup code called from the allocator
would actually break if this restriction was violated, it could very well
be that there's no real way for the allocator to act on a cgroup that can
disappear concurrently. But, since it's likely nobody has verified this
one way or another, better to just be safe and declare that RCU is
required. Allocating from an RCU-unsafe context seems a bit crazy anyway.
Link: https://lore.kernel.org/20260519-nolock-rcu-comment-v1-1-4a630c8794e5@google.com Signed-off-by: Brendan Jackman <jackmanb@google.com> Suggested-by: Junaid Shahid <junaids@google.com> Acked-by: Harry Yoo (Oracle) <harry@kernel.org> Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>