]> git.ipfire.org Git - thirdparty/kernel/linux.git/log
thirdparty/kernel/linux.git
5 hours agoMerge tag 'mm-hotfixes-stable-2026-04-19-00-14' of git://git.kernel.org/pub/scm/linux... master
Linus Torvalds [Sun, 19 Apr 2026 21:45:37 +0000 (14:45 -0700)] 
Merge tag 'mm-hotfixes-stable-2026-04-19-00-14' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM fixes from Andrew Morton:
 "7 hotfixes. 6 are cc:stable and all are for MM. Please see the
  individual changelogs for details"

* tag 'mm-hotfixes-stable-2026-04-19-00-14' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  mm/damon/core: disallow non-power of two min_region_sz on damon_start()
  mm/vmalloc: take vmap_purge_lock in shrinker
  mm: call ->free_folio() directly in folio_unmap_invalidate()
  mm: blk-cgroup: fix use-after-free in cgwb_release_workfn()
  mm/zone_device: do not touch device folio after calling ->folio_free()
  mm/damon/core: disallow time-quota setting zero esz
  mm/mempolicy: fix weighted interleave auto sysfs name

7 hours agoMerge tag 'driver-core-7.1-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Sun, 19 Apr 2026 19:58:08 +0000 (12:58 -0700)] 
Merge tag 'driver-core-7.1-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core

Pull driver core fixes from Danilo Krummrich:

 - Prevent a device from being probed before device_add() has finished
   initializing it; gate probe with a "ready_to_probe" device flag to
   avoid races with concurrent driver_register() calls

 - Fix a kernel-doc warning for DEV_FLAG_COUNT introduced by the above

 - Return -ENOTCONN from software_node_get_reference_args() when a
   referenced software node is known but not yet registered, allowing
   callers to defer probe

 - In sysfs_group_attrs_change_owner(), also check is_visible_const();
   missed when the const variant was introduced

* tag 'driver-core-7.1-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core:
  driver core: Add kernel-doc for DEV_FLAG_COUNT enum value
  sysfs: attribute_group: Respect is_visible_const() when changing owner
  software node: return -ENOTCONN when referenced swnode is not registered yet
  driver core: Don't let a device probe until it's ready

11 hours agoMerge tag 'staging-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh...
Linus Torvalds [Sun, 19 Apr 2026 15:51:32 +0000 (08:51 -0700)] 
Merge tag 'staging-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging

Pull staging driver updates from Greg KH:
 "Here is the "big" set of staging driver changes for 7.1-rc1.

  Nothing major in here at all, just lots of little cleanups for the
  staging drivers, driven by new developers getting their feet wet in
  kernel development. "Largest" thing in here is the change of some of
  the octeon variable types into proper kernel ones.

  Full details are in the shortlog.

  All of these have been in linux-next for a while with no reported
  issues"

* tag 'staging-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: (154 commits)
  staging: rtl8723bs: remove redundant & parentheses
  staging: most: dim2: replace BUG_ON() in poison_channel()
  staging: most: dim2: replace BUG_ON() in enqueue()
  staging: most: dim2: replace BUG_ON() in configure_channel()
  staging: most: dim2: replace BUG_ON() in service_done_flag()
  staging: most: dim2: replace BUG_ON() in try_start_dim_transfer()
  staging: rtl8723bs: remove unused RTL8188E antenna selection macros
  staging: rtl8723bs: remove redundant blank lines in basic_types.h
  staging: rtl8723bs: wrap complex macros with parentheses
  staging: rtl8723bs: remove unused WRITEEF/READEF byte macros
  staging: rtl8723bs: rename camelCase variable
  staging: greybus: audio: fix error message for BTN_3 button
  staging: rtl8723bs: rename variables to snake_case
  staging: rtl8723bs: fix spelling in comment
  staging: rtl8723bs: cleanup return in sdio_init()
  staging: rtl8723bs: use direct returns in sdio_dvobj_init()
  staging: rtl8723bs: remove unused arg at odm_interface.h
  greybus: raw: fix use-after-free if write is called after disconnect
  greybus: raw: fix use-after-free on cdev close
  staging: rtl8723bs: fix logical continuations in xmit_linux.c
  ...

11 hours agoMerge tag 'usb-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Linus Torvalds [Sun, 19 Apr 2026 15:47:40 +0000 (08:47 -0700)] 
Merge tag 'usb-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb

Pull USB / Thunderbolt updates from Greg KH:
 "Here is the big set of USB and Thunderbolt changes for 7.1-rc1.

  Lots of little things in here, nothing major, just constant
  improvements, updates, and new features. Highlights are:

   - new USB power supply driver support.

     These changes did touch outside of drivers/usb/ but got acks from
     the relevant mantainers for them.

   - dts file updates and conversions

   - string function conversions into "safer" ones

   - new device quirks

   - xhci driver updates

   - usb gadget driver minor fixes

   - typec driver additions and updates

   - small number of thunderbolt driver changes

   - dwc3 driver updates and additions of new hardware support

   - other minor driver updates

  All of these have been in the linux-next tree for a while with no
  reported issues"

* tag 'usb-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (176 commits)
  usb: dwc3: starfive: Add JHB100 USB 2.0 DRD controller
  dt-bindings: usb: dwc3: add support for StarFive JHB100
  dt-bindings: usb: atmel,at91sam9rl-udc: convert to DT schema
  dt-bindings: usb: atmel,at91rm9200-udc: convert to DT schema
  dt-bindings: usb: generic-ehci: fix schema structure and add at91sam9g45 constraints
  dt-bindings: usb: generic-ohci: add AT91RM9200 OHCI binding support
  arm: dts: at91: remove unused #address-cells/#size-cells from sam9x60 udc node
  drivers/usb/host: Fix spelling error 'seperate' -> 'separate'
  usbip: tools: add hint when no exported devices are found
  USB: serial: iuu_phoenix: fix iuutool author name
  usb: gadget: f_ncm: validate minimum block_len in ncm_unwrap_ntb()
  usb: gadget: f_phonet: fix skb frags[] overflow in pn_rx_complete()
  usb: gadget: f_hid: Add missing error code
  usb: typec: cros_ec_ucsi: Load driver from OF and ACPI definitions
  dt-bindings: chrome: Add cros-ec-ucsi compatibility to typec binding
  USB: of: Simplify with scoped for each OF child loop
  usbip: validate number_of_packets in usbip_pack_ret_submit()
  usb: gadget: renesas_usb3: validate endpoint index in standard request handlers
  usb: core: config: reverse the size check of the SSP isoc endpoint descriptor
  usb: typec: ucsi: Set usb mode on partner change
  ...

11 hours agoMerge tag 'tty-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Linus Torvalds [Sun, 19 Apr 2026 15:44:41 +0000 (08:44 -0700)] 
Merge tag 'tty-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty

Pull tty/serial updates from Greg KH:
 "Here is the set of tty and serial driver changes for 7.1-rc1.

  Not much here this cycle, biggest thing is the removal of an old
  driver that never got any actual hardware support (esp32), and the
  second try to moving the tty ports to their own workqueues (first try
  was in 7.0-rc1 but was reverted due to problems)

  Otherwise it's just a small set of driver updates and some vt modifier
  key enhancements.

  All have been in linux-next for a while with no reported issues"

* tag 'tty-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (35 commits)
  tty: serial: ip22zilog: Fix section mispatch warning
  hvc/xen: Check console connection flag
  serial: sh-sci: Add support for RZ/G3L RSCI
  dt-bindings: serial: renesas,rsci: Document RZ/G3L SoC
  tty: atmel_serial: update outdated reference to atmel_tasklet_func()
  serial: xilinx_uartps: Drop unused include
  serial: qcom-geni: drop stray newline format specifier
  serial: 8250: loongson: Enable building on MIPS Loongson64
  dt-bindings: serial: 8250: Add Loongson 3A4000 uart compatible
  serial: 8250_fintek: Add support for F81214E
  tty: tty_port: add workqueue to flip TTY buffer
  vt: support ITU-T T.416 color subparameters
  serial: qcom-geni: Fix RTS behavior with flow control
  tty: serial: imx: keep dma request disabled before dma transfer setup
  tty: serial: 8250: Add SystemBase Multi I/O cards
  serial: pic32_uart: allow driver to be compiled on all architectures with COMPILE_TEST
  serial: tegra: remove Kconfig dependency on APB DMA controller
  dt-bindings: serial: amlogic,meson-uart: Add compatible string for A9
  dt-bindings: serial: atmel,at91-usart: add microchip,lan9691-usart
  serial: auart: check clk_enable() return in console write
  ...

12 hours agoMerge tag 'mm-stable-2026-04-18-02-14' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 19 Apr 2026 15:01:17 +0000 (08:01 -0700)] 
Merge tag 'mm-stable-2026-04-18-02-14' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull more MM updates from Andrew Morton:

 - "Eliminate Dying Memory Cgroup" (Qi Zheng and Muchun Song)

   Address the longstanding "dying memcg problem". A situation wherein a
   no-longer-used memory control group will hang around for an extended
   period pointlessly consuming memory

 - "fix unexpected type conversions and potential overflows" (Qi Zheng)

   Fix a couple of potential 32-bit/64-bit issues which were identified
   during review of the "Eliminate Dying Memory Cgroup" series

 - "kho: history: track previous kernel version and kexec boot count"
   (Breno Leitao)

   Use Kexec Handover (KHO) to pass the previous kernel's version string
   and the number of kexec reboots since the last cold boot to the next
   kernel, and print it at boot time

 - "liveupdate: prevent double preservation" (Pasha Tatashin)

   Teach LUO to avoid managing the same file across different active
   sessions

 - "liveupdate: Fix module unloading and unregister API" (Pasha
   Tatashin)

   Address an issue with how LUO handles module reference counting and
   unregistration during module unloading

 - "zswap pool per-CPU acomp_ctx simplifications" (Kanchana Sridhar)

   Simplify and clean up the zswap crypto compression handling and
   improve the lifecycle management of zswap pool's per-CPU acomp_ctx
   resources

 - "mm/damon/core: fix damon_call()/damos_walk() vs kdmond exit race"
   (SeongJae Park)

   Address unlikely but possible leaks and deadlocks in damon_call() and
   damon_walk()

 - "mm/damon/core: validate damos_quota_goal->nid" (SeongJae Park)

   Fix a couple of root-only wild pointer dereferences

 - "Docs/admin-guide/mm/damon: warn commit_inputs vs other params race"
   (SeongJae Park)

   Update the DAMON documentation to warn operators about potential
   races which can occur if the commit_inputs parameter is altered at
   the wrong time

 - "Minor hmm_test fixes and cleanups" (Alistair Popple)

   Bugfixes and a cleanup for the HMM kernel selftests

 - "Modify memfd_luo code" (Chenghao Duan)

   Cleanups, simplifications and speedups to the memfd_lou code

 - "mm, kvm: allow uffd support in guest_memfd" (Mike Rapoport)

   Support for userfaultfd in guest_memfd

 - "selftests/mm: skip several tests when thp is not available" (Chunyu
   Hu)

   Fix several issues in the selftests code which were causing breakage
   when the tests were run on CONFIG_THP=n kernels

 - "mm/mprotect: micro-optimization work" (Pedro Falcato)

   A couple of nice speedups for mprotect()

 - "MAINTAINERS: update KHO and LIVE UPDATE entries" (Pratyush Yadav)

   Document upcoming changes in the maintenance of KHO, LUO, memfd_luo,
   kexec, crash, kdump and probably other kexec-based things - they are
   being moved out of mm.git and into a new git tree

* tag 'mm-stable-2026-04-18-02-14' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (121 commits)
  MAINTAINERS: add page cache reviewer
  mm/vmscan: avoid false-positive -Wuninitialized warning
  MAINTAINERS: update Dave's kdump reviewer email address
  MAINTAINERS: drop include/linux/liveupdate from LIVE UPDATE
  MAINTAINERS: drop include/linux/kho/abi/ from KHO
  MAINTAINERS: update KHO and LIVE UPDATE maintainers
  MAINTAINERS: update kexec/kdump maintainers entries
  mm/migrate_device: remove dead migration entry check in migrate_vma_collect_huge_pmd()
  selftests: mm: skip charge_reserved_hugetlb without killall
  userfaultfd: allow registration of ranges below mmap_min_addr
  mm/vmstat: fix vmstat_shepherd double-scheduling vmstat_update
  mm/hugetlb: fix early boot crash on parameters without '=' separator
  zram: reject unrecognized type= values in recompress_store()
  docs: proc: document ProtectionKey in smaps
  mm/mprotect: special-case small folios when applying permissions
  mm/mprotect: move softleaf code out of the main function
  mm: remove '!root_reclaim' checking in should_abort_scan()
  mm/sparse: fix comment for section map alignment
  mm/page_io: use sio->len for PSWPIN accounting in sio_read_complete()
  selftests/mm: transhuge_stress: skip the test when thp not available
  ...

20 hours agomm/damon/core: disallow non-power of two min_region_sz on damon_start()
SeongJae Park [Sat, 11 Apr 2026 21:36:36 +0000 (14:36 -0700)] 
mm/damon/core: disallow non-power of two min_region_sz on damon_start()

Commit d8f867fa0825 ("mm/damon: add damon_ctx->min_sz_region") introduced
a bug that allows unaligned DAMON region address ranges.  Commit
c80f46ac228b ("mm/damon/core: disallow non-power of two min_region_sz")
fixed it, but only for damon_commit_ctx() use case.  Still, DAMON sysfs
interface can emit non-power of two min_region_sz via damon_start().  Fix
the path by adding the is_power_of_2() check on damon_start().

The issue was discovered by sashiko [1].

Link: https://lore.kernel.org/20260411213638.77768-1-sj@kernel.org
Link: https://lore.kernel.org/20260403155530.64647-1-sj@kernel.org
Fixes: d8f867fa0825 ("mm/damon: add damon_ctx->min_sz_region")
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org> # 6.18.x
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
20 hours agomm/vmalloc: take vmap_purge_lock in shrinker
Uladzislau Rezki (Sony) [Mon, 13 Apr 2026 19:26:46 +0000 (21:26 +0200)] 
mm/vmalloc: take vmap_purge_lock in shrinker

decay_va_pool_node() can be invoked concurrently from two paths:
__purge_vmap_area_lazy() when pools are being purged, and the shrinker via
vmap_node_shrink_scan().

However, decay_va_pool_node() is not safe to run concurrently, and the
shrinker path currently lacks serialization, leading to races and possible
leaks.

Protect decay_va_pool_node() by taking vmap_purge_lock in the shrinker
path to ensure serialization with purge users.

Link: https://lore.kernel.org/20260413192646.14683-1-urezki@gmail.com
Fixes: 7679ba6b36db ("mm: vmalloc: add a shrinker to drain vmap pools")
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: Baoquan He <baoquan.he@linux.dev>
Cc: chenyichong <chenyichong@uniontech.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
20 hours agomm: call ->free_folio() directly in folio_unmap_invalidate()
Matthew Wilcox (Oracle) [Mon, 13 Apr 2026 18:43:11 +0000 (19:43 +0100)] 
mm: call ->free_folio() directly in folio_unmap_invalidate()

We can only call filemap_free_folio() if we have a reference to (or hold a
lock on) the mapping.  Otherwise, we've already removed the folio from the
mapping so it no longer pins the mapping and the mapping can be removed,
causing a use-after-free when accessing mapping->a_ops.

Follow the same pattern as __remove_mapping() and load the free_folio
function pointer before dropping the lock on the mapping.  That lets us
make filemap_free_folio() static as this was the only caller outside
filemap.c.

Link: https://lore.kernel.org/20260413184314.3419945-1-willy@infradead.org
Fixes: fb7d3bc41493 ("mm/filemap: drop streaming/uncached pages when writeback completes")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reported-by: Google Big Sleep <big-sleep-vuln-reports+bigsleep-501448199@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
20 hours agomm: blk-cgroup: fix use-after-free in cgwb_release_workfn()
Breno Leitao [Mon, 13 Apr 2026 10:09:19 +0000 (03:09 -0700)] 
mm: blk-cgroup: fix use-after-free in cgwb_release_workfn()

cgwb_release_workfn() calls css_put(wb->blkcg_css) and then later accesses
wb->blkcg_css again via blkcg_unpin_online().  If css_put() drops the last
reference, the blkcg can be freed asynchronously (css_free_rwork_fn ->
blkcg_css_free -> kfree) before blkcg_unpin_online() dereferences the
pointer to access blkcg->online_pin, resulting in a use-after-free:

  BUG: KASAN: slab-use-after-free in blkcg_unpin_online (./include/linux/instrumented.h:112 ./include/linux/atomic/atomic-instrumented.h:400 ./include/linux/refcount.h:389 ./include/linux/refcount.h:432 ./include/linux/refcount.h:450 block/blk-cgroup.c:1367)
  Write of size 4 at addr ff11000117aa6160 by task kworker/71:1/531
   Workqueue: cgwb_release cgwb_release_workfn
   Call Trace:
    <TASK>
     blkcg_unpin_online (./include/linux/instrumented.h:112 ./include/linux/atomic/atomic-instrumented.h:400 ./include/linux/refcount.h:389 ./include/linux/refcount.h:432 ./include/linux/refcount.h:450 block/blk-cgroup.c:1367)
     cgwb_release_workfn (mm/backing-dev.c:629)
     process_scheduled_works (kernel/workqueue.c:3278 kernel/workqueue.c:3385)

   Freed by task 1016:
    kfree (./include/linux/kasan.h:235 mm/slub.c:2689 mm/slub.c:6246 mm/slub.c:6561)
    css_free_rwork_fn (kernel/cgroup/cgroup.c:5542)
    process_scheduled_works (kernel/workqueue.c:3302 kernel/workqueue.c:3385)

** Stack based on commit 66672af7a095 ("Add linux-next specific files
for 20260410")

I am seeing this crash sporadically in Meta fleet across multiple kernel
versions.  A full reproducer is available at:
https://github.com/leitao/debug/blob/main/reproducers/repro_blkcg_uaf.sh

(The race window is narrow.  To make it easily reproducible, inject a
msleep(100) between css_put() and blkcg_unpin_online() in
cgwb_release_workfn().  With that delay and a KASAN-enabled kernel, the
reproducer triggers the splat reliably in less than a second.)

Fix this by moving blkcg_unpin_online() before css_put(), so the
cgwb's CSS reference keeps the blkcg alive while blkcg_unpin_online()
accesses it.

Link: https://lore.kernel.org/20260413-blkcg-v1-1-35b72622d16c@debian.org
Fixes: 59b57717fff8 ("blkcg: delay blkg destruction until after writeback has finished")
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Cc: David Hildenbrand <david@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: JP Kobryn <inwardvessel@gmail.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
20 hours agomm/zone_device: do not touch device folio after calling ->folio_free()
Matthew Brost [Fri, 10 Apr 2026 23:03:46 +0000 (16:03 -0700)] 
mm/zone_device: do not touch device folio after calling ->folio_free()

The contents of a device folio can immediately change after calling
->folio_free(), as the folio may be reallocated by a driver with a
different order.  Instead of touching the folio again to extract the
pgmap, use the local stack variable when calling percpu_ref_put_many().

Link: https://lore.kernel.org/20260410230346.4009855-1-matthew.brost@intel.com
Fixes: d245f9b4ab80 ("mm/zone_device: support large zone device private folios")
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Balbir Singh <balbirs@nvidia.com>
Reviewed-by: Vishal Moola <vishal.moola@gmail.com>
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
20 hours agomm/damon/core: disallow time-quota setting zero esz
SeongJae Park [Tue, 7 Apr 2026 00:31:52 +0000 (17:31 -0700)] 
mm/damon/core: disallow time-quota setting zero esz

When the throughput of a DAMOS scheme is very slow, DAMOS time quota can
make the effective size quota smaller than damon_ctx->min_region_sz.  In
the case, damos_apply_scheme() will skip applying the action, because the
action is tried at region level, which requires >=min_region_sz size.
That is, the quota is effectively exceeded for the quota charge window.

Because no action will be applied, the total_charged_sz and
total_charged_ns are also not updated.  damos_set_effective_quota() will
try to update the effective size quota before starting the next charge
window.  However, because the total_charged_sz and total_charged_ns have
not updated, the throughput and effective size quota are also not changed.
Since effective size quota can only be decreased, other effective size
quota update factors including DAMOS quota goals and size quota cannot
make any change, either.

As a result, the scheme is unexpectedly deactivated until the user notices
and mitigates the situation.  The users can mitigate this situation by
changing the time quota online or re-install the scheme.  While the
mitigation is somewhat straightforward, finding the situation would be
challenging, because DAMON is not providing good observabilities for that.
Even if such observability is provided, doing the additional monitoring
and the mitigation is somewhat cumbersome and not aligned to the intention
of the time quota.  The time quota was intended to help reduce the user's
administration overhead.

Fix the problem by setting time quota-modified effective size quota be at
least min_region_sz always.

The issue was discovered [1] by sashiko.

Link: https://lore.kernel.org/20260407003153.79589-1-sj@kernel.org
Link: https://lore.kernel.org/20260405192504.110014-1-sj@kernel.org
Fixes: 1cd243030059 ("mm/damon/schemes: implement time quota")
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org> # 5.16.x
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
20 hours agomm/mempolicy: fix weighted interleave auto sysfs name
Joshua Hahn [Tue, 7 Apr 2026 14:14:14 +0000 (07:14 -0700)] 
mm/mempolicy: fix weighted interleave auto sysfs name

The __ATTR macro is a utility that makes defining kobj_attributes easier
by stringfying the name, verifying the mode, and setting the show/store
fields in a single initializer.  It takes a raw token as the first value,
rather than a string, so that __ATTR family macros like __ATTR_RW can
token-paste it for inferring the _show / _store function names.

Commit e341f9c3c841 ("mm/mempolicy: Weighted Interleave Auto-tuning") used
the __ATTR macro to define the "auto" sysfs for weighted interleave.  A
few months later, commit 2fb6915fa22d ("compiler_types.h: add "auto" as a
macro for "__auto_type"") introduced a #define macro which expanded auto
into __auto_type.

This led to the "auto" token passed into __ATTR to be expanded out into
__auto_type, and the sysfs entry to be displayed as __auto_type as well.

Expand out the __ATTR macro and directly pass a string "auto" instead of
the raw token 'auto' to prevent it from being expanded out.  Also bypass
the VERIFY_OCTAL_PERMISSIONS check by triple checking that 0664 is indeed
the intended permissions for this sysfs file.

Before:
$ ls /sys/kernel/mm/mempolicy/weighted_interleave
__auto_type  node0

After:
$ ls /sys/kernel/mm/mempolicy/weighted_interleave/
auto  node0

Link: https://lore.kernel.org/20260407141415.3080960-1-joshua.hahnjy@gmail.com
Fixes: 2fb6915fa22d ("compiler_types.h: add "auto" as a macro for "__auto_type"")
Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
Reviewed-by: Gregory Price <gourry@gourry.net>
Reviewed-by: Rakie Kim <rakie.kim@sk.com>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Acked-by: Zi Yan <ziy@nvidia.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Byungchul Park <byungchul@sk.com>
Cc: "Huang, Ying" <ying.huang@linux.alibaba.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Ying Huang <ying.huang@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
27 hours agoMerge tag 'pinctrl-v7.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw...
Linus Torvalds [Sat, 18 Apr 2026 23:59:09 +0000 (16:59 -0700)] 
Merge tag 'pinctrl-v7.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl

Pull pin control updates from Linus Walleij:
 "Core changes:

   - Perform basic checks on pin config properties so as not to allow
     directly contradictory settings such as setting a pin to more than
     one bias or drive mode

   - Handle input-threshold-voltage-microvolt property

   - Introduce pinctrl_gpio_get_config() handling in the core for SCMI
     GPIO using pin control

  New drivers:

   - GPIO-by-pin control driver (also appearing in the GPIO pull
     request) fulfilling a promise on a comment from Grant Likely many
     years ago: "can't GPIO just be a front-end for pin control?" it
     turns out it can, if and only if you design something new from
     scratch, such as SCMI

   - Broadcom BCM7038 as a pinctrl-single delegate

   - Mobileye EyeQ6Lplus OLB pin controller

   - Qualcomm Eliza and Hawi families TLMM pin controllers

   - Qualcomm SDM670 and Milos family LPASS LPI pin controllers

   - Qualcomm IPQ5210 pin controller

   - Realtek RTD1625 pin controller support

   - Rockchip RV1103B pin controller support

   - Texas Instruments AM62L as a pinctrl-single delegate

  Improvements:

   - Set config implementation for the Spacemit K1 pin controller"

* tag 'pinctrl-v7.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl: (84 commits)
  pinctrl: qcom: Add Hawi pinctrl driver
  dt-bindings: pinctrl: qcom: Describe Hawi TLMM block
  dt-bindings: pinctrl: pinctrl-max77620: convert to DT schema
  pinctrl: single: Add bcm7038-padconf compatible matching
  dt-bindings: pinctrl: pinctrl-single: Add brcm,bcm7038-padconf
  dt-bindings: pinctrl: apple,pinctrl: Add t8122 compatible
  pinctrl: qcom: sdm670-lpass-lpi: label variables as static
  pinctrl: sophgo: pinctrl-sg2044: Fix wrong module description
  pinctrl: sophgo: pinctrl-sg2042: Fix wrong module description
  pinctrl: qcom: add sdm670 lpi tlmm
  dt-bindings: pinctrl: qcom: Add SDM670 LPASS LPI pinctrl
  dt-bindings: qcom: lpass-lpi-common: add reserved GPIOs property
  pinctrl: qcom: Introduce IPQ5210 TLMM driver
  dt-bindings: pinctrl: qcom: add IPQ5210 pinctrl
  pinctrl: qcom: Drop redundant intr_target_reg on modern SoCs
  pinctrl: qcom: eliza: Fix interrupt target bit
  pinctrl: core: Don't use "proxy" headers
  pinctrl: amd: Support new ACPI ID AMDI0033
  pinctrl: renesas: rzg2l: Drop superfluous blank line
  pinctrl: renesas: rzg2l: Fix save/restore of {IOLH,IEN,PUPD,SMT} registers
  ...

27 hours agoMerge tag 'i3c/for-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/i3c/linux
Linus Torvalds [Sat, 18 Apr 2026 23:48:13 +0000 (16:48 -0700)] 
Merge tag 'i3c/for-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/i3c/linux

Pull i3c updates from Alexandre Belloni:
 "Subsystem:
   - add sysfs option to rescan bus via entdaa
   - fix error code handling for send_ccc_cmd

  Drivers:
   - mipi-i3c-hci-pci: Intel Nova Lake-H I3C support"

* tag 'i3c/for-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/i3c/linux: (22 commits)
  i3c: mipi-i3c-hci: fix IBI payload length calculation for final status
  i3c: master: adi: Fix error propagation for CCCs
  i3c: master: Fix error codes at send_ccc_cmd
  i3c: master: Move bus_init error suppression
  i3c: master: Move entdaa error suppression
  i3c: master: Move rstdaa error suppression
  i3c: dw: Simplify xfer cleanup with __free(kfree)
  i3c: dw: Fix memory leak in dw_i3c_master_i3c_xfers()
  i3c: master: renesas: Use __free(kfree) for xfer cleanup in renesas_i3c_send_ccc_cmd()
  i3c: master: renesas: Fix memory leak in renesas_i3c_i3c_xfers()
  i3c: master: dw-i3c: Balance PM runtime usage count on probe failure
  i3c: master: dw-i3c: Fix missing reset assertion in remove() callback
  i3c: mipi-i3c-hci-pci: Enable IBI while runtime suspended for Intel controllers
  i3c: mipi-i3c-hci-pci: Add optional ability to manage child runtime PM
  i3c: mipi-i3c-hci: Allow parent to manage runtime PM
  i3c: mipi-i3c-hci: Add quirk to allow IBI while runtime suspended
  i3c: mipi-i3c-hci-pci: Set d3hot_delay to 0 for Intel controllers
  i3c: fix missing newline in dev_err messages
  i3c: master: use kzalloc_flex
  i3c: mipi-i3c-hci-pci: Add support for Intel Nova Lake-H I3C
  ...

32 hours agoMerge tag 'parisc-for-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/delle...
Linus Torvalds [Sat, 18 Apr 2026 18:37:36 +0000 (11:37 -0700)] 
Merge tag 'parisc-for-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux

Pull parisc architecture updates from Helge Deller:

 - A fix to make modules on 32-bit parisc architecture work again

 - Drop ip_fast_csum() inline assembly to avoid unaligned memory
   accesses

 - Allow to build kernel without 32-bit VDSO

 - Reference leak fix in error path in LED driver

* tag 'parisc-for-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
  parisc: led: fix reference leak on failed device registration
  module.lds.S: Fix modules on 32-bit parisc architecture
  parisc: Allow to build without VDSO32
  parisc: Include 32-bit VDSO only when building for 32-bit or compat mode
  parisc: Allow to disable COMPAT mode on 64-bit kernel
  parisc: Fix default stack size when COMPAT=n
  parisc: Fix signal code to depend on CONFIG_COMPAT instead of CONFIG_64BIT
  parisc: is_compat_task() shall return false for COMPAT=n
  parisc: Avoid compat syscalls when COMPAT=n
  parisc: _llseek syscall is only available for 32-bit userspace
  parisc: Drop ip_fast_csum() inline assembly implementation
  parisc: update outdated comments for renamed ccio_alloc_consistent()

32 hours agoMerge tag 'memblock-v7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt...
Linus Torvalds [Sat, 18 Apr 2026 18:29:14 +0000 (11:29 -0700)] 
Merge tag 'memblock-v7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock

Pull memblock updates from Mike Rapoport:

 - improve debuggability of reserve_mem kernel parameter handling with
   print outs in case of a failure and debugfs info showing what was
   actually reserved

 - Make memblock_free_late() and free_reserved_area() use the same core
   logic for freeing the memory to buddy and ensure it takes care of
   updating memblock arrays when ARCH_KEEP_MEMBLOCK is enabled.

* tag 'memblock-v7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock:
  x86/alternative: delay freeing of smp_locks section
  memblock: warn when freeing reserved memory before memory map is initialized
  memblock, treewide: make memblock_free() handle late freeing
  memblock: make free_reserved_area() update memblock if ARCH_KEEP_MEMBLOCK=y
  memblock: extract page freeing from free_reserved_area() into a helper
  memblock: make free_reserved_area() more robust
  mm: move free_reserved_area() to mm/memblock.c
  powerpc: opal-core: pair alloc_pages_exact() with free_pages_exact()
  powerpc: fadump: pair alloc_pages_exact() with free_pages_exact()
  memblock: reserve_mem: fix end caclulation in reserve_mem_release_by_name()
  memblock: move reserve_bootmem_range() to memblock.c and make it static
  memblock: Add reserve_mem debugfs info
  memblock: Print out errors on reserve_mem parser

34 hours agoMerge tag 'i2c-for-7.1-rc1-part1' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Sat, 18 Apr 2026 16:44:22 +0000 (09:44 -0700)] 
Merge tag 'i2c-for-7.1-rc1-part1' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux

Pull i2c updates from Wolfram Sang:
 "The biggest news in this pull request is that it will start the last
  cycle of me handling the I2C subsystem. From 7.2. on, I will pass
  maintainership to Andi Shyti who has been maintaining the I2C drivers
  for a while now and who has done a great job in doing so.

  We will use this cycle for a hopefully smooth transition. Thanks must
  go to Andi for stepping up! I will still be around for guidance.

  Updates:
   - generic cleanups in npcm7xx, qcom-cci, xiic and designware DT
     bindings
   - atr: use kzalloc_flex for alias pool allocation
   - ixp4xx: convert bindings to DT schema
   - ocores: use read_poll_timeout_atomic() for polling waits
   - qcom-geni: skip extra TX DMA TRE for single read messages
   - s3c24xx: validate SMBus block length before using it
   - spacemit: refactor xfer path and add K1 PIO support
   - tegra: identify DVC and VI with SoC data variants
   - tegra: support SoC-specific register offsets
   - xiic: switch to devres and generic fw properties
   - xiic: skip input clock setup on non-OF systems
   - various minor improvements in other drivers

  rtl9300:
   - add per-SoC callbacks and clock support for RTL9607C
   - add support for new 50 kHz and 2.5 MHz bus speeds
   - general refactoring in preparation for RTL9607C support

  New support:
   - DesignWare GOOG5000 (ACPI HID)
   - Intel Nova Lake (ACPI ID)
   - Realtek RTL9607C
   - SpacemiT K3 binding
   - Tegra410 register layout support"

* tag 'i2c-for-7.1-rc1-part1' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: (40 commits)
  i2c: usbio: Add ACPI device-id for NVL platforms
  i2c: qcom-geni: Avoid extra TX DMA TRE for single read message in GPI mode
  i2c: atr: use kzalloc_flex
  i2c: spacemit: introduce pio for k1
  i2c: spacemit: move i2c_xfer_msg()
  i2c: xiic: skip input clock setup on non-OF systems
  i2c: xiic: use numbered adapter registration
  i2c: xiic: cosmetic: use resource format specifier in debug log
  i2c: xiic: cosmetic cleanup
  i2c: xiic: switch to generic device property accessors
  i2c: xiic: remove duplicate error message
  i2c: xiic: switch to devres managed APIs
  i2c: rtl9300: add RTL9607C i2c controller support
  i2c: rtl9300: introduce new function properties to driver data
  i2c: rtl9300: introduce clk struct for upcoming rtl9607 support
  dt-bindings: i2c: realtek,rtl9301-i2c: extend for clocks and RTL9607C support
  i2c: rtl9300: introduce a property for 8 bit width reg address
  i2c: rtl9300: introduce F_BUSY to the reg_fields struct
  i2c: rtl9300: introduce max length property to driver data
  i2c: rtl9300: split data_reg into read and write reg
  ...

34 hours agoMerge tag 'for-linus-7.1-1' of https://github.com/cminyard/linux-ipmi
Linus Torvalds [Sat, 18 Apr 2026 16:33:54 +0000 (09:33 -0700)] 
Merge tag 'for-linus-7.1-1' of https://github.com/cminyard/linux-ipmi

Pull ipmi updates from Corey Minyard:
 "Small updates and fixes (mostly to the BMC software):

   - Fix one issue in the host side driver where a kthread can be left
     running on a specific memory allocation failre at probe time

   - Replace system_wq with system_percpu_wq so system_wq can eventually
     go away"

* tag 'for-linus-7.1-1' of https://github.com/cminyard/linux-ipmi:
  ipmi:ssif: Clean up kthread on errors
  ipmi:ssif: Remove unnecessary indention
  ipmi: ssif_bmc: Fix KUnit test link failure when KUNIT=m
  ipmi: ssif_bmc: add unit test for state machine
  ipmi: ssif_bmc: change log level to dbg in irq callback
  ipmi: ssif_bmc: fix message desynchronization after truncated response
  ipmi: ssif_bmc: fix missing check for copy_to_user() partial failure
  ipmi: ssif_bmc: cancel response timer on remove
  ipmi: Replace use of system_wq with system_percpu_wq

34 hours agoMerge tag 'perf-tools-for-v7.1-2026-04-17' of git://git.kernel.org/pub/scm/linux...
Linus Torvalds [Sat, 18 Apr 2026 16:24:56 +0000 (09:24 -0700)] 
Merge tag 'perf-tools-for-v7.1-2026-04-17' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools

Pull perf tools updates from Namhyung Kim:
 "perf report:

   - Add 'comm_nodigit' sort key to combine similar threads that only
     have different numbers in the comm. In the following example, the
     'comm_nodigit' will have samples from all threads starting with
     "bpfrb/" into an entry "bpfrb/<N>".

        $ perf report -s comm_nodigit,comm -H
        ...
        #
        #    Overhead  CommandNoDigit / Command
        # ...........  ........................
        #
            20.30%     swapper
               20.30%     swapper
            13.37%     chrome
               13.37%     chrome
            10.07%     bpfrb/<N>
                7.47%     bpfrb/0
                0.70%     bpfrb/1
                0.47%     bpfrb/3
                0.46%     bpfrb/2
                0.25%     bpfrb/4
                0.23%     bpfrb/5
                0.20%     bpfrb/6
                0.14%     bpfrb/10
                0.07%     bpfrb/7

   - Support flat layout for symfs. The --symfs option is to specify the
     location of debugging symbol files. The default 'hierarchy' layout
     would search the symbol file using the same path of the original
     file under the symfs root. The new 'flat' layout would search only
     in the root directory.

   - Update 'simd' sort key for ARM SIMD flags to cover ASE/SME and more
     predicate flags.

  perf stat:

   - Add --pmu-filter option to select specific PMUs. This would be
     useful when you measure metrics from multiple instance of uncore
     PMUs with similar names.

        # perf stat -M cpa_p0_avg_bw
         Performance counter stats for 'system wide':

            19,417,779,115      hisi_sicl0_cpa0/cpa_cycles/      #     0.00 cpa_p0_avg_bw
                         0      hisi_sicl0_cpa0/cpa_p0_wr_dat/
                         0      hisi_sicl0_cpa0/cpa_p0_rd_dat_64b/
                         0      hisi_sicl0_cpa0/cpa_p0_rd_dat_32b/
            19,417,751,103      hisi_sicl10_cpa0/cpa_cycles/     #     0.00 cpa_p0_avg_bw
                         0      hisi_sicl10_cpa0/cpa_p0_wr_dat/
                         0      hisi_sicl10_cpa0/cpa_p0_rd_dat_64b/
                         0      hisi_sicl10_cpa0/cpa_p0_rd_dat_32b/
            19,417,730,679      hisi_sicl2_cpa0/cpa_cycles/      #     0.31 cpa_p0_avg_bw
                75,635,749      hisi_sicl2_cpa0/cpa_p0_wr_dat/
                18,520,640      hisi_sicl2_cpa0/cpa_p0_rd_dat_64b/
                         0      hisi_sicl2_cpa0/cpa_p0_rd_dat_32b/
            19,417,674,227      hisi_sicl8_cpa0/cpa_cycles/      #     0.00 cpa_p0_avg_bw
                         0      hisi_sicl8_cpa0/cpa_p0_wr_dat/
                         0      hisi_sicl8_cpa0/cpa_p0_rd_dat_64b/
                         0      hisi_sicl8_cpa0/cpa_p0_rd_dat_32b/

              19.417734480 seconds time elapsed

     With --pmu-filter, users can select only hisi_sicl2_cpa0 PMU.

        # perf stat --pmu-filter hisi_sicl2_cpa0 -M cpa_p0_avg_bw
         Performance counter stats for 'system wide':

             6,234,093,559      cpa_cycles                       #     0.60 cpa_p0_avg_bw
                50,548,465      cpa_p0_wr_dat
                 7,552,182      cpa_p0_rd_dat_64b
                         0      cpa_p0_rd_dat_32b

               6.234139320 seconds time elapsed

  Data type profiling:

   - Quality improvements by tracking register state more precisely

   - Ensure array members to get the type

   - Handle more cases for global variables

  Vendor event/metric updates:

   - Update various Intel events and metrics

   - Add NVIDIA Tegra 410 Olympus events

  Internal changes:

   - Verify perf.data header for maliciously crafted files

   - Update perf test to cover more usages and make them robust

   - Move a couple of copied kernel headers not to annoy objtool build

   - Fix a bug in map sorting in name order

   - Remove some unused codes

  Misc:

   - Fix module symbol resolution with non-zero text address

   - Add -t/--threads option to `perf bench mem mmap`

   - Track duration of exit*() syscall by `perf trace -s`

   - Add core.addr2line-timeout and core.addr2line-disable-warn config
     items"

* tag 'perf-tools-for-v7.1-2026-04-17' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools: (131 commits)
  perf loongarch: Fix build failure with CONFIG_LIBDW_DWARF_UNWIND
  perf annotate: Use jump__delete when freeing LoongArch jumps
  perf test: Fixes for check branch stack sampling
  perf test: Fix inet_pton probe failure and unroll call graph
  perf build: fix "argument list too long" in second location
  perf header: Add sanity checks to HEADER_BPF_BTF processing
  perf header: Sanity check HEADER_BPF_PROG_INFO
  perf header: Sanity check HEADER_PMU_CAPS
  perf header: Sanity check HEADER_HYBRID_TOPOLOGY
  perf header: Sanity check HEADER_CACHE
  perf header: Sanity check HEADER_GROUP_DESC
  perf header: Sanity check HEADER_PMU_MAPPINGS
  perf header: Sanity check HEADER_MEM_TOPOLOGY
  perf header: Sanity check HEADER_NUMA_TOPOLOGY
  perf header: Sanity check HEADER_CPU_TOPOLOGY
  perf header: Sanity check HEADER_NRCPUS and HEADER_CPU_DOMAIN_INFO
  perf header: Bump up the max number of command line args allowed
  perf header: Validate nr_domains when reading HEADER_CPU_DOMAIN_INFO
  perf sample: Fix documentation typo
  perf arm_spe: Improve SIMD flags setting
  ...

43 hours agoMAINTAINERS: add page cache reviewer
Jan Kara [Wed, 15 Apr 2026 17:40:40 +0000 (19:40 +0200)] 
MAINTAINERS: add page cache reviewer

Add myself as a page cache reviewer since I tend to review changes in
these areas anyway.

[akpm@linux-foundation.org: add linux-mm@kvack.org]
Link: https://lore.kernel.org/20260415174039.13016-2-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Lorenzo Stoakes <ljs@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
43 hours agomm/vmscan: avoid false-positive -Wuninitialized warning
Arnd Bergmann [Tue, 14 Apr 2026 06:51:58 +0000 (08:51 +0200)] 
mm/vmscan: avoid false-positive -Wuninitialized warning

When the -fsanitize=bounds sanitizer is enabled, gcc-16 sometimes runs
into a corner case in the read_ctrl_pos() pos function, where it sees
possible undefined behavior from the 'tier' index overflowing, presumably
in the case that this was called with a negative tier:

In function 'get_tier_idx',
    inlined from 'isolate_folios' at mm/vmscan.c:4671:14:
mm/vmscan.c: In function 'isolate_folios':
mm/vmscan.c:4645:29: error: 'pv.refaulted' is used uninitialized [-Werror=uninitialized]

Part of the problem seems to be that read_ctrl_pos() has unusual calling
conventions since commit 37a260870f2c ("mm/mglru: rework type selection")
where passing MAX_NR_TIERS makes it accumulate all tiers but passing a
smaller positive number makes it read a single tier instead.

Shut up the warning by adding a fake initialization to the two instances
of this variable that can run into that corner case.

Link: https://lore.kernel.org/all/CAJHvVcjtFW86o5FoQC8MMEXCHAC0FviggaQsd5EmiCHP+1fBpg@mail.gmail.com/
Link: https://lore.kernel.org/20260414065206.3236176-1-arnd@kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kairui Song <kasong@tencent.com>
Cc: Koichiro Den <koichiro.den@canonical.com>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Wei Xu <weixugc@google.com>
Cc: Yuanchu Xie <yuanchu@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
43 hours agoMAINTAINERS: update Dave's kdump reviewer email address
Dave Young [Wed, 15 Apr 2026 03:29:26 +0000 (11:29 +0800)] 
MAINTAINERS: update Dave's kdump reviewer email address

Use my personal email address due to the Red Hat work will stop soon

Link: https://lore.kernel.org/ad8GFhh3SI1wb7IC@darkstar.users.ipa.redhat.com
Signed-off-by: Dave Young <ruirui.yang@linux.dev>
Acked-by: Dave Young <dyoung@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
43 hours agoMAINTAINERS: drop include/linux/liveupdate from LIVE UPDATE
Pratyush Yadav (Google) [Tue, 14 Apr 2026 12:17:20 +0000 (12:17 +0000)] 
MAINTAINERS: drop include/linux/liveupdate from LIVE UPDATE

The directory does not exist any more.

Link: https://lore.kernel.org/20260414121752.1912847-4-pratyush@kernel.org
Signed-off-by: Pratyush Yadav (Google) <pratyush@kernel.org>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: SeongJae Park <sj@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
43 hours agoMAINTAINERS: drop include/linux/kho/abi/ from KHO
Pratyush Yadav (Google) [Tue, 14 Apr 2026 12:17:19 +0000 (12:17 +0000)] 
MAINTAINERS: drop include/linux/kho/abi/ from KHO

The KHO entry already includes include/linux/kho.  Listing its
subdirectory is redundant.

Link: https://lore.kernel.org/20260414121752.1912847-3-pratyush@kernel.org
Signed-off-by: Pratyush Yadav (Google) <pratyush@kernel.org>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: SeongJae Park <sj@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
43 hours agoMAINTAINERS: update KHO and LIVE UPDATE maintainers
Pratyush Yadav (Google) [Tue, 14 Apr 2026 12:17:18 +0000 (12:17 +0000)] 
MAINTAINERS: update KHO and LIVE UPDATE maintainers

Patch series "MAINTAINERS: update KHO and LIVE UPDATE entries".

This series contains some updates for the Kexec Handover (KHO) and Live
update entries.  Patch 1 updates the maintainers list and adds the
liveupdate tree.  Patches 2 and 3 clean up stale files in the list.

This patch (of 3):

I have been helping out with reviewing and developing KHO.  I would also
like to help maintain it.  Change my entry from R to M for KHO and live
update.  Alex has been inactive for a while, so to avoid over-crowding the
KHO entry and to keep the information up-to-date, move his entry from M to
R.

We also now have a tree for KHO and live update at liveupdate/linux.git
where we plan to start maintaining those subsystems and start queuing the
patches.  List that in the entries as well.

Link: https://lore.kernel.org/20260414121752.1912847-1-pratyush@kernel.org
Link: https://lore.kernel.org/20260414121752.1912847-2-pratyush@kernel.org
Signed-off-by: Pratyush Yadav (Google) <pratyush@kernel.org>
Reviewed-by: Alexander Graf <graf@amazon.com>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: David Hildenbrand <david@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
43 hours agoMAINTAINERS: update kexec/kdump maintainers entries
Pasha Tatashin [Mon, 13 Apr 2026 12:11:46 +0000 (12:11 +0000)] 
MAINTAINERS: update kexec/kdump maintainers entries

Update KEXEC and KDUMP maintainer entries by adding the live update group
maintainers.  Remove Vivek Goyal due to inactivity to keep the MAINTAINERS
file up-to-date, and add Vivek to the CREDITS file to recognize their
contributions.

Link: https://lore.kernel.org/20260413121146.49215-1-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Acked-by: Pratyush Yadav <pratyush@kernel.org>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Diego Viola <diego.viola@gmail.com>
Cc: Jakub Kacinski <kuba@kernel.org>
Cc: Magnus Karlsson <magnus.karlsson@intel.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Martin Kepplinger <martink@posteo.de>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
43 hours agomm/migrate_device: remove dead migration entry check in migrate_vma_collect_huge_pmd()
Davidlohr Bueso [Thu, 12 Feb 2026 01:46:11 +0000 (17:46 -0800)] 
mm/migrate_device: remove dead migration entry check in migrate_vma_collect_huge_pmd()

The softleaf_is_migration() check is unreachable as entries that are not
device_private are filtered out.  Similarly, the PTE-level equivalent in
migrate_vma_collect_pmd() skips migration entries.

This dead branch also contained a double spin_unlock(ptl) bug.

Link: https://lore.kernel.org/20260212014611.416695-1-dave@stgolabs.net
Fixes: a30b48bf1b244 ("mm/migrate_device: implement THP migration of zone device pages")
Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
Suggested-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Acked-by: Balbir Singh <balbirs@nvidia.com>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Cc: Byungchul Park <byungchul@sk.com>
Cc: Gregory Price <gourry@gourry.net>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Mathew Brost <matthew.brost@intel.com>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Ying Huang <ying.huang@linux.alibaba.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
43 hours agoselftests: mm: skip charge_reserved_hugetlb without killall
Cao Ruichuang [Fri, 10 Apr 2026 04:41:39 +0000 (12:41 +0800)] 
selftests: mm: skip charge_reserved_hugetlb without killall

charge_reserved_hugetlb.sh tears down background writers with killall from
psmisc.  Minimal Ubuntu images do not always provide that tool, so the
selftest fails in cleanup for an environment reason rather than for the
hugetlb behavior it is trying to cover.

Skip the test when killall is unavailable, similar to the existing root
check, so these environments report the dependency clearly instead of
failing the test.

Link: https://lore.kernel.org/20260410044139.67480-1-create0818@163.com
Signed-off-by: Cao Ruichuang <create0818@163.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
43 hours agouserfaultfd: allow registration of ranges below mmap_min_addr
Denis M. Karpov [Thu, 9 Apr 2026 10:33:45 +0000 (13:33 +0300)] 
userfaultfd: allow registration of ranges below mmap_min_addr

The current implementation of validate_range() in fs/userfaultfd.c
performs a hard check against mmap_min_addr.  This is redundant because
UFFDIO_REGISTER operates on memory ranges that must already be backed by a
VMA.

Enforcing mmap_min_addr or capability checks again in userfaultfd is
unnecessary and prevents applications like binary compilers from using
UFFD for valid memory regions mapped by application.

Remove the redundant check for mmap_min_addr.

We started using UFFD instead of the classic mprotect approach in the
binary translator to track application writes.  During development, we
encountered this bug.  The translator cannot control where the translated
application chooses to map its memory and if the app requires a
low-address area, UFFD fails, whereas mprotect would work just fine.  I
believe this is a genuine logic bug rather than an improvement, and I
would appreciate including the fix in stable.

Link: https://lore.kernel.org/20260409103345.15044-1-komlomal@gmail.com
Fixes: 86039bd3b4e6 ("userfaultfd: add new syscall to provide memory externalization")
Signed-off-by: Denis M. Karpov <komlomal@gmail.com>
Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>
Acked-by: Harry Yoo (Oracle) <harry@kernel.org>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
43 hours agomm/vmstat: fix vmstat_shepherd double-scheduling vmstat_update
Breno Leitao [Thu, 9 Apr 2026 12:26:36 +0000 (05:26 -0700)] 
mm/vmstat: fix vmstat_shepherd double-scheduling vmstat_update

vmstat_shepherd uses delayed_work_pending() to check whether vmstat_update
is already scheduled for a given CPU before queuing it.  However,
delayed_work_pending() only tests WORK_STRUCT_PENDING_BIT, which is
cleared the moment a worker thread picks up the work to execute it.

This means that while vmstat_update is actively running on a CPU,
delayed_work_pending() returns false.  If need_update() also returns true
at that point (per-cpu counters not yet zeroed mid-flush), the shepherd
queues a second invocation with delay=0, causing vmstat_update to run
again immediately after finishing.

On a 72-CPU system this race is readily observable: before the fix, many
CPUs show invocation gaps well below 500 jiffies (the minimum
round_jiffies_relative() can produce), with the most extreme cases
reaching 0 jiffies—vmstat_update called twice within the same jiffy.

Fix this by replacing delayed_work_pending() with work_busy(), which
returns non-zero for both WORK_BUSY_PENDING (timer armed or work queued)
and WORK_BUSY_RUNNING (work currently executing).  The shepherd now
correctly skips a CPU in all busy states.

After the fix, all sub-jiffy and most sub-100-jiffie gaps disappear.  The
remaining early invocations have gaps in the 700–999 jiffie range,
attributable to round_jiffies_relative() aligning to a nearer
jiffie-second boundary rather than to this race.

Each spurious vmstat_update invocation has a measurable side effect:
refresh_cpu_vm_stats() calls decay_pcp_high() for every zone, which drains
idle per-CPU pages back to the buddy allocator via free_pcppages_bulk(),
taking the zone spinlock each time.  Eliminating the double-scheduling
therefore reduces zone lock contention directly.  On a 72-CPU stress-ng
workload measured with perf lock contention:

  free_pcppages_bulk contention count:  ~55% reduction
  free_pcppages_bulk total wait time:   ~57% reduction
  free_pcppages_bulk max wait time:     ~47% reduction

Note: work_busy() is inherently racy—between the check and the
subsequent queue_delayed_work_on() call, vmstat_update can finish
execution, leaving the work neither pending nor running.  In that narrow
window the shepherd can still queue a second invocation.  After the fix,
this residual race is rare and produces only occasional small gaps, a
significant improvement over the systematic double-scheduling seen with
delayed_work_pending().

Link: https://lore.kernel.org/20260409-vmstat-v2-1-e9d9a6db08ad@debian.org
Fixes: 7b8da4c7f07774 ("vmstat: get rid of the ugly cpu_stat_off variable")
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Dmitry Ilvokhin <d@ilvokhin.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
43 hours agomm/hugetlb: fix early boot crash on parameters without '=' separator
Thorsten Blum [Thu, 9 Apr 2026 10:54:40 +0000 (12:54 +0200)] 
mm/hugetlb: fix early boot crash on parameters without '=' separator

If hugepages, hugepagesz, or default_hugepagesz are specified on the
kernel command line without the '=' separator, early parameter parsing
passes NULL to hugetlb_add_param(), which dereferences it in strlen() and
can crash the system during early boot.

Reject NULL values in hugetlb_add_param() and return -EINVAL instead.

Link: https://lore.kernel.org/20260409105437.108686-4-thorsten.blum@linux.dev
Fixes: 5b47c02967ab ("mm/hugetlb: convert cmdline parameters from setup to early")
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
Cc: David Hildenbrand <david@kernel.org>
Cc: Frank van der Linden <fvdl@google.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
43 hours agozram: reject unrecognized type= values in recompress_store()
Andrew Stellman [Tue, 7 Apr 2026 15:30:27 +0000 (11:30 -0400)] 
zram: reject unrecognized type= values in recompress_store()

recompress_store() parses the type= parameter with three if statements
checking for "idle", "huge", and "huge_idle".  An unrecognized value
silently falls through with mode left at 0, causing the recompression pass
to run with no slot filter â€” processing all slots instead of the
intended subset.

Add a !mode check after the type parsing block to return -EINVAL for
unrecognized values, consistent with the function's other parameter
validation.

Link: https://lore.kernel.org/20260407153027.42425-1-astellman@stellman-greene.com
Signed-off-by: Andrew Stellman <astellman@stellman-greene.com>
Suggested-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
43 hours agodocs: proc: document ProtectionKey in smaps
Kevin Brodsky [Tue, 7 Apr 2026 12:51:33 +0000 (13:51 +0100)] 
docs: proc: document ProtectionKey in smaps

The ProtectionKey entry was added in v4.9; back then it was x86-specific,
but it now lives in generic code and applies to all architectures
supporting pkeys (currently x86, power, arm64).

Time to document it: add a paragraph to proc.rst about the ProtectionKey
entry.

[akpm@linux-foundation.org: s/system/hardware/, per review discussion]
[akpm@linux-foundation.org: s/hardware/CPU/]
Link: https://lore.kernel.org/20260407125133.564182-1-kevin.brodsky@arm.com
Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
Reported-by: Yury Khrustalev <yury.khrustalev@arm.com>
Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Reviewed-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Marc Rutland <mark.rutland@arm.com>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
43 hours agomm/mprotect: special-case small folios when applying permissions
Pedro Falcato [Thu, 2 Apr 2026 14:16:28 +0000 (15:16 +0100)] 
mm/mprotect: special-case small folios when applying permissions

The common order-0 case is important enough to want its own branch, and
avoids the hairy, large loop logic that the CPU does not seem to handle
particularly well.

While at it, encourage the compiler to inline batch PTE logic and resolve
constant branches by adding __always_inline strategically.

Link: https://lore.kernel.org/20260402141628.3367596-3-pfalcato@suse.de
Signed-off-by: Pedro Falcato <pfalcato@suse.de>
Suggested-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Tested-by: Luke Yang <luyang@redhat.com>
Reviewed-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jiri Hladky <jhladky@redhat.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
43 hours agomm/mprotect: move softleaf code out of the main function
Pedro Falcato [Thu, 2 Apr 2026 14:16:27 +0000 (15:16 +0100)] 
mm/mprotect: move softleaf code out of the main function

Patch series "mm/mprotect: micro-optimization work", v3.

Micro-optimize the change_protection functionality and the
change_pte_range() routine.  This set of functions works in an incredibly
tight loop, and even small inefficiencies are incredibly evident when spun
hundreds, thousands or hundreds of thousands of times.

There was an attempt to keep the batching functionality as much as
possible, which introduced some part of the slowness, but not all of it.
Removing it for !arm64 architectures would speed mprotect() up even
further, but could easily pessimize cases where large folios are mapped
(which is not as rare as it seems, particularly when it comes to the page
cache these days).

The micro-benchmark used for the tests was [0] (usable using
google/benchmark and g++ -O2 -lbenchmark repro.cpp)

This resulted in the following (first entry is baseline):

---------------------------------------------------------
Benchmark               Time             CPU   Iterations
---------------------------------------------------------
mprotect_bench      85967 ns        85967 ns         6935
mprotect_bench      70684 ns        70684 ns         9887

After the patchset we can observe an ~18% speedup in mprotect.  Wonderful
for the elusive mprotect-based workloads!

Testing & more ideas welcome.  I suspect there is plenty of improvement
possible but it would require more time than what I have on my hands right
now.  The entire inlined function (which inlines into change_protection())
is gigantic - I'm not surprised this is so finnicky.

Note: per my profiling, the next _big_ bottleneck here is
modify_prot_start_ptes, exactly on the xchg() done by x86.
ptep_get_and_clear() is _expensive_.  I don't think there's a properly
safe way to go about it since we do depend on the D bit quite a lot.  This
might not be such an issue on other architectures.

Luke Yang reported [1]:

: On average, we see improvements ranging from a minimum of 5% to a
: maximum of 55%, with most improvements showing around a 25% speed up in
: the libmicro/mprot_tw4m micro benchmark.

This patch (of 2):

Move softleaf change_pte_range code into a separate function.  This makes
the change_pte_range() function a good bit smaller, and lessens cognitive
load when reading through the function.

Link: https://lore.kernel.org/20260402141628.3367596-1-pfalcato@suse.de
Link: https://lore.kernel.org/20260402141628.3367596-2-pfalcato@suse.de
Link: https://lore.kernel.org/all/aY8-XuFZ7zCvXulB@luyang-thinkpadp1gen7.toromso.csb/
Link: https://gist.github.com/heatd/1450d273005aba91fa5744f44dfcd933
Link: https://lore.kernel.org/CAL2CeBxT4jtJ+LxYb6=BNxNMGinpgD_HYH5gGxOP-45Q2OncqQ@mail.gmail.com
Signed-off-by: Pedro Falcato <pfalcato@suse.de>
Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Tested-by: Luke Yang <luyang@redhat.com>
Reviewed-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jiri Hladky <jhladky@redhat.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
43 hours agomm: remove '!root_reclaim' checking in should_abort_scan()
Zhaoyang Huang [Thu, 12 Feb 2026 03:21:11 +0000 (11:21 +0800)] 
mm: remove '!root_reclaim' checking in should_abort_scan()

Android systems usually use memory.reclaim interface to implement user
space memory management which expects that the requested reclaim target
and actually reclaimed amount memory are not diverging by too much. With
the current MGRLU implementation there is, however, no bail out when the
reclaim target is reached and this could lead to an excessive reclaim
that scales with the reclaim hierarchy size.For example, we can get a
nr_reclaimed=394/nr_to_reclaim=32 proactive reclaim under a common 1-N
cgroup hierarchy.

This defect arose from the goal of keeping fairness among memcgs that is,
for try_to_free_mem_cgroup_pages -> shrink_node_memcgs -> shrink_lruvec ->
lru_gen_shrink_lruvec -> try_to_shrink_lruvec, the !root_reclaim(sc) check
was there for reclaim fairness, which was necessary before commit
b82b530740b9 ("mm: vmscan: restore incremental cgroup iteration") because
the fairness depended on attempted proportional reclaim from every memcg
under the target memcg.  However after commit b82b530740b9 there is no
longer a need to visit every memcg to ensure fairness.  Let's have
try_to_shrink_lruvec bail out when the nr_reclaimed achieved.

Link: https://lore.kernel.org/20260318011558.1696310-1-zhaoyang.huang@unisoc.com
Link: https://lore.kernel.org/20260212032111.408865-1-zhaoyang.huang@unisoc.com
Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
Suggested-by: T.J.Mercier <tjmercier@google.com>
Reviewed-by: T.J. Mercier <tjmercier@google.com>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Qi Zheng <qi.zheng@linux.dev>
Reviewed-by: Barry Song <baohua@kernel.org>
Reviewed-by: Kairui Song <kasong@tencent.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Yuanchu Xie <yuanchu@google.com>
Cc: Wei Xu <weixugc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
43 hours agomm/sparse: fix comment for section map alignment
Muchun Song [Thu, 2 Apr 2026 10:23:20 +0000 (18:23 +0800)] 
mm/sparse: fix comment for section map alignment

The comment in mmzone.h currently details exhaustive per-architecture
bit-width lists and explains alignment using min(PAGE_SHIFT,
PFN_SECTION_SHIFT).  Such details risk falling out of date over time and
may inadvertently be left un-updated.

We always expect a single section to cover full pages.  Therefore, we can
safely assume that PFN_SECTION_SHIFT is large enough to accommodate
SECTION_MAP_LAST_BIT.  We use BUILD_BUG_ON() to ensure this.

Update the comment to accurately reflect this consensus, making it clear
that we rely on a single section covering full pages.

Link: https://lore.kernel.org/20260402102320.3617578-1-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Petr Tesarik <ptesarik@suse.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
43 hours agomm/page_io: use sio->len for PSWPIN accounting in sio_read_complete()
David Carlier [Thu, 2 Apr 2026 06:14:07 +0000 (07:14 +0100)] 
mm/page_io: use sio->len for PSWPIN accounting in sio_read_complete()

sio_read_complete() uses sio->pages to account global PSWPIN vm events,
but sio->pages tracks the number of bvec entries (folios), not base pages.

While large folios cannot currently reach this path (SWP_FS_OPS and
SWP_SYNCHRONOUS_IO are mutually exclusive, and mTHP swap-in allocation is
gated on SWP_SYNCHRONOUS_IO), the accounting is semantically inconsistent
with the per-memcg path which correctly uses folio_nr_pages().

Use sio->len >> PAGE_SHIFT instead, which gives the correct base page
count since sio->len is accumulated via folio_size(folio).

Link: https://lore.kernel.org/20260402061408.36119-1-devnexen@gmail.com
Signed-off-by: David Carlier <devnexen@gmail.com>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Kairui Song <kasong@tencent.com>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: NeilBrown <neil@brown.name>
Cc: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
43 hours agoselftests/mm: transhuge_stress: skip the test when thp not available
Chunyu Hu [Thu, 2 Apr 2026 01:45:43 +0000 (09:45 +0800)] 
selftests/mm: transhuge_stress: skip the test when thp not available

The test requires thp, skip the test when thp is not available to avoid
false positive.

Tested with thp disabled kernel.
Before the fix:
  # --------------------------------
  # running ./transhuge-stress -d 20
  # --------------------------------
  # TAP version 13
  # 1..1
  # transhuge-stress: allocate 1453 transhuge pages, using 2907 MiB virtual memory and 11 MiB of ram
  # Bail out! MADV_HUGEPAGE# Planned tests != run tests (1 != 0)
  # # Totals: pass:0 fail:0 xfail:0 xpass:0 skip:0 error:0
  # [FAIL]
  not ok 60 transhuge-stress -d 20 # exit=1

After the fix:
  # --------------------------------
  # running ./transhuge-stress -d 20
  # --------------------------------
  # TAP version 13
  # 1..0 # SKIP Transparent Hugepages not available
  # [SKIP]
  ok 5 transhuge-stress -d 20 # SKIP

Link: https://lore.kernel.org/20260402014543.1671131-7-chuhu@redhat.com
Signed-off-by: Chunyu Hu <chuhu@redhat.com>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Li Wang <liwang@redhat.com>
Cc: Nico Pache <npache@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
43 hours agoselftests/mm: split_huge_page_test: skip the test when thp is not available
Chunyu Hu [Thu, 2 Apr 2026 01:45:42 +0000 (09:45 +0800)] 
selftests/mm: split_huge_page_test: skip the test when thp is not available

When thp is not enabled on some kernel config such as realtime kernel, the
test will report failure.  Fix the false positive by skipping the test
directly when thp is not enabled.

Tested with thp disabled kernel:
Before The fix:
  # --------------------------------------------------
  # running ./split_huge_page_test /tmp/xfs_dir_Ywup9p
  # --------------------------------------------------
  # TAP version 13
  # Bail out! Reading PMD pagesize failed
  # # Totals: pass:0 fail:0 xfail:0 xpass:0 skip:0 error:0
  # [FAIL]
  not ok 61 split_huge_page_test /tmp/xfs_dir_Ywup9p # exit=1

After the fix:
  # --------------------------------------------------
  # running ./split_huge_page_test /tmp/xfs_dir_YHPUPl
  # --------------------------------------------------
  # TAP version 13
  # 1..0 # SKIP Transparent Hugepages not available
  # [SKIP]
  ok 6 split_huge_page_test /tmp/xfs_dir_YHPUPl # SKIP

Link: https://lore.kernel.org/20260402014543.1671131-6-chuhu@redhat.com
Signed-off-by: Chunyu Hu <chuhu@redhat.com>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Li Wang <liwang@redhat.com>
Cc: Nico Pache <npache@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
43 hours agoselftests/mm/vm_util: robust write_file()
Chunyu Hu [Thu, 2 Apr 2026 01:45:41 +0000 (09:45 +0800)] 
selftests/mm/vm_util: robust write_file()

Add three more checks for buflen and numwritten.  The buflen should be at
least two, that means at least one char and the null-end.  The error case
check is added by checking numwriten < 0 instead of numwritten < 1.  And
the truncate case is checked.  The test will exit if any of these
conditions aren't met.

Additionally, add more print information when a write failure occurs or a
truncated write happens, providing clearer diagnostics.

Link: https://lore.kernel.org/20260402014543.1671131-5-chuhu@redhat.com
Signed-off-by: Chunyu Hu <chuhu@redhat.com>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>
Cc: Nico Pache <npache@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
43 hours agoselftests/mm: move write_file helper to vm_util
Chunyu Hu [Thu, 2 Apr 2026 01:45:40 +0000 (09:45 +0800)] 
selftests/mm: move write_file helper to vm_util

thp_settings provides write_file() helper for safely writing to a file and
exit when write failure happens.  It's a very low level helper and many
sub tests need such a helper, not only thp tests.

split_huge_page_test also defines a write_file locally.  The two have
minior differences in return type and used exit api.  And there would be
conflicts if split_huge_page_test wanted to include thp_settings.h because
of different prototype, making it less convenient.

It's possisble to merge the two, although some tests don't use the
kselftest infrastrucutre for testing.  It would also work when using the
ksft_exit_msg() to exit in my test, as the counters are all zero.  Output
will be like:

  TAP version 13
  1..62
  Bail out! /proc/sys/vm/drop_caches1 open failed: No such file or directory
  # Totals: pass:0 fail:0 xfail:0 xpass:0 skip:0 error:0

So here we just keep the version in split_huge_page_test, and move it into
the vm_util.  This makes it easy to maitain and user could just include
one vm_util.h when they don't need thp setting helpers.  Keep the
prototype of void return as the function will exit on any error, return
value is not necessary, and will simply the callers like write_num() and
write_string().

Link: https://lore.kernel.org/20260402014543.1671131-4-chuhu@redhat.com
Signed-off-by: Chunyu Hu <chuhu@redhat.com>
Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Suggested-by: Mike Rapoport <rppt@kernel.org>
Cc: Nico Pache <npache@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
43 hours agoselftests/mm: soft-dirty: skip two tests when thp is not available
Chunyu Hu [Thu, 2 Apr 2026 01:45:39 +0000 (09:45 +0800)] 
selftests/mm: soft-dirty: skip two tests when thp is not available

The test_hugepage test contain two sub tests.  If just reporting one skip
when thp not available, there will be error in the log because the test
count don't match the test plan.  Change to skip two tests by running the
ksft_test_result_skip twice in this case.

Without the fix (run test on thp disabled kernel):
  ./run_vmtests.sh -t soft_dirty
  # --------------------
  # running ./soft-dirty
  # --------------------
  # TAP version 13
  # 1..19
  # ok 1 Test test_simple
  # ok 2 Test test_vma_reuse dirty bit of allocated page
  # ok 3 Test test_vma_reuse dirty bit of reused address page
  # ok 4 # SKIP Transparent Hugepages not available
  # ok 5 Test test_mprotect-anon dirty bit of new written page
  # ok 6 Test test_mprotect-anon soft-dirty clear after clear_refs
  # ok 7 Test test_mprotect-anon soft-dirty clear after marking RO
  # ok 8 Test test_mprotect-anon soft-dirty clear after marking RW
  # ok 9 Test test_mprotect-anon soft-dirty after rewritten
  # ok 10 Test test_mprotect-file dirty bit of new written page
  # ok 11 Test test_mprotect-file soft-dirty clear after clear_refs
  # ok 12 Test test_mprotect-file soft-dirty clear after marking RO
  # ok 13 Test test_mprotect-file soft-dirty clear after marking RW
  # ok 14 Test test_mprotect-file soft-dirty after rewritten
  # ok 15 Test test_merge-anon soft-dirty after remap merge 1st pg
  # ok 16 Test test_merge-anon soft-dirty after remap merge 2nd pg
  # ok 17 Test test_merge-anon soft-dirty after mprotect merge 1st pg
  # ok 18 Test test_merge-anon soft-dirty after mprotect merge 2nd pg
  # # 1 skipped test(s) detected. Consider enabling relevant config options to improve coverage.
  # # Planned tests != run tests (19 != 18)
  # # Totals: pass:17 fail:0 xfail:0 xpass:0 skip:1 error:0
  # [FAIL]
  not ok 52 soft-dirty # exit=1

With the fix (run test on thp disabled kernel):
  ./run_vmtests.sh -t soft_dirty
  # --------------------
  # running ./soft-dirty
  # TAP version 13
  # --------------------
  # running ./soft-dirty
  # --------------------
  # TAP version 13
  # 1..19
  # ok 1 Test test_simple
  # ok 2 Test test_vma_reuse dirty bit of allocated page
  # ok 3 Test test_vma_reuse dirty bit of reused address page
  # # Transparent Hugepages not available
  # ok 4 # SKIP Test test_hugepage huge page allocation
  # ok 5 # SKIP Test test_hugepage huge page dirty bit
  # ok 6 Test test_mprotect-anon dirty bit of new written page
  # ok 7 Test test_mprotect-anon soft-dirty clear after clear_refs
  # ok 8 Test test_mprotect-anon soft-dirty clear after marking RO
  # ok 9 Test test_mprotect-anon soft-dirty clear after marking RW
  # ok 10 Test test_mprotect-anon soft-dirty after rewritten
  # ok 11 Test test_mprotect-file dirty bit of new written page
  # ok 12 Test test_mprotect-file soft-dirty clear after clear_refs
  # ok 13 Test test_mprotect-file soft-dirty clear after marking RO
  # ok 14 Test test_mprotect-file soft-dirty clear after marking RW
  # ok 15 Test test_mprotect-file soft-dirty after rewritten
  # ok 16 Test test_merge-anon soft-dirty after remap merge 1st pg
  # ok 17 Test test_merge-anon soft-dirty after remap merge 2nd pg
  # ok 18 Test test_merge-anon soft-dirty after mprotect merge 1st pg
  # ok 19 Test test_merge-anon soft-dirty after mprotect merge 2nd pg
  # # 2 skipped test(s) detected. Consider enabling relevant config options to improve coverage.
  # # Totals: pass:17 fail:0 xfail:0 xpass:0 skip:2 error:0
  # [PASS]
  ok 1 soft-dirty
  hwpoison_inject
  # SUMMARY: PASS=1 SKIP=0 FAIL=0
  1..1

Link: https://lore.kernel.org/20260402014543.1671131-3-chuhu@redhat.com
Signed-off-by: Chunyu Hu <chuhu@redhat.com>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Li Wang <liwang@redhat.com>
Cc: Nico Pache <npache@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
43 hours agoselftests/mm/guard-regions: skip collapse test when thp not enabled
Chunyu Hu [Thu, 2 Apr 2026 01:45:38 +0000 (09:45 +0800)] 
selftests/mm/guard-regions: skip collapse test when thp not enabled

Patch series "selftests/mm: skip several tests when thp is not available",
v8.

There are several tests requires transprarent hugepages, when run on thp
disabled kernel such as realtime kernel, there will be false negative.
Mark those tests as skip when thp is not available.

This patch (of 6):

When thp is not available, just skip the collape tests to avoid the false
negative.

Without the change, run with a thp disabled kernel:
  ./run_vmtests.sh -t madv_guard -n 1
  <snip/>
  #  RUN           guard_regions.anon.collapse ...
  # guard-regions.c:2217:collapse:Expected madvise(ptr, size, MADV_NOHUGEPAGE) (-1) == 0 (0)
  # collapse: Test terminated by assertion
  #          FAIL  guard_regions.anon.collapse
  not ok 2 guard_regions.anon.collapse
  <snip/>
  #  RUN           guard_regions.shmem.collapse ...
  # guard-regions.c:2217:collapse:Expected madvise(ptr, size, MADV_NOHUGEPAGE) (-1) == 0 (0)
  # collapse: Test terminated by assertion
  #          FAIL  guard_regions.shmem.collapse
  not ok 32 guard_regions.shmem.collapse
  <snip/>
  #  RUN           guard_regions.file.collapse ...
  # guard-regions.c:2217:collapse:Expected madvise(ptr, size, MADV_NOHUGEPAGE) (-1) == 0 (0)
  # collapse: Test terminated by assertion
  #          FAIL  guard_regions.file.collapse
  not ok 62 guard_regions.file.collapse
  <snip/>
  # FAILED: 87 / 90 tests passed.
  # 17 skipped test(s) detected. Consider enabling relevant config options to improve coverage.
  # Totals: pass:70 fail:3 xfail:0 xpass:0 skip:17 error:0

With this change, run with thp disabled kernel:
  ./run_vmtests.sh -t madv_guard -n 1
  <snip/>
  #  RUN           guard_regions.anon.collapse ...
  #      SKIP      Transparent Hugepages not available
  #            OK  guard_regions.anon.collapse
  ok 2 guard_regions.anon.collapse # SKIP Transparent Hugepages not available
  <snip/>
  #  RUN           guard_regions.file.collapse ...
  #      SKIP      Transparent Hugepages not available
  #            OK  guard_regions.file.collapse
  ok 62 guard_regions.file.collapse # SKIP Transparent Hugepages not available
  <snip/>
  #  RUN           guard_regions.shmem.collapse ...
  #      SKIP      Transparent Hugepages not available
  #            OK  guard_regions.shmem.collapse
  ok 32 guard_regions.shmem.collapse # SKIP Transparent Hugepages not available
  <snip/>
  # PASSED: 90 / 90 tests passed.
  # 20 skipped test(s) detected. Consider enabling relevant config options to improve coverage.
  # Totals: pass:70 fail:0 xfail:0 xpass:0 skip:20 error:0

Link: https://lore.kernel.org/20260402014543.1671131-1-chuhu@redhat.com
Link: https://lore.kernel.org/20260402014543.1671131-2-chuhu@redhat.com
Signed-off-by: Chunyu Hu <chuhu@redhat.com>
Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Li Wang <liwang@redhat.com>
Cc: Nico Pache <npache@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
43 hours agouserfaultfd: mfill_atomic(): remove retry logic
Mike Rapoport (Microsoft) [Thu, 2 Apr 2026 04:11:52 +0000 (07:11 +0300)] 
userfaultfd: mfill_atomic(): remove retry logic

Since __mfill_atomic_pte() handles the retry for both anonymous and shmem,
there is no need to retry copying the date from the userspace in the loop
in mfill_atomic().

Drop the retry logic from mfill_atomic().

[rppt@kernel.org: remove safety measure of not returning ENOENT from _copy]
Link: https://lore.kernel.org/ac5zcDUY8CFHr6Lw@kernel.org
Link: https://lore.kernel.org/20260402041156.1377214-12-rppt@kernel.org
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrei Vagin <avagin@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand (Arm) <david@kernel.org>
Cc: Harry Yoo <harry.yoo@oracle.com>
Cc: Harry Yoo (Oracle) <harry@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: James Houghton <jthoughton@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nikita Kalyazin <kalyazin@amazon.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Carlier <devnexen@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
43 hours agoshmem, userfaultfd: implement shmem uffd operations using vm_uffd_ops
Mike Rapoport (Microsoft) [Thu, 2 Apr 2026 04:11:51 +0000 (07:11 +0300)] 
shmem, userfaultfd: implement shmem uffd operations using vm_uffd_ops

Add filemap_add() and filemap_remove() methods to vm_uffd_ops and use them
in __mfill_atomic_pte() to add shmem folios to page cache and remove them
in case of error.

Implement these methods in shmem along with vm_uffd_ops->alloc_folio() and
drop shmem_mfill_atomic_pte().

Since userfaultfd now does not reference any functions from shmem, drop
include if linux/shmem_fs.h from mm/userfaultfd.c

mfill_atomic_install_pte() is not used anywhere outside of mm/userfaultfd,
make it static.

Link: https://lore.kernel.org/20260402041156.1377214-11-rppt@kernel.org
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: James Houghton <jthoughton@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrei Vagin <avagin@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand (Arm) <david@kernel.org>
Cc: Harry Yoo <harry.yoo@oracle.com>
Cc: Harry Yoo (Oracle) <harry@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nikita Kalyazin <kalyazin@amazon.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Carlier <devnexen@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
43 hours agouserfaultfd: introduce vm_uffd_ops->alloc_folio()
Mike Rapoport (Microsoft) [Thu, 2 Apr 2026 04:11:50 +0000 (07:11 +0300)] 
userfaultfd: introduce vm_uffd_ops->alloc_folio()

and use it to refactor mfill_atomic_pte_zeroed_folio() and
mfill_atomic_pte_copy().

mfill_atomic_pte_zeroed_folio() and mfill_atomic_pte_copy() perform
almost identical actions:
* allocate a folio
* update folio contents (either copy from userspace of fill with zeros)
* update page tables with the new folio

Split a __mfill_atomic_pte() helper that handles both cases and uses newly
introduced vm_uffd_ops->alloc_folio() to allocate the folio.

Pass the ops structure from the callers to __mfill_atomic_pte() to later
allow using anon_uffd_ops for MAP_PRIVATE mappings of file-backed VMAs.

Note, that the new ops method is called alloc_folio() rather than
folio_alloc() to avoid clash with alloc_tag macro folio_alloc().

Link: https://lore.kernel.org/20260402041156.1377214-10-rppt@kernel.org
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: James Houghton <jthoughton@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrei Vagin <avagin@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand (Arm) <david@kernel.org>
Cc: Harry Yoo <harry.yoo@oracle.com>
Cc: Harry Yoo (Oracle) <harry@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nikita Kalyazin <kalyazin@amazon.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Carlier <devnexen@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
43 hours agoshmem, userfaultfd: use a VMA callback to handle UFFDIO_CONTINUE
Mike Rapoport (Microsoft) [Thu, 2 Apr 2026 04:11:49 +0000 (07:11 +0300)] 
shmem, userfaultfd: use a VMA callback to handle UFFDIO_CONTINUE

When userspace resolves a page fault in a shmem VMA with UFFDIO_CONTINUE
it needs to get a folio that already exists in the pagecache backing that
VMA.

Instead of using shmem_get_folio() for that, add a get_folio_noalloc()
method to 'struct vm_uffd_ops' that will return a folio if it exists in
the VMA's pagecache at given pgoff.

Implement get_folio_noalloc() method for shmem and slightly refactor
userfaultfd's mfill_get_vma() and mfill_atomic_pte_continue() to support
this new API.

Link: https://lore.kernel.org/20260402041156.1377214-9-rppt@kernel.org
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: James Houghton <jthoughton@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrei Vagin <avagin@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand (Arm) <david@kernel.org>
Cc: Harry Yoo <harry.yoo@oracle.com>
Cc: Harry Yoo (Oracle) <harry@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nikita Kalyazin <kalyazin@amazon.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Carlier <devnexen@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
43 hours agouserfaultfd: introduce vm_uffd_ops
Mike Rapoport (Microsoft) [Thu, 2 Apr 2026 04:11:48 +0000 (07:11 +0300)] 
userfaultfd: introduce vm_uffd_ops

Current userfaultfd implementation works only with memory managed by core
MM: anonymous, shmem and hugetlb.

First, there is no fundamental reason to limit userfaultfd support only to
the core memory types and userfaults can be handled similarly to regular
page faults provided a VMA owner implements appropriate callbacks.

Second, historically various code paths were conditioned on
vma_is_anonymous(), vma_is_shmem() and is_vm_hugetlb_page() and some of
these conditions can be expressed as operations implemented by a
particular memory type.

Introduce vm_uffd_ops extension to vm_operations_struct that will delegate
memory type specific operations to a VMA owner.

Operations for anonymous memory are handled internally in userfaultfd
using anon_uffd_ops that implicitly assigned to anonymous VMAs.

Start with a single operation, ->can_userfault() that will verify that a
VMA meets requirements for userfaultfd support at registration time.

Implement that method for anonymous, shmem and hugetlb and move relevant
parts of vma_can_userfault() into the new callbacks.

[rppt@kernel.org: relocate VM_DROPPABLE test, per Tal]
Link: https://lore.kernel.org/adffgfM5ANxtPIEF@kernel.org
Link: https://lore.kernel.org/20260402041156.1377214-8-rppt@kernel.org
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrei Vagin <avagin@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand (Arm) <david@kernel.org>
Cc: Harry Yoo <harry.yoo@oracle.com>
Cc: Harry Yoo (Oracle) <harry@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: James Houghton <jthoughton@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nikita Kalyazin <kalyazin@amazon.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Carlier <devnexen@gmail.com>
Cc: Tal Zussman <tz2294@columbia.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
43 hours agouserfaultfd: move vma_can_userfault out of line
Mike Rapoport (Microsoft) [Thu, 2 Apr 2026 04:11:47 +0000 (07:11 +0300)] 
userfaultfd: move vma_can_userfault out of line

vma_can_userfault() has grown pretty big and it's not called on
performance critical path.

Move it out of line.

No functional changes.

Link: https://lore.kernel.org/20260402041156.1377214-7-rppt@kernel.org
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: David Hildenbrand (Red Hat) <david@kernel.org>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrei Vagin <avagin@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Harry Yoo <harry.yoo@oracle.com>
Cc: Harry Yoo (Oracle) <harry@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: James Houghton <jthoughton@google.com>
Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nikita Kalyazin <kalyazin@amazon.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Carlier <devnexen@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
43 hours agouserfaultfd: retry copying with locks dropped in mfill_atomic_pte_copy()
Mike Rapoport (Microsoft) [Thu, 2 Apr 2026 04:11:46 +0000 (07:11 +0300)] 
userfaultfd: retry copying with locks dropped in mfill_atomic_pte_copy()

Implementation of UFFDIO_COPY for anonymous memory might fail to copy data
from userspace buffer when the destination VMA is locked (either with
mm_lock or with per-VMA lock).

In that case, mfill_atomic() releases the locks, retries copying the data
with locks dropped and then re-locks the destination VMA and
re-establishes PMD.

Since this retry-reget dance is only relevant for UFFDIO_COPY and it never
happens for other UFFDIO_ operations, make it a part of
mfill_atomic_pte_copy() that actually implements UFFDIO_COPY for anonymous
memory.

As a temporal safety measure to avoid breaking biscection
mfill_atomic_pte_copy() makes sure to never return -ENOENT so that the
loop in mfill_atomic() won't retry copiyng outside of mmap_lock.  This is
removed later when shmem implementation will be updated later and the loop
in mfill_atomic() will be adjusted.

[akpm@linux-foundation.org: update mfill_copy_folio_retry()]
Link: https://lore.kernel.org/20260316173829.1126728-1-avagin@google.com
Link: https://lore.kernel.org/20260306171815.3160826-6-rppt@kernel.org
Link: https://lore.kernel.org/20260402041156.1377214-6-rppt@kernel.org
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand (Arm) <david@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: James Houghton <jthoughton@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nikita Kalyazin <kalyazin@amazon.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Carlier <devnexen@gmail.com>
Cc: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
43 hours agouserfaultfd: introduce mfill_get_vma() and mfill_put_vma()
Mike Rapoport (Microsoft) [Thu, 2 Apr 2026 04:11:45 +0000 (07:11 +0300)] 
userfaultfd: introduce mfill_get_vma() and mfill_put_vma()

Split the code that finds, locks and verifies VMA from mfill_atomic() into
a helper function.

This function will be used later during refactoring of
mfill_atomic_pte_copy().

Add a counterpart mfill_put_vma() helper that unlocks the VMA and releases
map_changing_lock.

[avagin@google.com: fix lock leak in mfill_get_vma()]
Link: https://lore.kernel.org/20260316173829.1126728-1-avagin@google.com
Link: https://lore.kernel.org/20260402041156.1377214-5-rppt@kernel.org
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Andrei Vagin <avagin@google.com>
Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand (Arm) <david@kernel.org>
Cc: Harry Yoo <harry.yoo@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: James Houghton <jthoughton@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nikita Kalyazin <kalyazin@amazon.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Carlier <devnexen@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
43 hours agouserfaultfd: introduce mfill_establish_pmd() helper
Mike Rapoport (Microsoft) [Thu, 2 Apr 2026 04:11:44 +0000 (07:11 +0300)] 
userfaultfd: introduce mfill_establish_pmd() helper

There is a lengthy code chunk in mfill_atomic() that establishes the PMD
for UFFDIO operations.  This code may be called twice: first time when the
copy is performed with VMA/mm locks held and the other time after the copy
is retried with locks dropped.

Move the code that establishes a PMD into a helper function so it can be
reused later during refactoring of mfill_atomic_pte_copy().

Link: https://lore.kernel.org/20260402041156.1377214-4-rppt@kernel.org
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrei Vagin <avagin@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand (Arm) <david@kernel.org>
Cc: Harry Yoo <harry.yoo@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: James Houghton <jthoughton@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nikita Kalyazin <kalyazin@amazon.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Carlier <devnexen@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
43 hours agouserfaultfd: introduce struct mfill_state
Mike Rapoport (Microsoft) [Thu, 2 Apr 2026 04:11:43 +0000 (07:11 +0300)] 
userfaultfd: introduce struct mfill_state

mfill_atomic() passes a lot of parameters down to its callees.

Aggregate them all into mfill_state structure and pass this structure to
functions that implement various UFFDIO_ commands.

Tracking the state in a structure will allow moving the code that retries
copying of data for UFFDIO_COPY into mfill_atomic_pte_copy() and make the
loop in mfill_atomic() identical for all UFFDIO operations on PTE-mapped
memory.

The mfill_state definition is deliberately local to mm/userfaultfd.c,
hence shmem_mfill_atomic_pte() is not updated.

[harry.yoo@oracle.com: properly initialize mfill_state.len to fix
                       folio_add_new_anon_rmap() WARN]
Link: https://lore.kernel.org/abehBY7QakYF9bK4@hyeyoo
Link: https://lore.kernel.org/20260402041156.1377214-3-rppt@kernel.org
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrei Vagin <avagin@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Harry Yoo (Oracle) <harry@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: James Houghton <jthoughton@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nikita Kalyazin <kalyazin@amazon.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Carlier <devnexen@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
43 hours agouserfaultfd: introduce mfill_copy_folio_locked() helper
Mike Rapoport (Microsoft) [Thu, 2 Apr 2026 04:11:42 +0000 (07:11 +0300)] 
userfaultfd: introduce mfill_copy_folio_locked() helper

Patch series "mm, kvm: allow uffd support in guest_memfd", v4.

These patches enable support for userfaultfd in guest_memfd.

As the groundwork I refactored userfaultfd handling of PTE-based memory
types (anonymous and shmem) and converted them to use vm_uffd_ops for
allocating a folio or getting an existing folio from the page cache.
shmem also implements callbacks that add a folio to the page cache after
the data passed in UFFDIO_COPY was copied and remove the folio from the
page cache if page table update fails.

In order for guest_memfd to notify userspace about page faults, there are
new VM_FAULT_UFFD_MINOR and VM_FAULT_UFFD_MISSING that a ->fault() handler
can return to inform the page fault handler that it needs to call
handle_userfault() to complete the fault.

Nikita helped to plumb these new goodies into guest_memfd and provided
basic tests to verify that guest_memfd works with userfaultfd.  The
handling of UFFDIO_MISSING in guest_memfd requires ability to remove a
folio from page cache, the best way I could find was exporting
filemap_remove_folio() to KVM.

I deliberately left hugetlb out, at least for the most part.  hugetlb
handles acquisition of VMA and more importantly establishing of parent
page table entry differently than PTE-based memory types.  This is a
different abstraction level than what vm_uffd_ops provides and people
objected to exposing such low level APIs as a part of VMA operations.

Also, to enable uffd in guest_memfd refactoring of hugetlb is not needed
and I prefer to delay it until the dust settles after the changes in this
set.

This patch (of 4):

Split copying of data when locks held from mfill_atomic_pte_copy() into a
helper function mfill_copy_folio_locked().

This makes improves code readability and makes complex
mfill_atomic_pte_copy() function easier to comprehend.

No functional change.

Link: https://lore.kernel.org/20260402041156.1377214-1-rppt@kernel.org
Link: https://lore.kernel.org/20260402041156.1377214-2-rppt@kernel.org
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: Peter Xu <peterx@redhat.com>
Reviewed-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrei Vagin <avagin@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: James Houghton <jthoughton@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Harry Yoo <harry.yoo@oracle.com>
Cc: Nikita Kalyazin <kalyazin@amazon.com>
Cc: David Carlier <devnexen@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
43 hours agomm/memfd_luo: remove folio from page cache when accounting fails
Chenghao Duan [Thu, 26 Mar 2026 08:47:26 +0000 (16:47 +0800)] 
mm/memfd_luo: remove folio from page cache when accounting fails

In memfd_luo_retrieve_folios(), when shmem_inode_acct_blocks() fails
after successfully adding the folio to the page cache, the code jumps
to unlock_folio without removing the folio from the page cache.

While the folio eventually will be freed when the file is released by
memfd_luo_retrieve(), it is a good idea to directly remove a folio that
was not fully added to the file.  This avoids the possibility of
accounting mismatches in shmem or filemap core.

Fix by adding a remove_from_cache label that calls
filemap_remove_folio() before unlocking, matching the error handling
pattern in shmem_alloc_and_add_folio().

This issue was identified by AI review:
https://sashiko.dev/#/patchset/20260323110747.193569-1-duanchenghao@kylinos.cn

[pratyush@kernel.org: changelog alterations]
Link: https://lore.kernel.org/2vxzzf3lfujq.fsf@kernel.org
Link: https://lore.kernel.org/20260326084727.118437-7-duanchenghao@kylinos.cn
Signed-off-by: Chenghao Duan <duanchenghao@kylinos.cn>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Cc: Haoran Jiang <jianghaoran@kylinos.cn>
Cc: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
43 hours agomm/memfd_luo: fix physical address conversion in put_folios cleanup
Chenghao Duan [Thu, 26 Mar 2026 08:47:25 +0000 (16:47 +0800)] 
mm/memfd_luo: fix physical address conversion in put_folios cleanup

In memfd_luo_retrieve_folios()'s put_folios cleanup path:

1. kho_restore_folio() expects a phys_addr_t (physical address) but
   receives a raw PFN (pfolio->pfn). This causes kho_restore_page() to
   check the wrong physical address (pfn << PAGE_SHIFT instead of the
   actual physical address).

2. This loop lacks the !pfolio->pfn check that exists in the main
   retrieval loop and memfd_luo_discard_folios(), which could
   incorrectly process sparse file holes where pfn=0.

Fix by converting PFN to physical address with PFN_PHYS() and adding
the !pfolio->pfn check, matching the pattern used elsewhere in this file.

This issue was identified by the AI review.
https://sashiko.dev/#/patchset/20260323110747.193569-1-duanchenghao@kylinos.cn

Link: https://lore.kernel.org/20260326084727.118437-6-duanchenghao@kylinos.cn
Fixes: b3749f174d68 ("mm: memfd_luo: allow preserving memfd")
Signed-off-by: Chenghao Duan <duanchenghao@kylinos.cn>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Cc: Haoran Jiang <jianghaoran@kylinos.cn>
Cc: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
43 hours agomm/memfd_luo: use i_size_write() to set inode size during retrieve
Chenghao Duan [Thu, 26 Mar 2026 08:47:24 +0000 (16:47 +0800)] 
mm/memfd_luo: use i_size_write() to set inode size during retrieve

Use i_size_write() instead of directly assigning to inode->i_size when
restoring the memfd size in memfd_luo_retrieve(), to keep code
consistency.

No functional change intended.

Link: https://lore.kernel.org/20260326084727.118437-5-duanchenghao@kylinos.cn
Signed-off-by: Chenghao Duan <duanchenghao@kylinos.cn>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Haoran Jiang <jianghaoran@kylinos.cn>
Cc: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Pratyush Yadav <pratyush@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
43 hours agomm/memfd_luo: remove unnecessary memset in zero-size memfd path
Chenghao Duan [Thu, 26 Mar 2026 08:47:23 +0000 (16:47 +0800)] 
mm/memfd_luo: remove unnecessary memset in zero-size memfd path

The memset(kho_vmalloc, 0, sizeof(*kho_vmalloc)) call in the zero-size
file handling path is unnecessary because the allocation of the ser
structure already uses the __GFP_ZERO flag, ensuring the memory is already
zero-initialized.

Link: https://lore.kernel.org/20260326084727.118437-4-duanchenghao@kylinos.cn
Signed-off-by: Chenghao Duan <duanchenghao@kylinos.cn>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Haoran Jiang <jianghaoran@kylinos.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
43 hours agomm/memfd_luo: optimize shmem_recalc_inode calls in retrieve path
Chenghao Duan [Thu, 26 Mar 2026 08:47:22 +0000 (16:47 +0800)] 
mm/memfd_luo: optimize shmem_recalc_inode calls in retrieve path

Move shmem_recalc_inode() out of the loop in memfd_luo_retrieve_folios()
to improve performance when restoring large memfds.

Currently, shmem_recalc_inode() is called for each folio during restore,
which is O(n) expensive operations.  This patch collects the number of
successfully added folios and calls shmem_recalc_inode() once after the
loop completes, reducing complexity to O(1).

Additionally, fix the error path to also call shmem_recalc_inode() for the
folios that were successfully added before the error occurred.

Link: https://lore.kernel.org/20260326084727.118437-3-duanchenghao@kylinos.cn
Signed-off-by: Chenghao Duan <duanchenghao@kylinos.cn>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Cc: Haoran Jiang <jianghaoran@kylinos.cn>
Cc: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
43 hours agomm/memfd: use folio_nr_pages() for shmem inode accounting
Chenghao Duan [Thu, 26 Mar 2026 08:47:21 +0000 (16:47 +0800)] 
mm/memfd: use folio_nr_pages() for shmem inode accounting

I found several modifiable points while reading the code.

This patch (of 6):

Patch series "Modify memfd_luo code", v3.

memfd_luo_retrieve_folios() called shmem_inode_acct_blocks() and
shmem_recalc_inode() with hardcoded 1 instead of the actual folio page
count.  memfd may use large folios (THP/hugepages), causing quota/limit
under-accounting and incorrect stat output.

Fix by using folio_nr_pages(folio) for both functions.

Issue found by AI review and suggested by Pratyush Yadav <pratyush@kernel.org>.
https://sashiko.dev/#/patchset/20260319012845.29570-1-duanchenghao%40kylinos.cn

Link: https://lore.kernel.org/20260326084727.118437-1-duanchenghao@kylinos.cn
Link: https://lore.kernel.org/20260326084727.118437-2-duanchenghao@kylinos.cn
Signed-off-by: Chenghao Duan <duanchenghao@kylinos.cn>
Suggested-by: Pratyush Yadav <pratyush@kernel.org>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Cc: Haoran Jiang <jianghaoran@kylinos.cn>
Cc: Mike Rapoport <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
43 hours agomm/sparse: fix preinited section_mem_map clobbering on failure path
Muchun Song [Tue, 31 Mar 2026 11:37:24 +0000 (19:37 +0800)] 
mm/sparse: fix preinited section_mem_map clobbering on failure path

sparse_init_nid() is careful to leave alone every section whose vmemmap
has already been set up by sparse_vmemmap_init_nid_early(); it only clears
section_mem_map for the rest:

        if (!preinited_vmemmap_section(ms))
                ms->section_mem_map = 0;

A leftover line after that conditional block

        ms->section_mem_map = 0;

was supposed to be deleted but was missed in the failure path, causing the
field to be overwritten for all sections when memory allocation fails,
effectively destroying the pre-initialization check.

Drop the stray assignment so that preinited sections retain their
already valid state.

Those pre-inited sections (HugeTLB pages) are not activated.  However,
such failures are extremely rare, so I don't see any major userspace
issues.

Link: https://lore.kernel.org/20260331113724.2080833-1-songmuchun@bytedance.com
Fixes: d65917c42373 ("mm/sparse: allow for alternate vmemmap section init at boot")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed by: Donet Tom <donettom@linux.ibm.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: Frank van der Linden <fvdl@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
43 hours agozram: do not forget to endio for partial discard requests
Sergey Senozhatsky [Tue, 31 Mar 2026 07:42:44 +0000 (16:42 +0900)] 
zram: do not forget to endio for partial discard requests

As reported by Qu Wenruo and Avinesh Kumar, the following

 getconf PAGESIZE
 65536
 blkdiscard -p 4k /dev/zram0

takes literally forever to complete.  zram doesn't support partial
discards and just returns immediately w/o doing any discard work in such
cases.  The problem is that we forget to endio on our way out, so
blkdiscard sleeps forever in submit_bio_wait().  Fix this by jumping to
end_bio label, which does bio_endio().

Link: https://lore.kernel.org/20260331074255.777019-1-senozhatsky@chromium.org
Fixes: 0120dd6e4e20 ("zram: make zram_bio_discard more self-contained")
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reported-by: Qu Wenruo <wqu@suse.com>
Closes: https://lore.kernel.org/linux-block/92361cd3-fb8b-482e-bc89-15ff1acb9a59@suse.com
Tested-by: Qu Wenruo <wqu@suse.com>
Reported-by: Avinesh Kumar <avinesh.kumar@suse.com>
Closes: https://bugzilla.suse.com/show_bug.cgi?id=1256530
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Minchan Kim <minchan@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
43 hours agolib: test_hmm: implement a device release method
Alistair Popple [Tue, 31 Mar 2026 06:34:45 +0000 (17:34 +1100)] 
lib: test_hmm: implement a device release method

Unloading the HMM test module produces the following warning:

[ 3782.224783] ------------[ cut here ]------------
[ 3782.226323] Device 'hmm_dmirror0' does not have a release() function, it is broken and must be fixed. See Documentation/core-api/kobject.rst.
[ 3782.230570] WARNING: drivers/base/core.c:2567 at device_release+0x185/0x210, CPU#20: rmmod/1924
[ 3782.233949] Modules linked in: test_hmm(-) nvidia_uvm(O) nvidia(O)
[ 3782.236321] CPU: 20 UID: 0 PID: 1924 Comm: rmmod Tainted: G           O        7.0.0-rc1+ #374 PREEMPT(full)
[ 3782.240226] Tainted: [O]=OOT_MODULE
[ 3782.241639] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014
[ 3782.246193] RIP: 0010:device_release+0x185/0x210
[ 3782.247860] Code: 00 00 fc ff df 48 8d 7b 50 48 89 fa 48 c1 ea 03 80 3c 02 00 0f 85 86 00 00 00 48 8b 73 50 48 85 f6 74 11 48 8d 3d db 25 29 03 <67> 48 0f b9 3a e9 0d ff ff ff 48 b8 00 00 00 00 00 fc ff df 48 89
[ 3782.254211] RSP: 0018:ffff888126577d98 EFLAGS: 00010246
[ 3782.256054] RAX: dffffc0000000000 RBX: ffffffffc2b70310 RCX: ffffffff8fe61ba1
[ 3782.258512] RDX: 1ffffffff856e062 RSI: ffff88811341eea0 RDI: ffffffff91bbacb0
[ 3782.261041] RBP: ffff888111475000 R08: 0000000000000001 R09: fffffbfff856e069
[ 3782.263471] R10: ffffffffc2b7034b R11: 00000000ffffffff R12: 0000000000000000
[ 3782.265983] R13: dffffc0000000000 R14: ffff88811341eea0 R15: 0000000000000000
[ 3782.268443] FS:  00007fd5a3689040(0000) GS:ffff88842c8d0000(0000) knlGS:0000000000000000
[ 3782.271236] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3782.273251] CR2: 00007fd5a36d2c10 CR3: 00000001242b8000 CR4: 00000000000006f0
[ 3782.275362] Call Trace:
[ 3782.276071]  <TASK>
[ 3782.276678]  kobject_put+0x146/0x270
[ 3782.277731]  hmm_dmirror_exit+0x7a/0x130 [test_hmm]
[ 3782.279135]  __do_sys_delete_module+0x341/0x510
[ 3782.280438]  ? module_flags+0x300/0x300
[ 3782.281547]  do_syscall_64+0x111/0x670
[ 3782.282620]  entry_SYSCALL_64_after_hwframe+0x4b/0x53
[ 3782.284091] RIP: 0033:0x7fd5a3793b37
[ 3782.285303] Code: 73 01 c3 48 8b 0d c9 82 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 99 82 0c 00 f7 d8 64 89 01 48
[ 3782.290708] RSP: 002b:00007ffd68b7dc68 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
[ 3782.292817] RAX: ffffffffffffffda RBX: 000055e3c0d1c770 RCX: 00007fd5a3793b37
[ 3782.294735] RDX: 0000000000000000 RSI: 0000000000000800 RDI: 000055e3c0d1c7d8
[ 3782.296661] RBP: 0000000000000000 R08: 1999999999999999 R09: 0000000000000000
[ 3782.298622] R10: 00007fd5a3806ac0 R11: 0000000000000206 R12: 00007ffd68b7deb0
[ 3782.300576] R13: 00007ffd68b7e781 R14: 000055e3c0d1b2a0 R15: 00007ffd68b7deb8
[ 3782.301963]  </TASK>
[ 3782.302371] irq event stamp: 5019
[ 3782.302987] hardirqs last  enabled at (5027): [<ffffffff8cf1f062>] __up_console_sem+0x52/0x60
[ 3782.304507] hardirqs last disabled at (5036): [<ffffffff8cf1f047>] __up_console_sem+0x37/0x60
[ 3782.306086] softirqs last  enabled at (4940): [<ffffffff8cd9a4b0>] __irq_exit_rcu+0xc0/0xf0
[ 3782.307567] softirqs last disabled at (4929): [<ffffffff8cd9a4b0>] __irq_exit_rcu+0xc0/0xf0
[ 3782.309105] ---[ end trace 0000000000000000 ]---

This is because the test module doesn't have a device.release method.  In
this case one probably isn't needed for correctness - the device structs
are in a static array so don't need freeing when the final reference goes
away.

However some device state is freed on exit, so to ensure this happens at
the right time and to silence the warning move the deinitialisation to a
release method and assign that as the device release callback.  Whilst
here also fix a minor error handling bug where cdev_device_del() wasn't
being called if allocation failed.

Link: https://lore.kernel.org/20260331063445.3551404-4-apopple@nvidia.com
Fixes: 6a760f58c792 ("mm/hmm/test: use char dev with struct device to get device node")
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Acked-by: Balbir Singh <balbirs@nvidia.com>
Tested-by: Zenghui Yu (Huawei) <zenghui.yu@linux.dev>
Cc: David Hildenbrand <david@kernel.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: <stable@vger,kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
43 hours agoselftests/mm: hmm-tests: don't hardcode THP size to 2MB
Alistair Popple [Tue, 31 Mar 2026 06:34:44 +0000 (17:34 +1100)] 
selftests/mm: hmm-tests: don't hardcode THP size to 2MB

Several HMM tests hardcode TWOMEG as the THP size. This is wrong on
architectures where the PMD size is not 2MB such as arm64 with 64K base
pages where THP is 512MB. Fix this by using read_pmd_pagesize() from
vm_util instead.

While here also replace the custom file_read_ulong() helper used to
parse the default hugetlbfs page size from /proc/meminfo with the
existing default_huge_page_size() from vm_util.

Link: https://lore.kernel.org/20260331063445.3551404-3-apopple@nvidia.com
Link: https://lore.kernel.org/linux-mm/8bd0396a-8997-4d2e-a13f-5aac033083d7@linux.dev/
Fixes: fee9f6d1b8df ("mm/hmm/test: add selftests for HMM")
Fixes: 519071529d2a ("selftests/mm/hmm-tests: new tests for zone device THP migration")
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reported-by: Zenghui Yu <zenghui.yu@linux.dev>
Closes: https://lore.kernel.org/linux-mm/8bd0396a-8997-4d2e-a13f-5aac033083d7@linux.dev/
Reviewed-by: Balbir Singh <balbirs@nvidia.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: <stable@vger,kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
43 hours agolib: test_hmm: evict device pages on file close to avoid use-after-free
Alistair Popple [Tue, 31 Mar 2026 06:34:43 +0000 (17:34 +1100)] 
lib: test_hmm: evict device pages on file close to avoid use-after-free

Patch series "Minor hmm_test fixes and cleanups".

Two bugfixes a cleanup for the HMM kernel selftests.  These were mostly
reported by Zenghui Yu with special thanks to Lorenzo for analysing and
pointing out the problems.

This patch (of 3):

When dmirror_fops_release() is called it frees the dmirror struct but
doesn't migrate device private pages back to system memory first.  This
leaves those pages with a dangling zone_device_data pointer to the freed
dmirror.

If a subsequent fault occurs on those pages (eg.  during coredump) the
dmirror_devmem_fault() callback dereferences the stale pointer causing a
kernel panic.  This was reported [1] when running mm/ksft_hmm.sh on arm64,
where a test failure triggered SIGABRT and the resulting coredump walked
the VMAs faulting in the stale device private pages.

Fix this by calling dmirror_device_evict_chunk() for each devmem chunk in
dmirror_fops_release() to migrate all device private pages back to system
memory before freeing the dmirror struct.  The function is moved earlier
in the file to avoid a forward declaration.

Link: https://lore.kernel.org/20260331063445.3551404-1-apopple@nvidia.com
Link: https://lore.kernel.org/20260331063445.3551404-2-apopple@nvidia.com
Fixes: b2ef9f5a5cb3 ("mm/hmm/test: add selftest driver for HMM")
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reported-by: Zenghui Yu <zenghui.yu@linux.dev>
Closes: https://lore.kernel.org/linux-mm/8bd0396a-8997-4d2e-a13f-5aac033083d7@linux.dev/
Reviewed-by: Balbir Singh <balbirs@nvidia.com>
Tested-by: Zenghui Yu <zenghui.yu@linux.dev>
Cc: David Hildenbrand <david@kernel.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Zenghui Yu <zenghui.yu@linux.dev>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
43 hours agoselftests/mm: skip hugetlb_dio tests when DIO alignment is incompatible
Li Wang [Wed, 1 Apr 2026 09:05:20 +0000 (17:05 +0800)] 
selftests/mm: skip hugetlb_dio tests when DIO alignment is incompatible

hugetlb_dio test uses sub-page offsets (pagesize / 2) to verify that
hugepages used as DIO user buffers are correctly unpinned at completion.

However, on filesystems with a logical block size larger than half the
page size (e.g., 4K-sector block devices), these unaligned DIO writes are
rejected with -EINVAL, causing the test to fail unexpectedly.

Add get_dio_alignment() to query the filesystem's required DIO alignment
via statx(STATX_DIOALIGN) and skip individual test cases whose file offset
or write size is not a multiple of that alignment.  Aligned cases continue
to run so the core coverage is preserved.

While here, open the temporary file once in main() and share the fd across
all test cases instead of reopening it in each invocation.

=== Reproduce Steps ===

  # dd if=/dev/zero of=/tmp/test.img bs=1M count=512
  # losetup --sector-size 4096 /dev/loop0 /tmp/test.img
  # mkfs.xfs /dev/loop0
  # mkdir -p /mnt/dio_test
  # mount /dev/loop0 /mnt/dio_test

  // Modify test to open /mnt/dio_test and rebuild it:
  -       fd = open("/tmp", O_TMPFILE | O_RDWR | O_DIRECT, 0664);
  +       fd = open("/mnt/dio_test", O_TMPFILE | O_RDWR | O_DIRECT, 0664);

  # getconf PAGESIZE
  4096

  # echo 100 >/proc/sys/vm/nr_hugepages

  # ./hugetlb_dio
  TAP version 13
  1..4
  # No. Free pages before allocation : 100
  # No. Free pages after munmap : 100
  ok 1 free huge pages from 0-12288
  Bail out! Error writing to file
  : Invalid argument (22)
  # Planned tests != run tests (4 != 1)
  # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0

Link: https://lore.kernel.org/20260401090520.24018-1-liwang@redhat.com
Signed-off-by: Li Wang <liwang@redhat.com>
Suggested-by: Mike Rapoport <rppt@kernel.org>
Suggested-by: David Hildenbrand <david@kernel.org>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
43 hours agotools/testing/selftests: add merge test for partial msealed range
Lorenzo Stoakes (Oracle) [Tue, 31 Mar 2026 07:36:27 +0000 (08:36 +0100)] 
tools/testing/selftests: add merge test for partial msealed range

Commit 2697dd8ae721 ("mm/mseal: update VMA end correctly on merge") fixed
an issue in the loop which iterates through VMAs applying mseal, which was
triggered by mseal()'ing a range of VMAs where the second was mseal()'d
and the first mergeable with it, once mseal()'d.

Add a regression test to assert that this behaviour is correct.  We place
it in the merge selftests as this is strictly an issue with merging (via a
vma_modify() invocation).

It also asserts that mseal()'d ranges are correctly merged as you'd
expect.

The test is implemented such that it is skipped if mseal() is not
available on the system.

[rppt@kernel.org: fix inclusions, to fix handle_uprobe_upon_merged_vma()]
Link: https://lore.kernel.org/ac_mCIUQWRAbuH8F@kernel.org
[ljs@kernel.org: simplifications per Pedro]
Link: https://lore.kernel.org/1c9c922d-5cb5-4cff-9273-b737cdb57ca1@lucifer.local
Link: https://lore.kernel.org/20260331073627.50010-1-ljs@kernel.org
Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Signed-off-by: Mike Rapoport <rppt@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
44 hours agomm/mempolicy: fix memory leaks in weighted_interleave_auto_store()
Jackie Liu [Wed, 1 Apr 2026 00:57:02 +0000 (08:57 +0800)] 
mm/mempolicy: fix memory leaks in weighted_interleave_auto_store()

weighted_interleave_auto_store() fetches old_wi_state inside the if
(!input) block only.  This causes two memory leaks:

1. When a user writes "false" and the current mode is already manual,
   the function returns early without freeing the freshly allocated
   new_wi_state.

2. When a user writes "true", old_wi_state stays NULL because the
   fetch is skipped entirely. The old state is then overwritten by
   rcu_assign_pointer() but never freed, since the cleanup path is
   gated on old_wi_state being non-NULL. A user can trigger this
   repeatedly by writing "1" in a loop.

Fix both leaks by moving the old_wi_state fetch before the input check,
making it unconditional.  This also allows a unified early return for both
"true" and "false" when the requested mode matches the current mode.

Link: https://lore.kernel.org/20260401005702.7096-1-liu.yun@linux.dev
Link: https://sashiko.dev/#/patchset/20260331100740.84906-1-liu.yun@linux.dev
Fixes: e341f9c3c841 ("mm/mempolicy: Weighted Interleave Auto-tuning")
Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Reviewed-by: Joshua Hahn <joshua.hahnjy@gmail.com>
Reviewed by: Donet Tom <donettom@linux.ibm.com>
Cc: Gregory Price <gourry@gourry.net>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Byungchul Park <byungchul@sk.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: <stable@vger.kernel.org> # v6.16+
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
44 hours agoDocs/admin-guide/mm/damon/lru_sort: warn commit_inputs vs param updates race
SeongJae Park [Sun, 29 Mar 2026 15:30:50 +0000 (08:30 -0700)] 
Docs/admin-guide/mm/damon/lru_sort: warn commit_inputs vs param updates race

DAMON_LRU_SORT handles commit_inputs request inside kdamond thread,
reading the module parameters.  If the user updates the module
parameters while the kdamond thread is reading those, races can happen.
To avoid this, the commit_inputs parameter shows whether it is still in
the progress, assuming users wouldn't update parameters in the middle of
the work.  Some users might ignore that.  Add a warning about the
behavior.

The issue was discovered in [1] by sashiko.

Link: https://lore.kernel.org/20260329153052.46657-3-sj@kernel.org
Link: https://lore.kernel.org/20260319161620.189392-2-objecting@objecting.org
Fixes: 6acfcd0d7524 ("Docs/admin-guide/damon: add a document for DAMON_LRU_SORT")
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org> # 6.0.x
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
44 hours agoDocs/admin-guide/mm/damon/reclaim: warn commit_inputs vs param updates race
SeongJae Park [Sun, 29 Mar 2026 15:30:49 +0000 (08:30 -0700)] 
Docs/admin-guide/mm/damon/reclaim: warn commit_inputs vs param updates race

Patch series "Docs/admin-guide/mm/damon: warn commit_inputs vs other
params race".

Writing 'Y' to the commit_inputs parameter of DAMON_RECLAIM and
DAMON_LRU_SORT, and writing other parameters before the commit_inputs
request is completely processed can cause race conditions.  While the
consequence can be bad, the documentation is not clearly describing that.
Add clear warnings.

The issue was discovered [1,2] by sashiko.

This patch (of 2):

DAMON_RECLAIM handles commit_inputs request inside kdamond thread,
reading the module parameters.  If the user updates the module
parameters while the kdamond thread is reading those, races can happen.
To avoid this, the commit_inputs parameter shows whether it is still in
the progress, assuming users wouldn't update parameters in the middle of
the work.  Some users might ignore that.  Add a warning about the
behavior.

The issue was discovered in [1] by sashiko.

Link: https://lore.kernel.org/20260329153052.46657-2-sj@kernel.org
Link: https://lore.kernel.org/20260319161620.189392-3-objecting@objecting.org
Link: https://lore.kernel.org/20260319161620.189392-2-objecting@objecting.org
Fixes: 81a84182c343 ("Docs/admin-guide/mm/damon/reclaim: document 'commit_inputs' parameter")
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org> # 5.19.x
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
44 hours agomm/damon/core: use time_in_range_open() for damos quota window start
SeongJae Park [Sun, 29 Mar 2026 15:23:05 +0000 (08:23 -0700)] 
mm/damon/core: use time_in_range_open() for damos quota window start

damos_adjust_quota() uses time_after_eq() to show if it is time to start a
new quota charge window, comparing the current jiffies and the scheduled
next charge window start time.  If it is, the next charge window start
time is updated and the new charge window starts.

The time check and next window start time update is skipped while the
scheme is deactivated by the watermarks.  Let's suppose the deactivation
is kept more than LONG_MAX jiffies (assuming CONFIG_HZ of 250, more than
99 days in 32 bit systems and more than one billion years in 64 bit
systems), resulting in having the jiffies larger than the next charge
window start time + LONG_MAX.  Then, the time_after_eq() call can return
false until another LONG_MAX jiffies are passed.

This means the scheme can continue working after being reactivated by the
watermarks.  But, soon, the quota will be exceeded and the scheme will
again effectively stop working until the next charge window starts.
Because the current charge window is extended to up to LONG_MAX jiffies,
however, it will look like it stopped unexpectedly and indefinitely, from
the user's perspective.

Fix this by using !time_in_range_open() instead.

The issue was discovered [1] by sashiko.

Link: https://lore.kernel.org/20260329152306.45796-1-sj@kernel.org
Link: https://lore.kernel.org/20260324040722.57944-1-sj@kernel.org
Fixes: ee801b7dd782 ("mm/damon/schemes: activate schemes based on a watermarks mechanism")
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org> # 5.16.x
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
44 hours agomm/damon/core: validate damos_quota_goal->nid for node_memcg_{used,free}_bp
SeongJae Park [Sun, 29 Mar 2026 04:39:00 +0000 (21:39 -0700)] 
mm/damon/core: validate damos_quota_goal->nid for node_memcg_{used,free}_bp

Users can set damos_quota_goal->nid with arbitrary value for
node_memcg_{used,free}_bp.  But DAMON core is using those for NODE-DATA()
without a validation of the value.  This can result in out of bounds
memory access.  The issue can actually triggered using DAMON user-space
tool (damo), like below.

    $ sudo mkdir /sys/fs/cgroup/foo
    $ sudo ./damo start --damos_action stat --damos_quota_interval 1s \
            --damos_quota_goal node_memcg_used_bp 50% -1 /foo
    $ sudo dmseg
    [...]
    [  524.181426] Unable to handle kernel paging request at virtual address 0000000000002c00

Fix this issue by adding the validation of the given node id.  If an
invalid node id is given, it returns 0% for used memory ratio, and 100%
for free memory ratio.

Link: https://lore.kernel.org/20260329043902.46163-3-sj@kernel.org
Fixes: b74a120bcf50 ("mm/damon/core: implement DAMOS_QUOTA_NODE_MEMCG_USED_BP")
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org> # 6.19.x
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
44 hours agomm/damon/core: validate damos_quota_goal->nid for node_mem_{used,free}_bp
SeongJae Park [Sun, 29 Mar 2026 04:38:59 +0000 (21:38 -0700)] 
mm/damon/core: validate damos_quota_goal->nid for node_mem_{used,free}_bp

Patch series "mm/damon/core: validate damos_quota_goal->nid".

node_mem[cg]_{used,free}_bp DAMOS quota goals receive the node id.  The
node id is used for si_meminfo_node() and NODE_DATA() without proper
validation.  As a result, privileged users can trigger an out of bounds
memory access using DAMON_SYSFS.  Fix the issues.

The issue was originally reported [1] with a fix by another author.  The
original author announced [2] that they will stop working including the
fix that was still in the review stage.  Hence I'm restarting this.

This patch (of 2):

Users can set damos_quota_goal->nid with arbitrary value for
node_mem_{used,free}_bp.  But DAMON core is using those for
si_meminfo_node() without the validation of the value.  This can result in
out of bounds memory access.  The issue can actually triggered using DAMON
user-space tool (damo), like below.

    $ sudo ./damo start --damos_action stat \
     --damos_quota_goal node_mem_used_bp 50% -1 \
     --damos_quota_interval 1s
    $ sudo dmesg
    [...]
    [   65.565986] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000098

Fix this issue by adding the validation of the given node.  If an invalid
node id is given, it returns 0% for used memory ratio, and 100% for free
memory ratio.

Link: https://lore.kernel.org/20260329043902.46163-2-sj@kernel.org
Link: https://lore.kernel.org/20260325073034.140353-1-objecting@objecting.org
Link: https://lore.kernel.org/20260327040924.68553-1-sj@kernel.org
Fixes: 0e1c773b501f ("mm/damon/core: introduce damos quota goal metrics for memory node utilization")
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org> # 6.16.x
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
44 hours agomm/damon/stat: fix memory leak on damon_start() failure in damon_stat_start()
Jackie Liu [Tue, 31 Mar 2026 10:15:53 +0000 (18:15 +0800)] 
mm/damon/stat: fix memory leak on damon_start() failure in damon_stat_start()

Destroy the DAMON context and reset the global pointer when damon_start()
fails.  Otherwise, the context allocated by damon_stat_build_ctx() is
leaked, and the stale damon_stat_context pointer will be overwritten on
the next enable attempt, making the old allocation permanently
unreachable.

Link: https://lore.kernel.org/20260331101553.88422-1-liu.yun@linux.dev
Fixes: 369c415e6073 ("mm/damon: introduce DAMON_STAT module")
Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Reviewed-by: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org> # 6.17.x
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
44 hours agomm/damon/core: fix damos_walk() vs kdamond_fn() exit race
SeongJae Park [Fri, 27 Mar 2026 23:33:15 +0000 (16:33 -0700)] 
mm/damon/core: fix damos_walk() vs kdamond_fn() exit race

When kdamond_fn() main loop is finished, the function cancels remaining
damos_walk() request and unset the damon_ctx->kdamond so that API callers
and API functions themselves can show the context is terminated.
damos_walk() adds the caller's request to the queue first.  After that, it
shows if the kdamond of the damon_ctx is still running (damon_ctx->kdamond
is set).  Only if the kdamond is running, damos_walk() starts waiting for
the kdamond's handling of the newly added request.

The damos_walk() requests registration and damon_ctx->kdamond unset are
protected by different mutexes, though.  Hence, damos_walk() could race
with damon_ctx->kdamond unset, and result in deadlocks.

For example, let's suppose kdamond successfully finished the damow_walk()
request cancelling.  Right after that, damos_walk() is called for the
context.  It registers the new request, and shows the context is still
running, because damon_ctx->kdamond unset is not yet done.  Hence the
damos_walk() caller starts waiting for the handling of the request.
However, the kdamond is already on the termination steps, so it never
handles the new request.  As a result, the damos_walk() caller thread
infinitely waits.

Fix this by introducing another damon_ctx field, namely
walk_control_obsolete.  It is protected by the
damon_ctx->walk_control_lock, which protects damos_walk() request
registration.  Initialize (unset) it in kdamond_fn() before letting
damon_start() returns and set it just before the cancelling of the
remaining damos_walk() request is executed.  damos_walk() reads the
obsolete field under the lock and avoids adding a new request.

After this change, only requests that are guaranteed to be handled or
cancelled are registered.  Hence the after-registration DAMON context
termination check is no longer needed.  Remove it together.

The issue is found by sashiko [1].

Link: https://lore.kernel.org/20260327233319.3528-3-sj@kernel.org
Link: https://lore.kernel.org/20260325141956.87144-1-sj@kernel.org
Fixes: bf0eaba0ff9c ("mm/damon/core: implement damos_walk()")
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org> # 6.14.x
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
44 hours agomm/damon/core: fix damon_call() vs kdamond_fn() exit race
SeongJae Park [Fri, 27 Mar 2026 23:33:14 +0000 (16:33 -0700)] 
mm/damon/core: fix damon_call() vs kdamond_fn() exit race

Patch series "mm/damon/core: fix damon_call()/damos_walk() vs kdmond exit
race".

damon_call() and damos_walk() can leak memory and/or deadlock when they
race with kdamond terminations.  Fix those.

This patch (of 2);

When kdamond_fn() main loop is finished, the function cancels all
remaining damon_call() requests and unset the damon_ctx->kdamond so that
API callers and API functions themselves can know the context is
terminated.  damon_call() adds the caller's request to the queue first.
After that, it shows if the kdamond of the damon_ctx is still running
(damon_ctx->kdamond is set).  Only if the kdamond is running, damon_call()
starts waiting for the kdamond's handling of the newly added request.

The damon_call() requests registration and damon_ctx->kdamond unset are
protected by different mutexes, though.  Hence, damon_call() could race
with damon_ctx->kdamond unset, and result in deadlocks.

For example, let's suppose kdamond successfully finished the damon_call()
requests cancelling.  Right after that, damon_call() is called for the
context.  It registers the new request, and shows the context is still
running, because damon_ctx->kdamond unset is not yet done.  Hence the
damon_call() caller starts waiting for the handling of the request.
However, the kdamond is already on the termination steps, so it never
handles the new request.  As a result, the damon_call() caller threads
infinitely waits.

Fix this by introducing another damon_ctx field, namely
call_controls_obsolete.  It is protected by the
damon_ctx->call_controls_lock, which protects damon_call() requests
registration.  Initialize (unset) it in kdamond_fn() before letting
damon_start() returns and set it just before the cancelling of remaining
damon_call() requests is executed.  damon_call() reads the obsolete field
under the lock and avoids adding a new request.

After this change, only requests that are guaranteed to be handled or
cancelled are registered.  Hence the after-registration DAMON context
termination check is no longer needed.  Remove it together.

Note that the deadlock will not happen when damon_call() is called for
repeat mode request.  In tis case, damon_call() returns instead of waiting
for the handling when the request registration succeeds and it shows the
kdamond is running.  However, if the request also has dealloc_on_cancel,
the request memory would be leaked.

The issue is found by sashiko [1].

Link: https://lore.kernel.org/20260327233319.3528-1-sj@kernel.org
Link: https://lore.kernel.org/20260327233319.3528-2-sj@kernel.org
Link: https://lore.kernel.org/20260325141956.87144-1-sj@kernel.org
Fixes: 42b7491af14c ("mm/damon/core: introduce damon_call()")
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org> # 6.14.x
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
44 hours agomm: zswap: tie per-CPU acomp_ctx lifetime to the pool
Kanchana P. Sridhar [Tue, 31 Mar 2026 18:33:51 +0000 (11:33 -0700)] 
mm: zswap: tie per-CPU acomp_ctx lifetime to the pool

Currently, per-CPU acomp_ctx are allocated on pool creation and/or CPU
hotplug, and destroyed on pool destruction or CPU hotunplug.  This
complicates the lifetime management to save memory while a CPU is
offlined, which is not very common.

Simplify lifetime management by allocating per-CPU acomp_ctx once on pool
creation (or CPU hotplug for CPUs onlined later), and keeping them
allocated until the pool is destroyed.

Refactor cleanup code from zswap_cpu_comp_dead() into acomp_ctx_free() to
be used elsewhere.

The main benefit of using the CPU hotplug multi state instance startup
callback to allocate the acomp_ctx resources is that it prevents the cores
from being offlined until the multi state instance addition call returns.

  From Documentation/core-api/cpu_hotplug.rst:

    "The node list add/remove operations and the callback invocations are
     serialized against CPU hotplug operations."

Furthermore, zswap_[de]compress() cannot contend with
zswap_cpu_comp_prepare() because:

  - During pool creation/deletion, the pool is not in the zswap_pools
    list.

  - During CPU hot[un]plug, the CPU is not yet online, as Yosry pointed
    out. zswap_cpu_comp_prepare() will be run on a control CPU,
    since CPUHP_MM_ZSWP_POOL_PREPARE is in the PREPARE section of "enum
    cpuhp_state".

  In both these cases, any recursions into zswap reclaim from
  zswap_cpu_comp_prepare() will be handled by the old pool.

The above two observations enable the following simplifications:

 1) zswap_cpu_comp_prepare():

    a) acomp_ctx mutex locking:

       If the process gets migrated while zswap_cpu_comp_prepare() is
       running, it will complete on the new CPU. In case of failures, we
       pass the acomp_ctx pointer obtained at the start of
       zswap_cpu_comp_prepare() to acomp_ctx_free(), which again, can
       only undergo migration. There appear to be no contention
       scenarios that might cause inconsistent values of acomp_ctx's
       members. Hence, it seems there is no need for
       mutex_lock(&acomp_ctx->mutex) in zswap_cpu_comp_prepare().

    b) acomp_ctx mutex initialization:

       Since the pool is not yet on zswap_pools list, we don't need to
       initialize the per-CPU acomp_ctx mutex in
       zswap_pool_create(). This has been restored to occur in
       zswap_cpu_comp_prepare().

    c) Subsequent CPU offline-online transitions:

       zswap_cpu_comp_prepare() checks upfront if acomp_ctx->acomp is
       valid. If so, it returns success. This should handle any CPU
       hotplug online-offline transitions after pool creation is done.

 2) CPU offline vis-a-vis zswap ops:

    Let's suppose the process is migrated to another CPU before the
    current CPU is dysfunctional. If zswap_[de]compress() holds the
    acomp_ctx->mutex lock of the offlined CPU, that mutex will be
    released once it completes on the new CPU. Since there is no
    teardown callback, there is no possibility of UAF.

 3) Pool creation/deletion and process migration to another CPU:

    During pool creation/deletion, the pool is not in the zswap_pools
    list. Hence it cannot contend with zswap ops on that CPU. However,
    the process can get migrated.

    a) Pool creation --> zswap_cpu_comp_prepare()
                                --> process migrated:
                                    * Old CPU offline: no-op.
                                    * zswap_cpu_comp_prepare() continues
                                      to run on the new CPU to finish
                                      allocating acomp_ctx resources for
                                      the offlined CPU.

    b) Pool deletion --> acomp_ctx_free()
                                --> process migrated:
                                    * Old CPU offline: no-op.
                                    * acomp_ctx_free() continues
                                      to run on the new CPU to finish
                                      de-allocating acomp_ctx resources
                                      for the offlined CPU.

 4) Pool deletion vis-a-vis CPU onlining:

    The call to cpuhp_state_remove_instance() cannot race with
    zswap_cpu_comp_prepare() because of hotplug synchronization.

The current acomp_ctx_get_cpu_lock()/acomp_ctx_put_unlock() are deleted.
Instead, zswap_[de]compress() directly call
mutex_[un]lock(&acomp_ctx->mutex).

The per-CPU memory cost of not deleting the acomp_ctx resources upon CPU
offlining, and only deleting them when the pool is destroyed, is 8.28 KB
on x86_64.  This cost is only paid when a CPU is offlined, until it is
onlined again.

Link: https://lore.kernel.org/20260331183351.29844-3-kanchanapsridhar2026@gmail.com
Co-developed-by: Kanchana P. Sridhar <kanchanapsridhar2026@gmail.com>
Signed-off-by: Kanchana P. Sridhar <kanchanapsridhar2026@gmail.com>
Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
Acked-by: Yosry Ahmed <yosry@kernel.org>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
44 hours agomm: zswap: remove redundant checks in zswap_cpu_comp_dead()
Kanchana P. Sridhar [Tue, 31 Mar 2026 18:33:50 +0000 (11:33 -0700)] 
mm: zswap: remove redundant checks in zswap_cpu_comp_dead()

Patch series "zswap pool per-CPU acomp_ctx simplifications", v3.

This patchset first removes redundant checks on the acomp_ctx and its
"req" member in zswap_cpu_comp_dead().

Next, it persists the zswap pool's per-CPU acomp_ctx resources to last
until the pool is destroyed.  It then simplifies the per-CPU acomp_ctx
mutex locking in zswap_compress()/zswap_decompress().

Code comments added after allocation and before checking to deallocate the
per-CPU acomp_ctx's members, based on expected crypto API return values
and zswap changes this patchset makes.

Patch 2 is an independent submission of patch 23 from [1], to
facilitate merging.

This patch (of 2):

There are presently redundant checks on the per-CPU acomp_ctx and it's
"req" member in zswap_cpu_comp_dead(): redundant because they are
inconsistent with zswap_pool_create() handling of failure in allocating
the acomp_ctx, and with the expected NULL return value from the
acomp_request_alloc() API when it fails to allocate an acomp_req.

Fix these by converting to them to be NULL checks.

Add comments in zswap_cpu_comp_prepare() clarifying the expected return
values of the crypto_alloc_acomp_node() and acomp_request_alloc() API.

Link: https://lore.kernel.org/20260331183351.29844-2-kanchanapsridhar2026@gmail.com
Link: https://patchwork.kernel.org/project/linux-mm/list/?series=1046677
Signed-off-by: Kanchana P. Sridhar <kanchanapsridhar2026@gmail.com>
Suggested-by: Yosry Ahmed <yosry@kernel.org>
Acked-by: Yosry Ahmed <yosry@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
44 hours agomm/alloc_tag: clear codetag for pages allocated before page_ext initialization
Hao Ge [Tue, 31 Mar 2026 08:13:12 +0000 (16:13 +0800)] 
mm/alloc_tag: clear codetag for pages allocated before page_ext initialization

Due to initialization ordering, page_ext is allocated and initialized
relatively late during boot.  Some pages have already been allocated and
freed before page_ext becomes available, leaving their codetag
uninitialized.

A clear example is in init_section_page_ext(): alloc_page_ext() calls
kmemleak_alloc().  If the slab cache has no free objects, it falls back to
the buddy allocator to allocate memory.  However, at this point page_ext
is not yet fully initialized, so these newly allocated pages have no
codetag set.  These pages may later be reclaimed by KASAN, which causes
the warning to trigger when they are freed because their codetag ref is
still empty.

Use a global array to track pages allocated before page_ext is fully
initialized.  The array size is fixed at 8192 entries, and will emit a
warning if this limit is exceeded.  When page_ext initialization
completes, set their codetag to empty to avoid warnings when they are
freed later.

This warning is only observed with CONFIG_MEM_ALLOC_PROFILING_DEBUG=Y and
mem_profiling_compressed disabled:

[    9.582133] ------------[ cut here ]------------
[    9.582137] alloc_tag was not set
[    9.582139] WARNING: ./include/linux/alloc_tag.h:164 at __pgalloc_tag_sub+0x40f/0x550, CPU#5: systemd/1
[    9.582190] CPU: 5 UID: 0 PID: 1 Comm: systemd Not tainted 7.0.0-rc4 #1 PREEMPT(lazy)
[    9.582192] Hardware name: Red Hat KVM, BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[    9.582194] RIP: 0010:__pgalloc_tag_sub+0x40f/0x550
[    9.582196] Code: 00 00 4c 29 e5 48 8b 05 1f 88 56 05 48 8d 4c ad 00 48 8d 2c c8 e9 87 fd ff ff 0f 0b 0f 0b e9 f3 fe ff ff 48 8d 3d 61 2f ed 03 <67> 48 0f b9 3a e9 b3 fd ff ff 0f 0b eb e4 e8 5e cd 14 02 4c 89 c7
[    9.582197] RSP: 0018:ffffc9000001f940 EFLAGS: 00010246
[    9.582200] RAX: dffffc0000000000 RBX: 1ffff92000003f2b RCX: 1ffff110200d806c
[    9.582201] RDX: ffff8881006c0360 RSI: 0000000000000004 RDI: ffffffff9bc7b460
[    9.582202] RBP: 0000000000000000 R08: 0000000000000000 R09: fffffbfff3a62324
[    9.582203] R10: ffffffff9d311923 R11: 0000000000000000 R12: ffffea0004001b00
[    9.582204] R13: 0000000000002000 R14: ffffea0000000000 R15: ffff8881006c0360
[    9.582206] FS:  00007ffbbcf2d940(0000) GS:ffff888450479000(0000) knlGS:0000000000000000
[    9.582208] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    9.582210] CR2: 000055ee3aa260d0 CR3: 0000000148b67005 CR4: 0000000000770ef0
[    9.582211] PKRU: 55555554
[    9.582212] Call Trace:
[    9.582213]  <TASK>
[    9.582214]  ? __pfx___pgalloc_tag_sub+0x10/0x10
[    9.582216]  ? check_bytes_and_report+0x68/0x140
[    9.582219]  __free_frozen_pages+0x2e4/0x1150
[    9.582221]  ? __free_slab+0xc2/0x2b0
[    9.582224]  qlist_free_all+0x4c/0xf0
[    9.582227]  kasan_quarantine_reduce+0x15d/0x180
[    9.582229]  __kasan_slab_alloc+0x69/0x90
[    9.582232]  kmem_cache_alloc_noprof+0x14a/0x500
[    9.582234]  do_getname+0x96/0x310
[    9.582237]  do_readlinkat+0x91/0x2f0
[    9.582239]  ? __pfx_do_readlinkat+0x10/0x10
[    9.582240]  ? get_random_bytes_user+0x1df/0x2c0
[    9.582244]  __x64_sys_readlinkat+0x96/0x100
[    9.582246]  do_syscall_64+0xce/0x650
[    9.582250]  ? __x64_sys_getrandom+0x13a/0x1e0
[    9.582252]  ? __pfx___x64_sys_getrandom+0x10/0x10
[    9.582254]  ? do_syscall_64+0x114/0x650
[    9.582255]  ? ksys_read+0xfc/0x1d0
[    9.582258]  ? __pfx_ksys_read+0x10/0x10
[    9.582260]  ? do_syscall_64+0x114/0x650
[    9.582262]  ? do_syscall_64+0x114/0x650
[    9.582264]  ? __pfx_fput_close_sync+0x10/0x10
[    9.582266]  ? file_close_fd_locked+0x178/0x2a0
[    9.582268]  ? __x64_sys_faccessat2+0x96/0x100
[    9.582269]  ? __x64_sys_close+0x7d/0xd0
[    9.582271]  ? do_syscall_64+0x114/0x650
[    9.582273]  ? do_syscall_64+0x114/0x650
[    9.582275]  ? clear_bhb_loop+0x50/0xa0
[    9.582277]  ? clear_bhb_loop+0x50/0xa0
[    9.582279]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[    9.582280] RIP: 0033:0x7ffbbda345ee
[    9.582282] Code: 0f 1f 40 00 48 8b 15 29 38 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff c3 0f 1f 40 00 f3 0f 1e fa 49 89 ca b8 0b 01 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d fa 37 0d 00 f7 d8 64 89 01 48
[    9.582284] RSP: 002b:00007ffe2ad8de58 EFLAGS: 00000202 ORIG_RAX: 000000000000010b
[    9.582286] RAX: ffffffffffffffda RBX: 000055ee3aa25570 RCX: 00007ffbbda345ee
[    9.582287] RDX: 000055ee3aa25570 RSI: 00007ffe2ad8dee0 RDI: 00000000ffffff9c
[    9.582288] RBP: 0000000000001000 R08: 0000000000000003 R09: 0000000000001001
[    9.582289] R10: 0000000000001000 R11: 0000000000000202 R12: 0000000000000033
[    9.582290] R13: 00007ffe2ad8dee0 R14: 00000000ffffff9c R15: 00007ffe2ad8deb0
[    9.582292]  </TASK>
[    9.582293] ---[ end trace 0000000000000000 ]---

Link: https://lore.kernel.org/20260331081312.123719-1-hao.ge@linux.dev
Fixes: dcfe378c81f72 ("lib: introduce support for page allocation tagging")
Signed-off-by: Hao Ge <hao.ge@linux.dev>
Suggested-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Suren Baghdasaryan <surenb@google.com>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
44 hours agomm/vmscan: prevent MGLRU reclaim from pinning address space
Suren Baghdasaryan [Sun, 22 Mar 2026 07:08:43 +0000 (00:08 -0700)] 
mm/vmscan: prevent MGLRU reclaim from pinning address space

When shrinking lruvec, MGLRU pins address space before walking it.  This
is excessive since all it needs for walking the page range is a stable
mm_struct to be able to take and release mmap_read_lock and a stable
mm->mm_mt tree to walk.  This address space pinning results in delays when
releasing the memory of a dying process.  This also prevents mm reapers
(both in-kernel oom-reaper and userspace process_mrelease()) from doing
their job during MGLRU scan because they check task_will_free_mem() which
will yield negative result due to the elevated mm->mm_users.

This affects the system in the sense that if the MM of the killed
process is being reclaimed by kswapd then reapers won't be able to reap
it.  Even the process itself (which might have higher-priority than
kswapd) will not free its memory until kswapd drops the last reference.
IOW, we delay freeing the memory because kswapd is reclaiming it.  In
Android the visible result for us is that process_mrelease() (userspace
reaper) skips MM in such cases and we see process memory not released
for an unusually long time (secs).

Replace unnecessary address space pinning with mm_struct pinning by
replacing mmget/mmput with mmgrab/mmdrop calls.  mm_mt is contained within
mm_struct itself, therefore it won't be freed as long as mm_struct is
stable and it won't change during the walk because mmap_read_lock is being
held.

Link: https://lore.kernel.org/20260322070843.941997-1-surenb@google.com
Fixes: bd74fdaea146 ("mm: multi-gen LRU: support page table walks")
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Yuanchu Xie <yuanchu@google.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Kalesh Singh <kaleshsingh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
44 hours agoliveupdate: defer file handler module refcounting to active sessions
Pasha Tatashin [Fri, 27 Mar 2026 03:33:34 +0000 (03:33 +0000)] 
liveupdate: defer file handler module refcounting to active sessions

Stop pinning modules indefinitely upon file handler registration.
Instead, dynamically increment the module reference count only when a live
update session actively uses the file handler (e.g., during preservation
or deserialization), and release it when the session ends.

This allows modules providing live update handlers to be gracefully
unloaded when no live update is in progress.

Link: https://lore.kernel.org/20260327033335.696621-11-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Pratyush Yadav (Google) <pratyush@kernel.org>
Cc: David Matlack <dmatlack@google.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Samiullah Khawaja <skhawaja@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
44 hours agoliveupdate: make unregister functions return void
Pasha Tatashin [Fri, 27 Mar 2026 03:33:33 +0000 (03:33 +0000)] 
liveupdate: make unregister functions return void

Change liveupdate_unregister_file_handler and liveupdate_unregister_flb to
return void instead of an error code.  This follows the design principle
that unregistration during module unload should not fail, as the unload
cannot be stopped at that point.

Link: https://lore.kernel.org/20260327033335.696621-10-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Pratyush Yadav (Google) <pratyush@kernel.org>
Cc: David Matlack <dmatlack@google.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Samiullah Khawaja <skhawaja@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
44 hours agoliveupdate: remove liveupdate_test_unregister()
Pasha Tatashin [Fri, 27 Mar 2026 03:33:32 +0000 (03:33 +0000)] 
liveupdate: remove liveupdate_test_unregister()

Now that file handler unregistration automatically unregisters all
associated file handlers (FLBs), the liveupdate_test_unregister() function
is no longer needed.  Remove it along with its usages and declarations.

Link: https://lore.kernel.org/20260327033335.696621-9-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Pratyush Yadav (Google) <pratyush@kernel.org>
Cc: David Matlack <dmatlack@google.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Samiullah Khawaja <skhawaja@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
44 hours agoliveupdate: auto unregister FLBs on file handler unregistration
Pasha Tatashin [Fri, 27 Mar 2026 03:33:31 +0000 (03:33 +0000)] 
liveupdate: auto unregister FLBs on file handler unregistration

To ensure that unregistration is always successful and doesn't leave
dangling resources, introduce auto-unregistration of FLBs: when a file
handler is unregistered, all FLBs associated with it are automatically
unregistered.

Introduce a new helper luo_flb_unregister_all() which unregisters all FLBs
linked to the given file handler.

Link: https://lore.kernel.org/20260327033335.696621-8-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Pratyush Yadav (Google) <pratyush@kernel.org>
Cc: David Matlack <dmatlack@google.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Samiullah Khawaja <skhawaja@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
44 hours agoliveupdate: remove luo_session_quiesce()
Pasha Tatashin [Fri, 27 Mar 2026 03:33:30 +0000 (03:33 +0000)] 
liveupdate: remove luo_session_quiesce()

Now that FLB module references are handled dynamically during active
sessions, we can safely remove the luo_session_quiesce() and
luo_session_resume() mechanism.

Link: https://lore.kernel.org/20260327033335.696621-7-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Pratyush Yadav (Google) <pratyush@kernel.org>
Cc: David Matlack <dmatlack@google.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Samiullah Khawaja <skhawaja@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
44 hours agoliveupdate: defer FLB module refcounting to active sessions
Pasha Tatashin [Fri, 27 Mar 2026 03:33:29 +0000 (03:33 +0000)] 
liveupdate: defer FLB module refcounting to active sessions

Stop pinning modules indefinitely upon FLB registration.  Instead,
dynamically take a module reference when the FLB is actively used in a
session (e.g., during preserve and retrieve) and release it when the
session concludes.

This allows modules providing FLB operations to be cleanly unloaded when
not in active use by the live update orchestrator.

Link: https://lore.kernel.org/20260327033335.696621-6-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Samiullah Khawaja <skhawaja@google.com>
Reviewed-by: Pratyush Yadav (Google) <pratyush@kernel.org>
Cc: David Matlack <dmatlack@google.com>
Cc: Mike Rapoport <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
44 hours agoliveupdate: protect FLB lists with luo_register_rwlock
Pasha Tatashin [Fri, 27 Mar 2026 03:33:28 +0000 (03:33 +0000)] 
liveupdate: protect FLB lists with luo_register_rwlock

Because liveupdate FLB objects will soon drop their persistent module
references when registered, list traversals must be protected against
concurrent module unloading.

To provide this protection, utilize the global luo_register_rwlock.  It
protects the global registry of FLBs and the handler's specific list of
FLB dependencies.

Read locks are used during concurrent list traversals (e.g., during
preservation and serialization).  Write locks are taken during
registration and unregistration.

Link: https://lore.kernel.org/20260327033335.696621-5-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Pratyush Yadav (Google) <pratyush@kernel.org>
Cc: David Matlack <dmatlack@google.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Samiullah Khawaja <skhawaja@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
44 hours agoliveupdate: protect file handler list with rwsem
Pasha Tatashin [Fri, 27 Mar 2026 03:33:27 +0000 (03:33 +0000)] 
liveupdate: protect file handler list with rwsem

Because liveupdate file handlers will no longer hold a module reference
when registered, we must ensure that the access to the handler list is
protected against concurrent module unloading.

Utilize the global luo_register_rwlock to protect the global registry of
file handlers.  Read locks are taken during list traversals in
luo_preserve_file() and luo_file_deserialize().  Write locks are taken
during registration and unregistration.

Link: https://lore.kernel.org/20260327033335.696621-4-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Pratyush Yadav (Google) <pratyush@kernel.org>
Cc: David Matlack <dmatlack@google.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Samiullah Khawaja <skhawaja@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
44 hours agoliveupdate: synchronize lazy initialization of FLB private state
Pasha Tatashin [Fri, 27 Mar 2026 03:33:26 +0000 (03:33 +0000)] 
liveupdate: synchronize lazy initialization of FLB private state

The luo_flb_get_private() function, which is responsible for lazily
initializing the private state of FLB objects, can be called concurrently
from multiple threads.  This creates a data race on the 'initialized' flag
and can lead to multiple executions of mutex_init() and INIT_LIST_HEAD()
on the same memory.

Introduce a static spinlock (luo_flb_init_lock) local to the function to
synchronize the initialization path.  Use smp_load_acquire() and
smp_store_release() for memory ordering between the fast path and the slow
path.

Link: https://lore.kernel.org/20260327033335.696621-3-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Cc: David Matlack <dmatlack@google.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Samiullah Khawaja <skhawaja@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
44 hours agoliveupdate: safely print untrusted strings
Pasha Tatashin [Fri, 27 Mar 2026 03:33:25 +0000 (03:33 +0000)] 
liveupdate: safely print untrusted strings

Patch series "liveupdate: Fix module unloading and unregister API", v3.

This patch series addresses an issue with how LUO handles module reference
counting and unregistration during a module unload (e.g., via rmmod).

Currently, modules that register live update file handlers are pinned for
the entire duration they are registered.  This prevents the modules from
being unloaded gracefully, even when no live update session is in
progress.

Furthermore, if a module is forcefully unloaded, the unregistration
functions return an error (e.g.  -EBUSY) if a session is active, which is
ignored by the kernel's module unload path, leaving dangling pointers in
the LUO global lists.

To resolve these issues, this series introduces the following changes:
1. Adds a global read-write semaphore (luo_register_rwlock) to protect
   the registration lists for both file handlers and FLBs.
2. Reduces the scope of module reference counting for file handlers and
   FLBs. Instead of pinning modules indefinitely upon registration,
   references are now taken only when they are actively used in a live
   update session (e.g., during preservation, retrieval, or
   deserialization).
3. Removes the global luo_session_quiesce() mechanism since module
   unload behavior now handles active sessions implicitly.
4. Introduces auto-unregistration of FLBs during file handler
   unregistration to prevent leaving dangling resources.
5. Changes the unregistration functions to return void instead of
   an error code.
6. Fixes a data race in luo_flb_get_private() by introducing a spinlock
   for thread-safe lazy initialization.
7. Strengthens security by using %.*s when printing untrusted deserialized
   compatible strings and session names to prevent out-of-bounds reads.

This patch (of 10):

Deserialized strings from KHO data (such as file handler compatible
strings and session names) are provided by the previous kernel and might
not be null-terminated if the data is corrupted or maliciously crafted.

When printing these strings in error messages, use the %.*s format
specifier with the maximum buffer size to prevent out-of-bounds reads into
adjacent kernel memory.

Link: https://lore.kernel.org/20260327033335.696621-1-pasha.tatashin@soleen.com
Link: https://lore.kernel.org/20260327033335.696621-2-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Pratyush Yadav (Google) <pratyush@kernel.org>
Cc: David Matlack <dmatlack@google.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Samiullah Khawaja <skhawaja@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
44 hours agomm: vmscan: fix dirty folios throttling on cgroup v1 for MGLRU
Baolin Wang [Fri, 27 Mar 2026 10:21:08 +0000 (18:21 +0800)] 
mm: vmscan: fix dirty folios throttling on cgroup v1 for MGLRU

The balance_dirty_pages() won't do the dirty folios throttling on
cgroupv1.  See commit 9badce000e2c ("cgroup, writeback: don't enable
cgroup writeback on traditional hierarchies").

Moreover, after commit 6b0dfabb3555 ("fs: Remove aops->writepage"), we no
longer attempt to write back filesystem folios through reclaim.

On large memory systems, the flusher may not be able to write back quickly
enough.  Consequently, MGLRU will encounter many folios that are already
under writeback.  Since we cannot reclaim these dirty folios, the system
may run out of memory and trigger the OOM killer.

Hence, for cgroup v1, let's throttle reclaim after waking up the flusher,
which is similar to commit 81a70c21d917 ("mm/cgroup/reclaim: fix dirty
pages throttling on cgroup v1"), to avoid unnecessary OOM.

The following test program can easily reproduce the OOM issue.  With this
patch applied, the test passes successfully.

$mkdir /sys/fs/cgroup/memory/test
$echo 256M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
$echo $$ > /sys/fs/cgroup/memory/test/cgroup.procs
$dd if=/dev/zero of=/mnt/data.bin bs=1M count=800

Link: https://lore.kernel.org/3445af0f09e8ca945492e052e82594f8c4f2e2f6.1774606060.git.baolin.wang@linux.alibaba.com
Fixes: ac35a4902374 ("mm: multi-gen LRU: minimal implementation")
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Reviewed-by: Kairui Song <kasong@tencent.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Yuanchu Xie <yuanchu@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
44 hours agoselftests: liveupdate: add test for double preservation
Pasha Tatashin [Thu, 26 Mar 2026 16:39:43 +0000 (16:39 +0000)] 
selftests: liveupdate: add test for double preservation

Verify that a file can only be preserved once across all active sessions.
Attempting to preserve it a second time, whether in the same or a
different session, should fail with EBUSY.

Link: https://lore.kernel.org/20260326163943.574070-4-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Samiullah Khawaja <skhawaja@google.com>
Cc: David Matlack <dmatlack@google.com>
Cc: Pratyush Yadav <pratyush@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
44 hours agomemfd: implement get_id for memfd_luo
Pasha Tatashin [Thu, 26 Mar 2026 16:39:42 +0000 (16:39 +0000)] 
memfd: implement get_id for memfd_luo

Memfds are identified by their underlying inode.  Implement get_id for
memfd_luo to return the inode pointer.  This prevents the same memfd from
being managed twice by LUO if the same inode is pointed by multiple file
objects.

Link: https://lore.kernel.org/20260326163943.574070-3-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Pratyush Yadav (Google) <pratyush@kernel.org>
Cc: David Matlack <dmatlack@google.com>
Cc: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Samiullah Khawaja <skhawaja@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
44 hours agoliveupdate: prevent double management of files
Pasha Tatashin [Thu, 26 Mar 2026 16:39:41 +0000 (16:39 +0000)] 
liveupdate: prevent double management of files

Patch series "liveupdate: prevent double preservation", v4.

Currently, LUO does not prevent the same file from being managed twice
across different active sessions.

Because LUO preserves files of absolutely different types: memfd, and
upcoming vfiofd [1], iommufd [2], guestmefd (and possible kvmfd/cpufd).
There is no common private data or guarantee on how to prevent that the
same file is not preserved twice beside using inode or some slower and
expensive method like hashtables.

This patch (of 4)

Currently, LUO does not prevent the same file from being managed twice
across different active sessions.

Use a global xarray luo_preserved_files to keep track of file identifiers
being preserved by LUO.  Update luo_preserve_file() to check and insert
the file identifier into this xarray when it is preserved, and erase it in
luo_file_unpreserve_files() when it is released.

To allow handlers to define what constitutes a "unique" file (e.g.,
different struct file objects pointing to the same hardware resource), add
a get_id() callback to struct liveupdate_file_ops.  If not provided, the
default identifier is the struct file pointer itself.

This ensures that the same file (or resource) cannot be managed by
multiple sessions.  If another session attempts to preserve an already
managed file, it will now fail with -EBUSY.

Link: https://lore.kernel.org/20260326163943.574070-1-pasha.tatashin@soleen.com
Link: https://lore.kernel.org/20260326163943.574070-2-pasha.tatashin@soleen.com
Link: https://lore.kernel.org/all/20260129212510.967611-1-dmatlack@google.com
Link: https://lore.kernel.org/all/20260203220948.2176157-1-skhawaja@google.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Samiullah Khawaja <skhawaja@google.com>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: David Matlack <dmatlack@google.com>
Cc: Pratyush Yadav <pratyush@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
44 hours agokho: document kexec-metadata tracking feature
Breno Leitao [Mon, 16 Mar 2026 11:54:36 +0000 (04:54 -0700)] 
kho: document kexec-metadata tracking feature

Add documentation for the kexec-metadata feature that tracks the previous
kernel version and kexec boot count across kexec reboots.  This helps
diagnose bugs that only reproduce when kexecing from specific kernel
versions.

Link: https://lore.kernel.org/20260316-kho-v9-6-ed6dcd951988@debian.org
Signed-off-by: Breno Leitao <leitao@debian.org>
Suggested-by: Mike Rapoport <rppt@kernel.org>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
44 hours agokho: kexec-metadata: track previous kernel chain
Breno Leitao [Mon, 16 Mar 2026 11:54:35 +0000 (04:54 -0700)] 
kho: kexec-metadata: track previous kernel chain

Use Kexec Handover (KHO) to pass the previous kernel's version string and
the number of kexec reboots since the last cold boot to the next kernel,
and print it at boot time.

Example output:
    [    0.000000] KHO: exec from: 6.19.0-rc4-next-20260107 (count 1)

Motivation
==========

Bugs that only reproduce when kexecing from specific kernel versions are
difficult to diagnose.  These issues occur when a buggy kernel kexecs into
a new kernel, with the bug manifesting only in the second kernel.

Recent examples include the following commits:

 * commit eb2266312507 ("x86/boot: Fix page table access in
   5-level to 4-level paging transition")
 * commit 77d48d39e991 ("efistub/tpm: Use ACPI reclaim memory
   for event log to avoid corruption")
 * commit 64b45dd46e15 ("x86/efi: skip memattr table on kexec
   boot")

As kexec-based reboots become more common, these version-dependent bugs
are appearing more frequently.  At scale, correlating crashes to the
previous kernel version is challenging, especially when issues only occur
in specific transition scenarios.

Implementation
==============

The kexec metadata is stored as a plain C struct (struct
kho_kexec_metadata) rather than FDT format, for simplicity and direct
field access.  It is registered via kho_add_subtree() as a separate
subtree, keeping it independent from the core KHO ABI.  This design
choice:

 - Keeps the core KHO ABI minimal and stable
 - Allows the metadata format to evolve independently
 - Avoids requiring version bumps for all KHO consumers (LUO, etc.)
   when the metadata format changes

The struct kho_kexec_metadata contains two fields:
 - previous_release: The kernel version that initiated the kexec
 - kexec_count: Number of kexec boots since last cold boot

On cold boot, kexec_count starts at 0 and increments with each kexec.  The
count helps identify issues that only manifest after multiple consecutive
kexec reboots.

[leitao@debian.org: call kho_kexec_metadata_init() for both boot paths]
Link: https://lore.kernel.org/all/20260309-kho-v8-5-c3abcf4ac750@debian.org/
Link: https://lore.kernel.org/20260409-kho_fix_merge_issue-v1-1-710c84ceaa85@debian.org
Link: https://lore.kernel.org/20260316-kho-v9-5-ed6dcd951988@debian.org
Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
44 hours agokho: fix kho_in_debugfs_init() to handle non-FDT blobs
Breno Leitao [Mon, 16 Mar 2026 11:54:34 +0000 (04:54 -0700)] 
kho: fix kho_in_debugfs_init() to handle non-FDT blobs

kho_in_debugfs_init() calls fdt_totalsize() to determine blob sizes, which
assumes all blobs are FDTs.  This breaks for non-FDT blobs like struct
kho_kexec_metadata.

Fix this by reading the "blob-size" property from the FDT (persisted by
kho_add_subtree()) instead of calling fdt_totalsize().  Also rename local
variables from fdt_phys/sub_fdt to blob_phys/blob for consistency with the
non-FDT-specific naming.

Link: https://lore.kernel.org/20260316-kho-v9-4-ed6dcd951988@debian.org
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
44 hours agokho: persist blob size in KHO FDT
Breno Leitao [Mon, 16 Mar 2026 11:54:33 +0000 (04:54 -0700)] 
kho: persist blob size in KHO FDT

kho_add_subtree() accepts a size parameter but only forwards it to
debugfs.  The size is not persisted in the KHO FDT, so it is lost across
kexec.  This makes it impossible for the incoming kernel to determine the
blob size without understanding the blob format.

Store the blob size as a "blob-size" property in the KHO FDT alongside the
"preserved-data" physical address.  This allows the receiving kernel to
recover the size for any blob regardless of format.

Also extend kho_retrieve_subtree() with an optional size output parameter
so callers can learn the blob size without needing to understand the blob
format.  Update all callers to pass NULL for the new parameter.

Link: https://lore.kernel.org/20260316-kho-v9-3-ed6dcd951988@debian.org
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>