]> git.ipfire.org Git - thirdparty/linux.git/log
thirdparty/linux.git
8 days agos390/ap: Fix locking issue in SE bind and associate sysfs functions
Harald Freudenberger [Wed, 3 Jun 2026 13:04:56 +0000 (15:04 +0200)] 
s390/ap: Fix locking issue in SE bind and associate sysfs functions

Revisit and reorganize the locking and lock coverage of the
ap->lock spinlock as used in the two sysfs functions
se_bind_store() and se_associate_store().

A kernel run reported a possible deadlock situation, caused by
holding the spinlock (ap->lock) while triggering a uevent.
The fix rearranges the code protected by the spinlock by excluding
the uevent invocation, which does not require protection.

Additionally, the start of the protected region is moved earlier
to cover more lines, ensuring a consistent view of the AP queue
state between reading and updating its struct fields.

=====================================================
WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
7.1.0-20260601.rc6.git12.516b5dbd4d4a.300.fc44.s390x+debug #1 Not tainted
-----------------------------------------------------
setupseguest.sh/11034 [HC0[0]:SC0[2]:HE1:SE0] is trying to acquire:
000001c991f498e8 (fs_reclaim){+.+.}-{0:0}, at: __kmalloc_cache_noprof+0x5a/0x6d0
and this task is already holding:
000000c4a1a12378 (&aq->lock){+.-.}-{2:2}, at: se_bind_store+0x96/0x3a0
which would create a new lock dependency:
 (&aq->lock){+.-.}-{2:2} -> (fs_reclaim){+.+.}-{0:0}
but this new dependency connects a SOFTIRQ-irq-safe lock:
 (&aq->lock){+.-.}-{2:2}
... which became SOFTIRQ-irq-safe at:
  __lock_acquire+0x5ae/0x15a0
  lock_acquire+0x14c/0x400
  _raw_spin_lock_bh+0x58/0xb0
  ap_tasklet_fn+0x72/0xd0
  tasklet_action_common+0x174/0x1b0
  handle_softirqs+0x180/0x5c0
  irq_exit_rcu+0x196/0x200
  do_ext_irq+0x12a/0x4d0
  ext_int_handler+0xc6/0xf0
  folio_zero_user+0x1c6/0x240
  folio_zero_user+0x182/0x240
  vma_alloc_anon_folio_pmd+0xa0/0x1d0
  __do_huge_pmd_anonymous_page+0x3a/0x200
  __handle_mm_fault+0x56c/0x590
  handle_mm_fault+0xa2/0x370
  do_exception+0x292/0x590
  __do_pgm_check+0x136/0x3e0
  pgm_check_handler+0x114/0x160
to a SOFTIRQ-irq-unsafe lock:
 (fs_reclaim){+.+.}-{0:0}
... which became SOFTIRQ-irq-unsafe at:
...
  __lock_acquire+0x5ae/0x15a0
  lock_acquire+0x14c/0x400
  __fs_reclaim_acquire+0x44/0x50
  fs_reclaim_acquire+0xbe/0x100
  fs_reclaim_correct_nesting+0x20/0x70
  dotest+0x5e/0x148
  locking_selftest+0x2854/0x2a88
  start_kernel+0x3b2/0x4f0
  startup_continue+0x2e/0x40
other info that might help us debug this:
 Possible interrupt unsafe locking scenario:
       CPU0                    CPU1
       ----                    ----
  lock(fs_reclaim);
       local_irq_disable();
       lock(&aq->lock);
       lock(fs_reclaim);
  <Interrupt>
    lock(&aq->lock);
 *** DEADLOCK ***
4 locks held by setupseguest.sh/11034:
 #0: 000000c485d01440 (sb_writers#4){.+.+}-{0:0}, at: vfs_write+0x2fc/0x380
 #1: 000000c4d2283288 (&of->mutex#2){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x12a0x270
 #2: 000000c4a1830e48 (kn->active#172){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x1e/0x270
 #3: 000000c4a1a12378 (&aq->lock){+.-.}-{2:2}, at: se_bind_store+0x96/0x3a0
the dependencies between SOFTIRQ-irq-safe lock and the holding lock:
-> (&aq->lock){+.-.}-{2:2} {
   HARDIRQ-ON-W at:
    __lock_acquire+0x5ae/0x15a0
    lock_acquire+0x14c/0x400
    _raw_spin_lock_bh+0x58/0xb0
    ap_queue_init_state+0x2e/0x50
    ap_scan_domains+0x5d6/0x620
    ap_scan_adapter+0x4c0/0x810
    ap_scan_bus+0x70/0x350
    ap_scan_bus_wq_callback+0x56/0x80
    process_one_work+0x2ba/0x820
    worker_thread+0x21a/0x400
    kthread+0x164/0x190
    __ret_from_fork+0x4c/0x340
    ret_from_fork+0xa/0x30
   IN-SOFTIRQ-W at:
    __lock_acquire+0x5ae/0x15a0
    lock_acquire+0x14c/0x400
    _raw_spin_lock_bh+0x58/0xb0
    ap_tasklet_fn+0x72/0xd0
    tasklet_action_common+0x174/0x1b0
    handle_softirqs+0x180/0x5c0
    irq_exit_rcu+0x196/0x200
    do_ext_irq+0x12a/0x4d0
    ext_int_handler+0xc6/0xf0
    folio_zero_user+0x1c6/0x240
    folio_zero_user+0x182/0x240
    vma_alloc_anon_folio_pmd+0xa0/0x1d0
    __do_huge_pmd_anonymous_page+0x3a/0x200
    __handle_mm_fault+0x56c/0x590
    handle_mm_fault+0xa2/0x370
    do_exception+0x292/0x590
    __do_pgm_check+0x136/0x3e0
    pgm_check_handler+0x114/0x160
   INITIAL USE at:
   __lock_acquire+0x5ae/0x15a0
   lock_acquire+0x14c/0x400
   _raw_spin_lock_bh+0x58/0xb0
   ap_queue_init_state+0x2e/0x50
   ap_scan_domains+0x5d6/0x620
   ap_scan_adapter+0x4c0/0x810
   ap_scan_bus+0x70/0x350
   ap_scan_bus_wq_callback+0x56/0x80
   process_one_work+0x2ba/0x820
   worker_thread+0x21a/0x400
   kthread+0x164/0x190
   __ret_from_fork+0x4c/0x340
   ret_from_fork+0xa/0x30
 }
 ... key      at: [<000001c9936e8aa0>] __key.7+0x0/0x10
the dependencies between the lock to be acquired
 and SOFTIRQ-irq-unsafe lock:
-> (fs_reclaim){+.+.}-{0:0} {
   HARDIRQ-ON-W at:
    __lock_acquire+0x5ae/0x15a0
    lock_acquire+0x14c/0x400
    __fs_reclaim_acquire+0x44/0x50
    fs_reclaim_acquire+0xbe/0x100
    fs_reclaim_correct_nesting+0x20/0x70
    dotest+0x5e/0x148
    locking_selftest+0x2854/0x2a88
    start_kernel+0x3b2/0x4f0
    startup_continue+0x2e/0x40
   SOFTIRQ-ON-W at:
    __lock_acquire+0x5ae/0x15a0
    lock_acquire+0x14c/0x400
    __fs_reclaim_acquire+0x44/0x50
    fs_reclaim_acquire+0xbe/0x100
    fs_reclaim_correct_nesting+0x20/0x70
    dotest+0x5e/0x148
    locking_selftest+0x2854/0x2a88
    start_kernel+0x3b2/0x4f0
    startup_continue+0x2e/0x40
   INITIAL USE at:
   __lock_acquire+0x5ae/0x15a0
   lock_acquire+0x14c/0x400
   __fs_reclaim_acquire+0x44/0x50
   fs_reclaim_acquire+0xbe/0x100
   fs_reclaim_correct_nesting+0x20/0x70
   dotest+0x5e/0x148
   locking_selftest+0x2854/0x2a88
   start_kernel+0x3b2/0x4f0
   startup_continue+0x2e/0x40
 }
 ... key      at: [<000001c991f498e8>] __fs_reclaim_map+0x0/0x30
 ... acquired at:
   check_prev_add+0x178/0xf40
   __lock_acquire+0x12aa/0x15a0
   lock_acquire+0x14c/0x400
   __fs_reclaim_acquire+0x44/0x50
   fs_reclaim_acquire+0xbe/0x100
   __kmalloc_cache_noprof+0x5a/0x6d0
   kobject_uevent_env+0xd4/0x420
   ap_send_se_bind_uevent+0x48/0x70
   se_bind_store+0x146/0x3a0
   kernfs_fop_write_iter+0x18c/0x270
   vfs_write+0x23c/0x380
   ksys_write+0x88/0x120
   __do_syscall+0x170/0x750
   system_call+0x72/0x90
stack backtrace:
CPU: 6 UID: 0 PID: 11034 Comm: setupseguest.sh Not tainted 7.1.0-20260601.rc6.git2.516b5dbd4d4a.300.fc44.s390x+debug #1 PREEMPT
Hardware name: IBM 9175 ME1 701 (KVM/Linux)
Call Trace:
 [<000001c98ffa0a7e>] dump_stack_lvl+0xae/0x108
 [<000001c9900a6d7a>] print_bad_irq_dependency+0x47a/0x480
 [<000001c9900a7184>] check_irq_usage+0x404/0x4c0
 [<000001c9900a73b8>] check_prev_add+0x178/0xf40
 [<000001c9900aaf1a>] __lock_acquire+0x12aa/0x15a0
 [<000001c9900ab35c>] lock_acquire+0x14c/0x400
 [<000001c9903be454>] __fs_reclaim_acquire+0x44/0x50
 [<000001c9903be51e>] fs_reclaim_acquire+0xbe/0x100
 [<000001c9903cf4ca>] __kmalloc_cache_noprof+0x5a/0x6d0
 [<000001c9910ca9d4>] kobject_uevent_env+0xd4/0x420
 [<000001c990d84098>] ap_send_se_bind_uevent+0x48/0x70
 [<000001c990d87416>] se_bind_store+0x146/0x3a0
 [<000001c99057da7c>] kernfs_fop_write_iter+0x18c/0x270
 [<000001c99047712c>] vfs_write+0x23c/0x380
 [<000001c990477438>] ksys_write+0x88/0x120
 [<000001c9910f64e0>] __do_syscall+0x170/0x750
 [<000001c99110a412>] system_call+0x72/0x90
INFO: lockdep is turned off.

Fixes: 4179c3984227 ("s390/ap: Implement SE bind and associate uevents")
Reported-by: Ingo Franzki <ifranzki@linux.ibm.com>
Suggested-by: Finn Callies <fcallies@linux.ibm.com>
Reviewed-by: Finn Callies <fcallies@linux.ibm.com>
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
8 days agoASoC: SOF: amd: set ipc flags to zero
Vijendar Mukunda [Tue, 9 Jun 2026 16:08:45 +0000 (21:38 +0530)] 
ASoC: SOF: amd: set ipc flags to zero

As per design, set IPC conf structure flags to zero during acp init
sequence.

Link: https://github.com/thesofproject/linux/pull/5642
Signed-off-by: Vijendar Mukunda <Vijendar.Mukunda@amd.com>
Tested-by: Umang Jain <uajain@igalia.com>
Link: https://patch.msgid.link/20260609160938.3717513-2-Vijendar.Mukunda@amd.com
Signed-off-by: Mark Brown <broonie@kernel.org>
8 days agoASoC: SOF: amd: fix for ipc flags check
Vijendar Mukunda [Tue, 9 Jun 2026 16:08:44 +0000 (21:38 +0530)] 
ASoC: SOF: amd: fix for ipc flags check

Firmware will set dsp_ack to 1 when firmware sends response for the IPC
command issued by host. Similarly dsp_msg flag will be updated to 1.

During ACP D0 entry, the value read from the sof_dsp_ack_write scratch
flag can be uninitialized. A non-zero garbage value is treated as a
pending DSP IPC ack before SOF_FW_BOOT_COMPLETE, causing a spurious
"IPC reply before FW_BOOT_COMPLETE" log.

Fix the condition checks for ipc flags.

Fixes: 738a2b5e2cc9 ("ASoC: SOF: amd: Add IPC support for ACP IP block")
Link: https://github.com/thesofproject/linux/pull/5642
Signed-off-by: Vijendar Mukunda <Vijendar.Mukunda@amd.com>
Tested-by: Umang Jain <uajain@igalia.com>
Link: https://patch.msgid.link/20260609160938.3717513-1-Vijendar.Mukunda@amd.com
Signed-off-by: Mark Brown <broonie@kernel.org>
8 days agoMerge branch 'net-ethtool-let-ops-locked-drivers-run-without-rtnl_lock'
Jakub Kicinski [Tue, 9 Jun 2026 17:13:08 +0000 (10:13 -0700)] 
Merge branch 'net-ethtool-let-ops-locked-drivers-run-without-rtnl_lock'

Jakub Kicinski says:

====================
net: ethtool: let ops locked drivers run without rtnl_lock

With the ethtool_get_link_ksettings() situation hopefully ironed out
the previous series (commit 6a5d837f0ce2) let's return to the main
part of the series.

We have been slowly moving towards removing the rtnl_lock dependency
in driver ops since the concept of "ops-locked" drivers have been
introduced last year. Since last year will take the netdev instance
lock before invoking any ndo or ethtool op of "ops-locked" drivers.

We dipped our toes into rtnl_lock-less ops with the queue binding API.
Queue stats, NAPI, and other netdev-netlink objects are also queried
without holding rtnl_lock already. It's time to take the next logical
step and lift the requirement from ethtool ops.

The direct motivation for this patchset is that ethtool ops often
involve communicating with device FW, and may take a long time
to complete. Aggressive polling of device state on machines
with 10+ NICs have been shown to significantly increase rtnl_lock
pressure.

There's a handful of areas which still need rtnl_lock (see below).
I decided to convert everything to rtnl_lock-less by default, and
add a set of flags which let the drivers request rtnl_lock to still
be taken. I don't love this, but I'm worried that opt-in would be
even more confusing.

Known issues / exclusions:
 - qdiscs - qdisc configuration currently assumes rtnl_lock, this
   is mostly impacting set_channels callback. qdisc config is probably
   the easiest one of the exclusions to tackle, it's fairly self-contained.
 - features - even tho feature changes are (correctly) plumbed to
   the driver thru ndos they are part of ethtool uAPI. ethtool itself
   calls netdev_features_change() if it has spotted device feature change
   before vs after to the callback. Some drivers also call
   netdev_features_change() directly in response to various changes,
   e.g. setting priv flags.
   Since features have to propagate to upper and lower devices anything
   that touches features is quite hard to move from under rtnl_lock.
 - phylink - phylink and SFP depend on rtnl_lock today, I suspect
   that this is purely for historic reasons. I started poking at
   it and don't really see a need for a global lock. But accessing
   the netdev instance lock from the SFP entry points will require
   some attention from the phylink folks.
 - phydev - similar to phylink, looks quite doable. But no ops-locked
   driver currently has a phydev (fbnic only uses phylink) so phydev
   related paths retain a ASSERT_RTNL() for now.

Tested on mlx5, bnxt and fbnic.
====================

Link: https://patch.msgid.link/20260605002912.3456868-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 days agodocs: net: ethtool: document ops-locked drivers and op_needs_rtnl
Jakub Kicinski [Fri, 5 Jun 2026 00:29:12 +0000 (17:29 -0700)] 
docs: net: ethtool: document ops-locked drivers and op_needs_rtnl

Catch up various bits of documentation after the locking changes.

Reviewed-by: Nicolai Buchwitz <nb@tipi-net.de>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260605002912.3456868-13-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 days agonet: ethtool: optionally skip rtnl_lock on IOCTL path
Jakub Kicinski [Fri, 5 Jun 2026 00:29:11 +0000 (17:29 -0700)] 
net: ethtool: optionally skip rtnl_lock on IOCTL path

Convert the IOCTL path similarly to how we converted Netlink.
The device lookup gets a little hairy. We could take rtnl_lock
unconditionally and drop it before calling the driver (this would
avoid the reference + liveness check). But I think being able
to make progress even if rtnl is dead-locked is quite useful.

First extra concern is handling features. List all the cmds which
modify features and always take rtnl_lock. We could fold this list
into ethtool_ioctl_needs_rtnl() but seems cleaner to keep
ethtool_ioctl_needs_rtnl() driver-related. If a driver changed
features and we were not holding rtnl_lock - warn about it.
It can only happen on buggy ops locked drivers (buggy because
they should have set appropriate "I need rtnl for op X" bit).

Second wrinkle is the PHY ID hack which drops the locks while
sleeping. Convert its static "busy" variable which used to
be protected by rtnl_lock to a field in struct ethtool_netdev_state.
This feature is about identifying an adapter or a port within
a system, so being able to blink multiple LEDs at the same
time is likely not very useful in practice. But it's the simplest
fix, we can add a mutex if someone thinks a system should only
be ID'ing one port at a time.

Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260605002912.3456868-12-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 days agonet: ethtool: ioctl: concentrate the locking
Jakub Kicinski [Fri, 5 Jun 2026 00:29:10 +0000 (17:29 -0700)] 
net: ethtool: ioctl: concentrate the locking

Add another layer of helper functions to make upcoming locking
changes easier. Otherwise we'd need a pretty complex goto
structure. netdev instance lock is now taken slightly sooner
but that should not be an issue since rtnl_lock is already held,
anyway.

Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260605002912.3456868-11-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 days agonet: ethtool: optionally skip rtnl_lock in RSS context handlers
Jakub Kicinski [Fri, 5 Jun 2026 00:29:09 +0000 (17:29 -0700)] 
net: ethtool: optionally skip rtnl_lock in RSS context handlers

Skip rtnl_lock in RSS context handlers if device is ops-locked.
Fairly trivial conversion. bnxt needed rtnl_lock for changing
the main context but looks like additional contexts are fine
without it.

Note (for review bots?) that ethnl_ops_begin() checks whether
the device is still registered.

Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260605002912.3456868-10-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 days agonet: ethtool: optionally skip rtnl_lock in ethnl_act_module_fw_flash()
Jakub Kicinski [Fri, 5 Jun 2026 00:29:08 +0000 (17:29 -0700)] 
net: ethtool: optionally skip rtnl_lock in ethnl_act_module_fw_flash()

Module firmware flashing reads SFF-8024 identifier bytes via
.get_module_eeprom_by_page(). Other than that it modifies
a bit in the netdev->ethtool struct. Both should be ops-locked
at this point.

Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260605002912.3456868-9-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 days agonet: ethtool: optionally skip rtnl_lock in ethnl_tsinfo_dumpit()
Jakub Kicinski [Fri, 5 Jun 2026 00:29:07 +0000 (17:29 -0700)] 
net: ethtool: optionally skip rtnl_lock in ethnl_tsinfo_dumpit()

ethnl_tsinfo_dumpit() iterates netdevs and per-netdev PHY topology
calling ops->get_ts_info(). Switch to the "ops compat locking"
helpers which take either rtnl_lock or instance lock, depending
on what the device needs.

Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260605002912.3456868-8-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 days agonet: ethtool: optionally skip rtnl_lock in cable test handlers
Jakub Kicinski [Fri, 5 Jun 2026 00:29:06 +0000 (17:29 -0700)] 
net: ethtool: optionally skip rtnl_lock in cable test handlers

Skip rtnl_lock in cable test handlers. This is really a noop since
no ops locked device supports these.

Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260605002912.3456868-7-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 days agonet: ethtool: optionally skip rtnl_lock on Netlink path for SET ops
Jakub Kicinski [Fri, 5 Jun 2026 00:29:05 +0000 (17:29 -0700)] 
net: ethtool: optionally skip rtnl_lock on Netlink path for SET ops

Make ethtool not take rtnl_lock for SET commands when operation
is performed on an ops-locked driver. cfg/cfg_pending are now
ops-locked, since only ethtool modifies them.

Some SET driver callbacks will still need rtnl_lock, most notably
those which may end up calling netdev_update_features() or the qdisc
layer (via netif_set_real_num_tx_queues()). Let drivers selectively
opt back into the rtnl_lock with a new bitfield in ops.

We need two helpers since Netlink and ioctl cmds have different
values. Keep the helpers side by side in common.h to make sure
they get updated together, even tho they will only get called
from ioctl.c and netlink.c.

SET commands which don't use ethnl_default_set_doit() are converted
by subsequent commits.

Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260605002912.3456868-6-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 days agonet: ethtool: optionally skip rtnl_lock on Netlink path for GET ops
Jakub Kicinski [Fri, 5 Jun 2026 00:29:04 +0000 (17:29 -0700)] 
net: ethtool: optionally skip rtnl_lock on Netlink path for GET ops

ethnl_default_doit() and ethnl_default_dump_one() are both used
exclusively for GET callbacks (former to get info for a single
device or get global strings). ops-locked devices don't need
rtnl_lock for GET callbacks, stop taking it.

Introduce an opt-out mechanism for devices which use phylink (fbnic)
since phylink currently depends on rtnl_lock protection. Subsequent
patches will add more exceptions, anyway. Practically the new helpers
for judging if command needs rtnl_lock could also call
netdev_need_ops_lock() but I find that it makes the code in the callers
slightly less obvious.

Add a helper for IOCTLs already, even tho it's unused so that
we can keep them in sync as the series progresses.

This is the first user-visible step of moving ethtool ops out
from under rtnl. Subsequent patches do the same for SET ops,
as well as the ioctl path.

Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260605002912.3456868-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 days agonet: ethtool: make dev->hwprov ops-protected
Jakub Kicinski [Fri, 5 Jun 2026 00:29:03 +0000 (17:29 -0700)] 
net: ethtool: make dev->hwprov ops-protected

dev->hwprov tracks the active hwtstamp provider for the device.
Make it ops protected (instance lock if the netdev driver opts
into holding instance lock around callbacks, otherwise rtnl_lock).

hwprov is written and read in:
 - drivers/net/phy/phy_device.c
    phydev and ops protection don't currently mix, add a comment
 - net/ethtool/
    as of now holds both rtnl lock and ops lock, this one will
    soon only hold one lock or the other

read in:
 - net/core/dev_ioctl.c
    holds both rtnl lock and ops lock
 - net/core/timestamping.c
    RCU reader

The new netdev_ops_lock_dereference() helper does not have
"compat" in the name. The name would be quite long and I think
in this case it should be obvious that we need _a_ lock.
netdev_lock_dereference() already exists and means dev->lock
is always expected.

Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260605002912.3456868-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 days agonet: ethtool: relax ethnl_req_get_phydev() locking assertion
Jakub Kicinski [Fri, 5 Jun 2026 00:29:02 +0000 (17:29 -0700)] 
net: ethtool: relax ethnl_req_get_phydev() locking assertion

phydev <> netdev linking and lifecycle depends on rtnl_lock.
We want to switch to instance locks for most ethtool ops.
Let's add an assert that ops locked devices don't use phydev
today. If one does we can either opt the phy ops out of
being purely ops locked, or do deeper surgery to make phy
locking ops-compatible. I don't think there's any fundamental
challenge to make that work.

Reviewed-by: Nicolai Buchwitz <nb@tipi-net.de>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260605002912.3456868-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 days agonet: ethtool: serialize broadcast notification sequence allocation
Jakub Kicinski [Fri, 5 Jun 2026 00:29:01 +0000 (17:29 -0700)] 
net: ethtool: serialize broadcast notification sequence allocation

ethnl_bcast_seq is a global counter stamped into the nlmsg_seq field
of every multicast notification, allowing userspace to detect dropped
messages. Today the ordering is achieved by using rtnl_lock().

Moving forward we will want ethtool ops to run under just the netdev
instance lock so to establish ordering we need a separate lock
for notifications. With the netdev instance locks operations on
different devices may bypass each other but the expectation is
that it should not matter. What we need to prevent is:
 - notification IDs getting out of order
 - operations on one device getting out of order

For simplicity defer allocating the ID of the notification right
before the notification is delivered. This removes the need for
special handling in ethnl_rss_create_send_ntf().

Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260605002912.3456868-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 days agoigc: skip RX timestamp header for frame preemption verification
KhaiWenTan [Fri, 24 Apr 2026 07:59:07 +0000 (15:59 +0800)] 
igc: skip RX timestamp header for frame preemption verification

When RX hardware timestamping is enabled, a 16-byte inline timestamp header
is added to the start of the packet buffer, causing FPE handshake
verification to fail.

Because an incorrect packet buffer is passed to igc_fpe_handle_mpacket(),
the mem_is_zero() check inspects the timestamp metadata instead of the
actual mPacket payload. As a result, valid Verify/Response mPackets can be
missed when inline RX timestamps are present.

Pass pktbuf + pkt_offset to igc_fpe_handle_mpacket() so it inspects the
actual mPacket payload instead of the timestamp header.

Fixes: 5422570c0010 ("igc: add support for frame preemption verification")
Co-developed-by: Faizal Rahim <faizal.abdul.rahim@linux.intel.com>
Signed-off-by: Faizal Rahim <faizal.abdul.rahim@linux.intel.com>
Signed-off-by: KhaiWenTan <khai.wen.tan@linux.intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
8 days agoixgbe: do not configure xps for XDP queues
Larysa Zaremba [Mon, 18 May 2026 11:15:04 +0000 (13:15 +0200)] 
ixgbe: do not configure xps for XDP queues

netif_set_xps_queue() should not be called for an XDP Tx queue, since such
queues are not netdev-exposed. On systems with number of CPUs >=64, on E610
adapter, netdev is configured with maximum number queue pairs being 63
(due to MSI-X assignment), but configuring XDP results in 64 XDP queues.

So, during XDP program load, when netif_set_xps_queue() is called for the
last XDP queue, we get a WARNING with a call trace and KASAN report
afterwards (if enabled).

[ 2012.699800] WARNING: net/core/dev.c:2854 at __netif_set_xps_queue+0x116a/0x1e40, CPU#36: xdpsock/103668
[...]
[ 2012.700029] RIP: 0010:__netif_set_xps_queue+0x116a/0x1e40
[ 2012.700035] Code: b6 34 06 48 89 f8 83 e0 07 83 c0 01 40 38 f0 7c 09 40 84 f6 0f 85 03 0a 00 00 0f b7 44 24 40 66 43 89 44 6a 18 e9 01 fb ff ff <0f> 0b e9 f2 ee ff ff 44 8b 44 24 44 45 85 c0 74 50 4d 85 e4 0f 84
[ 2012.700040] RSP: 0018:ffff8882369aeb28 EFLAGS: 00010246
[ 2012.700046] RAX: 0000000000000000 RBX: 000000000000003f RCX: 0000000000000000
[ 2012.700050] RDX: 1ffff1111da3d891 RSI: ffff888120e34250 RDI: ffff8888ed1ec488
[ 2012.700054] RBP: ffff888913281560 R08: 0000000000000000 R09: ffff8888ed1ec000
[ 2012.700058] R10: ffff8888a2e83180 R11: 0000000000000000 R12: 0000000000007fa8
[ 2012.700061] R13: 000000000000003f R14: ffff888120e34854 R15: ffff8889132817c8
[ 2012.700065] FS:  00007fc8ea9ff740(0000) GS:ffff88884cefe000(0000) knlGS:0000000000000000
[ 2012.700069] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2012.700073] CR2: 00007f81c8000020 CR3: 00000002299f8006 CR4: 00000000007726f0
[ 2012.700077] PKRU: 55555554
[ 2012.700080] Call Trace:
[ 2012.700084]  <TASK>
[ 2012.700087]  ? ktime_get+0x61/0x150
[ 2012.700097]  ? usleep_range_state+0x133/0x1b0
[ 2012.700108]  ? __pfx_usleep_range_state+0x10/0x10
[ 2012.700114]  netif_set_xps_queue+0x31/0x50
[ 2012.700119]  ixgbe_configure_tx_ring+0x472/0x920 [ixgbe]
[...]
[ 2012.700486]  ixgbe_xdp+0x38f/0x750 [ixgbe]

[...]

[ 2012.701094] BUG: KASAN: slab-out-of-bounds in __netif_set_xps_queue+0x1ac5/0x1e40
[ 2012.701100] Write of size 4 at addr ffff88888d43cff8 by task xdpsock/103668

Skip XPS configuration for XDP Tx queues.

Fixes: 33fdc82f0883 ("ixgbe: add support for XDP_TX action")
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Patryk Holda <patryk.holda@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
8 days agoidpf: add padding to PTP virtchnl structures
Przemyslaw Korba [Mon, 25 May 2026 08:38:03 +0000 (10:38 +0200)] 
idpf: add padding to PTP virtchnl structures

Add padding to virtchnl2 PTP structures to match the Control Plane
expected message sizes:
* virtchnl2_ptp_get_dev_clk_time: 8 -> 16 bytes
* virtchnl2_ptp_set_dev_clk_time: 8 -> 16 bytes
* virtchnl2_ptp_get_cross_time: 16 -> 24 bytes

The FW expects the above sizes and PTP negotiation fails due to the
mismatch. Previously neither the FW nor the driver checked message/reply
sizes strictly, so the problem appeared only after recent validation
improvements.

reproduction steps:
ptp4l -i <pf> -m
Observe: failed to open /dev/ptp0: Permission denied

Fixes: bf27283ba594 ("virtchnl: add PTP virtchnl definitions")
Cc: stable@vger.kernel.org
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Przemyslaw Korba <przemyslaw.korba@intel.com>
Tested-by: Samuel Salin <Samuel.salin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
8 days agoMerge tag 'microchip-dt64-7.2' of https://git.kernel.org/pub/scm/linux/kernel/git...
Arnd Bergmann [Tue, 9 Jun 2026 16:46:40 +0000 (18:46 +0200)] 
Merge tag 'microchip-dt64-7.2' of https://git.kernel.org/pub/scm/linux/kernel/git/at91/linux into soc/dt

Microchip ARM64 device tree updates for v7.2

This update includes:
- the device tree node for the OTP controller on LAN9691 SoC

* tag 'microchip-dt64-7.2' of https://git.kernel.org/pub/scm/linux/kernel/git/at91/linux:
  arm64: dts: microchip: lan969x: add OTP node

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
8 days agoMerge tag 'at91-dt-7.2' of https://git.kernel.org/pub/scm/linux/kernel/git/at91/linux...
Arnd Bergmann [Tue, 9 Jun 2026 16:45:32 +0000 (18:45 +0200)] 
Merge tag 'at91-dt-7.2' of https://git.kernel.org/pub/scm/linux/kernel/git/at91/linux into soc/dt

Microchip AT91 device tree updates for v7.2

This update includes:
- I3C controller device tree node for SAMA7D65 SoC

* tag 'at91-dt-7.2' of https://git.kernel.org/pub/scm/linux/kernel/git/at91/linux:
  ARM: dts: microchip: add I3C controller

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
8 days agoMerge tag 'omap-for-v7.2/dt-signed' of git://git.kernel.org/pub/scm/linux/kernel...
Arnd Bergmann [Tue, 9 Jun 2026 16:43:36 +0000 (18:43 +0200)] 
Merge tag 'omap-for-v7.2/dt-signed' of git://git.kernel.org/pub/scm/linux/kernel/git/khilman/linux-omap into soc/dt

ARM: dts: OMAP updates for v7.2

Minor DTS fixes & updates

* tag 'omap-for-v7.2/dt-signed' of git://git.kernel.org/pub/scm/linux/kernel/git/khilman/linux-omap:
  ARM: dts: dm8168-evm: Set stdout-path to uart3
  arch: arm: dts: cpcap-mapphone: Add audio-codec jack detection interrupts
  arch: arm: dts: cpcap-mapphone: Set VAUDIO regulator always-on
  ARM: dts: ti/omap: omap4-epson-embt2ws: fix typo in iio device property
  ARM: dts: am335x-sl50: Fix audio bitclock and frame master endpoint

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
8 days agoMerge tag 'mvebu-arm-7.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/gclement...
Arnd Bergmann [Tue, 9 Jun 2026 16:41:32 +0000 (18:41 +0200)] 
Merge tag 'mvebu-arm-7.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/gclement/mvebu into soc/arm

mvebu arm for 7.2 (part 1)

Orion5x: Replace machine_is_mss2() with of_machine_is_compatible() in mss2_pci_init()

mvebu_v5_defconfig: Remove stale MACH_LINKSTATION_LSCHL reference

Armada 370: Simplify of_node_put calls and drop redundant NULL checks

* tag 'mvebu-arm-7.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/gclement/mvebu:
  ARM: orion5x: update board check in mss2_pci_init() to use the DT
  arm: mvebu_v5_defconfig: remove stale MACH_LINKSTATION_LSCHL reference
  ARM: mvebu: simplify of_node_put calls
  ARM: mvebu: drop unnecessary NULL check

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
8 days agoarm64: dts: aspeed: Add initial AST27xx SoC device tree
Ryan Chen [Tue, 9 Jun 2026 02:47:20 +0000 (10:47 +0800)] 
arm64: dts: aspeed: Add initial AST27xx SoC device tree

Add initial device tree support for the ASPEED AST27xx family, the
8th-generation Baseboard Management Controller (BMC) SoCs.

AST27xx SOC Family
 - https://www.aspeedtech.com/server_ast2700/
 - https://www.aspeedtech.com/server_ast2720/
 - https://www.aspeedtech.com/server_ast2750/

The AST27xx features a dual-SoC architecture consisting of two dies,
referred to as SoC0 and SoC1 - interconnected through an internal
proprietary bus. Both SoCs share the same address decoding scheme,
while each maintains independent clock and reset domains.

- SoC0 (CPU die): contains a quad-core Cortex-A35 cluster and two
  Cortex-M4 cores, along with high-speed peripherals.
- SoC1 (I/O die): includes the BootMCU (responsible for system
  boot) and its own clock/reset domains low-speed peripherals.

The device tree describes the SoC0 and SoC1 domains and their peripheral
layouts.

Signed-off-by: Ryan Chen <ryan_chen@aspeedtech.com>
Link: https://lore.kernel.org/r/20260609-upstream_ast2700-v9-3-f631752f0cb1@aspeedtech.com
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
8 days agobtrfs: fix use-after-free after relocation failure with concurrent COW
Filipe Manana [Fri, 5 Jun 2026 15:15:37 +0000 (16:15 +0100)] 
btrfs: fix use-after-free after relocation failure with concurrent COW

If we get a failure during relocation, before we update all the extent
buffers that have file extent items pointing to extents from the block
group being relocated, we can trigger a user-after-free on the reloc
control structure (fs_info->reloc_control) if we have a concurrent task
that is COWing a subvolume leaf.

This happens like this:

1) Relocation of data block group X starts;

2) Relocation changes its state to UPDATE_DATA_PTRS;

3) A task doing a rename for example, COWs leaf A from a subvolume tree
   and ends up at btrfs_reloc_cow_block() and extracts fs_info->reloc_ctl
   into a local variable, which then passes to replace_file_extents();

4) The relocation task gets an error and under the label 'out_put_bg' in
   btrfs_relocate_block_group() calls free_reloc_control(), which frees
   the reloc control structure that the rename task is using;

5) The rename task triggers a use-after-free on the reloc control
   structure that was just freed.

Syzbot reported this recently, with the following stack trace:

   [   88.389822][ T5325] BTRFS error (device loop0 state A): Transaction aborted (error -5)
   [   88.389842][ T5325] BTRFS: error (device loop0 state A) in cleanup_transaction:2067: errno=-5 IO failure
   [   88.389864][ T5325] BTRFS info (device loop0 state EA): forced readonly
   [   88.392277][ T5324] BTRFS: error (device loop0 state EA) in btrfs_sync_log:3572: errno=-5 IO failure
   [   88.396630][ T5325] BTRFS info (device loop0 state EA): balance: ended with status: -5
   [   88.400135][ T5346] ==================================================================
   [   88.400148][ T5346] BUG: KASAN: slab-use-after-free in replace_file_extents+0x85f/0x1590
   [   88.400288][ T5346] Read of size 8 at addr ffff888012312010 by task syz.0.0/5346
   [   88.400299][ T5346]
   [   88.400306][ T5346] CPU: 0 UID: 0 PID: 5346 Comm: syz.0.0 Not tainted syzkaller #0 PREEMPT(full)
   [   88.400319][ T5346] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
   [   88.400325][ T5346] Call Trace:
   [   88.400331][ T5346]  <TASK>
   [   88.400336][ T5346]  dump_stack_lvl+0xe8/0x150
   [   88.400351][ T5346]  print_address_description+0x55/0x1e0
   [   88.400364][ T5346]  ? replace_file_extents+0x85f/0x1590
   [   88.400378][ T5346]  print_report+0x58/0x70
   [   88.400389][ T5346]  kasan_report+0x117/0x150
   [   88.400405][ T5346]  ? replace_file_extents+0x85f/0x1590
   [   88.400420][ T5346]  replace_file_extents+0x85f/0x1590
   [   88.400440][ T5346]  ? __pfx_replace_file_extents+0x10/0x10
   [   88.400452][ T5346]  ? update_ref_for_cow+0xa71/0x1270
   [   88.400473][ T5346]  btrfs_force_cow_block+0xa4d/0x2450
   [   88.400492][ T5346]  ? __pfx_btrfs_force_cow_block+0x10/0x10
   [   88.400508][ T5346]  ? __pfx_btrfs_get_32+0x10/0x10
   [   88.400523][ T5346]  btrfs_cow_block+0x3c4/0xa90
   [   88.400542][ T5346]  push_leaf_left+0x2ac/0x4a0
   [   88.400561][ T5346]  split_leaf+0xd16/0x12e0
   [   88.400574][ T5346]  ? btrfs_bin_search+0x924/0xc70
   [   88.400592][ T5346]  ? __pfx_split_leaf+0x10/0x10
   [   88.400602][ T5346]  ? leaf_space_used+0x177/0x1e0
   [   88.400618][ T5346]  ? btrfs_leaf_free_space+0x14a/0x2f0
   [   88.400634][ T5346]  btrfs_search_slot+0x2641/0x2d20
   [   88.400654][ T5346]  ? __pfx_btrfs_search_slot+0x10/0x10
   [   88.400669][ T5346]  ? rcu_is_watching+0x15/0xb0
   [   88.400681][ T5346]  ? trace_kmem_cache_alloc+0x29/0xe0
   [   88.400694][ T5346]  btrfs_insert_empty_items+0x9c/0x190
   [   88.400711][ T5346]  btrfs_insert_inode_ref+0x229/0xcb0
   [   88.400724][ T5346]  ? __pfx_btrfs_insert_inode_ref+0x10/0x10
   [   88.400736][ T5346]  ? __pfx_btrfs_qgroup_convert_reserved_meta+0x10/0x10
   [   88.400751][ T5346]  ? btrfs_record_root_in_trans+0x124/0x180
   [   88.400767][ T5346]  ? start_transaction+0x8a0/0x1820
   [   88.400778][ T5346]  ? btrfs_set_inode_index+0x5e/0x100
   [   88.400787][ T5346]  btrfs_rename2+0x17bb/0x40d0
   [   88.400800][ T5346]  ? check_noncircular+0xda/0x150
   [   88.400814][ T5346]  ? add_lock_to_list+0xc7/0x100
   [   88.400828][ T5346]  ? __pfx_btrfs_rename2+0x10/0x10
   [   88.400842][ T5346]  ? lockdep_hardirqs_on+0x7a/0x110
   [   88.400901][ T5346]  ? lock_acquire+0x221/0x350
   [   88.400915][ T5346]  ? down_write_nested+0x174/0x210
   [   88.400931][ T5346]  ? __pfx_down_write_nested+0x10/0x10
   [   88.400941][ T5346]  ? do_raw_spin_unlock+0x4d/0x210
   [   88.400952][ T5346]  ? try_break_deleg+0x5b/0x180
   [   88.400963][ T5346]  ? __pfx_btrfs_rename2+0x10/0x10
   [   88.400973][ T5346]  vfs_rename+0xa96/0xeb0
   [   88.400992][ T5346]  ? __pfx_vfs_rename+0x10/0x10
   [   88.401010][ T5346]  ovl_fill_super+0x46b7/0x5e20
   [   88.401030][ T5346]  ? __pfx_ovl_fill_super+0x10/0x10
   [   88.401042][ T5346]  ? xas_create+0x1902/0x1b90
   [   88.401060][ T5346]  ? __pfx___mutex_trylock_common+0x10/0x10
   [   88.401076][ T5346]  ? trace_contention_end+0x3d/0x140
   [   88.401094][ T5346]  ? shrinker_register+0x124/0x230
   [   88.401111][ T5346]  ? __mutex_unlock_slowpath+0x1be/0x6f0
   [   88.401127][ T5346]  ? shrinker_register+0x61/0x230
   [   88.401143][ T5346]  ? __pfx___mutex_lock+0x10/0x10
   [   88.401158][ T5346]  ? __pfx___mutex_unlock_slowpath+0x10/0x10
   [   88.401177][ T5346]  ? __raw_spin_lock_init+0x45/0x100
   [   88.401196][ T5346]  ? sget_fc+0x962/0xa40
   [   88.401208][ T5346]  ? __pfx_set_anon_super_fc+0x10/0x10
   [   88.401222][ T5346]  ? __pfx_ovl_fill_super+0x10/0x10
   [   88.401241][ T5346]  get_tree_nodev+0xbb/0x150
   [   88.401257][ T5346]  vfs_get_tree+0x92/0x2a0
   [   88.401272][ T5346]  do_new_mount+0x341/0xd30
   [   88.401283][ T5346]  ? apparmor_capable+0x126/0x170
   [   88.401301][ T5346]  ? __pfx_do_new_mount+0x10/0x10
   [   88.401311][ T5346]  ? ns_capable+0x89/0xe0
   [   88.401322][ T5346]  ? path_mount+0x690/0x10e0
   [   88.401333][ T5346]  ? user_path_at+0xd4/0x160
   [   88.401346][ T5346]  __se_sys_mount+0x31d/0x420
   [   88.401358][ T5346]  ? __pfx___se_sys_mount+0x10/0x10
   [   88.401370][ T5346]  ? __x64_sys_mount+0x20/0xc0
   [   88.401381][ T5346]  ? entry_SYSCALL_64_after_hwframe+0x77/0x7f
   [   88.401391][ T5346]  do_syscall_64+0x15f/0xf80
   [   88.401403][ T5346]  ? trace_irq_disable+0x3b/0x140
   [   88.401413][ T5346]  ? clear_bhb_loop+0x40/0x90
   [   88.401421][ T5346]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
   [   88.401429][ T5346] RIP: 0033:0x7fa1ff79ce59
   [   88.401436][ T5346] Code: ff c3 66 (...)
   [   88.401443][ T5346] RSP: 002b:00007fa2005affe8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
   [   88.401456][ T5346] RAX: ffffffffffffffda RBX: 00007fa1ffa16180 RCX: 00007fa1ff79ce59
   [   88.401464][ T5346] RDX: 0000200000000100 RSI: 0000200000002240 RDI: 0000000000000000
   [   88.401474][ T5346] RBP: 00007fa1ff832d6f R08: 0000200000000440 R09: 0000000000000000
   [   88.401481][ T5346] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
   [   88.401488][ T5346] R13: 00007fa1ffa16218 R14: 00007fa1ffa16180 R15: 00007ffc734fba78
   [   88.401500][ T5346]  </TASK>
   [   88.401506][ T5346]
   [   88.401510][ T5346] Allocated by task 5325:
   [   88.401516][ T5346]  kasan_save_track+0x3e/0x80
   [   88.401529][ T5346]  __kasan_kmalloc+0x93/0xb0
   [   88.401542][ T5346]  __kmalloc_cache_noprof+0x31c/0x660
   [   88.401554][ T5346]  btrfs_relocate_block_group+0x217/0xc40
   [   88.401568][ T5346]  btrfs_relocate_chunk+0x115/0x820
   [   88.401577][ T5346]  __btrfs_balance+0x1db0/0x2ae0
   [   88.401587][ T5346]  btrfs_balance+0xaf3/0x11b0
   [   88.401596][ T5346]  btrfs_ioctl_balance+0x3d3/0x610
   [   88.401612][ T5346]  __se_sys_ioctl+0xfc/0x170
   [   88.401626][ T5346]  do_syscall_64+0x15f/0xf80
   [   88.401640][ T5346]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
   [   88.401650][ T5346]
   [   88.401653][ T5346] Freed by task 5325:
   [   88.401659][ T5346]  kasan_save_track+0x3e/0x80
   [   88.401671][ T5346]  kasan_save_free_info+0x46/0x50
   [   88.401680][ T5346]  __kasan_slab_free+0x5c/0x80
   [   88.401692][ T5346]  kfree+0x1c5/0x640
   [   88.401703][ T5346]  btrfs_relocate_block_group+0x95d/0xc40
   [   88.401715][ T5346]  btrfs_relocate_chunk+0x115/0x820
   [   88.401724][ T5346]  __btrfs_balance+0x1db0/0x2ae0
   [   88.401733][ T5346]  btrfs_balance+0xaf3/0x11b0
   [   88.401742][ T5346]  btrfs_ioctl_balance+0x3d3/0x610
   [   88.401757][ T5346]  __se_sys_ioctl+0xfc/0x170
   [   88.401770][ T5346]  do_syscall_64+0x15f/0xf80
   [   88.401785][ T5346]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
   [   88.401795][ T5346]
   [   88.401798][ T5346] The buggy address belongs to the object at ffff888012312000
   [   88.401798][ T5346]  which belongs to the cache kmalloc-2k of size 2048
   [   88.401807][ T5346] The buggy address is located 16 bytes inside of
   [   88.401807][ T5346]  freed 2048-byte region [ffff888012312000ffff888012312800)
   [   88.401819][ T5346]
   [   88.401822][ T5346] The buggy address belongs to the physical page:
   [   88.401829][ T5346] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x12310
   [   88.401840][ T5346] head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
   [   88.401849][ T5346] flags: 0xfff00000000040(head|node=0|zone=1|lastcpupid=0x7ff)
   [   88.401860][ T5346] page_type: f5(slab)
   [   88.401871][ T5346] raw: 00fff00000000040 ffff88801ac42000 dead000000000100 dead000000000122
   [   88.401881][ T5346] raw: 0000000000000000 0000000800080008 00000000f5000000 0000000000000000
   [   88.401892][ T5346] head: 00fff00000000040 ffff88801ac42000 dead000000000100 dead000000000122
   [   88.401902][ T5346] head: 0000000000000000 0000000800080008 00000000f5000000 0000000000000000
   [   88.401913][ T5346] head: 00fff00000000003 fffffffffffffe01 00000000ffffffff 00000000ffffffff
   [   88.401923][ T5346] head: ffffffffffffffff 0000000000000000 00000000ffffffff 0000000000000008
   [   88.401929][ T5346] page dumped because: kasan: bad access detected
   [   88.401935][ T5346] page_owner tracks the page as allocated
   [   88.401941][ T5346] page last allocated via order 3, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 9, tgid 9 (kworker/0:0), ts 83905464494, free_ts 83674944822
   [   88.401961][ T5346]  post_alloc_hook+0x231/0x280
   [   88.401975][ T5346]  get_page_from_freelist+0x24ba/0x2540
   [   88.401990][ T5346]  __alloc_frozen_pages_noprof+0x18d/0x380
   [   88.402004][ T5346]  allocate_slab+0x77/0x660
   [   88.402019][ T5346]  refill_objects+0x339/0x3d0
   [   88.402033][ T5346]  __pcs_replace_empty_main+0x321/0x720
   [   88.402043][ T5346]  __kmalloc_node_track_caller_noprof+0x572/0x7b0
   [   88.402055][ T5346]  __alloc_skb+0x2c1/0x7d0
   [   88.402067][ T5346]  mld_newpack+0x14c/0xc90
   [   88.402080][ T5346]  add_grhead+0x5a/0x2a0
   [   88.402093][ T5346]  add_grec+0x1452/0x1740
   [   88.402105][ T5346]  mld_ifc_work+0x6e6/0xe70
   [   88.402116][ T5346]  process_scheduled_works+0xb5d/0x1860
   [   88.402127][ T5346]  worker_thread+0xa53/0xfc0
   [   88.402138][ T5346]  kthread+0x389/0x470
   [   88.402150][ T5346]  ret_from_fork+0x514/0xb70
   [   88.402161][ T5346] page last free pid 5282 tgid 5282 stack trace:
   [   88.402168][ T5346]  __free_frozen_pages+0xbc7/0xd30
   [   88.402180][ T5346]  __slab_free+0x274/0x2c0
   [   88.402191][ T5346]  qlist_free_all+0x99/0x100
   [   88.402201][ T5346]  kasan_quarantine_reduce+0x148/0x160
   [   88.402211][ T5346]  __kasan_slab_alloc+0x22/0x80
   [   88.402221][ T5346]  __kmalloc_cache_noprof+0x2ba/0x660
   [   88.402231][ T5346]  kernfs_fop_open+0x3f0/0xda0
   [   88.402253][ T5346]  do_dentry_open+0x785/0x14e0
   [   88.402262][ T5346]  vfs_open+0x3b/0x340
   [   88.402270][ T5346]  path_openat+0x2e08/0x3860
   [   88.402281][ T5346]  do_file_open+0x23e/0x4a0
   [   88.402292][ T5346]  do_sys_openat2+0x113/0x200
   [   88.402300][ T5346]  __x64_sys_openat+0x138/0x170
   [   88.402309][ T5346]  do_syscall_64+0x15f/0xf80
   [   88.402326][ T5346]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
   [   88.402336][ T5346]
   [   88.402339][ T5346] Memory state around the buggy address:
   [   88.402345][ T5346]  ffff888012311f00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
   [   88.402352][ T5346]  ffff888012311f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
   [   88.402359][ T5346] >ffff888012312000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
   [   88.402365][ T5346]                          ^
   [   88.402370][ T5346]  ffff888012312080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
   [   88.402380][ T5346]  ffff888012312100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
   [   88.402385][ T5346] ==================================================================

Fix this by:

1) Making the reloc control structure ref counted;

2) Make revery place that access fs_info->reloc_ctl outside the relocation
   code, which at the moment it's only replace_file_extents() and
   btrfs_init_reloc_root(), get a reference count on the structure.
   There's also btrfs_update_reloc_root() that is called outside the
   relocation code, but this case is safe because it's only called in
   the transaction commit path while under the fs_info->reloc_mutex
   protection, but nevertheless grab a reference to make the code more
   consistent and avoid false alerts from AI reviews;

3) Add a spinlock to protect fs_info->reloc_ctl, since we can not take the
   fs_info->reloc_mutex as that would cause a deadlock since that lock is
   taken in the transaction commit path. That spinlock is taken before
   setting fs_info->reloc_ctl to an allocated structure, setting it to
   NULL and reading fs_info->reloc_ctl;

4) Make sure the structure is freed only when its reference count drops to
   zero.

Reported-by: syzbot+0eea49bba18051dea35e@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/6a1df323.bb0696ed.125a22.000a.GAE@google.com/
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
8 days agobtrfs: move WARN_ON on unexpected error in __add_tree_block()
Filipe Manana [Fri, 5 Jun 2026 16:25:50 +0000 (17:25 +0100)] 
btrfs: move WARN_ON on unexpected error in __add_tree_block()

There's no point in having the WARN_ON(1) inside the if statement for the
unexpected error. Move it into the if statement's condition, which brings
a couple benefits:

1) It marks the branch as unlikely, hinting the compiler to generate
   better code;

2) The WARN_ON() produces a stack trace after the dumped leaf and error
   message which can hide that more important information in case we get
   a truncated dmesg/syslog.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
8 days agobtrfs: move locking into btrfs_get_reloc_bg_bytenr()
Filipe Manana [Fri, 5 Jun 2026 16:07:08 +0000 (17:07 +0100)] 
btrfs: move locking into btrfs_get_reloc_bg_bytenr()

It does not make sense for the single caller to have the responsability
to lock the relocation mutex before calling the function and then have
the function to assert the lock is held. As this is a function in
relocation.c, move the locking details into it.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
8 days agobtrfs: lzo: reject compressed segment that overflows the compressed input
Weiming Shi [Sun, 7 Jun 2026 05:25:13 +0000 (22:25 -0700)] 
btrfs: lzo: reject compressed segment that overflows the compressed input

lzo_decompress_bio() validates each on-disk segment length seg_len only
against the workspace cbuf size, not against the compressed input size
(compressed_len, the total folio bytes of the bio).  A crafted extent can
carry a segment whose seg_len passes the cbuf check but runs past the end
of the bio, so copy_compressed_segment() walks off the last folio:
get_current_folio() then returns the NULL folio from bio_next_folio(), and
with CONFIG_BTRFS_ASSERT disabled (default) folio_size(NULL) faults.

 BUG: KASAN: null-ptr-deref in lzo_decompress_bio (fs/btrfs/lzo.c:383)
 Read of size 8 at addr 0000000000000000 by task kworker/u8:1/29
 Workqueue: btrfs-endio simple_end_io_work
  kasan_report (mm/kasan/report.c:590)
  lzo_decompress_bio (fs/btrfs/lzo.c:383)
  end_bbio_compressed_read (fs/btrfs/compression.c:1065)
  btrfs_bio_end_io (fs/btrfs/bio.c:135)
  btrfs_check_read_bio (fs/btrfs/bio.c:180 fs/btrfs/bio.c:285)
  simple_end_io_work
  process_one_work
  worker_thread

Reject any segment whose payload would extend beyond compressed_len before
copying it, treating it as corruption like the other on-disk validation
failures in this function.

Reported-by: Xiang Mei <xmei5@asu.edu>
Fixes: a6e66e6f8c1b ("btrfs: rework lzo_decompress_bio() to make it subpage compatible")
Assisted-by: Claude:claude-opus-4-8
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Weiming Shi <bestswngs@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
8 days agobtrfs: retry faulting in the pages after a zero sized short direct write
Qu Wenruo [Thu, 4 Jun 2026 00:29:48 +0000 (09:59 +0930)] 
btrfs: retry faulting in the pages after a zero sized short direct write

Currently btrfs_direct_write() will not try to fault in the pages, but
directly fall back to buffered writes, if the first page of the buffer
can not be faulted in.

For example, during generic/362 with nodatasum mount option, there is a
write at file offset 0, length PAGE_SIZE, and the page is not faulted in.
Then we go the following callchain and directly fall back to buffered
IO:

 btrfs_direct_write()
 |- btrfs_dio_write()
 |-  __iomap_dio_rw()
 |  |- iomap_iter()
 |  |  |- btrfs_dio_iomap_begin()
 |  |     Now an ordered extent is allocated for the 4K write.
 |  |
 |  |- iomi.status = iomap_dio_iter()
 |  |  Where iomap_dio_iter() returned -EFAULT.
 |  |
 |  |- ret = iomap_iter()
 |  |  |- btrfs_dio_iomap_end()
 |  |  |  | return -ENOTBLK
 |  |  |- return -ENOTBLK
 |  |- if (ret == -ENOTBLK) { ret = 0; }
 |     Now the return value is reset to 0.
 |
 |- ret = iomap_dio_complete()
 |  Since no byte is submitted, @ret is now zero.
 |
 |- if (iov_iter_count() > 0 && (ret == -EFAULT || ret > 0))
 |  @ret is zero, thus not meeting the above retry condition
 |
 |- Fallback to buffered

Just slightly loosen the condition to allow retry faulting in pages after
a zero sized short write.

Unlike the previous two bug fixes, this one is not really cause any real
bug, but only reducing the chance to do zero-copy direct IO.
Thus it doesn't really require stable-CC nor fixes-tag.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
8 days agobtrfs: fix incorrect buffered IO fallback for append direct writes
Qu Wenruo [Thu, 4 Jun 2026 00:29:47 +0000 (09:59 +0930)] 
btrfs: fix incorrect buffered IO fallback for append direct writes

[BUG]
With the previous bug of short direct writes fixed, test case
generic/362 (*) still fails with the following error with nodatasum
mount option:

 generic/362  0s ... - output mismatch (see /home/adam/xfstests/results//generic/362.out.bad)
 - output mismatch (see /home/adam/xfstests/results//generic/362.out.bad)
    --- tests/generic/362.out 2024-08-24 15:31:37.200000000 +0930
    +++ /home/adam/xfstests/results//generic/362.out.bad 2026-05-27 10:13:09.072485767 +0930
    @@ -1,2 +1,3 @@
     QA output created by 362
    +Wrong file size after first write, got 8192 expected 4096
     Silence is golden
    ...

*: If the test case has been executed before with default data checksum,
the failure will not reproduce. Need the following fix to make it
reliably reproducible:
https://lore.kernel.org/linux-btrfs/20260528111659.87113-1-wqu@suse.com/

[CAUSE]
Inside btrfs_dio_iomap_begin() for a direct write, we increase the isize
if it's beyond the current isize.

But if the direct io finished short, we do not revert the isize to the
previous value nor to the short write end.

Then if we need to fall back to buffered writes, and the write has
IOCB_APPEND flag, then the buffered write will be positioned at the
incorrect isize.

The call chain looks like this:

 btrfs_direct_write(pos=0, length=4K)
 |- __iomap_dio_rw()
 |  |- iomap_iter()
 |  |  |- btrfs_dio_iomap_begin()
 |  |     |- btrfs_get_blocks_direct_write()
 |  |        |- i_size_write()
 |  |           Which updates the isize to the write end (4K).
 |  |
 |  |- iomap_dio_iter()
 |  |  Failed with -EFAULT on the first page.
 |  |
 |  |- iomap_iter()
 |  |  |- btrfs_dio_iomap_end()
 |  |     Detects a short write, return -ENOTBLK
 |  |- if (ret == -ENOTBLK) { ret = 0;}
 |     Which resets the return value.
 |
 |- ret = iomap_dio_complet()
 |  Which returns 0.
 |
 |- btrfs_buffered_write(iocb, from);
    |- generic_write_checks()
       |- iocb->ki_pos = i_size_read()
          Which is still the new size (4K), other than the original
  isize 0.

[FIX]
Introduce the following btrfs_dio_data members:

- old_isize

- updated_isize
  If the direct write has enlarged the isize.

Then if we got a short write, and btrfs_dio_data::updated_isize is set,
revert to the correct isize based on old_isize and current file
position.

And here we call i_size_write() without holding an extent lock, which is
a very special case that we're safe to do:

 - Only a single writer can be enlarging isize
   Enlarging isize will take the exclusive inode lock.

 - Buffered readers need to wait for the OE we're holding
   Buffered readers will lock extent and wait for OE of the folio range.
   Sometimes we can skip the OE wait, but since all page cache is
   invalidated, the OE wait can not be skipped.

But I do not think this is the most elegant solution, nor covers all
cases. E.g. if the bio is submitted but IO failed, we are unable to do
the revert.

I believe the more elegant one would be extend the EXTENT_DIO_LOCKED
lifespan for direct writes, so that we can update the isize when a
write beyond EOF finished successfully.

However that change is too huge for a small bug fix.
So only implement the minimal partial fix for now.

[REASON FOR NO FIXES TAG]
The bug is again very old, before commit f85781fb505e ("btrfs: switch to
iomap for direct IO") we are already increasing isize without a
proper rollback for short writes.

Thus only a CC to stable.

CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
8 days agobtrfs: fix false IO failure after falling back to buffered write
Qu Wenruo [Thu, 4 Jun 2026 00:29:46 +0000 (09:59 +0930)] 
btrfs: fix false IO failure after falling back to buffered write

[BUG]
The test case generic/362 will fail with "nodatasum" mount option (*):

 MOUNT_OPTIONS -- -o nodatasum /dev/mapper/test-scratch1 /mnt/scratch

 generic/362  0s ... - output mismatch (see /home/adam/xfstests/results//generic/362.out.bad)
    --- tests/generic/362.out 2024-08-24 15:31:37.200000000 +0930
    +++ /home/adam/xfstests/results//generic/362.out.bad 2026-05-27 10:21:17.574771567 +0930
    @@ -1,2 +1,3 @@
     QA output created by 362
    +First write failed: Input/output error
     Silence is golden
    ...

*: If the test case has been executed before with default data checksum,
the failure will not reproduce. Need the following fix to make it
reliably reproducible:
https://lore.kernel.org/linux-btrfs/20260528111659.87113-1-wqu@suse.com/

[CAUSE]
Inside __iomap_dio_rw(), the -EFAULT/-ENOTBLK error is not directly returned.
Thus we never got an error pointer from __iomap_dio_rw().

The call chain looks like this:

 btrfs_direct_write()
 |- btrfs_dio_write()
 |-  __iomap_dio_rw()
 |  |- iomap_iter()
 |  |  |- btrfs_dio_iomap_begin()
 |  |     Now an ordered extent is allocated for the 4K write.
 |  |
 |  |- iomi.status = iomap_dio_iter()
 |  |  Where iomap_dio_iter() returned -EFAULT.
 |  |
 |  |- ret = iomap_iter()
 |  |  |- btrfs_dio_iomap_end()
 |  |  |  |- btrfs_finish_ordered_extent(uptodate = false)
 |  |  |  |  |- can_finish_ordered_extent()
 |  |  |  |     |- btrfs_mark_ordered_extent_error()
 |  |  |  |        |- mapping_set_error()
 |  |  |  |           Now the address space is marked error.
 |  |  |  | return -ENOTBLK
 |  |  |- return -ENOTBLK
 |  |- if (ret == -ENOTBLK) { ret = 0; }
 |     Now the return value is reset to 0.
 |     Thus no error pointer will be returned.
 |
 |- ret = iomap_dio_complete()
 |  Since no byte is submitted, @ret is 0.
 |
 |- Fallback to buffered IO
 |  And the buffered write finished without error
 |
 |- filemap_fdatawait_range()
    |- filemap_check_errors()
       The previous error is recorded, thus an error is returned

However the buffered write is properly submitted and finished, the error
is from the btrfs_finish_ordered_extent() call with @uptodate = false.

[FIX]
When a short dio write happened, any range that is submitted will have
btrfs_extract_ordered_extent() to be called, thus the submitted range
will always have an OE just covering the submitted range.

The remaining OE range is never submitted, thus they should be treated
as truncated, not an error. So that we can properly reclaim and not
insert an unnecessary file extent item, without marking the mapping as
error.

Extract a helper, btrfs_mark_ordered_extent_truncated(), and utilize
that helper to mark the direct IO ordered extent as truncated, so it
won't cause failure for the later buffered fallback.

[REASON FOR NO FIXES TAG]
The bug itself is pretty old, at commit f85781fb505e ("btrfs: switch to
iomap for direct IO") we're already passing @uptodate=false finishing
the OE.
But at that time OE with IOERR won't call mapping_set_error(), so it's
not exposed.
Later commit d61bec08b904 ("btrfs: mark ordered extent and inode with
error if we fail to finish") finally exposed the bug, but that commit
is doing a correct job, not the root cause.

Anyway the bug is very old, dating back to 5.1x days, thus only CC to
stable.

CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
8 days agobtrfs: use verbose assertions in backref.c
Filipe Manana [Tue, 2 Jun 2026 13:42:17 +0000 (14:42 +0100)] 
btrfs: use verbose assertions in backref.c

While debugging a relocation issue I hit an assertion in backref.c but it
was not super useful, since it could not tell what was the unexpected
value that triggered the assertion. The stack trace was this:

  [583246.338097] assertion failed: !cache->nr_nodes, in fs/btrfs/backref.c:3158
  [583246.339588] ------------[ cut here ]------------
  [583246.340573] kernel BUG at fs/btrfs/backref.c:3158!
  [583246.342075] Oops: invalid opcode: 0000 [#1] SMP PTI
  [583246.343294] CPU: 5 UID: 0 PID: 677957 Comm: btrfs Not tainted 7.1.0-rc4-btrfs-next-234+ #1 PREEMPT(full)
  [583246.345715] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
  [583246.348694] RIP: 0010:btrfs_backref_release_cache.cold+0x61/0x84 [btrfs]
  [583246.350759] Code: 90 d5 7c (...)
  [583246.354923] RSP: 0018:ffffd4fc88c93ad8 EFLAGS: 00010246
  [583246.355982] RAX: 000000000000003e RBX: ffff8dec90d97020 RCX: 0000000000000000
  [583246.357459] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 00000000ffffffff
  [583246.359517] RBP: ffff8dec8eeb78c0 R08: 0000000000000000 R09: 3fffffffffefffff
  [583246.361180] R10: ffffd4fc88c93970 R11: 0000000000000003 R12: ffff8decd21f3470
  [583246.363184] R13: 00000000fffffffe R14: ffff8decd21f3000 R15: ffff8decd21f3000
  [583246.364666] FS:  00007f9a51751400(0000) GS:ffff8df3f4255000(0000) knlGS:0000000000000000
  [583246.366287] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [583246.367443] CR2: 00007f9a518ed8f5 CR3: 00000004467c8002 CR4: 0000000000370ef0
  [583246.368969] Call Trace:
  [583246.369541]  <TASK>
  [583246.370040]  relocate_block_group+0xf2/0x520 [btrfs]
  [583246.371243]  btrfs_relocate_block_group+0x9a9/0x22e0 [btrfs]
  [583246.372443]  ? preempt_count_add+0x47/0xa0
  [583247.532978]  ? btrfs_tree_read_lock_nested+0x19/0x90 [btrfs]
  [583247.534520]  ? mutex_lock+0x1a/0x40
  [583247.602233]  ? btrfs_scrub_pause+0x2e/0x120 [btrfs]
  [583247.603543]  btrfs_relocate_chunk+0x3b/0x1a0 [btrfs]
  [583247.604893]  btrfs_balance+0x9d5/0x1920 [btrfs]
  [583247.606189]  ? preempt_count_add+0x69/0xa0
  [583247.607030]  btrfs_ioctl+0x260c/0x2a20 [btrfs]
  [583247.608015]  ? __memcg_slab_free_hook+0x156/0x1a0
  [583247.636971]  __x64_sys_ioctl+0x92/0xe0
  [583247.679247]  do_syscall_64+0x60/0xf20
  [583247.753297]  ? clear_bhb_loop+0x60/0xb0
  [583247.756321]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
  [583247.787018] RIP: 0033:0x7f9a5186a8db
  [583247.787787] Code: 00 48 89 (...)
  [583247.791410] RSP: 002b:00007fff2ffa6ac0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
  [583247.792897] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f9a5186a8db
  [583247.794319] RDX: 00007fff2ffa6bb0 RSI: 00000000c4009420 RDI: 0000000000000003
  [583247.795714] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
  [583247.797149] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fff2ffa903f
  [583247.798685] R13: 00007fff2ffa6bb0 R14: 0000000000000002 R15: 0000000000000002
  [583247.800136]  </TASK>

So update all simple assertions in backref.c to print out the values when
they aren't testing simple boolean conditions.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
8 days agobtrfs: print a message when a missing device re-appears
Qu Wenruo [Tue, 2 Jun 2026 05:26:49 +0000 (14:56 +0930)] 
btrfs: print a message when a missing device re-appears

There is a bug report that fstrim crashed, and that crash is eventually
pinned down to a missing device which re-appeared and screwed up callers
that only checks BTRFS_DEV_STATE_MISSING, but not
BTRFS_DEV_STATE_WRITEABLE nor device->bdev.

A missing device re-appearing can be very tricky, as for now it will
result in a device without WRITEABLE or MISSING flag, and still no bdev
pointer.

As the first step to enhance handling of such re-appearing missing
devices, add a dmesg output when a missing device re-appeared.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
8 days agobtrfs: do not trim a device which is not writeable
Qu Wenruo [Tue, 2 Jun 2026 04:04:46 +0000 (13:34 +0930)] 
btrfs: do not trim a device which is not writeable

[BUG]
There is a bug report that btrfs/242 can randomly fail with the
following NULL pointer dereference:

  run fstests btrfs/242 at 2026-06-01 10:25:08
  BTRFS: device fsid d4d7f234-487c-4787-88e4-47a8b68c9874 devid 1 transid 9 /dev/sdc (8:32) scanned by mount (122609)
  BTRFS info (device sdc): first mount of filesystem d4d7f234-487c-4787-88e4-47a8b68c9874
  BTRFS info (device sdc): using crc32c checksum algorithm
  BTRFS warning (device sdc): devid 2 uuid fbe72d72-3272-482d-80fb-ab88ed398192 is missing
  BTRFS warning (device sdc): devid 2 uuid fbe72d72-3272-482d-80fb-ab88ed398192 is missing
  BTRFS info (device sdc): allowing degraded mounts
  BTRFS info (device sdc): turning on async discard
  BTRFS info (device sdc): enabling free space tree
  Unable to handle kernel NULL pointer dereference at virtual address 0000000000000018
  user pgtable: 4k pages, 48-bit VAs, pgdp=000000013fd6b000
  CPU: 4 UID: 0 PID: 122625 Comm: fstrim Not tainted 7.0.10-2-default #1 PREEMPT(full) openSUSE Tumbleweed e9a5f6b24978fba3bf015a992f865837fdfff3dd
  Hardware name: QEMU KVM Virtual Machine, BIOS edk2-20250812-19.fc42 08/12/2025
  pstate: 01400005 (nzcv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
  pc : btrfs_trim_fs+0x34c/0xa00 [btrfs]
  lr : btrfs_trim_fs+0x1f0/0xa00 [btrfs]
  Call trace:
   btrfs_trim_fs+0x34c/0xa00 [btrfs f02c1d570ceea621c69d302ba75dd61868083840] (P)
   btrfs_ioctl_fitrim+0xe8/0x178 [btrfs f02c1d570ceea621c69d302ba75dd61868083840]
   btrfs_ioctl+0xdd4/0x2bd8 [btrfs f02c1d570ceea621c69d302ba75dd61868083840]
   __arm64_sys_ioctl+0xac/0x108
   invoke_syscall.constprop.0+0x5c/0xd0
   el0_svc_common.constprop.0+0x40/0xf0
   do_el0_svc+0x24/0x40
   el0_svc+0x40/0x1d0
   el0t_64_sync_handler+0xa0/0xe8
   el0t_64_sync+0x1b0/0x1b8
  Code: 17ffff83 f94017e0 f9002be0 f9402ea0 (f9400c00)
  ---[ end trace 0000000000000000  ]---

Also the reporter is very kind to test the following ASSERT() added to
btrfs_trim_free_extents_throttle():

ASSERT(device->bdev,
       "devid=%llu path=%s dev_state=0x%lx\n",
       device->devid, btrfs_dev_name(device), device->dev_state);

And it shows the following output:

  assertion failed: device->bdev, in extent-tree.c:6630 (devid=2 path=/dev/sdd dev_state=0x82)

Which means the device->bdev is NULL, and the dev_state is
BTRFS_DEV_STATE_IN_FS_METADATA | BTRFS_DEV_STATE_ITEM_FOUND, without
BTRFS_DEV_STATE_WRITEABLE flag set.

[CAUSE]
The pc points to the following call chain:

  btrfs_trim_fs()
  |- btrfs_trim_free_extents()
     |- btrfs_trim_free_extents_throttle()
        |- bdev_max_discard_sectors(device->bdev)

So the NULL pointer dereference is caused by device->bdev being NULL.

This looks impossible by a quick glance, as just before calling
btrfs_trim_free_extents_throttle(), we have skipped any device that has
BTRFS_DEV_STATE_MISSING flag set.

However in this particular case, there is a window where the missing
device is later re-scanned, causing btrfs to remove the
BTRFS_DEV_STATE_MISSING flag:

  btrfs_control_ioctl()
  |- btrfs_scan_one_device()
     |- device_list_add()
        |- rcu_assign_pointer(device->name, name);
        |  This updates the missing device's path to the new good path.
        |
        |- clear_bit(BTRFS_DEV_STATE_MISSING, &device->dev_state)
           This removes the BTRFS_DEV_STATE_MISSING flag.

This allows the missing device to re-appear and clear the
BTRFS_DEV_STATE_MISSING flag.  However the device still does not have
the BTRFS_DEV_STATE_WRITEABLE flag set, nor is its bdev pointer updated.

The bdev pointer remains NULL, triggering the crash later.

[FIX]
This is a big de-synchronization between BTRFS_DEV_STATE_MISSING and
device->bdev pointer, and shows a gap in btrfs's re-appearing-device
handling.

The proper handling of re-appearing device will need quite some extra
work, which is out of the context of this small fix.

Thankfully the regular bbio submission path has already handled it well
by checking if the device->bdev is NULL before submitting.

So here we just fix the crash by checking if the device is writeable and
has a bdev pointer before calling bdev_max_discard_sectors().

Reported-by: Su Yue <glass.su@suse.com>
Link: https://lore.kernel.org/linux-btrfs/wlwir19t.fsf@damenly.org/
Fixes: 499f377f49f0 ("btrfs: iterate over unused chunk space in FITRIM")
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
8 days agobtrfs: return real error after lookup failure in btrfs_ioctl_default_subvol()
Filipe Manana [Mon, 1 Jun 2026 09:45:14 +0000 (10:45 +0100)] 
btrfs: return real error after lookup failure in btrfs_ioctl_default_subvol()

If we fail to lookup the dir item, we are always returning -ENOENT but
that may not be the reason for the failure, as btrfs_lookup_dir_item() can
return many different errors, such as -EIO or -ENOMEM for example.
Fix this by returning the real error, and also fixup the silly error
message, including the id of the directory and the error.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
8 days agobtrfs: use mapping shared locking for reading super block
Filipe Manana [Sun, 31 May 2026 10:36:06 +0000 (11:36 +0100)] 
btrfs: use mapping shared locking for reading super block

There's no need to exclusively lock the mapping, shared locking is enough
to protect from a concurrent set block size operation (BLKBSZSET ioctl).

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
8 days agobtrfs: use lockless read in nr_cached_objects shrinker callback
Ben Maurer [Fri, 29 May 2026 21:23:46 +0000 (14:23 -0700)] 
btrfs: use lockless read in nr_cached_objects shrinker callback

Under heavy memcg-driven slab reclaim with many memcgs and CPUs,
shrink_slab_memcg() invokes the per-superblock count callback once per
(memcg, NUMA node) tuple. For btrfs that callback reaches
percpu_counter_sum_positive() on fs_info->evictable_extent_maps, which
takes the percpu_counter's raw spinlock with IRQs disabled and walks
every online CPU. With hundreds of memcgs driving reclaim on a host with
dozens of CPUs, this counter lock becomes a global serialization point:
profiles show CPU pinned in the spin_lock_irqsave acquire under
__percpu_counter_sum, with cross-CPU IPIs hitting csd_lock_wait_toolong
while waiting for spinning vCPUs.

The shrinker count is advisory -- super_cache_count() already notes
"counts can change between super_cache_count and super_cache_scan, so we
really don't need locks here." Use percpu_counter_read_positive(), which
is lockless. Worst-case skew is bounded by batch * num_online_cpus (a
few thousand), negligible compared to the millions of extent maps a busy
filesystem accumulates and well within the noise that the shrinker
already tolerates.

Tested-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Ben Maurer <bmaurer@meta.com>
Signed-off-by: David Sterba <dsterba@suse.com>
8 days agobtrfs: switch local indicator variables to bools
David Sterba [Tue, 26 May 2026 11:33:21 +0000 (13:33 +0200)] 
btrfs: switch local indicator variables to bools

For all local indicator variables do simple switch to bool, done on all
files.

Signed-off-by: David Sterba <dsterba@suse.com>
8 days agobtrfs: send: pass bool for pending_move and refs_processed parameters
David Sterba [Tue, 26 May 2026 11:29:49 +0000 (13:29 +0200)] 
btrfs: send: pass bool for pending_move and refs_processed parameters

We're passing simple indicators as int, switch them to bool types.

Signed-off-by: David Sterba <dsterba@suse.com>
8 days agobtrfs: use shifts for sectorsize and nodesize
David Sterba [Wed, 27 May 2026 11:16:52 +0000 (13:16 +0200)] 
btrfs: use shifts for sectorsize and nodesize

Convert more multiplications of sectorsize or nodesize to use the
shifts. The remaining cases are multiplications by constants that
compiler can optimize by itself, and in tests.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
8 days agobtrfs: fix deadlock cloning inline extent when using flushoncommit
Filipe Manana [Tue, 26 May 2026 13:44:30 +0000 (14:44 +0100)] 
btrfs: fix deadlock cloning inline extent when using flushoncommit

In commit b48c980b6a7e ("btrfs: fix deadlock between reflink and
transaction commit when using flushoncommit") a deadlock was fixed
between reflinks and transaction commits when the fs is mounted with the
flushoncommit option. This happened when we had to copy an inline extent's
data to the destination file. However the issue was fixed only for the
case where the destination offset is 0, it missed the case when the offset
is greater than zero.

Fix this by ensuring we get i_size update whenever we copied an inline
extent's data into the destination file.

Syzbot reported this with the following trace:

   INFO: task kworker/u8:3:57 blocked for more than 143 seconds.
         Not tainted syzkaller #0
   "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
   task:kworker/u8:3    state:D stack:21600 pid:57    tgid:57    ppid:2      task_flags:0x4208160 flags:0x00080000
   Workqueue: writeback wb_workfn (flush-btrfs-129)
   Call Trace:
    <TASK>
    context_switch kernel/sched/core.c:5402 [inline]
    __schedule+0x16f9/0x5500 kernel/sched/core.c:7204
    __schedule_loop kernel/sched/core.c:7283 [inline]
    schedule+0x164/0x360 kernel/sched/core.c:7298
    wait_extent_bit fs/btrfs/extent-io-tree.c:905 [inline]
    btrfs_lock_extent_bits+0x59c/0x700 fs/btrfs/extent-io-tree.c:2008
    btrfs_lock_extent fs/btrfs/extent-io-tree.h:152 [inline]
    btrfs_invalidate_folio+0x440/0xc00 fs/btrfs/inode.c:7718
    extent_writepage fs/btrfs/extent_io.c:1848 [inline]
    extent_write_cache_pages fs/btrfs/extent_io.c:2552 [inline]
    btrfs_writepages+0x12f3/0x2410 fs/btrfs/extent_io.c:2684
    do_writepages+0x32e/0x550 mm/page-writeback.c:2571
    __writeback_single_inode+0x133/0x10e0 fs/fs-writeback.c:1764
    writeback_sb_inodes+0x97f/0x1980 fs/fs-writeback.c:2056
    wb_writeback+0x445/0xb00 fs/fs-writeback.c:2241
    wb_do_writeback fs/fs-writeback.c:2388 [inline]
    wb_workfn+0x3fd/0xf20 fs/fs-writeback.c:2428
    process_one_work+0x98b/0x1630 kernel/workqueue.c:3318
    process_scheduled_works kernel/workqueue.c:3401 [inline]
    worker_thread+0xb49/0x1140 kernel/workqueue.c:3482
    kthread+0x388/0x470 kernel/kthread.c:436
    ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
    ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
    </TASK>
   INFO: task syz.0.145:8523 blocked for more than 143 seconds.
         Not tainted syzkaller #0
   "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
   task:syz.0.145       state:D stack:22752 pid:8523  tgid:8522  ppid:5850   task_flags:0x400140 flags:0x00080002
   Call Trace:
    <TASK>
    context_switch kernel/sched/core.c:5402 [inline]
    __schedule+0x16f9/0x5500 kernel/sched/core.c:7204
    __schedule_loop kernel/sched/core.c:7283 [inline]
    schedule+0x164/0x360 kernel/sched/core.c:7298
    wb_wait_for_completion+0x3e8/0x790 fs/fs-writeback.c:227
    __writeback_inodes_sb_nr+0x24c/0x2d0 fs/fs-writeback.c:2847
    try_to_writeback_inodes_sb+0x9a/0xc0 fs/fs-writeback.c:2895
    btrfs_start_delalloc_flush fs/btrfs/transaction.c:2182 [inline]
    btrfs_commit_transaction+0x813/0x2fc0 fs/btrfs/transaction.c:2371
    btrfs_sync_file+0xdf4/0x1230 fs/btrfs/file.c:1822
    generic_write_sync include/linux/fs.h:2663 [inline]
    btrfs_do_write_iter+0x6a9/0x840 fs/btrfs/file.c:1473
    new_sync_write fs/read_write.c:595 [inline]
    vfs_write+0x629/0xba0 fs/read_write.c:688
    ksys_write+0x156/0x270 fs/read_write.c:740
    do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
    do_syscall_64+0x15f/0x560 arch/x86/entry/syscall_64.c:94
    entry_SYSCALL_64_after_hwframe+0x77/0x7f
   RIP: 0033:0x7f5a0bdece59
   RSP: 002b:00007f5a0b446028 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
   RAX: ffffffffffffffda RBX: 00007f5a0c065fa0 RCX: 00007f5a0bdece59
   RDX: 000000000000029f RSI: 0000200000000200 RDI: 0000000000000004
   RBP: 00007f5a0be82d6f R08: 0000000000000000 R09: 0000000000000000
   R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
   R13: 00007f5a0c066038 R14: 00007f5a0c065fa0 R15: 00007ffe149206b8
    </TASK>
   INFO: task syz.0.145:8539 blocked for more than 143 seconds.
         Not tainted syzkaller #0
   "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
   task:syz.0.145       state:D stack:23704 pid:8539  tgid:8522  ppid:5850   task_flags:0x400140 flags:0x00080002
   Call Trace:
    <TASK>
    context_switch kernel/sched/core.c:5402 [inline]
    __schedule+0x16f9/0x5500 kernel/sched/core.c:7204
    __schedule_loop kernel/sched/core.c:7283 [inline]
    schedule+0x164/0x360 kernel/sched/core.c:7298
    wait_current_trans+0x39f/0x590 fs/btrfs/transaction.c:536
    start_transaction+0xbd8/0x1820 fs/btrfs/transaction.c:716
    clone_copy_inline_extent fs/btrfs/reflink.c:299 [inline]
    btrfs_clone+0x1316/0x2540 fs/btrfs/reflink.c:574
    btrfs_clone_files+0x271/0x3f0 fs/btrfs/reflink.c:795
    btrfs_remap_file_range+0x76b/0x1320 fs/btrfs/reflink.c:948
    vfs_clone_file_range+0x435/0x7b0 fs/remap_range.c:403
    ioctl_file_clone fs/ioctl.c:239 [inline]
    ioctl_file_clone_range fs/ioctl.c:257 [inline]
    do_vfs_ioctl+0xe15/0x1540 fs/ioctl.c:544
    __do_sys_ioctl fs/ioctl.c:595 [inline]
    __se_sys_ioctl+0x82/0x170 fs/ioctl.c:583
    do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
    do_syscall_64+0x15f/0x560 arch/x86/entry/syscall_64.c:94
    entry_SYSCALL_64_after_hwframe+0x77/0x7f
   RIP: 0033:0x7f5a0bdece59
   RSP: 002b:00007f5a0b425028 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
   RAX: ffffffffffffffda RBX: 00007f5a0c066090 RCX: 00007f5a0bdece59
   RDX: 00002000000000c0 RSI: 000000004020940d RDI: 0000000000000004
   RBP: 00007f5a0be82d6f R08: 0000000000000000 R09: 0000000000000000
   R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
   R13: 00007f5a0c066128 R14: 00007f5a0c066090 R15: 00007ffe149206b8
    </TASK>

Reported-by: syzbot+c7443384724bb0f9e913@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/6a150a09.820a0220.e7972.0006.GAE@google.com/
Fixes: 05a5a7621ce6 ("Btrfs: implement full reflink support for inline extents")
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
8 days agobtrfs: allocate eb-attached btree pages as movable
Rik van Riel [Tue, 26 May 2026 22:37:39 +0000 (18:37 -0400)] 
btrfs: allocate eb-attached btree pages as movable

Extent buffer pages allocated by alloc_extent_buffer() are attached to
btree_inode->i_mapping (the buffer_tree path), reach the LRU, and are
served by the btree_migrate_folio aops in fs/btrfs/disk-io.c. They are
migratable in practice once their owning extent buffer hits refs == 1,
which happens naturally. The buddy allocator classifies them by GFP,
however, and bare GFP_NOFS lands them in MIGRATE_UNMOVABLE pageblocks.

The result: every btree_inode page we read in pins an unmovable pageblock
from the page-superblock allocator's perspective, even though the page
itself can be moved.

Have each caller of btrfs_alloc_page_array, btrfs_alloc_folio_array,
and alloc_eb_folio_array pass in the full GFP mask directly, instead
of having the functions calculate it from boolean flags.

The alloc_extent_buffer call site passes GFP_NOFS | __GFP_NOFAIL |
__GFP_MOVABLE. All other call sites pass plain GFP_NOFS.

Three categories of caller stay on bare GFP_NOFS, deliberately:

  - alloc_dummy_extent_buffer / btrfs_clone_extent_buffer: the
    resulting eb is EXTENT_BUFFER_UNMAPPED, folio->mapping stays NULL,
    the folios never enter LRU, never get migrate_folio aops. Tagging
    them __GFP_MOVABLE would violate the page allocator's migrability
    contract and they would defeat compaction in MOVABLE pageblocks
    where isolate_migratepages_block skips non-LRU non-movable_ops
    pages outright.

  - btrfs_alloc_page_array callers in fs/btrfs/raid56.c (stripe
    pages), fs/btrfs/inode.c (encoded reads), fs/btrfs/ioctl.c (io_uring
    encoded reads), fs/btrfs/relocation.c (relocation buffers): same
    contract violation. raid56 stripe_pages additionally persist in
    the stripe cache (RBIO_CACHE_SIZE=1024) well beyond a single I/O,
    so they are not transient enough to hand-wave the contract.

  - btrfs_alloc_folio_array caller in fs/btrfs/scrub.c (stripe
    folios): same -- stripe->folios[] are private buffers freed via
    folio_put in release_scrub_stripe.

This change targets the dominant fragmentation source observed on the
page-superblock series: ~28 GB of btree_inode pages parked across
many tainted superpageblocks on a 250 GB test system with btrfs root,
preventing 1 GiB hugepage allocation from those regions. With the
movable hint, those pages now land in MOVABLE pageblocks where the
existing background defragger drains them through the standard
PB_has_movable gate, no LRU-sample fallback needed.

Assisted-by: Claude:claude-opus-4-6
Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: David Sterba <dsterba@suse.com>
8 days agobtrfs: add 32-bit compat ioctl for BTRFS_IOC_GET_SUBVOL_INFO
Daan De Meyer [Thu, 21 May 2026 07:51:13 +0000 (07:51 +0000)] 
btrfs: add 32-bit compat ioctl for BTRFS_IOC_GET_SUBVOL_INFO

On 64-bit kernels with 32-bit userspace, struct btrfs_ioctl_timespec is
laid out as 16 bytes (8B sec + 4B nsec + 4B trailing padding) instead of
the 12 bytes a 32-bit userspace expects, because the surrounding struct
is not packed. As a result, struct btrfs_ioctl_get_subvol_info_args has
a different size and layout in 32-bit userspace than in the 64-bit
kernel, and BTRFS_IOC_GET_SUBVOL_INFO returns garbage to 32-bit callers.

Mirror what was done for BTRFS_IOC_SET_RECEIVED_SUBVOL: add a packed
btrfs_ioctl_get_subvol_info_args_32 with btrfs_ioctl_timespec_32 fields,
define BTRFS_IOC_GET_SUBVOL_INFO_32 with that struct as the size
argument, factor the existing handler into a shared _btrfs_ioctl_get_
subvol_info() helper, and add btrfs_ioctl_get_subvol_info_32() which
fills the kernel struct and translates field-by-field into the 32-bit
struct before copy_to_user().

Signed-off-by: Daan De Meyer <daan@amutable.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
8 days agobtrfs: derive f_fsid from on-disk fsid and dev_t
Anand Jain [Mon, 27 Apr 2026 10:18:04 +0000 (18:18 +0800)] 
btrfs: derive f_fsid from on-disk fsid and dev_t

The f_fsid was originally derived from fs_devices->fsid and the
subvolume root ID. However, when temp_fsid is active, fs_devices->fsid
is randomized, making the standard derivation inconsistent.

Since metadata_uuid is optional, it is not a reliable alternative.  This
patch instead retrieves the on-disk UUID from fs_info->super_copy->fsid.

To prevent f_fsid collisions between original and cloned filesystems,
this implementation hashes the dev_t for single-device btrfs filesystems
to ensure uniqueness. This is limited to single-device filesystems as
cloned mounts are currently only supported for that configuration. Note
that f_fsid will change if the device is replaced.

Additionally, since the kernel cannot distinguish between the original
and the cloned filesystem, this new f_fsid derivation is applied to
both.

Link: https://lore.kernel.org/linux-btrfs/cover.1772095546.git.asj@kernel.org/
Link: https://lore.kernel.org/linux-btrfs/cover.1774092915.git.asj@kernel.org/
Signed-off-by: Anand Jain <asj@kernel.org>
Signed-off-by: David Sterba <dsterba@suse.com>
8 days agobtrfs: use on-disk uuid for s_uuid in temp_fsid mounts
Anand Jain [Mon, 27 Apr 2026 10:18:03 +0000 (18:18 +0800)] 
btrfs: use on-disk uuid for s_uuid in temp_fsid mounts

When mounting a cloned filesystem with a temporary fsuuid (temp_fsid),
layered modules like overlayfs require a persistent identifier.

While internal in-memory fs_devices->fsid must remain unique to
the kernel module, let s_uuid carry the original on-disk UUID.

Signed-off-by: Anand Jain <asj@kernel.org>
Signed-off-by: David Sterba <dsterba@suse.com>
8 days agobtrfs: avoid unnecessary dev stats updates
Qu Wenruo [Tue, 7 Apr 2026 09:34:01 +0000 (19:04 +0930)] 
btrfs: avoid unnecessary dev stats updates

[MINOR PROBLEM]
When mounting a filesystem with a valid DEV_STATS item, we will always
update the DEV_STATS again in the next transaction commit, even if there
is no change the values.

[CAUSE]
During the mount, btrfs_device_init_dev_stats() will read out the
on-disk DEV_STATS item for each device.
Then it calls btrfs_dev_stat_set() to update the in-memory structure.

However btrfs_dev_stat_set() does not only set the dev stats value, but
also increase device->dev_stats_ccnt.

That member determines if we should update the device item at the next
transaction commit. Since we have called btrfs_dev_stat_set() for each
dev status member, dev_stats_ccnt will be non-zero and we will update
the dev stats item even it doesn't change at all.

[FIX]
Instead of using btrfs_dev_stat_set() for valid on-disk DEV_STATUS
values, directly call atomic_set() to set the in-memory values.

For other call sites, we still want to use btrfs_dev_stat_set() so that
we will force updating/creating the dev stats item.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
8 days agobtrfs: always update/create the dev stats item when adding a new device
Qu Wenruo [Tue, 7 Apr 2026 09:34:00 +0000 (19:04 +0930)] 
btrfs: always update/create the dev stats item when adding a new device

[MINOR PROBLEM]
When adding a new btrfs device, the corresponding DEV_STATS item creation
can only triggered by a mount cycle if there is no other error
triggered:

  # mkfs.btrfs -f $dev1 $mnt
  # mount $dev1 $mnt
  # btrfs dev add $dev2 $mnt
  # sync
  # btrfs ins dump-tree -t dev $dev1
  device tree key (DEV_TREE ROOT_ITEM 0)
  leaf 30588928 items 6 free space 15853 generation 9 owner DEV_TREE
         item 0 key (DEV_STATS PERSISTENT_ITEM 1) itemoff 16243 itemsize 40 <<<
          persistent item objectid DEV_STATS offset 1
          device stats
          write_errs 0 read_errs 0 flush_errs 0 corruption_errs 0 generation 0
         item 1 key (1 DEV_EXTENT 13631488) itemoff 16195 itemsize 48

Only after a mount cycle and a new transaction, the DEV_STATS for devid
2 can show up:

  # umount $mnt
  # mount $dev1 $mnt
  # touch $mnt
  # sync
  # btrfs ins dump-tree -t dev $dev1
  device tree key (DEV_TREE ROOT_ITEM 0)
  leaf 30605312 items 7 free space 15788 generation 10 owner DEV_TREE
         item 0 key (DEV_STATS PERSISTENT_ITEM 1) itemoff 16243 itemsize 40
          persistent item objectid DEV_STATS offset 1
          device stats
          write_errs 0 read_errs 0 flush_errs 0 corruption_errs 0 generation 0
         item 1 key (DEV_STATS PERSISTENT_ITEM 2) itemoff 16203 itemsize 40
          persistent item objectid DEV_STATS offset 2
          device stats
          write_errs 0 read_errs 0 flush_errs 0 corruption_errs 0 generation 0

[CAUSE]
Btrfs only updates the DEV_STATS item when the device->dev_stats_ccnt
counter is not 0.

This is to reduce COW for the device tree. However that dev_stats_ccnt is
only increased at the following call sites:

- btrfs_dev_stat_inc()
  This happens when some IO error happened.

- btrfs_dev_stat_read_and_reset()
  This happens for GET_DEV_STATS ioctl with BTRFS_DEV_STATS_RESET flag.

- btrfs_dev_stat_set()
  This happens inside btrfs_device_init_dev_stats().

So when a new device is added, its dev_stats_ccnt is just initialized to
0, and btrfs won't create nor update the corresponding DEV_STATS item at
all.

[ENHANCEMENT]
When a new device is added, also increase the dev_stats_ccnt by one.
This includes both device add ioctl and dev-replace.

This will force btrfs to create a new DEV_STATS item or update the
existing one with the correct values.

This not only makes the DEV_STATS creation early, but also prevents
old DEV_STATS left from older kernels to cause false alerts for the
newly added device.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
8 days agobtrfs: remove the dev stats item when removing a device
Qu Wenruo [Tue, 7 Apr 2026 09:33:59 +0000 (19:03 +0930)] 
btrfs: remove the dev stats item when removing a device

[MINOR BUG]
The following script will cause DEV_STATS item to be left after the
corresponding device is removed:

  # mkfs.btrfs -f $dev1
  # mount $dev1 $mnt
  # btrfs dev add $dev2 $mnt
  # umount $mnt

  ## Without real errors, only at mount time btrfs will update
  ## dev->dev_stats_ccnt, thus we need a mount cycle to create the
  ## DEV_STATS item for the new device.

  # mount $dev1 $mnt
  # touch $mnt/foobar
  # sync
  # btrfs dev remove $dev2 $mnt
  # umount $mnt

This will result the DEV_STATS item for devid 2 still left in device
tree:

  device tree key (DEV_TREE ROOT_ITEM 0)
  leaf 31064064 items 7 free space 15788 generation 18 owner DEV_TREE
  leaf 31064064 flags 0x1(WRITTEN) backref revision 1
  fs uuid 4bd853ed-f6ef-45fd-bbf1-1c3a2d9987cb
  chunk uuid b496eab1-ec23-46b5-81c1-2f1b3503ca07
         item 0 key (DEV_STATS PERSISTENT_ITEM 1) itemoff 16243 itemsize 40
          persistent item objectid DEV_STATS offset 1
          device stats
          write_errs 0 read_errs 0 flush_errs 0 corruption_errs 0 generation 0
         item 1 key (DEV_STATS PERSISTENT_ITEM 2) itemoff 16203 itemsize 40
          persistent item objectid DEV_STATS offset 2
          device stats
          write_errs 0 read_errs 0 flush_errs 0 corruption_errs 0 generation 0

This is not a huge problem, but if the existing DEV_STATS contains
errors, and a new device is added into the fs taking the old devid, then
after a mount cycle, the new device will suddenly inherit old errors
which can give false alerts.

[CAUSE]
Btrfs never has the ability to delete DEV_STATS items.

It either create a new one through update_dev_stat_item(), or read an
existing one through btrfs_device_init_dev_stats().

However update_dev_stat_item() is only called lazily, if a new device is
created and no new update to dev stats, then it will skip the update of
the on-disk item.

So if the old DEV_STATS item exists and a new device is added, and no
errors during the remaining operations, the old DEV_STATS will not be
updated.

Then at the next mount cycle, btrfs_device_init_dev_stats() is called at
mount time, which will read out the old records, causing false alerts to
the newly added device.

[FIX]
Manually remove the DEV_STATS item during btrfs_rm_device().

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
8 days agobtrfs: remove the dev stats item for replace target device
Qu Wenruo [Tue, 7 Apr 2026 09:33:58 +0000 (19:03 +0930)] 
btrfs: remove the dev stats item for replace target device

[MINOR PROBLEM]
When a running dev-replace hits some error for the target device (devid
0), there will be a DEV_STATS with error records created at the next
transaction commit.

Unfortunately that item will never to be deleted.

This means at the next dev-replace, if the replace is interrupted, then
at the next mount, the target device will suddenly inherit the old error
records from that DEV_STATS item, which can give some false alerts on
that device.

This shouldn't affect end users that much, as it requires all the
following conditions to be met, which is pretty rare:

- The initial dev-replace hits some error on the target device
  E.g. write errors, but those errors itself is already a big problem
  for a running replace.

  This is required to create the DEV_STATS item in the first place.

- The next replace is interrupted
  This is required to allow btrfs to read from the old records.

[CAUSE]
Btrfs just never deletes the DEV_STATS after a replace is finished.

[FIX]
Remove the DEV_STATS item for devid 0 after the replace is finished.

This is not going to completely fix the error, as we still have other
error paths, e.g. by somehow the fs flips RO and can not start a new
transaction for the DEV_STATS item removal.

But those corner cases will be addressed by later patches which provide
a more generic fix to DEV_STATS related problems.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
8 days agobtrfs: validate data reloc tree file extent item members
Teng Liu [Wed, 13 May 2026 11:35:44 +0000 (13:35 +0200)] 
btrfs: validate data reloc tree file extent item members

get_new_location() uses BUG_ON() to crash the kernel if the file extent
item it looks up has any of offset, compression, encryption, or
other_encoding set non-zero. The data reloc inode is only written by
relocation's own paths and the four fields are always 0 in what the
kernel writes:

  - insert_prealloc_file_extent() memsets the stack item to zero and
    only fills in type, disk_bytenr, disk_num_bytes and num_bytes, so
    offset/compression/encryption/other_encoding stay 0.
  - insert_ordered_extent_file_extent() copies oe->compress_type into
    the file extent's compression field, but the data reloc inode is
    created with BTRFS_INODE_NOCOMPRESS so compress_type is always 0;
    encryption and other_encoding are reserved-and-zero in btrfs.

A non-zero value here means the leaf decoded from disk does not match
what the kernel wrote, i.e. on-disk corruption. A malformed image
reaches this code via balance and panics the kernel.

A previous attempt to enforce all four constraints in tree-checker's
check_extent_data_item() was merged as commit 7d0ee95979e9 ("btrfs:
validate data reloc tree file extent item members in tree-checker")
and then reverted by commit 1c034697fcaa after btrfs/061 produced
false positives on arm64 with 64K pages. The reason: relocation
writeback legitimately produces REG file_extent_items with offset != 0
in the data reloc tree. When an ordered extent covers only the back
portion of an underlying PREALLOC (num_bytes < ram_bytes on the input
file_extent), insert_ordered_extent_file_extent() inserts a REG with

  offset    = oe->offset
  num_bytes = oe->num_bytes
  ram_bytes preserved from the original PREALLOC,

and this item can reach disk if a transaction commit fires while it
is present in the leaf.

The four fields belong in different layers:

  - compression, encryption and other_encoding are universal
    invariants for every item in the data reloc tree, regardless of
    cluster geometry. Enforce them in tree-checker's
    check_extent_data_item() so a corrupt leaf is rejected at read
    time.

  - offset is only an invariant at the cluster-boundary keys that
    get_new_location() searches (the key is computed as
    src_disk_bytenr - reloc_block_group_start). Partial-PREALLOC
    writebacks legitimately place REG items at non-boundary keys with
    offset != 0; tree-checker cannot reject these. The cluster-
    boundary item is always written by either
    insert_prealloc_file_extent() (offset=0 by memset) or by the
    front portion of a partial writeback (offset=0 by construction),
    so a non-zero offset there is corruption.

Enforce the universal invariants in check_extent_data_item() with a
file_extent_err() rejection. Convert the BUG_ON() in
get_new_location() to a -EUCLEAN return paired with btrfs_print_leaf()
and btrfs_err() so the offending leaf is logged. The caller in
replace_file_extents() already handles non-zero returns from
get_new_location() by breaking out of the loop without aborting the
transaction.

Suggested-by: Qu Wenruo <wqu@suse.com>
Suggested-by: David Sterba <dsterba@suse.com>
Reported-by: syzbot+3e20d8f3d41bac5dc9a2@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=3e20d8f3d41bac5dc9a2
Signed-off-by: Teng Liu <27rabbitlt@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
8 days agobtrfs: annotate lockless read of defrag_bytes in should_nocow()
Cen Zhang [Wed, 1 Apr 2026 02:21:53 +0000 (10:21 +0800)] 
btrfs: annotate lockless read of defrag_bytes in should_nocow()

should_nocow() reads inode->defrag_bytes without holding inode->lock,
while btrfs_set_delalloc_extent() and btrfs_clear_delalloc_extent()
update it under that spinlock.

This is a data race.  The read is a quick check used to decide whether
to fall back to COW for a NOCOW inode: if defrag_bytes is non-zero and
the range is tagged EXTENT_DEFRAG, we force COW so that defragmentation
can rewrite the extent.  Reading a stale value is harmless because:

  - A missed increment may skip COW once, but the defrag pass will
    redo the extent later.
  - A stale non-zero may force an unnecessary COW, which is a minor
    efficiency loss, not a correctness issue.

On 64-bit platforms an aligned u64 load is naturally atomic so tearing
cannot happen.  On 32-bit platforms u64 may tear, but we only test for
zero vs non-zero, so the heuristic stays correct regardless.  Use
data_race() annotation.

Fixes: 47059d930f0e ("Btrfs: make defragment work with nodatacow option")
Signed-off-by: Cen Zhang <zzzccc427@gmail.com>
[ Use data_race() instead of READ_ONCXE() ]
Signed-off-by: David Sterba <dsterba@suse.com>
8 days agobtrfs: send: switch struct fs_path to auto freeing
David Sterba [Sun, 24 May 2026 10:56:49 +0000 (12:56 +0200)] 
btrfs: send: switch struct fs_path to auto freeing

The fs_path can use the auto freeing pattern and it's completely
contained in send. Define the freeing wrapper and add the cleanup
attributes.

Almost all conversions are straightforward, replacing goto with direct
return.

Signed-off-by: David Sterba <dsterba@suse.com>
8 days agobtrfs: add message format for qgroupid
David Sterba [Sat, 23 May 2026 16:33:41 +0000 (18:33 +0200)] 
btrfs: add message format for qgroupid

The qgroupid has a specific format, add common format specifier, similar
to what we have for checksums and keys.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
8 days agobtrfs: zoned: always set max_active_zones for zoned devices
Johannes Thumshirn [Fri, 22 May 2026 09:22:12 +0000 (11:22 +0200)] 
btrfs: zoned: always set max_active_zones for zoned devices

When a block device does not report a maximum number of open or active
zones,  currently assign BTRFS_DEFAULT_MAX_ACTIVE_ZONES (128) to
the internal limit, if the device has more than
BTRFS_DEFAULT_MAX_ACTIVE_ZONES zones.

But if the device has less than BTRFS_DEFAULT_MAX_ACTIVE_ZONES the
internal max_active_zones limit will stay at 0, even if the device has
zone resource limits. Furthermore, if the device has a total number of
zones that is less than BTRFS_DEFAULT_MAX_ACTIVE_ZONE, max_active_zones
should be set to at most the number of zones.

Also move the max_active_zone calculation and setting into a dedicated
helper, to shrink btrfs_get_dev_zone_info().

Fixes: 04147d8394e8 ("btrfs: zoned: limit active zones to max_open_zones")
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
8 days agobtrfs: use bvec_phys() in compressed_bio_last_folio()
Matthew Wilcox (Oracle) [Fri, 22 May 2026 18:14:09 +0000 (19:14 +0100)] 
btrfs: use bvec_phys() in compressed_bio_last_folio()

This is open-coded bvec_phys(), also remove direct use of bv_page.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Boris Burkov <boris@bur.io>
Tested-by: Boris Burkov <boris@bur.io>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: David Sterba <dsterba@suse.com>
8 days agobtrfs: replace __free_page with folio_put() in attach_eb_folio_to_filemap()
Matthew Wilcox (Oracle) [Fri, 22 May 2026 18:14:08 +0000 (19:14 +0100)] 
btrfs: replace __free_page with folio_put() in attach_eb_folio_to_filemap()

Calling __free_page() on folio_page() happens to work today, but
won't always.  Besides, it's far simpler to call folio_put().

Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Tested-by: Boris Burkov <boris@bur.io>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: David Sterba <dsterba@suse.com>
8 days agoRevert "btrfs: fix the file offset calculation inside btrfs_decompress_buf2page()"
Matthew Wilcox (Oracle) [Fri, 22 May 2026 18:14:07 +0000 (19:14 +0100)] 
Revert "btrfs: fix the file offset calculation inside btrfs_decompress_buf2page()"

It seems that af566bdaff54 was tested against a tree which did not
contain commit 12851bd921d4 ("fs: Turn page_offset() into a wrapper
around folio_pos()).  Unfortunately it has a bug of its own; on 32-bit
systems, shifting by PAGE_SHIFT will overflow on files larger than 4GiB.
Since page_offset() is now fixed, just revert af566bdaff54.

Fixes: af566bdaff54 (btrfs: fix the file offset calculation inside btrfs_decompress_buf2page())
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Boris Burkov <boris@bur.io>
Tested-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
8 days agobtrfs: zoned: fix deadlock waiting for ticket during data relocation
Johannes Thumshirn [Fri, 22 May 2026 09:02:47 +0000 (11:02 +0200)] 
btrfs: zoned: fix deadlock waiting for ticket during data relocation

When performing data relocation on a zoned filesystem, BTRFS can deadlock
in handle_reserve_tickets(). The relocation process is waiting on a space
reservation ticket that can never be fulfilled, because the relocation
itself is the operation responsible for freeing up that space.

Fix this by introducing a new flush state,
BTRFS_RESERVE_FLUSH_ZONED_RELOCATION, specifically for data chunk
allocation during zoned relocation. Like
BTRFS_RESERVE_FLUSH_FREE_SPACE_INODE, this state uses
priority_reclaim_data_space() instead of the normal flushing path, which
avoids re-entering the relocation code and breaking the deadlock cycle.

In btrfs_alloc_data_chunk_ondemand(), select this new flush state when the
inode belongs to a data relocation root on a zoned filesystem.

Fixes: e2a7fd22378f ("btrfs: zoned: add zone reclaim flush state for DATA space_info")
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
8 days agobtrfs: zoned: don't account data relocation space-info in statfs free space
Johannes Thumshirn [Fri, 22 May 2026 09:02:46 +0000 (11:02 +0200)] 
btrfs: zoned: don't account data relocation space-info in statfs free space

Don't account the free space in a data relocation space-info sub-group as
usable free space in statfs.

This is misleading as no user allocations can be made in this space-info
sub-group. It is only a target for relocation.

Fixes: f92ee31e031c ("btrfs: introduce btrfs_space_info sub-group")
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
8 days agobtrfs: zoned: always set data_relocation_bg
Johannes Thumshirn [Fri, 22 May 2026 09:02:45 +0000 (11:02 +0200)] 
btrfs: zoned: always set data_relocation_bg

When searching for a data relocation block-group on mount,
btrfs_zoned_reserve_data_reloc_bg() is looking for the first empty DATA
block-group. But it first checks if the block-group is empty and if yes
continues the search, and then checks if it is the first DATA block-group.

There is actually no point in looking for the second empty DATA block
group as new DATA allocations will just allocate a new chunk for it. Pick
the first DATA block-group without any allocations done and set it as
relocation block-group.

At first, the commit 694ce5e143d6 ("btrfs: zoned: reserve data_reloc
block group on mount") introduced the functionality. At that time, we
took second unused (used == 0) block group, as the first one might be a
block group used for normal data.  Later, commit daa0fde32235 ("btrfs:
zoned: fix data relocation block group reservation") switched to look
for an empty block group (alloc_offset == 0). At this point, there is no
reason taking the second one anymore. So, this commit is fixing an issue
in commit daa0fde32235.

Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
8 days agoarm64: Kconfig: Add ASPEED SoC family Kconfig support
Ryan Chen [Tue, 9 Jun 2026 02:47:19 +0000 (10:47 +0800)] 
arm64: Kconfig: Add ASPEED SoC family Kconfig support

Add support for ASPEED SoC family like ast27XX 8th
generation ASPEED BMCs.

Signed-off-by: Ryan Chen <ryan_chen@aspeedtech.com>
Link: https://lore.kernel.org/r/20260609-upstream_ast2700-v9-2-f631752f0cb1@aspeedtech.com
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
8 days agodt-bindings: arm: aspeed: Add AST2700 board compatible
Ryan Chen [Tue, 9 Jun 2026 02:47:18 +0000 (10:47 +0800)] 
dt-bindings: arm: aspeed: Add AST2700 board compatible

Add device tree compatible string for AST2700 based boards
("aspeed,ast2700-evb" and "aspeed,ast2700") to the Aspeed SoC
board bindings. This allows proper schema validation and
enables support for AST2700 platforms.

Signed-off-by: Ryan Chen <ryan_chen@aspeedtech.com>
Acked-by: Conor Dooley <conor.dooley@microchip.com>
Link: https://lore.kernel.org/r/20260609-upstream_ast2700-v9-1-f631752f0cb1@aspeedtech.com
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
8 days agoMerge tag 'riscv-dt-for-v7.2' of https://git.kernel.org/pub/scm/linux/kernel/git...
Arnd Bergmann [Tue, 9 Jun 2026 16:14:00 +0000 (18:14 +0200)] 
Merge tag 'riscv-dt-for-v7.2' of https://git.kernel.org/pub/scm/linux/kernel/git/conor/linux into soc/dt

Microchip RISC-V devicetrees for v7.2

This time around, there's nothing other than patches for Microchip
boards. All of this is low priority fixes and cleanup, centred on the
pic64gx and beaglev-fire boards. There is no new support added.

Signed-off-by: Conor Dooley <conor.dooley@microchip.com>
* tag 'riscv-dt-for-v7.2' of https://git.kernel.org/pub/scm/linux/kernel/git/conor/linux:
  riscv: dts: microchip: remove redudant enabling of syscontroller
  riscv: dts: microchip: fix pic64gx gpio interrupt-cells
  riscv: dts: microchip: add gpio line names on beaglev-fire
  riscv: dts: microchip: add adc interrupt on beaglev-fire
  riscv: dts: microchip: clean up beaglev-fire regulator node names
  riscv: dts: microchip: remove gpio hogs from beaglev-fire
  riscv: dts: microchip: gpio controllers on mpfs need 2 interrupt cells
  riscv: dts: microchip: sort pic64gx i2c nodes alphanumerically
  riscv: dts: microchip: update pic64gx gpio interrupts to better match the SoC
  riscv: dts: microchip: add tsu clock to macb on pic64gx

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
8 days agoASoC: cs35l56: Increase pm_runtime autosuspend delay
Richard Fitzgerald [Tue, 9 Jun 2026 12:29:46 +0000 (13:29 +0100)] 
ASoC: cs35l56: Increase pm_runtime autosuspend delay

Increase the pm_runtime autosuspend delay to be longer than the
timeout of the firmware's own inactivity timer.

There is no point attempting to pm_runtime suspend any sooner than
the firmware idle timeout because it would only mean the driver has
to poll waiting for the firmware idle.

Signed-off-by: Richard Fitzgerald <rf@opensource.cirrus.com>
Link: https://patch.msgid.link/20260609122946.288103-1-rf@opensource.cirrus.com
Signed-off-by: Mark Brown <broonie@kernel.org>
8 days agoblock: propagate in_flight to whole disk on partition I/O
Tang Yizhou [Tue, 26 May 2026 02:15:55 +0000 (10:15 +0800)] 
block: propagate in_flight to whole disk on partition I/O

Now when I/O is submitted to a partition, the per-CPU in_flight[]
counter is incremented only on the partition's block_device, not on the
underlying whole disk. This leads to a problem which can be shown by a
fio test:

lsblk
  NAME     MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
  mydev    252:1    0   20G  0 disk
  └─mydev1 259:0    0   10G  0 part

iostat -xp 1
  Device       r/s        rkB/s      ... aqu-sz   %util
  mydev    128153.00  512612.00      ...  13.22   72.20
  mydev1   128154.00  512616.00      ...  13.22  100.00

%util is different between mydev and mydev1, which is unexpected.

This is the cumulative effect of a series of patches. The root cause is
commit e016b78201a2 ("block: return just one value from part_in_flight"),
which deleted the branch in part_in_flight() that aggregated the whole-disk
in_flight count on top of the partition's. Then the second commit is
commit 10ec5e86f9b8 ("block: merge part_{inc,dev}_in_flight into their
only callers"), which folded the whole-disk in_flight accounting into
generic_start_io_acct() and generic_end_io_acct(). Those two helpers
were then removed by commit e722fff238bb ("block: remove
generic_{start,end}_io_acct"), and from that point on the whole disk's
in_flight is no longer accounted at all.

In update_io_ticks(), if calling bdev_count_inflight() finds that the
inflight value of the whole device is 0, the accumulation of io_ticks will
be skipped, causing the reported util% value to be underestimated.

Fix it by restoring the whole-disk in_flight accounting.

Fixes: e016b78201a2 ("block: return just one value from part_in_flight")
Suggested-by: Leon Hwang <leon.huangfu@shopee.com>
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Tang Yizhou <yizhou.tang@shopee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260526021555.359500-1-yizhou.tang@shopee.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 days agoMerge tag 'v7.2-rockchip-dts64-2' of https://git.kernel.org/pub/scm/linux/kernel...
Arnd Bergmann [Tue, 9 Jun 2026 16:12:24 +0000 (18:12 +0200)] 
Merge tag 'v7.2-rockchip-dts64-2' of https://git.kernel.org/pub/scm/linux/kernel/git/mmind/linux-rockchip into soc/dt

We've got basic camera support on RK3588!

New peripherals RK3588 vicap plus camera addition to some boards,
RK3528 USB and USB enablement on some boards, RGA3 support on RK3588.
Missing EL2 virtual timer interrupt added to RK3588.

Some more added peripherals for the Khadas Edge 2L board.

* tag 'v7.2-rockchip-dts64-2' of https://git.kernel.org/pub/scm/linux/kernel/git/mmind/linux-rockchip:
  arm64: dts: rockchip: Fix vcc_sdio regulator max voltage on Pinebook Pro
  arm64: dts: rockchip: Enable USB 2.0 ports on NanoPi Zero2
  arm64: dts: rockchip: Enable USB 2.0 ports on ArmSoM Sige1
  arm64: dts: rockchip: Enable USB ports on Radxa ROCK 2A/2F
  arm64: dts: rockchip: Enable USB 2.0 ports on Radxa E20C
  arm64: dts: rockchip: Add USB nodes for RK3528
  arm64: dts: rockchip: enable adc button for Radxa E25
  arm64: dts: rockchip: Add Bluetooth support for Khadas Edge 2L
  arm64: dts: rockchip: Enable USB for Khadas Edge 2L
  arm64: dts: rockchip: Disable removed devices from rk3399-nanopi-r4s
  arm64: dts: rockchip: Fix EEPROM compatible on rk3399-nanopi-r4s-enterprise
  arm64: dts: rockchip: add radxa camera 4k on rock 5b+ cam1
  arm64: dts: rockchip: add radxa camera 4k on rock 5b+ cam0
  arm64: dts: rockchip: add vicap node to rk3588
  arm64: dts: rockchip: Add EL2 virtual timer interrupt
  arm64: dts: rockchip: add rga3 dt nodes to rk3588

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
8 days agoPM: QoS: Fix misc device registration unwind
Yuho Choi [Mon, 8 Jun 2026 17:07:48 +0000 (13:07 -0400)] 
PM: QoS: Fix misc device registration unwind

cpu_latency_qos_init() registers cpu_dma_latency first and, when
CONFIG_PM_QOS_CPU_SYSTEM_WAKEUP is enabled, registers cpu_wakeup_latency
afterwards. The second registration overwrites the first return value.

As a result, a failure to register cpu_dma_latency can be masked if the
second registration succeeds. Conversely, if cpu_dma_latency succeeds and
cpu_wakeup_latency fails, the function returns an error while leaving the
first misc device registered.

Return immediately on the first registration failure and deregister
cpu_dma_latency if the second registration fails.

Fixes: a4e6512a79d8 ("PM: QoS: Introduce a CPU system wakeup QoS limit")
Signed-off-by: Yuho Choi <dbgh9129@gmail.com>
Link: https://patch.msgid.link/20260608170748.82273-1-dbgh9129@gmail.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
8 days agovirtio-blk: clamp zone report to the report buffer capacity
Michael Bommarito [Sun, 7 Jun 2026 12:48:34 +0000 (08:48 -0400)] 
virtio-blk: clamp zone report to the report buffer capacity

virtblk_report_zones() trusts the device-reported number of zones when
walking the report buffer:

nz = min_t(u64, virtio64_to_cpu(vblk->vdev, report->nr_zones),
   nr_zones);
...
for (i = 0; i < nz && zone_idx < nr_zones; i++) {
ret = virtblk_parse_zone(vblk, &report->zones[i], ...);

The buffer is allocated by virtblk_alloc_report_buffer(), whose size is
capped by the queue's max hardware sectors and max segments and can
therefore hold fewer descriptors than nr_zones. nz is bounded only by
the device-supplied report->nr_zones and the requested nr_zones, never
by the buffer's descriptor capacity. At probe time the request count is
unbounded (blk_revalidate_disk_zones() calls report_zones() with
nr_zones == UINT_MAX), so the device-supplied report->nr_zones is the
sole gate: a device that reports more zones than fit in the buffer
drives the loop to read report->zones[i] past the end of the allocation.

A malicious or buggy virtio-blk device that reports an inflated nr_zones
triggers this during zone revalidation at probe. KASAN reports a
vmalloc-out-of-bounds read in virtblk_report_zones() against the report
buffer allocated a few lines earlier.

Clamp nz to the number of descriptors that actually fit in the report
buffer.

Fixes: 95bfec41bd3d ("virtio-blk: add support for zoned block devices")
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Link: https://patch.msgid.link/20260607124834.3059944-1-michael.bommarito@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 days agoMerge tag 'hisi-arm64-dt-for-7.2' of https://github.com/hisilicon/linux-hisi into...
Arnd Bergmann [Tue, 9 Jun 2026 16:01:42 +0000 (18:01 +0200)] 
Merge tag 'hisi-arm64-dt-for-7.2' of https://github.com/hisilicon/linux-hisi into soc/dt

ARM64: DT: HiSilicon ARM64 DT updates for v7.2

- Move role-switch endpoint into connector on hi3660-hikey960

* tag 'hisi-arm64-dt-for-7.2' of https://github.com/hisilicon/linux-hisi:
  arm64: dts: hisilicon: hi3660-hikey960: move role-switch endpoint into connector

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
8 days agoMerge tag 'imx-dt-7.2' of git://git.kernel.org/pub/scm/linux/kernel/git/frank.li...
Arnd Bergmann [Tue, 9 Jun 2026 16:00:12 +0000 (18:00 +0200)] 
Merge tag 'imx-dt-7.2' of git://git.kernel.org/pub/scm/linux/kernel/git/frank.li/linux into soc/dt

i.MX ARM device tree changes for 7.2:

DT Binding Cleanup:
- Replaced undocumented compatible strings with proper ones:
  * edt,edt-ft5x06 -> edt,edt-ft5206
  * marvell,88E1510 -> ethernet-phy-ieee802.3-c22
  * karo,imx6qdl-tx6-sgtl5000 -> simple-audio-card
- Fixed incorrect VAR-SOM-MX6UL references (corrected to VAR-SOM-MX6)
- Added missing required properties:
  * #phy-cells for usb-nop-xceiv
  * #io-channel-cells to ADC nodes
  * bus-type for ov5642/ov5640 cameras
  * ti,deskew = <0> for ti,tfp410
- Added missing supply properties (power-supply, vdd-supply, dvdd-supply, avdd-supply)
- Removed redundant/empty properties (bus-width for video-mux, empty clock-names)
- Fixed boolean property warnings and non-existent property references
- Converted TS-4800 watchdog to DT schema
- Renamed wdt nodes to watchdog for consistency

New Features Added:
- PCIe Root Port nodes and PERST property for imx6qdl, imx6sx, and imx7d
- OV5645 camera support for imx7d-pico-pi
- LVDS display panel support for imx6ul-var-som
- WiFi and Bluetooth support for VAR-SOM boards
- nvmem-layout support for imx7
- New bus bindings: fsl,aipi-bus and fsl,emi-bus
- New board binding: variscite,var-som-imx6ull

Code Refactoring:
- VAR-SOM-MX6UL/ULL: factored out common parts for CPU variants
- Separated audio, ethernet (ENET1/ENET2), and SD card support into reusable components

* tag 'imx-dt-7.2' of git://git.kernel.org/pub/scm/linux/kernel/git/frank.li/linux: (35 commits)
  dt-bindings: soc: imx: Add fsl,aipi-bus and fsl,emi-bus
  ARM: dts: freescale: add bootph-all to i.MX7ULP watchdog nodes
  ARM: dts: imx7: add nvmem-layout
  ARM: dts: imx7d-pico-pi: add OV5645 camera support
  ARM: dts: imx6-display5: replace marvell,88E1510 with ethernet-phy-ieee802.3-c22
  ARM: dts: imx: replace undocumented compatible string edt,edt-ft5x06 with edt,edt-ft5206
  ARM: dts: imx6qdl-tx6: remove undocumented karo,imx6qdl-tx6-sgtl5000 and keep only simple-audio-card
  ARM: dts: imx: Add bus-type for ov5642/ov5640
  ARM: dts: imx: remove redundant bus-width for video-mux
  ARM: dts: imx: add (power|vdd)-supply for related node
  ARM: dts: imx53-ppd: add '#phy-cells' for usb-nop-xceiv
  ARM: dts: imx53-qsb: add dvdd and avdd supply for panel sii,43wvf1g
  ARM: dts: imx: add ti,deskew = <0> for ti,tfp410
  ARM: dts: imx7d: Add Root Port node and PERST property
  ARM: dts: imx6sx: Add Root Port node and PERST property
  ARM: dts: imx6qdl: Add Root Port node and PERST property
  ARM: dts: nxp: imx51-ts4800: Rename wdt node to watchdog
  dt-bindings: watchdog: Convert TS-4800 to DT schema
  ARM: dts: imx6ul: add #io-channel-cells to ADC
  ARM: dts: imx25: remove empty clock-names for nand-controller@bb000000
  ...

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
8 days agoMerge tag 'apple-soc-dt-7.2' of https://git.kernel.org/pub/scm/linux/kernel/git/sven...
Arnd Bergmann [Tue, 9 Jun 2026 15:59:29 +0000 (17:59 +0200)] 
Merge tag 'apple-soc-dt-7.2' of https://git.kernel.org/pub/scm/linux/kernel/git/sven/linux into soc/dt

Apple SoC DT update for 7.2

Add minimal device trees for all t8122 / base M3 based devices and some
required new compatibles to the dt-bindings. These are enough to boot
Linux on these devices to a simple serial console but future work is
required to make these machines useful for end users.

Signed-off-by: Sven Peter <sven@kernel.org>
* tag 'apple-soc-dt-7.2' of https://git.kernel.org/pub/scm/linux/kernel/git/sven/linux:
  arm64: dts: apple: Initial t8122 (M3) device trees
  dt-bindings: arm: apple: Add M3 based devices
  dt-bindings: pwm: apple,s5l-fpwm: Add t8122 compatible
  dt-bindings: power: apple,pmgr-pwrstate: Add t8122 compatible
  dt-bindings: arm: apple: apple,pmgr: Add t8122 compatible

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
8 days agoMerge tag 'ti-k3-dt-for-v7.2' of https://git.kernel.org/pub/scm/linux/kernel/git...
Arnd Bergmann [Tue, 9 Jun 2026 15:56:58 +0000 (17:56 +0200)] 
Merge tag 'ti-k3-dt-for-v7.2' of https://git.kernel.org/pub/scm/linux/kernel/git/ti/linux into soc/dt

TI K3 device tree updates for v7.2

SoC Specific Features and Fixes:

J722S:
- Use ti,j7200-padconf compatible for pad configuration
- Add MCU and wakeup domain peripherals specific to J722S
- Use J722S-specific compatibles for WIZ, gmii-sel and CPSW3G nodes

Board Specific Features and Fixes:

AM62x (Toradex Verdin):
- Add display overlays: DSI-to-HDMI adapter, DSI-to-LVDS adapter with 10.1"
  panel, capacitive touch displays in 7" DSI, 10.1" DSI and 10.1" LVDS
  configurations, and Mezzanine with 10.1" LVDS display
- Add NAU8822 Bridge Tied Load audio support
- Add OV5640 CSI camera overlays
- Add Verdin Mezzanine CAN overlay
- Reserve UART_4 for Cortex-M4F use

AM625:
- Add support for TQ-Systems TQMa62xx SoM and MBa62xx carrier board,
  including new dt-bindings compatible strings

AM62x (phyBOARD-Lyra):
- Add DT overlay for Lincoln LCD185-101CT OLDI panel

AM62-LP SK:
- Add system-power-controller node

AM62A7 SK:
- Add bootph-all tag to vqmmc regulator for proper boot sequencing

J721S2-som:
- Add bootph-pre-ram property to PMIC-B for proper early boot sequencing

* tag 'ti-k3-dt-for-v7.2' of https://git.kernel.org/pub/scm/linux/kernel/git/ti/linux:
  arm64: dts: ti: k3-am62-verdin: Add Mezzanine with Toradex Display 10.1" LVDS
  arm64: dts: ti: k3-am62-verdin: Add Toradex Verdin Mezzanine CAN
  arm64: dts: ti: k3-am62-verdin: Add Toradex OV5640 CSI Cameras
  arm64: dts: ti: k3-am62-verdin: Reserve UART_4 for Cortex-M4F
  arm64: dts: ti: k3-am62-verdin: Add NAU8822 Bridge Tied Load
  arm64: dts: ti: k3-am62-verdin: Add Toradex Capacitive Touch Display 7" DSI
  arm64: dts: ti: k3-am62-verdin: Add Toradex Capacitive Touch Display 10.1" DSI
  arm64: dts: ti: k3-am62-verdin: Add Toradex Capacitive Touch Display 10.1" LVDS
  arm64: dts: ti: k3-am62-verdin: Add Toradex DSI to LVDS adapter with 10.1" display
  arm64: dts: ti: k3-j722s-main: use J722S compatibles for WIZ, gmii-sel and CPSW3G
  arm64: dts: ti: k3-am62-verdin: Add DSI to HDMI adapter overlay
  arm64: dts: ti: Add TQ-Systems TQMa62xx SoM and MBa62xx carrier board Device Trees
  dt-bindings: arm: ti: Add compatible for AM625-based TQMa62xx SOM family and carrier board
  arm64: dts: ti: am62-phyboard-lyra: Add DT overlay for Lincoln LCD185-101CT panel
  arm64: dts: ti: k3-j721s2-som-p0: add bootph-pre-ram property to PMIC-B
  arm64: dts: ti: k3-j722s: Add wakeup domain peripherals specific to J722S
  arm64: dts: ti: k3-j722s: Add mcu domain peripherals specific to J722S
  arm64: dts: ti: k3-j722s: Use ti,j7200-padconf compatible
  arm64: dts: ti: k3-am62-lp-sk: Add system-power-controller
  arm64: dts: ti: k3-am62a7-sk: Add bootph-all tag to vqmmc

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
8 days agoMerge tag 'arm-soc/for-7.2/devicetree' of https://github.com/Broadcom/stblinux into...
Arnd Bergmann [Tue, 9 Jun 2026 15:56:03 +0000 (17:56 +0200)] 
Merge tag 'arm-soc/for-7.2/devicetree' of https://github.com/Broadcom/stblinux into soc/dt

This pull request contains Broadcom ARM-based SoCs Device Tree changes
for 7.2, please pull the following:

- Rosen moves the Meraki MX6X pinctrl configuration to the PWM node
  where it belongs, also fixes the USB3 GPIO for the R6300v2 and
  EA6500v2 routers

- Jinseok fixes a gpio label for the bcm2711 systems (Raspberry Pi 4)

* tag 'arm-soc/for-7.2/devicetree' of https://github.com/Broadcom/stblinux:
  arm: dts: bcm2711: Fix typo in gpio-line-names
  ARM: dts: BCM5301X: EA6500v2: fix USB3
  ARM: dts: BCM5301X: R6300v2: fix USB3
  ARM: dts: NSP: Move MX6X pinctrl config to PWM node

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
8 days agoMerge tag 'tegra-for-7.2-arm64-dt' of git://git.kernel.org/pub/scm/linux/kernel/git...
Arnd Bergmann [Tue, 9 Jun 2026 15:44:42 +0000 (17:44 +0200)] 
Merge tag 'tegra-for-7.2-arm64-dt' of git://git.kernel.org/pub/scm/linux/kernel/git/tegra/linux into soc/dt

arm64: tegra: Changes for v7.2-rc1

This contains a fix for GPIO on Jetson AGX Thor and enables the PCIe
controllers used on this platform.

Smaug (a.k.a. Pixel C) devices can now be properly powered off via the
PMIC.

Miscellaneous improvements are made across a number of different boards,
such as SMMU enablement on Tegra194 platforms and some better bootloader
interoperation on Chromium-based devices.

Finally, various minor fixes to device tree properties round off this
set of changes.

* tag 'tegra-for-7.2-arm64-dt' of git://git.kernel.org/pub/scm/linux/kernel/git/tegra/linux:
  arm64: tegra: Enable SMMU on Tegra194 display controllers
  Revert "arm64: tegra: Disable ISO SMMU for Tegra194"
  arm64: tegra: Add #{address,size}-cells to Chromium-based /firmware
  arm64: tegra: Fix aspm-l1-entry-delay-ns L1 latency cells
  arm64: tegra: Mark MAX77620 as system power controller on Smaug
  arm64: tegra: Enable PCIe for Jetson AGX Thor
  arm64: tegra: Fix address of Tegra264 main GPIO controller
  arm64: tegra: Add aspm-l1-entry-delay-ns to PCIe nodes
  arm64: tegra: Fix Tegra234 MGBE PTP clock

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
8 days agoMerge branch 'bpf-enforce-btf-pointer-write-checks-for-global-args'
Kumar Kartikeya Dwivedi [Tue, 9 Jun 2026 15:39:46 +0000 (17:39 +0200)] 
Merge branch 'bpf-enforce-btf-pointer-write-checks-for-global-args'

Nuoqi Gui says:

====================
bpf: Enforce BTF pointer write checks for global args

check_mem_reg() verifies both read and write access when a caller passes
memory into a global subprogram. For PTR_TO_BTF_ID callers,
check_helper_mem_access() currently always checks the access as BPF_READ.

That lets a tracing program pass a task_struct field pointer to a global
subprogram argument typed as writable memory. The direct field store is rejected
with "only read is supported", but the callee is validated with a generic
writable PTR_TO_MEM argument and can store through it.

Forward the requested access type into the PTR_TO_BTF_ID helper-access path and
add verifier coverage for the global-subprogram argument case.

Validation (tested on bpf-next 8496d9020ff3):

  Without this series:
    direct BTF field store rejected with "only read is supported";
    global-subprogram candidate loaded, attached, and runtime-confirmed.

  With this series applied:
    direct BTF field store rejected with "only read is supported";
    global-subprogram candidate rejected with "only read is supported".

Signed-off-by: Nuoqi Gui <gnq25@mails.tsinghua.edu.cn>
---
====================

Link: https://patch.msgid.link/20260609-f01-04-btf-writable-arg-v1-0-f449cd970669@mails.tsinghua.edu.cn
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
8 days agoselftests/bpf: Cover writable BTF field global subprog args
Nuoqi Gui [Tue, 9 Jun 2026 14:43:51 +0000 (22:43 +0800)] 
selftests/bpf: Cover writable BTF field global subprog args

Add a verifier test for passing a BTF-backed task_struct field pointer to a
global subprogram argument typed as writable memory.

The direct field store is already rejected.
The global subprogram path should be rejected too.
The callee must not lose the BTF pointer's read-only provenance.
It must not validate the argument as ordinary writable memory.

Signed-off-by: Nuoqi Gui <gnq25@mails.tsinghua.edu.cn>
Link: https://lore.kernel.org/bpf/20260609-f01-04-btf-writable-arg-v1-2-f449cd970669@mails.tsinghua.edu.cn
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
8 days agobpf: Enforce write checks for BTF pointer helper access
Nuoqi Gui [Tue, 9 Jun 2026 14:43:50 +0000 (22:43 +0800)] 
bpf: Enforce write checks for BTF pointer helper access

check_mem_reg() verifies both read and write access for global subprogram
memory arguments. When the caller register is PTR_TO_BTF_ID,
check_helper_mem_access() currently forwards the access to
check_ptr_to_btf_access() as BPF_READ regardless of the requested access
type.

This lets a BTF-backed kernel object field pointer pass the caller-side
writable memory check for a global subprogram argument. The callee is then
validated with a generic writable PTR_TO_MEM argument and can store through
it, even though an equivalent direct BTF field store is rejected with "only
read is supported".

Forward the requested access type to check_ptr_to_btf_access().
This enforces existing BTF write restrictions for global subprogram memory
arguments as well.

Fixes: 3e30be4288b3 ("bpf: Allow helpers access trusted PTR_TO_BTF_ID.")
Signed-off-by: Nuoqi Gui <gnq25@mails.tsinghua.edu.cn>
Link: https://lore.kernel.org/bpf/20260609-f01-04-btf-writable-arg-v1-1-f449cd970669@mails.tsinghua.edu.cn
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
8 days agogpio: gpio-ltc4283: Add support for the LTC4283 Swap Controller
Nuno Sá [Sat, 2 May 2026 09:56:54 +0000 (10:56 +0100)] 
gpio: gpio-ltc4283: Add support for the LTC4283 Swap Controller

The LTC4283 device has up to 8 pins that can be configured as GPIOs.

Note that PGIO pins are not set as GPIOs by default so if they are
configured to be used as GPIOs we need to make sure to initialize them
to a sane default. They are set as inputs by default.

Acked-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
Reviewed-by: Linus Walleij <linusw@kernel.org>
Signed-off-by: Nuno Sá <nuno.sa@analog.com>
Link: https://lore.kernel.org/r/20260502-ltc4283-support-v13-3-1c206542e652@analog.com
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
8 days agohwmon: ltc4283: Add support for the LTC4283 Swap Controller
Nuno Sá [Sat, 2 May 2026 09:56:53 +0000 (10:56 +0100)] 
hwmon: ltc4283: Add support for the LTC4283 Swap Controller

Support the LTC4283 Hot Swap Controller. The device features programmable
current limit with foldback and independently adjustable inrush current to
optimize the MOSFET safe operating area (SOA). The SOA timer limits MOSFET
temperature rise for reliable protection against overstresses.

An I2C interface and onboard ADC allow monitoring of board current,
voltage, power, energy, and fault status.

Signed-off-by: Nuno Sá <nuno.sa@analog.com>
Link: https://lore.kernel.org/r/20260502-ltc4283-support-v13-2-1c206542e652@analog.com
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
8 days agotimers/migration: Temporarily disable per capacity hierarchies
Frederic Weisbecker [Tue, 9 Jun 2026 12:33:56 +0000 (14:33 +0200)] 
timers/migration: Temporarily disable per capacity hierarchies

Some workloads with different CPU capacities consume more power with
timer migration than before. The recently introduced per capacity
hierarchies were supposed to alleviate this problem. However it appears
to also regress other types of workloads, especially when plenty of
capacities live together in the same machine.

Disable the feature until a reasonable solution is found.

Fixes: 098cbaad8e57 ("timers/migration: Split per-capacity hierarchies")
Reported-by: Christian Loehle <christian.loehle@arm.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260609123356.28449-1-frederic@kernel.org
Closes: https://lore.kernel.org/all/3b79338f-6cfc-4722-8062-9103db2c8ad1@arm.com
8 days agoMerge tag 'mm-hotfixes-stable-2026-06-08-20-51' of git://git.kernel.org/pub/scm/linux...
Linus Torvalds [Tue, 9 Jun 2026 15:24:25 +0000 (08:24 -0700)] 
Merge tag 'mm-hotfixes-stable-2026-06-08-20-51' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
 "11 hotfixes. 9 are for MM. 8 are cc:stable and the remaining 3 address
  post-7.1 issues or aren't considered suitable for backporting.

  Thre's a two-patch series "mm/damon/{reclaim,lru_sort}: handle ctx
  allocation failures" from SeongJae Park which fixes a couple of DAMON
  -ENOMEM bloopers. The rest are singletons - please see the individual
  changelogs for details"

* tag 'mm-hotfixes-stable-2026-06-08-20-51' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  mm/mincore: handle non-swap entries before !CONFIG_SWAP guard
  arm64: mm: call pagetable dtor when freeing hot-removed page tables
  mm/list_lru: drain before clearing xarray entry on reparent
  mm/huge_memory: use correct flags for device private PMD entry
  mm/damon/lru_sort: handle ctx allocation failure
  mm/damon/reclaim: handle ctx allocation failure
  zram: fix use-after-free in zram_bvec_write_partial()
  MAINTAINERS: update Baoquan He's email address
  tools headers UAPI: sync linux/taskstats.h for procacct.c
  mm/cma_sysfs: skip inactive CMA areas in sysfs
  ipc/shm: serialize orphan cleanup with shm_nattch updates

8 days agodt-bindings: hwmon: Document the LTC4283 Swap Controller
Nuno Sá [Sat, 2 May 2026 09:56:52 +0000 (10:56 +0100)] 
dt-bindings: hwmon: Document the LTC4283 Swap Controller

The LTC4283 is a negative voltage hot swap controller that drives an
external N-channel MOSFET to allow a board to be safely inserted and
removed from a live backplane.

Special note for the "adi,vpower-drns-enable" property. It allows to choose
between the attenuated MOSFET drain voltage or the attenuated input
voltage at the RTNS pin (effectively choosing between input or output
power). This is a system level decision not really intended to change at
runtime and hence is being added as a Firmware property.

Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Signed-off-by: Nuno Sá <nuno.sa@analog.com>
Link: https://lore.kernel.org/r/20260502-ltc4283-support-v13-1-1c206542e652@analog.com
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
8 days agohwmon: (pmbus/xdp720) Fix driver issues xdp720/730
Ashish Yadav [Tue, 9 Jun 2026 07:22:31 +0000 (12:52 +0530)] 
hwmon: (pmbus/xdp720) Fix driver issues xdp720/730

Fix driver issues:
- Add the missing regulator and property files in include
- Declare XDP720_DEFAULT_RIMON as unsigned constant
- Declare struct pmbus_driver_info xdp720_info as constant

Signed-off-by: Ashish Yadav <ashish.yadav@infineon.com>
Link: https://lore.kernel.org/r/20260609072231.15486-4-Ashish.Yadav@infineon.com
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
8 days agohwmon: (pmbus/xdp720) Add support for efuse xdp730
Ashish Yadav [Tue, 9 Jun 2026 07:22:30 +0000 (12:52 +0530)] 
hwmon: (pmbus/xdp720) Add support for efuse xdp730

Adds support for the Infineon XDP730 Digital eFuse Controller by
updating the existing XDP720 driver.

Signed-off-by: Ashish Yadav <ashish.yadav@infineon.com>
Link: https://lore.kernel.org/r/20260609072231.15486-3-Ashish.Yadav@infineon.com
[groeck: Fixed conflicts in xdp720_id declaration]
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
8 days agodt-bindings: hwmon/pmbus: Add Infineon xdp730
Ashish Yadav [Tue, 9 Jun 2026 07:22:29 +0000 (12:52 +0530)] 
dt-bindings: hwmon/pmbus: Add Infineon xdp730

Add documentation for the device tree binding of the XDP730 eFuse.
Rename node to efuse to accurately reflect its hardware function.

Signed-off-by: Ashish Yadav <ashish.yadav@infineon.com>
Acked-by: Conor Dooley <conor.dooley@microchip.com>
Link: https://lore.kernel.org/r/20260609072231.15486-2-Ashish.Yadav@infineon.com
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
8 days agohwmon: (adt7462) Add of_match_table to support devicetree
Kory Maincent [Mon, 8 Jun 2026 15:23:43 +0000 (17:23 +0200)] 
hwmon: (adt7462) Add of_match_table to support devicetree

Add of_match_table to add support of devicetree probing.

Signed-off-by: Kory Maincent <kory.maincent@bootlin.com>
[rgantois: Removed of_match_ptr().]
Signed-off-by: Romain Gantois <romain.gantois@bootlin.com>
Link: https://lore.kernel.org/r/20260608-adt7462-bindings-v2-1-272982c40325@bootlin.com
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
8 days agohwmon: (asus-ec-sensors) add ROG MAXIMUS Z790 EXTREME
Brian Downey [Mon, 8 Jun 2026 06:08:41 +0000 (08:08 +0200)] 
hwmon: (asus-ec-sensors) add ROG MAXIMUS Z790 EXTREME

Add support for ROG MAXIMUS Z790 EXTREME

Signed-off-by: Brian Downey <bdowne01@gmail.com>
Signed-off-by: Eugene Shalygin <eugene.shalygin@gmail.com>
Link: https://lore.kernel.org/r/20260608060855.40469-1-eugene.shalygin@gmail.com
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
8 days agohwmon: (pmbus/max20860a) Add driver for Analog Devices MAX20860A
Syed Arif [Mon, 1 Jun 2026 18:45:36 +0000 (18:45 +0000)] 
hwmon: (pmbus/max20860a) Add driver for Analog Devices MAX20860A

Add a PMBus driver for the Analog Devices MAX20860A step-down DC-DC
switching regulator. The MAX20860A provides monitoring of input/output
voltage, output current, and temperature via the PMBus interface using
linear data format. Optional regulator support is available via
CONFIG_SENSORS_MAX20860A_REGULATOR.

Signed-off-by: Syed Arif <arif.syed@hpe.com>
Signed-off-by: Sanman Pradhan <psanman@juniper.net>
Link: https://lore.kernel.org/r/20260601184516.919488-3-sanman.pradhan@hpe.com
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
8 days agodt-bindings: hwmon: pmbus: Add Analog Devices MAX20860A
Sanman Pradhan [Mon, 1 Jun 2026 18:45:30 +0000 (18:45 +0000)] 
dt-bindings: hwmon: pmbus: Add Analog Devices MAX20860A

Add devicetree binding documentation for the Analog Devices MAX20860A
step-down DC-DC switching regulator with PMBus interface.

Signed-off-by: Sanman Pradhan <psanman@juniper.net>
Acked-by: Conor Dooley <conor.dooley@microchip.com>
Link: https://lore.kernel.org/r/20260601184516.919488-2-sanman.pradhan@hpe.com
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
8 days agohwmon: (pmbus) Add support for Flex BMR316, BMR321, BMR350 and BMR351
Daniel Nilsson [Wed, 3 Jun 2026 08:57:12 +0000 (10:57 +0200)] 
hwmon: (pmbus) Add support for Flex BMR316, BMR321, BMR350 and BMR351

Add support for BMR316, BMR321, BMR350 and BMR351 DC/DC converter
modules from Flex to the pmbus driver.

Signed-off-by: Daniel Nilsson <linux@erq.se>
Link: https://lore.kernel.org/r/20260603085712.659432-2-linux@erq.se
[groeck: Resolved conflicts (explicit struct members in pmbus_id)]
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
8 days agohwmon: Use named initializers for platform_device_id arrays
Uwe Kleine-König (The Capable Hub) [Wed, 27 May 2026 15:15:53 +0000 (17:15 +0200)] 
hwmon: Use named initializers for platform_device_id arrays

Named initializers are better readable and more robust to changes of the
struct definition. This robustness is relevant for a planned change to
struct platform_device_id replacing .driver_data by an anonymous unit.

While touching these arrays unify usage of commas.

Signed-off-by: Uwe Kleine-König (The Capable Hub) <u.kleine-koenig@baylibre.com>
Link: https://lore.kernel.org/r/25d38df8db42d69f33fa30267c9fd5ea058223d0.1779894738.git.u.kleine-koenig@baylibre.com
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
8 days agohwmon: (cros_ec) Drop unused assignment of platform_device_id driver data
Uwe Kleine-König (The Capable Hub) [Wed, 27 May 2026 15:15:52 +0000 (17:15 +0200)] 
hwmon: (cros_ec) Drop unused assignment of platform_device_id driver data

The driver explicitly set the .driver_data member of struct
platform_device_id to zero without relying on that value. Drop this
unused assignments.

While touching this array unify spacing and use named initializers for
.name.

Signed-off-by: Uwe Kleine-König (The Capable Hub) <u.kleine-koenig@baylibre.com>
Reviewed-by: Tzung-Bi Shih <tzungbi@kernel.org>
Acked-by: Thomas Weißschuh <linux@weissschuh.net>
Link: https://lore.kernel.org/r/972c9998054c7944f63266819d6fb08b36edb5c5.1779894738.git.u.kleine-koenig@baylibre.com
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
8 days agohwmon: (it87) Clamp negative values to zero in set_fan()
Nikita Zhandarovich [Fri, 29 May 2026 14:18:36 +0000 (17:18 +0300)] 
hwmon: (it87) Clamp negative values to zero in set_fan()

set_fan() parses user input with kstrtol() and passes the resulting
value to FAN16_TO_REG() on chips with 16-bit fan support.

Negative fan speeds are not meaningful and should be rejected before
conversion. Worst scenario, one may be able to abuse undefined
behaviour of signed overflow to possibly induce rpm * 2 == 0 in
FAN16_TO_REG(), thus causing a division by zero.

Instead, clamp val < 0 to zero and keep the conversion in its valid
input domain, avoiding unsafe arithmetic in the register conversion
path.

Found by Linux Verification Center (linuxtesting.org) with static
analysis tool SVACE.

Fixes: 17d648bf5786 ("it87: Add support for the IT8716F")
Signed-off-by: Nikita Zhandarovich <n.zhandarovich@fintech.ru>
Link: https://lore.kernel.org/r/20260529141839.1639287-1-n.zhandarovich@fintech.ru
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
8 days agohwmon: (asus-ec-sensors) add ROG STRIX B850-E GAMING WIFI
Eugene Shalygin [Sun, 7 Jun 2026 12:36:16 +0000 (14:36 +0200)] 
hwmon: (asus-ec-sensors) add ROG STRIX B850-E GAMING WIFI

The board has a similar sensor configuration to the
ROG STRIX B850-I GAMING WIFI, but includes an additional
T-Sensor header. The patch was provided via GitHub [1].

[1] https://github.com/zeule/asus-ec-sensors/pull/105

Signed-off-by: Eugene Shalygin <eugene.shalygin@gmail.com>
Link: https://lore.kernel.org/r/20260607123626.100630-1-eugene.shalygin@gmail.com
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
8 days agohwmon: (asus-ec-sensors) add ROG STRIX B650E-E GAMING WIFI
Veronika Kossmann [Sun, 7 Jun 2026 11:06:10 +0000 (13:06 +0200)] 
hwmon: (asus-ec-sensors) add ROG STRIX B650E-E GAMING WIFI

Add support for ROG STRIX B650E-E GAMING WIFI

Signed-off-by: Veronika Kossmann <nanodesuu@gmail.com>
Co-developed-by: Oleg Tsvetkov <oleg-tsv@yandex.ru>
Signed-off-by: Oleg Tsvetkov <oleg-tsv@yandex.ru>
Signed-off-by: Eugene Shalygin <eugene.shalygin@gmail.com>
Link: https://lore.kernel.org/r/20260607110702.84599-2-eugene.shalygin@gmail.com
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
8 days agohwmon: (nct6683) Add support for ASRock Z890 Pro-A
Reiner Pröls [Thu, 21 May 2026 21:26:11 +0000 (23:26 +0200)] 
hwmon: (nct6683) Add support for ASRock Z890 Pro-A

Add the ASRock Z890 Pro-A customer ID to the list of supported
boards for the NCT6683 hardware monitoring driver.

MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Signed-off-by: Reiner Pröls <reiner.proels@gmail.com>
Link: https://lore.kernel.org/r/20260521212632.223724-1-Reiner.Proels@gmail.com
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
8 days agohwmon: (adt7475) Add explicit header include
Flaviu Nistor [Fri, 22 May 2026 05:23:52 +0000 (08:23 +0300)] 
hwmon: (adt7475) Add explicit header include

Since device_property_read_string() and similar functions defined in
linux/property.h are used in the driver add explicit include for
linux/mod_devicetable.h and linux/property.h rather than having implicit
inclusions.
Removed of_match_ptr() improving non-Device Tree compatibility of the
driver and drop unnecessary __maybe_unused.
Header linux/of.h can't be removed yet since macro is_of_node() is used.

Signed-off-by: Flaviu Nistor <flaviu.nistor@gmail.com>
Link: https://lore.kernel.org/r/20260522052352.12139-1-flaviu.nistor@gmail.com
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
8 days agohwmon: (lm63) expose PWM frequency and LUT hysteresis as writable
Jan-Henrik Bruhn [Sat, 23 May 2026 13:36:17 +0000 (15:36 +0200)] 
hwmon: (lm63) expose PWM frequency and LUT hysteresis as writable

The driver caches the PWM frequency register and the CONFIG_FAN slow-clock
select bit, but never lets userspace pick a different output frequency.
Add a pwm1_freq sysfs attribute that selects the closest SCS + PFR
combination for the requested value in Hz, gated by manual mode like
set_pwm1(). PFR is clamped to 31 so that 2*PFR fits in the chip's 6-bit
PWM register (matching the existing scaling assumption in show_pwm1).

The hardware LUT hysteresis register is shared by all LUT entries, so
the per-point pwm1_auto_pointN_temp_hyst attributes can't be made RW
without N-to-1 cross-attribute side effects. Following the max31760
precedent, expose a single chip-wide pwm1_auto_point_temp_hyst attribute
holding the hysteresis amount in millidegrees; the per-point attributes
stay RO and continue to show the resulting absolute trip-down
temperature for each entry.

This was tested on a Linksys LGS328MPC switch hardware where the fan
would not spin with the default PWM Frequency, which is why this change
is required.

Signed-off-by: Jan-Henrik Bruhn <kernel@jhbruhn.de>
Link: https://lore.kernel.org/r/20260523133617.3439102-1-kernel@jhbruhn.de
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
8 days agohwmon: (pmbus/adm1266) add rtc debugfs entry
Abdurrahman Hussain [Wed, 20 May 2026 22:42:42 +0000 (15:42 -0700)] 
hwmon: (pmbus/adm1266) add rtc debugfs entry

The driver seeds the chip's SET_RTC register once at probe with
ktime_get_real_seconds().  Over a long uptime the chip's internal
seconds counter drifts away from the host's wall-clock time, so the
timestamp embedded in each blackbox record stops being meaningful
in wall-clock terms.  The datasheet recommends that the host
periodically resynchronise the counter to address this; today the
driver has no userspace-facing knob for that.

Expose SET_RTC via an rtc debugfs file alongside the other adm1266
debugfs entries:

  read  -- returns the chip's current SET_RTC seconds counter, so
           userspace can observe how far the chip has drifted from
           host wall-clock without writing anything.
  write -- the kernel re-reads ktime_get_real_seconds() itself and
           pushes it to the chip.  The write payload is ignored;
           userspace does not get to supply its own timestamp
           value, so there is no way for it to push a wrong time
           into the chip.

A small userspace agent (chrony hook, systemd-timesyncd dispatch
script, or a periodic cron job) can write to this file to keep the
chip's counter aligned with wall-clock across long uptimes.

Both the read and write paths take pmbus_lock to serialise against
the pmbus_core's own PAGE+register sequences and against the other
adm1266 debugfs accessors that already run under the same lock.

While at it, drop the now-redundant adm1266_set_rtc() probe-time
helper.  The new adm1266_rtc_set() callback does exactly the same
byte-packing and write; probe just calls adm1266_rtc_set(client, 0)
(the ignored @val argument) after pmbus_do_probe() so the
pmbus_lock acquired by the new helper has a live mutex to take.

Signed-off-by: Abdurrahman Hussain <abdurrahman@nexthop.ai>
Assisted-by: Claude-Code:claude-opus-4-7
Assisted-by: sashiko:gemini-3.1-pro-preview
Link: https://lore.kernel.org/r/20260520-adm1266-v5-3-c72ef1fac1ea@nexthop.ai
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
8 days agohwmon: (pmbus/adm1266) add powerup_counter debugfs entry
Abdurrahman Hussain [Wed, 20 May 2026 22:42:41 +0000 (15:42 -0700)] 
hwmon: (pmbus/adm1266) add powerup_counter debugfs entry

The ADM1266 maintains a 16-bit non-volatile POWERUP_COUNTER register
(0xE4, datasheet Rev. D, Table 93) that increments on every power
cycle and cannot be reset by the host. Each blackbox record already
embeds the counter at record time, so the standalone live value is
primarily useful for matching a captured record back to the boot it
came from when correlating logs.

Expose it as a read-only debugfs file alongside sequencer_state. The
block-read returns two payload bytes in little-endian order.

Take pmbus_lock around the block-read so the access serialises with
any pmbus_core sequence that sets PAGE on the device. Without it, a
PAGE write from another thread could interleave between a PAGE set
and a paged read elsewhere in the driver and corrupt either side's
view of the device state machine.

Signed-off-by: Abdurrahman Hussain <abdurrahman@nexthop.ai>
Assisted-by: Claude-Code:claude-opus-4-7
Assisted-by: sashiko:gemini-3.1-pro-preview
Link: https://lore.kernel.org/r/20260520-adm1266-v5-2-c72ef1fac1ea@nexthop.ai
Signed-off-by: Guenter Roeck <linux@roeck-us.net>