git.ipfire.org Git - thirdparty/kernel/linux.git/log

dibs: Improve DIBS prompts and help texts

The Kconfig prompts for the DIBS options read:

DIBS support (DIBS) [N/m/y/?]
intra-OS shortcut with dibs loopback (DIBS_LO) [N/y/?]

Clarify the DIBS prompt by expanding the acronym.
Capitalize the first character of the DIBS_LO prompt.
While at it, join the first two lines of the DIBS help text into a real
sentence.

Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Alexandra Winter <wintera@linux.ibm.com>
Link: https://patch.msgid.link/9093259c43e5d1996965d1522562444419196a19.1778838921.git.geert+renesas@glider.be
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

x86: Remove unnecessary architecture-specific <asm/device.h>

arch/x86/include/asm/device.h is identical to
include/asm-generic/device.h, and therefore the x86-specific version
is unnecessary. Remove it.

[ dhansen: Minor note: It looks like if asm/foo.h does not exist that
   the build system generates one that does a #include
   <asm-generic/foo.h>. Thus, all that needs to be done is
   remove the arch-specific one. ]

Signed-off-by: Ethan Nelson-Moore <enelsonmoore@gmail.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://patch.msgid.link/20260517025713.97791-1-enelsonmoore@gmail.com

Merge tag 'batadv-net-pullrequest-20260515' of https://git.open-mesh.org/batadv

Simon Wunderlich says:

====================
Here are various batman-adv bugfixes:

- fix tp_meter counter underflow during shutdown, by Luxiao Xu

- fix tp_meter tp_vars reference leak in receiver shutdown,
   by Sven Eckelmann

- fix various translation table integer handling issues,
   by Sven Eckelmann (3 patches)

- fix various translation table counter issues,
   by Sven Eckelmann (3 patches)

- fix fragment reassembly length accounting, by Ruide Cao

- clear current gateway during teardown, by Ruijie Li

- handle forward allocation error in DAT, by Sven Eckelmann

- tp_meter: avoid use of uninitialized sender variables in tp_meter,
   by Sven Eckelmann

- disallow unicast fragment in fragment, by Sven Eckelmann

- directly shut down tp_meter timer on cleanup, by Sven Eckelmann

* tag 'batadv-net-pullrequest-20260515' of https://git.open-mesh.org/batadv:
  batman-adv: tp_meter: directly shut down timer on cleanup
  batman-adv: frag: disallow unicast fragment in fragment
  batman-adv: tp_meter: avoid use of uninit sender vars
  batman-adv: dat: handle forward allocation error
  batman-adv: clear current gateway during teardown
  batman-adv: fix fragment reassembly length accounting
  batman-adv: tt: prevent TVLV entry number overflow
  batman-adv: tt: avoid empty VLAN responses
  batman-adv: tt: fix TOCTOU race for reported vlans
  batman-adv: tt: fix negative last_changeset_len
  batman-adv: tt: fix negative tt_buff_len
  batman-adv: tt: reject oversized local TVLV buffers
  batman-adv: tp_meter: fix tp_vars reference leak in receiver shutdown
  batman-adv: fix tp_meter counter underflow during shutdown
====================

Link: https://patch.msgid.link/20260515095540.325586-1-sw@simonwunderlich.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'for-net-2026-05-14' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth

Luiz Augusto von Dentz says:

====================
bluetooth pull request for net:

- af_bluetooth: serialize accept_q access
- L2CAP: ecred_reconfigure: send packed pdu, not stack pointer
- btmtk: accept too short WMT FUNC_CTRL events
- hci_qca: Convert timeout from jiffies to ms

* tag 'for-net-2026-05-14' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
  Bluetooth: hci_qca: Convert timeout from jiffies to ms
  Bluetooth: L2CAP: ecred_reconfigure: send packed pdu, not stack pointer
  Bluetooth: btmtk: accept too short WMT FUNC_CTRL events
  Bluetooth: serialize accept_q access
====================

Link: https://patch.msgid.link/20260514172340.1515042-1-luiz.dentz@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

openvswitch: vport: fix race between linking and the device notifier

Sashiko reports that it is technically possible that we got the device
reference, but by the time we're linking it to the OVS datapath, it
may be already in the process of being deleted.  In this case if the
notifier wins the race for RTNL, it will see that the device is not
yet in the OVS datapath (ovs_netdev_get_vport() will fail in the
dp_device_event()) and will do nothing.  Then the ovs_netdev_link()
will take the RTNL and link the unregistering device to OVS datapath.

Eventually, netdev_wait_allrefs_any() will re-broadcast the event and
the device will be properly detached, but it will take at least a
second before that happens, so it's not something we should rely on.

Let's avoid linking the non-registered device in the first place.

Note: As per documentation, RTNL doesn't protect the reg_state, but
it actually does for all the state transitions we care about here,
so it should not be necessary to use READ_ONCE or taking the instance
lock.  We can still do that, but we have a few more places even in
this file where the reg_state is accessed without those while under
RTNL, and many more places like this across the kernel code, so it
might make more sense to change all of them in a more centralized
fashion in the future, if necessary.

Fixes: ccb1352e76cf ("net: Add Open vSwitch kernel components.")
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Reviewed-by: Aaron Conole <aconole@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Link: https://patch.msgid.link/20260514184702.2461435-1-i.maximets@ovn.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: qualcomm: rmnet: fix endpoint use-after-free in rmnet_dellink()

rmnet_dellink() removes the endpoint from the hash table with
hlist_del_init_rcu() and then immediately frees it with kfree(). However,
RCU readers on the receive path (rmnet_rx_handler ->
__rmnet_map_ingress_handler) may still hold a reference to the endpoint and
dereference ep->egress_dev after the memory has been freed. The endpoint is
a kmalloc-32 object, and the stale read at offset 8 corresponds to the
egress_dev pointer.

  BUG: unable to handle page fault for address: ffffffffde942eef
  Oops: 0002 [#1] SMP NOPTI
  CPU: 1 UID: 0 PID: 137 Comm: poc_write Not tainted 7.0.0+ #4 PREEMPTLAZY
  RIP: 0010:rmnet_vnd_rx_fixup (rmnet_vnd.c:27)
  Call Trace:
   <TASK>
   __rmnet_map_ingress_handler (rmnet_handlers.c:48 rmnet_handlers.c:101)
   rmnet_rx_handler (rmnet_handlers.c:129 rmnet_handlers.c:235)
   __netif_receive_skb_core.constprop.0 (net/core/dev.c:6096)
   __netif_receive_skb_one_core (net/core/dev.c:6208)
   netif_receive_skb (net/core/dev.c:6467)
   tun_get_user (drivers/net/tun.c:1955)
   tun_chr_write_iter (drivers/net/tun.c:2003)
   vfs_write (fs/read_write.c:688)
   ksys_write (fs/read_write.c:740)
   </TASK>

Add an rcu_head field to struct rmnet_endpoint and replace kfree() with
kfree_rcu() so the endpoint memory remains valid through the RCU grace
period. Also remove the rmnet_vnd_dellink() call and inline only the
nr_rmnet_devs decrement, since rmnet_vnd_dellink() would set
ep->egress_dev to NULL during the grace period, creating a data race
with lockless readers.

Fixes: ceed73a2cf4a ("drivers: net: ethernet: qualcomm: rmnet: Initial implementation")
Reported-by: Xiang Mei <xmei5@asu.edu>
Signed-off-by: Weiming Shi <bestswngs@gmail.com>
Link: https://patch.msgid.link/20260514122511.3083479-2-bestswngs@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: appletalk: fix NULL pointer dereference in aarp_send_ddp()

aarp_send_ddp() calls atalk_find_dev_addr(dev) in the LocalTalk fast
path without checking for NULL. When the device has no AppleTalk
interface configured (dev->atalk_ptr == NULL), this leads to a NULL
pointer dereference at the at->s_net access.

KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
RIP: 0010:aarp_send_ddp (net/appletalk/aarp.c:552 (discriminator 2))
Call Trace:
  <TASK>
  atalk_sendmsg (net/appletalk/ddp.c:1715)
  __sys_sendto (net/socket.c:2265 (discriminator 1))
  __x64_sys_sendto (net/socket.c:2272)
  do_syscall_64 (arch/x86/entry/syscall_64.c:94)
  entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:121)

Add a NULL check consistent with the other callers of
atalk_find_dev_addr().

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Reported-by: Xiang Mei <xmei5@asu.edu>
Signed-off-by: Weiming Shi <bestswngs@gmail.com>
Link: https://patch.msgid.link/20260514123806.3085961-3-bestswngs@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netdevsim: psp: reset spi on key rotation and check for exhaustion on alloc

The PSP spec states that the lower 31b of the SPI need to be
non-zero. Though not in the spec, I think it is reasonable to reset
the lower 31b of the spi space after a key rotation, and to also
decline to generate session keys when the lower 31b saturate.

Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20260515-spi-handle-v1-1-debf8cb467cb@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-mlx5-frag-buffer-improvements'

Tariq Toukan says:

====================
net/mlx5: frag buffer improvements

This series adds observability for mlx5 fragment buffer DMA pools and
improves the default NUMA placement policy for fragment buffer
allocations.

Patch 1 adds a debugfs interface exposing per-node DMA pool usage
statistics for mlx5_frag_buf allocations, helping with debugging and
visibility into pool utilization.

Patch 2 improves locality of default fragment buffer allocations by
using numa_mem_id() when no explicit NUMA node is requested, allowing
allocations to prefer the current CPU's local memory node.

Together, these changes improve both introspection and memory locality
behavior of mlx5 fragment buffer allocations.
====================

Link: https://patch.msgid.link/20260514104925.337570-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: add debugfs stats for frag buf dma pools

Add a debugfs file exposing per-node DMA pool usage for mlx5_frag_buf
allocations.

  # cat /sys/kernel/debug/mlx5/<dev>/frag_buf_dma_pools
  node  block_size  used_blocks  allocated_blocks
     0        4096            0                 0
     0        8192            0                 0
     0       16384            0                 0
     0       32768            0                 0
     0       65536            0                 0
     1        4096            0                 0
     1        8192            0                 0
     1       16384            0                 0
     1       32768            0                 0
     1       65536            0                 0

Signed-off-by: Nimrod Oren <noren@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260514104925.337570-3-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: use numa_mem_id() for default frag buf allocations

Use the current CPU's local memory node when callers do not request a
specific NUMA node for mlx5_frag_buf allocations.

Signed-off-by: Nimrod Oren <noren@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260514104925.337570-2-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: xsk: Fix unlocked writing to ICOSQ

During napi poll, when the affinity changes and there's still XSK work
to be done, we trigger an ICOSQ interrupt on the new CPU. However, this
triggering on the ICOSQ is done unprotected.

There are 2 such races:

A) mlx5e_trigger_irq() is called while mlx5e_xsk_alloc_rx_mpwqe() is
running from a different CPU due to affinity change. This can happen
because IRQ triggering is done after napi_complete_done(). At this point
the NAPI can be scheduled on a different CPU. Like this:

  CPU A (old affinity, NAPI tail)    CPU B (new affinity, fresh NAPI)
  -------------------------------    --------------------------------
  napi_complete_done()  clears SCHED
  mlx5e_cq_arm(...)
                                     napi_schedule_prep() sets SCHED
                                     mlx5e_napi_poll()
                                       mlx5e_xsk_alloc_rx_mpwqe()
                                         mlx5e_icosq_sync_lock() // noop
                                         memcpy 640 B UMR body
                                         advance sq->pc by 10
  mlx5e_trigger_irq(&c->icosq)
    wqe_info[pi] = {NOP, 1}
    mlx5e_post_nop() advances sq->pc

B) mlx5e_trigger_irq() is called on the ICOSQ when
mlx5e_trigger_napi_icosq() is running.

The obvious fix would be to lock the ICOSQ. But ICOSQ has an optimized
locking scheme that doesn't work for this scenario. Kick the async ICOSQ
instead which is always locked.

This issue was noticed in the wild with the following splat:

  netdevice: ge-0-0-1: Bad OP in ICOSQ CQE: 0xd
  WARNING: drivers/net/ethernet/mellanox/mlx5/core/en_rx.c:826 [...]
  [...]
  Call Trace:
   <IRQ>
   mlx5e_napi_poll+0x11d/0x7f0 [mlx5_core]
   __napi_poll+0x30/0x200
   ? skb_defer_free_flush+0x9c/0xc0
   net_rx_action+0x2fe/0x3f0
   handle_softirqs+0xd8/0x340
   __irq_exit_rcu+0xbc/0xe0
   common_interrupt+0x85/0xa0
   </IRQ>
   <TASK>
   asm_common_interrupt+0x26/0x40
  [...]
  ---[ end trace 0000000000000000 ]---
  mlx5_core 0000:08:00.0 ge-0-0-1: Error cqe on cqn 0x548, ci 0x2022, qn 0x8f4,
  opcode 0xd, syndrome 0x2, vendor syndrome 0x68
  00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  00000030: 00 00 00 00 01 00 68 02 01 00 08 f4 de 14 59 d2
  WQE DUMP: WQ size 16384 WQ cur size 0, WQE index 0x1e14, len: 64
  00000000: 00 00 00 01 d9 ed 80 02 00 00 00 01 d9 ed 90 02
  00000010: 00 00 00 01 d9 ed a0 02 00 00 00 01 d9 ed b0 02
  00000020: 00 00 00 01 d9 ed c0 02 00 00 00 01 d9 ed d0 02
  00000030: 00 00 00 01 d9 ed e0 02 00 00 00 01 d9 ed f0 02
  mlx5_core 0000:08:00.0 ge-0-0-1: Error cqe on cqn 0x548, ci 0x2023, qn 0x8f4,
  opcode 0xd, syndrome 0x5, vendor syndrome 0xf9
  00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  00000030: 00 00 00 00 01 00 f9 05 01 00 08 f4 de 15 cf d2

Fixes: db05815b36cb ("net/mlx5e: Add XSK zero-copy support")
Reported-by: Paul Saab <ps@mu.org>
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260513064613.334602-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

PCI: imx6: Parse 'reset-gpios' in Root Port nodes

The current DT binding for pci-imx6 specifies the 'reset-gpios' property
in the host bridge node. However, the PERST# signal logically belongs to
individual Root Ports rather than the host bridge itself. This becomes
important when supporting PCIe Key E connector and the PCI power control
framework for pci-imx6 driver, which requires properties to be specified
in Root Port nodes.

Parse 'reset-gpios' from Root Port nodes and the PCIe bridge nodes under
the Root Port using the common helper pci_host_common_parse_ports(), and
update the reset GPIO handling to use the parsed port list from
bridge->ports. To maintain DT backwards compatibility, fall back to the
legacy method of parsing the host bridge node if the reset property is not
present in the Root Port nodes.

Since now the reset GPIO is obtained with GPIOD_ASIS flag, it may be in
input mode, so use gpiod_direction_output() instead of
gpiod_set_value_cansleep() to ensure the reset GPIO is properly
configured as output before setting its value.

Signed-off-by: Sherry Sun <sherry.sun@nxp.com>
Signed-off-by: Manivannan Sadhasivam <mani@kernel.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Richard Zhu <hongxing.zhu@nxp.com>
Link: https://patch.msgid.link/20260422093549.407022-5-sherry.sun@nxp.com

accel/amdxdna: Add expandable device heap support

Introduce an expandable device heap to avoid allocating a large heap
upfront. Start with a smaller initial heap and grow it on demand.
Return -EAGAIN when BO allocation fails due to insufficient heap space,
allowing userspace to trigger heap expansion via a heap BO creation
IOCTL and retry the allocation.

Manage heap chunks using an xarray. On expansion, register new chunks
with the firmware via MSG_OP_ADD_HOST_BUFFER.

Since heap shrinking is not supported by the firmware, release all heap
chunks on device close.

Co-developed-by: Wendy Liang <wendy.liang@amd.com>
Signed-off-by: Wendy Liang <wendy.liang@amd.com>
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Link: https://patch.msgid.link/20260515161922.744647-1-lizhi.hou@amd.com

PCI: imx6: Assert PERST# before enabling regulators

The PCIe endpoint may start responding or driving signals as soon as its
supply is enabled, even before the reference clock is stable.  Asserting
PERST# before enabling the regulator ensures that the endpoint remains in
reset throughout the entire power-up sequence, until both power and refclk
are known to be stable and link initialization can safely begin.

Currently, the driver enables the vpcie3v3aux regulator in imx_pcie_probe()
before PERST# is asserted in imx_pcie_host_init(), which may cause PCIe
endpoint undefined behavior during early power-up. However, there is no
issue so far because PERST# is requested as GPIOD_OUT_HIGH in
imx_pcie_probe(), which guarantees that PERST# is asserted before enabling
the vpcie3v3aux regulator.

This prepares for an upcoming changes that will parse the reset property
using the new Root Port binding, which will use GPIOD_ASIS when requesting
the reset GPIO. With GPIOD_ASIS, the GPIO state is not guaranteed, so
explicit sequencing is required.

Fix the power sequencing by:

  1. Moving vpcie3v3aux regulator enable from probe to
     imx_pcie_host_init(), where it can be properly sequenced with PERST#.

  2. Moving imx_pcie_assert_perst() before regulator and clock enable to
     ensure correct ordering.

Signed-off-by: Sherry Sun <sherry.sun@nxp.com>
Signed-off-by: Manivannan Sadhasivam <mani@kernel.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Richard Zhu <hongxing.zhu@nxp.com>
Link: https://patch.msgid.link/20260422093549.407022-4-sherry.sun@nxp.com

PCI: host-generic: Add common helpers for parsing Root Port properties

Introduce generic helper functions to parse Root Port device tree nodes and
extract common properties like reset GPIOs. This allows multiple PCI host
controller drivers to share the same parsing logic.

Define struct pci_host_port to hold common Root Port properties (currently
only list of PERST# GPIO descriptors) and add pci_host_common_parse_ports()
to parse Root Port nodes from device tree.

Also add the 'ports' list to struct pci_host_bridge to better maintain
parsed Root Port information.

Signed-off-by: Sherry Sun <sherry.sun@nxp.com>
Signed-off-by: Manivannan Sadhasivam <mani@kernel.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Link: https://patch.msgid.link/20260422093549.407022-3-sherry.sun@nxp.com

drm/v3d: Release indirect CSD GEM reference on CPU job free

v3d_get_cpu_indirect_csd_params() takes a reference to the indirect BO via
drm_gem_object_lookup() and stashes it in cpu_job->indirect_csd.indirect,
but nothing on the CPU job teardown path ever drops that reference.

Drop the extra reference in v3d_cpu_job_free(). The NULL check covers ioctl
errors before the lookup ran and CPU job types other than
V3D_CPU_JOB_TYPE_INDIRECT_CSD, which leave the field zero-initialised.

Cc: stable@vger.kernel.org
Fixes: 18b8413b25b7 ("drm/v3d: Create a CPU job extension for a indirect CSD job")
Assisted-by: Claude:claude-opus-4.7
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Link: https://patch.msgid.link/20260515-v3d-cpu-job-leaks-v1-2-7f147cbbf935@igalia.com
Signed-off-by: Maíra Canal <mcanal@igalia.com>

drm/v3d: Fix use-after-free of CPU job query arrays on error path

The CPU job ioctl's fail label calls kvfree() on cpu_job's timestamp and
performance query arrays after v3d_job_cleanup(), which drops the job's
last reference and frees cpu_job. Reading cpu_job at that point is a
use-after-free. Also, on the early v3d_job_init() failure path, it is a
NULL dereference, since v3d_job_deallocate() zeroes the local pointer.

In the success path, the arrays are released from the scheduler's
.free_job callback, but on the error path, they are freed manually, as
the job was never pushed to the scheduler. While the success path deals
with this correctly, the fail path doesn't.

On top of that, the manual kvfree() calls only free the array storage;
they don't drm_syncobj_put() the per-query syncobjs that
v3d_timestamp_query_info_free() and v3d_performance_query_info_free()
release on the success path. So the same fail path that triggers the
use-after-free also leaks one syncobj reference per query.

Unify the CPU job teardown into the CPU job's kref destructor, mirroring
v3d_render_job_free(). The scheduler's .free_job slot reverts to the
generic v3d_sched_job_free() and the fail label drops the manual
kvfree() calls, leaving a single teardown path that is reached from both
the scheduler and the ioctl error path. That removes the use-after-free,
the NULL dereference, and the syncobj leak by construction.

Cc: stable@vger.kernel.org
Fixes: 9ba0ff3e083f ("drm/v3d: Create a CPU job extension for the timestamp query job")
Assisted-by: Claude:claude-opus-4.7
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Link: https://patch.msgid.link/20260515-v3d-cpu-job-leaks-v1-1-7f147cbbf935@igalia.com
Signed-off-by: Maíra Canal <mcanal@igalia.com>

dt-bindings: PCI: fsl,imx6q-pcie: Add reset GPIO in Root Port node

Update fsl,imx6q-pcie.yaml to include the standard reset-gpios property for
the Root Port node.

The reset-gpios property is already defined in pci-bus-common.yaml for
PERST#, so use it instead of the local reset-gpio property. Keep the
existing reset-gpio property in the bridge node for backward compatibility,
but mark it as deprecated.

Signed-off-by: Sherry Sun <sherry.sun@nxp.com>
Signed-off-by: Manivannan Sadhasivam <mani@kernel.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Link: https://patch.msgid.link/20260422093549.407022-2-sherry.sun@nxp.com

PCI: imx6: Fix IMX6SX_GPR12_PCIE_TEST_POWERDOWN handling

The IMX6SX_GPR12_PCIE_TEST_POWERDOWN bit does not control the PCIe
reference clock on i.MX6SX. Instead, it is part of i.MX6SX PCIe core
reset sequence.

Move the IMX6SX_GPR12_PCIE_TEST_POWERDOWN assertion/deassertion into
the core reset functions to properly reflect its purpose. Remove the
.enable_ref_clk() callback for i.MX6SX since it was incorrectly
manipulating this bit.

Fixes: e3c06cd063d6 ("PCI: imx6: Add initial imx6sx support")
Signed-off-by: Richard Zhu <hongxing.zhu@nxp.com>
Signed-off-by: Manivannan Sadhasivam <mani@kernel.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260319090844.444987-1-hongxing.zhu@nxp.com

PCI/pwrctrl: Lock device when calling device_is_bound()

The kerneldoc for device_is_bound() states that it must be called with
the device lock taken. Synchronize the two calls in pwrctrl core.

Fixes: b35cf3b6aa1e ("PCI/pwrctrl: Add APIs to power on/off pwrctrl devices")
Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Manivannan Sadhasivam <mani@kernel.org>
Link: https://patch.msgid.link/20260518100700.47581-1-bartosz.golaszewski@oss.qualcomm.com

PCI: Drop unnecessary retries when restoring BARs

In 2012, commit 26f41062f28d ("PCI: check for pci bar restore completion
and retry") amended pci_restore_state() to attempt BAR restoration up to 10
times.  This was necessary because back in the day, only a 100 msec delay
was observed after pcie_flr() carried out a Function Level Reset.  The
retries ensured that BARs were restored even if devices needed more time to
come out of reset.

In 2016, commit 5adecf817dd6 ("PCI: Wait for up to 1000ms after FLR reset")
extended the delay to 1 sec.  Commit a2758b6b8fdb ("PCI: Rename
pci_flr_wait() to pci_dev_wait() and make it generic") subsequently
extended it further to 60 sec.

The lengthened delay makes it unnecessary to retry BAR restoration, so
drop it.

Reported-by: Bjorn Helgaas <bhelgaas@google.com>
Closes: https://lore.kernel.org/r/20260416225745.GA41850@bhelgaas/
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Link: https://patch.msgid.link/785c98b50a7a00d0698848c75d51b8f5669ad18f.1777814679.git.lukas@wunner.de

PCI: Wait for device readiness after D3hot -> D0uninitialized transition

For a device that advertises No_Soft_Reset == 0, a transition from D3hot to
D0uninitialized is a soft reset, and the resulting internal device state is
undefined.

Per PCIe r7.0, sec 2.3.1, a transition from D3hot to D0uninitialized
mandates a minimum 10 ms delay before accessing the device. Following this
delay, the device is permitted to respond to initial configuration requests
with a Request Retry Status (RRS) completion status if it needs more time
to initialize.

Call pci_dev_wait() after pci_power_up() performs a D3hot->D0uninitialized
transition to ensure the device is ready to accept config accesses, as is
done after the similar transition in pci_pm_reset().

If the device is already ready, this is essentially a no-op except for one
additional config read.

Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Rafael J. Wysocki (Intel) <rafael@kernel.org>
Link: https://patch.msgid.link/20260518191220.636213-3-bhelgaas@google.com

PCI: Log device readiness timeouts as errors

pci_dev_wait() waits for a device to be Configuration-Ready after a reset,
such as a Function-Level Reset (FLR), a soft reset during a D3hot->
D0uninitialized transition when No_Soft_Reset == 0), or a power-up sequence
from D3cold->D0uninitialized.

If pci_dev_wait() returns success, the device is guaranteed to respond to
configuration requests with Successful Completion status. If it times out,
device is completely non-responsive.

Upgrade the log level from pci_warn() to pci_err() to reflect this failure
state.

Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Rafael J. Wysocki (Intel) <rafael@kernel.org>
Link: https://patch.msgid.link/20260518191220.636213-2-bhelgaas@google.com

drm/amd/display: Fix eDP receiver ready status check in T7 sequence

[Why]
Some eDP panels return sinkstatus as 0x5, causing the original sinkstatus == 1
check to never match and resulting in unnecessary polling delay. The
equality check is too restrictive and doesn't properly validate the
specific status bit that indicates receiver readiness.

[How]
Replace direct value comparison with proper bitmask check using
DP_RECEIVE_PORT_0_STATUS constant.

Reviewed-by: Wenjing Liu <wenjing.liu@amd.com>
Signed-off-by: Sung-huai Wang <Danny.Wang@amd.com>
Signed-off-by: Ivan Lipski <ivan.lipski@amd.com>
Tested-by: Dan Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

Revert "drm/amd/display: dmub_cmd.h: add missing kernel-doc for enums"

[Why & How]
This reverts commit f71d0b68ec58b781f4f44ea642846bedce075e85.

1. Auto-generated Header: The file 'dmub_cmd.h' is an auto-generated header
   managed in an external repository (dmu_stg). Manual changes made directly in
   this repository will be overwritten and lost during the next automated weekly
   synchronization.

2. Tooling Compatibility: This header is governed by internal AMD firmware
   standards which require Doxygen formatting for cross-team documentation.
   Moving to kernel-doc syntax may break internal documentation pipelines.

3. Suppressing Warnings: Current 'make htmldocs' and 'make W=1' builds
   do not actively scan 'dmub_cmd.h' for kernel-doc compliance, thus no warnings
   are triggered during standard compilation. To address warnings generated when
   manually running './scripts/kernel-doc', we have added a notice at the file
   header indicating that this is an auto-generated file that does not strictly
   follow kernel-doc formatting. This ensures that any future linting tools or
   manual checks recognize the formatting as intentional.

Acked-by: Harry Wentland <harry.wentland@amd.com>
Acked-by: Nicholas Kazlauskas <nicholas.kazlauskas@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: James Lin <pinglei.lin@amd.com>
Tested-by: Dan Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

drm/amd/display: Add some missing code for dcn42

[why & how]
Some DCN4.2 related code is missing from upstream

Fixes: e56e3cff2a1b ("drm/amd/display: Sync dcn42 with DC 3.2.373")
Acked-by: ChiaHsuan Chung <ChiaHsuan.Chung@amd.com>
Reviewed-by: Roman Li <Roman.Li@amd.com>
Signed-off-by: James Lin <pinglei.lin@amd.com>
Tested-by: Dan Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

drm/amd: Reduce code duplication in runtime PM

[Why]
amdgpu_pmops_runtime_suspend() runs almost the same code that
amdgpu_pmops_runtime_idle() runs. That is there is pointless code
duplication.

[How]
Move amdgpu_pmops_runtime_idle() up, extract common code and then
call from both functions. No intended functional changes.

Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

drm/amdkfd: Check bounds for allocate_sdma_queue restore_sdma_id

allocate_sdma_queue has an option where the sdma queue id can be
specified (used by CRIU). We weren't bounds-checking that
value.

Confirm it's less than the maximum number of queues.

Signed-off-by: David Francis <David.Francis@amd.com>
Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

drm/amdgpu: use atomic operation to achieve lockless serialization

In amdgpu_seq64_alloc there is a possibility that two difference cores
from two separate NODES can try to and could get the same free slot.
So this fixes that race here using atomic test_and_set clear operations.

Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

drm/amdgpu/vce3: Fix VCE 3 firmware size and offsets

The VCPU BO contains the actual FW at an offset, but
it was not calculated into the VCPU BO size.
Subtract this from the FW size to make sure there is
no out of bounds access.

This may fix VM faults when using VCE 3.

Cc: John Olender <john.olender@gmail.com>
Fixes: e98226221467 ("drm/amdgpu: recalculate VCE firmware BO size")
Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

drm/amdkfd: Check bounds on allocate_doorbell

allocated_doorbell has an option to set the doorbell id
to a specific value (used by CRIU). This value was not
bounds checked.

Check to confirm it's less than KFD_MAX_NUM_OF_QUEUES_PER_PROCESS.

Signed-off-by: David Francis <David.Francis@amd.com>
Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

drm/amdgpu/vce2: Fix VCE 2 firmware size and offsets

The VCPU BO contains the actual FW at an offset, but
it was not calculated into the VCPU BO size.
Subtract this from the FW size to make sure there is
no out of bounds access.

Additionally, increase the VCE_V2_0_DATA_SIZE to
have extra space after the VCE handles.

Also increase the data size used for each VCE handle.
The FW needs 23744 bytes, use 24K to be safe.

This fixes VM faults when using VCE 2.

Cc: John Olender <john.olender@gmail.com>
Closes: https://gitlab.freedesktop.org/drm/amd/-/work_items/4802
Fixes: e98226221467 ("drm/amdgpu: recalculate VCE firmware BO size")
Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

drm/amdgpu/vce1: Stop using amdgpu_vce_resume

The VCE1 firmware works slightly differently and is already
loaded by vce_v1_0_load_fw(). It doesn't actually need to
call amdgpu_vce_resume().

Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

drm/amdgpu/vce1: Fix VCE 1 firmware size and offsets

The VCPU BO contains the actual FW at an offset, but
it was not calculated into the VCPU BO size.
Subtract this from the FW size to make sure there is
no out of bounds access.

Make sure the stack and data offsets are aligned to
the 32K TLB size.

Check that the FW microcode actually fits in the
space that is reserved for it.

Fixes: d4a640d4b9f3 ("drm/amdgpu/vce1: Implement VCE1 IP block (v2)")
Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

drm/amdgpu/vce1: Don't repeat GTT MGR node allocation

Only allocate entries from the GTT manager when the
VCE GTT node is not allocated yet. This prevents the
possibility of allocating them multiple times, which
causes issues during GPU reset and suspend/resume.

Fixes: 71aec08f80e7 ("amdgpu/vce: use amdgpu_gtt_mgr_alloc_entries")
Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

drm/amdgpu/vce1: Check if VRAM address is lower than GART.

Previously, I had assumed this was not possible
so it was OK to not handle it, but now we got a report
from a user who has a board that is configured this way.

When the VCPU BO is already located in a low 32-bit address
in VRAM (eg. when VRAM is mapped to the low address space),
don't do the workaround.

Fixes: 71aec08f80e7 ("amdgpu/vce: use amdgpu_gtt_mgr_alloc_entries")
Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

drm/amdgpu/vce1: Remove superfluous address check

The same thing is already checked a few lines above.

Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

drm/amdgpu/vce1: Check that the GPU address is < 128 MiB

When ensuring the low 32-bit address, make sure it is
less than 128 MiB, otherwise the VCE seems to fail to initialize.
This seems to be an undocumented limitation of the firmware
validation mechanism. Note that in case of VCE1 the BAR
address is zero and we can't change it also due to the
firmware validator.

When programming the mmVCE_VCPU_CACHE_OFFSETn registers,
don't AND them with a mask. This is incorrect because
the register mask is actually 0x0fffffff and useless because
we already ensure the addresses are below the limit.

Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

drm/amdgpu: Align amdgpu_gtt_mgr entries to TLB size on Tahiti (v2)

The TLB is organized in groups of 8 entries, each one is 4K.
On Tahiti, the HW requires these GART entries to be 32K-aligned.

This fixes a VCE 1 firmware validation failure that can happen
after suspend/resume since we use amdgpu_gtt_mgr for VCE 1.

v2:
- Change variable declaration order
- Add comment about "V bit HW bug"

Fixes: 698fa62f56aa ("drm/amdgpu: Add helper to alloc GART entries")
Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

drm/amdkfd: Fix OOB memory exposure in get_wave_state()

The get_wave_state() function for v9 trusts cp_hqd_cntl_stack_size and
cp_hqd_cntl_stack_offset values read directly from the MQD, which are
written by GPU microcode and fully attacker-controlled on the
CRIU-restore path (via AMDKFD_IOC_RESTORE_PROCESS with H3).

this leads to an unbounded copy_to_user() that can leak adjacent
GTT/kernel memory. If offset > size, integer underflow produces a ~4 GiB
read length, if size is set to 1 MiB against a 4 KiB allocation, we leak
1 MiB of adjacent kernel memory (other queues' MQDs, ring buffers, KASLR
pointers).

Fix by clamping both cp_hqd_cntl_stack_size to the actual allocated
buffer size (q->ctl_stack_size) and cp_hqd_cntl_stack_offset to the
clamped size before performing arithmetic and copy_to_user().

This ensures we never read beyond the allocated kernel BO regardless of
attacker-supplied MQD field values.

Signed-off-by: Sunday Clement <Sunday.Clement@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

drm/amd/pm: fix memleak of dpm_policies on smu v15

In smu_v15_0_fini_smc_tables, dpm_policies was not freed or NULLed, causing a memory leak.
Add kfree() and NULL assignment to properly release memory and avoid dangling pointers.

Fixes: 2beedc3a92b7 ("drm/amd/pm: Add initial support for smu v15_0_8");
Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

drm/amdkfd: Enable SDMA queue reset on gfx v12.1

After suspend/resume sdma_gang is supported on MES 12.1, SDMA queue reset
is supported too.

Signed-off-by: Amber Lin <Amber.Lin@amd.com>
Reviewed-by: Michael Chen<michael.chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

drm/amdgpu: Support MES suspend_all_sdma_gangs

suspend_all_sdma_gangs is supported in new MES firmware for gfx 12.1

Signed-off-by: Amber Lin <Amber.Lin@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Michael Chen<michael.chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

drm/amdgpu: fix OOB risk parsing virt RAS batch trace replies on the VF

amdgpu_virt_ras_get_batch_records() indexed batchs[] and records[]
from ras_cmd_batch_trace_record_rsp copied out of shared memory without
fully bounding the cache window or per-batch offset/trace_num. A
tampered or corrupted buffer could set real_batch_num past the array,
make a naive start_batch_id + real_batch_num comparison wrap in
uint64_t, or point offset+trace_num past records[].

Add amdgpu_virt_ras_check_batch_cached() for a subtraction-based window
with a real_batch_num cap, re-run it after GET_BATCH_TRACE_RECORD, and
use an explicit batch index into batchs[]. Consolidate batch_id,
trace_num, and offset+trace_num checks; on any failure memset the cache
and return -EIO so the next call refetches.

Signed-off-by: Chenglei Xie <Chenglei.Xie@amd.com>
Reviewed-by: YiPeng Chai <YiPeng.Chai@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

drm/amdgpu: Add guest driver CUID support

v3:
improve the coding style.

v2:
use debugfs_create_x64 and debugfs_create_x8 to create node.

v1:
1. Add guest driver CUID support
2. Do not expose vf index(variable "fcn_idx") to customers,
replace the fcn_idx with pad.
Only expose the unitid to customers.

background:
Change fcn_idx to pad, VF index won't expose to guest vm.
Introduce a new unitid field as the VF identifier to replace the VF index:
1).unitid is assigned by the host driver
2).It is delivered to the guest via the pf2vf message
3).The application or umd can retrieve united from the sysfs node

Signed-off-by: chong li <chongli2@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

drm/amdgpu: Fix discovery offset check under VF

Discovery table may be kept at offset 0 by host driver. Remove the
validation check.

Fixes: 01bdc7e219c4 ("drm/amdgpu: New interface to get IP discovery binary v3")
Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Ellen Pan <yunru.pan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

drm/amdgpu: remove va cursors for all mappings

va_cursor struct needs to be cleaned even if the mapping
has been removed already.

Also simplify it by make it a void function as return value
check isn't needed as its called during tear down.

Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

drm/amdgpu: reject non-user addresses early in GEM_USERPTR ioctl

amdgpu_gem_userptr_ioctl() currently accepts any value of args->addr
and only discovers an out-of-range pointer much later, inside
amdgpu_gem_object_create() and the HMM mirror registration path.
Userspace can drive that path with kernel-side virtual addresses;
the get_user_pages() layer rejects them, but only after the driver
has already allocated a GEM object and started wiring up notifier
state that then has to be torn down on failure.

Add an access_ok() guard at the top of the ioctl, right after the
existing page-alignment check and before flag validation, so any
address that does not lie within the calling task's user address
range is rejected with -EFAULT before any allocation occurs. No
legitimate ROCm/HSA userspace passes kernel-mode pointers through
this interface, so this is defense-in-depth rather than a behaviour
change for valid callers; -EFAULT matches the convention already
used by other uaccess-style rejections in the kernel.

Also add an explicit #include <linux/uaccess.h>; access_ok() is
otherwise only available transitively through other headers in
this translation unit.

Signed-off-by: Amir Shetaia <Amir.Shetaia@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

watchdog: ziirave_wdt: Use named initializers for struct i2c_device_id

While being less compact, using named initializers allows to more easily
see which members of the structs are assigned which value without having
to lookup the declaration of the struct. And it's also more robust
against changes to the struct definition.

This patch doesn't modify the compiled arrays, only their representation
in source form benefits. The former was confirmed with x86 and arm64
builds.

Signed-off-by: Uwe Kleine-König (The Capable Hub) <u.kleine-koenig@baylibre.com>
Link: https://lore.kernel.org/r/20260518171901.904094-2-u.kleine-koenig@baylibre.com
Signed-off-by: Guenter Roeck <linux@roeck-us.net>

drm/amdgpu/vpe: Force collaborate sync after TRAP

VPE1 could possibly hang and fail to power off at the end of commands in
collaboration mode. This workaround adds a COLLAB_SYNC after TRAP to
force instances synchronized to avoid VPE1 fail to power off.

Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Alan liu <haoping.liu@amd.com>
Closes: https://gitlab.freedesktop.org/drm/amd/-/work_items/5171
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

drm/amdgpu/userq: update the vm task info during signal ioctl

Pagefaults does not have process information correctly populated
as vm->task is not set during vm_init but should be updated while
real submission. So setting that up during signal_ioctl to get
the correct submission process details.

Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

drm/amdgpu/userq: cancel reset work while tear down in progress

While tear down of a userq_mgr is happening when all the queues
are free we should cancel any reset work if pending before exiting.

Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

drm/amdgpu: Remove UML build exclusion from Kconfig

The depends on !UML was added in commit dffe68131707 ("amdgpu: Avoid
building on UML") to work around build failures with allyesconfig on
UML. The original errors were:

- smu7_hwmgr.c: incompatible pointer type 'struct cpuinfo_um *' vs
'struct cpuinfo_x86 *' in intel_core_rkl_chk()
- kfd_topology.c: 'struct cpuinfo_um' has no member named 'apicid'

Both issues have since been resolved independently:
- intel_core_rkl_chk() has been removed entirely.
- kfd_topology.c now uses a proper #ifdef CONFIG_X86_64 guard.
- All other cpuinfo_x86/cpu_data() references in the driver are
guarded by #if IS_ENABLED(CONFIG_X86) or #ifdef CONFIG_X86_64.

Removing this exclusion allows CONFIG_DRM_AMDGPU to be selected on UML,
which in turn enables running KUnit tests (such as amdgpu_dm_crc_test)
under UML without needing a full hardware-capable kernel build.

Reviewed-by: Alex Hung <alex.hung@amd.com>
Assisted-by: Claude:claude-opus-4.6
Signed-off-by: Harry Wentland <harry.wentland@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

drm/amdgpu: rework userq reset work handling

It is illegal to schedule reset work from another reset work!

Fix this by scheduling the userq reset work directly on the work queue
of the reset domain.

Not fully tested, I leave that to the IGT test cases.

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Prike Liang <Prike.Liang@amd.com>
Reviewed-by: Sunil Khatri <sunil.khatri@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

drm/amdgpu/userq: pin mqd and fw object bo to avoid eviction

mqd and fw objects are queue core objects which should remain
valid and never be unmapped and evicted for user queues to work
properly.

During eviction if these buffers are evicted the hw continue to
use the invalid addresses and caused page faults and system hung.

Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

drm/amdgpu/userq: use drm_exec in amdgpu_userq_fence_read_wptr

To access the bo from vm mapping first lock the root bo and
then the object bo of the mapping to make sure both locks
are taken safely.

Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

drm/xe/guc: Use xe_device_is_l2_flush_optimized()

We encapsulate the logic to check if the platform has L2 flush
optimization feature in xe_device_is_l2_flush_optimized(), but
guc_ctl_feature_flags() is using an open-coded version of that same type
of check. Fix that by replacing the open-coded check with
xe_device_is_l2_flush_optimized().

Reviewed-by: Himanshu Girotra <himanshu.girotra@intel.com>
Reviewed-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
Link: https://patch.msgid.link/20260513-guc-l2-flush-opt-use-xe_device_is_l2_flush_optimized-v1-1-36fa866d6ed8@intel.com
Signed-off-by: Gustavo Sousa <gustavo.sousa@intel.com>

HID: core: Fix size_t specifier in hid_report_raw_event()

When building for 32-bit platforms, for which 'size_t' is
'unsigned int', there are warnings around using the incorrect format
specifier to print bsize in hid_report_raw_event():

  drivers/hid/hid-core.c:2054:29: error: format specifies type 'long' but the argument has type 'size_t' (aka 'unsigned int') [-Werror,-Wformat]
   2053 |                 hid_warn_ratelimited(hid, "Event data for report %d is incorrect (%d vs %ld)\n",
        |                                                                                         ~~~
        |                                                                                         %zu
   2054 |                                      report->id, csize, bsize);
        |                                                         ^~~~~
  drivers/hid/hid-core.c:2076:29: error: format specifies type 'long' but the argument has type 'size_t' (aka 'unsigned int') [-Werror,-Wformat]
   2075 |                 hid_warn_ratelimited(hid, "Event data for report %d was too short (%d vs %ld)\n",
        |                                                                                          ~~~
        |                                                                                          %zu
   2076 |                                      report->id, rsize, bsize);
        |                                                         ^~~~~

Use the proper 'size_t' format specifier, '%zu', to clear up the
warnings.

Cc: stable@vger.kernel.org
Fixes: 2c85c61d1332 ("HID: pass the buffer size to hid_report_raw_event")
Reported-by: Miguel Ojeda <ojeda@kernel.org>
Closes: https://lore.kernel.org/20260516020430.110135-1-ojeda@kernel.org/
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'perf-upstream'

sched/cache: Fix stale preferred_llc for a new task

On fork without CLONE_VM, the child gets a new mm,
the parent's preferred_llc value is stale for the
child.

Fix this by resetting the task's preferred_llc to -1.

This bug was reported by sashiko.

Fixes: 47d8696b95f7 ("sched/cache: Assign preferred LLC ID to processes")
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/0ec7309d0e24ede97656754d1505b7490403d966.1778703694.git.tim.c.chen@linux.intel.com

sched/cache: Fix has_multi_llcs iff at least one partition has multiple LLCs

sched_cache_present is a global static key, but build_sched_domains()
is called per partition from the "Build new domains" loop in
partition_sched_domains_locked(). Each call unconditionally sets the
key based solely on the has_multi_llcs local variable for that partition.

The call to the last partition set the value even when there
are previous partitions with multiple LLCs.

If partition A (multi-LLC) is built first, the key is enabled. Then
when partition B (single-LLC) is built, the key is disabled. The
multi-LLC partition A is still active but the key is now off.

Fix it by doing a similar thing as sched_energy_present: check the
multi-LLCs during the iteration over all the partitions rather than
checking it on a single partition.

This bug was reported by sashiko.

Fixes: d59f4fd1d303 ("sched/cache: Enable cache aware scheduling for multi LLCs NUMA node")
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/c541af2547d54509fbfd3b3a1e8072e2e5c7ff68.1778703694.git.tim.c.chen@linux.intel.com

sched/cache: Fix cache aware scheduling enabling for multi LLCs system

If there are multiple LLCs in the system, cache aware scheduling
should be enabled. However, there is a corner case where, if there
is a single NUMA node and a single LLC per node, cache aware
scheduling will be turned on in the current implementation -
because at this moment, the parent domain has not yet been
degenerated, and it is possible that the current domain has the
same cpu span as its parent. There is no need to turn cache aware
scheduling on in this scenario.

Fix it by iterating the parent domains to find a domain that is
a superset of the current sd_llc, so that later, after the duplicated
parent domains have been degenerated, cache aware scheduling will
take effect.

For example, the expected behavior would be:
2 sockets, 1 LLC per socket: MC span=0-3, PKG span=0-7, has_multi_llcs=true
1 socket, 2 LLCs per socket: MC span=0-3, PKG span=0-7, has_multi_llcs=true
2 sockets, 2 LLCs per socket: MC span=0-3, PKG span=0-7, has_multi_llcs=true
1 socket, 1 LLC per socket: MC span=0-3, PKG span=0-3, has_multi_llcs=false

This bug was reported by sashiko.

Fixes: d59f4fd1d303 ("sched/cache: Enable cache aware scheduling for multi LLCs NUMA node")
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/6328a8a7f40925cec2a712d81ee58128a4c4444a.1778703694.git.tim.c.chen@linux.intel.com

sched/cache: Fix race condition during sched domain rebuild

sched_cache_active_set_unlocked() checks hardware support without
locks:
static void sched_cache_active_set(bool locked)
{
        /* hardware does not support */
        if (!static_branch_likely(&sched_cache_present)) {
                _sched_cache_active_set(false, locked);
                return;
        }
    ...
If build_sched_domains() runs concurrently during CPU hotplug,
it can disable sched_cache_present under sched_domains_mutex
and the CPU hotplug lock. If a debugfs write thread evaluates
sched_cache_present as true right before that, and then blocks
or gets preempted, it might proceed to enable sched_cache_active
after the hardware support has been marked as absent. Make it
safer by acquiring cpus_read_lock() and sched_domains_mutex_lock()
when the user changes sched_cache_active via debugfs.

This bug was reported by sashiko.

Fixes: 067a31358143 ("sched/cache: Allow the user space to turn on and off cache aware scheduling")
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/9afddf439687f04bb56b46625bd9f153eb8abad5.1778703694.git.tim.c.chen@linux.intel.com

sched/cache: Fix checking active load balance by only considering the CFS task

The currently running task cur may not be a CFS task, such as
an RT or Deadline task. For non-CFS tasks, the task_util(cur)
utilization average is not maintained, so this might pass a
stale or meaningless value to can_migrate_llc().

Check if the task is CFS before getting its task_util().

This bug was reported by sashiko.

Fixes: 714059f79ff0 ("sched/cache: Handle moving single tasks to/from their preferred LLC")
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/f9161133cf040d286dca11344a112c5ef2a5253d.1778703694.git.tim.c.chen@linux.intel.com

sched/cache: Fix unpaired account_llc_enqueue/dequeue

There is a race condition that, after a task is enqueued
on a runqueue, task_llc(p) may change due to CPU hotplug,
because the llc_id is dynamically allocated and adjusted
at runtime.
Therefore, checking task_llc(p) to determine whether the
task is being dequeued from its preferred LLC is unreliable
and can cause inconsistent values.

To fix this problem, record whether p is enqueued on its
preferred LLC, in order to pair with account_llc_dequeue()
to maintain a consistent nr_pref_llc_running per runqueue.

This bug was reported by sashiko, and the solution was once
suggested by Prateek.

Fixes: 46afe3af7ead ("sched/cache: Track LLC-preferred tasks per runqueue")
Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/0c8c6a1571d66792a4d2ff0103ba3cc13e059046.1778703694.git.tim.c.chen@linux.intel.com

sched/cache: Annotate lockless accesses to mm->sc_stat.cpu

mm->sc_stat.cpu is written by task_cache_work() and could be read
locklessly by several functions on other CPUs. Use READ_ONCE and
WRITE_ONCE on mm->sc_stat.cpu access and write to prevent inconsistent
values from compiler optimizations when there are multiple accesses.

For example in get_pref_llc(), if the writer updated the field between
two compiler-generated loads, the validation (e.g., cpu != -1) and
subsequent use (e.g., llc_id(cpu)) could operate on different values,
allowing a negative CPU ID to be used as an index.

Leave plain write in mm_init_sched(), where the mm is not
yet visible to other CPUs.

This bug was reported by sashiko.

Fixes: 47d8696b95f7 ("sched/cache: Assign preferred LLC ID to processes")
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/63ea494f12efcf265d7134400a06cd75d7f2c310.1778703694.git.tim.c.chen@linux.intel.com

sched/cache: Fix potential NULL mm pointer access

A concurrent task exit might cause a NULL pointer dereference
in account_mm_sched(). Use the locally cached mm pointer instead,
since the active_mm reference guarantees the structure remains
allocated. Meanwhile, skip the kernel thread because it has
nothing to do with cache aware scheduling.

This bug was reported by sashiko and Vern.

Fixes: df0d98475954 ("sched/cache: Introduce infrastructure for cache-aware load balancing")
Reported-by: Vern Hao <haoxing990@gmail.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/09cf7ee3-6e27-4505-9692-4b4a4707c8b2@gmail.com/
Link: https://patch.msgid.link/066d8cfa45d4822bf4367e788c50377c66bbcc82.1778703694.git.tim.c.chen@linux.intel.com

sched/cache: Fix rcu warning when accessing sd_llc domain

rcu_dereference_all() should be used to access the
sd_llc domain under RCU protection.

This bug was reported by sashiko.

Fixes: df0d98475954 ("sched/cache: Introduce infrastructure for cache-aware load balancing")
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/2dc49455e861215d8059a1c877953f0b95990038.1778703694.git.tim.c.chen@linux.intel.com

sched/cache: Add user control to adjust the aggressiveness of cache-aware scheduling

Introduce a set of debugfs knobs to control how aggressively the
cache aware scheduling does the task aggregation.

(1) aggr_tolerance
With sched_cache enabled, the scheduler uses a process's footprint
as a proxy for its LLC footprint to determine if aggregating tasks
on the preferred LLC could cause cache contention. If the footprint
exceeds the LLC size, aggregation is skipped. Since the kernel
cannot efficiently track per-task cache usage (resctrl is
user-space only), userspace can provide a more accurate hint.

Introduce /sys/kernel/debug/sched/llc_balancing/aggr_tolerance to
let users control how strictly footprint limits aggregation. Values
range from 0 to 100:
  - 0: Cache-aware scheduling is disabled.
  - 1: Strict; tasks with footprint larger than LLC size are skipped.
  - >=100: Aggressive; tasks are aggregated regardless of footprint.
For example, with a 32MB L3 cache:

  - aggr_tolerance=1 -> tasks with footprint > 32MB are skipped.
  - aggr_tolerance=99 -> tasks with footprint > 784GB are skipped
    (784GB = (1 + (99 - 1) * 256) * 32MB).
Similarly, /sys/kernel/debug/sched/llc_balancing/aggr_tolerance also
controls how strictly the number of active threads is considered when
doing cache aware load balance. The number of SMTs is also considered.
High SMT counts reduce the aggregation capacity, preventing excessive
task aggregation on SMT-heavy systems like Power10/Power11.

Yangyu suggested introducing separate aggregation controls for the
number of active threads and memory footprint checks. Since there are
plans to add per-process/task group controls, fine-grained tunables are
deferred to that implementation.

(2) epoch_period, epoch_affinity_timeout,
    imb_pct, overaggr_pct are also turned into tunables.

Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Suggested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Suggested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Suggested-by: Tingyin Duan <tingyin.duan@gmail.com>
Suggested-by: Jianyong Wu <jianyong.wu@outlook.com>
Suggested-by: Yangyu Chen <cyy@cyyself.name>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Tingyin Duan <tingyin.duan@gmail.com>
Link: https://patch.msgid.link/1c62cc060ba2b33d7b1f0ed98b3390128edbae93.1778703694.git.tim.c.chen@linux.intel.com

sched/cache: Avoid cache-aware scheduling for memory-heavy processes

Prateek and Tingyin reported that memory-intensive workloads (such as
stream) can saturate memory bandwidth and caches on the preferred LLC
when sched_cache aggregates too many threads.

To mitigate this, estimate a process's memory footprint by comparing
its NUMA balancing fault statistics to the size of the LLC. If the
footprint exceeds the LLC size, skip cache-aware scheduling.

Note that footprint is only an approximation of the memory footprint,
since the kernel lacks suitable metrics to estimate the real working
set. If a user-provided hint is available in the future, it would be
more accurate. A later patch will allow users to provide a hint to
adjust this threshold.

Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Suggested-by: Vern Hao <vernhao@tencent.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Tingyin Duan <tingyin.duan@gmail.com>
Link: https://patch.msgid.link/95cf64a385bcc12f18dcebe9d59e8d3ba8bb318f.1778703694.git.tim.c.chen@linux.intel.com

sched/cache: Calculate the LLC size and store it in sched_domain

Cache aware scheduling needs to know the LLC size that a process
can use, so as to avoid memory-intensive tasks from being
over-aggregated on a single LLC.

Introduce a preparation patch to add get_effective_llc_bytes() to
get the LLC size that a CPU can use. The function can be further
enhanced by subtracting the LLC cache ways reserved by resctrl
(CAT in Intel RDT, etc).

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Tingyin Duan <tingyin.duan@gmail.com>
Link: https://patch.msgid.link/37afee09ff608034da0ce149e72d33b6f4698edf.1778703694.git.tim.c.chen@linux.intel.com

sched/cache: Skip cache-aware scheduling for single-threaded processes

For a single thread, the current wakeup path tends to place it
on the same LLC where it was previously running with cache-hot
data. There is no need to enable cache-aware scheduling for
single-threaded processes for the following reasons:

1. Cache-aware scheduling primarily benefits multi-threaded
   processes where threads share data. Single-threaded processes
   typically have no inter-thread data sharing and thus gain little.

2. Enabling it incurs the additional overhead of tracking the
   thread's residency in the LLCs.

3. Bypassing single-threaded processes avoids excessive
   concentration of such tasks on a single LLC.

Nevertheless, this check can be omitted if users explicitly
provide hints for such single-threaded workloads where different
processes have shared memory, e.g., via prctl() or other interfaces
to be added in the future.

Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Tingyin Duan <tingyin.duan@gmail.com>
Link: https://patch.msgid.link/8a59a13aa58fdb48e410ecb2aabd97fe3ea5d256.1778703694.git.tim.c.chen@linux.intel.com

sched/cache: Disable cache aware scheduling for processes with high thread counts

A performance regression was observed by Prateek when running hackbench
with many threads per process (high fd count). To avoid this, processes
with a large number of active threads are excluded from cache-aware
scheduling.

With sched_cache enabled, record the number of active threads in each
process during the periodic task_cache_work(). While iterating over
CPUs, if the currently running task belongs to the same process as the
task that launched task_cache_work(), increment the active thread count.

If the number of active threads within the process exceeds the number
of Cores (divided by the SMT number) in the LLC, do not enable
cache-aware scheduling. However, on systems with a smaller number of
CPUs within 1 LLC, like Power10/Power11 with SMT4 and an LLC size of 4,
this check effectively disables cache-aware scheduling for any process.
One possible solution suggested by Peter is to use an LLC-mask instead
of a single LLC value for preference. Once there are a 'few' LLCs as
preference, this constraint becomes a little easier. It could be an
enhancement in the future.

For users who wish to perform task aggregation regardless, a debugfs knob
is provided for tuning in a subsequent change.

Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Suggested-by: Aaron Lu <ziqianlu@bytedance.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Tingyin Duan <tingyin.duan@gmail.com>
Link: https://patch.msgid.link/d076cd21a8e6c6341d1e2d927e118db770ebb650.1778703694.git.tim.c.chen@linux.intel.com

sched/cache: Allow only 1 thread of the process to calculate the LLC occupancy

Scanning online CPUs to calculate the occupancy might be
time-consuming. Only allow 1 thread of the process to scan
the CPUs at the same time, which is similar to what
NUMA balance does in task_numa_work().

Signed-off-by: Jianyong Wu <wujianyong@hygon.cn>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/5672b52e588b855b01e5a1a17822f7c6c7237a3d.1778703694.git.tim.c.chen@linux.intel.com

drm/xe/multi_queue: Fix secondary queue error case

If xe_lrc_create() fails, the secondary queue added to the
multi-queue group list is not removed before freeing the
queue. Fix error path handling for secondary queues by
removing it from the multi-queue group list at the right
place.

Reported-by: Sebastian Österlund <sebastian.osterlund@intel.com>
Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/work_items/7979
Fixes: d716a5088c88 ("drm/xe/multi_queue: Handle tearing down of a multi queue")
Cc: stable@vger.kernel.org # v7.0+
Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patch.msgid.link/20260518191639.320890-2-niranjana.vishwanathapura@intel.com

cgroup/rstat: validate cpu before css_rstat_cpu() access

css_rstat_updated() is exposed as a BPF kfunc and accepts a
caller-provided cpu argument. The function uses cpu for per-cpu rstat
lookups without checking whether it refers to a valid possible CPU.

A BPF iter/cgroup program with CAP_BPF and CAP_PERFMON can pass an
invalid cpu value. On an unfixed UBSCAN_BOUNDS test kernel, cpu ==
0x7fffffff triggers:

  UBSAN: array-index-out-of-bounds in kernel/cgroup/rstat.c:31:9
  index 2147483647 is out of range for type 'long unsigned int [64]'
  Call Trace:
    css_rstat_updated
    bpf_iter_run_prog
    cgroup_iter_seq_show
    bpf_seq_read

Add cpu validation to the BPF-facing css_rstat_updated() kfunc and
move the common implementation to __css_rstat_updated() for in-kernel
callers.

Fixes: a319185be9f5 ("cgroup: bpf: enable bpf programs to integrate with rstat")
Signed-off-by: Qing Ming <a0yami@mailbox.org>
Signed-off-by: Tejun Heo <tj@kernel.org>

srcu: Don't queue workqueue handlers to never-online CPUs

While an srcu_struct structure is in the midst of switching from CPU-0
to all-CPUs state, it can attempt to invoke callbacks for CPUs that
have never been online.  Worse yet, it can attempt in invoke callbacks
for CPUs that never will be online, even including imaginary CPUs not in
cpu_possible_mask.  This can cause hangs on s390, which is not set up to
deal with workqueue handlers being scheduled on such CPUs.  This commit
therefore causes Tree SRCU to refrain from queueing workqueue handlers
on CPUs that have not yet (and might never) come online.

Because callbacks are not invoked on CPUs that have not been
online, it is an error to invoke call_srcu(), synchronize_srcu(), or
synchronize_srcu_expedited() on a CPU that is not yet fully online.
However, it turns out to be less code to redirect the callbacks
from too-early invocations of call_srcu() than to warn about such
invocations.  This commit therefore also redirects callbacks queued on
not-yet-fully-online CPUs to the boot CPU.

Reported-by: Vasily Gorbik <gor@linux.ibm.com>
Fixes: 61bbcfb50514 ("srcu: Push srcu_node allocation to GP when non-preemptible")
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: Vasily Gorbik <gor@linux.ibm.com>
Tested-by: Samir <samir@linux.ibm.com>
Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Boqun Feng <boqun@kernel.org>

cgroup/rdma: Drop unnecessary READ_ONCE() on event counters

All accesses to the event counters are serialized by rdmacg_mutex,
making the READ_ONCE() annotations unnecessary. Remove them.

Signed-off-by: Tao Cui <cuitao@kylinos.cn>
Signed-off-by: Tejun Heo <tj@kernel.org>

mailbox: add list of used channels to debugfs

During development, it is useful to see which mailboxes are currently
obtained. Use a seq-file in debugfs to list the currently registered
controllers and their used channels. Example output from a Renesas R-Car
X5H based system:

# cat /sys/kernel/debug/mailbox/mailbox_summary

189e0000.system-controller:
   0: c1000000.mailbox_test_send_to_recv
   1: c1000100.mailbox_test_recv_to_send
128: c1000100.mailbox_test_recv_to_send
129: c1000000.mailbox_test_send_to_recv
189e1000.system-controller:
   4: scmi_dev.1
   5: scmi_dev.2

Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Signed-off-by: Jassi Brar <jassisinghbrar@gmail.com>

mailbox: don't free the channel if the startup callback failed

If the optional startup() callbacks fails, we need to clear some states.
Currently, this is done by freeing the channel. This does, however, more
than needed which creates problems. Namely, it is calling the shutdown()
callback. This is totally not intuitive. No user expects that shutdown()
is called when startup() fails, similar to remove() not being called
when probe() fails. Currently, quite some mailbox users register the IRQ
in startup() and free them in shutdown(). These drivers will get a WARN
about freeing an already free IRQ. Other subtle issues could arise from
this unexpected behaviour.

To solve this problem, introduce a helper which does the minimal cleanup
and use it in both, in free_channel() and after startup() failed.

Link: https://sashiko.dev/#/patchset/20260402112709.13002-1-wsa%2Brenesas%40sang-engineering.com
Fixes: 2b6d83e2b8b7 ("mailbox: Introduce framework for mailbox")
Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Signed-off-by: Jassi Brar <jassisinghbrar@gmail.com>

mailbox: Make mbox_send_message() return error code when tx fails

When the mailbox controller failed transmitting message, the error code
was only passed to the client's tx done handler and not to
mbox_send_message() in blocking mode. For this reason, the function could
return a false success. This commit resolves the issue by introducing the
tx status and checking it before mbox_send_message() returns.

This commit works with the premise that the multi-threads' access to a
channel in blocking mode is serialized by clients, not by the mailbox
APIs, since the current mbox_send_message() in blocking mode does not
support multi-threads.

Signed-off-by: Joonwon Kang <joonwonkang@google.com>
Reviewed-by: Sudeep Holla <sudeep.holla@kernel.org>
Signed-off-by: Jassi Brar <jassisinghbrar@gmail.com>

mailbox: Clarify multi-thread is not supported in blocking mode

Unlike in non-blocking mode, multi-thread has not been supported in
blocking mode. This commit is to prevent clients from having wrong
assumption by explicitly specifying this fact to the API doc.

Signed-off-by: Joonwon Kang <joonwonkang@google.com>
Signed-off-by: Jassi Brar <jassisinghbrar@gmail.com>

mailbox: mtk-adsp: fix UAF during device teardown

When the SOF audio driver fails to initialize (e.g. firmware boot
timeout), its devres unwind frees the snd_sof_dev object that the
mailbox client (mtk-adsp-ipc) reaches via chan->cl->rx_callback.
The mtk-adsp-mailbox shutdown clears the mailbox command registers
but leaves the IRQ line unmasked, so a late interrupt can still
queue a threaded handler after mbox_free_channel() had cleared
chan->cl, and mbox_chan_received_data() would then trigger UAF:

  BUG: KASAN: slab-use-after-free in sof_ipc3_validate_fw_version
   sof_ipc3_validate_fw_version
   sof_ipc3_do_rx_work
   sof_ipc3_rx_msg
   mt8196_dsp_handle_request
   mtk_adsp_ipc_recv
   mbox_chan_received_data
   mtk_adsp_mbox_isr
   irq_thread_fn
  Freed by task ...:
   kfree
   devres_release_all
   really_probe
   ... (sof-audio-of-mt8196 probe failure)

The crash was observed roughly three seconds after the failed probe.

disable_irq() in shutdown and enable_irq() in startup. disable_irq()
also waits for any in-flight interrupts, so by the time
mbox_free_channel() proceeds to clear chan->cl no rx_callback can run.

In addition, request the IRQ with IRQF_NO_AUTOEN so it stays masked
between probe and the first client bind — otherwise an early interrupt
can crash on chan->cl == NULL in mbox_chan_received_data().

Fixes: af2dfa96c52d ("mailbox: mediatek: add support for adsp mailbox controller")
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reviewed-by: Tzung-Bi Shih <tzungbi@kernel.org>
Signed-off-by: Jassi Brar <jassisinghbrar@gmail.com>

mailbox: qcom: Unify user-visible "Qualcomm" name

Various names for Qualcomm as a company are used in user-visible config
options: QCOM, Qualcomm and Qualcomm Technologies. Switch to unified
"Qualcomm" so it will be easier for users to identify the options when
for example running menuconfig.

Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@oss.qualcomm.com>
Signed-off-by: Jassi Brar <jassisinghbrar@gmail.com>

mailbox: exynos: Drop unused register definitions

Leaving these dead definitions in place hides which registers are
actually being used by the hardware, making the driver harder to read
and maintain. Remove them to clean up the file.

Signed-off-by: Tudor Ambarus <tudor.ambarus@linaro.org>
Reviewed-by: Alexey Klimov <alexey.klimov@linaro.org>
Signed-off-by: Jassi Brar <jassisinghbrar@gmail.com>

dt-bindings: mailbox: qcom: Add IPCC support for Hawi Platform

Document the Inter-Processor Communication Controller on the Qualcomm
Hawi Platform, which will be used to route interrupts across various
subsystems found on the SoC.

Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com>
Signed-off-by: Mukesh Ojha <mukesh.ojha@oss.qualcomm.com>
Reviewed-by: Manivannan Sadhasivam <mani@kernel.org>
Signed-off-by: Jassi Brar <jassisinghbrar@gmail.com>

dt-bindings: mailbox: qcom,cpucp-mbox: Add Hawi compatible

Document CPU Control Processor (CPUCP) mailbox controller for Qualcomm
Hawi SoCs. It is software compatible with X1E80100 CPUCP mailbox
controller hence fallback to it.

Signed-off-by: Mukesh Ojha <mukesh.ojha@oss.qualcomm.com>
Acked-by: Krzysztof Kozlowski <krzysztof.kozlowski@oss.qualcomm.com>
Signed-off-by: Jassi Brar <jassisinghbrar@gmail.com>

dt-bindings: mailbox: qcom: Add Shikra APCS compatible

Add compatible for the Qualcomm Shikra APCS block.

Signed-off-by: Komal Bajaj <komal.bajaj@oss.qualcomm.com>
Signed-off-by: Sneh Mankad <sneh.mankad@oss.qualcomm.com>
Acked-by: Rob Herring (Arm) <robh@kernel.org>
Signed-off-by: Jassi Brar <jassisinghbrar@gmail.com>

mailbox: qcom-cpucp: Add support for Nord CPUCP mailbox controller

The Nord SoC CPUCP mailbox supports 16 IPC channels, compared to 3 on
x1e80100. The existing driver hardcodes the channel count via a
compile-time constant (APSS_CPUCP_IPC_CHAN_SUPPORTED), making it
impossible to support hardware with a different number of channels.

Introduce a qcom_cpucp_mbox_data per-hardware configuration struct that
carries the channel count, and retrieve it via of_device_get_match_data()
at probe time. Switch the channel array from a fixed-size member to a
dynamically allocated buffer sized from the hardware data. Update the
x1e80100 entry to supply its own data struct, and add a new Nord entry
with num_chans = 16.

Signed-off-by: Deepti Jaggi <deepti.jaggi@oss.qualcomm.com>
Signed-off-by: Shawn Guo <shengchao.guo@oss.qualcomm.com>
Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com>
Signed-off-by: Jassi Brar <jassisinghbrar@gmail.com>

dt-bindings: mailbox: qcom: Document Nord CPUCP mailbox controller

Document CPUSS Control Processor (CPUCP) mailbox controller for Qualcomm
Nord SoC, which is compatible with X1E80100 CPUCP, even though it supports
more IPC channels.

Signed-off-by: Deepti Jaggi <deepti.jaggi@oss.qualcomm.com>
Signed-off-by: Shawn Guo <shengchao.guo@oss.qualcomm.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@oss.qualcomm.com>
Signed-off-by: Jassi Brar <jassisinghbrar@gmail.com>

mailbox: mpfs: fix check for syscon presence in mpfs_mbox_inbox_isr()

mpfs_mbox_inbox_isr() writes to the sysreg scb syscon, not the control
scb syscon, but checks for the presence of the latter. Ultimately this
makes little difference because if one syscon is present, both will be.

Fixes: a4123ffab9ece ("mailbox: mpfs: support new, syscon based, devicetree configuration")
Signed-off-by: Conor Dooley <conor.dooley@microchip.com>
Signed-off-by: Jassi Brar <jassisinghbrar@gmail.com>

drm/dp/mst: fix OOB reads on 2-byte fields in sideband reply parsers

Three sideband reply parsers read 16-bit fields as:

  val = (raw->msg[idx] << 8) | (raw->msg[idx+1]);

and check bounds only after the fact. When idx == raw->curlen,
raw->msg[idx+1] reads one byte past the received message data into
the following struct fields (curchunk_len, curchunk_idx, curlen).

Affected functions:
- drm_dp_sideband_parse_enum_path_resources_ack()
   full_payload_bw_number and avail_payload_bw_number fields
- drm_dp_sideband_parse_allocate_payload_ack()
   allocated_pbn field
- drm_dp_sideband_parse_query_payload_ack()
   allocated_pbn field

Fix by using a single combined check (idx + 2 > curlen) before each
2-byte read. Since the check is strictly tighter than idx > curlen,
no separate step is needed.

Fixes: ad7f8a1f9ced ("drm/helper: add Displayport multi-stream helper (v0.6)")
Cc: <stable@vger.kernel.org> # v3.17+
Signed-off-by: Ashutosh Desai <ashutoshdesai993@gmail.com>
Reviewed-by: Lyude Paul <lyude@redhat.com>
[added fixes tag]
Signed-off-by: Lyude Paul <lyude@redhat.com>
Link: https://patch.msgid.link/20260510203128.2884846-1-ashutoshdesai993@gmail.com

drm/dp/mst: fix OOB reads in remote DPCD/I2C sideband reply parsers

drm_dp_sideband_parse_remote_dpcd_read() reads num_bytes from the raw
message and then unconditionally does:

memcpy(bytes, &raw->msg[idx], num_bytes);

without checking that idx + num_bytes <= raw->curlen. raw->msg[] is
256 bytes; if a malicious or misbehaving MST hub sets num_bytes larger
than the remaining payload, the memcpy reads past the received data
into whatever follows in raw->msg[].

drm_dp_sideband_parse_remote_i2c_read_ack() has the same flaw (noted
with a /* TODO check */ comment since the code was introduced).

Fix both functions by using a single combined check
(idx + num_bytes > curlen) before each memcpy. Since num_bytes is u8,
it is always >= 0, so this strictly subsumes the simpler idx > curlen
form and no separate step is needed.

Fixes: ad7f8a1f9ced ("drm/helper: add Displayport multi-stream helper (v0.6)")
Cc: <stable@vger.kernel.org> # v3.17+
Signed-off-by: Ashutosh Desai <ashutoshdesai993@gmail.com>
Reviewed-by: Lyude Paul <lyude@redhat.com>
[added missing fixes tag]
Signed-off-by: Lyude Paul <lyude@redhat.com>
Link: https://patch.msgid.link/20260510201733.2882224-1-ashutoshdesai993@gmail.com

PCI: mediatek-gen3: Do full device power down on removal

When power control for downstream devices was introduced in the
mediatek-gen3 PCIe controller driver, only the power to the downstream
devices was cut when the controller driver is removed. This matched
existing behavior, but in hindsight a proper power down sequence should
have been followed.

Call mtk_pcie_devices_power_down() on driver removal so that in addition
to removing power from the downstream devices, PERST# is asserted.

Fixes: 1a152e21940a ("PCI: mediatek-gen3: Integrate new pwrctrl API")
Signed-off-by: Chen-Yu Tsai <wenst@chromium.org>
Signed-off-by: Manivannan Sadhasivam <mani@kernel.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Link: https://patch.msgid.link/20260505105918.1823170-1-wenst@chromium.org

PCI: mediatek-gen3: Add a .shutdown() callback to control PERST# signal

Add a .shutdown() callback to control the timing of PERST# and power during
system shutdown to ensure that PERST# is asserted before power to the
connector is removed, as required by PCIe CEM r6.0, sec 2.2.

Signed-off-by: Jian Yang <jian.yang@mediatek.com>
Signed-off-by: Manivannan Sadhasivam <mani@kernel.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Link: https://patch.msgid.link/20260413071401.1151-3-jian.yang@mediatek.com

PCI: mediatek-gen3: Fix PERST# control timing during system startup

Some MediaTek chips stop generating REFCLK if the PCIE_PHY_RSTB signal of
PCIe controller is asserted at the start of mtk_pcie_devices_power_up().

But the driver deasserts PCIE_PHY_RSTB together with PCIE_PE_RSTB signal
that is used to deassert PERST#. This violates PCIe CEM r6.0, sec 2.11.2,
which mandates waiting for 100ms (PCIE_T_PVPERL_MS) after power becomes
stable.

Move the MAC, PHY and BRG reset deassert code above the PCIE_T_PVPERL_MS
delay and leave the PCIE_PE_RSTB deassertion after the delay.

Add the 10ms delay mentioned in the MediaTek datasheet after asserting
PCIE_BRG_RSTB and before accessing the PCIE_RST_CTRL_REG register.

Signed-off-by: Jian Yang <jian.yang@mediatek.com>
[mani: commit log and comments rewording]
Signed-off-by: Manivannan Sadhasivam <mani@kernel.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Link: https://patch.msgid.link/20260413071401.1151-2-jian.yang@mediatek.com

drm/panel-edp: Add panel for Surface Pro 12in

Add an entry for the BOE NE120DRM-N28 panel,
used in the Microsoft Surface Pro 12-inch.

The values chosen were tested to be working fine
for wake from sleep and hibernation.

Panel edid:

00 ff ff ff ff ff ff 00 09 e5 c9 0c a0 06 00 07
0a 22 01 04 a5 19 11 78 07 9f 15 a6 55 4c 9b 25
0e 50 54 00 00 00 01 01 01 01 01 01 01 01 01 01
01 01 01 01 01 01 62 53 94 a0 80 b8 2e 50 18 10
3a 00 fe a9 00 00 00 1a 13 7d 94 a0 80 b8 2e 50
18 10 3a 00 fe a9 00 00 00 1a 00 00 00 fd 00 18
5a 5b 88 20 01 0a 20 20 20 20 20 20 00 00 00 fc
00 4e 45 31 32 30 44 52 4d 2d 4e 32 38 0a 00 0a

Signed-off-by: Harrison Vanderbyl <harrison.vanderbyl@gmail.com>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Link: https://patch.msgid.link/9e749a3a483e4a3c684eac3ee6a4b241c94a0362.1778822464.git.harrison.vanderbyl@gmail.com

ASoC: Add support for GPIOs driven amplifiers

Herve Codina <herve.codina@bootlin.com> says:

On some embedded system boards, audio amplifiers are designed using
discrete components such as op-amp, several resistors and switches to
either adjust the gain (switching resistors) or fully switch the
audio signal path (mute and/or bypass features).

Those switches are usually driven by simple GPIOs.

This kind of amplifiers are not handled in ASoC and the fallback is to
let the user-space handle those GPIOs out of the ALSA world.

In order to have those kind of amplifiers fully integrated in the audio
stack, this series introduces the audio-gpio-amp to handle them.

This new ASoC component allows to have the amplifiers seen as ASoC
auxiliarty devices and so it allows to control them through audio mixer
controls.

In order to ease the review, I choose to split modifications related
to the merge of the gpio-audio-amp part into the simple-amplfier driver
in several commits.

Link: https://patch.msgid.link/20260513081702.317117-1-herve.codina@bootlin.com

MAINTAINERS: Add the ASoC gpio audio amplifier entry

After contributing the component, add myself as the maintainer for the
ASoC gpio audio amplifier component.

Signed-off-by: Herve Codina <herve.codina@bootlin.com>
Link: https://patch.msgid.link/20260513081702.317117-18-herve.codina@bootlin.com
Signed-off-by: Mark Brown <broonie@kernel.org>