]> git.ipfire.org Git - thirdparty/linux.git/log
thirdparty/linux.git
2 weeks agoMerge tag 'for-net-2026-06-03' of git://git.kernel.org/pub/scm/linux/kernel/git/bluet...
Jakub Kicinski [Thu, 4 Jun 2026 02:07:46 +0000 (19:07 -0700)] 
Merge tag 'for-net-2026-06-03' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth

Luiz Augusto von Dentz says:

====================
bluetooth pull request for net:

 - hci_core: fix memory leak in error path of hci_alloc_dev()
 - hci_sync: reject oversized Broadcast Announcement prepend
 - MGMT: Fix backward compatibility with userspace
 - MGMT: validate advertising TLV before type checks
 - L2CAP: reject BR/EDR signaling packets over MTUsig
 - RFCOMM: validate skb length in MCC handlers
 - RFCOMM: hold listener socket in rfcomm_connect_ind()
 - ISO: Fix not releasing hdev reference on iso_conn_big_sync
 - ISO: Fix a use-after-free of the hci_conn pointer
 - ISO: Fix data-race on iso_pi fields in hci_get_route calls
 - SCO: Fix data-race on sco_pi fields in sco_connect
 - BNEP: reject short frames before parsing

* tag 'for-net-2026-06-03' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
  Bluetooth: MGMT: Fix backward compatibility with userspace
  Bluetooth: SCO: Fix data-race on sco_pi fields in sco_connect
  Bluetooth: ISO: Fix data-race on iso_pi fields in hci_get_route calls
  Bluetooth: ISO: Fix a use-after-free of the hci_conn pointer
  Bluetooth: ISO: Fix not releasing hdev reference on iso_conn_big_sync
  Bluetooth: fix memory leak in error path of hci_alloc_dev()
  Bluetooth: bnep: reject short frames before parsing
  Bluetooth: hci_sync: reject oversized Broadcast Announcement prepend
  Bluetooth: L2CAP: reject BR/EDR signaling packets over MTUsig
  Bluetooth: RFCOMM: validate skb length in MCC handlers
  Bluetooth: MGMT: validate advertising TLV before type checks
  Bluetooth: RFCOMM: hold listener socket in rfcomm_connect_ind()
====================

Link: https://patch.msgid.link/20260603162714.342496-1-luiz.dentz@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 weeks agoMerge tag 'wireless-2026-06-03' of https://git.kernel.org/pub/scm/linux/kernel/git...
Jakub Kicinski [Thu, 4 Jun 2026 02:07:34 +0000 (19:07 -0700)] 
Merge tag 'wireless-2026-06-03' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless

Johannes Berg says:

====================
Things are finally quieting down:
 - iwlwifi:
   - FW reset handshake removal for older devices
   - NIC access fix in fast resume
   - avoid too large command for some BIOSes
   - fix TX power constraints in AP mode
 - cfg80211:
   - fix netlink parse overflow
   - fix potential 6 GHz scan memory leak
   - enforce HE/EHT consistency to avoid mac80211 crash
 - mac80211: guard radiotap antenna parsing

* tag 'wireless-2026-06-03' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
  wifi: cfg80211: enforce HE/EHT cap/oper consistency
  wifi: fix leak if split 6 GHz scanning fails
  wifi: mac80211: limit injected antenna index in ieee80211_parse_tx_radiotap
  wifi: nl80211: reject oversized EMA RNR lists
  wifi: iwlwifi: pcie: simplify the resume flow if fast resume is not used
  wifi: iwlwifi: mvm: avoid oversized UATS command copy
  wifi: iwlwifi: mld: send tx power constraints before link activation
  wifi: iwlwifi: mvm: don't support the reset handshake for old firmwares
====================

Link: https://patch.msgid.link/20260603113208.171874-3-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 weeks agox86/cpuid: Update bitfields to x86-cpuid-db v3.1
Maciej Wieczor-Retman [Wed, 3 Jun 2026 17:10:57 +0000 (17:10 +0000)] 
x86/cpuid: Update bitfields to x86-cpuid-db v3.1

Update leaf_types.h to version 3.1, as generated by x86-cpuid-db.

Summary of the v3.1 changes:

* Fix a few typos that were found during the kernel CPUID data model
  review. Also include fixes found using an LLM agent review, from Ahmed.

* Rename thrd_director_nclasses to hw_feedback_nclasses as it's the
  name used in Intel SDM.

See https://gitlab.com/x86-cpuid.org/x86-cpuid-db/-/blob/v3.1/CHANGELOG.rst
for more info.

Signed-off-by: Maciej Wieczor-Retman <maciej.wieczor-retman@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://patch.msgid.link/9653d8690ec7093c8190b12d1fa8c689c4da50fe.1780506200.git.m.wieczorretman@pm.me
2 weeks agoMerge branch 'mptcp-misc-fixes-for-v7-1-rc7'
Jakub Kicinski [Thu, 4 Jun 2026 02:04:46 +0000 (19:04 -0700)] 
Merge branch 'mptcp-misc-fixes-for-v7-1-rc7'

Matthieu Baerts says:

====================
mptcp: misc fixes for v7.1-rc7

Here are various unrelated fixes:

- Patch 1: fix missing wakeups when multiple threads are reading from
  the same fd. A fix for v5.7.

- Patch 2: fix retransmission loop when MPTCP checksum is enabled. A fix
  for v5.14.

- Patch 3: fix a TOCTOU race while computing rcv_wnd. A fix for v5.11.

- Patch 4: allow subflows receive window to shrink if needed. A fix for
  v5.19.

- Patches 5-6: avoid 'extra_subflows' to underflow with the userspace
  PM. A fix for v5.19.

- Patch 7: report errors if one subflow cannot set SO_TIMESTAMPING. A
  fix for v5.14.

- Patch 8: try to set TCP_MAXSEG on all subflows, before reporting
  errors, if any. A fix for v6.17.

- Patch 9: check desc->count in read_sock, to act as expected. A fix
  for v7.0.

- Patch 10: fix an uninit value in mptcp_established_options, reported
  by syzbot. A fix for v7.1-rc1.

- Patch 11: fix a similar issue than the previous patch, exposed by the
  same modification from v7.1-rc1, but was already causing issues since
  v5.15.
====================

Link: https://patch.msgid.link/20260602-net-mptcp-misc-fixes-7-1-rc7-v2-0-856831229976@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 weeks agomptcp: change mptcp_established_options() to return opt_size
Eric Dumazet [Tue, 2 Jun 2026 12:51:38 +0000 (12:51 +0000)] 
mptcp: change mptcp_established_options() to return opt_size

Instead of passing opt_size address to mptcp_established_options(),
change this function to return it by value.

This removes the need for an expensive stack canary in
tcp_established_options() when CONFIG_STACKPROTECTOR_STRONG=y.

$ scripts/bloat-o-meter -t vmlinux.old vmlinux.new
add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-92 (-92)
Function                                     old     new   delta
tcp_options_write.isra                      1423    1407     -16
mptcp_established_options                   2746    2720     -26
tcp_established_options                      553     503     -50
Total: Before=22110750, After=22110658, chg -0.00%

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260602125138.2317015-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 weeks agomptcp: add-addr: always drop other suboptions
Matthieu Baerts (NGI0) [Tue, 2 Jun 2026 12:14:18 +0000 (22:14 +1000)] 
mptcp: add-addr: always drop other suboptions

When an ADD_ADDR needs to be sent, it could be prepared if there is
enough remaining space and even if the packet is not a pure ACK. But it
would be dropped soon after.

Indeed, in mptcp_pm_add_addr_signal(), there is enough space to fit a
DSS of 20 octets and an ADD_ADDR echo containing an IPv4 address on 8
octets for example. In this case, the packet would be prepared, the
MPTCP_ADD_ADDR_ECHO bit would be removed from pm->addr_signal, but the
option would be silently dropped in mptcp_established_options_add_addr()
not to override DSS info in the union from 'struct mptcp_out_options',
and also because mptcp_write_options() will enforce mutually exclusion
with DSS.

Instead, don't even try to send an ADD_ADDR if it is not a pure ACK.
Retry for each new packet until a pure-ACK is emitted. That's fine to do
that, because each time an ADD_ADDR (echo) is scheduled, a pure ACK is
queued.

This also simplifies the code, and the skb checks can be done earlier,
before the lock.

Note: also, since commit 6d0060f600ad ("mptcp: Write MPTCP DSS headers
to outgoing data packets"), opts->ahmac would not have been set to 0
when other suboptions were not dropped, and when sending an ADD_ADDR
echo. That would have resulted in sending an ADD_ADDR using garbage
info, where there was not enough space, instead of an echo one without
the ADD_ADDR HMAC.

Fixes: 1bff1e43a30e ("mptcp: optimize out option generation")
Cc: stable@vger.kernel.org
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260602-net-mptcp-misc-fixes-7-1-rc7-v2-11-856831229976@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 weeks agomptcp: fix uninit-value in mptcp_established_options
Paolo Abeni [Tue, 2 Jun 2026 12:14:17 +0000 (22:14 +1000)] 
mptcp: fix uninit-value in mptcp_established_options

syzbot reported the following uninit splat:

  BUG: KMSAN: uninit-value in mptcp_write_data_fin net/mptcp/options.c:542 [inline]
  BUG: KMSAN: uninit-value in mptcp_established_options_dss net/mptcp/options.c:590 [inline]
  BUG: KMSAN: uninit-value in mptcp_established_options+0x112f/0x3530 net/mptcp/options.c:874
   mptcp_write_data_fin net/mptcp/options.c:542 [inline]
   mptcp_established_options_dss net/mptcp/options.c:590 [inline]
   mptcp_established_options+0x112f/0x3530 net/mptcp/options.c:874
   tcp_established_options+0x312/0xcc0 net/ipv4/tcp_output.c:1192
   __tcp_transmit_skb+0x5dc/0x5fe0 net/ipv4/tcp_output.c:1575
   __tcp_send_ack+0x967/0xad0 net/ipv4/tcp_output.c:4499
   tcp_send_ack+0x3d/0x60 net/ipv4/tcp_output.c:4505
   mptcp_subflow_shutdown+0x164/0x690 net/mptcp/protocol.c:3137
   mptcp_check_send_data_fin+0x31b/0x3d0 net/mptcp/protocol.c:3218
   __mptcp_wr_shutdown net/mptcp/protocol.c:3234 [inline]
   __mptcp_close+0x860/0x1360 net/mptcp/protocol.c:3313
   mptcp_close+0x42/0x260 net/mptcp/protocol.c:3367
   inet_release+0x1ee/0x2a0 net/ipv4/af_inet.c:442
   __sock_release net/socket.c:722 [inline]
   sock_close+0xd6/0x2f0 net/socket.c:1514
   __fput+0x60e/0x1010 fs/file_table.c:510
   ____fput+0x25/0x30 fs/file_table.c:538
   task_work_run+0x208/0x2b0 kernel/task_work.c:233
   resume_user_mode_work include/linux/resume_user_mode.h:50 [inline]
   __exit_to_user_mode_loop kernel/entry/common.c:67 [inline]
   exit_to_user_mode_loop+0x306/0x1b60 kernel/entry/common.c:98
   __exit_to_user_mode_prepare include/linux/irq-entry-common.h:207 [inline]
   syscall_exit_to_user_mode_prepare include/linux/irq-entry-common.h:238 [inline]
   syscall_exit_to_user_mode include/linux/entry-common.h:318 [inline]
   __do_fast_syscall_32+0x2c7/0x460 arch/x86/entry/syscall_32.c:310
   do_fast_syscall_32+0x37/0x80 arch/x86/entry/syscall_32.c:332
   do_SYSENTER_32+0x1f/0x30 arch/x86/entry/syscall_32.c:370
   entry_SYSENTER_compat_after_hwframe+0x84/0x8e

  Local variable opts created at:
   __tcp_transmit_skb+0x4d/0x5fe0 net/ipv4/tcp_output.c:1536
   __tcp_send_ack+0x967/0xad0 net/ipv4/tcp_output.c:4499

The output path currently omits initializing the mptcp extension
`use_map` flag in a few corner cases.

Address the issue always zeroing all the extensions flags before
eventually initializing the individual bits. To that extent, introduce
and use a struct_group to avoid multiple bitwise operations.

Fixes: cfcceb7a39fc ("tcp: shrink per-packet memset in __tcp_transmit_skb()")
Cc: stable@vger.kernel.org
Reported-by: syzbot+ff020673c5e3d94d9478@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=ff020673c5e3d94d9478
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260602-net-mptcp-misc-fixes-7-1-rc7-v2-10-856831229976@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 weeks agomptcp: check desc->count in read_sock
Gang Yan [Tue, 2 Jun 2026 12:14:16 +0000 (22:14 +1000)] 
mptcp: check desc->count in read_sock

__tcp_read_sock() checks desc->count after each skb is consumed and
breaks the loop when it reaches 0. The MPTCP variant lacks this check.

This is a functional bug, other subsystems also rely on this check:
TLS strparser sets desc->count to 0 once a full TLS record is assembled
and depends on this break to stop reading.

Add the same desc->count check to __mptcp_read_sock(), mirroring
__tcp_read_sock().

Fixes: 250d9766a984 ("mptcp: implement .read_sock")
Cc: stable@vger.kernel.org
Co-developed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Gang Yan <yangang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260602-net-mptcp-misc-fixes-7-1-rc7-v2-9-856831229976@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 weeks agomptcp: sockopt: set sockopt on all subflows
Matthieu Baerts (NGI0) [Tue, 2 Jun 2026 12:14:15 +0000 (22:14 +1000)] 
mptcp: sockopt: set sockopt on all subflows

The mptcp_setsockopt_all_sf(), currently used only with TCP_MAXSEG,
stopped when one subflow returned an error.

Even if it is not wrong, this is different from the other helpers trying
to set the option on all subflows, and then returning an error if at
least one of them had an issue.

Follow this behaviour, for a question of uniformity.

Fixes: 51c5fd09e1b4 ("mptcp: add TCP_MAXSEG sockopt support")
Cc: stable@vger.kernel.org
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260602-net-mptcp-misc-fixes-7-1-rc7-v2-8-856831229976@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 weeks agomptcp: sockopt: check timestamping ret value
Matthieu Baerts (NGI0) [Tue, 2 Jun 2026 12:14:14 +0000 (22:14 +1000)] 
mptcp: sockopt: check timestamping ret value

sock_set_timestamping() can fail for different reasons. The returned
value should then be checked.

If sock_set_timestamping() fails for at least one subflow, the first
error is now reported to the userspace, similar to what is done with
other socket options.

Fixes: 9061f24bf82e ("mptcp: sockopt: propagate timestamp request to subflows")
Cc: stable@vger.kernel.org
Reported-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Closes: https://lore.kernel.org/willemdebruijn.kernel.178a41a53d041@gmail.com
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260602-net-mptcp-misc-fixes-7-1-rc7-v2-7-856831229976@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 weeks agoselftests: mptcp: add test for extra_subflows underflow on userspace PM
Tao Cui [Tue, 2 Jun 2026 12:14:13 +0000 (22:14 +1000)] 
selftests: mptcp: add test for extra_subflows underflow on userspace PM

Add a test to verify that when userspace PM fails to create a subflow
(e.g. using an unreachable address), the extra_subflows counter is not
decremented below zero.

Fixes: 77e4b94a3de6 ("mptcp: update userspace pm infos")
Cc: stable@vger.kernel.org
Signed-off-by: Tao Cui <cuitao@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260602-net-mptcp-misc-fixes-7-1-rc7-v2-6-856831229976@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 weeks agomptcp: pm: fix extra_subflows underflow on userspace PM subflow creation
Tao Cui [Tue, 2 Jun 2026 12:14:12 +0000 (22:14 +1000)] 
mptcp: pm: fix extra_subflows underflow on userspace PM subflow creation

The userspace PM increments extra_subflows after __mptcp_subflow_connect()
succeeds, but __mptcp_subflow_connect() calls mptcp_pm_close_subflow()
on failure to roll back the pre-increment done by the kernel PM's fill_*()
helpers. Because the userspace PM hasn't incremented yet at that point,
this decrement is spurious and causes extra_subflows to underflow.

Fix it by aligning the userspace PM with the kernel PM: increment
extra_subflows before calling __mptcp_subflow_connect(), so the existing
error path in subflow.c correctly rolls it back on failure. Also simplify
the error handling by taking pm.lock only when needed for cleanup.

Fixes: 77e4b94a3de6 ("mptcp: update userspace pm infos")
Cc: stable@vger.kernel.org
Signed-off-by: Tao Cui <cuitao@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260602-net-mptcp-misc-fixes-7-1-rc7-v2-5-856831229976@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 weeks agomptcp: allow subflow rcv wnd to shrink
Paolo Abeni [Tue, 2 Jun 2026 12:14:11 +0000 (22:14 +1000)] 
mptcp: allow subflow rcv wnd to shrink

In MPTCP connection, the `window` field in the TCP header refers to the
MPTCP-level rcv_nxt and it's right edge should not move backward. Such
constraint is enforced at DSS option generation time.

At the same time, the TCP stack ensures independently that the TCP-level
rcv wnd right's edge does not move backward. That in turn causes artificial
inflating of the MPTCP rcv window when the incoming data is acked at the
TCP level and is OoO in the MPTCP sequence space (or lands in the backlog).

As a consequence, the incoming traffic can exceed the receiver rcvbuf size
even when the sender is not misbehaving.

Prevent such scenario forcibly allowing the TCP subflow to shrink the
TCP-level rcv wnd regardless of the current netns setting.

Fixes: f3589be0c420 ("mptcp: never shrink offered window")
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260602-net-mptcp-misc-fixes-7-1-rc7-v2-4-856831229976@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 weeks agomptcp: close TOCTOU race while computing rcv_wnd
Paolo Abeni [Tue, 2 Jun 2026 12:14:10 +0000 (22:14 +1000)] 
mptcp: close TOCTOU race while computing rcv_wnd

The MPTCP output path access locklessly the MPTCP-level ack_seq
in multiple times, using possibly different values for the data_ack
in the DSS option and to compute the announced rcv wnd for the same
packet.

Refactor the cote to avoid inconsistencies which may confuse the
peer. Also ensure that the MPTCP level rcv wnd is updated only when
the egress packet actually contains a DSS ack.

Fixes: fa3fe2b15031 ("mptcp: track window announced to peer")
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260602-net-mptcp-misc-fixes-7-1-rc7-v2-3-856831229976@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 weeks agomptcp: fix retransmission loop when csum is enabled
Paolo Abeni [Tue, 2 Jun 2026 12:14:09 +0000 (22:14 +1000)] 
mptcp: fix retransmission loop when csum is enabled

Sashiko noted that retransmission with csum enabled can actually
transmit new data, but currently the relevant code does not update
accordingly snd_nxt.

The may cause incoming ack drop and an endless retransmission loop.

Address the issue incrementing snd_nxt as needed.

Fixes: 4e14867d5e91 ("mptcp: tune re-injections for csum enabled mode")
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260602-net-mptcp-misc-fixes-7-1-rc7-v2-2-856831229976@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 weeks agomptcp: fix missing wakeups in edge scenarios
Paolo Abeni [Tue, 2 Jun 2026 12:14:08 +0000 (22:14 +1000)] 
mptcp: fix missing wakeups in edge scenarios

The mptcp_recvmsg() can fill MPTCP socket receive queue via
mptcp_move_skbs(), but currently does not try to wakeup any listener,
because the same process is going to check the receive queue soon.

When multiple threads are reading from the same fd, the above can
cause stall. Add the missing wakeup.

Fixes: 6771bfd9ee24 ("mptcp: update mptcp ack sequence from work queue")
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260602-net-mptcp-misc-fixes-7-1-rc7-v2-1-856831229976@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 weeks agodrivers/of: fdt: Make ibm,phandle logic only happen on pseries
Daniel Palmer [Wed, 3 Jun 2026 15:18:09 +0000 (00:18 +0900)] 
drivers/of: fdt: Make ibm,phandle logic only happen on pseries

The "ibm,phandle" thing only seems to be needed on pseries
machines but everyone gets it so they get a string and a little
bit of useless code.

In __of_attach_node() the pseries specific part uses
IS_ENABLED(CONFIG_PPC_PSERIES) so do that here too.

Signed-off-by: Daniel Palmer <daniel@thingy.jp>
Link: https://patch.msgid.link/20260603151809.3256280-1-daniel@thingy.jp
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
2 weeks agotools/x86/kcpuid: Update bitfields to x86-cpuid-db v3.1
Maciej Wieczor-Retman [Wed, 3 Jun 2026 17:10:49 +0000 (17:10 +0000)] 
tools/x86/kcpuid: Update bitfields to x86-cpuid-db v3.1

Update kcpuid's CSV file to version 3.1, as generated by x86-cpuid-db.

Summary of the v3.1 changes:

* Fix a few typos that were found during the kernel CPUID data model
  review. Also include fixes found using an LLM agent review.

* Rename thrd_director_nclasses to hw_feedback_nclasses as it's the
  name used in Intel SDM.

See https://gitlab.com/x86-cpuid.org/x86-cpuid-db/-/blob/v3.1/CHANGELOG.rst
for more info.

Signed-off-by: Maciej Wieczor-Retman <maciej.wieczor-retman@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://patch.msgid.link/cbe9ff395b3269e112ff7ca414d726ffd7bf0787.1780506200.git.m.wieczorretman@pm.me
2 weeks agoptp: vclock: Switch from RCU to SRCU
Kurt Kanzenbach [Fri, 29 May 2026 17:11:47 +0000 (19:11 +0200)] 
ptp: vclock: Switch from RCU to SRCU

The usage of PTP vClocks leads immediately to the following issues with
ptp4l with LOCKDEP and DEBUG_ATOMIC_SLEEP enabled: "BUG: sleeping function
called from invalid context".

ptp_convert_timestamp() acquires a mutex_t within a RCU read section.  This
is illegal, because acquiring a mutex_t can result in voluntary scheduling
request which is not allowed within a RCU read section.

Replace the RCU usage with SRCU where sleeping is allowed.

Reported-by: Florian Zeitz <florian.zeitz@schettke.com>
Closes: https://lore.kernel.org/all/00a8cce8-410e-4038-98af-49be6d93d7bd@schettke.com/
Fixes: 67d93ffc0f3c ("ptp: vclock: use mutex to fix "sleep on atomic" bug")
Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20260529-vclock_rcu-v2-1-02a5531fab92@linutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 weeks agoipv4: raw: remove six obsolete EXPORT_SYMBOL_GPL()
Eric Dumazet [Tue, 2 Jun 2026 16:50:36 +0000 (16:50 +0000)] 
ipv4: raw: remove six obsolete EXPORT_SYMBOL_GPL()

IPv6 can not be a module anymore, we no longer need to export:

- raw_hash_sk()
- raw_unhash_sk()
- raw_abort()
- raw_seq_start()
- raw_seq_next()
- raw_seq_stop()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20260602165036.2712408-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 weeks agoipv4: restrict IPOPT_SSRR and IPOPT_LSRR options
Eric Dumazet [Tue, 2 Jun 2026 16:15:47 +0000 (16:15 +0000)] 
ipv4: restrict IPOPT_SSRR and IPOPT_LSRR options

This patch restricts setting Loose Source and Record Route (LSRR)
and Strict Source and Record Route (SSRR) IP options to users
with CAP_NET_RAW capability.

This prevents unprivileged applications from forcing packets to route
through attacker-controlled nodes to leak TCP ISN and possibly other
protocol information.

While LSRR and SSRR are commonly filtered in many network environments,
they may still be supported and forwarded along some network paths.

RFC 7126 (Recommendations on Filtering of IPv4 Packets Containing
IPv4 Options) recommend to drop these options in 4.3 and 4.4.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Reported-by: Tamir Shahar <tamirthesis@gmail.com>
Reported-by: Amit Klein <aksecurity@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260602161547.2642155-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 weeks agoMerge branch 'af_unix-fix-inq_len-update-issue'
Jakub Kicinski [Thu, 4 Jun 2026 01:52:28 +0000 (18:52 -0700)] 
Merge branch 'af_unix-fix-inq_len-update-issue'

Jianyu Li says:

====================
af_unix: Fix inq_len update issue

From: Jianyu Li <jianyu.li@mediatek.com>

This series fix the problem that inq_len is inconsistent with
actual remaining byte count when only part of a skb is consumed.
====================

Link: https://patch.msgid.link/20260601113640.231897-1-jianyu.li@mediatek.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 weeks agoaf_unix: Add test for SCM_INQ on partial read
Jianyu Li [Mon, 1 Jun 2026 11:36:40 +0000 (19:36 +0800)] 
af_unix: Add test for SCM_INQ on partial read

Add test to verify that when a skb is partially consumed,
unix_inq_len() return correct remaining byte count.

Before:

  #  RUN           scm_inq.stream.partial_read ...
  # scm_inq.c:165:partial_read:Expected remain (512) == *(int *)CMSG_DATA(cmsg) (768)
  # partial_read: Test terminated by assertion
  #          FAIL  scm_inq.stream.partial_read
  not ok 2 scm_inq.stream.partial_read

After:

  #  RUN           scm_inq.stream.partial_read ...
  #            OK  scm_inq.stream.partial_read
  ok 2 scm_inq.stream.partial_read

Signed-off-by: Jianyu Li <jianyu.li@mediatek.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260601113640.231897-3-jianyu.li@mediatek.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 weeks agoaf_unix: Fix inq_len update problem in partial read
Jianyu Li [Mon, 1 Jun 2026 11:36:39 +0000 (19:36 +0800)] 
af_unix: Fix inq_len update problem in partial read

Currently inq_len is updated only when the whole skb is consumed.
If only part of the data is read, following SIOCINQ query would
get value greater than what actually left.

This change update inq_len timely in unix_stream_read_generic(),
and adjust unix_stream_read_skb() accordingly to prevent
repetitive update.

Fixes: f4e1fb04c123 ("af_unix: Use cached value for SOCK_STREAM in unix_inq_len().")
Signed-off-by: Jianyu Li <jianyu.li@mediatek.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260601113640.231897-2-jianyu.li@mediatek.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 weeks agosched_ext: Make scx_bpf_kick_cid() return s32
Tejun Heo [Thu, 4 Jun 2026 01:46:56 +0000 (15:46 -1000)] 
sched_ext: Make scx_bpf_kick_cid() return s32

Switch scx_bpf_kick_cid() from void to s32 so future cap enforcement can
surface failures. cid interface is introduced in this cycle and has no
external users, so the ABI change is safe. Subsequent patches will add
-EPERM returns when the calling sub-sched lacks the required cap on the
target cid.

v2: Return scx_cid_to_cpu()'s errno instead of -EINVAL. (Andrea)

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2 weeks agosched_ext: Add scx_cmask_test() and scx_cmask_for_each_cid()
Tejun Heo [Thu, 4 Jun 2026 01:46:56 +0000 (15:46 -1000)] 
sched_ext: Add scx_cmask_test() and scx_cmask_for_each_cid()

Add single-bit test and iterator over set cids in an scx_cmask.

v2: Bound scx_cmask_for_each_cid() to the active span. (sashiko AI)

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2 weeks agotools/sched_ext: Order single-cid cmask helpers as (cid, mask)
Tejun Heo [Thu, 4 Jun 2026 01:46:56 +0000 (15:46 -1000)] 
tools/sched_ext: Order single-cid cmask helpers as (cid, mask)

The BPF arena single-cid cmask helpers take the cmask first and the cid
second. Reorder them to (cid, mask) to match the kernel-side helpers and
the test_bit(nr, addr), cpumask_test_cpu(cpu, mask) convention. Range and
iteration helpers keep (mask, start).

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2 weeks agosched_ext: Order single-cid cmask helpers as (cid, mask)
Tejun Heo [Thu, 4 Jun 2026 01:46:56 +0000 (15:46 -1000)] 
sched_ext: Order single-cid cmask helpers as (cid, mask)

__scx_cmask_set(), __scx_cmask_contains() and __scx_cmask_word() take the
cmask first and the cid second. The kernel's bit and cpumask predicates put
the index first: test_bit(nr, addr), cpumask_test_cpu(cpu, mask). Reorder
the cmask helpers to (cid, mask) for consistency, ahead of new single-cid
ops added next. Mask-level ops (and/or/andnot/copy/subset/intersects) keep
(dst, src).

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2 weeks agoMerge tag 'amd-drm-fixes-7.1-2026-06-03' of https://gitlab.freedesktop.org/agd5f...
Dave Airlie [Thu, 4 Jun 2026 01:15:28 +0000 (11:15 +1000)] 
Merge tag 'amd-drm-fixes-7.1-2026-06-03' of https://gitlab.freedesktop.org/agd5f/linux into drm-fixes

amd-drm-fixes-7.1-2026-06-03:

amdgpu:
- BT.2020 fix for DCE
- DC bounds checking fixes
- SDMA 7.1 fix
- UserQ fixes
- SI fix
- SMU 13 fixes
- SMU 14 fixes
- GC 12.1 fix
- Userptr fix
- GC 10.1 fix
- GART fix for non-4K pages

amdkfd:
- UAF race fix
- Fix a potential NULL pointer dereference
- GC 11 buffer overflow fix for SDMA

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patch.msgid.link/20260604011351.2373027-1-alexander.deucher@amd.com
2 weeks agoocteontx2-af: Fix initialization of mcam's entry2target_pffunc field
Suman Ghosh [Fri, 29 May 2026 11:37:05 +0000 (17:07 +0530)] 
octeontx2-af: Fix initialization of mcam's entry2target_pffunc field

NPC mcam entry stores a mapping between mcam entry and target pcifunc.
During initialization of this field, API kmalloc_array has been used which
caused some junk values to array. Whereas, the array is expected to be
initialized by 0. This patch fixes the same by using kcalloc instead of
kmalloc_array.

Fixes: 55307fcb9258 ("octeontx2-af: Add mbox messages to install and delete MCAM rules")
Signed-off-by: Suman Ghosh <sumang@marvell.com>
Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/1780054625-17090-1-git-send-email-sbhatta@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 weeks agoocteontx2-pf: Fix NDC sync operation errors
Geetha sowjanya [Fri, 29 May 2026 11:37:57 +0000 (17:07 +0530)] 
octeontx2-pf: Fix NDC sync operation errors

On system reboot "rvu_nicpf 0002:03:00.0: NDC sync operation failed"
error messages are shown, even if the operations is successful.
This is due to wrong if error check in ndc_syc() function.

Fixes: 42c45ac1419c ("octeontx2-af: Sync NIX and NPA contexts from NDC to LLC/DRAM")
Signed-off-by: Geetha sowjanya <gakula@marvell.com>
Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/1780054677-17249-1-git-send-email-sbhatta@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 weeks agoappletalk: aarp: zero-initialize aarp_entry to prevent heap info leak
Yizhou Zhao [Fri, 29 May 2026 10:50:16 +0000 (18:50 +0800)] 
appletalk: aarp: zero-initialize aarp_entry to prevent heap info leak

aarp_alloc() allocates struct aarp_entry without zeroing it, but only
initializes refcnt and packet_queue.  When an unresolved AARP entry is
created, hwaddr[ETH_ALEN] is left uninitialized.

aarp_seq_show() later prints this field with %pM when users read
/proc/net/atalk/arp.  This can expose 6 bytes of stale heap data for
each unresolved entry.

Fix this by zero-initializing struct aarp_entry at allocation time.

Reported-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn>
Reported-by: Yuxiang Yang <yangyx22@mails.tsinghua.edu.cn>
Reported-by: Ao Wang <wangao@seu.edu.cn>
Reported-by: Xuewei Feng <fengxw06@126.com>
Reported-by: Qi Li <qli01@tsinghua.edu.cn>
Reported-by: Ke Xu <xuke@tsinghua.edu.cn>
Signed-off-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260529105017.81531-1-zhaoyz24@mails.tsinghua.edu.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 weeks agoMerge branch 'geneve-allow-binding-udp-socket-to-a-specific-address'
Jakub Kicinski [Thu, 4 Jun 2026 01:07:44 +0000 (18:07 -0700)] 
Merge branch 'geneve-allow-binding-udp-socket-to-a-specific-address'

Kuniyuki Iwashima says:

====================
geneve: Allow binding UDP socket to a specific address.

By default, a GENEVE device bind()s its underlying UDP socket(s) to
the IPv4 or IPv6 wildcard address because there is no way to specify
a specific local IP address to bind() to.

This prevents deploying multiple GENEVE devices on a multi-homed host
where each device should be isolated and bound to a different local IP
address on the same UDP port.

This series introduces two options to specify local IPv4 or IPv6
addresses for a GENEVE device.
====================

Link: https://patch.msgid.link/20260602190436.139591-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 weeks agogeneve: Introduce IFLA_GENEVE_LOCAL and IFLA_GENEVE_LOCAL6.
Kuniyuki Iwashima [Tue, 2 Jun 2026 19:03:45 +0000 (19:03 +0000)] 
geneve: Introduce IFLA_GENEVE_LOCAL and IFLA_GENEVE_LOCAL6.

By default, a GENEVE device bind()s its underlying UDP socket(s) to
the IPv4 or IPv6 wildcard address because there is no way to specify
a specific local IP address to bind() to.

This prevents deploying multiple GENEVE devices on a multi-homed host
where each device should be isolated and bound to a different local IP
address on the same UDP port.

Let's introduce new options, IFLA_GENEVE_LOCAL and IFLA_GENEVE_LOCAL6,
to allow specifying a local IPv4/IPv6 address for the backend UDP
socket.

By default, when collect metadata mode (IFLA_GENEVE_COLLECT_METADATA)
is enabled, both IPv4 and IPv6 sockets are created.  However, if a
source address is specified via the new attributes, only a single
socket corresponding to that specific address family is created.

Accordingly, geneve_find_sock() and geneve_find_dev() are updated to
take the source address into account, ensuring that multiple devices
and sockets configured with different source addresses can coexist
without conflict.

In addition, the source address is validated in geneve_xmit_skb()
and geneve6_xmit_skb(), so the BPF prog must set it in bpf_tunnel_key.

With this change, multiple GENEVE devices can be successfully created
and bound to their respective local IP addresses:

  (*) "local" is the keyword for IFLA_GENEVE_LOCAL / IFLA_GENEVE_LOCAL6

  # for i in $(seq 1 2);
  do
          ip link add geneve4_${i} type geneve local 192.168.0.${i} external
          ip addr add 192.168.0.${i}/24 dev geneve4_${i}
          ip link set geneve4_${i} up

          ip link add geneve6_${i} type geneve local 2001:9292::${i} external
          ip addr add 2001:9292::${i}/64 dev geneve6_${i} nodad
          ip link set geneve6_${i} up
  done

  # ip -d l | grep geneve
  9: geneve4_1: <BROADCAST,MULTICAST,UP,LOWER_UP> ...
      geneve external id 0 local 192.168.0.1 ...
  10: geneve6_1: <BROADCAST,MULTICAST,UP,LOWER_UP> ...
      geneve external id 0 local 2001:9292::1 ...
  11: geneve4_2: <BROADCAST,MULTICAST,UP,LOWER_UP> ...
      geneve external id 0 local 192.168.0.2 ...
  12: geneve6_2: <BROADCAST,MULTICAST,UP,LOWER_UP> ...
      geneve external id 0 local 2001:9292::2 ...

  # ss -ua | grep geneve
  UNCONN 0      0         192.168.0.2:geneve      0.0.0.0:*
  UNCONN 0      0         192.168.0.1:geneve      0.0.0.0:*
  UNCONN 0      0      [2001:9292::2]:geneve            *:*
  UNCONN 0      0      [2001:9292::1]:geneve            *:*

Note that even if the local address is explicitly configured with
the wildcard address, kernel does not dump it except for devices with
IFLA_GENEVE_COLLECT_METADATA.  This is consistent with the behaviour
of is_tnl_info_zero(), which treats the wildcard remote address as not
configured.

  ## ynl example.
  # ./tools/net/ynl/pyynl/cli.py \
    --spec ./Documentation/netlink/specs/rt-link.yaml \
    --do newlink --create \
    --json '{"ifname": "geneve0",
             "linkinfo": {"kind":"geneve",
                  "data": {"local": "0.0.0.0",
           "collect-metadata": true}}}'

  # ./tools/net/ynl/pyynl/cli.py \
    --spec ./Documentation/netlink/specs/rt-link.yaml \
    --do getlink \
    --json '{"ifname": "geneve0"}' --output-json | \
    jq .linkinfo.data.local
  "0.0.0.0"

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260602190436.139591-6-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 weeks agogeneve: Add dualstack flag to struct geneve_config.
Kuniyuki Iwashima [Tue, 2 Jun 2026 19:03:44 +0000 (19:03 +0000)] 
geneve: Add dualstack flag to struct geneve_config.

When collect metadata mode (IFLA_GENEVE_COLLECT_METADATA) is
enabled, the GENEVE device creates both IPv4 and IPv6 sockets
and bind()s them to wildcard addresses.

The next patch allows creating only one socket bound to a
specific address even when the collect metadata mode is
enabled.

Then, we need a flag to distinguish dualstack GENEVE devices
to detect local address conflict.

Let's add the dualstack flag to struct geneve_config.

IFLA_GENEVE_COLLECT_METADATA processing is moved up in
geneve_nl2info() for the next patch to overwrite dualstack
to false while keeping collect_md true.

Note that IFLA_GENEVE_REMOTE and IFLA_GENEVE_REMOTE6 does not
set cfg->dualstack to false since is_tnl_info_zero() ignores
the wildcard remote address:

  # ip link add geneve0 type geneve external remote 0.0.0.1
  Error: Device is externally controlled, so attributes (VNI, Port, and so on) must not be specified.

  # ip link add geneve0 type geneve external remote 0.0.0.0
  # ss -ua | grep geneve
  UNCONN 0      0            0.0.0.0:geneve      0.0.0.0:*
  UNCONN 0      0                  *:geneve            *:*

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260602190436.139591-5-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 weeks agogeneve: Pass struct geneve_dev to geneve_find_sock().
Kuniyuki Iwashima [Tue, 2 Jun 2026 19:03:43 +0000 (19:03 +0000)] 
geneve: Pass struct geneve_dev to geneve_find_sock().

This is a prep patch to make a subsequent patch clean.

We will need to access geneve_dev->cfg.info.key.u.{ipv4,ipv6}.src
in geneve_find_sock() later and extend conditions there.

Let's pass down struct geneve from geneve_sock_add() to
geneve_find_sock() and flatten the conditional logic.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260602190436.139591-4-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 weeks agogeneve: Pass struct geneve_dev to geneve_create_sock().
Kuniyuki Iwashima [Tue, 2 Jun 2026 19:03:42 +0000 (19:03 +0000)] 
geneve: Pass struct geneve_dev to geneve_create_sock().

This is a prep patch to make a subsequent patch clean.

We will need to access geneve_dev->cfg.info.key.u.{ipv4,ipv6}.src
in geneve_create_sock() later.

Let's pass down struct geneve_dev from geneve_sock_add() to
geneve_create_sock() instead of individual config fields.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260602190436.139591-3-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 weeks agogeneve: Reuse ipv6_addr_type() result in geneve_nl2info().
Kuniyuki Iwashima [Tue, 2 Jun 2026 19:03:41 +0000 (19:03 +0000)] 
geneve: Reuse ipv6_addr_type() result in geneve_nl2info().

geneve_nl2info() calls ipv6_addr_type() to check if the remote
IPv6 address is link-local.

Then, it also calls ipv6_addr_is_multicast() for the same address.

Let's not call ipv6_addr_is_multicast() and reuse ipv6_addr_type().

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260602190436.139591-2-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 weeks agonet: sfp: initialize i2c_block_size at adapter configure time
Jonas Jelonek [Thu, 28 May 2026 20:52:40 +0000 (20:52 +0000)] 
net: sfp: initialize i2c_block_size at adapter configure time

sfp->i2c_block_size is only assigned in sfp_sm_mod_probe(), which runs
from the state machine timer after SFP_F_PRESENT has been set. Between
those two points, sfp_module_eeprom() (the ethtool -m callback) gates
only on SFP_F_PRESENT and can be entered with i2c_block_size still at
its kzalloc'd value of 0.

On a pure-I2C adapter, sfp_i2c_read() then issues an i2c_transfer()
with msgs[1].len = 0 inside a loop that subtracts this_len from len
each iteration; on adapters that succeed a zero-length read the loop
never advances, spinning while holding rtnl_lock.

This was previously addressed by initializing i2c_block_size in
sfp_alloc() (commit 813c2dd78618), but the initialization was dropped
when i2c_block_size was split from i2c_max_block_size.

Initialize sfp->i2c_block_size from sfp->i2c_max_block_size in
sfp_i2c_configure(), so the field is valid as soon as the adapter is
known. sfp_sm_mod_probe() still reassigns it on each module insertion
to recover from a per-module clamp to 1 (sfp_id_needs_byte_io).

Fixes: 7662abf4db94 ("net: phy: sfp: Add support for SMBus module access")
Cc: stable@vger.kernel.org
Signed-off-by: Jonas Jelonek <jelonek.jonas@gmail.com>
Link: https://patch.msgid.link/20260528205242.971410-2-jelonek.jonas@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 weeks agoxsk: cache csum_start/csum_offset to fix TOCTOU in xsk_skb_metadata()
Jason Xing [Sat, 30 May 2026 04:26:30 +0000 (12:26 +0800)] 
xsk: cache csum_start/csum_offset to fix TOCTOU in xsk_skb_metadata()

The TX metadata area resides in the UMEM buffer which is memory-mapped
and concurrently writable by userspace. In xsk_skb_metadata(),
csum_start and csum_offset are read from shared memory for bounds
validation, then read again for skb assignment. A malicious userspace
application can race to overwrite these values between the two reads,
bypassing the bounds check and causing out-of-bounds memory access
during checksum computation in the transmit path.

Fix this by reading csum_start and csum_offset into local variables
once, then using the local copies for both validation and assignment.

Note that other metadata fields (flags, launch_time) and the cached
csum fields may be mutually inconsistent due to concurrent userspace
writes, but this is benign: the only security-critical invariant is
that each field's validated value is the same one used, which local
caching guarantees.

Closes: https://lore.kernel.org/all/20260503200927.73EA1C2BCB4@smtp.kernel.org/
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Fixes: 48eb03dd2630 ("xsk: Add TX timestamp and TX checksum offload support")
Link: https://patch.msgid.link/20260530042630.80626-1-kerneljasonxing@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 weeks agonet: b44: use ethtool_puts
Rosen Penev [Sun, 31 May 2026 00:03:34 +0000 (17:03 -0700)] 
net: b44: use ethtool_puts

There's a subtle error with the memcpy here, where b44_gstrings should
not be dereferenced. Dereferening causes the following error with W=1:

In file included from drivers/net/ethernet/broadcom/b44.c:17:
In file included from ./include/linux/module.h:18:
In file included from ./include/linux/kmod.h:9:
In file included from ./include/linux/umh.h:4:
In file included from ./include/linux/gfp.h:7:
In file included from ./include/linux/mmzone.h:8:
In file included from ./include/linux/spinlock.h:56:
In file included from ./include/linux/preempt.h:79:
In file included from ./arch/powerpc/include/asm/preempt.h:5:
In file included from ./include/asm-generic/preempt.h:5:
In file included from ./include/linux/thread_info.h:23:
In file included from ./arch/powerpc/include/asm/current.h:13:
In file included from ./arch/powerpc/include/asm/paca.h:16:
In file included from ./include/linux/string.h:386:
./include/linux/fortify-string.h:578:4: error: call to
'__read_overflow2_field' declared with 'warning' attribute: detected read
beyond size of field (2nd parameter); maybe use>
  578 |                        __read_overflow2_field(q_size_field, size);
      |                         ^

Instead of fixing the memcpy, use ethtool_puts, which is the proper
helper for printing ethtool gstrings.

Signed-off-by: Rosen Penev <rosenp@gmail.com>
Link: https://patch.msgid.link/20260531000334.388351-1-rosenp@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 weeks agoMerge branch 'net-mlx5-add-switchdev-mode-support-for-socket-direct-single-netdev...
Jakub Kicinski [Thu, 4 Jun 2026 00:42:33 +0000 (17:42 -0700)] 
Merge branch 'net-mlx5-add-switchdev-mode-support-for-socket-direct-single-netdev-part-1-2'

Tariq Toukan says:

====================
net/mlx5: Add switchdev mode support for Socket Direct single netdev, part 1/2

This series enables Socket Direct single netdev to operate in switchdev
mode with shared FDB. SD single netdev combines multiple PCI functions
behind a single netdev interface. To support switchdev offloads, these
functions must participate in virtual LAG (shared FDB).

Design

Rather than introducing a separate LAG instance for SD, this series
integrates SD secondary devices into the existing LAG structure
(priv.lag) created at probe time. Each lag_func entry carries a
group_id field that identifies its SD group membership (0 means not
part of any SD group). An xarray mark (XA_MARK_PORT) distinguishes
physical port entries from SD secondaries, enabling a single unified
iterator that filters by group:

  - MLX5_LAG_FILTER_PORTS: iterate port-level entries only (existing
    behavior, used by bonding, FW LAG commands, v2p_map)
  - MLX5_LAG_FILTER_ALL: iterate all devices including SD secondaries
    (used by MPESW shared FDB across all devices)
  - specific group_id: iterate only devices in that SD group (used by
    per-group SD shared FDB operations)

Existing callers use mlx5_ldev_for_each() which maps to
MLX5_LAG_FILTER_PORTS, preserving current behavior for non-SD
configurations.

Lifecycle and ownership

The SD LAG lifecycle is tied to the SD group, not to bonding events:

1. At PCI probe, mlx5_lag_add_mdev() creates the LAG structure
   (priv.lag) for each LAG-capable PF. e.g.: SD primary devices

2. During mlx5_sd_init(), after the SD group is fully formed (primary
   and secondaries paired), sd_lag_init() registers the secondary
   devices into the primary's existing priv.lag by calling
   mlx5_ldev_add_mdev() with the SD group_id. The primary's lag_func
   also gets its group_id set. No separate LAG instance is created.

3. After all the devices in SD group transition to switchdev,
   mlx5_lag_shared_fdb_create() is invoked with the group_id to create
   a software-only shared FDB scoped to that SD group. This sets
   sd_fdb_active on all lag_func entries in the group. No FW LAG
   commands are issued since SD devices share the same physical port.

4. If MPESW (multi-port eswitch) is enabled on top of SD groups, the
   per-group SD shared FDB is torn down first, then MPESW shared FDB is
   created spanning all devices (ports + SD secondaries) using
   MLX5_LAG_FILTER_ALL. On MPESW disable, per-group SD shared FDB is
   restored.

5. On SD teardown (mlx5_sd_cleanup or device unbind), sd_lag_cleanup()
   removes secondaries from priv.lag and clears the primary's group_id.
   The LAG structure itself is not destroyed.

The sd_fdb_active flag is set on all lag_func entries in a group (not
just the primary), so any device can detect the SD shared FDB state
during lag_disable_change teardown without needing to look up peer
entries.

SD shared FDB is a pure software construct -- unlike regular LAG modes
(ROCE, SRIOV, MPESW), it does not issue FW create_lag/destroy_lag
commands. The software vport LAG for SD is implemented via eswitch
egress ACL bounce rules, managed by the IB layer through
mlx5_eth_lag_init(). And the software LAG demux is implemented via
steering rules that utilize new destination, VHCA_RX.

Patches

Infrastructure (patches 1, 5-6):
  - Factor out shared FDB code into a dedicated file
  - Extend lag_func with group_id and sd_fdb_active fields;
    add XA_MARK_PORT and unified iterator with group_id filter
  - Extend shared FDB API with group_id parameter

E-Switch preparation (patches 2-3):
  - Align eswitch disable sequence ordering
  - Move devcom init from TC to eswitch layer

SD group management (patches 4, 7-9):
  - Replace peer count check with direct peer lookup
  - Register SD secondaries in the existing LAG at SD init time
  - Block RoCE and VF LAG for SD devices
  - Block multipath LAG for SD devices

Switchdev integration (patch 10):
  - Keep netdev resources local in switchdev mode

Steering (patches 11-12):
  - Track peer flow slots with bitmap for selective peer flow deletion
  - Enable TC flow steering for SD LAG

Enablement (patch 13):
  - Verify unique vhca_id count for cross-VHCA RQT
====================

Link: https://patch.msgid.link/20260531113954.395443-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 weeks agonet/mlx5e: Verify unique vhca_id count instead of range
Shay Drory [Sun, 31 May 2026 11:39:53 +0000 (14:39 +0300)] 
net/mlx5e: Verify unique vhca_id count instead of range

Change verify_num_vhca_ids() to count the number of unique vhca_ids
and verify this count doesn't exceed max_num_vhca_id, rather than
validating individual vhca_id values are within a specific range.

The previous implementation checked if each vhca_id was in the range
[0, max_num_vhca_id - 1], which is overly restrictive. The hardware
capability max_rqt_vhca_id represents the maximum number of unique
vhca_ids that can be used, not a range constraint on individual IDs.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260531113954.395443-14-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 weeks agonet/mlx5e: TC, enable steering for SD LAG
Shay Drory [Sun, 31 May 2026 11:39:52 +0000 (14:39 +0300)] 
net/mlx5e: TC, enable steering for SD LAG

Enable TC flow steering for SD LAG mode by extending multiport
eligibility checks and peer flow handling.

SD LAG operates similarly to MPESW for TC offloads - flows on
secondary devices need peer flow creation on the primary, and
multiport forwarding rules are eligible when either MPESW or SD LAG
is active.

Add mlx5_lag_is_sd() helper to query SD LAG mode, and
mlx5_sd_is_primary() to identify the primary device. Redirect uplink
priv/proto_dev queries to the primary device's eswitch in SD
configurations.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260531113954.395443-13-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 weeks agonet/mlx5e: TC, track peer flow slots with bitmap
Shay Drory [Sun, 31 May 2026 11:39:51 +0000 (14:39 +0300)] 
net/mlx5e: TC, track peer flow slots with bitmap

With SD devices joining the LAG, peer flows are not created for all
devcom peers - SD devices skip peers that belong to a different SD
group. However, the delete path iterated all devcom peers
unconditionally, attempting to delete from slots that were never
populated.

Track which peer slots are populated using a bitmap in mlx5e_tc_flow.
The delete path now iterates only set bits, matching exactly the slots
that were set up during flow creation.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260531113954.395443-12-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 weeks agonet/mlx5: SD, keep netdev resources on same PF in switchdev mode
Shay Drory [Sun, 31 May 2026 11:39:50 +0000 (14:39 +0300)] 
net/mlx5: SD, keep netdev resources on same PF in switchdev mode

In SD switchdev mode, network device resources such as channels and
completion vectors must remain on the same PF rather than being
distributed across SD group members.

Modify mlx5_sd_ch_ix_get_dev_ix() to return 0 and
mlx5_sd_ch_ix_get_vec_ix() to return the channel index directly when
in switchdev mode, keeping resources local to the requesting PF.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260531113954.395443-11-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 weeks agonet/mlx5: LAG, block multipath LAG for SD devices
Shay Drory [Sun, 31 May 2026 11:39:49 +0000 (14:39 +0300)] 
net/mlx5: LAG, block multipath LAG for SD devices

SD devices are not compatible with multipath LAG since they use
dedicated SD LAG for cross-socket connectivity. Add an SD check
to the multipath prereq validation to prevent multipath LAG
activation on SD-configured ports.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260531113954.395443-10-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 weeks agonet/mlx5: LAG, block RoCE and VF LAG for SD devices
Shay Drory [Sun, 31 May 2026 11:39:48 +0000 (14:39 +0300)] 
net/mlx5: LAG, block RoCE and VF LAG for SD devices

Socket Direct devices manage their own LAG via SD LAG infrastructure.
Block the standard netdev-event-driven LAG path (RoCE LAG and VF LAG)
for SD devices to prevent conflicting LAG configurations.

Expose mlx5_sd_is_supported() as a public helper that encapsulates all
SD eligibility checks. Use it in mlx5_lag_dev_alloc() to skip netdev
notifier registration for SD-capable devices at alloc time. Some sd
code is reordered to expose the new function, no logic is changed.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260531113954.395443-9-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 weeks agonet/mlx5: SD, introduce Socket Direct LAG
Shay Drory [Sun, 31 May 2026 11:39:47 +0000 (14:39 +0300)] 
net/mlx5: SD, introduce Socket Direct LAG

Register SD secondary devices with the existing LAG structure by
adding them to the primary's ldev xarray with a shared group_id.
This ties the SD LAG lifecycle to the SD group lifecycle.

Add sd_lag_state debugfs entry for LAG state visibility. To avoid
race between this entry and LAG deletion, have debugfs creation
and deletion done last on SD init and first on SD cleanup.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260531113954.395443-8-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 weeks agonet/mlx5: LAG, extend shared FDB API with group_id filter
Shay Drory [Sun, 31 May 2026 11:39:46 +0000 (14:39 +0300)] 
net/mlx5: LAG, extend shared FDB API with group_id filter

Add a group_id parameter to mlx5_lag_shared_fdb_create() and
mlx5_lag_shared_fdb_destroy() to scope shared FDB operations to a
specific SD group.

When group_id is U32_MAX, the functions operate on all LAG devices. When
group_id is non-zero, they operate only on devices in that SD group
without issuing FW LAG commands, since SD LAG is a pure software
construct.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260531113954.395443-7-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 weeks agonet/mlx5: LAG, prepare for SD device integration
Shay Drory [Sun, 31 May 2026 11:39:45 +0000 (14:39 +0300)] 
net/mlx5: LAG, prepare for SD device integration

Socket Direct (SD) secondaries devices will participate in LAG, even
though they are silent. SD secondary devices share the same physical
port as their primary but are separate PCI functions that need to be
tracked alongside regular LAG ports.

Extend lag_func with a group_id field to identify SD group membership
and introduce a unified iterator that can filter by group. Add APIs
for registering SD secondary devices in an existing LAG.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260531113954.395443-6-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 weeks agonet/mlx5: LAG, replace peer count check with direct peer lookup
Shay Drory [Sun, 31 May 2026 11:39:44 +0000 (14:39 +0300)] 
net/mlx5: LAG, replace peer count check with direct peer lookup

Replace mlx5_eswitch_get_npeers() count-based check with a new
mlx5_eswitch_is_peer() function that directly verifies the peer
relationship between two eswitches.

This change prepares for SD LAG support, which is a virtual LAG that
does not have num_lag_ports capability and cannot use the count-based
peer validation.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260531113954.395443-5-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 weeks agonet/mlx5: E-Switch, move devcom init from TC to eswitch layer
Shay Drory [Sun, 31 May 2026 11:39:43 +0000 (14:39 +0300)] 
net/mlx5: E-Switch, move devcom init from TC to eswitch layer

Move the E-swtich devcom component management from TC layer to ESW
layer.

This refactoring places devcom lifecycle management at the appropriate
layer and prepares for SD LAG which needs devcom registration
independent of the TC/representor initialization.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260531113954.395443-4-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 weeks agonet/mlx5: E-Switch, align disable sequence with switchdev-to-legacy transition
Shay Drory [Sun, 31 May 2026 11:39:42 +0000 (14:39 +0300)] 
net/mlx5: E-Switch, align disable sequence with switchdev-to-legacy transition

This patch align the eswitch disable sequence with the
switchdev-to-legacy mode transition, where eswitch must be disabled
before device detachment. The consistent ordering is required for proper
SD LAG cleanup which depends on eswitch state during teardown.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260531113954.395443-3-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 weeks agonet/mlx5: LAG, factor out shared FDB code into dedicated file
Shay Drory [Sun, 31 May 2026 11:39:41 +0000 (14:39 +0300)] 
net/mlx5: LAG, factor out shared FDB code into dedicated file

Refactor shared FDB LAG logic into a new lag/shared_fdb.c file to
improve code organization and enable reuse. Move shared FDB specific
functions from lag.c and introduce consolidated APIs:
- mlx5_lag_shared_fdb_create() handles LAG activation with shared FDB
- mlx5_lag_shared_fdb_destroy() handles LAG deactivation with shared FDB

Update mlx5_do_bond(), mlx5_disable_lag() and mpesw.c to use the new
APIs, which simplifies the shared FDB code paths.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260531113954.395443-2-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 weeks agomm/mincore: handle non-swap entries before !CONFIG_SWAP guard
Usama Arif [Tue, 2 Jun 2026 17:22:47 +0000 (10:22 -0700)] 
mm/mincore: handle non-swap entries before !CONFIG_SWAP guard

mincore_swap() also fields migration/hwpoison entries (and shmem
swapin-error entries), which can exist on !CONFIG_SWAP builds when
CONFIG_MIGRATION or CONFIG_MEMORY_FAILURE is enabled.  The
!IS_ENABLED(CONFIG_SWAP) guard ran before the non-swap-entry early return,
so mincore_pte_range() can spuriously WARN and report these pages
nonresident on !CONFIG_SWAP kernels.

Move the guard below the non-swap-entry check so only true swap entries
trip the WARN, and migration/hwpoison entries take the existing "uptodate
/ non-shmem" path.

Link: https://lore.kernel.org/20260602172247.279421-1-usama.arif@linux.dev
Fixes: 1f2052755c15 ("mm/mincore: use a helper for checking the swap cache")
Signed-off-by: Usama Arif <usama.arif@linux.dev>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Reviewed-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Baoquan He <baoquan.he@linux.dev>
Cc: Chris Li <chrisl@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 weeks agoarm64: mm: call pagetable dtor when freeing hot-removed page tables
Alistair Popple [Thu, 21 May 2026 03:27:30 +0000 (13:27 +1000)] 
arm64: mm: call pagetable dtor when freeing hot-removed page tables

Since 5e8eb9aeeda3 ("arm64: mm: always call PTE/PMD ctor in
__create_pgd_mapping()") page-table allocation on ARM64 always calls
pagetable_{pte,pmd,pud,p4d}_ctor().  This sets the page_type to
PGTY_table, increments NR_PAGETABLE and possible allocates a PTL.  However
the matching pagetable_dtor() calls were never added.

With DEBUG_VM enabled on kernel versions prior to v6.17 without
2dfcd1608f3a9 ("mm/page_alloc: let page freeing clear any set page type")
this leads to the following warning when freeing these pages due to
page->page_type sharing page->_mapcount:

  BUG: Bad page state in process ... pfn:284fbb
  page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x284fbb
  flags: 0x17fffc000000000(node=0|zone=2|lastcpupid=0x1ffff)
  page_type: f2(table)
  page dumped because: nonzero mapcount
  Call trace:
   bad_page+0x13c/0x160
   __free_frozen_pages+0x6cc/0x860
   ___free_pages+0xf4/0x180
   free_pages+0x54/0x80
   free_hotplug_page_range.part.0+0x58/0x90
   free_empty_tables+0x438/0x500
   __remove_pgd_mapping.constprop.0+0x60/0xa8
   arch_remove_memory+0x48/0x80
   try_remove_memory+0x158/0x1d8
   offline_and_remove_memory+0x138/0x180

It can also lead to leaking the ptl allocation if ALLOC_SPLIT_PTLOCKS is
defined and incorrect NR_PAGETABLE stats.  Fix this by calling
pagetable_dtor() in free_hotplug_pgtable_page() prior to freeing the page
to undo the effects of calling pagetable_*_ctor().

Link: https://lore.kernel.org/20260521032730.2104017-1-apopple@nvidia.com
Fixes: 5e8eb9aeeda3 ("arm64: mm: always call PTE/PMD ctor in __create_pgd_mapping()")
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: Will Deacon <will@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 weeks agomm/list_lru: drain before clearing xarray entry on reparent
Shakeel Butt [Mon, 1 Jun 2026 16:15:01 +0000 (09:15 -0700)] 
mm/list_lru: drain before clearing xarray entry on reparent

memcg_reparent_list_lrus() clears the dying memcg's xarray entry with
xas_store(&xas, NULL) before reparenting its per-node lists into the
parent.  This opens a window where a concurrent list_lru_del() arriving
for the dying memcg sees xa_load() == NULL, walks to the parent in
lock_list_lru_of_memcg(), takes the parent's per-node lock, and calls
list_del_init() on an item still physically linked on the dying memcg's
list.

If another in-flight thread holds the dying memcg's per-node lock at the
same moment (another list_lru_del, or a list_lru_walk_one running an
isolate callback), both threads modify ->next/->prev pointers on the same
physical list under different locks.  Adjacent items can corrupt each
other's links.

Fix it by reversing the order: reparent each per-node list and mark the
child's list lru dead and then clear the xarray entry.  Any concurrent
list_lru op that finds the still-set xarray entry either takes the dying
memcg's per-node lock (synchronizing with the drain) or sees LONG_MIN and
walks to the parent, where the items now live.

Link: https://lore.kernel.org/20260601161501.1444829-1-shakeel.butt@linux.dev
Fixes: fb56fdf8b9a2 ("mm/list_lru: split the lock to per-cgroup scope")
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Reported-by: Chris Mason <clm@fb.com>
Reviewed-by: Kairui Song <kasong@tencent.com>
Acked-by: Muchun Song <muchun.song@linux.dev>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 weeks agomm/huge_memory: use correct flags for device private PMD entry
Lorenzo Stoakes [Mon, 1 Jun 2026 08:30:44 +0000 (09:30 +0100)] 
mm/huge_memory: use correct flags for device private PMD entry

Commit 65edfda6f3f2 ("mm/rmap: extend rmap and migration support
device-private entries") updated set_pmd_migration_entry() to use
pmdp_huge_get_and_clear() in the softleaf case, but made no further
adjustments to the function itself.

Therefore this function continues to incorrectly use pmd_write(),
pmd_soft_dirty() and pmd_uffd_wp() to determine whether the installed
migration entry should be marked writable, softdirty or uffd-wp
respectively.

Whilst all are incorrect, the most problematic of these is pmd_write(), as
this can lead to corrupted rmap state.

On x86-64 _PAGE_SWP_SOFT_DIRTY is aliased to _PAGE_RW.  So calling
pmd_write() on a softleaf will return the softdirty state encoded in the
entry, assuming CONFIG_MEM_SOFT_DIRTY was enabled.

This was observed when running the hmm.hmm_device_private.anon_write_child
selftest:

1. The test faults in a range then migrates it such that a device-private
   THP range is established.

2. The parent then migrates it to a device-private writable PMD entry whose
   folio is entirely AnonExclusive with entire_mapcount=1, softdirty set
   (accidentally correct write state).

3. The parent forks and the PMD entries are set to device-private read only
   entries, entire_mapcount=2, softdirty still set.

4. [BUG] The child writes to the range then migrates to RAM - intending to
   install non-writable migration entries - but replacing parent and child
   PMD mappings with WRITABLE entries due to misinterpreting the softdirty
   bit.

5. In remove_migration_pmd(), if !softleaf_is_migration_read(entry) we
   set the RMAP_EXCLUSIVE flag when calling folio_add_anon_rmap_pmd() for
   both parent and child, which are therefore AnonExclusive.

6. [SPLAT] Child sets migrated folio entire_mapcount=1, parent sets
   entire_mapcount=2 and we end up with an AnonExclusive folio with
   entire_mapcount=2! Assert fires in __folio_add_anon_rmap():

VM_WARN_ON_FOLIO(folio_test_large(folio) &&
 folio_entire_mapcount(folio) > 1 &&
 PageAnonExclusive(cur_page), folio)

This patch fixes the issue by correctly referencing the softleaf entry
fields for writable, softdirty and uffd-wp in set_pmd_migration_entry().

It also only updates A/D flags if the entry is present as these are
otherwise not meaningful for a softleaf entry.

This patch also flips the if (!present) { ...  } else { ...  } logic in
set_pmd_migration_entry() so it is easier to understand, and adds some
comments to make things clearer.

I was able to bisect this to commit 775465fd26a3 ("lib/test_hmm: add zone
device private THP test infrastructure") which first exposes this bug as
it was the commit that permitted test_hmm to generate the test.

However commit 65edfda6f3f2 ("mm/rmap: extend rmap and migration support
device-private entries") is the commit that actually enabled this
behaviour.

Link: https://lore.kernel.org/20260601083044.57132-1-ljs@kernel.org
Fixes: 65edfda6f3f2 ("mm/rmap: extend rmap and migration support device-private entries")
Signed-off-by: Lorenzo Stoakes <ljs@kernel.org>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Balbir Singh <balbirs@nvidia.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Oscar Salvador (SUSE) <osalvador@kernel.org>
Reviewed-by: Barry Song <baohua@kernel.org>
Reviewed-by: Lance Yang <lance.yang@linux.dev>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 weeks agomm/damon/lru_sort: handle ctx allocation failure
SeongJae Park [Fri, 29 May 2026 00:01:03 +0000 (17:01 -0700)] 
mm/damon/lru_sort: handle ctx allocation failure

DAMON_LRU_SORT allocates the damon_ctx object for its kdamond in its init
function.  damon_lru_sort_enabled_store() wrongly assumes the allocation
will always succeed once tried.  If the damon_ctx allocation was failed,
therefore, code execution reaches to damon_commit_ctx() while 'ctx' is
NULL.  As a result, it dereferences the NULL 'ctx' pointer.  Avoid the
NULL dereference by returning -ENOMEM if 'ctx' is NULL.

Link: https://lore.kernel.org/20260529000104.7006-3-sj@kernel.org
Fixes: c4a8e662c839 ("mm/damon/lru_sort: use damon_initialized()")
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org> # 6.18.x
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 weeks agomm/damon/reclaim: handle ctx allocation failure
SeongJae Park [Fri, 29 May 2026 00:01:02 +0000 (17:01 -0700)] 
mm/damon/reclaim: handle ctx allocation failure

Patch series "mm/damon/{reclaim,lru_sort}: handle ctx allocation failures".

DAMON_RECLAIM and DAMON_LRU_SORT could dereference NULL pointers if their
damon_ctx object allocations fail.  The bugs are expected to happen
infrequently because the allocations are arguably too small to fail on
common setups.  But theoretically they are possible and the consequences
are bad.  Fix those.

The issues were discovered [1] by Sashiko.

This patch (of 2):

DAMON_RECLAIM allocates the damon_ctx object for its kdamond in its init
function.  damon_reclaim_enabled_store() wrongly assumes the allocation
will always succeed once tried.  If the damon_ctx allocation was failed,
therefore, code execution reaches to damon_commit_ctx() while 'ctx' is
NULL.  As a result, it dereferences the NULL 'ctx' pointer.  Avoid the
NULL dereference by returning -ENOMEM if 'ctx' is NULL.

Link: https://lore.kernel.org/20260529000104.7006-2-sj@kernel.org
Link: https://lore.kernel.org/20260419014800.877-1-sj@kernel.org
Fixes: 3f7a914ab9a5 ("mm/damon/reclaim: use damon_initialized()")
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org> # 6.18.x
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 weeks agozram: fix use-after-free in zram_bvec_write_partial()
Cunlong Li [Thu, 28 May 2026 02:48:44 +0000 (10:48 +0800)] 
zram: fix use-after-free in zram_bvec_write_partial()

zram_read_page() picks the sync or async backing device read path based on
whether the parent bio is NULL.  zram_bvec_write_partial() passes its
parent bio down, so for ZRAM_WB slots the read is dispatched
asynchronously and zram_read_page() returns 0 while the bio is still in
flight.  The caller then runs memcpy_from_bvec(), zram_write_page() and
__free_page() on the buffer, leaving the async read to write into a freed
page.

zram_bvec_read_partial() was switched to NULL in commit 4e3c87b9421d
("zram: fix synchronous reads") for the same reason; the write_partial
counterpart was missed.

Link: https://lore.kernel.org/20260528-zram-v3-1-cab86eef8764@gmail.com
Fixes: 8e654f8fbff5 ("zram: read page from backing device")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Cunlong Li <shenxiaogll@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Yisheng Xie <xieyisheng1@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 weeks agoMAINTAINERS: update Baoquan He's email address
Baoquan He [Thu, 28 May 2026 13:14:54 +0000 (21:14 +0800)] 
MAINTAINERS: update Baoquan He's email address

I will switch to use @linux.dev mailbox, update all entries in
MAINTAINERS.  And map the address in .mailmap.

Link: https://lore.kernel.org/20260528131454.1996752-1-baoquan.he@linux.dev
Signed-off-by: Baoquan He <baoquan.he@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 weeks agotools headers UAPI: sync linux/taskstats.h for procacct.c
Wang Yaxin [Wed, 27 May 2026 13:35:58 +0000 (21:35 +0800)] 
tools headers UAPI: sync linux/taskstats.h for procacct.c

After commit 9b93f7e32774 ("tools/getdelays: use the static UAPI headers
from tools/include/uapi"), the Makefile was changed to use
-I../include/uapi/ instead of -I../../usr/include to ensure tools always
use the up-to-date UAPI headers.

However, only linux/taskstats.h was added to tools/include/uapi/ in commit
e5bbb35a07b3 ("tools headers UAPI: sync linux/taskstats.h"), but
linux/acct.h was missing.

This causes procacct.c to fail to compile with:

procacct.c:234:37: error: 'AGROUP' undeclared (first use in this function)

gcc -I../include/uapi/    getdelays.c   -o getdelays
gcc -I../include/uapi/    procacct.c   -o procacct
procacct.c: In function `print_procacct':
procacct.c:234:37: error: `AGROUP' undeclared (first use in this function)
did you mean `NOGROUP'?
  234 |  , t->version >= 12 ? (t->ac_flag & AGROUP ? 'P' : 'T') : '?'
      |                                     ^~~~~~
      |                                     NOGROUP
procacct.c:234:37: note: each undeclared ident

because procacct.c uses the AGROUP macro defined in linux/acct.h.

Add the missing linux/acct.h to complete the static UAPI header set.

Link: https://lore.kernel.org/20260527213558929EhiHHy9EDTMjmg3uuDOMi@zte.com.cn
Fixes: 9b93f7e32774 ("tools/getdelays: use the static UAPI headers from tools/include/uapi")
Signed-off-by: Wang Yaxin <wang.yaxin@zte.com.cn>
Reviewed-by: Thomas Weißschuh <linux@weissschuh.net>
Cc: Fan Yu <fan.yu9@zte.com.cn>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: xu xin <xu.xin16@zte.com.cn>
Cc: Yang Yang <yang.yang29@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 weeks agomm/cma_sysfs: skip inactive CMA areas in sysfs
Kaitao Cheng [Fri, 22 May 2026 13:14:34 +0000 (21:14 +0800)] 
mm/cma_sysfs: skip inactive CMA areas in sysfs

cma_activate_area() can fail after a CMA area has already been added to
cma_areas[].  In that case the area is left in the global array, but it
does not reach the point where CMA_ACTIVATED is set.

cma_sysfs_init() currently walks all cma_area_count entries and creates
sysfs files for every area, including ones that failed activation.  These
areas are not usable CMA areas and should not be exposed to userspace as
valid CMA regions.

If such an inactive area is exposed, userspace sees a CMA directory whose
read-only accounting files report zeros.  total_pages and available_pages
report zero because the failed activation path clears cma->count and
cma->available_count, while the allocation and release counters also stay
at zero because the area cannot service CMA allocations.  This makes the
failed area look like a valid but empty CMA region and can mislead tests,
monitoring, and diagnostics.

Skip CMA areas that did not reach CMA_ACTIVATED when creating the sysfs
objects.  Since inactive entries can now be skipped, make the error unwind
tolerate entries that never had cma_kobj initialized.

Link: https://lore.kernel.org/20260524140420.61864-1-kaitao.cheng@linux.dev
Link: https://lore.kernel.org/20260522131434.78532-1-kaitao.cheng@linux.dev
Fixes: 43ca106fa8ec ("mm: cma: support sysfs")
Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
Reported-by: David Hildenbrand (Arm) <david@kernel.org>
Suggested-by: David Hildenbrand (Arm) <david@kernel.org>
Suggested-by: Muchun Song <songmuchun@bytedance.com>
Reported-by: Muchun Song <songmuchun@bytedance.com>
Closes: https://lore.kernel.org/linux-mm/55481a8b-dcfc-4bef-ba59-aa0b43dca88b@kernel.org/
Acked-by: Muchun Song <muchun.song@linux.dev>
Cc: David Hildenbrand <david@kernel.org>
Cc: Dmitry Osipenko <digetx@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 weeks agoipc/shm: serialize orphan cleanup with shm_nattch updates
Yilin Zhu [Thu, 30 Apr 2026 05:21:34 +0000 (13:21 +0800)] 
ipc/shm: serialize orphan cleanup with shm_nattch updates

shm_destroy_orphaned() walks the shm idr under shm_ids(ns).rwsem, but that
does not serialize all fields tested by shm_may_destroy().  In particular,
shm_nattch is updated while holding shm_perm.lock, and attach paths can do
that without holding the rwsem.

Do not decide that an orphaned segment is unused before taking the object
lock.  Move the shm_may_destroy() check under shm_perm.lock, matching the
other destroy paths, and unlock the segment when it no longer qualifies
for removal.

Link: https://lore.kernel.org/9d97cc1031de2d0bace0edf3a668818aa2f4eca6.1777410234.git.zylzyl2333@gmail.com
Fixes: 4c677e2eefdb ("shm: optimize locking and ipc_namespace getting")
Reported-by: Yuan Tan <yuantan098@gmail.com>
Reported-by: Yifan Wu <yifanwucs@gmail.com>
Reported-by: Juefei Pu <tomapufckgml@gmail.com>
Reported-by: Xin Liu <bird@lzu.edu.cn>
Signed-off-by: Yilin Zhu <zylzyl2333@gmail.com>
Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jeongjun Park <aha310510@gmail.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Liam Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Serge Hallyn <sergeh@kernel.org>
Cc: Vasiliy Kulikov <segoon@openwall.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 weeks agobpftool: Use libbpf error code for flow dissector query
Woojin Ji [Wed, 3 Jun 2026 00:33:39 +0000 (09:33 +0900)] 
bpftool: Use libbpf error code for flow dissector query

bpf_prog_query() returns a negative errno on failure.
query_flow_dissector() currently closes the namespace fd and then reads
errno to decide whether -EINVAL means that the running kernel does not
support flow dissector queries.

That errno check controls behavior, not just diagnostics: -EINVAL is
handled as a non-fatal old-kernel case, while any other error makes bpftool
net fail.

The namespace fd is opened read-only, so close() is not expected to
commonly fail in normal use. Still, the BPF_PROG_QUERY error is already
available in err, and reading errno after an intervening close() is
fragile. If close() does change errno, the compatibility branch may be
based on close()'s error instead of the BPF_PROG_QUERY result.

This was reproduced with an LD_PRELOAD fault injector that forced
BPF_PROG_QUERY for BPF_FLOW_DISSECTOR to fail with EINVAL and then
forced close() on the netns fd to fail with EIO. The unpatched bpftool
reported "can't query prog: Input/output error". With this change, the
same injected failure is handled as the intended non-fatal EINVAL
compatibility case.

Use the libbpf-returned error code instead. Keep the existing errno reset
in the non-fatal path to preserve batch mode behavior. The success path
is unchanged.

Fixes: 7f0c57fec80f ("bpftool: show flow_dissector attachment status")
Signed-off-by: Woojin Ji <random6.xyz@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Leon Hwang <leon.hwang@linux.dev>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Quentin Monnet <qmo@kernel.org>
Link: https://lore.kernel.org/bpf/20260603003339.33791-1-random6.xyz@gmail.com
Assisted-by: ChatGPT:gpt-5.5
2 weeks agoMerge tag 'drm-msm-next-2026-05-30' of https://gitlab.freedesktop.org/drm/msm into...
Dave Airlie [Wed, 3 Jun 2026 20:41:21 +0000 (06:41 +1000)] 
Merge tag 'drm-msm-next-2026-05-30' of https://gitlab.freedesktop.org/drm/msm into drm-next

Changes for v7.2

Core:
- Fixed documentation for msm_gem_shrinker functions
- IFPC related enablement/fixes for gen8
- PERFCNTR_CONFIG ioctl support

GPU
- Reworked handling of UBWC configuration
- a810 suppport

MDSS:
- Added Milos platform support
- Reworked handling of UBWC configuration

DisplayPort:
- Reworked HPD handling, preparing for the MST support

DPU:
- Added Milos platform support
- Reworked handling of UBWC configuration

DSI:
- Added Milos platform support

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Rob Clark <rob.clark@oss.qualcomm.com>
Link: https://patch.msgid.link/CACSVV00DXZcvFH2-C3fouve5DGs0DGa-vvsJPuaRmUZZVNKOfg@mail.gmail.com
2 weeks agoirqchip/loongarch-ir: Add IR (interrupt redirection) irqchip support
Tianyang Zhang [Wed, 13 May 2026 01:28:35 +0000 (09:28 +0800)] 
irqchip/loongarch-ir: Add IR (interrupt redirection) irqchip support

The main function of the redirect interrupt controller is to manage
the redirected-interrupt table, which consists of many redirected entries.

When MSI interrupts are requested, the driver creates a corresponding
redirected entry that describes the target CPU/vector number and the
operating mode of the interrupt. The redirected interrupt module has an
independent cache, and during the interrupt routing process, it will
prioritize the redirected entries that hit the cache. The irqchip driver
can invalidate certain entry caches via a command queue.

Co-developed-by: Liupu Wang <wangliupu@loongson.cn>
Signed-off-by: Liupu Wang <wangliupu@loongson.cn>
Signed-off-by: Tianyang Zhang <zhangtianyang@loongson.cn>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Acked-by: Huacai Chen <chenhuacai@loongson.cn>
Link: https://patch.msgid.link/20260513012839.2856463-5-zhangtianyang@loongson.cn
2 weeks agoirqchip/loongarch-avec: Return IRQ_SET_MASK_OK_DONE when keep affinity
Tianyang Zhang [Wed, 13 May 2026 01:28:34 +0000 (09:28 +0800)] 
irqchip/loongarch-avec: Return IRQ_SET_MASK_OK_DONE when keep affinity

Interrupt redirection support requires a new redirect domain, which will
appear as a child domain of avecintc domain. For each interrupt source,
avecintc domain only provides the CPU/interrupt vectors, while redirect
domain provides other operations to synchronize the interrupt affinity
information among multiple cores.

When modifying the affinity of an interrupt associated with the redirect
domain, if the avecintc domain detects that the actual interrupt affinity
hasn't been changed, then the redirect domain doesn't need to perform any
operations.

To achieve the above purpose, in avecintc_set_affinity() when the current
affinity remains valid, then return value is set to IRQ_SET_MASK_OK_DONE.

This doesn't introduce any compatibility issues, even if the new return
value causing msi_domain_set_affinity() to no longer perform the call to
irq_chip_write_msi_msg():

  1) When both avecintc and redirect exist in the system, the msg_address
     and msg_data no longer change after the allocation phase, so it does
     not actually require updating the MSI message info.

  2) When only avecintc exists in the system, the irq_domain_activate_irq()
     interface will be responsible for the initial configuration of the MSI
     message info, which is unconditional. After that, if unnecessary,
     there is no modification to the MSI message info.

Signed-off-by: Tianyang Zhang <zhangtianyang@loongson.cn>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Acked-by: Huacai Chen <chenhuacai@loongson.cn>
Link: https://patch.msgid.link/20260513012839.2856463-4-zhangtianyang@loongson.cn
2 weeks agoirqchip/loongarch-avec: Prepare for interrupt redirection support
Tianyang Zhang [Wed, 13 May 2026 01:28:33 +0000 (09:28 +0800)] 
irqchip/loongarch-avec: Prepare for interrupt redirection support

Interrupt redirection support requires a new interrupt chip, which needs
to share data structures, constants and functions with the AVECINTC code.

So move them to the header file and make the required functions public.

Signed-off-by: Tianyang Zhang <zhangtianyang@loongson.cn>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Acked-by: Huacai Chen <chenhuacai@loongson.cn>
Link: https://patch.msgid.link/20260513012839.2856463-3-zhangtianyang@loongson.cn
2 weeks agoDocs/LoongArch: Add advanced extended IRQ model
Tianyang Zhang [Wed, 13 May 2026 01:28:32 +0000 (09:28 +0800)] 
Docs/LoongArch: Add advanced extended IRQ model

Introduce a new advanced extended interrupt model with redirect interrupt
controllers. When the redirect interrupt controller is enabled, the routing
target of MSI interrupts is no longer a specific CPU and vector number, but
a specific redirect entry. The actual CPU and vector number used are
described by the redirect entry.

Signed-off-by: Tianyang Zhang <zhangtianyang@loongson.cn>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Acked-by: Huacai Chen <chenhuacai@loongson.cn>
Link: https://patch.msgid.link/20260513012839.2856463-2-zhangtianyang@loongson.cn
2 weeks agoarm64: patching: replace min_t with min in __text_poke
Thorsten Blum [Sun, 31 May 2026 22:08:16 +0000 (00:08 +0200)] 
arm64: patching: replace min_t with min in __text_poke

Use the simpler min() macro since both values are unsigned and
compatible.

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Will Deacon <will@kernel.org>
2 weeks agolocking/rtmutex: Skip remove_waiter() when waiter is not enqueued
Davidlohr Bueso [Thu, 7 May 2026 11:29:13 +0000 (04:29 -0700)] 
locking/rtmutex: Skip remove_waiter() when waiter is not enqueued

syzbot triggered the following splat in remove_waiter() via
FUTEX_CMP_REQUEUE_PI:

  KASAN: null-ptr-deref in range [0x0000000000000a88-0x0000000000000a8f]
   class_raw_spinlock_constructor
   remove_waiter+0x159/0x1200 kernel/locking/rtmutex.c:1561
   rt_mutex_start_proxy_lock+0x103/0x120
   futex_requeue+0x10e4/0x20d0
   __x64_sys_futex+0x34f/0x4d0

task_blocks_on_rt_mutex() does not arm the waiter upon deadlock detection,
leaving waiter->task nil, where 3bfdc63936dd ("rtmutex: Use waiter::task instead
of current in remove_waiter()") made this fatal.

Furthermore, rt_mutex_start_proxy_lock() should not be calling into remove_waiter()
upon a successfully grabbing the rtmutex. 1a1fb985f2e2 ("futex: Handle early deadlock
return correctly"), moved the remove_waiter() out of __rt_mutex_start_proxy_lock()
(where 'ret' was only ever 0 or < 0) into the wrapper. Tighten this check to
account for try_to_take_rt_mutex().

Fixes: 3bfdc63936dd ("rtmutex: Use waiter::task instead of current in remove_waiter()")
Reported-by: syzbot+78147abe6c524f183ee9@syzkaller.appspotmail.com
Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Cc: stable@vger.kernel.org
Closes: https://lore.kernel.org/all/69f114ac.050a0220.ac8b.0003.GAE@google.com/
Link: https://patch.msgid.link/20260507112913.1019537-1-dave@stgolabs.net
2 weeks agogpu: nova-core: move lifetime to `Bar0`
Gary Guo [Tue, 2 Jun 2026 17:04:07 +0000 (18:04 +0100)] 
gpu: nova-core: move lifetime to `Bar0`

Currently Nova code uses `&'a Bar0` a lot. This is `&'a Mmio`, where `Mmio`
represents an owned MMIO region; this type only exists as a target for
`Deref` so `Bar` and `IoMem` can share code and should be avoided to be
named directly. The upcoming I/O projection series would make `Io` trait
much simpler to implement, and thus the owned MMIO type would be removed
in favour of direct `Io` implementation on `Bar` and `IoMem`.

Add lifetime parameter to `Bar0<'a>` and change it to be alias of `&'a
pci::Bar<'a, ..>`. This also prepares Nova core so that when I/O projection
series land, this could be changed to using a MMIO view type directly which
avoids double indirection.

Signed-off-by: Gary Guo <gary@garyguo.net>
Acked-by: Alexandre Courbot <acourbot@nvidia.com>
Reviewed-by: Eliot Courtney <ecourtney@nvidia.com>
Link: https://patch.msgid.link/20260602170416.2268531-1-gary@kernel.org
[ Rebase onto latest drm-rust-next (Blackwell enablement). - Danilo ]
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
2 weeks agoKVM: arm64: nv: Restart stage-1 walk if stage-2 desc update fails
Oliver Upton [Tue, 2 Jun 2026 23:54:50 +0000 (16:54 -0700)] 
KVM: arm64: nv: Restart stage-1 walk if stage-2 desc update fails

kvm_walk_nested_s2() returns -EAGAIN as an indication that an underlying
descriptor update fails due to a race. The expectation is that the
caller restart translation, yet walk_s1() actually synthesizes an abort.

Propagate the -EAGAIN return out of walk_s1(), relying on callers to
restart the translation fetch.

Fixes: e4c7dfac2f1a ("KVM: arm64: nv: Implement HW access flag management in stage-2 SW PTW")
Signed-off-by: Oliver Upton <oupton@kernel.org>
Link: https://patch.msgid.link/20260602235450.103057-6-oupton@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
2 weeks agoKVM: arm64: Restart instruction upon race in __kvm_at_s12()
Oliver Upton [Tue, 2 Jun 2026 23:54:49 +0000 (16:54 -0700)] 
KVM: arm64: Restart instruction upon race in __kvm_at_s12()

__kvm_at_s*() are expected to return -EAGAIN if the page table walk
raced with a concurrent update to a page table descriptor, which is
interpreted as a signal to restart the trapping instruction.

While this mostly works, __kvm_at_s12() silently eats the return from
__kvm_at_s1e01() and consumes an uninitialized PAR value. Propagate the
nonzero return instead.

Fixes: 92c6443222ca ("KVM: arm64: Propagate PTW errors up to AT emulation")
Signed-off-by: Oliver Upton <oupton@kernel.org>
Link: https://patch.msgid.link/20260602235450.103057-5-oupton@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
2 weeks agoKVM: arm64: nv: Inject SEA TTW when desc update can't write to GPA
Oliver Upton [Tue, 2 Jun 2026 23:54:48 +0000 (16:54 -0700)] 
KVM: arm64: nv: Inject SEA TTW when desc update can't write to GPA

Similar to the handling of descriptor reads, inject an SEA during TTW
when the descriptor access fails for reasons other than a race, such as
a read-only memslot or a bad HVA.

Fixes: bff8aa213dee ("KVM: arm64: Implement HW access flag management in stage-1 SW PTW")
Fixes: e4c7dfac2f1a ("KVM: arm64: nv: Implement HW access flag management in stage-2 SW PTW")
Signed-off-by: Oliver Upton <oupton@kernel.org>
Link: https://patch.msgid.link/20260602235450.103057-4-oupton@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
2 weeks agoKVM: arm64: nv: Fully update VNCR fixmap state in kvm_translate_vncr()
Oliver Upton [Tue, 2 Jun 2026 23:54:47 +0000 (16:54 -0700)] 
KVM: arm64: nv: Fully update VNCR fixmap state in kvm_translate_vncr()

kvm_translate_vncr() first invalidates the pseudo-TLB entry and
corresponding fixmap in anticipation of installing a new translation.

While the fixmap invalidation does clear the mapping from host stage-1,
it does not clear the L1_VNCR_MAPPED flag. Depending on the state of the
VNCR TLB at vcpu_put(), this could potentially precipitate a BUG_ON() if
vt->cpu is reset.

Share a helper with kvm_vcpu_put_hw_mmu(), ensuring that KVM's view of
the VNCR fixmap is in sync with the state of the VNCR TLB. Give it a
slightly verbose name to make it obvious that it is meant to be used
local to a CPU, unlike other VNCR TLB maintenance.

Fixes: 069a05e535496 ("KVM: arm64: nv: Handle VNCR_EL2-triggered faults")
Signed-off-by: Oliver Upton <oupton@kernel.org>
Link: https://patch.msgid.link/20260602235450.103057-3-oupton@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
2 weeks agoKVM: arm64: Don't leak PFN when kvm_translate_vncr() races MMU notifier
Oliver Upton [Tue, 2 Jun 2026 23:54:46 +0000 (16:54 -0700)] 
KVM: arm64: Don't leak PFN when kvm_translate_vncr() races MMU notifier

In the case that kvm_translate_vncr() races with an MMU notifier the
early return does not release a reference on the faulted in PFN. Add
the necessary call to kvm_release_faultin_page() for the unused PFN.

Cc: stable@vger.kernel.org
Fixes: 069a05e535496 ("KVM: arm64: nv: Handle VNCR_EL2-triggered faults")
Reported-by: Sashiko (local):gemini-3.1-pro
Signed-off-by: Oliver Upton <oupton@kernel.org>
Link: https://patch.msgid.link/20260602235450.103057-2-oupton@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
2 weeks agoarm64: cpufeature: Expose ID_AA64ISAR2_EL1.ATS1A to KVM
Marc Zyngier [Tue, 2 Jun 2026 15:54:29 +0000 (16:54 +0100)] 
arm64: cpufeature: Expose ID_AA64ISAR2_EL1.ATS1A to KVM

KVM needs to know if the HW implements FEAT_ATS1A in order to correctly
sanitise HFGITR_EL2.ATS1E1A, which otherwise defaults to RES0 and
AT S1E1A traps are handled as UNDEF.

Solves this by exposing ID_AA64ISAR2_EL1.ATS1A to the rest of the kernel.

Fixes: ff987ffc0c18c ("KVM: arm64: nv: Add support for FEAT_ATS1A")
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Reviewed-by: Oliver Upton <oupton@kernel.org>
Link: https://patch.msgid.link/20260602155430.2088142-4-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
2 weeks agoKVM: arm64: Wire AT S1E1A in the system instruction handling table
Marc Zyngier [Tue, 2 Jun 2026 15:54:28 +0000 (16:54 +0100)] 
KVM: arm64: Wire AT S1E1A in the system instruction handling table

Despite having handling code for AT S1E1A, the instruction was
never plugged into the system instruction table, leading to an
exception being injected in the guest.

If the guest is Linux and using the __kvm_at() helper, the exception
is actually handled in the helper, and KVM continues more or less
silently by reentering the guest. Not exactly what you'd expect.

Fix this by plugging the emulation code where required.

Fixes: ff987ffc0c18c ("KVM: arm64: nv: Add support for FEAT_ATS1A")
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Reviewed-by: Oliver Upton <oupton@kernel.org>
Link: https://patch.msgid.link/20260602155430.2088142-3-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
2 weeks agoKVM: arm64: Key CPTR_EL2.E0POE propagation on FEAT_S1POE
Marc Zyngier [Tue, 2 Jun 2026 15:54:27 +0000 (16:54 +0100)] 
KVM: arm64: Key CPTR_EL2.E0POE propagation on FEAT_S1POE

We propagate CPTR_EL2.E0POE from a L1 into the L0 configuration, but
we key this on the L1 guest supporting FEAT_S2POE. This is obviously
wrong, as this bit is solely concerned with Stage-1 translation.

Fix this by making the update depend on FEAT_S1POE.

Fixes: cd931bd6093cb ("KVM: arm64: nv: Add additional trap setup for CPTR_EL2")
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Reviewed-by: Oliver Upton <oupton@kernel.org>
Link: https://patch.msgid.link/20260602155430.2088142-2-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
2 weeks agoperf/arm-cmn: Fix DVM node events
Robin Murphy [Fri, 29 May 2026 14:33:45 +0000 (15:33 +0100)] 
perf/arm-cmn: Fix DVM node events

The new DVM node events added in CMN-700 also apply to CMN S3; fix
the model encoding so that we can expose the aliases and handle
occupancy filtering on newer CMNs too.

Cc: stable@vger.kernel.org
Fixes: 0dc2f4963f7e ("perf/arm-cmn: Support CMN S3")
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
2 weeks agodrm/amd/pm: smu_v14_0_0: use SoftMin for gfxclk in set_soft_freq_limited_range
Priya Hosur [Thu, 7 May 2026 08:01:37 +0000 (13:31 +0530)] 
drm/amd/pm: smu_v14_0_0: use SoftMin for gfxclk in set_soft_freq_limited_range

In smu_v14_0_0_set_soft_freq_limited_range(), the gfxclk floor is
programmed via SetHardMinGfxClk together with SetSoftMaxGfxClk. Under
power_dpm_force_performance_level=high this pins HardMin to peak gfxclk.

In PMFW arbitration HardMin has higher priority than SoftMax, so the
firmware thermal/PPT throttler cannot clamp gfxclk via SoftMax once
HardMin is set to peak. Replace SetHardMinGfxClk with SetSoftMinGfxclk
so the driver still requests peak performance but the firmware
throttler retains the ability to clamp gfxclk under thermal/PPT
pressure. SoftMax handling is unchanged and no other clock domains
are affected.

Signed-off-by: Priya Hosur <Priya.Hosur@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 3ea273267fd29cbf6d83ee72329f59eb5042605b)
Cc: stable@vger.kernel.org
2 weeks agodrm/amdgpu: Fix incorrect VRAM GART mappings on non-4K page size systems
Donet Tom [Wed, 27 May 2026 13:19:31 +0000 (18:49 +0530)] 
drm/amdgpu: Fix incorrect VRAM GART mappings on non-4K page size systems

When mapping VRAM pages into the GART page table,
amdgpu_gart_map_vram_range() assumes that the system page size is the
same as the GPU page size.

On systems with non-4K page sizes, multiple GPU pages can exist within
a single CPU page. As a result, the mappings are created incorrectly
because fewer page table entries are programmed than required.

Fix this by programming the mappings correctly for non-4K page size
systems.

Fixes: 237d623ae659 ("drm/amdgpu/gart: Add helper to bind VRAM pages (v2)")
Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Donet Tom <donettom@linux.ibm.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit a8f0bc22388f74e0cf4ed8b7d1846c580eaf44cc)
Cc: stable@vger.kernel.org
2 weeks agodrm/amdgpu/userq: move wptr_obj cleanup in mqd_destroy
Sunil Khatri [Mon, 25 May 2026 04:26:23 +0000 (09:56 +0530)] 
drm/amdgpu/userq: move wptr_obj cleanup in mqd_destroy

In case when queue_create fails and mqd has already been
allocated and hence wptr_obj is not cleaned up.

So moving that cleanup part to mqd_destroy so it takes
care of all the cases of clean up and during tear down of
the queue.

Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 43355f62cd2ef5386c2693df537c232ea0f2ce6c)

2 weeks agodrm/amdgpu: improve the userq seq BO free bit lookup
Prike Liang [Tue, 26 May 2026 02:25:26 +0000 (10:25 +0800)] 
drm/amdgpu: improve the userq seq BO free bit lookup

Use find_next_zero_bit() to locate the next free seq slot bit
instead of the current walk, for more efficient bitmap scanning.

Signed-off-by: Prike Liang <Prike.Liang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit ff905a9b6228de9eedd0db71ecb1bdde91fb898d)

2 weeks agodrm/amdgpu/userq: remove the vital queue unmap logging
Sunil Khatri [Mon, 25 May 2026 07:48:00 +0000 (13:18 +0530)] 
drm/amdgpu/userq: remove the vital queue unmap logging

Mesa userqueues free does not wait for the free to complete and go ahead
in unmapping the vital bos while kernel is still in queue free and
corresponding cleanup.

So ideally we don't need the logging for that and hence remove the warn
message as this is expected behaviour and functionally, we are making
sure to wait for the required fences before unmap.

Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 758a868043dcb07eca923bc451c16da3e73dc47c)

2 weeks agodrm/amdkfd: Fix buffer overflow in SDMA queue checkpoint/restore on GFX11
Andrew Martin [Thu, 28 May 2026 16:54:39 +0000 (12:54 -0400)] 
drm/amdkfd: Fix buffer overflow in SDMA queue checkpoint/restore on GFX11

The v11 MQD manager incorrectly assigned the CP-compute variants of
checkpoint_mqd/restore_mqd for KFD_MQD_TYPE_SDMA queues. These functions
use sizeof(struct v11_compute_mqd) (2048 bytes) instead of sizeof(struct
v11_sdma_mqd) (512 bytes), causing a 1536-byte overflow.

During CRIU checkpoint of an SDMA queue on Navi3x:
- checkpoint_mqd() reads 2048 bytes from a 512-byte SDMA MQD buffer,
  leaking 1536 bytes of adjacent GTT memory to userspace

During CRIU restore:
- restore_mqd() writes 2048 bytes into a 512-byte SDMA MQD buffer,
  corrupting 1536 bytes of adjacent GTT memory (often the ring buffer
  or neighboring MQDs)

This is a copy-paste regression unique to v11. All other ASIC backends
(cik, vi, v9, v10, v12) correctly use the SDMA-specific variants.

Add checkpoint_mqd_sdma() and restore_mqd_sdma() functions that properly
handle the smaller v11_sdma_mqd structure, matching the pattern used in
other MQD managers.

Fixes: cc009e613de6 ("drm/amdkfd: Add KFD support for soc21 v3")
Assisted-by: Claude:Sonnet 4-5
Signed-off-by: Andrew Martin <andrew.martin@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 6fa41db7ffdec97d62433adf03b7b9b759af8c2c)
Cc: stable@vger.kernel.org
2 weeks agodrm/amdkfd: fix NULL dereference in get_queue_ids()
Muhammad Bilal [Sat, 23 May 2026 16:56:46 +0000 (16:56 +0000)] 
drm/amdkfd: fix NULL dereference in get_queue_ids()

When usr_queue_id_array is NULL and num_queues is non-zero,
get_queue_ids() returns NULL. The callers check only IS_ERR() on the
return value; since IS_ERR(NULL) == false the check passes, and
suspend_queues() calls q_array_invalidate() which immediately
dereferences NULL while iterating num_queues times.

Userspace can trigger this via kfd_ioctl_set_debug_trap() by supplying
num_queues > 0 with a zero queue_array_ptr, causing a kernel panic.

A NULL usr_queue_id_array with num_queues == 0 is a legitimate no-op
(q_array_invalidate never executes, and resume_queues already guards
all queue_ids dereferences behind a NULL check). Return ERR_PTR(-EINVAL)
only when num_queues is non-zero and the pointer is absent; both callers
already propagate IS_ERR() returns correctly to userspace.

Fixes: a70a93fa568b ("drm/amdkfd: add debug suspend and resume process queues operation")
Signed-off-by: Muhammad Bilal <meatuni001@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit f165a82cdf503884bb1797771c61b2fcc72113d4)
Cc: stable@vger.kernel.org
2 weeks agodrm/amdgpu: set noretry=1 as default for GFX 10.1.x (Navi10/12/14)
Vitaly Prosyak [Fri, 29 May 2026 17:50:38 +0000 (13:50 -0400)] 
drm/amdgpu: set noretry=1 as default for GFX 10.1.x (Navi10/12/14)

Problem:
While developing the amd_close_race IGT test (which intentionally triggers
execute permission faults by removing VM_PAGE_EXECUTABLE from GPU page table
entries), we discovered that on Navi10 (GFX 10.1.x) these faults produce
zero diagnostic output. The GPU simply hangs silently for ~10s until the
scheduler timeout fires. There is no way to distinguish an execute
permission fault from any other type of GPU hang.

Root cause:
GFX 10.1.x defaults to noretry=0, which sets
RETRY_PERMISSION_OR_INVALID_PAGE_FAULT=1 in the GFXHUB UTCL2 registers
(gfxhub_v2_0.c line 313). With this bit set, permission faults (valid PTE,
wrong R/W/X bits) are handled entirely within the UTCL1/UTCL2 hardware
loop: UTCL2 returns an XNACK to UTCL1, and UTCL1 re-requests the
translation indefinitely, expecting software to eventually fix the
permission bits (as happens in SVM/HMM recovery). No interrupt of any kind
reaches the IH ring.

This is different from invalid-page faults (V=0) which DO generate a retry
fault interrupt that the driver can escalate to a no-retry fault. Permission
faults with valid PTEs loop silently forever in hardware.

GFX 10.3+ already defaults to noretry=1, which makes permission faults
generate immediate L2 protection fault interrupts. GFX 10.1.x was
inadvertently left out of this default.

Fix:
Change the noretry=1 threshold from IP_VERSION(10, 3, 0) to
IP_VERSION(10, 1, 0) in amdgpu_gmc_noretry_set(). This is a one-line
change that aligns GFX 10.1.x behavior with GFX 10.3+ and all newer
generations.

With noretry=1, the existing non-retry fault handler
(gmc_v10_0_process_interrupt) already decodes and prints the full
GCVM_L2_PROTECTION_FAULT_STATUS register including PERMISSION_FAULTS,
faulting address, VMID, PASID, and process name. No additional logging
code is needed — the fix is purely routing permission faults to the
existing, fully-capable non-retry interrupt handler.

v2: Dropped GFX10-specific logging from gmc_v10_0.c and
kfd_int_process_v10.c (Felix Kuehling). v1 added logging in the retry
fault handler, but with noretry=1 permission faults take the non-retry
path — the v1 retry handler code was dead and would never execute.

Tested on Navi10 (GFX 10.1.10):
- Execute permission faults now produce immediate, clear output:
    [gfxhub] page fault (src_id:0 ring:64 vmid:4 pasid:592)
     Process amd_close_race pid 13380 thread amd_close_race pid 13384
      in page at address 0x40001000 from client 0x1b (UTCL2)
    GCVM_L2_PROTECTION_FAULT_STATUS:0x00700881
         PERMISSION_FAULTS: 0x8
- No regressions with properly-mapped GPU workloads

Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit eb21edd24c40d81066753f8ac6f23bce15745395)
Cc: stable@vger.kernel.org
2 weeks agodrm/amdgpu/gfxhub: Program CRASH_ON_*_FAULT bits to 0 as needed
Timur Kristóf [Mon, 25 May 2026 11:45:02 +0000 (13:45 +0200)] 
drm/amdgpu/gfxhub: Program CRASH_ON_*_FAULT bits to 0 as needed

When the fault stop mode isn't AMDGPU_VM_FAULT_STOP_ALWAYS,
these bits should be programmed to 0.

Program CRASH_ON_NO_RETRY_FAULT and CRASH_ON_RETRY_FAULT
always, to make sure to clear the bits when we don't want
to crash.

Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit d0cd99e73090700b7a942b98a3327ec966597d0a)

2 weeks agodrm/amdgpu: fix waiting for all submissions for userptrs
Christian König [Wed, 18 Feb 2026 12:05:46 +0000 (13:05 +0100)] 
drm/amdgpu: fix waiting for all submissions for userptrs

Wait for all submissions when userptrs need to be invalidated by the MMU
notifier, not just the one the userptr was involved into.

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Tested-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 91250893cbaa25c86872deca95a540d08de1f91e)
Cc: stable@vger.kernel.org
2 weeks agodrm/amdgpu: drm/amdgpu: Set correct DMA mask for gfx12.1
Harish Kasiviswanathan [Tue, 12 May 2026 14:57:49 +0000 (10:57 -0400)] 
drm/amdgpu: drm/amdgpu: Set correct DMA mask for gfx12.1

Set correct DMA mask for gfx12

Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit a2ef14ee2593b48242b8d90f229f71c1710529da)

2 weeks agodrm/amdgpu: Use asic specific pte_addr_mask
Harish Kasiviswanathan [Tue, 28 Apr 2026 21:45:06 +0000 (17:45 -0400)] 
drm/amdgpu: Use asic specific pte_addr_mask

For PTE creation use asic specific physical page base address mask

v2: Change variable name from pa_mask to pte_addr_mask

Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 2ea989885941a6e5607ef86dbe309e90b7191f21)

2 weeks agodrm/amd/pm: zero unused SMU argument registers
Yang Wang [Mon, 11 May 2026 08:33:37 +0000 (16:33 +0800)] 
drm/amd/pm: zero unused SMU argument registers

SMU messages may use fewer arguments than the available argument registers,
the previous code only wrote used registers and left the rest unchanged,
so stale values from a prior message could persist.

Write all argument registers for each message and zero the unused tail
to keep command arguments deterministic and avoid unintended carry-over.

Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit e03b635f61f77ebd5107ef82f48e3221cb695856)

2 weeks agodrm/amd/pm: mark metrics.energy_accumulator is invalid for smu 14.0.2
Yang Wang [Fri, 29 May 2026 03:47:31 +0000 (11:47 +0800)] 
drm/amd/pm: mark metrics.energy_accumulator is invalid for smu 14.0.2

EnergyAccumulator is unsupported on SMU 14.0.2, mark it invalid.

Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 646b05043eeed04b51c14aad22a400a8250af4b7)
Cc: stable@vger.kernel.org
2 weeks agodrm/amd/pm: fix smu13 power limit default/cap calculation
Yang Wang [Tue, 19 May 2026 03:18:12 +0000 (11:18 +0800)] 
drm/amd/pm: fix smu13 power limit default/cap calculation

smu_v13_0_0_get_power_limit() and smu_v13_0_7_get_power_limit() mix
runtime power_limit with PP table limits when reporting default/min/max.

When current power limit query succeeds, default_power_limit was set to the
runtime value instead of the PP table default, and min/max could be derived
from inconsistent bases (MsgLimits/runtime), leading to incorrect cap info.

Use SocketPowerLimitAc/Dc as the PP default base (pp_limit), keep
current_power_limit as runtime value, and derive min/max from pp_limit with
OD percentages.

Closes: https://gitlab.freedesktop.org/drm/amd/-/work_items/5227
Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Kenneth Feng <kenneth.feng@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 1eaf26db95901ca70737503a89b831dd763c8453)
Cc: stable@vger.kernel.org
2 weeks agodrm/amd/pm: apply SMU 13.0.10 workaround during MP1 unload
Yang Wang [Thu, 21 May 2026 14:36:37 +0000 (22:36 +0800)] 
drm/amd/pm: apply SMU 13.0.10 workaround during MP1 unload

On SMU v13.0.10, sending PrepareMp1ForUnload with the default
parameter may leave the device in an inaccessible state. This can
affect runtime power management and partial PnP flows.
e.g: kexec, driver unload, boco/d3cold.

Pass the required workaround parameter 0x55, when preparing MP1 for
unload on SMU v13.0.10, keep the existing behavior for other SMU
versions.

Closes: https://gitlab.freedesktop.org/drm/amd/-/work_items/5133
Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Kenneth Feng <kenneth.feng@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 4e8ee1afeedb8d24dd22cdd5ae9f98a6d76ebe4b)
Cc: stable@vger.kernel.org