Johan Hovold [Fri, 24 Apr 2026 10:28:47 +0000 (12:28 +0200)]
MIPS: ip22-gio: fix device reference leak in probe
The gio probe function needlessly takes a device reference which is
never released and therefore prevents unbound gio devices from being
freed.
Fixes: e84de0c61905 ("MIPS: GIO bus support for SGI IP22/28") Cc: stable@vger.kernel.org # 3.3 Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Signed-off-by: Johan Hovold <johan@kernel.org> Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Johan Hovold [Fri, 24 Apr 2026 10:28:46 +0000 (12:28 +0200)]
MIPS: ip22-gio: fix gio device memory leak
The gio device release callback was never wired up so gio devices are
not freed when the last reference is dropped.
Fixes: e84de0c61905 ("MIPS: GIO bus support for SGI IP22/28") Cc: stable@vger.kernel.org # 3.3 Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Signed-off-by: Johan Hovold <johan@kernel.org> Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Takashi Iwai [Tue, 26 May 2026 15:28:41 +0000 (17:28 +0200)]
ALSA: seq: oss: Fix UAF at handling events with embedded SysEx data
The OSS sequencer processes the input MIDI bytes into a sequencer
event to be dispatched later (in snd_seq_oss_midi_putc() called from
snd_seq_oss_process_event()). When it's a SysEx data, the event
record contains data.ext.ptr pointer to the original SysEx bytes, and
the referred data is copied into the pool afterwards at dispatching.
The problem is that, if the sequencer port gets closed concurrently
before the dispatch, the OSS sequencer core also releases the
resources (in snd_seq_oss_midi_check_exit_port()), while the pending
event may hold a stale pointer, eventually leading to a UAF at a later
dispatch.
Fortunately, there is already a refcounting mechanism (snd_use_lock_t)
for the OSS MIDI device access, and for addressing the issue above, we
just need to extend the refcount until the event gets dispatched.
This patch extends snd_seq_oss_process_event() to give back the
refcount object, which is in turn released after calling the sequencer
dispatcher with the given event in the caller side.
According to the original report, KASAN report as below:
Cássio Gabriel [Tue, 26 May 2026 12:48:27 +0000 (09:48 -0300)]
ALSA: xen-front: Connect event channel after stream prepare
The request channel must be connected from ALSA .open(), because hw-rule
queries and the stream open request use it. The event channel is
different: XENSND_EVT_CUR_POS handling uses ALSA runtime buffer and
period geometry, and the corresponding Xen stream parameters are not
submitted to the backend until .prepare() sends XENSND_OP_OPEN.
Currently .open() connects both channels. A backend current-position
event, or a stale event queued for an earlier stream instance, can
therefore reach xen_snd_front_alsa_handle_cur_pos() before
runtime->buffer_size and runtime->period_size are valid.
Add a per-channel connection helper, connect only the request channel in
.open(), connect the event channel after a successful stream prepare,
and disconnect it before stream close/free. Re-check the event-channel
state after taking ring_io_lock so disconnecting the event channel
synchronizes against a threaded IRQ that passed the initial lockless
state test. Keep defensive runtime geometry checks in the position
handler.
Cássio Gabriel [Tue, 26 May 2026 12:48:26 +0000 (09:48 -0300)]
ALSA: xen-front: Reset event channel state on stream clear
xen_snd_front_evtchnl_pair_clear() resets evt_next_id for both
channels. That is correct for the request channel, where evt_next_id is
used to allocate the next request id. It is wrong for the event channel:
incoming events are validated against evt_id, and evt_id is incremented
by evtchnl_interrupt_evt().
This leaves the expected event id from the previous stream instance. A
backend that restarts event ids for a reopened stream can then have valid
current-position events dropped until the stale frontend id catches up.
Reset evt_id for the event channel. Also advance the event-page consumer
to the current producer while clearing the stream, so obsolete events
queued for the previous stream instance are not delivered to the next
ALSA runtime.
Lianqin Hu [Wed, 27 May 2026 03:33:08 +0000 (03:33 +0000)]
ALSA: usb-audio: Add iface reset and delay quirk for TAE1160 USB Audio
Setting up the interface when suspended/resumeing fail on this card.
Adding a reset and delay quirk will eliminate this problem.
usb 1-1: new full-speed USB device number 2 using xhci-hcd
usb 1-1: New USB device found, idVendor=25aa, idProduct=600b
usb 1-1: New USB device strings: Mfr=1, Product=2, SerialNumber=3
usb 1-1: Product: TAE1159
usb 1-1: Manufacturer: Generic
usb 1-1: SerialNumber: 20210726905926
Jakub Pisarczyk [Tue, 26 May 2026 20:18:30 +0000 (22:18 +0200)]
ALSA: hda/cs420x: Add CS4208 fixup for iMac16,1
The 21.5" Retina 4K iMac (Late 2015, DMI product name "iMac16,1") ships
with a Cirrus Logic CS4208 codec wired to an external speaker amplifier
enabled through codec GPIO0 -- the same arrangement as the late-2013
MacBookPro 11,x. Without a matching entry in cs4208_mac_fixup_tbl[] the
fixup picker logs:
snd_hda_codec_cs420x hdaudioC1D0: CS4208: picked fixup for codec SSID 106b:0000
i.e. an empty fixup name, GPIO0 stays low, the external amp is never
powered up, and the internal speakers are silent on a stock kernel.
The codec SSID reported by hardware is 0x106b:0x7f00. Reusing CS4208_MBP11
(GPIO0 + SPDIF switch fixup) makes the internal speakers and S/PDIF
output work out of the box, removing the need for users to set
`options snd_hda_intel model=mbp11` via /etc/modprobe.d/.
Tested on iMac16,1 (kernel 6.17.0): four internal drivers
(Left tweeter, Left woofer, Right tweeter, Right woofer, exposed as the
4 channels of the analog-surround-40 ALSA profile) produce audio after
the fixup is applied.
The affected IOEXP nodes are missing interrupt pin configuration in
the device tree, causing the interrupt line to remain asserted and
resulting in repeated unhandled IRQ events.
Add the required interrupt-related properties for the affected IOEXP
devices to ensure proper interrupt handling and prevent the IRQ from
being disabled.
[arj: Drop markdown code-block fence, favour indentation]
Mike Hsieh [Fri, 22 May 2026 10:07:59 +0000 (18:07 +0800)]
ARM: dts: aspeed: clemente: Remove IOB NIC TMP421 nodes
Remove the TMP421 sensor entry from the DTS, as it is no longer the
primary telemetry source.
Accessing the CX8 NIC via I2C while it is powered off causes voltage
leakage on the bus, leading to EEPROM corruption on shared I2C devices.
Removing this node prevents the BMC from initiating traffic to the NIC
during initialization, protecting the integrity of the shared bus.
Signed-off-by: Mike Hsieh <mike.quanta.115@gmail.com> Signed-off-by: Andrew Jeffery <andrew@codeconstruct.com.au>
ARM: dts: aspeed: Enable networking for Asus Kommando IPMI Card
Adds the DT nodes needed for ethernet support for Asus Kommando, with
phy mode set to rgmii-id.
When this DT was originally added, the phy mode was set to rgmii (which
was incorrect). It was suggested to remove networking support from the
DT till the Aspeed networking driver was patched so that the correct phy
mode could be used.
The discussion in [1] mentions that u-boot was inserting clk delays that
weren't needed, which resulted in needing to set the phy mode in linux
to rgmii incorrectly. The solution suggested there was to patch u-boot to
no longer insert these clk delays and use rgmii-id as the phy mode for
any future DTs added to linux.
This DT was tested (on the OpenBMC u-boot fork [2]) with a u-boot DT
modified to insert clk delays of 0 (instead of patching u-boot itself).
[3] adds a u-boot DT for this device (without networking) and describes
how to patch it to add networking support. If this patched DT is used,
then networking works with rgmii-id phy mode in both u-boot and linux.
Haoxiang Li [Mon, 25 May 2026 08:26:11 +0000 (16:26 +0800)]
net: thunderx: fix PTP device ref leak in nicvf_probe()
cavium_ptp_get() acquires a reference to the PTP PCI device
through pci_get_device(). If any initialization step fails
after cavium_ptp_get(), the PTP PCI device reference is leaked.
Add a common error path to release the PTP reference before
returning from probe failures.
Qi Tang [Sat, 23 May 2026 14:32:45 +0000 (22:32 +0800)]
ipv6: validate extension header length before copying to cmsg
ip6_datagram_recv_specific_ctl() builds IPV6_{HOPOPTS,DSTOPTS,RTHDR}
cmsgs (and their IPV6_2292* legacy counterparts) by trusting the
on-wire hdrlen byte (ptr[1]) when computing the put_cmsg() length.
The length was validated only at parse time (ipv6_parse_hopopts(),
etc.). An nftables payload-write expression can rewrite hdrlen after
parsing and before the skb reaches recvmsg; the write itself is
in-bounds but put_cmsg() then reads up to ((hdrlen+1) << 3) = 2040
bytes from an 8-byte header. nftables is reachable from an
unprivileged user namespace, so this is an unprivileged
slab-out-of-bounds read:
BUG: KASAN: slab-out-of-bounds in put_cmsg+0x3ac/0x540
put_cmsg+0x3ac/0x540
udpv6_recvmsg+0xca0/0x1250
sock_recvmsg+0xdf/0x190
____sys_recvmsg+0x1b1/0x620
Add ipv6_get_exthdr_len() which validates that at least two bytes
are accessible before reading the hdrlen field, then checks the
computed length against skb_tail_pointer(skb), returning 0 on
failure. Extension headers are kept in the linear skb area by
pskb_may_pull() during input, so skb_tail_pointer() is the correct
bound.
Use ipv6_get_exthdr_len() at all non-AH call sites: the five
standalone cmsg blocks (HbH, 2292HbH, 2292DSTOPTS x2, 2292RTHDR)
and the three standard cases in the extension-header walk loop
(DSTOPTS, ROUTING, default). AH retains an inline bounds check
because its length formula differs ((ptr[1]+2)<<2).
The walk loop also gets a pre-read bounds check at the top to
validate ptr before any case accesses ptr[0] or ptr[1].
When the walk loop detects a corrupted header, return from the
function instead of continuing to process later socket options.
Cc: stable@vger.kernel.org Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Qi Tang <tpluszz77@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20260523143245.2281415-1-tpluszz77@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Luka Gejak [Sat, 23 May 2026 13:04:20 +0000 (15:04 +0200)]
net: hsr: require valid EOT supervision TLV
Supervision frames are only valid if terminated with a zero-length EOT
TLV. The current check fails to reject non-EOT entries as the terminal
TLV, potentially allowing malformed supervision traffic.
Fix this by strictly requiring the terminal TLV to be HSR_TLV_EOT with
a length of zero.
Sean Shen [Tue, 26 May 2026 13:07:16 +0000 (22:07 +0900)]
ksmbd: fix FSCTL permission bypass by adding a permission check for FSCTL_SET_SPARSE
FSCTL_SET_SPARSE in fsctl_set_sparse() modifies the file's sparse
attribute and saves it through xattr without any permission checks.
This exposes two issues:
1) A client on a read-only share can change the sparse attribute
on files it opened, even though the share is read-only.
Other FSCTL write operations already check
test_tree_conn_flag(work->tcon, KSMBD_TREE_CONN_FLAG_WRITABLE),
but FSCTL_SET_SPARSE does not.
2) Even on writable shares, clients without FILE_WRITE_DATA or
FILE_WRITE_ATTRIBUTES access should not modify the sparse
attribute. Similar handle-level checks exist in other functions
but are missing here.
Add both share-level writable check and per-handle access check.
Use goto out on error to avoid leaking file references.
Fixes: e2f34481b24d ("cifsd: add server-side procedures for SMB3") Cc: Namjae Jeon <linkinjeon@kernel.org> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Steve French <smfrench@gmail.com> Signed-off-by: Sean Shen <grayhat@foxmail.com> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
ksmbd: release ksmbd_inode ref via ksmbd_inode_put on lookup paths
ksmbd_query_inode_status() and ksmbd_lookup_fd_inode() both take a
reference on a ksmbd_inode via __ksmbd_inode_lookup() (which performs
atomic_inc_not_zero()) and later release it using a bare
atomic_dec(&ci->m_count). Unlike ksmbd_inode_put(), a bare
atomic_dec() does not check whether the reference count has reached
zero, so if the caller happens to drop the last reference, the
ksmbd_inode is leaked: it stays in the global inode hash table with
m_count == 0, future __ksmbd_inode_lookup() calls reject it via
atomic_inc_not_zero(), and ksmbd_inode_free() is never invoked.
In ksmbd_lookup_fd_inode() the matched-fp path (which now also uses
ksmbd_inode_put()) cannot currently reach m_count == 0 because the
matched ksmbd_file holds its own reference on ci, but converting it to
the proper API keeps the three call sites consistent and avoids
future regressions if the locking changes.
Because ksmbd_inode_put() may free the ksmbd_inode if this drops the
last reference, the call must happen after up_read(&ci->m_lock) on the
two affected paths in ksmbd_lookup_fd_inode(). On the no-match path
this is a pure reordering; on the matched path ksmbd_fp_get() is
moved above the unlock so that the returned ksmbd_file is pinned
before the inode reference is released.
Signed-off-by: Aleksandr Golovnya <cofedish@gmail.com> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
Ali Ganiyev [Mon, 25 May 2026 01:23:47 +0000 (10:23 +0900)]
ksmbd: OOB read regression in smb_check_perm_dacl() ACE-walk loops
Commit d07b26f39246 ("ksmbd: require minimum ACE size in
smb_check_perm_dacl()") introduced a transposed bounds check:
if (offsetof(struct smb_ace, sid) + aces_size < CIFS_SID_BASE_SIZE)
Since offsetof(..sid) is 8 and CIFS_SID_BASE_SIZE is 8, this evaluates
to `aces_size < 0`. Because `aces_size` is always non-negative, this
check becomes dead code and never breaks the loop.
Worse, that commit removed the old 4-byte guard, meaning the loop now
reads `ace->size` (offset 2) even when `aces_size` is 0-3 bytes. This
re-opens a 2-byte heap out-of-bounds (OOB) read past the pntsd allocation
during subsequent SMB2_CREATE operations.
Fix this by properly transposing the comparison to require at least
16 bytes (8-byte offset + 8-byte SID base), matching the correct form
used in smb_inherit_dacl().
Fixes: d07b26f39246 ("ksmbd: require minimum ACE size in smb_check_perm_dacl()") Cc: stable@vger.kernel.org Signed-off-by: Ali Ganiyev <ali.qaniyev@gmail.com> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
Jakub Kicinski [Wed, 27 May 2026 01:32:34 +0000 (18:32 -0700)]
Merge tag 'nfc-7.1-rc6' of https://codeberg.org/linux-nfc/linux
David Heidelberg says:
====================
nfc pull request for net:
Code improvements
- llcp: Fix use-after-free in llcp_sock_release()
- llcp: Fix use-after-free race in nfc_llcp_recv_cc()
- hci: fix out-of-bounds read in HCP header parsing
Regression fixes:
- nxp-nci: i2c: use rising-edge IRQ on ACPI systems
Signed-off-by: David Heidelberg <david@ixit.cz>
* tag 'nfc-7.1-rc6' of https://codeberg.org/linux-nfc/linux:
nfc: nxp-nci: i2c: use rising-edge IRQ on ACPI systems
nfc: hci: fix out-of-bounds read in HCP header parsing
nfc: llcp: Fix use-after-free race in nfc_llcp_recv_cc()
nfc: llcp: Fix use-after-free in llcp_sock_release()
====================
Eric Dumazet [Mon, 25 May 2026 20:36:42 +0000 (20:36 +0000)]
vxlan: do not reuse cached ip_hdr() value after skb_tunnel_check_pmtu()
skb_tunnel_check_pmtu() can change skb->head.
Reusing old_iph afer skb_tunnel_check_pmtu() can cause an UAF.
Use instead ip_hdr(skb) as done in drivers/net/bareudp.c
and drivers/net/geneve.c.
Found by Sashiko.
Fixes: 4cb47a8644cc ("tunnels: PMTU discovery support for directly bridged IP packets") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Link: https://patch.msgid.link/20260525203642.2389723-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Eric Dumazet [Mon, 25 May 2026 20:13:35 +0000 (20:13 +0000)]
tunnels: load network headers after skb_cow() in iptunnel_pmtud_build_icmp[v6]()
Sashiko found that iptunnel_pmtud_build_icmp() and
iptunnel_pmtud_build_icmpv6() were caching ip_hdr() and ipv6_hdr()
before an skb_cow() call which can reallocate skb->head.
Fix this possible UAF by initializing the local variables
after the skb_cow() call.
Remove skb_reset_network_header() calls which were not needed.
Fixes: 4cb47a8644cc ("tunnels: PMTU discovery support for directly bridged IP packets") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Link: https://patch.msgid.link/20260525201335.2361845-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Wed, 27 May 2026 01:07:28 +0000 (18:07 -0700)]
Merge tag 'nf-next-26-05-25' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Florian Westphal says:
====================
netfilter: updates for net-next
The following patchset contains Netfilter fixes and small enhancements:
1) Disable 32-bit x_tables compatibility (32bit binaries on 64bit
kernel) interface in user namespaces. This is 'last warning'
before this is removed for good.
2) Add a configuration toggle for netfilter GCOV profiling. Provide
dedicated toggles for ipset and ipvs.
3) Remove modular support for nfnetlink and restrict it to built-in only.
From Pablo Neira Ayuso.
4) Use per-rule hash initval in nf_conncount. This avoids unecessary
lock contention with short keys (e.g. conntrack zones) in different
namespaces.
5) Use nf_ct_exp_net() in ctnetlink expectation dumps.
From Pratham Gupta.
6) Remove a dead conditional in nft_set_rbtree.
7) Fix conntrack helper policy updates to apply per-class values correctly.
From David Carlier.
8) Fix an off-by-one OOB read in nf_conntrack_irc:parse_dcc(). Use strict
less-than comparison in the newline search loop to respect the
exclusive-end pointer convention. From Muhammad Bilal.
9) Fix typos in nf_conntrack_proto_tcp comments. From Avinash Duduskar.
10) Restore performance optimization in nft_set_pipapo_avx2 by passing
the next map index. Refactor lookup logic for clarity and add a
DEBUG_NET check to document this.
11) Avoid (harmless) u16 overflow in nf_conntrack_ftp when parsing FTP PORT
and EPRT commands. Ignore commands where single octet exceeds 255.
From Giuseppe Caruso.
Patch 12, which removes incorrect (and obviously unused) code from
nft_byteorder was kept back to avoid a net -> net-next merge conflict.
* tag 'nf-next-26-05-25' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
netfilter: nf_conntrack_ftp: avoid u16 overflows
netfilter: nft_set_pipapo_avx2: restore performance optimization
netfilter: nf_conntrack_proto_tcp: fix typos in comments
netfilter: nf_conntrack_irc: fix parse_dcc() off-by-one OOB read
netfilter: nfnl_cthelper: apply per-class values when updating policies
netfilter: nft_set_rbtree: remove dead conditional
netfilter: ctnetlink: use nf_ct_exp_net() in expectation dump
netfilter: nf_conncount: use per-rule hash initval
netfilter: allow nfnetlink built-in only
netfilter: add option for GCOV profiling
netfilter: x_tables: disable 32bit compat interface in user namespaces
====================
netconsole: Constify struct configfs_item_operations and configfs_group_operations
'struct configfs_item_operations' and 'configfs_group_operations' are not
modified in this driver.
Constifying these structures moves some data to a read-only section, so
increases overall security, especially when the structure holds some
function pointers.
On a x86_64, with allmodconfig, as an example:
Before:
======
text data bss dec hex filename
64259 24272 608 89139 15c33 drivers/net/netconsole.o
After:
=====
text data bss dec hex filename
64579 23952 608 89139 15c33 drivers/net/netconsole.o
Maoyi Xie [Mon, 25 May 2026 07:17:59 +0000 (15:17 +0800)]
mlxsw: spectrum_fid: use a dedicated list head pointer for sorted insert
mlxsw_sp_fid_port_vid_list_add() inserts into a list sorted by
local_port. It walks the list to find the first entry with a
larger local_port, then inserts the new entry before it:
If the loop falls through (the new local_port is the largest),
tmp_port_vid runs off the end of the list. &tmp_port_vid->list
then ends up at the list head itself (container_of() offsets
cancel), and list_add_tail() inserts at the tail. So the code
works today.
It is fragile though. Anyone who later adds a read of another
field of tmp_port_vid will hit memory outside the list head.
Track the insertion point with a dedicated list_head pointer.
Initialise insert_before to &fid->port_vid_list, set it to
&tmp_port_vid->list only on early break, and pass insert_before
to list_add_tail(). The cursor is no longer touched after the
loop. Behaviour is unchanged.
Wei Fang [Sun, 24 May 2026 07:03:10 +0000 (15:03 +0800)]
net: dsa: netc: fix unmet Kconfig dependencies for NET_DSA_NETC_SWITCH
NET_DSA_NETC_SWITCH selects NXP_NTMP, NXP_NETC_LIB and FSL_ENETC_MDIO,
but these symbols depend on NET_VENDOR_FREESCALE which may not be
enabled. This results in Kconfig warnings and linker errors like:
undefined reference to `ntmp_bpt_update_entry'
undefined reference to `ntmp_fdbt_search_port_entry'
undefined reference to `ntmp_free_cbdr'
undefined reference to `enetc_hw_alloc'
...
Therefore, add "depends on NET_VENDOR_FREESCALE" to NET_DSA_NETC_SWITCH,
ensuring that the selected symbols NXP_NTMP, NXP_NETC_LIB and
FSL_ENETC_MDIO, which all depend on NET_VENDOR_FREESCALE, can only be
selected when that dependency is already satisfied.
Lucien.Jheng [Sun, 24 May 2026 06:39:15 +0000 (14:39 +0800)]
net: phy: air_en8811h: add AN8811HB MCU assert/deassert support
AN8811HB needs a MCU soft-reset cycle before firmware loading begins.
Assert the MCU (hold it in reset) and immediately deassert (release)
via a dedicated PBUS register pair (0x5cf9f8 / 0x5cf9fc), accessed
through a registered mdio_device at PHY-addr+8.
Add __air_pbus_reg_write() as a low-level helper taking a struct
mdio_device *, create and register the PBUS mdio_device in
an8811hb_probe() and store it in priv->pbusdev, then implement
an8811hb_mcu_assert() / _deassert() on top of it. Add
an8811hb_remove() to unregister the PBUS device on teardown. Wire
both calls into an8811hb_load_firmware() and en8811h_restart_mcu()
so every firmware load or MCU restart on AN8811HB correctly sequences
the reset control registers.
ipv6: addrconf: fix temp address generation after prefix deprecation
When a router temporarily deprecates an IPv6 prefix (either by sending a
Router Advertisement with Preferred Lifetime = 0 or by letting the
lifetime expire) and later restores it, the kernel permanently loses its
ability to generate temporary privacy addresses (RFC 8981) for that
prefix.
This happens because the address worker attempts to generate a
replacement temporary address when the current one nears expiration. As
the base prefix is deprecated already, the generation fails after
marking the temporary address as already having spawned a replacement
(ifp->regen_count++).
When the router eventually restores the prefix, the temporary address
becomes active again. However, once it naturally expires, the address
worker sees this temporary address already tried to generate one and
skips the regeneration.
Fix the issue by resetting the regen_count check of the latest temp
address generated for the prefix updated by the incoming RA.
l2tp: use refcount_inc_not_zero in l2tp_session_get_by_ifname
A reader in l2tp_session_get_by_ifname() can return a pointer to a
session whose refcount has reached zero. The getter takes its
reference with plain refcount_inc(), but every other session getter
in the same file (l2tp_v2_session_get, l2tp_v3_session_get, and the
corresponding _get_next variants) uses refcount_inc_not_zero()
because the IDR/RCU lookup can race with refcount_dec_and_test() ->
l2tp_session_free() -> kfree_rcu(). The ifname getter is the only
outlier; the inconsistency was raised on-list after 979c017803c4
("l2tp: use list_del_rcu in l2tp_session_unhash").
A reader inside rcu_read_lock_bh() that matches session->ifname can
be preempted between the strcmp() and the refcount_inc(). If the
last reference drops on another CPU in that window, the reader's
refcount_inc() runs on a counter that has reached zero. refcount_t
catches the addition-on-zero, prints "refcount_t: addition on 0;
use-after-free", saturates the counter, and returns the saturated
pointer to the caller. Session memory is held live by the in-flight
RCU read section, but the kfree_rcu() callback queued from
l2tp_session_free() will free it once the grace period closes; a
caller that dereferences the returned session past that point hits
a slab-use-after-free. On PREEMPT_RT local_bh_disable() is a per-CPU
sleeping lock and the preemption window is real; on stock PREEMPT
kernels local_bh_disable() is a preempt_count increment that closes
the cross-CPU race in practice (see below).
Use refcount_inc_not_zero() and continue the list walk on failure,
matching the other session getters in the file. The ifname getter
is the only session getter in net/l2tp/ that still uses the bare
refcount_inc() pattern; this change restores file-internal
consistency. The success path is unchanged.
Fixes: abe7a1a7d0b6 ("l2tp: improve tunnel/session refcount helpers") Cc: stable@vger.kernel.org Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com> Reviewed-by: James Chapman <jchapman@katalix.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260523023423.2568972-1-michael.bommarito@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Martin Karsten [Sat, 23 May 2026 01:22:20 +0000 (21:22 -0400)]
net: napi: Skip last poll when arming gro timer in busy poll
Skip the extra call to napi->poll(), if the gro timer is armed at the
end of busy polling. This removes the need for having a separate
__busy_poll_stop() routine and its code is moved directly into the
relevant places in busy_poll_stop(). Remove obsolete comment about
ndo_busy_poll_stop().
This is a follow-up to commit 58e2330bd455 ("net: napi: Avoid gro timer
misfiring at end of busypoll"), which has deferred arming the gro timer
to the end of __busy_poll_stop() to eliminate a race condition between
a short timer and long poll that could leave the queue stuck with
interrupts disabled and no timer armed.
Co-developed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Martin Karsten <mkarsten@uwaterloo.ca> Link: https://patch.msgid.link/20260523012247.1574691-1-mkarsten@uwaterloo.ca Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Tim Bird [Fri, 22 May 2026 22:55:08 +0000 (16:55 -0600)]
llc: Add SPDX id lines to some llc source files
Most of the lls source files are missing SPDX-License-Identifier
lines. Add appropriate IDs to these files, and remove other license
info from the header. In once case, leave the existing id line
and just remove the license reference text.
Andreas Hindborg [Mon, 27 Apr 2026 08:11:35 +0000 (10:11 +0200)]
rust: module_param: use `pr_warn_once!` for null pointer warning
Replace `pr_warn!` and the accompanying TODO with `pr_warn_once!`, now that
the macro is available.
[ Note: Adarsh Das independently authored an identical patch on the
rust-for-linux list, but it missed the modules tree. ]
Suggested-by: Adarsh Das <adarshdas950@gmail.com> Signed-off-by: Andreas Hindborg <a.hindborg@kernel.org> Reviewed-by: Aaron Tomlin <atomlin@atomlin.com> Reviewed-by: Gary Guo <gary@garyguo.net> Reviewed-by: Daniel Gomez <da.gomez@samsung.com> Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
====================
Add OVS packet family YNL spec and unicast notification support
This series adds a YAML netlink spec for the OVS_PACKET_FAMILY genetlink
family and a bind-only ntf_bind() helper for receiving unicast
notifications.
====================
Minxi Hou [Fri, 22 May 2026 17:41:54 +0000 (01:41 +0800)]
tools: ynl: add unicast notification receive support
Add ntf_bind() method to YnlFamily for binding the netlink
socket without joining a multicast group. This enables receiving
unicast notifications through the existing poll_ntf/check_ntf
path.
The OVS packet family sends MISS and ACTION upcalls via
genlmsg_unicast() to a per-vport PID rather than through a
multicast group. The existing ntf_subscribe() couples bind()
with setsockopt(ADD_MEMBERSHIP), which does not fit the unicast
case. ntf_bind() provides the bind-only alternative, with the
address defaulting to (0, 0) but exposed as an explicit argument.
Minxi Hou [Fri, 22 May 2026 17:41:53 +0000 (01:41 +0800)]
netlink: specs: add OVS packet family specification
Add YAML netlink spec for the OVS_PACKET_FAMILY (ovs_packet).
This completes the set of OVS genetlink family specs (ovs_datapath,
ovs_flow, ovs_vport already exist).
The spec defines three operations: MISS (event), ACTION (event),
and EXECUTE (do). MISS and ACTION are kernel-to-userspace upcalls
sent via genlmsg_unicast(); EXECUTE is the only registered genl
operation.
Key, actions, and egress-tun-key attributes are typed as binary
rather than nest because the nested attribute definitions belong
to the ovs_flow spec and cross-spec references are not supported
by the YNL framework.
Alice Ryhl [Wed, 8 Apr 2026 08:32:17 +0000 (08:32 +0000)]
rust: kasan: add support for Software Tag-Based KASAN
This adds support for Software Tag-Based KASAN (KASAN_SW_TAGS) when
CONFIG_RUST is enabled. This requires that rustc includes support for
the kernel-hwaddress sanitizer, which is available since 1.96.0 [1].
Unlike with clang, we need to pass -Zsanitizer-recover in addition to
-Zsanitizer because the option is not implied automatically.
The kasan makefile uses different names for the flags depending on
whether CC is clang or gcc, but as we require that CC is clang when
using KASAN, we do not need to try to handle mixed gcc/llvm builds when
Rust is enabled.
Alice Ryhl [Wed, 8 Apr 2026 08:32:16 +0000 (08:32 +0000)]
rust: kasan: KASAN+RUST requires clang
Kernel KASAN involves passing various llvm/gcc specific arguments to
the C and Rust compiler. Since these arguments differ between llvm and
gcc, it's not safe to mix an llvm-based rustc with a gcc build when
kasan is enabled.
Alice Ryhl [Tue, 31 Mar 2026 10:57:49 +0000 (10:57 +0000)]
kbuild: rust: add AutoFDO support
This patch enables AutoFDO build support for Rust code within the Linux
kernel. This allows Rust code to be profiled and optimized based on the
profile.
The RUSTFLAGS variable was suffixed with *_AUTOFDO_CLANG to match the
naming of the config option, which is called CONFIG_AUTOFDO_CLANG.
This implementation has been verified in Android, first by inspecting
the object files and confirming that they look correct. After that,
it was verified as below:
1. Running the binderAddInts benchmark [1] with Rust Binder built as
rust_binder.ko module, using a Pixel 9 Pro.
2. Collecting a profile on a Pixel 10 Pro XL using the app-launch
benchmark, which starts different apps many times, on a device with
Rust Binder as a built-in kernel module. (C Binder was not present on
the device.)
3. Using the collected profile, run the binderAddInts benchmark again
with Rust Binder built both as a rust_binder.ko module, and as a
built-in kernel module.
4. In both cases, Rust Binder without AutoFDO was approximately 13%
slower than the AutoFDO optimized version. Built-in vs .ko did not
make a measurable performance difference.
All of the above was verified in conjunction with my helpers inlining
series [2], which confirmed that this worked correctly for helpers too
once [3] was fixed in the helpers inlining series.
Chaitanya Sabnis [Tue, 26 May 2026 10:22:40 +0000 (15:52 +0530)]
i2c: davinci: fix division by zero on missing clock-frequency
When the 'clock-frequency' property is missing from the device tree,
the driver falls back to DAVINCI_I2C_DEFAULT_BUS_FREQ. However, this
macro was defined in kHz (100), whereas the device tree property is
expected in Hz.
The probe function divided the fallback value by 1000, causing
integer truncation that resulted in dev->bus_freq = 0. This triggered
a deterministic division-by-zero kernel panic when calculating clock
dividers later in the probe sequence.
Fix this by redefining DAVINCI_I2C_DEFAULT_BUS_FREQ in Hz (100000)
to match the expected device tree property unit, allowing the existing
division logic to work correctly for both cases.
Ricardo Robaina [Wed, 13 May 2026 21:47:59 +0000 (18:47 -0300)]
audit: fix removal of dangling executable rules
When an audited executable is deleted from the disk, its dentry
becomes negative. Any later attempt to delete the associated audit
rule will lead to audit_alloc_mark() encountering this negative
dentry and immediately aborting, returning -ENOENT.
This early abort prevents the subsystem from allocating the temporary
fsnotify mark needed to construct the search key, meaning the kernel
cannot find the existing rule in its own lists to delete it. This
leaves a dangling rule in memory, resulting in the following error
while attempting to delete the rule:
# ./audit-dupe-exe-deadlock.sh
No rules
Error deleting rule (No such file or directory)
There was an error while processing parameters
# auditctl -l
-a always,exit -S all -F exe=/tmp/file -F path=/tmp/file -F key=dr
# auditctl -D
Error deleting rule (No such file or directory)
There was an error while processing parameters
This patch fixes this issue by removing the d_really_is_negative()
check. By doing so, a dummy mark can be successfully generated for
the deleted path, which allows the audit subsystem to properly match
and flush the dangling rule.
Cc: stable@kernel.org Fixes: 76a53de6f7ff ("VFS/audit: introduce kern_path_parent() for audit") Acked-by: Waiman Long <longman@redhat.com> Acked-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Ricardo Robaina <rrobaina@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
Vishal Annapurve [Fri, 22 May 2026 15:15:34 +0000 (15:15 +0000)]
KVM: x86: Treat KVM's virtual PMU as disabled for TDX VMs
Introduce a "protected PMU" concept, and use it to disable KVM's virtual
PMU for TDX VMs, as the PMU state for TDX VMs is virtualized by the TDX
Module[1], i.e. _can't_ emulated/virtualized by KVM, and KVM doesn't yet
support enabling/exposing PMU functionality for/to TDX VMs. For now,
simply treat the PMU as disabled, as it's not clear what all needs to be
changed, e.g. KVM needs to do at least:
1) Configure TD_PARAMS to allow guests to use performance monitoring.
2) Restrict the TD to a subset of the PEBS counters if supported.
3) Limit the TD to setup a certain perfmon events using basic/enhanced
event filtering.
Explicitly disallow enabling the PMU via KVM_CAP_PMU_CAPABILITY for VMs
with a protected PMU to prevent userspace from circumventing KVM's
protections.
Jani Nikula [Wed, 13 May 2026 07:58:40 +0000 (10:58 +0300)]
drm/i915/display: stop passing i to for_each_pipe_crtc_modeset_{enable, disable}()
Refactor for_each_pipe_crtc_modeset_{enable,disable}() and their
underlying for_each_crtc_in_masks{,_reverse}() helpers to utilize
__UNIQUE_ID() to avoid having to pass the for loop variable to them.
Jani Nikula [Wed, 13 May 2026 07:58:38 +0000 (10:58 +0300)]
drm/i915/display: pass struct intel_display to all for_each_intel_crtc*() macros
Now that the for_each_intel_crtc*() iterator macros primarily use
display->pipe_list for iteration, it's more convenient to pass struct
intel_display to them directly instead of struct drm_device. Make it so.
Jani Nikula [Wed, 13 May 2026 07:58:37 +0000 (10:58 +0300)]
drm/i915/display: always pass display->drm to for_each_intel_crtc*()
In preparation for always passing struct intel_display to
for_each_intel_crtc*() family of iterators, start off by unifying their
usage to always having struct intel_display *display around, and passing
display->drm to them.
Jani Nikula [Wed, 13 May 2026 07:58:36 +0000 (10:58 +0300)]
drm/i915/display: switch from drm_for_each_crtc() to for_each_intel_crtc()
intel_has_pending_fb_unpin() has the last direct user of
drm_for_each_crtc() in i915. Switch to for_each_intel_crtc() to ensure
pipe order iteration in all cases.
Jani Nikula [Mon, 25 May 2026 11:05:53 +0000 (14:05 +0300)]
drm/{i915, xe}: move xe_display_flush_cleanup_work() to i915 display
xe_display_flush_cleanup_work() is a bit of an oddball function in xe
display code. There shouldn't be anything this specific or xe
specific. While I'm not sure what the correct refactor for the function
should be, move it to shared display code for starters, next to the
eerily similar but slightly different intel_has_pending_fb_unpin() that
is only called from i915 core.
The main goal here is to unblock some refactors on
for_each_intel_crtc().
Kevin Cheng [Fri, 22 May 2026 23:27:01 +0000 (16:27 -0700)]
KVM: selftests: Add nested page fault injection test
Add a test that exercises nested page fault injection during L2
execution. L2 executes I/O string instructions (OUTSB/INSB) that access
memory restricted in L1's nested page tables (NPT/EPT), triggering a
nested page fault that L0 must inject to L1.
The test supports both AMD SVM (NPF) and Intel VMX (EPT violation) and
verifies that:
- The exit reason is an NPF/EPT violation
- The access type and permission bits are correct
- The faulting GPA is correct
Three test cases are implemented:
- Unmap the final data page (final translation fault, OUTSB read)
- Unmap a PT page (page walk fault, OUTSB read)
- Write-protect the final data page (protection violation, INSB write)
- Write-protect a PT page (protection violation on A/D update, OUTSB
read)
When injecting an EPT Violation into L2 in response to a fault detected
while emulating an L2 GVA access, synthesize the GVA_IS_VALID and
GVA_TRANSLATED bits using information provided by the walker, instead of
pulling the bits from vmcs02.EXIT_QUALIFICATION. The information in
vmcs02.EXIT_QUALIFICATION is valid/correct if and only if the fault being
injected into L1 is the direct result of an EPT Violation VM-Exit from L2.
E.g. if KVM is emulating an I/O instruction and the memory operand's
translation through L1's EPT fails, using vmcs02.EXIT_QUALIFICATION is
wrong as the semantics for EXIT_QUALIFICATION would be for an I/O exit,
not an EPT Violation exit.
Opportunistically clean up the formatting for creating the mask of bits
to pull from vmcs02.EXIT_QUALIFICATION.
Kevin Cheng [Fri, 22 May 2026 23:26:59 +0000 (16:26 -0700)]
KVM: SVM: Fix nested NPF injection of PFERR_GUEST_{PAGE,FINAL}_MASK bits
Fix KVM's generation of PFERR_GUEST_{PAGE,FINAL}_MASK bits when injecting a
Nested Page Fault into L1. Currently, KVM blindly stuffs GUEST_FINAL into
L1, which is blatantly wrong given that KVM obviously generates NPFs for
page table accesses.
There are two paths that trigger NPF injection: hardware NPF exits (from
L2) and emulation-triggered faults, i.e. when KVM detects a NPF as part of
emulating an L2 GVA access. For the hardware case, use the bits verbatim
from the VMCB, as KVM is simply forwarding a NPF to L1. For the emulation
case, propagate the GUEST_{PAGE,FINAL} bits from the access field (which
were recently added for MBEC+GMET support).
To differentiate between the two cases, add "hardware_nested_page_fault"
to "struct x86_exception", and set it when injecting a NPF in response to
an NPF exit from L2.
To help guard against future goofs, assert that exactly one of GUEST_PAGE
or GUEST_FINAL is set when injecting a NPF. Unlike VMX, there are no
(known) cases where hardware doesn't set either bit, and KVM should always
set one or the other when emulating a GVA access.
nvme-multipath: enable PCI P2PDMA for multipath devices
NVMe multipath does not expose BLK_FEAT_PCI_P2PDMA on the head disk
even when all underlying controllers support it.
Set BLK_FEAT_PCI_P2PDMA unconditionally in nvme_mpath_alloc_disk()
alongside the other features. nvme_update_ns_info_block() already
calls queue_limits_stack_bdev() to stack each path's limits onto the
head disk, which routes through blk_stack_limits(). The core now
clears BLK_FEAT_PCI_P2PDMA automatically if any path (e.g., FC) does
not support it, consistent with how BLK_FEAT_NOWAIT and BLK_FEAT_POLL
are handled.
md: propagate BLK_FEAT_PCI_P2PDMA from member devices to RAID device
MD RAID does not propagate BLK_FEAT_PCI_P2PDMA from member devices to
the RAID device, preventing peer-to-peer DMA through the RAID layer even
when all underlying devices support it.
Enable BLK_FEAT_PCI_P2PDMA unconditionally in raid0, raid1 and raid10
personalities during queue limits setup. blk_stack_limits() clears it
automatically if any member device lacks support, consistent with how
BLK_FEAT_NOWAIT and BLK_FEAT_POLL are handled in the block core.
Parity RAID personalities (raid4/5/6) are excluded because they require
CPU access to data pages for parity computation, which is incompatible
with P2P mappings.
Tested with RAID0/1/10 arrays containing multiple NVMe devices with
P2PDMA support, confirming that peer-to-peer transfers work correctly
through the RAID layer.
block: clear BLK_FEAT_PCI_P2PDMA in blk_stack_limits() for non-supporting devices
BLK_FEAT_NOWAIT and BLK_FEAT_POLL are cleared in blk_stack_limits()
when an underlying device does not support them. Apply the same
treatment to BLK_FEAT_PCI_P2PDMA: stacking drivers set it
unconditionally and rely on the core to clear it whenever a
non-supporting member device is stacked.
KVM: x86: Tell ->inject_page_fault() whether or a fault came from hardware
When injecting a page fault (including nested TDP faults into L1), tell the
injection routine whether or not the fault originated in hardware, i.e. if
KVM is effectively forwarding a fault it intercept. For nested TDP fault
injection, KVM needs to grab PAGE_WALK vs. GUEST_FINAL information from the
VMCB/VMCS, _if_ the fault originated in hardware.
Note, simply checking whether or not the original exit was due a #NPF or
EPT Violation isn't sufficient/correct, as the fault being synthesized for
L1 may or may not be the "same" fault that triggered a VM-Exit from L2.
E.g. if access to emulated MMIO in L2 hits a !PRESENT fault (EPT Violation
or #NPF), e.g. because MMIO caching is disabled or it's the first time the
GPA has been accessed by L2, then KVM will enter the emulator. If
emulating the MMIO instruction then hits a nested TDP fault, e.g. because
L2 was accessing MMIO with a MOVSQ (memory-to-memory move), or because L1
has since unmapped the code stream, then the TDP fault synthesized to L1
will not be the same emulated fault the triggered the VM-Exit.
No functional change intended (nothing uses the new param, yet...).
Yan Zhao [Thu, 30 Apr 2026 01:50:01 +0000 (09:50 +0800)]
x86/tdx: Drop exported function tdx_quirk_reset_page()
KVM invokes tdx_quirk_reset_page() to reset TDX control pages (including
S-EPT pages, TDR page, etc.), as all those pages are allocated by KVM TDX
and thus always have struct page.
However, it's also reasonable for KVM to reset those TDX control pages via
tdx_quirk_reset_paddr() directly, eliminating the need to export two
parallel APIs. Keeping tdx_quirk_reset_page() as a one-line helper in the
header file is also unnecessary.
No functional change intended.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Suggested-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Acked-by: Kiryl Shutsemau <kas@kernel.org> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Ackerley Tng <ackerleytng@google.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://patch.msgid.link/20260430015001.24242-1-yan.y.zhao@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
x86/tdx: Use PFN directly for unmapping guest private memory
Remove struct page assumptions/constraints in APIs for unmapping guest
private memory and have them take physical address directly.
Having core TDX make assumptions that guest private memory must be backed
by struct page (and/or folio) will create subtle dependencies on how
KVM/guest_memfd allocates/manages memory (e.g., whether it uses memory
allocated from core MM, if the memory is refcounted, or if the folio is
split) that are easily avoided. [1].
KVM's MMUs work with PFNs. This is very much an intentional design choice.
It ensures that the KVM MMUs remain flexible and are not too tightly tied
to the regular CPU MMUs and the kernel code around them. Using
"struct page" for TDX guest memory is not a good fit anywhere near the KVM
MMU code [2].
Therefore, for unmapping guest private memory: export
tdx_quirk_reset_paddr() for direct KVM invocation, and convert the SEAMCALL
wrapper API tdh_phymem_page_wbinvd_hkid() to take PFN as input (thus
updating mk_keyed_paddr() and tdh_phymem_page_wbinvd_tdr()).
Intentionally have KVM pass PAGE_SIZE (rather than KVM_HPAGE_SIZE(level))
to tdx_quirk_reset_paddr() in tdx_sept_remove_private_spte() to avoid
mixing in huge page changes. The KVM_BUG_ON() check for !PG_LEVEL_4K in
tdx_sept_remove_private_spte() justifies using PAGE_SIZE.
Do not convert tdx_reclaim_page() to use PFN as input since it currently
does not remove guest private memory.
Use "kvm_pfn_t pfn" for type safety. Using this KVM type is appropriate
since APIs tdh_phymem_page_wbinvd_hkid() and tdx_quirk_reset_paddr() are
exported to KVM only.
[Yan: Use kvm_pfn_t,exclude tdx_reclaim_page(),use tdx_quirk_reset_paddr()]
x86/tdx: Use PFN directly for mapping guest private memory
Remove struct page assumptions/constraints in the SEAMCALL wrapper APIs for
mapping guest private memory and have them take PFN directly.
Having core TDX make assumptions that guest private memory must be backed
by struct page (and/or folio) will create subtle dependencies on how
KVM/guest_memfd allocates/manages memory (e.g., whether it uses memory
allocated from core MM, if the memory is refcounted, or if the folio is
split) that are easily avoided. [1].
KVM's MMUs work with PFNs. This is very much an intentional design choice.
It ensures that the KVM MMUs remain flexible and are not too tied to the
regular CPU MMUs and the kernel code around them. Using 'struct page' for
TDX guest memory is not a good fit anywhere near the KVM MMU code [2].
Use "kvm_pfn_t pfn" for type safety. Using this KVM type is appropriate
since APIs tdh_mem_page_add() and tdh_mem_page_aug() are exported to KVM
only.
Keith Busch [Tue, 26 May 2026 15:35:31 +0000 (08:35 -0700)]
blk-mq: reinsert cached request to the list
A previous commit removed an optimization out of caution for a scenario
that turns out not to be real: all the "queue_exit" goto's are safe to
reinsert the request into the cached_rq's plug list as they are either
from a non-blocking path, or a successful merge that already holds the
queue reference. This optimization is most needed for small sequential
workloads that successfully merge into larger requests.
Fixes: dc278e9bf2b9 ("blk-mq: pop cached request if it is usable") Suggested-by: Ming Lei <tom.leiming@gmail.com> Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://patch.msgid.link/20260526153531.2365935-1-kbusch@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
Li Ming [Wed, 20 May 2026 12:14:57 +0000 (20:14 +0800)]
cxl/test: Update mock dev array before calling platform_device_add()
CXL test environment hits the following error sometimes.
cxl_mem mem9: endpoint7 failed probe
All mock memdevs are platform firmware devices added by cxl_test module,
and cxl_test module also provides a platform device driver for them to
create a memdev device to CXL subsystem. cxl_test module uses
cxl_rcd/mem_single/mem arrays to store different types of mock memdevs.
CXL drivers calls registered mock functions for a mock memdev by
checking if a given memdev is in these arrays.
When cxl_test module adds these mock memdevs, it always calls
platform_device_add() before adding them to a suitable mock memdev
array. However, there is a small window where CXL drivers calls mock
function for a added memdev before it added to a mock memdev array. In
above case, cxl endpoint driver considers a added memdev was not a mock
memdev, then calling devm_cxl_endpoint_decoders_setup() for it rather
than mock_endpoint_decoders_setup().
An appropriate solution is that adding a new mock device to a mock
device array before calling platform_device_add() for it. It can
guarantee the new mock device is visible to CXL subsystem.
This patch introduces a new helped called cxl_mock_platform_device_add()
to handle the issue, and uses the function for all mock devices addition.
Fixes: 3a2b97b3210b ("cxl/test: Improve init-order fidelity relative to real-world systems") Signed-off-by: Li Ming <ming.li@zohomail.com> Tested-by: Alison Schofield <alison.schofield@intel.com> Reviewed-by: Alison Schofield <alison.schofield@intel.com> Link: https://patch.msgid.link/20260520121457.234404-1-ming.li@zohomail.com Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Linus Torvalds [Tue, 26 May 2026 20:49:13 +0000 (13:49 -0700)]
Merge tag 'nfsd-7.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd fixes from Chuck Lever:
"Regressions:
- Tighten bounds checking for sunrpc cache hash tables
- Don't report key material in the ftrace log
Stable fix:
- Fix lockd's implementation of the NLM TEST procedure"
* tag 'nfsd-7.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
lockd: fix TEST handling when not all permissions are available.
NFSD: Report whether fh_key was actually updated
sunrpc: prevent out-of-bounds read in __cache_seq_start()
Petr Pavlu [Fri, 27 Mar 2026 07:59:03 +0000 (08:59 +0100)]
module, riscv: force sh_addr=0 for arch-specific sections
When linking modules with 'ld.bfd -r', sections defined without an address
inherit the location counter, resulting in non-zero sh_addr values in the
resulting .ko files. Relocatable objects are expected to have sh_addr=0 for
all sections. Non-zero addresses are confusing in this context, typically
worse compressible, and may cause tools to misbehave [1].
Force sh_addr=0 for all riscv-specific module sections.
Petr Pavlu [Fri, 27 Mar 2026 07:59:02 +0000 (08:59 +0100)]
module, m68k: force sh_addr=0 for arch-specific sections
When linking modules with 'ld.bfd -r', sections defined without an address
inherit the location counter, resulting in non-zero sh_addr values in the
resulting .ko files. Relocatable objects are expected to have sh_addr=0 for
all sections. Non-zero addresses are confusing in this context, typically
worse compressible, and may cause tools to misbehave [1].
Force sh_addr=0 for all m68k-specific module sections.
Petr Pavlu [Fri, 27 Mar 2026 07:59:01 +0000 (08:59 +0100)]
module, arm64: force sh_addr=0 for arch-specific sections
When linking modules with 'ld.bfd -r', sections defined without an address
inherit the location counter, resulting in non-zero sh_addr values in the
resulting .ko files. Relocatable objects are expected to have sh_addr=0 for
all sections. Non-zero addresses are confusing in this context, typically
worse compressible, and may cause tools to misbehave [1].
Force sh_addr=0 for all arm64-specific module sections.
Petr Pavlu [Fri, 27 Mar 2026 07:59:00 +0000 (08:59 +0100)]
module, arm: force sh_addr=0 for arch-specific sections
When linking modules with 'ld.bfd -r', sections defined without an address
inherit the location counter, resulting in non-zero sh_addr values in the
resulting .ko files. Relocatable objects are expected to have sh_addr=0 for
all sections. Non-zero addresses are confusing in this context, typically
worse compressible, and may cause tools to misbehave [1].
Force sh_addr=0 for all arm-specific module sections.
Linus Torvalds [Tue, 26 May 2026 20:37:26 +0000 (13:37 -0700)]
Merge tag 'linux_kselftest-kunit-fixes-7.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
Pull kunit fix from Shuah Khan:
"Fix a use-after-free in kunit debugfs when using kunit.filter when the
executor frees dynamically allocated resources after running boot-time
tests. This resulted in fatal hardware exception due to invalidation
of capability flags on the reclaimed memory on some architectures such
as CHERI RISC-V that support the feature, and silent memory corruption
on others.
The fix for this couples the lifetime of the filtered suite memory
allocation to the lifetime of the kunit subsystem and its associated
VFS nodes. Ownership of the boot-time suite_set is now transferred to
a global tracker ('kunit_boot_suites'), and the memory is cleanly
released in kunit_exit() during module teardown"
* tag 'linux_kselftest-kunit-fixes-7.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
kunit: fix use-after-free in debugfs when using kunit.filter
Borislav Petkov [Wed, 13 May 2026 20:06:01 +0000 (22:06 +0200)]
x86/microcode: Do not access MSR_IA32_PLATFORM_ID when running as a guest
Patch in Fixes: causes the usual:
unchecked MSR access error: RDMSR from 0x17 at ... (intel_get_platform_id)
Call Trace:
early_init_intel
early_cpu_init
setup_arch
_printk
start_kernel
x86_64_start_reservations
x86_64_start_kernel
common_startup_64
because the kernel is booted in a guest.
In order to avoid it, this MSR access needs to be prevented when running
virtualized. That is usually done by checking X86_FEATURE_HYPERVISOR but
for this particular case it is too early yet.
The platform ID needs to be read as early as when microcode is loaded on
the BSP:
and by that time, CPUID leafs haven't been parsed yet.
The microcode loader already has logic to check early whether the kernel
is running virtualized so make that globally available to arch/x86/. The
query whether running virtualized is getting more and more prominent in
recent times so might as well make it an arch-global var which the rest
of the code can use.
Fixes: d8630b67ca1ed ("x86/cpu: Add platform ID to CPU info structure") Reported-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Tested-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://lore.kernel.org/all/20260430020953.1405535-1-binbin.wu@linux.intel.com
Mostafa Saleh [Tue, 26 May 2026 12:53:17 +0000 (12:53 +0000)]
irqchip/gic-v4: Don't advertise VLPIs if no ITS is probed
When accidentally setting “kvm-arm.vgic_v4_enable=1” on a system that has
no MSI controller device tree node and GICv4, it results a panic as
“gic_domain” is NULL and the kernel attempts to access it.
Unable to handle kernel NULL pointer dereference at virtual address 0000000000000028
Mem abort info:
ESR = 0x0000000096000006
Tvrtko Ursulin [Sat, 23 May 2026 10:34:18 +0000 (11:34 +0100)]
drm/xe: Assign queue name in time for drm_sched_init
Currently the queue name is only assigned after the drm scheduler instance
has been created. This loses information with all logging or debug
workqueue facilities so lets re-order things a bit so the name gets
assigned in time.
To be able to assign a GuC ID early we split the allocation into
reservation and publish phases.
First, with the submission state lock held, we reserve the ID in the GuC
ID manager, which serves as an authoritative source of truth. Then we can
drop the lock and reserve entries in the exec queue lookup XArray. This
can be lockless since the NULL entries are invisible both to the kernel
and userspace. Only after the queue has been fully created we replace the
reserved entries with the queue pointer, which can be done locklessly for
single width queues.
Kevin Cheng [Fri, 22 May 2026 23:26:57 +0000 (16:26 -0700)]
KVM: x86: Widen x86_exception's error_code to 64 bits
Widen the error_code field in struct x86_exception from u16 to u64 to
accommodate AMD's NPF error code, which defines information bits above
bit 31, e.g. PFERR_GUEST_FINAL_MASK (bit 32), and PFERR_GUEST_PAGE_MASK
(bit 33).
Retain the u16 type for the local errcode variable in walk_addr_generic
as the walker synthesizes conventional #PF error codes that are
architecturally limited to bits 15:0.
Piotr Zarycki [Sat, 23 May 2026 11:18:57 +0000 (13:18 +0200)]
KVM: selftests: hyperv_features: test write of 1 to HV_X64_MSR_RESET
Writing 1 to HV_X64_MSR_RESET triggers a real vCPU reset; the test
was writing 0 because the host loop was not prepared to handle the
resulting KVM_EXIT_SYSTEM_EVENT. Add the missing handling and write
1 to actually exercise the reset path.
KVM: selftests: Randomize dirty_log_test's delay before reaping the bitmap
In the dirty log test, randomize the delay before the initial call to get
the dirty log bitmap for a given iteration, so that the amount of memory
dirtied by the guest varies from iteration to iteration, and so that the
user can effectively control the duration (by increasing the interval).
Always waiting 1ms effectively hides a KVM RISC-V bug as the test reaps the
dirty bitmap before the guest has a chance to trigger the problematic flow
in KVM.
KVM: selftests: Add and use kvm_free_fd() to harden against fd goofs
Add a kvm_free_fd() macro to close and invalidate a file descriptor, and
use it through the core infrastructure to harden against goofs where a
selftest attempts to reuse a closed file descriptor.
KVM: selftests: Cast guest_memfd fd to a signed int when checking for >= 0
When conditionally closing a memory region's guest_memfd file descriptor,
cast the field to a signed it so that negative values are correctly
detected. Because selftests reuse "struct kvm_userspace_memory_region2"
instead of providing custom storage, they pick up the kernel uAPI's __u32
definition of the file descriptor, not the more common "int" definition,
e.g. that's used for userspace_mem_region.fd.
Fixes: bb2968ad6c33 ("KVM: selftests: Add support for creating private memslots") Reported-by: Bibo Mao <maobibo@loongson.cn> Closes: https://lore.kernel.org/all/20260508015013.4108345-1-maobibo@loongson.cn Reviewed-by: Bibo Mao <maobibo@loongson.cn> Reviewed-by: Ackerley Tng <ackerleytng@google.com> Link: https://patch.msgid.link/20260522171535.3525890-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
Zongyao Chen [Fri, 22 May 2026 17:21:50 +0000 (10:21 -0700)]
KVM: selftests: Test guest_memfd binding overlap without GPA overlap
The guest_memfd binding overlap test recreates the deleted slot with GPA
ranges that overlap the still-live slot. KVM rejects those attempts from
the generic memslot overlap check before reaching kvm_gmem_bind(), so the
test can pass even if guest_memfd binding overlap detection is broken.
Recreate the slot at its original, non-overlapping GPA and use guest_memfd
offsets that overlap the front and back halves of the other slot's binding.
Expand the guest_memfd so the back-half case remains within the file size.
Zongyao Chen [Fri, 22 May 2026 17:21:49 +0000 (10:21 -0700)]
KVM: guest_memfd: Return -EEXIST for overlapping bindings
KVM_SET_USER_MEMORY_REGION2 rejects guest_memfd ranges that overlap an
existing binding, but kvm_gmem_bind() currently reports the failure through
its generic -EINVAL path. That makes binding conflicts indistinguishable
from malformed guest_memfd parameters.
Return -EEXIST when the target guest_memfd range is already bound, matching
the errno used for overlapping GPA memslots and making the two types of
range conflicts report the same class of error to userspace.
Note, returning -EINVAL was definitely not intentional, as guest_memfd
support was accompanied by a selftest to verify that attempting to create
overlapping bindings fails with -EEXIST. Except the selftest was also
flawed in that it unintentionally overlapped memslot GPAs, and so failed
on KVM's common memslot checks before reaching guest_memfd.
Fixes: a7800aa80ea4 ("KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory") Signed-off-by: Zongyao Chen <ZongYao.Chen@linux.alibaba.com> Reviewed-by: Ackerley Tng <ackerleytng@google.com> Tested-by: Ackerley Tng <ackerleytng@google.com>
[sean: call out that the original intent was to return -EEXIST] Link: https://patch.msgid.link/20260522172151.3530267-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
Thomas Weißschuh [Mon, 25 May 2026 08:27:16 +0000 (10:27 +0200)]
selftests/nolibc: use mutable buffer for execve() argv string
The existing code would trigger a warning under -Wwrite-strings which is
about to be enabled. Use a mutable buffer instead. While in this
specific case, casting away the 'const' would be fine, let's avoid casts
which are not really necessary.
Since the QPIC-SPI-NAND flash controller present in ipq5210 is the same
as the one found in ipq9574, document the ipq5210 compatible and with
ipq9574 as the fallback.
Aravind Anilraj [Sun, 29 Mar 2026 07:06:42 +0000 (03:06 -0400)]
thermal: intel: int340x: Check return value of ptc_create_groups()
proc_thermal_ptc_add() ignores the return value of ptc_create_groups()
causing the driver to silenty continue even if sysfs group creation
fails.
The thermal control interface would be unavailable with no indication
of failure.
Check the return value and on failure clean up any sysfs groups that
were successfully created before the error, then propagate the error to
the caller which already handles it correctly via goto err_rem_rapl.
Aravind Anilraj [Sun, 29 Mar 2026 07:06:41 +0000 (03:06 -0400)]
thermal: intel: int340x: Fix potential shift overflow in ptc_mmio_write()
The value parameter is u32 but is shifted into a u64 register value
without casting first. If the shift amount pushes bits beyond 32, they
are lost. Cast value to u64 before shifting to ensure all bits are
preserved.