Michael Tretter [Thu, 18 Dec 2025 09:23:49 +0000 (10:23 +0100)]
media: staging: imx-csi: move media_pipeline to video device
The imx-media driver has a single imx_media_device. Attaching the
media_pipeline to the imx_media_device prevents the execution of multiple
media pipelines on the device. This should be possible as long as the
media_pipelines don't use the same pads or pads that be configured while
the other media pipeline is streaming.
Move the media_pipeline to the imx_media_video_dev to be able to construct
media pipelines per imx capture device.
If different media pipelines in the media device conflict, the validation
will fail. Thus, the pipeline will fail to start and signal an error to
user space.
Reviewed-by: Frank Li <Frank.Li@nxp.com> Reviewed-by: Philipp Zabel <p.zabel@pengutronix.de> Signed-off-by: Michael Tretter <m.tretter@pengutronix.de> Signed-off-by: Frank Li <Frank.Li@nxp.com> Signed-off-by: Hans Verkuil <hverkuil+cisco@kernel.org>
Bitterblue Smith [Wed, 20 May 2026 14:44:35 +0000 (17:44 +0300)]
wifi: rtw89: usb: Support switching to USB 3 mode
The Realtek wifi 6/7 devices which support USB 3 are weird: when first
plugged in, they pretend to be USB 2. The driver needs to send some
commands to the device, which make it disappear and come back as a
USB 3 device.
Implement the required commands in rtw89.
Add a new function rtw89_usb_write32_quiet() to avoid the warnings
when writing to R_{AX,BE}_PAD_CTRL2. Even though the write succeeds,
usb_control_msg() returns -EPROTO, probably because the USB device
disappears immediately. This results in some confusing warnings in
the kernel log.
When a USB 3 device is plugged into a USB 2 port, rtw89 will try to
switch it to USB 3 mode only once. The device will disappear and come
back still in USB 2 mode, of course.
Tested with RTL8832AU, RTL8832BU, RTL8832CU, and RTL8912AU.
Jacopo Mondi [Mon, 4 May 2026 12:43:14 +0000 (14:43 +0200)]
media: rcar-vin: Drop min_queued_buffers
The R-Car VIN driver already uses a scratch buffer to sustain capture
operations in absence of a frame buffer provided by userspace.
There is no reason to require 4 buffers queued at all times for the
driver to operate. Drop min_queued_buffers from the VIN driver to allow
single-frame capture operations.
Zong-Zhe Yang [Wed, 20 May 2026 12:38:23 +0000 (20:38 +0800)]
wifi: rtw89: 8922d: configure TX shape settings
By default, BB enables triangular spectrum by a series of register
settings. According to band and regulation, RF parameters determine whether
TX shape needs to be restricted or not. So now, clear the corresponding
settings if it has no need to do.
Jani Nikula [Tue, 26 May 2026 12:55:59 +0000 (15:55 +0300)]
drm/i915/power: drop resume parameter from intel_display_power_init_hw()
intel_power_domains_resume() calling intel_display_power_init_hw() with
the resume parameter is an internal implementation detail. Hide it
inside intel_display_power.c, and provide a clean external interface
without the parameter.
Zong-Zhe Yang [Wed, 20 May 2026 12:38:21 +0000 (20:38 +0800)]
wifi: rtw89: Wi-Fi 7 configure TX power limit for large MRU
Support of Large MRU (Multiple Resource Unit) starts from RTL8922D_CID7090,
i.e. RTL8922A and RTL8922D-VS variant do not support it. There are the new
corresponding control registers. So, configure them.
Ping-Ke Shih [Wed, 20 May 2026 12:38:18 +0000 (20:38 +0800)]
wifi: rtw89: 8922d: refactor digital power compensation to support new format
Because base settings of digital power compensation can be shared across
all bands, the settings are divided into two parts -- base and individual
values per bands. Refactor the code to be reuse with new format.
Zong-Zhe Yang [Wed, 20 May 2026 12:38:17 +0000 (20:38 +0800)]
wifi: rtw89: fw: load TX power track element according to AID
RF parameters has different TX power track table for different AID.
FW elements may include multiple TX power track tables for different
AID. So, load the corresponding one.
ARM: omap1: enable real software node lookup of GPIOs on Nokia 770
Currently the board file for Nokia 770 creates dummy software nodes not
attached in any way to the actual GPIO controller devices and uses the
fact that GPIOLIB matching swnode's name to the GPIO chip's label during
software node lookup. This behavior is wrong and we want to remove it.
To that end, we need to first convert all existing users to creating
actual fwnode links.
Create real software nodes for GPIO controllers on OMAP16xx and
reference them from the software nodes in the nokia board file.
ARM: omap1: use platform_device_register_full() for GPIO devices on OMAP 16xx
Ahead of changes attaching GPIO controller's software nodes referenced
from the Nokia 770 board files to their target devices, switch the
method for registering the platform devices to the
platform_device_register_full() variant. This is done to leverage the
new swnode field of struct platform_device_info which automate the
software node's registration and assignment.
Dmitry Torokhov [Tue, 26 May 2026 16:40:37 +0000 (18:40 +0200)]
MIPS: alchemy: db1300: switch to static device properties
Convert "5way switch" gpio-keys device and smsc911x ethernet controller
to use static device properties instead of bespoke platform data
structures for configuration.
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
[Bartosz: use platform_device_info::swnode] Tested-by: Manuel Lauss <manuel.lauss@gmail.com> Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com> Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Dmitry Torokhov [Tue, 26 May 2026 16:40:36 +0000 (18:40 +0200)]
MIPS: alchemy: gpr: switch to static device properties
Convert I2C-gpio device and GPIO-connected LEDs on GPR board to software
nodes/properties, so that support for platform data can be removed from
gpio-leds driver (which will rely purely on generic device properties
for configuration).
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
[Bartosz: use platform_device_info::swnode] Tested-by: Manuel Lauss <manuel.lauss@gmail.com> Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com> Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Dmitry Torokhov [Tue, 26 May 2026 16:40:35 +0000 (18:40 +0200)]
MIPS: alchemy: db1000: use nodes attached to GPIO chips in properties
GPIO subsystem is switching the way it locates GPIO chip instances for
GPIO references in software nodes by doing identity matching instead of
matching on node names. Switch to using software nodes attached to gpio
chips instead of using freestanding software nodes.
Also stop supplying platform data for the spi-gpio controller since
spi-gpio driver can derive number of chipselect lines from device
properties.
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com> Tested-by: Manuel Lauss <manuel.lauss@gmail.com> Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com> Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Dmitry Torokhov [Tue, 26 May 2026 16:40:34 +0000 (18:40 +0200)]
MIPS: alchemy: mtx1: attach software nodes to GPIO chips
GPIO subsystem is switching the way it locates GPIO chip instances for
GPIO references in software nodes from matching on node names to
identity matching, which necessitates assigning firmware nodes
(software nodes) to GPIO chips.
Move the node definitions for alchemy-gpio1 and alchemy-gpio2 to
arch/misp/alchemy/common/gpiolib.c, register them there, and attach
them to gpio_chip instances. Adjust MTX1 board file to use these nodes.
Note that because nodes need to be registered before they can be used in
PROPERTY_ENTRY_GPIO() we have to do the registration at
postcore_initcall level, otherwise (due to the link order) MTX1 board
initialization code will run first.
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
[Bartosz: use platform_device_info::swnode] Tested-by: Manuel Lauss <manuel.lauss@gmail.com> Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com> Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
MIPS: alchemy: provide visible function prototypes to board files
Board files under arch/mips/alchemy/ define functions called from
db1xxx.c but their prototypes are only in that .c file instead of being
declared in a common header. This causes several build warnings about
missing prototypes. Provide these prototypes in a new header and include
it where necessary.
Tested-by: Manuel Lauss <manuel.lauss@gmail.com> Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com> Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Pull in the platform.h header into platform.c to fix the following
warning:
arch/mips/alchemy/devboards/platform.c:68:12: warning: no previous prototype for ‘db1x_register_pcmcia_socket’ [-Wmissing-prototypes]
68 | int __init db1x_register_pcmcia_socket(phys_addr_t pcmcia_attr_start,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~
arch/mips/alchemy/devboards/platform.c:152:12: warning: no previous prototype for ‘db1x_register_norflash’ [-Wmissing-prototypes]
152 | int __init db1x_register_norflash(unsigned long size, int width,
| ^~~~~~~~~~~~~~~~~~~~~~
Tested-by: Manuel Lauss <manuel.lauss@gmail.com> Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com> Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Johan Hovold [Fri, 24 Apr 2026 10:28:47 +0000 (12:28 +0200)]
MIPS: ip22-gio: fix device reference leak in probe
The gio probe function needlessly takes a device reference which is
never released and therefore prevents unbound gio devices from being
freed.
Fixes: e84de0c61905 ("MIPS: GIO bus support for SGI IP22/28") Cc: stable@vger.kernel.org # 3.3 Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Signed-off-by: Johan Hovold <johan@kernel.org> Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Johan Hovold [Fri, 24 Apr 2026 10:28:46 +0000 (12:28 +0200)]
MIPS: ip22-gio: fix gio device memory leak
The gio device release callback was never wired up so gio devices are
not freed when the last reference is dropped.
Fixes: e84de0c61905 ("MIPS: GIO bus support for SGI IP22/28") Cc: stable@vger.kernel.org # 3.3 Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Signed-off-by: Johan Hovold <johan@kernel.org> Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Takashi Iwai [Tue, 26 May 2026 15:28:41 +0000 (17:28 +0200)]
ALSA: seq: oss: Fix UAF at handling events with embedded SysEx data
The OSS sequencer processes the input MIDI bytes into a sequencer
event to be dispatched later (in snd_seq_oss_midi_putc() called from
snd_seq_oss_process_event()). When it's a SysEx data, the event
record contains data.ext.ptr pointer to the original SysEx bytes, and
the referred data is copied into the pool afterwards at dispatching.
The problem is that, if the sequencer port gets closed concurrently
before the dispatch, the OSS sequencer core also releases the
resources (in snd_seq_oss_midi_check_exit_port()), while the pending
event may hold a stale pointer, eventually leading to a UAF at a later
dispatch.
Fortunately, there is already a refcounting mechanism (snd_use_lock_t)
for the OSS MIDI device access, and for addressing the issue above, we
just need to extend the refcount until the event gets dispatched.
This patch extends snd_seq_oss_process_event() to give back the
refcount object, which is in turn released after calling the sequencer
dispatcher with the given event in the caller side.
According to the original report, KASAN report as below:
Cássio Gabriel [Tue, 26 May 2026 12:48:27 +0000 (09:48 -0300)]
ALSA: xen-front: Connect event channel after stream prepare
The request channel must be connected from ALSA .open(), because hw-rule
queries and the stream open request use it. The event channel is
different: XENSND_EVT_CUR_POS handling uses ALSA runtime buffer and
period geometry, and the corresponding Xen stream parameters are not
submitted to the backend until .prepare() sends XENSND_OP_OPEN.
Currently .open() connects both channels. A backend current-position
event, or a stale event queued for an earlier stream instance, can
therefore reach xen_snd_front_alsa_handle_cur_pos() before
runtime->buffer_size and runtime->period_size are valid.
Add a per-channel connection helper, connect only the request channel in
.open(), connect the event channel after a successful stream prepare,
and disconnect it before stream close/free. Re-check the event-channel
state after taking ring_io_lock so disconnecting the event channel
synchronizes against a threaded IRQ that passed the initial lockless
state test. Keep defensive runtime geometry checks in the position
handler.
Cássio Gabriel [Tue, 26 May 2026 12:48:26 +0000 (09:48 -0300)]
ALSA: xen-front: Reset event channel state on stream clear
xen_snd_front_evtchnl_pair_clear() resets evt_next_id for both
channels. That is correct for the request channel, where evt_next_id is
used to allocate the next request id. It is wrong for the event channel:
incoming events are validated against evt_id, and evt_id is incremented
by evtchnl_interrupt_evt().
This leaves the expected event id from the previous stream instance. A
backend that restarts event ids for a reopened stream can then have valid
current-position events dropped until the stale frontend id catches up.
Reset evt_id for the event channel. Also advance the event-page consumer
to the current producer while clearing the stream, so obsolete events
queued for the previous stream instance are not delivered to the next
ALSA runtime.
Lianqin Hu [Wed, 27 May 2026 03:33:08 +0000 (03:33 +0000)]
ALSA: usb-audio: Add iface reset and delay quirk for TAE1160 USB Audio
Setting up the interface when suspended/resumeing fail on this card.
Adding a reset and delay quirk will eliminate this problem.
usb 1-1: new full-speed USB device number 2 using xhci-hcd
usb 1-1: New USB device found, idVendor=25aa, idProduct=600b
usb 1-1: New USB device strings: Mfr=1, Product=2, SerialNumber=3
usb 1-1: Product: TAE1159
usb 1-1: Manufacturer: Generic
usb 1-1: SerialNumber: 20210726905926
Jakub Pisarczyk [Tue, 26 May 2026 20:18:30 +0000 (22:18 +0200)]
ALSA: hda/cs420x: Add CS4208 fixup for iMac16,1
The 21.5" Retina 4K iMac (Late 2015, DMI product name "iMac16,1") ships
with a Cirrus Logic CS4208 codec wired to an external speaker amplifier
enabled through codec GPIO0 -- the same arrangement as the late-2013
MacBookPro 11,x. Without a matching entry in cs4208_mac_fixup_tbl[] the
fixup picker logs:
snd_hda_codec_cs420x hdaudioC1D0: CS4208: picked fixup for codec SSID 106b:0000
i.e. an empty fixup name, GPIO0 stays low, the external amp is never
powered up, and the internal speakers are silent on a stock kernel.
The codec SSID reported by hardware is 0x106b:0x7f00. Reusing CS4208_MBP11
(GPIO0 + SPDIF switch fixup) makes the internal speakers and S/PDIF
output work out of the box, removing the need for users to set
`options snd_hda_intel model=mbp11` via /etc/modprobe.d/.
Tested on iMac16,1 (kernel 6.17.0): four internal drivers
(Left tweeter, Left woofer, Right tweeter, Right woofer, exposed as the
4 channels of the analog-surround-40 ALSA profile) produce audio after
the fixup is applied.
The affected IOEXP nodes are missing interrupt pin configuration in
the device tree, causing the interrupt line to remain asserted and
resulting in repeated unhandled IRQ events.
Add the required interrupt-related properties for the affected IOEXP
devices to ensure proper interrupt handling and prevent the IRQ from
being disabled.
[arj: Drop markdown code-block fence, favour indentation]
Mike Hsieh [Fri, 22 May 2026 10:07:59 +0000 (18:07 +0800)]
ARM: dts: aspeed: clemente: Remove IOB NIC TMP421 nodes
Remove the TMP421 sensor entry from the DTS, as it is no longer the
primary telemetry source.
Accessing the CX8 NIC via I2C while it is powered off causes voltage
leakage on the bus, leading to EEPROM corruption on shared I2C devices.
Removing this node prevents the BMC from initiating traffic to the NIC
during initialization, protecting the integrity of the shared bus.
Signed-off-by: Mike Hsieh <mike.quanta.115@gmail.com> Signed-off-by: Andrew Jeffery <andrew@codeconstruct.com.au>
ARM: dts: aspeed: Enable networking for Asus Kommando IPMI Card
Adds the DT nodes needed for ethernet support for Asus Kommando, with
phy mode set to rgmii-id.
When this DT was originally added, the phy mode was set to rgmii (which
was incorrect). It was suggested to remove networking support from the
DT till the Aspeed networking driver was patched so that the correct phy
mode could be used.
The discussion in [1] mentions that u-boot was inserting clk delays that
weren't needed, which resulted in needing to set the phy mode in linux
to rgmii incorrectly. The solution suggested there was to patch u-boot to
no longer insert these clk delays and use rgmii-id as the phy mode for
any future DTs added to linux.
This DT was tested (on the OpenBMC u-boot fork [2]) with a u-boot DT
modified to insert clk delays of 0 (instead of patching u-boot itself).
[3] adds a u-boot DT for this device (without networking) and describes
how to patch it to add networking support. If this patched DT is used,
then networking works with rgmii-id phy mode in both u-boot and linux.
Haoxiang Li [Mon, 25 May 2026 08:26:11 +0000 (16:26 +0800)]
net: thunderx: fix PTP device ref leak in nicvf_probe()
cavium_ptp_get() acquires a reference to the PTP PCI device
through pci_get_device(). If any initialization step fails
after cavium_ptp_get(), the PTP PCI device reference is leaked.
Add a common error path to release the PTP reference before
returning from probe failures.
Qi Tang [Sat, 23 May 2026 14:32:45 +0000 (22:32 +0800)]
ipv6: validate extension header length before copying to cmsg
ip6_datagram_recv_specific_ctl() builds IPV6_{HOPOPTS,DSTOPTS,RTHDR}
cmsgs (and their IPV6_2292* legacy counterparts) by trusting the
on-wire hdrlen byte (ptr[1]) when computing the put_cmsg() length.
The length was validated only at parse time (ipv6_parse_hopopts(),
etc.). An nftables payload-write expression can rewrite hdrlen after
parsing and before the skb reaches recvmsg; the write itself is
in-bounds but put_cmsg() then reads up to ((hdrlen+1) << 3) = 2040
bytes from an 8-byte header. nftables is reachable from an
unprivileged user namespace, so this is an unprivileged
slab-out-of-bounds read:
BUG: KASAN: slab-out-of-bounds in put_cmsg+0x3ac/0x540
put_cmsg+0x3ac/0x540
udpv6_recvmsg+0xca0/0x1250
sock_recvmsg+0xdf/0x190
____sys_recvmsg+0x1b1/0x620
Add ipv6_get_exthdr_len() which validates that at least two bytes
are accessible before reading the hdrlen field, then checks the
computed length against skb_tail_pointer(skb), returning 0 on
failure. Extension headers are kept in the linear skb area by
pskb_may_pull() during input, so skb_tail_pointer() is the correct
bound.
Use ipv6_get_exthdr_len() at all non-AH call sites: the five
standalone cmsg blocks (HbH, 2292HbH, 2292DSTOPTS x2, 2292RTHDR)
and the three standard cases in the extension-header walk loop
(DSTOPTS, ROUTING, default). AH retains an inline bounds check
because its length formula differs ((ptr[1]+2)<<2).
The walk loop also gets a pre-read bounds check at the top to
validate ptr before any case accesses ptr[0] or ptr[1].
When the walk loop detects a corrupted header, return from the
function instead of continuing to process later socket options.
Cc: stable@vger.kernel.org Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Qi Tang <tpluszz77@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20260523143245.2281415-1-tpluszz77@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Luka Gejak [Sat, 23 May 2026 13:04:20 +0000 (15:04 +0200)]
net: hsr: require valid EOT supervision TLV
Supervision frames are only valid if terminated with a zero-length EOT
TLV. The current check fails to reject non-EOT entries as the terminal
TLV, potentially allowing malformed supervision traffic.
Fix this by strictly requiring the terminal TLV to be HSR_TLV_EOT with
a length of zero.
Sean Shen [Tue, 26 May 2026 13:07:16 +0000 (22:07 +0900)]
ksmbd: fix FSCTL permission bypass by adding a permission check for FSCTL_SET_SPARSE
FSCTL_SET_SPARSE in fsctl_set_sparse() modifies the file's sparse
attribute and saves it through xattr without any permission checks.
This exposes two issues:
1) A client on a read-only share can change the sparse attribute
on files it opened, even though the share is read-only.
Other FSCTL write operations already check
test_tree_conn_flag(work->tcon, KSMBD_TREE_CONN_FLAG_WRITABLE),
but FSCTL_SET_SPARSE does not.
2) Even on writable shares, clients without FILE_WRITE_DATA or
FILE_WRITE_ATTRIBUTES access should not modify the sparse
attribute. Similar handle-level checks exist in other functions
but are missing here.
Add both share-level writable check and per-handle access check.
Use goto out on error to avoid leaking file references.
Fixes: e2f34481b24d ("cifsd: add server-side procedures for SMB3") Cc: Namjae Jeon <linkinjeon@kernel.org> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Steve French <smfrench@gmail.com> Signed-off-by: Sean Shen <grayhat@foxmail.com> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
ksmbd: release ksmbd_inode ref via ksmbd_inode_put on lookup paths
ksmbd_query_inode_status() and ksmbd_lookup_fd_inode() both take a
reference on a ksmbd_inode via __ksmbd_inode_lookup() (which performs
atomic_inc_not_zero()) and later release it using a bare
atomic_dec(&ci->m_count). Unlike ksmbd_inode_put(), a bare
atomic_dec() does not check whether the reference count has reached
zero, so if the caller happens to drop the last reference, the
ksmbd_inode is leaked: it stays in the global inode hash table with
m_count == 0, future __ksmbd_inode_lookup() calls reject it via
atomic_inc_not_zero(), and ksmbd_inode_free() is never invoked.
In ksmbd_lookup_fd_inode() the matched-fp path (which now also uses
ksmbd_inode_put()) cannot currently reach m_count == 0 because the
matched ksmbd_file holds its own reference on ci, but converting it to
the proper API keeps the three call sites consistent and avoids
future regressions if the locking changes.
Because ksmbd_inode_put() may free the ksmbd_inode if this drops the
last reference, the call must happen after up_read(&ci->m_lock) on the
two affected paths in ksmbd_lookup_fd_inode(). On the no-match path
this is a pure reordering; on the matched path ksmbd_fp_get() is
moved above the unlock so that the returned ksmbd_file is pinned
before the inode reference is released.
Signed-off-by: Aleksandr Golovnya <cofedish@gmail.com> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
Ali Ganiyev [Mon, 25 May 2026 01:23:47 +0000 (10:23 +0900)]
ksmbd: OOB read regression in smb_check_perm_dacl() ACE-walk loops
Commit d07b26f39246 ("ksmbd: require minimum ACE size in
smb_check_perm_dacl()") introduced a transposed bounds check:
if (offsetof(struct smb_ace, sid) + aces_size < CIFS_SID_BASE_SIZE)
Since offsetof(..sid) is 8 and CIFS_SID_BASE_SIZE is 8, this evaluates
to `aces_size < 0`. Because `aces_size` is always non-negative, this
check becomes dead code and never breaks the loop.
Worse, that commit removed the old 4-byte guard, meaning the loop now
reads `ace->size` (offset 2) even when `aces_size` is 0-3 bytes. This
re-opens a 2-byte heap out-of-bounds (OOB) read past the pntsd allocation
during subsequent SMB2_CREATE operations.
Fix this by properly transposing the comparison to require at least
16 bytes (8-byte offset + 8-byte SID base), matching the correct form
used in smb_inherit_dacl().
Fixes: d07b26f39246 ("ksmbd: require minimum ACE size in smb_check_perm_dacl()") Cc: stable@vger.kernel.org Signed-off-by: Ali Ganiyev <ali.qaniyev@gmail.com> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
Jakub Kicinski [Wed, 27 May 2026 01:32:34 +0000 (18:32 -0700)]
Merge tag 'nfc-7.1-rc6' of https://codeberg.org/linux-nfc/linux
David Heidelberg says:
====================
nfc pull request for net:
Code improvements
- llcp: Fix use-after-free in llcp_sock_release()
- llcp: Fix use-after-free race in nfc_llcp_recv_cc()
- hci: fix out-of-bounds read in HCP header parsing
Regression fixes:
- nxp-nci: i2c: use rising-edge IRQ on ACPI systems
Signed-off-by: David Heidelberg <david@ixit.cz>
* tag 'nfc-7.1-rc6' of https://codeberg.org/linux-nfc/linux:
nfc: nxp-nci: i2c: use rising-edge IRQ on ACPI systems
nfc: hci: fix out-of-bounds read in HCP header parsing
nfc: llcp: Fix use-after-free race in nfc_llcp_recv_cc()
nfc: llcp: Fix use-after-free in llcp_sock_release()
====================
Eric Dumazet [Mon, 25 May 2026 20:36:42 +0000 (20:36 +0000)]
vxlan: do not reuse cached ip_hdr() value after skb_tunnel_check_pmtu()
skb_tunnel_check_pmtu() can change skb->head.
Reusing old_iph afer skb_tunnel_check_pmtu() can cause an UAF.
Use instead ip_hdr(skb) as done in drivers/net/bareudp.c
and drivers/net/geneve.c.
Found by Sashiko.
Fixes: 4cb47a8644cc ("tunnels: PMTU discovery support for directly bridged IP packets") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Link: https://patch.msgid.link/20260525203642.2389723-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Eric Dumazet [Mon, 25 May 2026 20:13:35 +0000 (20:13 +0000)]
tunnels: load network headers after skb_cow() in iptunnel_pmtud_build_icmp[v6]()
Sashiko found that iptunnel_pmtud_build_icmp() and
iptunnel_pmtud_build_icmpv6() were caching ip_hdr() and ipv6_hdr()
before an skb_cow() call which can reallocate skb->head.
Fix this possible UAF by initializing the local variables
after the skb_cow() call.
Remove skb_reset_network_header() calls which were not needed.
Fixes: 4cb47a8644cc ("tunnels: PMTU discovery support for directly bridged IP packets") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Link: https://patch.msgid.link/20260525201335.2361845-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Wed, 27 May 2026 01:07:28 +0000 (18:07 -0700)]
Merge tag 'nf-next-26-05-25' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Florian Westphal says:
====================
netfilter: updates for net-next
The following patchset contains Netfilter fixes and small enhancements:
1) Disable 32-bit x_tables compatibility (32bit binaries on 64bit
kernel) interface in user namespaces. This is 'last warning'
before this is removed for good.
2) Add a configuration toggle for netfilter GCOV profiling. Provide
dedicated toggles for ipset and ipvs.
3) Remove modular support for nfnetlink and restrict it to built-in only.
From Pablo Neira Ayuso.
4) Use per-rule hash initval in nf_conncount. This avoids unecessary
lock contention with short keys (e.g. conntrack zones) in different
namespaces.
5) Use nf_ct_exp_net() in ctnetlink expectation dumps.
From Pratham Gupta.
6) Remove a dead conditional in nft_set_rbtree.
7) Fix conntrack helper policy updates to apply per-class values correctly.
From David Carlier.
8) Fix an off-by-one OOB read in nf_conntrack_irc:parse_dcc(). Use strict
less-than comparison in the newline search loop to respect the
exclusive-end pointer convention. From Muhammad Bilal.
9) Fix typos in nf_conntrack_proto_tcp comments. From Avinash Duduskar.
10) Restore performance optimization in nft_set_pipapo_avx2 by passing
the next map index. Refactor lookup logic for clarity and add a
DEBUG_NET check to document this.
11) Avoid (harmless) u16 overflow in nf_conntrack_ftp when parsing FTP PORT
and EPRT commands. Ignore commands where single octet exceeds 255.
From Giuseppe Caruso.
Patch 12, which removes incorrect (and obviously unused) code from
nft_byteorder was kept back to avoid a net -> net-next merge conflict.
* tag 'nf-next-26-05-25' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
netfilter: nf_conntrack_ftp: avoid u16 overflows
netfilter: nft_set_pipapo_avx2: restore performance optimization
netfilter: nf_conntrack_proto_tcp: fix typos in comments
netfilter: nf_conntrack_irc: fix parse_dcc() off-by-one OOB read
netfilter: nfnl_cthelper: apply per-class values when updating policies
netfilter: nft_set_rbtree: remove dead conditional
netfilter: ctnetlink: use nf_ct_exp_net() in expectation dump
netfilter: nf_conncount: use per-rule hash initval
netfilter: allow nfnetlink built-in only
netfilter: add option for GCOV profiling
netfilter: x_tables: disable 32bit compat interface in user namespaces
====================
netconsole: Constify struct configfs_item_operations and configfs_group_operations
'struct configfs_item_operations' and 'configfs_group_operations' are not
modified in this driver.
Constifying these structures moves some data to a read-only section, so
increases overall security, especially when the structure holds some
function pointers.
On a x86_64, with allmodconfig, as an example:
Before:
======
text data bss dec hex filename
64259 24272 608 89139 15c33 drivers/net/netconsole.o
After:
=====
text data bss dec hex filename
64579 23952 608 89139 15c33 drivers/net/netconsole.o
Maoyi Xie [Mon, 25 May 2026 07:17:59 +0000 (15:17 +0800)]
mlxsw: spectrum_fid: use a dedicated list head pointer for sorted insert
mlxsw_sp_fid_port_vid_list_add() inserts into a list sorted by
local_port. It walks the list to find the first entry with a
larger local_port, then inserts the new entry before it:
If the loop falls through (the new local_port is the largest),
tmp_port_vid runs off the end of the list. &tmp_port_vid->list
then ends up at the list head itself (container_of() offsets
cancel), and list_add_tail() inserts at the tail. So the code
works today.
It is fragile though. Anyone who later adds a read of another
field of tmp_port_vid will hit memory outside the list head.
Track the insertion point with a dedicated list_head pointer.
Initialise insert_before to &fid->port_vid_list, set it to
&tmp_port_vid->list only on early break, and pass insert_before
to list_add_tail(). The cursor is no longer touched after the
loop. Behaviour is unchanged.
Wei Fang [Sun, 24 May 2026 07:03:10 +0000 (15:03 +0800)]
net: dsa: netc: fix unmet Kconfig dependencies for NET_DSA_NETC_SWITCH
NET_DSA_NETC_SWITCH selects NXP_NTMP, NXP_NETC_LIB and FSL_ENETC_MDIO,
but these symbols depend on NET_VENDOR_FREESCALE which may not be
enabled. This results in Kconfig warnings and linker errors like:
undefined reference to `ntmp_bpt_update_entry'
undefined reference to `ntmp_fdbt_search_port_entry'
undefined reference to `ntmp_free_cbdr'
undefined reference to `enetc_hw_alloc'
...
Therefore, add "depends on NET_VENDOR_FREESCALE" to NET_DSA_NETC_SWITCH,
ensuring that the selected symbols NXP_NTMP, NXP_NETC_LIB and
FSL_ENETC_MDIO, which all depend on NET_VENDOR_FREESCALE, can only be
selected when that dependency is already satisfied.
Lucien.Jheng [Sun, 24 May 2026 06:39:15 +0000 (14:39 +0800)]
net: phy: air_en8811h: add AN8811HB MCU assert/deassert support
AN8811HB needs a MCU soft-reset cycle before firmware loading begins.
Assert the MCU (hold it in reset) and immediately deassert (release)
via a dedicated PBUS register pair (0x5cf9f8 / 0x5cf9fc), accessed
through a registered mdio_device at PHY-addr+8.
Add __air_pbus_reg_write() as a low-level helper taking a struct
mdio_device *, create and register the PBUS mdio_device in
an8811hb_probe() and store it in priv->pbusdev, then implement
an8811hb_mcu_assert() / _deassert() on top of it. Add
an8811hb_remove() to unregister the PBUS device on teardown. Wire
both calls into an8811hb_load_firmware() and en8811h_restart_mcu()
so every firmware load or MCU restart on AN8811HB correctly sequences
the reset control registers.
ipv6: addrconf: fix temp address generation after prefix deprecation
When a router temporarily deprecates an IPv6 prefix (either by sending a
Router Advertisement with Preferred Lifetime = 0 or by letting the
lifetime expire) and later restores it, the kernel permanently loses its
ability to generate temporary privacy addresses (RFC 8981) for that
prefix.
This happens because the address worker attempts to generate a
replacement temporary address when the current one nears expiration. As
the base prefix is deprecated already, the generation fails after
marking the temporary address as already having spawned a replacement
(ifp->regen_count++).
When the router eventually restores the prefix, the temporary address
becomes active again. However, once it naturally expires, the address
worker sees this temporary address already tried to generate one and
skips the regeneration.
Fix the issue by resetting the regen_count check of the latest temp
address generated for the prefix updated by the incoming RA.
l2tp: use refcount_inc_not_zero in l2tp_session_get_by_ifname
A reader in l2tp_session_get_by_ifname() can return a pointer to a
session whose refcount has reached zero. The getter takes its
reference with plain refcount_inc(), but every other session getter
in the same file (l2tp_v2_session_get, l2tp_v3_session_get, and the
corresponding _get_next variants) uses refcount_inc_not_zero()
because the IDR/RCU lookup can race with refcount_dec_and_test() ->
l2tp_session_free() -> kfree_rcu(). The ifname getter is the only
outlier; the inconsistency was raised on-list after 979c017803c4
("l2tp: use list_del_rcu in l2tp_session_unhash").
A reader inside rcu_read_lock_bh() that matches session->ifname can
be preempted between the strcmp() and the refcount_inc(). If the
last reference drops on another CPU in that window, the reader's
refcount_inc() runs on a counter that has reached zero. refcount_t
catches the addition-on-zero, prints "refcount_t: addition on 0;
use-after-free", saturates the counter, and returns the saturated
pointer to the caller. Session memory is held live by the in-flight
RCU read section, but the kfree_rcu() callback queued from
l2tp_session_free() will free it once the grace period closes; a
caller that dereferences the returned session past that point hits
a slab-use-after-free. On PREEMPT_RT local_bh_disable() is a per-CPU
sleeping lock and the preemption window is real; on stock PREEMPT
kernels local_bh_disable() is a preempt_count increment that closes
the cross-CPU race in practice (see below).
Use refcount_inc_not_zero() and continue the list walk on failure,
matching the other session getters in the file. The ifname getter
is the only session getter in net/l2tp/ that still uses the bare
refcount_inc() pattern; this change restores file-internal
consistency. The success path is unchanged.
Fixes: abe7a1a7d0b6 ("l2tp: improve tunnel/session refcount helpers") Cc: stable@vger.kernel.org Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com> Reviewed-by: James Chapman <jchapman@katalix.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260523023423.2568972-1-michael.bommarito@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Martin Karsten [Sat, 23 May 2026 01:22:20 +0000 (21:22 -0400)]
net: napi: Skip last poll when arming gro timer in busy poll
Skip the extra call to napi->poll(), if the gro timer is armed at the
end of busy polling. This removes the need for having a separate
__busy_poll_stop() routine and its code is moved directly into the
relevant places in busy_poll_stop(). Remove obsolete comment about
ndo_busy_poll_stop().
This is a follow-up to commit 58e2330bd455 ("net: napi: Avoid gro timer
misfiring at end of busypoll"), which has deferred arming the gro timer
to the end of __busy_poll_stop() to eliminate a race condition between
a short timer and long poll that could leave the queue stuck with
interrupts disabled and no timer armed.
Co-developed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Martin Karsten <mkarsten@uwaterloo.ca> Link: https://patch.msgid.link/20260523012247.1574691-1-mkarsten@uwaterloo.ca Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Tim Bird [Fri, 22 May 2026 22:55:08 +0000 (16:55 -0600)]
llc: Add SPDX id lines to some llc source files
Most of the lls source files are missing SPDX-License-Identifier
lines. Add appropriate IDs to these files, and remove other license
info from the header. In once case, leave the existing id line
and just remove the license reference text.
Andreas Hindborg [Mon, 27 Apr 2026 08:11:35 +0000 (10:11 +0200)]
rust: module_param: use `pr_warn_once!` for null pointer warning
Replace `pr_warn!` and the accompanying TODO with `pr_warn_once!`, now that
the macro is available.
[ Note: Adarsh Das independently authored an identical patch on the
rust-for-linux list, but it missed the modules tree. ]
Suggested-by: Adarsh Das <adarshdas950@gmail.com> Signed-off-by: Andreas Hindborg <a.hindborg@kernel.org> Reviewed-by: Aaron Tomlin <atomlin@atomlin.com> Reviewed-by: Gary Guo <gary@garyguo.net> Reviewed-by: Daniel Gomez <da.gomez@samsung.com> Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
====================
Add OVS packet family YNL spec and unicast notification support
This series adds a YAML netlink spec for the OVS_PACKET_FAMILY genetlink
family and a bind-only ntf_bind() helper for receiving unicast
notifications.
====================
Minxi Hou [Fri, 22 May 2026 17:41:54 +0000 (01:41 +0800)]
tools: ynl: add unicast notification receive support
Add ntf_bind() method to YnlFamily for binding the netlink
socket without joining a multicast group. This enables receiving
unicast notifications through the existing poll_ntf/check_ntf
path.
The OVS packet family sends MISS and ACTION upcalls via
genlmsg_unicast() to a per-vport PID rather than through a
multicast group. The existing ntf_subscribe() couples bind()
with setsockopt(ADD_MEMBERSHIP), which does not fit the unicast
case. ntf_bind() provides the bind-only alternative, with the
address defaulting to (0, 0) but exposed as an explicit argument.
Minxi Hou [Fri, 22 May 2026 17:41:53 +0000 (01:41 +0800)]
netlink: specs: add OVS packet family specification
Add YAML netlink spec for the OVS_PACKET_FAMILY (ovs_packet).
This completes the set of OVS genetlink family specs (ovs_datapath,
ovs_flow, ovs_vport already exist).
The spec defines three operations: MISS (event), ACTION (event),
and EXECUTE (do). MISS and ACTION are kernel-to-userspace upcalls
sent via genlmsg_unicast(); EXECUTE is the only registered genl
operation.
Key, actions, and egress-tun-key attributes are typed as binary
rather than nest because the nested attribute definitions belong
to the ovs_flow spec and cross-spec references are not supported
by the YNL framework.
Alice Ryhl [Wed, 8 Apr 2026 08:32:17 +0000 (08:32 +0000)]
rust: kasan: add support for Software Tag-Based KASAN
This adds support for Software Tag-Based KASAN (KASAN_SW_TAGS) when
CONFIG_RUST is enabled. This requires that rustc includes support for
the kernel-hwaddress sanitizer, which is available since 1.96.0 [1].
Unlike with clang, we need to pass -Zsanitizer-recover in addition to
-Zsanitizer because the option is not implied automatically.
The kasan makefile uses different names for the flags depending on
whether CC is clang or gcc, but as we require that CC is clang when
using KASAN, we do not need to try to handle mixed gcc/llvm builds when
Rust is enabled.
Alice Ryhl [Wed, 8 Apr 2026 08:32:16 +0000 (08:32 +0000)]
rust: kasan: KASAN+RUST requires clang
Kernel KASAN involves passing various llvm/gcc specific arguments to
the C and Rust compiler. Since these arguments differ between llvm and
gcc, it's not safe to mix an llvm-based rustc with a gcc build when
kasan is enabled.
Alice Ryhl [Tue, 31 Mar 2026 10:57:49 +0000 (10:57 +0000)]
kbuild: rust: add AutoFDO support
This patch enables AutoFDO build support for Rust code within the Linux
kernel. This allows Rust code to be profiled and optimized based on the
profile.
The RUSTFLAGS variable was suffixed with *_AUTOFDO_CLANG to match the
naming of the config option, which is called CONFIG_AUTOFDO_CLANG.
This implementation has been verified in Android, first by inspecting
the object files and confirming that they look correct. After that,
it was verified as below:
1. Running the binderAddInts benchmark [1] with Rust Binder built as
rust_binder.ko module, using a Pixel 9 Pro.
2. Collecting a profile on a Pixel 10 Pro XL using the app-launch
benchmark, which starts different apps many times, on a device with
Rust Binder as a built-in kernel module. (C Binder was not present on
the device.)
3. Using the collected profile, run the binderAddInts benchmark again
with Rust Binder built both as a rust_binder.ko module, and as a
built-in kernel module.
4. In both cases, Rust Binder without AutoFDO was approximately 13%
slower than the AutoFDO optimized version. Built-in vs .ko did not
make a measurable performance difference.
All of the above was verified in conjunction with my helpers inlining
series [2], which confirmed that this worked correctly for helpers too
once [3] was fixed in the helpers inlining series.
Chaitanya Sabnis [Tue, 26 May 2026 10:22:40 +0000 (15:52 +0530)]
i2c: davinci: fix division by zero on missing clock-frequency
When the 'clock-frequency' property is missing from the device tree,
the driver falls back to DAVINCI_I2C_DEFAULT_BUS_FREQ. However, this
macro was defined in kHz (100), whereas the device tree property is
expected in Hz.
The probe function divided the fallback value by 1000, causing
integer truncation that resulted in dev->bus_freq = 0. This triggered
a deterministic division-by-zero kernel panic when calculating clock
dividers later in the probe sequence.
Fix this by redefining DAVINCI_I2C_DEFAULT_BUS_FREQ in Hz (100000)
to match the expected device tree property unit, allowing the existing
division logic to work correctly for both cases.
Ricardo Robaina [Wed, 13 May 2026 21:47:59 +0000 (18:47 -0300)]
audit: fix removal of dangling executable rules
When an audited executable is deleted from the disk, its dentry
becomes negative. Any later attempt to delete the associated audit
rule will lead to audit_alloc_mark() encountering this negative
dentry and immediately aborting, returning -ENOENT.
This early abort prevents the subsystem from allocating the temporary
fsnotify mark needed to construct the search key, meaning the kernel
cannot find the existing rule in its own lists to delete it. This
leaves a dangling rule in memory, resulting in the following error
while attempting to delete the rule:
# ./audit-dupe-exe-deadlock.sh
No rules
Error deleting rule (No such file or directory)
There was an error while processing parameters
# auditctl -l
-a always,exit -S all -F exe=/tmp/file -F path=/tmp/file -F key=dr
# auditctl -D
Error deleting rule (No such file or directory)
There was an error while processing parameters
This patch fixes this issue by removing the d_really_is_negative()
check. By doing so, a dummy mark can be successfully generated for
the deleted path, which allows the audit subsystem to properly match
and flush the dangling rule.
Cc: stable@kernel.org Fixes: 76a53de6f7ff ("VFS/audit: introduce kern_path_parent() for audit") Acked-by: Waiman Long <longman@redhat.com> Acked-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Ricardo Robaina <rrobaina@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
Vishal Annapurve [Fri, 22 May 2026 15:15:34 +0000 (15:15 +0000)]
KVM: x86: Treat KVM's virtual PMU as disabled for TDX VMs
Introduce a "protected PMU" concept, and use it to disable KVM's virtual
PMU for TDX VMs, as the PMU state for TDX VMs is virtualized by the TDX
Module[1], i.e. _can't_ emulated/virtualized by KVM, and KVM doesn't yet
support enabling/exposing PMU functionality for/to TDX VMs. For now,
simply treat the PMU as disabled, as it's not clear what all needs to be
changed, e.g. KVM needs to do at least:
1) Configure TD_PARAMS to allow guests to use performance monitoring.
2) Restrict the TD to a subset of the PEBS counters if supported.
3) Limit the TD to setup a certain perfmon events using basic/enhanced
event filtering.
Explicitly disallow enabling the PMU via KVM_CAP_PMU_CAPABILITY for VMs
with a protected PMU to prevent userspace from circumventing KVM's
protections.
Jani Nikula [Wed, 13 May 2026 07:58:40 +0000 (10:58 +0300)]
drm/i915/display: stop passing i to for_each_pipe_crtc_modeset_{enable, disable}()
Refactor for_each_pipe_crtc_modeset_{enable,disable}() and their
underlying for_each_crtc_in_masks{,_reverse}() helpers to utilize
__UNIQUE_ID() to avoid having to pass the for loop variable to them.
Jani Nikula [Wed, 13 May 2026 07:58:38 +0000 (10:58 +0300)]
drm/i915/display: pass struct intel_display to all for_each_intel_crtc*() macros
Now that the for_each_intel_crtc*() iterator macros primarily use
display->pipe_list for iteration, it's more convenient to pass struct
intel_display to them directly instead of struct drm_device. Make it so.
Jani Nikula [Wed, 13 May 2026 07:58:37 +0000 (10:58 +0300)]
drm/i915/display: always pass display->drm to for_each_intel_crtc*()
In preparation for always passing struct intel_display to
for_each_intel_crtc*() family of iterators, start off by unifying their
usage to always having struct intel_display *display around, and passing
display->drm to them.
Jani Nikula [Wed, 13 May 2026 07:58:36 +0000 (10:58 +0300)]
drm/i915/display: switch from drm_for_each_crtc() to for_each_intel_crtc()
intel_has_pending_fb_unpin() has the last direct user of
drm_for_each_crtc() in i915. Switch to for_each_intel_crtc() to ensure
pipe order iteration in all cases.
Jani Nikula [Mon, 25 May 2026 11:05:53 +0000 (14:05 +0300)]
drm/{i915, xe}: move xe_display_flush_cleanup_work() to i915 display
xe_display_flush_cleanup_work() is a bit of an oddball function in xe
display code. There shouldn't be anything this specific or xe
specific. While I'm not sure what the correct refactor for the function
should be, move it to shared display code for starters, next to the
eerily similar but slightly different intel_has_pending_fb_unpin() that
is only called from i915 core.
The main goal here is to unblock some refactors on
for_each_intel_crtc().
Kevin Cheng [Fri, 22 May 2026 23:27:01 +0000 (16:27 -0700)]
KVM: selftests: Add nested page fault injection test
Add a test that exercises nested page fault injection during L2
execution. L2 executes I/O string instructions (OUTSB/INSB) that access
memory restricted in L1's nested page tables (NPT/EPT), triggering a
nested page fault that L0 must inject to L1.
The test supports both AMD SVM (NPF) and Intel VMX (EPT violation) and
verifies that:
- The exit reason is an NPF/EPT violation
- The access type and permission bits are correct
- The faulting GPA is correct
Three test cases are implemented:
- Unmap the final data page (final translation fault, OUTSB read)
- Unmap a PT page (page walk fault, OUTSB read)
- Write-protect the final data page (protection violation, INSB write)
- Write-protect a PT page (protection violation on A/D update, OUTSB
read)
When injecting an EPT Violation into L2 in response to a fault detected
while emulating an L2 GVA access, synthesize the GVA_IS_VALID and
GVA_TRANSLATED bits using information provided by the walker, instead of
pulling the bits from vmcs02.EXIT_QUALIFICATION. The information in
vmcs02.EXIT_QUALIFICATION is valid/correct if and only if the fault being
injected into L1 is the direct result of an EPT Violation VM-Exit from L2.
E.g. if KVM is emulating an I/O instruction and the memory operand's
translation through L1's EPT fails, using vmcs02.EXIT_QUALIFICATION is
wrong as the semantics for EXIT_QUALIFICATION would be for an I/O exit,
not an EPT Violation exit.
Opportunistically clean up the formatting for creating the mask of bits
to pull from vmcs02.EXIT_QUALIFICATION.
Kevin Cheng [Fri, 22 May 2026 23:26:59 +0000 (16:26 -0700)]
KVM: SVM: Fix nested NPF injection of PFERR_GUEST_{PAGE,FINAL}_MASK bits
Fix KVM's generation of PFERR_GUEST_{PAGE,FINAL}_MASK bits when injecting a
Nested Page Fault into L1. Currently, KVM blindly stuffs GUEST_FINAL into
L1, which is blatantly wrong given that KVM obviously generates NPFs for
page table accesses.
There are two paths that trigger NPF injection: hardware NPF exits (from
L2) and emulation-triggered faults, i.e. when KVM detects a NPF as part of
emulating an L2 GVA access. For the hardware case, use the bits verbatim
from the VMCB, as KVM is simply forwarding a NPF to L1. For the emulation
case, propagate the GUEST_{PAGE,FINAL} bits from the access field (which
were recently added for MBEC+GMET support).
To differentiate between the two cases, add "hardware_nested_page_fault"
to "struct x86_exception", and set it when injecting a NPF in response to
an NPF exit from L2.
To help guard against future goofs, assert that exactly one of GUEST_PAGE
or GUEST_FINAL is set when injecting a NPF. Unlike VMX, there are no
(known) cases where hardware doesn't set either bit, and KVM should always
set one or the other when emulating a GVA access.
nvme-multipath: enable PCI P2PDMA for multipath devices
NVMe multipath does not expose BLK_FEAT_PCI_P2PDMA on the head disk
even when all underlying controllers support it.
Set BLK_FEAT_PCI_P2PDMA unconditionally in nvme_mpath_alloc_disk()
alongside the other features. nvme_update_ns_info_block() already
calls queue_limits_stack_bdev() to stack each path's limits onto the
head disk, which routes through blk_stack_limits(). The core now
clears BLK_FEAT_PCI_P2PDMA automatically if any path (e.g., FC) does
not support it, consistent with how BLK_FEAT_NOWAIT and BLK_FEAT_POLL
are handled.
md: propagate BLK_FEAT_PCI_P2PDMA from member devices to RAID device
MD RAID does not propagate BLK_FEAT_PCI_P2PDMA from member devices to
the RAID device, preventing peer-to-peer DMA through the RAID layer even
when all underlying devices support it.
Enable BLK_FEAT_PCI_P2PDMA unconditionally in raid0, raid1 and raid10
personalities during queue limits setup. blk_stack_limits() clears it
automatically if any member device lacks support, consistent with how
BLK_FEAT_NOWAIT and BLK_FEAT_POLL are handled in the block core.
Parity RAID personalities (raid4/5/6) are excluded because they require
CPU access to data pages for parity computation, which is incompatible
with P2P mappings.
Tested with RAID0/1/10 arrays containing multiple NVMe devices with
P2PDMA support, confirming that peer-to-peer transfers work correctly
through the RAID layer.
block: clear BLK_FEAT_PCI_P2PDMA in blk_stack_limits() for non-supporting devices
BLK_FEAT_NOWAIT and BLK_FEAT_POLL are cleared in blk_stack_limits()
when an underlying device does not support them. Apply the same
treatment to BLK_FEAT_PCI_P2PDMA: stacking drivers set it
unconditionally and rely on the core to clear it whenever a
non-supporting member device is stacked.
KVM: x86: Tell ->inject_page_fault() whether or a fault came from hardware
When injecting a page fault (including nested TDP faults into L1), tell the
injection routine whether or not the fault originated in hardware, i.e. if
KVM is effectively forwarding a fault it intercept. For nested TDP fault
injection, KVM needs to grab PAGE_WALK vs. GUEST_FINAL information from the
VMCB/VMCS, _if_ the fault originated in hardware.
Note, simply checking whether or not the original exit was due a #NPF or
EPT Violation isn't sufficient/correct, as the fault being synthesized for
L1 may or may not be the "same" fault that triggered a VM-Exit from L2.
E.g. if access to emulated MMIO in L2 hits a !PRESENT fault (EPT Violation
or #NPF), e.g. because MMIO caching is disabled or it's the first time the
GPA has been accessed by L2, then KVM will enter the emulator. If
emulating the MMIO instruction then hits a nested TDP fault, e.g. because
L2 was accessing MMIO with a MOVSQ (memory-to-memory move), or because L1
has since unmapped the code stream, then the TDP fault synthesized to L1
will not be the same emulated fault the triggered the VM-Exit.
No functional change intended (nothing uses the new param, yet...).
Yan Zhao [Thu, 30 Apr 2026 01:50:01 +0000 (09:50 +0800)]
x86/tdx: Drop exported function tdx_quirk_reset_page()
KVM invokes tdx_quirk_reset_page() to reset TDX control pages (including
S-EPT pages, TDR page, etc.), as all those pages are allocated by KVM TDX
and thus always have struct page.
However, it's also reasonable for KVM to reset those TDX control pages via
tdx_quirk_reset_paddr() directly, eliminating the need to export two
parallel APIs. Keeping tdx_quirk_reset_page() as a one-line helper in the
header file is also unnecessary.
No functional change intended.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Suggested-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Acked-by: Kiryl Shutsemau <kas@kernel.org> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Ackerley Tng <ackerleytng@google.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://patch.msgid.link/20260430015001.24242-1-yan.y.zhao@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
x86/tdx: Use PFN directly for unmapping guest private memory
Remove struct page assumptions/constraints in APIs for unmapping guest
private memory and have them take physical address directly.
Having core TDX make assumptions that guest private memory must be backed
by struct page (and/or folio) will create subtle dependencies on how
KVM/guest_memfd allocates/manages memory (e.g., whether it uses memory
allocated from core MM, if the memory is refcounted, or if the folio is
split) that are easily avoided. [1].
KVM's MMUs work with PFNs. This is very much an intentional design choice.
It ensures that the KVM MMUs remain flexible and are not too tightly tied
to the regular CPU MMUs and the kernel code around them. Using
"struct page" for TDX guest memory is not a good fit anywhere near the KVM
MMU code [2].
Therefore, for unmapping guest private memory: export
tdx_quirk_reset_paddr() for direct KVM invocation, and convert the SEAMCALL
wrapper API tdh_phymem_page_wbinvd_hkid() to take PFN as input (thus
updating mk_keyed_paddr() and tdh_phymem_page_wbinvd_tdr()).
Intentionally have KVM pass PAGE_SIZE (rather than KVM_HPAGE_SIZE(level))
to tdx_quirk_reset_paddr() in tdx_sept_remove_private_spte() to avoid
mixing in huge page changes. The KVM_BUG_ON() check for !PG_LEVEL_4K in
tdx_sept_remove_private_spte() justifies using PAGE_SIZE.
Do not convert tdx_reclaim_page() to use PFN as input since it currently
does not remove guest private memory.
Use "kvm_pfn_t pfn" for type safety. Using this KVM type is appropriate
since APIs tdh_phymem_page_wbinvd_hkid() and tdx_quirk_reset_paddr() are
exported to KVM only.
[Yan: Use kvm_pfn_t,exclude tdx_reclaim_page(),use tdx_quirk_reset_paddr()]
x86/tdx: Use PFN directly for mapping guest private memory
Remove struct page assumptions/constraints in the SEAMCALL wrapper APIs for
mapping guest private memory and have them take PFN directly.
Having core TDX make assumptions that guest private memory must be backed
by struct page (and/or folio) will create subtle dependencies on how
KVM/guest_memfd allocates/manages memory (e.g., whether it uses memory
allocated from core MM, if the memory is refcounted, or if the folio is
split) that are easily avoided. [1].
KVM's MMUs work with PFNs. This is very much an intentional design choice.
It ensures that the KVM MMUs remain flexible and are not too tied to the
regular CPU MMUs and the kernel code around them. Using 'struct page' for
TDX guest memory is not a good fit anywhere near the KVM MMU code [2].
Use "kvm_pfn_t pfn" for type safety. Using this KVM type is appropriate
since APIs tdh_mem_page_add() and tdh_mem_page_aug() are exported to KVM
only.