net: ethernet: ravb: Suspend and resume the transmission flow
The current driver does not follow the latest datasheet and does not
suspend the flow when stopping DMA and resume it when starting. Update
the driver to do so.
Jakub Kicinski [Fri, 3 Apr 2026 23:02:31 +0000 (16:02 -0700)]
Merge branch 'net-stmmac-fix-tegra234-mgbe-clock'
Jon Hunter says:
====================
net: stmmac: Fix Tegra234 MGBE clock
The name of the PTP ref clock for the Tegra234 MGBE ethernet controller
does not match the generic name in the stmmac platform driver. Despite
this basic ethernet is functional on the Tegra234 platforms that use
this driver and as far as I know, we have not tested PTP support with
this driver. Hence, the risk of breaking any functionality is low.
The previous attempt to fix this in the stmmac platform driver, by
supporting the Tegra234 PTP clock name, was rejected [0]. The preference
from the netdev maintainers is to fix this in the DT binding for
Tegra234.
This series fixes this by correcting the device-tree binding to align
with the generic name for the PTP clock. I understand that this is
breaking the ABI for this device, which we should never do, but this
is a last resort for getting this fixed. I am open to any better ideas
to fix this. Please note that we still maintain backward compatibility
in the driver to allow older device-trees to work, but we don't
advertise this via the binding, because I did not see any value in doing
so.
====================
Jon Hunter [Wed, 1 Apr 2026 10:29:40 +0000 (11:29 +0100)]
dt-bindings: net: Fix Tegra234 MGBE PTP clock
The PTP clock for the Tegra234 MGBE device is incorrectly named
'ptp-ref' and should be 'ptp_ref'. This is causing the following
warning to be observed on Tegra234 platforms that use this device:
Although this constitutes an ABI breakage in the binding for this
device, PTP support has clearly never worked and so fix this now
so we can correct the device-tree for this device. Note that the
MGBE driver still supports the legacy 'ptp-ref' clock name and so
older/existing device-trees will still work, but given that this
is not the correct name, there is no point to advertise this in the
binding.
Fixes: 189c2e5c7669 ("dt-bindings: net: Add Tegra234 MGBE") Signed-off-by: Jon Hunter <jonathanh@nvidia.com> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@oss.qualcomm.com> Link: https://patch.msgid.link/20260401102941.17466-3-jonathanh@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jon Hunter [Wed, 1 Apr 2026 10:29:39 +0000 (11:29 +0100)]
net: stmmac: Fix PTP ref clock for Tegra234
Since commit 030ce919e114 ("net: stmmac: make sure that ptp_rate is not
0 before configuring timestamping") was added the following error is
observed on Tegra234:
It turns out that the Tegra234 device-tree binding defines the PTP ref
clock name as 'ptp-ref' and not 'ptp_ref' and the above commit now
exposes this and that the PTP clock is not configured correctly.
In order to update device-tree to use the correct 'ptp_ref' name, update
the Tegra MGBE driver to use 'ptp_ref' by default and fallback to using
'ptp-ref' if this clock name is present.
Fixes: d8ca113724e7 ("net: stmmac: tegra: Add MGBE support") Signed-off-by: Jon Hunter <jonathanh@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260401102941.17466-2-jonathanh@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
nfc: s3fwrn5: allocate rx skb before consuming bytes
s3fwrn82_uart_read() reports the number of accepted bytes to the serdev
core. The current code consumes bytes into recv_skb and may already
deliver a complete frame before allocating a fresh receive buffer.
If that alloc_skb() fails, the callback returns 0 even though it has
already consumed bytes, and it leaves recv_skb as NULL for the next
receive callback. That breaks the receive_buf() accounting contract and
can also lead to a NULL dereference on the next skb_put_u8().
Allocate the receive skb lazily before consuming the next byte instead.
If allocation fails, return the number of bytes already accepted.
AMD defines Extended Interrupt Local Vector Table (EILVT) registers to allow
for additional interrupt sources. While the APIC registers for those are
unique to AMD, the format of those registers follows the standard LVT
registers. Drop EILVT-specific macros in favor of the standard APIC
LVT macros.
Drop unused APIC_EILVT_NR_AMD_K8 and APIC_EILVT_LVTOFF while at it.
In configurations with multiple tunnel layers and MPLS lwtunnel routing, a
single tunnel hop can increment the counter beyond this limit. This causes
packets to be dropped with the "Dead loop on virtual device" message even
when a routing loop doesn't exist.
Increase IP_TUNNEL_RECURSION_LIMIT from 4 to 5 to handle this use-case.
====================
net: macb: Remove dedicated IRQ handler for WoL
During debugging of a suspend/resume issue, I observed that the macb driver
employs a dedicated IRQ handler for Wake-on-LAN (WoL) support. To my knowledge,
no other Ethernet driver adopts this approach. This implementation unnecessarily
complicates the suspend/resume process without providing any clear benefit.
Instead, we can easily modify the existing IRQ handler to manage WoL events,
avoiding any overhead in the TX/RX hot path.
The net throughput shows no significant difference following these changes.
The following data(net throughput and execution time of macb_interrupt) were
collected from my AMD Zynqmp board using the commands:
taskset -c 1,2,3 iperf3 -c 192.168.3.4 -t 60 -Z -P 3 -R
cat /sys/kernel/debug/tracing/trace_stat/function0
Kevin Hao [Thu, 2 Apr 2026 13:41:25 +0000 (21:41 +0800)]
net: macb: Remove dedicated IRQ handler for WoL
In the current implementation, the suspend/resume path frees the
existing IRQ handler and sets up a dedicated WoL IRQ handler, then
restores the original handler upon resume. This approach is not used
by any other Ethernet driver and unnecessarily complicates the
suspend/resume process. After adjusting the IRQ handler in the previous
patches, we can now handle WoL interrupts without introducing any
overhead in the TX/RX hot path. Therefore, the dedicated WoL IRQ
handler is removed.
I have verified WoL functionality on my AMD ZynqMP board using the
following steps:
root@amd-zynqmp:~# ifconfig end0 192.168.3.3
root@amd-zynqmp:~# ethtool -s end0 wol a
root@amd-zynqmp:~# echo mem >/sys/power/state
Kevin Hao [Thu, 2 Apr 2026 13:41:24 +0000 (21:41 +0800)]
net: macb: Factor out the handling of non-hot IRQ events into a separate function
In the current code, the IRQ handler checks each IRQ event sequentially.
Since most IRQ events are related to TX/RX operations, while other
events occur infrequently, this approach introduces unnecessary overhead
in the hot path for TX/RX processing. This patch reduces such overhead
by extracting the handling of all non-TX/RX events into a new function
and consolidating these events under a new flag. As a result, only a
single check is required to determine whether any non-TX/RX events have
occurred. If such events exist, the handler jumps to the new function.
This optimization reduces four conditional checks to one and prevents
the instruction cache from being polluted with rarely used code in the
hot path.
Kevin Hao [Thu, 2 Apr 2026 13:41:23 +0000 (21:41 +0800)]
net: macb: Introduce macb_queue_isr_clear() helper function
The current implementation includes several occurrences of the
following pattern:
if (bp->caps & MACB_CAPS_ISR_CLEAR_ON_WRITE)
queue_writel(queue, ISR, value);
Introduces a helper function to consolidate these repeated code
segments. No functional changes are made.
Kevin Hao [Thu, 2 Apr 2026 13:41:22 +0000 (21:41 +0800)]
net: macb: Replace open-coded implementation with napi_schedule()
The driver currently duplicates the logic of napi_schedule() primarily
to include additional debug information. However, these debug details
are not essential for a specific driver and can be effectively obtained
through existing tracepoints in the networking core, such as
/sys/kernel/tracing/events/napi/napi_poll. Therefore, this patch
replaces the open-coded implementation with napi_schedule() to
simplify the driver's code.
Danilo Krummrich [Tue, 24 Mar 2026 00:59:14 +0000 (01:59 +0100)]
s390/ap: use generic driver_override infrastructure
When the AP masks are updated via apmask_store() or aqmask_store(),
ap_bus_revise_bindings() is called after ap_attr_mutex has been
released.
This calls __ap_revise_reserved(), which accesses the driver_override
field without holding any lock, racing against a concurrent
driver_override_store() that may free the old string, resulting in a
potential UAF.
Fix this by using the driver-core driver_override infrastructure, which
protects all accesses with an internal spinlock.
Note that unlike most other buses, the AP bus does not check
driver_override in its match() callback; the override is checked in
ap_device_probe() and __ap_revise_reserved() instead.
Also note that we do not enable the driver_override feature of struct
bus_type, as AP - in contrast to most other buses - passes "" to
sysfs_emit() when the driver_override pointer is NULL. Thus, printing
"\n" instead of "(null)\n".
Additionally, AP has a custom counter that is modified in the
corresponding custom driver_override_store().
Fixes: d38a87d7c064 ("s390/ap: Support driver_override for AP queue devices") Tested-by: Holger Dengler <dengler@linux.ibm.com> Reviewed-by: Holger Dengler <dengler@linux.ibm.com> Reviewed-by: Harald Freudenberger <freude@linux.ibm.com> Link: https://patch.msgid.link/20260324005919.2408620-11-dakr@kernel.org Signed-off-by: Danilo Krummrich <dakr@kernel.org>
Danilo Krummrich [Tue, 24 Mar 2026 00:59:13 +0000 (01:59 +0100)]
s390/cio: use generic driver_override infrastructure
When a driver is probed through __driver_attach(), the bus' match()
callback is called without the device lock held, thus accessing the
driver_override field without a lock, which can cause a UAF.
Fix this by using the driver-core driver_override infrastructure taking
care of proper locking internally.
Note that calling match() from __driver_attach() without the device lock
held is intentional. [1]
Danilo Krummrich [Tue, 24 Mar 2026 00:59:12 +0000 (01:59 +0100)]
vdpa: use generic driver_override infrastructure
When a driver is probed through __driver_attach(), the bus' match()
callback is called without the device lock held, thus accessing the
driver_override field without a lock, which can cause a UAF.
Fix this by using the driver-core driver_override infrastructure taking
care of proper locking internally.
Note that calling match() from __driver_attach() without the device lock
held is intentional. [1]
Yiqi Sun [Thu, 2 Apr 2026 07:04:19 +0000 (15:04 +0800)]
ipv4: icmp: fix null-ptr-deref in icmp_build_probe()
ipv6_stub->ipv6_dev_find() may return ERR_PTR(-EAFNOSUPPORT) when the
IPv6 stack is not active (CONFIG_IPV6=m and not loaded), and passing
this error pointer to dev_hold() will cause a kernel crash with
null-ptr-deref.
Instead, silently discard the request. RFC 8335 does not appear to
define a specific response for the case where an IPv6 interface
identifier is syntactically valid but the implementation cannot perform
the lookup at runtime, and silently dropping the request may safer than
misreporting "No Such Interface".
Danilo Krummrich [Tue, 24 Mar 2026 00:59:10 +0000 (01:59 +0100)]
platform/wmi: use generic driver_override infrastructure
When a driver is probed through __driver_attach(), the bus' match()
callback is called without the device lock held, thus accessing the
driver_override field without a lock, which can cause a UAF.
Fix this by using the driver-core driver_override infrastructure taking
care of proper locking internally.
Note that calling match() from __driver_attach() without the device lock
held is intentional. [1]
Danilo Krummrich [Tue, 24 Mar 2026 00:59:09 +0000 (01:59 +0100)]
PCI: use generic driver_override infrastructure
When a driver is probed through __driver_attach(), the bus' match()
callback is called without the device lock held, thus accessing the
driver_override field without a lock, which can cause a UAF.
Fix this by using the driver-core driver_override infrastructure taking
care of proper locking internally.
Note that calling match() from __driver_attach() without the device lock
held is intentional. [1]
Link: https://lore.kernel.org/driver-core/DGRGTIRHA62X.3RY09D9SOK77P@kernel.org/ Reported-by: Gui-Dong Han <hanguidong02@gmail.com> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220789 Fixes: 782a985d7af2 ("PCI: Introduce new device binding path using pci_dev.driver_override") Acked-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Alex Williamson <alex@shazbot.org> Tested-by: Gui-Dong Han <hanguidong02@gmail.com> Reviewed-by: Gui-Dong Han <hanguidong02@gmail.com> Link: https://patch.msgid.link/20260324005919.2408620-6-dakr@kernel.org Signed-off-by: Danilo Krummrich <dakr@kernel.org>
Qingfang Deng [Thu, 2 Apr 2026 05:00:50 +0000 (13:00 +0800)]
ppp: update Kconfig help message
Both links of the PPPoE section are no longer valid, and the CVS version
is no longer relevant.
- Replace the TLDP URL with the pppd project homepage.
- Update pppd version requirement for PPPoE.
- Update RP-PPPoE project homepage, and clarify that it's only needed
for server mode.
ipv4: nexthop: allocate skb dynamically in rtm_get_nexthop()
When querying a nexthop object via RTM_GETNEXTHOP, the kernel currently
allocates a fixed-size skb using NLMSG_GOODSIZE. While sufficient for
single nexthops and small Equal-Cost Multi-Path groups, this fixed
allocation fails for large nexthop groups like 512 nexthops.
Fix this by allocating the size dynamically using nh_nlmsg_size() and
using nlmsg_new(), this is consistent with nexthop_notify() behavior. In
addition, adjust nh_nlmsg_size_grp() so it calculates the size needed
based on flags passed. While at it, also add the size of NHA_FDB for
nexthop group size calculation as it was missing too.
This cannot be reproduced via iproute2 as the group size is currently
limited and the command fails as follows:
addattr_l ERROR: message exceeded bound of 1048
Fixes: 430a049190de ("nexthop: Add support for nexthop groups") Reported-by: Yiming Qian <yimingqian591@gmail.com> Closes: https://lore.kernel.org/netdev/CAL_bE8Li2h4KO+AQFXW4S6Yb_u5X4oSKnkywW+LPFjuErhqELA@mail.gmail.com/ Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260402072613.25262-2-fmancera@suse.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ipv4: nexthop: avoid duplicate NHA_HW_STATS_ENABLE on nexthop group dump
Currently NHA_HW_STATS_ENABLE is included twice everytime a dump of
nexthop group is performed with NHA_OP_FLAG_DUMP_STATS. As all the stats
querying were moved to nla_put_nh_group_stats(), leave only that
instance of the attribute querying.
Fixes: 5072ae00aea4 ("net: nexthop: Expose nexthop group HW stats to user space") Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260402072613.25262-1-fmancera@suse.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
driver core: make software nodes available earlier
Software nodes are currently initialized in a function registered as
a postcore_initcall(). However, some devices may want to register
software nodes earlier than that (or also in a postcore_initcall() where
they're at the mercy of the link order). Move the initialization to
driver_init() making swnode available much earlier as well as making
their initialization time deterministic.
Suggested-by: Dmitry Torokhov <dmitry.torokhov@gmail.com> Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com> Acked-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Link: https://patch.msgid.link/20260402-nokia770-gpio-swnodes-v5-3-d730db3dd299@oss.qualcomm.com
[ Fix typo in the commit message: "s/merci/mercy/". - Danilo ] Signed-off-by: Danilo Krummrich <dakr@kernel.org>
net: qualcomm: qca_uart: report the consumed byte on RX skb allocation failure
qca_tty_receive() consumes each input byte before checking whether a
completed frame needs a fresh receive skb. When the current byte completes
a frame, the driver delivers that frame and then allocates a new skb for
the next one.
If that allocation fails, the current code returns i even though data[i]
has already been consumed and may already have completed the delivered
frame. Since serdev interprets the return value as the number of accepted
bytes, this under-reports progress by one byte and can replay the final
byte of the completed frame into a fresh parser state on the next call.
Return i + 1 in that failure path so the accepted-byte count matches the
actual receive-state progress.
Fixes: dfc768fbe618 ("net: qualcomm: add QCA7000 UART driver") Cc: stable@vger.kernel.org Signed-off-by: Pengpeng Hou <pengpeng@iscas.ac.cn> Reviewed-by: Stefan Wahren <wahrenst@gmx.net> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260402071207.4036-1-pengpeng@iscas.ac.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Oleh Konko [Thu, 2 Apr 2026 09:48:57 +0000 (09:48 +0000)]
tipc: fix bc_ackers underflow on duplicate GRP_ACK_MSG
The GRP_ACK_MSG handler in tipc_group_proto_rcv() currently decrements
bc_ackers on every inbound group ACK, even when the same member has
already acknowledged the current broadcast round.
Because bc_ackers is a u16, a duplicate ACK received after the last
legitimate ACK wraps the counter to 65535. Once wrapped,
tipc_group_bc_cong() keeps reporting congestion and later group
broadcasts on the affected socket stay blocked until the group is
recreated.
Fix this by ignoring duplicate or stale ACKs before touching bc_acked or
bc_ackers. This makes repeated GRP_ACK_MSG handling idempotent and
prevents the underflow path.
Fixes: 2f487712b893 ("tipc: guarantee that group broadcast doesn't bypass group unicast") Cc: stable@vger.kernel.org Signed-off-by: Oleh Konko <security@1seal.org> Reviewed-by: Tung Nguyen <tung.quang.nguyen@est.tech> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/41a4833f368641218e444fdcff822039.security@1seal.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
rtnetlink: add missing netlink_ns_capable() check for peer netns
rtnl_newlink() lacks a CAP_NET_ADMIN capability check on the peer
network namespace when creating paired devices (veth, vxcan,
netkit). This allows an unprivileged user with a user namespace
to create interfaces in arbitrary network namespaces, including
init_net.
Add a netlink_ns_capable() check for CAP_NET_ADMIN in the peer
namespace before allowing device creation to proceed.
Fixes: 81adee47dfb6 ("net: Support specifying the network namespace upon device creation.") Signed-off-by: Nikolaos Gkarlis <nickgarlis@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260402181432.4126920-1-nickgarlis@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
selftests: drv-net: gro: more test cases
Add a few more test cases for GRO.
First 4 patches are unchanged from v1.
Patches 5 and 6 are new. Willem pointed out that the defines are
duplicated and all these imprecise defines have been annoying me
for a while so I decided to clean them up.
With the defines cleaned up and now more precise patch 7 (was 5)
no longer has to play any games with the MTU for ip6ip6.
The last patch now sends 3 segments as requested.
====================
Jakub Kicinski [Thu, 2 Apr 2026 20:59:59 +0000 (13:59 -0700)]
selftests: drv-net: gro: test ip6ip6
We explicitly test ipip encap. Let's add ip6ip6, too. Having
just ipip seems like favoring IPv4 which we should not do :)
Testing all combinations is left for future work, not sure
it's actually worth it.
Jakub Kicinski [Thu, 2 Apr 2026 20:59:58 +0000 (13:59 -0700)]
selftests: drv-net: gro: make large packet math more precise
When constructing the packets for large_* test cases we use
a static value for packet count and MSS. It works okay for
ipv4 vs ipv6 but the gap between ipv4 and ip6ip6 is going to
be quite significant.
Make the defines calculate the worst case values, those
are only used for sizing stack arrays. Create helpers for
calculating precise values based on the exact test case.
Jakub Kicinski [Thu, 2 Apr 2026 20:59:57 +0000 (13:59 -0700)]
selftests: drv-net: gro: remove TOTAL_HDR_LEN
Willem points out TOTAL_HDR_LEN is identical to MAX_HDR_LEN.
This seems to have been the case ever since the test was added.
Replace the uses of TOTAL_HDR_LEN with MAX_HDR_LEN, MAX seems
more common for what this value is.
Jakub Kicinski [Thu, 2 Apr 2026 20:59:56 +0000 (13:59 -0700)]
selftests: drv-net: gro: prepare for ip6ip6 support
Try to use already calculated offsets and not depend on the ipip
flag as much. This patch should not change any functionality,
it's just a cleanup to make ip6ip6 support easier.
Jakub Kicinski [Thu, 2 Apr 2026 20:59:55 +0000 (13:59 -0700)]
selftests: drv-net: gro: always wait for FIN in the capacity test
The new capacity/order test exits as soon as it sees the expected
packet sequence. This may allow the "flushing" FIN packet to spill
over to the next test. Let's always wait for the FIN before exiting.
Jakub Kicinski [Thu, 2 Apr 2026 20:59:54 +0000 (13:59 -0700)]
selftests: drv-net: gro: add 1 byte payload test
Small IPv4 packets get padded to 60B, this may break / confuse
some buggy implementations. Add a test to coalesce a 1B payload.
Keep this separate from the lrg_sml test because I suspect some
implementations may not handle this case (treat padded frames
as ineligible for coalescing).
Jakub Kicinski [Thu, 2 Apr 2026 20:59:53 +0000 (13:59 -0700)]
selftests: drv-net: gro: add data burst test case
Add a test trying to induce a GRO context timeout followed
by another sequence of packets for the same flow. The second
burst arrives 100ms after the first one so any implementation
(SW or HW) must time out waiting at that point. We expect both
bursts to be aggregated successfully but separately.
Richard Zhu [Tue, 31 Mar 2026 08:52:52 +0000 (16:52 +0800)]
PCI: imx6: Keep Root Port MSI capability with iMSI-RX to work around hardware bug
On NXP i.MX7D, i.MX8MM, and i.MX8MQ chipsets, MSIs from the endpoints won't
be received by the iMSI-RX MSI controller if the Root Port MSI capability
is disabled.
Even though the Root Port MSIs won't be received by the iMSI-RX controller
due to design, these chipsets have some weird hardware bug that prevents
the endpoint MSIs from reaching when the Root Port MSI capability is
disabled.
Hence, introduce a new flag, 'dw_pcie_rp::keep_rp_msi_en', set it for the
above mentioned SoCs, and always keep the Root Port MSI capability when
this flag is set.
Note that by keeping Root Port MSI capability, Root Port MSIs such as AER,
PME and others won't be received by default. So users need to use
workarounds such as passing 'pcie_pme=nomsi' cmdline param.
Fixes: f5cd8a929c825 ("PCI: dwc: Remove MSI/MSIX capability for Root Port if iMSI-RX is used as MSI controller") Suggested-by: Manivannan Sadhasivam <mani@kernel.org> Signed-off-by: Richard Zhu <hongxing.zhu@nxp.com>
[mani: commit log] Signed-off-by: Manivannan Sadhasivam <mani@kernel.org>
[bhelgaas: fix typos] Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Frank Li <Frank.Li@nxp.com> Link: https://patch.msgid.link/20260331085252.1243108-1-hongxing.zhu@nxp.com
bridge: guard local VLAN-0 FDB helpers against NULL vlan group
When CONFIG_BRIDGE_VLAN_FILTERING is not set, br_vlan_group() and
nbp_vlan_group() return NULL (br_private.h stub definitions). The
BR_BOOLOPT_FDB_LOCAL_VLAN_0 toggle code is compiled unconditionally and
reaches br_fdb_delete_locals_per_vlan_port() and
br_fdb_insert_locals_per_vlan_port(), where the NULL vlan group pointer
is dereferenced via list_for_each_entry(v, &vg->vlan_list, vlist).
The observed crash is in the delete path, triggered when creating a
bridge with IFLA_BR_MULTI_BOOLOPT containing BR_BOOLOPT_FDB_LOCAL_VLAN_0
via RTM_NEWLINK. The insert helper has the same bug pattern.
Oops: general protection fault, probably for non-canonical address 0xdffffc0000000056: 0000 [#1] KASAN NOPTI
KASAN: null-ptr-deref in range [0x00000000000002b0-0x00000000000002b7]
RIP: 0010:br_fdb_delete_locals_per_vlan+0x2b9/0x310
Call Trace:
br_fdb_toggle_local_vlan_0+0x452/0x4c0
br_toggle_fdb_local_vlan_0+0x31/0x80 net/bridge/br.c:276
br_boolopt_toggle net/bridge/br.c:313
br_boolopt_multi_toggle net/bridge/br.c:364
br_changelink net/bridge/br_netlink.c:1542
br_dev_newlink net/bridge/br_netlink.c:1575
Add NULL checks for the vlan group pointer in both helpers, returning
early when there are no VLANs to iterate. This matches the existing
pattern used by other bridge FDB functions such as br_fdb_add() and
br_fdb_delete().
Fixes: 21446c06b441 ("net: bridge: Introduce UAPI for BR_BOOLOPT_FDB_LOCAL_VLAN_0") Signed-off-by: Zijing Yin <yzjaurora@gmail.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20260402140153.3925663-1-yzjaurora@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: airoha: Fix memory leak in airoha_qdma_rx_process()
If an error occurs on the subsequents buffers belonging to the
non-linear part of the skb (e.g. due to an error in the payload length
reported by the NIC or if we consumed all the available fragments for
the skb), the page_pool fragment will not be linked to the skb so it will
not return to the pool in the airoha_qdma_rx_process() error path. Fix the
memory leak partially reverting commit 'd6d2b0e1538d ("net: airoha: Fix
page recycling in airoha_qdma_rx_process()")' and always running
page_pool_put_full_page routine in the airoha_qdma_rx_process() error
path.
When CONFIG_FIXED_PHY is in a loadable module, the fec driver cannot be
built-in any more:
x86_64-linux-ld: vmlinux.o: in function `fec_enet_mii_probe':
fec_main.c:(.text+0xc4f367): undefined reference to `fixed_phy_unregister'
x86_64-linux-ld: vmlinux.o: in function `fec_enet_close':
fec_main.c:(.text+0xc59591): undefined reference to `fixed_phy_unregister'
x86_64-linux-ld: vmlinux.o: in function `fec_enet_mii_probe.cold':
Select the fixed phy support on all targets to make this build
correctly, not just on coldfire.
Notat that Essentially the stub helpers in include/linux/phy_fixed.h
cannot be used correctly because of this build time dependency,
and we could just remove them to hit the build failure more often
when a driver uses them without the 'select FIXED_PHY'.
Fixes: dc86b621e1b4 ("net: fec: register a fixed phy using fixed_phy_register_100fd if needed") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260402141048.2713445-1-arnd@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Gerd Bayer [Mon, 30 Mar 2026 13:09:45 +0000 (15:09 +0200)]
PCI: Enable AtomicOps only if Root Port supports them
When inspecting the config space of a Connect-X physical function in an
s390 system after it was initialized by the mlx5_core device driver, we
found the function to be enabled to request AtomicOps despite the Root Port
lacking support for completing them:
On s390 and many virtualized guests, the Endpoint is visible but the Root
Port is not. In this case, pci_enable_atomic_ops_to_root() previously
enabled AtomicOps in the Endpoint even though it can't tell whether the
Root Port supports them as a completer.
Change pci_enable_atomic_ops_to_root() to fail if there's no Root Port or
the Root Port doesn't support AtomicOps.
Fixes: 430a23689dea ("PCI: Add pci_enable_atomic_ops_to_root()") Reported-by: Alexander Schmidt <alexs@linux.ibm.com> Signed-off-by: Gerd Bayer <gbayer@linux.ibm.com>
[bhelgaas: commit log, check RP first to simplify flow] Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Link: https://patch.msgid.link/20260330-fix_pciatops-v7-2-f601818417e8@linux.ibm.com
Gerd Bayer [Mon, 30 Mar 2026 13:09:44 +0000 (15:09 +0200)]
PCI: Do not enable AtomicOps by RCiEPs
Since Root Complex Integrated Endpoints (RCiEPs) attach to a bus that has
no bridge device describing the Root Port, the capability to complete
AtomicOps requests cannot be determined with PCIe methods.
Change default of pci_enable_atomic_ops_to_root() to not enable AtomicOps
requests on RCiEPs.
As far as we know, there are no RCiEPs that need AtomicOps (see Link
below).
The clocks for qcom-ethqos return a rate of zero as firmware manages
their rate. According to hardware documentation, the clock which is
fed to the slave AHB interface can range between 50 to 100MHz for
non-RGMII and 30 to 75MHz for boards with a RGMII interfaces.
Currently, stmmac uses an undefined divisor value. Instead, use
STMMAC_CSR_60_100M which will mean we meet IEEE 802.3 specification
since this will generate:
Unfortunately, this divisor makes the upper bound of both ranges exeed
the IEEE 802.3 specification, and thus we can not use it without knowing
for certain what the current CSR clock rate actually is.
So, STMMAC_CSR_60_100M is the best fit for all boards based on the
information provided thus far.
stmmac: cleanup dead dependencies on STMMAC_PLATFORM and STMMAC_ETH in Kconfig
There are already 'if STMMAC_ETH' and 'STMMAC_PLATFORM'
conditions wrapping these config options, making the
'depends on' statements duplicate dependencies (dead code).
I propose leaving the outer 'if STMMAC_PLATFORM...endif' and
'if STMMAC_ETH...endif' conditions, and removing the
individual 'depends on' statements.
This dead code was found by kconfirm, a static analysis tool for Kconfig.
tcf_csum_act() walks nested VLAN headers directly from skb->data when an
skb still carries in-payload VLAN tags. The current code reads
vlan->h_vlan_encapsulated_proto and then pulls VLAN_HLEN bytes without
first ensuring that the full VLAN header is present in the linear area.
If only part of an inner VLAN header is linearized, accessing
h_vlan_encapsulated_proto reads past the linear area, and the following
skb_pull(VLAN_HLEN) may violate skb invariants.
Fix this by requiring pskb_may_pull(skb, VLAN_HLEN) before accessing and
pulling each nested VLAN header. If the header still is not fully
available, drop the packet through the existing error path.
Fixes: 2ecba2d1e45b ("net: sched: act_csum: Fix csum calc for tagged packets") Reported-by: Yifan Wu <yifanwucs@gmail.com> Reported-by: Juefei Pu <tomapufckgml@gmail.com> Co-developed-by: Yuan Tan <yuantan098@gmail.com> Signed-off-by: Yuan Tan <yuantan098@gmail.com> Suggested-by: Xin Liu <bird@lzu.edu.cn> Tested-by: Ren Wei <enjou1224z@gmail.com> Signed-off-by: Ruide Cao <caoruide123@gmail.com> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/22df2fcb49f410203eafa5d97963dd36089f4ecf.1774892775.git.caoruide123@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Miguel Ojeda [Tue, 31 Mar 2026 20:58:49 +0000 (22:58 +0200)]
rust: macros: simplify `format!` arguments
Clippy in Rust 1.88.0 (only) reported [1] up to the previous commit:
warning: variables can be used directly in the `format!` string
--> rust/macros/module.rs:112:23
|
112 | let content = format!("{param}:{content}", param = param, content = content);
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
= help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#uninlined_format_args
= note: `-W clippy::uninlined-format-args` implied by `-W clippy::all`
= help: to override `-W clippy::all` add `#[allow(clippy::uninlined_format_args)]`
help: change this to
|
112 - let content = format!("{param}:{content}", param = param, content = content);
112 + let content = format!("{param}:{content}");
The reason it only triggers in that version is that the lint was moved
from `pedantic` to `style` in Rust 1.88.0 and then back to `pedantic`
in Rust 1.89.0 [2][3].
In this case, the suggestion is fair and a pure simplification, thus
just apply it.
In addition, do the same for another place in the file that Clippy does
not report because it is multi-line.
Paul Moore [Thu, 1 Jan 2026 22:19:18 +0000 (17:19 -0500)]
selinux: fix overlayfs mmap() and mprotect() access checks
The existing SELinux security model for overlayfs is to allow access if
the current task is able to access the top level file (the "user" file)
and the mounter's credentials are sufficient to access the lower
level file (the "backing" file). Unfortunately, the current code does
not properly enforce these access controls for both mmap() and mprotect()
operations on overlayfs filesystems.
This patch makes use of the newly created security_mmap_backing_file()
LSM hook to provide the missing backing file enforcement for mmap()
operations, and leverages the backing file API and new LSM blob to
provide the necessary information to properly enforce the mprotect()
access controls.
Cc: stable@vger.kernel.org Acked-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
Paul Moore [Fri, 19 Dec 2025 18:18:22 +0000 (13:18 -0500)]
lsm: add backing_file LSM hooks
Stacked filesystems such as overlayfs do not currently provide the
necessary mechanisms for LSMs to properly enforce access controls on the
mmap() and mprotect() operations. In order to resolve this gap, a LSM
security blob is being added to the backing_file struct and the following
new LSM hooks are being created:
The first two hooks are to manage the lifecycle of the LSM security blob
in the backing_file struct, while the third provides a new mmap() access
control point for the underlying backing file. It is also expected that
LSMs will likely want to update their security_file_mprotect() callback
to address issues with their mprotect() controls, but that does not
require a change to the security_file_mprotect() LSM hook.
There are a three other small changes to support these new LSM hooks:
* Pass the user file associated with a backing file down to
alloc_empty_backing_file() so it can be included in the
security_backing_file_alloc() hook.
* Add getter and setter functions for the backing_file struct LSM blob
as the backing_file struct remains private to fs/file_table.c.
* Constify the file struct field in the LSM common_audit_data struct to
better support LSMs that need to pass a const file struct pointer into
the common LSM audit code.
Thanks to Arnd Bergmann for identifying the missing EXPORT_SYMBOL_GPL()
and supplying a fixup.
Cc: stable@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org Cc: linux-unionfs@vger.kernel.org Cc: linux-erofs@lists.ozlabs.org Reviewed-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Serge Hallyn <serge@hallyn.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Paul Moore <paul@paul-moore.com>
Amir Goldstein [Mon, 30 Mar 2026 08:27:51 +0000 (10:27 +0200)]
fs: prepare for adding LSM blob to backing_file
In preparation to adding LSM blob to backing_file struct, factor out
helpers init_backing_file() and backing_file_free().
Cc: stable@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org Cc: linux-unionfs@vger.kernel.org Cc: linux-erofs@lists.ozlabs.org Signed-off-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Serge Hallyn <serge@hallyn.com>
[PM: use the term "LSM blob", fix comment style to match file] Signed-off-by: Paul Moore <paul@paul-moore.com>
Dave Jiang [Fri, 3 Apr 2026 19:30:57 +0000 (12:30 -0700)]
Merge branch 'for-7.1/cxl-region-refactor' into cxl-for-next
Refactor CXL core/region code to make region code more manageable by
splitting out DAX and PMEM code from RAM handling code.
cxl/core: use cleanup.h for devm_cxl_add_dax_region
cxl/core/region: move dax region device logic into region_dax.c
cxl/core/region: move pmem region driver logic into region_pmem.c
Dave Jiang [Fri, 3 Apr 2026 19:21:27 +0000 (12:21 -0700)]
Merge branch 'for-7.1/dax-hmem' into cxl-for-next
The series addresses conflicts between HMEM and CXL when handling Soft
Reserved memory ranges. CXL will try best effort in claiming the Soft
Reserved memory region that are CXL regions. If fails, it will punt
back to HMEM.
tools/testing/cxl: Test dax_hmem takeover of CXL regions
tools/testing/cxl: Simulate auto-assembly failure
dax/hmem: Parent dax_hmem devices
dax/hmem: Fix singleton confusion between dax_hmem_work and hmem devices
dax/hmem: Reduce visibility of dax_cxl coordination symbols
cxl/region: Constify cxl_region_resource_contains()
cxl/region: Limit visibility of cxl_region_contains_resource()
dax/cxl: Fix HMEM dependencies
cxl/region: Fix use-after-free from auto assembly failure
dax/hmem, cxl: Defer and resolve Soft Reserved ownership
cxl/region: Add helper to check Soft Reserved containment by CXL regions
dax: Track all dax_region allocations under a global resource tree
dax/cxl, hmem: Initialize hmem early and defer dax_cxl binding
dax/hmem: Gate Soft Reserved deferral on DEV_DAX_CXL
dax/hmem: Request cxl_acpi and cxl_pci before walking Soft Reserved ranges
dax/hmem: Factor HMEM registration into __hmem_register_device()
dax/bus: Use dax_region_put() in alloc_dax_region() error path
Dave Jiang [Fri, 3 Apr 2026 19:18:23 +0000 (12:18 -0700)]
Merge branch 'for-7.1/cxl-type2-support' into cxl-for-next
Prep patches for CXL type2 accelerator basic support
cxl/region: Factor out interleave granularity setup
cxl/region: Factor out interleave ways setup
cxl: Make region type based on endpoint type
cxl/pci: Remove redundant cxl_pci_find_port() call
cxl: Move pci generic code from cxl_pci to core/cxl_pci
cxl: export internal structs for external Type2 drivers
cxl: support Type2 when initializing cxl_dev_state
Merge tag 'sched_ext-for-7.0-rc6-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext fixes from Tejun Heo:
"These are late but both fix subtle yet critical problems and the blast
radius is limited strictly to sched_ext.
- Fix stale direct dispatch state in ddsp_dsq_id which can cause
spurious warnings in mark_direct_dispatch() on task wakeup
- Fix is_bpf_migration_disabled() false negative on non-PREEMPT_RCU
configs which can lead to incorrectly dispatching migration-
disabled tasks to remote CPUs"
* tag 'sched_ext-for-7.0-rc6-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
sched_ext: Fix stale direct dispatch state in ddsp_dsq_id
sched_ext: Fix is_bpf_migration_disabled() false negative on non-PREEMPT_RCU
Alison Schofield [Sat, 14 Mar 2026 06:19:50 +0000 (23:19 -0700)]
tools/testing/cxl: Enable replay of user regions as auto regions
The cxl_test module currently hard-codes auto regions in the mock
topology, limiting coverage of the driver's region auto-assembly
logic.
Teach cxl_test to replay previously committed decoder programming
across a cxl_acpi unbind/bind cycle. Decoder programming is recorded
in a registry keyed by a stable port identity and decoder id. The
registry is updated on decoder commit and reset events and consulted
during enumeration to restore previously enabled decoders.
This allows regions created through the user interface to be replayed
during enumeration and treated as auto-discovered regions, enabling
testing of region auto-assembly using configurations created in the
cxl_test topology.
The NDCTL CXL unit test, cxl-region-replay.sh, demonstrates the usage.
Co-developed-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Co-developed-by: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Alison Schofield <alison.schofield@intel.com> Link: https://patch.msgid.link/20260314061952.2221030-1-alison.schofield@intel.com Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Pengjie Zhang [Tue, 16 Dec 2025 03:11:53 +0000 (11:11 +0800)]
PM / devfreq: use _visible attribute to replace create/remove_sysfs_files()
Previously, non-generic attributes (polling_interval, timer) used separate
create/delete logic, leading to race conditions during concurrent access in
creation/deletion. Multi-threaded operations also caused inconsistencies
between governor capabilities and attribute states.
1.Use is_visible + sysfs_update_group() to unify management of these
attributes, eliminating creation/deletion races.
2.Add locks and validation to these attributes, ensuring consistency
between current governor capabilities and attribute operations in
multi-threaded environments.
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Jie Zhan <zhanjie9@hisilicon.com> Signed-off-by: Pengjie Zhang <zhangpengjie2@huawei.com> Signed-off-by: Chanwoo Choi <cw00.choi@samsung.com> Link: https://www.spinics.net/lists/kernel/msg5967745.html
Since the sensor supports different sampling intervals via
bits CR0 and CR1 from the CONFIG register, add support in
order for the conversion rate to be changed from user space.
Default is 4 conv/sec.
Sergio Melas [Fri, 27 Mar 2026 22:16:02 +0000 (23:16 +0100)]
hwmon: (yogafan) Add support for Lenovo Yoga/Legion fan monitoring
This driver provides fan speed monitoring for Lenovo Yoga, Legion, and
IdeaPad laptops by interfacing with the Embedded Controller (EC) via ACPI.
To address low-resolution sampling in Lenovo EC firmware, a Rate-Limited
Lag (RLLag) filter is implemented. The filter ensures a consistent physical
curve regardless of userspace polling frequency.
Hardware identification is performed via DMI-based quirk tables, which
map specific ACPI object paths and register widths (8-bit vs 16-bit)
deterministically.
7e0ffb72de8a ("sched_ext: Fix stale direct dispatch state in
ddsp_dsq_id")
which clears ddsp state at individual call sites instead of
dispatch_enqueue(), and sub-sched related code reorg and API updates on
for-7.1. Resolved by applying the ddsp fix with for-7.1's signatures.
Software nodes depend on kernel_kobj which is initialized pretty late
into the boot process - as a core_initcall(). Ahead of moving the
software node initialization to driver_init() we must first make
kernel_kobj available before it.
Make ksysfs_init() visible in a new header - ksysfs.h - and call it in
do_basic_setup() right before driver_init().
Merge tag 'spi-fix-v7.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi
Pull spi fixes from Mark Brown:
"A small collection of fixes, mostly probe/remove issues that are the
result of Felix Gu going and auditing those areas, plus one error
handling fix for the Cadence QSPI driver"
* tag 'spi-fix-v7.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
spi: cadence-qspi: Fix exec_mem_op error handling
spi: amlogic: spifc-a4: unregister ECC engine on probe failure and remove() callback
spi: stm32-ospi: Fix DMA channel leak on stm32_ospi_dma_setup() failure
spi: stm32-ospi: Fix reset control leak on probe error
spi: stm32-ospi: Fix resource leak in remove() callback
Andrea Righi [Fri, 3 Apr 2026 06:57:20 +0000 (08:57 +0200)]
sched_ext: Fix stale direct dispatch state in ddsp_dsq_id
@p->scx.ddsp_dsq_id can be left set (non-SCX_DSQ_INVALID) triggering a
spurious warning in mark_direct_dispatch() when the next wakeup's
ops.select_cpu() calls scx_bpf_dsq_insert(), such as:
WARNING: kernel/sched/ext.c:1273 at scx_dsq_insert_commit+0xcd/0x140
The root cause is that ddsp_dsq_id was only cleared in dispatch_enqueue(),
which is not reached in all paths that consume or cancel a direct dispatch
verdict.
Fix it by clearing it at the right places:
- direct_dispatch(): cache the direct dispatch state in local variables
and clear it before dispatch_enqueue() on the synchronous path. For
the deferred path, the direct dispatch state must remain set until
process_ddsp_deferred_locals() consumes them.
- process_ddsp_deferred_locals(): cache the dispatch state in local
variables and clear it before calling dispatch_to_local_dsq(), which
may migrate the task to another rq.
- do_enqueue_task(): clear the dispatch state on the enqueue path
(local/global/bypass fallbacks), where the direct dispatch verdict is
ignored.
- dequeue_task_scx(): clear the dispatch state after dispatch_dequeue()
to handle both the deferred dispatch cancellation and the holding_cpu
race, covering all cases where a pending direct dispatch is
cancelled.
- scx_disable_task(): clear the direct dispatch state when
transitioning a task out of the current scheduler. Waking tasks may
have had the direct dispatch state set by the outgoing scheduler's
ops.select_cpu() and then been queued on a wake_list via
ttwu_queue_wakelist(), when SCX_OPS_ALLOW_QUEUED_WAKEUP is set. Such
tasks are not on the runqueue and are not iterated by scx_bypass(),
so their direct dispatch state won't be cleared. Without this clear,
any subsequent SCX scheduler that tries to direct dispatch the task
will trigger the WARN_ON_ONCE() in mark_direct_dispatch().
Fixes: 5b26f7b920f7 ("sched_ext: Allow SCX_DSQ_LOCAL_ON for direct dispatches") Cc: stable@vger.kernel.org # v6.12+ Cc: Daniel Hodges <hodgesd@meta.com> Cc: Patrick Somaru <patsomaru@meta.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
Merge tag 'pm-7.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These fix a potential NULL pointer dereference in the energy model
netlink interface and a potential double free in an error path in
the common cpufreq governor management code:
- Fix a NULL pointer dereference in the energy model netlink
interface that may occur if a given perf domain ID is not
recognized (Changwoo Min)
- Avoid double free in the cpufreq_dbs_governor_init() error
path when kobject_init_and_add() fails (Guangshuo Li)"
* tag 'pm-7.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
cpufreq: governor: fix double free in cpufreq_dbs_governor_init() error path
PM: EM: Fix NULL pointer dereference when perf domain ID is not found
Merge tag 'thermal-7.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull thermal control fixes from Rafael Wysocki:
"Address potential races between thermal zone removal and system
resume that may lead to a use-after-free (in two different ways)
and a potential use-after-free in the thermal zone unregistration
path (Rafael Wysocki)"
* tag 'thermal-7.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
thermal: core: Fix thermal zone device registration error path
thermal: core: Address thermal zone removal races with resume
KVM: SEV: Disallow LAUNCH_FINISH if vCPUs are actively being created
Reject LAUNCH_FINISH for SEV-ES and SNP VMs if KVM is actively creating
one or more vCPUs, as KVM needs to process and encrypt each vCPU's VMSA.
Letting userspace create vCPUs while LAUNCH_FINISH is in-progress is
"fine", at least in the current code base, as kvm_for_each_vcpu() operates
on online_vcpus, LAUNCH_FINISH (all SEV+ sub-ioctls) holds kvm->mutex, and
fully onlining a vCPU in kvm_vm_ioctl_create_vcpu() is done under
kvm->mutex. I.e. there's no difference between an in-progress vCPU and a
vCPU that is created entirely after LAUNCH_FINISH.
However, given that concurrent LAUNCH_FINISH and vCPU creation can't
possibly work (for any reasonable definition of "work"), since userspace
can't guarantee whether a particular vCPU will be encrypted or not,
disallow the combination as a hardening measure, to reduce the probability
of introducing bugs in the future, and to avoid having to reason about the
safety of future changes related to LAUNCH_FINISH.
KVM: SEV: Protect *all* of sev_mem_enc_register_region() with kvm->lock
Take and hold kvm->lock for before checking sev_guest() in
sev_mem_enc_register_region(), as sev_guest() isn't stable unless kvm->lock
is held (or KVM can guarantee KVM_SEV_INIT{2} has completed and can't
rollack state). If KVM_SEV_INIT{2} fails, KVM can end up trying to add to
a not-yet-initialized sev->regions_list, e.g. triggering a #GP
Opportunistically use guard() to avoid having to define a new error label
and goto usage.
Fixes: 1e80fdc09d12 ("KVM: SVM: Pin guest memory when SEV is active") Cc: stable@vger.kernel.org Reported-by: Alexander Potapenko <glider@google.com> Tested-by: Alexander Potapenko <glider@google.com> Link: https://patch.msgid.link/20260310234829.2608037-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
KVM: SEV: Reject attempts to sync VMSA of an already-launched/encrypted vCPU
Reject synchronizing vCPU state to its associated VMSA if the vCPU has
already been launched, i.e. if the VMSA has already been encrypted. On a
host with SNP enabled, accessing guest-private memory generates an RMP #PF
and panics the host.
Note, the KVM flaw has been present since commit ad73109ae7ec ("KVM: SVM:
Provide support to launch and run an SEV-ES guest"), but has only been
actively dangerous for the host since SNP support was added. With SEV-ES,
KVM would "just" clobber guest state, which is totally fine from a host
kernel perspective since userspace can clobber guest state any time before
sev_launch_update_vmsa().
KVM: selftests: Remove duplicate LAUNCH_UPDATE_VMSA call in SEV-ES migrate test
Drop the explicit KVM_SEV_LAUNCH_UPDATE_VMSA call when creating an SEV-ES
VM in the SEV migration test, as sev_vm_create() automatically updates the
VMSA pages for SEV-ES guests. The only reason the duplicate call doesn't
cause visible problems is because the test doesn't actually try to run the
vCPUs. That will change when KVM adds a check to prevent userspace from
re-launching a VMSA (which corrupts the VMSA page due to KVM writing
encrypted private memory).
KVM: SEV: Use kvzalloc_objs() when pinning userpages
Use kvzalloc_objs() instead of sev_pin_memory()'s open coded (rough)
equivalent to harden the code and
Note! This sanity check in __kvmalloc_node_noprof()
/* Don't even allow crazy sizes */
if (unlikely(size > INT_MAX)) {
WARN_ON_ONCE(!(flags & __GFP_NOWARN));
return NULL;
}
will artificially limit the maximum size of any single pinned region to
just under 1TiB. While there do appear to be providers that support SEV
VMs with more than 1TiB of _total_ memory, it's unlikely any KVM-based
providers pin 1TiB in a single request.
Allocate with NOWARN so that fuzzers can't trip the WARN_ON_ONCE() when
they inevitably run on systems with copious amounts of RAM, i.e. when they
can get by KVM's "total_npages > totalram_pages()" restriction.
Note #2, KVM's usage of vmalloc()+kmalloc() instead of kvmalloc() predates
commit 7661809d493b ("mm: don't allow oversized kvmalloc() calls") by 4+
years (see commit 89c505809052 ("KVM: SVM: Add support for
KVM_SEV_LAUNCH_UPDATE_DATA command"). I.e. the open coded behavior wasn't
intended to avoid the aforementioned sanity check. The implementation
appears to be pure oversight at the time the code was written, as it showed
up in v3[1] of the early RFCs, whereas as v2[2] simply used kmalloc().
KVM: SEV: Disallow pinning more pages than exist in the system
Explicitly disallow pinning more pages for an SEV VM than exist in the
system to defend against absurd userspace requests without relying on
somewhat arbitrary kernel functionality to prevent truly stupid KVM
behavior. E.g. even with the INT_MAX check, userspace can request that
KVM pin nearly 8TiB of memory, regardless of how much RAM exists in the
system.
Opportunistically rename "locked" to a more descriptive "total_npages".
KVM: SEV: Drop useless sanity checks in sev_mem_enc_register_region()
Drop sev_mem_enc_register_region()'s sanity checks on the incoming address
and size, as SEV is 64-bit only, making ULONG_MAX a 64-bit, all-ones value,
and thus making it impossible for kvm_enc_region.{addr,size} to be greater
than ULONG_MAX.
Note, sev_pin_memory() verifies the incoming address is non-NULL (which
isn't strictly required, but whatever), and that addr+size don't wrap to
zero (which _is_ needed and what really needs to be guarded against).
Note #2, pin_user_pages_fast() guards against the end address walking into
kernel address space, so lack of an access_ok() check is also safe (maybe
not ideal, but safe).
No functional change intended (the generated code is literally the same,
i.e. the compiler was smart enough to know the checks were useless).
Note, the checks in sev_mem_enc_register_region() that presumably exist to
verify the incoming address+size are completely worthless, as both "addr"
and "size" are u64s and SEV is 64-bit only, i.e. they _can't_ be greater
than ULONG_MAX. That wart will be cleaned up in the near future.
if (range->addr > ULONG_MAX || range->size > ULONG_MAX)
return -EINVAL;
Opportunistically add a comment to explain why the code calculates the
number of pages the "hard" way, e.g. instead of just shifting @ulen.
Fixes: 78824fabc72e ("KVM: SVM: fix svn_pin_memory()'s use of get_user_pages_fast()") Cc: stable@vger.kernel.org Reviewed-by: Liam Merwick <liam.merwick@oracle.com> Tested-by: Liam Merwick <liam.merwick@oracle.com> Link: https://patch.msgid.link/20260313003302.3136111-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
Koichiro Den [Fri, 20 Mar 2026 14:01:39 +0000 (23:01 +0900)]
misc: pci_endpoint_test: Use -EINVAL for small subrange size
The sub_size check ensures that each subrange is large enough for 32-bit
accesses. Subranges smaller than sizeof(u32) do not satisfy this
assumption, so this is a local sanity check rather than a resource
exhaustion case.
KVM: x86: Suppress WARNs on nested_run_pending after userspace exit
To end an ongoing game of whack-a-mole between KVM and syzkaller, WARN on
illegally cancelling a pending nested VM-Enter if and only if userspace
has NOT gained control of the vCPU since the nested run was initiated. As
proven time and time again by syzkaller, userspace can clobber vCPU state
so as to force a VM-Exit that violates KVM's architectural modelling of
VMRUN/VMLAUNCH/VMRESUME.
To detect that userspace has gained control, while minimizing the risk of
operating on stale data, convert nested_run_pending from a pure boolean to
a tri-state of sorts, where '0' is still "not pending", '1' is "pending",
and '2' is "pending but untrusted". Then on KVM_RUN, if the flag is in
the "trusted pending" state, move it to "untrusted pending".
Note, moving the state to "untrusted" even if KVM_RUN is ultimately
rejected is a-ok, because for the "untrusted" state to matter, KVM must
get past kvm_x86_vcpu_pre_run() at some point for the vCPU.
Merge tag 'gpio-fixes-for-v7.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux
Pull gpio fixes from Bartosz Golaszewski:
- fix kerneldocs for gpio-timberdale and gpio-nomadik
- clear the "requested" flag in error path in gpiod_request_commit()
- call of_xlate() if provided when setting up shared GPIOs
- handle pins shared by child firmware nodes of consumer devices
- fix return value check in gpio-qixis-fpga
- fix suspend on gpio-mxc
- fix gpio-microchip DT bindings
* tag 'gpio-fixes-for-v7.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
dt-bindings: gpio: fix microchip #interrupt-cells
gpio: shared: shorten the critical section in gpiochip_setup_shared()
gpio: mxc: map Both Edge pad wakeup to Rising Edge
gpio: qixis-fpga: Fix error handling for devm_regmap_init_mmio()
gpio: shared: handle pins shared by child nodes of devices
gpio: shared: call gpio_chip::of_xlate() if set
gpiolib: clear requested flag if line is invalid
gpio: nomadik: repair some kernel-doc comments
gpio: timberdale: repair kernel-doc comments
gpio: Fix resource leaks on errors in gpiochip_add_data_with_key()
Yosry Ahmed [Thu, 12 Mar 2026 23:48:22 +0000 (16:48 -0700)]
KVM: x86: Move nested_run_pending to kvm_vcpu_arch
Move nested_run_pending field present in both svm_nested_state and
nested_vmx to the common kvm_vcpu_arch. This allows for common code to
use without plumbing it through per-vendor helpers.
nested_run_pending remains zero-initialized, as the entire kvm_vcpu
struct is, and all further accesses are done through vcpu->arch instead
of svm->nested or vmx->nested.
No functional change intended.
Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Yosry Ahmed <yosry@kernel.org>
[sean: expand the commend in the field declaration] Link: https://patch.msgid.link/20260312234823.3120658-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
v1->v2:
. fixed bugs spotted by Eduard, Mykyta, claude and gemini
. fixed selftests that were failing in unpriv
. gemini(sashiko) found several precision improvements in patch 6,
but they made no difference in real programs.
v1: https://lore.kernel.org/bpf/20260401021635.34636-1-alexei.starovoitov@gmail.com/
First 6 prep patches for static stack liveness.
. do src/dst_reg validation early and remove defensive checks
. sort subprog in topo order. We wanted to do this long ago
to process global subprogs this way and in other cases.
. Add constant folding pass that computes map_ptr, subprog_idx,
loads from readonly maps, and other constants that fit into 32-bit
. Use these constants to eliminate dead code. Replace predicted
conditional branches with "jmp always". That reduces JIT prog size.
. Add two helpers that return access size from their arguments.
====================
bpf: Add helper and kfunc stack access size resolution
The static stack liveness analysis needs to know how many bytes a
helper or kfunc accesses through a stack pointer argument, so it can
precisely mark the affected stack slots as stack 'def' or 'use'.
Add bpf_helper_stack_access_bytes() and bpf_kfunc_stack_access_bytes()
which resolve the access size for a given call argument.
bpf: Add bpf_compute_const_regs() and bpf_prune_dead_branches() passes
Add two passes before the main verifier pass:
bpf_compute_const_regs() is a forward dataflow analysis that tracks
register values in R0-R9 across the program using fixed-point
iteration in reverse postorder. Each register is tracked with
a six-state lattice:
At merge points, if two paths produce the same state and value for
a register, it stays; otherwise it becomes UNKNOWN.
The analysis handles:
- MOV, ADD, SUB, AND with immediate or register operands
- LD_IMM64 for plain constants, map FDs, map values, and subprogs
- LDX from read-only maps: constant-folds the load by reading the
map value directly via bpf_map_direct_read()
Results that fit in 32 bits are stored per-instruction in
insn_aux_data and bitmasks.
bpf_prune_dead_branches() uses the computed constants to evaluate
conditional branches. When both operands of a conditional jump are
known constants, the branch outcome is determined statically and the
instruction is rewritten to an unconditional jump.
The CFG postorder is then recomputed to reflect new control flow.
This eliminates dead edges so that subsequent liveness analysis
doesn't propagate through dead code.
Also add runtime sanity check to validate that precomputed
constants match the verifier's tracked state.
selftests/bpf: Add tests for subprog topological ordering
Add few tests for topo sort:
- linear chain: main -> A -> B
- diamond: main -> A, main -> B, A -> C, B -> C
- mixed global/static: main -> global -> static leaf
- shared callee: main -> leaf, main -> global -> leaf
- duplicate calls: main calls same subprog twice
- no calls: single subprog
bpf: Sort subprogs in topological order after check_cfg()
Add a pass that sorts subprogs in topological order so that iterating
subprog_topo_order[] walks leaf subprogs first, then their callers.
This is computed as a DFS post-order traversal of the CFG.
The pass runs after check_cfg() to ensure the CFG has been validated
before traversing and after postorder has been computed to avoid
walking dead code.
Merge tag 'drm-fixes-2026-04-03' of https://gitlab.freedesktop.org/drm/kernel
Pull drm fixes from Dave Airlie:
"Hopefully no Easter eggs in this bunch of fixes. Usual stuff across
the amd/intel with some misc bits. Thanks to Thorsten and Alex for
making sure a regression fix that was hanging around in process land
finally made it in, that is probably the biggest change in here.
core:
- revert unplug/framebuffer fix as it caused problems
- compat ioctl speculation fix
i915:
- Fix for #12045: Huawei Matebook E (DRR-WXX): Persistent Black
Screen on Boot with i915 and Gen11: Modesetting and Backlight
Control Malfunction
- Fix for #15826: i915: Raptor Lake-P [UHD Graphics] display
flicker/corruption on eDP panel
- Use crtc_state->enhanced_framing properly on ivb/hsw CPU eDP
xe:
- uapi: Accept canonical GPU addresses in xe_vm_madvise_ioctl
- Disallow writes to read-only VMAs
- PXP fixes
- Disable garbage collector work item on SVM close
- void memory allocations in xe_device_declare_wedged
qaic:
- hang fix
ast:
- initialisation fix"
* tag 'drm-fixes-2026-04-03' of https://gitlab.freedesktop.org/drm/kernel: (28 commits)
drm/amd/display: Wire up dcn10_dio_construct() for all pre-DCN401 generations
drm/ioc32: stop speculation on the drm_compat_ioctl path
drm/sysfb: Fix efidrm error handling and memory type mismatch
drm/i915/dp: Use crtc_state->enhanced_framing properly on ivb/hsw CPU eDP
drm/i915/cdclk: Do the full CDCLK dance for min_voltage_level changes
drm/amdkfd: Fix queue preemption/eviction failures by aligning control stack size to GPU page size
drm/amdgpu: Fix wait after reset sequence in S4
drm/amd/display: Fix NULL pointer dereference in dcn401_init_hw()
drm/amdgpu: Change AMDGPU_VA_RESERVED_TRAP_SIZE to 64KB
drm/amdgpu/userq: fix memory leak in MQD creation error paths
drm/amd: Fix MQD and control stack alignment for non-4K
drm/amdkfd: Align expected_queue_size to PAGE_SIZE
drm/amdgpu: fix the idr allocation flags
drm/amdgpu: validate doorbell_offset in user queue creation
drm/amdgpu/pm: drop SMU driver if version not matched messages
drm/xe: Avoid memory allocations in xe_device_declare_wedged()
drm/xe: Disable garbage collector work item on SVM close
drm/xe/pxp: Don't allow PXP on older PTL GSC FWs
drm/xe/pxp: Clear restart flag in pxp_start after jumping back
drm/xe/pxp: Remove incorrect handling of impossible state during suspend
...