Florian Westphal [Sat, 28 Mar 2026 22:00:31 +0000 (23:00 +0100)]
netfilter: x_physdev: reject empty or not-nul terminated device names
Reject names that lack a \0 character and reject the empty string as
well. iptables allows this but it fails to re-parse iptables-save output
that contain such rules.
Add /proc/net/ip_vs_status to show current state of IPVS.
The motivation for this new /proc interface is to provide the output
for the users to help them decide when to tune the load factor for
hash tables, which is possible with the new sysctl knobs coming in
followup patch.
The output also includes information for the kthreads used for stats.
- Minor updates to driver and dt-bindings for Tegra (Thierry Reding,
Rosen Penev)
- Add MAINTAINERS entry for CPPC driver (Viresh Kumar)
- Add support for new features: CPPC performance priority, Dynamic EPP,
Raw EPP, and new unit tests for them to amd-pstate (Gautham Shenoy,
Mario Limonciello)
- Fix sysfs files being present when HW missing and broken/outdated
documentation in the amd-pstate driver (Ninad Naik, Gautham Shenoy)
- Pass the policy to cpufreq_driver->adjust_perf() to avoid using
cpufreq_cpu_get() in the .adjust_perf() callback in amd-pstate which
leads to a scheduling-while-atomic bug (K Prateek Nayak)
- Clean up dead code in Kconfig for cpufreq (Julian Braha)
- Remove max_freq_req update for pre-existing cpufreq policy and add a
boost_freq_req QoS request to save the boost constraint instead of
overwriting the last scaling_max_freq constraint (Pierre Gondois)
- Embed cpufreq QoS freq_req objects in cpufreq policy so they all
are allocated in one go along with the policy to simplify lifetime
rules and avoid error handling issues (Viresh Kumar)
- Use DMI max speed when CPPC is unavailable in the acpi-cpufreq
scaling driver (Henry Tseng)
- Switch policy_is_shared() in cpufreq to using cpumask_nth() instead
of cpumask_weight() because the former is more efficient (Yury Norov)
- Use sysfs_emit() in sysfs show functions for cpufreq governor
attributes (Thorsten Blum)
- Update intel_pstate to stop returning an error when "off" is written
to its status sysfs attribute while the driver is already off (Fabio
De Francesco)
- Include current frequency in the debug message printed by
__cpufreq_driver_target() (Pengjie Zhang)
* pm-cpufreq: (38 commits)
cpufreq/amd-pstate: Add POWER_SUPPLY select for dynamic EPP
MAINTAINERS: amd-pstate: Step down as maintainer, add Prateek as reviewer
cpufreq: Pass the policy to cpufreq_driver->adjust_perf()
cpufreq/amd-pstate: Pass the policy to amd_pstate_update()
cpufreq/amd-pstate-ut: Add a unit test for raw EPP
cpufreq/amd-pstate: Add support for raw EPP writes
cpufreq/amd-pstate: Add support for platform profile class
cpufreq/amd-pstate: add kernel command line to override dynamic epp
cpufreq/amd-pstate: Add dynamic energy performance preference
Documentation: amd-pstate: fix dead links in the reference section
cpufreq/amd-pstate: Cache the max frequency in cpudata
Documentation/amd-pstate: Add documentation for amd_pstate_floor_{freq,count}
Documentation/amd-pstate: List amd_pstate_prefcore_ranking sysfs file
Documentation/amd-pstate: List amd_pstate_hw_prefcore sysfs file
amd-pstate-ut: Add a testcase to validate the visibility of driver attributes
amd-pstate-ut: Add module parameter to select testcases
amd-pstate: Introduce a tracepoint trace_amd_pstate_cppc_req2()
amd-pstate: Add sysfs support for floor_freq and floor_count
amd-pstate: Add support for CPPC_REQ2 and FLOOR_PERF
x86/cpufeatures: Add AMD CPPC Performance Priority feature.
...
Pengpeng Hou [Tue, 10 Mar 2026 08:08:00 +0000 (08:08 +0000)]
xen/grant-table: guard gnttab_suspend/resume with CONFIG_HIBERNATE_CALLBACKS
In current linux.git, gnttab_suspend() and gnttab_resume() are defined
and declared unconditionally. However, their only in-tree callers reside
in drivers/xen/manage.c, which are guarded by CONFIG_HIBERNATE_CALLBACKS.
Match the helper scope to their callers by wrapping the definitions in
CONFIG_HIBERNATE_CALLBACKS and providing no-op stubs in the header. This
fixes the config-scope mismatch and reduces the code footprint when
hibernation callbacks are disabled.
Jason Andryuk [Wed, 18 Mar 2026 23:53:26 +0000 (19:53 -0400)]
hvc/xen: Check console connection flag
When the console out buffer is filled, __write_console() will return 0
as it cannot send any data. domU_write_console() will then spin in
`while (len)` as len doesn't decrement until xenconsoled attaches. This
would block a domU and nullify the parallelism of Hyperlaunch until dom0
userspace starts xenconsoled, which empties the buffer.
Xen 4.21 added a connection field to the xen console page. This is set
to XENCONSOLE_DISCONNECTED (1) when a domain is built, and xenconsoled
will set it to XENCONSOLE_CONNECTED (0) when it connects.
Update the hvc_xen driver to check the field. When the field is
disconnected, drop the write with -ENOTCONN. We only drop the write
when the field is XENCONSOLE_DISCONNECTED (1) to try for maximum
compatibility. The Xen toolstack has historically zero initialized the
console, so it should see XENCONSOLE_CONNECTED (0) by default. If an
implemenation used uninitialized memory, only checking for
XENCONSOLE_DISCONNECTED could have the lowest chance of not connecting.
This lets the hyperlaunched domU boot without stalling. Once dom0
starts xenconsoled, xl console can be used to access the domU's hvc0.
Paritally sync console.h from xen.git to bring in the new field.
Kexin Sun [Sat, 21 Mar 2026 11:00:39 +0000 (19:00 +0800)]
xen/swiotlb: fix stale reference to swiotlb_unmap_page()
Commit af85de5a9f00 ("xen: swiotlb: Switch to physical
address mapping callbacks") renamed xen_swiotlb_unmap_page()
to xen_swiotlb_unmap_phys(). The comment in
xen_swiotlb_unmap_sg() had already been missing the xen_
prefix (reading swiotlb_unmap_page()), and the rename only
changed _page to _phys without correcting this, leaving it
as swiotlb_unmap_phys(). Fix the reference to use the
correct function name xen_swiotlb_unmap_phys().
xen/manage: unwind partial shutdown watcher setup on error
setup_shutdown_watcher() registers shutdown_watch first, then the sysrq
watch, and finally publishes the supported feature-* nodes in xenstore.
If sysrq watch registration fails, or xenbus_printf() fails after one or
more feature nodes were created, the function returns immediately without
undoing the earlier setup.
This leaves the system in a partially initialized state, with registered
watches and/or stale xenstore entries despite the function reporting
failure.
Unwind the partial setup before returning an error by unregistering any
watches that were already registered and removing feature nodes that were
already published.
selftests/sched_ext: Fix wrong DSQ ID in peek_dsq error message
The error path after scx_bpf_create_dsq(real_dsq_id, ...) was reporting
test_dsq_id instead of real_dsq_id in the error message, which would
mislead debugging.
- Remove EROFS_MAP_ENCODED since it was always set together with
EROFS_MAP_MAPPED for compressed extents and checked redundantly;
- Replace the EROFS_MAP_FULL_MAPPED flag with the opposite
EROFS_MAP_PARTIAL_MAPPED flag so that extents are implicitly
fully mapped initially to simplify the logic;
- Make fragment extents independent of EROFS_MAP_MAPPED since
they are not directly allocated on disk; thus fragment extents
are no longer twisted with mapped extents.
ARM: xen: validate hypervisor compatible before parsing its version
fdt_find_hyper_node() reads the raw compatible property and then derives
hyper_node.version from a prefix match before later printing it with %s.
Flat DT properties are external boot input, and this path does not prove
that the first compatible entry is NUL-terminated within the returned
property length.
Keep the existing flat-DT lookup path, but verify that the first
compatible entry terminates within the returned property length before
deriving the version suffix from it.
Kuba Piecuch [Thu, 9 Apr 2026 16:57:44 +0000 (16:57 +0000)]
sched_ext: Documentation: improve accuracy of task lifecycle pseudo-code
* Add ops.quiescent() and ops.runnable() to the sched_change path.
When a queued task has one of its scheduling properties changed
(e.g. nice, affinity), it goes through dequeue() -> quiescent() ->
(property change callback, e.g. ops.set_weight()) -> runnable() ->
enqueue().
* Change && to || in ops.enqueue() condition. We want to enqueue tasks
that have a non-zero slice and are not in any DSQ.
* Call ops.dispatch() and ops.dequeue() only for tasks that have had
ops.enqueue() called. This is to account for tasks direct-dispatched
from ops.select_cpu().
* Add a note explaining that the pseudo-code provides a simplified view
of the task lifecycle and list some examples of cases that the
pseudo-code does not account for.
Fixes: a4f61f0a1afd ("sched_ext: Documentation: Add ops.dequeue() to task lifecycle") Signed-off-by: Kuba Piecuch <jpiecuch@google.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
Neeraj Soni [Fri, 10 Apr 2026 06:58:33 +0000 (12:28 +0530)]
mmc: sdhci-msm: Fix the wrapped key handling
Inline Crypto Engine (ICE) supports wrapped key generation. While
registering crypto profile the supported key types are queried from ICE
driver. So the explicit check for RAW key is not needed.
Fixes: fd78e2b582a0 ("mmc: sdhci-msm: Add support for wrapped keys") Signed-off-by: Neeraj Soni <neeraj.soni@oss.qualcomm.com> Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
gpio: tegra: return -ENOMEM on allocation failure in probe
devm_kzalloc() failure in tegra_gpio_probe() returns -ENODEV, which
indicates "no such device". The correct error code for a memory
allocation failure is -ENOMEM.
Hangbin Liu [Wed, 8 Apr 2026 07:19:05 +0000 (15:19 +0800)]
tools: ynl: tests: fix leading space on Makefile target
The ../generated/protos.a rule had a spurious leading space before the
target name. In make, target rules must start at column 0; only recipe
lines are indented with a tab. The extra space caused make to misparse
the rule.
Remove the leading space to match the style of the adjacent
../lib/ynl.a rule.
People (do people still write code or is it all AI?) seem to not
get that ksft_run() can only be called once. If we call it
multiple times KTAP parsers will likely cut off after the first
batch has finished.
fbnic_up() calls netif_tx_start_all_queues(), which only clears
__QUEUE_STATE_DRV_XOFF. If qdisc backlog has accumulated on any TX
queue before the reconfiguration (e.g. ring resize via ethtool -G),
start does not call __netif_schedule() to kick the qdisc, so the
pending backlog is never drained and the queue stalls.
Switch to netif_tx_wake_all_queues(), which clears DRV_XOFF and also
calls __netif_schedule() on every queue, ensuring any backlog that
built up before the down/up cycle is promptly dequeued.
net: airoha: Add dma_rmb() and READ_ONCE() in airoha_qdma_rx_process()
Add missing dma_rmb() in airoha_qdma_rx_process routine to make sure the
DMA read operations are completed when the NIC reports the processing on
the current descriptor is done. Moreover, add missing READ_ONCE() in
airoha_qdma_rx_process() for DMA descriptor control fields in order to
avoid any compiler reordering.
net: txgbe: fix RTNL assertion warning when remove module
For the copper NIC with external PHY, the driver called
phylink_connect_phy() during probe and phylink_disconnect_phy() during
remove. It caused an RTNL assertion warning in phylink_disconnect_phy()
upon module remove.
To fix this, add rtnl_lock() and rtnl_unlock() around the
phylink_disconnect_phy() in remove function.
Fixes: 02b2a6f91b90 ("net: txgbe: support copper NIC with external PHY") Cc: stable@vger.kernel.org Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://patch.msgid.link/8B47A5872884147D+20260407094041.4646-1-jiawenwu@trustnetic.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Fri, 10 Apr 2026 03:19:47 +0000 (20:19 -0700)]
Merge branch 'net-bcmgenet-fix-queue-lock-up'
Justin Chen says:
====================
net: bcmgenet: fix queue lock up
We have been seeing reports of logs like this.
[ 41.761198] bcmgenet 1001300000.ethernet eth0: NETDEV WATCHDOG: CPU: 0: transmit queue 2 timed out 10039 ms
[ 43.745198] bcmgenet 1001300000.ethernet eth0: NETDEV WATCHDOG: CPU: 0: transmit queue 2 timed out 12023 ms
[ 45.729198] bcmgenet 1001300000.ethernet eth0: NETDEV WATCHDOG: CPU: 0: transmit queue 2 timed out 14007 ms
We have two issues. The persistent queue timeouts and the eventual
lock up of the entire transmit.
We address the lock up issue first. The queue timeouts are due to
a fundamental design issue not a bug perse. Timeouts still persist,
but we should no longer lock up.
====================
The bcmgenet_timeout handler tries to take down all tx queues when
a single queue times out. This is over zealous and causes many race
conditions with queues that are still chugging along. Instead lets
only restart the timed out queue.
While reclaiming the tx queue we fast forward the write pointer to
drop any data in flight. These dropped frames are not added back
to the pool of free bds. We also need to tell the netdev that we
are dropping said data.
net: bcmgenet: fix off-by-one in bcmgenet_put_txcb
The write_ptr points to the next open tx_cb. We want to return the
tx_cb that gets rewinded, so we must rewind the pointer first then
return the tx_cb that it points to. That way the txcb can be correctly
cleaned up.
Fixes: 876dbadd53a7 ("net: bcmgenet: Fix unmapping of fragments in bcmgenet_xmit()") Signed-off-by: Justin Chen <justin.chen@broadcom.com> Reviewed-by: Nicolai Buchwitz <nb@tipi-net.de> Link: https://patch.msgid.link/20260406175756.134567-2-justin.chen@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Kevin Hao [Tue, 7 Apr 2026 00:45:39 +0000 (08:45 +0800)]
net: macb: Use napi_schedule_irqoff() in IRQ handler
For non-PREEMPT_RT kernels, the IRQ handler runs with interrupts
disabled, allowing the use of napi_schedule_irqoff() to save a pair of
local_irq_{save,restore} operations. For PREEMPT_RT kernels,
napi_schedule_irqoff() behaves identically to napi_schedule().
Qingfang Deng [Tue, 7 Apr 2026 09:40:56 +0000 (17:40 +0800)]
ppp: consolidate refcount decrements
ppp_destroy_{channel,interface} are always called after
refcount_dec_and_test().
To reduce boilerplate code, consolidate the decrements by moving them
into the two functions. To reflect this change in semantics, rename the
functions to ppp_release_*.
Marek Vasut [Sun, 5 Apr 2026 23:29:58 +0000 (01:29 +0200)]
net: phy: realtek: Add property to enable SSC
Add support for spread spectrum clocking (SSC) on RTL8211F(D)(I)-CG,
RTL8211FS(I)(-VS)-CG, RTL8211FG(I)(-VS)-CG PHYs. The implementation
follows EMI improvement application note Rev. 1.2 for these PHYs.
The current implementation enables SSC for both RXC and SYSCLK clock
signals. Introduce DT properties 'realtek,clkout-ssc-enable',
'realtek,rxc-ssc-enable' and 'realtek,sysclk-ssc-enable' which control
CLKOUT, RXC and SYSCLK SSC spread spectrum clocking enablement on these
signals.
Document support for spread spectrum clocking (SSC) on RTL8211F(D)(I)-CG,
RTL8211FS(I)(-VS)-CG, RTL8211FG(I)(-VS)-CG PHYs. Introduce DT properties
'realtek,clkout-ssc-enable', 'realtek,rxc-ssc-enable' and
'realtek,sysclk-ssc-enable' which control CLKOUT, RXC and SYSCLK
SSC spread spectrum clocking enablement on these signals. These
clock are not exposed via the clock API, therefore assigned-clock-sscs
property does not apply.
====================
macsec: Add support for VLAN filtering in offload mode
This short series adds support for VLANs in MACsec devices when offload
mode is enabled. This allows VLAN netdevs on top of MACsec netdevs to
function, which accidentally used to be the case in the past, but was
broken. This series adds back proper support.
As part of this, the existing nsim-only MACsec offload tests were
translated to Python so they can run against real HW and new
traffic-based tests were added for VLAN filter propagation, since
there's currently no uAPI to check VLAN filters.
====================
VLAN-filtering is done through two netdev features
(NETIF_F_HW_VLAN_CTAG_FILTER and NETIF_F_HW_VLAN_STAG_FILTER) and two
netdev ops (ndo_vlan_rx_add_vid and ndo_vlan_rx_kill_vid).
Implement these and advertise the features if the lower device supports
them. This allows proper VLAN filtering to work on top of MACsec
devices, when the lower device is capable of VLAN filtering.
As a concrete example, having this chain of interfaces now works:
vlan_filtering_capable_dev(1) -> macsec_dev(2) -> macsec_vlan_dev(3)
Before the mentioned commit this used to accidentally work because the
MACsec device (and thus the lower device) was put in promiscuous mode
and the VLAN filter was not used. But after commit [1] correctly made
the macsec driver expose the IFF_UNICAST_FLT flag, promiscuous mode was
no longer used and VLAN filters on dev 1 kicked in. Without support in
dev 2 for propagating VLAN filters down, the register_vlan_dev ->
vlan_vid_add -> __vlan_vid_add -> vlan_add_rx_filter_info call from dev
3 is silently eaten (because vlan_hw_filter_capable returns false and
vlan_add_rx_filter_info silently succeeds).
For MACsec, VLAN filters are only relevant for offload, otherwise
the VLANs are encrypted and the lower devices don't care about them. So
VLAN filters are only passed on to lower devices in offload mode.
Flipping between offload modes now needs to offload/unoffload the
filters with vlan_{get,drop}_rx_*_filter_info().
To avoid the back-and-forth filter updating during rollback, the setting
of macsec->offload is moved after the add/del secy ops. This is safe
since none of the code called from those requires macsec->offload.
In case adding the filters fails, the added ones are rolled back and an
error is returned to the operation toggling the offload state.
selftests: Add MACsec VLAN propagation traffic test
Add VLAN filter propagation tests through offloaded MACsec devices via
actual traffic.
The tests create MACsec tunnels with matching SAs on both endpoints,
stack VLANs on top, and verify connectivity with ping. Covered:
- Offloaded MACsec with VLAN (filters propagate to HW)
- Software MACsec with VLAN (no HW filter propagation)
- Offload on/off toggle and verifying traffic still works
On netdevsim this makes use of the VLAN filter debugfs file to actually
validate that filters are applied/removed correctly.
On real hardware the traffic should validate actual VLAN filter
propagation.
selftests: Migrate nsim-only MACsec tests to Python
Move MACsec offload API and ethtool feature tests from
tools/testing/selftests/drivers/net/netdevsim/macsec-offload.sh to
tools/testing/selftests/drivers/net/macsec.py using the NetDrvEnv
framework so tests can run against both netdevsim (default) and real
hardware (NETIF=ethX). As some real hardware requires MACsec to use
encryption, add that to the tests.
Netdevsim-specific limit checks (max SecY, max RX SC) were moved into
separate test cases to avoid failures on real hardware.
IFA_F_PERMANENT addresses require the allocation of a bunch of percpu
pointers, currently in atomic scope.
Similar to commit 51454ea42c1a ("ipv6: fix locking issues with loops
over idev->addr_list"), move fixup_permanent_addr() outside the
&idev->lock scope, and do the allocations with GFP_KERNEL. With such
change fixup_permanent_addr() is invoked with the BH enabled, and the
ifp lock acquired there needs the BH variant.
Note that we don't need to acquire a reference to the permanent
addresses before releasing the mentioned write lock, because
addrconf_permanent_addr() runs under RTNL and ifa removal always happens
under RTNL, too.
Also the PERMANENT flag is constant in the relevant scope, as it can be
cleared only by inet6_addr_modify() under the RTNL lock.
David Carlier [Tue, 7 Apr 2026 15:07:58 +0000 (16:07 +0100)]
net: use get_random_u{16,32,64}() where appropriate
Use the typed random integer helpers instead of
get_random_bytes() when filling a single integer variable.
The helpers return the value directly, require no pointer
or size argument, and better express intent.
Skipped sites writing into __be16 (netdevsim) and __le64
(ceph) fields where a direct assignment would trigger
sparse endianness warnings.
Signed-off-by: David Carlier <devnexen@gmail.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260407150758.5889-1-devnexen@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Wed, 8 Apr 2026 22:12:51 +0000 (15:12 -0700)]
net: remove the netif_get_rx_queue_lease_locked() helpers
The netif_get_rx_queue_lease_locked() API hides the locking
and the descend onto the leased queue. Making the code
harder to follow (at least to me). Remove the API and open
code the descend a bit. Most of the code now looks like:
if (!leased)
return __helper(x);
hw_rxq = ..
netdev_lock(hw_rxq->dev);
ret = __helper(x);
netdev_unlock(hw_rxq->dev);
return ret;
Of course if we have more code paths that need the wrapping
we may need to revisit. For now, IMHO, having to know what
netif_get_rx_queue_lease_locked() does is not worth the 20LoC
it saves.
====================
netkit: Support for io_uring zero-copy and AF_XDP
Containers use virtual netdevs to route traffic from a physical netdev
in the host namespace. They do not have access to the physical netdev
in the host and thus can't use memory providers or AF_XDP that require
reconfiguring/restarting queues in the physical netdev.
This patchset adds the concept of queue leasing to virtual netdevs that
allow containers to use memory providers and AF_XDP at native speed.
Leased queues are bound to a real queue in a physical netdev and act
as a proxy.
Memory providers and AF_XDP operations take an ifindex and queue id,
so containers would pass in an ifindex for a virtual netdev and a queue
id of a leased queue, which then gets proxied to the underlying real
queue.
We have implemented support for this concept in netkit and tested the
latter against Nvidia ConnectX-6 (mlx5) as well as Broadcom BCM957504
(bnxt_en) 100G NICs. For more details see the individual patches.
====================
David Wei [Thu, 2 Apr 2026 23:10:31 +0000 (01:10 +0200)]
selftests/net: Add queue leasing tests with netkit
Add extensive selftests for netkit queue leasing, using io_uring zero
copy test binary inside of a netns with netkit. This checks that memory
providers can be bound against virtual queues in a netkit within a
netns that are leasing from a physical netdev in the default netns.
Also add various test cases around corner cases for the queue creation
itself as well as queue info dumping and teardown in case of netkit in
device pair and single mode.
Daniel Borkmann [Thu, 2 Apr 2026 23:10:30 +0000 (01:10 +0200)]
netkit: Add xsk support for af_xdp applications
Enable support for AF_XDP applications to operate on a netkit device.
The goal is that AF_XDP applications can natively consume AF_XDP
from network namespaces. The use-case from Cilium side is to support
Kubernetes KubeVirt VMs through QEMU's AF_XDP backend. KubeVirt is a
virtual machine management add-on for Kubernetes which aims to provide
a common ground for virtualization. KubeVirt spawns the VMs inside
Kubernetes Pods which reside in their own network namespace just like
regular Pods.
Raw QEMU AF_XDP backend example with eth0 being a physical device with
16 queues where netkit is bound to the last queue (for multi-queue RSS
context can be used if supported by the driver):
# ethtool -X eth0 start 0 equal 15
# ethtool -X eth0 start 15 equal 1 context new
# ethtool --config-ntuple eth0 flow-type ether \
src 00:00:00:00:00:00 \
src-mask ff:ff:ff:ff:ff:ff \
dst $mac dst-mask 00:00:00:00:00:00 \
proto 0 proto-mask 0xffff action 15
[ ... setup BPF/XDP prog on eth0 to steer into shared xsk map ... ]
# ip netns add foo
# ip link add numrxqueues 2 nk type netkit single
# ynl --family netdev --output-json --do queue-create \
--json "{"ifindex": $(ifindex nk), "type": "rx", \
"lease": { "ifindex": $(ifindex eth0), \
"queue": { "type": "rx", "id": 15 } } }"
{'id': 1}
# ip link set nk netns foo
# ip netns exec foo ip link set lo up
# ip netns exec foo ip link set nk up
# ip netns exec foo qemu-system-x86_64 \
-kernel $kernel \
-drive file=${image_name},index=0,media=disk,format=raw \
-append "root=/dev/sda rw console=ttyS0" \
-cpu host \
-m $memory \
-enable-kvm \
-device virtio-net-pci,netdev=net0,mac=$mac \
-netdev af-xdp,ifname=nk,id=net0,mode=native,queues=1,start-queue=1,inhibit=on,map-path=$dir/xsks_map \
-nographic
We have tested the above against a dual-port Nvidia ConnectX-6 (mlx5)
100G NIC with successful network connectivity out of QEMU. An earlier
iteration of this work was presented at LSF/MM/BPF [0] and more
recently at LPC [1].
For getting to a first starting point to connect all things with
KubeVirt, bind mounting the xsk map from Cilium into the VM launcher
Pod which acts as a regular Kubernetes Pod while not perfect, is not
a big problem given its out of reach from the application sitting
inside the VM (and some of the control plane aspects are baked in
the launcher Pod already), so the isolation barrier is still the VM.
Eventually the goal is to have a XDP/XSK redirect extension where
there is no need to have the xsk map, and the BPF program can just
derive the target xsk through the queue where traffic was received
on.
The exposure through netkit is because Cilium should not act as a
proxy handing out xsk sockets. Existing applications expect a netdev
from kernel side and should not need to rewrite just to implement
against a CNI's protocol. Also, all the memory should not be accounted
against Cilium but rather the application Pod itself which is consuming
AF_XDP. Further, on up/downgrades we expect the data plane to being
completely decoupled from the control plane; if Cilium would own the
sockets that would be disruptive. Another use-case which opens up and
is regularly asked from users would be to have DPDK applications on
top of AF_XDP in regular Kubernetes Pods.
Daniel Borkmann [Thu, 2 Apr 2026 23:10:29 +0000 (01:10 +0200)]
netkit: Add netkit notifier to check for unregistering devices
Add a netdevice notifier in netkit to watch for NETDEV_UNREGISTER events.
If the target device is indeed NETREG_UNREGISTERING and previously leased
a queue to a netkit device, then collect the related netkit devices and
batch-unregister_netdevice_many() them.
If this were not done, then the netkit device would hold a reference on
the physical device preventing it from going away. However, in case of
both io_uring zero-copy as well as AF_XDP this situation is handled
gracefully and the allocated resources are torn down.
In the case where mentioned infra is used through netkit, the applications
have a reference on netkit, and netkit in turn holds a reference on the
physical device. In order to have netkit release the reference on the
physical device, we need such watcher to then unregister the netkit ones.
This is generally quite similar to the dependency handling in case of
tunnels (e.g. vxlan bound to a underlying netdev) where the tunnel device
gets removed along with the physical device.
# ip a
[...]
4: enp10s0f0np0: <BROADCAST,MULTICAST> mtu 1500 qdisc mq state DOWN group default qlen 1000
link/ether e8:eb:d3:a3:43:f6 brd ff:ff:ff:ff:ff:ff
inet 10.0.0.2/24 scope global enp10s0f0np0
valid_lft forever preferred_lft forever
[...]
8: nk@NONE: <BROADCAST,MULTICAST,NOARP> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff
[...]
# ip a
[...]
[ both enp10s0f0np0 and nk gone ]
[...]
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Co-developed-by: David Wei <dw@davidwei.uk> Signed-off-by: David Wei <dw@davidwei.uk> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20260402231031.447597-13-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
David Wei [Thu, 2 Apr 2026 23:10:28 +0000 (01:10 +0200)]
netkit: Implement rtnl_link_ops->alloc and ndo_queue_create
Implement rtnl_link_ops->alloc that allows the number of rx queues to be
set when netkit is created. By default, netkit has only a single rxq (and
single txq). The number of queues is deliberately not allowed to be changed
via ethtool -L and is fixed for the lifetime of a netkit instance.
For netkit device creation, numrxqueues with larger than one rxq can be
specified. These rxqs are leasable to real rxqs in physical netdevs:
ip link add type netkit peer numrxqueues 64 # for device pair
ip link add numrxqueues 64 type netkit single # for single device
The limit of numrxqueues for netkit is currently set to 1024, which allows
leasing multiple real rxqs from physical netdevs.
The implementation of ndo_queue_create() adds a new rxq during the queue
lease operation. We allow to create queues either in single device mode
or for the case of dual device mode for the netkit peer device which gets
placed into the target network namespace. For dual device mode the lease
against the primary device does not make sense for the targeted use cases,
and therefore gets rejected.
We also need to add a lockdep class for netkit, such that lockdep does
not trip over us, similarly done as in commit 0bef512012b1 ("net: add
netdev_lockdep_set_classes() to virtual drivers").
This is also the last missing bit to netkit for supporting io_uring with
zero-copy mode [0]. Up until this point it was not possible to consume the
latter out of containers or Kubernetes Pods where applications are in their
own network namespace.
io_uring example with eth0 being a physical device with 16 queues where
netkit is bound to the last queue, iou-zcrx.c is binary from selftests;
ethtool configuration (tcp-data-split, hds_thresh, RSS, flow steering)
is done on the physical device by the control plane; here, flow steering
to that queue is based on the service VIP:port of the server utilizing
io_uring:
# ethtool -X eth0 start 0 equal 15
# ethtool -X eth0 start 15 equal 1 context new
# ethtool --config-ntuple eth0 flow-type tcp4 dst-ip 1.2.3.4 dst-port 5000 action 15
# ip netns add foo
# ip link add type netkit peer numrxqueues 2
# ynl --family netdev --output-json --do queue-create \
--json "{"ifindex": $(ifindex nk0), "type": "rx", \
"lease": { "ifindex": $(ifindex eth0), \
"queue": { "type": "rx", "id": 15 } } }"
{'id': 1}
# ip link set nk0 netns foo
# ip link set nk1 up
# ip netns exec foo ip link set lo up
# ip netns exec foo ip link set nk0 up
# ip netns exec foo ip addr add 1.2.3.4/32 dev nk0
[ ... setup routing etc to get external traffic into the netns ... ]
# ip netns exec foo ./iou-zcrx -s -p 5000 -i nk0 -q 1
For Cilium, the plan is to open up support for the various memory providers
for regular Kubernetes Pods when Cilium is configured with netkit datapath
mode.
Daniel Borkmann [Thu, 2 Apr 2026 23:10:27 +0000 (01:10 +0200)]
netkit: Add single device mode for netkit
Add a single device mode for netkit instead of netkit pairs. The primary
target for the paired devices is to connect network namespaces, of course,
and support has been implemented in projects like Cilium [0]. For the rxq
leasing the plan is to support two main scenarios related to single device
mode:
* For the use-case of io_uring zero-copy, the control plane can either
set up a netkit pair where the peer device can perform rxq leasing which
is then tied to the lifetime of the peer device, or the control plane
can use a regular netkit pair to connect the hostns to a Pod/container
and dynamically add/remove rxq leasing through a single device without
having to interrupt the device pair. In the case of io_uring, the memory
pool is used as skb non-linear pages, and thus the skb will go its way
through the regular stack into netkit. Things like the netkit policy when
no BPF is attached or skb scrubbing etc apply as-is in case the paired
devices are used, or if the backend memory is tied to the single device
and traffic goes through a paired device.
* For the use-case of AF_XDP, the control plane needs to use netkit in the
single device mode. The single device mode currently enforces only a
pass policy when no BPF is attached, and does not yet support BPF link
attachments for AF_XDP. skbs sent to that device get dropped at the
moment. Given AF_XDP operates at a lower layer of the stack tying this
to the netkit pair did not make sense. In future, the plan is to allow
BPF at the XDP layer which can: i) process traffic coming from the AF_XDP
application (e.g. QEMU with AF_XDP backend) to filter egress traffic or
to push selected egress traffic up to the single netkit device to the
local stack (e.g. DHCP requests), and ii) vice-versa skbs sent to the
single netkit into the AF_XDP application (e.g. DHCP replies). Also,
the control-plane can dynamically manage rxq leasing for the single
netkit device without having to interrupt (e.g. down/up cycle) the main
netkit pair for the Pod which has traffic going in and out.
Daniel Borkmann [Thu, 2 Apr 2026 23:10:26 +0000 (01:10 +0200)]
xsk: Proxy pool management for leased queues
Similarly to the netif_mp_{open,close}_rxq handling for leased queues, proxy
the xsk_{reg,clear}_pool_at_qid via netif_get_rx_queue_lease_locked such
that in case a virtual netdev picked a leased rxq, the request gets through
to the real rxq in the physical netdev. The proxying is only relevant for
queue_id < dev->real_num_rx_queues since right now it's only supported for
rxqs.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Co-developed-by: David Wei <dw@davidwei.uk> Signed-off-by: David Wei <dw@davidwei.uk> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20260402231031.447597-10-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Daniel Borkmann [Thu, 2 Apr 2026 23:10:25 +0000 (01:10 +0200)]
xsk: Extend xsk_rcv_check validation
xsk_rcv_check tests for inbound packets to see whether they match
the bound AF_XDP socket. Refactor the test into a small helper
xsk_dev_queue_valid and move the validation against xs->dev and
xs->queue_id there.
The fast-path case stays in place and allows for quick return in
xsk_dev_queue_valid. If it fails, the validation is extended to
check whether the AF_XDP socket is bound against a leased queue,
and if so, the test is redone.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Co-developed-by: David Wei <dw@davidwei.uk> Signed-off-by: David Wei <dw@davidwei.uk> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20260402231031.447597-9-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
David Wei [Thu, 2 Apr 2026 23:10:24 +0000 (01:10 +0200)]
net: Proxy netdev_queue_get_dma_dev for leased queues
Extend netdev_queue_get_dma_dev to return the physical device of the
real rxq for DMA in case the queue was leased. This allows memory
providers like io_uring zero-copy or devmem to bind to the physically
leased rxq via virtual devices such as netkit.
Signed-off-by: David Wei <dw@davidwei.uk> Co-developed-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20260402231031.447597-8-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
David Wei [Thu, 2 Apr 2026 23:10:23 +0000 (01:10 +0200)]
net: Proxy netif_mp_{open,close}_rxq for leased queues
When a process in a container wants to setup a memory provider, it will
use the virtual netdev and a leased rxq, and call netif_mp_{open,close}_rxq
to try and restart the queue. At this point, proxy the queue restart on
the real rxq in the physical netdev.
For memory providers (io_uring zero-copy rx and devmem), it causes the
real rxq in the physical netdev to be filled from a memory provider that
has DMA mapped memory from a process within a container.
Signed-off-by: David Wei <dw@davidwei.uk> Co-developed-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20260402231031.447597-7-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Daniel Borkmann [Thu, 2 Apr 2026 23:10:22 +0000 (01:10 +0200)]
net: Slightly simplify net_mp_{open,close}_rxq
net_mp_open_rxq is currently not used in the tree as all callers are
using __net_mp_open_rxq directly, and net_mp_close_rxq is only used
once while all other locations use __net_mp_close_rxq.
Consolidate into a single API, netif_mp_{open,close}_rxq, using the
netif_ prefix to indicate that the caller is responsible for locking.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Co-developed-by: David Wei <dw@davidwei.uk> Signed-off-by: David Wei <dw@davidwei.uk> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20260402231031.447597-6-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Daniel Borkmann [Thu, 2 Apr 2026 23:10:21 +0000 (01:10 +0200)]
net, ethtool: Disallow leased real rxqs to be resized
Similar to AF_XDP, do not allow queues in a physical netdev to be resized
by ethtool -L when they are leased. Cover channel resize paths (both
netlink and ioctl) to reject resizing when the queues would be affected.
Given we need to have different checks for RX vs TX, detangle the code into
a two-loop version rather than the range of new_combined + min(new_rx, new_tx)
to old_combined + max(old_rx, old_tx).
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Co-developed-by: David Wei <dw@davidwei.uk> Signed-off-by: David Wei <dw@davidwei.uk> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20260402231031.447597-5-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Daniel Borkmann [Thu, 2 Apr 2026 23:10:20 +0000 (01:10 +0200)]
net: Add lease info to queue-get response
Populate nested lease info to the queue-get response that returns the
ifindex, queue id with type and optionally netns id if the device
resides in a different netns.
Example with ynl client when using AF_XDP via queue leasing:
# ip a
[...]
4: enp10s0f0np0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdp/id:24 qdisc mq state UP group default qlen 1000
link/ether e8:eb:d3:a3:43:f6 brd ff:ff:ff:ff:ff:ff
inet 10.0.0.2/24 scope global enp10s0f0np0
valid_lft forever preferred_lft forever
inet6 fe80::eaeb:d3ff:fea3:43f6/64 scope link proto kernel_ll
valid_lft forever preferred_lft forever
[...]
# ip netns exec foo ip a
[...]
8: nk@NONE: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff
inet6 fe80::200:ff:fe00:0/64 scope link proto kernel_ll
valid_lft forever preferred_lft forever
[...]
# ip netns exec foo ethtool -i nk
driver: netkit
[...]
# ip netns exec foo ls /sys/class/net/nk/queues/
rx-0 rx-1 tx-0
Note that the caller of netdev_nl_queue_fill_one() holds the netdevice
lock. For the queue-get we do not lock both devices. When queues get
{un,}leased, both devices are locked, thus if __netif_get_rx_queue_lease()
returns a lease pointer, it points to a valid device. The netns-id is
fetched via peernet2id_alloc() similarly as done in OVS.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Co-developed-by: David Wei <dw@davidwei.uk> Signed-off-by: David Wei <dw@davidwei.uk> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20260402231031.447597-4-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Note that the netdevice locking order is always from the virtual to
the physical device.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Co-developed-by: David Wei <dw@davidwei.uk> Signed-off-by: David Wei <dw@davidwei.uk> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20260402231031.447597-3-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Daniel Borkmann [Thu, 2 Apr 2026 23:10:18 +0000 (01:10 +0200)]
net: Add queue-create operation
Add a ynl netdev family operation called queue-create that creates a
new queue on a netdevice:
name: queue-create
attribute-set: queue
flags: [admin-perm]
do:
request:
attributes:
- ifindex
- type
- lease
reply: &queue-create-op
attributes:
- id
This is a generic operation such that it can be extended for various
use cases in future. Right now it is mandatory to specify ifindex,
the queue type which is enforced to rx and a lease. The newly created
queue id is returned to the caller.
A queue from a virtual device can have a lease which refers to another
queue from a physical device. This is useful for memory providers
and AF_XDP operations which take an ifindex and queue id to allow
applications to bind against virtual devices in containers. The lease
couples both queues together and allows to proxy the operations from
a virtual device in a container to the physical device.
In future, the nested lease attribute can be lifted and made optional
for other use-cases such as dynamic queue creation for physical
netdevs. The lack of lease and the specification of the physical
device as an ifindex will imply that we need a real queue to be
allocated. Similarly, the queue type enforcement to rx can then be
lifted as well to support tx.
An early implementation had only driver-specific integration [0], but
in order for other virtual devices to reuse, it makes sense to have
this as a generic API in core net.
For leasing queues, the virtual netdev must have real_num_rx_queues
less than num_rx_queues at the time of calling queue-create. The
queue-type must be rx as only rx queues are supported for leasing
for now. We also enforce that the queue-create ifindex must point
to a virtual device, and that the nested lease attribute's ifindex
must point to a physical device. The nested lease attribute set
contains a netns-id attribute which is optional and can specify a
netns-id relative to the caller's netns. It requires cap_net_admin
and if the netns-id attribute is not specified, the lease ifindex
will be retrieved from the current netns. Also, it is modeled as
an s32 type similarly as done elsewhere in the stack.
Ming Lei [Thu, 9 Apr 2026 13:30:19 +0000 (21:30 +0800)]
MAINTAINERS: update ublk driver maintainer email
Update the ublk userspace block driver maintainer email address
from ming.lei@redhat.com to tom.leiming@gmail.com as the original
email will become invalid.
Ming Lei [Thu, 9 Apr 2026 13:30:18 +0000 (21:30 +0800)]
Documentation: ublk: address review comments for SHMEM_ZC docs
- Use "physical pages" instead of "page frame numbers (PFNs)" for
clarity
- Remove "without any per-I/O overhead" claim from zero-copy
description
- Add scatter/gather limitation: each I/O's data must be contiguous
within a single registered buffer
Ming Lei [Thu, 9 Apr 2026 13:30:17 +0000 (21:30 +0800)]
ublk: allow buffer registration before device is started
Before START_DEV, there is no disk, no queue, no I/O dispatch, so
the maple tree can be safely modified under ub->mutex alone without
freezing the queue.
Add ublk_lock_buf_tree()/ublk_unlock_buf_tree() helpers that take
ub->mutex first, then freeze the queue if device is started. This
ordering (mutex -> freeze) is safe because ublk_stop_dev_unlocked()
already holds ub->mutex when calling del_gendisk() which freezes
the queue.
Ming Lei [Thu, 9 Apr 2026 13:30:16 +0000 (21:30 +0800)]
ublk: replace xarray with IDA for shmem buffer index allocation
Remove struct ublk_buf which only contained nr_pages that was never
read after registration. Use IDA for pure index allocation instead
of xarray. Make __ublk_ctrl_unreg_buf() return int so the caller
can detect invalid index without a separate lookup.
Simplify ublk_buf_cleanup() to walk the maple tree directly and
unpin all pages in one pass, instead of iterating the xarray by
buffer index.
Ming Lei [Thu, 9 Apr 2026 13:30:14 +0000 (21:30 +0800)]
ublk: verify all pages in multi-page bvec fall within registered range
rq_for_each_bvec() yields multi-page bvecs where bv_page is only the
first page. ublk_try_buf_match() only validated the start PFN against
the maple tree, but a bvec can span multiple pages past the end of a
registered range.
Use mas_walk() instead of mtree_load() to obtain the range boundaries
stored in the maple tree, and check that the bvec's end PFN does not
exceed the range. Also remove base_pfn from struct ublk_buf_range
since mas.index already provides the range start PFN.
Ming Lei [Thu, 9 Apr 2026 13:30:13 +0000 (21:30 +0800)]
ublk: widen ublk_shmem_buf_reg.len to __u64 for 4GB buffer support
The __u32 len field cannot represent a 4GB buffer (0x100000000
overflows to 0). Change it to __u64 so buffers up to 4GB can be
registered. Add a reserved field for alignment and validate it
is zero.
The kernel enforces a default max of 4GB (UBLK_SHMEM_BUF_SIZE_MAX)
which may be increased in future.
Hyungjung Joo [Fri, 13 Mar 2026 13:29:43 +0000 (22:29 +0900)]
affs: bound hash_pos before table lookup in affs_readdir
affs_readdir() decodes ctx->pos into hash_pos and chain_pos and then
dereferences AFFS_HEAD(dir_bh)->table[hash_pos] before validating
that hash_pos is within the runtime table bound. Treat out-of-range
positions as end-of-directory before the first table lookup.
Signed-off-by: Hyungjung Joo <jhj140711@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Merge tag 'kbuild-fixes-7.0-4' of git://git.kernel.org/pub/scm/linux/kernel/git/kbuild/linux
Pull Kbuild fixes from Nathan Chancellor:
- Make modules-cpio-pkg respect INSTALL_MOD_PATH so that it can be
used with distribution initramfs files that have a merged /usr,
such as Fedora
- Silence an instance of -Wunused-but-set-global, a strengthening
of -Wunused-but-set-variable in tip of tree Clang, in modpost,
as the variable for extra warnings is currently unused
* tag 'kbuild-fixes-7.0-4' of git://git.kernel.org/pub/scm/linux/kernel/git/kbuild/linux:
modpost: Declare extra_warn with unused attribute
kbuild: modules-cpio-pkg: Respect INSTALL_MOD_PATH
drm/ttm/tests: Remove checks from ttm_pool_free_no_dma_alloc
On !x86, the pool type is never initialised, and the pages are freed
back to the system.
The test broke on the list_lru rewrite, but I'm not sure how that it was
supposed to work previously. In the meantime CI is broken so reverting
for now.
Fixes: 444e2a19d7fd ("ttm/pool: port to list_lru. (v2)") Cc: Christian Koenig <christian.koenig@amd.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Christian König <christian.koenig@amd.com> Cc: Dave Airlie <airlied@redhat.com> Signed-off-by: Maarten Lankhorst <dev@lankhorst.se> Signed-off-by: Dave Airlie <airlied@redhat.com> Link: https://patch.msgid.link/20260409142658.1511941-2-dev@lankhorst.se
Matthew Auld [Thu, 9 Apr 2026 12:15:09 +0000 (13:15 +0100)]
drm/ttm/tests: fix lru_count ASSERT
On pool init we should expect the lru_count for each node to be zeroed
as per __list_lru_init -> init_one_lru, but here we are asserting the
opposite.
Currently our CI is blowing up with:
10:23:33] # ttm_device_init_pools: ASSERTION FAILED at drivers/gpu/drm/ttm/tests/ttm_device_test.c:178
[10:23:33] Expected !list_lru_count(&pt.pages) to be false, but is true
[10:23:33] [FAILED] DMA allocations, DMA32 required
[10:23:33] [PASSED] No DMA allocations, DMA32 required
[10:23:33] # ttm_device_init_pools: ASSERTION FAILED at drivers/gpu/drm/ttm/tests/ttm_device_test.c:178
[10:23:33] Expected !list_lru_count(&pt.pages) to be false, but is true
Fixes: 444e2a19d7fd ("ttm/pool: port to list_lru. (v2)") Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Christian Koenig <christian.koenig@amd.com> Cc: Dave Airlie <airlied@redhat.com> Reviewed-by: Ryszard Knop <ryszard.knop@intel.com> Signed-off-by: Dave Airlie <airlied@redhat.com> Link: https://patch.msgid.link/20260409121512.81298-3-matthew.auld@intel.com
bpf: Fix use-after-free in offloaded map/prog info fill
When querying info for an offloaded BPF map or program,
bpf_map_offload_info_fill_ns() and bpf_prog_offload_info_fill_ns()
obtain the network namespace with get_net(dev_net(offmap->netdev)).
However, the associated netdev's netns may be racing with teardown
during netns destruction. If the netns refcount has already reached 0,
get_net() performs a refcount_t increment on 0, triggering:
refcount_t: addition on 0; use-after-free.
Although rtnl_lock and bpf_devs_lock ensure the netdev pointer remains
valid, they cannot prevent the netns refcount from reaching zero.
Fix this by using maybe_get_net() instead of get_net(). maybe_get_net()
uses refcount_inc_not_zero() and returns NULL if the refcount is already
zero, which causes ns_get_path_cb() to fail and the caller to return
-ENOENT -- the correct behavior when the netns is being destroyed.
Fixes: 675fc275a3a2d ("bpf: offload: report device information for offloaded programs") Fixes: 52775b33bb507 ("bpf: offload: report device information about offloaded maps") Reported-by: Yinhao Hu <dddddd@hust.edu.cn> Reported-by: Kaiyan Mei <M202472210@hust.edu.cn> Reviewed-by: Dongliang Mu <dzm91@hust.edu.cn> Closes: https://lore.kernel.org/bpf/f0aa3678-79c9-47ae-9e8c-02a3d1df160a@hust.edu.cn/ Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20260409023733.168050-1-jiayuan.chen@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
MAINTAINERS: Remove Salil Mehta as HiSilicon HNS3/HNS Ethernet maintainer
Closing this chapter and a long wonderful journey with my team, I sign off one
last time with my Huawei email address. Remove my maintainer entry for the
HiSilicon HNS and HNS3 10G/100G Ethernet drivers, and add a CREDITS entry for
my co-authorship and maintenance contributions to these drivers.
Cross-merge networking fixes after downstream PR (net-7.0-rc8).
Conflicts:
net/ipv6/seg6_iptunnel.c c3812651b522f ("seg6: separate dst_cache for input and output paths in seg6 lwtunnel") 78723a62b969a ("seg6: add per-route tunnel source address")
https://lore.kernel.org/adZhwtOYfo-0ImSa@sirena.org.uk
net/ipv4/icmp.c fde29fd934932 ("ipv4: icmp: fix null-ptr-deref in icmp_build_probe()") d98adfbdd5c01 ("ipv4: drop ipv6_stub usage and use direct function calls")
https://lore.kernel.org/adO3dccqnr6j-BL9@sirena.org.uk
Daniel Borkmann [Thu, 9 Apr 2026 15:50:16 +0000 (17:50 +0200)]
selftests/bpf: Add test for stale pkt range after scalar arithmetic
Extend the verifier_direct_packet_access BPF selftests to exercise the
verifier code paths which ensure that the pkt range is cleared after
add/sub alu with a known scalar. The tests reject the invalid access.
# LDLIBS=-static PKG_CONFIG='pkg-config --static' ./vmtest.sh -- ./test_progs -t verifier_direct
[...]
#592/35 verifier_direct_packet_access/direct packet access: pkt_range cleared after sub with known scalar:OK
#592/36 verifier_direct_packet_access/direct packet access: pkt_range cleared after add with known scalar:OK
#592/37 verifier_direct_packet_access/direct packet access: test3:OK
#592/38 verifier_direct_packet_access/direct packet access: test3 @unpriv:OK
#592/39 verifier_direct_packet_access/direct packet access: test34 (non-linear, cgroup_skb/ingress, too short eth):OK
#592/40 verifier_direct_packet_access/direct packet access: test35 (non-linear, cgroup_skb/ingress, too short 1):OK
#592/41 verifier_direct_packet_access/direct packet access: test36 (non-linear, cgroup_skb/ingress, long enough):OK
#592 verifier_direct_packet_access:OK
[...]
Summary: 2/47 PASSED, 0 SKIPPED, 0 FAILED
Daniel Borkmann [Thu, 9 Apr 2026 15:50:15 +0000 (17:50 +0200)]
bpf: Drop pkt_end markers on arithmetic to prevent is_pkt_ptr_branch_taken
When a pkt pointer acquires AT_PKT_END or BEYOND_PKT_END range from
a comparison, and then, known-constant arithmetic is performed,
adjust_ptr_min_max_vals() copies the stale range via dst_reg->raw =
ptr_reg->raw without clearing the negative reg->range sentinel values.
This lets is_pkt_ptr_branch_taken() choose one branch direction and
skip going through the other. Fix this by clearing negative pkt range
values (that is, AT_PKT_END and BEYOND_PKT_END) after arithmetic on
pkt pointers. This ensures is_pkt_ptr_branch_taken() returns unknown
and both branches are properly verified.
Fixes: 6d94e741a8ff ("bpf: Support for pointers beyond pkt_end.") Reported-by: STAR Labs SG <info@starlabs.sg> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20260409155016.536608-1-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Merge ACPI core driver core driver updates and assorted driver updates
related to ACPI support for 7.1-rc1:
- Clean up the ACPI AC and ACPI PAD (processor aggregator device)
drivers (Rafael Wysocki)
- Rework checking for duplicate video bus devices and consolidate
pnp.bus_id workarounds handling in the ACPI video bus driver (Rafael
Wysocki)
- Update the ACPI core device drivers to stop setting acpi_device_name()
unnecessarily (Rafael Wysocki)
- Rearrange code using acpi_device_class() in the ACPI core device
drivers and update them to stop setting acpi_device_class()
unnecessarily (Rafael Wysocki)
- Define ACPI_AC_CLASS in one place (Rafael Wysocki)
- Convert the ni903x_wdt watchdog driver and the xen ACPI PAD driver to
bind to platform devices instead of ACPI devices (Rafael Wysocki)
* acpi-driver:
watchdog: ni903x_wdt: Convert to a platform driver
ACPI: PAD: xen: Convert to a platform driver
ACPI: AC: Define ACPI_AC_CLASS in one place
ACPI: driver: Do not set acpi_device_class() unnecessarily
ACPI: driver: Avoid using pnp.device_class for netlink handling
ACPI: event: Redefine acpi_notifier_call_chain()
ACPI: driver: Do not set acpi_device_name() unnecessarily
ACPI: video: Consolidate pnp.bus_id workarounds handling
ACPI: video: Rework checking for duplicate video bus devices
driver core: auxiliary bus: Introduce dev_is_auxiliary()
ACPI: PAD: Rearrange notify handler installation and removal
ACPI: AC: Get rid of unnecessary declarations
regmap: debugfs: fix race condition in dummy name allocation
Use IDA instead of a simple counter for generating unique dummy names.
The previous implementation used dummy_index++ which is not atomic,
leading to potential duplicate names when multiple threads call
regmap_debugfs_init() concurrently with name="dummy".
Merge ACPI Time and Alarm Device (TAD) driver updates for 7.1-rc1:
- Clean up the ACPI TAD driver in various ways and add an RTC class
device interface, including both the RTC setting/reading and alarm
timer support, to it (Rafael Wysocki)
* acpi-tad:
ACPI: TAD: Add alarm support to the RTC class device interface
ACPI: TAD: Split acpi_tad_rtc_read_time()
ACPI: TAD: Relocate two functions
ACPI: TAD: Split three functions to untangle runtime PM handling
ACPI: TAD: Use DC wakeup only if AC wakeup is supported
ACPI: TAD: Use dev_groups in struct device_driver
ACPI: TAD: Update the driver description comment
ACPI: TAD: Add RTC class device interface
ACPI: TAD: Clear unused RT data in acpi_tad_set_real_time()
ACPI: TAD: Rearrange RT data validation checking
ACPI: TAD: Use __free() for cleanup in time_store()
ACPI: TAD: Support RTC without wakeup
ACPI: TAD: Create one attribute group
Merge updates related to the CMOS RTC driver and x86/ACPI CMOS RTC
support for 7.1-rc1:
- Add ACPI support to the platform device interface in the CMOS RTC
driver, make the ACPI core device enumeration code create a platform
device for the CMOS RTC, and drop CMOS RTC PNP device support (Rafael
Wysocki)
- Consolidate the x86-specific CMOS RTC handling with the ACPI TAD
driver and clean up the CMOS RTC ACPI address space handler (Rafael
Wysocki)
- Enable ACPI alarm in the CMOS RTC driver if advertised in ACPI FADT
and allow that driver to work without a dedicated IRQ if the ACPI
alarm is used (Rafael Wysocki)
* acpi-cmos-rtc:
rtc: cmos: Do not require IRQ if ACPI alarm is used
rtc: cmos: Enable ACPI alarm if advertised in ACPI FADT
ACPI: TAD/x86: cmos_rtc: Consolidate address space handler setup
rtc: cmos: Drop PNP device support
x86: rtc: Drop PNP device check
ACPI: PNP: Drop CMOS RTC PNP device support
ACPI: x86/rtc-cmos: Use platform device for driver binding
ACPI: x86: cmos_rtc: Create a CMOS RTC platform device
ACPI: x86: cmos_rtc: Improve coordination with ACPI TAD driver
ACPI: x86: cmos_rtc: Clean up address space handler driver
Merge ACPI processor driver updates and ACPI CPPC library updates for
7.1-rc1:
- Address multiple assorted issues and clean up the code in the ACPI
processor idle driver (Huisong Li)
- Replace strlcat() in the ACPI processor idle drive with a better
alternative (Andy Shevchenko)
- Rearrange and clean up acpi_processor_errata_piix4() (Rafael Wysocki)
- Move reference performance to capabilities and fix an uninitialized
variable in the ACPI CPPC library (Pengjie Zhang)
- Add support for the Performance Limited Register to the ACPI CPPC
library (Sumit Gupta)
- Add cppc_get_perf() API to read performance controls, extend
cppc_set_epp_perf() for FFH/SystemMemory, and make the ACPI CPPC
library warn on missing mandatory DESIRED_PERF register (Sumit Gupta)
- Modify the cpufreq CPPC driver to update MIN_PERF/MAX_PERF in target
callbacks to allow it to control performance bounds via standard
scaling_min_freq and scaling_max_freq sysfs attributes and add sysfs
documentation for the Performance Limited Register to it (Sumit Gupta)
* acpi-processor:
ACPI: processor: idle: Reset cpuidle on C-state list changes
cpuidle: Extract and export no-lock variants of cpuidle_unregister_device()
ACPI: processor: idle: Fix NULL pointer dereference in hotplug path
ACPI: processor: idle: Reset power_setup_done flag on initialization failure
ACPI: processor: Rearrange and clean up acpi_processor_errata_piix4()
ACPI: processor: idle: Replace strlcat() with better alternative
ACPI: processor: idle: Remove redundant static variable and rename cstate check function
ACPI: processor: idle: Move max_cstate update out of the loop
ACPI: processor: idle: Remove redundant cstate check in acpi_processor_power_init
ACPI: processor: idle: Add missing bounds check in flatten_lpi_states()
* acpi-cppc:
ACPI: CPPC: Check cpc_read() return values consistently
ACPI: CPPC: Fix uninitialized ref variable in cppc_get_perf_caps()
ACPI: CPPC: Move reference performance to capabilities
cpufreq: CPPC: Add sysfs documentation for perf_limited
ACPI: CPPC: add APIs and sysfs interface for perf_limited
cpufreq: cppc: Update MIN_PERF/MAX_PERF in target callbacks
cpufreq: CPPC: Update cached perf_ctrls on sysfs write
ACPI: CPPC: Extend cppc_set_epp_perf() for FFH/SystemMemory
ACPI: CPPC: Warn on missing mandatory DESIRED_PERF register
ACPI: CPPC: Add cppc_get_perf() API to read performance controls
Mark Brown [Thu, 9 Apr 2026 19:19:36 +0000 (20:19 +0100)]
regulator: fix OF node imbalance on reuse
Johan Hovold <johan@kernel.org> says:
These drivers reuse the OF node of their parent multi-function device
but fail to take another reference to balance the one dropped by the
platform bus code when unbinding the MFD and deregistering the child
devices.
Fix this by using the intended helper for reusing OF nodes.
Note that the first two patches will cause a trivial conflict with Doug's
series adding accessor functions for struct device flags which has now been
merged to the driver-core tree:
Johan Hovold [Wed, 8 Apr 2026 07:30:55 +0000 (09:30 +0200)]
regulator: bd9571mwv: fix OF node reference imbalance
The driver reuses the OF node of the parent multi-function device but
fails to take another reference to balance the one dropped by the
platform bus code when unbinding the MFD and deregistering the child
devices.
Fix this by using the intended helper for reusing OF nodes.
Fixes: e85c5a153fe2 ("regulator: Add ROHM BD9571MWV-M PMIC regulator driver") Cc: stable@vger.kernel.org # 4.12 Cc: Marek Vasut <marek.vasut@gmail.com> Signed-off-by: Johan Hovold <johan@kernel.org> Link: https://patch.msgid.link/20260408073055.5183-8-johan@kernel.org Signed-off-by: Mark Brown <broonie@kernel.org>
Johan Hovold [Wed, 8 Apr 2026 07:30:54 +0000 (09:30 +0200)]
regulator: act8945a: fix OF node reference imbalance
The driver reuses the OF node of the parent multi-function device but
fails to take another reference to balance the one dropped by the
platform bus code when unbinding the MFD and deregistering the child
devices.
Fix this by using the intended helper for reusing OF nodes.
Fixes: 38c09961048b ("regulator: act8945a: add regulator driver for ACT8945A") Cc: stable@vger.kernel.org # 4.6 Cc: Wenyou Yang <wenyou.yang@atmel.com> Signed-off-by: Johan Hovold <johan@kernel.org> Link: https://patch.msgid.link/20260408073055.5183-7-johan@kernel.org Signed-off-by: Mark Brown <broonie@kernel.org>
Johan Hovold [Wed, 8 Apr 2026 07:30:53 +0000 (09:30 +0200)]
regulator: s2dos05: fix OF node reference imbalance
The driver reuses the OF node of the parent multi-function device but
fails to take another reference to balance the one dropped by the
platform bus code when unbinding the MFD and deregistering the child
devices.
Fix this by using the intended helper for reusing OF nodes.
Fixes: bb2441402392 ("regulator: add s2dos05 regulator support") Cc: stable@vger.kernel.org # 6.18 Cc: Dzmitry Sankouski <dsankouski@gmail.com> Signed-off-by: Johan Hovold <johan@kernel.org> Link: https://patch.msgid.link/20260408073055.5183-6-johan@kernel.org Signed-off-by: Mark Brown <broonie@kernel.org>
Johan Hovold [Wed, 8 Apr 2026 07:30:52 +0000 (09:30 +0200)]
regulator: mt6357: fix OF node reference imbalance
The driver reuses the OF node of the parent multi-function device but
fails to take another reference to balance the one dropped by the
platform bus code when unbinding the MFD and deregistering the child
devices.
Fix this by using the intended helper for reusing OF nodes.
Fixes: dafc7cde23dc ("regulator: add mt6357 regulator") Cc: stable@vger.kernel.org # 6.2 Signed-off-by: Johan Hovold <johan@kernel.org> Link: https://patch.msgid.link/20260408073055.5183-5-johan@kernel.org Signed-off-by: Mark Brown <broonie@kernel.org>
Johan Hovold [Wed, 8 Apr 2026 07:30:51 +0000 (09:30 +0200)]
regulator: max77650: fix OF node reference imbalance
The driver reuses the OF node of the parent multi-function device but
fails to take another reference to balance the one dropped by the
platform bus code when unbinding the MFD and deregistering the child
devices.
Fix this by using the intended helper for reusing OF nodes.
Fixes: bcc61f1c44fd ("regulator: max77650: add regulator support") Cc: stable@vger.kernel.org # 5.1 Reviewed-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com> Signed-off-by: Johan Hovold <johan@kernel.org> Link: https://patch.msgid.link/20260408073055.5183-4-johan@kernel.org Signed-off-by: Mark Brown <broonie@kernel.org>
Johan Hovold [Wed, 8 Apr 2026 07:30:50 +0000 (09:30 +0200)]
regulator: rk808: fix OF node reference imbalance
The driver reuses the OF node of the parent multi-function device but
fails to take another reference to balance the one dropped by the
platform bus code when unbinding the MFD and deregistering the child
devices.
Fix this by using the intended helper for reusing OF nodes.
Fixes: 647e57351f8e ("regulator: rk808: reduce 'struct rk808' usage") Cc: stable@vger.kernel.org # 6.2 Reviewed-by: Sebastian Reichel <sebastian.reichel@collabora.com> Reviewed-by: Douglas Anderson <dianders@chromium.org> Signed-off-by: Johan Hovold <johan@kernel.org> Link: https://patch.msgid.link/20260408073055.5183-3-johan@kernel.org Signed-off-by: Mark Brown <broonie@kernel.org>
Johan Hovold [Wed, 8 Apr 2026 07:30:49 +0000 (09:30 +0200)]
regulator: bq257xx: fix OF node reference imbalance
The driver reuses the OF node of the parent multi-function device but
fails to take another reference to balance the one dropped by the
platform bus code when unbinding the MFD and deregistering the child
devices.
Fix this by using the intended helper for reusing OF nodes.
Fixes: 981dd162b635 ("regulator: bq257xx: Add bq257xx boost regulator driver") Cc: stable@vger.kernel.org # 6.18 Cc: Chris Morgan <macromorgan@hotmail.com> Reviewed-by: Douglas Anderson <dianders@chromium.org> Signed-off-by: Johan Hovold <johan@kernel.org> Link: https://patch.msgid.link/20260408073055.5183-2-johan@kernel.org Signed-off-by: Mark Brown <broonie@kernel.org>