]> git.ipfire.org Git - thirdparty/kernel/linux.git/log
thirdparty/kernel/linux.git
2 days agoMerge branch 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox...
Jakub Kicinski [Sun, 12 Apr 2026 21:34:27 +0000 (14:34 -0700)] 
Merge branch 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux

Tariq Toukan says:

====================
mlx5-next updates 2026-04-09

* 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux:
  net/mlx5: Add icm_mng_function_id_mode cap bit
  net/mlx5: Rename MLX5_PF page counter type to MLX5_SELF
  net/mlx5: Add vhca_id_type bit to alias context
  mlx5: Remove redundant iseg base
====================

Link: https://patch.msgid.link/20260409110431.154894-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agovsock: fix buffer size clamping order
Norbert Szetei [Thu, 9 Apr 2026 16:34:12 +0000 (18:34 +0200)] 
vsock: fix buffer size clamping order

In vsock_update_buffer_size(), the buffer size was being clamped to the
maximum first, and then to the minimum. If a user sets a minimum buffer
size larger than the maximum, the minimum check overrides the maximum
check, inverting the constraint.

This breaks the intended socket memory boundaries by allowing the
vsk->buffer_size to grow beyond the configured vsk->buffer_max_size.

Fix this by checking the minimum first, and then the maximum. This
ensures the buffer size never exceeds the buffer_max_size.

Fixes: b9f2b0ffde0c ("vsock: handle buffer_size sockopts in the core")
Suggested-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Norbert Szetei <norbert@doyensec.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://patch.msgid.link/180118C5-8BCF-4A63-A305-4EE53A34AB9C@doyensec.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agoMerge branch 'net-reduce-sk_filter-and-friends-bloat'
Jakub Kicinski [Sun, 12 Apr 2026 21:30:27 +0000 (14:30 -0700)] 
Merge branch 'net-reduce-sk_filter-and-friends-bloat'

Eric Dumazet says:

====================
net: reduce sk_filter() (and friends) bloat

Some functions return an error by value, and a drop_reason
by an output parameter. This extra parameter can force stack canaries.

A drop_reason is enough and more efficient.

This series reduces bloat by 678 bytes on x86_64:

$ scripts/bloat-o-meter -t vmlinux.old vmlinux.final
add/remove: 0/0 grow/shrink: 3/18 up/down: 79/-757 (-678)
Function                                     old     new   delta
vsock_queue_rcv_skb                           50      79     +29
ipmr_cache_report                           1290    1315     +25
ip6mr_cache_report                          1322    1347     +25
tcp_v6_rcv                                  3169    3167      -2
packet_rcv_spkt                              329     327      -2
unix_dgram_sendmsg                          1731    1726      -5
netlink_unicast                              957     945     -12
netlink_dump                                1372    1359     -13
sk_filter_trim_cap                           889     858     -31
netlink_broadcast_filtered                  1633    1595     -38
tcp_v4_rcv                                  3152    3111     -41
raw_rcv_skb                                  122      80     -42
ping_queue_rcv_skb                           109      61     -48
ping_rcv                                     215     162     -53
rawv6_rcv_skb                                278     224     -54
__sk_receive_skb                             690     632     -58
raw_rcv                                      591     527     -64
udpv6_queue_rcv_one_skb                      935     869     -66
udp_queue_rcv_one_skb                        919     853     -66
tun_net_xmit                                1146    1074     -72
sock_queue_rcv_skb_reason                    166      76     -90
Total: Before=29722890, After=29722212, chg -0.00%

Future conversions from sock_queue_rcv_skb() to sock_queue_rcv_skb_reason()
can be done later.
====================

Link: https://patch.msgid.link/20260409145625.2306224-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agonet: change sk_filter_trim_cap() to return a drop_reason by value
Eric Dumazet [Thu, 9 Apr 2026 14:56:24 +0000 (14:56 +0000)] 
net: change sk_filter_trim_cap() to return a drop_reason by value

Current return value can be replaced with the drop_reason,
reducing kernel bloat:

$ scripts/bloat-o-meter -t vmlinux.old vmlinux.new
add/remove: 0/2 grow/shrink: 1/11 up/down: 32/-603 (-571)
Function                                     old     new   delta
tcp_v6_rcv                                  3135    3167     +32
unix_dgram_sendmsg                          1731    1726      -5
netlink_unicast                              957     945     -12
netlink_dump                                1372    1359     -13
sk_filter_trim_cap                           882     858     -24
tcp_v4_rcv                                  3143    3111     -32
__pfx_tcp_filter                              32       -     -32
netlink_broadcast_filtered                  1633    1595     -38
sock_queue_rcv_skb_reason                    126      76     -50
tun_net_xmit                                1127    1074     -53
__sk_receive_skb                             690     632     -58
udpv6_queue_rcv_one_skb                      935     869     -66
udp_queue_rcv_one_skb                        919     853     -66
tcp_filter                                   154       -    -154
Total: Before=29722783, After=29722212, chg -0.00%

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260409145625.2306224-6-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agotcp: change tcp_filter() to return the reason by value
Eric Dumazet [Thu, 9 Apr 2026 14:56:23 +0000 (14:56 +0000)] 
tcp: change tcp_filter() to return the reason by value

sk_filter_trim_cap() will soon return the reason by value,
do the same for tcp_filter().

Note:

tcp_filter() is no longer inlined. Following patch will inline it again.

$ scripts/bloat-o-meter -t vmlinux.4 vmlinux.5
add/remove: 2/0 grow/shrink: 0/2 up/down: 186/-43 (143)
Function                                     old     new   delta
tcp_filter                                     -     154    +154
__pfx_tcp_filter                               -      32     +32
tcp_v4_rcv                                  3152    3143      -9
tcp_v6_rcv                                  3169    3135     -34
Total: Before=29722640, After=29722783, chg +0.00%

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260409145625.2306224-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agonet: change sk_filter_reason() to return the reason by value
Eric Dumazet [Thu, 9 Apr 2026 14:56:22 +0000 (14:56 +0000)] 
net: change sk_filter_reason() to return the reason by value

sk_filter_trim_cap will soon return the reason by value,
do the same for sk_filter_reason().

$ scripts/bloat-o-meter -t vmlinux.old vmlinux.new
add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-21 (-21)
Function                                     old     new   delta
sock_queue_rcv_skb_reason                    128     126      -2
tun_net_xmit                                1146    1127     -19
Total: Before=29722661, After=29722640, chg -0.00%

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260409145625.2306224-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agonet: always set reason in sk_filter_trim_cap()
Eric Dumazet [Thu, 9 Apr 2026 14:56:21 +0000 (14:56 +0000)] 
net: always set reason in sk_filter_trim_cap()

sk_filter_trim_cap() will soon return the drop reason by value.

Make sure *reason is cleared when no error is returned,
to ease this conversion.

$ scripts/bloat-o-meter -t vmlinux.old vmlinux.new
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-7 (-7)
Function                                     old     new   delta
sk_filter_trim_cap                           889     882      -7
Total: Before=29722668, After=29722661, chg -0.00%

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260409145625.2306224-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agonet: change sock_queue_rcv_skb_reason() to return a drop_reason
Eric Dumazet [Thu, 9 Apr 2026 14:56:20 +0000 (14:56 +0000)] 
net: change sock_queue_rcv_skb_reason() to return a drop_reason

Change sock_queue_rcv_skb_reason() to return the drop_reason directly
instead of using a reference.

This is part of an effort to remove stack canaries and reduce bloat.

$ scripts/bloat-o-meter -t vmlinux.old vmlinux.new
add/remove: 0/0 grow/shrink: 3/7 up/down: 79/-301 (-222)
Function                                     old     new   delta
vsock_queue_rcv_skb                           50      79     +29
ipmr_cache_report                           1290    1315     +25
ip6mr_cache_report                          1322    1347     +25
packet_rcv_spkt                              329     327      -2
sock_queue_rcv_skb_reason                    166     128     -38
raw_rcv_skb                                  122      80     -42
ping_queue_rcv_skb                           109      61     -48
ping_rcv                                     215     162     -53
rawv6_rcv_skb                                278     224     -54
raw_rcv                                      591     527     -64
Total: Before=29722890, After=29722668, chg -0.00%

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260409145625.2306224-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agoMerge branch 'add-support-for-pic64-hpsc-hx-mdio-controller'
Jakub Kicinski [Sun, 12 Apr 2026 21:19:25 +0000 (14:19 -0700)] 
Merge branch 'add-support-for-pic64-hpsc-hx-mdio-controller'

Charles Perry says:

====================
Add support for PIC64-HPSC/HX MDIO controller

This series adds a driver for the two MDIO controllers of PIC64-HPSC/HX.
The hardware supports C22 and C45 but only C22 is implemented for now.

This MDIO hardware is based on a Microsemi design supported in Linux by
mdio-mscc-miim.c. However, The register interface is completely different
with pic64hpsc, hence the need for a separate driver.

The documentation recommends an input clock of 156.25MHz and a prescaler of
39, which yields an MDIO clock of 1.95MHz.

This was tested on Microchip HB1301 evalkit which has a VSC8574 and a
VSC8541. I've tested with bus frequencies of 0.6, 1.95 and 2.5 MHz.

This series also adds a PHY write barrier when disabling PHY interrupts as
discussed in: https://lore.kernel.org/acvUqDgepCIScs8M@shell.armlinux.org.uk
====================

Link: https://patch.msgid.link/20260408131821.1145334-1-charles.perry@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agonet: phy: add a PHY write barrier when disabling interrupts
Charles Perry [Wed, 8 Apr 2026 13:18:16 +0000 (06:18 -0700)] 
net: phy: add a PHY write barrier when disabling interrupts

MDIO bus controllers are not required to wait for write transactions to
complete before returning as synchronization is often achieved by polling
status bits.

This can cause issues when disabling interrupts since an interrupt could
fire before the interrupt handler is unregistered and there's no status
bit to poll.

Add a phy_write_barrier() function and use it in phy_disable_interrupts()
to fix this issue. The write barrier just reads an MII register and
discards the value, which is enough to guarantee that previous writes have
completed.

Signed-off-by: Charles Perry <charles.perry@microchip.com>
Link: https://patch.msgid.link/20260408131821.1145334-4-charles.perry@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agonet: mdio: add a driver for PIC64-HPSC/HX MDIO controller
Charles Perry [Wed, 8 Apr 2026 13:18:15 +0000 (06:18 -0700)] 
net: mdio: add a driver for PIC64-HPSC/HX MDIO controller

This adds an MDIO driver for PIC64-HPSC/HX. The hardware supports C22
and C45 but only C22 is implemented in this commit.

This MDIO hardware is based on a Microsemi design supported in Linux by
mdio-mscc-miim.c. However, The register interface is completely
different with pic64hpsc, hence the need for a separate driver.

The documentation recommends an input clock of 156.25MHz and a prescaler
of 39, which yields an MDIO clock of 1.95MHz.

The hardware supports an interrupt pin or a "TRIGGER" bit that can be
polled to signal transaction completion. This commit uses polling.

This was tested on Microchip HB1301 evalkit with a VSC8574 and a
VSC8541.

Signed-off-by: Charles Perry <charles.perry@microchip.com>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20260408131821.1145334-3-charles.perry@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agodt-bindings: net: document Microchip PIC64-HPSC/HX MDIO controller
Charles Perry [Wed, 8 Apr 2026 13:18:14 +0000 (06:18 -0700)] 
dt-bindings: net: document Microchip PIC64-HPSC/HX MDIO controller

This MDIO hardware is based on a Microsemi design supported in Linux by
mdio-mscc-miim.c. However, The register interface is completely different
with pic64hpsc, hence the need for separate documentation.

The hardware supports C22 and C45.

The documentation recommends an input clock of 156.25MHz and a prescaler
of 39, which yields an MDIO clock of 1.95MHz.

The hardware supports an interrupt pin to signal transaction completion
which is not strictly needed as the software can also poll a "TRIGGER"
bit for this.

Signed-off-by: Charles Perry <charles.perry@microchip.com>
Acked-by: Conor Dooley <conor.dooley@microchip.com>
Link: https://patch.msgid.link/20260408131821.1145334-2-charles.perry@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agonet: phy: fix a return path in get_phy_c45_ids()
Charles Perry [Thu, 9 Apr 2026 13:36:54 +0000 (06:36 -0700)] 
net: phy: fix a return path in get_phy_c45_ids()

The return value of phy_c45_probe_present() is stored in "ret", not
"phy_reg", fix this. "phy_reg" always has a positive value if we reach
this return path (since it would have returned earlier otherwise), which
means that the original goal of the patch of not considering -ENODEV
fatal wasn't achieved.

Fixes: 17b447539408 ("net: phy: c45 scanning: Don't consider -ENODEV fatal")
Signed-off-by: Charles Perry <charles.perry@microchip.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/20260409133654.3203336-1-charles.perry@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agodt-bindings: net: dsa: nxp,sja1105: make spi-cpol optional for sja1110
Josua Mayer [Thu, 9 Apr 2026 12:34:33 +0000 (14:34 +0200)] 
dt-bindings: net: dsa: nxp,sja1105: make spi-cpol optional for sja1110

Currently, the binding requires 'spi-cpha' for SJA1105 and 'spi-cpol'
for SJA1110.

However, the SJA1110 supports both SPI modes 0 and 2. Mode 2
(cpha=0, cpol=1) is used by the NXP LX2160 Bluebox 3.

On the SolidRun i.MX8DXL HummingBoard Telematics, mode 0 is stable,
while forcing mode 2 introduces CRC errors especially during bursts.

Drop the requirement on spi-cpol for SJA1110.

Fixes: af2eab1a8243 ("dt-bindings: net: nxp,sja1105: document spi-cpol/cpha")
Signed-off-by: Josua Mayer <josua@solid-run.com>
Acked-by: Conor Dooley <conor.dooley@microchip.com>
Link: https://patch.msgid.link/20260409-imx8dxl-sr-som-v2-1-83ff20629ba0@solid-run.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agoocteon_ep: Remove unnecessary semicolons in octep_oq_drop_rx()
Nobuhiro Iwamatsu [Thu, 9 Apr 2026 05:08:11 +0000 (14:08 +0900)] 
octeon_ep: Remove unnecessary semicolons in octep_oq_drop_rx()

Remove unnecessary semicolons in octep_oq_drop_rx().

Signed-off-by: Nobuhiro Iwamatsu <nobuhiro.iwamatsu.x90@mail.toshiba>
Link: https://patch.msgid.link/1775711291-13938-1-git-send-email-nobuhiro.iwamatsu.x90@mail.toshiba
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agoMerge branch 'more-fixes-for-the-ipa-driver'
Jakub Kicinski [Sun, 12 Apr 2026 20:49:34 +0000 (13:49 -0700)] 
Merge branch 'more-fixes-for-the-ipa-driver'

Luca Weiss says:

====================
More fixes for the IPA driver

Two more fixes for the Qualcomm IPA driver.
====================

Link: https://patch.msgid.link/20260409-ipa-fixes-v1-0-a817c30678ac@fairphone.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agonet: ipa: Fix decoding EV_PER_EE for IPA v5.0+
Luca Weiss [Thu, 9 Apr 2026 08:13:32 +0000 (10:13 +0200)] 
net: ipa: Fix decoding EV_PER_EE for IPA v5.0+

Initially 'reg' and 'val' are assigned from HW_PARAM_2.

But since IPA v5.0+ takes EV_PER_EE from HW_PARAM_4 (instead of
NUM_EV_PER_EE from HW_PARAM_2), we not only need to re-assign 'reg' but
also read the register value of that register into 'val' so that
reg_decode() works on the correct value.

Fixes: f651334e1ef5 ("net: ipa: add HW_PARAM_4 GSI register")
Link: https://sashiko.dev/#/patchset/20260403-milos-ipa-v1-0-01e9e4e03d3e%40fairphone.com?part=2
Signed-off-by: Luca Weiss <luca.weiss@fairphone.com>
Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com>
Link: https://patch.msgid.link/20260409-ipa-fixes-v1-2-a817c30678ac@fairphone.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agonet: ipa: Fix programming of QTIME_TIMESTAMP_CFG
Luca Weiss [Thu, 9 Apr 2026 08:13:31 +0000 (10:13 +0200)] 
net: ipa: Fix programming of QTIME_TIMESTAMP_CFG

The 'val' variable gets overwritten multiple times, discarding previous
values. Looking at the git log shows these should be combined with |=
instead.

Fixes: 9265a4f0f0b4 ("net: ipa: define even more IPA register fields")
Link: https://sashiko.dev/#/patchset/20260403-milos-ipa-v1-0-01e9e4e03d3e%40fairphone.com?part=4
Signed-off-by: Luca Weiss <luca.weiss@fairphone.com>
Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com>
Link: https://patch.msgid.link/20260409-ipa-fixes-v1-1-a817c30678ac@fairphone.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agoLinux 7.0 v7.0
Linus Torvalds [Sun, 12 Apr 2026 20:48:06 +0000 (13:48 -0700)] 
Linux 7.0

2 days agoppp: require CAP_NET_ADMIN in target netns for unattached ioctls
Taegu Ha [Thu, 9 Apr 2026 07:11:15 +0000 (16:11 +0900)] 
ppp: require CAP_NET_ADMIN in target netns for unattached ioctls

/dev/ppp open is currently authorized against file->f_cred->user_ns,
while unattached administrative ioctls operate on current->nsproxy->net_ns.

As a result, a local unprivileged user can create a new user namespace
with CLONE_NEWUSER, gain CAP_NET_ADMIN only in that new user namespace,
and still issue PPPIOCNEWUNIT, PPPIOCATTACH, or PPPIOCATTCHAN against
an inherited network namespace.

Require CAP_NET_ADMIN in the user namespace that owns the target network
namespace before handling unattached PPP administrative ioctls.

This preserves normal pppd operation in the network namespace it is
actually privileged in, while rejecting the userns-only inherited-netns
case.

Fixes: 273ec51dd7ce ("net: ppp_generic - introduce net-namespace functionality v2")
Signed-off-by: Taegu Ha <hataegu0826@gmail.com>
Link: https://patch.msgid.link/20260409071117.4354-1-hataegu0826@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agoMerge patch series "bpf: Fix OOB in pcpu_init_value and add a test"
Alexei Starovoitov [Sun, 12 Apr 2026 20:34:41 +0000 (13:34 -0700)] 
Merge patch series "bpf: Fix OOB in pcpu_init_value and add a test"

xulang <xulang@uniontech.com> says:
====================

Fix OOB read when copying element from a BPF_MAP_TYPE_CGROUP_STORAGE
map to another pcpu map with the same value_size that is not rounded
up to 8 bytes, and add a test case to reproduce the issue.

The root cause is that pcpu_init_value() uses copy_map_value_long() which
rounds up the copy size to 8 bytes, but CGROUP_STORAGE map values are not
8-byte aligned (e.g., 4-byte). This causes a 4-byte OOB read when
the copy is performed.
====================

Link: https://lore.kernel.org/r/7653EEEC2BAB17DF+20260402073948.2185396-1-xulang@uniontech.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 days agoselftests/bpf: Add test for cgroup storage OOB read
Lang Xu [Thu, 2 Apr 2026 07:42:36 +0000 (15:42 +0800)] 
selftests/bpf: Add test for cgroup storage OOB read

Add a test case to reproduce the out-of-bounds read issue when copying
from a cgroup storage map to a pcpu map with a value_size not rounded
up to 8 bytes.

The test creates:
1. A CGROUP_STORAGE map with 4-byte value (not 8-byte aligned)
2. A LRU_PERCPU_HASH map with 4-byte value (same size)

When a socket is created in the cgroup, the BPF program triggers
bpf_map_update_elem() which calls copy_map_value_long(). This function
rounds up the copy size to 8 bytes, but the cgroup storage buffer is
only 4 bytes, causing an OOB read (before the fix).

Signed-off-by: Lang Xu <xulang@uniontech.com>
Link: https://lore.kernel.org/r/D63BF0DBFF1EA122+20260402074236.2187154-2-xulang@uniontech.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 days agobpf: Fix OOB in pcpu_init_value
Lang Xu [Thu, 2 Apr 2026 07:42:35 +0000 (15:42 +0800)] 
bpf: Fix OOB in pcpu_init_value

An out-of-bounds read occurs when copying element from a
BPF_MAP_TYPE_CGROUP_STORAGE map to another pcpu map with the
same value_size that is not rounded up to 8 bytes.

The issue happens when:
1. A CGROUP_STORAGE map is created with value_size not aligned to
   8 bytes (e.g., 4 bytes)
2. A pcpu map is created with the same value_size (e.g., 4 bytes)
3. Update element in 2 with data in 1

pcpu_init_value assumes that all sources are rounded up to 8 bytes,
and invokes copy_map_value_long to make a data copy, However, the
assumption doesn't stand since there are some cases where the source
may not be rounded up to 8 bytes, e.g., CGROUP_STORAGE, skb->data.
the verifier verifies exactly the size that the source claims, not
the size rounded up to 8 bytes by kernel, an OOB happens when the
source has only 4 bytes while the copy size(4) is rounded up to 8.

Fixes: d3bec0138bfb ("bpf: Zero-fill re-used per-cpu map element")
Reported-by: Kaiyan Mei <kaiyanm@hust.edu.cn>
Closes: https://lore.kernel.org/all/14e6c70c.6c121.19c0399d948.Coremail.kaiyanm@hust.edu.cn/
Link: https://lore.kernel.org/r/420FEEDDC768A4BE+20260402074236.2187154-1-xulang@uniontech.com
Signed-off-by: Lang Xu <xulang@uniontech.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 days agoMerge branch 'net-rds-fix-use-after-free-in-rds-ib-for-non-init-namespaces'
Jakub Kicinski [Sun, 12 Apr 2026 20:33:21 +0000 (13:33 -0700)] 
Merge branch 'net-rds-fix-use-after-free-in-rds-ib-for-non-init-namespaces'

Allison Henderson says:

====================
net/rds: Fix use-after-free in RDS/IB for non-init namespaces

This series fixes syzbot bug da8e060735ae02c8f3d1
https://syzkaller.appspot.com/bug?extid=da8e060735ae02c8f3d1

The report finds a use-after-free bug where ib connections access an
invalid network namespace after it has been freed.  The stack is:

    rds_rdma_cm_event_handler_cmn
      rds_conn_path_drop
        rds_destroy_pending
          check_net()  <-- use-after-free

This is initially introduced in:
d5a8ac28a7ff ("RDS-TCP: Make RDS-TCP work correctly when it is set up
in a netns other than init_net").

Here, we made RDS aware of the namespace by storing a net pointer in
each connection.  But it is not explicitly restricted to init_net in
the case of ib. The RDS/TCP transport has its own pernet exit handler
(rds_tcp_exit_net) that destroys connections when a namespace is torn
down. But RDS/IB does not support more than the initial namespace and
has no such handler. The initial namespace is statically allocated,
and never torn down, so it always has at least one reference.

Allowing non init namespaces that do not have a persistent reference
means that when their refcounts drop to zero, they are released through
cleanup_net(). Which would call any registered pernet clean up handlers
if it had any, but since they don't in this case, the extra
rds_connections remain with stale c_net pointers.  Which are then
accessed later causing the use-after-free bug.

So, the simple fix is to disallow more than the initial namespace
to be created in the case of ib connections.

Fixes are ported from UEK patches found here:

  https://github.com/oracle/linux-uek/commit/8ed9a82376b7
  Patch 1 is a prerequisite optimization to rds_ib_laddr_check() that
  avoids excessive rdma_bind_addr() calls during transport probing by
  first checking rds_ib_get_device().  This is needed because patch 2
  adds a namespace check at the top of the same function.

    UEK: 8ed9a82376b7 ("rds: ib: Optimize rds_ib_laddr_check")

  https://github.com/oracle/linux-uek/commit/bd9489a08004
  Patch 2 restricts RDS/IB to the initial network namespace.  It adds
  checks in both rds_ib_laddr_check() and rds_set_transport() to reject
  IB use from non-init namespaces with -EPROTOTYPE.  This prevents the
  use-after-free by ensuring IB connections cannot exist in namespaces
  that may be torn down.

    UEK: bd9489a08004 ("net/rds: Restrict use of RDS/IB to the initial
    network namespace")

Questions, comments and feedback appreciated!
====================

Link: https://patch.msgid.link/20260408080420.540032-1-achender@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agonet/rds: Restrict use of RDS/IB to the initial network namespace
Greg Jumper [Wed, 8 Apr 2026 08:04:20 +0000 (01:04 -0700)] 
net/rds: Restrict use of RDS/IB to the initial network namespace

Prevent using RDS/IB in network namespaces other than the initial one.
The existing RDS/IB code will not work properly in non-initial network
namespaces.

Fixes: d5a8ac28a7ff ("RDS-TCP: Make RDS-TCP work correctly when it is set up in a netns other than init_net")
Reported-by: syzbot+da8e060735ae02c8f3d1@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=da8e060735ae02c8f3d1
Signed-off-by: Greg Jumper <greg.jumper@oracle.com>
Signed-off-by: Allison Henderson <achender@kernel.org>
Link: https://patch.msgid.link/20260408080420.540032-3-achender@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agonet/rds: Optimize rds_ib_laddr_check
HÃ¥kon Bugge [Wed, 8 Apr 2026 08:04:19 +0000 (01:04 -0700)] 
net/rds: Optimize rds_ib_laddr_check

rds_ib_laddr_check() creates a CM_ID and attempts to bind the address
in question to it. This in order to qualify the allegedly local
address as a usable IB/RoCE address.

In the field, ExaWatcher runs rds-ping to all ports in the fabric from
all local ports. This using all active ToS'es. In a full rack system,
we have 14 cell servers and eight db servers. Typically, 6 ToS'es are
used. This implies 528 rds-ping invocations per ExaWatcher's "RDSinfo"
interval.

Adding to this, each rds-ping invocation creates eight sockets and
binds the local address to them:

socket(AF_RDS, SOCK_SEQPACKET, 0)       = 3
bind(3, {sa_family=AF_INET, sin_port=htons(0),
sin_addr=inet_addr("192.168.36.2")}, 16) = 0
socket(AF_RDS, SOCK_SEQPACKET, 0)       = 4
bind(4, {sa_family=AF_INET, sin_port=htons(0),
sin_addr=inet_addr("192.168.36.2")}, 16) = 0
socket(AF_RDS, SOCK_SEQPACKET, 0)       = 5
bind(5, {sa_family=AF_INET, sin_port=htons(0),
sin_addr=inet_addr("192.168.36.2")}, 16) = 0
socket(AF_RDS, SOCK_SEQPACKET, 0)       = 6
bind(6, {sa_family=AF_INET, sin_port=htons(0),
sin_addr=inet_addr("192.168.36.2")}, 16) = 0
socket(AF_RDS, SOCK_SEQPACKET, 0)       = 7
bind(7, {sa_family=AF_INET, sin_port=htons(0),
sin_addr=inet_addr("192.168.36.2")}, 16) = 0
socket(AF_RDS, SOCK_SEQPACKET, 0)       = 8
bind(8, {sa_family=AF_INET, sin_port=htons(0),
sin_addr=inet_addr("192.168.36.2")}, 16) = 0
socket(AF_RDS, SOCK_SEQPACKET, 0)       = 9
bind(9, {sa_family=AF_INET, sin_port=htons(0),
sin_addr=inet_addr("192.168.36.2")}, 16) = 0
socket(AF_RDS, SOCK_SEQPACKET, 0)       = 10
bind(10, {sa_family=AF_INET, sin_port=htons(0),
sin_addr=inet_addr("192.168.36.2")}, 16) = 0

So, at every interval ExaWatcher executes rds-ping's, 4224 CM_IDs are
allocated, considering this full-rack system. After the a CM_ID has
been allocated, rdma_bind_addr() is called, with the port number being
zero. This implies that the CMA will attempt to search for an un-used
ephemeral port. Simplified, the algorithm is to start at a random
position in the available port space, and then if needed, iterate
until an un-used port is found.

The book-keeping of used ports uses the idr system, which again uses
slab to allocate new struct idr_layer's. The size is 2092 bytes and
slab tries to reduce the wasted space. Hence, it chooses an order:3
allocation, for which 15 idr_layer structs will fit and only 1388
bytes are wasted per the 32KiB order:3 chunk.

Although this order:3 allocation seems like a good space/speed
trade-off, it does not resonate well with how it used by the CMA. The
combination of the randomized starting point in the port space (which
has close to zero spatial locality) and the close proximity in time of
the 4224 invocations of the rds-ping's, creates a memory hog for
order:3 allocations.

These costly allocations may need reclaims and/or compaction. At
worst, they may fail and produce a stack trace such as (from uek4):

[<ffffffff811a72d5>] __inc_zone_page_state+0x35/0x40
[<ffffffff811c2e97>] page_add_file_rmap+0x57/0x60
[<ffffffffa37ca1df>] remove_migration_pte+0x3f/0x3c0 [ksplice_6cn872bt_vmlinux_new]
[<ffffffff811c3de8>] rmap_walk+0xd8/0x340
[<ffffffff811e8860>] remove_migration_ptes+0x40/0x50
[<ffffffff811ea83c>] migrate_pages+0x3ec/0x890
[<ffffffff811afa0d>] compact_zone+0x32d/0x9a0
[<ffffffff811b00ed>] compact_zone_order+0x6d/0x90
[<ffffffff811b03b2>] try_to_compact_pages+0x102/0x270
[<ffffffff81190e56>] __alloc_pages_direct_compact+0x46/0x100
[<ffffffff8119165b>] __alloc_pages_nodemask+0x74b/0xaa0
[<ffffffff811d8411>] alloc_pages_current+0x91/0x110
[<ffffffff811e3b0b>] new_slab+0x38b/0x480
[<ffffffffa41323c7>] __slab_alloc+0x3b7/0x4a0 [ksplice_s0dk66a8_vmlinux_new]
[<ffffffff811e42ab>] kmem_cache_alloc+0x1fb/0x250
[<ffffffff8131fdd6>] idr_layer_alloc+0x36/0x90
[<ffffffff8132029c>] idr_get_empty_slot+0x28c/0x3d0
[<ffffffff813204ad>] idr_alloc+0x4d/0xf0
[<ffffffffa051727d>] cma_alloc_port+0x4d/0xa0 [rdma_cm]
[<ffffffffa0517cbe>] rdma_bind_addr+0x2ae/0x5b0 [rdma_cm]
[<ffffffffa09d8083>] rds_ib_laddr_check+0x83/0x2c0 [ksplice_6l2xst5i_rds_rdma_new]
[<ffffffffa05f892b>] rds_trans_get_preferred+0x5b/0xa0 [rds]
[<ffffffffa05f09f2>] rds_bind+0x212/0x280 [rds]
[<ffffffff815b4016>] SYSC_bind+0xe6/0x120
[<ffffffff815b4d3e>] SyS_bind+0xe/0x10
[<ffffffff816b031a>] system_call_fastpath+0x18/0xd4

To avoid these excessive calls to rdma_bind_addr(), we optimize
rds_ib_laddr_check() by simply checking if the address in question has
been used before. The rds_rdma module keeps track of addresses
associated with IB devices, and the function rds_ib_get_device() is
used to determine if the address already has been qualified as a valid
local address. If not found, we call the legacy rds_ib_laddr_check(),
now renamed to rds_ib_laddr_check_cm().

Signed-off-by: HÃ¥kon Bugge <haakon.bugge@oracle.com>
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com>
Signed-off-by: Allison Henderson <achender@kernel.org>
Link: https://patch.msgid.link/20260408080420.540032-2-achender@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agoMerge branch 'net-hamradio-fix-missing-input-validation-in-bpqether-and-scc'
Jakub Kicinski [Sun, 12 Apr 2026 20:19:07 +0000 (13:19 -0700)] 
Merge branch 'net-hamradio-fix-missing-input-validation-in-bpqether-and-scc'

Mashiro Chen says:

====================
net: hamradio: fix missing input validation in bpqether and scc

This series fixes two missing input validation bugs in the hamradio
drivers. Both patches were reviewed by Joerg Reuter (hamradio
maintainer).
====================

Link: https://patch.msgid.link/20260409024927.24397-1-mashiro.chen@mailbox.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agonet: hamradio: scc: validate bufsize in SIOCSCCSMEM ioctl
Mashiro Chen [Thu, 9 Apr 2026 02:49:27 +0000 (10:49 +0800)] 
net: hamradio: scc: validate bufsize in SIOCSCCSMEM ioctl

The SIOCSCCSMEM ioctl copies a scc_mem_config from user space and
assigns its bufsize field directly to scc->stat.bufsize without any
range validation:

  scc->stat.bufsize = memcfg.bufsize;

If a privileged user (CAP_SYS_RAWIO) sets bufsize to 0, the receive
interrupt handler later calls dev_alloc_skb(0) and immediately writes
a KISS type byte via skb_put_u8() into a zero-capacity socket buffer,
corrupting the adjacent skb_shared_info region.

Reject bufsize values smaller than 16; this is large enough to hold
at least one KISS header byte plus useful data.

Signed-off-by: Mashiro Chen <mashiro.chen@mailbox.org>
Acked-by: Joerg Reuter <jreuter@yaina.de>
Link: https://patch.msgid.link/20260409024927.24397-3-mashiro.chen@mailbox.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agonet: hamradio: bpqether: validate frame length in bpq_rcv()
Mashiro Chen [Thu, 9 Apr 2026 02:49:26 +0000 (10:49 +0800)] 
net: hamradio: bpqether: validate frame length in bpq_rcv()

The BPQ length field is decoded as:

  len = skb->data[0] + skb->data[1] * 256 - 5;

If the sender sets bytes [0..1] to values whose combined value is
less than 5, len becomes negative.  Passing a negative int to
skb_trim() silently converts to a huge unsigned value, causing the
function to be a no-op.  The frame is then passed up to AX.25 with
its original (untrimmed) payload, delivering garbage beyond the
declared frame boundary.

Additionally, a negative len corrupts the 64-bit rx_bytes counter
through implicit sign-extension.

Add a bounds check before pulling the length bytes: reject frames
where len is negative or exceeds the remaining skb data.

Acked-by: Joerg Reuter <jreuter@yaina.de>
Signed-off-by: Mashiro Chen <mashiro.chen@mailbox.org>
Link: https://patch.msgid.link/20260409024927.24397-2-mashiro.chen@mailbox.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agoselftests/bpf: Fix reg_bounds to match new tnum-based refinement
Paul Chaignon [Wed, 8 Apr 2026 20:40:50 +0000 (22:40 +0200)] 
selftests/bpf: Fix reg_bounds to match new tnum-based refinement

Commit efc11a667878 ("bpf: Improve bounds when tnum has a single
possible value") improved the bounds refinement to detect when the tnum
and u64 range overlap in a single value (and the bounds can thus be set
to that value).

Eduard then noticed that it broke the slow-mode reg_bounds selftests
because they don't have an equivalent logic and are therefore unable to
refine the bounds as much as the verifier. The following test case
illustrates this.

  ACTUAL   TRUE1:  scalar(u64=0xffffffff00000000,u32=0,s64=0xffffffff00000000,s32=0)
  EXPECTED TRUE1:  scalar(u64=[0xfffffffe00000001; 0xffffffff00000000],u32=0,s64=[0xfffffffe00000001; 0xffffffff00000000],s32=0)
  [...]
  #323/1007 reg_bounds_gen_consts_s64_s32/(s64)[0xfffffffe00000001; 0xffffffff00000000] (s32)<op> S64_MIN:FAIL

with the verifier logs:

  [...]
  19: w0 = w6                 ; R0=scalar(smin=0,smax=umax=0xffffffff,
                                          var_off=(0x0; 0xffffffff))
                                R6=scalar(smin=0xfffffffe00000001,smax=0xffffffff00000000,
                                          umin=0xfffffffe00000001,umax=0xffffffff00000000,
                                          var_off=(0xfffffffe00000000; 0x1ffffffff))
  20: w0 = w7                 ; R0=0 R7=0x8000000000000000
  21: if w6 == w7 goto pc+3
  [...]
  from 21 to 25: [...]
  25: w0 = w6                 ; R0=0 R6=0xffffffff00000000
                              ;         ^
                              ;         unexpected refined value
  26: w0 = w7                 ; R0=0 R7=0x8000000000000000
  27: exit

When w6 == w7 is true, the verifier can deduce that the R6's tnum is
equal to (0xfffffffe00000000; 0x100000000) and then use that information
to refine the bounds: the tnum only overlap with the u64 range in
0xffffffff00000000. The reg_bounds selftest doesn't know about tnums
and therefore fails to perform the same refinement.

This issue happens when the tnum carries information that cannot be
represented in the ranges, as otherwise the selftest could reach the
same refined value using just the ranges. The tnum thus needs to
represent non-contiguous values (ex., R6's tnum above, after the
condition). The only way this can happen in the reg_bounds selftest is
at the boundary between the 32 and 64bit ranges. We therefore only need
to handle that case.

This patch fixes the selftest refinement logic by checking if the u32
and u64 ranges overlap in a single value. If so, the ranges can be set
to that value. We need to handle two cases: either they overlap in
umin64...

  u64 values
  matching u32 range:     xxx        xxx        xxx        xxx
                      |--------------------------------------|
  u64 range:          0                xxxxx                 UMAX64

or in umax64:

  u64 values
  matching u32 range:     xxx        xxx        xxx        xxx
                      |--------------------------------------|
  u64 range:          0          xxxxx                       UMAX64

To detect the first case, we decrease umax64 to the maximum value that
matches the u32 range. If that happens to be umin64, then umin64 is the
only overlap. We proceed similarly for the second case, increasing
umin64 to the minimum value that matches the u32 range.

Note this is similar to how the verifier handles the general case using
tnum, but we don't need to care about a single-value overlap in the
middle of the range. That case is not possible when comparing two
ranges.

This patch also adds two test cases reproducing this bug as part of the
normal test runs (without SLOW_TESTS=1).

Fixes: efc11a667878 ("bpf: Improve bounds when tnum has a single possible value")
Reported-by: Eduard Zingerman <eddyz87@gmail.com>
Closes: https://lore.kernel.org/bpf/4e6dd64a162b3cab3635706ae6abfdd0be4db5db.camel@gmail.com/
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Link: https://lore.kernel.org/r/ada9UuSQi2SE2IfB@mail.gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 days agonet: rose: reject truncated CLEAR_REQUEST frames in state machines
Mashiro Chen [Wed, 8 Apr 2026 17:25:51 +0000 (01:25 +0800)] 
net: rose: reject truncated CLEAR_REQUEST frames in state machines

All five ROSE state machines (states 1-5) handle ROSE_CLEAR_REQUEST
by reading the cause and diagnostic bytes directly from skb->data[3]
and skb->data[4] without verifying that the frame is long enough:

  rose_disconnect(sk, ..., skb->data[3], skb->data[4]);

The entry-point check in rose_route_frame() only enforces
ROSE_MIN_LEN (3 bytes), so a remote peer on a ROSE network can
send a syntactically valid but truncated CLEAR_REQUEST (3 or 4
bytes) while a connection is open in any state.  Processing such a
frame causes a one- or two-byte out-of-bounds read past the skb
data, leaking uninitialized heap content as the cause/diagnostic
values returned to user space via getsockopt(ROSE_GETCAUSE).

Add a single length check at the rose_process_rx_frame() dispatch
point, before any state machine is entered, to drop frames that
carry the CLEAR_REQUEST type code but are too short to contain the
required cause and diagnostic fields.

Signed-off-by: Mashiro Chen <mashiro.chen@mailbox.org>
Link: https://patch.msgid.link/20260408172551.281486-1-mashiro.chen@mailbox.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agoMerge branch 'net-enetc-improve-statistics-for-v1-and-add-statistics-for-v4'
Jakub Kicinski [Sun, 12 Apr 2026 20:03:51 +0000 (13:03 -0700)] 
Merge branch 'net-enetc-improve-statistics-for-v1-and-add-statistics-for-v4'

Wei Fang says:

====================
net: enetc: improve statistics for v1 and add statistics for v4

For ENETC v1, some standardized statistics were redundantly included in
the unstructured statistics, so remove these duplicated entries.
Previously, the unstructured statistics only contained eMAC data and
did not include pMAC data; add pMAC statistics to ensure completeness.

For ENETC v4, the driver previously reported MAC statistics only for the
internal ENETC (Pseudo MAC). Extend the implementation to provide
additional statistics for both the internal ENETC and the standalone
ENETC.
====================

Link: https://patch.msgid.link/20260408055849.1314033-1-wei.fang@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agonet: enetc: add unstructured counters for ENETC v4
Wei Fang [Wed, 8 Apr 2026 05:58:49 +0000 (13:58 +0800)] 
net: enetc: add unstructured counters for ENETC v4

Like ENETC v1, ENETC v4 also has many non-standard counters, so these
counters are added to improve statistical coverage.

Signed-off-by: Wei Fang <wei.fang@nxp.com>
Link: https://patch.msgid.link/20260408055849.1314033-6-wei.fang@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agonet: enetc: add unstructured pMAC counters for ENETC v1
Wei Fang [Wed, 8 Apr 2026 05:58:48 +0000 (13:58 +0800)] 
net: enetc: add unstructured pMAC counters for ENETC v1

The ENETC v1 has two MACs (eMAC and pMAC) to support preemption. The
existing unstructured counters include the eMAC counters, but not the
pMAC counters. So add pMAC counters to improve statistical coverage.

Signed-off-by: Wei Fang <wei.fang@nxp.com>
Link: https://patch.msgid.link/20260408055849.1314033-5-wei.fang@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agonet: enetc: remove standardized counters from enetc_pm_counters
Wei Fang [Wed, 8 Apr 2026 05:58:47 +0000 (13:58 +0800)] 
net: enetc: remove standardized counters from enetc_pm_counters

The standardized counters are already exposed via the get_pause_stats(),
get_rmon_stats(), get_eth_ctrl_stats() and get_eth_mac_stats()
interfaces. Keeping the same counters in enetc_pm_counters results in
redundant output.

Remove these standardized counters from enetc_pm_counters and rely on
the existing statistics interfaces to report them.

Signed-off-by: Wei Fang <wei.fang@nxp.com>
Link: https://patch.msgid.link/20260408055849.1314033-4-wei.fang@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agonet: enetc: show RX drop counters only for assigned RX rings
Wei Fang [Wed, 8 Apr 2026 05:58:46 +0000 (13:58 +0800)] 
net: enetc: show RX drop counters only for assigned RX rings

For ENETC v1, each SI provides 16 RBDCR registers for RX ring drop
counters, but this does not imply that an SI actually owns 16 RX rings.
The ENETC hardware supports a total of 16 RX rings, which are assigned
to 3 SIs (1 PSI and 2 VSIs), so each SI is assigned fewer than 16 RX
rings.

The current implementation always reports 16 RX drop counters per SI,
leading to redundant output for SIs with fewer RX rings. Update the
logic to display drop counters only for the RX rings that are actually
assigned to the SI.

Signed-off-by: Wei Fang <wei.fang@nxp.com>
Link: https://patch.msgid.link/20260408055849.1314033-3-wei.fang@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agonet: enetc: add support for the standardized counters
Wei Fang [Wed, 8 Apr 2026 05:58:45 +0000 (13:58 +0800)] 
net: enetc: add support for the standardized counters

ENETC v4 provides 64-bit counters for IEEE 802.3 basic and mandatory
managed objects, the IETF Management Information Database (MIB) package
(RFC2665), and Remote Network Monitoring (RMON) statistics. In addition,
some ENETCs support preemption, so these ENETCs have two MACs: MAC 0 is
the express MAC (eMAC), MAC 1 is the preemptible MAC (pMAC). Both MACs
support these statistics.

Signed-off-by: Wei Fang <wei.fang@nxp.com>
Link: https://patch.msgid.link/20260408055849.1314033-2-wei.fang@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agoselftests/bpf: Add tests for non-arena/arena operations
Emil Tsalapatis [Sun, 12 Apr 2026 17:45:39 +0000 (13:45 -0400)] 
selftests/bpf: Add tests for non-arena/arena operations

Add a selftest that ensures instructions with arena source and
non-arena destination registers are accepted by the verifier.

Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com>
Acked-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20260412174546.18684-3-emil@etsalapatis.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 days agobpf: Allow instructions with arena source and non-arena dest registers
Emil Tsalapatis [Sun, 12 Apr 2026 17:45:38 +0000 (13:45 -0400)] 
bpf: Allow instructions with arena source and non-arena dest registers

The compiler sometimes stores the result of a PTR_TO_ARENA and SCALAR
operation into the scalar register rather than the pointer register.
Relax the verifier to allow operations between a source arena register
and a destination non-arena register, marking the destination's value
as a PTR_TO_ARENA.

Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com>
Acked-by: Song Liu <song@kernel.org>
Fixes: 6082b6c328b5 ("bpf: Recognize addr_space_cast instruction in the verifier.")
Link: https://lore.kernel.org/r/20260412174546.18684-2-emil@etsalapatis.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 days agoMerge branch 'bpf-add-the-missing-fsession'
Alexei Starovoitov [Sun, 12 Apr 2026 19:42:38 +0000 (12:42 -0700)] 
Merge branch 'bpf-add-the-missing-fsession'

Menglong Dong says:

====================
bpf: add the missing fsession

Add the missing fsession attach type to the BPF docs, verifier log and
bpftool.

Changes since v2:
- replace "FENTRY/FEXIT/FSESSION" with "Tracing" in the 1st patch
- v2: https://lore.kernel.org/all/20260408062109.386083-1-dongml2@chinatelecom.cn/

Changes since v1:
- add a missing FSESSION in bpf_check_attach_target() in the 1st patch
- v1: https://lore.kernel.org/all/20260408031416.266229-1-dongml2@chinatelecom.cn/
====================

Link: https://patch.msgid.link/20260412060346.142007-1-dongml2@chinatelecom.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 days agobpftool: add missing fsession to the usage and docs of bpftool
Menglong Dong [Sun, 12 Apr 2026 06:03:46 +0000 (14:03 +0800)] 
bpftool: add missing fsession to the usage and docs of bpftool

Add the fsession attach type to the usage of bpftool in do_help().
Meanwhile, add it to the bash-completion and bpftool-prog.rst too.

Acked-by: Leon Hwang <leon.hwang@linux.dev>
Acked-by: Quentin Monnet <qmo@kernel.org>
Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Link: https://lore.kernel.org/r/20260412060346.142007-4-dongml2@chinatelecom.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 days agodocs/bpf: add missing fsession attach type to docs
Menglong Dong [Sun, 12 Apr 2026 06:03:45 +0000 (14:03 +0800)] 
docs/bpf: add missing fsession attach type to docs

Add the fsession attach type to program_types.rst and drgn.rst.

Acked-by: Leon Hwang <leon.hwang@linux.dev>
Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Link: https://lore.kernel.org/r/20260412060346.142007-3-dongml2@chinatelecom.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 days agobpf: add missing fsession to the verifier log
Menglong Dong [Sun, 12 Apr 2026 06:03:44 +0000 (14:03 +0800)] 
bpf: add missing fsession to the verifier log

The fsession attach type is missed in the verifier log in
check_get_func_ip(), bpf_check_attach_target() and check_attach_btf_id().
Update them to make the verifier log proper. Meanwhile, update the
corresponding selftests.

Acked-by: Leon Hwang <leon.hwang@linux.dev>
Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Link: https://lore.kernel.org/r/20260412060346.142007-2-dongml2@chinatelecom.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 days agoMerge branch 'bpf-split-verifier-c'
Alexei Starovoitov [Sun, 12 Apr 2026 19:34:31 +0000 (12:34 -0700)] 
Merge branch 'bpf-split-verifier-c'

Alexei Starovoitov says:

====================
v3->v4: Restore few minor comments and undo few function moves
v2->v3: Actually restore comments lost in patch 3
(instead of adding them to patch 4)
v1->v2: Restore comments lost in patch 3

verifier.c is huge. Split it into logically independent pieces.
No functional changes.
The diff is impossible to review over email.
'git show' shows minimal actual changes. Only plenty of moved lines.
Such split may cause backport headaches.
We should have split it long ago.
Even after split verifier.c is still 20k lines,
but further split is harder.
====================

Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://patch.msgid.link/20260412152936.54262-1-alexei.starovoitov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 days agobpf: Move BTF checking logic into check_btf.c
Alexei Starovoitov [Sun, 12 Apr 2026 15:29:35 +0000 (08:29 -0700)] 
bpf: Move BTF checking logic into check_btf.c

BTF validation logic is independent from the main verifier.
Move it into check_btf.c

Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20260412152936.54262-7-alexei.starovoitov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 days agobpf: Move backtracking logic to backtrack.c
Alexei Starovoitov [Sun, 12 Apr 2026 15:29:34 +0000 (08:29 -0700)] 
bpf: Move backtracking logic to backtrack.c

Move precision propagation and backtracking logic to backtrack.c
to reduce verifier.c size.

No functional changes.

Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20260412152936.54262-6-alexei.starovoitov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 days agobpf: Move state equivalence logic to states.c
Alexei Starovoitov [Sun, 12 Apr 2026 15:29:33 +0000 (08:29 -0700)] 
bpf: Move state equivalence logic to states.c

verifier.c is huge. Move is_state_visited() to states.c,
so that all state equivalence logic is in one file.

Mechanical move. No functional changes.

Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20260412152936.54262-5-alexei.starovoitov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 days agobpf: Move check_cfg() into cfg.c
Alexei Starovoitov [Sun, 12 Apr 2026 15:29:32 +0000 (08:29 -0700)] 
bpf: Move check_cfg() into cfg.c

verifier.c is huge. Move check_cfg(), compute_postorder(),
compute_scc() into cfg.c

Mechanical move. No functional changes.

Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20260412152936.54262-4-alexei.starovoitov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 days agobpf: Move compute_insn_live_regs() into liveness.c
Alexei Starovoitov [Sun, 12 Apr 2026 15:29:31 +0000 (08:29 -0700)] 
bpf: Move compute_insn_live_regs() into liveness.c

verifier.c is huge. Move compute_insn_live_regs() into liveness.c.

Mechanical move. No functional changes.

Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20260412152936.54262-3-alexei.starovoitov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 days agobpf: Move fixup/post-processing logic from verifier.c into fixups.c
Alexei Starovoitov [Sun, 12 Apr 2026 15:29:30 +0000 (08:29 -0700)] 
bpf: Move fixup/post-processing logic from verifier.c into fixups.c

verifier.c is huge. Split fixup/post-processing logic that runs after
the verifier accepted the program into fixups.c.

Mechanical move. No functional changes.

Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20260412152936.54262-2-alexei.starovoitov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 days agogre: Count GRE packet drops
Gal Pressman [Thu, 9 Apr 2026 09:09:45 +0000 (12:09 +0300)] 
gre: Count GRE packet drops

GRE is silently dropping packets without updating statistics.

In case of drop, increment rx_dropped counter to provide visibility into
packet loss. For the case where no GRE protocol handler is registered,
use rx_nohandler.

Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Nimrod Oren <noren@nvidia.com>
Signed-off-by: Gal Pressman <gal@nvidia.com>
Link: https://patch.msgid.link/20260409090945.1542440-1-gal@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agoMerge branch 'bpf-fix-sock_ops_get_sk-same-register-oob-read-in-sock_ops-and-add...
Jakub Kicinski [Sun, 12 Apr 2026 19:28:07 +0000 (12:28 -0700)] 
Merge branch 'bpf-fix-sock_ops_get_sk-same-register-oob-read-in-sock_ops-and-add-selftest'

Jiayuan Chen says:

====================
bpf: Fix SOCK_OPS_GET_SK same-register OOB read in sock_ops and add selftest

When a BPF sock_ops program accesses ctx fields with dst_reg == src_reg,
the SOCK_OPS_GET_SK() and SOCK_OPS_GET_FIELD() macros fail to zero the
destination register in the !fullsock / !locked_tcp_sock path, leading to
OOB read (GET_SK) and kernel pointer leak (GET_FIELD).

Patch 1: Fix both macros by adding BPF_MOV64_IMM(si->dst_reg, 0) in the
!fullsock landing pad.
Patch 2: Add selftests covering same-register and different-register cases
for both GET_SK and GET_FIELD.

[1] https://lore.kernel.org/bpf/6fe1243e-149b-4d3b-99c7-fcc9e2f75787@std.uestc.edu.cn/T/#u
====================

Link: https://patch.msgid.link/20260407022720.162151-1-jiayuan.chen@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agoselftests/bpf: Add tests for sock_ops ctx access with same src/dst register
Jiayuan Chen [Tue, 7 Apr 2026 02:26:28 +0000 (10:26 +0800)] 
selftests/bpf: Add tests for sock_ops ctx access with same src/dst register

Add selftests to verify SOCK_OPS_GET_SK() and SOCK_OPS_GET_FIELD() correctly
return NULL/zero when dst_reg == src_reg and is_fullsock == 0.

Three subtests are included:
 - get_sk: ctx->sk with same src/dst register (SOCK_OPS_GET_SK)
 - get_field: ctx->snd_cwnd with same src/dst register (SOCK_OPS_GET_FIELD)
 - get_sk_diff_reg: ctx->sk with different src/dst register (baseline)

Each BPF program uses inline asm (__naked) to force specific register
allocation, reads is_fullsock first, then loads the field using the same
(or different) register. The test triggers TCP_NEW_SYN_RECV via a TCP
handshake and checks that the result is NULL/zero when is_fullsock == 0.

Reviewed-by: Sun Jian <sun.jian.kdev@gmail.com>
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260407022720.162151-3-jiayuan.chen@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agobpf: Fix same-register dst/src OOB read and pointer leak in sock_ops
Jiayuan Chen [Tue, 7 Apr 2026 02:26:27 +0000 (10:26 +0800)] 
bpf: Fix same-register dst/src OOB read and pointer leak in sock_ops

When a BPF sock_ops program accesses ctx fields with dst_reg == src_reg,
the SOCK_OPS_GET_SK() and SOCK_OPS_GET_FIELD() macros fail to zero the
destination register in the !fullsock / !locked_tcp_sock path.

Both macros borrow a temporary register to check is_fullsock /
is_locked_tcp_sock when dst_reg == src_reg, because dst_reg holds the
ctx pointer. When the check is false (e.g., TCP_NEW_SYN_RECV state with
a request_sock), dst_reg should be zeroed but is not, leaving the stale
ctx pointer:

 - SOCK_OPS_GET_SK: dst_reg retains the ctx pointer, passes NULL checks
   as PTR_TO_SOCKET_OR_NULL, and can be used as a bogus socket pointer,
   leading to stack-out-of-bounds access in helpers like
   bpf_skc_to_tcp6_sock().

 - SOCK_OPS_GET_FIELD: dst_reg retains the ctx pointer which the
   verifier believes is a SCALAR_VALUE, leaking a kernel pointer.

Fix both macros by:
 - Changing JMP_A(1) to JMP_A(2) in the fullsock path to skip the
   added instruction.
 - Adding BPF_MOV64_IMM(si->dst_reg, 0) after the temp register
   restore in the !fullsock path, placed after the restore because
   dst_reg == src_reg means we need src_reg intact to read ctx->temp.

Fixes: fd09af010788 ("bpf: sock_ops ctx access may stomp registers in corner case")
Fixes: 84f44df664e9 ("bpf: sock_ops sk access may stomp registers when dst_reg = src_reg")
Reported-by: Quan Sun <2022090917019@std.uestc.edu.cn>
Reported-by: Yinhao Hu <dddddd@hust.edu.cn>
Reported-by: Kaiyan Mei <M202472210@hust.edu.cn>
Reported-by: Dongliang Mu <dzm91@hust.edu.cn>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Closes: https://lore.kernel.org/bpf/6fe1243e-149b-4d3b-99c7-fcc9e2f75787@std.uestc.edu.cn/T/#u
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260407022720.162151-2-jiayuan.chen@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agoDocumentation: core-api: real-time: correct spelling
Sukrut Heroorkar [Sat, 11 Apr 2026 15:51:19 +0000 (17:51 +0200)] 
Documentation: core-api: real-time: correct spelling

Fix typo "excpetion" with "exception".

Signed-off-by: Sukrut Heroorkar <hsukrut3@gmail.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <20260411155120.233357-1-hsukrut3@gmail.com>

2 days agodoc: Add CPU Isolation documentation
Frederic Weisbecker [Thu, 2 Apr 2026 09:47:49 +0000 (11:47 +0200)] 
doc: Add CPU Isolation documentation

nohz_full was introduced in v3.10 in 2013, which means this
documentation is overdue for 13 years.

Fortunately Paul wrote a part of the needed documentation a while ago,
especially concerning nohz_full in Documentation/timers/no_hz.rst and
also about per-CPU kthreads in
Documentation/admin-guide/kernel-per-CPU-kthreads.rst

Introduce a new page that gives an overview of CPU isolation in general.

Acked-by: Waiman Long <longman@redhat.com>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <20260402094749.18879-1-frederic@kernel.org>

2 days agoMerge tag 'edac_urgent_for_7.0' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Sun, 12 Apr 2026 18:56:07 +0000 (11:56 -0700)] 
Merge tag 'edac_urgent_for_7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras

Pull EDAC fix from Borislav Petkov:

 - Fix the error path ordering when the driver-private descriptor
   allocation fails

* tag 'edac_urgent_for_7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
  EDAC/mc: Fix error path ordering in edac_mc_alloc()

2 days agoNFC: digital: Bounds check NFC-A cascade depth in SDD response handler
Greg Kroah-Hartman [Thu, 9 Apr 2026 15:18:14 +0000 (17:18 +0200)] 
NFC: digital: Bounds check NFC-A cascade depth in SDD response handler

The NFC-A anti-collision cascade in digital_in_recv_sdd_res() appends 3
or 4 bytes to target->nfcid1 on each round, but the number of cascade
rounds is controlled entirely by the peer device.  The peer sets the
cascade tag in the SDD_RES (deciding 3 vs 4 bytes) and the
cascade-incomplete bit in the SEL_RES (deciding whether another round
follows).

ISO 14443-3 limits NFC-A to three cascade levels and target->nfcid1 is
sized accordingly (NFC_NFCID1_MAXSIZE = 10), but nothing in the driver
actually enforces this.  This means a malicious peer can keep the
cascade running, writing past the heap-allocated nfc_target with each
round.

Fix this by rejecting the response when the accumulated UID would exceed
the buffer.

Commit e329e71013c9 ("NFC: nci: Bounds check struct nfc_target arrays")
fixed similar missing checks against the same field on the NCI path.

Cc: Simon Horman <horms@kernel.org>
Cc: Kees Cook <kees@kernel.org>
Cc: Thierry Escande <thierry.escande@linux.intel.com>
Cc: Samuel Ortiz <sameo@linux.intel.com>
Fixes: 2c66daecc409 ("NFC Digital: Add NFC-A technology support")
Cc: stable <stable@kernel.org>
Assisted-by: gregkh_clanker_t1000
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://patch.msgid.link/2026040913-figure-seducing-bd3f@gregkh
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agonet_sched: fix skb memory leak in deferred qdisc drops
Fernando Fernandez Mancera [Wed, 8 Apr 2026 10:00:44 +0000 (12:00 +0200)] 
net_sched: fix skb memory leak in deferred qdisc drops

When the network stack cleans up the deferred list via qdisc_run_end(),
it operates on the root qdisc. If the root qdisc do not implement the
TCQ_F_DEQUEUE_DROPS flag the packets queue to free are never freed and
gets stranded on the child's local to_free list.

Fix this by making qdisc_dequeue_drop() aware of the root qdisc. It
fetches the root qdisc and check for the TCQ_F_DEQUEUE_DROPS flag. If
the flag is present, the packet is appended directly to the root's
to_free list. Otherwise, drop it directly as it was done before the
optimization was implemented.

Fixes: a6efc273ab82 ("net_sched: use qdisc_dequeue_drop() in cake, codel, fq_codel")
Reported-by: Damilola Bello <damilola@aterlo.com>
Closes: https://lore.kernel.org/netdev/CAPgFtOLaedBMU0f_BxV2bXftTJSmJr018Q5uozOo5vVo6b9tjw@mail.gmail.com/
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260408100044.4530-1-fmancera@suse.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agoMerge branch 'net-phy-add-support-for-disabling-autonomous-eee'
Jakub Kicinski [Sun, 12 Apr 2026 18:33:26 +0000 (11:33 -0700)] 
Merge branch 'net-phy-add-support-for-disabling-autonomous-eee'

Nicolai Buchwitz says:

====================
net: phy: add support for disabling autonomous EEE

Some PHYs implement autonomous EEE where the PHY manages EEE
independently, preventing the MAC from controlling LPI signaling.
This conflicts with MACs that implement their own LPI control.

This series adds a .disable_autonomous_eee callback to struct phy_driver
and calls it from phy_support_eee(). When a MAC indicates it supports
EEE, the PHY's autonomous EEE is automatically disabled. The setting is
persisted across suspend/resume by re-applying it in phy_init_hw() after
soft reset, following the same pattern suggested by Russell King for PHY
tunables [1].

Patch 1 adds the phylib infrastructure.
Patch 2 implements it for Broadcom BCM54xx (AutogrEEEn).
Patch 3 converts the Realtek RTL8211F, which previously unconditionally
  disabled PHY-mode EEE in config_init.

This came up while adding EEE support to the Cadence macb driver (used
on Raspberry Pi 5 with a BCM54210PE PHY). The PHY's AutogrEEEn mode
prevented the MAC from tracking LPI state. The Realtek RTL8211F has
the same pattern, unconditionally disabling PHY-mode EEE with the
comment "Disable PHY-mode EEE so LPI is passed to the MAC".

Other BCM54xx PHYs likely have the same AutogrEEEn register layout,
but I only have access to the BCM54210PE/BCM54213PE datasheets. It
would be appreciated if Florian or others could confirm which other
BCM54xx variants share this register so we can wire them up too.

Tested on Raspberry Pi CM4 (bcmgenet + BCM54210PE),
Raspberry Pi CM5 (Cadence GEM + BCM54210PE) and
Raspberry Pi 5 (Cadence GEM + BCM54213PE).

[1] https://lore.kernel.org/netdev/acuwvoydmJusuj9x@shell.armlinux.org.uk/
====================

Link: https://patch.msgid.link/20260406-devel-autonomous-eee-v1-0-b335e7143711@tipi-net.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agonet: phy: realtek: convert RTL8211F to .disable_autonomous_eee
Nicolai Buchwitz [Mon, 6 Apr 2026 07:13:09 +0000 (09:13 +0200)] 
net: phy: realtek: convert RTL8211F to .disable_autonomous_eee

The RTL8211F previously unconditionally disabled PHY-mode EEE in
config_init. Convert this to use the new .disable_autonomous_eee
callback so it is only disabled when the MAC indicates EEE support
via phy_support_eee().

This preserves PHY-autonomous EEE for MACs that do not support EEE,
while still disabling it when the MAC manages LPI.

Signed-off-by: Nicolai Buchwitz <nb@tipi-net.de>
Link: https://patch.msgid.link/20260406-devel-autonomous-eee-v1-3-b335e7143711@tipi-net.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agonet: phy: broadcom: implement .disable_autonomous_eee for BCM54xx
Nicolai Buchwitz [Mon, 6 Apr 2026 07:13:08 +0000 (09:13 +0200)] 
net: phy: broadcom: implement .disable_autonomous_eee for BCM54xx

Implement the .disable_autonomous_eee callback for the BCM54210E.

In AutogrEEEn mode the PHY manages EEE autonomously. Clearing the
AutogrEEEn enable bit in MII_BUF_CNTL_0 switches the PHY to Native
EEE mode.

Signed-off-by: Nicolai Buchwitz <nb@tipi-net.de>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://patch.msgid.link/20260406-devel-autonomous-eee-v1-2-b335e7143711@tipi-net.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agonet: phy: add support for disabling PHY-autonomous EEE
Nicolai Buchwitz [Mon, 6 Apr 2026 07:13:07 +0000 (09:13 +0200)] 
net: phy: add support for disabling PHY-autonomous EEE

Some PHYs (e.g. Broadcom BCM54xx, Realtek RTL8211F) implement
autonomous EEE where the PHY manages LPI signaling without forwarding
it to the MAC. This conflicts with MAC drivers that implement their own
LPI control.

Add a .disable_autonomous_eee callback to struct phy_driver and call it
from phy_support_eee(). When a MAC driver indicates it supports EEE via
phy_support_eee(), the PHY's autonomous EEE is automatically disabled so
the MAC can manage LPI entry/exit.

Signed-off-by: Nicolai Buchwitz <nb@tipi-net.de>
Link: https://patch.msgid.link/20260406-devel-autonomous-eee-v1-1-b335e7143711@tipi-net.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agonet: airoha: Add missing RX_CPU_IDX() configuration in airoha_qdma_cleanup_rx_queue()
Lorenzo Bianconi [Wed, 8 Apr 2026 18:26:56 +0000 (20:26 +0200)] 
net: airoha: Add missing RX_CPU_IDX() configuration in airoha_qdma_cleanup_rx_queue()

When the descriptor index written in REG_RX_CPU_IDX() is equal to the one
stored in REG_RX_DMA_IDX(), the hw will stop since the QDMA RX ring is
empty.
Add missing REG_RX_CPU_IDX() configuration in airoha_qdma_cleanup_rx_queue
routine during QDMA RX ring cleanup.

Fixes: 514aac359987 ("net: airoha: Add missing cleanup bits in airoha_qdma_cleanup_rx_queue()")
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/20260408-airoha-cpu-idx-airoha_qdma_cleanup_rx_queue-v1-1-8efa64844308@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agoMerge branch 'ynl-ethtool-netlink-fix-nla_len-overflow-for-large-string-sets'
Jakub Kicinski [Sun, 12 Apr 2026 18:23:52 +0000 (11:23 -0700)] 
Merge branch 'ynl-ethtool-netlink-fix-nla_len-overflow-for-large-string-sets'

Hangbin Liu says:

====================
ynl/ethtool/netlink: fix nla_len overflow for large string sets

This series addresses a silent data corruption issue triggered when ynl
retrieves string sets from NICs with a large number of statistics entries
(e.g. mlx5_core with thousands of ETH_SS_STATS strings).

The root cause is that struct nlattr.nla_len is a __u16 (max 65535
bytes). When a NIC exports enough statistics strings, the
ETHTOOL_A_STRINGSET_STRINGS nest built by strset_fill_set() exceeds
this limit. nla_nest_end() silently truncates the length on assignment,
producing a corrupted netlink message.

Patch 1 moves ethtool.py to selftest.

Patch 2 improves the ethtool tool: rename the doit/dumpit helpers
to do_set/do_get and convert do_get to use ynl.do() with an
explicit device header instead of a full dump with client-side filtering.

Patch 3 adds a --dbg-small-recv option to the YNL ethtool tool,
matching the same option already present in cli.py, to help debug netlink
message size issues

Patch 4 adds a new helper nla_nest_end_safe() to check whether the nla_len
is overflow and return -EMSGSIZE early if so.

Patch 5 uses the new helper in ethtool to make sure the ethtool doesn't
reply a corrupted netlink message.
====================

Link: https://patch.msgid.link/20260408-b4-ynl_ethtool-v2-0-7623a5e8f70b@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agoethtool: strset: check nla_len overflow
Hangbin Liu [Wed, 8 Apr 2026 07:08:53 +0000 (15:08 +0800)] 
ethtool: strset: check nla_len overflow

The netlink attribute length field nla_len is a __u16, which can only
represent values up to 65535 bytes. NICs with a large number of
statistics strings (e.g. mlx5_core with thousands of ETH_SS_STATS
entries) can produce a ETHTOOL_A_STRINGSET_STRINGS nest that exceeds
this limit.

When nla_nest_end() writes the actual nest size back to nla_len, the
value is silently truncated. This results in a corrupted netlink message
being sent to userspace: the parser reads a wrong (truncated) attribute
length and misaligns all subsequent attribute boundaries, causing decode
errors.

Fix this by using the new helper nla_nest_end_safe and error out if
the size exceeds U16_MAX.

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://patch.msgid.link/20260408-b4-ynl_ethtool-v2-5-7623a5e8f70b@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agonetlink: add a nla_nest_end_safe() helper
Hangbin Liu [Wed, 8 Apr 2026 07:08:52 +0000 (15:08 +0800)] 
netlink: add a nla_nest_end_safe() helper

The nla_len field in struct nlattr is a __u16, which can only hold
values up to 65535. If a nested attribute grows beyond this limit,
nla_nest_end() silently truncates the length, producing a corrupted
netlink message with no indication of the problem.

Since nla_nest_end() is used everywhere and this issue rarely happens,
let's add a new helper to check the length.

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://patch.msgid.link/20260408-b4-ynl_ethtool-v2-4-7623a5e8f70b@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agotools: ynl: ethtool: add --dbg-small-recv option
Hangbin Liu [Wed, 8 Apr 2026 07:08:51 +0000 (15:08 +0800)] 
tools: ynl: ethtool: add --dbg-small-recv option

Add a --dbg-small-recv debug option to control the recv() buffer size
used by YNL, matching the same option already present in cli.py. This
is useful if user need to get large netlink message.

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://patch.msgid.link/20260408-b4-ynl_ethtool-v2-3-7623a5e8f70b@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agotools: ynl: ethtool: use doit instead of dumpit for per-device GET
Hangbin Liu [Wed, 8 Apr 2026 07:08:50 +0000 (15:08 +0800)] 
tools: ynl: ethtool: use doit instead of dumpit for per-device GET

Rename the local helper doit() to do_set() and dumpit() to do_get() to
better reflect their purpose.

Convert do_get() to use ynl.do() with an explicit device header instead
of ynl.dump() followed by client-side filtering. This is more efficient
as the kernel only processes and returns data for the requested device,
rather than dumping all devices across the netns.

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://patch.msgid.link/20260408-b4-ynl_ethtool-v2-2-7623a5e8f70b@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agotools: ynl: move ethtool.py to selftest
Hangbin Liu [Wed, 8 Apr 2026 07:08:49 +0000 (15:08 +0800)] 
tools: ynl: move ethtool.py to selftest

We have converted all the samples to selftests. This script is
the last piece of random "PoC" code we still have lying around.
Let's move it to tests.

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://patch.msgid.link/20260408-b4-ynl_ethtool-v2-1-7623a5e8f70b@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agoMerge branch 'net-mana-fix-debugfs-directory-naming-and-file-lifecycle'
Jakub Kicinski [Sun, 12 Apr 2026 18:22:56 +0000 (11:22 -0700)] 
Merge branch 'net-mana-fix-debugfs-directory-naming-and-file-lifecycle'

Erni Sri Satya Vennela says:

====================
net: mana: Fix debugfs directory naming and file lifecycle

This series fixes two pre-existing debugfs issues in the MANA driver.

Patch 1 fixes the per-device debugfs directory naming to use the unique
PCI BDF address via pci_name(), avoiding a potential NULL pointer
dereference when pdev->slot is NULL (e.g. VFIO passthrough, nested KVM)
and preventing name collisions across multiple PFs or VFs.

Patch 2 moves the current_speed debugfs file creation from
mana_probe_port() to mana_init_port() so it survives detach/attach
cycles triggered by MTU changes or XDP program changes.
====================

Link: https://patch.msgid.link/20260408081224.302308-1-ernis@linux.microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agonet: mana: Move current_speed debugfs file to mana_init_port()
Erni Sri Satya Vennela [Wed, 8 Apr 2026 08:12:20 +0000 (01:12 -0700)] 
net: mana: Move current_speed debugfs file to mana_init_port()

Move the current_speed debugfs file creation from mana_probe_port() to
mana_init_port(). The file was previously created only during initial
probe, but mana_cleanup_port_context() removes the entire vPort debugfs
directory during detach/attach cycles. Since mana_init_port() recreates
the directory on re-attach, moving current_speed here ensures it survives
these cycles.

Fixes: 75cabb46935b ("net: mana: Add support for net_shaper_ops")
Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260408081224.302308-3-ernis@linux.microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agonet: mana: Use pci_name() for debugfs directory naming
Erni Sri Satya Vennela [Wed, 8 Apr 2026 08:12:19 +0000 (01:12 -0700)] 
net: mana: Use pci_name() for debugfs directory naming

Use pci_name(pdev) for the per-device debugfs directory instead of
hardcoded "0" for PFs and pci_slot_name(pdev->slot) for VFs. The
previous approach had two issues:

1. pci_slot_name() dereferences pdev->slot, which can be NULL for VFs
   in environments like generic VFIO passthrough or nested KVM,
   causing a NULL pointer dereference.

2. Multiple PFs would all use "0", and VFs across different PCI
   domains or buses could share the same slot name, leading to
   -EEXIST errors from debugfs_create_dir().

pci_name(pdev) returns the unique BDF address, is always valid, and is
unique across the system.

Fixes: 6607c17c6c5e ("net: mana: Enable debugfs files for MANA device")
Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260408081224.302308-2-ernis@linux.microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agonfc: llcp: add missing return after LLCP_CLOSED checks
Junxi Qian [Wed, 8 Apr 2026 08:10:06 +0000 (16:10 +0800)] 
nfc: llcp: add missing return after LLCP_CLOSED checks

In nfc_llcp_recv_hdlc() and nfc_llcp_recv_disc(), when the socket
state is LLCP_CLOSED, the code correctly calls release_sock() and
nfc_llcp_sock_put() but fails to return. Execution falls through to
the remainder of the function, which calls release_sock() and
nfc_llcp_sock_put() again. This results in a double release_sock()
and a refcount underflow via double nfc_llcp_sock_put(), leading to
a use-after-free.

Add the missing return statements after the LLCP_CLOSED branches
in both functions to prevent the fall-through.

Fixes: d646960f7986 ("NFC: Initial LLCP support")
Signed-off-by: Junxi Qian <qjx1298677004@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260408081006.3723-1-qjx1298677004@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agoMerge branch 'bng_en-add-link-management-and-statistics-support'
Jakub Kicinski [Sun, 12 Apr 2026 18:09:39 +0000 (11:09 -0700)] 
Merge branch 'bng_en-add-link-management-and-statistics-support'

Bhargava Marreddy says:

====================
bng_en: add link management and statistics support

This series enhances the bng_en driver by adding:
1. Link/PHY support
   a. Link query
   b. Async Link events
   c. Ethtool link set/get functionality
2. Hardware statistics reporting via ethtool -S

This version incorporates feedback received prior to splitting the
original series into two parts.
====================

Link: https://patch.msgid.link/20260406180420.279470-1-bhargava.marreddy@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agobng_en: add support for ethtool -S stats display
Bhargava Marreddy [Mon, 6 Apr 2026 18:04:20 +0000 (23:34 +0530)] 
bng_en: add support for ethtool -S stats display

Implement the legacy ethtool statistics interface (get_sset_count,
get_strings, get_ethtool_stats) to expose hardware counters not
available through standard kernel stats APIs.

Ex:
a) Per-queue ring stats
     rxq0_ucast_packets: 2
     rxq0_mcast_packets: 0
     rxq0_bcast_packets: 15
     rxq0_ucast_bytes: 120
     rxq0_mcast_bytes: 0
     rxq0_bcast_bytes: 900
     txq0_ucast_packets: 0
     txq0_mcast_packets: 0
     txq0_bcast_packets: 0
     txq0_ucast_bytes: 0
     txq0_mcast_bytes: 0
     txq0_bcast_bytes: 0

b) Per-queue TPA(LRO/GRO) stats
     rxq4_tpa_eligible_pkt: 0
     rxq4_tpa_eligible_bytes: 0
     rxq4_tpa_pkt: 0
     rxq4_tpa_bytes: 0
     rxq4_tpa_errors: 0
     rxq4_tpa_events: 0

c) Port level stats
     rxp_good_vlan_frames: 0
     rxp_mtu_err_frames: 0
     rxp_tagged_frames: 0
     rxp_double_tagged_frames: 0
     rxp_pfc_ena_frames_pri0: 0
     rxp_pfc_ena_frames_pri1: 0
     rxp_pfc_ena_frames_pri2: 0
     rxp_pfc_ena_frames_pri3: 0
     rxp_pfc_ena_frames_pri4: 0
     rxp_pfc_ena_frames_pri5: 0
     rxp_pfc_ena_frames_pri6: 0
     rxp_pfc_ena_frames_pri7: 0
     rxp_eee_lpi_events: 0
     rxp_eee_lpi_duration: 0
     rxp_runt_bytes: 0
     rxp_runt_frames: 0
     txp_good_vlan_frames: 0
     txp_jabber_frames: 0
     txp_fcs_err_frames: 0
     txp_pfc_ena_frames_pri0: 0
     txp_pfc_ena_frames_pri1: 0
     txp_pfc_ena_frames_pri2: 0
     txp_pfc_ena_frames_pri3: 0
     txp_pfc_ena_frames_pri4: 0
     txp_pfc_ena_frames_pri5: 0
     txp_pfc_ena_frames_pri6: 0
     txp_pfc_ena_frames_pri7: 0
     txp_eee_lpi_events: 0
     txp_eee_lpi_duration: 0
     txp_xthol_frames: 0

d) Per-priority stats
     rx_bytes_pri0: 4182650
     rx_bytes_pri1: 4182650
     rx_bytes_pri2: 4182650
     rx_bytes_pri3: 4182650
     rx_bytes_pri4: 4182650
     rx_bytes_pri5: 4182650
     rx_bytes_pri6: 4182650
     rx_bytes_pri7: 4182650

Signed-off-by: Bhargava Marreddy <bhargava.marreddy@broadcom.com>
Reviewed-by: Vikas Gupta <vikas.gupta@broadcom.com>
Link: https://patch.msgid.link/20260406180420.279470-11-bhargava.marreddy@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agobng_en: implement netdev_stat_ops
Bhargava Marreddy [Mon, 6 Apr 2026 18:04:19 +0000 (23:34 +0530)] 
bng_en: implement netdev_stat_ops

Implement netdev_stat_ops to provide standardized per-queue
statistics via the Netlink API.

Below is the description of the hardware drop counters:

rx-hw-drop-overruns: Packets dropped by HW due to resource limitations
(e.g., no BDs available in the host ring).
rx-hw-drops: Total packets dropped by HW (sum of overruns and error
drops).
tx-hw-drop-errors: Packets dropped by HW because they were invalid or
malformed.
tx-hw-drops: Total packets dropped by HW (sum of resource limitations
and error drops).

The implementation was verified using the ynl tool:

./tools/net/ynl/pyynl/cli.py --spec \
Documentation/netlink/specs/netdev.yaml --dump qstats-get --json \
'{"ifindex":14, "scope":"queue"}'

[{'ifindex': 14, 'queue-id': 0, 'queue-type': 'rx', 'rx-bytes': 758,
'rx-hw-drop-overruns': 0, 'rx-hw-drops': 0, 'rx-packets': 11},
 {'ifindex': 14, 'queue-id': 1, 'queue-type': 'rx', 'rx-bytes': 0,
'rx-hw-drop-overruns': 0, 'rx-hw-drops': 0, 'rx-packets': 0},
{'ifindex': 14, 'queue-id': 0, 'queue-type': 'tx', 'tx-bytes': 0,
'tx-hw-drop-errors': 0, 'tx-hw-drops': 0, 'tx-packets': 0},
 {'ifindex': 14, 'queue-id': 1, 'queue-type': 'tx', 'tx-bytes': 0,
'tx-hw-drop-errors': 0, 'tx-hw-drops': 0, 'tx-packets': 0},
 {'ifindex': 14, 'queue-id': 2, 'queue-type': 'tx', 'tx-bytes': 810,
'tx-hw-drop-errors': 0, 'tx-hw-drops': 0, 'tx-packets': 10},]

Signed-off-by: Bhargava Marreddy <bhargava.marreddy@broadcom.com>
Reviewed-by: Vikas Gupta <vikas.gupta@broadcom.com>
Link: https://patch.msgid.link/20260406180420.279470-10-bhargava.marreddy@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agobng_en: implement ndo_get_stats64
Bhargava Marreddy [Mon, 6 Apr 2026 18:04:18 +0000 (23:34 +0530)] 
bng_en: implement ndo_get_stats64

Implement the ndo_get_stats64 callback to report aggregate network
statistics. The driver gathers these by accumulating the per-ring
counters into the provided rtnl_link_stats64 structure.

Signed-off-by: Bhargava Marreddy <bhargava.marreddy@broadcom.com>
Reviewed-by: Vikas Gupta <vikas.gupta@broadcom.com>
Link: https://patch.msgid.link/20260406180420.279470-9-bhargava.marreddy@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agobng_en: periodically fetch and accumulate hardware statistics
Bhargava Marreddy [Mon, 6 Apr 2026 18:04:17 +0000 (23:34 +0530)] 
bng_en: periodically fetch and accumulate hardware statistics

Use the timer to schedule periodic stats collection via
the workqueue when the link is up. Fetch fresh counters from
hardware via DMA and accumulate them into 64-bit software
shadows, handling wrap-around for counters narrower than
64 bits.

Signed-off-by: Bhargava Marreddy <bhargava.marreddy@broadcom.com>
Reviewed-by: Vikas Gupta <vikas.gupta@broadcom.com>
Reviewed-by: Rahul Gupta <rahul-rg.gupta@broadcom.com>
Reviewed-by: Ajit Kumar Khaparde <ajit.khaparde@broadcom.com>
Link: https://patch.msgid.link/20260406180420.279470-8-bhargava.marreddy@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agobng_en: add HW stats infra and structured ethtool ops
Bhargava Marreddy [Mon, 6 Apr 2026 18:04:16 +0000 (23:34 +0530)] 
bng_en: add HW stats infra and structured ethtool ops

Implement the hardware-level statistics foundation and modern structured
ethtool operations.

1. Infrastructure: Add HWRM firmware wrappers (FUNC_QSTATS_EXT,
   PORT_QSTATS_EXT, and PORT_QSTATS) to query ring and port counters.
2. Structured ops: Implement .get_eth_phy_stats, .get_eth_mac_stats,
   .get_eth_ctrl_stats, .get_pause_stats, and .get_rmon_stats.

Stats are initially reported as 0; accumulation logic is added
in a subsequent patch.

Signed-off-by: Bhargava Marreddy <bhargava.marreddy@broadcom.com>
Reviewed-by: Vikas Gupta <vikas.gupta@broadcom.com>
Link: https://patch.msgid.link/20260406180420.279470-7-bhargava.marreddy@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agobng_en: add support for link async events
Bhargava Marreddy [Mon, 6 Apr 2026 18:04:15 +0000 (23:34 +0530)] 
bng_en: add support for link async events

Register for firmware asynchronous events, including link-status,
link-speed, and PHY configuration changes. Upon event reception,
re-query the PHY and update ethtool settings accordingly.

Signed-off-by: Bhargava Marreddy <bhargava.marreddy@broadcom.com>
Reviewed-by: Vikas Gupta <vikas.gupta@broadcom.com>
Reviewed-by: Rajashekar Hudumula <rajashekar.hudumula@broadcom.com>
Reviewed-by: Ajit Kumar Khaparde <ajit.khaparde@broadcom.com>
Link: https://patch.msgid.link/20260406180420.279470-6-bhargava.marreddy@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agobng_en: implement ethtool pauseparam operations
Bhargava Marreddy [Mon, 6 Apr 2026 18:04:14 +0000 (23:34 +0530)] 
bng_en: implement ethtool pauseparam operations

Implement .get_pauseparam and .set_pauseparam to support flow control
configuration. This allows reporting and setting of autoneg, RX pause,
and TX pause states.

Signed-off-by: Bhargava Marreddy <bhargava.marreddy@broadcom.com>
Reviewed-by: Vikas Gupta <vikas.gupta@broadcom.com>
Link: https://patch.msgid.link/20260406180420.279470-5-bhargava.marreddy@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agobng_en: add ethtool link settings, get_link, and nway_reset
Bhargava Marreddy [Mon, 6 Apr 2026 18:04:13 +0000 (23:34 +0530)] 
bng_en: add ethtool link settings, get_link, and nway_reset

Add get/set_link_ksettings, get_link, and nway_reset support.
Report supported, advertised, and link-partner speeds across NRZ,
PAM4, and PAM4-112 signaling modes. Enable lane count reporting.

Signed-off-by: Bhargava Marreddy <bhargava.marreddy@broadcom.com>
Reviewed-by: Vikas Gupta <vikas.gupta@broadcom.com>
Reviewed-by: Rajashekar Hudumula <rajashekar.hudumula@broadcom.com>
Reviewed-by: Ajit Kumar Khaparde <ajit.khaparde@broadcom.com>
Signed-off-by: Vikas Gupta <vikas.gupta@broadcom.com>
Link: https://patch.msgid.link/20260406180420.279470-4-bhargava.marreddy@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agobng_en: query PHY capabilities and report link status
Bhargava Marreddy [Mon, 6 Apr 2026 18:04:12 +0000 (23:34 +0530)] 
bng_en: query PHY capabilities and report link status

Query PHY capabilities and supported speeds from firmware,
retrieve current link state (speed, duplex, pause, FEC),
and log the information. Seed initial link state during probe.

Signed-off-by: Bhargava Marreddy <bhargava.marreddy@broadcom.com>
Reviewed-by: Vikas Gupta <vikas.gupta@broadcom.com>
Reviewed-by: Rajashekar Hudumula <rajashekar.hudumula@broadcom.com>
Reviewed-by: Ajit Kumar Khaparde <ajit.khaparde@broadcom.com>
Link: https://patch.msgid.link/20260406180420.279470-3-bhargava.marreddy@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agobng_en: add per-PF workqueue, timer, and slow-path task
Bhargava Marreddy [Mon, 6 Apr 2026 18:04:11 +0000 (23:34 +0530)] 
bng_en: add per-PF workqueue, timer, and slow-path task

Add a dedicated single-thread workqueue and a timer for each PF
to drive deferred slow-path work such as link event handling and
stats collection. The timer is stopped via timer_delete_sync()
when interrupts are disabled and restarted on open.

While the close path stops the timer to prevent new tasks from
being scheduled, the sp_task and workqueue are preserved to
maintain state continuity. Final draining and destruction of
the workqueue are handled during PCI remove.

Signed-off-by: Bhargava Marreddy <bhargava.marreddy@broadcom.com>
Reviewed-by: Vikas Gupta <vikas.gupta@broadcom.com>
Reviewed-by: Ajit Kumar Khaparde <ajit.khaparde@broadcom.com>
Link: https://patch.msgid.link/20260406180420.279470-2-bhargava.marreddy@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agoMerge branch 'add-tso-map-once-dma-helpers-and-bnxt-sw-uso-support'
Jakub Kicinski [Sun, 12 Apr 2026 17:54:35 +0000 (10:54 -0700)] 
Merge branch 'add-tso-map-once-dma-helpers-and-bnxt-sw-uso-support'

Joe Damato says:

====================
Add TSO map-once DMA helpers and bnxt SW USO support

Greetings:

This series extends net/tso to add a data structure and some helpers allowing
drivers to DMA map headers and packet payloads a single time. The helpers can
then be used to reference slices of shared mapping for each segment. This
helps to avoid the cost of repeated DMA mappings, especially on systems which
use an IOMMU. N per-packet DMA maps are replaced with a single map for the
entire GSO skb. As of v3, the series uses the DMA IOVA API (as suggested by
Leon [1]) and provides a fallback path when an IOMMU is not in use. The DMA
IOVA API provides even better efficiency than the v2; see below.

The added helpers are then used in bnxt to add support for software UDP
Segmentation Offloading (SW USO) for older bnxt devices which do not have
support for USO in hardware. Since the helpers are generic, other drivers
can be extended similarly.

The v2 showed a ~4x reduction in DMA mapping calls at the same wire packet
rate on production traffic with a bnxt device. The v3, however, shows a larger
reduction of about ~6x at the same wire packet rate. This is thanks to Leon's
suggestion of using the DMA IOVA API [1].

Special care is taken to make bnxt ethtool operations work correctly: the ring
size cannot be reduced below a minimum threshold while USO is enabled and
growing the ring automatically re-enables USO if it was previously blocked.

This v10 contains some cosmetic changes (wrapping long lines), moves the test
to the correct directory, and attempts to fix the slot availability check
added in the v9.

I re-ran the python test and the test passed on my bnxt system. I also ran
this on a production system.
====================

Link: https://patch.msgid.link/20260408230607.2019402-1-joe@dama.to
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agoselftests: drv-net: Add USO test
Joe Damato [Wed, 8 Apr 2026 23:05:59 +0000 (16:05 -0700)] 
selftests: drv-net: Add USO test

Add a simple test for USO. Tests both ipv4 and ipv6 with several full
segments and a partial segment.

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Joe Damato <joe@dama.to>
Link: https://patch.msgid.link/20260408230607.2019402-11-joe@dama.to
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agonet: bnxt: Dispatch to SW USO
Joe Damato [Wed, 8 Apr 2026 23:05:58 +0000 (16:05 -0700)] 
net: bnxt: Dispatch to SW USO

Wire in the SW USO path added in preceding commits when hardware USO is
not possible.

When a GSO skb with SKB_GSO_UDP_L4 arrives and the NIC lacks HW USO
capability, redirect to bnxt_sw_udp_gso_xmit() which handles software
segmentation into individual UDP frames submitted directly to the TX
ring.

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Signed-off-by: Joe Damato <joe@dama.to>
Link: https://patch.msgid.link/20260408230607.2019402-10-joe@dama.to
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agonet: bnxt: Add SW GSO completion and teardown support
Joe Damato [Wed, 8 Apr 2026 23:05:57 +0000 (16:05 -0700)] 
net: bnxt: Add SW GSO completion and teardown support

Update __bnxt_tx_int and bnxt_free_one_tx_ring_skbs to handle SW GSO
segments:

- MID segments: adjust tx_pkts/tx_bytes accounting and skip skb free
  (the skb is shared across all segments and freed only once)

- LAST segments: call tso_dma_map_complete() to tear down the IOVA
  mapping if one was used. On the fallback path, payload DMA unmapping
  is handled by the existing per-BD dma_unmap_len walk.

Both MID and LAST completions advance tx_inline_cons to release the
segment's inline header slot back to the ring.

is_sw_gso is initialized to zero, so the new code paths are not run.

Add logic for feature advertisement and guardrails for ring sizing.

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Signed-off-by: Joe Damato <joe@dama.to>
Link: https://patch.msgid.link/20260408230607.2019402-9-joe@dama.to
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agonet: bnxt: Implement software USO
Joe Damato [Wed, 8 Apr 2026 23:05:56 +0000 (16:05 -0700)] 
net: bnxt: Implement software USO

Implement bnxt_sw_udp_gso_xmit() using the core tso_dma_map API and
the pre-allocated TX inline buffer for per-segment headers.

The xmit path:
1. Calls tso_start() to initialize TSO state
2. Stack-allocates a tso_dma_map and calls tso_dma_map_init() to
   DMA-map the linear payload and all frags upfront.
3. For each segment:
   - Copies and patches headers via tso_build_hdr() into the
     pre-allocated tx_inline_buf (DMA-synced per segment)
   - Counts payload BDs via tso_dma_map_count()
   - Emits long BD (header) + ext BD + payload BDs
   - Payload BDs use tso_dma_map_next() which yields (dma_addr,
     chunk_len, mapping_len) tuples.

Header BDs set dma_unmap_len=0 since the inline buffer is pre-allocated
and unmapped only at ring teardown.

Completion state is updated by calling tso_dma_map_completion_save() for
the last segment.

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Joe Damato <joe@dama.to>
Link: https://patch.msgid.link/20260408230607.2019402-8-joe@dama.to
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agonet: bnxt: Add boilerplate GSO code
Joe Damato [Wed, 8 Apr 2026 23:05:55 +0000 (16:05 -0700)] 
net: bnxt: Add boilerplate GSO code

Add bnxt_gso.c and bnxt_gso.h with a stub bnxt_sw_udp_gso_xmit()
function, SW USO constants (BNXT_SW_USO_MAX_SEGS,
BNXT_SW_USO_MAX_DESCS), and the is_sw_gso field in bnxt_sw_tx_bd
with BNXT_SW_GSO_MID/LAST markers.

The full SW USO implementation will be added in a future commit.

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Signed-off-by: Joe Damato <joe@dama.to>
Link: https://patch.msgid.link/20260408230607.2019402-7-joe@dama.to
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agonet: bnxt: Add TX inline buffer infrastructure
Joe Damato [Wed, 8 Apr 2026 23:05:54 +0000 (16:05 -0700)] 
net: bnxt: Add TX inline buffer infrastructure

Add per-ring pre-allocated inline buffer fields (tx_inline_buf,
tx_inline_dma, tx_inline_size) to bnxt_tx_ring_info and helpers to
allocate and free them. A producer and consumer (tx_inline_prod,
tx_inline_cons) are added to track which slot(s) of the inline buffer
are in-use.

The inline buffer will be used by the SW USO path for pre-allocated,
pre-DMA-mapped per-segment header copies. In the future, this
could be extended to support TX copybreak.

Allocation helper is marked __maybe_unused in this commit because it
will be wired in later.

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Signed-off-by: Joe Damato <joe@dama.to>
Link: https://patch.msgid.link/20260408230607.2019402-6-joe@dama.to
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agonet: bnxt: Use dma_unmap_len for TX completion unmapping
Joe Damato [Wed, 8 Apr 2026 23:05:53 +0000 (16:05 -0700)] 
net: bnxt: Use dma_unmap_len for TX completion unmapping

Store the DMA mapping length in each TX buffer descriptor via
dma_unmap_len_set at submit time, and use dma_unmap_len at completion
time.

This is a no-op for normal packets but prepares for software USO,
where header BDs set dma_unmap_len to 0 because the header buffer
is unmapped collectively rather than per-segment.

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Signed-off-by: Joe Damato <joe@dama.to>
Link: https://patch.msgid.link/20260408230607.2019402-5-joe@dama.to
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agonet: bnxt: Add a helper for tx_bd_ext
Joe Damato [Wed, 8 Apr 2026 23:05:52 +0000 (16:05 -0700)] 
net: bnxt: Add a helper for tx_bd_ext

Factor out some code to setup tx_bd_exts into a helper function. This
helper will be used by SW USO implementation in the following commits.

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Signed-off-by: Joe Damato <joe@dama.to>
Link: https://patch.msgid.link/20260408230607.2019402-4-joe@dama.to
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agonet: bnxt: Export bnxt_xmit_get_cfa_action
Joe Damato [Wed, 8 Apr 2026 23:05:51 +0000 (16:05 -0700)] 
net: bnxt: Export bnxt_xmit_get_cfa_action

Export bnxt_xmit_get_cfa_action so that it can be used in future commits
which add software USO support to bnxt.

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Signed-off-by: Joe Damato <joe@dama.to>
Link: https://patch.msgid.link/20260408230607.2019402-3-joe@dama.to
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agonet: tso: Introduce tso_dma_map and helpers
Joe Damato [Wed, 8 Apr 2026 23:05:50 +0000 (16:05 -0700)] 
net: tso: Introduce tso_dma_map and helpers

Add struct tso_dma_map to tso.h for tracking DMA addresses of mapped
GSO payload data and tso_dma_map_completion_state.

The tso_dma_map combines DMA mapping storage with iterator state, allowing
drivers to walk pre-mapped DMA regions linearly. Includes fields for
the DMA IOVA path (iova_state, iova_offset, total_len) and a fallback
per-region path (linear_dma, frags[], frag_idx, offset).

The tso_dma_map_completion_state makes the IOVA completion state opaque
for drivers. Drivers are expected to allocate this and use the added
helpers to update the completion state.

Adds skb_frag_phys() to skbuff.h, returning the physical address
of a paged fragment's data, which is used by the tso_dma_map helpers
introduced in this commit described below.

The added TSO DMA map helpers are:

tso_dma_map_init(): DMA-maps the linear payload region and all frags
upfront. Prefers the DMA IOVA API for a single contiguous mapping with
one IOTLB sync; falls back to per-region dma_map_phys() otherwise.
Returns 0 on success, cleans up partial mappings on failure.

tso_dma_map_cleanup(): Handles both IOVA and fallback teardown paths.

tso_dma_map_count(): counts how many descriptors the next N bytes of
payload will need. Returns 1 if IOVA is used since the mapping is
contiguous.

tso_dma_map_next(): yields the next (dma_addr, chunk_len) pair.
On the IOVA path, each segment is a single contiguous chunk. On the
fallback path, indicates when a chunk starts a new DMA mapping so the
driver can set dma_unmap_len on that descriptor for completion-time
unmapping.

tso_dma_map_completion_save(): updates the completion state. Drivers
will call this at xmit time.

tso_dma_map_complete(): tears down the mapping at completion time and
returns true if the IOVA path was used. If it was not used, this is a
no-op and returns false.

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Joe Damato <joe@dama.to>
Link: https://patch.msgid.link/20260408230607.2019402-2-joe@dama.to
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agoMerge tag 'wq-for-7.0-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Sun, 12 Apr 2026 17:42:40 +0000 (10:42 -0700)] 
Merge tag 'wq-for-7.0-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq

Pull workqueue fix from Tejun Heo:
 "This is a fix for a stall which triggers on ordered workqueues when
  there are multiple inactive work items during workqueue property
  changes through sysfs, which doesn't happen that frequently.

  While really late, the fix is very low risk as it just repeats an
  operation which is already being performed:

   - Fix incomplete activation of multiple inactive works when
     unplugging a pool_workqueue, where the pending_pwqs list
     wasn't being updated for subsequent works"

* tag 'wq-for-7.0-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: Add pool_workqueue to pending_pwqs list when unplugging multiple inactive works

2 days agoMerge tag 'timers-urgent-2026-04-12' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 12 Apr 2026 17:01:55 +0000 (10:01 -0700)] 
Merge tag 'timers-urgent-2026-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer fixes from Thomas Gleixner:
 "Two fixes for the time/timers subsystem:

   - Invert the inverted fastpath decision in check_tick_dependency(),
     which prevents NOHZ full to stop the tick. That's a regression
     introduced in the 7.0 merge window.

   - Prevent a unpriviledged DoS in the clockevents code, where user
     space can starve the timer interrupt by arming a timerfd or posix
     interval timer in a tight loop with an absolute expiry time in the
     past. The fix turned out to be incomplete and was was amended
     yesterday to make it work on some 20 years old AMD machines as
     well. All issues with it have been confirmed to be resolved by
     various reporters"

* tag 'timers-urgent-2026-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  clockevents: Prevent timer interrupt starvation
  tick/nohz: Fix inverted return value in check_tick_dependency() fast path

2 days agovsock/virtio: remove unnecessary call to `virtio_transport_get_ops`
Luigi Leonardi [Wed, 8 Apr 2026 15:21:02 +0000 (17:21 +0200)] 
vsock/virtio: remove unnecessary call to `virtio_transport_get_ops`

`virtio_transport_send_pkt_info` gets all the transport information
from the parameter `t_ops`. There is no need to call
`virtio_transport_get_ops()`.

Remove it.

Acked-by: Arseniy Krasnov <avkrasnov@salutedevices.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Luigi Leonardi <leonardi@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://patch.msgid.link/20260408-remove_parameter-v2-1-e00f31cf7a17@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 days agonet: skb: clean up dead code after skb_kfree_head() simplification
Jiayuan Chen [Fri, 10 Apr 2026 03:47:32 +0000 (11:47 +0800)] 
net: skb: clean up dead code after skb_kfree_head() simplification

Since commit 0f42e3f4fe2a ("net: skb: fix cross-cache free of
KFENCE-allocated skb head"), skb_kfree_head() always calls kfree()
and no longer uses end_offset to distinguish between skb_small_head_cache
and generic kmalloc caches.

Clean up the leftovers:

- Remove the unused end_offset parameter from skb_kfree_head() and
  update all callers.
- Remove the SKB_SMALL_HEAD_HEADROOM guard in __skb_unclone_keeptruesize()
  which was protecting the old skb_kfree_head() logic.
- Update the SKB_SMALL_HEAD_CACHE_SIZE comment to reflect that the
  non-power-of-2 sizing is no longer used for free-path disambiguation.

No functional change.

Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260410034736.297900-1-jiayuan.chen@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>