====================
selftests: net: improve error handling in passive TFO test
This series improves error handling in the passive TFO test by (1)
fixing a broken behavior when the child processes failed (or timed out),
and (2) adding more error handlng code in the test program.
The first patch fixes the behavior that the test didn't report failure
even if the server or the client process exited with non-zero status.
The second patch adds error handling code in the test program to improve
reliability of the test.
====================
====================
net: phy: realtek: simplify and reunify C22/C45 drivers
The RTL8221B PHY variants (VB-CG and VM-CG) were previously split into
separate C22 and C45 driver instances to support copper SFP modules
using the RollBall MDIO-over-I2C protocol, which only supports Clause-45
access. However, this split created significant code duplication and
complexity.
Commit 8af2136e77989 ("net: phy: realtek: add helper
RTL822X_VND2_C22_REG") exposed that RealTek PHYs map all standard
Clause-22 registers into MDIO_MMD_VEND2 at offset 0xa400.
With commit 1850ec20d6e71 ("net: phy: realtek: use paged access for
MDIO_MMD_VEND2 in C22 mode") it is now possible to access all MMD
registers transparently, regardless of whether the PHY is accessed via
C22 or C45 MDIO.
Further improve the translation logic for this register mapping, so a
single unified driver works efficiently with both access methods,
reducing code duplication.
The series also includes cleanup to remove unnecessary paged operations
on registers that aren't actually affected by page selection.
Testing was done on RTL8211F and RTL8221B-VB-CG (the latter in both
C22 and C45 modes).
====================
Only registers 0x10~0x17 are affected by the value in the page
selection register 0x1f. Hence there is no point in using paged
operations when accessing any other registers.
Simplify the driver by using the normal phy_read and phy_write
operations for registers which are anyway not affected by paging.
Turns out that register address RTL_VND2_PHYSR (0xa434) maps to
Clause-22 register MII_RESV2. Use that to get rid of yet another magic
number, and rename access macros accordingly.
Daniel Golle [Tue, 13 Jan 2026 03:44:25 +0000 (03:44 +0000)]
net: phy: realtek: reunify C22 and C45 drivers
Reunify the split C22/C45 drivers for the RTL8221B-VB-CG 2.5Gbps and
RTL8221B-VM-CG 2.5Gbps PHYs back into a single driver.
This is possible now by using all the driver operations previously used
by the C45 driver, as transparent access to all MMDs including
MDIO_MMD_VEND2 is now possible also over Clause-22 MDIO.
The unified driver will still only use Clause-45 access on any Clause-45
capable busses while still working fine on Clause-22 busses.
Daniel Golle [Tue, 13 Jan 2026 03:44:17 +0000 (03:44 +0000)]
net: phy: realtek: simplify C22 reg access via MDIO_MMD_VEND2
RealTek 2.5GE PHYs have all standard Clause-22 registers mapped also
inside MDIO_MMD_VEND2 at offset 0xa400. This is used mainly in case the
PHY is connected to a Clause-45-only bus. The RTL8221B is frequently
used in copper SFP module which uses the RollBall MDIO-over-I2C
method which *only* supports Clause-45, for example.
In order to support using the PHY on Clause-45-only busses, the PHY
driver has previously been split into a C22-only and C45-only instances,
creating quite a bit of redundancy and confusion.
In preparation of reunifying the two driver instances, add support for
translating MDIO_MMD_VEND2 registers 0xa400 to 0xa43c back to Clause-22
registers 0 to 30 in case the PHY is accessed on a Clause-22 bus.
Daniel Golle [Tue, 13 Jan 2026 03:44:00 +0000 (03:44 +0000)]
net: phy: realtek: support interrupt also for C22 variants
Now that access to MDIO_MMD_VEND2 works transparently also in Clause-22
mode, add interrupt support also for the C22 variants of the
RTL8221B-VB-CG and RTL8221B-VM-CG. This results in the C22 and C45
driver instances now having all the same features implemented.
dwmac4's transmit performance dropped by a factor of four due to an
incorrect assumption about which definitions are for what. This
highlights the need for sane register macros.
Commit 8409495bf6c9 ("net: stmmac: cores: remove many xxx_SHIFT
definitions") changed the way the txpbl value is merged into the
register:
value = readl(ioaddr + DMA_CHAN_TX_CONTROL(dwmac4_addrs, chan));
- value = value | (txpbl << DMA_BUS_MODE_PBL_SHIFT);
+ value = value | FIELD_PREP(DMA_BUS_MODE_PBL, txpbl);
The assumption here was that DMA_BUS_MODE_PBL was the mask for
DMA_BUS_MODE_PBL_SHIFT, but this turns out not to be the case.
The field is actually six bits wide, buts 21:16, and is called
TXPBL.
What's even more confusing is, there turns out to be a PBLX8
single bit in the DMA_CHAN_CONTROL register (0x1100 for channel 0),
and DMA_BUS_MODE_PBL seems to be used for that. However, this bit
et.al. was listed under a comment "/* DMA SYS Bus Mode bitmap */"
which is for register 0x1004.
Fix this up by adding an appropriately named field definition under
the DMA_CHAN_TX_CONTROL() register address definition.
Move the RPBL mask definition under DMA_CHAN_RX_CONTROL(), correctly
renaming it as well.
Also move the PBL bit definition under DMA_CHAN_CONTROL(), correctly
renaming it.
- ALE_VERSION_MAJOR/MINOR are no longer used following the transition to
regmaps in commit bbfc7e2b9ebe ("net: ethernet: ti: cpsw_ale: use
regfields for ALE registers")
- ALE_VERSION_IR3 is unused since entry mask bits are no longer
hardcoded with commit b5d31f294027 ("net: ethernet: ti: ale: optimize
ale entry mask bits configuartion")
- ALE_VERSION_IR4 has never been used since its introduction in commit ca47130a744b ("net: netcp: ale: update to support unknown vlan
controls for NU switch")
net/sched: cake: avoid separate allocation of struct cake_sched_config
Paolo pointed out that we can avoid separately allocating struct
cake_sched_config even in the non-mq case, by embedding it into struct
cake_sched_data. This reduces the complexity of the logic that swaps the
pointers and frees the old value, at the cost of adding 56 bytes to the
latter. Since cake_sched_data is already almost 17k bytes, this seems
like a reasonable tradeoff.
Suggested-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Fixes: bc0ce2bad36c ("net/sched: sch_cake: Factor out config variables into separate struct") Link: https://patch.msgid.link/20260113143157.2581680-1-toke@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Shahar Shitrit [Tue, 13 Jan 2026 10:08:03 +0000 (12:08 +0200)]
docs: tls: Enhance TLS resync async process documentation
Expand the tls-offload.rst documentation to provide a more detailed
explanation of the asynchronous resync process, including the role
of struct tls_offload_resync_async in managing resync requests on
the kernel side.
Also, add documentation for helper functions
tls_offload_rx_resync_async_request_start/ _end/ _cancel.
Thomas Weißschuh [Tue, 13 Jan 2026 07:44:17 +0000 (08:44 +0100)]
uapi: add INT_MAX and INT_MIN constants
Some UAPI headers use INT_MAX and INT_MIN. Currently they include
<limits.h> for their definitions, which introduces a problematic
dependency on libc.
Add custom, namespaced definitions of INT_MAX and INT_MIN using the
same values as the regular kernel code.
These definitions are not added to uapi/linux/limits.h, as that header
will conflict with libc definitions on some platforms.
net: usb: sr9700: remove code to drive nonexistent MII
This device does not have a MII, even though the driver
contains code to drive one (because it originated as a copy of the
dm9601 driver). It also only supports 10Mbps half-duplex
operation (the DM9601 registers to set the speed/duplex mode
are read-only). Remove all MII-related code and implement
sr9700_get_link_ksettings which returns hardcoded correct
information for the link speed and duplex mode. Also add
announcement of the link status like many other Ethernet
drivers have.
====================
net: pcs: rzn1-miic: Support configurable PHY_LINK polarity
This series adds support for configuring the active level of MIIC
PHY_LINK status signals on Renesas RZ/N1 and RZ/T2H/N2H platforms.
The MIIC block provides dedicated hardware PHY_LINK signals that indicate
EtherPHY link-up and link-down status independently of whether the MAC
(GMAC) or Ethernet switch (ETHSW) is used. While GMAC-based systems
typically obtain link state via MDIO and handle it in software, the
ETHSW relies on these PHY_LINK pins for both CPU-assisted operation and
switch-only forwarding paths that do not involve the host processor.
These hardware PHY_LINK signals are particularly important for use cases
requiring fast reaction to link-down events, such as redundancy protocols
including Device Level Ring (DLR). In such scenarios, relying solely on
software-based link detection introduces latency that can negatively
impact recovery time. The ETHSW therefore exposes PHY_LINK signals to
enable immediate hardware-level detection of cable or port failures.
Some systems require the PHY_LINK signal polarity to be configured as
active low rather than the default active high. This series introduces a
new DT property to describe the required polarity and adds corresponding
driver support to program the MIIC PHY_LINK register accordingly. The
configuration is accumulated during DT parsing and applied once hardware
initialization is complete, taking into account SoC-specific differences
between RZ/N1 and RZ/T2H/N2H.
====================
Lad Prabhakar [Mon, 12 Jan 2026 17:35:55 +0000 (17:35 +0000)]
net: pcs: rzn1-miic: Add PHY_LINK active-level configuration support
Add support to configure the active level of MIIC PHY_LINK status signals
on a per-converter basis using a DT property.
MIIC provides dedicated PHY_LINK signals that indicate EtherPHY link-up and
link-down status in hardware. These signals are required regardless of
whether GMAC or ETHSW is used. With GMAC, link state is retrieved via
MDC/MDIO and handled in software, while ETHSW relies on PHY_LINK pins for
both CPU-assisted operation and switch-only data paths that do not involve
the host.
Hardware PHY_LINK signals are also critical for fast reaction to link-down
events, for example when running redundancy protocols such as Device Level
Ring (DLR), where rapid detection of cable faults is required to switch to
an alternate path without software latency.
Parse the requested polarity from DT, accumulate the configuration during
probing, and apply it to the MIIC_PHY_LINK register once hardware
initialization is complete, when the registers can be safely modified.
Handle SoC-specific bit layout differences between RZ/N1 and RZ/T2H/N2H
within the driver.
Add the renesas,miic-phy-link-active-low property to allow configuring
the active level of phy_link status signals provided by the MIIC block.
EtherPHY link-up and link-down status is required as a hardware IP
feature independent of whether GMAC or ETHSW is used. With GMAC, link
state is retrieved via MDC/MDIO and handled in software. In contrast,
ETHSW exposes dedicated PHY_LINK pins that provide this information
directly in hardware.
These PHY_LINK signals are required not only for host-controlled traffic
but also for switch-only forwarding paths where frames are exchanged
between external nodes without CPU involvement. This is particularly
important for redundancy protocols such as DLR (Device Level Ring),
which depend on fast detection of link-down events caused by cable or
port failures. Handling such events purely in software introduces
latency, which is why ETHSW provides dedicated hardware PHY_LINK pins.
Matt Johnston [Tue, 13 Jan 2026 09:01:16 +0000 (17:01 +0800)]
mctp i2c: initialise event handler read bytes
Set a 0xff value for i2c reads of an mctp-i2c device. Otherwise reads
will return "val" from the i2c bus driver. For i2c-aspeed and
i2c-npcm7xx that is a stack uninitialised u8.
Tested with "i2ctransfer -y 1 r10@0x34" where 0x34 is a mctp-i2c
instance, now it returns all 0xff.
Pavan Chebbi [Tue, 13 Jan 2026 18:34:22 +0000 (10:34 -0800)]
bnxt_en: Fix build break on non-x86 platforms
Commit c470195b989fe added .getcrosststamp() interface where
the code uses boot_cpu_has() function which is available only
in x86 platforms. This fails the build on any other platform.
Since the interface is going to be supported only on x86 anyway,
we can simply compile out the entire support on non-x86 platforms.
Cover the .getcrosststamp support under CONFIG_X86
Fixes: c470195b989f ("bnxt_en: Add PTP .getcrosststamp() interface to get device/host times") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202601111808.WnBJCuWI-lkp@intel.com Signed-off-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Link: https://patch.msgid.link/20260113183422.508851-1-michael.chan@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Marco Crivellari [Tue, 13 Jan 2026 15:14:33 +0000 (16:14 +0100)]
hinic3: add WQ_PERCPU to alloc_workqueue users
This continues the effort to refactor workqueue APIs, which began with
the introduction of new workqueues and a new alloc_workqueue flag in:
commit 128ea9f6ccfb ("workqueue: Add system_percpu_wq and system_dfl_wq")
commit 930c2ea566af ("workqueue: Add new WQ_PERCPU flag")
The refactoring is going to alter the default behavior of
alloc_workqueue() to be unbound by default.
With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
must now use WQ_PERCPU. For more details see the Link tag below.
In order to keep alloc_workqueue() behavior identical, explicitly request
WQ_PERCPU.
Jan Hoffmann [Tue, 13 Jan 2026 20:55:44 +0000 (21:55 +0100)]
net: phy: realtek: fix in-band capabilities for 2.5G PHYs
It looks like the configuration of in-band AN only affects SGMII, and it
is always disabled for 2500Base-X. Adjust the reported capabilities
accordingly.
This is based on testing using OpenWrt on Zyxel XGS1010-12 rev A1 with
RTL8226-CG, and Zyxel XGS1210-12 rev B1 with RTL8221B-VB-CG. On these
devices, 2500Base-X in-band AN is known to work with some SFP modules
(containing an unknown PHY). However, with the built-in Realtek PHYs,
no auto-negotiation takes place, irrespective of the configuration of
the PHY.
Fixes: 10fbd71fc5f9b ("net: phy: realtek: implement configuring in-band an") Signed-off-by: Jan Hoffmann <jan@3e8.eu> Reviewed-by: Daniel Golle <daniel@makrotopia.org> Link: https://patch.msgid.link/20260113205557.503409-1-jan@3e8.eu Signed-off-by: Jakub Kicinski <kuba@kernel.org>
No user of PHY fixups unregisters these. IOW: The fixup unregistering
functions are unused and can be removed. Remove also documentation
for these functions. Whilst at it, remove also mentioning of
phy_register_fixup() from the Documentation, as this function has been
static since ea47e70e476f ("net: phy: remove fixup-related definitions
from phy.h which are not used outside phylib").
Fixup unregistering functions were added with f38e7a32ee4f
("phy: add phy fixup unregister functions") in 2016, and last user
was removed with 6782d06a47ad ("net: usb: lan78xx: Remove KSZ9031 PHY
fixup") in 2024.
The comments describing the RX/TX headers and status response use
a combination of 0- and 1-based indexing, leading to confusion. Correct
the numbering and make it consistent. Also fix a typo "pm" for "pn".
This issue also existed in dm9601 and was fixed in commit 61189c78bda8
("dm9601: trivial comment fixes").
Heiner Kallweit [Mon, 12 Jan 2026 20:11:04 +0000 (21:11 +0100)]
net: ethernet: dnet: remove driver
This legacy platform driver was used with some Qong board.
Support for this board was removed with
commit c93197b0041d ("ARM: imx: Remove i.MX31 board files")
in 2020. So remove this now orphaned driver.
Osose Itua [Wed, 7 Jan 2026 22:16:53 +0000 (17:16 -0500)]
net: phy: adin: enable configuration of the LP Termination Register
The ADIN1200/ADIN1300 provide a control bit that selects between normal
receive termination and the lowest common mode impedance for 100BASE-TX
operation. This behavior is controlled through the Low Power Termination
register (B_100_ZPTM_EN_DIMRX).
Bit 0 of this register enables normal termination when set (this is the
default), and selects the lowest common mode impedance when cleared.
Add "adi,low-cmode-impedance" boolean property which, when present,
configures the PHY for the lowest common-mode impedance on the receive
pair for 100BASE-TX operation by clearing the B_100_ZPTM_EN_DIMRX bit.
This is suited for capacitive coupled applications and other
applications where there may be a path for high common-mode noise to
reach the PHY.
If this value is not present, the value of the bit by default is 1,
which is normal termination (zero-power termination) mode.
Jakub Kicinski [Fri, 16 Jan 2026 03:14:28 +0000 (19:14 -0800)]
Merge tag 'phy_common_properties' of git://git.kernel.org/pub/scm/linux/kernel/git/phy/linux-phy
Vinod Koul says:
====================
phy common properties
Introduce "rx-polarity" and "tx-polarity" device tree properties
with Kunit tests (from Vladimir Oltean).
* tag 'phy_common_properties' of git://git.kernel.org/pub/scm/linux/kernel/git/phy/linux-phy:
phy: add phy_get_rx_polarity() and phy_get_tx_polarity()
dt-bindings: phy-common-props: RX and TX lane polarity inversion
dt-bindings: phy-common-props: ensure protocol-names are unique
dt-bindings: phy-common-props: create a reusable "protocol-names" definition
dt-bindings: phy: rename transmit-amplitude.yaml to phy-common-props.yaml
====================
Document that the 'len' field in ethtool_gstrings and 'n_stats' field in
ethtool_stats optionally serve dual purposes: on entry they specify the
number of items requested, and on return they indicate the number
actually returned (which is not necessarily the same).
- ip6_tunnel: use skb_vlan_inet_prepare() in __ip6_tnl_rcv()
- bluetooth: hci_sync: enable PA sync lost event
- eth: virtio-net:
- fix the deadlock when disabling rx NAPI
- fix misalignment bug in struct virtnet_info
Previous releases - always broken:
- ipv4: ip_gre: make ipgre_header() robust
- can: fix SSP_SRC in cases when bit-rate is higher than 1 MBit.
- eth:
- mlx5e: profile change fix
- octeon_ep_vf: fix free_irq dev_id mismatch in IRQ rollback
- macvlan: fix possible UAF in macvlan_forward_source()"
* tag 'net-6.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (37 commits)
virtio_net: Fix misalignment bug in struct virtnet_info
net: can: j1939: j1939_xtp_rx_rts_session_active(): deactivate session upon receiving the second rts
can: raw: instantly reject disabled CAN frames
can: propagate CAN device capabilities via ml_priv
Revert "can: raw: instantly reject unsupported CAN frames"
net/sched: sch_qfq: do not free existing class in qfq_change_class()
selftests: drv-net: fix RPS mask handling for high CPU numbers
selftests: drv-net: fix RPS mask handling in toeplitz test
ipv6: Fix use-after-free in inet6_addr_del().
dst: fix races in rt6_uncached_list_del() and rt_del_uncached_list()
net: hv_netvsc: reject RSS hash key programming without RX indirection table
tools: ynl: render event op docs correctly
net: add net.core.qdisc_max_burst
net: airoha: Fix typo in airoha_ppe_setup_tc_block_cb definition
net: phy: motorcomm: fix duplex setting error for phy leds
net: octeon_ep_vf: fix free_irq dev_id mismatch in IRQ rollback
net/mlx5e: Restore destroying state bit after profile cleanup
net/mlx5e: Pass netdev to mlx5e_destroy_netdev instead of priv
net/mlx5e: Don't store mlx5e_priv in mlx5e_dev devlink priv
net/mlx5e: Fix crash on profile change rollback failure
...
Paolo Abeni [Thu, 15 Jan 2026 12:13:01 +0000 (13:13 +0100)]
Merge tag 'linux-can-fixes-for-6.19-20260115' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can
Marc Kleine-Budde says:
====================
pull-request: can 2026-01-15
this is a pull request of 4 patches for net/main, it super-seeds the
"can 2026-01-14" pull request. The dev refcount leak in patch #3 is
fixed.
The first 3 patches are by Oliver Hartkopp and revert the approach to
instantly reject unsupported CAN frames introduced in
net-next-for-v6.19 and replace it by placing the needed data into the
CAN specific ml_priv.
The last patch is by Tetsuo Handa and fixes a J1939 refcount leak for
j1939_session in session deactivation upon receiving the second RTS.
* tag 'linux-can-fixes-for-6.19-20260115' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can:
net: can: j1939: j1939_xtp_rx_rts_session_active(): deactivate session upon receiving the second rts
can: raw: instantly reject disabled CAN frames
can: propagate CAN device capabilities via ml_priv
Revert "can: raw: instantly reject unsupported CAN frames"
====================
1) Fix inner mode lookup in tunnel mode GSO segmentation.
The protocol was taken from the wrong field.
2) Set ipv4 no_pmtu_disc flag only on output SAs. The
insertation of input SAs can fail if no_pmtu_disc
is set.
Please pull or let me know if there are problems.
ipsec-2026-01-14
* tag 'ipsec-2026-01-14' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec:
xfrm: set ipv4 no_pmtu_disc flag only on output sa when direction is set
xfrm: Fix inner mode lookup in tunnel mode GSO segmentation
====================
====================
net/mlx5: HWS single flow counter support
This small series refactors the flow counter bulk initialization code
and extends it so that single flow counters are also usable by hardware
steering (HWS) rules.
Patches 1-2 refactor the bulk init path: first by factoring out common
flow counter bulk initialization into mlx5_fc_bulk_init(), then by
splitting the bitmap allocation into mlx5_fs_bulk_bitmap_alloc(), with
no functional changes.
Patch 3 initializes bulk data for counters allocated via
mlx5_fc_single_alloc(), so they can be safely used by HWS rules.
====================
Mark Bloch [Mon, 12 Jan 2026 09:40:24 +0000 (11:40 +0200)]
net/mlx5: fs, split bulk init
Refactor mlx5_fs_bulk_init() by moving bitmap allocation logic into a
new helper function mlx5_fs_bulk_bitmap_alloc(). This change does not
alter any logic.
Mark Bloch [Mon, 12 Jan 2026 09:40:23 +0000 (11:40 +0200)]
net/mlx5: fs, factor out flow counter bulk init
Add mlx5_fc_bulk_init() to handle bulk initialization of flow counters.
This change does not alter any logic, but refactors the code to remove
duplicate initialization logic by centralizing it in a single function.
====================
Introduce and use netif_xmit_timeout_ms() helper
This is V2, find V1 here:
https://lore.kernel.org/all/1764054776-1308696-1-git-send-email-tariqt@nvidia.com/
This series by Shahar introduces a new helper function
netif_xmit_timeout_ms() to check if a TX queue has timed out and report
the timeout duration.
It also encapsulates the check for whether the TX queue is stopped.
Replace duplicated open-coded timeout check in hns3 driver with the new
helper.
For mlx5e, refine the TX timeout recovery flow to act only on SQs whose
transmit timestamp indicates an actual timeout, as determined by the
helper. This prevents unnecessary channel reopen events caused by
attempting recovery on queues that are merely stopped but not truly
timed out.
====================
Shahar Shitrit [Mon, 12 Jan 2026 09:16:23 +0000 (11:16 +0200)]
net/mlx5e: Refine TX timeout handling to skip non-timed-out SQ
mlx5e_tx_timeout_work() is invoked when the dev_watchdog reports a
timed-out TX queue. Currently, the recovery flow is triggered for all
stopped SQs, which is not always correct — some SQs may be temporarily
stopped without actually timing out. Attempting to recover such SQs
results in no EQE being polled (since no real timeout occurred), which
the driver misinterprets as a recovery failure, unnecessarily causing
channel reopening.
Improve the logic to initiate recovery only for SQs that are both
stopped and timed out. Utilize the helper introduced in the previous
patch to determine whether the netdevice watchdog timeout period has
elapsed since the SQ’s last transmit timestamp.
Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com> Reviewed-by: Yael Chemla <ychemla@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1768209383-1546791-4-git-send-email-tariqt@nvidia.com Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Shahar Shitrit [Mon, 12 Jan 2026 09:16:21 +0000 (11:16 +0200)]
net: Introduce netif_xmit_timeout_ms() helper
Introduce a new helper function netif_xmit_timeout_ms() to check
if a TX queue is stopped and has timed out and report the timeout
duration. This makes the timeout logic reusable, and will be used
in several places in subsequent patches.
Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com> Reviewed-by: Yael Chemla <ychemla@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1768209383-1546791-2-git-send-email-tariqt@nvidia.com Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
virtio_net: Fix misalignment bug in struct virtnet_info
Use the new TRAILING_OVERLAP() helper to fix a misalignment bug
along with the following warning:
drivers/net/virtio_net.c:429:46: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end]
This helper creates a union between a flexible-array member (FAM)
and a set of members that would otherwise follow it (in this case
`u8 rss_hash_key_data[VIRTIO_NET_RSS_MAX_KEY_SIZE];`). This
overlays the trailing members (rss_hash_key_data) onto the FAM
(hash_key_data) while keeping the FAM and the start of MEMBERS aligned.
The static_assert() ensures this alignment remains.
Notice that due to tail padding in flexible `struct
virtio_net_rss_config_trailer`, `rss_trailer.hash_key_data`
(at offset 83 in struct virtnet_info) and `rss_hash_key_data` (at
offset 84 in struct virtnet_info) are misaligned by one byte. See
below:
As a result, the RSS key passed to the device is shifted by 1
byte: the last byte is cut off, and instead a (possibly
uninitialized) byte is added at the beginning.
As a last note `struct virtio_net_rss_config_hdr *rss_hdr;` is also
moved to the end, since it seems those three members should stick
around together. :)
Cc: stable@vger.kernel.org Fixes: ed3100e90d0d ("virtio_net: Use new RSS config structs") Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Acked-by: Michael S. Tsirkin <mst@redhat.com> Link: https://patch.msgid.link/aWIItWq5dV9XTTCJ@kspp Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jason Xing [Sun, 4 Jan 2026 01:21:25 +0000 (09:21 +0800)]
xsk: move cq_cached_prod_lock to avoid touching a cacheline in sending path
We (Paolo and I) noticed that in the sending path touching an extra
cacheline due to cq_cached_prod_lock will impact the performance. After
moving the lock from struct xsk_buff_pool to struct xsk_queue, the
performance is increased by ~5% which can be observed by xdpsock.
An alternative approach [1] can be using atomic_try_cmpxchg() to have the
same effect. But unfortunately I don't have evident performance numbers to
prove the atomic approach is better than the current patch. The advantage
is to save the contention time among multiple xsks sharing the same pool
while the disadvantage is losing good maintenance. The full discussion can
be found at the following link.
Jason Xing [Sun, 4 Jan 2026 01:21:24 +0000 (09:21 +0800)]
xsk: advance cq/fq check when shared umem is used
In the shared umem mode with different queues or devices, either
uninitialized cq or fq is not allowed which was previously done in
xp_assign_dev_shared(). The patch advances the check at the beginning
so that 1) we can avoid a few memory allocation and stuff if cq or fq
is NULL, 2) it can be regarded as preparation for the next patch in
the series.
Tetsuo Handa [Tue, 13 Jan 2026 15:28:47 +0000 (00:28 +0900)]
net: can: j1939: j1939_xtp_rx_rts_session_active(): deactivate session upon receiving the second rts
Since j1939_session_deactivate_activate_next() in j1939_tp_rxtimer() is
called only when the timer is enabled, we need to call
j1939_session_deactivate_activate_next() if we cancelled the timer.
Otherwise, refcount for j1939_session leaks, which will later appear as
| unregister_netdevice: waiting for vcan0 to become free. Usage count = 2.
The reverted patch was accessing CAN device internal data structures
from the network layer because it needs to know about the CAN protocol
capabilities of the CAN devices.
This data access caused build problems between the CAN network and the
CAN driver layer which introduced unwanted Kconfig dependencies and fixes.
The patches 2 & 3 implement a better approach which makes use of the
CAN specific ml_priv data which is accessible from both sides.
With this change the CAN network layer can check the required features
and the decoupling of the driver layer and network layer is restored.
Oliver Hartkopp [Fri, 9 Jan 2026 14:41:35 +0000 (15:41 +0100)]
can: raw: instantly reject disabled CAN frames
For real CAN interfaces the CAN_CTRLMODE_FD and CAN_CTRLMODE_XL control
modes indicate whether an interface can handle those CAN FD/XL frames.
In the case a CAN XL interface is configured in CANXL-only mode with
disabled error-signalling neither CAN CC nor CAN FD frames can be sent.
The checks are now performed on CAN_RAW sockets to give an instant feedback
to the user when writing unsupported CAN frames to the interface or when
the CAN interface is in read-only mode.
Oliver Hartkopp [Fri, 9 Jan 2026 14:41:34 +0000 (15:41 +0100)]
can: propagate CAN device capabilities via ml_priv
Commit 1a620a723853 ("can: raw: instantly reject unsupported CAN frames")
caused a sequence of dependency and linker fixes.
Instead of accessing CAN device internal data structures which caused the
dependency problems this patch introduces capability information into the
CAN specific ml_priv data which is accessible from both sides.
With this change the CAN network layer can check the required features and
the decoupling of the driver layer and network layer is restored.
Fixes: 1a620a723853 ("can: raw: instantly reject unsupported CAN frames") Cc: Marc Kleine-Budde <mkl@pengutronix.de> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Vincent Mailhol <mailhol@kernel.org> Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Link: https://patch.msgid.link/20260109144135.8495-3-socketcan@hartkopp.net Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
The entire problem was caused by the requirement that a new network layer
feature needed to know about the protocol capabilities of the CAN devices.
Instead of accessing CAN device internal data structures which caused the
dependency problems a better approach has been developed which makes use of
CAN specific ml_priv data which is accessible from both sides.
Cc: Marc Kleine-Budde <mkl@pengutronix.de> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Vincent Mailhol <mailhol@kernel.org> Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Link: https://patch.msgid.link/20260109144135.8495-2-socketcan@hartkopp.net Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Linus Torvalds [Wed, 14 Jan 2026 19:24:38 +0000 (11:24 -0800)]
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"Only one core change (and one in doc only) the rest are drivers.
The one core fix is for some inline encrypting drives that can't
handle encryption requests on non-data commands (like error handling
ones); it saves the request level encryption parameters in the eh_save
structure so they can be cleared for error handling and restored after
it is completed"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: ufs: host: mediatek: Make read-only array scale_us static const
scsi: bfa: Update outdated comment
scsi: mpt3sas: Update maintainer list
scsi: ufs: core: Configure MCQ after link startup
scsi: core: Fix error handler encryption support
scsi: core: Correct documentation for scsi_test_unit_ready()
scsi: ufs: dt-bindings: Fix several grammar errors
Linus Torvalds [Wed, 14 Jan 2026 16:18:01 +0000 (08:18 -0800)]
Merge tag 'media/v6.19-3' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
Pull media fixes from Mauro Carvalho Chehab:
- ov02c10: some fixes related to preserving bayer pattern and
horizontal control
- ipu-bridge: Add quirks for some Dell XPS laptops with inverted
sensors
- mali-c55: Fix version identifier logic
- rzg2l-cru: csi-2: fix RZ/V2H input sizes on some variants
* tag 'media/v6.19-3' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
media: ov02c10: Remove unnecessary hflip and vflip pointers
media: ipu-bridge: Add DMI quirk for Dell XPS laptops with upside down sensors
media: ov02c10: Fix the horizontal flip control
media: ov02c10: Adjust x-win/y-win when changing flipping to preserve bayer-pattern
media: ov02c10: Fix bayer-pattern change after default vflip change
media: rzg2l-cru: csi-2: Support RZ/V2H input sizes
media: uapi: mali-c55-config: Remove version identifier
media: mali-c55: Remove duplicated version check
media: Documentation: mali-c55: Use v4l2-isp version identifier
Vladimir Oltean [Sun, 11 Jan 2026 09:39:34 +0000 (11:39 +0200)]
phy: add phy_get_rx_polarity() and phy_get_tx_polarity()
Add helpers in the generic PHY folder which can be used using 'select
PHY_COMMON_PROPS' from Kconfig, without otherwise needing to
enable GENERIC_PHY.
These helpers need to deal with the slight messiness of the fact that
the polarity properties are arrays per protocol, and with the fact that
there is no default value mandated by the standard properties, all
default values depend on driver and protocol (PHY_POL_NORMAL may be a
good default for SGMII, whereas PHY_POL_AUTO may be a good default for
PCIe).
Push the supported mask of polarities to these helpers, to simplify
drivers such that they don't need to validate what's in the device tree
(or other firmware description).
Add a KUnit test suite to make sure that the API produces the expected
results. The fact that we use fwnode structures means we can validate
with software nodes, and as opposed to the device_property API, we can
bypass the need to have a device structure.
Vladimir Oltean [Sun, 11 Jan 2026 09:39:33 +0000 (11:39 +0200)]
dt-bindings: phy-common-props: RX and TX lane polarity inversion
Differential signaling is a technique for high-speed protocols to be
more resilient to noise. At the transmit side we have a positive and a
negative signal which are mirror images of each other. At the receiver,
if we subtract the negative signal (say of amplitude -A) from the
positive signal (say +A), we recover the original single-ended signal at
twice its original amplitude. But any noise, like one coming from EMI
from outside sources, is supposed to have an almost equal impact upon
the positive (A + E, E being for "error") and negative signal (-A + E).
So (A + E) - (-A + E) eliminates this noise, and this is what makes
differential signaling useful.
Except that in order to work, there must be strict requirements observed
during PCB design and layout, like the signal traces needing to have the
same length and be physically close to each other, and many others.
Sometimes it is not easy to fulfill all these requirements, a simple
case to understand is when on chip A's pins, the positive pin is on the
left and the negative is on the right, but on the chip B's pins (with
which A tries to communicate), positive is on the right and negative on
the left. The signals would need to cross, using vias and other ugly
stuff that affects signal integrity (introduces impedance
discontinuities which cause reflections, etc).
So sometimes, board designers intentionally connect differential lanes
the wrong way, and expect somebody else to invert that signal to recover
useful data. This is where RX and TX polarity inversion comes in as a
generic concept that applies to any high-speed serial protocol as long
as it uses differential signaling.
I've stopped two attempts to introduce more vendor-specific descriptions
of this only in the past month:
https://lore.kernel.org/linux-phy/20251110110536.2596490-1-horatiu.vultur@microchip.com/
https://lore.kernel.org/netdev/20251028000959.3kiac5kwo5pcl4ft@skbuf/
and in the kernel we already have merged:
- "st,px_rx_pol_inv"
- "st,pcie-tx-pol-inv"
- "st,sata-tx-pol-inv"
- "mediatek,pnswap"
- "airoha,pnswap-rx"
- "airoha,pnswap-tx"
and maybe more. So it is pretty general.
One additional element of complexity is introduced by the fact that for
some protocols, receivers can automatically detect and correct for an
inverted lane polarity (example: the PCIe LTSSM does this in the
Polling.Configuration state; the USB 3.1 Link Layer Test Specification
says that the detection and correction of the lane polarity inversion in
SuperSpeed operation shall be enabled in Polling.RxEQ.). Whereas for
other protocols (SGMII, SATA, 10GBase-R, etc etc), the polarity is all
manual and there is no detection mechanism mandated by their respective
standards.
So why would one even describe rx-polarity and tx-polarity for protocols
like PCIe, if it had to always be PHY_POL_AUTO?
Related question: why would we define the polarity as an array per
protocol? Isn't the physical PCB layout protocol-agnostic, and aren't we
describing the same physical reality from the lens of different protocols?
The answer to both questions is because multi-protocol PHYs exist
(supporting e.g. USB2 and USB3, or SATA and PCIe, or PCIe and Ethernet
over the same lane), one would need to manually set the polarity for
SATA/Ethernet, while leaving it at auto for PCIe/USB 3.0+.
I also investigated from another angle: what if polarity inversion in
the PHY is one layer, and then the PCIe/USB3 LTSSM polarity detection is
another layer on top? Then rx-polarity = <PHY_POL_AUTO> doesn't make
sense, it can still be rx-polarity = <PHY_POL_NORMAL> or <PHY_POL_INVERT>,
and the link training state machine figures things out on top of that.
This would radically simplify the design, as the elimination of
PHY_POL_AUTO inherently means that the need for a property array per
protocol also goes away.
I don't know how things are in the general case, but at least in the 10G
and 28G Lynx SerDes blocks from NXP Layerscape devices, this isn't the
case, and there's only a single level of RX polarity inversion: in the
SerDes lane. In the case of PCIe, the controller is in charge of driving
the RDAT_INV bit autonomously, and it is read-only to software.
So the existence of this kind of SerDes lane proves the need for
PHY_POL_AUTO to be a third state.
Vladimir Oltean [Sun, 11 Jan 2026 09:39:32 +0000 (11:39 +0200)]
dt-bindings: phy-common-props: ensure protocol-names are unique
Rob Herring points out that "The default for .*-names is the entries
don't have to be unique.":
https://lore.kernel.org/linux-phy/20251204155219.GA1533839-robh@kernel.org/
Let's use uniqueItems: true to make sure the schema enforces this. It
doesn't make sense in this case to have duplicate properties for the
same SerDes protocol.
Note that this can only be done with the $defs + $ref pattern as
established by the previous commit. When the tx-p2p-microvolt-names
constraints were expressed directly under "properties", it would have
been validated by the string-array meta-schema, which does not support
the 'uniqueItems' keyword as can be seen below.
properties:tx-p2p-microvolt-names: Additional properties are not allowed ('uniqueItems' was unexpected)
from schema $id: http://devicetree.org/meta-schemas/string-array.yaml
Vladimir Oltean [Sun, 11 Jan 2026 09:39:31 +0000 (11:39 +0200)]
dt-bindings: phy-common-props: create a reusable "protocol-names" definition
Other properties also need to be defined per protocol than just
tx-p2p-microvolt-names. Create a common definition to avoid copying a 55
line property.
Vladimir Oltean [Sun, 11 Jan 2026 09:39:30 +0000 (11:39 +0200)]
dt-bindings: phy: rename transmit-amplitude.yaml to phy-common-props.yaml
I would like to add more properties similar to tx-p2p-microvolt, and I
don't think it makes sense to create one schema for each such property
(transmit-amplitude.yaml, lane-polarity.yaml, transmit-equalization.yaml
etc).
Instead, let's rename to phy-common-props.yaml, which makes it a more
adequate host schema for all the above properties.
Linus Torvalds [Wed, 14 Jan 2026 05:21:13 +0000 (21:21 -0800)]
Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Pull bpf fixes from Alexei Starovoitov:
- Fix incorrect usage of BPF_TRAMP_F_ORIG_STACK in riscv JIT (Menglong
Dong)
- Fix reference count leak in bpf_prog_test_run_xdp() (Tetsuo Handa)
- Fix metadata size check in bpf_test_run() (Toke Høiland-Jørgensen)
- Check that BPF insn array is not allowed as a map for const strings
(Deepanshu Kartikey)
* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
bpf: Fix reference count leak in bpf_prog_test_run_xdp()
bpf: Reject BPF_MAP_TYPE_INSN_ARRAY in check_reg_const_str()
selftests/bpf: Update xdp_context_test_run test to check maximum metadata size
bpf, test_run: Subtract size of xdp_frame from allowed metadata size
riscv, bpf: Fix incorrect usage of BPF_TRAMP_F_ORIG_STACK
Alice Ryhl [Mon, 5 Jan 2026 10:44:06 +0000 (10:44 +0000)]
rust: bitops: fix missing _find_* functions on 32-bit ARM
On 32-bit ARM, you may encounter linker errors such as this one:
ld.lld: error: undefined symbol: _find_next_zero_bit
>>> referenced by rust_binder_main.43196037ba7bcee1-cgu.0
>>> drivers/android/binder/rust_binder_main.o:(<rust_binder_main::process::Process>::insert_or_update_handle) in archive vmlinux.a
>>> referenced by rust_binder_main.43196037ba7bcee1-cgu.0
>>> drivers/android/binder/rust_binder_main.o:(<rust_binder_main::process::Process>::insert_or_update_handle) in archive vmlinux.a
This error occurs because even though the functions are declared by
include/linux/find.h, the definition is #ifdef'd out on 32-bit ARM. This
is because arch/arm/include/asm/bitops.h contains:
And the underscore-prefixed function is conditional on #ifndef of the
non-underscore-prefixed name, but the declaration in find.h is *not*
conditional on that #ifndef.
To fix the linker error, we ensure that the symbols in question exist
when compiling Rust code. We do this by defining them in rust/helpers/
whenever the normal definition is #ifndef'd out.
Note that these helpers are somewhat unusual in that they do not have
the rust_helper_ prefix that most helpers have. Adding the rust_helper_
prefix does not compile, as 'bindings::_find_next_zero_bit()' will
result in a call to a symbol called _find_next_zero_bit as defined by
include/linux/find.h rather than a symbol with the rust_helper_ prefix.
This is because when a symbol is present in both include/ and
rust/helpers/, the one from include/ wins under the assumption that the
current configuration is one where that helper is unnecessary. This
heuristic fails for _find_next_zero_bit() because the header file always
declares it even if the symbol does not exist.
The functions still use the __rust_helper annotation. This lets the
wrapper function be inlined into Rust code even if full kernel LTO is
not used once the patch series for that feature lands.
Yury: arches are free to implement they own find_bit() functions. Most
rely on generic implementation, but arm32 and m86k - not; so they require
custom handling. Alice confirmed it fixes the build for both.
Cc: stable@vger.kernel.org Fixes: 6cf93a9ed39e ("rust: add bindings for bitops.h") Reported-by: Andreas Hindborg <a.hindborg@kernel.org> Closes: https://rust-for-linux.zulipchat.com/#narrow/channel/x/topic/x/near/561677301 Tested-by: Andreas Hindborg <a.hindborg@kernel.org> Reviewed-by: Dirk Behme <dirk.behme@de.bosch.com> Signed-off-by: Alice Ryhl <aliceryhl@google.com> Signed-off-by: Yury Norov (NVIDIA) <yury.norov@gmail.com>
Dipayaan Roy [Mon, 12 Jan 2026 13:05:52 +0000 (05:05 -0800)]
net: mana: Implement ndo_tx_timeout and serialize queue resets per port.
Implement .ndo_tx_timeout for MANA so any stalled TX queue can be detected
and a device-controlled port reset for all queues can be scheduled to a
ordered workqueue. The reset for all queues on stall detection is
recomended by hardware team.
Gal Pressman [Mon, 12 Jan 2026 17:37:14 +0000 (19:37 +0200)]
selftests: drv-net: fix RPS mask handling in toeplitz test
The toeplitz.py test passed the hex mask without "0x" prefix (e.g.,
"300" for CPUs 8,9). The toeplitz.c strtoul() call wrongly parsed this
as decimal 300 (0x12c) instead of hex 0x300.
Pass the prefixed mask to toeplitz.c, and the unprefixed one to sysfs.
Fixes: 9cf9aa77a1f6 ("selftests: drv-net: hw: convert the Toeplitz test to Python") Reviewed-by: Nimrod Oren <noren@nvidia.com> Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20260112173715.384843-2-gal@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
syzbot reported use-after-free of inet6_ifaddr in
inet6_addr_del(). [0]
The cited commit accidentally moved ipv6_del_addr() for
mngtmpaddr before reading its ifp->flags for temporary
addresses in inet6_addr_del().
Let's move ipv6_del_addr() down to fix the UAF.
[0]:
BUG: KASAN: slab-use-after-free in inet6_addr_del.constprop.0+0x67a/0x6b0 net/ipv6/addrconf.c:3117
Read of size 4 at addr ffff88807b89c86c by task syz.3.1618/9593
Eric Dumazet [Mon, 12 Jan 2026 10:38:25 +0000 (10:38 +0000)]
dst: fix races in rt6_uncached_list_del() and rt_del_uncached_list()
syzbot was able to crash the kernel in rt6_uncached_list_flush_dev()
in an interesting way [1]
Crash happens in list_del_init()/INIT_LIST_HEAD() while writing
list->prev, while the prior write on list->next went well.
static inline void INIT_LIST_HEAD(struct list_head *list)
{
WRITE_ONCE(list->next, list); // This went well
WRITE_ONCE(list->prev, list); // Crash, @list has been freed.
}
Issue here is that rt6_uncached_list_del() did not attempt to lock
ul->lock, as list_empty(&rt->dst.rt_uncached) returned
true because the WRITE_ONCE(list->next, list) happened on the other CPU.
We might use list_del_init_careful() and list_empty_careful(),
or make sure rt6_uncached_list_del() always grabs the spinlock
whenever rt->dst.rt_uncached_list has been set.
A similar fix is neeed for IPv4.
[1]
BUG: KASAN: slab-use-after-free in INIT_LIST_HEAD include/linux/list.h:46 [inline]
BUG: KASAN: slab-use-after-free in list_del_init include/linux/list.h:296 [inline]
BUG: KASAN: slab-use-after-free in rt6_uncached_list_flush_dev net/ipv6/route.c:191 [inline]
BUG: KASAN: slab-use-after-free in rt6_disable_ip+0x633/0x730 net/ipv6/route.c:5020
Write of size 8 at addr ffff8880294cfa78 by task kworker/u8:14/3450
The buggy address belongs to the object at ffff8880294cfa00
which belongs to the cache ip6_dst_cache of size 232
The buggy address is located 120 bytes inside of
freed 232-byte region [ffff8880294cfa00, ffff8880294cfae8)
Fixes: 8d0b94afdca8 ("ipv6: Keep track of DST_NOCACHE routes in case of iface down/unregister") Fixes: 78df76a065ae ("ipv4: take rt_uncached_lock only if needed") Reported-by: syzbot+179fc225724092b8b2b2@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/6964cdf2.050a0220.eaf7.009d.GAE@google.com/T/#u Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Martin KaFai Lau <martin.lau@kernel.org> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20260112103825.3810713-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
RSS configuration requires a valid RX indirection table. When the device
reports a single receive queue, rndis_filter_device_add() does not
allocate an indirection table, accepting RSS hash key updates in this
state leads to a hang.
Fix this by gating netvsc_set_rxfh() on ndc->rx_table_sz and return
-EOPNOTSUPP when the table is absent. This aligns set_rxfh with the device
capabilities and prevents incorrect behavior.
Fixes: 962f3fee83a4 ("netvsc: add ethtool ops to get/set RSS key") Signed-off-by: Aditya Garg <gargaditya@linux.microsoft.com> Reviewed-by: Dipayaan Roy <dipayanroy@linux.microsoft.com> Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com> Link: https://patch.msgid.link/1768212093-1594-1-git-send-email-gargaditya@linux.microsoft.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
net: phy: fixed_phy: replace list of fixed PHYs with static array
Due to max 32 PHY addresses being available per mii bus, using a list
can't support more fixed PHY's. And there's no known use case for as
much as 32 fixed PHY's on a system. 8 should be plenty of fixed PHY's,
so use an array of that size instead of a list. This allows to
significantly reduce the code size and complexity.
In addition replace heavy-weight IDA with a simple bitmap.
====================
Heiner Kallweit [Sun, 11 Jan 2026 12:43:22 +0000 (13:43 +0100)]
net: phy: fixed_phy: replace IDA with a bitmap
Size of array fmb_fixed_phys is small, so we can use a simple bitmap
instead of an IDA to manage dynamic allocation of fixed PHY's.
find_first_zero_bit() isn't atomic, so we need the loop to rule out
double allocation of a PHY address.
Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/d4614463-d532-41fc-92e9-ef97107aceb5@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Heiner Kallweit [Sun, 11 Jan 2026 12:41:39 +0000 (13:41 +0100)]
net: phy: fixed_phy: replace list of fixed PHYs with static array
Due to max 32 PHY addresses being available per mii bus, using a list
can't support more fixed PHY's. And there's no known use case for as
much as 32 fixed PHY's on a system. 8 should be plenty of fixed PHY's,
so use an array of that size instead of a list. This allows to
significantly reduce the code size and complexity.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/8610d30c-eac7-4100-9008-d3b6cee6a5cd@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
ipv6: Allow for nexthop device mismatch with "onlink"
This patchset aligns IPv6 with IPv4 with respect to the "onlink" keyword
and allows IPv6 routes to be configured with a gateway address that is
resolved out of a different interface than the one specified.
Patches #1-#3 are small preparations in the existing "onlink" selftest.
Patch #4 is the actual change. See the commit message for detailed
description and motivation.
Patch #5 adds test cases for both address families, to make sure that
this use case does not regress.
====================
Ido Schimmel [Sun, 11 Jan 2026 12:08:13 +0000 (14:08 +0200)]
selftests: fib-onlink: Add test cases for nexthop device mismatch
Add test cases that verify that when the "onlink" keyword is specified,
both address families (with and without VRF) accept routes with a
gateway address that is reachable via a different interface than the one
specified.
Output without "ipv6: Allow for nexthop device mismatch with "onlink"":
Ido Schimmel [Sun, 11 Jan 2026 12:08:12 +0000 (14:08 +0200)]
ipv6: Allow for nexthop device mismatch with "onlink"
IPv4 allows for a nexthop device mismatch when the "onlink" keyword is
specified:
# ip link add name dummy1 up type dummy
# ip address add 192.0.2.1/24 dev dummy1
# ip link add name dummy2 up type dummy
# ip route add 198.51.100.0/24 nexthop via 192.0.2.2 dev dummy2
Error: Nexthop has invalid gateway.
# ip route add 198.51.100.0/24 nexthop via 192.0.2.2 dev dummy2 onlink
# echo $?
0
This seems to be consistent with the description of "onlink" in the
ip-route man page: "Pretend that the nexthop is directly attached to
this link, even if it does not match any interface prefix".
On the other hand, IPv6 rejects a nexthop device mismatch, even when
"onlink" is specified:
# ip link add name dummy1 up type dummy
# ip address add 2001:db8:1::1/64 dev dummy1
# ip link add name dummy2 up type dummy
# ip route add 2001:db8:10::/64 nexthop via 2001:db8:1::2 dev dummy2
RTNETLINK answers: No route to host
# ip route add 2001:db8:10::/64 nexthop via 2001:db8:1::2 dev dummy2 onlink
Error: Nexthop has invalid gateway or device mismatch.
This is intentional according to commit fc1e64e1092f ("net/ipv6: Add
support for onlink flag") which added IPv6 "onlink" support and states
that "any unicast gateway is allowed as long as the gateway is not a
local address and if it resolves it must match the given device".
The condition was later relaxed in commit 4ed591c8ab44 ("net/ipv6: Allow
onlink routes to have a device mismatch if it is the default route") to
allow for a nexthop device mismatch if the gateway address is resolved
via the default route:
# ip link add name dummy1 up type dummy
# ip route add ::/0 dev dummy1
# ip link add name dummy2 up type dummy
# ip route add 2001:db8:10::/64 nexthop via 2001:db8:1::2 dev dummy2
RTNETLINK answers: No route to host
# ip route add 2001:db8:10::/64 nexthop via 2001:db8:1::2 dev dummy2 onlink
# echo $?
0
While the decision to forbid a nexthop device mismatch in IPv6 seems to
be intentional, it is unclear why it was made. Especially when it
differs from IPv4 and seems to go against the intended behavior of
"onlink".
Therefore, relax the condition further and allow for a nexthop device
mismatch when "onlink" is specified:
# ip link add name dummy1 up type dummy
# ip address add 2001:db8:1::1/64 dev dummy1
# ip link add name dummy2 up type dummy
# ip route add 2001:db8:10::/64 nexthop via 2001:db8:1::2 dev dummy2 onlink
# echo $?
0
The motivating use case is the fact that FRR would like to be able to
configure overlay routes of the following form:
# ip route add <host-Z> vrf <VRF> encap ip id <ID> src <VTEP-A> dst <VTEP-Z> via <VTEP-Z> dev vxlan0 onlink
Where vxlan0 is in the default VRF in which "VTEP-Z" is reachable via
one of the underlay routes (e.g., via swpX). Without this patch, the
above only works with IPv4, but not with IPv6.
Ido Schimmel [Sun, 11 Jan 2026 12:08:11 +0000 (14:08 +0200)]
selftests: fib-onlink: Add a test case for IPv4 multicast gateway
A multicast gateway address should be rejected when "onlink" is
specified, but it is only tested as part of the IPv6 tests. Add an
equivalent IPv4 test.
# ./fib-onlink-tests.sh -v
[...]
COMMAND: ip ro add table 254 169.254.101.12/32 via 233.252.0.1 dev veth1 onlink
Error: Nexthop has invalid gateway.
TEST: Invalid gw - multicast address [ OK ]
[...]
COMMAND: ip ro add table 1101 169.254.102.12/32 via 233.252.0.1 dev veth5 onlink
Error: Nexthop has invalid gateway.
The command in the test fails as expected because IPv6 forbids a nexthop
device mismatch:
# ./fib-onlink-tests.sh -v
[...]
COMMAND: ip -6 ro add table 1101 2001:db8:102::103/128 via 2001:db8:701::64 dev veth5 onlink
Error: Nexthop has invalid gateway or device mismatch.
TEST: Gateway resolves to wrong nexthop device - VRF [ OK ]
[...]
Where:
# ip route get 2001:db8:701::64 vrf lisa
2001:db8:701::64 dev veth7 table 1101 proto kernel src 2001:db8:701::1 metric 256 pref medium
This is in contrast to IPv4 where a nexthop device mismatch is allowed
when "onlink" is specified:
# ip route get 169.254.7.2 vrf lisa
169.254.7.2 dev veth7 table 1101 src 169.254.7.1 uid 0
# ip ro add table 1101 169.254.102.103/32 via 169.254.7.2 dev veth5 onlink
# echo $?
0
Remove these tests in preparation for aligning IPv6 with IPv4 and
allowing nexthop device mismatch when "onlink" is specified.
A subsequent patch will add tests that verify that both address families
allow a nexthop device mismatch with "onlink".
According to the test description, these tests fail because of a wrong
nexthop device:
# ./fib-onlink-tests.sh -v
[...]
COMMAND: ip ro add table 254 169.254.101.102/32 via 169.254.3.1 dev veth1 onlink
Error: Nexthop has invalid gateway.
TEST: Gateway resolves to wrong nexthop device [ OK ]
COMMAND: ip ro add table 1101 169.254.102.103/32 via 169.254.7.1 dev veth5 onlink
Error: Nexthop has invalid gateway.
TEST: Gateway resolves to wrong nexthop device - VRF [ OK ]
[...]
But this is incorrect. They fail because the gateway addresses are local
addresses:
# ip -4 address show
[...]
28: veth3@if27: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 link-netns peer_ns-Urqh3o
inet 169.254.3.1/24 scope global veth3
[...]
32: veth7@if31: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master lisa state UP group default qlen 1000 link-netns peer_ns-Urqh3o
inet 169.254.7.1/24 scope global veth7
Therefore, using a local address that matches the nexthop device fails
as well:
# ip ro add table 254 169.254.101.102/32 via 169.254.3.1 dev veth3 onlink
Error: Nexthop has invalid gateway.
Using a gateway address with a "wrong" nexthop device is actually valid
and allowed:
# ip route get 169.254.1.2
169.254.1.2 dev veth1 src 169.254.1.1 uid 0
# ip ro add table 254 169.254.101.102/32 via 169.254.1.2 dev veth3 onlink
# echo $?
0
Remove these tests given that their output is confusing and that the
scenario that they are testing is already covered by other tests.
A subsequent patch will add tests for the nexthop device mismatch
scenario.
- This is only a first phase. It instantiates the port, and leverage
that to make the MAC <-> PHY <-> SFP usecase simpler.
- Next phase will deal with controlling the port state, as well as the
netlink uAPI for that.
- The end-goal is to enable support for complex port MUX. This
preliminary work focuses on PHY-driven ports, but this will be
extended to support muxing at the MII level (Multi-phy, or compo PHY
+ SFP as found on Turris Omnia for example).
- The naming is definitely not set in stone. I named that "phy_port",
but this may convey the false sense that this is phylib-specific.
Even the word "port" is not that great, as it already has several
different meanings in the net world (switch port, devlink port,
etc.). I used the term "connector" in the binding.
A bit of history on that work :
The end goal that I personnaly want to achieve is :
+ PHY - RJ45
|
MAC - MUX -+ PHY - RJ45
After many discussions here on netdev@, but also at netdevconf[1] and
LPC[2], there appears to be several analoguous designs that exist out
there.
[1] : https://netdevconf.info/0x17/sessions/talk/improving-multi-phy-and-multi-port-interfaces.html
[2] : https://lpc.events/event/18/contributions/1964/ (video isn't the
right one)
Take the MAchiatobin, it has 2 interfaces that looks like this :
MAC - PHY -+ RJ45
|
+ SFP - Whatever the module does
Now, looking at the Turris Omnia, we have :
MAC - MUX -+ PHY - RJ45
|
+ SFP - Whatever the module does
We can find more example of this kind of designs, the common part is
that we expose multiple front-facing media ports. This is what this
current work aims at supporting. As of right now, it does'nt add any
support for muxing, but this will come later on.
This first phase focuses on phy-driven ports only, but there are already
quite some challenges already. For one, we can't really autodetect how
many ports are sitting behind a PHY. That's why this series introduces a
new binding. Describing ports in DT should however be a last-resort
thing when we need to clear some ambiguity about the PHY media-side.
The only use-cases that we have today for multi-port PHYs are combo PHYs
that drive both a Copper port and an SFP (the Macchiatobin case). This
in itself is challenging and this series only addresses part of this
support, by registering a phy_port for the PHY <-> SFP connection. The
SFP module should in the end be considered as a port as well, but that's
not yet the case.
However, because now PHYs can register phy_ports for every media-side
interface they have, they can register the capabilities of their ports,
which allows making the PHY-driver SFP case much more generic.
====================
net: phy: Only rely on phy_port for PHY-driven SFP
Now that all PHY drivers that support downstream SFP have been converted
to phy_port serdes handling, we can make the generic PHY SFP handling
mandatory, thus making all phylib sfp helpers static.
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20260108080041.553250-14-maxime.chevallier@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: phy: qca807x: Support SFP through phy_port interface
QCA8072/8075 may be used as combo-port PHYs, with Serdes (100/1000BaseX)
and Copper interfaces. The PHY has the ability to read the configuration
it's in. If the configuration indicates the PHY is in combo mode, allow
registering up to 2 ports.
Register a dedicated set of port ops to handle the serdes port, and rely
on generic phylib SFP support for the SFP handling.
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20260108080041.553250-13-maxime.chevallier@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>