David Yang [Thu, 11 Jun 2026 07:08:49 +0000 (15:08 +0800)]
net: dsa: hellcreek: avoid devlink resource IDs collision with PARENT_TOP
This might not cause real problems, but the hellcreek devlink resource
ID collides with the sentinel DEVLINK_RESOURCE_ID_PARENT_TOP (0). Avoid
it by keeping the real resource IDs starting at 1.
David Yang [Thu, 11 Jun 2026 07:08:48 +0000 (15:08 +0800)]
net: dsa: b53: avoid devlink resource IDs collision with PARENT_TOP
This might not cause real problems, but the b53 devlink resource ID
collides with the sentinel DEVLINK_RESOURCE_ID_PARENT_TOP (0). Avoid it
by keeping the real resource IDs starting at 1.
David Yang [Thu, 11 Jun 2026 07:08:47 +0000 (15:08 +0800)]
net: dsa: dsa_loop: avoid devlink resource IDs collision with PARENT_TOP
This might not cause real problems, but the dsa_loop devlink resource ID
collides with the sentinel DEVLINK_RESOURCE_ID_PARENT_TOP (0). Avoid it
by keeping the real resource IDs starting at 1.
====================
net/mlx5: Add switchdev mode support for Socket Direct single netdev, part 2/2
This is part 2. Find part 1 here:
https://lore.kernel.org/all/20260531113954.395443-1-tariqt@nvidia.com/
This series enables Socket Direct single netdev to operate in switchdev
mode with shared FDB. SD single netdev combines multiple PCI functions
behind a single netdev interface. To support switchdev offloads, these
functions must participate in virtual LAG (shared FDB).
Design
Rather than introducing a separate LAG instance for SD, this series
integrates SD secondary devices into the existing LAG structure
(priv.lag) created at probe time. Each lag_func entry carries a
group_id field that identifies its SD group membership (0 means not
part of any SD group). An xarray mark (XA_MARK_PORT) distinguishes
physical port entries from SD secondaries, enabling a single unified
iterator that filters by group:
- MLX5_LAG_FILTER_PORTS: iterate port-level entries only (existing
behavior, used by bonding, FW LAG commands, v2p_map)
- MLX5_LAG_FILTER_ALL: iterate all devices including SD secondaries
(used by MPESW shared FDB across all devices)
- specific group_id: iterate only devices in that SD group (used by
per-group SD shared FDB operations)
Existing callers use mlx5_ldev_for_each() which maps to
MLX5_LAG_FILTER_PORTS, preserving current behavior for non-SD
configurations.
Lifecycle and ownership
The SD LAG lifecycle is tied to the SD group, not to bonding events:
1. At PCI probe, mlx5_lag_add_mdev() creates the LAG structure
(priv.lag) for each LAG-capable PF. e.g.: SD primary devices
2. During mlx5_sd_init(), after the SD group is fully formed (primary
and secondaries paired), sd_lag_init() registers the secondary
devices into the primary's existing priv.lag by calling
mlx5_ldev_add_mdev() with the SD group_id. The primary's lag_func
also gets its group_id set. No separate LAG instance is created.
3. After all the devices in SD group transition to switchdev,
mlx5_lag_shared_fdb_create() is invoked with the group_id to create
a software-only shared FDB scoped to that SD group. This sets
sd_fdb_active on all lag_func entries in the group. No FW LAG
commands are issued since SD devices share the same physical port.
4. If MPESW (multi-port eswitch) is enabled on top of SD groups, the
per-group SD shared FDB is torn down first, then MPESW shared FDB is
created spanning all devices (ports + SD secondaries) using
MLX5_LAG_FILTER_ALL. On MPESW disable, per-group SD shared FDB is
restored.
5. On SD teardown (mlx5_sd_cleanup or device unbind), sd_lag_cleanup()
removes secondaries from priv.lag and clears the primary's group_id.
The LAG structure itself is not destroyed.
The sd_fdb_active flag is set on all lag_func entries in a group (not
just the primary), so any device can detect the SD shared FDB state
during lag_disable_change teardown without needing to look up peer
entries.
SD shared FDB is a pure software construct -- unlike regular LAG modes
(ROCE, SRIOV, MPESW), it does not issue FW create_lag/destroy_lag
commands. The software vport LAG for SD is implemented via eswitch
egress ACL bounce rules, managed by the IB layer through
mlx5_eth_lag_init(). And the software LAG demux is implemented via
steering rules that utilize new destination, VHCA_RX.
Devcom support (patches 2-3):
- Expose locked variant of send_event
- Add DEVCOM_CANT_FAIL for non-rollback events
SD core hardening (patches 4-6):
- Make primary/secondary role determination more robust
- Add L2 table silent mode query support
- Expand vport metadata for SD secondary devices
SD switchdev transition (patches 7-8):
- Support switchdev mode transition with shared FDB
- Notify SD on eswitch disable
LAG integration (patches 9-12):
- Store demux resources per master lag_func
- Disable both regular and SD LAG on lag_disable_change
- Introduce software vport LAG implementation
- Add MPESW over SD LAG support
Deferred init (patches 13-14):
- Tie rep load/unload to SD LAG state
- Defer vport metadata init until SD is ready
Enablement (patch 15):
- Enable SD over ECPF and allow switchdev transition
Shay Drory [Fri, 12 Jun 2026 11:39:04 +0000 (14:39 +0300)]
net/mlx5: SD, enable SD over ECPF and allow switchdev transition
Remove the restriction blocking SD on embedded CPU PFs (ECPF), enabling
SD functionality on BlueField DPUs. Remove the blocker preventing SD
devices from transitioning to switchdev mode.
The infrastructure added in earlier patches properly handles this case.
Shay Drory [Fri, 12 Jun 2026 11:39:03 +0000 (14:39 +0300)]
net/mlx5: SD, defer vport metadata init until SD is ready
Allow SD devices to transition to switchdev before the SD group is
fully up. Metadata allocation requires the SD group to be ready, so
defer it from esw_offloads_enable() until SD shared-FDB activation.
Add mlx5_esw_offloads_init_deferred_metadata() which allocates per-vport
metadata and refreshes the ingress ACLs that were previously programmed
with metadata=0. The helper is idempotent and can be called multiple
times.
Shay Drory [Fri, 12 Jun 2026 11:39:02 +0000 (14:39 +0300)]
net/mlx5: E-Switch, Tie rep load/unload to SD LAG state
On an SD device, vport representors are not functional until the SD
group is combined and shared FDB is active. Skip the initial load and
the reload paths in that window; reps are loaded as part of the SD LAG
activation flow once it becomes active.
In addition, explicitly unload representors when SD LAG is destroyed.
Shay Drory [Fri, 12 Jun 2026 11:39:01 +0000 (14:39 +0300)]
net/mlx5: LAG, add MPESW over SD LAG support
Enable MPESW LAG creation over SD LAG members, forming a composite LAG
hierarchy. This allows bonding multiple SD groups together under a
single MPESW configuration with shared FDB.
When enabling composite MPESW, the individual SD LAG shared FDB
configurations are temporarily torn down and recreated when the
composite LAG is disabled.
Shay Drory [Fri, 12 Jun 2026 11:39:00 +0000 (14:39 +0300)]
net/mlx5: LAG, introduce software vport LAG implementation
SD LAG is a virtual LAG without hardware LAG support, so it cannot use
the firmware vport LAG commands. Implement a software-based vport LAG
using egress ACL bounce rules.
Add esw_set_slave_egress_rule() to create an egress ACL rule on the
slave's manager vport that bounces traffic to the master's manager
vport. This achieves the same traffic steering as hardware vport LAG.
Redirect mlx5_cmd_create_vport_lag() and mlx5_cmd_destroy_vport_lag()
to the software implementation when operating in SD LAG mode.
In addition, adjust lag_demux creation to check SD LAG mode as well.
Shay Drory [Fri, 12 Jun 2026 11:38:59 +0000 (14:38 +0300)]
net/mlx5: LAG, disable both regular and SD LAG on lag_disable_change
Extend mlx5_lag_disable_change() to properly disable both regular LAG
and SD LAG when requested. Each LAG type uses its own devcom component
for locking.
Use mlx5_sd_get_devcom() helper to retrieve the SD devcom component,
needed for proper locking when disabling SD LAG.
Shay Drory [Fri, 12 Jun 2026 11:38:58 +0000 (14:38 +0300)]
net/mlx5: LAG, store demux resources per master lag_func
The lag demux resources (flow table, flow group, and rules xarray)
are stored on the shared ldev. With Socket Direct, multiple SD groups
each create their own demux FT/FG during their master's IB device
initialization. Since they all write to the same ldev fields, the
second group's init overwrites the first group's pointers, leaking
the first group's FT/FG.
During teardown, the cleanup uses the overwritten pointers, destroying
the wrong group's resources and leaving leaked flow tables in the LAG
namespace. These leaked tables can interfere with subsequently created
demux tables.
Move the demux resources from the shared ldev to per-master lag_func
instances. Each master device now owns its own independent demux
state. The rule_add and rule_del helpers look up the appropriate
master's lag_func via the existing filter/group infrastructure.
Shay Drory [Fri, 12 Jun 2026 11:38:57 +0000 (14:38 +0300)]
net/mlx5: E-Switch, notify SD on eswitch disable
When eswitch is disabled, notify the SD layer so it can clean up
SD-specific resources such as the TX flow table root configuration
on secondary devices.
Shay Drory [Fri, 12 Jun 2026 11:38:56 +0000 (14:38 +0300)]
net/mlx5: SD, support switchdev mode transition with shared FDB
When the eswitch transitions, propagate the change to SD: secondaries
get their TX flow table root reconfigured for the new mode, and when
all group devices move to switchdev, the per-group shared FDB is
activated.
Shared FDB activation is best-effort - failure does not block the
eswitch transition; the next transition retries.
Note: the existing mlx5_get_sd() guard that blocks switchdev for SD
devices is intentionally retained. It will be removed once all
supporting patches are in place.
Shay Drory [Fri, 12 Jun 2026 11:38:55 +0000 (14:38 +0300)]
net/mlx5: SD, expend vport metadata for SD secondary devices
In Socket Direct configurations the primary and secondary PFs share the
same native_port_num. The eswitch vport metadata encodes pf_num in its
upper bits to distinguish vports across PFs. Without SD-awareness, both
PFs generate identical metadata, causing FDB rules to steer traffic to
the wrong representor.
Add mlx5_sd_pf_num_get() which remaps the pf_num for SD devices.
Use it so each PF in an SD group produces unique vport metadata.
Shay Drory [Fri, 12 Jun 2026 11:38:54 +0000 (14:38 +0300)]
net/mlx5: SD, add L2 table silent mode query support
Add mlx5_fs_cmd_query_l2table_silent() to query the current silent mode
state from firmware. This allows detecting if firmware has already put
secondary devices into silent mode.
During SD group registration, query the silent mode of each device. If
a device is already in silent mode (set by firmware), record this in
the fw_silents_secondaries flag and use it to help determine the
primary/secondary roles.
When fw_silents_secondaries is set, skip the driver-initiated silent
mode set/unset operations since firmware manages this state. This
handles configurations where firmware persistently silences secondary
devices.
Shay Drory [Fri, 12 Jun 2026 11:38:53 +0000 (14:38 +0300)]
net/mlx5: SD, make primary/secondary role determination more robust
Refactor SD group registration to use devcom event-driven role
determination to ensure SD is marked as ready only after roles are fully
assigned and the group state is consistent, making outside accessors,
which will be added in downstream patches, safe to use without races.
The devcom events:
- SD_PRIMARY_SET event: each device compares bus numbers with peers
to determine which should be primary
- SD_SECONDARIES_SET event: secondaries register themselves with the
elected primary device
Shay Drory [Fri, 12 Jun 2026 11:38:52 +0000 (14:38 +0300)]
net/mlx5: devcom, add DEVCOM_CANT_FAIL for non-rollback events
Some devcom events are not expected to fail. Rather than attempting
a rollback that may not be meaningful, allow callers to pass
DEVCOM_CANT_FAIL as the rollback_event to indicate that the event
handler should not fail. If it does, emit a warning and stop
propagating to further peers, but skip the rollback path.
Shay Drory [Fri, 12 Jun 2026 11:38:51 +0000 (14:38 +0300)]
net/mlx5: devcom, expose locked variant of send_event
Factor mlx5_devcom_send_event() into two functions:
- mlx5_devcom_locked_send_event(): performs the dispatch (and
rollback) with comp->sem already held by the caller.
- mlx5_devcom_send_event(): unchanged wrapper that takes comp->sem,
calls the locked variant, and releases it.
This lets callers bracket multiple event broadcasts under a single
held write lock, eliminating the gap between consecutive dispatches
where peer state could change.
SD secondary devices share the primary's uplink and do not have
their own uplink representor. When reloading IB reps on secondary
devices, skip the uplink and only load VF/SF vport IB reps.
geneve: Fix off-by-one comparing with GRO_LEGACY_MAX_SIZE
GRO_LEGACY_MAX_SIZE = 65536; total_len being 65536 is too big to fit
into a u16. As can be seen in skb_gro_receive, packets bigger or equal
to gro_max_size (or GRO_LEGACY_MAX_SIZE) are dropped with -E2BIG. Apply
the same boundary to geneve_post_decap_hint to avoid writing 65536 to a
16-bit iph->tot_len field with an overflow.
Fixes: fd0dd796576e ("geneve: use GRO hint option in the RX path") Signed-off-by: Alice Mikityanska <alice@isovalent.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20260611192955.604661-3-alice.kernel@fastmail.im Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Similar to commit add641e7dee3 ("sched: act_csum: don't mangle TCP and
UDP GSO packets"), UDP tunnel GSO packets going through act_csum
shouldn't have their checksum calculated at this point, because it will
be done after segmentation. Setting the checksum in act_csum modifies
skb->ip_summed and prevents inner IP csum offload from kicking in,
resulting in a packet with a bad checksum.
Add UDP tunnel GSO packets to the exceptions, and also add UDP GSO
(SKB_GSO_UDP_L4), as the same logic as in the commit mentioned above
applies to UDP GSO too.
====================
net: hns3: enhance tc flow offload support
This patchset enhances the tc flow offload support for hns3 driver:
- Patch 1: Refactor hclge_add_cls_flower() to support more actions
- Patch 2: Improve unused_tuple parameter setting for separate src/dst configuration
- Patch 3: Add support for HCLGE_FD_ACTION_SELECT_QUEUE and HCLGE_FD_ACTION_DROP_PACKET actions
- Patch 4: Add support for FLOW_DISSECTOR_KEY_IP and FLOW_DISSECTOR_KEY_ENC_KEYID dissectors
- Patch 5: Add debugfs support for dumping FD rules
- Patch 6: Move FD code to a separate file (hclge_fd.c) for better code organization
====================
Jijie Shao [Wed, 10 Jun 2026 06:06:18 +0000 (14:06 +0800)]
net: hns3: move fd code to a separate file
The hclge_main.c file has become very large,
so the fd code has been moved to a separate hclge_fd.c file.
This patch only moves the code and does not modify any functionality.
Jijie Shao [Wed, 10 Jun 2026 06:06:16 +0000 (14:06 +0800)]
net: hns3: support IP and tunnel VNI dissectors for tc flow
Currently, the driver does not support FLOW_DISSECTOR_KEY_IP and
FLOW_DISSECTOR_KEY_ENC_KEYID. But the hardware supports
ip_tos (FLOW_DISSECTOR_KEY_IP) and
outer_tun_vni (FLOW_DISSECTOR_KEY_ENC_KEYID).
This patch adds support for FLOW_DISSECTOR_KEY_IP and
FLOW_DISSECTOR_KEY_ENC_KEYID.
Additionally, since tc flow cannot effectively support
l2_user_def, l3_user_def, and l4_user_def,
this patch explicitly sets them to not be used.
Jijie Shao [Wed, 10 Jun 2026 06:06:14 +0000 (14:06 +0800)]
net: hns3: improve the unused_tuple parameter setting
Currently, when the tc tool is used to set flow table rules, the IP address
and MAC address can be configured separately, for example, src_xx or dst_xx
can be configured separately.
Therefore, the driver needs to check whether the mask is all zero in
keys, such as FLOW_DISSECTOR_KEY_IPV4_ADDRS, FLOW_DISSECTOR_KEY_IPV6_ADDRS,
and FLOW_DISSECTOR_KEY_ETH_ADDRS.
If the mask is all zero, the tuple is not configured.
In this case, the driver adds the tuple to unused_tuple.
Jijie Shao [Wed, 10 Jun 2026 06:06:13 +0000 (14:06 +0800)]
net: hns3: refactor add_cls_flower to prepare for multiple actions
Remove the tc parameter from the add_cls_flower() ops callback and
refactor action parsing to support future extensions for SELECT_QUEUE
and DROP_PACKET actions.
Changes:
* Remove the tc parameter from the add_cls_flower() callback signature.
* Extract TC-based action parsing into hclge_get_tc_flower_action().
* Move the dissector->used_keys check from hclge_parse_cls_flower() to
hclge_check_cls_flower(), and restrict ETH_ADDRS to
HCLGE_FD_MODE_DEPTH_2K_WIDTH_400B_STAGE_1 mode since hardware only
supports MAC matching there.
* Migrate error reporting from dev_err() to netlink extended ACK (extack).
The FDB management done by the dpaa2_switch_port_set_fdb() function is
hard to follow even by trained eyes. This series tries to make it easier
to read and understand it by factoring out some code blocks into helper
functions and unifying the join and leave paths in terms of FDB
management.
====================
Ioana Ciornei [Wed, 10 Jun 2026 15:09:12 +0000 (18:09 +0300)]
dpaa2-switch: unify the FDB update logic in dpaa2_switch_port_set_fdb()
For both the join and leave paths, the logic goes through the following
steps: determines which FDB should be used on a port after the current
changeupper change, populate the private port structures with the new
FDB and, if necessary, make as not used the old FDB.
Instead of having two distinct paths inside the
dpaa2_switch_port_set_fdb() for linking=true and linking=false, unify
them. This will hopefully help in making this function easier to read.
Ioana Ciornei [Wed, 10 Jun 2026 15:09:11 +0000 (18:09 +0300)]
dpaa2-switch: move FDB selection for leave path into a helper
Move the FDB selection for when a port leaves bridge into a new helper -
dpaa2_switch_fdb_for_leave(). This will hopefully make the
dpaa2_switch_port_set_fdb() function easier to read and follow. The new
helper only determines the FDB to be used, any updates into the private
port structure still gets done in the set_fdb() function.
Ioana Ciornei [Wed, 10 Jun 2026 15:09:10 +0000 (18:09 +0300)]
dpaa2-switch: move FDB selection for join path into a helper
The dpaa2_switch_port_set_fdb() function handles the setup of the FDB
for both changeupper cases: join and leave. Move the code block which
handles the join path into a new helper - dpaa2_switch_fdb_for_join() -
with the hope that the entire function will become easier to read and
extend with other use cases in the future.
This new helper just determines and returns what FDB should be used for
a specific port, the cleanup of the old FDB and the actual setup in the
per port structure remains in the dpaa2_switch_port_set_fdb() function.
Ioana Ciornei [Wed, 10 Jun 2026 15:09:09 +0000 (18:09 +0300)]
dpaa2-switch: factor out the FDB in-use check into a helper
The dpaa2_switch_port_set_fdb() function is hard to follow and
open-coding the in-use check into it makes it even harder to read.
Factor out that code block into a new helper -
dpaa2_switch_fdb_in_use_by_others().
Ioana Ciornei [Wed, 10 Jun 2026 15:09:08 +0000 (18:09 +0300)]
dpaa2-switch: change dpaa2_switch_port_set_fdb() function prototype
Since there dpaa2_switch_port_set_fdb() never fails and its return value
was never checked, change its prototype to return void.
Also, instead of determining if the DPAA2 port is joining or leaving an
upper based on the value of the 'bridge_dev' parameter, add the
'linking' parameter to explicitly specify the action.
Signed-off-by: Maximilian Pezzullo <maximilianpezzullo@gmail.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Joe Damato <joe@dama.to> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Link: https://patch.msgid.link/20260609213559.178657-15-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Agalakov Daniil [Tue, 9 Jun 2026 21:35:54 +0000 (14:35 -0700)]
e1000e: limit endianness conversion to boundary words
[Why]
In e1000_set_eeprom(), the eeprom_buff is allocated to hold a range of
words. However, only the boundary words (the first and the last) are
populated from the EEPROM if the write request is not word-aligned.
The words in the middle of the buffer remain uninitialized because they
are intended to be completely overwritten by the new data via memcpy().
The previous implementation had a loop that performed le16_to_cpus()
on the entire buffer. This resulted in endianness conversion being
performed on uninitialized memory for all interior words.
Fix this by converting the endianness only for the boundary words
immediately after they are successfully read from the EEPROM.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Agalakov Daniil [Tue, 9 Jun 2026 21:35:53 +0000 (14:35 -0700)]
e1000: limit endianness conversion to boundary words
[Why]
In e1000_set_eeprom(), the eeprom_buff is allocated to hold a range of
words. However, only the boundary words (the first and the last) are
populated from the EEPROM if the write request is not word-aligned.
The words in the middle of the buffer remain uninitialized because they
are intended to be completely overwritten by the new data via memcpy().
The previous implementation had a loop that performed le16_to_cpus()
on the entire buffer. This resulted in endianness conversion being
performed on uninitialized memory for all interior words.
Fix this by converting the endianness only for the boundary words
immediately after they are successfully read from the EEPROM.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Matt Vollrath [Tue, 9 Jun 2026 21:35:52 +0000 (14:35 -0700)]
e1000e: Use __napi_schedule_irqoff()
The __napi_schedule_irqoff() macro is intended to bypass saving and
restoring IRQ state when scheduling is requested from an IRQ handler,
where hard interrupts are already disabled. Use this macro in all three
interrupt handlers.
This was tested on a system with an I218-V and MSI interrupts. Because
this is an optimization, I was interested in measuring the impact, so I
added ktime_get() time measurement to e1000_intr_msi and a print of the
last sample in the watchdog task. For each test case I ran a
bi-directional iperf3 to saturate the line. With some help from awk,
here are the statistics.
49 samples each, all units ns
previous: min 678 max 1265 mean 879.429 median 806 stddev 137.188
noirq: min 707 max 1165 mean 811.857 median 790 stddev 89.486
According to this informal comparison, the mean time to handle an
interrupt from start to finish is improved by about 8% under load.
Signed-off-by: Matt Vollrath <tactii@gmail.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Tested-by: Michal Cohen <michalx.cohen@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Link: https://patch.msgid.link/20260609213559.178657-12-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Daiki Harada [Tue, 9 Jun 2026 21:35:51 +0000 (14:35 -0700)]
igc: use napi_schedule_irqoff() instead of napi_schedule()
Replace napi_schedule() with napi_schedule_irqoff()
in the interrupt handler path in igc driver
Tested on Intel Corporation Ethernet Controller I226-V.
Suggested-by: Kohei Enju <kohei@enjuk.jp> Signed-off-by: Daiki Harada <daiky0325@gmail.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Dima Ruinskiy <dima.ruinskiy@intel.com> Tested-by: Moriya Kadosh <moriyax.kadosh@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Link: https://patch.msgid.link/20260609213559.178657-11-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
e1000e: use ktime_get_real_ns() in e1000e_systim_reset()
Replace ktime_to_ns(ktime_get_real()) with the direct equivalent
ktime_get_real_ns() in e1000e_systim_reset(). Using the combined helper
avoids the unnecessary intermediate ktime_t variable and makes the
intent clearer.
Suggested-by: Jacob Keller <jacob.e.keller@intel.com> Suggested-by: Simon Horman <horms@kernel.org> Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Tested-by: Avigail Dahan <avigailx.dahan@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Link: https://patch.msgid.link/20260609213559.178657-9-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
igb: use ktime_get_real helpers in igb_ptp_reset()
Replace ktime_to_ns(ktime_get_real()) with the direct equivalent
ktime_get_real_ns() and ktime_to_timespec64(ktime_get_real()) with
ktime_get_real_ts64() in igb_ptp_reset(). Using the combined helpers
makes the intent clearer.
Suggested-by: Jacob Keller <jacob.e.keller@intel.com> Suggested-by: Simon Horman <horms@kernel.org> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Link: https://patch.msgid.link/20260609213559.178657-8-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Piotr Kwapulinski <piotr.kwapulinski@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Link: https://patch.msgid.link/20260609213559.178657-6-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Raczynski [Tue, 9 Jun 2026 21:35:45 +0000 (14:35 -0700)]
net/intel: Replace manual array size calculation with ARRAY_SIZE
There are still places in the code where manual calculation of array size
exist, but it is good to enforce usage of single macro through the whole
code as it makes code bit more readable.
While at it, beautify condition surrounding it by reversing check and remove
unnecessary casting.
iavf: iavf_virtchnl_completion: drop duplicate ether_addr_equal() test
This is just a simple cleanup fix. Commit 35a2443d0910f ("iavf: Add
waiting for response from PF in set mac") introduced a duplicate
ether_addr_equal() check, so the current code tests the new MAC twice
against the former MAC.
Remove the outer ether_addr_equal() test, remnant of commit c5c922b3e09b
("iavf: fix MAC address setting for VFs when filter is rejected")
Signed-off-by: Corinna Vinschen <vinschen@redhat.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Link: https://patch.msgid.link/20260609213559.178657-4-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
idpf: Replace use of system_unbound_wq with system_dfl_wq
This patch continues the effort to refactor workqueue APIs, which has begun
with the changes introducing new workqueues and a new alloc_workqueue flag:
commit 128ea9f6ccfb ("workqueue: Add system_percpu_wq and system_dfl_wq")
commit 930c2ea566af ("workqueue: Add new WQ_PERCPU flag")
The point of the refactoring is to eventually alter the default behavior of
workqueues to become unbound by default so that their workload placement is
optimized by the scheduler.
Before that to happen, workqueue users must be converted to the better named
new workqueues with no intended behaviour changes:
This series extends Marvell octeontx2-af support for CN20K NPC (MCAM
debuggability, allocation policy, default-rule lifetime, optional KPU
profiles from firmware files, X2/X4 MCAM keyword handling in flows and
defaults, and dynamic CN20K NPC private state), adds a devlink mechanism
for multi-value parameters, and moves devlink_nl_param_fill() temporaries
to the heap so stack usage stays reasonable once union devlink_param_value
grows (patch 3).
Patch 1 enforces a single RVU admin-function PCI device in the kernel.
On Octeon series SoCs, hardware resources such as NPC, NIX and related
blocks are global and coordinated by the AF driver; PFs and VFs request
them through AF mailbox messages. Firmware exposes only one AF PCI
function at boot, so two AF driver instances cannot both own that state.
rvu_probe() rejects a second bind with -EBUSY, logs a warning, clears the
probe gate on early allocation failures, and aligns the driver model with
hardware so reviewers and automation can rely on exactly one bound AF.
Patch 2 improves CN20K MCAM visibility in debugfs: mcam_layout marks
enabled entries, dstats reports per-entry hit deltas (baseline updated in
software after each read; hardware counters are not cleared), and mismatch
lists enabled entries without a PF mapping.
Patch 3 allocates the per-configuration-mode union devlink_param_value
buffers and struct devlink_param_gset_ctx used by devlink_nl_param_fill()
with kcalloc()/kzalloc_obj() and funnels failures through a single cleanup
path so the netlink reply path stays safe as the union grows.
Patch 4 (Saeed) introduces DEVLINK_PARAM_TYPE_U64_ARRAY and nested
DEVLINK_ATTR_PARAM_VALUE_DATA attributes so drivers and user space can
exchange bounded u64 arrays; YAML, uapi, and netlink validation are
updated.
Patch 5 adds a runtime devlink parameter srch_order to reorder CN20K
subbank search during MCAM allocation (the param uses the u64 array type
from patch 4).
Patch 6 ties default MCAM entries to NIX LF alloc/free on CN20K, adds
NIX_LF_DONT_FREE_DFT_IDXS for PF teardown paths that must not drop default
NPC indexes while the driver still owns state, and tightens nix_lf_alloc
error propagation.
Patch 7 allows loading a custom KPU profile from /lib/firmware/kpu via
module parameter kpu_profile, with cam2 / ptype_mask wiring and helpers
that share firmware-sourced vs filesystem-sourced profile layouts.
Patch 8 makes default-rule allocation, AF flow install, and PF-side RSS,
defaults, and ethtool flows respect the active CN20K MCAM keyword width
(X2 vs X4), including X4 reference-index masking and -EOPNOTSUPP when a
flow needs X4 keys on an X2-only profile.
Patch 9 replaces file-scope npc_priv and static dstats with allocation
sized from discovered bank/subbank geometry, threads npc_priv_get()
through CN20K NPC paths, and allocates dstats via devm_kzalloc for the
debugfs helper.
Patch 1 is ordered first so later patches assume a single bound AF.
Heap-backed devlink_nl_param_fill() sits immediately before the U64 array
param work so incremental builds stay stack-safe as the union grows; the
CN20K patches keep srch_order ahead of NIX LF coordination, optional KPU
profile load from firmware files, X2/X4 handling, and the npc_priv refactor
that touches the same files heavily.
====================
octeontx2-af: npc: cn20k: Allocate npc_priv and dstats dynamically.
Replace the file-scope static npc_priv with a kcalloc'd struct filled
from hardware bank/subbank geometry at init (num_banks is no longer a
const compile-time constant; drop init_done and use a non-NULL
npc_priv pointer for liveness). Thread npc_priv_get() / pointer access
through the CN20K NPC code paths, extend teardown to kfree the root
struct on failure and in npc_cn20k_deinit, and adjust MCAM section
setup to use the discovered subbank count.
Allocate MCAM debugfs dstats via devm_kzalloc instead of a static matrix,
and use the allocated backing store consistently when computing deltas
(including the counter rollover compare).
octeontx2: cn20k: Respect NPC MCAM X2/X4 profile in flows and DFT alloc
Default CN20K NPC rule allocation now keys off the active MCAM keyword
width: use X4 with a bank-masked reference index when the silicon uses
X4 keys, and X2 with the raw index otherwise (replacing the previous
always-X2 / eidx + 1 behaviour).
In the AF flow-install path, flows that need more than 256 key bits
query the NPC profile; if the platform is fixed to X2 entries, fail
with -EOPNOTSUPP instead of requesting X4. Otherwise select X4 for the
MCAM alloc.
On the PF, cache and pass the profile kw_type from npc_get_pfl_info
through otx2_mcam_pfl_info_get(), and use it when allocating MCAM
entries for RSS/defaults and when installing ethtool flows on CN20K,
including masking the reference index for X4 slot layout.
octeontx2-af: npc: Support for custom KPU profile from filesystem
Flashing updated firmware on deployed devices is cumbersome. Provide a
mechanism to load a custom KPU (Key Parse Unit) profile directly from
the filesystem at module load time.
When the rvu_af module is loaded with the kpu_profile parameter, the
specified profile is read from /lib/firmware/kpu and programmed into
the KPU registers. Add npc_kpu_profile_cam2 for the extended cam format
used by filesystem-loaded profiles and support ptype/ptype_mask in
npc_config_kpucam when profile->from_fs is set.
Usage:
1. Copy the KPU profile file to /lib/firmware/kpu.
2. Build OCTEONTX2_AF as a module.
3. Load: insmod rvu_af.ko kpu_profile=<profile_name>
octeontx2: cn20k: Coordinate default rules with NIX LF lifecycle
Add NIX_LF_DONT_FREE_DFT_IDXS so the PF can send NIX LF free during hw
reinit or teardown without the AF freeing CN20K default NPC rule indexes
while the driver still owns that state (otx2_init_hw_resources and
otx2_free_hw_resources).
On CN20K, allocate default NPC rules from NIX LF alloc before
nix_interface_init, roll back with npc_cn20k_dft_rules_free on failure,
and free from NIX LF free when the new flag is not set. Tighten
rvu_mbox_handler_nix_lf_alloc error handling: use a single rc, propagate
qmem_alloc and other errors, and set -ENOMEM only when kcalloc fails
(remove the blanket -ENOMEM at the free_mem path).
octeontx2-af: npc: cn20k: add subbank search order control
CN20K NPC MCAM is split into 32 subbanks that are searched in a
predefined order during allocation. Lower-numbered subbanks have
higher priority than higher-numbered ones.
Add a runtime "srch_order" to control the order in which
subbanks are searched during MCAM allocation.
Saeed Mahameed [Tue, 9 Jun 2026 04:04:48 +0000 (09:34 +0530)]
devlink: Implement devlink param multi attribute nested data values
Devlink param value attribute is not defined since devlink is handling
the value validating and parsing internally, this allows us to implement
multi attribute values without breaking any policies.
Devlink param multi-attribute values are considered to be dynamically
sized arrays of u64 values, by introducing a new devlink param type
DEVLINK_PARAM_TYPE_U64_ARRAY, driver and user space can set a variable
count of u64 values into the DEVLINK_ATTR_PARAM_VALUE_DATA attribute.
Implement get/set parsing and add to the internal value structure passed
to drivers.
This is useful for devices that need to configure a list of values for
a specific configuration.
example:
$ devlink dev param show pci/... name multi-value-param
name multi-value-param type driver-specific
values:
cmode permanent value: 0,1,2,3,4,5,6,7
$ devlink dev param set pci/... name multi-value-param \
value 4,5,6,7,0,1,2,3 cmode permanent
devlink: heap-allocate param fill buffers in devlink_nl_param_fill
devlink_nl_param_fill() kept two per-configuration-mode copies of
union devlink_param_value plus a struct devlink_param_gset_ctx on the
stack while building the Netlink reply. Allocate those with kcalloc()
and kzalloc_obj() instead, and route failures through a single cleanup
path so temporary buffers are always freed.
Improve MCAM visibility and field debugging for CN20K NPC.
- Extend "mcam_layout" to show enabled (+) or disabled state per entry
so status can be verified without parsing the full "mcam_entry" dump.
- Add "dstats" debugfs entry: for enabled MCAM indices, print hit deltas
since the prior read by comparing hardware counters to a per-entry
software baseline and advancing that baseline after each read (hardware
counters are not cleared).
- Add "mismatch" debugfs entry: lists MCAM entries that are enabled
but not explicitly allocated, helping diagnose allocation/field issues.
On Octeon series SoCs, the AF is an integrated device within the SoC, and
hardware resources such as NPC, NIX and related blocks are global and
coordinated by the AF driver. Physical and virtual functions request those
resources via AF mailbox messages, so two AF driver instances cannot both
own that global state; firmware exposes only one AF PCI function at boot
and any further octeontx2-af PCI probe returns -EBUSY so software matches
the single-AF model.
====================
net/stmmac: Fixes for maximum TX/RX queues to use by driver
When contributing other changes preparing functions for new XGMAC hardware
https://lore.kernel.org/netdev/20260601162537.553512-1-j.raczynski@samsung.com/
there have been reports by Sashiko AI.
All of issues are wrong DTS configuration, but kernel needs to handle it.
====================
Jakub Raczynski [Thu, 11 Jun 2026 11:33:58 +0000 (13:33 +0200)]
net/stmmac: Apply MTL_MAX queue limit if config missing
When "snps,rx-queues-to-use" or "tx-queues-to-use" config in DTS is provided
current code will apply U8_MAX value for queues_to_use if there is input of
higher value. But actual maximum number of supported queues is set via
macro MTL_MAX_RX_QUEUES and MTL_MAX_TX_QUEUES, which currently have value of 8.
This value of U8_MAX will be capped to value provided by core in DMA
capabilities (dma_conf), but it does so only if core provides it.
This is true for XGMAC (dwxgmac2) and some GMAC (dwmac4),
but not for (dwmac1000). This capping is at later stage in stmmac_hw_init(),
and during stmmac_mtl_setup() we might parse fields outside allocated memory
if queues_to_use is over defines MTL_MAX_ values,
for example following rx_queues_cfg is array of size of MTL_MAX_RX_QUEUES.
Fix this by capping value to MTL_MAX during config parsing.
Jakub Raczynski [Thu, 11 Jun 2026 11:33:57 +0000 (13:33 +0200)]
net/stmmac: Apply TBS config only to used queues
While opening stmmac driver, there is enabling of TBS (Time-Based Scheduling)
option in dma config. Currently this is executed for all possible TX queues via
MTL_MAX_TX_QUEUES macro, but actual number of queues used might differ.
While setting this is generally harmless, since memory for MTL_MAX_TX_QUEUES
is allocated, it is incorrect, because it prepares config for unused queues.
Change this to apply tbs config only to tx_queues_to_use.
Co-developed-by: Chang-Sub Lee <cs0617.lee@samsung.com> Signed-off-by: Chang-Sub Lee <cs0617.lee@samsung.com> Signed-off-by: Jakub Raczynski <j.raczynski@samsung.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260611113358.3379518-2-j.raczynski@samsung.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Lorenzo Bianconi [Thu, 11 Jun 2026 10:43:00 +0000 (12:43 +0200)]
net: airoha: better handle MIBs for GDM ports with multiple devs attached
In the context of a GDM port that can have multiple net_devices attached
(GDM3 and GDM4), the HW counters (MIBs) are global for the GDM port.
This cause duplicated stats reported to the kernel for the related
net_device.
The SoC supports a split MIB feature where each counter is tracked based
on the relevant HW channel (NBQ) to account for this scenario and
provide a way to select the related counter on accessing the MIB
registers.
Enable this feature for GDM3 and GDM4 and configure the relevant HW
channel before updating the HW stats to report correct HW counter to the
kernel for the related interface.
Move the stats struct from port to dev since HW counter are now specific
to the network device instead of the GDM port. Refactor
airoha_update_hw_stats() to take airoha_eth and airoha_gdm_port
parameters since the function operates on the entire port.
Frank Li [Wed, 10 Jun 2026 15:05:30 +0000 (11:05 -0400)]
dt-bindings: net: dsa: Convert lan9303.txt to yaml format
Convert lan9303.txt to yaml format to fix below CHECK_DTBS warnings:
arch/arm/boot/dts/nxp/imx/imx53-kp-hsc.dtb: /soc/bus@50000000/i2c@53fec000/switch@a: failed to match any schema with compatible: ['smsc,lan9303-i2c']
Additional changes:
- rename switch-phy to switch in example.
selftests: net: add test for IPv4 devconf netlink notifications
Introduce a new test, `ipv4_devconf_notify`, to verify that the kernel
sends the appropriate netlink notifications when IPv4 devconf parameters
are modified.
The test depends on the newly introduced iproute2 command:
ipv4: handle devconf post-set actions on netlink updates
When IPv4 device configuration parameters are updated via netlink, the
kernel currently only updates the value. This bypasses several
post-modification actions that occur when these same parameters are
updated via sysctl, such as flushing the routing cache or emitting
RTM_NEWNETCONF notifications.
This patch addresses the inconsistency by calling the
devinet_conf_post_set() helper inside inet_set_link_af(). If a flush is
required, we defer it until the netlink attribute parsing loop
completes.
This ensures consistent behavior and side-effects for devconf changes,
regardless of whether they are initiated via sysctl or netlink.
The logic for handling IPv4 devconf sysctls is scattered. Notification
and cache flushes are managed in devinet_conf_proc(), while a separate
ipv4_doint_and_flush() function and DEVINET_SYSCTL_FLUSHING_ENTRY macro
is used for properties that solely require a cache flush.
This patch refactors the sysctl handling by introducing a centralized
helper, devinet_conf_post_set(). This new function evaluates the changed
attribute and handles all necessary operations like triggering netlink
notifications. It returns a boolean indicating whether a routing cache
flush is required.
Note that the boolean is necessary as this function will be re-used for
netlink IPv4 devconf handling where the cache flushing must wait until
all the attributes have been processed.
Finally, this is introducing a small change in behavior for
IPV4_DEVCONF_ROUTE_LOCALNET. As commit d0daebc3d622 ("ipv4: Add
interface option to enable routing of 127.0.0.0/8") intended, the cache
flush should only be performed when ROUTE_LOCALNET changes from 1 to 0.
Unfortunately, this was not true because while implementing it the
DEVINET_SYSCTL_FLUSHING_ENTRY was used for the attribute, making the
code related to it on devinet_conf_proc() dead.
IPV4_DEVCONF_FORWARDING is still being handled separately as it requires
more operations.
Eric Dumazet [Mon, 8 Jun 2026 15:14:52 +0000 (15:14 +0000)]
tcp: refine tcp_sequence() for the FIN exception
Commit 0e24d17bd966 ("tcp: implement RFC 7323 window retraction
receiver requirements") removed the special FIN case that
was added in commit 1e3bb184e941 ("tcp: re-enable acceptance of
FIN packets when RWIN is 0").
If a peer sends a segment containing data and a FIN flag before
it learns about our window retraction and has a buggy TCP stack,
it might place the FIN one byte beyond what it thinks is the
right edge of the window (i.e., max_window_edge + 1).
The data portion (end_seq - th->fin) will end exactly at max_window_edge.
In this case, we will drop the packet if our receive queue is not empty,
even though the data was sent within the window we previously allowed.
Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Simon Baatz <gmbnomis@gmail.com> Link: https://patch.msgid.link/20260608151452.706822-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
dpll/ice: Add generic DPLL type and full TX reference clock control for E825
NOTE: This series is intentionally submitted on net-next (not
intel-wired-lan) as early feedback of DPLL subsystem changes is
welcomed. In the past possible approaches were discussed in [1].
This series adds TX reference clock support for E825 devices and exposes
TX clock selection and synchronization status via the Linux DPLL
subsystem.
E825 hardware contains a dedicated TX clock domain with per-port source
selection behavior that is distinct from PPS handling and from board-level
EEC distribution. TX reference clock selection is device-wide, shared
across ports, and mediated by firmware as part of link bring-up. As a
result, TX clock selection intent may differ from effective hardware
configuration, and software must verify outcome after link-up.
To support this, the series extends the DPLL core and the ice driver
incrementally. The series also introduces DPLL_TYPE_GENERIC as a broad
UAPI class for DPLL instances outside PPS/EEC categories. The intent is
to keep type naming reusable and scalable across different ASIC
topologies while preserving functional discoverability via
driver/device context and pin topology.
This follows netdev discussion guidance that UAPI type naming should avoid
location-specific or vendor-specific taxonomy, because such labels do not
scale across different ASIC designs. The function of a given DPLL instance
is already discoverable from driver/device context and pin topology, and
does not require an additional narrow type identifier in UAPI.
At the same time, a separate DPLL object is still needed for E825 TX clock
control/reporting semantics. Using DPLL_TYPE_GENERIC provides a reusable
class for devices outside PPS/EEC without overfitting UAPI naming to one
topology.
The relevant discussion is in [2].
Series content
- add a new generic DPLL type for devices outside PPS/EEC classes;
- relax DPLL pin registration rules for firmware-described shared pins
and extend pin notifications with a source identifier;
- allow dynamic state control of SyncE reference pins where hardware
supports it;
- add CPI infrastructure for PHY-side TX clock control on E825C;
- introduce a TX-clock DPLL device and TX reference clock pins
(EXT_EREF0 and SYNCE) in the ice driver;
- extend the Restart Auto-Negotiation command to carry a TX reference
clock index;
- implement hardware-backed TX reference clock switching, post-link
verification, and TX synchronization reporting.
TXCLK pins report TX reference topology only. Actual synchronization
success is reported via DPLL lock status, updated after hardware
verification: external TX references report LOCKED, while the internal
ENET/TXCO source reports UNLOCKED.
This provides reliable TX reference selection and observability on E825
devices using standard DPLL interfaces, without conflating user intent
with effective hardware behavior.
Grzegorz Nitka [Sun, 7 Jun 2026 18:30:45 +0000 (20:30 +0200)]
ice: implement E825 TX ref clock control and TXC hardware sync status
Build on the previously introduced TXC DPLL framework and implement
full TX reference clock control and hardware-backed synchronization
status reporting for E825 devices.
E825 firmware may accept or override TX reference clock requests based
on device-wide routing constraints and link conditions. Because the
final selection becomes visible only after a link-up event, the driver
splits the observation into two complementary signals:
- TXCLK pin state reflects the requested TX reference clock
(pf->ptp.port.tx_clk_req). After a link-up, the value is reconciled
against the SERDES reference selector by
ice_txclk_update_and_notify(); if firmware or auto-negotiation
selected a different clock, tx_clk_req is overwritten so that pin
state converges to the actual hardware selection.
- TXC DPLL lock status reflects hardware synchronization:
* LOCKED when an external TX reference is in use
* UNLOCKED when falling back to ENET/TXCO, or when a requested
external reference has not (yet) been accepted by hardware.
Userspace observing only pin state therefore sees user intent, while
lock status is the authoritative indicator of whether the requested
clock is actually selected and synchronizing. This matches the DPLL
subsystem model where pin state describes topology and device lock
status describes signal quality.
TX reference selection topology:
- External references (SYNCE, EREF0) are represented as TXCLK pins
- The internal ENET/TXCO clock has no pin representation; when
selected, all TXCLK pins are reported DISCONNECTED
With this change, TX reference clocks on E825 devices can be reliably
selected, observed via standard DPLL interfaces, and monitored for
effective synchronization through TXC DPLL lock status.
Grzegorz Nitka [Sun, 7 Jun 2026 18:30:44 +0000 (20:30 +0200)]
ice: add Tx reference clock index handling to AN restart command
Extend the Restart Auto-Negotiation (AN) AdminQ command with a new
parameter allowing software to specify the Tx reference clock index to
be used during link restart.
This patch:
- adds REFCLK field definitions to ice_aqc_restart_an
- updates ice_aq_set_link_restart_an() to take a new refclk parameter
and properly encode it into the command
- keeps legacy behavior by passing REFCLK_NOCHANGE where appropriate
This prepares the driver for configurations requiring dynamic selection
of the Tx reference clock as part of the AN flow.
Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com> Link: https://patch.msgid.link/20260607183045.1213735-13-grzegorz.nitka@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Grzegorz Nitka [Sun, 7 Jun 2026 18:30:43 +0000 (20:30 +0200)]
ice: implement CPI support for E825C
Add full CPI (Converged PHY Interface) command handling required for
E825C devices. The CPI interface allows the driver to interact with
PHY-side control logic through the LM/PHY command registers, including
enabling/disabling/selection of PHY reference clock.
This patch introduces:
- a new CPI subsystem (ice_cpi.c / ice_cpi.h) implementing the CPI
request/acknowledge state machine, including REQ/ACK protocol,
command execution, and response handling
- helper functions for reading/writing PHY registers over Sideband
Queue
- CPI command execution API (ice_cpi_exec) and a helper for enabling or
disabling Tx reference clocks (CPI 0xF1 opcode 'Config PHY clocking')
- assurance of CPI transaction serialization into the CPI core.
CPI REQ/ACK is a multi-step handshake and must be executed
atomically per PHY. Centralize the lock in ice_cpi_exec() and
use adapter-scoped per-PHY mutexes, which match the hardware sharing
model across PFs.
- addition of the non-posted write opcode (wr_np) to SBQ
- Makefile integration to build CPI support together with the PTP stack
This provides the infrastructure necessary to support PHY-side
configuration flows on E825C and is required for advanced link control
and Tx reference clock management.
Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com> Link: https://patch.msgid.link/20260607183045.1213735-12-grzegorz.nitka@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Grzegorz Nitka [Sun, 7 Jun 2026 18:30:42 +0000 (20:30 +0200)]
ice: introduce TXC DPLL device and TX ref clock pin framework for E825
E825 devices provide a dedicated TX clock (TXC) domain which may be
driven by multiple reference clock sources, including external board
references and port-derived SyncE. To support future TX clock control
and observability through the Linux DPLL subsystem, introduce a
separate TXC DPLL device (of DPLL_TYPE_GENERIC) and a framework for
representing TX reference clock inputs.
This change adds a new internal DPLL pin type (TXCLK) and registers
TX reference clock pins for E825-based devices:
- EXT_EREF0: a board-level external electrical reference
- SYNCE: a port-derived SyncE reference described via firmware nodes
The TXC DPLL device is created and managed alongside the existing
PPS and EEC DPLL instances. TXCLK pins are registered directly or
deferred via a notifier when backed by fwnode-described pins.
A per-pin attribute encodes the TX reference source associated with
each TXCLK pin.
At this stage, TXCLK pin state callbacks and TXC DPLL lock status
reporting are implemented as placeholders. Pin state getters always
return DISCONNECTED, and the TXC DPLL is initialized in the UNLOCKED
state. No hardware configuration or TX reference switching is
performed yet.
This patch establishes the structural groundwork required for
hardware-backed TX reference selection, verification, and
synchronization status reporting, which will be implemented in
subsequent patches.
Also signal dpll_init from the fwnode pin init error path so any
notifier worker already blocked on it can drain, avoiding a
flush_workqueue() deadlock during teardown.
Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com> Link: https://patch.msgid.link/20260607183045.1213735-11-grzegorz.nitka@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Grzegorz Nitka [Sun, 7 Jun 2026 18:30:41 +0000 (20:30 +0200)]
dpll: allow fwnode pins to attempt state change without capability bit
Pins registered with an fwnode may have .state_on_dpll_set implemented
without advertising DPLL_PIN_CAPABILITIES_STATE_CAN_CHANGE upfront.
Requiring the bit for fwnode pins ties firmware description to driver
implementation details unnecessarily.
Relax the capability check in dpll_pin_state_set() and
dpll_pin_on_pin_state_set(): when a pin has an associated fwnode, bypass
the capability gate and let the ops layer decide, returning -EOPNOTSUPP
if .state_on_dpll_set is absent. Non-fwnode pins retain the original
strict behavior.
This is used later in the series by the SyncE_Ref output pin, which
relies on the fwnode path for state control.
Grzegorz Nitka [Sun, 7 Jun 2026 18:30:40 +0000 (20:30 +0200)]
dpll: extend pin notifier with notification source ID
Extend the DPLL pin notification API to include a source identifier
indicating where the notification originates. This allows notifier
consumers to distinguish between notifications coming from
an associated DPLL instance, a parent pin, or the pin itself.
A new field, src_clock_id, is added to struct dpll_pin_notifier_info
and is passed through all pin-related notification paths. Callers of
dpll_pin_notify() are updated to provide a meaningful source identifier
based on their context:
- pin registration/unregistration uses the DPLL's clock_id,
- pin-on-pin operations use the parent pin's clock_id,
- pin changes use the pin's own clock_id.
As introduced in the commit ("dpll: allow registering FW-identified pin
with a different DPLL"), it is possible to share the same physical pin
via firmware description (fwnode) with DPLL objects from different
kernel modules. This means that a given pin can be registered multiple
times.
Driver such as ICE (E825 devices) rely on this mechanism when listening
for the event where a shared-fwnode pin appears, while avoiding reacting
to events triggered by their own registration logic.
This change only extends the notification metadata and does not alter
existing semantics for drivers that do not use the new field.
Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com> Link: https://patch.msgid.link/20260607183045.1213735-9-grzegorz.nitka@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Grzegorz Nitka [Sun, 7 Jun 2026 18:30:39 +0000 (20:30 +0200)]
dpll: balance create/delete notifications in __dpll_pin_(un)register
__dpll_pin_register() emits dpll_pin_create_ntf() internally, but
__dpll_pin_unregister() left the matching delete to its callers. The
counts then diverge on dpll_pin_on_pin_register() rollback and on
dpll_pin_on_pin_unregister(), leaking stale notifications.
Emit dpll_pin_delete_ntf() inside __dpll_pin_unregister() and drop the
now-redundant call in dpll_pin_unregister().
Fixes: 9431063ad323 ("dpll: core: Add DPLL framework base functions") Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com> Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Link: https://patch.msgid.link/20260607183045.1213735-8-grzegorz.nitka@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Grzegorz Nitka [Sun, 7 Jun 2026 18:30:38 +0000 (20:30 +0200)]
dpll: guard sync-pair removal on full pin unregister
__dpll_pin_unregister() wiped the global sync-pair state on every
(dpll, ops, priv, cookie) tuple removed from a pin. When a pin is
registered multiple times and only one registration is being torn
down, this dropped sync-pair pairings still in use by the surviving
registrations.
Move dpll_pin_ref_sync_pair_del() inside the xa_empty(&pin->dpll_refs)
branch so it only runs when the last registration is gone, alongside
clearing the DPLL_REGISTERED mark.
Grzegorz Nitka [Sun, 7 Jun 2026 18:30:37 +0000 (20:30 +0200)]
dpll: emit per-dpll delete notifications in dpll_pin_on_pin_unregister()
dpll_pin_on_pin_register() emits a creation notification for every
parent->dpll_refs entry, but dpll_pin_on_pin_unregister() emitted only
one deletion notification outside the loop. When a pin is registered
against multiple parent dplls, userspace sees N creates but a single
delete and leaks per-dpll state.
Move dpll_pin_delete_ntf() into the loop and call it before
__dpll_pin_unregister() so the DPLL_REGISTERED mark is still set when
dpll_pin_available() is consulted.
Fixes: 9d71b54b65b1 ("dpll: netlink: Add DPLL framework base functions") Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com> Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Link: https://patch.msgid.link/20260607183045.1213735-6-grzegorz.nitka@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Grzegorz Nitka [Sun, 7 Jun 2026 18:30:36 +0000 (20:30 +0200)]
dpll: send delete notification before unregister in on-pin rollback
The rollback path in dpll_pin_on_pin_register() called
__dpll_pin_unregister() before dpll_pin_delete_ntf(). When the
unregister dropped the pin's last DPLL reference it cleared the
DPLL_REGISTERED mark in dpll_pin_xa, so the subsequent
dpll_pin_event_send() failed dpll_pin_available() and aborted with
-ENODEV. As a result userspace was never notified of the rollback
deletion and remained out of sync with the kernel.
Send the delete notification first, matching the order used by
dpll_pin_unregister() and dpll_pin_on_pin_unregister().
Grzegorz Nitka [Sun, 7 Jun 2026 18:30:35 +0000 (20:30 +0200)]
dpll: fix stale iteration in dpll_pin_on_pin_unregister()
Neither parent->dpll_refs nor pin->dpll_refs on its own is a correct
iteration target at unregister time:
- pin->dpll_refs includes DPLLs the child was registered against
via a different parent or directly; blind unregister WARNs on
the cookie miss in dpll_xa_ref_pin_del().
- parent->dpll_refs reflects the parent's current attachments, not
those at child-register time. Another driver may have (un)reg'd
the parent against additional DPLLs in the meantime, so we miss
registrations that exist and visit DPLLs that have none.
Walk pin->dpll_refs and use dpll_pin_registration_find() to filter
to entries whose cookie is this parent. Symmetric with
dpll_pin_on_pin_register(), correct under any subsequent change to
parent->dpll_refs.
Grzegorz Nitka [Sun, 7 Jun 2026 18:30:34 +0000 (20:30 +0200)]
dpll: allow registering FW-identified pin with a different DPLL
Relax the (module, clock_id) equality requirement when registering a
pin identified by firmware (pin->fwnode). Some platforms associate a
FW-described pin with a DPLL instance that differs from the pin's
(module, clock_id) tuple. For such pins, permit registration without
requiring the strict match. Non-FW pins still require equality.
Keep netlink pin module reporting/filtering safe for this relaxed
registration model by caching the module name in the pin object at
allocation time and using the cached string in netlink paths.
This avoids dereferencing pin->module after provider module teardown.
Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com> Link: https://patch.msgid.link/20260607183045.1213735-3-grzegorz.nitka@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Grzegorz Nitka [Sun, 7 Jun 2026 18:30:33 +0000 (20:30 +0200)]
dpll: add generic DPLL type
Add DPLL_TYPE_GENERIC to represent DPLL devices which do not fit the
existing PPS or EEC classes.
The UAPI type is intentionally generic. During netdev discussion,
maintainers pointed out that introducing identifiers tied to a specific
placement or single design does not scale across ASICs and vendors.
The role of a DPLL is already inferable from the spawning driver,
bus device, and pin topology, without encoding additional
purpose-specific taxonomy in the type name.
Using a generic type keeps the UAPI extensible and avoids premature
naming that may become incorrect as new hardware topologies are
exposed through the DPLL subsystem.
Expose the new type through UAPI and netlink specification as "generic".
1) Replace the open-coded manual cleanup in xfrm_add_policy() error
path with xfrm_policy_destroy() for consistency with
xfrm_policy_construct().
From Deepanshu Kartikey.
2) Limit XFRMA_TFCPAD to a sensible maximum (max IP length, 64k) since
u32 is excessive for traffic flow confidentiality padding.
From David Ahern.
3) Add a new netlink message XFRM_MSG_MIGRATE_STATE that
allows migrating individual IPsec SAs independently of
their policies. The existing XFRM_MSG_MIGRATE is tightly coupled
to policy+SA migration, lacks SPI for unique SA identification,
and cannot express reqid changes or migrate Transport mode
selectors. The new interface identifies the SA via SPI and mark,
supports reqid changes, address family changes, encap removal,
and uses an atomic create+install flow under x->lock to prevent
SN/IV reuse during AEAD SA migration.
From Antony Antony.
* tag 'ipsec-next-2026-06-12' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next:
xfrm: add documentation for XFRM_MSG_MIGRATE_STATE
xfrm: restrict netlink attributes for XFRM_MSG_MIGRATE_STATE
xfrm: add XFRM_MSG_MIGRATE_STATE for single SA migration
xfrm: make xfrm_dev_state_add xuo parameter const
xfrm: extract address family and selector validation helpers
xfrm: refactor XFRMA_MTIMER_THRESH validation into a helper
xfrm: move encap and xuo into struct xfrm_migrate
xfrm: add error messages to state migration
xfrm: add state synchronization after migration
xfrm: check family before comparing addresses in migrate
xfrm: split xfrm_state_migrate into create and install functions
xfrm: rename reqid in xfrm_migrate
xfrm: fix NAT-related field inheritance in SA migration
xfrm: allow migration from UDP encapsulated to non-encapsulated ESP
xfrm: add extack to xfrm_init_state
xfrm: remove redundant assignments
xfrm: Reject excessive values for XFRMA_TFCPAD
xfrm: cleanup error path in xfrm_add_policy()
====================
====================
vsock: consolidate acceptq accounting into core helpers
These patches follow up on commit c05fa14db43e
("vsock/vmci: fix sk_ack_backlog leak on failed handshake")
by consolidating sk_acceptq_added() and sk_acceptq_removed() into
the core vsock helpers so transports cannot forget them.
Raf Dickson [Fri, 12 Jun 2026 04:52:16 +0000 (04:52 +0000)]
vsock: fold sk_acceptq_removed() into vsock_remove_pending()
Callers of vsock_remove_pending() must also call sk_acceptq_removed()
to keep sk_ack_backlog consistent. Move the call into
vsock_remove_pending() itself to make it automatic and prevent future
callers from forgetting it.
Suggested-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Raf Dickson <rafdog35@gmail.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Luigi Leonardi <leonardi@redhat.com> Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com> Link: https://patch.msgid.link/20260612045216.105796-5-rafdog35@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Raf Dickson [Fri, 12 Jun 2026 04:52:15 +0000 (04:52 +0000)]
vsock: fold sk_acceptq_added() into vsock_enqueue_accept()
virtio and hyperv call sk_acceptq_added() immediately before
vsock_enqueue_accept(). Move the call into vsock_enqueue_accept()
itself so callers cannot forget it and the accounting is consistent.
Suggested-by: Paolo Abeni <pabeni@redhat.com> Suggested-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Raf Dickson <rafdog35@gmail.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Luigi Leonardi <leonardi@redhat.com> Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com> Link: https://patch.msgid.link/20260612045216.105796-4-rafdog35@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Raf Dickson [Fri, 12 Jun 2026 04:52:14 +0000 (04:52 +0000)]
vsock: fold sk_acceptq_added() into vsock_add_pending()
Move sk_acceptq_added() into vsock_add_pending() so callers cannot
forget it. vmci is the only transport using the pending list and
is updated accordingly.
Suggested-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Raf Dickson <rafdog35@gmail.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Luigi Leonardi <leonardi@redhat.com> Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com> Link: https://patch.msgid.link/20260612045216.105796-3-rafdog35@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Raf Dickson [Fri, 12 Jun 2026 04:52:13 +0000 (04:52 +0000)]
vsock: introduce vsock_pending_to_accept() helper
Add vsock_pending_to_accept() to move a socket directly from the
pending list to the accept queue in a single operation, avoiding
the sock_put/sock_hold dance and the sk_acceptq_removed()/
sk_acceptq_added() pair that would otherwise be needed when
calling vsock_remove_pending() followed by vsock_enqueue_accept().
Use it in vmci_transport_recv_connecting_server() where a completed
handshake transitions the socket from pending to accept queue.
Suggested-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Raf Dickson <rafdog35@gmail.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Luigi Leonardi <leonardi@redhat.com> Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com> Link: https://patch.msgid.link/20260612045216.105796-2-rafdog35@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Raf Dickson [Fri, 12 Jun 2026 04:58:42 +0000 (04:58 +0000)]
vsock: use sk_acceptq_is_full() helper in all transports
Replace the open-coded backlog check with sk_acceptq_is_full().
The helper uses > instead of >=, which is the correct comparison
per commit 64a146513f8f ("[NET]: Revert incorrect accept queue
backlog changes."), and adds READ_ONCE() for proper memory ordering.
Florian Westphal [Fri, 12 Jun 2026 09:22:09 +0000 (11:22 +0200)]
selftests: netfilter: add phony nft_offload test
... "phony", because its not testing offloads, it tests the control
plane code. Also test error unwind via fault injection framework.
For a proper test, real hardware would be required given we'd have
check if 'previously handed off to hardware' offload commands are
properly removed again on failure or rule flush.
Florian Westphal [Fri, 12 Jun 2026 09:22:08 +0000 (11:22 +0200)]
netdevsim: tc: allow to test nf_tables offload control plane code
The actual 'offload' is phony, all commands are ignored: this is only
useful to test control plane code.
Tag the existing callback to permit error injection to test rollback/abort
code in nf_tables. This is also for fuzzers - the fault injection
framework allows probabilistic error insertion.