git.ipfire.org Git - thirdparty/kernel/linux.git/log

dpll: zl3073x: detect DPLL channel count from chip ID at runtime

Replace the five per-variant zl3073x_chip_info structures and their
exported symbol definitions with a single consolidated chip ID lookup
table. The chip variant is now detected at runtime by reading the chip
ID register from hardware and looking it up in the table, rather than
being selected at compile time via the bus driver match data.

Repurpose struct zl3073x_chip_info to hold a single chip ID, its
channel count, and a flags field. Introduce enum zl3073x_flags with
ZL3073X_FLAG_REF_PHASE_COMP_32 to replace the chip_id switch statement
in zl3073x_dev_is_ref_phase_comp_32bit(). Store a pointer to the
detected chip_info entry in struct zl3073x_dev for runtime access.

This simplifies the bus drivers by removing per-variant .data and
.driver_data references from the I2C/SPI match tables, and makes
adding support for new chip variants a single-line table addition.

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Link: https://patch.msgid.link/20260227105300.710272-2-ivecera@redhat.com
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: mana: Trigger VF reset/recovery on health check failure due to HWC timeout

The GF stats periodic query is used as mechanism to monitor HWC health
check. If this HWC command times out, it is a strong indication that
the device/SoC is in a faulty state and requires recovery.

Today, when a timeout is detected, the driver marks
hwc_timeout_occurred, clears cached stats, and stops rescheduling the
periodic work. However, the device itself is left in the same failing
state.

Extend the timeout handling path to trigger the existing MANA VF
recovery service by queueing a GDMA_EQE_HWC_RESET_REQUEST work item.
This is expected to initiate the appropriate recovery flow by suspende
resume first and if it fails then trigger a bus rescan.

This change is intentionally limited to HWC command timeouts and does
not trigger recovery for errors reported by the SoC as a normal command
response.

Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/aaFShvKnwR5FY8dH@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

atm: atmdev: add function parameter names and description

kernel-doc reports function parameters not described for parameters
that are not named. Add parameter names for these functions and then
describe the function parameters in kernel-doc format.

Fixes these warnings:
Warning: include/linux/atmdev.h:316 function parameter '' not described
in 'register_atm_ioctl'
Warning: include/linux/atmdev.h:321 function parameter '' not described
in 'deregister_atm_ioctl'

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://patch.msgid.link/20260228220845.2978547-1-rdunlap@infradead.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dccp Remove inet_hashinfo2_init_mod().

Commit c92c81df93df ("net: dccp: fix kernel crash on module load")
added inet_hashinfo2_init_mod() for DCCP.

Commit 22d6c9eebf2e ("net: Unexport shared functions for DCCP.")
removed EXPORT_SYMBOL_GPL() it but forgot to remove the function
itself.

Let's remove inet_hashinfo2_init_mod().

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260301063756.1581685-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'ipmr-no-rtnl-for-rtnl_family_ipmr-rtnetlink'

Kuniyuki Iwashima says:

====================
ipmr: No RTNL for RTNL_FAMILY_IPMR rtnetlink.

This series removes RTNL from ipmr rtnetlink handlers.

After this series, there are a few RTNL left in net/ipv4/ipmr.c
and such users will be converted to per-netns RTNL in another
series.

Patch 1 adds a selftest to exercise most? of the RTNL paths
in net/ipv4/ipmr.c

Patch 2 - 6 converts RTM_GETLINK / RTM_GETROUTE handlers
to RCU.

Patch 7 - 9 converts ->exit_batch() to ->exit_rtnl() to
save one RTNL in cleanup_net().

Patch 10 - 11 removes unnecessary RTNL during setup_net()
failure.

Patch 12 is a random cleanup.

Patch 13 - 15 drops RTNL for RTM_NEWROUTE and RTM_DELROUTE.
====================

Link: https://patch.msgid.link/20260228221800.1082070-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipmr: Don't hold RTNL for ipmr_rtm_route().

ipmr_mfc_add() and ipmr_mfc_delete() are already protected
by a dedicated mutex.

rtm_to_ipmr_mfcc() calls __ipmr_get_table(), __dev_get_by_index(),
amd ipmr_find_vif().

Once __dev_get_by_index() is converted to dev_get_by_index_rcu(),
we can move the other two functions under that same RCU section
and drop RTNL for ipmr_rtm_route().

Let's do that conversion and drop ASSERT_RTNL() in
mr_call_mfc_notifiers().

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260228221800.1082070-16-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipmr: Add dedicated mutex for mrt->{mfc_hash,mfc_cache_list}.

We will no longer hold RTNL for ipmr_rtm_route() to modify the
MFC hash table.

Only __dev_get_by_index() in rtm_to_ipmr_mfcc() is the RTNL
dependant, otherwise, we just need protection for mrt->mfc_hash
and mrt->mfc_cache_list.

Let's add a new mutex for ipmr_mfc_add(), ipmr_mfc_delete(),
and mroute_clean_tables() (setsockopt(MRT_FLUSH or MRT_DONE)).

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260228221800.1082070-15-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipmr/ip6mr: Convert net->ipv[46].ipmr_seq to atomic_t.

We will no longer hold RTNL for ipmr_mfc_add() and ipmr_mfc_delete().

MFC entry can be loosely connected with VIF by its index for
mrt->vif_table[] (stored in mfc_parent), but the two tables are
not synchronised. i.e. Even if VIF 1 is removed, MFC for VIF 1
is not automatically removed.

The only field that the MFC/VIF interfaces share is
net->ipv[46].ipmr_seq, which is protected by RTNL.

Adding a new mutex for both just to protect a single field is overkill.

Let's convert the field to atomic_t.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260228221800.1082070-14-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipmr: Define net->ipv4.{ipmr_notifier_ops,ipmr_seq} under CONFIG_IP_MROUTE.

net->ipv4.ipmr_notifier_ops and net->ipv4.ipmr_seq are used
only in net/ipv4/ipmr.c.

Let's move these definitions under CONFIG_IP_MROUTE.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260228221800.1082070-13-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipmr: Call fib_rules_unregister() without RTNL.

fib_rules_unregister() removes ops from net->rules_ops under
spinlock, calls ops->delete() for each rule, and frees the ops.

ipmr_rules_ops_template does not have ->delete(), and any
operation does not require RTNL there.

Let's move fib_rules_unregister() from ipmr_rules_exit_rtnl()
to ipmr_net_exit().

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260228221800.1082070-12-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipmr: Remove RTNL in ipmr_rules_init() and ipmr_net_init().

When ipmr_free_table() is called from ipmr_rules_init() or
ipmr_net_init(), the netns is not yet published.

Thus, no device should have been registered, and
mroute_clean_tables() will not call vif_delete(), so
unregister_netdevice_many() is unnecessary.

unregister_netdevice_many() does nothing if the list is empty,
but it requires RTNL due to the unconditional ASSERT_RTNL()
at the entry of unregister_netdevice_many_notify().

Let's remove unnecessary RTNL and ASSERT_RTNL() and instead
add WARN_ON_ONCE() in ipmr_free_table().

Note that we use a local list for the new WARN_ON_ONCE() because
dev_kill_list passed from ipmr_rules_exit_rtnl() may have some
devices when other ops->init() fails after ipmr durnig setup_net().

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260228221800.1082070-11-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipmr: Convert ipmr_net_exit_batch() to ->exit_rtnl().

ipmr_net_ops uses ->exit_batch() to acquire RTNL only once
for dying network namespaces.

ipmr does not depend on the ordering of ->exit_rtnl() and
->exit_batch() of other pernet_operations (unlike fib_net_ops).

Once ipmr_free_table() is called and all devices are
queued for destruction in ->exit_rtnl(), later during
NETDEV_UNREGISTER, ipmr_device_event() will not see anything
in vif table and just do nothing.

Let's convert ipmr_net_exit_batch() to ->exit_rtnl().

Note that fib_rules_unregister() does not need RTNL and
we will remove RTNL and unregister_netdevice_many() in
ipmr_net_init().

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260228221800.1082070-10-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipmr: Move unregister_netdevice_many() out of ipmr_free_table().

This is a prep commit to convert ipmr_net_exit_batch() to
->exit_rtnl().

Let's move unregister_netdevice_many() in ipmr_free_table()
to its callers.

Now ipmr_rules_exit() can do batching all tables per netns.

Note that later we will remove RTNL and unregister_netdevice_many()
in ipmr_rules_init().

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260228221800.1082070-9-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipmr: Move unregister_netdevice_many() out of mroute_clean_tables().

This is a prep commit to convert ipmr_net_exit_batch() to
->exit_rtnl().

Let's move unregister_netdevice_many() in mroute_clean_tables()
to its callers.

As a bonus, mrtsock_destruct() can do batching for all tables.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260228221800.1082070-8-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipmr: Convert ipmr_rtm_dumproute() to RCU.

ipmr_rtm_dumproute() calls mr_table_dump() or mr_rtm_dumproute(),
and mr_rtm_dumproute() finally calls mr_table_dump().

mr_table_dump() calls the passed function, _ipmr_fill_mroute().

_ipmr_fill_mroute() is a wrapper of ipmr_fill_mroute() to cast
struct mr_mfc * to struct mfc_cache *.

ipmr_fill_mroute() can be already called safely under RCU.

Let's convert ipmr_rtm_dumproute() to RCU.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260228221800.1082070-7-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipmr: Convert ipmr_rtm_getroute() to RCU.

ipmr_rtm_getroute() calls __ipmr_get_table(), ipmr_cache_find(),
and ipmr_fill_mroute().

The table is not removed until netns dismantle, and net->ipv4.mr_tables
is managed with RCU list API, so __ipmr_get_table() is safe under RCU.

struct mfc_cache is freed by mr_cache_put() after RCU grace period,
so we can use ipmr_cache_find() under RCU. rcu_read_lock() around
it was just to avoid lockdep splat for rhl_for_each_entry_rcu().

ipmr_fill_mroute() calls mr_fill_mroute(), which properly uses RCU.

Let's drop RTNL for ipmr_rtm_getroute() and use RCU instead.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260228221800.1082070-6-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipmr: Use MAXVIFS in mroute_msgsize().

mroute_msgsize() calculates skb size needed for ipmr_fill_mroute().

The size differs based on mrt->maxvif.

We will drop RTNL for ipmr_rtm_getroute() and mrt->maxvif may
change under RCU.

To avoid -EMSGSIZE, let's calculate the size with the maximum
value of mrt->maxvif, MAXVIFS.

struct rtnexthop is 8 bytes and MAXVIFS is 32, so the maximum delta
is 256 bytes, which is small enough.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260228221800.1082070-5-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipmr: Convert ipmr_rtm_dumplink() to RCU.

net->ipv4.mr_tables is updated under RTNL and can be read
safely under RCU.

Once created, the multicast route tables are not removed
until netns dismantle.

ipmr_rtm_dumplink() does not need RTNL protection for
ipmr_for_each_table() and ipmr_fill_table() if RCU is held.

Even if mrt->maxvif changes concurrently, ipmr_fill_vif()
returns true to continue dumping the next table.

Let's convert it to RCU.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260228221800.1082070-4-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipmr: Annotate access to mrt->mroute_do_{pim,assert,wrvifwhole}.

These fields in struct mr_table are updated in ip_mroute_setsockopt()
under RTNL:

  * mroute_do_pim
  * mroute_do_assert
  * mroute_do_wrvifwhole

However, ip_mroute_getsockopt() does not hold RTNL and read the first
two fields locklessly, and ip_mr_forward() reads all the three under
RCU.  pim_rcv_v1() also reads mroute_do_pim locklessly.

Let's use WRITE_ONCE() and READ_ONCE() for them.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260228221800.1082070-3-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftest: net: Add basic functionality tests for ipmr.

The new test exercise paths, where RTNL is needed, to
catch lockdep splat:

  setsockopt
    MRT_INIT / MRT_DONE
    MRT_ADD_VIF / MRT_DEL_VIF
    MRT_ADD_MFC / MRT_DEL_MFC / MRT_ADD_MFC_PROXY / MRT_DEL_MFC_PROXY
    MRT_TABLE
    MRT_FLUSH

  rtnetlink
    RTM_NEWROUTE
    RTM_DELROUTE

  NETDEV_UNREGISTER

I will extend this to cover IPv6 setsockopt() later.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260228221800.1082070-2-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mpls: remove test against ipv6_stub

ipv6_stub is never NULL, let's remove this test.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260228175715.1195536-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-sparx5-clean-up-probe-remove-init-and-deinit-paths'

Daniel Machon says:

====================
net: sparx5: clean up probe/remove init and deinit paths

This series refactors the sparx5 init and deinit code out of
sparx5_start() and into probe(), adding proper per-subsystem cleanup
labels and deinit functions.

Currently, the sparx5 driver initializes most subsystems inside
sparx5_start(), which is called from probe(). This includes registering
netdevs, starting worker threads for stats and MAC table polling,
requesting PTP IRQs, and initializing VCAP. The function has grown to
handle many unrelated subsystems, and has no granular error handling —
it either succeeds entirely or returns an error, leaving cleanup to a
single catch-all label in probe().

The remove() path has a similar problem: teardown is not structured as
the reverse of initialization, and several subsystems lack proper deinit
functions. For example, the stats workqueue has no corresponding
cleanup, and the mact workqueue is destroyed without first cancelling
its delayed work.

Refactor this by moving each init function out of sparx5_start() and
into probe(), with a corresponding goto-based cleanup label. Add deinit
functions for subsystems that allocate resources, to properly cancel
work and destroy workqueues. Ensure that cleanup order in both error
paths and remove() follows the reverse of initialization order.
sparx5_start() is eliminated entirely — its hardware register setup
is renamed to sparx5_forwarding_init() and its FDMA/XTR setup is
extracted to sparx5_frame_io_init().

Before this series, most init functions live inside sparx5_start() with
no individual cleanup:

  probe():
    sparx5_start():              <- no granular error handling
      sparx5_mact_init()
      sparx_stats_init()           <- starts worker, no cleanup
      mact_queue setup             <- no cancel on teardown
      sparx5_register_netdevs()
      sparx5_register_notifier_blocks()
      sparx5_vcap_init()
    sparx5_ptp_init()

    probe() error path:
      cleanup_ports:
        sparx5_cleanup_ports()
        destroy_workqueue(mact_queue)

After this series, probe() initializes subsystems in order with
matching cleanup labels, and remove() tears down in reverse:

  probe():
    sparx5_pgid_init()
    sparx5_vlan_init()
    sparx5_board_init()
    sparx5_forwarding_init()
    sparx5_calendar_init()         -> cleanup_ports
    sparx5_qos_init()              -> cleanup_ports
    sparx5_vcap_init()             -> cleanup_ports
    sparx5_mact_init()             -> cleanup_vcap
    sparx5_stats_init()            -> cleanup_mact
    sparx5_frame_io_init()         -> cleanup_stats
    sparx5_ptp_init()              -> cleanup_frame_io
    sparx5_register_netdevs()      -> cleanup_ptp
    sparx5_register_notifier_blocks() -> cleanup_netdevs

  remove():
    sparx5_unregister_notifier_blocks()
    sparx5_unregister_netdevs()
    sparx5_ptp_deinit()
    sparx5_frame_io_deinit()
    sparx5_stats_deinit()
    sparx5_mact_deinit()
    sparx5_vcap_deinit()
    sparx5_destroy_netdevs()
====================

Link: https://patch.msgid.link/20260227-sparx5-init-deinit-v2-0-10ba54ccf005@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: sparx5: replace sparx5_start() with sparx5_forwarding_init()

With all subsystem initializations moved out, sparx5_start() only sets
up forwarding (UPSIDs, CPU ports, masks, PGIDs, FCS, watermarks).
Rename it to sparx5_forwarding_init() and make it void since it cannot
fail. This removes sparx5_start() entirely.

Signed-off-by: Daniel Machon <daniel.machon@microchip.com>
Link: https://patch.msgid.link/20260227-sparx5-init-deinit-v2-9-10ba54ccf005@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: sparx5: move FDMA/XTR initialization out of sparx5_start()

Move the Frame DMA and register-based extraction initialization out of
sparx5_start() and into a new sparx5_frame_io_init() function, called
from probe().

Also, add sparx5_frame_io_deinit() for the cleanup path.

Signed-off-by: Daniel Machon <daniel.machon@microchip.com>
Link: https://patch.msgid.link/20260227-sparx5-init-deinit-v2-8-10ba54ccf005@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: sparx5: move PTP IRQ handling out of sparx5_start()

Move the PTP IRQ request into sparx5_ptp_init() so all PTP setup is
done in one place.

Also move the sparx5_ptp_init() call to right before
sparx5_register_netdevs() and add a cleanup_ptp label. Update remove()
to disable the PTP IRQ and reorder ptp_deinit accordingly.

Signed-off-by: Daniel Machon <daniel.machon@microchip.com>
Link: https://patch.msgid.link/20260227-sparx5-init-deinit-v2-7-10ba54ccf005@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: sparx5: move remaining init functions from start() to probe()

Move sparx5_pgid_init(), sparx5_vlan_init(), and sparx5_board_init()
from sparx5_start() to probe(). These functions do not require cleanup.

Signed-off-by: Daniel Machon <daniel.machon@microchip.com>
Link: https://patch.msgid.link/20260227-sparx5-init-deinit-v2-6-10ba54ccf005@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: sparx5: move calendar initialization to probe

Move the calendar initialization from sparx5_start() to probe() by
creating a new sparx5_calendar_init() wrapper function that calls both
sparx5_config_auto_calendar() and sparx5_config_dsm_calendar().
Calendar initialization does not require cleanup.

Also, make the individual calendar config functions static since they
are now only called from within sparx5_calendar.c.

Signed-off-by: Daniel Machon <daniel.machon@microchip.com>
Link: https://patch.msgid.link/20260227-sparx5-init-deinit-v2-5-10ba54ccf005@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: sparx5: move stats initialization and add deinit function

The sparx5_stats_init() function starts a worker thread which needs to
be cleaned up. Move the initialization code to probe() and add a
deinit() function for proper teardown.

Also, rename sparx_stats_init() to sparx5_stats_init() to match the
driver naming convention.

Signed-off-by: Daniel Machon <daniel.machon@microchip.com>
Link: https://patch.msgid.link/20260227-sparx5-init-deinit-v2-4-10ba54ccf005@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: sparx5: move MAC table initialization and add deinit function

Consolidate all MAC table initialization from sparx5_start() into
sparx5_mact_init(), move it to probe(), and add a deinit function for
proper teardown.

Also, make sparx5_mact_pull_work() static since it is only used within
sparx5_mactable.c.

Signed-off-by: Daniel Machon <daniel.machon@microchip.com>
Link: https://patch.msgid.link/20260227-sparx5-init-deinit-v2-3-10ba54ccf005@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: sparx5: move VCAP initialization to probe

Move the VCAP initialization code from sparx5_start() to probe(). Add
proper error handling with a cleanup_vcap label and sparx5_vcap_deinit()
call.

Also, rename sparx5_vcap_destroy() to sparx5_vcap_deinit() to stay
consistent with the naming.

Signed-off-by: Daniel Machon <daniel.machon@microchip.com>
Link: https://patch.msgid.link/20260227-sparx5-init-deinit-v2-2-10ba54ccf005@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: sparx5: move netdev and notifier block registration to probe

Move netdev registration and notifier block registration from
sparx5_start() to probe(). This allows proper cleanup via goto-based
error labels in probe().

Also, remove the sparx5_cleanup_ports() helper as its functionality is now
split between sparx5_unregister_netdevs() and sparx5_destroy_netdevs()
called at appropriate points.

Signed-off-by: Daniel Machon <daniel.machon@microchip.com>
Link: https://patch.msgid.link/20260227-sparx5-init-deinit-v2-1-10ba54ccf005@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-stmmac-further-cleanups'

Russell King says:

====================
net: stmmac: further cleanups

Yet another bunch of patches cleaning up the stmmac driver.

We start off by cleaning up the formatting for stmmac_mac_finish(). Then
remove a plat_dat->port_node which is redundant, followed by several
descriptor methods that aren't called.

We then remove useless dwmac4 interrupt definitions, and realise that
v4.10 definitions are the same as v4.0, so get rid of those as well.
We also remove the write-only priv->hw->xlgmac member.

Next, we change priv->extend_desc and priv->chain_mode to be a boolean
and document what each of these are doing. Also do the same for
dma_cfg->fixed_burst and dma_cfg->mixed_burst.

Then, move the initialisation of dma_cfg->atds into stmmac_hw_init()
as this is where we have all the dependencies for this known, and
simplify its initialisation. Also comment what this is doing.

Finally, move the check that priv->plat->dma_cfg is present and the
programmable burst limit is set into the driver probe rather than
checking it each time we are just about to reset the dwmac core.
It is unnecessary to keep checking this. This makes a platform glue
driver fail early when it hasn't setup everything that's required
rather than when attempting to bring the netdev up for the first time.
====================

Link: https://patch.msgid.link/aaFpZvuIzOLaNM0m@shell.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: move DMA configuration validation to driver probe

Move the DMA configuration validation from stmmac_init_dma_engine()
to the start of the driver probe function. The platform glue is
expected to supply the DMA configuration, and a non-zero programmable
burst length (bpl).

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vvuY3-0000000Avnq-1Spv@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: simplify atds initialisation

atds is boolean, and there is only one place that its value is changed.
Simplify this to a boolean assignment.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vvuXy-0000000Avnk-10Q8@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: move initialisation of dma_cfg->atds

Move the initialisation of priv->plat->dma_cfg->atds, which indicates
that 8 32-bit word descriptors are being used for pre-v4.0 cores, after
the call to stmmac_hwif_init(), which will initialise priv->extend_desc
and priv->mode (the descriptor mode.)

We don't need to re-evaluate this in stmmac_init_dma_engine() - as the
state that it depends on only changes in stmmac_hwif_init() which is
only called in the probe path. Also, once set, no code clears this
flag.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vvuXt-0000000Avnc-0UYC@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: make dma_cfg mixed/fixed burst boolean

struct stmmac_dma_cfg mixed_burst/fixed_burst members are both boolean
in nature - of_property_read_bool() are used to read these from DT, and
they are only tested for non-zero values. Use bool to avoid unnecessary
padding in this structure.

Update dwmac-intel to initialise these using true rather than '1', and
remove the '0' initialisers as the struct is already zero initialised
on allocation.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vvuXn-0000000AvnX-4A1u@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: make chain_mode a boolean

priv->chain_mode is only tested for non-zero, so it can be a boolean.
Change its type to boolean, and add a comment describing this member.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vvuXi-0000000AvnR-3btC@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: make extend_desc boolean

extend_desc is a boolean, so make it so, and use "true" to assign it.
Add a comment to describe what this member does.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vvuXd-0000000AvnL-36K3@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: remove mac->xlgmac

mac->xlgmac is only ever written to by the dwxlgmac2_quirk() function.
Remove mac->xlgmac, and the quirk function that then becomes redundant.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vvuXY-0000000AvnF-2ccv@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: remove dwmac410_(enable|disable)_dma_irq

As a result of the previous cleanup, it is now obvious that there are
no differences between the dwmac4 and dwmac410 versions of the DMA
interrupt enable/disable functions.

Moreover, dwmac410_disable_dma_irq() is completely unused; instead,
dwmac4_disable_dma_irq() is used to disable the interrupts for v4.10a
cores while dwmac410_enable_dma_irq() was being used to enable these
same same interrupts.

Remove the unnecessary v4.10a functions.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vvuXT-0000000Avn9-29US@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: remove dwmac4 DMA_CHAN_INTR_DEFAULT_[TR]X*

Remove the DMA_CHAN_INTR_DEFAULT_[TR]X* definitions, which are aliases
of their respective DMA_CHAN_INTR_ENA_[TR]IE definitions.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vvuXO-0000000Avn3-1hhD@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: remove .get_tx_len()

No code calls stmmac_get_tx_len(). Remove this macro, its associated
function pointer, and all implementations.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vvuXJ-0000000Avmx-1B8Y@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: remove .get_tx_ls()

No code calls stmmac_get_tx_ls(). Remove this macro, its associated
function pointer, and all implementations.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vvuXE-0000000Avmr-0eB0@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: remove .get_tx_owner()

No code calls stmmac_get_tx_owner(). Remove the macro, its associated
function pointer, and all implementations.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vvuX9-0000000Avml-08Lo@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: remove plat_dat->port_node

There are repeated instances of:

fwnode = priv->plat->port_node;
if (!fwnode)
fwnode = dev_fwnode(priv->device);

However, the only place that ->port_node is set is
stmmac_probe_config_dt():

struct device_node *np = pdev->dev.of_node;
...
/* PHYLINK automatically parses the phy-handle property */
plat->port_node = of_fwnode_handle(np);

which is equivalent to dev_fwnode(&pdev->dev) and, as priv->device
will be &pdev->dev, is also equivalent to dev_fwnode(priv->device).

Thus, plat_dat->port_node doesn't provide any extra benefit over
using dev_fwnode(priv->device) directly.

There is one case where port_node is used directly, which can be
found in stmmac_pcs_setup(). This may cause a change of behaviour
as PCI drivers do not populate plat_dat->port_node, but
dev_fwnode(priv->device) may be valid. PCI-based stmmac should
be tested.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vvuX3-0000000Avme-3oej@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: clean up formatting in stmmac_mac_finish()

Wrap the arguments for priv->plat->mac_finish() to avoid an overly long
line.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vvuWy-0000000AvmY-3GWN@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'selftests-drv-net-iou-zcrx-improve-stability-and-make-the-large-chunk-test-work'

Jakub Kicinski says:

====================
selftests: drv-net: iou-zcrx: improve stability and make the large chunk test work

The iou-zcrx test hasn't been passing in NIPA, I assumed it's because
we're missing iouring changes, but it's still failing after the merge
window. Turns out there was a bug in the implementation which was fixed
separately via the iouring tree. With that out of the way the tests
are passing but flaky. Patch 1 deals with the flakiness.

While looking at this I also noticed that the large chunk test isn't
running at all. So fix and enable it (patches 2 and 3).
====================

Link: https://patch.msgid.link/20260227171305.2848240-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: drv-net: iou-zcrx: allocate hugepages for large chunks test

The large chunks test needs 2MB hugepages for its mmap allocation,
but the test system may not have any pre-allocated. Ensure at least
64 hugepages are available before running the test, and restore the
original value on cleanup.

While at it strip the stdout, it has a trailing new line.

Before: ok 5 iou-zcrx.test_zcrx_large_chunks # SKIP Can't allocate huge pages
Link: https://patch.msgid.link/20260227171305.2848240-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: drv-net: iou-zcrx: rework large chunks test to use common setup

Commit a32bb32d0193 ("selftests: iou-zcrx: test large chunk sizes")
and commit de7c600e2d5b ("selftests/net: parametrise iou-zcrx.py with
ksft_variants") landed at similar time. The large chunks test was
actually not included in the list of tests, so it never run.
We haven't noticed that it uses the old-style helpers
(_get_combined_channels, _get_current_settings, _set_flow_rule)
that were removed by the other commit.

Rework test_zcrx_large_chunks to reuse the single() setup function
and add it to the ksft_run cases list so it actually gets executed.

Link: https://patch.msgid.link/20260227171305.2848240-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: drv-net: iou-zcrx: wait for memory provider cleanup

io_uring defers zcrx context teardown to the iou_exit workqueue.

  # ps aux | grep iou
  ...    07:58   0:00 [kworker/u19:0-iou_exit]
  ... 07:58   0:00 [kworker/u18:2-iou_exit]

When the test's receiver process exits, bkg() returns but the memory
provider may still be attached to the rx queue. The subsequent defer()
that restores tcp-data-split then fails:

  # Exception while handling defer / cleanup (callback 3 of 3)!
  # Defer Exception| net.ynl.pyynl.lib.ynl.NlError:
      Netlink error: can't disable tcp-data-split while device has
                     memory provider enabled: Invalid argument
  not ok 1 iou-zcrx.test_zcrx.single

Add a helper that polls netdev queue-get until no rx queue reports
the io-uring memory provider attribute. Register it as a defer()
just before tcp-data-split is restored as a "barrier".

Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Link: https://patch.msgid.link/20260227171305.2848240-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: remove addr_len argument of recvmsg() handlers

Use msg->msg_namelen as a place holder instead of a
temporary variable, notably in inet[6]_recvmsg().

This removes stack canaries and allows tail-calls.

$ scripts/bloat-o-meter -t vmlinux.old vmlinux
add/remove: 0/0 grow/shrink: 2/19 up/down: 26/-532 (-506)
Function                                     old     new   delta
rawv6_recvmsg                                744     767     +23
vsock_dgram_recvmsg                           55      58      +3
vsock_connectible_recvmsg                     50      47      -3
unix_stream_recvmsg                          161     158      -3
unix_seqpacket_recvmsg                        62      59      -3
unix_dgram_recvmsg                            42      39      -3
tcp_recvmsg                                  546     543      -3
mptcp_recvmsg                               1568    1565      -3
ping_recvmsg                                 806     800      -6
tcp_bpf_recvmsg_parser                       983     974      -9
ip_recv_error                                588     576     -12
ipv6_recv_rxpmtu                             442     428     -14
udp_recvmsg                                 1243    1224     -19
ipv6_recv_error                             1046    1024     -22
udpv6_recvmsg                               1487    1461     -26
raw_recvmsg                                  465     437     -28
udp_bpf_recvmsg                             1027     984     -43
sock_common_recvmsg                          103      27     -76
inet_recvmsg                                 257     175     -82
inet6_recvmsg                                257     175     -82
tcp_bpf_recvmsg                              663     568     -95
Total: Before=25143834, After=25143328, chg -0.00%

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20260227151120.1346573-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'phy-qcom-sgmii-eth-add-set_mode-and-validate-methods'

net: stmmac: qcom-ethqos: further serdes reorganisation [part]

First PHY patch of Russell's series. Vladimir will need this
to avoid a conflict with his work.

Link: https://patch.msgid.link/aaDSJAc-x2-klvHJ@shell.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

phy: qcom-sgmii-eth: add .set_mode() and .validate() methods

qcom-sgmii-eth is an Ethernet SerDes supporting only Ethernet mode
using SGMII, 1000BASE-X and 2500BASE-X.

Add an implementation of the .set_mode() method, which can be used
instead of or as well as the .set_speed() method. The Ethernet
interface modes mentioned above all have a fixed data rate, so
setting the mode is sufficient to fully specify the operating
parameters.

Add an implementation of the .validate() method, which will be
necessary to allow discovery of the SerDes capabilities for platform
independent SerDes support in the stmmac network driver.

Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>
Acked-by: Vinod Koul <vkoul@kernel.org>
Tested-by: Mohd Ayaan Anwar <mohd.anwar@oss.qualcomm.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vvkU3-0000000AuP2-0hu3@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-sched-refactor-qdisc-drop-reasons-into-dedicated-tracepoint'

Jesper Dangaard Brouer says:

====================
net: sched: refactor qdisc drop reasons into dedicated tracepoint

This series refactors qdisc drop reason handling by introducing a dedicated
enum qdisc_drop_reason and trace_qdisc_drop tracepoint, providing qdisc
layer drop diagnostics with direct qdisc context visibility.

Background:
-----------
Identifying which qdisc dropped a packet via skb_drop_reason is difficult.
Normally, the kfree_skb tracepoint caller "location" hints at the dropping
code, but qdisc drops happen at a central point (__dev_queue_xmit), making
this unusable. As a workaround, commits 5765c7f6e317 ("net_sched: sch_fq:
add three drop_reason") and a42d71e322a8 ("net_sched: sch_cake: Add drop
reasons") encoded qdisc names directly in the drop reason enums.

This series provides a cleaner solution by creating a dedicated qdisc
tracepoint that naturally includes qdisc context (handle, parent, kind).

Solution:
---------
Create a new tracepoint trace_qdisc_drop that builds on top of existing
trace_qdisc_enqueue infrastructure. It includes qdisc handle, parent,
qdisc kind (name), and device information directly.

The existing SKB_DROP_REASON_QDISC_DROP is retained for backwards
compatibility via kfree_skb_reason(). The qdisc-specific drop reasons
(QDISC_DROP_*) provide fine-grained detail via the new tracepoint.
The enum uses subsystem encoding (offset by SKB_DROP_REASON_SUBSYS_QDISC)
to catch type mismatches during debugging.

This implements the alternative approach described in:
https://lore.kernel.org/all/6be17a08-f8aa-4f91-9bd0-d9e1f0a92d90@kernel.org/
====================

Link: https://patch.msgid.link/177211325634.3011628.9343837509740374154.stgit@firesoul
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: sched: sch_dualpi2: use qdisc_dequeue_drop() for dequeue drops

DualPI2 drops packets during dequeue but was using kfree_skb_reason()
directly, bypassing trace_qdisc_drop. Convert to qdisc_dequeue_drop()
and add QDISC_DROP_L4S_STEP_NON_ECN to the qdisc drop reason enum.

- Set TCQ_F_DEQUEUE_DROPS flag in dualpi2_init()
- Use enum qdisc_drop_reason in drop_and_retry()
- Replace kfree_skb_reason() with qdisc_dequeue_drop()

Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/177211351978.3011628.11267023360997620069.stgit@firesoul
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: sched: rename QDISC_DROP_CAKE_FLOOD to QDISC_DROP_FLOOD_PROTECTION

Rename QDISC_DROP_CAKE_FLOOD to QDISC_DROP_FLOOD_PROTECTION to use a
generic name without embedding the qdisc name. This follows the
principle that drop reasons should describe the drop mechanism rather
than being tied to a specific qdisc implementation.

The flood protection drop reason is used by qdiscs implementing
probabilistic drop algorithms (like BLUE) that detect unresponsive
flows indicating potential DoS or flood attacks. CAKE uses this via
its Cobalt AQM component.

Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/177211347537.3011628.13759059534638729639.stgit@firesoul
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: sched: rename QDISC_DROP_FQ_* to generic names

Rename FQ-specific drop reasons to generic names:
- QDISC_DROP_FQ_BAND_LIMIT -> QDISC_DROP_BAND_LIMIT
- QDISC_DROP_FQ_HORIZON_LIMIT -> QDISC_DROP_HORIZON_LIMIT

This follows the principle that drop reasons should describe the drop
mechanism rather than being tied to a specific qdisc implementation.
These concepts (priority band limits, timestamp horizon) could apply
to other qdiscs as well.

Remove the local macro define FQDR() and instead use the
full QDISC_DROP_* name to make it easier to navigate code.

Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/177211346902.3011628.12523261489552097455.stgit@firesoul
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: sched: sfq: convert to qdisc drop reasons

Convert SFQ to use the new qdisc-specific drop reason infrastructure.

This patch demonstrates how to convert a flow-based qdisc to use the
new enum qdisc_drop_reason. As part of this conversion:

- Add QDISC_DROP_MAXFLOWS for flow table exhaustion
- Rename FQ_FLOW_LIMIT to generic FLOW_LIMIT, now shared by FQ and SFQ
- Use QDISC_DROP_OVERLIMIT for sfq_drop() when overall limit exceeded
- Use QDISC_DROP_FLOW_LIMIT for per-flow depth limit exceeded

The FLOW_LIMIT reason is now a common drop reason for per-flow limits,
applicable to both FQ and SFQ qdiscs.

Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/177211345946.3011628.12770616071857185664.stgit@firesoul
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: sched: introduce qdisc-specific drop reason tracing

Create new enum qdisc_drop_reason and trace_qdisc_drop tracepoint
for qdisc layer drop diagnostics with direct qdisc context visibility.

The new tracepoint includes qdisc handle, parent, kind (name), and
device information. Existing SKB_DROP_REASON_QDISC_DROP is retained
for backwards compatibility via kfree_skb_reason().

Convert qdiscs with drop reasons to use the new infrastructure.

Change CAKE's cobalt_should_drop() return type from enum skb_drop_reason
to enum qdisc_drop_reason to fix implicit enum conversion warnings.
Use QDISC_DROP_UNSPEC as the 'not dropped' sentinel instead of
SKB_NOT_DROPPED_YET. Both have the same compiled value (0), so the
comparison logic remains semantically equivalent.

Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/177211345275.3011628.1974310302645218067.stgit@firesoul
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'icmp-fix-icmp-error-source-address-over-xfrm-tunnel'

Antony Antony says:

====================
icmp: Fix icmp error source address over xfrm tunnel

icmp: Fix icmp error source address over xfrm tunnel

This fix, originally sent to XFRM/IPsec, has been recommended by
Steffen Klassert to submit to the net tree, since it changes ICMP
behavior.

The patch addresses a minor issue related to the IPv4 source address
of ICMP error messages. The bug only occurs when xfrm policies are
configured. It originated from an old 2011 commit:

commit 415b3334a21a ("icmp: Fix regression in nexthop resolution during
replies.")
====================

Link: https://patch.msgid.link/cover.1772101380.git.antony.antony@secunet.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: net: add ICMP error source address test over xfrm tunnel

Test that ICMP error messages generated by an IPsec gateway use
the correct source address (the gateway's address, not the
unreachable destination).

Signed-off-by: Antony Antony <antony.antony@secunet.com>
Link: https://patch.msgid.link/79d526f96cf2252d71550d38772876bc72c7e3c7.1772101380.git.antony.antony@secunet.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

icmp: fix ICMP error source address when xfrm policy matches

When an IPsec gateway generates an ICMP error (e.g., Destination Host
Unreachable), the source address incorrectly shows the unreachable
destination instead of the gateway's address. IPv6 behaves correctly.

Before fix:
  ping 10.1.6.3
  From 10.1.6.3 icmp_seq=1 Destination Host Unreachable
  (wrong - 10.1.6.3 is the unreachable host)

After fix:
  ping 10.1.6.3
  From 10.1.5.2 icmp_seq=1 Destination Host Unreachable
  (correct - 10.1.5.2 is the gateway)

The fix removes the memcpy that overwrote fl4 with fl4_dec after
xfrm_lookup(). A follow-up commit adds a selftest.

Fixes: 415b3334a21a ("icmp: Fix regression in nexthop resolution during replies.")
Cc: stable+noautosel@kernel.org # Avoid false positives in tests
Signed-off-by: Antony Antony <antony.antony@secunet.com>
Acked-by: Tobias Brunner <tobias@strongswan.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/19a0156ff6e76baa323a81d710510d399a6ff63a.1772101380.git.antony.antony@secunet.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'npc-hw-block-support-for-cn20k'

Ratheesh Kannoth says:

====================
NPC HW block support for cn20k

This patchset adds comprehensive support for the CN20K NPC
architecture. CN20K introduces significant changes in MCAM layout,
parser design, KPM/KPU mapping, index management, virtual index handling,
and dynamic rule installation. The patches update the AF, PF/VF, and
common layers to correctly support these new capabilities while
preserving compatibility with previous silicon variants.

MCAM on CN20K differs from older designs: the hardware now contains
two vertical banks of depth 8192, and thirty-two horizontal subbanks of
depth 256. Each subbank can be configured as x2 or x4, enabling
256-bit or 512-bit key storage. Several allocation models are added to
support this layout, including contiguous and non-contiguous allocation
with or without reference ranges and priorities.

Parser and extraction logic are also enhanced. CN20K introduces a new
profile model where up to twenty-four extractors may be configured for
each parsing profile. A new KPM profile scheme is added, grouping
sixteen KPUs into eight KPM profiles, each formed by two KPUs.

Support is added for default index allocation for CN20K-specific
MCAM entry structures, virtual index allocation, improved defragmentation,
and TC rule installation by allowing the AF driver to determine
required x2/x4 rule width during flow install.
====================

Link: https://patch.msgid.link/20260224080009.4147301-1-rkannoth@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

octeontx2-af: npc: Use common structures

CN20K and legacy silicon differ in the size of key words used
in NPC MCAM. However, SoC-specific structures are not required
for low-level functions. Remove the SoC-specific structures
and rename the macros to improve readability.

Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>
Link: https://patch.msgid.link/20260224080009.4147301-14-rkannoth@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

octeontx2-af: npc: cn20k: add debugfs support

CN20K silicon divides the NPC MCAM into banks and subbanks, with each
subbank configurable for x2 or x4 key widths. This patch adds debugfs
entries to expose subbank usage details and their configured key type.

A debugfs entry is also added to display the default MCAM indexes
allocated for each pcifunc.

Additionally, debugfs support is introduced to show the mapping between
virtual indexes and real MCAM indexes, and vice versa.

Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>
Link: https://patch.msgid.link/20260224080009.4147301-13-rkannoth@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

octeontx2-pf: cn20k: Add TC rules support

Unlike previous silicons, MCAM entries required for TC rules in CN20K
are allocated dynamically. The key size can also be dynamic, i.e., X2 or
X4. Based on the size of the TC rule match criteria, the AF driver
allocates an X2 or X4 rule. This patch implements the required changes
for CN20K TC by requesting an MCAM entry from the AF driver on the fly
when the user installs a rule. Based on the TC rule priority added or
deleted by the user, the PF driver shifts MCAM entries accordingly. If
there is a mix of X2 and X4 rules and the user tries to install a rule
in the middle of existing rules, the PF driver detects this and rejects
the rule since X2 and X4 rules cannot be shifted in hardware.

Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com>
Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>
Link: https://patch.msgid.link/20260224080009.4147301-12-rkannoth@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

octeontx2-af: npc: cn20k: Allocate MCAM entry for flow installation

In CN20K, the PF/VF driver is unaware of the NPC MCAM entry type (x2/x4)
required for a particular TC rule when the user installs rules through the
TC command. This forces the PF/VF driver to first query the AF driver for
the rule size, then allocate an entry, and finally install the flow. This
sequence requires three mailbox request/response exchanges from the PF. To
speed up the installation, the `install_flow` mailbox request message is
extended with additional fields that allow the AF driver to determine the
required NPC MCAM entry type, allocate the MCAM entry, and complete the
flow installation in a single step.

Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>
Link: https://patch.msgid.link/20260224080009.4147301-11-rkannoth@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

octeontx2-af: npc: cn20k: virtual index support

This patch adds support for virtual MCAM index allocation and
improves CN20K MCAM defragmentation handling. A new field is
introduced in the non-ref, non-contiguous MCAM allocation mailbox
request to indicate that virtual indexes should be returned instead
of physical ones. Virtual indexes allow the hardware to move mapped
MCAM entries internally, enabling defragmentation and preventing
scattered allocations across subbanks. The patch also enhances
defragmentation by treating non-ref, non-contiguous allocations as
ideal candidates for packing sparsely used regions, which can free
up subbanks for potential x2 or x4 configuration. All such
allocations are tracked and always returned as virtual indexes so
they remain stable even when entries are moved during defrag.
During defragmentation, MCAM entries may shift between subbanks,
but their virtual indexes remain unchanged. Additionally, this
update fixes an issue where entry statistics were not being
restored correctly after defragmentation.

Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>
Link: https://patch.msgid.link/20260224080009.4147301-10-rkannoth@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

octeontx2-af: npc: cn20k: Add new mailboxes for CN20K silicon

To enable enhanced MCAM capabilities for CN20K, the struct mcam_entry
has been extended to support expanded keyword requirements.
Specifically, the kw and kw_mask arrays have been increased from
a size of 7 to 8 to accommodate the additional keyword field
introduced for CN20K.

To ensure seamless integration while preserving compatibility
with existing platforms, dedicated CN20K-specific mailboxes
have been introduced that leverage the updated struct mcam_entry.
This approach allows CN20K to utilize the extended structure
without impacting current implementations.

This patch identifies the relevant mailboxes and introduces the
following CN20K-specific additions:

New mailboxes added:
1. `NPC_CN20K_MCAM_WRITE_ENTRY`
2. `NPC_CN20K_MCAM_ALLOC_AND_WRITE_ENTRY`
3. `NPC_CN20K_MCAM_READ_ENTRY`
4. `NPC_CN20K_MCAM_READ_BASE_RULE`

Signed-off-by: Suman Ghosh <sumang@marvell.com>
Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>
Link: https://patch.msgid.link/20260224080009.4147301-9-rkannoth@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

octeontx2-af: npc: cn20k: Prepare for new SoC

Current code pass mcam_entry structure to all low
level functions. This is not proper:

1) We need to modify all functions to support a new SoC
2) It does not look good to pass soc specific structure to
all common functions.

This patch adds a mcam meta data structure, which is populated
and passed to low level functions.

Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>
Link: https://patch.msgid.link/20260224080009.4147301-8-rkannoth@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

octeontx2-af: npc: cn20k: Use common APIs

In cn20k silicon, the register definitions and the algorithms used to
read, write, copy, and enable MCAM entries have changed. This patch
updates the common APIs to support both cn20k and previous silicon
variants.

Additionally, cn20k introduces a new algorithm for MCAM index management.
The common APIs are updated to invoke the cn20k-specific index management
routines for allocating, freeing, and retrieving default MCAM entries.

Signed-off-by: Suman Ghosh <sumang@marvell.com>
Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>
Link: https://patch.msgid.link/20260224080009.4147301-7-rkannoth@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

octeontx2-af: npc: cn20k: Allocate default MCAM indexes

Reserving MCAM entries in the AF driver for installing default MCAM
entries is not an efficient allocation method, as it results in
significant wastage of entries. This patch allocates MCAM indexes for
promiscuous, multicast, broadcast, and unicast traffic in descending
order of indexes (from lower to higher priority) when the NIX LF is
attached to the PF/VF.

Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>
Link: https://patch.msgid.link/20260224080009.4147301-6-rkannoth@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

octeontx2-af: npc: cn20k: MKEX profile support

In new silicon variant cn20k, a new parser profile is introduced. Instead
of having two layer-data information per key field type, a new key
extractor concept is introduced. As part of this change now a maximum of
24 extractor can be configured per packet parsing profile. For example,
LA type(ether) can have 24 unique parsing key, LC type(ip), LD
type(tcp/udp) also can have unique 24 parsing key associated.

Signed-off-by: Suman Ghosh <sumang@marvell.com>
Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>
Link: https://patch.msgid.link/20260224080009.4147301-5-rkannoth@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

octeontx2-af: npc: cn20k: Add default profile

Default mkex profile for cn20k silicon. This commit
changes attribute of objects to may_be_unused to
avoid compiler warning

Signed-off-by: Suman Ghosh <sumang@marvell.com>
Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>
Link: https://patch.msgid.link/20260224080009.4147301-4-rkannoth@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

octeontx2-af: npc: cn20k: KPM profile changes

KPU (Kangaroo Processing Unit) profiles are primarily used to set the
required packet pointers that will be used in later stages for key
generation. In the new CN20K silicon variant, a new KPM profile is
introduced alongside the existing KPU profiles.

In CN20K, a total of 16 KPUs are grouped into 8 KPM profiles. As per
the current hardware design, each KPM configuration contains a
combination of 2 KPUs:
    KPM0 = KPU0 + KPU8
    KPM1 = KPU1 + KPU9
    ...
    KPM7 = KPU7 + KPU15

This configuration enables more efficient use of KPU resources. This
patch adds support for the new KPM profile configuration.

Signed-off-by: Suman Ghosh <sumang@marvell.com>
Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>
Link: https://patch.msgid.link/20260224080009.4147301-3-rkannoth@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

octeontx2-af: npc: cn20k: Index management

In CN20K silicon, the MCAM is divided vertically into two banks.
Each bank has a depth of 8192.

The MCAM is divided horizontally into 32 subbanks, with each subbank
having a depth of 256.

Each subbank can accommodate either x2 keys or x4 keys. x2 keys are
256 bits in size, and x4 keys are 512 bits in size.

    Bank1                   Bank0
    |-----------------------------|
    |               |             | subbank 31 { depth 256 }
    |               |             |
    |-----------------------------|
    |               |             | subbank 30
    |               |             |
    ------------------------------
    ...............................

    |-----------------------------|
    |               |             | subbank 0
    |               |             |
    ------------------------------|

This patch implements the following allocation schemes in NPC.
The allocation API accepts reference (ref), limit, contig, priority,
and count values. For example, specifying ref=100, limit=200,
contig=1, priority=LOW, and count=20 will allocate 20 contiguous
MCAM entries between entries 100 and 200.

1. Contiguous allocation with ref, limit, and priority.
2. Non-contiguous allocation with ref, limit, and priority.
3. Non-contiguous allocation without ref.
4. Contiguous allocation without ref.

Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>
Link: https://patch.msgid.link/20260224080009.4147301-2-rkannoth@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv6: sit: Replace deprecated strcpy with strscpy

strcpy() has been deprecated [1] because it performs no bounds checking
on the destination buffer, which can lead to buffer overflows. Replace
it with the safer strscpy(). Use the two-argument version of strscpy()
to copy 'parms->name' in ipip6_tunnel_locate().

Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strcpy
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Link: https://patch.msgid.link/20260227004541.798966-3-thorsten.blum@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'gve-support-larger-ring-sizes-in-dqo-qpl-mode'

Max Yuan says:

====================
gve: Support larger ring sizes in DQO-QPL mode

This patch series updates the gve driver to improve Queue Page List
(QPL) management and enable support for larger ring sizes when using the
DQO-QPL queue format.

Previously, the driver used hardcoded multipliers to determine the
number of pages to register for QPLs (e.g., 2x ring size for RX). This
rigid approach made it difficult to support larger ring sizes without
potentially exceeding the "max_registered_pages" limit reported by the
device.

The first patch introduces a unified and flexible logic for calculating
QPL page requirements. It balances TX and RX page allocations based on
the configured ring sizes and scales the total count down proportionally
if it would otherwise exceed the device's global registration limit.

The second patch leverages this new flexibility to stop ignoring the
maximum ring size supported by the device in DQO-QPL mode. Users can now
configure ring sizes up to the device-reported maximum, as the driver
will automatically adjust the QPL size to stay within allowed memory
bounds.
====================

Link: https://patch.msgid.link/20260225182342.1049816-1-joshwash@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

gve: Enable reading max ring size from the device in DQO-QPL mode

The gVNIC device indicates a device option (MODIFY_RING) to the driver,
which presents a range of ring sizes from which the user is allowed to
select. But in DQO-QPL queue format, the driver ignores the "max" of
this range and instead allows the user to configure the ring size in the
range [min, default]. This was done because increasing the ring size
could result in the number of registered pages being higher than the max
allowed by the device.

In order to support large ring sizes, stop ignoring the "max" of the
range presented in the MODIFY_RING option.

Signed-off-by: Matt Olson <maolson@google.com>
Signed-off-by: Max Yuan <maxyuan@google.com>
Reviewed-by: Jordan Rhee <jordanrhee@google.com>
Reviewed-by: Harshitha Ramamurthy <hramamurthy@google.com>
Reviewed-by: Praveen Kaligineedi <pkaligineedi@google.com>
Signed-off-by: Joshua Washington <joshwash@google.com>
Link: https://patch.msgid.link/20260225182342.1049816-3-joshwash@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

gve: Update QPL page registration logic

For DQO, change QPL page registration logic to be more flexible to honor
the "max_registered_pages" parameter from the gVNIC device.

Previously the number of RX pages per QPL was hardcoded to twice the
ring size, and the number of TX pages per QPL was dictated by the device
in the DQO-QPL device option. Now [in DQO-QPL mode], the driver will
ignore the "tx_pages_per_qpl" parameter indicated in the DQO-QPL device
option and instead allocate up to (tx_queue_length / 2) pages per TX QPL
and up to (rx_queue_length * 2) pages per RX QPL while keeping the total
number of pages under the "max_registered_pages".

Merge DQO and GQI QPL page calculation logic into a unified
gve_update_num_qpl_pages function. Add rx_pages_per_qpl to the priv
struct for consumption by both DQO and GQI.

Signed-off-by: Matt Olson <maolson@google.com>
Signed-off-by: Max Yuan <maxyuan@google.com>
Reviewed-by: Jordan Rhee <jordanrhee@google.com>
Reviewed-by: Harshitha Ramamurthy <hramamurthy@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Praveen Kaligineedi <pkaligineedi@google.com>
Signed-off-by: Joshua Washington <joshwash@google.com>
Link: https://patch.msgid.link/20260225182342.1049816-2-joshwash@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

keys, dns: Use kmalloc_flex to improve dns_resolver_preparse

Use kmalloc_flex() when allocating a new 'struct user_key_payload' in
dns_resolver_preparse() to replace the open-coded size arithmetic.

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Link: https://patch.msgid.link/20260226214930.785423-3-thorsten.blum@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: fix sock compilation error under CONFIG_PREEMPT_RT

When CONFIG_PREEMPT_RT is enabled, __SPIN_LOCK_UNLOCKED() expands to a
brace-enclosed initializer rather than a compound literal, which cannot
be used in assignment expressions. This causes a build failure:

net/core/sock.c:3787:29: error: expected expression before '{' token
3787 | tmp.slock = __SPIN_LOCK_UNLOCKED(tmp.slock);

Use declaration-with-initializer instead of assignment, consistent with
how __SPIN_LOCK_UNLOCKED() is used elsewhere in the kernel (e.g.
DEFINE_SPINLOCK).

Fixes: 5151ec54f586 ("net: use try_cmpxchg() in lock_sock_nested()")
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260228111319.79506-1-jiayuan.chen@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-ethernet-litex-minor-improvment-for-the-codebase'

Inochi Amaoto says:

====================
net: ethernet: litex: minor improvment for the codebase

Improve the litex code for using the device managed function to register
netdev and replace all the "pdev->dev" with dev pointer instead.
====================

Link: https://patch.msgid.link/20260227003351.752934-1-inochiama@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ethernet: litex: use device pointer to simplify code.

As there is already a device pointer in the probe function, replace
all "&pdev->dev" pattern with this predefined device pointer.

Signed-off-by: Inochi Amaoto <inochiama@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20260227003351.752934-3-inochiama@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ethernet: litex: use devm_register_netdev() to register netdev

Use devm_register_netdev to avoid unnecessary remove() callback in
platform_driver structure.

Signed-off-by: Inochi Amaoto <inochiama@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20260227003351.752934-2-inochiama@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/handshake: Fixed grammar mistake

The word "a" was used instead of "an" which is grammatically incorrect.
Fixed by changing from "a" to "an". This improves readability of the
documentation.

Signed-off-by: Leon Kral <leon.j.kral@protonmail.com>
Reviewed-by: Alistair Francis <alistair.francis@wdc.com>
Link: https://patch.msgid.link/20260227001151.41610-1-leon.j.kral@protonmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

NFC: fix header file kernel-doc warnings

Repair some of the comments:
- use the correct enum names
- don't use "/**" for a non-kernel-doc comment

to fix these warnings:

Warning: include/uapi/linux/nfc.h:127 Excess enum value
'@NFC_EVENT_DEVICE_DEACTIVATED' description in 'nfc_commands'
Warning: include/uapi/linux/nfc.h:204 Excess enum value
'@NFC_ATTR_APDU' description in 'nfc_attrs'
Warning: include/uapi/linux/nfc.h:302 expecting prototype for Pseudo().
Prototype was for NFC_RAW_HEADER_SIZE() instead

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://patch.msgid.link/20260226221004.1037909-1-rdunlap@infradead.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: inline skb_add_rx_frag_netmem()

This critical helper (via skb_add_rx_frag()) is mostly used
from drivers rx fast path.

It is time to inline it, this actually saves space in vmlinux:

size vmlinux.old vmlinux
   text    data     bss     dec     hex filename
37350766        23092977        4846992 65290735        3e441ef vmlinux.old
37350600        23092977        4846992 65290569        3e44149 vmlinux

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260226041213.1892561-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv6: discard fragment queue earlier if there is malformed datagram

Currently the kernel IPv6 implementation is not dicarding the fragment
queue upon receiving a IPv6 fragment that is not 8 bytes aligned. It
relies on queue expiration to free the queue.

While RFC 8200 section 4.5 does not explicitly mention that the rest of
fragments must be discarded, it does not make sense to keep them. The
parameter problem message is sent regardless that. In addition, if the
sender is able to re-compose the datagram so it is 8 bytes aligned it
would qualify as a new whole datagram not fitting into the same fragment
queue.

The same situation happens if segment end is exceeding the IPv6 maximum
packet length. The sooner we can free resources the better during
reassembly, the better.

Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Link: https://patch.msgid.link/20260225133758.4553-1-fmancera@suse.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

r8152: Add 2500baseT EEE status/configuration support

The r8152 driver supports the RTL8156, which is a 2.5Gbit Ethernet controller for
USB 3.0, for which support is added for configuring and displaying the EEE
advertisement status for 2.5GBit connections.

The patch also corrects the determination of whether EEE is active to include
the 2.5GBit connection status and make the determination dependent not on the
desired speed configuration (tp->speed), but on the actual speed used by
the controller. For consistency, this is corrected also for the RTL8152/3.

This was tested on an Edimax EU-4307 V1.0 USB-Ethernet adapter with RTL8156,
and a SECOMP Value 12.99.1115 USB-C 3.1 Ethernet converter with RTL8153.

Signed-off-by: Birger Koblitz <mail@birger-koblitz.de>
Link: https://patch.msgid.link/20260224-b4-eee2g5-v2-1-cf5c83df036e@birger-koblitz.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

vmxnet3: Suppress page allocation warning for massive Rx Data ring

The vmxnet3 driver supports an Rx Data ring (rx-mini) to optimise the
processing of small packets. The size of this ring's DMA-coherent memory
allocation is determined by the product of the primary Rx ring size and
the data ring descriptor size:

sz = rq->rx_ring[0].size * rq->data_ring.desc_size;

When a user configures the maximum supported parameters via ethtool
(rx_ring[0].size = 4096, data_ring.desc_size = 2048), the required
contiguous memory allocation reaches 8 MB (8,388,608 bytes).

In environments lacking Contiguous Memory Allocator (CMA),
dma_alloc_coherent() falls back to the standard zone buddy allocator. An
8 MB allocation translates to a page order of 11, which strictly exceeds
the default MAX_PAGE_ORDER (10) on most architectures.

Consequently, __alloc_pages_noprof() catches the oversize request and
triggers a loud kernel warning stack trace:

WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp)

This warning is unnecessary and alarming to system administrators because
the vmxnet3 driver already handles this allocation failure gracefully.
If dma_alloc_coherent() returns NULL, the driver safely disables the
Rx Data ring (adapter->rxdataring_enabled = false) and falls back to
standard, streaming DMA packet processing.

To resolve this, append the __GFP_NOWARN flag to the dma_alloc_coherent()
gfp_mask. This instructs the page allocator to silently fail the
allocation if it exceeds order limits or memory is too fragmented,
preventing the spurious warning stack trace.

Furthermore, enhance the subsequent netdev_err() fallback message to
include the requested allocation size. This provides critical debugging
context to the administrator (e.g., revealing that an 8 MB allocation
was attempted and failed) without making hardcoded assumptions about
the state of the system's configurations.

Reviewed-by: Jijie Shao <shaojijie@huawei.com>
Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>
Link: https://patch.msgid.link/20260226163121.4045808-1-atomlin@atomlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ethernet: mtk_eth_soc: avoid writing to ESW registers on MT7628

The MT7628 has a fixed-link PHY and does not expose MAC control
registers. Writes to these registers only corrupt the ESW VLAN
configuration.

This patch explicitly registers no-op phylink_mac_ops for MT7628, as
after removing the invalid register accesses, the existing
phylink_mac_ops effectively become no-ops.

This code was introduced by commit 296c9120752b
("net: ethernet: mediatek: Add MT7628/88 SoC support")

Signed-off-by: Joris Vaisvila <joey@tinyisr.com>
Reviewed-by: Daniel Golle <daniel@makrotpia.org>
Reviewed-by: Stefan Roese <stefan.roese@mailbox.org>
Link: https://patch.msgid.link/20260226154547.68553-1-joey@tinyisr.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tls: don't select STREAM_PARSER

ktls was converted to its own stream parser in commit
84c61fe1a75b ("tls: rx: do not use the standard strparser"), but the
Kconfig dependency was left. The only part of the original strparser
that's shared with ktls are a few structs (strp_msg, sk_skb_cb) and
the strp_msg helper, those don't require building the net/strparser
code.

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/cb41e513a30eeaac0b419284cc87433f049b2ee0.1771871995.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: use try_cmpxchg() in lock_sock_nested()

Add a fast path in lock_sock_nested(), to avoid acquiring
the socket spinlock only to set @owned to one:

        spin_lock_bh(&sk->sk_lock.slock);
        if (unlikely(sock_owned_by_user_nocheck(sk)))
                __lock_sock(sk);
        sk->sk_lock.owned = 1;
        spin_unlock_bh(&sk->sk_lock.slock);

On x86_64 compiler generates something quite efficient:

00000000000077c0 <lock_sock_nested>:
    77c0:       f3 0f 1e fa                 endbr64
    77c4:       e8 00 00 00 00              call   __fentry__
    77c9:       b9 01 00 00 00              mov    $0x1,%ecx
    77ce:       31 c0                       xor    %eax,%eax
    77d0:       f0 48 0f b1 8f 48 01 00 00  lock cmpxchg %rcx,0x148(%rdi)
    77d9:       75 06                       jne    slow_path
    77db:       2e e9 00 00 00 00           cs jmp __x86_return_thunk-0x4
slow_path:      ...

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://patch.msgid.link/20260226021215.1764237-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/hsr: update outdated comments

The function hsr_rcv() was renamed hsr_handle_frame() and moved to
net/hsr/hsr_slave.c by commit 81ba6afd6e64 ("net/hsr: Switch from
dev_add_pack() to netdev_rx_handler_register()").

Update all remaining references in the comments accordingly.

Signed-off-by: Kexin Sun <kexinsun@smail.nju.edu.cn>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260225145159.2953-1-kexinsun@smail.nju.edu.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: micrel: Add support for lan9645x internal phy

LAN9645X is a family of switch chips with 5 internal copper phys. The
internal PHY is based on parts of LAN8832. This is a low-power, single
port triple-speed (10BASE-T/100BASE-TX/1000BASE-T) ethernet physical
layer transceiver (PHY) that supports transmission and reception of data
on standard CAT-5, as well as CAT-5e and CAT-6 Unshielded Twisted
Pair (UTP) cables.

Add support for the internal PHY of the lan9645x chip family.

Reviewed-by: Steen Hegelund <Steen.Hegelund@microchip.com>
Reviewed-by: Daniel Machon <daniel.machon@microchip.com>
Signed-off-by: Jens Emil Schulz Østergaard <jensemil.schulzostergaard@microchip.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20260226-phy_micrel_add_support_for_lan9645x_internal_phy-v3-1-1fe82379962b@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netmem: remove the pp fields from net_iov

Now that the pp fields in net_iov have no users, remove them from
net_iov and clean up.

Signed-off-by: Byungchul Park <byungchul@sk.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/20260224061424.11219-1-byungchul@sk.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: airoha: fix typo in function name

Corrected the typo in the function name from
`airhoa_is_lan_gdm_port` to `airoha_is_lan_gdm_port`. This change ensures
consistency in the API naming convention.

Signed-off-by: Zhengping Zhang <aquapinn@qq.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Acked-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/tencent_E4FD5D6BC0131E617D848896F5F9FCED6E0A@qq.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: atlantic: fix reading SFP module info on some AQC100 cards

Commit 853a2944aaf3 ("net: atlantic: support reading SFP module info")
added support for reading SFP module info on AQC100-based cards. However,
it only supports reading directly from the controller's hardware
registers, and this does not seem to be supported on certain cards,
including my TRENDnet TEG-10GECSFP V3. "ethtool -m" times out when reading
certain registers, even when I increase the read poll timeout values.

The DPDK "atlantic" driver reads module info via firmware calls instead of
directly reading the hardware registers, provided that the NIC's firmware
version supports it.

This change adapts the DPDK firmware call code to the kernel driver. It
preserves the old hardware-based module read code as a fallback when the
firmware does not support it, to avoid breaking cards that are currently
working.

Tested on 2 different TRENDnet TEG-10GECSFP V3 cards, both with firmware
version 3.1.121 (current at the time of this patch). Both cards correctly
reported module info for a passive DAC cable and 2 different 10G optical
transceivers.

Signed-off-by: Tiernan Hubble <thubble@thubble.ca>
Link: https://patch.msgid.link/20260225002026.1754045-1-thubble@thubble.ca
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'support-phys-that-have-inband-autoneg-disabled-with-gem'

Charles Perry says:

====================
Support PHYs that have inband autoneg disabled with GEM

I'm testing SGMII with a VSC8574 PHY [1] and microchip HPSC SoC [2].

The link can work with or without autoneg, as long as the MAC and the PHY
are configured the same way. This doesn't work with the current MAC driver
because the MAC inband autoneg is always enabled (in the ->mac_config()
phylink_mac_ops). More precisely, the PHY driver (mscc_main.c) has
phylink's ->config_inband() implemented while the MAC ->pcs_config() ops
has an empty body.

This is based on code written by Sean Anderson [3]. Let me know if I
should add a From: or Co-developed-by: tag.

Logs with inband autoneg (managed = "in-band-status"):

  root@p64h:~# ifconfig eth1 up 10.180.59.33
  macb 40004184000.ethernet eth1: PHY 4000c21e000.mdio-mdio:02 doesn't supply possible interfaces
  macb 40004184000.ethernet eth1: PHY [4000c21e000.mdio-mdio:02] driver [Microsemi GE VSC8574 SyncE] (irq=POLL)
  macb 40004184000.ethernet eth1: phy: sgmii setting supported 00000000,00000000,00000000,000042ff advertising 00000000,00000000,00000000,000042ff
  macb 40004184000.ethernet eth1: configuring for inband/sgmii link mode
  macb 40004184000.ethernet eth1: major config, requested inband/sgmii
  macb 40004184000.ethernet eth1: interface sgmii inband modes: pcs=03 phy=03
  macb 40004184000.ethernet eth1: major config, active inband/inband,an-enabled/sgmii
  macb 40004184000.ethernet eth1: phylink_mac_config: mode=inband/sgmii/none adv=00000000,00000000,00000000,000042ff pause=00
  macb_pcs_config: PCSANADV=0x1 PCSCNTRL=0x1040
  macb_pcs_get_state: PCSSTS=0x109 PCSANLPBASE=0x1
  macb_pcs_get_state: PCSSTS=0x12d PCSANLPBASE=0x1801
  macb 40004184000.ethernet eth1: phy link down sgmii/Unknown/Unknown/none/off/nolpi
  macb_pcs_get_state: PCSSTS=0x12d PCSANLPBASE=0x1801
  macb_pcs_get_state: PCSSTS=0x12d PCSANLPBASE=0x1801
  macb 40004184000.ethernet eth1: phy link up sgmii/1Gbps/Full/none/tx/nolpi
  macb_pcs_get_state: PCSSTS=0x129 PCSANLPBASE=0x9801
  macb_pcs_get_state: PCSSTS=0x12d PCSANLPBASE=0x9801
  macb 40004184000.ethernet eth1: Link is Up - 1Gbps/Full - flow control tx

Logs without inband autoneg:

  root@p64h:~# ifconfig eth1 up 10.180.59.33
  macb 40004184000.ethernet eth1: PHY 4000c21e000.mdio-mdio:02 doesn't supply possible interfaces
  macb 40004184000.ethernet eth1: PHY [4000c21e000.mdio-mdio:02] driver [Microsemi GE VSC8574 SyncE] (irq=POLL)
  macb 40004184000.ethernet eth1: phy: sgmii setting supported 00000000,00000000,00000000,000042ff advertising 00000000,00000000,00000000,000042ff
  macb 40004184000.ethernet eth1: configuring for phy/sgmii link mode
  macb 40004184000.ethernet eth1: major config, requested phy/sgmii
  macb 40004184000.ethernet eth1: interface sgmii inband modes: pcs=03 phy=03
  macb 40004184000.ethernet eth1: major config, active phy/outband/sgmii
  macb 40004184000.ethernet eth1: phylink_mac_config: mode=phy/sgmii/none adv=00000000,00000000,00000000,00000000 pause=00
  macb_pcs_config: PCSANADV=0x1 PCSCNTRL=0x40
  macb 40004184000.ethernet eth1: phy link down sgmii/Unknown/Unknown/none/off/nolpi
  macb 40004184000.ethernet eth1: phy link up sgmii/1Gbps/Full/none/tx/nolpi
  macb 40004184000.ethernet eth1: Link is Up - 1Gbps/Full - flow control tx

The above logs are generated with an additional printk() in macb_psc_config()
and macb_pcs_get_state() and "#define DEBUG" in phylink.c.

[1]: https://www.microchip.com/en-us/product/vsc8574
[2]: https://www.microchip.com/en-us/products/microprocessors/64-bit-mpus/pic64-hpsc
[3]: https://lore.kernel.org/all/20250610233547.3588356-1-sean.anderson@linux.dev/
====================

Link: https://patch.msgid.link/20260224202854.112813-1-charles.perry@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>