git.ipfire.org Git - thirdparty/linux.git/log

Merge branch 'net-ethtool-make-sure-__ethtool_get_link_ksettings-is-ops-locked'

Jakub Kicinski says:

====================
net: ethtool: make sure __ethtool_get_link_ksettings() is ops-locked

This is prep for the series which will make most of the ethtool ops
run without rtnl_lock. The AI bots surfaced a number of callers of
__ethtool_get_link_ksettings() which need fixing, so I decided to
send that as a smaller prep-series. Each driver changed separately
for ease of review.

Full series unlocking ethtool ops AKA v1::
https://lore.kernel.org/20260528231637.251822-1-kuba@kernel.org
====================

Link: https://patch.msgid.link/20260603012840.2254293-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ethtool: make sure __ethtool_get_link_ksettings() is ops-locked

All drivers which may call *_get_link_ksettings() on ops-locked
devices from paths already holding the ops lock are ready now.
Make __ethtool_get_link_ksettings() take the ops lock, and assert
that it's held in netif_get_link_ksettings().

Reviewed-by: Nicolai Buchwitz <nb@tipi-net.de>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20260603012840.2254293-12-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

scsi: fcoe: don't recurse on the netdev's ops lock

fcoe_link_speed_update() calls __ethtool_get_link_ksettings() on the
lport's netdev, which will soon take the dev's ops lock. Some notifier
callers already arrive with this lock held. Switch to
netif_get_link_ksettings() and adjust the explicit call sites to take
the netdev lock explicitly.

Within fcoe_device_notification() try to only query the link speed
from notifiers which announce link state change (UP / CHANGE),
DOWN / GOING_DOWN notifiers are slightly sketchy when it comes
to ops locking right now, and the code already special-cases
those by maintaining the local link_possible variable.

Also take the lock in bnx2fc_net_config(), even though I think
that bnx2fc call sites are largely irrelevant since it's not
an ops-locked driver.

Link: https://patch.msgid.link/20260603012840.2254293-11-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

leds: trigger: netdev: don't recurse on the netdev ops lock

get_device_state() calls __ethtool_get_link_ksettings() on the trigger's
netdev, which will soon take the dev's ops lock. Three of its callers
already hold that lock and one doesn't, so the function would either
deadlock or run unprotected depending on the path.

Make get_device_state() expect the dev's ops lock held and switch to
netif_get_link_ksettings():

  * netdev_trig_notify() NETDEV_UP / NETDEV_CHANGE / NETDEV_CHANGENAME
    arrive with the dev's ops lock held (per netdevices.rst).
  * set_device_name() does not hold the lock, take it explicitly.

Due to lock ordering we need to reshuffle the code in set_device_name()
a little bit. We need to find the device earlier on, so that we can
lock it before we take trigger_data->lock.

Link: https://patch.msgid.link/20260603012840.2254293-10-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: sched: don't recurse on the netdev ops lock in qdiscs

cbs_set_port_rate() and taprio_set_picos_per_byte() are reached from
two paths and both already hold the device's ops lock:

*_change(), via tc_modify_qdisc() which calls netdev_lock_ops(dev)
before dispatching to the qdisc ops.

*_dev_notifier() on NETDEV_UP / NETDEV_CHANGE, where caller
holds the ops lock across the notifier chain.

Switch to netif_get_link_ksettings() to avoid deadlock once
__ethtool_get_link_ksettings() starts taking the netdev lock.

Link: https://patch.msgid.link/20260603012840.2254293-9-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: bridge: don't recurse on the port's netdev ops lock

port_cost() calls __ethtool_get_link_ksettings() on the port device,
which will soon take the port's ops lock. br_port_carrier_check()
is reached via the NETDEV_CHANGE notifier from linkwatch, which
already holds the port's ops lock, so the call would deadlock.

Make port_cost() expect the port's ops lock held and switch to
netif_get_link_ksettings(). The only other caller is new_nbp(),
make sure it takes the lock explicitly.

Reviewed-by: Nicolai Buchwitz <nb@tipi-net.de>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260603012840.2254293-8-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: team: don't recurse on the port's netdev ops lock

__team_port_change_send() calls __ethtool_get_link_ksettings() on
the port, which will soon take the port's ops lock. The notifier
caller already holds it while the slave-add/del callers do not,
so the function would either deadlock or run unprotected depending
on the path.

Make __team_port_change_send() expect the port's ops lock held and
switch to netif_get_link_ksettings(). team_device_event()'s NETDEV_UP /
NETDEV_CHANGE already arrive with the port's ops lock held.
team_port_add() now take it explicitly.

Note that NETDEV_DOWN and team_port_del() will pass false as @linkup
so they will not execute netif_get_link_ksettings(). This is fortunate
as NETDEV_DOWN has somewhat mixed locking right now.

Link: https://patch.msgid.link/20260603012840.2254293-7-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: bonding: don't recurse on the slave's netdev ops lock

bond_update_speed_duplex() calls __ethtool_get_link_ksettings() on
the slave, which will soon take the slave's ops lock. One of its
callers already holds it and the other three don't, so the function
would either deadlock or run unprotected depending on the path.

Make the helper expect the slave's ops lock held and switch to
netif_get_link_ksettings(). Wrap the three call sites that don't
already hold it:

  * bond_enslave() (rtnl held; core drops the lower's ops lock
    around ->ndo_add_slave).
  * bond_miimon_commit() (rtnl_trylock'd from the mii workqueue).
  * bond_ethtool_get_link_ksettings() (rtnl held via ethtool layer,
    bond device itself is not ops locked).

The call site which does already hold the ops lock is
bond_slave_netdev_event() via NETDEV_UP / NETDEV_CHANGE notifiers,
so it stays as-is.

Reviewed-by: Nicolai Buchwitz <nb@tipi-net.de>
Link: https://patch.msgid.link/20260603012840.2254293-6-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ethtool: add netif_get_link_ksettings() for correct ops-locked use

__ethtool_get_link_ksettings() is exported and called from sysfs
and many drivers. It invokes ethtool_ops->get_link_ksettings
so by our own docs it should be holding netdev lock for ops locked
devices. Looks like commit 2bcf4772e45a ("net: ethtool:
try to protect all callback with netdev instance lock")
missed adding the ops lock here.

There's a number of callers we need to fix up so let's add the
netif_get_link_ksettings() helper first, without any actual
locking changes (this commit is a nop).

Not treating this as a fix because I don't think any driver cares
at this point, but if we want to remove the rtnl_lock protection
this will become critical.

Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20260603012840.2254293-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: document NETDEV_CHANGENAME as ops locked

NETDEV_CHANGENAME is only emitted from netif_change_name().
netif_change_name() has two callers both of which hold netdev_lock_ops()
around the call site:
- dev_change_name()
- do_setlink()

Document NETDEV_CHANGENAME as always ops locked.

Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20260603012840.2254293-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ethtool: cmis_cdb: hold instance lock for ops locked devices

FW module flashing was written so that the flashing happens
without holding rtnl_lock. This allows flashing multiple modules
at once. Current drivers can handle that well, but we should
let drivers depend on the netdev instance lock. Instance lock
is per netdev, and so is the module so we won't break parallel
updates.

Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20260603012840.2254293-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: rename netdev_ops_assert_locked()

Jakub suggests renaming the existing assert to match
the netdev_lock_ops_compat() semantics.

We want netdev_assert_locked_ops() to mean - if the driver
is ops locked - check that it's holding the device lock.

The existing helper check for either ops lock or rtnl_lock,
which is the locking behavior of netdev_lock_ops_compat().

The reason for naming divergence is likely that
netdev_ops_assert_locked() predated the _compat() helpers.

Suggested-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: Nicolai Buchwitz <nb@tipi-net.de>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20260603012840.2254293-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: convert miss_list allocation to kvmalloc_array()

dr_icm_buddy_init_ste_cache() allocates the per-buddy miss_list using
the open-coded kvmalloc(n * sizeof(*p), ...) form. The neighbouring
allocations in the same function already use the kvcalloc()/
kvzalloc_objs() forms; switch this last one to kvmalloc_array() for
consistency and for the size_mul overflow check that kvmalloc_array()
performs.

The semantics are unchanged: kvmalloc_array() returns a non-zeroed
buffer, just like the previous kvmalloc() call. Existing callers of
buddy->miss_list initialise each list_head before use.

Signed-off-by: William Theesfeld <william@theesfeld.net>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260601193758.626537-1-william@theesfeld.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: axienet: Use dedicated ethtool_ops for the dmaengine path

The dmaengine path shares ethtool_ops with the legacy AXI DMA path,
including .get_coalesce/.set_coalesce that poke XAXIDMA_*_CR_OFFSET
directly. In dmaengine mode lp->dma_regs is not mapped by axienet, so
those ethtool calls touch unmapped/unrelated memory and report values
unrelated to the channel actually in use.

.get_ringparam/.set_ringparam only touch lp->rx_bd_num/lp->tx_bd_num,
fields used only by the legacy path for BD ring sizing. In dmaengine
mode the descriptor ring is owned by the dmaengine provider and these
fields are not consulted, so reporting them is misleading.

No dmaengine API exists today to query or program either coalescing
or ring size on behalf of the client, so neither can be exposed
meaningfully in dmaengine mode.

Add axienet_ethtool_dmaengine_ops without the coalesce and ringparam
hooks. Also move the ethtool_ops assignment from early probe into the
if/else alongside netdev_ops, so the legacy and dmaengine paths pick
their respective ops in one place. No functional change for the
legacy DMA path.

Signed-off-by: Suraj Gupta <suraj.gupta2@amd.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260601124454.3384601-1-suraj.gupta2@amd.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/sun: Fix multiple typos in comments

There are some typos in comments and while they are harmless and not visible,
there is no reason not to fix them. Fix the ones that are not register related,
which might have intentional naming convention.

Signed-off-by: Jakub Raczynski <j.raczynski@samsung.com>
Link: https://patch.msgid.link/20260601163727.554364-1-j.raczynski@samsung.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

eth: bnxt: disable rx-copybreak by default

rx-copybreak requires an extra slab allocation. Since bnxt uses
page pool frags and HDS by default, the rx-copybreak doesn't
buy us anything. The extra pressure on slab causes overload
on pre-sheaves kernels on modern AMD platforms.

In synthetic testing on net-next this patch shows little difference
but I think copybreak is "obvious waste" at this point.

Default rx-copybreak threshold to 0 / disabled.

The "copybreak" defines are really the size bounds for the Rx header
buffer. Rename them.

Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Link: https://patch.msgid.link/20260602003759.1545645-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv6: exthdrs: recompute network header pointer once

In ip6_parse_tlv(), recompute the network header pointer once regardless
of the option processed (Hbh or Dest), as missing recomputation for
specific options has caused issues in the past.

Signed-off-by: Justin Iurman <justin.iurman@gmail.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260602213033.12244-1-justin.iurman@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ibm: emac: fix unchecked platform_get_irq return value

platform_get_irq() returns a negative errno on failure.
Commit a598f66d9169 replaced irq_of_parse_and_map() (which returns 0
on failure) with platform_get_irq() but dropped the error check.
Without it, a negative IRQ number is passed to devm_request_irq(),
which fails with -EINVAL instead of propagating the real error
from platform_get_irq().

Add the missing error check and goto err_gone.

Signed-off-by: Rosen Penev <rosenp@gmail.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260601040201.103481-1-rosenp@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: change mptcp_established_options() to return opt_size

Instead of passing opt_size address to mptcp_established_options(),
change this function to return it by value.

This removes the need for an expensive stack canary in
tcp_established_options() when CONFIG_STACKPROTECTOR_STRONG=y.

$ scripts/bloat-o-meter -t vmlinux.old vmlinux.new
add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-92 (-92)
Function                                     old     new   delta
tcp_options_write.isra                      1423    1407     -16
mptcp_established_options                   2746    2720     -26
tcp_established_options                      553     503     -50
Total: Before=22110750, After=22110658, chg -0.00%

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260602125138.2317015-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv4: raw: remove six obsolete EXPORT_SYMBOL_GPL()

IPv6 can not be a module anymore, we no longer need to export:

- raw_hash_sk()
- raw_unhash_sk()
- raw_abort()
- raw_seq_start()
- raw_seq_next()
- raw_seq_stop()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20260602165036.2712408-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'geneve-allow-binding-udp-socket-to-a-specific-address'

Kuniyuki Iwashima says:

====================
geneve: Allow binding UDP socket to a specific address.

By default, a GENEVE device bind()s its underlying UDP socket(s) to
the IPv4 or IPv6 wildcard address because there is no way to specify
a specific local IP address to bind() to.

This prevents deploying multiple GENEVE devices on a multi-homed host
where each device should be isolated and bound to a different local IP
address on the same UDP port.

This series introduces two options to specify local IPv4 or IPv6
addresses for a GENEVE device.
====================

Link: https://patch.msgid.link/20260602190436.139591-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

geneve: Introduce IFLA_GENEVE_LOCAL and IFLA_GENEVE_LOCAL6.

By default, a GENEVE device bind()s its underlying UDP socket(s) to
the IPv4 or IPv6 wildcard address because there is no way to specify
a specific local IP address to bind() to.

This prevents deploying multiple GENEVE devices on a multi-homed host
where each device should be isolated and bound to a different local IP
address on the same UDP port.

Let's introduce new options, IFLA_GENEVE_LOCAL and IFLA_GENEVE_LOCAL6,
to allow specifying a local IPv4/IPv6 address for the backend UDP
socket.

By default, when collect metadata mode (IFLA_GENEVE_COLLECT_METADATA)
is enabled, both IPv4 and IPv6 sockets are created.  However, if a
source address is specified via the new attributes, only a single
socket corresponding to that specific address family is created.

Accordingly, geneve_find_sock() and geneve_find_dev() are updated to
take the source address into account, ensuring that multiple devices
and sockets configured with different source addresses can coexist
without conflict.

In addition, the source address is validated in geneve_xmit_skb()
and geneve6_xmit_skb(), so the BPF prog must set it in bpf_tunnel_key.

With this change, multiple GENEVE devices can be successfully created
and bound to their respective local IP addresses:

  (*) "local" is the keyword for IFLA_GENEVE_LOCAL / IFLA_GENEVE_LOCAL6

  # for i in $(seq 1 2);
  do
          ip link add geneve4_${i} type geneve local 192.168.0.${i} external
          ip addr add 192.168.0.${i}/24 dev geneve4_${i}
          ip link set geneve4_${i} up

          ip link add geneve6_${i} type geneve local 2001:9292::${i} external
          ip addr add 2001:9292::${i}/64 dev geneve6_${i} nodad
          ip link set geneve6_${i} up
  done

  # ip -d l | grep geneve
  9: geneve4_1: <BROADCAST,MULTICAST,UP,LOWER_UP> ...
      geneve external id 0 local 192.168.0.1 ...
  10: geneve6_1: <BROADCAST,MULTICAST,UP,LOWER_UP> ...
      geneve external id 0 local 2001:9292::1 ...
  11: geneve4_2: <BROADCAST,MULTICAST,UP,LOWER_UP> ...
      geneve external id 0 local 192.168.0.2 ...
  12: geneve6_2: <BROADCAST,MULTICAST,UP,LOWER_UP> ...
      geneve external id 0 local 2001:9292::2 ...

  # ss -ua | grep geneve
  UNCONN 0      0         192.168.0.2:geneve      0.0.0.0:*
  UNCONN 0      0         192.168.0.1:geneve      0.0.0.0:*
  UNCONN 0      0      [2001:9292::2]:geneve            *:*
  UNCONN 0      0      [2001:9292::1]:geneve            *:*

Note that even if the local address is explicitly configured with
the wildcard address, kernel does not dump it except for devices with
IFLA_GENEVE_COLLECT_METADATA.  This is consistent with the behaviour
of is_tnl_info_zero(), which treats the wildcard remote address as not
configured.

  ## ynl example.
  # ./tools/net/ynl/pyynl/cli.py \
    --spec ./Documentation/netlink/specs/rt-link.yaml \
    --do newlink --create \
    --json '{"ifname": "geneve0",
             "linkinfo": {"kind":"geneve",
                  "data": {"local": "0.0.0.0",
           "collect-metadata": true}}}'

  # ./tools/net/ynl/pyynl/cli.py \
    --spec ./Documentation/netlink/specs/rt-link.yaml \
    --do getlink \
    --json '{"ifname": "geneve0"}' --output-json | \
    jq .linkinfo.data.local
  "0.0.0.0"

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260602190436.139591-6-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

geneve: Add dualstack flag to struct geneve_config.

When collect metadata mode (IFLA_GENEVE_COLLECT_METADATA) is
enabled, the GENEVE device creates both IPv4 and IPv6 sockets
and bind()s them to wildcard addresses.

The next patch allows creating only one socket bound to a
specific address even when the collect metadata mode is
enabled.

Then, we need a flag to distinguish dualstack GENEVE devices
to detect local address conflict.

Let's add the dualstack flag to struct geneve_config.

IFLA_GENEVE_COLLECT_METADATA processing is moved up in
geneve_nl2info() for the next patch to overwrite dualstack
to false while keeping collect_md true.

Note that IFLA_GENEVE_REMOTE and IFLA_GENEVE_REMOTE6 does not
set cfg->dualstack to false since is_tnl_info_zero() ignores
the wildcard remote address:

  # ip link add geneve0 type geneve external remote 0.0.0.1
  Error: Device is externally controlled, so attributes (VNI, Port, and so on) must not be specified.

  # ip link add geneve0 type geneve external remote 0.0.0.0
  # ss -ua | grep geneve
  UNCONN 0      0            0.0.0.0:geneve      0.0.0.0:*
  UNCONN 0      0                  *:geneve            *:*

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260602190436.139591-5-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

geneve: Pass struct geneve_dev to geneve_find_sock().

This is a prep patch to make a subsequent patch clean.

We will need to access geneve_dev->cfg.info.key.u.{ipv4,ipv6}.src
in geneve_find_sock() later and extend conditions there.

Let's pass down struct geneve from geneve_sock_add() to
geneve_find_sock() and flatten the conditional logic.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260602190436.139591-4-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

geneve: Pass struct geneve_dev to geneve_create_sock().

This is a prep patch to make a subsequent patch clean.

We will need to access geneve_dev->cfg.info.key.u.{ipv4,ipv6}.src
in geneve_create_sock() later.

Let's pass down struct geneve_dev from geneve_sock_add() to
geneve_create_sock() instead of individual config fields.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260602190436.139591-3-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

geneve: Reuse ipv6_addr_type() result in geneve_nl2info().

geneve_nl2info() calls ipv6_addr_type() to check if the remote
IPv6 address is link-local.

Then, it also calls ipv6_addr_is_multicast() for the same address.

Let's not call ipv6_addr_is_multicast() and reuse ipv6_addr_type().

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260602190436.139591-2-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: b44: use ethtool_puts

There's a subtle error with the memcpy here, where b44_gstrings should
not be dereferenced. Dereferening causes the following error with W=1:

In file included from drivers/net/ethernet/broadcom/b44.c:17:
In file included from ./include/linux/module.h:18:
In file included from ./include/linux/kmod.h:9:
In file included from ./include/linux/umh.h:4:
In file included from ./include/linux/gfp.h:7:
In file included from ./include/linux/mmzone.h:8:
In file included from ./include/linux/spinlock.h:56:
In file included from ./include/linux/preempt.h:79:
In file included from ./arch/powerpc/include/asm/preempt.h:5:
In file included from ./include/asm-generic/preempt.h:5:
In file included from ./include/linux/thread_info.h:23:
In file included from ./arch/powerpc/include/asm/current.h:13:
In file included from ./arch/powerpc/include/asm/paca.h:16:
In file included from ./include/linux/string.h:386:
./include/linux/fortify-string.h:578:4: error: call to
'__read_overflow2_field' declared with 'warning' attribute: detected read
beyond size of field (2nd parameter); maybe use>
578 | __read_overflow2_field(q_size_field, size);
| ^

Instead of fixing the memcpy, use ethtool_puts, which is the proper
helper for printing ethtool gstrings.

Signed-off-by: Rosen Penev <rosenp@gmail.com>
Link: https://patch.msgid.link/20260531000334.388351-1-rosenp@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-mlx5-add-switchdev-mode-support-for-socket-direct-single-netdev-part-1-2'

Tariq Toukan says:

====================
net/mlx5: Add switchdev mode support for Socket Direct single netdev, part 1/2

This series enables Socket Direct single netdev to operate in switchdev
mode with shared FDB. SD single netdev combines multiple PCI functions
behind a single netdev interface. To support switchdev offloads, these
functions must participate in virtual LAG (shared FDB).

Design

Rather than introducing a separate LAG instance for SD, this series
integrates SD secondary devices into the existing LAG structure
(priv.lag) created at probe time. Each lag_func entry carries a
group_id field that identifies its SD group membership (0 means not
part of any SD group). An xarray mark (XA_MARK_PORT) distinguishes
physical port entries from SD secondaries, enabling a single unified
iterator that filters by group:

  - MLX5_LAG_FILTER_PORTS: iterate port-level entries only (existing
    behavior, used by bonding, FW LAG commands, v2p_map)
  - MLX5_LAG_FILTER_ALL: iterate all devices including SD secondaries
    (used by MPESW shared FDB across all devices)
  - specific group_id: iterate only devices in that SD group (used by
    per-group SD shared FDB operations)

Existing callers use mlx5_ldev_for_each() which maps to
MLX5_LAG_FILTER_PORTS, preserving current behavior for non-SD
configurations.

Lifecycle and ownership

The SD LAG lifecycle is tied to the SD group, not to bonding events:

1. At PCI probe, mlx5_lag_add_mdev() creates the LAG structure
   (priv.lag) for each LAG-capable PF. e.g.: SD primary devices

2. During mlx5_sd_init(), after the SD group is fully formed (primary
   and secondaries paired), sd_lag_init() registers the secondary
   devices into the primary's existing priv.lag by calling
   mlx5_ldev_add_mdev() with the SD group_id. The primary's lag_func
   also gets its group_id set. No separate LAG instance is created.

3. After all the devices in SD group transition to switchdev,
   mlx5_lag_shared_fdb_create() is invoked with the group_id to create
   a software-only shared FDB scoped to that SD group. This sets
   sd_fdb_active on all lag_func entries in the group. No FW LAG
   commands are issued since SD devices share the same physical port.

4. If MPESW (multi-port eswitch) is enabled on top of SD groups, the
   per-group SD shared FDB is torn down first, then MPESW shared FDB is
   created spanning all devices (ports + SD secondaries) using
   MLX5_LAG_FILTER_ALL. On MPESW disable, per-group SD shared FDB is
   restored.

5. On SD teardown (mlx5_sd_cleanup or device unbind), sd_lag_cleanup()
   removes secondaries from priv.lag and clears the primary's group_id.
   The LAG structure itself is not destroyed.

The sd_fdb_active flag is set on all lag_func entries in a group (not
just the primary), so any device can detect the SD shared FDB state
during lag_disable_change teardown without needing to look up peer
entries.

SD shared FDB is a pure software construct -- unlike regular LAG modes
(ROCE, SRIOV, MPESW), it does not issue FW create_lag/destroy_lag
commands. The software vport LAG for SD is implemented via eswitch
egress ACL bounce rules, managed by the IB layer through
mlx5_eth_lag_init(). And the software LAG demux is implemented via
steering rules that utilize new destination, VHCA_RX.

Patches

Infrastructure (patches 1, 5-6):
  - Factor out shared FDB code into a dedicated file
  - Extend lag_func with group_id and sd_fdb_active fields;
    add XA_MARK_PORT and unified iterator with group_id filter
  - Extend shared FDB API with group_id parameter

E-Switch preparation (patches 2-3):
  - Align eswitch disable sequence ordering
  - Move devcom init from TC to eswitch layer

SD group management (patches 4, 7-9):
  - Replace peer count check with direct peer lookup
  - Register SD secondaries in the existing LAG at SD init time
  - Block RoCE and VF LAG for SD devices
  - Block multipath LAG for SD devices

Switchdev integration (patch 10):
  - Keep netdev resources local in switchdev mode

Steering (patches 11-12):
  - Track peer flow slots with bitmap for selective peer flow deletion
  - Enable TC flow steering for SD LAG

Enablement (patch 13):
  - Verify unique vhca_id count for cross-VHCA RQT
====================

Link: https://patch.msgid.link/20260531113954.395443-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: Verify unique vhca_id count instead of range

Change verify_num_vhca_ids() to count the number of unique vhca_ids
and verify this count doesn't exceed max_num_vhca_id, rather than
validating individual vhca_id values are within a specific range.

The previous implementation checked if each vhca_id was in the range
[0, max_num_vhca_id - 1], which is overly restrictive. The hardware
capability max_rqt_vhca_id represents the maximum number of unique
vhca_ids that can be used, not a range constraint on individual IDs.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260531113954.395443-14-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: TC, enable steering for SD LAG

Enable TC flow steering for SD LAG mode by extending multiport
eligibility checks and peer flow handling.

SD LAG operates similarly to MPESW for TC offloads - flows on
secondary devices need peer flow creation on the primary, and
multiport forwarding rules are eligible when either MPESW or SD LAG
is active.

Add mlx5_lag_is_sd() helper to query SD LAG mode, and
mlx5_sd_is_primary() to identify the primary device. Redirect uplink
priv/proto_dev queries to the primary device's eswitch in SD
configurations.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260531113954.395443-13-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: TC, track peer flow slots with bitmap

With SD devices joining the LAG, peer flows are not created for all
devcom peers - SD devices skip peers that belong to a different SD
group. However, the delete path iterated all devcom peers
unconditionally, attempting to delete from slots that were never
populated.

Track which peer slots are populated using a bitmap in mlx5e_tc_flow.
The delete path now iterates only set bits, matching exactly the slots
that were set up during flow creation.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260531113954.395443-12-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: SD, keep netdev resources on same PF in switchdev mode

In SD switchdev mode, network device resources such as channels and
completion vectors must remain on the same PF rather than being
distributed across SD group members.

Modify mlx5_sd_ch_ix_get_dev_ix() to return 0 and
mlx5_sd_ch_ix_get_vec_ix() to return the channel index directly when
in switchdev mode, keeping resources local to the requesting PF.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260531113954.395443-11-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: LAG, block multipath LAG for SD devices

SD devices are not compatible with multipath LAG since they use
dedicated SD LAG for cross-socket connectivity. Add an SD check
to the multipath prereq validation to prevent multipath LAG
activation on SD-configured ports.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260531113954.395443-10-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: LAG, block RoCE and VF LAG for SD devices

Socket Direct devices manage their own LAG via SD LAG infrastructure.
Block the standard netdev-event-driven LAG path (RoCE LAG and VF LAG)
for SD devices to prevent conflicting LAG configurations.

Expose mlx5_sd_is_supported() as a public helper that encapsulates all
SD eligibility checks. Use it in mlx5_lag_dev_alloc() to skip netdev
notifier registration for SD-capable devices at alloc time. Some sd
code is reordered to expose the new function, no logic is changed.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260531113954.395443-9-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: SD, introduce Socket Direct LAG

Register SD secondary devices with the existing LAG structure by
adding them to the primary's ldev xarray with a shared group_id.
This ties the SD LAG lifecycle to the SD group lifecycle.

Add sd_lag_state debugfs entry for LAG state visibility. To avoid
race between this entry and LAG deletion, have debugfs creation
and deletion done last on SD init and first on SD cleanup.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260531113954.395443-8-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: LAG, extend shared FDB API with group_id filter

Add a group_id parameter to mlx5_lag_shared_fdb_create() and
mlx5_lag_shared_fdb_destroy() to scope shared FDB operations to a
specific SD group.

When group_id is U32_MAX, the functions operate on all LAG devices. When
group_id is non-zero, they operate only on devices in that SD group
without issuing FW LAG commands, since SD LAG is a pure software
construct.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260531113954.395443-7-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: LAG, prepare for SD device integration

Socket Direct (SD) secondaries devices will participate in LAG, even
though they are silent. SD secondary devices share the same physical
port as their primary but are separate PCI functions that need to be
tracked alongside regular LAG ports.

Extend lag_func with a group_id field to identify SD group membership
and introduce a unified iterator that can filter by group. Add APIs
for registering SD secondary devices in an existing LAG.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260531113954.395443-6-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: LAG, replace peer count check with direct peer lookup

Replace mlx5_eswitch_get_npeers() count-based check with a new
mlx5_eswitch_is_peer() function that directly verifies the peer
relationship between two eswitches.

This change prepares for SD LAG support, which is a virtual LAG that
does not have num_lag_ports capability and cannot use the count-based
peer validation.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260531113954.395443-5-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: E-Switch, move devcom init from TC to eswitch layer

Move the E-swtich devcom component management from TC layer to ESW
layer.

This refactoring places devcom lifecycle management at the appropriate
layer and prepares for SD LAG which needs devcom registration
independent of the TC/representor initialization.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260531113954.395443-4-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: E-Switch, align disable sequence with switchdev-to-legacy transition

This patch align the eswitch disable sequence with the
switchdev-to-legacy mode transition, where eswitch must be disabled
before device detachment. The consistent ordering is required for proper
SD LAG cleanup which depends on eswitch state during teardown.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260531113954.395443-3-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: LAG, factor out shared FDB code into dedicated file

Refactor shared FDB LAG logic into a new lag/shared_fdb.c file to
improve code organization and enable reuse. Move shared FDB specific
functions from lag.c and introduce consolidated APIs:
- mlx5_lag_shared_fdb_create() handles LAG activation with shared FDB
- mlx5_lag_shared_fdb_destroy() handles LAG deactivation with shared FDB

Update mlx5_do_bond(), mlx5_disable_lag() and mpesw.c to use the new
APIs, which simplifies the shared FDB code paths.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260531113954.395443-2-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests/net: bind_bhash: fix memory leak in bind_socket

The getaddrinfo() call in bind_socket() dynamically allocates memory
for the result linked list that must be freed with freeaddrinfo().
However, none of the code paths after a successful getaddrinfo() call
free this memory, causing a leak in every invocation of bind_socket().

Signed-off-by: longlong yan <yanlonglong@kylinos.cn>
Link: https://patch.msgid.link/20260601013927.1835-1-yanlonglong@kylinos.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: Improve Tx timer arm logic further

Calling hrtimer_start() on an already-active txtimer is unnecessary
and expensive. Skip the restart if the timer is already active by
adding an hrtimer_active() check before hrtimer_start().

Previously, each packet reset the timer to tx_coal_timer in the future,
acting as a sliding window that delayed NAPI under burst traffic. With
this change, an already-active timer is left to fire sooner, scheduling
NAPI within tx_coal_timer of the first packet and freeing TX descriptors
earlier.

There is no race concern: hrtimer_start() is internally serialized and
safe to call on an active timer. In the event of a race between
hrtimer_active() and hrtimer_start(), the worst case is calling
hrtimer_start() on an already-active timer, which is identical to the
pre-patch behaviour.

Performance on Cyclone V with dwmac-socfpga (iperf3 -u -b 0 -l 64):
Before: ~45200 pps
After: ~52300 pps (~15% improvement)

Additionally, ~10% improvement in UDP throughput observed on Agilex5,
with hrtimer CPU usage reduced from ~8% to ~0.6%.

Signed-off-by: Rohan G Thomas <rohan.g.thomas@altera.com>
Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Nazim Amirul <muhammad.nazim.amirul.nazle.asmade@altera.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20260529064659.32287-1-muhammad.nazim.amirul.nazle.asmade@altera.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'dpaa2-switch-various-improvements'

Ioana Ciornei says:

====================
dpaa2-switch: various improvements

This patch set is a comprised of improvements and fixes for
long-standing bugs which were only caught by sashiko while reviewing the
LAG support patches for the dpaa2-switch:
https://lore.kernel.org/all/20260512131554.952971-1-ioana.ciornei@nxp.com/

In order to not just add to the already big set, I am submitting these
before any v3 of the LAG support patches.

The individual patches tackle FDB and VLAN management in the
dpaa2-switch driver as well as removal of some duplicated code. The
error path of the dpaa2_switch_rx() is also improved.
====================

Link: https://patch.msgid.link/20260528173452.1953102-1-ioana.ciornei@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dpaa2-switch: fix handling of NAPI on the remove path

All the NAPI instances for a DPSW device are attached to the first
switch port's net_device but shared by all ports. The NAPI instances get
disabled only once the last port goes down.

This causes an issue on the .remove() path where each port is
unregistered and freed one at a time, causing the NAPI instances to be
deleted even though they are not disabled.

In order to avoid this, split up the unregister_netdev() calls from the
free_netdev() so that we make sure all ports go down before we attempt
a deletion of NAPI instances. Also, make the netif_napi_del() explicit
as it is on the .probe() path.

Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Link: https://patch.msgid.link/20260528173452.1953102-6-ioana.ciornei@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dpaa2-switch: support VLAN flag changes on existing VIDs

The switchdev core notifies the driver on VLAN flag changes on existing
VIDs through the changed field of the switchdev_obj_port_vlan structure.
Without this patch, the driver was erroring out from the start if the
same VID was inserted twice, from its perspective, even though the
second call was actually a flag change.

$ bridge vlan add dev eth2 vid 30 untagged
$ bridge vlan add dev eth2 vid 30
[ 458.589534] fsl_dpaa2_switch dpsw.0 eth2: VLAN 30 already configured

This patch fixes the above behavior by, first of all, removing the
checks and return of errors on a VLAN already being installed. The patch
also moves the sequence of code which checks if there is space for a new
VLAN so that the verification is being done only for VLANs not know by
the switch and not flag changes.

A new parameter is added to the dpaa2_switch_port_add_vlan() function so
that we pass the vlan->changed necessary information. Based on this new
parameter and the flags value, the untagged flag will be added or
removed from a VLAN installed on a port. The same thing is also extended
to the PVID configuration.

Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Link: https://patch.msgid.link/20260528173452.1953102-5-ioana.ciornei@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dpaa2-switch: remove duplicated check for the maximum number of VLANs

The check for the maximum number of VLANs is exactly duplicated twice in
the dpaa2_switch_port_vlans_add() function. Remove one of the instances
so that we do not have dead code.

Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Link: https://patch.msgid.link/20260528173452.1953102-4-ioana.ciornei@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dpaa2-switch: fix the error path in dpaa2_switch_rx()

In case of an error in dpaa2_switch_rx(), the dpaa2_switch_free_fd()
function is called in order to free the FD. This is incorrect since the
dpaa2_switch_free_fd() is intended to be used on Tx frame descriptors,
meaning that it expects in the software annotation area of the FD data
to find a valid skb pointer on which to call dev_kfree_skb().

Fix this by extracting the dma_unmap_page() from
dpaa2_switch_build_linear_skb() directly into the dpaa2_switch_rx()
function. This allows us to directly use free_pages() in case of an
error before an SKB was created and kfree_skb() afterwards.

Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Link: https://patch.msgid.link/20260528173452.1953102-3-ioana.ciornei@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dpaa2-switch: rework FDB management on the bridge leave path

On bridge leave, the dpaa2_switch_port_set_fdb() function always
allocates a new FDB for the port which is becoming standalone. In case
no FDB is found, then the port leaving a bridge will continue to use the
current one.

The above logic does not cover the case in which there are multiple
bridges which have ports from the same DPSW instance. In this case, when
the last port leaves bridge #1, it finds an unused FDB to switch to, but
the old FDB is not marked as unused. Since the number of FDBs is equal
to the number of DPSW interfaces, this will eventually lead to multiple
ports sharing the same FDB.

Fix this by changing how we are managing the FDBs on the leave path.
Instead of directly allocating a new FDB, first verify if the current
port is the last one to leave a bridge. If this is the case, then
continue to use the current FDB and only allocate another FDB if there
are other ports remaining in the bridge.

Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Link: https://patch.msgid.link/20260528173452.1953102-2-ioana.ciornei@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: sja1105: flower: reject cross-chip redirect

dsa_port_from_netdev() may return a valid port from a different switch
chip. Programming another chip's port index into the local hardware
causes redirection to the wrong port, or an out-of-bounds access if the
index exceeds the local chip's port count.

Apply a minimal fix that adds a check to catch this case and adjusts the
extack message. When cls->common.skip_sw is not set, the operation could
instead redirect to the upstream port and let the software or upstream
switch(es) handle the forward, but that is not addressed here.

Signed-off-by: David Yang <mmyangfl@gmail.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Link: https://patch.msgid.link/20260530003940.2000994-1-mmyangfl@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: fec_mpc52xx: add missing kernel-doc for @may_sleep

Add the missing @may_sleep parameter description to the
mpc52xx_fec_stop kernel-doc comment.

Signed-off-by: Rosen Penev <rosenp@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20260531000042.369043-1-rosenp@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: openvswitch: add dec_ttl action support and test

Add dec_ttl action support to the OVS kernel datapath selftest
framework:

  - Add dec_ttl nested NLA class to ovs-dpctl.py with proper
    OVS_DEC_TTL_ATTR_ACTION sub-attribute handling
  - Add parse support for dec_ttl(le_1(<inner_actions>)) action
    string, consistent with the odp-util.c format where le_1()
    holds the actions taken when TTL reaches 1
  - Add dpstr output formatting for dec_ttl actions
  - Add test_dec_ttl() to openvswitch.sh that verifies:
    * Normal TTL packets are forwarded after decrement
    * TTL=1 packets are dropped (TTL expiry)
    * Graceful skip via ksft_skip if kernel lacks dec_ttl support

The dec_ttl class uses late-binding type resolution to reference
ovsactions for its inner action list, avoiding circular references
at class definition time.

Signed-off-by: Minxi Hou <houminxi@gmail.com>
Reviewed-by: Aaron Conole <aconole@redhat.com>
Link: https://patch.msgid.link/20260530021443.1734484-1-houminxi@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-mlx5-avoid-payload-in-skb-s-linear-part-for-better-gro-processing'

Tariq Toukan says:

====================
net/mlx5: Avoid payload in skb's linear part for better GRO-processing

This is V7 of a series originally submitted by Christoph.

When LRO is enabled on the MLX, mlx5e_skb_from_cqe_mpwrq_nonlinear
copies parts of the payload to the linear part of the skb.

This triggers suboptimal processing in GRO, causing slow throughput.

This patch series addresses this by using eth_get_headlen to compute the
size of the protocol headers and only copy those bits. This results in a
significant throughput improvement (detailed results in the specific
patch).
====================

Link: https://patch.msgid.link/20260601061522.398044-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: Avoid copying payload to the skb's linear part

mlx5e_skb_from_cqe_mpwrq_nonlinear() copies MLX5E_RX_MAX_HEAD (256)
bytes from the page-pool to the skb's linear part. Those 256 bytes
include part of the payload.

When attempting to do GRO in skb_gro_receive, if headlen > data_offset
(and skb->head_frag is not set), we end up aggregating packets in the
frag_list.

This is of course not good when we are CPU-limited. Also causes a worse
skb->len/truesize ratio,...

So, let's avoid copying parts of the payload to the linear part. We use
eth_get_headlen() to parse the headers and compute the length of the
protocol headers, which will be used to copy the relevant bits of the
skb's linear part.

We still allocate MLX5E_RX_MAX_HEAD for the skb so that if the networking
stack needs to call pskb_may_pull() later on, we don't need to reallocate
memory.

This gives a nice throughput increase (ARM Neoverse-V2 with CX-7 NIC and
LRO enabled):

BEFORE:
=======
(netserver pinned to core receiving interrupts)
$ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
87380  16384 262144    60.01    32547.82

(netserver pinned to adjacent core receiving interrupts)
$ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
87380  16384 262144    60.00    52531.67

AFTER:
======
(netserver pinned to core receiving interrupts)
$ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
87380  16384 262144    60.00    52896.06

(netserver pinned to adjacent core receiving interrupts)
$ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
87380  16384 262144    60.00    85094.90

Additional tests across a larger range of parameters w/ and w/o LRO, w/
and w/o IPv6-encapsulation, different MTUs (1500, 4096, 9000), different
TCP read/write-sizes as well as UDP benchmarks, all have shown equal or
better performance with this patch.

For XDP pull at most ETH_HLEN bytes in the linear area so that XDP_PASS
can also benefit from this improvement and keep things simple when
dealing with skb geometry changes from the XDP program.

Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Christoph Paasch <cpaasch@openai.com>
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260601061522.398044-3-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: DMA-sync earlier in mlx5e_skb_from_cqe_mpwrq_nonlinear

Doing the call to dma_sync_single_for_cpu() earlier will allow us to
adjust headlen based on the actual size of the protocol headers.

Doing this earlier means that we don't need to call
mlx5e_copy_skb_header() anymore and rather can call
skb_copy_to_linear_data() directly.

Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Christoph Paasch <cpaasch@openai.com>
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260601061522.398044-2-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-airoha-preliminary-patches-to-support-multiple-net_devices-connected-to-the-same-gdm-port'

Lorenzo Bianconi says:

====================
net: airoha: Preliminary patches to support multiple net_devices connected to the same GDM port

EN7581 or AN7583 SoCs support connecting multiple external SerDes (e.g.
Ethernet or USB SerDes) to GDM3 or GDM4 ports via a hw arbiter that
manages the traffic in a TDM manner. As a result multiple net_devices can
connect to the same GDM{3,4} port and there is a theoretical "1:n"
relation between GDM ports and net_devices.

           ┌─────────────────────────────────┐
           │                                 │    ┌──────┐
           │                         P1 GDM1 ├────►MT7530│
           │                                 │    └──────┘
           │                                 │      ETH0 (DSA conduit)
           │                                 │
           │              PSE/FE             │
           │                                 │
           │                                 │
           │                                 │    ┌─────┐
           │                         P0 CDM1 ├────►QDMA0│
           │  P4                     P9 GDM4 │    └─────┘
           └──┬─────────────────────────┬────┘
              │                         │
           ┌──▼──┐                 ┌────▼────┐
           │ PPE │                 │   ARB   │
           └─────┘                 └─┬─────┬─┘
                                     │     │
                                  ┌──▼──┐┌─▼───┐
                                  │ ETH ││ USB │
                                  └─────┘└─────┘
                                   ETH1   ETH2

This is a preliminary series to introduce support for multiple net_devices
connected to the same Frame Engine (FE) GDM port (GDM3 or GDM4) via an
external hw arbiter.
====================

Link: https://patch.msgid.link/20260527-airoha-eth-multi-serdes-preliminary-v1-0-ec6ed73ef7fc@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: airoha: Rename airoha_set_gdm2_loopback in airoha_enable_gdm2_loopback

This is a preliminary patch in order to allow the user to select if the
configured device will be used as hw lan or wan.
Please not this patch does not introduce any logical changes, just
cosmetic ones.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/20260527-airoha-eth-multi-serdes-preliminary-v1-6-ec6ed73ef7fc@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: airoha: Move {cpu,fwd}_tx_packets in airoha_gdm_dev struct

Since now multiple net_devices connected to different QDMA blocks can
share the same GDM port, cpu_tx_packets and fwd_tx_packets fields can
be overwritten with the value from a different QDMA block. In order to
fix the issue move cpu_tx_packets and fwd_tx_packets fields from
airoha_gdm_port struct to airoha_gdm_dev one.

Tested-by: Xuegang Lu <xuegang.lu@airoha.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/20260527-airoha-eth-multi-serdes-preliminary-v1-5-ec6ed73ef7fc@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: airoha: Move qos_sq_bmap in airoha_gdm_dev struct

Since now multiple net_devices connected to different QDMA blocks can
share the same GDM port, qos_sq_bmap field can be overwritten with the
configuration obtained from a net_device connected to a different QDMA
block. In order to fix the issue move qos_sq_bmap field from
airoha_gdm_port struct to airoha_gdm_dev one.
Add qos_channel_map bitmap in airoha_qdma struct to track if a shared
QDMA channel is already in use by another net_device.

Tested-by: Xuegang Lu <xuegang.lu@airoha.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/20260527-airoha-eth-multi-serdes-preliminary-v1-4-ec6ed73ef7fc@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: airoha: Rely on airoha_gdm_dev pointer in airoha_is_lan_gdm_port()

Rename airoha_is_lan_gdm_port in airoha_is_lan_gdm_dev. Moreover, rely
on airoha_gdm_dev pointer in airoha_is_lan_gdm_dev() instead of
airoha_gdm_port one.
This is a preliminary patch to support multiple net_devices connected to
the same GDM{3,4} port via an external hw arbiter.

Tested-by: Xuegang Lu <xuegang.lu@airoha.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/20260527-airoha-eth-multi-serdes-preliminary-v1-3-ec6ed73ef7fc@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: airoha: Move airoha_qdma pointer in airoha_gdm_dev struct

Move airoha_qdma pointer from airoha_gdm_port struct to airoha_gdm_dev
one since the QDMA block used depends on the particular net_device
WAN/LAN configuration and in the current codebase net_device pointer is
associated to airoha_gdm_dev struct.
This is a preliminary patch to support multiple net_devices connected
to the same GDM{3,4} port via an external hw arbiter.

Tested-by: Xuegang Lu <xuegang.lu@airoha.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/20260527-airoha-eth-multi-serdes-preliminary-v1-2-ec6ed73ef7fc@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: airoha: Introduce airoha_gdm_dev struct

EN7581 and AN7583 SoCs support connecting multiple external SerDes to GDM3
or GDM4 ports via a hw arbiter that manages the traffic in a TDM manner.
As a result multiple net_devices can connect to the same GDM{3,4} port
and there is a theoretical "1:n" relation between GDM port and
net_devices.
Introduce airoha_gdm_dev struct to collect net_device related info (e.g.
net_device and external phy pointer). Please note this is just a
preliminary patch and we are still supporting a single net_device for
each GDM port. Subsequent patches will add support for multiple net_devices
connected to the same GDM port.

Tested-by: Xuegang Lu <xuegang.lu@airoha.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/20260527-airoha-eth-multi-serdes-preliminary-v1-1-ec6ed73ef7fc@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

doc/netlink: rt-link: fix binary attributes marked as strings

These link-attrs attributes were previously marked as strings:

- wireless - struct iw_event
- protinfo - a nest of ifla6-attrs or linkinfo-brport-attrs
- cost, priority - unused

Signed-off-by: Remy D. Farley <one-d-wide@protonmail.com>
Link: https://patch.msgid.link/20260529121355.1564817-1-one-d-wide@protonmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'netdevsim-psp-fix-issues-with-stats-collection'

Daniel Zahka says:

====================
netdevsim: psp: fix issues with stats collection

It has come to my attention via a sashiko review of my net-next series
for aes-gcm in netdevsim [1] that there were preexisting issues with
netdevsim's implementation of psp statistics.

API usage issues:
1. not calling u64_stats_init() on the u64_stats_sync object during
   init
2. not serializing usage of the writer side API during stats update

Logical Bugs:
1. We were incrementing rx stats on the sending devices stats
   counters.

Fix the first set of issues by removing the u64_stats_t api entirely,
and keep track of stats with atomics. Fix the second issue by charging
events to the right netdevsim object.

[1]: https://sashiko.dev/#/patchset/20260508-nsim-psp-crypto-v1-0-4b50ed09b794%40gmail.com

  TAP version 13
  1..28
  ok 1 psp.data_basic_send_v0_ip4
  ok 2 psp.data_basic_send_v0_ip6
  ok 3 psp.data_basic_send_v1_ip4
  ok 4 psp.data_basic_send_v1_ip6
  ok 5 psp.data_basic_send_v2_ip4
  ok 6 psp.data_basic_send_v2_ip6
  ok 7 psp.data_basic_send_v3_ip4
  ok 8 psp.data_basic_send_v3_ip6
  ok 9 psp.data_mss_adjust_ip4
  ok 10 psp.data_mss_adjust_ip6
  ok 11 psp.dev_list_devices
  ok 12 psp.dev_get_device
  ok 13 psp.dev_get_device_bad
  ok 14 psp.dev_rotate
  ok 15 psp.dev_rotate_spi
  ok 16 psp.assoc_basic
  ok 17 psp.assoc_bad_dev
  ok 18 psp.assoc_sk_only_conn
  ok 19 psp.assoc_sk_only_mismatch
  ok 20 psp.assoc_sk_only_mismatch_tx
  ok 21 psp.assoc_sk_only_unconn
  ok 22 psp.assoc_version_mismatch
  ok 23 psp.assoc_twice
  ok 24 psp.data_send_bad_key
  ok 25 psp.data_send_disconnect
  ok 26 psp.data_stale_key
  ok 27 psp.removal_device_rx
  ok 28 psp.removal_device_bi
  # Totals: pass:28 fail:0 xfail:0 xpass:0 skip:0 error:0

Dump stats on both devs tx on one should match rx on other:
local dev:
id=5 ifindex=2 stats={'dev-id': 5, 'key-rotations': 0,
'stale-events': 0, 'rx-packets': 1226, 'rx-bytes': 39244,
'rx-auth-fail': 0, 'rx-error': 0, 'rx-bad': 0, 'tx-packets': 1931,
'tx-bytes': 2478908, 'tx-error': 0}

remote dev:
id=3 ifindex=2 stats={'dev-id': 3, 'key-rotations': 0, 'stale-events':
0, 'rx-packets': 1931, 'rx-bytes': 2478908, 'rx-auth-fail': 0,
'rx-error': 0, 'rx-bad': 0, 'tx-packets': 1226, 'tx-bytes': 39244,
'tx-error': 0}
====================

Link: https://patch.msgid.link/20260529-fix-psp-stats-v2-0-3a194eacf18e@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netdevsim: psp: use atomic64 for psp stats counters

The existing u64_stats_t-based psp counters had two preexisting api
usage bugs: u64_stats_init() was never called on the syncp object, and
the writer side of the u64_stats_update_begin()/end() api was not
serialized. Switch the counters to atomic64_t instead. Atomics need
no initialization and are inherently safe against concurrent writers,
eliminating both bugs at once.

Use atomic64_t rather than atomic_long_t so byte counters don't wrap
at 4 GiB on 32-bit builds.

Fixes: 178f0763c5f3 ("netdevsim: implement psp device stats")
Cc: <stable+noautosel@kernel.org> # netdevsim is a test harness, it's never loaded on production systems
Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com>
Link: https://patch.msgid.link/20260529-fix-psp-stats-v2-2-3a194eacf18e@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netdevsim: psp: update rx stats on the peer netdevsim

nsim_do_psp() handles both tx and rx psp processing in the sending
device's nsim_start_xmit() path. The existing code has a logical bug,
where we erroneously increment rx_bytes and rx_packets on the sending
devices stats, instead of the peer device.

Additionally, compute psp_len after psp_dev_encapsulate() and before
psp_dev_rcv(), which modifies the header region of the skb. The
existing calculation was actually correct, because psp_dev_rcv()
leaves skb_inner_transport_header pointing at the tcp header, but this
is fragile and confusing as there is no actual inner transport header
after psp_dev_rcv has removed udp encapsulation.

Fixes: 178f0763c5f3 ("netdevsim: implement psp device stats")
Cc: <stable+noautosel@kernel.org> # netdevsim is a test harness, it's never loaded on production systems
Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com>
Link: https://patch.msgid.link/20260529-fix-psp-stats-v2-1-3a194eacf18e@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netdevsim: fib: fix use-after-free of FIB data via debugfs

Writing to the netdevsim debugfs file
"netdevsim/netdevsimN/fib/nexthop_bucket_activity" enters
nsim_nexthop_bucket_activity_write(), which looks up a nexthop in
data->nexthop_ht under rtnl_lock(). If a network namespace teardown,
devlink reload or device deletion runs concurrently, nsim_fib_destroy()
frees that rhashtable (and the surrounding nsim_fib_data) while the
write is still in flight, leading to a slab-use-after-free:

  BUG: KASAN: slab-use-after-free in nsim_nexthop_bucket_activity_write+0xb9e/0xdf0
  Read of size 4 at addr ff1100001a379808 by task syz.0.11967/27894

  CPU: 0 UID: 0 PID: 27894 Comm: syz.0.11967 Not tainted 7.1.0-rc4-gf6f1bfc1980a #4
  Call Trace:
   nsim_nexthop_bucket_activity_write+0xb9e/0xdf0
   full_proxy_write+0x135/0x1a0
   vfs_write+0x2e2/0x1040
   ksys_write+0x146/0x270
   __x64_sys_write+0x76/0xb0
   do_syscall_64+0xb9/0x5b0
   entry_SYSCALL_64_after_hwframe+0x74/0x7c

  Allocated by task 15957:
   rhashtable_init_noprof+0x3ec/0x860
   nsim_fib_create+0x371/0xca0
   nsim_drv_probe+0xd60/0x15c0
   ...
   new_device_store+0x425/0x7f0

  Freed by task 24:
   rhashtable_free_and_destroy+0x10d/0x620
   nsim_fib_destroy+0xc9/0x1c0
   nsim_dev_reload_destroy+0x1e7/0x530
   nsim_dev_reload_down+0x6b/0xd0
   devlink_reload+0x1b5/0x770
   devlink_pernet_pre_exit+0x25d/0x3a0
   ops_undo_list+0x1b7/0xb90
   cleanup_net+0x47f/0x8a0

  The buggy address belongs to the object at ff1100001a379800
   which belongs to the cache kmalloc-1k of size 1024

The freed 1k object is the bucket table of data->nexthop_ht. Shortly
after, the dangling table is dereferenced again and the machine also
takes a GPF in __rht_bucket_nested() from the same call site.

The root cause is a lifetime mismatch: the debugfs files reference
nsim_fib_data (the writer dereferences data->nexthop_ht), but the
interface is not bracketed around the lifetime of that data.
nsim_fib_destroy() freed both rhashtables and only removed the debugfs
directory afterwards, and nsim_fib_create() created the debugfs files
before the rhashtables were initialized and, on the error path, freed
them before removing the files. debugfs keeps the file itself alive
across a ->write() via debugfs_file_get()/debugfs_file_put()
(fs/debugfs/file.c), but it does not keep data->nexthop_ht alive, so the
in-flight writer dereferenced freed memory. rtnl_lock() in the writer
does not help, because the teardown path does not take rtnl around
rhashtable_free_and_destroy().

Fix it by bracketing the debugfs interface around the data it exposes,
keeping nsim_fib_create() and nsim_fib_destroy() symmetric:

- In nsim_fib_destroy(), tear down the debugfs files before the data
   structures they reference. debugfs_remove_recursive() drops the
   initial active-user reference and then waits for every in-flight
   ->write() to drop its reference before returning, and rejects new
   opens (__debugfs_file_removed(), fs/debugfs/inode.c). Once it returns,
   no debugfs accessor can reach the FIB data, so the rhashtables and
   nsim_fib_data can be destroyed safely. This also covers the bool knobs
   in the same directory, which store pointers into the same
   nsim_fib_data, and the final kfree(data).

- In nsim_fib_create(), create the debugfs files after the rhashtables
   and notifiers are set up. This closes the same race on the
   error-unwind path, where a concurrent writer could otherwise observe a
   half-constructed instance or a table that the unwind has already
   freed. (With only the destroy-side change, a writer racing the create
   window instead dereferences an uninitialized data->nexthop_ht.)

This is reproducible by racing, in a loop, writes to
/sys/kernel/debug/netdevsim/netdevsimN/fib/nexthop_bucket_activity
against a teardown of the same netdevsim instance -- a devlink reload
("devlink dev reload netdevsim/netdevsimN"), destroying the network
namespace it lives in, or "echo N > /sys/bus/netdevsim/del_device". It
was found with syzkaller; a syzkaller reproducer is available. A
standalone C reproducer does not trigger it reliably because the race
needs the netns-teardown/reload path.

Cc: <stable+noautosel@kernel.org> # netdevsim is a test harness, it's never loaded on production systems
Signed-off-by: Zijing Yin <yzjaurora@gmail.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260529135718.1804031-1-yzjaurora@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: drv-net: tso: add new tests for ip6tnl, ipip, and sit tunnels

Add new tunnel test cases for ip6tnl, ipip, and sit. ip6tnl supports
ipv[46] as inner l3 header, and the other two tunnels only support a
single inner l3 type.

Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com>
Link: https://patch.msgid.link/20260529-tso-tunnels-v1-1-3771ee9eaaa9@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'batadv-next-pullrequest-20260601' of https://git.open-mesh.org/batadv

Simon Wunderlich says:

====================
This batman-adv cleanup patchset includes the following patches, all by
Sven Eckelmann:

- drop batman-adv specific version

- MAINTAINERS housekeeping for batman-adv (two patches)

- add missing includes

- use atomic_xchg() for gw.reselect check

- extract netdev wifi detection information object

- replace inappropriate atomic access with (READ|WRITE)_ONCE
   (six patches)

- tt: replace open-coded overflow check with helper

- tvlv: avoid unnecessary OGM buffer reallocations

- use neigh_node's orig_node only as id

* tag 'batadv-next-pullrequest-20260601' of https://git.open-mesh.org/batadv:
  batman-adv: use neigh_node's orig_node only as id
  batman-adv: tvlv: avoid unnecessary OGM buffer reallocations
  batman-adv: tt: replace open-coded overflow check with helper
  batman-adv: replace non-atomic last_ttvn with (READ|WRITE)_ONCE
  batman-adv: replace non-atomic packet_size_max with (READ|WRITE)_ONCE
  batman-adv: replace non-atomic mesh state with (READ|WRITE)_ONCE
  batman-adv: replace non-atomic vlan config fields with (READ|WRITE)_ONCE
  batman-adv: replace non-atomic hardif config fields with (READ|WRITE)_ONCE
  batman-adv: replace non-atomic meshif config fields with (READ|WRITE)_ONCE
  batman-adv: extract netdev wifi detection information object
  batman-adv: use atomic_xchg() for gw.reselect check
  batman-adv: add missing includes
  MAINTAINERS: Don't send batman-adv patches to netdev
  MAINTAINERS: Rename batman-adv T(ree)
  batman-adv: drop batman-adv specific version
====================

Link: https://patch.msgid.link/20260601123629.707089-1-sw@simonwunderlich.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

af_unix: Remove sock->state assignment.

Both struct socket and struct sock have a variable to
manage its state, sock->state and sk->sk_state.

When both are used, the former typically manages syscall
state and the latter manages the actual connection state.

AF_UNIX only uses sk->sk_state.

Let's remove unnecessary assignemnts for sock->state.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260529191829.3864438-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: net: add socat syslog for PPPoL2TP

As done in pppoe.sh, start socat as the syslog listener. In case the
test fails, dump its log to see what's going on.

Signed-off-by: Qingfang Deng <qingfang.deng@linux.dev>
Reviewed-by: Matthieu Baerts <matttbe@kernel.org>
Link: https://patch.msgid.link/20260529021146.5739-1-qingfang.deng@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-phy-dp83822-add-optional-external-phy-clock'

Stefan Wahren says:

====================
net: phy: dp83822: Add optional external PHY clock

This small series implement support for external PHY clock for the
dp83822 driver.
====================

Link: https://patch.msgid.link/20260528184642.33424-1-wahrenst@gmx.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: dp83822: Add optional external PHY clock

In some cases, the PHY can use an external ref clock source instead of a
crystal.

Add an optional clock in the PHY node to make sure that the clock source
is enabled, if specified, before probing.

Signed-off-by: Stefan Wahren <wahrenst@gmx.net>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20260528184642.33424-3-wahrenst@gmx.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: dp83822: Improve readability in dp8382x_probe

Introduce a local pointer for device so devm_kzalloc() fit into
a single line. Also this makes following changes easier to read.

Signed-off-by: Stefan Wahren <wahrenst@gmx.net>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20260528184642.33424-2-wahrenst@gmx.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tcp: change bpf_skops_hdr_opt_len() signature

Some compilers do not inline bpf_skops_hdr_opt_len() from
tcp_established_options(), forcing an expensive stack canary
when CONFIG_STACKPROTECTOR_STRONG=y.

Change bpf_skops_hdr_opt_len() to return @remaining by value
to remove this stack canary from TCP fast path.

$ scripts/bloat-o-meter -t vmlinux.old vmlinux.new
add/remove: 0/0 grow/shrink: 1/1 up/down: 10/-59 (-49)
Function                                     old     new   delta
bpf_skops_hdr_opt_len                        297     307     +10
tcp_established_options                      574     515     -59
Total: Before=31456795, After=31456746, chg -0.00%

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260601093819.469626-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: b53: hide legacy gpiolib usage on non-mips

The MIPS bcm53xx platform still uses the legacy gpiolib interfaces based
on gpio numbers, but other platforms do not.

Hide these interfaces inside of the existing #ifdef block and use the
modern interfaces in the common parts of the driver to allow building
it when the gpio_set_value() is left out of the kernel.

Reviewed-by: Jonas Gorski <jonas.gorski@gmail.com>
Reviewed-by: Linus Walleij <linusw@kernel.org>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Reviewed-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://patch.msgid.link/20260601165716.648230-1-arnd@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'add-starfive-jhb100-soc-sgmii-gmac-support'

Minda Chen says:

====================
Add StarFive jhb100 soc SGMII GMAC support

jhb100 is a Starfive new RISC-V SoC for datacenter BMC (BaseBoard
Managent Controller). Similar with Aspeed 27x0.

The jhb100 minimal system upstream is in progress:
https://patchwork.kernel.org/project/linux-riscv/cover/20260508053632.818548-1-changhuang.liang@starfivetech.com/

jhb100 GMAC still using designware GMAC core like JH7100 and JH7110,
and contains 2 SGMII interfaces, 1 RGMII/RMII interface, 1 RMII
interface. In JH7100/JH7110 dwmac-starfive.c have supported RGMII/RMII
interface. So require to add SGMII support to dwmac-starfive.c for JHB100.

SGMII serdes PHY has been integrated in JHB100 and do not have driver
setting.

In JHB100 EVB board, SGMII connect with motorcomm YT8531s external PHY
and support RJ45 ethernet port.
====================

Link: https://patch.msgid.link/20260527084108.121416-1-minda.chen@starfivetech.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: starfive: Add STMMAC_FLAG_SPH_DISABLE flag

Add default disable split header flag in all the starfive
soc.

Signed-off-by: Minda Chen <minda.chen@starfivetech.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20260527084108.121416-5-minda.chen@starfivetech.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: starfive: Add jhb100 SGMII interface

Add jhb100 compatible and SGMII support. jhb100 soc contains
2 SGMII interfaces and integrated with serdes PHY. SGMII with
split TX/RX MAC clock and need to set 2.5M/25M/125M TX/RX clock
rate in 10M/100M/1000M speed mode.

Signed-off-by: Minda Chen <minda.chen@starfivetech.com>
Reviewed-by: Sai Krishna <saikrishnag@marvell.com>
Link: https://patch.msgid.link/20260527084108.121416-4-minda.chen@starfivetech.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dt-bindings: net: starfive,jh7110-dwmac: Add jhb100 support

The jhb100 GMAC still using Synopsys designware GMAC core.
hardware features are similar with jh7100.

Add jhb100 GMAC compatible and reset, interrupts features.
jhb100 dwmac has only one reset signal and one interrupt
line.

jhb100 SGMII interface tx/rx mac clock is split and require to
set clock rate in 10M/100M/1000M speed. So dts need to add a
new rx clock in code, dts and dt binding doc.

Signed-off-by: Minda Chen <minda.chen@starfivetech.com>
Acked-by: Conor Dooley <conor.dooley@microchip.com>
Link: https://patch.msgid.link/20260527084108.121416-3-minda.chen@starfivetech.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dt-bindings: net: starfive,jh7110-dwmac: Remove jh8100

Remove jh8100 dt-bindings because do not support it now.
StarFive have stopped jh8100 developing and will not release
it outside.

Signed-off-by: Minda Chen <minda.chen@starfivetech.com>
Acked-by: Conor Dooley <conor.dooley@microchip.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20260527084108.121416-2-minda.chen@starfivetech.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: qrtr: fix node refcount leak on ctrl packet alloc failure

qrtr_send_resume_tx() calls qrtr_node_lookup() which takes a
reference on the returned node. If the subsequent call to
qrtr_alloc_ctrl_packet() fails due to memory allocation failure, the
function returns -ENOMEM without calling qrtr_node_release() to
release the node reference.

Add qrtr_node_release(node) before returning on the allocation failure
path to properly release the reference.

Signed-off-by: Wentao Liang <vulab@iscas.ac.cn>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Manivannan Sadhasivam <mani@kernel.org>
Link: https://patch.msgid.link/20260528080019.1176700-1-vulab@iscas.ac.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: lan743x: avoid netdev-based logging before netdev registration

This patch updates the lan743x driver to prevent the use of netdev-based
logging APIs (such as netdev_dbg) before the network device has been
successfully registered. Using netdev-based logging prior to registration
results in log messages referencing "(unnamed net_device) (uninitialized)",
which can be confusing and less informative.

The driver must use netif_msg_ APIs and device-based logging (e.g. dev_dbg)
until netdev registration is complete. This ensures log entries are
associated with the correct device context and improves log clarity. After
registration, netdev-based logging APIs can be used safely.

Signed-off-by: David Thompson <davthompson@nvidia.com>
Link: https://patch.msgid.link/20260528165017.421576-1-davthompson@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: txgbe: fix phylink leak on AML init failure

Destroy the phylink instance when fixed-link setup fails.

Signed-off-by: Chenguang Zhao <zhaochenguang@kylinos.cn>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Jiawen Wu <jiawenwu@trustnetic.com>
Link: https://patch.msgid.link/20260528013258.129146-1-zhaochenguang@kylinos.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: fec_mpc52xx_phy: Add missing MODULE_DESCRIPTION()

Fixes error during modpost:

WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/net/ethernet/freescale/fec_mpc52xx_phy.o

Signed-off-by: Rosen Penev <rosenp@gmail.com>
Link: https://patch.msgid.link/20260527025139.10188-1-rosenp@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: wwan: t7xx: Add delay between MD and SAP suspend

SAP (Service Access Point) suspend occasionally times out with error
-110 (ETIMEDOUT), followed by modem port errors and complete modem
failure requiring a system reboot to recover.

Error symptoms:
  mtk_t7xx 0000:72:00.0: [PM] SAP suspend error: -110
  mtk_t7xx 0000:72:00.0: can't suspend (...returned -110)
  mtk_t7xx 0000:07:00.0: Failed to send skb: -22
  mtk_t7xx 0000:07:00.0: Write error on MBIM port, -22

The modem firmware needs time after receiving the MD (modem) suspend
request to complete internal operations before it is ready to accept
the SAP suspend request. Without this delay, if runtime PM attempts
to suspend while the firmware is busy, the SAP suspend command times
out, leaving the modem in an unrecoverable state.

Root cause and userspace interaction:
ModemManager 1.24+ includes changes that reduce the likelihood of this
issue by ensuring the modem is in a low-power state before the kernel
attempts runtime suspend. However, the kernel driver should not depend
on specific userspace behavior or ModemManager versions. Older versions
(1.20-1.22) are still widely deployed, and the kernel should be robust
regardless of userspace implementation details.

There appears to be no hardware status register or other mechanism
available to query whether the firmware is ready for SAP suspend.
A delay between the two suspend requests is the most reliable solution
found through testing.

Add a 50ms delay between MD suspend and SAP suspend. This gives the
firmware adequate time to complete internal operations without adding
significant latency to the suspend path. This makes the driver robust
across all ModemManager versions and system conditions.

Testing: 96+ hours of continuous operation with ModemManager 1.20.2
and Fibocom FM350-GL modem. Zero SAP suspend timeouts observed across
2000+ successful suspend/resume cycles. Previously failed within
24 hours with 100% reproducibility.

Signed-off-by: Jose Ignacio Tornos Martinez <jtornosm@redhat.com>
Reviewed-by: Loic Poulain <loic.poulain@oss.qualcomm.com>
Link: https://patch.msgid.link/20260527061451.12710-1-jtornosm@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: sfp: probe for RollBall I2C-to-MDIO bridge in mdio-i2c

The "OEM"/"SFP-10G-T" quirk entry in sfp_fixup_rollball_cc()
unconditionally forces MDIO_I2C_ROLLBALL for all modules matching that
vendor/part-number combination.  This works for modules that genuinely
implement a RollBall I2C-to-MDIO bridge, but silently breaks modules
that share the same EEPROM strings without having such a bridge.

The Realtek RTL8261BE-CG is one such module: a pure copper 10G SFP+
media converter with no I2C-to-MDIO bridge.  Its EEPROM reports
vendor="OEM", part="SFP-10G-T-I", and -- critically -- Vendor OUI
00:00:00, making OUI-based differentiation impossible.  With
MDIO_I2C_ROLLBALL forced, the module silently ACKs the unlock password
write, the MDIO bus is created, but no PHY responds; the SFP state
machine cycles through the RollBall PHY-probe retry window before
reporting no PHY.

Move the probe into i2c_mii_init_rollball() in mdio-i2c.c, where the
RollBall protocol constants are already defined.  After sending the
unlock password, issue a CMD_READ and poll for CMD_DONE up to 200 ms
(10 x 20 ms, matching the existing rollball poll tolerance).  A genuine
RollBall bridge asserts CMD_DONE within that window; modules without a
bridge never do, so i2c_mii_init_rollball() returns -ENODEV.
mdio_i2c_alloc() propagates -ENODEV to the caller to signal that no
bridge is present and PHY probing should be skipped.
sfp_sm_add_mdio_bus() catches -ENODEV and transitions
sfp->mdio_protocol to MDIO_I2C_NONE so the rest of the state machine
skips PHY probing for this module.

Any I2C-level error (NACK, timeout) during the probe is also treated as
-ENODEV: if the module does not respond at I2C address 0x51 at all,
there is certainly no RollBall bridge there, and SFP initialization
should not abort.

The probe writes are safe with respect to SFP EEPROM integrity: only
modules explicitly listed in the quirk table enter this path, and the
RollBall password unlock write to 0x51 was already issued by
i2c_mii_init_rollball() before the probe for all such modules.  Any
module without a device at 0x51 NACKs the transfer and is treated as
-ENODEV.

Add "OEM"/"SFP-10G-T-I" to the quirk table so RTL8261BE modules enter
the probe path; genuine RollBall modules continue to work as before.

Signed-off-by: Petr Wozniak <petr.wozniak@gmail.com>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/20260527053909.2118-1-petr.wozniak@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'mv88e6xxx-serdes-on-mv88e6321'

Fidan Aliyeva says:

====================
mv88e6xxx: SERDES on mv88e6321

This patch series add code support to be able to use SERDES feature of
mv88e6321 version of Marvel mv88e6xxx series. mv88e6321 has 2 ports to
support high speed SERDES but the support is lacking in the driver.

mv88e6321 version has a similar architecture to mv88e6352 version making it
possible to reuse its pcs functions. That's why the patch series consist of
2 parts:

1. Refactor the serdes functions and pcs_init of mv88e6352 to be more
generic (patches 1-2).
2. Add the SERDES support for mv88e6321 reusing 6352's pcs functions

The final code has been tested on mv88e6321 ethernet device directly by ip
ping tests, performance tests and also verifying the switch's expected
register values.

Referred document: 88E6321/88E6320 Functional Specification
====================

Link: https://patch.msgid.link/20260528210310.1365858-1-fidan.aliyeva.ext@ericsson.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mv88e6xxx: Add SERDES Support for mv88e6321

Add serdes and pcs_ops functions for mv88e6321. In mv88e6321
2 ports support serdes functionality; port 0 and port 1. These ports are
serdes-only ports.

Changes:

1. Add a function support to return the lane address for the port based on
cmode.
2. Reuse mv88e6352's serdes_get_regs* and pcs_init functions for mv88e6321.

Tested on mv88e6321 switch port 0.

Co-developed-by: Thomas Eckerman <thomas.eckerman.ext@ericsson.com>
Signed-off-by: Thomas Eckerman <thomas.eckerman.ext@ericsson.com>
Signed-off-by: Fidan Aliyeva <fidan.aliyeva.ext@ericsson.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20260528210310.1365858-4-fidan.aliyeva.ext@ericsson.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mv88e6xxx: Refactor 6352's serdes functions

Changes:
1. Replace serdes check by mv88e6352_g2_scratch_port_has_serdes in
mv88e6352_pcs_init function by mv88e6xxx_serdes_get_lane function making it
more generic.
2. Replace serdes checks in mv88e6352_serdes_get_* functions with
mv88e6xxx_serdes_get_lane making them more generic.
3. Add lane argument to mv88e6352_serdes_read so it can be reused later for
6321.

Co-developed-by: Thomas Eckerman <thomas.eckerman.ext@ericsson.com>
Signed-off-by: Thomas Eckerman <thomas.eckerman.ext@ericsson.com>
Signed-off-by: Fidan Aliyeva <fidan.aliyeva.ext@ericsson.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20260528210310.1365858-3-fidan.aliyeva.ext@ericsson.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mv88e6xxx: Add mv88e6352_serdes_get_lane

Changes:
1. Add mv88e6352_serdes_get_lane function which checks if the port
supports SERDES by calling mv88e6352_g2_scratch_port_has_serdes. Then
returns the address of the SERDES lane.
2. Add this function as .serdes_get_lane member to all the chip
versions which use mv88e6352_pcs_init.

Co-developed-by: Thomas Eckerman <thomas.eckerman.ext@ericsson.com>
Signed-off-by: Thomas Eckerman <thomas.eckerman.ext@ericsson.com>
Signed-off-by: Fidan Aliyeva <fidan.aliyeva.ext@ericsson.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20260528210310.1365858-2-fidan.aliyeva.ext@ericsson.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-mdio-realtek-rtl9300-soc-independent-command-runner'

Markus Stockhausen says:

====================
net: mdio: realtek-rtl9300: SoC independent command runner

The Realtek Otto switch platform consist of four different series

- RTL838x aka maple   : 28 port 1G Switches
- RTL839x aka cypress : 52 port 1G Switches
- RTL930x aka longan  : 28 port 1G/2.5G/10G Switches
- RTL931x aka mango   : 56 port 1G/2.5G/10G Switches

After establishing basic groundwork for multi device support, this series
harmonizes the command handling of the MDIO driver. It is the second step
to allow easier integration of the non RTL930x SoCs into this driver.
====================

Link: https://patch.msgid.link/20260527163449.1294961-1-markus.stockhausen@gmx.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: mdio: realtek-rtl9300: use command runner for read_c22()

Convert the final missing read_c22() path to the new read enabled command
runner. Do it the same way as other implementations.

- bus calls otto_emdio_read_c22()
- this hands over to SoC specific otto_emdio_9300_read_c22()
- finally the registers are filled and the runner issued

With this cleanup remove the obsolete helper otto_emdio_wait_ready()

Signed-off-by: Markus Stockhausen <markus.stockhausen@gmx.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20260527163449.1294961-5-markus.stockhausen@gmx.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: mdio: realtek-rtl9300: use command runner for read_c45()

Convert the read_c45() path to the new command runner. This needs the
additional helper otto_emdio_read_cmd() that can issue the command runner
and process a read operation. It is basically nothing more than

- run the command
- read the command result thorugh the I/O register

With this in place convert the read_c45() like the alread existing write
C22/C45 implementation.

- bus calls otto_emdio_read_c45()
- this handed over to SoC specific otto_emdio_9300_read_c45()
- the registers are filled
- the otto_emdio_read_cmd() is issued
- that calls the command runner

Signed-off-by: Markus Stockhausen <markus.stockhausen@gmx.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20260527163449.1294961-4-markus.stockhausen@gmx.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: mdio: realtek-rtl9300: use command runner for write_c22()

Now that the driver has a generic command runner make use of it in the
write_c22() path. For this.

- add generic otto_emdio_write_c22() helper that will be called by bus
- convert otto_emdio_9300_write_c22() to new command runner logic

Signed-off-by: Markus Stockhausen <markus.stockhausen@gmx.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20260527163449.1294961-3-markus.stockhausen@gmx.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: mdio: realtek-rtl9300: provide generic command runner

The current bus read/write commands for C22/C45 are RTL930x specific.
Avoid to duplicate those 200 lines of code for the RTL838x, RTL839x and
RTL931x targets. Instead provide a generic command runner that is SoC
independent. The implementation works as follows:

The runner will take a prepared list of the four MDIO registers. It will
feed the data into the registers. This generic write to all registers
(or to say "a little bit too much") is no issue. The hardware looks at
the to be executed command and will only take the pieces of data that
are really required. No side effects have been observed on any of the
four SoCs during the time this mechanism exists in downstream OpenWrt.

The last fed register is the C22/command register. This will be enriched
with the proper command flags from the caller. The hardware issues the
command and the runner will wait for its finalization.

Besides from feeding all registers the runner emulates the behaviour of
the old code as best as possible

- check defensively for a running command in advance
- Before this commit the driver had different MMIO timeout values.
  1000s for command preparation, 100us after writes and 1000us after
  reads. The new version uses a consistent 1000us timeout for all
  of these.
- return -ENXIO in case of hardware failure (fail bit)

As a first consumer of this runner convert the write_c45() function.
This is realized in a multi stage approach

- a generic otto_emdio_write_c45() will be called by the bus
- this will forward the request to the device specific writer. In this
  case otto_emdio_9300_write_c45().
- There the command data is filled in and the additional helper
  otto_emdio_write_cmd() will be called
- That adds the write flag and issues the generic command runner.

With all the above mentioned in place, there is not much left to do in
otto_emdio_9300_write_c45(). It just fills the register fields and
calls the write helper with the right command bits.

Signed-off-by: Markus Stockhausen <markus.stockhausen@gmx.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20260527163449.1294961-2-markus.stockhausen@gmx.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: Remove orphaned ax25_ptr references

The AX.25 subsystem was removed in commit dd8d4bc28ad7
("net: remove ax25 and amateur radio (hamradio) subsystem"),
which removed the ax25_ptr field from struct net_device but
left behind the kdoc comment and documentation.

Signed-off-by: Costa Shulyupin <costa.shul@redhat.com>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://patch.msgid.link/20260531134837.4111349-1-costa.shul@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tcp_bbr: fix SPDX-License-Identifier to be GPL-2.0 OR BSD-3-Clause

Since TCP BBR congestion control was introduced in
commit 0f8782ea1497 ("tcp_bbr: add BBR congestion control")
it has always been offered as "Dual BSD/GPL":

MODULE_LICENSE("Dual BSD/GPL");

A GPL-2.0-only SPDX header was erroneously added in the recent
commit 2ed4b46b4fc7 ("net: Add SPDX ids to some source files").

This commit revises the tcp_bbr.c SPDX-License-Identifier to note that
this file is licensed as "GPL-2.0 OR BSD-3-Clause".

Fixes: 2ed4b46b4fc7 ("net: Add SPDX ids to some source files")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Van Jacobson <vanj@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Tim Bird <tim.bird@sony.com>
Link: https://patch.msgid.link/20260531183558.2337381-1-ncardwell.sw@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ibm: emac: Reserve VLAN header in MJS limit

The IBM EMAC programs its Maximum Jumbo Size (MJS) drop
threshold from ndev->mtu directly. The hardware sizes the threshold
against the L2 frame minus the ethernet header, but does not
discount the 802.1Q tag, so a frame carrying a VLAN tag and a full
1500-byte payload exceeds MJS by exactly 4 bytes and is dropped.

This is normally hidden because JPSM (and therefore the MJS check)
only engages when the MTU is raised above ETH_DATA_LEN. With the
qca8k DSA tagger the conduit MTU is bumped by QCA_HDR_LEN to 1502
during dsa_conduit_setup(), which is enough to enable JPSM and
expose the off-by-VLAN-tag in the limit.

Pad MJS by VLAN_HLEN so a VLAN-tagged full-MTU frame passes.

Reported on Meraki MX60 (qca8k switch): tagged VLAN
traffic drops at 1500-byte payload, while 1496 bytes works
and untagged 1500 bytes works.

Assisted-by: Claude:Opus-4.7
Signed-off-by: Rosen Penev <rosenp@gmail.com>
Link: https://patch.msgid.link/20260526202247.13823-1-rosenp@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

batman-adv: use neigh_node's orig_node only as id

The orig_node member of struct batadv_neigh_node is no longer used in
B.A.T.M.A.N. IV. But batadv_neigh_node_create() is still storing it.
Only batadv_v_ogm_route_update() uses it to check if we route toward
it - not needing the data stored in the batadv_orig_node object itself,
but merely a pointer to identify the originator.

The field cannot hold a proper reference because that would create a
reference cycle, so it must never be dereferenced. Rename it to
orig_node_id and mark it __private to make any future attempt to
dereference it immediately noticeable.

Signed-off-by: Sven Eckelmann <sven@narfation.org>