git.ipfire.org Git - thirdparty/linux.git/log

vxlan: Use a single lock to protect the FDB table

Currently, the VXLAN driver stores FDB entries in a hash table with a
fixed number of buckets (256). Subsequent patches are going to convert
this table to rhashtable with a linked list for entry traversal, as
rhashtable is more scalable.

In preparation for this conversion, move from a per-bucket spin lock to
a single spin lock that protects the entire FDB table.

The per-bucket spin locks were introduced by commit fe1e0713bbe8
("vxlan: Use FDB_HASH_SIZE hash_locks to reduce contention") citing
"huge contention when inserting/deleting vxlan_fdbs into the fdb_head".

It is not clear from the commit message which code path was holding the
spin lock for long periods of time, but the obvious suspect is the FDB
cleanup routine (vxlan_cleanup()) that periodically traverses the entire
table in order to delete aged-out entries.

This will be solved by subsequent patches that will convert the FDB
cleanup routine to traverse the linked list of FDB entries using RCU,
only acquiring the spin lock when deleting an aged-out entry.

The change reduces the size of the VXLAN device structure from 3600
bytes to 2576 bytes.

Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20250415121143.345227-7-idosch@nvidia.com
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

vxlan: Relocate assignment of default remote device

The default FDB entry can be associated with a net device if a physical
device (i.e., 'dev PHYS_DEV') was specified during the creation of the
VXLAN device.

The assignment of the net device pointer to 'dst->remote_dev' logically
belongs in the if block that resolves the pointer from the specified
ifindex, so move it there.

Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20250415121143.345227-6-idosch@nvidia.com
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

vxlan: Unsplit default FDB entry creation and notification

Commit 0241b836732f ("vxlan: fix default fdb entry netlink notify
ordering during netdev create") split the creation of the default FDB
entry from its notification to avoid sending a RTM_NEWNEIGH notification
before RTM_NEWLINK.

Previous patches restructured the code so that the default FDB entry is
created after registering the VXLAN device and the notification about
the new entry immediately follows its creation.

Therefore, simplify the code and revert back to vxlan_fdb_update() which
takes care of both creating the FDB entry and notifying user space
about it.

Hold the FDB hash lock when calling vxlan_fdb_update() like it expects.
A subsequent patch will add a lockdep assertion to make sure this is
indeed the case.

Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20250415121143.345227-5-idosch@nvidia.com
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

vxlan: Insert FDB into hash table in vxlan_fdb_create()

Commit 7c31e54aeee5 ("vxlan: do not destroy fdb if register_netdevice()
is failed") split the insertion of FDB entries into the FDB hash table
from the function where they are created.

This was done in order to work around a problem that is no longer
possible after the previous patch. Simplify the code and move the body
of vxlan_fdb_insert() back into vxlan_fdb_create().

Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20250415121143.345227-4-idosch@nvidia.com
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

vxlan: Simplify creation of default FDB entry

There is asymmetry in how the default FDB entry (all-zeroes) is created
and destroyed in the VXLAN driver. It is created as part of the driver's
newlink() routine, but destroyed as part of its ndo_uninit() routine.

This caused multiple problems in the past. First, commit 0241b836732f
("vxlan: fix default fdb entry netlink notify ordering during netdev
create") split the notification about the entry from its creation so
that it will not be notified to user space before the VXLAN device is
registered.

Then, commit 6db924687139 ("vxlan: Fix error path in
__vxlan_dev_create()") made the error path in __vxlan_dev_create()
asymmetric by destroying the FDB entry before unregistering the net
device. Otherwise, the FDB entry would have been freed twice: By
ndo_uninit() as part of unregister_netdevice() and by
vxlan_fdb_destroy() in the error path.

Finally, commit 7c31e54aeee5 ("vxlan: do not destroy fdb if
register_netdevice() is failed") split the insertion of the FDB entry
into the hash table from its creation, moving the insertion after the
registration of the net device. Otherwise, like before, the FDB entry
would have been freed twice: By ndo_uninit() as part of
register_netdevice()'s error path and by vxlan_fdb_destroy() in the
error path of __vxlan_dev_create().

The end result is that the code is unnecessarily complex. In addition,
the fixed size hash table cannot be converted to rhashtable as
vxlan_fdb_insert() cannot fail, which will no longer be true with
rhashtable.

Solve this by making the addition and deletion of the default FDB entry
completely symmetric. Namely, as part of newlink() routine, create the
entry, insert it into to the hash table and send a notification to user
space after the net device was registered. Note that at this stage the
net device is still administratively down and cannot transmit / receive
packets.

Move the deletion from ndo_uninit() to the dellink routine(): Flush the
default entry together with all the other entries, before unregistering
the net device.

Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20250415121143.345227-3-idosch@nvidia.com
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

vxlan: Add RCU read-side critical sections in the Tx path

The Tx path does not run from an RCU read-side critical section which
makes the current lockless accesses to FDB entries invalid. As far as I
am aware, this has not been a problem in practice, but traces will be
generated once we transition the FDB lookup to rhashtable_lookup().

Add rcu_read_{lock,unlock}() around the handling of FDB entries in the
Tx path. Remove the RCU read-side critical section from vxlan_xmit_nh()
as now the function is always called from an RCU read-side critical
section.

Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20250415121143.345227-2-idosch@nvidia.com
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Martin KaFai Lau says:

====================
pull-request: bpf-next 2025-04-17

We've added 12 non-merge commits during the last 9 day(s) which contain
a total of 18 files changed, 1748 insertions(+), 19 deletions(-).

The main changes are:

1) bpf qdisc support, from Amery Hung.
   A qdisc can be implemented in bpf struct_ops programs and
   can be used the same as other existing qdiscs in the
   "tc qdisc" command.

2) Add xsk tail adjustment tests, from Tushar Vyavahare.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next:
  selftests/bpf: Test attaching bpf qdisc to mq and non root
  selftests/bpf: Add a bpf fq qdisc to selftest
  selftests/bpf: Add a basic fifo qdisc test
  libbpf: Support creating and destroying qdisc
  bpf: net_sched: Disable attaching bpf qdisc to non root
  bpf: net_sched: Support updating bstats
  bpf: net_sched: Add a qdisc watchdog timer
  bpf: net_sched: Add basic bpf qdisc kfuncs
  bpf: net_sched: Support implementation of Qdisc_ops in bpf
  bpf: Prepare to reuse get_ctx_arg_idx
  selftests/xsk: Add tail adjustment tests and support check
  selftests/xsk: Add packet stream replacement function
====================

Link: https://patch.msgid.link/20250417184338.3152168-1-martin.lau@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'bnxt_en-update-for-net-next'

Michael Chan says:

====================
bnxt_en: Update for net-next

The first patch changes the FW message timeout threshold for a warning
message. The second patch adjusts the ethtool -w coredump length to
suppress a warning. The last 2 patches are small cleanup patches for
the bnxt_ulp RoCE auxbus code.

v1: https://lore.kernel.org/netdev/20250415174818.1088646-1-michael.chan@broadcom.com/
====================

Link: https://patch.msgid.link/20250417172448.1206107-1-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bnxt_en: Remove unused macros in bnxt_ulp.h

BNXT_ROCE_ULP and BNXT_MAX_ULP are no longer used. Remove them to
clean up the code.

Reviewed-by: Shruti Parab <shruti.parab@broadcom.com>
Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20250417172448.1206107-5-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bnxt_en: Remove unused field "ref_count" in struct bnxt_ulp

The "ref_count" field in struct bnxt_ulp is unused after
commit a43c26fa2e6c ("RDMA/bnxt_re: Remove the sriov config callback").
So we can just remove it now.

Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20250417172448.1206107-4-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bnxt_en: Report the ethtool coredump length after copying the coredump

ethtool first calls .get_dump_flags() to get the dump length.  For
coredump, the driver calls the FW to get the coredump length (L1).  The
min. of L1 and the user specified length is then passed to
.get_dump_data() (L2) to get the coredump.  The actual coredump length
retrieved by the FW (L3) during .get_dump_data() may be smaller than L1.
This length discrepancy will trigger a WARN_ON() in
ethtool_get_dump_data().

ethtool has already vzalloc'ed a buffer with size L1.  Just report
the coredump length as L2 even though the actual coredump length L3
may be smaller.  The extra zero padding does not matter.  This will
prevent the warning that may alarm the user.

For correctness, only do the final length update if there is no error.

Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com>
Reviewed-by: Damodharam Ammepalli <damodharam.ammepalli@broadcom.com>
Signed-off-by: Shruti Parab <shruti.parab@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20250417172448.1206107-3-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bnxt_en: Change FW message timeout warning

The firmware advertises a "hwrm_cmd_max_timeout" value to the driver
for NVRAM and coredump related functions that can take tens of seconds
to complete.  The driver polls for the operation to complete under
mutex and may trigger hung task watchdog warning if the wait is too long.
To warn the user about this, the driver currently prints a warning if
this advertised value exceeds 40 seconds:

Device requests max timeout of %d seconds, may trigger hung task watchdog

Initially, we chose 40 seconds, well below the kernel's default
CONFIG_DEFAULT_HUNG_TASK_TIMEOUT (120 seconds) to avoid triggering
the hung task watchdog.  But 60 seconds is the timeout on most
production FW and cannot be reduced further.  Change the driver's warning
threshold to 60 seconds to avoid triggering this warning on all
production devices.  We also print the warning if the value exceeds
CONFIG_DEFAULT_HUNG_TASK_TIMEOUT which may be set to architecture
specific defaults as low as 10 seconds.

Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20250417172448.1206107-2-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-stmmac-socfpga-fix-init-ordering-and-cleanups'

Russell King says:

====================
net: stmmac: socfpga: fix init ordering and cleanups

This series fixes the init ordering of the socfpga probe function.
The standard rule is to do all setup before publishing any device,
and socfpga violates that. I can see no reason for this, but these
patches have not been tested on hardware.

Address this by moving the initialisation of dwmac->stmmac_rst
along with all the other dwmac initialisers - there's no reason
for this to be late as plat_dat->stmmac_rst has already been
populated.

Next, replace the call to ops->set_phy_mode() with an init function
socfpga_dwmac_init() which will then be linked in to plat_dat->init.

Then, add this to plat_dat->init, and switch to stmmac_pltfr_pm_ops
from the private ops. The runtime suspend/resume socfpga implementations
are identical to the platform ones, but misses the noirq versions
which this will add.

Before we swap the order of socfpga_dwmac_init() and
stmmac_dvr_probe(), we need to change the way the interface is
obtained, as that uses driver data and the struct net_device which
haven't been initialised. Save a pointer to plat_dat in the socfpga
private data, and use that to get the interface mode. We can then swap
the order of the init and probe functions.

Finally, convert to devm_stmmac_pltfr_probe() by moving the call
to ops->set_phy_mode() into an init function appropriately populating
plat_dat->init.
====================

Link: https://patch.msgid.link/aAE2tKlImhwKySq_@shell.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: socfpga: convert to devm_stmmac_pltfr_probe()

Convert socfpga to use devm_stmmac_pltfr_probe() to further simplify
the probe function, wrapping the call to the set_phy_mode() method
into socfpga_dwmac_init() which can be called from the plat_dat->init()
method. Also call this from socfpga_dwmac_resume() thereby simplifying
that function.

Using the devm variant also means we can remove the call to
stmmac_pltfr_remove().

Unfortunately, we can't also convert to stmmac_pltfr_pm_ops as there is
extra work done in socfpga_dwmac_resume().

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/E1u5Sns-001IJw-OY@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: socfpga: call set_phy_mode() before registration

Initialisation/setup after registration is a bug. This is the second
of two patches fixing this in socfpga.

The set_phy_mode() functions do various hardware setup that would
interfere with a netdev that has been published, and thus available to
be opened by the kernel/userspace.

However, set_phy_mode() relies upon the netdev having been initialised
to get at the plat_stmmacenet_data structure, which is probably why it
was placed after stmmac_drv_probe(). We can remove that need by storing
a pointer to struct plat_stmmacenet_data in struct socfpga_dwmac.

Move the call to set_phy_mode() before calling stmmac_dvr_probe().
This also simplifies the probe function as there is no need to
unregister the netdev if set_phy_mode() fails.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/E1u5Snn-001IJq-L0@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: socfpga: convert to stmmac_pltfr_pm_ops

Convert socfpga to use the generic stmmac_pltfr_pm_ops, which can be
achieved by adding an appropriate plat_dat->init function to do the
setup.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/E1u5Sni-001IJk-Gi@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: socfpga: provide init function

Both the resume and probe path needs to configure the phy mode, so
provide a common function to do this which can later be hooked into
plat_dat->init.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/E1u5Snd-001IJe-Cx@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: socfpga: init dwmac->stmmac_rst before registration

Initialisation/setup after registration is a bug. This is the first of
two patches fixing this in socfpga.

dwmac->stmmac_rst is initialised from the stmmac plat_dat's stmmac_rst
member, which is itself initialised by devm_stmmac_probe_config_dt().
Therefore, this can be initialised before we call stmmac_dvr_probe().
Move it there.

dwmac->stmmac_rst is used by the set_phy_mode() method.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/E1u5SnY-001IJY-90@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-adopting-nlmsg_payload-final-series'

Breno Leitao says:

====================
net: Adopting nlmsg_payload() (final series)

This patchset marks the final step in converting users to the new
nlmsg_payload() function. It addresses the last two files that were not
converted in previous series, specifically updating the following
functions:

neigh_valid_dump_req
rtnl_valid_dump_ifinfo_req
rtnl_valid_getlink_req
valid_fdb_get_strict
valid_bridge_getlink_req
rtnl_valid_stats_req
rtnl_mdb_valid_dump_req

I would like to extend a big thank you to Kuniyuki Iwashima for his
invaluable help and review of this effort.
====================

Link: https://patch.msgid.link/20250417-nlmsg_v3-v1-0-9b09d9d7e61d@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: Use nlmsg_payload in rtnetlink file

Leverage the new nlmsg_payload() helper to avoid checking for message
size and then reading the nlmsg data.

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250417-nlmsg_v3-v1-2-9b09d9d7e61d@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: Use nlmsg_payload in neighbour file

Leverage the new nlmsg_payload() helper to avoid checking for message
size and then reading the nlmsg data.

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250417-nlmsg_v3-v1-1-9b09d9d7e61d@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

s390: ism: Pass string literal as format argument of dev_set_name()

GCC 14.2.0 reports that passing a non-string literal as the
format argument of dev_set_name() is potentially insecure.

drivers/s390/net/ism_drv.c: In function 'ism_probe':
drivers/s390/net/ism_drv.c:615:2: warning: format not a string literal and no format arguments [-Wformat-security]
615 | dev_set_name(&ism->dev, dev_name(&pdev->dev));
| ^~~~~~~~~~~~

It seems to me that as pdev is a PCIE device then the dev_name
call above should always return the device's BDF, e.g. 00:12.0.
That this should not contain format escape sequences. And thus
the current usage is safe.

But, it seems better to be safe than sorry. And, as a bonus, compiler
output becomes less verbose by addressing this issue.

Compile tested only.
No functional change intended.

Signed-off-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250417-ism-str-fmt-v1-1-9818b029874d@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: Fix spelling mistakes in mlx5_core_dbg message and comments

There is a spelling mistake in a mlx5_core_dbg and two spelling mistakes
in comment blocks. Fix them.

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Acked-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250418135703.542722-1-colin.i.king@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: axienet: Fix spelling mistake "archecture" -> "architecture"

There is a spelling mistake in a dev_error message. Fix it.

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Link: https://patch.msgid.link/20250418112447.533746-1-colin.i.king@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tools: ynl: add missing header deps

Various new families and my recent work on rtnetlink missed
adding dependencies on C headers. If the system headers are
up to date or don't include a given header at all this doesn't
make a difference. But if the system headers are in place but
stale - compilation will break.

Reported-by: Kory Maincent <kory.maincent@bootlin.com>
Fixes: 29d34a4d785b ("tools: ynl: generate code for rt-addr and add a sample")
Link: https://lore.kernel.org/20250418190431.69c10431@kmaincent-XPS-13-7390
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Tested-by: Kory Maincent <kory.maincent@bootlin.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20250418234942.2344036-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: add UAPI to the header guard in various network headers

fib_rule, ip6_tunnel, and a whole lot of if_* headers lack the customary
_UAPI in the header guard. Without it YNL build can't protect from in tree
and system headers both getting included. YNL doesn't need most of these
but it's annoying to have to fix them one by one.

Note that header installation strips this _UAPI prefix so this should
result in no change to the end user.

Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20250416200840.1338195-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

trace: tcp: Add const qualifier to skb parameter in tcp_probe event

Change the tcp_probe tracepoint to accept a const struct sk_buff
parameter instead of a non-const one. This improves type safety and
better reflects that the skb is not modified within the tracepoint
implementation.

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250416-tcp_probe-v1-1-1edc3c5a1cb8@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: Delete the outer () duplicated of macro SOCK_SKB_CB_OFFSET definition

For macro SOCK_SKB_CB_OFFSET definition, Delete the outer () duplicated.

Signed-off-by: Zijun Hu <quic_zijuhu@quicinc.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250416-fix_net-v1-1-d544c9f3f169@quicinc.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: mediatek: stop initialising plat->mac_interface

Mediatek doesn't make use of mac_interface, and none of the in-tree
DT files use the mac-mode property. Therefore, mac_interface already
follows phy_interface. Remove this unnecessary assignment.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/E1u4zyh-000xVE-PG@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: dwc-qos: use PHY clock-stop capability

Use the PHY clock-stop capability when programming the MAC LPI mode,
which allows the transmit clock to the PHY to be gated. Tested on the
Jetson Xavier NX platform.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/E1u4zi1-000xHh-57@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netdev: fix the locking for netdev notifications

Kuniyuki reports that the assert for netdev lock fires when
there are netdev event listeners (otherwise we skip the netlink
event generation).

Correct the locking when coming from the notifier.

The NETDEV_XDP_FEAT_CHANGE notifier is already fully locked,
it's the documentation that's incorrect.

Fixes: 99e44f39a8f7 ("netdev: depend on netdev->lock for xdp features")
Reported-by: syzkaller <syzkaller@googlegroups.com>
Reported-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/20250410171019.62128-1-kuniyu@amazon.com
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250416030447.1077551-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: ethtool: Fix formatting of ptp_rq0_csum_complete_tail_slow

The new GCC 15 warning -Wunterminated-string-initialization reports:

In file included from drivers/net/ethernet/mellanox/mlx5/core/en.h:55,
                 from drivers/net/ethernet/mellanox/mlx5/core/en_stats.c:34:
drivers/net/ethernet/mellanox/mlx5/core/en_stats.h:57:46: warning: initializer-string for array of 'char' truncates NUL terminator but destination lacks 'nonstring' attribute (33 chars into 32 available) [-Wunterminated-string-initialization]
   57 | #define MLX5E_DECLARE_PTP_RQ_STAT(type, fld) "ptp_rq%d_"#fld, offsetof(type, fld)
      |                                              ^~~~~~~~~~~
drivers/net/ethernet/mellanox/mlx5/core/en_stats.c:2279:11: note: in expansion of macro 'MLX5E_DECLARE_PTP_RQ_STAT'
2279 |         { MLX5E_DECLARE_PTP_RQ_STAT(struct mlx5e_rq_stats, csum_complete_tail_slow) },
      |           ^~~~~~~~~~~~~~~~~~~~~~~~~

This stat string is being used in ethtool_sprintf(), so it must be a
valid NUL-terminated string. Currently the string lacks the final NUL
byte (as GCC warns), but by absolute luck, the next byte in memory is a
space (decimal 32) followed by a NUL. "format" is immediately followed
by little-endian size_t:

struct counter_desc {
        char                       format[32];           /*     0    32 */
        size_t                     offset;               /*    32     8 */
};

The "offset" member is populated by the stats member offset:

#define MLX5E_DECLARE_PTP_RQ_STAT(type, fld) "ptp_rq%d_"#fld, offsetof(type, fld)

which for this struct mlx5e_rq_stats member, csum_complete_tail_slow, is
32, or space, and then the rest of the "offset" bytes are NULs.

struct mlx5e_rq_stats {
...
        u64                        csum_complete_tail_slow; /* 32     8 */

The use of vsnprintf(), within ethtool_sprintf(), reads past the end of
"format" and sees the format string as "ptp_rq%d_csum_complete_tail_slow ",
with %d getting resolved by MLX5E_PTP_CHANNEL_IX (value 0):

                       ethtool_sprintf(data, ptp_rq_stats_desc[i].format,
                                       MLX5E_PTP_CHANNEL_IX);

With an output result of "ptp_rq0_csum_complete_tail_slow", which gets
precisely truncated to 31 characters with a trailing NUL.

So, instead of accidentally getting this correct due to the NUL bytes
at the end of the size_t that happens to follow the format string, just
make the string initializer 1 byte shorter by replacing "%d" with "0",
since MLX5E_PTP_CHANNEL_IX is already hard-coded. This results in no
initializer truncation and no need to call sprintf().

Signed-off-by: Kees Cook <kees@kernel.org>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Link: https://patch.msgid.link/20250416020109.work.297-kees@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ethtool: Adjust exactly ETH_GSTRING_LEN-long stats to use memcpy

Many drivers populate the stats buffer using C-String based APIs (e.g.
ethtool_sprintf() and ethtool_puts()), usually when building up the
list of stats individually (i.e. with a for() loop). This, however,
requires that the source strings be populated in such a way as to have
a terminating NUL byte in the source.

Other drivers populate the stats buffer directly using one big memcpy()
of an entire array of strings. No NUL termination is needed here, as the
bytes are being directly passed through. Yet others will build up the
stats buffer individually, but also use memcpy(). This, too, does not
need NUL termination of the source strings.

However, there are cases where the strings that populate the
source stats strings are exactly ETH_GSTRING_LEN long, and GCC
15's -Wunterminated-string-initialization option complains that the
trailing NUL byte has been truncated. This situation is fine only if the
driver is using the memcpy() approach. If the C-String APIs are used,
the destination string name will have its final byte truncated by the
required trailing NUL byte applied by the C-string API.

For drivers that are already using memcpy() but have initializers that
truncate the NUL terminator, mark their source strings as __nonstring to
silence the GCC warnings.

For drivers that have initializers that truncate the NUL terminator and
are using the C-String APIs, switch to memcpy() to avoid destination
string truncation and mark their source strings as __nonstring to silence
the GCC warnings. (Also introduce ethtool_cpy() as a helper to make this
an easy replacement).

Specifically the following warnings were investigated and addressed:

../drivers/net/ethernet/chelsio/cxgb/cxgb2.c:364:9: warning: initializer-string for array of 'char' truncates NUL terminator but destination lacks 'nonstring' attribute (33 chars into 32 available) [-Wunterminated-string-initialization]
  364 |         "TxFramesAbortedDueToXSCollisions",
      |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
../drivers/net/ethernet/freescale/enetc/enetc_ethtool.c:165:33: warning: initializer-string for array of 'char' truncates NUL terminator but destination lacks 'nonstring' attribute (33 chars into 32 available) [-Wunterminated-string-initialization]
  165 |         { ENETC_PM_R1523X(0),   "MAC rx 1523 to max-octet packets" },
      |                                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
../drivers/net/ethernet/freescale/enetc/enetc_ethtool.c:190:33: warning: initializer-string for array of 'char' truncates NUL terminator but destination lacks 'nonstring' attribute (33 chars into 32 available) [-Wunterminated-string-initialization]
  190 |         { ENETC_PM_T1523X(0),   "MAC tx 1523 to max-octet packets" },
      |                                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
../drivers/net/ethernet/google/gve/gve_ethtool.c:76:9: warning: initializer-string for array of 'char' truncates NUL terminator but destination lacks 'nonstring' attribute (33 chars into 32 available) [-Wunterminated-string-initialization]
   76 |         "adminq_dcfg_device_resources_cnt", "adminq_set_driver_parameter_cnt",
      |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
../drivers/net/ethernet/stmicro/stmmac/stmmac_ethtool.c:117:53: warning: initializer-string for array of 'char' truncates NUL terminator but destination lacks 'nonstring' attribute (33 chars into 32 available) [-Wunterminated-string-initialization]
  117 |         STMMAC_STAT(ptp_rx_msg_type_pdelay_follow_up),
      |                                                     ^
../drivers/net/ethernet/stmicro/stmmac/stmmac_ethtool.c:46:12: note: in definition of macro 'STMMAC_STAT'
   46 |         { #m, sizeof_field(struct stmmac_extra_stats, m),       \
      |            ^
../drivers/net/ethernet/mellanox/mlxsw/spectrum_ethtool.c:328:24: warning: initializer-string for array of 'char' truncates NUL terminator but destination lacks 'nonstring' attribute (33 chars into 32 available) [-Wunterminated-string-initialization]
  328 |                 .str = "a_mac_control_frames_transmitted",
      |                        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
../drivers/net/ethernet/mellanox/mlxsw/spectrum_ethtool.c:340:24: warning: initializer-string for array of 'char' truncates NUL terminator but destination lacks 'nonstring' attribute (33 chars into 32 available) [-Wunterminated-string-initialization]
  340 |                 .str = "a_pause_mac_ctrl_frames_received",
      |                        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Signed-off-by: Kees Cook <kees@kernel.org>
Reviewed-by: Petr Machata <petrm@nvidia.com> # for mlxsw
Reviewed-by: Harshitha Ramamurthy <hramamurthy@google.com>
Link: https://patch.msgid.link/20250416010210.work.904-kees@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

r8169: add RTL_GIGA_MAC_VER_LAST to facilitate adding support for new chip versions

Add a new mac_version enum value RTL_GIGA_MAC_VER_LAST. Benefit is that
when adding support for a new chip version we have to touch less code,
except something changes fundamentally.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/06991f47-2aec-4aa2-8918-2c6e79332303@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

r8169: refactor chip version detection

Refactor chip version detection and merge both configuration tables.
Apart from reducing the code by a third, this paves the way for
merging chip version handling if only difference is the firmware.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Link: https://patch.msgid.link/1fea533a-dd5a-4198-a9e2-895e11083947@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-stmmac-sunxi-cleanups'

Russell King says:

====================
net: stmmac: sunxi cleanups

This series cleans up the sunxi (sun7i) code in two ways:

1. it converts to use the new set_clk_tx_rate() method, even though
   we don't use clk_tx_i. In doing so, I reformat the function to
   read better, but with no changes to the code.

2. convert from stmmac_dvr_probe() to stmmac_pltfr_probe(), and then
   to its devm variant, which allows code simplification.
====================

Link: https://patch.msgid.link/Z_5WT_jOBgubjWQg@shell.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: sunxi: use devm_stmmac_pltfr_probe()

Using devm_stmmac_pltfr_probe() simplifies the probe function. This
will not only call plat_dat->init (sun7i_dwmac_init), but also
plat_dat->exit (sun7i_dwmac_exit) appropriately if stmmac_dvr_probe()
fails. This results in an overall simplification of the glue driver.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1u4fre-000nMr-FT@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: sunxi: use stmmac_pltfr_probe()

Rather than open-coding the calls to sun7i_gmac_init() and
sun7i_gmac_exit() in the probe function, use stmmac_pltfr_probe()
which will automatically call the plat_dat->init() and plat_dat->exit()
methods appropriately. This simplifies the code.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1u4frZ-000nMl-BB@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: sunxi: convert to set_clk_tx_rate()

Convert sunxi to use the set_clk_tx_rate() callback rather than the
fix_mac_speed() callback.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1u4frU-000nMf-6o@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Cross-merge networking fixes after downstream PR (net-6.15-rc3).

No conflicts. Adjacent changes:

tools/net/ynl/pyynl/ynl_gen_c.py
4d07bbf2d456 ("tools: ynl-gen: don't declare loop iterator in place")
7e8ba0c7de2b ("tools: ynl: don't use genlmsghdr in classic netlink")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'net-6.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
"Including fixes from Bluetooth, CAN and Netfilter.

  Current release - regressions:

   - two fixes for the netdev per-instance locking

   - batman-adv: fix double-hold of meshif when getting enabled

  Current release - new code bugs:

   - Bluetooth: increment TX timestamping tskey always for stream
     sockets

   - wifi: static analysis and build fixes for the new Intel sub-driver

  Previous releases - regressions:

   - net: fib_rules: fix iif / oif matching on L3 master (VRF) device

   - ipv6: add exception routes to GC list in rt6_insert_exception()

   - netfilter: conntrack: fix erroneous removal of offload bit

   - Bluetooth:
       - fix sending MGMT_EV_DEVICE_FOUND for invalid address
       - l2cap: process valid commands in too long frame
       - btnxpuart: Revert baudrate change in nxp_shutdown

  Previous releases - always broken:

   - ethtool: fix memory corruption during SFP FW flashing

   - eth:
       - hibmcge: fixes for link and MTU handling, pause frames etc
       - igc: fixes for PTM (PCIe timestamping)

   - dsa: b53: enable BPDU reception for management port

  Misc:

   - fixes for Netlink protocol schemas"

* tag 'net-6.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (81 commits)
  net: ethernet: mtk_eth_soc: revise QDMA packet scheduler settings
  net: ethernet: mtk_eth_soc: correct the max weight of the queue limit for 100Mbps
  net: ethernet: mtk_eth_soc: reapply mdc divider on reset
  net: ti: icss-iep: Fix possible NULL pointer dereference for perout request
  net: ti: icssg-prueth: Fix possible NULL pointer dereference inside emac_xmit_xdp_frame()
  net: ti: icssg-prueth: Fix kernel warning while bringing down network interface
  netfilter: conntrack: fix erronous removal of offload bit
  net: don't try to ops lock uninitialized devs
  ptp: ocp: fix start time alignment in ptp_ocp_signal_set
  net: dsa: avoid refcount warnings when ds->ops->tag_8021q_vlan_del() fails
  net: dsa: free routing table on probe failure
  net: dsa: clean up FDB, MDB, VLAN entries on unbind
  net: dsa: mv88e6xxx: fix -ENOENT when deleting VLANs and MST is unsupported
  net: dsa: mv88e6xxx: avoid unregistering devlink regions which were never registered
  net: txgbe: fix memory leak in txgbe_probe() error path
  net: bridge: switchdev: do not notify new brentries as changed
  net: b53: enable BPDU reception for management port
  netlink: specs: rt-neigh: prefix struct nfmsg members with ndm
  netlink: specs: rt-link: adjust mctp attribute naming
  netlink: specs: rtnetlink: attribute naming corrections
  ...

Merge branch 'bpf-qdisc'

Amery Hung says:

====================
bpf qdisc

Hi all,

This patchset aims to support implementing qdisc using bpf struct_ops.
This version takes a step back and only implements the minimum support
for bpf qdisc. 1) support of adding skb to bpf_list and bpf_rbtree
directly and 2) classful qdisc are deferred to future patchsets. In
addition, we only allow attaching bpf qdisc to root or mq for now.
This is to prevent accidentally breaking exisiting classful qdiscs
that rely on data in a child qdisc. This limit may be lifted in the
future after careful inspection.

* Overview *

This series supports implementing qdisc using bpf struct_ops. bpf qdisc
aims to be a flexible and easy-to-use infrastructure that allows users to
quickly experiment with different scheduling algorithms/policies. It only
requires users to implement core qdisc logic using bpf and implements the
mundane part for them. In addition, the ability to easily communicate
between qdisc and other components will also bring new opportunities for
new applications and optimizations.

* Performance of bpf qdisc *

This patchset includes two qdisc examples, bpf_fifo and bpf_fq, for
__testing__ purposes. For performance test, we compare selftests and their
kernel counterparts to give you a sense of the performance of qdisc
implemented in bpf.

The implementation of bpf_fq is fairly complex and slightly different from
fq so later we only compare the two fifo qdiscs. bpf_fq implements a
scheduling algorithm similar to fq before commit 29f834aa326e ("net_sched:
sch_fq: add 3 bands and WRR scheduling") was introduced. bpf_fifo uses a
single bpf_list as a queue instead of three queues for different
priorities in pfifo_fast. The time complexity of fifo however should be
similar since the queue selection time is negligible.

Test setup:

    client -> qdisc ------------->  server
    ~~~~~~~~~~~~~~~                 ~~~~~~
    nested VM1 @ DC1               VM2 @ DC2

Throghput: iperf3 -t 600, 5 times

      Qdisc        Average (GBits/sec)
    ----------     -------------------
    pfifo_fast       12.52 ± 0.26
    bpf_fifo         11.72 ± 0.32
    fq               10.24 ± 0.13
    bpf_fq           11.92 ± 0.64

Latency: sockperf pp --tcp -t 600, 5 times

      Qdisc        Average (usec)
    ----------     --------------
    pfifo_fast      244.58 ± 7.93
    bpf_fifo        244.92 ± 15.22
    fq              234.30 ± 19.25
    bpf_fq          221.34 ± 10.76

Looking at the two fifo qdiscs, the 6.4% drop in throughput in the bpf
implementatioin is consistent with previous observation (v8 throughput
test on a loopback device). This should be able to be mitigated by
supporting adding skb to bpf_list or bpf_rbtree directly in the future.

* Clean up skb in bpf qdisc during reset *

The current implementation relies on bpf qdisc implementors to correctly
release skbs in queues (bpf graphs or maps) in .reset, which might not be
a safe thing to do. The solution as Martin has suggested would be
supporting private data in struct_ops. This can also help simplifying
implementation of qdisc that works with mq. For examples, qdiscs in the
selftest mostly use global data. Therefore, even if user add multiple
qdisc instances under mq, they would still share the same queue.
====================

Link: https://patch.msgid.link/20250409214606.2000194-1-ameryhung@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>

selftests/bpf: Test attaching bpf qdisc to mq and non root

Until we are certain that existing classful qdiscs work with bpf qdisc,
make sure we don't allow attaching a bpf qdisc to non root. Meanwhile,
attaching to mq is allowed.

Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/20250409214606.2000194-11-ameryhung@gmail.com

selftests/bpf: Add a bpf fq qdisc to selftest

This test implements a more sophisticated qdisc using bpf. The bpf fair-
queueing (fq) qdisc gives each flow an equal chance to transmit data. It
also respects the timestamp of skb for rate limiting.

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/20250409214606.2000194-10-ameryhung@gmail.com

selftests/bpf: Add a basic fifo qdisc test

This selftest includes a bare minimum fifo qdisc, which simply enqueues
sk_buffs into the back of a bpf list and dequeues from the front of the
list.

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/20250409214606.2000194-9-ameryhung@gmail.com

libbpf: Support creating and destroying qdisc

Extend struct bpf_tc_hook with handle, qdisc name and a new attach type,
BPF_TC_QDISC, to allow users to add or remove any qdisc specified in
addition to clsact.

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/20250409214606.2000194-8-ameryhung@gmail.com

bpf: net_sched: Disable attaching bpf qdisc to non root

Do not allow users to attach bpf qdiscs to classful qdiscs. This is to
prevent accidentally breaking existings classful qdiscs if they rely on
some data in the child qdisc. This restriction can potentially be lifted
in the future. Note that, we still allow bpf qdisc to be attached to mq.

Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/20250409214606.2000194-7-ameryhung@gmail.com

bpf: net_sched: Support updating bstats

Add a kfunc to update Qdisc bstats when an skb is dequeued. The kfunc is
only available in .dequeue programs.

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/20250409214606.2000194-6-ameryhung@gmail.com

bpf: net_sched: Add a qdisc watchdog timer

Add a watchdog timer to bpf qdisc. The watchdog can be used to schedule
the execution of qdisc through kfunc, bpf_qdisc_schedule(). It can be
useful for building traffic shaping scheduling algorithm, where the time
the next packet will be dequeued is known.

The implementation relies on struct_ops gen_prologue/epilogue to patch bpf
programs provided by users. Operator specific prologue/epilogue kfuncs
are introduced instead of watchdog kfuncs so that it is easier to extend
prologue/epilogue in the future (writing C vs BPF bytecode).

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/20250409214606.2000194-5-ameryhung@gmail.com

bpf: net_sched: Add basic bpf qdisc kfuncs

Add basic kfuncs for working on skb in qdisc.

Both bpf_qdisc_skb_drop() and bpf_kfree_skb() can be used to release
a reference to an skb. However, bpf_qdisc_skb_drop() can only be called
in .enqueue where a to_free skb list is available from kernel to defer
the release. bpf_kfree_skb() should be used elsewhere. It is also used
in bpf_obj_free_fields() when cleaning up skb in maps and collections.

bpf_skb_get_hash() returns the flow hash of an skb, which can be used
to build flow-based queueing algorithms.

Finally, allow users to create read-only dynptr via bpf_dynptr_from_skb().

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/20250409214606.2000194-4-ameryhung@gmail.com

bpf: net_sched: Support implementation of Qdisc_ops in bpf

The recent advancement in bpf such as allocated objects, bpf list and bpf
rbtree has provided powerful and flexible building blocks to realize
sophisticated packet scheduling algorithms. As struct_ops now supports
core operators in Qdisc_ops, start allowing qdisc to be implemented using
bpf struct_ops with this patch. Users can implement Qdisc_ops.{enqueue,
dequeue, init, reset, destroy} in bpf and register the qdisc dynamically
into the kernel.

Co-developed-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Amery Hung <amery.hung@bytedance.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/20250409214606.2000194-3-ameryhung@gmail.com

bpf: Prepare to reuse get_ctx_arg_idx

Rename get_ctx_arg_idx to bpf_ctx_arg_idx, and allow others to call it.
No functional change.

Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/20250409214606.2000194-2-ameryhung@gmail.com

Merge tag 'for-linus-6.15a-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip

Pull xen fix from Juergen Gross:
"Just a single fix for the Xen multicall driver avoiding a percpu
variable referencing initdata by its initializer"

* tag 'for-linus-6.15a-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen: fix multicall debug feature

Merge tag 'for-linus-fwctl' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull fwctl fixes from Jason Gunthorpe:
"Three small changes from further build testing:

   - Don't rely on the userspace uuid.h for the uapi header

   - Fix sparse warnings in pds

   - Typo in log message"

* tag 'for-linus-fwctl' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
  fwctl: Fix repeated device word in log message
  pds_fwctl: Fix type and endian complaints
  fwctl/cxl: Fix uuid_t usage in uapi

Merge tag 'sound-6.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound

Pull sound fixes from Takashi Iwai:
"A collection of small fixes. All are device-specific like quirks, new
  IDs, and other safe (or rather boring) changes"

* tag 'sound-6.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
  firmware: cs_dsp: test_bin_error: Fix uninitialized data used as fw version
  ASoC: codecs: Add of_match_table for aw888081 driver
  ASoC: fsl: fsl_qmc_audio: Reset audio data pointers on TRIGGER_START event
  mailmap: Add entry for Srinivas Kandagatla
  MAINTAINERS: use kernel.org alias
  ASoC: cs42l43: Reset clamp override on jack removal
  ALSA: hda/realtek - Fixed ASUS platform headset Mic issue
  ALSA: hda/cirrus_scodec_test: Don't select dependencies
  ALSA: azt2320: Replace deprecated strcpy() with strscpy()
  ASoC: hdmi-codec: use RTD ID instead of DAI ID for ELD entry
  ASoC: Intel: avs: Constrain path based on BE capabilities
  ALSA: hda/tas2781: Remove unnecessary NULL check before release_firmware()
  ASoC: Intel: avs: Fix null-ptr-deref in avs_component_probe()
  ASoC: fsl_asrc_dma: get codec or cpu dai from backend
  ASoC: qcom: Fix sc7280 lpass potential buffer overflow
  ASoC: dwc: always enable/disable i2s irqs
  ASoC: Intel: sof_sdw: Add quirk for Asus Zenbook S16
  ASoC: codecs:lpass-wsa-macro: Fix logic of enabling vi channels
  ASoC: codecs:lpass-wsa-macro: Fix vi feedback rate

Merge tag 'platform-drivers-x86-v6.15-3' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86

Pull x86 platform drivers fixes from Ilpo Järvinen:
"Fixes:
   - amd/pmf: Fix STT limits
   - asus-laptop: Fix an uninitialized variable
   - intel_pmc_ipc: Allow building without ACPI
   - mlxbf-bootctl: Use sysfs_emit_at() in secure_boot_fuse_state_show()
   - msi-wmi-platform: Add locking to workaround ACPI firmware bug

  New HW support:
   - alienware-wmi-wmax:
      - Extended thermal control support to:
         - Alienware Area-51m R2
         - Alienware m16 R1
         - Alienware m16 R2
         - Dell G16 7630
         - Dell G5 5505 SE
      - G-Mode support to Alienware m16 R1
   - x86-android-tablets: Add Vexia Edu Atla 10 tablet 5V data"

* tag 'platform-drivers-x86-v6.15-3' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86:
  platform/x86: msi-wmi-platform: Workaround a ACPI firmware bug
  platform/x86: msi-wmi-platform: Rename "data" variable
  platform/x86: alienware-wmi-wmax: Extend support to more laptops
  platform/x86: alienware-wmi-wmax: Add G-Mode support to Alienware m16 R1
  platform/x86: amd: pmf: Fix STT limits
  mlxbf-bootctl: use sysfs_emit_at() in secure_boot_fuse_state_show()
  platform/x86: x86-android-tablets: Add Vexia Edu Atla 10 tablet 5V data
  platform/x86: x86-android-tablets: Add "9v" to Vexia EDU ATLA 10 tablet symbols
  asus-laptop: Fix an uninitialized variable
  platform/x86: intel_pmc_ipc: add option to build without ACPI

Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull SCSI fixes from James Bottomley:
"Small drivers fixes, except for ufs which has two large updates, one
  for exposing the device level feature, which is a new addition to the
  device spec and the other reworking the exynos driver to fix coherence
  issues on some android phones"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  scsi: megaraid_sas: Driver version update to 07.734.00.00-rc1
  scsi: megaraid_sas: Block zero-length ATA VPD inquiry
  scsi: scsi_transport_srp: Replace min/max nesting with clamp()
  scsi: ufs: core: Add device level exception support
  scsi: ufs: core: Rename ufshcd_wb_presrv_usrspc_keep_vcc_on()
  scsi: smartpqi: Use is_kdump_kernel() to check for kdump
  scsi: pm80xx: Set phy_attached to zero when device is gone
  scsi: ufs: exynos: gs101: Put UFS device in reset on .suspend()
  scsi: ufs: exynos: Move phy calls to .exit() callback
  scsi: ufs: exynos: Enable PRDT pre-fetching with UFSHCD_CAP_CRYPTO
  scsi: ufs: exynos: Ensure consistent phy reference counts
  scsi: ufs: exynos: Disable iocc if dma-coherent property isn't set
  scsi: ufs: exynos: Move UFS shareability value to drvdata
  scsi: ufs: exynos: Ensure pre_link() executes before exynos_ufs_phy_init()
  scsi: iscsi: Fix missing scsi_host_put() in error path
  scsi: ufs: core: Fix a race condition related to device commands
  scsi: hisi_sas: Fix I/O errors caused by hardware port ID changes
  scsi: hisi_sas: Enable force phy when SATA disk directly connected

Merge tag 'ata-6.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux

Pull ata fix from Damien Le Moal:

- Fix how sense data from the sense data for successfull NCQ commands
   log page is used to fully initialize the result_tf of a completed
   command, so that the sense data returned to the scsi layer is fully
   initialized with all the device provided information (from Niklas)

* tag 'ata-6.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux:
  ata: libata-sata: Save all fields from sense data descriptor

Merge tag 'xfs-fixes-6.15-rc3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull XFS fixes from Carlos Maiolino:
"This mostly includes fixes and documentation for the zoned allocator
  feature merged during previous merge window, but it also adds a sysfs
  tunable for the zone garbage collector.

  There is also a fix for a regression to the RT device that we'd like
  to fix ASAP now that we're getting more users on the RT zoned
  allocator"

* tag 'xfs-fixes-6.15-rc3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: document zoned rt specifics in admin-guide
  xfs: fix fsmap for internal zoned devices
  xfs: Fix spelling mistake "drity" -> "dirty"
  xfs: compute buffer address correctly in xmbuf_map_backing_mem
  xfs: add tunable threshold parameter for triggering zone GC
  xfs: mark xfs_buf_free as might_sleep()
  xfs: remove the leftover xfs_{set,clear}_li_failed infrastructure

Merge tag 'for-6.15-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

- handle encoded read ioctl returning EAGAIN so it does not mistakenly
   free the work structure

- escape subvolume path in mount option list so it cannot be wrongly
   parsed when the path contains ","

- remove folio size assertions when writing super block to device with
   enabled large folios

* tag 'for-6.15-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: remove folio order ASSERT()s in super block writeback path
  btrfs: correctly escape subvol in btrfs_show_options()
  btrfs: ioctl: don't free iov when btrfs_encoded_read() returns -EAGAIN

Merge tag 'slab-for-6.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab

Pull slab fix from Vlastimil Babka:

- Stable fix adding zero initialization of slab->obj_ext to prevent
crashes with allocation profiling (Suren Baghdasaryan)

* tag 'slab-for-6.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab:
slab: ensure slab->obj_exts is clear in a newly allocated slab page

net: ethernet: mtk_eth_soc: revise QDMA packet scheduler settings

The QDMA packet scheduler suffers from a performance issue.
Fix this by picking up changes from MediaTek's SDK which change to use
Token Bucket instead of Leaky Bucket and fix the SPEED_1000 configuration.

Fixes: 160d3a9b1929 ("net: ethernet: mtk_eth_soc: introduce MTK_NETSYS_V2 support")
Signed-off-by: Bo-Cun Chen <bc-bocun.chen@mediatek.com>
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Link: https://patch.msgid.link/18040f60f9e2f5855036b75b28c4332a2d2ebdd8.1744764277.git.daniel@makrotopia.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ethernet: mtk_eth_soc: correct the max weight of the queue limit for 100Mbps

Without this patch, the maximum weight of the queue limit will be
incorrect when linked at 100Mbps due to an apparent typo.

Fixes: f63959c7eec31 ("net: ethernet: mtk_eth_soc: implement multi-queue support for per-port queues")
Signed-off-by: Bo-Cun Chen <bc-bocun.chen@mediatek.com>
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Link: https://patch.msgid.link/74111ba0bdb13743313999ed467ce564e8189006.1744764277.git.daniel@makrotopia.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ethernet: mtk_eth_soc: reapply mdc divider on reset

In the current method, the MDC divider was reset to the default setting
of 2.5MHz after the NETSYS SER. Therefore, we need to reapply the MDC
divider configuration function in mtk_hw_init() after reset.

Fixes: c0a440031d431 ("net: ethernet: mtk_eth_soc: set MDIO bus clock frequency")
Signed-off-by: Bo-Cun Chen <bc-bocun.chen@mediatek.com>
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Link: https://patch.msgid.link/8ab7381447e6cdcb317d5b5a6ddd90a1734efcb0.1744764277.git.daniel@makrotopia.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'nf-25-04-17' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf

Pablo Neira Ayuso says:

====================
Netfilter fix for net

The following batch contains one Netfilter fix for net:

1) conntrack offload bit is erroneously unset in a race scenario,
from Florian Westphal.

netfilter pull request 25-04-17

* tag 'nf-25-04-17' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: conntrack: fix erronous removal of offload bit
====================

Link: https://patch.msgid.link/20250417102847.16640-1-pablo@netfilter.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge tag 'for-net-2025-04-16' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth

Luiz Augusto von Dentz says:

====================
bluetooth pull request for net:

- l2cap: Process valid commands in too long frame
- vhci: Avoid needless snprintf() calls

* tag 'for-net-2025-04-16' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
Bluetooth: vhci: Avoid needless snprintf() calls
Bluetooth: l2cap: Process valid commands in too long frame
====================

Link: https://patch.msgid.link/20250416210126.2034212-1-luiz.dentz@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'net-pktgen-fix-checkpatch-code-style-errors-warnings'

Peter Seiderer says:

====================
net: pktgen: fix checkpatch code style errors/warnings

Fix checkpatch detected code style errors/warnings detected in
the file net/core/pktgen.c (remaining checkpatch checks will be addressed
in a follow up patch set).
====================

Link: https://patch.msgid.link/20250415112916.113455-1-ps.report@gmx.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: pktgen: fix code style (WARNING: Prefer strscpy over strcpy)

Fix checkpatch code style warnings:

  WARNING: Prefer strscpy over strcpy - see: https://github.com/KSPP/linux/issues/88
  #1423: FILE: net/core/pktgen.c:1423:
  +                       strcpy(pkt_dev->dst_min, buf);

  WARNING: Prefer strscpy over strcpy - see: https://github.com/KSPP/linux/issues/88
  #1444: FILE: net/core/pktgen.c:1444:
  +                       strcpy(pkt_dev->dst_max, buf);

  WARNING: Prefer strscpy over strcpy - see: https://github.com/KSPP/linux/issues/88
  #1554: FILE: net/core/pktgen.c:1554:
  +                       strcpy(pkt_dev->src_min, buf);

  WARNING: Prefer strscpy over strcpy - see: https://github.com/KSPP/linux/issues/88
  #1575: FILE: net/core/pktgen.c:1575:
  +                       strcpy(pkt_dev->src_max, buf);

  WARNING: Prefer strscpy over strcpy - see: https://github.com/KSPP/linux/issues/88
  #3231: FILE: net/core/pktgen.c:3231:
  +                       strcpy(pkt_dev->result, "Starting");

  WARNING: Prefer strscpy over strcpy - see: https://github.com/KSPP/linux/issues/88
  #3235: FILE: net/core/pktgen.c:3235:
  +                       strcpy(pkt_dev->result, "Error starting");

  WARNING: Prefer strscpy over strcpy - see: https://github.com/KSPP/linux/issues/88
  #3849: FILE: net/core/pktgen.c:3849:
  +       strcpy(pkt_dev->odevname, ifname);

While at it squash memset/strcpy pattern into single strscpy_pad call.

Signed-off-by: Peter Seiderer <ps.report@gmx.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250415112916.113455-4-ps.report@gmx.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: pktgen: fix code style (WARNING: please, no space before tabs)

Fix checkpatch code style warnings:

  WARNING: please, no space before tabs
  #230: FILE: net/core/pktgen.c:230:
  +#define M_NETIF_RECEIVE ^I1^I/* Inject packets into stack */$

Signed-off-by: Peter Seiderer <ps.report@gmx.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250415112916.113455-3-ps.report@gmx.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: pktgen: fix code style (ERROR: else should follow close brace '}')

Fix checkpatch code style errors:

  ERROR: else should follow close brace '}'
  #1317: FILE: net/core/pktgen.c:1317:
  +               }
  +               else

And checkpatch follow up code style check:

  CHECK: Unbalanced braces around else statement
  #1316: FILE: net/core/pktgen.c:1316:
  +               } else

Signed-off-by: Peter Seiderer <ps.report@gmx.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250415112916.113455-2-ps.report@gmx.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'mitigate-double-allocations-in-ioam6_iptunnel'

Justin Iurman says:

====================
Mitigate double allocations in ioam6_iptunnel

Commit dce525185bc9 ("net: ipv6: ioam6_iptunnel: mitigate 2-realloc
issue") fixed the double allocation issue in ioam6_iptunnel. However,
since commit 92191dd10730 ("net: ipv6: fix dst ref loops in rpl, seg6
and ioam6 lwtunnels"), the fix was left incomplete. Because the cache is
now empty when the dst_entry is the same post transformation in order to
avoid a reference loop, the double reallocation is back for such cases
(e.g., inline mode) which are valid for IOAM. This patch provides a way
to detect such cases without having a reference loop in the cache, and
so to avoid the double reallocation issue for all cases again.

v1: https://lore.kernel.org/netdev/20250410152432.30246-1-justin.iurman@uliege.be/T/#t
====================

Link: https://patch.msgid.link/20250415112554.23823-1-justin.iurman@uliege.be
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: ipv6: ioam6: fix double reallocation

If the dst_entry is the same post transformation (which is a valid use
case for IOAM), we don't add it to the cache to avoid a reference loop.
Instead, we use a "fake" dst_entry and add it to the cache as a signal.
When we read the cache, we compare it with our "fake" dst_entry and
therefore detect if we're in the special case.

Signed-off-by: Justin Iurman <justin.iurman@uliege.be>
Link: https://patch.msgid.link/20250415112554.23823-3-justin.iurman@uliege.be
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: ipv6: ioam6: use consistent dst names

Be consistent and use the same terminology as other lwt users: orig_dst
is the dst_entry before the transformation, while dst is either the
dst_entry in the cache or the dst_entry after the transformation

Signed-off-by: Justin Iurman <justin.iurman@uliege.be>
Link: https://patch.msgid.link/20250415112554.23823-2-justin.iurman@uliege.be
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'introducing-openvpn-data-channel-offload'

Antonio Quartulli says:

====================
Introducing OpenVPN Data Channel Offload

Notable changes since v25:
* removed netdev notifier (was only used for our own devices)
* added .dellink implementation to address what was previously
done in notifier
* removed .ndo_open and moved netif_carrier_off() call to .ndo_init
* fixed author in MODULE_AUTHOR()
* properly indented checks in ovpn.yaml
* switched from TSTATS to DSTATS
* removed obsolete comment in ovpn_socket_new()
* removed unrelated hunk in ovpn_socket_new()

The latest code can also be found at:

https://github.com/OpenVPN/ovpn-net-next

Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
====================

Link: https://patch.msgid.link/20250415-b4-ovpn-v26-0-577f6097b964@openvpn.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

testing/selftests: add test tool and scripts for ovpn module

The ovpn-cli tool can be compiled and used as selftest for the ovpn
kernel module.

[NOTE: it depends on libmedtls for decoding base64-encoded keys]

ovpn-cli implements the netlink and RTNL APIs and can thus be integrated
in any script for more automated testing.

Along with the tool, a bunch of scripts are provided that perform basic
functionality tests by means of network namespaces.
These scripts take part to the kselftest automation.

The output of the scripts, which will appear in the kselftest
reports, is a list of steps performed by the scripts plus some
output coming from the execution of `ping`, `iperf` and `ovpn-cli`
itself.
In general it is useful only in case of failure, in order to
understand which step has failed and why.

Please note: since peer sockets are tied to the userspace
process that created them (i.e. exiting the process will result
in closing the socket), every run of ovpn-cli that created
one will go to background and enter pause(), waiting for the
signal which will allow it to terminate.
Termination is accomplished at the end of each script by
issuing a killall command.

Cc: linux-kselftest@vger.kernel.org
Cc: Shuah Khan <skhan@linuxfoundation.org>
Signed-off-by: Antonio Quartulli <antonio@openvpn.net>
Link: https://patch.msgid.link/20250415-b4-ovpn-v26-23-577f6097b964@openvpn.net
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

ovpn: add basic ethtool support

Implement support for basic ethtool functionality.

Note that ovpn is a virtual device driver, therefore
various ethtool APIs are just not meaningful and thus
not implemented.

Signed-off-by: Antonio Quartulli <antonio@openvpn.net>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20250415-b4-ovpn-v26-22-577f6097b964@openvpn.net
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

ovpn: notify userspace when a peer is deleted

Whenever a peer is deleted, send a notification to userspace so that it
can react accordingly.

This is most important when a peer is deleted due to ping timeout,
because it all happens in kernelspace and thus userspace has no direct
way to learn about it.

Signed-off-by: Antonio Quartulli <antonio@openvpn.net>
Link: https://patch.msgid.link/20250415-b4-ovpn-v26-21-577f6097b964@openvpn.net
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

ovpn: kill key and notify userspace in case of IV exhaustion

IV wrap-around is cryptographically dangerous for a number of ciphers,
therefore kill the key and inform userspace (via netlink) should the
IV space go exhausted.

Userspace has two ways of deciding when the key has to be renewed before
exhausting the IV space:
1) time based approach:
   after X seconds/minutes userspace generates a new key and sends it
   to the kernel. This is based on guestimate and normally default
   timer value works well.

2) packet count based approach:
   after X packets/bytes userspace generates a new key and sends it to
   the kernel. Userspace keeps track of the amount of traffic by
   periodically polling GET_PEER and fetching the VPN/LINK stats.

Signed-off-by: Antonio Quartulli <antonio@openvpn.net>
Link: https://patch.msgid.link/20250415-b4-ovpn-v26-20-577f6097b964@openvpn.net
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

ovpn: implement key add/get/del/swap via netlink

This change introduces the netlink commands needed to add, get, delete
and swap keys for a specific peer.

Userspace is expected to use these commands to create, inspect (non
sensitive data only), destroy and rotate session keys for a specific
peer.

Signed-off-by: Antonio Quartulli <antonio@openvpn.net>
Link: https://patch.msgid.link/20250415-b4-ovpn-v26-19-577f6097b964@openvpn.net
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

ovpn: implement peer add/get/dump/delete via netlink

This change introduces the netlink command needed to add, delete and
retrieve/dump known peers. Userspace is expected to use these commands
to handle known peer lifecycles.

Signed-off-by: Antonio Quartulli <antonio@openvpn.net>
Link: https://patch.msgid.link/20250415-b4-ovpn-v26-18-577f6097b964@openvpn.net
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

ovpn: add support for updating local or remote UDP endpoint

In case of UDP links, the local or remote endpoint used to communicate
with a given peer may change without a connection restart.

Add support for learning the new address in case of change.

Signed-off-by: Antonio Quartulli <antonio@openvpn.net>
Link: https://patch.msgid.link/20250415-b4-ovpn-v26-17-577f6097b964@openvpn.net
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

ovpn: implement keepalive mechanism

OpenVPN supports configuring a periodic keepalive packet.
message to allow the remote endpoint detect link failures.

This change implements the keepalive sending and timer expiring logic.

Signed-off-by: Antonio Quartulli <antonio@openvpn.net>
Link: https://patch.msgid.link/20250415-b4-ovpn-v26-16-577f6097b964@openvpn.net
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

ovpn: implement peer lookup logic

In a multi-peer scenario there are a number of situations when a
specific peer needs to be looked up.

We may want to lookup a peer by:
1. its ID
2. its VPN destination IP
3. its transport IP/port couple

For each of the above, there is a specific routing table referencing all
peers for fast look up.

Case 2. is a bit special in the sense that an outgoing packet may not be
sent to the peer VPN IP directly, but rather to a network behind it. For
this reason we first perform a nexthop lookup in the system routing
table and then we use the retrieved nexthop as peer search key.

Signed-off-by: Antonio Quartulli <antonio@openvpn.net>
Link: https://patch.msgid.link/20250415-b4-ovpn-v26-15-577f6097b964@openvpn.net
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

ovpn: implement multi-peer support

With this change an ovpn instance will be able to stay connected to
multiple remote endpoints.

This functionality is strictly required when running ovpn on an
OpenVPN server.

Signed-off-by: Antonio Quartulli <antonio@openvpn.net>
Link: https://patch.msgid.link/20250415-b4-ovpn-v26-14-577f6097b964@openvpn.net
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

ovpn: add support for MSG_NOSIGNAL in tcp_sendmsg

Userspace may want to pass the MSG_NOSIGNAL flag to
tcp_sendmsg() in order to avoid generating a SIGPIPE.

To pass this flag down the TCP stack a new skb sending API
accepting a flags argument is introduced.

Cc: Eric Dumazet <edumazet@google.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Antonio Quartulli <antonio@openvpn.net>
Link: https://patch.msgid.link/20250415-b4-ovpn-v26-13-577f6097b964@openvpn.net
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

skb: implement skb_send_sock_locked_with_flags()

When sending an skb over a socket using skb_send_sock_locked(),
it is currently not possible to specify any flag to be set in
msghdr->msg_flags.

However, we may want to pass flags the user may have specified,
like MSG_NOSIGNAL.

Extend __skb_send_sock() with a new argument 'flags' and add a
new interface named skb_send_sock_locked_with_flags().

Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Simon Horman <horms@kernel.org>
Signed-off-by: Antonio Quartulli <antonio@openvpn.net>
Link: https://patch.msgid.link/20250415-b4-ovpn-v26-12-577f6097b964@openvpn.net
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

ovpn: implement TCP transport

With this change ovpn is allowed to communicate to peers also via TCP.
Parsing of incoming messages is implemented through the strparser API.

Note that ovpn redefines sk_prot and sk_socket->ops for the TCP socket
used to communicate with the peer.
For this reason it needs to access inet6_stream_ops, which is declared
as extern in the IPv6 module, but it is not fully exported.

Therefore this patch is also adding EXPORT_SYMBOL_GPL(inet6_stream_ops)
to net/ipv6/af_inet6.c.

Cc: David Ahern <dsahern@kernel.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Simon Horman <horms@kernel.org>
Signed-off-by: Antonio Quartulli <antonio@openvpn.net>
Link: https://patch.msgid.link/20250415-b4-ovpn-v26-11-577f6097b964@openvpn.net
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

ovpn: store tunnel and transport statistics

Byte/packet counters for in-tunnel and transport streams
are now initialized and updated as needed.

To be exported via netlink.

Signed-off-by: Antonio Quartulli <antonio@openvpn.net>
Link: https://patch.msgid.link/20250415-b4-ovpn-v26-10-577f6097b964@openvpn.net
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

ovpn: implement packet processing

This change implements encryption/decryption and
encapsulation/decapsulation of OpenVPN packets.

Support for generic crypto state is added along with
a wrapper for the AEAD crypto kernel API.

Signed-off-by: Antonio Quartulli <antonio@openvpn.net>
Link: https://patch.msgid.link/20250415-b4-ovpn-v26-9-577f6097b964@openvpn.net
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

ovpn: implement basic RX path (UDP)

Packets received over the socket are forwarded to the user device.

Implementation is UDP only. TCP will be added by a later patch.

Note: no decryption/decapsulation exists yet, packets are forwarded as
they arrive without much processing.

Signed-off-by: Antonio Quartulli <antonio@openvpn.net>
Link: https://patch.msgid.link/20250415-b4-ovpn-v26-8-577f6097b964@openvpn.net
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

ovpn: implement basic TX path (UDP)

Packets sent over the ovpn interface are processed and transmitted to the
connected peer, if any.

Implementation is UDP only. TCP will be added by a later patch.

Note: no crypto/encapsulation exists yet. Packets are just captured and
sent.

Signed-off-by: Antonio Quartulli <antonio@openvpn.net>
Link: https://patch.msgid.link/20250415-b4-ovpn-v26-7-577f6097b964@openvpn.net
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

ovpn: introduce the ovpn_socket object

This specific structure is used in the ovpn kernel module
to wrap and carry around a standard kernel socket.

ovpn takes ownership of passed sockets and therefore an ovpn
specific objects is attached to them for status tracking
purposes.

Initially only UDP support is introduced. TCP will come in a later
patch.

Cc: willemdebruijn.kernel@gmail.com
Signed-off-by: Antonio Quartulli <antonio@openvpn.net>
Link: https://patch.msgid.link/20250415-b4-ovpn-v26-6-577f6097b964@openvpn.net
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

ovpn: introduce the ovpn_peer object

An ovpn_peer object holds the whole status of a remote peer
(regardless whether it is a server or a client).

This includes status for crypto, tx/rx buffers, napi, etc.

Only support for one peer is introduced (P2P mode).
Multi peer support is introduced with a later patch.

Along with the ovpn_peer, also the ovpn_bind object is introcued
as the two are strictly related.
An ovpn_bind object wraps a sockaddr representing the local
coordinates being used to talk to a specific peer.

Signed-off-by: Antonio Quartulli <antonio@openvpn.net>
Link: https://patch.msgid.link/20250415-b4-ovpn-v26-5-577f6097b964@openvpn.net
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

ovpn: keep carrier always on for MP interfaces

An ovpn interface configured in MP mode will keep carrier always
on and let the user decide when to bring it administratively up and
down.

This way a MP node (i.e. a server) will keep its interface always
up and running, even when no peer is connected.

Signed-off-by: Antonio Quartulli <antonio@openvpn.net>
Link: https://patch.msgid.link/20250415-b4-ovpn-v26-4-577f6097b964@openvpn.net
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

ovpn: add basic interface creation/destruction/management routines

Add basic infrastructure for handling ovpn interfaces.

Tested-by: Donald Hunter <donald.hunter@gmail.com>
Signed-off-by: Antonio Quartulli <antonio@openvpn.net>
Link: https://patch.msgid.link/20250415-b4-ovpn-v26-3-577f6097b964@openvpn.net
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

ovpn: add basic netlink support

This commit introduces basic netlink support with family
registration/unregistration functionalities and stub pre/post-doit.

More importantly it introduces the YAML uAPI description along
with its auto-generated files:
- include/uapi/linux/ovpn.h
- drivers/net/ovpn/netlink-gen.c
- drivers/net/ovpn/netlink-gen.h

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Signed-off-by: Antonio Quartulli <antonio@openvpn.net>
Link: https://patch.msgid.link/20250415-b4-ovpn-v26-2-577f6097b964@openvpn.net
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: introduce OpenVPN Data Channel Offload (ovpn)

OpenVPN is a userspace software existing since around 2005 that allows
users to create secure tunnels.

So far OpenVPN has implemented all operations in userspace, which
implies several back and forth between kernel and user land in order to
process packets (encapsulate/decapsulate, encrypt/decrypt, rerouting..).

With `ovpn` we intend to move the fast path (data channel) entirely
in kernel space and thus improve user measured throughput over the
tunnel.

`ovpn` is implemented as a simple virtual network device driver, that
can be manipulated by means of the standard RTNL APIs. A device of kind
`ovpn` allows only IPv4/6 traffic and can be of type:
* P2P (peer-to-peer): any packet sent over the interface will be
  encapsulated and transmitted to the other side (typical OpenVPN
  client or peer-to-peer behaviour);
* P2MP (point-to-multipoint): packets sent over the interface are
  transmitted to peers based on existing routes (typical OpenVPN
  server behaviour).

After the interface has been created, OpenVPN in userspace can
configure it using a new Netlink API. Specifically it is possible
to manage peers and their keys.

The OpenVPN control channel is multiplexed over the same transport
socket by means of OP codes. Anything that is not DATA_V2 (OpenVPN
OP code for data traffic) is sent to userspace and handled there.
This way the `ovpn` codebase is kept as compact as possible while
focusing on handling data traffic only (fast path).

Any OpenVPN control feature (like cipher negotiation, TLS handshake,
rekeying, etc.) is still fully handled by the userspace process.

When userspace establishes a new connection with a peer, it first
performs the handshake and then passes the socket to the `ovpn` kernel
module, which takes ownership. From this moment on `ovpn` will handle
data traffic for the new peer.
When control packets are received on the link, they are forwarded to
userspace through the same transport socket they were received on, as
userspace is still listening to them.

Some events (like peer deletion) are sent to a Netlink multicast group.

Although it wasn't easy to convince the community, `ovpn` implements
only a limited number of the data-channel features supported by the
userspace program.

Each feature that made it to `ovpn` was attentively vetted to
avoid carrying too much legacy along with us (and to give a clear cut to
old and probalby-not-so-useful features).

Notably, only encryption using AEAD ciphers (specifically
ChaCha20Poly1305 and AES-GCM) was implemented. Supporting any other
cipher out there was not deemed useful.

Both UDP and TCP sockets are supported.

As explained above, in case of P2MP mode, OpenVPN will use the main system
routing table to decide which packet goes to which peer. This implies
that no routing table was re-implemented in the `ovpn` kernel module.

This kernel module can be enabled by selecting the CONFIG_OVPN entry
in the networking drivers section.

NOTE: this first patch introduces the very basic framework only.
Features are then added patch by patch, however, although each patch
will compile and possibly not break at runtime, only after having
applied the full set it is expected to see the ovpn module fully working.

Cc: steffen.klassert@secunet.com
Cc: antony.antony@secunet.com
Signed-off-by: Antonio Quartulli <antonio@openvpn.net>
Link: https://patch.msgid.link/20250415-b4-ovpn-v26-1-577f6097b964@openvpn.net
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'bug-fixes-from-xdp-and-perout-series'

Meghana Malladi says:

====================
Bug fixes from XDP and perout series

This patch series consists of bug fixes from the XDP series:
1. Fixes a kernel warning that occurs when bringing down the
   network interface.
2. Resolves a potential NULL pointer dereference in the
   emac_xmit_xdp_frame() function.
3. Resolves a potential NULL pointer dereference in the
   icss_iep_perout_enable() function

v3: https://lore.kernel.org/all/20250328102403.2626974-1-m-malladi@ti.com/
====================

Link: https://patch.msgid.link/20250415090543.717991-1-m-malladi@ti.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: ti: icss-iep: Fix possible NULL pointer dereference for perout request

The ICSS IEP driver tracks perout and pps enable state with flags.
Currently when disabling pps and perout signals during icss_iep_exit(),
results in NULL pointer dereference for perout.

To fix the null pointer dereference issue, the icss_iep_perout_enable_hw
function can be modified to directly clear the IEP CMP registers when
disabling PPS or PEROUT, without referencing the ptp_perout_request
structure, as its contents are irrelevant in this case.

Fixes: 9b115361248d ("net: ti: icssg-prueth: Fix clearing of IEP_CMP_CFG registers during iep_init")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/all/7b1c7c36-363a-4085-b26c-4f210bee1df6@stanley.mountain/
Signed-off-by: Meghana Malladi <m-malladi@ti.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20250415090543.717991-4-m-malladi@ti.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: ti: icssg-prueth: Fix possible NULL pointer dereference inside emac_xmit_xdp_frame()

There is an error check inside emac_xmit_xdp_frame() function which
is called when the driver wants to transmit XDP frame, to check if
the allocated tx descriptor is NULL, if true to exit and return
ICSSG_XDP_CONSUMED implying failure in transmission.

In this case trying to free a descriptor which is NULL will result
in kernel crash due to NULL pointer dereference. Fix this error handling
and increase netdev tx_dropped stats in the caller of this function
if the function returns ICSSG_XDP_CONSUMED.

Fixes: 62aa3246f462 ("net: ti: icssg-prueth: Add XDP support")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/all/70d8dd76-0c76-42fc-8611-9884937c82f5@stanley.mountain/
Signed-off-by: Meghana Malladi <m-malladi@ti.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Roger Quadros <rogerq@kernel.org>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20250415090543.717991-3-m-malladi@ti.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>