Currently, the VXLAN driver stores FDB entries in a hash table with a
fixed number of buckets (256). Subsequent patches are going to convert
this table to rhashtable with a linked list for entry traversal, as
rhashtable is more scalable.
In preparation for this conversion, move from a per-bucket spin lock to
a single spin lock that protects the entire FDB table.
The per-bucket spin locks were introduced by commit fe1e0713bbe8
("vxlan: Use FDB_HASH_SIZE hash_locks to reduce contention") citing
"huge contention when inserting/deleting vxlan_fdbs into the fdb_head".
It is not clear from the commit message which code path was holding the
spin lock for long periods of time, but the obvious suspect is the FDB
cleanup routine (vxlan_cleanup()) that periodically traverses the entire
table in order to delete aged-out entries.
This will be solved by subsequent patches that will convert the FDB
cleanup routine to traverse the linked list of FDB entries using RCU,
only acquiring the spin lock when deleting an aged-out entry.
The change reduces the size of the VXLAN device structure from 3600
bytes to 2576 bytes.
Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20250415121143.345227-7-idosch@nvidia.com Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
vxlan: Relocate assignment of default remote device
The default FDB entry can be associated with a net device if a physical
device (i.e., 'dev PHYS_DEV') was specified during the creation of the
VXLAN device.
The assignment of the net device pointer to 'dst->remote_dev' logically
belongs in the if block that resolves the pointer from the specified
ifindex, so move it there.
Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20250415121143.345227-6-idosch@nvidia.com Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
vxlan: Unsplit default FDB entry creation and notification
Commit 0241b836732f ("vxlan: fix default fdb entry netlink notify
ordering during netdev create") split the creation of the default FDB
entry from its notification to avoid sending a RTM_NEWNEIGH notification
before RTM_NEWLINK.
Previous patches restructured the code so that the default FDB entry is
created after registering the VXLAN device and the notification about
the new entry immediately follows its creation.
Therefore, simplify the code and revert back to vxlan_fdb_update() which
takes care of both creating the FDB entry and notifying user space
about it.
Hold the FDB hash lock when calling vxlan_fdb_update() like it expects.
A subsequent patch will add a lockdep assertion to make sure this is
indeed the case.
Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20250415121143.345227-5-idosch@nvidia.com Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
vxlan: Insert FDB into hash table in vxlan_fdb_create()
Commit 7c31e54aeee5 ("vxlan: do not destroy fdb if register_netdevice()
is failed") split the insertion of FDB entries into the FDB hash table
from the function where they are created.
This was done in order to work around a problem that is no longer
possible after the previous patch. Simplify the code and move the body
of vxlan_fdb_insert() back into vxlan_fdb_create().
Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20250415121143.345227-4-idosch@nvidia.com Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
There is asymmetry in how the default FDB entry (all-zeroes) is created
and destroyed in the VXLAN driver. It is created as part of the driver's
newlink() routine, but destroyed as part of its ndo_uninit() routine.
This caused multiple problems in the past. First, commit 0241b836732f
("vxlan: fix default fdb entry netlink notify ordering during netdev
create") split the notification about the entry from its creation so
that it will not be notified to user space before the VXLAN device is
registered.
Then, commit 6db924687139 ("vxlan: Fix error path in
__vxlan_dev_create()") made the error path in __vxlan_dev_create()
asymmetric by destroying the FDB entry before unregistering the net
device. Otherwise, the FDB entry would have been freed twice: By
ndo_uninit() as part of unregister_netdevice() and by
vxlan_fdb_destroy() in the error path.
Finally, commit 7c31e54aeee5 ("vxlan: do not destroy fdb if
register_netdevice() is failed") split the insertion of the FDB entry
into the hash table from its creation, moving the insertion after the
registration of the net device. Otherwise, like before, the FDB entry
would have been freed twice: By ndo_uninit() as part of
register_netdevice()'s error path and by vxlan_fdb_destroy() in the
error path of __vxlan_dev_create().
The end result is that the code is unnecessarily complex. In addition,
the fixed size hash table cannot be converted to rhashtable as
vxlan_fdb_insert() cannot fail, which will no longer be true with
rhashtable.
Solve this by making the addition and deletion of the default FDB entry
completely symmetric. Namely, as part of newlink() routine, create the
entry, insert it into to the hash table and send a notification to user
space after the net device was registered. Note that at this stage the
net device is still administratively down and cannot transmit / receive
packets.
Move the deletion from ndo_uninit() to the dellink routine(): Flush the
default entry together with all the other entries, before unregistering
the net device.
Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20250415121143.345227-3-idosch@nvidia.com Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
vxlan: Add RCU read-side critical sections in the Tx path
The Tx path does not run from an RCU read-side critical section which
makes the current lockless accesses to FDB entries invalid. As far as I
am aware, this has not been a problem in practice, but traces will be
generated once we transition the FDB lookup to rhashtable_lookup().
Add rcu_read_{lock,unlock}() around the handling of FDB entries in the
Tx path. Remove the RCU read-side critical section from vxlan_xmit_nh()
as now the function is always called from an RCU read-side critical
section.
Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20250415121143.345227-2-idosch@nvidia.com Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
We've added 12 non-merge commits during the last 9 day(s) which contain
a total of 18 files changed, 1748 insertions(+), 19 deletions(-).
The main changes are:
1) bpf qdisc support, from Amery Hung.
A qdisc can be implemented in bpf struct_ops programs and
can be used the same as other existing qdiscs in the
"tc qdisc" command.
2) Add xsk tail adjustment tests, from Tushar Vyavahare.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next:
selftests/bpf: Test attaching bpf qdisc to mq and non root
selftests/bpf: Add a bpf fq qdisc to selftest
selftests/bpf: Add a basic fifo qdisc test
libbpf: Support creating and destroying qdisc
bpf: net_sched: Disable attaching bpf qdisc to non root
bpf: net_sched: Support updating bstats
bpf: net_sched: Add a qdisc watchdog timer
bpf: net_sched: Add basic bpf qdisc kfuncs
bpf: net_sched: Support implementation of Qdisc_ops in bpf
bpf: Prepare to reuse get_ctx_arg_idx
selftests/xsk: Add tail adjustment tests and support check
selftests/xsk: Add packet stream replacement function
====================
Jakub Kicinski [Tue, 22 Apr 2025 01:50:37 +0000 (18:50 -0700)]
Merge branch 'bnxt_en-update-for-net-next'
Michael Chan says:
====================
bnxt_en: Update for net-next
The first patch changes the FW message timeout threshold for a warning
message. The second patch adjusts the ethtool -w coredump length to
suppress a warning. The last 2 patches are small cleanup patches for
the bnxt_ulp RoCE auxbus code.
Kalesh AP [Thu, 17 Apr 2025 17:24:47 +0000 (10:24 -0700)]
bnxt_en: Remove unused field "ref_count" in struct bnxt_ulp
The "ref_count" field in struct bnxt_ulp is unused after
commit a43c26fa2e6c ("RDMA/bnxt_re: Remove the sriov config callback").
So we can just remove it now.
Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com> Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Link: https://patch.msgid.link/20250417172448.1206107-4-michael.chan@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
bnxt_en: Report the ethtool coredump length after copying the coredump
ethtool first calls .get_dump_flags() to get the dump length. For
coredump, the driver calls the FW to get the coredump length (L1). The
min. of L1 and the user specified length is then passed to
.get_dump_data() (L2) to get the coredump. The actual coredump length
retrieved by the FW (L3) during .get_dump_data() may be smaller than L1.
This length discrepancy will trigger a WARN_ON() in
ethtool_get_dump_data().
ethtool has already vzalloc'ed a buffer with size L1. Just report
the coredump length as L2 even though the actual coredump length L3
may be smaller. The extra zero padding does not matter. This will
prevent the warning that may alarm the user.
For correctness, only do the final length update if there is no error.
Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com> Reviewed-by: Damodharam Ammepalli <damodharam.ammepalli@broadcom.com> Signed-off-by: Shruti Parab <shruti.parab@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Link: https://patch.msgid.link/20250417172448.1206107-3-michael.chan@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Michael Chan [Thu, 17 Apr 2025 17:24:45 +0000 (10:24 -0700)]
bnxt_en: Change FW message timeout warning
The firmware advertises a "hwrm_cmd_max_timeout" value to the driver
for NVRAM and coredump related functions that can take tens of seconds
to complete. The driver polls for the operation to complete under
mutex and may trigger hung task watchdog warning if the wait is too long.
To warn the user about this, the driver currently prints a warning if
this advertised value exceeds 40 seconds:
Device requests max timeout of %d seconds, may trigger hung task watchdog
Initially, we chose 40 seconds, well below the kernel's default
CONFIG_DEFAULT_HUNG_TASK_TIMEOUT (120 seconds) to avoid triggering
the hung task watchdog. But 60 seconds is the timeout on most
production FW and cannot be reduced further. Change the driver's warning
threshold to 60 seconds to avoid triggering this warning on all
production devices. We also print the warning if the value exceeds
CONFIG_DEFAULT_HUNG_TASK_TIMEOUT which may be set to architecture
specific defaults as low as 10 seconds.
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Link: https://patch.msgid.link/20250417172448.1206107-2-michael.chan@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
net: stmmac: socfpga: fix init ordering and cleanups
This series fixes the init ordering of the socfpga probe function.
The standard rule is to do all setup before publishing any device,
and socfpga violates that. I can see no reason for this, but these
patches have not been tested on hardware.
Address this by moving the initialisation of dwmac->stmmac_rst
along with all the other dwmac initialisers - there's no reason
for this to be late as plat_dat->stmmac_rst has already been
populated.
Next, replace the call to ops->set_phy_mode() with an init function
socfpga_dwmac_init() which will then be linked in to plat_dat->init.
Then, add this to plat_dat->init, and switch to stmmac_pltfr_pm_ops
from the private ops. The runtime suspend/resume socfpga implementations
are identical to the platform ones, but misses the noirq versions
which this will add.
Before we swap the order of socfpga_dwmac_init() and
stmmac_dvr_probe(), we need to change the way the interface is
obtained, as that uses driver data and the struct net_device which
haven't been initialised. Save a pointer to plat_dat in the socfpga
private data, and use that to get the interface mode. We can then swap
the order of the init and probe functions.
Finally, convert to devm_stmmac_pltfr_probe() by moving the call
to ops->set_phy_mode() into an init function appropriately populating
plat_dat->init.
====================
net: stmmac: socfpga: convert to devm_stmmac_pltfr_probe()
Convert socfpga to use devm_stmmac_pltfr_probe() to further simplify
the probe function, wrapping the call to the set_phy_mode() method
into socfpga_dwmac_init() which can be called from the plat_dat->init()
method. Also call this from socfpga_dwmac_resume() thereby simplifying
that function.
Using the devm variant also means we can remove the call to
stmmac_pltfr_remove().
Unfortunately, we can't also convert to stmmac_pltfr_pm_ops as there is
extra work done in socfpga_dwmac_resume().
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/E1u5Sns-001IJw-OY@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: stmmac: socfpga: call set_phy_mode() before registration
Initialisation/setup after registration is a bug. This is the second
of two patches fixing this in socfpga.
The set_phy_mode() functions do various hardware setup that would
interfere with a netdev that has been published, and thus available to
be opened by the kernel/userspace.
However, set_phy_mode() relies upon the netdev having been initialised
to get at the plat_stmmacenet_data structure, which is probably why it
was placed after stmmac_drv_probe(). We can remove that need by storing
a pointer to struct plat_stmmacenet_data in struct socfpga_dwmac.
Move the call to set_phy_mode() before calling stmmac_dvr_probe().
This also simplifies the probe function as there is no need to
unregister the netdev if set_phy_mode() fails.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/E1u5Snn-001IJq-L0@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: stmmac: socfpga: convert to stmmac_pltfr_pm_ops
Convert socfpga to use the generic stmmac_pltfr_pm_ops, which can be
achieved by adding an appropriate plat_dat->init function to do the
setup.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/E1u5Sni-001IJk-Gi@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Both the resume and probe path needs to configure the phy mode, so
provide a common function to do this which can later be hooked into
plat_dat->init.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/E1u5Snd-001IJe-Cx@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: stmmac: socfpga: init dwmac->stmmac_rst before registration
Initialisation/setup after registration is a bug. This is the first of
two patches fixing this in socfpga.
dwmac->stmmac_rst is initialised from the stmmac plat_dat's stmmac_rst
member, which is itself initialised by devm_stmmac_probe_config_dt().
Therefore, this can be initialised before we call stmmac_dvr_probe().
Move it there.
dwmac->stmmac_rst is used by the set_phy_mode() method.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/E1u5SnY-001IJY-90@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patchset marks the final step in converting users to the new
nlmsg_payload() function. It addresses the last two files that were not
converted in previous series, specifically updating the following
functions:
Simon Horman [Thu, 17 Apr 2025 10:28:23 +0000 (11:28 +0100)]
s390: ism: Pass string literal as format argument of dev_set_name()
GCC 14.2.0 reports that passing a non-string literal as the
format argument of dev_set_name() is potentially insecure.
drivers/s390/net/ism_drv.c: In function 'ism_probe':
drivers/s390/net/ism_drv.c:615:2: warning: format not a string literal and no format arguments [-Wformat-security]
615 | dev_set_name(&ism->dev, dev_name(&pdev->dev));
| ^~~~~~~~~~~~
It seems to me that as pdev is a PCIE device then the dev_name
call above should always return the device's BDF, e.g. 00:12.0.
That this should not contain format escape sequences. And thus
the current usage is safe.
But, it seems better to be safe than sorry. And, as a bonus, compiler
output becomes less verbose by addressing this issue.
Compile tested only.
No functional change intended.
Jakub Kicinski [Fri, 18 Apr 2025 23:49:42 +0000 (16:49 -0700)]
tools: ynl: add missing header deps
Various new families and my recent work on rtnetlink missed
adding dependencies on C headers. If the system headers are
up to date or don't include a given header at all this doesn't
make a difference. But if the system headers are in place but
stale - compilation will break.
Jakub Kicinski [Wed, 16 Apr 2025 20:08:40 +0000 (13:08 -0700)]
net: add UAPI to the header guard in various network headers
fib_rule, ip6_tunnel, and a whole lot of if_* headers lack the customary
_UAPI in the header guard. Without it YNL build can't protect from in tree
and system headers both getting included. YNL doesn't need most of these
but it's annoying to have to fix them one by one.
Note that header installation strips this _UAPI prefix so this should
result in no change to the end user.
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20250416200840.1338195-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
trace: tcp: Add const qualifier to skb parameter in tcp_probe event
Change the tcp_probe tracepoint to accept a const struct sk_buff
parameter instead of a non-const one. This improves type safety and
better reflects that the skb is not modified within the tracepoint
implementation.
Mediatek doesn't make use of mac_interface, and none of the in-tree
DT files use the mac-mode property. Therefore, mac_interface already
follows phy_interface. Remove this unnecessary assignment.
net: stmmac: dwc-qos: use PHY clock-stop capability
Use the PHY clock-stop capability when programming the MAC LPI mode,
which allows the transmit clock to the PHY to be gated. Tested on the
Jetson Xavier NX platform.
net/mlx5e: ethtool: Fix formatting of ptp_rq0_csum_complete_tail_slow
The new GCC 15 warning -Wunterminated-string-initialization reports:
In file included from drivers/net/ethernet/mellanox/mlx5/core/en.h:55,
from drivers/net/ethernet/mellanox/mlx5/core/en_stats.c:34:
drivers/net/ethernet/mellanox/mlx5/core/en_stats.h:57:46: warning: initializer-string for array of 'char' truncates NUL terminator but destination lacks 'nonstring' attribute (33 chars into 32 available) [-Wunterminated-string-initialization]
57 | #define MLX5E_DECLARE_PTP_RQ_STAT(type, fld) "ptp_rq%d_"#fld, offsetof(type, fld)
| ^~~~~~~~~~~
drivers/net/ethernet/mellanox/mlx5/core/en_stats.c:2279:11: note: in expansion of macro 'MLX5E_DECLARE_PTP_RQ_STAT'
2279 | { MLX5E_DECLARE_PTP_RQ_STAT(struct mlx5e_rq_stats, csum_complete_tail_slow) },
| ^~~~~~~~~~~~~~~~~~~~~~~~~
This stat string is being used in ethtool_sprintf(), so it must be a
valid NUL-terminated string. Currently the string lacks the final NUL
byte (as GCC warns), but by absolute luck, the next byte in memory is a
space (decimal 32) followed by a NUL. "format" is immediately followed
by little-endian size_t:
The use of vsnprintf(), within ethtool_sprintf(), reads past the end of
"format" and sees the format string as "ptp_rq%d_csum_complete_tail_slow ",
with %d getting resolved by MLX5E_PTP_CHANNEL_IX (value 0):
With an output result of "ptp_rq0_csum_complete_tail_slow", which gets
precisely truncated to 31 characters with a trailing NUL.
So, instead of accidentally getting this correct due to the NUL bytes
at the end of the size_t that happens to follow the format string, just
make the string initializer 1 byte shorter by replacing "%d" with "0",
since MLX5E_PTP_CHANNEL_IX is already hard-coded. This results in no
initializer truncation and no need to call sprintf().
net: ethtool: Adjust exactly ETH_GSTRING_LEN-long stats to use memcpy
Many drivers populate the stats buffer using C-String based APIs (e.g.
ethtool_sprintf() and ethtool_puts()), usually when building up the
list of stats individually (i.e. with a for() loop). This, however,
requires that the source strings be populated in such a way as to have
a terminating NUL byte in the source.
Other drivers populate the stats buffer directly using one big memcpy()
of an entire array of strings. No NUL termination is needed here, as the
bytes are being directly passed through. Yet others will build up the
stats buffer individually, but also use memcpy(). This, too, does not
need NUL termination of the source strings.
However, there are cases where the strings that populate the
source stats strings are exactly ETH_GSTRING_LEN long, and GCC
15's -Wunterminated-string-initialization option complains that the
trailing NUL byte has been truncated. This situation is fine only if the
driver is using the memcpy() approach. If the C-String APIs are used,
the destination string name will have its final byte truncated by the
required trailing NUL byte applied by the C-string API.
For drivers that are already using memcpy() but have initializers that
truncate the NUL terminator, mark their source strings as __nonstring to
silence the GCC warnings.
For drivers that have initializers that truncate the NUL terminator and
are using the C-String APIs, switch to memcpy() to avoid destination
string truncation and mark their source strings as __nonstring to silence
the GCC warnings. (Also introduce ethtool_cpy() as a helper to make this
an easy replacement).
Specifically the following warnings were investigated and addressed:
../drivers/net/ethernet/chelsio/cxgb/cxgb2.c:364:9: warning: initializer-string for array of 'char' truncates NUL terminator but destination lacks 'nonstring' attribute (33 chars into 32 available) [-Wunterminated-string-initialization]
364 | "TxFramesAbortedDueToXSCollisions",
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
../drivers/net/ethernet/freescale/enetc/enetc_ethtool.c:165:33: warning: initializer-string for array of 'char' truncates NUL terminator but destination lacks 'nonstring' attribute (33 chars into 32 available) [-Wunterminated-string-initialization]
165 | { ENETC_PM_R1523X(0), "MAC rx 1523 to max-octet packets" },
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
../drivers/net/ethernet/freescale/enetc/enetc_ethtool.c:190:33: warning: initializer-string for array of 'char' truncates NUL terminator but destination lacks 'nonstring' attribute (33 chars into 32 available) [-Wunterminated-string-initialization]
190 | { ENETC_PM_T1523X(0), "MAC tx 1523 to max-octet packets" },
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
../drivers/net/ethernet/google/gve/gve_ethtool.c:76:9: warning: initializer-string for array of 'char' truncates NUL terminator but destination lacks 'nonstring' attribute (33 chars into 32 available) [-Wunterminated-string-initialization]
76 | "adminq_dcfg_device_resources_cnt", "adminq_set_driver_parameter_cnt",
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
../drivers/net/ethernet/stmicro/stmmac/stmmac_ethtool.c:117:53: warning: initializer-string for array of 'char' truncates NUL terminator but destination lacks 'nonstring' attribute (33 chars into 32 available) [-Wunterminated-string-initialization]
117 | STMMAC_STAT(ptp_rx_msg_type_pdelay_follow_up),
| ^
../drivers/net/ethernet/stmicro/stmmac/stmmac_ethtool.c:46:12: note: in definition of macro 'STMMAC_STAT'
46 | { #m, sizeof_field(struct stmmac_extra_stats, m), \
| ^
../drivers/net/ethernet/mellanox/mlxsw/spectrum_ethtool.c:328:24: warning: initializer-string for array of 'char' truncates NUL terminator but destination lacks 'nonstring' attribute (33 chars into 32 available) [-Wunterminated-string-initialization]
328 | .str = "a_mac_control_frames_transmitted",
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
../drivers/net/ethernet/mellanox/mlxsw/spectrum_ethtool.c:340:24: warning: initializer-string for array of 'char' truncates NUL terminator but destination lacks 'nonstring' attribute (33 chars into 32 available) [-Wunterminated-string-initialization]
340 | .str = "a_pause_mac_ctrl_frames_received",
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Signed-off-by: Kees Cook <kees@kernel.org> Reviewed-by: Petr Machata <petrm@nvidia.com> # for mlxsw Reviewed-by: Harshitha Ramamurthy <hramamurthy@google.com> Link: https://patch.msgid.link/20250416010210.work.904-kees@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
r8169: add RTL_GIGA_MAC_VER_LAST to facilitate adding support for new chip versions
Add a new mac_version enum value RTL_GIGA_MAC_VER_LAST. Benefit is that
when adding support for a new chip version we have to touch less code,
except something changes fundamentally.
Refactor chip version detection and merge both configuration tables.
Apart from reducing the code by a third, this paves the way for
merging chip version handling if only difference is the firmware.
Jakub Kicinski [Fri, 18 Apr 2025 01:41:42 +0000 (18:41 -0700)]
Merge branch 'net-stmmac-sunxi-cleanups'
Russell King says:
====================
net: stmmac: sunxi cleanups
This series cleans up the sunxi (sun7i) code in two ways:
1. it converts to use the new set_clk_tx_rate() method, even though
we don't use clk_tx_i. In doing so, I reformat the function to
read better, but with no changes to the code.
2. convert from stmmac_dvr_probe() to stmmac_pltfr_probe(), and then
to its devm variant, which allows code simplification.
====================
Using devm_stmmac_pltfr_probe() simplifies the probe function. This
will not only call plat_dat->init (sun7i_dwmac_init), but also
plat_dat->exit (sun7i_dwmac_exit) appropriately if stmmac_dvr_probe()
fails. This results in an overall simplification of the glue driver.
Rather than open-coding the calls to sun7i_gmac_init() and
sun7i_gmac_exit() in the probe function, use stmmac_pltfr_probe()
which will automatically call the plat_dat->init() and plat_dat->exit()
methods appropriately. This simplifies the code.
- ipv6: add exception routes to GC list in rt6_insert_exception()
- netfilter: conntrack: fix erroneous removal of offload bit
- Bluetooth:
- fix sending MGMT_EV_DEVICE_FOUND for invalid address
- l2cap: process valid commands in too long frame
- btnxpuart: Revert baudrate change in nxp_shutdown
Previous releases - always broken:
- ethtool: fix memory corruption during SFP FW flashing
- eth:
- hibmcge: fixes for link and MTU handling, pause frames etc
- igc: fixes for PTM (PCIe timestamping)
- dsa: b53: enable BPDU reception for management port
Misc:
- fixes for Netlink protocol schemas"
* tag 'net-6.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (81 commits)
net: ethernet: mtk_eth_soc: revise QDMA packet scheduler settings
net: ethernet: mtk_eth_soc: correct the max weight of the queue limit for 100Mbps
net: ethernet: mtk_eth_soc: reapply mdc divider on reset
net: ti: icss-iep: Fix possible NULL pointer dereference for perout request
net: ti: icssg-prueth: Fix possible NULL pointer dereference inside emac_xmit_xdp_frame()
net: ti: icssg-prueth: Fix kernel warning while bringing down network interface
netfilter: conntrack: fix erronous removal of offload bit
net: don't try to ops lock uninitialized devs
ptp: ocp: fix start time alignment in ptp_ocp_signal_set
net: dsa: avoid refcount warnings when ds->ops->tag_8021q_vlan_del() fails
net: dsa: free routing table on probe failure
net: dsa: clean up FDB, MDB, VLAN entries on unbind
net: dsa: mv88e6xxx: fix -ENOENT when deleting VLANs and MST is unsupported
net: dsa: mv88e6xxx: avoid unregistering devlink regions which were never registered
net: txgbe: fix memory leak in txgbe_probe() error path
net: bridge: switchdev: do not notify new brentries as changed
net: b53: enable BPDU reception for management port
netlink: specs: rt-neigh: prefix struct nfmsg members with ndm
netlink: specs: rt-link: adjust mctp attribute naming
netlink: specs: rtnetlink: attribute naming corrections
...
Martin KaFai Lau [Thu, 17 Apr 2025 17:50:56 +0000 (10:50 -0700)]
Merge branch 'bpf-qdisc'
Amery Hung says:
====================
bpf qdisc
Hi all,
This patchset aims to support implementing qdisc using bpf struct_ops.
This version takes a step back and only implements the minimum support
for bpf qdisc. 1) support of adding skb to bpf_list and bpf_rbtree
directly and 2) classful qdisc are deferred to future patchsets. In
addition, we only allow attaching bpf qdisc to root or mq for now.
This is to prevent accidentally breaking exisiting classful qdiscs
that rely on data in a child qdisc. This limit may be lifted in the
future after careful inspection.
* Overview *
This series supports implementing qdisc using bpf struct_ops. bpf qdisc
aims to be a flexible and easy-to-use infrastructure that allows users to
quickly experiment with different scheduling algorithms/policies. It only
requires users to implement core qdisc logic using bpf and implements the
mundane part for them. In addition, the ability to easily communicate
between qdisc and other components will also bring new opportunities for
new applications and optimizations.
* Performance of bpf qdisc *
This patchset includes two qdisc examples, bpf_fifo and bpf_fq, for
__testing__ purposes. For performance test, we compare selftests and their
kernel counterparts to give you a sense of the performance of qdisc
implemented in bpf.
The implementation of bpf_fq is fairly complex and slightly different from
fq so later we only compare the two fifo qdiscs. bpf_fq implements a
scheduling algorithm similar to fq before commit 29f834aa326e ("net_sched:
sch_fq: add 3 bands and WRR scheduling") was introduced. bpf_fifo uses a
single bpf_list as a queue instead of three queues for different
priorities in pfifo_fast. The time complexity of fifo however should be
similar since the queue selection time is negligible.
Looking at the two fifo qdiscs, the 6.4% drop in throughput in the bpf
implementatioin is consistent with previous observation (v8 throughput
test on a loopback device). This should be able to be mitigated by
supporting adding skb to bpf_list or bpf_rbtree directly in the future.
* Clean up skb in bpf qdisc during reset *
The current implementation relies on bpf qdisc implementors to correctly
release skbs in queues (bpf graphs or maps) in .reset, which might not be
a safe thing to do. The solution as Martin has suggested would be
supporting private data in struct_ops. This can also help simplifying
implementation of qdisc that works with mq. For examples, qdiscs in the
selftest mostly use global data. Therefore, even if user add multiple
qdisc instances under mq, they would still share the same queue.
====================
selftests/bpf: Test attaching bpf qdisc to mq and non root
Until we are certain that existing classful qdiscs work with bpf qdisc,
make sure we don't allow attaching a bpf qdisc to non root. Meanwhile,
attaching to mq is allowed.
This test implements a more sophisticated qdisc using bpf. The bpf fair-
queueing (fq) qdisc gives each flow an equal chance to transmit data. It
also respects the timestamp of skb for rate limiting.
Extend struct bpf_tc_hook with handle, qdisc name and a new attach type,
BPF_TC_QDISC, to allow users to add or remove any qdisc specified in
addition to clsact.
bpf: net_sched: Disable attaching bpf qdisc to non root
Do not allow users to attach bpf qdiscs to classful qdiscs. This is to
prevent accidentally breaking existings classful qdiscs if they rely on
some data in the child qdisc. This restriction can potentially be lifted
in the future. Note that, we still allow bpf qdisc to be attached to mq.
Add a watchdog timer to bpf qdisc. The watchdog can be used to schedule
the execution of qdisc through kfunc, bpf_qdisc_schedule(). It can be
useful for building traffic shaping scheduling algorithm, where the time
the next packet will be dequeued is known.
The implementation relies on struct_ops gen_prologue/epilogue to patch bpf
programs provided by users. Operator specific prologue/epilogue kfuncs
are introduced instead of watchdog kfuncs so that it is easier to extend
prologue/epilogue in the future (writing C vs BPF bytecode).
Both bpf_qdisc_skb_drop() and bpf_kfree_skb() can be used to release
a reference to an skb. However, bpf_qdisc_skb_drop() can only be called
in .enqueue where a to_free skb list is available from kernel to defer
the release. bpf_kfree_skb() should be used elsewhere. It is also used
in bpf_obj_free_fields() when cleaning up skb in maps and collections.
bpf_skb_get_hash() returns the flow hash of an skb, which can be used
to build flow-based queueing algorithms.
Finally, allow users to create read-only dynptr via bpf_dynptr_from_skb().
bpf: net_sched: Support implementation of Qdisc_ops in bpf
The recent advancement in bpf such as allocated objects, bpf list and bpf
rbtree has provided powerful and flexible building blocks to realize
sophisticated packet scheduling algorithms. As struct_ops now supports
core operators in Qdisc_ops, start allowing qdisc to be implemented using
bpf struct_ops with this patch. Users can implement Qdisc_ops.{enqueue,
dequeue, init, reset, destroy} in bpf and register the qdisc dynamically
into the kernel.
Co-developed-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Amery Hung <amery.hung@bytedance.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://patch.msgid.link/20250409214606.2000194-3-ameryhung@gmail.com
Merge tag 'for-linus-fwctl' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull fwctl fixes from Jason Gunthorpe:
"Three small changes from further build testing:
- Don't rely on the userspace uuid.h for the uapi header
- Fix sparse warnings in pds
- Typo in log message"
* tag 'for-linus-fwctl' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
fwctl: Fix repeated device word in log message
pds_fwctl: Fix type and endian complaints
fwctl/cxl: Fix uuid_t usage in uapi
Merge tag 'sound-6.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"A collection of small fixes. All are device-specific like quirks, new
IDs, and other safe (or rather boring) changes"
* tag 'sound-6.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
firmware: cs_dsp: test_bin_error: Fix uninitialized data used as fw version
ASoC: codecs: Add of_match_table for aw888081 driver
ASoC: fsl: fsl_qmc_audio: Reset audio data pointers on TRIGGER_START event
mailmap: Add entry for Srinivas Kandagatla
MAINTAINERS: use kernel.org alias
ASoC: cs42l43: Reset clamp override on jack removal
ALSA: hda/realtek - Fixed ASUS platform headset Mic issue
ALSA: hda/cirrus_scodec_test: Don't select dependencies
ALSA: azt2320: Replace deprecated strcpy() with strscpy()
ASoC: hdmi-codec: use RTD ID instead of DAI ID for ELD entry
ASoC: Intel: avs: Constrain path based on BE capabilities
ALSA: hda/tas2781: Remove unnecessary NULL check before release_firmware()
ASoC: Intel: avs: Fix null-ptr-deref in avs_component_probe()
ASoC: fsl_asrc_dma: get codec or cpu dai from backend
ASoC: qcom: Fix sc7280 lpass potential buffer overflow
ASoC: dwc: always enable/disable i2s irqs
ASoC: Intel: sof_sdw: Add quirk for Asus Zenbook S16
ASoC: codecs:lpass-wsa-macro: Fix logic of enabling vi channels
ASoC: codecs:lpass-wsa-macro: Fix vi feedback rate
Merge tag 'platform-drivers-x86-v6.15-3' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86
Pull x86 platform drivers fixes from Ilpo Järvinen:
"Fixes:
- amd/pmf: Fix STT limits
- asus-laptop: Fix an uninitialized variable
- intel_pmc_ipc: Allow building without ACPI
- mlxbf-bootctl: Use sysfs_emit_at() in secure_boot_fuse_state_show()
- msi-wmi-platform: Add locking to workaround ACPI firmware bug
New HW support:
- alienware-wmi-wmax:
- Extended thermal control support to:
- Alienware Area-51m R2
- Alienware m16 R1
- Alienware m16 R2
- Dell G16 7630
- Dell G5 5505 SE
- G-Mode support to Alienware m16 R1
- x86-android-tablets: Add Vexia Edu Atla 10 tablet 5V data"
* tag 'platform-drivers-x86-v6.15-3' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86:
platform/x86: msi-wmi-platform: Workaround a ACPI firmware bug
platform/x86: msi-wmi-platform: Rename "data" variable
platform/x86: alienware-wmi-wmax: Extend support to more laptops
platform/x86: alienware-wmi-wmax: Add G-Mode support to Alienware m16 R1
platform/x86: amd: pmf: Fix STT limits
mlxbf-bootctl: use sysfs_emit_at() in secure_boot_fuse_state_show()
platform/x86: x86-android-tablets: Add Vexia Edu Atla 10 tablet 5V data
platform/x86: x86-android-tablets: Add "9v" to Vexia EDU ATLA 10 tablet symbols
asus-laptop: Fix an uninitialized variable
platform/x86: intel_pmc_ipc: add option to build without ACPI
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"Small drivers fixes, except for ufs which has two large updates, one
for exposing the device level feature, which is a new addition to the
device spec and the other reworking the exynos driver to fix coherence
issues on some android phones"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: megaraid_sas: Driver version update to 07.734.00.00-rc1
scsi: megaraid_sas: Block zero-length ATA VPD inquiry
scsi: scsi_transport_srp: Replace min/max nesting with clamp()
scsi: ufs: core: Add device level exception support
scsi: ufs: core: Rename ufshcd_wb_presrv_usrspc_keep_vcc_on()
scsi: smartpqi: Use is_kdump_kernel() to check for kdump
scsi: pm80xx: Set phy_attached to zero when device is gone
scsi: ufs: exynos: gs101: Put UFS device in reset on .suspend()
scsi: ufs: exynos: Move phy calls to .exit() callback
scsi: ufs: exynos: Enable PRDT pre-fetching with UFSHCD_CAP_CRYPTO
scsi: ufs: exynos: Ensure consistent phy reference counts
scsi: ufs: exynos: Disable iocc if dma-coherent property isn't set
scsi: ufs: exynos: Move UFS shareability value to drvdata
scsi: ufs: exynos: Ensure pre_link() executes before exynos_ufs_phy_init()
scsi: iscsi: Fix missing scsi_host_put() in error path
scsi: ufs: core: Fix a race condition related to device commands
scsi: hisi_sas: Fix I/O errors caused by hardware port ID changes
scsi: hisi_sas: Enable force phy when SATA disk directly connected
Merge tag 'ata-6.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux
Pull ata fix from Damien Le Moal:
- Fix how sense data from the sense data for successfull NCQ commands
log page is used to fully initialize the result_tf of a completed
command, so that the sense data returned to the scsi layer is fully
initialized with all the device provided information (from Niklas)
* tag 'ata-6.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux:
ata: libata-sata: Save all fields from sense data descriptor
Merge tag 'xfs-fixes-6.15-rc3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull XFS fixes from Carlos Maiolino:
"This mostly includes fixes and documentation for the zoned allocator
feature merged during previous merge window, but it also adds a sysfs
tunable for the zone garbage collector.
There is also a fix for a regression to the RT device that we'd like
to fix ASAP now that we're getting more users on the RT zoned
allocator"
* tag 'xfs-fixes-6.15-rc3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: document zoned rt specifics in admin-guide
xfs: fix fsmap for internal zoned devices
xfs: Fix spelling mistake "drity" -> "dirty"
xfs: compute buffer address correctly in xmbuf_map_backing_mem
xfs: add tunable threshold parameter for triggering zone GC
xfs: mark xfs_buf_free as might_sleep()
xfs: remove the leftover xfs_{set,clear}_li_failed infrastructure
Merge tag 'for-6.15-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- handle encoded read ioctl returning EAGAIN so it does not mistakenly
free the work structure
- escape subvolume path in mount option list so it cannot be wrongly
parsed when the path contains ","
- remove folio size assertions when writing super block to device with
enabled large folios
* tag 'for-6.15-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: remove folio order ASSERT()s in super block writeback path
btrfs: correctly escape subvol in btrfs_show_options()
btrfs: ioctl: don't free iov when btrfs_encoded_read() returns -EAGAIN
Merge tag 'slab-for-6.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab
Pull slab fix from Vlastimil Babka:
- Stable fix adding zero initialization of slab->obj_ext to prevent
crashes with allocation profiling (Suren Baghdasaryan)
* tag 'slab-for-6.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab:
slab: ensure slab->obj_exts is clear in a newly allocated slab page
The QDMA packet scheduler suffers from a performance issue.
Fix this by picking up changes from MediaTek's SDK which change to use
Token Bucket instead of Leaky Bucket and fix the SPEED_1000 configuration.
net: ethernet: mtk_eth_soc: reapply mdc divider on reset
In the current method, the MDC divider was reset to the default setting
of 2.5MHz after the NETSYS SER. Therefore, we need to reapply the MDC
divider configuration function in mtk_hw_init() after reset.
Paolo Abeni [Thu, 17 Apr 2025 13:20:41 +0000 (15:20 +0200)]
Merge tag 'nf-25-04-17' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Pablo Neira Ayuso says:
====================
Netfilter fix for net
The following batch contains one Netfilter fix for net:
1) conntrack offload bit is erroneously unset in a race scenario,
from Florian Westphal.
netfilter pull request 25-04-17
* tag 'nf-25-04-17' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: conntrack: fix erronous removal of offload bit
====================
Paolo Abeni [Thu, 17 Apr 2025 11:08:41 +0000 (13:08 +0200)]
Merge tag 'for-net-2025-04-16' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth
Luiz Augusto von Dentz says:
====================
bluetooth pull request for net:
- l2cap: Process valid commands in too long frame
- vhci: Avoid needless snprintf() calls
* tag 'for-net-2025-04-16' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
Bluetooth: vhci: Avoid needless snprintf() calls
Bluetooth: l2cap: Process valid commands in too long frame
====================
Fix checkpatch detected code style errors/warnings detected in
the file net/core/pktgen.c (remaining checkpatch checks will be addressed
in a follow up patch set).
====================
====================
Mitigate double allocations in ioam6_iptunnel
Commit dce525185bc9 ("net: ipv6: ioam6_iptunnel: mitigate 2-realloc
issue") fixed the double allocation issue in ioam6_iptunnel. However,
since commit 92191dd10730 ("net: ipv6: fix dst ref loops in rpl, seg6
and ioam6 lwtunnels"), the fix was left incomplete. Because the cache is
now empty when the dst_entry is the same post transformation in order to
avoid a reference loop, the double reallocation is back for such cases
(e.g., inline mode) which are valid for IOAM. This patch provides a way
to detect such cases without having a reference loop in the cache, and
so to avoid the double reallocation issue for all cases again.
If the dst_entry is the same post transformation (which is a valid use
case for IOAM), we don't add it to the cache to avoid a reference loop.
Instead, we use a "fake" dst_entry and add it to the cache as a signal.
When we read the cache, we compare it with our "fake" dst_entry and
therefore detect if we're in the special case.
Be consistent and use the same terminology as other lwt users: orig_dst
is the dst_entry before the transformation, while dst is either the
dst_entry in the cache or the dst_entry after the transformation
====================
Introducing OpenVPN Data Channel Offload
Notable changes since v25:
* removed netdev notifier (was only used for our own devices)
* added .dellink implementation to address what was previously
done in notifier
* removed .ndo_open and moved netif_carrier_off() call to .ndo_init
* fixed author in MODULE_AUTHOR()
* properly indented checks in ovpn.yaml
* switched from TSTATS to DSTATS
* removed obsolete comment in ovpn_socket_new()
* removed unrelated hunk in ovpn_socket_new()
testing/selftests: add test tool and scripts for ovpn module
The ovpn-cli tool can be compiled and used as selftest for the ovpn
kernel module.
[NOTE: it depends on libmedtls for decoding base64-encoded keys]
ovpn-cli implements the netlink and RTNL APIs and can thus be integrated
in any script for more automated testing.
Along with the tool, a bunch of scripts are provided that perform basic
functionality tests by means of network namespaces.
These scripts take part to the kselftest automation.
The output of the scripts, which will appear in the kselftest
reports, is a list of steps performed by the scripts plus some
output coming from the execution of `ping`, `iperf` and `ovpn-cli`
itself.
In general it is useful only in case of failure, in order to
understand which step has failed and why.
Please note: since peer sockets are tied to the userspace
process that created them (i.e. exiting the process will result
in closing the socket), every run of ovpn-cli that created
one will go to background and enter pause(), waiting for the
signal which will allow it to terminate.
Termination is accomplished at the end of each script by
issuing a killall command.
Whenever a peer is deleted, send a notification to userspace so that it
can react accordingly.
This is most important when a peer is deleted due to ping timeout,
because it all happens in kernelspace and thus userspace has no direct
way to learn about it.
ovpn: kill key and notify userspace in case of IV exhaustion
IV wrap-around is cryptographically dangerous for a number of ciphers,
therefore kill the key and inform userspace (via netlink) should the
IV space go exhausted.
Userspace has two ways of deciding when the key has to be renewed before
exhausting the IV space:
1) time based approach:
after X seconds/minutes userspace generates a new key and sends it
to the kernel. This is based on guestimate and normally default
timer value works well.
2) packet count based approach:
after X packets/bytes userspace generates a new key and sends it to
the kernel. Userspace keeps track of the amount of traffic by
periodically polling GET_PEER and fetching the VPN/LINK stats.
ovpn: implement peer add/get/dump/delete via netlink
This change introduces the netlink command needed to add, delete and
retrieve/dump known peers. Userspace is expected to use these commands
to handle known peer lifecycles.
In a multi-peer scenario there are a number of situations when a
specific peer needs to be looked up.
We may want to lookup a peer by:
1. its ID
2. its VPN destination IP
3. its transport IP/port couple
For each of the above, there is a specific routing table referencing all
peers for fast look up.
Case 2. is a bit special in the sense that an outgoing packet may not be
sent to the peer VPN IP directly, but rather to a network behind it. For
this reason we first perform a nexthop lookup in the system routing
table and then we use the retrieved nexthop as peer search key.
With this change ovpn is allowed to communicate to peers also via TCP.
Parsing of incoming messages is implemented through the strparser API.
Note that ovpn redefines sk_prot and sk_socket->ops for the TCP socket
used to communicate with the peer.
For this reason it needs to access inet6_stream_ops, which is declared
as extern in the IPv6 module, but it is not fully exported.
Therefore this patch is also adding EXPORT_SYMBOL_GPL(inet6_stream_ops)
to net/ipv6/af_inet6.c.
Cc: David Ahern <dsahern@kernel.org> Cc: Eric Dumazet <edumazet@google.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Simon Horman <horms@kernel.org> Signed-off-by: Antonio Quartulli <antonio@openvpn.net> Link: https://patch.msgid.link/20250415-b4-ovpn-v26-11-577f6097b964@openvpn.net Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
An ovpn_peer object holds the whole status of a remote peer
(regardless whether it is a server or a client).
This includes status for crypto, tx/rx buffers, napi, etc.
Only support for one peer is introduced (P2P mode).
Multi peer support is introduced with a later patch.
Along with the ovpn_peer, also the ovpn_bind object is introcued
as the two are strictly related.
An ovpn_bind object wraps a sockaddr representing the local
coordinates being used to talk to a specific peer.
Signed-off-by: Antonio Quartulli <antonio@openvpn.net> Link: https://patch.msgid.link/20250415-b4-ovpn-v26-5-577f6097b964@openvpn.net Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
This commit introduces basic netlink support with family
registration/unregistration functionalities and stub pre/post-doit.
More importantly it introduces the YAML uAPI description along
with its auto-generated files:
- include/uapi/linux/ovpn.h
- drivers/net/ovpn/netlink-gen.c
- drivers/net/ovpn/netlink-gen.h
Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Signed-off-by: Antonio Quartulli <antonio@openvpn.net> Link: https://patch.msgid.link/20250415-b4-ovpn-v26-2-577f6097b964@openvpn.net Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
net: introduce OpenVPN Data Channel Offload (ovpn)
OpenVPN is a userspace software existing since around 2005 that allows
users to create secure tunnels.
So far OpenVPN has implemented all operations in userspace, which
implies several back and forth between kernel and user land in order to
process packets (encapsulate/decapsulate, encrypt/decrypt, rerouting..).
With `ovpn` we intend to move the fast path (data channel) entirely
in kernel space and thus improve user measured throughput over the
tunnel.
`ovpn` is implemented as a simple virtual network device driver, that
can be manipulated by means of the standard RTNL APIs. A device of kind
`ovpn` allows only IPv4/6 traffic and can be of type:
* P2P (peer-to-peer): any packet sent over the interface will be
encapsulated and transmitted to the other side (typical OpenVPN
client or peer-to-peer behaviour);
* P2MP (point-to-multipoint): packets sent over the interface are
transmitted to peers based on existing routes (typical OpenVPN
server behaviour).
After the interface has been created, OpenVPN in userspace can
configure it using a new Netlink API. Specifically it is possible
to manage peers and their keys.
The OpenVPN control channel is multiplexed over the same transport
socket by means of OP codes. Anything that is not DATA_V2 (OpenVPN
OP code for data traffic) is sent to userspace and handled there.
This way the `ovpn` codebase is kept as compact as possible while
focusing on handling data traffic only (fast path).
Any OpenVPN control feature (like cipher negotiation, TLS handshake,
rekeying, etc.) is still fully handled by the userspace process.
When userspace establishes a new connection with a peer, it first
performs the handshake and then passes the socket to the `ovpn` kernel
module, which takes ownership. From this moment on `ovpn` will handle
data traffic for the new peer.
When control packets are received on the link, they are forwarded to
userspace through the same transport socket they were received on, as
userspace is still listening to them.
Some events (like peer deletion) are sent to a Netlink multicast group.
Although it wasn't easy to convince the community, `ovpn` implements
only a limited number of the data-channel features supported by the
userspace program.
Each feature that made it to `ovpn` was attentively vetted to
avoid carrying too much legacy along with us (and to give a clear cut to
old and probalby-not-so-useful features).
Notably, only encryption using AEAD ciphers (specifically
ChaCha20Poly1305 and AES-GCM) was implemented. Supporting any other
cipher out there was not deemed useful.
Both UDP and TCP sockets are supported.
As explained above, in case of P2MP mode, OpenVPN will use the main system
routing table to decide which packet goes to which peer. This implies
that no routing table was re-implemented in the `ovpn` kernel module.
This kernel module can be enabled by selecting the CONFIG_OVPN entry
in the networking drivers section.
NOTE: this first patch introduces the very basic framework only.
Features are then added patch by patch, however, although each patch
will compile and possibly not break at runtime, only after having
applied the full set it is expected to see the ovpn module fully working.
====================
Bug fixes from XDP and perout series
This patch series consists of bug fixes from the XDP series:
1. Fixes a kernel warning that occurs when bringing down the
network interface.
2. Resolves a potential NULL pointer dereference in the
emac_xmit_xdp_frame() function.
3. Resolves a potential NULL pointer dereference in the
icss_iep_perout_enable() function
net: ti: icss-iep: Fix possible NULL pointer dereference for perout request
The ICSS IEP driver tracks perout and pps enable state with flags.
Currently when disabling pps and perout signals during icss_iep_exit(),
results in NULL pointer dereference for perout.
To fix the null pointer dereference issue, the icss_iep_perout_enable_hw
function can be modified to directly clear the IEP CMP registers when
disabling PPS or PEROUT, without referencing the ptp_perout_request
structure, as its contents are irrelevant in this case.
Fixes: 9b115361248d ("net: ti: icssg-prueth: Fix clearing of IEP_CMP_CFG registers during iep_init") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/all/7b1c7c36-363a-4085-b26c-4f210bee1df6@stanley.mountain/ Signed-off-by: Meghana Malladi <m-malladi@ti.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20250415090543.717991-4-m-malladi@ti.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
net: ti: icssg-prueth: Fix possible NULL pointer dereference inside emac_xmit_xdp_frame()
There is an error check inside emac_xmit_xdp_frame() function which
is called when the driver wants to transmit XDP frame, to check if
the allocated tx descriptor is NULL, if true to exit and return
ICSSG_XDP_CONSUMED implying failure in transmission.
In this case trying to free a descriptor which is NULL will result
in kernel crash due to NULL pointer dereference. Fix this error handling
and increase netdev tx_dropped stats in the caller of this function
if the function returns ICSSG_XDP_CONSUMED.
Fixes: 62aa3246f462 ("net: ti: icssg-prueth: Add XDP support") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/all/70d8dd76-0c76-42fc-8611-9884937c82f5@stanley.mountain/ Signed-off-by: Meghana Malladi <m-malladi@ti.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Roger Quadros <rogerq@kernel.org> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20250415090543.717991-3-m-malladi@ti.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>