Jacob Keller [Thu, 20 Nov 2025 20:20:42 +0000 (12:20 -0800)]
ice: pass pointer to ice_fetch_u64_stats_per_ring
The ice_fetch_u64_stats_per_ring function takes a pointer to the syncp from
the ring stats to synchronize reading of the packet stats. It also takes a
*copy* of the ice_q_stats fields instead of a pointer to the stats. This
completely defeats the point of using the u64_stats API. We pass the stats
by value, so they are static at the point of reading within the
u64_stats_fetch_retry loop.
Simplify the function to take a pointer to the ice_ring_stats instead of
two separate parameters. Additionally, since we never call this outside of
ice_main.c, make it a static function.
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Queue and vector resources for a given vport, are stored in the
idpf_vport structure. At the time of configuration, these
resources are accessed using vport pointer. Meaning, all the
config path functions are tied to the default queue and vector
resources of the vport.
There are use cases which can make use of config path functions
to configure queue and vector resources that are not tied to any
vport. One such use case is PTP secondary mailbox creation
(it would be in a followup series). To configure queue and interrupt
resources for such cases, we can make use of the existing config
infrastructure by passing the necessary queue and vector resources info.
To achieve this, group the existing queue and vector resources into
default resource group and refactor the code to pass the resource
pointer to the config path functions.
This series also includes patches which generalizes the send virtchnl
message APIs and mailbox API that are necessary for the implementation
of PTP secondary mailbox.
* '200GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
idpf: generalize mailbox API
idpf: avoid calling get_rx_ptypes for each vport
idpf: generalize send virtchnl message API
idpf: remove vport pointer from queue sets
idpf: add rss_data field to RSS function parameters
idpf: reshuffle idpf_vport struct members to avoid holes
idpf: move some iterator declarations inside for loops
idpf: move queue resources to idpf_q_vec_rsrc structure
idpf: introduce idpf_q_vec_rsrc struct and move vector resources to it
idpf: introduce local idpf structure to store virtchnl queue chunks
====================
net: usb: sr9700: use ETH_ALEN instead of magic number
The driver hardcodes the number 6 as the number of bytes to write to
the SR_PAR register, which stores the MAC address. Use ETH_ALEN instead
to make the code clearer.
Andy Roulin and Francesco Ruggeri have apparently independently both hit an
issue with the current neighbor notification scheme. Francesco reported the
issue in [1]. In a response[2] to that report, Andy said:
neigh_update sends a rtnl notification if an update, e.g.,
nud_state change, was done but there is no guarantee of
ordering of the rtnl notifications. Consider the following
scenario:
userspace thread kernel thread
================ =============
neigh_update
write_lock_bh(n->lock)
n->nud_state = STALE
write_unlock_bh(n->lock)
neigh_notify
neigh_fill_info
read_lock_bh(n->lock)
ndm->nud_state = STALE
read_unlock_bh(n->lock)
-------------------------->
neigh:update
write_lock_bh(n->lock)
n->nud_state = REACHABLE
write_unlock_bh(n->lock)
neigh_notify
neigh_fill_info
read_lock_bh(n->lock)
ndm->nud_state = REACHABLE
read_unlock_bh(n->lock)
rtnl_nofify
RTNL REACHABLE sent
<--------
rtnl_notify
RTNL STALE sent
In this scenario, the kernel neigh is updated first to STALE and
then REACHABLE but the netlink notifications are sent out of order,
first REACHABLE and then STALE.
The solution presented in [2] was to extend the critical region to include
both the call to neigh_fill_info(), as well as rtnl_notify(). Then we have
a guarantee that whatever state was captured by neigh_fill_info(), will be
sent right away. The above scenario can thus not happen.
This is how this patchset begins: patches #1 and #2 add helper duals to
neigh_fill_info() and __neigh_notify() such that the __-prefixed function
assumes the neighbor lock is held, and the unprefixed one is a thin wrapper
that manages locking. This extends locking further than Andy's patch, but
makes for a clear code and supports the following part.
At that point, the original race is gone. But what can happen is the
following race, where the notification does not reflect the change that was
made:
Here, even though neigh_update() made a change to STALE, it later sends a
notification with a NUD of REACHABLE. The obvious solution to fix this race
is to move the notifier to the same critical section that actually makes
the change.
Sending a notification in fact involves two things: invoking the internal
notifier chain, and sending the netlink notification. The overall approach
in this patchset is to move the netlink notification to the critical
section of the change, while keeping the internal notifier intact. Since
the motion is not obviously correct, the patchset presents the change in
series of incremental steps with discussion in commit messages. Please see
details in the patches themselves.
Reproducer
==========
To consistently reproduce, I injected an mdelay before the rtnl_notify()
call. Since only one thread should delay, a bit of instrumentation was
needed to see where the call originates. The mdelay was then only issued on
the call stack rooted in the RTNL request.
Then the general idea is to issue an "ip neigh replace" to mark a neighbor
entry as failed. In parallel to that, inject an ARP burst that validates
the entry. This is all observed with an "ip monitor neigh", where one can
see either a REACHABLE->FAILED transition, or FAILED->REACHABLE, while the
actual state at the end of the sequence is always REACHABLE.
With the patchset, only FAILED->REACHABLE is ever observed in the monitor.
Alternatives
============
Another approach to solving the issue would be to have a per-neighbor queue
of notification digests, each with a set of fields necessary for formatting
a notification. In pseudocode, a neighbor update would look something like
this:
neighbor_update:
- lock
- do update
- allocate notification digest, fill partially, mark not-committed
- unlock
- critical-section-breaking stuff (probes, ARP Q, etc.)
- lock
- fill in missing details to the digest (notably neigh->probes)
- mark the digest as committed
- while (front of the digest queue is committed)
- pop it, convert to notifier, send the notification
- unlock
This adds more complexity and would imply more changes to the code, which
is why I think the approach presented in this patchset is better. But it
would allow us to retain the overall structure of the code while giving us
accurate notifications.
A third approach would be to consider the second race not very serious and
be OK with seeing a notification that does not reflect the change that
prompted it. Then a two-patch prefix of this patchset would be all that is
needed.
Petr Machata [Wed, 21 Jan 2026 16:43:42 +0000 (17:43 +0100)]
net: core: neighbour: Make another netlink notification atomically
Similarly to the issue from the previous patch, neigh_timer_handler() also
updates the neighbor separately from formatting and sending the netlink
notification message. We have not seen reports to the effect of this
causing trouble, but in theory, the same sort of issues could have come up:
neigh_timer_handler() would make changes as necessary, but before
formatting and sending a notification, is interrupted before sending by
another thread, which makes a parallel change and sends its own message.
The message send that is prompted by an earlier change thus contains
information that does not reflect the change having been made.
To solve this, the netlink notification needs to be in the same critical
section that updates the neighbor. The critical section is ended by the
neigh_probe() call which drops the lock before calling solicit. Stretching
the critical section over the solicit call is problematic, because that can
then involved all sorts of forwarding callbacks. Therefore, like in the
previous patch, split the netlink notification away from the internal one
and move it ahead of the probe call.
Petr Machata [Wed, 21 Jan 2026 16:43:41 +0000 (17:43 +0100)]
net: core: neighbour: Make one netlink notification atomically
As noted in a previous patch, one race remains in the current code. A
kernel thread might interrupt a userspace thread after the change is done,
but before formatting and sending the message. Then what we would see is
two messages with the same contents:
The solution is to send the netlink message inside the critical section
where the neighbor is changed, so that it reflects the notified-upon
neighbor state.
To that end, in __neigh_update(), move the current neigh_notify() call up
to said critical section, and convert it to __neigh_notify(), because the
lock is held. This motion crosses calls to neigh_update_managed_list(),
neigh_update_gc_list() and neigh_update_process_arp_queue(), all of which
potentially unlock and give an opportunity for the above race.
This also crosses a call to neigh_update_process_arp_queue() which calls
neigh->output(), which might be neigh_resolve_output() calls
neigh_event_send() calls neigh_event_send_probe() calls
__neigh_event_send() calls neigh_probe(), which touches neigh->probes,
an update which will now not be visible in the notification.
However, there is indication that there is no promise that these changes
will be accurately projected to notifications: fib6_table_lookup()
indirectly calls route.c's find_match() calls rt6_probe(), which looks up a
neighbor and call __neigh_set_probe_once(), which sets neigh->probes to 0,
but neither this nor the caller seems to send a notification.
Additionally, the neighbor object that the neigh_probe() mentioned above is
called on, might be the alternative neighbor looked up for the ARP queue
packet destination. If that is the case, the changed value of n1->probes is
not notified anywhere.
So at least in some circumstances, the reported number of probes needs to
be assumed to change without notification.
The netlink message needs to be send inside the critical section where the
neighbor is changed, so that it reflects the notified-upon neighbor state.
On the other hand, there is no such need in case of notifier chain: the
listeners do not assume lock, and often in fact just schedule a delayed
work to act on the neighbor later. At least one in fact also takes the
neighbor lock.
This requires that the netlink notification be done before the internal
notifier chain message is sent. That is safe to do, because the current
listeners, as well as __neigh_notify(), only read the updated neighbor
fields, and never modify them. (Apart from locking.)
The obvious idea behind the helper is to keep together the two bits that
should be done either both or neither: the internal notifier chain message,
and the netlink notification.
To make sure that the notification sent reflects the change being made, the
netlink message needs to be send inside the critical section where the
neighbor is changed. But for the notifier chain, there is no such need: the
listeners do not assume lock, and often in fact just schedule a delayed
work to act on the neighbor later. At least one in fact also takes the
neighbor lock. Therefore these two items have each different locking needs.
Now we could unlock inside the helper, but I find that error prone, and the
fact that the notification is conditional in the first place does not help
to make the call site obvious.
So in this patch, the helper is instead removed and the body, which is just
these two calls, inlined. That way we can use each notifier independently.
Petr Machata [Wed, 21 Jan 2026 16:43:38 +0000 (17:43 +0100)]
net: core: neighbour: Process ARP queue later
ARP queue processing unlocks the neighbor lock, which can allow another
thread to asynchronously perform a neighbor update and send an out of order
notification. Therefore this needs to be done after the notification is
sent.
Move it just before the end of the critical section. Since
neigh_update_process_arp_queue() unlocks, it does not form a part of the
critical section anymore but it can benefit from the lock being taken. The
intention is to eventually do the RTNL notification before this call.
This motion crosses a call to neigh_update_is_router(), which should not
influence processing of the ARP queue.
Petr Machata [Wed, 21 Jan 2026 16:43:36 +0000 (17:43 +0100)]
net: core: neighbour: Call __neigh_notify() under a lock
Andy Roulin has described an issue with the current neighbor notification
scheme as follows. This was also presented publicly at the link below.
neigh_update sends a rtnl notification if an update, e.g.,
nud_state change, was done but there is no guarantee of
ordering of the rtnl notifications. Consider the following
scenario:
userspace thread kernel thread
================ =============
neigh_update
write_lock_bh(n->lock)
n->nud_state = STALE
write_unlock_bh(n->lock)
neigh_notify
neigh_fill_info
read_lock_bh(n->lock)
ndm->nud_state = STALE
read_unlock_bh(n->lock)
-------------------------->
neigh:update
write_lock_bh(n->lock)
n->nud_state = REACHABLE
write_unlock_bh(n->lock)
neigh_notify
neigh_fill_info
read_lock_bh(n->lock)
ndm->nud_state = REACHABLE
read_unlock_bh(n->lock)
rtnl_nofify
RTNL REACHABLE sent
<--------
rtnl_notify
RTNL STALE sent
In this scenario, the kernel neigh is updated first to STALE and
then REACHABLE but the netlink notifications are sent out of order,
first REACHABLE and then STALE.
The solution is to send the netlink message inside the same critical
section that formats the message. That way both the contents and ordering
of the message reflect the same state, and we cannot see the abovementioned
out-of-order delivery.
Even with this patch, a remaining issue that the contents of the message
may not reflect the changes made to the neighbor. A kernel thread might
still interrupt a userspace thread after the change is done, but before
formatting and sending the message. Then what we would see is two messages
with the same contents. The following patches will attempt to address that
issue.
To support those future patches, convert __neigh_notify() to a helper that
assumes that the neighbor lock is already taken by having it call
__neigh_fill_info() instead of neigh_fill_info(). Add a new helper,
neigh_notify(), which takes the lock before calling __neigh_notify().
Migrate all callers to use the latter.
Petr Machata [Wed, 21 Jan 2026 16:43:35 +0000 (17:43 +0100)]
net: core: neighbour: Add a neigh_fill_info() helper for when lock not held
The netlink message needs to be formatted and sent inside the critical
section where the neighbor is changed, so that it reflects the
notified-upon neighbor state. Because it will happen inside an already
existing critical section, it has to assume that the neighbor lock is held.
Add a helper __neigh_fill_info(), which is like neigh_fill_info(), but
makes this assumption. Convert neigh_fill_info() to a wrapper around this
new helper.
Eric Dumazet [Thu, 22 Jan 2026 16:50:49 +0000 (16:50 +0000)]
ipvlan: remove ipvlan_ht_addr_lookup()
ipvlan_ht_addr_lookup() is called four times and not inlined.
Split it to ipvlan_ht_addr_lookup6() and ipvlan_ht_addr_lookup4()
and rework ipvlan_addr_lookup() to call these helpers once,
so that they are (auto)inlined.
After this change, ipvlan_addr_lookup() is faster, and we save
350 bytes of text on x86_64.
Eric Dumazet [Thu, 22 Jan 2026 04:57:18 +0000 (04:57 +0000)]
net: inline net_is_devmem_iov()
1) Inline this small helper to reduce code size and decrease cpu costs.
2) Constify its argument.
3) Move it to include/net/netmem.h, as a prereq for the following patch.
On 64bit arches, struct u64_stats_sync is empty and provides no help
against load/store tearing. memcpy() should not be considered atomic
against u64 values. Use u64_stats_copy() instead.
====================
David Yang [Tue, 20 Jan 2026 09:21:32 +0000 (17:21 +0800)]
vxlan: vnifilter: fix memcpy with u64_stats
On 64bit arches, struct u64_stats_sync is empty and provides no help
against load/store tearing. memcpy() should not be considered atomic
against u64 values. Use u64_stats_copy() instead.
David Yang [Tue, 20 Jan 2026 09:21:31 +0000 (17:21 +0800)]
macsec: fix memcpy with u64_stats
On 64bit arches, struct u64_stats_sync is empty and provides no help
against load/store tearing. memcpy() should not be considered atomic
against u64 values. Use u64_stats_copy() instead.
David Yang [Tue, 20 Jan 2026 09:21:30 +0000 (17:21 +0800)]
net: bridge: mcast: fix memcpy with u64_stats
On 64bit arches, struct u64_stats_sync is empty and provides no help
against load/store tearing. memcpy() should not be considered atomic
against u64 values. Use u64_stats_copy() instead.
David Yang [Tue, 20 Jan 2026 09:21:29 +0000 (17:21 +0800)]
u64_stats: Introduce u64_stats_copy()
The following (anti-)pattern was observed in the code tree:
do {
start = u64_stats_fetch_begin(&pstats->syncp);
memcpy(&temp, &pstats->stats, sizeof(temp));
} while (u64_stats_fetch_retry(&pstats->syncp, start));
On 64bit arches, struct u64_stats_sync is empty and provides no help
against load/store tearing, especially for memcpy(), for which arches may
provide their highly-optimized implements.
In theory the affected code should convert to u64_stats_t, or use
READ_ONCE()/WRITE_ONCE() properly.
However since there are needs to copy chunks of statistics, instead of
writing loops at random places, we provide a safe memcpy() variant for
u64_stats.
====================
net/rds: RDS-TCP state machine and message loss improvements
This is subset 2 of the larger RDS-TCP patch series I posted last
Oct. The greater series aims to correct multiple rds-tcp issues that
can cause dropped or out of sequence messages. I've broken it down into
smaller sets to make reviews more manageable.
In this set, we correct a few RDS/TCP connection handling issues, and
message loss issues.
====================
Gerd Rausch [Thu, 22 Jan 2026 05:52:13 +0000 (22:52 -0700)]
net/rds: rds_tcp_accept_one ought to not discard messages
RDS/TCP differs from RDS/RDMA in that message acknowledgment
is done based on TCP sequence numbers:
As soon as the last byte of a message has been acknowledged
by the TCP stack of a peer, "rds_tcp_write_space()" goes on
to discard prior messages from the send queue.
Which is fine, for as long as the receiver never throws any messages away.
Unfortunately, that is *not* the case since the introduction of MPRDS:
commit 1a0e100fb2c96 "RDS: TCP: Enable multipath RDS for TCP"
A new function "rds_tcp_accept_one_path" was introduced,
which is entitled to return "NULL", if no connection path is currently
available.
Unfortunately, this happens after the "->accept()" call, and the new socket
often already contains messages, since the peer already transitioned
to "RDS_CONN_UP" on behalf of "TCP_ESTABLISHED".
That's also the case after this [1]:
commit 1a0e100fb2c96 "RDS: TCP: Force every connection to be initiated by
numerically smaller IP address"
which tried to address the situation of pending data by only transitioning
connections from a smaller IP address to "RDS_CONN_UP".
But even in those cases, and in particular if the "RDS_EXTHDR_NPATHS"
handshake has not occurred yet, and therefore we're working with
"c_npaths <= 1", "c_conn[0]" may be in a state distinct from
"RDS_CONN_DOWN", and therefore all messages on the just accepted socket
will be tossed away.
This fix changes "rds_tcp_accept_one":
* If connected from a peer with a larger IP address, the new socket
will continue to get closed right away.
With commit [1] above, there should not be any messages
in the socket receive buffer, since the peer never transitioned
to "RDS_CONN_UP".
Therefore it should be okay to not make any efforts to dispatch
the socket receive buffer.
* If connected from a peer with a smaller IP address,
we call "rds_tcp_accept_one_path" to find a free slot/"path".
If found, business goes on as usual.
If none was found, we save/stash the newly accepted socket
into "rds_tcp_accepted_sock", in order to not lose any
messages that may have arrived already.
We then return from "rds_tcp_accept_one" with "-ENOBUFS".
Later on, when a slot/"path" does become available again
(e.g. state transitioned to "RDS_CONN_DOWN",
or HS extension header was received with "c_npaths > 1")
we call "rds_tcp_conn_slots_available" that simply re-issues
a "rds_tcp_accept_one_path" worker-callback and picks
up the new socket from "rds_tcp_accepted_sock", and thereby
continuing where it left with "-ENOBUFS" last time.
Since a new slot has become available, those messages
won't be lost, since processing proceeds as if that slot
had been available the first time around.
Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com> Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Link: https://patch.msgid.link/20260122055213.83608-3-achender@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Gerd Rausch [Thu, 22 Jan 2026 05:52:12 +0000 (22:52 -0700)]
net/rds: No shortcut out of RDS_CONN_ERROR
RDS connections carry a state "rds_conn_path::cp_state"
and transitions from one state to another and are conditional
upon an expected state: "rds_conn_path_transition."
There is one exception to this conditionality, which is
"RDS_CONN_ERROR" that can be enforced by "rds_conn_path_drop"
regardless of what state the condition is currently in.
But as soon as a connection enters state "RDS_CONN_ERROR",
the connection handling code expects it to go through the
shutdown-path.
The RDS/TCP multipath changes added a shortcut out of
"RDS_CONN_ERROR" straight back to "RDS_CONN_CONNECTING"
via "rds_tcp_accept_one_path" (e.g. after "rds_tcp_state_change").
A subsequent "rds_tcp_reset_callbacks" can then transition
the state to "RDS_CONN_RESETTING" with a shutdown-worker queued.
That'll trip up "rds_conn_init_shutdown", which was
never adjusted to handle "RDS_CONN_RESETTING" and subsequently
drops the connection with the dreaded "DR_INV_CONN_STATE",
which leaves "RDS_SHUTDOWN_WORK_QUEUED" on forever.
So we do two things here:
a) Don't shortcut "RDS_CONN_ERROR", but take the longer
path through the shutdown code.
b) Add "RDS_CONN_RESETTING" to the expected states in
"rds_conn_init_shutdown" so that we won't error out
and get stuck, if we ever hit weird state transitions
like this again."
====================
net: restore the structure of driver-facing qcfg API
The goal of qcfg objects is to let us seamlessly support new use cases
without modifying all the drivers. We want to pull all the logic of
combining configuration supplied via different interfaces into the core
and present the drivers with a flat queue-by-queue configuration.
Additionally we want to separate the current effective configuration
from the user intent (default vs user setting vs memory provider setting).
Restructure the recently added code to re-introduce the pieces that
are missing compared to the old RFC:
https://lore.kernel.org/20250421222827.283737-1-kuba@kernel.org
Namely:
- the netdev_queue_config() helper
- queue config validation callback
I hopefully removed all the more "out there" parts of the RFC.
====================
Jakub Kicinski [Thu, 22 Jan 2026 00:51:12 +0000 (16:51 -0800)]
net: add queue config validation callback
I imagine (tm) that as the number of per-queue configuration
options grows some of them may conflict for certain drivers.
While the drivers can obviously do all the validation locally
doing so is fairly inconvenient as the config is fed to drivers
piecemeal via different ops (for different params and NIC-wide
vs per-queue).
Add a centralized callback for validating the queue config
in queue ops. The callback gets invoked before memory provider
is installed, and in the future should also be called when ring
params are modified.
The validation is done after each layer of configuration.
Since we can't fail MP un-binding we must make sure that
the config is valid both before and after MP overrides are
applied. This is moot for now since the set of MP and device
configs are disjoint. It will matter significantly in the future,
so adding it now so that we don't forget..
Jakub Kicinski [Thu, 22 Jan 2026 00:51:11 +0000 (16:51 -0800)]
net: use netdev_queue_config() for mp restart
We should follow the prepare/commit approach for queue configuration.
The qcfg struct should be added to dev->cfg rather than directly to
queue objects so that we can clone and discard the pending config
easily.
Remove the qcfg in struct netdev_rx_queue, and switch remaining callers
to netdev_queue_config(). netdev_queue_config() will construct the qcfg
on the fly based on device defaults and state of the queue.
ndo_default_qcfg becomes optional because having the callback itself
does not have any meaningful semantics to us.
Jakub Kicinski [Thu, 22 Jan 2026 00:51:10 +0000 (16:51 -0800)]
net: move mp->rx_page_size validation to __net_mp_open_rxq()
Move mp->rx_page_size validation where the rest of MP input
validation lives. No other caller is modifying mp params so
validation logic in queue restarts is out of place.
Jakub Kicinski [Thu, 22 Jan 2026 00:51:09 +0000 (16:51 -0800)]
net: introduce a trivial netdev_queue_config()
We may choose to extend or reimplement the logic which renders
the per-queue config. The drivers should not poke directly into
the queue state. Add a helper for drivers to use when they want
to query the config for a specific queue.
Jakub Kicinski [Thu, 22 Jan 2026 00:51:08 +0000 (16:51 -0800)]
eth: bnxt: always set the queue mgmt ops
Core provides a centralized callback for validating per-queue settings
but the callback is part of the queue management ops. Having the ops
conditionally set complicates the parts of the driver which could
otherwise lean on the core to feed it the correct settings.
Always set the queue ops, but provide no restart-related callbacks if
queue ops are not supported by the device. This should maintain current
behavior, the check in netdev_rx_queue_restart() looks both at op struct
and individual ops.
====================
selftest: Extend tun/virtio coverage for GSO over UDP tunnel
The design strategy is to extend the existing tun testing infrastructure
to support this new use-case, rather than introducing a new or parallel framework.
This allows for better integration and re-use of existing test logic.
====================
Xu Du [Wed, 21 Jan 2026 10:04:59 +0000 (18:04 +0800)]
selftest: tun: Add test for sending gso packet into tun
The test constructs a raw packet, prepends a virtio_net_hdr,
and writes the result to the TUN device. This mimics the behavior
of a vm forwarding a guest's packet to the host networking stack.
Xu Du [Wed, 21 Jan 2026 10:04:58 +0000 (18:04 +0800)]
selftest: tun: Add helpers for GSO over UDP tunnel
In preparation for testing GSO over UDP tunnels, enhance the test
infrastructure to support a more complex data path involving a TUN
device and a GENEVE udp tunnel.
This patch introduces a dedicated setup/teardown topology that creates
both a GENEVE tunnel interface and a TUN interface. The TUN device acts
as the VTEP (Virtual Tunnel Endpoint), allowing it to send and receive
virtio-net packets. This setup effectively tests the kernel's data path
for encapsulated traffic.
Note that after adding a new address to the UDP tunnel, we need to wait
a bit until the associated route is available.
Additionally, a new data structure is defined to manage test parameters.
This structure is designed to be extensible, allowing different test
data and configurations to be easily added in subsequent patches.
Xu Du [Wed, 21 Jan 2026 10:04:56 +0000 (18:04 +0800)]
selftest: tun: Introduce tuntap_helpers.h header for TUN/TAP testing
Introduce rtnetlink manipulation and packet construction helpers that
will simplify the later creation of more related test cases. This avoids
duplicating logic across different test cases.
This new header will contain:
- YNL-based netlink management utilities.
- Helpers for ip link, ip address, ip neighbor and ip route operations.
- Packet construction and manipulation helpers.
====================
geneve: introduce double tunnel GSO/GRO support
This is the [belated] incarnation of topic discussed in the last Neconf
[1].
In container orchestration in virtual environments there is a consistent
usage of double UDP tunneling - specifically geneve. Such setup lack
support of GRO and GSO for inter VM traffic.
After commit b430f6c38da6 ("Merge branch 'virtio_udp_tunnel_08_07_2025'
of https://github.com/pabeni/linux-devel") and the qemu cunter-part, VMs
are able to send/receive GSO over UDP aggregated packets.
This series introduces the missing bit for full end-to-end aggregation
in the above mentioned scenario. Specifically:
- introduces a new netdev feature set to generalize existing per device
driver GSO admission check.1
- adds GSO partial support for the geneve and vxlan drivers
- introduces and use a geneve option to assist double tunnel GRO
- adds some simple functional tests for the above.
The new device features set is not strictly needed for the following
work, but avoids the introduction of trivial `ndo_features_check` to
support GSO partial and thus possible performance regression due to the
additional indirect call. Such feature set could be leveraged by a
number of existing drivers (intel, meta and possibly wangxun) to avoid
duplicate code/tests. Such part has been omitted here to keep the series
small.
Both GSO partial support and double GRO support have some downsides.
With the first in place, GSO partial packets will traverse the network
stack 'downstream' the outer geneve UDP tunnel and will be visible by
the udp/IP/IPv6 and by netfilter. Currently only H/W NICs implement GSO
partial support and such packets are visible only via software taps.
Double UDP tunnel GRO will cook 'GSO partial' like aggregate packets,
i.e. the inner UDP encapsulation headers set will still carry the
wire-level lengths and csum, so that segmentation considering such
headers parts of a giant, constant encapsulation header will yield the
correct result.
The correct GSO packet layout is applied when the packet traverse the
outermost geneve encapsulation.
Both GSO partial and double UDP encap are disabled by default and must
be explicitly enabled via, respectively ethtool and geneve device
configuration.
Finally note that the GSO partial feature could potentially be applied
to all the other UDP tunnels, but this series limits its usage to geneve
and vxlan devices.
Paolo Abeni [Wed, 21 Jan 2026 16:11:36 +0000 (17:11 +0100)]
selftests: net: tests for add double tunneling GRO/GSO
Create a simple, netns-based topology with double, nested UDP tunnels and
perform TSO transfers on top.
Explicitly enable GSO and/or GRO and check the skb layout consistency with
different configuration allowing (or not) GSO frames to be delivered on
the other end.
The trickest part is account in a robust way the aggregated/unaggregated
packets with double encapsulation: use a classic bpf filter for it.
Paolo Abeni [Wed, 21 Jan 2026 16:11:35 +0000 (17:11 +0100)]
geneve: use GRO hint option in the RX path
At the GRO stage, when a valid hint option is found, try match the whole
nested headers and try to aggregate on the inner protocol; in case of hdr
mismatch extract the nested address and port to properly flush on a
per-inner flow basis.
On GRO completion, the (unmodified) nested headers will be considered part
of the (constant) outer geneve encap header so that plain UDP tunnel
segmentation will yield valid wire packets.
In the geneve RX path, when processing a GSO packet carrying a GRO hint
option, update the nested header length fields from the wire packet size to
the GSO-packet one. If the nested header additionally carries a checksum,
convert it to CSUM-partial.
Finally, when the RX path leverages the GRO hints, skip the additional GRO
stage done by GRO cells: otherwise the already set skb->encapsulation flag
will foul the GRO cells complete step to use touch the innermost IP header
when it should update the nested csum, corrupting the packet.
Paolo Abeni [Wed, 21 Jan 2026 16:11:34 +0000 (17:11 +0100)]
geneve: extract hint option at GRO stage
Add helpers for finding a GRO hint option in the geneve header, performing
basic sanitization of the option offsets vs the actual packet layout,
validate the option for GRO aggregation and check the nested header
checksum.
The validation helper closely mirrors similar check performed by the ipv4
and ipv6 gro callbacks, with the additional twist of accessing the
relevant network header via the GRO hint offset.
To validate the nested UDP checksum, leverage the csum completed of the
outer header, similarly to LCO, with the main difference that in this case
we have the outer checksum available.
Use the helpers to extract the hint info at the GRO stage.
Paolo Abeni [Wed, 21 Jan 2026 16:11:33 +0000 (17:11 +0100)]
geneve: add GRO hint output path
If a geneve egress packet contains nested UDP encap headers, add a geneve
option including the information necessary on the RX side to perform GRO
aggregation of the whole packets: the nested network and transport headers,
and the nested protocol type.
Use geneve option class `netdev`, already registered in the Network
Virtualization Overlay (NVO3) IANA registry:
Paolo Abeni [Wed, 21 Jan 2026 16:11:32 +0000 (17:11 +0100)]
geneve: pass the geneve device ptr to geneve_build_skb()
Instead of handing to it the geneve configuration in multiple arguments.
This already avoids some code duplication and we are going to pass soon
more arguments to such function.
Paolo Abeni [Wed, 21 Jan 2026 16:11:27 +0000 (17:11 +0100)]
net: introduce mangleid_features
Some/most devices implementing gso_partial need to disable the GSO partial
features when the IP ID can't be mangled; to that extend each of them
implements something alike the following[1]:
if (skb->encapsulation && !(features & NETIF_F_TSO_MANGLEID))
features &= ~NETIF_F_TSO;
in the ndo_features_check() op, which leads to a bit of duplicate code.
Later patch in the series will implement GSO partial support for virtual
devices, and the current status quo will require more duplicate code and
a new indirect call in the TX path for them.
Introduce the mangleid_features mask, allowing the core to disable NIC
features based on/requiring MANGLEID, without any further intervention
from the driver.
The same functionality could be alternatively implemented adding a single
boolean flag to the struct net_device, but would require an additional
checks in ndo_features_check().
Also note that [1] is incorrect if the NIC additionally implements
NETIF_F_GSO_UDP_L4, mangleid_features transparently handle even such a
case.
====================
net: convert drivers to .get_rx_ring_count (last part)
Commit 84eaf4359c36 ("net: ethtool: add get_rx_ring_count callback to
optimize RX ring queries") added specific support for GRXRINGS callback,
simplifying .get_rxnfc.
Remove the handling of GRXRINGS in .get_rxnfc() by moving it to the new
.get_rx_ring_count().
This simplifies the RX ring count retrieval and aligns the following
drivers with the new ethtool API for querying RX ring parameters.
* sfc
* ionic
* sfc/siena
* sfc/ef100
* fbnic
* mana
* nfp
* atlantic
* benet (this is v2 in fact, where v1 had some discussions that
required a v2). See link [0]
Link: https://lore.kernel.org/all/20260119094514.5b12a097@kernel.org/
This is covering the last drivers, and as soon as this lands, I will
change the ethtool framework to avoid calling .get_rx_ring_count for
ETHTOOL_GRXRINGS, simplifying the ethtool core framework.
====================
Breno Leitao [Thu, 22 Jan 2026 18:40:13 +0000 (10:40 -0800)]
net: benet: convert to use .get_rx_ring_count
Use the newly introduced .get_rx_ring_count ethtool ops callback instead
of handling ETHTOOL_GRXRINGS directly in .get_rxnfc().
Since ETHTOOL_GRXRINGS was the only command handled by be_get_rxnfc(),
remove the function entirely.
Since the be_multi_rxq() check in be_get_rxnfc() previously blocked RSS
configuration on single-queue setups (via ethtool core validation), add
an equivalent check to be_set_rxfh() to preserve this behavior, as
suggested by Jakub.
Lukas Bulwahn [Thu, 22 Jan 2026 07:46:09 +0000 (08:46 +0100)]
MAINTAINERS: remove obsolete file entry in NETWORKING DRIVERS
Commit d8f87aa5fa0a ("net: remove HIPPI support and RoadRunner HIPPI
driver") removes the hippidevice header file, but misses that there is
still a file entry in MAINTAINERS referring to it.
Remove the obsolete file entry in NETWORKING DRIVERS.
drivers/net/wireless/ath/ath12k/mac.c
drivers/net/wireless/ath/ath12k/wifi7/hw.c 31707572108d ("wifi: ath12k: Fix wrong P2P device link id issue") c26f294fef2a ("wifi: ath12k: Move ieee80211_ops callback to the arch specific module")
https://lore.kernel.org/20260114123751.6a208818@canb.auug.org.au
Adjacent changes:
drivers/net/wireless/ath/ath12k/mac.c 8b8d6ee53dfd ("wifi: ath12k: Fix scan state stuck in ABORTING after cancel_remain_on_channel") 914c890d3b90 ("wifi: ath12k: Add framework for hardware specific ieee80211_ops registration")
Danielle Ratson [Wed, 21 Jan 2026 11:46:44 +0000 (13:46 +0200)]
selftests: net: Add kernel selftest for RFC 4884
RFC 4884 extended certain ICMP messages with a length attribute that
encodes the length of the "original datagram" field. This is needed so
that new information could be appended to these messages without
applications thinking that it is part of the "original datagram" field.
In version 5.9, the kernel was extended with two new socket options
(SOL_IP/IP_RECVERR_4884 and SOL_IPV6/IPV6_RECVERR_RFC4884) that allow
user space to retrieve this length which is basically the offset to the
ICMP Extension Structure at the end of the ICMP message. This is
required by user space applications that need to parse the information
contained in the ICMP Extension Structure. For example, the RFC 5837
extension for tracepath.
Add a selftest that verifies correct handling of the RFC 4884 length
field for both IPv4 and IPv6, with and without extension structures,
and validates that malformed extensions are correctly reported as invalid.
For each address family, the test creates:
- a raw socket used to send locally crafted ICMP error packets to the
loopback address, and
- a datagram socket used to receive the encapsulated original datagram
and associated error metadata from the kernel error queue.
ICMP packets are constructed entirely in user space rather than relying
on kernel-generated errors. This allows the test to exercise invalid
scenarios (such as corrupted checksums and incorrect length fields) and
verify that the SO_EE_RFC4884_FLAG_INVALID flag is set as expected.
Output Example:
$ ./icmp_rfc4884
Starting 18 tests from 18 test cases.
RUN rfc4884.ipv4_ext_small_payload.rfc4884 ...
OK rfc4884.ipv4_ext_small_payload.rfc4884
ok 1 rfc4884.ipv4_ext_small_payload.rfc4884
RUN rfc4884.ipv4_ext.rfc4884 ...
OK rfc4884.ipv4_ext.rfc4884
ok 2 rfc4884.ipv4_ext.rfc4884
RUN rfc4884.ipv4_ext_large_payload.rfc4884 ...
OK rfc4884.ipv4_ext_large_payload.rfc4884
ok 3 rfc4884.ipv4_ext_large_payload.rfc4884
RUN rfc4884.ipv4_no_ext_small_payload.rfc4884 ...
OK rfc4884.ipv4_no_ext_small_payload.rfc4884
ok 4 rfc4884.ipv4_no_ext_small_payload.rfc4884
RUN rfc4884.ipv4_no_ext_min_payload.rfc4884 ...
OK rfc4884.ipv4_no_ext_min_payload.rfc4884
ok 5 rfc4884.ipv4_no_ext_min_payload.rfc4884
RUN rfc4884.ipv4_no_ext_large_payload.rfc4884 ...
OK rfc4884.ipv4_no_ext_large_payload.rfc4884
ok 6 rfc4884.ipv4_no_ext_large_payload.rfc4884
RUN rfc4884.ipv4_invalid_ext_checksum.rfc4884 ...
OK rfc4884.ipv4_invalid_ext_checksum.rfc4884
ok 7 rfc4884.ipv4_invalid_ext_checksum.rfc4884
RUN rfc4884.ipv4_invalid_ext_length_small.rfc4884 ...
OK rfc4884.ipv4_invalid_ext_length_small.rfc4884
ok 8 rfc4884.ipv4_invalid_ext_length_small.rfc4884
RUN rfc4884.ipv4_invalid_ext_length_large.rfc4884 ...
OK rfc4884.ipv4_invalid_ext_length_large.rfc4884
ok 9 rfc4884.ipv4_invalid_ext_length_large.rfc4884
RUN rfc4884.ipv6_ext_small_payload.rfc4884 ...
OK rfc4884.ipv6_ext_small_payload.rfc4884
ok 10 rfc4884.ipv6_ext_small_payload.rfc4884
RUN rfc4884.ipv6_ext.rfc4884 ...
OK rfc4884.ipv6_ext.rfc4884
ok 11 rfc4884.ipv6_ext.rfc4884
RUN rfc4884.ipv6_ext_large_payload.rfc4884 ...
OK rfc4884.ipv6_ext_large_payload.rfc4884
ok 12 rfc4884.ipv6_ext_large_payload.rfc4884
RUN rfc4884.ipv6_no_ext_small_payload.rfc4884 ...
OK rfc4884.ipv6_no_ext_small_payload.rfc4884
ok 13 rfc4884.ipv6_no_ext_small_payload.rfc4884
RUN rfc4884.ipv6_no_ext_min_payload.rfc4884 ...
OK rfc4884.ipv6_no_ext_min_payload.rfc4884
ok 14 rfc4884.ipv6_no_ext_min_payload.rfc4884
RUN rfc4884.ipv6_no_ext_large_payload.rfc4884 ...
OK rfc4884.ipv6_no_ext_large_payload.rfc4884
ok 15 rfc4884.ipv6_no_ext_large_payload.rfc4884
RUN rfc4884.ipv6_invalid_ext_checksum.rfc4884 ...
OK rfc4884.ipv6_invalid_ext_checksum.rfc4884
ok 16 rfc4884.ipv6_invalid_ext_checksum.rfc4884
RUN rfc4884.ipv6_invalid_ext_length_small.rfc4884 ...
OK rfc4884.ipv6_invalid_ext_length_small.rfc4884
ok 17 rfc4884.ipv6_invalid_ext_length_small.rfc4884
RUN rfc4884.ipv6_invalid_ext_length_large.rfc4884 ...
OK rfc4884.ipv6_invalid_ext_length_large.rfc4884
ok 18 rfc4884.ipv6_invalid_ext_length_large.rfc4884
PASSED: 18 / 18 tests passed.
Totals: pass:18 fail:0 xfail:0 xpass:0 skip:0 error:0
====================
net: stmmac: dwmac: enforce preamble before SFD for i.MX8MP
This series adds a new phy_device flag PHY_F_KEEP_PREAMBLE_BEFORE_SFD
that allows a MAC driver to request to keep the preamble bytes before
the start frame delimiter (SFD) when receiving frames from the PHY.
This flag is set in the stmmac driver for the i.MX8MP SoC due to errata
(ERR050694), which causes it to drop frames without a preamble.
The Micrel KSZ9131 PHY supports keeping the preamble before SFD by
setting an undocumented flag, that was confirmed by NXP and Micrel. This
new feature has been added to the Micrel PHY driver for the KSZ9131 PHY.
====================
net: stmmac: dwmac-imx: keep preamble before sfd on i.MX8MP
The stmmac implementation used by NXP for the i.MX8MP SoC is subject to
errata ERR050694. According to this errata, when no preamble byte is
transferred before the SFD from the PHY to the MAC, the MAC will discard
the frame.
Setting the PHY_F_KEEP_PREAMBLE_BEFORE_SFD flag instructs PHYs that
support it to keep the preamble byte before the SFD. This ensures that
the MAC successfully receives frames.
As this is an issue in the MAC implementation, only enable the flag for
the i.MX8MP SoC where the errata applies but not for other SoCs using a
working stmmac implementation.
The exact wording of the errata ERR050694 from NXP:
The IEEE 802.3 standard states that, in MII/GMII modes, the byte
preceding the SFD (0xD5), SMD-S (0xE6,0x4C, 0x7F, or 0xB3), or SMD-C
(0x61, 0x52, 0x9E, or 0x2A) byte can be a non-PREAMBLE byte or there can
be no preceding preamble byte. The MAC receiver must successfully
receive a packet without any preamble(0x55) byte preceding the SFD,
SMD-S, or SMD-C byte.
However due to the defect, in configurations where frame preemption is
enabled, when preamble byte does not precede the SFD, SMD-S, or SMD-C
byte, the received packet is discarded by the MAC receiver. This is
because, the start-of-packet detection logic of the MAC receiver
incorrectly checks for a preamble byte.
NXP refers to IEEE 802.3 where in clause 35.2.3.2.2 Receive case (GMII)
they show two tables one where the preamble is preceding the SFD and one
where it is not. The text says:
The operation of 1000 Mb/s PHYs can result in shrinkage of the preamble
between transmission at the source GMII and reception at the destination
GMII. Table 35-3 depicts the case where no preamble bytes are conveyed
across the GMII. This case may not be possible with a specific PHY, but
illustrates the minimum preamble with which MAC shall be able to
operate. Table 35-4 depicts the case where the entire preamble is
conveyed across the GMII.
This workaround was tested on a Verdin iMX8MP by enforcing 10 MBit/s:
ethtool -s end0 speed 10
Without keeping the preamble, no packet were received. With keeping the
preamble, everything worked as expected.
Signed-off-by: Stefan Eichenberger <stefan.eichenberger@toradex.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20260120203905.23805-4-eichest@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: phy: micrel: add option to keep the preamble before sfd for KSZ9131
If the PHY_F_KEEP_PREAMBLE_BEFORE_SFD flag is set in the
phy_device::dev_flags field, the preamble will be kept before the start
frame delimiter (SFD) on the KSZ9131 PHY. This flag is not officially
documented by Micrel. However, information provided by NXP and Micrel
indicates that this flag ensures the PHY sends the full preamble instead
of removing it. The full discussion can be found on the NXP forum:
https://community.nxp.com/t5/i-MX-Processors/iMX8MP-eqos-not-working-for-10base-t/m-p/2151032
Signed-off-by: Stefan Eichenberger <stefan.eichenberger@toradex.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20260120203905.23805-3-eichest@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: phy: add a new phy_device flag to keep preamble before sfd
Add a new flag, PHY_F_KEEP_PREAMBLE_BEFORE_SFD, to indicate that the PHY
shall not remove the preamble before the SFD if it supports it. MACs
that do not support receiving frames without a preamble can set this
flag.
Signed-off-by: Stefan Eichenberger <stefan.eichenberger@toradex.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20260120203905.23805-2-eichest@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Thu, 22 Jan 2026 22:40:38 +0000 (14:40 -0800)]
Merge tag 'nf-next-26-01-22' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Florian Westphal says:
====================
netfilter: updates for net-next
There is an issue with interval matching in nftables rbtree set type:
When userspace sends us set updates, there is a brief window where
false negative lookups may occur from the data plane. Quoting Pablos
original cover letter:
This series addresses this issue by translating the rbtree, which keeps
the intervals in order, to binary search. The array is published to
packet path through RCU. The idea is to keep using the rbtree
datastructure for control plane, which needs to deal with updates, then
generate an array using this rbtree for binary search lookups.
Patch #1 allows to call .remove in case .abort is defined, which is
needed by this new approach. Only pipapo needs to skip .remove to speed.
Patch #2 add the binary search array approach for interval matching.
Patch #3 updates .get to use the binary search array to find for
(closest or exact) interval matching.
Patch #4 removes seqcount_rwlock_t as it is not needed anymore (new in
this series).
* tag 'nf-next-26-01-22' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
netfilter: nft_set_rbtree: remove seqcount_rwlock_t
netfilter: nft_set_rbtree: use binary search array in get command
netfilter: nft_set_rbtree: translate rbtree to array for binary search
netfilter: nf_tables: add .abort_skip_removal flag for set types
====================
RX ptypes received from device control plane doesn't depend on vport
info, but might vary based on the queue model. When the driver requests
for ptypes, control plane fills both ptype_id_10 (used for splitq) and
ptype_id_8 (used for singleq) fields of the virtchnl2_ptype response
structure. This allows to call get_rx_ptypes once at the adapter level
instead of each vport.
Parse and store the received ptypes of both splitq and singleq in a
separate lookup table. Respective lookup table is used based on the
queue model info. As part of the changes, pull the ptype protocol
parsing code into a separate function.
Reviewed-by: Madhu Chittim <madhu.chittim@intel.com> Signed-off-by: Pavan Kumar Linga <pavan.kumar.linga@intel.com> Signed-off-by: Joshua Hay <joshua.a.hay@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Tested-by: Samuel Salin <Samuel.salin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
With the previous refactor of passing idpf resource pointer, all of the
virtchnl send message functions do not require full vport structure.
Those functions can be generalized to be able to use for configuring
vport independent queues.
Signed-off-by: Anton Nadezhdin <anton.nadezhdin@intel.com> Reviewed-by: Madhu Chittim <madhu.chittim@intel.com> Signed-off-by: Pavan Kumar Linga <pavan.kumar.linga@intel.com> Signed-off-by: Joshua Hay <joshua.a.hay@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Joshua Hay [Thu, 13 Nov 2025 00:41:40 +0000 (16:41 -0800)]
idpf: remove vport pointer from queue sets
Replace vport pointer in queue sets struct with adapter backpointer and
vport_id as those are the primary fields necessary for virtchnl
communication. Otherwise, pass the vport pointer as a separate parameter
where available. Also move xdp_txq_offset to queue vector resource
struct since we no longer have the vport pointer.
Reviewed-by: Madhu Chittim <madhu.chittim@intel.com> Signed-off-by: Joshua Hay <joshua.a.hay@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Tested-by: Samuel Salin <Samuel.salin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
idpf: add rss_data field to RSS function parameters
Retrieve rss_data field of vport just once and pass it to RSS related
functions instead of retrieving it in each function.
While at it, update s/rss/RSS in the RSS function doc comments.
Reviewed-by: Anton Nadezhdin <anton.nadezhdin@intel.com> Signed-off-by: Pavan Kumar Linga <pavan.kumar.linga@intel.com> Signed-off-by: Joshua Hay <joshua.a.hay@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Tested-by: Samuel Salin <Samuel.salin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
idpf: reshuffle idpf_vport struct members to avoid holes
The previous refactor of moving queue and vector resources out of the
idpf_vport structure, created few holes. Reshuffle the existing members
to avoid holes as much as possible.
Reviewed-by: Anton Nadezhdin <anton.nadezhdin@intel.com> Signed-off-by: Pavan Kumar Linga <pavan.kumar.linga@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Tested-by: Samuel Salin <Samuel.salin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Joshua Hay [Thu, 13 Nov 2025 00:41:37 +0000 (16:41 -0800)]
idpf: move some iterator declarations inside for loops
Move some iterator declarations to their respective for loops; use more
appropriate unsigned type.
Signed-off-by: Joshua Hay <joshua.a.hay@intel.com> Reviewed-by: Madhu Chittim <madhu.chittim@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Tested-by: Samuel Salin <Samuel.salin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
idpf: introduce idpf_q_vec_rsrc struct and move vector resources to it
To group all the vector and queue resources, introduce idpf_q_vec_rsrc
structure. This helps to reuse the same config path functions by other
features. For example, PTP implementation can use the existing config
infrastructure to configure secondary mailbox by passing its queue and
vector info. It also helps to avoid any duplication of code.
Existing queue and vector resources are grouped as default resources.
This patch moves vector info to the newly introduced structure.
Following patch moves the queue resources.
While at it, declare the loop iterator for 'num_q_vectors' in loop and
use the correct type.
Include idpf_q_vec_rsrc backpointer in idpf_alloc_queue_set along with
vport.
Reviewed-by: Anton Nadezhdin <anton.nadezhdin@intel.com> Signed-off-by: Pavan Kumar Linga <pavan.kumar.linga@intel.com> Signed-off-by: Joshua Hay <joshua.a.hay@intel.com> Tested-by: Samuel Salin <Samuel.salin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
idpf: introduce local idpf structure to store virtchnl queue chunks
Queue ID and register info received from device Control Plane is stored
locally in the same little endian format. As the queue chunks are
retrieved in 3 functions, lexx_to_cpu conversions are done each time.
Instead introduce a new idpf structure to store the received queue info.
It also avoids conditional check to retrieve queue chunks.
With this change, there is no need to store the queue chunks in
'req_qs_chunks' field. So remove that.
Suggested-by: Milena Olech <milena.olech@intel.com> Reviewed-by: Anton Nadezhdin <anton.nadezhdin@intel.com> Signed-off-by: Pavan Kumar Linga <pavan.kumar.linga@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Joshua Hay <joshua.a.hay@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Linus Torvalds [Thu, 22 Jan 2026 17:32:11 +0000 (09:32 -0800)]
Merge tag 'net-6.19-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from CAN and wireless.
Pretty big, but hard to make up any cohesive story that would explain
it, a random collection of fixes. The two reverts of bad patches from
this release here feel like stuff that'd normally show up by rc5 or
rc6. Perhaps obvious thing to say, given the holiday timing.
That said, no active investigations / regressions. Let's see what the
next week brings.
Current release - fix to a fix:
- can: alloc_candev_mqs(): add missing default CAN capabilities
Current release - regressions:
- usbnet: fix crash due to missing BQL accounting after resume
netfilter: nft_set_rbtree: use binary search array in get command
Rework .get interface to use the binary search array, this needs a specific
lookup function to match on end intervals (<=). Packet path lookup is slight
different because match is on lesser value, not equal (ie. <).
After this patch, seqcount can be removed in a follow up patch.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>
netfilter: nft_set_rbtree: translate rbtree to array for binary search
The rbtree can temporarily store overlapping inactive elements during
the transaction processing, leading to false negative lookups.
To address this issue, this patch adds a .commit function that walks the
the rbtree to build a array of intervals of ordered elements. This
conversion compacts the two singleton elements that represent the start
and the end of the interval into a single interval object for space
efficient.
Binary search is O(log n), similar to rbtree lookup time, therefore,
performance number should be similar, and there is an implementation
available under lib/bsearch.c and include/linux/bsearch.h that is used
for this purpose.
This slightly increases memory consumption for this new array that
stores pointers to the start and the end of the interval.
With this patch:
# time nft -f 100k-intervals-set.nft
real 0m4.218s
user 0m3.544s
sys 0m0.400s
Without this patch:
# time nft -f 100k-intervals-set.nft
real 0m3.920s
user 0m3.547s
sys 0m0.276s
With this patch, with IPv4 intervals:
baseline rbtree (match on first field only): 15254954pps
Without this patch:
baseline rbtree (match on first field only): 10256119pps
This provides a ~50% improvement in matching intervals from packet path.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>
netfilter: nf_tables: add .abort_skip_removal flag for set types
The pipapo set backend is the only user of the .abort interface so far.
To speed up pipapo abort path, removals are skipped.
The follow up patch updates the rbtree to use to build an array of
ordered elements, then use binary search. This needs a new .abort
interface but, unlike pipapo, it also need to undo/remove elements.
Add a flag and use it from the pipapo set backend.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>
Hariprasad Kelam [Wed, 21 Jan 2026 09:48:19 +0000 (15:18 +0530)]
Octeontx2-af: Add proper checks for fwdata
firmware populates MAC address, link modes (supported, advertised)
and EEPROM data in shared firmware structure which kernel access
via MAC block(CGX/RPM).
Accessing fwdata, on boards booted with out MAC block leading to
kernel panics.
Fixes: 997814491cee ("Octeontx2-af: Fetch MAC channel info from firmware") Fixes: 5f21226b79fd ("Octeontx2-pf: ethtool: support multi advertise mode") Signed-off-by: Hariprasad Kelam <hkelam@marvell.com> Link: https://patch.msgid.link/20260121094819.2566786-1-hkelam@marvell.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ivan Vecera [Wed, 21 Jan 2026 13:00:11 +0000 (14:00 +0100)]
dpll: Prevent duplicate registrations
Modify the internal registration helpers dpll_xa_ref_{dpll,pin}_add()
to reject duplicate registration attempts.
Previously, if a caller attempted to register the same pin multiple
times (with the same ops, priv, and cookie) on the same device, the core
silently increments the reference count and return success. This behavior
is incorrect because if the caller makes these duplicate registrations
then for the first one dpll_pin_registration is allocated and for others
the associated dpll_pin_ref.refcount is incremented. During the first
unregistration the associated dpll_pin_registration is freed and for
others WARN is fired.
Fix this by updating the logic to return `-EEXIST` if a matching
registration is found to enforce a strict "register once" policy.
Fixes: 9431063ad323 ("dpll: core: Add DPLL framework base functions") Signed-off-by: Ivan Vecera <ivecera@redhat.com> Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Link: https://patch.msgid.link/20260121130012.112606-1-ivecera@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Incorrectly transmitted interrupt number instead of queue number
when using netif_queue_set_napi. Besides, move this to appropriate
code location to set napi.
Remove redundant netif_stop_subqueue beacuase it is not part of the
hinic3_send_one_skb process.