This patch series by Vlad refactors how action STEs are handled for
hardware steering.
Definitions
----------
* STE (Steering Table Entry): a building block for steering rules.
Simple rules consist of a single STE that specifies both the match
value and what actions to do. For more complex rules we have one or
more match STEs that point to one or more action STEs. It is these
action STEs which this patch series is primarily concerned with.
* RTC (Rule Table Context): a table that contains STEs. A matcher
currently consists of a match RTC and, if necessary, an action RTC.
This patch series decouples action RTCs from matchers and moves action
RTCs to a central pool.
* Matcher: a logical container for steering rules. While the items above
describe hardware concepts, a matcher is purely a software construct.
Current situation
-----------------
As mentioned above, a matcher currently consists of a match RTC (or
more, in case of complex matchers) and zero or one action STCs. An
action STC is only allocated if the matcher contains sufficiently
complicated action templates, or many actions.
When adding a rule, we decide based on its action template whether it
requires action STEs. If yes, we allocate the required number of action
STEs from the matcher's action STE.
When updating a rule, we need to prevent the rule ever being in an
invalid state. So we need to allocate and write new action STEs first,
then update the match STE to point to them, and finally release the old
action STEs. So there is a state when a rule needs double the action
STEs it normally uses.
Thus, for a given matcher of log_sz=N, log_action_ste_sz=A, the action
STC log_size is (N + A + 1). We need enough space to hold all the rules'
action STEs, and effectively double that space to account for the not
very common case of rules being updated. We could manage with much fewer
extra action STEs, but RTCs are allocated in powers of two. This results
in effective utilization of action RTCs of 50%, outside rule update
cases.
This is further complicated when resizing matchers. To avoid updating
all the rules to point to new match STEs, we keep existing action RTCs
around as resize_data, and only free them when the matcher is freed.
Action STE pool
---------------
This patch series decouples action RTCs from matchers by creating a
per-queue pool. When a rule needs to allocate action STEs it does so
from the pool, creating a new RTC if needed. During update two sets of
action STEs are in use, possibly from different RTCs.
The pool is sharded per-queue to avoid lock contention. Each per-queue
pool consists of 3 elements, corresponding to rx-only, tx-only and
rx-and-tx use cases. The series takes this approach because rules that
are bidirectional require that their action STEs have the same index in
the rx- and tx-RTCs, and using a single RTC would result in
unidirectional rules wasting the STEs for the unused direction.
Pool elements, in turn, consist of a list of RTCs. The driver
progressively allocates larger RTCs as they are needed to amortize the
cost of allocation.
Allocation of elements (STEs) inside RTCs is modelled by an existing
mechanism, somewhat confusingly also known as a pool. The first few
patches in the series refactor this abstraction to simplify it and adapt
it to the new schema.
Finally, this series implements periodic cleanup of unused action RTCs
as a new feature. Previously, once a matcher allocated an action RTC, it
would only be freed when the matcher is freed. This resulted in a lot of
wasted memory for matchers that had previously grown, but were now
mostly unused.
Conversely, action STE pools have a timestamp of when they were last
used. A cleanup routine periodically checks all pools. If a pool's last
usage was too far in the past, it is destroyed.
Benchmarks
----------
The test module creates a batch of (1 << 18) rules per queue and then
deletes them, in a loop. The rules are complex enough to require two
action STEs per rule.
Each queue is manipulated from a separate kernel workqueue, so there is
a 1:1 correspondence between threads and queues.
There are sleep statements between insert and delete batches so that
memory usage can be evaluated using `free -m`. The numbers below are the
diff between base memory usage (without the mlx5 module inserted) and
peak usage while running a test. The values are rounded to the nearest
hundred megabytes. The `queues` column lists how many queues the test
used.
Across all of the tests, insertion and deletion rates are the same
before and after these patches.
Summary of the patches
----------------------
* Patch 1: Fix matcher action template attach to avoid overrunning the
buffer and correctly report errors.
* Patches 2-7: Cleanup the existing pool abstraction. Clarify semantics,
and use cases, simplify API and callers.
* Patch 8: Implement the new action STE pool structure.
* Patch 9: Use the action STE pool when manipulating rules.
* Patch 10: Remove action RTC from matcher.
* Patch 11: Add logic to periodically check and free unused action RTCs.
* Patch 12: Export action STE tables in debugfs for our dump tool.
====================
Periodically check for unused action STE tables and free their
associated resources. In order to do this safely, add a per-queue lock
to synchronize the garbage collect work with regular operations on
steering rules.
Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com> Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Michal Kubiak <michal.kubiak@intel.com> Link: https://patch.msgid.link/1744312662-356571-12-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Remove the matcher action STE implementation now that the code uses
per-queue action STE pools. This also allows simplifying matcher code
because it is now only handling a single type of RTC/STE.
The matcher resize data is also going away. Matchers were saving old
action STE data because the rules still used it, but now that data lives
in the action STE pool and is no longer coupled to a matcher.
Furthermore, matchers no longer need to rehash a due to action template
addition. If a new action template needs more action STEs, we simply
update the matcher's num_of_action_stes and future rules will allocate
the correct number. Existing rules are unaffected by such an operation
and can continue to use their existing action STEs.
The range action was using the matcher action STE implementation, but
there was no reason to do this other than the container fitting the
purpose. Extract that information to a separate structure.
Finally, stop dumping per-matcher information about action RTCs,
because they no longer exist. A later patch in this series will add
support for dumping action STE pools.
Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com> Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Michal Kubiak <michal.kubiak@intel.com> Link: https://patch.msgid.link/1744312662-356571-11-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Implement a per-queue pool of action STEs that match STEs can link to,
regardless of matcher.
The code relies on hints to optimize whether a given rule is added to
rx-only, tx-only or both. Correspondingly, action STEs need to be added
to different RTC for ingress or egress paths. For rx-and-tx rules, the
current rule implementation dictates that the offsets for a given rule
must be the same in both RTCs.
To avoid wasting STEs, each action STE pool element holds 3 pools:
rx-only, tx-only, and rx-and-tx, corresponding to the possible values of
the pool optimization enum. The implementation then chooses at rule
creation / update which of these elements to allocate from.
Each element holds multiple action STE tables, which wrap an RTC, an STE
range, the logic to buddy-allocate offsets from the range, and an STC
that allows match STEs to point to this table. When allocating offsets
from an element, we iterate through available action STE tables and, if
needed, create a new table.
Similar to the previous implementation, this iteration does not free any
resources. This is implemented in a subsequent patch.
Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com> Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Michal Kubiak <michal.kubiak@intel.com> Link: https://patch.msgid.link/1744312662-356571-9-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The optimization to create a size-one STE range for the unused direction
was broken. The hardware prevents us from creating RTCs over unallocated
STE space, so the only reason this has worked so far is because the
optimization was never used.
Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com> Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Michal Kubiak <michal.kubiak@intel.com> Link: https://patch.msgid.link/1744312662-356571-8-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Remove members which are now no longer used. In fact, many of the
`struct mlx5hws_pool_chunk` were not even written to beyond being
initialized, but they were used in various internals.
Also cleanup some local variables which made more sense when the API was
thicker.
Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com> Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Michal Kubiak <michal.kubiak@intel.com> Link: https://patch.msgid.link/1744312662-356571-6-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Refactor the pool implementation to remove unused flags and clarify its
usage. A pool represents a single range of STEs or STCs which are
allocated at pool creation time.
Pools are used under three patterns:
1. STCs are allocated one at a time from a global pool using a bitmap
based implementation.
2. Action STEs are allocated in power-of-two blocks using a buddy
algorithm.
3. Match STEs do not use allocation, since insertion into these tables
is based on hashes or direct addressing. In such cases we use a pool
only to create the STE range.
Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com> Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Michal Kubiak <michal.kubiak@intel.com> Link: https://patch.msgid.link/1744312662-356571-5-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The pool implementation claimed to support multiple resources, but this
does not really make sense in context. Callers always allocate a single
STC or STE chunk of exactly the size provided.
The code that handled multiple resources was unused (and likely buggy)
due to the combination of flags passed by callers.
Simplify the pool by having it handle a single resource. As a result of
this simplification, chunks no longer contain a resource offset (there
is now only one resource per pool), and the get_base_id functions no
longer take a chunk parameter.
Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com> Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Michal Kubiak <michal.kubiak@intel.com> Link: https://patch.msgid.link/1744312662-356571-4-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The procedure of attaching an action template to an existing matcher had
a few issues:
1. Attaching accidentally overran the `at` array in bwc_matcher, which
would result in memory corruption. This bug wasn't triggered, but it
is possible to trigger it by attaching action templates beyond the
initial buffer size of 8. Fix this by converting to a dynamically
sized buffer and reallocating if needed.
2. Similarly, the `at` array inside the native matcher was never
reallocated. Fix this the same as above.
3. The bwc layer treated any error in action template attach as a signal
that the matcher should be rehashed to account for a larger number of
action STEs. In reality, there are other unrelated errors that can
arise and they should be propagated upstack. Fix this by adding a
`need_rehash` output parameter that's orthogonal to error codes.
Fixes: 2111bb970c78 ("net/mlx5: HWS, added backward-compatible API handling") Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com> Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Michal Kubiak <michal.kubiak@intel.com> Link: https://patch.msgid.link/1744312662-356571-2-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: dsa: microchip: add ETS scheduler support for KSZ88x3 switches
Implement Enhanced Transmission Selection scheduler (ETS) support for
KSZ88x3 devices, which support two fixed egress scheduling modes:
Strict Priority and Weighted Fair Queuing (WFQ).
Since the switch does not allow remapping priorities to queues or
adjusting weights, this implementation only supports enabling
strict priority mode. If strict mode is not explicitly requested,
the switch falls back to its default WFQ mode.
This patch introduces KSZ88x3-specific handlers for ETS add and
delete operations and uses TXQ Split Control registers to toggle
the WFQ enable bit per queue. Corresponding macros are also added
for register access.
====================
net: stmmac: remove unnecessary initialisation of 1µs TIC counter
In commit 8efbdbfa9938 ("net: stmmac: Initialize MAC_ONEUS_TIC_COUNTER
register"), code to initialise the LPI 1us counter in dwmac4's
initialisation was added, making the initialisation in glue drivers
unnecessary. This series cleans up the now redundant initialisation.
====================
GMAC_1US_TIC_COUNTER is now no longer used, so remove the definition.
This was duplicated by GMAC4_MAC_ONEUS_TIC_COUNTER further down in the
same file.
net: stmmac: intel-plat: remove eee_usecs_rate and hardware write
Remove the write to GMAC_1US_TIC_COUNTER for two reasons:
1. during initialisation or reinitialisation of the DWMAC core, the
core is reset, which sets this register back to its default value.
Writing it prior to stmmac_dvr_probe() has no effect.
2. Since commit 8efbdbfa9938 ("net: stmmac: Initialize
MAC_ONEUS_TIC_COUNTER register"), GMAC4/5 core code will set
this register based on the rate of plat->stmmac_clk. This clock
is fetched by devm_stmmac_probe_config_dt(), and plat->clk_ptp_rate
will be set to its rate profided a "ptp_ref" clock is not provided.
In any case, Marek's commit will set the effectual value of this
register.
Therefore, dwmac-intel-plat.c writing GMAC_1US_TIC_COUNTER serves no
useful purpose and can be removed.
net: stmmac: intel: remove eee_usecs_rate and hardware write
Remove the write to GMAC_1US_TIC_COUNTER for two reasons:
1. during initialisation or reinitialisation of the DWMAC core, the
core is reset, which sets this register back to its default value.
Writing it prior to stmmac_dvr_probe() has no effect.
2. Since commit 8efbdbfa9938 ("net: stmmac: Initialize
MAC_ONEUS_TIC_COUNTER register"), GMAC4/5 core code will set
this register based on the rate of plat->stmmac_clk. This clock
is created by the same code which initialises plat->eee_usecs_rate,
which is also created to run at this same rate. Since Marek's
commit, this will set this register appropriately using the
rate of this clock.
Therefore, dwmac-intel.c writing GMAC_1US_TIC_COUNTER serves no
useful purpose and can be removed.
tegra_eqos_init() initialises the 1US TIC counter for the EEE timers.
However, the DWGMAC core is reset after this write, which clears
this register to its default.
However, dwmac4_core_init() configures this register using the same
clock, which happens after reset - thus this is the write which
ensures that the register is correctly configured.
Therefore, tegra_eqos_init() is not required and is removed. This also
means eqos->clk_slave can also be removed.
====================
net: Convert ->exit_batch_rtnl() to ->exit_rtnl().
While converting nexthop to per-netns RTNL, there are two blockers
to using rtnl_net_dereference(), flush_all_nexthops() and
__unregister_nexthop_notifier(), both of which are called from
->exit_batch_rtnl().
Instead of spreading __rtnl_net_lock() over each ->exit_batch_rtnl(),
we should convert all ->exit_batch_rtnl() to per-net ->exit_rtnl() and
run it under __rtnl_net_lock() because all ->exit_batch_rtnl() functions
do not have anything to factor out for batching.
Patch 1 & 2 factorise the undo mechanism against ->init() into a single
function, and Patch 3 adds ->exit_batch_rtnl().
Patch 4 ~ 13 convert all ->exit_batch_rtnl() users.
Patch 14 removes ->exit_batch_rtnl().
Later, we can convert pfcp and ppp to use ->exit_rtnl().
bareudp: Convert bareudp_exit_batch_rtnl() to ->exit_rtnl().
bareudp_exit_batch_rtnl() iterates the dying netns list and performs the
same operation for each.
Let's use ->exit_rtnl().
While at it, we replace unregister_netdevice_queue() with
bareudp_dellink() for better cleanup. It unlinks the device
from net_generic(net, bareudp_net_id)->bareudp_list, but there
is no real issue as both the dev and the list are freed later.
net: Add ->exit_rtnl() hook to struct pernet_operations.
struct pernet_operations provides two batching hooks; ->exit_batch()
and ->exit_batch_rtnl().
The batching variant is beneficial if ->exit() meets any of the
following conditions:
1) ->exit() repeatedly acquires a global lock for each netns
2) ->exit() has a time-consuming operation that can be factored
out (e.g. synchronize_rcu(), smp_mb(), etc)
3) ->exit() does not need to repeat the same iterations for each
netns (e.g. inet_twsk_purge())
Currently, none of the ->exit_batch_rtnl() functions satisfy any of
the above conditions because RTNL is factored out and held by the
caller and all of these functions iterate over the dying netns list.
Also, we want to hold per-netns RTNL there but avoid spreading
__rtnl_net_lock() across multiple locations.
Let's add ->exit_rtnl() hook and run it under __rtnl_net_lock().
The following patches will convert all ->exit_batch_rtnl() users
to ->exit_rtnl().
When we roll back the changes made by struct pernet_operations.init(),
we execute mostly identical sequences in three places.
* setup_net()
* cleanup_net()
* free_exit_list()
The only difference between the first two is which list and RCU helpers
to use.
In setup_net(), an ops could fail on the way, so we need to perform a
reverse walk from its previous ops in pernet_list. OTOH, in cleanup_net(),
we iterate the full list from tail to head.
The former passes the failed ops to list_for_each_entry_continue_reverse().
It's tricky, but we can reuse it for the latter if we pass list_entry() of
the head node.
Also, synchronize_rcu() and synchronize_rcu_expedited() can be easily
switched by an argument.
Let's factorise the rollback part in setup_net() and cleanup_net().
In the next patch, ops_undo_list() will be reused for free_exit_list(),
and then two arguments (ops_list and hold_rtnl) will differ.
Commit 0e9c127729be ("ethtool: add interface to read Tx hardware
timestamping statistics") added documentation for timestamping
statistics, but added the detailed explanation for this method to
the get_ts_info() rather than get_ts_stats(). Move it to the correct
entry.
====================
Fix late DMA unmap crash for page pool
This series fixes the late dma_unmap crash for page pool first reported
by Yonglong Liu in [0]. It is an alternative approach to the one
submitted by Yunsheng Lin, most recently in [1]. The first commit just
wraps some tests in a helper function, in preparation of the main change
in patch 2. See the commit message of patch 2 for the details.
page_pool: Track DMA-mapped pages and unmap them when destroying the pool
When enabling DMA mapping in page_pool, pages are kept DMA mapped until
they are released from the pool, to avoid the overhead of re-mapping the
pages every time they are used. This causes resource leaks and/or
crashes when there are pages still outstanding while the device is torn
down, because page_pool will attempt an unmap through a non-existent DMA
device on the subsequent page return.
To fix this, implement a simple tracking of outstanding DMA-mapped pages
in page pool using an xarray. This was first suggested by Mina[0], and
turns out to be fairly straight forward: We simply store pointers to
pages directly in the xarray with xa_alloc() when they are first DMA
mapped, and remove them from the array on unmap. Then, when a page pool
is torn down, it can simply walk the xarray and unmap all pages still
present there before returning, which also allows us to get rid of the
get/put_device() calls in page_pool. Using xa_cmpxchg(), no additional
synchronisation is needed, as a page will only ever be unmapped once.
To avoid having to walk the entire xarray on unmap to find the page
reference, we stash the ID assigned by xa_alloc() into the page
structure itself, using the upper bits of the pp_magic field. This
requires a couple of defines to avoid conflicting with the
POINTER_POISON_DELTA define, but this is all evaluated at compile-time,
so does not affect run-time performance. The bitmap calculations in this
patch gives the following number of bits for different architectures:
- 23 bits on 32-bit architectures
- 21 bits on PPC64 (because of the definition of ILLEGAL_POINTER_VALUE)
- 32 bits on other 64-bit architectures
Stashing a value into the unused bits of pp_magic does have the effect
that it can make the value stored there lie outside the unmappable
range (as governed by the mmap_min_addr sysctl), for architectures that
don't define ILLEGAL_POINTER_VALUE. This means that if one of the
pointers that is aliased to the pp_magic field (such as page->lru.next)
is dereferenced while the page is owned by page_pool, that could lead to
a dereference into userspace, which is a security concern. The risk of
this is mitigated by the fact that (a) we always clear pp_magic before
releasing a page from page_pool, and (b) this would need a
use-after-free bug for struct page, which can have many other risks
since page->lru.next is used as a generic list pointer in multiple
places in the kernel. As such, with this patch we take the position that
this risk is negligible in practice. For more discussion, see[1].
Since all the tracking added in this patch is performed on DMA
map/unmap, no additional code is needed in the fast path, meaning the
performance overhead of this tracking is negligible there. A
micro-benchmark shows that the total overhead of the tracking itself is
about 400 ns (39 cycles(tsc) 395.218 ns; sum for both map and unmap[2]).
Since this cost is only paid on DMA map and unmap, it seems like an
acceptable cost to fix the late unmap issue. Further optimisation can
narrow the cases where this cost is paid (for instance by eliding the
tracking when DMA map/unmap is a no-op).
The extra memory needed to track the pages is neatly encapsulated inside
xarray, which uses the 'struct xa_node' structure to track items. This
structure is 576 bytes long, with slots for 64 items, meaning that a
full node occurs only 9 bytes of overhead per slot it tracks (in
practice, it probably won't be this efficient, but in any case it should
be an acceptable overhead).
page_pool: Move pp_magic check into helper functions
Since we are about to stash some more information into the pp_magic
field, let's move the magic signature checks into a pair of helper
functions so it can be changed in one place.
Reviewed-by: Mina Almasry <almasrymina@google.com> Tested-by: Yonglong Liu <liuyonglong@huawei.com> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://patch.msgid.link/20250409-page-pool-track-dma-v9-1-6a9ef2e0cba8@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
For ice:
Mateusz and Larysa add support for LLDP packets to be received on a VF
and transmitted by a VF in switchdev mode. Additional information:
https://lore.kernel.org/intel-wired-lan/20250214085215.2846063-1-larysa.zaremba@intel.com/
Karol adds timesync support for E825C devices using 2xNAC (Network
Acceleration Complex) configuration. 2xNAC mode is the mode in which
IO die is housing two complexes and each of them has its own PHY
connected to it.
Martyna adds messaging to clarify filter errors when recipe space is
exhausted.
Colin Ian King adds static modifier to a const array to avoid stack
usage.
For i40e:
Kyungwook Boo changes variable declaration types to prevent possible
underflow.
For ixgbe:
Rand Deeb adjusts retry values so that retries are attempted.
For igc:
Rui Salvaterra sets VLAN offloads to be enabled as default.
For e1000e:
Piotr Wejman converts driver to use newer hardware timestamping API.
* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
net: e1000e: convert to ndo_hwtstamp_get() and ndo_hwtstamp_set()
igc: enable HW vlan tag insertion/stripping by default
ixgbe: Fix unreachable retry logic in combined and byte I2C write functions
i40e: fix MMIO write access to an invalid page in i40e_clear_hw
ice: make const read-only array dflt_rules static
ice: improve error message for insufficient filter space
ice: enable timesync operation on 2xNAC E825 devices
ice: refactor ice_sbq_msg_dev enum
ice: remove SW side band access workaround for E825
ice: enable LLDP TX for VFs through tc
ice: support egress drop rules on PF
ice: remove headers argument from ice_tc_count_lkups
ice: receive LLDP on trusted VFs
ice: do not add LLDP-specific filter if not necessary
ice: fix check for existing switch rule
====================
Jakub Kicinski [Mon, 14 Apr 2025 22:57:12 +0000 (15:57 -0700)]
Merge branch 'cpsw-bindings-for-5000m-fixed-link'
Siddharth Vadapalli says:
====================
CPSW Bindings for 5000M Fixed-Link
This series adds 5000M as a valid speed for fixed-link mode of operation
and also updates the CPSW bindings to evaluate fixed-link property. This
series is in the context of the following device-tree overlay which
enables USXGMII 5000M Fixed-link mode of operation with CPSW on TI's
J784S4 SoC:
https://github.com/torvalds/linux/blob/v6.15-rc1/arch/arm64/boot/dts/ti/k3-j784s4-evm-usxgmii-exp1-exp2.dtso
====================
bna: bnad_dim_timeout: Rename del_timer_sync in comment
Commit 8fa7292fee5c ("treewide: Switch/rename to timer_delete[_sync]()")
switched del_timer_sync to timer_delete_sync, but did not modify the
comment for bnad_dim_timeout(). Now fix it.
Fix checkpatch detected code style errors/warnings detected in
the file net/core/pktgen.c (remaining checkpatch checks will be addressed
in a follow up patch set).
====================
Jakub Kicinski [Thu, 10 Apr 2025 01:42:46 +0000 (18:42 -0700)]
net: convert dev->rtnl_link_state to a bool
netdevice reg_state was split into two 16 bit enums back in 2010
in commit a2835763e130 ("rtnetlink: handle rtnl_link netlink
notifications manually"). Since the split the fields have been
moved apart, and last year we converted reg_state to a normal
u8 in commit 4d42b37def70 ("net: convert dev->reg_state to u8").
rtnl_link_state being a 16 bitfield makes no sense. Convert it
to a single bool, it seems very unlikely after 15 years that
we'll need more values in it.
We could drop dev->rtnl_link_ops from the conditions but feels
like having it there more clearly points at the reason for this
hack.
Paolo Abeni [Wed, 9 Apr 2025 20:00:56 +0000 (22:00 +0200)]
udp: properly deal with xfrm encap and ADDRFORM
UDP GRO accounting assumes that the GRO receive callback is always
set when the UDP tunnel is enabled, but syzkaller proved otherwise,
leading tot the following splat:
Address the issue moving the accounting hook into
setup_udp_tunnel_sock() and set_xfrm_gro_udp_encap_rcv(), where
the GRO callback is actually set.
set_xfrm_gro_udp_encap_rcv() is prone to races with IPV6_ADDRFORM,
run the relevant setsockopt under the socket lock to ensure using
consistent values of sk_family and up->encap_type.
Refactor the GRO callback selection code, to make it clear that
the function pointer is always initialized.
Reported-by: syzbot+8c469a2260132cd095c1@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=8c469a2260132cd095c1 Fixes: 172bf009c18d ("xfrm: Support GRO for IPv4 ESP in UDP encapsulation") Fixes: 5d7f5b2f6b935 ("udp_tunnel: use static call for GRO hooks when possible") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://patch.msgid.link/92bcdb6899145a9a387c8fa9e3ca656642a43634.1744228733.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: hsr: sync hw addr of slave2 according to slave1 hw addr on PRP
In order to work properly PRP requires slave1 and slave2 to share the
same MAC address. To ease the configuration process on userspace tools,
sync the slave2 MAC address with slave1. In addition, when deleting the
port from the list, restore the original MAC address.
Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net> Signed-off-by: David S. Miller <davem@davemloft.net>
net: phy: mediatek: add Airoha PHY ID to SoC driver
Airoha AN7581 SoC ship with a Switch based on the MT753x Switch embedded
in other SoC like the MT7581 and the MT7988. Similar to these they
require configuring some pin to enable LED PHYs.
Add support for the PHY ID for the Airoha embedded Switch and define a
simple probe function to toggle these pins. Also fill the LED functions
and add dedicated function to define LED polarity.
net: phy: mediatek: permit to compile test GE SOC PHY driver
When commit 462a3daad679 ("net: phy: mediatek: fix compile-test
dependencies") fixed the dependency, it should have also introduced
an or on COMPILE_TEST to permit this driver to be compile-tested even if
NVMEM_MTK_EFUSE wasn't selected. The driver makes use of NVMEM API that
are always compiled (return error) so the driver can actually be
compiled even without that config.
Fix and simplify the dependency condition of this kernel config.
Fixes: 462a3daad679 ("net: phy: mediatek: fix compile-test dependencies") Acked-by: Daniel Golle <daniel@makrotopia.org> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Christian Marangi <ansuelsmth@gmail.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Link: https://patch.msgid.link/20250410100410.348-1-ansuelsmth@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
Add L2 hw acceleration for airoha_eth driver
Introduce the capability to offload L2 traffic defining flower rules in
the PSE/PPE engine available on EN7581 SoC.
Since the hw always reports L2/L3/L4 flower rules, link all L2 rules
sharing the same L2 info (with different L3/L4 info) in the L2 subflows
list of a given L2 PPE entry.
Similar to mtk driver, introduce the capability to offload L2 traffic
defining flower rules in the PSE/PPE engine available on EN7581 SoC.
Since the hw always reports L2/L3/L4 flower rules, link all L2 rules
sharing the same L2 info (with different L3/L4 info) in the L2 subflows
list of a given L2 PPE entry.
Introduce l2_flows rhashtable in airoha_ppe struct in order to
store L2 flows committed by upper layers of the kernel. This is a
preliminary patch in order to offload L2 traffic rules.
Jakub Kicinski [Sat, 12 Apr 2025 01:58:13 +0000 (18:58 -0700)]
Merge branch 'net-retire-dccp-socket'
Kuniyuki Iwashima says:
====================
net: Retire DCCP socket.
As announced by commit b144fcaf46d4 ("dccp: Print deprecation
notice."), it's time to remove DCCP socket.
The patch 2 removes net/dccp, LSM code, doc, and etc, leaving
DCCP netfilter modules.
The patch 3 unexports shared functions for DCCP, and the patch 4
renames tcp_or_dccp_get_hashinfo() to tcp_get_hashinfo().
We can do more cleanup; for example, remove IPPROTO_TCP checks in
__inet6?_check_established(), remove __module_get() for twsk,
remove timewait_sock_ops.twsk_destructor(), etc, but it will be
more of TCP stuff, so I'll defer to a later series.
DCCP was orphaned in 2021 by commit 054c4610bd05 ("MAINTAINERS: dccp:
move Gerrit Renker to CREDITS"), which noted that the last maintainer
had been inactive for five years.
In recent years, it has become a playground for syzbot, and most changes
to DCCP have been odd bug fixes triggered by syzbot. Apart from that,
the only changes have been driven by treewide or networking API updates
or adjustments related to TCP.
Thus, in 2023, we announced we would remove DCCP in 2025 via commit b144fcaf46d4 ("dccp: Print deprecation notice.").
Since then, only one individual has contacted the netdev mailing list. [0]
There is ongoing research for Multipath DCCP. The repository is hosted
on GitHub [1], and development is not taking place through the upstream
community. While the repository is published under the GPLv2 license,
the scheduling part remains proprietary, with a LICENSE file [2] stating:
"This is not Open Source software."
The researcher mentioned a plan to address the licensing issue, upstream
the patches, and step up as a maintainer, but there has been no further
communication since then.
Maintaining DCCP for a decade without any real users has become a burden.
Therefore, it's time to remove it.
Removing DCCP will also provide significant benefits to TCP. It allows
us to freely reorganize the layout of struct inet_connection_sock, which
is currently shared with DCCP, and optimize it to reduce the number of
cachelines accessed in the TCP fast path.
Note that we keep DCCP netfilter modules as requested. [3]
net: phy: air_en8811h: Add clk provider for CKO pin
EN8811H outputs 25MHz or 50MHz clocks on CKO, selected by GPIO3.
CKO clock operates continuously from power-up through md32 loading.
Implement clk provider driver so we can disable the clock output in case
it isn't needed, which also helps to reduce EMF noise
Piotr Wejman [Sat, 22 Feb 2025 12:46:29 +0000 (13:46 +0100)]
net: e1000e: convert to ndo_hwtstamp_get() and ndo_hwtstamp_set()
Update the driver to use the new hardware timestamping API added in commit 66f7223039c0 ("net: add NDOs for configuring hardware timestamping").
Use Netlink extack for error reporting in e1000e_config_hwtstamp.
Align the indentation of net_device_ops.
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Vitaly Lifshits <vitaly.lifshits@intel.com> Signed-off-by: Piotr Wejman <wejmanpm@gmail.com> Tested-by: Avigail Dahan <avigailx.dahan@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Rui Salvaterra [Thu, 13 Mar 2025 09:35:22 +0000 (09:35 +0000)]
igc: enable HW vlan tag insertion/stripping by default
This is enabled by default in other Intel drivers I've checked (e1000, e1000e,
iavf, igb and ice). Fixes an out-of-the-box performance issue when running
OpenWrt on typical mini-PCs with igc-supported Ethernet controllers and 802.1Q
VLAN configurations, as ethtool isn't part of the default packages and sane
defaults are expected.
In my specific case, with an Intel N100-based machine with four I226-V Ethernet
controllers, my upload performance increased from under 30 Mb/s to the expected
~1 Gb/s.
Signed-off-by: Rui Salvaterra <rsalvaterra@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Vitaly Lifshits <vitaly.lifshits@intel.com> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> Tested-by: Mor Bar-Gabay <morx.bar.gabay@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Rand Deeb [Thu, 6 Mar 2025 10:12:00 +0000 (13:12 +0300)]
ixgbe: Fix unreachable retry logic in combined and byte I2C write functions
The current implementation of `ixgbe_write_i2c_combined_generic_int` and
`ixgbe_write_i2c_byte_generic_int` sets `max_retry` to `1`, which makes
the condition `retry < max_retry` always evaluate to `false`. This renders
the retry mechanism ineffective, as the debug message and retry logic are
never executed.
This patch increases `max_retry` to `3` in both functions, aligning them
with the retry logic in `ixgbe_read_i2c_combined_generic_int`. This
ensures that the retry mechanism functions as intended, improving
robustness in case of I2C write failures.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Signed-off-by: Rand Deeb <rand.sec96@gmail.com> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Colin Ian King [Mon, 17 Mar 2025 17:20:24 +0000 (17:20 +0000)]
ice: make const read-only array dflt_rules static
Don't populate the const read-only array dflt_rules on the stack at run
time, instead make it static.
Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
ice: improve error message for insufficient filter space
When adding a rule to switch through tc, if the operation fails
due to not enough free recipes (-ENOSPC), provide a clearer
error message: "Unable to add filter: insufficient space available."
This improves user feedback by distinguishing space limitations from
other generic failures.
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Karol Kolacinski [Thu, 20 Mar 2025 13:15:38 +0000 (14:15 +0100)]
ice: enable timesync operation on 2xNAC E825 devices
According to the E825C specification, SBQ address for ports on a single
complex is device 2 for PHY 0 and device 13 for PHY1.
For accessing ports on a dual complex E825C (so called 2xNAC mode),
the driver should use destination device 2 (referred as phy_0) for
the current complex PHY and device 13 (referred as phy_0_peer) for
peer complex PHY.
Differentiate SBQ destination device by checking if current PF port
number is on the same PHY as target port number.
Adjust 'ice_get_lane_number' function to provide unique port number for
ports from PHY1 in 'dual' mode config (by adding fixed offset for PHY1
ports). Cache this value in ice_hw struct.
Introduce ice_get_primary_hw wrapper to get access to timesync register
not available from second NAC.
Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com> Co-developed-by: Grzegorz Nitka <grzegorz.nitka@intel.com> Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Karol Kolacinski [Thu, 20 Mar 2025 13:15:37 +0000 (14:15 +0100)]
ice: refactor ice_sbq_msg_dev enum
Rename ice_sbq_msg_dev to ice_sbq_dev_id to reflect the meaning of this
type more precisely. This enum type describes RDA (Remote Device Access)
client ids, accessible over SB (Side Band) interface.
Rename enum elements to make a driver namespace more cleaner and
consistent with other definitions within SB
Remove unused 'rmn_x' entries, specific to unsupported E824 device.
Adjust clients '2' and '13' names (phy_0 and phy_0_peer respectively) to
be compliant with EAS doc. According to the specification, regardless of
the complex entity (single or dual), when accessing its own ports,
they're accessed always as 'phy_0' client. And referred as 'phy_0_peer'
when handling ports connected to the other complex.
Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com> Co-developed-by: Grzegorz Nitka <grzegorz.nitka@intel.com> Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Karol Kolacinski [Thu, 20 Mar 2025 13:15:36 +0000 (14:15 +0100)]
ice: remove SW side band access workaround for E825
Due to the bug in FW/NVM autoload mechanism (wrong default
SB_REM_DEV_CTL register settings), the access to peer PHY and CGU
clients was disabled by default.
As the workaround solution, the register value was overwritten by the
driver at the probe or reset handling.
Remove workaround as it's not needed anymore. The fix in autoload
procedure has been provided with NVM 3.80 version.
NOTE: at the time the fix was provided in NVM, the E825C product was
not officially available on the market, so it's not expected this change
will cause regression when running with older driver/kernel versions.
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com> Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Larysa Zaremba [Fri, 14 Feb 2025 08:50:40 +0000 (09:50 +0100)]
ice: enable LLDP TX for VFs through tc
Only a single VSI can be in charge of sending LLDP frames, sometimes it is
beneficial to assign this function to a VF, that is possible to do with tc
capabilities in the switchdev mode. It requires first blocking the PF from
sending the LLDP frames with a following command:
tc filter add dev <ifname> egress protocol lldp flower skip_sw action drop
Then it becomes possible to configure a forward rule from a VF port
representor to uplink instead.
tc filter add dev <vf_ifname> ingress protocol lldp flower skip_sw
action mirred egress redirect dev <ifname>
How LLDP exclusivity was done previously is LLDP traffic was blocked for a
whole port by a single rule and PF was bypassing that. Now at least in the
switchdev mode, every separate VSI has to have its own drop rule. Another
complication is the fact that tc does not respect when the driver refuses
to delete a rule, so returning an error results in a HW rule still present
with no way to reference it through tc. This is addressed by allowing the
PF rule to be deleted at any time, but making the VF forward rule "dormant"
in such case, this means it is deleted from HW but stays in tc and driver's
bookkeeping to be restored when drop rule is added back to the PF.
Implement tc configuration handling which enables the user to transmit LLDP
packets from VF instead of PF.
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Larysa Zaremba [Fri, 14 Feb 2025 08:50:39 +0000 (09:50 +0100)]
ice: support egress drop rules on PF
tc clsact qdisc allows us to add offloaded egress rules with commands such
as the following one:
tc filter add dev <ifname> egress protocol lldp flower skip_sw action drop
Support the egress rule drop action when added to PF, with a few caveats:
* in switchdev mode, all PF traffic has to go uplink with an exception for
LLDP that can be delegated to a single VSI at a time
* in legacy mode, we cannot delegate LLDP functionality to another VSI, so
such packets from PF should not be blocked.
Also, simplify the rule direction logic, it was previously derived from
actions, but actually can be inherited from the tc block (and flipped in
case of port representors).
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Mateusz Pacuszka [Fri, 14 Feb 2025 08:50:37 +0000 (09:50 +0100)]
ice: receive LLDP on trusted VFs
When a trusted VF tries to configure an LLDP multicast address, configure a
rule that would mirror the traffic to this VF, untrusted VFs are not
allowed to receive LLDP at all, so the request to add LLDP MAC address will
always fail for them.
Add a forwarding LLDP filter to a trusted VF when it tries to add an LLDP
multicast MAC address. The MAC address has to be added after enabling
trust (through restarting the LLDP service).
Signed-off-by: Mateusz Pacuszka <mateuszx.pacuszka@intel.com> Co-developed-by: Larysa Zaremba <larysa.zaremba@intel.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Larysa Zaremba [Fri, 14 Feb 2025 08:50:36 +0000 (09:50 +0100)]
ice: do not add LLDP-specific filter if not necessary
Commit 34295a3696fb ("ice: implement new LLDP filter command")
introduced the ability to use LLDP-specific filter that directs all
LLDP traffic to a single VSI. However, current goal is for all trusted VFs
to be able to see LLDP neighbors, which is impossible to do with the
special filter.
Make using the generic filter the default choice and fall back to special
one only if a generic filter cannot be added. That way setups with "NVMs
where an already existent LLDP filter is blocking the creation of a filter
to allow LLDP packets" will still be able to configure software Rx LLDP on
PF only, while all other setups would be able to forward them to VFs too.
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Mateusz Pacuszka [Fri, 14 Feb 2025 08:50:35 +0000 (09:50 +0100)]
ice: fix check for existing switch rule
In case the rule already exists and another VSI wants to subscribe to it
new VSI list is being created and both VSIs are moved to it.
Currently, the check for already existing VSI with the same rule is done
based on fdw_id.hw_vsi_id, which applies only to LOOKUP_RX flag.
Change it to vsi_handle. This is software VSI ID, but it can be applied
here, because vsi_map itself is also based on it.
Additionally change return status in case the VSI already exists in the
VSI map to "Already exists". Such case should be handled by the caller.
Signed-off-by: Mateusz Pacuszka <mateuszx.pacuszka@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Some stm32 implementations need the receive clock running in suspend,
as indicated by dwmac->ops->clk_rx_enable_in_suspend. The existing
code achieved this in a rather complex way, by passing a flag around.
However, the clk API prepare/enables are counted - which means that a
clock won't be stopped as long as there are more prepare and enables
than disables and unprepares, just like a reference count.
Therefore, we can simplify this logic by calling clk_prepare_enable()
an additional time in the probe function if this flag is set, and then
balancing that at remove time.
With this, we can avoid passing a "are we suspending" and "are we
resuming" flag to various functions in the driver.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
The integrated PHY's of RTL8125/8126 have an own mechanism to access
PHY parameters, similar to what r8168g_phy_param does on earlier PHY
versions. Add helper rtl8125_phy_param to simplify the code.
ipv4: remove unnecessary judgment in ip_route_output_key_hash_rcu
In the ip_route_output_key_cash_rcu function, the input fl4 member saddr is
first checked to be non-zero before entering multicast, broadcast and
arbitrary IP address checks. However, the fact that the IP address is not
0 has already ruled out the possibility of any address, so remove
unnecessary judgment.
====================
tools: ynl: c: basic netlink-raw support
Basic support for netlink-raw AKA classic netlink in user space C codegen.
This series is enough to read routes and addresses from the kernel
(see the samples in patches 12 and 13).
Specs need to be slightly adjusted and decorated with the c naming info.
In terms of codegen this series includes just the basic plumbing required
to skip genlmsghdr and handle request types which may technically also
be legal in genetlink-legacy but are very uncommon there.
Subsequent series will add support for:
- handling CRUD-style notifications
- code gen for array types classic netlink uses
- sub-message support
Jakub Kicinski [Thu, 10 Apr 2025 01:46:55 +0000 (18:46 -0700)]
tools: ynl-gen: consider dump ops without a do "type-consistent"
If the type for the response to do and dump are the same we don't
generate it twice. This is called "type_consistent" in the generator.
Consider operations which only have dump to also be consistent.
This removes unnecessary "_dump" from the names. There's a number
of GET ops in classic Netlink which only have dump handlers.
Make sure we output the "onesided" types, normally if the type
is consistent we only output it when we render the do structures.
Jakub Kicinski [Thu, 10 Apr 2025 01:46:54 +0000 (18:46 -0700)]
tools: ynl: don't use genlmsghdr in classic netlink
Make sure the codegen calls the right YNL lib helper to start
the request based on family type. Classic netlink request must
not include the genl header.
Conversely don't expect genl headers in the responses.
Jakub Kicinski [Thu, 10 Apr 2025 01:46:53 +0000 (18:46 -0700)]
tools: ynl-gen: don't consider requests with fixed hdr empty
C codegen skips generating the structs if request/reply has no attrs.
In such cases the request op takes no argument and return int
(rather than response struct). In case of classic netlink a lot of
information gets passed using the fixed struct, however, so adjust
the logic to consider a request empty only if it has no attrs _and_
no fixed struct.
Jakub Kicinski [Thu, 10 Apr 2025 01:46:52 +0000 (18:46 -0700)]
tools: ynl: support creating non-genl sockets
Classic netlink has static family IDs specified in YAML,
there is no family name -> ID lookup. Support providing
the ID info to the library via the generated struct and
make library use it. Since NETLINK_ROUTE is ID 0 we need
an extra boolean to indicate classic_id is to be used.