net: renesas: rswitch: remove speed from gwca structure
This field is set but never used.
GWCA is rswitch CPU interface module which connects rswitch to the
host over AXI bus. Speed of the switch ports is not anyhow related to
GWCA operation.
net: renesas: rswitch: do not deinit disabled ports
In rswitch_ether_port_init_all(), only enabled ports are initialized.
Then, rswitch_ether_port_deinit_all() shall also only deinitialize
enabled ports.
====================
vxlan: Support user-defined reserved bits
Currently the VXLAN header validation works by vxlan_rcv() going feature
by feature, each feature clearing the bits that it consumes. If anything
is left unparsed at the end, the packet is rejected.
Unfortunately there are machines out there that send VXLAN packets with
reserved bits set, even if they are configured to not use the
corresponding features. One such report is here[1], and we have heard
similar complaints from our customers as well.
This patchset adds an attribute that makes it configurable which bits
the user wishes to tolerate and which they consider reserved. This was
recommended in [1] as well.
A knob like that inevitably allows users to set as reserved bits that
are in fact required for the features enabled by the netdevice, such as
GPE. This is detected, and such configurations are rejected.
In patches #1..#7, the reserved bits validation code is gradually moved
away from the unparsed approach described above, to one where a given
set of valid bits is precomputed and then the packet is validated
against that.
In patch #8, this precomputed set is made configurable through a new
attribute IFLA_VXLAN_RESERVED_BITS.
Patches #9 and #10 massage the testsuite a bit, so that patch #11 can
introduce a selftest for the resreved bits feature.
The corresponding iproute2 support is available in [2].
Petr Machata [Thu, 5 Dec 2024 15:41:00 +0000 (16:41 +0100)]
selftests: forwarding: Add a selftest for the new reserved_bits UAPI
Run VXLAN packets through a gateway. Flip individual bits of the packet
and/or reserved bits of the gateway, and check that the gateway treats the
packets as expected.
Petr Machata [Thu, 5 Dec 2024 15:40:59 +0000 (16:40 +0100)]
selftests: net: lib: Add several autodefer helpers
Add ip_link_set_addr(), ip_link_set_up(), ip_addr_add() and ip_route_add()
to the suite of helpers that automatically schedule a corresponding
cleanup.
When setting a new MAC, one needs to remember the old address first. Move
mac_get() from forwarding/ to that end.
Petr Machata [Thu, 5 Dec 2024 15:40:57 +0000 (16:40 +0100)]
vxlan: Add an attribute to make VXLAN header validation configurable
The set of bits that the VXLAN netdevice currently considers reserved is
defined by the features enabled at the netdevice construction. In order to
make this configurable, add an attribute, IFLA_VXLAN_RESERVED_BITS. The
payload is a pair of big-endian u32's covering the VXLAN header. This is
validated against the set of flags used by the various enabled VXLAN
features, and attempts to override bits used by an enabled feature are
bounced.
Petr Machata [Thu, 5 Dec 2024 15:40:56 +0000 (16:40 +0100)]
vxlan: vxlan_rcv(): Drop unparsed
The code currently validates the VXLAN header in two ways: first by
comparing it with the set of reserved bits, constructed ahead of time
during the netdevice construction; and second by gradually clearing the
bits off a separate copy of VXLAN header, "unparsed". Drop the latter
validation method.
Petr Machata [Thu, 5 Dec 2024 15:40:55 +0000 (16:40 +0100)]
vxlan: Bump error counters for header mismatches
The VXLAN driver so far has not increased the error counters for packets
that set reserved bits. It does so for other packet errors, so do it for
this case as well.
Petr Machata [Thu, 5 Dec 2024 15:40:54 +0000 (16:40 +0100)]
vxlan: Track reserved bits explicitly as part of the configuration
In order to make it possible to configure which bits in VXLAN header should
be considered reserved, introduce a new field vxlan_config::reserved_bits.
Have it cover the whole header, except for the VNI-present bit and the bits
for VNI itself, and have individual enabled features clear more bits off
reserved_bits.
(This is expressed as first constructing a used_bits set, and then
inverting it to get the reserved_bits. The set of used_bits will be useful
on its own for validation of user-set reserved_bits in a following patch.)
The patch also moves a comment relevant to the validation from the unparsed
validation site up to the new site. Logically this patch should add the new
comment, and a later patch that removes the unparsed bits would remove the
old comment. But keeping both legs in the same patch is better from the
history spelunking point of view.
Petr Machata [Thu, 5 Dec 2024 15:40:53 +0000 (16:40 +0100)]
vxlan: vxlan_rcv(): Extract vxlan_hdr(skb) to a named variable
Having a named reference to the VXLAN header is more handy than having to
conjure it anew through vxlan_hdr() on every use. Add a new variable and
convert several open-coded sites.
Additionally, convert one "unparsed" use to the new variable as well. Thus
the only "unparsed" uses that remain are the flag-clearing and the header
validity check at the end.
Petr Machata [Thu, 5 Dec 2024 15:40:52 +0000 (16:40 +0100)]
vxlan: vxlan_rcv() callees: Drop the unparsed argument
The functions vxlan_remcsum() and vxlan_parse_gbp_hdr() take both the SKB
and the unparsed VXLAN header. Now that unparsed adjustment is handled
directly by vxlan_rcv(), drop this argument, and have the function derive
it from the SKB on its own.
vxlan_parse_gpe_proto() does not take SKB, so keep the header parameter.
However const it so that it's clear that the intention is that it does not
get changed.
Petr Machata [Thu, 5 Dec 2024 15:40:51 +0000 (16:40 +0100)]
vxlan: vxlan_rcv() callees: Move clearing of unparsed flags out
In order to migrate away from the use of unparsed to detect invalid flags,
move all the code that actually clears the flags from callees directly to
vxlan_rcv().
Petr Machata [Thu, 5 Dec 2024 15:40:50 +0000 (16:40 +0100)]
vxlan: In vxlan_rcv(), access flags through the vxlan netdevice
vxlan_sock.flags is constructed from vxlan_dev.cfg.flags, as the subset of
flags (named VXLAN_F_RCV_FLAGS) that is important from the point of view of
socket sharing. Attempts to reconfigure these flags during the vxlan netdev
lifetime are also bounced. It is therefore immaterial whether we access the
flags through the vxlan_dev or through the socket.
Convert the socket accesses to netdevice accesses in this separate patch to
make the conversions that take place in the following patches more obvious.
Jakub Kicinski [Thu, 5 Dec 2024 16:59:14 +0000 (08:59 -0800)]
net: reformat kdoc return statements
kernel-doc -Wall warns about missing Return: statement for non-void
functions. We have a number of kdocs in our headers which are missing
the colon, IOW they use
* Return some value
or
* Returns some value
Having the colon makes some sense, it should help kdoc parser avoid
false positives. So add them. This is mostly done with a sed script,
and removing the unnecessary cases (mostly the comments which aren't
kdoc).
Acked-by: Johannes Berg <johannes@sipsolutions.net> Acked-by: Richard Cochran <richardcochran@gmail.com> Acked-by: Sergey Ryazanov <ryazanov.s.a@gmail.com> Reviewed-by: Edward Cree <ecree.xilinx@gmail.com> Acked-by: Alexandra Winter <wintera@linux.ibm.com> Acked-by: Pablo Neira Ayuso <pablo@netfilter.org> Link: https://patch.msgid.link/20241205165914.1071102-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In configurations with 2 or more DSA clusters it will fail to allocate
unique MDIO bus names as only the switch ID is used, fix this by using
a combination of the tree ID and switch ID when needed
====================
rxrpc: Implement jumbo DATA transmission and RACK-TLP
Here's a series of patches to implement two main features:
(1) The transmission of jumbo data packets whereby several DATA packets of
a particular size can be glued together into a single UDP packet,
allowing us to make use of larger MTU sizes. The basic jumbo
subpacket capacity is 1412 bytes (RXRPC_JUMBO_DATALEN) and, say, an
MTU of 8192 allows five of them to be transmitted as one.
An alternative (and possibly more efficient way) would be to
expand/shrink the capacity of each DATA packet to match the MTU and
thus save on header and tail-gap overhead, but the Rx protocol does
not provide a mechanism for splitting the data - especially as the
transported data is encrypted per-packet - and so UDP fragmentation
would be the only way to handle this.
In fact, in the future, AF_RXRPC also needs to look at shrinking the
packet size where the MTU is smaller - for instance in the case of
being carried by IPv6 over wifi where there isn't capacity for a 1412
byte capacity.
(2) RACK-TLP to manage packet loss and retransmission in conjunction with
the congestion control algorithm.
These allow for better data throughput and work towards being able to have
larger transmission windows.
To this end, the following changes are also made:
(1) Use a single large array of kvec structs for the I/O thread rather
than having one per transmission buffer. We need a much bigger
collection of kvecs for ping padding
(2) Implement path-MTU probing by sending padded PING ACK packets and
monitoring for PING RESPONSE ACKs. The pmtud value determined is used
to configure the construction of jumbo DATA packets.
(3) The transmission queue is changed from a linked list of transmission
buffer structs to a linked list of transmission-queue structs, each of
which points to either 32 or 64 transmission buffers (depending on cpu
word size) and various bits of metadata are concentrated in the queue
structs rather than the buffers to make better use of the cpu cache.
(4) SACK data is stored in the transmission-queue structures in batches of
32 or 64 making it faster to process rather than being spread amongst
all the individual packet buffers.
(5) Don't change the DF flag on the UDP socket unless we need to - and
basically only enable it for path-MTU probing.
There are also some additional bits:
(1) Fix the handling of connection aborts to poke the aborted connections.
(2) Don't set the MORE-PACKETS Rx header flag on the wire. No one
actually checks it and it is, in any case, generated inconsistently
between implementations.
(3) Request an ACK when, during call transmission, there's a stall in the
app generating the data to be transmitted.
(4) Fix attention starvation in the I/O thread by making sure we go
through all outstanding events rather than returning to the beginning
of the check cycle after any time we process an event.
(5) Don't use the skbuff timestamp in the calculation of timeouts and RTT
as we really should include local processing time in that too.
Further, getting receive skbuff timestamps may be expensive.
(6) Make RTT tracking per call with the saving of the value between calls,
even within the same connection channel. The initial call timeout
starts off large to allow the server time to set up its state before
the initial reply.
(7) Don't allocate txbuf structs for ACK packets, but rather use page
frags and MSG_SPLICE_PAGES.
(8) Use irq-disabling locks for interactions between app threads and I/O
threads so that the I/O thread doesn't get help up.
(9) Make rxrpc set the REQUEST-ACK flag on an outgoing packet when cwnd is
at RXRPC_MIN_CWND (currently 4), not at 2 which it can never reach.
(10) Add some tracing bits and pieces (including displaying the userStatus
field in an ACK header) and some more stats counters (including
different sizes of jumbo packets sent/received).
David Howells [Wed, 4 Dec 2024 07:47:07 +0000 (07:47 +0000)]
rxrpc: Implement RACK/TLP to deal with transmission stalls [RFC8985]
When an rxrpc call is in its transmission phase and is sending a lot of
packets, stalls occasionally occur that cause severe performance
degradation (eg. increasing the transmission time for a 256MiB payload from
0.7s to 2.5s over a 10G link).
rxrpc already implements TCP-style congestion control [RFC5681] and this
helps mitigate the effects, but occasionally we're missing a time event
that deals with a missing ACK, leading to a stall until the RTO expires.
Fix this by implementing RACK/TLP in rxrpc.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
David Howells [Wed, 4 Dec 2024 07:47:06 +0000 (07:47 +0000)]
rxrpc: Fix request for an ACK when cwnd is minimum
rxrpc_prepare_data_subpacket() sets the REQUEST-ACK flag on the outgoing
DATA packet under a number of circumstances, including, theoretically, when
the cwnd is at minimum (or less). However, the minimum in this function is
hard-coded as 2, but the actual minimum is RXRPC_MIN_CWND (which is
currently 4) and so this never occurs.
Without this, we will miss the request of some ACKs, potentially leading to
a transmission stall until a timeout occurs on one side or the other that
leads to an ACK being generated.
Fix the function to use RXRPC_MIN_CWND rather than a hard-coded number.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
David Howells [Wed, 4 Dec 2024 07:47:05 +0000 (07:47 +0000)]
rxrpc: Manage RTT per-call rather than per-peer
Manage the determination of RTT on a per-call (ie. per-RPC op) basis rather
than on a per-peer basis, averaging across all calls going to that peer.
The problem is that the RTT measurements from the initial packets on a call
may be off because the server may do some setting up (such as getting a
lock on a file) before accepting the rest of the data in the RPC and,
further, the RTT may be affected by server-side file operations, for
instance if a large amount of data is being written or read.
Note: When handling the FS.StoreData-type RPCs, for example, the server
uses the userStatus field in the header of ACK packets as supplementary
flow control to aid in managing this. AF_RXRPC does not yet support this,
but it should be added.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
David Howells [Wed, 4 Dec 2024 07:47:04 +0000 (07:47 +0000)]
rxrpc: Add a reason indicator to the tx_ack tracepoint
Record the reason for the transmission of an ACK in the rxrpc_tx_ack
tracepoint, and not just in the rxrpc_propose_ack tracepoint.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
David Howells [Wed, 4 Dec 2024 07:47:03 +0000 (07:47 +0000)]
rxrpc: Add a reason indicator to the tx_data tracepoint
Add an indicator to the rxrpc_tx_data tracepoint to indicate what triggered
the transmission of a particular packet. At this point, it's only normal
transmission and retransmission, plus the tracepoint is also used to record
loss injection, but in a future patch, TLP-induced (re-)transmission will
also be a thing.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
David Howells [Wed, 4 Dec 2024 07:47:02 +0000 (07:47 +0000)]
rxrpc: Tidy up the ACK parsing a bit
Tidy up the ACK parsing in the following ways:
(1) Put the serial number of the ACK packet into the rxrpc_ack_summary
struct and access it from there whilst parsing an ACK.
(2) Be consistent about using "if (summary.acked_serial)" rather than "if
(summary.acked_serial != 0)".
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
David Howells [Wed, 4 Dec 2024 07:47:01 +0000 (07:47 +0000)]
rxrpc: Use irq-disabling spinlocks between app and I/O thread
Where a spinlock is used by both the application thread and the I/O thread,
use irq-disabling locking so that an interrupt taken on the app thread
doesn't also slow down the I/O thread.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
David Howells [Wed, 4 Dec 2024 07:47:00 +0000 (07:47 +0000)]
rxrpc: Don't allocate a txbuf for an ACK transmission
Don't allocate an rxrpc_txbuf struct for an ACK transmission. There's now
no need as the memory to hold the ACK content is allocated with a page frag
allocator. The allocation and freeing of a txbuf is just unnecessary
overhead.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
David Howells [Wed, 4 Dec 2024 07:46:59 +0000 (07:46 +0000)]
rxrpc: Send jumbo DATA packets
Send jumbo DATA packets if the path-MTU probing using padded PING ACK
packets shows up sufficient capacity to do so. This allows larger chunks
of data to be sent without reducing the retryability as the subpackets in a
jumbo packet can also be retransmitted individually.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
David Howells [Wed, 4 Dec 2024 07:46:58 +0000 (07:46 +0000)]
rxrpc: Fix initial resend timeout
The constant for the initial resend timeout is in milliseconds, but the
variable it's assigned to is in microseconds. Fix the constant to be in
microseconds.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
David Howells [Wed, 4 Dec 2024 07:46:57 +0000 (07:46 +0000)]
rxrpc: Fix the calculation and use of RTO
Make the following changes to the calculation and use of RTO:
(1) Fix rxrpc_resend() to use the backed-off RTO value obtained by calling
rxrpc_get_rto_backoff() rather than extracting the value itself.
Without this, it may retransmit packets too early.
(2) The RTO value being similar to the RTT causes a lot of extraneous
resends because the RTT doesn't end up taking account of clearing out
of the receive queue on the server. Worse, responses to PING-ACKs are
made as fast as possible and so are less than the DATA-requested-ACK
RTT and so skew the RTT down.
Fix this by putting a lower bound on the RTO by adding 100ms to it and
limiting the lower end to 200ms.
Fixes: c410bf01933e ("rxrpc: Fix the excessive initial retransmission timeout") Fixes: 37473e416234 ("rxrpc: Clean up the resend algorithm") Signed-off-by: David Howells <dhowells@redhat.com> Suggested-by: Simon Wilkinson <sxw@auristor.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
David Howells [Wed, 4 Dec 2024 07:46:56 +0000 (07:46 +0000)]
rxrpc: Display userStatus in rxrpc_rx_ack trace
Display the userStatus field from the Rx packet header in the rxrpc_rx_ack
trace line. This is used for flow control purposes by FS.StoreData-type
kafs RPC calls.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
David Howells [Wed, 4 Dec 2024 07:46:55 +0000 (07:46 +0000)]
rxrpc: Adjust the rxrpc_rtt_rx tracepoint
Adjust the rxrpc_rtt_rx tracepoint in the following ways:
(1) Display the collected RTT sample in the rxrpc_rtt_rx trace.
(2) Move the division of srtt by 8 to the TP_printk() rather doing it
before invoking the trace point.
(3) Display the min_rtt value.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
David Howells [Wed, 4 Dec 2024 07:46:53 +0000 (07:46 +0000)]
rxrpc: Don't use received skbuff timestamps
Don't use received skbuff timestamps, but rather set a timestamp when an
ack is processed so that the time taken to get to rxrpc_input_ack() is
included in the RTT.
The timestamp of the latest ACK received is tracked in
call->acks_latest_ts.
David Howells [Wed, 4 Dec 2024 07:46:52 +0000 (07:46 +0000)]
rxrpc: Store the DATA serial in the txqueue and use this in RTT calc
Store the serial number set on a DATA packet at the point of transmission
in the rxrpc_txqueue struct and when an ACK is received, match the
reference number in the ACK by trawling the txqueue rather than sharing an
RTT table with ACK RTT. This can be done as part of Tx queue rotation.
This means we have a lot more RTT samples available and is faster to search
with all the serial numbers packed together into a few cachelines rather
than being hung off different txbufs.
David Howells [Wed, 4 Dec 2024 07:46:51 +0000 (07:46 +0000)]
rxrpc: Use the new rxrpc_tx_queue struct to more efficiently process ACKs
With the change in the structure of the transmission buffer to store
buffers in bunches of 32 or 64 (BITS_PER_LONG) we can place sets of
per-buffer flags into the rxrpc_tx_queue struct rather than storing them in
rxrpc_tx_buf, thereby vastly increasing efficiency when assessing the SACK
table in an ACK packet.
David Howells [Wed, 4 Dec 2024 07:46:50 +0000 (07:46 +0000)]
rxrpc: Adjust names and types of congestion-related fields
Adjust some of the names of fields and constants to make them look a bit
more like the TCP congestion symbol names, such as flight_size -> in_flight
and congest_mode to ca_state.
Move the persistent congestion-related fields from the rxrpc_ack_summary
struct into the rxrpc_call struct rather than copying them out and back in
again. The rxrpc_congest tracepoint can fetch them from the call struct.
Rename the counters for soft acks and nacks to have an 's' on the front to
reflect the softness, e.g. nr_acks -> nr_sacks.
Make fields counting numbers of packets or numbers of acks u16 rather than
u8 to allow for windows of up to 8192 DATA packets in flight in future.
David Howells [Wed, 4 Dec 2024 07:46:49 +0000 (07:46 +0000)]
rxrpc: Display stats about jumbo packets transmitted and received
In /proc/net/rxrpc/stats, display statistics about the numbers of different
sizes of jumbo packets transmitted and received, showing counts for 1
subpacket (ie. a non-jumbo packet), 2 subpackets, 3, ... to 8 and then 9+.
David Howells [Wed, 4 Dec 2024 07:46:48 +0000 (07:46 +0000)]
rxrpc: Replace call->acks_first_seq with tracking of the hard ACK point
Replace the call->acks_first_seq variable (which holds ack.firstPacket from
the latest ACK packet and indicates the sequence number of the first ack
slot in the SACK table) with call->acks_hard_ack which will hold the
highest sequence hard ACK'd. This is 1 less than call->acks_first_seq, but
it fits in the same schema as the other tracking variables which hold the
sequence of a packet, not one past it.
This will fix the rxrpc_congest tracepoint's calculation of SACK window
size which shows one fewer than it should - and will occasionally go to -1.
David Howells [Wed, 4 Dec 2024 07:46:47 +0000 (07:46 +0000)]
rxrpc: call->acks_hard_ack is now the same call->tx_bottom, so remove it
Now that packets are removed from the Tx queue in the rotation function
rather than being cleaned up later, call->acks_hard_ack now advances in
step with call->tx_bottom, so remove it.
Some of the places call->acks_hard_ack is used in the rxrpc tracepoints are
replaced by call->acks_first_seq instead as that's the peer's reported idea
of the hard-ACK point.
We need to scan the buffers in the transmission queue occasionally when
processing ACKs, but the transmission queue is currently a linked list of
transmission buffers which, when we eventually expand the Tx window to 8192
packets will be very slow to walk.
Instead, pull the fields we need to examine a lot (last sent time,
retransmitted flag) into a new struct rxrpc_txqueue and make each one hold
an array of 32 or 64 packets.
The transmission queue is then a list of these structs, each pointing to a
contiguous set of packets. Scanning is then a lot faster as the flags and
timestamps are concentrated in the CPU dcache.
The transmission timestamps are stored as a number of microseconds from a
base ktime to reduce memory requirements. This should be fine provided we
manage to transmit an entire buffer within an hour.
This will make implementing RACK-TLP [RFC8985] easier as it will be less
costly to scan the transmission buffers.
David Howells [Wed, 4 Dec 2024 07:46:45 +0000 (07:46 +0000)]
rxrpc: Don't need barrier for ->tx_bottom and ->acks_hard_ack
We don't need a barrier for the ->tx_bottom value (which indicates the
lowest sequence still in the transmission queue) and the ->acks_hard_ack
value (which tracks the DATA packets hard-ack'd by the latest ACK packet
received and thus indicates which DATA packets can now be discarded) as the
app thread doesn't use either value as a reference to memory to access.
Rather, the app thread merely uses these as a guide to how much space is
available in the transmission queue
Change the code to use READ/WRITE_ONCE() instead.
Also, change rxrpc_check_tx_space() to use the same value for tx_bottom
throughout.
David Howells [Wed, 4 Dec 2024 07:46:43 +0000 (07:46 +0000)]
rxrpc: Only set DF=1 on initial DATA transmission
Change how the DF flag is managed on DATA transmissions. Set it on initial
transmission and don't set it on retransmissions. Then remove the handling
for EMSGSIZE in rxrpc_send_data_packet() and just pretend it didn't happen,
leaving it to the retransmission path to retry.
The path-MTU discovery using PING ACKs is then used to probe for the
maximum DATA size - though notification by ICMP will be used if one is
received.
David Howells [Wed, 4 Dec 2024 07:46:41 +0000 (07:46 +0000)]
rxrpc: Fix CPU time starvation in I/O thread
Starvation can happen in the rxrpc I/O thread because it goes back to the
top of the I/O loop after it does any one thing without trying to give any
other connection or call CPU time. Also, because it processes one call
packet at a time, it tries to do the retransmission loop after each ACK
without checking to see if there are other ACKs already in the queue that
can update the SACK state.
Fix this by:
(1) Add a received-packet queue on each call.
(2) Distribute packets from the master Rx queue to the individual call,
conn and error queues and 'poking' calls to add them to the attend
queue first thing in the I/O thread.
(3) Go through all the attention-seeking connections and calls before
going back to the top of the I/O thread. Each queue is extracted as a
whole and then gone through so that new additions to insert themselves
into the queue.
(4) Make the call event handler go through all the packets currently on
the call's rx_queue before transmitting and retransmitting DATA
packets.
(5) Drop the skb argument from the call event handler as this is now
replaced with the rx_queue. Instead, keep track of whether we
received a packet or an ACK for the tests that used to rely on that.
David Howells [Wed, 4 Dec 2024 07:46:40 +0000 (07:46 +0000)]
rxrpc: Add a tracepoint to show variables pertinent to jumbo packet size
Add a tracepoint to be called right before packets are transmitted for the
first time that shows variable values that are pertinent to how many
subpackets will be added to a jumbo DATA packet.
David Howells [Wed, 4 Dec 2024 07:46:39 +0000 (07:46 +0000)]
rxrpc: Prepare to be able to send jumbo DATA packets
Prepare to be able to send jumbo DATA packets if the we decide to, but
don't enable that yet. This will allow larger chunks of data to be sent
without reducing the retryability as the subpackets in a jumbo packet can
also be retransmitted individually.
David Howells [Wed, 4 Dec 2024 07:46:38 +0000 (07:46 +0000)]
rxrpc: Separate the packet length from the data length in rxrpc_txbuf
Separate the packet length from the data length (txb->len) stored in the
rxrpc_txbuf to make security calculations easier. Also store the
allocation size as that's an upper bound on the size of the security
wrapper and change a number of fields to unsigned short as the amount of
data can't exceed the capacity of a UDP packet.
Also, whilst we're at it, use kzalloc() for txbufs.
David Howells [Wed, 4 Dec 2024 07:46:37 +0000 (07:46 +0000)]
rxrpc: Implement path-MTU probing using padded PING ACKs (RFC8899)
Implement path-MTU probing (along the lines of RFC8899) by padding some of
the PING ACKs we send. PING ACKs get their own individual responses quite
apart from the acking of data (though, as ACKs, they fulfil that role
also).
The probing concentrates on packet sizes that correspond how many
subpackets can be stuffed inside a jumbo packet as jumbo DATA packets are
just aggregations of individual DATA packets and can be split easily for
retransmission purposes.
If we want to perform probing, we advertise this by setting the maximum
number of jumbo subpackets to 0 in the ack trailer when we send an ACK and
see if the peer is also advertising the service. This is interpreted by
non-supporting Rx stacks as an indication that jumbo packets aren't
supported.
The MTU sizes advertised in the ACK trailer AF_RXRPC transmits are pegged
at a maximum of 1444 unless pmtud is supported by both sides.
David Howells [Wed, 4 Dec 2024 07:46:35 +0000 (07:46 +0000)]
rxrpc: Request an ACK on impending Tx stall
Set the REQUEST-ACK flag on the DATA packet we're about to send if we're
about to stall transmission because the app layer isn't keeping up
supplying us with data to transmit.
David Howells [Wed, 4 Dec 2024 07:46:32 +0000 (07:46 +0000)]
rxrpc: Clean up Tx header flags generation handling
Clean up the generation of the header flags when building packet headers
for transmission:
(1) Assemble the flags in a local variable rather than in the txb->flags.
(2) Do the flags masking and JUMBO-PACKET setting in one bit of code for
both the main header and the jumbo headers.
(3) Generate the REQUEST-ACK flag afresh each time. There's a possibility
we might want to do jumbo retransmission packets in future.
(4) Pass the local flags variable to the rxrpc_tx_data tracepoint rather
than the combination of the txb flags and the wire header flags (the
latter belong only to the first subpacket).
David Howells [Wed, 4 Dec 2024 07:46:30 +0000 (07:46 +0000)]
rxrpc: Fix handling of received connection abort
Fix the handling of a connection abort that we've received. Though the
abort is at the connection level, it needs propagating to the calls on that
connection. Whilst the propagation bit is performed, the calls aren't then
woken up to go and process their termination, and as no further input is
forthcoming, they just hang.
Also add some tracing for the logging of connection aborts.
Fixes: 248f219cb8bc ("rxrpc: Rewrite the data and ack handling code") Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org Link: https://patch.msgid.link/20241204074710.990092-3-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
David Howells [Wed, 4 Dec 2024 07:46:29 +0000 (07:46 +0000)]
ktime: Add us_to_ktime()
Add a us_to_ktime() helper to go with ms_to_ktime() and ns_to_ktime().
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Thomas Gleixner <tglx@linutronix.de>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org Link: https://patch.msgid.link/20241204074710.990092-2-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
cn10k-ipsec: Add outbound inline ipsec support
This patch series adds outbound inline ipsec support on Marvell
cn10k series of platform. One crypto hardware logical function
(cpt-lf) per netdev is required for inline ipsec outbound
functionality. Software prepare and submit crypto hardware
(CPT) instruction for outbound inline ipsec crypto mode offload.
The CPT instruction have details for encryption and authentication
Crypto hardware encrypt, authenticate and provide the ESP packet
to network hardware logic to transmit ipsec packet.
First patch makes dma memory writable for in-place encryption,
Second patch moves code to common file, Third patch disable
backpressure on crypto (CPT) and network (NIX) hardware.
Patch four onwards enables inline outbound ipsec.
v9->v10:
- Removed unlikely() in data-patch and used static_branch when at least
a SA is configured.
- Added missing READ_ONCE() as per comment on previous patch
- Removed "\n" from end of extack messages
- Poll for context write status check reduced to 100ms from 10s
v8->v9:
- Removed mutex lock to use hardware, now using hardware state
- Previous versions were supporting only 64 SAs and a bitmap was
used for same. That limitation is removed from this version.
- Replaced netdev_err with NL_SET_ERR_MSG_MOD in state add flow
as per comment in previous version
v7->v8:
- spell correction in patch 1/8 (s/sdk/skb)
v6->v7:
- skb data was mapped as device writeable but it was not ensured
that skb is writeable. This version calls skb_unshare() to make
skb data writeable (Thanks Jakub Kicinski for pointing out).
v4->v5:
- Fixed un-initialized warning and pointer check
(comment from Kalesh Anakkur Purayil)
v3->v4:
- Few error messages in data-path removed and some moved
under netif_msg_tx_err().
- Added check for crypto offload (XFRM_DEV_OFFLOAD_CRYPTO)
Thanks "Leon Romanovsky" for pointing out
- Fixed codespell error as per comment from Simon Horman
- Added some other cleanup comment from Kalesh Anakkur Purayil
v2->v3:
- Fix smatch and sparse errors (Comment from Simon Horman)
- Fix build error with W=1 (Comment from Simon Horman)
https://patchwork.kernel.org/project/netdevbpf/patch/20240513105446.297451-6-bbhushan2@marvell.com/
- Some other minor cleanup as per comment
https://www.spinics.net/lists/netdev/msg997197.html
v1->v2:
- Fix compilation error to build driver a module
- Use dma_wmb() instead of architecture specific barrier
- Fix couple of other compilation warnings
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Bharat Bhushan [Wed, 4 Dec 2024 05:56:57 +0000 (11:26 +0530)]
cn10k-ipsec: Process outbound ipsec crypto offload
Prepare and submit crypto hardware (CPT) instruction for
outbound ipsec crypto offload. The CPT instruction have
authentication offset, IV offset and encapsulation offset
in input packet. Also provide SA context pointer which have
details about algo, keys, salt etc. Crypto hardware encrypt,
authenticate and provide the ESP packet to networking hardware.
Signed-off-by: Bharat Bhushan <bbhushan2@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Bharat Bhushan [Wed, 4 Dec 2024 05:56:56 +0000 (11:26 +0530)]
cn10k-ipsec: Add SA add/del support for outb ipsec crypto offload
This patch adds support to add and delete Security Association
(SA) xfrm ops. Hardware maintains SA context in memory allocated
by software. Each SA context is 128 byte aligned and size of
each context is multiple of 128-byte. Add support for transport
and tunnel ipsec mode, ESP protocol, aead aes-gcm-icv16, key size
128/192/256-bits with 32bit salt.
Signed-off-by: Bharat Bhushan <bbhushan2@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Bharat Bhushan [Wed, 4 Dec 2024 05:56:55 +0000 (11:26 +0530)]
cn10k-ipsec: Init hardware for outbound ipsec crypto offload
One crypto hardware logical function (cpt-lf) per netdev is
required for outbound ipsec crypto offload. Allocate, attach
and initialize one crypto hardware function when enabling
outbound ipsec crypto offload. Crypto hardware function will
be detached and freed on disabling outbound ipsec crypto
offload.
Signed-off-by: Bharat Bhushan <bbhushan2@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Bharat Bhushan [Wed, 4 Dec 2024 05:56:54 +0000 (11:26 +0530)]
octeontx2-af: Disable backpressure between CPT and NIX
NIX can assert backpressure to CPT on the NIX<=>CPT link.
Keep the backpressure disabled for now. NIX block anyways
handles backpressure asserted by MAC due to PFC or flow
control pkts.
Signed-off-by: Bharat Bhushan <bbhushan2@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Bharat Bhushan [Wed, 4 Dec 2024 05:56:52 +0000 (11:26 +0530)]
octeontx2-pf: map skb data as device writeable
Crypto hardware need write permission for in-place encrypt
or decrypt operation on skb-data to support IPsec crypto
offload. That patch uses skb_unshare to make skb data writeable
for ipsec crypto offload and map skb fragment memory as
device read-write.
Signed-off-by: Bharat Bhushan <bbhushan2@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
====================
net: net: add negotiation of in-band capabilities (remainder)
Here are the last three patches which were not included in the non-RFC
posting, but were in the RFC posting. These add the .pcs_inband()
method to the Lynx, MTK Lynx and XPCS drivers.
====================
Stas Sergeev [Thu, 5 Dec 2024 07:36:14 +0000 (10:36 +0300)]
tun: fix group permission check
Currently tun checks the group permission even if the user have matched.
Besides going against the usual permission semantic, this has a
very interesting implication: if the tun group is not among the
supplementary groups of the tun user, then effectively no one can
access the tun device. CAP_SYS_ADMIN still can, but its the same as
not setting the tun ownership.
This patch relaxes the group checking so that either the user match
or the group match is enough. This avoids the situation when no one
can access the device even though the ownership is properly set.
Also I simplified the logic by removing the redundant inversions:
tun_not_capable() --> !tun_capable()
Signed-off-by: Stas Sergeev <stsp2@yandex.ru> Reviewed-by: Willem de Bruijn <willemb@google.com> Acked-by: Jason Wang <jasowang@redhat.com> Link: https://patch.msgid.link/20241205073614.294773-1-stsp2@yandex.ru Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Johannes Berg [Fri, 6 Dec 2024 10:30:57 +0000 (11:30 +0100)]
tools: ynl-gen-c: don't require -o argument
Without -o the tool currently crashes, but it's not marked
as required. The only thing we can't do without it is to
generate the correct #include for user source files, but
we can put a placeholder instead.
====================
net: Convert some UDP tunnel drivers to NETDEV_PCPU_STAT_DSTATS.
VXLAN, Geneve and Bareudp use various device counters for managing
RX and TX statistics:
* VXLAN uses the device core_stats for RX and TX drops, tstats for
regular RX/TX counters and DEV_STATS_INC() for various types of
RX/TX errors.
* Geneve uses tstats for regular RX/TX counters and DEV_STATS_INC()
for everything else, include RX/TX drops.
* Bareudp, was recently converted to follow VXLAN behaviour, that is,
device core_stats for RX and TX drops, tstats for regular RX/TX
counters and DEV_STATS_INC() for other counter types.
Let's consolidate statistics management around the dstats counters
instead. This avoids using core_stats in VXLAN and Bareudp, as
core_stats is supposed to be used by core networking code only (and not
in drivers). This also allows Geneve to avoid using atomic increments
when updating RX and TX drop counters, as dstats is per-cpu. Finally,
this also simplifies the code as all three modules now handle stats in
the same way and with only two different sets of counters (the per-cpu
dstats and the atomic DEV_STATS_INC()).
Patch 1 creates dstats helper functions that can be used outside of VRF
(until then, dstats was VRF-specific).
Then patches 2 to 4, convert VXLAN, Geneve and Bareudp, one by one.
====================
Guillaume Nault [Wed, 4 Dec 2024 12:11:32 +0000 (13:11 +0100)]
bareudp: Handle stats using NETDEV_PCPU_STAT_DSTATS.
Bareudp uses the TSTATS infrastructure (dev_sw_netstats_*()) for RX
packet counters. It was also recently converted to use the device core
stats (dev_core_stats_*()) for RX and TX drops (see commit 788d5d655bc9
("bareudp: Use pcpu stats to update rx_dropped counter.")).
Since core stats are to be avoided in drivers, and for consistency with
VXLAN and Geneve, let's convert packet stats handling to DSTATS, which
can handle RX/TX stats and packet drops. Statistics that don't fit
DSTATS are still updated atomically with DEV_STATS_INC().
Guillaume Nault [Wed, 4 Dec 2024 12:11:30 +0000 (13:11 +0100)]
geneve: Handle stats using NETDEV_PCPU_STAT_DSTATS.
Geneve uses the TSTATS infrastructure (dev_sw_netstats_*()) for RX
packet counters. All other counters are handled using atomic increments
with DEV_STATS_INC().
Let's convert packet stats handling to DSTATS, which has a per-cpu
counter for packet drops too, to avoid the cost of atomic increments
in these cases. Statistics that don't fit DSTATS are still updated
atomically with DEV_STATS_INC().
Guillaume Nault [Wed, 4 Dec 2024 12:11:27 +0000 (13:11 +0100)]
vxlan: Handle stats using NETDEV_PCPU_STAT_DSTATS.
VXLAN uses the TSTATS infrastructure (dev_sw_netstats_*()) for RX and
TX packet counters. It also uses the device core stats
(dev_core_stats_*()) for RX and TX drops.
Let's consolidate that using the DSTATS infrastructure, which can
handle both packet counters and packet drops. Statistics that don't
fit DSTATS are still updated atomically with DEV_STATS_INC().
While there, convert the "len" variable of vxlan_encap_bypass() to
unsigned int, to respect the types of skb->len and
dev_dstats_[rt]x_add().
Guillaume Nault [Wed, 4 Dec 2024 12:11:21 +0000 (13:11 +0100)]
vrf: Make pcpu_dstats update functions available to other modules.
Currently vrf is the only module that uses NETDEV_PCPU_STAT_DSTATS.
In order to make this kind of statistics available to other modules,
we need to define the update functions in netdevice.h.
Therefore, let's define dev_dstats_*() functions for RX and TX packet
updates (packets, bytes and drops). Use these new functions in vrf.c
instead of vrf_rx_stats() and the other manual counter updates.
While there, update the type of the "len" variables to "unsigned int",
so that there're aligned with both skb->len and the new dstats update
functions.
Jakub Kicinski [Sat, 7 Dec 2024 01:53:29 +0000 (17:53 -0800)]
Merge branch 'lan78xx-preparations-for-phylink'
Oleksij Rempel says:
====================
lan78xx: Preparations for PHYlink
This patch set is part of the preparatory work for migrating the lan78xx
USB Ethernet driver to the PHYlink framework. During extensive testing,
I observed that resetting the USB adapter can lead to various read/write
errors. While the errors themselves are acceptable, they generate
excessive log messages, resulting in significant log spam. This set
improves error handling to reduce logging noise by addressing errors
directly and returning early when necessary.
Key highlights of this series include:
- Enhanced error handling to reduce log spam while preserving the
original error values, avoiding unnecessary overwrites.
- Improved error reporting using the `%pe` specifier for better clarity
in log messages.
- Removal of redundant and problematic PHY fixups for LAN8835 and
KSZ9031, with detailed explanations in the respective patches.
- Cleanup of code structure, including unified `goto` labels for better
readability and maintainability, even in simple editors.
====================
Oleksij Rempel [Wed, 4 Dec 2024 08:41:42 +0000 (09:41 +0100)]
net: usb: lan78xx: Improve error handling in dataport and multicast writes
Update `lan78xx_dataport_write` and `lan78xx_deferred_multicast_write`
to:
- Handle errors during register read/write operations.
- Exit immediately on errors and log them using `%pe` for clarity.
- Avoid silent failures by propagating error codes properly.
Oleksij Rempel [Wed, 4 Dec 2024 08:41:41 +0000 (09:41 +0100)]
net: usb: lan78xx: Add error handling to lan78xx_irq_bus_sync_unlock
Update `lan78xx_irq_bus_sync_unlock` to handle errors in register
read/write operations. If an error occurs, log it and exit the function
appropriately. This ensures proper handling of failures during IRQ
synchronization.
Oleksij Rempel [Wed, 4 Dec 2024 08:41:39 +0000 (09:41 +0100)]
net: usb: lan78xx: Add error handling to lan78xx_init_ltm
Convert `lan78xx_init_ltm` to return error codes and handle errors
properly. Previously, errors during the LTM initialization process were
not propagated, potentially leading to undetected issues. This patch
ensures:
- Errors in `lan78xx_read_reg` and `lan78xx_write_reg` are checked and
handled.
- Errors are logged with detailed messages using `%pe` for clarity.
- The function exits immediately on error, returning the error code.
Oleksij Rempel [Wed, 4 Dec 2024 08:41:38 +0000 (09:41 +0100)]
net: usb: lan78xx: Improve error handling in EEPROM and OTP operations
Refine error handling in EEPROM and OTP read/write functions by:
- Return error values immediately upon detection.
- Avoid overwriting correct error codes with `-EIO`.
- Preserve initial error codes as they were appropriate for specific
failures.
- Use `-ETIMEDOUT` for timeout conditions instead of `-EIO`.
Oleksij Rempel [Wed, 4 Dec 2024 08:41:37 +0000 (09:41 +0100)]
net: usb: lan78xx: Fix error handling in MII read/write functions
Ensure proper error handling in `lan78xx_mdiobus_read` and
`lan78xx_mdiobus_write` by checking return values of register read/write
operations and returning errors to the caller.
Oleksij Rempel [Wed, 4 Dec 2024 08:41:36 +0000 (09:41 +0100)]
net: usb: lan78xx: Improve error reporting with %pe specifier
Replace integer error codes with the `%pe` format specifier in register
read and write error messages. This change provides human-readable error
strings, making logs more informative and debugging easier.
Oleksij Rempel [Wed, 4 Dec 2024 08:41:34 +0000 (09:41 +0100)]
net: usb: lan78xx: Remove KSZ9031 PHY fixup
Remove the KSZ9031RNX PHY fixup from the lan78xx driver. The fixup applied
specific RGMII pad skew configurations globally, but these settings violate the
RGMII specification and cause more harm than benefit.
Key issues with the fixup:
1. **Non-Compliant Timing**: The fixup's delay settings fall outside the RGMII
specification requirements of 1.5 ns to 2.0 ns:
- RX Path: Total delay of **2.16 ns** (PHY internal delay of 1.2 ns + 0.96
ns skew).
- TX Path: Total delay of **0.96 ns**, significantly below the RGMII minimum
of 1.5 ns.
2. **Redundant or Incorrect Configurations**:
- The RGMII skew registers written by the fixup do not meaningfully alter
the PHY's default behavior and fail to account for its internal delays.
- The TX_DATA pad skew was not configured, relying on power-on defaults
that are insufficient for RGMII compliance.
3. **Micrel Driver Support**: By setting `PHY_INTERFACE_MODE_RGMII_ID`, the
Micrel driver can calculate and assign appropriate skew values for the
KSZ9031 PHY. This ensures better timing configurations without relying on
external fixups.
4. **System Interference**: The fixup applied globally, reconfiguring all
KSZ9031 PHYs in the system, even those unrelated to the LAN78xx adapter.
This could lead to unintended and harmful behavior on unrelated interfaces.
While the fixup is removed, a better mechanism is still needed to dynamically
determine the optimal combination of PHY and MAC delays to fully meet RGMII
requirements without relying on Device Tree or global fixups. This would allow
for robust operation across different hardware configurations.
The Micrel driver is capable of using the interface mode value to calculate and
apply better skew values, providing a configuration much closer to the RGMII
specification than the fixup. Removing the fixup ensures better default
behavior and prevents harm to other system interfaces.
Oleksij Rempel [Wed, 4 Dec 2024 08:41:33 +0000 (09:41 +0100)]
net: usb: lan78xx: Remove LAN8835 PHY fixup
Remove the PHY fixup for the LAN8835 PHY in the lan78xx driver due to
the following reasons:
- There is no publicly available information about the LAN8835 PHY.
However, it appears to be the integrated PHY used in the LAN7800 and
LAN7850 USB Ethernet controllers. These PHYs use the GMII interface,
not RGMII as configured by the fixup.
- The correct driver for handling the LAN8835 PHY functionality is the
Microchip PHY driver (`drivers/net/phy/microchip.c`), which properly
supports these integrated PHYs.
- The PHY ID `0x0007C130` is actually used by the LAN8742A PHY, which
only supports RMII. This interface is incompatible with the LAN78xx
MAC, as the LAN7801 (the only LAN78xx version without an integrated
PHY) supports only RGMII.
- The mask applied for this fixup is overly broad, inadvertently
covering both Microchip LAN88xx PHYs and unrelated SMSC LAN8742A PHYs,
leading to potential conflicts with other devices.
- Testing has shown that removing this fixup for LAN7800 and LAN7850
does not result in any noticeable difference in functionality, as the
Microchip PHY driver (`drivers/net/phy/microchip.c`) handles all
necessary configurations for these integrated PHYs.
- Registering this fixup globally (not limited to USB devices) risks
conflicts by unintentionally modifying other interfaces whenever a
LAN7801 adapter is connected to the system.
Note that both LAN7800 and LAN7850 USB Ethernet controllers use an
integrated PHY with the ID `0x0007C132`. Additionally, the LAN7515, a
specialized part for Raspberry Pi, includes an integrated LAN7800 USB
Ethernet controller and USB hub in a multifunctional chip design, and it
also uses the same PHY ID (`0x0007C132`).
Jakub Kicinski [Sat, 7 Dec 2024 01:47:34 +0000 (17:47 -0800)]
Merge branch 'net-phylib-eee-cleanups'
Russell King says:
====================
net: phylib EEE cleanups
Clean up phylib's EEE support. Patches previously posted as RFC as part
of the phylink EEE series.
Patch 1 changes the Marvell driver to use the state we store in
struct phy_device, rather than manually calling
phydev->eee_cfg.eee_enabled.
Patch 2 avoids genphy_c45_ethtool_get_eee() setting ->eee_enabled, as
we copy that from phydev->eee_cfg.eee_enabled later, and after patch 3
mo one uses this after calling genphy_c45_ethtool_get_eee(). In fact,
the only caller of this function now is phy_ethtool_get_eee().
As all callers to genphy_c45_eee_is_active() now pass NULL as its
is_enabled flag, this is no longer useful. Remove the argument in
patch 3.
Patch 4 updates the phylib documentation to make it absolutely clear
that phy_ethtool_get_eee() now fills in all members of struct
ethtool_keee, which is why we now have so many buggy network drivers.
====================
All callers to genphy_c45_eee_is_active() now pass NULL as the
is_enabled argument, which means we never use the value computed
in this function. Remove the argument and clean up this function.
genphy_c45_ethtool_get_eee() is only called from phy_ethtool_get_eee(),
which then calls eeecfg_to_eee(). eeecfg_to_eee() will overwrite
keee.eee_enabled, so there's no point setting keee.eee_enabled in
genphy_c45_ethtool_get_eee(). Remove this assignment.
Joe Damato [Wed, 4 Dec 2024 16:32:39 +0000 (16:32 +0000)]
selftests: net: cleanup busy_poller.c
Fix various integer type conversions by using strtoull and a temporary
variable which is bounds checked before being casted into the
appropriate cfg_* variable for use by the test program.
While here:
- free the strdup'd cfg string for overall hygenie.
- initialize napi_id = 0 in setup_queue to avoid warnings on some
compilers.
Eric Dumazet [Wed, 4 Dec 2024 21:02:34 +0000 (21:02 +0000)]
net: tipc: remove one synchronize_net() from tipc_nametbl_stop()
tipc_exit_net() is very slow and is abused by syzbot.
tipc_nametbl_stop() is called for each netns being dismantled.
Calling synchronize_net() right before freeing tn->nametbl
is a big hammer.
Replace this with kfree_rcu().
Note that RCU is not properly used here, otherwise
tn->nametbl should be cleared before the synchronize_net()
or kfree_rcu(), or even before the cleanup loop.
We might need to fix this at some point.
Also note tipc uses other synchronize_rcu() calls,
more work is needed to make tipc_exit_net() much faster.
This is V3 of the phylink conversion for ucc_geth.
The main changes in this V3 are related to error handling in the patches
1 and 10 to report an error when the deprecated "interface" property is
found in DT. Doing so, I found and addressed some issues with the jump
labels in the error paths, impacting patches 1 and 10.
The rest of the changes are just a rebase on net-next.
Some of the V2 changes haven't been reviewed, so I stress out that I'm
still uncertain about the way WoL is handled is patches 4 and 10.
Thanks,
Maxime
Link to V1: https://lore.kernel.org/netdev/20241107170255.1058124-1-maxime.chevallier@bootlin.com/
Link to V2: https://lore.kernel.org/netdev/20241114153603.307872-1-maxime.chevallier@bootlin.com/
====================
Signed-off-by: David S. Miller <davem@davemloft.net>