Matthieu Baerts [Tue, 10 Jan 2023 15:36:20 +0000 (16:36 +0100)]
mptcp: add new listener events
These new events have been added in kernel commit f8c9dfbd875b ("mptcp:
add pm listener events") by Geliang Tang.
Two new MPTCP Netlink event types for PM listening socket creation and
closure have been recently added. They will be available in the future
v6.2 kernel.
They have been added because MPTCP for Linux, when not using the
in-kernel PM, depends on the userspace PM to create extra listening
sockets -- called "PM listeners" -- before announcing addresses and
ports. With the existing MPTCP Netlink events, a userspace PM can create
PM listeners at startup time, or in response to an incoming connection.
Creating sockets in response to connections is not optimal: ADD_ADDRs
can't be sent until the sockets are created and listen()ed, and if all
connections are closed then it may not be clear to the userspace PM
daemon that PM listener sockets should be cleaned up. Hence these new
events: PM listening sockets can be managed based on application
activity.
Note that the maximum event string size has to be increased by 2 to be
able to display LISTENER_CREATED without truncated it.
Also, as pointed by Mat, this event doesn't have any "token" attribute
so this attribute is now printed only if it is available.
Link: https://github.com/multipath-tcp/mptcp_net-next/issues/313 Cc: Geliang Tang <geliang.tang@suse.com> Acked-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Ido Schimmel [Tue, 27 Dec 2022 11:03:18 +0000 (13:03 +0200)]
dcb: Do not leave ACKs in socket receive buffer
Originally, the dcb utility only stopped receiving messages from a
socket when it found the attribute it was looking for. Cited commit
changed that, so that the utility will also stop when seeing an ACK
(NLMSG_ERROR message), by setting the NLM_F_ACK flag on requests.
This is problematic because it means a successful request will leave an
ACK in the socket receive buffer, causing the next request to bail
before reading its response.
Fix that by not stopping when finding the required attribute in a
response. Instead, stop on the subsequent ACK.
Fixes: 84c036972659 ("dcb: unblock mnl_socket_recvfrom if not message received") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Hauke Mehrtens [Fri, 23 Dec 2022 17:03:45 +0000 (18:03 +0100)]
configure: Remove include <sys/stat.h>
The check_name_to_handle_at() function in the configure script is
including sys/stat.h. This include fails with glibc 2.36 like this:
````
In file included from /linux-5.15.84/include/uapi/linux/stat.h:5,
from /toolchain-x86_64_gcc-12.2.0_glibc/include/bits/statx.h:31,
from /toolchain-x86_64_gcc-12.2.0_glibc/include/sys/stat.h:465,
from config.YExfMc/name_to_handle_at_test.c:3:
/linux-5.15.84/include/uapi/linux/types.h:10:2: warning: #warning "Attempt to use kernel headers from user space, see https://kernelnewbies.org/KernelHeaders" [-Wcpp]
10 | #warning "Attempt to use kernel headers from user space, see https://kernelnewbies.org/KernelHeaders"
| ^~~~~~~
In file included from /linux-5.15.84/include/uapi/linux/posix_types.h:5,
from /linux-5.15.84/include/uapi/linux/types.h:14:
/linux-5.15.84/include/uapi/linux/stddef.h:5:10: fatal error: linux/compiler_types.h: No such file or directory
5 | #include <linux/compiler_types.h>
| ^~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.
````
Just removing the include works, the manpage of name_to_handle_at() says
only fcntl.h is needed.
Fixes: c5b72cc56bf8 ("lib/fs: fix issue when {name,open}_to_handle_at() is not implemented") Tested-by: Heiko Thiery <heiko.thiery@gmail.com> Signed-off-by: Hauke Mehrtens <hauke@hauke-m.de> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Ido Schimmel [Thu, 15 Dec 2022 17:52:30 +0000 (19:52 +0200)]
bridge: mdb: Add replace support
Allow user space to replace MDB port group entries by specifying the
'NLM_F_REPLACE' flag in the netlink message header.
Examples:
# bridge mdb replace dev br0 port dummy10 grp 239.1.1.1 permanent source_list 192.0.2.1,192.0.2.2 filter_mode include
# bridge -d -s mdb show
dev br0 port dummy10 grp 239.1.1.1 src 192.0.2.2 permanent filter_mode include proto static 0.00
dev br0 port dummy10 grp 239.1.1.1 src 192.0.2.1 permanent filter_mode include proto static 0.00
dev br0 port dummy10 grp 239.1.1.1 permanent filter_mode include source_list 192.0.2.2/0.00,192.0.2.1/0.00 proto static 0.00
# bridge mdb replace dev br0 port dummy10 grp 239.1.1.1 permanent source_list 192.0.2.1,192.0.2.3 filter_mode exclude proto zebra
# bridge -d -s mdb show
dev br0 port dummy10 grp 239.1.1.1 src 192.0.2.3 permanent filter_mode include proto zebra blocked 0.00
dev br0 port dummy10 grp 239.1.1.1 src 192.0.2.1 permanent filter_mode include proto zebra blocked 0.00
dev br0 port dummy10 grp 239.1.1.1 permanent filter_mode exclude source_list 192.0.2.3/0.00,192.0.2.1/0.00 proto zebra 0.00
# bridge mdb replace dev br0 port dummy10 grp 239.1.1.1 temp source_list 192.0.2.4,192.0.2.3 filter_mode include proto bgp
# bridge -d -s mdb show
dev br0 port dummy10 grp 239.1.1.1 src 192.0.2.4 temp filter_mode include proto bgp 0.00
dev br0 port dummy10 grp 239.1.1.1 src 192.0.2.3 temp filter_mode include proto bgp 0.00
dev br0 port dummy10 grp 239.1.1.1 temp filter_mode include source_list 192.0.2.4/259.44,192.0.2.3/259.44 proto bgp 0.00
Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David Ahern <dsahern@kernel.org>
Ido Schimmel [Thu, 15 Dec 2022 17:52:29 +0000 (19:52 +0200)]
bridge: mdb: Add routing protocol support
Allow user space to specify the routing protocol of the MDB port group
entry by adding the 'MDBE_ATTR_RTPROT' attribute to the
'MDBA_SET_ENTRY_ATTRS' nest.
Examples:
# bridge mdb add dev br0 port dummy10 grp 239.1.1.1 permanent proto zebra
# bridge mdb add dev br0 port dummy10 grp 239.1.1.2 permanent
# bridge -d mdb show
dev br0 port dummy10 grp 239.1.1.2 permanent filter_mode exclude proto static
dev br0 port dummy10 grp 239.1.1.1 permanent filter_mode exclude proto zebra
Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David Ahern <dsahern@kernel.org>
Ido Schimmel [Thu, 15 Dec 2022 17:52:28 +0000 (19:52 +0200)]
bridge: mdb: Add source list support
Allow user space to specify the source list of (*, G) entries by adding
the 'MDBE_ATTR_SRC_LIST' attribute to the 'MDBA_SET_ENTRY_ATTRS' nest.
Example:
# bridge mdb add dev br0 port dummy10 grp 239.1.1.1 temp source_list 198.51.100.1,198.51.100.2 filter_mode exclude
# bridge -d -s mdb show
dev br0 port dummy10 grp 239.1.1.1 src 198.51.100.2 temp filter_mode include proto static blocked 0.00
dev br0 port dummy10 grp 239.1.1.1 src 198.51.100.1 temp filter_mode include proto static blocked 0.00
dev br0 port dummy10 grp 239.1.1.1 temp filter_mode exclude source_list 198.51.100.2/0.00,198.51.100.1/0.00 proto static 256.42
Ido Schimmel [Thu, 15 Dec 2022 17:52:26 +0000 (19:52 +0200)]
bridge: mdb: Split source parsing to a separate function
Currently, the only attribute inside the 'MDBA_SET_ENTRY_ATTRS' nest is
'MDBE_ATTR_SOURCE', but subsequent patches are going to add more
attributes to the nest.
Prepare for the addition of these attributes by splitting the parsing of
individual attributes inside the nest to separate functions.
Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David Ahern <dsahern@kernel.org>
Ido Schimmel [Thu, 15 Dec 2022 17:52:25 +0000 (19:52 +0200)]
bridge: mdb: Use a boolean to indicate nest is required
Currently, the only attribute inside the 'MDBA_SET_ENTRY_ATTRS' nest is
'MDBE_ATTR_SOURCE', but subsequent patches are going to add more
attributes to the nest.
Prepare for the addition of these attributes by determining the
necessity of the nest from a boolean variable that is set whenever one
of these attributes is parsed. This avoids the need to have one long
condition that checks for the presence of one of the individual
attributes.
Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David Ahern <dsahern@kernel.org>
David Ahern [Wed, 14 Dec 2022 16:04:31 +0000 (09:04 -0700)]
Merge branch 'new-ipsec-offload-type' into next
Leon Romanovsky says:
====================
From: Leon Romanovsky <leonro@nvidia.com>
Extend ip tool to support new IPsec offload mode.
Followup of the recently accepted series to netdev.
https://lore.kernel.org/r/20221209093310.4018731-1-steffen.klassert@secunet.com
Changelog:
v1:
* Changed "full offload" to "packet offload" to be aligned with kernel names.
* Rebase to latest iproute2-next
v0: https://lore.kernel.org/all/cover.1652179360.git.leonro@nvidia.com
Leon Romanovsky [Mon, 12 Dec 2022 07:54:06 +0000 (09:54 +0200)]
xfrm: add an interface to offload policy
Extend at "ip xfrm policy" to allow policy offload to specific device.
The syntax and the code follow already established pattern from the
state offload.
The only difference between them is that direction was already mandatory
argument in policy configuration commands, so don't need to add direction
handling logic like it was done for the state offload.
The syntax is as follows:
$ ip xfrm policy .... offload packet dev <if-name>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: David Ahern <dsahern@kernel.org>
Leon Romanovsky [Mon, 12 Dec 2022 07:54:05 +0000 (09:54 +0200)]
xfrm: add packet offload mode to xfrm state
Allow users to configure xfrm states with packet offload type.
Packet offload mode:
ip xfrm state offload packet dev <if-name> dir <in|out>
Crypto offload mode:
ip xfrm state offload crypto dev <if-name> dir <in|out>
ip xfrm state offload dev <if-name> dir <in|out>
The latter variant configures crypto offload mode and is needed
to provide backward compatibility.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: David Ahern <dsahern@kernel.org>
Leon Romanovsky [Mon, 12 Dec 2022 07:54:04 +0000 (09:54 +0200)]
xfrm: prepare state offload logic to set mode
The offload in xfrm state requires to provide device and direction
in order to activate it. However, in the help section, device and
direction were displayed as an optional.
As a preparation to addition of packet offload, let's fix the help
section and refactor the code to be more clear.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: David Ahern <dsahern@kernel.org>
David Ahern [Wed, 14 Dec 2022 16:01:13 +0000 (09:01 -0700)]
Merge branch 'devlink-port-function' into next
Shay Drory says:
====================
Patch implementing new netlink attribute for devlink-port function got
merged to net-next.
https://lore.kernel.org/netdev/20221206185119.380138-1-shayd@nvidia.com/
Now there is a need to support these new attribute in the userspace
tool. Implement roce and migratable port function attributes in devlink
userspace tool. Update documentation.
Shay Drory [Sun, 11 Dec 2022 11:58:48 +0000 (13:58 +0200)]
devlink: Support setting port function migratable cap
Suppor port function commands to enable / disable migratable
capability, this is used to set the port function as migratable.
Live migration is the process of transferring a live virtual machine
from one physical host to another without disrupting its normal
operation.
In order for a VM to be able to perform LM, all the VM components must
be able to perform migration. e.g.: to be migratable.
In order for VF to be migratable, VF must be bound to VFIO driver with
migration support.
When migratable capability is enable for a function of the port, the
device is making the necessary preparations for the function to be
migratable, which might include disabling features which cannot be
migrated.
Example of LM with migratable function configuration:
Set migratable of the VF's port function.
$ devlink port show pci/0000:06:00.0/2
pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0
vfnum 1
function:
hw_addr 00:00:00:00:00:00 migratable disable
$ devlink port function set pci/0000:06:00.0/2 migratable enable
$ devlink port show pci/0000:06:00.0/2
pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0
vfnum 1
function:
hw_addr 00:00:00:00:00:00 migratable enable
Shay Drory [Sun, 11 Dec 2022 11:58:47 +0000 (13:58 +0200)]
devlink: Support setting port function roce cap
Support port function commands to enable / disable RoCE, this is used to
control the port RoCE device capabilities.
When RoCE is disabled for a function of the port, function cannot create
any RoCE specific resources (e.g GID table).
It also saves system memory utilization. For example disabling RoCE
enable a VF/SF to save 1 Mbytes of system memory per function.
Example of a PCI VF port which supports a port function:
$ devlink port show pci/0000:06:00.0/2
pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum
0 vfnum 1
function:
hw_addr 00:00:00:00:00:00 roce enabled
$ devlink port function set pci/0000:06:00.0/2 roce disable
$ devlink port show pci/0000:06:00.0/2
pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum
0 vfnum 1
function:
hw_addr 00:00:00:00:00:00 roce disabled
If `__rtnl_talk_iov` fails then callers are not expected to free `answer`.
Currently if `NLMSG_ERROR` was received with an error then the netlink
buffer was stored in `answer`, while still returning an error
This leak can be observed by running this snippet over time.
This triggers an `NLMSG_ERROR` because for each neighbour update, `ip`
will try to query for the name of interface 9999 in the wrong netns.
(which in itself is a separate bug)
set -e
ip netns del test-a || true
ip netns add test-a
ip netns del test-b || true
ip netns add test-b
ip -n test-a netns set test-b auto
ip -n test-a link add veth_a index 9999 type veth \
peer name veth_b netns test-b
ip -n test-b link set veth_b up
ip -n test-a monitor link address prefix neigh nsid label all-nsid \
> /dev/null &
monitor_pid=$!
clean() {
kill $monitor_pid
ip netns del test-a
ip netns del test-b
}
trap clean EXIT
while true; do
ip -n test-b neigh add dev veth_b 1.2.3.4 lladdr AA:AA:AA:AA:AA:AA
ip -n test-b neigh del dev veth_b 1.2.3.4
done
Fixes: 55870dfe7f8b ("Improve batch and dump times by caching link lookups") Signed-off-by: Lahav Schlesinger <lschlesinger@drivenets.com> Signed-off-by: Gilad Naaman <gnaaman@drivenets.com> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Jiri Pirko [Mon, 5 Dec 2022 12:21:57 +0000 (13:21 +0100)]
devlink: push common code to __pr_out_port_handle_start_tb()
There is a common code in pr_out_port_handle_start() and
pr_out_port_handle_start_arr(). As the next patch is going to extend it
even more, push the code into common helper.
Signed-off-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David Ahern <dsahern@kernel.org>
Jiri Pirko [Mon, 5 Dec 2022 12:21:56 +0000 (13:21 +0100)]
devlink: get devlink port for ifname using RTNL get link command
Currently, when user specifies ifname as a handle on command line of
devlink, the related devlink port is looked-up in previously taken dump
of all devlink ports on the system. There are 3 problems with that:
1) The dump iterates over all devlink instances in kernel and takes a
devlink instance lock for each.
2) Dumping all devlink ports would not scale.
3) Alternative ifnames are not exposed by devlink netlink interface.
Instead, benefit from RTNL get link command extension and get the
devlink port handle info from IFLA_DEVLINK_PORT attribute, if supported.
Signed-off-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David Ahern <dsahern@kernel.org>
David Ahern [Thu, 8 Dec 2022 16:23:56 +0000 (09:23 -0700)]
Merge branch 'pcp-prio-apptrust' into next
Daniel Machon says:
====================
This patch series makes use of the newly introduced [1] DCB_APP_SEL_PCP
selector, for PCP/DEI prioritization, and DCB_ATTR_IEEE_APP_TRUST
attribute for configuring per-selector trust and trust-order.
========================================================================
New parameter "pcp-prio" to existing "app" subcommand:
========================================================================
A new pcp-prio parameter has been added to the app subcommand, which can
be used to classify traffic based on PCP and DEI from the VLAN header.
PCP and DEI is specified in a combination of numerical and symbolic
form, where 'de' (drop-eligible) means DEI=1 and 'nd' (not-drop-eligible)
means DEI=0.
Map PCP 1 and DEI 0 to priority 1
$ dcb app add dev eth0 pcp-prio 1nd:1
Map PCP 1 and DEI 1 to priority 1
$ dcb app add dev eth0 pcp-prio 1de:1
========================================================================
New apptrust subcommand for configuring per-selector trust and trust
order:
========================================================================
This new command currently has a single parameter, which lets you
specify an ordered list of trusted selectors. The microchip sparx5
driver is already enabled to offload said list of trusted selectors. The
new command has been given the name apptrust, to indicate that the trust
covers APP table selectors only. I found that 'apptrust' was better than
plain 'trust' as the latter does not indicate the scope of what is to be
trusted.
Example:
Trust selectors dscp and pcp, in that order:
$ dcb apptrust set dev eth0 order dscp pcp
Trust selectors ethtype, stream-port and pcp, in that order
$ dcb apptrust set dev eth0 order ethtype stream-port pcp
Show the trust order
$ dcb apptrust show dev eth0 order order: ethtype stream-port pcp
A concern was raised here [2], that 'apptrust' would not work well with
matches(), so instead strcmp() has been used to match for the new
subcommand, as suggested here [3]. Same goes with pcp-prio parameter for
dcb app.
The man page for dcb_app has been extended to cover the new pcp-prio
parameter, and a new man page for dcb_apptrust has been created.
Daniel Machon [Mon, 5 Dec 2022 22:21:45 +0000 (23:21 +0100)]
dcb: add new subcommand for apptrust
Add new apptrust subcommand for the dcbnl apptrust extension object.
The apptrust command lets you specify a consecutive ordered list of
trusted selectors, which can be used by drivers to determine which
selectors are eligible (trusted) for packet prioritization, and in which
order.
Selectors are sent in a new nested attribute:
DCB_ATTR_IEEE_APP_TRUST_TABLE. The nest contains trusted selectors
encapsulated in either DCB_ATTR_IEEE_APP or DCB_ATTR_DCB_APP attributes,
for standard and non-standard selectors, respectively.
Example:
Trust selectors dscp and pcp, in that order
$ dcb apptrust set dev eth0 order dscp pcp
Signed-off-by: Daniel Machon <daniel.machon@microchip.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David Ahern <dsahern@kernel.org>
Daniel Machon [Mon, 5 Dec 2022 22:21:44 +0000 (23:21 +0100)]
dcb: add new pcp-prio parameter to dcb app
Add new pcp-prio parameter to the app subcommand, which can be used to
classify traffic based on PCP and DEI from the VLAN header. PCP and DEI
is specified in a combination of numerical and symbolic form, where 'de'
(drop-eligible) means DEI=1 and 'nd' (not-drop-eligible) means DEI=0.
Map PCP 1 and DEI 0 to priority 1
$ dcb app add dev eth0 pcp-prio 1nd:1
Map PCP 1 and DEI 1 to priority 1
$ dcb app add dev eth0 pcp-prio 1de:1
Internally, PCP and DEI is encoded in the protocol field of the dcb_app
struct. Each combination of PCP and DEI maps to a priority, thus needing
a range of 0-15. A well formed dcb_app entry for PCP/DEI
prioritization, could look like:
Also, three helper functions for translating between std and non-std APP
selectors, have been added to dcb_app.c and exposed through dcb.h.
Signed-off-by: Daniel Machon <daniel.machon@microchip.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David Ahern <dsahern@kernel.org>
Jacob Keller [Mon, 5 Dec 2022 22:59:31 +0000 (14:59 -0800)]
devlink: support direct region read requests
The kernel has gained support for reading from regions without needing to
create a snapshot. To use this support, the DEVLINK_ATTR_REGION_DIRECT
attribute must be added to the command.
For the "read" command, if the user did not specify a snapshot, add the new
attribute to request a direct read. The "dump" command will still require a
snapshot. While technically a dump could be performed without a snapshot it
is not guaranteed to be atomic unless the region size is no larger than
256 bytes.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David Ahern <dsahern@kernel.org>
Setting a parent during creation of the node doesn't work, despite
documentation [1] clearly saying that it should.
[1] man/man8/devlink-rate.8
Example:
$ devlink port function rate add pci/0000:4b:00.0/node_custom parent node_0
Unknown option "parent"
Fix this by passing DL_OPT_PORT_FN_RATE_PARENT as an argument to
dl_argv_parse() when it gets called from cmd_port_fn_rate_add().
Fixes: 6c70aca76ef2 ("devlink: Add port func rate support") Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com> Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
devlink: Add documentation for tx_prority and tx_weight
New netlink attributes tx_priority and tx_weight were added.
Update the man page for devlink-rate to account for new attributes.
Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com> Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com> Signed-off-by: David Ahern <dsahern@kernel.org>
devlink: Introduce new attribute 'tx_weight' to devlink-rate
To fully utilize hierarchical QoS algorithm new attribute 'tx_weight'
needs to be introduced. Weight attribute allows for usage of Weighted
Fair Queuing arbitration scheme among siblings. This arbitration
scheme can be used simultaneously with the strict priority.
Introduce ability to configure tx_weight from devlink userspace
utility. Make the new attribute optional.
Example commands:
$ devlink port function rate add pci/0000:4b:00.0/node_custom \
tx_weight 50 parent node_0
$ devlink port function rate set pci/0000:4b:00.0/2 tx_weight 20
Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com> Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com> Signed-off-by: David Ahern <dsahern@kernel.org>
devlink: Introduce new attribute 'tx_priority' to devlink-rate
To fully utilize hierarchical QoS algorithm new attribute 'tx_priority'
needs to be introduced. Priority attribute allows for usage of strict
priority arbiter among siblings. This arbitration scheme attempts to
schedule nodes based on their priority as long as the nodes remain within
their bandwidth limit.
Introduce ability to configure tx_priority from devlink userspace
utility. Make the new attribute optional.
Example commands:
$ devlink port function rate add pci/0000:4b:00.0/node_custom \
tx_priority 5 parent node_0
$ devlink port function rate set pci/0000:4b:00.0/2 tx_priority 5
Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com> Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com> Signed-off-by: David Ahern <dsahern@kernel.org>
Leonard Crestez [Thu, 1 Dec 2022 21:41:05 +0000 (23:41 +0200)]
ip neigh: Support --json on ip neigh get
The ip neigh command supports --json for "list" but not for "get". Add
json support for the "get" command so that it's possible to fetch
information about specific neighbors without regular expressions.
Fixes: aac7f725fa46 ("ipneigh: add color and json support") Signed-off-by: Leonard Crestez <cdleonard@gmail.com> Signed-off-by: David Ahern <dsahern@kernel.org>
Jason Wang [Tue, 29 Nov 2022 04:28:16 +0000 (12:28 +0800)]
vdpa: allow provisioning device features
This patch allows device features to be provisioned via vdpa. This
will be useful for preserving migration compatibility between source
and destination:
# vdpa dev add name dev1 mgmtdev pci/0000:02:00.0 device_features 0x300020000
# vdpa dev config show dev1
# dev1: mac 52:54:00:12:34:56 link up link_announce false mtu 65535
negotiated_features CTRL_VQ VERSION_1 ACCESS_PLATFORM
Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David Ahern <dsahern@kernel.org>
Tan Tee Min [Fri, 2 Dec 2022 06:25:42 +0000 (14:25 +0800)]
taprio: fix wrong for loop condition in add_tc_entries()
The for loop in add_tc_entries() mistakenly included the last entry
index+1. Fix it to correctly loop the max_sdu entry between tc=0 and
num_max_sdu_entries-1.
Fixes: b10a6509c195 ("taprio: support dumping and setting per-tc max SDU") Signed-off-by: Tan Tee Min <tee.min.tan@linux.intel.com> Signed-off-by: David Ahern <dsahern@kernel.org>
Hangbin Liu [Tue, 8 Nov 2022 12:43:44 +0000 (20:43 +0800)]
ip: fix return value for rtnl_talk failures
Since my last commit "rtnetlink: add new function rtnl_echo_talk()" we
return the kernel rtnl exit code directly, which breaks some kernel
selftest checking. As there are still a lot of tests checking -2 as the
error return value, to keep backward compatibility, let's keep using
-2 for all the rtnl return values.
Reported-by: Ido Schimmel <idosch@idosch.org> Fixes: 6c09257f1bf6 ("rtnetlink: add new function rtnl_echo_talk()") Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Tested-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Jiri Pirko [Wed, 9 Nov 2022 12:48:51 +0000 (13:48 +0100)]
devlink: load ifname map on demand from ifname_map_rev_lookup() as well
Commit 5cddbb274eab ("devlink: load port-ifname map on demand") changed
the ifname map to be loaded on demand from ifname_map_lookup(). However,
it didn't put this on-demand loading into ifname_map_rev_lookup() which
causes ifname_map_rev_lookup() to return -ENOENT all the time.
Fix this by triggering on-demand ifname map load
from ifname_map_rev_lookup() as well.
Fixes: 5cddbb274eab ("devlink: load port-ifname map on demand") Signed-off-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
tc_util: Change datatype for maj to avoid overflow issue
The return value by stroul() is unsigned long int. Hence the datatype
for maj should defined as unsigned long to avoid overflow issue.
Signed-off-by: Muhammad Husaini Zulkifli <muhammad.husaini.zulkifli@intel.com> Signed-off-by: Lai Peter Jun Ann <jun.ann.lai@intel.com> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
tc_util: Fix no error return when large parent id used
This patch is to fix the issue where there is no error return
when large value of parent ID is being used. The return value by
stroul() is unsigned long int. Hence the datatype for maj and min
should defined as unsigned long to avoid overflow issue.
Signed-off-by: Muhammad Husaini Zulkifli <muhammad.husaini.zulkifli@intel.com> Signed-off-by: Lai Peter Jun Ann <jun.ann.lai@intel.com> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Ido Schimmel [Sun, 6 Nov 2022 11:39:57 +0000 (13:39 +0200)]
man: bridge: Reword description of "locked" bridge port option
Adjust the description to mention the "no_linklocal_learn" bridge option
and make sure it is consistent between both the bridge(8) and ip-link(8)
man pages.
Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David Ahern <dsahern@kernel.org>
Andrea Claudi [Thu, 3 Nov 2022 17:39:25 +0000 (18:39 +0100)]
json: do not escape single quotes
ECMA-404 standard does not include single quote character among the json
escape sequences. This means single quotes does not need to be escaped.
Indeed the single quote escape produces an invalid json output:
$ ip link add "john's" type dummy
$ ip link show "john's"
9: john's: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether c6:8e:53:f6:a3:4b brd ff:ff:ff:ff:ff:ff
$ ip -j link | jq .
parse error: Invalid escape at line 1, column 765
This can be fixed removing the single quote escape in jsonw_puts.
With this patch in place:
$ ip -j link | jq .[].ifname
"lo"
"john's"
Fixes: fcc16c2287bf ("provide common json output formatter") Signed-off-by: Andrea Claudi <aclaudi@redhat.com>
Vladimir Oltean [Fri, 28 Oct 2022 11:50:53 +0000 (14:50 +0300)]
taprio: support dumping and setting per-tc max SDU
The 802.1Q queueMaxSDU table is technically implemented in Linux as
the TCA_TAPRIO_TC_ENTRY_MAX_SDU attribute of the TCA_TAPRIO_ATTR_TC_ENTRY
nest. Multiple TCA_TAPRIO_ATTR_TC_ENTRY nests may appear in the netlink
message, one per traffic class. Other configuration items that are per
traffic class are also supposed to go there.
This is done for future extensibility of the netlink interface (I have
the feeling that the struct tc_mqprio_qopt passed through
TCA_TAPRIO_ATTR_PRIOMAP is not exactly extensible, which kind of defeats
the purpose of using netlink). But otherwise, the max-sdu is parsed from
the user, and printed, just like any other fixed-size 16 element array.
I've modified the example for a fully offloaded configuration (flags 2)
to also show a max-sdu use case. The gate intervals were 0x80 (for TC 7),
0xa0 (for TCs 7 and 5) and 0xdf (for TCs 7, 6, 4, 3, 2, 1, 0).
I modified the last gate to exclude TC 7 (df -> 5f), so that TC 7 now
only interferes with TC 5.
Output after running the full offload command from the man page example
(the new attribute is "max-sdu"):
Benjamin Poirier [Wed, 26 Oct 2022 06:49:07 +0000 (15:49 +0900)]
ip-monitor: Do not error out when RTNLGRP_STATS is not available
Following commit 4e8a9914c4d4 ("ip-monitor: Include stats events in default
and "all" cases"), `ip monitor` fails to start on kernels which do not
contain linux.git commit 5fd0b838efac ("net: rtnetlink: Add UAPI toggle for
IFLA_OFFLOAD_XSTATS_L3_STATS") because the netlink group RTNLGRP_STATS
doesn't exist:
$ ip monitor
Failed to add stats group to list
When "stats" is not explicitly requested, ignore the error so that `ip
monitor` and `ip monitor all` continue to work on older kernels.
Note that the same change is not done for RTNLGRP_NEXTHOP because its value
is 32 and group numbers <= 32 are always supported; see the comment above
netlink_change_ngroups() in the kernel source. Therefore
NETLINK_ADD_MEMBERSHIP 32 does not error out even on kernels which do not
support RTNLGRP_NEXTHOP.
v2:
* Silently ignore a failure to implicitly add the stats group, instead of
printing a warning.
Reported-by: Stephen Hemminger <stephen@networkplumber.org> Fixes: 4e8a9914c4d4 ("ip-monitor: Include stats events in default and "all" cases") Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Andrea Claudi [Sun, 23 Oct 2022 15:37:11 +0000 (17:37 +0200)]
testsuite: fix build failure
After commit 6c09257f1bf6 ("rtnetlink: add new function
rtnl_echo_talk()") "make check" results in:
$ make check
make -C testsuite
make -C iproute2 configure
make -C testsuite alltests
make -C tools
CC generate_nlmsg
/usr/bin/ld: /tmp/cc6YaGBM.o: in function `rtnl_echo_talk':
libnetlink.c:(.text+0x25bd): undefined reference to `new_json_obj'
/usr/bin/ld: libnetlink.c:(.text+0x25c7): undefined reference to `open_json_object'
/usr/bin/ld: libnetlink.c:(.text+0x25e3): undefined reference to `close_json_object'
/usr/bin/ld: libnetlink.c:(.text+0x25e8): undefined reference to `delete_json_obj'
collect2: error: ld returned 1 exit status
make[2]: *** [Makefile:6: generate_nlmsg] Error 1
make[1]: *** [Makefile:40: generate_nlmsg] Error 2
make: *** [Makefile:130: check] Error 2
This is due to json function calls included in libutil and not in
libnetlink. Fix this adding libutil.a to the tools Makefile, and linking
against libcap as required by libutil itself.
Fixes: 6c09257f1bf6 ("rtnetlink: add new function rtnl_echo_talk()") Signed-off-by: Andrea Claudi <aclaudi@redhat.com> Acked-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Matthieu Baerts [Mon, 17 Oct 2022 17:03:08 +0000 (19:03 +0200)]
ss: re-add TIPC query support
TIPC support has been introduced in 'iproute-master' (not -next) in
commit 5caf79a0 ("ss: Add support for TIPC socket diag in ss tool"), at
the same time a refactoring introducing filter_db_parse() was done, see
commit 67d5fd55 ("ss: Put filter DB parsing into a separate function")
from iproute2-next.
When the two commits got merged, the support for TIPC has been
apparently accidentally dropped.
This simply adds the missing entry for TIPC.
Fixes: 2c62a64d ("Merge branch 'iproute2-master' into iproute2-next") Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Matthieu Baerts [Mon, 17 Oct 2022 17:03:06 +0000 (19:03 +0200)]
ss: man: add missing entries for TIPC
'ss -h' was mentioning TIPC but not the man page.
Fixes: 5caf79a0 ("ss: Add support for TIPC socket diag in ss tool") Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Junxin Chen [Wed, 19 Oct 2022 01:20:08 +0000 (09:20 +0800)]
dcb: unblock mnl_socket_recvfrom if not message received
Currently, the dcb command sinks to the kernel through the netlink
to obtain information. However, if the kernel fails to obtain infor-
mation or is not processed, the dcb command is suspended.
For example, if we don't implement dcbnl_ops->ieee_getpfc in the
kernel, the command "dcb pfc show dev eth1" will be stuck and subsequent
commands cannot be executed.
This patch adds the NLM_F_ACK flag to the netlink in mnlu_msg_prepare
to ensure that the kernel responds to user requests.
After the problem is solved, the execution result is as follows:
$ dcb pfc show dev eth1
Attribute not found: Success
Fixes: 67033d1c1c8a ("Add skeleton of a new tool, dcb") Signed-off-by: Junxin Chen <chenjunxin1@huawei.com> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Vincent Mailhol [Mon, 10 Oct 2022 14:16:38 +0000 (23:16 +0900)]
iplink_can: add missing `]' of the bitrate, dbitrate and termination arrays
The command "ip --details link show canX" misses the closing bracket
`]' of the bitrate, the dbitrate and the termination arrays. The --json
output is not impacted.
Change the first argument of close_json_array() from PRINT_JSON to
PRINT_ANY to fix the problem. The second argument was already set
correctly.
Fixes: 67f3c7a5cc0d ("iplink_can: use PRINT_ANY to factorize code and fix signedness") Reported-by: Marc Kleine-Budde <mkl@pengutronix.de> Acked-by: Marc Kleine-Budde <mkl@pengutronix.de> Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Wojciech Drewek [Fri, 7 Oct 2022 07:51:01 +0000 (09:51 +0200)]
f_flower: Introduce L2TPv3 support
Add support for matching on L2TPv3 session ID.
Session ID can be specified only when ip proto was
set to IPPROTO_L2TP.
L2TPv3 might be transported over IP or over UDP,
this implementation is only about L2TPv3 over IP.
IPv6 is also supported, in this case next header
is set to IPPROTO_L2TP.
Example filter:
# tc filter add dev eth0 ingress prio 1 protocol ip \
flower \
ip_proto l2tp \
l2tpv3_sid 1234 \
skip_sw \
action drop
Reviewed-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Wojciech Drewek <wojciech.drewek@intel.com> Signed-off-by: David Ahern <dsahern@kernel.org>
Andrea Claudi [Tue, 4 Oct 2022 14:25:03 +0000 (16:25 +0200)]
man: ss.8: fix a typo
Fixes: f76ad635f21d ("man: break long lines in man page sources") Reported-by: Prijesh Patel <prpatel@redhat.com> Signed-off-by: Andrea Claudi <aclaudi@redhat.com>
Eyal Birger [Mon, 3 Oct 2022 09:12:12 +0000 (12:12 +0300)]
ip: xfrm: support adding xfrm metadata as lwtunnel info in routes
Support for xfrm metadata as lwtunnel metadata was added in kernel commit 2c2493b9da91 ("xfrm: lwtunnel: add lwtunnel support for xfrm interfaces in collect_md mode")
This commit adds the respective support in lwt routes.
Example use (consider ipsec1 as an xfrm interface in "external" mode):
ip route add 10.1.0.0/24 dev ipsec1 encap xfrm if_id 1
Or in the context of vrf, one can also specify the "link" property:
ip route add 10.1.0.0/24 dev ipsec1 encap xfrm if_id 1 link_dev eth15
Signed-off-by: Eyal Birger <eyal.birger@gmail.com> Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David Ahern <dsahern@kernel.org>
Ido Schimmel [Sat, 1 Oct 2022 14:35:51 +0000 (17:35 +0300)]
iplink_bridge: Add no_linklocal_learn option support
Kernel commit 70e4272b4c81 ("net: bridge: add no_linklocal_learn bool
option") added the no_linklocal_learn bridge option that can be set via
sysfs or netlink.
Add iproute2 support, allowing it to query and set the option via
netlink.
The option is useful, for example, in scenarios where we want the bridge
to be able to refresh dynamic FDB entries that were added by user space
and are pointing to locked bridge ports, but do not want the bridge to
populate its FDB from EAPOL frames used for authentication.
Example:
$ ip -j -d link show dev br0 | jq ".[][\"linkinfo\"][\"info_data\"][\"no_linklocal_learn\"]"
0
$ cat /sys/class/net/br0/bridge/no_linklocal_learn
0
# ip link set dev br0 type bridge no_linklocal_learn 1
$ ip -j -d link show dev br0 | jq ".[][\"linkinfo\"][\"info_data\"][\"no_linklocal_learn\"]"
1
$ cat /sys/class/net/br0/bridge/no_linklocal_learn
1
# ip link set dev br0 type bridge no_linklocal_learn 0
$ ip -j -d link show dev br0 | jq ".[][\"linkinfo\"][\"info_data\"][\"no_linklocal_learn\"]"
0
$ cat /sys/class/net/br0/bridge/no_linklocal_learn
0
Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David Ahern <dsahern@kernel.org>
Doing make check on iproute2 runs several checks including man page
checks for common errors. Recent addition of linecard support to
devlink introduced this error.
Checking manpages for syntax errors...
an-old.tmac: <standard input>: line 31: 'R' is a string (producing the registered sign), not a macro.
Error in devlink-lc.8
Fixes: 4cb0bec3744a ("devlink: add support for linecard show and type set") Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>