bridge: vlan: dump port only if there are any vlans
When I added support for new vlan rtm dumping, I made a mistake in the
output format when there are no vlans on the port. This patch fixes it by
not printing ports without vlan entries (similar to current situation).
Example (no vlans):
$ bridge -d vlan show
port vlan-id
Fixes: e5f87c834193 ("bridge: vlan: add support for the new rtm dump call") Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David Ahern <dsahern@kernel.org>
Tony Ambardar [Tue, 20 Apr 2021 08:26:36 +0000 (01:26 -0700)]
ip: drop 2-char command assumption
The 'ip' utility hardcodes the assumption of being a 2-char command, where
any follow-on characters are passed as an argument:
$ ./ip-full help
Object "-full" is unknown, try "ip help".
This confusing behaviour isn't seen with 'tc' for example, and was added in
a 2005 commit without documentation. It was noticed during testing of 'ip'
variants built/packaged with different feature sets (e.g. w/o BPF support).
Mitigate the problem by redoing the command without the 2-char assumption
if the follow-on characters fail to parse as a valid command.
Fixes: 351efcde4e62 ("Update header files to 2.6.14") Signed-off-by: Tony Ambardar <Tony.Ambardar@gmail.com> Signed-off-by: David Ahern <dsahern@kernel.org>
David Ahern [Thu, 22 Apr 2021 05:20:13 +0000 (05:20 +0000)]
Merge branch 'bridge-vlan' into next
Nikolay Aleksandrov says:
====================
From: Nikolay Aleksandrov <nikolay@nvidia.com>
This set extends the bridge vlan code to use the new vlan RTM calls
which allow to dump detailed per-port, per-vlan information and also to
manipulate the per-vlan options. It also allows to monitor any vlan
changes (add/del/option change). The rtm vlan dumps have an extensible
format which allows us to add new options and attributes easily, and
also to request the kernel to filter on different vlan information when
dumping. The new kernel dump code tries to use compressed vlan format as
much as possible (it includes netlink attributes for vlan start and
end) to reduce the number of generated messages and netlink traffic.
The iproute2 support is activated by using the "-d" flag when showing
vlan information, that will cause it to use the new rtm dump call and
get all the detailed information, if "-s" is also specified it will dump
per-vlan statistics as well. Obviously in that case the vlans cannot be
compressed. To change per-vlan options (currently only STP state is
supported) a new vlan command is added - "set". It can be used to set
options of bridge or port vlans and vlan ranges can be used, all of the
new vlan option code uses extack to show more understandable errors.
The set adds the first supported per-vlan option - STP state.
Man pages and usage information are updated accordingly.
Example:
$ bridge -d vlan show
port vlan-id
ens13 1 PVID Egress Untagged
state forwarding
bridge 1 PVID Egress Untagged
state forwarding
$ bridge vlan set vid 1 dev ens13 state blocking
$ bridge -d vlan show
port vlan-id
ens13 1 PVID Egress Untagged
state blocking
bridge 1 PVID Egress Untagged
state forwarding
Add support for vlan activity monitoring, we display vlan notifications on
vlan add/del/options change. The man page and help are also updated
accordingly.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David Ahern <dsahern@kernel.org>
bridge: vlan: add support for the new rtm dump call
Use the new bridge vlan rtm dump helper to dump all of the available
vlan information when -details (-d) is used with vlan show. It is also
capable of dumping vlan stats if -statistics (-s) is added.
Currently this is the only interface capable of dumping per-vlan
options. The vlan dump format is compatible with current vlan show, it
uses the same helpers to dump vlan information. The new addition is one
line which will contain the per-vlan options (similar to ip -d link show
for ports). Currently only the vlan STP state is printed.
The call uses compressed vlan format by default.
Example:
$ bridge -s -d vlan show
port vlan-id
virbr1 1 PVID Egress Untagged
state forwarding
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David Ahern <dsahern@kernel.org>
bridge: vlan: add option set command and state option
Add a new per-vlan option set command. It allows to manipulate vlan
options, those can be bridge-wide or per-port depending on what device
is specified. The first option that can be set is the vlan STP state,
it is identical to the bridge port STP state. The man page is also
updated accordingly.
Example:
$ bridge vlan set vid 10 dev br0 state learning
or a range:
$ bridge vlan set vid 10-20 dev swp1 state blocking
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David Ahern <dsahern@kernel.org>
Rename print_portstate to print_stp_state in preparation for use by vlan
code as well (per-vlan state), and export it. To be in line with the new
naming rename also port_states to stp_states as they'll be used for
vlans, too.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David Ahern <dsahern@kernel.org>
This adds iproute2 support for mptcp event monitoring, e.g. creation,
establishment, address announcements from the peer, subflow establishment
and so on.
While the kernel-generated events are primarily aimed at mptcpd (e.g. for
subflow management), this is also useful for debugging.
Sabrina Dubroca [Fri, 19 Mar 2021 16:57:17 +0000 (17:57 +0100)]
ip: xfrm: add support for tfcpad
This patch adds support for setting and displaying the Traffic Flow
Confidentiality attribute for an XFRM state, which allows padding ESP
packets to a specified length.
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David Ahern <dsahern@kernel.org>
David Ahern [Fri, 19 Mar 2021 15:05:29 +0000 (15:05 +0000)]
Merge branch 'nexthop-resilient-hash' into next
Petr Machata says:
====================
Support for resilient next-hop groups was recently accepted to Linux
kernel[1]. Resilient next-hop groups add a layer of indirection between the
SKB hash and the next hop. Thus the hash is used to reference a hash table
bucket, which is then used to reference a particular next hop. This allows
the system more flexibility when assigning SKB hash space to next hops.
Previously, each next hop had to be assigned a continuous range of SKB hash
space. With a hash table as an intermediate layer, it is possible to
reassign next hops with a hash table bucket granularity. In turn, this
mends issues with traffic flow redirection resulting from next hop removal
or adjustments in next-hop weights.
In this patch set, introduce support for resilient next-hop groups to
iproute2.
- Patch #1 brings include/uapi/linux/nexthop.h and /rtnetlink.h up to date.
- Patches #2 and #3 add new helpers that will be useful later.
- Patch #4 extends the ip/nexthop sub-tool to accept group type as a
command line argument, and to dispatch based on the specified type.
- Patch #5 adds the support for resilient next-hop groups.
- Patch #6 adds the support for resilient next-hop group bucket interface.
To illustrate the usage, consider the following commands:
# ip nexthop add id 1 via 192.0.2.2 dev dummy1
# ip nexthop add id 2 via 192.0.2.3 dev dummy1
# ip nexthop add id 10 group 1/2 type resilient \
buckets 8 idle_timer 60 unbalanced_timer 300
The last command creates a resilient next-hop group. It will have 8
buckets, each bucket will be considered idle when no traffic hits it for at
least 60 seconds, and if the table remains out of balance for 300 seconds,
it will be forcefully brought into balance.
And this is how the next-hop group bucket interface looks:
# ip nexthop bucket show id 10
id 10 index 0 idle_time 5.59 nhid 1
id 10 index 1 idle_time 5.59 nhid 1
id 10 index 2 idle_time 8.74 nhid 2
id 10 index 3 idle_time 8.74 nhid 2
id 10 index 4 idle_time 8.74 nhid 1
id 10 index 5 idle_time 8.74 nhid 1
id 10 index 6 idle_time 8.74 nhid 1
id 10 index 7 idle_time 8.74 nhid 1
Ido Schimmel [Wed, 17 Mar 2021 12:54:35 +0000 (13:54 +0100)]
nexthop: Add support for nexthop buckets
Add ability to dump multiple nexthop buckets and get a specific one.
Example:
# ip nexthop add id 10 group 1/2 type resilient buckets 8
# ip nexthop
id 1 via 192.0.2.2 dev dummy10 scope link
id 2 via 192.0.2.19 dev dummy20 scope link
id 10 group 1/2 type resilient buckets 8 idle_timer 120 unbalanced_timer 0 unbalanced_time 0
# ip nexthop bucket
id 10 index 0 idle_time 28.1 nhid 2
id 10 index 1 idle_time 28.1 nhid 2
id 10 index 2 idle_time 28.1 nhid 2
id 10 index 3 idle_time 28.1 nhid 2
id 10 index 4 idle_time 28.1 nhid 1
id 10 index 5 idle_time 28.1 nhid 1
id 10 index 6 idle_time 28.1 nhid 1
id 10 index 7 idle_time 28.1 nhid 1
# ip nexthop bucket show nhid 1
id 10 index 4 idle_time 53.59 nhid 1
id 10 index 5 idle_time 53.59 nhid 1
id 10 index 6 idle_time 53.59 nhid 1
id 10 index 7 idle_time 53.59 nhid 1
# ip nexthop bucket get id 10 index 5
id 10 index 5 idle_time 81 nhid 1
# ip -j -p nexthop bucket get id 10 index 5
[ {
"id": 10,
"bucket": {
"index": 5,
"idle_time": 104.89,
"nhid": 1
},
"flags": [ ]
} ]
Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David Ahern <dsahern@kernel.org>
Ido Schimmel [Wed, 17 Mar 2021 12:54:33 +0000 (13:54 +0100)]
nexthop: Add ability to specify group type
Next patches are going to add a 'resilient' nexthop group type, so allow
users to specify the type using the 'type' argument. Currently, only
'mpath' type is supported.
These two commands are equivalent:
# ip nexthop add id 10 group 1/2/3
# ip nexthop add id 10 group 1/2/3 type mpath
Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David Ahern <dsahern@kernel.org>
Petr Machata [Wed, 17 Mar 2021 12:54:32 +0000 (13:54 +0100)]
nexthop: Extract a helper to parse a NH ID
NH ID extraction is a common operation, and will become more common still
with the resilient NH groups support. Add a helper that does what it
usually done and returns the parsed NH ID.
Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David Ahern <dsahern@kernel.org>
Tony Ambardar [Thu, 11 Mar 2021 21:47:54 +0000 (13:47 -0800)]
lib/bpf: add missing limits.h includes
Several functions in bpf_glue.c and bpf_libbpf.c rely on PATH_MAX, which is
normally included from <limits.h> in other iproute2 source files.
It fixes errors seen using gcc 10.2.0, binutils 2.35.1 and musl 1.1.24:
bpf_glue.c: In function 'get_libbpf_version':
bpf_glue.c:46:11: error: 'PATH_MAX' undeclared (first use in this function);
did you mean 'AF_MAX'?
46 | char buf[PATH_MAX], *s;
| ^~~~~~~~
| AF_MAX
Reported-by: Rui Salvaterra <rsalvaterra@gmail.com> Signed-off-by: Tony Ambardar <Tony.Ambardar@gmail.com> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Sabrina Dubroca [Tue, 9 Mar 2021 15:44:33 +0000 (16:44 +0100)]
ip: xfrm: limit the length of the security context name when printing
Security context names are not guaranteed to be NUL-terminated by the
kernel, so we can't just print them using %s directly. The length of
the string is determined by sctx->ctx_len, so we can use that to limit
what fprintf outputs.
While at it, factor that out to a separate function, since the exact
same code is used to print the security context for both policies and
states.
Fixes: b2bb289a57fe ("xfrm security context support") Reported-by: Paul Wouters <pwouters@redhat.com> Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
q_cake: Fix incorrect printing of signed values in class statistics
The deficit returned from the kernel is signed, but was printed with a %u
specifier in the format string, leading to negative values to be printed as
high unsigned values instead. In addition, we passed a negative value to
sprint_time() even though that expects an unsigned value. Fix this by
changing the format specifier and reversing the sign of negative time
values.
Fixes: 714444c0cb26 ("Add support for CAKE qdisc") Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Roi Dayan [Mon, 22 Feb 2021 12:10:30 +0000 (14:10 +0200)]
dcb: Fix compilation warning about reallocarray
In older distros we need bsd/stdlib.h but newer distro doesn't
need it. Also old distro will need libbsd-devel installed and newer
doesn't. To remove a possible dependency on libbsd-devel replace usage
of reallocarray to realloc.
dcb_app.c: In function ‘dcb_app_table_push’:
dcb_app.c:68:25: warning: implicit declaration of function ‘reallocarray’; did you mean ‘realloc’?
Fixes: 8e9bed1493f5 ("dcb: Add a subtool for the DCB APP object") Signed-off-by: Roi Dayan <roid@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Andrea Claudi [Mon, 22 Feb 2021 18:14:32 +0000 (19:14 +0100)]
lib/fs: Fix single return points for get_cgroup2_*
Functions get_cgroup2_id() and get_cgroup2_path() may call close() with
a negative argument.
Avoid that making the calls conditional on the file descriptors.
get_cgroup2_path() may also return NULL leaking a file descriptor.
Ensure this does not happen using a single return point.
Fixes: d5e6ee0dac64 ("ss: introduce cgroup2 cache and helper functions") Fixes: 8f1cd119b377 ("lib: fix checking of returned file handle size for cgroup") Signed-off-by: Andrea Claudi <aclaudi@redhat.com> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Andrea Claudi [Mon, 22 Feb 2021 18:14:31 +0000 (19:14 +0100)]
lib/fs: avoid double call to mkdir on make_path()
make_path() function calls mkdir two times in a row. The first one it
stores mkdir return code, and then it calls it again to check for errno.
This seems unnecessary, as we can use the return code from the first
call and check for errno if not 0.
Fixes: ac3415f5c1b1d ("lib/fs: Fix and simplify make_path()") Acked-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Andrea Claudi <aclaudi@redhat.com> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Andrea Claudi [Mon, 22 Feb 2021 17:43:10 +0000 (18:43 +0100)]
lib/bpf: Fix and simplify bpf_mnt_check_target()
As stated in commit ac3415f5c1b1 ("lib/fs: Fix and simplify make_path()"),
calling stat() before mkdir() is racey, because the entry might change in
between.
As the call to stat() seems to only check for target existence, we can
simply call mkdir() unconditionally and catch all errors but EEXIST.
Fixes: 95ae9a4870e7 ("bpf: fix mnt path when from env") Signed-off-by: Andrea Claudi <aclaudi@redhat.com>
Andrea Claudi [Mon, 22 Feb 2021 11:40:36 +0000 (12:40 +0100)]
lib/namespace: fix ip -all netns return code
When ip -all netns {del,exec} are called and no netns is present, ip
exit with status 0. However this does not happen if no netns has been
created since boot time: in that case, indeed, the NETNS_RUN_DIR is not
present and netns_foreach() exit with code 1.
$ ls /var/run/netns
ls: cannot access '/var/run/netns': No such file or directory
$ ip -all netns exec ip link show
$ echo $?
1
$ ip -all netns del
$ echo $?
1
$ ip netns add test
$ ip netns del test
$ ip -all netns del
$ echo $?
0
$ ls -a /var/run/netns
. ..
This leaves us in the unpleasant situation where the same command, when
no netns is present, does the same stuff (in this case, nothing), but
exit with two different statuses.
Fix this treating ENOENT in a different way from other errors, similarly
to what we already do in ipnetns.c netns_identify_pid()
Fixes: e998e118ddc3 ("lib: Exec func on each netns") Reported-by: Jianlin Shi <jishi@redhat.com> Signed-off-by: Andrea Claudi <aclaudi@redhat.com> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Andrea Claudi [Mon, 22 Feb 2021 20:23:01 +0000 (21:23 +0100)]
ip: lwtunnel: seg6: bail out if table ids are invalid
When table and vrftable are used in SRv6, ip should bail out if table
ids are not valid, and return a proper error message to the user.
Achieve this simply checking rtnl_rttable_a2n return value, as we
already do in the rest of iproute.
Fixes: 0486388a877a ("add support for table name in SRv6 End.DT* behaviors") Fixes: 69629b4e43c4 ("seg6: add support for vrftable attribute in SRv6 End.DT4/DT6 behaviors") Signed-off-by: Andrea Claudi <aclaudi@redhat.com> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Andrea Claudi [Mon, 22 Feb 2021 20:22:47 +0000 (21:22 +0100)]
tc: m_gate: use SPRINT_BUF when needed
sprint_time64() uses SPRINT_BSIZE-1 as a constant buffer lenght in its
implementation, however m_gate uses shorter buffers when calling it.
Fix this using SPRINT_BUF macro to get the buffer, thus getting a
SPRINT_BSIZE-long buffer.
Fixes: 07d5ee70b5b3 ("iproute2-next:tc:action: add a gate control action") Signed-off-by: Andrea Claudi <aclaudi@redhat.com> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Vladimir Oltean [Thu, 11 Feb 2021 10:45:02 +0000 (12:45 +0200)]
man8/bridge.8: be explicit that "flood" is an egress setting
Talking to varios people, it became apparent that there is a certain
ambiguity in the description of these flags. They refer to egress
flooding, which should perhaps be stated more clearly.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
The dump isn't supported for the statistics bind/unbind commands
because they operate on specific QP counters. This is different
from query commands that can operate on many objects at the same
time.
Let's check the user input and ensure that arguments are valid.
Fixes: a6d0773ebecc ("rdma: Add stat manual mode support") Signed-off-by: Ido Kalir <idok@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Thayne McCombs [Sun, 14 Feb 2021 08:09:13 +0000 (01:09 -0700)]
ss: Make leading ":" always optional for sport and dport
The sport and dport conditions in expressions were inconsistent on
whether there should be a ":" at the beginning of the port when only a
port was provided depending on the family. The link and netlink
families required a ":" to work. The vsock family required the ":"
to be absent. The inet and inet6 families work with or without a leading
":".
This makes the leading ":" optional in all cases, so if sport or dport
are used, then it works with a leading ":" or without one, as inet and
inet6 did.
Signed-off-by: Thayne McCombs <astrothayne@gmail.com> Signed-off-by: David Ahern <dsahern@kernel.org>
Amit Cohen [Tue, 9 Feb 2021 09:12:00 +0000 (11:12 +0200)]
ip route: Print "rt_offload_failed" indication
The kernel signals when offload fails using the 'RTM_F_OFFLOAD_FAILED'
flag. Print it to help users understand the offload state of the route.
The "rt_" prefix is used in order to distinguish it from the offload state
of nexthops, similar to "rt_offload" and "rt_trap".
Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David Ahern <dsahern@kernel.org>
Delete the vdpa device after its use:
$ vdpa dev del foo2
An example of PCI PF, VF and SF management device:
pci/0000:03.00:0
supported_classes
net
pci/0000:03.00:4
supported_classes
net
auxiliary/mlx5_core.sf.8
supported_classes
net
Parav Pandit [Wed, 10 Feb 2021 18:34:42 +0000 (20:34 +0200)]
utils: Add helper routines for indent handling
Subsequent patch needs to use 2 char indentation for nested objects.
Hence introduce a generic helpers to allocate, deallocate, increment,
decrement and to print indent block.
Signed-off-by: Parav Pandit <parav@nvidia.com> Signed-off-by: David Ahern <dsahern@kernel.org>
This commit adds support for configuring HTB in offload mode. HTB
offload eliminates the single qdisc lock in the datapath and offloads
the algorithm to the NIC. The new 'offload' parameter is added to
enable this mode:
Classes are created as usual, but filters should be moved to clsact for
lock-free classification (filters attached to HTB itself are not
supported in the offload mode):
# tc filter add dev eth0 egress protocol ip flower dst_port 80
action skbedit priority 1:10
tc qdisc show and tc class show will indicate whether the offload is
enabled. Example output:
Thayne McCombs [Tue, 2 Feb 2021 03:32:10 +0000 (20:32 -0700)]
ss: always prefer family as part of host condition to default family
ss accepts an address family both with the -f option and as part of a
host condition. However, if the family in the host condition is
different than the the last -f option, then which family is actually
used depends on the order that different families are checked.
This changes parse_hostcond to check all family prefixes before parsing
the rest of the address, so that the host condition's family always has
a higher priority than the "preferred" family.
Signed-off-by: Thayne McCombs <astrothayne@gmail.com> Signed-off-by: David Ahern <dsahern@kernel.org>
Luca Boccassi [Sun, 24 Jan 2021 17:36:58 +0000 (17:36 +0000)]
iproute: force rtm_dst_len to 32/128
Since NETLINK_GET_STRICT_CHK was enabled, the kernel rejects commands
that pass a prefix length, eg:
ip route get `1.0.0.0/1
Error: ipv4: Invalid values in header for route get request.
ip route get 0.0.0.0/0
Error: ipv4: rtm_src_len and rtm_dst_len must be 32 for IPv4
Since there's no point in setting a rtm_dst_len that we know is going
to be rejected, just force it to the right value if it's passed on
the command line. Print a warning to stderr to notify users.
Thayne McCombs [Tue, 2 Feb 2021 22:30:40 +0000 (14:30 -0800)]
ss: Add clarification about host conditions with multiple familes to man
In creating documentation for expressions I ran into an interesting case
where if you use two different familie types in the expression, such as
in `ss 'sport inet:ssh or src unix:/run/*'`, then you would only get the
results for one address family (in this case unix sockets).
The reason is that in parse_hostcond if the family is specified we
remove any previously added families from filter->families, and
preserve the "states" if any states are set. I tried changing this to
not reset the families, but ran into some issues with Invalid Argument
errors in inet_show_netlink, I think related to the state.
I can dig into that more if supporting this is useful, but I'm not sure
if these types of expressions would actually be useful in practice. Or
perhaps an error should be given if an expression contains conditions
with multiple families (besides inet and inet6)?
Anyway, for now, this patch just notes the limitation in the man page.
Signed-off-by: Thayne McCombs <astrothayne@gmail.com> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Edwin Peer [Tue, 26 Jan 2021 17:40:53 +0000 (09:40 -0800)]
iplink: print warning for missing VF data
The kernel might truncate VF info in IFLA_VFINFO_LIST. Compare the
expected number of VFs in IFLA_NUM_VF to how many were found in the
list and warn accordingly.
Signed-off-by: Edwin Peer <edwin.peer@broadcom.com> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Paolo Abeni [Mon, 25 Jan 2021 16:02:07 +0000 (17:02 +0100)]
ss: do not emit warn while dumping MPTCP on old kernels
Prior to this commit, running 'ss' on a kernel older than v5.9
bumps an error message:
RTNETLINK answers: Invalid argument
When asked to dump protocol number > 255 - that is: MPTCP - 'ss'
adds an INET_DIAG_REQ_PROTOCOL attribute, unsupported by the older
kernel.
Avoid the warning ignoring filter issues when INET_DIAG_REQ_PROTOCOL
is used.
Additionally older kernel end-up invoking tcpdiag_send(), which
in turn will try to dump DCCP socks. Bail early in such function,
as the kernel does not implement an MPTCPDIAG_GET request.
Reported-by: "Rantala, Tommi T. (Nokia - FI/Espoo)" <tommi.t.rantala@nokia.com> Fixes: 9c3be2c0eee0 ("ss: mptcp: add msk diag interface support") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Vladimir Oltean [Fri, 22 Jan 2021 01:22:11 +0000 (03:22 +0200)]
man: tc-taprio.8: document the full offload feature
Since this feature's introduction in commit 9c66d1564676 ("taprio: Add
support for hardware offloading") from kernel v5.4, it never got
documented in the man pages. Due to this reason, we see customer reports
of seemingly contradictory information: the community manpages claim
there is no support for full offload, nonetheless many silicon vendors
have already implemented it.
This patch documents the full offload feature (enabled by specifying
"flags 2" to the taprio qdisc) and gives one more example that tries to
illustrate some of the finer points related to the usage.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
David Ahern [Tue, 2 Feb 2021 02:45:49 +0000 (02:45 +0000)]
Merge branch 'devlink-port-mgmt' into next
Parav Pandit says:
====================
This patchset implements devlink port add, delete and function state
management commands.
An example sequence for a PCI SF:
Set the device in switchdev mode:
$ devlink dev eswitch set pci/0000:06:00.0 mode switchdev
View ports in switchdev mode:
$ devlink port show
pci/0000:06:00.0/65535: type eth netdev ens2f0np0 flavour physical port 0 s=
plittable false
Add a subfunction port for PCI PF 0 with sfnumber 88:
$ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88
pci/0000:08:00.0/32768: type eth netdev eth6 flavour pcisf controller 0 pfn=
um 0 sfnum 88 splittable false
function:
hw_addr 00:00:00:00:00:00 state inactive opstate detached
Show a newly added port:
$ devlink port show pci/0000:06:00.0/32768
pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf contro=
ller 0 pfnum 0 sfnum 88 splittable false
function:
hw_addr 00:00:00:00:00:00 state inactive opstate detached
Set the function state to active:
$ devlink port function set pci/0000:06:00.0/32768 hw_addr 00:00:00:00:88:8=
8 state active
Show the port in JSON format:
$ devlink port show pci/0000:06:00.0/32768 -jp
{
"port": {
"pci/0000:06:00.0/32768": {
"type": "eth",
"netdev": "ens2f0npf0sf88",
"flavour": "pcisf",
"controller": 0,
"pfnum": 0,
"sfnum": 88,
"splittable": false,
"function": {
"hw_addr": "00:00:00:00:88:88",
"state": "active",
"opstate": "attached"
}
}
}
}
Set the function state to active:
$ devlink port function set pci/0000:06:00.0/32768 state inactive
Delete the port after use:
$ devlink port del pci/0000:06:00.0/32768
Oliver Hartkopp [Mon, 25 Jan 2021 10:40:55 +0000 (11:40 +0100)]
iplink_can: add Classical CAN frame LEN8_DLC support
The len8_dlc element is filled by the CAN interface driver and used for CAN
frame creation by the CAN driver when the CAN_CTRLMODE_CC_LEN8_DLC flag is
supported by the driver and enabled via netlink configuration interface.
Add the command line support for cc-len8-dlc for Linux 5.11+
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: David Ahern <dsahern@kernel.org>
Jarod Wilson [Fri, 15 Jan 2021 19:21:37 +0000 (14:21 -0500)]
bond: support xmit_hash_policy=vlan+srcmac
There's a new transmit hash policy being added to the bonding driver that
is a simple XOR of vlan ID and source MAC, xmit_hash_policy vlan+srcmac.
This trivial patch makes it configurable and queryable via iproute2.
Cc: Stephen Hemminger <stephen@networkplumber.org> Cc: Jay Vosburgh <j.vosburgh@gmail.com> Signed-off-by: Jarod Wilson <jarod@redhat.com> Signed-off-by: David Ahern <dsahern@kernel.org>
Luca Boccassi [Sun, 17 Jan 2021 22:54:27 +0000 (22:54 +0000)]
vrf: fix ip vrf exec with libbpf
The size of bpf_insn is passed to bpf_load_program instead of the number
of elements as it expects, so ip vrf exec fails with:
$ sudo ip link add vrf-blue type vrf table 10
$ sudo ip link set dev vrf-blue up
$ sudo ip/ip vrf exec vrf-blue ls
Failed to load BPF prog: 'Invalid argument'
last insn is not an exit or jmp
processed 0 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0
Kernel compiled with CGROUP_BPF enabled?
https://bugs.debian.org/980046
Reported-by: Emmanuel DECAEN <Emmanuel.Decaen@xsalto.com> Signed-off-by: Luca Boccassi <bluca@debian.org> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Roi Dayan [Tue, 12 Jan 2021 10:33:17 +0000 (12:33 +0200)]
build: Fix link errors on some systems
Since moving get_rate() and get_size() from tc to lib, on some
systems we fail to link because of missing math lib.
Move the functions that require math lib to their own c file
and add -lm to dcb that now use those functions.
../lib/libutil.a(utils.o): In function `get_rate':
utils.c:(.text+0x10dc): undefined reference to `floor'
../lib/libutil.a(utils.o): In function `get_size':
utils.c:(.text+0x1394): undefined reference to `floor'
../lib/libutil.a(json_print.o): In function `sprint_size':
json_print.c:(.text+0x14c0): undefined reference to `rint'
json_print.c:(.text+0x14f4): undefined reference to `rint'
json_print.c:(.text+0x157c): undefined reference to `rint'
Fixes: f3be0e6366ac ("lib: Move get_rate(), get_rate64() from tc here") Fixes: 44396bdfcc0a ("lib: Move get_size() from tc here") Fixes: adbe5de96662 ("lib: Move sprint_size() from tc here, add print_size()") Signed-off-by: Roi Dayan <roid@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Tested-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
David Ahern [Mon, 18 Jan 2021 04:10:27 +0000 (04:10 +0000)]
Merge branch 'dcb-app-dcbx' into next
Petr Machata says:
====================
Add support to the dcb tool for the following two DCB objects:
- APP, which allows configuration of traffic prioritization rules based on
several possible packet headers.
- DCBX, which is a 1-byte bitfield of flags that configure whether the DCBX
protocol is implemented in the device or in the host, and which version
of the protocol should be used.
Patch #1 adds a new helper for finding a name of a given dsfield value.
This is useful for APP DSCP-to-priority rules, which can use human-readable
DSCP names.
Patches #2, #3 and #4 extend existing interfaces for, respectively, parsing
of the X:Y mappings, for setting a DCB object, and for getting a DCB
object.
In patch #5, support for the command line argument -N / --Numeric is
added. The APP tool later uses it to decide whether to format DSCP values
as human-readable strings or as plain numbers.
Patches #6 and #7 add the subtools themselves and their man pages.
v2:
- Two patches dropped and sent to iproute2 branch as "dcb: Fixes".
This patch set now depends on that one.
- Patch #5:
- Make it -N / --Numeric instead of -n / --no-nice-names
- Rename the flag from no_nice_names to numeric as well
- Patch #6:
- Adjust to s/no_nice_names/numeric/ from another patch.
Petr Machata [Sat, 2 Jan 2021 00:03:41 +0000 (01:03 +0100)]
dcb: Add a subtool for the DCBX object
The Linux DCBX object is a 1-byte bitfield of flags that configure whether
the DCBX protocol is implemented in the device or in the host, and which
version of the protocol should be used. Add a tool to access the per-port
Linux DCBX object.
For example:
# dcb dcbx set dev eni1np1 host ieee
# dcb dcbx show dev eni1np1
host ieee
Signed-off-by: Petr Machata <me@pmachata.org> Signed-off-by: David Ahern <dsahern@kernel.org>
Petr Machata [Sat, 2 Jan 2021 00:03:39 +0000 (01:03 +0100)]
dcb: Support -N to suppress translation to human-readable names
Some DSCP values can be translated to symbolic names. That may not be
always desirable. Introduce a command-line option similar to other tools,
-N or --Numeric, to suppress this translation.
Signed-off-by: Petr Machata <me@pmachata.org> Signed-off-by: David Ahern <dsahern@kernel.org>
Petr Machata [Sat, 2 Jan 2021 00:03:38 +0000 (01:03 +0100)]
dcb: Generalize dcb_get_attribute()
The function dcb_get_attribute() assumes that the caller knows the exact
size of the looked-for payload. It also assumes that the response comes
wrapped in an DCB_ATTR_IEEE nest. The former assumption does not hold for
the IEEE APP table, which has variable size. The latter one does not hold
for DCBX, which is not IEEE-nested, and also for any CEE attributes, which
would come CEE-nested.
Factor out the payload extractor from the current dcb_get_attribute() code,
and put into a helper. Then rewrite dcb_get_attribute() compatibly in terms
of the new function. Introduce dcb_get_attribute_va() as a thin wrapper for
IEEE-nested access, and dcb_get_attribute_bare() for access to attributes
that are not nested.
Signed-off-by: Petr Machata <me@pmachata.org> Signed-off-by: David Ahern <dsahern@kernel.org>
Petr Machata [Sat, 2 Jan 2021 00:03:37 +0000 (01:03 +0100)]
dcb: Generalize dcb_set_attribute()
The function dcb_set_attribute() takes a fully-formed payload as an
argument. For callers that need to build a nested attribute, such as is the
case for DCB APP table, this is not great, because with libmnl, they would
need to construct a separate netlink message just to pluck out the payload
and hand it over to this function.
Currently, dcb_set_attribute() also always wraps the payload in an
DCB_ATTR_IEEE container, because that is what all the dcb subtools so far
needed. But that is not appropriate for DCBX in particular, and in fact a
handful other attributes, as well as any CEE payloads.
Instead, generalize this code by adding parameters for constructing a
custom payload and for fetching the response from a custom response
attribute. Then add dcb_set_attribute_va(), which takes a callback to
invoke in the right place for the nest to be built, and
dcb_set_attribute_bare(), which is similar to dcb_set_attribute(), but does
not encapsulate the payload in an IEEE container. Rewrite
dcb_set_attribute() compatibly in terms of the new functions.
Signed-off-by: Petr Machata <me@pmachata.org> Signed-off-by: David Ahern <dsahern@kernel.org>
Petr Machata [Sat, 2 Jan 2021 00:03:36 +0000 (01:03 +0100)]
lib: Generalize parse_mapping()
The function parse_mapping() assumes the key is a number, with a single
configurable exception, which is using "all" to mean "all possible keys".
If a caller wishes to use symbolic names instead of numbers, they cannot
reuse this function.
To facilitate reuse in these situations, convert parse_mapping() into a
helper, parse_mapping_gen(), which instead of an allow-all boolean takes a
generic key-parsing callback. Rewrite parse_mapping() in terms of this
newly-added helper and add a pair of key parsers, one for just numbers,
another for numbers and the keyword "all". Publish the latter as well.
Signed-off-by: Petr Machata <me@pmachata.org> Signed-off-by: David Ahern <dsahern@kernel.org>
Petr Machata [Sat, 2 Jan 2021 00:03:35 +0000 (01:03 +0100)]
lib: rt_names: Add rtnl_dsfield_get_name()
For formatting DSCP (not full dsfield), it would be handy to be able to
just get the name from the name table, and not get any of the remaining
cruft related to formatting. Add a new entry point to just fetch the
name table string uninterpreted. Use it from rtnl_dsfield_n2a().
Signed-off-by: Petr Machata <me@pmachata.org> Signed-off-by: David Ahern <dsahern@kernel.org>