Commit a9c3d70d902a0473ee5c13336317006a52ce8242 broke backward compatibility
by making 'configure' error out if parameters are passed, instead of
ignoring them.
Sometimes packaging systems detect 'configure' and assume it's from
autotools, and pass a bunch of options. Eg:
Hangbin Liu [Mon, 9 Aug 2021 03:01:53 +0000 (11:01 +0800)]
ip/bond: add lacp active support
lacp_active specifies whether to send LACPDU frames periodically.
If set on, the LACPDU frames are sent along with the configured lacp_rate
setting. If set off, the LACPDU frames acts as "speak when spoken to".
Presently, if a Geneve or VXLAN interface was created with 'external',
it's not possible for a user to determine e.g. the value of 'dstport'
after creation. This change fixes that by avoiding early returns.
This change partly reverts commit 00ff4b8e31af ("ip/tunnel: Be consistent
when printing tunnel collect metadata").
Signed-off-by: Ilya Dmitrichenko <errordeveloper@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David Ahern <dsahern@kernel.org>
Only one of "set", "swap" or "ecn" shall be used in a single tc-skbmod
command. Trying to use more than one of them at a time is considered
undefined behavior; pipe multiple tc-skbmod commands together instead.
"set" and "swap" only affect Ethernet packets, while "ecn" only affects
IP packets.
Depends on kernel patch "net/sched: act_skbmod: Add SKBMOD_F_ECN option
support", as well as iproute2 patch "tc/skbmod: Remove misinformation
about the swap action".
Justin Iurman [Sun, 1 Aug 2021 12:45:51 +0000 (14:45 +0200)]
New IOAM6 encap type for routes
This patch provides a new encap type for routes to insert an IOAM pre-allocated
trace:
$ ip -6 ro ad fc00::1/128 encap ioam6 trace prealloc type 0x800000 ns 1 size 12 dev eth0
where:
- "trace" and "prealloc" may appear as useless but just anticipate for future
implementations of other ioam option types.
- "type" is a bitfield (=u32) defining the IOAM pre-allocated trace type (see
the corresponding uapi).
- "ns" is an IOAM namespace ID attached to the pre-allocated trace.
- "size" is the trace pre-allocated size in bytes; must be a 4-octet multiple;
limited size (see IOAM6_TRACE_DATA_SIZE_MAX).
Signed-off-by: Justin Iurman <justin.iurman@uliege.be> Signed-off-by: David Ahern <dsahern@kernel.org>
Justin Iurman [Sun, 1 Aug 2021 12:45:50 +0000 (14:45 +0200)]
Add, show, link, remove IOAM namespaces and schemas
This patch provides support for adding, listing and removing IOAM namespaces
and schemas with iproute2. When adding an IOAM namespace, both "data" (=u32)
and "wide" (=u64) are optional. Therefore, you can either have none, one of
them, or both at the same time. When adding an IOAM schema, there is no
restriction on "DATA" except its size (see IOAM6_MAX_SCHEMA_DATA_LEN). By
default, an IOAM namespace has no active IOAM schema (meaning an IOAM namespace
is not linked to an IOAM schema), and an IOAM schema is not considered
as "active" (meaning an IOAM schema is not linked to an IOAM namespace). It is
possible to link an IOAM namespace with an IOAM schema, thanks to the last
command below (meaning the IOAM schema will be considered as "active" for the
specific IOAM namespace).
$ ip ioam
Usage: ip ioam { COMMAND | help }
ip ioam namespace show
ip ioam namespace add ID [ data DATA32 ] [ wide DATA64 ]
ip ioam namespace del ID
ip ioam schema show
ip ioam schema add ID DATA
ip ioam schema del ID
ip ioam namespace set ID schema { ID | none }
Signed-off-by: Justin Iurman <justin.iurman@uliege.be> Signed-off-by: David Ahern <dsahern@kernel.org>
ipneigh: add support to print brief output of neigh cache in tabular format
Make use of the already available brief flag and print the basic details of
the IPv4 or IPv6 neighbour cache in a tabular format for better readability
when the brief output is expected.
Jakub Kicinski [Wed, 18 Aug 2021 21:29:46 +0000 (14:29 -0700)]
ss: fix fallback to procfs for raw sockets
Jonas reports that ss -awp does not display any RAW sockets
on a Knoppix 4.4 kernel.
sockdiag_send() diverts to tcpdiag_send() to try the older
netlink interface. tcpdiag_send() works for TCP and DCCP
but not other protocols. Instead of rejecting unsupported
protocols (and missing RAW and SCTP) match on supported ones.
Link: https://lore.kernel.org/netdev/20210815231738.7b42bad4@mmluhan/ Reported-and-tested-by: Jonas Bechtel <post@jbechtel.de> Fixes: 41fe6c34de50 ("ss: Add inet raw sockets information gathering via netlink diag interface") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Gokul Sivakumar [Tue, 17 Aug 2021 17:28:07 +0000 (22:58 +0530)]
man: bridge: fix the typo to change "-c[lor]" into "-c[olor]" in man page
Fixes: 3a1ca9a5b ("bridge: update man page for new color and json changes") Signed-off-by: Gokul Sivakumar <gokulkumar792@gmail.com> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Gokul Sivakumar [Tue, 17 Aug 2021 17:28:06 +0000 (22:58 +0530)]
bridge: fdb: don't colorize the "dev" & "dst" keywords in "bridge -c fdb"
To be consistent with the colorized output of "ip" command and to increase
readability, stop highlighting the "dev" & "dst" keywords in the colorized
output of "bridge -c fdb" cmd.
Example: in the following "bridge -c fdb" entry, only "00:00:00:00:00:00",
"vxlan100" and "2001:db8:2::1" fields should be highlighted in color.
00:00:00:00:00:00 dev vxlan100 dst 2001:db8:2::1 self permanent
Signed-off-by: Gokul Sivakumar <gokulkumar792@gmail.com> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Gokul Sivakumar [Tue, 17 Aug 2021 17:28:05 +0000 (22:58 +0530)]
bridge: reorder cmd line arg parsing to let "-c" detected as "color" option
As per the man/man8/bridge.8 page, the shorthand cmd line arg "-c" can be
used to colorize the bridge cmd output. But while parsing the args in while
loop, matches() detects "-c" as "-compressedvlans" instead of "-color", so
fix this by doing the check for "-color" option first before checking for
"-compressedvlans".
Signed-off-by: Gokul Sivakumar <gokulkumar792@gmail.com> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Hangbin Liu [Mon, 16 Aug 2021 07:49:05 +0000 (15:49 +0800)]
ip/bond: add arp_validate filter support
Add arp_validate filter support based on kernel commit 896149ff1b2c
("bonding: extend arp_validate to be able to receive unvalidated arp-only traffic")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
devlink: Show port state values in man page and in the help command
Port function state can have either of the two values - active or
inactive. Update the documentation and help command for these two
values to tell user about it.
With the introduction of state, hw_addr and state are optional.
Hence mark them as optional in man page that also aligns with the help
command output.
Fixes: bdfb9f1bd61a ("devlink: Support set of port function state") Signed-off-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Phil Sutter [Wed, 4 Aug 2021 09:18:28 +0000 (11:18 +0200)]
tc: u32: Fix key folding in sample option
In between Linux kernel 2.4 and 2.6, key folding for hash tables changed
in kernel space. When iproute2 dropped support for the older algorithm,
the wrong code was removed and kernel 2.4 folding method remained in
place. To get things functional for recent kernels again, restoring the
old code alone was not sufficient - additional byteorder fixes were
needed.
While being at it, make use of ffs() and thereby align the code with how
kernel determines the shift width.
Fixes: 267480f55383c ("Backout the 2.4 utsname hash patch.") Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Jacob Keller [Thu, 5 Aug 2021 23:44:59 +0000 (16:44 -0700)]
devlink: fix infinite loop on flash update for drivers without status
When processing device flash update, cmd_dev_flash function waits until
the flash process has completed. This requires the following two
conditions to both be true:
a) we've received an exit status from the child process
b) we've received the DEVLINK_CMD_FLASH_UPDATE_END *or*
we haven't received any status notifications from the driver.
The original devlink flash status monitoring code in 9b13cddfe268
("devlink: implement flash status monitoring") was written assuming that
a driver will either send no status updates, or it will send at least
one DEVLINK_CMD_FLASH_UPDATE_STATUS before DEVLINK_CMD_FLASH_UPDATE_END.
Newer versions of the kernel since commit 52cc5f3a166a ("devlink: move flash
end and begin to core devlink") in v5.10 moved handling of the
DEVLINK_CMD_FLASH_UPDATE_END into the core stack, and will send this
regardless of whether or not the driver sends any of its own status
notifications.
The handling of DEVLINK_CMD_FLASH_UPDATE_END in cmd_dev_flash_status_cb
has an additional condition that it must not be the first message.
Otherwise, it falls back to treating it like
a DEVLINK_CMD_FLASH_UPDATE_STATUS.
This is wrong because it can lead to an infinite loop if a driver does
not send any status updates.
In this case, the kernel will send DEVLINK_CMD_FLASH_UPDATE_END without
any DEVLINK_CMD_FLASH_UPDATE_STATUS. The devlink application will see
that ctx->not_first is false, and will treat this like any other status
message. Thus, ctx->not_first will be set to 1.
The loop condition to exit flash update will thus never be true, since
we will wait forever, because ctx->not_first is true, and
ctx->received_end is false.
This leads to the application appearing to process the flash update, but
it will never exit.
Fix this by simply always treating DEVLINK_CMD_FLASH_UPDATE_END the same
regardless of whether its the first message or not.
This is obviously the correct thing to do: once we've received the
DEVLINK_CMD_FLASH_UPDATE_END the flash update must be finished. For new
kernels this is always true, because we send this message in the core
stack after the driver flash update routine finishes.
For older kernels, some drivers may not have sent any
DEVLINK_CMD_FLASH_UPDATE_STATUS or DEVLINK_CMD_FLASH_UPDATE_END. This is
handled by the while loop conditional that exits if we get a return
value from the child process without having received any status
notifications.
An argument could be made that we should exit immediately when we get
either the DEVLINK_CMD_FLASH_UPDATE_END or an exit code from the child
process. However, at a minimum it makes no sense to ever process
DEVLINK_CMD_FLASH_UPDATE_END as if it were a DEVLINK_CMD_FLASH_UPDATE_STATUS.
This is easy to test as it is triggered by the selftests for the
netdevsim driver, which has a test case for both with and without status
notifications.
Fixes: 9b13cddfe268 ("devlink: implement flash status monitoring") Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Feng Zhou [Sun, 1 Aug 2021 06:07:09 +0000 (14:07 +0800)]
lib/bpf: Fix btf_load error lead to enable debug log
Use tc with no verbose, when bpf_btf_attach fail,
the conditions:
"if (fd < 0 && (errno == ENOSPC || !ctx->log_size))"
will make ctx->log_size != 0. And then, bpf_prog_attach,
ctx->log_size != 0. so enable debug log.
The verifier log sometimes is so chatty on larger programs.
bpf_prog_attach is failed.
"Log buffer too small to dump verifier log 16777215 bytes (9 tries)!"
BTF load failure does not affect prog load. prog still work.
So when BTF/PROG load fail, enlarge log_size and re-fail with
having verbose.
Signed-off-by: Feng Zhou <zhoufeng.zf@bytedance.com> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Peilin Ye [Tue, 20 Jul 2021 19:21:45 +0000 (12:21 -0700)]
tc/skbmod: Remove misinformation about the swap action
Currently man 8 tc-skbmod says that "...the swap action will occur after
any smac/dmac substitutions are executed, if they are present."
This is false. In fact, trying to "set" and "swap" in a single skbmod
command causes the "set" part to be completely ignored. As an example:
$ tc filter add dev eth0 parent 1: protocol ip prio 10 \
matchall action skbmod \
set dmac AA:AA:AA:AA:AA:AA smac BB:BB:BB:BB:BB:BB \
swap mac
The above command simply does a "swap", without setting DMAC or SMAC to
AA's or BB's. The root cause of this is in the kernel, see
net/sched/act_skbmod.c:tcf_skbmod_init():
parm = nla_data(tb[TCA_SKBMOD_PARMS]);
index = parm->index;
if (parm->flags & SKBMOD_F_SWAPMAC)
lflags = SKBMOD_F_SWAPMAC;
^^^^^^^^^^^^^^^^^^^^^^^^^^
Doing a "=" instead of "|=" clears all other "set" flags when doing a
"swap". Discourage using "set" and "swap" in the same command by
documenting it as undefined behavior, and update the "SYNOPSIS" section
as well as tc -help text accordingly.
If one really needs to e.g. "set" DMAC to all AA's then "swap" DMAC and
SMAC, one should do two separate commands and "pipe" them together.
Reviewed-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Peilin Ye <peilin.ye@bytedance.com> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Roi Dayan [Mon, 12 Jul 2021 12:26:53 +0000 (15:26 +0300)]
police: Fix normal output back to what it was
With the json support fix the normal output was
changed. set it back to what it was.
Print overhead with print_size().
Print newline before ref.
Fixes: 0d5cf51e0d6c ("police: Add support for json output") Signed-off-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
A successful call to recvmsg() causes msg.msg_controllen to contain the length
of the received ancillary data. However, the current code in the 'ip' utility
doesn't reset this value after each recvmsg().
This means that if a call to recvmsg() doesn't have ancillary data, then
'msg.msg_controllen' will be set to 0, causing future recvmsg() which do
contain ancillary data to get MSG_CTRUNC set in msg.msg_flags.
This fixes 'ip monitor' running with the all-nsid option - With this option the
kernel passes the nsid as ancillary data. If while 'ip monitor' is running an
even on the current netns is received, then no ancillary data will be sent,
causing 'msg.msg_controllen' to be set to 0, which causes 'ip monitor' to
indefinitely print "[nsid current]" instead of the real nsid.
Fixes: 449b824ad196 ("ipmonitor: allows to monitor in several netns") Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Lahav Schlesinger <lschlesinger@drivenets.com> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
libnetlink: check error handler is present before a call
Fix nullptr dereference of errhndlr from rtnl_dump_filter_arg
struct in rtnl_dump_done and rtnl_dump_error functions.
Fixes: 459ce6e3d792 ("ip route: ignore ENOENT during save if RT_TABLE_MAIN is being dumped") Cc: Stephen Hemminger <stephen@networkplumber.org> Cc: Roi Dayan <roid@nvidia.com> Cc: Alexander Mikhalitsyn <alexander@mihalicyn.com> Reported-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
ip route: ignore ENOENT during save if RT_TABLE_MAIN is being dumped
We started to use in-kernel filtering feature which allows to get only
needed tables (see iproute_dump_filter()). From the kernel side it's
implemented in net/ipv4/fib_frontend.c (inet_dump_fib), net/ipv6/ip6_fib.c
(inet6_dump_fib). The problem here is that behaviour of "ip route save"
was changed after c7e6371bc ("ip route: Add protocol, table id and device to dump request").
If filters are used, then kernel returns ENOENT error if requested table
is absent, but in newly created net namespace even RT_TABLE_MAIN table
doesn't exist. It is really allocated, for instance, after issuing
"ip l set lo up".
Reproducer is fairly simple:
$ unshare -n ip route save > dump
Error: ipv4: FIB table does not exist.
Dump terminated
Expected result here is to get empty dump file (as it was before this
change).
v2: reworked, so, now it takes into account NLMSGERR_ATTR_MSG
(see nl_dump_ext_ack_done() function). We want to suppress error messages
in stderr about absent FIB table from kernel too.
v3: reworked to make code clearer. Introduced rtnl_suppressed_errors(),
rtnl_suppress_error() helpers. User may suppress up to 3 errors (may be
easily extended by changing SUPPRESS_ERRORS_INIT macro).
v4: reworked, rtnl_dump_filter_errhndlr() was introduced. Thanks
to Stephen Hemminger for comments and suggestions
v5: space fixes, commit message reformat, empty initializers
Fixes: c7e6371bc ("ip route: Add protocol, table id and device to dump request") Cc: David Ahern <dsahern@gmail.com> Cc: Stephen Hemminger <stephen@networkplumber.org> Cc: Andrei Vagin <avagin@gmail.com> Cc: Alexander Mikhalitsyn <alexander@mihalicyn.com> Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
When BPF programs which consists of multiple executable sections via
iproute2+libbpf (configured with LIBBPF_FORCE=on), we noticed that a
wrong section can be attached to a device. E.g.:
# tc qdisc replace dev lxc_health clsact
# tc filter replace dev lxc_health ingress prio 1 \
handle 1 bpf da obj bpf_lxc.o sec from-container
# tc filter show dev lxc_health ingress filter protocol all
pref 1 bpf chain 0 filter protocol all pref 1 bpf chain 0
handle 0x1 bpf_lxc.o:[__send_drop_notify] <-- WRONG SECTION
direct-action not_in_hw id 38 tag 7d891814eda6809e jited
After taking a closer look into load_bpf_object() in lib/bpf_libbpf.c,
we noticed that the filter used in the program iterator does not check
whether a program section name matches a requested section name
(cfg->section). This can lead to a wrong prog FD being used to attach
the program.
Fixes: 6d61a2b55799 ("lib: add libbpf support") Signed-off-by: Martynas Pumputis <m@lambda.lt> Acked-by: Hangbin Liu <haliu@redhat.com> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Ben Hutchings [Mon, 28 Jun 2021 23:25:59 +0000 (01:25 +0200)]
devlink: Fix printf() type mismatches on 32-bit architectures
devlink currently uses "%lu" to format values of type uint64_t,
but on 32-bit architectures uint64_t is defined as unsigned
long long and this does not work correctly.
Fix this by using the standard macro PRIu64 instead.
Signed-off-by: Ben Hutchings <ben.hutchings@mind.be> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Ben Hutchings [Mon, 28 Jun 2021 23:24:46 +0000 (01:24 +0200)]
utils: Fix BIT() to support up to 64 bits on all architectures
devlink and vdpa use BIT() together with 64-bit flag fields. devlink
is already using bit numbers greater than 31 and so does not work
correctly on 32-bit architectures.
Fix this by making BIT() use uint64_t instead of unsigned long.
Signed-off-by: Ben Hutchings <ben.hutchings@mind.be> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Roi Dayan [Tue, 22 Jun 2021 05:42:50 +0000 (08:42 +0300)]
devlink: Fix link errors on some systems
On some systems we fail to link because of missing math lib.
add -lm to devlink.
LINK devlink
../lib/libutil.a(utils_math.o): In function `get_rate':
utils_math.c:(.text+0xcc): undefined reference to `floor'
../lib/libutil.a(utils_math.o): In function `get_size':
utils_math.c:(.text+0x384): undefined reference to `floor'
collect2: error: ld returned 1 exit status
make[1]: *** [Makefile:16: devlink] Error 1
make: *** [Makefile:64: all] Error 2
Fixes: 6c70aca76ef2 ("devlink: Add port func rate support") Signed-off-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Sergey Ryazanov [Tue, 22 Jun 2021 23:52:56 +0000 (02:52 +0300)]
iplink: support for WWAN devices
The WWAN subsystem has been extended to generalize the per data channel
network interfaces management. This change implements support for WWAN
links handling. And actively uses the earlier introduced ip-link
capability to specify the parent by its device name.
The WWAN interface for a new data channel should be created with a
command like this:
ip link add dev wwan0-2 parentdev wwan0 type wwan linkid 2
Where: wwan0 is the modem HW device name (should be taken from
/sys/class/wwan) and linkid is an identifier of the opened data
channel.
Signed-off-by: Sergey Ryazanov <ryazanov.s.a@gmail.com> Signed-off-by: David Ahern <dsahern@kernel.org>
Sergey Ryazanov [Tue, 22 Jun 2021 23:52:55 +0000 (02:52 +0300)]
iplink: add support for parent device
Add support for specifying a parent device (struct device) by its name
during the link creation and printing parent name in the links list.
This option will be used to create WWAN links and possibly by other
device classes that do not have a "natural parent netdev".
Add the parent device bus name printing for links list info
completeness. But do not add a corresponding command line argument, as
we do not have a use case for this attribute.
Signed-off-by: Sergey Ryazanov <ryazanov.s.a@gmail.com> Signed-off-by: David Ahern <dsahern@kernel.org>
Paolo Lungaroni [Thu, 17 Jun 2021 17:23:54 +0000 (19:23 +0200)]
seg6: add support for SRv6 End.DT46 Behavior
We introduce the new "End.DT46" action for supporting the SRv6 End.DT46
Behavior in iproute2.
The SRv6 End.DT46 Behavior, defined in RFC 8986 [1] section 4.8, can be
used to implement L3 VPNs based on Segment Routing over IPv6 networks in
multi-tenants environments and it is capable of handling both IPv4 and
IPv6 tenant traffic at the same time.
The SRv6 End.DT46 Behavior decapsulates the received packets and it
performs the IPv4 or IPv6 routing lookup in the routing table of the
tenant.
As for the End.DT4 and for the End.DT6 in VRF mode, the SRv6 End.DT46
Behavior leverages a VRF device in order to force the routing lookup into
the associated routing table using the "vrftable" attribute.
To make the End.DT46 work properly, it must be guaranteed that the
routing table used for routing lookup operations is bound to one and
only one VRF during the tunnel creation. Such constraint has to be
enforced by enabling the VRF strict_mode sysctl parameter, i.e.:
$ sysctl -wq net.vrf.strict_mode=1
Note that the same approach is used for the End.DT4 Behavior and for the
End.DT6 Behavior in VRF mode.
An SRv6 End.DT46 Behavior instance can be created as follows:
$ ip -6 route add 2001:db8::1 encap seg6local action End.DT46 vrftable 100 dev vrf100
Standard Output:
$ ip -6 route show 2001:db8::1
2001:db8::1 encap seg6local action End.DT46 vrftable 100 dev vrf100 metric 1024 pref medium
This patch updates the route.8 man page and the ip route help with the
information related to End.DT46.
Considering that the same information was missing for the SRv6 End.DT4 and
the End.DT6 Behaviors, we have also added it.
Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it> Signed-off-by: Paolo Lungaroni <paolo.lungaroni@uniroma2.it> Signed-off-by: David Ahern <dsahern@kernel.org>
Guillaume Nault [Fri, 11 Jun 2021 09:46:16 +0000 (11:46 +0200)]
utils: bump max args number to 512 for batch files
Large tc filters can have many arguments. For example the following
filter matches the first 7 MPLS LSEs, pops all of them, then updates
the Ethernet header and redirects the resulting packet to eth1.
David Ahern [Sat, 12 Jun 2021 04:38:34 +0000 (04:38 +0000)]
Merge branch 'devlink-rate-support' into next
Dmytro Linkin says:
====================
Series implements devlink rate commands, which are:
- Dump particular or all rate objects (JSON or non-JSON)
- Add/Delete node rate object
- Set tx rate share/max values for rate object
- Set/Unset parent rate object for other rate object
Examples:
Display all rate objects:
# devlink port function rate show
pci/0000:03:00.0/1 type leaf parent some_group
pci/0000:03:00.0/2 type leaf tx_share 12Mbit
pci/0000:03:00.0/some_group type node tx_share 1Gbps tx_max 5Gbps
Display leaf rate object bound to the 1st devlink port of the
pci/0000:03:00.0 device:
# devlink port function rate show pci/0000:03:00.0/1
pci/0000:03:00.0/1 type leaf
Display node rate object with name some_group of the pci/0000:03:00.0
device:
# devlink port function rate show pci/0000:03:00.0/some_group
pci/0000:03:00.0/some_group type node
Display leaf rate object rate values using IEC units:
# devlink -i port function rate show pci/0000:03:00.0/2
pci/0000:03:00.0/2 type leaf 11718Kibit
Display pci/0000:03:00.0/2 leaf rate object as pretty JSON output:
# devlink -jp port function rate show pci/0000:03:00.0/2
{
"rate": {
"pci/0000:03:00.0/2": {
"type": "leaf",
"tx_share": 1500000
}
}
}
Create node rate object with name "1st_group" on pci/0000:03:00.0 device:
# devlink port function rate add pci/0000:03:00.0/1st_group
Create node rate object with specified parameters:
# devlink port function rate add pci/0000:03:00.0/2nd_group \
tx_share 10Mbit tx_max 30Mbit parent 1st_group
Set parameters to the specified leaf rate object:
# devlink port function rate set pci/0000:03:00.0/1 \
tx_share 2Mbit tx_max 10Mbit
Set leaf's parent to "1st_group":
# devlink port function rate set pci/0000:03:00.0/1 parent 1st_group
Unset leaf's parent:
# devlink port function rate set pci/0000:03:00.0/1 noparent
Delete node rate object:
# devlink port function rate del pci/0000:03:00.0/2nd_group
Rate values can be specified in bits or bytes per second (bit|bps), with
any SI (k, m, g, t) or IEC (ki, mi, gi, ti) prefix. Bare number means
bits per second. Units also printed in "show" command output, but not
necessarily the same which were specified with "set" or "add" command.
-i/--iec switch force output in IEC units. JSON output always print
values as bytes per sec.
Dmytro Linkin [Fri, 11 Jun 2021 07:25:36 +0000 (10:25 +0300)]
devlink: Add port func rate support
Implement user commands to manage devlink port func rate objects.
List all rate commands:
$ devlink port func rate help
or just
$ devlink port func rate
To list all OR particular rate object:
$ devlink port func rate show
pci/0000:03:00.0/some_group: type node
pci/0000:03:00.0/0: type leaf
pci/0000:03:00.0/1: type leaf
$ devlink prot func rate show pci/0000:03:00.0/1
pci/0000:03:00.0/0: type leaf
$ devlink prot func rate show pci/0000:03:00.0/some_group
pci/0000:03:00.0/some_group: type node
Rate object of type "leaf" created by it's driver where name is the name
of corresponding devlink port. Rate object of type "node" represents
rate group created by the user using commands:
$ devlink port func rate add pci/0000:03:00.0/some_group
Dmytro Linkin [Fri, 11 Jun 2021 07:25:35 +0000 (10:25 +0300)]
devlink: Add helper function to validate object handler
Every handler argument validated in two steps, first of which, form
checking, expects identifier is few words separated by slashes.
For device and region handlers just checked if identifier have expected
number of slashes.
Add generic function to do that and make code cleaner & consistent.
Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David Ahern <dsahern@kernel.org>
Parav Pandit [Mon, 7 Jun 2021 19:24:06 +0000 (22:24 +0300)]
devlink: Add optional controller user input
A user optionally provides the external controller number when user
wants to create devlink port for the external controller.
An example on eswitch system:
$ devlink dev eswitch set pci/0033:01:00.0 mode switchdev
$ devlink port show
pci/0033:01:00.0/196607: type eth netdev enP51p1s0f0np0 flavour physical port 0 splittable false
pci/0033:01:00.0/131072: type eth netdev eth0 flavour pcipf controller 1 pfnum 0 external true splittable false
function:
hw_addr 00:00:00:00:00:00
$ devlink port add pci/0033:01:00.0 flavour pcisf pfnum 0 sfnum 77 controller 1
pci/0033:01:00.0/163840: type eth netdev eth1 flavour pcisf controller 1 pfnum 0 sfnum 77 external true splittable false
function:
hw_addr 00:00:00:00:00:00 state inactive opstate detached
Hangbin Liu [Mon, 31 May 2021 09:47:39 +0000 (17:47 +0800)]
configure: add options ability
There are more and more global environment variables that land everywhere
in configure, which is making user hard to know which one does what.
Using command-line options would make it easier for users to learn or
remember the config options.
This patch converts the INCLUDE variable to command option first. Check
if the first variable has '-' to compile with the old INCLUDE path
setting method.
Signed-off-by: Hangbin Liu <haliu@redhat.com> Signed-off-by: David Ahern <dsahern@kernel.org>
Florian Westphal [Mon, 17 May 2021 05:10:10 +0000 (07:10 +0200)]
libgenl: make genl_add_mcast_grp set errno on error
genl_add_mcast_grp doesn't set errno in all cases.
On kernels that support mptcp but lack event support (all kernels <= 5.11)
MPTCP_PM_EV_GRP_NAME won't be found and ip will exit with
"can't subscribe to mptcp events: Success"
Set errno to a meaningful value (ENOENT) when the group name isn't found
and also cover other spots where it returns nonzero with errno unset.
Fixes: ff619e4fd370 ("mptcp: add support for event monitoring") Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Heiko Thiery [Sat, 8 May 2021 06:49:26 +0000 (08:49 +0200)]
lib/fs: fix issue when {name,open}_to_handle_at() is not implemented
With commit d5e6ee0dac64 the usage of functions name_to_handle_at() and
open_by_handle_at() are introduced. But these function are not available
e.g. in uclibc-ng < 1.0.35. To have a backward compatibility check for the
availability in the configure script and in case of absence do a direct
syscall.
Fixes: d5e6ee0dac64 ("ss: introduce cgroup2 cache and helper functions") Cc: Dmitry Yakunin <zeil@yandex-team.ru> Cc: Petr Vorel <petr.vorel@gmail.com> Signed-off-by: Heiko Thiery <heiko.thiery@gmail.com> Reviewed-by: Petr Vorel <petr.vorel@gmail.com> Signed-off-by: David Ahern <dsahern@kernel.org>
David Ahern [Sun, 9 May 2021 22:50:18 +0000 (22:50 +0000)]
config.mk: Rerun configure when it is newer than config.mk
config.mk needs to be re-generated any time configure is changed.
Rename the existing make target and add a check that the config.mk
file needs to exist and must be newer than configure script.
Signed-off-by: David Ahern <dsahern@kernel.org> Reviewed-by: Petr Vorel <petr.vorel@gmail.com> Tested-by: Petr Vorel <petr.vorel@gmail.com>
Jakub Kicinski [Sat, 1 May 2021 03:10:59 +0000 (20:10 -0700)]
ip: dynamically size columns when printing stats
This change makes ip -s -s output size the columns
automatically. I often find myself using json
output because the normal output is unreadable.
Even on a laptop after 2 days of uptime byte
and packet counters almost overflow their columns,
let alone a busy server.
Paolo Lungaroni [Sat, 8 May 2021 15:44:58 +0000 (17:44 +0200)]
seg6: add counters support for SRv6 Behaviors
We introduce the "count" optional attribute for supporting counters in SRv6
Behaviors as defined in [1], section 6. For each SRv6 Behavior instance,
counters defined in [1] are:
- the total number of packets that have been correctly processed;
- the total amount of traffic in bytes of all packets that have been
correctly processed;
In addition, we introduce a new counter that counts the number of packets
that have NOT been properly processed (i.e. errors) by an SRv6 Behavior
instance.
Each SRv6 Behavior instance can be configured, at the time of its creation,
to make use of counters specifing the "count" attribute as follows:
$ ip -6 route add 2001:db8::1 encap seg6local action End count dev eth0
per-behavior counters can be shown by adding "-s" to the iproute2 command
line, i.e.:
$ ip -s -6 route show 2001:db8::1
2001:db8::1 encap seg6local action End packets 0 bytes 0 errors 0 dev eth0
Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it> Signed-off-by: Paolo Lungaroni <paolo.lungaroni@uniroma2.it> Signed-off-by: David Ahern <dsahern@kernel.org>
Andrea Claudi [Thu, 6 May 2021 10:42:06 +0000 (12:42 +0200)]
tc: htb: improve burst error messages
When a wrong value is provided for "burst" or "cburst" parameters, the
resulting error message is unclear and can be misleading:
$ tc class add dev dummy0 parent 1: classid 1:1 htb rate 100KBps burst errtrigger
Illegal "buffer"
The message claims an illegal "buffer" is provided, but neither the
inline help nor the man page list "buffer" among the htb parameters, and
the only way to know that "burst", "maxburst" and "buffer" are synonyms
is to look into tc/q_htb.c.
This commit tries to improve this simply changing the error string to
the parameter name provided in the user-given command, clearly pointing
out where the wrong value is.
$ tc class add dev dummy0 parent 1: classid 1:1 htb rate 100KBps burst errtrigger
Illegal "burst"
$ tc class add dev dummy0 parent 1: classid 1:1 htb rate 100Kbps maxburst errtrigger
Illegal "maxburst"
Reported-by: Sebastian Mitterle <smitterl@redhat.com> Signed-off-by: Andrea Claudi <aclaudi@redhat.com> Signed-off-by: David Ahern <dsahern@kernel.org>
Fix this returning an error if key length is longer than
TIPC_AEAD_KEYLEN_MAX.
Fixes: 24bee3bf9752 ("tipc: add new commands to set TIPC AEAD key") Signed-off-by: Andrea Claudi <aclaudi@redhat.com> Signed-off-by: David Ahern <dsahern@kernel.org>
Fix this returning an error if provided algname is longer than
TIPC_AEAD_ALG_NAME.
Fixes: 24bee3bf9752 ("tipc: add new commands to set TIPC AEAD key") Signed-off-by: Andrea Claudi <aclaudi@redhat.com> Signed-off-by: David Ahern <dsahern@kernel.org>
Hoang Le [Thu, 6 May 2021 03:27:24 +0000 (10:27 +0700)]
tipc: call a sub-routine in separate socket
When receiving a result from first query to netlink, we may exec
a another query inside the callback. If calling this sub-routine
in the same socket, it will be discarded the result from previous
exection.
To avoid this we perform a nested query in separate socket.
Fixes: 202102830663 ("tipc: use the libmnl functions in lib/mnl_utils.c") Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au> Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: David Ahern <dsahern@kernel.org>
Tyson Moore [Thu, 29 Apr 2021 18:28:47 +0000 (14:28 -0400)]
tc-cake: update docs to include LE diffserv
Linux kernel commit b8392808eb3fc28e ("sch_cake: add RFC 8622 LE PHB
support to CAKE diffserv handling") added packets with LE diffserv to
the Bulk priority tin. Update the documentation to reflect this change.
Signed-off-by: Tyson Moore <tyson@tyson.me> Signed-off-by: David Ahern <dsahern@kernel.org>
Andrea Claudi [Sat, 1 May 2021 16:39:23 +0000 (18:39 +0200)]
dcb: fix memory leak
main() dinamically allocates dcb, but when dcb_help() is called it
returns without freeing it.
Fix this using a goto, as it is already done in the same function.
Fixes: 67033d1c1c8a ("Add skeleton of a new tool, dcb") Signed-off-by: Andrea Claudi <aclaudi@redhat.com> Reviewed-by: Petr Machata <me@pmachata.org> Signed-off-by: David Ahern <dsahern@kernel.org>
Andrea Claudi [Sat, 1 May 2021 16:39:22 +0000 (18:39 +0200)]
dcb: fix return value on dcb_cmd_app_show
dcb_cmd_app_show() is supposed to return EINVAL if an incorrect argument
is provided.
Fixes: 8e9bed1493f5 ("dcb: Add a subtool for the DCB APP object") Signed-off-by: Andrea Claudi <aclaudi@redhat.com> Reviewed-by: Petr Machata <me@pmachata.org> Signed-off-by: David Ahern <dsahern@kernel.org>
Andrea Claudi [Sat, 1 May 2021 17:05:45 +0000 (19:05 +0200)]
lib: bpf_legacy: avoid to pass invalid argument to close()
In function bpf_obj_open, if bpf_fetch_prog_arg() return an error, we
end up in the out: path with a negative value for fd, and pass it to
close.
Avoid this checking for fd to be positive.
Fixes: 32e93fb7f66d ("{f,m}_bpf: allow for sharing maps") Signed-off-by: Andrea Claudi <aclaudi@redhat.com> Signed-off-by: David Ahern <dsahern@kernel.org>
Andrea Claudi [Sat, 1 May 2021 16:44:35 +0000 (18:44 +0200)]
tc: q_ets: drop dead code from argument parsing
Checking for nbands to be at least 1 at this point is useless. Indeed:
- ets requires "bands", "quanta" or "strict" to be specified
- if "bands" is specified, nbands cannot be negative, see parse_nbands()
- if "strict" is specified, nstrict cannot be negative, see
parse_nbands()
- if "quantum" is specified, nquanta cannot be negative, see
parse_quantum()
- if "bands" is not specified, nbands is set to nstrict+nquanta
- the previous if statement takes care of the case when none of them are
specified and nbands is 0, terminating execution.
Thus nbands cannot be < 1 at this point and this code cannot be executed.
Signed-off-by: Andrea Claudi <aclaudi@redhat.com> Signed-off-by: David Ahern <dsahern@kernel.org>
mptcp: make sure flag signal is set when add addr with port
When add address with port, it is mean to send an ADD_ADDR to remote,
so it must have flag signal set.
Fixes: 42fbca91cd61 ("mptcp: add support for port based endpoint") Signed-off-by: Jianguo Wu <wujianguo@chinatelecom.cn> Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David Ahern <dsahern@kernel.org>
The default behavior for source MACVLAN is to duplicate packets to
appropriate type source devices, and then do the normal destination MACVLAN
flow. This patch adds an option to skip destination MACVLAN processing if
any matching source MACVLAN device has the option set.
This allows setting up a "catch all" device for source MACVLAN: create one
or more devices with type source nodst, and one device with e.g. type vepa,
and incoming traffic will be received on exactly one device.
Signed-off-by: Jethro Beekman <kernel@jbeekman.nl> Signed-off-by: David Ahern <dsahern@kernel.org>
$ rdma res show srq
dev ibp8s0f0 srqn 0 type BASIC pdn 3 comm [ib_ipoib]
dev ibp8s0f0 srqn 4 type BASIC lqpn 125-128,130-140 pdn 9 pid 3581 comm ibv_srq_pingpon
dev ibp8s0f0 srqn 5 type BASIC lqpn 141-156 pdn 10 pid 3584 comm ibv_srq_pingpon
dev ibp8s0f0 srqn 6 type BASIC lqpn 157-172 pdn 11 pid 3590 comm ibv_srq_pingpon
dev ibp8s0f1 srqn 0 type BASIC pdn 3 comm [ib_ipoib]
dev ibp8s0f1 srqn 1 type BASIC lqpn 329-344 pdn 4 pid 3586 comm ibv_srq_pingpon
$ rdma res show srq lqpn 126-141
dev ibp8s0f0 srqn 4 type BASIC lqpn 126-128,130-140 pdn 9 pid 3581 comm ibv_srq_pingpon
dev ibp8s0f0 srqn 5 type BASIC lqpn 141 pdn 10 pid 3584 comm ibv_srq_pingpon
$ rdma res show srq lqpn 127
dev ibp8s0f0 srqn 4 type BASIC lqpn 127 pdn 9 pid 3581 comm ibv_srq_pingpon
Reviewed-by: Ido Kalir <idok@nvidia.com> Reviewed-by: Mark Zhang <markz@mellanox.com> Signed-off-by: Neta Ostrovsky <netao@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: David Ahern <dsahern@kernel.org>
$ rdma res show ctx
dev ibp8s0f0 ctxn 0 pid 980 comm ibv_rc_pingpong
dev ibp8s0f0 ctxn 1 pid 981 comm ibv_rc_pingpong
dev ibp8s0f0 ctxn 2 pid 992 comm ibv_rc_pingpong
dev ibp8s0f1 ctxn 0 pid 984 comm ibv_rc_pingpong
dev ibp8s0f1 ctxn 1 pid 987 comm ibv_rc_pingpong
$ rdma res show ctx dev ibp8s0f1
dev ibp8s0f1 ctxn 0 pid 984 comm ibv_rc_pingpong
dev ibp8s0f1 ctxn 1 pid 987 comm ibv_rc_pingpong
Reviewed-by: Mark Zhang <markz@mellanox.com> Reviewed-by: Ido Kalir <idok@nvidia.com> Signed-off-by: Neta Ostrovsky <netao@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: David Ahern <dsahern@kernel.org>
Andrea Claudi [Mon, 19 Apr 2021 13:49:57 +0000 (15:49 +0200)]
lib: bpf_legacy: fix missing socket close when connect() fails
In functions bpf_{send,recv}_map_fds(), when connect fails after a
socket is successfully opened, we return with error missing a close on
the socket.
Fix this closing the socket if opened and using a single return point
for both the functions.
Fixes: 6256f8c9e45f ("tc, bpf: finalize eBPF support for cls and act front-end") Signed-off-by: Andrea Claudi <aclaudi@redhat.com> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Andrea Claudi [Mon, 19 Apr 2021 13:49:56 +0000 (15:49 +0200)]
lib: bpf_legacy: treat 0 as a valid file descriptor
As stated in the man page(), open returns a non-negative integer as a
file descriptor. Hence, when checking for its return value to be ok, we
should include 0 as a valid value.
This fixes a covscan warning about a missing close() in this function.
Fixes: ecb05c0f997d ("bpf: improve error reporting around tail calls") Signed-off-by: Andrea Claudi <aclaudi@redhat.com> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>