git.ipfire.org Git - thirdparty/iproute2.git/log

Merge branch 'bridge-mdb-vxlan-attr' into next

Ido Schimmel says:

====================

Add support for new VXLAN MDB attributes.

See kernel merge commit abf36703d704 ("Merge branch
'vxlan-MDB-support'") for background and motivation.

====================

Signed-off-by: David Ahern <dsahern@kernel.org>

bridge: mdb: Document the catchall MDB entries

Document the catchall MDB entries used to transmit IPv4 and IPv6
unregistered multicast packets.

In deployments where inter-subnet multicast forwarding is used, not all
the VTEPs in a tenant domain are members in all the broadcast domains.
It is therefore advantageous to transmit BULL (broadcast, unknown
unicast and link-local multicast) and unregistered IP multicast traffic
on different tunnels. If the same tunnel was used, a VTEP only
interested in IP multicast traffic would also pull all the BULL traffic
and drop it as it is not a member in the originating broadcast domain
[1].

[1] https://datatracker.ietf.org/doc/html/draft-ietf-bess-evpn-irb-mcast#section-2.6

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>

bridge: mdb: Add outgoing interface support

In a similar fashion to VXLAN FDB entries, allow user space to program
and view the outgoing interface of VXLAN MDB entries. Specifically, add
support for the 'MDBE_ATTR_IFINDEX' and 'MDBA_MDB_EATTR_IFINDEX'
attributes in request and response messages, respectively.

The outgoing interface will be forced during the underlay route lookup
and is required when the underlay destination IP is multicast, as the
multicast routing tables are not consulted.

Example:

# bridge mdb add dev vxlan0 port vxlan0 grp 239.1.1.1 permanent dst 198.51.100.1 via dummy10

$ bridge -d -s mdb show
dev vxlan0 port vxlan0 grp 239.1.1.1 permanent filter_mode exclude proto static dst 198.51.100.1 via dummy10    0.00

$ bridge -d -s -j -p mdb show
[ {
         "mdb": [ {
                 "index": 10,
                 "dev": "vxlan0",
                 "port": "vxlan0",
                 "grp": "239.1.1.1",
                 "state": "permanent",
                 "filter_mode": "exclude",
                 "protocol": "static",
                 "flags": [ ],
                 "dst": "198.51.100.1",
                 "via": "dummy10",
                 "timer": "   0.00"
             } ],
         "router": {}
     } ]

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>

bridge: mdb: Add source VNI support

In a similar fashion to VXLAN FDB entries, allow user space to program
and view the source VNI of VXLAN MDB entries. Specifically, add support
for the 'MDBE_ATTR_SRC_VNI' and 'MDBA_MDB_EATTR_SRC_VNI' attributes in
request and response messages, respectively.

The source VNI is only relevant when the VXLAN device is in external
mode, where multiple VNIs can be multiplexed over a single VXLAN device.

Example:

# bridge mdb add dev vxlan0 port vxlan0 grp 239.1.1.1 permanent dst 198.51.100.1 src_vni 2222

$ bridge -d -s mdb show
dev vxlan0 port vxlan0 grp 239.1.1.1 permanent filter_mode exclude proto static dst 198.51.100.1 src_vni 2222    0.00

$ bridge -d -s -j -p mdb show
[ {
         "mdb": [ {
                 "index": 16,
                 "dev": "vxlan0",
                 "port": "vxlan0",
                 "grp": "239.1.1.1",
                 "state": "permanent",
                 "filter_mode": "exclude",
                 "protocol": "static",
                 "flags": [ ],
                 "dst": "198.51.100.1",
                 "src_vni": 2222,
                 "timer": "   0.00"
             } ],
         "router": {}
     } ]

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>

bridge: mdb: Add destination VNI support

In a similar fashion to VXLAN FDB entries, allow user space to program
and view the destination VNI of VXLAN MDB entries. Specifically, add
support for the 'MDBE_ATTR_VNI' and 'MDBA_MDB_EATTR_VNI' attributes in
request and response messages, respectively.

This is useful when ingress replication (IR) is used and the destination
VXLAN tunnel endpoint (VTEP) is not a member of the source broadcast
domain (BD). In this case, the ingress VTEP should transmit the packet
using the VNI of the Supplementary Broadcast Domain (SBD) in which all
the VTEPs are member of [1].

Example:

# bridge mdb add dev vxlan0 port vxlan0 grp 239.1.1.1 permanent dst 198.51.100.1 vni 1111

$ bridge -d -s mdb show
dev vxlan0 port vxlan0 grp 239.1.1.1 permanent filter_mode exclude proto static dst 198.51.100.1 vni 1111    0.00

$ bridge -d -s -j -p mdb show
[ {
         "mdb": [ {
                 "index": 15,
                 "dev": "vxlan0",
                 "port": "vxlan0",
                 "grp": "239.1.1.1",
                 "state": "permanent",
                 "filter_mode": "exclude",
                 "protocol": "static",
                 "flags": [ ],
                 "dst": "198.51.100.1",
                 "vni": 1111,
                 "timer": "   0.00"
             } ],
         "router": {}
     } ]

[1] https://datatracker.ietf.org/doc/html/draft-ietf-bess-evpn-irb-mcast#section-3.2.2

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>

bridge: mdb: Add UDP destination port support

In a similar fashion to VXLAN FDB entries, allow user space to program
and view the UDP destination port of VXLAN MDB entries. Specifically,
add support for the 'MDBE_ATTR_DST_PORT' and 'MDBA_MDB_EATTR_DST_PORT'
attributes in request and response messages, respectively.

Use the keyword "dst_port" instead of "port" as the latter is already
used to specify the net device associated with the MDB entry.

Example:

# bridge mdb add dev vxlan0 port vxlan0 grp 239.1.1.1 permanent dst 198.51.100.1 dst_port 1234

$ bridge -d -s mdb show
dev vxlan0 port vxlan0 grp 239.1.1.1 permanent filter_mode exclude proto static dst 198.51.100.1 dst_port 1234    0.00

$ bridge -d -s -j -p mdb show
[ {
         "mdb": [ {
                 "index": 15,
                 "dev": "vxlan0",
                 "port": "vxlan0",
                 "grp": "239.1.1.1",
                 "state": "permanent",
                 "filter_mode": "exclude",
                 "protocol": "static",
                 "flags": [ ],
                 "dst": "198.51.100.1",
                 "dst_port": 1234,
                 "timer": "   0.00"
             } ],
         "router": {}
     } ]

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>

bridge: mdb: Add underlay destination IP support

Allow user space to program and view VXLAN MDB entries. Specifically,
add support for the 'MDBE_ATTR_DST' and 'MDBA_MDB_EATTR_DST' attributes
in request and response messages, respectively.

The attributes encode the IP address of the destination VXLAN tunnel
endpoint where multicast receivers for the specified multicast flow
reside.

Multiple destinations can be added for each flow.

Example:

# bridge mdb add dev vxlan0 port vxlan0 grp 239.1.1.1 permanent dst 198.51.100.1
# bridge mdb add dev vxlan0 port vxlan0 grp 239.1.1.1 permanent dst 192.0.2.1

$ bridge -d -s mdb show
dev vxlan0 port vxlan0 grp 239.1.1.1 permanent filter_mode exclude proto static dst 192.0.2.1    0.00
dev vxlan0 port vxlan0 grp 239.1.1.1 permanent filter_mode exclude proto static dst 198.51.100.1    0.00

$ bridge -d -s -j -p mdb show
[ {
         "mdb": [ {
                 "index": 15,
                 "dev": "vxlan0",
                 "port": "vxlan0",
                 "grp": "239.1.1.1",
                 "state": "permanent",
                 "filter_mode": "exclude",
                 "protocol": "static",
                 "flags": [ ],
                 "dst": "192.0.2.1",
                 "timer": "   0.00"
             },{
                 "index": 15,
                 "dev": "vxlan0",
                 "port": "vxlan0",
                 "grp": "239.1.1.1",
                 "state": "permanent",
                 "filter_mode": "exclude",
                 "protocol": "static",
                 "flags": [ ],
                 "dst": "198.51.100.1",
                 "timer": "   0.00"
             } ],
         "router": {}
     } ]

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>

Update kernel headers

Update kernel headers to commit:
fcb3a4653bc5 ("net/sched: act_api: use the correct TCA_ACT attributes in dump")

Signed-off-by: David Ahern <dsahern@kernel.org>

Merge branch 'main' into next

Signed-off-by: David Ahern <dsahern@kernel.org>

v6.2.0

tc: m_ct: add support for helper

This patch is to add the setup and dump for helper in tc ct action
in userspace, and the support in kernel was added in:

  https://lore.kernel.org/netdev/cover.1667766782.git.lucien.xin@gmail.com/

here is an example for usage:

  # ip link add dummy0 type dummy
  # tc qdisc add dev dummy0 ingress

  # tc filter add dev dummy0 ingress proto ip flower ip_proto \
    tcp dst_port 21 ct_state -trk action ct helper ipv4-tcp-ftp

  # tc filter show dev dummy0 ingress
    filter protocol ip pref 49152 flower chain 0 handle 0x1
      eth_type ipv4
      ip_proto tcp
      dst_port 21
      ct_state -trk
      not_in_hw
        action order 1: ct zone 0 helper ipv4-tcp-ftp pipe
        index 1 ref 1 bind

v1->v2:
  - add dst_port 21 in the example tc flower rule in changelog
    as Marcele noticed.
  - use snprintf to avoid possible string overflows as Stephen
    suggested in ct_print_helper().

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David Ahern <dsahern@kernel.org>

seg6: man: ip-link.8: add SRv6 End PSP flavor description

This patch extends the manpage by providing a brief description of the PSP
flavor for the SRv6 End behavior as defined in RFC 8986 [1].

The code/logic required to handle the "flavors" framework has already been
merged into iproute2 by commit:
04a6b456bf74 ("seg6: add support for flavors in SRv6 End* behaviors").

Some examples:
ip -6 route add 2001:db8::1 encap seg6local action End flavors psp dev eth0

Standard Output:
ip -6 route show 2001:db8::1
2001:db8::1 encap seg6local action End flavors psp dev eth0 metric 1024 pref medium

JSON Output:
ip -6 -j -p route show 2001:db8::1
[ {
"dst": "2001:db8::1",
"encap": "seg6local",
"action": "End",
"flavors": [ "psp" ],
"dev": "eth0",
"metric": 1024,
"flags": [ ],
"pref": "medium"
} ]

[1] - https://datatracker.ietf.org/doc/html/rfc8986

Signed-off-by: Paolo Lungaroni <paolo.lungaroni@uniroma2.it>
Signed-off-by: David Ahern <dsahern@kernel.org>

iplink: add gso and gro max_size attributes for ipv4

This patch adds two attributes gso/gro_ipv4_max_size in iplink for the
user space support of the BIG TCP for IPv4:

https://lore.kernel.org/netdev/de811bf3-e2d8-f727-72bc-c8a754a9d929@tessares.net/T/

Note that after this kernel patchset, "gso/gro_max_size" are used for IPv6
packets while "gso/gro_ipv4_max_size" are for IPv4 patckets. To not break
these old applications using "gso/gro_ipv4_max_size" for IPv4 GSO packets,
the new size will also be set on "gso/gro_ipv4_max_size" in kernel when
"gso/gro_max_size" changes to a value <= 65536.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David Ahern <dsahern@kernel.org>

Merge remote-tracking branch 'main/main' into next

Signed-off-by: David Ahern <dsahern@kernel.org>

testsuite: fix testsuite build failure when iproute build without libcap-devel

iproute allows to build without libcap.The testsuite will fail to
compile when libcap dose not exists.It was required in 6d68d7f85d.

Fixes: 6d68d7f85d ("testsuite: fix build failure")
Signed-off-by: gaoxingwang <gaoxingwang1@huawei.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>

iplink: fix the gso and gro max_size names in documentation

The option names for "ip link set" should be gso/gro_max_*
instead of max_gso/gro_*. So fix them in documentation.

Fixes: e4ba36f75201 ("iplink: add ip-link documentation")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>

libnetlink.c: Fix memory leak in batch mode

During testing we noticed significant memory leak that is easily
reproducible and detectable with valgrind:

==2006284== 393,216 bytes in 12 blocks are definitely lost in loss record 5 of 5
==2006284==    at 0x4848899: malloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==2006284==    by 0x18C73E: rtnl_recvmsg (libnetlink.c:830)
==2006284==    by 0x18CF9E: __rtnl_talk_iov (libnetlink.c:1032)
==2006284==    by 0x18D3CE: __rtnl_talk (libnetlink.c:1140)
==2006284==    by 0x18D4DE: rtnl_talk (libnetlink.c:1168)
==2006284==    by 0x11BF04: tc_filter_modify (tc_filter.c:224)
==2006284==    by 0x11DD70: do_filter (tc_filter.c:748)
==2006284==    by 0x116B06: do_cmd (tc.c:210)
==2006284==    by 0x116C7C: tc_batch_cmd (tc.c:231)
==2006284==    by 0x1796F2: do_batch (utils.c:1701)
==2006284==    by 0x116D05: batch (tc.c:246)
==2006284==    by 0x117327: main (tc.c:331)
==2006284==
==2006284== LEAK SUMMARY:
==2006284==    definitely lost: 884,736 bytes in 27 blocks

In case nlmsg_type == NLMSG_ERROR and if answer set to NULL, we
should free(buf) too.

Signed-off-by: Denys Fedoryshchenko <denys.f@collabora.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>

ip: fix UB in strncpy (e.g. truncated ip route output)

Fix overlapping buffers passed to strncpy which is UB. format_host_rta_r writes
to the buffer passed to it, so hostname (derived from b1) & b1 partly overlap.

This gets worse with sys-libs/glibc-2.37 where the ip route output can be truncated,
but it was UB anyway and you can see it occurring w/ glibc-2.36.

Bug: https://lore.kernel.org/netdev/0011AC38-4823-4D0A-8580-B108D08959C2@gentoo.org/T/#u
Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=30112
Thanks-to: Doug Freed <dwfreed@mtu.edu>
Signed-off-by: Sam James <sam@gentoo.org>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>

uapi: update headers to 6.2-rc8

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>

bridge: mdb: Remove double space in MDB dump

There is an extra space after the "proto" field. Remove it.

Before:

# bridge -d mdb
dev br0 port swp1 grp 239.1.1.1 permanent proto static vid 1

After:

# bridge -d mdb
dev br0 port swp1 grp 239.1.1.1 permanent proto static vid 1

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>

man: man8: bridge: Describe mcast_max_groups

Add documentation for per-port and port-port-vlan option mcast_max_groups.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>

bridge: Add support for mcast_n_groups, mcast_max_groups

A total of four new bridge attributes are being added to the kernel:
mcast_n_groups and mcast_max_groups, as link and vlan attributes. Add
to the bridge tool the support code to enable setting and querying
these attributes. Example usage:

# ip link add name br up type bridge vlan_filtering 1 mcast_snooping 1 \
                                      mcast_vlan_snooping 1 mcast_querier 1
# ip link set dev v1 master br
# bridge vlan add dev v1 vid 2

# bridge vlan set dev v1 vid 1 mcast_max_groups 1
# bridge mdb add dev br port v1 grp 230.1.2.3 temp vid 1
# bridge mdb add dev br port v1 grp 230.1.2.4 temp vid 1
Error: bridge: Port-VLAN is already in 1 groups, and mcast_max_groups=1.

# bridge link set dev v1 mcast_max_groups 1
# bridge mdb add dev br port v1 grp 230.1.2.3 temp vid 2
Error: bridge: Port is already in 1 groups, and mcast_max_groups=1.

# bridge -d link show
5: v1@v2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master br [...]
     [...] mcast_n_groups 1 mcast_max_groups 1

# bridge -d vlan show
port              vlan-id
br                1 PVID Egress Untagged
                     state forwarding mcast_router 1
v1                1 PVID Egress Untagged
                     [...] mcast_n_groups 1 mcast_max_groups 1
                   2
                     [...] mcast_n_groups 0 mcast_max_groups 0

This is how the JSON dump looks like:

# bridge -j -d link show dev v1 | jq
[
   {
     "ifindex": 4,
     "link": "v2",
     "ifname": "v1",
     "flags": [
       "BROADCAST",
       "MULTICAST"
     ],
     "mtu": 1500,
     "master": "br",
     "state": "disabled",
     "priority": 32,
     "cost": 2,
     "hairpin": false,
     "guard": false,
     "root_block": false,
     "fastleave": false,
     "learning": true,
     "flood": true,
     "mcast_flood": true,
     "bcast_flood": true,
     "mcast_router": 1,
     "mcast_to_unicast": false,
     "neigh_suppress": false,
     "vlan_tunnel": false,
     "isolated": false,
     "locked": false,
     "mab": false,
     "mcast_n_groups": 0,
     "mcast_max_groups": 0
   }
]

# bridge -j -d vlan show dev v1 | jq
[
   {
     "ifname": "v1",
     "vlans": [
       {
         "vlan": 1,
         "flags": [
           "PVID",
           "Egress Untagged"
         ],
         "state": "forwarding",
         "mcast_router": 1,
         "mcast_n_groups": 0,
         "mcast_max_groups": 1
       }
     ]
   }
]

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>

Update kernel headers

Update kernel headers to commit:
61d731e6538d ("Merge tag 'linux-can-next-for-6.3-20230206' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next")

Signed-off-by: David Ahern <dsahern@kernel.org>

Merge branch 'main' into next

Signed-off-by: David Ahern <dsahern@kernel.org>

ip-rule.8: Bring synopsis in line with description

Bring ip-rule.8 synopsis in line with description

The parameters "show" and "priority" were listed in the synopsis using
other aliases than in the description.

Signed-off-by: Sven Neuhaus <sven-netdev@sven.de>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>

macsec: Fix Macsec packet number attribute print

Currently Macsec print routines uses a 32 bit print routine
to print out the value of the packet number (PN) attribute, a
miss use of the 32 bit print routine is causing a miss print of
only the 32 least significant bit (LSB) of an extended packet
number (XPN) which is a 64 bit attribute.

Fixes: 6ce23b7c2d79 ("macsec: add Extended Packet Number support")
Signed-off-by: Emeel Hakim <ehakim@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>

tc: add new attr TCA_EXT_WARN_MSG

Currently, when the rule is not to be exclusively executed by the
hardware, extack is not passed along and offloading failures don't
get logged. Add a new attr TCA_EXT_WARN_MSG to log the extack message
so we can monitor the HW failures. e.g.

  # tc monitor
  added chain dev enp3s0f1np1 parent ffff: chain 0
  added filter dev enp3s0f1np1 ingress protocol all pref 49152 flower chain 0 handle 0x1
    ct_state +trk+new
    not_in_hw
          action order 1: gact action drop
           random type none pass val 0
           index 1 ref 1 bind 1

  mlx5_core: matching on ct_state +new isn't supported.

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David Ahern <dsahern@kernel.org>

Revert "tc/tc_monitor: print netlink extack message"

This reverts commit 0cc5533b ("tc/tc_monitor: print netlink extack message")
as the commit mentioned is not applied to upstream.

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David Ahern <dsahern@kernel.org>

Update kernel headers

Update kernel headers to commit
a7b87d2a31dc ("Merge branch 'mlxsw-add-support-of-latency-tlv'")

Signed-off-by: David Ahern <dsahern@kernel.org>

Merge remote-tracking branch 'main/main' into next
Signed-off-by: David Ahern <dsahern@kernel.org>

man: ip-link.8: Fix formatting

Signed-off-by: Stefan Pietsch <stefan+linux@shellforce.org>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>

add space after keyword

The style standard is to use space after keywords.
Example:
if (expr)
verus
if(expr)

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>

macsec: Fix Macsec replay protection

Currently when configuring macsec with replay protection,
replay protection and window gets a default value of -1,
the above is leading to passing replay protection and
replay window attributes to the kernel while replay is
explicitly set to off, leading for an invalid argument
error when configured with extended packet number (XPN).
since the default window value which is 0xFFFFFFFF is
passed to the kernel and while XPN is configured the above
value is an invalid window value.

Example:
ip link add link eth2 macsec0 type macsec sci 1 cipher
gcm-aes-xpn-128 replay off

RTNETLINK answers: Invalid argument

Fix by passing the window attribute to the kernel only if replay is on

Fixes: b26fc590ce62 ("ip: add MACsec support")
Signed-off-by: Emeel Hakim <ehakim@nvidia.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>

netem: add SPDX license header

The netem directory contains code to generate tables for netem.
This code came from NISTnet which was public domain.
Add appropriate license tag.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>

misc: use SPDX

Use SPDX tag instead of GPL boilerplate.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>

tc: use SPDX

Replace GPL boilerplate with SPDX.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>

tc: replace GPL-BSD boilerplate in codel and fq

Replace legal boilerplate with SPDX instead.
These algorithms are dual licensed.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>

tipc: use SPDX

Replace boilerplate GPL text with SPDX

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>

testsuite: use SPDX

Replace boilerplate with SPDX tag.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>

ip: use SPDX

Use SPDX instead of boilerplate text for ip and related
sub commands.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>

devlink: use SPDX

Add SPDX tag instead of GPL 2.0 or later boilerplate

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>

lib: replace GPL boilerplate with SPDX

Replace standard GPL 2.0 or later text with SPDX tag.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>

genl: use SPDX

Replace GPL 2.0 or later boilerplate.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>

bridge: use SPDX

Replace GPL 2.0 or later boilerplate text.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>

man: ss: remove duplicated option name

Signed-off-by: Jakub Wilk <jwilk@jwilk.net>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>

tc: remove support for rr qdisc

The Round-Robin qdisc was removed in kernel version 2.6.27.
Remove code and man page references from iproute.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>

mptcp: add new listener events

These new events have been added in kernel commit f8c9dfbd875b ("mptcp:
add pm listener events") by Geliang Tang.

Two new MPTCP Netlink event types for PM listening socket creation and
closure have been recently added. They will be available in the future
v6.2 kernel.

They have been added because MPTCP for Linux, when not using the
in-kernel PM, depends on the userspace PM to create extra listening
sockets -- called "PM listeners" -- before announcing addresses and
ports. With the existing MPTCP Netlink events, a userspace PM can create
PM listeners at startup time, or in response to an incoming connection.
Creating sockets in response to connections is not optimal: ADD_ADDRs
can't be sent until the sockets are created and listen()ed, and if all
connections are closed then it may not be clear to the userspace PM
daemon that PM listener sockets should be cleaned up. Hence these new
events: PM listening sockets can be managed based on application
activity.

Note that the maximum event string size has to be increased by 2 to be
able to display LISTENER_CREATED without truncated it.

Also, as pointed by Mat, this event doesn't have any "token" attribute
so this attribute is now printed only if it is available.

Link: https://github.com/multipath-tcp/mptcp_net-next/issues/313
Cc: Geliang Tang <geliang.tang@suse.com>
Acked-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>

tc/htb: add SPDX comment

The standard way is to use SPDX to refer to license,
instead of per-file boilerplate text.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>

tc/htb: break long lines

Style guidelines is 100 characters

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>

tc: Add JSON output to tc-class

* Add JSON formatted output to the `tc class show ...` command.
* Add JSON formatted output for the htb qdisc classes.

Signed-off-by: Max Tottenham <mtottenh@akamai.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>

uapi: update vdpa.h

Upstream 6.2-rc3

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>

dcb: Do not leave ACKs in socket receive buffer

Originally, the dcb utility only stopped receiving messages from a
socket when it found the attribute it was looking for. Cited commit
changed that, so that the utility will also stop when seeing an ACK
(NLMSG_ERROR message), by setting the NLM_F_ACK flag on requests.

This is problematic because it means a successful request will leave an
ACK in the socket receive buffer, causing the next request to bail
before reading its response.

Fix that by not stopping when finding the required attribute in a
response. Instead, stop on the subsequent ACK.

Fixes: 84c036972659 ("dcb: unblock mnl_socket_recvfrom if not message received")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>

configure: Remove include <sys/stat.h>

The check_name_to_handle_at() function in the configure script is
including sys/stat.h. This include fails with glibc 2.36 like this:
````
In file included from /linux-5.15.84/include/uapi/linux/stat.h:5,
                 from /toolchain-x86_64_gcc-12.2.0_glibc/include/bits/statx.h:31,
                 from /toolchain-x86_64_gcc-12.2.0_glibc/include/sys/stat.h:465,
                 from config.YExfMc/name_to_handle_at_test.c:3:
/linux-5.15.84/include/uapi/linux/types.h:10:2: warning: #warning "Attempt to use kernel headers from user space, see https://kernelnewbies.org/KernelHeaders" [-Wcpp]
   10 | #warning "Attempt to use kernel headers from user space, see https://kernelnewbies.org/KernelHeaders"
      |  ^~~~~~~
In file included from /linux-5.15.84/include/uapi/linux/posix_types.h:5,
                 from /linux-5.15.84/include/uapi/linux/types.h:14:
/linux-5.15.84/include/uapi/linux/stddef.h:5:10: fatal error: linux/compiler_types.h: No such file or directory
    5 | #include <linux/compiler_types.h>
      |          ^~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.
````

Just removing the include works, the manpage of name_to_handle_at() says
only fcntl.h is needed.

Fixes: c5b72cc56bf8 ("lib/fs: fix issue when {name,open}_to_handle_at() is not implemented")
Tested-by: Heiko Thiery <heiko.thiery@gmail.com>
Signed-off-by: Hauke Mehrtens <hauke@hauke-m.de>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>

uapi: update headers to 6.2-rc1

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>

devlink: fix mon json output for trap-policer

There is a json footer missed for trap-policer output in "devlink mon".
So add it and fix the json output.

Fixes: a66af5569337 ("devlink: Add devlink trap policer set and show commands")
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>

Merge branch 'bridge-new-mdb-attr' into next

Ido Schimmel says:

====================

Add support for new MDB attributes and replace command.

See kernel merge commit 8150f0cfb24f ("Merge branch
'bridge-mcast-extensions-for-evpn'") for background and motivation.

Patches #1-#2 are preparations.

Patches #3-#5 add support for new MDB attributes: Filter mode, source
list and routing protocol.

Patch #6 adds replace support.

====================

Signed-off-by: David Ahern <dsahern@kernel.org>

bridge: mdb: Add replace support

Allow user space to replace MDB port group entries by specifying the
'NLM_F_REPLACE' flag in the netlink message header.

Examples:

# bridge mdb replace dev br0 port dummy10 grp 239.1.1.1 permanent source_list 192.0.2.1,192.0.2.2 filter_mode include
# bridge -d -s mdb show
dev br0 port dummy10 grp 239.1.1.1 src 192.0.2.2 permanent filter_mode include proto static     0.00
dev br0 port dummy10 grp 239.1.1.1 src 192.0.2.1 permanent filter_mode include proto static     0.00
dev br0 port dummy10 grp 239.1.1.1 permanent filter_mode include source_list 192.0.2.2/0.00,192.0.2.1/0.00 proto static     0.00

# bridge mdb replace dev br0 port dummy10 grp 239.1.1.1 permanent source_list 192.0.2.1,192.0.2.3 filter_mode exclude proto zebra
# bridge -d -s mdb show
dev br0 port dummy10 grp 239.1.1.1 src 192.0.2.3 permanent filter_mode include proto zebra  blocked    0.00
dev br0 port dummy10 grp 239.1.1.1 src 192.0.2.1 permanent filter_mode include proto zebra  blocked    0.00
dev br0 port dummy10 grp 239.1.1.1 permanent filter_mode exclude source_list 192.0.2.3/0.00,192.0.2.1/0.00 proto zebra     0.00

# bridge mdb replace dev br0 port dummy10 grp 239.1.1.1 temp source_list 192.0.2.4,192.0.2.3 filter_mode include proto bgp
# bridge -d -s mdb show
dev br0 port dummy10 grp 239.1.1.1 src 192.0.2.4 temp filter_mode include proto bgp     0.00
dev br0 port dummy10 grp 239.1.1.1 src 192.0.2.3 temp filter_mode include proto bgp     0.00
dev br0 port dummy10 grp 239.1.1.1 temp filter_mode include source_list 192.0.2.4/259.44,192.0.2.3/259.44 proto bgp     0.00

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David Ahern <dsahern@kernel.org>

bridge: mdb: Add routing protocol support

Allow user space to specify the routing protocol of the MDB port group
entry by adding the 'MDBE_ATTR_RTPROT' attribute to the
'MDBA_SET_ENTRY_ATTRS' nest.

Examples:

# bridge mdb add dev br0 port dummy10 grp 239.1.1.1 permanent proto zebra

# bridge mdb add dev br0 port dummy10 grp 239.1.1.2 permanent

# bridge -d mdb show
dev br0 port dummy10 grp 239.1.1.2 permanent filter_mode exclude proto static
dev br0 port dummy10 grp 239.1.1.1 permanent filter_mode exclude proto zebra

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David Ahern <dsahern@kernel.org>

bridge: mdb: Add source list support

Allow user space to specify the source list of (*, G) entries by adding
the 'MDBE_ATTR_SRC_LIST' attribute to the 'MDBA_SET_ENTRY_ATTRS' nest.

Example:

# bridge mdb add dev br0 port dummy10 grp 239.1.1.1 temp source_list 198.51.100.1,198.51.100.2 filter_mode exclude

# bridge -d -s mdb show
dev br0 port dummy10 grp 239.1.1.1 src 198.51.100.2 temp filter_mode include proto static  blocked    0.00
dev br0 port dummy10 grp 239.1.1.1 src 198.51.100.1 temp filter_mode include proto static  blocked    0.00
dev br0 port dummy10 grp 239.1.1.1 temp filter_mode exclude source_list 198.51.100.2/0.00,198.51.100.1/0.00 proto static   256.42

# bridge -j -p -d -s mdb show
[ {
         "mdb": [ {
                 "index": 10,
                 "dev": "br0",
                 "port": "dummy10",
                 "grp": "239.1.1.1",
                 "src": "198.51.100.2",
                 "state": "temp",
                 "filter_mode": "include",
                 "protocol": "static",
                 "flags": [ "blocked" ],
                 "timer": "   0.00"
             },{
                 "index": 10,
                 "dev": "br0",
                 "port": "dummy10",
                 "grp": "239.1.1.1",
                 "src": "198.51.100.1",
                 "state": "temp",
                 "filter_mode": "include",
                 "protocol": "static",
                 "flags": [ "blocked" ],
                 "timer": "   0.00"
             },{
             },{
                 "index": 10,
                 "dev": "br0",
                 "port": "dummy10",
                 "grp": "239.1.1.1",
                 "state": "temp",
                 "filter_mode": "exclude",
                 "source_list": [ {
                         "address": "198.51.100.2",
                         "timer": "0.00"
                     },{
                         "address": "198.51.100.1",
                         "timer": "0.00"
                     } ],
                 "protocol": "static",
                 "flags": [ ],
                 "timer": " 251.19"
             } ],
         "router": {}
     } ]

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David Ahern <dsahern@kernel.org>

bridge: mdb: Add filter mode support

Allow user space to specify the filter mode of (*, G) entries by adding
the 'MDBE_ATTR_GROUP_MODE' attribute to the 'MDBA_SET_ENTRY_ATTRS' nest.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David Ahern <dsahern@kernel.org>

bridge: mdb: Split source parsing to a separate function

Currently, the only attribute inside the 'MDBA_SET_ENTRY_ATTRS' nest is
'MDBE_ATTR_SOURCE', but subsequent patches are going to add more
attributes to the nest.

Prepare for the addition of these attributes by splitting the parsing of
individual attributes inside the nest to separate functions.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David Ahern <dsahern@kernel.org>

bridge: mdb: Use a boolean to indicate nest is required

Currently, the only attribute inside the 'MDBA_SET_ENTRY_ATTRS' nest is
'MDBE_ATTR_SOURCE', but subsequent patches are going to add more
attributes to the nest.

Prepare for the addition of these attributes by determining the
necessity of the nest from a boolean variable that is set whenever one
of these attributes is parsed. This avoids the need to have one long
condition that checks for the presence of one of the individual
attributes.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David Ahern <dsahern@kernel.org>

Merge branch 'main' into next

Conflicts:
devlink/devlink.c

Signed-off-by: David Ahern <dsahern@kernel.org>

fix version #

Merge branch 'new-ipsec-offload-type' into next

Leon Romanovsky says:

====================

From: Leon Romanovsky <leonro@nvidia.com>

Extend ip tool to support new IPsec offload mode.
Followup of the recently accepted series to netdev.
https://lore.kernel.org/r/20221209093310.4018731-1-steffen.klassert@secunet.com

Changelog:
v1:
* Changed "full offload" to "packet offload" to be aligned with kernel names.
* Rebase to latest iproute2-next
v0: https://lore.kernel.org/all/cover.1652179360.git.leonro@nvidia.com

====================

Signed-off-by: David Ahern <dsahern@kernel.org>

xfrm: add an interface to offload policy

Extend at "ip xfrm policy" to allow policy offload to specific device.
The syntax and the code follow already established pattern from the
state offload.

The only difference between them is that direction was already mandatory
argument in policy configuration commands, so don't need to add direction
handling logic like it was done for the state offload.

The syntax is as follows:
$ ip xfrm policy .... offload packet dev <if-name>

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>

xfrm: add packet offload mode to xfrm state

Allow users to configure xfrm states with packet offload type.

Packet offload mode:
  ip xfrm state offload packet dev <if-name> dir <in|out>
Crypto offload mode:
  ip xfrm state offload crypto dev <if-name> dir <in|out>
  ip xfrm state offload dev <if-name> dir <in|out>

The latter variant configures crypto offload mode and is needed
to provide backward compatibility.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>

xfrm: prepare state offload logic to set mode

The offload in xfrm state requires to provide device and direction
in order to activate it. However, in the help section, device and
direction were displayed as an optional.

As a preparation to addition of packet offload, let's fix the help
section and refactor the code to be more clear.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>

Merge branch 'devlink-port-function' into next

Shay Drory says:

====================

Patch implementing new netlink attribute for devlink-port function got
merged to net-next.
https://lore.kernel.org/netdev/20221206185119.380138-1-shayd@nvidia.com/

Now there is a need to support these new attribute in the userspace
tool. Implement roce and migratable port function attributes in devlink
userspace tool. Update documentation.

====================

Signed-off-by: David Ahern <dsahern@kernel.org>

devlink: Add documentation for roce and migratable port function attributes

New port function attributes roce and migratable were added.
Update the man page for devlink-port to account for new attributes.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>

devlink: Support setting port function migratable cap

Suppor port function commands to enable / disable migratable
capability, this is used to set the port function as migratable.

Live migration is the process of transferring a live virtual machine
from one physical host to another without disrupting its normal
operation.

In order for a VM to be able to perform LM, all the VM components must
be able to perform migration. e.g.: to be migratable.
In order for VF to be migratable, VF must be bound to VFIO driver with
migration support.

When migratable capability is enable for a function of the port, the
device is making the necessary preparations for the function to be
migratable, which might include disabling features which cannot be
migrated.

Example of LM with migratable function configuration:
Set migratable of the VF's port function.

$ devlink port show pci/0000:06:00.0/2
pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0
vfnum 1
    function:
        hw_addr 00:00:00:00:00:00 migratable disable

$ devlink port function set pci/0000:06:00.0/2 migratable enable

$ devlink port show pci/0000:06:00.0/2
pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0
vfnum 1
    function:
        hw_addr 00:00:00:00:00:00 migratable enable

Bind VF to VFIO driver with migration support:
$ echo <pci_id> > /sys/bus/pci/devices/0000:08:00.0/driver/unbind
$ echo mlx5_vfio_pci > /sys/bus/pci/devices/0000:08:00.0/driver_override
$ echo <pci_id> > /sys/bus/pci/devices/0000:08:00.0/driver/bind

Attach VF to the VM.
Start the VM.
Perform LM.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>

devlink: Support setting port function roce cap

Support port function commands to enable / disable RoCE, this is used to
control the port RoCE device capabilities.

When RoCE is disabled for a function of the port, function cannot create
any RoCE specific resources (e.g GID table).
It also saves system memory utilization. For example disabling RoCE
enable a VF/SF to save 1 Mbytes of system memory per function.

Example of a PCI VF port which supports a port function:

$ devlink port show pci/0000:06:00.0/2
    pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum
    0 vfnum 1
      function:
        hw_addr 00:00:00:00:00:00 roce enabled

$ devlink port function set pci/0000:06:00.0/2 roce disable

$ devlink port show pci/0000:06:00.0/2
    pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum
    0 vfnum 1
      function:
        hw_addr 00:00:00:00:00:00 roce disabled

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>

Update kernel headers

Update kernel headers to commit:
7e68dd7d07a2 ("Merge tag 'net-next-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next")

Signed-off-by: David Ahern <dsahern@kernel.org>

v6.1.0

man: add missing tc class show

"tc class show" is valid sub command but missing from synopsis.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>

tc: make prefix const

Tcstats functions have prefix argument that can be made const.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>

ip: print mpls errors on stderr

Error messages should go on stderr.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>

tc: print errors on stderr

Don't mix output and errors.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>

iplink: support JSON in MPLS output

The MPLS statistics did not support oneline or JSON
in current code.

Fixes: 837552b445f5 ("iplink: add support for afstats subcommand")
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>

ip-link: man: Document existence of netns argument in add command

ip-link-add supports netns argument just like ip-link-set. This commit
documents the existence of netns in help text and man page.

Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>

libnetlink: Fix memory leak in __rtnl_talk_iov()

If `__rtnl_talk_iov` fails then callers are not expected to free `answer`.

Currently if `NLMSG_ERROR` was received with an error then the netlink
buffer was stored in `answer`, while still returning an error

This leak can be observed by running this snippet over time.
This triggers an `NLMSG_ERROR` because for each neighbour update, `ip`
will try to query for the name of interface 9999 in the wrong netns.
(which in itself is a separate bug)

set -e

ip netns del test-a || true
ip netns add test-a
ip netns del test-b || true
ip netns add test-b

ip -n test-a netns set test-b auto
ip -n test-a link add veth_a index 9999 type veth \
  peer name veth_b netns test-b
ip -n test-b link set veth_b up

ip -n test-a monitor link address prefix neigh nsid label all-nsid \
  > /dev/null &
monitor_pid=$!
clean() {
  kill $monitor_pid
  ip netns del test-a
  ip netns del test-b
}
trap clean EXIT

while true; do
  ip -n test-b neigh add dev veth_b 1.2.3.4 lladdr AA:AA:AA:AA:AA:AA
  ip -n test-b neigh del dev veth_b 1.2.3.4
done

Fixes: 55870dfe7f8b ("Improve batch and dump times by caching link lookups")
Signed-off-by: Lahav Schlesinger <lschlesinger@drivenets.com>
Signed-off-by: Gilad Naaman <gnaaman@drivenets.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>

devlink: update ifname map when message contains DEVLINK_ATTR_PORT_NETDEV_NAME

Recent kernels send PORT_NEW message with when ifname changes,
so benefit from that by having ifnames updated.

Whenever there is a message containing DEVLINK_ATTR_PORT_NETDEV_NAME
attribute, use it to update ifname map.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>

devlink: push common code to __pr_out_port_handle_start_tb()

There is a common code in pr_out_port_handle_start() and
pr_out_port_handle_start_arr(). As the next patch is going to extend it
even more, push the code into common helper.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>

devlink: get devlink port for ifname using RTNL get link command

Currently, when user specifies ifname as a handle on command line of
devlink, the related devlink port is looked-up in previously taken dump
of all devlink ports on the system. There are 3 problems with that:
1) The dump iterates over all devlink instances in kernel and takes a
devlink instance lock for each.
2) Dumping all devlink ports would not scale.
3) Alternative ifnames are not exposed by devlink netlink interface.

Instead, benefit from RTNL get link command extension and get the
devlink port handle info from IFLA_DEVLINK_PORT attribute, if supported.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>

devlink: add ifname_map_add/del() helpers

Add couple of helpers to alloc/free of map object alongside with list
addition/removal.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>

Merge branch 'pcp-prio-apptrust' into next

Daniel Machon says:

====================

This patch series makes use of the newly introduced [1] DCB_APP_SEL_PCP
selector, for PCP/DEI prioritization, and DCB_ATTR_IEEE_APP_TRUST
attribute for configuring per-selector trust and trust-order.

========================================================================
New parameter "pcp-prio" to existing "app" subcommand:
========================================================================

A new pcp-prio parameter has been added to the app subcommand, which can
be used to classify traffic based on PCP and DEI from the VLAN header.
PCP and DEI is specified in a combination of numerical and symbolic
form, where 'de' (drop-eligible) means DEI=1 and 'nd' (not-drop-eligible)
means DEI=0.

Map PCP 1 and DEI 0 to priority 1
$ dcb app add dev eth0 pcp-prio 1nd:1

Map PCP 1 and DEI 1 to priority 1
$ dcb app add dev eth0 pcp-prio 1de:1

========================================================================
New apptrust subcommand for configuring per-selector trust and trust
order:
========================================================================

This new command currently has a single parameter, which lets you
specify an ordered list of trusted selectors. The microchip sparx5
driver is already enabled to offload said list of trusted selectors. The
new command has been given the name apptrust, to indicate that the trust
covers APP table selectors only. I found that 'apptrust' was better than
plain 'trust' as the latter does not indicate the scope of what is to be
trusted.

Example:

Trust selectors dscp and pcp, in that order:
$ dcb apptrust set dev eth0 order dscp pcp

Trust selectors ethtype, stream-port and pcp, in that order
$ dcb apptrust set dev eth0 order ethtype stream-port pcp

Show the trust order
$ dcb apptrust show dev eth0 order order: ethtype stream-port pcp

A concern was raised here [2], that 'apptrust' would not work well with
matches(), so instead strcmp() has been used to match for the new
subcommand, as suggested here [3]. Same goes with pcp-prio parameter for
dcb app.

The man page for dcb_app has been extended to cover the new pcp-prio
parameter, and a new man page for dcb_apptrust has been created.

[1] https://lore.kernel.org/netdev/20221101094834.2726202-1-daniel.machon@microchip.com/
[2] https://lore.kernel.org/netdev/20220909080631.6941a770@hermes.local/
[3] https://lore.kernel.org/netdev/Y0fP+9C0tE7P2xyK@shredder/

====================

Signed-off-by: David Ahern <dsahern@kernel.org>

dcb: add new subcommand for apptrust

Add new apptrust subcommand for the dcbnl apptrust extension object.

The apptrust command lets you specify a consecutive ordered list of
trusted selectors, which can be used by drivers to determine which
selectors are eligible (trusted) for packet prioritization, and in which
order.

Selectors are sent in a new nested attribute:
DCB_ATTR_IEEE_APP_TRUST_TABLE. The nest contains trusted selectors
encapsulated in either DCB_ATTR_IEEE_APP or DCB_ATTR_DCB_APP attributes,
for standard and non-standard selectors, respectively.

Example:

Trust selectors dscp and pcp, in that order
$ dcb apptrust set dev eth0 order dscp pcp

Signed-off-by: Daniel Machon <daniel.machon@microchip.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>

dcb: add new pcp-prio parameter to dcb app

Add new pcp-prio parameter to the app subcommand, which can be used to
classify traffic based on PCP and DEI from the VLAN header. PCP and DEI
is specified in a combination of numerical and symbolic form, where 'de'
(drop-eligible) means DEI=1 and 'nd' (not-drop-eligible) means DEI=0.

Map PCP 1 and DEI 0 to priority 1
$ dcb app add dev eth0 pcp-prio 1nd:1

Map PCP 1 and DEI 1 to priority 1
$ dcb app add dev eth0 pcp-prio 1de:1

Internally, PCP and DEI is encoded in the protocol field of the dcb_app
struct. Each combination of PCP and DEI maps to a priority, thus needing
a range of  0-15. A well formed dcb_app entry for PCP/DEI
prioritization, could look like:

    struct dcb_app pcp = {
        .selector = DCB_APP_SEL_PCP,
.priority = 7,
        .protocol = 15
    }

For mapping PCP=7 and DEI=1 to Prio=7.

Also, three helper functions for translating between std and non-std APP
selectors, have been added to dcb_app.c and exposed through dcb.h.

Signed-off-by: Daniel Machon <daniel.machon@microchip.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>

devlink: support direct region read requests

The kernel has gained support for reading from regions without needing to
create a snapshot. To use this support, the DEVLINK_ATTR_REGION_DIRECT
attribute must be added to the command.

For the "read" command, if the user did not specify a snapshot, add the new
attribute to request a direct read. The "dump" command will still require a
snapshot. While technically a dump could be performed without a snapshot it
is not guaranteed to be atomic unless the region size is no larger than
256 bytes.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: David Ahern <dsahern@kernel.org>

libnetlink: Fix wrong netlink header placement

The netlink header must be first in the netlink message, so move it
there.

Fixes: fee4a56f0191 ("Update kernel headers")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>

tc: ct: Fix invalid pointer dereference

Using macro NEXT_ARG_FWD does not validate argc.
Use macro NEXT_ARG which validates argc while parsing args
in the same loop iteration.

Fixes: c8a494314c40 ("tc: Introduce tc ct action")
Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Paul Blakey <paulb@nvidia.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>

devlink: Fix setting parent for 'rate add'

Setting a parent during creation of the node doesn't work, despite
documentation [1] clearly saying that it should.

[1] man/man8/devlink-rate.8

Example:
$ devlink port function rate add pci/0000:4b:00.0/node_custom parent node_0
Unknown option "parent"

Fix this by passing DL_OPT_PORT_FN_RATE_PARENT as an argument to
dl_argv_parse() when it gets called from cmd_port_fn_rate_add().

Fixes: 6c70aca76ef2 ("devlink: Add port func rate support")
Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com>
Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>

devlink: Add documentation for tx_prority and tx_weight

New netlink attributes tx_priority and tx_weight were added.
Update the man page for devlink-rate to account for new attributes.

Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com>
Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
Signed-off-by: David Ahern <dsahern@kernel.org>

devlink: Introduce new attribute 'tx_weight' to devlink-rate

To fully utilize hierarchical QoS algorithm new attribute 'tx_weight'
needs to be introduced. Weight attribute allows for usage of Weighted
Fair Queuing arbitration scheme among siblings. This arbitration
scheme can be used simultaneously with the strict priority.

Introduce ability to configure tx_weight from devlink userspace
utility. Make the new attribute optional.

Example commands:
$ devlink port function rate add pci/0000:4b:00.0/node_custom \
tx_weight 50 parent node_0

$ devlink port function rate set pci/0000:4b:00.0/2 tx_weight 20

Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com>
Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
Signed-off-by: David Ahern <dsahern@kernel.org>

devlink: Introduce new attribute 'tx_priority' to devlink-rate

To fully utilize hierarchical QoS algorithm new attribute 'tx_priority'
needs to be introduced. Priority attribute allows for usage of strict
priority arbiter among siblings. This arbitration scheme attempts to
schedule nodes based on their priority as long as the nodes remain within
their bandwidth limit.

Introduce ability to configure tx_priority from devlink userspace
utility. Make the new attribute optional.

Example commands:
$ devlink port function rate add pci/0000:4b:00.0/node_custom \
tx_priority 5 parent node_0
$ devlink port function rate set pci/0000:4b:00.0/2 tx_priority 5

Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com>
Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
Signed-off-by: David Ahern <dsahern@kernel.org>

Merge branch 'main' into next

Signed-off-by: David Ahern <dsahern@kernel.org>

Update kernel headers

Update kernel headers to commit:
dbadae927287 ("tsnep: Rework RX buffer allocation")

Signed-off-by: David Ahern <dsahern@kernel.org>

testsuite: Add test for ip --json neigh get

Signed-off-by: Leonard Crestez <cdleonard@gmail.com>
Signed-off-by: David Ahern <dsahern@kernel.org>

ip neigh: Support --json on ip neigh get

The ip neigh command supports --json for "list" but not for "get". Add
json support for the "get" command so that it's possible to fetch
information about specific neighbors without regular expressions.

Fixes: aac7f725fa46 ("ipneigh: add color and json support")
Signed-off-by: Leonard Crestez <cdleonard@gmail.com>
Signed-off-by: David Ahern <dsahern@kernel.org>

vdpa: allow provisioning device features

This patch allows device features to be provisioned via vdpa. This
will be useful for preserving migration compatibility between source
and destination:

# vdpa dev add name dev1 mgmtdev pci/0000:02:00.0 device_features 0x300020000
# vdpa dev config show dev1
# dev1: mac 52:54:00:12:34:56 link up link_announce false mtu 65535
negotiated_features CTRL_VQ VERSION_1 ACCESS_PLATFORM

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David Ahern <dsahern@kernel.org>