Roopa Prabhu [Fri, 26 Jun 2015 03:54:27 +0000 (23:54 -0400)]
mpls: always set type RTN_UNICAST and scope RT_SCOPE_UNIVERSE for
This patch fixes incorrect -EINVAL errors due to invalid
scope and type during mpls route deletes.
$ip -f mpls route add 100 as 200 via inet 10.1.1.2 dev swp1
$ip -f mpls route show
100 as to 200 via inet 10.1.1.2 dev swp1
$ip -f mpls route del 100 as 200 via inet 10.1.1.2 dev swp1
RTNETLINK answers: Invalid argument
$ip -f mpls route del 100
RTNETLINK answers: Invalid argument
After patch:
$ip -f mpls route show
100 as to 200 via inet 10.1.1.2 dev swp1
$ip -f mpls route del 100 as 200 via inet 10.1.1.2 dev swp1
$ip -f mpls route show
Always set type to RTN_UNICAST for mpls route add/deletes.
Also to keep things consistent with kernel set scope to
RT_SCOPE_UNIVERSE for both mpls and ipv6 routes. Both mpls and ipv6 route
deletes ignore scope.
Suggested-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: Vivek Venkataraman <vivek@cumulusnetworks.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
(cherry picked from commit f638e9f7c8718cd5f94bd1f6041319e7d65a91cd)
Daniel Borkmann [Thu, 21 May 2015 22:17:01 +0000 (00:17 +0200)]
tc: bpf: add initial man page
Add a start of a man-page to the misc section as a reference and
guide on (e)BPF classifier and actions. Given that tc is only tersely
documented, this is provided in the hope that users will have an
easier getting started with tc and (e)BPF. And, that there's now more
incentive for others to also start documenting their classifier and
actions as well. ;)
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@plumgrid.com>
iproute2: misc/ss.c - fix run_ssfilter af_packet when protocol == 0
s->local.data is a pointer to a field of a non-NULL struct, and hence
cannot be NULL, thus comparing it to 0 is always false, and thus the
return is always false.
Presumably this was meant to be a check whether s->local.data[0] (which
I believe stores af_packet protocol) is 0, ie. ANY.
Change-Id: Ia232f5b06ce081e3b2fb6338f1a709cd94e03ae5
Fixes:
ss.c:1018:37: error: comparison of array 's->local.data' equal to a null pointer is always false [-Werror,-Wtautological-pointer-compare]
return s->lport == 0 && s->local.data == 0;
~~~~~~~~~^~~~ ~
1 error generated.
These if-blocks are outright dead code, because '0 > unsigned' is always
false, so only else clause triggers and regardless of which clause triggers
it only updates 'ind' which is later unconditionally written to before
being used anyway.
Otherwise we get errors from clang:
m_pedit.c:166:8: error: comparison of 0 > unsigned expression is always false [-Werror,-Wtautological-compare]
if (0 > tkey->off) {
~ ^ ~~~~~~~~~
m_pedit.c:209:8: error: comparison of 0 > unsigned expression is always false [-Werror,-Wtautological-compare]
if (0 > tkey->off) {
~ ^ ~~~~~~~~~
2 errors generated.
Fix changing tunnel remote and local address to any
If a tunnel is created with a local address, you can't change it to any.
# ip tunnel add tunl1 mode ipip remote 10.16.42.37 local 10.16.42.214 ttl 64
# ip tunnel show tunl1
tunl1: ip/ip remote 10.16.42.37 local 10.16.42.214 ttl 64
# ip tunnel change tunl1 local any
# echo $?
0
# ip tunnel show tunl1
tunl1: ip/ip remote 10.16.42.37 local 10.16.42.214 ttl 64
It happens that parse_args zeroes ip_tunnel_parm, and when creating the
tunnel, it is OK to leave it as is if the address is any. However, when
changing the tunnel, the current parameters will be read from
ip_tunnel_parm, and local and remote address won't be zeroes anymore, so
it needs to be explicitly set to any.
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@redhat.com> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Daniel Borkmann [Fri, 29 May 2015 19:47:45 +0000 (21:47 +0200)]
tc: util: fix print_rate for ludicrous speeds
The for loop should only probe up to G[i]bit rates, so that we
end up with T[i]bit as the last max units[] slot for snprintf(3),
and not possibly an invalid pointer in case rate is multiple of
kilo.
Fixes: 8cecdc283743 ("tc: more user friendly rates") Reported-by: Jose R. Guzman Mosqueda <jose.r.guzman.mosqueda@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Eric Dumazet [Fri, 29 May 2015 12:37:49 +0000 (05:37 -0700)]
ss: do not bindly dump two families
ss currently dumps IPv4 sockets, then IPv6 sockets from the kernel,
even if -4 or -6 option was given. Filtering in user space then has to
drop all sockets of wrong family. Such a waste of time...
Mike Frysinger [Tue, 26 May 2015 06:51:30 +0000 (02:51 -0400)]
enable transparent LFS
Make sure we use 64-bit filesystem functions everywhere. This applies not
only to being able to read large files (which generally doesn't apply to
us), but also being able to simply stat them (as they might be using large
inodes).
Signed-off-by: Mike Frysinger <vapier@chromium.org>
There have been several instances where response from kernel
has overrun the stack buffer from the caller. Avoid future problems
by passing a size argument.
Also drop the unused peer and group arguments to rtnl_talk.
Richard Alpe [Thu, 7 May 2015 13:07:36 +0000 (15:07 +0200)]
tipc: add new TIPC configuration tool
tipc is a user-space configuration tool for TIPC (Transparent
Inter-process Communication). It utilizes the TIPC netlink API in the
kernel to fetch data or perform actions.
The tipc tool has somewhat similar syntax to the ip tool meaning that
users of the ip tool should not feel that unfamiliar with this tool.
Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
David Ward [Mon, 18 May 2015 15:35:13 +0000 (11:35 -0400)]
tc: gred: Adopt the term VQ in the command syntax and output
In the GRED kernel source code, both of the terms "drop parameters"
(DP) and "virtual queue" (VQ) are used to refer to the same thing.
Each "DP" is better understood as a "set of drop parameters", since
it has values for limit, min, max, avpkt, etc. This terminology can
result in confusion when creating a GRED qdisc having multiple DPs.
Netlink attributes and struct members with the DP name seem to have
been left intact for compatibility, while the term VQ was otherwise
adopted in the code, which is more intuitive.
Use the VQ term in the tc command syntax and output (but maintain
compatibility with the old syntax).
Rewrite the usage text to be concise and similar to other qdiscs.
David Ward [Mon, 18 May 2015 15:35:12 +0000 (11:35 -0400)]
tc: gred: Handle unsigned values properly in option parsing/printing
DPs, def_DP, and DP are unsigned values that are sent and received
in TCA_GRED_* netlink attributes; handle them properly when they
are parsed or printed. Use MAX_DPs as the initial value for def_DP
and DP, and fix the operator used for bounds checking them.
David Ward [Mon, 18 May 2015 15:35:10 +0000 (11:35 -0400)]
tc: gred: Print usage text if no arguments appear after "gred"
This is more helpful to the user, since the command takes two forms,
and the message that would otherwise appear about missing parameters
assumes one of those forms.
WANG Cong [Tue, 5 May 2015 22:30:20 +0000 (15:30 -0700)]
tc: fill in handle before checking argc
When deleting a specific basic filter with handle,
tc command always ignores the 'handle' option, so
tcm_handle is always 0 and kernel deletes all filters
in the selected group. This is wrong, we should respect
'handle' in cmdline.
Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
If ip rule command fails talking to kernel, exit code should be 2.
The sub-command is called by cmd loop and the exit code is negative
of return value from the command callback.
Add a new option to toggle the ability of querying the RSS configuration of a specific VF.
VF RSS information like RSS hash key may be considered sensitive on some devices where
this information is shared between VF and PF and thus its querying may be prohibited by default.
This new option allows a system administrator with privileges to modify a PF state
to control if the above VF querying is allowed or not.
For example:
To enable RSS querying of VF[0] of ethX:
>> ip link set dev ethX vf 0 query_rss on
Daniel Borkmann [Tue, 28 Apr 2015 11:37:42 +0000 (13:37 +0200)]
tc: {m, f}_ebpf: add option for dumping verifier log
Currently, only on error we get a log dump, but I found it useful when
working with eBPF to have an option to also dump the log on success.
Also spotted a typo in a header comment, which is fixed here as well.
Daniel Borkmann [Mon, 20 Apr 2015 11:48:54 +0000 (13:48 +0200)]
examples: bpf: fix ld offs to have same prog loaded on ingress/egress
Fix up the eBPF example program to match our kernel fix in a166151cbe33 ("bpf:
fix bpf helpers to use skb->mac_header relative offsets"). Tested on ingress
and egress paths.
Daniel Borkmann [Thu, 16 Apr 2015 19:20:06 +0000 (21:20 +0200)]
tc: built-in eBPF exec proxy
This work follows upon commit 6256f8c9e45f ("tc, bpf: finalize eBPF
support for cls and act front-end") and takes up the idea proposed by
Hannes Frederic Sowa to spawn a shell (or any other command) that holds
generated eBPF map file descriptors.
File descriptors, based on their id, are being fetched from the same
unix domain socket as demonstrated in the bpf_agent, the shell spawned
via execvpe(2) and the map fds passed over the environment, and thus
are made available to applications in the fashion of std{in,out,err}
for read/write access, for example in case of iproute2's examples/bpf/:
# env | grep BPF
BPF_NUM_MAPS=3
BPF_MAP1=6 <- BPF_MAP_ID_QUEUE (id 1)
BPF_MAP0=5 <- BPF_MAP_ID_PROTO (id 0)
BPF_MAP2=7 <- BPF_MAP_ID_DROPS (id 2)
The advantage (as opposed to the direct/native usage) is that now the
shell is map fd owner and applications can terminate and easily reattach
to descriptors w/o any kernel changes. Moreover, multiple applications
can easily read/write eBPF maps simultaneously.
To further allow users for experimenting with that, next step is to add
a small helper that can get along with simple data types, so that also
shell scripts can make use of bpf syscall, f.e to read/write into maps.
Generally, this allows for prepopulating maps, or any runtime altering
which could influence eBPF program behaviour (f.e. different run-time
classifications, skb modifications, ...), dumping of statistics, etc.
Reference: http://thread.gmane.org/gmane.linux.network/357471/focus=357860 Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Nicolas Dichtel [Wed, 22 Apr 2015 08:27:07 +0000 (10:27 +0200)]
mroute: remove invalid check against NLM_F_MULTI
This flag is only for the netlink protocol (multi-part messages), no reason
to reject messages without it.
Note that this flag was removed by the following kernel patches (v3.14) 65886f439ab0 ipmr: fix mfc notification flags f518338b1603 ip6mr: fix mfc notification flags
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Nicolas Dichtel [Wed, 22 Apr 2015 08:27:06 +0000 (10:27 +0200)]
libnamespaces: fix warning about syscall()
The warning was:
In file included from namespace.c:14:0:
../include/namespace.h: In function ‘setns’:
../include/namespace.h:37:2: warning: implicit declaration of function ‘syscall’ [-Wimplicit-function-declaration]
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Nicolas Dichtel [Wed, 22 Apr 2015 08:27:05 +0000 (10:27 +0200)]
tc: fix compilation warning on 32bits arch
The warning was:
m_simple.c: In function ‘parse_simple’:
m_simple.c:142:4: warning: format ‘%ld’ expects argument of type ‘long int’, but argument 3 has type ‘size_t’ [-Wformat]
Useful to be able to compile with -Werror.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Nicolas Dichtel [Wed, 15 Apr 2015 12:00:53 +0000 (14:00 +0200)]
ipxfrm: wrong nl msg sent on deleteall cmd
XFRM netlink family is independent from the route netlink family. It's wrong
to call rtnl_wilddump_request(), because it will add a 'struct ifinfomsg' into
the header and the kernel will complain (at least for xfrm state):
netlink: 24 bytes leftover after parsing attributes in process `ip'.
Reported-by: Gregory Hoggarth <Gregory.Hoggarth@alliedtelesis.co.nz> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Nicolas Dichtel [Wed, 15 Apr 2015 12:23:22 +0000 (14:23 +0200)]
netns: allow to dump and monitor nsid
Two commands are added:
- ip netns list-id
- ip monitor nsid
A cache is also added to remember the association between the iproute2 netns
name (from /var/run/netns/) and the nsid.
To avoid interfering with the rth socket, a new rtnl socket (rtnsh) is used to
get nsid (we may send rtnl request during listing on rth).
Pavel Šimerda [Mon, 13 Apr 2015 14:00:58 +0000 (16:00 +0200)]
cbq: fix find syntax in example
Without modification, using the example resulted in the following error:
[root@localhost sbin]# cbq restart
find: warning: you have specified the -maxdepth option after a
non-option argument (, but options are not positional (-maxdepth affects
tests specified before it as well as those specified after it). Please
specify options before other arguments.
find: warning: you have specified the -maxdepth option after a
non-option argument (, but options are not positional (-maxdepth affects
tests specified before it as well as those specified after it). Please
specify options before other arguments.
Pavel Šimerda [Mon, 13 Apr 2015 14:00:57 +0000 (16:00 +0200)]
ip-xfrm: support 'proto any' with 'sport' and 'dport'
When creating an IPsec SA that sets 'proto any' (IPPROTO_IP) and
specifies 'sport' and 'dport' at the same time in selector, the
following error is issued:
"sport" and "dport" are invalid with proto=ip
However using IPPROTO_IP with ports is completely legal and necessary
when one wants to share the SA on both TCP and UDP. One of the
applications requiring sharing SAs is 3GPP IMS AKA authentication.
Pavel Šimerda [Mon, 13 Apr 2015 14:00:56 +0000 (16:00 +0200)]
turn Makefile more distribution friendly
Changes:
* Accept directory settings from environment.
* Remove redundant ROOTDIR variable.
* Set KERNEL_INCLUDE default to '/usr/include'.
* Use CFLAGS from environemnt.
Note: In the long term it might be better to improve the configure
script to generate those parts of the Makefile in a manner similar
to autoconf. It might be even practical to autotoolize the package.
Signed-off-by: Pavel Šimerda <psimerda@redhat.com>
Felix Fietkau [Sun, 15 Feb 2015 16:57:19 +0000 (11:57 -0500)]
tc: add support for connmark action
Add ability to add the netfilter connmark support.
Typical usage:
...lets tag outgoing icmp with mark 0x10..
iptables -tmangle -A PREROUTING -p icmp -j CONNMARK --set-mark 0x10
..add on ingress of $ETH an extractor for connmark...
tc filter add dev $ETH parent ffff: prio 4 protocol ip \
u32 match ip protocol 1 0xff \
flowid 1:1 \
action connmark continue
...if the connmark was 0x11, we police to a ridic rate of 10Kbps
tc filter add dev $ETH parent ffff: prio 5 protocol ip \
handle 0x11 fw flowid 1:1 \
action police rate 10kbit burst 10k
Other ways to use the connmark is to supply the zone, index and
branching choice. Refer to help.
Signed-off-by: Felix Fietkau <nbd@openwrt.org> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Andy Gospodarek [Fri, 10 Apr 2015 20:50:40 +0000 (16:50 -0400)]
iproute2: unify naming for entries offloaded to hardware
The kernel now has the capability to offload FDB and FIB entries to hardware.
It is important to let users know if table entries are also offloaded to
hardware. Currently offloaded FDB entries are indicated by the existence of
the flag 'external' on the entry as of the following commit:
bridge/fdb: add flag/indication for FDB entry synced from offload device
When the patch to add support for indicating that FIB entries were also
offloaded as posted to netdev by Scott Feldman it became clear that 'external'
would not be an ideal name for routes. There could definitely be confusion
about what this might mean since many routes are to external networks -- a
collision/confusion that did not happen with FDB.
Scott Feldman asked me to check with others and build concensus around a name.
After speaking with several people about this I am proposing we refer to both
FDB and FIB entries that are currently backed by hardware (based on the work
done in rocker) with the flag 'offload' appended to the end ofthe entry.
Some people liked the string 'external,' others liked 'hardware,' but the point
is to communicate that these routes are available to something that will will
offload the forwarding normally done by the kernel. Since the term 'offload'
is used so frequently it seems appropriate to use the same language in
ip/bridge output.
The term 'offload' also seems to resonate with many of the people who have
responded on Scott's original thread or to those who I reached out to directly
and did respond to my query, so it seems we have reached consensus that it
should be the term used going forward.
v2: rebased against net-next branch
Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com> CC: Jamal Hadi Salim <jhs@mojatatu.com> CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com> CC: Jiri Pirko <jiri@resnulli.us> CC: John W. Linville <linville@tuxdriver.com> CC: Roopa Prabhu <roopa@cumulusnetworks.com> CC: Scott Feldman <sfeldma@gmail.com> CC: Stephen Hemminger <stephen@networkplumber.org>
Daniel Borkmann [Wed, 1 Apr 2015 15:57:44 +0000 (17:57 +0200)]
tc, bpf: finalize eBPF support for cls and act front-end
This work finalizes both eBPF front-ends for the classifier and action
part in tc, it allows for custom ELF section selection, a simplified tc
command frontend (while keeping compat), reusing of common maps between
classifier and actions residing in the same object file, and exporting
of all map fds to an eBPF agent for handing off further control in user
space.
It also adds an extensive example of how eBPF can be used, and a minimal
self-contained example agent that dumps map data. The example is well
documented and hopefully provides a good starting point into programming
cls_bpf and act_bpf.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Alexei Starovoitov <ast@plumgrid.com> Cc: Jiri Pirko <jiri@resnulli.us> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Acked-by: Thomas Graf <tgraf@suug.ch> Acked-by: Jiri Pirko <jiri@resnulli.us> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
add a new command to configure the SPD hash table:
ip xfrm policy set [ hthresh4 LBITS RBITS ] [ hthresh6 LBITS RBITS ]
and code to display the SPD hash configuration:
ip -s -s xfrm policy count
hthresh4: defines minimum local and remote IPv4 prefix lengths of
selectors to hash a policy. If prefix lengths are greater or equal
to the thresholds, then the policy is hashed, otherwise it falls back
in the policy_inexact chained list.
hthresh6: defines minimum local and remote IPv6 prefix lengths of
selectors to hash a policy, otherwise it falls back
in the policy_inexact chained list.
Example:
% ip -s -s xfrm policy count
SPD IN 0 OUT 0 FWD 0 (Sock: IN 0 OUT 0 FWD 0)
SPD buckets: count 7 Max 1048576
SPD IPv4 thresholds: local 32 remote 32
SPD IPv6 thresholds: local 128 remote 128
% ip xfrm pol set hthresh4 24 16 hthresh6 64 56
% ip -s -s xfrm policy count
SPD IN 0 OUT 0 FWD 0 (Sock: IN 0 OUT 0 FWD 0)
SPD buckets: count 7 Max 1048576
SPD IPv4 thresholds: local 24 remote 16
SPD IPv6 thresholds: local 64 remote 56
- Pull in the uapi mpls.h
- Update rtnetlink.h to include the mpls rtnetlink notification multicast group.
- Define AF_MPLS in utils.h if it is not defined from elsewhere
as is done with AF_DECnet
The address syntax for multiple mpls labels is a complete invention.
When I looked there seemed to be no wide spread convention for talking
about an mpls label stack in text for. Sometimes people did:
"{ Label1, Label2, Label3 }", sometimes people would do:
"[ label3, label2, label1 ]", and most of the time label
stacks were not explicitly shown at all.
The syntax I wound up using, so it would not have spaces and so it
would visually distinct from other kinds of addresses is.
label1/label2/label3 Where label1 is the label at the top of the label
stack and label3 is the label at the bottom on the label stack.
When there is a single label this matches what seems to be convention
with other tools. Just print out the numeric value of the mpls label.
The netlink protocol for labels uses the on the wire format for a
label stack. The ttl and traffic class are expected to be 0. Using
the on the wire format is common and what happens with other address
types. BGP when passing label stacks also uses this technique with the
exception that the ttl byte is not included making each label in a BGP
label stack 3 bytes instead of 4.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
This attribute is like RTA_DST except it specifies the destination
address to place on a packet when it leaves the host. For ip based
protocols this is destination NAT and not a common part of forwarding.
For protocols like MPLS label swapping is something that typically
happens on every hop.
There is likely to be a RTA_NEWSRC at some point so RTA_NEWDST
is printed as "as to" and can be specified either as "as to"
or just "as"
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>