From: shemminger Date: Tue, 10 Jan 2006 18:50:18 +0000 (+0000) Subject: Add missing files. X-Git-Tag: ss-060110~1 X-Git-Url: http://git.ipfire.org/cgi-bin/gitweb.cgi?a=commitdiff_plain;h=143969f24bc40e2a28e53d1008852188b8cd767a;p=thirdparty%2Fiproute2.git Add missing files. --- diff --git a/doc/actions/actions-general b/doc/actions/actions-general new file mode 100644 index 000000000..bb2295d89 --- /dev/null +++ b/doc/actions/actions-general @@ -0,0 +1,254 @@ + +This documented is slightly dated but should give you idea of how things +work. + +What is it? +----------- + +An extension to the filtering/classification architecture of Linux Traffic +Control. +Up to 2.6.8 the only action that could be "attached" to a filter was policing. +i.e you could say something like: + +----- +tc filter add dev lo parent ffff: protocol ip prio 10 u32 match ip src \ +127.0.0.1/32 flowid 1:1 police mtu 4000 rate 1500kbit burst 90k +----- + +which implies "if a packet is seen on the ingress of the lo device with +a source IP address of 127.0.0.1/32 we give it a classification id of 1:1 and +we execute a policing action which rate limits its bandwidth utilization +to 1.5Mbps". + +The new extensions allow for more than just policing actions to be added. +They are also fully backward compatible. If you have a kernel that doesnt +understand them, then the effect is null i.e if you have a newer tc +but older kernel, the actions are not installed. Likewise if you +have a newer kernel but older tc, obviously the tc will use current +syntax which will work fine. Of course to get the required effect you need +both newer tc and kernel. If you are reading this you have the +right tc ;-> + +A side effect is that we can now get stateless firewalling to work with tc. +Essentially this is now an alternative to iptables. +I wont go into details of my dislike for iptables at times, but +scalability is one of the main issues; however, if you need stateful +classification - use netfilter (for now). + +This stuff works on both ingress and egress qdiscs. + +Features +-------- + +1) new additional syntax and actions enabled. Note old syntax is still valid. + +Essentially this is still the same syntax as tc with a new construct +"action". The syntax is of the form: +tc filter add parent 1:0 protocol ip prio 10 +flowid 1:1 action * + +You can have as many actions as you want (within sensible reasoning). + +In the past the only real action was the policer; i.e you could do something +along the lines of: +tc filter add dev lo parent ffff: protocol ip prio 10 u32 \ +match ip src 127.0.0.1/32 flowid 1:1 \ +police mtu 4000 rate 1500kbit burst 90k + +Although you can still use the same syntax, now you can say: + +tc filter add dev lo parent 1:0 protocol ip prio 10 u32 \ +match ip src 127.0.0.1/32 flowid 1:1 \ +action police mtu 4000 rate 1500kbit burst 90k + +" generic Actions" (gact) at the moment are: +{ drop, pass, reclassify, continue} +(If you have others, no listed here give me a reason and we will add them) ++drop says to drop the packet ++pass says to accept it ++reclassify requests for reclassification of the packet ++continue requests for next lookup to match + +2)In order to take advantage of some of the targets written by the +iptables people, a classifier can have a packet being massaged by an +iptable target. I have only tested with mangler targets up to now. +(infact anything that is not in the mangling table is disabled right now) + +In terms of hooks: +*ingress is mapped to pre-routing hook +*egress is mapped to post-routing hook +I dont see much value in the other hooks, if you see it and email me good +reasons, the addition is trivial. + +Example syntax for iptables targets usage becomes: +tc filter add ..... u32 action ipt -j + +example: +tc filter add dev lo parent ffff: protocol ip prio 8 u32 \ +match ip dst 127.0.0.8/32 flowid 1:12 \ +action ipt -j mark --set-mark 2 + +3) A feature i call pipe +The motivation is derived from Unix pipe mechanism but applied to packets. +Essentially take a matching packet and pass it through +action1 | action2 | action3 etc. +You could do something similar to this with the tc policer and the "continue" +operator but this rather restricts it to just the policer and requires +multiple rules (and lookups, hence quiet inefficient); + +as an example -- and please note that this is just an example _not_ The +Word Youve Been Waiting For (yes i have had problems giving examples +which ended becoming dogma in documents and people modifying them a little +to look clever); + +i selected the metering rates to be small so that i can show better how +things work. + +The script below does the following: +- an incoming packet from 10.0.0.21 is first given a firewall mark of 1. + +- It is then metered to make sure it does not exceed its allocated rate of +1Kbps. If it doesnt exceed rate, this is where we terminate action execution. + +- If it does exceed its rate, its "color" changes to a mark of 2 and it is +then passed through a second meter. + +-The second meter is shared across all flows on that device [i am suprised +that this seems to be not a well know feature of the policer; Bert was telling +me that someone was writing a qdisc just to do sharing across multiple devices; +it must be the summer heat again; weve had someone doing that every year around +summer -- the key to sharing is to use a operator "index" in your policer +rules (example "index 20"). All your rules have to use the same index to +share.] + +-If the second meter is exceeded the color of the flow changes further to 3. + +-We then pass the packet to another meter which is shared across all devices +in the system. If this meter is exceeded we drop the packet. + +Note the mark can be used further up the system to do things like policy +or more interesting things on the egress. + +------------------ cut here ------------------------------- +# +# Add an ingress qdisc on eth0 +tc qdisc add dev eth0 ingress +# +#if you see an incoming packet from 10.0.0.21 +tc filter add dev eth0 parent ffff: protocol ip prio 1 \ +u32 match ip src 10.0.0.21/32 flowid 1:15 \ +# +# first give it a mark of 1 +action ipt -j mark --set-mark 1 index 2 \ +# +# then pass it through a policer which allows 1kbps; if the flow +# doesnt exceed that rate, this is where we stop, if it exceeds we +# pipe the packet to the next action +action police rate 1kbit burst 9k pipe \ +# +# which marks the packet fwmark as 2 and pipes +action ipt -j mark --set-mark 2 \ +# +# next attempt to borrow b/width from a meter +# used across all flows incoming on eth0("index 30") +# and if that is exceeded we pipe to the next action +action police index 30 mtu 5000 rate 1kbit burst 10k pipe \ +# mark it as fwmark 3 if exceeded +action ipt -j mark --set-mark 3 \ +# and then attempt to borrow from a meter used by all devices in the +# system. Should this be exceeded, drop the packet on the floor. +action police index 20 mtu 5000 rate 1kbit burst 90k drop +--------------------------------- + +Now lets see the actions installed with +"tc filter show parent ffff: dev eth0" + +-------- output ----------- +jroot# tc filter show parent ffff: dev eth0 +filter protocol ip pref 1 u32 +filter protocol ip pref 1 u32 fh 800: ht divisor 1 +filter protocol ip pref 1 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:15 + + action order 1: tablename: mangle hook: NF_IP_PRE_ROUTING + target MARK set 0x1 index 2 + + action order 2: police 1 action pipe rate 1Kbit burst 9Kb mtu 2Kb + + action order 3: tablename: mangle hook: NF_IP_PRE_ROUTING + target MARK set 0x2 index 1 + + action order 4: police 30 action pipe rate 1Kbit burst 10Kb mtu 5000b + + action order 5: tablename: mangle hook: NF_IP_PRE_ROUTING + target MARK set 0x3 index 3 + + action order 6: police 20 action drop rate 1Kbit burst 90Kb mtu 5000b + + match 0a000015/ffffffff at 12 +------------------------------- + +Note the ordering of the actions is based on the order in which we entered +them. In the future i will add explicit priorities. + +Now lets run a ping -f from 10.0.0.21 to this host; stop the ping after +you see a few lines of dots + +---- +[root@jzny hadi]# ping -f 10.0.0.22 +PING 10.0.0.22 (10.0.0.22): 56 data bytes +.................................................................................................................................................................................................................................................................................................................................................................................................................................................... +--- 10.0.0.22 ping statistics --- +2248 packets transmitted, 1811 packets received, 19% packet loss +round-trip min/avg/max = 0.7/9.3/20.1 ms +----------------------------- + +Now lets take a look at the stats with "tc -s filter show parent ffff: dev eth0" + +-------------- +jroot# tc -s filter show parent ffff: dev eth0 +filter protocol ip pref 1 u32 +filter protocol ip pref 1 u32 fh 800: ht divisor 1 +filter protocol ip pref 1 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:1 +5 + + action order 1: tablename: mangle hook: NF_IP_PRE_ROUTING + target MARK set 0x1 index 2 + Sent 188832 bytes 2248 pkts (dropped 0, overlimits 0) + + action order 2: police 1 action pipe rate 1Kbit burst 9Kb mtu 2Kb + Sent 188832 bytes 2248 pkts (dropped 0, overlimits 2122) + + action order 3: tablename: mangle hook: NF_IP_PRE_ROUTING + target MARK set 0x2 index 1 + Sent 178248 bytes 2122 pkts (dropped 0, overlimits 0) + + action order 4: police 30 action pipe rate 1Kbit burst 10Kb mtu 5000b + Sent 178248 bytes 2122 pkts (dropped 0, overlimits 1945) + + action order 5: tablename: mangle hook: NF_IP_PRE_ROUTING + target MARK set 0x3 index 3 + Sent 163380 bytes 1945 pkts (dropped 0, overlimits 0) + + action order 6: police 20 action drop rate 1Kbit burst 90Kb mtu 5000b + Sent 163380 bytes 1945 pkts (dropped 0, overlimits 437) + + match 0a000015/ffffffff at 12 +------------------------------- + +Neat, eh? + + +Wanna write an action module? +------------------------------ +Its easy. Either look at the code or send me email. I will document at +some point; will also accept documentation. + +TODO +---- + +Lotsa goodies/features coming. Requests also being accepted. +At the moment the focus has been on getting the architecture in place. +Expect new things in the spurious time i have to work on this +(particularly around end of year when i have typically get time off +from work). + diff --git a/doc/actions/dummy-README b/doc/actions/dummy-README new file mode 100644 index 000000000..3ef9f21b1 --- /dev/null +++ b/doc/actions/dummy-README @@ -0,0 +1,155 @@ + +Advantage over current IMQ; cleaner in particular in in SMP; +with a _lot_ less code. +Old Dummy device functionality is preserved while new one only +kicks in if you use actions. + +IMQ USES +-------- +As far as i know the reasons listed below is why people use IMQ. +It would be nice to know of anything else that i missed. + +1) qdiscs/policies that are per device as opposed to system wide. +IMQ allows for sharing. + +2) Allows for queueing incoming traffic for shaping instead of +dropping. I am not aware of any study that shows policing is +worse than shaping in achieving the end goal of rate control. +I would be interested if anyone is experimenting. + +3) Very interesting use: if you are serving p2p you may wanna give +preference to your own localy originated traffic (when responses come back) +vs someone using your system to do bittorent. So QoSing based on state +comes in as the solution. What people did to achive this was stick +the IMQ somewhere prelocal hook. +I think this is a pretty neat feature to have in Linux in general. +(i.e not just for IMQ). +But i wont go back to putting netfilter hooks in the device to satisfy +this. I also dont think its worth it hacking dummy some more to be +aware of say L3 info and play ip rule tricks to achieve this. +--> Instead the plan is to have a contrack related action. This action will +selectively either query/create contrack state on incoming packets. +Packets could then be redirected to dummy based on what happens -> eg +on incoming packets; if we find they are of known state we could send to +a different queue than one which didnt have existing state. This +all however is dependent on whatever rules the admin enters. + +At the moment this function does not exist yet. I have decided instead +of sitting on the patch to release it and then if theres pressure i will +add this feature. + +What you can do with dummy currently with actions +-------------------------------------------------- + +Lets say you are policing packets from alias 192.168.200.200/32 +you dont want those to exceed 100kbps going out. + +tc filter add dev eth0 parent 1: protocol ip prio 10 u32 \ +match ip src 192.168.200.200/32 flowid 1:2 \ +action police rate 100kbit burst 90k drop + +If you run tcpdump on eth0 you will see all packets going out +with src 192.168.200.200/32 dropped or not +Extend the rule a little to see only the ones that made it out: + +tc filter add dev eth0 parent 1: protocol ip prio 10 u32 \ +match ip src 192.168.200.200/32 flowid 1:2 \ +action police rate 10kbit burst 90k drop \ +action mirred egress mirror dev dummy0 + +Now fire tcpdump on dummy0 to see only those packets .. +tcpdump -n -i dummy0 -x -e -t + +Essentially a good debugging/logging interface. + +If you replace mirror with redirect, those packets will be +blackholed and will never make it out. This redirect behavior +changes with new patch (but not the mirror). + +What you can do with the patch to provide functionality +that most people use IMQ for below: + +-------- +export TC="/sbin/tc" + +$TC qdisc add dev dummy0 root handle 1: prio +$TC qdisc add dev dummy0 parent 1:1 handle 10: sfq +$TC qdisc add dev dummy0 parent 1:2 handle 20: tbf rate 20kbit buffer 1600 limit 3000 +$TC qdisc add dev dummy0 parent 1:3 handle 30: sfq +$TC filter add dev dummy0 protocol ip pref 1 parent 1: handle 1 fw classid 1:1 +$TC filter add dev dummy0 protocol ip pref 2 parent 1: handle 2 fw classid 1:2 + +ifconfig dummy0 up + +$TC qdisc add dev eth0 ingress + +# redirect all IP packets arriving in eth0 to dummy0 +# use mark 1 --> puts them onto class 1:1 +$TC filter add dev eth0 parent ffff: protocol ip prio 10 u32 \ +match u32 0 0 flowid 1:1 \ +action ipt -j MARK --set-mark 1 \ +action mirred egress redirect dev dummy0 + +-------- + + +Run A Little test: + +from another machine ping so that you have packets going into the box: +----- +[root@jzny action-tests]# ping 10.22 +PING 10.22 (10.0.0.22): 56 data bytes +64 bytes from 10.0.0.22: icmp_seq=0 ttl=64 time=2.8 ms +64 bytes from 10.0.0.22: icmp_seq=1 ttl=64 time=0.6 ms +64 bytes from 10.0.0.22: icmp_seq=2 ttl=64 time=0.6 ms + +--- 10.22 ping statistics --- +3 packets transmitted, 3 packets received, 0% packet loss +round-trip min/avg/max = 0.6/1.3/2.8 ms +[root@jzny action-tests]# +----- +Now look at some stats: + +--- +[root@jmandrake]:~# $TC -s filter show parent ffff: dev eth0 +filter protocol ip pref 10 u32 +filter protocol ip pref 10 u32 fh 800: ht divisor 1 +filter protocol ip pref 10 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:1 + match 00000000/00000000 at 0 + action order 1: tablename: mangle hook: NF_IP_PRE_ROUTING + target MARK set 0x1 + index 1 ref 1 bind 1 installed 4195sec used 27sec + Sent 252 bytes 3 pkts (dropped 0, overlimits 0) + + action order 2: mirred (Egress Redirect to device dummy0) stolen + index 1 ref 1 bind 1 installed 165 sec used 27 sec + Sent 252 bytes 3 pkts (dropped 0, overlimits 0) + +[root@jmandrake]:~# $TC -s qdisc +qdisc sfq 30: dev dummy0 limit 128p quantum 1514b + Sent 0 bytes 0 pkts (dropped 0, overlimits 0) +qdisc tbf 20: dev dummy0 rate 20Kbit burst 1575b lat 2147.5s + Sent 210 bytes 3 pkts (dropped 0, overlimits 0) +qdisc sfq 10: dev dummy0 limit 128p quantum 1514b + Sent 294 bytes 3 pkts (dropped 0, overlimits 0) +qdisc prio 1: dev dummy0 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 + Sent 504 bytes 6 pkts (dropped 0, overlimits 0) +qdisc ingress ffff: dev eth0 ---------------- + Sent 308 bytes 5 pkts (dropped 0, overlimits 0) + +[root@jmandrake]:~# ifconfig dummy0 +dummy0 Link encap:Ethernet HWaddr 00:00:00:00:00:00 + inet6 addr: fe80::200:ff:fe00:0/64 Scope:Link + UP BROADCAST RUNNING NOARP MTU:1500 Metric:1 + RX packets:6 errors:0 dropped:3 overruns:0 frame:0 + TX packets:3 errors:0 dropped:0 overruns:0 carrier:0 + collisions:0 txqueuelen:32 + RX bytes:504 (504.0 b) TX bytes:252 (252.0 b) +----- + +Dummy continues to behave like it always did. +You send it any packet not originating from the actions it will drop them. +[In this case the three dropped packets were ipv6 ndisc]. + +cheers, +jamal diff --git a/ip/ipntable.c b/ip/ipntable.c new file mode 100644 index 000000000..5655d93ce --- /dev/null +++ b/ip/ipntable.c @@ -0,0 +1,657 @@ +/* + * Copyright (C)2006 USAGI/WIDE Project + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ +/* + * based on ipneigh.c + */ +/* + * Authors: + * Masahide NAKAMURA @USAGI + */ + +#include +#include +#include +#include +#include + +#include "utils.h" +#include "ip_common.h" + +static struct +{ + int family; + int index; +#define NONE_DEV (-1) + char name[1024]; +} filter; + +static void usage(void) __attribute__((noreturn)); + +static void usage(void) +{ + fprintf(stderr, + "Usage: ip ntable change name NAME [ dev DEV ]\n" + " [ thresh1 VAL ] [ thresh2 VAL ] [ thresh3 VAL ] [ gc_int MSEC ]\n" + " [ PARMS ]\n" + "Usage: ip ntable show [ dev DEV ] [ name NAME ]\n" + + "PARMS := [ base_reachable MSEC ] [ retrans MSEC ] [ gc_stale MSEC ]\n" + " [ delay_probe MSEC ] [ queue LEN ]\n" + " [ app_probs VAL ] [ ucast_probes VAL ] [ mcast_probes VAL ]\n" + " [ anycast_delay MSEC ] [ proxy_delay MSEC ] [ proxy_queue LEN ]\n" + " [ locktime MSEC ]\n" + ); + + exit(-1); +} + +static int ipntable_modify(int cmd, int flags, int argc, char **argv) +{ + struct { + struct nlmsghdr n; + struct ndtmsg ndtm; + char buf[1024]; + } req; + char *namep = NULL; + char *threshsp = NULL; + char *gc_intp = NULL; + char parms_buf[1024]; + struct rtattr *parms_rta = (struct rtattr *)parms_buf; + int parms_change = 0; + + memset(&req, 0, sizeof(req)); + + req.n.nlmsg_len = NLMSG_LENGTH(sizeof(struct ndtmsg)); + req.n.nlmsg_flags = NLM_F_REQUEST|flags; + req.n.nlmsg_type = cmd; + + req.ndtm.ndtm_family = preferred_family; + req.ndtm.ndtm_pad1 = 0; + req.ndtm.ndtm_pad2 = 0; + + memset(&parms_buf, 0, sizeof(parms_buf)); + + parms_rta->rta_type = NDTA_PARMS; + parms_rta->rta_len = RTA_LENGTH(0); + + while (argc > 0) { + if (strcmp(*argv, "name") == 0) { + int len; + + NEXT_ARG(); + if (namep) + duparg("NAME", *argv); + + namep = *argv; + len = strlen(namep) + 1; + addattr_l(&req.n, sizeof(req), NDTA_NAME, namep, len); + } else if (strcmp(*argv, "thresh1") == 0) { + __u32 thresh1; + + NEXT_ARG(); + threshsp = *argv; + + if (get_u32(&thresh1, *argv, 0)) + invarg("\"thresh1\" value is invalid", *argv); + + addattr32(&req.n, sizeof(req), NDTA_THRESH1, thresh1); + } else if (strcmp(*argv, "thresh2") == 0) { + __u32 thresh2; + + NEXT_ARG(); + threshsp = *argv; + + if (get_u32(&thresh2, *argv, 0)) + invarg("\"thresh2\" value is invalid", *argv); + + addattr32(&req.n, sizeof(req), NDTA_THRESH2, thresh2); + } else if (strcmp(*argv, "thresh3") == 0) { + __u32 thresh3; + + NEXT_ARG(); + threshsp = *argv; + + if (get_u32(&thresh3, *argv, 0)) + invarg("\"thresh3\" value is invalid", *argv); + + addattr32(&req.n, sizeof(req), NDTA_THRESH3, thresh3); + } else if (strcmp(*argv, "gc_int") == 0) { + __u64 gc_int; + + NEXT_ARG(); + gc_intp = *argv; + + if (get_u64(&gc_int, *argv, 0)) + invarg("\"gc_int\" value is invalid", *argv); + + addattr_l(&req.n, sizeof(req), NDTA_GC_INTERVAL, + &gc_int, sizeof(gc_int)); + } else if (strcmp(*argv, "dev") == 0) { + __u32 ifindex; + + NEXT_ARG(); + ifindex = ll_name_to_index(*argv); + if (ifindex == 0) { + fprintf(stderr, "Cannot find device \"%s\"\n", *argv); + return -1; + } + + rta_addattr32(parms_rta, sizeof(parms_buf), + NDTPA_IFINDEX, ifindex); + } else if (strcmp(*argv, "base_reachable") == 0) { + __u64 breachable; + + NEXT_ARG(); + + if (get_u64(&breachable, *argv, 0)) + invarg("\"base_reachable\" value is invalid", *argv); + + rta_addattr_l(parms_rta, sizeof(parms_buf), + NDTPA_BASE_REACHABLE_TIME, + &breachable, sizeof(breachable)); + parms_change = 1; + } else if (strcmp(*argv, "retrans") == 0) { + __u64 retrans; + + NEXT_ARG(); + + if (get_u64(&retrans, *argv, 0)) + invarg("\"retrans\" value is invalid", *argv); + + rta_addattr_l(parms_rta, sizeof(parms_buf), + NDTPA_RETRANS_TIME, + &retrans, sizeof(retrans)); + parms_change = 1; + } else if (strcmp(*argv, "gc_stale") == 0) { + __u64 gc_stale; + + NEXT_ARG(); + + if (get_u64(&gc_stale, *argv, 0)) + invarg("\"gc_stale\" value is invalid", *argv); + + rta_addattr_l(parms_rta, sizeof(parms_buf), + NDTPA_GC_STALETIME, + &gc_stale, sizeof(gc_stale)); + parms_change = 1; + } else if (strcmp(*argv, "delay_probe") == 0) { + __u64 delay_probe; + + NEXT_ARG(); + + if (get_u64(&delay_probe, *argv, 0)) + invarg("\"delay_probe\" value is invalid", *argv); + + rta_addattr_l(parms_rta, sizeof(parms_buf), + NDTPA_DELAY_PROBE_TIME, + &delay_probe, sizeof(delay_probe)); + parms_change = 1; + } else if (strcmp(*argv, "queue") == 0) { + __u32 queue; + + NEXT_ARG(); + + if (get_u32(&queue, *argv, 0)) + invarg("\"queue\" value is invalid", *argv); + + if (!parms_rta) + parms_rta = (struct rtattr *)&parms_buf; + rta_addattr32(parms_rta, sizeof(parms_buf), + NDTPA_QUEUE_LEN, queue); + parms_change = 1; + } else if (strcmp(*argv, "app_probes") == 0) { + __u32 aprobe; + + NEXT_ARG(); + + if (get_u32(&aprobe, *argv, 0)) + invarg("\"app_probes\" value is invalid", *argv); + + rta_addattr32(parms_rta, sizeof(parms_buf), + NDTPA_APP_PROBES, aprobe); + parms_change = 1; + } else if (strcmp(*argv, "ucast_probes") == 0) { + __u32 uprobe; + + NEXT_ARG(); + + if (get_u32(&uprobe, *argv, 0)) + invarg("\"ucast_probes\" value is invalid", *argv); + + rta_addattr32(parms_rta, sizeof(parms_buf), + NDTPA_UCAST_PROBES, uprobe); + parms_change = 1; + } else if (strcmp(*argv, "mcast_probes") == 0) { + __u32 mprobe; + + NEXT_ARG(); + + if (get_u32(&mprobe, *argv, 0)) + invarg("\"mcast_probes\" value is invalid", *argv); + + rta_addattr32(parms_rta, sizeof(parms_buf), + NDTPA_MCAST_PROBES, mprobe); + parms_change = 1; + } else if (strcmp(*argv, "anycast_delay") == 0) { + __u64 anycast_delay; + + NEXT_ARG(); + + if (get_u64(&anycast_delay, *argv, 0)) + invarg("\"anycast_delay\" value is invalid", *argv); + + rta_addattr_l(parms_rta, sizeof(parms_buf), + NDTPA_ANYCAST_DELAY, + &anycast_delay, sizeof(anycast_delay)); + parms_change = 1; + } else if (strcmp(*argv, "proxy_delay") == 0) { + __u64 proxy_delay; + + NEXT_ARG(); + + if (get_u64(&proxy_delay, *argv, 0)) + invarg("\"proxy_delay\" value is invalid", *argv); + + rta_addattr_l(parms_rta, sizeof(parms_buf), + NDTPA_PROXY_DELAY, + &proxy_delay, sizeof(proxy_delay)); + parms_change = 1; + } else if (strcmp(*argv, "proxy_queue") == 0) { + __u32 pqueue; + + NEXT_ARG(); + + if (get_u32(&pqueue, *argv, 0)) + invarg("\"proxy_queue\" value is invalid", *argv); + + rta_addattr32(parms_rta, sizeof(parms_buf), + NDTPA_PROXY_QLEN, pqueue); + parms_change = 1; + } else if (strcmp(*argv, "locktime") == 0) { + __u64 locktime; + + NEXT_ARG(); + + if (get_u64(&locktime, *argv, 0)) + invarg("\"locktime\" value is invalid", *argv); + + rta_addattr_l(parms_rta, sizeof(parms_buf), + NDTPA_LOCKTIME, + &locktime, sizeof(locktime)); + parms_change = 1; + } else { + invarg("unknown", *argv); + } + + argc--; argv++; + } + + if (!namep) + missarg("NAME"); + if (!threshsp && !gc_intp && !parms_change) { + fprintf(stderr, "Not enough information: changable attributes required.\n"); + exit(-1); + } + + if (parms_rta->rta_len > RTA_LENGTH(0)) { + addattr_l(&req.n, sizeof(req), NDTA_PARMS, RTA_DATA(parms_rta), + RTA_PAYLOAD(parms_rta)); + } + + if (rtnl_talk(&rth, &req.n, 0, 0, NULL, NULL, NULL) < 0) + exit(2); + + return 0; +} + +static const char *ntable_strtime_delta(__u32 msec) +{ + static char str[32]; + struct timeval now; + time_t t; + struct tm *tp; + + if (msec == 0) + goto error; + + memset(&now, 0, sizeof(now)); + + if (gettimeofday(&now, NULL) < 0) { + perror("gettimeofday"); + goto error; + } + + t = now.tv_sec - (msec / 1000); + tp = localtime(&t); + if (!tp) + goto error; + + strftime(str, sizeof(str), "%Y-%m-%d %T", tp); + + return str; + error: + strcpy(str, "(error)"); + return str; +} + +int print_ntable(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg) +{ + FILE *fp = (FILE*)arg; + struct ndtmsg *ndtm = NLMSG_DATA(n); + int len = n->nlmsg_len; + struct rtattr *tb[NDTA_MAX+1]; + struct rtattr *tpb[NDTPA_MAX+1]; + int ret; + + if (n->nlmsg_type != RTM_NEWNEIGHTBL) { + fprintf(stderr, "Not NEIGHTBL: %08x %08x %08x\n", + n->nlmsg_len, n->nlmsg_type, n->nlmsg_flags); + return 0; + } + len -= NLMSG_LENGTH(sizeof(*ndtm)); + if (len < 0) { + fprintf(stderr, "BUG: wrong nlmsg len %d\n", len); + return -1; + } + + if (preferred_family && preferred_family != ndtm->ndtm_family) + return 0; + + parse_rtattr(tb, NDTA_MAX, NDTA_RTA(ndtm), + n->nlmsg_len - NLMSG_LENGTH(sizeof(*ndtm))); + + if (tb[NDTA_NAME]) { + char *name = RTA_DATA(tb[NDTA_NAME]); + + if (strlen(filter.name) > 0 && strcmp(filter.name, name)) + return 0; + } + if (tb[NDTA_PARMS]) { + parse_rtattr(tpb, NDTPA_MAX, RTA_DATA(tb[NDTA_PARMS]), + RTA_PAYLOAD(tb[NDTA_PARMS])); + + if (tpb[NDTPA_IFINDEX]) { + __u32 ifindex = *(__u32 *)RTA_DATA(tpb[NDTPA_IFINDEX]); + + if (filter.index && filter.index != ifindex) + return 0; + } else { + if (filter.index && filter.index != NONE_DEV) + return 0; + } + } + + if (ndtm->ndtm_family == AF_INET) + fprintf(fp, "inet "); + else if (ndtm->ndtm_family == AF_INET6) + fprintf(fp, "inet6 "); + else if (ndtm->ndtm_family == AF_DECnet) + fprintf(fp, "dnet "); + else + fprintf(fp, "(%d) ", ndtm->ndtm_family); + + if (tb[NDTA_NAME]) { + char *name = RTA_DATA(tb[NDTA_NAME]); + fprintf(fp, "%s ", name); + } + + fprintf(fp, "%s", _SL_); + + ret = (tb[NDTA_THRESH1] || tb[NDTA_THRESH2] || tb[NDTA_THRESH3] || + tb[NDTA_GC_INTERVAL]); + if (ret) + fprintf(fp, " "); + + if (tb[NDTA_THRESH1]) { + __u32 thresh1 = *(__u32 *)RTA_DATA(tb[NDTA_THRESH1]); + fprintf(fp, "thresh1 %u ", thresh1); + } + if (tb[NDTA_THRESH2]) { + __u32 thresh2 = *(__u32 *)RTA_DATA(tb[NDTA_THRESH2]); + fprintf(fp, "thresh2 %u ", thresh2); + } + if (tb[NDTA_THRESH3]) { + __u32 thresh3 = *(__u32 *)RTA_DATA(tb[NDTA_THRESH3]); + fprintf(fp, "thresh3 %u ", thresh3); + } + if (tb[NDTA_GC_INTERVAL]) { + __u64 gc_int = *(__u64 *)RTA_DATA(tb[NDTA_GC_INTERVAL]); + fprintf(fp, "gc_int %llu ", gc_int); + } + + if (ret) + fprintf(fp, "%s", _SL_); + + if (tb[NDTA_CONFIG] && show_stats) { + struct ndt_config *ndtc = RTA_DATA(tb[NDTA_CONFIG]); + + fprintf(fp, " "); + fprintf(fp, "config "); + + fprintf(fp, "key_len %u ", ndtc->ndtc_key_len); + fprintf(fp, "entry_size %u ", ndtc->ndtc_entry_size); + fprintf(fp, "entries %u ", ndtc->ndtc_entries); + + fprintf(fp, "%s", _SL_); + fprintf(fp, " "); + + fprintf(fp, "last_flush %s ", + ntable_strtime_delta(ndtc->ndtc_last_flush)); + fprintf(fp, "last_rand %s ", + ntable_strtime_delta(ndtc->ndtc_last_rand)); + + fprintf(fp, "%s", _SL_); + fprintf(fp, " "); + + fprintf(fp, "hash_rnd %u ", ndtc->ndtc_hash_rnd); + fprintf(fp, "hash_mask %08x ", ndtc->ndtc_hash_mask); + + fprintf(fp, "hash_chain_gc %u ", ndtc->ndtc_hash_chain_gc); + fprintf(fp, "proxy_qlen %u ", ndtc->ndtc_proxy_qlen); + + fprintf(fp, "%s", _SL_); + } + + if (tb[NDTA_PARMS]) { + if (tpb[NDTPA_IFINDEX]) { + __u32 ifindex = *(__u32 *)RTA_DATA(tpb[NDTPA_IFINDEX]); + + fprintf(fp, " "); + fprintf(fp, "dev %s ", ll_index_to_name(ifindex)); + fprintf(fp, "%s", _SL_); + } + + fprintf(fp, " "); + + if (tpb[NDTPA_REFCNT]) { + __u32 refcnt = *(__u32 *)RTA_DATA(tpb[NDTPA_REFCNT]); + fprintf(fp, "refcnt %u ", refcnt); + } + if (tpb[NDTPA_REACHABLE_TIME]) { + __u64 reachable = *(__u64 *)RTA_DATA(tpb[NDTPA_REACHABLE_TIME]); + fprintf(fp, "reachable %llu ", reachable); + } + if (tpb[NDTPA_BASE_REACHABLE_TIME]) { + __u64 breachable = *(__u64 *)RTA_DATA(tpb[NDTPA_BASE_REACHABLE_TIME]); + fprintf(fp, "base_reachable %llu ", breachable); + } + if (tpb[NDTPA_RETRANS_TIME]) { + __u64 retrans = *(__u64 *)RTA_DATA(tpb[NDTPA_RETRANS_TIME]); + fprintf(fp, "retrans %llu ", retrans); + } + + fprintf(fp, "%s", _SL_); + + fprintf(fp, " "); + + if (tpb[NDTPA_GC_STALETIME]) { + __u64 gc_stale = *(__u64 *)RTA_DATA(tpb[NDTPA_GC_STALETIME]); + fprintf(fp, "gc_stale %llu ", gc_stale); + } + if (tpb[NDTPA_DELAY_PROBE_TIME]) { + __u64 delay_probe = *(__u64 *)RTA_DATA(tpb[NDTPA_DELAY_PROBE_TIME]); + fprintf(fp, "delay_probe %llu ", delay_probe); + } + if (tpb[NDTPA_QUEUE_LEN]) { + __u32 queue = *(__u32 *)RTA_DATA(tpb[NDTPA_QUEUE_LEN]); + fprintf(fp, "queue %u ", queue); + } + + fprintf(fp, "%s", _SL_); + + fprintf(fp, " "); + + if (tpb[NDTPA_APP_PROBES]) { + __u32 aprobe = *(__u32 *)RTA_DATA(tpb[NDTPA_APP_PROBES]); + fprintf(fp, "app_probes %u ", aprobe); + } + if (tpb[NDTPA_UCAST_PROBES]) { + __u32 uprobe = *(__u32 *)RTA_DATA(tpb[NDTPA_UCAST_PROBES]); + fprintf(fp, "ucast_probes %u ", uprobe); + } + if (tpb[NDTPA_MCAST_PROBES]) { + __u32 mprobe = *(__u32 *)RTA_DATA(tpb[NDTPA_MCAST_PROBES]); + fprintf(fp, "mcast_probes %u ", mprobe); + } + + fprintf(fp, "%s", _SL_); + + fprintf(fp, " "); + + if (tpb[NDTPA_ANYCAST_DELAY]) { + __u64 anycast_delay = *(__u64 *)RTA_DATA(tpb[NDTPA_ANYCAST_DELAY]); + fprintf(fp, "anycast_delay %llu ", anycast_delay); + } + if (tpb[NDTPA_PROXY_DELAY]) { + __u64 proxy_delay = *(__u64 *)RTA_DATA(tpb[NDTPA_PROXY_DELAY]); + fprintf(fp, "proxy_delay %llu ", proxy_delay); + } + if (tpb[NDTPA_PROXY_QLEN]) { + __u32 pqueue = *(__u32 *)RTA_DATA(tpb[NDTPA_PROXY_QLEN]); + fprintf(fp, "proxy_queue %u ", pqueue); + } + if (tpb[NDTPA_LOCKTIME]) { + __u64 locktime = *(__u64 *)RTA_DATA(tpb[NDTPA_LOCKTIME]); + fprintf(fp, "locktime %llu ", locktime); + } + + fprintf(fp, "%s", _SL_); + } + + if (tb[NDTA_STATS] && show_stats) { + struct ndt_stats *ndts = RTA_DATA(tb[NDTA_STATS]); + + fprintf(fp, " "); + fprintf(fp, "stats "); + + fprintf(fp, "allocs %llu ", ndts->ndts_allocs); + fprintf(fp, "destroys %llu ", ndts->ndts_destroys); + fprintf(fp, "hash_grows %llu ", ndts->ndts_hash_grows); + + fprintf(fp, "%s", _SL_); + fprintf(fp, " "); + + fprintf(fp, "res_failed %llu ", ndts->ndts_res_failed); + fprintf(fp, "lookups %llu ", ndts->ndts_lookups); + fprintf(fp, "hits %llu ", ndts->ndts_hits); + + fprintf(fp, "%s", _SL_); + fprintf(fp, " "); + + fprintf(fp, "rcv_probes_mcast %llu ", ndts->ndts_rcv_probes_mcast); + fprintf(fp, "rcv_probes_ucast %llu ", ndts->ndts_rcv_probes_ucast); + + fprintf(fp, "%s", _SL_); + fprintf(fp, " "); + + fprintf(fp, "periodic_gc_runs %llu ", ndts->ndts_periodic_gc_runs); + fprintf(fp, "forced_gc_runs %llu ", ndts->ndts_forced_gc_runs); + + fprintf(fp, "%s", _SL_); + } + + fprintf(fp, "\n"); + + fflush(fp); + return 0; +} + +void ipntable_reset_filter(void) +{ + memset(&filter, 0, sizeof(filter)); +} + +static int ipntable_show(int argc, char **argv) +{ + ipntable_reset_filter(); + + filter.family = preferred_family; + + while (argc > 0) { + if (strcmp(*argv, "dev") == 0) { + NEXT_ARG(); + + if (strcmp("none", *argv) == 0) + filter.index = NONE_DEV; + else if ((filter.index = ll_name_to_index(*argv)) == 0) + invarg("\"DEV\" is invalid", *argv); + } else if (strcmp(*argv, "name") == 0) { + NEXT_ARG(); + + strncpy(filter.name, *argv, sizeof(filter.name)); + } else + invarg("unknown", *argv); + + argc--; argv++; + } + + if (rtnl_wilddump_request(&rth, preferred_family, RTM_GETNEIGHTBL) < 0) { + perror("Cannot send dump request"); + exit(1); + } + + if (rtnl_dump_filter(&rth, print_ntable, stdout, NULL, NULL) < 0) { + fprintf(stderr, "Dump terminated\n"); + exit(1); + } + + return 0; +} + +int do_ipntable(int argc, char **argv) +{ + ll_init_map(&rth); + + if (argc > 0) { + if (matches(*argv, "change") == 0 || + matches(*argv, "chg") == 0) + return ipntable_modify(RTM_SETNEIGHTBL, + NLM_F_REPLACE, + argc-1, argv+1); + if (matches(*argv, "show") == 0 || + matches(*argv, "lst") == 0 || + matches(*argv, "list") == 0) + return ipntable_show(argc-1, argv+1); + if (matches(*argv, "help") == 0) + usage(); + } else + return ipntable_show(0, NULL); + + fprintf(stderr, "Command \"%s\" is unknown, try \"ip ntable help\".\n", *argv); + exit(-1); +}