Kees Cook [Mon, 30 Oct 2017 21:05:41 +0000 (14:05 -0700)]
drivers/net: tundra: Convert timers to use timer_setup()
In preparation for unconditionally passing the struct timer_list pointer to
all timer callbacks, switch to using the new timer_setup() and from_timer()
to pass the timer pointer explicitly.
Cc: "David S. Miller" <davem@davemloft.net> Cc: Philippe Reynes <tremyfr@gmail.com> Cc: "yuval.shaia@oracle.com" <yuval.shaia@oracle.com> Cc: Eric Dumazet <edumazet@google.com> Cc: netdev@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Kees Cook [Mon, 30 Oct 2017 21:05:12 +0000 (14:05 -0700)]
drivers/net: ntb_netdev: Convert timers to use timer_setup()
In preparation for unconditionally passing the struct timer_list pointer to
all timer callbacks, switch to using the new timer_setup() and from_timer()
to pass the timer pointer explicitly.
Cc: Jon Mason <jdmason@kudzu.us> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Allen Hubbe <Allen.Hubbe@emc.com> Cc: linux-ntb@googlegroups.com Cc: netdev@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Yonghong Song [Mon, 30 Oct 2017 20:50:22 +0000 (13:50 -0700)]
bpf: avoid rcu_dereference inside bpf_event_mutex lock region
During perf event attaching/detaching bpf programs,
the tp_event->prog_array change is protected by the
bpf_event_mutex lock in both attaching and deteching
functions. Although tp_event->prog_array is a rcu
pointer, rcu_derefrence is not needed to access it
since mutex lock will guarantee ordering.
Verified through "make C=2" that sparse
locking check still happy with the new change.
Also change the label name in perf_event_{attach,detach}_bpf_prog
from "out" to "unlock" to reflect the code action after the label.
Signed-off-by: Yonghong Song <yhs@fb.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: bridge: add neigh_suppress to bridge port policies
Add an entry for IFLA_BRPORT_NEIGH_SUPPRESS to bridge port policies.
Fixes: 821f1b21cabb ("bridge: add new BR_NEIGH_SUPPRESS port flag to suppress arp and nd flood") Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 1 Nov 2017 03:28:33 +0000 (12:28 +0900)]
Merge branch 'mvpp2-various-improvements'
Antoine Tenart says:
====================
net: mvpp2: various improvements
This series includes various patches improving the Marvell PPv2 driver.
I send them as a series to avoid any possible merge conflict.
- Patches 1 and 2 improve the initializing of the Tx and Rx FIFO.
- Patch 3 initialize the RSS table to evenly distribute the ingress
packets across multiple Rx queues based on their hashes.
- Patch 4 limits the number of TSO segments sent to the driver, to avoid
having more segments to handle than the corresponding number of
available descriptors.
- Patch 5 and 6 are cosmetic improvements.
This applies on today's net-next branch, The patches were tested
extensively (I ran iperf and http downloads in parallel, transferring
TBs of data).
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Antoine Tenart [Mon, 30 Oct 2017 10:23:33 +0000 (11:23 +0100)]
net: mvpp2: simplify the Tx desc set DMA logic
Two functions were always used to set the DMA addresses in Tx
descriptors, because this address is split into a base+offset in the
descriptors. A mask was used to come up with the base and offset
addresses and two functions were called, mvpp2_txdesc_dma_addr_set() and
mvpp2_txdesc_offset_set().
This patch moves the base+offset calculation logic to
mvpp2_txdesc_dma_addr_set(), and removes mvpp2_txdesc_offset_set() to
simplify things.
Signed-off-by: Antoine Tenart <antoine.tenart@free-electrons.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Antoine Tenart [Mon, 30 Oct 2017 10:23:32 +0000 (11:23 +0100)]
net: mvpp2: use the aggr txq size define everywhere
Cosmetic patch using the MVPP2_AGGR_TXQ_SIZE everywhere instead of the
size field of aggr_txq, as the size never change and is always equal to
the MVPP2_AGGR_TXQ_SIZE define.
Signed-off-by: Antoine Tenart <antoine.tenart@free-electrons.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Antoine Tenart [Mon, 30 Oct 2017 10:23:31 +0000 (11:23 +0100)]
net: mvpp2: limit TSO segments and use stop/wake thresholds
Too many TSO descriptors can be required for the default queue size,
when using small MSS values for example. Prevent this by adding a
maximum number of allowed TSO segments (300). In addition set a stop and
a wake thresholds to stop the queue when there's no room for a 1 "worst
case scenario skb". Wake up the queue when the number of descriptors is
low enough.
Signed-off-by: Antoine Tenart <antoine.tenart@free-electrons.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Antoine Tenart [Mon, 30 Oct 2017 10:23:30 +0000 (11:23 +0100)]
net: mvpp2: initialize the RSS tables
This patch initialize the RSS tables to evenly (depending on the packets
RSS hashes) distribute the packets across port Rx queues. This helps to
handle packets on different CPUs to improve performances, as more queues
will be used in parallel.
Signed-off-by: Antoine Tenart <antoine.tenart@free-electrons.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Antoine Tenart [Mon, 30 Oct 2017 10:23:29 +0000 (11:23 +0100)]
net: mvpp2: initialize the Tx FIFO size
So far only the Rx FIFO size was initialized. For PPv2.2 the Tx FIFO
size can be set as well. This patch initializes the Tx FIFO size for
PPv2.2 controllers to 3K.
Signed-off-by: Antoine Tenart <antoine.tenart@free-electrons.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Antoine Tenart [Mon, 30 Oct 2017 10:23:28 +0000 (11:23 +0100)]
net: mvpp2: set the Rx FIFO size depending on the port speeds for PPv2.2
The Rx FIFO size was set to the same value for all ports. This patch
sets it depending on the maximum speed a given port can handle. This is
only working for PPv2.2.
Signed-off-by: Antoine Tenart <antoine.tenart@free-electrons.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Wei Yongjun [Sat, 28 Oct 2017 05:05:46 +0000 (05:05 +0000)]
net: bcmgenet: Avoid calling platform_device_put() twice in bcmgenet_mii_exit()
Remove platform_device_put() call after platform_device_unregister()
from function bcmgenet_mii_exit(), otherwise, we will call
platform_device_put() twice.
Fixes: 9a4e79697009 ("net: bcmgenet: utilize generic Broadcom UniMAC MDIO controller driver") Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Acked-by: Doug Berger <opendmb@gmail.com> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 1 Nov 2017 02:50:43 +0000 (11:50 +0900)]
Merge branch 'extack-nonfatal'
David Ahern says:
====================
net: Allow non-fatal messages to be passed in extack
There are many cases where networking subsystems throw non-fatal warning
messages that end up in dmesg / kernel log to which a user making the
change is completely oblivious. This set makes the extack facility
usable for returning such messages.
The case in point here is spectrum and adding FIB rules which causes an
offload abort. Make the use case more user friendly by letting the user
know that offload is no longer happening because of the rule change.
v2
- kept the offload abort in a work queue entry per Ido's comment
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David Ahern [Sat, 28 Oct 2017 00:37:14 +0000 (17:37 -0700)]
mlxsw: spectrum_router: Return extack message on abort due to fib rules
Adding a FIB rule on a spectrum platform silently aborts FIB offload:
$ ip ru add pref 99 from all to 192.168.1.1 table 10
$ dmesg -c
[ 623.144736] mlxsw_spectrum 0000:03:00.0: FIB abort triggered. Note that FIB entries are no longer being offloaded to this device.
This patch reworks FIB rule handling to return a message to the user:
$ ip ru add pref 99 from all to 8.8.8.8 table 11
Error: spectrum: FIB rules not supported. Aborting offload.
spectrum currently only checks whether the fib rule is a default rule or
an l3mdev rule, both of which it knows how to handle. Any other it aborts
FIB offload. Move the processing to check the rule type inline with the
user request. If the rule is an unsupported one, then a work queue entry
is used to abort the offload. Change the rule delete handling to just
return since it does nothing at the moment.
Signed-off-by: David Ahern <dsahern@gmail.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David Ahern [Sat, 28 Oct 2017 00:37:13 +0000 (17:37 -0700)]
net: Add extack to fib_notifier_info
Add extack to fib_notifier_info and plumb through stack to
call_fib_rule_notifiers, call_fib_entry_notifiers and
call_fib6_entry_notifiers. This allows notifer handlers to
return messages to user.
Signed-off-by: David Ahern <dsahern@gmail.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 1 Nov 2017 02:47:45 +0000 (11:47 +0900)]
Merge branch 'dsa-port-parsing'
Vivien Didelot says:
====================
net: dsa: add port parsing functions
This patchset adds port parsing functions called early in the new
bindings parsing stage, which regroup all the fetching of static data
available at the port level, including the port's type, name and CPU
master interface.
This simplifies the rest of the code which does not need to dig into
device tree or platform data again in order to check a port's type or
name.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Vivien Didelot [Fri, 27 Oct 2017 19:55:19 +0000 (15:55 -0400)]
net: dsa: remove name arg from slave create
Now that slave dsa_port always have their name set, there is no need to
pass it to dsa_slave_create() anymore. Remove this argument.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Vivien Didelot [Fri, 27 Oct 2017 19:55:18 +0000 (15:55 -0400)]
net: dsa: get port name at parse time
Get the optional "label" property and assign a default one directly at
parse time instead of doing it when creating the slave.
For legacy, simply assign the port name stored in cd->port_names.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Vivien Didelot [Fri, 27 Oct 2017 19:55:17 +0000 (15:55 -0400)]
net: dsa: get master device at port parsing time
Fetching the master device can be done directly when a port is parsed
from device tree or pdata, instead of waiting until dsa_dst_parse.
Now that -EPROBE_DEFER is returned before we add the switch to the tree,
there is no need to check for this error after dsa_dst_parse.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Vivien Didelot [Fri, 27 Oct 2017 19:55:15 +0000 (15:55 -0400)]
net: dsa: get port type at parse time
Assign a port's type at parsed time instead of waiting for the tree to
be completed.
Because this is now done earlier, we can use the port's type in
dsa_port_is_* helpers instead of digging again in topology description.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Vivien Didelot [Fri, 27 Oct 2017 19:55:14 +0000 (15:55 -0400)]
net: dsa: add port parse functions
Add symmetrical DSA port parsing functions for pdata and device tree,
used to parse and validate a given port node or platform data.
They don't do much for the moment but will be extended later on to
assign a port type and get device references.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Vivien Didelot [Fri, 27 Oct 2017 19:55:13 +0000 (15:55 -0400)]
net: dsa: get ports within parsing code
There is no point into hiding the -EINVAL error code in ERR_PTR from a
dsa_get_ports function, simply get the "ports" node directly from within
the dsa_parse_ports_dn function.
This also has the effect to make the pdata and device tree handling code
symmetrical inside _dsa_register_switch.
At the same time, rename dsa_parse_ports_dn to dsa_parse_ports_of
because _of is a more common suffix for device tree parsing functions.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
It's done by observing that majority of BPF programs use little to
no stack whereas verifier kept all of 512 stack slots ready always.
Instead dynamically reallocate struct verifier state when stack
access is detected.
Runtime difference before vs after is within a noise.
The number of processed instructions stays the same.
Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 1 Nov 2017 02:39:52 +0000 (11:39 +0900)]
Merge branch 'liquidio-switchdev-support'
Vijaya Mohan Guvva says:
====================
liquidio: switchdev support for LiquidIO NIC
patch1 of this patch set adds switchdev support for SRIOV capable
LiquidIO NIC, so that for every SRIOV VF on LiquidIO, a representor
netdev is created on hypervisor. It also has changes to send representor
interface configurations like admin state and MTU to LiquidIO firmware and
to retrieve HW counted VF stats for VF representor.
patch2 adds support for switchdev enable/disable from devlink
Patchset Change Log:
V2 -> V3:
* Use mac address as the physical switchID.
* Check for eswitch_mode before returning switchID
V1 -> V2:
* Name the representors "pfXvfY".
* Drop patch3 (ethtool support for switchdev ports) that was in V1
because it's not necessary.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Enable and disable switchdev on SRIOV capable LiquidIO NIC with devlink.
Create representor netdev for each SRIOV VF function on SRIOV enable and
and do the cleanup on SRIOV disable.
Signed-off-by: Vijaya Mohan Guvva <vijaya.guvva@cavium.com> Signed-off-by: Satanand Burla <satananda.burla@cavium.com> Signed-off-by: Raghu Vatsavayi <raghu.vatsavayi@cavium.com> Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Enable switchdev for SRIOV capable LiquidIO NIC. It registers
a representor netdev (with switchdev_ops) for each SRIOV VF created.
It also has changes to send representor interface configurations like
admin state and MTU to LiquidIO firmware and to retrieve HW counted
VF stats for VF representor.
Signed-off-by: Vijaya Mohan Guvva <vijaya.guvva@cavium.com> Signed-off-by: Satanand Burla <satananda.burla@cavium.com> Signed-off-by: Raghu Vatsavayi <raghu.vatsavayi@cavium.com> Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 1 Nov 2017 02:37:17 +0000 (11:37 +0900)]
Merge tag 'mlx5-updates-2017-10-31' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5-updates-2017-10-31 mlx5e stats groups
This series from Kamal introduces an important refactoring for mlx5e stats
handling, which groups the stats into generic groups structure which allows
to control the behavior and stats reporting per group in a modular way.
In the first patch Kamal introduces a new data type "mlx5e_stats_grp" This change
defines a new API to create a group of stats and simplifies the way of handling them.
This struct will define the following behavior per group:
- get_num_stats() - return the number of counters in the group.
- fill_strings() - fill counters strings within the group.
- fill_stats() - fill counters values within the group.
All other patches will be straight forward refactoring per stats group,
where Kamal will move each mlx5e stats group to use the new API.
The idea is to have better flexibility and modularity to add new counters,
all ethtool logic was rendered generic and loops through the generic stats groups and
calls the groups callbacks to figure out how and what to report back to user space.
Introducing new file en_stats.c to hold all the new stat groups logic and implementation.
Static structures (counters descriptors) moved from en_stats.h to en_stats.c which reduces
the mlx5_core binary footprint, originally reported and addressed by Stephen Hemminger:
("mlx5: fix space waste from ethtool descriptions") which was waived due to this re-design.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 1 Nov 2017 02:08:18 +0000 (11:08 +0900)]
Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue
Jeff Kirsher says:
====================
40GbE Intel Wired LAN Driver Updates 2017-10-31
This series contains updates to i40e, i40evf and net/sched.
Arnd Bergmann cleans up the power management code to resolve a build
warning.
Shannon Nelson fixes i40e to only redistribute our vectors when we did
not get the full count that we requested.
Alex reverts a previous commit because it potentially causes a memory leak
when combined with the current page recycling scheme.
Amritha enables configuring cloud filters in i40e using the tc-flower
classifier. The classification function of the filter is to match a
packet to a traffic class. cls_flower is extended to offload classid to
hardware. Hardware traffic classes are identified using classid values
reserved in the range :ffe0 - :ffef.
The cloud filters are added for a VSI and are cleaned up when the VSI is
deleted. The filters that match on L4 ports needs enhanced admin queue
functions with big buffer support for extended fields in cloud filter
commands.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Brenda J. Butler [Tue, 31 Oct 2017 18:29:03 +0000 (14:29 -0400)]
tc-testing: better test case file error reporting
tdc.py reads a bunch of test cases in json files. When a json file
cannot be parsed, tdc just exits and does not run any tests.
This patch will cause tdc to print a message with the file name and
line number, then that file will be ignored and the rest of the tests
will be processed.
Signed-off-by: Brenda J. Butler <bjb@mojatatu.com> Acked-by: Lucas Bates <lucasb@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Brenda J. Butler [Tue, 31 Oct 2017 18:28:35 +0000 (14:28 -0400)]
tc-testing: better check if thing is list
Check if tcase[k] is an instance of a list (is or is derived from list)
instead of checking if it is a list.
This will be useful if the data structures change to be something
that implements list, instead of being an actual list. In that
case, this code will not have to change.
Signed-off-by: Brenda J. Butler <bjb@mojatatu.com> Acked-by: Lucas Bates <lucasb@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Brenda J. Butler [Tue, 31 Oct 2017 18:27:28 +0000 (14:27 -0400)]
tc-testing: split config file
Move the config customization into a site-local file
tdc_config_local.py, so that updates of the tdc test
software does not require hand-editing of the config.
This patch includes a template for the site-local
customization file.
In addition, this makes it easy to revert to a stock
tdc environment for testing the test framework and/or
the core tests.
Also it makes it harder for any custom config to be
submitted back to the kernel tdc.
Signed-off-by: Brenda J. Butler <bjb@mojatatu.com> Acked-by: Lucas Bates <lucasb@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Brenda J. Butler [Tue, 31 Oct 2017 18:25:46 +0000 (14:25 -0400)]
tc-testing: very simple example test cases
As part of documentation, supply some very simple test cases
to illustrate how test cases work. One test case shows
commands in the setup, command, verify and teardown stages.
Other test cases show how to have a working test case that
does not have commands in the setup, verify and/or teardown
stages.
Specifically, the command lists for setup and teardown can
be empty. And the verify command must have a command, but
it can be /bin/true. The regex must have a string, we
recommend a single space, and the count of matches must be
zero if you do not want to use the match feature of verify.
Verify will always look for a return code of success (0)
so we give /bin/true when we do not want to make a check
there.
Also, update the documentation for testcases to be more
specific in the cases of:
- accepting non-success return codes in setup and
teardown stages
- how to write the test when no setup, teardown
and/or verify are desired.
To run the example test cases:
$ sudo -E ./tdc.py -f creating-testcases/example.json -l
1f: (example) simple test to test framework
2f: (example) simple test, no need for verify
3f: (example) simple test, no need for setup or teardown (or verify)
$ sudo -E ./tdc.py -f creating-testcases/example.json
Test 1f: simple test to test framework
Test 2f: simple test, no need for verify
Test 3f: simple test, no need for setup or teardown (or verify)
All test results:
1..3
ok 1 1f simple test to test framework
ok 2 2f simple test, no need for verify
ok 3 3f simple test, no need for setup or teardown (or verify)
$
Signed-off-by: Brenda J. Butler <bjb@mojatatu.com> Acked-by: Lucas Bates <lucasb@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 1 Nov 2017 01:57:24 +0000 (10:57 +0900)]
Merge branch 'l2tp-remove-unused-code'
Guillaume Nault says:
====================
l2tp: remove unused code
Patch #1 removes the ref/deref mechanism that was originally used to
prevent ppp pseudowires from dropping their sockets. This mechanism
was error prone and isn't used anymore.
Patch #2 removes some module specific refcount debugging.
Patches #3 and #4 take care of some dead code.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Guillaume Nault [Tue, 31 Oct 2017 16:36:44 +0000 (17:36 +0100)]
l2tp: remove l2tp specific refcount debugging
With conversion to refcount_t, such manual debugging code doesn't make
sense anymore.
The tunnel part was already dropped by 54652eb12c1b ("l2tp: hold tunnel while looking up sessions in l2tp_netlink").
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
Guillaume Nault [Tue, 31 Oct 2017 16:36:42 +0000 (17:36 +0100)]
l2tp: remove ->ref() and ->deref()
The ->ref() and ->deref() callbacks are unused since PPP stopped using
them in ee40fb2e1eb5 ("l2tp: protect sock pointer of struct pppol2tp_session with RCU").
We can thus remove them from struct l2tp_session and drop the do_ref
parameter of l2tp_session_get*().
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
Kamal Heib [Wed, 23 Aug 2017 11:42:03 +0000 (14:42 +0300)]
net/mlx5e: Introduce stats group API
Currently the mlx5e driver has multiple groups of stats, each group is
used for different purposes and it may depend on hardware capabilities
or not. The problem with the current implementation is that there is no
clear API to create a new group of stats.
This change define a new API to create a group of stats and simplifies
the way of handling them by defining a new struct "mlx5e_stats_grp" which
have the following three function pointers:
- get_num_stats() - return the number of counters in the group.
- fill_strings() - fill counters strings within the group.
- fill_stats() - fill counters values within the group.
The above function pointers are used within the ethtool callbaks while
calling "ethtool -S" from userspace. This change also switch the SW
group to use the new API.
Signed-off-by: Kamal Heib <kamalh@mellanox.com> Reviewed-by: Gal Pressman <galp@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Amritha Nambiar [Fri, 27 Oct 2017 09:36:01 +0000 (02:36 -0700)]
i40e: Enable cloud filters via tc-flower
This patch enables tc-flower based hardware offloads. tc flower
filter provided by the kernel is configured as driver specific
cloud filter. The patch implements functions and admin queue
commands needed to support cloud filters in the driver and
adds cloud filters to configure these tc-flower filters.
The classification function of the filter is to direct matched
packets to a traffic class. The hardware traffic class is set
based on the the classid reserved in the range :ffe0 - :ffef.
Match Dst MAC and route to TC0:
prio 1 flower dst_mac 3c:fd:fe:a0:d6:70 skip_sw\
hw_tc 1
Match Dst IPv4,Dst Port and route to TC1:
prio 2 flower dst_ip 192.168.3.5/32\
ip_proto udp dst_port 25 skip_sw\
hw_tc 2
Match Dst IPv6,Dst Port and route to TC1:
prio 3 flower dst_ip fe8::200:1\
ip_proto udp dst_port 66 skip_sw\
hw_tc 2
Delete tc flower filter:
Example:
Flow Director Sideband is disabled while configuring cloud filters
via tc-flower and until any cloud filter exists.
Unsupported matches when cloud filters are added using enhanced
big buffer cloud filter mode of underlying switch include:
1. source port and source IP
2. Combined MAC address and IP fields.
3. Not specifying L4 port
These filter matches can however be used to redirect traffic to
the main VSI (tc 0) which does not require the enhanced big buffer
cloud filter support.
Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com> Signed-off-by: Kiran Patil <kiran.patil@intel.com> Signed-off-by: Anjali Singhai Jain <anjali.singhai@intel.com> Signed-off-by: Jingjing Wu <jingjing.wu@intel.com> Acked-by: Shannon Nelson <shannon.nelson@oracle.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Amritha Nambiar [Fri, 27 Oct 2017 09:35:51 +0000 (02:35 -0700)]
i40e: Admin queue definitions for cloud filters
Add new admin queue definitions and extended fields for cloud
filter support. Define big buffer for extended general fields
in Add/Remove Cloud filters command.
Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com> Signed-off-by: Kiran Patil <kiran.patil@intel.com> Signed-off-by: Jingjing Wu <jingjing.wu@intel.com> Acked-by: Shannon Nelson <shannon.nelson@oracle.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Amritha Nambiar [Fri, 27 Oct 2017 09:35:45 +0000 (02:35 -0700)]
i40e: Cloud filter mode for set_switch_config command
Add definitions for L4 filters and switch modes based on cloud filters
modes and extend the set switch config command to include the
additional cloud filter mode.
Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com> Signed-off-by: Kiran Patil <kiran.patil@intel.com> Acked-by: Shannon Nelson <shannon.nelson@oracle.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Amritha Nambiar [Fri, 27 Oct 2017 09:35:40 +0000 (02:35 -0700)]
i40e: Map TCs with the VSI seids
Add mapping of TCs with the seids of the channel VSIs. TC0
will be mapped to the main VSI seid and all other TCs are
mapped to the seid of the corresponding channel VSI.
Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com> Acked-by: Shannon Nelson <shannon.nelson@oracle.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Amritha Nambiar [Fri, 27 Oct 2017 09:35:34 +0000 (02:35 -0700)]
net: sched: Identify hardware traffic classes using classid
This patch offloads the classid to hardware and uses the classid
reserved in the range :ffe0 - :ffef to identify hardware traffic
classes reported via dev->num_tc.
tcf_result structure contains the class ID of the class to which
the packet belongs and is offloaded to hardware via flower filter.
A new helper function is introduced to represent HW traffic
classes 0 through 15 using the reserved classid values :ffe0 - :ffef.
Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com> Acked-by: Shannon Nelson <shannon.nelson@oracle.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
I am reverting this as I am fairly certain this can result in a memory leak
when combined with the current page recycling scheme. Specifically we end
up attempting to allocate fewer buffers than we recycled and this results
in us rewinding the next to alloc pointer which leads to leaks when we
overwrite the rx_buffer_info when processing the next frame.
Fixes: 11f29003d637 ("i40e/i40evf: bump tail only in multiples of 8") Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Shannon Nelson [Tue, 10 Oct 2017 21:56:58 +0000 (14:56 -0700)]
i40e: only redistribute MSI-X vectors when needed
Whether or not there are vectors_left, we only need to redistribute
our vectors if we didn't get as many as we requested. With the current
check, the code will try to redistribute even if we did in fact get all
the vectors we requested - this can happen when we have more CPUs than
we do vectors. This restores an earlier check to be sure we only
redistribute if we didn't get the full count we requested.
Fixes: 4ce20abc645f (i40e: fix MSI-X vector redistribution if hw limit is reached) Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Arnd Bergmann [Tue, 10 Oct 2017 08:17:38 +0000 (10:17 +0200)]
i40e: mark PM functions as __maybe_unused
A cleanup of the PM code left an incorrect #ifdef in place, leading
to a harmless build warning:
drivers/net/ethernet/intel/i40e/i40e_main.c:12223:12: error: 'i40e_resume' defined but not used [-Werror=unused-function]
drivers/net/ethernet/intel/i40e/i40e_main.c:12185:12: error: 'i40e_suspend' defined but not used [-Werror=unused-function]
It's easier to use __maybe_unused attributes here, since you
can't pick the wrong one.
Fixes: 0e5d3da40055 ("i40e: use newer generic PM support instead of legacy PM callbacks") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Acked-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Jakub Kicinski [Mon, 30 Oct 2017 20:46:47 +0000 (13:46 -0700)]
net: filter: remove unused variable and fix warning
bpf_getsockopt bpf call sets the ret variable to zero and
never changes it. What's worse in case CONFIG_INET is
not selected the variable is completely unused generating
a warning.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Acked-by: Lawrence Brakmo <brakmo@fb.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2) In mac80211, validate user rate mask before configuring it. From
Johannes Berg.
3) Properly enforce memory limits in fair queueing code, from Toke
Hoiland-Jorgensen.
4) Fix lockdep splat in inet_csk_route_req(), from Eric Dumazet.
5) Fix TSO header allocation and management in mvpp2 driver, from Yan
Markman.
6) Don't take socket lock in BH handler in strparser code, from Tom
Herbert.
7) Don't show sockets from other namespaces in AF_UNIX code, from
Andrei Vagin.
8) Fix double free in error path of tap_open(), from Girish Moodalbail.
9) Fix TX map failure path in igb and ixgbe, from Jean-Philippe Brucker
and Alexander Duyck.
10) Fix DCB mode programming in stmmac driver, from Jose Abreu.
11) Fix err_count handling in various tunnels (ipip, ip6_gre). From Xin
Long.
12) Properly align SKB head before building SKB in tuntap, from Jason
Wang.
13) Avoid matching qdiscs with a zero handle during lookups, from Cong
Wang.
14) Fix various endianness bugs in sctp, from Xin Long.
15) Fix tc filter callback races and add selftests which trigger the
problem, from Cong Wang.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (73 commits)
selftests: Introduce a new test case to tc testsuite
selftests: Introduce a new script to generate tc batch file
net_sched: fix call_rcu() race on act_sample module removal
net_sched: add rtnl assertion to tcf_exts_destroy()
net_sched: use tcf_queue_work() in tcindex filter
net_sched: use tcf_queue_work() in rsvp filter
net_sched: use tcf_queue_work() in route filter
net_sched: use tcf_queue_work() in u32 filter
net_sched: use tcf_queue_work() in matchall filter
net_sched: use tcf_queue_work() in fw filter
net_sched: use tcf_queue_work() in flower filter
net_sched: use tcf_queue_work() in flow filter
net_sched: use tcf_queue_work() in cgroup filter
net_sched: use tcf_queue_work() in bpf filter
net_sched: use tcf_queue_work() in basic filter
net_sched: introduce a workqueue for RCU callbacks of tc filter
sctp: fix some type cast warnings introduced since very beginning
sctp: fix a type cast warnings that causes a_rwnd gets the wrong value
sctp: fix some type cast warnings introduced by transport rhashtable
sctp: fix some type cast warnings introduced by stream reconf
...
3. Two more bugs found by Chris:
https://patchwork.ozlabs.org/patch/826696/
https://patchwork.ozlabs.org/patch/826695/
Usually RCU callbacks are simple, however for TC filters and actions,
they are complex because at least TC actions could be destroyed
together with the TC filter in one callback. And RCU callbacks are
invoked in BH context, without locking they are parallel too. All of
these contribute to the cause of these nasty bugs.
Alternatively, we could also:
a) Introduce a spinlock to serialize these RCU callbacks. But as I
said in commit 1697c4bb5245 ("net_sched: carefully handle
tcf_block_put()"), it is very hard to do because of tcf_chain_dump().
Potentially we need to do a lot of work to make it possible (if not
impossible).
b) Just get rid of these RCU callbacks, because they are not
necessary at all, callers of these call_rcu() are all on slow paths
and holding RTNL lock, so blocking is allowed in their contexts.
However, David and Eric dislike adding synchronize_rcu() here.
As suggested by Paul, we could defer the work to a workqueue and
gain the permission of holding RTNL again without any performance
impact, however, in tcf_block_put() we could have a deadlock when
flushing workqueue while hodling RTNL lock, the trick here is to
defer the work itself in workqueue and make it queued after all
other works so that we keep the same ordering to avoid any
use-after-free. Please see the first patch for details.
Patch 1 introduces the infrastructure, patch 2~12 move each
tc filter to the new tc filter workqueue, patch 13 adds
an assertion to catch potential bugs like this, patch 14
closes another rcu callback race, patch 15 and patch 16 add
new test cases.
====================
Reported-by: Chris Mi <chrism@mellanox.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Jiri Pirko <jiri@resnulli.us> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Chris Mi [Fri, 27 Oct 2017 01:24:43 +0000 (18:24 -0700)]
selftests: Introduce a new test case to tc testsuite
In this patchset, we fixed a tc bug. This patch adds the test case
that reproduces the bug. To run this test case, user should specify
an existing NIC device:
# sudo ./tdc.py -d enp4s0f0
This test case belongs to category "flower". If user doesn't specify
a NIC device, the test cases belong to "flower" will not be run.
In this test case, we create 1M filters and all filters share the same
action. When destroying all filters, kernel should not panic. It takes
about 18s to run it.
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Acked-by: Lucas Bates <lucasb@mojatatu.com> Signed-off-by: Chris Mi <chrism@mellanox.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
positional arguments:
device device name
file batch file name
optional arguments:
-h, --help show this help message and exit
-n NUMBER, --number NUMBER
how many lines in batch file
-o, --skip_sw skip_sw (offload), by default skip_hw
-s, --share_action all filters share the same action
-p, --prio all filters have different prio
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Acked-by: Lucas Bates <lucasb@mojatatu.com> Signed-off-by: Chris Mi <chrism@mellanox.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Cong Wang [Fri, 27 Oct 2017 01:24:41 +0000 (18:24 -0700)]
net_sched: fix call_rcu() race on act_sample module removal
Similar to commit c78e1746d3ad
("net: sched: fix call_rcu() race on classifier module unloads"),
we need to wait for flying RCU callback tcf_sample_cleanup_rcu().
Cc: Yotam Gigi <yotamg@mellanox.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Jiri Pirko <jiri@resnulli.us> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Cong Wang [Fri, 27 Oct 2017 01:24:40 +0000 (18:24 -0700)]
net_sched: add rtnl assertion to tcf_exts_destroy()
After previous patches, it is now safe to claim that
tcf_exts_destroy() is always called with RTNL lock.
Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Jiri Pirko <jiri@resnulli.us> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Cong Wang [Fri, 27 Oct 2017 01:24:39 +0000 (18:24 -0700)]
net_sched: use tcf_queue_work() in tcindex filter
Defer the tcf_exts_destroy() in RCU callback to
tc filter workqueue and get RTNL lock.
Reported-by: Chris Mi <chrism@mellanox.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Jiri Pirko <jiri@resnulli.us> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Cong Wang [Fri, 27 Oct 2017 01:24:38 +0000 (18:24 -0700)]
net_sched: use tcf_queue_work() in rsvp filter
Defer the tcf_exts_destroy() in RCU callback to
tc filter workqueue and get RTNL lock.
Reported-by: Chris Mi <chrism@mellanox.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Jiri Pirko <jiri@resnulli.us> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Cong Wang [Fri, 27 Oct 2017 01:24:37 +0000 (18:24 -0700)]
net_sched: use tcf_queue_work() in route filter
Defer the tcf_exts_destroy() in RCU callback to
tc filter workqueue and get RTNL lock.
Reported-by: Chris Mi <chrism@mellanox.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Jiri Pirko <jiri@resnulli.us> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Cong Wang [Fri, 27 Oct 2017 01:24:36 +0000 (18:24 -0700)]
net_sched: use tcf_queue_work() in u32 filter
Defer the tcf_exts_destroy() in RCU callback to
tc filter workqueue and get RTNL lock.
Reported-by: Chris Mi <chrism@mellanox.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Jiri Pirko <jiri@resnulli.us> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Cong Wang [Fri, 27 Oct 2017 01:24:35 +0000 (18:24 -0700)]
net_sched: use tcf_queue_work() in matchall filter
Defer the tcf_exts_destroy() in RCU callback to
tc filter workqueue and get RTNL lock.
Reported-by: Chris Mi <chrism@mellanox.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Jiri Pirko <jiri@resnulli.us> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Cong Wang [Fri, 27 Oct 2017 01:24:34 +0000 (18:24 -0700)]
net_sched: use tcf_queue_work() in fw filter
Defer the tcf_exts_destroy() in RCU callback to
tc filter workqueue and get RTNL lock.
Reported-by: Chris Mi <chrism@mellanox.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Jiri Pirko <jiri@resnulli.us> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Cong Wang [Fri, 27 Oct 2017 01:24:33 +0000 (18:24 -0700)]
net_sched: use tcf_queue_work() in flower filter
Defer the tcf_exts_destroy() in RCU callback to
tc filter workqueue and get RTNL lock.
Reported-by: Chris Mi <chrism@mellanox.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Jiri Pirko <jiri@resnulli.us> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Cong Wang [Fri, 27 Oct 2017 01:24:32 +0000 (18:24 -0700)]
net_sched: use tcf_queue_work() in flow filter
Defer the tcf_exts_destroy() in RCU callback to
tc filter workqueue and get RTNL lock.
Reported-by: Chris Mi <chrism@mellanox.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Jiri Pirko <jiri@resnulli.us> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Cong Wang [Fri, 27 Oct 2017 01:24:31 +0000 (18:24 -0700)]
net_sched: use tcf_queue_work() in cgroup filter
Defer the tcf_exts_destroy() in RCU callback to
tc filter workqueue and get RTNL lock.
Reported-by: Chris Mi <chrism@mellanox.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Jiri Pirko <jiri@resnulli.us> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Cong Wang [Fri, 27 Oct 2017 01:24:30 +0000 (18:24 -0700)]
net_sched: use tcf_queue_work() in bpf filter
Defer the tcf_exts_destroy() in RCU callback to
tc filter workqueue and get RTNL lock.
Reported-by: Chris Mi <chrism@mellanox.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Jiri Pirko <jiri@resnulli.us> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Cong Wang [Fri, 27 Oct 2017 01:24:29 +0000 (18:24 -0700)]
net_sched: use tcf_queue_work() in basic filter
Defer the tcf_exts_destroy() in RCU callback to
tc filter workqueue and get RTNL lock.
Reported-by: Chris Mi <chrism@mellanox.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Jiri Pirko <jiri@resnulli.us> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Cong Wang [Fri, 27 Oct 2017 01:24:28 +0000 (18:24 -0700)]
net_sched: introduce a workqueue for RCU callbacks of tc filter
This patch introduces a dedicated workqueue for tc filters
so that each tc filter's RCU callback could defer their
action destroy work to this workqueue. The helper
tcf_queue_work() is introduced for them to use.
Because we hold RTNL lock when calling tcf_block_put(), we
can not simply flush works inside it, therefore we have to
defer it again to this workqueue and make sure all flying RCU
callbacks have already queued their work before this one, in
other words, to ensure this is the last one to execute to
prevent any use-after-free.
On the other hand, this makes tcf_block_put() ugly and
harder to understand. Since David and Eric strongly dislike
adding synchronize_rcu(), this is probably the only
solution that could make everyone happy.
Please also see the code comments below.
Reported-by: Chris Mi <chrism@mellanox.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Jiri Pirko <jiri@resnulli.us> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sun, 29 Oct 2017 09:39:58 +0000 (18:39 +0900)]
Merge branch 'ipvlan-private-vepa'
Mahesh Bandewar says:
====================
add 'private' and 'vepa' attributes to ipvlan modes
IPvlan has always been operating in bridge-mode for its supported modes i.e.
if the packets are destined to the adjacent neighbor dev, then IPvlan driver
will switch the packet internally without needing the packets to hit the
wire or get routed. However, there are situations where this bridge-mode is
not needed. e.g. two private processes running inside two namespaces which
are having one IPvlan slave each for its namespace but sharing the master. These
processes should reach the outside world through the master device but at
the same time the bridge function should not work. Currently that's not
possible hence the private attribute for the selected mode comes in play.
VEPA or 802.1Qbg on the other hand has limited appeal with IPvlan since IPvlan
uses the mac-address of the lower device. So packets that are destined to
the adjacent neighbor slave-dev will have same src and dest mac. When these
packets reach the external switch/router, they will send you the redirect
message which the host will have to deal with. Having said that this attribute
will have appeal in debugging as IPvlan will not switch / short-circuit
packets internally. e.g. using VEPA mode with lower-device in loopback mode
will avoid some complicated set-ups that use non-local-bind with some route
jugglery.
This patch-set implements these attributes for the existing modes that
IPvlan has. Please see individual patches for their detailed implementation.
A subsequent ip-utils patch is needed and will be sent soon.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Mahesh Bandewar [Thu, 26 Oct 2017 22:09:25 +0000 (15:09 -0700)]
ipvlan: implement VEPA mode
This is very similar to the Macvlan VEPA mode, however, there is some
difference. IPvlan uses the mac-address of the lower device, so the VEPA
mode has implications of ICMP-redirects for packets destined for its
immediate neighbors sharing same master since the packets will have same
source and dest mac. The external switch/router will send redirect msg.
Having said that, this will be useful tool in terms of debugging
since IPvlan will not switch packets within its slaves and rely completely
on the external entity as intended in 802.1Qbg.
Signed-off-by: Mahesh Bandewar <maheshb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Mahesh Bandewar [Thu, 26 Oct 2017 22:09:21 +0000 (15:09 -0700)]
ipvlan: introduce 'private' attribute for all existing modes.
IPvlan has always operated in bridge mode. However there are scenarios
where each slave should be able to talk through the master device but
not necessarily across each other. Think of an environment where each
of a namespace is a private and independant customer. In this scenario
the machine which is hosting these namespaces neither want to tell who
their neighbor is nor the individual namespaces care to talk to neighbor
on short-circuited network path.
This patch implements the mode that is very similar to the 'private' mode
in macvlan where individual slaves can send and receive traffic through
the master device, just that they can not talk among slave devices.
Signed-off-by: Mahesh Bandewar <maheshb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Quentin Monnet [Thu, 26 Oct 2017 21:16:05 +0000 (14:16 -0700)]
tools: bpftool: add bash completion for bpftool
Add a completion file for bash. The completion function runs bpftool
when needed, making it smart enough to help users complete ids or tags
for eBPF programs and maps currently on the system.
Update Makefile to install completion file to
/usr/share/bash-completion/completions when running `make install`.
Emacs file mode and (at the end) Vim modeline have been added, to keep
the style in use for most existing bash completion files. In this, it
differs from tools/perf/perf-completion.sh, which seems to be the only
other completion file among the kernel sources repository. This is also
valid for indent style: 4-space indents, as in other completion files.
Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sun, 29 Oct 2017 09:03:25 +0000 (18:03 +0900)]
Merge branch 'sctp-endianness-fixes'
Xin Long says:
====================
sctp: a bunch of fixes for some sparse warnings
As Eric noticed, when running 'make C=2 M=net/sctp/', a plenty of
warnings or errors checked by sparse appear. They are all problems
about Endian and type cast.
Most of them are just warnings by which no issues could be caused
while some might be bugs.
This patchset fixes them with four patches basically according to
how they are introduced.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Sat, 28 Oct 2017 11:43:57 +0000 (19:43 +0800)]
sctp: fix some type cast warnings introduced since very beginning
These warnings were found by running 'make C=2 M=net/sctp/'.
They are there since very beginning.
Note after this patch, there still one warning left in
sctp_outq_flush():
sctp_chunk_fail(chunk, SCTP_ERROR_INV_STRM)
Since it has been moved to sctp_stream_outq_migrate on net-next,
to avoid the extra job when merging net-next to net, I will post
the fix for it after the merging is done.
Reported-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>