]> git.ipfire.org Git - thirdparty/linux.git/log
thirdparty/linux.git
5 years agonet: dsa: sja1105: Make room for P/Q/R/S FDB operations
Vladimir Oltean [Sun, 2 Jun 2019 21:11:57 +0000 (00:11 +0300)] 
net: dsa: sja1105: Make room for P/Q/R/S FDB operations

The DSA callbacks were written with the E/T (first generation) in mind,
which is quite different.

For P/Q/R/S completely new implementations need to be provided, which
are held as function pointers in the priv->info structure.  We are
taking a slightly roundabout way for this (a function from
sja1105_main.c reads a structure defined in sja1105_spi.c that
points to a function defined in sja1105_main.c), but it is what it is.

The FDB dump callback works for both families, hence no function pointer
for that.

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: dsa: sja1105: Plug in support for TCAM searches via the dynamic interface
Vladimir Oltean [Sun, 2 Jun 2019 21:11:56 +0000 (00:11 +0300)] 
net: dsa: sja1105: Plug in support for TCAM searches via the dynamic interface

Only a single dynamic configuration table of the SJA1105 P/Q/R/S
supports this operation: the FDB.

To keep the existing structure in place (sja1105_dynamic_config_read and
sja1105_dynamic_config_write) and not introduce any new function, a
convention is made for sja1105_dynamic_config_read that a negative index
argument denotes a search for the entry provided as argument.

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: dsa: sja1105: Add missing L2 Forwarding Table definitions for P/Q/R/S
Vladimir Oltean [Sun, 2 Jun 2019 21:11:55 +0000 (00:11 +0300)] 
net: dsa: sja1105: Add missing L2 Forwarding Table definitions for P/Q/R/S

This appends to the L2 Forwarding and L2 Forwarding Parameters tables
(originally added for first-generation switches) the bits that are new
in the second generation.

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: dsa: sja1105: Fix bit offsets of index field from L2 lookup entries
Vladimir Oltean [Sun, 2 Jun 2019 21:11:54 +0000 (00:11 +0300)] 
net: dsa: sja1105: Fix bit offsets of index field from L2 lookup entries

This was inadvertently copied from the SJA1105 E/T structure and not
tested.  Cross-checking with the P/Q/R/S documentation (UM11040) makes
it immediately obvious what the correct bit offsets for this field are.

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: dsa: sja1105: Shim declaration of struct sja1105_dyn_cmd
Vladimir Oltean [Sun, 2 Jun 2019 21:11:53 +0000 (00:11 +0300)] 
net: dsa: sja1105: Shim declaration of struct sja1105_dyn_cmd

This structure is merely an implementation detail and should be hidden
from the sja1105_dynamic_config.h header, which provides to the rest of
the driver an abstract access to the dynamic configuration interface of
the switch.

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge branch 'r8169-make-firmware-handling-code-ready-to-be-factored-out'
David S. Miller [Mon, 3 Jun 2019 22:38:43 +0000 (15:38 -0700)] 
Merge branch 'r8169-make-firmware-handling-code-ready-to-be-factored-out'

Heiner Kallweit says:

====================
r8169: make firmware handling code ready to be factored out

This series contains the final steps to make firmware handling code
ready to be factored out into a separate source code file.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agor8169: add rtl_fw_request_firmware and rtl_fw_release_firmware
Heiner Kallweit [Mon, 3 Jun 2019 19:26:31 +0000 (21:26 +0200)] 
r8169: add rtl_fw_request_firmware and rtl_fw_release_firmware

Add rtl_fw_request_firmware and rtl_fw_release_firmware which will be
part of the API when factoring out the firmware handling code.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agor8169: make rtl_fw_format_ok and rtl_fw_data_ok more independent
Heiner Kallweit [Mon, 3 Jun 2019 19:25:43 +0000 (21:25 +0200)] 
r8169: make rtl_fw_format_ok and rtl_fw_data_ok more independent

In preparation of factoring out the firmware handling code avoid any
usage of struct rtl8169_private internals. As part of it we can inline
rtl_check_firmware.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agor8169: simplify rtl_fw_write_firmware
Heiner Kallweit [Mon, 3 Jun 2019 19:24:38 +0000 (21:24 +0200)] 
r8169: simplify rtl_fw_write_firmware

Similar to rtl_fw_data_ok() we can simplify the code by moving
incrementing the index to the for loop initialization.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agor8169: add enum rtl_fw_opcode
Heiner Kallweit [Mon, 3 Jun 2019 19:23:43 +0000 (21:23 +0200)] 
r8169: add enum rtl_fw_opcode

Replace the firmware opcode defines with a proper enum. The BUG()
in rtl_fw_write_firmware() can be removed because the call to
rtl_fw_data_ok() ensures all opcodes are valid.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge branch 'hns3-next'
David S. Miller [Mon, 3 Jun 2019 22:32:50 +0000 (15:32 -0700)] 
Merge branch 'hns3-next'

Huazhong Tan says:

====================
code optimizations & bugfixes for HNS3 driver

This patch-set includes code optimizations and bugfixes for the HNS3
ethernet controller driver.

[patch 1/10] removes the redundant core reset type

[patch 2/10 - 3/10] fixes two VLAN related issues

[patch 4/10] fixes a TM issue

[patch 5/10 - 10/10] includes some patches related to RAS & MSI-X error

Change log:
V1->V2: removes two patches which needs to change HNS's infiniband
driver as well, they will be upstreamed later with the
infiniband's one.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: hns3: delay and separate enabling of NIC and ROCE HW errors
Weihang Li [Mon, 3 Jun 2019 02:09:22 +0000 (10:09 +0800)] 
net: hns3: delay and separate enabling of NIC and ROCE HW errors

All RAS and MSI-X should be enabled just in the final stage of HNS3
initialization. It means that they should be enabled in
hclge_init_xxx_client_instance instead of hclge_ae_dev(). Especially
MSI-X, if it is enabled before opening vector0 IRQ, there are some
chances that a MSI-X error will cause failure on initialization of
 NIC client instane. So this patch delays enabling of HW errors.
Otherwise, we also separate enabling of ROCE RAS from NIC, because
it's not reasonable to enable ROCE RAS if we even don't have a ROCE
driver.

Signed-off-by: Weihang Li <liweihang@hisilicon.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Huazhong tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: hns3: add opcode about query and clear RAS & MSI-X to special opcode
Weihang Li [Mon, 3 Jun 2019 02:09:21 +0000 (10:09 +0800)] 
net: hns3: add opcode about query and clear RAS & MSI-X to special opcode

There are four commands being used to query and clear RAS and MSI-X
interrupts status. They should be contained in array of special opcodes
because these commands have several descriptors, and we need to judge
return value in the first descriptor rather than the last one as other
opcodes. In addition, we shouldn't set the NEXT_FLAG of first descriptor.

This patch fixes above issues.

Signed-off-by: Weihang Li <liweihang@hisilicon.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: hns3: remove setting bit of reset_requests when handling mac tunnel interrupts
Weihang Li [Mon, 3 Jun 2019 02:09:20 +0000 (10:09 +0800)] 
net: hns3: remove setting bit of reset_requests when handling mac tunnel interrupts

We shouldn't set HNAE3_NONE_RESET bit of the variable that represents a
reset request during handling of MSI-X errors, or may cause issue when
trigger reset.

Signed-off-by: Weihang Li <liweihang@hisilicon.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: hns3: add handling of two bits in MAC tunnel interrupts
Weihang Li [Mon, 3 Jun 2019 02:09:19 +0000 (10:09 +0800)] 
net: hns3: add handling of two bits in MAC tunnel interrupts

LINK_UP and LINK_DOWN are two bits of MAC tunnel interrupts, but previous
HNS3 driver didn't handle them. If they were enabled, value of these two
bits will change during link down and link up, which will cause HNS3
driver keep receiving IRQ but can't handle them.

This patch adds handling of these two bits of interrupts, we will record
and clear them as what we do to other MAC tunnel interrupts.

Signed-off-by: Weihang Li <liweihang@hisilicon.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: hns3: set ops to null when unregister ad_dev
Weihang Li [Mon, 3 Jun 2019 02:09:18 +0000 (10:09 +0800)] 
net: hns3: set ops to null when unregister ad_dev

The hclge/hclgevf and hns3 module can be unloaded independently,
when hclge/hclgevf unloaded firstly, the ops of ae_dev should
be set to NULL, otherwise it will cause an use-after-free problem.

Fixes: 38caee9d3ee8 ("net: hns3: Add support of the HNAE3 framework")
Signed-off-by: Weihang Li <liweihang@hisilicon.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: hns3: add a check to pointer in error_detected and slot_reset
Weihang Li [Mon, 3 Jun 2019 02:09:17 +0000 (10:09 +0800)] 
net: hns3: add a check to pointer in error_detected and slot_reset

If we add a VF without loading hclgevf.ko and then there is a RAS error
occurs, PCIe AER will call error_detected and slot_reset of all functions,
and will get a NULL pointer when we check ad_dev->ops->handle_hw_ras_error.
This will cause a call trace and failures on handling of follow-up RAS
errors.

This patch check ae_dev and ad_dev->ops at first to solve above issues.

Signed-off-by: Weihang Li <liweihang@hisilicon.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: hns3: set the port shaper according to MAC speed
Yunsheng Lin [Mon, 3 Jun 2019 02:09:16 +0000 (10:09 +0800)] 
net: hns3: set the port shaper according to MAC speed

This patch sets the port shaper according to the MAC speed as
suggested by hardware user manual.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: hns3: fix VLAN filter restore issue after reset
Jian Shen [Mon, 3 Jun 2019 02:09:15 +0000 (10:09 +0800)] 
net: hns3: fix VLAN filter restore issue after reset

In orginal codes, the driver only restore VLAN filter entries
for PF after reset, the VLAN entries of VF will lose in this
case.

This patch fixes it by recording VLAN IDs for each function
when add VLAN, and restore the VLAN IDs after reset.

Fixes: 681ec3999b3d ("net: hns3: fix for vlan table lost problem when resetting")
Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: hns3: don't configure new VLAN ID into VF VLAN table when it's full
Jian Shen [Mon, 3 Jun 2019 02:09:14 +0000 (10:09 +0800)] 
net: hns3: don't configure new VLAN ID into VF VLAN table when it's full

VF VLAN table can only support no more than 256 VLANs. When user
adds too many VLANs, the VF VLAN table will be full, and firmware
will close the VF VLAN table for the function. When VF VLAN table
is full, and user keeps adding new VLANs, it's unnecessary to
configure the VF VLAN table, because it will always fail, and print
warning message. The worst case is adding 4K VLANs, and doing reset,
it will take much time to restore these VLANs, which may cause VF
reset fail by timeout.

Fixes: 6c251711b37f ("net: hns3: Disable vf vlan filter when vf vlan table is full")
Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: hns3: remove redundant core reset
Huazhong Tan [Mon, 3 Jun 2019 02:09:13 +0000 (10:09 +0800)] 
net: hns3: remove redundant core reset

Since core reset is similar to the global reset, so this
patch removes it and uses global reset to replace it.

Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: fix use-after-free in kfree_skb_list
Eric Dumazet [Sun, 2 Jun 2019 18:24:18 +0000 (11:24 -0700)] 
net: fix use-after-free in kfree_skb_list

syzbot reported nasty use-after-free [1]

Lets remove frag_list field from structs ip_fraglist_iter
and ip6_fraglist_iter. This seens not needed anyway.

[1] :
BUG: KASAN: use-after-free in kfree_skb_list+0x5d/0x60 net/core/skbuff.c:706
Read of size 8 at addr ffff888085a3cbc0 by task syz-executor303/8947

CPU: 0 PID: 8947 Comm: syz-executor303 Not tainted 5.2.0-rc2+ #12
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x172/0x1f0 lib/dump_stack.c:113
 print_address_description.cold+0x7c/0x20d mm/kasan/report.c:188
 __kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
 kasan_report+0x12/0x20 mm/kasan/common.c:614
 __asan_report_load8_noabort+0x14/0x20 mm/kasan/generic_report.c:132
 kfree_skb_list+0x5d/0x60 net/core/skbuff.c:706
 ip6_fragment+0x1ef4/0x2680 net/ipv6/ip6_output.c:882
 __ip6_finish_output+0x577/0xaa0 net/ipv6/ip6_output.c:144
 ip6_finish_output+0x38/0x1f0 net/ipv6/ip6_output.c:156
 NF_HOOK_COND include/linux/netfilter.h:294 [inline]
 ip6_output+0x235/0x7f0 net/ipv6/ip6_output.c:179
 dst_output include/net/dst.h:433 [inline]
 ip6_local_out+0xbb/0x1b0 net/ipv6/output_core.c:179
 ip6_send_skb+0xbb/0x350 net/ipv6/ip6_output.c:1796
 ip6_push_pending_frames+0xc8/0xf0 net/ipv6/ip6_output.c:1816
 rawv6_push_pending_frames net/ipv6/raw.c:617 [inline]
 rawv6_sendmsg+0x2993/0x35e0 net/ipv6/raw.c:947
 inet_sendmsg+0x141/0x5d0 net/ipv4/af_inet.c:802
 sock_sendmsg_nosec net/socket.c:652 [inline]
 sock_sendmsg+0xd7/0x130 net/socket.c:671
 ___sys_sendmsg+0x803/0x920 net/socket.c:2292
 __sys_sendmsg+0x105/0x1d0 net/socket.c:2330
 __do_sys_sendmsg net/socket.c:2339 [inline]
 __se_sys_sendmsg net/socket.c:2337 [inline]
 __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2337
 do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x44add9
Code: e8 7c e6 ff ff 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 1b 05 fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f826f33bce8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00000000006e7a18 RCX: 000000000044add9
RDX: 0000000000000000 RSI: 0000000020000240 RDI: 0000000000000005
RBP: 00000000006e7a10 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00000000006e7a1c
R13: 00007ffcec4f7ebf R14: 00007f826f33c9c0 R15: 20c49ba5e353f7cf

Allocated by task 8947:
 save_stack+0x23/0x90 mm/kasan/common.c:71
 set_track mm/kasan/common.c:79 [inline]
 __kasan_kmalloc mm/kasan/common.c:489 [inline]
 __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:462
 kasan_slab_alloc+0xf/0x20 mm/kasan/common.c:497
 slab_post_alloc_hook mm/slab.h:437 [inline]
 slab_alloc_node mm/slab.c:3269 [inline]
 kmem_cache_alloc_node+0x131/0x710 mm/slab.c:3579
 __alloc_skb+0xd5/0x5e0 net/core/skbuff.c:199
 alloc_skb include/linux/skbuff.h:1058 [inline]
 __ip6_append_data.isra.0+0x2a24/0x3640 net/ipv6/ip6_output.c:1519
 ip6_append_data+0x1e5/0x320 net/ipv6/ip6_output.c:1688
 rawv6_sendmsg+0x1467/0x35e0 net/ipv6/raw.c:940
 inet_sendmsg+0x141/0x5d0 net/ipv4/af_inet.c:802
 sock_sendmsg_nosec net/socket.c:652 [inline]
 sock_sendmsg+0xd7/0x130 net/socket.c:671
 ___sys_sendmsg+0x803/0x920 net/socket.c:2292
 __sys_sendmsg+0x105/0x1d0 net/socket.c:2330
 __do_sys_sendmsg net/socket.c:2339 [inline]
 __se_sys_sendmsg net/socket.c:2337 [inline]
 __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2337
 do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Freed by task 8947:
 save_stack+0x23/0x90 mm/kasan/common.c:71
 set_track mm/kasan/common.c:79 [inline]
 __kasan_slab_free+0x102/0x150 mm/kasan/common.c:451
 kasan_slab_free+0xe/0x10 mm/kasan/common.c:459
 __cache_free mm/slab.c:3432 [inline]
 kmem_cache_free+0x86/0x260 mm/slab.c:3698
 kfree_skbmem net/core/skbuff.c:625 [inline]
 kfree_skbmem+0xc5/0x150 net/core/skbuff.c:619
 __kfree_skb net/core/skbuff.c:682 [inline]
 kfree_skb net/core/skbuff.c:699 [inline]
 kfree_skb+0xf0/0x390 net/core/skbuff.c:693
 kfree_skb_list+0x44/0x60 net/core/skbuff.c:708
 __dev_xmit_skb net/core/dev.c:3551 [inline]
 __dev_queue_xmit+0x3034/0x36b0 net/core/dev.c:3850
 dev_queue_xmit+0x18/0x20 net/core/dev.c:3914
 neigh_direct_output+0x16/0x20 net/core/neighbour.c:1532
 neigh_output include/net/neighbour.h:511 [inline]
 ip6_finish_output2+0x1034/0x2550 net/ipv6/ip6_output.c:120
 ip6_fragment+0x1ebb/0x2680 net/ipv6/ip6_output.c:863
 __ip6_finish_output+0x577/0xaa0 net/ipv6/ip6_output.c:144
 ip6_finish_output+0x38/0x1f0 net/ipv6/ip6_output.c:156
 NF_HOOK_COND include/linux/netfilter.h:294 [inline]
 ip6_output+0x235/0x7f0 net/ipv6/ip6_output.c:179
 dst_output include/net/dst.h:433 [inline]
 ip6_local_out+0xbb/0x1b0 net/ipv6/output_core.c:179
 ip6_send_skb+0xbb/0x350 net/ipv6/ip6_output.c:1796
 ip6_push_pending_frames+0xc8/0xf0 net/ipv6/ip6_output.c:1816
 rawv6_push_pending_frames net/ipv6/raw.c:617 [inline]
 rawv6_sendmsg+0x2993/0x35e0 net/ipv6/raw.c:947
 inet_sendmsg+0x141/0x5d0 net/ipv4/af_inet.c:802
 sock_sendmsg_nosec net/socket.c:652 [inline]
 sock_sendmsg+0xd7/0x130 net/socket.c:671
 ___sys_sendmsg+0x803/0x920 net/socket.c:2292
 __sys_sendmsg+0x105/0x1d0 net/socket.c:2330
 __do_sys_sendmsg net/socket.c:2339 [inline]
 __se_sys_sendmsg net/socket.c:2337 [inline]
 __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2337
 do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

The buggy address belongs to the object at ffff888085a3cbc0
 which belongs to the cache skbuff_head_cache of size 224
The buggy address is located 0 bytes inside of
 224-byte region [ffff888085a3cbc0ffff888085a3cca0)
The buggy address belongs to the page:
page:ffffea0002168f00 refcount:1 mapcount:0 mapping:ffff88821b6f63c0 index:0x0
flags: 0x1fffc0000000200(slab)
raw: 01fffc0000000200 ffffea00027bbf88 ffffea0002105b88 ffff88821b6f63c0
raw: 0000000000000000 ffff888085a3c080 000000010000000c 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff888085a3ca80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 ffff888085a3cb00: 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc fc
>ffff888085a3cb80: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
                                           ^
 ffff888085a3cc00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff888085a3cc80: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc

Fixes: 0feca6190f88 ("net: ipv6: add skbuff fraglist splitter")
Fixes: c8b17be0b7a4 ("net: ipv4: add skbuff fraglist splitter")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agor8169: use paged versions of phylib MDIO access functions
Heiner Kallweit [Sun, 2 Jun 2019 08:53:49 +0000 (10:53 +0200)] 
r8169: use paged versions of phylib MDIO access functions

Use paged versions of phylib MDIO access functions to simplify
the code.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoqed: Fix build error without CONFIG_DEVLINK
YueHaibing [Sat, 1 Jun 2019 08:06:05 +0000 (16:06 +0800)] 
qed: Fix build error without CONFIG_DEVLINK

Fix gcc build error while CONFIG_DEVLINK is not set

drivers/net/ethernet/qlogic/qed/qed_main.o: In function `qed_remove':
qed_main.c:(.text+0x1eb4): undefined reference to `devlink_unregister'

Select DEVLINK to fix this.

Reported-by: Hulk Robot <hulkci@huawei.com>
Fixes: 24e04879abdd ("qed: Add qed devlink parameters table")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Acked-by: Michal KalderonĀ <michal.kalderon@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agotcp: use this_cpu_read(*X) instead of *this_cpu_ptr(X)
Eric Dumazet [Sat, 1 Jun 2019 02:17:33 +0000 (19:17 -0700)] 
tcp: use this_cpu_read(*X) instead of *this_cpu_ptr(X)

this_cpu_read(*X) is slightly faster than *this_cpu_ptr(X)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoipv4: icmp: use this_cpu_read() in icmp_sk()
Eric Dumazet [Sat, 1 Jun 2019 02:09:02 +0000 (19:09 -0700)] 
ipv4: icmp: use this_cpu_read() in icmp_sk()

this_cpu_read(*X) is faster than *this_cpu_ptr(X)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoipv6: use this_cpu_read() in rt6_get_pcpu_route()
Eric Dumazet [Sat, 1 Jun 2019 01:11:25 +0000 (18:11 -0700)] 
ipv6: use this_cpu_read() in rt6_get_pcpu_route()

this_cpu_read(*X) is faster than *this_cpu_ptr(X)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge branch 'Add-MT7629-ethernet-support'
David S. Miller [Mon, 3 Jun 2019 22:00:00 +0000 (15:00 -0700)] 
Merge branch 'Add-MT7629-ethernet-support'

Sean Wang says:

====================
Add MT7629 ethernet support

MT7629 inlcudes two sets of SGMIIs used for external switch or PHY, and embedded
switch (ESW) via GDM1, GePHY via GMAC2, so add several patches in the series to
make the code base common with the old SoCs.

The patch 1, 3 and 6, adds extension for SGMII to have the hardware configured
for 1G, 2.5G and AN to fit the capability of the target PHY. In patch 6 could be
an example showing how to use these configurations for underlying PHY speed to
match up the link speed of the target PHY.

The patch 4 is used for automatically configured the hardware path from GMACx to
the target PHY by the description in deviceetree topology to determine the
proper value for the corresponding MUX.

The patch 2 and 5 is for the update for MT7629 including dt-binding document and
its driver.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoarm64: dts: mt7622: Enlarge the SGMII register range
Sean Wang [Sat, 1 Jun 2019 00:03:15 +0000 (08:03 +0800)] 
arm64: dts: mt7622: Enlarge the SGMII register range

Enlarge the SGMII register range and using 2.5G force mode on default.

Signed-off-by: Sean Wang <sean.wang@mediatek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: ethernet: mediatek: Add MT7629 ethernet support
Sean Wang [Sat, 1 Jun 2019 00:03:14 +0000 (08:03 +0800)] 
net: ethernet: mediatek: Add MT7629 ethernet support

Add ethernet support to MT7629 SoC

Signed-off-by: Sean Wang <sean.wang@mediatek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: ethernet: mediatek: Integrate hardware path from GMAC to PHY variants
Sean Wang [Sat, 1 Jun 2019 00:03:13 +0000 (08:03 +0800)] 
net: ethernet: mediatek: Integrate hardware path from GMAC to PHY variants

All path route on various SoCs all would be managed in common function
mtk_setup_hw_path that is determined by the both applied devicetree
regarding the path between GMAC and the target PHY or switch by the
capability of target SoC in the runtime.

Signed-off-by: Sean Wang <sean.wang@mediatek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: ethernet: mediatek: Extend SGMII related functions
Sean Wang [Sat, 1 Jun 2019 00:03:12 +0000 (08:03 +0800)] 
net: ethernet: mediatek: Extend SGMII related functions

Add SGMII related logic into a separate file, and also provides options for
forcing 1G, 2.5, AN mode for the target PHY, that can be determined from
SGMII node in DTS.

Signed-off-by: Sean Wang <sean.wang@mediatek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agodt-bindings: net: mediatek: Add support for MediaTek MT7629 SoC
Sean Wang [Sat, 1 Jun 2019 00:03:11 +0000 (08:03 +0800)] 
dt-bindings: net: mediatek: Add support for MediaTek MT7629 SoC

Add binding document for the ethernet on MT7629 SoC.

Signed-off-by: Sean Wang <sean.wang@mediatek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agodt-bindings: clock: mediatek: Add an extra required property to sgmiisys
Sean Wang [Sat, 1 Jun 2019 00:03:10 +0000 (08:03 +0800)] 
dt-bindings: clock: mediatek: Add an extra required property to sgmiisys

add an extra required property "mediatek,physpeed" to sgmiisys to determine
link speed to match up the capability of the target PHY.

Signed-off-by: Sean Wang <sean.wang@mediatek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoipv6: icmp: use this_cpu_read() in icmpv6_sk()
Eric Dumazet [Fri, 31 May 2019 22:27:00 +0000 (15:27 -0700)] 
ipv6: icmp: use this_cpu_read() in icmpv6_sk()

In general, this_cpu_read(*X) is faster than *this_cpu_ptr(X)

Also remove the inline attibute, totally useless.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoflow_offload: include linux/kernel.h from flow_offload.h
Edward Cree [Fri, 31 May 2019 21:47:21 +0000 (22:47 +0100)] 
flow_offload: include linux/kernel.h from flow_offload.h

flow_stats_update() uses max_t, so ensure we have that defined.

Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoflow_dissector: remove unused FLOW_DISSECTOR_F_STOP_AT_L3 flag
Stanislav Fomichev [Fri, 31 May 2019 21:05:06 +0000 (14:05 -0700)] 
flow_dissector: remove unused FLOW_DISSECTOR_F_STOP_AT_L3 flag

This flag is not used by any caller, remove it.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge tag 'mlx5-updates-2019-05-31' of git://git.kernel.org/pub/scm/linux/kernel...
David S. Miller [Mon, 3 Jun 2019 20:42:56 +0000 (13:42 -0700)] 
Merge tag 'mlx5-updates-2019-05-31' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux

Saeed Mahameed says:

====================
mlx5-updates-2019-05-31

This series provides some updates to mlx5 core and netdevice driver.

1) use __netdev_tx_sent_queue() to improve performance under GSO workload

2) Allow matching only enc_key_id/enc_dst_port for decapsulation action

3) Geneve support:
This patchset adds support for GENEVE tunnel encap/decap flows offload:
encapsulating layer 2 Ethernet frames within layer 4 UDP datagrams.
The driver supports 6081 destination UDP port number, which is the
default IANA-assigned port.

Encap:
  ConnectX-5 inserts the header (w/ or w/o Geneve TLV options) that is
  provided by the mlx5 driver to the outgoing packet.

Decap:
  Geneve header is matched and the packet is decapsulated.
  Notes about decap flows with Geneve TLV Options:
   - Support offloading of 32-bit options data only
   - At any given time, only one combination of class/type parameters
     can be offloaded, but the same class/type combination can have
     many different flows offloaded with different 32-bit option data
   - Options with value of 0 can't be offloaded

Managing Geneve TLV options:
  Matching (on receive) is done by ConnectX-5 flex parser.
  Geneve TLV options are managed using General Object of type
  ā€œGeneve TLV Optionsā€.

  When the first flow with a certain class/type values is requested
  to be offloaded, the driver creates a FW object with FW command
  (Geneve TLV Options general object) and starts counting the number
  of flows using this object.

  During this time, any request with a different class/type values
  will fail to be offloaded.
  Once the refcount reaches 0, the driver destroys the TLV options
  general object, and can now offload a flow with any class/type parameters.

  Geneve TLV Options object is added to core device.
  It is currently used to manage Geneve TLV options general
  object allocation in FW and its reference counting only.

  In the future it will also be used for managing geneve ports
  by registering callbacks for ndo_udp_tunnel_add/del.

TC tunnel code refactoring:
  As a preparation for Geneve code, the TC tunnel code in mlx5
  was rearranged in a modular way, so that it would be easier
  to add future tunnels:
   - Defined tc tunnel object with the fields and callbacks that
     any tunnel must implement.
   - Define tc UDP tunnel object for UDP tunnels, such as VXLAN
   - Move each tunnel code (GRE, VXLAN) to its own separate file
   - Rewrite tc tunnel implementation in a general way ā€“ using
     only the objects and their callbacks.

4) Termination tables:
Actions in tables set with the termination flag are guaranteed to terminate
the action list. Thus, potential looping functionality (e.g. haripin) can safely be
executed without potential loops.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge branch 'ena-next'
David S. Miller [Mon, 3 Jun 2019 20:30:38 +0000 (13:30 -0700)] 
Merge branch 'ena-next'

Sameeh Jubran says:

====================
Extending the ena driver to support new features and enhance performance

This patchset introduces the following:

* add support for changing the inline header size (max_header_size) for applications
  with overlay and nested headers
* enable automatic fallback to polling mode for admin queue when interrupt is not
  available or missed
* add good checksum counter for Rx ethtool statistics
* update ena.txt
* some minor code clean-up
* some performance enhancements with doorbell calculations

Differences from V1:

* net: ena: add handling of llq max tx burst size (1/11):
 * fixed christmas tree issue

* net: ena: ethtool: add extra properties retrieval via get_priv_flags (2/11):
 * replaced snprintf with strlcpy
 * dropped confusing error message
 * added more details to  the commit message
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: ena: use dev_info_once instead of static variable
Sameeh Jubran [Mon, 3 Jun 2019 14:43:29 +0000 (17:43 +0300)] 
net: ena: use dev_info_once instead of static variable

Signed-off-by: Sameeh Jubran <sameehj@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: ena: add good checksum counter
Sameeh Jubran [Mon, 3 Jun 2019 14:43:28 +0000 (17:43 +0300)] 
net: ena: add good checksum counter

Add a new statistics to ETHTOOL to specify if the device calculated
and validated the Rx csum.

Signed-off-by: Evgeny Shmeilin <evgeny@annapurnaLabs.com>
Signed-off-by: Sameeh Jubran <sameehj@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: ena: optimise calculations for CQ doorbell
Sameeh Jubran [Mon, 3 Jun 2019 14:43:27 +0000 (17:43 +0300)] 
net: ena: optimise calculations for CQ doorbell

This patch initially checks if CQ doorbell
is needed before proceeding with the calculations.

Signed-off-by: Igor Chauskin <igorch@amazon.com>
Signed-off-by: Sameeh Jubran <sameehj@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: ena: add support for changing max_header_size in LLQ mode
Sameeh Jubran [Mon, 3 Jun 2019 14:43:26 +0000 (17:43 +0300)] 
net: ena: add support for changing max_header_size in LLQ mode

Up until now the driver always used a single setting for the sizes
of the different parts of the llq entry - 128 for entry size, 2 for
descriptors before header and 96 for maximum header size.

The current code makes sure that the parts of the llq entry are
compatible with each other and with the initial llq entry size given
by the device.

This commit changes this code to support any llq entry size

Signed-off-by: Arthur Kiyanovski <akiyano@amazon.com>
Signed-off-by: Sameeh Jubran <sameehj@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: ena: allow automatic fallback to polling mode
Sameeh Jubran [Mon, 3 Jun 2019 14:43:25 +0000 (17:43 +0300)] 
net: ena: allow automatic fallback to polling mode

Enable fallback to polling mode for Admin queue
when identified a command response arrival
without an accompanying MSI-X interrupt

Signed-off-by: Igor Chauskin <igorch@amazon.com>
Signed-off-by: Sameeh Jubran <sameehj@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: ena: documentation: update ena.txt
Sameeh Jubran [Mon, 3 Jun 2019 14:43:24 +0000 (17:43 +0300)] 
net: ena: documentation: update ena.txt

Small cosmetic changes to ena.txt

Signed-off-by: Sameeh Jubran <sameehj@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: ena: add newline at the end of pr_err prints
Sameeh Jubran [Mon, 3 Jun 2019 14:43:23 +0000 (17:43 +0300)] 
net: ena: add newline at the end of pr_err prints

Some pr_err prints lacked '\n' in the end. Added where missing.

Signed-off-by: Arthur Kiyanovski <akiyano@amazon.com>
Signed-off-by: Sameeh Jubran <sameehj@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: ena: arrange ena_probe() function variables in reverse christmas tree
Sameeh Jubran [Mon, 3 Jun 2019 14:43:22 +0000 (17:43 +0300)] 
net: ena: arrange ena_probe() function variables in reverse christmas tree

Reverse christmas tree arrangement is when strings are written from longer
to shorter with each line. Most of our functions are abiding this
arrangement but this function does not.

In this commit we arrange the variables of ena_probe() in reverse christmas
tree.

Signed-off-by: Arthur Kiyanovski <akiyano@amazon.com>
Signed-off-by: Sameeh Jubran <sameehj@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: ena: replace free_tx/rx_ids union with single free_ids field in ena_ring
Sameeh Jubran [Mon, 3 Jun 2019 14:43:21 +0000 (17:43 +0300)] 
net: ena: replace free_tx/rx_ids union with single free_ids field in ena_ring

struct ena_ring holds a union of free_rx_ids and free_tx_ids.
Both of the above fields mean the exact same thing and are used
exactly the same way.
Furthermore, these fields are always used with a prefix of the
type of ring. So for tx it will be tx_ring->free_tx_ids, and for
rx it will be rx_ring->free_rx_ids, which shows how redundant the
"_tx" and "_rx" parts are.
Furthermore still, this may lead to confusing code like where
tx_ring->free_rx_ids which works correctly but looks like a mess.

This commit removes the aforementioned redundancy by replacing the
free_rx/tx_ids union with a single free_ids field.
It also changes a single goto label name from err_free_tx_ids: to
err_tx_free_ids: for consistency with the above new notation.

Signed-off-by: Arthur Kiyanovski <akiyano@amazon.com>
Signed-off-by: Sameeh Jubran <sameehj@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: ena: ethtool: add extra properties retrieval via get_priv_flags
Arthur Kiyanovski [Mon, 3 Jun 2019 14:43:20 +0000 (17:43 +0300)] 
net: ena: ethtool: add extra properties retrieval via get_priv_flags

This commit adds a mechanism for exposing different device
properties via ethtool's priv_flags. The strings are provided
by the device and copied to user space through the driver.

In this commit we:

Add commands, structs and defines necessary for handling
extra properties

Add functions for:
Allocation/destruction of a buffer for extra properties strings.
Retreival of extra properties strings and flags from the network device.

Handle the allocation of a buffer for extra properties strings.

* Initialize buffer with extra properties strings from the
  network device at driver startup.

Use ethtool's get_priv_flags to expose extra properties of
the ENA device

Signed-off-by: Arthur Kiyanovski <akiyano@amazon.com>
Signed-off-by: Sameeh Jubran <sameehj@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: ena: add handling of llq max tx burst size
Sameeh Jubran [Mon, 3 Jun 2019 14:43:19 +0000 (17:43 +0300)] 
net: ena: add handling of llq max tx burst size

There is a maximum TX burst size that the ENA device can handle.
It is exposed by the device to the driver and the driver
needs to comply with it to avoid bugs.

In this commit we:
1. Add ena_com_is_doorbell_needed(), which calculates the number of
   llq entries that will be used to hold a packet, and will return
   true if they exceed the number of allowed entries in a burst.
   If the function returns true, a doorbell needs to be invoked
   to send this packet in the next burst.

2. Follow the available entries in the current burst:
   - Every doorbell a new burst begins
   - With each write of an llq entry, the available entries in the
     current burst are decreased by 1.

Signed-off-by: Arthur Kiyanovski <akiyano@amazon.com>
Signed-off-by: Sameeh Jubran <sameehj@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: dsa: mv88e6xxx: make mv88e6xxx_g1_stats_wait static
Rasmus Villemoes [Mon, 3 Jun 2019 08:04:09 +0000 (08:04 +0000)] 
net: dsa: mv88e6xxx: make mv88e6xxx_g1_stats_wait static

mv88e6xxx_g1_stats_wait has no users outside global1.c, so make it
static.

Signed-off-by: Rasmus Villemoes <rasmus.villemoes@prevas.dk>
Reviewed-by: Vivien Didelot <vivien.didelot@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: dsa: mv88e6xxx: fix comments and macro names in mv88e6390_g1_mgmt_rsvd2cpu
Rasmus Villemoes [Mon, 3 Jun 2019 07:52:46 +0000 (07:52 +0000)] 
net: dsa: mv88e6xxx: fix comments and macro names in mv88e6390_g1_mgmt_rsvd2cpu

The macros have an extraneous '800' (after 0180C2 there should be just
six nibbles, with X representing one), while the comments have
interchanged c2 and 80 and an extra :00.

Signed-off-by: Rasmus Villemoes <rasmus.villemoes@prevas.dk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonexthop: Add entry to MAINTAINERS
David Ahern [Fri, 31 May 2019 18:44:09 +0000 (12:44 -0600)] 
nexthop: Add entry to MAINTAINERS

Add entry to MAINTAINERS file for new nexthop code.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge branch 'r8169-replace-several-function-pointers-with-direct-calls'
David S. Miller [Mon, 3 Jun 2019 01:15:38 +0000 (18:15 -0700)] 
Merge branch 'r8169-replace-several-function-pointers-with-direct-calls'

Heiner Kallweit says:

====================
r8169: replace several function pointers with direct calls

This series removes most function pointers from struct rtl8169_private
and uses direct calls instead. This simplifies the code and avoids
the penalty of indirect calls in times of retpoline.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agor8169: avoid tso csum function indirection
Heiner Kallweit [Fri, 31 May 2019 17:55:11 +0000 (19:55 +0200)] 
r8169: avoid tso csum function indirection

Replace indirect call to tso_csum with direct calls. To do this we have
to move rtl_chip_supports_csum_v2().

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agor8169: remove struct jumbo_ops
Heiner Kallweit [Fri, 31 May 2019 17:54:03 +0000 (19:54 +0200)] 
r8169: remove struct jumbo_ops

The jumbo_ops are used in just one place, so we can simplify the code
and avoid the penalty of indirect calls in times of retpoline.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agor8169: remove struct mdio_ops
Heiner Kallweit [Fri, 31 May 2019 17:53:28 +0000 (19:53 +0200)] 
r8169: remove struct mdio_ops

The mdio_ops are used in just one place, so we can simplify the code
and avoid the penalty of indirect calls in times of retpoline.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agor8169: improve r8169_csum_workaround
Heiner Kallweit [Fri, 31 May 2019 17:17:15 +0000 (19:17 +0200)] 
r8169: improve r8169_csum_workaround

Use helper skb_is_gso() and simplify access to tx_dropped.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: ethernet: improve eth_platform_get_mac_address
Heiner Kallweit [Fri, 31 May 2019 17:14:44 +0000 (19:14 +0200)] 
net: ethernet: improve eth_platform_get_mac_address

pci_device_to_OF_node(to_pci_dev(dev)) is the same as dev->of_node,
so we can simplify the code. In addition add an empty line before
the return statement.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge branch 'ifa_list-RCU'
David S. Miller [Mon, 3 Jun 2019 01:08:47 +0000 (18:08 -0700)] 
Merge branch 'ifa_list-RCU'

Florian Westphal says:

====================
net: add rcu annotations for ifa_list

v3: fix typo in patch1 commit message
    All other patches are unchanged.
v2: remove ifa_list iteration in afs instead of conversion

Eric Dumazet reported following problem:

  It looks that unless RTNL is held, accessing ifa_list needs proper RCU
  protection.  indev->ifa_list can be changed under us by another cpu
  (which owns RTNL) [..]

  A proper rcu_dereference() with an happy sparse support would require
  adding __rcu attribute.

This patch series does that: add __rcu to the ifa_list pointers.
That makes sparse complain, so the series also adds the required
rcu_assign_pointer/dereference helpers where needed.

All patches except the last one are preparation work.
Two new macros are introduced for in_ifaddr walks.

Last patch adds the __rcu annotations and the assign_pointer/dereference
helper calls.

This patch is a bit large, but I found no better way -- other
approaches (annotate-first or add helpers-first) all result in
mid-series sparse warnings.

This series is submitted vs. net-next rather than net for several
reasons:

1. Its (mostly) compile-tested only
2. 3rd patch changes behaviour wrt. secondary addresses
   (see changelog)
3. The problem exists for a very long time (2004), so it doesn't
   seem to be urgent to fix this -- rcu use to free ifa_list
   predates the git era.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: ipv4: provide __rcu annotation for ifa_list
Florian Westphal [Fri, 31 May 2019 16:27:09 +0000 (18:27 +0200)] 
net: ipv4: provide __rcu annotation for ifa_list

ifa_list is protected by rcu, yet code doesn't reflect this.

Add the __rcu annotations and fix up all places that are now reported by
sparse.

I've done this in the same commit to not add intermediate patches that
result in new warnings.

Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agodrivers: use in_dev_for_each_ifa_rtnl/rcu
Florian Westphal [Fri, 31 May 2019 16:27:08 +0000 (18:27 +0200)] 
drivers: use in_dev_for_each_ifa_rtnl/rcu

Like previous patches, use the new iterator macros to avoid sparse
warnings once proper __rcu annotations are added.

Compile tested only.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: use new in_dev_ifa iterators
Florian Westphal [Fri, 31 May 2019 16:27:07 +0000 (18:27 +0200)] 
net: use new in_dev_ifa iterators

Use in_dev_for_each_ifa_rcu/rtnl instead.
This prevents sparse warnings once proper __rcu annotations are added.

Signed-off-by: Florian Westphal <fw@strlen.de>
t di# Last commands done (6 commands done):

Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonetfilter: use in_dev_for_each_ifa_rcu
Florian Westphal [Fri, 31 May 2019 16:27:06 +0000 (18:27 +0200)] 
netfilter: use in_dev_for_each_ifa_rcu

Netfilter hooks are always running under rcu read lock, use
the new iterator macro so sparse won't complain once we add
proper __rcu annotations.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agodevinet: use in_dev_for_each_ifa_rcu in more places
Florian Westphal [Fri, 31 May 2019 16:27:05 +0000 (18:27 +0200)] 
devinet: use in_dev_for_each_ifa_rcu in more places

This also replaces spots that used for_primary_ifa().

for_primary_ifa() aborts the loop on the first secondary address seen.

Replace it with either the rcu or rtnl variant of in_dev_for_each_ifa(),
but two places will now also consider secondary addresses too:
inet_addr_onlink() and inet_ifa_byprefix().

I do not understand why they should ignore secondary addresses.

Why would a secondary address not be considered 'on link'?
When matching a prefix, why ignore a matching secondary address?

Other places get converted as well, but gain "->flags & SECONDARY" check.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: inetdevice: provide replacement iterators for in_ifaddr walk
Florian Westphal [Fri, 31 May 2019 16:27:04 +0000 (18:27 +0200)] 
net: inetdevice: provide replacement iterators for in_ifaddr walk

The ifa_list is protected either by rcu or rtnl lock, but the
current iterators do not account for this.

This adds two iterators as replacement, a later patch in
the series will update them with the needed rcu/rtnl_dereference calls.

Its not done in this patch yet to avoid sparse warnings -- the fields
lack the proper __rcu annotation.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoafs: do not send list of client addresses
Florian Westphal [Fri, 31 May 2019 16:27:03 +0000 (18:27 +0200)] 
afs: do not send list of client addresses

David Howells says:
  I'm told that there's not really any point populating the list.
  Current OpenAFS ignores it, as does AuriStor - and IBM AFS 3.6 will
  do the right thing.
  The list is actually useless as it's the client's view of the world,
  not the servers, so if there's any NAT in the way its contents are
  invalid.  Further, it doesn't support IPv6 addresses.

  On that basis, feel free to make it an empty list and remove all the
  interface enumeration.

V1 of this patch reworked the function to use a new helper for the
ifa_list iteration to avoid sparse warnings once the proper __rcu
annotations get added in struct in_device later.

But, in light of the above, just remove afs_get_ipv4_interfaces.

Compile tested only.

Cc: David Howells <dhowells@redhat.com>
Cc: linux-afs@lists.infradead.org
Signed-off-by: Florian Westphal <fw@strlen.de>
Tested-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoqed: remove redundant assignment to rc
Colin Ian King [Fri, 31 May 2019 13:27:38 +0000 (14:27 +0100)] 
qed: remove redundant assignment to rc

The variable rc is assigned with a value that is never read and
it is re-assigned a new value later on.  The assignment is redundant
and can be removed.

Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge tag 'isdn-removal' of https://git.kernel.org/pub/scm/linux/kernel/git/arnd...
David S. Miller [Mon, 3 Jun 2019 00:48:58 +0000 (17:48 -0700)] 
Merge tag 'isdn-removal' of https://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground

Arnd Bergmann says:

====================
isdn: deprecate non-mISDN drivers

When isdn4linux came up in the context of another patch series, I
remembered that we had discussed removing it a while ago.

It turns out that the suggestion from Karsten Keil wa to remove I4L
in 2018 after the last public ISDN networks are shut down. This has
happened now (with a very small number of exceptions), so I guess it's
time to try again.

We currently have three ISDN stacks in the kernel: the original
isdn4linux (with the hisax driver), the newer CAPI (with four drivers),
and finally the mISDN stack (supporting roughly the same hardware as
hisax).

As far as I can tell, anyone using ISDN with mainline kernel drivers in
the past few years uses mISDN, and this is typically used for voice-only
PBX installations that don't require a public network.

The older stacks support additional features for data networks, but those
typically make no sense any more if there is no network to connect to.

My proposal for this time is to kill off isdn4linux entirely, as it seems
to have been unusable for quite a while. This code has been abandoned
for many years and it does cause problems for treewide maintenance as
it tends to do everything that we try to stop doing.
Birger Harzenetter mentioned that is is still using i4l in order to
make use of the 'divert' feature that is not part of mISDN, but has
otherwise moved on to mISDN for normal operation, like apparently
everyone else.

CAPI in turn is not quite as obsolete, but two of the drivers (avm
and hysdn) don't seem to be used at all, while another one (gigaset)
will stop being maintained as Paul Bolle is no longer able to
test it after the network gets shut down in September.
All three are now moved into drivers/staging to let others speak
up in case there are remaining users.
This leaves Bluetooth CMTP as the only remaining user of CAPI, but
Marcel Holtmann wishes to keep maintaining it.

For the discussion on version 1, see [2]
Unfortunately, Karsten Keil as the maintainer has not participated in
the discussion.

      Arnd

[1] https://patchwork.kernel.org/patch/8484861/#17900371
[2] https://listserv.isdn4linux.de/pipermail/isdn4linux/2019-April/thread.html
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge branch 'mscc-ocelot-tc-flower'
David S. Miller [Sun, 2 Jun 2019 20:49:49 +0000 (13:49 -0700)] 
Merge branch 'mscc-ocelot-tc-flower'

Horatiu Vultur says:

====================
Add hw offload of TC flower on MSCC Ocelot

This patch series enables hardware offload for flower filter used in
traffic controller on MSCC Ocelot board.

v2->v3 changes:
 - remove the check for shared blocks

v1->v2 changes:
 - when declaring variables use reverse christmas tree
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: mscc: ocelot: Hardware ofload for tc flower filter
Horatiu Vultur [Fri, 31 May 2019 07:16:57 +0000 (09:16 +0200)] 
net: mscc: ocelot: Hardware ofload for tc flower filter

Hardware offload of port filtering are now supported via tc command using
flower filter. ACL rules are used to enable the hardware offload.
The following keys are supported:

vlan_id
vlan_prio
dst_mac/src_mac for non IP frames
dst_ip/src_ip
dst_port/src_port

The following actions are supported:
trap
drop

These filters are supported only on the ingress schedulare.

Add:
tc qdisc add dev eth3 ingress
tc filter ad dev eth3 parent ffff: ip_proto ip flower \
    ip_proto tcp dst_port 80 action drop

Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: mscc: ocelot: Add support for tcam
Horatiu Vultur [Fri, 31 May 2019 07:16:56 +0000 (09:16 +0200)] 
net: mscc: ocelot: Add support for tcam

Add ACL support using the TCAM. Using ACL it is possible to create rules
in hardware to filter/redirect frames.

Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoselftests: Add test cases for nexthop objects
David Ahern [Thu, 30 May 2019 19:06:36 +0000 (12:06 -0700)] 
selftests: Add test cases for nexthop objects

Add functional test cases for nexthop objects.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
David S. Miller [Sat, 1 Jun 2019 23:21:19 +0000 (16:21 -0700)] 
Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next

Pablo Neira Ayuso says:

====================
Netfilter/IPVS updates for net-next

The following patchset container Netfilter/IPVS update for net-next:

1) Add UDP tunnel support for ICMP errors in IPVS.

Julian Anastasov says:

This patchset is a followup to the commit that adds UDP/GUE tunnel:
"ipvs: allow tunneling with gue encapsulation".

What we do is to put tunnel real servers in hash table (patch 1),
add function to lookup tunnels (patch 2) and use it to strip the
embedded tunnel headers from ICMP errors (patch 3).

2) Extend xt_owner to match for supplementary groups, from
   Lukasz Pawelczyk.

3) Remove unused oif field in flow_offload_tuple object, from
   Taehee Yoo.

4) Release basechain counters from workqueue to skip synchronize_rcu()
   call. From Florian Westphal.

5) Replace skb_make_writable() by skb_ensure_writable(). Patchset
   from Florian Westphal.

6) Checksum support for gue encapsulation in IPVS, from Jacky Hu.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
David S. Miller [Sat, 1 Jun 2019 04:21:18 +0000 (21:21 -0700)] 
Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Alexei Starovoitov says:

====================
pull-request: bpf-next 2019-05-31

The following pull-request contains BPF updates for your *net-next* tree.

Lots of exciting new features in the first PR of this developement cycle!
The main changes are:

1) misc verifier improvements, from Alexei.

2) bpftool can now convert btf to valid C, from Andrii.

3) verifier can insert explicit ZEXT insn when requested by 32-bit JITs.
   This feature greatly improves BPF speed on 32-bit architectures. From Jiong.

4) cgroups will now auto-detach bpf programs. This fixes issue of thousands
   bpf programs got stuck in dying cgroups. From Roman.

5) new bpf_send_signal() helper, from Yonghong.

6) cgroup inet skb programs can signal CN to the stack, from Lawrence.

7) miscellaneous cleanups, from many developers.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoselftests/bpf: measure RTT from xdp using xdping
Alan Maguire [Fri, 31 May 2019 17:47:14 +0000 (18:47 +0100)] 
selftests/bpf: measure RTT from xdp using xdping

xdping allows us to get latency estimates from XDP.  Output looks
like this:

./xdping -I eth4 192.168.55.8
Setting up XDP for eth4, please wait...
XDP setup disrupts network connectivity, hit Ctrl+C to quit

Normal ping RTT data
[Ignore final RTT; it is distorted by XDP using the reply]
PING 192.168.55.8 (192.168.55.8) from 192.168.55.7 eth4: 56(84) bytes of data.
64 bytes from 192.168.55.8: icmp_seq=1 ttl=64 time=0.302 ms
64 bytes from 192.168.55.8: icmp_seq=2 ttl=64 time=0.208 ms
64 bytes from 192.168.55.8: icmp_seq=3 ttl=64 time=0.163 ms
64 bytes from 192.168.55.8: icmp_seq=8 ttl=64 time=0.275 ms

4 packets transmitted, 4 received, 0% packet loss, time 3079ms
rtt min/avg/max/mdev = 0.163/0.237/0.302/0.054 ms

XDP RTT data:
64 bytes from 192.168.55.8: icmp_seq=5 ttl=64 time=0.02808 ms
64 bytes from 192.168.55.8: icmp_seq=6 ttl=64 time=0.02804 ms
64 bytes from 192.168.55.8: icmp_seq=7 ttl=64 time=0.02815 ms
64 bytes from 192.168.55.8: icmp_seq=8 ttl=64 time=0.02805 ms

The xdping program loads the associated xdping_kern.o BPF program
and attaches it to the specified interface.  If run in client
mode (the default), it will add a map entry keyed by the
target IP address; this map will store RTT measurements, current
sequence number etc.  Finally in client mode the ping command
is executed, and the xdping BPF program will use the last ICMP
reply, reformulate it as an ICMP request with the next sequence
number and XDP_TX it.  After the reply to that request is received
we can measure RTT and repeat until the desired number of
measurements is made.  This is why the sequence numbers in the
normal ping are 1, 2, 3 and 8.  We XDP_TX a modified version
of ICMP reply 4 and keep doing this until we get the 4 replies
we need; hence the networking stack only sees reply 8, where
we have XDP_PASSed it upstream since we are done.

In server mode (-s), xdping simply takes ICMP requests and replies
to them in XDP rather than passing the request up to the networking
stack.  No map entry is required.

xdping can be run in native XDP mode (the default, or specified
via -N) or in skb mode (-S).

A test program test_xdping.sh exercises some of these options.

Note that native XDP does not seem to XDP_TX for veths, hence -N
is not tested.  Looking at the code, it looks like XDP_TX is
supported so I'm not sure if that's expected.  Running xdping in
native mode for ixgbe as both client and server works fine.

Changes since v4

- close fds on cleanup (Song Liu)

Changes since v3

- fixed seq to be __be16 (Song Liu)
- fixed fd checks in xdping.c (Song Liu)

Changes since v2

- updated commit message to explain why seq number of last
  ICMP reply is 8 not 4 (Song Liu)
- updated types of seq number, raddr and eliminated csum variable
  in xdpclient/xdpserver functions as it was not needed (Song Liu)
- added XDPING_DEFAULT_COUNT definition and usage specification of
  default/max counts (Song Liu)

Changes since v1
 - moved from RFC to PATCH
 - removed unused variable in ipv4_csum() (Song Liu)
 - refactored ICMP checks into icmp_check() function called by client
   and server programs and reworked client and server programs due
   to lack of shared code (Song Liu)
 - added checks to ensure that SKB and native mode are not requested
   together (Song Liu)

Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
5 years agoMerge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next...
David S. Miller [Sat, 1 Jun 2019 00:13:19 +0000 (17:13 -0700)] 
Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue

Jeff Kirsher says:

====================
Intel Wired LAN Driver Updates 2019-05-31

This series contains updates to the iavf driver.

Nathan Chancellor converts the use of gnu_printf to printf.

Aleksandr modifies the driver to limit the number of RSS queues to the
number of online CPUs in order to avoid creating misconfigured RSS
queues.

Gustavo A. R. Silva converts a couple of instances where sizeof() can be
replaced with struct_size().

Alice makes the remaining changes to the iavf driver to cleanup all the
old "i40evf" references in the driver to iavf, including the file names
that still contained the old driver reference.  There was no functional
changes made, just cosmetic to reduce any confusion going forward now
that the iavf driver is the virtual function driver for both i40e and
ice drivers.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agobpf: doc: update answer for 32-bit subregister question
Jiong Wang [Thu, 30 May 2019 20:23:18 +0000 (21:23 +0100)] 
bpf: doc: update answer for 32-bit subregister question

There has been quite a few progress around the two steps mentioned in the
answer to the following question:

  Q: BPF 32-bit subregister requirements

This patch updates the answer to reflect what has been done.

v2:
 - Add missing full stop. (Song Liu)
 - Minor tweak on one sentence. (Song Liu)

v1:
 - Integrated rephrase from Quentin and Jakub

Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
5 years agoMerge branch 'map-charge-cleanup'
Alexei Starovoitov [Fri, 31 May 2019 23:52:56 +0000 (16:52 -0700)] 
Merge branch 'map-charge-cleanup'

Roman Gushchin says:

====================
During my work on memcg-based memory accounting for bpf maps
I've done some cleanups and refactorings of the existing
memlock rlimit-based code. It makes it more robust, unifies
size to pages conversion, size checks and corresponding error
codes. Also it adds coverage for cgroup local storage and
socket local storage maps.

It looks like some preliminary work on the mm side might be
required to start working on the memcg-based accounting,
so I'm sending these patches as a separate patchset.
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
5 years agobpf: move memory size checks to bpf_map_charge_init()
Roman Gushchin [Thu, 30 May 2019 01:03:59 +0000 (18:03 -0700)] 
bpf: move memory size checks to bpf_map_charge_init()

Most bpf map types doing similar checks and bytes to pages
conversion during memory allocation and charging.

Let's unify these checks by moving them into bpf_map_charge_init().

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
5 years agobpf: rework memlock-based memory accounting for maps
Roman Gushchin [Thu, 30 May 2019 01:03:58 +0000 (18:03 -0700)] 
bpf: rework memlock-based memory accounting for maps

In order to unify the existing memlock charging code with the
memcg-based memory accounting, which will be added later, let's
rework the current scheme.

Currently the following design is used:
  1) .alloc() callback optionally checks if the allocation will likely
     succeed using bpf_map_precharge_memlock()
  2) .alloc() performs actual allocations
  3) .alloc() callback calculates map cost and sets map.memory.pages
  4) map_create() calls bpf_map_init_memlock() which sets map.memory.user
     and performs actual charging; in case of failure the map is
     destroyed
  <map is in use>
  1) bpf_map_free_deferred() calls bpf_map_release_memlock(), which
     performs uncharge and releases the user
  2) .map_free() callback releases the memory

The scheme can be simplified and made more robust:
  1) .alloc() calculates map cost and calls bpf_map_charge_init()
  2) bpf_map_charge_init() sets map.memory.user and performs actual
    charge
  3) .alloc() performs actual allocations
  <map is in use>
  1) .map_free() callback releases the memory
  2) bpf_map_charge_finish() performs uncharge and releases the user

The new scheme also allows to reuse bpf_map_charge_init()/finish()
functions for memcg-based accounting. Because charges are performed
before actual allocations and uncharges after freeing the memory,
no bogus memory pressure can be created.

In cases when the map structure is not available (e.g. it's not
created yet, or is already destroyed), on-stack bpf_map_memory
structure is used. The charge can be transferred with the
bpf_map_charge_move() function.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
5 years agobpf: group memory related fields in struct bpf_map_memory
Roman Gushchin [Thu, 30 May 2019 01:03:57 +0000 (18:03 -0700)] 
bpf: group memory related fields in struct bpf_map_memory

Group "user" and "pages" fields of bpf_map into the bpf_map_memory
structure. Later it can be extended with "memcg" and other related
information.

The main reason for a such change (beside cosmetics) is to pass
bpf_map_memory structure to charging functions before the actual
allocation of bpf_map.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
5 years agobpf: add memlock precharge for socket local storage
Roman Gushchin [Thu, 30 May 2019 01:03:56 +0000 (18:03 -0700)] 
bpf: add memlock precharge for socket local storage

Socket local storage maps lack the memlock precharge check,
which is performed before the memory allocation for
most other bpf map types.

Let's add it in order to unify all map types.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
5 years agobpf: add memlock precharge check for cgroup_local_storage
Roman Gushchin [Thu, 30 May 2019 01:03:55 +0000 (18:03 -0700)] 
bpf: add memlock precharge check for cgroup_local_storage

Cgroup local storage maps lack the memlock precharge check,
which is performed before the memory allocation for
most other bpf map types.

Let's add it in order to unify all map types.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
5 years agoMerge branch 'propagate-cn-to-tcp'
Alexei Starovoitov [Fri, 31 May 2019 23:41:30 +0000 (16:41 -0700)] 
Merge branch 'propagate-cn-to-tcp'

Lawrence Brakmo says:

====================
This patchset adds support for propagating congestion notifications (cn)
to TCP from cgroup inet skb egress BPF programs.

Current cgroup skb BPF programs cannot trigger TCP congestion window
reductions, even when they drop a packet. This patch-set adds support
for cgroup skb BPF programs to send congestion notifications in the
return value when the packets are TCP packets. Rather than the
current 1 for keeping the packet and 0 for dropping it, they can
now return:
    NET_XMIT_SUCCESS    (0)    - continue with packet output
    NET_XMIT_DROP       (1)    - drop packet and do cn
    NET_XMIT_CN         (2)    - continue with packet output and do cn
    -EPERM                     - drop packet

Finally, HBM programs are modified to collect and return more
statistics.

There has been some discussion regarding the best place to manage
bandwidths. Some believe this should be done in the qdisc where it can
also be managed with a BPF program. We believe there are advantages
for doing it with a BPF program in the cgroup/skb callback. For example,
it reduces overheads in the cases where there is on primary workload and
one or more secondary workloads, where each workload is running on its
own cgroupv2. In this scenario, we only need to throttle the secondary
workloads and there is no overhead for the primary workload since there
will be no BPF program attached to its cgroup.

Regardless, we agree that this mechanism should not penalize those that
are not using it. We tested this by doing 1 byte req/reply RPCs over
loopback. Each test consists of 30 sec of back-to-back 1 byte RPCs.
Each test was repeated 50 times with a 1 minute delay between each set
of 10. We then calculated the average RPCs/sec over the 50 tests. We
compare upstream with upstream + patchset and no BPF program as well
as upstream + patchset and a BPF program that just returns ALLOW_PKT.
Here are the results:

upstream                           80937 RPCs/sec
upstream + patches, no BPF program 80894 RPCs/sec
upstream + patches, BPF program    80634 RPCs/sec

These numbers indicate that there is no penalty for these patches

The use of congestion notifications improves the performance of HBM when
using Cubic. Without congestion notifications, Cubic will not decrease its
cwnd and HBM will need to drop a large percentage of the packets.

The following results are obtained for rate limits of 1Gbps,
between two servers using netperf, and only one flow. We also show how
reducing the max delayed ACK timer can improve the performance when
using Cubic.

Command used was:
  ./do_hbm_test.sh -l -D --stats -N -r=<rate> [--no_cn] [dctcp] \
                   -s=<server running netserver>
  where:
     <rate>   is 1000
     --no_cn  specifies no cwr notifications
     dctcp    uses dctcp

                       Cubic                    DCTCP
Lim, DA      Mbps cwnd cred drops  Mbps cwnd cred drops
--------     ---- ---- ---- -----  ---- ---- ---- -----
  1G, 40       35  462 -320 67%     995    1 -212  0.05%
  1G, 40,cn   736    9  -78  0.07   995    1 -212  0.05
  1G,  5,cn   941    2 -189  0.13   995    1 -212  0.05

Notes:
  --no_cn has no effect with DCTCP
  Lim = rate limit
  DA = maximum delay ack timer
  cred = credit in packets
  drops = % packets dropped

v1->v2: Insures that only BPF_CGROUP_INET_EGRESS can return values 2 and 3
        New egress values apply to all protocols, not just TCP
        Cleaned up patch 4, Update BPF_CGROUP_RUN_PROG_INET_EGRESS callers
        Removed changes to __tcp_transmit_skb (patch 5), no longer needed
        Removed sample use of EDT
v2->v3: Removed the probe timer related changes
v3->v4: Replaced preempt_enable_no_resched() by preempt_enable()
        in BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY() macro
====================

Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
5 years agobpf: Add more stats to HBM
brakmo [Tue, 28 May 2019 23:59:40 +0000 (16:59 -0700)] 
bpf: Add more stats to HBM

Adds more stats to HBM, including average cwnd and rtt of all TCP
flows, percents of packets that are ecn ce marked and distribution
of return values.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
5 years agobpf: Add cn support to hbm_out_kern.c
brakmo [Tue, 28 May 2019 23:59:39 +0000 (16:59 -0700)] 
bpf: Add cn support to hbm_out_kern.c

Update hbm_out_kern.c to support returning cn notifications.
Also updates relevant files to allow disabling cn notifications.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
5 years agobpf: Update BPF_CGROUP_RUN_PROG_INET_EGRESS calls
brakmo [Tue, 28 May 2019 23:59:38 +0000 (16:59 -0700)] 
bpf: Update BPF_CGROUP_RUN_PROG_INET_EGRESS calls

Update BPF_CGROUP_RUN_PROG_INET_EGRESS() callers to support returning
congestion notifications from the BPF programs.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
5 years agobpf: Update __cgroup_bpf_run_filter_skb with cn
brakmo [Tue, 28 May 2019 23:59:37 +0000 (16:59 -0700)] 
bpf: Update __cgroup_bpf_run_filter_skb with cn

For egress packets, __cgroup_bpf_fun_filter_skb() will now call
BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY() instead of PROG_CGROUP_RUN_ARRAY()
in order to propagate congestion notifications (cn) requests to TCP
callers.

For egress packets, this function can return:
   NET_XMIT_SUCCESS    (0)    - continue with packet output
   NET_XMIT_DROP       (1)    - drop packet and notify TCP to call cwr
   NET_XMIT_CN         (2)    - continue with packet output and notify TCP
                                to call cwr
   -EPERM                     - drop packet

For ingress packets, this function will return -EPERM if any attached
program was found and if it returned != 1 during execution. Otherwise 0
is returned.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
5 years agobpf: cgroup inet skb programs can return 0 to 3
brakmo [Tue, 28 May 2019 23:59:36 +0000 (16:59 -0700)] 
bpf: cgroup inet skb programs can return 0 to 3

Allows cgroup inet skb programs to return values in the range [0, 3].
The second bit is used to deterine if congestion occurred and higher
level protocol should decrease rate. E.g. TCP would call tcp_enter_cwr()

The bpf_prog must set expected_attach_type to BPF_CGROUP_INET_EGRESS
at load time if it uses the new return values (i.e. 2 or 3).

The expected_attach_type is currently not enforced for
BPF_PROG_TYPE_CGROUP_SKB.  e.g Meaning the current bpf_prog with
expected_attach_type setting to BPF_CGROUP_INET_EGRESS can attach to
BPF_CGROUP_INET_INGRESS.  Blindly enforcing expected_attach_type will
break backward compatibility.

This patch adds a enforce_expected_attach_type bit to only
enforce the expected_attach_type when it uses the new
return value.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
5 years agobpf: Create BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY
brakmo [Tue, 28 May 2019 23:59:35 +0000 (16:59 -0700)] 
bpf: Create BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY

Create new macro BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY() to be used by
__cgroup_bpf_run_filter_skb for EGRESS BPF progs so BPF programs can
request cwr for TCP packets.

Current cgroup skb programs can only return 0 or 1 (0 to drop the
packet. This macro changes the behavior so the low order bit
indicates whether the packet should be dropped (0) or not (1)
and the next bit is used for congestion notification (cn).

Hence, new allowed return values of CGROUP EGRESS BPF programs are:
  0: drop packet
  1: keep packet
  2: drop packet and call cwr
  3: keep packet and call cwr

This macro then converts it to one of NET_XMIT values or -EPERM
that has the effect of dropping the packet with no cn.
  0: NET_XMIT_SUCCESS  skb should be transmitted (no cn)
  1: NET_XMIT_DROP     skb should be dropped and cwr called
  2: NET_XMIT_CN       skb should be transmitted and cwr called
  3: -EPERM            skb should be dropped (no cn)

Note that when more than one BPF program is called, the packet is
dropped if at least one of programs requests it be dropped, and
there is cn if at least one program returns cn.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
5 years agoxen-netback: remove redundant assignment to err
Colin Ian King [Thu, 30 May 2019 19:04:38 +0000 (20:04 +0100)] 
xen-netback: remove redundant assignment to err

The variable err is assigned with the value -ENOMEM that is never
read and it is re-assigned a new value later on.  The assignment is
redundant and can be removed.

Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonexthop: remove redundant assignment to err
Colin Ian King [Thu, 30 May 2019 15:57:54 +0000 (16:57 +0100)] 
nexthop: remove redundant assignment to err

The variable err is initialized with a value that is never read
and err is reassigned a few statements later. This initialization
is redundant and can be removed.

Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet/mlx5e: TX, Improve performance under GSO workload
Erez Alfasi [Tue, 14 May 2019 10:55:22 +0000 (13:55 +0300)] 
net/mlx5e: TX, Improve performance under GSO workload

__netdev_tx_sent_queue() was introduced by:
commit 3e59020abf0f ("net: bql: add __netdev_tx_sent_queue()")

BQL counters should be updated without flipping/caring about
BQL status, if the current skb has xmit_more set.

Using __netdev_tx_sent_queue() avoids messing with BQL stop
flag, increases performance on GSO workload by keeping
doorbells to the minimum required and also sparing atomic
operations.

Signed-off-by: Erez Alfasi <ereza@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
5 years agonet/mlx5e: Use termination table for VLAN push actions
Oz Shlomo [Thu, 18 Apr 2019 13:45:29 +0000 (16:45 +0300)] 
net/mlx5e: Use termination table for VLAN push actions

HW does not support push VLAN action in the RX direction (packets
arriving from the wire). The FW works around this limitation by haripining
the packet. The hairpin workaround applies only when the push VLAN action
is specified in a termination table, assuring that there are no actions
following the haripin.

Instantiate termination table for push VLAN actions. Re-use identical
terminating tables for increased HW cache efficiency.

Signed-off-by: Oz Shlomo <ozsh@mellanox.com>
Reviewed-by: Paul Blakey <paulb@mellanox.com>
Reviewed-by: Eli Britstein <elibr@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
5 years agonet/mlx5e: Geneve, Add support for encap/decap flows offload
Yevgeny Kliteynik [Thu, 4 Apr 2019 00:37:36 +0000 (03:37 +0300)] 
net/mlx5e: Geneve, Add support for encap/decap flows offload

Add HW offloading support for flows with Geneve encap/decap.

Notes about decap flows with Geneve TLV Options:
  - Support offloading of 32-bit options data only
  - At any given time, only one combination of class/type parameters
    can be offloaded, but the same class/type combination can have
    many different flows offloaded with different 32-bit option data
  - Options with value of 0 can't be offloaded

Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
5 years agonet/mlx5e: Rearrange tc tunnel code in a modular way
Yevgeny Kliteynik [Sun, 14 Apr 2019 14:50:01 +0000 (17:50 +0300)] 
net/mlx5e: Rearrange tc tunnel code in a modular way

Rearrange tc tunnel code so that it would be easy to add future tunnels:
 - Define tc tunnel object with the fields and callbacks that any
   tunnel must implement.
 - Define tc UDP tunnel object for UDP tunnels, such as VXLAN
 - Move each tunnel code (GRE, VXLAN) to its own separate file
 - Rewrite tc tunnel implementation in a general way - using only
   the objects and their callbacks.

Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
5 years agonet/mlx5e: Geneve, Keep tunnel info as pointer to the original struct
Yevgeny Kliteynik [Tue, 12 Feb 2019 11:31:00 +0000 (13:31 +0200)] 
net/mlx5e: Geneve, Keep tunnel info as pointer to the original struct

In mlx5e encap entry structure, IP tunnel info data structure is copied
by value. This approach worked till now, but it breaks when there are
encapsulation options, such as in case of Geneve.

These options are stored in the structure that is allocated adjacent to
the IP tunnel info struct, and not pointed at by any field in that struct.
Therefore, when copying the struct by value, we loose the address of the
original struct and can't get to the encapsulation options.

Fix the problem by storing the pointer to the tunnel info data instead.

Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
5 years agonet/mlx5: Geneve, Manage Geneve TLV options
Yevgeny Kliteynik [Wed, 30 Jan 2019 15:21:55 +0000 (17:21 +0200)] 
net/mlx5: Geneve, Manage Geneve TLV options

Use Geneve TLV Options object to manage the flex parser matching
on the 32-bit options data.

When the first flow with a certain class/type values is requested to
be offloaded, create a FW object with FW command (Geneve TLV Options
general object) and start counting the number of flows using this object.

During this time, any request with a different class/type values will
fail to be offloaded.
Once the refcount reaches 0, destroy the TLV options general object,
and can now offload a flow with any class/type parameters.

Geneve TLV Options object is added to core device.
It is currently used to manage Geneve TLV options general
object allocation in FW and its reference counting only.
In the future it will also be used for managing geneve ports
by registering callbacks for ndo_udp_tunnel_add/del.

Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
5 years agonet/mlx5e: Enable setting multiple match criteria for flow group
Yevgeny Kliteynik [Wed, 30 Jan 2019 13:52:35 +0000 (15:52 +0200)] 
net/mlx5e: Enable setting multiple match criteria for flow group

When filling in flow spec match criteria, to allow previous
modifications of the match criteria, use "|=" rather than "=".

Tunnel options are parsed before the match criteria of the offloaded
flow are being set. If the the flow that we're about to offload has
encapsulation options, the flow group might need to match on additional
criteria.

For Geneve, an additional flow group matching parameter should
be used - misc3. The appropriate bit in the match criteria is set
while parsing the tunnel options, so the criteria value shouldn't
be overwritten.

This is a pre-step for supporting Geneve TLV options offload.

Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>