Jozsef Kadlecsik [Fri, 30 Dec 2022 11:32:37 +0000 (12:32 +0100)]
netfilter: ipset: Rework long task execution when adding/deleting entries
When adding/deleting large number of elements in one step in ipset, it can
take a reasonable amount of time and can result in soft lockup errors. The
patch 5f7b51bf09ba ("netfilter: ipset: Limit the maximal range of
consecutive elements to add/delete") tried to fix it by limiting the max
elements to process at all. However it was not enough, it is still possible
that we get hung tasks. Lowering the limit is not reasonable, so the
approach in this patch is as follows: rely on the method used at resizing
sets and save the state when we reach a smaller internal batch limit,
unlock/lock and proceed from the saved state. Thus we can avoid long
continuous tasks and at the same time removed the limit to add/delete large
number of elements in one step.
The nfnl mutex is held during the whole operation which prevents one to issue
other ipset commands in parallel.
Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org> Reported-by: syzbot+9204e7399656300bf271@syzkaller.appspotmail.com Fixes: 5f7b51bf09ba ("netfilter: ipset: Limit the maximal range of consecutive elements to add/delete")
Jozsef Kadlecsik [Thu, 29 Dec 2022 14:00:21 +0000 (15:00 +0100)]
netfilter: ipset: fix hash:net,port,net hang with /0 subnet
The hash:net,port,net set type supports /0 subnets. However, the patch
commit 5f7b51bf09baca8e titled "netfilter: ipset: Limit the maximal range
of consecutive elements to add/delete" did not take into account it and
resulted in an endless loop. The bug is actually older but the patch 5f7b51bf09baca8e brings it out earlier.
Handle /0 subnets properly in hash:net,port,net set types.
Reported-by: Марк Коренберг <socketpair@gmail.com> Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org>
Jozsef Kadlecsik [Mon, 21 Nov 2022 12:16:56 +0000 (13:16 +0100)]
netfilter: ipset: restore allowing 64 clashing elements in hash:net,iface
The patch "netfilter: ipset: enforce documented limit to prevent allocating
huge memory" was too strict and prevented to add up to 64 clashing elements
to a hash:net,iface type of set. This patch fixes the issue and now the type
behaves as documented.
Vishwanath Pai [Thu, 10 Nov 2022 21:31:26 +0000 (16:31 -0500)]
netfilter: ipset: Add support for new bitmask parameter
Add a new parameter to complement the existing 'netmask' option. The
main difference between netmask and bitmask is that bitmask takes any
arbitrary ip address as input, it does not have to be a valid netmask.
The name of the new parameter is 'bitmask'. This lets us mask out
arbitrary bits in the ip address, for example:
ipset create set1 hash:ip bitmask 255.128.255.0
ipset create set2 hash:ip,port family inet6 bitmask ffff::ff80
Vishwanath Pai [Thu, 10 Nov 2022 21:30:26 +0000 (16:30 -0500)]
netfilter: ipset: Add support for new bitmask parameter
Add a new parameter to complement the existing 'netmask' option. The
main difference between netmask and bitmask is that bitmask takes any
arbitrary ip address as input, it does not have to be a valid netmask.
The name of the new parameter is 'bitmask'. This lets us mask out
arbitrary bits in the ip address, for example:
ipset create set1 hash:ip bitmask 255.128.255.0
ipset create set2 hash:ip,port family inet6 bitmask ffff::ff80
Vishwanath Pai [Wed, 28 Sep 2022 18:26:50 +0000 (14:26 -0400)]
netfilter: ipset: regression in ip_set_hash_ip.c
This patch introduced a regression: commit 48596a8ddc46 ("netfilter:
ipset: Fix adding an IPv4 range containing more than 2^31 addresses")
The variable e.ip is passed to adtfn() function which finally adds the
ip address to the set. The patch above refactored the for loop and moved
e.ip = htonl(ip) to the end of the for loop.
What this means is that if the value of "ip" changes between the first
assignement of e.ip and the forloop, then e.ip is pointing to a
different ip address than "ip".
Test case:
$ ipset create jdtest_tmp hash:ip family inet hashsize 2048 maxelem 100000
$ ipset add jdtest_tmp 10.0.1.1/31
ipset v6.21.1: Element cannot be added to the set: it's already added
The value of ip gets updated inside the "else if (tb[IPSET_ATTR_CIDR])"
block but e.ip is still pointing to the old value.
Wolfram Sang [Mon, 7 Nov 2022 21:09:04 +0000 (22:09 +0100)]
netfilter: move from strlcpy with unused retval to strscpy
Follow the advice of the below link and prefer 'strscpy' in this
subsystem. Conversion is 1:1 because the return value is not used.
Generated by a coccinelle script.
Kees Cook [Mon, 7 Nov 2022 20:58:52 +0000 (21:58 +0100)]
netlink: Bounds-check struct nlmsgerr creation
In preparation for FORTIFY_SOURCE doing bounds-check on memcpy(),
switch from __nlmsg_put to nlmsg_put(), and explain the bounds check
for dealing with the memcpy() across a composite flexible array struct.
Avoids this future run-time warning:
memcpy: detected field-spanning write (size 32) of single field "&errmsg->msg" at net/netlink/af_netlink.c:2447 (size 16)
Cc: Jakub Kicinski <kuba@kernel.org> Cc: Pablo Neira Ayuso <pablo@netfilter.org> Cc: Jozsef Kadlecsik <kadlec@netfilter.org> Cc: Florian Westphal <fw@strlen.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Paolo Abeni <pabeni@redhat.com> Cc: syzbot <syzkaller@googlegroups.com> Cc: netfilter-devel@vger.kernel.org Cc: coreteam@netfilter.org Cc: netdev@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220901071336.1418572-1-keescook@chromium.org Signed-off-by: David S. Miller <davem@davemloft.net>
sched: consistently handle layer3 header accesses in the presence of VLANs
There are a couple of places in net/sched/ that check skb->protocol and act
on the value there. However, in the presence of VLAN tags, the value stored
in skb->protocol can be inconsistent based on whether VLAN acceleration is
enabled. The commit quoted in the Fixes tag below fixed the users of
skb->protocol to use a helper that will always see the VLAN ethertype.
However, most of the callers don't actually handle the VLAN ethertype, but
expect to find the IP header type in the protocol field. This means that
things like changing the ECN field, or parsing diffserv values, stops
working if there's a VLAN tag, or if there are multiple nested VLAN
tags (QinQ).
To fix this, change the helper to take an argument that indicates whether
the caller wants to skip the VLAN tags or not. When skipping VLAN tags, we
make sure to skip all of them, so behaviour is consistent even in QinQ
mode.
To make the helper usable from the ECN code, move it to if_vlan.h instead
of pkt_sched.h.
v3:
- Remove empty lines
- Move vlan variable definitions inside loop in skb_protocol()
- Also use skb_protocol() helper in IP{,6}_ECN_decapsulate() and
bpf_skb_ecn_set_ce()
v2:
- Use eth_type_vlan() helper in skb_protocol()
- Also fix code that reads skb->protocol directly
- Change a couple of 'if/else if' statements to switch constructs to avoid
calling the helper twice
Reported-by: Ilya Ponetayev <i.ponetaev@ndmsystems.com> Fixes: d8b9605d2697 ("net: sched: fix skb->protocol use in case of accelerated vlan path") Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
netfilter: ipset: enforce documented limit to prevent allocating huge memory
Daniel Xu reported that the hash:net,iface type of the ipset subsystem does
not limit adding the same network with different interfaces to a set, which
can lead to huge memory usage or allocation failure.
The quick reproducer is
$ ipset create ACL.IN.ALL_PERMIT hash:net,iface hashsize 1048576 timeout 0
$ for i in $(seq 0 100); do /sbin/ipset add ACL.IN.ALL_PERMIT 0.0.0.0/0,kaf_$i timeout 0 -exist; done
The backtrace when vmalloc fails:
[Tue Oct 25 00:13:08 2022] ipset: vmalloc error: size 1073741848, exceeds total pages
<...>
[Tue Oct 25 00:13:08 2022] Call Trace:
[Tue Oct 25 00:13:08 2022] <TASK>
[Tue Oct 25 00:13:08 2022] dump_stack_lvl+0x48/0x60
[Tue Oct 25 00:13:08 2022] warn_alloc+0x155/0x180
[Tue Oct 25 00:13:08 2022] __vmalloc_node_range+0x72a/0x760
[Tue Oct 25 00:13:08 2022] ? hash_netiface4_add+0x7c0/0xb20
[Tue Oct 25 00:13:08 2022] ? __kmalloc_large_node+0x4a/0x90
[Tue Oct 25 00:13:08 2022] kvmalloc_node+0xa6/0xd0
[Tue Oct 25 00:13:08 2022] ? hash_netiface4_resize+0x99/0x710
<...>
The fix is to enforce the limit documented in the ipset(8) manpage:
> The internal restriction of the hash:net,iface set type is that the same
> network prefix cannot be stored with more than 64 different interfaces
> in a single set.
Reported-by: Daniel Xu <dxu@dxuuu.xyz> Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org>
The parser assumes the set is an IPv4 ipset because IPSET_OPT_FAMILY is
not set.
# ipset-translate restore < ./ipset-mwan3_set_connected_ipv6.dump
add table inet global
add set inet global mwan3_connected_v6 { type ipv6_addr; flags interval; }
flush set inet global mwan3_connected_v6
ipset v7.15: Error in line 4: Syntax error: '64' is out of range 0-32
Remove ipset_xlate_type_get(), call ipset_xlate_set_get() instead to
obtain the set type and family.
Reported-by: Florian Eckert <fe@dev.tdt.de> Fixes: 325af556cd3a ("add ipset to nftables translation infrastructure") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
originally reported in
https://lists.opensuse.org/archives/list/factory@lists.opensuse.org/thread/ZIXKNQHSSCQ4ZLEGYYKLAXQ4PQ5EYFGZ/
by Larry Len Rainey
Signed-off-by: Bernhard M. Wiedemann <bwiedemann@suse.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
netfilter: ipset: Fix maximal range check in hash_ipportnet4_uadt()
Clang warns:
net/netfilter/ipset/ip_set_hash_ipportnet.c:249:29: warning: variable
'port_to' is uninitialized when used here [-Wuninitialized]
if (((u64)ip_to - ip + 1)*(port_to - port + 1) > IPSET_MAX_RANGE)
^~~~~~~
net/netfilter/ipset/ip_set_hash_ipportnet.c:167:45: note: initialize the
variable 'port_to' to silence this warning
u32 ip = 0, ip_to = 0, p = 0, port, port_to;
^
= 0
net/netfilter/ipset/ip_set_hash_ipportnet.c:249:39: warning: variable
'port' is uninitialized when used here [-Wuninitialized]
if (((u64)ip_to - ip + 1)*(port_to - port + 1) > IPSET_MAX_RANGE)
^~~~
net/netfilter/ipset/ip_set_hash_ipportnet.c:167:36: note: initialize the
variable 'port' to silence this warning
u32 ip = 0, ip_to = 0, p = 0, port, port_to;
^
= 0
2 warnings generated.
The range check was added before port and port_to are initialized.
Shuffle the check after the initialization so that the check works
properly.
Fixes: 7fb6c63025ff ("netfilter: ipset: Limit the maximal range of consecutive elements to
add/delete")
Limit the maximal range of consecutive elements to add/delete fix
Avoid possible number overflows when calculating the number of
consecutive elements. Also, compute properly the consecutive
elements in the case of hash:net* types.
Limit the maximal range of consecutive elements to add/delete
The range size of consecutive elements were not limited. Thus one
could define a huge range which may result soft lockup errors due
to the long execution time. Now the range size is limited to 2^20
entries. Reported by Brad Spengler.
ipset_parse_argv() parses, builds and send the netlink messages to the
kernel. This patch extracts the parser and wrap it around the new
ipset_parser() function.
This patch comes is preparation for the ipset to nftables translation
infrastructure.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org>
Jozsef Kadlecsik [Sat, 26 Jun 2021 20:14:38 +0000 (22:14 +0200)]
Fix patch "Parse port before trying by service name"
The patch broke parsing service names: number parsing failures
are hard errors which erase data, thus making impossible to
parse input as a string. Fix it by enabling soft (warning)
failures in the case of port number parsing.
Jozsef Kadlecsik [Wed, 17 Feb 2021 09:07:27 +0000 (10:07 +0100)]
Silence unused-but-set-variable warnings
When ipset is compiled in non-debug mode, in some environments warnings
treated as errors emitted:
session.c: In function 'build_msg':
session.c:1985:28: warning: variable 'type' set but not used
[-Wunused-but-set-variable]
const struct ipset_type *type;
^
session.c:2030:28: warning: variable 'type' set but not used
[-Wunused-but-set-variable]
const struct ipset_type *type;
^
Fix it by hiding the unused variable definitions/settings in non-debug mode.
Neutron Soutmun [Mon, 18 Jan 2021 04:58:30 +0000 (11:58 +0700)]
ipset: fix print format warning
* Use PRIx64 for portablility over various architectures.
* The format string for the 64bit number printing is incorrect,
the `%` sign is missing.
* The force types casting over the uint32_t and uint64_t are unnecessary
which warned by the compiler on different architecture.
Vasily Averin [Sun, 20 Dec 2020 12:21:13 +0000 (13:21 +0100)]
netfilter: ipset: fix shift-out-of-bounds in htable_bits()
htable_bits() can call jhash_size(32) and trigger shift-out-of-bounds
UBSAN: shift-out-of-bounds in net/netfilter/ipset/ip_set_hash_gen.h:151:6
shift exponent 32 is too large for 32-bit type 'unsigned int'
CPU: 0 PID: 8498 Comm: syz-executor519
Not tainted 5.10.0-rc7-next-20201208-syzkaller #0
Call Trace:
__dump_stack lib/dump_stack.c:79 [inline]
dump_stack+0x107/0x163 lib/dump_stack.c:120
ubsan_epilogue+0xb/0x5a lib/ubsan.c:148
__ubsan_handle_shift_out_of_bounds.cold+0xb1/0x181 lib/ubsan.c:395
htable_bits net/netfilter/ipset/ip_set_hash_gen.h:151 [inline]
hash_mac_create.cold+0x58/0x9b net/netfilter/ipset/ip_set_hash_gen.h:1524
ip_set_create+0x610/0x1380 net/netfilter/ipset/ip_set_core.c:1115
nfnetlink_rcv_msg+0xecc/0x1180 net/netfilter/nfnetlink.c:252
netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2494
nfnetlink_rcv+0x1ac/0x420 net/netfilter/nfnetlink.c:600
netlink_unicast_kernel net/netlink/af_netlink.c:1304 [inline]
netlink_unicast+0x533/0x7d0 net/netlink/af_netlink.c:1330
netlink_sendmsg+0x907/0xe40 net/netlink/af_netlink.c:1919
sock_sendmsg_nosec net/socket.c:652 [inline]
sock_sendmsg+0xcf/0x120 net/socket.c:672
____sys_sendmsg+0x6e8/0x810 net/socket.c:2345
__sys_sendmsg+0xe5/0x1b0 net/socket.c:2432
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
This patch replaces htable_bits() by simple fls(hashsize - 1) call:
it alone returns valid nbits both for round and non-round hashsizes.
It is normal to set any nbits here because it is validated inside
following htable_size() call which returns 0 for nbits>31.
Vasily Averin [Sun, 20 Dec 2020 12:17:01 +0000 (13:17 +0100)]
netfilter: ipset: fixes possible oops in mtype_resize
currently mtype_resize() can cause oops
t = ip_set_alloc(htable_size(htable_bits));
if (!t) {
ret = -ENOMEM;
goto out;
}
t->hregion = ip_set_alloc(ahash_sizeof_regions(htable_bits));
Increased htable_bits can force htable_size() to return 0.
In own turn ip_set_alloc(0) returns not 0 but ZERO_SIZE_PTR,
so follwoing access to t->hregion should trigger an OOPS.
Vasily Averin [Thu, 19 Nov 2020 13:59:51 +0000 (14:59 +0100)]
netfilter: ipset: enable memory accounting for ipset allocations
Currently netadmin inside non-trusted container can quickly allocate
whole node's memory via request of huge ipset hashtable.
Other ipset-related memory allocations should be restricted too.
The parameter defines the upper limit in any hash bucket at adding new entries
from userspace - if the limit would be exceeded, ipset doubles the hash size
and rehashes. It means the set may consume more memory but gives faster
evaluation at matching in the set.
The -exist flag was supported with the create, add and delete commands.
In order to gracefully handle the destroy command with nonexistent sets,
the -exist flag is added to destroy too.
Replace the existing /* fall through */ comments and its variants with
the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
fall-through markings when it is the case.
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org>
netfilter: Replace zero-length array with flexible-array member
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:
struct foo {
int stuff;
struct boo array[];
};
By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.
Also, notice that, dynamic memory allocations won't be affected by
this change:
"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]
Lastly, fix checkpatch.pl warning
WARNING: __aligned(size) is preferred over __attribute__((aligned(size)))
in net/bridge/netfilter/ebtables.c
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org>
Russell King [Wed, 10 Jun 2020 20:51:11 +0000 (21:51 +0100)]
netfiler: ipset: fix unaligned atomic access
When using ip_set with counters and comment, traffic causes the kernel
to panic on 32-bit ARM:
Alignment trap: not handling instruction e1b82f9f at [<bf01b0dc>]
Unhandled fault: alignment exception (0x221) at 0xea08133c
PC is at ip_set_match_extensions+0xe0/0x224 [ip_set]
The problem occurs when we try to update the 64-bit counters - the
faulting address above is not 64-bit aligned. The problem occurs
due to the way elements are allocated, for example:
If the element has a requirement for a member to be 64-bit aligned,
and set->dsize is not a multiple of 8, but is a multiple of four,
then every odd numbered elements will be misaligned - and hitting
an atomic64_add() on that element will cause the kernel to panic.
ip_set_elem_len() must return a size that is rounded to the maximum
alignment of any extension field stored in the element. This change
ensures that is the case.
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org>
Phil Sutter [Thu, 14 May 2020 11:31:21 +0000 (13:31 +0200)]
netfilter: ipset: Fix subcounter update skip
If IPSET_FLAG_SKIP_SUBCOUNTER_UPDATE is set, user requested to not
update counters in sub sets. Therefore IPSET_FLAG_SKIP_COUNTER_UPDATE
must be set, not unset.
Fixes: 6e01781d1c80e ("netfilter: ipset: set match: add support to match the counters") Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org>
Stefano Brivio [Mon, 24 Feb 2020 17:52:43 +0000 (18:52 +0100)]
ipset: Update byte and packet counters regardless of whether they match
In ip_set_match_extensions(), for sets with counters, we take care of
updating counters themselves by calling ip_set_update_counter(), and of
checking if the given comparison and values match, by calling
ip_set_match_counter() if needed.
However, if a given comparison on counters doesn't match the configured
values, that doesn't mean the set entry itself isn't matching.
This fix restores the behaviour we had before commit 4750005a85f7
("netfilter: ipset: Fix "don't update counters" mode when counters used
at the matching"), without reintroducing the issue fixed there: back
then, mtype_data_match() first updated counters in any case, and then
took care of matching on counters.
Now, if the IPSET_FLAG_SKIP_COUNTER_UPDATE flag is set,
ip_set_update_counter() will anyway skip counter updates if desired.
The issue observed is illustrated by this reproducer:
ipset create c hash:ip counters
ipset add c 192.0.2.1
iptables -I INPUT -m set --match-set c src --bytes-gt 800 -j DROP
if we now send packets from 192.0.2.1, bytes and packets counters
for the entry as shown by 'ipset list' are always zero, and, no
matter how many bytes we send, the rule will never match, because
counters themselves are not updated.
Reported-by: Mithil Mhatre <mmhatre@redhat.com> Fixes: 4750005a85f7 ("netfilter: ipset: Fix "don't update counters" mode when counters used at the matching") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org>
Serhey Popovych [Thu, 5 Mar 2020 15:28:24 +0000 (17:28 +0200)]
ip_set: Fix compatibility with kernels between v3.3 and v4.5
These kernels does not have in their @struct netlink_dump_control method
that is used to prepare for netlink dump ->start(). This affects all
kernels that does not contain commit fc9e50f5a5a4 ("netlink: add a start
callback for starting a netlink dump").
Introduce fake value of HAVE_NETLINK_DUMP_START_ARGS equal to 7 that
never spot in the wild and set HAVE_NETLINK_DUMP_START_ARGS to 4 only
after explicit test if ->start() is available.
Introduce --update-counters-first flag for the set target
Stefano Brivio reported that the patch 'netfilter: ipset:
Fix "don't update counters" mode when counters used at the matching'
changed the semantic of when the counters are updated. Before the patch
the counters were updated regardless of the results of the counter
matches, after the patch the counters were updated only if the counter
match conditions (if specified) matched the packet. In order to handle
both ways, the --update-counters-first flag is introduced: when the flag
is specified, the counters are updated before checking the counter match
conditions. Without the flag the current evaluation path (i.e. update
only if counter conditions match) works.
Serhey Popovych [Thu, 5 Mar 2020 15:28:23 +0000 (17:28 +0200)]
ip_set: Fix build on kernels without INIT_DEFERRABLE_WORK
There was macro rename in kernel with commit 203b42f73174 ("workqueue:
make deferrable delayed_work initializer names consistent") that renames
INIT_DELAYED_WORK_DEFERRABLE() to INIT_DEFERRABLE_WORK().