netfilter: ipset: fix region locking in hash types
Region locking introduced in v5.6-rc4 contained three macros to handle
the region locks: ahash_bucket_start(), ahash_bucket_end() which gave
back the start and end hash bucket values belonging to a given region
lock and ahash_region() which should give back the region lock belonging
to a given hash bucket. The latter was incorrect which can lead to a
race condition between the garbage collector and adding new elements
when a hash type of set is defined with timeouts.
Fixes: f66ee0410b1c ("netfilter: ipset: Fix "INFO: rcu detected stall in hash_xxx" reports") Reported-by: Kota Toda <kota.toda@gmo-cybersecurity.com> Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org>
Jeremy Sowden [Fri, 7 Feb 2025 20:08:13 +0000 (20:08 +0000)]
bash-completion: restore fix for syntax error
There is a syntax error in a redirection:
$ bash -x utils/ipset_bash_completion/ipset
+ shopt -s extglob
utils/ipset_bash_completion/ipset: line 365: syntax error near unexpected token `('
utils/ipset_bash_completion/ipset: line 365: `done < <(PATH=${PATH}:/sbin ( command ip -o link show ) 2>/dev/null)'
Move the environment variable assignment into the sub-shell.
This fix was previously applied in commit 417ee1054fb2 ("bash-completion:
fix syntax error"), but then reverted, presumably by mistake, in commit 0378d91222c1 ("Bash completion utility updated").
Phil Sutter [Tue, 17 Dec 2024 19:56:55 +0000 (20:56 +0100)]
netfilter: ipset: Fix for recursive locking warning
With CONFIG_PROVE_LOCKING, when creating a set of type bitmap:ip, adding
it to a set of type list:set and populating it from iptables SET target
triggers a kernel warning:
| WARNING: possible recursive locking detected
| 6.12.0-rc7-01692-g5e9a28f41134-dirty #594 Not tainted
| --------------------------------------------
| ping/4018 is trying to acquire lock:
| ffff8881094a6848 (&set->lock){+.-.}-{2:2}, at: ip_set_add+0x28c/0x360 [ip_set]
|
| but task is already holding lock:
| ffff88811034c048 (&set->lock){+.-.}-{2:2}, at: ip_set_add+0x28c/0x360 [ip_set]
This is a false alarm: ipset does not allow nested list:set type, so the
loop in list_set_kadd() can never encounter the outer set itself. No
other set type supports embedded sets, so this is the only case to
consider.
To avoid the false report, create a distinct lock class for list:set
type ipset locks.
Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org>
Phil Sutter [Fri, 29 Nov 2024 15:30:38 +0000 (16:30 +0100)]
netfilter: ipset: Hold module reference while requesting a module
User space may unload ip_set.ko while it is itself requesting a set type
backend module, leading to a kernel crash. The race condition may be
provoked by inserting an mdelay() right after the nfnl_unlock() call.
Fixes: a7b4f989a629 ("netfilter: ipset: IP set core support") Signed-off-by: Phil Sutter <phil@nwl.cc> Acked-by: Jozsef Kadlecsik <kadlec@netfilter.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org>
Jeongjun Park [Wed, 13 Nov 2024 13:02:09 +0000 (22:02 +0900)]
netfilter: ipset: add missing range check in bitmap_ip_uadt
When tb[IPSET_ATTR_IP_TO] is not present but tb[IPSET_ATTR_CIDR] exists,
the values of ip and ip_to are slightly swapped. Therefore, the range check
for ip should be done later, but this part is missing and it seems that the
vulnerability occurs.
So we should add missing range checks and remove unnecessary range checks.
Cc: <stable@vger.kernel.org> Reported-by: syzbot+58c872f7790a4d2ac951@syzkaller.appspotmail.com Fixes: 72205fc68bd1 ("netfilter: ipset: bitmap:ip set type support") Signed-off-by: Jeongjun Park <aha310510@gmail.com> Acked-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org>
When destroying all sets, we are either in pernet exit phase or
are executing a "destroy all sets command" from userspace. The latter
was taken into account in ip_set_dereference() (nfnetlink mutex is held),
but the former was not. The patch adds the required check to
rcu_dereference_protected() in ip_set_dereference().
Fixes: 4e7aaa6b82d6 ("netfilter: ipset: Fix race between namespace cleanup and gc in the list:set type") Reported-by: syzbot+b62c37cdd58103293a5a@syzkaller.appspotmail.com Reported-by: syzbot+cfbe1da5fdfc39efc293@syzkaller.appspotmail.com Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202406141556.e0b6f17e-lkp@intel.com Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org>
To cover for this, make net_last_addr() fall back to the 'Address:'
line. Simply adding this keyword is fine as in normal output it appears
first and thus the other recognized keywords' values take precedence.
Phil Sutter [Thu, 27 Jun 2024 08:18:18 +0000 (10:18 +0200)]
tests: Reduce testsuite run-time
Where acceptable, batch add set element calls to avoid overhead of
excessive 'ipset' program spawns. On my (slow) testing VM, this patch
reduces a full run of tests/runtest.sh from ~70min down to ~11min.
This might eliminate the situation being tested: resize.sh might be such
a case so batch only 255 'ipset add' calls and continue to repeat these
batched calls 32 times in hopes that it still qualifies as the resizing
stress test tests/hash:ip.t calls it.
Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org>
Phil Sutter [Thu, 27 Jun 2024 08:18:16 +0000 (10:18 +0200)]
lib: data: Fix for global-buffer-overflow warning by ASAN
After compiling with CFLAGS="-fsanitize=address -g", running the
testsuite triggers the following warning:
| ipmap: Range: Check syntax error: missing range/from-to: FAILED
| Failed test: ../src/ipset 2>.foo.err -N test ipmap
| =================================================================
| ==4204==ERROR: AddressSanitizer: global-buffer-overflow on address 0x55a21e77172a at pc 0x7f1ef246f2a6 bp 0x7fffed8f4f40 sp 0x7fffed8f46e8
| READ of size 32 at 0x55a21e77172a thread T0
| #0 0x7f1ef246f2a5 in __interceptor_memcpy /var/tmp/portage/sys-devel/gcc-13.2.1_p20231014/work/gcc-13-20231014/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:899
| #1 0x55a21e758bf6 in ipset_strlcpy /home/n0-1/git/ipset/lib/data.c:119
| #2 0x55a21e758bf6 in ipset_data_set /home/n0-1/git/ipset/lib/data.c:349
| #3 0x55a21e75ee2f in ipset_parse_typename /home/n0-1/git/ipset/lib/parse.c:1819
| #4 0x55a21e754119 in ipset_parser /home/n0-1/git/ipset/lib/ipset.c:1205
| #5 0x55a21e752cef in ipset_parse_argv /home/n0-1/git/ipset/lib/ipset.c:1344
| #6 0x55a21e74ea45 in main /home/n0-1/git/ipset/src/ipset.c:38
| #7 0x7f1ef224cf09 (/lib64/libc.so.6+0x23f09)
| #8 0x7f1ef224cfc4 in __libc_start_main (/lib64/libc.so.6+0x23fc4)
| #9 0x55a21e74f040 in _start (/home/n0-1/git/ipset/src/ipset+0x1d040)
|
| 0x55a21e77172a is located 54 bytes before global variable '*.LC1' defined in 'ipset_bitmap_ip.c' (0x55a21e771760) of size 19
| '*.LC1' is ascii string 'IP|IP/CIDR|FROM-TO'
| 0x55a21e77172a is located 0 bytes after global variable '*.LC0' defined in 'ipset_bitmap_ip.c' (0x55a21e771720) of size 10
| '*.LC0' is ascii string 'bitmap:ip'
Fix this by avoiding 'src' array overstep in ipset_strlcpy(): In
contrast to strncpy(), memcpy() does not respect NUL-chars in input but
stubbornly reads as many bytes as specified.
netfilter: ipset: Fix race between namespace cleanup and gc in the list:set type
Lion Ackermann reported that there is a race condition between namespace cleanup
in ipset and the garbage collection of the list:set type. The namespace
cleanup can destroy the list:set type of sets while the gc of the set type is
waiting to run in rcu cleanup. The latter uses data from the destroyed set which
thus leads use after free. The patch contains the following parts:
- When destroying all sets, first remove the garbage collectors, then wait
if needed and then destroy the sets.
- Fix the badly ordered "wait then remove gc" for the destroy a single set
case.
- Fix the missing rcu locking in the list:set type in the userspace test
case.
- Use proper RCU list handlings in the list:set type.
The patch depends on 975403cda657 (netfilter: ipset: Add list flush to cancel_gc).
Flushing list in cancel_gc drops references to other lists right away,
without waiting for RCU to destroy list. Fixes race when referenced
ipsets can't be destroyed while referring list is scheduled for destroy.
Signed-off-by: Alexander Maltsev <keltar.gw@gmail.com> Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org>
Pavel Tikhomirov [Wed, 22 May 2024 17:54:22 +0000 (19:54 +0200)]
netfilter: propagate net to nf_bridge_get_physindev
This is a preparation patch for replacing physindev with physinif on
nf_bridge_info structure. We will use dev_get_by_index_rcu to resolve
device, when needed, and it requires net to be available.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org>
Jozsef Kadlecsik [Mon, 12 Feb 2024 11:51:17 +0000 (12:51 +0100)]
netfilter: ipset: Suppress false sparse warnings
Due to the code reorganization the functions in question now run by call_rcu(),
not under rcu locking and pointer access. This produces false sparse warning
which are suppressed by the patch.
netfilter: ipset: remove set destroy at ip_set module removal
The ip_set module can only be removed when all set module type modules
are already removed. A set type module can only be removed when all sets
belonging to the given type are already removed. So it is not possible
that there's any set defined at ip_set module removal.
The patch fdb8e12cc2cc ("netfilter: ipset: fix performance regression
in swap operation") missed to add the calls to gc cancellations
at the error path of create operations and at module unload. Also,
because the half of the destroy operations now executed by a
function registered by call_rcu(), neither NFNL_SUBSYS_IPSET mutex
or rcu read lock is held and therefore the checking of them results
false warnings.
treewide: Convert del_timer*() to timer_shutdown*()
Due to several bugs caused by timers being re-armed after they are
shutdown and just before they are freed, a new state of timers was added
called "shutdown". After a timer is set to this state, then it can no
longer be re-armed.
The following script was run to find all the trivial locations where
del_timer() or del_timer_sync() is called in the same function that the
object holding the timer is freed. It also ignores any locations where
the timer->function is modified between the del_timer*() and the free(),
as that is not considered a "trivial" case.
This was created by using a coccinelle script and the following
commands:
Jozsef Kadlecsik [Mon, 29 Jan 2024 11:30:23 +0000 (12:30 +0100)]
netfilter: ipset: fix race condition between swap/destroy and kernel side add/del/test v4
The patch "netfilter: ipset: fix race condition between swap/destroy
and kernel side add/del/test", commit 28628fa9 fixes a race condition.
But the synchronize_rcu() added to the swap function unnecessarily slows
it down: it can safely be moved to destroy and use call_rcu() instead.
Eric Dumazet pointed out that simply calling the destroy functions as
rcu callback does not work: sets with timeout use garbage collectors
which need cancelling at destroy which can wait. Therefore the destroy
functions are split into two: cancelling garbage collectors safely at
executing the command received by netlink and moving the remaining
part only into the rcu callback.
Link: https://lore.kernel.org/lkml/C0829B10-EAA6-4809-874E-E1E9C05A8D84@automattic.com/ Fixes: 28628fa952fe ("netfilter: ipset: fix race condition between swap/destroy and kernel side add/del/test") Reported-by: Ale Crismani <ale.crismani@automattic.com> Reported-by: David Wang <00107082@163.com> Tested-by: David Wang <00107082@163.com> Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org>
Jozsef Kadlecsik [Mon, 11 Dec 2023 10:30:30 +0000 (11:30 +0100)]
netfilter: ipset: fix race condition between swap/destroy and kernel side add/del/test v3
Florian Westphal pointed out that all netfilter hooks run with rcu_read_lock() held
and em_ipset.c wraps the entire ip_set_test() in rcu read lock/unlock pair.
So there's no need to extend the rcu read locked area in ipset itself.
Jozsef Kadlecsik [Thu, 19 Oct 2023 18:41:53 +0000 (20:41 +0200)]
netfilter: ipset: fix race condition between swap/destroy and kernel side add/del/test
Linkui Xiao reported that there's a race condition when ipset swap and destroy is
called, which can lead to crash in add/del/test element operations. Swap then
destroy are usual operations to replace a set with another one in a production
system. The issue can in some cases be reproduced with the script:
Swap replaces hash_ip1 with hash_ip2 and then destroy removes hash_ip2 which
is the original hash_ip1. ip_set_test was called on hash_ip1 and because destroy
removed it, hash_net_kadt crashes.
The fix is to protect both the list of the sets and the set pointers in an extended RCU
region and before calling destroy, wait to finish all started rcu_read_lock().
The first version of the patch was written by Linkui Xiao <xiaolinkui@kylinos.cn>.
netfilter: ipset: Fix race between IPSET_CMD_CREATE and IPSET_CMD_SWAP
Kyle Zeng reported that there is a race between IPSET_CMD_ADD and IPSET_CMD_SWAP
in netfilter/ip_set, which can lead to the invocation of `__ip_set_put` on a wrong
`set`, triggering the `BUG_ON(set->ref == 0);` check in it.
The race is caused by using the wrong reference counter, i.e. the ref counter instead
of ref_netlink.
netfilter: ipset: add the missing IP_SET_HASH_WITH_NET0 macro for ip_set_hash_netportnet.c
The missing IP_SET_HASH_WITH_NET0 macro in ip_set_hash_netportnet can
lead to the use of wrong `CIDR_POS(c)` for calculating array offsets,
which can lead to integer underflow. As a result, it leads to slab
out-of-bound access.
This patch adds back the IP_SET_HASH_WITH_NET0 macro to
ip_set_hash_netportnet to address the issue.
Azeem Shaikh [Tue, 13 Jun 2023 00:34:37 +0000 (00:34 +0000)]
netfilter: ipset: Replace strlcpy with strscpy
strlcpy() reads the entire source buffer first.
This read may exceed the destination size limit.
This is both inefficient and can lead to linear read
overflows if a source string is not NUL-terminated [1].
In an effort to remove strlcpy() completely [2], replace
strlcpy() here with strscpy().
Direct replacement is safe here since return value from all
callers of STRLCPY macro were ignored.
netfilter: ipset: Add schedule point in call_ad().
syzkaller found a repro that causes Hung Task [0] with ipset. The repro
first creates an ipset and then tries to delete a large number of IPs
from the ipset concurrently:
The first deleting thread hogs a CPU with nfnl_lock(NFNL_SUBSYS_IPSET)
held, and other threads wait for it to be released.
Previously, the same issue existed in set->variant->uadt() that could run
so long under ip_set_lock(set). Commit 5e29dc36bd5e ("netfilter: ipset:
Rework long task execution when adding/deleting entries") tried to fix it,
but the issue still exists in the caller with another mutex.
While adding/deleting many IPs, we should release the CPU periodically to
prevent someone from abusing ipset to hang the system.
Note we need to increment the ipset's refcnt to prevent the ipset from
being destroyed while rescheduling.
Nowadays, carp seems to be the preferred name for protocol 112. Simply
change the expected output for lack of idea for a backwards compatible
change which is not simply using another protocol.
Phil Sutter [Sun, 5 Mar 2023 11:46:49 +0000 (12:46 +0100)]
tests: xlate: Make test input valid
Make sure ipset at least accepts the test input by running it against
plain ipset once for sanity. This exposed two issues:
* Set 'hip5' doesn't have comment support, so add the commented elements
to 'hip6' instead (likely a typo).
* Set 'bip1' range 2.0.0.1-2.1.0.1 exceeds the max allowed for bitmap
sets. Reduce it accordingly.
Fixes: 7587d1c4b5465 ("tests: add tests ipset to nftables") Signed-off-by: Phil Sutter <phil@nwl.cc>
Phil Sutter [Sun, 5 Mar 2023 11:43:23 +0000 (12:43 +0100)]
tests: xlate: Test built binary by default
Testing the host's iptables-translate by default is unintuitive. Since
the ipset-translate symlink is created upon 'make install', add a local
symlink to the repository pointing at a built binary in src/. Using this
by default is consistent with the regular testsuite.
Sam James [Sat, 28 Jan 2023 18:25:33 +0000 (19:25 +0100)]
configure.ac: fix bashisms
configure scripts need to be runnable with a POSIX-compliant /bin/sh.
On many (but not all!) systems, /bin/sh is provided by Bash, so errors
like this aren't spotted. Notably Debian defaults to /bin/sh provided
by dash which doesn't tolerate such bashisms as '=='.
This retains compatibility with bash.
Signed-off-by: Sam James <sam@gentoo.org> Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org>
Gavrilov Ilia [Sat, 28 Jan 2023 18:09:52 +0000 (19:09 +0100)]
netfilter: ipset: Fix overflow before widen in the bitmap_ip_create() function.
When first_ip is 0, last_ip is 0xFFFFFFFF, and netmask is 31, the value of
an arithmetic expression 2 << (netmask - mask_bits - 1) is subject
to overflow due to a failure casting operands to a larger data type
before performing the arithmetic.
Note that it's harmless since the value will be checked at the next step.
Found by InfoTeCS on behalf of Linux Verification Center
(linuxtesting.org) with SVACE.
Jozsef Kadlecsik [Fri, 30 Dec 2022 11:32:37 +0000 (12:32 +0100)]
netfilter: ipset: Rework long task execution when adding/deleting entries
When adding/deleting large number of elements in one step in ipset, it can
take a reasonable amount of time and can result in soft lockup errors. The
patch 5f7b51bf09ba ("netfilter: ipset: Limit the maximal range of
consecutive elements to add/delete") tried to fix it by limiting the max
elements to process at all. However it was not enough, it is still possible
that we get hung tasks. Lowering the limit is not reasonable, so the
approach in this patch is as follows: rely on the method used at resizing
sets and save the state when we reach a smaller internal batch limit,
unlock/lock and proceed from the saved state. Thus we can avoid long
continuous tasks and at the same time removed the limit to add/delete large
number of elements in one step.
The nfnl mutex is held during the whole operation which prevents one to issue
other ipset commands in parallel.
Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org> Reported-by: syzbot+9204e7399656300bf271@syzkaller.appspotmail.com Fixes: 5f7b51bf09ba ("netfilter: ipset: Limit the maximal range of consecutive elements to add/delete")
Jozsef Kadlecsik [Thu, 29 Dec 2022 14:00:21 +0000 (15:00 +0100)]
netfilter: ipset: fix hash:net,port,net hang with /0 subnet
The hash:net,port,net set type supports /0 subnets. However, the patch
commit 5f7b51bf09baca8e titled "netfilter: ipset: Limit the maximal range
of consecutive elements to add/delete" did not take into account it and
resulted in an endless loop. The bug is actually older but the patch 5f7b51bf09baca8e brings it out earlier.
Handle /0 subnets properly in hash:net,port,net set types.
Reported-by: Марк Коренберг <socketpair@gmail.com> Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org>
Jozsef Kadlecsik [Mon, 21 Nov 2022 12:16:56 +0000 (13:16 +0100)]
netfilter: ipset: restore allowing 64 clashing elements in hash:net,iface
The patch "netfilter: ipset: enforce documented limit to prevent allocating
huge memory" was too strict and prevented to add up to 64 clashing elements
to a hash:net,iface type of set. This patch fixes the issue and now the type
behaves as documented.
Vishwanath Pai [Thu, 10 Nov 2022 21:31:26 +0000 (16:31 -0500)]
netfilter: ipset: Add support for new bitmask parameter
Add a new parameter to complement the existing 'netmask' option. The
main difference between netmask and bitmask is that bitmask takes any
arbitrary ip address as input, it does not have to be a valid netmask.
The name of the new parameter is 'bitmask'. This lets us mask out
arbitrary bits in the ip address, for example:
ipset create set1 hash:ip bitmask 255.128.255.0
ipset create set2 hash:ip,port family inet6 bitmask ffff::ff80