Jo-Philipp Wich [Mon, 8 Aug 2022 22:18:42 +0000 (00:18 +0200)]
meta: don't use non-POSIX formats in strptime()
The current strptime() invocations in meta.c use the `%F` format which
is not specified by POSIX and thus unimplemented by some libc flavors
such as musl libc.
Replace all occurrences of `%F` with an equivalent `%Y-%m-%d` format
in order to be able to properly parse user supplied dates in such
environments.
Signed-off-by: Jo-Philipp Wich <jo@mein.io> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
src: allow anon set concatenation with ether and vlan
vlan id uses integer type (which has a length of 0).
Using it was possible, but listing would assert:
python: mergesort.c:24: concat_expr_msort_value: Assertion `ilen > 0' failed.
There are two reasons for this.
First reason is that the udata/typeof information lacks the 'vlan id'
part, because internally this is 'payload . binop(payload AND mask)'.
binop lacks an udata store. It makes little sense to store it,
'typeof' keyword expects normal match syntax.
So, when storing udata, store the left hand side of the binary
operation, i.e. the load of the 2-byte key.
With that resolved, delinerization could work, but concat_elem_expr()
would splice 12 bits off the elements value, but it should be 16 (on
a byte boundary).
evaluate: search stacked header list for matching payload dep
"ether saddr 0:1:2:3:4:6 vlan id 2" works, but reverse fails:
"vlan id 2 ether saddr 0:1:2:3:4:6" will give
Error: conflicting protocols specified: vlan vs. ether
After "proto: track full stack of seen l2 protocols, not just cumulative offset",
we have a list of all l2 headers, so search those to see if we had this
proto base in the past before rejecting this.
Reported-by: Eric Garver <eric@garver.life> Signed-off-by: Florian Westphal <fw@strlen.de>
proto: track full stack of seen l2 protocols, not just cumulative offset
For input, a cumulative size counter of all pushed l2 headers is enough,
because we have the full expression tree available to us.
For delinearization we need to track all seen l2 headers, else we lose
information that we might need at a later time.
Consider:
rule netdev nt nc set update ether saddr . vlan id
during delinearization, the vlan proto_desc replaces the ethernet one,
and by the time we try to split the concatenation apart we will search
the ether saddr offset vs. the templates for proto_vlan.
This replaces the offset with an array that stores the protocol
descriptions seen.
Then, if the payload offset is larger than our description, search the
l2 stack and adjust the offset until we're within the expected offset
boundary.
Reported-by: Eric Garver <eric@garver.life> Signed-off-by: Florian Westphal <fw@strlen.de>
After this, listing will show:
update @macset { @ll,48,48 . vlan id timeout 5s }
@ll,48,48 . vlan id @macset
The @ll, ... is due to vlan description replacing the ethernet one,
so payload decode fails to take the concatenation apart (the ethernet
header payload info is matched vs. vlan template).
cache: release pending rules when chain binding lookup fails
If the implicit chain is not in the cache, release pending rules in
ctx->list and report EINTR to let the cache core retry to populate a
consistent cache.
Closes: https://bugzilla.netfilter.org/show_bug.cgi?id=1402 Fixes: c330152b7f77 ("src: support for implicit chain bindings") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Florian Westphal [Thu, 23 Jun 2022 17:56:19 +0000 (19:56 +0200)]
scanner: don't pop active flex scanner scope
Currently we can pop a flex scope that is still active, i.e. the
scanner_pop_start_cond() for the scope has not been done.
Example:
counter ipsec out ip daddr 192.168.1.2 counter name "ipsec_out"
Here, parser fails because 'daddr' is parsed as STRING, not as DADDR token.
Bug is as follows:
COUNTER changes scope to COUNTER. (COUNTER).
Next, IPSEC scope gets pushed, stack is: COUNTER, IPSEC.
Then, the 'COUNTER' scope close happens. Because active scope has changed,
we cannot pop (we would pop the 'ipsec' scope in flex).
The pop operation gets delayed accordingly.
Next, IP gets pushed, stack is: COUNTER, IPSEC, IP, plus the information
that one scope closure/pop was delayed.
Then, the IP scope is closed. Because a pop operation was delayed, we pop again,
which brings us back to COUNTER state.
This is bogus: The pop operation CANNOT be done yet, because the ipsec scope
is still open, but the existing code lacks the information to detect this.
After popping the IP scope, we must remain in IPSEC scope until bison
parser calls scanner_pop_start_cond(, IPSEC).
This adds a counter per flex scope so that we can detect this case.
In above case, after the IP scope gets closed, the "new" (previous)
scope (IPSEC) will be treated as active and its close is attempted again
on the next call to scanner_pop_start_cond().
After this patch, transition in above rule is:
push counter (COUNTER)
push IPSEC (COUNTER, IPSEC)
pop COUNTER (delayed: COUNTER, IPSEC, pending-pop for COUNTER),
push IP (COUNTER, IPSEC, IP, pending-pop for COUNTER)
pop IP (COUNTER, IPSEC, pending-pop for COUNTER)
parse DADDR (we're in IPSEC scope, its valid token)
pop IPSEC (pops all remaining scopes).
We could also resurrect the commit:
"scanner: flags: move to own scope", the test case passes with the
new scope closure logic.
Fixes: bff106c5b277 ("scanner: add support for scope nesting") Signed-off-by: Florian Westphal <fw@strlen.de>
The error in the set/map definition is correctly caught and queued, but
because the set is invalid and does not contain a key type, adding to it
causes a NULL pointer dereference of set->key within setelem_evaluate().
I don't think it's necessary to queue another error since the underlying
problem is correctly detected and reported when parsing the definition
of the set. Simply checking the validity of set->key before using it
seems to fix it, causing the error in the definition of the set to be
reported properly. The element type error isn't caught, but that seems
reasonable since the key type is invalid or unknown anyway:
$ ./nft -c -f ~/nftables.conf.crash
/home/pti/nftables.conf.crash:3:21-21: Error: set definition does not specify key
set inet filter foo {}
^
[ Add tests to cover this case --pablo ]
Closes: https://bugzilla.netfilter.org/show_bug.cgi?id=1597 Signed-off-by: Peter Tirsek <peter@tirsek.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
mnl: store netlink error location for set elements
Store set element location in the per-command netlink error location
array. This allows for fine grain error reporting when adding and
deleting elements.
# nft -f test.nft
test.nft:5:4-20: Error: Could not process rule: File exists
00:01:45:09:0b:26 : drop,
^^^^^^^^^^^^^^^^^
test.nft contains a large map with one redundant entry.
Thus, users do not have to find the needle in the stack.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Revert support for limit statement, the limit statement is stateful and
it applies a ratelimit per rule, transformation for merging rules with
the limit statement needs to use anonymous sets with statements.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Do not try to merge rules with unsupported statements. This patch adds a
dummy unsupported statement which is included in the statement
collection and the rule vs statement matrix.
When looking for possible rule mergers, rules using unsupported
statements are discarded, otherwise bogus rule mergers might occur.
Note that __stmt_type_eq() already returns false for unsupported
statements.
Add a test using meta mark statement, which is not yet supported.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Do not print stateful information such as counters which are likely set
to zero.
Before this patch:
Merging:
packets.conf:10:3-29: ip protocol 4 counter drop
packets.conf:11:3-29: ip protocol 41 counter drop
packets.conf:12:3-29: ip protocol 47 counter drop
into:
ip protocol { 4, 41, 47 } counter packets 0 bytes 0 drop
^^^^^^^^^^^^^^^^^
After:
Merging:
packets.conf:10:3-29: ip protocol 4 counter drop
packets.conf:11:3-29: ip protocol 41 counter drop
packets.conf:12:3-29: ip protocol 47 counter drop
into:
ip protocol { 4, 41, 47 } counter drop
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Phil Sutter [Thu, 16 Jun 2022 08:56:12 +0000 (10:56 +0200)]
intervals: Do not sort cached set elements over and over again
When adding element(s) to a non-empty set, code merged the two lists and
sorted the result. With many individual 'add element' commands this
causes substantial overhead. Make use of the fact that
existing_set->init is sorted already, sort only the list of new elements
and use list_splice_sorted() to merge the two sorted lists.
Add set_sort_splice() and use it for set element overlap detection and
automerge.
A test case adding ~25k elements in individual commands completes in
about 1/4th of the time with this patch applied.
Joint work with Pablo.
Fixes: 3da9643fb9ff9 ("intervals: add support to automerge with kernel elements") Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Robots might generate a long list of singleton element commands such as:
add element t s { 1.0.1.0/24 }
...
add element t s { 1.0.2.0/23 }
collapse them into one single command before the evaluation step, ie.
add element t s { 1.0.1.0/24, ..., 1.0.2.0/23 }
this speeds up overlap detection and set element automerge operations in
this worst case scenario.
Since 3da9643fb9ff9 ("intervals: add support to automerge with kernel
elements"), the new interval tracking relies on mergesort. The pattern
above triggers the set sorting for each element.
This patch adds a list to cmd objects that store collapsed commands.
Moreover, expressions also contain a reference to the original command,
to uncollapse the commands after the evaluation step.
These commands are uncollapsed after the evaluation step to ensure error
reporting works as expected (command and netlink message are mapped
1:1).
For the record:
- nftables versions <= 1.0.2 did not perform any kind of overlap
check for the described scenario above (because set cache only contained
elements in the kernel in this case). This is a problem for kernels < 5.7
which rely on userspace to detect overlaps.
- the overlap detection could be skipped for kernels >= 5.7.
- The extended netlink error reporting available for set elements
since 5.19-rc might allow to remove the uncollapse step, in this case,
error reporting does not rely on the netlink sequence to refer to the
command triggering the problem.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Florian Westphal [Fri, 10 Jun 2022 11:01:46 +0000 (13:01 +0200)]
Revert "scanner: flags: move to own scope"
Excess nesting of scanner scopes is very fragile and error prone:
rule `iif != lo ip daddr 127.0.0.1/8 counter limit rate 1/second log flags all prefix "nft_lo4 " drop`
fails with `Error: No symbol type information` hinting at `prefix`
Problem is that we nest via:
counter
limit
log
flags
By the time 'prefix' is scanned, state is still stuck in 'counter' due
to this nesting. Working around "prefix" isn't enough, any other
keyword, e.g. "level" in 'flags all level debug' will be parsed as 'string' too.
So, revert this.
Fixes: a16697097e2b ("scanner: flags: move to own scope") Reported-by: Christian Göttsche <cgzones@googlemail.com> Signed-off-by: Florian Westphal <fw@strlen.de>
evaluate: reset ctx->set after set interval evaluation
Otherwise bogus error reports on set datatype mismatch might occur, such as:
Error: datatype mismatch, expected Internet protocol, expression has type IPv4 address
meta l4proto { tcp, udp } th dport 443 dnat to 10.0.0.1
~~~~~~~~~~~~ ^^^^^^^^^^^^
with an unrelated set declaration.
table ip test {
set set_with_interval {
type ipv4_addr
flags interval
}
chain prerouting {
type nat hook prerouting priority dstnat; policy accept;
meta l4proto { tcp, udp } th dport 443 dnat to 10.0.0.1
}
}
This bug has been introduced in the evaluation step.
Reported-by: Roman Petrov <nwhisper@gmail.com> Fixes: 81e36530fcac ("src: replace interval segment tree overlap and automerge)" Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Use pr_gmp_debug() instead to compile with minigmp.
intervals.c: In function ‘set_delete’:
intervals.c:489:25: warning: implicit declaration of function ‘gmp_printf’; did you mean ‘gmp_vfprintf’? [-Wimplicit-function-declaration]
489 | gmp_printf("remove: [%Zx-%Zx]\n",
| ^~~~~~~~~~
| gmp_vfprintf
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
netlink: swap byteorder for host-endian concat data
All data must be passed in network byte order, else matching
won't work respectively kernel will reject the interval because
it thinks that start is after end
This is needed to allow use of 'ppp*' in interval sets with
concatenations.
intervals: deletion should adjust range not yet in the kernel
Do not remove the range if it does not exists yet in the kernel, adjust it
instead. Uncovered by use-after-free error.
==276702==ERROR: AddressSanitizer: heap-use-after-free on address 0x60d00190663c at pc 0x7ff310ab526f bp 0x7fffeb76f750 sp 0x7fffeb76f748 READ of size 4 at 0x60d00190663c thread T0
#0 0x7ff310ab526e in __adjust_elem_right .../nftables/src/intervals.c:300
#1 0x7ff310ab59a7 in adjust_elem_right .../nftables/src/intervals.c:311
#2 0x7ff310ab6daf in setelem_adjust .../nftables/src/intervals.c:354
#3 0x7ff310ab783a in setelem_delete .../nftables/src/intervals.c:411
#4 0x7ff310ab80e6 in __set_delete .../nftables/src/intervals.c:451
Fixes: 3e8d934e4f72 ("intervals: support to partial deletion with automerge") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
optimize: merge nat rules with same selectors into map
Verdict and nat are mutually exclusive, no need to support for this
combination.
# cat ruleset.nft
table ip x {
chain y {
type nat hook postrouting priority srcnat; policy drop;
ip saddr 1.1.1.1 tcp dport 8000 snat to 4.4.4.4:80
ip saddr 2.2.2.2 tcp dport 8001 snat to 5.5.5.5:90
}
}
# nft -o -c -f ruleset.nft
Merging:
ruleset.nft:4:3-52: ip saddr 1.1.1.1 tcp dport 8000 snat to 4.4.4.4:80
ruleset.nft:5:3-52: ip saddr 2.2.2.2 tcp dport 8001 snat to 5.5.5.5:90
into:
snat to ip saddr . tcp dport map { 1.1.1.1 . 8000 : 4.4.4.4 . 80, 2.2.2.2 . 8001 : 5.5.5.5 . 90 }
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
intervals: set on EXPR_F_KERNEL flag for new elements in set cache
So follow up command in this batch that update the set assumes this
element is already in the kernel.
Fixes: 3da9643fb9ff ("intervals: add support to automerge with kernel elements") Fixes: 3ed9fadaab95 ("intervals: build list of elements to be added from cache") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
intervals: build list of elements to be added from cache
Loop over the set cache and add elements that have no EXPR_F_KERNEL,
meaning that these are new elements in the set that have resulted
from adjusting/split existing ranges.
This fixes several partial deletions of the same interval in one
command.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Phil Sutter [Thu, 14 Apr 2022 11:39:24 +0000 (13:39 +0200)]
intervals: Simplify element sanity checks
Since setelem_delete() assigns to 'prev' pointer only if it doesn't have
EXPR_F_REMOVE flag set, there is no need to check that flag in called
functions.
Fixes: 3e8d934e4f722 ("intervals: support to partial deletion with automerge") Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
intervals: unset EXPR_F_KERNEL for adjusted elements
This element is adjusted, reset the EXPR_F_KERNEL flag, this is a new
element and the old is purged from the kernel.
The existing list of elements in the kernel is spliced to the elements
to be removed, then merge-sorted. The EXPR_F_REMOVE flag specifies that
this element represents a deletion.
The EXPR_F_REMOVE and EXPR_F_KERNEL allows to track objects: whether
element is in the kernel (EXPR_F_KERNEL), element is new (no flag) or
element represents a removal (EXPR_F_REMOVE).
Reported-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
intervals: support to partial deletion with automerge
Splice the existing set element cache with the elements to be deleted
and merge sort it. The elements to be deleted are identified by the
EXPR_F_REMOVE flag.
The set elements to be deleted is automerged in first place if the
automerge flag is set on.
There are four possible deletion scenarios:
- Exact match, eg. delete [a-b] and there is a [a-b] range in the kernel set.
- Adjust left side of range, eg. delete [a-b] from range [a-x] where x > b.
- Adjust right side of range, eg. delete [a-b] from range [x-b] where x < a.
- Split range, eg. delete [a-b] from range [x-y] where x < a and b < y.
Update nft_evaluate() to use the safe list variant since new commands
are dynamically registered to the list to update ranges.
This patch also restores the set element existence check for Linux
kernels <= 5.7.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
intervals: add support to automerge with kernel elements
Extend the interval codebase to support for merging elements in the
kernel with userspace element updates.
Add a list of elements to be purged to cmd and set objects. These
elements representing outdated intervals are deleted before adding the
updated ranges.
This routine splices the list of userspace and kernel elements, then it
mergesorts to identify overlapping and contiguous ranges. This splice
operation is undone so the set userspace cache remains consistent.
Incrementally update the elements in the cache, this allows to remove dd44081d91ce ("segtree: Fix add and delete of element in same batch").
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
src: replace interval segment tree overlap and automerge
This is a rewrite of the segtree interval codebase.
This patch now splits the original set_to_interval() function in three
routines:
- add set_automerge() to merge overlapping and contiguous ranges.
The elements, expressed either as single value, prefix and ranges are
all first normalized to ranges. This elements expressed as ranges are
mergesorted. Then, there is a linear list inspection to check for
merge candidates. This code only merges elements in the same batch,
ie. it does not merge elements in the kernela and the userspace batch.
- add set_overlap() to check for overlapping set elements. Linux
kernel >= 5.7 already checks for overlaps, older kernels still needs
this code. This code checks for two conflict types:
1) between elements in this batch.
2) between elements in this batch and kernelspace.
The elements in the kernel are temporarily merged into the list of
elements in the batch to check for this overlaps. The EXPR_F_KERNEL
flag allows us to restore the set cache after the overlap check has
been performed.
- set_to_interval() now only transforms set elements, expressed as range
e.g. [a,b], to individual set elements using the EXPR_F_INTERVAL_END
flag notation to represent e.g. [a,b+1), where b+1 has the
EXPR_F_INTERVAL_END flag set on.
More relevant updates:
- The overlap and automerge routines are now performed in the evaluation
phase.
- The userspace set object representation now stores a reference to the
existing kernel set object (in case there is already a set with this
same name in the kernel). This is required by the new overlap and
automerge approach.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add initial test case, sets with names and interfaces,
anonymous and named ones.
Check match+no-match.
netns with ppp1 and ppq veth, send packets via both interfaces.
Rule counters should have incremented on the three rules.
(that match on set that have "abcdef1" or "abcdef*" strings in them).
Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
src: make interval sets work with string datatypes
Allows to interface names in interval sets:
table inet filter {
set s {
type ifname
flags interval
elements = { eth*, foo }
}
Concatenations are not yet supported, also, listing is broken,
those strings will not be printed back because the values will remain
in big-endian order. Followup patch will extend segtree to translate
this back to host byte order.
Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
evaluate: string prefix expression must retain original length
To make something like "eth*" work for interval sets (match
eth0, eth1, and so on...) we must treat the string as a 128 bit
integer.
Without this, segtree will do the wrong thing when applying the prefix,
because we generate the prefix based on 'eth*' as input, with a length of 3.
The correct import needs to be done on "eth\0\0\0\0\0\0\0...", i.e., if
the input buffer were an ipv6 address, it should look like "eth\0::",
not "::eth".
Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>