src: support for selectors with different byteorder with interval concatenations
Assuming the following interval set with concatenation:
set test {
typeof ip saddr . meta mark
flags interval
}
then, the following rule:
ip saddr . meta mark @test
requires bytecode that swaps the byteorder for the meta mark selector in
case the set contains intervals and concatenations.
inet x y
[ meta load nfproto => reg 1 ]
[ cmp eq reg 1 0x00000002 ]
[ payload load 4b @ network header + 12 => reg 1 ]
[ meta load mark => reg 9 ]
[ byteorder reg 9 = hton(reg 9, 4, 4) ] <----- this is required !
[ lookup reg 1 set test dreg 0 ]
This patch updates byteorder_conversion() to add the unary expression
that introduces the byteorder expression.
Moreover, store the meta mark range component of the element tuple in
the set in big endian as it is required for the range comparisons. Undo
the byteorder swap in the netlink delinearize path to listing the meta
mark values accordingly.
Update tests/py to validate that byteorder expression is emitted in the
bytecode. Update tests/shell to validate insertion and listing of a
named map declaration.
A similar commit 806ab081dc9a ("netlink: swap byteorder for
host-endian concat data") already exists in the tree to handle this for
strings with prefix (e.g. eth*).
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Phil Sutter [Tue, 11 Oct 2022 16:46:55 +0000 (18:46 +0200)]
Warn for tables with compat expressions in rules
While being able to "look inside" compat expressions using nft is a nice
feature, it is also (yet another) pitfall for unaware users, deceiving
them into assuming interchangeability (or at least compatibility)
between iptables-nft and nft.
In reality, which involves 'nft list ruleset | nft -f -', any correctly
translated compat expressions will turn into native nftables ones not
understood by (the version of) iptables-nft which created them in the
first place. Other compat expressions will vanish, potentially
compromising the firewall ruleset.
Emit a warning (as comment) to give users a chance to stop and
reconsider before shooting their own foot.
monitor: missing cache and set handle initialization
This leads to a crash when adding stateful expressions to sets:
netlink.c:928:38: runtime error: member access within null pointer of type 'struct nft_ctx'
AddressSanitizer:DEADLYSIGNAL
=================================================================
==13781==ERROR: AddressSanitizer: SEGV on unknown address 0x0000000000d0 (pc 0x7fc96fc2b6b2 bp 0x7ffc0e26b080 sp 0x7ffc0e26b020 T0)
==13781==The signal is caused by a READ memory access.
==13781==Hint: address points to the zero page.
#0 0x7fc96fc2b6b2 in table_cache_find /home/pablo/devel/scm/git-netfilter/nftables/src/cache.c:456
#1 0x7fc96fd244d4 in netlink_parse_set_expr /home/pablo/devel/scm/git-netfilter/nftables/src/netlink_delinearize.c:1857
#2 0x7fc96fcf1b4d in netlink_delinearize_set /home/pablo/devel/scm/git-netfilter/nftables/src/netlink.c:928
#3 0x7fc96fd41966 in netlink_events_cache_addset /home/pablo/devel/scm/git-netfilter/nftables/src/monitor.c:649
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Unsupported meta match on layer 4 protocol sets on protocol context to
proto_unknown, handle anything coming after it as a raw expression in
payload_expr_expand().
Moreover, payload_dependency_kill() skips dependency removal if protocol
is unknown, so raw payload expression leaves meta layer 4 protocol
remains in place.
Closes: https://bugzilla.netfilter.org/show_bug.cgi?id=1641 Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
netlink_delinearize: complete payload expression in payload statement
Call payload_expr_complete() to complete payload expression in payload
statement, otherwise expr->payload.desc is set to proto_unknown.
Call stmt_payload_binop_postprocess() introduced by 50ca788ca4d0
("netlink: decode payload statment") if payload_expr_complete() fails to
provide a protocol description (eg. ip dscp).
Follow up patch does not allow to remove redundant payload dependency if
proto_unknown is used to deal with the raw payload expression case.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Implicit chains do not allow for incremental updates, do not display rule
handle since kernel refuses to update an implicit chain which is already
bound.
Closes: https://bugzilla.netfilter.org/show_bug.cgi?id=1615 Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
expr_evaluate_shift() overrides the datatype which results in a datatype
memleak after the binop transfer that triggers a left-shift of the
constant (in the map).
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Phil Sutter [Wed, 28 Sep 2022 21:26:42 +0000 (23:26 +0200)]
monitor: Sanitize startup race condition
During startup, 'nft monitor' first fetches the current ruleset and then
keeps this cache up to date based on received events. This is racey, as
any ruleset changes in between the initial fetch and the socket opening
are not recognized.
This script demonstrates the problem:
| #!/bin/bash
|
| while true; do
| nft flush ruleset
| iptables-nft -A FORWARD
| done &
| maniploop=$!
|
| trap "kill $maniploop; kill \$!; wait" EXIT
|
| while true; do
| nft monitor rules >/dev/null &
| sleep 0.2
| kill $!
| done
If the table add event is missed, the rule add event callback fails to
deserialize the rule and calls abort().
Avoid the inconvenient program exit by returning NULL from
netlink_delinearize_rule() instead of aborting and make callers check
the return value.
Jeremy Sowden [Sun, 18 Sep 2022 17:22:12 +0000 (18:22 +0100)]
segtree: fix decomposition of unclosed intervals containing address prefixes
The code which decomposes unclosed intervals doesn't check for prefixes. This
leads to incorrect output for sets which contain these. For example,
# nft -f - <<END
table ip t {
chain c {
ip saddr 192.0.0.0/2 drop
ip saddr 10.0.0.0/8 drop
ip saddr { 192.0.0.0/2, 10.0.0.0/8 } drop
}
}
table ip6 t {
chain c {
ip6 saddr ff00::/8 drop
ip6 saddr fe80::/10 drop
ip6 saddr { ff00::/8, fe80::/10 } drop
}
}
END
# nft list table ip6 t
table ip6 t {
chain c {
ip6 saddr ff00::/8 drop
ip6 saddr fe80::/10 drop
ip6 saddr { fe80::/10, ff00::-ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff } drop
}
}
# nft list table ip t
table ip t {
chain c {
ip saddr 192.0.0.0/2 drop
ip saddr 10.0.0.0/8 drop
ip saddr { 10.0.0.0/8, 192.0.0.0-255.255.255.255 } drop
}
}
Instead of treating the final unclosed interval as a special case, reuse the
code which correctly handles closed intervals.
Add a shell test-case.
Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1018156 Fixes: 86b965bdab8d ("segtree: fix decomposition of unclosed intervals") Signed-off-by: Jeremy Sowden <jeremy@azazel.net> Signed-off-by: Florian Westphal <fw@strlen.de>
Jeremy Sowden [Sun, 18 Sep 2022 17:22:11 +0000 (18:22 +0100)]
segtree: refactor decomposition of closed intervals
Move the code in `interval_map_decompose` which adds a new closed
interval to the set into a separate function. In addition to the moving
of the code, there is one other change: `compound_expr_add` is called
once, after the main conditional, instead of being called in each
branch.
Signed-off-by: Jeremy Sowden <jeremy@azazel.net> Signed-off-by: Florian Westphal <fw@strlen.de>
'rule inet dscpclassify dscp_match meta l4proto { udp } th dport { 3478 } th sport { 3478-3497, 16384-16387 } goto ct_set_ef'
works with 'nft add', but not 'nft insert', the latter yields: "BUG: unhandled op 4".
Fixes: 81e36530fcac ("src: replace interval segment tree overlap and automerge") Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
py: support variables management and fix formatting
Add nft_ctx_add_var() and nft_ctx_clear_vars() support through add_var() and
clear_vars(). Also, fix some functions documentation and drop unnecesary
comments.
In addition, modify get_dry_run() to return the previous value set. This is
needed to be consistent with the rest of the python API.
Peter Collinson [Mon, 12 Sep 2022 10:52:23 +0000 (12:52 +0200)]
py: extend python API to support libnftables API
Allows py/nftables.py to support full mapping to the libnftables API. The
changes allow python code to talk in text to the kernel rather than just
using json. The Python API can now also use dry run to test changes.
Link: https://bugzilla.netfilter.org/show_bug.cgi?id=1591 Signed-off-by: Peter Collinson <pc@hillside.co.uk> Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
498a5f0c219d added collapsing of set operations in different commands.
However, the logic is currently too relaxed. It is valid to have a
table and set with identical names on different address families.
For example:
table ip a {
set x {
type inet_service;
}
}
table ip6 a {
set x {
type inet_service;
}
}
add element ip a x { 1 }
add element ip a x { 2 }
add element ip6 a x { 2 }
The above currently results in nothing being added to the ip6 family
table due to being collapsed into the ip table add. Prior to 498a5f0c219d the set add would work. The fix is simply to check the
family in addition to the table and set names before allowing a
collapse.
[ Add testcase to tests/shell --pablo ]
Fixes: 498a5f0c219d ("rule: collapse set element commands") Signed-off-by: Derek Hageman <hageman@inthat.cloud> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Since libnftnl's 212479ad2c92 ("rule, set_elem: fix printing of user
data"), userdata is missing in netlink payload printing via --debug.
Update tests/py/ip6/srh.t.payload to silence warning.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
src: allow burst 0 for byte ratelimit and use it as default
Packet-based limit burst is set to 5, as in iptables. However,
byte-based limit burst adds to the rate to calculate the bucket size,
and this is also sets this to 5 (... bytes in this case). Update it to
use zero byte burst by default instead.
This patch also updates manpage to describe how the burst value
influences the kernel module's token bucket in each of the two modes.
This documentation update is based on original text by Phil Sutter.
Adjust tests/py to silence warnings due to mismatching byte burst.
Fixes: 285baccfea46 ("src: disallow burst 0 in ratelimits") Acked-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Florian Westphal [Tue, 23 Aug 2022 10:51:52 +0000 (12:51 +0200)]
expr: update EXPR_MAX and add missing comments
WHen flagcmp and catchall expressions got added the EXPR_MAX definition
wasn't changed.
Should have no impact in practice however, this value is only checked to
prevent crash when old nft release is used to list a ruleset generated
by a newer nft release and a unknown 'typeof' expression.
v2: Pablo suggested to add EXPR_MAX into enum so its easier to spot.
Adding __EXPR_MAX + define EXPR_MAX (__EXPR_MAX - 1) causes '__EXPR_MAX
not handled in switch' warnings, hence the 'EXPR_MAX =' solution.
Phil Sutter [Tue, 30 Aug 2022 13:00:52 +0000 (15:00 +0200)]
erec: Dump locations' expressions only if set
Calling netlink_dump_expr() with a NULL pointer leads to segfault within
libnftnl. Internal ("fake") locations such as 'netlink_location' don't
have an expression assigned so expect this and skip the call. Simple
reproducer (list ruleset with netlink debugging as non-root):
| $ nft -d netlink list ruleset
Reported-by: François Rigault <frigo@amadeus.com> Signed-off-by: Phil Sutter <phil@nwl.cc>
into a concatenation of statements, the following expansion need to
be done for rule #2:
67 . "bar"
123 . "bar"
The expansion logic consists of cloning the existing concatenation being
built and then append each element in the implicit set. A list of
ongoing concatenations being built is maintained, so further expansions
are also supported.
Extend test to cover for this use-case.
Closes: https://bugzilla.netfilter.org/show_bug.cgi?id=1628 Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Xiao Liang [Fri, 19 Aug 2022 02:40:23 +0000 (10:40 +0800)]
src: Don't parse string as verdict in map
In verdict map, string values are accidentally treated as verdicts.
For example:
table t {
map foo {
type ipv4_addr : verdict
elements = {
192.168.0.1 : bar
}
}
chain output {
type filter hook output priority mangle;
ip daddr vmap @foo
}
}
Though "bar" is not a valid verdict (should be "jump bar" or something),
the string is taken as the element value. Then NFTA_DATA_VALUE is sent
to the kernel instead of NFTA_DATA_VERDICT. This would be rejected by
recent kernels. On older ones (e.g. v5.4.x) that don't validate the
type, a warning can be seen when the rule is hit, because of the
corrupted verdict value:
[5120263.467627] WARNING: CPU: 12 PID: 303303 at net/netfilter/nf_tables_core.c:229 nft_do_chain+0x394/0x500 [nf_tables]
Indeed, we don't parse verdicts during evaluation, but only chain names,
which is of type string rather than verdict. For example, "jump $var" is
a verdict while "$var" is a string.
Fixes: c64457cff967 ("src: Allow goto and jump to a variable") Signed-off-by: Xiao Liang <shaw.leon@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de>
Jo-Philipp Wich [Mon, 8 Aug 2022 22:18:42 +0000 (00:18 +0200)]
meta: don't use non-POSIX formats in strptime()
The current strptime() invocations in meta.c use the `%F` format which
is not specified by POSIX and thus unimplemented by some libc flavors
such as musl libc.
Replace all occurrences of `%F` with an equivalent `%Y-%m-%d` format
in order to be able to properly parse user supplied dates in such
environments.
Signed-off-by: Jo-Philipp Wich <jo@mein.io> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
src: allow anon set concatenation with ether and vlan
vlan id uses integer type (which has a length of 0).
Using it was possible, but listing would assert:
python: mergesort.c:24: concat_expr_msort_value: Assertion `ilen > 0' failed.
There are two reasons for this.
First reason is that the udata/typeof information lacks the 'vlan id'
part, because internally this is 'payload . binop(payload AND mask)'.
binop lacks an udata store. It makes little sense to store it,
'typeof' keyword expects normal match syntax.
So, when storing udata, store the left hand side of the binary
operation, i.e. the load of the 2-byte key.
With that resolved, delinerization could work, but concat_elem_expr()
would splice 12 bits off the elements value, but it should be 16 (on
a byte boundary).
evaluate: search stacked header list for matching payload dep
"ether saddr 0:1:2:3:4:6 vlan id 2" works, but reverse fails:
"vlan id 2 ether saddr 0:1:2:3:4:6" will give
Error: conflicting protocols specified: vlan vs. ether
After "proto: track full stack of seen l2 protocols, not just cumulative offset",
we have a list of all l2 headers, so search those to see if we had this
proto base in the past before rejecting this.
Reported-by: Eric Garver <eric@garver.life> Signed-off-by: Florian Westphal <fw@strlen.de>
proto: track full stack of seen l2 protocols, not just cumulative offset
For input, a cumulative size counter of all pushed l2 headers is enough,
because we have the full expression tree available to us.
For delinearization we need to track all seen l2 headers, else we lose
information that we might need at a later time.
Consider:
rule netdev nt nc set update ether saddr . vlan id
during delinearization, the vlan proto_desc replaces the ethernet one,
and by the time we try to split the concatenation apart we will search
the ether saddr offset vs. the templates for proto_vlan.
This replaces the offset with an array that stores the protocol
descriptions seen.
Then, if the payload offset is larger than our description, search the
l2 stack and adjust the offset until we're within the expected offset
boundary.
Reported-by: Eric Garver <eric@garver.life> Signed-off-by: Florian Westphal <fw@strlen.de>
After this, listing will show:
update @macset { @ll,48,48 . vlan id timeout 5s }
@ll,48,48 . vlan id @macset
The @ll, ... is due to vlan description replacing the ethernet one,
so payload decode fails to take the concatenation apart (the ethernet
header payload info is matched vs. vlan template).
cache: release pending rules when chain binding lookup fails
If the implicit chain is not in the cache, release pending rules in
ctx->list and report EINTR to let the cache core retry to populate a
consistent cache.
Closes: https://bugzilla.netfilter.org/show_bug.cgi?id=1402 Fixes: c330152b7f77 ("src: support for implicit chain bindings") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Florian Westphal [Thu, 23 Jun 2022 17:56:19 +0000 (19:56 +0200)]
scanner: don't pop active flex scanner scope
Currently we can pop a flex scope that is still active, i.e. the
scanner_pop_start_cond() for the scope has not been done.
Example:
counter ipsec out ip daddr 192.168.1.2 counter name "ipsec_out"
Here, parser fails because 'daddr' is parsed as STRING, not as DADDR token.
Bug is as follows:
COUNTER changes scope to COUNTER. (COUNTER).
Next, IPSEC scope gets pushed, stack is: COUNTER, IPSEC.
Then, the 'COUNTER' scope close happens. Because active scope has changed,
we cannot pop (we would pop the 'ipsec' scope in flex).
The pop operation gets delayed accordingly.
Next, IP gets pushed, stack is: COUNTER, IPSEC, IP, plus the information
that one scope closure/pop was delayed.
Then, the IP scope is closed. Because a pop operation was delayed, we pop again,
which brings us back to COUNTER state.
This is bogus: The pop operation CANNOT be done yet, because the ipsec scope
is still open, but the existing code lacks the information to detect this.
After popping the IP scope, we must remain in IPSEC scope until bison
parser calls scanner_pop_start_cond(, IPSEC).
This adds a counter per flex scope so that we can detect this case.
In above case, after the IP scope gets closed, the "new" (previous)
scope (IPSEC) will be treated as active and its close is attempted again
on the next call to scanner_pop_start_cond().
After this patch, transition in above rule is:
push counter (COUNTER)
push IPSEC (COUNTER, IPSEC)
pop COUNTER (delayed: COUNTER, IPSEC, pending-pop for COUNTER),
push IP (COUNTER, IPSEC, IP, pending-pop for COUNTER)
pop IP (COUNTER, IPSEC, pending-pop for COUNTER)
parse DADDR (we're in IPSEC scope, its valid token)
pop IPSEC (pops all remaining scopes).
We could also resurrect the commit:
"scanner: flags: move to own scope", the test case passes with the
new scope closure logic.
Fixes: bff106c5b277 ("scanner: add support for scope nesting") Signed-off-by: Florian Westphal <fw@strlen.de>
The error in the set/map definition is correctly caught and queued, but
because the set is invalid and does not contain a key type, adding to it
causes a NULL pointer dereference of set->key within setelem_evaluate().
I don't think it's necessary to queue another error since the underlying
problem is correctly detected and reported when parsing the definition
of the set. Simply checking the validity of set->key before using it
seems to fix it, causing the error in the definition of the set to be
reported properly. The element type error isn't caught, but that seems
reasonable since the key type is invalid or unknown anyway:
$ ./nft -c -f ~/nftables.conf.crash
/home/pti/nftables.conf.crash:3:21-21: Error: set definition does not specify key
set inet filter foo {}
^
[ Add tests to cover this case --pablo ]
Closes: https://bugzilla.netfilter.org/show_bug.cgi?id=1597 Signed-off-by: Peter Tirsek <peter@tirsek.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
mnl: store netlink error location for set elements
Store set element location in the per-command netlink error location
array. This allows for fine grain error reporting when adding and
deleting elements.
# nft -f test.nft
test.nft:5:4-20: Error: Could not process rule: File exists
00:01:45:09:0b:26 : drop,
^^^^^^^^^^^^^^^^^
test.nft contains a large map with one redundant entry.
Thus, users do not have to find the needle in the stack.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Revert support for limit statement, the limit statement is stateful and
it applies a ratelimit per rule, transformation for merging rules with
the limit statement needs to use anonymous sets with statements.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Do not try to merge rules with unsupported statements. This patch adds a
dummy unsupported statement which is included in the statement
collection and the rule vs statement matrix.
When looking for possible rule mergers, rules using unsupported
statements are discarded, otherwise bogus rule mergers might occur.
Note that __stmt_type_eq() already returns false for unsupported
statements.
Add a test using meta mark statement, which is not yet supported.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Do not print stateful information such as counters which are likely set
to zero.
Before this patch:
Merging:
packets.conf:10:3-29: ip protocol 4 counter drop
packets.conf:11:3-29: ip protocol 41 counter drop
packets.conf:12:3-29: ip protocol 47 counter drop
into:
ip protocol { 4, 41, 47 } counter packets 0 bytes 0 drop
^^^^^^^^^^^^^^^^^
After:
Merging:
packets.conf:10:3-29: ip protocol 4 counter drop
packets.conf:11:3-29: ip protocol 41 counter drop
packets.conf:12:3-29: ip protocol 47 counter drop
into:
ip protocol { 4, 41, 47 } counter drop
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Phil Sutter [Thu, 16 Jun 2022 08:56:12 +0000 (10:56 +0200)]
intervals: Do not sort cached set elements over and over again
When adding element(s) to a non-empty set, code merged the two lists and
sorted the result. With many individual 'add element' commands this
causes substantial overhead. Make use of the fact that
existing_set->init is sorted already, sort only the list of new elements
and use list_splice_sorted() to merge the two sorted lists.
Add set_sort_splice() and use it for set element overlap detection and
automerge.
A test case adding ~25k elements in individual commands completes in
about 1/4th of the time with this patch applied.
Joint work with Pablo.
Fixes: 3da9643fb9ff9 ("intervals: add support to automerge with kernel elements") Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Robots might generate a long list of singleton element commands such as:
add element t s { 1.0.1.0/24 }
...
add element t s { 1.0.2.0/23 }
collapse them into one single command before the evaluation step, ie.
add element t s { 1.0.1.0/24, ..., 1.0.2.0/23 }
this speeds up overlap detection and set element automerge operations in
this worst case scenario.
Since 3da9643fb9ff9 ("intervals: add support to automerge with kernel
elements"), the new interval tracking relies on mergesort. The pattern
above triggers the set sorting for each element.
This patch adds a list to cmd objects that store collapsed commands.
Moreover, expressions also contain a reference to the original command,
to uncollapse the commands after the evaluation step.
These commands are uncollapsed after the evaluation step to ensure error
reporting works as expected (command and netlink message are mapped
1:1).
For the record:
- nftables versions <= 1.0.2 did not perform any kind of overlap
check for the described scenario above (because set cache only contained
elements in the kernel in this case). This is a problem for kernels < 5.7
which rely on userspace to detect overlaps.
- the overlap detection could be skipped for kernels >= 5.7.
- The extended netlink error reporting available for set elements
since 5.19-rc might allow to remove the uncollapse step, in this case,
error reporting does not rely on the netlink sequence to refer to the
command triggering the problem.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>