set element with range takes 4 instances of struct expr:
EXPR_SET_ELEM -> EXPR_RANGE -> (2) EXPR_VALUE
where EXPR_RANGE represents two references to struct expr with constant
value.
This new EXPR_RANGE_VALUE trims it down to two expressions:
EXPR_SET_ELEM -> EXPR_RANGE_VALUE
with two direct low and high values that represent the range:
struct {
mpz_t low;
mpz_t high;
};
this two new direct values in struct expr do not modify its size.
setelem_expr_to_range() translates EXPR_RANGE to EXPR_RANGE_VALUE, this
conversion happens at a later stage.
constant_range_expr_print() translates this structure to constant values
to reuse the existing datatype_print() which relies in singleton values.
The automerge routine has been updated to use EXPR_RANGE_VALUE.
This requires a follow up patch to rework the conversion from range
expression to singleton element to provide a noticeable memory
consumption reduction.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This is a test case for nft_socket cgroupv2 matching, including
support for matching inside a cgroupv2 mount space added in kernel
commit 7f3287db6543 ("netfilter: nft_socket: make cgroupsv2 matching work with namespaces").
Test is thus run twice, once in the initial namespace and once with
a changed cgroupv2 root.
In case we can't create a cgroup or the 2nd half (unshared re-run)
fails, indicate SKIP.
Jeremy Sowden [Mon, 18 Nov 2024 23:18:28 +0000 (00:18 +0100)]
src: allow binop expressions with variable right-hand operands
Hitherto, the kernel has required constant values for the `xor` and
`mask` attributes of boolean bitwise expressions. This has meant that
the right-hand operand of a boolean binop must be constant. Now the
kernel has support for AND, OR and XOR operations with right-hand
operands passed via registers, we can relax this restriction. Allow
non-constant right-hand operands if the left-hand operand is not
constant, e.g.:
ct mark & 0xffff0000 | meta mark & 0xffff
The kernel now supports performing AND, OR and XOR operations directly,
on one register and an immediate value or on two registers, so we need
to be able to generate and parse bitwise boolean expressions of this
form.
If a boolean operation has a constant RHS, we continue to send a
mask-and-xor expression to the kernel.
Add tests for {ct,meta} mark with variable RHS operands.
JSON support is also included.
This requires Linux kernel >= 6.13-rc.
[ Originally posted as patch 1/8 and 6/8 which has been collapsed and
simplified to focus on initial {ct,meta} mark support. Tests have
been extracted from 8/8 including a tests/py fix to payload output
due to incorrect output in original patchset. JSON support has been
extracted from patch 7/8 --pablo]
Signed-off-by: Jeremy Sowden <jeremy@azazel.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Donald Yandt [Fri, 22 Nov 2024 22:04:49 +0000 (17:04 -0500)]
mnl: fix basehook comparison
When comparing two hooks, if both device names are null,
the comparison should return true, as they are considered equal.
Fixes: b8872b83eb365 ("src: mnl: prepare for listing all device netdev device hooks") Signed-off-by: Donald Yandt <donald.yandt@gmail.com> Signed-off-by: Phil Sutter <phil@nwl.cc>
Florian Westphal [Fri, 25 Oct 2024 07:47:25 +0000 (09:47 +0200)]
src: allow to map key to nfqueue number
Allow to specify a numeric queue id as part of a map.
The parser side is easy, but the reverse direction (listing) is not.
'queue' is a statement, it doesn't have an expression.
Add a generic 'queue_type' datatype as a shim to the real basetype with
constant expressions, this is used only for udata build/parse, it stores
the "key" (the parser token, here "queue") as udata in kernel and can
then restore the original key.
Add a dumpfile to validate parser & output.
JSON support is missing because JSON allow typeof only since quite
recently.
Phil Sutter [Thu, 7 Nov 2024 13:39:51 +0000 (14:39 +0100)]
tests: monitor: Become $PWD agnostic
The call to 'cd' is problematic since later the script tries to 'exec
unshare -n $0'. This is not the only problem though: Individual test
cases specified on command line are expected to be relative to the
script's directory, too. Just get rid of these nonsensical restrictions.
Reported-by: Florian Westphal <fw@strlen.de> Signed-off-by: Phil Sutter <phil@nwl.cc>
Phil Sutter [Wed, 2 Oct 2024 17:55:49 +0000 (19:55 +0200)]
tests: py: Fix for storing payload into missing file
When running a test for which no corresponding *.payload file exists,
the *.payload.got file name was incorrectly constructed due to
'payload_path' variable not being set.
Fixes: 2cfab7a3e10fc ("tests/py: Write dissenting payload into the right file") Signed-off-by: Phil Sutter <phil@nwl.cc>
Phil Sutter [Fri, 27 Sep 2024 22:55:34 +0000 (00:55 +0200)]
json: Support typeof in set and map types
Implement this as a special "type" property value which is an object
with sole property "typeof". The latter's value is the JSON
representation of the expression in set->key, so for concatenated
typeofs it is a concat expression.
All this is a bit clumsy right now but it works and it should be
possible to tear it down a bit for more user-friendliness in a
compatible way by either replacing the concat expression by the array it
contains or even the whole "typeof" object - the parser would just
assume any object (or objects in an array) in the "type" property value
are expressions to extract a type from.
Update json parser to collapse {add,create} element commands to reduce
memory consumption in the case of large sets defined by one element per
command:
Add CTX_F_COLLAPSED flag to report that command has been collapsed.
This patch reduces memory consumption by ~32% this case.
Fixes: 20f1c60ac8c8 ("src: collapse set element commands from parser") Reported-by: Eric Garver <eric@garver.life> Tested-by: Eric Garver <eric@garver.life> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Florian Westphal [Tue, 29 Oct 2024 20:12:19 +0000 (21:12 +0100)]
tests: monitor: fix up test case breakage
Monitor test fails:
echo: running tests from file set-simple.t
echo output differs!
-add element ip t portrange { 1024-65535 }
add element ip t portrange { 100-200 }
+add element ip t portrange { 1024-65535 }
+# new generation 510 by process 129009 (nft)
I also noticed -j mode did not work correctly, add missing json annotations
in set-concat-interval.t while at it.
Florian Westphal [Tue, 22 Oct 2024 13:26:54 +0000 (15:26 +0200)]
tests: shell: don't rely on writable test directory
Running shell tests from a virtme-ng instance with ro mapped test dir
hangs due to runaway 'awk' reading from stdin instead of the intended
$tmpfile (variable is empty), so add quotes where needed.
0002relative_0 wants to check relative includes. It tries to create a
temporary file in the current directory, which fails as thats readonly
inside the virtme vm instance.
[ -w ! $foo ... did not catch this due to missing "".
Add quotes and return the skip retval so the test gets flagged as skipped.
0013input_descriptors_included_files_0 and 0020include_chain_0 are
switched to normal tmpfiles, there is nothing in the test that needs
relative includes.
Also, get rid of some error tests for subsequent mktemp calls for
scripts that already called 'set -e'.
src: fix extended netlink error reporting with large set elements
Large sets can expand into several netlink messages, use sequence number
and attribute offset to correlate the set element and the location.
When set element command expands into several netlink messages,
increment sequence number for each netlink message. Update struct cmd to
store the range of netlink messages that result from this command.
struct nlerr_loc remains in the same size in x86_64.
# nft -f set-65535.nft
set-65535.nft:65029:22-32: Error: Could not process rule: File exists
create element x y { 1.1.254.253 }
^^^^^^^^^^^
rule: netlink attribute offset is uint32_t for struct nlerr_loc
The maximum netlink message length (nlh->nlmsg_len) is uint32_t, struct
nlerr_loc stores the offset to the netlink attribute which must be
uint32_t, not uint16_t.
While at it, remove check for zero netlink attribute offset in
nft_cmd_error() which should not ever happen, likely this check was
there to prevent the uint16_t offset overflow.
498a5f0c219d ("rule: collapse set element commands") does not help to
reduce memory consumption in the case of large sets defined by one
element per line:
add element ip x y { 1.1.1.1 }
add element ip x y { 1.1.1.2 }
...
This patch reduces memory consumption by ~75%, set elements are
collapsed into an existing cmd object wherever possible to reduce the
number of cmd objects.
This patch also adds a special case for variables for sets similar to:
be055af5c58d ("cmd: skip variable set elements when collapsing commands")
netfilter: nf_tables: set element extended ACK reporting support
which is already included in recent -stable kernels:
# cat ruleset.nft
add table ip x
add chain ip x y
add set ip x y { type ipv4_addr; }
create element ip x y { 1.1.1.1 }
create element ip x y { 1.1.1.1 }
# nft -f ruleset.nft
ruleset.nft:5:25-31: Error: Could not process rule: File exists
create element ip x y { 1.1.1.1 }
^^^^^^^
since there is no need to relate commands via sequence number anymore,
this allows also removes the uncollapse step.
Fixes: 498a5f0c219d ("rule: collapse set element commands") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Allow to specify elements that never expire in sets with global
timeout.
set x {
typeof ip saddr
timeout 1m
elements = { 1.1.1.1 timeout never,
2.2.2.2,
3.3.3.3 timeout 2m }
}
in this example above:
- 1.1.1.1 is a permanent element
- 2.2.2.2 expires after 1 minute (uses default set timeout)
- 3.3.3.3 expires after 2 minutes (uses specified timeout override)
Use internal NFT_NEVER_TIMEOUT marker as UINT64_MAX to differenciate
between use default set timeout and timeout never if "timeout N" is used
in set declaration. Maximum supported timeout in milliseconds which is
conveyed within a netlink attribute is 0x10c6f7a0b5ec.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
tests: shell: more randomization for timeout parameter
Either pass no timeout argument, pass timeout+expires or omit
timeout (uses default timeout, if any).
This should not expose further kernel code to run at this time, but unlike
the existing (deterministic) element-update test case this script does
have live traffic and different set types, including rhashtable which has
async gc.
proto: use NFT_PAYLOAD_L4CSUM_PSEUDOHDR flag to mangle UDP checksum
There are two mechanisms to update the UDP checksum field:
1) _CSUM_TYPE and _CSUM_OFFSET which specify the type of checksum
(e.g. inet) and offset where it is located.
2) use NFT_PAYLOAD_L4CSUM_PSEUDOHDR flag to use layer 4 kernel
protocol parser.
The problem with 1) is that it is inconditional, that is, csum_type and
csum_offset cannot deal with zero UDP checksum.
Use NFT_PAYLOAD_L4CSUM_PSEUDOHDR flag instead since it relies on the
layer 4 kernel parser which skips updating zero UDP checksum.
Extend test coverage for the UDP mangling with and without zero
checksum.
Fixes: e6c9174e13b2 ("proto: add checksum key information to struct proto_desc") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
- Add sleep calls after setting up container topology.
- Extend TCP connect timeout to 4 seconds. Test has no listener, this is
just sending SYN packets that are rejected but it works to test the
payload mangling ruleset.
- fix incorrect logic to check for 0 matching packets through grep.
Fixes: 84da729e067a ("tests: shell: add test to cover payload transport match and mangle") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Needs a feature check file, so add one:
Add element with 1m timeout, then update expiry to 1ms.
If element still exists after 1ms, update request was ignored.
Test case checks timeouts can both be incremented and decremented,
checks error recovery (update request but transaction fails) and
that expiry is restored in addion to timeout.
Phil Sutter [Tue, 3 Sep 2024 15:43:19 +0000 (17:43 +0200)]
libnftables: Zero ctx->vars after freeing it
Leaving the invalid pointer value in place will cause a double-free when
users call nft_ctx_clear_vars() first, then nft_ctx_free(). Moreover,
nft_ctx_add_var() passes the pointer to mrealloc() and thus assumes it
to be either NULL or valid.
position refers to the rule handle, it has similar cache requirements as
replace rule command, relax cache requirements.
Commit e5382c0d08e3 ("src: Support intra-transaction rule references")
uses position.id for index support which requires a full cache, but
only in such case.
Fixes: 01e5c6f0ed03 ("src: add cache level flags") Tested-by: Eric Garver <eric@garver.life> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
No need for full cache, this command relies on the rule handle which is
not validated from userspace. Cache requirements are similar to those
of add/create/delete rule commands.
This speeds up incremental updates with large rulesets.
Extend tests/coverage for rule replacement.
Fixes: 01e5c6f0ed03 ("src: add cache level flags") Tested-by: Eric Garver <eric@garver.life> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
cache: assert filter when calling nft_cache_evaluate()
nft_cache_evaluate() always takes a non-null filter, remove superfluous
checks when calculating cache requirements via flags.
Note that filter is still option from netlink dump path, since this can
be called from error path to provide hints.
Fixes: 08725a9dc14c ("cache: filter out rules by chain") Fixes: b3ed8fd8c9f3 ("cache: missing family in cache filtering") Fixes: 635ee1cad8aa ("cache: filter out sets and maps that are not requested") Fixes: 3f1d3912c3a6 ("cache: filter out tables that are not requested") Tested-by: Eric Garver <eric@garver.life> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Reset command does not utilize the cache infrastructure.
This implicitly fixes a crash with anonymous sets because elements are
not fetched. I initially tried to fix it by toggling the missing cache
flags, but then ASAN reports memleaks.
To address these issues relies on Phil's list filtering infrastructure
which updates is expanded to accomodate filtering requirements of the
reset commands, such as 'reset table ip' where only the family is sent
to the kernel.
After this update, tests/shell reports a few inconsistencies between
reset and list commands:
- reset rules chain t c2
display sets, but it should only list the given chain.
- reset rules table t
reset rules ip
do not list elements in the set. In both cases, these are fully
listing a given table and family, elements should be included.
The consolidation also ensures list and reset will not differ.
A few more notes:
- CMD_OBJ_TABLE is used for:
rules family table
from the parser, due to the lack of a better enum, same applies to
CMD_OBJ_CHAIN.
- CMD_OBJ_ELEMENTS still does not use the cache, but same occurs in
the CMD_GET command case which needs to be consolidated.
Closes: https://bugzilla.netfilter.org/show_bug.cgi?id=1763 Fixes: 83e0f4402fb7 ("Implement 'reset {set,map,element}' commands") Fixes: 1694df2de79f ("Implement 'reset rule' and 'reset rules' commands") Tested-by: Eric Garver <eric@garver.life> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Only family is set on in the dump request, set on table and chain
otherwise, rules for the given family are fetched for each existing
table.
Fixes: afbd102211dc ("src: do not use the nft_cache_filter object from mnl.c") Tested-by: Eric Garver <eric@garver.life> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
babc6ee8773c ("cache: populate chains on demand from error path")
Flags describe cache requirements for a given batch, accumulate flags
that are inferred from commands in this batch.
Fixes: 7df42800cf89 ("src: single cache_update() call to build cache before evaluation") Tested-by: Eric Garver <eric@garver.life> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Inconditionally reset filter for each command in the batch, this is safer.
Fixes: 3f1d3912c3a6 ("cache: filter out tables that are not requested") Tested-by: Eric Garver <eric@garver.life> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
parser_json: fix crash in json_parse_set_stmt_list
Due to missing `NULL`-check, there will be a segfault for invalid statements.
Fixes: 07958ec53830 ("json: add set statement list support") Signed-off-by: Sebastian Walz (sivizius) <sebastian.walz@secunet.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
It will return a pointer to an owned string, the caller must free it.
However, `json_error` just borrows the string to format it as `%s`, but
after printing the formatted error message, the pointer to the string is
lost and thus never freed.
Fixes: 586ad210368b ("libnftables: Implement JSON parser") Signed-off-by: Sebastian Walz (sivizius) <sebastian.walz@secunet.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
parser_bison: allow 0 burst in limit rate byte mode
Unbreak restoring elements in set with rate limit that fail with:
> /dev/stdin:3618:61-61: Error: limit burst must be > 0
> elements = { 1.2.3.4 limit rate over 1000 kbytes/second timeout 1s,
no need for burst != 0 for limit rate byte mode.
Add tests/shell too.
Fixes: 702eff5b5b74 ("src: allow burst 0 for byte ratelimit and use it as default") Fixes: 285baccfea46 ("src: disallow burst 0 in ratelimits") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
src: drop obsolete hook argument form hook dump functions
since commit b98fee20bfe2 ("mnl: revisit hook listing"), handle.chain is
never set in this path, so 'hook' is always set to -1, so the hook arg
can be dropped.
mnl_nft_dump_nf_hooks() can call itself for the UNSPEC case, this
avoids the second switch/case to handle printing for inet/unspec.
As for the error handling, 'nft list hooks' should not print an error,
even if nothing is printed, UNLESS there was also a lowlevel (syscall)
error from the kernel.
We don't want to indicate failure just because e.g. kernel doesn't support
NFPROTO_ARP.
This also fixes a display bug, 'nft list hooks device foo' would show hooks
registered for that device as 'bridge' family instead of the expected
'netdev' family.
This was because UNSPEC handling did not query 'netdev' family and did
pass the device name to the lowlevel function. Add it, and pass NULL
device name for those families that don't support device attachment.
The lowelevel function currently always queries NFPROTO_NETDEV to handle
the 'inet' ingress case.
This is dubious, as 'inet ingress' is a pseudo-alias to netdev family
(inet itself is a pseudo-family that ends up registering for both ipv4
and ipv6 hooks).
ERR: "tests/shell/testcases/chains/jump_to_base_chain" has no "tests/shell/testcases/chains/dumps/jump_to_base_chain.{nft,nodump}" file
For all of those, add the relevant .nft dump file.
Add a 'nodump' file in case the test doesn't print anything (e.g.
because the test checks that invalid ruleset fails validation).
Some tests have a .nft but not .json-nft, this is because json lacks
some features, in particular "typeof" and anonymous/implicit chains.
ERR: "tests/shell/testcases/maps/delete_element_catchall" has no "tests/shell/testcases/maps/dumps/delete_element_catchall.{nft,nodump}" file
ERR: "tests/shell/testcases/maps/dumps/delete_elem_catchall.nft" has no test "tests/shell/testcases/maps/delete_elem_catchall"
these two are related, rename the dump file to match the script name.
- index with rule breaks, because NFT_CACHE_REFRESH is missing.
- simple set updates.
Moreover, the current process could populate the cache with objects for
listing commands (no generation ID is bumped), while another process
could update the ruleset. Leading to a inconsistent cache due to the
genid + 1 check.
This optimization needs more work and more tests for -i/--interactive,
revert it.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
tests: shell: add more ruleset validation test cases
Passes fine on all tested kernel releases.
Same as existing tests, but try harder to fool the validation:
1. Add a ruleset where the jump that that exceeds 16 is "broken", i.e.
c0 -> c1 ... -> c8
c9-> c1 ... -> c16
Where c0 is a base chain, with a graph thats really a linear list
from c0 to c8 and c9 to c16 is a linear list not connected to the former
or a hook point.
Then try to link them either directly via jump/goto rule or indirectly
with a verdict map.
Try both unbound map with element doing 'goto c9' and then trying to add
vmap rule to c8 (must fail, creates link).
Then try reverse: with empty map, add vmap rule to c8 (should work, no
elements...).
Then, add map element with jump or goto to c9. This should be rejected.
Try the same thing with a tproxy expression in a user-defined chain:
attempt to make it reachable from c0 (filter input), which is illegal.
/dev/stdin is a placeholder, read() from STDIN_FILENO is used to fetch
the standard input into a buffer.
Since 5c2b2b0a2ba7 ("src: error reporting with -f and read from stdin")
stdin is stored in a buffer to fix error reporting.
This patch requires: ("parser_json: use stdin buffer if available")
Fixes: 149b1c95d129 ("libnftables: refuse to open onput files other than named pipes or regular files") Acked-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>