Jeremy Sowden [Sat, 11 Dec 2021 18:55:25 +0000 (18:55 +0000)]
evaluate: reject: support ethernet as L2 protocol for inet table
When we are evaluating a `reject` statement in the `inet` family, we may
have `ether` and `ip` or `ip6` as the L2 and L3 protocols in the
evaluation context:
The reason it fails is that the ethernet protocol numbers for IPv4 and
IPv6 (`ETH_P_IP` and `ETH_P_IPV6`) do not match `NFPROTO_IPV4` and
`NFPROTO_IPV6`. Add support for the ethernet protocol numbers.
Replace the current `BUG("unsupported family")` error message with
something more informative that tells the user to provide an explicit
reject option.
Jeremy Sowden [Sat, 11 Dec 2021 18:55:23 +0000 (18:55 +0000)]
proto: short-circuit loops over upper protocols
Each `struct proto_desc` contains a fixed-size array of higher layer
protocols. Only the first few are not NULL. Therefore, we can stop
iterating over the array once we reach a NULL member.
Signed-off-by: Jeremy Sowden <jeremy@azazel.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
First binop masks out unwanted parts of the 16-bit field.
Second binop needs to left-shift so that lookups in the set will work.
When decoding, the first binop is removed after the exthdr load
has been adjusted accordingly. Constant propagation adjusts the
shift-value to 0 on removal. This change then gets rid of the
shift-by-0 entirely.
After this change, 'frag frag-off @s4' input is shown as-is.
Štěpán Němec [Wed, 1 Dec 2021 11:12:00 +0000 (12:12 +0100)]
tests: shell: better parameters for the interval stack overflow test
Wider testing has shown that 128 kB stack is too low (e.g. for systems
with 64 kB page size), leading to false failures in some environments.
Based on results from a matrix of RHEL 8 and RHEL 9 systems across
x86_64, aarch64, ppc64le and s390x architectures as well as some
anecdotal testing of other Linux distros on x86_64 machines, 400 kB
seems safe: the normal nft stack (which should stay constant during
this test) on all tested systems doesn't exceed 200 kB (stays around
100 kB on typical systems with 4 kB page size), while always growing
beyond 500 kB in the failing case (nftables before baecd1cf2685) with
the increased set size.
Fixes: d8ccad2a2b73 ("tests: cover baecd1cf2685 ("segtree: Fix segfault when restoring a huge interval set")") Signed-off-by: Štěpán Němec <snemec@redhat.com> Signed-off-by: Phil Sutter <phil@nwl.cc>
Its always 0, so remove it.
Looks like this was intended to support variable options that have
array-like members, but so far this isn't implemented, better remove
dead code and implement it properly when such support is needed.
Phil Sutter [Tue, 30 Nov 2021 15:57:54 +0000 (16:57 +0100)]
cache: Filter set list on server side
Fetch either all tables' sets at once, a specific table's sets or even a
specific set if needed instead of iterating over the list of previously
fetched tables and fetching for each, then ignoring anything returned
that doesn't match the filter.
Phil Sutter [Mon, 29 Nov 2021 15:26:44 +0000 (16:26 +0100)]
cache: Filter chain list on kernel side
When operating on a specific chain, add payload to NFT_MSG_GETCHAIN so
kernel returns only relevant data. Since ENOENT is an expected return
code, do not treat this as error.
While being at it, improve code in chain_cache_cb() a bit:
- Check chain's family first, it is a less expensive check than
comparing table names.
- Do not extract chain name of uninteresting chains.
Phil Sutter [Mon, 29 Nov 2021 14:36:45 +0000 (15:36 +0100)]
cache: Filter rule list on kernel side
Instead of fetching all existing rules in kernel's ruleset and filtering
in user space, add payload to the dump request specifying the table and
chain to filter for.
Since list_rule_cb() no longer needs the filter, pass only netlink_ctx
to the callback and drop struct rule_cache_dump_ctx.
Phil Sutter [Mon, 29 Nov 2021 14:28:33 +0000 (15:28 +0100)]
cache: Filter tables on kernel side
Instead of requesting a dump of all tables and filtering the data in
user space, construct a non-dump request if filter contains a table so
kernel returns only that single table.
This should improve nft performance in rulesets with many tables
present.
Florian Westphal [Sun, 21 Nov 2021 22:33:19 +0000 (23:33 +0100)]
exthdr: fix tcpopt_find_template to use length after mask adjustment
Unify binop handling for ipv6 extension header, ip option and tcp option
processing.
Pass the real offset and length expected, not the one used in the kernel.
This was already done for extension headers and ip options, but tcp
option parsing did not do this.
This was fine before because no existing tcp option template
had a non-byte sized member.
With mptcp addition this isn't the case anymore, subtype field is
only 4 bits wide, but tcp option delinearization passed 8bits instead.
Pass the offset and mask delta, just like ip option/ipv6 exthdr.
This makes nft show 'tcp option mptcp subtype 1' instead of
'tcp option mptcp unknown & 240 == 16'.
Florian Westphal [Sun, 21 Nov 2021 22:33:05 +0000 (23:33 +0100)]
scanner: add tcp flex scope
This moves tcp options not used anywhere else (e.g. in synproxy) to a
distinct scope. This will also allow to avoid exposing new option
keywords in the ruleset context.
Phil Sutter [Wed, 10 Mar 2021 18:46:08 +0000 (19:46 +0100)]
netlink_delinearize: Fix for escaped asterisk strings on Big Endian
The original nul-char detection was not functional on Big Endian.
Instead, go a simpler route by exporting the string and working on the
exported data to check for a nul-char and escape a trailing asterisk if
present. With the data export already happening in the caller, fold
escaped_string_wildcard_expr_alloc() into it as well.
Phil Sutter [Wed, 10 Mar 2021 13:38:37 +0000 (14:38 +0100)]
datatype: Fix size of time_type
Used by 'ct expiration', time_type is supposed to be 32bits. Passing a
64bits variable to constant_expr_alloc() causes the value to be always
zero on Big Endian.
Fixes: 0974fa84f162a ("datatype: seperate time parsing/printing from time_type") Signed-off-by: Phil Sutter <phil@nwl.cc>
Phil Sutter [Wed, 10 Mar 2021 10:45:47 +0000 (11:45 +0100)]
meta: Fix hour_type size
In kernel as well as when parsing, hour_type is assumed to be 32bits.
Having the struct datatype field set to 64bits breaks Big Endian and so
does passing a 64bit value and 32 as length to constant_expr_alloc() as
it makes it import the upper 32bits. Fix this by turning 'result' into a
uint32_t and introduce a temporary uint64_t just for the call to
time_parse() which expects that.
Fixes: f8f32deda31df ("meta: Introduce new conditions 'time', 'day' and 'hour'") Signed-off-by: Phil Sutter <phil@nwl.cc>
Phil Sutter [Tue, 9 Mar 2021 20:24:30 +0000 (21:24 +0100)]
meta: Fix {g,u}id_type on Big Endian
Using a 64bit variable to temporarily hold the parsed value works only
on Little Endian. uid_t and gid_t (and therefore also pw->pw_uid and
gr->gr_gid) are 32bit.
To fix this, use uid_t/gid_t for the temporary variable but keep the
64bit one for numeric parsing so values exceeding 32bits are still
detected.
Fixes: e0ed4c45d9ad2 ("meta: relax restriction on UID/GID parsing") Signed-off-by: Phil Sutter <phil@nwl.cc>
Phil Sutter [Thu, 17 Dec 2020 17:19:18 +0000 (18:19 +0100)]
src: Fix payload statement mask on Big Endian
The mask used to select bits to keep must be exported in the same
byteorder as the payload statement itself, also the length of the
exported data must match the number of bytes extracted earlier.
Phil Sutter [Thu, 17 Dec 2020 14:52:03 +0000 (15:52 +0100)]
mnl: Fix for missing info in rule dumps
Commit 0e52cab1e64ab improved error reporting by adding rule's table and
chain names to netlink message directly, prefixed by their location
info. This in turn caused netlink dumps of the rule to not contain table
and chain name anymore. Fix this by inserting the missing info before
dumping and remove it afterwards to not cause duplicated entries in
netlink message.
Phil Sutter [Wed, 17 Mar 2021 19:39:38 +0000 (20:39 +0100)]
exthdr: Fix for segfault with unknown exthdr
Unknown exthdr type with NFT_EXTHDR_F_PRESENT flag set caused
NULL-pointer deref. Fix this by moving the conditional exthdr.desc deref
atop the function and use the result in all cases.
Fixes: e02bd59c4009b ("exthdr: Implement existence check") Signed-off-by: Phil Sutter <phil@nwl.cc>
Phil Sutter [Thu, 4 Feb 2021 14:58:25 +0000 (15:58 +0100)]
tests/py: Avoid duplicate records in *.got files
If payloads don't contain family-specific bits, they may sit in a single
*.payload file for all tested families. In such case, nft-test.py will
consequently write dissenting payloads into a single *.got file. To
avoid the duplicate entries, check if a matching record exists already
before writing it out.
This header is not required to compile nftables with editline, remove
it, this unbreak compilation in several distros which have no symlink
from history.h to editline.h
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
mnl.c: In function ‘mnl_batch_talk’:
mnl.c:417:17: warning: comparison of integer expressions of different signedness: ‘unsigned in’ and ‘long int’ [-Wsign-compare]
if (rcvbufsiz < NFT_MNL_ECHO_RCVBUFF_DEFAULT)
^
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Florian Westphal [Wed, 17 Nov 2021 13:26:21 +0000 (14:26 +0100)]
monitor: do not call interval_map_decompose() for concat intervals
Without this, nft monitor will either print garbage or even segfault
when encountering a concat set because we pass expr->value to libgmp
helpers for concat (non-value) expressions.
Also, for concat case, we need to call concat_range_aggregate() helper.
Add a test case for this. Without this patch, it gives:
Fixes: 50780456a01a ("evaluate: check for missing transport protocol match in nat map with concatenations") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Check family when filtering out listing of tables and sets.
Fixes: 3f1d3912c3a6 ("cache: filter out tables that are not requested") Fixes: 635ee1cad8aa ("cache: filter out sets and maps that are not requested") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Florian Westphal [Thu, 28 Oct 2021 15:36:06 +0000 (17:36 +0200)]
doc: update ct timeout section with the state names
docs are too terse and did not have the list of valid timeout states.
While at it, adjust default stream timeout of udp to 120, this is the
current kernel default.
evaluate: clone variable expression if there is more than one reference
Clone the expression that defines the variable value if there are
multiple references to it in the ruleset. This saves heap memory
consumption in case the variable defines a set with a huge number of
elements.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Do not call alloc_setelem_cache() to build the set element list in
nftnl_set. Instead, translate one single set element expression to
nftnl_set_elem object at a time and use this object to build the netlink
header.
Using a huge test set containing 1.1 million element blocklist, this
patch is reducing userspace memory consumption by 40%.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
tests: py: remove verdict from closing end interval
Kernel does not allow for NFT_SET_ELEM_INTERVAL_END flag and
NFTA_SET_ELEM_DATA. The closing end interval represents a mismatch,
therefore, no verdict can be applied. The existing payload files show
the drop verdict when this is unset (because NF_DROP=0).
This update is required to fix payload warnings in tests/py after
libnftnl's ("set: use NFTNL_SET_ELEM_VERDICT to print verdict").
Fixes: 6671d9d137f6 ("mnl: Set NFTNL_SET_DATA_TYPE before dumping set elements") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Štěpán Němec [Fri, 5 Nov 2021 11:39:11 +0000 (12:39 +0100)]
tests: shell: $NFT needs to be invoked unquoted
The variable has to undergo word splitting, otherwise the shell tries
to find the variable value as an executable, which breaks in cases that 7c8a44b25c22 ("tests: shell: Allow wrappers to be passed as nft command")
intends to support.
Mention this in the shell tests README.
Fixes: d8ccad2a2b73 ("tests: cover baecd1cf2685 ("segtree: Fix segfault when restoring a huge interval set")") Signed-off-by: Štěpán Němec <snemec@redhat.com> Signed-off-by: Phil Sutter <phil@nwl.cc>
Štěpán Němec [Fri, 5 Nov 2021 11:39:10 +0000 (12:39 +0100)]
tests: shell: README: clarify test file name convention
Since commit 4d26b6dd3c4c, test file name suffix no longer reflects
expected exit code in all cases.
Move the sentence "Since they are located with `find', test files can
be put in any subdirectory." to a separate paragraph.
Fixes: 4d26b6dd3c4c ("tests: shell: change all test scripts to return 0") Signed-off-by: Štěpán Němec <snemec@redhat.com> Signed-off-by: Phil Sutter <phil@nwl.cc>
Štěpán Němec [Fri, 5 Nov 2021 11:39:09 +0000 (12:39 +0100)]
tests: shell: README: $NFT does not have to be a path to a binary
Since commit 7c8a44b25c22, $NFT can contain an arbitrary command,
e.g. 'valgrind nft'.
Fixes: 7c8a44b25c22 ("tests: shell: Allow wrappers to be passed as nft command") Signed-off-by: Štěpán Němec <snemec@redhat.com> Signed-off-by: Phil Sutter <phil@nwl.cc>
evaluate: postpone transport protocol match check after nat expression evaluation
Fix bogus error report when using transport protocol as map key.
Fixes: 50780456a01a ("evaluate: check for missing transport protocol match in nat map with concatenations") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Jeremy Sowden [Fri, 29 Oct 2021 20:40:08 +0000 (21:40 +0100)]
parser: add `limit_rate_pkts` and `limit_rate_bytes` rules
Factor the `N / time-unit` and `N byte-unit / time-unit` expressions
from limit expressions out into separate `limit_rate_pkts` and
`limit_rate_bytes` rules respectively.
Signed-off-by: Jeremy Sowden <jeremy@azazel.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Štěpán Němec [Wed, 20 Oct 2021 12:44:09 +0000 (14:44 +0200)]
tests: run-tests.sh: ensure non-zero exit when $failed != 0
POSIX [1] does not specify the behavior of `exit' with arguments
outside the 0-255 range, but what generally (bash, dash, zsh, OpenBSD
ksh, busybox) seems to happen is the shell exiting with status & 255
[2], which results in zero exit for certain non-zero arguments.
Phil Sutter [Tue, 2 Nov 2021 19:53:53 +0000 (20:53 +0100)]
tests: shell: Fix bogus testsuite failure with 250Hz
Previous fix for HZ=100 was not sufficient, a kernel with HZ=250 rounds
the 10ms to 8ms it seems. Do as Lukas suggests and accept the occasional
input/output asymmetry instead of continuing the hide'n'seek game.
Fixes: c9c5b5f621c37 ("tests: shell: Fix bogus testsuite failure with 100Hz") Suggested-by: Lukas Wunner <lukas@wunner.de> Signed-off-by: Phil Sutter <phil@nwl.cc> Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Lukas Wunner [Wed, 11 Mar 2020 12:20:06 +0000 (13:20 +0100)]
src: Support netdev egress hook
Add userspace support for the netdev egress hook which is queued up for
v5.16-rc1, complete with documentation and tests. Usage is identical to
the ingress hook.
Signed-off-by: Lukas Wunner <lukas@wunner.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Štěpán Němec [Wed, 20 Oct 2021 12:42:20 +0000 (14:42 +0200)]
tests: cover baecd1cf2685 ("segtree: Fix segfault when restoring a huge interval set")
Test inspired by [1] with both the set and stack size reduced by the
same power of 2, to preserve the (pre-baecd1cf2685) segfault on one
hand, and make the test successfully complete (post-baecd1cf2685) in a
few seconds even on weaker hardware on the other.
(The reason I stopped at 128kB stack size is that with 64kB I was
getting segfaults even with baecd1cf2685 applied.)
Florian Westphal [Tue, 19 Oct 2021 12:07:25 +0000 (14:07 +0200)]
tests: shell: auto-removal of chain hook on netns removal
This is the nft equivalent of the syzbot report that lead to
kernel commit 68a3765c659f8
("netfilter: nf_tables: skip netdev events generated on netns removal").
Jeremy Sowden [Thu, 7 Oct 2021 20:12:21 +0000 (21:12 +0100)]
rule: fix stateless output after listing sets containing counters
Before outputting counters in set definitions the
`NFT_CTX_OUTPUT_STATELESS` flag was set to suppress output of the
counter state and unconditionally cleared afterwards, regardless of
whether it had been originally set. Record the original set of flags
and restore it.
Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=994273 Fixes: 6d80e0f15492 ("src: support for counter in set definition") Signed-off-by: Jeremy Sowden <jeremy@azazel.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Jeremy Sowden [Thu, 7 Oct 2021 20:12:20 +0000 (21:12 +0100)]
rule: remove fake stateless output of named counters
When `-s` is passed, no state is output for named quotas and counter and
quota rules, but fake zero state is output for named counters. Remove
the output of named counters to match the remaining stateful objects.
Signed-off-by: Jeremy Sowden <jeremy@azazel.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Štěpán Němec [Mon, 11 Oct 2021 11:59:04 +0000 (13:59 +0200)]
doc: libnftables-json: make the example valid libnftables JSON input
- Add missing comma between array elements.
- Fix chain 'name' property.
- Match 'op' property is mandatory.
Fixes: 2e56f533b36a ("doc: Improve example in libnftables-json(5)") Fixes: 90d4ee087171 ("JSON: Make match op mandatory, introduce 'in' operator") Signed-off-by: Štěpán Němec <snemec@redhat.com> Signed-off-by: Phil Sutter <phil@nwl.cc>
Set on the cache flags for the nested notation too, this is fixing nft -f
with two files, one that contains the set declaration and another that
adds a rule that refers to such set.
Closes: https://bugzilla.netfilter.org/show_bug.cgi?id=1474 Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
evaluate: check for missing transport protocol match in nat map with concatenations
Restore this error with NAT maps:
# nft add rule 'ip ipfoo c dnat to ip daddr map @y'
Error: transport protocol mapping is only valid after transport protocol match
add rule ip ipfoo c dnat to ip daddr map @y
~~~~ ^^^^^^^^^^^^^^^
Allow for transport protocol match in the map too, which is implicitly
pulling in a transport protocol dependency.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
netlink: dynset: set compound expr dtype based on set key definition
"nft add rule ... add @t { ip saddr . 22 ..." will be listed as
'ip saddr . 0x16 [ invalid type]".
This is a display bug, the compound expression created during netlink
deserialization lacks correct datatypes for the value expression.
Avoid this by setting the individual expressions' datatype.
The set key has the needed information, so walk over the types and set
them in the dynset statment.
Also add a test case.
Reported-by: Paulo Ricardo Bruck <paulobruck1@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de>