git.ipfire.org Git - thirdparty/freeradius-server.git/log

Pin standard_conforming_strings on the PostgreSQL CI test database

PQescapeStringConn doubles backslashes when the connection's
std_strings flag is false, which trips the backslash assertion in
src/tests/modules/sql_postgresql/escape.unlang on test servers where
the parameter isn't being reported as on. Pin it at the database
level in postgresql-setup.sh so every connection inherits it, and
restore the backslash assertion in the test.

Make SQL escape functions binary safe, and check escaping functions correctly

add more CFLAGS

we can just use talloc_zero() here

which is less code and clearer

sort the libraries for consistent build order

and use $(LIBFREERADIUS_SERVER) instead of manual linkages,
which lets us build with / without TLS much more easily

POSTCLEAN is a command, not a list of files

don't pass a free function to the hash tables

both the hash and das are parented by the same dict. so we can
just rely on the talloc ordering to do the cleanups correctly.

if there's a free function which frees the hash table entry, then
we need a corresponding destructor in the 'da', which removes the
da from the hash table.

Without that, we have accidental ordering, and potential for
problems if anything inside of talloc

eap: drop attr_state from libfreeradius-eap

libfreeradius-eap is meant to be protocol-agnostic, but it autoloaded
the RADIUS State attribute and reached into the request reply_pairs
from eap_fail() to delete it. State management on Access-Reject is
process_radius's job: RESUME(access_reject) already calls
fr_state_discard() to unlink the state-tree entry. Nothing on the
reject path actually adds a State pair to reply_pairs, so the strip
in eap_fail was guarding against a case that doesn't happen.

Remove the autoload entry and the strip; if a future caller needs to
enforce "no State on Access-Reject" at the wire level, that belongs in
the process module's send-reject path, not in the EAP library.

eap: fix talloc abort on EAP-NAK for unsupported method (#5846)

eap_session_discard() looked up the request_data_t via
request_data_reference(), which leaves the entry in place with its
opaque pointer set to the eap_session we then talloc_free.  When the
session is frozen (request == NULL on the eap_session), the destructor's
own request_data_get() cleanup path is short-circuited, so the entry
survives with a dangling rd->opaque.  When the request's
session_state_ctx is later freed, the rd destructor calls
talloc_free(rd->opaque) on the freed chunk and aborts with "Bad talloc
magic value".

Switch to request_data_get() so the entry is unlinked atomically as
the eap_session is pulled out, then free the session.  No callers rely
on the entry surviving the discard.

Triggered by NAK-for-unsupported-method when ignore_unknown_eap_types
is no, but applies to any path that hits eap_failure() for a frozen
session.

Add src/tests/eapol_test/fail-aka-nak.conf as a regression test under
the new fail-<type>-* harness convention: server loads only EAP-AKA,
supplicant offers PEAP and NAKs the AKA challenge.  Pre-fix this aborts
in request_slab_deinit; post-fix the daemon stays up and the request is
cleanly rejected.

tests: support fail-<type>-* eapol_test cases that should reject cleanly

A conf file whose basename starts with `fail-` is now treated by the
harness as a negative scenario: eapol_test is expected to NOT complete
authentication, and the recipe inverts the exit-code check accordingly.
The server still has to be alive at the end (radiusd_stop checks the
PID), which is what catches an actual server crash.

Use cram md5 (works on macos)

free local variables which match the current frames dictionary

instead of freeing local variables which don't match the
previous frames dictionary. There may be multiple frames with
local variables.

network: detect reservation aliasing via data_size before reset

When app_io->read() calls fr_network_listen_send_packet() internally
(e.g. ldap_sync for AD notifications), that function calls
fr_message_and_data_alloc() on the same message set while our
reservation is outstanding.

message_reserve() uses fr_ring_buffer_reserve(), which does NOT advance
write_offset.  The subsequent alloc therefore lands at the same message
ring slot as the existing reservation.  The memset inside message_reserve
zeroes our struct (clearing data, rb, etc.), then the new message commits
into that slot and fills in data_size with the actual packet size.

After app_io->read() returns 0, the previous fix unconditionally called
fr_message_and_data_reset() on cd.  In the aliased case cd now points to
the already-committed, already-dispatched message; resetting it sets
data_size and data to NULL, causing the worker to decode zero bytes and
fail to find attr_packet_type.

Fix: before resetting, check cd->m.data_size.  A non-zero value means
an alloc claimed the slot while we held the reservation.  In that case
skip the reset entirely and set s->cd = NULL so the next call gets a
fresh reservation from the now-advanced write_offset.

network: fix uncommitted reservation aliasing when app_io->read returns no data

fr_message_and_data_reserve() uses fr_ring_buffer_reserve() which does not
advance write_offset. If app_io->read() returns 0 (e.g. ldap_sync, which
does its own reads internally) and the cached cd is held in s->cd, any
subsequent allocation on the same message set returns the same ring-buffer
slot and zeroes it, corrupting cd->m.data.

Fix: only cache s->cd when s->leftover > 0 (partial stream data to preserve).
With no leftover, cancel the reservation explicitly via fr_message_and_data_reset()
which clears the message fields and marks the slot FR_MESSAGE_FREE so the next
reserve can reclaim it cleanly.

Also fix the B2 commit size in fr_network_read(): data_size returned by TCP
app_io already includes the leftover bytes already in the buffer, so the old
cd->m.data_size + data_size was double-counting.

Use cram md5 (works on macos)

Create distinct, reserve functions, and reserve + commit functions for channel data and ring buffers

Always commit data before moving onto the next packet

add command to sync submodules

use public URL, not SSH for submodule

Add Ubuntu 26 to docker and crossbuild CI jobs

Add Ubuntu 26 to CI .deb job

Re-work coordinator shutdown sequence

Doing the pthread_join() in fr_coord_deregister() caused some occasional
timing issues.

free children entries after they are used

i.e. after vasprintf() is called. otherwise the children are
freed, adn then the pointer which is passed to vasprintf is then
pointing to unused memory

parent events from the RB tree that they are inserted into

add definitions for extension libraries

because we already add our own wrappers, and we want to be sure
that they are compatible with the upstream code.

Use existing thread names

Set the OS thread name for network and worker threads via
pthread_setname_np so they appear in ps/top/htop output.

Adds a configure check for pthread_setname_np (from pthread.h).
Linux and macOS have different signatures (2-arg vs 1-arg), handled
with an __APPLE__ guard matching the existing pattern in thread.c.

Add thread names

update RB tree freeing process

There are conflicts between the behavior of "free the tree", and
the talloc destructors for a node.  For simplicity / laziness,
the "free tree walker" just walks over the tree, freeing the
node data.  It expects that the tree nodes themselves remain
active during this walk, as the tree is not rebalanced.

The free walker will free the node data, which should NOT free
the individual node.  The node may, in fact, be inside of the
block which is being freed!

We therefore free the talloc children before mangling the tree
structure, so that any talloc destructors can look at the "clean"
tree structure.

We update the various destructors for tree data to check if the
tree is being freed, and then don't try to find / remove the entry
in the tree.

We also update the allocations so that the nodes in the tree are
always parented from the tree.  That way they are cleaned up before
the tree is cleaned up.

If (as before) the tree and nodes are both parented from the same
parent, then the nodes / tree are freed in essentially random order,
and the nodes might stick around after the tree is freed

typo

rlm_ldap: fix call_env safe_for token mismatch causing double-escape of DN/filter values

LDAP_DN_CALL_ENV_ESCAPE and LDAP_FILTER_CALL_ENV_ESCAPE were using
fr_ldap_dn_box_escape and fr_ldap_filter_box_escape as their safe_for
tokens, but %ldap.dn.safe and %ldap.filter.safe mark values with
LDAP_DN_SAFE_FOR (fr_ldap_dn_escape_func) and LDAP_FILTER_SAFE_FOR
(fr_ldap_filter_escape_func) respectively.

The mismatched tokens meant pre-marked-safe values (e.g. dc=example,dc=com
passed through %ldap.dn.safe) were not recognised as safe by the call_env
escape check and got re-escaped to dc\3dexample\2cdc\3dcom, producing an
invalid DN syntax error.

Fix: move LDAP_DN_SAFE_FOR / LDAP_FILTER_SAFE_FOR before the call_env
macros and use them consistently in .safe_for and .literals_safe_for.

Also add missing radprofile attributes to profile_injection.attrs: the
filter listed only Idle-Timeout but radprofile for user "john" also sets
Session-Timeout, Acct-Interim-Interval and Framed-IP-Netmask.

ldap xlat_profile test: drop bogus notfound rcode check

%ldap.profile() is an xlat, not a module call. It returns a bool but
does not write the unlang rcode, so checking (!notfound) after the xlat
always sees whatever rcode was current from a prior statement. The bool
check immediately above already verifies the injection payload does not
match any profile; the rcode check was both wrong and redundant.

rlm_ldap: Add examples to filter.safe, filter.unescape, and uri.* xlat doc sections

Split the combined ldap.uri.escape/safe/unescape alias blurb into three
separate sections, each with a concrete example. Add examples to the
previously bare ldap.filter.safe and ldap.filter.unescape sections.

LDAP requires _two_ safety schemes, one for DNs one for filters

- The DN safety scheme would escape '+', which is the RDN separator char. This would break instances where usernames were extracted directly from certificates, as '+' would become \2c and would not correctly be broken into its constituent RDN values.

- The existing filter schemes were not correctly applied in a number of places, meaning that if the administrator did not escape values with %ldap.uri.escape(), content from unsafe attributes could become structural elements of filters or DNs.

clean up cf_data_add() usages

* client / unlang code marked the data as "to free", even though
  the data was already parented by a talloc'd chunk.  So any call
  to decrease_ref_count() would result in a use after free

* update cf_data_free() set / destructor so that the destructor
  is set only when the data needs to be freed.  which means that
  the destructor doesn't need to check the "do_free" flag

pass correct parameters to pooled_object()

because that API is weird and confusing.

update valgrind.h path to <valgrind/valgrind.h>

more ${Q}

use our internal header, which wraps the public one

ci: switch from luajit to luajit2 (OpenResty fork) for CI dependencies

libnginx-mod-http-lua (the nginx Lua module used for rlm_rest testing)
depends on libluajit2-5.1-2 from the OpenResty luajit2 fork. This package
conflicts with libluajit-5.1-2 from the canonical luajit source, so both
package families cannot be installed simultaneously.

The CI Docker image was built with libnginx-mod-http-lua installed, so
libluajit2-5.1-2 is already present and libluajit-5.1-2 / luajit are
absent. mk-build-deps for extra-packages.debian.control was failing because
installing luajit would require removing libluajit2-5.1-2 (and thus
libnginx-mod-http-lua).

Switch extra-packages.debian.control to request libluajit2-5.1-dev and
luajit2 instead. They coexist with the already-installed libluajit2-5.1-2,
provide the same headers and soname for rlm_lua compilation, and the luajit2
interpreter binary is equivalent for any runtime use.

Remove the now-irrelevant apt preferences pin from freeradius-deps/action.yml:
the libluajit-5.1-* packages are not installed in the Docker image at all,
so there was nothing to hold back.

ci: pin all luajit binary packages to dfsg-1

The first pin only covered libluajit-5.1-2, but the Ubuntu security
rebuild produced build1 versions of all four binary packages from the
luajit source (libluajit-5.1-2, libluajit-5.1-common, libluajit-5.1-dev,
luajit). luajit and libluajit-5.1-dev carry strict (= dfsg-1) deps on
BOTH libluajit-5.1-2 and libluajit-5.1-common, so full-upgrade upgrading
libluajit-5.1-common to build1 still removed luajit as a casualty.

Extend the apt preferences pin to cover libluajit-5.1-* (glob) and luajit
explicitly, keeping the entire package set at dfsg-1 until Ubuntu ships a
coherent build1 rebuild of all four.

ci: Add Ubuntu archive mirrors and switch from OpenResty to stock nginx+lua

Add ubuntu-mirrors-setup.sh which rewrites /etc/apt/sources.list.d/ubuntu.sources
to use mirror+file: lists for the main archive, ports (arm64), and security suites.
Mirrors are ordered by proximity to Ottawa with the canonical servers as last-resort
fallbacks.  Also drops the apt connect timeout to 5s so dead hosts fail over quickly
rather than stalling for the 120s default.

Remove the ubuntu-toolchain-r/test PPA: gcc-13 and gcc-14 are both in Ubuntu 24.04
universe, so the PPA adds nothing for those versions and has no public mirrors.

Remove the OpenResty apt repo and replace the openresty package with nginx +
libnginx-mod-http-lua + lua-cjson, all from Ubuntu's own repos.  All Lua primitives
used by the rlm_rest test API are standard ngx_lua.  Update openresty-setup.sh to
detect OpenResty vs stock nginx at runtime (macOS dev vs CI), inject the lua
load_module directive when needed, and replace the OpenResty-specific
ngx.ctx.openresty_request_time_us with ngx.now() elapsed timing.

Pacify Coverity (CID 1692449)

Don't assing an instruction number twice

Ensure instructions being freed are removed from the tree

Replies that come in during a zombie period mean a home server is alive

Use unlang_interpret_force_result for dead home servers

In the case where there are no status checks and revive_interval is used
to assume a home server has come back to life.

This means that during the period when the home server is marked as
dead, the module fails immediately so failover is efficient.

A different approach is needed for dynamic home servers, since the same
instruction can be used for many different home servers.

Add all instructions to unlang_instruction_tree

So that the instruction gets correctly populated in the
unlang_thread_array entries.

Correct comment

Add AA-App-Service-Options to Nokia SR dictionary

Scheduled fuzzing: Update src/tests/fuzzer-corpus/radius.tar

Scheduled fuzzing: Update src/tests/fuzzer-corpus/cbor.tar

Scheduled fuzzing: Update src/tests/fuzzer-corpus/dhcpv6.tar

Scheduled fuzzing: Update src/tests/fuzzer-corpus/dns.tar

Scheduled fuzzing: Update src/tests/fuzzer-corpus/dhcpv4.tar

Scheduled fuzzing: Update src/tests/fuzzer-corpus/util.tar

Scheduled fuzzing: Update src/tests/fuzzer-corpus/bfd.tar

Scheduled fuzzing: Update src/tests/fuzzer-corpus/der.tar

Scheduled fuzzing: Update src/tests/fuzzer-corpus/tftp.tar

Scheduled fuzzing: Update src/tests/fuzzer-corpus/vmps.tar

Scheduled fuzzing: Update src/tests/fuzzer-corpus/tacacs.tar

atexit: skip TLS-cached pools once shutdown has freed them

`fr_atexit_thread_trigger_all()` runs each registered thread destructor
on the calling (main) thread, including ones that free `_Thread_local`
caches owned by threads outside our schedule (librdkafka's bg threads,
perl, etc.).  We can free their pool chunks but can't reset another
thread's TLS slot, so the next call from those threads dereferences a
dangling pointer and aborts with "Bad talloc magic value" - first seen
in `_kafka_log_cb` from the "Terminating instance" debug line emitted
inside `rd_kafka_destroy()` during `mod_detach`.

Add `fr_atexit_thread_local_disable_alloc()` / ..._alloc_disabled()` in
atexit.c as the single source of truth, called once from radiusd
*before* the trigger.  TLS-pool initialisers consult it before reading
their slot and fall back to `talloc_*(NULL, ...)` when set:

  - log.c:fr_log_pool_init returns NULL
  - sbuff.c:sbuff_scratch_init returns NULL

Other TLS pools registered the same way (md4/md5/hmac_*, strerror,
talloc autofree) can opt in as crashes surface; the single flag means
the fix is one extra check at the top of each initialiser.

log: bypass per-thread pool once shutdown has freed it

`fr_atexit_thread_trigger_all()` runs each registered thread destructor
on the calling (main) thread, including `_fr_log_pool_free` for every
worker that ever logged.  That frees the underlying pool chunk but
can't reset the `_Thread_local fr_log_pool` slot in any thread other
than main, so threads spawned outside our schedule (librdkafka's bg
threads, perl, etc.) keep a dangling pointer.  The next log call from
those threads (typically the "Terminating instance" debug line that
librdkafka emits inside `rd_kafka_destroy()` during `mod_detach`) hands
the dead pointer to `talloc_new` and aborts with "Bad talloc magic
value".

Add `fr_log_disable_pools()` and call it from radiusd right after the
trigger.  `fr_log_pool_init()` short-circuits to NULL once set, so the
TLS read is skipped and downstream `talloc_new(NULL)` allocates a
top-level chunk for the duration of the line.  Relaxed atomic because
the flag is a single-writer signal with no other state to synchronise
through it.

totp: Use an output buffer to output

rlm_kafka: Simpler, and arguably more correct way of handling kafka shutdown

Allow modules to use the main event loop for low frequency I/O and timer events

Saves spawning dedicated threads... we may want to revisit this in future

kafka: Use the correct type of sbuff for writing out sizes and time deltas

Tweaks to kafka tests

rlm_kafka: continue the wake drain loop after a cancelled pctx

_kafka_wake loops over the atomic ring draining pctx the bg cb pushed,
checks pctx->request, and if the worker-side cancel handler has already
NULLed it, frees the pctx without marking a request runnable. It was
using `return` on that branch (copy-pasted from the one-shot
kafka_delivery_notification()), which exits the whole drain loop as
soon as the first cancelled entry comes up, leaving any subsequent
live pctx stranded in the ring - their requests never get resumed and
end up cancelled by max_request_time instead.

Switch to `continue` so the rest of the ring still drains.

tests/multi-server: filter Status-Server from kafka-producer accept log

The kafka-producer1 container's docker healthcheck pokes the UDP
listener with a Status-Server packet every 2s.  In the radius namespace
the reply code for a successful Status-Server is Access-Accept, so the
healthcheck also runs through send Access-Accept - and thus through the
test-framework linelog - producing a stream of
  kafka-producer-accept {"User-Name": null}
entries with no User-Name attribute set (radclient status only sends a
Message-Authenticator).

Gate the linelog on Packet-Type == ::Access-Request so only proto_load
traffic lands in the listener file.  Healthcheck chatter is invisible
to the test harness from here.

tests/multi-server: enable librdkafka auto_create_topic

librdkafka producers don't force broker auto-create: they query
metadata, see "Unknown topic or partition" for a topic the broker
hasn't materialised yet, and sit on the produce call for up to 30s
waiting for metadata propagation. In the multi-server test
kafka-producer1 was hitting that window on the first burst of
proto_load traffic, every yielded request reached max_request_time
and was cancelled before any delivery report arrived, and the test
timed out with zero messages on the topic.

Setting auto_create_topic = yes on the rlm_kafka producer tells
librdkafka to request topic auto-creation as part of its metadata
fetch, so the very first PRODUCE sees the topic already live.

tests/multi-server: don't auto-restart kafka-producer1

proto_load on kafka-producer1 runs its generator to completion and then
radiusd exits normally. With restart: unless-stopped docker would bring
the container straight back up, rerun apt-get install / radiusd startup
(~30s each time), fire a fresh batch of traffic, and race the test
framework's verify phase - the listener file ended up with a mix of
in-flight and post-restart events.

Set restart: "no" so the container exits once and stays exited; the
test completes based on the consumer's summary line.

tests/multi-server: collapse log/listener failure-dump loops

Both loops called tail -200 on every file; only the listener branch
needed the extra line-type counts header. Fold them into one loop and
switch on the path for listener-specific output. Behaviour is
unchanged - the failure report still shows log tails plus listener
histograms.

just discard the data instead of saving it.

write of 0 means something other than "we saved the data"

ENETDOWN and ENETUNREACH are temporary failures

We might want to discard the data instead of saving it,
especially for UDP. Or, put the packet into a pending queue,
which can then be written later, or else timed out.

ENET* are temporary failures

tests/multi-server: dump listener files too on failure

We already tail logs/* on a failed test, but the per-suite
listener/*.txt is where the consumer (and producer linelog) writes
structured events the framework validates against. Dumping logs
alone tells us the containers ran; dumping listeners tells us
whether the pipeline actually produced the events we were waiting
for. Counts-by-prefix header makes it easy to spot 'no
kafka-consumer-received lines at all' vs 'got some but not the
expected count'.

tests/multi-server: reconnect test expects all 200 delivered, not 200/250

Over-provisioning num_messages above expected_messages masks real
losses - we want the test to catch regressions in the reconnect
path, not tolerate them. Drop the override so num_messages
defaults to expected_messages (200) and the test fails if any
message goes missing across the disconnect / reconnect cycle.

coverity: silence CID 1691836 / 1691837

CID 1691837 (NULL_RETURNS) in rlm_kafka's kafka_xlat_produce():
Coverity doesn't trust the xlat framework's required=true contract
and flags the downstream derefs of key_vb and value_vb.  Add an
fr_assert after the vars to document the invariant and silence it.

CID 1691836 (RESOURCE_LEAK) in fr_atomic_ring_push():
Coverity doesn't track atomic stores as reference publication, so
when we atomic_store_explicit() `n` into h->next and ring->head it
still considers `n` leaked once the local goes out of scope.  It
isn't - the consumer frees it via atomic_ring_entry_free() once it
advances past.  Annotate with /* coverity[leaked_storage] */.

tests/multi-server: embed proto_load in kafka-producer1, add reconnect suite

Topology simplification: move the proto_load listener directly into
kafka-producer1's virtual server, so generated Access-Requests flow
straight into `recv Access-Request` -> kafka.produce without going
over the wire.  One fewer container, one fewer RADIUS hop, and the
test still exercises exactly the produce path end-to-end.

Changes:

* environments/kafka.yml.j2
   - Drop the load-generator service.
   - Feed the proto_load profile (start_pps / max_pps / duration /
     step / parallel / num_messages) to kafka-producer1 via env vars;
     Jinja pulls them from the test's loadgen: block.
   - Re-declare TEST_PROJECT_NAME / TEST_SUBNET inline on
     kafka-producer1 because YAML's <<: anchor merge doesn't union
     nested dicts - a service-level environment: replaces the one
     inherited from x-common-config.
   - New `loadgen_num_messages` knob, defaulting to
     `expected_messages`, so tests that expect loss (reconnect) can
     generate more than the consumer will count.

* configs/freeradius/kafka-producer1/radiusd.conf.j2
   - Add `listen load { handler = load; transport = step; step { ... } }`
     inside the existing kafka-producer server.

* configs/freeradius/kafka-producer1/load-generator-packets/packet.conf
   - Default Access-Request packet skeleton proto_load sends.

* tests/kafka-produce/{short.ci,heavy}.test.yml + template.yml.j2
   - Collapse to a single state that waits for kafka-consumer-summary.
     No more two-phase load-gen orchestration; proto_load fires on
     freeradius startup and finishes long before the summary arrives.

* tests/kafka-produce-reconnect/
   - New suite exercising broker disconnect / reconnect.  Applies 100%
     packet loss on kafka-producer1's egress mid-stream (packet_loss
     action from the framework's NetworkEvents), holds for
     `outage_seconds`, then removes it.  Queued produces inside
     librdkafka drain after reconnect, request threads that yielded
     waiting on their delivery reports resume, and the consumer
     eventually sees >= expected_messages on the topic.

tests/multi-server: bump kafka-produce CI timeouts for DinD runners

Apache Kafka's JVM startup through the healthcheck takes ~20-30s on
the self-hosted CI DinD runners (vs a few seconds on local Docker
Desktop). By the time state_1 actually starts load-generation,
most of the previous 10s test_verify_timeout is already gone, so
the kafka-consumer-summary trigger emits too late and state_1
fails the validator even though the whole pipeline eventually
succeeds.

Bump to 120s total / 60s per-state / 90s consumer. Generous but
not so generous that a genuine hang would go undiagnosed.

tests/multi-server: run kafka-consumer as root

The confluentinc/cp-kcat image defaults to uid 1000 (appuser). On
the self-hosted CI runners the bind-mounted listener dir is owned
by root with mode 0755, so the consumer script can't write its
summary line and the test never observes kafka-consumer-summary.

Pin the consumer to uid 0 to sidestep the ownership mismatch.
Local Docker Desktop on macOS hides this because its bind mount
layer maps ownership loosely; on Linux DinD the permissions are
real.

tests/multi-server: switch broker from redpanda to apache kafka

Redpanda's seastar reactor aborts during init with "close() syscall
failed: Invalid argument" on the self-hosted CI runners, regardless
of:

  - redpanda image version (v26.1.6 via :latest and v24.3.15 pinned
    both fail the same way);
  - sandbox configuration (default, seccomp:unconfined +
    apparmor:unconfined, and privileged:true all hit the same error);
  - seastar tuning (--mode dev-container, explicit
    --overprovisioned / --unsafe-bypass-fsync / --reserve-memory=0M).

This is a seastar + runner-kernel interaction we can't unblock from
the compose side.

apache/kafka:3.9.1 is the official Apache Kafka Docker image, runs
the JVM implementation (not seastar), and starts cleanly in the
same DinD environment.  The wire protocol is identical so kcat on
the consumer side and rlm_kafka via librdkafka on the producer
side don't care which broker is serving.

tests/multi-server: pin redpanda to v24.3.15

`:latest` resolves to v26.1.6, which aborts during seastar reactor
init with "close() syscall failed: Invalid argument" on the
self-hosted CI runners - even with the container running privileged
(so seccomp/AppArmor/capability bounding are all off). That's a
regression in the image itself, not a sandbox problem.

Pin to v24.3.15 which starts cleanly. Bump when a newer tag is
verified to work.

tests/multi-server: run the kafka broker privileged

seccomp:unconfined + apparmor:unconfined wasn't enough to get
redpanda past seastar's reactor init on the self-hosted CI runners
(close() still failed with EINVAL on an internal fd). Replace the
narrow security_opt overrides with `privileged: true`, which turns
off seccomp + AppArmor + capability bounding + /dev restrictions in
one go - the minimum that reliably starts the broker across DinD
runner configurations. Test-only scope, compose-network-only
exposure.

tests/multi-server: also unconfine apparmor on the kafka broker

The previous seccomp:unconfined change flipped redpanda's first-stage
failure mode (perf_event_open now EACCES from the kernel sysctl,
instead of EPERM from seccomp) but the fatal close() EINVAL during
seastar reactor init still fired. On DinD runners the inner
containers inherit the default docker-default AppArmor profile in
addition to seccomp, and that profile is what's driving the EINVAL.
Opt out of both sandboxes for the test broker.

tests/multi-server: relax seccomp for the redpanda kafka broker

The self-hosted CI runners' Docker seccomp profile is stricter than
Docker Desktop's; it blocks enough of redpanda/seastar's startup
syscalls (io_uring / eventfd / perf_event_open) that the reactor
aborts during init with "close() syscall failed: Invalid argument"
and the broker container exits non-zero. The dependent
kafka-producer1 container then never starts and compose up reports
"dependency failed to start: container ... is unhealthy".

Opt the kafka service out of the seccomp sandbox - it's a test
broker on an isolated compose network, no host access implications.

Missing const

Quiet clang scan

kafka: Use a single producer handler and atomic queues to return the DRs

Jiggle functions in kafka/base.h/c

Fix fr_event_user_trigger so it works over multiple calls

Add a simple SPSC atomic, expandable queue chain

rlm_kafka: prime librdkafka's lazy globals from .onload

Call fr_kafka_init() from mod_load (paired with fr_kafka_free() in
mod_unload) so librdkafka's one-shot SSL/SASL init happens
deterministically at module load time, before any worker thread gets
to rd_kafka_new(). Ref-counted through libfreeradius-kafka so future
kafka-family modules can share the hook.

lib/kafka: expose fr_kafka_init() / fr_kafka_free()

librdkafka lazily initialises its SSL (lock callbacks on legacy
OpenSSL) and SASL globals on the first rd_kafka_new() call. In a
server that owns its own OpenSSL setup that creates a race at
thread_instantiate time, and the ordering is non-deterministic.

Give kafka-using modules a deterministic hook they can call from
their .onload: fr_kafka_init() runs the lazy paths by creating and
immediately destroying a throwaway producer, fr_kafka_free() pairs
it. Both are ref-counted against other kafka modules (same shape
as fr_openssl_init / fr_openssl_free in src/lib/tls/base.c) so a
future rlm_kafka_consumer sharing the lib doesn't double-init.

A no-op log callback is attached to the dummy conf so librdkafka's
"no bootstrap brokers" warning from the dummy producer doesn't leak
into the server log at startup.

rlm_kafka: wrap thread-owner fr_assert() calls in #ifndef NDEBUG

The worker_tid field on rlm_kafka_thread_t and rlm_kafka_msg_ctx_t->t
only exists under #ifndef NDEBUG (see ffad24d4d1), and the two
fr_assert() calls that reference it compile to nothing under NDEBUG.
With -Werror=unused-variable the ndebug build then failed because
the `t` unboxed at the head of _kafka_error_cb had no remaining use.

Move both assertions - plus the `t` local in _kafka_error_cb - under
the same #ifndef NDEBUG guard that protects the field itself.

rlm_kafka: recognise a null key box on the xlat path

With the framework now carrying FR_TYPE_NULL boxes through to the
xlat body, check for fr_type_is_null() in addition to the zero-length
check so an explicit `null` key, an empty '' literal, and an
attribute that expanded to nothing all resolve to "no key on the wire"
without any of them needing to coerce into a zero-length octets value.

tmpl, value, xlat: carry null through arg lists instead of casting it

Reverses the cast coercion added in 3b5165084f. An explicit `null`
should not silently become "" or zero-length octets - callers that
wrote `null` meant "no value at all", which is a different shape
from "the empty string".

value.c: fr_value_box_cast_to_{string,octets} now return a clean
fr_strerror() on FR_TYPE_NULL source instead of falling through to
the catch-all fr_assert(0).

xlat_tokenize.c: xlat_validate_function_arg skips the compile-time
cast for FR_TYPE_NULL literals so a bareword `null` survives arg
validation.

xlat_eval.c: the runtime concat and per-box cast paths both pass an
FR_TYPE_NULL source through to the xlat body unchanged, so
implementations can check fr_type_is_null() on the incoming box
and react accordingly.

tests/modules/kafka: exercise the `null` keyword on the xlat key arg

Now that tmpl parsing recognises `null` as an explicit FR_TYPE_NULL
placeholder, swap the zero-length-value produce's key from `''` to
`null` so the test doubles as regression coverage for the keyword
end-to-end (tokenize -> xlat arg list -> cast to octets -> zero-
length key -> "no key on the wire").

tmpl, value: accept `null` as an explicit keyword

Adds tmpl_afrom_null_substr so the bareword `null` is recognised at
tmpl-tokenize time and builds a TMPL_TYPE_DATA wrapping an
FR_TYPE_NULL box.  Wired in before the numeric / address / bool /
attribute branches in tmpl_afrom_substr so a dictionary attribute
named "null" can't shadow it.

FR_TYPE_NULL previously doubled as the "uninitialised box" sentinel,
which is why TMPL_VERIFY panicked when it saw one inside a
TMPL_TYPE_DATA and why fr_value_box_cast_to_{string,octets} lacked
a source case for it.  With the null keyword those encounters are
now deliberate, so:

  - Drop the "FR_TYPE_NULL inside TMPL_TYPE_DATA is uninitialised"
    assertion in tmpl_tokenize.c's TMPL_VERIFY.
  - Cast FR_TYPE_NULL to an empty string / zero-length octets box.

The result is that positional xlat arguments can carry an explicit
"no value" placeholder without the framework dropping the slot or
the type system tripping over it.

rlm_kafka: accept a key as the middle xlat argument

%kafka.produce now takes (topic, key, value) instead of (topic, value),
so xlat callers can pick a partition the same way the method form does
via a declared topic `key = ...`. Zero-length octets (the literal
empty string, or an attribute that expands to nothing) mean "no key"
on the wire - librdkafka falls back to its configured partitioner.

Updated existing xlat tests to pass an explicit '' key, and
xlat.unlang now covers the non-empty case too: produce to
freeradius-test-xlat-alt with a `"xlat-key"` key and assert it
round-trips byte-for-byte through the broker.

rlm_kafka: unbox the self-pipe uctx with talloc_get_type_abort

Follow-up to the audit pass in cb2ee227c3: _kafka_fd_readable was
still casting uctx straight to rlm_kafka_thread_t *. Bring it in
line with the other callbacks so a mismatched uctx aborts loudly at
the callsite instead of crashing deeper in rd_kafka_poll.