git.ipfire.org Git - thirdparty/freeradius-server.git/log

]> git.ipfire.org Git - thirdparty/freeradius-server.git/log

projects / thirdparty / freeradius-server.git / log

summary | shortlog | log | commit | commitdiff | tree
first ⋅ prev ⋅ next

commit | commitdiff | tree

Arran Cudbard-Bell [Mon, 27 Apr 2026 18:01:35 +0000 (14:01 -0400)]

atexit: skip TLS-cached pools once shutdown has freed them

`fr_atexit_thread_trigger_all()` runs each registered thread destructor
on the calling (main) thread, including ones that free `_Thread_local`
caches owned by threads outside our schedule (librdkafka's bg threads,
perl, etc.).  We can free their pool chunks but can't reset another
thread's TLS slot, so the next call from those threads dereferences a
dangling pointer and aborts with "Bad talloc magic value" - first seen
in `_kafka_log_cb` from the "Terminating instance" debug line emitted
inside `rd_kafka_destroy()` during `mod_detach`.

Add `fr_atexit_thread_local_disable_alloc()` / ..._alloc_disabled()` in
atexit.c as the single source of truth, called once from radiusd
*before* the trigger.  TLS-pool initialisers consult it before reading
their slot and fall back to `talloc_*(NULL, ...)` when set:

  - log.c:fr_log_pool_init returns NULL
  - sbuff.c:sbuff_scratch_init returns NULL

Other TLS pools registered the same way (md4/md5/hmac_*, strerror,
talloc autofree) can opt in as crashes surface; the single flag means
the fix is one extra check at the top of each initialiser.

commit | commitdiff | tree

Arran Cudbard-Bell [Mon, 27 Apr 2026 18:01:35 +0000 (14:01 -0400)]

log: bypass per-thread pool once shutdown has freed it

`fr_atexit_thread_trigger_all()` runs each registered thread destructor
on the calling (main) thread, including `_fr_log_pool_free` for every
worker that ever logged.  That frees the underlying pool chunk but
can't reset the `_Thread_local fr_log_pool` slot in any thread other
than main, so threads spawned outside our schedule (librdkafka's bg
threads, perl, etc.) keep a dangling pointer.  The next log call from
those threads (typically the "Terminating instance" debug line that
librdkafka emits inside `rd_kafka_destroy()` during `mod_detach`) hands
the dead pointer to `talloc_new` and aborts with "Bad talloc magic
value".

Add `fr_log_disable_pools()` and call it from radiusd right after the
trigger.  `fr_log_pool_init()` short-circuits to NULL once set, so the
TLS read is skipped and downstream `talloc_new(NULL)` allocates a
top-level chunk for the duration of the line.  Relaxed atomic because
the flag is a single-writer signal with no other state to synchronise
through it.

commit | commitdiff | tree

Arran Cudbard-Bell [Sun, 26 Apr 2026 22:11:43 +0000 (18:11 -0400)]

totp: Use an output buffer to output

commit | commitdiff | tree

Arran Cudbard-Bell [Sun, 26 Apr 2026 22:06:41 +0000 (18:06 -0400)]

rlm_kafka: Simpler, and arguably more correct way of handling kafka shutdown

commit | commitdiff | tree

Arran Cudbard-Bell [Fri, 24 Apr 2026 23:24:49 +0000 (19:24 -0400)]

Allow modules to use the main event loop for low frequency I/O and timer events

Saves spawning dedicated threads... we may want to revisit this in future

commit | commitdiff | tree

Arran Cudbard-Bell [Fri, 24 Apr 2026 23:24:05 +0000 (19:24 -0400)]

kafka: Use the correct type of sbuff for writing out sizes and time deltas

commit | commitdiff | tree

Arran Cudbard-Bell [Fri, 24 Apr 2026 23:23:34 +0000 (19:23 -0400)]

Tweaks to kafka tests

commit | commitdiff | tree

Arran Cudbard-Bell [Fri, 24 Apr 2026 16:15:14 +0000 (12:15 -0400)]

rlm_kafka: continue the wake drain loop after a cancelled pctx

_kafka_wake loops over the atomic ring draining pctx the bg cb pushed,
checks pctx->request, and if the worker-side cancel handler has already
NULLed it, frees the pctx without marking a request runnable. It was
using `return` on that branch (copy-pasted from the one-shot
kafka_delivery_notification()), which exits the whole drain loop as
soon as the first cancelled entry comes up, leaving any subsequent
live pctx stranded in the ring - their requests never get resumed and
end up cancelled by max_request_time instead.

Switch to `continue` so the rest of the ring still drains.

commit | commitdiff | tree

Arran Cudbard-Bell [Fri, 24 Apr 2026 14:59:56 +0000 (10:59 -0400)]

tests/multi-server: filter Status-Server from kafka-producer accept log

The kafka-producer1 container's docker healthcheck pokes the UDP
listener with a Status-Server packet every 2s.  In the radius namespace
the reply code for a successful Status-Server is Access-Accept, so the
healthcheck also runs through send Access-Accept - and thus through the
test-framework linelog - producing a stream of
  kafka-producer-accept {"User-Name": null}
entries with no User-Name attribute set (radclient status only sends a
Message-Authenticator).

Gate the linelog on Packet-Type == ::Access-Request so only proto_load
traffic lands in the listener file.  Healthcheck chatter is invisible
to the test harness from here.

commit | commitdiff | tree

Arran Cudbard-Bell [Fri, 24 Apr 2026 14:59:46 +0000 (10:59 -0400)]

tests/multi-server: enable librdkafka auto_create_topic

librdkafka producers don't force broker auto-create: they query
metadata, see "Unknown topic or partition" for a topic the broker
hasn't materialised yet, and sit on the produce call for up to 30s
waiting for metadata propagation. In the multi-server test
kafka-producer1 was hitting that window on the first burst of
proto_load traffic, every yielded request reached max_request_time
and was cancelled before any delivery report arrived, and the test
timed out with zero messages on the topic.

Setting auto_create_topic = yes on the rlm_kafka producer tells
librdkafka to request topic auto-creation as part of its metadata
fetch, so the very first PRODUCE sees the topic already live.

commit | commitdiff | tree

Arran Cudbard-Bell [Fri, 24 Apr 2026 14:59:20 +0000 (10:59 -0400)]

tests/multi-server: don't auto-restart kafka-producer1

proto_load on kafka-producer1 runs its generator to completion and then
radiusd exits normally. With restart: unless-stopped docker would bring
the container straight back up, rerun apt-get install / radiusd startup
(~30s each time), fire a fresh batch of traffic, and race the test
framework's verify phase - the listener file ended up with a mix of
in-flight and post-restart events.

Set restart: "no" so the container exits once and stays exited; the
test completes based on the consumer's summary line.

commit | commitdiff | tree

Arran Cudbard-Bell [Fri, 24 Apr 2026 14:59:10 +0000 (10:59 -0400)]

tests/multi-server: collapse log/listener failure-dump loops

Both loops called tail -200 on every file; only the listener branch
needed the extra line-type counts header. Fold them into one loop and
switch on the path for listener-specific output. Behaviour is
unchanged - the failure report still shows log tails plus listener
histograms.

commit | commitdiff | tree

Alan T. DeKok [Sat, 25 Apr 2026 01:20:13 +0000 (21:20 -0400)]

just discard the data instead of saving it.

write of 0 means something other than "we saved the data"

commit | commitdiff | tree

Alan T. DeKok [Fri, 24 Apr 2026 17:27:09 +0000 (13:27 -0400)]

ENETDOWN and ENETUNREACH are temporary failures

We might want to discard the data instead of saving it,
especially for UDP. Or, put the packet into a pending queue,
which can then be written later, or else timed out.

commit | commitdiff | tree

Alan T. DeKok [Fri, 24 Apr 2026 17:11:53 +0000 (13:11 -0400)]

ENET* are temporary failures

commit | commitdiff | tree

Arran Cudbard-Bell [Thu, 23 Apr 2026 23:13:55 +0000 (19:13 -0400)]

tests/multi-server: dump listener files too on failure

We already tail logs/* on a failed test, but the per-suite
listener/*.txt is where the consumer (and producer linelog) writes
structured events the framework validates against. Dumping logs
alone tells us the containers ran; dumping listeners tells us
whether the pipeline actually produced the events we were waiting
for. Counts-by-prefix header makes it easy to spot 'no
kafka-consumer-received lines at all' vs 'got some but not the
expected count'.

commit | commitdiff | tree

Arran Cudbard-Bell [Thu, 23 Apr 2026 23:03:01 +0000 (19:03 -0400)]

tests/multi-server: reconnect test expects all 200 delivered, not 200/250

Over-provisioning num_messages above expected_messages masks real
losses - we want the test to catch regressions in the reconnect
path, not tolerate them. Drop the override so num_messages
defaults to expected_messages (200) and the test fails if any
message goes missing across the disconnect / reconnect cycle.

commit | commitdiff | tree

Arran Cudbard-Bell [Thu, 23 Apr 2026 21:49:49 +0000 (17:49 -0400)]

coverity: silence CID 1691836 / 1691837

CID 1691837 (NULL_RETURNS) in rlm_kafka's kafka_xlat_produce():
Coverity doesn't trust the xlat framework's required=true contract
and flags the downstream derefs of key_vb and value_vb.  Add an
fr_assert after the vars to document the invariant and silence it.

CID 1691836 (RESOURCE_LEAK) in fr_atomic_ring_push():
Coverity doesn't track atomic stores as reference publication, so
when we atomic_store_explicit() `n` into h->next and ring->head it
still considers `n` leaked once the local goes out of scope.  It
isn't - the consumer frees it via atomic_ring_entry_free() once it
advances past.  Annotate with /* coverity[leaked_storage] */.

commit | commitdiff | tree

Arran Cudbard-Bell [Thu, 23 Apr 2026 21:49:26 +0000 (17:49 -0400)]

tests/multi-server: embed proto_load in kafka-producer1, add reconnect suite

Topology simplification: move the proto_load listener directly into
kafka-producer1's virtual server, so generated Access-Requests flow
straight into `recv Access-Request` -> kafka.produce without going
over the wire.  One fewer container, one fewer RADIUS hop, and the
test still exercises exactly the produce path end-to-end.

Changes:

* environments/kafka.yml.j2
   - Drop the load-generator service.
   - Feed the proto_load profile (start_pps / max_pps / duration /
     step / parallel / num_messages) to kafka-producer1 via env vars;
     Jinja pulls them from the test's loadgen: block.
   - Re-declare TEST_PROJECT_NAME / TEST_SUBNET inline on
     kafka-producer1 because YAML's <<: anchor merge doesn't union
     nested dicts - a service-level environment: replaces the one
     inherited from x-common-config.
   - New `loadgen_num_messages` knob, defaulting to
     `expected_messages`, so tests that expect loss (reconnect) can
     generate more than the consumer will count.

* configs/freeradius/kafka-producer1/radiusd.conf.j2
   - Add `listen load { handler = load; transport = step; step { ... } }`
     inside the existing kafka-producer server.

* configs/freeradius/kafka-producer1/load-generator-packets/packet.conf
   - Default Access-Request packet skeleton proto_load sends.

* tests/kafka-produce/{short.ci,heavy}.test.yml + template.yml.j2
   - Collapse to a single state that waits for kafka-consumer-summary.
     No more two-phase load-gen orchestration; proto_load fires on
     freeradius startup and finishes long before the summary arrives.

* tests/kafka-produce-reconnect/
   - New suite exercising broker disconnect / reconnect.  Applies 100%
     packet loss on kafka-producer1's egress mid-stream (packet_loss
     action from the framework's NetworkEvents), holds for
     `outage_seconds`, then removes it.  Queued produces inside
     librdkafka drain after reconnect, request threads that yielded
     waiting on their delivery reports resume, and the consumer
     eventually sees >= expected_messages on the topic.

commit | commitdiff | tree

Arran Cudbard-Bell [Thu, 23 Apr 2026 20:13:59 +0000 (16:13 -0400)]

tests/multi-server: bump kafka-produce CI timeouts for DinD runners

Apache Kafka's JVM startup through the healthcheck takes ~20-30s on
the self-hosted CI DinD runners (vs a few seconds on local Docker
Desktop). By the time state_1 actually starts load-generation,
most of the previous 10s test_verify_timeout is already gone, so
the kafka-consumer-summary trigger emits too late and state_1
fails the validator even though the whole pipeline eventually
succeeds.

Bump to 120s total / 60s per-state / 90s consumer. Generous but
not so generous that a genuine hang would go undiagnosed.

commit | commitdiff | tree

Arran Cudbard-Bell [Thu, 23 Apr 2026 20:02:07 +0000 (16:02 -0400)]

tests/multi-server: run kafka-consumer as root

The confluentinc/cp-kcat image defaults to uid 1000 (appuser). On
the self-hosted CI runners the bind-mounted listener dir is owned
by root with mode 0755, so the consumer script can't write its
summary line and the test never observes kafka-consumer-summary.

Pin the consumer to uid 0 to sidestep the ownership mismatch.
Local Docker Desktop on macOS hides this because its bind mount
layer maps ownership loosely; on Linux DinD the permissions are
real.

commit | commitdiff | tree

Arran Cudbard-Bell [Thu, 23 Apr 2026 19:54:29 +0000 (15:54 -0400)]

tests/multi-server: switch broker from redpanda to apache kafka

Redpanda's seastar reactor aborts during init with "close() syscall
failed: Invalid argument" on the self-hosted CI runners, regardless
of:

  - redpanda image version (v26.1.6 via :latest and v24.3.15 pinned
    both fail the same way);
  - sandbox configuration (default, seccomp:unconfined +
    apparmor:unconfined, and privileged:true all hit the same error);
  - seastar tuning (--mode dev-container, explicit
    --overprovisioned / --unsafe-bypass-fsync / --reserve-memory=0M).

This is a seastar + runner-kernel interaction we can't unblock from
the compose side.

apache/kafka:3.9.1 is the official Apache Kafka Docker image, runs
the JVM implementation (not seastar), and starts cleanly in the
same DinD environment.  The wire protocol is identical so kcat on
the consumer side and rlm_kafka via librdkafka on the producer
side don't care which broker is serving.

commit | commitdiff | tree

Arran Cudbard-Bell [Thu, 23 Apr 2026 19:46:58 +0000 (15:46 -0400)]

tests/multi-server: pin redpanda to v24.3.15

`:latest` resolves to v26.1.6, which aborts during seastar reactor
init with "close() syscall failed: Invalid argument" on the
self-hosted CI runners - even with the container running privileged
(so seccomp/AppArmor/capability bounding are all off). That's a
regression in the image itself, not a sandbox problem.

Pin to v24.3.15 which starts cleanly. Bump when a newer tag is
verified to work.

commit | commitdiff | tree

Arran Cudbard-Bell [Thu, 23 Apr 2026 19:39:08 +0000 (15:39 -0400)]

tests/multi-server: run the kafka broker privileged

seccomp:unconfined + apparmor:unconfined wasn't enough to get
redpanda past seastar's reactor init on the self-hosted CI runners
(close() still failed with EINVAL on an internal fd). Replace the
narrow security_opt overrides with `privileged: true`, which turns
off seccomp + AppArmor + capability bounding + /dev restrictions in
one go - the minimum that reliably starts the broker across DinD
runner configurations. Test-only scope, compose-network-only
exposure.

commit | commitdiff | tree

Arran Cudbard-Bell [Thu, 23 Apr 2026 19:31:11 +0000 (15:31 -0400)]

tests/multi-server: also unconfine apparmor on the kafka broker

The previous seccomp:unconfined change flipped redpanda's first-stage
failure mode (perf_event_open now EACCES from the kernel sysctl,
instead of EPERM from seccomp) but the fatal close() EINVAL during
seastar reactor init still fired. On DinD runners the inner
containers inherit the default docker-default AppArmor profile in
addition to seccomp, and that profile is what's driving the EINVAL.
Opt out of both sandboxes for the test broker.

commit | commitdiff | tree

Arran Cudbard-Bell [Thu, 23 Apr 2026 19:20:50 +0000 (15:20 -0400)]

tests/multi-server: relax seccomp for the redpanda kafka broker

The self-hosted CI runners' Docker seccomp profile is stricter than
Docker Desktop's; it blocks enough of redpanda/seastar's startup
syscalls (io_uring / eventfd / perf_event_open) that the reactor
aborts during init with "close() syscall failed: Invalid argument"
and the broker container exits non-zero. The dependent
kafka-producer1 container then never starts and compose up reports
"dependency failed to start: container ... is unhealthy".

Opt the kafka service out of the seccomp sandbox - it's a test
broker on an isolated compose network, no host access implications.

commit | commitdiff | tree

Arran Cudbard-Bell [Thu, 23 Apr 2026 17:58:05 +0000 (13:58 -0400)]

Missing const

commit | commitdiff | tree

Arran Cudbard-Bell [Thu, 23 Apr 2026 17:58:00 +0000 (13:58 -0400)]

Quiet clang scan

commit | commitdiff | tree

Arran Cudbard-Bell [Thu, 23 Apr 2026 17:25:27 +0000 (13:25 -0400)]

kafka: Use a single producer handler and atomic queues to return the DRs

commit | commitdiff | tree

Arran Cudbard-Bell [Thu, 23 Apr 2026 17:25:00 +0000 (13:25 -0400)]

Jiggle functions in kafka/base.h/c

commit | commitdiff | tree

Arran Cudbard-Bell [Thu, 23 Apr 2026 13:45:08 +0000 (09:45 -0400)]

Fix fr_event_user_trigger so it works over multiple calls

commit | commitdiff | tree

Arran Cudbard-Bell [Thu, 23 Apr 2026 13:43:01 +0000 (09:43 -0400)]

Add a simple SPSC atomic, expandable queue chain

commit | commitdiff | tree

Arran Cudbard-Bell [Wed, 22 Apr 2026 20:18:17 +0000 (16:18 -0400)]

rlm_kafka: prime librdkafka's lazy globals from .onload

Call fr_kafka_init() from mod_load (paired with fr_kafka_free() in
mod_unload) so librdkafka's one-shot SSL/SASL init happens
deterministically at module load time, before any worker thread gets
to rd_kafka_new(). Ref-counted through libfreeradius-kafka so future
kafka-family modules can share the hook.

commit | commitdiff | tree

Arran Cudbard-Bell [Wed, 22 Apr 2026 20:18:07 +0000 (16:18 -0400)]

lib/kafka: expose fr_kafka_init() / fr_kafka_free()

librdkafka lazily initialises its SSL (lock callbacks on legacy
OpenSSL) and SASL globals on the first rd_kafka_new() call. In a
server that owns its own OpenSSL setup that creates a race at
thread_instantiate time, and the ordering is non-deterministic.

Give kafka-using modules a deterministic hook they can call from
their .onload: fr_kafka_init() runs the lazy paths by creating and
immediately destroying a throwaway producer, fr_kafka_free() pairs
it. Both are ref-counted against other kafka modules (same shape
as fr_openssl_init / fr_openssl_free in src/lib/tls/base.c) so a
future rlm_kafka_consumer sharing the lib doesn't double-init.

A no-op log callback is attached to the dummy conf so librdkafka's
"no bootstrap brokers" warning from the dummy producer doesn't leak
into the server log at startup.

commit | commitdiff | tree

Arran Cudbard-Bell [Wed, 22 Apr 2026 20:17:54 +0000 (16:17 -0400)]

rlm_kafka: wrap thread-owner fr_assert() calls in #ifndef NDEBUG

The worker_tid field on rlm_kafka_thread_t and rlm_kafka_msg_ctx_t->t
only exists under #ifndef NDEBUG (see ffad24d4d1), and the two
fr_assert() calls that reference it compile to nothing under NDEBUG.
With -Werror=unused-variable the ndebug build then failed because
the `t` unboxed at the head of _kafka_error_cb had no remaining use.

Move both assertions - plus the `t` local in _kafka_error_cb - under
the same #ifndef NDEBUG guard that protects the field itself.

commit | commitdiff | tree

Arran Cudbard-Bell [Wed, 22 Apr 2026 18:32:56 +0000 (14:32 -0400)]

rlm_kafka: recognise a null key box on the xlat path

With the framework now carrying FR_TYPE_NULL boxes through to the
xlat body, check for fr_type_is_null() in addition to the zero-length
check so an explicit `null` key, an empty '' literal, and an
attribute that expanded to nothing all resolve to "no key on the wire"
without any of them needing to coerce into a zero-length octets value.

commit | commitdiff | tree

Arran Cudbard-Bell [Wed, 22 Apr 2026 18:32:48 +0000 (14:32 -0400)]

tmpl, value, xlat: carry null through arg lists instead of casting it

Reverses the cast coercion added in 3b5165084f. An explicit `null`
should not silently become "" or zero-length octets - callers that
wrote `null` meant "no value at all", which is a different shape
from "the empty string".

value.c: fr_value_box_cast_to_{string,octets} now return a clean
fr_strerror() on FR_TYPE_NULL source instead of falling through to
the catch-all fr_assert(0).

xlat_tokenize.c: xlat_validate_function_arg skips the compile-time
cast for FR_TYPE_NULL literals so a bareword `null` survives arg
validation.

xlat_eval.c: the runtime concat and per-box cast paths both pass an
FR_TYPE_NULL source through to the xlat body unchanged, so
implementations can check fr_type_is_null() on the incoming box
and react accordingly.

commit | commitdiff | tree

Arran Cudbard-Bell [Wed, 22 Apr 2026 18:23:48 +0000 (14:23 -0400)]

tests/modules/kafka: exercise the `null` keyword on the xlat key arg

Now that tmpl parsing recognises `null` as an explicit FR_TYPE_NULL
placeholder, swap the zero-length-value produce's key from `''` to
`null` so the test doubles as regression coverage for the keyword
end-to-end (tokenize -> xlat arg list -> cast to octets -> zero-
length key -> "no key on the wire").

commit | commitdiff | tree

Arran Cudbard-Bell [Wed, 22 Apr 2026 18:23:40 +0000 (14:23 -0400)]

tmpl, value: accept `null` as an explicit keyword

Adds tmpl_afrom_null_substr so the bareword `null` is recognised at
tmpl-tokenize time and builds a TMPL_TYPE_DATA wrapping an
FR_TYPE_NULL box.  Wired in before the numeric / address / bool /
attribute branches in tmpl_afrom_substr so a dictionary attribute
named "null" can't shadow it.

FR_TYPE_NULL previously doubled as the "uninitialised box" sentinel,
which is why TMPL_VERIFY panicked when it saw one inside a
TMPL_TYPE_DATA and why fr_value_box_cast_to_{string,octets} lacked
a source case for it.  With the null keyword those encounters are
now deliberate, so:

  - Drop the "FR_TYPE_NULL inside TMPL_TYPE_DATA is uninitialised"
    assertion in tmpl_tokenize.c's TMPL_VERIFY.
  - Cast FR_TYPE_NULL to an empty string / zero-length octets box.

The result is that positional xlat arguments can carry an explicit
"no value" placeholder without the framework dropping the slot or
the type system tripping over it.

commit | commitdiff | tree

Arran Cudbard-Bell [Wed, 22 Apr 2026 17:53:59 +0000 (13:53 -0400)]

rlm_kafka: accept a key as the middle xlat argument

%kafka.produce now takes (topic, key, value) instead of (topic, value),
so xlat callers can pick a partition the same way the method form does
via a declared topic `key = ...`. Zero-length octets (the literal
empty string, or an attribute that expands to nothing) mean "no key"
on the wire - librdkafka falls back to its configured partitioner.

Updated existing xlat tests to pass an explicit '' key, and
xlat.unlang now covers the non-empty case too: produce to
freeradius-test-xlat-alt with a `"xlat-key"` key and assert it
round-trips byte-for-byte through the broker.

commit | commitdiff | tree

Arran Cudbard-Bell [Wed, 22 Apr 2026 17:53:47 +0000 (13:53 -0400)]

rlm_kafka: unbox the self-pipe uctx with talloc_get_type_abort

Follow-up to the audit pass in cb2ee227c3: _kafka_fd_readable was
still casting uctx straight to rlm_kafka_thread_t *. Bring it in
line with the other callbacks so a mismatched uctx aborts loudly at
the callsite instead of crashing deeper in rd_kafka_poll.

commit | commitdiff | tree

Arran Cudbard-Bell [Wed, 22 Apr 2026 17:53:33 +0000 (13:53 -0400)]

rlm_kafka: cast topic_name_cmp params to the actual struct type

The tree stores rlm_kafka_topic_t entries and only worked by casting
the void pointers to `char const **` because `name` happens to be the
first field. Spell the struct out so the dependency on field
ordering isn't encoded in the comparator.

commit | commitdiff | tree

Arran Cudbard-Bell [Wed, 22 Apr 2026 17:13:43 +0000 (13:13 -0400)]

rlm_kafka: use the cached log_prefix in the self-pipe error cb

_kafka_fd_error was hard-coding the literal "kafka" prefix. Pull the
thread context off the event's uctx and format against the same
"rlm_kafka (<instance>)" prefix the rest of the module uses, so
multi-instance configurations show which instance's self-pipe fell
over.

commit | commitdiff | tree

Arran Cudbard-Bell [Wed, 22 Apr 2026 17:12:50 +0000 (13:12 -0400)]

rlm_kafka: debug-build sanity check that dr / error cbs run on-thread

Capture pthread_self() into rlm_kafka_thread_t at thread_instantiate,
then assert the same tid in _kafka_delivery_report_cb and
_kafka_error_cb. librdkafka only wakes us via the main queue (which
we poll from the worker's event loop), so a cross-thread hit would
mean an event slipped a different path and the no-lock handling of
the inflight list is unsafe.

Field and assertions are #ifndef NDEBUG so release builds carry
neither the extra tid nor the check - fr_assert(_x) expands to
nothing under NDEBUG so the missing field doesn't matter.

commit | commitdiff | tree

Arran Cudbard-Bell [Wed, 22 Apr 2026 17:12:40 +0000 (13:12 -0400)]

rlm_kafka: tidy header includes to match the module standard

Move RCSID / USES_APPLE_DEPRECATED_API above the includes, switch
the three `#include "lib/util/..."` to `<freeradius-devel/util/...>`
to match every other module, group includes by subsystem, and pull
pthread.h in alongside the other system headers in anticipation of
the upcoming thread-ownership sanity check.

commit | commitdiff | tree

Alan T. DeKok [Wed, 22 Apr 2026 18:22:56 +0000 (14:22 -0400)]

allow instruction results to be forced

cache forced result in thread-specific data.

check it in the interpreter hot loop.

allow modules to get their current instruction, so that it can
be later be used in timer / event callbacks

allow instructions to be forced.

commit | commitdiff | tree

Alan T. DeKok [Wed, 22 Apr 2026 18:14:23 +0000 (14:14 -0400)]

move interpreter thread-local data into the interpreter C file

which means that we can avoid function calls in the fast case

commit | commitdiff | tree

Arran Cudbard-Bell [Wed, 22 Apr 2026 16:39:43 +0000 (12:39 -0400)]

lib/kafka, rlm_kafka: drop trailing periods from Doxygen block titles

House style: the brief / title line of a /** */ block is a heading,
not a sentence, so no trailing period. Body prose sentences after the
title keep their periods as normal.

commit | commitdiff | tree

Arran Cudbard-Bell [Wed, 22 Apr 2026 16:39:09 +0000 (12:39 -0400)]

rlm_kafka: mark rare produce error paths unlikely

rd_kafka_produce can only fail on queue pressure or a rejected
message, kafka_produce_enqueue then signals that back as NULL. The
topic-not-declared check in the xlat fast path and the cancelled-
request check in dr_msg_cb are both paths we only take in error or
shutdown cases. Hinting the optimiser lets it lay the hot path out
straight and push the error bodies out-of-line - matches the
unlikely() idiom already used across src/lib/util/.

commit | commitdiff | tree

Arran Cudbard-Bell [Wed, 22 Apr 2026 16:38:33 +0000 (12:38 -0400)]

rlm_kafka: unbox async opaques with talloc_get_type_abort

xctx->inst / xctx->thread / msg->_private are all void pointers that
we know at the call-site should point back at our typed talloc chunks.
Using talloc_get_type_abort turns a subtle type-system hole into a
loud abort at the exact callsite if the opaque ever gets crossed over,
instead of a mystery crash deep in the function body.

The DR callback keeps its NULL-guard first - a NULL opaque is the
documented signal for fire-and-forget produces, so it's part of the
contract, not an error worth asserting on.

commit | commitdiff | tree

Arran Cudbard-Bell [Wed, 22 Apr 2026 16:38:03 +0000 (12:38 -0400)]

rlm_kafka: bridge librdkafka log output into the server log

Register an rd_kafka_conf_set_log_cb on the shared producer conf in
mod_instantiate. Every per-thread rd_kafka_conf_dup inherits it, so
broker errors, protocol traces, and any categories enabled via the
top-level `debug` knob now feed through the server's ERROR / WARN /
INFO / DEBUG macros with a pre-rendered "rlm_kafka (<instance>)"
prefix instead of going to librdkafka's default stderr sink.

Callback runs on librdkafka's internal threads so no mctx is in
scope; we stash the prefix on rlm_kafka_t at instantiate time and
reach for it via the producer's opaque (rlm_kafka_thread_t) on each
line.

commit | commitdiff | tree

Arran Cudbard-Bell [Wed, 22 Apr 2026 16:10:07 +0000 (12:10 -0400)]

kafka: Basic unreachable test, and cancellation race

commit | commitdiff | tree

Arran Cudbard-Bell [Wed, 22 Apr 2026 15:55:43 +0000 (11:55 -0400)]

kafka: Actually verify the data made it though

commit | commitdiff | tree

Arran Cudbard-Bell [Wed, 22 Apr 2026 13:31:05 +0000 (09:31 -0400)]

tests/multi-server: put kafka-produce back in the CI short suite

The client-cache fix in b56deabae7 ("Create different client cache
lists for TCP and UDP") addresses the underlying reason the
kafka-produce test was timing out in CI - the load-generator's own
Status-Server client lookup was failing the same way as
proxy-multihop-accept's. Rename short.test.yml back to
short.ci.test.yml so it runs under test.multi-server.ci again.

commit | commitdiff | tree

Arran Cudbard-Bell [Wed, 22 Apr 2026 13:07:55 +0000 (09:07 -0400)]

kafka: Minor style fixes.

commit | commitdiff | tree

Arran Cudbard-Bell [Wed, 22 Apr 2026 13:07:02 +0000 (09:07 -0400)]

Create different client cache lists for TCP and UDP

If we had both TCP and UDP listeners, one would end up getting an empty client list.

commit | commitdiff | tree

Arran Cudbard-Bell [Wed, 22 Apr 2026 04:15:47 +0000 (00:15 -0400)]

tests/multi-server: drop kafka-produce from the CI short suite

The kafka-produce multi-server test runs green locally against a
docker-compose redpanda, but on the shared GitHub Actions runners the
redpanda container fails to reach a healthy state reliably, even with
Redpanda's own dev-container preset. Rather than leave a perennially
red CI job, rename short.ci.test.yml to short.test.yml so the test
runs from `make test.multi-server` but not from
`test.multi-server.ci`. Local runs and future dedicated kafka
infrastructure can still exercise it.

commit | commitdiff | tree

Arran Cudbard-Bell [Wed, 22 Apr 2026 03:17:27 +0000 (23:17 -0400)]

ci: skip kafka module tests; switch multi-server broker to dev-container mode

CI rounds 1-2 showed two different redpanda startup failures:

- In the ci.yml / ci-sanitizers.yml `services:` stanzas, seastar
   exits EINVAL on the self-hosted runners because GitHub Actions
   services give us no way to override the default command line.
   There's no good workaround via `options:` so drop the redpanda
   service container, remove the kafka_test_server passthrough, and
   let the kafka module tests skip cleanly (the existing
   kafka_require_test_server gate already handles the empty-server
   case).  The multi-server kafka-produce test still exercises the
   full produce path end-to-end.

- The multi-server docker-compose had a hand-rolled list of redpanda
   start flags (`--overprovisioned --smp=1 ... --unsafe-bypass-fsync`)
   which still failed to boot on the shared runner.  Replace it with
   Redpanda's official single-node preset `--mode dev-container`,
   which sets the right seastar probing and IO defaults for a
   containerised CI environment, and keep only the overrides we
   actually need on top (smp/memory/node-id and the advertised
   listener address).

commit | commitdiff | tree

Arran Cudbard-Bell [Wed, 22 Apr 2026 02:47:13 +0000 (22:47 -0400)]

ci: pull redpanda from its own registry and give it a longer health window

Two things tripped up the first CI run on developer/arr2036:

- The ci.yml and ci-sanitizers.yml workflows pulled redpanda through
   the FreeRADIUS internal docker mirror (docker.internal.networkradius.com),
   but redpandadata/redpanda is not mirrored there, so every job with a
   redpanda service container died at 'Initialize containers'.  Pull
   directly from docker.redpanda.com, matching what the multi-server
   docker-compose already does.

- The multi-server redpanda service had a 60s start window and 30s of
   retries.  On a busy self-hosted runner that's marginal - we saw
   kafka-produce-short_ci fail the compose-up health gate.  Bump the
   start_period to 120s and extend retries so we allow up to ~4 minutes
   for the broker to come up.

commit | commitdiff | tree

Arran Cudbard-Bell [Wed, 22 Apr 2026 02:17:30 +0000 (22:17 -0400)]

tests/kafka: cover method dispatch forms, keys, binary and edge payloads

Extend the kafka test harness to exercise:

- method form with explicit name2 (kafka.produce / send / recv .topic)
- method form with implicit name2 (bare 'kafka' inside a recv section
that matches a topic named after the packet type)
- keyed produces (deterministic partitioning by key attr)
- binary payloads with embedded NULs and high-bit bytes
- edge-case xlat inputs (empty string, 16 KiB payload, embedded
control characters, UTF-8 / multibyte)

Topics referenced by the new tests are declared in module.conf so
unknown-topic handling continues to fail at parse time.

commit | commitdiff | tree

Arran Cudbard-Bell [Wed, 22 Apr 2026 01:19:29 +0000 (21:19 -0400)]

rlm_kafka: tighten error handling and xlat register ctx

A grab-bag of cleanups surfaced while refactoring the module:

- Trust MEM() for talloc-only failures in kafka_topic_thread_handles
   (rb_tree_alloc, rd_kafka_topic_conf_dup).
- Duplicate topic handle is now an fr_cond_assert_msg - it cannot
   happen if the declared-topic tree was built correctly.
- Drop the redundant kafka.produce xlat NULL-arg guard; the xlat
   arg parser already enforces required=true.
- Register the produce xlat against mi->boot rather than mi->data -
   xlat registration happens in bootstrap, before mi->data is
   allocated and mprotected.
- Move xlat_arg_parser_t below the signal callback so the xlat
   function and its args sit together.
- Drop the 'rlm_kafka:' prefix from logger calls; the logging layer
   already tags module-originated messages.

commit | commitdiff | tree

Arran Cudbard-Bell [Wed, 22 Apr 2026 01:17:09 +0000 (21:17 -0400)]

rlm_kafka: type value and key as octets

Kafka payloads and keys are opaque byte strings on the wire. Typing
value/key as FR_TYPE_STRING worked accidentally because fr_value_box_t
carries an explicit length, but it invited NUL-termination or UTF-8
assumptions to creep in from intermediate tmpl expansion.

Switch both the per-topic call_env rules and the xlat value argument
to FR_TYPE_OCTETS, and read .vb_octets instead of .vb_strvalue on the
produce paths. As a bonus, an integer-typed key attribute now
serialises in network byte order - matching what other Kafka clients
do - so the same numeric key hashes to the same partition regardless
of producer.

commit | commitdiff | tree

Arran Cudbard-Bell [Wed, 22 Apr 2026 01:16:29 +0000 (21:16 -0400)]

rlm_kafka: delegate per-topic value/key parsing to call_env framework

The topic-subsection callback was walking value and key by hand with
cf_pair_find + call_env_parse_pair + call_env_parsed_add +
call_env_parsed_set_tmpl, reimplementing what call_env_parse() already
does when given a rules array pointed at a CONF_SECTION.

Replace the bespoke walker with a single static topic_env[] array and
a recursive call_env_parse() against the topic's subsection. The
framework handles pair lookup, tmpl compilation, offset writes, and
required/nullable enforcement.

commit | commitdiff | tree

Arran Cudbard-Bell [Wed, 22 Apr 2026 01:15:43 +0000 (21:15 -0400)]

rlm_kafka: drop wordy type and local names

Strip the redundant noise from the type names now that the module is
settled:

    kafka_produce_ctx_t     -> rlm_kafka_msg_ctx_t
    rlm_kafka_produce_env_t -> rlm_kafka_env_t
    rlm_kafka_topic_handle_t -> rlm_kafka_topic_t

Plus the topic handle's rd_kafka_topic_t field and the loop locals in
kafka_topic_thread_handles (rkt/h/tc -> kt/topic_t/ktc).  Pure rename,
no behaviour change.

commit | commitdiff | tree

Arran Cudbard-Bell [Wed, 22 Apr 2026 01:15:13 +0000 (21:15 -0400)]

rlm_kafka: own flush_timeout as a module-level setting

flush_timeout controls how long thread_detach waits for in-flight
messages to drain, which is a policy decision that belongs to the
module using the kafka base library rather than to the library
itself. Move the CONF_PARSER entry into rlm_kafka's module_config
and store the value on rlm_kafka_t.

commit | commitdiff | tree

Arran Cudbard-Bell [Wed, 22 Apr 2026 01:12:06 +0000 (21:12 -0400)]

Add NO_OUTPUT plug to conf parser, rename FR_CONF_FUNC to FR_CONF_PAIR_GLOBAL to match FR_CONF_SUBSECTION_GLOBAL

commit | commitdiff | tree

Arran Cudbard-Bell [Tue, 21 Apr 2026 23:32:10 +0000 (19:32 -0400)]

Suppress kafka related leaks

commit | commitdiff | tree

Arran Cudbard-Bell [Tue, 21 Apr 2026 23:31:59 +0000 (19:31 -0400)]

Don't overwrite the \0 terminator in fr_globdir_get_path

commit | commitdiff | tree

Arran Cudbard-Bell [Tue, 21 Apr 2026 23:31:21 +0000 (19:31 -0400)]

Don't truncate all the zeros...

commit | commitdiff | tree

Arran Cudbard-Bell [Tue, 21 Apr 2026 23:29:33 +0000 (19:29 -0400)]

Allow CF_IDENT_ANY to work with sections and pairs in CONF_PARSER arrays

commit | commitdiff | tree

Arran Cudbard-Bell [Tue, 21 Apr 2026 23:29:01 +0000 (19:29 -0400)]

Initial commit of kafka producer

commit | commitdiff | tree

Arran Cudbard-Bell [Tue, 21 Apr 2026 17:06:32 +0000 (13:06 -0400)]

rlm_kafka: implement async producer with event-triggered I/O

Replaces the stub with a full async Kafka producer.

* Per-worker rd_kafka_t: each worker thread owns its own producer handle
  so delivery report callbacks always run on the worker that initiated
  the produce. No cross-thread wakeups, no resume races.

* Self-pipe I/O: librdkafka's main queue is wired to a self-pipe via
  rd_kafka_queue_io_event_enable(). The pipe's read end sits in the
  worker's event loop; kafka_fd_readable() drains it then polls the
  producer with a zero-timeout rd_kafka_poll() loop to dispatch DRs
  and broker errors.

* Module surface: kafka.produce method (topic/key/value/headers
  call_env) plus %kafka.produce(topic, value) xlat. Both yield waiting
  for the delivery report and resume with a rcode that distinguishes
  transient failure (fail), permanent rejection (reject), timeout
  (timeout), and success.

* Cancellation: signal handler detaches request from the in-flight
  ctx rather than trying to recall the message (librdkafka has no
  cancel API). dr_msg_cb sees request == NULL and silently frees.
  Safe against dr_msg_cb racing because both fire on the same worker
  thread.

* Shutdown: thread_detach flushes outstanding produces within the
  configured flush_timeout, then drains any queued DRs before tearing
  down the producer. Inflight ctxs whose DRs never arrive are walked
  and their requests nulled out as a belt-and-suspenders.

The module gates off the build system automatically via the existing
src/lib/kafka/all.mk include if librdkafka isn't available.

commit | commitdiff | tree

Arran Cudbard-Bell [Tue, 21 Apr 2026 17:06:15 +0000 (13:06 -0400)]

lib/kafka: expose fr_kafka_conf_t and accessors to modules

Promote kafka_conf_from_cs() and kafka_topic_conf_from_cs() from static
inline to extern, and move the fr_kafka_conf_t / fr_kafka_topic_conf_t
typedefs into the public header so modules can parse kafka config
subsections and hand the resulting rd_kafka_conf_t to rd_kafka_new().

Also fixes a sbuff terminator construction bug in kafka_config_dflt()
where FR_SBUFF_TERM() was being applied to a runtime pointer (sizeof
produces pointer size, not string length), and adds subcs_size metadata
to the topic multi-subsection so cf_parse doesn't assert.

commit | commitdiff | tree

Arran Cudbard-Bell [Tue, 21 Apr 2026 16:59:36 +0000 (12:59 -0400)]

debian: package freeradius-kafka separately

commit | commitdiff | tree

Alan T. DeKok [Wed, 22 Apr 2026 00:47:11 +0000 (20:47 -0400)]

remove unused functions

commit | commitdiff | tree

Alan T. DeKok [Wed, 22 Apr 2026 00:47:42 +0000 (20:47 -0400)]

replace local function with use of standard talloc API

commit | commitdiff | tree

Alan T. DeKok [Mon, 20 Apr 2026 17:56:54 +0000 (13:56 -0400)]

enable suppress_secrets by default, ala v3

commit | commitdiff | tree

Nick Porter [Tue, 21 Apr 2026 15:55:20 +0000 (16:55 +0100)]

Report where too small packets are from in radsniff

commit | commitdiff | tree

Alan T. DeKok [Mon, 20 Apr 2026 17:36:06 +0000 (13:36 -0400)]

Add '-S' and overrides

commit | commitdiff | tree

Alan T. DeKok [Mon, 20 Apr 2026 17:27:12 +0000 (13:27 -0400)]

add cf_pair_replace_or_add()

in preparation for '-S foo.bar=baz'

commit | commitdiff | tree

Alan T. DeKok [Mon, 20 Apr 2026 15:21:28 +0000 (11:21 -0400)]

remove unused code

commit | commitdiff | tree

Alan T. DeKok [Wed, 15 Apr 2026 13:46:56 +0000 (09:46 -0400)]

clarify and separate num objects from pool size

so that the calls to talloc_pooled_object() are clearer

commit | commitdiff | tree

Alan T. DeKok [Wed, 15 Apr 2026 13:41:19 +0000 (09:41 -0400)]

add stack

commit | commitdiff | tree

Nick Porter [Fri, 17 Apr 2026 13:16:13 +0000 (14:16 +0100)]

Just reconnect for UDP which will start the status checks

commit | commitdiff | tree

Nick Porter [Fri, 17 Apr 2026 13:13:38 +0000 (14:13 +0100)]

If the connection is status_checking then status_check packets are allowed

commit | commitdiff | tree

Nick Porter [Fri, 17 Apr 2026 12:58:58 +0000 (13:58 +0100)]

Compare correct time values for last reply within allowed window

commit | commitdiff | tree

Nick Porter [Thu, 16 Apr 2026 13:58:04 +0000 (14:58 +0100)]

If we're verifying replies, we need the original packet

commit | commitdiff | tree

Nick Porter [Wed, 15 Apr 2026 17:52:21 +0000 (18:52 +0100)]

Loop correctly after parsing octet sequence

After `p = end`, p will be pointing to the next character to parse, so
no need to further increment the pointer.

commit | commitdiff | tree

Alan T. DeKok [Tue, 14 Apr 2026 19:40:21 +0000 (15:40 -0400)]

update docs

commit | commitdiff | tree

Alan T. DeKok [Tue, 14 Apr 2026 19:11:13 +0000 (15:11 -0400)]

add more items and sort in alphabetical order

commit | commitdiff | tree

Ethan Thompson [Mon, 13 Apr 2026 18:25:29 +0000 (14:25 -0400)]

Don't parent temporary dir buffer to ef in exfile_open_mkdir (#5823)

The buffer is always freed before the function returns, so parenting
it to ef implied a lifetime relationship that didn't exist.

Signed-off-by: ethan-thompson <ethan.thompson@networkradius.com>

commit | commitdiff | tree

Nick Porter [Mon, 13 Apr 2026 14:12:06 +0000 (15:12 +0100)]

Disarm rather than delete timers

To avoid repeated freeing / allocating

commit | commitdiff | tree

Nick Porter [Mon, 13 Apr 2026 14:02:29 +0000 (15:02 +0100)]

Ensure parent cleanup timer is disarmed following new TCP connection

commit | commitdiff | tree

Nick Porter [Thu, 9 Apr 2026 13:05:43 +0000 (14:05 +0100)]

Only standard modules register xlats with their own name

Without this, if a virtual server, for example, has the same name as a
module which registers an xlat in its name, then, during server
shutdown, removing the process module for the virtual server attempts to
unregister the xlat which it doesn't own and leads to a seg fault.

commit | commitdiff | tree

Nick Porter [Thu, 9 Apr 2026 13:27:28 +0000 (14:27 +0100)]

Add name to coord_pair

So request names can indicate which coordinator they belong to.

commit | commitdiff | tree

Nick Porter [Thu, 9 Apr 2026 13:55:16 +0000 (14:55 +0100)]

Parent coord_pair_reg off the list of registrations

As with coord_reg, the entry component of the registration will change
as additional modules register coord_pair, which conflicts with module
instance data protection.

commit | commitdiff | tree

Nick Porter [Wed, 8 Apr 2026 10:55:46 +0000 (11:55 +0100)]

Parent coordinator registrations off the list of registrations

When more than one module registers a coordinator, the "previous"
registration changes when the new one is added to the list. If the
registration is parented off the module instance data then that gets
protected - so a seg fault happens when the second registration is
added. Parenting the registration off the list removes this issue.

commit | commitdiff | tree

Nick Porter [Wed, 8 Apr 2026 10:14:49 +0000 (11:14 +0100)]

Use fr_dlist_talloc_init to type check entries

commit | commitdiff | tree

Nick Porter [Mon, 6 Apr 2026 16:13:15 +0000 (17:13 +0100)]

Protect CONF_SECTION with const

commit | commitdiff | tree

Nick Porter [Mon, 6 Apr 2026 16:09:05 +0000 (17:09 +0100)]

CONF_SECTION is not changed by map_afrom_cs, so use const

Mirror of https://github.com/FreeRADIUS/freeradius-server.git