apply_syscall_filter() unconditionally inserts the write() syscall
into c->syscall_filter when exec_fd or handoff_timestamp_fd is in
use, so the parent can receive the exec status / handoff timestamp
from the child. When the unit configured a positive
SystemCallFilter= allow-list that deliberately omits write(), the
resulting widening of the operator's policy happens silently with
no trace in the journal.
Emit a log_debug() before the seccomp_filter_set_add_by_name() call
when syscall_allow_list is true, so the widening is at least
observable to operators inspecting the unit's debug log.
While here, document that mutating c->syscall_filter through a
'const ExecContext *c' is intentional: apply_syscall_filter() runs
only in the post-fork child, which owns a private copy of the
address space, so the hashmap change is never observed by the
manager.
No functional change for the allow-list itself; write() is still
added exactly as before.
Fixes: 84b79215ccc5 ("core: do not filter out write() if required in the very late stage")
Assisted-by: kres (claude-opus-4-7)
Signed-off-by: Chris Mason <clm@meta.com>
action = negative_action;
}
- /* Sending over exec_fd or handoff_timestamp_fd requires write() syscall. */
+ /* Sending over exec_fd or handoff_timestamp_fd requires write() syscall.
+ *
+ * Note: this mutates c->syscall_filter despite the 'const ExecContext *c' qualifier.
+ * That is intentional and safe here because apply_syscall_filter() runs only in the
+ * post-fork child, which holds a private copy of the address space; the hashmap
+ * change is never visible to the manager process. */
if (p->exec_fd >= 0 || p->handoff_timestamp_fd >= 0) {
+ if (c->syscall_allow_list)
+ log_debug("SystemCallFilter= allow-list in effect; adding 'write' syscall required for exec handoff.");
+
r = seccomp_filter_set_add_by_name(c->syscall_filter, c->syscall_allow_list, "write");
if (r < 0)
return r;