From: Chris Down <chris@chrisdown.name>
Date: Tue, 16 Jun 2026 10:50:02 +0000 (+0800)
Subject: journal: Prevent total log loss on unclean shutdown at high write rates
X-Git-Url: http://git.ipfire.org/cgi-bin/gitweb.cgi?a=commitdiff_plain;h=787eb710636598bd419851ea8dfd289518ac6acb;p=thirdparty%2Fsystemd.git

journal: Prevent total log loss on unclean shutdown at high write rates

In Meta production we have been considering using journald more widely
for some time. One of the blockers to doing that which I have noticed is
that often journald seems to have vastly less data after lockups/power
failures compared to plain files, which is not great when debugging
outages.

On small write rates this tends to be hard to reproduce, but when
writing thousands of messages a second, an unclean shutdown can result
in the end result being an active journal file with a header that
records an arena larger than the data that actually reached disk. What
happens is then that journalctl then discards the entire file(!),
completely ignoring that there is a huge amount of data which is
actually perfectly readable.

The reason for that is that the journal header is updated on every
append, while the file size and newly written arena contents are only
made durable on the filesystem's own schedule. After a crash, the header
can therefore describe writes which were logically completed by journald
but whose backing data or file metadata never reached disk.

Take the following example of how this can happen at high log rates:

1. journald appends objects into an mmap()ed arena, periodically growing
   the file with fallocate() in FILE_SIZE_INCREASE (8M) steps and
   advancing the header's arena_size tail pointers as it goes along.
2. The header is dirtied on every append, and its arena_size is advanced
   at each fallocate(). It is, from the kernel's perspective, an
   ordinary data page and is only made durable by the kernel's periodic
   page cache writeback on its own schedule. The file's length, by
   contrast, is metadata, made durable only when the filesystem commits
   a transaction (or on an fsync(), which journald does not issue
   between sync intervals).
3. journald marks journals NOCOW, so the header's data block is
   overwritten in place and is decoupled from the size metadata. Nothing
   orders the two with respect to each other. Writeback therefore can
   routinely persist a header whose arena_size has run ahead of the file
   length recorded on disk.
4. Power is lost. On the next boot the persisted header reflects an
   arena_size and tail pointers which have been advanced for appends.
   However their payload and the file metadata were never committed, so
   header_size + arena_size now points well past the end of the file as
   it exists on disk.
5. journal_file_verify_header() then rejects this with -ENODATA:

    if (... || header_size + arena_size > (uint64_t) f->last_stat.st_size)
            return -ENODATA;

That is correct when opening for writing, because we must not append to
a file whose recorded state we cannot trust, and the caller must rotate
it away. But the same check also runs on read only opens, where it is
actively harmful. In the case of journalctl, the entire file is skipped,
even though the data hash table, the field hash table, and the head of
the array all are present and fully intact, and the great majority of
entries are physically present. In fact, only a very small part of the
most recently written tail is missing, but everything before is
readable. This results in mistakenly rejecting the entire file as
corrupt.

This happens extremely frequently on machines with high write rates
during power cuts or lockups. In testing writing ~7500 msg/s through
journald and then cutting power, I reproduced it in ten out of ten
attempts across different machines.

In each case, the header was left claiming ~296M of arena while only
~192-208M had reached disk. In this case, journalctl reports that it has
recovered 0 of ~335000 messages. Whether a given crash trips the
condition depends on where it falls relative to the header's writeback,
but when it does, the loss today is total. After this patch the vast
majority of messages can be retrieved.

Let's fix this by keeping the rejection for writing, but for read-only
opens, let's just clamp the arena to the real file size and skip the
consistency checks on the now unreliable tail pointers. The reader will
walk the entry array chain from its intact head and stop at the
truncation point by the bounds check that already exists, so there's no
need to do any more than that there.

You might also wonder, why not address this on the write side? That
would be astronomically expensive and require an fsync() after every
fallocate().

In terms of improvements, when reading from a file affected in the way
described above, previously journalctl recovers 0 entries, and now it
can recover all actually intact entries.
---

diff --git a/src/libsystemd/sd-journal/journal-file.c b/src/libsystemd/sd-journal/journal-file.c
index bd8a4348bda..632b3bddbef 100644
--- a/src/libsystemd/sd-journal/journal-file.c
+++ b/src/libsystemd/sd-journal/journal-file.c
@@ -590,10 +590,39 @@ static int journal_file_verify_header(JournalFile *f) {
 
         arena_size = le64toh(READ_NOW(f->header->arena_size));
 
-        if (UINT64_MAX - header_size < arena_size || header_size + arena_size > (uint64_t) f->last_stat.st_size)
+        if (UINT64_MAX - header_size < arena_size)
                 return -ENODATA;
 
+        uint64_t file_size = (uint64_t) f->last_stat.st_size;
+
+        /* Probably an unclean shutdown where the header was written, but the arena data was not. On write we
+         * should ask the caller to rotate, but on read, we can still work it out with bounds checks. */
+        bool truncated = false;
+        if (header_size + arena_size > file_size) {
+                if (journal_file_writable(f))
+                        return -ENODATA;
+
+                /* This shouldn't happen given file_size is page aligned via fallocate(), but just in case
+                 * things are _really_ messed up... */
+                uint64_t available = ALIGN_DOWN_U64(file_size, sizeof(uint64_t));
+                if (header_size > available || available - header_size < offsetof(ObjectHeader, payload))
+                        return -ENODATA;
+
+                log_debug("Journal file %s claims a %" PRIu64 " byte arena but is only %" PRIu64
+                          " bytes on disk, clamping for recovery.",
+                          f->path,
+                          arena_size,
+                          file_size);
+                arena_size = available - header_size;
+                truncated = true;
+        }
+
         uint64_t tail_object_offset = le64toh(f->header->tail_object_offset);
+        if (truncated)
+                /* The tail may be in the lost region, so cap it at the last possible object header start. */
+                tail_object_offset = MIN(
+                                tail_object_offset,
+                                header_size + arena_size - offsetof(ObjectHeader, payload));
         if (!offset_is_valid(tail_object_offset, header_size, UINT64_MAX))
                 return -ENODATA;
         if (header_size + arena_size < tail_object_offset)
@@ -615,7 +644,7 @@ static int journal_file_verify_header(JournalFile *f) {
         if (!offset_is_valid(entry_array_offset, header_size, tail_object_offset))
                 return -ENODATA;
 
-        if (JOURNAL_HEADER_CONTAINS(f->header, tail_entry_array_offset)) {
+        if (!truncated && JOURNAL_HEADER_CONTAINS(f->header, tail_entry_array_offset)) {
                 uint32_t offset = le32toh(f->header->tail_entry_array_offset);
                 uint32_t n = le32toh(f->header->tail_entry_array_n_entries);
 
@@ -632,7 +661,7 @@ static int journal_file_verify_header(JournalFile *f) {
                         return -ENODATA;
         }
 
-        if (JOURNAL_HEADER_CONTAINS(f->header, tail_entry_offset)) {
+        if (!truncated && JOURNAL_HEADER_CONTAINS(f->header, tail_entry_offset)) {
                 uint64_t offset = le64toh(f->header->tail_entry_offset);
 
                 if (!offset_is_valid(offset, header_size, tail_object_offset))
@@ -664,7 +693,7 @@ static int journal_file_verify_header(JournalFile *f) {
 
         /* Verify number of objects */
         uint64_t n_objects = le64toh(f->header->n_objects);
-        if (n_objects > arena_size / offsetof(ObjectHeader, payload))
+        if (!truncated && n_objects > arena_size / offsetof(ObjectHeader, payload))
                 return -ENODATA;
 
         uint64_t n_entries = le64toh(f->header->n_entries);
@@ -4576,10 +4605,17 @@ int journal_file_get_cutoff_realtime_usec(JournalFile *f, usec_t *ret_from, usec
         }
 
         if (ret_to) {
-                if (f->header->tail_entry_realtime == 0)
+                Object *o;
+                int r;
+
+                /* The header may be stale on unclean shutdown, so don't trust it. */
+                r = journal_file_next_entry(f, 0, DIRECTION_UP, &o, NULL);
+                if (r < 0)
+                        return r;
+                if (r == 0)
                         return -ENOENT;
 
-                *ret_to = le64toh(f->header->tail_entry_realtime);
+                *ret_to = le64toh(o->entry.realtime);
         }
 
         return 1;