]> git.ipfire.org Git - thirdparty/systemd.git/blame - docs/JOURNAL_FILE_FORMAT.md
docs: use absolute links for our pages
[thirdparty/systemd.git] / docs / JOURNAL_FILE_FORMAT.md
CommitLineData
bbcd38e4
LP
1---
2title: Journal File Format
3category: Interfaces
4layout: default
0aff7b75 5SPDX-License-Identifier: LGPL-2.1-or-later
bbcd38e4
LP
6---
7
8# Journal File Format
9
717e92ce 10_Note that this document describes the binary on-disk format of journals only.
a90c3a9e 11For interfacing with web technologies there's the [Journal JSON Format](JOURNAL_EXPORT_FORMATS#journal-json-format).
a738c6d9 12For transfer of journal data across the network there's the
13[Journal Export Format](JOURNAL_EXPORT_FORMATS#journal-export-format)._
bbcd38e4
LP
14
15The systemd journal stores log data in a binary format with several features:
16
17* Fully indexed by all fields
18* Can store binary data, up to 2^64-1 in size
19* Seekable
20* Primarily append-based, hence robust to corruption
21* Support for in-line compression
22* Support for in-line Forward Secure Sealing
23
a738c6d9 24This document explains the basic structure of the file format on disk.
25We are making this available primarily to allow review and provide documentation.
26Note that the actual implementation in the
27[systemd codebase](https://github.com/systemd/systemd/blob/main/src/libsystemd/sd-journal/)
28is the only ultimately authoritative description of the format,
29so if this document and the code disagree, the code is right.
30That said we'll of course try hard to keep this document up-to-date and accurate.
31
32Instead of implementing your own reader or writer for journal files we ask you to use the
33[Journal's native CAPI](https://www.freedesktop.org/software/systemd/man/sd-journal.html)
34to access these files.
35It provides you with full access to the files, and will not withhold any data.
36If you find a limitation, please ping us and we might add some additional interfaces for you.
37
38If you need access to the raw journal data in serialized stream form without C API our recommendation is to make use of the
39[Journal Export Format](JOURNAL_EXPORT_FORMATS#journal-export-format),
40which you can get via `journalctl -o export` or via `systemd-journal-gatewayd`.
41The export format is much simpler to parse, but complete and accurate.
42Due to its stream-based nature it is not indexed.
43
44_Or, to put this in other words: this low-level document is probably not what you want to use as base of your project.
45You want our [C API](https://www.freedesktop.org/software/systemd/man/sd-journal.html) instead!
5c90c67a 46And if you really don't want the C API, then you want the
0d592a5e 47[Journal Export Format or Journal JSON Format](/JOURNAL_EXPORT_FORMATS) instead!
a90c3a9e 48This document is primarily for your entertainment and education.
5c90c67a 49Thank you!_
bbcd38e4 50
a738c6d9 51This document assumes you have a basic understanding of the journal concepts, the properties of a journal entry and so on.
52If not, please go and read up, then come back!
53This is a good opportunity to read about the
54[basic properties of journal entries](https://www.freedesktop.org/software/systemd/man/systemd.journal-fields.html),
55in particular realize that they may include binary non-text data (though usually don't),
56and the same field might have multiple values assigned within the same entry.
57
58This document describes the current format of systemd 246.
59The documented format is compatible with the format used in the first versions of the journal,
70cd1e56 60but received various compatible and incompatible additions since.
bbcd38e4 61
a738c6d9 62If you are wondering why the journal file format has been created in the first place instead of adopting an existing database implementation,
63please have a look [at this thread](https://lists.freedesktop.org/archives/systemd-devel/2012-October/007054.html).
bbcd38e4
LP
64
65
66## Basics
67
da890466 68* All offsets, sizes, time values, hashes (and most other numeric values) are 32-bit/64-bit unsigned integers in LE format.
bbcd38e4 69* Offsets are always relative to the beginning of the file.
a738c6d9 70* The 64-bit hash function siphash24 is used for newer journal files.
71 For older files [Jenkins lookup3](https://en.wikipedia.org/wiki/Jenkins_hash_function) is used,
72 more specifically `jenkins_hashlittle2()` with the first 32-bit integer it returns as higher 32-bit part of the 64-bit value,
73 and the second one uses as lower 32-bit part.
da890466 74* All structures are aligned to 64-bit boundaries and padded to multiples of 64-bit
bbcd38e4
LP
75* The format is designed to be read and written via memory mapping using multiple mapped windows.
76* All time values are stored in usec since the respective epoch.
22aa58ad 77* Wall clock time values are relative to the Unix time epoch, i.e. January 1st, 1970. (`CLOCK_REALTIME`)
a738c6d9 78* Monotonic time values are always stored jointly with the kernel boot ID value (i.e. `/proc/sys/kernel/random/boot_id`) they belong to.
79 They tend to be relative to the start of the boot, but aren't for containers. (`CLOCK_MONOTONIC`)
da890466 80* Randomized, unique 128-bit IDs are used in various locations. These are generally UUID v4 compatible, but this is not a requirement.
bbcd38e4
LP
81
82## General Rules
83
84If any kind of corruption is noticed by a writer it should immediately rotate
85the file and start a new one. No further writes should be attempted to the
86original file, but it should be left around so that as little data as possible
87is lost.
88
89If any kind of corruption is noticed by a reader it should try hard to handle
90this gracefully, such as skipping over the corrupted data, but allowing access
91to as much data around it as possible.
92
93A reader should verify all offsets and other data as it reads it. This includes
94checking for alignment and range of offsets in the file, especially before
95trying to read it via a memory map.
96
97A reader must interleave rotated and corrupted files as good as possible and
98present them as single stream to the user.
99
100All fields marked as "reserved" must be initialized with 0 when writing and be
101ignored on reading. They are currently not used but might be used later on.
102
103
104## Structure
105
106The file format's data structures are declared in
882da5cc 107[journal-def.h](https://github.com/systemd/systemd/blob/main/src/libsystemd/sd-journal/journal-def.h).
bbcd38e4
LP
108
109The file format begins with a header structure. After the header structure
110object structures follow. Objects are appended to the end as time
111progresses. Most data stored in these objects is not altered anymore after
112having been written once, with the exception of records necessary for
113indexing. When new data is appended to a file the writer first writes all new
114objects to the end of the file, and then links them up at front after that's
115done. Currently, seven different object types are known:
116
117```c
118enum {
119 OBJECT_UNUSED,
120 OBJECT_DATA,
121 OBJECT_FIELD,
122 OBJECT_ENTRY,
123 OBJECT_DATA_HASH_TABLE,
124 OBJECT_FIELD_HASH_TABLE,
125 OBJECT_ENTRY_ARRAY,
126 OBJECT_TAG,
127 _OBJECT_TYPE_MAX
128};
129```
130
131* A **DATA** object, which encapsulates the contents of one field of an entry, i.e. a string such as `_SYSTEMD_UNIT=avahi-daemon.service`, or `MESSAGE=Foobar made a booboo.` but possibly including large or binary data, and always prefixed by the field name and "=".
132* A **FIELD** object, which encapsulates a field name, i.e. a string such as `_SYSTEMD_UNIT` or `MESSAGE`, without any `=` or even value.
133* An **ENTRY** object, which binds several **DATA** objects together into a log entry.
134* A **DATA_HASH_TABLE** object, which encapsulates a hash table for finding existing **DATA** objects.
135* A **FIELD_HASH_TABLE** object, which encapsulates a hash table for finding existing **FIELD** objects.
136* An **ENTRY_ARRAY** object, which encapsulates a sorted array of offsets to entries, used for seeking by binary search.
137* A **TAG** object, consisting of an FSS sealing tag for all data from the beginning of the file or the last tag written (whichever is later).
138
139## Header
140
141The Header struct defines, well, you guessed it, the file header:
142
143```c
144_packed_ struct Header {
145 uint8_t signature[8]; /* "LPKSHHRH" */
146 le32_t compatible_flags;
147 le32_t incompatible_flags;
148 uint8_t state;
149 uint8_t reserved[7];
150 sd_id128_t file_id;
151 sd_id128_t machine_id;
f0104781 152 sd_id128_t tail_entry_boot_id;
bbcd38e4
LP
153 sd_id128_t seqnum_id;
154 le64_t header_size;
155 le64_t arena_size;
156 le64_t data_hash_table_offset;
157 le64_t data_hash_table_size;
158 le64_t field_hash_table_offset;
159 le64_t field_hash_table_size;
160 le64_t tail_object_offset;
161 le64_t n_objects;
162 le64_t n_entries;
163 le64_t tail_entry_seqnum;
164 le64_t head_entry_seqnum;
165 le64_t entry_array_offset;
166 le64_t head_entry_realtime;
167 le64_t tail_entry_realtime;
168 le64_t tail_entry_monotonic;
169 /* Added in 187 */
170 le64_t n_data;
171 le64_t n_fields;
172 /* Added in 189 */
173 le64_t n_tags;
174 le64_t n_entry_arrays;
70cd1e56
LP
175 /* Added in 246 */
176 le64_t data_hash_chain_depth;
177 le64_t field_hash_chain_depth;
e81710d3 178 /* Added in 252 */
206f0f39
LP
179 le32_t tail_entry_array_offset;
180 le32_t tail_entry_array_n_entries;
181 /* Added in 254 */
182 le64_t tail_entry_offset;
bbcd38e4
LP
183};
184```
185
22aa58ad 186The first 8 bytes of Journal files must contain the ASCII characters `LPKSHHRH`.
bbcd38e4
LP
187
188If a writer finds that the **machine_id** of a file to write to does not match
189the machine it is running on it should immediately rotate the file and start a
190new one.
191
192When journal file is first created the **file_id** is randomly and uniquely
193initialized.
194
f0104781
LP
195When a writer creates a file it shall initialize the **tail_entry_boot_id** to
196the current boot ID of the system. When appending an entry it shall update the
197field to the boot ID of that entry, so that it is guaranteed that the
198**tail_entry_monotonic** field refers to a timestamp of the monotonic clock
199associated with the boot with the ID indicated by the **tail_entry_boot_id**
200field. (Compatibility note: in older versions of the journal, the field was
201also supposed to be updated whenever the file was opened for any form of
202writing, including when opened to mark it as archived. This behaviour has been
203deemed problematic since without an associated boot ID the
204**tail_entry_monotonic** field is useless. To indicate whether the boot ID is
205updated only on append the JOURNAL_COMPATIBLE_TAIL_ENTRY_BOOT_ID is set. If it
206is not set, the **tail_entry_monotonic** field is not usable).
bbcd38e4
LP
207
208The currently used part of the file is the **header_size** plus the
209**arena_size** field of the header. If a writer needs to write to a file where
210the actual file size on disk is smaller than the reported value it shall
211immediately rotate the file and start a new one. If a writer is asked to write
e8607daf
P
212to a file with a header that is shorter than its own definition of the struct
213Header, it shall immediately rotate the file and start a new one.
bbcd38e4
LP
214
215The **n_objects** field contains a counter for objects currently available in
216this file. As objects are appended to the end of the file this counter is
217increased.
218
219The first object in the file starts immediately after the header. The last
220object in the file is at the offset **tail_object_offset**, which may be 0 if
221no object is in the file yet.
222
223The **n_entries**, **n_data**, **n_fields**, **n_tags**, **n_entry_arrays** are
224counters of the objects of the specific types.
225
226**tail_entry_seqnum** and **head_entry_seqnum** contain the sequential number
227(see below) of the last or first entry in the file, respectively, or 0 if no
228entry has been written yet.
229
230**tail_entry_realtime** and **head_entry_realtime** contain the wallclock
231timestamp of the last or first entry in the file, respectively, or 0 if no
232entry has been written yet.
233
234**tail_entry_monotonic** is the monotonic timestamp of the last entry in the
f0104781
LP
235file, referring to monotonic time of the boot identified by
236**tail_entry_boot_id**, but only if the
237JOURNAL_COMPATIBLE_TAIL_ENTRY_BOOT_ID feature flag is set, see above. If it
238is not set, this field might refer to a different boot then the one in the
239**tail_entry_boot_id** field, for example when the file was ultimately
240archived.
bbcd38e4 241
70cd1e56
LP
242**data_hash_chain_depth** is a counter of the deepest chain in the data hash
243table, minus one. This is updated whenever a chain is found that is longer than
244the previous deepest chain found. Note that the counter is updated during hash
245table lookups, as the chains are traversed. This counter is used to determine
246when it is a good time to rotate the journal file, because hash collisions
247became too frequent.
248
249Similar, **field_hash_chain_depth** is a counter of the deepest chain in the
250field hash table, minus one.
251
e81710d3
DDM
252**tail_entry_array_offset** and **tail_entry_array_n_entries** allow immediate
253access to the last entry array in the global entry array chain.
bbcd38e4 254
206f0f39
LP
255**tail_entry_offset** allow immediate access to the last entry in the journal
256file.
257
bbcd38e4
LP
258## Extensibility
259
260The format is supposed to be extensible in order to enable future additions of
261features. Readers should simply skip objects of unknown types as they read
262them. If a compatible feature extension is made a new bit is registered in the
22aa58ad 263header's **compatible_flags** field. If a feature extension is used that makes
bbcd38e4 264the format incompatible a new bit is registered in the header's
22aa58ad
LP
265**incompatible_flags** field. Readers should check these two bit fields, if
266they find a flag they don't understand in compatible_flags they should continue
267to read the file, but if they find one in **incompatible_flags** they should
268fail, asking for an update of the software. Writers should refuse writing if
269there's an unknown bit flag in either of these fields.
bbcd38e4
LP
270
271The file header may be extended as new features are added. The size of the file
22aa58ad 272header is stored in the header. All header fields up to **n_data** are known to
bbcd38e4 273unconditionally exist in all revisions of the file format, all fields starting
22aa58ad 274with **n_data** needs to be explicitly checked for via a size check, since they
bbcd38e4
LP
275were additions after the initial release.
276
70cd1e56 277Currently only five extensions flagged in the flags fields are known:
bbcd38e4
LP
278
279```c
280enum {
70cd1e56
LP
281 HEADER_INCOMPATIBLE_COMPRESSED_XZ = 1 << 0,
282 HEADER_INCOMPATIBLE_COMPRESSED_LZ4 = 1 << 1,
283 HEADER_INCOMPATIBLE_KEYED_HASH = 1 << 2,
284 HEADER_INCOMPATIBLE_COMPRESSED_ZSTD = 1 << 3,
87413812 285 HEADER_INCOMPATIBLE_COMPACT = 1 << 4,
bbcd38e4
LP
286};
287
288enum {
f0104781
LP
289 HEADER_COMPATIBLE_SEALED = 1 << 0,
290 HEADER_COMPATIBLE_TAIL_ENTRY_BOOT_ID = 1 << 1,
bbcd38e4
LP
291};
292```
293
70cd1e56
LP
294HEADER_INCOMPATIBLE_COMPRESSED_XZ indicates that the file includes DATA objects
295that are compressed using XZ. Similarly, HEADER_INCOMPATIBLE_COMPRESSED_LZ4
296indicates that the file includes DATA objects that are compressed with the LZ4
297algorithm. And HEADER_INCOMPATIBLE_COMPRESSED_ZSTD indicates that there are
298objects compressed with ZSTD.
299
300HEADER_INCOMPATIBLE_KEYED_HASH indicates that instead of the unkeyed Jenkins
301hash function the keyed siphash24 hash function is used for the two hash
302tables, see below.
bbcd38e4 303
87413812
DDM
304HEADER_INCOMPATIBLE_COMPACT indicates that the journal file uses the new binary
305format that uses less space on disk compared to the original format.
306
bbcd38e4
LP
307HEADER_COMPATIBLE_SEALED indicates that the file includes TAG objects required
308for Forward Secure Sealing.
309
f0104781
LP
310HEADER_COMPATIBLE_TAIL_ENTRY_BOOT_ID indicates whether the
311**tail_entry_boot_id** field is strictly updated on initial creation of the
312file and whenever an entry is updated (in which case the flag is set), or also
313when the file is archived (in which case it is unset). New files should always
314set this flag (and thus not update the **tail_entry_boot_id** except when
315creating the file and when appending an entry to it.
bbcd38e4
LP
316
317## Dirty Detection
318
319```c
320enum {
321 STATE_OFFLINE = 0,
322 STATE_ONLINE = 1,
323 STATE_ARCHIVED = 2,
324 _STATE_MAX
325};
326```
327
328If a file is opened for writing the **state** field should be set to
329STATE_ONLINE. If a file is closed after writing the **state** field should be
330set to STATE_OFFLINE. After a file has been rotated it should be set to
331STATE_ARCHIVED. If a writer is asked to write to a file that is not in
063a43a1 332STATE_OFFLINE it should immediately rotate the file and start a new one,
bbcd38e4
LP
333without changing the file.
334
f223fd6a 335After and before the state field is changed, `fdatasync()` should be executed on
bbcd38e4
LP
336the file to ensure the dirty state hits disk.
337
338
339## Sequence Numbers
340
341All entries carry sequence numbers that are monotonically counted up for each
342entry (starting at 1) and are unique among all files which carry the same
343**seqnum_id** field. This field is randomly generated when the journal daemon
344creates its first file. All files generated by the same journal daemon instance
345should hence carry the same seqnum_id. This should guarantee a monotonic stream
346of sequential numbers for easy interleaving even if entries are distributed
347among several files, such as the system journal and many per-user journals.
348
349
350## Concurrency
351
352The file format is designed to be usable in a simultaneous
353single-writer/multiple-reader scenario. The synchronization model is very weak
354in order to facilitate storage on the most basic of file systems (well, the
22aa58ad 355most basic ones that provide us with `mmap()` that is), and allow good
bbcd38e4 356performance. No file locking is used. The only time where disk synchronization
22aa58ad 357via `fdatasync()` should be enforced is after and before changing the **state**
bbcd38e4
LP
358field in the file header (see below). It is recommended to execute a memory
359barrier after appending and initializing new objects at the end of the file,
360and before linking them up in the earlier objects.
361
362This weak synchronization model means that it is crucial that readers verify
363the structural integrity of the file as they read it and handle invalid
364structure gracefully. (Checking what you read is a pretty good idea out of
365security considerations anyway.) This specifically includes checking offset
366values, and that they point to valid objects, with valid sizes and of the type
367and hash value expected. All code must be written with the fact in mind that a
70cd1e56
LP
368file with inconsistent structure might just be inconsistent temporarily, and
369might become consistent later on. Payload OTOH requires less scrutiny, as it
370should only be linked up (and hence visible to readers) after it was
bbcd38e4
LP
371successfully written to memory (though not necessarily to disk). On non-local
372file systems it is a good idea to verify the payload hashes when reading, in
22aa58ad 373order to avoid annoyances with `mmap()` inconsistencies.
bbcd38e4 374
22aa58ad
LP
375Clients intending to show a live view of the journal should use `inotify()` for
376this to watch for files changes. Since file writes done via `mmap()` do not
377result in `inotify()` writers shall truncate the file to its current size after
bbcd38e4 378writing one or more entries, which results in inotify events being
70cd1e56
LP
379generated. Note that this is not used as a transaction scheme (it doesn't
380protect anything), but merely for triggering wakeups.
bbcd38e4
LP
381
382Note that inotify will not work on network file systems if reader and writer
383reside on different hosts. Readers which detect they are run on journal files
384on a non-local file system should hence not rely on inotify for live views but
385fall back to simple time based polling of the files (maybe recheck every 2s).
386
387
388## Objects
389
390All objects carry a common header:
391
392```c
393enum {
70cd1e56
LP
394 OBJECT_COMPRESSED_XZ = 1 << 0,
395 OBJECT_COMPRESSED_LZ4 = 1 << 1,
396 OBJECT_COMPRESSED_ZSTD = 1 << 2,
bbcd38e4
LP
397};
398
399_packed_ struct ObjectHeader {
400 uint8_t type;
401 uint8_t flags;
402 uint8_t reserved[6];
403 le64_t size;
404 uint8_t payload[];
405};
22aa58ad 406```
bbcd38e4
LP
407
408The **type** field is one of the object types listed above. The **flags** field
70cd1e56
LP
409currently knows three flags: OBJECT_COMPRESSED_XZ, OBJECT_COMPRESSED_LZ4 and
410OBJECT_COMPRESSED_ZSTD. It is only valid for DATA objects and indicates that
411the data payload is compressed with XZ/LZ4/ZSTD. If one of the
412OBJECT_COMPRESSED_* flags is set for an object then the matching
413HEADER_INCOMPATIBLE_COMPRESSED_XZ/HEADER_INCOMPATIBLE_COMPRESSED_LZ4/HEADER_INCOMPATIBLE_COMPRESSED_ZSTD
414flag must be set for the file as well. At most one of these three bits may be
415set. The **size** field encodes the size of the object including all its
bbcd38e4
LP
416headers and payload.
417
418
419## Data Objects
420
421```c
422_packed_ struct DataObject {
423 ObjectHeader object;
424 le64_t hash;
425 le64_t next_hash_offset;
426 le64_t next_field_offset;
427 le64_t entry_offset; /* the first array entry we store inline */
428 le64_t entry_array_offset;
429 le64_t n_entries;
e81710d3
DDM
430 union { \
431 struct { \
432 uint8_t payload[] ; \
433 } regular; \
434 struct { \
435 le32_t tail_entry_array_offset; \
436 le32_t tail_entry_array_n_entries; \
437 uint8_t payload[]; \
438 } compact; \
439 }; \
bbcd38e4
LP
440};
441```
442
443Data objects carry actual field data in the **payload[]** array, including a
22aa58ad 444field name, a `=` and the field data. Example:
bbcd38e4 445`_SYSTEMD_UNIT=foobar.service`. The **hash** field is a hash value of the
70cd1e56
LP
446payload. If the `HEADER_INCOMPATIBLE_KEYED_HASH` flag is set in the file header
447this is the siphash24 hash value of the payload, keyed by the file ID as stored
22aa58ad 448in the **file_id** field of the file header. If the flag is not set it is the
70cd1e56
LP
449non-keyed Jenkins hash of the payload instead. The keyed hash is preferred as
450it makes the format more robust against attackers that want to trigger hash
451collisions in the hash table.
bbcd38e4
LP
452
453**next_hash_offset** is used to link up DATA objects in the DATA_HASH_TABLE if
454a hash collision happens (in a singly linked list, with an offset of 0
455indicating the end). **next_field_offset** is used to link up data objects with
456the same field name from the FIELD object of the field used.
457
458**entry_offset** is an offset to the first ENTRY object referring to this DATA
459object. **entry_array_offset** is an offset to an ENTRY_ARRAY object with
460offsets to other entries referencing this DATA object. Storing the offset to
461the first ENTRY object in-line is an optimization given that many DATA objects
462will be referenced from a single entry only (for example, `MESSAGE=` frequently
463includes a practically unique string). **n_entries** is a counter of the total
464number of ENTRY objects that reference this object, i.e. the sum of all
465ENTRY_ARRAYS chained up from this object, plus 1.
466
467The **payload[]** field contains the field name and date unencoded, unless
70cd1e56
LP
468OBJECT_COMPRESSED_XZ/OBJECT_COMPRESSED_LZ4/OBJECT_COMPRESSED_ZSTD is set in the
469`ObjectHeader`, in which case the payload is compressed with the indicated
470compression algorithm.
bbcd38e4 471
e81710d3
DDM
472If the `HEADER_INCOMPATIBLE_COMPACT` flag is set, Two extra fields are stored to
473allow immediate access to the tail entry array in the DATA object's entry array
474chain.
bbcd38e4
LP
475
476## Field Objects
477
478```c
479_packed_ struct FieldObject {
480 ObjectHeader object;
481 le64_t hash;
482 le64_t next_hash_offset;
483 le64_t head_data_offset;
484 uint8_t payload[];
485};
486```
487
488Field objects are used to enumerate all possible values a certain field name
489can take in the entire journal file.
490
491The **payload[]** array contains the actual field name, without '=' or any
492field value. Example: `_SYSTEMD_UNIT`. The **hash** field is a hash value of
493the payload. As for the DATA objects, this too is either the `.file_id` keyed
494siphash24 hash of the payload, or the non-keyed Jenkins hash.
495
496**next_hash_offset** is used to link up FIELD objects in the FIELD_HASH_TABLE
497if a hash collision happens (in singly linked list, offset 0 indicating the
498end). **head_data_offset** points to the first DATA object that shares this
499field name. It is the head of a singly linked list using DATA's
500**next_field_offset** offset.
501
502
503## Entry Objects
504
505```
bbcd38e4
LP
506_packed_ struct EntryObject {
507 ObjectHeader object;
508 le64_t seqnum;
509 le64_t realtime;
510 le64_t monotonic;
511 sd_id128_t boot_id;
512 le64_t xor_hash;
a9089a66
DDM
513 union { \
514 struct { \
515 le64_t object_offset; \
516 le64_t hash; \
517 } regular[]; \
518 struct { \
519 le32_t object_offset; \
520 } compact[]; \
521 } items; \
bbcd38e4
LP
522};
523```
524
525An ENTRY object binds several DATA objects together into one log entry, and
526includes other metadata such as various timestamps.
527
528The **seqnum** field contains the sequence number of the entry, **realtime**
529the realtime timestamp, and **monotonic** the monotonic timestamp for the boot
530identified by **boot_id**.
531
532The **xor_hash** field contains a binary XOR of the hashes of the payload of
533all DATA objects referenced by this ENTRY. This value is usable to check the
534contents of the entry, being independent of the order of the DATA objects in
70cd1e56
LP
535the array. Note that even for files that have the
536`HEADER_INCOMPATIBLE_KEYED_HASH` flag set (and thus siphash24 the otherwise
537used hash function) the hash function used for this field, as singular
538exception, is the Jenkins lookup3 hash function. The XOR hash value is used to
539quickly compare the contents of two entries, and to define a well-defined order
540between two entries that otherwise have the same sequence numbers and
541timestamps.
bbcd38e4
LP
542
543The **items[]** array contains references to all DATA objects of this entry,
70cd1e56
LP
544plus their respective hashes (which are calculated the same way as in the DATA
545objects, i.e. keyed by the file ID).
bbcd38e4 546
a9089a66 547If the `HEADER_INCOMPATIBLE_COMPACT` flag is set, DATA object offsets are stored
da890466 548as 32-bit integers instead of 64-bit and the unused hash field per data object is
a9089a66
DDM
549not stored anymore.
550
bbcd38e4
LP
551In the file ENTRY objects are written ordered monotonically by sequence
552number. For continuous parts of the file written during the same boot
553(i.e. with the same boot_id) the monotonic timestamp is monotonic too. Modulo
554wallclock time jumps (due to incorrect clocks being corrected) the realtime
555timestamps are monotonic too.
556
557
558## Hash Table Objects
559
560```c
561_packed_ struct HashItem {
562 le64_t head_hash_offset;
563 le64_t tail_hash_offset;
564};
565
566_packed_ struct HashTableObject {
567 ObjectHeader object;
568 HashItem items[];
569};
570```
571
572The structure of both DATA_HASH_TABLE and FIELD_HASH_TABLE objects are
e3500e9d 573identical. They implement a simple hash table, with each cell containing
bbcd38e4
LP
574offsets to the head and tail of the singly linked list of the DATA and FIELD
575objects, respectively. DATA's and FIELD's next_hash_offset field are used to
576chain up the objects. Empty cells have both offsets set to 0.
577
578Each file contains exactly one DATA_HASH_TABLE and one FIELD_HASH_TABLE
579objects. Their payload is directly referred to by the file header in the
580**data_hash_table_offset**, **data_hash_table_size**,
581**field_hash_table_offset**, **field_hash_table_size** fields. These offsets do
582_not_ point to the object headers but directly to the payloads. When a new
583journal file is created the two hash table objects need to be created right
584away as first two objects in the stream.
585
586If the hash table fill level is increasing over a certain fill level (Learning
587from Java's Hashtable for example: > 75%), the writer should rotate the file
588and create a new one.
589
590The DATA_HASH_TABLE should be sized taking into account to the maximum size the
591file is expected to grow, as configured by the administrator or disk space
70cd1e56
LP
592considerations. The FIELD_HASH_TABLE should be sized to a fixed size; the
593number of fields should be pretty static as it depends only on developers'
bbcd38e4
LP
594creativity rather than runtime parameters.
595
596
597## Entry Array Objects
598
599
600```c
601_packed_ struct EntryArrayObject {
602 ObjectHeader object;
603 le64_t next_entry_array_offset;
99daf3ce
DDM
604 union {
605 le64_t regular[];
606 le32_t compact[];
607 } items;
bbcd38e4
LP
608};
609```
610
611Entry Arrays are used to store a sorted array of offsets to entries. Entry
612arrays are strictly sorted by offsets on disk, and hence by their timestamps
613and sequence numbers (with some restrictions, see above).
614
99daf3ce 615If the `HEADER_INCOMPATIBLE_COMPACT` flag is set, offsets are stored as 32-bit
da890466 616integers instead of 64-bit.
99daf3ce 617
bbcd38e4
LP
618Entry Arrays are chained up. If one entry array is full another one is
619allocated and the **next_entry_array_offset** field of the old one pointed to
620it. An Entry Array with **next_entry_array_offset** set to 0 is the last in the
621list. To optimize allocation and seeking, as entry arrays are appended to a
622chain of entry arrays they should increase in size (double).
623
624Due to being monotonically ordered entry arrays may be searched with a binary
625search (bisection).
626
627One chain of entry arrays links up all entries written to the journal. The
628first entry array is referenced in the **entry_array_offset** field of the
629header.
630
631Each DATA object also references an entry array chain listing all entries
632referencing a specific DATA object. Since many DATA objects are only referenced
633by a single ENTRY the first offset of the list is stored inside the DATA object
634itself, an ENTRY_ARRAY object is only needed if it is referenced by more than
635one ENTRY.
636
637
638## Tag Object
639
640```c
641#define TAG_LENGTH (256/8)
642
643_packed_ struct TagObject {
644 ObjectHeader object;
645 le64_t seqnum;
646 le64_t epoch;
647 uint8_t tag[TAG_LENGTH]; /* SHA-256 HMAC */
648};
649```
650
651Tag objects are used to seal off the journal for alteration. In regular
652intervals a tag object is appended to the file. The tag object consists of a
653SHA-256 HMAC tag that is calculated from the objects stored in the file since
654the last tag was written, or from the beginning if no tag was written yet. The
655key for the HMAC is calculated via the externally maintained FSPRG logic for
656the epoch that is written into **epoch**. The sequence number **seqnum** is
657increased with each tag. When calculating the HMAC of objects header fields
658that are volatile are excluded (skipped). More specifically all fields that
659might validly be altered to maintain a consistent file structure (such as
660offsets to objects added later for the purpose of linked lists and suchlike)
661after an object has been written are not protected by the tag. This means a
662verifier has to independently check these fields for consistency of
663structure. For the fields excluded from the HMAC please consult the source code
664directly. A verifier should read the file from the beginning to the end, always
665calculating the HMAC for the objects it reads. Each time a tag object is
666encountered the HMAC should be verified and restarted. The tag object sequence
667numbers need to increase strictly monotonically. Tag objects themselves are
668partially protected by the HMAC (i.e. seqnum and epoch is included, the tag
669itself not).
670
671
672## Algorithms
673
674### Reading
675
676Given an offset to an entry all data fields are easily found by following the
677offsets in the data item array of the entry.
678
679Listing entries without filter is done by traversing the list of entry arrays
680starting with the headers' **entry_array_offset** field.
681
682Seeking to an entry by timestamp or sequence number (without any matches) is
683done via binary search in the entry arrays starting with the header's
684**entry_array_offset** field. Since these arrays double in size as more are
685added the time cost of seeking is O(log(n)*log(n)) if n is the number of
686entries in the file.
687
688When seeking or listing with one field match applied the DATA object of the
689match is first identified, and then its data entry array chain traversed. The
690time cost is the same as for seeks/listings with no match.
691
692If multiple matches are applied, multiple chains of entry arrays should be
693traversed in parallel. Since they all are strictly monotonically ordered by
694offset of the entries, advancing in one can be directly applied to the others,
695until an entry matching all matches is found. In the worst case seeking like
696this is O(n) where n is the number of matching entries of the "loosest" match,
697but in the common case should be much more efficient at least for the
698well-known fields, where the set of possible field values tend to be closely
699related. Checking whether an entry matches a number of matches is efficient
700since the item array of the entry contains hashes of all data fields
701referenced, and the number of data fields of an entry is generally small (<
70230).
703
704When interleaving multiple journal files seeking tends to be a frequently used
705operation, but in this case can be effectively suppressed by caching results
706from previous entries.
707
708When listing all possible values a certain field can take it is sufficient to
709look up the FIELD object and follow the chain of links to all DATA it includes.
710
711### Writing
712
e3500e9d
VC
713When an entry is appended to the journal, for each of its data fields the data
714hash table should be checked. If the data field does not yet exist in the file,
715it should be appended and added to the data hash table. When a data field's data
716object is added, the field hash table should be checked for the field name of
bbcd38e4
LP
717the data field, and a field object be added if necessary. After all data fields
718(and recursively all field names) of the new entry are appended and linked up
e3500e9d 719in the hashtables, the entry object should be appended and linked up too.
bbcd38e4 720
e3500e9d 721At regular intervals a tag object should be written if sealing is enabled (see
bbcd38e4
LP
722above). Before the file is closed a tag should be written too, to seal it off.
723
724Before writing an object, time and disk space limits should be checked and
725rotation triggered if necessary.
726
727
728## Optimizing Disk IO
729
730_A few general ideas to keep in mind:_
731
732The hash tables for looking up fields and data should be quickly in the memory
733cache and not hurt performance. All entries and entry arrays are ordered
734strictly by time on disk, and hence should expose an OK access pattern on
735rotating media, when read sequentially (which should be the most common case,
736given the nature of log data).
737
738The disk access patterns of the binary search for entries needed for seeking
739are problematic on rotating disks. This should not be a major issue though,
740since seeking should not be a frequent operation.
741
742When reading, collecting data fields for presenting entries to the user is
743problematic on rotating disks. In order to optimize these patterns the item
744array of entry objects should be sorted by disk offset before
745writing. Effectively, frequently used data objects should be in the memory
746cache quickly. Non-frequently used data objects are likely to be located
747between the previous and current entry when reading and hence should expose an
748OK access pattern. Problematic are data objects that are neither frequently nor
749infrequently referenced, which will cost seek time.
750
751And that's all there is to it.
752
753Thanks for your interest!