]> git.ipfire.org Git - thirdparty/systemd.git/blame - docs/JOURNAL_FILE_FORMAT.md
Merge pull request #23417 from bnf/docs-dark-mode
[thirdparty/systemd.git] / docs / JOURNAL_FILE_FORMAT.md
CommitLineData
bbcd38e4
LP
1---
2title: Journal File Format
3category: Interfaces
4layout: default
0aff7b75 5SPDX-License-Identifier: LGPL-2.1-or-later
bbcd38e4
LP
6---
7
8# Journal File Format
9
717e92ce
ZJS
10_Note that this document describes the binary on-disk format of journals only.
11For interfacing with web technologies there's the [Journal JSON Format](https://systemd.io/JOURNAL_EXPORT_FORMATS#journal-json-format).
12For transfer of journal data across the network there's the [Journal Export Format](https://systemd.io/JOURNAL_EXPORT_FORMATS#journal-export-format)._
bbcd38e4
LP
13
14The systemd journal stores log data in a binary format with several features:
15
16* Fully indexed by all fields
17* Can store binary data, up to 2^64-1 in size
18* Seekable
19* Primarily append-based, hence robust to corruption
20* Support for in-line compression
21* Support for in-line Forward Secure Sealing
22
23This document explains the basic structure of the file format on disk. We are
24making this available primarily to allow review and provide documentation. Note
25that the actual implementation in the [systemd
21dfadbd 26codebase](https://github.com/systemd/systemd/blob/main/src/libsystemd/sd-journal/) is the
bbcd38e4
LP
27only ultimately authoritative description of the format, so if this document
28and the code disagree, the code is right. That said we'll of course try hard to
29keep this document up-to-date and accurate.
30
31Instead of implementing your own reader or writer for journal files we ask you
32to use the [Journal's native C
931bc195 33API](https://www.freedesktop.org/software/systemd/man/sd-journal.html) to access
bbcd38e4
LP
34these files. It provides you with full access to the files, and will not
35withhold any data. If you find a limitation, please ping us and we might add
36some additional interfaces for you.
37
38If you need access to the raw journal data in serialized stream form without C
39API our recommendation is to make use of the [Journal Export
717e92ce
ZJS
40Format](https://systemd.io/JOURNAL_EXPORT_FORMATS#journal-export-format), which you can
41get via `journalctl -o export` or via `systemd-journal-gatewayd`. The export
bbcd38e4
LP
42format is much simpler to parse, but complete and accurate. Due to its
43stream-based nature it is not indexed.
44
45_Or, to put this in other words: this low-level document is probably not what
46you want to use as base of your project. You want our [C
931bc195 47API](https://www.freedesktop.org/software/systemd/man/sd-journal.html) instead!
bbcd38e4 48And if you really don't want the C API, then you want the [Journal Export
717e92ce
ZJS
49Format or Journal JSON Format](https://systemd.io/JOURNAL_EXPORT_FORMATS) instead!
50This document is primarily for your entertainment and education. Thank you!_
bbcd38e4
LP
51
52This document assumes you have a basic understanding of the journal concepts,
53the properties of a journal entry and so on. If not, please go and read up,
54then come back! This is a good opportunity to read about the [basic properties
55of journal
931bc195 56entries](https://www.freedesktop.org/software/systemd/man/systemd.journal-fields.html),
bbcd38e4
LP
57in particular realize that they may include binary non-text data (though
58usually don't), and the same field might have multiple values assigned within
59the same entry.
60
70cd1e56 61This document describes the current format of systemd 246. The documented
bbcd38e4 62format is compatible with the format used in the first versions of the journal,
70cd1e56 63but received various compatible and incompatible additions since.
bbcd38e4
LP
64
65If you are wondering why the journal file format has been created in the first
66place instead of adopting an existing database implementation, please have a
67look [at this
68thread](https://lists.freedesktop.org/archives/systemd-devel/2012-October/007054.html).
69
70
71## Basics
72
73* All offsets, sizes, time values, hashes (and most other numeric values) are 64bit unsigned integers in LE format.
74* Offsets are always relative to the beginning of the file.
22aa58ad 75* The 64bit hash function siphash24 is used for newer journal files. For older files [Jenkins lookup3](https://en.wikipedia.org/wiki/Jenkins_hash_function) is used, more specifically `jenkins_hashlittle2()` with the first 32bit integer it returns as higher 32bit part of the 64bit value, and the second one uses as lower 32bit part.
bbcd38e4
LP
76* All structures are aligned to 64bit boundaries and padded to multiples of 64bit
77* The format is designed to be read and written via memory mapping using multiple mapped windows.
78* All time values are stored in usec since the respective epoch.
22aa58ad
LP
79* Wall clock time values are relative to the Unix time epoch, i.e. January 1st, 1970. (`CLOCK_REALTIME`)
80* Monotonic time values are always stored jointly with the kernel boot ID value (i.e. `/proc/sys/kernel/random/boot_id`) they belong to. They tend to be relative to the start of the boot, but aren't for containers. (`CLOCK_MONOTONIC`)
bbcd38e4
LP
81* Randomized, unique 128bit IDs are used in various locations. These are generally UUID v4 compatible, but this is not a requirement.
82
83## General Rules
84
85If any kind of corruption is noticed by a writer it should immediately rotate
86the file and start a new one. No further writes should be attempted to the
87original file, but it should be left around so that as little data as possible
88is lost.
89
90If any kind of corruption is noticed by a reader it should try hard to handle
91this gracefully, such as skipping over the corrupted data, but allowing access
92to as much data around it as possible.
93
94A reader should verify all offsets and other data as it reads it. This includes
95checking for alignment and range of offsets in the file, especially before
96trying to read it via a memory map.
97
98A reader must interleave rotated and corrupted files as good as possible and
99present them as single stream to the user.
100
101All fields marked as "reserved" must be initialized with 0 when writing and be
102ignored on reading. They are currently not used but might be used later on.
103
104
105## Structure
106
107The file format's data structures are declared in
882da5cc 108[journal-def.h](https://github.com/systemd/systemd/blob/main/src/libsystemd/sd-journal/journal-def.h).
bbcd38e4
LP
109
110The file format begins with a header structure. After the header structure
111object structures follow. Objects are appended to the end as time
112progresses. Most data stored in these objects is not altered anymore after
113having been written once, with the exception of records necessary for
114indexing. When new data is appended to a file the writer first writes all new
115objects to the end of the file, and then links them up at front after that's
116done. Currently, seven different object types are known:
117
118```c
119enum {
120 OBJECT_UNUSED,
121 OBJECT_DATA,
122 OBJECT_FIELD,
123 OBJECT_ENTRY,
124 OBJECT_DATA_HASH_TABLE,
125 OBJECT_FIELD_HASH_TABLE,
126 OBJECT_ENTRY_ARRAY,
127 OBJECT_TAG,
128 _OBJECT_TYPE_MAX
129};
130```
131
132* A **DATA** object, which encapsulates the contents of one field of an entry, i.e. a string such as `_SYSTEMD_UNIT=avahi-daemon.service`, or `MESSAGE=Foobar made a booboo.` but possibly including large or binary data, and always prefixed by the field name and "=".
133* A **FIELD** object, which encapsulates a field name, i.e. a string such as `_SYSTEMD_UNIT` or `MESSAGE`, without any `=` or even value.
134* An **ENTRY** object, which binds several **DATA** objects together into a log entry.
135* A **DATA_HASH_TABLE** object, which encapsulates a hash table for finding existing **DATA** objects.
136* A **FIELD_HASH_TABLE** object, which encapsulates a hash table for finding existing **FIELD** objects.
137* An **ENTRY_ARRAY** object, which encapsulates a sorted array of offsets to entries, used for seeking by binary search.
138* A **TAG** object, consisting of an FSS sealing tag for all data from the beginning of the file or the last tag written (whichever is later).
139
140## Header
141
142The Header struct defines, well, you guessed it, the file header:
143
144```c
145_packed_ struct Header {
146 uint8_t signature[8]; /* "LPKSHHRH" */
147 le32_t compatible_flags;
148 le32_t incompatible_flags;
149 uint8_t state;
150 uint8_t reserved[7];
151 sd_id128_t file_id;
152 sd_id128_t machine_id;
153 sd_id128_t boot_id; /* last writer */
154 sd_id128_t seqnum_id;
155 le64_t header_size;
156 le64_t arena_size;
157 le64_t data_hash_table_offset;
158 le64_t data_hash_table_size;
159 le64_t field_hash_table_offset;
160 le64_t field_hash_table_size;
161 le64_t tail_object_offset;
162 le64_t n_objects;
163 le64_t n_entries;
164 le64_t tail_entry_seqnum;
165 le64_t head_entry_seqnum;
166 le64_t entry_array_offset;
167 le64_t head_entry_realtime;
168 le64_t tail_entry_realtime;
169 le64_t tail_entry_monotonic;
170 /* Added in 187 */
171 le64_t n_data;
172 le64_t n_fields;
173 /* Added in 189 */
174 le64_t n_tags;
175 le64_t n_entry_arrays;
70cd1e56
LP
176 /* Added in 246 */
177 le64_t data_hash_chain_depth;
178 le64_t field_hash_chain_depth;
bbcd38e4
LP
179};
180```
181
22aa58ad 182The first 8 bytes of Journal files must contain the ASCII characters `LPKSHHRH`.
bbcd38e4
LP
183
184If a writer finds that the **machine_id** of a file to write to does not match
185the machine it is running on it should immediately rotate the file and start a
186new one.
187
188When journal file is first created the **file_id** is randomly and uniquely
189initialized.
190
191When a writer opens a file it shall initialize the **boot_id** to the current
192boot id of the system.
193
194The currently used part of the file is the **header_size** plus the
195**arena_size** field of the header. If a writer needs to write to a file where
196the actual file size on disk is smaller than the reported value it shall
197immediately rotate the file and start a new one. If a writer is asked to write
e8607daf
P
198to a file with a header that is shorter than its own definition of the struct
199Header, it shall immediately rotate the file and start a new one.
bbcd38e4
LP
200
201The **n_objects** field contains a counter for objects currently available in
202this file. As objects are appended to the end of the file this counter is
203increased.
204
205The first object in the file starts immediately after the header. The last
206object in the file is at the offset **tail_object_offset**, which may be 0 if
207no object is in the file yet.
208
209The **n_entries**, **n_data**, **n_fields**, **n_tags**, **n_entry_arrays** are
210counters of the objects of the specific types.
211
212**tail_entry_seqnum** and **head_entry_seqnum** contain the sequential number
213(see below) of the last or first entry in the file, respectively, or 0 if no
214entry has been written yet.
215
216**tail_entry_realtime** and **head_entry_realtime** contain the wallclock
217timestamp of the last or first entry in the file, respectively, or 0 if no
218entry has been written yet.
219
220**tail_entry_monotonic** is the monotonic timestamp of the last entry in the
221file, referring to monotonic time of the boot identified by **boot_id**.
222
70cd1e56
LP
223**data_hash_chain_depth** is a counter of the deepest chain in the data hash
224table, minus one. This is updated whenever a chain is found that is longer than
225the previous deepest chain found. Note that the counter is updated during hash
226table lookups, as the chains are traversed. This counter is used to determine
227when it is a good time to rotate the journal file, because hash collisions
228became too frequent.
229
230Similar, **field_hash_chain_depth** is a counter of the deepest chain in the
231field hash table, minus one.
232
bbcd38e4
LP
233
234## Extensibility
235
236The format is supposed to be extensible in order to enable future additions of
237features. Readers should simply skip objects of unknown types as they read
238them. If a compatible feature extension is made a new bit is registered in the
22aa58ad 239header's **compatible_flags** field. If a feature extension is used that makes
bbcd38e4 240the format incompatible a new bit is registered in the header's
22aa58ad
LP
241**incompatible_flags** field. Readers should check these two bit fields, if
242they find a flag they don't understand in compatible_flags they should continue
243to read the file, but if they find one in **incompatible_flags** they should
244fail, asking for an update of the software. Writers should refuse writing if
245there's an unknown bit flag in either of these fields.
bbcd38e4
LP
246
247The file header may be extended as new features are added. The size of the file
22aa58ad 248header is stored in the header. All header fields up to **n_data** are known to
bbcd38e4 249unconditionally exist in all revisions of the file format, all fields starting
22aa58ad 250with **n_data** needs to be explicitly checked for via a size check, since they
bbcd38e4
LP
251were additions after the initial release.
252
70cd1e56 253Currently only five extensions flagged in the flags fields are known:
bbcd38e4
LP
254
255```c
256enum {
70cd1e56
LP
257 HEADER_INCOMPATIBLE_COMPRESSED_XZ = 1 << 0,
258 HEADER_INCOMPATIBLE_COMPRESSED_LZ4 = 1 << 1,
259 HEADER_INCOMPATIBLE_KEYED_HASH = 1 << 2,
260 HEADER_INCOMPATIBLE_COMPRESSED_ZSTD = 1 << 3,
bbcd38e4
LP
261};
262
263enum {
70cd1e56 264 HEADER_COMPATIBLE_SEALED = 1 << 0,
bbcd38e4
LP
265};
266```
267
70cd1e56
LP
268HEADER_INCOMPATIBLE_COMPRESSED_XZ indicates that the file includes DATA objects
269that are compressed using XZ. Similarly, HEADER_INCOMPATIBLE_COMPRESSED_LZ4
270indicates that the file includes DATA objects that are compressed with the LZ4
271algorithm. And HEADER_INCOMPATIBLE_COMPRESSED_ZSTD indicates that there are
272objects compressed with ZSTD.
273
274HEADER_INCOMPATIBLE_KEYED_HASH indicates that instead of the unkeyed Jenkins
275hash function the keyed siphash24 hash function is used for the two hash
276tables, see below.
bbcd38e4
LP
277
278HEADER_COMPATIBLE_SEALED indicates that the file includes TAG objects required
279for Forward Secure Sealing.
280
281
282## Dirty Detection
283
284```c
285enum {
286 STATE_OFFLINE = 0,
287 STATE_ONLINE = 1,
288 STATE_ARCHIVED = 2,
289 _STATE_MAX
290};
291```
292
293If a file is opened for writing the **state** field should be set to
294STATE_ONLINE. If a file is closed after writing the **state** field should be
295set to STATE_OFFLINE. After a file has been rotated it should be set to
296STATE_ARCHIVED. If a writer is asked to write to a file that is not in
063a43a1 297STATE_OFFLINE it should immediately rotate the file and start a new one,
bbcd38e4
LP
298without changing the file.
299
f223fd6a 300After and before the state field is changed, `fdatasync()` should be executed on
bbcd38e4
LP
301the file to ensure the dirty state hits disk.
302
303
304## Sequence Numbers
305
306All entries carry sequence numbers that are monotonically counted up for each
307entry (starting at 1) and are unique among all files which carry the same
308**seqnum_id** field. This field is randomly generated when the journal daemon
309creates its first file. All files generated by the same journal daemon instance
310should hence carry the same seqnum_id. This should guarantee a monotonic stream
311of sequential numbers for easy interleaving even if entries are distributed
312among several files, such as the system journal and many per-user journals.
313
314
315## Concurrency
316
317The file format is designed to be usable in a simultaneous
318single-writer/multiple-reader scenario. The synchronization model is very weak
319in order to facilitate storage on the most basic of file systems (well, the
22aa58ad 320most basic ones that provide us with `mmap()` that is), and allow good
bbcd38e4 321performance. No file locking is used. The only time where disk synchronization
22aa58ad 322via `fdatasync()` should be enforced is after and before changing the **state**
bbcd38e4
LP
323field in the file header (see below). It is recommended to execute a memory
324barrier after appending and initializing new objects at the end of the file,
325and before linking them up in the earlier objects.
326
327This weak synchronization model means that it is crucial that readers verify
328the structural integrity of the file as they read it and handle invalid
329structure gracefully. (Checking what you read is a pretty good idea out of
330security considerations anyway.) This specifically includes checking offset
331values, and that they point to valid objects, with valid sizes and of the type
332and hash value expected. All code must be written with the fact in mind that a
70cd1e56
LP
333file with inconsistent structure might just be inconsistent temporarily, and
334might become consistent later on. Payload OTOH requires less scrutiny, as it
335should only be linked up (and hence visible to readers) after it was
bbcd38e4
LP
336successfully written to memory (though not necessarily to disk). On non-local
337file systems it is a good idea to verify the payload hashes when reading, in
22aa58ad 338order to avoid annoyances with `mmap()` inconsistencies.
bbcd38e4 339
22aa58ad
LP
340Clients intending to show a live view of the journal should use `inotify()` for
341this to watch for files changes. Since file writes done via `mmap()` do not
342result in `inotify()` writers shall truncate the file to its current size after
bbcd38e4 343writing one or more entries, which results in inotify events being
70cd1e56
LP
344generated. Note that this is not used as a transaction scheme (it doesn't
345protect anything), but merely for triggering wakeups.
bbcd38e4
LP
346
347Note that inotify will not work on network file systems if reader and writer
348reside on different hosts. Readers which detect they are run on journal files
349on a non-local file system should hence not rely on inotify for live views but
350fall back to simple time based polling of the files (maybe recheck every 2s).
351
352
353## Objects
354
355All objects carry a common header:
356
357```c
358enum {
70cd1e56
LP
359 OBJECT_COMPRESSED_XZ = 1 << 0,
360 OBJECT_COMPRESSED_LZ4 = 1 << 1,
361 OBJECT_COMPRESSED_ZSTD = 1 << 2,
bbcd38e4
LP
362};
363
364_packed_ struct ObjectHeader {
365 uint8_t type;
366 uint8_t flags;
367 uint8_t reserved[6];
368 le64_t size;
369 uint8_t payload[];
370};
22aa58ad 371```
bbcd38e4
LP
372
373The **type** field is one of the object types listed above. The **flags** field
70cd1e56
LP
374currently knows three flags: OBJECT_COMPRESSED_XZ, OBJECT_COMPRESSED_LZ4 and
375OBJECT_COMPRESSED_ZSTD. It is only valid for DATA objects and indicates that
376the data payload is compressed with XZ/LZ4/ZSTD. If one of the
377OBJECT_COMPRESSED_* flags is set for an object then the matching
378HEADER_INCOMPATIBLE_COMPRESSED_XZ/HEADER_INCOMPATIBLE_COMPRESSED_LZ4/HEADER_INCOMPATIBLE_COMPRESSED_ZSTD
379flag must be set for the file as well. At most one of these three bits may be
380set. The **size** field encodes the size of the object including all its
bbcd38e4
LP
381headers and payload.
382
383
384## Data Objects
385
386```c
387_packed_ struct DataObject {
388 ObjectHeader object;
389 le64_t hash;
390 le64_t next_hash_offset;
391 le64_t next_field_offset;
392 le64_t entry_offset; /* the first array entry we store inline */
393 le64_t entry_array_offset;
394 le64_t n_entries;
395 uint8_t payload[];
396};
397```
398
399Data objects carry actual field data in the **payload[]** array, including a
22aa58ad 400field name, a `=` and the field data. Example:
bbcd38e4 401`_SYSTEMD_UNIT=foobar.service`. The **hash** field is a hash value of the
70cd1e56
LP
402payload. If the `HEADER_INCOMPATIBLE_KEYED_HASH` flag is set in the file header
403this is the siphash24 hash value of the payload, keyed by the file ID as stored
22aa58ad 404in the **file_id** field of the file header. If the flag is not set it is the
70cd1e56
LP
405non-keyed Jenkins hash of the payload instead. The keyed hash is preferred as
406it makes the format more robust against attackers that want to trigger hash
407collisions in the hash table.
bbcd38e4
LP
408
409**next_hash_offset** is used to link up DATA objects in the DATA_HASH_TABLE if
410a hash collision happens (in a singly linked list, with an offset of 0
411indicating the end). **next_field_offset** is used to link up data objects with
412the same field name from the FIELD object of the field used.
413
414**entry_offset** is an offset to the first ENTRY object referring to this DATA
415object. **entry_array_offset** is an offset to an ENTRY_ARRAY object with
416offsets to other entries referencing this DATA object. Storing the offset to
417the first ENTRY object in-line is an optimization given that many DATA objects
418will be referenced from a single entry only (for example, `MESSAGE=` frequently
419includes a practically unique string). **n_entries** is a counter of the total
420number of ENTRY objects that reference this object, i.e. the sum of all
421ENTRY_ARRAYS chained up from this object, plus 1.
422
423The **payload[]** field contains the field name and date unencoded, unless
70cd1e56
LP
424OBJECT_COMPRESSED_XZ/OBJECT_COMPRESSED_LZ4/OBJECT_COMPRESSED_ZSTD is set in the
425`ObjectHeader`, in which case the payload is compressed with the indicated
426compression algorithm.
bbcd38e4
LP
427
428
429## Field Objects
430
431```c
432_packed_ struct FieldObject {
433 ObjectHeader object;
434 le64_t hash;
435 le64_t next_hash_offset;
436 le64_t head_data_offset;
437 uint8_t payload[];
438};
439```
440
441Field objects are used to enumerate all possible values a certain field name
442can take in the entire journal file.
443
444The **payload[]** array contains the actual field name, without '=' or any
445field value. Example: `_SYSTEMD_UNIT`. The **hash** field is a hash value of
446the payload. As for the DATA objects, this too is either the `.file_id` keyed
447siphash24 hash of the payload, or the non-keyed Jenkins hash.
448
449**next_hash_offset** is used to link up FIELD objects in the FIELD_HASH_TABLE
450if a hash collision happens (in singly linked list, offset 0 indicating the
451end). **head_data_offset** points to the first DATA object that shares this
452field name. It is the head of a singly linked list using DATA's
453**next_field_offset** offset.
454
455
456## Entry Objects
457
458```
459_packed_ struct EntryItem {
460 le64_t object_offset;
461 le64_t hash;
462};
463
464_packed_ struct EntryObject {
465 ObjectHeader object;
466 le64_t seqnum;
467 le64_t realtime;
468 le64_t monotonic;
469 sd_id128_t boot_id;
470 le64_t xor_hash;
471 EntryItem items[];
472};
473```
474
475An ENTRY object binds several DATA objects together into one log entry, and
476includes other metadata such as various timestamps.
477
478The **seqnum** field contains the sequence number of the entry, **realtime**
479the realtime timestamp, and **monotonic** the monotonic timestamp for the boot
480identified by **boot_id**.
481
482The **xor_hash** field contains a binary XOR of the hashes of the payload of
483all DATA objects referenced by this ENTRY. This value is usable to check the
484contents of the entry, being independent of the order of the DATA objects in
70cd1e56
LP
485the array. Note that even for files that have the
486`HEADER_INCOMPATIBLE_KEYED_HASH` flag set (and thus siphash24 the otherwise
487used hash function) the hash function used for this field, as singular
488exception, is the Jenkins lookup3 hash function. The XOR hash value is used to
489quickly compare the contents of two entries, and to define a well-defined order
490between two entries that otherwise have the same sequence numbers and
491timestamps.
bbcd38e4
LP
492
493The **items[]** array contains references to all DATA objects of this entry,
70cd1e56
LP
494plus their respective hashes (which are calculated the same way as in the DATA
495objects, i.e. keyed by the file ID).
bbcd38e4
LP
496
497In the file ENTRY objects are written ordered monotonically by sequence
498number. For continuous parts of the file written during the same boot
499(i.e. with the same boot_id) the monotonic timestamp is monotonic too. Modulo
500wallclock time jumps (due to incorrect clocks being corrected) the realtime
501timestamps are monotonic too.
502
503
504## Hash Table Objects
505
506```c
507_packed_ struct HashItem {
508 le64_t head_hash_offset;
509 le64_t tail_hash_offset;
510};
511
512_packed_ struct HashTableObject {
513 ObjectHeader object;
514 HashItem items[];
515};
516```
517
518The structure of both DATA_HASH_TABLE and FIELD_HASH_TABLE objects are
e3500e9d 519identical. They implement a simple hash table, with each cell containing
bbcd38e4
LP
520offsets to the head and tail of the singly linked list of the DATA and FIELD
521objects, respectively. DATA's and FIELD's next_hash_offset field are used to
522chain up the objects. Empty cells have both offsets set to 0.
523
524Each file contains exactly one DATA_HASH_TABLE and one FIELD_HASH_TABLE
525objects. Their payload is directly referred to by the file header in the
526**data_hash_table_offset**, **data_hash_table_size**,
527**field_hash_table_offset**, **field_hash_table_size** fields. These offsets do
528_not_ point to the object headers but directly to the payloads. When a new
529journal file is created the two hash table objects need to be created right
530away as first two objects in the stream.
531
532If the hash table fill level is increasing over a certain fill level (Learning
533from Java's Hashtable for example: > 75%), the writer should rotate the file
534and create a new one.
535
536The DATA_HASH_TABLE should be sized taking into account to the maximum size the
537file is expected to grow, as configured by the administrator or disk space
70cd1e56
LP
538considerations. The FIELD_HASH_TABLE should be sized to a fixed size; the
539number of fields should be pretty static as it depends only on developers'
bbcd38e4
LP
540creativity rather than runtime parameters.
541
542
543## Entry Array Objects
544
545
546```c
547_packed_ struct EntryArrayObject {
548 ObjectHeader object;
549 le64_t next_entry_array_offset;
550 le64_t items[];
551};
552```
553
554Entry Arrays are used to store a sorted array of offsets to entries. Entry
555arrays are strictly sorted by offsets on disk, and hence by their timestamps
556and sequence numbers (with some restrictions, see above).
557
558Entry Arrays are chained up. If one entry array is full another one is
559allocated and the **next_entry_array_offset** field of the old one pointed to
560it. An Entry Array with **next_entry_array_offset** set to 0 is the last in the
561list. To optimize allocation and seeking, as entry arrays are appended to a
562chain of entry arrays they should increase in size (double).
563
564Due to being monotonically ordered entry arrays may be searched with a binary
565search (bisection).
566
567One chain of entry arrays links up all entries written to the journal. The
568first entry array is referenced in the **entry_array_offset** field of the
569header.
570
571Each DATA object also references an entry array chain listing all entries
572referencing a specific DATA object. Since many DATA objects are only referenced
573by a single ENTRY the first offset of the list is stored inside the DATA object
574itself, an ENTRY_ARRAY object is only needed if it is referenced by more than
575one ENTRY.
576
577
578## Tag Object
579
580```c
581#define TAG_LENGTH (256/8)
582
583_packed_ struct TagObject {
584 ObjectHeader object;
585 le64_t seqnum;
586 le64_t epoch;
587 uint8_t tag[TAG_LENGTH]; /* SHA-256 HMAC */
588};
589```
590
591Tag objects are used to seal off the journal for alteration. In regular
592intervals a tag object is appended to the file. The tag object consists of a
593SHA-256 HMAC tag that is calculated from the objects stored in the file since
594the last tag was written, or from the beginning if no tag was written yet. The
595key for the HMAC is calculated via the externally maintained FSPRG logic for
596the epoch that is written into **epoch**. The sequence number **seqnum** is
597increased with each tag. When calculating the HMAC of objects header fields
598that are volatile are excluded (skipped). More specifically all fields that
599might validly be altered to maintain a consistent file structure (such as
600offsets to objects added later for the purpose of linked lists and suchlike)
601after an object has been written are not protected by the tag. This means a
602verifier has to independently check these fields for consistency of
603structure. For the fields excluded from the HMAC please consult the source code
604directly. A verifier should read the file from the beginning to the end, always
605calculating the HMAC for the objects it reads. Each time a tag object is
606encountered the HMAC should be verified and restarted. The tag object sequence
607numbers need to increase strictly monotonically. Tag objects themselves are
608partially protected by the HMAC (i.e. seqnum and epoch is included, the tag
609itself not).
610
611
612## Algorithms
613
614### Reading
615
616Given an offset to an entry all data fields are easily found by following the
617offsets in the data item array of the entry.
618
619Listing entries without filter is done by traversing the list of entry arrays
620starting with the headers' **entry_array_offset** field.
621
622Seeking to an entry by timestamp or sequence number (without any matches) is
623done via binary search in the entry arrays starting with the header's
624**entry_array_offset** field. Since these arrays double in size as more are
625added the time cost of seeking is O(log(n)*log(n)) if n is the number of
626entries in the file.
627
628When seeking or listing with one field match applied the DATA object of the
629match is first identified, and then its data entry array chain traversed. The
630time cost is the same as for seeks/listings with no match.
631
632If multiple matches are applied, multiple chains of entry arrays should be
633traversed in parallel. Since they all are strictly monotonically ordered by
634offset of the entries, advancing in one can be directly applied to the others,
635until an entry matching all matches is found. In the worst case seeking like
636this is O(n) where n is the number of matching entries of the "loosest" match,
637but in the common case should be much more efficient at least for the
638well-known fields, where the set of possible field values tend to be closely
639related. Checking whether an entry matches a number of matches is efficient
640since the item array of the entry contains hashes of all data fields
641referenced, and the number of data fields of an entry is generally small (<
64230).
643
644When interleaving multiple journal files seeking tends to be a frequently used
645operation, but in this case can be effectively suppressed by caching results
646from previous entries.
647
648When listing all possible values a certain field can take it is sufficient to
649look up the FIELD object and follow the chain of links to all DATA it includes.
650
651### Writing
652
e3500e9d
VC
653When an entry is appended to the journal, for each of its data fields the data
654hash table should be checked. If the data field does not yet exist in the file,
655it should be appended and added to the data hash table. When a data field's data
656object is added, the field hash table should be checked for the field name of
bbcd38e4
LP
657the data field, and a field object be added if necessary. After all data fields
658(and recursively all field names) of the new entry are appended and linked up
e3500e9d 659in the hashtables, the entry object should be appended and linked up too.
bbcd38e4 660
e3500e9d 661At regular intervals a tag object should be written if sealing is enabled (see
bbcd38e4
LP
662above). Before the file is closed a tag should be written too, to seal it off.
663
664Before writing an object, time and disk space limits should be checked and
665rotation triggered if necessary.
666
667
668## Optimizing Disk IO
669
670_A few general ideas to keep in mind:_
671
672The hash tables for looking up fields and data should be quickly in the memory
673cache and not hurt performance. All entries and entry arrays are ordered
674strictly by time on disk, and hence should expose an OK access pattern on
675rotating media, when read sequentially (which should be the most common case,
676given the nature of log data).
677
678The disk access patterns of the binary search for entries needed for seeking
679are problematic on rotating disks. This should not be a major issue though,
680since seeking should not be a frequent operation.
681
682When reading, collecting data fields for presenting entries to the user is
683problematic on rotating disks. In order to optimize these patterns the item
684array of entry objects should be sorted by disk offset before
685writing. Effectively, frequently used data objects should be in the memory
686cache quickly. Non-frequently used data objects are likely to be located
687between the previous and current entry when reading and hence should expose an
688OK access pattern. Problematic are data objects that are neither frequently nor
689infrequently referenced, which will cost seek time.
690
691And that's all there is to it.
692
693Thanks for your interest!