]> git.ipfire.org Git - thirdparty/systemd.git/blob - docs/JOURNAL_FILE_FORMAT.md
man: add self-contained example of notify protocol
[thirdparty/systemd.git] / docs / JOURNAL_FILE_FORMAT.md
1 ---
2 title: Journal File Format
3 category: Interfaces
4 layout: default
5 SPDX-License-Identifier: LGPL-2.1-or-later
6 ---
7
8 # Journal File Format
9
10 _Note that this document describes the binary on-disk format of journals only.
11 For interfacing with web technologies there's the [Journal JSON Format](JOURNAL_EXPORT_FORMATS.md#journal-json-format).
12 For transfer of journal data across the network there's the [Journal Export Format](JOURNAL_EXPORT_FORMATS.md#journal-export-format)._
13
14 The systemd journal stores log data in a binary format with several features:
15
16 * Fully indexed by all fields
17 * Can store binary data, up to 2^64-1 in size
18 * Seekable
19 * Primarily append-based, hence robust to corruption
20 * Support for in-line compression
21 * Support for in-line Forward Secure Sealing
22
23 This document explains the basic structure of the file format on disk. We are
24 making this available primarily to allow review and provide documentation. Note
25 that the actual implementation in the [systemd
26 codebase](https://github.com/systemd/systemd/blob/main/src/libsystemd/sd-journal/) is the
27 only ultimately authoritative description of the format, so if this document
28 and the code disagree, the code is right. That said we'll of course try hard to
29 keep this document up-to-date and accurate.
30
31 Instead of implementing your own reader or writer for journal files we ask you
32 to use the [Journal's native C
33 API](https://www.freedesktop.org/software/systemd/man/sd-journal.html) to access
34 these files. It provides you with full access to the files, and will not
35 withhold any data. If you find a limitation, please ping us and we might add
36 some additional interfaces for you.
37
38 If you need access to the raw journal data in serialized stream form without C
39 API our recommendation is to make use of the [Journal Export
40 Format](https://systemd.io/JOURNAL_EXPORT_FORMATS#journal-export-format), which you can
41 get via `journalctl -o export` or via `systemd-journal-gatewayd`. The export
42 format is much simpler to parse, but complete and accurate. Due to its
43 stream-based nature it is not indexed.
44
45 _Or, to put this in other words: this low-level document is probably not what
46 you want to use as base of your project. You want our [C
47 API](https://www.freedesktop.org/software/systemd/man/sd-journal.html) instead!
48 And if you really don't want the C API, then you want the
49 [Journal Export Format or Journal JSON Format](JOURNAL_EXPORT_FORMATS)
50 instead! This document is primarily for your entertainment and education.
51 Thank you!_
52
53 This document assumes you have a basic understanding of the journal concepts,
54 the properties of a journal entry and so on. If not, please go and read up,
55 then come back! This is a good opportunity to read about the [basic properties
56 of journal
57 entries](https://www.freedesktop.org/software/systemd/man/systemd.journal-fields.html),
58 in particular realize that they may include binary non-text data (though
59 usually don't), and the same field might have multiple values assigned within
60 the same entry.
61
62 This document describes the current format of systemd 246. The documented
63 format is compatible with the format used in the first versions of the journal,
64 but received various compatible and incompatible additions since.
65
66 If you are wondering why the journal file format has been created in the first
67 place instead of adopting an existing database implementation, please have a
68 look [at this
69 thread](https://lists.freedesktop.org/archives/systemd-devel/2012-October/007054.html).
70
71
72 ## Basics
73
74 * All offsets, sizes, time values, hashes (and most other numeric values) are 32-bit/64-bit unsigned integers in LE format.
75 * Offsets are always relative to the beginning of the file.
76 * The 64-bit hash function siphash24 is used for newer journal files. For older files [Jenkins lookup3](https://en.wikipedia.org/wiki/Jenkins_hash_function) is used, more specifically `jenkins_hashlittle2()` with the first 32-bit integer it returns as higher 32-bit part of the 64-bit value, and the second one uses as lower 32-bit part.
77 * All structures are aligned to 64-bit boundaries and padded to multiples of 64-bit
78 * The format is designed to be read and written via memory mapping using multiple mapped windows.
79 * All time values are stored in usec since the respective epoch.
80 * Wall clock time values are relative to the Unix time epoch, i.e. January 1st, 1970. (`CLOCK_REALTIME`)
81 * Monotonic time values are always stored jointly with the kernel boot ID value (i.e. `/proc/sys/kernel/random/boot_id`) they belong to. They tend to be relative to the start of the boot, but aren't for containers. (`CLOCK_MONOTONIC`)
82 * Randomized, unique 128-bit IDs are used in various locations. These are generally UUID v4 compatible, but this is not a requirement.
83
84 ## General Rules
85
86 If any kind of corruption is noticed by a writer it should immediately rotate
87 the file and start a new one. No further writes should be attempted to the
88 original file, but it should be left around so that as little data as possible
89 is lost.
90
91 If any kind of corruption is noticed by a reader it should try hard to handle
92 this gracefully, such as skipping over the corrupted data, but allowing access
93 to as much data around it as possible.
94
95 A reader should verify all offsets and other data as it reads it. This includes
96 checking for alignment and range of offsets in the file, especially before
97 trying to read it via a memory map.
98
99 A reader must interleave rotated and corrupted files as good as possible and
100 present them as single stream to the user.
101
102 All fields marked as "reserved" must be initialized with 0 when writing and be
103 ignored on reading. They are currently not used but might be used later on.
104
105
106 ## Structure
107
108 The file format's data structures are declared in
109 [journal-def.h](https://github.com/systemd/systemd/blob/main/src/libsystemd/sd-journal/journal-def.h).
110
111 The file format begins with a header structure. After the header structure
112 object structures follow. Objects are appended to the end as time
113 progresses. Most data stored in these objects is not altered anymore after
114 having been written once, with the exception of records necessary for
115 indexing. When new data is appended to a file the writer first writes all new
116 objects to the end of the file, and then links them up at front after that's
117 done. Currently, seven different object types are known:
118
119 ```c
120 enum {
121 OBJECT_UNUSED,
122 OBJECT_DATA,
123 OBJECT_FIELD,
124 OBJECT_ENTRY,
125 OBJECT_DATA_HASH_TABLE,
126 OBJECT_FIELD_HASH_TABLE,
127 OBJECT_ENTRY_ARRAY,
128 OBJECT_TAG,
129 _OBJECT_TYPE_MAX
130 };
131 ```
132
133 * A **DATA** object, which encapsulates the contents of one field of an entry, i.e. a string such as `_SYSTEMD_UNIT=avahi-daemon.service`, or `MESSAGE=Foobar made a booboo.` but possibly including large or binary data, and always prefixed by the field name and "=".
134 * A **FIELD** object, which encapsulates a field name, i.e. a string such as `_SYSTEMD_UNIT` or `MESSAGE`, without any `=` or even value.
135 * An **ENTRY** object, which binds several **DATA** objects together into a log entry.
136 * A **DATA_HASH_TABLE** object, which encapsulates a hash table for finding existing **DATA** objects.
137 * A **FIELD_HASH_TABLE** object, which encapsulates a hash table for finding existing **FIELD** objects.
138 * An **ENTRY_ARRAY** object, which encapsulates a sorted array of offsets to entries, used for seeking by binary search.
139 * A **TAG** object, consisting of an FSS sealing tag for all data from the beginning of the file or the last tag written (whichever is later).
140
141 ## Header
142
143 The Header struct defines, well, you guessed it, the file header:
144
145 ```c
146 _packed_ struct Header {
147 uint8_t signature[8]; /* "LPKSHHRH" */
148 le32_t compatible_flags;
149 le32_t incompatible_flags;
150 uint8_t state;
151 uint8_t reserved[7];
152 sd_id128_t file_id;
153 sd_id128_t machine_id;
154 sd_id128_t tail_entry_boot_id;
155 sd_id128_t seqnum_id;
156 le64_t header_size;
157 le64_t arena_size;
158 le64_t data_hash_table_offset;
159 le64_t data_hash_table_size;
160 le64_t field_hash_table_offset;
161 le64_t field_hash_table_size;
162 le64_t tail_object_offset;
163 le64_t n_objects;
164 le64_t n_entries;
165 le64_t tail_entry_seqnum;
166 le64_t head_entry_seqnum;
167 le64_t entry_array_offset;
168 le64_t head_entry_realtime;
169 le64_t tail_entry_realtime;
170 le64_t tail_entry_monotonic;
171 /* Added in 187 */
172 le64_t n_data;
173 le64_t n_fields;
174 /* Added in 189 */
175 le64_t n_tags;
176 le64_t n_entry_arrays;
177 /* Added in 246 */
178 le64_t data_hash_chain_depth;
179 le64_t field_hash_chain_depth;
180 /* Added in 252 */
181 le32_t tail_entry_array_offset;
182 le32_t tail_entry_array_n_entries;
183 /* Added in 254 */
184 le64_t tail_entry_offset;
185 };
186 ```
187
188 The first 8 bytes of Journal files must contain the ASCII characters `LPKSHHRH`.
189
190 If a writer finds that the **machine_id** of a file to write to does not match
191 the machine it is running on it should immediately rotate the file and start a
192 new one.
193
194 When journal file is first created the **file_id** is randomly and uniquely
195 initialized.
196
197 When a writer creates a file it shall initialize the **tail_entry_boot_id** to
198 the current boot ID of the system. When appending an entry it shall update the
199 field to the boot ID of that entry, so that it is guaranteed that the
200 **tail_entry_monotonic** field refers to a timestamp of the monotonic clock
201 associated with the boot with the ID indicated by the **tail_entry_boot_id**
202 field. (Compatibility note: in older versions of the journal, the field was
203 also supposed to be updated whenever the file was opened for any form of
204 writing, including when opened to mark it as archived. This behaviour has been
205 deemed problematic since without an associated boot ID the
206 **tail_entry_monotonic** field is useless. To indicate whether the boot ID is
207 updated only on append the JOURNAL_COMPATIBLE_TAIL_ENTRY_BOOT_ID is set. If it
208 is not set, the **tail_entry_monotonic** field is not usable).
209
210 The currently used part of the file is the **header_size** plus the
211 **arena_size** field of the header. If a writer needs to write to a file where
212 the actual file size on disk is smaller than the reported value it shall
213 immediately rotate the file and start a new one. If a writer is asked to write
214 to a file with a header that is shorter than its own definition of the struct
215 Header, it shall immediately rotate the file and start a new one.
216
217 The **n_objects** field contains a counter for objects currently available in
218 this file. As objects are appended to the end of the file this counter is
219 increased.
220
221 The first object in the file starts immediately after the header. The last
222 object in the file is at the offset **tail_object_offset**, which may be 0 if
223 no object is in the file yet.
224
225 The **n_entries**, **n_data**, **n_fields**, **n_tags**, **n_entry_arrays** are
226 counters of the objects of the specific types.
227
228 **tail_entry_seqnum** and **head_entry_seqnum** contain the sequential number
229 (see below) of the last or first entry in the file, respectively, or 0 if no
230 entry has been written yet.
231
232 **tail_entry_realtime** and **head_entry_realtime** contain the wallclock
233 timestamp of the last or first entry in the file, respectively, or 0 if no
234 entry has been written yet.
235
236 **tail_entry_monotonic** is the monotonic timestamp of the last entry in the
237 file, referring to monotonic time of the boot identified by
238 **tail_entry_boot_id**, but only if the
239 JOURNAL_COMPATIBLE_TAIL_ENTRY_BOOT_ID feature flag is set, see above. If it
240 is not set, this field might refer to a different boot then the one in the
241 **tail_entry_boot_id** field, for example when the file was ultimately
242 archived.
243
244 **data_hash_chain_depth** is a counter of the deepest chain in the data hash
245 table, minus one. This is updated whenever a chain is found that is longer than
246 the previous deepest chain found. Note that the counter is updated during hash
247 table lookups, as the chains are traversed. This counter is used to determine
248 when it is a good time to rotate the journal file, because hash collisions
249 became too frequent.
250
251 Similar, **field_hash_chain_depth** is a counter of the deepest chain in the
252 field hash table, minus one.
253
254 **tail_entry_array_offset** and **tail_entry_array_n_entries** allow immediate
255 access to the last entry array in the global entry array chain.
256
257 **tail_entry_offset** allow immediate access to the last entry in the journal
258 file.
259
260 ## Extensibility
261
262 The format is supposed to be extensible in order to enable future additions of
263 features. Readers should simply skip objects of unknown types as they read
264 them. If a compatible feature extension is made a new bit is registered in the
265 header's **compatible_flags** field. If a feature extension is used that makes
266 the format incompatible a new bit is registered in the header's
267 **incompatible_flags** field. Readers should check these two bit fields, if
268 they find a flag they don't understand in compatible_flags they should continue
269 to read the file, but if they find one in **incompatible_flags** they should
270 fail, asking for an update of the software. Writers should refuse writing if
271 there's an unknown bit flag in either of these fields.
272
273 The file header may be extended as new features are added. The size of the file
274 header is stored in the header. All header fields up to **n_data** are known to
275 unconditionally exist in all revisions of the file format, all fields starting
276 with **n_data** needs to be explicitly checked for via a size check, since they
277 were additions after the initial release.
278
279 Currently only five extensions flagged in the flags fields are known:
280
281 ```c
282 enum {
283 HEADER_INCOMPATIBLE_COMPRESSED_XZ = 1 << 0,
284 HEADER_INCOMPATIBLE_COMPRESSED_LZ4 = 1 << 1,
285 HEADER_INCOMPATIBLE_KEYED_HASH = 1 << 2,
286 HEADER_INCOMPATIBLE_COMPRESSED_ZSTD = 1 << 3,
287 HEADER_INCOMPATIBLE_COMPACT = 1 << 4,
288 };
289
290 enum {
291 HEADER_COMPATIBLE_SEALED = 1 << 0,
292 HEADER_COMPATIBLE_TAIL_ENTRY_BOOT_ID = 1 << 1,
293 };
294 ```
295
296 HEADER_INCOMPATIBLE_COMPRESSED_XZ indicates that the file includes DATA objects
297 that are compressed using XZ. Similarly, HEADER_INCOMPATIBLE_COMPRESSED_LZ4
298 indicates that the file includes DATA objects that are compressed with the LZ4
299 algorithm. And HEADER_INCOMPATIBLE_COMPRESSED_ZSTD indicates that there are
300 objects compressed with ZSTD.
301
302 HEADER_INCOMPATIBLE_KEYED_HASH indicates that instead of the unkeyed Jenkins
303 hash function the keyed siphash24 hash function is used for the two hash
304 tables, see below.
305
306 HEADER_INCOMPATIBLE_COMPACT indicates that the journal file uses the new binary
307 format that uses less space on disk compared to the original format.
308
309 HEADER_COMPATIBLE_SEALED indicates that the file includes TAG objects required
310 for Forward Secure Sealing.
311
312 HEADER_COMPATIBLE_TAIL_ENTRY_BOOT_ID indicates whether the
313 **tail_entry_boot_id** field is strictly updated on initial creation of the
314 file and whenever an entry is updated (in which case the flag is set), or also
315 when the file is archived (in which case it is unset). New files should always
316 set this flag (and thus not update the **tail_entry_boot_id** except when
317 creating the file and when appending an entry to it.
318
319 ## Dirty Detection
320
321 ```c
322 enum {
323 STATE_OFFLINE = 0,
324 STATE_ONLINE = 1,
325 STATE_ARCHIVED = 2,
326 _STATE_MAX
327 };
328 ```
329
330 If a file is opened for writing the **state** field should be set to
331 STATE_ONLINE. If a file is closed after writing the **state** field should be
332 set to STATE_OFFLINE. After a file has been rotated it should be set to
333 STATE_ARCHIVED. If a writer is asked to write to a file that is not in
334 STATE_OFFLINE it should immediately rotate the file and start a new one,
335 without changing the file.
336
337 After and before the state field is changed, `fdatasync()` should be executed on
338 the file to ensure the dirty state hits disk.
339
340
341 ## Sequence Numbers
342
343 All entries carry sequence numbers that are monotonically counted up for each
344 entry (starting at 1) and are unique among all files which carry the same
345 **seqnum_id** field. This field is randomly generated when the journal daemon
346 creates its first file. All files generated by the same journal daemon instance
347 should hence carry the same seqnum_id. This should guarantee a monotonic stream
348 of sequential numbers for easy interleaving even if entries are distributed
349 among several files, such as the system journal and many per-user journals.
350
351
352 ## Concurrency
353
354 The file format is designed to be usable in a simultaneous
355 single-writer/multiple-reader scenario. The synchronization model is very weak
356 in order to facilitate storage on the most basic of file systems (well, the
357 most basic ones that provide us with `mmap()` that is), and allow good
358 performance. No file locking is used. The only time where disk synchronization
359 via `fdatasync()` should be enforced is after and before changing the **state**
360 field in the file header (see below). It is recommended to execute a memory
361 barrier after appending and initializing new objects at the end of the file,
362 and before linking them up in the earlier objects.
363
364 This weak synchronization model means that it is crucial that readers verify
365 the structural integrity of the file as they read it and handle invalid
366 structure gracefully. (Checking what you read is a pretty good idea out of
367 security considerations anyway.) This specifically includes checking offset
368 values, and that they point to valid objects, with valid sizes and of the type
369 and hash value expected. All code must be written with the fact in mind that a
370 file with inconsistent structure might just be inconsistent temporarily, and
371 might become consistent later on. Payload OTOH requires less scrutiny, as it
372 should only be linked up (and hence visible to readers) after it was
373 successfully written to memory (though not necessarily to disk). On non-local
374 file systems it is a good idea to verify the payload hashes when reading, in
375 order to avoid annoyances with `mmap()` inconsistencies.
376
377 Clients intending to show a live view of the journal should use `inotify()` for
378 this to watch for files changes. Since file writes done via `mmap()` do not
379 result in `inotify()` writers shall truncate the file to its current size after
380 writing one or more entries, which results in inotify events being
381 generated. Note that this is not used as a transaction scheme (it doesn't
382 protect anything), but merely for triggering wakeups.
383
384 Note that inotify will not work on network file systems if reader and writer
385 reside on different hosts. Readers which detect they are run on journal files
386 on a non-local file system should hence not rely on inotify for live views but
387 fall back to simple time based polling of the files (maybe recheck every 2s).
388
389
390 ## Objects
391
392 All objects carry a common header:
393
394 ```c
395 enum {
396 OBJECT_COMPRESSED_XZ = 1 << 0,
397 OBJECT_COMPRESSED_LZ4 = 1 << 1,
398 OBJECT_COMPRESSED_ZSTD = 1 << 2,
399 };
400
401 _packed_ struct ObjectHeader {
402 uint8_t type;
403 uint8_t flags;
404 uint8_t reserved[6];
405 le64_t size;
406 uint8_t payload[];
407 };
408 ```
409
410 The **type** field is one of the object types listed above. The **flags** field
411 currently knows three flags: OBJECT_COMPRESSED_XZ, OBJECT_COMPRESSED_LZ4 and
412 OBJECT_COMPRESSED_ZSTD. It is only valid for DATA objects and indicates that
413 the data payload is compressed with XZ/LZ4/ZSTD. If one of the
414 OBJECT_COMPRESSED_* flags is set for an object then the matching
415 HEADER_INCOMPATIBLE_COMPRESSED_XZ/HEADER_INCOMPATIBLE_COMPRESSED_LZ4/HEADER_INCOMPATIBLE_COMPRESSED_ZSTD
416 flag must be set for the file as well. At most one of these three bits may be
417 set. The **size** field encodes the size of the object including all its
418 headers and payload.
419
420
421 ## Data Objects
422
423 ```c
424 _packed_ struct DataObject {
425 ObjectHeader object;
426 le64_t hash;
427 le64_t next_hash_offset;
428 le64_t next_field_offset;
429 le64_t entry_offset; /* the first array entry we store inline */
430 le64_t entry_array_offset;
431 le64_t n_entries;
432 union { \
433 struct { \
434 uint8_t payload[] ; \
435 } regular; \
436 struct { \
437 le32_t tail_entry_array_offset; \
438 le32_t tail_entry_array_n_entries; \
439 uint8_t payload[]; \
440 } compact; \
441 }; \
442 };
443 ```
444
445 Data objects carry actual field data in the **payload[]** array, including a
446 field name, a `=` and the field data. Example:
447 `_SYSTEMD_UNIT=foobar.service`. The **hash** field is a hash value of the
448 payload. If the `HEADER_INCOMPATIBLE_KEYED_HASH` flag is set in the file header
449 this is the siphash24 hash value of the payload, keyed by the file ID as stored
450 in the **file_id** field of the file header. If the flag is not set it is the
451 non-keyed Jenkins hash of the payload instead. The keyed hash is preferred as
452 it makes the format more robust against attackers that want to trigger hash
453 collisions in the hash table.
454
455 **next_hash_offset** is used to link up DATA objects in the DATA_HASH_TABLE if
456 a hash collision happens (in a singly linked list, with an offset of 0
457 indicating the end). **next_field_offset** is used to link up data objects with
458 the same field name from the FIELD object of the field used.
459
460 **entry_offset** is an offset to the first ENTRY object referring to this DATA
461 object. **entry_array_offset** is an offset to an ENTRY_ARRAY object with
462 offsets to other entries referencing this DATA object. Storing the offset to
463 the first ENTRY object in-line is an optimization given that many DATA objects
464 will be referenced from a single entry only (for example, `MESSAGE=` frequently
465 includes a practically unique string). **n_entries** is a counter of the total
466 number of ENTRY objects that reference this object, i.e. the sum of all
467 ENTRY_ARRAYS chained up from this object, plus 1.
468
469 The **payload[]** field contains the field name and date unencoded, unless
470 OBJECT_COMPRESSED_XZ/OBJECT_COMPRESSED_LZ4/OBJECT_COMPRESSED_ZSTD is set in the
471 `ObjectHeader`, in which case the payload is compressed with the indicated
472 compression algorithm.
473
474 If the `HEADER_INCOMPATIBLE_COMPACT` flag is set, Two extra fields are stored to
475 allow immediate access to the tail entry array in the DATA object's entry array
476 chain.
477
478 ## Field Objects
479
480 ```c
481 _packed_ struct FieldObject {
482 ObjectHeader object;
483 le64_t hash;
484 le64_t next_hash_offset;
485 le64_t head_data_offset;
486 uint8_t payload[];
487 };
488 ```
489
490 Field objects are used to enumerate all possible values a certain field name
491 can take in the entire journal file.
492
493 The **payload[]** array contains the actual field name, without '=' or any
494 field value. Example: `_SYSTEMD_UNIT`. The **hash** field is a hash value of
495 the payload. As for the DATA objects, this too is either the `.file_id` keyed
496 siphash24 hash of the payload, or the non-keyed Jenkins hash.
497
498 **next_hash_offset** is used to link up FIELD objects in the FIELD_HASH_TABLE
499 if a hash collision happens (in singly linked list, offset 0 indicating the
500 end). **head_data_offset** points to the first DATA object that shares this
501 field name. It is the head of a singly linked list using DATA's
502 **next_field_offset** offset.
503
504
505 ## Entry Objects
506
507 ```
508 _packed_ struct EntryObject {
509 ObjectHeader object;
510 le64_t seqnum;
511 le64_t realtime;
512 le64_t monotonic;
513 sd_id128_t boot_id;
514 le64_t xor_hash;
515 union { \
516 struct { \
517 le64_t object_offset; \
518 le64_t hash; \
519 } regular[]; \
520 struct { \
521 le32_t object_offset; \
522 } compact[]; \
523 } items; \
524 };
525 ```
526
527 An ENTRY object binds several DATA objects together into one log entry, and
528 includes other metadata such as various timestamps.
529
530 The **seqnum** field contains the sequence number of the entry, **realtime**
531 the realtime timestamp, and **monotonic** the monotonic timestamp for the boot
532 identified by **boot_id**.
533
534 The **xor_hash** field contains a binary XOR of the hashes of the payload of
535 all DATA objects referenced by this ENTRY. This value is usable to check the
536 contents of the entry, being independent of the order of the DATA objects in
537 the array. Note that even for files that have the
538 `HEADER_INCOMPATIBLE_KEYED_HASH` flag set (and thus siphash24 the otherwise
539 used hash function) the hash function used for this field, as singular
540 exception, is the Jenkins lookup3 hash function. The XOR hash value is used to
541 quickly compare the contents of two entries, and to define a well-defined order
542 between two entries that otherwise have the same sequence numbers and
543 timestamps.
544
545 The **items[]** array contains references to all DATA objects of this entry,
546 plus their respective hashes (which are calculated the same way as in the DATA
547 objects, i.e. keyed by the file ID).
548
549 If the `HEADER_INCOMPATIBLE_COMPACT` flag is set, DATA object offsets are stored
550 as 32-bit integers instead of 64-bit and the unused hash field per data object is
551 not stored anymore.
552
553 In the file ENTRY objects are written ordered monotonically by sequence
554 number. For continuous parts of the file written during the same boot
555 (i.e. with the same boot_id) the monotonic timestamp is monotonic too. Modulo
556 wallclock time jumps (due to incorrect clocks being corrected) the realtime
557 timestamps are monotonic too.
558
559
560 ## Hash Table Objects
561
562 ```c
563 _packed_ struct HashItem {
564 le64_t head_hash_offset;
565 le64_t tail_hash_offset;
566 };
567
568 _packed_ struct HashTableObject {
569 ObjectHeader object;
570 HashItem items[];
571 };
572 ```
573
574 The structure of both DATA_HASH_TABLE and FIELD_HASH_TABLE objects are
575 identical. They implement a simple hash table, with each cell containing
576 offsets to the head and tail of the singly linked list of the DATA and FIELD
577 objects, respectively. DATA's and FIELD's next_hash_offset field are used to
578 chain up the objects. Empty cells have both offsets set to 0.
579
580 Each file contains exactly one DATA_HASH_TABLE and one FIELD_HASH_TABLE
581 objects. Their payload is directly referred to by the file header in the
582 **data_hash_table_offset**, **data_hash_table_size**,
583 **field_hash_table_offset**, **field_hash_table_size** fields. These offsets do
584 _not_ point to the object headers but directly to the payloads. When a new
585 journal file is created the two hash table objects need to be created right
586 away as first two objects in the stream.
587
588 If the hash table fill level is increasing over a certain fill level (Learning
589 from Java's Hashtable for example: > 75%), the writer should rotate the file
590 and create a new one.
591
592 The DATA_HASH_TABLE should be sized taking into account to the maximum size the
593 file is expected to grow, as configured by the administrator or disk space
594 considerations. The FIELD_HASH_TABLE should be sized to a fixed size; the
595 number of fields should be pretty static as it depends only on developers'
596 creativity rather than runtime parameters.
597
598
599 ## Entry Array Objects
600
601
602 ```c
603 _packed_ struct EntryArrayObject {
604 ObjectHeader object;
605 le64_t next_entry_array_offset;
606 union {
607 le64_t regular[];
608 le32_t compact[];
609 } items;
610 };
611 ```
612
613 Entry Arrays are used to store a sorted array of offsets to entries. Entry
614 arrays are strictly sorted by offsets on disk, and hence by their timestamps
615 and sequence numbers (with some restrictions, see above).
616
617 If the `HEADER_INCOMPATIBLE_COMPACT` flag is set, offsets are stored as 32-bit
618 integers instead of 64-bit.
619
620 Entry Arrays are chained up. If one entry array is full another one is
621 allocated and the **next_entry_array_offset** field of the old one pointed to
622 it. An Entry Array with **next_entry_array_offset** set to 0 is the last in the
623 list. To optimize allocation and seeking, as entry arrays are appended to a
624 chain of entry arrays they should increase in size (double).
625
626 Due to being monotonically ordered entry arrays may be searched with a binary
627 search (bisection).
628
629 One chain of entry arrays links up all entries written to the journal. The
630 first entry array is referenced in the **entry_array_offset** field of the
631 header.
632
633 Each DATA object also references an entry array chain listing all entries
634 referencing a specific DATA object. Since many DATA objects are only referenced
635 by a single ENTRY the first offset of the list is stored inside the DATA object
636 itself, an ENTRY_ARRAY object is only needed if it is referenced by more than
637 one ENTRY.
638
639
640 ## Tag Object
641
642 ```c
643 #define TAG_LENGTH (256/8)
644
645 _packed_ struct TagObject {
646 ObjectHeader object;
647 le64_t seqnum;
648 le64_t epoch;
649 uint8_t tag[TAG_LENGTH]; /* SHA-256 HMAC */
650 };
651 ```
652
653 Tag objects are used to seal off the journal for alteration. In regular
654 intervals a tag object is appended to the file. The tag object consists of a
655 SHA-256 HMAC tag that is calculated from the objects stored in the file since
656 the last tag was written, or from the beginning if no tag was written yet. The
657 key for the HMAC is calculated via the externally maintained FSPRG logic for
658 the epoch that is written into **epoch**. The sequence number **seqnum** is
659 increased with each tag. When calculating the HMAC of objects header fields
660 that are volatile are excluded (skipped). More specifically all fields that
661 might validly be altered to maintain a consistent file structure (such as
662 offsets to objects added later for the purpose of linked lists and suchlike)
663 after an object has been written are not protected by the tag. This means a
664 verifier has to independently check these fields for consistency of
665 structure. For the fields excluded from the HMAC please consult the source code
666 directly. A verifier should read the file from the beginning to the end, always
667 calculating the HMAC for the objects it reads. Each time a tag object is
668 encountered the HMAC should be verified and restarted. The tag object sequence
669 numbers need to increase strictly monotonically. Tag objects themselves are
670 partially protected by the HMAC (i.e. seqnum and epoch is included, the tag
671 itself not).
672
673
674 ## Algorithms
675
676 ### Reading
677
678 Given an offset to an entry all data fields are easily found by following the
679 offsets in the data item array of the entry.
680
681 Listing entries without filter is done by traversing the list of entry arrays
682 starting with the headers' **entry_array_offset** field.
683
684 Seeking to an entry by timestamp or sequence number (without any matches) is
685 done via binary search in the entry arrays starting with the header's
686 **entry_array_offset** field. Since these arrays double in size as more are
687 added the time cost of seeking is O(log(n)*log(n)) if n is the number of
688 entries in the file.
689
690 When seeking or listing with one field match applied the DATA object of the
691 match is first identified, and then its data entry array chain traversed. The
692 time cost is the same as for seeks/listings with no match.
693
694 If multiple matches are applied, multiple chains of entry arrays should be
695 traversed in parallel. Since they all are strictly monotonically ordered by
696 offset of the entries, advancing in one can be directly applied to the others,
697 until an entry matching all matches is found. In the worst case seeking like
698 this is O(n) where n is the number of matching entries of the "loosest" match,
699 but in the common case should be much more efficient at least for the
700 well-known fields, where the set of possible field values tend to be closely
701 related. Checking whether an entry matches a number of matches is efficient
702 since the item array of the entry contains hashes of all data fields
703 referenced, and the number of data fields of an entry is generally small (<
704 30).
705
706 When interleaving multiple journal files seeking tends to be a frequently used
707 operation, but in this case can be effectively suppressed by caching results
708 from previous entries.
709
710 When listing all possible values a certain field can take it is sufficient to
711 look up the FIELD object and follow the chain of links to all DATA it includes.
712
713 ### Writing
714
715 When an entry is appended to the journal, for each of its data fields the data
716 hash table should be checked. If the data field does not yet exist in the file,
717 it should be appended and added to the data hash table. When a data field's data
718 object is added, the field hash table should be checked for the field name of
719 the data field, and a field object be added if necessary. After all data fields
720 (and recursively all field names) of the new entry are appended and linked up
721 in the hashtables, the entry object should be appended and linked up too.
722
723 At regular intervals a tag object should be written if sealing is enabled (see
724 above). Before the file is closed a tag should be written too, to seal it off.
725
726 Before writing an object, time and disk space limits should be checked and
727 rotation triggered if necessary.
728
729
730 ## Optimizing Disk IO
731
732 _A few general ideas to keep in mind:_
733
734 The hash tables for looking up fields and data should be quickly in the memory
735 cache and not hurt performance. All entries and entry arrays are ordered
736 strictly by time on disk, and hence should expose an OK access pattern on
737 rotating media, when read sequentially (which should be the most common case,
738 given the nature of log data).
739
740 The disk access patterns of the binary search for entries needed for seeking
741 are problematic on rotating disks. This should not be a major issue though,
742 since seeking should not be a frequent operation.
743
744 When reading, collecting data fields for presenting entries to the user is
745 problematic on rotating disks. In order to optimize these patterns the item
746 array of entry objects should be sorted by disk offset before
747 writing. Effectively, frequently used data objects should be in the memory
748 cache quickly. Non-frequently used data objects are likely to be located
749 between the previous and current entry when reading and hence should expose an
750 OK access pattern. Problematic are data objects that are neither frequently nor
751 infrequently referenced, which will cost seek time.
752
753 And that's all there is to it.
754
755 Thanks for your interest!