docs/JOURNAL_FILE_FORMAT.md

   1 ---
   2 title: Journal File Format
   3 category: Interfaces
   4 layout: default
   5 SPDX-License-Identifier: LGPL-2.1-or-later
   6 ---
   7
   8 # Journal File Format
   9
  10 _Note that this document describes the binary on-disk format of journals only.
  11 For interfacing with web technologies there's the [Journal JSON Format](JOURNAL_EXPORT_FORMATS#journal-json-format).
  12 For transfer of journal data across the network there's the
  13 [Journal Export Format](JOURNAL_EXPORT_FORMATS#journal-export-format)._
  14
  15 The systemd journal stores log data in a binary format with several features:
  16
  17 * Fully indexed by all fields
  18 * Can store binary data, up to 2^64-1 in size
  19 * Seekable
  20 * Primarily append-based, hence robust to corruption
  21 * Support for in-line compression
  22 * Support for in-line Forward Secure Sealing
  23
  24 This document explains the basic structure of the file format on disk.
  25 We are making this available primarily to allow review and provide documentation.
  26 Note that the actual implementation in the
  27 [systemd codebase](https://github.com/systemd/systemd/blob/main/src/libsystemd/sd-journal/)
  28 is the only ultimately authoritative description of the format,
  29 so if this document and the code disagree, the code is right.
  30 That said we'll of course try hard to keep this document up-to-date and accurate.
  31
  32 Instead of implementing your own reader or writer for journal files we ask you to use the
  33 [Journal's native CAPI](https://www.freedesktop.org/software/systemd/man/sd-journal.html)
  34 to access these files.
  35 It provides you with full access to the files, and will not withhold any data.
  36 If you find a limitation, please ping us and we might add some additional interfaces for you.
  37
  38 If you need access to the raw journal data in serialized stream form without C API our recommendation is to make use of the
  39 [Journal Export Format](JOURNAL_EXPORT_FORMATS#journal-export-format),
  40 which you can get via `journalctl -o export` or via `systemd-journal-gatewayd`.
  41 The export format is much simpler to parse, but complete and accurate.
  42 Due to its stream-based nature it is not indexed.
  43
  44 _Or, to put this in other words: this low-level document is probably not what you want to use as base of your project.
  45 You want our [C API](https://www.freedesktop.org/software/systemd/man/sd-journal.html) instead!
  46 And if you really don't want the C API, then you want the
  47 [Journal Export Format or Journal JSON Format](/JOURNAL_EXPORT_FORMATS) instead!
  48 This document is primarily for your entertainment and education.
  49 Thank you!_
  50
  51 This document assumes you have a basic understanding of the journal concepts, the properties of a journal entry and so on.
  52 If not, please go and read up, then come back!
  53 This is a good opportunity to read about the
  54 [basic properties of journal entries](https://www.freedesktop.org/software/systemd/man/systemd.journal-fields.html),
  55 in particular realize that they may include binary non-text data (though usually don't),
  56 and the same field might have multiple values assigned within the same entry.
  57
  58 This document describes the current format of systemd 246.
  59 The documented format is compatible with the format used in the first versions of the journal,
  60 but received various compatible and incompatible additions since.
  61
  62 If you are wondering why the journal file format has been created in the first place instead of adopting an existing database implementation,
  63 please have a look [at this thread](https://lists.freedesktop.org/archives/systemd-devel/2012-October/007054.html).
  64
  65
  66 ## Basics
  67
  68 * All offsets, sizes, time values, hashes (and most other numeric values) are 32-bit/64-bit unsigned integers in LE format.
  69 * Offsets are always relative to the beginning of the file.
  70 * The 64-bit hash function siphash24 is used for newer journal files.
  71   For older files [Jenkins lookup3](https://en.wikipedia.org/wiki/Jenkins_hash_function) is used,
  72   more specifically `jenkins_hashlittle2()` with the first 32-bit integer it returns as higher 32-bit part of the 64-bit value,
  73   and the second one uses as lower 32-bit part.
  74 * All structures are aligned to 64-bit boundaries and padded to multiples of 64-bit
  75 * The format is designed to be read and written via memory mapping using multiple mapped windows.
  76 * All time values are stored in usec since the respective epoch.
  77 * Wall clock time values are relative to the Unix time epoch, i.e. January 1st, 1970. (`CLOCK_REALTIME`)
  78 * Monotonic time values are always stored jointly with the kernel boot ID value (i.e. `/proc/sys/kernel/random/boot_id`) they belong to.
  79   They tend to be relative to the start of the boot, but aren't for containers. (`CLOCK_MONOTONIC`)
  80 * Randomized, unique 128-bit IDs are used in various locations. These are generally UUID v4 compatible, but this is not a requirement.
  81
  82 ## General Rules
  83
  84 If any kind of corruption is noticed by a writer it should immediately rotate
  85 the file and start a new one. No further writes should be attempted to the
  86 original file, but it should be left around so that as little data as possible
  87 is lost.
  88
  89 If any kind of corruption is noticed by a reader it should try hard to handle
  90 this gracefully, such as skipping over the corrupted data, but allowing access
  91 to as much data around it as possible.
  92
  93 A reader should verify all offsets and other data as it reads it. This includes
  94 checking for alignment and range of offsets in the file, especially before
  95 trying to read it via a memory map.
  96
  97 A reader must interleave rotated and corrupted files as good as possible and
  98 present them as single stream to the user.
  99
 100 All fields marked as "reserved" must be initialized with 0 when writing and be
 101 ignored on reading. They are currently not used but might be used later on.
 102
 103
 104 ## Structure
 105
 106 The file format's data structures are declared in
 107 [journal-def.h](https://github.com/systemd/systemd/blob/main/src/libsystemd/sd-journal/journal-def.h).
 108
 109 The file format begins with a header structure. After the header structure
 110 object structures follow. Objects are appended to the end as time
 111 progresses. Most data stored in these objects is not altered anymore after
 112 having been written once, with the exception of records necessary for
 113 indexing. When new data is appended to a file the writer first writes all new
 114 objects to the end of the file, and then links them up at front after that's
 115 done. Currently, seven different object types are known:
 116
 117 ```c
 118 enum {
 119         OBJECT_UNUSED,
 120         OBJECT_DATA,
 121         OBJECT_FIELD,
 122         OBJECT_ENTRY,
 123         OBJECT_DATA_HASH_TABLE,
 124         OBJECT_FIELD_HASH_TABLE,
 125         OBJECT_ENTRY_ARRAY,
 126         OBJECT_TAG,
 127         _OBJECT_TYPE_MAX
 128 };
 129 ```
 130
 131 * A **DATA** object, which encapsulates the contents of one field of an entry, i.e. a string such as `_SYSTEMD_UNIT=avahi-daemon.service`, or `MESSAGE=Foobar made a booboo.` but possibly including large or binary data, and always prefixed by the field name and "=".
 132 * A **FIELD** object, which encapsulates a field name, i.e. a string such as `_SYSTEMD_UNIT` or `MESSAGE`, without any `=` or even value.
 133 * An **ENTRY** object, which binds several **DATA** objects together into a log entry.
 134 * A **DATA_HASH_TABLE** object, which encapsulates a hash table for finding existing **DATA** objects.
 135 * A **FIELD_HASH_TABLE** object, which encapsulates a hash table for finding existing **FIELD** objects.
 136 * An **ENTRY_ARRAY** object, which encapsulates a sorted array of offsets to entries, used for seeking by binary search.
 137 * A **TAG** object, consisting of an FSS sealing tag for all data from the beginning of the file or the last tag written (whichever is later).
 138
 139 ## Header
 140
 141 The Header struct defines, well, you guessed it, the file header:
 142
 143 ```c
 144 _packed_ struct Header {
 145         uint8_t signature[8]; /* "LPKSHHRH" */
 146         le32_t compatible_flags;
 147         le32_t incompatible_flags;
 148         uint8_t state;
 149         uint8_t reserved[7];
 150         sd_id128_t file_id;
 151         sd_id128_t machine_id;
 152         sd_id128_t tail_entry_boot_id;
 153         sd_id128_t seqnum_id;
 154         le64_t header_size;
 155         le64_t arena_size;
 156         le64_t data_hash_table_offset;
 157         le64_t data_hash_table_size;
 158         le64_t field_hash_table_offset;
 159         le64_t field_hash_table_size;
 160         le64_t tail_object_offset;
 161         le64_t n_objects;
 162         le64_t n_entries;
 163         le64_t tail_entry_seqnum;
 164         le64_t head_entry_seqnum;
 165         le64_t entry_array_offset;
 166         le64_t head_entry_realtime;
 167         le64_t tail_entry_realtime;
 168         le64_t tail_entry_monotonic;
 169         /* Added in 187 */
 170         le64_t n_data;
 171         le64_t n_fields;
 172         /* Added in 189 */
 173         le64_t n_tags;
 174         le64_t n_entry_arrays;
 175         /* Added in 246 */
 176         le64_t data_hash_chain_depth;
 177         le64_t field_hash_chain_depth;
 178         /* Added in 252 */
 179         le32_t tail_entry_array_offset;
 180         le32_t tail_entry_array_n_entries;
 181         /* Added in 254 */
 182         le64_t tail_entry_offset;
 183 };
 184 ```
 185
 186 The first 8 bytes of Journal files must contain the ASCII characters `LPKSHHRH`.
 187
 188 If a writer finds that the **machine_id** of a file to write to does not match
 189 the machine it is running on it should immediately rotate the file and start a
 190 new one.
 191
 192 When journal file is first created the **file_id** is randomly and uniquely
 193 initialized.
 194
 195 When a writer creates a file it shall initialize the **tail_entry_boot_id** to
 196 the current boot ID of the system. When appending an entry it shall update the
 197 field to the boot ID of that entry, so that it is guaranteed that the
 198 **tail_entry_monotonic** field refers to a timestamp of the monotonic clock
 199 associated with the boot with the ID indicated by the **tail_entry_boot_id**
 200 field. (Compatibility note: in older versions of the journal, the field was
 201 also supposed to be updated whenever the file was opened for any form of
 202 writing, including when opened to mark it as archived. This behaviour has been
 203 deemed problematic since without an associated boot ID the
 204 **tail_entry_monotonic** field is useless. To indicate whether the boot ID is
 205 updated only on append the JOURNAL_COMPATIBLE_TAIL_ENTRY_BOOT_ID is set. If it
 206 is not set, the **tail_entry_monotonic** field is not usable).
 207
 208 The currently used part of the file is the **header_size** plus the
 209 **arena_size** field of the header. If a writer needs to write to a file where
 210 the actual file size on disk is smaller than the reported value it shall
 211 immediately rotate the file and start a new one. If a writer is asked to write
 212 to a file with a header that is shorter than its own definition of the struct
 213 Header, it shall immediately rotate the file and start a new one.
 214
 215 The **n_objects** field contains a counter for objects currently available in
 216 this file. As objects are appended to the end of the file this counter is
 217 increased.
 218
 219 The first object in the file starts immediately after the header. The last
 220 object in the file is at the offset **tail_object_offset**, which may be 0 if
 221 no object is in the file yet.
 222
 223 The **n_entries**, **n_data**, **n_fields**, **n_tags**, **n_entry_arrays** are
 224 counters of the objects of the specific types.
 225
 226 **tail_entry_seqnum** and **head_entry_seqnum** contain the sequential number
 227 (see below) of the last or first entry in the file, respectively, or 0 if no
 228 entry has been written yet.
 229
 230 **tail_entry_realtime** and **head_entry_realtime** contain the wallclock
 231 timestamp of the last or first entry in the file, respectively, or 0 if no
 232 entry has been written yet.
 233
 234 **tail_entry_monotonic** is the monotonic timestamp of the last entry in the
 235 file, referring to monotonic time of the boot identified by
 236 **tail_entry_boot_id**, but only if the
 237 JOURNAL_COMPATIBLE_TAIL_ENTRY_BOOT_ID feature flag is set, see above. If it
 238 is not set, this field might refer to a different boot then the one in the
 239 **tail_entry_boot_id** field, for example when the file was ultimately
 240 archived.
 241
 242 **data_hash_chain_depth** is a counter of the deepest chain in the data hash
 243 table, minus one. This is updated whenever a chain is found that is longer than
 244 the previous deepest chain found. Note that the counter is updated during hash
 245 table lookups, as the chains are traversed. This counter is used to determine
 246 when it is a good time to rotate the journal file, because hash collisions
 247 became too frequent.
 248
 249 Similar, **field_hash_chain_depth** is a counter of the deepest chain in the
 250 field hash table, minus one.
 251
 252 **tail_entry_array_offset** and **tail_entry_array_n_entries** allow immediate
 253 access to the last entry array in the global entry array chain.
 254
 255 **tail_entry_offset** allow immediate access to the last entry in the journal
 256 file.
 257
 258 ## Extensibility
 259
 260 The format is supposed to be extensible in order to enable future additions of
 261 features. Readers should simply skip objects of unknown types as they read
 262 them. If a compatible feature extension is made a new bit is registered in the
 263 header's **compatible_flags** field. If a feature extension is used that makes
 264 the format incompatible a new bit is registered in the header's
 265 **incompatible_flags** field. Readers should check these two bit fields, if
 266 they find a flag they don't understand in compatible_flags they should continue
 267 to read the file, but if they find one in **incompatible_flags** they should
 268 fail, asking for an update of the software. Writers should refuse writing if
 269 there's an unknown bit flag in either of these fields.
 270
 271 The file header may be extended as new features are added. The size of the file
 272 header is stored in the header. All header fields up to **n_data** are known to
 273 unconditionally exist in all revisions of the file format, all fields starting
 274 with **n_data** needs to be explicitly checked for via a size check, since they
 275 were additions after the initial release.
 276
 277 Currently only five extensions flagged in the flags fields are known:
 278
 279 ```c
 280 enum {
 281         HEADER_INCOMPATIBLE_COMPRESSED_XZ   = 1 << 0,
 282         HEADER_INCOMPATIBLE_COMPRESSED_LZ4  = 1 << 1,
 283         HEADER_INCOMPATIBLE_KEYED_HASH      = 1 << 2,
 284         HEADER_INCOMPATIBLE_COMPRESSED_ZSTD = 1 << 3,
 285         HEADER_INCOMPATIBLE_COMPACT         = 1 << 4,
 286 };
 287
 288 enum {
 289         HEADER_COMPATIBLE_SEALED             = 1 << 0,
 290         HEADER_COMPATIBLE_TAIL_ENTRY_BOOT_ID = 1 << 1,
 291 };
 292 ```
 293
 294 HEADER_INCOMPATIBLE_COMPRESSED_XZ indicates that the file includes DATA objects
 295 that are compressed using XZ. Similarly, HEADER_INCOMPATIBLE_COMPRESSED_LZ4
 296 indicates that the file includes DATA objects that are compressed with the LZ4
 297 algorithm. And HEADER_INCOMPATIBLE_COMPRESSED_ZSTD indicates that there are
 298 objects compressed with ZSTD.
 299
 300 HEADER_INCOMPATIBLE_KEYED_HASH indicates that instead of the unkeyed Jenkins
 301 hash function the keyed siphash24 hash function is used for the two hash
 302 tables, see below.
 303
 304 HEADER_INCOMPATIBLE_COMPACT indicates that the journal file uses the new binary
 305 format that uses less space on disk compared to the original format.
 306
 307 HEADER_COMPATIBLE_SEALED indicates that the file includes TAG objects required
 308 for Forward Secure Sealing.
 309
 310 HEADER_COMPATIBLE_TAIL_ENTRY_BOOT_ID indicates whether the
 311 **tail_entry_boot_id** field is strictly updated on initial creation of the
 312 file and whenever an entry is updated (in which case the flag is set), or also
 313 when the file is archived (in which case it is unset). New files should always
 314 set this flag (and thus not update the **tail_entry_boot_id** except when
 315 creating the file and when appending an entry to it.
 316
 317 ## Dirty Detection
 318
 319 ```c
 320 enum {
 321         STATE_OFFLINE = 0,
 322         STATE_ONLINE = 1,
 323         STATE_ARCHIVED = 2,
 324         _STATE_MAX
 325 };
 326 ```
 327
 328 If a file is opened for writing the **state** field should be set to
 329 STATE_ONLINE. If a file is closed after writing the **state** field should be
 330 set to STATE_OFFLINE. After a file has been rotated it should be set to
 331 STATE_ARCHIVED. If a writer is asked to write to a file that is not in
 332 STATE_OFFLINE it should immediately rotate the file and start a new one,
 333 without changing the file.
 334
 335 After and before the state field is changed, `fdatasync()` should be executed on
 336 the file to ensure the dirty state hits disk.
 337
 338
 339 ## Sequence Numbers
 340
 341 All entries carry sequence numbers that are monotonically counted up for each
 342 entry (starting at 1) and are unique among all files which carry the same
 343 **seqnum_id** field. This field is randomly generated when the journal daemon
 344 creates its first file. All files generated by the same journal daemon instance
 345 should hence carry the same seqnum_id. This should guarantee a monotonic stream
 346 of sequential numbers for easy interleaving even if entries are distributed
 347 among several files, such as the system journal and many per-user journals.
 348
 349
 350 ## Concurrency
 351
 352 The file format is designed to be usable in a simultaneous
 353 single-writer/multiple-reader scenario. The synchronization model is very weak
 354 in order to facilitate storage on the most basic of file systems (well, the
 355 most basic ones that provide us with `mmap()` that is), and allow good
 356 performance. No file locking is used. The only time where disk synchronization
 357 via `fdatasync()` should be enforced is after and before changing the **state**
 358 field in the file header (see below). It is recommended to execute a memory
 359 barrier after appending and initializing new objects at the end of the file,
 360 and before linking them up in the earlier objects.
 361
 362 This weak synchronization model means that it is crucial that readers verify
 363 the structural integrity of the file as they read it and handle invalid
 364 structure gracefully. (Checking what you read is a pretty good idea out of
 365 security considerations anyway.) This specifically includes checking offset
 366 values, and that they point to valid objects, with valid sizes and of the type
 367 and hash value expected. All code must be written with the fact in mind that a
 368 file with inconsistent structure might just be inconsistent temporarily, and
 369 might become consistent later on. Payload OTOH requires less scrutiny, as it
 370 should only be linked up (and hence visible to readers) after it was
 371 successfully written to memory (though not necessarily to disk). On non-local
 372 file systems it is a good idea to verify the payload hashes when reading, in
 373 order to avoid annoyances with `mmap()` inconsistencies.
 374
 375 Clients intending to show a live view of the journal should use `inotify()` for
 376 this to watch for files changes. Since file writes done via `mmap()` do not
 377 result in `inotify()` writers shall truncate the file to its current size after
 378 writing one or more entries, which results in inotify events being
 379 generated. Note that this is not used as a transaction scheme (it doesn't
 380 protect anything), but merely for triggering wakeups.
 381
 382 Note that inotify will not work on network file systems if reader and writer
 383 reside on different hosts. Readers which detect they are run on journal files
 384 on a non-local file system should hence not rely on inotify for live views but
 385 fall back to simple time based polling of the files (maybe recheck every 2s).
 386
 387
 388 ## Objects
 389
 390 All objects carry a common header:
 391
 392 ```c
 393 enum {
 394         OBJECT_COMPRESSED_XZ   = 1 << 0,
 395         OBJECT_COMPRESSED_LZ4  = 1 << 1,
 396         OBJECT_COMPRESSED_ZSTD = 1 << 2,
 397 };
 398
 399 _packed_ struct ObjectHeader {
 400         uint8_t type;
 401         uint8_t flags;
 402         uint8_t reserved[6];
 403         le64_t size;
 404         uint8_t payload[];
 405 };
 406 ```
 407
 408 The **type** field is one of the object types listed above. The **flags** field
 409 currently knows three flags: OBJECT_COMPRESSED_XZ, OBJECT_COMPRESSED_LZ4 and
 410 OBJECT_COMPRESSED_ZSTD. It is only valid for DATA objects and indicates that
 411 the data payload is compressed with XZ/LZ4/ZSTD. If one of the
 412 OBJECT_COMPRESSED_* flags is set for an object then the matching
 413 HEADER_INCOMPATIBLE_COMPRESSED_XZ/HEADER_INCOMPATIBLE_COMPRESSED_LZ4/HEADER_INCOMPATIBLE_COMPRESSED_ZSTD
 414 flag must be set for the file as well. At most one of these three bits may be
 415 set. The **size** field encodes the size of the object including all its
 416 headers and payload.
 417
 418
 419 ## Data Objects
 420
 421 ```c
 422 _packed_ struct DataObject {
 423         ObjectHeader object;
 424         le64_t hash;
 425         le64_t next_hash_offset;
 426         le64_t next_field_offset;
 427         le64_t entry_offset; /* the first array entry we store inline */
 428         le64_t entry_array_offset;
 429         le64_t n_entries;
 430         union {                                                         \
 431                 struct {                                                \
 432                         uint8_t payload[] ;                             \
 433                 } regular;                                              \
 434                 struct {                                                \
 435                         le32_t tail_entry_array_offset;                 \
 436                         le32_t tail_entry_array_n_entries;              \
 437                         uint8_t payload[];                              \
 438                 } compact;                                              \
 439         };                                                              \
 440 };
 441 ```
 442
 443 Data objects carry actual field data in the **payload[]** array, including a
 444 field name, a `=` and the field data. Example:
 445 `_SYSTEMD_UNIT=foobar.service`. The **hash** field is a hash value of the
 446 payload. If the `HEADER_INCOMPATIBLE_KEYED_HASH` flag is set in the file header
 447 this is the siphash24 hash value of the payload, keyed by the file ID as stored
 448 in the **file_id** field of the file header. If the flag is not set it is the
 449 non-keyed Jenkins hash of the payload instead. The keyed hash is preferred as
 450 it makes the format more robust against attackers that want to trigger hash
 451 collisions in the hash table.
 452
 453 **next_hash_offset** is used to link up DATA objects in the DATA_HASH_TABLE if
 454 a hash collision happens (in a singly linked list, with an offset of 0
 455 indicating the end). **next_field_offset** is used to link up data objects with
 456 the same field name from the FIELD object of the field used.
 457
 458 **entry_offset** is an offset to the first ENTRY object referring to this DATA
 459 object. **entry_array_offset** is an offset to an ENTRY_ARRAY object with
 460 offsets to other entries referencing this DATA object. Storing the offset to
 461 the first ENTRY object in-line is an optimization given that many DATA objects
 462 will be referenced from a single entry only (for example, `MESSAGE=` frequently
 463 includes a practically unique string). **n_entries** is a counter of the total
 464 number of ENTRY objects that reference this object, i.e. the sum of all
 465 ENTRY_ARRAYS chained up from this object, plus 1.
 466
 467 The **payload[]** field contains the field name and date unencoded, unless
 468 OBJECT_COMPRESSED_XZ/OBJECT_COMPRESSED_LZ4/OBJECT_COMPRESSED_ZSTD is set in the
 469 `ObjectHeader`, in which case the payload is compressed with the indicated
 470 compression algorithm.
 471
 472 If the `HEADER_INCOMPATIBLE_COMPACT` flag is set, Two extra fields are stored to
 473 allow immediate access to the tail entry array in the DATA object's entry array
 474 chain.
 475
 476 ## Field Objects
 477
 478 ```c
 479 _packed_ struct FieldObject {
 480         ObjectHeader object;
 481         le64_t hash;
 482         le64_t next_hash_offset;
 483         le64_t head_data_offset;
 484         uint8_t payload[];
 485 };
 486 ```
 487
 488 Field objects are used to enumerate all possible values a certain field name
 489 can take in the entire journal file.
 490
 491 The **payload[]** array contains the actual field name, without '=' or any
 492 field value. Example: `_SYSTEMD_UNIT`. The **hash** field is a hash value of
 493 the payload. As for the DATA objects, this too is either the `.file_id` keyed
 494 siphash24 hash of the payload, or the non-keyed Jenkins hash.
 495
 496 **next_hash_offset** is used to link up FIELD objects in the FIELD_HASH_TABLE
 497 if a hash collision happens (in singly linked list, offset 0 indicating the
 498 end). **head_data_offset** points to the first DATA object that shares this
 499 field name. It is the head of a singly linked list using DATA's
 500 **next_field_offset** offset.
 501
 502
 503 ## Entry Objects
 504
 505 ```
 506 _packed_ struct EntryObject {
 507         ObjectHeader object;
 508         le64_t seqnum;
 509         le64_t realtime;
 510         le64_t monotonic;
 511         sd_id128_t boot_id;
 512         le64_t xor_hash;
 513         union {                                 \
 514                 struct {                        \
 515                         le64_t object_offset;   \
 516                         le64_t hash;            \
 517                 } regular[];                    \
 518                 struct {                        \
 519                         le32_t object_offset;   \
 520                 } compact[];                    \
 521         } items;                                \
 522 };
 523 ```
 524
 525 An ENTRY object binds several DATA objects together into one log entry, and
 526 includes other metadata such as various timestamps.
 527
 528 The **seqnum** field contains the sequence number of the entry, **realtime**
 529 the realtime timestamp, and **monotonic** the monotonic timestamp for the boot
 530 identified by **boot_id**.
 531
 532 The **xor_hash** field contains a binary XOR of the hashes of the payload of
 533 all DATA objects referenced by this ENTRY. This value is usable to check the
 534 contents of the entry, being independent of the order of the DATA objects in
 535 the array. Note that even for files that have the
 536 `HEADER_INCOMPATIBLE_KEYED_HASH` flag set (and thus siphash24 the otherwise
 537 used hash function) the hash function used for this field, as singular
 538 exception, is the Jenkins lookup3 hash function. The XOR hash value is used to
 539 quickly compare the contents of two entries, and to define a well-defined order
 540 between two entries that otherwise have the same sequence numbers and
 541 timestamps.
 542
 543 The **items[]** array contains references to all DATA objects of this entry,
 544 plus their respective hashes (which are calculated the same way as in the DATA
 545 objects, i.e. keyed by the file ID).
 546
 547 If the `HEADER_INCOMPATIBLE_COMPACT` flag is set, DATA object offsets are stored
 548 as 32-bit integers instead of 64-bit and the unused hash field per data object is
 549 not stored anymore.
 550
 551 In the file ENTRY objects are written ordered monotonically by sequence
 552 number. For continuous parts of the file written during the same boot
 553 (i.e. with the same boot_id) the monotonic timestamp is monotonic too. Modulo
 554 wallclock time jumps (due to incorrect clocks being corrected) the realtime
 555 timestamps are monotonic too.
 556
 557
 558 ## Hash Table Objects
 559
 560 ```c
 561 _packed_ struct HashItem {
 562         le64_t head_hash_offset;
 563         le64_t tail_hash_offset;
 564 };
 565
 566 _packed_ struct HashTableObject {
 567         ObjectHeader object;
 568         HashItem items[];
 569 };
 570 ```
 571
 572 The structure of both DATA_HASH_TABLE and FIELD_HASH_TABLE objects are
 573 identical. They implement a simple hash table, with each cell containing
 574 offsets to the head and tail of the singly linked list of the DATA and FIELD
 575 objects, respectively. DATA's and FIELD's next_hash_offset field are used to
 576 chain up the objects. Empty cells have both offsets set to 0.
 577
 578 Each file contains exactly one DATA_HASH_TABLE and one FIELD_HASH_TABLE
 579 objects. Their payload is directly referred to by the file header in the
 580 **data_hash_table_offset**, **data_hash_table_size**,
 581 **field_hash_table_offset**, **field_hash_table_size** fields. These offsets do
 582 _not_ point to the object headers but directly to the payloads. When a new
 583 journal file is created the two hash table objects need to be created right
 584 away as first two objects in the stream.
 585
 586 If the hash table fill level is increasing over a certain fill level (Learning
 587 from Java's Hashtable for example: > 75%), the writer should rotate the file
 588 and create a new one.
 589
 590 The DATA_HASH_TABLE should be sized taking into account to the maximum size the
 591 file is expected to grow, as configured by the administrator or disk space
 592 considerations. The FIELD_HASH_TABLE should be sized to a fixed size; the
 593 number of fields should be pretty static as it depends only on developers'
 594 creativity rather than runtime parameters.
 595
 596
 597 ## Entry Array Objects
 598
 599
 600 ```c
 601 _packed_ struct EntryArrayObject {
 602         ObjectHeader object;
 603         le64_t next_entry_array_offset;
 604         union {
 605                 le64_t regular[];
 606                 le32_t compact[];
 607         } items;
 608 };
 609 ```
 610
 611 Entry Arrays are used to store a sorted array of offsets to entries. Entry
 612 arrays are strictly sorted by offsets on disk, and hence by their timestamps
 613 and sequence numbers (with some restrictions, see above).
 614
 615 If the `HEADER_INCOMPATIBLE_COMPACT` flag is set, offsets are stored as 32-bit
 616 integers instead of 64-bit.
 617
 618 Entry Arrays are chained up. If one entry array is full another one is
 619 allocated and the **next_entry_array_offset** field of the old one pointed to
 620 it. An Entry Array with **next_entry_array_offset** set to 0 is the last in the
 621 list. To optimize allocation and seeking, as entry arrays are appended to a
 622 chain of entry arrays they should increase in size (double).
 623
 624 Due to being monotonically ordered entry arrays may be searched with a binary
 625 search (bisection).
 626
 627 One chain of entry arrays links up all entries written to the journal. The
 628 first entry array is referenced in the **entry_array_offset** field of the
 629 header.
 630
 631 Each DATA object also references an entry array chain listing all entries
 632 referencing a specific DATA object. Since many DATA objects are only referenced
 633 by a single ENTRY the first offset of the list is stored inside the DATA object
 634 itself, an ENTRY_ARRAY object is only needed if it is referenced by more than
 635 one ENTRY.
 636
 637
 638 ## Tag Object
 639
 640 ```c
 641 #define TAG_LENGTH (256/8)
 642
 643 _packed_ struct TagObject {
 644         ObjectHeader object;
 645         le64_t seqnum;
 646         le64_t epoch;
 647         uint8_t tag[TAG_LENGTH]; /* SHA-256 HMAC */
 648 };
 649 ```
 650
 651 Tag objects are used to seal off the journal for alteration. In regular
 652 intervals a tag object is appended to the file. The tag object consists of a
 653 SHA-256 HMAC tag that is calculated from the objects stored in the file since
 654 the last tag was written, or from the beginning if no tag was written yet. The
 655 key for the HMAC is calculated via the externally maintained FSPRG logic for
 656 the epoch that is written into **epoch**. The sequence number **seqnum** is
 657 increased with each tag. When calculating the HMAC of objects header fields
 658 that are volatile are excluded (skipped). More specifically all fields that
 659 might validly be altered to maintain a consistent file structure (such as
 660 offsets to objects added later for the purpose of linked lists and suchlike)
 661 after an object has been written are not protected by the tag. This means a
 662 verifier has to independently check these fields for consistency of
 663 structure. For the fields excluded from the HMAC please consult the source code
 664 directly. A verifier should read the file from the beginning to the end, always
 665 calculating the HMAC for the objects it reads. Each time a tag object is
 666 encountered the HMAC should be verified and restarted. The tag object sequence
 667 numbers need to increase strictly monotonically. Tag objects themselves are
 668 partially protected by the HMAC (i.e. seqnum and epoch is included, the tag
 669 itself not).
 670
 671
 672 ## Algorithms
 673
 674 ### Reading
 675
 676 Given an offset to an entry all data fields are easily found by following the
 677 offsets in the data item array of the entry.
 678
 679 Listing entries without filter is done by traversing the list of entry arrays
 680 starting with the headers' **entry_array_offset** field.
 681
 682 Seeking to an entry by timestamp or sequence number (without any matches) is
 683 done via binary search in the entry arrays starting with the header's
 684 **entry_array_offset** field. Since these arrays double in size as more are
 685 added the time cost of seeking is O(log(n)*log(n)) if n is the number of
 686 entries in the file.
 687
 688 When seeking or listing with one field match applied the DATA object of the
 689 match is first identified, and then its data entry array chain traversed. The
 690 time cost is the same as for seeks/listings with no match.
 691
 692 If multiple matches are applied, multiple chains of entry arrays should be
 693 traversed in parallel. Since they all are strictly monotonically ordered by
 694 offset of the entries, advancing in one can be directly applied to the others,
 695 until an entry matching all matches is found. In the worst case seeking like
 696 this is O(n) where n is the number of matching entries of the "loosest" match,
 697 but in the common case should be much more efficient at least for the
 698 well-known fields, where the set of possible field values tend to be closely
 699 related. Checking whether an entry matches a number of matches is efficient
 700 since the item array of the entry contains hashes of all data fields
 701 referenced, and the number of data fields of an entry is generally small (<
 702 30).
 703
 704 When interleaving multiple journal files seeking tends to be a frequently used
 705 operation, but in this case can be effectively suppressed by caching results
 706 from previous entries.
 707
 708 When listing all possible values a certain field can take it is sufficient to
 709 look up the FIELD object and follow the chain of links to all DATA it includes.
 710
 711 ### Writing
 712
 713 When an entry is appended to the journal, for each of its data fields the data
 714 hash table should be checked. If the data field does not yet exist in the file,
 715 it should be appended and added to the data hash table. When a data field's data
 716 object is added, the field hash table should be checked for the field name of
 717 the data field, and a field object be added if necessary. After all data fields
 718 (and recursively all field names) of the new entry are appended and linked up
 719 in the hashtables, the entry object should be appended and linked up too.
 720
 721 At regular intervals a tag object should be written if sealing is enabled (see
 722 above). Before the file is closed a tag should be written too, to seal it off.
 723
 724 Before writing an object, time and disk space limits should be checked and
 725 rotation triggered if necessary.
 726
 727
 728 ## Optimizing Disk IO
 729
 730 _A few general ideas to keep in mind:_
 731
 732 The hash tables for looking up fields and data should be quickly in the memory
 733 cache and not hurt performance. All entries and entry arrays are ordered
 734 strictly by time on disk, and hence should expose an OK access pattern on
 735 rotating media, when read sequentially (which should be the most common case,
 736 given the nature of log data).
 737
 738 The disk access patterns of the binary search for entries needed for seeking
 739 are problematic on rotating disks. This should not be a major issue though,
 740 since seeking should not be a frequent operation.
 741
 742 When reading, collecting data fields for presenting entries to the user is
 743 problematic on rotating disks. In order to optimize these patterns the item
 744 array of entry objects should be sorted by disk offset before
 745 writing. Effectively, frequently used data objects should be in the memory
 746 cache quickly. Non-frequently used data objects are likely to be located
 747 between the previous and current entry when reading and hence should expose an
 748 OK access pattern. Problematic are data objects that are neither frequently nor
 749 infrequently referenced, which will cost seek time.
 750
 751 And that's all there is to it.
 752
 753 Thanks for your interest!