docs/JOURNAL_FILE_FORMAT.md

   1 ---
   2 title: Journal File Format
   3 category: Interfaces
   4 layout: default
   5 ---
   6
   7 # Journal File Format
   8
   9 _Note that this document describes the binary on-disk format of journals
  10 only. For interfacing with web technologies there's the [Journal JSON
  11 Format](http://www.freedesktop.org/wiki/Software/systemd/json). For transfer
  12 of journal data across the network there's the [Journal Export
  13 Format](http://www.freedesktop.org/wiki/Software/systemd/export)._
  14
  15 The systemd journal stores log data in a binary format with several features:
  16
  17 * Fully indexed by all fields
  18 * Can store binary data, up to 2^64-1 in size
  19 * Seekable
  20 * Primarily append-based, hence robust to corruption
  21 * Support for in-line compression
  22 * Support for in-line Forward Secure Sealing
  23
  24 This document explains the basic structure of the file format on disk. We are
  25 making this available primarily to allow review and provide documentation. Note
  26 that the actual implementation in the [systemd
  27 codebase](https://github.com/systemd/systemd/blob/master/src/journal/) is the
  28 only ultimately authoritative description of the format, so if this document
  29 and the code disagree, the code is right. That said we'll of course try hard to
  30 keep this document up-to-date and accurate.
  31
  32 Instead of implementing your own reader or writer for journal files we ask you
  33 to use the [Journal's native C
  34 API](http://www.freedesktop.org/software/systemd/man/sd-journal.html) to access
  35 these files. It provides you with full access to the files, and will not
  36 withhold any data. If you find a limitation, please ping us and we might add
  37 some additional interfaces for you.
  38
  39 If you need access to the raw journal data in serialized stream form without C
  40 API our recommendation is to make use of the [Journal Export
  41 Format](http://www.freedesktop.org/wiki/Software/systemd/export), which you can
  42 get via "journalctl -o export" or via systemd-journal-gatewayd. The export
  43 format is much simpler to parse, but complete and accurate. Due to its
  44 stream-based nature it is not indexed.
  45
  46 _Or, to put this in other words: this low-level document is probably not what
  47 you want to use as base of your project. You want our [C
  48 API](http://www.freedesktop.org/software/systemd/man/sd-journal.html) instead!
  49 And if you really don't want the C API, then you want the [Journal Export
  50 Format](http://www.freedesktop.org/wiki/Software/systemd/export) instead! This
  51 document is primarily for your entertainment and education. Thank you!_
  52
  53 This document assumes you have a basic understanding of the journal concepts,
  54 the properties of a journal entry and so on. If not, please go and read up,
  55 then come back! This is a good opportunity to read about the [basic properties
  56 of journal
  57 entries](http://www.freedesktop.org/software/systemd/man/systemd.journal-fields.html),
  58 in particular realize that they may include binary non-text data (though
  59 usually don't), and the same field might have multiple values assigned within
  60 the same entry.
  61
  62 This document describes the current format of systemd 246. The documented
  63 format is compatible with the format used in the first versions of the journal,
  64 but received various compatible and incompatible additions since.
  65
  66 If you are wondering why the journal file format has been created in the first
  67 place instead of adopting an existing database implementation, please have a
  68 look [at this
  69 thread](https://lists.freedesktop.org/archives/systemd-devel/2012-October/007054.html).
  70
  71
  72 ## Basics
  73
  74 * All offsets, sizes, time values, hashes (and most other numeric values) are 64bit unsigned integers in LE format.
  75 * Offsets are always relative to the beginning of the file.
  76 * The 64bit hash function siphash24 is used for newer journal files. For older files [Jenkins lookup3](https://en.wikipedia.org/wiki/Jenkins_hash_function) is used, more specifically `jenkins_hashlittle2()` with the first 32bit integer it returns as higher 32bit part of the 64bit value, and the second one uses as lower 32bit part.
  77 * All structures are aligned to 64bit boundaries and padded to multiples of 64bit
  78 * The format is designed to be read and written via memory mapping using multiple mapped windows.
  79 * All time values are stored in usec since the respective epoch.
  80 * Wall clock time values are relative to the Unix time epoch, i.e. January 1st, 1970. (`CLOCK_REALTIME`)
  81 * Monotonic time values are always stored jointly with the kernel boot ID value (i.e. `/proc/sys/kernel/random/boot_id`) they belong to. They tend to be relative to the start of the boot, but aren't for containers. (`CLOCK_MONOTONIC`)
  82 * Randomized, unique 128bit IDs are used in various locations. These are generally UUID v4 compatible, but this is not a requirement.
  83
  84 ## General Rules
  85
  86 If any kind of corruption is noticed by a writer it should immediately rotate
  87 the file and start a new one. No further writes should be attempted to the
  88 original file, but it should be left around so that as little data as possible
  89 is lost.
  90
  91 If any kind of corruption is noticed by a reader it should try hard to handle
  92 this gracefully, such as skipping over the corrupted data, but allowing access
  93 to as much data around it as possible.
  94
  95 A reader should verify all offsets and other data as it reads it. This includes
  96 checking for alignment and range of offsets in the file, especially before
  97 trying to read it via a memory map.
  98
  99 A reader must interleave rotated and corrupted files as good as possible and
 100 present them as single stream to the user.
 101
 102 All fields marked as "reserved" must be initialized with 0 when writing and be
 103 ignored on reading. They are currently not used but might be used later on.
 104
 105
 106 ## Structure
 107
 108 The file format's data structures are declared in
 109 [journal-def.h](https://github.com/systemd/systemd/blob/master/src/journal/journal-def.h).
 110
 111 The file format begins with a header structure. After the header structure
 112 object structures follow. Objects are appended to the end as time
 113 progresses. Most data stored in these objects is not altered anymore after
 114 having been written once, with the exception of records necessary for
 115 indexing. When new data is appended to a file the writer first writes all new
 116 objects to the end of the file, and then links them up at front after that's
 117 done. Currently, seven different object types are known:
 118
 119 ```c
 120 enum {
 121         OBJECT_UNUSED,
 122         OBJECT_DATA,
 123         OBJECT_FIELD,
 124         OBJECT_ENTRY,
 125         OBJECT_DATA_HASH_TABLE,
 126         OBJECT_FIELD_HASH_TABLE,
 127         OBJECT_ENTRY_ARRAY,
 128         OBJECT_TAG,
 129         _OBJECT_TYPE_MAX
 130 };
 131 ```
 132
 133 * A **DATA** object, which encapsulates the contents of one field of an entry, i.e. a string such as `_SYSTEMD_UNIT=avahi-daemon.service`, or `MESSAGE=Foobar made a booboo.` but possibly including large or binary data, and always prefixed by the field name and "=".
 134 * A **FIELD** object, which encapsulates a field name, i.e. a string such as `_SYSTEMD_UNIT` or `MESSAGE`, without any `=` or even value.
 135 * An **ENTRY** object, which binds several **DATA** objects together into a log entry.
 136 * A **DATA_HASH_TABLE** object, which encapsulates a hash table for finding existing **DATA** objects.
 137 * A **FIELD_HASH_TABLE** object, which encapsulates a hash table for finding existing **FIELD** objects.
 138 * An **ENTRY_ARRAY** object, which encapsulates a sorted array of offsets to entries, used for seeking by binary search.
 139 * A **TAG** object, consisting of an FSS sealing tag for all data from the beginning of the file or the last tag written (whichever is later).
 140
 141 ## Header
 142
 143 The Header struct defines, well, you guessed it, the file header:
 144
 145 ```c
 146 _packed_ struct Header {
 147         uint8_t signature[8]; /* "LPKSHHRH" */
 148         le32_t compatible_flags;
 149         le32_t incompatible_flags;
 150         uint8_t state;
 151         uint8_t reserved[7];
 152         sd_id128_t file_id;
 153         sd_id128_t machine_id;
 154         sd_id128_t boot_id;    /* last writer */
 155         sd_id128_t seqnum_id;
 156         le64_t header_size;
 157         le64_t arena_size;
 158         le64_t data_hash_table_offset;
 159         le64_t data_hash_table_size;
 160         le64_t field_hash_table_offset;
 161         le64_t field_hash_table_size;
 162         le64_t tail_object_offset;
 163         le64_t n_objects;
 164         le64_t n_entries;
 165         le64_t tail_entry_seqnum;
 166         le64_t head_entry_seqnum;
 167         le64_t entry_array_offset;
 168         le64_t head_entry_realtime;
 169         le64_t tail_entry_realtime;
 170         le64_t tail_entry_monotonic;
 171         /* Added in 187 */
 172         le64_t n_data;
 173         le64_t n_fields;
 174         /* Added in 189 */
 175         le64_t n_tags;
 176         le64_t n_entry_arrays;
 177         /* Added in 246 */
 178         le64_t data_hash_chain_depth;
 179         le64_t field_hash_chain_depth;
 180 };
 181 ```
 182
 183 The first 8 bytes of Journal files must contain the ASCII characters `LPKSHHRH`.
 184
 185 If a writer finds that the **machine_id** of a file to write to does not match
 186 the machine it is running on it should immediately rotate the file and start a
 187 new one.
 188
 189 When journal file is first created the **file_id** is randomly and uniquely
 190 initialized.
 191
 192 When a writer opens a file it shall initialize the **boot_id** to the current
 193 boot id of the system.
 194
 195 The currently used part of the file is the **header_size** plus the
 196 **arena_size** field of the header. If a writer needs to write to a file where
 197 the actual file size on disk is smaller than the reported value it shall
 198 immediately rotate the file and start a new one. If a writer is asked to write
 199 to a file with a header that is shorter than his own definition of the struct
 200 Header, he shall immediately rotate the file and start a new one.
 201
 202 The **n_objects** field contains a counter for objects currently available in
 203 this file. As objects are appended to the end of the file this counter is
 204 increased.
 205
 206 The first object in the file starts immediately after the header. The last
 207 object in the file is at the offset **tail_object_offset**, which may be 0 if
 208 no object is in the file yet.
 209
 210 The **n_entries**, **n_data**, **n_fields**, **n_tags**, **n_entry_arrays** are
 211 counters of the objects of the specific types.
 212
 213 **tail_entry_seqnum** and **head_entry_seqnum** contain the sequential number
 214 (see below) of the last or first entry in the file, respectively, or 0 if no
 215 entry has been written yet.
 216
 217 **tail_entry_realtime** and **head_entry_realtime** contain the wallclock
 218 timestamp of the last or first entry in the file, respectively, or 0 if no
 219 entry has been written yet.
 220
 221 **tail_entry_monotonic** is the monotonic timestamp of the last entry in the
 222 file, referring to monotonic time of the boot identified by **boot_id**.
 223
 224 **data_hash_chain_depth** is a counter of the deepest chain in the data hash
 225 table, minus one. This is updated whenever a chain is found that is longer than
 226 the previous deepest chain found. Note that the counter is updated during hash
 227 table lookups, as the chains are traversed. This counter is used to determine
 228 when it is a good time to rotate the journal file, because hash collisions
 229 became too frequent.
 230
 231 Similar, **field_hash_chain_depth** is a counter of the deepest chain in the
 232 field hash table, minus one.
 233
 234
 235 ## Extensibility
 236
 237 The format is supposed to be extensible in order to enable future additions of
 238 features. Readers should simply skip objects of unknown types as they read
 239 them. If a compatible feature extension is made a new bit is registered in the
 240 header's **compatible_flags** field. If a feature extension is used that makes
 241 the format incompatible a new bit is registered in the header's
 242 **incompatible_flags** field. Readers should check these two bit fields, if
 243 they find a flag they don't understand in compatible_flags they should continue
 244 to read the file, but if they find one in **incompatible_flags** they should
 245 fail, asking for an update of the software. Writers should refuse writing if
 246 there's an unknown bit flag in either of these fields.
 247
 248 The file header may be extended as new features are added. The size of the file
 249 header is stored in the header. All header fields up to **n_data** are known to
 250 unconditionally exist in all revisions of the file format, all fields starting
 251 with **n_data** needs to be explicitly checked for via a size check, since they
 252 were additions after the initial release.
 253
 254 Currently only five extensions flagged in the flags fields are known:
 255
 256 ```c
 257 enum {
 258         HEADER_INCOMPATIBLE_COMPRESSED_XZ   = 1 << 0,
 259         HEADER_INCOMPATIBLE_COMPRESSED_LZ4  = 1 << 1,
 260         HEADER_INCOMPATIBLE_KEYED_HASH      = 1 << 2,
 261         HEADER_INCOMPATIBLE_COMPRESSED_ZSTD = 1 << 3,
 262 };
 263
 264 enum {
 265         HEADER_COMPATIBLE_SEALED = 1 << 0,
 266 };
 267 ```
 268
 269 HEADER_INCOMPATIBLE_COMPRESSED_XZ indicates that the file includes DATA objects
 270 that are compressed using XZ. Similarly, HEADER_INCOMPATIBLE_COMPRESSED_LZ4
 271 indicates that the file includes DATA objects that are compressed with the LZ4
 272 algorithm. And HEADER_INCOMPATIBLE_COMPRESSED_ZSTD indicates that there are
 273 objects compressed with ZSTD.
 274
 275 HEADER_INCOMPATIBLE_KEYED_HASH indicates that instead of the unkeyed Jenkins
 276 hash function the keyed siphash24 hash function is used for the two hash
 277 tables, see below.
 278
 279 HEADER_COMPATIBLE_SEALED indicates that the file includes TAG objects required
 280 for Forward Secure Sealing.
 281
 282
 283 ## Dirty Detection
 284
 285 ```c
 286 enum {
 287         STATE_OFFLINE = 0,
 288         STATE_ONLINE = 1,
 289         STATE_ARCHIVED = 2,
 290         _STATE_MAX
 291 };
 292 ```
 293
 294 If a file is opened for writing the **state** field should be set to
 295 STATE_ONLINE. If a file is closed after writing the **state** field should be
 296 set to STATE_OFFLINE. After a file has been rotated it should be set to
 297 STATE_ARCHIVED. If a writer is asked to write to a file that is not in
 298 STATE_OFFLINE it should immediately rotate the file and start a new one,
 299 without changing the file.
 300
 301 After and before the state field is changed `fdatasync()` should be executed on
 302 the file to ensure the dirty state hits disk.
 303
 304
 305 ## Sequence Numbers
 306
 307 All entries carry sequence numbers that are monotonically counted up for each
 308 entry (starting at 1) and are unique among all files which carry the same
 309 **seqnum_id** field. This field is randomly generated when the journal daemon
 310 creates its first file. All files generated by the same journal daemon instance
 311 should hence carry the same seqnum_id. This should guarantee a monotonic stream
 312 of sequential numbers for easy interleaving even if entries are distributed
 313 among several files, such as the system journal and many per-user journals.
 314
 315
 316 ## Concurrency
 317
 318 The file format is designed to be usable in a simultaneous
 319 single-writer/multiple-reader scenario. The synchronization model is very weak
 320 in order to facilitate storage on the most basic of file systems (well, the
 321 most basic ones that provide us with `mmap()` that is), and allow good
 322 performance. No file locking is used. The only time where disk synchronization
 323 via `fdatasync()` should be enforced is after and before changing the **state**
 324 field in the file header (see below). It is recommended to execute a memory
 325 barrier after appending and initializing new objects at the end of the file,
 326 and before linking them up in the earlier objects.
 327
 328 This weak synchronization model means that it is crucial that readers verify
 329 the structural integrity of the file as they read it and handle invalid
 330 structure gracefully. (Checking what you read is a pretty good idea out of
 331 security considerations anyway.) This specifically includes checking offset
 332 values, and that they point to valid objects, with valid sizes and of the type
 333 and hash value expected. All code must be written with the fact in mind that a
 334 file with inconsistent structure might just be inconsistent temporarily, and
 335 might become consistent later on. Payload OTOH requires less scrutiny, as it
 336 should only be linked up (and hence visible to readers) after it was
 337 successfully written to memory (though not necessarily to disk). On non-local
 338 file systems it is a good idea to verify the payload hashes when reading, in
 339 order to avoid annoyances with `mmap()` inconsistencies.
 340
 341 Clients intending to show a live view of the journal should use `inotify()` for
 342 this to watch for files changes. Since file writes done via `mmap()` do not
 343 result in `inotify()` writers shall truncate the file to its current size after
 344 writing one or more entries, which results in inotify events being
 345 generated. Note that this is not used as a transaction scheme (it doesn't
 346 protect anything), but merely for triggering wakeups.
 347
 348 Note that inotify will not work on network file systems if reader and writer
 349 reside on different hosts. Readers which detect they are run on journal files
 350 on a non-local file system should hence not rely on inotify for live views but
 351 fall back to simple time based polling of the files (maybe recheck every 2s).
 352
 353
 354 ## Objects
 355
 356 All objects carry a common header:
 357
 358 ```c
 359 enum {
 360         OBJECT_COMPRESSED_XZ   = 1 << 0,
 361         OBJECT_COMPRESSED_LZ4  = 1 << 1,
 362         OBJECT_COMPRESSED_ZSTD = 1 << 2,
 363 };
 364
 365 _packed_ struct ObjectHeader {
 366         uint8_t type;
 367         uint8_t flags;
 368         uint8_t reserved[6];
 369         le64_t size;
 370         uint8_t payload[];
 371 };
 372 ```
 373
 374 The **type** field is one of the object types listed above. The **flags** field
 375 currently knows three flags: OBJECT_COMPRESSED_XZ, OBJECT_COMPRESSED_LZ4 and
 376 OBJECT_COMPRESSED_ZSTD. It is only valid for DATA objects and indicates that
 377 the data payload is compressed with XZ/LZ4/ZSTD. If one of the
 378 OBJECT_COMPRESSED_* flags is set for an object then the matching
 379 HEADER_INCOMPATIBLE_COMPRESSED_XZ/HEADER_INCOMPATIBLE_COMPRESSED_LZ4/HEADER_INCOMPATIBLE_COMPRESSED_ZSTD
 380 flag must be set for the file as well. At most one of these three bits may be
 381 set. The **size** field encodes the size of the object including all its
 382 headers and payload.
 383
 384
 385 ## Data Objects
 386
 387 ```c
 388 _packed_ struct DataObject {
 389         ObjectHeader object;
 390         le64_t hash;
 391         le64_t next_hash_offset;
 392         le64_t next_field_offset;
 393         le64_t entry_offset; /* the first array entry we store inline */
 394         le64_t entry_array_offset;
 395         le64_t n_entries;
 396         uint8_t payload[];
 397 };
 398 ```
 399
 400 Data objects carry actual field data in the **payload[]** array, including a
 401 field name, a `=` and the field data. Example:
 402 `_SYSTEMD_UNIT=foobar.service`. The **hash** field is a hash value of the
 403 payload. If the `HEADER_INCOMPATIBLE_KEYED_HASH` flag is set in the file header
 404 this is the siphash24 hash value of the payload, keyed by the file ID as stored
 405 in the **file_id** field of the file header. If the flag is not set it is the
 406 non-keyed Jenkins hash of the payload instead. The keyed hash is preferred as
 407 it makes the format more robust against attackers that want to trigger hash
 408 collisions in the hash table.
 409
 410 **next_hash_offset** is used to link up DATA objects in the DATA_HASH_TABLE if
 411 a hash collision happens (in a singly linked list, with an offset of 0
 412 indicating the end). **next_field_offset** is used to link up data objects with
 413 the same field name from the FIELD object of the field used.
 414
 415 **entry_offset** is an offset to the first ENTRY object referring to this DATA
 416 object. **entry_array_offset** is an offset to an ENTRY_ARRAY object with
 417 offsets to other entries referencing this DATA object. Storing the offset to
 418 the first ENTRY object in-line is an optimization given that many DATA objects
 419 will be referenced from a single entry only (for example, `MESSAGE=` frequently
 420 includes a practically unique string). **n_entries** is a counter of the total
 421 number of ENTRY objects that reference this object, i.e. the sum of all
 422 ENTRY_ARRAYS chained up from this object, plus 1.
 423
 424 The **payload[]** field contains the field name and date unencoded, unless
 425 OBJECT_COMPRESSED_XZ/OBJECT_COMPRESSED_LZ4/OBJECT_COMPRESSED_ZSTD is set in the
 426 `ObjectHeader`, in which case the payload is compressed with the indicated
 427 compression algorithm.
 428
 429
 430 ## Field Objects
 431
 432 ```c
 433 _packed_ struct FieldObject {
 434         ObjectHeader object;
 435         le64_t hash;
 436         le64_t next_hash_offset;
 437         le64_t head_data_offset;
 438         uint8_t payload[];
 439 };
 440 ```
 441
 442 Field objects are used to enumerate all possible values a certain field name
 443 can take in the entire journal file.
 444
 445 The **payload[]** array contains the actual field name, without '=' or any
 446 field value. Example: `_SYSTEMD_UNIT`. The **hash** field is a hash value of
 447 the payload. As for the DATA objects, this too is either the `.file_id` keyed
 448 siphash24 hash of the payload, or the non-keyed Jenkins hash.
 449
 450 **next_hash_offset** is used to link up FIELD objects in the FIELD_HASH_TABLE
 451 if a hash collision happens (in singly linked list, offset 0 indicating the
 452 end). **head_data_offset** points to the first DATA object that shares this
 453 field name. It is the head of a singly linked list using DATA's
 454 **next_field_offset** offset.
 455
 456
 457 ## Entry Objects
 458
 459 ```
 460 _packed_ struct EntryItem {
 461         le64_t object_offset;
 462         le64_t hash;
 463 };
 464
 465 _packed_ struct EntryObject {
 466         ObjectHeader object;
 467         le64_t seqnum;
 468         le64_t realtime;
 469         le64_t monotonic;
 470         sd_id128_t boot_id;
 471         le64_t xor_hash;
 472         EntryItem items[];
 473 };
 474 ```
 475
 476 An ENTRY object binds several DATA objects together into one log entry, and
 477 includes other metadata such as various timestamps.
 478
 479 The **seqnum** field contains the sequence number of the entry, **realtime**
 480 the realtime timestamp, and **monotonic** the monotonic timestamp for the boot
 481 identified by **boot_id**.
 482
 483 The **xor_hash** field contains a binary XOR of the hashes of the payload of
 484 all DATA objects referenced by this ENTRY. This value is usable to check the
 485 contents of the entry, being independent of the order of the DATA objects in
 486 the array. Note that even for files that have the
 487 `HEADER_INCOMPATIBLE_KEYED_HASH` flag set (and thus siphash24 the otherwise
 488 used hash function) the hash function used for this field, as singular
 489 exception, is the Jenkins lookup3 hash function. The XOR hash value is used to
 490 quickly compare the contents of two entries, and to define a well-defined order
 491 between two entries that otherwise have the same sequence numbers and
 492 timestamps.
 493
 494 The **items[]** array contains references to all DATA objects of this entry,
 495 plus their respective hashes (which are calculated the same way as in the DATA
 496 objects, i.e. keyed by the file ID).
 497
 498 In the file ENTRY objects are written ordered monotonically by sequence
 499 number. For continuous parts of the file written during the same boot
 500 (i.e. with the same boot_id) the monotonic timestamp is monotonic too. Modulo
 501 wallclock time jumps (due to incorrect clocks being corrected) the realtime
 502 timestamps are monotonic too.
 503
 504
 505 ## Hash Table Objects
 506
 507 ```c
 508 _packed_ struct HashItem {
 509         le64_t head_hash_offset;
 510         le64_t tail_hash_offset;
 511 };
 512
 513 _packed_ struct HashTableObject {
 514         ObjectHeader object;
 515         HashItem items[];
 516 };
 517 ```
 518
 519 The structure of both DATA_HASH_TABLE and FIELD_HASH_TABLE objects are
 520 identical. They implement a simple hash table, which each cell containing
 521 offsets to the head and tail of the singly linked list of the DATA and FIELD
 522 objects, respectively. DATA's and FIELD's next_hash_offset field are used to
 523 chain up the objects. Empty cells have both offsets set to 0.
 524
 525 Each file contains exactly one DATA_HASH_TABLE and one FIELD_HASH_TABLE
 526 objects. Their payload is directly referred to by the file header in the
 527 **data_hash_table_offset**, **data_hash_table_size**,
 528 **field_hash_table_offset**, **field_hash_table_size** fields. These offsets do
 529 _not_ point to the object headers but directly to the payloads. When a new
 530 journal file is created the two hash table objects need to be created right
 531 away as first two objects in the stream.
 532
 533 If the hash table fill level is increasing over a certain fill level (Learning
 534 from Java's Hashtable for example: > 75%), the writer should rotate the file
 535 and create a new one.
 536
 537 The DATA_HASH_TABLE should be sized taking into account to the maximum size the
 538 file is expected to grow, as configured by the administrator or disk space
 539 considerations. The FIELD_HASH_TABLE should be sized to a fixed size; the
 540 number of fields should be pretty static as it depends only on developers'
 541 creativity rather than runtime parameters.
 542
 543
 544 ## Entry Array Objects
 545
 546
 547 ```c
 548 _packed_ struct EntryArrayObject {
 549         ObjectHeader object;
 550         le64_t next_entry_array_offset;
 551         le64_t items[];
 552 };
 553 ```
 554
 555 Entry Arrays are used to store a sorted array of offsets to entries. Entry
 556 arrays are strictly sorted by offsets on disk, and hence by their timestamps
 557 and sequence numbers (with some restrictions, see above).
 558
 559 Entry Arrays are chained up. If one entry array is full another one is
 560 allocated and the **next_entry_array_offset** field of the old one pointed to
 561 it. An Entry Array with **next_entry_array_offset** set to 0 is the last in the
 562 list. To optimize allocation and seeking, as entry arrays are appended to a
 563 chain of entry arrays they should increase in size (double).
 564
 565 Due to being monotonically ordered entry arrays may be searched with a binary
 566 search (bisection).
 567
 568 One chain of entry arrays links up all entries written to the journal. The
 569 first entry array is referenced in the **entry_array_offset** field of the
 570 header.
 571
 572 Each DATA object also references an entry array chain listing all entries
 573 referencing a specific DATA object. Since many DATA objects are only referenced
 574 by a single ENTRY the first offset of the list is stored inside the DATA object
 575 itself, an ENTRY_ARRAY object is only needed if it is referenced by more than
 576 one ENTRY.
 577
 578
 579 ## Tag Object
 580
 581 ```c
 582 #define TAG_LENGTH (256/8)
 583
 584 _packed_ struct TagObject {
 585         ObjectHeader object;
 586         le64_t seqnum;
 587         le64_t epoch;
 588         uint8_t tag[TAG_LENGTH]; /* SHA-256 HMAC */
 589 };
 590 ```
 591
 592 Tag objects are used to seal off the journal for alteration. In regular
 593 intervals a tag object is appended to the file. The tag object consists of a
 594 SHA-256 HMAC tag that is calculated from the objects stored in the file since
 595 the last tag was written, or from the beginning if no tag was written yet. The
 596 key for the HMAC is calculated via the externally maintained FSPRG logic for
 597 the epoch that is written into **epoch**. The sequence number **seqnum** is
 598 increased with each tag. When calculating the HMAC of objects header fields
 599 that are volatile are excluded (skipped). More specifically all fields that
 600 might validly be altered to maintain a consistent file structure (such as
 601 offsets to objects added later for the purpose of linked lists and suchlike)
 602 after an object has been written are not protected by the tag. This means a
 603 verifier has to independently check these fields for consistency of
 604 structure. For the fields excluded from the HMAC please consult the source code
 605 directly. A verifier should read the file from the beginning to the end, always
 606 calculating the HMAC for the objects it reads. Each time a tag object is
 607 encountered the HMAC should be verified and restarted. The tag object sequence
 608 numbers need to increase strictly monotonically. Tag objects themselves are
 609 partially protected by the HMAC (i.e. seqnum and epoch is included, the tag
 610 itself not).
 611
 612
 613 ## Algorithms
 614
 615 ### Reading
 616
 617 Given an offset to an entry all data fields are easily found by following the
 618 offsets in the data item array of the entry.
 619
 620 Listing entries without filter is done by traversing the list of entry arrays
 621 starting with the headers' **entry_array_offset** field.
 622
 623 Seeking to an entry by timestamp or sequence number (without any matches) is
 624 done via binary search in the entry arrays starting with the header's
 625 **entry_array_offset** field. Since these arrays double in size as more are
 626 added the time cost of seeking is O(log(n)*log(n)) if n is the number of
 627 entries in the file.
 628
 629 When seeking or listing with one field match applied the DATA object of the
 630 match is first identified, and then its data entry array chain traversed. The
 631 time cost is the same as for seeks/listings with no match.
 632
 633 If multiple matches are applied, multiple chains of entry arrays should be
 634 traversed in parallel. Since they all are strictly monotonically ordered by
 635 offset of the entries, advancing in one can be directly applied to the others,
 636 until an entry matching all matches is found. In the worst case seeking like
 637 this is O(n) where n is the number of matching entries of the "loosest" match,
 638 but in the common case should be much more efficient at least for the
 639 well-known fields, where the set of possible field values tend to be closely
 640 related. Checking whether an entry matches a number of matches is efficient
 641 since the item array of the entry contains hashes of all data fields
 642 referenced, and the number of data fields of an entry is generally small (<
 643 30).
 644
 645 When interleaving multiple journal files seeking tends to be a frequently used
 646 operation, but in this case can be effectively suppressed by caching results
 647 from previous entries.
 648
 649 When listing all possible values a certain field can take it is sufficient to
 650 look up the FIELD object and follow the chain of links to all DATA it includes.
 651
 652 ### Writing
 653
 654 When an entry is appended to the journal for each of its data fields the data
 655 hash table should be checked. If the data field does not yet exist in the file
 656 it should be appended and added to the data hash table. When a field data
 657 object is added the field hash table should be checked for the field name of
 658 the data field, and a field object be added if necessary. After all data fields
 659 (and recursively all field names) of the new entry are appended and linked up
 660 in the hashtables the entry object should be appended and linked up too.
 661
 662 In regular intervals a tag object should be written if sealing is enabled (see
 663 above). Before the file is closed a tag should be written too, to seal it off.
 664
 665 Before writing an object, time and disk space limits should be checked and
 666 rotation triggered if necessary.
 667
 668
 669 ## Optimizing Disk IO
 670
 671 _A few general ideas to keep in mind:_
 672
 673 The hash tables for looking up fields and data should be quickly in the memory
 674 cache and not hurt performance. All entries and entry arrays are ordered
 675 strictly by time on disk, and hence should expose an OK access pattern on
 676 rotating media, when read sequentially (which should be the most common case,
 677 given the nature of log data).
 678
 679 The disk access patterns of the binary search for entries needed for seeking
 680 are problematic on rotating disks. This should not be a major issue though,
 681 since seeking should not be a frequent operation.
 682
 683 When reading, collecting data fields for presenting entries to the user is
 684 problematic on rotating disks. In order to optimize these patterns the item
 685 array of entry objects should be sorted by disk offset before
 686 writing. Effectively, frequently used data objects should be in the memory
 687 cache quickly. Non-frequently used data objects are likely to be located
 688 between the previous and current entry when reading and hence should expose an
 689 OK access pattern. Problematic are data objects that are neither frequently nor
 690 infrequently referenced, which will cost seek time.
 691
 692 And that's all there is to it.
 693
 694 Thanks for your interest!