]>
Commit | Line | Data |
---|---|---|
bbcd38e4 LP |
1 | --- |
2 | title: Journal File Format | |
3 | category: Interfaces | |
4 | layout: default | |
0aff7b75 | 5 | SPDX-License-Identifier: LGPL-2.1-or-later |
bbcd38e4 LP |
6 | --- |
7 | ||
8 | # Journal File Format | |
9 | ||
717e92ce | 10 | _Note that this document describes the binary on-disk format of journals only. |
a90c3a9e | 11 | For interfacing with web technologies there's the [Journal JSON Format](JOURNAL_EXPORT_FORMATS#journal-json-format). |
a738c6d9 | 12 | For transfer of journal data across the network there's the |
13 | [Journal Export Format](JOURNAL_EXPORT_FORMATS#journal-export-format)._ | |
bbcd38e4 LP |
14 | |
15 | The systemd journal stores log data in a binary format with several features: | |
16 | ||
17 | * Fully indexed by all fields | |
18 | * Can store binary data, up to 2^64-1 in size | |
19 | * Seekable | |
20 | * Primarily append-based, hence robust to corruption | |
21 | * Support for in-line compression | |
22 | * Support for in-line Forward Secure Sealing | |
23 | ||
a738c6d9 | 24 | This document explains the basic structure of the file format on disk. |
25 | We are making this available primarily to allow review and provide documentation. | |
26 | Note that the actual implementation in the | |
27 | [systemd codebase](https://github.com/systemd/systemd/blob/main/src/libsystemd/sd-journal/) | |
28 | is the only ultimately authoritative description of the format, | |
29 | so if this document and the code disagree, the code is right. | |
30 | That said we'll of course try hard to keep this document up-to-date and accurate. | |
31 | ||
32 | Instead of implementing your own reader or writer for journal files we ask you to use the | |
33 | [Journal's native CAPI](https://www.freedesktop.org/software/systemd/man/sd-journal.html) | |
34 | to access these files. | |
35 | It provides you with full access to the files, and will not withhold any data. | |
36 | If you find a limitation, please ping us and we might add some additional interfaces for you. | |
37 | ||
38 | If you need access to the raw journal data in serialized stream form without C API our recommendation is to make use of the | |
39 | [Journal Export Format](JOURNAL_EXPORT_FORMATS#journal-export-format), | |
40 | which you can get via `journalctl -o export` or via `systemd-journal-gatewayd`. | |
41 | The export format is much simpler to parse, but complete and accurate. | |
42 | Due to its stream-based nature it is not indexed. | |
43 | ||
44 | _Or, to put this in other words: this low-level document is probably not what you want to use as base of your project. | |
45 | You want our [C API](https://www.freedesktop.org/software/systemd/man/sd-journal.html) instead! | |
5c90c67a | 46 | And if you really don't want the C API, then you want the |
0d592a5e | 47 | [Journal Export Format or Journal JSON Format](/JOURNAL_EXPORT_FORMATS) instead! |
a90c3a9e | 48 | This document is primarily for your entertainment and education. |
5c90c67a | 49 | Thank you!_ |
bbcd38e4 | 50 | |
a738c6d9 | 51 | This document assumes you have a basic understanding of the journal concepts, the properties of a journal entry and so on. |
52 | If not, please go and read up, then come back! | |
53 | This is a good opportunity to read about the | |
54 | [basic properties of journal entries](https://www.freedesktop.org/software/systemd/man/systemd.journal-fields.html), | |
55 | in particular realize that they may include binary non-text data (though usually don't), | |
56 | and the same field might have multiple values assigned within the same entry. | |
57 | ||
58 | This document describes the current format of systemd 246. | |
59 | The documented format is compatible with the format used in the first versions of the journal, | |
70cd1e56 | 60 | but received various compatible and incompatible additions since. |
bbcd38e4 | 61 | |
a738c6d9 | 62 | If you are wondering why the journal file format has been created in the first place instead of adopting an existing database implementation, |
63 | please have a look [at this thread](https://lists.freedesktop.org/archives/systemd-devel/2012-October/007054.html). | |
bbcd38e4 LP |
64 | |
65 | ||
66 | ## Basics | |
67 | ||
da890466 | 68 | * All offsets, sizes, time values, hashes (and most other numeric values) are 32-bit/64-bit unsigned integers in LE format. |
bbcd38e4 | 69 | * Offsets are always relative to the beginning of the file. |
a738c6d9 | 70 | * The 64-bit hash function siphash24 is used for newer journal files. |
71 | For older files [Jenkins lookup3](https://en.wikipedia.org/wiki/Jenkins_hash_function) is used, | |
72 | more specifically `jenkins_hashlittle2()` with the first 32-bit integer it returns as higher 32-bit part of the 64-bit value, | |
73 | and the second one uses as lower 32-bit part. | |
da890466 | 74 | * All structures are aligned to 64-bit boundaries and padded to multiples of 64-bit |
bbcd38e4 LP |
75 | * The format is designed to be read and written via memory mapping using multiple mapped windows. |
76 | * All time values are stored in usec since the respective epoch. | |
22aa58ad | 77 | * Wall clock time values are relative to the Unix time epoch, i.e. January 1st, 1970. (`CLOCK_REALTIME`) |
a738c6d9 | 78 | * Monotonic time values are always stored jointly with the kernel boot ID value (i.e. `/proc/sys/kernel/random/boot_id`) they belong to. |
79 | They tend to be relative to the start of the boot, but aren't for containers. (`CLOCK_MONOTONIC`) | |
da890466 | 80 | * Randomized, unique 128-bit IDs are used in various locations. These are generally UUID v4 compatible, but this is not a requirement. |
bbcd38e4 LP |
81 | |
82 | ## General Rules | |
83 | ||
84 | If any kind of corruption is noticed by a writer it should immediately rotate | |
85 | the file and start a new one. No further writes should be attempted to the | |
86 | original file, but it should be left around so that as little data as possible | |
87 | is lost. | |
88 | ||
89 | If any kind of corruption is noticed by a reader it should try hard to handle | |
90 | this gracefully, such as skipping over the corrupted data, but allowing access | |
91 | to as much data around it as possible. | |
92 | ||
93 | A reader should verify all offsets and other data as it reads it. This includes | |
94 | checking for alignment and range of offsets in the file, especially before | |
95 | trying to read it via a memory map. | |
96 | ||
97 | A reader must interleave rotated and corrupted files as good as possible and | |
98 | present them as single stream to the user. | |
99 | ||
100 | All fields marked as "reserved" must be initialized with 0 when writing and be | |
101 | ignored on reading. They are currently not used but might be used later on. | |
102 | ||
103 | ||
104 | ## Structure | |
105 | ||
106 | The file format's data structures are declared in | |
882da5cc | 107 | [journal-def.h](https://github.com/systemd/systemd/blob/main/src/libsystemd/sd-journal/journal-def.h). |
bbcd38e4 LP |
108 | |
109 | The file format begins with a header structure. After the header structure | |
110 | object structures follow. Objects are appended to the end as time | |
111 | progresses. Most data stored in these objects is not altered anymore after | |
112 | having been written once, with the exception of records necessary for | |
113 | indexing. When new data is appended to a file the writer first writes all new | |
114 | objects to the end of the file, and then links them up at front after that's | |
115 | done. Currently, seven different object types are known: | |
116 | ||
117 | ```c | |
118 | enum { | |
119 | OBJECT_UNUSED, | |
120 | OBJECT_DATA, | |
121 | OBJECT_FIELD, | |
122 | OBJECT_ENTRY, | |
123 | OBJECT_DATA_HASH_TABLE, | |
124 | OBJECT_FIELD_HASH_TABLE, | |
125 | OBJECT_ENTRY_ARRAY, | |
126 | OBJECT_TAG, | |
127 | _OBJECT_TYPE_MAX | |
128 | }; | |
129 | ``` | |
130 | ||
131 | * A **DATA** object, which encapsulates the contents of one field of an entry, i.e. a string such as `_SYSTEMD_UNIT=avahi-daemon.service`, or `MESSAGE=Foobar made a booboo.` but possibly including large or binary data, and always prefixed by the field name and "=". | |
132 | * A **FIELD** object, which encapsulates a field name, i.e. a string such as `_SYSTEMD_UNIT` or `MESSAGE`, without any `=` or even value. | |
133 | * An **ENTRY** object, which binds several **DATA** objects together into a log entry. | |
134 | * A **DATA_HASH_TABLE** object, which encapsulates a hash table for finding existing **DATA** objects. | |
135 | * A **FIELD_HASH_TABLE** object, which encapsulates a hash table for finding existing **FIELD** objects. | |
136 | * An **ENTRY_ARRAY** object, which encapsulates a sorted array of offsets to entries, used for seeking by binary search. | |
137 | * A **TAG** object, consisting of an FSS sealing tag for all data from the beginning of the file or the last tag written (whichever is later). | |
138 | ||
139 | ## Header | |
140 | ||
141 | The Header struct defines, well, you guessed it, the file header: | |
142 | ||
143 | ```c | |
144 | _packed_ struct Header { | |
145 | uint8_t signature[8]; /* "LPKSHHRH" */ | |
146 | le32_t compatible_flags; | |
147 | le32_t incompatible_flags; | |
148 | uint8_t state; | |
149 | uint8_t reserved[7]; | |
150 | sd_id128_t file_id; | |
151 | sd_id128_t machine_id; | |
f0104781 | 152 | sd_id128_t tail_entry_boot_id; |
bbcd38e4 LP |
153 | sd_id128_t seqnum_id; |
154 | le64_t header_size; | |
155 | le64_t arena_size; | |
156 | le64_t data_hash_table_offset; | |
157 | le64_t data_hash_table_size; | |
158 | le64_t field_hash_table_offset; | |
159 | le64_t field_hash_table_size; | |
160 | le64_t tail_object_offset; | |
161 | le64_t n_objects; | |
162 | le64_t n_entries; | |
163 | le64_t tail_entry_seqnum; | |
164 | le64_t head_entry_seqnum; | |
165 | le64_t entry_array_offset; | |
166 | le64_t head_entry_realtime; | |
167 | le64_t tail_entry_realtime; | |
168 | le64_t tail_entry_monotonic; | |
169 | /* Added in 187 */ | |
170 | le64_t n_data; | |
171 | le64_t n_fields; | |
172 | /* Added in 189 */ | |
173 | le64_t n_tags; | |
174 | le64_t n_entry_arrays; | |
70cd1e56 LP |
175 | /* Added in 246 */ |
176 | le64_t data_hash_chain_depth; | |
177 | le64_t field_hash_chain_depth; | |
e81710d3 | 178 | /* Added in 252 */ |
206f0f39 LP |
179 | le32_t tail_entry_array_offset; |
180 | le32_t tail_entry_array_n_entries; | |
181 | /* Added in 254 */ | |
182 | le64_t tail_entry_offset; | |
bbcd38e4 LP |
183 | }; |
184 | ``` | |
185 | ||
22aa58ad | 186 | The first 8 bytes of Journal files must contain the ASCII characters `LPKSHHRH`. |
bbcd38e4 LP |
187 | |
188 | If a writer finds that the **machine_id** of a file to write to does not match | |
189 | the machine it is running on it should immediately rotate the file and start a | |
190 | new one. | |
191 | ||
192 | When journal file is first created the **file_id** is randomly and uniquely | |
193 | initialized. | |
194 | ||
f0104781 LP |
195 | When a writer creates a file it shall initialize the **tail_entry_boot_id** to |
196 | the current boot ID of the system. When appending an entry it shall update the | |
197 | field to the boot ID of that entry, so that it is guaranteed that the | |
198 | **tail_entry_monotonic** field refers to a timestamp of the monotonic clock | |
199 | associated with the boot with the ID indicated by the **tail_entry_boot_id** | |
200 | field. (Compatibility note: in older versions of the journal, the field was | |
201 | also supposed to be updated whenever the file was opened for any form of | |
202 | writing, including when opened to mark it as archived. This behaviour has been | |
203 | deemed problematic since without an associated boot ID the | |
204 | **tail_entry_monotonic** field is useless. To indicate whether the boot ID is | |
205 | updated only on append the JOURNAL_COMPATIBLE_TAIL_ENTRY_BOOT_ID is set. If it | |
206 | is not set, the **tail_entry_monotonic** field is not usable). | |
bbcd38e4 LP |
207 | |
208 | The currently used part of the file is the **header_size** plus the | |
209 | **arena_size** field of the header. If a writer needs to write to a file where | |
210 | the actual file size on disk is smaller than the reported value it shall | |
211 | immediately rotate the file and start a new one. If a writer is asked to write | |
e8607daf P |
212 | to a file with a header that is shorter than its own definition of the struct |
213 | Header, it shall immediately rotate the file and start a new one. | |
bbcd38e4 LP |
214 | |
215 | The **n_objects** field contains a counter for objects currently available in | |
216 | this file. As objects are appended to the end of the file this counter is | |
217 | increased. | |
218 | ||
219 | The first object in the file starts immediately after the header. The last | |
220 | object in the file is at the offset **tail_object_offset**, which may be 0 if | |
221 | no object is in the file yet. | |
222 | ||
223 | The **n_entries**, **n_data**, **n_fields**, **n_tags**, **n_entry_arrays** are | |
224 | counters of the objects of the specific types. | |
225 | ||
226 | **tail_entry_seqnum** and **head_entry_seqnum** contain the sequential number | |
227 | (see below) of the last or first entry in the file, respectively, or 0 if no | |
228 | entry has been written yet. | |
229 | ||
230 | **tail_entry_realtime** and **head_entry_realtime** contain the wallclock | |
231 | timestamp of the last or first entry in the file, respectively, or 0 if no | |
232 | entry has been written yet. | |
233 | ||
234 | **tail_entry_monotonic** is the monotonic timestamp of the last entry in the | |
f0104781 LP |
235 | file, referring to monotonic time of the boot identified by |
236 | **tail_entry_boot_id**, but only if the | |
237 | JOURNAL_COMPATIBLE_TAIL_ENTRY_BOOT_ID feature flag is set, see above. If it | |
238 | is not set, this field might refer to a different boot then the one in the | |
239 | **tail_entry_boot_id** field, for example when the file was ultimately | |
240 | archived. | |
bbcd38e4 | 241 | |
70cd1e56 LP |
242 | **data_hash_chain_depth** is a counter of the deepest chain in the data hash |
243 | table, minus one. This is updated whenever a chain is found that is longer than | |
244 | the previous deepest chain found. Note that the counter is updated during hash | |
245 | table lookups, as the chains are traversed. This counter is used to determine | |
246 | when it is a good time to rotate the journal file, because hash collisions | |
247 | became too frequent. | |
248 | ||
249 | Similar, **field_hash_chain_depth** is a counter of the deepest chain in the | |
250 | field hash table, minus one. | |
251 | ||
e81710d3 DDM |
252 | **tail_entry_array_offset** and **tail_entry_array_n_entries** allow immediate |
253 | access to the last entry array in the global entry array chain. | |
bbcd38e4 | 254 | |
206f0f39 LP |
255 | **tail_entry_offset** allow immediate access to the last entry in the journal |
256 | file. | |
257 | ||
bbcd38e4 LP |
258 | ## Extensibility |
259 | ||
260 | The format is supposed to be extensible in order to enable future additions of | |
261 | features. Readers should simply skip objects of unknown types as they read | |
262 | them. If a compatible feature extension is made a new bit is registered in the | |
22aa58ad | 263 | header's **compatible_flags** field. If a feature extension is used that makes |
bbcd38e4 | 264 | the format incompatible a new bit is registered in the header's |
22aa58ad LP |
265 | **incompatible_flags** field. Readers should check these two bit fields, if |
266 | they find a flag they don't understand in compatible_flags they should continue | |
267 | to read the file, but if they find one in **incompatible_flags** they should | |
268 | fail, asking for an update of the software. Writers should refuse writing if | |
269 | there's an unknown bit flag in either of these fields. | |
bbcd38e4 LP |
270 | |
271 | The file header may be extended as new features are added. The size of the file | |
22aa58ad | 272 | header is stored in the header. All header fields up to **n_data** are known to |
bbcd38e4 | 273 | unconditionally exist in all revisions of the file format, all fields starting |
22aa58ad | 274 | with **n_data** needs to be explicitly checked for via a size check, since they |
bbcd38e4 LP |
275 | were additions after the initial release. |
276 | ||
70cd1e56 | 277 | Currently only five extensions flagged in the flags fields are known: |
bbcd38e4 LP |
278 | |
279 | ```c | |
280 | enum { | |
70cd1e56 LP |
281 | HEADER_INCOMPATIBLE_COMPRESSED_XZ = 1 << 0, |
282 | HEADER_INCOMPATIBLE_COMPRESSED_LZ4 = 1 << 1, | |
283 | HEADER_INCOMPATIBLE_KEYED_HASH = 1 << 2, | |
284 | HEADER_INCOMPATIBLE_COMPRESSED_ZSTD = 1 << 3, | |
87413812 | 285 | HEADER_INCOMPATIBLE_COMPACT = 1 << 4, |
bbcd38e4 LP |
286 | }; |
287 | ||
288 | enum { | |
f0104781 LP |
289 | HEADER_COMPATIBLE_SEALED = 1 << 0, |
290 | HEADER_COMPATIBLE_TAIL_ENTRY_BOOT_ID = 1 << 1, | |
bbcd38e4 LP |
291 | }; |
292 | ``` | |
293 | ||
70cd1e56 LP |
294 | HEADER_INCOMPATIBLE_COMPRESSED_XZ indicates that the file includes DATA objects |
295 | that are compressed using XZ. Similarly, HEADER_INCOMPATIBLE_COMPRESSED_LZ4 | |
296 | indicates that the file includes DATA objects that are compressed with the LZ4 | |
297 | algorithm. And HEADER_INCOMPATIBLE_COMPRESSED_ZSTD indicates that there are | |
298 | objects compressed with ZSTD. | |
299 | ||
300 | HEADER_INCOMPATIBLE_KEYED_HASH indicates that instead of the unkeyed Jenkins | |
301 | hash function the keyed siphash24 hash function is used for the two hash | |
302 | tables, see below. | |
bbcd38e4 | 303 | |
87413812 DDM |
304 | HEADER_INCOMPATIBLE_COMPACT indicates that the journal file uses the new binary |
305 | format that uses less space on disk compared to the original format. | |
306 | ||
bbcd38e4 LP |
307 | HEADER_COMPATIBLE_SEALED indicates that the file includes TAG objects required |
308 | for Forward Secure Sealing. | |
309 | ||
f0104781 LP |
310 | HEADER_COMPATIBLE_TAIL_ENTRY_BOOT_ID indicates whether the |
311 | **tail_entry_boot_id** field is strictly updated on initial creation of the | |
312 | file and whenever an entry is updated (in which case the flag is set), or also | |
313 | when the file is archived (in which case it is unset). New files should always | |
314 | set this flag (and thus not update the **tail_entry_boot_id** except when | |
315 | creating the file and when appending an entry to it. | |
bbcd38e4 LP |
316 | |
317 | ## Dirty Detection | |
318 | ||
319 | ```c | |
320 | enum { | |
321 | STATE_OFFLINE = 0, | |
322 | STATE_ONLINE = 1, | |
323 | STATE_ARCHIVED = 2, | |
324 | _STATE_MAX | |
325 | }; | |
326 | ``` | |
327 | ||
328 | If a file is opened for writing the **state** field should be set to | |
329 | STATE_ONLINE. If a file is closed after writing the **state** field should be | |
330 | set to STATE_OFFLINE. After a file has been rotated it should be set to | |
331 | STATE_ARCHIVED. If a writer is asked to write to a file that is not in | |
063a43a1 | 332 | STATE_OFFLINE it should immediately rotate the file and start a new one, |
bbcd38e4 LP |
333 | without changing the file. |
334 | ||
f223fd6a | 335 | After and before the state field is changed, `fdatasync()` should be executed on |
bbcd38e4 LP |
336 | the file to ensure the dirty state hits disk. |
337 | ||
338 | ||
339 | ## Sequence Numbers | |
340 | ||
341 | All entries carry sequence numbers that are monotonically counted up for each | |
342 | entry (starting at 1) and are unique among all files which carry the same | |
343 | **seqnum_id** field. This field is randomly generated when the journal daemon | |
344 | creates its first file. All files generated by the same journal daemon instance | |
345 | should hence carry the same seqnum_id. This should guarantee a monotonic stream | |
346 | of sequential numbers for easy interleaving even if entries are distributed | |
347 | among several files, such as the system journal and many per-user journals. | |
348 | ||
349 | ||
350 | ## Concurrency | |
351 | ||
352 | The file format is designed to be usable in a simultaneous | |
353 | single-writer/multiple-reader scenario. The synchronization model is very weak | |
354 | in order to facilitate storage on the most basic of file systems (well, the | |
22aa58ad | 355 | most basic ones that provide us with `mmap()` that is), and allow good |
bbcd38e4 | 356 | performance. No file locking is used. The only time where disk synchronization |
22aa58ad | 357 | via `fdatasync()` should be enforced is after and before changing the **state** |
bbcd38e4 LP |
358 | field in the file header (see below). It is recommended to execute a memory |
359 | barrier after appending and initializing new objects at the end of the file, | |
360 | and before linking them up in the earlier objects. | |
361 | ||
362 | This weak synchronization model means that it is crucial that readers verify | |
363 | the structural integrity of the file as they read it and handle invalid | |
364 | structure gracefully. (Checking what you read is a pretty good idea out of | |
365 | security considerations anyway.) This specifically includes checking offset | |
366 | values, and that they point to valid objects, with valid sizes and of the type | |
367 | and hash value expected. All code must be written with the fact in mind that a | |
70cd1e56 LP |
368 | file with inconsistent structure might just be inconsistent temporarily, and |
369 | might become consistent later on. Payload OTOH requires less scrutiny, as it | |
370 | should only be linked up (and hence visible to readers) after it was | |
bbcd38e4 LP |
371 | successfully written to memory (though not necessarily to disk). On non-local |
372 | file systems it is a good idea to verify the payload hashes when reading, in | |
22aa58ad | 373 | order to avoid annoyances with `mmap()` inconsistencies. |
bbcd38e4 | 374 | |
22aa58ad LP |
375 | Clients intending to show a live view of the journal should use `inotify()` for |
376 | this to watch for files changes. Since file writes done via `mmap()` do not | |
377 | result in `inotify()` writers shall truncate the file to its current size after | |
bbcd38e4 | 378 | writing one or more entries, which results in inotify events being |
70cd1e56 LP |
379 | generated. Note that this is not used as a transaction scheme (it doesn't |
380 | protect anything), but merely for triggering wakeups. | |
bbcd38e4 LP |
381 | |
382 | Note that inotify will not work on network file systems if reader and writer | |
383 | reside on different hosts. Readers which detect they are run on journal files | |
384 | on a non-local file system should hence not rely on inotify for live views but | |
385 | fall back to simple time based polling of the files (maybe recheck every 2s). | |
386 | ||
387 | ||
388 | ## Objects | |
389 | ||
390 | All objects carry a common header: | |
391 | ||
392 | ```c | |
393 | enum { | |
70cd1e56 LP |
394 | OBJECT_COMPRESSED_XZ = 1 << 0, |
395 | OBJECT_COMPRESSED_LZ4 = 1 << 1, | |
396 | OBJECT_COMPRESSED_ZSTD = 1 << 2, | |
bbcd38e4 LP |
397 | }; |
398 | ||
399 | _packed_ struct ObjectHeader { | |
400 | uint8_t type; | |
401 | uint8_t flags; | |
402 | uint8_t reserved[6]; | |
403 | le64_t size; | |
404 | uint8_t payload[]; | |
405 | }; | |
22aa58ad | 406 | ``` |
bbcd38e4 LP |
407 | |
408 | The **type** field is one of the object types listed above. The **flags** field | |
70cd1e56 LP |
409 | currently knows three flags: OBJECT_COMPRESSED_XZ, OBJECT_COMPRESSED_LZ4 and |
410 | OBJECT_COMPRESSED_ZSTD. It is only valid for DATA objects and indicates that | |
411 | the data payload is compressed with XZ/LZ4/ZSTD. If one of the | |
412 | OBJECT_COMPRESSED_* flags is set for an object then the matching | |
413 | HEADER_INCOMPATIBLE_COMPRESSED_XZ/HEADER_INCOMPATIBLE_COMPRESSED_LZ4/HEADER_INCOMPATIBLE_COMPRESSED_ZSTD | |
414 | flag must be set for the file as well. At most one of these three bits may be | |
415 | set. The **size** field encodes the size of the object including all its | |
bbcd38e4 LP |
416 | headers and payload. |
417 | ||
418 | ||
419 | ## Data Objects | |
420 | ||
421 | ```c | |
422 | _packed_ struct DataObject { | |
423 | ObjectHeader object; | |
424 | le64_t hash; | |
425 | le64_t next_hash_offset; | |
426 | le64_t next_field_offset; | |
427 | le64_t entry_offset; /* the first array entry we store inline */ | |
428 | le64_t entry_array_offset; | |
429 | le64_t n_entries; | |
e81710d3 DDM |
430 | union { \ |
431 | struct { \ | |
432 | uint8_t payload[] ; \ | |
433 | } regular; \ | |
434 | struct { \ | |
435 | le32_t tail_entry_array_offset; \ | |
436 | le32_t tail_entry_array_n_entries; \ | |
437 | uint8_t payload[]; \ | |
438 | } compact; \ | |
439 | }; \ | |
bbcd38e4 LP |
440 | }; |
441 | ``` | |
442 | ||
443 | Data objects carry actual field data in the **payload[]** array, including a | |
22aa58ad | 444 | field name, a `=` and the field data. Example: |
bbcd38e4 | 445 | `_SYSTEMD_UNIT=foobar.service`. The **hash** field is a hash value of the |
70cd1e56 LP |
446 | payload. If the `HEADER_INCOMPATIBLE_KEYED_HASH` flag is set in the file header |
447 | this is the siphash24 hash value of the payload, keyed by the file ID as stored | |
22aa58ad | 448 | in the **file_id** field of the file header. If the flag is not set it is the |
70cd1e56 LP |
449 | non-keyed Jenkins hash of the payload instead. The keyed hash is preferred as |
450 | it makes the format more robust against attackers that want to trigger hash | |
451 | collisions in the hash table. | |
bbcd38e4 LP |
452 | |
453 | **next_hash_offset** is used to link up DATA objects in the DATA_HASH_TABLE if | |
454 | a hash collision happens (in a singly linked list, with an offset of 0 | |
455 | indicating the end). **next_field_offset** is used to link up data objects with | |
456 | the same field name from the FIELD object of the field used. | |
457 | ||
458 | **entry_offset** is an offset to the first ENTRY object referring to this DATA | |
459 | object. **entry_array_offset** is an offset to an ENTRY_ARRAY object with | |
460 | offsets to other entries referencing this DATA object. Storing the offset to | |
461 | the first ENTRY object in-line is an optimization given that many DATA objects | |
462 | will be referenced from a single entry only (for example, `MESSAGE=` frequently | |
463 | includes a practically unique string). **n_entries** is a counter of the total | |
464 | number of ENTRY objects that reference this object, i.e. the sum of all | |
465 | ENTRY_ARRAYS chained up from this object, plus 1. | |
466 | ||
467 | The **payload[]** field contains the field name and date unencoded, unless | |
70cd1e56 LP |
468 | OBJECT_COMPRESSED_XZ/OBJECT_COMPRESSED_LZ4/OBJECT_COMPRESSED_ZSTD is set in the |
469 | `ObjectHeader`, in which case the payload is compressed with the indicated | |
470 | compression algorithm. | |
bbcd38e4 | 471 | |
e81710d3 DDM |
472 | If the `HEADER_INCOMPATIBLE_COMPACT` flag is set, Two extra fields are stored to |
473 | allow immediate access to the tail entry array in the DATA object's entry array | |
474 | chain. | |
bbcd38e4 LP |
475 | |
476 | ## Field Objects | |
477 | ||
478 | ```c | |
479 | _packed_ struct FieldObject { | |
480 | ObjectHeader object; | |
481 | le64_t hash; | |
482 | le64_t next_hash_offset; | |
483 | le64_t head_data_offset; | |
484 | uint8_t payload[]; | |
485 | }; | |
486 | ``` | |
487 | ||
488 | Field objects are used to enumerate all possible values a certain field name | |
489 | can take in the entire journal file. | |
490 | ||
491 | The **payload[]** array contains the actual field name, without '=' or any | |
492 | field value. Example: `_SYSTEMD_UNIT`. The **hash** field is a hash value of | |
493 | the payload. As for the DATA objects, this too is either the `.file_id` keyed | |
494 | siphash24 hash of the payload, or the non-keyed Jenkins hash. | |
495 | ||
496 | **next_hash_offset** is used to link up FIELD objects in the FIELD_HASH_TABLE | |
497 | if a hash collision happens (in singly linked list, offset 0 indicating the | |
498 | end). **head_data_offset** points to the first DATA object that shares this | |
499 | field name. It is the head of a singly linked list using DATA's | |
500 | **next_field_offset** offset. | |
501 | ||
502 | ||
503 | ## Entry Objects | |
504 | ||
505 | ``` | |
bbcd38e4 LP |
506 | _packed_ struct EntryObject { |
507 | ObjectHeader object; | |
508 | le64_t seqnum; | |
509 | le64_t realtime; | |
510 | le64_t monotonic; | |
511 | sd_id128_t boot_id; | |
512 | le64_t xor_hash; | |
a9089a66 DDM |
513 | union { \ |
514 | struct { \ | |
515 | le64_t object_offset; \ | |
516 | le64_t hash; \ | |
517 | } regular[]; \ | |
518 | struct { \ | |
519 | le32_t object_offset; \ | |
520 | } compact[]; \ | |
521 | } items; \ | |
bbcd38e4 LP |
522 | }; |
523 | ``` | |
524 | ||
525 | An ENTRY object binds several DATA objects together into one log entry, and | |
526 | includes other metadata such as various timestamps. | |
527 | ||
528 | The **seqnum** field contains the sequence number of the entry, **realtime** | |
529 | the realtime timestamp, and **monotonic** the monotonic timestamp for the boot | |
530 | identified by **boot_id**. | |
531 | ||
532 | The **xor_hash** field contains a binary XOR of the hashes of the payload of | |
533 | all DATA objects referenced by this ENTRY. This value is usable to check the | |
534 | contents of the entry, being independent of the order of the DATA objects in | |
70cd1e56 LP |
535 | the array. Note that even for files that have the |
536 | `HEADER_INCOMPATIBLE_KEYED_HASH` flag set (and thus siphash24 the otherwise | |
537 | used hash function) the hash function used for this field, as singular | |
538 | exception, is the Jenkins lookup3 hash function. The XOR hash value is used to | |
539 | quickly compare the contents of two entries, and to define a well-defined order | |
540 | between two entries that otherwise have the same sequence numbers and | |
541 | timestamps. | |
bbcd38e4 LP |
542 | |
543 | The **items[]** array contains references to all DATA objects of this entry, | |
70cd1e56 LP |
544 | plus their respective hashes (which are calculated the same way as in the DATA |
545 | objects, i.e. keyed by the file ID). | |
bbcd38e4 | 546 | |
a9089a66 | 547 | If the `HEADER_INCOMPATIBLE_COMPACT` flag is set, DATA object offsets are stored |
da890466 | 548 | as 32-bit integers instead of 64-bit and the unused hash field per data object is |
a9089a66 DDM |
549 | not stored anymore. |
550 | ||
bbcd38e4 LP |
551 | In the file ENTRY objects are written ordered monotonically by sequence |
552 | number. For continuous parts of the file written during the same boot | |
553 | (i.e. with the same boot_id) the monotonic timestamp is monotonic too. Modulo | |
554 | wallclock time jumps (due to incorrect clocks being corrected) the realtime | |
555 | timestamps are monotonic too. | |
556 | ||
557 | ||
558 | ## Hash Table Objects | |
559 | ||
560 | ```c | |
561 | _packed_ struct HashItem { | |
562 | le64_t head_hash_offset; | |
563 | le64_t tail_hash_offset; | |
564 | }; | |
565 | ||
566 | _packed_ struct HashTableObject { | |
567 | ObjectHeader object; | |
568 | HashItem items[]; | |
569 | }; | |
570 | ``` | |
571 | ||
572 | The structure of both DATA_HASH_TABLE and FIELD_HASH_TABLE objects are | |
e3500e9d | 573 | identical. They implement a simple hash table, with each cell containing |
bbcd38e4 LP |
574 | offsets to the head and tail of the singly linked list of the DATA and FIELD |
575 | objects, respectively. DATA's and FIELD's next_hash_offset field are used to | |
576 | chain up the objects. Empty cells have both offsets set to 0. | |
577 | ||
578 | Each file contains exactly one DATA_HASH_TABLE and one FIELD_HASH_TABLE | |
579 | objects. Their payload is directly referred to by the file header in the | |
580 | **data_hash_table_offset**, **data_hash_table_size**, | |
581 | **field_hash_table_offset**, **field_hash_table_size** fields. These offsets do | |
582 | _not_ point to the object headers but directly to the payloads. When a new | |
583 | journal file is created the two hash table objects need to be created right | |
584 | away as first two objects in the stream. | |
585 | ||
586 | If the hash table fill level is increasing over a certain fill level (Learning | |
587 | from Java's Hashtable for example: > 75%), the writer should rotate the file | |
588 | and create a new one. | |
589 | ||
590 | The DATA_HASH_TABLE should be sized taking into account to the maximum size the | |
591 | file is expected to grow, as configured by the administrator or disk space | |
70cd1e56 LP |
592 | considerations. The FIELD_HASH_TABLE should be sized to a fixed size; the |
593 | number of fields should be pretty static as it depends only on developers' | |
bbcd38e4 LP |
594 | creativity rather than runtime parameters. |
595 | ||
596 | ||
597 | ## Entry Array Objects | |
598 | ||
599 | ||
600 | ```c | |
601 | _packed_ struct EntryArrayObject { | |
602 | ObjectHeader object; | |
603 | le64_t next_entry_array_offset; | |
99daf3ce DDM |
604 | union { |
605 | le64_t regular[]; | |
606 | le32_t compact[]; | |
607 | } items; | |
bbcd38e4 LP |
608 | }; |
609 | ``` | |
610 | ||
611 | Entry Arrays are used to store a sorted array of offsets to entries. Entry | |
612 | arrays are strictly sorted by offsets on disk, and hence by their timestamps | |
613 | and sequence numbers (with some restrictions, see above). | |
614 | ||
99daf3ce | 615 | If the `HEADER_INCOMPATIBLE_COMPACT` flag is set, offsets are stored as 32-bit |
da890466 | 616 | integers instead of 64-bit. |
99daf3ce | 617 | |
bbcd38e4 LP |
618 | Entry Arrays are chained up. If one entry array is full another one is |
619 | allocated and the **next_entry_array_offset** field of the old one pointed to | |
620 | it. An Entry Array with **next_entry_array_offset** set to 0 is the last in the | |
621 | list. To optimize allocation and seeking, as entry arrays are appended to a | |
622 | chain of entry arrays they should increase in size (double). | |
623 | ||
624 | Due to being monotonically ordered entry arrays may be searched with a binary | |
625 | search (bisection). | |
626 | ||
627 | One chain of entry arrays links up all entries written to the journal. The | |
628 | first entry array is referenced in the **entry_array_offset** field of the | |
629 | header. | |
630 | ||
631 | Each DATA object also references an entry array chain listing all entries | |
632 | referencing a specific DATA object. Since many DATA objects are only referenced | |
633 | by a single ENTRY the first offset of the list is stored inside the DATA object | |
634 | itself, an ENTRY_ARRAY object is only needed if it is referenced by more than | |
635 | one ENTRY. | |
636 | ||
637 | ||
638 | ## Tag Object | |
639 | ||
640 | ```c | |
641 | #define TAG_LENGTH (256/8) | |
642 | ||
643 | _packed_ struct TagObject { | |
644 | ObjectHeader object; | |
645 | le64_t seqnum; | |
646 | le64_t epoch; | |
647 | uint8_t tag[TAG_LENGTH]; /* SHA-256 HMAC */ | |
648 | }; | |
649 | ``` | |
650 | ||
651 | Tag objects are used to seal off the journal for alteration. In regular | |
652 | intervals a tag object is appended to the file. The tag object consists of a | |
653 | SHA-256 HMAC tag that is calculated from the objects stored in the file since | |
654 | the last tag was written, or from the beginning if no tag was written yet. The | |
655 | key for the HMAC is calculated via the externally maintained FSPRG logic for | |
656 | the epoch that is written into **epoch**. The sequence number **seqnum** is | |
657 | increased with each tag. When calculating the HMAC of objects header fields | |
658 | that are volatile are excluded (skipped). More specifically all fields that | |
659 | might validly be altered to maintain a consistent file structure (such as | |
660 | offsets to objects added later for the purpose of linked lists and suchlike) | |
661 | after an object has been written are not protected by the tag. This means a | |
662 | verifier has to independently check these fields for consistency of | |
663 | structure. For the fields excluded from the HMAC please consult the source code | |
664 | directly. A verifier should read the file from the beginning to the end, always | |
665 | calculating the HMAC for the objects it reads. Each time a tag object is | |
666 | encountered the HMAC should be verified and restarted. The tag object sequence | |
667 | numbers need to increase strictly monotonically. Tag objects themselves are | |
668 | partially protected by the HMAC (i.e. seqnum and epoch is included, the tag | |
669 | itself not). | |
670 | ||
671 | ||
672 | ## Algorithms | |
673 | ||
674 | ### Reading | |
675 | ||
676 | Given an offset to an entry all data fields are easily found by following the | |
677 | offsets in the data item array of the entry. | |
678 | ||
679 | Listing entries without filter is done by traversing the list of entry arrays | |
680 | starting with the headers' **entry_array_offset** field. | |
681 | ||
682 | Seeking to an entry by timestamp or sequence number (without any matches) is | |
683 | done via binary search in the entry arrays starting with the header's | |
684 | **entry_array_offset** field. Since these arrays double in size as more are | |
685 | added the time cost of seeking is O(log(n)*log(n)) if n is the number of | |
686 | entries in the file. | |
687 | ||
688 | When seeking or listing with one field match applied the DATA object of the | |
689 | match is first identified, and then its data entry array chain traversed. The | |
690 | time cost is the same as for seeks/listings with no match. | |
691 | ||
692 | If multiple matches are applied, multiple chains of entry arrays should be | |
693 | traversed in parallel. Since they all are strictly monotonically ordered by | |
694 | offset of the entries, advancing in one can be directly applied to the others, | |
695 | until an entry matching all matches is found. In the worst case seeking like | |
696 | this is O(n) where n is the number of matching entries of the "loosest" match, | |
697 | but in the common case should be much more efficient at least for the | |
698 | well-known fields, where the set of possible field values tend to be closely | |
699 | related. Checking whether an entry matches a number of matches is efficient | |
700 | since the item array of the entry contains hashes of all data fields | |
701 | referenced, and the number of data fields of an entry is generally small (< | |
702 | 30). | |
703 | ||
704 | When interleaving multiple journal files seeking tends to be a frequently used | |
705 | operation, but in this case can be effectively suppressed by caching results | |
706 | from previous entries. | |
707 | ||
708 | When listing all possible values a certain field can take it is sufficient to | |
709 | look up the FIELD object and follow the chain of links to all DATA it includes. | |
710 | ||
711 | ### Writing | |
712 | ||
e3500e9d VC |
713 | When an entry is appended to the journal, for each of its data fields the data |
714 | hash table should be checked. If the data field does not yet exist in the file, | |
715 | it should be appended and added to the data hash table. When a data field's data | |
716 | object is added, the field hash table should be checked for the field name of | |
bbcd38e4 LP |
717 | the data field, and a field object be added if necessary. After all data fields |
718 | (and recursively all field names) of the new entry are appended and linked up | |
e3500e9d | 719 | in the hashtables, the entry object should be appended and linked up too. |
bbcd38e4 | 720 | |
e3500e9d | 721 | At regular intervals a tag object should be written if sealing is enabled (see |
bbcd38e4 LP |
722 | above). Before the file is closed a tag should be written too, to seal it off. |
723 | ||
724 | Before writing an object, time and disk space limits should be checked and | |
725 | rotation triggered if necessary. | |
726 | ||
727 | ||
728 | ## Optimizing Disk IO | |
729 | ||
730 | _A few general ideas to keep in mind:_ | |
731 | ||
732 | The hash tables for looking up fields and data should be quickly in the memory | |
733 | cache and not hurt performance. All entries and entry arrays are ordered | |
734 | strictly by time on disk, and hence should expose an OK access pattern on | |
735 | rotating media, when read sequentially (which should be the most common case, | |
736 | given the nature of log data). | |
737 | ||
738 | The disk access patterns of the binary search for entries needed for seeking | |
739 | are problematic on rotating disks. This should not be a major issue though, | |
740 | since seeking should not be a frequent operation. | |
741 | ||
742 | When reading, collecting data fields for presenting entries to the user is | |
743 | problematic on rotating disks. In order to optimize these patterns the item | |
744 | array of entry objects should be sorted by disk offset before | |
745 | writing. Effectively, frequently used data objects should be in the memory | |
746 | cache quickly. Non-frequently used data objects are likely to be located | |
747 | between the previous and current entry when reading and hence should expose an | |
748 | OK access pattern. Problematic are data objects that are neither frequently nor | |
749 | infrequently referenced, which will cost seek time. | |
750 | ||
751 | And that's all there is to it. | |
752 | ||
753 | Thanks for your interest! |