]>
Commit | Line | Data |
---|---|---|
bbcd38e4 LP |
1 | --- |
2 | title: Journal File Format | |
3 | category: Interfaces | |
4 | layout: default | |
0aff7b75 | 5 | SPDX-License-Identifier: LGPL-2.1-or-later |
bbcd38e4 LP |
6 | --- |
7 | ||
8 | # Journal File Format | |
9 | ||
717e92ce ZJS |
10 | _Note that this document describes the binary on-disk format of journals only. |
11 | For interfacing with web technologies there's the [Journal JSON Format](https://systemd.io/JOURNAL_EXPORT_FORMATS#journal-json-format). | |
12 | For transfer of journal data across the network there's the [Journal Export Format](https://systemd.io/JOURNAL_EXPORT_FORMATS#journal-export-format)._ | |
bbcd38e4 LP |
13 | |
14 | The systemd journal stores log data in a binary format with several features: | |
15 | ||
16 | * Fully indexed by all fields | |
17 | * Can store binary data, up to 2^64-1 in size | |
18 | * Seekable | |
19 | * Primarily append-based, hence robust to corruption | |
20 | * Support for in-line compression | |
21 | * Support for in-line Forward Secure Sealing | |
22 | ||
23 | This document explains the basic structure of the file format on disk. We are | |
24 | making this available primarily to allow review and provide documentation. Note | |
25 | that the actual implementation in the [systemd | |
21dfadbd | 26 | codebase](https://github.com/systemd/systemd/blob/main/src/libsystemd/sd-journal/) is the |
bbcd38e4 LP |
27 | only ultimately authoritative description of the format, so if this document |
28 | and the code disagree, the code is right. That said we'll of course try hard to | |
29 | keep this document up-to-date and accurate. | |
30 | ||
31 | Instead of implementing your own reader or writer for journal files we ask you | |
32 | to use the [Journal's native C | |
931bc195 | 33 | API](https://www.freedesktop.org/software/systemd/man/sd-journal.html) to access |
bbcd38e4 LP |
34 | these files. It provides you with full access to the files, and will not |
35 | withhold any data. If you find a limitation, please ping us and we might add | |
36 | some additional interfaces for you. | |
37 | ||
38 | If you need access to the raw journal data in serialized stream form without C | |
39 | API our recommendation is to make use of the [Journal Export | |
717e92ce ZJS |
40 | Format](https://systemd.io/JOURNAL_EXPORT_FORMATS#journal-export-format), which you can |
41 | get via `journalctl -o export` or via `systemd-journal-gatewayd`. The export | |
bbcd38e4 LP |
42 | format is much simpler to parse, but complete and accurate. Due to its |
43 | stream-based nature it is not indexed. | |
44 | ||
45 | _Or, to put this in other words: this low-level document is probably not what | |
46 | you want to use as base of your project. You want our [C | |
931bc195 | 47 | API](https://www.freedesktop.org/software/systemd/man/sd-journal.html) instead! |
bbcd38e4 | 48 | And if you really don't want the C API, then you want the [Journal Export |
717e92ce ZJS |
49 | Format or Journal JSON Format](https://systemd.io/JOURNAL_EXPORT_FORMATS) instead! |
50 | This document is primarily for your entertainment and education. Thank you!_ | |
bbcd38e4 LP |
51 | |
52 | This document assumes you have a basic understanding of the journal concepts, | |
53 | the properties of a journal entry and so on. If not, please go and read up, | |
54 | then come back! This is a good opportunity to read about the [basic properties | |
55 | of journal | |
931bc195 | 56 | entries](https://www.freedesktop.org/software/systemd/man/systemd.journal-fields.html), |
bbcd38e4 LP |
57 | in particular realize that they may include binary non-text data (though |
58 | usually don't), and the same field might have multiple values assigned within | |
59 | the same entry. | |
60 | ||
70cd1e56 | 61 | This document describes the current format of systemd 246. The documented |
bbcd38e4 | 62 | format is compatible with the format used in the first versions of the journal, |
70cd1e56 | 63 | but received various compatible and incompatible additions since. |
bbcd38e4 LP |
64 | |
65 | If you are wondering why the journal file format has been created in the first | |
66 | place instead of adopting an existing database implementation, please have a | |
67 | look [at this | |
68 | thread](https://lists.freedesktop.org/archives/systemd-devel/2012-October/007054.html). | |
69 | ||
70 | ||
71 | ## Basics | |
72 | ||
73 | * All offsets, sizes, time values, hashes (and most other numeric values) are 64bit unsigned integers in LE format. | |
74 | * Offsets are always relative to the beginning of the file. | |
22aa58ad | 75 | * The 64bit hash function siphash24 is used for newer journal files. For older files [Jenkins lookup3](https://en.wikipedia.org/wiki/Jenkins_hash_function) is used, more specifically `jenkins_hashlittle2()` with the first 32bit integer it returns as higher 32bit part of the 64bit value, and the second one uses as lower 32bit part. |
bbcd38e4 LP |
76 | * All structures are aligned to 64bit boundaries and padded to multiples of 64bit |
77 | * The format is designed to be read and written via memory mapping using multiple mapped windows. | |
78 | * All time values are stored in usec since the respective epoch. | |
22aa58ad LP |
79 | * Wall clock time values are relative to the Unix time epoch, i.e. January 1st, 1970. (`CLOCK_REALTIME`) |
80 | * Monotonic time values are always stored jointly with the kernel boot ID value (i.e. `/proc/sys/kernel/random/boot_id`) they belong to. They tend to be relative to the start of the boot, but aren't for containers. (`CLOCK_MONOTONIC`) | |
bbcd38e4 LP |
81 | * Randomized, unique 128bit IDs are used in various locations. These are generally UUID v4 compatible, but this is not a requirement. |
82 | ||
83 | ## General Rules | |
84 | ||
85 | If any kind of corruption is noticed by a writer it should immediately rotate | |
86 | the file and start a new one. No further writes should be attempted to the | |
87 | original file, but it should be left around so that as little data as possible | |
88 | is lost. | |
89 | ||
90 | If any kind of corruption is noticed by a reader it should try hard to handle | |
91 | this gracefully, such as skipping over the corrupted data, but allowing access | |
92 | to as much data around it as possible. | |
93 | ||
94 | A reader should verify all offsets and other data as it reads it. This includes | |
95 | checking for alignment and range of offsets in the file, especially before | |
96 | trying to read it via a memory map. | |
97 | ||
98 | A reader must interleave rotated and corrupted files as good as possible and | |
99 | present them as single stream to the user. | |
100 | ||
101 | All fields marked as "reserved" must be initialized with 0 when writing and be | |
102 | ignored on reading. They are currently not used but might be used later on. | |
103 | ||
104 | ||
105 | ## Structure | |
106 | ||
107 | The file format's data structures are declared in | |
882da5cc | 108 | [journal-def.h](https://github.com/systemd/systemd/blob/main/src/libsystemd/sd-journal/journal-def.h). |
bbcd38e4 LP |
109 | |
110 | The file format begins with a header structure. After the header structure | |
111 | object structures follow. Objects are appended to the end as time | |
112 | progresses. Most data stored in these objects is not altered anymore after | |
113 | having been written once, with the exception of records necessary for | |
114 | indexing. When new data is appended to a file the writer first writes all new | |
115 | objects to the end of the file, and then links them up at front after that's | |
116 | done. Currently, seven different object types are known: | |
117 | ||
118 | ```c | |
119 | enum { | |
120 | OBJECT_UNUSED, | |
121 | OBJECT_DATA, | |
122 | OBJECT_FIELD, | |
123 | OBJECT_ENTRY, | |
124 | OBJECT_DATA_HASH_TABLE, | |
125 | OBJECT_FIELD_HASH_TABLE, | |
126 | OBJECT_ENTRY_ARRAY, | |
127 | OBJECT_TAG, | |
128 | _OBJECT_TYPE_MAX | |
129 | }; | |
130 | ``` | |
131 | ||
132 | * A **DATA** object, which encapsulates the contents of one field of an entry, i.e. a string such as `_SYSTEMD_UNIT=avahi-daemon.service`, or `MESSAGE=Foobar made a booboo.` but possibly including large or binary data, and always prefixed by the field name and "=". | |
133 | * A **FIELD** object, which encapsulates a field name, i.e. a string such as `_SYSTEMD_UNIT` or `MESSAGE`, without any `=` or even value. | |
134 | * An **ENTRY** object, which binds several **DATA** objects together into a log entry. | |
135 | * A **DATA_HASH_TABLE** object, which encapsulates a hash table for finding existing **DATA** objects. | |
136 | * A **FIELD_HASH_TABLE** object, which encapsulates a hash table for finding existing **FIELD** objects. | |
137 | * An **ENTRY_ARRAY** object, which encapsulates a sorted array of offsets to entries, used for seeking by binary search. | |
138 | * A **TAG** object, consisting of an FSS sealing tag for all data from the beginning of the file or the last tag written (whichever is later). | |
139 | ||
140 | ## Header | |
141 | ||
142 | The Header struct defines, well, you guessed it, the file header: | |
143 | ||
144 | ```c | |
145 | _packed_ struct Header { | |
146 | uint8_t signature[8]; /* "LPKSHHRH" */ | |
147 | le32_t compatible_flags; | |
148 | le32_t incompatible_flags; | |
149 | uint8_t state; | |
150 | uint8_t reserved[7]; | |
151 | sd_id128_t file_id; | |
152 | sd_id128_t machine_id; | |
153 | sd_id128_t boot_id; /* last writer */ | |
154 | sd_id128_t seqnum_id; | |
155 | le64_t header_size; | |
156 | le64_t arena_size; | |
157 | le64_t data_hash_table_offset; | |
158 | le64_t data_hash_table_size; | |
159 | le64_t field_hash_table_offset; | |
160 | le64_t field_hash_table_size; | |
161 | le64_t tail_object_offset; | |
162 | le64_t n_objects; | |
163 | le64_t n_entries; | |
164 | le64_t tail_entry_seqnum; | |
165 | le64_t head_entry_seqnum; | |
166 | le64_t entry_array_offset; | |
167 | le64_t head_entry_realtime; | |
168 | le64_t tail_entry_realtime; | |
169 | le64_t tail_entry_monotonic; | |
170 | /* Added in 187 */ | |
171 | le64_t n_data; | |
172 | le64_t n_fields; | |
173 | /* Added in 189 */ | |
174 | le64_t n_tags; | |
175 | le64_t n_entry_arrays; | |
70cd1e56 LP |
176 | /* Added in 246 */ |
177 | le64_t data_hash_chain_depth; | |
178 | le64_t field_hash_chain_depth; | |
bbcd38e4 LP |
179 | }; |
180 | ``` | |
181 | ||
22aa58ad | 182 | The first 8 bytes of Journal files must contain the ASCII characters `LPKSHHRH`. |
bbcd38e4 LP |
183 | |
184 | If a writer finds that the **machine_id** of a file to write to does not match | |
185 | the machine it is running on it should immediately rotate the file and start a | |
186 | new one. | |
187 | ||
188 | When journal file is first created the **file_id** is randomly and uniquely | |
189 | initialized. | |
190 | ||
191 | When a writer opens a file it shall initialize the **boot_id** to the current | |
192 | boot id of the system. | |
193 | ||
194 | The currently used part of the file is the **header_size** plus the | |
195 | **arena_size** field of the header. If a writer needs to write to a file where | |
196 | the actual file size on disk is smaller than the reported value it shall | |
197 | immediately rotate the file and start a new one. If a writer is asked to write | |
e8607daf P |
198 | to a file with a header that is shorter than its own definition of the struct |
199 | Header, it shall immediately rotate the file and start a new one. | |
bbcd38e4 LP |
200 | |
201 | The **n_objects** field contains a counter for objects currently available in | |
202 | this file. As objects are appended to the end of the file this counter is | |
203 | increased. | |
204 | ||
205 | The first object in the file starts immediately after the header. The last | |
206 | object in the file is at the offset **tail_object_offset**, which may be 0 if | |
207 | no object is in the file yet. | |
208 | ||
209 | The **n_entries**, **n_data**, **n_fields**, **n_tags**, **n_entry_arrays** are | |
210 | counters of the objects of the specific types. | |
211 | ||
212 | **tail_entry_seqnum** and **head_entry_seqnum** contain the sequential number | |
213 | (see below) of the last or first entry in the file, respectively, or 0 if no | |
214 | entry has been written yet. | |
215 | ||
216 | **tail_entry_realtime** and **head_entry_realtime** contain the wallclock | |
217 | timestamp of the last or first entry in the file, respectively, or 0 if no | |
218 | entry has been written yet. | |
219 | ||
220 | **tail_entry_monotonic** is the monotonic timestamp of the last entry in the | |
221 | file, referring to monotonic time of the boot identified by **boot_id**. | |
222 | ||
70cd1e56 LP |
223 | **data_hash_chain_depth** is a counter of the deepest chain in the data hash |
224 | table, minus one. This is updated whenever a chain is found that is longer than | |
225 | the previous deepest chain found. Note that the counter is updated during hash | |
226 | table lookups, as the chains are traversed. This counter is used to determine | |
227 | when it is a good time to rotate the journal file, because hash collisions | |
228 | became too frequent. | |
229 | ||
230 | Similar, **field_hash_chain_depth** is a counter of the deepest chain in the | |
231 | field hash table, minus one. | |
232 | ||
bbcd38e4 LP |
233 | |
234 | ## Extensibility | |
235 | ||
236 | The format is supposed to be extensible in order to enable future additions of | |
237 | features. Readers should simply skip objects of unknown types as they read | |
238 | them. If a compatible feature extension is made a new bit is registered in the | |
22aa58ad | 239 | header's **compatible_flags** field. If a feature extension is used that makes |
bbcd38e4 | 240 | the format incompatible a new bit is registered in the header's |
22aa58ad LP |
241 | **incompatible_flags** field. Readers should check these two bit fields, if |
242 | they find a flag they don't understand in compatible_flags they should continue | |
243 | to read the file, but if they find one in **incompatible_flags** they should | |
244 | fail, asking for an update of the software. Writers should refuse writing if | |
245 | there's an unknown bit flag in either of these fields. | |
bbcd38e4 LP |
246 | |
247 | The file header may be extended as new features are added. The size of the file | |
22aa58ad | 248 | header is stored in the header. All header fields up to **n_data** are known to |
bbcd38e4 | 249 | unconditionally exist in all revisions of the file format, all fields starting |
22aa58ad | 250 | with **n_data** needs to be explicitly checked for via a size check, since they |
bbcd38e4 LP |
251 | were additions after the initial release. |
252 | ||
70cd1e56 | 253 | Currently only five extensions flagged in the flags fields are known: |
bbcd38e4 LP |
254 | |
255 | ```c | |
256 | enum { | |
70cd1e56 LP |
257 | HEADER_INCOMPATIBLE_COMPRESSED_XZ = 1 << 0, |
258 | HEADER_INCOMPATIBLE_COMPRESSED_LZ4 = 1 << 1, | |
259 | HEADER_INCOMPATIBLE_KEYED_HASH = 1 << 2, | |
260 | HEADER_INCOMPATIBLE_COMPRESSED_ZSTD = 1 << 3, | |
bbcd38e4 LP |
261 | }; |
262 | ||
263 | enum { | |
70cd1e56 | 264 | HEADER_COMPATIBLE_SEALED = 1 << 0, |
bbcd38e4 LP |
265 | }; |
266 | ``` | |
267 | ||
70cd1e56 LP |
268 | HEADER_INCOMPATIBLE_COMPRESSED_XZ indicates that the file includes DATA objects |
269 | that are compressed using XZ. Similarly, HEADER_INCOMPATIBLE_COMPRESSED_LZ4 | |
270 | indicates that the file includes DATA objects that are compressed with the LZ4 | |
271 | algorithm. And HEADER_INCOMPATIBLE_COMPRESSED_ZSTD indicates that there are | |
272 | objects compressed with ZSTD. | |
273 | ||
274 | HEADER_INCOMPATIBLE_KEYED_HASH indicates that instead of the unkeyed Jenkins | |
275 | hash function the keyed siphash24 hash function is used for the two hash | |
276 | tables, see below. | |
bbcd38e4 LP |
277 | |
278 | HEADER_COMPATIBLE_SEALED indicates that the file includes TAG objects required | |
279 | for Forward Secure Sealing. | |
280 | ||
281 | ||
282 | ## Dirty Detection | |
283 | ||
284 | ```c | |
285 | enum { | |
286 | STATE_OFFLINE = 0, | |
287 | STATE_ONLINE = 1, | |
288 | STATE_ARCHIVED = 2, | |
289 | _STATE_MAX | |
290 | }; | |
291 | ``` | |
292 | ||
293 | If a file is opened for writing the **state** field should be set to | |
294 | STATE_ONLINE. If a file is closed after writing the **state** field should be | |
295 | set to STATE_OFFLINE. After a file has been rotated it should be set to | |
296 | STATE_ARCHIVED. If a writer is asked to write to a file that is not in | |
063a43a1 | 297 | STATE_OFFLINE it should immediately rotate the file and start a new one, |
bbcd38e4 LP |
298 | without changing the file. |
299 | ||
f223fd6a | 300 | After and before the state field is changed, `fdatasync()` should be executed on |
bbcd38e4 LP |
301 | the file to ensure the dirty state hits disk. |
302 | ||
303 | ||
304 | ## Sequence Numbers | |
305 | ||
306 | All entries carry sequence numbers that are monotonically counted up for each | |
307 | entry (starting at 1) and are unique among all files which carry the same | |
308 | **seqnum_id** field. This field is randomly generated when the journal daemon | |
309 | creates its first file. All files generated by the same journal daemon instance | |
310 | should hence carry the same seqnum_id. This should guarantee a monotonic stream | |
311 | of sequential numbers for easy interleaving even if entries are distributed | |
312 | among several files, such as the system journal and many per-user journals. | |
313 | ||
314 | ||
315 | ## Concurrency | |
316 | ||
317 | The file format is designed to be usable in a simultaneous | |
318 | single-writer/multiple-reader scenario. The synchronization model is very weak | |
319 | in order to facilitate storage on the most basic of file systems (well, the | |
22aa58ad | 320 | most basic ones that provide us with `mmap()` that is), and allow good |
bbcd38e4 | 321 | performance. No file locking is used. The only time where disk synchronization |
22aa58ad | 322 | via `fdatasync()` should be enforced is after and before changing the **state** |
bbcd38e4 LP |
323 | field in the file header (see below). It is recommended to execute a memory |
324 | barrier after appending and initializing new objects at the end of the file, | |
325 | and before linking them up in the earlier objects. | |
326 | ||
327 | This weak synchronization model means that it is crucial that readers verify | |
328 | the structural integrity of the file as they read it and handle invalid | |
329 | structure gracefully. (Checking what you read is a pretty good idea out of | |
330 | security considerations anyway.) This specifically includes checking offset | |
331 | values, and that they point to valid objects, with valid sizes and of the type | |
332 | and hash value expected. All code must be written with the fact in mind that a | |
70cd1e56 LP |
333 | file with inconsistent structure might just be inconsistent temporarily, and |
334 | might become consistent later on. Payload OTOH requires less scrutiny, as it | |
335 | should only be linked up (and hence visible to readers) after it was | |
bbcd38e4 LP |
336 | successfully written to memory (though not necessarily to disk). On non-local |
337 | file systems it is a good idea to verify the payload hashes when reading, in | |
22aa58ad | 338 | order to avoid annoyances with `mmap()` inconsistencies. |
bbcd38e4 | 339 | |
22aa58ad LP |
340 | Clients intending to show a live view of the journal should use `inotify()` for |
341 | this to watch for files changes. Since file writes done via `mmap()` do not | |
342 | result in `inotify()` writers shall truncate the file to its current size after | |
bbcd38e4 | 343 | writing one or more entries, which results in inotify events being |
70cd1e56 LP |
344 | generated. Note that this is not used as a transaction scheme (it doesn't |
345 | protect anything), but merely for triggering wakeups. | |
bbcd38e4 LP |
346 | |
347 | Note that inotify will not work on network file systems if reader and writer | |
348 | reside on different hosts. Readers which detect they are run on journal files | |
349 | on a non-local file system should hence not rely on inotify for live views but | |
350 | fall back to simple time based polling of the files (maybe recheck every 2s). | |
351 | ||
352 | ||
353 | ## Objects | |
354 | ||
355 | All objects carry a common header: | |
356 | ||
357 | ```c | |
358 | enum { | |
70cd1e56 LP |
359 | OBJECT_COMPRESSED_XZ = 1 << 0, |
360 | OBJECT_COMPRESSED_LZ4 = 1 << 1, | |
361 | OBJECT_COMPRESSED_ZSTD = 1 << 2, | |
bbcd38e4 LP |
362 | }; |
363 | ||
364 | _packed_ struct ObjectHeader { | |
365 | uint8_t type; | |
366 | uint8_t flags; | |
367 | uint8_t reserved[6]; | |
368 | le64_t size; | |
369 | uint8_t payload[]; | |
370 | }; | |
22aa58ad | 371 | ``` |
bbcd38e4 LP |
372 | |
373 | The **type** field is one of the object types listed above. The **flags** field | |
70cd1e56 LP |
374 | currently knows three flags: OBJECT_COMPRESSED_XZ, OBJECT_COMPRESSED_LZ4 and |
375 | OBJECT_COMPRESSED_ZSTD. It is only valid for DATA objects and indicates that | |
376 | the data payload is compressed with XZ/LZ4/ZSTD. If one of the | |
377 | OBJECT_COMPRESSED_* flags is set for an object then the matching | |
378 | HEADER_INCOMPATIBLE_COMPRESSED_XZ/HEADER_INCOMPATIBLE_COMPRESSED_LZ4/HEADER_INCOMPATIBLE_COMPRESSED_ZSTD | |
379 | flag must be set for the file as well. At most one of these three bits may be | |
380 | set. The **size** field encodes the size of the object including all its | |
bbcd38e4 LP |
381 | headers and payload. |
382 | ||
383 | ||
384 | ## Data Objects | |
385 | ||
386 | ```c | |
387 | _packed_ struct DataObject { | |
388 | ObjectHeader object; | |
389 | le64_t hash; | |
390 | le64_t next_hash_offset; | |
391 | le64_t next_field_offset; | |
392 | le64_t entry_offset; /* the first array entry we store inline */ | |
393 | le64_t entry_array_offset; | |
394 | le64_t n_entries; | |
395 | uint8_t payload[]; | |
396 | }; | |
397 | ``` | |
398 | ||
399 | Data objects carry actual field data in the **payload[]** array, including a | |
22aa58ad | 400 | field name, a `=` and the field data. Example: |
bbcd38e4 | 401 | `_SYSTEMD_UNIT=foobar.service`. The **hash** field is a hash value of the |
70cd1e56 LP |
402 | payload. If the `HEADER_INCOMPATIBLE_KEYED_HASH` flag is set in the file header |
403 | this is the siphash24 hash value of the payload, keyed by the file ID as stored | |
22aa58ad | 404 | in the **file_id** field of the file header. If the flag is not set it is the |
70cd1e56 LP |
405 | non-keyed Jenkins hash of the payload instead. The keyed hash is preferred as |
406 | it makes the format more robust against attackers that want to trigger hash | |
407 | collisions in the hash table. | |
bbcd38e4 LP |
408 | |
409 | **next_hash_offset** is used to link up DATA objects in the DATA_HASH_TABLE if | |
410 | a hash collision happens (in a singly linked list, with an offset of 0 | |
411 | indicating the end). **next_field_offset** is used to link up data objects with | |
412 | the same field name from the FIELD object of the field used. | |
413 | ||
414 | **entry_offset** is an offset to the first ENTRY object referring to this DATA | |
415 | object. **entry_array_offset** is an offset to an ENTRY_ARRAY object with | |
416 | offsets to other entries referencing this DATA object. Storing the offset to | |
417 | the first ENTRY object in-line is an optimization given that many DATA objects | |
418 | will be referenced from a single entry only (for example, `MESSAGE=` frequently | |
419 | includes a practically unique string). **n_entries** is a counter of the total | |
420 | number of ENTRY objects that reference this object, i.e. the sum of all | |
421 | ENTRY_ARRAYS chained up from this object, plus 1. | |
422 | ||
423 | The **payload[]** field contains the field name and date unencoded, unless | |
70cd1e56 LP |
424 | OBJECT_COMPRESSED_XZ/OBJECT_COMPRESSED_LZ4/OBJECT_COMPRESSED_ZSTD is set in the |
425 | `ObjectHeader`, in which case the payload is compressed with the indicated | |
426 | compression algorithm. | |
bbcd38e4 LP |
427 | |
428 | ||
429 | ## Field Objects | |
430 | ||
431 | ```c | |
432 | _packed_ struct FieldObject { | |
433 | ObjectHeader object; | |
434 | le64_t hash; | |
435 | le64_t next_hash_offset; | |
436 | le64_t head_data_offset; | |
437 | uint8_t payload[]; | |
438 | }; | |
439 | ``` | |
440 | ||
441 | Field objects are used to enumerate all possible values a certain field name | |
442 | can take in the entire journal file. | |
443 | ||
444 | The **payload[]** array contains the actual field name, without '=' or any | |
445 | field value. Example: `_SYSTEMD_UNIT`. The **hash** field is a hash value of | |
446 | the payload. As for the DATA objects, this too is either the `.file_id` keyed | |
447 | siphash24 hash of the payload, or the non-keyed Jenkins hash. | |
448 | ||
449 | **next_hash_offset** is used to link up FIELD objects in the FIELD_HASH_TABLE | |
450 | if a hash collision happens (in singly linked list, offset 0 indicating the | |
451 | end). **head_data_offset** points to the first DATA object that shares this | |
452 | field name. It is the head of a singly linked list using DATA's | |
453 | **next_field_offset** offset. | |
454 | ||
455 | ||
456 | ## Entry Objects | |
457 | ||
458 | ``` | |
459 | _packed_ struct EntryItem { | |
460 | le64_t object_offset; | |
461 | le64_t hash; | |
462 | }; | |
463 | ||
464 | _packed_ struct EntryObject { | |
465 | ObjectHeader object; | |
466 | le64_t seqnum; | |
467 | le64_t realtime; | |
468 | le64_t monotonic; | |
469 | sd_id128_t boot_id; | |
470 | le64_t xor_hash; | |
471 | EntryItem items[]; | |
472 | }; | |
473 | ``` | |
474 | ||
475 | An ENTRY object binds several DATA objects together into one log entry, and | |
476 | includes other metadata such as various timestamps. | |
477 | ||
478 | The **seqnum** field contains the sequence number of the entry, **realtime** | |
479 | the realtime timestamp, and **monotonic** the monotonic timestamp for the boot | |
480 | identified by **boot_id**. | |
481 | ||
482 | The **xor_hash** field contains a binary XOR of the hashes of the payload of | |
483 | all DATA objects referenced by this ENTRY. This value is usable to check the | |
484 | contents of the entry, being independent of the order of the DATA objects in | |
70cd1e56 LP |
485 | the array. Note that even for files that have the |
486 | `HEADER_INCOMPATIBLE_KEYED_HASH` flag set (and thus siphash24 the otherwise | |
487 | used hash function) the hash function used for this field, as singular | |
488 | exception, is the Jenkins lookup3 hash function. The XOR hash value is used to | |
489 | quickly compare the contents of two entries, and to define a well-defined order | |
490 | between two entries that otherwise have the same sequence numbers and | |
491 | timestamps. | |
bbcd38e4 LP |
492 | |
493 | The **items[]** array contains references to all DATA objects of this entry, | |
70cd1e56 LP |
494 | plus their respective hashes (which are calculated the same way as in the DATA |
495 | objects, i.e. keyed by the file ID). | |
bbcd38e4 LP |
496 | |
497 | In the file ENTRY objects are written ordered monotonically by sequence | |
498 | number. For continuous parts of the file written during the same boot | |
499 | (i.e. with the same boot_id) the monotonic timestamp is monotonic too. Modulo | |
500 | wallclock time jumps (due to incorrect clocks being corrected) the realtime | |
501 | timestamps are monotonic too. | |
502 | ||
503 | ||
504 | ## Hash Table Objects | |
505 | ||
506 | ```c | |
507 | _packed_ struct HashItem { | |
508 | le64_t head_hash_offset; | |
509 | le64_t tail_hash_offset; | |
510 | }; | |
511 | ||
512 | _packed_ struct HashTableObject { | |
513 | ObjectHeader object; | |
514 | HashItem items[]; | |
515 | }; | |
516 | ``` | |
517 | ||
518 | The structure of both DATA_HASH_TABLE and FIELD_HASH_TABLE objects are | |
e3500e9d | 519 | identical. They implement a simple hash table, with each cell containing |
bbcd38e4 LP |
520 | offsets to the head and tail of the singly linked list of the DATA and FIELD |
521 | objects, respectively. DATA's and FIELD's next_hash_offset field are used to | |
522 | chain up the objects. Empty cells have both offsets set to 0. | |
523 | ||
524 | Each file contains exactly one DATA_HASH_TABLE and one FIELD_HASH_TABLE | |
525 | objects. Their payload is directly referred to by the file header in the | |
526 | **data_hash_table_offset**, **data_hash_table_size**, | |
527 | **field_hash_table_offset**, **field_hash_table_size** fields. These offsets do | |
528 | _not_ point to the object headers but directly to the payloads. When a new | |
529 | journal file is created the two hash table objects need to be created right | |
530 | away as first two objects in the stream. | |
531 | ||
532 | If the hash table fill level is increasing over a certain fill level (Learning | |
533 | from Java's Hashtable for example: > 75%), the writer should rotate the file | |
534 | and create a new one. | |
535 | ||
536 | The DATA_HASH_TABLE should be sized taking into account to the maximum size the | |
537 | file is expected to grow, as configured by the administrator or disk space | |
70cd1e56 LP |
538 | considerations. The FIELD_HASH_TABLE should be sized to a fixed size; the |
539 | number of fields should be pretty static as it depends only on developers' | |
bbcd38e4 LP |
540 | creativity rather than runtime parameters. |
541 | ||
542 | ||
543 | ## Entry Array Objects | |
544 | ||
545 | ||
546 | ```c | |
547 | _packed_ struct EntryArrayObject { | |
548 | ObjectHeader object; | |
549 | le64_t next_entry_array_offset; | |
550 | le64_t items[]; | |
551 | }; | |
552 | ``` | |
553 | ||
554 | Entry Arrays are used to store a sorted array of offsets to entries. Entry | |
555 | arrays are strictly sorted by offsets on disk, and hence by their timestamps | |
556 | and sequence numbers (with some restrictions, see above). | |
557 | ||
558 | Entry Arrays are chained up. If one entry array is full another one is | |
559 | allocated and the **next_entry_array_offset** field of the old one pointed to | |
560 | it. An Entry Array with **next_entry_array_offset** set to 0 is the last in the | |
561 | list. To optimize allocation and seeking, as entry arrays are appended to a | |
562 | chain of entry arrays they should increase in size (double). | |
563 | ||
564 | Due to being monotonically ordered entry arrays may be searched with a binary | |
565 | search (bisection). | |
566 | ||
567 | One chain of entry arrays links up all entries written to the journal. The | |
568 | first entry array is referenced in the **entry_array_offset** field of the | |
569 | header. | |
570 | ||
571 | Each DATA object also references an entry array chain listing all entries | |
572 | referencing a specific DATA object. Since many DATA objects are only referenced | |
573 | by a single ENTRY the first offset of the list is stored inside the DATA object | |
574 | itself, an ENTRY_ARRAY object is only needed if it is referenced by more than | |
575 | one ENTRY. | |
576 | ||
577 | ||
578 | ## Tag Object | |
579 | ||
580 | ```c | |
581 | #define TAG_LENGTH (256/8) | |
582 | ||
583 | _packed_ struct TagObject { | |
584 | ObjectHeader object; | |
585 | le64_t seqnum; | |
586 | le64_t epoch; | |
587 | uint8_t tag[TAG_LENGTH]; /* SHA-256 HMAC */ | |
588 | }; | |
589 | ``` | |
590 | ||
591 | Tag objects are used to seal off the journal for alteration. In regular | |
592 | intervals a tag object is appended to the file. The tag object consists of a | |
593 | SHA-256 HMAC tag that is calculated from the objects stored in the file since | |
594 | the last tag was written, or from the beginning if no tag was written yet. The | |
595 | key for the HMAC is calculated via the externally maintained FSPRG logic for | |
596 | the epoch that is written into **epoch**. The sequence number **seqnum** is | |
597 | increased with each tag. When calculating the HMAC of objects header fields | |
598 | that are volatile are excluded (skipped). More specifically all fields that | |
599 | might validly be altered to maintain a consistent file structure (such as | |
600 | offsets to objects added later for the purpose of linked lists and suchlike) | |
601 | after an object has been written are not protected by the tag. This means a | |
602 | verifier has to independently check these fields for consistency of | |
603 | structure. For the fields excluded from the HMAC please consult the source code | |
604 | directly. A verifier should read the file from the beginning to the end, always | |
605 | calculating the HMAC for the objects it reads. Each time a tag object is | |
606 | encountered the HMAC should be verified and restarted. The tag object sequence | |
607 | numbers need to increase strictly monotonically. Tag objects themselves are | |
608 | partially protected by the HMAC (i.e. seqnum and epoch is included, the tag | |
609 | itself not). | |
610 | ||
611 | ||
612 | ## Algorithms | |
613 | ||
614 | ### Reading | |
615 | ||
616 | Given an offset to an entry all data fields are easily found by following the | |
617 | offsets in the data item array of the entry. | |
618 | ||
619 | Listing entries without filter is done by traversing the list of entry arrays | |
620 | starting with the headers' **entry_array_offset** field. | |
621 | ||
622 | Seeking to an entry by timestamp or sequence number (without any matches) is | |
623 | done via binary search in the entry arrays starting with the header's | |
624 | **entry_array_offset** field. Since these arrays double in size as more are | |
625 | added the time cost of seeking is O(log(n)*log(n)) if n is the number of | |
626 | entries in the file. | |
627 | ||
628 | When seeking or listing with one field match applied the DATA object of the | |
629 | match is first identified, and then its data entry array chain traversed. The | |
630 | time cost is the same as for seeks/listings with no match. | |
631 | ||
632 | If multiple matches are applied, multiple chains of entry arrays should be | |
633 | traversed in parallel. Since they all are strictly monotonically ordered by | |
634 | offset of the entries, advancing in one can be directly applied to the others, | |
635 | until an entry matching all matches is found. In the worst case seeking like | |
636 | this is O(n) where n is the number of matching entries of the "loosest" match, | |
637 | but in the common case should be much more efficient at least for the | |
638 | well-known fields, where the set of possible field values tend to be closely | |
639 | related. Checking whether an entry matches a number of matches is efficient | |
640 | since the item array of the entry contains hashes of all data fields | |
641 | referenced, and the number of data fields of an entry is generally small (< | |
642 | 30). | |
643 | ||
644 | When interleaving multiple journal files seeking tends to be a frequently used | |
645 | operation, but in this case can be effectively suppressed by caching results | |
646 | from previous entries. | |
647 | ||
648 | When listing all possible values a certain field can take it is sufficient to | |
649 | look up the FIELD object and follow the chain of links to all DATA it includes. | |
650 | ||
651 | ### Writing | |
652 | ||
e3500e9d VC |
653 | When an entry is appended to the journal, for each of its data fields the data |
654 | hash table should be checked. If the data field does not yet exist in the file, | |
655 | it should be appended and added to the data hash table. When a data field's data | |
656 | object is added, the field hash table should be checked for the field name of | |
bbcd38e4 LP |
657 | the data field, and a field object be added if necessary. After all data fields |
658 | (and recursively all field names) of the new entry are appended and linked up | |
e3500e9d | 659 | in the hashtables, the entry object should be appended and linked up too. |
bbcd38e4 | 660 | |
e3500e9d | 661 | At regular intervals a tag object should be written if sealing is enabled (see |
bbcd38e4 LP |
662 | above). Before the file is closed a tag should be written too, to seal it off. |
663 | ||
664 | Before writing an object, time and disk space limits should be checked and | |
665 | rotation triggered if necessary. | |
666 | ||
667 | ||
668 | ## Optimizing Disk IO | |
669 | ||
670 | _A few general ideas to keep in mind:_ | |
671 | ||
672 | The hash tables for looking up fields and data should be quickly in the memory | |
673 | cache and not hurt performance. All entries and entry arrays are ordered | |
674 | strictly by time on disk, and hence should expose an OK access pattern on | |
675 | rotating media, when read sequentially (which should be the most common case, | |
676 | given the nature of log data). | |
677 | ||
678 | The disk access patterns of the binary search for entries needed for seeking | |
679 | are problematic on rotating disks. This should not be a major issue though, | |
680 | since seeking should not be a frequent operation. | |
681 | ||
682 | When reading, collecting data fields for presenting entries to the user is | |
683 | problematic on rotating disks. In order to optimize these patterns the item | |
684 | array of entry objects should be sorted by disk offset before | |
685 | writing. Effectively, frequently used data objects should be in the memory | |
686 | cache quickly. Non-frequently used data objects are likely to be located | |
687 | between the previous and current entry when reading and hence should expose an | |
688 | OK access pattern. Problematic are data objects that are neither frequently nor | |
689 | infrequently referenced, which will cost seek time. | |
690 | ||
691 | And that's all there is to it. | |
692 | ||
693 | Thanks for your interest! |