]>
Commit | Line | Data |
---|---|---|
bbcd38e4 LP |
1 | --- |
2 | title: Journal File Format | |
3 | category: Interfaces | |
4 | layout: default | |
5 | --- | |
6 | ||
7 | # Journal File Format | |
8 | ||
9 | _Note that this document describes the binary on-disk format of journals | |
10 | only. For interfacing with web technologies there's the [Journal JSON | |
11 | Format](http://www.freedesktop.org/wiki/Software/systemd/json). For transfer | |
12 | of journal data across the network there's the [Journal Export | |
13 | Format](http://www.freedesktop.org/wiki/Software/systemd/export)._ | |
14 | ||
15 | The systemd journal stores log data in a binary format with several features: | |
16 | ||
17 | * Fully indexed by all fields | |
18 | * Can store binary data, up to 2^64-1 in size | |
19 | * Seekable | |
20 | * Primarily append-based, hence robust to corruption | |
21 | * Support for in-line compression | |
22 | * Support for in-line Forward Secure Sealing | |
23 | ||
24 | This document explains the basic structure of the file format on disk. We are | |
25 | making this available primarily to allow review and provide documentation. Note | |
26 | that the actual implementation in the [systemd | |
27 | codebase](https://github.com/systemd/systemd/blob/master/src/journal/) is the | |
28 | only ultimately authoritative description of the format, so if this document | |
29 | and the code disagree, the code is right. That said we'll of course try hard to | |
30 | keep this document up-to-date and accurate. | |
31 | ||
32 | Instead of implementing your own reader or writer for journal files we ask you | |
33 | to use the [Journal's native C | |
34 | API](http://www.freedesktop.org/software/systemd/man/sd-journal.html) to access | |
35 | these files. It provides you with full access to the files, and will not | |
36 | withhold any data. If you find a limitation, please ping us and we might add | |
37 | some additional interfaces for you. | |
38 | ||
39 | If you need access to the raw journal data in serialized stream form without C | |
40 | API our recommendation is to make use of the [Journal Export | |
41 | Format](http://www.freedesktop.org/wiki/Software/systemd/export), which you can | |
42 | get via "journalctl -o export" or via systemd-journal-gatewayd. The export | |
43 | format is much simpler to parse, but complete and accurate. Due to its | |
44 | stream-based nature it is not indexed. | |
45 | ||
46 | _Or, to put this in other words: this low-level document is probably not what | |
47 | you want to use as base of your project. You want our [C | |
48 | API](http://www.freedesktop.org/software/systemd/man/sd-journal.html) instead! | |
49 | And if you really don't want the C API, then you want the [Journal Export | |
50 | Format](http://www.freedesktop.org/wiki/Software/systemd/export) instead! This | |
51 | document is primarily for your entertainment and education. Thank you!_ | |
52 | ||
53 | This document assumes you have a basic understanding of the journal concepts, | |
54 | the properties of a journal entry and so on. If not, please go and read up, | |
55 | then come back! This is a good opportunity to read about the [basic properties | |
56 | of journal | |
57 | entries](http://www.freedesktop.org/software/systemd/man/systemd.journal-fields.html), | |
58 | in particular realize that they may include binary non-text data (though | |
59 | usually don't), and the same field might have multiple values assigned within | |
60 | the same entry. | |
61 | ||
70cd1e56 | 62 | This document describes the current format of systemd 246. The documented |
bbcd38e4 | 63 | format is compatible with the format used in the first versions of the journal, |
70cd1e56 | 64 | but received various compatible and incompatible additions since. |
bbcd38e4 LP |
65 | |
66 | If you are wondering why the journal file format has been created in the first | |
67 | place instead of adopting an existing database implementation, please have a | |
68 | look [at this | |
69 | thread](https://lists.freedesktop.org/archives/systemd-devel/2012-October/007054.html). | |
70 | ||
71 | ||
72 | ## Basics | |
73 | ||
74 | * All offsets, sizes, time values, hashes (and most other numeric values) are 64bit unsigned integers in LE format. | |
75 | * Offsets are always relative to the beginning of the file. | |
22aa58ad | 76 | * The 64bit hash function siphash24 is used for newer journal files. For older files [Jenkins lookup3](https://en.wikipedia.org/wiki/Jenkins_hash_function) is used, more specifically `jenkins_hashlittle2()` with the first 32bit integer it returns as higher 32bit part of the 64bit value, and the second one uses as lower 32bit part. |
bbcd38e4 LP |
77 | * All structures are aligned to 64bit boundaries and padded to multiples of 64bit |
78 | * The format is designed to be read and written via memory mapping using multiple mapped windows. | |
79 | * All time values are stored in usec since the respective epoch. | |
22aa58ad LP |
80 | * Wall clock time values are relative to the Unix time epoch, i.e. January 1st, 1970. (`CLOCK_REALTIME`) |
81 | * Monotonic time values are always stored jointly with the kernel boot ID value (i.e. `/proc/sys/kernel/random/boot_id`) they belong to. They tend to be relative to the start of the boot, but aren't for containers. (`CLOCK_MONOTONIC`) | |
bbcd38e4 LP |
82 | * Randomized, unique 128bit IDs are used in various locations. These are generally UUID v4 compatible, but this is not a requirement. |
83 | ||
84 | ## General Rules | |
85 | ||
86 | If any kind of corruption is noticed by a writer it should immediately rotate | |
87 | the file and start a new one. No further writes should be attempted to the | |
88 | original file, but it should be left around so that as little data as possible | |
89 | is lost. | |
90 | ||
91 | If any kind of corruption is noticed by a reader it should try hard to handle | |
92 | this gracefully, such as skipping over the corrupted data, but allowing access | |
93 | to as much data around it as possible. | |
94 | ||
95 | A reader should verify all offsets and other data as it reads it. This includes | |
96 | checking for alignment and range of offsets in the file, especially before | |
97 | trying to read it via a memory map. | |
98 | ||
99 | A reader must interleave rotated and corrupted files as good as possible and | |
100 | present them as single stream to the user. | |
101 | ||
102 | All fields marked as "reserved" must be initialized with 0 when writing and be | |
103 | ignored on reading. They are currently not used but might be used later on. | |
104 | ||
105 | ||
106 | ## Structure | |
107 | ||
108 | The file format's data structures are declared in | |
109 | [journal-def.h](https://github.com/systemd/systemd/blob/master/src/journal/journal-def.h). | |
110 | ||
111 | The file format begins with a header structure. After the header structure | |
112 | object structures follow. Objects are appended to the end as time | |
113 | progresses. Most data stored in these objects is not altered anymore after | |
114 | having been written once, with the exception of records necessary for | |
115 | indexing. When new data is appended to a file the writer first writes all new | |
116 | objects to the end of the file, and then links them up at front after that's | |
117 | done. Currently, seven different object types are known: | |
118 | ||
119 | ```c | |
120 | enum { | |
121 | OBJECT_UNUSED, | |
122 | OBJECT_DATA, | |
123 | OBJECT_FIELD, | |
124 | OBJECT_ENTRY, | |
125 | OBJECT_DATA_HASH_TABLE, | |
126 | OBJECT_FIELD_HASH_TABLE, | |
127 | OBJECT_ENTRY_ARRAY, | |
128 | OBJECT_TAG, | |
129 | _OBJECT_TYPE_MAX | |
130 | }; | |
131 | ``` | |
132 | ||
133 | * A **DATA** object, which encapsulates the contents of one field of an entry, i.e. a string such as `_SYSTEMD_UNIT=avahi-daemon.service`, or `MESSAGE=Foobar made a booboo.` but possibly including large or binary data, and always prefixed by the field name and "=". | |
134 | * A **FIELD** object, which encapsulates a field name, i.e. a string such as `_SYSTEMD_UNIT` or `MESSAGE`, without any `=` or even value. | |
135 | * An **ENTRY** object, which binds several **DATA** objects together into a log entry. | |
136 | * A **DATA_HASH_TABLE** object, which encapsulates a hash table for finding existing **DATA** objects. | |
137 | * A **FIELD_HASH_TABLE** object, which encapsulates a hash table for finding existing **FIELD** objects. | |
138 | * An **ENTRY_ARRAY** object, which encapsulates a sorted array of offsets to entries, used for seeking by binary search. | |
139 | * A **TAG** object, consisting of an FSS sealing tag for all data from the beginning of the file or the last tag written (whichever is later). | |
140 | ||
141 | ## Header | |
142 | ||
143 | The Header struct defines, well, you guessed it, the file header: | |
144 | ||
145 | ```c | |
146 | _packed_ struct Header { | |
147 | uint8_t signature[8]; /* "LPKSHHRH" */ | |
148 | le32_t compatible_flags; | |
149 | le32_t incompatible_flags; | |
150 | uint8_t state; | |
151 | uint8_t reserved[7]; | |
152 | sd_id128_t file_id; | |
153 | sd_id128_t machine_id; | |
154 | sd_id128_t boot_id; /* last writer */ | |
155 | sd_id128_t seqnum_id; | |
156 | le64_t header_size; | |
157 | le64_t arena_size; | |
158 | le64_t data_hash_table_offset; | |
159 | le64_t data_hash_table_size; | |
160 | le64_t field_hash_table_offset; | |
161 | le64_t field_hash_table_size; | |
162 | le64_t tail_object_offset; | |
163 | le64_t n_objects; | |
164 | le64_t n_entries; | |
165 | le64_t tail_entry_seqnum; | |
166 | le64_t head_entry_seqnum; | |
167 | le64_t entry_array_offset; | |
168 | le64_t head_entry_realtime; | |
169 | le64_t tail_entry_realtime; | |
170 | le64_t tail_entry_monotonic; | |
171 | /* Added in 187 */ | |
172 | le64_t n_data; | |
173 | le64_t n_fields; | |
174 | /* Added in 189 */ | |
175 | le64_t n_tags; | |
176 | le64_t n_entry_arrays; | |
70cd1e56 LP |
177 | /* Added in 246 */ |
178 | le64_t data_hash_chain_depth; | |
179 | le64_t field_hash_chain_depth; | |
bbcd38e4 LP |
180 | }; |
181 | ``` | |
182 | ||
22aa58ad | 183 | The first 8 bytes of Journal files must contain the ASCII characters `LPKSHHRH`. |
bbcd38e4 LP |
184 | |
185 | If a writer finds that the **machine_id** of a file to write to does not match | |
186 | the machine it is running on it should immediately rotate the file and start a | |
187 | new one. | |
188 | ||
189 | When journal file is first created the **file_id** is randomly and uniquely | |
190 | initialized. | |
191 | ||
192 | When a writer opens a file it shall initialize the **boot_id** to the current | |
193 | boot id of the system. | |
194 | ||
195 | The currently used part of the file is the **header_size** plus the | |
196 | **arena_size** field of the header. If a writer needs to write to a file where | |
197 | the actual file size on disk is smaller than the reported value it shall | |
198 | immediately rotate the file and start a new one. If a writer is asked to write | |
199 | to a file with a header that is shorter than his own definition of the struct | |
200 | Header, he shall immediately rotate the file and start a new one. | |
201 | ||
202 | The **n_objects** field contains a counter for objects currently available in | |
203 | this file. As objects are appended to the end of the file this counter is | |
204 | increased. | |
205 | ||
206 | The first object in the file starts immediately after the header. The last | |
207 | object in the file is at the offset **tail_object_offset**, which may be 0 if | |
208 | no object is in the file yet. | |
209 | ||
210 | The **n_entries**, **n_data**, **n_fields**, **n_tags**, **n_entry_arrays** are | |
211 | counters of the objects of the specific types. | |
212 | ||
213 | **tail_entry_seqnum** and **head_entry_seqnum** contain the sequential number | |
214 | (see below) of the last or first entry in the file, respectively, or 0 if no | |
215 | entry has been written yet. | |
216 | ||
217 | **tail_entry_realtime** and **head_entry_realtime** contain the wallclock | |
218 | timestamp of the last or first entry in the file, respectively, or 0 if no | |
219 | entry has been written yet. | |
220 | ||
221 | **tail_entry_monotonic** is the monotonic timestamp of the last entry in the | |
222 | file, referring to monotonic time of the boot identified by **boot_id**. | |
223 | ||
70cd1e56 LP |
224 | **data_hash_chain_depth** is a counter of the deepest chain in the data hash |
225 | table, minus one. This is updated whenever a chain is found that is longer than | |
226 | the previous deepest chain found. Note that the counter is updated during hash | |
227 | table lookups, as the chains are traversed. This counter is used to determine | |
228 | when it is a good time to rotate the journal file, because hash collisions | |
229 | became too frequent. | |
230 | ||
231 | Similar, **field_hash_chain_depth** is a counter of the deepest chain in the | |
232 | field hash table, minus one. | |
233 | ||
bbcd38e4 LP |
234 | |
235 | ## Extensibility | |
236 | ||
237 | The format is supposed to be extensible in order to enable future additions of | |
238 | features. Readers should simply skip objects of unknown types as they read | |
239 | them. If a compatible feature extension is made a new bit is registered in the | |
22aa58ad | 240 | header's **compatible_flags** field. If a feature extension is used that makes |
bbcd38e4 | 241 | the format incompatible a new bit is registered in the header's |
22aa58ad LP |
242 | **incompatible_flags** field. Readers should check these two bit fields, if |
243 | they find a flag they don't understand in compatible_flags they should continue | |
244 | to read the file, but if they find one in **incompatible_flags** they should | |
245 | fail, asking for an update of the software. Writers should refuse writing if | |
246 | there's an unknown bit flag in either of these fields. | |
bbcd38e4 LP |
247 | |
248 | The file header may be extended as new features are added. The size of the file | |
22aa58ad | 249 | header is stored in the header. All header fields up to **n_data** are known to |
bbcd38e4 | 250 | unconditionally exist in all revisions of the file format, all fields starting |
22aa58ad | 251 | with **n_data** needs to be explicitly checked for via a size check, since they |
bbcd38e4 LP |
252 | were additions after the initial release. |
253 | ||
70cd1e56 | 254 | Currently only five extensions flagged in the flags fields are known: |
bbcd38e4 LP |
255 | |
256 | ```c | |
257 | enum { | |
70cd1e56 LP |
258 | HEADER_INCOMPATIBLE_COMPRESSED_XZ = 1 << 0, |
259 | HEADER_INCOMPATIBLE_COMPRESSED_LZ4 = 1 << 1, | |
260 | HEADER_INCOMPATIBLE_KEYED_HASH = 1 << 2, | |
261 | HEADER_INCOMPATIBLE_COMPRESSED_ZSTD = 1 << 3, | |
bbcd38e4 LP |
262 | }; |
263 | ||
264 | enum { | |
70cd1e56 | 265 | HEADER_COMPATIBLE_SEALED = 1 << 0, |
bbcd38e4 LP |
266 | }; |
267 | ``` | |
268 | ||
70cd1e56 LP |
269 | HEADER_INCOMPATIBLE_COMPRESSED_XZ indicates that the file includes DATA objects |
270 | that are compressed using XZ. Similarly, HEADER_INCOMPATIBLE_COMPRESSED_LZ4 | |
271 | indicates that the file includes DATA objects that are compressed with the LZ4 | |
272 | algorithm. And HEADER_INCOMPATIBLE_COMPRESSED_ZSTD indicates that there are | |
273 | objects compressed with ZSTD. | |
274 | ||
275 | HEADER_INCOMPATIBLE_KEYED_HASH indicates that instead of the unkeyed Jenkins | |
276 | hash function the keyed siphash24 hash function is used for the two hash | |
277 | tables, see below. | |
bbcd38e4 LP |
278 | |
279 | HEADER_COMPATIBLE_SEALED indicates that the file includes TAG objects required | |
280 | for Forward Secure Sealing. | |
281 | ||
282 | ||
283 | ## Dirty Detection | |
284 | ||
285 | ```c | |
286 | enum { | |
287 | STATE_OFFLINE = 0, | |
288 | STATE_ONLINE = 1, | |
289 | STATE_ARCHIVED = 2, | |
290 | _STATE_MAX | |
291 | }; | |
292 | ``` | |
293 | ||
294 | If a file is opened for writing the **state** field should be set to | |
295 | STATE_ONLINE. If a file is closed after writing the **state** field should be | |
296 | set to STATE_OFFLINE. After a file has been rotated it should be set to | |
297 | STATE_ARCHIVED. If a writer is asked to write to a file that is not in | |
298 | STATE_OFFLINE it should immediately rotate the file and start a new one, | |
299 | without changing the file. | |
300 | ||
22aa58ad | 301 | After and before the state field is changed `fdatasync()` should be executed on |
bbcd38e4 LP |
302 | the file to ensure the dirty state hits disk. |
303 | ||
304 | ||
305 | ## Sequence Numbers | |
306 | ||
307 | All entries carry sequence numbers that are monotonically counted up for each | |
308 | entry (starting at 1) and are unique among all files which carry the same | |
309 | **seqnum_id** field. This field is randomly generated when the journal daemon | |
310 | creates its first file. All files generated by the same journal daemon instance | |
311 | should hence carry the same seqnum_id. This should guarantee a monotonic stream | |
312 | of sequential numbers for easy interleaving even if entries are distributed | |
313 | among several files, such as the system journal and many per-user journals. | |
314 | ||
315 | ||
316 | ## Concurrency | |
317 | ||
318 | The file format is designed to be usable in a simultaneous | |
319 | single-writer/multiple-reader scenario. The synchronization model is very weak | |
320 | in order to facilitate storage on the most basic of file systems (well, the | |
22aa58ad | 321 | most basic ones that provide us with `mmap()` that is), and allow good |
bbcd38e4 | 322 | performance. No file locking is used. The only time where disk synchronization |
22aa58ad | 323 | via `fdatasync()` should be enforced is after and before changing the **state** |
bbcd38e4 LP |
324 | field in the file header (see below). It is recommended to execute a memory |
325 | barrier after appending and initializing new objects at the end of the file, | |
326 | and before linking them up in the earlier objects. | |
327 | ||
328 | This weak synchronization model means that it is crucial that readers verify | |
329 | the structural integrity of the file as they read it and handle invalid | |
330 | structure gracefully. (Checking what you read is a pretty good idea out of | |
331 | security considerations anyway.) This specifically includes checking offset | |
332 | values, and that they point to valid objects, with valid sizes and of the type | |
333 | and hash value expected. All code must be written with the fact in mind that a | |
70cd1e56 LP |
334 | file with inconsistent structure might just be inconsistent temporarily, and |
335 | might become consistent later on. Payload OTOH requires less scrutiny, as it | |
336 | should only be linked up (and hence visible to readers) after it was | |
bbcd38e4 LP |
337 | successfully written to memory (though not necessarily to disk). On non-local |
338 | file systems it is a good idea to verify the payload hashes when reading, in | |
22aa58ad | 339 | order to avoid annoyances with `mmap()` inconsistencies. |
bbcd38e4 | 340 | |
22aa58ad LP |
341 | Clients intending to show a live view of the journal should use `inotify()` for |
342 | this to watch for files changes. Since file writes done via `mmap()` do not | |
343 | result in `inotify()` writers shall truncate the file to its current size after | |
bbcd38e4 | 344 | writing one or more entries, which results in inotify events being |
70cd1e56 LP |
345 | generated. Note that this is not used as a transaction scheme (it doesn't |
346 | protect anything), but merely for triggering wakeups. | |
bbcd38e4 LP |
347 | |
348 | Note that inotify will not work on network file systems if reader and writer | |
349 | reside on different hosts. Readers which detect they are run on journal files | |
350 | on a non-local file system should hence not rely on inotify for live views but | |
351 | fall back to simple time based polling of the files (maybe recheck every 2s). | |
352 | ||
353 | ||
354 | ## Objects | |
355 | ||
356 | All objects carry a common header: | |
357 | ||
358 | ```c | |
359 | enum { | |
70cd1e56 LP |
360 | OBJECT_COMPRESSED_XZ = 1 << 0, |
361 | OBJECT_COMPRESSED_LZ4 = 1 << 1, | |
362 | OBJECT_COMPRESSED_ZSTD = 1 << 2, | |
bbcd38e4 LP |
363 | }; |
364 | ||
365 | _packed_ struct ObjectHeader { | |
366 | uint8_t type; | |
367 | uint8_t flags; | |
368 | uint8_t reserved[6]; | |
369 | le64_t size; | |
370 | uint8_t payload[]; | |
371 | }; | |
22aa58ad | 372 | ``` |
bbcd38e4 LP |
373 | |
374 | The **type** field is one of the object types listed above. The **flags** field | |
70cd1e56 LP |
375 | currently knows three flags: OBJECT_COMPRESSED_XZ, OBJECT_COMPRESSED_LZ4 and |
376 | OBJECT_COMPRESSED_ZSTD. It is only valid for DATA objects and indicates that | |
377 | the data payload is compressed with XZ/LZ4/ZSTD. If one of the | |
378 | OBJECT_COMPRESSED_* flags is set for an object then the matching | |
379 | HEADER_INCOMPATIBLE_COMPRESSED_XZ/HEADER_INCOMPATIBLE_COMPRESSED_LZ4/HEADER_INCOMPATIBLE_COMPRESSED_ZSTD | |
380 | flag must be set for the file as well. At most one of these three bits may be | |
381 | set. The **size** field encodes the size of the object including all its | |
bbcd38e4 LP |
382 | headers and payload. |
383 | ||
384 | ||
385 | ## Data Objects | |
386 | ||
387 | ```c | |
388 | _packed_ struct DataObject { | |
389 | ObjectHeader object; | |
390 | le64_t hash; | |
391 | le64_t next_hash_offset; | |
392 | le64_t next_field_offset; | |
393 | le64_t entry_offset; /* the first array entry we store inline */ | |
394 | le64_t entry_array_offset; | |
395 | le64_t n_entries; | |
396 | uint8_t payload[]; | |
397 | }; | |
398 | ``` | |
399 | ||
400 | Data objects carry actual field data in the **payload[]** array, including a | |
22aa58ad | 401 | field name, a `=` and the field data. Example: |
bbcd38e4 | 402 | `_SYSTEMD_UNIT=foobar.service`. The **hash** field is a hash value of the |
70cd1e56 LP |
403 | payload. If the `HEADER_INCOMPATIBLE_KEYED_HASH` flag is set in the file header |
404 | this is the siphash24 hash value of the payload, keyed by the file ID as stored | |
22aa58ad | 405 | in the **file_id** field of the file header. If the flag is not set it is the |
70cd1e56 LP |
406 | non-keyed Jenkins hash of the payload instead. The keyed hash is preferred as |
407 | it makes the format more robust against attackers that want to trigger hash | |
408 | collisions in the hash table. | |
bbcd38e4 LP |
409 | |
410 | **next_hash_offset** is used to link up DATA objects in the DATA_HASH_TABLE if | |
411 | a hash collision happens (in a singly linked list, with an offset of 0 | |
412 | indicating the end). **next_field_offset** is used to link up data objects with | |
413 | the same field name from the FIELD object of the field used. | |
414 | ||
415 | **entry_offset** is an offset to the first ENTRY object referring to this DATA | |
416 | object. **entry_array_offset** is an offset to an ENTRY_ARRAY object with | |
417 | offsets to other entries referencing this DATA object. Storing the offset to | |
418 | the first ENTRY object in-line is an optimization given that many DATA objects | |
419 | will be referenced from a single entry only (for example, `MESSAGE=` frequently | |
420 | includes a practically unique string). **n_entries** is a counter of the total | |
421 | number of ENTRY objects that reference this object, i.e. the sum of all | |
422 | ENTRY_ARRAYS chained up from this object, plus 1. | |
423 | ||
424 | The **payload[]** field contains the field name and date unencoded, unless | |
70cd1e56 LP |
425 | OBJECT_COMPRESSED_XZ/OBJECT_COMPRESSED_LZ4/OBJECT_COMPRESSED_ZSTD is set in the |
426 | `ObjectHeader`, in which case the payload is compressed with the indicated | |
427 | compression algorithm. | |
bbcd38e4 LP |
428 | |
429 | ||
430 | ## Field Objects | |
431 | ||
432 | ```c | |
433 | _packed_ struct FieldObject { | |
434 | ObjectHeader object; | |
435 | le64_t hash; | |
436 | le64_t next_hash_offset; | |
437 | le64_t head_data_offset; | |
438 | uint8_t payload[]; | |
439 | }; | |
440 | ``` | |
441 | ||
442 | Field objects are used to enumerate all possible values a certain field name | |
443 | can take in the entire journal file. | |
444 | ||
445 | The **payload[]** array contains the actual field name, without '=' or any | |
446 | field value. Example: `_SYSTEMD_UNIT`. The **hash** field is a hash value of | |
447 | the payload. As for the DATA objects, this too is either the `.file_id` keyed | |
448 | siphash24 hash of the payload, or the non-keyed Jenkins hash. | |
449 | ||
450 | **next_hash_offset** is used to link up FIELD objects in the FIELD_HASH_TABLE | |
451 | if a hash collision happens (in singly linked list, offset 0 indicating the | |
452 | end). **head_data_offset** points to the first DATA object that shares this | |
453 | field name. It is the head of a singly linked list using DATA's | |
454 | **next_field_offset** offset. | |
455 | ||
456 | ||
457 | ## Entry Objects | |
458 | ||
459 | ``` | |
460 | _packed_ struct EntryItem { | |
461 | le64_t object_offset; | |
462 | le64_t hash; | |
463 | }; | |
464 | ||
465 | _packed_ struct EntryObject { | |
466 | ObjectHeader object; | |
467 | le64_t seqnum; | |
468 | le64_t realtime; | |
469 | le64_t monotonic; | |
470 | sd_id128_t boot_id; | |
471 | le64_t xor_hash; | |
472 | EntryItem items[]; | |
473 | }; | |
474 | ``` | |
475 | ||
476 | An ENTRY object binds several DATA objects together into one log entry, and | |
477 | includes other metadata such as various timestamps. | |
478 | ||
479 | The **seqnum** field contains the sequence number of the entry, **realtime** | |
480 | the realtime timestamp, and **monotonic** the monotonic timestamp for the boot | |
481 | identified by **boot_id**. | |
482 | ||
483 | The **xor_hash** field contains a binary XOR of the hashes of the payload of | |
484 | all DATA objects referenced by this ENTRY. This value is usable to check the | |
485 | contents of the entry, being independent of the order of the DATA objects in | |
70cd1e56 LP |
486 | the array. Note that even for files that have the |
487 | `HEADER_INCOMPATIBLE_KEYED_HASH` flag set (and thus siphash24 the otherwise | |
488 | used hash function) the hash function used for this field, as singular | |
489 | exception, is the Jenkins lookup3 hash function. The XOR hash value is used to | |
490 | quickly compare the contents of two entries, and to define a well-defined order | |
491 | between two entries that otherwise have the same sequence numbers and | |
492 | timestamps. | |
bbcd38e4 LP |
493 | |
494 | The **items[]** array contains references to all DATA objects of this entry, | |
70cd1e56 LP |
495 | plus their respective hashes (which are calculated the same way as in the DATA |
496 | objects, i.e. keyed by the file ID). | |
bbcd38e4 LP |
497 | |
498 | In the file ENTRY objects are written ordered monotonically by sequence | |
499 | number. For continuous parts of the file written during the same boot | |
500 | (i.e. with the same boot_id) the monotonic timestamp is monotonic too. Modulo | |
501 | wallclock time jumps (due to incorrect clocks being corrected) the realtime | |
502 | timestamps are monotonic too. | |
503 | ||
504 | ||
505 | ## Hash Table Objects | |
506 | ||
507 | ```c | |
508 | _packed_ struct HashItem { | |
509 | le64_t head_hash_offset; | |
510 | le64_t tail_hash_offset; | |
511 | }; | |
512 | ||
513 | _packed_ struct HashTableObject { | |
514 | ObjectHeader object; | |
515 | HashItem items[]; | |
516 | }; | |
517 | ``` | |
518 | ||
519 | The structure of both DATA_HASH_TABLE and FIELD_HASH_TABLE objects are | |
520 | identical. They implement a simple hash table, which each cell containing | |
521 | offsets to the head and tail of the singly linked list of the DATA and FIELD | |
522 | objects, respectively. DATA's and FIELD's next_hash_offset field are used to | |
523 | chain up the objects. Empty cells have both offsets set to 0. | |
524 | ||
525 | Each file contains exactly one DATA_HASH_TABLE and one FIELD_HASH_TABLE | |
526 | objects. Their payload is directly referred to by the file header in the | |
527 | **data_hash_table_offset**, **data_hash_table_size**, | |
528 | **field_hash_table_offset**, **field_hash_table_size** fields. These offsets do | |
529 | _not_ point to the object headers but directly to the payloads. When a new | |
530 | journal file is created the two hash table objects need to be created right | |
531 | away as first two objects in the stream. | |
532 | ||
533 | If the hash table fill level is increasing over a certain fill level (Learning | |
534 | from Java's Hashtable for example: > 75%), the writer should rotate the file | |
535 | and create a new one. | |
536 | ||
537 | The DATA_HASH_TABLE should be sized taking into account to the maximum size the | |
538 | file is expected to grow, as configured by the administrator or disk space | |
70cd1e56 LP |
539 | considerations. The FIELD_HASH_TABLE should be sized to a fixed size; the |
540 | number of fields should be pretty static as it depends only on developers' | |
bbcd38e4 LP |
541 | creativity rather than runtime parameters. |
542 | ||
543 | ||
544 | ## Entry Array Objects | |
545 | ||
546 | ||
547 | ```c | |
548 | _packed_ struct EntryArrayObject { | |
549 | ObjectHeader object; | |
550 | le64_t next_entry_array_offset; | |
551 | le64_t items[]; | |
552 | }; | |
553 | ``` | |
554 | ||
555 | Entry Arrays are used to store a sorted array of offsets to entries. Entry | |
556 | arrays are strictly sorted by offsets on disk, and hence by their timestamps | |
557 | and sequence numbers (with some restrictions, see above). | |
558 | ||
559 | Entry Arrays are chained up. If one entry array is full another one is | |
560 | allocated and the **next_entry_array_offset** field of the old one pointed to | |
561 | it. An Entry Array with **next_entry_array_offset** set to 0 is the last in the | |
562 | list. To optimize allocation and seeking, as entry arrays are appended to a | |
563 | chain of entry arrays they should increase in size (double). | |
564 | ||
565 | Due to being monotonically ordered entry arrays may be searched with a binary | |
566 | search (bisection). | |
567 | ||
568 | One chain of entry arrays links up all entries written to the journal. The | |
569 | first entry array is referenced in the **entry_array_offset** field of the | |
570 | header. | |
571 | ||
572 | Each DATA object also references an entry array chain listing all entries | |
573 | referencing a specific DATA object. Since many DATA objects are only referenced | |
574 | by a single ENTRY the first offset of the list is stored inside the DATA object | |
575 | itself, an ENTRY_ARRAY object is only needed if it is referenced by more than | |
576 | one ENTRY. | |
577 | ||
578 | ||
579 | ## Tag Object | |
580 | ||
581 | ```c | |
582 | #define TAG_LENGTH (256/8) | |
583 | ||
584 | _packed_ struct TagObject { | |
585 | ObjectHeader object; | |
586 | le64_t seqnum; | |
587 | le64_t epoch; | |
588 | uint8_t tag[TAG_LENGTH]; /* SHA-256 HMAC */ | |
589 | }; | |
590 | ``` | |
591 | ||
592 | Tag objects are used to seal off the journal for alteration. In regular | |
593 | intervals a tag object is appended to the file. The tag object consists of a | |
594 | SHA-256 HMAC tag that is calculated from the objects stored in the file since | |
595 | the last tag was written, or from the beginning if no tag was written yet. The | |
596 | key for the HMAC is calculated via the externally maintained FSPRG logic for | |
597 | the epoch that is written into **epoch**. The sequence number **seqnum** is | |
598 | increased with each tag. When calculating the HMAC of objects header fields | |
599 | that are volatile are excluded (skipped). More specifically all fields that | |
600 | might validly be altered to maintain a consistent file structure (such as | |
601 | offsets to objects added later for the purpose of linked lists and suchlike) | |
602 | after an object has been written are not protected by the tag. This means a | |
603 | verifier has to independently check these fields for consistency of | |
604 | structure. For the fields excluded from the HMAC please consult the source code | |
605 | directly. A verifier should read the file from the beginning to the end, always | |
606 | calculating the HMAC for the objects it reads. Each time a tag object is | |
607 | encountered the HMAC should be verified and restarted. The tag object sequence | |
608 | numbers need to increase strictly monotonically. Tag objects themselves are | |
609 | partially protected by the HMAC (i.e. seqnum and epoch is included, the tag | |
610 | itself not). | |
611 | ||
612 | ||
613 | ## Algorithms | |
614 | ||
615 | ### Reading | |
616 | ||
617 | Given an offset to an entry all data fields are easily found by following the | |
618 | offsets in the data item array of the entry. | |
619 | ||
620 | Listing entries without filter is done by traversing the list of entry arrays | |
621 | starting with the headers' **entry_array_offset** field. | |
622 | ||
623 | Seeking to an entry by timestamp or sequence number (without any matches) is | |
624 | done via binary search in the entry arrays starting with the header's | |
625 | **entry_array_offset** field. Since these arrays double in size as more are | |
626 | added the time cost of seeking is O(log(n)*log(n)) if n is the number of | |
627 | entries in the file. | |
628 | ||
629 | When seeking or listing with one field match applied the DATA object of the | |
630 | match is first identified, and then its data entry array chain traversed. The | |
631 | time cost is the same as for seeks/listings with no match. | |
632 | ||
633 | If multiple matches are applied, multiple chains of entry arrays should be | |
634 | traversed in parallel. Since they all are strictly monotonically ordered by | |
635 | offset of the entries, advancing in one can be directly applied to the others, | |
636 | until an entry matching all matches is found. In the worst case seeking like | |
637 | this is O(n) where n is the number of matching entries of the "loosest" match, | |
638 | but in the common case should be much more efficient at least for the | |
639 | well-known fields, where the set of possible field values tend to be closely | |
640 | related. Checking whether an entry matches a number of matches is efficient | |
641 | since the item array of the entry contains hashes of all data fields | |
642 | referenced, and the number of data fields of an entry is generally small (< | |
643 | 30). | |
644 | ||
645 | When interleaving multiple journal files seeking tends to be a frequently used | |
646 | operation, but in this case can be effectively suppressed by caching results | |
647 | from previous entries. | |
648 | ||
649 | When listing all possible values a certain field can take it is sufficient to | |
650 | look up the FIELD object and follow the chain of links to all DATA it includes. | |
651 | ||
652 | ### Writing | |
653 | ||
654 | When an entry is appended to the journal for each of its data fields the data | |
655 | hash table should be checked. If the data field does not yet exist in the file | |
656 | it should be appended and added to the data hash table. When a field data | |
657 | object is added the field hash table should be checked for the field name of | |
658 | the data field, and a field object be added if necessary. After all data fields | |
659 | (and recursively all field names) of the new entry are appended and linked up | |
660 | in the hashtables the entry object should be appended and linked up too. | |
661 | ||
662 | In regular intervals a tag object should be written if sealing is enabled (see | |
663 | above). Before the file is closed a tag should be written too, to seal it off. | |
664 | ||
665 | Before writing an object, time and disk space limits should be checked and | |
666 | rotation triggered if necessary. | |
667 | ||
668 | ||
669 | ## Optimizing Disk IO | |
670 | ||
671 | _A few general ideas to keep in mind:_ | |
672 | ||
673 | The hash tables for looking up fields and data should be quickly in the memory | |
674 | cache and not hurt performance. All entries and entry arrays are ordered | |
675 | strictly by time on disk, and hence should expose an OK access pattern on | |
676 | rotating media, when read sequentially (which should be the most common case, | |
677 | given the nature of log data). | |
678 | ||
679 | The disk access patterns of the binary search for entries needed for seeking | |
680 | are problematic on rotating disks. This should not be a major issue though, | |
681 | since seeking should not be a frequent operation. | |
682 | ||
683 | When reading, collecting data fields for presenting entries to the user is | |
684 | problematic on rotating disks. In order to optimize these patterns the item | |
685 | array of entry objects should be sorted by disk offset before | |
686 | writing. Effectively, frequently used data objects should be in the memory | |
687 | cache quickly. Non-frequently used data objects are likely to be located | |
688 | between the previous and current entry when reading and hence should expose an | |
689 | OK access pattern. Problematic are data objects that are neither frequently nor | |
690 | infrequently referenced, which will cost seek time. | |
691 | ||
692 | And that's all there is to it. | |
693 | ||
694 | Thanks for your interest! |