From: Sean Purcell Date: Tue, 11 Apr 2017 20:50:05 +0000 (-0700) Subject: Add format specification X-Git-Tag: v1.3.0~1^2~47^2~11 X-Git-Url: http://git.ipfire.org/cgi-bin/gitweb.cgi?a=commitdiff_plain;h=45f3bc4801ebb27dc118346a7bc761b715e478b9;p=thirdparty%2Fzstd.git Add format specification --- diff --git a/doc/zstd_seekable_compression_format.md b/doc/zstd_seekable_compression_format.md new file mode 100644 index 000000000..fd0593f9a --- /dev/null +++ b/doc/zstd_seekable_compression_format.md @@ -0,0 +1,107 @@ +# Zstandard Seekable Format + +### Notices + +Copyright (c) 2017-present Facebook, Inc. + +Permission is granted to copy and distribute this document +for any purpose and without charge, +including translations into other languages +and incorporation into compilations, +provided that the copyright notice and this notice are preserved, +and that any substantive changes or deletions from the original +are clearly marked. +Distribution of this document is unlimited. + +### Version +0.1.0 (11/04/17) + +## Introduction +This document defines a format for compressed data to be stored so that subranges of the data can be efficiently decompressed without requiring the entire document to be decompressed. +This is done by splitting up the input data into chunks, +each of which are compressed independently, +and so can be decompressed independently. +Decompression then takes advantage of a provided 'seek table', which allows the decompressor to immediately jump to the desired data. This is done in a way that is compatible with the original Zstandard format by placing the seek table in a Zstandard skippable frame. + +### Overall conventions +In this document: +- square brackets i.e. `[` and `]` are used to indicate optional fields or parameters. +- the naming convention for identifiers is `Mixed_Case_With_Underscores` +- All numeric fields are little-endian unless specified otherwise + +## Format + +The format consists of a number of chunks (Zstandard compressed frames and skippable frames), followed by a final skippable frame at the end containing the seek table. + +### Seek Table Format +The structure of the seek table frame is as follows: + +|`Skippable_Magic_Number`|`Frame_Size`|`[Seek_Table_Entries]`|`Seek_Table_Footer`| +|------------------------|------------|----------------------|-------------------| +| 4 bytes | 4 bytes | 8-12 bytes each | 9 bytes | + +__`Skippable_Magic_Number`__ + +Value : 0x184D2A5?, which means any value from 0x184D2A50 to 0x184D2A5F. +All 16 values are valid to identify a skippable frame. +This is for compatibility with [Zstandard skippable frames]. + +__`Frame_Size`__ + +The total size of the skippable frame, not including the `Skippable_Magic_Number` or `Frame_Size`. This is for compatibility with [Zstandard skippable frames]. + +[Zstandard skippable frames]: https://github.com/facebook/zstd/blob/master/doc/zstd_compression_format.md#skippable-frames + +#### `Seek_Table_Footer` +The seek table footer format is as follows: + +|`Number_Of_Chunks`|`Seek_Table_Descriptor`|`Seekable_Magic_Number`| +|------------------|-----------------------|-----------------------| +| 4 bytes | 1 byte | 4 bytes | + +__`Number_Of_Chunks`__ + +The number of stored chunks in the data. + +__`Seek_Table_Descriptor`__ + +A bitfield describing the format of the seek table. + +| Bit number | Field name | +| ---------- | ---------- | +| 7 | `Checksum_Flag` | +| 6-2 | `Reserved_Bits` | +| 1-0 | `Unused_Bits` | + +While only `Checksum_Flag` currently exists, there are 7 other bits in this field that can be used for future changes to the format, +for example the addition of inline dictionaries. + +__`Checksum_Flag`__ + +If the checksum flag is set, each of the seek table entries contains a 4 byte checksum of the uncompressed data contained in its chunk. + +`Reserved_Bits` are not currently used but may be used in the future for breaking changes, so a compliant decoder should ensure they are set to 0. `Unused_Bits` may be used in the future for non-breaking changes, so a compliant decoder should not interpret these bits. + +#### __`Seek_Table_Entries`__ + +`Seek_Table_Entries` consists of `Number_Of_Chunks` (one for each chunk in the data, not including the seek table frame) entries of the following form, in sequence: + +|`Compressed_Size`|`Decompressed_Size`|`[Checksum]`| +|-----------------|-------------------|------------| +| 4 bytes | 4 bytes | 4 bytes | + +__`Compressed_Size`__ + +The compressed size of the chunk. +The cumulative sum of the `Compressed_Size` fields of chunks `0` to `i` gives the offset in the compressed file of chunk `i+1`. + +__`Decompressed_Size`__ + +The size of the decompressed data contained in the chunk. For skippable or otherwise empty frames, this value is 0. + +__`Checksum`__ + +Only present if `Checksum_Flag` is set in the `Seek_Table_Descriptor`. Value : the least significant 32 bits of the XXH64 digest of the uncompressed data, stored in little-endian format. + +## Version Changes +- 0.1.0: initial version diff --git a/lib/decompress/zstdseek_decompress.c b/lib/decompress/zstdseek_decompress.c index 5e303f26b..92901befa 100644 --- a/lib/decompress/zstdseek_decompress.c +++ b/lib/decompress/zstdseek_decompress.c @@ -121,6 +121,11 @@ size_t ZSTD_seekable_loadSeekTable(ZSTD_seekable_DStream* zds, const void* src, { BYTE const sfd = ip[-5]; checksumFlag = sfd >> 7; + + /* check reserved bits */ + if ((checksumFlag >> 2) & 0x1f) { + return ERROR(corruption_detected); + } } numChunks = MEM_readLE32(ip-9);