From: Yann Collet Date: Mon, 30 Apr 2018 18:35:49 +0000 (-0700) Subject: Clarifications of Zstandard format specification X-Git-Tag: v1.3.5~3^2~68 X-Git-Url: http://git.ipfire.org/cgi-bin/gitweb.cgi?a=commitdiff_plain;h=82ad249645941096c8e1d1171ad6b4cfcfbd5188;p=thirdparty%2Fzstd.git Clarifications of Zstandard format specification from IETF RFC review --- diff --git a/doc/zstd_compression_format.md b/doc/zstd_compression_format.md index 7bf36c491..66819d136 100644 --- a/doc/zstd_compression_format.md +++ b/doc/zstd_compression_format.md @@ -16,7 +16,7 @@ Distribution of this document is unlimited. ### Version -0.2.6 (19/08/17) +0.2.7 (30/04/18) Introduction @@ -112,6 +112,11 @@ __`Magic_Number`__ 4 Bytes, __little-endian__ format. Value : 0xFD2FB528 +Note: This value was selected to be less probable to find at the beginning of some random file. +It avoids trivial patterns (0x00, 0xFF, repeated bytes, increasing bytes, etc.), +contains byte values outside of ASCII range, +and doesn't map into UTF8 space. +It reduces the chances that a text file represent this value by accident. __`Frame_Header`__ @@ -171,8 +176,8 @@ according to the following table: |`FCS_Field_Size`| 0 or 1 | 2 | 4 | 8 | When `Flag_Value` is `0`, `FCS_Field_Size` depends on `Single_Segment_flag` : -if `Single_Segment_flag` is set, `Field_Size` is 1. -Otherwise, `Field_Size` is 0 : `Frame_Content_Size` is not provided. +if `Single_Segment_flag` is set, `FCS_Field_Size` is 1. +Otherwise, `FCS_Field_Size` is 0 : `Frame_Content_Size` is not provided. __`Single_Segment_flag`__ @@ -218,11 +223,11 @@ __`Dictionary_ID_flag`__ This is a 2-bits flag (`= FHD & 3`), telling if a dictionary ID is provided within the header. -It also specifies the size of this field as `Field_Size`. +It also specifies the size of this field as `DID_Field_Size`. -|`Flag_Value`| 0 | 1 | 2 | 3 | -| ---------- | --- | --- | --- | --- | -|`Field_Size`| 0 | 1 | 2 | 4 | +|`Flag_Value` | 0 | 1 | 2 | 3 | +| -------------- | --- | --- | --- | --- | +|`DID_Field_Size`| 0 | 1 | 2 | 4 | #### `Window_Descriptor` @@ -270,7 +275,8 @@ the ID of the dictionary required to properly decode the frame. `Dictionary_ID` field is optional. When it's not present, it's up to the decoder to make sure it uses the correct dictionary. -Field size depends on `Dictionary_ID_flag`. +`Dictionary_ID` field size is provided by `DID_Field_Size`. +`DID_Field_Size` is directly derived from value of `Dictionary_ID_flag`. 1 byte can represent an ID 0-255. 2 bytes can represent an ID 0-65535. 4 bytes can represent an ID 0-4294967295. @@ -363,16 +369,14 @@ There are 4 block types : __`Block_Size`__ The upper 21 bits of `Block_Header` represent the `Block_Size`. +`Block_Size` is the size of the block excluding the header. +A block can contain any number of bytes (even zero), up to +`Block_Maximum_Decompressed_Size`, which is the smallest of: +- Window_Size +- 128 KB -Block sizes must respect a few rules : -- For `Compressed_Block`, `Block_Size` is always strictly less than decompressed size. -- Block decompressed size is always <= `Window_Size` -- Block decompressed size is always <= 128 KB. - -A block can contain any number of bytes (even empty), -up to `Block_Maximum_Decompressed_Size`, which is the smallest of : -- `Window_Size` -- 128 KB +A `Compressed_Block` has the extra restriction that `Block_Size` is always +strictly less than the decompressed size. Compressed Blocks @@ -392,8 +396,14 @@ To decode a compressed block, the following elements are necessary : - Previous decoded data, up to a distance of `Window_Size`, or all previously decoded data when `Single_Segment_flag` is set. - List of "recent offsets" from previous `Compressed_Block`. -- Decoding tables of previous `Compressed_Block` for each symbol type - (literals, literals lengths, match lengths, offsets). +- The previous Huffman tree, required by `Treeless_Literals_Block` type +- Previous FSE decoding tables, required by `Repeat_Mode` + for each symbol type (literals lengths, match lengths, offsets) + +Note that decoding tables aren't always from the previous `Compressed_Block`. + +- Every decoding table can come from a dictionary. +- The Huffman tree comes from the previous `Compressed_Literals_Block`. Literals Section ---------------- @@ -460,17 +470,20 @@ For values spanning several bytes, convention is __little-endian__. __`Size_Format` for `Raw_Literals_Block` and `RLE_Literals_Block`__ : -- Value ?0 : `Size_Format` uses 1 bit. +`Size_Format` uses 1 _or_ 2 bits. +Its value is : `Size_Format = (Header[0]>>2) & 3` + +- `Size_Format` == 00 or 10 : `Size_Format` uses 1 bit. `Regenerated_Size` uses 5 bits (0-31). - `Literals_Section_Header` has 1 byte. + `Literals_Section_Header` uses 1 byte. `Regenerated_Size = Header[0]>>3` -- Value 01 : `Size_Format` uses 2 bits. +- `Size_Format` == 01 : `Size_Format` uses 2 bits. `Regenerated_Size` uses 12 bits (0-4095). - `Literals_Section_Header` has 2 bytes. + `Literals_Section_Header` uses 2 bytes. `Regenerated_Size = (Header[0]>>4) + (Header[1]<<4)` -- Value 11 : `Size_Format` uses 2 bits. +- `Size_Format` == 11 : `Size_Format` uses 2 bits. `Regenerated_Size` uses 20 bits (0-1048575). - `Literals_Section_Header` has 3 bytes. + `Literals_Section_Header` uses 3 bytes. `Regenerated_Size = (Header[0]>>4) + (Header[1]<<4) + (Header[2]<<12)` Only Stream1 is present for these cases. @@ -479,18 +492,20 @@ using a long format, even if it's less efficient. __`Size_Format` for `Compressed_Literals_Block` and `Treeless_Literals_Block`__ : -- Value 00 : _A single stream_. +`Size_Format` always uses 2 bits. + +- `Size_Format` == 00 : _A single stream_. Both `Regenerated_Size` and `Compressed_Size` use 10 bits (0-1023). - `Literals_Section_Header` has 3 bytes. -- Value 01 : 4 streams. + `Literals_Section_Header` uses 3 bytes. +- `Size_Format` == 01 : 4 streams. Both `Regenerated_Size` and `Compressed_Size` use 10 bits (0-1023). - `Literals_Section_Header` has 3 bytes. -- Value 10 : 4 streams. + `Literals_Section_Header` uses 3 bytes. +- `Size_Format` == 10 : 4 streams. Both `Regenerated_Size` and `Compressed_Size` use 14 bits (0-16383). - `Literals_Section_Header` has 4 bytes. -- Value 11 : 4 streams. + `Literals_Section_Header` uses 4 bytes. +- `Size_Format` == 11 : 4 streams. Both `Regenerated_Size` and `Compressed_Size` use 18 bits (0-262143). - `Literals_Section_Header` has 5 bytes. + `Literals_Section_Header` uses 5 bytes. Both `Compressed_Size` and `Regenerated_Size` fields follow __little-endian__ convention. Note: `Compressed_Size` __includes__ the size of the Huffman Tree description @@ -516,7 +531,8 @@ it must be used to determine where streams begin. `Total_Streams_Size = Compressed_Size - Huffman_Tree_Description_Size`. For `Treeless_Literals_Block`, -the Huffman table comes from previously compressed literals block. +the Huffman table comes from previously compressed literals block, +or from a dictionary. Huffman compressed data consists of either 1 or 4 Huffman-coded streams. @@ -570,7 +586,8 @@ followed by the bitstream. | -------------------------- | ------------------------- | ---------------- | ---------------------- | --------- | To decode the `Sequences_Section`, it's required to know its size. -This size is deduced from `Block_Size - Literals_Section_Size`. +This size is deduced from the literals section size: +`Sequences_Section_Size = Block_Size - Literals_Section_Size`. #### `Sequences_Section_Header` @@ -614,9 +631,11 @@ They follow the same enumeration : No distribution table will be present. - `RLE_Mode` : The table description consists of a single byte. This code will be repeated for all sequences. -- `Repeat_Mode` : The table used in the previous compressed block will be used again. +- `Repeat_Mode` : The table used in the previous `Compressed_Block` will be used again, + or if this is the first block, table in the dictionary will be used No distribution table will be present. - Note: this includes RLE mode, so if `Repeat_Mode` follows `RLE_Mode`, the same symbol will be repeated. + Note that this includes `RLE_mode`, so if `Repeat_Mode` follows `RLE_Mode`, the same symbol will be repeated. + Note that this also includes `Predefined_Mode`. If this mode is used without any previous sequence table in the frame (or [dictionary](#dictionary-format)) to repeat, this should be treated as corruption. - `FSE_Compressed_Mode` : standard FSE compression. @@ -624,6 +643,8 @@ They follow the same enumeration : The format of this distribution table is described in [FSE Table Description](#fse-table-description). Note that the maximum allowed accuracy log for literals length and match length tables is 9, and the maximum accuracy log for the offsets table is 8. + `FSE_Compressed_Mode` must not be used when only one symbol is present, + `RLE_Mode` should be used instead (although any other mode will work). #### The codes for literals lengths, match lengths, and offsets. @@ -696,7 +717,7 @@ Offset codes are values ranging from `0` to `N`. A decoder is free to limit its maximum `N` supported. Recommendation is to support at least up to `22`. For information, at the time of this writing. -the reference decoder supports a maximum `N` value of `28` in 64-bits mode. +the reference decoder supports a maximum `N` value of `31` in 64-bits mode. An offset code is also the number of additional bits to read in __little-endian__ fashion, and can be translated into an `Offset_Value` using the following formulas : @@ -856,7 +877,8 @@ so an `offset_value` of 1 means `Repeated_Offset2`, an `offset_value` of 2 means `Repeated_Offset3`, and an `offset_value` of 3 means `Repeated_Offset1 - 1_byte`. -For the first block, the starting offset history is populated with the following values : 1, 4 and 8 (in order). +For the first block, the starting offset history is populated with the following values : 1, 4 and 8 (in order), +unless a dictionary is used, in which case they come from the dictionary. Then each block gets its starting offset history from the ending values of the most recent `Compressed_Block`. Note that blocks which are not `Compressed_Block` are skipped, they do not contribute to offset history. @@ -919,6 +941,8 @@ FSE, short for Finite State Entropy, is an entropy codec based on [ANS]. FSE encoding/decoding involves a state that is carried over between symbols, so decoding must be done in the opposite direction as encoding. Therefore, all FSE bitstreams are read from end to beginning. +Note that the order of the bits in the stream is not reversed, +we just read the elements in the reverse order they are written. For additional details on FSE, see [Finite State Entropy]. @@ -943,6 +967,7 @@ The Zstandard format encodes FSE table descriptions as follows: An FSE distribution table describes the probabilities of all symbols from `0` to the last present one (included) on a normalized scale of `1 << Accuracy_Log` . +Note that there must be two or more symbols with nonzero probability. It's a bitstream which is read forward, in __little-endian__ fashion. It's not necessary to know its exact size, @@ -959,24 +984,24 @@ It depends on : __example__ : Presuming an `Accuracy_Log` of 8, and presuming 100 probabilities points have already been distributed, - the decoder may read any value from `0` to `255 - 100 + 1 == 156` (inclusive). - Therefore, it must read `log2sup(156) == 8` bits. + the decoder may read any value from `0` to `256 - 100 + 1 == 157` (inclusive). + Therefore, it must read `log2sup(157) == 8` bits. - Value decoded : small values use 1 less bit : __example__ : - Presuming values from 0 to 156 (inclusive) are possible, - 255-156 = 99 values are remaining in an 8-bits field. + Presuming values from 0 to 157 (inclusive) are possible, + 255-157 = 98 values are remaining in an 8-bits field. They are used this way : - first 99 values (hence from 0 to 98) use only 7 bits, - values from 99 to 156 use 8 bits. + first 98 values (hence from 0 to 97) use only 7 bits, + values from 98 to 157 use 8 bits. This is achieved through this scheme : | Value read | Value decoded | Number of bits used | | ---------- | ------------- | ------------------- | - | 0 - 98 | 0 - 98 | 7 | - | 99 - 127 | 99 - 127 | 8 | - | 128 - 226 | 0 - 98 | 7 | - | 227 - 255 | 128 - 156 | 8 | + | 0 - 97 | 0 - 97 | 7 | + | 98 - 127 | 98 - 127 | 8 | + | 128 - 225 | 0 - 97 | 7 | + | 226 - 255 | 128 - 157 | 8 | Symbols probabilities are read one by one, in order. @@ -1019,12 +1044,12 @@ and instructions to get the next state. Symbols are scanned in their natural order for "less than 1" probabilities. Symbols with this probability are being attributed a single cell, -starting from the end of the table. +starting from the end of the table and retreating. These symbols define a full state reset, reading `Accuracy_Log` bits. -All remaining symbols are sorted in their natural order. +All remaining symbols are allocated in their natural order. Starting from symbol `0` and table position `0`, -each symbol gets attributed as many cells as its probability. +each symbol gets allocated as many cells as its probability. Cell allocation is spreaded, not linear : each successor position follow this rule : @@ -1044,6 +1069,7 @@ Each state will decode the current symbol. To get the `Number_of_Bits` and `Baseline` required for next state, it's first necessary to sort all states in their natural order. The lower states will need 1 more bit than higher ones. +The process is repeated for each symbol. __Example__ : Presuming a symbol has a probability of 5. @@ -1055,10 +1081,12 @@ Presuming the `Accuracy_Log` is 7, it defines 128 states. Divided by 8, each share is 16 large. In order to reach 8, 8-5=3 lowest states will count "double", -taking shares twice larger, +doubling the number of shares (32 in width), requiring one more bit in the process. -Numbering starts from higher states using less bits. +Baseline is assigned starting from the higher states using fewer bits, +and proceeding naturally, then resuming at the first state, +each takes its allocated width from Baseline. | state order | 0 | 1 | 2 | 3 | 4 | | ---------------- | ----- | ----- | ------ | ---- | ----- | @@ -1075,6 +1103,7 @@ See [Appendix A] for the results of this process applied to the default distribu [Appendix A]: #appendix-a---decoding-tables-for-predefined-codes + Huffman Coding -------------- Zstandard Huffman-coded streams are read backwards, @@ -1096,6 +1125,7 @@ The bitstream contains Huffman-coded symbols in __little-endian__ order, with the codes defined by the method below. ### Huffman Tree Description + Prefix coding represents symbols from an a priori known alphabet by bit sequences (codewords), one codeword for each symbol, in a manner such that different symbols may be represented @@ -1112,7 +1142,6 @@ More bits improve accuracy but cost more header size, and require more memory or more complex decoding operations. This specification limits maximum code length to 11 bits. - ##### Representation All literal values from zero (included) to last present one (excluded) @@ -1190,7 +1219,7 @@ and last symbol's weight is not represented. An FSE bitstream starts by a header, describing probabilities distribution. It will create a Decoding Table. -For a list of Huffman weights, the maximum accuracy log is 7 bits. +For a list of Huffman weights, the maximum accuracy log is 6 bits. For more description see the [FSE header description](#fse-table-description) The Huffman header compression uses 2 states, @@ -1330,7 +1359,8 @@ __`Content`__ : The rest of the dictionary is its content. As long as the amount of data decoded from this frame is less than or equal to `Window_Size`, sequence commands may specify offsets longer than the total length of decoded output so far to reference back to the - dictionary. After the total output has surpassed `Window_Size` however, + dictionary, even parts of the dictionary with offsets larger than `Window_Size`. + After the total output has surpassed `Window_Size` however, this is no longer allowed and the dictionary is no longer accessible. [compressed blocks]: #the-format-of-compressed_block @@ -1523,6 +1553,7 @@ to crosscheck that an implementation build its decoding tables correctly. Version changes --------------- +- 0.2.7 : clarifications from IETF RFC review, by Vijay Gurbani and Nick Terrell - 0.2.6 : fixed an error in huffman example, by Ulrich Kunitz - 0.2.5 : minor typos and clarifications - 0.2.4 : section restructuring, by Sean Purcell