Updated format specification to be easier to understand

author Sean Purcell <me@seanp.xyz>

Thu, 26 Jan 2017 00:41:52 +0000 (16:41 -0800)

committer Sean Purcell <me@seanp.xyz>

Thu, 26 Jan 2017 00:42:41 +0000 (16:42 -0800)
author Sean Purcell <me@seanp.xyz>
Thu, 26 Jan 2017 00:41:52 +0000 (16:41 -0800)
committer Sean Purcell <me@seanp.xyz>
Thu, 26 Jan 2017 00:42:41 +0000 (16:42 -0800)
diff --git a/doc/zstd_compression_format.md b/doc/zstd_compression_format.md

index b48b39104ba5671006896c10886d90dfa78d62c3..03a970c16576701a68b9c58f6a4ac25bf1689315 100644 (file)
--- a/doc/zstd_compression_format.md
+++ b/doc/zstd_compression_format.md
@@ -57,7 +57,6 @@ Whenever it does not support a parameter defined in the compressed stream,
  it must produce a non-ambiguous error code and associated error message
  explaining which parameter is unsupported.
  
-
  Overall conventions
  -----------
  In this document:
@@ -267,7 +266,7 @@ The `Window_Descriptor` byte is optional. It is absent when `Single_Segment_flag
  In this case, the maximum back-reference distance is the content size itself,
  which can be any value from 1 to 2^64-1 bytes (16 EB).
  
-| Bit numbers |     7-3    |     0-2    |
+| Bit numbers |     7-3    |     2-0    |
  | ----------- | ---------- | ---------- |
  | Field name  | `Exponent` | `Mantissa` |
  
@@ -381,7 +380,7 @@ There are 4 block types :
    This value cannot be used with current version of this specification.
  
  Block sizes must respect a few rules :
-- In compressed mode, compressed size if always strictly `< decompressed size`.
+- In compressed mode, `compressed size` is always strictly `< decompressed size`.
  - Block decompressed size is always <= maximum back-reference distance .
  - Block decompressed size is always <= 128 KB
  
@@ -478,7 +477,7 @@ For values spanning several bytes, convention is little-endian.
  
  __`Size_Format` for `Raw_Literals_Block` and `RLE_Literals_Block`__ :
  
-- Value x0 : `Regenerated_Size` uses 5 bits (0-31).
+- Value X0 : `Size_Format` uses 1 bit, `Regenerated_Size` uses 5 bits (0-31).
                 `Literals_Section_Header` has 1 byte.
                 `Regenerated_Size = Header[0]>>3`
  - Value 01 : `Regenerated_Size` uses 12 bits (0-4095).
@@ -507,7 +506,8 @@ __`Size_Format` for `Compressed_Literals_Block` and `Repeat_Stats_Literals_Block
                 `Literals_Section_Header` has 5 bytes.
  
  Both `Compressed_Size` and `Regenerated_Size` fields follow little-endian convention.
-
+Note: `Compressed_Size` __includes__ the size of the Huffman Tree description if it
+is present.
  
  #### `Huffman_Tree_Description`
  
@@ -550,23 +550,24 @@ Let's presume the following Huffman tree must be described :
  | `Number_of_Bits` |  1  |  2  |  3  |  0  |  4  |  4  |
  
  The tree depth is 4, since its smallest element uses 4 bits.
-Value `5` will not be listed, nor will values above `5`.
+Value `5` will not be listed as it can be determined from the values for 0-4,
+nor will values above `5` as they are all 0.
  Values from `0` to `4` will be listed using `Weight` instead of `Number_of_Bits`.
  Weight formula is :
  ```
  Weight = Number_of_Bits ? (Max_Number_of_Bits + 1 - Number_of_Bits) : 0
  ```
-It gives the following serie of weights :
+It gives the following series of weights :
  
-| `Weight` |  4  |  3  |  2  |  0  |  1  |
-| -------- | --- | --- | --- | --- | --- |
  | literal  |  0  |  1  |  2  |  3  |  4  |
+| -------- | --- | --- | --- | --- | --- |
+| `Weight` |  4  |  3  |  2  |  0  |  1  |
  
  The decoder will do the inverse operation :
  having collected weights of literals from `0` to `4`,
  it knows the last literal, `5`, is present with a non-zero weight.
-The weight of `5` can be deducted by joining to the nearest power of 2.
-Sum of `2^(Weight-1)` (excluding 0) is :
+The weight of `5` can be determined by advancing to the next power of 2.
+The sum of `2^(Weight-1)` (excluding 0's) is :
  `8 + 4 + 2 + 0 + 1 = 15`.
  Nearest power of 2 is 16.
  Therefore, `Max_Number_of_Bits = 4` and `Weight[5] = 1`.
@@ -574,23 +575,38 @@ Therefore, `Max_Number_of_Bits = 4` and `Weight[5] = 1`.
  ##### Huffman Tree header
  
  This is a single byte value (0-255),
-which tells how to decode the list of weights.
+which describes how to decode the list of weights.
  
  - if `headerByte` >= 128 : this is a direct representation,
    where each `Weight` is written directly as a 4 bits field (0-15).
+  They are encoded forward, 2 weights to a byte with the first weight taking
+  the top 4 bits and the second taking the bottom two (e.g.
+  `Weight[0] = (Byte[0] >> 4), Weight[1] = (Byte[0] & 0xf)`, etc.).
    The full representation occupies `((Number_of_Symbols+1)/2)` bytes,
    meaning it uses a last full byte even if `Number_of_Symbols` is odd.
    `Number_of_Symbols = headerByte - 127`.
    Note that maximum `Number_of_Symbols` is 255-127 = 128.
-  A larger serie must necessarily use FSE compression.
+  A larger series must necessarily use FSE compression.
  
  - if `headerByte` < 128 :
-  the serie of weights is compressed by FSE.
-  The length of the FSE-compressed serie is equal to `headerByte` (0-127).
+  the series of weights is compressed by FSE.
+  The length of the FSE-compressed series is equal to `headerByte` (0-127).
  
  ##### Finite State Entropy (FSE) compression of Huffman weights
  
-The serie of weights is compressed using FSE compression.
+FSE decoding uses three operations: `Init_State`, `Decode_Symbol`, and `Update_State`.
+`Init_State` reads in the initial state value from a bitstream,
+`Decode_Symbol` outputs a symbol based on the current state,
+and `Update_State` goes to a new state based on the current state and some number of consumed bits.
+
+FSE streams must be read in reverse from the order they're encoded in,
+so bitstreams start at a certain offset and works backwards towards their base.
+
+For more on how FSE bitstreams work, see [Finite State Entropy].
+
+[Finite State Entropy]:https://github.com/Cyan4973/FiniteStateEntropy/
+
+The series of Huffman weights is compressed using FSE compression.
  It's a single bitstream with 2 interleaved states,
  sharing a single distribution table.
  
@@ -598,22 +614,27 @@ To decode an FSE bitstream, it is necessary to know its compressed size.
  Compressed size is provided by `headerByte`.
  It's also necessary to know its _maximum possible_ decompressed size,
  which is `255`, since literal values span from `0` to `255`,
-and last symbol value is not represented.
+and last symbol's weight is not represented.
  
  An FSE bitstream starts by a header, describing probabilities distribution.
  It will create a Decoding Table.
-Table must be pre-allocated, which requires to support a maximum accuracy.
+The table must be pre-allocated, so a maximum accuracy must be fixed.
  For a list of Huffman weights, maximum accuracy is 7 bits.
  
-FSE header is [described in relevant chapter](#fse-distribution-table--condensed-format),
-and so is [FSE bitstream](#bitstream).
+The FSE header format is [described in a relevant chapter](#fse-distribution-table--condensed-format),
+as well as the [FSE bitstream](#bitstream).
  The main difference is that Huffman header compression uses 2 states,
  which share the same FSE distribution table.
-Bitstream contains only FSE symbols (no interleaved "raw bitfields").
-The number of symbols to decode is discovered
-by tracking bitStream overflow condition.
-When both states have overflowed the bitstream, end is reached.
+The first state (`State1`) encodes the even indexed symbols,
+and the second (`State2`) encodes the odd indexes.
+State1 is initialized first, and then State2, and they take turns decoding
+a single symbol and updating their state.
  
+The number of symbols to decode is determined
+by tracking bitStream overflow condition:
+If updating state after decoding a symbol would require more bits than
+remain in the stream, it is assumed the extra bits are 0.  Then,
+the symbols for each of the final states are decoded and the process is complete.
  
  ##### Conversion from weights to Huffman prefix codes
  
@@ -687,9 +708,20 @@ Consequently, a last byte of `0` is not possible.
  And the final-bit-flag itself is not part of the useful bitstream.
  Hence, the last byte contains between 0 and 7 useful bits.
  
+For example, if the literal sequence "0145" was encoded using the prefix codes above,
+it would be encoded as:
+```
+00000001 01110000
+```
+
+|Symbol  |   5  |   4  |  1 | 0 | Padding |
+|--------|------|------|----|---|---------|
+|Encoding|`0000`|`0001`|`01`|`1`| `10000` |
+
  Starting from the end,
  it's possible to read the bitstream in a little-endian fashion,
-keeping track of already used bits.
+keeping track of already used bits.  Since the bitstream is encoded in reverse
+order, by starting at the end the symbols can be read in forward order.
  
  Reading the last `Max_Number_of_Bits` bits,
  it's then possible to compare extracted value to decoding table,
@@ -700,7 +732,6 @@ If a bitstream is not entirely and exactly consumed,
  hence reaching exactly its beginning position with _all_ bits consumed,
  the decoding process is considered faulty.
  
-
  ### `Sequences_Section`
  
  A compressed block is a succession of _sequences_ .
@@ -712,7 +743,7 @@ The offset gives the position to copy from,
  which can be within a previous block.
  
  When all _sequences_ are decoded,
-if there is any literal left in the _literal section_,
+if there is are any literals left in the _literal section_,
  these bytes are added at the end of the block.
  
  The `Sequences_Section` regroup all symbols required to decode commands.
@@ -810,7 +841,7 @@ They define lengths from 0 to 131071 bytes.
  When `Compression_Mode` is `Predefined_Mode`,
  a predefined distribution is used for FSE compression.
  
-Below is its definition. It uses an accuracy of 6 bits (64 states).
+Its definition is below. It uses an accuracy of 6 bits (64 states).
  ```
  short literalsLength_defaultDistribution[36] =
          { 4, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1,
@@ -835,12 +866,12 @@ They define lengths from 3 to 131074 bytes.
  
  | `Match_Length_Code` |  40  |  41  |  42  |  43  |  44  |  45  |  46  |  47  |
  | ------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
-| `Baseline`          |  67  |  83  |  99  |  131 |  258 |  514 | 1026 | 2050 |
+| `Baseline`          |  67  |  83  |  99  |  131 |  259 |  515 | 1027 | 2051 |
  | `Number_of_Bits`    |   4  |   4  |   5  |   7  |   8  |   9  |  10  |  11  |
  
  | `Match_Length_Code` |  48  |  49  |  50  |  51  |  52  |
  | ------------------- | ---- | ---- | ---- | ---- | ---- |
-| `Baseline`          | 4098 | 8194 |16486 |32770 |65538 |
+| `Baseline`          | 4099 | 8195 |16387 |32771 |65539 |
  | `Number_of_Bits`    |  12  |  13  |  14  |  15  |  16  |
  
  ##### Default distribution for match length codes
@@ -848,7 +879,7 @@ They define lengths from 3 to 131074 bytes.
  When `Compression_Mode` is defined as `Predefined_Mode`,
  a predefined distribution is used for FSE compression.
  
-Below is its definition. It uses an accuracy of 6 bits (64 states).
+Its definition is below. It uses an accuracy of 6 bits (64 states).
  ```
  short matchLengths_defaultDistribution[53] =
          { 1, 4, 3, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1,
@@ -908,7 +939,7 @@ When present, they are in this order :
  - Match Lengths
  
  The content to decode depends on their respective encoding mode :
-- `Predefined_Mode` : no content. Use predefined distribution table.
+- `Predefined_Mode` : no content. Use the predefined distribution table.
  - `RLE_Mode` : 1 byte. This is the only code to use across the whole compressed block.
  - `FSE_Compressed_Mode` : A distribution table is present.
  - `Repeat_Mode` : no content. Re-use distribution from previous compressed block.
@@ -936,12 +967,12 @@ It depends on :
    __example__ :
    Presuming an `Accuracy_Log` of 8,
    and presuming 100 probabilities points have already been distributed,
-  the decoder may read any value from `0` to `255 - 100 + 1 == 156` (included).
+  the decoder may read any value from `0` to `255 - 100 + 1 == 156` (inclusive).
    Therefore, it must read `log2sup(156) == 8` bits.
  
  - Value decoded : small values use 1 less bit :
    __example__ :
-  Presuming values from 0 to 156 (included) are possible,
+  Presuming values from 0 to 156 (inclusive) are possible,
    255-156 = 99 values are remaining in an 8-bits field.
    They are used this way :
    first 99 values (hence from 0 to 98) use only 7 bits,
@@ -967,7 +998,7 @@ For the purpose of calculating cumulated distribution, it counts as one.
  
  [next paragraph]:#fse-decoding--from-normalized-distribution-to-decoding-tables
  
-When a symbol has a probability of `zero`,
+When a symbol has a __probability__ of `zero`,
  it is followed by a 2-bits repeat flag.
  This repeat flag tells how many probabilities of zeroes follow the current one.
  It provides a number ranging from 0 to 3.
@@ -1012,6 +1043,9 @@ position &= tableSize-1;
  
  A position is skipped if already occupied,
  typically by a "less than 1" probability symbol.
+`position` does not reset between symbols, it simply iterates through
+each position in the table, switching to the next symbol when enough
+states have been allocated to the current one.
  
  The result is a list of state values.
  Each state will decode the current symbol.
@@ -1043,7 +1077,7 @@ Numbering starts from higher states using less bits.
  | `Baseline`       |  32   |  64   |   96   |   0  |  16   |
  | range            | 32-63 | 64-95 | 96-127 | 0-15 | 16-31 |
  
-Next state is determined from current state
+The next state is determined from current state
  by reading the required `Number_of_Bits`, and adding the specified `Baseline`.
  
  
@@ -1093,15 +1127,16 @@ and then for `Literals_Length`.
  It starts by inserting the number of literals defined by `Literals_Length`,
  then continue by copying `Match_Length` bytes from `currentPos - Offset`.
  
-The next operation is to update states.
-Using rules pre-calculated in the decoding tables,
+If it is not the last sequence in the block,
+the next operation is to update states.
+Using the rules pre-calculated in the decoding tables,
  `Literals_Length_State` is updated,
  followed by `Match_Length_State`,
  and then `Offset_State`.
  
  This operation will be repeated `Number_of_Sequences` times.
  At the end, the bitstream shall be entirely consumed,
-otherwise bitstream is considered corrupted.
+otherwise the bitstream is considered corrupted.
  
  [Symbol Decoding]:#the-codes-for-literals-lengths-match-lengths-and-offsets
  
@@ -1111,13 +1146,13 @@ As seen in [Offset Codes], the first 3 values define a repeated offset and we wi
  They are sorted in recency order, with `Repeated_Offset1` meaning "most recent one".
  
  There is an exception though, when current sequence's literals length is `0`.
-In which case, repeated offsets are "pushed by one",
+In this case, repeated offsets are shifted by one,
  so `Repeated_Offset1` becomes `Repeated_Offset2`, `Repeated_Offset2` becomes `Repeated_Offset3`,
  and `Repeated_Offset3` becomes `Repeated_Offset1 - 1_byte`.
  
-On first block, offset history is populated by the following values : 1, 4 and 8 (in order).
+In the first block, the offset history is populated with the following values : 1, 4 and 8 (in order).
  
-Then each block receives its start value from previous compressed block.
+Then each block gets its starting offset history from the ending values of the most recent compressed block.
  Note that non-compressed blocks are skipped,
  they do not contribute to offset history.
  
@@ -1125,11 +1160,12 @@ they do not contribute to offset history.
  
  ###### Offset updates rules
  
-New offset take the lead in offset history,
-up to its previous place if it was already present.
+The newest offset takes the lead in offset history,
+shifting others back (up to its previous place if it was already present).
  
-It means that when `Repeated_Offset1` (most recent) is used, history is unmodified.
+This means that when `Repeated_Offset1` (most recent) is used, history is unmodified.
  When `Repeated_Offset2` is used, it's swapped with `Repeated_Offset1`.
+If any other offset is used, it becomes `Repeated_Offset1` and the rest are shift back by one.
  
  
  Dictionary format
@@ -1137,6 +1173,9 @@ Dictionary format
  
  `zstd` is compatible with "raw content" dictionaries, free of any format restriction,
  except that they must be at least 8 bytes.
+These dictionaries function as if they were just the `Content` block of a formatted
+dictionary.
+
  But dictionaries created by `zstd --train` follow a format, described here.
  
  __Pre-requisites__ : a dictionary has a size,
@@ -1160,16 +1199,17 @@ _Reserved ranges :_
                - low range : 1 - 32767
                - high range : >= (2^31)
  
-__`Entropy_Tables`__ : following the same format as a [compressed blocks].
+__`Entropy_Tables`__ : following the same format as the tables in [compressed blocks].
                They are stored in following order :
                Huffman tables for literals, FSE table for offsets,
                FSE table for match lengths, and FSE table for literals lengths.
-              It's finally followed by 3 offset values, populating recent offsets,
+              It's finally followed by 3 offset values, populating recent offsets (instead of using `{1,4,8}`),
                stored in order, 4-bytes little-endian each, for a total of 12 bytes.
                Each recent offset must have a value < dictionary size.
  
  __`Content`__ : The rest of the dictionary is its content.
-              The content act as a "past" in front of data to compress or decompress.
+              The content act as a "past" in front of data to compress or decompress,
+              so it can be referenced in sequence commands.
  
  [compressed blocks]: #the-format-of-compressed_block
author	Sean Purcell <me@seanp.xyz>
	Thu, 26 Jan 2017 00:41:52 +0000 (16:41 -0800)
committer	Sean Purcell <me@seanp.xyz>
	Thu, 26 Jan 2017 00:42:41 +0000 (16:42 -0800)