updated zstd_compression_format.md

author inikep <inikep@gmail.com>

Thu, 25 Aug 2016 12:59:08 +0000 (14:59 +0200)

committer inikep <inikep@gmail.com>

Thu, 25 Aug 2016 12:59:08 +0000 (14:59 +0200)
author inikep <inikep@gmail.com>
Thu, 25 Aug 2016 12:59:08 +0000 (14:59 +0200)
committer inikep <inikep@gmail.com>
Thu, 25 Aug 2016 12:59:08 +0000 (14:59 +0200)
diff --git a/programs/README.md b/programs/README.md

index 0fbb8a357e4c5398064be92e672c0295554aaf3a..9bd1e71b39cc6bdda7c974adaef3a44c4a4c9b2e 100644 (file)
--- a/programs/README.md
+++ b/programs/README.md
@@ -31,8 +31,8 @@ will rely more and more on previously decoded content to compress the rest of th
  Usage of the dictionary builder and created dictionaries with CLI:
  
  1. Create the dictionary : `zstd --train FullPathToTrainingSet/* -o dictionaryName`
-2. Compress with dictionary: `zstd FILE -D dictionaryName`
-3. Decompress with dictionary: `zstd --decompress FILE.zst -D dictionaryName`
+2. Compress with the dictionary: `zstd FILE -D dictionaryName`
+3. Decompress with the dictionary: `zstd --decompress FILE.zst -D dictionaryName`
  
  
  
diff --git a/zstd_compression_format.md b/zstd_compression_format.md

index 867a9b03302d0506339ae5da0541f6638a07737f..7143eea316ea649305994dba6af2ed38a015d81d 100644 (file)
--- a/zstd_compression_format.md
+++ b/zstd_compression_format.md
@@ -271,7 +271,7 @@ which can be any value from 1 to 2^64-1 bytes (16 EB).
  | ----------- | ---------- | ---------- |
  | Field name  | `Exponent` | `Mantissa` |
  
-Maximum distance is given by the following formulae :
+Maximum distance is given by the following formulas :
  ```
  windowLog = 10 + Exponent;
  windowBase = 1 << windowLog;
@@ -415,7 +415,7 @@ To decode a compressed block, the following elements are necessary :
    or all previous blocks when `Single_Segment_flag` is set.
  - List of "recent offsets" from previous compressed block.
  - Decoding tables of previous compressed block for each symbol type
-  (literals, litLength, matchLength, offset).
+  (literals, literals lengths, match lengths, offsets).
  
  
  ### `Literals_Section`
@@ -510,7 +510,7 @@ Both `Compressed_Size` and `Regenerated_Size` fields follow little-endian conven
  
  #### `Huffman_Tree_Description`
  
-This section is only present when `Literals_Block_Type` type is `Compressed_Block` (`2`).
+This section is only present when `Literals_Block_Type` type is `Compressed_Literals_Block` (`2`).
  
  Prefix coding represents symbols from an a priori known alphabet
  by bit sequences (codewords), one codeword for each symbol,
@@ -532,9 +532,11 @@ This specification limits maximum code length to 11 bits.
  ##### Representation
  
  All literal values from zero (included) to last present one (excluded)
-are represented by `Weight` values, from 0 to `Max_Number_of_Bits`.
-Transformation from `Weight` to `Number_of_Bits` follows this formulae :
-`Number_of_Bits = Weight ? Max_Number_of_Bits + 1 - Weight : 0` .
+are represented by `Weight` with values from `0` to `Max_Number_of_Bits`.
+Transformation from `Weight` to `Number_of_Bits` follows this formula :
+```
+Number_of_Bits = Weight ? (Max_Number_of_Bits + 1 - Weight) : 0
+```
  The last symbol's `Weight` is deduced from previously decoded ones,
  by completing to the nearest power of 2.
  This power of 2 gives `Max_Number_of_Bits`, the depth of the current tree.
@@ -549,7 +551,10 @@ Let's presume the following Huffman tree must be described :
  The tree depth is 4, since its smallest element uses 4 bits.
  Value `5` will not be listed, nor will values above `5`.
  Values from `0` to `4` will be listed using `Weight` instead of `Number_of_Bits`.
-Weight formula is : `Weight = Number_of_Bits ? Max_Number_of_Bits + 1 - Number_of_Bits : 0`.
+Weight formula is : 
+```
+Weight = Number_of_Bits ? (Max_Number_of_Bits + 1 - Number_of_Bits) : 0
+```
  It gives the following serie of weights :
  
  | `Weight` |  4  |  3  |  2  |  0  |  1  |
@@ -580,9 +585,9 @@ which tells how to decode the list of weights.
  
  - if `headerByte` < 128 :
    the serie of weights is compressed by FSE.
-  The length of the FSE-compressed serie is `headerByte` (0-127).
+  The length of the FSE-compressed serie is equal to `headerByte` (0-127).
  
-##### FSE (Finite State Entropy) compression of Huffman weights
+##### Finite State Entropy (FSE) compression of Huffman weights
  
  The serie of weights is compressed using FSE compression.
  It's a single bitstream with 2 interleaved states,
@@ -612,9 +617,10 @@ When both states have overflowed the bitstream, end is reached.
  ##### Conversion from weights to Huffman prefix codes
  
  All present symbols shall now have a `Weight` value.
-It is possible to transform weights into Number_of_Bits, using this formula :
-`Number_of_Bits = Number_of_Bits ? Max_Number_of_Bits + 1 - Weight : 0` .
-
+It is possible to transform weights into Number_of_Bits, using this formula:
+```
+Number_of_Bits = Number_of_Bits ? Max_Number_of_Bits + 1 - Weight : 0
+```
  Symbols are sorted by `Weight`. Within same `Weight`, symbols keep natural order.
  Symbols with a `Weight` of zero are removed.
  Then, starting from lowest weight, prefix codes are distributed in order.
@@ -636,21 +642,21 @@ it gives the following distribution :
  | prefix codes     | N/A | 0000| 0001| 001 | 01  |   1  |
  
  
-#### Literals bitstreams
+#### The content of Huffman-compressed literal stream
  
  ##### Bitstreams sizes
  
  As seen in a previous paragraph,
-there are 2 flavors of Huffman-compressed literals :
-single stream, and 4-streams.
+there are 2 types of Huffman-compressed literals :
+a single stream and 4 streams.
  
-4-streams is useful for CPU with multiple execution units and out-of-order operations.
+Encoding using 4 streams is useful for CPU with multiple execution units and out-of-order operations.
  Since each stream can be decoded independently,
  it's possible to decode them up to 4x faster than a single stream,
  presuming the CPU has enough parallelism available.
  
  For single stream, header provides both the compressed and regenerated size.
-For 4-streams though,
+For 4 streams though,
  header only provides compressed and regenerated size of all 4 streams combined.
  In order to properly decode the 4 streams,
  it's necessary to know the compressed and regenerated size of each stream.
@@ -663,8 +669,10 @@ bitstreams are preceded by 3 unsigned little-endian 16-bits values.
  Each value represents the compressed size of one stream, in order.
  The last stream size is deducted from total compressed size
  and from previously decoded stream sizes :
+
  `stream4CSize = totalCSize - 6 - stream1CSize - stream2CSize - stream3CSize`.
  
+
  ##### Bitstreams read and decode
  
  Each bitstream must be read _backward_,
@@ -706,23 +714,18 @@ When all _sequences_ are decoded,
  if there is any literal left in the _literal section_,
  these bytes are added at the end of the block.
  
-The _Sequences_Section_ regroup all symbols required to decode commands.
+The `Sequences_Section` regroup all symbols required to decode commands.
  There are 3 symbol types : literals lengths, offsets and match lengths.
  They are encoded together, interleaved, in a single _bitstream_.
  
-Each symbol is a _code_ in its own context,
-which specifies a baseline and a number of bits to add.
-_Codes_ are FSE compressed,
-and interleaved with raw additional bits in the same bitstream.
-
-The Sequences section starts by a header,
-followed by optional Probability tables for each symbol type,
+The `Sequences_Section` starts by a header,
+followed by optional probability tables for each symbol type,
  followed by the bitstream.
  
  | `Sequences_Section_Header` | [`Literals_Length_Table`] | [`Offset_Table`] | [`Match_Length_Table`] | bitStream |
  | -------------------------- | ------------------------- | ---------------- | ---------------------- | --------- |
  
-To decode the Sequence section, it's required to know its size.
+To decode the `Sequences_Section`, it's required to know its size.
  This size is deducted from `blockSize - literalSectionSize`.
  
  
@@ -753,8 +756,8 @@ This is a single byte, defining the compression mode of each symbol type.
  
  The last field, `Reserved`, must be all-zeroes.
  
-`Literals_Lengths_Mode`, `Offsets_Mode` and `Match_Lengths_Mode` define the compression mode of
-literals lengths, offsets and match lengths respectively.
+`Literals_Lengths_Mode`, `Offsets_Mode` and `Match_Lengths_Mode` define the `Compression_Mode` of
+literals lengths, offsets, and match lengths respectively.
  
  They follow the same enumeration :
  
@@ -769,9 +772,14 @@ They follow the same enumeration :
            A distribution table will be present.
            It will be described in [next part](#distribution-tables).
  
-#### Symbols decoding
+#### The codes for literals lengths, match lengths, and offsets.
  
-##### Literals Length codes
+Each symbol is a _code_ in its own context,
+which specifies `Baseline` and `Number_of_Bits` to add.
+_Codes_ are FSE compressed,
+and interleaved with raw additional bits in the same bitstream.
+
+##### Literals length codes 
  
  Literals length codes are values ranging from `0` to `35` included.
  They define lengths from 0 to 131071 bytes.
@@ -783,20 +791,20 @@ They define lengths from 0 to 131071 bytes.
  
  | `Literals_Length_Code` |  16  |  17  |  18  |  19  |  20  |  21  |  22  |  23  |
  | ---------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
-| Baseline               |  16  |  18  |  20  |  22  |  24  |  28  |  32  |  40  |
+| `Baseline`             |  16  |  18  |  20  |  22  |  24  |  28  |  32  |  40  |
  | `Number_of_Bits`       |   1  |   1  |   1  |   1  |   2  |   2  |   3  |   3  |
  
  | `Literals_Length_Code` |  24  |  25  |  26  |  27  |  28  |  29  |  30  |  31  |
  | ---------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
-| Baseline               |  48  |  64  |  128 |  256 |  512 | 1024 | 2048 | 4096 |
+| `Baseline`             |  48  |  64  |  128 |  256 |  512 | 1024 | 2048 | 4096 |
  | `Number_of_Bits`       |   4  |   6  |   7  |   8  |   9  |  10  |  11  |  12  |
  
  | `Literals_Length_Code` |  32  |  33  |  34  |  35  |
  | ---------------------- | ---- | ---- | ---- | ---- |
-| Baseline               | 8192 |16384 |32768 |65536 |
+| `Baseline`             | 8192 |16384 |32768 |65536 |
  | `Number_of_Bits`       |  13  |  14  |  15  |  16  |
  
-__Default distribution__
+##### Default distribution for literals length codes
  
  When `Compression_Mode` is `Predefined_Mode`,
  a predefined distribution is used for FSE compression.
@@ -809,7 +817,7 @@ short literalsLength_defaultDistribution[36] =
           -1,-1,-1,-1 };
  ```
  
-##### Match Length codes
+##### Match length codes
  
  Match length codes are values ranging from `0` to `52` included.
  They define lengths from 3 to 131074 bytes.
@@ -821,25 +829,25 @@ They define lengths from 3 to 131074 bytes.
  
  | `Match_Length_Code` |  32  |  33  |  34  |  35  |  36  |  37  |  38  |  39  |
  | ------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
-| Baseline            |  35  |  37  |  39  |  41  |  43  |  47  |  51  |  59  |
+| `Baseline`          |  35  |  37  |  39  |  41  |  43  |  47  |  51  |  59  |
  | `Number_of_Bits`    |   1  |   1  |   1  |   1  |   2  |   2  |   3  |   3  |
  
  | `Match_Length_Code` |  40  |  41  |  42  |  43  |  44  |  45  |  46  |  47  |
  | ------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
-| Baseline            |  67  |  83  |  99  |  131 |  258 |  514 | 1026 | 2050 |
+| `Baseline`          |  67  |  83  |  99  |  131 |  258 |  514 | 1026 | 2050 |
  | `Number_of_Bits`    |   4  |   4  |   5  |   7  |   8  |   9  |  10  |  11  |
  
  | `Match_Length_Code` |  48  |  49  |  50  |  51  |  52  |
  | ------------------- | ---- | ---- | ---- | ---- | ---- |
-| Baseline            | 4098 | 8194 |16486 |32770 |65538 |
+| `Baseline`          | 4098 | 8194 |16486 |32770 |65538 |
  | `Number_of_Bits`    |  12  |  13  |  14  |  15  |  16  |
  
-__Default distribution__
+##### Default distribution for match length codes
  
  When `Compression_Mode` is defined as `Predefined_Mode`,
  a predefined distribution is used for FSE compression.
  
-Here is its definition. It uses an accuracy of 6 bits (64 states).
+Below is its definition. It uses an accuracy of 6 bits (64 states).
  ```
  short matchLengths_defaultDistribution[53] =
          { 1, 4, 3, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1,
@@ -858,26 +866,27 @@ For information, at the time of this writing.
  the reference decoder supports a maximum `N` value of `28` in 64-bits mode.
  
  An offset code is also the number of additional bits to read,
-and can be translated into an `Offset_Value` using the following formulae :
+and can be translated into an `Offset_Value` using the following formulas :
  
  ```
  Offset_Value = (1 << offsetCode) + readNBits(offsetCode);
  if (Offset_Value > 3) offset = Offset_Value - 3;
  ```
-It means that maximum `Offset_Value` is `2^(N+1))-1` and it supports back-reference distance up to 2^(N+1))-4
+It means that maximum `Offset_Value` is `2^(N+1))-1` and it supports back-reference distance up to `2^(N+1))-4`
  but is limited by [maximum back-reference distance](#window_descriptor).
  
-Offset_Value from 1 to 3 are special : they define "repeat codes",
+`Offset_Value` from 1 to 3 are special : they define "repeat codes",
  which means one of the previous offsets will be repeated.
  They are sorted in recency order, with 1 meaning the most recent one.
  See [Repeat offsets](#repeat-offsets) paragraph.
  
-__Default distribution__
+
+##### Default distribution for offset codes
  
  When `Compression_Mode` is defined as `Predefined_Mode`,
  a predefined distribution is used for FSE compression.
  
-Here is its definition. It uses an accuracy of 5 bits (32 states),
+Below is its definition. It uses an accuracy of 5 bits (32 states),
  and supports a maximum `N` of 28, allowing offset values up to 536,870,908 .
  
  If any sequence in the compressed block requires an offset larger than this,
@@ -918,7 +927,7 @@ The bitstream starts by reporting on which scale it operates.
  Note that maximum `Accuracy_Log` for literal and match lengths is `9`,
  and for offsets is `8`. Higher values are considered errors.
  
-Then follow each symbol value, from `0` to last present one.
+Then follows each symbol value, from `0` to last present one.
  The number of bits used by each field is variable.
  It depends on :
  
@@ -947,11 +956,11 @@ It depends on :
  
  Symbols probabilities are read one by one, in order.
  
-Probability is obtained from Value decoded by following formulae :
+Probability is obtained from Value decoded by following formula :
  `Proba = value - 1`
  
  It means value `0` becomes negative probability `-1`.
-`-1` is a special probability, which means `less than 1`.
+`-1` is a special probability, which means "less than 1".
  Its effect on distribution table is described in [next paragraph].
  For the purpose of calculating cumulated distribution, it counts as one.
  
@@ -1006,7 +1015,7 @@ typically by a "less than 1" probability symbol.
  The result is a list of state values.
  Each state will decode the current symbol.
  
-To get the Number of bits and baseline required for next state,
+To get the `Number_of_Bits` and `Baseline` required for next state,
  it's first necessary to sort all states in their natural order.
  The lower states will need 1 more bit than higher ones.
  
@@ -1030,11 +1039,11 @@ Numbering starts from higher states using less bits.
  | width            |  32   |  32   |   32   |  16  |  16   |
  | `Number_of_Bits` |   5   |   5   |    5   |   4  |   4   |
  | range number     |   2   |   4   |    6   |   0  |   1   |
-| baseline         |  32   |  64   |   96   |   0  |  16   |
+| `Baseline`       |  32   |  64   |   96   |   0  |  16   |
  | range            | 32-63 | 64-95 | 96-127 | 0-15 | 16-31 |
  
  Next state is determined from current state
-by reading the required number of bits, and adding the specified baseline.
+by reading the required `Number_of_Bits`, and adding the specified `Baseline`.
  
  
  #### Bitstream
@@ -1064,16 +1073,16 @@ Reminder : always keep in mind that all values are read _backward_.
  ##### Decoding a sequence
  
  A state gives a code.
-A code provides a baseline and number of bits to add.
+A code provides `Baseline` and `Number_of_Bits` to add.
  See [Symbol Decoding] section for details on each symbol.
  
-Decoding starts by reading the number of bits required to decode offset.
-It then does the same for match length,
-and then for literals length.
+Decoding starts by reading the `Number_of_Bits` required to decode `Offset`.
+It then does the same for `Match_Length`,
+and then for `Literals_Length`.
  
-Offset / matchLength / litLength define a sequence.
-It starts by inserting the number of literals defined by `litLength`,
-then continue by copying `matchLength` bytes from `currentPos - offset`.
+`Offset`, `Match_Length`, and `Literals_Length` define a sequence.
+It starts by inserting the number of literals defined by `Literals_Length`,
+then continue by copying `Match_Length` bytes from `currentPos - Offset`.
  
  The next operation is to update states.
  Using rules pre-calculated in the decoding tables,
@@ -1085,7 +1094,7 @@ This operation will be repeated `Number_of_Sequences` times.
  At the end, the bitstream shall be entirely consumed,
  otherwise bitstream is considered corrupted.
  
-[Symbol Decoding]:#symbols-decoding
+[Symbol Decoding]:#the-codes-for-literals-lengths-match-lengths-and-offsets
  
  ##### Repeat offsets
  
@@ -1143,8 +1152,8 @@ _Reserved ranges :_
  
  __`Entropy_Tables`__ : following the same format as a [compressed blocks].
              They are stored in following order :
-            Huffman tables for literals, FSE table for offset,
-            FSE table for matchLenth, and FSE table for litLength.
+            Huffman tables for literals, FSE table for offsets,
+            FSE table for match lengths, and FSE table for literals lengths.
              It's finally followed by 3 offset values, populating recent offsets,
              stored in order, 4-bytes little-endian each, for a total of 12 bytes.
author	inikep <inikep@gmail.com>
	Thu, 25 Aug 2016 12:59:08 +0000 (14:59 +0200)
committer	inikep <inikep@gmail.com>
	Thu, 25 Aug 2016 12:59:08 +0000 (14:59 +0200)
programs/README.md		patch \| blob \| blame \| history
zstd_compression_format.md		patch \| blob \| blame \| history