From: Yann Collet Date: Fri, 15 Jul 2016 15:31:13 +0000 (+0200) Subject: minor specification clarifications, suggested by @ebiggers X-Git-Tag: v0.7.4^2~5 X-Git-Url: http://git.ipfire.org/cgi-bin/gitweb.cgi?a=commitdiff_plain;h=b21e9cbe8a30a25e22902905f7a5646d4890d516;p=thirdparty%2Fzstd.git minor specification clarifications, suggested by @ebiggers --- diff --git a/zstd_compression_format.md b/zstd_compression_format.md index c6afeab1c..95742386f 100644 --- a/zstd_compression_format.md +++ b/zstd_compression_format.md @@ -506,7 +506,7 @@ Compressed and regenerated size fields follow big endian convention. This section is only present when literals block type is `Compressed` (`0`). Prefix coding represents symbols from an a priori known alphabet -by bit sequences (codes), one code for each symbol, +by bit sequences (codewords), one codeword for each symbol, in a manner such that different symbols may be represented by bit sequences of different lengths, but a parser can always parse an encoded string @@ -515,14 +515,13 @@ unambiguously symbol-by-symbol. Given an alphabet with known symbol frequencies, the Huffman algorithm allows the construction of an optimal prefix code using the fewest bits of any possible prefix codes for that alphabet. -Such a code is called a Huffman code. Prefix code must not exceed a maximum code length. More bits improve accuracy but cost more header size, and require more memory for decoding operations. The current format limits the maximum depth to 15 bits. -The reference decoder goes further, by limiting it to 11 bits. +The reference decoder goes further, by limiting it to 12 bits. It is recommended to remain compatible with reference decoder. @@ -618,20 +617,19 @@ When both states have overflowed the bitstream, end is reached. ##### Conversion from weights to huffman prefix codes All present symbols shall now have a `weight` value. -A `weight` directly represents a `range` of prefix codes, -following the formulae : `range = weight ? 1 << (weight-1) : 0 ;` Symbols are sorted by weight. +Symbols with a weight of zero are removed. Within same weight, symbols keep natural order. Starting from lowest weight, -symbols are being allocated to a range of prefix codes. -Symbols with a weight of zero are not present. - -It is then possible to transform weights into nbBits : +symbols are being allocated to a `range`. +A `weight` directly represents a `range`, +following the formulae : `range = weight ? 1 << (weight-1) : 0 ;` +Similarly, it is possible to transform weights into nbBits : `nbBits = nbBits ? maxBits + 1 - weight : 0;` . __Example__ : -Let's presume the following huffman tree has been decoded : +Let's presume the following list of weights has been decoded : | Literal | 0 | 1 | 2 | 3 | 4 | 5 | | ------- | --- | --- | --- | --- | --- | --- | @@ -644,8 +642,9 @@ it gives the following distribution : | ------------ | --- | --- | --- | --- | --- | ---- | | weight | 0 | 1 | 1 | 2 | 3 | 4 | | range | 0 | 1 | 1 | 2 | 4 | 8 | -| prefix codes | N/A | 0 | 1 | 2-3 | 4-7 | 8-15 | +| table entries| N/A | 0 | 1 | 2-3 | 4-7 | 8-15 | | nb bits | 0 | 4 | 4 | 3 | 2 | 1 | +| prefix codes | N/A | 0000| 0001| 001 | 01 | 1 | #### Literals bitstreams @@ -696,12 +695,12 @@ it's possible to read the bitstream in a little-endian fashion, keeping track of already used bits. Reading the last `maxBits` bits, -it's then possible to compare extracted value to the prefix codes table, +it's then possible to compare extracted value to decoding table, determining the symbol to decode and number of bits to discard. The process continues up to reading the required number of symbols per stream. If a bitstream is not entirely and exactly consumed, -hence reaching exactly its beginning position with all bits consumed, +hence reaching exactly its beginning position with _all_ bits consumed, the decoding process is considered faulty. @@ -713,7 +712,7 @@ A literal copy command specifies a length. It is the number of bytes to be copied (or extracted) from the literal section. A match copy command specifies an offset and a length. The offset gives the position to copy from, -which can stand within a previous block. +which can be within a previous block. There are 3 symbol types, `literalLength`, `matchLength` and `offset`, which are encoded together, interleaved in a single _bitstream_.