From: Yann Collet Date: Wed, 5 Sep 2018 19:25:35 +0000 (-0700) Subject: spec clarification X-Git-Tag: v0.0.29~18^2 X-Git-Url: http://git.ipfire.org/cgi-bin/gitweb.cgi?a=commitdiff_plain;h=55a8f84a2c6072e712c2389ad3afba81366738e4;p=thirdparty%2Fzstd.git spec clarification following #1305 comments from @ulikunitz --- diff --git a/doc/zstd_compression_format.md b/doc/zstd_compression_format.md index 0b79f959f..c57b58269 100644 --- a/doc/zstd_compression_format.md +++ b/doc/zstd_compression_format.md @@ -16,7 +16,7 @@ Distribution of this document is unlimited. ### Version -0.2.8 (30/05/18) +0.2.9 (05/09/18) Introduction @@ -1192,6 +1192,8 @@ Number_of_Bits = Weight ? (Max_Number_of_Bits + 1 - Weight) : 0 The last symbol's `Weight` is deduced from previously decoded ones, by completing to the nearest power of 2. This power of 2 gives `Max_Number_of_Bits`, the depth of the current tree. +`Max_Number_of_Bits` must be <= 11, +otherwise the representation is considered corrupted. __Example__ : Let's presume the following Huffman tree must be described : @@ -1216,12 +1218,12 @@ It gives the following series of weights : | `Weight` | 4 | 3 | 2 | 0 | 1 | The decoder will do the inverse operation : -having collected weights of literals from `0` to `4`, +having collected weights of literal symbols from `0` to `4`, it knows the last literal, `5`, is present with a non-zero weight. The weight of `5` can be determined by advancing to the next power of 2. The sum of `2^(Weight-1)` (excluding 0's) is : `8 + 4 + 2 + 0 + 1 = 15`. -Nearest power of 2 is 16. +Nearest larger power of 2 value is 16. Therefore, `Max_Number_of_Bits = 4` and `Weight[5] = 16-15 = 1`. #### Huffman Tree header @@ -1233,18 +1235,24 @@ which describes how the series of weights is encoded. the series of weights is compressed using FSE (see below). The length of the FSE-compressed series is equal to `headerByte` (0-127). -- if `headerByte` >= 128 : this is a direct representation, - where each `Weight` is written directly as a 4 bits field (0-15). - They are encoded forward, 2 weights to a byte with the first weight taking - the top four bits and the second taking the bottom four (e.g. the following - operations could be used to read the weights: - `Weight[0] = (Byte[0] >> 4), Weight[1] = (Byte[0] & 0xf)`, etc.). - The full representation occupies `Ceiling(Number_of_Symbols/2)` bytes, - meaning it uses only full bytes even if `Number_of_Symbols` is odd. - `Number_of_Symbols = headerByte - 127`. - Note that maximum `Number_of_Symbols` is 255-127 = 128. - If any literal has a value > 128, raw header mode is not possible. - In such case, it's necessary to use FSE compression. +- if `headerByte` >= 128 : + + the series of weights uses a direct representation, + where each `Weight` is encoded directly as a 4 bits field (0-15). + + They are encoded forward, 2 weights to a byte, + first weight taking the top four bits and second one taking the bottom four. + * e.g. the following operations could be used to read the weights: + `Weight[0] = (Byte[0] >> 4), Weight[1] = (Byte[0] & 0xf)`, etc. + + The full representation occupies `Ceiling(Number_of_Weights/2)` bytes, + meaning it uses only full bytes even if `Number_of_Weights` is odd. + + `Number_of_Weights = headerByte - 127`. + * Note that maximum `Number_of_Weights` is 255-127 = 128, + therefore, only up to 128 `Weight` can be encoded using direct representation. + * Since the last non-zero `Weight` is _not_ encoded, + this scheme is compatible with alphabet sizes of up to 129 symbols, + hence including literal symbol 128. + * If any literal symbol > 128 has a non-zero `Weight`, + direct representation is not possible. + In such case, it's necessary to use FSE compression. #### Finite State Entropy (FSE) compression of Huffman weights @@ -1621,6 +1629,7 @@ or at least provide a meaningful error code explaining for which reason it canno Version changes --------------- +- 0.2.9 : clarifications for huffman weights direct representation, by Ulrich Kunitz - 0.2.8 : clarifications for IETF RFC discuss - 0.2.7 : clarifications from IETF RFC review, by Vijay Gurbani and Nick Terrell - 0.2.6 : fixed an error in huffman example, by Ulrich Kunitz