From 832f559b0b9d22f3afc6b6e11a55044b9a238db5 Mon Sep 17 00:00:00 2001 From: Yann Collet Date: Sat, 18 Feb 2023 18:16:00 -0800 Subject: [PATCH] clarify zstd specification for Huffman blocks Following detailed comments from @dweiller in #3508. --- doc/zstd_compression_format.md | 23 ++++++++++++++++------- 1 file changed, 16 insertions(+), 7 deletions(-) diff --git a/doc/zstd_compression_format.md b/doc/zstd_compression_format.md index e40677a1d..03d3a9abb 100644 --- a/doc/zstd_compression_format.md +++ b/doc/zstd_compression_format.md @@ -16,7 +16,7 @@ Distribution of this document is unlimited. ### Version -0.3.7 (2020-12-09) +0.3.8 (2023-02-18) Introduction @@ -470,6 +470,7 @@ This field uses 2 lowest bits of first byte, describing 4 different block types repeated `Regenerated_Size` times. - `Compressed_Literals_Block` - This is a standard Huffman-compressed block, starting with a Huffman tree description. + In this mode, there are at least 2 different literals represented in the Huffman tree description. See details below. - `Treeless_Literals_Block` - This is a Huffman-compressed block, using Huffman tree _from previous Huffman-compressed literals block_. @@ -566,6 +567,7 @@ or from a dictionary. ### `Huffman_Tree_Description` This section is only present when `Literals_Block_Type` type is `Compressed_Literals_Block` (`2`). +The tree describes the weights of all literals symbols that can be present in the literals block, at least 2 and up to 256. The format of the Huffman tree description can be found at [Huffman Tree description](#huffman-tree-description). The size of `Huffman_Tree_Description` is determined during decoding process, it must be used to determine where streams begin. @@ -1197,7 +1199,7 @@ Huffman Coding -------------- Zstandard Huffman-coded streams are read backwards, similar to the FSE bitstreams. -Therefore, to find the start of the bitstream, it is therefore to +Therefore, to find the start of the bitstream, it is required to know the offset of the last byte of the Huffman-coded stream. After writing the last bit containing information, the compressor @@ -1239,9 +1241,15 @@ Transformation from `Weight` to `Number_of_Bits` follows this formula : ``` Number_of_Bits = Weight ? (Max_Number_of_Bits + 1 - Weight) : 0 ``` -The last symbol's `Weight` is deduced from previously decoded ones, -by completing to the nearest power of 2. -This power of 2 gives `Max_Number_of_Bits`, the depth of the current tree. +When a literal value is not present, it receives a `Weight` of 0. +The least frequent symbol receives a `Weight` of 1. +Consequently, the `Weight` 1 is necessarily present. +The most frequent symbol receives a `Weight` anywhere between 1 and 11 (max). +The last symbol's `Weight` is deduced from previously retrieved Weights, +by completing to the nearest power of 2. It's necessarily non 0. +If it's not possible to reach a clean power of 2 with a single `Weight` value, +the Huffman Tree Description is considered invalid. +This final power of 2 gives `Max_Number_of_Bits`, the depth of the current tree. `Max_Number_of_Bits` must be <= 11, otherwise the representation is considered corrupted. @@ -1254,7 +1262,7 @@ Let's presume the following Huffman tree must be described : The tree depth is 4, since its longest elements uses 4 bits (longest elements are the one with smallest frequency). -Value `5` will not be listed, as it can be determined from values for 0-4, +Literal value `5` will not be listed, as it can be determined from previous values 0-4, nor will values above `5` as they are all 0. Values from `0` to `4` will be listed using `Weight` instead of `Number_of_Bits`. Weight formula is : @@ -1274,7 +1282,7 @@ The `Weight` of `5` can be determined by advancing to the next power of 2. The sum of `2^(Weight-1)` (excluding 0's) is : `8 + 4 + 2 + 0 + 1 = 15`. Nearest larger power of 2 value is 16. -Therefore, `Max_Number_of_Bits = 4` and `Weight[5] = 16-15 = 1`. +Therefore, `Max_Number_of_Bits = 4` and `Weight[5] = log_2(16 - 15) + 1 = 1`. #### Huffman Tree header @@ -1683,6 +1691,7 @@ or at least provide a meaningful error code explaining for which reason it canno Version changes --------------- +- 0.3.8 : clarifications for Huffman Blocks and Huffman Tree descriptions. - 0.3.7 : clarifications for Repeat_Offsets, matching RFC8878 - 0.3.6 : clarifications for Dictionary_ID - 0.3.5 : clarifications for Block_Maximum_Size -- 2.47.2