refactor huffman prefix code paragraph

author Yann Collet <cyan@fb.com>

Tue, 8 Oct 2024 00:15:07 +0000 (17:15 -0700)

committer Yann Collet <cyan@fb.com>

Tue, 8 Oct 2024 00:15:07 +0000 (17:15 -0700)
author Yann Collet <cyan@fb.com>
Tue, 8 Oct 2024 00:15:07 +0000 (17:15 -0700)
committer Yann Collet <cyan@fb.com>
Tue, 8 Oct 2024 00:15:07 +0000 (17:15 -0700)
diff --git a/doc/zstd_compression_format.md b/doc/zstd_compression_format.md

index 3c1ec3ad99e750ff3258e56e38f1515004b9c8b8..b4a6edf9ec5b5ccb3eda922ccf793f192d1220c2 100644 (file)
--- a/doc/zstd_compression_format.md
+++ b/doc/zstd_compression_format.md
@@ -16,7 +16,7 @@ Distribution of this document is unlimited.
  
  ### Version
  
-0.4.2 (2024-10-02)
+0.4.3 (2024-10-07)
  
  
  Introduction
@@ -1270,13 +1270,13 @@ This specification limits maximum code length to 11 bits.
  
  #### Representation
  
-All literal values from zero (included) to last present one (excluded)
+All literal symbols from zero (included) to last present one (excluded)
  are represented by `Weight` with values from `0` to `Max_Number_of_Bits`.
  Transformation from `Weight` to `Number_of_Bits` follows this formula :
  ```
  Number_of_Bits = Weight ? (Max_Number_of_Bits + 1 - Weight) : 0
  ```
-When a literal value is not present, it receives a `Weight` of 0.
+When a literal symbol is not present, it receives a `Weight` of 0.
  The least frequent symbol receives a `Weight` of 1.
  If no literal has a `Weight` of 1, then the data is considered corrupted.
  If there are not at least two literals with non-zero `Weight`, then the data
@@ -1293,33 +1293,38 @@ otherwise the representation is considered corrupted.
  __Example__ :
  Let's presume the following Huffman tree must be described :
  
-|  literal value   |  0  |  1  |  2  |  3  |  4  |  5  |
+|  literal symbol  |  A  |  B  |  C  |  D  |  E  |  F  |
  | ---------------- | --- | --- | --- | --- | --- | --- |
  | `Number_of_Bits` |  1  |  2  |  3  |  0  |  4  |  4  |
  
  The tree depth is 4, since its longest elements uses 4 bits
-(longest elements are the one with smallest frequency).
-Literal value `5` will not be listed, as it can be determined from previous values 0-4,
-nor will values above `5` as they are all 0.
-Values from `0` to `4` will be listed using `Weight` instead of `Number_of_Bits`.
+(longest elements are the ones with smallest frequency).
+
+All symbols will now receive a `Weight` instead of `Number_of_Bits`.
  Weight formula is :
  ```
  Weight = Number_of_Bits ? (Max_Number_of_Bits + 1 - Number_of_Bits) : 0
  ```
-It gives the following series of weights :
+It gives the following series of Weights :
+
+| literal symbol |  A  |  B  |  C  |  D  |  E  |  F  |
+| -------------- | --- | --- | --- | --- | --- | --- |
+|   `Weight`     |  4  |  3  |  2  |  0  |  1  |  1  |
+
+This list will be sent to the decoder, with the following modifications:
  
-| literal value |  0  |  1  |  2  |  3  |  4  |
-| ------------- | --- | --- | --- | --- | --- |
-|   `Weight`    |  4  |  3  |  2  |  0  |  1  |
+- `F` will not be listed, because it can be determined from previous symbols
+- nor will symbols above `F` as they are all 0
+- on the other hand, all symbols before `A`, starting with `\0`, will be listed, with a Weight of 0.
  
  The decoder will do the inverse operation :
-having collected weights of literal symbols from `0` to `4`,
-it knows the last literal, `5`, is present with a non-zero `Weight`.
-The `Weight` of `5` can be determined by advancing to the next power of 2.
+having collected weights of literal symbols from `A` to `E`,
+it knows the last literal, `F`, is present with a non-zero `Weight`.
+The `Weight` of `F` can be determined by advancing to the next power of 2.
  The sum of `2^(Weight-1)` (excluding 0's) is :
  `8 + 4 + 2 + 0 + 1 = 15`.
  Nearest larger power of 2 value is 16.
-Therefore, `Max_Number_of_Bits = 4` and `Weight[5] = log_2(16 - 15) + 1 = 1`.
+Therefore, `Max_Number_of_Bits = log2(16) = 4` and `Weight[F] = log_2(16 - 15) + 1 = 1`.
  
  #### Huffman Tree header
  
@@ -1359,7 +1364,7 @@ sharing a single distribution table.
  To decode an FSE bitstream, it is necessary to know its compressed size.
  Compressed size is provided by `headerByte`.
  It's also necessary to know its _maximum possible_ decompressed size,
-which is `255`, since literal values span from `0` to `255`,
+which is `255`, since literal symbols span from `0` to `255`,
  and last symbol's `Weight` is not represented.
  
  An FSE bitstream starts by a header, describing probabilities distribution.
@@ -1395,26 +1400,27 @@ It is possible to transform weights into `Number_of_Bits`, using this formula:
  ```
  Number_of_Bits = (Weight>0) ? Max_Number_of_Bits + 1 - Weight : 0
  ```
-Symbols are sorted by `Weight`.
-Within same `Weight`, symbols keep natural sequential order.
+In order to determine which prefix is assigned to each Symbol,
+Symbols are first sorted by `Weight`, then by natural sequential order.
  Symbols with a `Weight` of zero are removed.
-Then, starting from lowest `Weight`, prefix codes are distributed in sequential order.
+Then, starting from lowest `Weight` (hence highest `Number_of_Bits`),
+prefix codes are assigned in ascending order.
  
  __Example__ :
-Let's presume the following list of weights has been decoded :
+Let's assume the following list of weights has been decoded:
  
-| Literal  |  0  |  1  |  2  |  3  |  4  |  5  |
+| Literal  |  A  |  B  |  C  |  D  |  E  |  F  |
  | -------- | --- | --- | --- | --- | --- | --- |
  | `Weight` |  4  |  3  |  2  |  0  |  1  |  1  |
  
  Sorted by weight and then natural sequential order,
-it gives the following distribution :
+it gives the following prefix codes distribution:
  
-| Literal          |  3  |  4  |  5  |  2  |  1  |   0  |
-| ---------------- | --- | --- | --- | --- | --- | ---- |
-| `Weight`         |  0  |  1  |  1  |  2  |  3  |   4  |
-| `Number_of_Bits` |  0  |  4  |  4  |  3  |  2  |   1  |
-| prefix codes     | N/A | 0000| 0001| 001 | 01  |   1  |
+| Literal          |  D  |   E  |   F  |  C  |  B  |  A  |
+| ---------------- | --- | ---- | ---- | --- | --- | --- |
+| `Weight`         |  0  |   1  |   1  |  2  |  3  |  4  |
+| `Number_of_Bits` |  0  |   4  |   4  |  3  |  2  |  1  |
+| prefix code      | N/A | 0000 | 0001 | 001 | 01  |  1  |
  
  ### Huffman-coded Streams
  
@@ -1437,10 +1443,10 @@ it's possible to read the bitstream in a __little-endian__ fashion,
  keeping track of already used bits. Since the bitstream is encoded in reverse
  order, starting from the end read symbols in forward order.
  
-For example, if the literal sequence "0145" was encoded using above prefix code,
+For example, if the literal sequence `ABEF` was encoded using above prefix code,
  it would be encoded (in reverse order) as:
  
-|Symbol  |   5  |   4  |  1 | 0 | Padding |
+|Symbol  |   F  |   E  |  B | A | Padding |
  |--------|------|------|----|---|---------|
  |Encoding|`0000`|`0001`|`01`|`1`| `00001` |
  
@@ -1735,6 +1741,7 @@ or at least provide a meaningful error code explaining for which reason it canno
  
  Version changes
  ---------------
+- 0.4.3 : clarifications for Huffman prefix code assignment example
  - 0.4.2 : refactor FSE table construction process, inspired by Donald Pian
  - 0.4.1 : clarifications on a few error scenarios, by Eric Lasota
  - 0.4.0 : fixed imprecise behavior for nbSeq==0, detected by Igor Pavlov
author	Yann Collet <cyan@fb.com>
	Tue, 8 Oct 2024 00:15:07 +0000 (17:15 -0700)
committer	Yann Collet <cyan@fb.com>
	Tue, 8 Oct 2024 00:15:07 +0000 (17:15 -0700)