### Version
-0.4.2 (2024-10-02)
+0.4.3 (2024-10-07)
Introduction
#### Representation
-All literal values from zero (included) to last present one (excluded)
+All literal symbols from zero (included) to last present one (excluded)
are represented by `Weight` with values from `0` to `Max_Number_of_Bits`.
Transformation from `Weight` to `Number_of_Bits` follows this formula :
```
Number_of_Bits = Weight ? (Max_Number_of_Bits + 1 - Weight) : 0
```
-When a literal value is not present, it receives a `Weight` of 0.
+When a literal symbol is not present, it receives a `Weight` of 0.
The least frequent symbol receives a `Weight` of 1.
If no literal has a `Weight` of 1, then the data is considered corrupted.
If there are not at least two literals with non-zero `Weight`, then the data
__Example__ :
Let's presume the following Huffman tree must be described :
-| literal value | 0 | 1 | 2 | 3 | 4 | 5 |
+| literal symbol | A | B | C | D | E | F |
| ---------------- | --- | --- | --- | --- | --- | --- |
| `Number_of_Bits` | 1 | 2 | 3 | 0 | 4 | 4 |
The tree depth is 4, since its longest elements uses 4 bits
-(longest elements are the one with smallest frequency).
-Literal value `5` will not be listed, as it can be determined from previous values 0-4,
-nor will values above `5` as they are all 0.
-Values from `0` to `4` will be listed using `Weight` instead of `Number_of_Bits`.
+(longest elements are the ones with smallest frequency).
+
+All symbols will now receive a `Weight` instead of `Number_of_Bits`.
Weight formula is :
```
Weight = Number_of_Bits ? (Max_Number_of_Bits + 1 - Number_of_Bits) : 0
```
-It gives the following series of weights :
+It gives the following series of Weights :
+
+| literal symbol | A | B | C | D | E | F |
+| -------------- | --- | --- | --- | --- | --- | --- |
+| `Weight` | 4 | 3 | 2 | 0 | 1 | 1 |
+
+This list will be sent to the decoder, with the following modifications:
-| literal value | 0 | 1 | 2 | 3 | 4 |
-| ------------- | --- | --- | --- | --- | --- |
-| `Weight` | 4 | 3 | 2 | 0 | 1 |
+- `F` will not be listed, because it can be determined from previous symbols
+- nor will symbols above `F` as they are all 0
+- on the other hand, all symbols before `A`, starting with `\0`, will be listed, with a Weight of 0.
The decoder will do the inverse operation :
-having collected weights of literal symbols from `0` to `4`,
-it knows the last literal, `5`, is present with a non-zero `Weight`.
-The `Weight` of `5` can be determined by advancing to the next power of 2.
+having collected weights of literal symbols from `A` to `E`,
+it knows the last literal, `F`, is present with a non-zero `Weight`.
+The `Weight` of `F` can be determined by advancing to the next power of 2.
The sum of `2^(Weight-1)` (excluding 0's) is :
`8 + 4 + 2 + 0 + 1 = 15`.
Nearest larger power of 2 value is 16.
-Therefore, `Max_Number_of_Bits = 4` and `Weight[5] = log_2(16 - 15) + 1 = 1`.
+Therefore, `Max_Number_of_Bits = log2(16) = 4` and `Weight[F] = log_2(16 - 15) + 1 = 1`.
#### Huffman Tree header
To decode an FSE bitstream, it is necessary to know its compressed size.
Compressed size is provided by `headerByte`.
It's also necessary to know its _maximum possible_ decompressed size,
-which is `255`, since literal values span from `0` to `255`,
+which is `255`, since literal symbols span from `0` to `255`,
and last symbol's `Weight` is not represented.
An FSE bitstream starts by a header, describing probabilities distribution.
```
Number_of_Bits = (Weight>0) ? Max_Number_of_Bits + 1 - Weight : 0
```
-Symbols are sorted by `Weight`.
-Within same `Weight`, symbols keep natural sequential order.
+In order to determine which prefix is assigned to each Symbol,
+Symbols are first sorted by `Weight`, then by natural sequential order.
Symbols with a `Weight` of zero are removed.
-Then, starting from lowest `Weight`, prefix codes are distributed in sequential order.
+Then, starting from lowest `Weight` (hence highest `Number_of_Bits`),
+prefix codes are assigned in ascending order.
__Example__ :
-Let's presume the following list of weights has been decoded :
+Let's assume the following list of weights has been decoded:
-| Literal | 0 | 1 | 2 | 3 | 4 | 5 |
+| Literal | A | B | C | D | E | F |
| -------- | --- | --- | --- | --- | --- | --- |
| `Weight` | 4 | 3 | 2 | 0 | 1 | 1 |
Sorted by weight and then natural sequential order,
-it gives the following distribution :
+it gives the following prefix codes distribution:
-| Literal | 3 | 4 | 5 | 2 | 1 | 0 |
-| ---------------- | --- | --- | --- | --- | --- | ---- |
-| `Weight` | 0 | 1 | 1 | 2 | 3 | 4 |
-| `Number_of_Bits` | 0 | 4 | 4 | 3 | 2 | 1 |
-| prefix codes | N/A | 0000| 0001| 001 | 01 | 1 |
+| Literal | D | E | F | C | B | A |
+| ---------------- | --- | ---- | ---- | --- | --- | --- |
+| `Weight` | 0 | 1 | 1 | 2 | 3 | 4 |
+| `Number_of_Bits` | 0 | 4 | 4 | 3 | 2 | 1 |
+| prefix code | N/A | 0000 | 0001 | 001 | 01 | 1 |
### Huffman-coded Streams
keeping track of already used bits. Since the bitstream is encoded in reverse
order, starting from the end read symbols in forward order.
-For example, if the literal sequence "0145" was encoded using above prefix code,
+For example, if the literal sequence `ABEF` was encoded using above prefix code,
it would be encoded (in reverse order) as:
-|Symbol | 5 | 4 | 1 | 0 | Padding |
+|Symbol | F | E | B | A | Padding |
|--------|------|------|----|---|---------|
|Encoding|`0000`|`0001`|`01`|`1`| `00001` |
Version changes
---------------
+- 0.4.3 : clarifications for Huffman prefix code assignment example
- 0.4.2 : refactor FSE table construction process, inspired by Donald Pian
- 0.4.1 : clarifications on a few error scenarios, by Eric Lasota
- 0.4.0 : fixed imprecise behavior for nbSeq==0, detected by Igor Pavlov