From: Stella Lau <laus@fb.com>
Date: Thu, 27 Jul 2017 18:11:35 +0000 (-0700)
Subject: Update README
X-Git-Tag: v1.3.1^2~12^2~7^2~3
X-Git-Url: http://git.ipfire.org/cgi-bin/gitweb.cgi?a=commitdiff_plain;h=c105f605e66bd3bd985ceeaa660e7159e4d16298;p=thirdparty%2Fzstd.git

Update README
---

diff --git a/contrib/long_distance_matching/README.md b/contrib/long_distance_matching/README.md
index d9cb08951..e67bba714 100644
--- a/contrib/long_distance_matching/README.md
+++ b/contrib/long_distance_matching/README.md
@@ -28,12 +28,75 @@ The parameters are as follows and must all be defined:
 - `INSERT_BY_TAG` : insert entries into the hash table as a function of the hash. This increases speed by reducing the number of hash table lookups and match comparisons. Certain hashes will never be inserted.
 - `USE_CHECKSUM`  : store a checksum with the hash table entries for faster comparison. This halves the number of entries the hash table can contain.
 
-### Compression statistics
+The optional parameter `HASH_ONLY_EVERY_LOG` is the log inverse frequency of insertion into the hash table. That is, an entry is inserted approximately every `1 << HASH_ONLY_EVERY_LOG` times. If this parameter is not defined, the value is computed as a function of the window size and memory usage to approximate a even coverage of the window.
 
-Compression statistics (and the configuration) can be enabled/disabled via `COMPUTE_STATS` and `OUTPUT_CONFIGURATION` in `ldm.h`.
 
+### Benchmark
+
+Below is a comparison of various compression methods on a tar of four versions of llvm (versions `3.9.0`, `3.9.1`, `4.0.0`, `4.0.1`) with a total size of `727900160` B.
+
+| Method | Size | Ratio |
+|:---|---:|---:|
+|lrzip -p 32 -n -w 1 | `369968714` | `1.97`|
+|ldm | `209391361` | `3.48`|
+|lz4 | `189954338` | `3.83`|
+|lrzip -p 32 -l -w 1 | `163940343` | `4.44`|
+|zstd -1 | `126080293` | `5.77`|
+|lrzip -p 32 -n | `124821009` | `5.83`|
+|lrzip -p 32 -n -w 1 & zstd -1 | `120317909` | `6.05`|
+|zstd -3 -o | `115290952` | `6.31`|
+|lrzip -p 32 -g -L 9 -w 1 | `107168979` | `6.79`|
+|zstd -6 -o | `102772098` | `7.08`|
+|zstd -T16 -9 | `98040470` | `7.42`|
+|lrzip -p 32 -n -w 1 & zstd -T32 -19 | `88050289` | `8.27`|
+|zstd -T32 -19 | `83626098` | `8.70`|
+|lrzip -p 32 -n & zstd -1 | `36335117` | `20.03`|
+|ldm & zstd -6 | `32856232` | `22.15`|
+|lrzip -p 32 -g -L 9 | `32243594` | `22.58`|
+|lrzip -p 32 -n & zstd -6 | `30954572` | `23.52`|
+|lrzip -p 32 -n & zstd -T32 -19 | `26472064` | `27.50`|
+
+The method marked `ldm` was run with the following parameters:
+
+| Parameter | Value |
+|:---|---:|
+| `LDM_MEMORY_USAGE`    |   `23`|
+|`HASH_BUCKET_SIZE_LOG` |    `3`|
+|`LDM_LAG`              |    `0`|
+|`LDM_WINDOW_SIZE_LOG`  |   `28`|
+|`LDM_MIN_MATCH_LENGTH`|   `64`|
+|`INSERT_BY_TAG`        |    `1`|
+|`USE_CHECKSUM`         |    `1`|
+
+The compression speed was `220.5 MB/s`.
+
+### Parameter selection
 
+Below is a brief discussion of the effects of the parameters on the speed and compression ratio.
 
+#### Speed
 
+A large bottleneck in terms of speed is finding the matches and comparing to see if they are greater than the minimum match length. Generally:
+- The fewer matches found (or the lower the percentage of the literals matched), the slower the algorithm will behave.
+- Increasing `HASH_ONLY_EVERY_LOG` results in fewer inserts and, if `INSERT_BY_TAG` is set, fewer lookups in the table. This has a large effect on speed, as well as compression ratio.
+- If `HASH_ONLY_EVERY_LOG` is not set, its value is calculated based on `LDM_WINDOW_SIZE_LOG` and `LDM_MEMORY_USAGE`. Increasing `LDM_WINDOW_SIZE_LOG` has the effect of increasing `HASH_ONLY_EVERY_LOG` and increasing `LDM_MEMORY_USAGE` decreases `HASH_ONLY_EVERY_LOG`.
+- `USE_CHECKSUM` generally improves speed with hash table lookups.
 
+#### Compression ratio
 
+The compression ratio is highly correlated with the coverage of matches. As a long distance matcher, the algorithm was designed to "optimize" for long distance matches outside the zstd compression window. The compression ratio after recompressing the output of the long-distance matcher with zstd was a more important signal in development than the raw compression ratio itself.
+
+Generally, increasing `LDM_MEMORY_USAGE` will improve the compression ratio. However when using the default computed value of `HASH_ONLY_EVERY_LOG`, this increases the frequency of insertion and lookup in the table and thus may result in a decrease in speed. 
+
+Below is a table showing the speed and compression ratio when compressing the llvm tar (as described above) using different settings for `LDM_MEMORY_USAGE`. The other parameters were the same as used in the benchmark above.
+
+| `LDM_MEMORY_USAGE` | Ratio | Speed (MB/s) | Ratio after zstd -6  |
+|---:| ---: | ---: | ---: |
+| `18` | `1.85` | `232.4` | `10.92` |
+| `21` | `2.79` | `233.9` | `15.92` |
+| `23` | `3.48` | `220.5` | `18.29` |
+| `25` | `4.56` | `140.8` | `19.21` |
+
+### Compression statistics
+
+Compression statistics (and the configuration) can be enabled/disabled via `COMPUTE_STATS` and `OUTPUT_CONFIGURATION` in `ldm.h`.
diff --git a/contrib/long_distance_matching/ldm.c b/contrib/long_distance_matching/ldm.c
index 25bf5c838..ff9d94d07 100644
--- a/contrib/long_distance_matching/ldm.c
+++ b/contrib/long_distance_matching/ldm.c
@@ -58,9 +58,6 @@ struct LDM_compressStats {
 
   U32 minOffset, maxOffset;
   U32 offsetHistogram[32];
-
-  U64 TMP_hashCount[1 << HASH_ONLY_EVERY_LOG];
-  U64 TMP_totalHashCount;
 };
 
 typedef struct LDM_hashTable LDM_hashTable;
@@ -398,17 +395,6 @@ void LDM_printCompressStats(const LDM_compressStats *stats) {
                    (double) stats->numMatches);
   }
   printf("\n");
-#if INSERT_BY_TAG
-/*
-  printf("Lower bit distribution\n");
-  for (i = 0; i < (1 << HASH_ONLY_EVERY_LOG); i++) {
-    printf("%5d %5llu %6.3f\n", i, stats->TMP_hashCount[i],
-           100.0 * (double) stats->TMP_hashCount[i] /
-                   (double) stats->TMP_totalHashCount);
-  }
-*/
-#endif
-
   printf("=====================\n");
 }
 
@@ -503,14 +489,6 @@ static void setNextHash(LDM_CCtx *cctx) {
       cctx->lastPosHashed[LDM_HASH_LENGTH]);
   cctx->nextPosHashed = cctx->nextIp;
 
-#if INSERT_BY_TAG
-  {
-    U32 hashEveryMask = lowerBitsFromHfHash(cctx->nextHash);
-    cctx->stats.TMP_totalHashCount++;
-    cctx->stats.TMP_hashCount[hashEveryMask]++;
-  }
-#endif
-
 #if LDM_LAG
   if (cctx->ip - cctx->ibase > LDM_LAG) {
     cctx->lagHash = updateHash(
diff --git a/contrib/long_distance_matching/ldm.h b/contrib/long_distance_matching/ldm.h
index af35130eb..456ec5aa4 100644
--- a/contrib/long_distance_matching/ldm.h
+++ b/contrib/long_distance_matching/ldm.h
@@ -71,6 +71,9 @@
 #define LDM_OFFSET_SIZE 4
 
 #define LDM_WINDOW_SIZE (1 << (LDM_WINDOW_SIZE_LOG))
+
+// TODO: Match lengths that are too small do not use the hash table efficiently.
+// There should be a minimum hash length given the hash table size.
 #define LDM_HASH_LENGTH LDM_MIN_MATCH_LENGTH
 
 typedef struct LDM_compressStats LDM_compressStats;