updated man page, providing more details for --train mode

author Yann Collet <cyan@fb.com>

Thu, 14 Apr 2022 01:51:59 +0000 (18:51 -0700)

committer Yann Collet <cyan@fb.com>

Thu, 14 Apr 2022 01:51:59 +0000 (18:51 -0700)
author Yann Collet <cyan@fb.com>
Thu, 14 Apr 2022 01:51:59 +0000 (18:51 -0700)
committer Yann Collet <cyan@fb.com>
Thu, 14 Apr 2022 01:51:59 +0000 (18:51 -0700)
diff --git a/programs/zstd.1.md b/programs/zstd.1.md

index 3901b58dbdfa0dde0fff82247bbee8368e27389e..4e176f39950fd78450b4fdc2974e82790cf98c01 100644 (file)
--- a/programs/zstd.1.md
+++ b/programs/zstd.1.md
@@ -19,8 +19,8 @@ DESCRIPTION
  with command line syntax similar to `gzip (1)` and `xz (1)`.
  It is based on the **LZ77** family, with further FSE & huff0 entropy stages.
  `zstd` offers highly configurable compression speed,
-with fast modes at > 200 MB/s per core,
-and strong modes nearing lzma compression ratios.
+from fast modes at > 200 MB/s per core,
+to strong modes with excellent compression ratios.
  It also features a very fast decoder, with speeds > 500 MB/s per core.
  
  `zstd` command line syntax is generally similar to gzip,
@@ -31,13 +31,12 @@ but features the following differences :
    - When compressing a single file, `zstd` displays progress notifications
      and result summary by default.
      Use `-q` to turn them off.
-  - `zstd` does not accept input from console,
-    but it properly accepts `stdin` when it's not the console.
    - `zstd` displays a short help page when command line is an error.
      Use `-q` to turn it off.
+  - `zstd` does not accept input from console,
+    though it does accept `stdin` when it's not the console.
  
-`zstd` compresses or decompresses each _file_ according to the selected
-operation mode.
+`zstd` processes each _file_ according to the selected operation mode.
  If no _files_ are given or _file_ is `-`, `zstd` reads from standard input
  and writes the processed data to standard output.
  `zstd` will refuse to write compressed data to standard output
@@ -54,8 +53,8 @@ whose name is derived from the source _file_ name:
    get the target filename
  
  ### Concatenation with .zst files
-It is possible to concatenate `.zst` files as is.
-`zstd` will decompress such files as if they were a single `.zst` file.
+It is possible to concatenate multiple `.zst` files. `zstd` will decompress
+such agglomerated file as if it was a single `.zst` file.
  
  OPTIONS
  -------
@@ -85,8 +84,8 @@ the last one takes effect.
      Decompress.
  * `-t`, `--test`:
      Test the integrity of compressed _files_.
-    This option is equivalent to `--decompress --stdout` except that the
-    decompressed data is discarded instead of being written to standard output.
+    This option is equivalent to `--decompress --stdout > /dev/null`,
+    decompressed data is discarded and checksummed for errors.
      No files are created or removed.
  * `-b#`:
      Benchmark file(s) using compression level #
@@ -96,7 +95,7 @@ the last one takes effect.
  * `-l`, `--list`:
      Display information related to a zstd compressed file, such as size, ratio, and checksum.
      Some of these fields may not be available.
-    This command can be augmented with the `-v` modifier.
+    This command's output can be augmented with the `-v` modifier.
  
  ### Operation modifiers
  
@@ -292,10 +291,10 @@ options that intend to mimic the `gzip` behavior:
      alias to the option `-9`.
  
  
-### Restricted usage of Environment Variables
+### Interactions with Environment Variables
  
-Using environment variables to set parameters has security implications.
-Therefore, this avenue is intentionally restricted.
+Employing environment variables to set parameters has security implications.
+Therefore, this avenue is intentionally limited.
  Only `ZSTD_CLEVEL` and `ZSTD_NBTHREADS` are currently supported.
  They set the compression level and number of threads to use during compression, respectively.
  
@@ -305,8 +304,8 @@ If the value of `ZSTD_CLEVEL` is not a valid integer, it will be ignored with a
  
  `ZSTD_NBTHREADS` can be used to set the number of threads `zstd` will attempt to use during compression.
  If the value of `ZSTD_NBTHREADS` is not a valid unsigned integer, it will be ignored with a warning message.
-`ZSTD_NBTHREADS` has a default value of (`1`), and is capped at ZSTDMT_NBWORKERS_MAX==200. `zstd` must be
-compiled with multithread support for this to have any effect.
+`ZSTD_NBTHREADS` has a default value of (`1`), and is capped at ZSTDMT_NBWORKERS_MAX==200.
+`zstd` must be compiled with multithread support for this to have any effect.
  
  They can both be overridden by corresponding command line arguments:
  `-#` for compression level and `-T#` for number of compression threads.
@@ -318,27 +317,36 @@ DICTIONARY BUILDER
  which greatly improves efficiency on small files and messages.
  It's possible to train `zstd` with a set of samples,
  the result of which is saved into a file called a `dictionary`.
-Then during compression and decompression, reference the same dictionary,
+Then, during compression and decompression, reference the same dictionary,
  using command `-D dictionaryFileName`.
  Compression of small files similar to the sample set will be greatly improved.
  
  * `--train FILEs`:
      Use FILEs as training set to create a dictionary.
-    The training set should contain a lot of small files (> 100),
+    The training set should ideally contain a lot of samples (> 100),
      and weight typically 100x the target dictionary size
-    (for example, 10 MB for a 100 KB dictionary).
+    (for example, ~10 MB for a 100 KB dictionary).
      `--train` can be combined with `-r` to indicate a directory rather than listing all the files,
      which can be useful to circumvent shell expansion limits.
  
+    Since dictionary compression is mostly effective for small files,
+    the expectation is that the training set will only contain small files.
+    In the case where some samples happen to be large,
+    only the first 128 KB of these samples will be used for training.
+
      `--train` supports multithreading if `zstd` is compiled with threading support (default).
-    Additional parameters can be specified with `--train-fastcover`.
+    Additional advanced parameters can be specified with `--train-fastcover`.
      The legacy dictionary builder can be accessed with `--train-legacy`.
      The slower cover dictionary builder can be accessed with `--train-cover`.
-    Default is equivalent to `--train-fastcover=d=8,steps=4`.
-* `-o file`:
-    Dictionary saved into `file` (default name: dictionary).
+    Default `--train` is equivalent to `--train-fastcover=d=8,steps=4`.
+
+* `-o FILE`:
+    Dictionary saved into `FILE` (default name: dictionary).
  * `--maxdict=#`:
-    Limit dictionary to specified size (default: 112640).
+    Limit dictionary to specified size (default: 112640 bytes).
+    As usual, quantities are expressed in bytes by default,
+    and it's possible to employ suffixes (like `KB` or `MB`)
+    to specify larger values.
  * `-#`:
      Use `#` compression level during training (optional).
      Will generate statistics more tuned for selected compression level,
@@ -346,17 +354,37 @@ Compression of small files similar to the sample set will be greatly improved.
  * `-B#`:
      Split input files into blocks of size # (default: no split)
  * `-M#`, `--memory=#`:
-    Limit the amount of sample data loaded for training (default: 2 GB). See above for details.
+    Limit the amount of sample data loaded for training (default: 2 GB).
+    Note that the default (2 GB) is also the maximum.
+    This parameter can be useful in situations where the training set size
+    is not well controlled and could be potentially very large.
+    Since speed of the training process is directly correlated to
+    the size of the training sample set,
+    a smaller sample set leads to faster training.
+
+    In situations where the training set is larger than maximum memory,
+    the CLI will randomly select samples among the available ones,
+    up to the maximum allowed memory budget.
+    This is meant to improve dictionary relevance
+    by mitigating the potential impact of clustering,
+    such as selecting only files from the beginning of a list
+    sorted by modification date, or sorted by alphabetical order.
+    The randomization process is deterministic, so
+    training of the same list of files with the same parameters
+    will lead to the creation of the same dictionary.
+
  * `--dictID=#`:
-    A dictionary ID is a locally unique ID
-    that a decoder can use to verify it is using the right dictionary.
+    A dictionary ID is a locally unique ID.
+    The decoder will use this value to verify it is using the right dictionary.
      By default, zstd will create a 4-bytes random number ID.
-    It's possible to give a precise number instead.
-    Short numbers have an advantage : an ID < 256 will only need 1 byte in the
-    compressed frame header, and an ID < 65536 will only need 2 bytes.
-    This compares favorably to 4 bytes default.
-    However, it's up to the dictionary manager to not assign twice the same ID to
+    It's possible to provide an explicit number ID instead.
+    It's up to the dictionary manager to not assign twice the same ID to
      2 different dictionaries.
+    Note that short numbers have an advantage :
+    an ID < 256 will only need 1 byte in the compressed frame header,
+    and an ID < 65536 will only need 2 bytes.
+    This compares favorably to 4 bytes default.
+
  * `--train-cover[=k#,d=#,steps=#,split=#,shrink[=#]]`:
      Select parameters for the default dictionary builder algorithm named cover.
      If _d_ is not specified, then it tries _d_ = 6 and _d_ = 8.
@@ -421,7 +449,7 @@ Compression of small files similar to the sample set will be greatly improved.
      Use legacy dictionary builder algorithm with the given dictionary
      _selectivity_ (default: 9).
      The smaller the _selectivity_ value, the denser the dictionary,
-    improving its efficiency but reducing its possible maximum size.
+    improving its efficiency but reducing its achievable maximum size.
      `--train-legacy=s=#` is also accepted.
  
      Examples:
@@ -452,14 +480,14 @@ BENCHMARK
  ADVANCED COMPRESSION OPTIONS
  ----------------------------
  ### -B#:
-Select the size of each compression job.
+Specify the size of each compression job.
  This parameter is only available when multi-threading is enabled.
  Each compression job is run in parallel, so this value indirectly impacts the nb of active threads.
  Default job size varies depending on compression level (generally  `4 * windowSize`).
  `-B#` makes it possible to manually select a custom size.
  Note that job size must respect a minimum value which is enforced transparently.
  This minimum is either 512 KB, or `overlapSize`, whichever is largest.
-Different job sizes will lead to (slightly) different compressed frames.
+Different job sizes will lead to non-identical compressed frames.
  
  ### --zstd[=options]:
  `zstd` provides 22 predefined compression levels.
author	Yann Collet <cyan@fb.com>
	Thu, 14 Apr 2022 01:51:59 +0000 (18:51 -0700)
committer	Yann Collet <cyan@fb.com>
	Thu, 14 Apr 2022 01:51:59 +0000 (18:51 -0700)