From dd8cb5a0f1a7581c407a7307fc526deb47e1f266 Mon Sep 17 00:00:00 2001 From: Yann Collet Date: Fri, 10 Mar 2023 15:54:31 -0800 Subject: [PATCH] added documentation for the seekable format and notably provide additional context for the Maximum Frame Size parameter. requested by @P-E-Meunier at https://github.com/facebook/zstd/commit/1df9f36c6c6cea08778d45a4adaf60e2433439a3#commitcomment-103856979. --- contrib/seekable_format/README.md | 42 +++++++++++++++++++++++++ contrib/seekable_format/zstd_seekable.h | 17 +++++++--- 2 files changed, 55 insertions(+), 4 deletions(-) create mode 100644 contrib/seekable_format/README.md diff --git a/contrib/seekable_format/README.md b/contrib/seekable_format/README.md new file mode 100644 index 000000000..fedf96bab --- /dev/null +++ b/contrib/seekable_format/README.md @@ -0,0 +1,42 @@ +# Zstandard Seekable Format + +The seekable format splits compressed data into a series of independent "frames", +each compressed individually, +so that decompression of a section in the middle of an archive +only requires zstd to decompress at most a frame's worth of extra data, +instead of the entire archive. + +The frames are appended, so that the decompression of the entire payload +still regenerates the original content, using any compliant zstd decoder. + +On top of that, the seekable format generates a jump table, +which makes it possible to jump directly to the position of the relevant frame +when requesting only a segment of the data. +The jump table is simply ignored by zstd decoders unaware of the seekable format. + +The format is delivered with an API to create seekable archives +and to retrieve arbitrary segments inside the archive. + +### Maximum Frame Size parameter + +When creating a seekable archive, the main parameter is the maximum frame size. + +At compression time, user can manually select the boundaries between segments, +but they don't have to: long segments will be automatically split +when larger than selected maximum frame size. + +Small frame sizes reduce decompression cost when requesting small segments, +because the decoder will nonetheless have to decompress an entire frame +to recover just a single byte from it. + +A good rule of thumb is to select a maximum frame size roughly equivalent +to the access pattern when it's known. +For example, if the application tends to request 4KB blocks, +then it's a good idea to set a maximum frame size in the vicinity of 4 KB. + +But small frame sizes also reduce compression ratio, +and increase the cost for the jump table, +so there is a balance to find. + +In general, try to avoid really tiny frame sizes (<1 KB), +which would have a large negative impact on compression ratio. diff --git a/contrib/seekable_format/zstd_seekable.h b/contrib/seekable_format/zstd_seekable.h index ef2957588..a0e5e3573 100644 --- a/contrib/seekable_format/zstd_seekable.h +++ b/contrib/seekable_format/zstd_seekable.h @@ -48,10 +48,19 @@ typedef struct ZSTD_seekTable_s ZSTD_seekTable; * * Use ZSTD_seekable_initCStream() to initialize a ZSTD_seekable_CStream object * for a new compression operation. -* `maxFrameSize` indicates the size at which to automatically start a new -* seekable frame. `maxFrameSize == 0` implies the default maximum size. -* `checksumFlag` indicates whether or not the seek table should include frame -* checksums on the uncompressed data for verification. +* - `maxFrameSize` indicates the size at which to automatically start a new +* seekable frame. +* `maxFrameSize == 0` implies the default maximum size. +* Smaller frame sizes allow faster decompression of small segments, +* since retrieving a single byte requires decompression of +* the full frame where the byte belongs. +* In general, size the frames to roughly correspond to +* the access granularity (when it's known). +* But small sizes also reduce compression ratio. +* Avoid really tiny frame sizes (< 1 KB), +* that would hurt compression ratio considerably. +* - `checksumFlag` indicates whether or not the seek table should include frame +* checksums on the uncompressed data for verification. * @return : a size hint for input to provide for compression, or an error code * checkable with ZSTD_isError() * -- 2.47.3