Update SystemZ DFLTCC README

author Ilya Leoshkevich <iii@linux.ibm.com>

Thu, 20 Aug 2020 11:34:45 +0000 (13:34 +0200)

committer Hans Kristian Rosbach <hk-github@circlestorm.org>

Sat, 22 Aug 2020 12:19:24 +0000 (14:19 +0200)
author Ilya Leoshkevich <iii@linux.ibm.com>
Thu, 20 Aug 2020 11:34:45 +0000 (13:34 +0200)
committer Hans Kristian Rosbach <hk-github@circlestorm.org>
Sat, 22 Aug 2020 12:19:24 +0000 (14:19 +0200)
diff --git a/arch/s390/README.md b/arch/s390/README.md

index 841eb896c70ce821ca6c3722d120ebc32948718d..2ff88df4027812e7906791aaf84ead97601d0adf 100644 (file)
--- a/arch/s390/README.md
+++ b/arch/s390/README.md
@@ -1,6 +1,7 @@
-This directory contains IBM Z DEFLATE CONVERSION CALL support for
-zlib-ng. In order to enable it, the following build commands should be
-used:
+# Introduction
+
+This directory contains SystemZ deflate hardware acceleration support.
+It can be enabled using the following build commands:
  
      $ ./configure --with-dfltcc-deflate --with-dfltcc-inflate
      $ make
@@ -10,65 +11,92 @@ or
      $ cmake -DWITH_DFLTCC_DEFLATE=1 -DWITH_DFLTCC_INFLATE=1 .
      $ make
  
-When built like this, zlib-ng would compress in hardware on level 1,
-and in software on all other levels. Decompression will always happen
-in hardware. In order to enable DFLTCC compression for levels 1-6 (i.e.
-to make it used by default) one could add -DDFLTCC_LEVEL_MASK=0x7e to
-CFLAGS when building zlib-ng.
-
-Two DFLTCC compression calls produce the same results only when they
-both are made on machines of the same generation, and when the
-respective buffers have the same offset relative to the start of the
-page. Therefore care should be taken when using hardware compression
-when reproducible results are desired. In particular, zlib-ng-specific
-zng_deflateSetParams call allows setting Z_DEFLATE_REPRODUCIBLE
-parameter, which would disable DFLTCC if reproducible results are
-required.
+When built like this, zlib-ng would compress using hardware on level 1,
+and using software on all other levels. Decompression will always happen
+in hardware. In order to enable hardware compression for levels 1-6
+(i.e. to make it used by default) one could add
+`-DDFLTCC_LEVEL_MASK=0x7e` to CFLAGS when building zlib-ng.
+
+SystemZ deflate hardware acceleration is available on [IBM z15](
+https://www.ibm.com/products/z15) and newer machines under the name [
+"Integrated Accelerator for zEnterprise Data Compression"](
+https://www.ibm.com/support/z-content-solutions/compression/). The
+programming interface to it is a machine instruction called DEFLATE
+CONVERSION CALL (DFLTCC). It is documented in Chapter 26 of [Principles
+of Operation](http://publibfp.dhe.ibm.com/epubs/pdf/a227832c.pdf). Both
+the code and the rest of this document refer to this feature simply as
+"DFLTCC".
+
+# Performance
+
+Performance figures are published [here](
+https://github.com/iii-i/zlib-ng/wiki/Performance-with-dfltcc-patch-applied-and-dfltcc-support-built-on-dfltcc-enabled-machine
+). The compression speed-up can be as high as 110x and the decompression
+speed-up can be as high as 15x.
+
+# Limitations
+
+Two DFLTCC compression calls with identical inputs are not guaranteed to
+produce identical outputs. Therefore care should be taken when using
+hardware compression when reproducible results are desired. In
+particular, zlib-ng-specific `zng_deflateSetParams` call allows setting
+`Z_DEFLATE_REPRODUCIBLE` parameter, which disables DFLTCC support for a
+particular stream.
  
  DFLTCC does not support every single zlib-ng feature, in particular:
  
-* inflate(Z_BLOCK) and inflate(Z_TREES)
-* inflateMark()
-* inflatePrime()
-* deflateParams() after the first deflate() call
+* `inflate(Z_BLOCK)` and `inflate(Z_TREES)`
+* `inflateMark()`
+* `inflatePrime()`
  
  When used, these functions will either switch to software, or, in case
  this is not possible, gracefully fail.
  
-All SystemZ-specific code lives in a separate file and is integrated
-with the rest of zlib-ng using hook macros, which are explained below.
+# Code structure
+
+All SystemZ-specific code lives in `arch/s390` directory and is
+integrated with the rest of zlib-ng using hook macros.
+
+## Hook macros
  
  DFLTCC takes as arguments a parameter block, an input buffer, an output
-buffer and a window. ZALLOC_STATE, ZFREE_STATE, ZCOPY_STATE,
-ZALLOC_WINDOW and TRY_FREE_WINDOW macros encapsulate allocation details
-for the parameter block (which is allocated alongside zlib-ng state)
-and the window (which must be page-aligned).
+buffer and a window. `ZALLOC_STATE()`, `ZFREE_STATE()`, `ZCOPY_STATE()`,
+`ZALLOC_WINDOW()` and `TRY_FREE_WINDOW()` macros encapsulate allocation
+details for the parameter block (which is allocated alongside zlib-ng
+state) and the window (which must be page-aligned).
+
+While inflate software and hardware window formats match, this is not
+the case for deflate. Therefore, `deflateSetDictionary()` and
+`deflateGetDictionary()` need special handling, which is triggered using
+`DEFLATE_SET_DICTIONARY_HOOK()` and `DEFLATE_GET_DICTIONARY_HOOK()`
+macros.
  
-While for inflate software and hardware window formats match, this is
-not the case for deflate. Therefore, deflateSetDictionary and
-deflateGetDictionary need special handling, which is triggered using
-the DEFLATE_SET_DICTIONARY_HOOK and DEFLATE_GET_DICTIONARY_HOOK macros.
+`deflateResetKeep()` and `inflateResetKeep()` update the DFLTCC
+parameter block using `DEFLATE_RESET_KEEP_HOOK()` and
+`INFLATE_RESET_KEEP_HOOK()` macros.
  
-deflateResetKeep() and inflateResetKeep() update the DFLTCC parameter
-block using DEFLATE_RESET_KEEP_HOOK and INFLATE_RESET_KEEP_HOOK macros.
+`INFLATE_PRIME_HOOK()` and `INFLATE_MARK_HOOK()` macros make the
+unsupported `inflatePrime()` and `inflateMark()` calls fail gracefully.
  
-DEFLATE_PARAMS_HOOK, INFLATE_PRIME_HOOK and INFLATE_MARK_HOOK macros
-make the unsupported deflateParams(), inflatePrime() and inflateMark()
-calls fail gracefully.
+`DEFLATE_PARAMS_HOOK()` implements switching between hardware and
+software compression mid-stream using `deflateParams()`. Switching
+normally entails flushing the current block, which might not be possible
+in low memory situations. `deflateParams()` uses `DEFLATE_DONE()` hook
+in order to detect and gracefully handle such situations.
  
  The algorithm implemented in hardware has different compression ratio
-than the one implemented in software. DEFLATE_BOUND_ADJUST_COMPLEN and
-DEFLATE_NEED_CONSERVATIVE_BOUND macros make deflateBound() return the
-correct results for the hardware implementation.
+than the one implemented in software. `DEFLATE_BOUND_ADJUST_COMPLEN()`
+and `DEFLATE_NEED_CONSERVATIVE_BOUND()` macros make `deflateBound()`
+return the correct results for the hardware implementation.
  
-Actual compression and decompression are handled by DEFLATE_HOOK and
-INFLATE_TYPEDO_HOOK macros. Since inflation with DFLTCC manages the
-window on its own, calling updatewindow() is suppressed using
-INFLATE_NEED_UPDATEWINDOW() macro.
+Actual compression and decompression are handled by `DEFLATE_HOOK()` and
+`INFLATE_TYPEDO_HOOK()` macros. Since inflation with DFLTCC manages the
+window on its own, calling `updatewindow()` is suppressed using
+`INFLATE_NEED_UPDATEWINDOW()` macro.
  
  In addition to compression, DFLTCC computes CRC-32 and Adler-32
  checksums, therefore, whenever it's used, software checksumming is
-suppressed using DEFLATE_NEED_CHECKSUM and INFLATE_NEED_CHECKSUM
+suppressed using `DEFLATE_NEED_CHECKSUM()` and `INFLATE_NEED_CHECKSUM()`
  macros.
  
  While software always produces reproducible compression results, this
@@ -77,4 +105,110 @@ ability to specify whether or not reproducible compression results
  are required. While it is always possible to specify this setting
  before the compression begins, it is not always possible to do so in
  the middle of a deflate stream - the exact conditions for that are
-determined by DEFLATE_CAN_SET_REPRODUCIBLE macro.
+determined by `DEFLATE_CAN_SET_REPRODUCIBLE()` macro.
+
+## SystemZ-specific code
+
+When zlib-ng is built with DFLTCC, the hooks described above are
+converted to calls to functions, which are implemented in
+`arch/s390/dfltcc_*` files. The functions can be grouped in three broad
+categories:
+
+* Base DFLTCC support, e.g. wrapping the machine instruction -
+  `dfltcc()` and allocating aligned memory - `dfltcc_alloc_state()`.
+* Translating between software and hardware data formats, e.g.
+  `dfltcc_deflate_set_dictionary()`.
+* Translating between software and hardware state machines, e.g.
+  `dfltcc_deflate()` and `dfltcc_inflate()`.
+
+The functions from the first two categories are fairly simple, however,
+various quirks in both software and hardware state machines make the
+functions from the third category quite complicated.
+
+### `dfltcc_deflate()` function
+
+This function is called by `deflate()` and has the following
+responsibilities:
+
+* Checking whether DFLTCC can be used with the current stream. If this
+  is not the case, then it returns `0`, making `deflate()` use some
+  other function in order to compress in software. Otherwise it returns
+  `1`.
+* Block management and Huffman table generation. DFLTCC ends blocks only
+  when explicitly instructed to do so by the software. Furthermore,
+  whether to use fixed or dynamic Huffman tables must also be determined
+  by the software. Since looking at data in order to gather statistics
+  would negate performance benefits, the following approach is used: the
+  first `DFLTCC_FIRST_FHT_BLOCK_SIZE` bytes are placed into a fixed
+  block, and every next `DFLTCC_BLOCK_SIZE` bytes are placed into
+  dynamic blocks.
+* Writing EOBS. Block Closing Control bit in the parameter block
+  instructs DFLTCC to write EOBS, however, certain conditions need to be
+  met: input data length must be non-zero or Continuation Flag must be
+  set. To put this in simpler terms, DFLTCC will silently refuse to
+  write EOBS if this is the only thing that it is asked to do. Since the
+  code has to be able to emit EOBS in software anyway, in order to avoid
+  tricky corner cases Block Closing Control is never used. Whether to
+  write EOBS is instead controlled by `soft_bcc` variable.
+* Triggering block post-processing. Depending on flush mode, `deflate()`
+  must perform various additional actions when a block or a stream ends.
+  `dfltcc_deflate()` informs `deflate()` about this using
+  `block_state *result` parameter.
+* Converting software state fields into hardware parameter block fields,
+  and vice versa. For example, `wrap` and Check Value Type or `bi_valid`
+  and Sub-Byte Boundary. Certain fields cannot be translated and must
+  persist untouched in the parameter block between calls, for example,
+  Continuation Flag or Continuation State Buffer.
+* Handling flush modes and low-memory situations. These aspects are
+  quite intertwined and pervasive. The general idea here is that the
+  code must not do anything in software - whether explicitly by e.g.
+  calling `send_eobs()`, or implicitly - by returning to `deflate()`
+  with certain return and `*result` values, when Continuation Flag is
+  set.
+* Ending streams. When a new block is started and flush mode is
+  `Z_FINISH`, Block Header Final parameter block bit is used to mark
+  this block as final. However, sometimes an empty final block is
+  needed, and, unfortunately, just like with EOBS, DFLTCC will silently
+  refuse to do this. The general idea of DFLTCC implementation is to
+  rely as much as possible on the existing code. Here in order to do
+  this, the code pretends that it does not support DFLTCC, which makes
+  `deflate()` call a software compression function, which writes an
+  empty final block. Whether this is required is controlled by
+  `need_empty_block` variable.
+* Error handling. This is simply converting
+  Operation-Ending-Supplemental Code to string. Errors can only happen
+  due to things like memory corruption, and therefore they don't affect
+  the `deflate()` return code.
+
+### `dfltcc_inflate()` function
+
+This function is called by `inflate()` from the `TYPEDO` state (that is,
+when all the metadata is parsed and the stream is positioned at the type
+bits of deflate block header) and it's responsible for the following:
+
+* Falling back to software when flush mode is `Z_BLOCK` or `Z_TREES`.
+  Unfortunately, there is no way to ask DFLTCC to stop decompressing on
+  block or tree boundary.
+* `inflate()` decompression loop management. This is controlled using
+  the return value, which can be either `DFLTCC_INFLATE_BREAK` or
+  `DFLTCC_INFLATE_CONTINUE`.
+* Converting software state fields into hardware parameter block fields,
+  and vice versa. For example, `whave` and History Length or `wnext` and
+  History Offset.
+* Ending streams. This instructs `inflate()` to return `Z_STREAM_END`
+  and is controlled by `last` state field.
+* Error handling. Like deflate, error handling comprises
+  Operation-Ending-Supplemental Code to string conversion. Unlike
+  deflate, errors may happen due to bad inputs, therefore they are
+  propagated to `inflate()` by setting `mode` field to `MEM` or `BAD`.
+
+# Testing
+
+Given complexity of DFLTCC machine instruction, it is not clear whether
+QEMU TCG will ever support it. At the time of writing, one has to have
+access to an IBM z15+ VM or LPAR in order to test DFLTCC support. Since
+DFLTCC is a non-privileged instruction, neither special VM/LPAR
+configuration nor root are required.
+
+Still, zlib-ng CI has a few QEMU TCG-based configurations that check
+whether fallback to software is working.
author	Ilya Leoshkevich <iii@linux.ibm.com>
	Thu, 20 Aug 2020 11:34:45 +0000 (13:34 +0200)
committer	Hans Kristian Rosbach <hk-github@circlestorm.org>
	Sat, 22 Aug 2020 12:19:24 +0000 (14:19 +0200)