gh-134837: Correct and improve base85 documentation for base64 and binascii modules...

author David Huggins-Daines <dhd@ecolingui.ca>

Tue, 12 May 2026 19:46:46 +0000 (15:46 -0400)

committer GitHub <noreply@github.com>

Tue, 12 May 2026 19:46:46 +0000 (22:46 +0300)
author David Huggins-Daines <dhd@ecolingui.ca>
Tue, 12 May 2026 19:46:46 +0000 (15:46 -0400)
committer GitHub <noreply@github.com>
Tue, 12 May 2026 19:46:46 +0000 (22:46 +0300)
diff --git a/Doc/library/base64.rst b/Doc/library/base64.rst

index a722607b2c1f198ee5cd76853693a27ae1983061..8af40a2f8a65e3fdb2fc30c9a454b17122cd35f7 100644 (file)
--- a/Doc/library/base64.rst
+++ b/Doc/library/base64.rst
@@ -16,8 +16,10 @@
  This module provides functions for encoding binary data to printable
  ASCII characters and decoding such encodings back to binary data.
  This includes the :ref:`encodings specified in <base64-rfc-4648>`
-:rfc:`4648` (Base64, Base32 and Base16)
-and the non-standard :ref:`Base85 encodings <base64-base-85>`.
+:rfc:`4648` (Base64, Base32 and Base16), the :ref:`Base85 encoding
+<base64-base-85>` specified in `PDF 2.0
+<https://pdfa.org/resource/iso-32000-2/>`_, and non-standard variants
+of Base85 used elsewhere.
  
  There are two interfaces provided by this module.  The modern interface
  supports encoding :term:`bytes-like objects <bytes-like object>` to ASCII
@@ -284,19 +286,28 @@ POST request.
  Base85 Encodings
  -----------------
  
-Base85 encoding is not formally specified but rather a de facto standard,
-thus different systems perform the encoding differently.
+Base85 encoding is a family of algorithms which represent four bytes
+using five ASCII characters.  Originally implemented in the Unix
+``btoa(1)`` utility, a version of it was later adopted by Adobe in the
+PostScript language and is standardized in PDF 2.0 (ISO 32000-2).
+This version, in both its ``btoa`` and PDF variants, is implemented by
+:func:`a85encode`.
  
-The :func:`a85encode` and :func:`b85encode` functions in this module are two implementations of
-the de facto standard. You should call the function with the Base85
-implementation used by the software you intend to work with.
+A separate version, using a different output character set, was
+defined as an April Fool's joke in :rfc:`1924` but is now used by Git
+and other software.  This version is implemented by :func:`b85encode`.
  
-The two functions present in this module differ in how they handle the following:
+Finally, a third version, using yet another output character set
+designed for safe inclusion in programming language strings, is
+defined by ZeroMQ and implemented here by :func:`z85encode`.
  
-* Whether to include enclosing ``<~`` and ``~>`` markers
-* Whether to include newline characters
-* The set of ASCII characters used for encoding
-* Handling of null bytes
+The functions present in this module differ in how they handle the following:
+
+* Whether to include and expect enclosing ``<~`` and ``~>`` markers.
+* Whether to fold the input into multiple lines.
+* The set of ASCII characters used for encoding.
+* Compact encodings of sequences of spaces and null bytes.
+* The encoding of zero-padding bytes applied to the input.
  
  Refer to the documentation of the individual functions for more information.
  
@@ -307,18 +318,22 @@ Refer to the documentation of the individual functions for more information.
  
     *foldspaces* is an optional flag that uses the special short sequence 'y'
     instead of 4 consecutive spaces (ASCII 0x20) as supported by 'btoa'. This
-   feature is not supported by the "standard" Ascii85 encoding.
+   feature is not supported by the standard encoding used in PDF.
  
     If *wrapcol* is non-zero, insert a newline (``b'\n'``) character
     after at most every *wrapcol* characters.
     If *wrapcol* is zero (default), do not insert any newlines.
  
-   If *pad* is true, the input is padded with ``b'\0'`` so its length is a
-   multiple of 4 bytes before encoding.
-   Note that the ``btoa`` implementation always pads.
+   *pad* controls whether zero-padding applied to the end of the input
+   is fully retained in the output encoding, as done by ``btoa``,
+   producing an exact multiple of 5 bytes of output. This is not part
+   of the standard encoding used in PDF, as it does not preserve the
+   length of the data.
  
-   *adobe* controls whether the encoded byte sequence is framed with ``<~``
-   and ``~>``, which is used by the Adobe implementation.
+   *adobe* controls whether the encoded byte sequence is framed with
+   ``<~`` and ``~>``, as in a PostScript base-85 string literal.  Note
+   that while ASCII85Decode streams in PDF documents *must* be
+   terminated with ``~>``, they *must not* use a leading ``<~``.
  
     .. versionadded:: 3.4
  
@@ -330,10 +345,12 @@ Refer to the documentation of the individual functions for more information.
  
     *foldspaces* is a flag that specifies whether the 'y' short sequence
     should be accepted as shorthand for 4 consecutive spaces (ASCII 0x20).
-   This feature is not supported by the "standard" Ascii85 encoding.
+   This feature is not supported by the standard Ascii85 encoding used in
+   PDF and PostScript.
  
-   *adobe* controls whether the input sequence is in Adobe Ascii85 format
-   (i.e. is framed with <~ and ~>).
+   *adobe* controls whether the ``<~`` and ``~>`` markers are
+   present. While the leading ``<~`` is not required, the input must
+   end with ``~>``, or a :exc:`ValueError` is raised.
  
     *ignorechars* should be a :term:`bytes-like object` containing characters
     to ignore from the input.
@@ -356,8 +373,11 @@ Refer to the documentation of the individual functions for more information.
     Encode the :term:`bytes-like object` *b* using base85 (as used in e.g.
     git-style binary diffs) and return the encoded :class:`bytes`.
  
-   If *pad* is true, the input is padded with ``b'\0'`` so its length is a
-   multiple of 4 bytes before encoding.
+   The input is padded with ``b'\0'`` so its length is a multiple of 4
+   bytes before encoding.  If *pad* is true, all the resulting
+   characters are retained in the output, which will always be a
+   multiple of 5 bytes, and thus the length of the data may not be
+   preserved on decoding.
  
     If *wrapcol* is non-zero, insert a newline (``b'\n'``) character
     after at most every *wrapcol* characters.
@@ -372,8 +392,7 @@ Refer to the documentation of the individual functions for more information.
  .. function:: b85decode(b, *, ignorechars=b'', canonical=False)
  
     Decode the base85-encoded :term:`bytes-like object` or ASCII string *b* and
-   return the decoded :class:`bytes`.  Padding is implicitly removed, if
-   necessary.
+   return the decoded :class:`bytes`.
  
     *ignorechars* should be a :term:`bytes-like object` containing characters
     to ignore from the input.
@@ -392,11 +411,12 @@ Refer to the documentation of the individual functions for more information.
  .. function:: z85encode(s, pad=False, *, wrapcol=0)
  
     Encode the :term:`bytes-like object` *s* using Z85 (as used in ZeroMQ)
-   and return the encoded :class:`bytes`.  See `Z85  specification
-   <https://rfc.zeromq.org/spec/32/>`_ for more information.
+   and return the encoded :class:`bytes`.
  
-   If *pad* is true, the input is padded with ``b'\0'`` so its length is a
-   multiple of 4 bytes before encoding.
+   The input is padded with ``b'\0'`` so its length is a multiple of 4
+   bytes before encoding.  If *pad* is true, all the resulting
+   characters are retained in the output, which will always be a
+   multiple of 5 bytes, as required by the ZeroMQ standard.
  
     If *wrapcol* is non-zero, insert a newline (``b'\n'``) character
     after at most every *wrapcol* characters.
@@ -414,8 +434,7 @@ Refer to the documentation of the individual functions for more information.
  .. function:: z85decode(s, *, ignorechars=b'', canonical=False)
  
     Decode the Z85-encoded :term:`bytes-like object` or ASCII string *s* and
-   return the decoded :class:`bytes`.  See `Z85  specification
-   <https://rfc.zeromq.org/spec/32/>`_ for more information.
+   return the decoded :class:`bytes`.
  
     *ignorechars* should be a :term:`bytes-like object` containing characters
     to ignore from the input.
@@ -499,3 +518,11 @@ recommended to review the security section for any code deployed to production.
        Section 5.2, "Base64 Content-Transfer-Encoding," provides the definition of the
        base64 encoding.
  
+   `ISO 32000-2 Portable document format - Part 2: PDF 2.0 <https://pdfa.org/resource/iso-32000-2/>`_
+      Section 7.4.3, "ASCII85Decode Filter," provides the definition
+      of the Ascii85 encoding used in PDF and PostScript, including
+      the output character set and the details of data length preservation
+      using zero-padding and partial output groups.
+
+   `ZeroMQ RFC 32/Z85 <https://rfc.zeromq.org/spec/32/>`_
+      The "Formal Specification" section provides the character set used in Z85.
diff --git a/Doc/library/binascii.rst b/Doc/library/binascii.rst

index 8b4ba6ae9fb2549e9b2e8a657338dbf47fcdf593..60afe9261d51facdb87ff77cd75d1acef1cd6bb3 100644 (file)
--- a/Doc/library/binascii.rst
+++ b/Doc/library/binascii.rst
@@ -133,8 +133,11 @@ The :mod:`!binascii` module defines the following functions:
     should be accepted as shorthand for 4 consecutive spaces (ASCII 0x20).
     This feature is not supported by the "standard" Ascii85 encoding.
  
-   *adobe* controls whether the input sequence is in Adobe Ascii85 format
-   (i.e. is framed with <~ and ~>).
+   *adobe* controls whether the encoded byte sequence is framed with
+   ``<~`` and ``~>``, as in a PostScript base-85 string literal.  If
+   *adobe* is true, a leading ``<~`` is optionally accepted, while a
+   trailing ``~>`` is *required*, and :exc:`binascii.Error` is raised
+   if it is not found.
  
     *ignorechars* should be a :term:`bytes-like object` containing characters
     to ignore from the input.
@@ -164,12 +167,16 @@ The :mod:`!binascii` module defines the following functions:
     after at most every *wrapcol* characters.
     If *wrapcol* is zero (default), do not insert any newlines.
  
-   If *pad* is true, the input is padded with ``b'\0'`` so its length is a
-   multiple of 4 bytes before encoding.
-   Note that the ``btoa`` implementation always pads.
+   If *pad* is true, the zero-padding applied to the end of the input
+   is fully retained in the output encoding, as done by ``btoa``,
+   producing an exact multiple of 5 bytes of output. This is not part
+   of the standard encoding used in PDF, as it does not preserve the
+   length of the data.
  
-   *adobe* controls whether the encoded byte sequence is framed with ``<~``
-   and ``~>``, which is used by the Adobe implementation.
+   *adobe* controls whether the encoded byte sequence is framed with
+   ``<~`` and ``~>``, as in a PostScript base-85 string literal.  Note
+   that while ASCII85Decode streams in PDF documents *must* be
+   terminated with ``~>``, they *must not* use a leading ``<~``.
  
     .. versionadded:: 3.15
  
@@ -213,8 +220,10 @@ The :mod:`!binascii` module defines the following functions:
     after at most every *wrapcol* characters.
     If *wrapcol* is zero (default), do not insert any newlines.
  
-   If *pad* is true, the input is padded with ``b'\0'`` so its length is a
-   multiple of 4 bytes before encoding.
+   If *pad* is true, the zero-padding applied to the end of the input
+   is retained in the output, which will always be a multiple of 5
+   bytes, and thus the length of the data may not be preserved on
+   decoding.
  
     .. versionadded:: 3.15
  
diff --git a/Lib/base64.py b/Lib/base64.py

index 4b810e08569e5ba20972d9017735f8acec514b6e..4a0e9d446edb0bcc7f3ba71ed55189b470edce3e 100644 (file)
--- a/Lib/base64.py
+++ b/Lib/base64.py
@@ -315,16 +315,20 @@ def a85encode(b, *, foldspaces=False, wrapcol=0, pad=False, adobe=False):
  
      foldspaces is an optional flag that uses the special short sequence 'y'
      instead of 4 consecutive spaces (ASCII 0x20) as supported by 'btoa'. This
-    feature is not supported by the "standard" Adobe encoding.
+    feature is not supported by the standard encoding used in PDF.
  
      If wrapcol is non-zero, insert a newline (b'\\n') character after at most
      every wrapcol characters.
  
-    pad controls whether the input is padded to a multiple of 4 before
-    encoding. Note that the btoa implementation always pads.
+    pad controls whether zero-padding applied to the end of the input
+    is fully retained in the output encoding, as done by btoa,
+    producing an exact multiple of 5 bytes of output.
+
+    adobe controls whether the encoded byte sequence is framed with <~
+    and ~>, as in a PostScript base-85 string literal.  Note that
+    while ASCII85Decode streams in PDF documents must be terminated
+    with ~>, they must not use a leading <~.
  
-    adobe controls whether the encoded byte sequence is framed with <~ and ~>,
-    which is used by the Adobe implementation.
      """
      return binascii.b2a_ascii85(b, foldspaces=foldspaces,
                                  adobe=adobe, wrapcol=wrapcol, pad=pad)
@@ -333,12 +337,14 @@ def a85decode(b, *, foldspaces=False, adobe=False, ignorechars=b' \t\n\r\v',
                canonical=False):
      """Decode the Ascii85 encoded bytes-like object or ASCII string b.
  
-    foldspaces is a flag that specifies whether the 'y' short sequence should be
-    accepted as shorthand for 4 consecutive spaces (ASCII 0x20). This feature is
-    not supported by the "standard" Adobe encoding.
+    foldspaces is a flag that specifies whether the 'y' short sequence
+    should be accepted as shorthand for 4 consecutive spaces (ASCII
+    0x20).  This feature is not supported by the standard Ascii85
+    encoding used in PDF and PostScript.
  
-    adobe controls whether the input sequence is in Adobe Ascii85 format (i.e.
-    is framed with <~ and ~>).
+    adobe controls whether the <~ and ~> markers are present. While
+    the leading <~ is not required, the input must end with ~>, or a
+    ValueError is raised.
  
      ignorechars should be a byte string containing characters to ignore from the
      input. This should only contain whitespace characters, and by default
@@ -358,8 +364,10 @@ def b85encode(b, pad=False, *, wrapcol=0):
      If wrapcol is non-zero, insert a newline (b'\\n') character after at most
      every wrapcol characters.
  
-    If pad is true, the input is padded with b'\\0' so its length is a multiple of
-    4 bytes before encoding.
+    The input is padded with b'\0' so its length is a multiple of 4
+    bytes before encoding.  If pad is true, all the resulting
+    characters are retained in the output, which will always be a
+    multiple of 5 bytes.
      """
      return binascii.b2a_base85(b, wrapcol=wrapcol, pad=pad)
  
@@ -379,8 +387,10 @@ def z85encode(s, pad=False, *, wrapcol=0):
      If wrapcol is non-zero, insert a newline (b'\\n') character after at most
      every wrapcol characters.
  
-    If pad is true, the input is padded with b'\\0' so its length is a multiple of
-    4 bytes before encoding.
+    The input is padded with b'\0' so its length is a multiple of
+    bytes before encoding.  If pad is true, all the resulting
+    characters are retained in the output, which will always be a
+    multiple of 5 bytes, as required by the ZeroMQ standard.
      """
      return binascii.b2a_base85(s, wrapcol=wrapcol, pad=pad,
                                 alphabet=binascii.Z85_ALPHABET)
diff --git a/Modules/binascii.c b/Modules/binascii.c

index 673dca6ee134bd800583891ed6ebeed3c184535d..0e7af135a6f6ce49033915c7619535a773e23df3 100644 (file)
--- a/Modules/binascii.c
+++ b/Modules/binascii.c
@@ -1057,7 +1057,8 @@ binascii.a2b_ascii85
      foldspaces: bool = False
          Allow 'y' as a short form encoding four spaces.
      adobe: bool = False
-        Expect data to be wrapped in '<~' and '~>' as in Adobe Ascii85.
+        Expect data to be terminated with '~>' as in Adobe Ascii85, and
+        optionally accept leading '<~'.
      ignorechars: Py_buffer = b''
          A byte string containing characters to ignore from the input.
      canonical: bool = False
@@ -1069,7 +1070,7 @@ Decode Ascii85 data.
  static PyObject *
  binascii_a2b_ascii85_impl(PyObject *module, Py_buffer *data, int foldspaces,
                            int adobe, Py_buffer *ignorechars, int canonical)
-/*[clinic end generated code: output=09b35f1eac531357 input=dd050604ed30199e]*/
+/*[clinic end generated code: output=09b35f1eac531357 input=08eab2e53c62f1a8]*/
  {
      const unsigned char *ascii_data = data->buf;
      Py_ssize_t ascii_len = data->len;
@@ -1264,7 +1265,7 @@ binascii.b2a_ascii85
      wrapcol: size_t = 0
          Split result into lines of provided width.
      pad: bool = False
-        Pad input to a multiple of 4 before encoding.
+        Retain zero-padding bytes at end of output.
      adobe: bool = False
          Wrap result in '<~' and '~>' as in Adobe Ascii85.
  
@@ -1274,7 +1275,7 @@ Ascii85-encode data.
  static PyObject *
  binascii_b2a_ascii85_impl(PyObject *module, Py_buffer *data, int foldspaces,
                            size_t wrapcol, int pad, int adobe)
-/*[clinic end generated code: output=5ce8fdee843073f4 input=791da754508c7d17]*/
+/*[clinic end generated code: output=5ce8fdee843073f4 input=a77e31d63517bf19]*/
  {
      const unsigned char *bin_data = data->buf;
      Py_ssize_t bin_len = data->len;
@@ -1539,7 +1540,7 @@ binascii.b2a_base85
      /
      *
      pad: bool = False
-        Pad input to a multiple of 4 before encoding.
+        Retain zero-padding bytes at end of output.
      wrapcol: size_t = 0
      alphabet: Py_buffer(c_default="{NULL, NULL}") = BASE85_ALPHABET
  
@@ -1549,7 +1550,7 @@ Base85-code line of data.
  static PyObject *
  binascii_b2a_base85_impl(PyObject *module, Py_buffer *data, int pad,
                           size_t wrapcol, Py_buffer *alphabet)
-/*[clinic end generated code: output=98b962ed52c776a4 input=1b20b0bd6572691b]*/
+/*[clinic end generated code: output=98b962ed52c776a4 input=54886d05128d41a8]*/
  {
      const unsigned char *bin_data = data->buf;
      Py_ssize_t bin_len = data->len;
diff --git a/Modules/clinic/binascii.c.h b/Modules/clinic/binascii.c.h

index ed695758ef998c93b2c3237892176d39bd9e6070..29fa9e87de87c7a07cb3d420a128819b231a9a6a 100644 (file)
--- a/Modules/clinic/binascii.c.h
+++ b/Modules/clinic/binascii.c.h
@@ -372,7 +372,8 @@ PyDoc_STRVAR(binascii_a2b_ascii85__doc__,
  "  foldspaces\n"
  "    Allow \'y\' as a short form encoding four spaces.\n"
  "  adobe\n"
-"    Expect data to be wrapped in \'<~\' and \'~>\' as in Adobe Ascii85.\n"
+"    Expect data to be terminated with \'~>\' as in Adobe Ascii85, and\n"
+"    optionally accept leading \'<~\'.\n"
  "  ignorechars\n"
  "    A byte string containing characters to ignore from the input.\n"
  "  canonical\n"
@@ -492,7 +493,7 @@ PyDoc_STRVAR(binascii_b2a_ascii85__doc__,
  "  wrapcol\n"
  "    Split result into lines of provided width.\n"
  "  pad\n"
-"    Pad input to a multiple of 4 before encoding.\n"
+"    Retain zero-padding bytes at end of output.\n"
  "  adobe\n"
  "    Wrap result in \'<~\' and \'~>\' as in Adobe Ascii85.");
  
@@ -709,7 +710,7 @@ PyDoc_STRVAR(binascii_b2a_base85__doc__,
  "Base85-code line of data.\n"
  "\n"
  "  pad\n"
-"    Pad input to a multiple of 4 before encoding.");
+"    Retain zero-padding bytes at end of output.");
  
  #define BINASCII_B2A_BASE85_METHODDEF    \
      {"b2a_base85", _PyCFunction_CAST(binascii_b2a_base85), METH_FASTCALL|METH_KEYWORDS, binascii_b2a_base85__doc__},
@@ -1684,4 +1685,4 @@ exit:
  
      return return_value;
  }
-/*[clinic end generated code: output=b41544f39b0ef681 input=a9049054013a1b77]*/
+/*[clinic end generated code: output=42dd48f323cbb118 input=a9049054013a1b77]*/
author	David Huggins-Daines <dhd@ecolingui.ca>
	Tue, 12 May 2026 19:46:46 +0000 (15:46 -0400)
committer	GitHub <noreply@github.com>
	Tue, 12 May 2026 19:46:46 +0000 (22:46 +0300)
Doc/library/base64.rst		patch \| blob \| blame \| history
Doc/library/binascii.rst		patch \| blob \| blame \| history
Lib/base64.py		patch \| blob \| blame \| history
Modules/binascii.c		patch \| blob \| blame \| history
Modules/clinic/binascii.c.h		patch \| blob \| blame \| history