From: Miss Islington (bot) <31488909+miss-islington@users.noreply.github.com> Date: Fri, 15 May 2026 18:36:47 +0000 (+0200) Subject: [3.13] gh-134837: Correct and improve base85 documentation for base64 module (GH... X-Git-Url: http://git.ipfire.org/cgi-bin/gitweb.cgi?a=commitdiff_plain;h=refs%2Fheads%2F3.13;p=thirdparty%2FPython%2Fcpython.git [3.13] gh-134837: Correct and improve base85 documentation for base64 module (GH-145843) (GH-149743) (GH-149893) (cherry picked from commit e667d62f114b54dcba17bdfad3835b9c91fda348) (cherry picked from commit 9ad8a1b955033ebbf05456de2cfad85d1de2503d) Co-authored-by: Serhiy Storchaka Co-authored-by: David Huggins-Daines --- diff --git a/Doc/library/base64.rst b/Doc/library/base64.rst index 1c2c57448cee..0248fa2f45f2 100644 --- a/Doc/library/base64.rst +++ b/Doc/library/base64.rst @@ -16,8 +16,10 @@ This module provides functions for encoding binary data to printable ASCII characters and decoding such encodings back to binary data. This includes the :ref:`encodings specified in ` -:rfc:`4648` (Base64, Base32 and Base16) -and the non-standard :ref:`Base85 encodings `. +:rfc:`4648` (Base64, Base32 and Base16), the :ref:`Base85 encoding +` specified in `PDF 2.0 +`_, and non-standard variants +of Base85 used elsewhere. There are two interfaces provided by this module. The modern interface supports encoding :term:`bytes-like objects ` to ASCII @@ -189,19 +191,28 @@ POST request. Base85 Encodings ----------------- -Base85 encoding is not formally specified but rather a de facto standard, -thus different systems perform the encoding differently. +Base85 encoding is a family of algorithms which represent four bytes +using five ASCII characters. Originally implemented in the Unix +``btoa(1)`` utility, a version of it was later adopted by Adobe in the +PostScript language and is standardized in PDF 2.0 (ISO 32000-2). +This version, in both its ``btoa`` and PDF variants, is implemented by +:func:`a85encode`. -The :func:`a85encode` and :func:`b85encode` functions in this module are two implementations of -the de facto standard. You should call the function with the Base85 -implementation used by the software you intend to work with. +A separate version, using a different output character set, was +defined as an April Fool's joke in :rfc:`1924` but is now used by Git +and other software. This version is implemented by :func:`b85encode`. -The two functions present in this module differ in how they handle the following: +Finally, a third version, using yet another output character set +designed for safe inclusion in programming language strings, is +defined by ZeroMQ and implemented here by :func:`z85encode`. -* Whether to include enclosing ``<~`` and ``~>`` markers -* Whether to include newline characters -* The set of ASCII characters used for encoding -* Handling of null bytes +The functions present in this module differ in how they handle the following: + +* Whether to include and expect enclosing ``<~`` and ``~>`` markers. +* Whether to fold the input into multiple lines. +* The set of ASCII characters used for encoding. +* Compact encodings of sequences of spaces and null bytes. +* The encoding of zero-padding bytes applied to the input. Refer to the documentation of the individual functions for more information. @@ -212,17 +223,22 @@ Refer to the documentation of the individual functions for more information. *foldspaces* is an optional flag that uses the special short sequence 'y' instead of 4 consecutive spaces (ASCII 0x20) as supported by 'btoa'. This - feature is not supported by the "standard" Ascii85 encoding. + feature is not supported by the standard encoding used in PDF. *wrapcol* controls whether the output should have newline (``b'\n'``) characters added to it. If this is non-zero, each output line will be at most this many characters long, excluding the trailing newline. - *pad* controls whether the input is padded to a multiple of 4 - before encoding. Note that the ``btoa`` implementation always pads. + *pad* controls whether zero-padding applied to the end of the input + is fully retained in the output encoding, as done by ``btoa``, + producing an exact multiple of 5 bytes of output. This is not part + of the standard encoding used in PDF, as it does not preserve the + length of the data. - *adobe* controls whether the encoded byte sequence is framed with ``<~`` - and ``~>``, which is used by the Adobe implementation. + *adobe* controls whether the encoded byte sequence is framed with + ``<~`` and ``~>``, as in a PostScript base-85 string literal. Note + that while ASCII85Decode streams in PDF documents *must* be + terminated with ``~>``, they *must not* use a leading ``<~``. .. versionadded:: 3.4 @@ -234,10 +250,12 @@ Refer to the documentation of the individual functions for more information. *foldspaces* is a flag that specifies whether the 'y' short sequence should be accepted as shorthand for 4 consecutive spaces (ASCII 0x20). - This feature is not supported by the "standard" Ascii85 encoding. + This feature is not supported by the standard Ascii85 encoding used in + PDF and PostScript. - *adobe* controls whether the input sequence is in Adobe Ascii85 format - (i.e. is framed with <~ and ~>). + *adobe* controls whether the ``<~`` and ``~>`` markers are + present. While the leading ``<~`` is not required, the input must + end with ``~>``, or a :exc:`ValueError` is raised. *ignorechars* should be a byte string containing characters to ignore from the input. This should only contain whitespace characters, and by @@ -251,8 +269,11 @@ Refer to the documentation of the individual functions for more information. Encode the :term:`bytes-like object` *b* using base85 (as used in e.g. git-style binary diffs) and return the encoded :class:`bytes`. - If *pad* is true, the input is padded with ``b'\0'`` so its length is a - multiple of 4 bytes before encoding. + The input is padded with ``b'\0'`` so its length is a multiple of 4 + bytes before encoding. If *pad* is true, all the resulting + characters are retained in the output, which will always be a + multiple of 5 bytes, and thus the length of the data may not be + preserved on decoding. .. versionadded:: 3.4 @@ -260,8 +281,7 @@ Refer to the documentation of the individual functions for more information. .. function:: b85decode(b) Decode the base85-encoded :term:`bytes-like object` or ASCII string *b* and - return the decoded :class:`bytes`. Padding is implicitly removed, if - necessary. + return the decoded :class:`bytes`. .. versionadded:: 3.4 @@ -269,8 +289,12 @@ Refer to the documentation of the individual functions for more information. .. function:: z85encode(s) Encode the :term:`bytes-like object` *s* using Z85 (as used in ZeroMQ) - and return the encoded :class:`bytes`. See `Z85 specification - `_ for more information. + and return the encoded :class:`bytes`. + + The `ZeroMQ specification `_ + requires the length of Z85-encoded data to be a multiple of 5 + bytes. To produce compliant data frames, you must pad the input + data to this function to a multiple of 4 bytes. .. versionadded:: 3.13 @@ -278,8 +302,7 @@ Refer to the documentation of the individual functions for more information. .. function:: z85decode(s) Decode the Z85-encoded :term:`bytes-like object` or ASCII string *s* and - return the decoded :class:`bytes`. See `Z85 specification - `_ for more information. + return the decoded :class:`bytes`. .. versionadded:: 3.13 @@ -352,3 +375,11 @@ recommended to review the security section for any code deployed to production. Section 5.2, "Base64 Content-Transfer-Encoding," provides the definition of the base64 encoding. + `ISO 32000-2 Portable document format - Part 2: PDF 2.0 `_ + Section 7.4.3, "ASCII85Decode Filter," provides the definition + of the Ascii85 encoding used in PDF and PostScript, including + the output character set and the details of data length preservation + using zero-padding and partial output groups. + + `ZeroMQ RFC 32/Z85 `_ + The "Formal Specification" section provides the character set used in Z85. diff --git a/Lib/base64.py b/Lib/base64.py index 6636e06382f4..888443c472c0 100755 --- a/Lib/base64.py +++ b/Lib/base64.py @@ -328,17 +328,20 @@ def a85encode(b, *, foldspaces=False, wrapcol=0, pad=False, adobe=False): foldspaces is an optional flag that uses the special short sequence 'y' instead of 4 consecutive spaces (ASCII 0x20) as supported by 'btoa'. This - feature is not supported by the "standard" Adobe encoding. + feature is not supported by the standard encoding used in PDF. - wrapcol controls whether the output should have newline (b'\\n') characters - added to it. If this is non-zero, each output line will be at most this - many characters long, excluding the trailing newline. + If wrapcol is non-zero, insert a newline (b'\\n') character after at most + every wrapcol characters. - pad controls whether the input is padded to a multiple of 4 before - encoding. Note that the btoa implementation always pads. + pad controls whether zero-padding applied to the end of the input + is fully retained in the output encoding, as done by btoa, + producing an exact multiple of 5 bytes of output. + + adobe controls whether the encoded byte sequence is framed with <~ + and ~>, as in a PostScript base-85 string literal. Note that + while ASCII85Decode streams in PDF documents must be terminated + with ~>, they must not use a leading <~. - adobe controls whether the encoded byte sequence is framed with <~ and ~>, - which is used by the Adobe implementation. """ global _a85chars, _a85chars2 # Delay the initialization of tables to not waste memory @@ -367,12 +370,14 @@ def a85encode(b, *, foldspaces=False, wrapcol=0, pad=False, adobe=False): def a85decode(b, *, foldspaces=False, adobe=False, ignorechars=b' \t\n\r\v'): """Decode the Ascii85 encoded bytes-like object or ASCII string b. - foldspaces is a flag that specifies whether the 'y' short sequence should be - accepted as shorthand for 4 consecutive spaces (ASCII 0x20). This feature is - not supported by the "standard" Adobe encoding. + foldspaces is a flag that specifies whether the 'y' short sequence + should be accepted as shorthand for 4 consecutive spaces (ASCII + 0x20). This feature is not supported by the standard Ascii85 + encoding used in PDF and PostScript. - adobe controls whether the input sequence is in Adobe Ascii85 format (i.e. - is framed with <~ and ~>). + adobe controls whether the <~ and ~> markers are present. While + the leading <~ is not required, the input must end with ~>, or a + ValueError is raised. ignorechars should be a byte string containing characters to ignore from the input. This should only contain whitespace characters, and by default @@ -445,8 +450,10 @@ _b85dec = None def b85encode(b, pad=False): """Encode bytes-like object b in base85 format and return a bytes object. - If pad is true, the input is padded with b'\\0' so its length is a multiple of - 4 bytes before encoding. + The input is padded with b'\0' so its length is a multiple of 4 + bytes before encoding. If pad is true, all the resulting + characters are retained in the output, which will always be a + multiple of 5 bytes. """ global _b85chars, _b85chars2 # Delay the initialization of tables to not waste memory