[3.14] gh-128571: Document UTF-16/32 native byte order (GH-139974) (#140309)

author Miss Islington (bot) <31488909+miss-islington@users.noreply.github.com>

Sat, 18 Oct 2025 18:59:37 +0000 (20:59 +0200)

committer GitHub <noreply@github.com>

Sat, 18 Oct 2025 18:59:37 +0000 (18:59 +0000)
author Miss Islington (bot) <31488909+miss-islington@users.noreply.github.com>
Sat, 18 Oct 2025 18:59:37 +0000 (20:59 +0200)
committer GitHub <noreply@github.com>
Sat, 18 Oct 2025 18:59:37 +0000 (18:59 +0000)
diff --git a/Doc/library/codecs.rst b/Doc/library/codecs.rst

index 5932012c535b56dac7efdb06dfbc4b7f3e8798a4..24b5a9d64b2cd24cb1db5eb0e0ba1bad1b66a1a4 100644 (file)
--- a/Doc/library/codecs.rst
+++ b/Doc/library/codecs.rst
@@ -982,17 +982,22 @@ defined in Unicode. A simple and straightforward way that can store each Unicode
  code point, is to store each code point as four consecutive bytes. There are two
  possibilities: store the bytes in big endian or in little endian order. These
  two encodings are called ``UTF-32-BE`` and ``UTF-32-LE`` respectively. Their
-disadvantage is that if e.g. you use ``UTF-32-BE`` on a little endian machine you
-will always have to swap bytes on encoding and decoding. ``UTF-32`` avoids this
-problem: bytes will always be in natural endianness. When these bytes are read
-by a CPU with a different endianness, then bytes have to be swapped though. To
-be able to detect the endianness of a ``UTF-16`` or ``UTF-32`` byte sequence,
-there's the so called BOM ("Byte Order Mark"). This is the Unicode character
-``U+FEFF``. This character can be prepended to every ``UTF-16`` or ``UTF-32``
-byte sequence. The byte swapped version of this character (``0xFFFE``) is an
-illegal character that may not appear in a Unicode text. So when the
-first character in a ``UTF-16`` or ``UTF-32`` byte sequence
-appears to be a ``U+FFFE`` the bytes have to be swapped on decoding.
+disadvantage is that if, for example, you use ``UTF-32-BE`` on a little endian
+machine you will always have to swap bytes on encoding and decoding.
+Python's ``UTF-16`` and ``UTF-32`` codecs avoid this problem by using the
+platform's native byte order when no BOM is present.
+Python follows prevailing platform
+practice, so native-endian data round-trips without redundant byte swapping,
+even though the Unicode Standard defaults to big-endian when the byte order is
+unspecified. When these bytes are read by a CPU with a different endianness,
+the bytes have to be swapped. To be able to detect the endianness of a
+``UTF-16`` or ``UTF-32`` byte sequence, a BOM ("Byte Order Mark") is used.
+This is the Unicode character ``U+FEFF``. This character can be prepended to every
+``UTF-16`` or ``UTF-32`` byte sequence. The byte swapped version of this character
+(``0xFFFE``) is an illegal character that may not appear in a Unicode text.
+When the first character of a ``UTF-16`` or ``UTF-32`` byte sequence is
+``U+FFFE``, the bytes have to be swapped on decoding.
+
  Unfortunately the character ``U+FEFF`` had a second purpose as
  a ``ZERO WIDTH NO-BREAK SPACE``: a character that has no width and doesn't allow
  a word to be split. It can e.g. be used to give hints to a ligature algorithm.
author	Miss Islington (bot) <31488909+miss-islington@users.noreply.github.com>
	Sat, 18 Oct 2025 18:59:37 +0000 (20:59 +0200)
committer	GitHub <noreply@github.com>
	Sat, 18 Oct 2025 18:59:37 +0000 (18:59 +0000)