GH-107678: Improve Unicode handling clarity in ``library/re.rst`` (#107679)

author Adam Turner <9087854+AA-Turner@users.noreply.github.com>

Thu, 11 Jan 2024 23:56:10 +0000 (23:56 +0000)

committer GitHub <noreply@github.com>

Thu, 11 Jan 2024 23:56:10 +0000 (23:56 +0000)
author Adam Turner <9087854+AA-Turner@users.noreply.github.com>
Thu, 11 Jan 2024 23:56:10 +0000 (23:56 +0000)
committer GitHub <noreply@github.com>
Thu, 11 Jan 2024 23:56:10 +0000 (23:56 +0000)
diff --git a/Doc/library/re.rst b/Doc/library/re.rst

index 302f7224de4a7a082b01f23e2a9aab02be28b8fb..5bb93390aa5f79cd3bee6dcba09673bf270cfa9f 100644 (file)
--- a/Doc/library/re.rst
+++ b/Doc/library/re.rst
@@ -17,7 +17,7 @@ those found in Perl.
  Both patterns and strings to be searched can be Unicode strings (:class:`str`)
  as well as 8-bit strings (:class:`bytes`).
  However, Unicode strings and 8-bit strings cannot be mixed:
-that is, you cannot match a Unicode string with a byte pattern or
+that is, you cannot match a Unicode string with a bytes pattern or
  vice-versa; similarly, when asking for a substitution, the replacement
  string must be of the same type as both the pattern and the search string.
  
@@ -257,8 +257,7 @@ The special characters are:
     .. index:: single: \ (backslash); in regular expressions
  
     * Character classes such as ``\w`` or ``\S`` (defined below) are also accepted
-     inside a set, although the characters they match depends on whether
-     :const:`ASCII` or :const:`LOCALE` mode is in force.
+     inside a set, although the characters they match depend on the flags_ used.
  
     .. index:: single: ^ (caret); in regular expressions
  
@@ -326,18 +325,24 @@ The special characters are:
     currently supported extensions.
  
  ``(?aiLmsux)``
-   (One or more letters from the set ``'a'``, ``'i'``, ``'L'``, ``'m'``,
-   ``'s'``, ``'u'``, ``'x'``.)  The group matches the empty string; the
-   letters set the corresponding flags: :const:`re.A` (ASCII-only matching),
-   :const:`re.I` (ignore case), :const:`re.L` (locale dependent),
-   :const:`re.M` (multi-line), :const:`re.S` (dot matches all),
-   :const:`re.U` (Unicode matching), and :const:`re.X` (verbose),
-   for the entire regular expression.
+   (One or more letters from the set
+   ``'a'``, ``'i'``, ``'L'``, ``'m'``, ``'s'``, ``'u'``, ``'x'``.)
+   The group matches the empty string;
+   the letters set the corresponding flags for the entire regular expression:
+
+   * :const:`re.A` (ASCII-only matching)
+   * :const:`re.I` (ignore case)
+   * :const:`re.L` (locale dependent)
+   * :const:`re.M` (multi-line)
+   * :const:`re.S` (dot matches all)
+   * :const:`re.U` (Unicode matching)
+   * :const:`re.X` (verbose)
+
     (The flags are described in :ref:`contents-of-module-re`.)
     This is useful if you wish to include the flags as part of the
     regular expression, instead of passing a *flag* argument to the
-   :func:`re.compile` function.  Flags should be used first in the
-   expression string.
+   :func:`re.compile` function.
+   Flags should be used first in the expression string.
  
     .. versionchanged:: 3.11
        This construction can only be used at the start of the expression.
@@ -351,14 +356,20 @@ The special characters are:
     pattern.
  
  ``(?aiLmsux-imsx:...)``
-   (Zero or more letters from the set ``'a'``, ``'i'``, ``'L'``, ``'m'``,
-   ``'s'``, ``'u'``, ``'x'``, optionally followed by ``'-'`` followed by
+   (Zero or more letters from the set
+   ``'a'``, ``'i'``, ``'L'``, ``'m'``, ``'s'``, ``'u'``, ``'x'``,
+   optionally followed by ``'-'`` followed by
     one or more letters from the ``'i'``, ``'m'``, ``'s'``, ``'x'``.)
-   The letters set or remove the corresponding flags:
-   :const:`re.A` (ASCII-only matching), :const:`re.I` (ignore case),
-   :const:`re.L` (locale dependent), :const:`re.M` (multi-line),
-   :const:`re.S` (dot matches all), :const:`re.U` (Unicode matching),
-   and :const:`re.X` (verbose), for the part of the expression.
+   The letters set or remove the corresponding flags for the part of the expression:
+
+   * :const:`re.A` (ASCII-only matching)
+   * :const:`re.I` (ignore case)
+   * :const:`re.L` (locale dependent)
+   * :const:`re.M` (multi-line)
+   * :const:`re.S` (dot matches all)
+   * :const:`re.U` (Unicode matching)
+   * :const:`re.X` (verbose)
+
     (The flags are described in :ref:`contents-of-module-re`.)
  
     The letters ``'a'``, ``'L'`` and ``'u'`` are mutually exclusive when used
@@ -366,7 +377,7 @@ The special characters are:
     when one of them appears in an inline group, it overrides the matching mode
     in the enclosing group.  In Unicode patterns ``(?a:...)`` switches to
     ASCII-only matching, and ``(?u:...)`` switches to Unicode matching
-   (default).  In byte pattern ``(?L:...)`` switches to locale depending
+   (default).  In bytes patterns ``(?L:...)`` switches to locale dependent
     matching, and ``(?a:...)`` switches to ASCII-only matching (default).
     This override is only in effect for the narrow inline group, and the
     original matching mode is restored outside of the group.
@@ -529,47 +540,61 @@ character ``'$'``.
  
  ``\b``
     Matches the empty string, but only at the beginning or end of a word.
-   A word is defined as a sequence of word characters.  Note that formally,
-   ``\b`` is defined as the boundary between a ``\w`` and a ``\W`` character
-   (or vice versa), or between ``\w`` and the beginning/end of the string.
-   This means that ``r'\bfoo\b'`` matches ``'foo'``, ``'foo.'``, ``'(foo)'``,
-   ``'bar foo baz'`` but not ``'foobar'`` or ``'foo3'``.
-
-   By default Unicode alphanumerics are the ones used in Unicode patterns, but
-   this can be changed by using the :const:`ASCII` flag.  Word boundaries are
-   determined by the current locale if the :const:`LOCALE` flag is used.
-   Inside a character range, ``\b`` represents the backspace character, for
-   compatibility with Python's string literals.
+   A word is defined as a sequence of word characters.
+   Note that formally, ``\b`` is defined as the boundary
+   between a ``\w`` and a ``\W`` character (or vice versa),
+   or between ``\w`` and the beginning or end of the string.
+   This means that ``r'\bat\b'`` matches ``'at'``, ``'at.'``, ``'(at)'``,
+   and ``'as at ay'`` but not ``'attempt'`` or ``'atlas'``.
+
+   The default word characters in Unicode (str) patterns
+   are Unicode alphanumerics and the underscore,
+   but this can be changed by using the :py:const:`~re.ASCII` flag.
+   Word boundaries are determined by the current locale
+   if the :py:const:`~re.LOCALE` flag is used.
+
+   .. note::
+
+      Inside a character range, ``\b`` represents the backspace character,
+      for compatibility with Python's string literals.
  
  .. index:: single: \B; in regular expressions
  
  ``\B``
-   Matches the empty string, but only when it is *not* at the beginning or end
-   of a word.  This means that ``r'py\B'`` matches ``'python'``, ``'py3'``,
-   ``'py2'``, but not ``'py'``, ``'py.'``, or ``'py!'``.
-   ``\B`` is just the opposite of ``\b``, so word characters in Unicode
-   patterns are Unicode alphanumerics or the underscore, although this can
-   be changed by using the :const:`ASCII` flag.  Word boundaries are
-   determined by the current locale if the :const:`LOCALE` flag is used.
+   Matches the empty string,
+   but only when it is *not* at the beginning or end of a word.
+   This means that ``r'at\B'`` matches ``'athens'``, ``'atom'``,
+   ``'attorney'``, but not ``'at'``, ``'at.'``, or ``'at!'``.
+   ``\B`` is the opposite of ``\b``,
+   so word characters in Unicode (str) patterns
+   are Unicode alphanumerics or the underscore,
+   although this can be changed by using the :py:const:`~re.ASCII` flag.
+   Word boundaries are determined by the current locale
+   if the :py:const:`~re.LOCALE` flag is used.
  
  .. index:: single: \d; in regular expressions
  
  ``\d``
     For Unicode (str) patterns:
-      Matches any Unicode decimal digit (that is, any character in
-      Unicode character category [Nd]).  This includes ``[0-9]``, and
-      also many other digit characters.  If the :const:`ASCII` flag is
-      used only ``[0-9]`` is matched.
+      Matches any Unicode decimal digit
+      (that is, any character in Unicode character category `[Nd]`__).
+      This includes ``[0-9]``, and also many other digit characters.
+
+      Matches ``[0-9]`` if the :py:const:`~re.ASCII` flag is used.
+
+      __ https://www.unicode.org/versions/Unicode15.0.0/ch04.pdf#G134153
  
     For 8-bit (bytes) patterns:
-      Matches any decimal digit; this is equivalent to ``[0-9]``.
+      Matches any decimal digit in the ASCII character set;
+      this is equivalent to ``[0-9]``.
  
  .. index:: single: \D; in regular expressions
  
  ``\D``
-   Matches any character which is not a decimal digit. This is
-   the opposite of ``\d``. If the :const:`ASCII` flag is used this
-   becomes the equivalent of ``[^0-9]``.
+   Matches any character which is not a decimal digit.
+   This is the opposite of ``\d``.
+
+   Matches ``[^0-9]`` if the :py:const:`~re.ASCII` flag is used.
  
  .. index:: single: \s; in regular expressions
  
@@ -578,8 +603,9 @@ character ``'$'``.
        Matches Unicode whitespace characters (which includes
        ``[ \t\n\r\f\v]``, and also many other characters, for example the
        non-breaking spaces mandated by typography rules in many
-      languages). If the :const:`ASCII` flag is used, only
-      ``[ \t\n\r\f\v]`` is matched.
+      languages).
+
+      Matches ``[ \t\n\r\f\v]`` if the :py:const:`~re.ASCII` flag is used.
  
     For 8-bit (bytes) patterns:
        Matches characters considered whitespace in the ASCII character set;
@@ -589,30 +615,39 @@ character ``'$'``.
  
  ``\S``
     Matches any character which is not a whitespace character. This is
-   the opposite of ``\s``. If the :const:`ASCII` flag is used this
-   becomes the equivalent of ``[^ \t\n\r\f\v]``.
+   the opposite of ``\s``.
+
+   Matches ``[^ \t\n\r\f\v]`` if the :py:const:`~re.ASCII` flag is used.
  
  .. index:: single: \w; in regular expressions
  
  ``\w``
     For Unicode (str) patterns:
-      Matches Unicode word characters; this includes alphanumeric characters (as defined by :meth:`str.isalnum`)
+      Matches Unicode word characters;
+      this includes all Unicode alphanumeric characters
+      (as defined by :py:meth:`str.isalnum`),
        as well as the underscore (``_``).
-      If the :const:`ASCII` flag is used, only ``[a-zA-Z0-9_]`` is matched.
+
+      Matches ``[a-zA-Z0-9_]`` if the :py:const:`~re.ASCII` flag is used.
  
     For 8-bit (bytes) patterns:
        Matches characters considered alphanumeric in the ASCII character set;
-      this is equivalent to ``[a-zA-Z0-9_]``.  If the :const:`LOCALE` flag is
-      used, matches characters considered alphanumeric in the current locale
-      and the underscore.
+      this is equivalent to ``[a-zA-Z0-9_]``.
+      If the :py:const:`~re.LOCALE` flag is used,
+      matches characters considered alphanumeric in the current locale and the underscore.
  
  .. index:: single: \W; in regular expressions
  
  ``\W``
-   Matches any character which is not a word character. This is
-   the opposite of ``\w``. If the :const:`ASCII` flag is used this
-   becomes the equivalent of ``[^a-zA-Z0-9_]``.  If the :const:`LOCALE` flag is
-   used, matches characters which are neither alphanumeric in the current locale
+   Matches any character which is not a word character.
+   This is the opposite of ``\w``.
+   By default, matches non-underscore (``_``) characters
+   for which :py:meth:`str.isalnum` returns ``False``.
+
+   Matches ``[^a-zA-Z0-9_]`` if the :py:const:`~re.ASCII` flag is used.
+
+   If the :py:const:`~re.LOCALE` flag is used,
+   matches characters which are neither alphanumeric in the current locale
     nor the underscore.
  
  .. index:: single: \Z; in regular expressions
@@ -644,9 +679,11 @@ string literals are also accepted by the regular expression parser::
  (Note that ``\b`` is used to represent word boundaries, and means "backspace"
  only inside character classes.)
  
-``'\u'``, ``'\U'``, and ``'\N'`` escape sequences are only recognized in Unicode
-patterns.  In bytes patterns they are errors.  Unknown escapes of ASCII
-letters are reserved for future use and treated as errors.
+``'\u'``, ``'\U'``, and ``'\N'`` escape sequences are
+only recognized in Unicode (str) patterns.
+In bytes patterns they are errors.
+Unknown escapes of ASCII letters are reserved
+for future use and treated as errors.
  
  Octal escapes are included in a limited form.  If the first digit is a 0, or if
  there are three octal digits, it is considered an octal escape. Otherwise, it is
@@ -694,30 +731,37 @@ Flags
  
     Make ``\w``, ``\W``, ``\b``, ``\B``, ``\d``, ``\D``, ``\s`` and ``\S``
     perform ASCII-only matching instead of full Unicode matching.  This is only
-   meaningful for Unicode patterns, and is ignored for byte patterns.
+   meaningful for Unicode (str) patterns, and is ignored for bytes patterns.
+
     Corresponds to the inline flag ``(?a)``.
  
-   Note that for backward compatibility, the :const:`re.U` flag still
-   exists (as well as its synonym :const:`re.UNICODE` and its embedded
-   counterpart ``(?u)``), but these are redundant in Python 3 since
-   matches are Unicode by default for strings (and Unicode matching
-   isn't allowed for bytes).
+   .. note::
+
+      The :py:const:`~re.U` flag still exists for backward compatibility,
+      but is redundant in Python 3 since
+      matches are Unicode by default for ``str`` patterns,
+      and Unicode matching isn't allowed for bytes patterns.
+      :py:const:`~re.UNICODE` and the inline flag ``(?u)`` are similarly redundant.
  
  
  .. data:: DEBUG
  
     Display debug information about compiled expression.
+
     No corresponding inline flag.
  
  
  .. data:: I
            IGNORECASE
  
-   Perform case-insensitive matching; expressions like ``[A-Z]`` will also
-   match lowercase letters.  Full Unicode matching (such as ``Ü`` matching
-   ``ü``) also works unless the :const:`re.ASCII` flag is used to disable
-   non-ASCII matches.  The current locale does not change the effect of this
-   flag unless the :const:`re.LOCALE` flag is also used.
+   Perform case-insensitive matching;
+   expressions like ``[A-Z]`` will also  match lowercase letters.
+   Full Unicode matching (such as ``Ü`` matching ``ü``)
+   also works unless the :py:const:`~re.ASCII` flag
+   is used to disable non-ASCII matches.
+   The current locale does not change the effect of this flag
+   unless the :py:const:`~re.LOCALE` flag is also used.
+
     Corresponds to the inline flag ``(?i)``.
  
     Note that when the Unicode patterns ``[a-z]`` or ``[A-Z]`` are used in
@@ -725,29 +769,35 @@ Flags
     letters and 4 additional non-ASCII letters: 'İ' (U+0130, Latin capital
     letter I with dot above), 'ı' (U+0131, Latin small letter dotless i),
     'ſ' (U+017F, Latin small letter long s) and 'K' (U+212A, Kelvin sign).
-   If the :const:`ASCII` flag is used, only letters 'a' to 'z'
+   If the :py:const:`~re.ASCII` flag is used, only letters 'a' to 'z'
     and 'A' to 'Z' are matched.
  
  .. data:: L
            LOCALE
  
     Make ``\w``, ``\W``, ``\b``, ``\B`` and case-insensitive matching
-   dependent on the current locale.  This flag can be used only with bytes
-   patterns.  The use of this flag is discouraged as the locale mechanism
-   is very unreliable, it only handles one "culture" at a time, and it only
-   works with 8-bit locales.  Unicode matching is already enabled by default
-   in Python 3 for Unicode (str) patterns, and it is able to handle different
-   locales/languages.
+   dependent on the current locale.
+   This flag can be used only with bytes patterns.
+
     Corresponds to the inline flag ``(?L)``.
  
+   .. warning::
+
+      This flag is discouraged; consider Unicode matching instead.
+      The locale mechanism is very unreliable
+      as it only handles one "culture" at a time
+      and only works with 8-bit locales.
+      Unicode matching is enabled by default for Unicode (str) patterns
+      and it is able to handle different locales and languages.
+
     .. versionchanged:: 3.6
-      :const:`re.LOCALE` can be used only with bytes patterns and is
-      not compatible with :const:`re.ASCII`.
+      :py:const:`~re.LOCALE` can be used only with bytes patterns
+      and is not compatible with :py:const:`~re.ASCII`.
  
     .. versionchanged:: 3.7
-      Compiled regular expression objects with the :const:`re.LOCALE` flag no
-      longer depend on the locale at compile time.  Only the locale at
-      matching time affects the result of matching.
+      Compiled regular expression objects with the :py:const:`~re.LOCALE` flag
+      no longer depend on the locale at compile time.
+      Only the locale at matching time affects the result of matching.
  
  
  .. data:: M
@@ -759,6 +809,7 @@ Flags
     end of each line (immediately preceding each newline).  By default, ``'^'``
     matches only at the beginning of the string, and ``'$'`` only at the end of the
     string and immediately before the newline (if any) at the end of the string.
+
     Corresponds to the inline flag ``(?m)``.
  
  .. data:: NOFLAG
@@ -778,19 +829,19 @@ Flags
  
     Make the ``'.'`` special character match any character at all, including a
     newline; without this flag, ``'.'`` will match anything *except* a newline.
+
     Corresponds to the inline flag ``(?s)``.
  
  
  .. data:: U
            UNICODE
  
-   In Python 2, this flag made :ref:`special sequences <re-special-sequences>`
-   include Unicode characters in matches. Since Python 3, Unicode characters
-   are matched by default.
-
-   See :const:`A` for restricting matching on ASCII characters instead.
+   In Python 3, Unicode characters are matched by default
+   for ``str`` patterns.
+   This flag is therefore redundant with **no effect**
+   and is only kept for backward compatibility.
  
-   This flag is only kept for backward compatibility.
+   See :py:const:`~re.ASCII` to restrict matching to ASCII characters instead.
  
  .. data:: X
            VERBOSE
@@ -914,6 +965,8 @@ Functions
     Empty matches for the pattern split the string only when not adjacent
     to a previous empty match.
  
+   .. code:: pycon
+
        >>> re.split(r'\b', 'Words, words, words.')
        ['', 'Words', ', ', 'words', ', ', 'words', '.']
        >>> re.split(r'\W*', '...words...')
@@ -1237,7 +1290,7 @@ Regular Expression Objects
  
     The regex matching flags.  This is a combination of the flags given to
     :func:`.compile`, any ``(?...)`` inline flags in the pattern, and implicit
-   flags such as :data:`UNICODE` if the pattern is a Unicode string.
+   flags such as :py:const:`~re.UNICODE` if the pattern is a Unicode string.
  
  
  .. attribute:: Pattern.groups
author	Adam Turner <9087854+AA-Turner@users.noreply.github.com>
	Thu, 11 Jan 2024 23:56:10 +0000 (23:56 +0000)
committer	GitHub <noreply@github.com>
	Thu, 11 Jan 2024 23:56:10 +0000 (23:56 +0000)