[3.14] gh-135676: Add a summary of source characters (GH-138194) (GH-139781)

author Miss Islington (bot) <31488909+miss-islington@users.noreply.github.com>

Wed, 8 Oct 2025 16:07:05 +0000 (18:07 +0200)

committer GitHub <noreply@github.com>

Wed, 8 Oct 2025 16:07:05 +0000 (18:07 +0200)
author Miss Islington (bot) <31488909+miss-islington@users.noreply.github.com>
Wed, 8 Oct 2025 16:07:05 +0000 (18:07 +0200)
committer GitHub <noreply@github.com>
Wed, 8 Oct 2025 16:07:05 +0000 (18:07 +0200)
diff --git a/Doc/reference/lexical_analysis.rst b/Doc/reference/lexical_analysis.rst

index f93666dcdc8f44289385114d506bc7dae7a84437..dfa340763d9255dba4befc27766dc659bde190f5 100644 (file)
--- a/Doc/reference/lexical_analysis.rst
+++ b/Doc/reference/lexical_analysis.rst
@@ -10,12 +10,76 @@ Lexical analysis
  A Python program is read by a *parser*.  Input to the parser is a stream of
  :term:`tokens <token>`, generated by the *lexical analyzer* (also known as
  the *tokenizer*).
-This chapter describes how the lexical analyzer breaks a file into tokens.
+This chapter describes how the lexical analyzer produces these tokens.
  
-Python reads program text as Unicode code points; the encoding of a source file
-can be given by an encoding declaration and defaults to UTF-8, see :pep:`3120`
-for details.  If the source file cannot be decoded, a :exc:`SyntaxError` is
-raised.
+The lexical analyzer determines the program text's :ref:`encoding <encodings>`
+(UTF-8 by default), and decodes the text into
+:ref:`source characters <lexical-source-character>`.
+If the text cannot be decoded, a :exc:`SyntaxError` is raised.
+
+Next, the lexical analyzer uses the source characters to generate a stream of tokens.
+The type of a generated token generally depends on the next source character to
+be processed. Similarly, other special behavior of the analyzer depends on
+the first source character that hasn't yet been processed.
+The following table gives a quick summary of these source characters,
+with links to sections that contain more information.
+
+.. list-table::
+   :header-rows: 1
+
+   * - Character
+     - Next token (or other relevant documentation)
+
+   * - * space
+       * tab
+       * formfeed
+     - * :ref:`Whitespace <whitespace>`
+
+   * - * CR, LF
+     - * :ref:`New line <line-structure>`
+       * :ref:`Indentation <indentation>`
+
+   * - * backslash (``\``)
+     - * :ref:`Explicit line joining <explicit-joining>`
+       * (Also significant in :ref:`string escape sequences <escape-sequences>`)
+
+   * - * hash (``#``)
+     - * :ref:`Comment <comments>`
+
+   * - * quote (``'``, ``"``)
+     - * :ref:`String literal <strings>`
+
+   * - * ASCII letter (``a``-``z``, ``A``-``Z``)
+       * non-ASCII character
+     - * :ref:`Name <identifiers>`
+       * Prefixed :ref:`string or bytes literal <strings>`
+
+   * - * underscore (``_``)
+     - * :ref:`Name <identifiers>`
+       * (Can also be part of :ref:`numeric literals <numbers>`)
+
+   * - * number (``0``-``9``)
+     - * :ref:`Numeric literal <numbers>`
+
+   * - * dot (``.``)
+     - * :ref:`Numeric literal <numbers>`
+       * :ref:`Operator <operators>`
+
+   * - * question mark (``?``)
+       * dollar (``$``)
+       *
+         .. (the following uses zero-width space characters to render
+         .. a literal backquote)
+
+         backquote (`````)
+       * control character
+     - * Error (outside string literals and comments)
+
+   * - * other printing character
+     - * :ref:`Operator or delimiter <operators>`
+
+   * - * end of file
+     - * :ref:`End marker <endmarker-token>`
  
  
  .. _line-structure:
@@ -120,6 +184,8 @@ If an encoding is declared, the encoding name must be recognized by Python
  encoding is used for all lexical analysis, including string literals, comments
  and identifiers.
  
+.. _lexical-source-character:
+
  All lexical analysis, including string literals, comments
  and identifiers, works on Unicode text decoded using the source encoding.
  Any Unicode code point, except the NUL control character, can appear in
author	Miss Islington (bot) <31488909+miss-islington@users.noreply.github.com>
	Wed, 8 Oct 2025 16:07:05 +0000 (18:07 +0200)
committer	GitHub <noreply@github.com>
	Wed, 8 Oct 2025 16:07:05 +0000 (18:07 +0200)