gh-135676: Add a summary of source characters (GH-138194)

author Petr Viktorin <encukou@gmail.com>

Wed, 8 Oct 2025 14:34:19 +0000 (16:34 +0200)

committer GitHub <noreply@github.com>

Wed, 8 Oct 2025 14:34:19 +0000 (16:34 +0200)
author Petr Viktorin <encukou@gmail.com>
Wed, 8 Oct 2025 14:34:19 +0000 (16:34 +0200)
committer GitHub <noreply@github.com>
Wed, 8 Oct 2025 14:34:19 +0000 (16:34 +0200)
diff --git a/Doc/reference/lexical_analysis.rst b/Doc/reference/lexical_analysis.rst

index 05ae410c168a31a1f26babc91d0ea958c125750c..0b0dba1a996af010180338a05a118d444cc573fb 100644 (file)
--- a/Doc/reference/lexical_analysis.rst
+++ b/Doc/reference/lexical_analysis.rst
@@ -10,12 +10,76 @@ Lexical analysis
  A Python program is read by a *parser*.  Input to the parser is a stream of
  :term:`tokens <token>`, generated by the *lexical analyzer* (also known as
  the *tokenizer*).
-This chapter describes how the lexical analyzer breaks a file into tokens.
+This chapter describes how the lexical analyzer produces these tokens.
  
-Python reads program text as Unicode code points; the encoding of a source file
-can be given by an encoding declaration and defaults to UTF-8, see :pep:`3120`
-for details.  If the source file cannot be decoded, a :exc:`SyntaxError` is
-raised.
+The lexical analyzer determines the program text's :ref:`encoding <encodings>`
+(UTF-8 by default), and decodes the text into
+:ref:`source characters <lexical-source-character>`.
+If the text cannot be decoded, a :exc:`SyntaxError` is raised.
+
+Next, the lexical analyzer uses the source characters to generate a stream of tokens.
+The type of a generated token generally depends on the next source character to
+be processed. Similarly, other special behavior of the analyzer depends on
+the first source character that hasn't yet been processed.
+The following table gives a quick summary of these source characters,
+with links to sections that contain more information.
+
+.. list-table::
+   :header-rows: 1
+
+   * - Character
+     - Next token (or other relevant documentation)
+
+   * - * space
+       * tab
+       * formfeed
+     - * :ref:`Whitespace <whitespace>`
+
+   * - * CR, LF
+     - * :ref:`New line <line-structure>`
+       * :ref:`Indentation <indentation>`
+
+   * - * backslash (``\``)
+     - * :ref:`Explicit line joining <explicit-joining>`
+       * (Also significant in :ref:`string escape sequences <escape-sequences>`)
+
+   * - * hash (``#``)
+     - * :ref:`Comment <comments>`
+
+   * - * quote (``'``, ``"``)
+     - * :ref:`String literal <strings>`
+
+   * - * ASCII letter (``a``-``z``, ``A``-``Z``)
+       * non-ASCII character
+     - * :ref:`Name <identifiers>`
+       * Prefixed :ref:`string or bytes literal <strings>`
+
+   * - * underscore (``_``)
+     - * :ref:`Name <identifiers>`
+       * (Can also be part of :ref:`numeric literals <numbers>`)
+
+   * - * number (``0``-``9``)
+     - * :ref:`Numeric literal <numbers>`
+
+   * - * dot (``.``)
+     - * :ref:`Numeric literal <numbers>`
+       * :ref:`Operator <operators>`
+
+   * - * question mark (``?``)
+       * dollar (``$``)
+       *
+         .. (the following uses zero-width space characters to render
+         .. a literal backquote)
+
+         backquote (`````)
+       * control character
+     - * Error (outside string literals and comments)
+
+   * - * other printing character
+     - * :ref:`Operator or delimiter <operators>`
+
+   * - * end of file
+     - * :ref:`End marker <endmarker-token>`
  
  
  .. _line-structure:
@@ -120,6 +184,8 @@ If an encoding is declared, the encoding name must be recognized by Python
  encoding is used for all lexical analysis, including string literals, comments
  and identifiers.
  
+.. _lexical-source-character:
+
  All lexical analysis, including string literals, comments
  and identifiers, works on Unicode text decoded using the source encoding.
  Any Unicode code point, except the NUL control character, can appear in
author	Petr Viktorin <encukou@gmail.com>
	Wed, 8 Oct 2025 14:34:19 +0000 (16:34 +0200)
committer	GitHub <noreply@github.com>
	Wed, 8 Oct 2025 14:34:19 +0000 (16:34 +0200)