A Python program is read by a *parser*. Input to the parser is a stream of
:term:`tokens <token>`, generated by the *lexical analyzer* (also known as
the *tokenizer*).
-This chapter describes how the lexical analyzer breaks a file into tokens.
+This chapter describes how the lexical analyzer produces these tokens.
-Python reads program text as Unicode code points; the encoding of a source file
-can be given by an encoding declaration and defaults to UTF-8, see :pep:`3120`
-for details. If the source file cannot be decoded, a :exc:`SyntaxError` is
-raised.
+The lexical analyzer determines the program text's :ref:`encoding <encodings>`
+(UTF-8 by default), and decodes the text into
+:ref:`source characters <lexical-source-character>`.
+If the text cannot be decoded, a :exc:`SyntaxError` is raised.
+
+Next, the lexical analyzer uses the source characters to generate a stream of tokens.
+The type of a generated token generally depends on the next source character to
+be processed. Similarly, other special behavior of the analyzer depends on
+the first source character that hasn't yet been processed.
+The following table gives a quick summary of these source characters,
+with links to sections that contain more information.
+
+.. list-table::
+ :header-rows: 1
+
+ * - Character
+ - Next token (or other relevant documentation)
+
+ * - * space
+ * tab
+ * formfeed
+ - * :ref:`Whitespace <whitespace>`
+
+ * - * CR, LF
+ - * :ref:`New line <line-structure>`
+ * :ref:`Indentation <indentation>`
+
+ * - * backslash (``\``)
+ - * :ref:`Explicit line joining <explicit-joining>`
+ * (Also significant in :ref:`string escape sequences <escape-sequences>`)
+
+ * - * hash (``#``)
+ - * :ref:`Comment <comments>`
+
+ * - * quote (``'``, ``"``)
+ - * :ref:`String literal <strings>`
+
+ * - * ASCII letter (``a``-``z``, ``A``-``Z``)
+ * non-ASCII character
+ - * :ref:`Name <identifiers>`
+ * Prefixed :ref:`string or bytes literal <strings>`
+
+ * - * underscore (``_``)
+ - * :ref:`Name <identifiers>`
+ * (Can also be part of :ref:`numeric literals <numbers>`)
+
+ * - * number (``0``-``9``)
+ - * :ref:`Numeric literal <numbers>`
+
+ * - * dot (``.``)
+ - * :ref:`Numeric literal <numbers>`
+ * :ref:`Operator <operators>`
+
+ * - * question mark (``?``)
+ * dollar (``$``)
+ *
+ .. (the following uses zero-width space characters to render
+ .. a literal backquote)
+
+ backquote (`````)
+ * control character
+ - * Error (outside string literals and comments)
+
+ * - * other printing character
+ - * :ref:`Operator or delimiter <operators>`
+
+ * - * end of file
+ - * :ref:`End marker <endmarker-token>`
.. _line-structure:
encoding is used for all lexical analysis, including string literals, comments
and identifiers.
+.. _lexical-source-character:
+
All lexical analysis, including string literals, comments
and identifiers, works on Unicode text decoded using the source encoding.
Any Unicode code point, except the NUL control character, can appear in