]> git.ipfire.org Git - thirdparty/gcc.git/blame - gcc/doc/cppinternals/lexing-a-token.rst
sphinx: add missing trailing newline
[thirdparty/gcc.git] / gcc / doc / cppinternals / lexing-a-token.rst
CommitLineData
c63539ff
ML
1..
2 Copyright 1988-2022 Free Software Foundation, Inc.
3 This is part of the GCC manual.
4 For copying conditions, see the copyright.rst file.
5
6Lexing a token
7**************
8
9Lexing of an individual token is handled by ``_cpp_lex_direct`` and
10its subroutines. In its current form the code is quite complicated,
11with read ahead characters and such-like, since it strives to not step
12back in the character stream in preparation for handling non-ASCII file
13encodings. The current plan is to convert any such files to UTF-8
14before processing them. This complexity is therefore unnecessary and
15will be removed, so I'll not discuss it further here.
16
17The job of ``_cpp_lex_direct`` is simply to lex a token. It is not
18responsible for issues like directive handling, returning lookahead
19tokens directly, multiple-include optimization, or conditional block
20skipping. It necessarily has a minor rò‚le to play in memory
21management of lexed lines. I discuss these issues in a separate section
22(see :ref:`lexing-a-line`).
23
24The lexer places the token it lexes into storage pointed to by the
25variable ``cur_token``, and then increments it. This variable is
26important for correct diagnostic positioning. Unless a specific line
27and column are passed to the diagnostic routines, they will examine the
28``line`` and ``col`` values of the token just before the location
29that ``cur_token`` points to, and use that location to report the
30diagnostic.
31
32The lexer does not consider whitespace to be a token in its own right.
33If whitespace (other than a new line) precedes a token, it sets the
34``PREV_WHITE`` bit in the token's flags. Each token has its
35``line`` and ``col`` variables set to the line and column of the
36first character of the token. This line number is the line number in
37the translation unit, and can be converted to a source (file, line) pair
38using the line map code.
39
40The first token on a logical, i.e. unescaped, line has the flag
41``BOL`` set for beginning-of-line. This flag is intended for
42internal use, both to distinguish a :samp:`#` that begins a directive
43from one that doesn't, and to generate a call-back to clients that want
44to be notified about the start of every non-directive line with tokens
45on it. Clients cannot reliably determine this for themselves: the first
46token might be a macro, and the tokens of a macro expansion do not have
47the ``BOL`` flag set. The macro expansion may even be empty, and the
48next token on the line certainly won't have the ``BOL`` flag set.
49
50New lines are treated specially; exactly how the lexer handles them is
51context-dependent. The C standard mandates that directives are
52terminated by the first unescaped newline character, even if it appears
53in the middle of a macro expansion. Therefore, if the state variable
54``in_directive`` is set, the lexer returns a ``CPP_EOF`` token,
55which is normally used to indicate end-of-file, to indicate
56end-of-directive. In a directive a ``CPP_EOF`` token never means
57end-of-file. Conveniently, if the caller was ``collect_args``, it
58already handles ``CPP_EOF`` as if it were end-of-file, and reports an
59error about an unterminated macro argument list.
60
61The C standard also specifies that a new line in the middle of the
62arguments to a macro is treated as whitespace. This white space is
63important in case the macro argument is stringized. The state variable
64``parsing_args`` is nonzero when the preprocessor is collecting the
65arguments to a macro call. It is set to 1 when looking for the opening
66parenthesis to a function-like macro, and 2 when collecting the actual
67arguments up to the closing parenthesis, since these two cases need to
68be distinguished sometimes. One such time is here: the lexer sets the
69``PREV_WHITE`` flag of a token if it meets a new line when
70``parsing_args`` is set to 2. It doesn't set it if it meets a new
71line when ``parsing_args`` is 1, since then code like
72
73.. code-block:: c++
74
75 #define foo() bar
76 foo
77 baz
78
79would be output with an erroneous space before :samp:`baz`:
80
81.. code-block:: c++
82
83 foo
84 baz
85
86This is a good example of the subtlety of getting token spacing correct
87in the preprocessor; there are plenty of tests in the testsuite for
88corner cases like this.
89
90The lexer is written to treat each of :samp:`\\r`, :samp:`\\n`, :samp:`\\r\\n`
91and :samp:`\\n\\r` as a single new line indicator. This allows it to
92transparently preprocess MS-DOS, Macintosh and Unix files without their
93needing to pass through a special filter beforehand.
94
95We also decided to treat a backslash, either ``\`` or the trigraph
96``??/``, separated from one of the above newline indicators by
97non-comment whitespace only, as intending to escape the newline. It
98tends to be a typing mistake, and cannot reasonably be mistaken for
99anything else in any of the C-family grammars. Since handling it this
100way is not strictly conforming to the ISO standard, the library issues a
101warning wherever it encounters it.
102
103Handling newlines like this is made simpler by doing it in one place
104only. The function ``handle_newline`` takes care of all newline
105characters, and ``skip_escaped_newlines`` takes care of arbitrarily
106long sequences of escaped newlines, deferring to ``handle_newline``
107to handle the newlines themselves.
108
109The most painful aspect of lexing ISO-standard C and C++ is handling
110trigraphs and backlash-escaped newlines. Trigraphs are processed before
111any interpretation of the meaning of a character is made, and unfortunately
112there is a trigraph representation for a backslash, so it is possible for
113the trigraph ``??/`` to introduce an escaped newline.
114
115Escaped newlines are tedious because theoretically they can occur
116anywhere---between the :samp:`+` and :samp:`=` of the :samp:`+=` token,
117within the characters of an identifier, and even between the :samp:`*`
118and :samp:`/` that terminates a comment. Moreover, you cannot be sure
119there is just one---there might be an arbitrarily long sequence of them.
120
121So, for example, the routine that lexes a number, ``parse_number``,
122cannot assume that it can scan forwards until the first non-number
123character and be done with it, because this could be the :samp:`\\`
124introducing an escaped newline, or the :samp:`?` introducing the trigraph
125sequence that represents the :samp:`\\` of an escaped newline. If it
126encounters a :samp:`?` or :samp:`\\`, it calls ``skip_escaped_newlines``
127to skip over any potential escaped newlines before checking whether the
128number has been finished.
129
130Similarly code in the main body of ``_cpp_lex_direct`` cannot simply
131check for a :samp:`=` after a :samp:`+` character to determine whether it
132has a :samp:`+=` token; it needs to be prepared for an escaped newline of
133some sort. Such cases use the function ``get_effective_char``, which
134returns the first character after any intervening escaped newlines.
135
136The lexer needs to keep track of the correct column position, including
137counting tabs as specified by the :option:`-ftabstop=` option. This
138should be done even within C-style comments; they can appear in the
139middle of a line, and we want to report diagnostics in the correct
140position for text appearing after the end of the comment.
141
142.. _invalid-identifiers:
143
144Some identifiers, such as ``__VA_ARGS__`` and poisoned identifiers,
145may be invalid and require a diagnostic. However, if they appear in a
146macro expansion we don't want to complain with each use of the macro.
147It is therefore best to catch them during the lexing stage, in
148``parse_identifier``. In both cases, whether a diagnostic is needed
149or not is dependent upon the lexer's state. For example, we don't want
150to issue a diagnostic for re-poisoning a poisoned identifier, or for
151using ``__VA_ARGS__`` in the expansion of a variable-argument macro.
152Therefore ``parse_identifier`` makes use of state flags to determine
153whether a diagnostic is appropriate. Since we change state on a
154per-token basis, and don't lex whole lines at a time, this is not a
155problem.
156
157Another place where state flags are used to change behavior is whilst
158lexing header names. Normally, a :samp:`<` would be lexed as a single
159token. After a ``#include`` directive, though, it should be lexed as
160a single token as far as the nearest :samp:`>` character. Note that we
161don't allow the terminators of header names to be escaped; the first
162:samp:`"` or :samp:`>` terminates the header name.
163
164Interpretation of some character sequences depends upon whether we are
165lexing C, C++ or Objective-C, and on the revision of the standard in
166force. For example, :samp:`::` is a single token in C++, but in C it is
167two separate :samp:`:` tokens and almost certainly a syntax error. Such
168cases are handled by ``_cpp_lex_direct`` based upon command-line
169flags stored in the ``cpp_options`` structure.
170
171Once a token has been lexed, it leads an independent existence. The
172spelling of numbers, identifiers and strings is copied to permanent
173storage from the original input buffer, so a token remains valid and
174correct even if its source buffer is freed with ``_cpp_pop_buffer``.
175The storage holding the spellings of such tokens remains until the
176client program calls cpp_destroy, probably at the end of the translation
3ed1b4ce 177unit.