]>
Commit | Line | Data |
---|---|---|
c63539ff ML |
1 | .. |
2 | Copyright 1988-2022 Free Software Foundation, Inc. | |
3 | This is part of the GCC manual. | |
4 | For copying conditions, see the copyright.rst file. | |
5 | ||
6 | Lexing a token | |
7 | ************** | |
8 | ||
9 | Lexing of an individual token is handled by ``_cpp_lex_direct`` and | |
10 | its subroutines. In its current form the code is quite complicated, | |
11 | with read ahead characters and such-like, since it strives to not step | |
12 | back in the character stream in preparation for handling non-ASCII file | |
13 | encodings. The current plan is to convert any such files to UTF-8 | |
14 | before processing them. This complexity is therefore unnecessary and | |
15 | will be removed, so I'll not discuss it further here. | |
16 | ||
17 | The job of ``_cpp_lex_direct`` is simply to lex a token. It is not | |
18 | responsible for issues like directive handling, returning lookahead | |
19 | tokens directly, multiple-include optimization, or conditional block | |
20 | skipping. It necessarily has a minor rò‚le to play in memory | |
21 | management of lexed lines. I discuss these issues in a separate section | |
22 | (see :ref:`lexing-a-line`). | |
23 | ||
24 | The lexer places the token it lexes into storage pointed to by the | |
25 | variable ``cur_token``, and then increments it. This variable is | |
26 | important for correct diagnostic positioning. Unless a specific line | |
27 | and column are passed to the diagnostic routines, they will examine the | |
28 | ``line`` and ``col`` values of the token just before the location | |
29 | that ``cur_token`` points to, and use that location to report the | |
30 | diagnostic. | |
31 | ||
32 | The lexer does not consider whitespace to be a token in its own right. | |
33 | If whitespace (other than a new line) precedes a token, it sets the | |
34 | ``PREV_WHITE`` bit in the token's flags. Each token has its | |
35 | ``line`` and ``col`` variables set to the line and column of the | |
36 | first character of the token. This line number is the line number in | |
37 | the translation unit, and can be converted to a source (file, line) pair | |
38 | using the line map code. | |
39 | ||
40 | The first token on a logical, i.e. unescaped, line has the flag | |
41 | ``BOL`` set for beginning-of-line. This flag is intended for | |
42 | internal use, both to distinguish a :samp:`#` that begins a directive | |
43 | from one that doesn't, and to generate a call-back to clients that want | |
44 | to be notified about the start of every non-directive line with tokens | |
45 | on it. Clients cannot reliably determine this for themselves: the first | |
46 | token might be a macro, and the tokens of a macro expansion do not have | |
47 | the ``BOL`` flag set. The macro expansion may even be empty, and the | |
48 | next token on the line certainly won't have the ``BOL`` flag set. | |
49 | ||
50 | New lines are treated specially; exactly how the lexer handles them is | |
51 | context-dependent. The C standard mandates that directives are | |
52 | terminated by the first unescaped newline character, even if it appears | |
53 | in the middle of a macro expansion. Therefore, if the state variable | |
54 | ``in_directive`` is set, the lexer returns a ``CPP_EOF`` token, | |
55 | which is normally used to indicate end-of-file, to indicate | |
56 | end-of-directive. In a directive a ``CPP_EOF`` token never means | |
57 | end-of-file. Conveniently, if the caller was ``collect_args``, it | |
58 | already handles ``CPP_EOF`` as if it were end-of-file, and reports an | |
59 | error about an unterminated macro argument list. | |
60 | ||
61 | The C standard also specifies that a new line in the middle of the | |
62 | arguments to a macro is treated as whitespace. This white space is | |
63 | important in case the macro argument is stringized. The state variable | |
64 | ``parsing_args`` is nonzero when the preprocessor is collecting the | |
65 | arguments to a macro call. It is set to 1 when looking for the opening | |
66 | parenthesis to a function-like macro, and 2 when collecting the actual | |
67 | arguments up to the closing parenthesis, since these two cases need to | |
68 | be distinguished sometimes. One such time is here: the lexer sets the | |
69 | ``PREV_WHITE`` flag of a token if it meets a new line when | |
70 | ``parsing_args`` is set to 2. It doesn't set it if it meets a new | |
71 | line when ``parsing_args`` is 1, since then code like | |
72 | ||
73 | .. code-block:: c++ | |
74 | ||
75 | #define foo() bar | |
76 | foo | |
77 | baz | |
78 | ||
79 | would be output with an erroneous space before :samp:`baz`: | |
80 | ||
81 | .. code-block:: c++ | |
82 | ||
83 | foo | |
84 | baz | |
85 | ||
86 | This is a good example of the subtlety of getting token spacing correct | |
87 | in the preprocessor; there are plenty of tests in the testsuite for | |
88 | corner cases like this. | |
89 | ||
90 | The lexer is written to treat each of :samp:`\\r`, :samp:`\\n`, :samp:`\\r\\n` | |
91 | and :samp:`\\n\\r` as a single new line indicator. This allows it to | |
92 | transparently preprocess MS-DOS, Macintosh and Unix files without their | |
93 | needing to pass through a special filter beforehand. | |
94 | ||
95 | We also decided to treat a backslash, either ``\`` or the trigraph | |
96 | ``??/``, separated from one of the above newline indicators by | |
97 | non-comment whitespace only, as intending to escape the newline. It | |
98 | tends to be a typing mistake, and cannot reasonably be mistaken for | |
99 | anything else in any of the C-family grammars. Since handling it this | |
100 | way is not strictly conforming to the ISO standard, the library issues a | |
101 | warning wherever it encounters it. | |
102 | ||
103 | Handling newlines like this is made simpler by doing it in one place | |
104 | only. The function ``handle_newline`` takes care of all newline | |
105 | characters, and ``skip_escaped_newlines`` takes care of arbitrarily | |
106 | long sequences of escaped newlines, deferring to ``handle_newline`` | |
107 | to handle the newlines themselves. | |
108 | ||
109 | The most painful aspect of lexing ISO-standard C and C++ is handling | |
110 | trigraphs and backlash-escaped newlines. Trigraphs are processed before | |
111 | any interpretation of the meaning of a character is made, and unfortunately | |
112 | there is a trigraph representation for a backslash, so it is possible for | |
113 | the trigraph ``??/`` to introduce an escaped newline. | |
114 | ||
115 | Escaped newlines are tedious because theoretically they can occur | |
116 | anywhere---between the :samp:`+` and :samp:`=` of the :samp:`+=` token, | |
117 | within the characters of an identifier, and even between the :samp:`*` | |
118 | and :samp:`/` that terminates a comment. Moreover, you cannot be sure | |
119 | there is just one---there might be an arbitrarily long sequence of them. | |
120 | ||
121 | So, for example, the routine that lexes a number, ``parse_number``, | |
122 | cannot assume that it can scan forwards until the first non-number | |
123 | character and be done with it, because this could be the :samp:`\\` | |
124 | introducing an escaped newline, or the :samp:`?` introducing the trigraph | |
125 | sequence that represents the :samp:`\\` of an escaped newline. If it | |
126 | encounters a :samp:`?` or :samp:`\\`, it calls ``skip_escaped_newlines`` | |
127 | to skip over any potential escaped newlines before checking whether the | |
128 | number has been finished. | |
129 | ||
130 | Similarly code in the main body of ``_cpp_lex_direct`` cannot simply | |
131 | check for a :samp:`=` after a :samp:`+` character to determine whether it | |
132 | has a :samp:`+=` token; it needs to be prepared for an escaped newline of | |
133 | some sort. Such cases use the function ``get_effective_char``, which | |
134 | returns the first character after any intervening escaped newlines. | |
135 | ||
136 | The lexer needs to keep track of the correct column position, including | |
137 | counting tabs as specified by the :option:`-ftabstop=` option. This | |
138 | should be done even within C-style comments; they can appear in the | |
139 | middle of a line, and we want to report diagnostics in the correct | |
140 | position for text appearing after the end of the comment. | |
141 | ||
142 | .. _invalid-identifiers: | |
143 | ||
144 | Some identifiers, such as ``__VA_ARGS__`` and poisoned identifiers, | |
145 | may be invalid and require a diagnostic. However, if they appear in a | |
146 | macro expansion we don't want to complain with each use of the macro. | |
147 | It is therefore best to catch them during the lexing stage, in | |
148 | ``parse_identifier``. In both cases, whether a diagnostic is needed | |
149 | or not is dependent upon the lexer's state. For example, we don't want | |
150 | to issue a diagnostic for re-poisoning a poisoned identifier, or for | |
151 | using ``__VA_ARGS__`` in the expansion of a variable-argument macro. | |
152 | Therefore ``parse_identifier`` makes use of state flags to determine | |
153 | whether a diagnostic is appropriate. Since we change state on a | |
154 | per-token basis, and don't lex whole lines at a time, this is not a | |
155 | problem. | |
156 | ||
157 | Another place where state flags are used to change behavior is whilst | |
158 | lexing header names. Normally, a :samp:`<` would be lexed as a single | |
159 | token. After a ``#include`` directive, though, it should be lexed as | |
160 | a single token as far as the nearest :samp:`>` character. Note that we | |
161 | don't allow the terminators of header names to be escaped; the first | |
162 | :samp:`"` or :samp:`>` terminates the header name. | |
163 | ||
164 | Interpretation of some character sequences depends upon whether we are | |
165 | lexing C, C++ or Objective-C, and on the revision of the standard in | |
166 | force. For example, :samp:`::` is a single token in C++, but in C it is | |
167 | two separate :samp:`:` tokens and almost certainly a syntax error. Such | |
168 | cases are handled by ``_cpp_lex_direct`` based upon command-line | |
169 | flags stored in the ``cpp_options`` structure. | |
170 | ||
171 | Once a token has been lexed, it leads an independent existence. The | |
172 | spelling of numbers, identifiers and strings is copied to permanent | |
173 | storage from the original input buffer, so a token remains valid and | |
174 | correct even if its source buffer is freed with ``_cpp_pop_buffer``. | |
175 | The storage holding the spellings of such tokens remains until the | |
176 | client program calls cpp_destroy, probably at the end of the translation | |
3ed1b4ce | 177 | unit. |