[thirdparty/gcc.git] / gcc / doc / cppinternals / lexing-a-token.rst

..
  Copyright 1988-2022 Free Software Foundation, Inc.
  This is part of the GCC manual.
  For copying conditions, see the copyright.rst file.

Lexing a token
**************

Lexing of an individual token is handled by ``_cpp_lex_direct`` and
its subroutines.  In its current form the code is quite complicated,
with read ahead characters and such-like, since it strives to not step
back in the character stream in preparation for handling non-ASCII file
encodings.  The current plan is to convert any such files to UTF-8
before processing them.  This complexity is therefore unnecessary and
will be removed, so I'll not discuss it further here.

The job of ``_cpp_lex_direct`` is simply to lex a token.  It is not
responsible for issues like directive handling, returning lookahead
tokens directly, multiple-include optimization, or conditional block
skipping.  It necessarily has a minor rôle to play in memory
management of lexed lines.  I discuss these issues in a separate section
(see :ref:`lexing-a-line`).

The lexer places the token it lexes into storage pointed to by the
variable ``cur_token``, and then increments it.  This variable is
important for correct diagnostic positioning.  Unless a specific line
and column are passed to the diagnostic routines, they will examine the
``line`` and ``col`` values of the token just before the location
that ``cur_token`` points to, and use that location to report the
diagnostic.

The lexer does not consider whitespace to be a token in its own right.
If whitespace (other than a new line) precedes a token, it sets the
``PREV_WHITE`` bit in the token's flags.  Each token has its
``line`` and ``col`` variables set to the line and column of the
first character of the token.  This line number is the line number in
the translation unit, and can be converted to a source (file, line) pair
using the line map code.

The first token on a logical, i.e. unescaped, line has the flag
``BOL`` set for beginning-of-line.  This flag is intended for
internal use, both to distinguish a :samp:`#` that begins a directive
from one that doesn't, and to generate a call-back to clients that want
to be notified about the start of every non-directive line with tokens
on it.  Clients cannot reliably determine this for themselves: the first
token might be a macro, and the tokens of a macro expansion do not have
the ``BOL`` flag set.  The macro expansion may even be empty, and the
next token on the line certainly won't have the ``BOL`` flag set.

New lines are treated specially; exactly how the lexer handles them is
context-dependent.  The C standard mandates that directives are
terminated by the first unescaped newline character, even if it appears
in the middle of a macro expansion.  Therefore, if the state variable
``in_directive`` is set, the lexer returns a ``CPP_EOF`` token,
which is normally used to indicate end-of-file, to indicate
end-of-directive.  In a directive a ``CPP_EOF`` token never means
end-of-file.  Conveniently, if the caller was ``collect_args``, it
already handles ``CPP_EOF`` as if it were end-of-file, and reports an
error about an unterminated macro argument list.

The C standard also specifies that a new line in the middle of the
arguments to a macro is treated as whitespace.  This white space is
important in case the macro argument is stringized.  The state variable
``parsing_args`` is nonzero when the preprocessor is collecting the
arguments to a macro call.  It is set to 1 when looking for the opening
parenthesis to a function-like macro, and 2 when collecting the actual
arguments up to the closing parenthesis, since these two cases need to
be distinguished sometimes.  One such time is here: the lexer sets the
``PREV_WHITE`` flag of a token if it meets a new line when
``parsing_args`` is set to 2.  It doesn't set it if it meets a new
line when ``parsing_args`` is 1, since then code like

.. code-block:: c++

  #define foo() bar
  foo
  baz

would be output with an erroneous space before :samp:`baz`:

.. code-block:: c++

  foo
   baz

This is a good example of the subtlety of getting token spacing correct
in the preprocessor; there are plenty of tests in the testsuite for
corner cases like this.

The lexer is written to treat each of :samp:`\\r`, :samp:`\\n`, :samp:`\\r\\n`
and :samp:`\\n\\r` as a single new line indicator.  This allows it to
transparently preprocess MS-DOS, Macintosh and Unix files without their
needing to pass through a special filter beforehand.

We also decided to treat a backslash, either ``\`` or the trigraph
``??/``, separated from one of the above newline indicators by
non-comment whitespace only, as intending to escape the newline.  It
tends to be a typing mistake, and cannot reasonably be mistaken for
anything else in any of the C-family grammars.  Since handling it this
way is not strictly conforming to the ISO standard, the library issues a
warning wherever it encounters it.

Handling newlines like this is made simpler by doing it in one place
only.  The function ``handle_newline`` takes care of all newline
characters, and ``skip_escaped_newlines`` takes care of arbitrarily
long sequences of escaped newlines, deferring to ``handle_newline``
to handle the newlines themselves.

The most painful aspect of lexing ISO-standard C and C++ is handling
trigraphs and backlash-escaped newlines.  Trigraphs are processed before
any interpretation of the meaning of a character is made, and unfortunately
there is a trigraph representation for a backslash, so it is possible for
the trigraph ``??/`` to introduce an escaped newline.

Escaped newlines are tedious because theoretically they can occur
anywhere---between the :samp:`+` and :samp:`=` of the :samp:`+=` token,
within the characters of an identifier, and even between the :samp:`*`
and :samp:`/` that terminates a comment.  Moreover, you cannot be sure
there is just one---there might be an arbitrarily long sequence of them.

So, for example, the routine that lexes a number, ``parse_number``,
cannot assume that it can scan forwards until the first non-number
character and be done with it, because this could be the :samp:`\\`
introducing an escaped newline, or the :samp:`?` introducing the trigraph
sequence that represents the :samp:`\\` of an escaped newline.  If it
encounters a :samp:`?` or :samp:`\\`, it calls ``skip_escaped_newlines``
to skip over any potential escaped newlines before checking whether the
number has been finished.

Similarly code in the main body of ``_cpp_lex_direct`` cannot simply
check for a :samp:`=` after a :samp:`+` character to determine whether it
has a :samp:`+=` token; it needs to be prepared for an escaped newline of
some sort.  Such cases use the function ``get_effective_char``, which
returns the first character after any intervening escaped newlines.

The lexer needs to keep track of the correct column position, including
counting tabs as specified by the :option:`-ftabstop=` option.  This
should be done even within C-style comments; they can appear in the
middle of a line, and we want to report diagnostics in the correct
position for text appearing after the end of the comment.

.. _invalid-identifiers:

Some identifiers, such as ``__VA_ARGS__`` and poisoned identifiers,
may be invalid and require a diagnostic.  However, if they appear in a
macro expansion we don't want to complain with each use of the macro.
It is therefore best to catch them during the lexing stage, in
``parse_identifier``.  In both cases, whether a diagnostic is needed
or not is dependent upon the lexer's state.  For example, we don't want
to issue a diagnostic for re-poisoning a poisoned identifier, or for
using ``__VA_ARGS__`` in the expansion of a variable-argument macro.
Therefore ``parse_identifier`` makes use of state flags to determine
whether a diagnostic is appropriate.  Since we change state on a
per-token basis, and don't lex whole lines at a time, this is not a
problem.

Another place where state flags are used to change behavior is whilst
lexing header names.  Normally, a :samp:`<` would be lexed as a single
token.  After a ``#include`` directive, though, it should be lexed as
a single token as far as the nearest :samp:`>` character.  Note that we
don't allow the terminators of header names to be escaped; the first
:samp:`"` or :samp:`>` terminates the header name.

Interpretation of some character sequences depends upon whether we are
lexing C, C++ or Objective-C, and on the revision of the standard in
force.  For example, :samp:`::` is a single token in C++, but in C it is
two separate :samp:`:` tokens and almost certainly a syntax error.  Such
cases are handled by ``_cpp_lex_direct`` based upon command-line
flags stored in the ``cpp_options`` structure.

Once a token has been lexed, it leads an independent existence.  The
spelling of numbers, identifiers and strings is copied to permanent
storage from the original input buffer, so a token remains valid and
correct even if its source buffer is freed with ``_cpp_pop_buffer``.
The storage holding the spellings of such tokens remains until the
client program calls cpp_destroy, probably at the end of the translation
unit.
Commit	Line	Data
c63539ff ML	1	..
	2	Copyright 1988-2022 Free Software Foundation, Inc.
	3	This is part of the GCC manual.
	4	For copying conditions, see the copyright.rst file.
	5
	6	Lexing a token
	7	**************
	8
	9	Lexing of an individual token is handled by ``_cpp_lex_direct`` and
	10	its subroutines. In its current form the code is quite complicated,
	11	with read ahead characters and such-like, since it strives to not step
	12	back in the character stream in preparation for handling non-ASCII file
	13	encodings. The current plan is to convert any such files to UTF-8
	14	before processing them. This complexity is therefore unnecessary and
	15	will be removed, so I'll not discuss it further here.
	16
	17	The job of ``_cpp_lex_direct`` is simply to lex a token. It is not
	18	responsible for issues like directive handling, returning lookahead
	19	tokens directly, multiple-include optimization, or conditional block
	20	skipping. It necessarily has a minor rôle to play in memory
	21	management of lexed lines. I discuss these issues in a separate section
	22	(see :ref:`lexing-a-line`).
	23
	24	The lexer places the token it lexes into storage pointed to by the
	25	variable ``cur_token``, and then increments it. This variable is
	26	important for correct diagnostic positioning. Unless a specific line
	27	and column are passed to the diagnostic routines, they will examine the
	28	``line`` and ``col`` values of the token just before the location
	29	that ``cur_token`` points to, and use that location to report the
	30	diagnostic.
	31
	32	The lexer does not consider whitespace to be a token in its own right.
	33	If whitespace (other than a new line) precedes a token, it sets the
	34	``PREV_WHITE`` bit in the token's flags. Each token has its
	35	``line`` and ``col`` variables set to the line and column of the
	36	first character of the token. This line number is the line number in
	37	the translation unit, and can be converted to a source (file, line) pair
	38	using the line map code.
	39
	40	The first token on a logical, i.e. unescaped, line has the flag
	41	``BOL`` set for beginning-of-line. This flag is intended for
	42	internal use, both to distinguish a :samp:`#` that begins a directive
	43	from one that doesn't, and to generate a call-back to clients that want
	44	to be notified about the start of every non-directive line with tokens
	45	on it. Clients cannot reliably determine this for themselves: the first
	46	token might be a macro, and the tokens of a macro expansion do not have
	47	the ``BOL`` flag set. The macro expansion may even be empty, and the
	48	next token on the line certainly won't have the ``BOL`` flag set.
	49
	50	New lines are treated specially; exactly how the lexer handles them is
	51	context-dependent. The C standard mandates that directives are
	52	terminated by the first unescaped newline character, even if it appears
	53	in the middle of a macro expansion. Therefore, if the state variable
	54	``in_directive`` is set, the lexer returns a ``CPP_EOF`` token,
	55	which is normally used to indicate end-of-file, to indicate
	56	end-of-directive. In a directive a ``CPP_EOF`` token never means
	57	end-of-file. Conveniently, if the caller was ``collect_args``, it
	58	already handles ``CPP_EOF`` as if it were end-of-file, and reports an
	59	error about an unterminated macro argument list.
	60
	61	The C standard also specifies that a new line in the middle of the
	62	arguments to a macro is treated as whitespace. This white space is
	63	important in case the macro argument is stringized. The state variable
	64	``parsing_args`` is nonzero when the preprocessor is collecting the
65	arguments to a macro call. It is set to 1 when looking for the opening
66	parenthesis to a function-like macro, and 2 when collecting the actual
67	arguments up to the closing parenthesis, since these two cases need to
68	be distinguished sometimes. One such time is here: the lexer sets the
69	``PREV_WHITE`` flag of a token if it meets a new line when
70	``parsing_args`` is set to 2. It doesn't set it if it meets a new
71	line when ``parsing_args`` is 1, since then code like
72
73	.. code-block:: c++
74
75	#define foo() bar
76	foo
77	baz
78
79	would be output with an erroneous space before :samp:`baz`:
80
81	.. code-block:: c++
82
83	foo
84	baz
85
86	This is a good example of the subtlety of getting token spacing correct
87	in the preprocessor; there are plenty of tests in the testsuite for
88	corner cases like this.
89
90	The lexer is written to treat each of :samp:`\\r`, :samp:`\\n`, :samp:`\\r\\n`
91	and :samp:`\\n\\r` as a single new line indicator. This allows it to
92	transparently preprocess MS-DOS, Macintosh and Unix files without their
93	needing to pass through a special filter beforehand.
94
95	We also decided to treat a backslash, either ``\`` or the trigraph
96	``??/``, separated from one of the above newline indicators by
97	non-comment whitespace only, as intending to escape the newline. It
98	tends to be a typing mistake, and cannot reasonably be mistaken for
99	anything else in any of the C-family grammars. Since handling it this
100	way is not strictly conforming to the ISO standard, the library issues a
101	warning wherever it encounters it.
102
103	Handling newlines like this is made simpler by doing it in one place
104	only. The function ``handle_newline`` takes care of all newline
105	characters, and ``skip_escaped_newlines`` takes care of arbitrarily
106	long sequences of escaped newlines, deferring to ``handle_newline``
107	to handle the newlines themselves.
108
109	The most painful aspect of lexing ISO-standard C and C++ is handling
110	trigraphs and backlash-escaped newlines. Trigraphs are processed before
111	any interpretation of the meaning of a character is made, and unfortunately
112	there is a trigraph representation for a backslash, so it is possible for
113	the trigraph ``??/`` to introduce an escaped newline.
114
115	Escaped newlines are tedious because theoretically they can occur
116	anywhere---between the :samp:`+` and :samp:`=` of the :samp:`+=` token,
117	within the characters of an identifier, and even between the :samp:`*`
118	and :samp:`/` that terminates a comment. Moreover, you cannot be sure
119	there is just one---there might be an arbitrarily long sequence of them.
120
121	So, for example, the routine that lexes a number, ``parse_number``,
122	cannot assume that it can scan forwards until the first non-number
123	character and be done with it, because this could be the :samp:`\\`
124	introducing an escaped newline, or the :samp:`?` introducing the trigraph
125	sequence that represents the :samp:`\\` of an escaped newline. If it
126	encounters a :samp:`?` or :samp:`\\`, it calls ``skip_escaped_newlines``
127	to skip over any potential escaped newlines before checking whether the
128	number has been finished.
129
130	Similarly code in the main body of ``_cpp_lex_direct`` cannot simply
131	check for a :samp:`=` after a :samp:`+` character to determine whether it
132	has a :samp:`+=` token; it needs to be prepared for an escaped newline of
133	some sort. Such cases use the function ``get_effective_char``, which
134	returns the first character after any intervening escaped newlines.
135
136	The lexer needs to keep track of the correct column position, including
137	counting tabs as specified by the :option:`-ftabstop=` option. This
138	should be done even within C-style comments; they can appear in the
139	middle of a line, and we want to report diagnostics in the correct
140	position for text appearing after the end of the comment.
141
142	.. _invalid-identifiers:
143
144	Some identifiers, such as ``__VA_ARGS__`` and poisoned identifiers,
145	may be invalid and require a diagnostic. However, if they appear in a
146	macro expansion we don't want to complain with each use of the macro.
147	It is therefore best to catch them during the lexing stage, in
148	``parse_identifier``. In both cases, whether a diagnostic is needed
149	or not is dependent upon the lexer's state. For example, we don't want
150	to issue a diagnostic for re-poisoning a poisoned identifier, or for
151	using ``__VA_ARGS__`` in the expansion of a variable-argument macro.
152	Therefore ``parse_identifier`` makes use of state flags to determine
153	whether a diagnostic is appropriate. Since we change state on a
154	per-token basis, and don't lex whole lines at a time, this is not a
155	problem.
156
157	Another place where state flags are used to change behavior is whilst
158	lexing header names. Normally, a :samp:`<` would be lexed as a single
159	token. After a ``#include`` directive, though, it should be lexed as
160	a single token as far as the nearest :samp:`>` character. Note that we
161	don't allow the terminators of header names to be escaped; the first
162	:samp:`"` or :samp:`>` terminates the header name.
163
164	Interpretation of some character sequences depends upon whether we are
165	lexing C, C++ or Objective-C, and on the revision of the standard in
166	force. For example, :samp:`::` is a single token in C++, but in C it is
167	two separate :samp:`:` tokens and almost certainly a syntax error. Such
168	cases are handled by ``_cpp_lex_direct`` based upon command-line
169	flags stored in the ``cpp_options`` structure.
170
171	Once a token has been lexed, it leads an independent existence. The
172	spelling of numbers, identifiers and strings is copied to permanent
173	storage from the original input buffer, so a token remains valid and
174	correct even if its source buffer is freed with ``_cpp_pop_buffer``.
175	The storage holding the spellings of such tokens remains until the
176	client program calls cpp_destroy, probably at the end of the translation
3ed1b4ce	177	unit.