gcc/doc/cppinternals/lexing-a-token.rst

   1 ..
   2   Copyright 1988-2022 Free Software Foundation, Inc.
   3   This is part of the GCC manual.
   4   For copying conditions, see the copyright.rst file.
   5
   6 Lexing a token
   7 **************
   8
   9 Lexing of an individual token is handled by ``_cpp_lex_direct`` and
  10 its subroutines.  In its current form the code is quite complicated,
  11 with read ahead characters and such-like, since it strives to not step
  12 back in the character stream in preparation for handling non-ASCII file
  13 encodings.  The current plan is to convert any such files to UTF-8
  14 before processing them.  This complexity is therefore unnecessary and
  15 will be removed, so I'll not discuss it further here.
  16
  17 The job of ``_cpp_lex_direct`` is simply to lex a token.  It is not
  18 responsible for issues like directive handling, returning lookahead
  19 tokens directly, multiple-include optimization, or conditional block
  20 skipping.  It necessarily has a minor rôle to play in memory
  21 management of lexed lines.  I discuss these issues in a separate section
  22 (see :ref:`lexing-a-line`).
  23
  24 The lexer places the token it lexes into storage pointed to by the
  25 variable ``cur_token``, and then increments it.  This variable is
  26 important for correct diagnostic positioning.  Unless a specific line
  27 and column are passed to the diagnostic routines, they will examine the
  28 ``line`` and ``col`` values of the token just before the location
  29 that ``cur_token`` points to, and use that location to report the
  30 diagnostic.
  31
  32 The lexer does not consider whitespace to be a token in its own right.
  33 If whitespace (other than a new line) precedes a token, it sets the
  34 ``PREV_WHITE`` bit in the token's flags.  Each token has its
  35 ``line`` and ``col`` variables set to the line and column of the
  36 first character of the token.  This line number is the line number in
  37 the translation unit, and can be converted to a source (file, line) pair
  38 using the line map code.
  39
  40 The first token on a logical, i.e. unescaped, line has the flag
  41 ``BOL`` set for beginning-of-line.  This flag is intended for
  42 internal use, both to distinguish a :samp:`#` that begins a directive
  43 from one that doesn't, and to generate a call-back to clients that want
  44 to be notified about the start of every non-directive line with tokens
  45 on it.  Clients cannot reliably determine this for themselves: the first
  46 token might be a macro, and the tokens of a macro expansion do not have
  47 the ``BOL`` flag set.  The macro expansion may even be empty, and the
  48 next token on the line certainly won't have the ``BOL`` flag set.
  49
  50 New lines are treated specially; exactly how the lexer handles them is
  51 context-dependent.  The C standard mandates that directives are
  52 terminated by the first unescaped newline character, even if it appears
  53 in the middle of a macro expansion.  Therefore, if the state variable
  54 ``in_directive`` is set, the lexer returns a ``CPP_EOF`` token,
  55 which is normally used to indicate end-of-file, to indicate
  56 end-of-directive.  In a directive a ``CPP_EOF`` token never means
  57 end-of-file.  Conveniently, if the caller was ``collect_args``, it
  58 already handles ``CPP_EOF`` as if it were end-of-file, and reports an
  59 error about an unterminated macro argument list.
  60
  61 The C standard also specifies that a new line in the middle of the
  62 arguments to a macro is treated as whitespace.  This white space is
  63 important in case the macro argument is stringized.  The state variable
  64 ``parsing_args`` is nonzero when the preprocessor is collecting the
  65 arguments to a macro call.  It is set to 1 when looking for the opening
  66 parenthesis to a function-like macro, and 2 when collecting the actual
  67 arguments up to the closing parenthesis, since these two cases need to
  68 be distinguished sometimes.  One such time is here: the lexer sets the
  69 ``PREV_WHITE`` flag of a token if it meets a new line when
  70 ``parsing_args`` is set to 2.  It doesn't set it if it meets a new
  71 line when ``parsing_args`` is 1, since then code like
  72
  73 .. code-block:: c++
  74
  75   #define foo() bar
  76   foo
  77   baz
  78
  79 would be output with an erroneous space before :samp:`baz`:
  80
  81 .. code-block:: c++
  82
  83   foo
  84    baz
  85
  86 This is a good example of the subtlety of getting token spacing correct
  87 in the preprocessor; there are plenty of tests in the testsuite for
  88 corner cases like this.
  89
  90 The lexer is written to treat each of :samp:`\\r`, :samp:`\\n`, :samp:`\\r\\n`
  91 and :samp:`\\n\\r` as a single new line indicator.  This allows it to
  92 transparently preprocess MS-DOS, Macintosh and Unix files without their
  93 needing to pass through a special filter beforehand.
  94
  95 We also decided to treat a backslash, either ``\`` or the trigraph
  96 ``??/``, separated from one of the above newline indicators by
  97 non-comment whitespace only, as intending to escape the newline.  It
  98 tends to be a typing mistake, and cannot reasonably be mistaken for
  99 anything else in any of the C-family grammars.  Since handling it this
 100 way is not strictly conforming to the ISO standard, the library issues a
 101 warning wherever it encounters it.
 102
 103 Handling newlines like this is made simpler by doing it in one place
 104 only.  The function ``handle_newline`` takes care of all newline
 105 characters, and ``skip_escaped_newlines`` takes care of arbitrarily
 106 long sequences of escaped newlines, deferring to ``handle_newline``
 107 to handle the newlines themselves.
 108
 109 The most painful aspect of lexing ISO-standard C and C++ is handling
 110 trigraphs and backlash-escaped newlines.  Trigraphs are processed before
 111 any interpretation of the meaning of a character is made, and unfortunately
 112 there is a trigraph representation for a backslash, so it is possible for
 113 the trigraph ``??/`` to introduce an escaped newline.
 114
 115 Escaped newlines are tedious because theoretically they can occur
 116 anywhere---between the :samp:`+` and :samp:`=` of the :samp:`+=` token,
 117 within the characters of an identifier, and even between the :samp:`*`
 118 and :samp:`/` that terminates a comment.  Moreover, you cannot be sure
 119 there is just one---there might be an arbitrarily long sequence of them.
 120
 121 So, for example, the routine that lexes a number, ``parse_number``,
 122 cannot assume that it can scan forwards until the first non-number
 123 character and be done with it, because this could be the :samp:`\\`
 124 introducing an escaped newline, or the :samp:`?` introducing the trigraph
 125 sequence that represents the :samp:`\\` of an escaped newline.  If it
 126 encounters a :samp:`?` or :samp:`\\`, it calls ``skip_escaped_newlines``
 127 to skip over any potential escaped newlines before checking whether the
 128 number has been finished.
 129
 130 Similarly code in the main body of ``_cpp_lex_direct`` cannot simply
 131 check for a :samp:`=` after a :samp:`+` character to determine whether it
 132 has a :samp:`+=` token; it needs to be prepared for an escaped newline of
 133 some sort.  Such cases use the function ``get_effective_char``, which
 134 returns the first character after any intervening escaped newlines.
 135
 136 The lexer needs to keep track of the correct column position, including
 137 counting tabs as specified by the :option:`-ftabstop=` option.  This
 138 should be done even within C-style comments; they can appear in the
 139 middle of a line, and we want to report diagnostics in the correct
 140 position for text appearing after the end of the comment.
 141
 142 .. _invalid-identifiers:
 143
 144 Some identifiers, such as ``__VA_ARGS__`` and poisoned identifiers,
 145 may be invalid and require a diagnostic.  However, if they appear in a
 146 macro expansion we don't want to complain with each use of the macro.
 147 It is therefore best to catch them during the lexing stage, in
 148 ``parse_identifier``.  In both cases, whether a diagnostic is needed
 149 or not is dependent upon the lexer's state.  For example, we don't want
 150 to issue a diagnostic for re-poisoning a poisoned identifier, or for
 151 using ``__VA_ARGS__`` in the expansion of a variable-argument macro.
 152 Therefore ``parse_identifier`` makes use of state flags to determine
 153 whether a diagnostic is appropriate.  Since we change state on a
 154 per-token basis, and don't lex whole lines at a time, this is not a
 155 problem.
 156
 157 Another place where state flags are used to change behavior is whilst
 158 lexing header names.  Normally, a :samp:`<` would be lexed as a single
 159 token.  After a ``#include`` directive, though, it should be lexed as
 160 a single token as far as the nearest :samp:`>` character.  Note that we
 161 don't allow the terminators of header names to be escaped; the first
 162 :samp:`"` or :samp:`>` terminates the header name.
 163
 164 Interpretation of some character sequences depends upon whether we are
 165 lexing C, C++ or Objective-C, and on the revision of the standard in
 166 force.  For example, :samp:`::` is a single token in C++, but in C it is
 167 two separate :samp:`:` tokens and almost certainly a syntax error.  Such
 168 cases are handled by ``_cpp_lex_direct`` based upon command-line
 169 flags stored in the ``cpp_options`` structure.
 170
 171 Once a token has been lexed, it leads an independent existence.  The
 172 spelling of numbers, identifiers and strings is copied to permanent
 173 storage from the original input buffer, so a token remains valid and
 174 correct even if its source buffer is freed with ``_cpp_pop_buffer``.
 175 The storage holding the spellings of such tokens remains until the
 176 client program calls cpp_destroy, probably at the end of the translation
 177 unit.