gcc/doc/cpp/character-sets.rst

   1 ..
   2   Copyright 1988-2022 Free Software Foundation, Inc.
   3   This is part of the GCC manual.
   4   For copying conditions, see the copyright.rst file.
   5
   6 .. _character-sets:
   7
   8 Character sets
   9 **************
  10
  11 Source code character set processing in C and related languages is
  12 rather complicated.  The C standard discusses two character sets, but
  13 there are really at least four.
  14
  15 The files input to CPP might be in any character set at all.  CPP's
  16 very first action, before it even looks for line boundaries, is to
  17 convert the file into the character set it uses for internal
  18 processing.  That set is what the C standard calls the :dfn:`source`
  19 character set.  It must be isomorphic with ISO 10646, also known as
  20 Unicode.  CPP uses the UTF-8 encoding of Unicode.
  21
  22 The character sets of the input files are specified using the
  23 :option:`-finput-charset=` option.
  24
  25 All preprocessing work (the subject of the rest of this manual) is
  26 carried out in the source character set.  If you request textual
  27 output from the preprocessor with the :option:`-E` option, it will be
  28 in UTF-8.
  29
  30 After preprocessing is complete, string and character constants are
  31 converted again, into the :dfn:`execution` character set.  This
  32 character set is under control of the user; the default is UTF-8,
  33 matching the source character set.  Wide string and character
  34 constants have their own character set, which is not called out
  35 specifically in the standard.  Again, it is under control of the user.
  36 The default is UTF-16 or UTF-32, whichever fits in the target's
  37 ``wchar_t`` type, in the target machine's byte
  38 order [#f1]_.
  39
  40 Octal and hexadecimal escape sequences do not undergo
  41 conversion; ``'\x12'`` has the value 0x12 regardless of the currently
  42 selected execution character set.  All other escapes are replaced by
  43 the character in the source character set that they represent, then
  44 converted to the execution character set, just like unescaped
  45 characters.
  46
  47 In identifiers, characters outside the ASCII range can be specified
  48 with the :samp:`\\u` and :samp:`\\U` escapes or used directly in the input
  49 encoding.  If strict ISO C90 conformance is specified with an option
  50 such as :option:`-std=c90`, or :option:`-fno-extended-identifiers` is
  51 used, then those constructs are not permitted in identifiers.
  52
  53 .. [#f1] UTF-16 does not meet the requirements of the C
  54   standard for a wide character set, but the choice of 16-bit
  55   ``wchar_t`` is enshrined in some system ABIs so we cannot fix
  56   this.