]>
Commit | Line | Data |
---|---|---|
fea681da MK |
1 | .\" Copyright (C) Markus Kuhn, 1996, 2001 |
2 | .\" | |
e4a74ca8 | 3 | .\" SPDX-License-Identifier: GPL-2.0-or-later |
fea681da MK |
4 | .\" |
5 | .\" 1995-11-26 Markus Kuhn <mskuhn@cip.informatik.uni-erlangen.de> | |
6 | .\" First version written | |
7 | .\" 2001-05-11 Markus Kuhn <mgk25@cl.cam.ac.uk> | |
8 | .\" Update | |
9 | .\" | |
45186a5d | 10 | .TH UTF-8 7 2019-03-06 "Linux man-pages (unreleased)" |
fea681da | 11 | .SH NAME |
ae03dc66 | 12 | UTF-8 \- an ASCII compatible multibyte Unicode encoding |
fea681da | 13 | .SH DESCRIPTION |
57e79231 | 14 | The Unicode 3.0 character set occupies a 16-bit code space. |
c13182ef | 15 | The most obvious |
57e79231 | 16 | Unicode encoding (known as UCS-2) |
c13182ef | 17 | consists of a sequence of 16-bit words. |
76f6db57 | 18 | Such strings can contain\(emas part of many 16-bit characters\(embytes |
d1a71985 | 19 | such as \(aq\e0\(aq or \(aq/\(aq, which have a |
c4bb193f | 20 | special meaning in filenames and other C library function arguments. |
76f6db57 | 21 | In addition, the majority of UNIX tools expect ASCII files and can't |
c13182ef MK |
22 | read 16-bit words as characters without major modifications. |
23 | For these reasons, | |
57e79231 | 24 | UCS-2 is not a suitable external encoding of Unicode |
9581cf78 | 25 | in filenames, text files, environment variables, and so on. |
57e79231 | 26 | The ISO 10646 Universal Character Set (UCS), |
76f6db57 MK |
27 | a superset of Unicode, occupies an even larger code |
28 | space\(em31\ bits\(emand the obvious | |
57e79231 | 29 | UCS-4 encoding for it (a sequence of 32-bit words) has the same problems. |
a721e8b2 | 30 | .PP |
57e79231 | 31 | The UTF-8 encoding of Unicode and UCS |
fea681da | 32 | does not have these problems and is the common way in which |
57e79231 | 33 | Unicode is used on UNIX-style operating systems. |
6427f1d8 | 34 | .SS Properties |
57e79231 | 35 | The UTF-8 encoding has the following nice properties: |
fea681da MK |
36 | .TP 0.2i |
37 | * | |
57e79231 MK |
38 | UCS |
39 | characters 0x00000000 to 0x0000007f (the classic US-ASCII | |
fea681da | 40 | characters) are encoded simply as bytes 0x00 to 0x7f (ASCII |
c13182ef MK |
41 | compatibility). |
42 | This means that files and strings which contain only | |
43 | 7-bit ASCII characters have the same encoding under both | |
57e79231 | 44 | ASCII |
fea681da | 45 | and |
57e79231 | 46 | UTF-8 . |
fea681da MK |
47 | .TP |
48 | * | |
57e79231 | 49 | All UCS characters greater than 0x7f are encoded as a multibyte sequence |
fea681da MK |
50 | consisting only of bytes in the range 0x80 to 0xfd, so no ASCII |
51 | byte can appear as part of another character and there are no | |
d1a71985 | 52 | problems with, for example, \(aq\e0\(aq or \(aq/\(aq. |
fea681da MK |
53 | .TP |
54 | * | |
57e79231 | 55 | The lexicographic sorting order of UCS-4 strings is preserved. |
fea681da MK |
56 | .TP |
57 | * | |
57e79231 | 58 | All possible 2^31 UCS codes can be encoded using UTF-8. |
fea681da MK |
59 | .TP |
60 | * | |
57e79231 | 61 | The bytes 0xc0, 0xc1, 0xfe, and 0xff are never used in the UTF-8 encoding. |
fea681da | 62 | .TP |
c13182ef | 63 | * |
ae03dc66 | 64 | The first byte of a multibyte sequence which represents a single non-ASCII |
57e79231 | 65 | UCS character is always in the range 0xc2 to 0xfd and indicates how long |
ae03dc66 MK |
66 | this multibyte sequence is. |
67 | All further bytes in a multibyte sequence | |
c13182ef MK |
68 | are in the range 0x80 to 0xbf. |
69 | This allows easy resynchronization and | |
fea681da MK |
70 | makes the encoding stateless and robust against missing bytes. |
71 | .TP | |
72 | * | |
57e79231 MK |
73 | UTF-8 encoded UCS characters may be up to six bytes long, however the |
74 | Unicode standard specifies no characters above 0x10ffff, so Unicode characters | |
33a0ccb2 | 75 | can be only up to four bytes long in |
57e79231 | 76 | UTF-8. |
6427f1d8 | 77 | .SS Encoding |
c13182ef MK |
78 | The following byte sequences are used to represent a character. |
79 | The sequence to be used depends on the UCS code number of the character: | |
fea681da | 80 | .TP 0.4i |
4d9b6984 | 81 | 0x00000000 \- 0x0000007F: |
fea681da MK |
82 | .RI 0 xxxxxxx |
83 | .TP | |
4d9b6984 | 84 | 0x00000080 \- 0x000007FF: |
c13182ef | 85 | .RI 110 xxxxx |
fea681da MK |
86 | .RI 10 xxxxxx |
87 | .TP | |
4d9b6984 | 88 | 0x00000800 \- 0x0000FFFF: |
fea681da MK |
89 | .RI 1110 xxxx |
90 | .RI 10 xxxxxx | |
91 | .RI 10 xxxxxx | |
92 | .TP | |
4d9b6984 | 93 | 0x00010000 \- 0x001FFFFF: |
fea681da MK |
94 | .RI 11110 xxx |
95 | .RI 10 xxxxxx | |
96 | .RI 10 xxxxxx | |
97 | .RI 10 xxxxxx | |
98 | .TP | |
4d9b6984 | 99 | 0x00200000 \- 0x03FFFFFF: |
fea681da MK |
100 | .RI 111110 xx |
101 | .RI 10 xxxxxx | |
102 | .RI 10 xxxxxx | |
103 | .RI 10 xxxxxx | |
104 | .RI 10 xxxxxx | |
105 | .TP | |
4d9b6984 | 106 | 0x04000000 \- 0x7FFFFFFF: |
fea681da MK |
107 | .RI 1111110 x |
108 | .RI 10 xxxxxx | |
109 | .RI 10 xxxxxx | |
110 | .RI 10 xxxxxx | |
111 | .RI 10 xxxxxx | |
112 | .RI 10 xxxxxx | |
113 | .PP | |
114 | The | |
115 | .I xxx | |
116 | bit positions are filled with the bits of the character code number in | |
ad0fbddd | 117 | binary representation, most significant bit first (big-endian). |
ae03dc66 | 118 | Only the shortest possible multibyte sequence |
fea681da MK |
119 | which can represent the code number of the character can be used. |
120 | .PP | |
57e79231 | 121 | The UCS code values 0xd800\(en0xdfff (UTF-16 surrogates) as well as 0xfffe and |
15f0b7af AC |
122 | 0xffff (UCS noncharacters) should not appear in conforming UTF-8 streams. |
123 | According to RFC 3629 no point above U+10FFFF should be used, | |
124 | which limits characters to four bytes. | |
6427f1d8 | 125 | .SS Example |
57e79231 | 126 | The Unicode character 0xa9 = 1010 1001 (the copyright sign) is encoded |
fea681da MK |
127 | in UTF-8 as |
128 | .PP | |
129 | .RS | |
130 | 11000010 10101001 = 0xc2 0xa9 | |
131 | .RE | |
132 | .PP | |
133 | and character 0x2260 = 0010 0010 0110 0000 (the "not equal" symbol) is | |
134 | encoded as: | |
135 | .PP | |
136 | .RS | |
137 | 11100010 10001001 10100000 = 0xe2 0x89 0xa0 | |
138 | .RE | |
73d8cece | 139 | .SS Application notes |
57e79231 | 140 | Users have to select a UTF-8 locale, for example with |
fea681da MK |
141 | .PP |
142 | .RS | |
143 | export LANG=en_GB.UTF-8 | |
144 | .RE | |
145 | .PP | |
57e79231 | 146 | in order to activate the UTF-8 support in applications. |
fea681da MK |
147 | .PP |
148 | Application software that has to be aware of the used character | |
149 | encoding should always set the locale with for example | |
150 | .PP | |
151 | .RS | |
152 | setlocale(LC_CTYPE, "") | |
153 | .RE | |
154 | .PP | |
155 | and programmers can then test the expression | |
156 | .PP | |
157 | .RS | |
158 | strcmp(nl_langinfo(CODESET), "UTF-8") == 0 | |
159 | .RE | |
160 | .PP | |
57e79231 | 161 | to determine whether a UTF-8 locale has been selected and whether |
fea681da | 162 | therefore all plaintext standard input and output, terminal |
735334d4 | 163 | communication, plaintext file content, filenames, and environment |
57e79231 | 164 | variables are encoded in UTF-8. |
fea681da | 165 | .PP |
57e79231 | 166 | Programmers accustomed to single-byte encodings such as US-ASCII or ISO 8859 |
fea681da | 167 | have to be aware that two assumptions made so far are no longer valid |
57e79231 | 168 | in UTF-8 locales. |
c13182ef MK |
169 | Firstly, a single byte does not necessarily correspond any |
170 | more to a single character. | |
57e79231 | 171 | Secondly, since modern terminal emulators in UTF-8 |
fea681da | 172 | mode also support Chinese, Japanese, and Korean |
57e79231 | 173 | double-width characters as well as nonspacing combining characters, |
fea681da | 174 | outputting a single character does not necessarily advance the cursor |
57e79231 | 175 | by one position as it did in ASCII. |
fea681da MK |
176 | Library functions such as |
177 | .BR mbsrtowcs (3) | |
178 | and | |
179 | .BR wcswidth (3) | |
180 | should be used today to count characters and cursor positions. | |
181 | .PP | |
57e79231 | 182 | The official ESC sequence to switch from an ISO 2022 |
fea681da | 183 | encoding scheme (as used for instance by VT100 terminals) to |
57e79231 | 184 | UTF-8 is ESC % G |
d1a71985 | 185 | ("\ex1b%G"). |
c13182ef | 186 | The corresponding return sequence from |
d1a71985 | 187 | UTF-8 to ISO 2022 is ESC % @ ("\ex1b%@"). |
c13182ef | 188 | Other ISO 2022 sequences (such as |
fea681da | 189 | for switching the G0 and G1 sets) are not applicable in UTF-8 mode. |
6427f1d8 | 190 | .SS Security |
57e79231 | 191 | The Unicode and UCS standards require that producers of UTF-8 |
75b94dc3 | 192 | shall use the shortest form possible, for example, producing a two-byte |
cfea5132 | 193 | sequence with first byte 0xc0 is nonconforming. |
57e79231 | 194 | Unicode 3.1 has added the requirement that conforming programs must not accept |
c13182ef MK |
195 | non-shortest forms in their input. |
196 | This is for security reasons: if | |
fea681da | 197 | user input is checked for possible security violations, a program |
57e79231 | 198 | might check only for the ASCII |
fea681da | 199 | version of "/../" or ";" or NUL and overlook that there are many |
57e79231 | 200 | non-ASCII ways to represent these things in a non-shortest UTF-8 |
fea681da | 201 | encoding. |
6427f1d8 | 202 | .SS Standards |
4550bf19 | 203 | ISO/IEC 10646-1:2000, Unicode 3.1, RFC\ 3629, Plan 9. |
fd7f0a7f MK |
204 | .\" .SH AUTHOR |
205 | .\" Markus Kuhn <mgk25@cl.cam.ac.uk> | |
47297adb | 206 | .SH SEE ALSO |
c664680a | 207 | .BR locale (1), |
fea681da MK |
208 | .BR nl_langinfo (3), |
209 | .BR setlocale (3), | |
210 | .BR charsets (7), | |
211 | .BR unicode (7) |