]>
Commit | Line | Data |
---|---|---|
fea681da MK |
1 | .\" Copyright (C) Markus Kuhn, 1996, 2001 |
2 | .\" | |
1dd72f9c | 3 | .\" %%%LICENSE_START(GPLv2+_DOC_FULL) |
fea681da MK |
4 | .\" This is free documentation; you can redistribute it and/or |
5 | .\" modify it under the terms of the GNU General Public License as | |
6 | .\" published by the Free Software Foundation; either version 2 of | |
7 | .\" the License, or (at your option) any later version. | |
8 | .\" | |
9 | .\" The GNU General Public License's references to "object code" | |
10 | .\" and "executables" are to be interpreted as the output of any | |
11 | .\" document formatting or typesetting system, including | |
12 | .\" intermediate and printed output. | |
13 | .\" | |
14 | .\" This manual is distributed in the hope that it will be useful, | |
15 | .\" but WITHOUT ANY WARRANTY; without even the implied warranty of | |
16 | .\" MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the | |
17 | .\" GNU General Public License for more details. | |
18 | .\" | |
19 | .\" You should have received a copy of the GNU General Public | |
c715f741 MK |
20 | .\" License along with this manual; if not, see |
21 | .\" <http://www.gnu.org/licenses/>. | |
6a8d8745 | 22 | .\" %%%LICENSE_END |
fea681da MK |
23 | .\" |
24 | .\" 1995-11-26 Markus Kuhn <mskuhn@cip.informatik.uni-erlangen.de> | |
25 | .\" First version written | |
26 | .\" 2001-05-11 Markus Kuhn <mgk25@cl.cam.ac.uk> | |
27 | .\" Update | |
28 | .\" | |
9ba01802 | 29 | .TH UTF-8 7 2019-03-06 "GNU" "Linux Programmer's Manual" |
fea681da | 30 | .SH NAME |
ae03dc66 | 31 | UTF-8 \- an ASCII compatible multibyte Unicode encoding |
fea681da | 32 | .SH DESCRIPTION |
57e79231 | 33 | The Unicode 3.0 character set occupies a 16-bit code space. |
c13182ef | 34 | The most obvious |
57e79231 | 35 | Unicode encoding (known as UCS-2) |
c13182ef | 36 | consists of a sequence of 16-bit words. |
76f6db57 | 37 | Such strings can contain\(emas part of many 16-bit characters\(embytes |
d1a71985 | 38 | such as \(aq\e0\(aq or \(aq/\(aq, which have a |
c4bb193f | 39 | special meaning in filenames and other C library function arguments. |
76f6db57 | 40 | In addition, the majority of UNIX tools expect ASCII files and can't |
c13182ef MK |
41 | read 16-bit words as characters without major modifications. |
42 | For these reasons, | |
57e79231 | 43 | UCS-2 is not a suitable external encoding of Unicode |
9581cf78 | 44 | in filenames, text files, environment variables, and so on. |
57e79231 | 45 | The ISO 10646 Universal Character Set (UCS), |
76f6db57 MK |
46 | a superset of Unicode, occupies an even larger code |
47 | space\(em31\ bits\(emand the obvious | |
57e79231 | 48 | UCS-4 encoding for it (a sequence of 32-bit words) has the same problems. |
a721e8b2 | 49 | .PP |
57e79231 | 50 | The UTF-8 encoding of Unicode and UCS |
fea681da | 51 | does not have these problems and is the common way in which |
57e79231 | 52 | Unicode is used on UNIX-style operating systems. |
6427f1d8 | 53 | .SS Properties |
57e79231 | 54 | The UTF-8 encoding has the following nice properties: |
fea681da MK |
55 | .TP 0.2i |
56 | * | |
57e79231 MK |
57 | UCS |
58 | characters 0x00000000 to 0x0000007f (the classic US-ASCII | |
fea681da | 59 | characters) are encoded simply as bytes 0x00 to 0x7f (ASCII |
c13182ef MK |
60 | compatibility). |
61 | This means that files and strings which contain only | |
62 | 7-bit ASCII characters have the same encoding under both | |
57e79231 | 63 | ASCII |
fea681da | 64 | and |
57e79231 | 65 | UTF-8 . |
fea681da MK |
66 | .TP |
67 | * | |
57e79231 | 68 | All UCS characters greater than 0x7f are encoded as a multibyte sequence |
fea681da MK |
69 | consisting only of bytes in the range 0x80 to 0xfd, so no ASCII |
70 | byte can appear as part of another character and there are no | |
d1a71985 | 71 | problems with, for example, \(aq\e0\(aq or \(aq/\(aq. |
fea681da MK |
72 | .TP |
73 | * | |
57e79231 | 74 | The lexicographic sorting order of UCS-4 strings is preserved. |
fea681da MK |
75 | .TP |
76 | * | |
57e79231 | 77 | All possible 2^31 UCS codes can be encoded using UTF-8. |
fea681da MK |
78 | .TP |
79 | * | |
57e79231 | 80 | The bytes 0xc0, 0xc1, 0xfe, and 0xff are never used in the UTF-8 encoding. |
fea681da | 81 | .TP |
c13182ef | 82 | * |
ae03dc66 | 83 | The first byte of a multibyte sequence which represents a single non-ASCII |
57e79231 | 84 | UCS character is always in the range 0xc2 to 0xfd and indicates how long |
ae03dc66 MK |
85 | this multibyte sequence is. |
86 | All further bytes in a multibyte sequence | |
c13182ef MK |
87 | are in the range 0x80 to 0xbf. |
88 | This allows easy resynchronization and | |
fea681da MK |
89 | makes the encoding stateless and robust against missing bytes. |
90 | .TP | |
91 | * | |
57e79231 MK |
92 | UTF-8 encoded UCS characters may be up to six bytes long, however the |
93 | Unicode standard specifies no characters above 0x10ffff, so Unicode characters | |
33a0ccb2 | 94 | can be only up to four bytes long in |
57e79231 | 95 | UTF-8. |
6427f1d8 | 96 | .SS Encoding |
c13182ef MK |
97 | The following byte sequences are used to represent a character. |
98 | The sequence to be used depends on the UCS code number of the character: | |
fea681da | 99 | .TP 0.4i |
4d9b6984 | 100 | 0x00000000 \- 0x0000007F: |
fea681da MK |
101 | .RI 0 xxxxxxx |
102 | .TP | |
4d9b6984 | 103 | 0x00000080 \- 0x000007FF: |
c13182ef | 104 | .RI 110 xxxxx |
fea681da MK |
105 | .RI 10 xxxxxx |
106 | .TP | |
4d9b6984 | 107 | 0x00000800 \- 0x0000FFFF: |
fea681da MK |
108 | .RI 1110 xxxx |
109 | .RI 10 xxxxxx | |
110 | .RI 10 xxxxxx | |
111 | .TP | |
4d9b6984 | 112 | 0x00010000 \- 0x001FFFFF: |
fea681da MK |
113 | .RI 11110 xxx |
114 | .RI 10 xxxxxx | |
115 | .RI 10 xxxxxx | |
116 | .RI 10 xxxxxx | |
117 | .TP | |
4d9b6984 | 118 | 0x00200000 \- 0x03FFFFFF: |
fea681da MK |
119 | .RI 111110 xx |
120 | .RI 10 xxxxxx | |
121 | .RI 10 xxxxxx | |
122 | .RI 10 xxxxxx | |
123 | .RI 10 xxxxxx | |
124 | .TP | |
4d9b6984 | 125 | 0x04000000 \- 0x7FFFFFFF: |
fea681da MK |
126 | .RI 1111110 x |
127 | .RI 10 xxxxxx | |
128 | .RI 10 xxxxxx | |
129 | .RI 10 xxxxxx | |
130 | .RI 10 xxxxxx | |
131 | .RI 10 xxxxxx | |
132 | .PP | |
133 | The | |
134 | .I xxx | |
135 | bit positions are filled with the bits of the character code number in | |
ad0fbddd | 136 | binary representation, most significant bit first (big-endian). |
ae03dc66 | 137 | Only the shortest possible multibyte sequence |
fea681da MK |
138 | which can represent the code number of the character can be used. |
139 | .PP | |
57e79231 | 140 | The UCS code values 0xd800\(en0xdfff (UTF-16 surrogates) as well as 0xfffe and |
ad0fbddd SL |
141 | 0xffff (UCS noncharacters) should not appear in conforming UTF-8 streams. According |
142 | to RFC 3629 no point above U+10FFFF should be used, which limits characters to four | |
143 | bytes. | |
6427f1d8 | 144 | .SS Example |
57e79231 | 145 | The Unicode character 0xa9 = 1010 1001 (the copyright sign) is encoded |
fea681da MK |
146 | in UTF-8 as |
147 | .PP | |
148 | .RS | |
149 | 11000010 10101001 = 0xc2 0xa9 | |
150 | .RE | |
151 | .PP | |
152 | and character 0x2260 = 0010 0010 0110 0000 (the "not equal" symbol) is | |
153 | encoded as: | |
154 | .PP | |
155 | .RS | |
156 | 11100010 10001001 10100000 = 0xe2 0x89 0xa0 | |
157 | .RE | |
73d8cece | 158 | .SS Application notes |
57e79231 | 159 | Users have to select a UTF-8 locale, for example with |
fea681da MK |
160 | .PP |
161 | .RS | |
162 | export LANG=en_GB.UTF-8 | |
163 | .RE | |
164 | .PP | |
57e79231 | 165 | in order to activate the UTF-8 support in applications. |
fea681da MK |
166 | .PP |
167 | Application software that has to be aware of the used character | |
168 | encoding should always set the locale with for example | |
169 | .PP | |
170 | .RS | |
171 | setlocale(LC_CTYPE, "") | |
172 | .RE | |
173 | .PP | |
174 | and programmers can then test the expression | |
175 | .PP | |
176 | .RS | |
177 | strcmp(nl_langinfo(CODESET), "UTF-8") == 0 | |
178 | .RE | |
179 | .PP | |
57e79231 | 180 | to determine whether a UTF-8 locale has been selected and whether |
fea681da MK |
181 | therefore all plaintext standard input and output, terminal |
182 | communication, plaintext file content, filenames and environment | |
57e79231 | 183 | variables are encoded in UTF-8. |
fea681da | 184 | .PP |
57e79231 | 185 | Programmers accustomed to single-byte encodings such as US-ASCII or ISO 8859 |
fea681da | 186 | have to be aware that two assumptions made so far are no longer valid |
57e79231 | 187 | in UTF-8 locales. |
c13182ef MK |
188 | Firstly, a single byte does not necessarily correspond any |
189 | more to a single character. | |
57e79231 | 190 | Secondly, since modern terminal emulators in UTF-8 |
fea681da | 191 | mode also support Chinese, Japanese, and Korean |
57e79231 | 192 | double-width characters as well as nonspacing combining characters, |
fea681da | 193 | outputting a single character does not necessarily advance the cursor |
57e79231 | 194 | by one position as it did in ASCII. |
fea681da MK |
195 | Library functions such as |
196 | .BR mbsrtowcs (3) | |
197 | and | |
198 | .BR wcswidth (3) | |
199 | should be used today to count characters and cursor positions. | |
200 | .PP | |
57e79231 | 201 | The official ESC sequence to switch from an ISO 2022 |
fea681da | 202 | encoding scheme (as used for instance by VT100 terminals) to |
57e79231 | 203 | UTF-8 is ESC % G |
d1a71985 | 204 | ("\ex1b%G"). |
c13182ef | 205 | The corresponding return sequence from |
d1a71985 | 206 | UTF-8 to ISO 2022 is ESC % @ ("\ex1b%@"). |
c13182ef | 207 | Other ISO 2022 sequences (such as |
fea681da | 208 | for switching the G0 and G1 sets) are not applicable in UTF-8 mode. |
6427f1d8 | 209 | .SS Security |
57e79231 | 210 | The Unicode and UCS standards require that producers of UTF-8 |
75b94dc3 | 211 | shall use the shortest form possible, for example, producing a two-byte |
cfea5132 | 212 | sequence with first byte 0xc0 is nonconforming. |
57e79231 | 213 | Unicode 3.1 has added the requirement that conforming programs must not accept |
c13182ef MK |
214 | non-shortest forms in their input. |
215 | This is for security reasons: if | |
fea681da | 216 | user input is checked for possible security violations, a program |
57e79231 | 217 | might check only for the ASCII |
fea681da | 218 | version of "/../" or ";" or NUL and overlook that there are many |
57e79231 | 219 | non-ASCII ways to represent these things in a non-shortest UTF-8 |
fea681da | 220 | encoding. |
6427f1d8 | 221 | .SS Standards |
4550bf19 | 222 | ISO/IEC 10646-1:2000, Unicode 3.1, RFC\ 3629, Plan 9. |
fd7f0a7f MK |
223 | .\" .SH AUTHOR |
224 | .\" Markus Kuhn <mgk25@cl.cam.ac.uk> | |
47297adb | 225 | .SH SEE ALSO |
c664680a | 226 | .BR locale (1), |
fea681da MK |
227 | .BR nl_langinfo (3), |
228 | .BR setlocale (3), | |
229 | .BR charsets (7), | |
230 | .BR unicode (7) |