]>
Commit | Line | Data |
---|---|---|
fea681da MK |
1 | .\" Copyright (c) 1996 Eric S. Raymond <esr@thyrsus.com> |
2 | .\" and Andries Brouwer <aeb@cwi.nl> | |
3 | .\" | |
89e3ffe9 | 4 | .\" %%%LICENSE_START(GPLv2+_DOC_ONEPARA) |
fea681da MK |
5 | .\" This is free documentation; you can redistribute it and/or |
6 | .\" modify it under the terms of the GNU General Public License as | |
7 | .\" published by the Free Software Foundation; either version 2 of | |
8 | .\" the License, or (at your option) any later version. | |
8f8359d8 | 9 | .\" %%%LICENSE_END |
fea681da MK |
10 | .\" |
11 | .\" This is combined from many sources, including notes by aeb and | |
12 | .\" research by esr. Portions derive from a writeup by Roman Czyborra. | |
13 | .\" | |
14 | .\" Last changed by David Starner <dstarner98@aasaa.ofe.org>. | |
608bf950 | 15 | .TH CHARSETS 7 2012-08-05 "Linux" "Linux Programmer's Manual" |
fea681da MK |
16 | .SH NAME |
17 | charsets \- programmer's view of character sets and internationalization | |
18 | .SH DESCRIPTION | |
c13182ef MK |
19 | Linux is an international operating system. |
20 | Various of its utilities | |
fea681da MK |
21 | and device drivers (including the console driver) support multilingual |
22 | character sets including Latin-alphabet letters with diacritical | |
23 | marks, accents, ligatures, and entire non-Latin alphabets including | |
24 | Greek, Cyrillic, Arabic, and Hebrew. | |
25 | .LP | |
26 | This manual page presents a programmer's-eye view of different | |
c13182ef MK |
27 | character-set standards and how they fit together on Linux. |
28 | Standards | |
fea681da | 29 | discussed include ASCII, ISO 8859, KOI8-R, Unicode, ISO 2022 and |
c13182ef MK |
30 | ISO 4873. |
31 | The primary emphasis is on character sets actually used as | |
fea681da MK |
32 | locale character sets, not the myriad others that can be found in data |
33 | from other systems. | |
34 | .LP | |
763f0e47 | 35 | A complete list of charsets used in an officially supported locale in glibc |
fea681da MK |
36 | 2.2.3 is: ISO-8859-{1,2,3,5,6,7,8,9,13,15}, CP1251, UTF-8, EUC-{KR,JP,TW}, |
37 | KOI8-{R,U}, GB2312, GB18030, GBK, BIG5, BIG5-HKSCS and TIS-620 (in no | |
c13182ef MK |
38 | particular order.) |
39 | (Romanian may be switching to ISO-8859-16.) | |
1ce284ec | 40 | .SS ASCII |
fea681da | 41 | ASCII (American Standard Code For Information Interchange) is the original |
c13182ef MK |
42 | 7-bit character set, originally designed for American English. |
43 | It is currently described by the ECMA-6 standard. | |
fea681da MK |
44 | .LP |
45 | Various ASCII variants replacing the dollar sign with other currency | |
46 | symbols and replacing punctuation with non-English alphabetic characters | |
c13182ef MK |
47 | to cover German, French, Spanish and others in 7 bits exist. |
48 | All are | |
5260fe08 | 49 | deprecated; glibc doesn't support locales whose character sets aren't |
6387216b MK |
50 | true supersets of ASCII. |
51 | (These sets are also known as ISO-646, a close | |
fea681da MK |
52 | relative of ASCII that permitted replacing these characters.) |
53 | .LP | |
54 | As Linux was written for hardware designed in the US, it natively | |
55 | supports ASCII. | |
1ce284ec | 56 | .SS ISO 8859 |
fea681da MK |
57 | ISO 8859 is a series of 15 8-bit character sets all of which have US |
58 | ASCII in their low (7-bit) half, invisible control characters in | |
59 | positions 128 to 159, and 96 fixed-width graphics in positions 160-255. | |
60 | .LP | |
c13182ef MK |
61 | Of these, the most important is ISO 8859-1 (Latin-1). |
62 | It is natively | |
fea681da MK |
63 | supported in the Linux console driver, fairly well supported in X11R6, |
64 | and is the base character set of HTML. | |
65 | .LP | |
66 | Console support for the other 8859 character sets is available under | |
67 | Linux through user-mode utilities (such as | |
68 | .BR setfont (8)) | |
69 | .\" // some distributions still have the deprecated consolechars | |
70 | that modify keyboard bindings and the EGA graphics | |
71 | table and employ the "user mapping" font table in the console | |
72 | driver. | |
73 | .LP | |
74 | Here are brief descriptions of each set: | |
75 | .TP | |
76 | 8859-1 (Latin-1) | |
77 | Latin-1 covers most Western European languages such as Albanian, Catalan, | |
78 | Danish, Dutch, English, Faroese, Finnish, French, German, Galician, | |
79 | Irish, Icelandic, Italian, Norwegian, Portuguese, Spanish, and | |
c13182ef MK |
80 | Swedish. |
81 | The lack of the ligatures Dutch ij, French oe and old-style | |
fea681da MK |
82 | ,,German`` quotation marks is considered tolerable. |
83 | .TP | |
84 | 8859-2 (Latin-2) | |
85 | Latin-2 supports most Latin-written Slavic and Central European | |
86 | languages: Croatian, Czech, German, Hungarian, Polish, Rumanian, | |
87 | Slovak, and Slovene. | |
88 | .TP | |
89 | 8859-3 (Latin-3) | |
90 | Latin-3 is popular with authors of Esperanto, Galician, and Maltese. | |
91 | (Turkish is now written with 8859-9 instead.) | |
92 | .TP | |
93 | 8859-4 (Latin-4) | |
c13182ef MK |
94 | Latin-4 introduced letters for Estonian, Latvian, and Lithuanian. |
95 | It is essentially obsolete; see 8859-10 (Latin-6) and 8859-13 (Latin-7). | |
fea681da MK |
96 | .TP |
97 | 8859-5 | |
98 | Cyrillic letters supporting Bulgarian, Byelorussian, Macedonian, | |
c13182ef | 99 | Russian, Serbian and Ukrainian. |
84c517a4 MK |
100 | Ukrainians read the letter "ghe" |
101 | with downstroke as "heh" and would need a ghe with upstroke to write a | |
c13182ef MK |
102 | correct ghe. |
103 | See the discussion of KOI8-R below. | |
fea681da MK |
104 | .TP |
105 | 8859-6 | |
c13182ef MK |
106 | Supports Arabic. |
107 | The 8859-6 glyph table is a fixed font of separate | |
fea681da MK |
108 | letter forms, but a proper display engine should combine these |
109 | using the proper initial, medial, and final forms. | |
110 | .TP | |
111 | 8859-7 | |
112 | Supports Modern Greek. | |
113 | .TP | |
114 | 8859-8 | |
c13182ef MK |
115 | Supports modern Hebrew without niqud (punctuation signs). |
116 | Niqud and full-fledged Biblical Hebrew are outside the scope of this | |
fea681da MK |
117 | character set; under Linux, UTF-8 is the preferred encoding for |
118 | these. | |
119 | .TP | |
120 | 8859-9 (Latin-5) | |
121 | This is a variant of Latin-1 that replaces Icelandic letters with | |
122 | Turkish ones. | |
123 | .TP | |
124 | 8859-10 (Latin-6) | |
125 | Latin 6 adds the last Inuit (Greenlandic) and Sami (Lappish) letters | |
c13182ef | 126 | that were missing in Latin 4 to cover the entire Nordic area. |
84c517a4 | 127 | RFC 1345 listed a preliminary and different "latin6". |
c13182ef | 128 | Skolt Sami still |
fea681da MK |
129 | needs a few more accents than these. |
130 | .TP | |
131 | 8859-11 | |
c13182ef MK |
132 | This only exists as a rejected draft standard. |
133 | The draft standard | |
fea681da MK |
134 | was identical to TIS-620, which is used under Linux for Thai. |
135 | .TP | |
136 | 8859-12 | |
c13182ef MK |
137 | This set does not exist. |
138 | While Vietnamese has been suggested for this | |
24b74457 | 139 | space, it does not fit within the 96 (noncombining) characters ISO |
c13182ef MK |
140 | 8859 offers. |
141 | UTF-8 is the preferred character set for Vietnamese use | |
fea681da MK |
142 | under Linux. |
143 | .TP | |
144 | 8859-13 (Latin-7) | |
145 | Supports the Baltic Rim languages; in particular, it includes Latvian | |
146 | characters not found in Latin-4. | |
147 | .TP | |
148 | 8859-14 (Latin-8) | |
149 | This is the Celtic character set, covering Gaelic and Welsh. | |
150 | This charset also contains the dotted characters needed for Old Irish. | |
151 | .TP | |
152 | 8859-15 (Latin-9) | |
153 | This adds the Euro sign and French and Finnish letters that were missing in | |
154 | Latin-1. | |
155 | .TP | |
156 | 8859-16 (Latin-10) | |
157 | This set covers many of the languages covered by 8859-2, and supports | |
158 | Romanian more completely then that set does. | |
1ce284ec | 159 | .SS KOI8-R |
c13182ef MK |
160 | KOI8-R is a non-ISO character set popular in Russia. |
161 | The lower half | |
fea681da | 162 | is US ASCII; the upper is a Cyrillic character set somewhat better |
c13182ef MK |
163 | designed than ISO 8859-5. |
164 | KOI8-U is a common character set, based off | |
165 | KOI8-R, that has better support for Ukrainian. | |
166 | Neither of these sets | |
fea681da MK |
167 | are ISO-2022 compatible, unlike the ISO-8859 series. |
168 | .LP | |
169 | Console support for KOI8-R is available under Linux through user-mode | |
170 | utilities that modify keyboard bindings and the EGA graphics table, | |
171 | and employ the "user mapping" font table in the console driver. | |
c13182ef | 172 | .\" Thanks to Tomohiro KUBOTA for the following sections about |
fea681da | 173 | .\" national standards. |
1ce284ec | 174 | .SS JIS X 0208 |
c13182ef MK |
175 | JIS X 0208 is a Japanese national standard character set. |
176 | Though there are some more Japanese national standard character sets (like | |
177 | JIS X 0201, JIS X 0212, and JIS X 0213), this is the most important one. | |
178 | Characters are mapped into a 94x94 two-byte matrix, | |
179 | whose each byte is in the range 0x21-0x7e. | |
180 | Note that JIS X 0208 is a character set, not an encoding. | |
181 | This means that JIS X 0208 | |
182 | itself is not used for expressing text data. | |
183 | JIS X 0208 is used | |
fea681da | 184 | as a component to construct encodings such as EUC-JP, Shift_JIS, |
c13182ef MK |
185 | and ISO-2022-JP. |
186 | EUC-JP is the most important encoding for Linux | |
187 | and includes US ASCII and JIS X 0208. | |
188 | In EUC-JP, JIS X 0208 | |
fea681da MK |
189 | characters are expressed in two bytes, each of which is the |
190 | JIS X 0208 code plus 0x80. | |
1ce284ec | 191 | .SS KS X 1001 |
c13182ef MK |
192 | KS X 1001 is a Korean national standard character set. |
193 | Just as | |
fea681da MK |
194 | JIS X 0208, characters are mapped into a 94x94 two-byte matrix. |
195 | KS X 1001 is used like JIS X 0208, as a component | |
196 | to construct encodings such as EUC-KR, Johab, and ISO-2022-KR. | |
197 | EUC-KR is the most important encoding for Linux and includes | |
c13182ef MK |
198 | US ASCII and KS X 1001. |
199 | KS C 5601 is an older name for KS X 1001. | |
1ce284ec | 200 | .SS GB 2312 |
fea681da | 201 | GB 2312 is a mainland Chinese national standard character set used |
c13182ef MK |
202 | to express simplified Chinese. |
203 | Just like JIS X 0208, characters are | |
204 | mapped into a 94x94 two-byte matrix used to construct EUC-CN. | |
205 | EUC-CN | |
fea681da | 206 | is the most important encoding for Linux and includes US ASCII and |
c13182ef MK |
207 | GB 2312. |
208 | Note that EUC-CN is often called as GB, GB 2312, or CN-GB. | |
1ce284ec | 209 | .SS Big5 |
fea681da | 210 | Big5 is a popular character set in Taiwan to express traditional |
c13182ef MK |
211 | Chinese. |
212 | (Big5 is both a character set and an encoding.) | |
213 | It is a superset of US ASCII. | |
214 | Non-ASCII characters are expressed in two bytes. | |
215 | Bytes 0xa1-0xfe are used as leading bytes for two-byte characters. | |
216 | Big5 and its extension is widely used in Taiwan and Hong Kong. | |
217 | It is not ISO 2022-compliant. | |
1ce284ec | 218 | .SS TIS 620 |
fea681da | 219 | TIS 620 is a Thai national standard character set and a superset |
c13182ef MK |
220 | of US ASCII. |
221 | Like ISO 8859 series, Thai characters are mapped into | |
222 | 0xa1-0xfe. | |
223 | TIS 620 is the only commonly used character set under | |
fea681da | 224 | Linux besides UTF-8 to have combining characters. |
1ce284ec | 225 | .SS UNICODE |
fea681da | 226 | Unicode (ISO 10646) is a standard which aims to unambiguously represent every |
c13182ef MK |
227 | character in every human language. |
228 | Unicode's structure permits 20.1 bits to encode every character. | |
229 | Since most computers don't include 20.1-bit | |
fea681da MK |
230 | integers, Unicode is usually encoded as 32-bit integers internally and |
231 | either a series of 16-bit integers (UTF-16) (needing two 16-bit integers | |
232 | only when encoding certain rare characters) or a series of 8-bit bytes | |
c13182ef | 233 | (UTF-8). |
608bf950 SK |
234 | Information on Unicode is available at |
235 | .UR http://www.unicode.org | |
236 | .UE . | |
fea681da MK |
237 | .LP |
238 | Linux represents Unicode using the 8-bit Unicode Transformation Format | |
c13182ef MK |
239 | (UTF-8). |
240 | UTF-8 is a variable length encoding of Unicode. | |
241 | It uses 1 | |
fea681da MK |
242 | byte to code 7 bits, 2 bytes for 11 bits, 3 bytes for 16 bits, 4 bytes |
243 | for 21 bits, 5 bytes for 26 bits, 6 bytes for 31 bits. | |
244 | .LP | |
c13182ef MK |
245 | Let 0,1,x stand for a zero, one, or arbitrary bit. |
246 | A byte 0xxxxxxx | |
fea681da | 247 | stands for the Unicode 00000000 0xxxxxxx which codes the same symbol |
c13182ef MK |
248 | as the ASCII 0xxxxxxx. |
249 | Thus, ASCII goes unchanged into UTF-8, and | |
fea681da MK |
250 | people using only ASCII do not notice any change: not in code, and not |
251 | in file size. | |
252 | .LP | |
253 | A byte 110xxxxx is the start of a 2-byte code, and 110xxxxx 10yyyyyy | |
c13182ef MK |
254 | is assembled into 00000xxx xxyyyyyy. |
255 | A byte 1110xxxx is the start | |
fea681da MK |
256 | of a 3-byte code, and 1110xxxx 10yyyyyy 10zzzzzz is assembled |
257 | into xxxxyyyy yyzzzzzz. | |
258 | (When UTF-8 is used to code the 31-bit ISO 10646 | |
259 | then this progression continues up to 6-byte codes.) | |
260 | .LP | |
261 | For most people who use ISO-8859 character sets, this means that the | |
c13182ef MK |
262 | characters outside of ASCII are now coded with two bytes. |
263 | This tends | |
264 | to expand ordinary text files by only one or two percent. | |
265 | For Russian | |
fea681da | 266 | or Greek users, this expands ordinary text files by 100%, since text in |
c13182ef MK |
267 | those languages is mostly outside of ASCII. |
268 | For Japanese users this means | |
269 | that the 16-bit codes now in common use will take three bytes. | |
270 | While there | |
8bb0494f | 271 | are algorithmic conversions from some character sets (especially ISO-8859-1) to |
fea681da MK |
272 | Unicode, general conversion requires carrying around conversion tables, |
273 | which can be quite large for 16-bit codes. | |
274 | .LP | |
275 | Note that UTF-8 is self-synchronizing: 10xxxxxx is a tail, any other | |
c13182ef MK |
276 | byte is the head of a code. |
277 | Note that the only way ASCII bytes occur | |
278 | in a UTF-8 stream, is as themselves. | |
279 | In particular, there are no | |
f81fb444 | 280 | embedded NULs (\(aq\\0\(aq) or \(aq/\(aqs that form part of some larger code. |
fea681da | 281 | .LP |
f81fb444 | 282 | Since ASCII, and, in particular, NUL and \(aq/\(aq, are unchanged, the |
c13182ef MK |
283 | kernel does not notice that UTF-8 is being used. |
284 | It does not care at | |
fea681da MK |
285 | all what the bytes it is handling stand for. |
286 | .LP | |
287 | Rendering of Unicode data streams is typically handled through | |
84c517a4 | 288 | "subfont" tables which map a subset of Unicode to glyphs. |
c13182ef | 289 | Internally |
fea681da MK |
290 | the kernel uses Unicode to describe the subfont loaded in video RAM. |
291 | This means that in UTF-8 mode one can use a character set with 512 | |
c13182ef MK |
292 | different symbols. |
293 | This is not enough for Japanese, Chinese and | |
fea681da MK |
294 | Korean, but it is enough for most other purposes. |
295 | .LP | |
296 | At the current time, the console driver does not handle combining | |
c13182ef MK |
297 | characters. |
298 | So Thai, Sioux and any other script needing combining | |
fea681da | 299 | characters can't be handled on the console. |
73d8cece | 300 | .SS ISO 2022 and ISO 4873 |
fea681da | 301 | The ISO 2022 and 4873 standards describe a font-control model |
c13182ef MK |
302 | based on VT100 practice. |
303 | This model is (partially) supported | |
fea681da MK |
304 | by the Linux kernel and by |
305 | .BR xterm (1). | |
306 | It is popular in Japan and Korea. | |
307 | .LP | |
308 | There are 4 graphic character sets, called G0, G1, G2 and G3, | |
309 | and one of them is the current character set for codes with | |
310 | high bit zero (initially G0), and one of them is the current | |
311 | character set for codes with high bit one (initially G1). | |
312 | Each graphic character set has 94 or 96 characters, and is | |
c13182ef MK |
313 | essentially a 7-bit character set. |
314 | It uses codes either | |
fea681da MK |
315 | 040-0177 (041-0176) or 0240-0377 (0241-0376). |
316 | G0 always has size 94 and uses codes 041-0176. | |
317 | .LP | |
318 | Switching between character sets is done using the shift functions | |
8bb93cd4 | 319 | \fB^N\fP (SO or LS1), \fB^O\fP (SI or LS0), ESC n (LS2), ESC o (LS3), |
fea681da MK |
320 | ESC N (SS2), ESC O (SS3), ESC ~ (LS1R), ESC } (LS2R), ESC | (LS3R). |
321 | The function LS\fIn\fP makes character set G\fIn\fP the current one | |
322 | for codes with high bit zero. | |
323 | The function LS\fIn\fPR makes character set G\fIn\fP the current one | |
324 | for codes with high bit one. | |
325 | The function SS\fIn\fP makes character set G\fIn\fP (\fIn\fP=2 or 3) | |
326 | the current one for the next character only (regardless of the value | |
327 | of its high order bit). | |
328 | .LP | |
329 | A 94-character set is designated as G\fIn\fP character set | |
330 | by an escape sequence ESC ( xx (for G0), ESC ) xx (for G1), | |
331 | ESC * xx (for G2), ESC + xx (for G3), where xx is a symbol | |
332 | or a pair of symbols found in the ISO 2375 International | |
333 | Register of Coded Character Sets. | |
334 | For example, ESC ( @ selects the ISO 646 character set as G0, | |
335 | ESC ( A selects the UK standard character set (with pound | |
336 | instead of number sign), ESC ( B selects ASCII (with dollar | |
337 | instead of currency sign), ESC ( M selects a character set | |
338 | for African languages, ESC ( ! A selects the Cuban character | |
bb0e6cec | 339 | set, and so on. |
fea681da MK |
340 | .LP |
341 | A 96-character set is designated as G\fIn\fP character set | |
4d9b6984 | 342 | by an escape sequence ESC \- xx (for G1), ESC . xx (for G2) |
fea681da | 343 | or ESC / xx (for G3). |
4d9b6984 | 344 | For example, ESC \- G selects the Hebrew alphabet as G1. |
fea681da MK |
345 | .LP |
346 | A multibyte character set is designated as G\fIn\fP character set | |
347 | by an escape sequence ESC $ xx or ESC $ ( xx (for G0), | |
348 | ESC $ ) xx (for G1), ESC $ * xx (for G2), ESC $ + xx (for G3). | |
349 | For example, ESC $ ( C selects the Korean character set for G0. | |
350 | The Japanese character set selected by ESC $ B has a more | |
351 | recent version selected by ESC & @ ESC $ B. | |
352 | .LP | |
353 | ISO 4873 stipulates a narrower use of character sets, where G0 | |
354 | is fixed (always ASCII), so that G1, G2 and G3 | |
355 | can only be invoked for codes with the high order bit set. | |
8bb93cd4 | 356 | In particular, \fB^N\fP and \fB^O\fP are not used anymore, ESC ( xx |
fea681da | 357 | can be used only with xx=B, and ESC ) xx, ESC * xx, ESC + xx |
4d9b6984 | 358 | are equivalent to ESC \- xx, ESC . xx, ESC / xx, respectively. |
47297adb | 359 | .SH SEE ALSO |
fea681da MK |
360 | .BR console (4), |
361 | .BR console_codes (4), | |
362 | .BR console_ioctl (4), | |
363 | .BR ascii (7), | |
364 | .BR iso_8859-1 (7), | |
365 | .BR unicode (7), | |
366 | .BR utf-8 (7) |