]>
Commit | Line | Data |
---|---|---|
fea681da MK |
1 | .\" Copyright (c) 1996 Eric S. Raymond <esr@thyrsus.com> |
2 | .\" and Andries Brouwer <aeb@cwi.nl> | |
3 | .\" | |
4 | .\" This is free documentation; you can redistribute it and/or | |
5 | .\" modify it under the terms of the GNU General Public License as | |
6 | .\" published by the Free Software Foundation; either version 2 of | |
7 | .\" the License, or (at your option) any later version. | |
8 | .\" | |
9 | .\" This is combined from many sources, including notes by aeb and | |
10 | .\" research by esr. Portions derive from a writeup by Roman Czyborra. | |
11 | .\" | |
12 | .\" Last changed by David Starner <dstarner98@aasaa.ofe.org>. | |
13 | .TH CHARSETS 7 2001-05-07 "Linux" "Linux Programmer's Manual" | |
14 | .SH NAME | |
15 | charsets \- programmer's view of character sets and internationalization | |
16 | .SH DESCRIPTION | |
c13182ef MK |
17 | Linux is an international operating system. |
18 | Various of its utilities | |
fea681da MK |
19 | and device drivers (including the console driver) support multilingual |
20 | character sets including Latin-alphabet letters with diacritical | |
21 | marks, accents, ligatures, and entire non-Latin alphabets including | |
22 | Greek, Cyrillic, Arabic, and Hebrew. | |
23 | .LP | |
24 | This manual page presents a programmer's-eye view of different | |
c13182ef MK |
25 | character-set standards and how they fit together on Linux. |
26 | Standards | |
fea681da | 27 | discussed include ASCII, ISO 8859, KOI8-R, Unicode, ISO 2022 and |
c13182ef MK |
28 | ISO 4873. |
29 | The primary emphasis is on character sets actually used as | |
fea681da MK |
30 | locale character sets, not the myriad others that can be found in data |
31 | from other systems. | |
32 | .LP | |
33 | A complete list of charsets used in a officially supported locale in glibc | |
34 | 2.2.3 is: ISO-8859-{1,2,3,5,6,7,8,9,13,15}, CP1251, UTF-8, EUC-{KR,JP,TW}, | |
35 | KOI8-{R,U}, GB2312, GB18030, GBK, BIG5, BIG5-HKSCS and TIS-620 (in no | |
c13182ef MK |
36 | particular order.) |
37 | (Romanian may be switching to ISO-8859-16.) | |
fea681da MK |
38 | .SH ASCII |
39 | ASCII (American Standard Code For Information Interchange) is the original | |
c13182ef MK |
40 | 7-bit character set, originally designed for American English. |
41 | It is currently described by the ECMA-6 standard. | |
fea681da MK |
42 | .LP |
43 | Various ASCII variants replacing the dollar sign with other currency | |
44 | symbols and replacing punctuation with non-English alphabetic characters | |
c13182ef MK |
45 | to cover German, French, Spanish and others in 7 bits exist. |
46 | All are | |
fea681da MK |
47 | deprecated; GNU libc doesn't support locales whose character sets aren't |
48 | true supersets of ASCII. (These sets are also known as ISO-646, a close | |
49 | relative of ASCII that permitted replacing these characters.) | |
50 | .LP | |
51 | As Linux was written for hardware designed in the US, it natively | |
52 | supports ASCII. | |
fea681da MK |
53 | .SH ISO 8859 |
54 | ISO 8859 is a series of 15 8-bit character sets all of which have US | |
55 | ASCII in their low (7-bit) half, invisible control characters in | |
56 | positions 128 to 159, and 96 fixed-width graphics in positions 160-255. | |
57 | .LP | |
c13182ef MK |
58 | Of these, the most important is ISO 8859-1 (Latin-1). |
59 | It is natively | |
fea681da MK |
60 | supported in the Linux console driver, fairly well supported in X11R6, |
61 | and is the base character set of HTML. | |
62 | .LP | |
63 | Console support for the other 8859 character sets is available under | |
64 | Linux through user-mode utilities (such as | |
65 | .BR setfont (8)) | |
66 | .\" // some distributions still have the deprecated consolechars | |
67 | that modify keyboard bindings and the EGA graphics | |
68 | table and employ the "user mapping" font table in the console | |
69 | driver. | |
70 | .LP | |
71 | Here are brief descriptions of each set: | |
72 | .TP | |
73 | 8859-1 (Latin-1) | |
74 | Latin-1 covers most Western European languages such as Albanian, Catalan, | |
75 | Danish, Dutch, English, Faroese, Finnish, French, German, Galician, | |
76 | Irish, Icelandic, Italian, Norwegian, Portuguese, Spanish, and | |
c13182ef MK |
77 | Swedish. |
78 | The lack of the ligatures Dutch ij, French oe and old-style | |
fea681da MK |
79 | ,,German`` quotation marks is considered tolerable. |
80 | .TP | |
81 | 8859-2 (Latin-2) | |
82 | Latin-2 supports most Latin-written Slavic and Central European | |
83 | languages: Croatian, Czech, German, Hungarian, Polish, Rumanian, | |
84 | Slovak, and Slovene. | |
85 | .TP | |
86 | 8859-3 (Latin-3) | |
87 | Latin-3 is popular with authors of Esperanto, Galician, and Maltese. | |
88 | (Turkish is now written with 8859-9 instead.) | |
89 | .TP | |
90 | 8859-4 (Latin-4) | |
c13182ef MK |
91 | Latin-4 introduced letters for Estonian, Latvian, and Lithuanian. |
92 | It is essentially obsolete; see 8859-10 (Latin-6) and 8859-13 (Latin-7). | |
fea681da MK |
93 | .TP |
94 | 8859-5 | |
95 | Cyrillic letters supporting Bulgarian, Byelorussian, Macedonian, | |
c13182ef MK |
96 | Russian, Serbian and Ukrainian. |
97 | Ukrainians read the letter `ghe' | |
fea681da | 98 | with downstroke as `heh' and would need a ghe with upstroke to write a |
c13182ef MK |
99 | correct ghe. |
100 | See the discussion of KOI8-R below. | |
fea681da MK |
101 | .TP |
102 | 8859-6 | |
c13182ef MK |
103 | Supports Arabic. |
104 | The 8859-6 glyph table is a fixed font of separate | |
fea681da MK |
105 | letter forms, but a proper display engine should combine these |
106 | using the proper initial, medial, and final forms. | |
107 | .TP | |
108 | 8859-7 | |
109 | Supports Modern Greek. | |
110 | .TP | |
111 | 8859-8 | |
c13182ef MK |
112 | Supports modern Hebrew without niqud (punctuation signs). |
113 | Niqud and full-fledged Biblical Hebrew are outside the scope of this | |
fea681da MK |
114 | character set; under Linux, UTF-8 is the preferred encoding for |
115 | these. | |
116 | .TP | |
117 | 8859-9 (Latin-5) | |
118 | This is a variant of Latin-1 that replaces Icelandic letters with | |
119 | Turkish ones. | |
120 | .TP | |
121 | 8859-10 (Latin-6) | |
122 | Latin 6 adds the last Inuit (Greenlandic) and Sami (Lappish) letters | |
c13182ef MK |
123 | that were missing in Latin 4 to cover the entire Nordic area. |
124 | RFC 1345 listed a preliminary and different `latin6'. | |
125 | Skolt Sami still | |
fea681da MK |
126 | needs a few more accents than these. |
127 | .TP | |
128 | 8859-11 | |
c13182ef MK |
129 | This only exists as a rejected draft standard. |
130 | The draft standard | |
fea681da MK |
131 | was identical to TIS-620, which is used under Linux for Thai. |
132 | .TP | |
133 | 8859-12 | |
c13182ef MK |
134 | This set does not exist. |
135 | While Vietnamese has been suggested for this | |
fea681da | 136 | space, it does not fit within the 96 (non-combining) characters ISO |
c13182ef MK |
137 | 8859 offers. |
138 | UTF-8 is the preferred character set for Vietnamese use | |
fea681da MK |
139 | under Linux. |
140 | .TP | |
141 | 8859-13 (Latin-7) | |
142 | Supports the Baltic Rim languages; in particular, it includes Latvian | |
143 | characters not found in Latin-4. | |
144 | .TP | |
145 | 8859-14 (Latin-8) | |
146 | This is the Celtic character set, covering Gaelic and Welsh. | |
147 | This charset also contains the dotted characters needed for Old Irish. | |
148 | .TP | |
149 | 8859-15 (Latin-9) | |
150 | This adds the Euro sign and French and Finnish letters that were missing in | |
151 | Latin-1. | |
152 | .TP | |
153 | 8859-16 (Latin-10) | |
154 | This set covers many of the languages covered by 8859-2, and supports | |
155 | Romanian more completely then that set does. | |
156 | .SH KOI8-R | |
c13182ef MK |
157 | KOI8-R is a non-ISO character set popular in Russia. |
158 | The lower half | |
fea681da | 159 | is US ASCII; the upper is a Cyrillic character set somewhat better |
c13182ef MK |
160 | designed than ISO 8859-5. |
161 | KOI8-U is a common character set, based off | |
162 | KOI8-R, that has better support for Ukrainian. | |
163 | Neither of these sets | |
fea681da MK |
164 | are ISO-2022 compatible, unlike the ISO-8859 series. |
165 | .LP | |
166 | Console support for KOI8-R is available under Linux through user-mode | |
167 | utilities that modify keyboard bindings and the EGA graphics table, | |
168 | and employ the "user mapping" font table in the console driver. | |
c13182ef | 169 | .\" Thanks to Tomohiro KUBOTA for the following sections about |
fea681da MK |
170 | .\" national standards. |
171 | .SH JIS X 0208 | |
c13182ef MK |
172 | JIS X 0208 is a Japanese national standard character set. |
173 | Though there are some more Japanese national standard character sets (like | |
174 | JIS X 0201, JIS X 0212, and JIS X 0213), this is the most important one. | |
175 | Characters are mapped into a 94x94 two-byte matrix, | |
176 | whose each byte is in the range 0x21-0x7e. | |
177 | Note that JIS X 0208 is a character set, not an encoding. | |
178 | This means that JIS X 0208 | |
179 | itself is not used for expressing text data. | |
180 | JIS X 0208 is used | |
fea681da | 181 | as a component to construct encodings such as EUC-JP, Shift_JIS, |
c13182ef MK |
182 | and ISO-2022-JP. |
183 | EUC-JP is the most important encoding for Linux | |
184 | and includes US ASCII and JIS X 0208. | |
185 | In EUC-JP, JIS X 0208 | |
fea681da MK |
186 | characters are expressed in two bytes, each of which is the |
187 | JIS X 0208 code plus 0x80. | |
fea681da | 188 | .SH KS X 1001 |
c13182ef MK |
189 | KS X 1001 is a Korean national standard character set. |
190 | Just as | |
fea681da MK |
191 | JIS X 0208, characters are mapped into a 94x94 two-byte matrix. |
192 | KS X 1001 is used like JIS X 0208, as a component | |
193 | to construct encodings such as EUC-KR, Johab, and ISO-2022-KR. | |
194 | EUC-KR is the most important encoding for Linux and includes | |
c13182ef MK |
195 | US ASCII and KS X 1001. |
196 | KS C 5601 is an older name for KS X 1001. | |
fea681da MK |
197 | .SH GB 2312 |
198 | GB 2312 is a mainland Chinese national standard character set used | |
c13182ef MK |
199 | to express simplified Chinese. |
200 | Just like JIS X 0208, characters are | |
201 | mapped into a 94x94 two-byte matrix used to construct EUC-CN. | |
202 | EUC-CN | |
fea681da | 203 | is the most important encoding for Linux and includes US ASCII and |
c13182ef MK |
204 | GB 2312. |
205 | Note that EUC-CN is often called as GB, GB 2312, or CN-GB. | |
fea681da MK |
206 | .SH Big5 |
207 | Big5 is a popular character set in Taiwan to express traditional | |
c13182ef MK |
208 | Chinese. |
209 | (Big5 is both a character set and an encoding.) | |
210 | It is a superset of US ASCII. | |
211 | Non-ASCII characters are expressed in two bytes. | |
212 | Bytes 0xa1-0xfe are used as leading bytes for two-byte characters. | |
213 | Big5 and its extension is widely used in Taiwan and Hong Kong. | |
214 | It is not ISO 2022-compliant. | |
fea681da MK |
215 | .SH TIS 620 |
216 | TIS 620 is a Thai national standard character set and a superset | |
c13182ef MK |
217 | of US ASCII. |
218 | Like ISO 8859 series, Thai characters are mapped into | |
219 | 0xa1-0xfe. | |
220 | TIS 620 is the only commonly used character set under | |
fea681da | 221 | Linux besides UTF-8 to have combining characters. |
fea681da MK |
222 | .SH UNICODE |
223 | Unicode (ISO 10646) is a standard which aims to unambiguously represent every | |
c13182ef MK |
224 | character in every human language. |
225 | Unicode's structure permits 20.1 bits to encode every character. | |
226 | Since most computers don't include 20.1-bit | |
fea681da MK |
227 | integers, Unicode is usually encoded as 32-bit integers internally and |
228 | either a series of 16-bit integers (UTF-16) (needing two 16-bit integers | |
229 | only when encoding certain rare characters) or a series of 8-bit bytes | |
c13182ef MK |
230 | (UTF-8). |
231 | Information on Unicode is available at <http://www.unicode.com>. | |
fea681da MK |
232 | .LP |
233 | Linux represents Unicode using the 8-bit Unicode Transformation Format | |
c13182ef MK |
234 | (UTF-8). |
235 | UTF-8 is a variable length encoding of Unicode. | |
236 | It uses 1 | |
fea681da MK |
237 | byte to code 7 bits, 2 bytes for 11 bits, 3 bytes for 16 bits, 4 bytes |
238 | for 21 bits, 5 bytes for 26 bits, 6 bytes for 31 bits. | |
239 | .LP | |
c13182ef MK |
240 | Let 0,1,x stand for a zero, one, or arbitrary bit. |
241 | A byte 0xxxxxxx | |
fea681da | 242 | stands for the Unicode 00000000 0xxxxxxx which codes the same symbol |
c13182ef MK |
243 | as the ASCII 0xxxxxxx. |
244 | Thus, ASCII goes unchanged into UTF-8, and | |
fea681da MK |
245 | people using only ASCII do not notice any change: not in code, and not |
246 | in file size. | |
247 | .LP | |
248 | A byte 110xxxxx is the start of a 2-byte code, and 110xxxxx 10yyyyyy | |
c13182ef MK |
249 | is assembled into 00000xxx xxyyyyyy. |
250 | A byte 1110xxxx is the start | |
fea681da MK |
251 | of a 3-byte code, and 1110xxxx 10yyyyyy 10zzzzzz is assembled |
252 | into xxxxyyyy yyzzzzzz. | |
253 | (When UTF-8 is used to code the 31-bit ISO 10646 | |
254 | then this progression continues up to 6-byte codes.) | |
255 | .LP | |
256 | For most people who use ISO-8859 character sets, this means that the | |
c13182ef MK |
257 | characters outside of ASCII are now coded with two bytes. |
258 | This tends | |
259 | to expand ordinary text files by only one or two percent. | |
260 | For Russian | |
fea681da | 261 | or Greek users, this expands ordinary text files by 100%, since text in |
c13182ef MK |
262 | those languages is mostly outside of ASCII. |
263 | For Japanese users this means | |
264 | that the 16-bit codes now in common use will take three bytes. | |
265 | While there | |
fea681da MK |
266 | are algorithmic conversions from some character sets (esp. ISO-8859-1) to |
267 | Unicode, general conversion requires carrying around conversion tables, | |
268 | which can be quite large for 16-bit codes. | |
269 | .LP | |
270 | Note that UTF-8 is self-synchronizing: 10xxxxxx is a tail, any other | |
c13182ef MK |
271 | byte is the head of a code. |
272 | Note that the only way ASCII bytes occur | |
273 | in a UTF-8 stream, is as themselves. | |
274 | In particular, there are no | |
28d88c17 | 275 | embedded NULs ('\\0') or '/'s that form part of some larger code. |
fea681da MK |
276 | .LP |
277 | Since ASCII, and, in particular, NUL and '/', are unchanged, the | |
c13182ef MK |
278 | kernel does not notice that UTF-8 is being used. |
279 | It does not care at | |
fea681da MK |
280 | all what the bytes it is handling stand for. |
281 | .LP | |
282 | Rendering of Unicode data streams is typically handled through | |
c13182ef MK |
283 | `subfont' tables which map a subset of Unicode to glyphs. |
284 | Internally | |
fea681da MK |
285 | the kernel uses Unicode to describe the subfont loaded in video RAM. |
286 | This means that in UTF-8 mode one can use a character set with 512 | |
c13182ef MK |
287 | different symbols. |
288 | This is not enough for Japanese, Chinese and | |
fea681da MK |
289 | Korean, but it is enough for most other purposes. |
290 | .LP | |
291 | At the current time, the console driver does not handle combining | |
c13182ef MK |
292 | characters. |
293 | So Thai, Sioux and any other script needing combining | |
fea681da | 294 | characters can't be handled on the console. |
fea681da MK |
295 | .SH "ISO 2022 AND ISO 4873" |
296 | The ISO 2022 and 4873 standards describe a font-control model | |
c13182ef MK |
297 | based on VT100 practice. |
298 | This model is (partially) supported | |
fea681da MK |
299 | by the Linux kernel and by |
300 | .BR xterm (1). | |
301 | It is popular in Japan and Korea. | |
302 | .LP | |
303 | There are 4 graphic character sets, called G0, G1, G2 and G3, | |
304 | and one of them is the current character set for codes with | |
305 | high bit zero (initially G0), and one of them is the current | |
306 | character set for codes with high bit one (initially G1). | |
307 | Each graphic character set has 94 or 96 characters, and is | |
c13182ef MK |
308 | essentially a 7-bit character set. |
309 | It uses codes either | |
fea681da MK |
310 | 040-0177 (041-0176) or 0240-0377 (0241-0376). |
311 | G0 always has size 94 and uses codes 041-0176. | |
312 | .LP | |
313 | Switching between character sets is done using the shift functions | |
314 | ^N (SO or LS1), ^O (SI or LS0), ESC n (LS2), ESC o (LS3), | |
315 | ESC N (SS2), ESC O (SS3), ESC ~ (LS1R), ESC } (LS2R), ESC | (LS3R). | |
316 | The function LS\fIn\fP makes character set G\fIn\fP the current one | |
317 | for codes with high bit zero. | |
318 | The function LS\fIn\fPR makes character set G\fIn\fP the current one | |
319 | for codes with high bit one. | |
320 | The function SS\fIn\fP makes character set G\fIn\fP (\fIn\fP=2 or 3) | |
321 | the current one for the next character only (regardless of the value | |
322 | of its high order bit). | |
323 | .LP | |
324 | A 94-character set is designated as G\fIn\fP character set | |
325 | by an escape sequence ESC ( xx (for G0), ESC ) xx (for G1), | |
326 | ESC * xx (for G2), ESC + xx (for G3), where xx is a symbol | |
327 | or a pair of symbols found in the ISO 2375 International | |
328 | Register of Coded Character Sets. | |
329 | For example, ESC ( @ selects the ISO 646 character set as G0, | |
330 | ESC ( A selects the UK standard character set (with pound | |
331 | instead of number sign), ESC ( B selects ASCII (with dollar | |
332 | instead of currency sign), ESC ( M selects a character set | |
333 | for African languages, ESC ( ! A selects the Cuban character | |
334 | set, etc. etc. | |
335 | .LP | |
336 | A 96-character set is designated as G\fIn\fP character set | |
4d9b6984 | 337 | by an escape sequence ESC \- xx (for G1), ESC . xx (for G2) |
fea681da | 338 | or ESC / xx (for G3). |
4d9b6984 | 339 | For example, ESC \- G selects the Hebrew alphabet as G1. |
fea681da MK |
340 | .LP |
341 | A multibyte character set is designated as G\fIn\fP character set | |
342 | by an escape sequence ESC $ xx or ESC $ ( xx (for G0), | |
343 | ESC $ ) xx (for G1), ESC $ * xx (for G2), ESC $ + xx (for G3). | |
344 | For example, ESC $ ( C selects the Korean character set for G0. | |
345 | The Japanese character set selected by ESC $ B has a more | |
346 | recent version selected by ESC & @ ESC $ B. | |
347 | .LP | |
348 | ISO 4873 stipulates a narrower use of character sets, where G0 | |
349 | is fixed (always ASCII), so that G1, G2 and G3 | |
350 | can only be invoked for codes with the high order bit set. | |
351 | In particular, ^N and ^O are not used anymore, ESC ( xx | |
352 | can be used only with xx=B, and ESC ) xx, ESC * xx, ESC + xx | |
4d9b6984 | 353 | are equivalent to ESC \- xx, ESC . xx, ESC / xx, respectively. |
fea681da MK |
354 | .SH "SEE ALSO" |
355 | .BR console (4), | |
356 | .BR console_codes (4), | |
357 | .BR console_ioctl (4), | |
358 | .BR ascii (7), | |
359 | .BR iso_8859-1 (7), | |
360 | .BR unicode (7), | |
361 | .BR utf-8 (7) |