]>
Commit | Line | Data |
---|---|---|
42d940fa | 1 | '\" t -*- coding: UTF-8 -*- |
fea681da | 2 | .\" Copyright (c) 1996 Eric S. Raymond <esr@thyrsus.com> |
ac56b6a8 | 3 | .\" and Copyright (c) Andries Brouwer <aeb@cwi.nl> |
fea681da | 4 | .\" |
89e3ffe9 | 5 | .\" %%%LICENSE_START(GPLv2+_DOC_ONEPARA) |
fea681da MK |
6 | .\" This is free documentation; you can redistribute it and/or |
7 | .\" modify it under the terms of the GNU General Public License as | |
8 | .\" published by the Free Software Foundation; either version 2 of | |
9 | .\" the License, or (at your option) any later version. | |
8f8359d8 | 10 | .\" %%%LICENSE_END |
fea681da MK |
11 | .\" |
12 | .\" This is combined from many sources, including notes by aeb and | |
13 | .\" research by esr. Portions derive from a writeup by Roman Czyborra. | |
14 | .\" | |
a8ed5f74 | 15 | .\" Changes also by David Starner <dstarner98@aasaa.ofe.org>. |
5b3318fb | 16 | .\" |
3df541c0 | 17 | .TH CHARSETS 7 2016-07-17 "Linux" "Linux Programmer's Manual" |
fea681da | 18 | .SH NAME |
a8ed5f74 | 19 | charsets - character set standards and internationalization |
fea681da | 20 | .SH DESCRIPTION |
a8ed5f74 MM |
21 | This manual page gives an overview on different character set standards |
22 | and how they were used on Linux before Unicode became ubiquitous. | |
23 | Some of this information is still helpful for people working with legacy | |
24 | systems and documents. | |
25 | .LP | |
26 | Standards discussed include such as | |
27 | ASCII, GB 2312, ISO 8859, JIS, KOI8-R, KS, and Unicode. | |
fea681da | 28 | .LP |
a8ed5f74 MM |
29 | The primary emphasis is on character sets that were actually used by |
30 | locale character sets, not the myriad others that could be found in data | |
fea681da | 31 | from other systems. |
1ce284ec | 32 | .SS ASCII |
fea681da | 33 | ASCII (American Standard Code For Information Interchange) is the original |
c13182ef | 34 | 7-bit character set, originally designed for American English. |
a8ed5f74 MM |
35 | Also known as US-ASCII. |
36 | It is currently described by the ISO 646:1991 IRV | |
37 | (International Reference Version) standard. | |
fea681da MK |
38 | .LP |
39 | Various ASCII variants replacing the dollar sign with other currency | |
a8ed5f74 MM |
40 | symbols and replacing punctuation with non-English alphabetic |
41 | characters to cover German, French, Spanish, and others in 7 bits | |
42 | emerged. | |
43 | All are deprecated; | |
44 | glibc does not support locales whose character sets are not true | |
45 | supersets of ASCII. | |
fea681da | 46 | .LP |
a8ed5f74 MM |
47 | As Unicode, when using UTF-8, is ASCII-compatible, plain ASCII text |
48 | still renders properly on modern UTF-8 using systems. | |
1ce284ec | 49 | .SS ISO 8859 |
42d940fa | 50 | ISO 8859 is a series of 15 8-bit character sets, all of which have ASCII |
a8ed5f74 MM |
51 | in their low (7-bit) half, invisible control characters in positions |
52 | 128 to 159, and 96 fixed-width graphics in positions 160-255. | |
fea681da | 53 | .LP |
a8ed5f74 MM |
54 | Of these, the most important is ISO 8859-1 |
55 | ("Latin Alphabet No .1" / Latin-1). | |
56 | It was widely adopted and supported by different systems, | |
57 | and is gradually being replaced with Unicode. | |
58 | The ISO 8859-1 characters are also the first 256 characters of Unicode. | |
fea681da MK |
59 | .LP |
60 | Console support for the other 8859 character sets is available under | |
61 | Linux through user-mode utilities (such as | |
62 | .BR setfont (8)) | |
fea681da MK |
63 | that modify keyboard bindings and the EGA graphics |
64 | table and employ the "user mapping" font table in the console | |
65 | driver. | |
66 | .LP | |
67 | Here are brief descriptions of each set: | |
68 | .TP | |
69 | 8859-1 (Latin-1) | |
a8ed5f74 | 70 | Latin-1 covers many West European languages such as Albanian, Basque, |
348f3b9d | 71 | Danish, English, Faroese, Galician, Icelandic, Irish, Italian, |
a8ed5f74 MM |
72 | Norwegian, Portuguese, Spanish, and Swedish. |
73 | The lack of the ligatures Dutch IJ/ij, French œ, and old-style „German“ | |
74 | quotation marks was considered tolerable. | |
fea681da MK |
75 | .TP |
76 | 8859-2 (Latin-2) | |
a8ed5f74 MM |
77 | Latin-2 supports many Latin-written Central and East European |
78 | languages such as Bosnian, Croatian, Czech, German, Hungarian, Polish, | |
fea681da | 79 | Slovak, and Slovene. |
a8ed5f74 | 80 | Replacing Romanian ș/ț with ş/ţ was considered tolerable. |
fea681da MK |
81 | .TP |
82 | 8859-3 (Latin-3) | |
42d940fa | 83 | Latin-3 was designed to cover of Esperanto, Maltese, and Turkish, but |
a8ed5f74 | 84 | 8859-9 later superseded it for Turkish. |
fea681da MK |
85 | .TP |
86 | 8859-4 (Latin-4) | |
a8ed5f74 | 87 | Latin-4 introduced letters for North European languages such as |
42d940fa | 88 | Estonian, Latvian, and Lithuanian, but was superseded by 8859-10 and |
a8ed5f74 | 89 | 8859-13. |
fea681da MK |
90 | .TP |
91 | 8859-5 | |
92 | Cyrillic letters supporting Bulgarian, Byelorussian, Macedonian, | |
a8ed5f74 MM |
93 | Russian, Serbian, and (almost completely) Ukrainian. |
94 | It was never widely used, see the discussion of KOI8-R/KOI8-U below. | |
fea681da MK |
95 | .TP |
96 | 8859-6 | |
a8ed5f74 | 97 | Was created for Arabic. |
c13182ef | 98 | The 8859-6 glyph table is a fixed font of separate |
fea681da MK |
99 | letter forms, but a proper display engine should combine these |
100 | using the proper initial, medial, and final forms. | |
101 | .TP | |
102 | 8859-7 | |
42d940fa | 103 | Was created for Modern Greek in 1987, updated in 2003. |
fea681da MK |
104 | .TP |
105 | 8859-8 | |
42d940fa | 106 | Supports Modern Hebrew without niqud (punctuation signs). |
a8ed5f74 | 107 | Niqud and full-fledged Biblical Hebrew were outside the scope of this |
79745892 | 108 | character set. |
fea681da MK |
109 | .TP |
110 | 8859-9 (Latin-5) | |
111 | This is a variant of Latin-1 that replaces Icelandic letters with | |
112 | Turkish ones. | |
113 | .TP | |
114 | 8859-10 (Latin-6) | |
91085d85 | 115 | Latin-6 added the Inuit (Greenlandic) and Sami (Lappish) letters that were |
a8ed5f74 | 116 | missing in Latin-4 to cover the entire Nordic area. |
fea681da MK |
117 | .TP |
118 | 8859-11 | |
a8ed5f74 MM |
119 | Supports the Thai alphabet and is nearly identical to the TIS-620 |
120 | standard. | |
fea681da MK |
121 | .TP |
122 | 8859-12 | |
c13182ef | 123 | This set does not exist. |
fea681da MK |
124 | .TP |
125 | 8859-13 (Latin-7) | |
126 | Supports the Baltic Rim languages; in particular, it includes Latvian | |
127 | characters not found in Latin-4. | |
128 | .TP | |
129 | 8859-14 (Latin-8) | |
a8ed5f74 MM |
130 | This is the Celtic character set, covering Old Irish, Manx, Gaelic, |
131 | Welsh, Cornish, and Breton. | |
fea681da MK |
132 | .TP |
133 | 8859-15 (Latin-9) | |
42d940fa | 134 | Latin-9 is similar to the widely used Latin-1 but replaces some less |
a8ed5f74 MM |
135 | common symbols with the Euro sign and French and Finnish letters that |
136 | were missing in Latin-1. | |
fea681da MK |
137 | .TP |
138 | 8859-16 (Latin-10) | |
a8ed5f74 MM |
139 | This set covers many Southeast European languages, and most |
140 | importantly supports Romanian more completely than Latin-2. | |
141 | .SS KOI8-R / KOI8-U | |
142 | KOI8-R is a non-ISO character set popular in Russia before Unicode. | |
143 | The lower half is ASCII; | |
144 | the upper is a Cyrillic character set somewhat better designed than | |
145 | ISO 8859-5. | |
42d940fa | 146 | KOI8-U, based on KOI8-R, has better support for Ukrainian. |
a8ed5f74 | 147 | Neither of these sets are ISO-2022 compatible, |
1acb8000 | 148 | unlike the ISO 8859 series. |
fea681da MK |
149 | .LP |
150 | Console support for KOI8-R is available under Linux through user-mode | |
151 | utilities that modify keyboard bindings and the EGA graphics table, | |
152 | and employ the "user mapping" font table in the console driver. | |
83f218d9 MM |
153 | .SS GB 2312 |
154 | GB 2312 is a mainland Chinese national standard character set used | |
155 | to express simplified Chinese. | |
156 | Just like JIS X 0208, characters are | |
157 | mapped into a 94x94 two-byte matrix used to construct EUC-CN. | |
158 | EUC-CN | |
159 | is the most important encoding for Linux and includes ASCII and | |
160 | GB 2312. | |
161 | Note that EUC-CN is often called as GB, GB 2312, or CN-GB. | |
162 | .SS Big5 | |
163 | Big5 was a popular character set in Taiwan to express traditional | |
164 | Chinese. | |
165 | (Big5 is both a character set and an encoding.) | |
166 | It is a superset of ASCII. | |
167 | Non-ASCII characters are expressed in two bytes. | |
168 | Bytes 0xa1-0xfe are used as leading bytes for two-byte characters. | |
169 | Big5 and its extension were widely used in Taiwan and Hong Kong. | |
170 | It is not ISO 2022 compliant. | |
c13182ef | 171 | .\" Thanks to Tomohiro KUBOTA for the following sections about |
fea681da | 172 | .\" national standards. |
1ce284ec | 173 | .SS JIS X 0208 |
c13182ef MK |
174 | JIS X 0208 is a Japanese national standard character set. |
175 | Though there are some more Japanese national standard character sets (like | |
176 | JIS X 0201, JIS X 0212, and JIS X 0213), this is the most important one. | |
177 | Characters are mapped into a 94x94 two-byte matrix, | |
178 | whose each byte is in the range 0x21-0x7e. | |
179 | Note that JIS X 0208 is a character set, not an encoding. | |
180 | This means that JIS X 0208 | |
181 | itself is not used for expressing text data. | |
182 | JIS X 0208 is used | |
fea681da | 183 | as a component to construct encodings such as EUC-JP, Shift_JIS, |
c13182ef MK |
184 | and ISO-2022-JP. |
185 | EUC-JP is the most important encoding for Linux | |
a8ed5f74 | 186 | and includes ASCII and JIS X 0208. |
c13182ef | 187 | In EUC-JP, JIS X 0208 |
fea681da MK |
188 | characters are expressed in two bytes, each of which is the |
189 | JIS X 0208 code plus 0x80. | |
1ce284ec | 190 | .SS KS X 1001 |
c13182ef MK |
191 | KS X 1001 is a Korean national standard character set. |
192 | Just as | |
fea681da MK |
193 | JIS X 0208, characters are mapped into a 94x94 two-byte matrix. |
194 | KS X 1001 is used like JIS X 0208, as a component | |
195 | to construct encodings such as EUC-KR, Johab, and ISO-2022-KR. | |
196 | EUC-KR is the most important encoding for Linux and includes | |
a8ed5f74 | 197 | ASCII and KS X 1001. |
c13182ef | 198 | KS C 5601 is an older name for KS X 1001. |
83f218d9 MM |
199 | .SS ISO 2022 and ISO 4873 |
200 | The ISO 2022 and 4873 standards describe a font-control model | |
201 | based on VT100 practice. | |
202 | This model is (partially) supported | |
203 | by the Linux kernel and by | |
204 | .BR xterm (1). | |
9be7476d MM |
205 | Several ISO 2022-based character encodings have been defined, |
206 | especially for Japanese. | |
83f218d9 MM |
207 | .LP |
208 | There are 4 graphic character sets, called G0, G1, G2, and G3, | |
209 | and one of them is the current character set for codes with | |
210 | high bit zero (initially G0), and one of them is the current | |
211 | character set for codes with high bit one (initially G1). | |
212 | Each graphic character set has 94 or 96 characters, and is | |
213 | essentially a 7-bit character set. | |
214 | It uses codes either | |
215 | 040-0177 (041-0176) or 0240-0377 (0241-0376). | |
216 | G0 always has size 94 and uses codes 041-0176. | |
217 | .LP | |
218 | Switching between character sets is done using the shift functions | |
219 | \fB^N\fP (SO or LS1), \fB^O\fP (SI or LS0), ESC n (LS2), ESC o (LS3), | |
220 | ESC N (SS2), ESC O (SS3), ESC ~ (LS1R), ESC } (LS2R), ESC | (LS3R). | |
221 | The function LS\fIn\fP makes character set G\fIn\fP the current one | |
222 | for codes with high bit zero. | |
223 | The function LS\fIn\fPR makes character set G\fIn\fP the current one | |
224 | for codes with high bit one. | |
225 | The function SS\fIn\fP makes character set G\fIn\fP (\fIn\fP=2 or 3) | |
226 | the current one for the next character only (regardless of the value | |
227 | of its high order bit). | |
228 | .LP | |
229 | A 94-character set is designated as G\fIn\fP character set | |
230 | by an escape sequence ESC ( xx (for G0), ESC ) xx (for G1), | |
231 | ESC * xx (for G2), ESC + xx (for G3), where xx is a symbol | |
232 | or a pair of symbols found in the ISO 2375 International | |
233 | Register of Coded Character Sets. | |
234 | For example, ESC ( @ selects the ISO 646 character set as G0, | |
235 | ESC ( A selects the UK standard character set (with pound | |
236 | instead of number sign), ESC ( B selects ASCII (with dollar | |
237 | instead of currency sign), ESC ( M selects a character set | |
238 | for African languages, ESC ( ! A selects the Cuban character | |
239 | set, and so on. | |
240 | .LP | |
241 | A 96-character set is designated as G\fIn\fP character set | |
242 | by an escape sequence ESC \- xx (for G1), ESC . xx (for G2) | |
243 | or ESC / xx (for G3). | |
244 | For example, ESC \- G selects the Hebrew alphabet as G1. | |
245 | .LP | |
246 | A multibyte character set is designated as G\fIn\fP character set | |
247 | by an escape sequence ESC $ xx or ESC $ ( xx (for G0), | |
248 | ESC $ ) xx (for G1), ESC $ * xx (for G2), ESC $ + xx (for G3). | |
249 | For example, ESC $ ( C selects the Korean character set for G0. | |
250 | The Japanese character set selected by ESC $ B has a more | |
251 | recent version selected by ESC & @ ESC $ B. | |
252 | .LP | |
253 | ISO 4873 stipulates a narrower use of character sets, where G0 | |
254 | is fixed (always ASCII), so that G1, G2 and G3 | |
255 | can be invoked only for codes with the high order bit set. | |
256 | In particular, \fB^N\fP and \fB^O\fP are not used anymore, ESC ( xx | |
257 | can be used only with xx=B, and ESC ) xx, ESC * xx, ESC + xx | |
258 | are equivalent to ESC \- xx, ESC . xx, ESC / xx, respectively. | |
a8ed5f74 MM |
259 | .SS TIS-620 |
260 | TIS-620 is a Thai national standard character set and a superset | |
261 | of ASCII. | |
42d940fa | 262 | In the same fashion as the ISO 8859 series, Thai characters are mapped into |
c13182ef | 263 | 0xa1-0xfe. |
a8ed5f74 MM |
264 | .SS Unicode |
265 | Unicode (ISO 10646) is a standard which aims to unambiguously represent | |
266 | every character in every human language. | |
c13182ef | 267 | Unicode's structure permits 20.1 bits to encode every character. |
91085d85 MK |
268 | Since most computers don't include 20.1-bit integers, Unicode is |
269 | usually encoded as 32-bit integers internally and either a series of | |
270 | 16-bit integers (UTF-16) (needing two 16-bit integers only when | |
a8ed5f74 | 271 | encoding certain rare characters) or a series of 8-bit bytes (UTF-8). |
fea681da MK |
272 | .LP |
273 | Linux represents Unicode using the 8-bit Unicode Transformation Format | |
c13182ef MK |
274 | (UTF-8). |
275 | UTF-8 is a variable length encoding of Unicode. | |
276 | It uses 1 | |
fea681da MK |
277 | byte to code 7 bits, 2 bytes for 11 bits, 3 bytes for 16 bits, 4 bytes |
278 | for 21 bits, 5 bytes for 26 bits, 6 bytes for 31 bits. | |
279 | .LP | |
c13182ef MK |
280 | Let 0,1,x stand for a zero, one, or arbitrary bit. |
281 | A byte 0xxxxxxx | |
fea681da | 282 | stands for the Unicode 00000000 0xxxxxxx which codes the same symbol |
c13182ef MK |
283 | as the ASCII 0xxxxxxx. |
284 | Thus, ASCII goes unchanged into UTF-8, and | |
fea681da MK |
285 | people using only ASCII do not notice any change: not in code, and not |
286 | in file size. | |
287 | .LP | |
288 | A byte 110xxxxx is the start of a 2-byte code, and 110xxxxx 10yyyyyy | |
c13182ef MK |
289 | is assembled into 00000xxx xxyyyyyy. |
290 | A byte 1110xxxx is the start | |
fea681da MK |
291 | of a 3-byte code, and 1110xxxx 10yyyyyy 10zzzzzz is assembled |
292 | into xxxxyyyy yyzzzzzz. | |
293 | (When UTF-8 is used to code the 31-bit ISO 10646 | |
294 | then this progression continues up to 6-byte codes.) | |
295 | .LP | |
1acb8000 | 296 | For most texts in ISO 8859 character sets, this means that the |
c13182ef MK |
297 | characters outside of ASCII are now coded with two bytes. |
298 | This tends | |
299 | to expand ordinary text files by only one or two percent. | |
300 | For Russian | |
a8ed5f74 | 301 | or Greek texts, this expands ordinary text files by 100%, since text in |
c13182ef MK |
302 | those languages is mostly outside of ASCII. |
303 | For Japanese users this means | |
304 | that the 16-bit codes now in common use will take three bytes. | |
91085d85 MK |
305 | While there are algorithmic conversions from some character sets |
306 | (especially ISO 8859-1) to Unicode, general conversion requires | |
307 | carrying around conversion tables, which can be quite large for 16-bit | |
a8ed5f74 | 308 | codes. |
fea681da MK |
309 | .LP |
310 | Note that UTF-8 is self-synchronizing: 10xxxxxx is a tail, any other | |
c13182ef MK |
311 | byte is the head of a code. |
312 | Note that the only way ASCII bytes occur | |
313 | in a UTF-8 stream, is as themselves. | |
314 | In particular, there are no | |
f81fb444 | 315 | embedded NULs (\(aq\\0\(aq) or \(aq/\(aqs that form part of some larger code. |
fea681da | 316 | .LP |
f81fb444 | 317 | Since ASCII, and, in particular, NUL and \(aq/\(aq, are unchanged, the |
c13182ef MK |
318 | kernel does not notice that UTF-8 is being used. |
319 | It does not care at | |
fea681da MK |
320 | all what the bytes it is handling stand for. |
321 | .LP | |
322 | Rendering of Unicode data streams is typically handled through | |
84c517a4 | 323 | "subfont" tables which map a subset of Unicode to glyphs. |
c13182ef | 324 | Internally |
fea681da | 325 | the kernel uses Unicode to describe the subfont loaded in video RAM. |
91085d85 | 326 | This means that in the Linux console in UTF-8 mode, one can use a character |
a8ed5f74 | 327 | set with 512 different symbols. |
42d940fa | 328 | This is not enough for Japanese, Chinese, and |
fea681da | 329 | Korean, but it is enough for most other purposes. |
47297adb | 330 | .SH SEE ALSO |
a8ed5f74 | 331 | .BR iconv (1), |
fea681da MK |
332 | .BR ascii (7), |
333 | .BR iso_8859-1 (7), | |
334 | .BR unicode (7), | |
335 | .BR utf-8 (7) |