]>
Commit | Line | Data |
---|---|---|
fea681da | 1 | .\" Copyright (c) 1996 Eric S. Raymond <esr@thyrsus.com> |
ac56b6a8 | 2 | .\" and Copyright (c) Andries Brouwer <aeb@cwi.nl> |
fea681da | 3 | .\" |
e4a74ca8 | 4 | .\" SPDX-License-Identifier: GPL-2.0-or-later |
fea681da MK |
5 | .\" |
6 | .\" This is combined from many sources, including notes by aeb and | |
7 | .\" research by esr. Portions derive from a writeup by Roman Czyborra. | |
8 | .\" | |
a8ed5f74 | 9 | .\" Changes also by David Starner <dstarner98@aasaa.ofe.org>. |
5b3318fb | 10 | .\" |
4c1c5274 | 11 | .TH charsets 7 (date) "Linux man-pages (unreleased)" |
fea681da | 12 | .SH NAME |
25d2cc43 | 13 | charsets \- character set standards and internationalization |
fea681da | 14 | .SH DESCRIPTION |
a8ed5f74 MM |
15 | This manual page gives an overview on different character set standards |
16 | and how they were used on Linux before Unicode became ubiquitous. | |
17 | Some of this information is still helpful for people working with legacy | |
18 | systems and documents. | |
dd3568a1 | 19 | .PP |
a8ed5f74 MM |
20 | Standards discussed include such as |
21 | ASCII, GB 2312, ISO 8859, JIS, KOI8-R, KS, and Unicode. | |
dd3568a1 | 22 | .PP |
a8ed5f74 MM |
23 | The primary emphasis is on character sets that were actually used by |
24 | locale character sets, not the myriad others that could be found in data | |
fea681da | 25 | from other systems. |
1ce284ec | 26 | .SS ASCII |
fea681da | 27 | ASCII (American Standard Code For Information Interchange) is the original |
c13182ef | 28 | 7-bit character set, originally designed for American English. |
a8ed5f74 MM |
29 | Also known as US-ASCII. |
30 | It is currently described by the ISO 646:1991 IRV | |
31 | (International Reference Version) standard. | |
dd3568a1 | 32 | .PP |
fea681da | 33 | Various ASCII variants replacing the dollar sign with other currency |
a8ed5f74 MM |
34 | symbols and replacing punctuation with non-English alphabetic |
35 | characters to cover German, French, Spanish, and others in 7 bits | |
36 | emerged. | |
37 | All are deprecated; | |
38 | glibc does not support locales whose character sets are not true | |
39 | supersets of ASCII. | |
dd3568a1 | 40 | .PP |
a8ed5f74 MM |
41 | As Unicode, when using UTF-8, is ASCII-compatible, plain ASCII text |
42 | still renders properly on modern UTF-8 using systems. | |
1ce284ec | 43 | .SS ISO 8859 |
42d940fa | 44 | ISO 8859 is a series of 15 8-bit character sets, all of which have ASCII |
a8ed5f74 | 45 | in their low (7-bit) half, invisible control characters in positions |
93f18cbb | 46 | 128 to 159, and 96 fixed-width graphics in positions 160\(en255. |
dd3568a1 | 47 | .PP |
a8ed5f74 MM |
48 | Of these, the most important is ISO 8859-1 |
49 | ("Latin Alphabet No .1" / Latin-1). | |
50 | It was widely adopted and supported by different systems, | |
51 | and is gradually being replaced with Unicode. | |
52 | The ISO 8859-1 characters are also the first 256 characters of Unicode. | |
dd3568a1 | 53 | .PP |
fea681da MK |
54 | Console support for the other 8859 character sets is available under |
55 | Linux through user-mode utilities (such as | |
56 | .BR setfont (8)) | |
fea681da MK |
57 | that modify keyboard bindings and the EGA graphics |
58 | table and employ the "user mapping" font table in the console | |
59 | driver. | |
dd3568a1 | 60 | .PP |
fea681da MK |
61 | Here are brief descriptions of each set: |
62 | .TP | |
63 | 8859-1 (Latin-1) | |
a8ed5f74 | 64 | Latin-1 covers many West European languages such as Albanian, Basque, |
348f3b9d | 65 | Danish, English, Faroese, Galician, Icelandic, Irish, Italian, |
a8ed5f74 | 66 | Norwegian, Portuguese, Spanish, and Swedish. |
15f0b7af AC |
67 | The lack of the ligatures |
68 | Dutch IJ/ij, | |
69 | French œ, | |
70 | and old-style „German“ quotation marks | |
71 | was considered tolerable. | |
fea681da MK |
72 | .TP |
73 | 8859-2 (Latin-2) | |
a8ed5f74 MM |
74 | Latin-2 supports many Latin-written Central and East European |
75 | languages such as Bosnian, Croatian, Czech, German, Hungarian, Polish, | |
fea681da | 76 | Slovak, and Slovene. |
15f0b7af AC |
77 | Replacing Romanian ș/ț with ş/ţ |
78 | was considered tolerable. | |
fea681da MK |
79 | .TP |
80 | 8859-3 (Latin-3) | |
42d940fa | 81 | Latin-3 was designed to cover of Esperanto, Maltese, and Turkish, but |
a8ed5f74 | 82 | 8859-9 later superseded it for Turkish. |
fea681da MK |
83 | .TP |
84 | 8859-4 (Latin-4) | |
a8ed5f74 | 85 | Latin-4 introduced letters for North European languages such as |
42d940fa | 86 | Estonian, Latvian, and Lithuanian, but was superseded by 8859-10 and |
a8ed5f74 | 87 | 8859-13. |
fea681da MK |
88 | .TP |
89 | 8859-5 | |
90 | Cyrillic letters supporting Bulgarian, Byelorussian, Macedonian, | |
a8ed5f74 MM |
91 | Russian, Serbian, and (almost completely) Ukrainian. |
92 | It was never widely used, see the discussion of KOI8-R/KOI8-U below. | |
fea681da MK |
93 | .TP |
94 | 8859-6 | |
a8ed5f74 | 95 | Was created for Arabic. |
c13182ef | 96 | The 8859-6 glyph table is a fixed font of separate |
fea681da MK |
97 | letter forms, but a proper display engine should combine these |
98 | using the proper initial, medial, and final forms. | |
99 | .TP | |
100 | 8859-7 | |
42d940fa | 101 | Was created for Modern Greek in 1987, updated in 2003. |
fea681da MK |
102 | .TP |
103 | 8859-8 | |
42d940fa | 104 | Supports Modern Hebrew without niqud (punctuation signs). |
a8ed5f74 | 105 | Niqud and full-fledged Biblical Hebrew were outside the scope of this |
79745892 | 106 | character set. |
fea681da MK |
107 | .TP |
108 | 8859-9 (Latin-5) | |
109 | This is a variant of Latin-1 that replaces Icelandic letters with | |
110 | Turkish ones. | |
111 | .TP | |
112 | 8859-10 (Latin-6) | |
91085d85 | 113 | Latin-6 added the Inuit (Greenlandic) and Sami (Lappish) letters that were |
a8ed5f74 | 114 | missing in Latin-4 to cover the entire Nordic area. |
fea681da MK |
115 | .TP |
116 | 8859-11 | |
a8ed5f74 MM |
117 | Supports the Thai alphabet and is nearly identical to the TIS-620 |
118 | standard. | |
fea681da MK |
119 | .TP |
120 | 8859-12 | |
c13182ef | 121 | This set does not exist. |
fea681da MK |
122 | .TP |
123 | 8859-13 (Latin-7) | |
124 | Supports the Baltic Rim languages; in particular, it includes Latvian | |
125 | characters not found in Latin-4. | |
126 | .TP | |
127 | 8859-14 (Latin-8) | |
a8ed5f74 MM |
128 | This is the Celtic character set, covering Old Irish, Manx, Gaelic, |
129 | Welsh, Cornish, and Breton. | |
fea681da MK |
130 | .TP |
131 | 8859-15 (Latin-9) | |
42d940fa | 132 | Latin-9 is similar to the widely used Latin-1 but replaces some less |
a8ed5f74 MM |
133 | common symbols with the Euro sign and French and Finnish letters that |
134 | were missing in Latin-1. | |
fea681da MK |
135 | .TP |
136 | 8859-16 (Latin-10) | |
a8ed5f74 MM |
137 | This set covers many Southeast European languages, and most |
138 | importantly supports Romanian more completely than Latin-2. | |
139 | .SS KOI8-R / KOI8-U | |
140 | KOI8-R is a non-ISO character set popular in Russia before Unicode. | |
141 | The lower half is ASCII; | |
142 | the upper is a Cyrillic character set somewhat better designed than | |
143 | ISO 8859-5. | |
42d940fa | 144 | KOI8-U, based on KOI8-R, has better support for Ukrainian. |
a8ed5f74 | 145 | Neither of these sets are ISO-2022 compatible, |
1acb8000 | 146 | unlike the ISO 8859 series. |
dd3568a1 | 147 | .PP |
fea681da MK |
148 | Console support for KOI8-R is available under Linux through user-mode |
149 | utilities that modify keyboard bindings and the EGA graphics table, | |
150 | and employ the "user mapping" font table in the console driver. | |
83f218d9 MM |
151 | .SS GB 2312 |
152 | GB 2312 is a mainland Chinese national standard character set used | |
153 | to express simplified Chinese. | |
154 | Just like JIS X 0208, characters are | |
155 | mapped into a 94x94 two-byte matrix used to construct EUC-CN. | |
156 | EUC-CN | |
157 | is the most important encoding for Linux and includes ASCII and | |
158 | GB 2312. | |
159 | Note that EUC-CN is often called as GB, GB 2312, or CN-GB. | |
160 | .SS Big5 | |
161 | Big5 was a popular character set in Taiwan to express traditional | |
162 | Chinese. | |
163 | (Big5 is both a character set and an encoding.) | |
164 | It is a superset of ASCII. | |
165 | Non-ASCII characters are expressed in two bytes. | |
93f18cbb | 166 | Bytes 0xa1\(en0xfe are used as leading bytes for two-byte characters. |
83f218d9 MM |
167 | Big5 and its extension were widely used in Taiwan and Hong Kong. |
168 | It is not ISO 2022 compliant. | |
c13182ef | 169 | .\" Thanks to Tomohiro KUBOTA for the following sections about |
fea681da | 170 | .\" national standards. |
1ce284ec | 171 | .SS JIS X 0208 |
c13182ef MK |
172 | JIS X 0208 is a Japanese national standard character set. |
173 | Though there are some more Japanese national standard character sets (like | |
174 | JIS X 0201, JIS X 0212, and JIS X 0213), this is the most important one. | |
175 | Characters are mapped into a 94x94 two-byte matrix, | |
93f18cbb | 176 | whose each byte is in the range 0x21\(en0x7e. |
c13182ef MK |
177 | Note that JIS X 0208 is a character set, not an encoding. |
178 | This means that JIS X 0208 | |
179 | itself is not used for expressing text data. | |
180 | JIS X 0208 is used | |
fea681da | 181 | as a component to construct encodings such as EUC-JP, Shift_JIS, |
c13182ef MK |
182 | and ISO-2022-JP. |
183 | EUC-JP is the most important encoding for Linux | |
a8ed5f74 | 184 | and includes ASCII and JIS X 0208. |
c13182ef | 185 | In EUC-JP, JIS X 0208 |
fea681da MK |
186 | characters are expressed in two bytes, each of which is the |
187 | JIS X 0208 code plus 0x80. | |
1ce284ec | 188 | .SS KS X 1001 |
c13182ef MK |
189 | KS X 1001 is a Korean national standard character set. |
190 | Just as | |
fea681da MK |
191 | JIS X 0208, characters are mapped into a 94x94 two-byte matrix. |
192 | KS X 1001 is used like JIS X 0208, as a component | |
193 | to construct encodings such as EUC-KR, Johab, and ISO-2022-KR. | |
194 | EUC-KR is the most important encoding for Linux and includes | |
a8ed5f74 | 195 | ASCII and KS X 1001. |
c13182ef | 196 | KS C 5601 is an older name for KS X 1001. |
83f218d9 MM |
197 | .SS ISO 2022 and ISO 4873 |
198 | The ISO 2022 and 4873 standards describe a font-control model | |
199 | based on VT100 practice. | |
200 | This model is (partially) supported | |
201 | by the Linux kernel and by | |
202 | .BR xterm (1). | |
9be7476d MM |
203 | Several ISO 2022-based character encodings have been defined, |
204 | especially for Japanese. | |
dd3568a1 | 205 | .PP |
83f218d9 MM |
206 | There are 4 graphic character sets, called G0, G1, G2, and G3, |
207 | and one of them is the current character set for codes with | |
208 | high bit zero (initially G0), and one of them is the current | |
209 | character set for codes with high bit one (initially G1). | |
210 | Each graphic character set has 94 or 96 characters, and is | |
211 | essentially a 7-bit character set. | |
212 | It uses codes either | |
93f18cbb MK |
213 | 040\(en0177 (041\(en0176) or 0240\(en0377 (0241\(en0376). |
214 | G0 always has size 94 and uses codes 041\(en0176. | |
dd3568a1 | 215 | .PP |
83f218d9 | 216 | Switching between character sets is done using the shift functions |
9ca13180 | 217 | \fB\(haN\fP (SO or LS1), \fB\(haO\fP (SI or LS0), ESC n (LS2), ESC o (LS3), |
af2d18b2 | 218 | ESC N (SS2), ESC O (SS3), ESC \(ti (LS1R), ESC } (LS2R), ESC | (LS3R). |
83f218d9 MM |
219 | The function LS\fIn\fP makes character set G\fIn\fP the current one |
220 | for codes with high bit zero. | |
221 | The function LS\fIn\fPR makes character set G\fIn\fP the current one | |
222 | for codes with high bit one. | |
223 | The function SS\fIn\fP makes character set G\fIn\fP (\fIn\fP=2 or 3) | |
224 | the current one for the next character only (regardless of the value | |
225 | of its high order bit). | |
dd3568a1 | 226 | .PP |
83f218d9 MM |
227 | A 94-character set is designated as G\fIn\fP character set |
228 | by an escape sequence ESC ( xx (for G0), ESC ) xx (for G1), | |
229 | ESC * xx (for G2), ESC + xx (for G3), where xx is a symbol | |
230 | or a pair of symbols found in the ISO 2375 International | |
231 | Register of Coded Character Sets. | |
232 | For example, ESC ( @ selects the ISO 646 character set as G0, | |
233 | ESC ( A selects the UK standard character set (with pound | |
234 | instead of number sign), ESC ( B selects ASCII (with dollar | |
235 | instead of currency sign), ESC ( M selects a character set | |
236 | for African languages, ESC ( ! A selects the Cuban character | |
237 | set, and so on. | |
dd3568a1 | 238 | .PP |
83f218d9 MM |
239 | A 96-character set is designated as G\fIn\fP character set |
240 | by an escape sequence ESC \- xx (for G1), ESC . xx (for G2) | |
241 | or ESC / xx (for G3). | |
242 | For example, ESC \- G selects the Hebrew alphabet as G1. | |
dd3568a1 | 243 | .PP |
83f218d9 MM |
244 | A multibyte character set is designated as G\fIn\fP character set |
245 | by an escape sequence ESC $ xx or ESC $ ( xx (for G0), | |
246 | ESC $ ) xx (for G1), ESC $ * xx (for G2), ESC $ + xx (for G3). | |
247 | For example, ESC $ ( C selects the Korean character set for G0. | |
248 | The Japanese character set selected by ESC $ B has a more | |
249 | recent version selected by ESC & @ ESC $ B. | |
dd3568a1 | 250 | .PP |
83f218d9 | 251 | ISO 4873 stipulates a narrower use of character sets, where G0 |
735334d4 | 252 | is fixed (always ASCII), so that G1, G2, and G3 |
83f218d9 | 253 | can be invoked only for codes with the high order bit set. |
9ca13180 | 254 | In particular, \fB\(haN\fP and \fB\(haO\fP are not used anymore, ESC ( xx |
83f218d9 MM |
255 | can be used only with xx=B, and ESC ) xx, ESC * xx, ESC + xx |
256 | are equivalent to ESC \- xx, ESC . xx, ESC / xx, respectively. | |
a8ed5f74 MM |
257 | .SS TIS-620 |
258 | TIS-620 is a Thai national standard character set and a superset | |
259 | of ASCII. | |
42d940fa | 260 | In the same fashion as the ISO 8859 series, Thai characters are mapped into |
93f18cbb | 261 | 0xa1\(en0xfe. |
a8ed5f74 MM |
262 | .SS Unicode |
263 | Unicode (ISO 10646) is a standard which aims to unambiguously represent | |
264 | every character in every human language. | |
c13182ef | 265 | Unicode's structure permits 20.1 bits to encode every character. |
91085d85 MK |
266 | Since most computers don't include 20.1-bit integers, Unicode is |
267 | usually encoded as 32-bit integers internally and either a series of | |
268 | 16-bit integers (UTF-16) (needing two 16-bit integers only when | |
a8ed5f74 | 269 | encoding certain rare characters) or a series of 8-bit bytes (UTF-8). |
dd3568a1 | 270 | .PP |
fea681da | 271 | Linux represents Unicode using the 8-bit Unicode Transformation Format |
c13182ef MK |
272 | (UTF-8). |
273 | UTF-8 is a variable length encoding of Unicode. | |
274 | It uses 1 | |
fea681da MK |
275 | byte to code 7 bits, 2 bytes for 11 bits, 3 bytes for 16 bits, 4 bytes |
276 | for 21 bits, 5 bytes for 26 bits, 6 bytes for 31 bits. | |
dd3568a1 | 277 | .PP |
c13182ef MK |
278 | Let 0,1,x stand for a zero, one, or arbitrary bit. |
279 | A byte 0xxxxxxx | |
fea681da | 280 | stands for the Unicode 00000000 0xxxxxxx which codes the same symbol |
c13182ef MK |
281 | as the ASCII 0xxxxxxx. |
282 | Thus, ASCII goes unchanged into UTF-8, and | |
fea681da MK |
283 | people using only ASCII do not notice any change: not in code, and not |
284 | in file size. | |
dd3568a1 | 285 | .PP |
fea681da | 286 | A byte 110xxxxx is the start of a 2-byte code, and 110xxxxx 10yyyyyy |
c13182ef MK |
287 | is assembled into 00000xxx xxyyyyyy. |
288 | A byte 1110xxxx is the start | |
fea681da MK |
289 | of a 3-byte code, and 1110xxxx 10yyyyyy 10zzzzzz is assembled |
290 | into xxxxyyyy yyzzzzzz. | |
291 | (When UTF-8 is used to code the 31-bit ISO 10646 | |
292 | then this progression continues up to 6-byte codes.) | |
dd3568a1 | 293 | .PP |
1acb8000 | 294 | For most texts in ISO 8859 character sets, this means that the |
c13182ef MK |
295 | characters outside of ASCII are now coded with two bytes. |
296 | This tends | |
297 | to expand ordinary text files by only one or two percent. | |
298 | For Russian | |
a8ed5f74 | 299 | or Greek texts, this expands ordinary text files by 100%, since text in |
c13182ef MK |
300 | those languages is mostly outside of ASCII. |
301 | For Japanese users this means | |
302 | that the 16-bit codes now in common use will take three bytes. | |
91085d85 MK |
303 | While there are algorithmic conversions from some character sets |
304 | (especially ISO 8859-1) to Unicode, general conversion requires | |
305 | carrying around conversion tables, which can be quite large for 16-bit | |
a8ed5f74 | 306 | codes. |
dd3568a1 | 307 | .PP |
fea681da | 308 | Note that UTF-8 is self-synchronizing: 10xxxxxx is a tail, any other |
c13182ef MK |
309 | byte is the head of a code. |
310 | Note that the only way ASCII bytes occur | |
311 | in a UTF-8 stream, is as themselves. | |
312 | In particular, there are no | |
d1a71985 | 313 | embedded NULs (\(aq\e0\(aq) or \(aq/\(aqs that form part of some larger code. |
dd3568a1 | 314 | .PP |
f81fb444 | 315 | Since ASCII, and, in particular, NUL and \(aq/\(aq, are unchanged, the |
c13182ef MK |
316 | kernel does not notice that UTF-8 is being used. |
317 | It does not care at | |
fea681da | 318 | all what the bytes it is handling stand for. |
dd3568a1 | 319 | .PP |
fea681da | 320 | Rendering of Unicode data streams is typically handled through |
84c517a4 | 321 | "subfont" tables which map a subset of Unicode to glyphs. |
c13182ef | 322 | Internally |
fea681da | 323 | the kernel uses Unicode to describe the subfont loaded in video RAM. |
91085d85 | 324 | This means that in the Linux console in UTF-8 mode, one can use a character |
a8ed5f74 | 325 | set with 512 different symbols. |
42d940fa | 326 | This is not enough for Japanese, Chinese, and |
fea681da | 327 | Korean, but it is enough for most other purposes. |
47297adb | 328 | .SH SEE ALSO |
a8ed5f74 | 329 | .BR iconv (1), |
fea681da | 330 | .BR ascii (7), |
28a4c58c | 331 | .BR iso_8859\-1 (7), |
fea681da | 332 | .BR unicode (7), |
28a4c58c | 333 | .BR utf\-8 (7) |