]> git.ipfire.org Git - thirdparty/man-pages.git/blame - man7/charsets.7
atof.3: SEE ALSO: add strfromd(3)
[thirdparty/man-pages.git] / man7 / charsets.7
CommitLineData
42d940fa 1'\" t -*- coding: UTF-8 -*-
fea681da 2.\" Copyright (c) 1996 Eric S. Raymond <esr@thyrsus.com>
ac56b6a8 3.\" and Copyright (c) Andries Brouwer <aeb@cwi.nl>
fea681da 4.\"
89e3ffe9 5.\" %%%LICENSE_START(GPLv2+_DOC_ONEPARA)
fea681da
MK
6.\" This is free documentation; you can redistribute it and/or
7.\" modify it under the terms of the GNU General Public License as
8.\" published by the Free Software Foundation; either version 2 of
9.\" the License, or (at your option) any later version.
8f8359d8 10.\" %%%LICENSE_END
fea681da
MK
11.\"
12.\" This is combined from many sources, including notes by aeb and
13.\" research by esr. Portions derive from a writeup by Roman Czyborra.
14.\"
a8ed5f74 15.\" Changes also by David Starner <dstarner98@aasaa.ofe.org>.
5b3318fb 16.\"
3df541c0 17.TH CHARSETS 7 2016-07-17 "Linux" "Linux Programmer's Manual"
fea681da 18.SH NAME
a8ed5f74 19charsets - character set standards and internationalization
fea681da 20.SH DESCRIPTION
a8ed5f74
MM
21This manual page gives an overview on different character set standards
22and how they were used on Linux before Unicode became ubiquitous.
23Some of this information is still helpful for people working with legacy
24systems and documents.
25.LP
26Standards discussed include such as
27ASCII, GB 2312, ISO 8859, JIS, KOI8-R, KS, and Unicode.
fea681da 28.LP
a8ed5f74
MM
29The primary emphasis is on character sets that were actually used by
30locale character sets, not the myriad others that could be found in data
fea681da 31from other systems.
1ce284ec 32.SS ASCII
fea681da 33ASCII (American Standard Code For Information Interchange) is the original
c13182ef 347-bit character set, originally designed for American English.
a8ed5f74
MM
35Also known as US-ASCII.
36It is currently described by the ISO 646:1991 IRV
37(International Reference Version) standard.
fea681da
MK
38.LP
39Various ASCII variants replacing the dollar sign with other currency
a8ed5f74
MM
40symbols and replacing punctuation with non-English alphabetic
41characters to cover German, French, Spanish, and others in 7 bits
42emerged.
43All are deprecated;
44glibc does not support locales whose character sets are not true
45supersets of ASCII.
fea681da 46.LP
a8ed5f74
MM
47As Unicode, when using UTF-8, is ASCII-compatible, plain ASCII text
48still renders properly on modern UTF-8 using systems.
1ce284ec 49.SS ISO 8859
42d940fa 50ISO 8859 is a series of 15 8-bit character sets, all of which have ASCII
a8ed5f74
MM
51in their low (7-bit) half, invisible control characters in positions
52128 to 159, and 96 fixed-width graphics in positions 160-255.
fea681da 53.LP
a8ed5f74
MM
54Of these, the most important is ISO 8859-1
55("Latin Alphabet No .1" / Latin-1).
56It was widely adopted and supported by different systems,
57and is gradually being replaced with Unicode.
58The ISO 8859-1 characters are also the first 256 characters of Unicode.
fea681da
MK
59.LP
60Console support for the other 8859 character sets is available under
61Linux through user-mode utilities (such as
62.BR setfont (8))
fea681da
MK
63that modify keyboard bindings and the EGA graphics
64table and employ the "user mapping" font table in the console
65driver.
66.LP
67Here are brief descriptions of each set:
68.TP
698859-1 (Latin-1)
a8ed5f74 70Latin-1 covers many West European languages such as Albanian, Basque,
348f3b9d 71Danish, English, Faroese, Galician, Icelandic, Irish, Italian,
a8ed5f74
MM
72Norwegian, Portuguese, Spanish, and Swedish.
73The lack of the ligatures Dutch IJ/ij, French œ, and old-style „German“
74quotation marks was considered tolerable.
fea681da
MK
75.TP
768859-2 (Latin-2)
a8ed5f74
MM
77Latin-2 supports many Latin-written Central and East European
78languages such as Bosnian, Croatian, Czech, German, Hungarian, Polish,
fea681da 79Slovak, and Slovene.
a8ed5f74 80Replacing Romanian ș/ț with ş/ţ was considered tolerable.
fea681da
MK
81.TP
828859-3 (Latin-3)
42d940fa 83Latin-3 was designed to cover of Esperanto, Maltese, and Turkish, but
a8ed5f74 848859-9 later superseded it for Turkish.
fea681da
MK
85.TP
868859-4 (Latin-4)
a8ed5f74 87Latin-4 introduced letters for North European languages such as
42d940fa 88Estonian, Latvian, and Lithuanian, but was superseded by 8859-10 and
a8ed5f74 898859-13.
fea681da
MK
90.TP
918859-5
92Cyrillic letters supporting Bulgarian, Byelorussian, Macedonian,
a8ed5f74
MM
93Russian, Serbian, and (almost completely) Ukrainian.
94It was never widely used, see the discussion of KOI8-R/KOI8-U below.
fea681da
MK
95.TP
968859-6
a8ed5f74 97Was created for Arabic.
c13182ef 98The 8859-6 glyph table is a fixed font of separate
fea681da
MK
99letter forms, but a proper display engine should combine these
100using the proper initial, medial, and final forms.
101.TP
1028859-7
42d940fa 103Was created for Modern Greek in 1987, updated in 2003.
fea681da
MK
104.TP
1058859-8
42d940fa 106Supports Modern Hebrew without niqud (punctuation signs).
a8ed5f74 107Niqud and full-fledged Biblical Hebrew were outside the scope of this
79745892 108character set.
fea681da
MK
109.TP
1108859-9 (Latin-5)
111This is a variant of Latin-1 that replaces Icelandic letters with
112Turkish ones.
113.TP
1148859-10 (Latin-6)
91085d85 115Latin-6 added the Inuit (Greenlandic) and Sami (Lappish) letters that were
a8ed5f74 116missing in Latin-4 to cover the entire Nordic area.
fea681da
MK
117.TP
1188859-11
a8ed5f74
MM
119Supports the Thai alphabet and is nearly identical to the TIS-620
120standard.
fea681da
MK
121.TP
1228859-12
c13182ef 123This set does not exist.
fea681da
MK
124.TP
1258859-13 (Latin-7)
126Supports the Baltic Rim languages; in particular, it includes Latvian
127characters not found in Latin-4.
128.TP
1298859-14 (Latin-8)
a8ed5f74
MM
130This is the Celtic character set, covering Old Irish, Manx, Gaelic,
131Welsh, Cornish, and Breton.
fea681da
MK
132.TP
1338859-15 (Latin-9)
42d940fa 134Latin-9 is similar to the widely used Latin-1 but replaces some less
a8ed5f74
MM
135common symbols with the Euro sign and French and Finnish letters that
136were missing in Latin-1.
fea681da
MK
137.TP
1388859-16 (Latin-10)
a8ed5f74
MM
139This set covers many Southeast European languages, and most
140importantly supports Romanian more completely than Latin-2.
141.SS KOI8-R / KOI8-U
142KOI8-R is a non-ISO character set popular in Russia before Unicode.
143The lower half is ASCII;
144the upper is a Cyrillic character set somewhat better designed than
145ISO 8859-5.
42d940fa 146KOI8-U, based on KOI8-R, has better support for Ukrainian.
a8ed5f74 147Neither of these sets are ISO-2022 compatible,
1acb8000 148unlike the ISO 8859 series.
fea681da
MK
149.LP
150Console support for KOI8-R is available under Linux through user-mode
151utilities that modify keyboard bindings and the EGA graphics table,
152and employ the "user mapping" font table in the console driver.
83f218d9
MM
153.SS GB 2312
154GB 2312 is a mainland Chinese national standard character set used
155to express simplified Chinese.
156Just like JIS X 0208, characters are
157mapped into a 94x94 two-byte matrix used to construct EUC-CN.
158EUC-CN
159is the most important encoding for Linux and includes ASCII and
160GB 2312.
161Note that EUC-CN is often called as GB, GB 2312, or CN-GB.
162.SS Big5
163Big5 was a popular character set in Taiwan to express traditional
164Chinese.
165(Big5 is both a character set and an encoding.)
166It is a superset of ASCII.
167Non-ASCII characters are expressed in two bytes.
168Bytes 0xa1-0xfe are used as leading bytes for two-byte characters.
169Big5 and its extension were widely used in Taiwan and Hong Kong.
170It is not ISO 2022 compliant.
c13182ef 171.\" Thanks to Tomohiro KUBOTA for the following sections about
fea681da 172.\" national standards.
1ce284ec 173.SS JIS X 0208
c13182ef
MK
174JIS X 0208 is a Japanese national standard character set.
175Though there are some more Japanese national standard character sets (like
176JIS X 0201, JIS X 0212, and JIS X 0213), this is the most important one.
177Characters are mapped into a 94x94 two-byte matrix,
178whose each byte is in the range 0x21-0x7e.
179Note that JIS X 0208 is a character set, not an encoding.
180This means that JIS X 0208
181itself is not used for expressing text data.
182JIS X 0208 is used
fea681da 183as a component to construct encodings such as EUC-JP, Shift_JIS,
c13182ef
MK
184and ISO-2022-JP.
185EUC-JP is the most important encoding for Linux
a8ed5f74 186and includes ASCII and JIS X 0208.
c13182ef 187In EUC-JP, JIS X 0208
fea681da
MK
188characters are expressed in two bytes, each of which is the
189JIS X 0208 code plus 0x80.
1ce284ec 190.SS KS X 1001
c13182ef
MK
191KS X 1001 is a Korean national standard character set.
192Just as
fea681da
MK
193JIS X 0208, characters are mapped into a 94x94 two-byte matrix.
194KS X 1001 is used like JIS X 0208, as a component
195to construct encodings such as EUC-KR, Johab, and ISO-2022-KR.
196EUC-KR is the most important encoding for Linux and includes
a8ed5f74 197ASCII and KS X 1001.
c13182ef 198KS C 5601 is an older name for KS X 1001.
83f218d9
MM
199.SS ISO 2022 and ISO 4873
200The ISO 2022 and 4873 standards describe a font-control model
201based on VT100 practice.
202This model is (partially) supported
203by the Linux kernel and by
204.BR xterm (1).
9be7476d
MM
205Several ISO 2022-based character encodings have been defined,
206especially for Japanese.
83f218d9
MM
207.LP
208There are 4 graphic character sets, called G0, G1, G2, and G3,
209and one of them is the current character set for codes with
210high bit zero (initially G0), and one of them is the current
211character set for codes with high bit one (initially G1).
212Each graphic character set has 94 or 96 characters, and is
213essentially a 7-bit character set.
214It uses codes either
215040-0177 (041-0176) or 0240-0377 (0241-0376).
216G0 always has size 94 and uses codes 041-0176.
217.LP
218Switching between character sets is done using the shift functions
219\fB^N\fP (SO or LS1), \fB^O\fP (SI or LS0), ESC n (LS2), ESC o (LS3),
220ESC N (SS2), ESC O (SS3), ESC ~ (LS1R), ESC } (LS2R), ESC | (LS3R).
221The function LS\fIn\fP makes character set G\fIn\fP the current one
222for codes with high bit zero.
223The function LS\fIn\fPR makes character set G\fIn\fP the current one
224for codes with high bit one.
225The function SS\fIn\fP makes character set G\fIn\fP (\fIn\fP=2 or 3)
226the current one for the next character only (regardless of the value
227of its high order bit).
228.LP
229A 94-character set is designated as G\fIn\fP character set
230by an escape sequence ESC ( xx (for G0), ESC ) xx (for G1),
231ESC * xx (for G2), ESC + xx (for G3), where xx is a symbol
232or a pair of symbols found in the ISO 2375 International
233Register of Coded Character Sets.
234For example, ESC ( @ selects the ISO 646 character set as G0,
235ESC ( A selects the UK standard character set (with pound
236instead of number sign), ESC ( B selects ASCII (with dollar
237instead of currency sign), ESC ( M selects a character set
238for African languages, ESC ( ! A selects the Cuban character
239set, and so on.
240.LP
241A 96-character set is designated as G\fIn\fP character set
242by an escape sequence ESC \- xx (for G1), ESC . xx (for G2)
243or ESC / xx (for G3).
244For example, ESC \- G selects the Hebrew alphabet as G1.
245.LP
246A multibyte character set is designated as G\fIn\fP character set
247by an escape sequence ESC $ xx or ESC $ ( xx (for G0),
248ESC $ ) xx (for G1), ESC $ * xx (for G2), ESC $ + xx (for G3).
249For example, ESC $ ( C selects the Korean character set for G0.
250The Japanese character set selected by ESC $ B has a more
251recent version selected by ESC & @ ESC $ B.
252.LP
253ISO 4873 stipulates a narrower use of character sets, where G0
254is fixed (always ASCII), so that G1, G2 and G3
255can be invoked only for codes with the high order bit set.
256In particular, \fB^N\fP and \fB^O\fP are not used anymore, ESC ( xx
257can be used only with xx=B, and ESC ) xx, ESC * xx, ESC + xx
258are equivalent to ESC \- xx, ESC . xx, ESC / xx, respectively.
a8ed5f74
MM
259.SS TIS-620
260TIS-620 is a Thai national standard character set and a superset
261of ASCII.
42d940fa 262In the same fashion as the ISO 8859 series, Thai characters are mapped into
c13182ef 2630xa1-0xfe.
a8ed5f74
MM
264.SS Unicode
265Unicode (ISO 10646) is a standard which aims to unambiguously represent
266every character in every human language.
c13182ef 267Unicode's structure permits 20.1 bits to encode every character.
91085d85
MK
268Since most computers don't include 20.1-bit integers, Unicode is
269usually encoded as 32-bit integers internally and either a series of
27016-bit integers (UTF-16) (needing two 16-bit integers only when
a8ed5f74 271encoding certain rare characters) or a series of 8-bit bytes (UTF-8).
fea681da
MK
272.LP
273Linux represents Unicode using the 8-bit Unicode Transformation Format
c13182ef
MK
274(UTF-8).
275UTF-8 is a variable length encoding of Unicode.
276It uses 1
fea681da
MK
277byte to code 7 bits, 2 bytes for 11 bits, 3 bytes for 16 bits, 4 bytes
278for 21 bits, 5 bytes for 26 bits, 6 bytes for 31 bits.
279.LP
c13182ef
MK
280Let 0,1,x stand for a zero, one, or arbitrary bit.
281A byte 0xxxxxxx
fea681da 282stands for the Unicode 00000000 0xxxxxxx which codes the same symbol
c13182ef
MK
283as the ASCII 0xxxxxxx.
284Thus, ASCII goes unchanged into UTF-8, and
fea681da
MK
285people using only ASCII do not notice any change: not in code, and not
286in file size.
287.LP
288A byte 110xxxxx is the start of a 2-byte code, and 110xxxxx 10yyyyyy
c13182ef
MK
289is assembled into 00000xxx xxyyyyyy.
290A byte 1110xxxx is the start
fea681da
MK
291of a 3-byte code, and 1110xxxx 10yyyyyy 10zzzzzz is assembled
292into xxxxyyyy yyzzzzzz.
293(When UTF-8 is used to code the 31-bit ISO 10646
294then this progression continues up to 6-byte codes.)
295.LP
1acb8000 296For most texts in ISO 8859 character sets, this means that the
c13182ef
MK
297characters outside of ASCII are now coded with two bytes.
298This tends
299to expand ordinary text files by only one or two percent.
300For Russian
a8ed5f74 301or Greek texts, this expands ordinary text files by 100%, since text in
c13182ef
MK
302those languages is mostly outside of ASCII.
303For Japanese users this means
304that the 16-bit codes now in common use will take three bytes.
91085d85
MK
305While there are algorithmic conversions from some character sets
306(especially ISO 8859-1) to Unicode, general conversion requires
307carrying around conversion tables, which can be quite large for 16-bit
a8ed5f74 308codes.
fea681da
MK
309.LP
310Note that UTF-8 is self-synchronizing: 10xxxxxx is a tail, any other
c13182ef
MK
311byte is the head of a code.
312Note that the only way ASCII bytes occur
313in a UTF-8 stream, is as themselves.
314In particular, there are no
f81fb444 315embedded NULs (\(aq\\0\(aq) or \(aq/\(aqs that form part of some larger code.
fea681da 316.LP
f81fb444 317Since ASCII, and, in particular, NUL and \(aq/\(aq, are unchanged, the
c13182ef
MK
318kernel does not notice that UTF-8 is being used.
319It does not care at
fea681da
MK
320all what the bytes it is handling stand for.
321.LP
322Rendering of Unicode data streams is typically handled through
84c517a4 323"subfont" tables which map a subset of Unicode to glyphs.
c13182ef 324Internally
fea681da 325the kernel uses Unicode to describe the subfont loaded in video RAM.
91085d85 326This means that in the Linux console in UTF-8 mode, one can use a character
a8ed5f74 327set with 512 different symbols.
42d940fa 328This is not enough for Japanese, Chinese, and
fea681da 329Korean, but it is enough for most other purposes.
47297adb 330.SH SEE ALSO
a8ed5f74 331.BR iconv (1),
fea681da
MK
332.BR ascii (7),
333.BR iso_8859-1 (7),
334.BR unicode (7),
335.BR utf-8 (7)