]> git.ipfire.org Git - thirdparty/man-pages.git/blame - man7/charsets.7
Many pages: Use correct letter case in page titles (TH)
[thirdparty/man-pages.git] / man7 / charsets.7
CommitLineData
fea681da 1.\" Copyright (c) 1996 Eric S. Raymond <esr@thyrsus.com>
ac56b6a8 2.\" and Copyright (c) Andries Brouwer <aeb@cwi.nl>
fea681da 3.\"
e4a74ca8 4.\" SPDX-License-Identifier: GPL-2.0-or-later
fea681da
MK
5.\"
6.\" This is combined from many sources, including notes by aeb and
7.\" research by esr. Portions derive from a writeup by Roman Czyborra.
8.\"
a8ed5f74 9.\" Changes also by David Starner <dstarner98@aasaa.ofe.org>.
5b3318fb 10.\"
4c1c5274 11.TH charsets 7 (date) "Linux man-pages (unreleased)"
fea681da 12.SH NAME
25d2cc43 13charsets \- character set standards and internationalization
fea681da 14.SH DESCRIPTION
a8ed5f74
MM
15This manual page gives an overview on different character set standards
16and how they were used on Linux before Unicode became ubiquitous.
17Some of this information is still helpful for people working with legacy
18systems and documents.
dd3568a1 19.PP
a8ed5f74
MM
20Standards discussed include such as
21ASCII, GB 2312, ISO 8859, JIS, KOI8-R, KS, and Unicode.
dd3568a1 22.PP
a8ed5f74
MM
23The primary emphasis is on character sets that were actually used by
24locale character sets, not the myriad others that could be found in data
fea681da 25from other systems.
1ce284ec 26.SS ASCII
fea681da 27ASCII (American Standard Code For Information Interchange) is the original
c13182ef 287-bit character set, originally designed for American English.
a8ed5f74
MM
29Also known as US-ASCII.
30It is currently described by the ISO 646:1991 IRV
31(International Reference Version) standard.
dd3568a1 32.PP
fea681da 33Various ASCII variants replacing the dollar sign with other currency
a8ed5f74
MM
34symbols and replacing punctuation with non-English alphabetic
35characters to cover German, French, Spanish, and others in 7 bits
36emerged.
37All are deprecated;
38glibc does not support locales whose character sets are not true
39supersets of ASCII.
dd3568a1 40.PP
a8ed5f74
MM
41As Unicode, when using UTF-8, is ASCII-compatible, plain ASCII text
42still renders properly on modern UTF-8 using systems.
1ce284ec 43.SS ISO 8859
42d940fa 44ISO 8859 is a series of 15 8-bit character sets, all of which have ASCII
a8ed5f74 45in their low (7-bit) half, invisible control characters in positions
93f18cbb 46128 to 159, and 96 fixed-width graphics in positions 160\(en255.
dd3568a1 47.PP
a8ed5f74
MM
48Of these, the most important is ISO 8859-1
49("Latin Alphabet No .1" / Latin-1).
50It was widely adopted and supported by different systems,
51and is gradually being replaced with Unicode.
52The ISO 8859-1 characters are also the first 256 characters of Unicode.
dd3568a1 53.PP
fea681da
MK
54Console support for the other 8859 character sets is available under
55Linux through user-mode utilities (such as
56.BR setfont (8))
fea681da
MK
57that modify keyboard bindings and the EGA graphics
58table and employ the "user mapping" font table in the console
59driver.
dd3568a1 60.PP
fea681da
MK
61Here are brief descriptions of each set:
62.TP
638859-1 (Latin-1)
a8ed5f74 64Latin-1 covers many West European languages such as Albanian, Basque,
348f3b9d 65Danish, English, Faroese, Galician, Icelandic, Irish, Italian,
a8ed5f74 66Norwegian, Portuguese, Spanish, and Swedish.
15f0b7af
AC
67The lack of the ligatures
68Dutch IJ/ij,
69French œ,
70and old-style „German“ quotation marks
71was considered tolerable.
fea681da
MK
72.TP
738859-2 (Latin-2)
a8ed5f74
MM
74Latin-2 supports many Latin-written Central and East European
75languages such as Bosnian, Croatian, Czech, German, Hungarian, Polish,
fea681da 76Slovak, and Slovene.
15f0b7af
AC
77Replacing Romanian ș/ț with ş/ţ
78was considered tolerable.
fea681da
MK
79.TP
808859-3 (Latin-3)
42d940fa 81Latin-3 was designed to cover of Esperanto, Maltese, and Turkish, but
a8ed5f74 828859-9 later superseded it for Turkish.
fea681da
MK
83.TP
848859-4 (Latin-4)
a8ed5f74 85Latin-4 introduced letters for North European languages such as
42d940fa 86Estonian, Latvian, and Lithuanian, but was superseded by 8859-10 and
a8ed5f74 878859-13.
fea681da
MK
88.TP
898859-5
90Cyrillic letters supporting Bulgarian, Byelorussian, Macedonian,
a8ed5f74
MM
91Russian, Serbian, and (almost completely) Ukrainian.
92It was never widely used, see the discussion of KOI8-R/KOI8-U below.
fea681da
MK
93.TP
948859-6
a8ed5f74 95Was created for Arabic.
c13182ef 96The 8859-6 glyph table is a fixed font of separate
fea681da
MK
97letter forms, but a proper display engine should combine these
98using the proper initial, medial, and final forms.
99.TP
1008859-7
42d940fa 101Was created for Modern Greek in 1987, updated in 2003.
fea681da
MK
102.TP
1038859-8
42d940fa 104Supports Modern Hebrew without niqud (punctuation signs).
a8ed5f74 105Niqud and full-fledged Biblical Hebrew were outside the scope of this
79745892 106character set.
fea681da
MK
107.TP
1088859-9 (Latin-5)
109This is a variant of Latin-1 that replaces Icelandic letters with
110Turkish ones.
111.TP
1128859-10 (Latin-6)
91085d85 113Latin-6 added the Inuit (Greenlandic) and Sami (Lappish) letters that were
a8ed5f74 114missing in Latin-4 to cover the entire Nordic area.
fea681da
MK
115.TP
1168859-11
a8ed5f74
MM
117Supports the Thai alphabet and is nearly identical to the TIS-620
118standard.
fea681da
MK
119.TP
1208859-12
c13182ef 121This set does not exist.
fea681da
MK
122.TP
1238859-13 (Latin-7)
124Supports the Baltic Rim languages; in particular, it includes Latvian
125characters not found in Latin-4.
126.TP
1278859-14 (Latin-8)
a8ed5f74
MM
128This is the Celtic character set, covering Old Irish, Manx, Gaelic,
129Welsh, Cornish, and Breton.
fea681da
MK
130.TP
1318859-15 (Latin-9)
42d940fa 132Latin-9 is similar to the widely used Latin-1 but replaces some less
a8ed5f74
MM
133common symbols with the Euro sign and French and Finnish letters that
134were missing in Latin-1.
fea681da
MK
135.TP
1368859-16 (Latin-10)
a8ed5f74
MM
137This set covers many Southeast European languages, and most
138importantly supports Romanian more completely than Latin-2.
139.SS KOI8-R / KOI8-U
140KOI8-R is a non-ISO character set popular in Russia before Unicode.
141The lower half is ASCII;
142the upper is a Cyrillic character set somewhat better designed than
143ISO 8859-5.
42d940fa 144KOI8-U, based on KOI8-R, has better support for Ukrainian.
a8ed5f74 145Neither of these sets are ISO-2022 compatible,
1acb8000 146unlike the ISO 8859 series.
dd3568a1 147.PP
fea681da
MK
148Console support for KOI8-R is available under Linux through user-mode
149utilities that modify keyboard bindings and the EGA graphics table,
150and employ the "user mapping" font table in the console driver.
83f218d9
MM
151.SS GB 2312
152GB 2312 is a mainland Chinese national standard character set used
153to express simplified Chinese.
154Just like JIS X 0208, characters are
155mapped into a 94x94 two-byte matrix used to construct EUC-CN.
156EUC-CN
157is the most important encoding for Linux and includes ASCII and
158GB 2312.
159Note that EUC-CN is often called as GB, GB 2312, or CN-GB.
160.SS Big5
161Big5 was a popular character set in Taiwan to express traditional
162Chinese.
163(Big5 is both a character set and an encoding.)
164It is a superset of ASCII.
165Non-ASCII characters are expressed in two bytes.
93f18cbb 166Bytes 0xa1\(en0xfe are used as leading bytes for two-byte characters.
83f218d9
MM
167Big5 and its extension were widely used in Taiwan and Hong Kong.
168It is not ISO 2022 compliant.
c13182ef 169.\" Thanks to Tomohiro KUBOTA for the following sections about
fea681da 170.\" national standards.
1ce284ec 171.SS JIS X 0208
c13182ef
MK
172JIS X 0208 is a Japanese national standard character set.
173Though there are some more Japanese national standard character sets (like
174JIS X 0201, JIS X 0212, and JIS X 0213), this is the most important one.
175Characters are mapped into a 94x94 two-byte matrix,
93f18cbb 176whose each byte is in the range 0x21\(en0x7e.
c13182ef
MK
177Note that JIS X 0208 is a character set, not an encoding.
178This means that JIS X 0208
179itself is not used for expressing text data.
180JIS X 0208 is used
fea681da 181as a component to construct encodings such as EUC-JP, Shift_JIS,
c13182ef
MK
182and ISO-2022-JP.
183EUC-JP is the most important encoding for Linux
a8ed5f74 184and includes ASCII and JIS X 0208.
c13182ef 185In EUC-JP, JIS X 0208
fea681da
MK
186characters are expressed in two bytes, each of which is the
187JIS X 0208 code plus 0x80.
1ce284ec 188.SS KS X 1001
c13182ef
MK
189KS X 1001 is a Korean national standard character set.
190Just as
fea681da
MK
191JIS X 0208, characters are mapped into a 94x94 two-byte matrix.
192KS X 1001 is used like JIS X 0208, as a component
193to construct encodings such as EUC-KR, Johab, and ISO-2022-KR.
194EUC-KR is the most important encoding for Linux and includes
a8ed5f74 195ASCII and KS X 1001.
c13182ef 196KS C 5601 is an older name for KS X 1001.
83f218d9
MM
197.SS ISO 2022 and ISO 4873
198The ISO 2022 and 4873 standards describe a font-control model
199based on VT100 practice.
200This model is (partially) supported
201by the Linux kernel and by
202.BR xterm (1).
9be7476d
MM
203Several ISO 2022-based character encodings have been defined,
204especially for Japanese.
dd3568a1 205.PP
83f218d9
MM
206There are 4 graphic character sets, called G0, G1, G2, and G3,
207and one of them is the current character set for codes with
208high bit zero (initially G0), and one of them is the current
209character set for codes with high bit one (initially G1).
210Each graphic character set has 94 or 96 characters, and is
211essentially a 7-bit character set.
212It uses codes either
93f18cbb
MK
213040\(en0177 (041\(en0176) or 0240\(en0377 (0241\(en0376).
214G0 always has size 94 and uses codes 041\(en0176.
dd3568a1 215.PP
83f218d9 216Switching between character sets is done using the shift functions
9ca13180 217\fB\(haN\fP (SO or LS1), \fB\(haO\fP (SI or LS0), ESC n (LS2), ESC o (LS3),
af2d18b2 218ESC N (SS2), ESC O (SS3), ESC \(ti (LS1R), ESC } (LS2R), ESC | (LS3R).
83f218d9
MM
219The function LS\fIn\fP makes character set G\fIn\fP the current one
220for codes with high bit zero.
221The function LS\fIn\fPR makes character set G\fIn\fP the current one
222for codes with high bit one.
223The function SS\fIn\fP makes character set G\fIn\fP (\fIn\fP=2 or 3)
224the current one for the next character only (regardless of the value
225of its high order bit).
dd3568a1 226.PP
83f218d9
MM
227A 94-character set is designated as G\fIn\fP character set
228by an escape sequence ESC ( xx (for G0), ESC ) xx (for G1),
229ESC * xx (for G2), ESC + xx (for G3), where xx is a symbol
230or a pair of symbols found in the ISO 2375 International
231Register of Coded Character Sets.
232For example, ESC ( @ selects the ISO 646 character set as G0,
233ESC ( A selects the UK standard character set (with pound
234instead of number sign), ESC ( B selects ASCII (with dollar
235instead of currency sign), ESC ( M selects a character set
236for African languages, ESC ( ! A selects the Cuban character
237set, and so on.
dd3568a1 238.PP
83f218d9
MM
239A 96-character set is designated as G\fIn\fP character set
240by an escape sequence ESC \- xx (for G1), ESC . xx (for G2)
241or ESC / xx (for G3).
242For example, ESC \- G selects the Hebrew alphabet as G1.
dd3568a1 243.PP
83f218d9
MM
244A multibyte character set is designated as G\fIn\fP character set
245by an escape sequence ESC $ xx or ESC $ ( xx (for G0),
246ESC $ ) xx (for G1), ESC $ * xx (for G2), ESC $ + xx (for G3).
247For example, ESC $ ( C selects the Korean character set for G0.
248The Japanese character set selected by ESC $ B has a more
249recent version selected by ESC & @ ESC $ B.
dd3568a1 250.PP
83f218d9 251ISO 4873 stipulates a narrower use of character sets, where G0
735334d4 252is fixed (always ASCII), so that G1, G2, and G3
83f218d9 253can be invoked only for codes with the high order bit set.
9ca13180 254In particular, \fB\(haN\fP and \fB\(haO\fP are not used anymore, ESC ( xx
83f218d9
MM
255can be used only with xx=B, and ESC ) xx, ESC * xx, ESC + xx
256are equivalent to ESC \- xx, ESC . xx, ESC / xx, respectively.
a8ed5f74
MM
257.SS TIS-620
258TIS-620 is a Thai national standard character set and a superset
259of ASCII.
42d940fa 260In the same fashion as the ISO 8859 series, Thai characters are mapped into
93f18cbb 2610xa1\(en0xfe.
a8ed5f74
MM
262.SS Unicode
263Unicode (ISO 10646) is a standard which aims to unambiguously represent
264every character in every human language.
c13182ef 265Unicode's structure permits 20.1 bits to encode every character.
91085d85
MK
266Since most computers don't include 20.1-bit integers, Unicode is
267usually encoded as 32-bit integers internally and either a series of
26816-bit integers (UTF-16) (needing two 16-bit integers only when
a8ed5f74 269encoding certain rare characters) or a series of 8-bit bytes (UTF-8).
dd3568a1 270.PP
fea681da 271Linux represents Unicode using the 8-bit Unicode Transformation Format
c13182ef
MK
272(UTF-8).
273UTF-8 is a variable length encoding of Unicode.
274It uses 1
fea681da
MK
275byte to code 7 bits, 2 bytes for 11 bits, 3 bytes for 16 bits, 4 bytes
276for 21 bits, 5 bytes for 26 bits, 6 bytes for 31 bits.
dd3568a1 277.PP
c13182ef
MK
278Let 0,1,x stand for a zero, one, or arbitrary bit.
279A byte 0xxxxxxx
fea681da 280stands for the Unicode 00000000 0xxxxxxx which codes the same symbol
c13182ef
MK
281as the ASCII 0xxxxxxx.
282Thus, ASCII goes unchanged into UTF-8, and
fea681da
MK
283people using only ASCII do not notice any change: not in code, and not
284in file size.
dd3568a1 285.PP
fea681da 286A byte 110xxxxx is the start of a 2-byte code, and 110xxxxx 10yyyyyy
c13182ef
MK
287is assembled into 00000xxx xxyyyyyy.
288A byte 1110xxxx is the start
fea681da
MK
289of a 3-byte code, and 1110xxxx 10yyyyyy 10zzzzzz is assembled
290into xxxxyyyy yyzzzzzz.
291(When UTF-8 is used to code the 31-bit ISO 10646
292then this progression continues up to 6-byte codes.)
dd3568a1 293.PP
1acb8000 294For most texts in ISO 8859 character sets, this means that the
c13182ef
MK
295characters outside of ASCII are now coded with two bytes.
296This tends
297to expand ordinary text files by only one or two percent.
298For Russian
a8ed5f74 299or Greek texts, this expands ordinary text files by 100%, since text in
c13182ef
MK
300those languages is mostly outside of ASCII.
301For Japanese users this means
302that the 16-bit codes now in common use will take three bytes.
91085d85
MK
303While there are algorithmic conversions from some character sets
304(especially ISO 8859-1) to Unicode, general conversion requires
305carrying around conversion tables, which can be quite large for 16-bit
a8ed5f74 306codes.
dd3568a1 307.PP
fea681da 308Note that UTF-8 is self-synchronizing: 10xxxxxx is a tail, any other
c13182ef
MK
309byte is the head of a code.
310Note that the only way ASCII bytes occur
311in a UTF-8 stream, is as themselves.
312In particular, there are no
d1a71985 313embedded NULs (\(aq\e0\(aq) or \(aq/\(aqs that form part of some larger code.
dd3568a1 314.PP
f81fb444 315Since ASCII, and, in particular, NUL and \(aq/\(aq, are unchanged, the
c13182ef
MK
316kernel does not notice that UTF-8 is being used.
317It does not care at
fea681da 318all what the bytes it is handling stand for.
dd3568a1 319.PP
fea681da 320Rendering of Unicode data streams is typically handled through
84c517a4 321"subfont" tables which map a subset of Unicode to glyphs.
c13182ef 322Internally
fea681da 323the kernel uses Unicode to describe the subfont loaded in video RAM.
91085d85 324This means that in the Linux console in UTF-8 mode, one can use a character
a8ed5f74 325set with 512 different symbols.
42d940fa 326This is not enough for Japanese, Chinese, and
fea681da 327Korean, but it is enough for most other purposes.
47297adb 328.SH SEE ALSO
a8ed5f74 329.BR iconv (1),
fea681da 330.BR ascii (7),
28a4c58c 331.BR iso_8859\-1 (7),
fea681da 332.BR unicode (7),
28a4c58c 333.BR utf\-8 (7)