]> git.ipfire.org Git - thirdparty/man-pages.git/blame - man7/charsets.7
locale.1, adjtimex.2, clone.2, fork.2, getrlimit.2, remap_file_pages.2, set_mempolicy...
[thirdparty/man-pages.git] / man7 / charsets.7
CommitLineData
fea681da 1.\" Copyright (c) 1996 Eric S. Raymond <esr@thyrsus.com>
ac56b6a8 2.\" and Copyright (c) Andries Brouwer <aeb@cwi.nl>
fea681da 3.\"
89e3ffe9 4.\" %%%LICENSE_START(GPLv2+_DOC_ONEPARA)
fea681da
MK
5.\" This is free documentation; you can redistribute it and/or
6.\" modify it under the terms of the GNU General Public License as
7.\" published by the Free Software Foundation; either version 2 of
8.\" the License, or (at your option) any later version.
8f8359d8 9.\" %%%LICENSE_END
fea681da
MK
10.\"
11.\" This is combined from many sources, including notes by aeb and
12.\" research by esr. Portions derive from a writeup by Roman Czyborra.
13.\"
14.\" Last changed by David Starner <dstarner98@aasaa.ofe.org>.
5b3318fb
MK
15.\"
16.\" FIXME This page was written long ago, and various pieces are probably
17.\" no longer quite current. A reworking by someone knowledgeable
18.\" on charsets is needed. Among other things, the page needs to
19.\" give more prominence to Unicode. mtk, May 2014
20.\"
8660ef9f 21.TH CHARSETS 7 2014-05-28 "Linux" "Linux Programmer's Manual"
fea681da
MK
22.SH NAME
23charsets \- programmer's view of character sets and internationalization
24.SH DESCRIPTION
c13182ef
MK
25Linux is an international operating system.
26Various of its utilities
fea681da
MK
27and device drivers (including the console driver) support multilingual
28character sets including Latin-alphabet letters with diacritical
29marks, accents, ligatures, and entire non-Latin alphabets including
30Greek, Cyrillic, Arabic, and Hebrew.
31.LP
32This manual page presents a programmer's-eye view of different
c13182ef
MK
33character-set standards and how they fit together on Linux.
34Standards
fea681da 35discussed include ASCII, ISO 8859, KOI8-R, Unicode, ISO 2022 and
c13182ef
MK
36ISO 4873.
37The primary emphasis is on character sets actually used as
fea681da
MK
38locale character sets, not the myriad others that can be found in data
39from other systems.
1ce284ec 40.SS ASCII
fea681da 41ASCII (American Standard Code For Information Interchange) is the original
c13182ef
MK
427-bit character set, originally designed for American English.
43It is currently described by the ECMA-6 standard.
fea681da
MK
44.LP
45Various ASCII variants replacing the dollar sign with other currency
46symbols and replacing punctuation with non-English alphabetic characters
a797afac 47to cover German, French, Spanish, and others in 7 bits exist.
c13182ef 48All are
5260fe08 49deprecated; glibc doesn't support locales whose character sets aren't
6387216b
MK
50true supersets of ASCII.
51(These sets are also known as ISO-646, a close
fea681da
MK
52relative of ASCII that permitted replacing these characters.)
53.LP
54As Linux was written for hardware designed in the US, it natively
55supports ASCII.
1ce284ec 56.SS ISO 8859
fea681da
MK
57ISO 8859 is a series of 15 8-bit character sets all of which have US
58ASCII in their low (7-bit) half, invisible control characters in
59positions 128 to 159, and 96 fixed-width graphics in positions 160-255.
60.LP
c13182ef
MK
61Of these, the most important is ISO 8859-1 (Latin-1).
62It is natively
fea681da
MK
63supported in the Linux console driver, fairly well supported in X11R6,
64and is the base character set of HTML.
65.LP
66Console support for the other 8859 character sets is available under
67Linux through user-mode utilities (such as
68.BR setfont (8))
69.\" // some distributions still have the deprecated consolechars
70that modify keyboard bindings and the EGA graphics
71table and employ the "user mapping" font table in the console
72driver.
73.LP
74Here are brief descriptions of each set:
75.TP
768859-1 (Latin-1)
77Latin-1 covers most Western European languages such as Albanian, Catalan,
78Danish, Dutch, English, Faroese, Finnish, French, German, Galician,
79Irish, Icelandic, Italian, Norwegian, Portuguese, Spanish, and
c13182ef
MK
80Swedish.
81The lack of the ligatures Dutch ij, French oe and old-style
fea681da
MK
82,,German`` quotation marks is considered tolerable.
83.TP
848859-2 (Latin-2)
85Latin-2 supports most Latin-written Slavic and Central European
11e751cb 86languages: Croatian, Czech, German, Hungarian, Polish, Romanian,
fea681da
MK
87Slovak, and Slovene.
88.TP
898859-3 (Latin-3)
90Latin-3 is popular with authors of Esperanto, Galician, and Maltese.
91(Turkish is now written with 8859-9 instead.)
92.TP
938859-4 (Latin-4)
c13182ef
MK
94Latin-4 introduced letters for Estonian, Latvian, and Lithuanian.
95It is essentially obsolete; see 8859-10 (Latin-6) and 8859-13 (Latin-7).
fea681da
MK
96.TP
978859-5
98Cyrillic letters supporting Bulgarian, Byelorussian, Macedonian,
a797afac 99Russian, Serbian, and Ukrainian.
84c517a4
MK
100Ukrainians read the letter "ghe"
101with downstroke as "heh" and would need a ghe with upstroke to write a
c13182ef
MK
102correct ghe.
103See the discussion of KOI8-R below.
fea681da
MK
104.TP
1058859-6
c13182ef
MK
106Supports Arabic.
107The 8859-6 glyph table is a fixed font of separate
fea681da
MK
108letter forms, but a proper display engine should combine these
109using the proper initial, medial, and final forms.
110.TP
1118859-7
112Supports Modern Greek.
113.TP
1148859-8
c13182ef
MK
115Supports modern Hebrew without niqud (punctuation signs).
116Niqud and full-fledged Biblical Hebrew are outside the scope of this
fea681da
MK
117character set; under Linux, UTF-8 is the preferred encoding for
118these.
119.TP
1208859-9 (Latin-5)
121This is a variant of Latin-1 that replaces Icelandic letters with
122Turkish ones.
123.TP
1248859-10 (Latin-6)
125Latin 6 adds the last Inuit (Greenlandic) and Sami (Lappish) letters
c13182ef 126that were missing in Latin 4 to cover the entire Nordic area.
84c517a4 127RFC 1345 listed a preliminary and different "latin6".
c13182ef 128Skolt Sami still
fea681da
MK
129needs a few more accents than these.
130.TP
1318859-11
33a0ccb2 132This exists only as a rejected draft standard.
c13182ef 133The draft standard
fea681da
MK
134was identical to TIS-620, which is used under Linux for Thai.
135.TP
1368859-12
c13182ef
MK
137This set does not exist.
138While Vietnamese has been suggested for this
24b74457 139space, it does not fit within the 96 (noncombining) characters ISO
c13182ef
MK
1408859 offers.
141UTF-8 is the preferred character set for Vietnamese use
fea681da
MK
142under Linux.
143.TP
1448859-13 (Latin-7)
145Supports the Baltic Rim languages; in particular, it includes Latvian
146characters not found in Latin-4.
147.TP
1488859-14 (Latin-8)
149This is the Celtic character set, covering Gaelic and Welsh.
150This charset also contains the dotted characters needed for Old Irish.
151.TP
1528859-15 (Latin-9)
153This adds the Euro sign and French and Finnish letters that were missing in
154Latin-1.
155.TP
1568859-16 (Latin-10)
157This set covers many of the languages covered by 8859-2, and supports
39728c23 158Romanian more completely than that set does.
1ce284ec 159.SS KOI8-R
c13182ef
MK
160KOI8-R is a non-ISO character set popular in Russia.
161The lower half
fea681da 162is US ASCII; the upper is a Cyrillic character set somewhat better
c13182ef
MK
163designed than ISO 8859-5.
164KOI8-U is a common character set, based off
165KOI8-R, that has better support for Ukrainian.
166Neither of these sets
fea681da
MK
167are ISO-2022 compatible, unlike the ISO-8859 series.
168.LP
169Console support for KOI8-R is available under Linux through user-mode
170utilities that modify keyboard bindings and the EGA graphics table,
171and employ the "user mapping" font table in the console driver.
c13182ef 172.\" Thanks to Tomohiro KUBOTA for the following sections about
fea681da 173.\" national standards.
1ce284ec 174.SS JIS X 0208
c13182ef
MK
175JIS X 0208 is a Japanese national standard character set.
176Though there are some more Japanese national standard character sets (like
177JIS X 0201, JIS X 0212, and JIS X 0213), this is the most important one.
178Characters are mapped into a 94x94 two-byte matrix,
179whose each byte is in the range 0x21-0x7e.
180Note that JIS X 0208 is a character set, not an encoding.
181This means that JIS X 0208
182itself is not used for expressing text data.
183JIS X 0208 is used
fea681da 184as a component to construct encodings such as EUC-JP, Shift_JIS,
c13182ef
MK
185and ISO-2022-JP.
186EUC-JP is the most important encoding for Linux
187and includes US ASCII and JIS X 0208.
188In EUC-JP, JIS X 0208
fea681da
MK
189characters are expressed in two bytes, each of which is the
190JIS X 0208 code plus 0x80.
1ce284ec 191.SS KS X 1001
c13182ef
MK
192KS X 1001 is a Korean national standard character set.
193Just as
fea681da
MK
194JIS X 0208, characters are mapped into a 94x94 two-byte matrix.
195KS X 1001 is used like JIS X 0208, as a component
196to construct encodings such as EUC-KR, Johab, and ISO-2022-KR.
197EUC-KR is the most important encoding for Linux and includes
c13182ef
MK
198US ASCII and KS X 1001.
199KS C 5601 is an older name for KS X 1001.
1ce284ec 200.SS GB 2312
fea681da 201GB 2312 is a mainland Chinese national standard character set used
c13182ef
MK
202to express simplified Chinese.
203Just like JIS X 0208, characters are
204mapped into a 94x94 two-byte matrix used to construct EUC-CN.
205EUC-CN
fea681da 206is the most important encoding for Linux and includes US ASCII and
c13182ef
MK
207GB 2312.
208Note that EUC-CN is often called as GB, GB 2312, or CN-GB.
1ce284ec 209.SS Big5
fea681da 210Big5 is a popular character set in Taiwan to express traditional
c13182ef
MK
211Chinese.
212(Big5 is both a character set and an encoding.)
213It is a superset of US ASCII.
214Non-ASCII characters are expressed in two bytes.
215Bytes 0xa1-0xfe are used as leading bytes for two-byte characters.
216Big5 and its extension is widely used in Taiwan and Hong Kong.
217It is not ISO 2022-compliant.
1ce284ec 218.SS TIS 620
fea681da 219TIS 620 is a Thai national standard character set and a superset
c13182ef
MK
220of US ASCII.
221Like ISO 8859 series, Thai characters are mapped into
2220xa1-0xfe.
223TIS 620 is the only commonly used character set under
fea681da 224Linux besides UTF-8 to have combining characters.
1ce284ec 225.SS UNICODE
fea681da 226Unicode (ISO 10646) is a standard which aims to unambiguously represent every
c13182ef
MK
227character in every human language.
228Unicode's structure permits 20.1 bits to encode every character.
229Since most computers don't include 20.1-bit
fea681da
MK
230integers, Unicode is usually encoded as 32-bit integers internally and
231either a series of 16-bit integers (UTF-16) (needing two 16-bit integers
232only when encoding certain rare characters) or a series of 8-bit bytes
c13182ef 233(UTF-8).
608bf950
SK
234Information on Unicode is available at
235.UR http://www.unicode.org
236.UE .
fea681da
MK
237.LP
238Linux represents Unicode using the 8-bit Unicode Transformation Format
c13182ef
MK
239(UTF-8).
240UTF-8 is a variable length encoding of Unicode.
241It uses 1
fea681da
MK
242byte to code 7 bits, 2 bytes for 11 bits, 3 bytes for 16 bits, 4 bytes
243for 21 bits, 5 bytes for 26 bits, 6 bytes for 31 bits.
244.LP
c13182ef
MK
245Let 0,1,x stand for a zero, one, or arbitrary bit.
246A byte 0xxxxxxx
fea681da 247stands for the Unicode 00000000 0xxxxxxx which codes the same symbol
c13182ef
MK
248as the ASCII 0xxxxxxx.
249Thus, ASCII goes unchanged into UTF-8, and
fea681da
MK
250people using only ASCII do not notice any change: not in code, and not
251in file size.
252.LP
253A byte 110xxxxx is the start of a 2-byte code, and 110xxxxx 10yyyyyy
c13182ef
MK
254is assembled into 00000xxx xxyyyyyy.
255A byte 1110xxxx is the start
fea681da
MK
256of a 3-byte code, and 1110xxxx 10yyyyyy 10zzzzzz is assembled
257into xxxxyyyy yyzzzzzz.
258(When UTF-8 is used to code the 31-bit ISO 10646
259then this progression continues up to 6-byte codes.)
260.LP
261For most people who use ISO-8859 character sets, this means that the
c13182ef
MK
262characters outside of ASCII are now coded with two bytes.
263This tends
264to expand ordinary text files by only one or two percent.
265For Russian
fea681da 266or Greek users, this expands ordinary text files by 100%, since text in
c13182ef
MK
267those languages is mostly outside of ASCII.
268For Japanese users this means
269that the 16-bit codes now in common use will take three bytes.
270While there
8bb0494f 271are algorithmic conversions from some character sets (especially ISO-8859-1) to
fea681da
MK
272Unicode, general conversion requires carrying around conversion tables,
273which can be quite large for 16-bit codes.
274.LP
275Note that UTF-8 is self-synchronizing: 10xxxxxx is a tail, any other
c13182ef
MK
276byte is the head of a code.
277Note that the only way ASCII bytes occur
278in a UTF-8 stream, is as themselves.
279In particular, there are no
f81fb444 280embedded NULs (\(aq\\0\(aq) or \(aq/\(aqs that form part of some larger code.
fea681da 281.LP
f81fb444 282Since ASCII, and, in particular, NUL and \(aq/\(aq, are unchanged, the
c13182ef
MK
283kernel does not notice that UTF-8 is being used.
284It does not care at
fea681da
MK
285all what the bytes it is handling stand for.
286.LP
287Rendering of Unicode data streams is typically handled through
84c517a4 288"subfont" tables which map a subset of Unicode to glyphs.
c13182ef 289Internally
fea681da
MK
290the kernel uses Unicode to describe the subfont loaded in video RAM.
291This means that in UTF-8 mode one can use a character set with 512
c13182ef
MK
292different symbols.
293This is not enough for Japanese, Chinese and
fea681da
MK
294Korean, but it is enough for most other purposes.
295.LP
296At the current time, the console driver does not handle combining
c13182ef
MK
297characters.
298So Thai, Sioux and any other script needing combining
fea681da 299characters can't be handled on the console.
73d8cece 300.SS ISO 2022 and ISO 4873
fea681da 301The ISO 2022 and 4873 standards describe a font-control model
c13182ef
MK
302based on VT100 practice.
303This model is (partially) supported
fea681da
MK
304by the Linux kernel and by
305.BR xterm (1).
306It is popular in Japan and Korea.
307.LP
a797afac 308There are 4 graphic character sets, called G0, G1, G2, and G3,
fea681da
MK
309and one of them is the current character set for codes with
310high bit zero (initially G0), and one of them is the current
311character set for codes with high bit one (initially G1).
312Each graphic character set has 94 or 96 characters, and is
c13182ef
MK
313essentially a 7-bit character set.
314It uses codes either
fea681da
MK
315040-0177 (041-0176) or 0240-0377 (0241-0376).
316G0 always has size 94 and uses codes 041-0176.
317.LP
318Switching between character sets is done using the shift functions
8bb93cd4 319\fB^N\fP (SO or LS1), \fB^O\fP (SI or LS0), ESC n (LS2), ESC o (LS3),
fea681da
MK
320ESC N (SS2), ESC O (SS3), ESC ~ (LS1R), ESC } (LS2R), ESC | (LS3R).
321The function LS\fIn\fP makes character set G\fIn\fP the current one
322for codes with high bit zero.
323The function LS\fIn\fPR makes character set G\fIn\fP the current one
324for codes with high bit one.
325The function SS\fIn\fP makes character set G\fIn\fP (\fIn\fP=2 or 3)
326the current one for the next character only (regardless of the value
327of its high order bit).
328.LP
329A 94-character set is designated as G\fIn\fP character set
330by an escape sequence ESC ( xx (for G0), ESC ) xx (for G1),
331ESC * xx (for G2), ESC + xx (for G3), where xx is a symbol
332or a pair of symbols found in the ISO 2375 International
333Register of Coded Character Sets.
334For example, ESC ( @ selects the ISO 646 character set as G0,
335ESC ( A selects the UK standard character set (with pound
336instead of number sign), ESC ( B selects ASCII (with dollar
337instead of currency sign), ESC ( M selects a character set
338for African languages, ESC ( ! A selects the Cuban character
bb0e6cec 339set, and so on.
fea681da
MK
340.LP
341A 96-character set is designated as G\fIn\fP character set
4d9b6984 342by an escape sequence ESC \- xx (for G1), ESC . xx (for G2)
fea681da 343or ESC / xx (for G3).
4d9b6984 344For example, ESC \- G selects the Hebrew alphabet as G1.
fea681da
MK
345.LP
346A multibyte character set is designated as G\fIn\fP character set
347by an escape sequence ESC $ xx or ESC $ ( xx (for G0),
348ESC $ ) xx (for G1), ESC $ * xx (for G2), ESC $ + xx (for G3).
349For example, ESC $ ( C selects the Korean character set for G0.
350The Japanese character set selected by ESC $ B has a more
351recent version selected by ESC & @ ESC $ B.
352.LP
353ISO 4873 stipulates a narrower use of character sets, where G0
354is fixed (always ASCII), so that G1, G2 and G3
33a0ccb2 355can be invoked only for codes with the high order bit set.
8bb93cd4 356In particular, \fB^N\fP and \fB^O\fP are not used anymore, ESC ( xx
fea681da 357can be used only with xx=B, and ESC ) xx, ESC * xx, ESC + xx
4d9b6984 358are equivalent to ESC \- xx, ESC . xx, ESC / xx, respectively.
47297adb 359.SH SEE ALSO
fea681da
MK
360.BR console (4),
361.BR console_codes (4),
362.BR console_ioctl (4),
363.BR ascii (7),
364.BR iso_8859-1 (7),
365.BR unicode (7),
366.BR utf-8 (7)