]> git.ipfire.org Git - thirdparty/man-pages.git/blame - man7/charsets.7
Wrapped long lines, wrapped at sentence boundaries; stripped trailing
[thirdparty/man-pages.git] / man7 / charsets.7
CommitLineData
fea681da
MK
1.\" Copyright (c) 1996 Eric S. Raymond <esr@thyrsus.com>
2.\" and Andries Brouwer <aeb@cwi.nl>
3.\"
4.\" This is free documentation; you can redistribute it and/or
5.\" modify it under the terms of the GNU General Public License as
6.\" published by the Free Software Foundation; either version 2 of
7.\" the License, or (at your option) any later version.
8.\"
9.\" This is combined from many sources, including notes by aeb and
10.\" research by esr. Portions derive from a writeup by Roman Czyborra.
11.\"
12.\" Last changed by David Starner <dstarner98@aasaa.ofe.org>.
13.TH CHARSETS 7 2001-05-07 "Linux" "Linux Programmer's Manual"
14.SH NAME
15charsets \- programmer's view of character sets and internationalization
16.SH DESCRIPTION
c13182ef
MK
17Linux is an international operating system.
18Various of its utilities
fea681da
MK
19and device drivers (including the console driver) support multilingual
20character sets including Latin-alphabet letters with diacritical
21marks, accents, ligatures, and entire non-Latin alphabets including
22Greek, Cyrillic, Arabic, and Hebrew.
23.LP
24This manual page presents a programmer's-eye view of different
c13182ef
MK
25character-set standards and how they fit together on Linux.
26Standards
fea681da 27discussed include ASCII, ISO 8859, KOI8-R, Unicode, ISO 2022 and
c13182ef
MK
28ISO 4873.
29The primary emphasis is on character sets actually used as
fea681da
MK
30locale character sets, not the myriad others that can be found in data
31from other systems.
32.LP
33A complete list of charsets used in a officially supported locale in glibc
342.2.3 is: ISO-8859-{1,2,3,5,6,7,8,9,13,15}, CP1251, UTF-8, EUC-{KR,JP,TW},
35KOI8-{R,U}, GB2312, GB18030, GBK, BIG5, BIG5-HKSCS and TIS-620 (in no
c13182ef
MK
36particular order.)
37(Romanian may be switching to ISO-8859-16.)
fea681da
MK
38.SH ASCII
39ASCII (American Standard Code For Information Interchange) is the original
c13182ef
MK
407-bit character set, originally designed for American English.
41It is currently described by the ECMA-6 standard.
fea681da
MK
42.LP
43Various ASCII variants replacing the dollar sign with other currency
44symbols and replacing punctuation with non-English alphabetic characters
c13182ef
MK
45to cover German, French, Spanish and others in 7 bits exist.
46All are
fea681da
MK
47deprecated; GNU libc doesn't support locales whose character sets aren't
48true supersets of ASCII. (These sets are also known as ISO-646, a close
49relative of ASCII that permitted replacing these characters.)
50.LP
51As Linux was written for hardware designed in the US, it natively
52supports ASCII.
fea681da
MK
53.SH ISO 8859
54ISO 8859 is a series of 15 8-bit character sets all of which have US
55ASCII in their low (7-bit) half, invisible control characters in
56positions 128 to 159, and 96 fixed-width graphics in positions 160-255.
57.LP
c13182ef
MK
58Of these, the most important is ISO 8859-1 (Latin-1).
59It is natively
fea681da
MK
60supported in the Linux console driver, fairly well supported in X11R6,
61and is the base character set of HTML.
62.LP
63Console support for the other 8859 character sets is available under
64Linux through user-mode utilities (such as
65.BR setfont (8))
66.\" // some distributions still have the deprecated consolechars
67that modify keyboard bindings and the EGA graphics
68table and employ the "user mapping" font table in the console
69driver.
70.LP
71Here are brief descriptions of each set:
72.TP
738859-1 (Latin-1)
74Latin-1 covers most Western European languages such as Albanian, Catalan,
75Danish, Dutch, English, Faroese, Finnish, French, German, Galician,
76Irish, Icelandic, Italian, Norwegian, Portuguese, Spanish, and
c13182ef
MK
77Swedish.
78The lack of the ligatures Dutch ij, French oe and old-style
fea681da
MK
79,,German`` quotation marks is considered tolerable.
80.TP
818859-2 (Latin-2)
82Latin-2 supports most Latin-written Slavic and Central European
83languages: Croatian, Czech, German, Hungarian, Polish, Rumanian,
84Slovak, and Slovene.
85.TP
868859-3 (Latin-3)
87Latin-3 is popular with authors of Esperanto, Galician, and Maltese.
88(Turkish is now written with 8859-9 instead.)
89.TP
908859-4 (Latin-4)
c13182ef
MK
91Latin-4 introduced letters for Estonian, Latvian, and Lithuanian.
92It is essentially obsolete; see 8859-10 (Latin-6) and 8859-13 (Latin-7).
fea681da
MK
93.TP
948859-5
95Cyrillic letters supporting Bulgarian, Byelorussian, Macedonian,
c13182ef
MK
96Russian, Serbian and Ukrainian.
97Ukrainians read the letter `ghe'
fea681da 98with downstroke as `heh' and would need a ghe with upstroke to write a
c13182ef
MK
99correct ghe.
100See the discussion of KOI8-R below.
fea681da
MK
101.TP
1028859-6
c13182ef
MK
103Supports Arabic.
104The 8859-6 glyph table is a fixed font of separate
fea681da
MK
105letter forms, but a proper display engine should combine these
106using the proper initial, medial, and final forms.
107.TP
1088859-7
109Supports Modern Greek.
110.TP
1118859-8
c13182ef
MK
112Supports modern Hebrew without niqud (punctuation signs).
113Niqud and full-fledged Biblical Hebrew are outside the scope of this
fea681da
MK
114character set; under Linux, UTF-8 is the preferred encoding for
115these.
116.TP
1178859-9 (Latin-5)
118This is a variant of Latin-1 that replaces Icelandic letters with
119Turkish ones.
120.TP
1218859-10 (Latin-6)
122Latin 6 adds the last Inuit (Greenlandic) and Sami (Lappish) letters
c13182ef
MK
123that were missing in Latin 4 to cover the entire Nordic area.
124RFC 1345 listed a preliminary and different `latin6'.
125Skolt Sami still
fea681da
MK
126needs a few more accents than these.
127.TP
1288859-11
c13182ef
MK
129This only exists as a rejected draft standard.
130The draft standard
fea681da
MK
131was identical to TIS-620, which is used under Linux for Thai.
132.TP
1338859-12
c13182ef
MK
134This set does not exist.
135While Vietnamese has been suggested for this
fea681da 136space, it does not fit within the 96 (non-combining) characters ISO
c13182ef
MK
1378859 offers.
138UTF-8 is the preferred character set for Vietnamese use
fea681da
MK
139under Linux.
140.TP
1418859-13 (Latin-7)
142Supports the Baltic Rim languages; in particular, it includes Latvian
143characters not found in Latin-4.
144.TP
1458859-14 (Latin-8)
146This is the Celtic character set, covering Gaelic and Welsh.
147This charset also contains the dotted characters needed for Old Irish.
148.TP
1498859-15 (Latin-9)
150This adds the Euro sign and French and Finnish letters that were missing in
151Latin-1.
152.TP
1538859-16 (Latin-10)
154This set covers many of the languages covered by 8859-2, and supports
155Romanian more completely then that set does.
156.SH KOI8-R
c13182ef
MK
157KOI8-R is a non-ISO character set popular in Russia.
158The lower half
fea681da 159is US ASCII; the upper is a Cyrillic character set somewhat better
c13182ef
MK
160designed than ISO 8859-5.
161KOI8-U is a common character set, based off
162KOI8-R, that has better support for Ukrainian.
163Neither of these sets
fea681da
MK
164are ISO-2022 compatible, unlike the ISO-8859 series.
165.LP
166Console support for KOI8-R is available under Linux through user-mode
167utilities that modify keyboard bindings and the EGA graphics table,
168and employ the "user mapping" font table in the console driver.
c13182ef 169.\" Thanks to Tomohiro KUBOTA for the following sections about
fea681da
MK
170.\" national standards.
171.SH JIS X 0208
c13182ef
MK
172JIS X 0208 is a Japanese national standard character set.
173Though there are some more Japanese national standard character sets (like
174JIS X 0201, JIS X 0212, and JIS X 0213), this is the most important one.
175Characters are mapped into a 94x94 two-byte matrix,
176whose each byte is in the range 0x21-0x7e.
177Note that JIS X 0208 is a character set, not an encoding.
178This means that JIS X 0208
179itself is not used for expressing text data.
180JIS X 0208 is used
fea681da 181as a component to construct encodings such as EUC-JP, Shift_JIS,
c13182ef
MK
182and ISO-2022-JP.
183EUC-JP is the most important encoding for Linux
184and includes US ASCII and JIS X 0208.
185In EUC-JP, JIS X 0208
fea681da
MK
186characters are expressed in two bytes, each of which is the
187JIS X 0208 code plus 0x80.
fea681da 188.SH KS X 1001
c13182ef
MK
189KS X 1001 is a Korean national standard character set.
190Just as
fea681da
MK
191JIS X 0208, characters are mapped into a 94x94 two-byte matrix.
192KS X 1001 is used like JIS X 0208, as a component
193to construct encodings such as EUC-KR, Johab, and ISO-2022-KR.
194EUC-KR is the most important encoding for Linux and includes
c13182ef
MK
195US ASCII and KS X 1001.
196KS C 5601 is an older name for KS X 1001.
fea681da
MK
197.SH GB 2312
198GB 2312 is a mainland Chinese national standard character set used
c13182ef
MK
199to express simplified Chinese.
200Just like JIS X 0208, characters are
201mapped into a 94x94 two-byte matrix used to construct EUC-CN.
202EUC-CN
fea681da 203is the most important encoding for Linux and includes US ASCII and
c13182ef
MK
204GB 2312.
205Note that EUC-CN is often called as GB, GB 2312, or CN-GB.
fea681da
MK
206.SH Big5
207Big5 is a popular character set in Taiwan to express traditional
c13182ef
MK
208Chinese.
209(Big5 is both a character set and an encoding.)
210It is a superset of US ASCII.
211Non-ASCII characters are expressed in two bytes.
212Bytes 0xa1-0xfe are used as leading bytes for two-byte characters.
213Big5 and its extension is widely used in Taiwan and Hong Kong.
214It is not ISO 2022-compliant.
fea681da
MK
215.SH TIS 620
216TIS 620 is a Thai national standard character set and a superset
c13182ef
MK
217of US ASCII.
218Like ISO 8859 series, Thai characters are mapped into
2190xa1-0xfe.
220TIS 620 is the only commonly used character set under
fea681da 221Linux besides UTF-8 to have combining characters.
fea681da
MK
222.SH UNICODE
223Unicode (ISO 10646) is a standard which aims to unambiguously represent every
c13182ef
MK
224character in every human language.
225Unicode's structure permits 20.1 bits to encode every character.
226Since most computers don't include 20.1-bit
fea681da
MK
227integers, Unicode is usually encoded as 32-bit integers internally and
228either a series of 16-bit integers (UTF-16) (needing two 16-bit integers
229only when encoding certain rare characters) or a series of 8-bit bytes
c13182ef
MK
230(UTF-8).
231Information on Unicode is available at <http://www.unicode.com>.
fea681da
MK
232.LP
233Linux represents Unicode using the 8-bit Unicode Transformation Format
c13182ef
MK
234(UTF-8).
235UTF-8 is a variable length encoding of Unicode.
236It uses 1
fea681da
MK
237byte to code 7 bits, 2 bytes for 11 bits, 3 bytes for 16 bits, 4 bytes
238for 21 bits, 5 bytes for 26 bits, 6 bytes for 31 bits.
239.LP
c13182ef
MK
240Let 0,1,x stand for a zero, one, or arbitrary bit.
241A byte 0xxxxxxx
fea681da 242stands for the Unicode 00000000 0xxxxxxx which codes the same symbol
c13182ef
MK
243as the ASCII 0xxxxxxx.
244Thus, ASCII goes unchanged into UTF-8, and
fea681da
MK
245people using only ASCII do not notice any change: not in code, and not
246in file size.
247.LP
248A byte 110xxxxx is the start of a 2-byte code, and 110xxxxx 10yyyyyy
c13182ef
MK
249is assembled into 00000xxx xxyyyyyy.
250A byte 1110xxxx is the start
fea681da
MK
251of a 3-byte code, and 1110xxxx 10yyyyyy 10zzzzzz is assembled
252into xxxxyyyy yyzzzzzz.
253(When UTF-8 is used to code the 31-bit ISO 10646
254then this progression continues up to 6-byte codes.)
255.LP
256For most people who use ISO-8859 character sets, this means that the
c13182ef
MK
257characters outside of ASCII are now coded with two bytes.
258This tends
259to expand ordinary text files by only one or two percent.
260For Russian
fea681da 261or Greek users, this expands ordinary text files by 100%, since text in
c13182ef
MK
262those languages is mostly outside of ASCII.
263For Japanese users this means
264that the 16-bit codes now in common use will take three bytes.
265While there
fea681da
MK
266are algorithmic conversions from some character sets (esp. ISO-8859-1) to
267Unicode, general conversion requires carrying around conversion tables,
268which can be quite large for 16-bit codes.
269.LP
270Note that UTF-8 is self-synchronizing: 10xxxxxx is a tail, any other
c13182ef
MK
271byte is the head of a code.
272Note that the only way ASCII bytes occur
273in a UTF-8 stream, is as themselves.
274In particular, there are no
28d88c17 275embedded NULs ('\\0') or '/'s that form part of some larger code.
fea681da
MK
276.LP
277Since ASCII, and, in particular, NUL and '/', are unchanged, the
c13182ef
MK
278kernel does not notice that UTF-8 is being used.
279It does not care at
fea681da
MK
280all what the bytes it is handling stand for.
281.LP
282Rendering of Unicode data streams is typically handled through
c13182ef
MK
283`subfont' tables which map a subset of Unicode to glyphs.
284Internally
fea681da
MK
285the kernel uses Unicode to describe the subfont loaded in video RAM.
286This means that in UTF-8 mode one can use a character set with 512
c13182ef
MK
287different symbols.
288This is not enough for Japanese, Chinese and
fea681da
MK
289Korean, but it is enough for most other purposes.
290.LP
291At the current time, the console driver does not handle combining
c13182ef
MK
292characters.
293So Thai, Sioux and any other script needing combining
fea681da 294characters can't be handled on the console.
fea681da
MK
295.SH "ISO 2022 AND ISO 4873"
296The ISO 2022 and 4873 standards describe a font-control model
c13182ef
MK
297based on VT100 practice.
298This model is (partially) supported
fea681da
MK
299by the Linux kernel and by
300.BR xterm (1).
301It is popular in Japan and Korea.
302.LP
303There are 4 graphic character sets, called G0, G1, G2 and G3,
304and one of them is the current character set for codes with
305high bit zero (initially G0), and one of them is the current
306character set for codes with high bit one (initially G1).
307Each graphic character set has 94 or 96 characters, and is
c13182ef
MK
308essentially a 7-bit character set.
309It uses codes either
fea681da
MK
310040-0177 (041-0176) or 0240-0377 (0241-0376).
311G0 always has size 94 and uses codes 041-0176.
312.LP
313Switching between character sets is done using the shift functions
314^N (SO or LS1), ^O (SI or LS0), ESC n (LS2), ESC o (LS3),
315ESC N (SS2), ESC O (SS3), ESC ~ (LS1R), ESC } (LS2R), ESC | (LS3R).
316The function LS\fIn\fP makes character set G\fIn\fP the current one
317for codes with high bit zero.
318The function LS\fIn\fPR makes character set G\fIn\fP the current one
319for codes with high bit one.
320The function SS\fIn\fP makes character set G\fIn\fP (\fIn\fP=2 or 3)
321the current one for the next character only (regardless of the value
322of its high order bit).
323.LP
324A 94-character set is designated as G\fIn\fP character set
325by an escape sequence ESC ( xx (for G0), ESC ) xx (for G1),
326ESC * xx (for G2), ESC + xx (for G3), where xx is a symbol
327or a pair of symbols found in the ISO 2375 International
328Register of Coded Character Sets.
329For example, ESC ( @ selects the ISO 646 character set as G0,
330ESC ( A selects the UK standard character set (with pound
331instead of number sign), ESC ( B selects ASCII (with dollar
332instead of currency sign), ESC ( M selects a character set
333for African languages, ESC ( ! A selects the Cuban character
334set, etc. etc.
335.LP
336A 96-character set is designated as G\fIn\fP character set
4d9b6984 337by an escape sequence ESC \- xx (for G1), ESC . xx (for G2)
fea681da 338or ESC / xx (for G3).
4d9b6984 339For example, ESC \- G selects the Hebrew alphabet as G1.
fea681da
MK
340.LP
341A multibyte character set is designated as G\fIn\fP character set
342by an escape sequence ESC $ xx or ESC $ ( xx (for G0),
343ESC $ ) xx (for G1), ESC $ * xx (for G2), ESC $ + xx (for G3).
344For example, ESC $ ( C selects the Korean character set for G0.
345The Japanese character set selected by ESC $ B has a more
346recent version selected by ESC & @ ESC $ B.
347.LP
348ISO 4873 stipulates a narrower use of character sets, where G0
349is fixed (always ASCII), so that G1, G2 and G3
350can only be invoked for codes with the high order bit set.
351In particular, ^N and ^O are not used anymore, ESC ( xx
352can be used only with xx=B, and ESC ) xx, ESC * xx, ESC + xx
4d9b6984 353are equivalent to ESC \- xx, ESC . xx, ESC / xx, respectively.
fea681da
MK
354.SH "SEE ALSO"
355.BR console (4),
356.BR console_codes (4),
357.BR console_ioctl (4),
358.BR ascii (7),
359.BR iso_8859-1 (7),
360.BR unicode (7),
361.BR utf-8 (7)