[thirdparty/man-pages.git] / man7 / charsets.7

'\" t -*- coding: UTF-8 -*-
.\" Copyright (c) 1996 Eric S. Raymond <esr@thyrsus.com>
.\" and Copyright (c) Andries Brouwer <aeb@cwi.nl>
.\"
.\" %%%LICENSE_START(GPLv2+_DOC_ONEPARA)
.\" This is free documentation; you can redistribute it and/or
.\" modify it under the terms of the GNU General Public License as
.\" published by the Free Software Foundation; either version 2 of
.\" the License, or (at your option) any later version.
.\" %%%LICENSE_END
.\"
.\" This is combined from many sources, including notes by aeb and
.\" research by esr.  Portions derive from a writeup by Roman Czyborra.
.\"
.\" Changes also by David Starner <dstarner98@aasaa.ofe.org>.
.\"
.TH CHARSETS 7 2016-07-17 "Linux" "Linux Programmer's Manual"
.SH NAME
charsets - character set standards and internationalization
.SH DESCRIPTION
This manual page gives an overview on different character set standards
and how they were used on Linux before Unicode became ubiquitous.
Some of this information is still helpful for people working with legacy
systems and documents.
.LP
Standards discussed include such as
ASCII, GB 2312, ISO 8859, JIS, KOI8-R, KS, and Unicode.
.LP
The primary emphasis is on character sets that were actually used by
locale character sets, not the myriad others that could be found in data
from other systems.
.SS ASCII
ASCII (American Standard Code For Information Interchange) is the original
7-bit character set, originally designed for American English.
Also known as US-ASCII.
It is currently described by the ISO 646:1991 IRV
(International Reference Version) standard.
.LP
Various ASCII variants replacing the dollar sign with other currency
symbols and replacing punctuation with non-English alphabetic
characters to cover German, French, Spanish, and others in 7 bits
emerged.
All are deprecated;
glibc does not support locales whose character sets are not true
supersets of ASCII.
.LP
As Unicode, when using UTF-8, is ASCII-compatible, plain ASCII text
still renders properly on modern UTF-8 using systems.
.SS ISO 8859
ISO 8859 is a series of 15 8-bit character sets, all of which have ASCII
in their low (7-bit) half, invisible control characters in positions
128 to 159, and 96 fixed-width graphics in positions 160-255.
.LP
Of these, the most important is ISO 8859-1
("Latin Alphabet No .1" / Latin-1).
It was widely adopted and supported by different systems,
and is gradually being replaced with Unicode.
The ISO 8859-1 characters are also the first 256 characters of Unicode.
.LP
Console support for the other 8859 character sets is available under
Linux through user-mode utilities (such as
.BR setfont (8))
that modify keyboard bindings and the EGA graphics
table and employ the "user mapping" font table in the console
driver.
.LP
Here are brief descriptions of each set:
.TP
8859-1 (Latin-1)
Latin-1 covers many West European languages such as Albanian, Basque,
Danish, English, Faroese, Galician, Icelandic, Irish, Italian,
Norwegian, Portuguese, Spanish, and Swedish.
The lack of the ligatures Dutch Ĳ/ĳ, French œ, and old-style „German“
quotation marks was considered tolerable.
.TP
8859-2 (Latin-2)
Latin-2 supports many Latin-written Central and East European
languages such as Bosnian, Croatian, Czech, German, Hungarian, Polish,
Slovak, and Slovene.
Replacing Romanian ș/ț with ş/ţ was considered tolerable.
.TP
8859-3 (Latin-3)
Latin-3 was designed to cover of Esperanto, Maltese, and Turkish, but
8859-9 later superseded it for Turkish.
.TP
8859-4 (Latin-4)
Latin-4 introduced letters for North European languages such as
Estonian, Latvian, and Lithuanian, but was superseded by 8859-10 and
8859-13.
.TP
8859-5
Cyrillic letters supporting Bulgarian, Byelorussian, Macedonian,
Russian, Serbian, and (almost completely) Ukrainian.
It was never widely used, see the discussion of KOI8-R/KOI8-U below.
.TP
8859-6
Was created for Arabic.
The 8859-6 glyph table is a fixed font of separate
letter forms, but a proper display engine should combine these
using the proper initial, medial, and final forms.
.TP
8859-7
Was created for Modern Greek in 1987, updated in 2003.
.TP
8859-8
Supports Modern Hebrew without niqud (punctuation signs).
Niqud and full-fledged Biblical Hebrew were outside the scope of this
character set.
.TP
8859-9 (Latin-5)
This is a variant of Latin-1 that replaces Icelandic letters with
Turkish ones.
.TP
8859-10 (Latin-6)
Latin-6 added the Inuit (Greenlandic) and Sami (Lappish) letters that were
missing in Latin-4 to cover the entire Nordic area.
.TP
8859-11
Supports the Thai alphabet and is nearly identical to the TIS-620
standard.
.TP
8859-12
This set does not exist.
.TP
8859-13 (Latin-7)
Supports the Baltic Rim languages; in particular, it includes Latvian
characters not found in Latin-4.
.TP
8859-14 (Latin-8)
This is the Celtic character set, covering Old Irish, Manx, Gaelic,
Welsh, Cornish, and Breton.
.TP
8859-15 (Latin-9)
Latin-9 is similar to the widely used Latin-1 but replaces some less
common symbols with the Euro sign and French and Finnish letters that
were missing in Latin-1.
.TP
8859-16 (Latin-10)
This set covers many Southeast European languages, and most
importantly supports Romanian more completely than Latin-2.
.SS KOI8-R / KOI8-U
KOI8-R is a non-ISO character set popular in Russia before Unicode.
The lower half is ASCII;
the upper is a Cyrillic character set somewhat better designed than
ISO 8859-5.
KOI8-U, based on KOI8-R, has better support for Ukrainian.
Neither of these sets are ISO-2022 compatible,
unlike the ISO 8859 series.
.LP
Console support for KOI8-R is available under Linux through user-mode
utilities that modify keyboard bindings and the EGA graphics table,
and employ the "user mapping" font table in the console driver.
.SS GB 2312
GB 2312 is a mainland Chinese national standard character set used
to express simplified Chinese.
Just like JIS X 0208, characters are
mapped into a 94x94 two-byte matrix used to construct EUC-CN.
EUC-CN
is the most important encoding for Linux and includes ASCII and
GB 2312.
Note that EUC-CN is often called as GB, GB 2312, or CN-GB.
.SS Big5
Big5 was a popular character set in Taiwan to express traditional
Chinese.
(Big5 is both a character set and an encoding.)
It is a superset of ASCII.
Non-ASCII characters are expressed in two bytes.
Bytes 0xa1-0xfe are used as leading bytes for two-byte characters.
Big5 and its extension were widely used in Taiwan and Hong Kong.
It is not ISO 2022 compliant.
.\" Thanks to Tomohiro KUBOTA for the following sections about
.\" national standards.
.SS JIS X 0208
JIS X 0208 is a Japanese national standard character set.
Though there are some more Japanese national standard character sets (like
JIS X 0201, JIS X 0212, and JIS X 0213), this is the most important one.
Characters are mapped into a 94x94 two-byte matrix,
whose each byte is in the range 0x21-0x7e.
Note that JIS X 0208 is a character set, not an encoding.
This means that JIS X 0208
itself is not used for expressing text data.
JIS X 0208 is used
as a component to construct encodings such as EUC-JP, Shift_JIS,
and ISO-2022-JP.
EUC-JP is the most important encoding for Linux
and includes ASCII and JIS X 0208.
In EUC-JP, JIS X 0208
characters are expressed in two bytes, each of which is the
JIS X 0208 code plus 0x80.
.SS KS X 1001
KS X 1001 is a Korean national standard character set.
Just as
JIS X 0208, characters are mapped into a 94x94 two-byte matrix.
KS X 1001 is used like JIS X 0208, as a component
to construct encodings such as EUC-KR, Johab, and ISO-2022-KR.
EUC-KR is the most important encoding for Linux and includes
ASCII and KS X 1001.
KS C 5601 is an older name for KS X 1001.
.SS ISO 2022 and ISO 4873
The ISO 2022 and 4873 standards describe a font-control model
based on VT100 practice.
This model is (partially) supported
by the Linux kernel and by
.BR xterm (1).
Several ISO 2022-based character encodings have been defined,
especially for Japanese.
.LP
There are 4 graphic character sets, called G0, G1, G2, and G3,
and one of them is the current character set for codes with
high bit zero (initially G0), and one of them is the current
character set for codes with high bit one (initially G1).
Each graphic character set has 94 or 96 characters, and is
essentially a 7-bit character set.
It uses codes either
040-0177 (041-0176) or 0240-0377 (0241-0376).
G0 always has size 94 and uses codes 041-0176.
.LP
Switching between character sets is done using the shift functions
\fB^N\fP (SO or LS1), \fB^O\fP (SI or LS0), ESC n (LS2), ESC o (LS3),
ESC N (SS2), ESC O (SS3), ESC ~ (LS1R), ESC } (LS2R), ESC | (LS3R).
The function LS\fIn\fP makes character set G\fIn\fP the current one
for codes with high bit zero.
The function LS\fIn\fPR makes character set G\fIn\fP the current one
for codes with high bit one.
The function SS\fIn\fP makes character set G\fIn\fP (\fIn\fP=2 or 3)
the current one for the next character only (regardless of the value
of its high order bit).
.LP
A 94-character set is designated as G\fIn\fP character set
by an escape sequence ESC ( xx (for G0), ESC ) xx (for G1),
ESC * xx (for G2), ESC + xx (for G3), where xx is a symbol
or a pair of symbols found in the ISO 2375 International
Register of Coded Character Sets.
For example, ESC ( @ selects the ISO 646 character set as G0,
ESC ( A selects the UK standard character set (with pound
instead of number sign), ESC ( B selects ASCII (with dollar
instead of currency sign), ESC ( M selects a character set
for African languages, ESC ( ! A selects the Cuban character
set, and so on.
.LP
A 96-character set is designated as G\fIn\fP character set
by an escape sequence ESC \- xx (for G1), ESC . xx (for G2)
or ESC / xx (for G3).
For example, ESC \- G selects the Hebrew alphabet as G1.
.LP
A multibyte character set is designated as G\fIn\fP character set
by an escape sequence ESC $ xx or ESC $ ( xx (for G0),
ESC $ ) xx (for G1), ESC $ * xx (for G2), ESC $ + xx (for G3).
For example, ESC $ ( C selects the Korean character set for G0.
The Japanese character set selected by ESC $ B has a more
recent version selected by ESC & @ ESC $ B.
.LP
ISO 4873 stipulates a narrower use of character sets, where G0
is fixed (always ASCII), so that G1, G2 and G3
can be invoked only for codes with the high order bit set.
In particular, \fB^N\fP and \fB^O\fP are not used anymore, ESC ( xx
can be used only with xx=B, and ESC ) xx, ESC * xx, ESC + xx
are equivalent to ESC \- xx, ESC . xx, ESC / xx, respectively.
.SS TIS-620
TIS-620 is a Thai national standard character set and a superset
of ASCII.
In the same fashion as the ISO 8859 series, Thai characters are mapped into
0xa1-0xfe.
.SS Unicode
Unicode (ISO 10646) is a standard which aims to unambiguously represent
every character in every human language.
Unicode's structure permits 20.1 bits to encode every character.
Since most computers don't include 20.1-bit integers, Unicode is
usually encoded as 32-bit integers internally and either a series of
16-bit integers (UTF-16) (needing two 16-bit integers only when
encoding certain rare characters) or a series of 8-bit bytes (UTF-8).
.LP
Linux represents Unicode using the 8-bit Unicode Transformation Format
(UTF-8).
UTF-8 is a variable length encoding of Unicode.
It uses 1
byte to code 7 bits, 2 bytes for 11 bits, 3 bytes for 16 bits, 4 bytes
for 21 bits, 5 bytes for 26 bits, 6 bytes for 31 bits.
.LP
Let 0,1,x stand for a zero, one, or arbitrary bit.
A byte 0xxxxxxx
stands for the Unicode 00000000 0xxxxxxx which codes the same symbol
as the ASCII 0xxxxxxx.
Thus, ASCII goes unchanged into UTF-8, and
people using only ASCII do not notice any change: not in code, and not
in file size.
.LP
A byte 110xxxxx is the start of a 2-byte code, and 110xxxxx 10yyyyyy
is assembled into 00000xxx xxyyyyyy.
A byte 1110xxxx is the start
of a 3-byte code, and 1110xxxx 10yyyyyy 10zzzzzz is assembled
into xxxxyyyy yyzzzzzz.
(When UTF-8 is used to code the 31-bit ISO 10646
then this progression continues up to 6-byte codes.)
.LP
For most texts in ISO 8859 character sets, this means that the
characters outside of ASCII are now coded with two bytes.
This tends
to expand ordinary text files by only one or two percent.
For Russian
or Greek texts, this expands ordinary text files by 100%, since text in
those languages is mostly outside of ASCII.
For Japanese users this means
that the 16-bit codes now in common use will take three bytes.
While there are algorithmic conversions from some character sets
(especially ISO 8859-1) to Unicode, general conversion requires
carrying around conversion tables, which can be quite large for 16-bit
codes.
.LP
Note that UTF-8 is self-synchronizing: 10xxxxxx is a tail, any other
byte is the head of a code.
Note that the only way ASCII bytes occur
in a UTF-8 stream, is as themselves.
In particular, there are no
embedded NULs (\(aq\\0\(aq) or \(aq/\(aqs that form part of some larger code.
.LP
Since ASCII, and, in particular, NUL and \(aq/\(aq, are unchanged, the
kernel does not notice that UTF-8 is being used.
It does not care at
all what the bytes it is handling stand for.
.LP
Rendering of Unicode data streams is typically handled through
"subfont" tables which map a subset of Unicode to glyphs.
Internally
the kernel uses Unicode to describe the subfont loaded in video RAM.
This means that in the Linux console in UTF-8 mode, one can use a character
set with 512 different symbols.
This is not enough for Japanese, Chinese, and
Korean, but it is enough for most other purposes.
.SH SEE ALSO
.BR iconv (1),
.BR ascii (7),
.BR iso_8859-1 (7),
.BR unicode (7),
.BR utf-8 (7)
Commit	Line	Data
42d940fa	1	'\" t -- coding: UTF-8 --
fea681da	2	.\" Copyright (c) 1996 Eric S. Raymond <esr@thyrsus.com>
ac56b6a8	3	.\" and Copyright (c) Andries Brouwer <aeb@cwi.nl>
fea681da	4	.\"
89e3ffe9	5	.\" %%%LICENSE_START(GPLv2+_DOC_ONEPARA)
fea681da MK	6	.\" This is free documentation; you can redistribute it and/or
	7	.\" modify it under the terms of the GNU General Public License as
	8	.\" published by the Free Software Foundation; either version 2 of
	9	.\" the License, or (at your option) any later version.
8f8359d8	10	.\" %%%LICENSE_END
fea681da MK	11	.\"
	12	.\" This is combined from many sources, including notes by aeb and
	13	.\" research by esr. Portions derive from a writeup by Roman Czyborra.
	14	.\"
a8ed5f74	15	.\" Changes also by David Starner <dstarner98@aasaa.ofe.org>.
5b3318fb	16	.\"
3df541c0	17	.TH CHARSETS 7 2016-07-17 "Linux" "Linux Programmer's Manual"
fea681da	18	.SH NAME
a8ed5f74	19	charsets - character set standards and internationalization
fea681da	20	.SH DESCRIPTION
a8ed5f74 MM	21	This manual page gives an overview on different character set standards
	22	and how they were used on Linux before Unicode became ubiquitous.
	23	Some of this information is still helpful for people working with legacy
	24	systems and documents.
	25	.LP
	26	Standards discussed include such as
	27	ASCII, GB 2312, ISO 8859, JIS, KOI8-R, KS, and Unicode.
fea681da	28	.LP
a8ed5f74 MM	29	The primary emphasis is on character sets that were actually used by
a8ed5f74 MM	30	locale character sets, not the myriad others that could be found in data
fea681da	31	from other systems.
1ce284ec	32	.SS ASCII
fea681da	33	ASCII (American Standard Code For Information Interchange) is the original
c13182ef	34	7-bit character set, originally designed for American English.
a8ed5f74 MM	35	Also known as US-ASCII.
	36	It is currently described by the ISO 646:1991 IRV
	37	(International Reference Version) standard.
fea681da MK	38	.LP
fea681da MK	39	Various ASCII variants replacing the dollar sign with other currency
a8ed5f74 MM	40	symbols and replacing punctuation with non-English alphabetic
	41	characters to cover German, French, Spanish, and others in 7 bits
	42	emerged.
	43	All are deprecated;
	44	glibc does not support locales whose character sets are not true
	45	supersets of ASCII.
fea681da	46	.LP
a8ed5f74 MM	47	As Unicode, when using UTF-8, is ASCII-compatible, plain ASCII text
a8ed5f74 MM	48	still renders properly on modern UTF-8 using systems.
1ce284ec	49	.SS ISO 8859
42d940fa	50	ISO 8859 is a series of 15 8-bit character sets, all of which have ASCII
a8ed5f74 MM	51	in their low (7-bit) half, invisible control characters in positions
a8ed5f74 MM	52	128 to 159, and 96 fixed-width graphics in positions 160-255.
fea681da	53	.LP
a8ed5f74 MM	54	Of these, the most important is ISO 8859-1
	55	("Latin Alphabet No .1" / Latin-1).
	56	It was widely adopted and supported by different systems,
	57	and is gradually being replaced with Unicode.
	58	The ISO 8859-1 characters are also the first 256 characters of Unicode.
fea681da MK	59	.LP
	60	Console support for the other 8859 character sets is available under
	61	Linux through user-mode utilities (such as
	62	.BR setfont (8))
fea681da MK	63	that modify keyboard bindings and the EGA graphics
	64	table and employ the "user mapping" font table in the console
	65	driver.
	66	.LP
	67	Here are brief descriptions of each set:
	68	.TP
	69	8859-1 (Latin-1)
a8ed5f74	70	Latin-1 covers many West European languages such as Albanian, Basque,
348f3b9d	71	Danish, English, Faroese, Galician, Icelandic, Irish, Italian,
a8ed5f74 MM	72	Norwegian, Portuguese, Spanish, and Swedish.
	73	The lack of the ligatures Dutch Ĳ/ĳ, French œ, and old-style „German“
	74	quotation marks was considered tolerable.
fea681da MK	75	.TP
fea681da MK	76	8859-2 (Latin-2)
a8ed5f74 MM	77	Latin-2 supports many Latin-written Central and East European
a8ed5f74 MM	78	languages such as Bosnian, Croatian, Czech, German, Hungarian, Polish,
fea681da	79	Slovak, and Slovene.
a8ed5f74	80	Replacing Romanian ș/ț with ş/ţ was considered tolerable.
fea681da MK	81	.TP
fea681da MK	82	8859-3 (Latin-3)
42d940fa	83	Latin-3 was designed to cover of Esperanto, Maltese, and Turkish, but
a8ed5f74	84	8859-9 later superseded it for Turkish.
fea681da MK	85	.TP
fea681da MK	86	8859-4 (Latin-4)
a8ed5f74	87	Latin-4 introduced letters for North European languages such as
42d940fa	88	Estonian, Latvian, and Lithuanian, but was superseded by 8859-10 and
a8ed5f74	89	8859-13.
fea681da MK	90	.TP
	91	8859-5
	92	Cyrillic letters supporting Bulgarian, Byelorussian, Macedonian,
a8ed5f74 MM	93	Russian, Serbian, and (almost completely) Ukrainian.
a8ed5f74 MM	94	It was never widely used, see the discussion of KOI8-R/KOI8-U below.
fea681da MK	95	.TP
fea681da MK	96	8859-6
a8ed5f74	97	Was created for Arabic.
c13182ef	98	The 8859-6 glyph table is a fixed font of separate
fea681da MK	99	letter forms, but a proper display engine should combine these
	100	using the proper initial, medial, and final forms.
	101	.TP
	102	8859-7
42d940fa	103	Was created for Modern Greek in 1987, updated in 2003.
fea681da MK	104	.TP
fea681da MK	105	8859-8
42d940fa	106	Supports Modern Hebrew without niqud (punctuation signs).
a8ed5f74	107	Niqud and full-fledged Biblical Hebrew were outside the scope of this
79745892	108	character set.
fea681da MK	109	.TP
	110	8859-9 (Latin-5)
	111	This is a variant of Latin-1 that replaces Icelandic letters with
	112	Turkish ones.
	113	.TP
	114	8859-10 (Latin-6)
91085d85	115	Latin-6 added the Inuit (Greenlandic) and Sami (Lappish) letters that were
a8ed5f74	116	missing in Latin-4 to cover the entire Nordic area.
fea681da MK	117	.TP
fea681da MK	118	8859-11
a8ed5f74 MM	119	Supports the Thai alphabet and is nearly identical to the TIS-620
a8ed5f74 MM	120	standard.
fea681da MK	121	.TP
fea681da MK	122	8859-12
c13182ef	123	This set does not exist.
fea681da MK	124	.TP
	125	8859-13 (Latin-7)
	126	Supports the Baltic Rim languages; in particular, it includes Latvian
	127	characters not found in Latin-4.
	128	.TP
	129	8859-14 (Latin-8)
a8ed5f74 MM	130	This is the Celtic character set, covering Old Irish, Manx, Gaelic,
a8ed5f74 MM	131	Welsh, Cornish, and Breton.
fea681da MK	132	.TP
fea681da MK	133	8859-15 (Latin-9)
42d940fa	134	Latin-9 is similar to the widely used Latin-1 but replaces some less
a8ed5f74 MM	135	common symbols with the Euro sign and French and Finnish letters that
a8ed5f74 MM	136	were missing in Latin-1.
fea681da MK	137	.TP
fea681da MK	138	8859-16 (Latin-10)
a8ed5f74 MM	139	This set covers many Southeast European languages, and most
	140	importantly supports Romanian more completely than Latin-2.
	141	.SS KOI8-R / KOI8-U
	142	KOI8-R is a non-ISO character set popular in Russia before Unicode.
	143	The lower half is ASCII;
	144	the upper is a Cyrillic character set somewhat better designed than
	145	ISO 8859-5.
42d940fa	146	KOI8-U, based on KOI8-R, has better support for Ukrainian.
a8ed5f74	147	Neither of these sets are ISO-2022 compatible,
1acb8000	148	unlike the ISO 8859 series.
fea681da MK	149	.LP
	150	Console support for KOI8-R is available under Linux through user-mode
	151	utilities that modify keyboard bindings and the EGA graphics table,
	152	and employ the "user mapping" font table in the console driver.
83f218d9 MM	153	.SS GB 2312
	154	GB 2312 is a mainland Chinese national standard character set used
	155	to express simplified Chinese.
	156	Just like JIS X 0208, characters are
	157	mapped into a 94x94 two-byte matrix used to construct EUC-CN.
	158	EUC-CN
	159	is the most important encoding for Linux and includes ASCII and
	160	GB 2312.
	161	Note that EUC-CN is often called as GB, GB 2312, or CN-GB.
	162	.SS Big5
	163	Big5 was a popular character set in Taiwan to express traditional
	164	Chinese.
	165	(Big5 is both a character set and an encoding.)
	166	It is a superset of ASCII.
	167	Non-ASCII characters are expressed in two bytes.
	168	Bytes 0xa1-0xfe are used as leading bytes for two-byte characters.
	169	Big5 and its extension were widely used in Taiwan and Hong Kong.
	170	It is not ISO 2022 compliant.
c13182ef	171	.\" Thanks to Tomohiro KUBOTA for the following sections about
fea681da	172	.\" national standards.
1ce284ec	173	.SS JIS X 0208
c13182ef MK	174	JIS X 0208 is a Japanese national standard character set.
	175	Though there are some more Japanese national standard character sets (like
	176	JIS X 0201, JIS X 0212, and JIS X 0213), this is the most important one.
	177	Characters are mapped into a 94x94 two-byte matrix,
	178	whose each byte is in the range 0x21-0x7e.
	179	Note that JIS X 0208 is a character set, not an encoding.
	180	This means that JIS X 0208
	181	itself is not used for expressing text data.
	182	JIS X 0208 is used
fea681da	183	as a component to construct encodings such as EUC-JP, Shift_JIS,
c13182ef MK	184	and ISO-2022-JP.
c13182ef MK	185	EUC-JP is the most important encoding for Linux
a8ed5f74	186	and includes ASCII and JIS X 0208.
c13182ef	187	In EUC-JP, JIS X 0208
fea681da MK	188	characters are expressed in two bytes, each of which is the
fea681da MK	189	JIS X 0208 code plus 0x80.
1ce284ec	190	.SS KS X 1001
c13182ef MK	191	KS X 1001 is a Korean national standard character set.
c13182ef MK	192	Just as
fea681da MK	193	JIS X 0208, characters are mapped into a 94x94 two-byte matrix.
	194	KS X 1001 is used like JIS X 0208, as a component
	195	to construct encodings such as EUC-KR, Johab, and ISO-2022-KR.
	196	EUC-KR is the most important encoding for Linux and includes
a8ed5f74	197	ASCII and KS X 1001.
c13182ef	198	KS C 5601 is an older name for KS X 1001.
83f218d9 MM	199	.SS ISO 2022 and ISO 4873
	200	The ISO 2022 and 4873 standards describe a font-control model
	201	based on VT100 practice.
	202	This model is (partially) supported
	203	by the Linux kernel and by
	204	.BR xterm (1).
9be7476d MM	205	Several ISO 2022-based character encodings have been defined,
9be7476d MM	206	especially for Japanese.
83f218d9 MM	207	.LP
	208	There are 4 graphic character sets, called G0, G1, G2, and G3,
	209	and one of them is the current character set for codes with
	210	high bit zero (initially G0), and one of them is the current
	211	character set for codes with high bit one (initially G1).
	212	Each graphic character set has 94 or 96 characters, and is
	213	essentially a 7-bit character set.
	214	It uses codes either
	215	040-0177 (041-0176) or 0240-0377 (0241-0376).
	216	G0 always has size 94 and uses codes 041-0176.
	217	.LP
	218	Switching between character sets is done using the shift functions
	219	\fB^N\fP (SO or LS1), \fB^O\fP (SI or LS0), ESC n (LS2), ESC o (LS3),
	220	ESC N (SS2), ESC O (SS3), ESC ~ (LS1R), ESC } (LS2R), ESC \| (LS3R).
	221	The function LS\fIn\fP makes character set G\fIn\fP the current one
	222	for codes with high bit zero.
	223	The function LS\fIn\fPR makes character set G\fIn\fP the current one
	224	for codes with high bit one.
	225	The function SS\fIn\fP makes character set G\fIn\fP (\fIn\fP=2 or 3)
	226	the current one for the next character only (regardless of the value
	227	of its high order bit).
	228	.LP
	229	A 94-character set is designated as G\fIn\fP character set
	230	by an escape sequence ESC ( xx (for G0), ESC ) xx (for G1),
	231	ESC * xx (for G2), ESC + xx (for G3), where xx is a symbol
	232	or a pair of symbols found in the ISO 2375 International
	233	Register of Coded Character Sets.
	234	For example, ESC ( @ selects the ISO 646 character set as G0,
	235	ESC ( A selects the UK standard character set (with pound
	236	instead of number sign), ESC ( B selects ASCII (with dollar
	237	instead of currency sign), ESC ( M selects a character set
	238	for African languages, ESC ( ! A selects the Cuban character
	239	set, and so on.
	240	.LP
	241	A 96-character set is designated as G\fIn\fP character set
	242	by an escape sequence ESC \- xx (for G1), ESC . xx (for G2)
	243	or ESC / xx (for G3).
	244	For example, ESC \- G selects the Hebrew alphabet as G1.
	245	.LP
	246	A multibyte character set is designated as G\fIn\fP character set
	247	by an escape sequence ESC $ xx or ESC $ ( xx (for G0),
	248	ESC $ ) xx (for G1), ESC $ * xx (for G2), ESC $ + xx (for G3).
	249	For example, ESC $ ( C selects the Korean character set for G0.
	250	The Japanese character set selected by ESC $ B has a more
	251	recent version selected by ESC & @ ESC $ B.
	252	.LP
	253	ISO 4873 stipulates a narrower use of character sets, where G0
	254	is fixed (always ASCII), so that G1, G2 and G3
	255	can be invoked only for codes with the high order bit set.
	256	In particular, \fB^N\fP and \fB^O\fP are not used anymore, ESC ( xx
	257	can be used only with xx=B, and ESC ) xx, ESC * xx, ESC + xx
	258	are equivalent to ESC \- xx, ESC . xx, ESC / xx, respectively.
a8ed5f74 MM	259	.SS TIS-620
	260	TIS-620 is a Thai national standard character set and a superset
	261	of ASCII.
42d940fa	262	In the same fashion as the ISO 8859 series, Thai characters are mapped into
c13182ef	263	0xa1-0xfe.
a8ed5f74 MM	264	.SS Unicode
	265	Unicode (ISO 10646) is a standard which aims to unambiguously represent
	266	every character in every human language.
c13182ef	267	Unicode's structure permits 20.1 bits to encode every character.
91085d85 MK	268	Since most computers don't include 20.1-bit integers, Unicode is
	269	usually encoded as 32-bit integers internally and either a series of
	270	16-bit integers (UTF-16) (needing two 16-bit integers only when
a8ed5f74	271	encoding certain rare characters) or a series of 8-bit bytes (UTF-8).
fea681da MK	272	.LP
fea681da MK	273	Linux represents Unicode using the 8-bit Unicode Transformation Format
c13182ef MK	274	(UTF-8).
	275	UTF-8 is a variable length encoding of Unicode.
	276	It uses 1
fea681da MK	277	byte to code 7 bits, 2 bytes for 11 bits, 3 bytes for 16 bits, 4 bytes
	278	for 21 bits, 5 bytes for 26 bits, 6 bytes for 31 bits.
	279	.LP
c13182ef MK	280	Let 0,1,x stand for a zero, one, or arbitrary bit.
c13182ef MK	281	A byte 0xxxxxxx
fea681da	282	stands for the Unicode 00000000 0xxxxxxx which codes the same symbol
c13182ef MK	283	as the ASCII 0xxxxxxx.
c13182ef MK	284	Thus, ASCII goes unchanged into UTF-8, and
fea681da MK	285	people using only ASCII do not notice any change: not in code, and not
	286	in file size.
	287	.LP
	288	A byte 110xxxxx is the start of a 2-byte code, and 110xxxxx 10yyyyyy
c13182ef MK	289	is assembled into 00000xxx xxyyyyyy.
c13182ef MK	290	A byte 1110xxxx is the start
fea681da MK	291	of a 3-byte code, and 1110xxxx 10yyyyyy 10zzzzzz is assembled
	292	into xxxxyyyy yyzzzzzz.
	293	(When UTF-8 is used to code the 31-bit ISO 10646
	294	then this progression continues up to 6-byte codes.)
	295	.LP
1acb8000	296	For most texts in ISO 8859 character sets, this means that the
c13182ef MK	297	characters outside of ASCII are now coded with two bytes.
	298	This tends
	299	to expand ordinary text files by only one or two percent.
	300	For Russian
a8ed5f74	301	or Greek texts, this expands ordinary text files by 100%, since text in
c13182ef MK	302	those languages is mostly outside of ASCII.
	303	For Japanese users this means
	304	that the 16-bit codes now in common use will take three bytes.
91085d85 MK	305	While there are algorithmic conversions from some character sets
	306	(especially ISO 8859-1) to Unicode, general conversion requires
	307	carrying around conversion tables, which can be quite large for 16-bit
a8ed5f74	308	codes.
fea681da MK	309	.LP
fea681da MK	310	Note that UTF-8 is self-synchronizing: 10xxxxxx is a tail, any other
c13182ef MK	311	byte is the head of a code.
	312	Note that the only way ASCII bytes occur
	313	in a UTF-8 stream, is as themselves.
	314	In particular, there are no
f81fb444	315	embedded NULs (\(aq\\0\(aq) or \(aq/\(aqs that form part of some larger code.
fea681da	316	.LP
f81fb444	317	Since ASCII, and, in particular, NUL and \(aq/\(aq, are unchanged, the
c13182ef MK	318	kernel does not notice that UTF-8 is being used.
c13182ef MK	319	It does not care at
fea681da MK	320	all what the bytes it is handling stand for.
	321	.LP
	322	Rendering of Unicode data streams is typically handled through
84c517a4	323	"subfont" tables which map a subset of Unicode to glyphs.
c13182ef	324	Internally
fea681da	325	the kernel uses Unicode to describe the subfont loaded in video RAM.
91085d85	326	This means that in the Linux console in UTF-8 mode, one can use a character
a8ed5f74	327	set with 512 different symbols.
42d940fa	328	This is not enough for Japanese, Chinese, and
fea681da	329	Korean, but it is enough for most other purposes.
47297adb	330	.SH SEE ALSO
a8ed5f74	331	.BR iconv (1),
fea681da MK	332	.BR ascii (7),
	333	.BR iso_8859-1 (7),
	334	.BR unicode (7),
	335	.BR utf-8 (7)