[thirdparty/man-pages.git] / man7 / unicode.7

.\" Copyright (C) Markus Kuhn, 1995, 2001
.\"
.\" %%%LICENSE_START(GPLv2+_DOC_FULL)
.\" This is free documentation; you can redistribute it and/or
.\" modify it under the terms of the GNU General Public License as
.\" published by the Free Software Foundation; either version 2 of
.\" the License, or (at your option) any later version.
.\"
.\" The GNU General Public License's references to "object code"
.\" and "executables" are to be interpreted as the output of any
.\" document formatting or typesetting system, including
.\" intermediate and printed output.
.\"
.\" This manual is distributed in the hope that it will be useful,
.\" but WITHOUT ANY WARRANTY; without even the implied warranty of
.\" MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
.\" GNU General Public License for more details.
.\"
.\" You should have received a copy of the GNU General Public
.\" License along with this manual; if not, see
.\" <http://www.gnu.org/licenses/>.
.\" %%%LICENSE_END
.\"
.\" 1995-11-26  Markus Kuhn <mskuhn@cip.informatik.uni-erlangen.de>
.\"      First version written
.\" 2001-05-11  Markus Kuhn <mgk25@cl.cam.ac.uk>
.\"      Update
.\"
.TH UNICODE 7 2017-09-15 "GNU" "Linux Programmer's Manual"
.SH NAME
unicode \- universal character set
.SH DESCRIPTION
The international standard ISO 10646 defines the
Universal Character Set (UCS).
UCS contains all characters of all other character set standards.
It also guarantees "round-trip compatibility";
in other words,
conversion tables can be built such that no information is lost
when a string is converted from any other encoding to UCS and back.
.PP
UCS contains the characters required to represent practically all
known languages.
This includes not only the Latin, Greek, Cyrillic,
Hebrew, Arabic, Armenian, and Georgian scripts, but also Chinese,
Japanese and Korean Han ideographs as well as scripts such as
Hiragana, Katakana, Hangul, Devanagari, Bengali, Gurmukhi, Gujarati,
Oriya, Tamil, Telugu, Kannada, Malayalam, Thai, Lao, Khmer, Bopomofo,
Tibetan, Runic, Ethiopic, Canadian Syllabics, Cherokee, Mongolian,
Ogham, Myanmar, Sinhala, Thaana, Yi, and others.
For scripts not yet
covered, research on how to best encode them for computer usage is
still going on and they will be added eventually.
This might
eventually include not only Hieroglyphs and various historic
Indo-European languages, but even some selected artistic scripts such
as Tengwar, Cirth, and Klingon.
UCS also covers a large number of
graphical, typographical, mathematical, and scientific symbols,
including those provided by TeX, Postscript, APL, MS-DOS, MS-Windows,
Macintosh, OCR fonts, as well as many word processing and publishing
systems, and more are being added.
.PP
The UCS standard (ISO 10646) describes a
31-bit character set architecture
consisting of 128 24-bit
.IR groups ,
each divided into 256 16-bit
.I planes
made up of 256 8-bit
.I rows
with 256
.I column
positions, one for each character.
Part 1 of the standard (ISO 10646-1)
defines the first 65534 code positions (0x0000 to 0xfffd), which form
the
.IR "Basic Multilingual Plane"
(BMP), that is plane 0 in group 0.
Part 2 of the standard (ISO 10646-2)
adds characters to group 0 outside the BMP in several
.I "supplementary planes"
in the range 0x10000 to 0x10ffff.
There are no plans to add characters
beyond 0x10ffff to the standard, therefore of the entire code space,
only a small fraction of group 0 will ever be actually used in the
foreseeable future.
The BMP contains all characters found in the
commonly used other character sets.
The supplemental planes added by
ISO 10646-2 cover only more exotic characters for special scientific,
dictionary printing, publishing industry, higher-level protocol and
enthusiast needs.
.PP
The representation of each UCS character as a 2-byte word is referred
to as the UCS-2 form (only for BMP characters),
whereas UCS-4 is the representation of each character by a 4-byte word.
In addition, there exist two encoding forms UTF-8
for backward compatibility with ASCII processing software and UTF-16
for the backward-compatible handling of non-BMP characters up to
0x10ffff by UCS-2 software.
.PP
The UCS characters 0x0000 to 0x007f are identical to those of the
classic US-ASCII
character set and the characters in the range 0x0000 to 0x00ff
are identical to those in
ISO 8859-1 (Latin-1).
.SS Combining characters
Some code points in UCS
have been assigned to
.IR "combining characters" .
These are similar to the nonspacing accent keys on a typewriter.
A combining character just adds an accent to the previous character.
The most important accented characters have codes of their own in UCS,
however, the combining character mechanism allows us to add accents
and other diacritical marks to any character.
The combining characters
always follow the character which they modify.
For example, the German
character Umlaut-A ("Latin capital letter A with diaeresis") can
either be represented by the precomposed UCS code 0x00c4, or
alternatively as the combination of a normal "Latin capital letter A"
followed by a "combining diaeresis": 0x0041 0x0308.
.PP
Combining characters are essential for instance for encoding the Thai
script or for mathematical typesetting and users of the International
Phonetic Alphabet.
.SS Implementation levels
As not all systems are expected to support advanced mechanisms like
combining characters, ISO 10646-1 specifies the following three
.I implementation levels
of UCS:
.TP 0.9i
Level 1
Combining characters and Hangul Jamo
(a variant encoding of the Korean script, where a Hangul syllable
glyph is coded as a triplet or pair of vowel/consonant codes) are not
supported.
.TP
Level 2
In addition to level 1, combining characters are now allowed for some
languages where they are essential (e.g., Thai, Lao, Hebrew,
Arabic, Devanagari, Malayalam).
.TP
Level 3
All UCS characters are supported.
.PP
The Unicode 3.0 Standard
published by the Unicode Consortium
contains exactly the UCS Basic Multilingual Plane
at implementation level 3, as described in ISO 10646-1:2000.
Unicode 3.1 added the supplemental planes of ISO 10646-2.
The Unicode standard and
technical reports published by the Unicode Consortium provide much
additional information on the semantics and recommended usages of
various characters.
They provide guidelines and algorithms for
editing, sorting, comparing, normalizing, converting, and displaying
Unicode strings.
.SS Unicode under Linux
Under GNU/Linux, the C type
.I wchar_t
is a signed 32-bit integer type.
Its values are always interpreted
by the C library as UCS
code values (in all locales), a convention that is signaled by the GNU
C library to applications by defining the constant
.B __STDC_ISO_10646__
as specified in the ISO C99 standard.
.PP
UCS/Unicode can be used just like ASCII in input/output streams,
terminal communication, plaintext files, filenames, and environment
variables in the ASCII compatible UTF-8 multibyte encoding.
To signal the use of UTF-8 as the character
encoding to all applications, a suitable
.I locale
has to be selected via environment variables (e.g.,
"LANG=en_GB.UTF-8").
.PP
The
.B nl_langinfo(CODESET)
function returns the name of the selected encoding.
Library functions such as
.BR wctomb (3)
and
.BR mbsrtowcs (3)
can be used to transform the internal
.I wchar_t
characters and strings into the system character encoding and back
and
.BR wcwidth (3)
tells, how many positions (0\(en2) the cursor is advanced by the
output of a character.
.PP
.SS Private Use Areas (PUA)
In the Basic Multilingual Plane,
the range 0xe000 to 0xf8ff will never be assigned to any characters by
the standard and is reserved for private usage.
For the Linux
community, this private area has been subdivided further into the
range 0xe000 to 0xefff which can be used individually by any end-user
and the Linux zone in the range 0xf000 to 0xf8ff where extensions are
coordinated among all Linux users.
The registry of the characters
assigned to the Linux zone is maintained by LANANA and the registry
itself is
.I Documentation/admin\-guide/unicode.rst
in the Linux kernel sources
.\" commit 9d85025b0418163fae079c9ba8f8445212de8568
(or
.I Documentation/unicode.txt
before Linux 4.10).
.PP
Two other planes are reserved for private usage, plane 15
(Supplementary Private Use Area-A, range 0xf0000 to 0xffffd)
and plane 16 (Supplementary Private Use Area-B, range
0x100000 to 0x10fffd).
.SS Literature
.IP * 3
Information technology \(em Universal Multiple-Octet Coded Character
Set (UCS) \(em Part 1: Architecture and Basic Multilingual Plane.
International Standard ISO/IEC 10646-1, International Organization
for Standardization, Geneva, 2000.
.IP
This is the official specification of UCS .
Available from
.UR http://www.iso.ch/
.UE .
.IP *
The Unicode Standard, Version 3.0.
The Unicode Consortium, Addison-Wesley,
Reading, MA, 2000, ISBN 0-201-61633-5.
.IP *
S.\& Harbison, G.\& Steele. C: A Reference Manual. Fourth edition,
Prentice Hall, Englewood Cliffs, 1995, ISBN 0-13-326224-3.
.IP
A good reference book about the C programming language.
The fourth
Commit	Line	Data
fea681da MK	1	.\" Copyright (C) Markus Kuhn, 1995, 2001
fea681da MK	2	.\"
1dd72f9c	3	.\" %%%LICENSE_START(GPLv2+_DOC_FULL)
fea681da MK	4	.\" This is free documentation; you can redistribute it and/or
	5	.\" modify it under the terms of the GNU General Public License as
	6	.\" published by the Free Software Foundation; either version 2 of
	7	.\" the License, or (at your option) any later version.
	8	.\"
	9	.\" The GNU General Public License's references to "object code"
	10	.\" and "executables" are to be interpreted as the output of any
	11	.\" document formatting or typesetting system, including
	12	.\" intermediate and printed output.
	13	.\"
	14	.\" This manual is distributed in the hope that it will be useful,
	15	.\" but WITHOUT ANY WARRANTY; without even the implied warranty of
	16	.\" MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
	17	.\" GNU General Public License for more details.
	18	.\"
	19	.\" You should have received a copy of the GNU General Public
c715f741 MK	20	.\" License along with this manual; if not, see
c715f741 MK	21	.\" <http://www.gnu.org/licenses/>.
6a8d8745	22	.\" %%%LICENSE_END
fea681da MK	23	.\"
	24	.\" 1995-11-26 Markus Kuhn <mskuhn@cip.informatik.uni-erlangen.de>
	25	.\" First version written
	26	.\" 2001-05-11 Markus Kuhn <mgk25@cl.cam.ac.uk>
	27	.\" Update
	28	.\"
4b8c67d9	29	.TH UNICODE 7 2017-09-15 "GNU" "Linux Programmer's Manual"
fea681da	30	.SH NAME
e095ac23	31	unicode \- universal character set
fea681da	32	.SH DESCRIPTION
9423e95b MK	33	The international standard ISO 10646 defines the
9423e95b MK	34	Universal Character Set (UCS).
c13182ef	35	UCS contains all characters of all other character set standards.
9423e95b	36	It also guarantees "round-trip compatibility";
88879aeb MK	37	in other words,
88879aeb MK	38	conversion tables can be built such that no information is lost
fea681da	39	when a string is converted from any other encoding to UCS and back.
a721e8b2	40	.PP
fea681da	41	UCS contains the characters required to represent practically all
c13182ef MK	42	known languages.
c13182ef MK	43	This includes not only the Latin, Greek, Cyrillic,
1954b6a9	44	Hebrew, Arabic, Armenian, and Georgian scripts, but also Chinese,
fea681da MK	45	Japanese and Korean Han ideographs as well as scripts such as
	46	Hiragana, Katakana, Hangul, Devanagari, Bengali, Gurmukhi, Gujarati,
	47	Oriya, Tamil, Telugu, Kannada, Malayalam, Thai, Lao, Khmer, Bopomofo,
	48	Tibetan, Runic, Ethiopic, Canadian Syllabics, Cherokee, Mongolian,
c13182ef MK	49	Ogham, Myanmar, Sinhala, Thaana, Yi, and others.
c13182ef MK	50	For scripts not yet
fea681da	51	covered, research on how to best encode them for computer usage is
c13182ef MK	52	still going on and they will be added eventually.
c13182ef MK	53	This might
fea681da MK	54	eventually include not only Hieroglyphs and various historic
fea681da MK	55	Indo-European languages, but even some selected artistic scripts such
c13182ef MK	56	as Tengwar, Cirth, and Klingon.
c13182ef MK	57	UCS also covers a large number of
a797afac	58	graphical, typographical, mathematical, and scientific symbols,
fea681da MK	59	including those provided by TeX, Postscript, APL, MS-DOS, MS-Windows,
	60	Macintosh, OCR fonts, as well as many word processing and publishing
	61	systems, and more are being added.
a721e8b2	62	.PP
fea681da	63	The UCS standard (ISO 10646) describes a
492d9973	64	31-bit character set architecture
fea681da MK	65	consisting of 128 24-bit
	66	.IR groups ,
	67	each divided into 256 16-bit
	68	.I planes
	69	made up of 256 8-bit
	70	.I rows
	71	with 256
	72	.I column
c13182ef	73	positions, one for each character.
9423e95b	74	Part 1 of the standard (ISO 10646-1)
fea681da MK	75	defines the first 65534 code positions (0x0000 to 0xfffd), which form
fea681da MK	76	the
18352067 MK	77	.IR "Basic Multilingual Plane"
18352067 MK	78	(BMP), that is plane 0 in group 0.
9423e95b	79	Part 2 of the standard (ISO 10646-2)
fea681da MK	80	adds characters to group 0 outside the BMP in several
fea681da MK	81	.I "supplementary planes"
c13182ef MK	82	in the range 0x10000 to 0x10ffff.
c13182ef MK	83	There are no plans to add characters
fea681da MK	84	beyond 0x10ffff to the standard, therefore of the entire code space,
fea681da MK	85	only a small fraction of group 0 will ever be actually used in the
c13182ef MK	86	foreseeable future.
	87	The BMP contains all characters found in the
	88	commonly used other character sets.
	89	The supplemental planes added by
fea681da MK	90	ISO 10646-2 cover only more exotic characters for special scientific,
	91	dictionary printing, publishing industry, higher-level protocol and
	92	enthusiast needs.
	93	.PP
	94	The representation of each UCS character as a 2-byte word is referred
9423e95b MK	95	to as the UCS-2 form (only for BMP characters),
	96	whereas UCS-4 is the representation of each character by a 4-byte word.
	97	In addition, there exist two encoding forms UTF-8
	98	for backward compatibility with ASCII processing software and UTF-16
66d3a13b	99	for the backward-compatible handling of non-BMP characters up to
fea681da MK	100	0x10ffff by UCS-2 software.
	101	.PP
	102	The UCS characters 0x0000 to 0x007f are identical to those of the
9423e95b	103	classic US-ASCII
fea681da MK	104	character set and the characters in the range 0x0000 to 0x00ff
fea681da MK	105	are identical to those in
9423e95b	106	ISO 8859-1 (Latin-1).
73d8cece	107	.SS Combining characters
9423e95b	108	Some code points in UCS
fea681da MK	109	have been assigned to
fea681da MK	110	.IR "combining characters" .
24b74457	111	These are similar to the nonspacing accent keys on a typewriter.
c13182ef MK	112	A combining character just adds an accent to the previous character.
c13182ef MK	113	The most important accented characters have codes of their own in UCS,
fea681da	114	however, the combining character mechanism allows us to add accents
c13182ef MK	115	and other diacritical marks to any character.
	116	The combining characters
	117	always follow the character which they modify.
	118	For example, the German
fea681da MK	119	character Umlaut-A ("Latin capital letter A with diaeresis") can
	120	either be represented by the precomposed UCS code 0x00c4, or
	121	alternatively as the combination of a normal "Latin capital letter A"
	122	followed by a "combining diaeresis": 0x0041 0x0308.
	123	.PP
	124	Combining characters are essential for instance for encoding the Thai
	125	script or for mathematical typesetting and users of the International
	126	Phonetic Alphabet.
73d8cece	127	.SS Implementation levels
fea681da MK	128	As not all systems are expected to support advanced mechanisms like
	129	combining characters, ISO 10646-1 specifies the following three
	130	.I implementation levels
	131	of UCS:
	132	.TP 0.9i
	133	Level 1
9423e95b	134	Combining characters and Hangul Jamo
fea681da	135	(a variant encoding of the Korean script, where a Hangul syllable
42f0a101	136	glyph is coded as a triplet or pair of vowel/consonant codes) are not
fea681da MK	137	supported.
	138	.TP
	139	Level 2
	140	In addition to level 1, combining characters are now allowed for some
	141	languages where they are essential (e.g., Thai, Lao, Hebrew,
e2ef00a5	142	Arabic, Devanagari, Malayalam).
fea681da MK	143	.TP
fea681da MK	144	Level 3
9423e95b	145	All UCS characters are supported.
fea681da	146	.PP
9423e95b MK	147	The Unicode 3.0 Standard
	148	published by the Unicode Consortium
	149	contains exactly the UCS Basic Multilingual Plane
fea681da	150	at implementation level 3, as described in ISO 10646-1:2000.
9423e95b	151	Unicode 3.1 added the supplemental planes of ISO 10646-2.
c13182ef	152	The Unicode standard and
fea681da MK	153	technical reports published by the Unicode Consortium provide much
fea681da MK	154	additional information on the semantics and recommended usages of
c13182ef MK	155	various characters.
c13182ef MK	156	They provide guidelines and algorithms for
a797afac	157	editing, sorting, comparing, normalizing, converting, and displaying
fea681da	158	Unicode strings.
73d8cece	159	.SS Unicode under Linux
fea681da	160	Under GNU/Linux, the C type
f19a0f03	161	.I wchar_t
c13182ef MK	162	is a signed 32-bit integer type.
c13182ef MK	163	Its values are always interpreted
9423e95b	164	by the C library as UCS
fea681da MK	165	code values (in all locales), a convention that is signaled by the GNU
	166	C library to applications by defining the constant
	167	.B __STDC_ISO_10646__
26b2443e	168	as specified in the ISO C99 standard.
a721e8b2	169	.PP
fea681da MK	170	UCS/Unicode can be used just like ASCII in input/output streams,
fea681da MK	171	terminal communication, plaintext files, filenames, and environment
9423e95b	172	variables in the ASCII compatible UTF-8 multibyte encoding.
c13182ef	173	To signal the use of UTF-8 as the character
fea681da	174	encoding to all applications, a suitable
f19a0f03	175	.I locale
fea681da MK	176	has to be selected via environment variables (e.g.,
	177	"LANG=en_GB.UTF-8").
	178	.PP
	179	The
	180	.B nl_langinfo(CODESET)
c13182ef MK	181	function returns the name of the selected encoding.
c13182ef MK	182	Library functions such as
fea681da MK	183	.BR wctomb (3)
	184	and
	185	.BR mbsrtowcs (3)
	186	can be used to transform the internal
9ff08aad	187	.I wchar_t
fea681da MK	188	characters and strings into the system character encoding and back
	189	and
	190	.BR wcwidth (3)
	191	tells, how many positions (0\(en2) the cursor is advanced by the
	192	output of a character.
	193	.PP
66e80e31	194	.SS Private Use Areas (PUA)
beecf99e	195	In the Basic Multilingual Plane,
fea681da	196	the range 0xe000 to 0xf8ff will never be assigned to any characters by
c13182ef MK	197	the standard and is reserved for private usage.
c13182ef MK	198	For the Linux
fea681da MK	199	community, this private area has been subdivided further into the
	200	range 0xe000 to 0xefff which can be used individually by any end-user
	201	and the Linux zone in the range 0xf000 to 0xf8ff where extensions are
c13182ef MK	202	coordinated among all Linux users.
c13182ef MK	203	The registry of the characters
79172100 MM	204	assigned to the Linux zone is maintained by LANANA and the registry
79172100 MM	205	itself is
b4a164cf ES	206	.I Documentation/admin\-guide/unicode.rst
	207	in the Linux kernel sources
	208	.\" commit 9d85025b0418163fae079c9ba8f8445212de8568
	209	(or
79172100	210	.I Documentation/unicode.txt
b4a164cf	211	before Linux 4.10).
66e80e31 DTQ	212	.PP
	213	Two other planes are reserved for private usage, plane 15
	214	(Supplementary Private Use Area-A, range 0xf0000 to 0xffffd)
	215	and plane 16 (Supplementary Private Use Area-B, range
	216	0x100000 to 0x10fffd).
d90a233f	217	.SS Literature
f2cf1fbf	218	.IP * 3
fea681da MK	219	Information technology \(em Universal Multiple-Octet Coded Character
	220	Set (UCS) \(em Part 1: Architecture and Basic Multilingual Plane.
	221	International Standard ISO/IEC 10646-1, International Organization
	222	for Standardization, Geneva, 2000.
a721e8b2	223	.IP
9423e95b	224	This is the official specification of UCS .
79172100	225	Available from
608bf950 SK	226	.UR http://www.iso.ch/
608bf950 SK	227	.UE .
f2cf1fbf	228	.IP *
fea681da MK	229	The Unicode Standard, Version 3.0.
	230	The Unicode Consortium, Addison-Wesley,
	231	Reading, MA, 2000, ISBN 0-201-61633-5.
f2cf1fbf	232	.IP *
8fb01fde	233	S.\& Harbison, G.\& Steele. C: A Reference Manual. Fourth edition,
fea681da	234	Prentice Hall, Englewood Cliffs, 1995, ISBN 0-13-326224-3.
a721e8b2	235	.IP
c13182ef MK	236	A good reference book about the C programming language.
c13182ef MK	237	The fourth