[thirdparty/man-pages.git] / man7 / unicode.7

.\" Copyright (C) Markus Kuhn, 1995, 2001
.\"
.\" SPDX-License-Identifier: GPL-2.0-or-later
.\"
.\" 1995-11-26  Markus Kuhn <mskuhn@cip.informatik.uni-erlangen.de>
.\"      First version written
.\" 2001-05-11  Markus Kuhn <mgk25@cl.cam.ac.uk>
.\"      Update
.\"
.TH unicode 7 (date) "Linux man-pages (unreleased)"
.SH NAME
unicode \- universal character set
.SH DESCRIPTION
The international standard ISO 10646 defines the
Universal Character Set (UCS).
UCS contains all characters of all other character set standards.
It also guarantees "round-trip compatibility";
in other words,
conversion tables can be built such that no information is lost
when a string is converted from any other encoding to UCS and back.
.PP
UCS contains the characters required to represent practically all
known languages.
This includes not only the Latin, Greek, Cyrillic,
Hebrew, Arabic, Armenian, and Georgian scripts, but also Chinese,
Japanese and Korean Han ideographs as well as scripts such as
Hiragana, Katakana, Hangul, Devanagari, Bengali, Gurmukhi, Gujarati,
Oriya, Tamil, Telugu, Kannada, Malayalam, Thai, Lao, Khmer, Bopomofo,
Tibetan, Runic, Ethiopic, Canadian Syllabics, Cherokee, Mongolian,
Ogham, Myanmar, Sinhala, Thaana, Yi, and others.
For scripts not yet
covered, research on how to best encode them for computer usage is
still going on and they will be added eventually.
This might
eventually include not only Hieroglyphs and various historic
Indo-European languages, but even some selected artistic scripts such
as Tengwar, Cirth, and Klingon.
UCS also covers a large number of
graphical, typographical, mathematical, and scientific symbols,
including those provided by TeX, Postscript, APL, MS-DOS, MS-Windows,
Macintosh, OCR fonts, as well as many word processing and publishing
systems, and more are being added.
.PP
The UCS standard (ISO 10646) describes a
31-bit character set architecture
consisting of 128 24-bit
.IR groups ,
each divided into 256 16-bit
.I planes
made up of 256 8-bit
.I rows
with 256
.I column
positions, one for each character.
Part 1 of the standard (ISO 10646-1)
defines the first 65534 code positions (0x0000 to 0xfffd), which form
the
.I Basic Multilingual Plane
(BMP), that is plane 0 in group 0.
Part 2 of the standard (ISO 10646-2)
adds characters to group 0 outside the BMP in several
.I "supplementary planes"
in the range 0x10000 to 0x10ffff.
There are no plans to add characters
beyond 0x10ffff to the standard, therefore of the entire code space,
only a small fraction of group 0 will ever be actually used in the
foreseeable future.
The BMP contains all characters found in the
commonly used other character sets.
The supplemental planes added by
ISO 10646-2 cover only more exotic characters for special scientific,
dictionary printing, publishing industry, higher-level protocol and
enthusiast needs.
.PP
The representation of each UCS character as a 2-byte word is referred
to as the UCS-2 form (only for BMP characters),
whereas UCS-4 is the representation of each character by a 4-byte word.
In addition, there exist two encoding forms UTF-8
for backward compatibility with ASCII processing software and UTF-16
for the backward-compatible handling of non-BMP characters up to
0x10ffff by UCS-2 software.
.PP
The UCS characters 0x0000 to 0x007f are identical to those of the
classic US-ASCII
character set and the characters in the range 0x0000 to 0x00ff
are identical to those in
ISO 8859-1 (Latin-1).
.SS Combining characters
Some code points in UCS
have been assigned to
.IR "combining characters" .
These are similar to the nonspacing accent keys on a typewriter.
A combining character just adds an accent to the previous character.
The most important accented characters have codes of their own in UCS,
however, the combining character mechanism allows us to add accents
and other diacritical marks to any character.
The combining characters
always follow the character which they modify.
For example, the German
character Umlaut-A ("Latin capital letter A with diaeresis") can
either be represented by the precomposed UCS code 0x00c4, or
alternatively as the combination of a normal "Latin capital letter A"
followed by a "combining diaeresis": 0x0041 0x0308.
.PP
Combining characters are essential for instance for encoding the Thai
script or for mathematical typesetting and users of the International
Phonetic Alphabet.
.SS Implementation levels
As not all systems are expected to support advanced mechanisms like
combining characters, ISO 10646-1 specifies the following three
.I implementation levels
of UCS:
.TP 0.9i
Level 1
Combining characters and Hangul Jamo
(a variant encoding of the Korean script, where a Hangul syllable
glyph is coded as a triplet or pair of vowel/consonant codes) are not
supported.
.TP
Level 2
In addition to level 1, combining characters are now allowed for some
languages where they are essential (e.g., Thai, Lao, Hebrew,
Arabic, Devanagari, Malayalam).
.TP
Level 3
All UCS characters are supported.
.PP
The Unicode 3.0 Standard
published by the Unicode Consortium
contains exactly the UCS Basic Multilingual Plane
at implementation level 3, as described in ISO 10646-1:2000.
Unicode 3.1 added the supplemental planes of ISO 10646-2.
The Unicode standard and
technical reports published by the Unicode Consortium provide much
additional information on the semantics and recommended usages of
various characters.
They provide guidelines and algorithms for
editing, sorting, comparing, normalizing, converting, and displaying
Unicode strings.
.SS Unicode under Linux
Under GNU/Linux, the C type
.I wchar_t
is a signed 32-bit integer type.
Its values are always interpreted
by the C library as UCS
code values (in all locales), a convention that is signaled by the GNU
C library to applications by defining the constant
.B __STDC_ISO_10646__
as specified in the ISO C99 standard.
.PP
UCS/Unicode can be used just like ASCII in input/output streams,
terminal communication, plaintext files, filenames, and environment
variables in the ASCII compatible UTF-8 multibyte encoding.
To signal the use of UTF-8 as the character
encoding to all applications, a suitable
.I locale
has to be selected via environment variables (e.g.,
"LANG=en_GB.UTF-8").
.PP
The
.B nl_langinfo(CODESET)
function returns the name of the selected encoding.
Library functions such as
.BR wctomb (3)
and
.BR mbsrtowcs (3)
can be used to transform the internal
.I wchar_t
characters and strings into the system character encoding and back
and
.BR wcwidth (3)
tells how many positions (0\(en2) the cursor is advanced by the
output of a character.
.SS Private Use Areas (PUA)
In the Basic Multilingual Plane,
the range 0xe000 to 0xf8ff will never be assigned to any characters by
the standard and is reserved for private usage.
For the Linux
community, this private area has been subdivided further into the
range 0xe000 to 0xefff which can be used individually by any end-user
and the Linux zone in the range 0xf000 to 0xf8ff where extensions are
coordinated among all Linux users.
The registry of the characters
assigned to the Linux zone is maintained by LANANA and the registry
itself is
.I Documentation/admin\-guide/unicode.rst
in the Linux kernel sources
.\" commit 9d85025b0418163fae079c9ba8f8445212de8568
(or
.I Documentation/unicode.txt
before Linux 4.10).
.PP
Two other planes are reserved for private usage, plane 15
(Supplementary Private Use Area-A, range 0xf0000 to 0xffffd)
and plane 16 (Supplementary Private Use Area-B, range
0x100000 to 0x10fffd).
.SS Literature
.IP \(bu 3
Information technology \(em Universal Multiple-Octet Coded Character
Set (UCS) \(em Part 1: Architecture and Basic Multilingual Plane.
International Standard ISO/IEC 10646-1, International Organization
for Standardization, Geneva, 2000.
.IP
This is the official specification of UCS.
Available from
.UR http://www.iso.ch/
.UE .
.IP \(bu
The Unicode Standard, Version 3.0.
The Unicode Consortium, Addison-Wesley,
Reading, MA, 2000, ISBN 0-201-61633-5.
.IP \(bu
S.\& Harbison, G.\& Steele. C: A Reference Manual. Fourth edition,
Prentice Hall, Englewood Cliffs, 1995, ISBN 0-13-326224-3.
.IP
A good reference book about the C programming language.
The fourth
Commit	Line	Data
fea681da MK	1	.\" Copyright (C) Markus Kuhn, 1995, 2001
fea681da MK	2	.\"
e4a74ca8	3	.\" SPDX-License-Identifier: GPL-2.0-or-later
fea681da MK	4	.\"
	5	.\" 1995-11-26 Markus Kuhn <mskuhn@cip.informatik.uni-erlangen.de>
	6	.\" First version written
	7	.\" 2001-05-11 Markus Kuhn <mgk25@cl.cam.ac.uk>
	8	.\" Update
	9	.\"
4c1c5274	10	.TH unicode 7 (date) "Linux man-pages (unreleased)"
fea681da	11	.SH NAME
e095ac23	12	unicode \- universal character set
fea681da	13	.SH DESCRIPTION
9423e95b MK	14	The international standard ISO 10646 defines the
9423e95b MK	15	Universal Character Set (UCS).
c13182ef	16	UCS contains all characters of all other character set standards.
9423e95b	17	It also guarantees "round-trip compatibility";
88879aeb MK	18	in other words,
88879aeb MK	19	conversion tables can be built such that no information is lost
fea681da	20	when a string is converted from any other encoding to UCS and back.
a721e8b2	21	.PP
fea681da	22	UCS contains the characters required to represent practically all
c13182ef MK	23	known languages.
c13182ef MK	24	This includes not only the Latin, Greek, Cyrillic,
1954b6a9	25	Hebrew, Arabic, Armenian, and Georgian scripts, but also Chinese,
fea681da MK	26	Japanese and Korean Han ideographs as well as scripts such as
	27	Hiragana, Katakana, Hangul, Devanagari, Bengali, Gurmukhi, Gujarati,
	28	Oriya, Tamil, Telugu, Kannada, Malayalam, Thai, Lao, Khmer, Bopomofo,
	29	Tibetan, Runic, Ethiopic, Canadian Syllabics, Cherokee, Mongolian,
c13182ef MK	30	Ogham, Myanmar, Sinhala, Thaana, Yi, and others.
c13182ef MK	31	For scripts not yet
fea681da	32	covered, research on how to best encode them for computer usage is
c13182ef MK	33	still going on and they will be added eventually.
c13182ef MK	34	This might
fea681da MK	35	eventually include not only Hieroglyphs and various historic
fea681da MK	36	Indo-European languages, but even some selected artistic scripts such
c13182ef MK	37	as Tengwar, Cirth, and Klingon.
c13182ef MK	38	UCS also covers a large number of
a797afac	39	graphical, typographical, mathematical, and scientific symbols,
fea681da MK	40	including those provided by TeX, Postscript, APL, MS-DOS, MS-Windows,
	41	Macintosh, OCR fonts, as well as many word processing and publishing
	42	systems, and more are being added.
a721e8b2	43	.PP
fea681da	44	The UCS standard (ISO 10646) describes a
492d9973	45	31-bit character set architecture
fea681da MK	46	consisting of 128 24-bit
	47	.IR groups ,
	48	each divided into 256 16-bit
	49	.I planes
	50	made up of 256 8-bit
	51	.I rows
	52	with 256
	53	.I column
c13182ef	54	positions, one for each character.
9423e95b	55	Part 1 of the standard (ISO 10646-1)
fea681da MK	56	defines the first 65534 code positions (0x0000 to 0xfffd), which form
fea681da MK	57	the
1ae6b2c7	58	.I Basic Multilingual Plane
18352067	59	(BMP), that is plane 0 in group 0.
9423e95b	60	Part 2 of the standard (ISO 10646-2)
fea681da MK	61	adds characters to group 0 outside the BMP in several
fea681da MK	62	.I "supplementary planes"
c13182ef MK	63	in the range 0x10000 to 0x10ffff.
c13182ef MK	64	There are no plans to add characters
fea681da MK	65	beyond 0x10ffff to the standard, therefore of the entire code space,
fea681da MK	66	only a small fraction of group 0 will ever be actually used in the
c13182ef MK	67	foreseeable future.
	68	The BMP contains all characters found in the
	69	commonly used other character sets.
	70	The supplemental planes added by
fea681da MK	71	ISO 10646-2 cover only more exotic characters for special scientific,
	72	dictionary printing, publishing industry, higher-level protocol and
	73	enthusiast needs.
	74	.PP
	75	The representation of each UCS character as a 2-byte word is referred
9423e95b MK	76	to as the UCS-2 form (only for BMP characters),
	77	whereas UCS-4 is the representation of each character by a 4-byte word.
	78	In addition, there exist two encoding forms UTF-8
	79	for backward compatibility with ASCII processing software and UTF-16
66d3a13b	80	for the backward-compatible handling of non-BMP characters up to
fea681da MK	81	0x10ffff by UCS-2 software.
	82	.PP
	83	The UCS characters 0x0000 to 0x007f are identical to those of the
9423e95b	84	classic US-ASCII
fea681da MK	85	character set and the characters in the range 0x0000 to 0x00ff
fea681da MK	86	are identical to those in
9423e95b	87	ISO 8859-1 (Latin-1).
73d8cece	88	.SS Combining characters
9423e95b	89	Some code points in UCS
fea681da MK	90	have been assigned to
fea681da MK	91	.IR "combining characters" .
24b74457	92	These are similar to the nonspacing accent keys on a typewriter.
c13182ef MK	93	A combining character just adds an accent to the previous character.
c13182ef MK	94	The most important accented characters have codes of their own in UCS,
fea681da	95	however, the combining character mechanism allows us to add accents
c13182ef MK	96	and other diacritical marks to any character.
	97	The combining characters
	98	always follow the character which they modify.
	99	For example, the German
fea681da MK	100	character Umlaut-A ("Latin capital letter A with diaeresis") can
	101	either be represented by the precomposed UCS code 0x00c4, or
	102	alternatively as the combination of a normal "Latin capital letter A"
	103	followed by a "combining diaeresis": 0x0041 0x0308.
	104	.PP
	105	Combining characters are essential for instance for encoding the Thai
	106	script or for mathematical typesetting and users of the International
	107	Phonetic Alphabet.
73d8cece	108	.SS Implementation levels
fea681da MK	109	As not all systems are expected to support advanced mechanisms like
	110	combining characters, ISO 10646-1 specifies the following three
	111	.I implementation levels
	112	of UCS:
	113	.TP 0.9i
	114	Level 1
9423e95b	115	Combining characters and Hangul Jamo
fea681da	116	(a variant encoding of the Korean script, where a Hangul syllable
42f0a101	117	glyph is coded as a triplet or pair of vowel/consonant codes) are not
fea681da MK	118	supported.
	119	.TP
	120	Level 2
	121	In addition to level 1, combining characters are now allowed for some
	122	languages where they are essential (e.g., Thai, Lao, Hebrew,
e2ef00a5	123	Arabic, Devanagari, Malayalam).
fea681da MK	124	.TP
fea681da MK	125	Level 3
9423e95b	126	All UCS characters are supported.
fea681da	127	.PP
9423e95b MK	128	The Unicode 3.0 Standard
	129	published by the Unicode Consortium
	130	contains exactly the UCS Basic Multilingual Plane
fea681da	131	at implementation level 3, as described in ISO 10646-1:2000.
9423e95b	132	Unicode 3.1 added the supplemental planes of ISO 10646-2.
c13182ef	133	The Unicode standard and
fea681da MK	134	technical reports published by the Unicode Consortium provide much
fea681da MK	135	additional information on the semantics and recommended usages of
c13182ef MK	136	various characters.
c13182ef MK	137	They provide guidelines and algorithms for
a797afac	138	editing, sorting, comparing, normalizing, converting, and displaying
fea681da	139	Unicode strings.
73d8cece	140	.SS Unicode under Linux
fea681da	141	Under GNU/Linux, the C type
f19a0f03	142	.I wchar_t
c13182ef MK	143	is a signed 32-bit integer type.
c13182ef MK	144	Its values are always interpreted
9423e95b	145	by the C library as UCS
fea681da MK	146	code values (in all locales), a convention that is signaled by the GNU
	147	C library to applications by defining the constant
	148	.B __STDC_ISO_10646__
26b2443e	149	as specified in the ISO C99 standard.
a721e8b2	150	.PP
fea681da MK	151	UCS/Unicode can be used just like ASCII in input/output streams,
fea681da MK	152	terminal communication, plaintext files, filenames, and environment
9423e95b	153	variables in the ASCII compatible UTF-8 multibyte encoding.
c13182ef	154	To signal the use of UTF-8 as the character
fea681da	155	encoding to all applications, a suitable
f19a0f03	156	.I locale
fea681da MK	157	has to be selected via environment variables (e.g.,
	158	"LANG=en_GB.UTF-8").
	159	.PP
	160	The
	161	.B nl_langinfo(CODESET)
c13182ef MK	162	function returns the name of the selected encoding.
c13182ef MK	163	Library functions such as
fea681da MK	164	.BR wctomb (3)
	165	and
	166	.BR mbsrtowcs (3)
	167	can be used to transform the internal
9ff08aad	168	.I wchar_t
fea681da MK	169	characters and strings into the system character encoding and back
	170	and
	171	.BR wcwidth (3)
81b2a338	172	tells how many positions (0\(en2) the cursor is advanced by the
fea681da	173	output of a character.
66e80e31	174	.SS Private Use Areas (PUA)
beecf99e	175	In the Basic Multilingual Plane,
fea681da	176	the range 0xe000 to 0xf8ff will never be assigned to any characters by
c13182ef MK	177	the standard and is reserved for private usage.
c13182ef MK	178	For the Linux
fea681da MK	179	community, this private area has been subdivided further into the
	180	range 0xe000 to 0xefff which can be used individually by any end-user
	181	and the Linux zone in the range 0xf000 to 0xf8ff where extensions are
c13182ef MK	182	coordinated among all Linux users.
c13182ef MK	183	The registry of the characters
79172100 MM	184	assigned to the Linux zone is maintained by LANANA and the registry
79172100 MM	185	itself is
b4a164cf ES	186	.I Documentation/admin\-guide/unicode.rst
	187	in the Linux kernel sources
	188	.\" commit 9d85025b0418163fae079c9ba8f8445212de8568
	189	(or
79172100	190	.I Documentation/unicode.txt
b4a164cf	191	before Linux 4.10).
66e80e31 DTQ	192	.PP
	193	Two other planes are reserved for private usage, plane 15
	194	(Supplementary Private Use Area-A, range 0xf0000 to 0xffffd)
	195	and plane 16 (Supplementary Private Use Area-B, range
	196	0x100000 to 0x10fffd).
d90a233f	197	.SS Literature
22356d97	198	.IP \(bu 3
fea681da MK	199	Information technology \(em Universal Multiple-Octet Coded Character
	200	Set (UCS) \(em Part 1: Architecture and Basic Multilingual Plane.
	201	International Standard ISO/IEC 10646-1, International Organization
	202	for Standardization, Geneva, 2000.
a721e8b2	203	.IP
bb75585d	204	This is the official specification of UCS.
79172100	205	Available from
608bf950 SK	206	.UR http://www.iso.ch/
608bf950 SK	207	.UE .
22356d97	208	.IP \(bu
fea681da MK	209	The Unicode Standard, Version 3.0.
	210	The Unicode Consortium, Addison-Wesley,
	211	Reading, MA, 2000, ISBN 0-201-61633-5.
22356d97	212	.IP \(bu
8fb01fde	213	S.\& Harbison, G.\& Steele. C: A Reference Manual. Fourth edition,
fea681da	214	Prentice Hall, Englewood Cliffs, 1995, ISBN 0-13-326224-3.
a721e8b2	215	.IP
c13182ef MK	216	A good reference book about the C programming language.
c13182ef MK	217	The fourth