]> git.ipfire.org Git - thirdparty/man-pages.git/blame - man7/unicode.7
adjtimex.2, futex.2, mremap.2, seccomp.2, getnameinfo.3, random.3, console_codes...
[thirdparty/man-pages.git] / man7 / unicode.7
CommitLineData
fea681da
MK
1.\" Copyright (C) Markus Kuhn, 1995, 2001
2.\"
1dd72f9c 3.\" %%%LICENSE_START(GPLv2+_DOC_FULL)
fea681da
MK
4.\" This is free documentation; you can redistribute it and/or
5.\" modify it under the terms of the GNU General Public License as
6.\" published by the Free Software Foundation; either version 2 of
7.\" the License, or (at your option) any later version.
8.\"
9.\" The GNU General Public License's references to "object code"
10.\" and "executables" are to be interpreted as the output of any
11.\" document formatting or typesetting system, including
12.\" intermediate and printed output.
13.\"
14.\" This manual is distributed in the hope that it will be useful,
15.\" but WITHOUT ANY WARRANTY; without even the implied warranty of
16.\" MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
17.\" GNU General Public License for more details.
18.\"
19.\" You should have received a copy of the GNU General Public
c715f741
MK
20.\" License along with this manual; if not, see
21.\" <http://www.gnu.org/licenses/>.
6a8d8745 22.\" %%%LICENSE_END
fea681da
MK
23.\"
24.\" 1995-11-26 Markus Kuhn <mskuhn@cip.informatik.uni-erlangen.de>
25.\" First version written
26.\" 2001-05-11 Markus Kuhn <mgk25@cl.cam.ac.uk>
27.\" Update
28.\"
4b8c67d9 29.TH UNICODE 7 2017-09-15 "GNU" "Linux Programmer's Manual"
fea681da 30.SH NAME
e095ac23 31unicode \- universal character set
fea681da 32.SH DESCRIPTION
9423e95b
MK
33The international standard ISO 10646 defines the
34Universal Character Set (UCS).
c13182ef 35UCS contains all characters of all other character set standards.
9423e95b 36It also guarantees "round-trip compatibility";
88879aeb
MK
37in other words,
38conversion tables can be built such that no information is lost
fea681da 39when a string is converted from any other encoding to UCS and back.
a721e8b2 40.PP
fea681da 41UCS contains the characters required to represent practically all
c13182ef
MK
42known languages.
43This includes not only the Latin, Greek, Cyrillic,
1954b6a9 44Hebrew, Arabic, Armenian, and Georgian scripts, but also Chinese,
fea681da
MK
45Japanese and Korean Han ideographs as well as scripts such as
46Hiragana, Katakana, Hangul, Devanagari, Bengali, Gurmukhi, Gujarati,
47Oriya, Tamil, Telugu, Kannada, Malayalam, Thai, Lao, Khmer, Bopomofo,
48Tibetan, Runic, Ethiopic, Canadian Syllabics, Cherokee, Mongolian,
c13182ef
MK
49Ogham, Myanmar, Sinhala, Thaana, Yi, and others.
50For scripts not yet
fea681da 51covered, research on how to best encode them for computer usage is
c13182ef
MK
52still going on and they will be added eventually.
53This might
fea681da
MK
54eventually include not only Hieroglyphs and various historic
55Indo-European languages, but even some selected artistic scripts such
c13182ef
MK
56as Tengwar, Cirth, and Klingon.
57UCS also covers a large number of
a797afac 58graphical, typographical, mathematical, and scientific symbols,
fea681da
MK
59including those provided by TeX, Postscript, APL, MS-DOS, MS-Windows,
60Macintosh, OCR fonts, as well as many word processing and publishing
61systems, and more are being added.
a721e8b2 62.PP
fea681da 63The UCS standard (ISO 10646) describes a
492d9973 6431-bit character set architecture
fea681da
MK
65consisting of 128 24-bit
66.IR groups ,
67each divided into 256 16-bit
68.I planes
69made up of 256 8-bit
70.I rows
71with 256
72.I column
c13182ef 73positions, one for each character.
9423e95b 74Part 1 of the standard (ISO 10646-1)
fea681da
MK
75defines the first 65534 code positions (0x0000 to 0xfffd), which form
76the
18352067
MK
77.IR "Basic Multilingual Plane"
78(BMP), that is plane 0 in group 0.
9423e95b 79Part 2 of the standard (ISO 10646-2)
fea681da
MK
80adds characters to group 0 outside the BMP in several
81.I "supplementary planes"
c13182ef
MK
82in the range 0x10000 to 0x10ffff.
83There are no plans to add characters
fea681da
MK
84beyond 0x10ffff to the standard, therefore of the entire code space,
85only a small fraction of group 0 will ever be actually used in the
c13182ef
MK
86foreseeable future.
87The BMP contains all characters found in the
88commonly used other character sets.
89The supplemental planes added by
fea681da
MK
90ISO 10646-2 cover only more exotic characters for special scientific,
91dictionary printing, publishing industry, higher-level protocol and
92enthusiast needs.
93.PP
94The representation of each UCS character as a 2-byte word is referred
9423e95b
MK
95to as the UCS-2 form (only for BMP characters),
96whereas UCS-4 is the representation of each character by a 4-byte word.
97In addition, there exist two encoding forms UTF-8
98for backward compatibility with ASCII processing software and UTF-16
66d3a13b 99for the backward-compatible handling of non-BMP characters up to
fea681da
MK
1000x10ffff by UCS-2 software.
101.PP
102The UCS characters 0x0000 to 0x007f are identical to those of the
9423e95b 103classic US-ASCII
fea681da
MK
104character set and the characters in the range 0x0000 to 0x00ff
105are identical to those in
9423e95b 106ISO 8859-1 (Latin-1).
73d8cece 107.SS Combining characters
9423e95b 108Some code points in UCS
fea681da
MK
109have been assigned to
110.IR "combining characters" .
24b74457 111These are similar to the nonspacing accent keys on a typewriter.
c13182ef
MK
112A combining character just adds an accent to the previous character.
113The most important accented characters have codes of their own in UCS,
fea681da 114however, the combining character mechanism allows us to add accents
c13182ef
MK
115and other diacritical marks to any character.
116The combining characters
117always follow the character which they modify.
118For example, the German
fea681da
MK
119character Umlaut-A ("Latin capital letter A with diaeresis") can
120either be represented by the precomposed UCS code 0x00c4, or
121alternatively as the combination of a normal "Latin capital letter A"
122followed by a "combining diaeresis": 0x0041 0x0308.
123.PP
124Combining characters are essential for instance for encoding the Thai
125script or for mathematical typesetting and users of the International
126Phonetic Alphabet.
73d8cece 127.SS Implementation levels
fea681da
MK
128As not all systems are expected to support advanced mechanisms like
129combining characters, ISO 10646-1 specifies the following three
130.I implementation levels
131of UCS:
132.TP 0.9i
133Level 1
9423e95b 134Combining characters and Hangul Jamo
fea681da 135(a variant encoding of the Korean script, where a Hangul syllable
42f0a101 136glyph is coded as a triplet or pair of vowel/consonant codes) are not
fea681da
MK
137supported.
138.TP
139Level 2
140In addition to level 1, combining characters are now allowed for some
141languages where they are essential (e.g., Thai, Lao, Hebrew,
e2ef00a5 142Arabic, Devanagari, Malayalam).
fea681da
MK
143.TP
144Level 3
9423e95b 145All UCS characters are supported.
fea681da 146.PP
9423e95b
MK
147The Unicode 3.0 Standard
148published by the Unicode Consortium
149contains exactly the UCS Basic Multilingual Plane
fea681da 150at implementation level 3, as described in ISO 10646-1:2000.
9423e95b 151Unicode 3.1 added the supplemental planes of ISO 10646-2.
c13182ef 152The Unicode standard and
fea681da
MK
153technical reports published by the Unicode Consortium provide much
154additional information on the semantics and recommended usages of
c13182ef
MK
155various characters.
156They provide guidelines and algorithms for
a797afac 157editing, sorting, comparing, normalizing, converting, and displaying
fea681da 158Unicode strings.
73d8cece 159.SS Unicode under Linux
fea681da 160Under GNU/Linux, the C type
f19a0f03 161.I wchar_t
c13182ef
MK
162is a signed 32-bit integer type.
163Its values are always interpreted
9423e95b 164by the C library as UCS
fea681da
MK
165code values (in all locales), a convention that is signaled by the GNU
166C library to applications by defining the constant
167.B __STDC_ISO_10646__
26b2443e 168as specified in the ISO C99 standard.
a721e8b2 169.PP
fea681da
MK
170UCS/Unicode can be used just like ASCII in input/output streams,
171terminal communication, plaintext files, filenames, and environment
9423e95b 172variables in the ASCII compatible UTF-8 multibyte encoding.
c13182ef 173To signal the use of UTF-8 as the character
fea681da 174encoding to all applications, a suitable
f19a0f03 175.I locale
fea681da
MK
176has to be selected via environment variables (e.g.,
177"LANG=en_GB.UTF-8").
178.PP
179The
180.B nl_langinfo(CODESET)
c13182ef
MK
181function returns the name of the selected encoding.
182Library functions such as
fea681da
MK
183.BR wctomb (3)
184and
185.BR mbsrtowcs (3)
186can be used to transform the internal
9ff08aad 187.I wchar_t
fea681da
MK
188characters and strings into the system character encoding and back
189and
190.BR wcwidth (3)
191tells, how many positions (0\(en2) the cursor is advanced by the
192output of a character.
193.PP
66e80e31 194.SS Private Use Areas (PUA)
beecf99e 195In the Basic Multilingual Plane,
fea681da 196the range 0xe000 to 0xf8ff will never be assigned to any characters by
c13182ef
MK
197the standard and is reserved for private usage.
198For the Linux
fea681da
MK
199community, this private area has been subdivided further into the
200range 0xe000 to 0xefff which can be used individually by any end-user
201and the Linux zone in the range 0xf000 to 0xf8ff where extensions are
c13182ef
MK
202coordinated among all Linux users.
203The registry of the characters
79172100
MM
204assigned to the Linux zone is maintained by LANANA and the registry
205itself is
b4a164cf
ES
206.I Documentation/admin\-guide/unicode.rst
207in the Linux kernel sources
208.\" commit 9d85025b0418163fae079c9ba8f8445212de8568
209(or
79172100 210.I Documentation/unicode.txt
b4a164cf 211before Linux 4.10).
66e80e31
DTQ
212.PP
213Two other planes are reserved for private usage, plane 15
214(Supplementary Private Use Area-A, range 0xf0000 to 0xffffd)
215and plane 16 (Supplementary Private Use Area-B, range
2160x100000 to 0x10fffd).
d90a233f 217.SS Literature
f2cf1fbf 218.IP * 3
fea681da
MK
219Information technology \(em Universal Multiple-Octet Coded Character
220Set (UCS) \(em Part 1: Architecture and Basic Multilingual Plane.
221International Standard ISO/IEC 10646-1, International Organization
222for Standardization, Geneva, 2000.
a721e8b2 223.IP
9423e95b 224This is the official specification of UCS .
79172100 225Available from
608bf950
SK
226.UR http://www.iso.ch/
227.UE .
f2cf1fbf 228.IP *
fea681da
MK
229The Unicode Standard, Version 3.0.
230The Unicode Consortium, Addison-Wesley,
231Reading, MA, 2000, ISBN 0-201-61633-5.
f2cf1fbf 232.IP *
8fb01fde 233S.\& Harbison, G.\& Steele. C: A Reference Manual. Fourth edition,
fea681da 234Prentice Hall, Englewood Cliffs, 1995, ISBN 0-13-326224-3.
a721e8b2 235.IP
c13182ef
MK
236A good reference book about the C programming language.
237The fourth