]>
Commit | Line | Data |
---|---|---|
fea681da MK |
1 | .\" Copyright (C) Markus Kuhn, 1995, 2001 |
2 | .\" | |
1dd72f9c | 3 | .\" %%%LICENSE_START(GPLv2+_DOC_FULL) |
fea681da MK |
4 | .\" This is free documentation; you can redistribute it and/or |
5 | .\" modify it under the terms of the GNU General Public License as | |
6 | .\" published by the Free Software Foundation; either version 2 of | |
7 | .\" the License, or (at your option) any later version. | |
8 | .\" | |
9 | .\" The GNU General Public License's references to "object code" | |
10 | .\" and "executables" are to be interpreted as the output of any | |
11 | .\" document formatting or typesetting system, including | |
12 | .\" intermediate and printed output. | |
13 | .\" | |
14 | .\" This manual is distributed in the hope that it will be useful, | |
15 | .\" but WITHOUT ANY WARRANTY; without even the implied warranty of | |
16 | .\" MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the | |
17 | .\" GNU General Public License for more details. | |
18 | .\" | |
19 | .\" You should have received a copy of the GNU General Public | |
c715f741 MK |
20 | .\" License along with this manual; if not, see |
21 | .\" <http://www.gnu.org/licenses/>. | |
6a8d8745 | 22 | .\" %%%LICENSE_END |
fea681da MK |
23 | .\" |
24 | .\" 1995-11-26 Markus Kuhn <mskuhn@cip.informatik.uni-erlangen.de> | |
25 | .\" First version written | |
26 | .\" 2001-05-11 Markus Kuhn <mgk25@cl.cam.ac.uk> | |
27 | .\" Update | |
28 | .\" | |
97986708 | 29 | .TH UNICODE 7 2016-03-15 "GNU" "Linux Programmer's Manual" |
fea681da | 30 | .SH NAME |
e095ac23 | 31 | unicode \- universal character set |
fea681da | 32 | .SH DESCRIPTION |
9423e95b MK |
33 | The international standard ISO 10646 defines the |
34 | Universal Character Set (UCS). | |
c13182ef | 35 | UCS contains all characters of all other character set standards. |
9423e95b | 36 | It also guarantees "round-trip compatibility"; |
88879aeb MK |
37 | in other words, |
38 | conversion tables can be built such that no information is lost | |
fea681da MK |
39 | when a string is converted from any other encoding to UCS and back. |
40 | ||
41 | UCS contains the characters required to represent practically all | |
c13182ef MK |
42 | known languages. |
43 | This includes not only the Latin, Greek, Cyrillic, | |
1954b6a9 | 44 | Hebrew, Arabic, Armenian, and Georgian scripts, but also Chinese, |
fea681da MK |
45 | Japanese and Korean Han ideographs as well as scripts such as |
46 | Hiragana, Katakana, Hangul, Devanagari, Bengali, Gurmukhi, Gujarati, | |
47 | Oriya, Tamil, Telugu, Kannada, Malayalam, Thai, Lao, Khmer, Bopomofo, | |
48 | Tibetan, Runic, Ethiopic, Canadian Syllabics, Cherokee, Mongolian, | |
c13182ef MK |
49 | Ogham, Myanmar, Sinhala, Thaana, Yi, and others. |
50 | For scripts not yet | |
fea681da | 51 | covered, research on how to best encode them for computer usage is |
c13182ef MK |
52 | still going on and they will be added eventually. |
53 | This might | |
fea681da MK |
54 | eventually include not only Hieroglyphs and various historic |
55 | Indo-European languages, but even some selected artistic scripts such | |
c13182ef MK |
56 | as Tengwar, Cirth, and Klingon. |
57 | UCS also covers a large number of | |
a797afac | 58 | graphical, typographical, mathematical, and scientific symbols, |
fea681da MK |
59 | including those provided by TeX, Postscript, APL, MS-DOS, MS-Windows, |
60 | Macintosh, OCR fonts, as well as many word processing and publishing | |
61 | systems, and more are being added. | |
62 | ||
63 | The UCS standard (ISO 10646) describes a | |
492d9973 | 64 | 31-bit character set architecture |
fea681da MK |
65 | consisting of 128 24-bit |
66 | .IR groups , | |
67 | each divided into 256 16-bit | |
68 | .I planes | |
69 | made up of 256 8-bit | |
70 | .I rows | |
71 | with 256 | |
72 | .I column | |
c13182ef | 73 | positions, one for each character. |
9423e95b | 74 | Part 1 of the standard (ISO 10646-1) |
fea681da MK |
75 | defines the first 65534 code positions (0x0000 to 0xfffd), which form |
76 | the | |
18352067 MK |
77 | .IR "Basic Multilingual Plane" |
78 | (BMP), that is plane 0 in group 0. | |
9423e95b | 79 | Part 2 of the standard (ISO 10646-2) |
fea681da MK |
80 | adds characters to group 0 outside the BMP in several |
81 | .I "supplementary planes" | |
c13182ef MK |
82 | in the range 0x10000 to 0x10ffff. |
83 | There are no plans to add characters | |
fea681da MK |
84 | beyond 0x10ffff to the standard, therefore of the entire code space, |
85 | only a small fraction of group 0 will ever be actually used in the | |
c13182ef MK |
86 | foreseeable future. |
87 | The BMP contains all characters found in the | |
88 | commonly used other character sets. | |
89 | The supplemental planes added by | |
fea681da MK |
90 | ISO 10646-2 cover only more exotic characters for special scientific, |
91 | dictionary printing, publishing industry, higher-level protocol and | |
92 | enthusiast needs. | |
93 | .PP | |
94 | The representation of each UCS character as a 2-byte word is referred | |
9423e95b MK |
95 | to as the UCS-2 form (only for BMP characters), |
96 | whereas UCS-4 is the representation of each character by a 4-byte word. | |
97 | In addition, there exist two encoding forms UTF-8 | |
98 | for backward compatibility with ASCII processing software and UTF-16 | |
66d3a13b | 99 | for the backward-compatible handling of non-BMP characters up to |
fea681da MK |
100 | 0x10ffff by UCS-2 software. |
101 | .PP | |
102 | The UCS characters 0x0000 to 0x007f are identical to those of the | |
9423e95b | 103 | classic US-ASCII |
fea681da MK |
104 | character set and the characters in the range 0x0000 to 0x00ff |
105 | are identical to those in | |
9423e95b | 106 | ISO 8859-1 (Latin-1). |
73d8cece | 107 | .SS Combining characters |
9423e95b | 108 | Some code points in UCS |
fea681da MK |
109 | have been assigned to |
110 | .IR "combining characters" . | |
24b74457 | 111 | These are similar to the nonspacing accent keys on a typewriter. |
c13182ef MK |
112 | A combining character just adds an accent to the previous character. |
113 | The most important accented characters have codes of their own in UCS, | |
fea681da | 114 | however, the combining character mechanism allows us to add accents |
c13182ef MK |
115 | and other diacritical marks to any character. |
116 | The combining characters | |
117 | always follow the character which they modify. | |
118 | For example, the German | |
fea681da MK |
119 | character Umlaut-A ("Latin capital letter A with diaeresis") can |
120 | either be represented by the precomposed UCS code 0x00c4, or | |
121 | alternatively as the combination of a normal "Latin capital letter A" | |
122 | followed by a "combining diaeresis": 0x0041 0x0308. | |
123 | .PP | |
124 | Combining characters are essential for instance for encoding the Thai | |
125 | script or for mathematical typesetting and users of the International | |
126 | Phonetic Alphabet. | |
73d8cece | 127 | .SS Implementation levels |
fea681da MK |
128 | As not all systems are expected to support advanced mechanisms like |
129 | combining characters, ISO 10646-1 specifies the following three | |
130 | .I implementation levels | |
131 | of UCS: | |
132 | .TP 0.9i | |
133 | Level 1 | |
9423e95b | 134 | Combining characters and Hangul Jamo |
fea681da | 135 | (a variant encoding of the Korean script, where a Hangul syllable |
42f0a101 | 136 | glyph is coded as a triplet or pair of vowel/consonant codes) are not |
fea681da MK |
137 | supported. |
138 | .TP | |
139 | Level 2 | |
140 | In addition to level 1, combining characters are now allowed for some | |
141 | languages where they are essential (e.g., Thai, Lao, Hebrew, | |
e2ef00a5 | 142 | Arabic, Devanagari, Malayalam). |
fea681da MK |
143 | .TP |
144 | Level 3 | |
9423e95b | 145 | All UCS characters are supported. |
fea681da | 146 | .PP |
9423e95b MK |
147 | The Unicode 3.0 Standard |
148 | published by the Unicode Consortium | |
149 | contains exactly the UCS Basic Multilingual Plane | |
fea681da | 150 | at implementation level 3, as described in ISO 10646-1:2000. |
9423e95b | 151 | Unicode 3.1 added the supplemental planes of ISO 10646-2. |
c13182ef | 152 | The Unicode standard and |
fea681da MK |
153 | technical reports published by the Unicode Consortium provide much |
154 | additional information on the semantics and recommended usages of | |
c13182ef MK |
155 | various characters. |
156 | They provide guidelines and algorithms for | |
a797afac | 157 | editing, sorting, comparing, normalizing, converting, and displaying |
fea681da | 158 | Unicode strings. |
73d8cece | 159 | .SS Unicode under Linux |
fea681da | 160 | Under GNU/Linux, the C type |
f19a0f03 | 161 | .I wchar_t |
c13182ef MK |
162 | is a signed 32-bit integer type. |
163 | Its values are always interpreted | |
9423e95b | 164 | by the C library as UCS |
fea681da MK |
165 | code values (in all locales), a convention that is signaled by the GNU |
166 | C library to applications by defining the constant | |
167 | .B __STDC_ISO_10646__ | |
26b2443e | 168 | as specified in the ISO C99 standard. |
fea681da MK |
169 | |
170 | UCS/Unicode can be used just like ASCII in input/output streams, | |
171 | terminal communication, plaintext files, filenames, and environment | |
9423e95b | 172 | variables in the ASCII compatible UTF-8 multibyte encoding. |
c13182ef | 173 | To signal the use of UTF-8 as the character |
fea681da | 174 | encoding to all applications, a suitable |
f19a0f03 | 175 | .I locale |
fea681da MK |
176 | has to be selected via environment variables (e.g., |
177 | "LANG=en_GB.UTF-8"). | |
178 | .PP | |
179 | The | |
180 | .B nl_langinfo(CODESET) | |
c13182ef MK |
181 | function returns the name of the selected encoding. |
182 | Library functions such as | |
fea681da MK |
183 | .BR wctomb (3) |
184 | and | |
185 | .BR mbsrtowcs (3) | |
186 | can be used to transform the internal | |
9ff08aad | 187 | .I wchar_t |
fea681da MK |
188 | characters and strings into the system character encoding and back |
189 | and | |
190 | .BR wcwidth (3) | |
191 | tells, how many positions (0\(en2) the cursor is advanced by the | |
192 | output of a character. | |
193 | .PP | |
66e80e31 | 194 | .SS Private Use Areas (PUA) |
beecf99e | 195 | In the Basic Multilingual Plane, |
fea681da | 196 | the range 0xe000 to 0xf8ff will never be assigned to any characters by |
c13182ef MK |
197 | the standard and is reserved for private usage. |
198 | For the Linux | |
fea681da MK |
199 | community, this private area has been subdivided further into the |
200 | range 0xe000 to 0xefff which can be used individually by any end-user | |
201 | and the Linux zone in the range 0xf000 to 0xf8ff where extensions are | |
c13182ef MK |
202 | coordinated among all Linux users. |
203 | The registry of the characters | |
79172100 MM |
204 | assigned to the Linux zone is maintained by LANANA and the registry |
205 | itself is | |
206 | .I Documentation/unicode.txt | |
207 | in the Linux kernel sources. | |
66e80e31 DTQ |
208 | .PP |
209 | Two other planes are reserved for private usage, plane 15 | |
210 | (Supplementary Private Use Area-A, range 0xf0000 to 0xffffd) | |
211 | and plane 16 (Supplementary Private Use Area-B, range | |
212 | 0x100000 to 0x10fffd). | |
d90a233f | 213 | .SS Literature |
f2cf1fbf | 214 | .IP * 3 |
fea681da MK |
215 | Information technology \(em Universal Multiple-Octet Coded Character |
216 | Set (UCS) \(em Part 1: Architecture and Basic Multilingual Plane. | |
217 | International Standard ISO/IEC 10646-1, International Organization | |
218 | for Standardization, Geneva, 2000. | |
219 | ||
9423e95b | 220 | This is the official specification of UCS . |
79172100 | 221 | Available from |
608bf950 SK |
222 | .UR http://www.iso.ch/ |
223 | .UE . | |
f2cf1fbf | 224 | .IP * |
fea681da MK |
225 | The Unicode Standard, Version 3.0. |
226 | The Unicode Consortium, Addison-Wesley, | |
227 | Reading, MA, 2000, ISBN 0-201-61633-5. | |
f2cf1fbf | 228 | .IP * |
fea681da MK |
229 | S. Harbison, G. Steele. C: A Reference Manual. Fourth edition, |
230 | Prentice Hall, Englewood Cliffs, 1995, ISBN 0-13-326224-3. | |
231 | ||
c13182ef MK |
232 | A good reference book about the C programming language. |
233 | The fourth | |