]>
Commit | Line | Data |
---|---|---|
fea681da MK |
1 | .\" Copyright (C) Markus Kuhn, 1995, 2001 |
2 | .\" | |
e4a74ca8 | 3 | .\" SPDX-License-Identifier: GPL-2.0-or-later |
fea681da MK |
4 | .\" |
5 | .\" 1995-11-26 Markus Kuhn <mskuhn@cip.informatik.uni-erlangen.de> | |
6 | .\" First version written | |
7 | .\" 2001-05-11 Markus Kuhn <mgk25@cl.cam.ac.uk> | |
8 | .\" Update | |
9 | .\" | |
4c1c5274 | 10 | .TH unicode 7 (date) "Linux man-pages (unreleased)" |
fea681da | 11 | .SH NAME |
e095ac23 | 12 | unicode \- universal character set |
fea681da | 13 | .SH DESCRIPTION |
9423e95b MK |
14 | The international standard ISO 10646 defines the |
15 | Universal Character Set (UCS). | |
c13182ef | 16 | UCS contains all characters of all other character set standards. |
9423e95b | 17 | It also guarantees "round-trip compatibility"; |
88879aeb MK |
18 | in other words, |
19 | conversion tables can be built such that no information is lost | |
fea681da | 20 | when a string is converted from any other encoding to UCS and back. |
a721e8b2 | 21 | .PP |
fea681da | 22 | UCS contains the characters required to represent practically all |
c13182ef MK |
23 | known languages. |
24 | This includes not only the Latin, Greek, Cyrillic, | |
1954b6a9 | 25 | Hebrew, Arabic, Armenian, and Georgian scripts, but also Chinese, |
fea681da MK |
26 | Japanese and Korean Han ideographs as well as scripts such as |
27 | Hiragana, Katakana, Hangul, Devanagari, Bengali, Gurmukhi, Gujarati, | |
28 | Oriya, Tamil, Telugu, Kannada, Malayalam, Thai, Lao, Khmer, Bopomofo, | |
29 | Tibetan, Runic, Ethiopic, Canadian Syllabics, Cherokee, Mongolian, | |
c13182ef MK |
30 | Ogham, Myanmar, Sinhala, Thaana, Yi, and others. |
31 | For scripts not yet | |
fea681da | 32 | covered, research on how to best encode them for computer usage is |
c13182ef MK |
33 | still going on and they will be added eventually. |
34 | This might | |
fea681da MK |
35 | eventually include not only Hieroglyphs and various historic |
36 | Indo-European languages, but even some selected artistic scripts such | |
c13182ef MK |
37 | as Tengwar, Cirth, and Klingon. |
38 | UCS also covers a large number of | |
a797afac | 39 | graphical, typographical, mathematical, and scientific symbols, |
fea681da MK |
40 | including those provided by TeX, Postscript, APL, MS-DOS, MS-Windows, |
41 | Macintosh, OCR fonts, as well as many word processing and publishing | |
42 | systems, and more are being added. | |
a721e8b2 | 43 | .PP |
fea681da | 44 | The UCS standard (ISO 10646) describes a |
492d9973 | 45 | 31-bit character set architecture |
fea681da MK |
46 | consisting of 128 24-bit |
47 | .IR groups , | |
48 | each divided into 256 16-bit | |
49 | .I planes | |
50 | made up of 256 8-bit | |
51 | .I rows | |
52 | with 256 | |
53 | .I column | |
c13182ef | 54 | positions, one for each character. |
9423e95b | 55 | Part 1 of the standard (ISO 10646-1) |
fea681da MK |
56 | defines the first 65534 code positions (0x0000 to 0xfffd), which form |
57 | the | |
1ae6b2c7 | 58 | .I Basic Multilingual Plane |
18352067 | 59 | (BMP), that is plane 0 in group 0. |
9423e95b | 60 | Part 2 of the standard (ISO 10646-2) |
fea681da MK |
61 | adds characters to group 0 outside the BMP in several |
62 | .I "supplementary planes" | |
c13182ef MK |
63 | in the range 0x10000 to 0x10ffff. |
64 | There are no plans to add characters | |
fea681da MK |
65 | beyond 0x10ffff to the standard, therefore of the entire code space, |
66 | only a small fraction of group 0 will ever be actually used in the | |
c13182ef MK |
67 | foreseeable future. |
68 | The BMP contains all characters found in the | |
69 | commonly used other character sets. | |
70 | The supplemental planes added by | |
fea681da MK |
71 | ISO 10646-2 cover only more exotic characters for special scientific, |
72 | dictionary printing, publishing industry, higher-level protocol and | |
73 | enthusiast needs. | |
74 | .PP | |
75 | The representation of each UCS character as a 2-byte word is referred | |
9423e95b MK |
76 | to as the UCS-2 form (only for BMP characters), |
77 | whereas UCS-4 is the representation of each character by a 4-byte word. | |
78 | In addition, there exist two encoding forms UTF-8 | |
79 | for backward compatibility with ASCII processing software and UTF-16 | |
66d3a13b | 80 | for the backward-compatible handling of non-BMP characters up to |
fea681da MK |
81 | 0x10ffff by UCS-2 software. |
82 | .PP | |
83 | The UCS characters 0x0000 to 0x007f are identical to those of the | |
9423e95b | 84 | classic US-ASCII |
fea681da MK |
85 | character set and the characters in the range 0x0000 to 0x00ff |
86 | are identical to those in | |
9423e95b | 87 | ISO 8859-1 (Latin-1). |
73d8cece | 88 | .SS Combining characters |
9423e95b | 89 | Some code points in UCS |
fea681da MK |
90 | have been assigned to |
91 | .IR "combining characters" . | |
24b74457 | 92 | These are similar to the nonspacing accent keys on a typewriter. |
c13182ef MK |
93 | A combining character just adds an accent to the previous character. |
94 | The most important accented characters have codes of their own in UCS, | |
fea681da | 95 | however, the combining character mechanism allows us to add accents |
c13182ef MK |
96 | and other diacritical marks to any character. |
97 | The combining characters | |
98 | always follow the character which they modify. | |
99 | For example, the German | |
fea681da MK |
100 | character Umlaut-A ("Latin capital letter A with diaeresis") can |
101 | either be represented by the precomposed UCS code 0x00c4, or | |
102 | alternatively as the combination of a normal "Latin capital letter A" | |
103 | followed by a "combining diaeresis": 0x0041 0x0308. | |
104 | .PP | |
105 | Combining characters are essential for instance for encoding the Thai | |
106 | script or for mathematical typesetting and users of the International | |
107 | Phonetic Alphabet. | |
73d8cece | 108 | .SS Implementation levels |
fea681da MK |
109 | As not all systems are expected to support advanced mechanisms like |
110 | combining characters, ISO 10646-1 specifies the following three | |
111 | .I implementation levels | |
112 | of UCS: | |
113 | .TP 0.9i | |
114 | Level 1 | |
9423e95b | 115 | Combining characters and Hangul Jamo |
fea681da | 116 | (a variant encoding of the Korean script, where a Hangul syllable |
42f0a101 | 117 | glyph is coded as a triplet or pair of vowel/consonant codes) are not |
fea681da MK |
118 | supported. |
119 | .TP | |
120 | Level 2 | |
121 | In addition to level 1, combining characters are now allowed for some | |
122 | languages where they are essential (e.g., Thai, Lao, Hebrew, | |
e2ef00a5 | 123 | Arabic, Devanagari, Malayalam). |
fea681da MK |
124 | .TP |
125 | Level 3 | |
9423e95b | 126 | All UCS characters are supported. |
fea681da | 127 | .PP |
9423e95b MK |
128 | The Unicode 3.0 Standard |
129 | published by the Unicode Consortium | |
130 | contains exactly the UCS Basic Multilingual Plane | |
fea681da | 131 | at implementation level 3, as described in ISO 10646-1:2000. |
9423e95b | 132 | Unicode 3.1 added the supplemental planes of ISO 10646-2. |
c13182ef | 133 | The Unicode standard and |
fea681da MK |
134 | technical reports published by the Unicode Consortium provide much |
135 | additional information on the semantics and recommended usages of | |
c13182ef MK |
136 | various characters. |
137 | They provide guidelines and algorithms for | |
a797afac | 138 | editing, sorting, comparing, normalizing, converting, and displaying |
fea681da | 139 | Unicode strings. |
73d8cece | 140 | .SS Unicode under Linux |
fea681da | 141 | Under GNU/Linux, the C type |
f19a0f03 | 142 | .I wchar_t |
c13182ef MK |
143 | is a signed 32-bit integer type. |
144 | Its values are always interpreted | |
9423e95b | 145 | by the C library as UCS |
fea681da MK |
146 | code values (in all locales), a convention that is signaled by the GNU |
147 | C library to applications by defining the constant | |
148 | .B __STDC_ISO_10646__ | |
26b2443e | 149 | as specified in the ISO C99 standard. |
a721e8b2 | 150 | .PP |
fea681da MK |
151 | UCS/Unicode can be used just like ASCII in input/output streams, |
152 | terminal communication, plaintext files, filenames, and environment | |
9423e95b | 153 | variables in the ASCII compatible UTF-8 multibyte encoding. |
c13182ef | 154 | To signal the use of UTF-8 as the character |
fea681da | 155 | encoding to all applications, a suitable |
f19a0f03 | 156 | .I locale |
fea681da MK |
157 | has to be selected via environment variables (e.g., |
158 | "LANG=en_GB.UTF-8"). | |
159 | .PP | |
160 | The | |
161 | .B nl_langinfo(CODESET) | |
c13182ef MK |
162 | function returns the name of the selected encoding. |
163 | Library functions such as | |
fea681da MK |
164 | .BR wctomb (3) |
165 | and | |
166 | .BR mbsrtowcs (3) | |
167 | can be used to transform the internal | |
9ff08aad | 168 | .I wchar_t |
fea681da MK |
169 | characters and strings into the system character encoding and back |
170 | and | |
171 | .BR wcwidth (3) | |
81b2a338 | 172 | tells how many positions (0\(en2) the cursor is advanced by the |
fea681da | 173 | output of a character. |
66e80e31 | 174 | .SS Private Use Areas (PUA) |
beecf99e | 175 | In the Basic Multilingual Plane, |
fea681da | 176 | the range 0xe000 to 0xf8ff will never be assigned to any characters by |
c13182ef MK |
177 | the standard and is reserved for private usage. |
178 | For the Linux | |
fea681da MK |
179 | community, this private area has been subdivided further into the |
180 | range 0xe000 to 0xefff which can be used individually by any end-user | |
181 | and the Linux zone in the range 0xf000 to 0xf8ff where extensions are | |
c13182ef MK |
182 | coordinated among all Linux users. |
183 | The registry of the characters | |
79172100 MM |
184 | assigned to the Linux zone is maintained by LANANA and the registry |
185 | itself is | |
b4a164cf ES |
186 | .I Documentation/admin\-guide/unicode.rst |
187 | in the Linux kernel sources | |
188 | .\" commit 9d85025b0418163fae079c9ba8f8445212de8568 | |
189 | (or | |
79172100 | 190 | .I Documentation/unicode.txt |
b4a164cf | 191 | before Linux 4.10). |
66e80e31 DTQ |
192 | .PP |
193 | Two other planes are reserved for private usage, plane 15 | |
194 | (Supplementary Private Use Area-A, range 0xf0000 to 0xffffd) | |
195 | and plane 16 (Supplementary Private Use Area-B, range | |
196 | 0x100000 to 0x10fffd). | |
d90a233f | 197 | .SS Literature |
22356d97 | 198 | .IP \(bu 3 |
fea681da MK |
199 | Information technology \(em Universal Multiple-Octet Coded Character |
200 | Set (UCS) \(em Part 1: Architecture and Basic Multilingual Plane. | |
201 | International Standard ISO/IEC 10646-1, International Organization | |
202 | for Standardization, Geneva, 2000. | |
a721e8b2 | 203 | .IP |
bb75585d | 204 | This is the official specification of UCS. |
79172100 | 205 | Available from |
608bf950 SK |
206 | .UR http://www.iso.ch/ |
207 | .UE . | |
22356d97 | 208 | .IP \(bu |
fea681da MK |
209 | The Unicode Standard, Version 3.0. |
210 | The Unicode Consortium, Addison-Wesley, | |
211 | Reading, MA, 2000, ISBN 0-201-61633-5. | |
22356d97 | 212 | .IP \(bu |
8fb01fde | 213 | S.\& Harbison, G.\& Steele. C: A Reference Manual. Fourth edition, |
fea681da | 214 | Prentice Hall, Englewood Cliffs, 1995, ISBN 0-13-326224-3. |
a721e8b2 | 215 | .IP |
c13182ef MK |
216 | A good reference book about the C programming language. |
217 | The fourth | |