]> git.ipfire.org Git - thirdparty/man-pages.git/blame - man7/unicode.7
Many pages: Use correct letter case in page titles (TH)
[thirdparty/man-pages.git] / man7 / unicode.7
CommitLineData
fea681da
MK
1.\" Copyright (C) Markus Kuhn, 1995, 2001
2.\"
e4a74ca8 3.\" SPDX-License-Identifier: GPL-2.0-or-later
fea681da
MK
4.\"
5.\" 1995-11-26 Markus Kuhn <mskuhn@cip.informatik.uni-erlangen.de>
6.\" First version written
7.\" 2001-05-11 Markus Kuhn <mgk25@cl.cam.ac.uk>
8.\" Update
9.\"
4c1c5274 10.TH unicode 7 (date) "Linux man-pages (unreleased)"
fea681da 11.SH NAME
e095ac23 12unicode \- universal character set
fea681da 13.SH DESCRIPTION
9423e95b
MK
14The international standard ISO 10646 defines the
15Universal Character Set (UCS).
c13182ef 16UCS contains all characters of all other character set standards.
9423e95b 17It also guarantees "round-trip compatibility";
88879aeb
MK
18in other words,
19conversion tables can be built such that no information is lost
fea681da 20when a string is converted from any other encoding to UCS and back.
a721e8b2 21.PP
fea681da 22UCS contains the characters required to represent practically all
c13182ef
MK
23known languages.
24This includes not only the Latin, Greek, Cyrillic,
1954b6a9 25Hebrew, Arabic, Armenian, and Georgian scripts, but also Chinese,
fea681da
MK
26Japanese and Korean Han ideographs as well as scripts such as
27Hiragana, Katakana, Hangul, Devanagari, Bengali, Gurmukhi, Gujarati,
28Oriya, Tamil, Telugu, Kannada, Malayalam, Thai, Lao, Khmer, Bopomofo,
29Tibetan, Runic, Ethiopic, Canadian Syllabics, Cherokee, Mongolian,
c13182ef
MK
30Ogham, Myanmar, Sinhala, Thaana, Yi, and others.
31For scripts not yet
fea681da 32covered, research on how to best encode them for computer usage is
c13182ef
MK
33still going on and they will be added eventually.
34This might
fea681da
MK
35eventually include not only Hieroglyphs and various historic
36Indo-European languages, but even some selected artistic scripts such
c13182ef
MK
37as Tengwar, Cirth, and Klingon.
38UCS also covers a large number of
a797afac 39graphical, typographical, mathematical, and scientific symbols,
fea681da
MK
40including those provided by TeX, Postscript, APL, MS-DOS, MS-Windows,
41Macintosh, OCR fonts, as well as many word processing and publishing
42systems, and more are being added.
a721e8b2 43.PP
fea681da 44The UCS standard (ISO 10646) describes a
492d9973 4531-bit character set architecture
fea681da
MK
46consisting of 128 24-bit
47.IR groups ,
48each divided into 256 16-bit
49.I planes
50made up of 256 8-bit
51.I rows
52with 256
53.I column
c13182ef 54positions, one for each character.
9423e95b 55Part 1 of the standard (ISO 10646-1)
fea681da
MK
56defines the first 65534 code positions (0x0000 to 0xfffd), which form
57the
1ae6b2c7 58.I Basic Multilingual Plane
18352067 59(BMP), that is plane 0 in group 0.
9423e95b 60Part 2 of the standard (ISO 10646-2)
fea681da
MK
61adds characters to group 0 outside the BMP in several
62.I "supplementary planes"
c13182ef
MK
63in the range 0x10000 to 0x10ffff.
64There are no plans to add characters
fea681da
MK
65beyond 0x10ffff to the standard, therefore of the entire code space,
66only a small fraction of group 0 will ever be actually used in the
c13182ef
MK
67foreseeable future.
68The BMP contains all characters found in the
69commonly used other character sets.
70The supplemental planes added by
fea681da
MK
71ISO 10646-2 cover only more exotic characters for special scientific,
72dictionary printing, publishing industry, higher-level protocol and
73enthusiast needs.
74.PP
75The representation of each UCS character as a 2-byte word is referred
9423e95b
MK
76to as the UCS-2 form (only for BMP characters),
77whereas UCS-4 is the representation of each character by a 4-byte word.
78In addition, there exist two encoding forms UTF-8
79for backward compatibility with ASCII processing software and UTF-16
66d3a13b 80for the backward-compatible handling of non-BMP characters up to
fea681da
MK
810x10ffff by UCS-2 software.
82.PP
83The UCS characters 0x0000 to 0x007f are identical to those of the
9423e95b 84classic US-ASCII
fea681da
MK
85character set and the characters in the range 0x0000 to 0x00ff
86are identical to those in
9423e95b 87ISO 8859-1 (Latin-1).
73d8cece 88.SS Combining characters
9423e95b 89Some code points in UCS
fea681da
MK
90have been assigned to
91.IR "combining characters" .
24b74457 92These are similar to the nonspacing accent keys on a typewriter.
c13182ef
MK
93A combining character just adds an accent to the previous character.
94The most important accented characters have codes of their own in UCS,
fea681da 95however, the combining character mechanism allows us to add accents
c13182ef
MK
96and other diacritical marks to any character.
97The combining characters
98always follow the character which they modify.
99For example, the German
fea681da
MK
100character Umlaut-A ("Latin capital letter A with diaeresis") can
101either be represented by the precomposed UCS code 0x00c4, or
102alternatively as the combination of a normal "Latin capital letter A"
103followed by a "combining diaeresis": 0x0041 0x0308.
104.PP
105Combining characters are essential for instance for encoding the Thai
106script or for mathematical typesetting and users of the International
107Phonetic Alphabet.
73d8cece 108.SS Implementation levels
fea681da
MK
109As not all systems are expected to support advanced mechanisms like
110combining characters, ISO 10646-1 specifies the following three
111.I implementation levels
112of UCS:
113.TP 0.9i
114Level 1
9423e95b 115Combining characters and Hangul Jamo
fea681da 116(a variant encoding of the Korean script, where a Hangul syllable
42f0a101 117glyph is coded as a triplet or pair of vowel/consonant codes) are not
fea681da
MK
118supported.
119.TP
120Level 2
121In addition to level 1, combining characters are now allowed for some
122languages where they are essential (e.g., Thai, Lao, Hebrew,
e2ef00a5 123Arabic, Devanagari, Malayalam).
fea681da
MK
124.TP
125Level 3
9423e95b 126All UCS characters are supported.
fea681da 127.PP
9423e95b
MK
128The Unicode 3.0 Standard
129published by the Unicode Consortium
130contains exactly the UCS Basic Multilingual Plane
fea681da 131at implementation level 3, as described in ISO 10646-1:2000.
9423e95b 132Unicode 3.1 added the supplemental planes of ISO 10646-2.
c13182ef 133The Unicode standard and
fea681da
MK
134technical reports published by the Unicode Consortium provide much
135additional information on the semantics and recommended usages of
c13182ef
MK
136various characters.
137They provide guidelines and algorithms for
a797afac 138editing, sorting, comparing, normalizing, converting, and displaying
fea681da 139Unicode strings.
73d8cece 140.SS Unicode under Linux
fea681da 141Under GNU/Linux, the C type
f19a0f03 142.I wchar_t
c13182ef
MK
143is a signed 32-bit integer type.
144Its values are always interpreted
9423e95b 145by the C library as UCS
fea681da
MK
146code values (in all locales), a convention that is signaled by the GNU
147C library to applications by defining the constant
148.B __STDC_ISO_10646__
26b2443e 149as specified in the ISO C99 standard.
a721e8b2 150.PP
fea681da
MK
151UCS/Unicode can be used just like ASCII in input/output streams,
152terminal communication, plaintext files, filenames, and environment
9423e95b 153variables in the ASCII compatible UTF-8 multibyte encoding.
c13182ef 154To signal the use of UTF-8 as the character
fea681da 155encoding to all applications, a suitable
f19a0f03 156.I locale
fea681da
MK
157has to be selected via environment variables (e.g.,
158"LANG=en_GB.UTF-8").
159.PP
160The
161.B nl_langinfo(CODESET)
c13182ef
MK
162function returns the name of the selected encoding.
163Library functions such as
fea681da
MK
164.BR wctomb (3)
165and
166.BR mbsrtowcs (3)
167can be used to transform the internal
9ff08aad 168.I wchar_t
fea681da
MK
169characters and strings into the system character encoding and back
170and
171.BR wcwidth (3)
81b2a338 172tells how many positions (0\(en2) the cursor is advanced by the
fea681da 173output of a character.
66e80e31 174.SS Private Use Areas (PUA)
beecf99e 175In the Basic Multilingual Plane,
fea681da 176the range 0xe000 to 0xf8ff will never be assigned to any characters by
c13182ef
MK
177the standard and is reserved for private usage.
178For the Linux
fea681da
MK
179community, this private area has been subdivided further into the
180range 0xe000 to 0xefff which can be used individually by any end-user
181and the Linux zone in the range 0xf000 to 0xf8ff where extensions are
c13182ef
MK
182coordinated among all Linux users.
183The registry of the characters
79172100
MM
184assigned to the Linux zone is maintained by LANANA and the registry
185itself is
b4a164cf
ES
186.I Documentation/admin\-guide/unicode.rst
187in the Linux kernel sources
188.\" commit 9d85025b0418163fae079c9ba8f8445212de8568
189(or
79172100 190.I Documentation/unicode.txt
b4a164cf 191before Linux 4.10).
66e80e31
DTQ
192.PP
193Two other planes are reserved for private usage, plane 15
194(Supplementary Private Use Area-A, range 0xf0000 to 0xffffd)
195and plane 16 (Supplementary Private Use Area-B, range
1960x100000 to 0x10fffd).
d90a233f 197.SS Literature
22356d97 198.IP \(bu 3
fea681da
MK
199Information technology \(em Universal Multiple-Octet Coded Character
200Set (UCS) \(em Part 1: Architecture and Basic Multilingual Plane.
201International Standard ISO/IEC 10646-1, International Organization
202for Standardization, Geneva, 2000.
a721e8b2 203.IP
bb75585d 204This is the official specification of UCS.
79172100 205Available from
608bf950
SK
206.UR http://www.iso.ch/
207.UE .
22356d97 208.IP \(bu
fea681da
MK
209The Unicode Standard, Version 3.0.
210The Unicode Consortium, Addison-Wesley,
211Reading, MA, 2000, ISBN 0-201-61633-5.
22356d97 212.IP \(bu
8fb01fde 213S.\& Harbison, G.\& Steele. C: A Reference Manual. Fourth edition,
fea681da 214Prentice Hall, Englewood Cliffs, 1995, ISBN 0-13-326224-3.
a721e8b2 215.IP
c13182ef
MK
216A good reference book about the C programming language.
217The fourth