3 WORKING DRAFT Ira McDonald
4 <i18n_sdd.txt> High North Inc
6 Common UNIX Printing System ("CUPS")
7 Internationalization Software Design Description v0.3
9 Copyright (C) Easy Software Products (2002) - All Rights Reserved
12 Status of this Document
14 This document is an unapproved working draft and is incomplete in some
15 sections (see 'Ed Note:' comments).
20 This document provides general information and high-level design for the
21 Internationalization extensions for the Common UNIX Printing System
22 ("CUPS") Version 1.2. This document also provides C language header
23 files and high-level pseudo-code for all new modules and external
57 McDonald June 20, 2002 [Page 1]
59 CUPS Internationalization Software Design Description v0.3
63 1. Scope ...................................................... 4
64 1.1. Identification ......................................... 4
65 1.2. System Overview ........................................ 4
66 1.3. Document Overview ...................................... 4
67 2. References ................................................. 5
68 2.1. CUPS References ........................................ 5
69 2.2. Other Documents ........................................ 5
70 3. Design Overview ............................................ 7
71 3.1. Transcoding - New ...................................... 7
72 3.1.1. transcode.h - Transcoding header ................... 7
73 3.1.1.1. cups_cmap_t - SBCS Charmap Structure ........... 10
74 3.1.1.2. cups_dmap_t - DBCS Charmap Structure ........... 11
75 3.1.2. transcode.c - Transcoding module ................... 11
76 3.1.2.1. cupsUtf8ToCharset() ............................ 11
77 3.1.2.2. cupsCharsetToUtf8() ............................ 12
78 3.1.2.3. cupsUtf8ToUtf16() .............................. 12
79 3.1.2.4. cupsUtf16ToUtf8() .............................. 12
80 3.1.2.5. cupsUtf8ToUtf32() .............................. 12
81 3.1.2.6. cupsUtf32ToUtf8() .............................. 13
82 3.1.2.7. cupsUtf16ToUtf32() ............................. 13
83 3.1.2.8. cupsUtf32ToUtf16() ............................. 13
84 3.1.2.9. Transcoding Utility Functions .................. 13
85 3.1.2.9.1. cupsCharmapGet() ........................... 14
86 3.1.2.9.2. cupsCharmapFree() .......................... 14
87 3.1.2.9.3. cupsCharmapFlush() ......................... 14
88 3.2. Normalization - New .................................... 15
89 3.2.1. normalize.h - Normalization header ................. 15
90 3.2.1.1. cups_normmap_t - Normalize Map Structure ....... 22
91 3.2.1.2. cups_foldmap_t - Case Fold Map Structure ....... 22
92 3.2.1.3. cups_propmap_t - Char Property Map Structure ... 23
93 3.2.1.4. cups_prop_t - Char Property Structure .......... 23
94 3.2.1.5. cups_breakmap_t - Line Break Map Structure ..... 23
95 3.2.1.6. cups_combmap_t - Combining Class Map Structure . 24
96 3.2.1.7. cups_comb_t - Combining Class Structure ........ 24
97 3.2.2. normalize.c - Normalization module ................. 24
98 3.2.2.1. cupsUtf8Normalize() ............................ 24
99 3.2.2.2. cupsUtf32Normalize() ........................... 25
100 3.2.2.3. cupsUtf8CaseFold() ............................. 25
101 3.2.2.4. cupsUtf32CaseFold() ............................ 26
102 3.2.2.5. cupsUtf8CompareCaseless() ...................... 26
103 3.2.2.6. cupsUtf32CompareCaseless() ..................... 26
104 3.2.2.7. cupsUtf8CompareIdentifier() .................... 27
105 3.2.2.8. cupsUtf32CompareIdentifier() ................... 27
106 3.2.2.9. cupsUtf32CharacterProperty() ................... 27
107 3.2.2.10. Normalization Utility Functions ............... 28
108 3.2.2.10.1. cupsNormalizeMapsGet() .................... 28
109 3.2.2.10.2. cupsNormalizeMapsFree() ................... 28
110 3.2.2.10.3. cupsNormalizeMapsFlush() .................. 28
111 3.3. Language - Existing .................................... 29
112 3.3.1. language.h - Language header ....................... 29
114 McDonald June 20, 2002 [Page 2]
116 CUPS Internationalization Software Design Description v0.3
118 3.3.2. language.c - Language module ....................... 29
119 3.3.2.1. cupsLangEncoding() - Existing .................. 29
120 3.3.2.2. cupsLangFlush() - Existing ..................... 29
121 3.3.2.3. cupsLangFree() - Existing ...................... 29
122 3.3.2.4. cupsLangGet() - Existing ....................... 30
123 3.3.2.5. cupsLangPrintf() - New ......................... 30
124 3.3.2.6. cupsLangPuts() - New ........................... 30
125 3.3.2.7. cupsEncodingName() - New ....................... 31
126 3.4. Common Text Filter - Existing .......................... 31
127 3.4.1. textcommon.h - Common text filter header ........... 31
128 3.4.1.1. lchar_t - Character/Attribute Structure ........ 31
129 3.4.2. textcommon.c - Common text filter .................. 32
130 3.4.2.1. TextMain() - Existing .......................... 32
131 3.4.2.2. compare_keywords() - Existing .................. 33
132 3.4.2.3. getutf8() - Existing ........................... 33
133 3.5. Text to PostScript Filter - Existing ................... 33
134 3.5.1. texttops.c - Text to PostScript filter ............. 33
135 3.5.1.1. main() - Existing .............................. 33
136 3.5.1.2. WriteEpilogue () - Existing .................... 34
137 3.5.1.3. WritePage () - Existing ........................ 34
138 3.5.1.4. WriteProlog () - Existing ...................... 34
139 3.5.1.5. write_line() - Existing ........................ 34
140 3.5.1.6. write_string() - Existing ...................... 34
141 3.5.1.7. write_text() - Existing ........................ 35
142 A. Glossary ................................................... A-1
171 McDonald June 20, 2002 [Page 3]
173 CUPS Internationalization Software Design Description v0.3
183 This document provides general information and high-level design for the
184 Internationalization extensions for the Common UNIX Printing System
185 ("CUPS") Version 1.2. This document also provides C language header
186 files and high-level pseudo-code for all new modules and external
192 The CUPS Internationalization extensions provide multilingual support
193 via Unicode 3.2:2002 [UNICODE3.2] / ISO-10646-1:2000 [ISO10646-1] and a
194 suite of local character sets (including all adopted parts of ISO-8859
195 and many MS Windows code pages) for CUPS 1.2.
197 The CUPS Internationalization extensions support UTF-8 [RFC2279] as the
198 common stream-oriented representation of all character data. UTF-8 is
199 defined in [ISO10646-1] and is further constrained (for integrity and
200 security) by [UNICODE3.2].
202 UTF-8 is the native character set of LDAPv3 [RFC2251], SLPv2 [RFC2608],
203 IPP/1.1 [RFC2910] [RFC2911], and many other Internet protocols.
206 1.3. Document Overview
209 This software design description document is organized into the
214 o 3 - Design Overview
228 McDonald June 20, 2002 [Page 4]
230 CUPS Internationalization Software Design Description v0.3
240 See: Section 2.1 'CUPS Documentation' of CUPS Software Design
246 The following non-CUPS documents are referenced by this document.
248 [ANSI-X3.4] ANSI Coded Character Set - 7-bit American National Standard
249 Code for Information Interchange, ANSI X3.4, 1986 (aka US-ASCII).
251 [GB2312] Code of Chinese Graphic Character Set for Information
252 Interchange, Primary Set, GB 2312, 1980.
254 [ISO639-1] Codes for the Representation of Names of Languages -- Part 1:
255 Alpha-2 Code, ISO/IEC 639-1, 2000.
257 [ISO639-2] Codes for the Representation of Names of Languages -- Part 2:
258 Alpha-3 Code, ISO/IEC 639-2, 1998.
260 [ISO646] Information Technology - ISO 7-bit Coded Character Set for
261 Information Interchange, ISO/IEC 646, 1991.
263 [ISO2022] Information Processing - ISO 7-bit and 8-bit Coded Character
264 Sets - Code Extension Techniques, ISO/IEC 2022, 1994. (Technically
265 identical to ECMA-35.)
267 [ISO3166-1] Codes for the Representation of Names of Countries and their
268 Subdivisions, Part 1: Country Codes, ISO/ISO 3166-1, 1997.
270 [ISO8859] Information Processing - 8-bit Single-Byte Code Graphic
271 Character Sets, ISO/IEC 8859-n, 1987-2001.
273 [ISO10646-1] Information Technology - Universal Multiple-Octet Code
274 Character Set (UCS) - Part 1: Architecture and Basic Multilingual
275 Plane, ISO/IEC 10646-1, September 2000.
277 [ISO10646-2] Information Technology - Universal Multiple-Octet Code
278 Character Set (UCS) - Part 2: Supplemental Planes, ISO/IEC 10646-2,
281 [RFC2119] Bradner. Key words for use in RFCs to Indicate Requirement
282 Levels, RFC 2119, March 1997.
285 McDonald June 20, 2002 [Page 5]
287 CUPS Internationalization Software Design Description v0.3
290 [RFC2251] Whal, Howes, Kille. Lightweight Directory Access Protocol
291 Version 3 (LDAPv3), RFC 2251, December 1997.
293 [RFC2277] Alvestrand. IETF Policy on Character Sets and Languages, RFC
296 [RFC2279] Yergeau. UTF-8, a Transformation Format of ISO 10646, RFC
299 [RFC2608] Guttman, Perkins, Veizades, Day. Service Location Protocol
300 Version 2 (SLPv2), RFC 2608, June 1999.
302 [RFC2910] Herriot, Butler, Moore, Turner, Wenn. Internet Printing
303 Protocol/1.1: Encoding and Transport, RFC 2910, September 2000.
305 [RFC2911] Hastings, Herriot, deBry, Isaacson, Powell. Internet Printing
306 Protocol/1.1: Model and Semantics, RFC 2911, September 2000.
308 [UNICODE3.0] Unicode Consortium, Unicode Standard Version 3.0,
309 Addison-Wesley Developers Press, ISBN 0-201-61633-5, 2000.
311 [UNICODE3.1] Unicode Consortium, Unicode Standard Version 3.1 (UAX-27),
314 [UNICODE3.2] Unicode Consortium, Unicode Standard Version 3.2 (UAX-28),
317 [US-ASCII] See [ANSI-X3.4] above.
342 McDonald June 20, 2002 [Page 6]
344 CUPS Internationalization Software Design Description v0.3
350 The CUPS Internationalization extensions are composed of several header
351 files and modules which extend the Language functions in the existing
352 CUPS Application Programmers Interface (API).
355 3.1. Transcoding - New
357 Initially, the CUPS Internationalization extensions will only support
358 SBCS (single-byte character set) transcoding. But the design allows
359 future support for DBCS (double-byte character set) transcoding for CJK
360 (Chinese/Japanese/Korean) languages and the MBCS (multiple-byte
361 character set) compound sets that use escapes for charset switching.
363 In order to reduce code size and increase performance all conventional
364 'mapping files' (tables of values in legacy characters sets with their
365 corresponding Unicode scalar values) will ALSO be sorted and stored in
366 memory as reverse maps (for efficient conversion from Unicode scalar
367 values to their corresponding legacy character set values). Transcoding
368 will be done directly by 2-level lookup (without any searching or
371 [Ed Note: CJK languages will be fairly costly in mapping table sizes,
372 because they have thousands (or tens of thousands) of codepoints.]
376 3.1.1. transcode.h - Transcoding header
379 * "$Id: i18n_sdd.txt 2678 2002-08-19 01:15:26Z mike $"
381 * Transcoding support for the Common UNIX Printing System (CUPS).
383 * Copyright 1997-2002 by Easy Software Products.
385 * These coded instructions, statements, and computer programs are
386 * the property of Easy Software Products and are protected by Federal
387 * copyright law. Distribution and use rights are outlined in the
388 * file "LICENSE.txt" which should have been included with this file.
389 * If this file is missing or damaged please contact Easy Software
392 * Attn: CUPS Licensing Information
393 * Easy Software Products
394 * 44141 Airport View Drive, Suite 204
395 * Hollywood, Maryland 20636-3111 USA
397 * Voice: (301) 373-9603
399 McDonald June 20, 2002 [Page 7]
401 CUPS Internationalization Software Design Description v0.3
403 * EMail: cups-info@cups.org
404 * WWW: http://www.cups.org
407 #ifndef _CUPS_TRANSCODE_H_
408 # define _CUPS_TRANSCODE_H_
411 * Include necessary headers...
414 # include "cups/language.h"
418 # endif /* __cplusplus */
424 typedef unsigned char utf8_t; /* UTF-8 Unicode/ISO-10646 code unit */
425 typedef unsigned short utf16_t; /* UTF-16 Unicode/ISO-10646 code unit */
426 typedef unsigned long utf32_t; /* UTF-32 Unicode/ISO-10646 code unit */
427 typedef unsigned short ucs2_t; /* UCS-2 Unicode/ISO-10646 code unit */
428 typedef unsigned long ucs4_t; /* UCS-4 Unicode/ISO-10646 code unit */
429 typedef unsigned char sbcs_t; /* SBCS Legacy 8-bit code unit */
430 typedef unsigned short dbcs_t; /* DBCS Legacy 16-bit code unit */
436 typedef struct cups_cmap_str /**** SBCS Charmap Cache Structure ****/
438 struct cups_cmap_str *next; /* Next charmap in cache */
439 int used; /* Number of times entry used */
440 cups_encoding_t encoding; /* Legacy charset encoding */
441 ucs2_t char2uni[256]; /* Map Legacy SBCS -> UCS-2 */
442 sbcs_t *uni2char[256]; /* Map UCS-2 -> Legacy SBCS */
446 typedef struct cups_dmap_str /**** DBCS Charmap Cache Structure ****/
448 struct cups_dmap_str *next; /* Next charmap in cache */
449 int used; /* Number of times entry used */
450 cups_encoding_t encoding; /* Legacy charset encoding */
451 ucs2_t *char2uni[256]; /* Map Legacy DBCS -> UCS-2 */
452 dbcs_t *uni2char[256]; /* Map UCS-2 -> Legacy DBCS */
456 McDonald June 20, 2002 [Page 8]
458 CUPS Internationalization Software Design Description v0.3
464 #define CUPS_MAX_USTRING 1024 /* Maximum size of Unicode string */
470 extern int TcFixMapNames; /* Fix map names to Unicode names */
471 extern int TcStrictUtf8; /* Non-shortest-form is illegal */
472 extern int TcStrictUtf16; /* Invalid surrogate pair is illegal */
473 extern int TcStrictUtf32; /* Greater than 0x10FFFF is illegal */
474 extern int TcRequireBOM; /* Require BOM for little/big-endian */
475 extern int TcSupportBOM; /* Support BOM for little/big-endian */
476 extern int TcSupport8859; /* Support ISO 8859-x repertoires */
477 extern int TcSupportWin; /* Support Windows-x repertoires */
478 extern int TcSupportCJK; /* Support CJK (Asian) repertoires */
485 * Utility functions for character set maps
487 extern void *cupsCharmapGet(const cups_encoding_t encoding);
489 extern void cupsCharmapFree(const cups_encoding_t encoding);
491 extern void cupsCharmapFlush(void);
494 * Convert UTF-8 to and from legacy character set
496 extern int cupsUtf8ToCharset(char *dest, /* O - Target string */
497 const utf8_t *src, /* I - Source string */
498 const int maxout, /* I - Max output */
499 cups_encoding_t encoding); /* I - Encoding */
500 extern int cupsCharsetToUtf8(utf8_t *dest, /* O - Target string */
501 const char *src, /* I - Source string */
502 const int maxout, /* I - Max output */
503 cups_encoding_t encoding); /* I - Encoding */
506 * Convert UTF-8 to and from UTF-16
508 extern int cupsUtf8ToUtf16(utf16_t *dest, /* O - Target string */
509 const utf8_t *src, /* I - Source string */
510 const int maxout); /* I - Max output */
511 extern int cupsUtf16ToUtf8(utf8_t *dest, /* O - Target string */
513 McDonald June 20, 2002 [Page 9]
515 CUPS Internationalization Software Design Description v0.3
517 const utf16_t *src, /* I - Source string */
518 const int maxout); /* I - Max output */
521 * Convert UTF-8 to and from UTF-32
523 extern int cupsUtf8ToUtf32(utf32_t *dest, /* O - Target string */
524 const utf8_t *src, /* I - Source string */
525 const int maxout); /* I - Max output */
526 extern int cupsUtf32ToUtf8(utf8_t *dest, /* O - Target string */
527 const utf32_t *src, /* I - Source string */
528 const int maxout); /* I - Max output */
531 * Convert UTF-16 to and from UTF-32
533 extern int cupsUtf16ToUtf32(utf32_t *dest, /* O - Target string */
534 const utf16_t *src, /* I - Source string */
535 const int maxout); /* I - Max output */
536 extern int cupsUtf32ToUtf16(utf16_t *dest, /* O - Target string */
537 const utf32_t *src, /* I - Source string */
538 const int maxout); /* I - Max output */
542 # endif /* __cplusplus */
544 #endif /* !_CUPS_TRANSCODE_H_ */
547 * End of "$Id: i18n_sdd.txt 2678 2002-08-19 01:15:26Z mike $"
552 3.1.1.1. cups_cmap_t - SBCS Charmap Structure
554 typedef struct cups_cmap_str /**** SBCS Charmap Cache Structure ****/
556 struct cups_cmap_str *next; /* Next charset map in cache */
557 int used; /* Number of times entry used */
558 cups_encoding_t encoding; /* Legacy charset encoding */
559 ucs2_t char2uni[256]; /* Map Legacy SBCS -> UCS-2 */
560 sbcs_t *uni2char[256]; /* Map UCS-2 -> Legacy SBCS */
563 'char2uni[]' is a (complete) array of UCS-2 values that supports direct
564 one-level lookup from an input SBCS legacy charset code point, for use
565 by 'cupsCharsetToUtf8()'.
567 'uni2char[]' is a (sparse) array of pointers to arrays of (256 each)
568 SBCS values, that supports direct two-level lookup from an input UCS-2
570 McDonald June 20, 2002 [Page 10]
572 CUPS Internationalization Software Design Description v0.3
574 code point, for use by 'cupsUtf8ToCharset()'.
578 3.1.1.2. cups_dmap_t - DBCS Charmap Structure
580 typedef struct cups_dmap_str /**** DBCS Charmap Cache Structure ****/
582 struct cups_dmap_str *next; /* Next charset map in cache */
583 int used; /* Number of times entry used */
584 cups_encoding_t encoding; /* Legacy charset encoding */
585 ucs2_t *char2uni[256]; /* Map Legacy DBCS -> UCS-2 */
586 dbcs_t *uni2char[256]; /* Map UCS-2 -> Legacy DBCS */
589 'char2uni[]' is a (sparse) array of pointers to arrays of (256 each)
590 UCS-2 values that supports direct two-level lookup from an input DBCS
591 legacy charset code point, for (future) use by 'cupsCharsetToUtf8()'.
593 'uni2char[]' is a (sparse) array of pointers to arrays of (256 each)
594 DBCS values, that supports direct two-level lookup from an input UCS-2
595 code point, for (future) use by 'cupsUtf8ToCharset()'.
599 3.1.2. transcode.c - Transcoding module
601 All of the transcoding functions are modelled on the C standard library
602 function 'strncpy()', except that they return the count of output, like
603 'strlen()', rather than the (redundant) pointer to the output.
605 If the transcoding functions detect invalid input parameters or they
606 detect an encoding error in their input, then they return '-1', rather
607 than the count of output.
609 All of the transcoding functions take an input parameter indicating the
610 maximum output units (for safe operation). The functions that return
611 16-bit (UTF-16) or 32-bit (UTF-32/UCS-4) output always return the output
612 string count (not including the final null) and NOT the memory size in
617 3.1.2.1. cupsUtf8ToCharset()
619 extern int cupsUtf8ToCharset(char *dest, /* O - Target string */
620 const utf8_t *src, /* I - Source string */
621 const int maxout, /* I - Max output */
622 cups_encoding_t encoding); /* I - Encoding */
624 <Find charset map by calling 'cupsCharmapGet()'>
625 <Convert input UTF-8 to internal UCS-4 by calling 'cupsUtf8ToUtf32()'>
627 McDonald June 20, 2002 [Page 11]
629 CUPS Internationalization Software Design Description v0.3
631 <Convert internal UCS-4 to legacy charset via charset map>
632 <Release charset map by calling 'cupsCharmapFree()'>
633 <Return length of output legacy charset string -- size in butes>
637 3.1.2.2. cupsCharsetToUtf8()
639 extern int cupsCharsetToUtf8(utf8_t *dest, /* O - Target string */
640 const char *src, /* I - Source string */
641 const int maxout, /* I - Max output */
642 cups_encoding_t encoding); /* I - Encoding */
644 <Find charset map by calling 'cupsCharmapGet()'>
645 <Convert input legacy charset to internal UCS-4 via charset map>
646 <Convert internal UCS-4 to UTF-8 by calling 'cupsUtf32ToUtf8()'>
647 <Release charset map by calling 'cupsCharmapFree()'>
648 <Return length of output UTF-8 string -- size in bytes>
652 3.1.2.3. cupsUtf8ToUtf16()
654 extern int cupsUtf8ToUtf16(utf16_t *dest, /* O - Target string */
655 const utf8_t *src, /* I - Source string */
656 const int maxout); /* I - Max output */
658 <...to avoid duplicate code to handle surrogate pairs...>
659 <Convert input UTF-8 to internal UCS-4 by calling 'cupsUtf8ToUtf32()'>
660 <Convert internal UCS-4 to UTF-16 by calling 'cupsUtf32ToUtf16()'>
661 <Return count of output UTF-16 string -- NOT memory size in bytes>
665 3.1.2.4. cupsUtf16ToUtf8()
667 extern int cupsUtf16ToUtf8(utf8_t *dest, /* O - Target string */
668 const utf16_t *src, /* I - Source string */
669 const int maxout); /* I - Max output */
671 <...to avoid duplicate code to handle surrogate pairs...>
672 <Convert input UTF-16 to internal UCS-4 by calling 'cupsUtf16ToUtf32()'>
673 <Convert internal UCS-4 to UTF-8 by calling 'cupsUtf32ToUtf8()'>
674 <Return length of output UTF-8 string -- size in bytes>
678 3.1.2.5. cupsUtf8ToUtf32()
680 extern int cupsUtf8ToUtf32(utf32_t *dest, /* O - Target string */
681 const utf8_t *src, /* I - Source string */
682 const int maxout); /* I - Max output */
684 McDonald June 20, 2002 [Page 12]
686 CUPS Internationalization Software Design Description v0.3
689 <Convert input UTF-8 directly to output UCS-4...>
690 <...checking for valid range, shortest-form, etc.>
691 <Return count of output UTF-32 string -- NOT memory size in bytes>
695 3.1.2.6. cupsUtf32ToUtf8()
697 extern int cupsUtf32ToUtf8(utf8_t *dest, /* O - Target string */
698 const utf32_t *src, /* I - Source string */
699 const int maxout); /* I - Max output */
701 <Convert input UCS-4 directly to output UTF-8...>
702 <...checking for valid range, etc.>
703 <Return length of output UTF-8 string -- size in bytes>
707 3.1.2.7. cupsUtf16ToUtf32()
709 extern int cupsUtf16ToUtf32(utf32_t *dest, /* O - Target string */
710 const utf16_t *src, /* I - Source string */
711 const int maxout); /* I - Max output */
713 <Convert input UTF-16 directly to output UCS-4...>
714 <...handling surrogate pairs decoding from UTF-16>
715 <Return count of output UTF-32 string -- NOT memory size in bytes>
719 3.1.2.8. cupsUtf32ToUtf16()
721 extern int cupsUtf32ToUtf16(utf16_t *dest, /* O - Target string */
722 const utf32_t *src, /* I - Source string */
723 const int maxout); /* I - Max output */
725 <Convert input UCS-4 directly to output UTF-16...>
726 <...handling surrogate pairs encoding to UTF-16>
727 <Return count of output UTF-16 string -- NOT memory size in bytes>
731 3.1.2.9. Transcoding Utility Functions
733 The transcoding utility functions are used to load (from a file into
734 memory), free (logically, without freeing memory), and flush (actually
735 free memory) character maps for SBCS (single-byte character set) and
736 (future) DBCS (double-byte character set) transcoding to and from UTF-8.
741 McDonald June 20, 2002 [Page 13]
743 CUPS Internationalization Software Design Description v0.3
747 3.1.2.9.1. cupsCharmapGet()
749 extern void *cupsCharmapGet(const cups_encoding_t encoding);
752 <Find SBSC or DBCS charset map in cache>
753 <...If found, increment 'used'>
754 <...and return pointer to SBCS or DBCS charset map>
755 <Get charset map file name by calling 'cupsEncodingName()'>
756 <Open charset map file>
757 <...If not found, return void>
758 <Allocate memory for SBCS or DBCS charset map in cache>
759 <...If no memory, return void>
760 <Add to SBCS or DBCS cache by assigning 'next' field>
761 <Assign 'encoding' field>
762 <Increment 'used' field>
763 <Read charset map file into memory in loop...>
764 <If SBCS, then 'char2uni[]' is an array of 'ucs2_t' values>
765 <...and 'uni2char[]' is an array of pointers to 'sbcs_t' arrays>
766 <If DBCS, then char2uni[]' is an array of pointers to 'ucs2_t' arrays>
767 <...and 'uni2char[]' is an array of pointers to 'dbcs_t' arrays>
768 <Close charset map file>
769 <Return pointer to SBCS or DBCS charset map>
773 3.1.2.9.2. cupsCharmapFree()
775 extern void cupsCharmapFree(const cups_encoding_t encoding);
778 <Find SBSC or DBCS charset map in cache>
779 <...If found, decrement 'used'>
784 3.1.2.9.3. cupsCharmapFlush()
786 extern void cupsCharmapFlush(void);
788 <Loop through SBCS charset map cache...>
789 <...Free 'uni2char[]' memory>
790 <...Free SBCS charset map memory>
791 <Loop through DBCS charset map cache...>
792 <...Free 'char2uni[]' memory>
793 <...Free 'uni2char[]' memory>
794 <...Free DBCS charset map memory>
798 McDonald June 20, 2002 [Page 14]
800 CUPS Internationalization Software Design Description v0.3
805 3.2. Normalization - New
809 3.2.1. normalize.h - Normalization header
812 * "$Id: i18n_sdd.txt 2678 2002-08-19 01:15:26Z mike $"
814 * Unicode normalization for the Common UNIX Printing System (CUPS).
816 * Copyright 1997-2002 by Easy Software Products.
818 * These coded instructions, statements, and computer programs are
819 * the property of Easy Software Products and are protected by Federal
820 * copyright law. Distribution and use rights are outlined in the
821 * file "LICENSE.txt" which should have been included with this file.
822 * If this file is missing or damaged please contact Easy Software
825 * Attn: CUPS Licensing Information
826 * Easy Software Products
827 * 44141 Airport View Drive, Suite 204
828 * Hollywood, Maryland 20636-3111 USA
830 * Voice: (301) 373-9603
831 * EMail: cups-info@cups.org
832 * WWW: http://www.cups.org
835 #ifndef _CUPS_NORMALIZE_H_
836 # define _CUPS_NORMALIZE_H_
839 * Include necessary headers...
842 # include "transcod.h"
846 # endif /* __cplusplus */
852 typedef enum /**** Normalizataion Types ****/
855 McDonald June 20, 2002 [Page 15]
857 CUPS Internationalization Software Design Description v0.3
859 CUPS_NORM_NFD, /* Canonical Decomposition */
860 CUPS_NORM_NFKD, /* Compatibility Decomposition */
861 CUPS_NORM_NFC, /* NFD, them Canonical Composition */
862 CUPS_NORM_NFKC /* NFKD, them Canonical Composition */
865 typedef enum /**** Case Folding Types ****/
867 CUPS_FOLD_SIMPLE, /* Simple - no expansion in size */
868 CUPS_FOLD_FULL /* Full - possible expansion in size */
871 typedef enum /**** Unicode Char Property Types ****/
873 CUPS_PROP_GENERAL_CATEGORY, /* See 'cups_gencat_t' enum */
874 CUPS_PROP_BIDI_CATEGORY, /* See 'cups_bidicat_t' enum */
875 CUPS_PROP_COMBINING_CLASS, /* See 'cups_combclass_t' type */
876 CUPS_PROP_BREAK_CLASS /* See 'cups_breakclass_t' enum */
880 * Note - parse Unicode char general category from 'UnicodeData.txt'
881 * into sparse local table in 'normalize.c'.
882 * Use major classes for logic optimizations throughout (by mask).
885 typedef enum /**** Unicode General Category ****/
887 CUPS_GENCAT_L = 0x10, /* Letter major class */
888 CUPS_GENCAT_LU = 0x11, /* Lu Letter, Uppercase */
889 CUPS_GENCAT_LL = 0x12, /* Ll Letter, Lowercase */
890 CUPS_GENCAT_LT = 0x13, /* Lt Letter, Titlecase */
891 CUPS_GENCAT_LM = 0x14, /* Lm Letter, Modifier */
892 CUPS_GENCAT_LO = 0x15, /* Lo Letter, Other */
893 CUPS_GENCAT_M = 0x20, /* Mark major class */
894 CUPS_GENCAT_MN = 0x21, /* Mn Mark, Non-Spacing */
895 CUPS_GENCAT_MC = 0x22, /* Mc Mark, Spacing Combining */
896 CUPS_GENCAT_ME = 0x23, /* Me Mark, Enclosing */
897 CUPS_GENCAT_N = 0x30, /* Number major class */
898 CUPS_GENCAT_ND = 0x31, /* Nd Number, Decimal Digit */
899 CUPS_GENCAT_NL = 0x32, /* Nl Number, Letter */
900 CUPS_GENCAT_NO = 0x33, /* No Number, Other */
901 CUPS_GENCAT_P = 0x40, /* Punctuation major class */
902 CUPS_GENCAT_PC = 0x41, /* Pc Punctuation, Connector */
903 CUPS_GENCAT_PD = 0x42, /* Pd Punctuation, Dash */
904 CUPS_GENCAT_PS = 0x43, /* Ps Punctuation, Open (start) */
905 CUPS_GENCAT_PE = 0x44, /* Pe Punctuation, Close (end) */
906 CUPS_GENCAT_PI = 0x45, /* Pi Punctuation, Initial Quote */
907 CUPS_GENCAT_PF = 0x46, /* Pf Punctuation, Final Quote */
908 CUPS_GENCAT_PO = 0x47, /* Po Punctuation, Other */
909 CUPS_GENCAT_S = 0x50, /* Symbol major class */
910 CUPS_GENCAT_SM = 0x51, /* Sm Symbol, Math */
912 McDonald June 20, 2002 [Page 16]
914 CUPS Internationalization Software Design Description v0.3
916 CUPS_GENCAT_SC = 0x52, /* Sc Symbol, Currency */
917 CUPS_GENCAT_SK = 0x53, /* Sk Symbol, Modifier */
918 CUPS_GENCAT_SO = 0x54, /* So Symbol, Other */
919 CUPS_GENCAT_Z = 0x60, /* Separator major class */
920 CUPS_GENCAT_ZS = 0x61, /* Zs Separator, Space */
921 CUPS_GENCAT_ZL = 0x62, /* Zl Separator, Line */
922 CUPS_GENCAT_ZP = 0x63, /* Zp Separator, Paragraph */
923 CUPS_GENCAT_C = 0x70, /* Other (miscellaneous) major class */
924 CUPS_GENCAT_CC = 0x71, /* Cc Other, Control */
925 CUPS_GENCAT_CF = 0x72, /* Cf Other, Format */
926 CUPS_GENCAT_CS = 0x73, /* Cs Other, Surrogate */
927 CUPS_GENCAT_CO = 0x74, /* Co Other, Private Use */
928 CUPS_GENCAT_CN = 0x75 /* Cn Other, Not Assigned */
932 * Note - parse Unicode char bidi category from 'UnicodeData.txt'
933 * into sparse local table in 'normalize.c'.
934 * Add bidirectional support to 'textcommon.c' - per Mike
937 typedef enum /**** Unicode Bidi Category ****/
939 CUPS_BIDI_L, /* Left-to-Right (Alpha, Syllabic, Ideographic) */
940 CUPS_BIDI_LRE, /* Left-to-Right Embedding (explicit) */
941 CUPS_BIDI_LRO, /* Left-to-Right Override (explicit) */
942 CUPS_BIDI_R, /* Right-to-Left (Hebrew alphabet and most punct) */
943 CUPS_BIDI_AL, /* Right-to-Left Arabic (Arabic, Thaana, Syriac) */
944 CUPS_BIDI_RLE, /* Right-to-Left Embedding (explicit) */
945 CUPS_BIDI_RLO, /* Right-to-Left Override (explicit) */
946 CUPS_BIDI_PDF, /* Pop Directional Format */
947 CUPS_BIDI_EN, /* Euro Number (Euro and East Arabic-Indic digits) */
948 CUPS_BIDI_ES, /* Euro Number Separator (Slash) */
949 CUPS_BIDI_ET, /* Euro Number Termintor (Plus, Minus, Degree, etc) */
950 CUPS_BIDI_AN, /* Arabic Number (Arabic-Indic digits, separators) */
951 CUPS_BIDI_CS, /* Common Number Separator (Colon, Comma, Dot, etc) */
952 CUPS_BIDI_NSM, /* Non-Spacing Mark (category Mn / Me in UCD) */
953 CUPS_BIDI_BN, /* Boundary Neutral (Formatting / Control chars) */
954 CUPS_BIDI_B, /* Paragraph Separator */
955 CUPS_BIDI_S, /* Segment Separator (Tab) */
956 CUPS_BIDI_WS, /* Whitespace Space (Space, Line Separator, etc) */
957 CUPS_BIDI_ON /* Other Neutrals */
961 * Note - parse Unicode line break class from 'DerivedLineBreak.txt'
962 * into sparse local table (list of class ranges) in 'normalize.c'.
963 * Note - add state table from UAX-14, section 7.3 - Ira
964 * Remember to do BK and SP in outer loop (not in state table).
965 * Consider optimization for CM (combining mark).
966 * See 'LineBreak.txt' (12,875) and 'DerivedLineBreak.txt' (1,350).
969 McDonald June 20, 2002 [Page 17]
971 CUPS Internationalization Software Design Description v0.3
974 typedef enum /**** Unicode Line Break Class ****/
977 * (A) - Allow Break AFTER
978 * (XA) - Prevent Break AFTER
979 * (B) - Allow Break BEFORE
980 * (XB) - Prevent Break BEFORE
981 * (P) - Allow Break For Pair
982 * (XP) - Prevent Break For Pair
984 CUPS_BREAK_AI, /* Ambiguous (Alphabetic or Ideograph) */
985 CUPS_BREAK_AL, /* Ordinary Alphabetic / Symbol Chars (XP) */
986 CUPS_BREAK_BA, /* Break Opportunity After Chars (A) */
987 CUPS_BREAK_BB, /* Break Opportunities Before Chars (B) */
988 CUPS_BREAK_B2, /* Break Opportunity Before / After (B/A/XP) */
989 CUPS_BREAK_BK, /* Mandatory Break (A) (normative) */
990 CUPS_BREAK_CB, /* Contingent Break (B/A) (normative) */
991 CUPS_BREAK_CL, /* Closing Punctuation (XB) */
992 CUPS_BREAK_CM, /* Attached Chars / Combining (XB) (normative) */
993 CUPS_BREAK_CR, /* Carriage Return (A) (normative) */
994 CUPS_BREAK_EX, /* Exclamation / Interrogation (XB) */
995 CUPS_BREAK_GL, /* Non-breaking ("Glue") (XB/XA) (normative) */
996 CUPS_BREAK_HY, /* Hyphen (XA) */
997 CUPS_BREAK_ID, /* Ideographic (B/A) */
998 CUPS_BREAK_IN, /* Inseparable chars (XP) */
999 CUPS_BREAK_IS, /* Numeric Separator (Infix) (XB) */
1000 CUPS_BREAK_LF, /* Line Feed (A) (normative) */
1001 CUPS_BREAK_NS, /* Non-starters (XB) */
1002 CUPS_BREAK_NU, /* Numeric (XP) */
1003 CUPS_BREAK_OP, /* Opening Punctuation (XA) */
1004 CUPS_BREAK_PO, /* Postfix (Numeric) (XB) */
1005 CUPS_BREAK_PR, /* Prefix (Numeric) (XA) */
1006 CUPS_BREAK_QU, /* Ambiguous Quotation (XB/XA) */
1007 CUPS_BREAK_SA, /* Context Dependent (South East Asian) (P) */
1008 CUPS_BREAK_SG, /* Surrogates (XP) (normative) */
1009 CUPS_BREAK_SP, /* Space (A) (normative) */
1010 CUPS_BREAK_SY, /* Symbols Allowing Break After (A) */
1011 CUPS_BREAK_XX, /* Unknown (XP) */
1012 CUPS_BREAK_ZW /* Zero Width Space (A) (normative) */
1013 } cups_breakclass_t;
1015 typedef int cups_combclass_t; /**** Unicode Combining Class ****/
1016 /* 0=base / 1..254=combining char */
1022 typedef struct cups_normmap_str /**** Normalize Map Cache Struct ****/
1024 struct cups_normmap_str *next; /* Next normalize in cache */
1026 McDonald June 20, 2002 [Page 18]
1028 CUPS Internationalization Software Design Description v0.3
1030 int used; /* Number of times entry used */
1031 cups_normalize_t normalize; /* Normalization type */
1032 int normcount; /* Count of Source Chars */
1033 ucs2_t *uni2norm; /* Char -> Normalization */
1034 /* ...only supports UCS-2 */
1037 typedef struct cups_foldmap_str /**** Case Fold Map Cache Struct ****/
1039 struct cups_foldmap_str *next; /* Next case fold in cache */
1040 int used; /* Number of times entry used */
1041 cups_folding_t fold; /* Case folding type */
1042 int foldcount; /* Count of Source Chars */
1043 ucs2_t *uni2fold; /* Char -> Folded Char(s) */
1044 /* ...only supports UCS-2 */
1047 typedef struct cups_prop_str /**** Char Property Struct ****/
1049 ucs2_t ch; /* Unicode Char as UCS-2 */
1050 unsigned char gencat; /* General Category */
1051 unsigned char bidicat; /* Bidirectional Category */
1054 typedef struct /**** Char Property Map Struct ****/
1056 int used; /* Number of times entry used */
1057 int propcount; /* Count of Source Chars */
1058 cups_prop_t *uni2prop; /* Char -> Properties */
1061 typedef struct /**** Line Break Class Map Struct ****/
1063 int used; /* Number of times entry used */
1064 int breakcount; /* Count of Source Chars */
1065 ucs2_t *uni2break; /* Char -> Line Break Class */
1068 typedef struct cups_comb_str /**** Char Combining Class Struct ****/
1070 ucs2_t ch; /* Unicode Char as UCS-2 */
1071 unsigned char combclass; /* Combining Class */
1072 unsigned char reserved; /* Reserved for alignment */
1075 typedef struct /**** Combining Class Map Struct ****/
1077 int used; /* Number of times entry used */
1078 int combcount; /* Count of Source Chars */
1079 cups_comb_t *uni2comb; /* Char -> Combining Class */
1083 McDonald June 20, 2002 [Page 19]
1085 CUPS Internationalization Software Design Description v0.3
1092 extern int NzSupportUcs2; /* Support UCS-2 (16-bit) mapping */
1093 extern int NzSupportUcs4; /* Support UCS-4 (32-bit) mapping */
1100 * Utility functions for normalization module
1102 extern int cupsNormalizeMapsGet(void);
1103 extern int cupsNormalizeMapsFree(void);
1104 extern void cupsNormalizeMapsFlush(void);
1107 * Normalize UTF-8 string to Unicode UAX-15 Normalization Form
1108 * Note - Compatibility Normalization Forms (NFKD/NFKC) are
1109 * unsafe for subsequent transcoding to legacy charsets
1111 extern int cupsUtf8Normalize(utf8_t *dest, /* O - Target string */
1112 const utf8_t *src, /* I - Source string */
1113 const int maxout, /* I - Max output */
1114 const cups_normalize_t normalize);
1115 /* I - Normalization */
1118 * Normalize UTF-32 string to Unicode UAX-15 Normalization Form
1119 * Note - Compatibility Normalization Forms (NFKD/NFKC) are
1120 * unsafe for subsequent transcoding to legacy charsets
1122 extern int cupsUtf32Normalize(utf32_t *dest,
1123 /* O - Target string */
1124 const utf32_t *src, /* I - Source string */
1125 const int maxout, /* I - Max output */
1126 const cups_normalize_t normalize);
1127 /* I - Normalization */
1130 * Case Fold UTF-8 string per Unicode UAX-21 Section 2.3
1131 * Note - Case folding output is
1132 * unsafe for subsequent transcoding to legacy charsets
1134 extern int cupsUtf8CaseFold(utf8_t *dest, /* O - Target string */
1135 const utf8_t *src, /* I - Source string */
1136 const int maxout, /* I - Max output */
1137 const cups_folding_t fold); /* I - Fold Mode */
1140 McDonald June 20, 2002 [Page 20]
1142 CUPS Internationalization Software Design Description v0.3
1146 * Case Fold UTF-32 string per Unicode UAX-21 Section 2.3
1147 * Note - Case folding output is
1148 * unsafe for subsequent transcoding to legacy charsets
1150 extern int cupsUtf32CaseFold(utf32_t *dest,/* O - Target string */
1151 const utf32_t *src, /* I - Source string */
1152 const int maxout, /* I - Max output */
1153 const cups_folding_t fold); /* I - Fold Mode */
1156 * Compare UTF-8 strings after case folding
1158 extern int cupsUtf8CompareCaseless(const utf8_t *s1,
1160 const utf8_t *s2); /* I - String2 */
1163 * Compare UTF-32 strings after case folding
1165 extern int cupsUtf32CompareCaseless(const utf32_t *s1,
1167 const utf32_t *s2); /* I - String2 */
1170 * Compare UTF-8 strings after case folding and NFKC normalization
1172 extern int cupsUtf8CompareIdentifier(const utf8_t *s1,
1174 const utf8_t *s2); /* I - String2 */
1177 * Compare UTF-32 strings after case folding and NFKC normalization
1179 extern int cupsUtf32CompareIdentifier(const utf32_t *s1,
1181 const utf32_t *s2); /* I - String2 */
1184 * Get UTF-32 character property
1186 extern int cupsUtf32CharacterProperty(const utf32_t ch,
1187 /* I - Source char */
1188 const cups_property_t property);
1189 /* I - Char Property */
1193 # endif /* __cplusplus */
1195 #endif /* !_CUPS_NORMALIZE_H_ */
1197 McDonald June 20, 2002 [Page 21]
1199 CUPS Internationalization Software Design Description v0.3
1203 * End of "$Id: i18n_sdd.txt 2678 2002-08-19 01:15:26Z mike $"
1208 3.2.1.1. cups_normmap_t - Normalize Map Structure
1210 typedef struct cups_normmap_str /**** Normalize Map Cache Struct ****/
1212 struct cups_normmap_str *next; /* Next normalize in cache */
1213 int used; /* Number of times entry used */
1214 cups_normalize_t normalize; /* Normalization type */
1215 int normcount; /* Count of Source Chars */
1216 ucs2_t *uni2norm; /* Char -> Normalization */
1217 /* ...only supports UCS-2 */
1220 'uni2norm' is a pointer to an array of _triplets_ of UCS-2 values.
1221 'normcount' is a count of _triplets_ in the 'uni2norm[]' array.
1223 For decompositions (NFD and NFKD), the triplets are: composed base
1224 character, decomposed base character, and decomposed accent character.
1225 These are used by 'cupsUtf8Normalize()' and 'cupsUtf32Normalize()' in
1226 performing canonical (NFD) or compatibility (NFKD) decomposition.
1228 For compositions (NFC and NFKC), the triplets are: decomposed base
1229 character, decomposed accent character, and composed base character.
1230 These are used by 'cupsUtf8Normalize()' and 'cupsUtf32Normalize()' in
1231 performing canonical composition (for NFC or NFKC).
1235 3.2.1.2. cups_foldmap_t - Case Fold Map Structure
1237 typedef struct cups_foldmap_str /**** Case Fold Map Cache Struct ****/
1239 int used; /* Number of times entry used */
1240 cups_folding_t fold; /* Case folding type */
1241 int foldcount; /* Count of Source Chars */
1242 ucs2_t *uni2fold; /* Char -> Folded Char(s) */
1243 /* ...only supports UCS-2 */
1246 'uni2fold' is a pointer to an array of _quadruplets_ of UCS-2 values.
1247 'foldcount' is a count of _quadruplets_ in the 'uni2fold[]' array.
1249 For simple case folding (without expansion of the size of the output
1250 string), the quadruplets are: input base character, output case folded
1251 character, zero (unused), and zero (unused).
1254 McDonald June 20, 2002 [Page 22]
1256 CUPS Internationalization Software Design Description v0.3
1259 For full case folding (with possible expansion of the size of the output
1260 string), the quadruplets are: input base character, output case folded
1261 character, second output character or zero, third output character or
1266 3.2.1.3. cups_propmap_t - Char Property Map Structure
1268 typedef struct /**** Char Property Map Struct ****/
1270 int used; /* Number of times entry used */
1271 int propcount; /* Count of Source Chars */
1272 cups_prop_t *uni2prop; /* Char -> Properties */
1275 'uni2prop' is a pointer to an array of 'cups_prop_t' (see below).
1276 'propcount' is a count of elements in the 'uni2prop[]' array.
1280 3.2.1.4. cups_prop_t - Char Property Structure
1282 typedef struct cups_prop_str /**** Char Property Struct ****/
1284 ucs2_t ch; /* Unicode Char as UCS-2 */
1285 unsigned char gencat; /* General Category */
1286 unsigned char bidicat; /* Bidirectional Category */
1291 3.2.1.5. cups_breakmap_t - Line Break Map Structure
1293 typedef struct /**** Line Break Class Map Struct ****/
1295 int used; /* Number of times entry used */
1296 int breakcount; /* Count of Source Chars */
1297 ucs2_t *uni2break; /* Char -> Line Break Class */
1300 'uni2break' is a pointer to an array of _triplets_ of UCS-2 values.
1301 'breakcount' is a count of _triplets_ in the 'uni2break[]' array.
1303 The triplets in 'uni2break' are: first UCS-2 value in a range, last
1304 UCS-2 value in a range, and line break class stored as UCS-2.
1311 McDonald June 20, 2002 [Page 23]
1313 CUPS Internationalization Software Design Description v0.3
1317 3.2.1.6. cups_combmap_t - Combining Class Map Structure
1319 typedef struct /**** Combining Class Map Struct ****/
1321 int used; /* Number of times entry used */
1322 int combcount; /* Count of Source Chars */
1323 cups_comb_t *uni2comb; /* Char -> Combining Class */
1326 'uni2comb' is a pointer to an array of 'cups_comb_t' (see below).
1327 'combcount' is a count of elements in the 'uni2comb[]' array.
1331 3.2.1.7. cups_comb_t - Combining Class Structure
1333 typedef struct cups_comb_str /**** Char Combining Class Struct ****/
1335 unsigned short ch; /* Unicode Char as UCS-2 */
1336 unsigned char combclass; /* Combining Class */
1337 unsigned char reserved; /* Reserved for alignment */
1342 3.2.2. normalize.c - Normalization module
1344 The normalization function 'cupsUtf8Normalize()' and the case folding
1345 function 'cupsUtf8CaseFold()' are modelled on the C standard library
1346 function 'strncpy()', except that they return the count of the output,
1347 like 'strlen()', rather than the (redundant) pointer to the output.
1349 If the normalization or case folding functions detect invalid input
1350 parameters or they detect an encoding error in their input, then they
1351 return '-1', rather than the count of output.
1353 The normalization and case folding functions take an input parameter
1354 indicating the maximum output units (for safe operation).
1358 3.2.2.1. cupsUtf8Normalize()
1361 * Normalize UTF-8 string to Unicode UAX-15 Normalization Form
1362 * Note - Compatibility Normalization Forms (NFKD/NFKC) are
1363 * unsafe for subsequent transcoding to legacy charsets
1365 extern int cupsUtf8Normalize(utf8_t *dest, /* O - Target string */
1366 const utf8_t *src, /* I - Source string */
1368 McDonald June 20, 2002 [Page 24]
1370 CUPS Internationalization Software Design Description v0.3
1372 const int maxout, /* I - Max output */
1373 const cups_normalize_t normalize);
1374 /* I - Normalization */
1376 <Convert input UTF-8 to internal UCS-4 by calling 'cupsUtf8ToUtf32()'>
1377 <Normalize by calling 'cupsUtf32Normalize()'>
1378 <Convert normalized UCS-4 to UTF-8 by calling 'cupsUtf32ToUtf8()>
1379 <Return length of output UTF-8 string -- size in butes>
1383 3.2.2.2. cupsUtf32Normalize()
1385 extern int cupsUtf32Normalize(utf32_t *dest,
1386 /* O - Target string */
1387 const utf32_t *src, /* I - Source string */
1388 const int maxout, /* I - Max output */
1389 const cups_normalize_t normalize);
1390 /* I - Normalization */
1392 <Find normalize maps by calling 'cupsNormalizeMapsGet()'>
1393 <...if not found, return '-1'>
1394 <Repeatedly traverse internal UCS-4, decomposing (NFD or NFKD)...>
1395 <...with 'bsearch()' of 'uni2norm[]' using local 'compare_decompose()'>
1396 <...until one pass yields no further decomposition>
1397 <Repeatedly traverse internal UCS-4, doing canonical reordering>
1398 <...with 'bsearch()' of 'uni2comb[]' using local 'compare_combchar()'>
1399 <...until one pass yields no further canonical reordering>
1400 <If 'normalize' requests composition (NFC or NFKC)...>
1401 <...repeatedly traverse internal UCS-4, composing (NFC or NFKC)...>
1402 <...with 'bsearch()' of 'uni2norm[]' using local 'compare_compose()'>
1403 <...until one pass yields no further composition>
1404 <Release normalize maps by calling 'cupsNormalizeMapsFree()'>
1405 <Return count of output UTF-32 string -- NOT memory size in butes>
1409 3.2.2.3. cupsUtf8CaseFold()
1412 * Case Fold UTF-8 string per Unicode UAX-21 Section 2.3
1413 * Note - Case folding output is
1414 * unsafe for subsequent transcoding to legacy charsets
1416 extern int cupsUtf8CaseFold(utf8_t *dest, /* O - Target string */
1417 const utf8_t *src, /* I - Source string */
1418 const int maxout, /* I - Max output */
1419 const cups_folding_t fold); /* I - Fold Mode */
1421 <Find normalize maps by calling 'cupsNormalizeMapsGet()'>
1422 <...if not found, return '-1'>
1423 <Convert input UTF-8 to internal UCS-4 by calling 'cupsUtf8ToUtf32()'>
1425 McDonald June 20, 2002 [Page 25]
1427 CUPS Internationalization Software Design Description v0.3
1429 <Case fold internal UCS-4 by calling 'cupsUtf32CaseFold()'>
1430 <Convert internal UCS-4 to output UTF-8 by calling 'cupsUtf32ToUtf8()>
1431 <Release normalize maps by calling 'cupsNormalizeMapsFree()'>
1432 <Return length of output UTF-8 string -- size in butes>
1436 3.2.2.4. cupsUtf32CaseFold()
1439 * Case Fold UTF-32 string per Unicode UAX-21 Section 2.3
1440 * Note - Case folding output is
1441 * unsafe for subsequent transcoding to legacy charsets
1443 extern int cupsUtf32CaseFold(utf32_t *dest, /* Target string */
1444 const utf32_t *src, /* Source string */
1445 const int maxout); /* Max output units */
1447 <Find case fold maps by calling 'cupsNormalizeMapsGet()'>
1448 <...if not found, return '-1'>
1449 <Traverse internal UCS-4 once, performing case folding...>
1450 <...with 'bsearch()' of 'uni2fold[]' using local 'compare_foldchar()'>
1451 <Copy internal UCS-4 to output UTF-32 string>
1452 <Release normalize maps by calling 'cupsNormalizeMapsFree()'>
1453 <Return count of output UTF-32 string -- NOT memory size in bytes>
1457 3.2.2.5. cupsUtf8CompareCaseless()
1460 * Compare UTF-8 strings after case folding
1462 extern int cupsUtf8CompareCaseless(const utf8_t *s1,
1464 const utf8_t *s2); /* I - String2 */
1466 <Case fold both input UTF-8 strings by calling 'cupsUtf8CaseFold()'>
1467 <Return compare of case folded first and second strings>
1471 3.2.2.6. cupsUtf32CompareCaseless()
1474 * Compare UTF-32 strings after case folding
1476 extern int cupsUtf32CompareCaseless(const utf32_t *s1,
1478 const utf32_t *s2); /* I - String2 */
1480 <Case fold both input UTF-32 strings by calling 'cupsUtf32CaseFold()'>
1482 McDonald June 20, 2002 [Page 26]
1484 CUPS Internationalization Software Design Description v0.3
1486 <Return compare of case folded first and second strings>
1490 3.2.2.7. cupsUtf8CompareIdentifier()
1493 * Compare UTF-8 strings after case folding and NFKC normalization
1495 extern int cupsUtf8CompareIdentifier(const utf8_t *s1,
1497 const utf8_t *s2); /* I - String2 */
1499 <Convert input UTF-8 to internal UCS-4 by calling 'cupsUtf8ToUtf32()'>
1500 <Case fold both strings by calling 'cupsUtf32CaseFold()'>
1501 <Normalize both strings to NFKC by calling 'cupsUtf32Normalize()'>
1502 <Return compare of case folded/normalized first and second strings>
1506 3.2.2.8. cupsUtf32CompareIdentifier()
1509 * Compare UTF-32 strings after case folding and NFKC normalization
1511 extern int cupsUtf32CompareIdentifier(const utf32_t *s1,
1513 const utf32_t *s2); /* I - String2 */
1515 <Case fold both strings by calling 'cupsUtf32CaseFold()'>
1516 <Normalize both strings to NFKC by calling 'cupsUtf32Normalize()'>
1517 <Return compare of case folded/normalized first and second strings>
1521 3.2.2.9. cupsUtf32CharacterProperty()
1524 * Get UTF-32 character property
1526 extern int cupsUtf32CharacterProperty(const utf32_t ch,
1527 /* I - Source char */
1528 const cups_property_t property);
1529 /* I - Char Property */
1531 <Lookup UTF-32 character property in appropriate map...> <...internal
1532 functions for each different map lookup>
1539 McDonald June 20, 2002 [Page 27]
1541 CUPS Internationalization Software Design Description v0.3
1545 3.2.2.10. Normalization Utility Functions
1550 3.2.2.10.1. cupsNormalizeMapsGet()
1552 extern void cupsNormalizeMapsMapsGet(void);
1554 <Find normalize maps in cache>
1555 <...If found, increment 'used'>
1556 <...and return void>
1557 <For each map (normalization, case fold, combining class, etc.)...>
1558 <Open (preprocessed form of) Unicode data file...>
1559 <...If not found, return void>
1560 <Count lines in preprocessed form, for mapping memory alloc>
1561 <...Close (preprocessed form of) Unicode data file>
1562 <Open (preprocessed form of) Unicode data file...>
1563 <...If not found, return void>
1564 <Allocate memory for approriate map in cache...>
1565 <...If no memory, return void>
1566 <Add to appropriate cache by assigning 'next' field>
1567 <Assign map type field and count field>
1568 <Increment 'used' field>
1569 <Read normalize map into memory in loop...>
1570 <...Add values to 'uni2xxx[]' array>
1571 <Close (preprocessed form of) Unicode data file>
1576 3.2.2.10.2. cupsNormalizeMapsFree()
1578 extern void cupsNormalizeMapsFree(void);
1580 <Find normalize maps in cache>
1581 <...If found, decrement 'used'>
1586 3.2.2.10.3. cupsNormalizeMapsFlush()
1588 extern void cupsNormalizeMapsFlush(void);
1590 <Loop through normalize maps cache...>
1591 <...Free 'uni2norm[]' memory>
1592 <...Free normalize map memory>
1593 <Loop through case folding cache...>
1594 <...Free 'uni2fold[]' memory>
1596 McDonald June 20, 2002 [Page 28]
1598 CUPS Internationalization Software Design Description v0.3
1600 <...Free case folding memory>
1601 <Loop through char property map cache...>
1602 <...Free 'uni2prop[]' memory>
1603 <...Free char property map memory>
1604 <Loop through line break class map cache...>
1605 <...Free 'uni2break[]' memory>
1606 <...Free line break class map memory>
1607 <Loop through combining class map cache...>
1608 <...Free 'uni2comb[]' memory>
1609 <...Free combining class map memory>
1614 3.3. Language - Existing
1618 3.3.1. language.h - Language header
1622 (1) Change definition of 'cups_lang_t' to correct length of 'language[]'
1623 to 32 characters per [RFC3066] and [ISO639-2] and [ISO3166-1].
1627 3.3.2. language.c - Language module
1631 3.3.2.1. cupsLangEncoding() - Existing
1637 3.3.2.2. cupsLangFlush() - Existing
1643 3.3.2.3. cupsLangFree() - Existing
1653 McDonald June 20, 2002 [Page 29]
1655 CUPS Internationalization Software Design Description v0.3
1659 3.3.2.4. cupsLangGet() - Existing
1663 (1) Change length of 'langname[]' and 'real[]' to 64 characters per
1664 [RFC3066] and potential length of encoding (charset) names;
1665 (2) Change language string normalization to support:
1666 (a) 8-character language codes per [RFC3066] and 3-character
1667 language codes per [ISO639-2];
1668 (b) 8-character country codes per [RFC3066] and 3-character country
1669 codes per [ISO3166-1];
1670 (c) Support for 'i' (IANA registered) and 'x' (private) language
1671 prefixes per [RFC3066];
1672 (d) Invariant use of 'utf-8' for encoding in message catalog, but
1673 save actual requested encoding name for later use.
1674 (3) Correct broken do/while statement for message catalog lookup (while
1675 condition is _never_ satisfied).
1679 3.3.2.5. cupsLangPrintf() - New
1681 extern int cupsLangPrintf(FILE *fp, /* I - File to write */
1682 const cups_lang_t *lang, /* I - Language/locale*/
1683 const cups_msg_t msg, /* I - Msg to format */
1684 ...); /* I - Args to format */
1686 <Set up variable args by calling 'va_start()'>
1687 <Format CUPS message with variable args by calling 'vsnprintf()'>
1688 <Clean up variable args by calling 'va_end()'>
1689 <Transcode CUPS message by calling 'cupsUtf8ToCharset()'>
1690 <Write CUPS message by calling 'fputs()'>
1691 <Return transcoded output CUPS message length>
1695 3.3.2.6. cupsLangPuts() - New
1697 extern int cupsLangPuts(FILE *fp, /* I - File to write */
1698 const cups_lang_t *lang, /* I - Language/locale*/
1699 const cups_msg_t msg); /* I - Msg to write */
1701 <Transcode CUPS message by calling 'cupsUtf8ToCharset()'>
1702 <Write CUPS message by calling 'fputs()'>
1703 <Return transcoded output CUPS message length>
1710 McDonald June 20, 2002 [Page 30]
1712 CUPS Internationalization Software Design Description v0.3
1716 3.3.2.7. cupsEncodingName() - New
1718 extern char *cupsEncodingName(cups_encoding_t encoding);
1720 <Lookup encoding name in static 'lang_encodings[]' array>
1721 <Return pointer to encoding name (charset map file name)>
1725 3.4. Common Text Filter - Existing
1729 3.4.1. textcommon.h - Common text filter header
1733 (1) Revise 'lchar_t' as specified below, adding 'attrx' bit-mask for
1734 selected Unicode character properties;
1735 (2) Revise 'lchar_t' as specified below, adding 'comblen' and 'combch[]'
1736 for Unicode combining/attached chars (accents);
1737 (3) Add 'COMBLEN_MAX' limit as specified below;
1738 (4) Add 'ATTRX_...' selected Unicode character properties as specified
1743 3.4.1.1. lchar_t - Character/Attribute Structure
1745 typedef struct lchar_str /**** Character / Attribute Structure ****/
1747 unsigned short ch; /* Unicode Char as UCS-2 */
1748 /* or 8/16-bit Legacy Char */
1749 unsigned short attr; /* Attributes of Char */
1750 unsigned short attrx; /* Extended Attributes */
1751 unsigned short comblen; /* Combining Char Count */
1752 unsigned short combch[8]; /* Combining Chars as UCS-2 */
1755 'ch' is a 16-bit UCS-2 character or a 8/16-bit legacy char. 'attr' is
1756 the character attributes defined for the existing 'lchar_t' structure
1757 (defined in 'textcommon.h'). 'attrx' is the extended character
1758 attributes defined for future selected Unicode character properties (see
1759 below). 'comblen' is the number of attached/combining characters.
1760 'combch' is an array of 16-bit UCS-2 attached/combining characters.
1762 Add to 'textcommon.h' constants:
1767 McDonald June 20, 2002 [Page 31]
1769 CUPS Internationalization Software Design Description v0.3
1772 ATTRX_RIGHT2LEFT 0x0001
1776 3.4.2. textcommon.c - Common text filter
1780 (1) Revise 'TextMain()' function as described below.
1784 3.4.2.1. TextMain() - Existing
1788 [Ed Note: Pseudo code below needs more work on bidi handling.]
1790 (1) In main loop at the _beginning_ of the 'default' clause, add the
1791 following code for combining marks:
1797 * Check for Unicode combining mark (accent)
1799 if (UTF-8 && cupsUtf32CombiningClass(ch) > 0)
1803 * Save Unicode combining mark in SAME character
1805 if (cp->comblen > COMBLEN_MAX)
1807 cp->combch[cp->comblen] = ch;
1812 (2) In main loop _after_ combining chars section in 'default' clause,
1813 add the following code for Unicode bidi control characters
1814 cups_bidicat_t bidicat;
1817 * Check for Unicode bidi control character
1821 bidicat = (cups_bidicat_t)
1822 cupsUtf32CharacterProperty(ch, CUPS_PROP_BIDI_CATEGORY);
1824 McDonald June 20, 2002 [Page 32]
1826 CUPS Internationalization Software Design Description v0.3
1828 if ((bidicat == CUPS_BIDI_LRE) /* Left-to-Right Embedding *
1829 || (bidicat == CUPS_BIDI_LRO) /* Left-to-Right Override */
1830 || (bidicat == CUPS_BIDI_RLE) /* Right-to-Left Embedding *
1831 || (bidicat == CUPS_BIDI_RLO) /* Right-to-Left Override */
1832 || (bidicat == CUPS_BIDI_PDF)) /* Pop Directional Format */
1834 /* Do bidi stuff here with memory for NEXT char's direction
1835 /* Discard bidi control character and break */
1837 if ((bidicat == CUPS_BIDI_R) /* Right-to-Left Hebrew */
1838 || (bidicat == CUPS_BIDI_AL)) /* Right-to-Left Arabic */
1840 /* Set attrx for right-to-left */
1841 cp->attrx |= ATTRX_RIGHT2LEFT
1847 3.4.2.2. compare_keywords() - Existing
1853 3.4.2.3. getutf8() - Existing
1857 [Ed Note: Future - allow 20-bit UTF-32 code points - requires updates
1858 in both 'textcommon.c' and 'texttops.c' for extended PostScript.]
1862 3.5. Text to PostScript Filter - Existing
1866 3.5.1. texttops.c - Text to PostScript filter
1870 (1) Revise local 'write_string()' function as described below.
1874 3.5.1.1. main() - Existing
1881 McDonald June 20, 2002 [Page 33]
1883 CUPS Internationalization Software Design Description v0.3
1887 3.5.1.2. WriteEpilogue () - Existing
1893 3.5.1.3. WritePage () - Existing
1899 3.5.1.4. WriteProlog () - Existing
1905 3.5.1.5. write_line() - Existing
1911 3.5.1.6. write_string() - Existing
1915 (1) At the _beginning_ of Multiple Fonts section, _replace_ the while()
1916 loop and surrounding 'putchar()' calls with the following code:
1918 for (; len > 0; len --, s ++)
1920 utf32_t decstr[COMBLEN_MAX * 2];
1921 utf32_t cmpstr[COMBLEN_MAX * 2];
1925 if (s->comblen == 0)
1927 printf("<%04x>", Chars[s->ch]);
1932 * Normalize decomposed Unicode character to NFKC
1933 * (compatibility decomposition, then canonical composition)
1935 decstr[0] = (utf32_t) s->ch;
1936 for (i = 0; i < s->comblen; i ++)
1938 McDonald June 20, 2002 [Page 34]
1940 CUPS Internationalization Software Design Description v0.3
1942 decstr[i + 1] = (utf32_t) s->combch[i];
1944 cmplen = cupsUtf32Normalize (&cmpstr[0],
1945 &decstr[0], COMBLEN_MAX * 2, CUPS_NORM_NFKC);
1950 * Write combining chars, then composed base, to same location
1952 for (i = 1; i < cmplen; i ++)
1954 printf("<%04x>", Chars[(int) cmpstr[i]);
1956 * Superimpose glyphs by backing up one column width
1958 printf (" -%.3f ", (72.0f / (float) CharsPerInch));
1960 printf("<%04x>", Chars[(int) cmpstr[0]);
1963 [Ed Note: Future - Bidi support - When writing Unicode characters
1964 (checking for explicit bidi) convert input string (lchar_t) to display
1969 3.5.1.7. write_text() - Existing
1995 McDonald June 20, 2002 [Page 35]
1997 CUPS Internationalization Software Design Description v0.3
2005 Abstract Character: A unit of information used for the organization,
2006 control, or representation of textual data.
2008 Accent Mark: A mark placed above, below, or to the side of a character
2009 to alter its phonetic value (also 'diacritic').
2011 Alphabet: A collection of symbols that, in the context of a particular
2012 written language, represent the sounds of that language.
2014 Base Character: A character that does not graphically combine with
2015 preceding characters, and that is neither a control nor a format
2018 Basic Multilingual Plane: The Unicode (or UCS) code values 0x0000
2019 through 0xFFFF, specified by [ISO10646] (also 'Plane 0').
2021 BIDI: Abbreviation for Bidirectional, in reference to mixed
2022 left-to-right and right-to-left text.
2024 Bidirectional Display: The process or result of mixing left-to-right
2025 oriented text and right-to-left oriented text in a single line.
2027 Big-endian: A computer architecture that stores multiple-byte numerical
2028 values with the most significant byte (MSB) values first.
2030 BMP: Abbreviation for Basic Multilingual Plane.
2032 BOM: Acronym for byte order mark (also 'ZWNBSP').
2034 Byte Order Mark: The Unicode character U+FEFF Zero Width No-Break Space
2035 (ZWNBSP) when used to indicate the byte order of text.
2037 Canonical: (1) Conforming to the general rules for encoding -- that is,
2038 not compressed, compacted, or in any other form specified by a higher
2039 protocol. (2) Characteristic of a normative mapping and form of
2042 Canonical Decomposition: The decomposition of a character that results
2043 from recursively applying the canonical mappings defined in the Unicode
2044 Character Database until no characters can be further decomposed, then
2045 reordering nonspacing marks according to section 3.10 of [UNICODE3.2].
2047 Canonical Equivalent: Two characters are canonical equivalents if their
2048 full canonical decompositions are identical.
2050 Case: (1) Feature of certain alphabets wheere the letters have two
2052 McDonald June 20, 2002 [Page A-1]
2054 CUPS Internationalization Software Design Description v0.3
2058 distinct forms. These variants are called the 'uppercase' letter (also
2059 known as 'capital' or 'majuscule') and the 'lowercase' letter (also
2060 known as 'small' or 'minuscule'). (2) Normative property of Unicode
2061 characters, consisting of uppercase, lowercase, and titlecase.
2063 Character: (1) The smallest component of written language that has
2064 semantic value; refers to the abstract meaning and/or shape, rather than
2065 a specific shape (see also 'glyph'). (2) Synonym for 'abstract
2066 character'. (3) The basic unit of encoding for the Unicode character
2067 encoding. (4) The English name for the ideographic written elements of
2068 Chinese origin (see 'ideograph').
2070 Character Encoding Form (CEF): Mapping from a character set definition
2071 to the actual bits used to represent the data.
2073 Character Encoding Scheme (CES): A 'character encoding form' plus byte
2074 serialization. [UNICODE3.2] defines seven character encoding schemes:
2075 UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, and UTF32-LE.
2077 Character Properties: A set of property names and property values
2078 associated with individual characters defined in [UNICODE3.2].
2080 Character Repertoire: (1) The collection of characters included in a
2081 character set. (2) The SUBSET of characters included in a large
2082 character set, e.g., [UNICODE3.2], that are necessary to support a
2083 complete mapping to another smaller character set, e.g., ISO8859-1 (also
2086 Character Set: A collection of elements used to represent textual
2089 Coded Character Set: A character set in which each character is
2090 assigned a numeric code value. Frequently abbreviated as 'character
2091 set', 'charset', or 'code set'.
2093 Code Point: (1) A numerical index (or position) in an encoding table
2094 used for encoding characters. (2) Synonym for 'Unicode scalar value'.
2096 Collation: The process of ordering units of textual information.
2097 Collation is usually specific to a particular language. Also known as
2098 'alphabetizing' or 'alphabetic sorting'.
2100 Combining Character: A character that graphically combines with a
2101 preceding 'base character'. The combining character is said to 'apply'
2102 to that base character. (See also 'nonspacing mark'.)
2104 Compatibility: (1) Consistency with existing practice or preexisting
2105 character encoding standards. (2) Characterisitic of a normative
2106 mapping and form of equivalence (see 'compatibility decomposition').
2109 McDonald June 20, 2002 [Page A-2]
2111 CUPS Internationalization Software Design Description v0.3
2116 Compatibility Character: A character that has a compatibility
2119 Compatibility Decomposition: The decomposition of a character that
2120 results from recursively applying BOTH the compatibility mappings AND
2121 the canonical mappings found in the Unicode Character Database until no
2122 characters can be further decomposed, then reordering nonspacing marks
2123 according to section 3.10 of [UNICODE3.2].
2125 Compatibility Equivalent: Two characters are compatibility equivalents
2126 if their full compatibility decompositions are identical.
2128 Composed Character: (See 'descomposable character'.)
2130 DBCS: Acronym for 'double-byte character set'.
2132 Decomposable Character: A character that is equivalent to a sequence of
2133 one or more other characters, according to the decomposition mappings
2134 found in [UNICODE3.2]. It may also be known as a 'precomposed
2135 character' or a 'composite character'.
2137 Decomposition: (1) The process of separating or analyzing a text
2138 element into component units. (2) A sequence of one or more characters
2139 that is equivalent to a 'decomposable character'.
2141 Diacritic: (See 'accent mark'.)
2143 Double-Byte Character Set (DBCS): One of a number of character sets
2144 defined for representing Chinese, Japanese, or Korean text (for example,
2145 JIS X 0208-1990). These character sets are often encoded in such a way
2146 as to allow double-byte character encodings to be mixed with single-byte
2147 character encodings. (See also 'multiple-byte character set'.)
2149 Font: A collection of glyphs used for visual depication of character
2152 FSS-UTF: Abbreviation for 'File System Safe UCS Transformation Format',
2153 originally published by X/Open. Now called 'UTF-8'.
2155 Fullwidth: Characters of East Asian character sets whose glyph image
2156 extends across the entire character display cell. In legacy character
2157 sets, fullwidth characters are normally encoded in two or three bytes.
2159 Glyph: (1) An abstract form that represents one or more glyph images.
2160 (2) A synonym for 'glyph image'.
2162 Glyph Image: The actual, concrete image of a glyph representation
2163 having been rasterized or otherwise images onto some display surface.
2166 McDonald June 20, 2002 [Page A-3]
2168 CUPS Internationalization Software Design Description v0.3
2173 Halfwidth: Characters of East Asian character sets whose glyph image
2174 occupies half of the character display cell. In legacy character sets,
2175 halfwidth characters are normally encoded in a single byte.
2177 Han Characters: Ideographic characters of Chinese origin.
2179 Hangul: The name of the script used to write the Korean language.
2181 High-Surrogate: A Unicode code value in the range U+D800 to U+DBFF.
2183 Hiragana: One of two standard syllabaries associated with the Japanese
2184 writing system. Use to write particles, grammatical affixes, and words
2185 that have no 'kanji' form.
2187 IANA: Internet Assigned Numbers Authority.
2189 Ideograph: (1) Any symbol that denotes an idea (or meaning) in contrast
2190 to a sound or pronunciation (for example, a 'smiley face'). (2) A
2191 common term used to refer to Han characters.
2193 IPA: International Phonetic Alphabet.
2195 IRG: Abbreviation for Ideographic Rapporteur Group, a subgroup of
2196 ISO/IEC JTC1/SC2/WG2 (who work on Han unification and submission of new
2197 Han characters for inclusion in revised versions of Unicode/ISO 10646).
2199 Jamo: The Korean name for a single letter of the Hangul script. Jamos
2200 are used to form Hangul syllables.
2202 Joiner: An invisible character that affects the joining behavior of
2203 surrounding characters.
2205 JTC1: Abbreviation for Joint Technical Committee 1 of ISO/IEC,
2206 responsible for information technology standardization.
2208 Kana: The name of a primarily syllabic script used by the Japanese
2209 writing system, composed of 'hiragana' and 'katakana'.
2211 Kanji: The Japanese name for Han characters; derived from the Chinese
2212 word 'hanzi'. Also romanized as 'kanzi'.
2214 Katakana: One of two standard syllabaries associated with the Japanese
2215 writing system, typically used in representation of borrowed vocabulary.
2217 Ligature: A glyph representing a combination of two or more characters,
2218 for example in the Latin script the ligature between 'f' and 'i' as
2221 Logical Order: The order in which text is typed on a keyboard. For the
2223 McDonald June 20, 2002 [Page A-4]
2225 CUPS Internationalization Software Design Description v0.3
2229 most part, logical order corresponds to phonetic order.
2231 Lowercase: (See 'case'.)
2233 Low-Surrogate: A Unicode code value in the range U+DC00 to U+DFFF.
2235 MBCS: Acronym for 'multiple-byte character set'.
2237 Multiple-Byte Character Set (MBCS): A character set encoded with a
2238 variable number of bytes per character. Many large character sets have
2239 been defined as MBCS so as to keep strict compatibility with the
2240 US-ASCII subset and/or [ISO2022].
2242 Normalization: Transformation of data to a normal form.
2244 Plain Text: Computer-encoded text that consists ONLY of a sequence of
2245 code values from a given standard, with no other formatting or
2246 structural information.
2248 Precomposed Character: (See 'decomposable character'.)
2250 Rendering: (1) The process of selecting and laying out glyphs for the
2251 purpose of depicting characters. (2) The process of making glyphs
2252 visible on a display device.
2254 Repertoire: (See 'character repertoire'.)
2256 Replacement Character: A character used as a substitute for an
2257 uninterpretable character from another encoding. [UNICODE3.2] defines
2258 U+FFFD REPLACEMENT CHARACTER for this function.
2260 Rich Text: The result of adding information such as font data, color,
2261 formatting, phonetic annotations, etc. to 'plain text' (e.g., HTML).
2263 SBCS: Acronym for 'single-byte character set'.
2265 Scalar Value: (See 'Unicode scalar value'.)
2267 Script: A collection of symbols used to represent textual information
2268 in one or more writing systems.
2270 Single-Byte Character Set (SBCS): One of a number of one-byte character
2271 sets defined for representing (mostly) Western languages (for example,
2272 ISO 8859-1 'Latin-1'). These character sets are often encoded in such a
2273 way as to be strict supersets of 7-bit [US-ASCII].
2275 Sorting: (See 'collation'.)
2277 Transcoding: Conversion of character data between different character
2280 McDonald June 20, 2002 [Page A-5]
2282 CUPS Internationalization Software Design Description v0.3
2287 Transformation Format: A mapping from a coded character sequence to a
2288 unique sequence of code values (typically octets).
2290 UCS: Abbreviation for Universal Character Set, specified by [ISO10646].
2292 UCS-2: UCS encoded in 2 octets, specified by [ISO10646].
2294 UCS-4: UCS encoded in 4 octets, specified by [ISO10646].
2296 Unicode Scalar Value: A number between 0 to 0x10FFFF.
2298 Uppercase: (See 'case'.)
2300 UTF: Abbreviation for Unicode (or UCS) Transformation Format.
2302 UTF-8: Unicode (or UCS) Transformation Format, 8-bit encoding form.
2303 Serializes a Unicode (or UCS) scalar value (code point) as a sequence of
2304 one to four octets. Does NOT suffer from byte-ordering ambiguities.
2306 UTF-16: Unicode (or UCS) Transformation Format, 16-bit encoding form.
2307 Serializes a Unicode (or UCS) scalar value (code point) as a sequence of
2308 two octets, in either big-endian or little-endian format. Uses an
2309 (optional) prefix of BOM to disambiguate byte-ordering.
2311 UTF-32: Unicode (or UCS) Transformation Format, 32-bit encoding form.
2312 Serializes a Unicode (or UCS) scalar value (code point) as a sequence of
2313 four octets, in either big-endian or little-endian format. Uses an
2314 (optional) prefix of BOM to disambiguate byte-ordering.
2316 Zero Width: Characteristic of some spaces or format control characters
2317 that do not advance text along the horizontal baseline.
2337 McDonald June 20, 2002 [Page A-6]