WORKING DRAFT Ira McDonald High North Inc Common UNIX Printing System ("CUPS") Internationalization Software Design Description v0.3 Copyright (C) Easy Software Products (2002) - All Rights Reserved Status of this Document This document is an unapproved working draft and is incomplete in some sections (see 'Ed Note:' comments). Abstract This document provides general information and high-level design for the Internationalization extensions for the Common UNIX Printing System ("CUPS") Version 1.2. This document also provides C language header files and high-level pseudo-code for all new modules and external functions. McDonald June 20, 2002 [Page 1] CUPS Internationalization Software Design Description v0.3 Table of Contents 1. Scope ...................................................... 4 1.1. Identification ......................................... 4 1.2. System Overview ........................................ 4 1.3. Document Overview ...................................... 4 2. References ................................................. 5 2.1. CUPS References ........................................ 5 2.2. Other Documents ........................................ 5 3. Design Overview ............................................ 7 3.1. Transcoding - New ...................................... 7 3.1.1. transcode.h - Transcoding header ................... 7 3.1.1.1. cups_cmap_t - SBCS Charmap Structure ........... 10 3.1.1.2. cups_dmap_t - DBCS Charmap Structure ........... 11 3.1.2. transcode.c - Transcoding module ................... 11 3.1.2.1. cupsUtf8ToCharset() ............................ 11 3.1.2.2. cupsCharsetToUtf8() ............................ 12 3.1.2.3. cupsUtf8ToUtf16() .............................. 12 3.1.2.4. cupsUtf16ToUtf8() .............................. 12 3.1.2.5. cupsUtf8ToUtf32() .............................. 12 3.1.2.6. cupsUtf32ToUtf8() .............................. 13 3.1.2.7. cupsUtf16ToUtf32() ............................. 13 3.1.2.8. cupsUtf32ToUtf16() ............................. 13 3.1.2.9. Transcoding Utility Functions .................. 13 3.1.2.9.1. cupsCharmapGet() ........................... 14 3.1.2.9.2. cupsCharmapFree() .......................... 14 3.1.2.9.3. cupsCharmapFlush() ......................... 14 3.2. Normalization - New .................................... 15 3.2.1. normalize.h - Normalization header ................. 15 3.2.1.1. cups_normmap_t - Normalize Map Structure ....... 22 3.2.1.2. cups_foldmap_t - Case Fold Map Structure ....... 22 3.2.1.3. cups_propmap_t - Char Property Map Structure ... 23 3.2.1.4. cups_prop_t - Char Property Structure .......... 23 3.2.1.5. cups_breakmap_t - Line Break Map Structure ..... 23 3.2.1.6. cups_combmap_t - Combining Class Map Structure . 24 3.2.1.7. cups_comb_t - Combining Class Structure ........ 24 3.2.2. normalize.c - Normalization module ................. 24 3.2.2.1. cupsUtf8Normalize() ............................ 24 3.2.2.2. cupsUtf32Normalize() ........................... 25 3.2.2.3. cupsUtf8CaseFold() ............................. 25 3.2.2.4. cupsUtf32CaseFold() ............................ 26 3.2.2.5. cupsUtf8CompareCaseless() ...................... 26 3.2.2.6. cupsUtf32CompareCaseless() ..................... 26 3.2.2.7. cupsUtf8CompareIdentifier() .................... 27 3.2.2.8. cupsUtf32CompareIdentifier() ................... 27 3.2.2.9. cupsUtf32CharacterProperty() ................... 27 3.2.2.10. Normalization Utility Functions ............... 28 3.2.2.10.1. cupsNormalizeMapsGet() .................... 28 3.2.2.10.2. cupsNormalizeMapsFree() ................... 28 3.2.2.10.3. cupsNormalizeMapsFlush() .................. 28 3.3. Language - Existing .................................... 29 3.3.1. language.h - Language header ....................... 29 McDonald June 20, 2002 [Page 2] CUPS Internationalization Software Design Description v0.3 3.3.2. language.c - Language module ....................... 29 3.3.2.1. cupsLangEncoding() - Existing .................. 29 3.3.2.2. cupsLangFlush() - Existing ..................... 29 3.3.2.3. cupsLangFree() - Existing ...................... 29 3.3.2.4. cupsLangGet() - Existing ....................... 30 3.3.2.5. cupsLangPrintf() - New ......................... 30 3.3.2.6. cupsLangPuts() - New ........................... 30 3.3.2.7. cupsEncodingName() - New ....................... 31 3.4. Common Text Filter - Existing .......................... 31 3.4.1. textcommon.h - Common text filter header ........... 31 3.4.1.1. lchar_t - Character/Attribute Structure ........ 31 3.4.2. textcommon.c - Common text filter .................. 32 3.4.2.1. TextMain() - Existing .......................... 32 3.4.2.2. compare_keywords() - Existing .................. 33 3.4.2.3. getutf8() - Existing ........................... 33 3.5. Text to PostScript Filter - Existing ................... 33 3.5.1. texttops.c - Text to PostScript filter ............. 33 3.5.1.1. main() - Existing .............................. 33 3.5.1.2. WriteEpilogue () - Existing .................... 34 3.5.1.3. WritePage () - Existing ........................ 34 3.5.1.4. WriteProlog () - Existing ...................... 34 3.5.1.5. write_line() - Existing ........................ 34 3.5.1.6. write_string() - Existing ...................... 34 3.5.1.7. write_text() - Existing ........................ 35 A. Glossary ................................................... A-1 McDonald June 20, 2002 [Page 3] CUPS Internationalization Software Design Description v0.3 1. Scope 1.1. Identification This document provides general information and high-level design for the Internationalization extensions for the Common UNIX Printing System ("CUPS") Version 1.2. This document also provides C language header files and high-level pseudo-code for all new modules and external functions. 1.2. System Overview The CUPS Internationalization extensions provide multilingual support via Unicode 3.2:2002 [UNICODE3.2] / ISO-10646-1:2000 [ISO10646-1] and a suite of local character sets (including all adopted parts of ISO-8859 and many MS Windows code pages) for CUPS 1.2. The CUPS Internationalization extensions support UTF-8 [RFC2279] as the common stream-oriented representation of all character data. UTF-8 is defined in [ISO10646-1] and is further constrained (for integrity and security) by [UNICODE3.2]. UTF-8 is the native character set of LDAPv3 [RFC2251], SLPv2 [RFC2608], IPP/1.1 [RFC2910] [RFC2911], and many other Internet protocols. 1.3. Document Overview This software design description document is organized into the following sections: o 1 - Scope o 2 - References o 3 - Design Overview o A - Glossary McDonald June 20, 2002 [Page 4] CUPS Internationalization Software Design Description v0.3 2. References 2.1. CUPS References See: Section 2.1 'CUPS Documentation' of CUPS Software Design Description. 2.2. Other Documents The following non-CUPS documents are referenced by this document. [ANSI-X3.4] ANSI Coded Character Set - 7-bit American National Standard Code for Information Interchange, ANSI X3.4, 1986 (aka US-ASCII). [GB2312] Code of Chinese Graphic Character Set for Information Interchange, Primary Set, GB 2312, 1980. [ISO639-1] Codes for the Representation of Names of Languages -- Part 1: Alpha-2 Code, ISO/IEC 639-1, 2000. [ISO639-2] Codes for the Representation of Names of Languages -- Part 2: Alpha-3 Code, ISO/IEC 639-2, 1998. [ISO646] Information Technology - ISO 7-bit Coded Character Set for Information Interchange, ISO/IEC 646, 1991. [ISO2022] Information Processing - ISO 7-bit and 8-bit Coded Character Sets - Code Extension Techniques, ISO/IEC 2022, 1994. (Technically identical to ECMA-35.) [ISO3166-1] Codes for the Representation of Names of Countries and their Subdivisions, Part 1: Country Codes, ISO/ISO 3166-1, 1997. [ISO8859] Information Processing - 8-bit Single-Byte Code Graphic Character Sets, ISO/IEC 8859-n, 1987-2001. [ISO10646-1] Information Technology - Universal Multiple-Octet Code Character Set (UCS) - Part 1: Architecture and Basic Multilingual Plane, ISO/IEC 10646-1, September 2000. [ISO10646-2] Information Technology - Universal Multiple-Octet Code Character Set (UCS) - Part 2: Supplemental Planes, ISO/IEC 10646-2, January 2001. [RFC2119] Bradner. Key words for use in RFCs to Indicate Requirement Levels, RFC 2119, March 1997. McDonald June 20, 2002 [Page 5] CUPS Internationalization Software Design Description v0.3 [RFC2251] Whal, Howes, Kille. Lightweight Directory Access Protocol Version 3 (LDAPv3), RFC 2251, December 1997. [RFC2277] Alvestrand. IETF Policy on Character Sets and Languages, RFC 2277, January 1998. [RFC2279] Yergeau. UTF-8, a Transformation Format of ISO 10646, RFC 2279, January 1998. [RFC2608] Guttman, Perkins, Veizades, Day. Service Location Protocol Version 2 (SLPv2), RFC 2608, June 1999. [RFC2910] Herriot, Butler, Moore, Turner, Wenn. Internet Printing Protocol/1.1: Encoding and Transport, RFC 2910, September 2000. [RFC2911] Hastings, Herriot, deBry, Isaacson, Powell. Internet Printing Protocol/1.1: Model and Semantics, RFC 2911, September 2000. [UNICODE3.0] Unicode Consortium, Unicode Standard Version 3.0, Addison-Wesley Developers Press, ISBN 0-201-61633-5, 2000. [UNICODE3.1] Unicode Consortium, Unicode Standard Version 3.1 (UAX-27), May 2001. [UNICODE3.2] Unicode Consortium, Unicode Standard Version 3.2 (UAX-28), March 2002. [US-ASCII] See [ANSI-X3.4] above. McDonald June 20, 2002 [Page 6] CUPS Internationalization Software Design Description v0.3 3. Design Overview The CUPS Internationalization extensions are composed of several header files and modules which extend the Language functions in the existing CUPS Application Programmers Interface (API). 3.1. Transcoding - New Initially, the CUPS Internationalization extensions will only support SBCS (single-byte character set) transcoding. But the design allows future support for DBCS (double-byte character set) transcoding for CJK (Chinese/Japanese/Korean) languages and the MBCS (multiple-byte character set) compound sets that use escapes for charset switching. In order to reduce code size and increase performance all conventional 'mapping files' (tables of values in legacy characters sets with their corresponding Unicode scalar values) will ALSO be sorted and stored in memory as reverse maps (for efficient conversion from Unicode scalar values to their corresponding legacy character set values). Transcoding will be done directly by 2-level lookup (without any searching or sorting). [Ed Note: CJK languages will be fairly costly in mapping table sizes, because they have thousands (or tens of thousands) of codepoints.] 3.1.1. transcode.h - Transcoding header /* * "$Id: i18n_sdd.txt 2678 2002-08-19 01:15:26Z mike $" * * Transcoding support for the Common UNIX Printing System (CUPS). * * Copyright 1997-2002 by Easy Software Products. * * These coded instructions, statements, and computer programs are * the property of Easy Software Products and are protected by Federal * copyright law. Distribution and use rights are outlined in the * file "LICENSE.txt" which should have been included with this file. * If this file is missing or damaged please contact Easy Software * Products at: * * Attn: CUPS Licensing Information * Easy Software Products * 44141 Airport View Drive, Suite 204 * Hollywood, Maryland 20636-3111 USA * * Voice: (301) 373-9603 McDonald June 20, 2002 [Page 7] CUPS Internationalization Software Design Description v0.3 * EMail: cups-info@cups.org * WWW: http://www.cups.org */ #ifndef _CUPS_TRANSCODE_H_ # define _CUPS_TRANSCODE_H_ /* * Include necessary headers... */ # include "cups/language.h" # ifdef __cplusplus extern "C" { # endif /* __cplusplus */ /* * Types... */ typedef unsigned char utf8_t; /* UTF-8 Unicode/ISO-10646 code unit */ typedef unsigned short utf16_t; /* UTF-16 Unicode/ISO-10646 code unit */ typedef unsigned long utf32_t; /* UTF-32 Unicode/ISO-10646 code unit */ typedef unsigned short ucs2_t; /* UCS-2 Unicode/ISO-10646 code unit */ typedef unsigned long ucs4_t; /* UCS-4 Unicode/ISO-10646 code unit */ typedef unsigned char sbcs_t; /* SBCS Legacy 8-bit code unit */ typedef unsigned short dbcs_t; /* DBCS Legacy 16-bit code unit */ /* * Structures... */ typedef struct cups_cmap_str /**** SBCS Charmap Cache Structure ****/ { struct cups_cmap_str *next; /* Next charmap in cache */ int used; /* Number of times entry used */ cups_encoding_t encoding; /* Legacy charset encoding */ ucs2_t char2uni[256]; /* Map Legacy SBCS -> UCS-2 */ sbcs_t *uni2char[256]; /* Map UCS-2 -> Legacy SBCS */ } cups_cmap_t; #if 0 typedef struct cups_dmap_str /**** DBCS Charmap Cache Structure ****/ { struct cups_dmap_str *next; /* Next charmap in cache */ int used; /* Number of times entry used */ cups_encoding_t encoding; /* Legacy charset encoding */ ucs2_t *char2uni[256]; /* Map Legacy DBCS -> UCS-2 */ dbcs_t *uni2char[256]; /* Map UCS-2 -> Legacy DBCS */ } cups_dmap_t; #endif McDonald June 20, 2002 [Page 8] CUPS Internationalization Software Design Description v0.3 /* * Constants... */ #define CUPS_MAX_USTRING 1024 /* Maximum size of Unicode string */ /* * Globals... */ extern int TcFixMapNames; /* Fix map names to Unicode names */ extern int TcStrictUtf8; /* Non-shortest-form is illegal */ extern int TcStrictUtf16; /* Invalid surrogate pair is illegal */ extern int TcStrictUtf32; /* Greater than 0x10FFFF is illegal */ extern int TcRequireBOM; /* Require BOM for little/big-endian */ extern int TcSupportBOM; /* Support BOM for little/big-endian */ extern int TcSupport8859; /* Support ISO 8859-x repertoires */ extern int TcSupportWin; /* Support Windows-x repertoires */ extern int TcSupportCJK; /* Support CJK (Asian) repertoires */ /* * Prototypes... */ /* * Utility functions for character set maps */ extern void *cupsCharmapGet(const cups_encoding_t encoding); /* I - Encoding */ extern void cupsCharmapFree(const cups_encoding_t encoding); /* I - Encoding */ extern void cupsCharmapFlush(void); /* * Convert UTF-8 to and from legacy character set */ extern int cupsUtf8ToCharset(char *dest, /* O - Target string */ const utf8_t *src, /* I - Source string */ const int maxout, /* I - Max output */ cups_encoding_t encoding); /* I - Encoding */ extern int cupsCharsetToUtf8(utf8_t *dest, /* O - Target string */ const char *src, /* I - Source string */ const int maxout, /* I - Max output */ cups_encoding_t encoding); /* I - Encoding */ /* * Convert UTF-8 to and from UTF-16 */ extern int cupsUtf8ToUtf16(utf16_t *dest, /* O - Target string */ const utf8_t *src, /* I - Source string */ const int maxout); /* I - Max output */ extern int cupsUtf16ToUtf8(utf8_t *dest, /* O - Target string */ McDonald June 20, 2002 [Page 9] CUPS Internationalization Software Design Description v0.3 const utf16_t *src, /* I - Source string */ const int maxout); /* I - Max output */ /* * Convert UTF-8 to and from UTF-32 */ extern int cupsUtf8ToUtf32(utf32_t *dest, /* O - Target string */ const utf8_t *src, /* I - Source string */ const int maxout); /* I - Max output */ extern int cupsUtf32ToUtf8(utf8_t *dest, /* O - Target string */ const utf32_t *src, /* I - Source string */ const int maxout); /* I - Max output */ /* * Convert UTF-16 to and from UTF-32 */ extern int cupsUtf16ToUtf32(utf32_t *dest, /* O - Target string */ const utf16_t *src, /* I - Source string */ const int maxout); /* I - Max output */ extern int cupsUtf32ToUtf16(utf16_t *dest, /* O - Target string */ const utf32_t *src, /* I - Source string */ const int maxout); /* I - Max output */ # ifdef __cplusplus } # endif /* __cplusplus */ #endif /* !_CUPS_TRANSCODE_H_ */ /* * End of "$Id: i18n_sdd.txt 2678 2002-08-19 01:15:26Z mike $" */ 3.1.1.1. cups_cmap_t - SBCS Charmap Structure typedef struct cups_cmap_str /**** SBCS Charmap Cache Structure ****/ { struct cups_cmap_str *next; /* Next charset map in cache */ int used; /* Number of times entry used */ cups_encoding_t encoding; /* Legacy charset encoding */ ucs2_t char2uni[256]; /* Map Legacy SBCS -> UCS-2 */ sbcs_t *uni2char[256]; /* Map UCS-2 -> Legacy SBCS */ } cups_cmap_t; 'char2uni[]' is a (complete) array of UCS-2 values that supports direct one-level lookup from an input SBCS legacy charset code point, for use by 'cupsCharsetToUtf8()'. 'uni2char[]' is a (sparse) array of pointers to arrays of (256 each) SBCS values, that supports direct two-level lookup from an input UCS-2 McDonald June 20, 2002 [Page 10] CUPS Internationalization Software Design Description v0.3 code point, for use by 'cupsUtf8ToCharset()'. 3.1.1.2. cups_dmap_t - DBCS Charmap Structure typedef struct cups_dmap_str /**** DBCS Charmap Cache Structure ****/ { struct cups_dmap_str *next; /* Next charset map in cache */ int used; /* Number of times entry used */ cups_encoding_t encoding; /* Legacy charset encoding */ ucs2_t *char2uni[256]; /* Map Legacy DBCS -> UCS-2 */ dbcs_t *uni2char[256]; /* Map UCS-2 -> Legacy DBCS */ } cups_dmap_t; 'char2uni[]' is a (sparse) array of pointers to arrays of (256 each) UCS-2 values that supports direct two-level lookup from an input DBCS legacy charset code point, for (future) use by 'cupsCharsetToUtf8()'. 'uni2char[]' is a (sparse) array of pointers to arrays of (256 each) DBCS values, that supports direct two-level lookup from an input UCS-2 code point, for (future) use by 'cupsUtf8ToCharset()'. 3.1.2. transcode.c - Transcoding module All of the transcoding functions are modelled on the C standard library function 'strncpy()', except that they return the count of output, like 'strlen()', rather than the (redundant) pointer to the output. If the transcoding functions detect invalid input parameters or they detect an encoding error in their input, then they return '-1', rather than the count of output. All of the transcoding functions take an input parameter indicating the maximum output units (for safe operation). The functions that return 16-bit (UTF-16) or 32-bit (UTF-32/UCS-4) output always return the output string count (not including the final null) and NOT the memory size in bytes. 3.1.2.1. cupsUtf8ToCharset() extern int cupsUtf8ToCharset(char *dest, /* O - Target string */ const utf8_t *src, /* I - Source string */ const int maxout, /* I - Max output */ cups_encoding_t encoding); /* I - Encoding */ McDonald June 20, 2002 [Page 11] CUPS Internationalization Software Design Description v0.3 3.1.2.2. cupsCharsetToUtf8() extern int cupsCharsetToUtf8(utf8_t *dest, /* O - Target string */ const char *src, /* I - Source string */ const int maxout, /* I - Max output */ cups_encoding_t encoding); /* I - Encoding */ 3.1.2.3. cupsUtf8ToUtf16() extern int cupsUtf8ToUtf16(utf16_t *dest, /* O - Target string */ const utf8_t *src, /* I - Source string */ const int maxout); /* I - Max output */ <...to avoid duplicate code to handle surrogate pairs...> 3.1.2.4. cupsUtf16ToUtf8() extern int cupsUtf16ToUtf8(utf8_t *dest, /* O - Target string */ const utf16_t *src, /* I - Source string */ const int maxout); /* I - Max output */ <...to avoid duplicate code to handle surrogate pairs...> 3.1.2.5. cupsUtf8ToUtf32() extern int cupsUtf8ToUtf32(utf32_t *dest, /* O - Target string */ const utf8_t *src, /* I - Source string */ const int maxout); /* I - Max output */ McDonald June 20, 2002 [Page 12] CUPS Internationalization Software Design Description v0.3 <...checking for valid range, shortest-form, etc.> 3.1.2.6. cupsUtf32ToUtf8() extern int cupsUtf32ToUtf8(utf8_t *dest, /* O - Target string */ const utf32_t *src, /* I - Source string */ const int maxout); /* I - Max output */ <...checking for valid range, etc.> 3.1.2.7. cupsUtf16ToUtf32() extern int cupsUtf16ToUtf32(utf32_t *dest, /* O - Target string */ const utf16_t *src, /* I - Source string */ const int maxout); /* I - Max output */ <...handling surrogate pairs decoding from UTF-16> 3.1.2.8. cupsUtf32ToUtf16() extern int cupsUtf32ToUtf16(utf16_t *dest, /* O - Target string */ const utf32_t *src, /* I - Source string */ const int maxout); /* I - Max output */ <...handling surrogate pairs encoding to UTF-16> 3.1.2.9. Transcoding Utility Functions The transcoding utility functions are used to load (from a file into memory), free (logically, without freeing memory), and flush (actually free memory) character maps for SBCS (single-byte character set) and (future) DBCS (double-byte character set) transcoding to and from UTF-8. McDonald June 20, 2002 [Page 13] CUPS Internationalization Software Design Description v0.3 3.1.2.9.1. cupsCharmapGet() extern void *cupsCharmapGet(const cups_encoding_t encoding); /* I - Encoding */ <...If found, increment 'used'> <...and return pointer to SBCS or DBCS charset map> <...If not found, return void> <...If no memory, return void> <...and 'uni2char[]' is an array of pointers to 'sbcs_t' arrays> <...and 'uni2char[]' is an array of pointers to 'dbcs_t' arrays> 3.1.2.9.2. cupsCharmapFree() extern void cupsCharmapFree(const cups_encoding_t encoding); /* I - Encoding */ <...If found, decrement 'used'> 3.1.2.9.3. cupsCharmapFlush() extern void cupsCharmapFlush(void); <...Free 'uni2char[]' memory> <...Free SBCS charset map memory> <...Free 'char2uni[]' memory> <...Free 'uni2char[]' memory> <...Free DBCS charset map memory> McDonald June 20, 2002 [Page 14] CUPS Internationalization Software Design Description v0.3 3.2. Normalization - New 3.2.1. normalize.h - Normalization header /* * "$Id: i18n_sdd.txt 2678 2002-08-19 01:15:26Z mike $" * * Unicode normalization for the Common UNIX Printing System (CUPS). * * Copyright 1997-2002 by Easy Software Products. * * These coded instructions, statements, and computer programs are * the property of Easy Software Products and are protected by Federal * copyright law. Distribution and use rights are outlined in the * file "LICENSE.txt" which should have been included with this file. * If this file is missing or damaged please contact Easy Software * Products at: * * Attn: CUPS Licensing Information * Easy Software Products * 44141 Airport View Drive, Suite 204 * Hollywood, Maryland 20636-3111 USA * * Voice: (301) 373-9603 * EMail: cups-info@cups.org * WWW: http://www.cups.org */ #ifndef _CUPS_NORMALIZE_H_ # define _CUPS_NORMALIZE_H_ /* * Include necessary headers... */ # include "transcod.h" # ifdef __cplusplus extern "C" { # endif /* __cplusplus */ /* * Types... */ typedef enum /**** Normalizataion Types ****/ { McDonald June 20, 2002 [Page 15] CUPS Internationalization Software Design Description v0.3 CUPS_NORM_NFD, /* Canonical Decomposition */ CUPS_NORM_NFKD, /* Compatibility Decomposition */ CUPS_NORM_NFC, /* NFD, them Canonical Composition */ CUPS_NORM_NFKC /* NFKD, them Canonical Composition */ } cups_normalize_t; typedef enum /**** Case Folding Types ****/ { CUPS_FOLD_SIMPLE, /* Simple - no expansion in size */ CUPS_FOLD_FULL /* Full - possible expansion in size */ } cups_folding_t; typedef enum /**** Unicode Char Property Types ****/ { CUPS_PROP_GENERAL_CATEGORY, /* See 'cups_gencat_t' enum */ CUPS_PROP_BIDI_CATEGORY, /* See 'cups_bidicat_t' enum */ CUPS_PROP_COMBINING_CLASS, /* See 'cups_combclass_t' type */ CUPS_PROP_BREAK_CLASS /* See 'cups_breakclass_t' enum */ } cups_property_t; /* * Note - parse Unicode char general category from 'UnicodeData.txt' * into sparse local table in 'normalize.c'. * Use major classes for logic optimizations throughout (by mask). */ typedef enum /**** Unicode General Category ****/ { CUPS_GENCAT_L = 0x10, /* Letter major class */ CUPS_GENCAT_LU = 0x11, /* Lu Letter, Uppercase */ CUPS_GENCAT_LL = 0x12, /* Ll Letter, Lowercase */ CUPS_GENCAT_LT = 0x13, /* Lt Letter, Titlecase */ CUPS_GENCAT_LM = 0x14, /* Lm Letter, Modifier */ CUPS_GENCAT_LO = 0x15, /* Lo Letter, Other */ CUPS_GENCAT_M = 0x20, /* Mark major class */ CUPS_GENCAT_MN = 0x21, /* Mn Mark, Non-Spacing */ CUPS_GENCAT_MC = 0x22, /* Mc Mark, Spacing Combining */ CUPS_GENCAT_ME = 0x23, /* Me Mark, Enclosing */ CUPS_GENCAT_N = 0x30, /* Number major class */ CUPS_GENCAT_ND = 0x31, /* Nd Number, Decimal Digit */ CUPS_GENCAT_NL = 0x32, /* Nl Number, Letter */ CUPS_GENCAT_NO = 0x33, /* No Number, Other */ CUPS_GENCAT_P = 0x40, /* Punctuation major class */ CUPS_GENCAT_PC = 0x41, /* Pc Punctuation, Connector */ CUPS_GENCAT_PD = 0x42, /* Pd Punctuation, Dash */ CUPS_GENCAT_PS = 0x43, /* Ps Punctuation, Open (start) */ CUPS_GENCAT_PE = 0x44, /* Pe Punctuation, Close (end) */ CUPS_GENCAT_PI = 0x45, /* Pi Punctuation, Initial Quote */ CUPS_GENCAT_PF = 0x46, /* Pf Punctuation, Final Quote */ CUPS_GENCAT_PO = 0x47, /* Po Punctuation, Other */ CUPS_GENCAT_S = 0x50, /* Symbol major class */ CUPS_GENCAT_SM = 0x51, /* Sm Symbol, Math */ McDonald June 20, 2002 [Page 16] CUPS Internationalization Software Design Description v0.3 CUPS_GENCAT_SC = 0x52, /* Sc Symbol, Currency */ CUPS_GENCAT_SK = 0x53, /* Sk Symbol, Modifier */ CUPS_GENCAT_SO = 0x54, /* So Symbol, Other */ CUPS_GENCAT_Z = 0x60, /* Separator major class */ CUPS_GENCAT_ZS = 0x61, /* Zs Separator, Space */ CUPS_GENCAT_ZL = 0x62, /* Zl Separator, Line */ CUPS_GENCAT_ZP = 0x63, /* Zp Separator, Paragraph */ CUPS_GENCAT_C = 0x70, /* Other (miscellaneous) major class */ CUPS_GENCAT_CC = 0x71, /* Cc Other, Control */ CUPS_GENCAT_CF = 0x72, /* Cf Other, Format */ CUPS_GENCAT_CS = 0x73, /* Cs Other, Surrogate */ CUPS_GENCAT_CO = 0x74, /* Co Other, Private Use */ CUPS_GENCAT_CN = 0x75 /* Cn Other, Not Assigned */ } cups_gencat_t; /* * Note - parse Unicode char bidi category from 'UnicodeData.txt' * into sparse local table in 'normalize.c'. * Add bidirectional support to 'textcommon.c' - per Mike */ typedef enum /**** Unicode Bidi Category ****/ { CUPS_BIDI_L, /* Left-to-Right (Alpha, Syllabic, Ideographic) */ CUPS_BIDI_LRE, /* Left-to-Right Embedding (explicit) */ CUPS_BIDI_LRO, /* Left-to-Right Override (explicit) */ CUPS_BIDI_R, /* Right-to-Left (Hebrew alphabet and most punct) */ CUPS_BIDI_AL, /* Right-to-Left Arabic (Arabic, Thaana, Syriac) */ CUPS_BIDI_RLE, /* Right-to-Left Embedding (explicit) */ CUPS_BIDI_RLO, /* Right-to-Left Override (explicit) */ CUPS_BIDI_PDF, /* Pop Directional Format */ CUPS_BIDI_EN, /* Euro Number (Euro and East Arabic-Indic digits) */ CUPS_BIDI_ES, /* Euro Number Separator (Slash) */ CUPS_BIDI_ET, /* Euro Number Termintor (Plus, Minus, Degree, etc) */ CUPS_BIDI_AN, /* Arabic Number (Arabic-Indic digits, separators) */ CUPS_BIDI_CS, /* Common Number Separator (Colon, Comma, Dot, etc) */ CUPS_BIDI_NSM, /* Non-Spacing Mark (category Mn / Me in UCD) */ CUPS_BIDI_BN, /* Boundary Neutral (Formatting / Control chars) */ CUPS_BIDI_B, /* Paragraph Separator */ CUPS_BIDI_S, /* Segment Separator (Tab) */ CUPS_BIDI_WS, /* Whitespace Space (Space, Line Separator, etc) */ CUPS_BIDI_ON /* Other Neutrals */ } cups_bidicat_t; /* * Note - parse Unicode line break class from 'DerivedLineBreak.txt' * into sparse local table (list of class ranges) in 'normalize.c'. * Note - add state table from UAX-14, section 7.3 - Ira * Remember to do BK and SP in outer loop (not in state table). * Consider optimization for CM (combining mark). * See 'LineBreak.txt' (12,875) and 'DerivedLineBreak.txt' (1,350). */ McDonald June 20, 2002 [Page 17] CUPS Internationalization Software Design Description v0.3 typedef enum /**** Unicode Line Break Class ****/ { /* * (A) - Allow Break AFTER * (XA) - Prevent Break AFTER * (B) - Allow Break BEFORE * (XB) - Prevent Break BEFORE * (P) - Allow Break For Pair * (XP) - Prevent Break For Pair */ CUPS_BREAK_AI, /* Ambiguous (Alphabetic or Ideograph) */ CUPS_BREAK_AL, /* Ordinary Alphabetic / Symbol Chars (XP) */ CUPS_BREAK_BA, /* Break Opportunity After Chars (A) */ CUPS_BREAK_BB, /* Break Opportunities Before Chars (B) */ CUPS_BREAK_B2, /* Break Opportunity Before / After (B/A/XP) */ CUPS_BREAK_BK, /* Mandatory Break (A) (normative) */ CUPS_BREAK_CB, /* Contingent Break (B/A) (normative) */ CUPS_BREAK_CL, /* Closing Punctuation (XB) */ CUPS_BREAK_CM, /* Attached Chars / Combining (XB) (normative) */ CUPS_BREAK_CR, /* Carriage Return (A) (normative) */ CUPS_BREAK_EX, /* Exclamation / Interrogation (XB) */ CUPS_BREAK_GL, /* Non-breaking ("Glue") (XB/XA) (normative) */ CUPS_BREAK_HY, /* Hyphen (XA) */ CUPS_BREAK_ID, /* Ideographic (B/A) */ CUPS_BREAK_IN, /* Inseparable chars (XP) */ CUPS_BREAK_IS, /* Numeric Separator (Infix) (XB) */ CUPS_BREAK_LF, /* Line Feed (A) (normative) */ CUPS_BREAK_NS, /* Non-starters (XB) */ CUPS_BREAK_NU, /* Numeric (XP) */ CUPS_BREAK_OP, /* Opening Punctuation (XA) */ CUPS_BREAK_PO, /* Postfix (Numeric) (XB) */ CUPS_BREAK_PR, /* Prefix (Numeric) (XA) */ CUPS_BREAK_QU, /* Ambiguous Quotation (XB/XA) */ CUPS_BREAK_SA, /* Context Dependent (South East Asian) (P) */ CUPS_BREAK_SG, /* Surrogates (XP) (normative) */ CUPS_BREAK_SP, /* Space (A) (normative) */ CUPS_BREAK_SY, /* Symbols Allowing Break After (A) */ CUPS_BREAK_XX, /* Unknown (XP) */ CUPS_BREAK_ZW /* Zero Width Space (A) (normative) */ } cups_breakclass_t; typedef int cups_combclass_t; /**** Unicode Combining Class ****/ /* 0=base / 1..254=combining char */ /* * Structures... */ typedef struct cups_normmap_str /**** Normalize Map Cache Struct ****/ { struct cups_normmap_str *next; /* Next normalize in cache */ McDonald June 20, 2002 [Page 18] CUPS Internationalization Software Design Description v0.3 int used; /* Number of times entry used */ cups_normalize_t normalize; /* Normalization type */ int normcount; /* Count of Source Chars */ ucs2_t *uni2norm; /* Char -> Normalization */ /* ...only supports UCS-2 */ } cups_normmap_t; typedef struct cups_foldmap_str /**** Case Fold Map Cache Struct ****/ { struct cups_foldmap_str *next; /* Next case fold in cache */ int used; /* Number of times entry used */ cups_folding_t fold; /* Case folding type */ int foldcount; /* Count of Source Chars */ ucs2_t *uni2fold; /* Char -> Folded Char(s) */ /* ...only supports UCS-2 */ } cups_foldmap_t; typedef struct cups_prop_str /**** Char Property Struct ****/ { ucs2_t ch; /* Unicode Char as UCS-2 */ unsigned char gencat; /* General Category */ unsigned char bidicat; /* Bidirectional Category */ } cups_prop_t; typedef struct /**** Char Property Map Struct ****/ { int used; /* Number of times entry used */ int propcount; /* Count of Source Chars */ cups_prop_t *uni2prop; /* Char -> Properties */ } cups_propmap_t; typedef struct /**** Line Break Class Map Struct ****/ { int used; /* Number of times entry used */ int breakcount; /* Count of Source Chars */ ucs2_t *uni2break; /* Char -> Line Break Class */ } cups_breakmap_t; typedef struct cups_comb_str /**** Char Combining Class Struct ****/ { ucs2_t ch; /* Unicode Char as UCS-2 */ unsigned char combclass; /* Combining Class */ unsigned char reserved; /* Reserved for alignment */ } cups_comb_t; typedef struct /**** Combining Class Map Struct ****/ { int used; /* Number of times entry used */ int combcount; /* Count of Source Chars */ cups_comb_t *uni2comb; /* Char -> Combining Class */ } cups_combmap_t; McDonald June 20, 2002 [Page 19] CUPS Internationalization Software Design Description v0.3 /* * Globals... */ extern int NzSupportUcs2; /* Support UCS-2 (16-bit) mapping */ extern int NzSupportUcs4; /* Support UCS-4 (32-bit) mapping */ /* * Prototypes... */ /* * Utility functions for normalization module */ extern int cupsNormalizeMapsGet(void); extern int cupsNormalizeMapsFree(void); extern void cupsNormalizeMapsFlush(void); /* * Normalize UTF-8 string to Unicode UAX-15 Normalization Form * Note - Compatibility Normalization Forms (NFKD/NFKC) are * unsafe for subsequent transcoding to legacy charsets */ extern int cupsUtf8Normalize(utf8_t *dest, /* O - Target string */ const utf8_t *src, /* I - Source string */ const int maxout, /* I - Max output */ const cups_normalize_t normalize); /* I - Normalization */ /* * Normalize UTF-32 string to Unicode UAX-15 Normalization Form * Note - Compatibility Normalization Forms (NFKD/NFKC) are * unsafe for subsequent transcoding to legacy charsets */ extern int cupsUtf32Normalize(utf32_t *dest, /* O - Target string */ const utf32_t *src, /* I - Source string */ const int maxout, /* I - Max output */ const cups_normalize_t normalize); /* I - Normalization */ /* * Case Fold UTF-8 string per Unicode UAX-21 Section 2.3 * Note - Case folding output is * unsafe for subsequent transcoding to legacy charsets */ extern int cupsUtf8CaseFold(utf8_t *dest, /* O - Target string */ const utf8_t *src, /* I - Source string */ const int maxout, /* I - Max output */ const cups_folding_t fold); /* I - Fold Mode */ McDonald June 20, 2002 [Page 20] CUPS Internationalization Software Design Description v0.3 /* * Case Fold UTF-32 string per Unicode UAX-21 Section 2.3 * Note - Case folding output is * unsafe for subsequent transcoding to legacy charsets */ extern int cupsUtf32CaseFold(utf32_t *dest,/* O - Target string */ const utf32_t *src, /* I - Source string */ const int maxout, /* I - Max output */ const cups_folding_t fold); /* I - Fold Mode */ /* * Compare UTF-8 strings after case folding */ extern int cupsUtf8CompareCaseless(const utf8_t *s1, /* I - String1 */ const utf8_t *s2); /* I - String2 */ /* * Compare UTF-32 strings after case folding */ extern int cupsUtf32CompareCaseless(const utf32_t *s1, /* I - String1 */ const utf32_t *s2); /* I - String2 */ /* * Compare UTF-8 strings after case folding and NFKC normalization */ extern int cupsUtf8CompareIdentifier(const utf8_t *s1, /* I - String1 */ const utf8_t *s2); /* I - String2 */ /* * Compare UTF-32 strings after case folding and NFKC normalization */ extern int cupsUtf32CompareIdentifier(const utf32_t *s1, /* I - String1 */ const utf32_t *s2); /* I - String2 */ /* * Get UTF-32 character property */ extern int cupsUtf32CharacterProperty(const utf32_t ch, /* I - Source char */ const cups_property_t property); /* I - Char Property */ # ifdef __cplusplus } # endif /* __cplusplus */ #endif /* !_CUPS_NORMALIZE_H_ */ McDonald June 20, 2002 [Page 21] CUPS Internationalization Software Design Description v0.3 /* * End of "$Id: i18n_sdd.txt 2678 2002-08-19 01:15:26Z mike $" */ 3.2.1.1. cups_normmap_t - Normalize Map Structure typedef struct cups_normmap_str /**** Normalize Map Cache Struct ****/ { struct cups_normmap_str *next; /* Next normalize in cache */ int used; /* Number of times entry used */ cups_normalize_t normalize; /* Normalization type */ int normcount; /* Count of Source Chars */ ucs2_t *uni2norm; /* Char -> Normalization */ /* ...only supports UCS-2 */ } cups_normmap_t; 'uni2norm' is a pointer to an array of _triplets_ of UCS-2 values. 'normcount' is a count of _triplets_ in the 'uni2norm[]' array. For decompositions (NFD and NFKD), the triplets are: composed base character, decomposed base character, and decomposed accent character. These are used by 'cupsUtf8Normalize()' and 'cupsUtf32Normalize()' in performing canonical (NFD) or compatibility (NFKD) decomposition. For compositions (NFC and NFKC), the triplets are: decomposed base character, decomposed accent character, and composed base character. These are used by 'cupsUtf8Normalize()' and 'cupsUtf32Normalize()' in performing canonical composition (for NFC or NFKC). 3.2.1.2. cups_foldmap_t - Case Fold Map Structure typedef struct cups_foldmap_str /**** Case Fold Map Cache Struct ****/ { int used; /* Number of times entry used */ cups_folding_t fold; /* Case folding type */ int foldcount; /* Count of Source Chars */ ucs2_t *uni2fold; /* Char -> Folded Char(s) */ /* ...only supports UCS-2 */ } cups_foldmap_t; 'uni2fold' is a pointer to an array of _quadruplets_ of UCS-2 values. 'foldcount' is a count of _quadruplets_ in the 'uni2fold[]' array. For simple case folding (without expansion of the size of the output string), the quadruplets are: input base character, output case folded character, zero (unused), and zero (unused). McDonald June 20, 2002 [Page 22] CUPS Internationalization Software Design Description v0.3 For full case folding (with possible expansion of the size of the output string), the quadruplets are: input base character, output case folded character, second output character or zero, third output character or zero. 3.2.1.3. cups_propmap_t - Char Property Map Structure typedef struct /**** Char Property Map Struct ****/ { int used; /* Number of times entry used */ int propcount; /* Count of Source Chars */ cups_prop_t *uni2prop; /* Char -> Properties */ } cups_propmap_t; 'uni2prop' is a pointer to an array of 'cups_prop_t' (see below). 'propcount' is a count of elements in the 'uni2prop[]' array. 3.2.1.4. cups_prop_t - Char Property Structure typedef struct cups_prop_str /**** Char Property Struct ****/ { ucs2_t ch; /* Unicode Char as UCS-2 */ unsigned char gencat; /* General Category */ unsigned char bidicat; /* Bidirectional Category */ } cups_prop_t; 3.2.1.5. cups_breakmap_t - Line Break Map Structure typedef struct /**** Line Break Class Map Struct ****/ { int used; /* Number of times entry used */ int breakcount; /* Count of Source Chars */ ucs2_t *uni2break; /* Char -> Line Break Class */ } cups_breakmap_t; 'uni2break' is a pointer to an array of _triplets_ of UCS-2 values. 'breakcount' is a count of _triplets_ in the 'uni2break[]' array. The triplets in 'uni2break' are: first UCS-2 value in a range, last UCS-2 value in a range, and line break class stored as UCS-2. McDonald June 20, 2002 [Page 23] CUPS Internationalization Software Design Description v0.3 3.2.1.6. cups_combmap_t - Combining Class Map Structure typedef struct /**** Combining Class Map Struct ****/ { int used; /* Number of times entry used */ int combcount; /* Count of Source Chars */ cups_comb_t *uni2comb; /* Char -> Combining Class */ } cups_combmap_t; 'uni2comb' is a pointer to an array of 'cups_comb_t' (see below). 'combcount' is a count of elements in the 'uni2comb[]' array. 3.2.1.7. cups_comb_t - Combining Class Structure typedef struct cups_comb_str /**** Char Combining Class Struct ****/ { unsigned short ch; /* Unicode Char as UCS-2 */ unsigned char combclass; /* Combining Class */ unsigned char reserved; /* Reserved for alignment */ } cups_comb_t; 3.2.2. normalize.c - Normalization module The normalization function 'cupsUtf8Normalize()' and the case folding function 'cupsUtf8CaseFold()' are modelled on the C standard library function 'strncpy()', except that they return the count of the output, like 'strlen()', rather than the (redundant) pointer to the output. If the normalization or case folding functions detect invalid input parameters or they detect an encoding error in their input, then they return '-1', rather than the count of output. The normalization and case folding functions take an input parameter indicating the maximum output units (for safe operation). 3.2.2.1. cupsUtf8Normalize() /* * Normalize UTF-8 string to Unicode UAX-15 Normalization Form * Note - Compatibility Normalization Forms (NFKD/NFKC) are * unsafe for subsequent transcoding to legacy charsets */ extern int cupsUtf8Normalize(utf8_t *dest, /* O - Target string */ const utf8_t *src, /* I - Source string */ McDonald June 20, 2002 [Page 24] CUPS Internationalization Software Design Description v0.3 const int maxout, /* I - Max output */ const cups_normalize_t normalize); /* I - Normalization */ 3.2.2.2. cupsUtf32Normalize() extern int cupsUtf32Normalize(utf32_t *dest, /* O - Target string */ const utf32_t *src, /* I - Source string */ const int maxout, /* I - Max output */ const cups_normalize_t normalize); /* I - Normalization */ <...if not found, return '-1'> <...with 'bsearch()' of 'uni2norm[]' using local 'compare_decompose()'> <...until one pass yields no further decomposition> <...with 'bsearch()' of 'uni2comb[]' using local 'compare_combchar()'> <...until one pass yields no further canonical reordering> <...repeatedly traverse internal UCS-4, composing (NFC or NFKC)...> <...with 'bsearch()' of 'uni2norm[]' using local 'compare_compose()'> <...until one pass yields no further composition> 3.2.2.3. cupsUtf8CaseFold() /* * Case Fold UTF-8 string per Unicode UAX-21 Section 2.3 * Note - Case folding output is * unsafe for subsequent transcoding to legacy charsets */ extern int cupsUtf8CaseFold(utf8_t *dest, /* O - Target string */ const utf8_t *src, /* I - Source string */ const int maxout, /* I - Max output */ const cups_folding_t fold); /* I - Fold Mode */ <...if not found, return '-1'> McDonald June 20, 2002 [Page 25] CUPS Internationalization Software Design Description v0.3 3.2.2.4. cupsUtf32CaseFold() /* * Case Fold UTF-32 string per Unicode UAX-21 Section 2.3 * Note - Case folding output is * unsafe for subsequent transcoding to legacy charsets */ extern int cupsUtf32CaseFold(utf32_t *dest, /* Target string */ const utf32_t *src, /* Source string */ const int maxout); /* Max output units */ <...if not found, return '-1'> <...with 'bsearch()' of 'uni2fold[]' using local 'compare_foldchar()'> 3.2.2.5. cupsUtf8CompareCaseless() /* * Compare UTF-8 strings after case folding */ extern int cupsUtf8CompareCaseless(const utf8_t *s1, /* I - String1 */ const utf8_t *s2); /* I - String2 */ 3.2.2.6. cupsUtf32CompareCaseless() /* * Compare UTF-32 strings after case folding */ extern int cupsUtf32CompareCaseless(const utf32_t *s1, /* I - String1 */ const utf32_t *s2); /* I - String2 */ McDonald June 20, 2002 [Page 26] CUPS Internationalization Software Design Description v0.3 3.2.2.7. cupsUtf8CompareIdentifier() /* * Compare UTF-8 strings after case folding and NFKC normalization */ extern int cupsUtf8CompareIdentifier(const utf8_t *s1, /* I - String1 */ const utf8_t *s2); /* I - String2 */ 3.2.2.8. cupsUtf32CompareIdentifier() /* * Compare UTF-32 strings after case folding and NFKC normalization */ extern int cupsUtf32CompareIdentifier(const utf32_t *s1, /* I - String1 */ const utf32_t *s2); /* I - String2 */ 3.2.2.9. cupsUtf32CharacterProperty() /* * Get UTF-32 character property */ extern int cupsUtf32CharacterProperty(const utf32_t ch, /* I - Source char */ const cups_property_t property); /* I - Char Property */ <...internal functions for each different map lookup> McDonald June 20, 2002 [Page 27] CUPS Internationalization Software Design Description v0.3 3.2.2.10. Normalization Utility Functions 3.2.2.10.1. cupsNormalizeMapsGet() extern void cupsNormalizeMapsMapsGet(void); <...If found, increment 'used'> <...and return void> <...If not found, return void> <...Close (preprocessed form of) Unicode data file> <...If not found, return void> <...If no memory, return void> <...Add values to 'uni2xxx[]' array> 3.2.2.10.2. cupsNormalizeMapsFree() extern void cupsNormalizeMapsFree(void); <...If found, decrement 'used'> 3.2.2.10.3. cupsNormalizeMapsFlush() extern void cupsNormalizeMapsFlush(void); <...Free 'uni2norm[]' memory> <...Free normalize map memory> <...Free 'uni2fold[]' memory> McDonald June 20, 2002 [Page 28] CUPS Internationalization Software Design Description v0.3 <...Free case folding memory> <...Free 'uni2prop[]' memory> <...Free char property map memory> <...Free 'uni2break[]' memory> <...Free line break class map memory> <...Free 'uni2comb[]' memory> <...Free combining class map memory> 3.3. Language - Existing 3.3.1. language.h - Language header Required Changes: (1) Change definition of 'cups_lang_t' to correct length of 'language[]' to 32 characters per [RFC3066] and [ISO639-2] and [ISO3166-1]. 3.3.2. language.c - Language module 3.3.2.1. cupsLangEncoding() - Existing [No Change] 3.3.2.2. cupsLangFlush() - Existing [No Change] 3.3.2.3. cupsLangFree() - Existing [No Change] McDonald June 20, 2002 [Page 29] CUPS Internationalization Software Design Description v0.3 3.3.2.4. cupsLangGet() - Existing Required Changes: (1) Change length of 'langname[]' and 'real[]' to 64 characters per [RFC3066] and potential length of encoding (charset) names; (2) Change language string normalization to support: (a) 8-character language codes per [RFC3066] and 3-character language codes per [ISO639-2]; (b) 8-character country codes per [RFC3066] and 3-character country codes per [ISO3166-1]; (c) Support for 'i' (IANA registered) and 'x' (private) language prefixes per [RFC3066]; (d) Invariant use of 'utf-8' for encoding in message catalog, but save actual requested encoding name for later use. (3) Correct broken do/while statement for message catalog lookup (while condition is _never_ satisfied). 3.3.2.5. cupsLangPrintf() - New extern int cupsLangPrintf(FILE *fp, /* I - File to write */ const cups_lang_t *lang, /* I - Language/locale*/ const cups_msg_t msg, /* I - Msg to format */ ...); /* I - Args to format */ 3.3.2.6. cupsLangPuts() - New extern int cupsLangPuts(FILE *fp, /* I - File to write */ const cups_lang_t *lang, /* I - Language/locale*/ const cups_msg_t msg); /* I - Msg to write */ McDonald June 20, 2002 [Page 30] CUPS Internationalization Software Design Description v0.3 3.3.2.7. cupsEncodingName() - New extern char *cupsEncodingName(cups_encoding_t encoding); 3.4. Common Text Filter - Existing 3.4.1. textcommon.h - Common text filter header Required changes: (1) Revise 'lchar_t' as specified below, adding 'attrx' bit-mask for selected Unicode character properties; (2) Revise 'lchar_t' as specified below, adding 'comblen' and 'combch[]' for Unicode combining/attached chars (accents); (3) Add 'COMBLEN_MAX' limit as specified below; (4) Add 'ATTRX_...' selected Unicode character properties as specified below. 3.4.1.1. lchar_t - Character/Attribute Structure typedef struct lchar_str /**** Character / Attribute Structure ****/ { unsigned short ch; /* Unicode Char as UCS-2 */ /* or 8/16-bit Legacy Char */ unsigned short attr; /* Attributes of Char */ unsigned short attrx; /* Extended Attributes */ unsigned short comblen; /* Combining Char Count */ unsigned short combch[8]; /* Combining Chars as UCS-2 */ } lchar_t; 'ch' is a 16-bit UCS-2 character or a 8/16-bit legacy char. 'attr' is the character attributes defined for the existing 'lchar_t' structure (defined in 'textcommon.h'). 'attrx' is the extended character attributes defined for future selected Unicode character properties (see below). 'comblen' is the number of attached/combining characters. 'combch' is an array of 16-bit UCS-2 attached/combining characters. Add to 'textcommon.h' constants: COMBLEN_MAX 8 McDonald June 20, 2002 [Page 31] CUPS Internationalization Software Design Description v0.3 ATTRX_RIGHT2LEFT 0x0001 3.4.2. textcommon.c - Common text filter Required Changes: (1) Revise 'TextMain()' function as described below. 3.4.2.1. TextMain() - Existing Required Changes: [Ed Note: Pseudo code below needs more work on bidi handling.] (1) In main loop at the _beginning_ of the 'default' clause, add the following code for combining marks: lchar_t *cp; cp = Page[line]; cp += column; /* * Check for Unicode combining mark (accent) */ if (UTF-8 && cupsUtf32CombiningClass(ch) > 0) { /* * Save Unicode combining mark in SAME character */ if (cp->comblen > COMBLEN_MAX) break; cp->combch[cp->comblen] = ch; cp->comblen ++; break; } (2) In main loop _after_ combining chars section in 'default' clause, add the following code for Unicode bidi control characters cups_bidicat_t bidicat; /* * Check for Unicode bidi control character */ if (UTF-8) { bidicat = (cups_bidicat_t) cupsUtf32CharacterProperty(ch, CUPS_PROP_BIDI_CATEGORY); McDonald June 20, 2002 [Page 32] CUPS Internationalization Software Design Description v0.3 if ((bidicat == CUPS_BIDI_LRE) /* Left-to-Right Embedding * || (bidicat == CUPS_BIDI_LRO) /* Left-to-Right Override */ || (bidicat == CUPS_BIDI_RLE) /* Right-to-Left Embedding * || (bidicat == CUPS_BIDI_RLO) /* Right-to-Left Override */ || (bidicat == CUPS_BIDI_PDF)) /* Pop Directional Format */ { /* Do bidi stuff here with memory for NEXT char's direction /* Discard bidi control character and break */ } if ((bidicat == CUPS_BIDI_R) /* Right-to-Left Hebrew */ || (bidicat == CUPS_BIDI_AL)) /* Right-to-Left Arabic */ { /* Set attrx for right-to-left */ cp->attrx |= ATTRX_RIGHT2LEFT } } 3.4.2.2. compare_keywords() - Existing [No Change] 3.4.2.3. getutf8() - Existing [No Change] [Ed Note: Future - allow 20-bit UTF-32 code points - requires updates in both 'textcommon.c' and 'texttops.c' for extended PostScript.] 3.5. Text to PostScript Filter - Existing 3.5.1. texttops.c - Text to PostScript filter Required Changes: (1) Revise local 'write_string()' function as described below. 3.5.1.1. main() - Existing [No Change] McDonald June 20, 2002 [Page 33] CUPS Internationalization Software Design Description v0.3 3.5.1.2. WriteEpilogue () - Existing [No Change] 3.5.1.3. WritePage () - Existing [No Change] 3.5.1.4. WriteProlog () - Existing [No Change] 3.5.1.5. write_line() - Existing [No Change] 3.5.1.6. write_string() - Existing Required Changes: (1) At the _beginning_ of Multiple Fonts section, _replace_ the while() loop and surrounding 'putchar()' calls with the following code: for (; len > 0; len --, s ++) { utf32_t decstr[COMBLEN_MAX * 2]; utf32_t cmpstr[COMBLEN_MAX * 2]; int cmplen; int i; if (s->comblen == 0) { printf("<%04x>", Chars[s->ch]); continue; } /* * Normalize decomposed Unicode character to NFKC * (compatibility decomposition, then canonical composition) */ decstr[0] = (utf32_t) s->ch; for (i = 0; i < s->comblen; i ++) McDonald June 20, 2002 [Page 34] CUPS Internationalization Software Design Description v0.3 decstr[i + 1] = (utf32_t) s->combch[i]; decstr[i] = 0; cmplen = cupsUtf32Normalize (&cmpstr[0], &decstr[0], COMBLEN_MAX * 2, CUPS_NORM_NFKC); if (cmplen < 1) continue; /* * Write combining chars, then composed base, to same location */ for (i = 1; i < cmplen; i ++) { printf("<%04x>", Chars[(int) cmpstr[i]); /* * Superimpose glyphs by backing up one column width */ printf (" -%.3f ", (72.0f / (float) CharsPerInch)); } printf("<%04x>", Chars[(int) cmpstr[0]); } [Ed Note: Future - Bidi support - When writing Unicode characters (checking for explicit bidi) convert input string (lchar_t) to display order???] 3.5.1.7. write_text() - Existing [No Change] McDonald June 20, 2002 [Page 35] CUPS Internationalization Software Design Description v0.3 APPENDIX A Glossary A. Glossary Abstract Character: A unit of information used for the organization, control, or representation of textual data. Accent Mark: A mark placed above, below, or to the side of a character to alter its phonetic value (also 'diacritic'). Alphabet: A collection of symbols that, in the context of a particular written language, represent the sounds of that language. Base Character: A character that does not graphically combine with preceding characters, and that is neither a control nor a format character. Basic Multilingual Plane: The Unicode (or UCS) code values 0x0000 through 0xFFFF, specified by [ISO10646] (also 'Plane 0'). BIDI: Abbreviation for Bidirectional, in reference to mixed left-to-right and right-to-left text. Bidirectional Display: The process or result of mixing left-to-right oriented text and right-to-left oriented text in a single line. Big-endian: A computer architecture that stores multiple-byte numerical values with the most significant byte (MSB) values first. BMP: Abbreviation for Basic Multilingual Plane. BOM: Acronym for byte order mark (also 'ZWNBSP'). Byte Order Mark: The Unicode character U+FEFF Zero Width No-Break Space (ZWNBSP) when used to indicate the byte order of text. Canonical: (1) Conforming to the general rules for encoding -- that is, not compressed, compacted, or in any other form specified by a higher protocol. (2) Characteristic of a normative mapping and form of equivalence. Canonical Decomposition: The decomposition of a character that results from recursively applying the canonical mappings defined in the Unicode Character Database until no characters can be further decomposed, then reordering nonspacing marks according to section 3.10 of [UNICODE3.2]. Canonical Equivalent: Two characters are canonical equivalents if their full canonical decompositions are identical. Case: (1) Feature of certain alphabets wheere the letters have two McDonald June 20, 2002 [Page A-1] CUPS Internationalization Software Design Description v0.3 APPENDIX A Glossary distinct forms. These variants are called the 'uppercase' letter (also known as 'capital' or 'majuscule') and the 'lowercase' letter (also known as 'small' or 'minuscule'). (2) Normative property of Unicode characters, consisting of uppercase, lowercase, and titlecase. Character: (1) The smallest component of written language that has semantic value; refers to the abstract meaning and/or shape, rather than a specific shape (see also 'glyph'). (2) Synonym for 'abstract character'. (3) The basic unit of encoding for the Unicode character encoding. (4) The English name for the ideographic written elements of Chinese origin (see 'ideograph'). Character Encoding Form (CEF): Mapping from a character set definition to the actual bits used to represent the data. Character Encoding Scheme (CES): A 'character encoding form' plus byte serialization. [UNICODE3.2] defines seven character encoding schemes: UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, and UTF32-LE. Character Properties: A set of property names and property values associated with individual characters defined in [UNICODE3.2]. Character Repertoire: (1) The collection of characters included in a character set. (2) The SUBSET of characters included in a large character set, e.g., [UNICODE3.2], that are necessary to support a complete mapping to another smaller character set, e.g., ISO8859-1 (also called 'Latin-1'). Character Set: A collection of elements used to represent textual information. Coded Character Set: A character set in which each character is assigned a numeric code value. Frequently abbreviated as 'character set', 'charset', or 'code set'. Code Point: (1) A numerical index (or position) in an encoding table used for encoding characters. (2) Synonym for 'Unicode scalar value'. Collation: The process of ordering units of textual information. Collation is usually specific to a particular language. Also known as 'alphabetizing' or 'alphabetic sorting'. Combining Character: A character that graphically combines with a preceding 'base character'. The combining character is said to 'apply' to that base character. (See also 'nonspacing mark'.) Compatibility: (1) Consistency with existing practice or preexisting character encoding standards. (2) Characterisitic of a normative mapping and form of equivalence (see 'compatibility decomposition'). McDonald June 20, 2002 [Page A-2] CUPS Internationalization Software Design Description v0.3 APPENDIX A Glossary Compatibility Character: A character that has a compatibility decomposition. Compatibility Decomposition: The decomposition of a character that results from recursively applying BOTH the compatibility mappings AND the canonical mappings found in the Unicode Character Database until no characters can be further decomposed, then reordering nonspacing marks according to section 3.10 of [UNICODE3.2]. Compatibility Equivalent: Two characters are compatibility equivalents if their full compatibility decompositions are identical. Composed Character: (See 'descomposable character'.) DBCS: Acronym for 'double-byte character set'. Decomposable Character: A character that is equivalent to a sequence of one or more other characters, according to the decomposition mappings found in [UNICODE3.2]. It may also be known as a 'precomposed character' or a 'composite character'. Decomposition: (1) The process of separating or analyzing a text element into component units. (2) A sequence of one or more characters that is equivalent to a 'decomposable character'. Diacritic: (See 'accent mark'.) Double-Byte Character Set (DBCS): One of a number of character sets defined for representing Chinese, Japanese, or Korean text (for example, JIS X 0208-1990). These character sets are often encoded in such a way as to allow double-byte character encodings to be mixed with single-byte character encodings. (See also 'multiple-byte character set'.) Font: A collection of glyphs used for visual depication of character data. FSS-UTF: Abbreviation for 'File System Safe UCS Transformation Format', originally published by X/Open. Now called 'UTF-8'. Fullwidth: Characters of East Asian character sets whose glyph image extends across the entire character display cell. In legacy character sets, fullwidth characters are normally encoded in two or three bytes. Glyph: (1) An abstract form that represents one or more glyph images. (2) A synonym for 'glyph image'. Glyph Image: The actual, concrete image of a glyph representation having been rasterized or otherwise images onto some display surface. McDonald June 20, 2002 [Page A-3] CUPS Internationalization Software Design Description v0.3 APPENDIX A Glossary Halfwidth: Characters of East Asian character sets whose glyph image occupies half of the character display cell. In legacy character sets, halfwidth characters are normally encoded in a single byte. Han Characters: Ideographic characters of Chinese origin. Hangul: The name of the script used to write the Korean language. High-Surrogate: A Unicode code value in the range U+D800 to U+DBFF. Hiragana: One of two standard syllabaries associated with the Japanese writing system. Use to write particles, grammatical affixes, and words that have no 'kanji' form. IANA: Internet Assigned Numbers Authority. Ideograph: (1) Any symbol that denotes an idea (or meaning) in contrast to a sound or pronunciation (for example, a 'smiley face'). (2) A common term used to refer to Han characters. IPA: International Phonetic Alphabet. IRG: Abbreviation for Ideographic Rapporteur Group, a subgroup of ISO/IEC JTC1/SC2/WG2 (who work on Han unification and submission of new Han characters for inclusion in revised versions of Unicode/ISO 10646). Jamo: The Korean name for a single letter of the Hangul script. Jamos are used to form Hangul syllables. Joiner: An invisible character that affects the joining behavior of surrounding characters. JTC1: Abbreviation for Joint Technical Committee 1 of ISO/IEC, responsible for information technology standardization. Kana: The name of a primarily syllabic script used by the Japanese writing system, composed of 'hiragana' and 'katakana'. Kanji: The Japanese name for Han characters; derived from the Chinese word 'hanzi'. Also romanized as 'kanzi'. Katakana: One of two standard syllabaries associated with the Japanese writing system, typically used in representation of borrowed vocabulary. Ligature: A glyph representing a combination of two or more characters, for example in the Latin script the ligature between 'f' and 'i' as 'fi'. Logical Order: The order in which text is typed on a keyboard. For the McDonald June 20, 2002 [Page A-4] CUPS Internationalization Software Design Description v0.3 APPENDIX A Glossary most part, logical order corresponds to phonetic order. Lowercase: (See 'case'.) Low-Surrogate: A Unicode code value in the range U+DC00 to U+DFFF. MBCS: Acronym for 'multiple-byte character set'. Multiple-Byte Character Set (MBCS): A character set encoded with a variable number of bytes per character. Many large character sets have been defined as MBCS so as to keep strict compatibility with the US-ASCII subset and/or [ISO2022]. Normalization: Transformation of data to a normal form. Plain Text: Computer-encoded text that consists ONLY of a sequence of code values from a given standard, with no other formatting or structural information. Precomposed Character: (See 'decomposable character'.) Rendering: (1) The process of selecting and laying out glyphs for the purpose of depicting characters. (2) The process of making glyphs visible on a display device. Repertoire: (See 'character repertoire'.) Replacement Character: A character used as a substitute for an uninterpretable character from another encoding. [UNICODE3.2] defines U+FFFD REPLACEMENT CHARACTER for this function. Rich Text: The result of adding information such as font data, color, formatting, phonetic annotations, etc. to 'plain text' (e.g., HTML). SBCS: Acronym for 'single-byte character set'. Scalar Value: (See 'Unicode scalar value'.) Script: A collection of symbols used to represent textual information in one or more writing systems. Single-Byte Character Set (SBCS): One of a number of one-byte character sets defined for representing (mostly) Western languages (for example, ISO 8859-1 'Latin-1'). These character sets are often encoded in such a way as to be strict supersets of 7-bit [US-ASCII]. Sorting: (See 'collation'.) Transcoding: Conversion of character data between different character sets. McDonald June 20, 2002 [Page A-5] CUPS Internationalization Software Design Description v0.3 APPENDIX A Glossary Transformation Format: A mapping from a coded character sequence to a unique sequence of code values (typically octets). UCS: Abbreviation for Universal Character Set, specified by [ISO10646]. UCS-2: UCS encoded in 2 octets, specified by [ISO10646]. UCS-4: UCS encoded in 4 octets, specified by [ISO10646]. Unicode Scalar Value: A number between 0 to 0x10FFFF. Uppercase: (See 'case'.) UTF: Abbreviation for Unicode (or UCS) Transformation Format. UTF-8: Unicode (or UCS) Transformation Format, 8-bit encoding form. Serializes a Unicode (or UCS) scalar value (code point) as a sequence of one to four octets. Does NOT suffer from byte-ordering ambiguities. UTF-16: Unicode (or UCS) Transformation Format, 16-bit encoding form. Serializes a Unicode (or UCS) scalar value (code point) as a sequence of two octets, in either big-endian or little-endian format. Uses an (optional) prefix of BOM to disambiguate byte-ordering. UTF-32: Unicode (or UCS) Transformation Format, 32-bit encoding form. Serializes a Unicode (or UCS) scalar value (code point) as a sequence of four octets, in either big-endian or little-endian format. Uses an (optional) prefix of BOM to disambiguate byte-ordering. Zero Width: Characteristic of some spaces or format control characters that do not advance text along the horizontal baseline. McDonald June 20, 2002 [Page A-6]