data/i18n_sdd.txt

   1
   2
   3     WORKING DRAFT                                               Ira McDonald
   4     <i18n_sdd.txt>                                            High North Inc
   5
   6                       Common UNIX Printing System ("CUPS")
   7              Internationalization Software Design Description v0.3
   8
   9        Copyright (C) Easy Software Products (2002) - All Rights Reserved
  10
  11
  12     Status of this Document
  13
  14     This document is an unapproved working draft and is incomplete in some
  15     sections (see 'Ed Note:' comments).
  16
  17
  18     Abstract
  19
  20     This document provides general information and high-level design for the
  21     Internationalization extensions for the Common UNIX Printing System
  22     ("CUPS") Version 1.2.  This document also provides C language header
  23     files and high-level pseudo-code for all new modules and external
  24     functions.
  25
  26
  27
  28
  29
  30
  31
  32
  33
  34
  35
  36
  37
  38
  39
  40
  41
  42
  43
  44
  45
  46
  47
  48
  49
  50
  51
  52
  53
  54
  55
  56
  57     McDonald                     June 20, 2002                      [Page 1]
  58 \f
  59            CUPS Internationalization Software Design Description v0.3
  60
  61                                Table of Contents
  62
  63     1.  Scope ......................................................       4
  64       1.1.  Identification .........................................       4
  65       1.2.  System Overview ........................................       4
  66       1.3.  Document Overview ......................................       4
  67     2.  References .................................................       5
  68       2.1.  CUPS References ........................................       5
  69       2.2.  Other Documents ........................................       5
  70     3.  Design Overview ............................................       7
  71       3.1.  Transcoding - New ......................................       7
  72         3.1.1.  transcode.h - Transcoding header ...................       7
  73           3.1.1.1.  cups_cmap_t - SBCS Charmap Structure ...........      10
  74           3.1.1.2.  cups_dmap_t - DBCS Charmap Structure ...........      11
  75         3.1.2.  transcode.c - Transcoding module ...................      11
  76           3.1.2.1.  cupsUtf8ToCharset() ............................      11
  77           3.1.2.2.  cupsCharsetToUtf8() ............................      12
  78           3.1.2.3.  cupsUtf8ToUtf16() ..............................      12
  79           3.1.2.4.  cupsUtf16ToUtf8() ..............................      12
  80           3.1.2.5.  cupsUtf8ToUtf32() ..............................      12
  81           3.1.2.6.  cupsUtf32ToUtf8() ..............................      13
  82           3.1.2.7.  cupsUtf16ToUtf32() .............................      13
  83           3.1.2.8.  cupsUtf32ToUtf16() .............................      13
  84           3.1.2.9.  Transcoding Utility Functions ..................      13
  85             3.1.2.9.1.  cupsCharmapGet() ...........................      14
  86             3.1.2.9.2.  cupsCharmapFree() ..........................      14
  87             3.1.2.9.3.  cupsCharmapFlush() .........................      14
  88       3.2.  Normalization - New ....................................      15
  89         3.2.1.  normalize.h - Normalization header .................      15
  90           3.2.1.1.  cups_normmap_t - Normalize Map Structure .......      22
  91           3.2.1.2.  cups_foldmap_t - Case Fold Map Structure .......      22
  92           3.2.1.3.  cups_propmap_t - Char Property Map Structure ...      23
  93           3.2.1.4.  cups_prop_t - Char Property Structure ..........      23
  94           3.2.1.5.  cups_breakmap_t - Line Break Map Structure .....      23
  95           3.2.1.6.  cups_combmap_t - Combining Class Map Structure .      24
  96           3.2.1.7.  cups_comb_t - Combining Class Structure ........      24
  97         3.2.2.  normalize.c - Normalization module .................      24
  98           3.2.2.1.  cupsUtf8Normalize() ............................      24
  99           3.2.2.2.  cupsUtf32Normalize() ...........................      25
 100           3.2.2.3.  cupsUtf8CaseFold() .............................      25
 101           3.2.2.4.  cupsUtf32CaseFold() ............................      26
 102           3.2.2.5.  cupsUtf8CompareCaseless() ......................      26
 103           3.2.2.6.  cupsUtf32CompareCaseless() .....................      26
 104           3.2.2.7.  cupsUtf8CompareIdentifier() ....................      27
 105           3.2.2.8.  cupsUtf32CompareIdentifier() ...................      27
 106           3.2.2.9.  cupsUtf32CharacterProperty() ...................      27
 107           3.2.2.10.  Normalization Utility Functions ...............      28
 108             3.2.2.10.1.  cupsNormalizeMapsGet() ....................      28
 109             3.2.2.10.2.  cupsNormalizeMapsFree() ...................      28
 110             3.2.2.10.3.  cupsNormalizeMapsFlush() ..................      28
 111       3.3.  Language - Existing ....................................      29
 112         3.3.1.  language.h - Language header .......................      29
 113
 114     McDonald                     June 20, 2002                      [Page 2]
 115 \f
 116            CUPS Internationalization Software Design Description v0.3
 117
 118         3.3.2.  language.c - Language module .......................      29
 119           3.3.2.1.  cupsLangEncoding() - Existing ..................      29
 120           3.3.2.2.  cupsLangFlush() - Existing .....................      29
 121           3.3.2.3.  cupsLangFree() - Existing ......................      29
 122           3.3.2.4.  cupsLangGet() - Existing .......................      30
 123           3.3.2.5.  cupsLangPrintf() - New .........................      30
 124           3.3.2.6.  cupsLangPuts() - New ...........................      30
 125           3.3.2.7.  cupsEncodingName() - New .......................      31
 126       3.4.  Common Text Filter - Existing ..........................      31
 127         3.4.1.  textcommon.h - Common text filter header ...........      31
 128           3.4.1.1.  lchar_t - Character/Attribute Structure ........      31
 129         3.4.2.  textcommon.c - Common text filter ..................      32
 130           3.4.2.1.  TextMain() - Existing ..........................      32
 131           3.4.2.2.  compare_keywords() - Existing ..................      33
 132           3.4.2.3.  getutf8() - Existing ...........................      33
 133       3.5.  Text to PostScript Filter - Existing ...................      33
 134         3.5.1.  texttops.c - Text to PostScript filter .............      33
 135           3.5.1.1.  main() - Existing ..............................      33
 136           3.5.1.2.  WriteEpilogue () - Existing ....................      34
 137           3.5.1.3.  WritePage () - Existing ........................      34
 138           3.5.1.4.  WriteProlog () - Existing ......................      34
 139           3.5.1.5.  write_line() - Existing ........................      34
 140           3.5.1.6.  write_string() - Existing ......................      34
 141           3.5.1.7.  write_text() - Existing ........................      35
 142     A.  Glossary ...................................................   A-1
 143
 144
 145
 146
 147
 148
 149
 150
 151
 152
 153
 154
 155
 156
 157
 158
 159
 160
 161
 162
 163
 164
 165
 166
 167
 168
 169
 170
 171     McDonald                     June 20, 2002                      [Page 3]
 172 \f
 173            CUPS Internationalization Software Design Description v0.3
 174
 175
 176
 177     1.  Scope
 178
 179
 180
 181     1.1.  Identification
 182
 183     This document provides general information and high-level design for the
 184     Internationalization extensions for the Common UNIX Printing System
 185     ("CUPS") Version 1.2.  This document also provides C language header
 186     files and high-level pseudo-code for all new modules and external
 187     functions.
 188
 189
 190     1.2.  System Overview
 191
 192     The CUPS Internationalization extensions provide multilingual support
 193     via Unicode 3.2:2002 [UNICODE3.2] / ISO-10646-1:2000 [ISO10646-1] and a
 194     suite of local character sets (including all adopted parts of ISO-8859
 195     and many MS Windows code pages) for CUPS 1.2.
 196
 197     The CUPS Internationalization extensions support UTF-8 [RFC2279] as the
 198     common stream-oriented representation of all character data.  UTF-8 is
 199     defined in [ISO10646-1] and is further constrained (for integrity and
 200     security) by [UNICODE3.2].
 201
 202     UTF-8 is the native character set of LDAPv3 [RFC2251], SLPv2 [RFC2608],
 203     IPP/1.1 [RFC2910] [RFC2911], and many other Internet protocols.
 204
 205
 206     1.3.  Document Overview
 207
 208
 209     This software design description document is organized into the
 210     following sections:
 211
 212     o   1 - Scope
 213     o   2 - References
 214     o   3 - Design Overview
 215     o   A - Glossary
 216
 217
 218
 219
 220
 221
 222
 223
 224
 225
 226
 227
 228     McDonald                     June 20, 2002                      [Page 4]
 229 \f
 230            CUPS Internationalization Software Design Description v0.3
 231
 232
 233
 234     2.  References
 235
 236
 237
 238     2.1.  CUPS References
 239
 240     See:  Section 2.1 'CUPS Documentation' of CUPS Software Design
 241     Description.
 242
 243
 244     2.2.  Other Documents
 245
 246     The following non-CUPS documents are referenced by this document.
 247
 248     [ANSI-X3.4] ANSI Coded Character Set - 7-bit American National Standard
 249     Code for Information Interchange, ANSI X3.4, 1986 (aka US-ASCII).
 250
 251     [GB2312] Code of Chinese Graphic Character Set for Information
 252     Interchange, Primary Set, GB 2312, 1980.
 253
 254     [ISO639-1] Codes for the Representation of Names of Languages -- Part 1:
 255     Alpha-2 Code, ISO/IEC 639-1, 2000.
 256
 257     [ISO639-2] Codes for the Representation of Names of Languages -- Part 2:
 258     Alpha-3 Code, ISO/IEC 639-2, 1998.
 259
 260     [ISO646] Information Technology - ISO 7-bit Coded Character Set for
 261     Information Interchange, ISO/IEC 646, 1991.
 262
 263     [ISO2022] Information Processing - ISO 7-bit and 8-bit Coded Character
 264     Sets - Code Extension Techniques, ISO/IEC 2022, 1994.  (Technically
 265     identical to ECMA-35.)
 266
 267     [ISO3166-1] Codes for the Representation of Names of Countries and their
 268     Subdivisions, Part 1:  Country Codes, ISO/ISO 3166-1, 1997.
 269
 270     [ISO8859] Information Processing - 8-bit Single-Byte Code Graphic
 271     Character Sets, ISO/IEC 8859-n, 1987-2001.
 272
 273     [ISO10646-1] Information Technology - Universal Multiple-Octet Code
 274     Character Set (UCS) - Part 1:  Architecture and Basic Multilingual
 275     Plane, ISO/IEC 10646-1, September 2000.
 276
 277     [ISO10646-2] Information Technology - Universal Multiple-Octet Code
 278     Character Set (UCS) - Part 2:  Supplemental Planes, ISO/IEC 10646-2,
 279     January 2001.
 280
 281     [RFC2119] Bradner.  Key words for use in RFCs to Indicate Requirement
 282     Levels, RFC 2119, March 1997.
 283
 284
 285     McDonald                     June 20, 2002                      [Page 5]
 286 \f
 287            CUPS Internationalization Software Design Description v0.3
 288
 289
 290     [RFC2251] Whal, Howes, Kille.  Lightweight Directory Access Protocol
 291     Version 3 (LDAPv3), RFC 2251, December 1997.
 292
 293     [RFC2277] Alvestrand.  IETF Policy on Character Sets and Languages, RFC
 294     2277, January 1998.
 295
 296     [RFC2279] Yergeau.  UTF-8, a Transformation Format of ISO 10646, RFC
 297     2279, January 1998.
 298
 299     [RFC2608] Guttman, Perkins, Veizades, Day.  Service Location Protocol
 300     Version 2 (SLPv2), RFC 2608, June 1999.
 301
 302     [RFC2910] Herriot, Butler, Moore, Turner, Wenn.  Internet Printing
 303     Protocol/1.1:  Encoding and Transport, RFC 2910, September 2000.
 304
 305     [RFC2911] Hastings, Herriot, deBry, Isaacson, Powell.  Internet Printing
 306     Protocol/1.1:  Model and Semantics, RFC 2911, September 2000.
 307
 308     [UNICODE3.0] Unicode Consortium, Unicode Standard Version 3.0,
 309     Addison-Wesley Developers Press, ISBN 0-201-61633-5, 2000.
 310
 311     [UNICODE3.1] Unicode Consortium, Unicode Standard Version 3.1 (UAX-27),
 312     May 2001.
 313
 314     [UNICODE3.2] Unicode Consortium, Unicode Standard Version 3.2 (UAX-28),
 315     March 2002.
 316
 317     [US-ASCII] See [ANSI-X3.4] above.
 318
 319
 320
 321
 322
 323
 324
 325
 326
 327
 328
 329
 330
 331
 332
 333
 334
 335
 336
 337
 338
 339
 340
 341
 342     McDonald                     June 20, 2002                      [Page 6]
 343 \f
 344            CUPS Internationalization Software Design Description v0.3
 345
 346
 347
 348     3.  Design Overview
 349
 350     The CUPS Internationalization extensions are composed of several header
 351     files and modules which extend the Language functions in the existing
 352     CUPS Application Programmers Interface (API).
 353
 354
 355     3.1.  Transcoding - New
 356
 357     Initially, the CUPS Internationalization extensions will only support
 358     SBCS (single-byte character set) transcoding.  But the design allows
 359     future support for DBCS (double-byte character set) transcoding for CJK
 360     (Chinese/Japanese/Korean) languages and the MBCS (multiple-byte
 361     character set) compound sets that use escapes for charset switching.
 362
 363     In order to reduce code size and increase performance all conventional
 364     'mapping files' (tables of values in legacy characters sets with their
 365     corresponding Unicode scalar values) will ALSO be sorted and stored in
 366     memory as reverse maps (for efficient conversion from Unicode scalar
 367     values to their corresponding legacy character set values).  Transcoding
 368     will be done directly by 2-level lookup (without any searching or
 369     sorting).
 370
 371     [Ed Note:  CJK languages will be fairly costly in mapping table sizes,
 372     because they have thousands (or tens of thousands) of codepoints.]
 373
 374
 375
 376     3.1.1.  transcode.h - Transcoding header
 377
 378     /*
 379      * "$Id: i18n_sdd.txt 2678 2002-08-19 01:15:26Z mike $"
 380      *
 381      *   Transcoding support for the Common UNIX Printing System (CUPS).
 382      *
 383      *   Copyright 1997-2002 by Easy Software Products.
 384      *
 385      *   These coded instructions, statements, and computer programs are
 386      *   the property of Easy Software Products and are protected by Federal
 387      *   copyright law.  Distribution and use rights are outlined in the
 388      *   file "LICENSE.txt" which should have been included with this file.
 389      *   If this file is missing or damaged please contact Easy Software
 390      *   Products at:
 391      *
 392      *       Attn: CUPS Licensing Information
 393      *       Easy Software Products
 394      *       44141 Airport View Drive, Suite 204
 395      *       Hollywood, Maryland 20636-3111 USA
 396      *
 397      *       Voice: (301) 373-9603
 398
 399     McDonald                     June 20, 2002                      [Page 7]
 400 \f
 401            CUPS Internationalization Software Design Description v0.3
 402
 403      *       EMail: cups-info@cups.org
 404      *         WWW: http://www.cups.org
 405      */
 406
 407     #ifndef _CUPS_TRANSCODE_H_
 408     #  define _CUPS_TRANSCODE_H_
 409
 410     /*
 411      * Include necessary headers...
 412      */
 413
 414     #  include "cups/language.h"
 415
 416     #  ifdef __cplusplus
 417     extern "C" {
 418     #  endif /* __cplusplus */
 419
 420     /*
 421      * Types...
 422      */
 423
 424     typedef unsigned char  utf8_t;  /* UTF-8 Unicode/ISO-10646 code unit */
 425     typedef unsigned short utf16_t; /* UTF-16 Unicode/ISO-10646 code unit */
 426     typedef unsigned long  utf32_t; /* UTF-32 Unicode/ISO-10646 code unit */
 427     typedef unsigned short ucs2_t;  /* UCS-2 Unicode/ISO-10646 code unit */
 428     typedef unsigned long  ucs4_t;  /* UCS-4 Unicode/ISO-10646 code unit */
 429     typedef unsigned char  sbcs_t;  /* SBCS Legacy 8-bit code unit */
 430     typedef unsigned short dbcs_t;  /* DBCS Legacy 16-bit code unit */
 431
 432     /*
 433      * Structures...
 434      */
 435
 436     typedef struct cups_cmap_str    /**** SBCS Charmap Cache Structure ****/
 437     {
 438       struct cups_cmap_str  *next;          /* Next charmap in cache */
 439       int                   used;           /* Number of times entry used */
 440       cups_encoding_t       encoding;       /* Legacy charset encoding */
 441       ucs2_t                char2uni[256];  /* Map Legacy SBCS -> UCS-2 */
 442       sbcs_t                *uni2char[256]; /* Map UCS-2 -> Legacy SBCS */
 443     } cups_cmap_t;
 444
 445     #if 0
 446     typedef struct cups_dmap_str    /**** DBCS Charmap Cache Structure ****/
 447     {
 448       struct cups_dmap_str  *next;          /* Next charmap in cache */
 449       int                   used;           /* Number of times entry used */
 450       cups_encoding_t       encoding;       /* Legacy charset encoding */
 451       ucs2_t                *char2uni[256]; /* Map Legacy DBCS -> UCS-2 */
 452       dbcs_t                *uni2char[256]; /* Map UCS-2 -> Legacy DBCS */
 453     } cups_dmap_t;
 454     #endif
 455
 456     McDonald                     June 20, 2002                      [Page 8]
 457 \f
 458            CUPS Internationalization Software Design Description v0.3
 459
 460
 461     /*
 462      * Constants...
 463      */
 464     #define CUPS_MAX_USTRING    1024    /* Maximum size of Unicode string */
 465
 466     /*
 467      * Globals...
 468      */
 469
 470     extern int      TcFixMapNames;  /* Fix map names to Unicode names */
 471     extern int      TcStrictUtf8;   /* Non-shortest-form is illegal */
 472     extern int      TcStrictUtf16;  /* Invalid surrogate pair is illegal */
 473     extern int      TcStrictUtf32;  /* Greater than 0x10FFFF is illegal */
 474     extern int      TcRequireBOM;   /* Require BOM for little/big-endian */
 475     extern int      TcSupportBOM;   /* Support BOM for little/big-endian */
 476     extern int      TcSupport8859;  /* Support ISO 8859-x repertoires */
 477     extern int      TcSupportWin;   /* Support Windows-x repertoires */
 478     extern int      TcSupportCJK;   /* Support CJK (Asian) repertoires */
 479
 480     /*
 481      * Prototypes...
 482      */
 483
 484     /*
 485      * Utility functions for character set maps
 486      */
 487     extern void     *cupsCharmapGet(const cups_encoding_t encoding);
 488                                                     /* I - Encoding */
 489     extern void     cupsCharmapFree(const cups_encoding_t encoding);
 490                                                     /* I - Encoding */
 491     extern void     cupsCharmapFlush(void);
 492
 493     /*
 494      * Convert UTF-8 to and from legacy character set
 495      */
 496     extern int      cupsUtf8ToCharset(char *dest,   /* O - Target string */
 497                         const utf8_t *src,          /* I - Source string */
 498                         const int maxout,           /* I - Max output */
 499                         cups_encoding_t encoding);  /* I - Encoding */
 500     extern int      cupsCharsetToUtf8(utf8_t *dest, /* O - Target string */
 501                         const char *src,            /* I - Source string */
 502                         const int maxout,           /* I - Max output */
 503                         cups_encoding_t encoding);  /* I - Encoding */
 504
 505     /*
 506      * Convert UTF-8 to and from UTF-16
 507      */
 508     extern int      cupsUtf8ToUtf16(utf16_t *dest,  /* O - Target string */
 509                         const utf8_t *src,          /* I - Source string */
 510                         const int maxout);          /* I - Max output */
 511     extern int      cupsUtf16ToUtf8(utf8_t *dest,   /* O - Target string */
 512
 513     McDonald                     June 20, 2002                      [Page 9]
 514 \f
 515            CUPS Internationalization Software Design Description v0.3
 516
 517                         const utf16_t *src,         /* I - Source string */
 518                         const int maxout);          /* I - Max output */
 519
 520     /*
 521      * Convert UTF-8 to and from UTF-32
 522      */
 523     extern int      cupsUtf8ToUtf32(utf32_t *dest,  /* O - Target string */
 524                         const utf8_t *src,          /* I - Source string */
 525                         const int maxout);          /* I - Max output */
 526     extern int      cupsUtf32ToUtf8(utf8_t *dest,   /* O - Target string */
 527                         const utf32_t *src,         /* I - Source string */
 528                         const int maxout);          /* I - Max output */
 529
 530     /*
 531      * Convert UTF-16 to and from UTF-32
 532      */
 533     extern int      cupsUtf16ToUtf32(utf32_t *dest, /* O - Target string */
 534                         const utf16_t *src,         /* I - Source string */
 535                         const int maxout);          /* I - Max output */
 536     extern int      cupsUtf32ToUtf16(utf16_t *dest, /* O - Target string */
 537                         const utf32_t *src,         /* I - Source string */
 538                         const int maxout);          /* I - Max output */
 539
 540     #  ifdef __cplusplus
 541     }
 542     #  endif /* __cplusplus */
 543
 544     #endif /* !_CUPS_TRANSCODE_H_ */
 545
 546     /*
 547      * End of "$Id: i18n_sdd.txt 2678 2002-08-19 01:15:26Z mike $"
 548      */
 549
 550
 551
 552     3.1.1.1.  cups_cmap_t - SBCS Charmap Structure
 553
 554     typedef struct cups_cmap_str    /**** SBCS Charmap Cache Structure ****/
 555     {
 556       struct cups_cmap_str  *next;          /* Next charset map in cache */
 557       int                   used;           /* Number of times entry used */
 558       cups_encoding_t       encoding;       /* Legacy charset encoding */
 559       ucs2_t                char2uni[256];  /* Map Legacy SBCS -> UCS-2 */
 560       sbcs_t                *uni2char[256]; /* Map UCS-2 -> Legacy SBCS */
 561     } cups_cmap_t;
 562
 563     'char2uni[]' is a (complete) array of UCS-2 values that supports direct
 564     one-level lookup from an input SBCS legacy charset code point, for use
 565     by 'cupsCharsetToUtf8()'.
 566
 567     'uni2char[]' is a (sparse) array of pointers to arrays of (256 each)
 568     SBCS values, that supports direct two-level lookup from an input UCS-2
 569
 570     McDonald                     June 20, 2002                     [Page 10]
 571 \f
 572            CUPS Internationalization Software Design Description v0.3
 573
 574     code point, for use by 'cupsUtf8ToCharset()'.
 575
 576
 577
 578     3.1.1.2.  cups_dmap_t - DBCS Charmap Structure
 579
 580     typedef struct cups_dmap_str    /**** DBCS Charmap Cache Structure ****/
 581     {
 582       struct cups_dmap_str  *next;          /* Next charset map in cache */
 583       int                   used;           /* Number of times entry used */
 584       cups_encoding_t       encoding;       /* Legacy charset encoding */
 585       ucs2_t                *char2uni[256]; /* Map Legacy DBCS -> UCS-2 */
 586       dbcs_t                *uni2char[256]; /* Map UCS-2 -> Legacy DBCS */
 587     } cups_dmap_t;
 588
 589     'char2uni[]' is a (sparse) array of pointers to arrays of (256 each)
 590     UCS-2 values that supports direct two-level lookup from an input DBCS
 591     legacy charset code point, for (future) use by 'cupsCharsetToUtf8()'.
 592
 593     'uni2char[]' is a (sparse) array of pointers to arrays of (256 each)
 594     DBCS values, that supports direct two-level lookup from an input UCS-2
 595     code point, for (future) use by 'cupsUtf8ToCharset()'.
 596
 597
 598
 599     3.1.2.  transcode.c - Transcoding module
 600
 601     All of the transcoding functions are modelled on the C standard library
 602     function 'strncpy()', except that they return the count of output, like
 603     'strlen()', rather than the (redundant) pointer to the output.
 604
 605     If the transcoding functions detect invalid input parameters or they
 606     detect an encoding error in their input, then they return '-1', rather
 607     than the count of output.
 608
 609     All of the transcoding functions take an input parameter indicating the
 610     maximum output units (for safe operation).  The functions that return
 611     16-bit (UTF-16) or 32-bit (UTF-32/UCS-4) output always return the output
 612     string count (not including the final null) and NOT the memory size in
 613     bytes.
 614
 615
 616
 617     3.1.2.1.  cupsUtf8ToCharset()
 618
 619     extern int      cupsUtf8ToCharset(char *dest,   /* O - Target string */
 620                         const utf8_t *src,          /* I - Source string */
 621                         const int maxout,           /* I - Max output */
 622                         cups_encoding_t encoding);  /* I - Encoding */
 623
 624     <Find charset map by calling 'cupsCharmapGet()'>
 625     <Convert input UTF-8 to internal UCS-4 by calling 'cupsUtf8ToUtf32()'>
 626
 627     McDonald                     June 20, 2002                     [Page 11]
 628 \f
 629            CUPS Internationalization Software Design Description v0.3
 630
 631     <Convert internal UCS-4 to legacy charset via charset map>
 632     <Release charset map by calling 'cupsCharmapFree()'>
 633     <Return length of output legacy charset string -- size in butes>
 634
 635
 636
 637     3.1.2.2.  cupsCharsetToUtf8()
 638
 639     extern int      cupsCharsetToUtf8(utf8_t *dest, /* O - Target string */
 640                         const char *src,            /* I - Source string */
 641                         const int maxout,           /* I - Max output */
 642                         cups_encoding_t encoding);  /* I - Encoding */
 643
 644     <Find charset map by calling 'cupsCharmapGet()'>
 645     <Convert input legacy charset to internal UCS-4 via charset map>
 646     <Convert internal UCS-4 to UTF-8 by calling 'cupsUtf32ToUtf8()'>
 647     <Release charset map by calling 'cupsCharmapFree()'>
 648     <Return length of output UTF-8 string -- size in bytes>
 649
 650
 651
 652     3.1.2.3.  cupsUtf8ToUtf16()
 653
 654     extern int      cupsUtf8ToUtf16(utf16_t *dest,  /* O - Target string */
 655                         const utf8_t *src,          /* I - Source string */
 656                         const int maxout);          /* I - Max output */
 657
 658     <...to avoid duplicate code to handle surrogate pairs...>
 659     <Convert input UTF-8 to internal UCS-4 by calling 'cupsUtf8ToUtf32()'>
 660     <Convert internal UCS-4 to UTF-16 by calling 'cupsUtf32ToUtf16()'>
 661     <Return count of output UTF-16 string -- NOT memory size in bytes>
 662
 663
 664
 665     3.1.2.4.  cupsUtf16ToUtf8()
 666
 667     extern int      cupsUtf16ToUtf8(utf8_t *dest,   /* O - Target string */
 668                         const utf16_t *src,         /* I - Source string */
 669                         const int maxout);          /* I - Max output */
 670
 671     <...to avoid duplicate code to handle surrogate pairs...>
 672     <Convert input UTF-16 to internal UCS-4 by calling 'cupsUtf16ToUtf32()'>
 673     <Convert internal UCS-4 to UTF-8 by calling 'cupsUtf32ToUtf8()'>
 674     <Return length of output UTF-8 string -- size in bytes>
 675
 676
 677
 678     3.1.2.5.  cupsUtf8ToUtf32()
 679
 680     extern int      cupsUtf8ToUtf32(utf32_t *dest,  /* O - Target string */
 681                         const utf8_t *src,          /* I - Source string */
 682                         const int maxout);          /* I - Max output */
 683
 684     McDonald                     June 20, 2002                     [Page 12]
 685 \f
 686            CUPS Internationalization Software Design Description v0.3
 687
 688
 689     <Convert input UTF-8 directly to output UCS-4...>
 690     <...checking for valid range, shortest-form, etc.>
 691     <Return count of output UTF-32 string -- NOT memory size in bytes>
 692
 693
 694
 695     3.1.2.6.  cupsUtf32ToUtf8()
 696
 697     extern int      cupsUtf32ToUtf8(utf8_t *dest,   /* O - Target string */
 698                         const utf32_t *src,         /* I - Source string */
 699                         const int maxout);          /* I - Max output */
 700
 701     <Convert input UCS-4 directly to output UTF-8...>
 702     <...checking for valid range, etc.>
 703     <Return length of output UTF-8 string -- size in bytes>
 704
 705
 706
 707     3.1.2.7.  cupsUtf16ToUtf32()
 708
 709     extern int      cupsUtf16ToUtf32(utf32_t *dest, /* O - Target string */
 710                         const utf16_t *src,         /* I - Source string */
 711                         const int maxout);          /* I - Max output */
 712
 713     <Convert input UTF-16 directly to output UCS-4...>
 714     <...handling surrogate pairs decoding from UTF-16>
 715     <Return count of output UTF-32 string -- NOT memory size in bytes>
 716
 717
 718
 719     3.1.2.8.  cupsUtf32ToUtf16()
 720
 721     extern int      cupsUtf32ToUtf16(utf16_t *dest, /* O - Target string */
 722                         const utf32_t *src,         /* I - Source string */
 723                         const int maxout);          /* I - Max output */
 724
 725     <Convert input UCS-4 directly to output UTF-16...>
 726     <...handling surrogate pairs encoding to UTF-16>
 727     <Return count of output UTF-16 string -- NOT memory size in bytes>
 728
 729
 730
 731     3.1.2.9.  Transcoding Utility Functions
 732
 733     The transcoding utility functions are used to load (from a file into
 734     memory), free (logically, without freeing memory), and flush (actually
 735     free memory) character maps for SBCS (single-byte character set) and
 736     (future) DBCS (double-byte character set) transcoding to and from UTF-8.
 737
 738
 739
 740
 741     McDonald                     June 20, 2002                     [Page 13]
 742 \f
 743            CUPS Internationalization Software Design Description v0.3
 744
 745
 746
 747     3.1.2.9.1.  cupsCharmapGet()
 748
 749     extern void     *cupsCharmapGet(const cups_encoding_t encoding);
 750                                                     /* I - Encoding */
 751
 752     <Find SBSC or DBCS charset map in cache>
 753     <...If found, increment 'used'>
 754     <...and return pointer to SBCS or DBCS charset map>
 755     <Get charset map file name by calling 'cupsEncodingName()'>
 756     <Open charset map file>
 757     <...If not found, return void>
 758     <Allocate memory for SBCS or DBCS charset map in cache>
 759     <...If no memory, return void>
 760     <Add to SBCS or DBCS cache by assigning 'next' field>
 761     <Assign 'encoding' field>
 762     <Increment 'used' field>
 763     <Read charset map file into memory in loop...>
 764     <If SBCS, then 'char2uni[]' is an array of 'ucs2_t' values>
 765     <...and 'uni2char[]' is an array of pointers to 'sbcs_t' arrays>
 766     <If DBCS, then char2uni[]' is an array of pointers to 'ucs2_t' arrays>
 767     <...and 'uni2char[]' is an array of pointers to 'dbcs_t' arrays>
 768     <Close charset map file>
 769     <Return pointer to SBCS or DBCS charset map>
 770
 771
 772
 773     3.1.2.9.2.  cupsCharmapFree()
 774
 775     extern void     cupsCharmapFree(const cups_encoding_t encoding);
 776                                                     /* I - Encoding */
 777
 778     <Find SBSC or DBCS charset map in cache>
 779     <...If found, decrement 'used'>
 780     <Return void>
 781
 782
 783
 784     3.1.2.9.3.  cupsCharmapFlush()
 785
 786     extern void     cupsCharmapFlush(void);
 787
 788     <Loop through SBCS charset map cache...>
 789     <...Free 'uni2char[]' memory>
 790     <...Free SBCS charset map memory>
 791     <Loop through DBCS charset map cache...>
 792     <...Free 'char2uni[]' memory>
 793     <...Free 'uni2char[]' memory>
 794     <...Free DBCS charset map memory>
 795     <Return void>
 796
 797
 798     McDonald                     June 20, 2002                     [Page 14]
 799 \f
 800            CUPS Internationalization Software Design Description v0.3
 801
 802
 803
 804
 805     3.2.  Normalization - New
 806
 807
 808
 809     3.2.1.  normalize.h - Normalization header
 810
 811     /*
 812      * "$Id: i18n_sdd.txt 2678 2002-08-19 01:15:26Z mike $"
 813      *
 814      *   Unicode normalization for the Common UNIX Printing System (CUPS).
 815      *
 816      *   Copyright 1997-2002 by Easy Software Products.
 817      *
 818      *   These coded instructions, statements, and computer programs are
 819      *   the property of Easy Software Products and are protected by Federal
 820      *   copyright law.  Distribution and use rights are outlined in the
 821      *   file "LICENSE.txt" which should have been included with this file.
 822      *   If this file is missing or damaged please contact Easy Software
 823      *   Products at:
 824      *
 825      *       Attn: CUPS Licensing Information
 826      *       Easy Software Products
 827      *       44141 Airport View Drive, Suite 204
 828      *       Hollywood, Maryland 20636-3111 USA
 829      *
 830      *       Voice: (301) 373-9603
 831      *       EMail: cups-info@cups.org
 832      *         WWW: http://www.cups.org
 833      */
 834
 835     #ifndef _CUPS_NORMALIZE_H_
 836     #  define _CUPS_NORMALIZE_H_
 837
 838     /*
 839      * Include necessary headers...
 840      */
 841
 842     #  include "transcod.h"
 843
 844     #  ifdef __cplusplus
 845     extern "C" {
 846     #  endif /* __cplusplus */
 847
 848     /*
 849      * Types...
 850      */
 851
 852     typedef enum                    /**** Normalizataion Types ****/
 853     {
 854
 855     McDonald                     June 20, 2002                     [Page 15]
 856 \f
 857            CUPS Internationalization Software Design Description v0.3
 858
 859       CUPS_NORM_NFD,                /* Canonical Decomposition */
 860       CUPS_NORM_NFKD,               /* Compatibility Decomposition */
 861       CUPS_NORM_NFC,                /* NFD, them Canonical Composition */
 862       CUPS_NORM_NFKC                /* NFKD, them Canonical Composition */
 863     } cups_normalize_t;
 864
 865     typedef enum                    /**** Case Folding Types ****/
 866     {
 867       CUPS_FOLD_SIMPLE,             /* Simple - no expansion in size */
 868       CUPS_FOLD_FULL                /* Full - possible expansion in size */
 869     } cups_folding_t;
 870
 871     typedef enum                    /**** Unicode Char Property Types ****/
 872     {
 873       CUPS_PROP_GENERAL_CATEGORY,   /* See 'cups_gencat_t' enum */
 874       CUPS_PROP_BIDI_CATEGORY,      /* See 'cups_bidicat_t' enum */
 875       CUPS_PROP_COMBINING_CLASS,    /* See 'cups_combclass_t' type */
 876       CUPS_PROP_BREAK_CLASS         /* See 'cups_breakclass_t' enum */
 877     } cups_property_t;
 878
 879     /*
 880      * Note - parse Unicode char general category from 'UnicodeData.txt'
 881      * into sparse local table in 'normalize.c'.
 882      * Use major classes for logic optimizations throughout (by mask).
 883      */
 884
 885     typedef enum                    /**** Unicode General Category ****/
 886     {
 887       CUPS_GENCAT_L  = 0x10, /* Letter major class */
 888       CUPS_GENCAT_LU = 0x11, /* Lu Letter, Uppercase */
 889       CUPS_GENCAT_LL = 0x12, /* Ll Letter, Lowercase */
 890       CUPS_GENCAT_LT = 0x13, /* Lt Letter, Titlecase */
 891       CUPS_GENCAT_LM = 0x14, /* Lm Letter, Modifier */
 892       CUPS_GENCAT_LO = 0x15, /* Lo Letter, Other */
 893       CUPS_GENCAT_M  = 0x20, /* Mark major class */
 894       CUPS_GENCAT_MN = 0x21, /* Mn Mark, Non-Spacing */
 895       CUPS_GENCAT_MC = 0x22, /* Mc Mark, Spacing Combining */
 896       CUPS_GENCAT_ME = 0x23, /* Me Mark, Enclosing */
 897       CUPS_GENCAT_N  = 0x30, /* Number major class */
 898       CUPS_GENCAT_ND = 0x31, /* Nd Number, Decimal Digit */
 899       CUPS_GENCAT_NL = 0x32, /* Nl Number, Letter */
 900       CUPS_GENCAT_NO = 0x33, /* No Number, Other */
 901       CUPS_GENCAT_P  = 0x40, /* Punctuation major class */
 902       CUPS_GENCAT_PC = 0x41, /* Pc Punctuation, Connector */
 903       CUPS_GENCAT_PD = 0x42, /* Pd Punctuation, Dash */
 904       CUPS_GENCAT_PS = 0x43, /* Ps Punctuation, Open (start) */
 905       CUPS_GENCAT_PE = 0x44, /* Pe Punctuation, Close (end) */
 906       CUPS_GENCAT_PI = 0x45, /* Pi Punctuation, Initial Quote */
 907       CUPS_GENCAT_PF = 0x46, /* Pf Punctuation, Final Quote */
 908       CUPS_GENCAT_PO = 0x47, /* Po Punctuation, Other */
 909       CUPS_GENCAT_S  = 0x50, /* Symbol major class */
 910       CUPS_GENCAT_SM = 0x51, /* Sm Symbol, Math */
 911
 912     McDonald                     June 20, 2002                     [Page 16]
 913 \f
 914            CUPS Internationalization Software Design Description v0.3
 915
 916       CUPS_GENCAT_SC = 0x52, /* Sc Symbol, Currency */
 917       CUPS_GENCAT_SK = 0x53, /* Sk Symbol, Modifier */
 918       CUPS_GENCAT_SO = 0x54, /* So Symbol, Other */
 919       CUPS_GENCAT_Z  = 0x60, /* Separator major class */
 920       CUPS_GENCAT_ZS = 0x61, /* Zs Separator, Space */
 921       CUPS_GENCAT_ZL = 0x62, /* Zl Separator, Line */
 922       CUPS_GENCAT_ZP = 0x63, /* Zp Separator, Paragraph */
 923       CUPS_GENCAT_C  = 0x70, /* Other (miscellaneous) major class */
 924       CUPS_GENCAT_CC = 0x71, /* Cc Other, Control */
 925       CUPS_GENCAT_CF = 0x72, /* Cf Other, Format */
 926       CUPS_GENCAT_CS = 0x73, /* Cs Other, Surrogate */
 927       CUPS_GENCAT_CO = 0x74, /* Co Other, Private Use */
 928       CUPS_GENCAT_CN = 0x75  /* Cn Other, Not Assigned */
 929     } cups_gencat_t;
 930
 931     /*
 932      * Note - parse Unicode char bidi category from 'UnicodeData.txt'
 933      * into sparse local table in 'normalize.c'.
 934      * Add bidirectional support to 'textcommon.c' - per Mike
 935      */
 936
 937     typedef enum                    /**** Unicode Bidi Category ****/
 938     {
 939       CUPS_BIDI_L,   /* Left-to-Right (Alpha, Syllabic, Ideographic) */
 940       CUPS_BIDI_LRE, /* Left-to-Right Embedding (explicit) */
 941       CUPS_BIDI_LRO, /* Left-to-Right Override (explicit) */
 942       CUPS_BIDI_R,   /* Right-to-Left (Hebrew alphabet and most punct) */
 943       CUPS_BIDI_AL,  /* Right-to-Left Arabic (Arabic, Thaana, Syriac) */
 944       CUPS_BIDI_RLE, /* Right-to-Left Embedding (explicit) */
 945       CUPS_BIDI_RLO, /* Right-to-Left Override (explicit) */
 946       CUPS_BIDI_PDF, /* Pop Directional Format */
 947       CUPS_BIDI_EN,  /* Euro Number (Euro and East Arabic-Indic digits) */
 948       CUPS_BIDI_ES,  /* Euro Number Separator (Slash) */
 949       CUPS_BIDI_ET,  /* Euro Number Termintor (Plus, Minus, Degree, etc) */
 950       CUPS_BIDI_AN,  /* Arabic Number (Arabic-Indic digits, separators) */
 951       CUPS_BIDI_CS,  /* Common Number Separator (Colon, Comma, Dot, etc) */
 952       CUPS_BIDI_NSM, /* Non-Spacing Mark (category Mn / Me in UCD) */
 953       CUPS_BIDI_BN,  /* Boundary Neutral (Formatting / Control chars) */
 954       CUPS_BIDI_B,   /* Paragraph Separator */
 955       CUPS_BIDI_S,   /* Segment Separator (Tab) */
 956       CUPS_BIDI_WS,  /* Whitespace Space (Space, Line Separator, etc) */
 957       CUPS_BIDI_ON   /* Other Neutrals */
 958     } cups_bidicat_t;
 959
 960     /*
 961      * Note - parse Unicode line break class from 'DerivedLineBreak.txt'
 962      * into sparse local table (list of class ranges) in 'normalize.c'.
 963      * Note - add state table from UAX-14, section 7.3 - Ira
 964      * Remember to do BK and SP in outer loop (not in state table).
 965      * Consider optimization for CM (combining mark).
 966      * See 'LineBreak.txt' (12,875) and 'DerivedLineBreak.txt' (1,350).
 967      */
 968
 969     McDonald                     June 20, 2002                     [Page 17]
 970 \f
 971            CUPS Internationalization Software Design Description v0.3
 972
 973
 974     typedef enum                    /**** Unicode Line Break Class ****/
 975     {
 976      /*
 977       * (A) - Allow Break AFTER
 978       * (XA) - Prevent Break AFTER
 979       * (B) - Allow Break BEFORE
 980       * (XB) - Prevent Break BEFORE
 981       * (P) - Allow Break For Pair
 982       * (XP) - Prevent Break For Pair
 983       */
 984       CUPS_BREAK_AI, /* Ambiguous (Alphabetic or Ideograph) */
 985       CUPS_BREAK_AL, /* Ordinary Alphabetic / Symbol Chars (XP) */
 986       CUPS_BREAK_BA, /* Break Opportunity After Chars (A) */
 987       CUPS_BREAK_BB, /* Break Opportunities Before Chars (B) */
 988       CUPS_BREAK_B2, /* Break Opportunity Before / After (B/A/XP) */
 989       CUPS_BREAK_BK, /* Mandatory Break (A) (normative) */
 990       CUPS_BREAK_CB, /* Contingent Break (B/A) (normative) */
 991       CUPS_BREAK_CL, /* Closing Punctuation (XB) */
 992       CUPS_BREAK_CM, /* Attached Chars / Combining (XB) (normative) */
 993       CUPS_BREAK_CR, /* Carriage Return (A) (normative) */
 994       CUPS_BREAK_EX, /* Exclamation / Interrogation (XB) */
 995       CUPS_BREAK_GL, /* Non-breaking ("Glue") (XB/XA) (normative) */
 996       CUPS_BREAK_HY, /* Hyphen (XA) */
 997       CUPS_BREAK_ID, /* Ideographic (B/A) */
 998       CUPS_BREAK_IN, /* Inseparable chars (XP) */
 999       CUPS_BREAK_IS, /* Numeric Separator (Infix) (XB) */
1000       CUPS_BREAK_LF, /* Line Feed (A) (normative) */
1001       CUPS_BREAK_NS, /* Non-starters (XB) */
1002       CUPS_BREAK_NU, /* Numeric (XP) */
1003       CUPS_BREAK_OP, /* Opening Punctuation (XA) */
1004       CUPS_BREAK_PO, /* Postfix (Numeric) (XB) */
1005       CUPS_BREAK_PR, /* Prefix (Numeric) (XA) */
1006       CUPS_BREAK_QU, /* Ambiguous Quotation (XB/XA) */
1007       CUPS_BREAK_SA, /* Context Dependent (South East Asian) (P) */
1008       CUPS_BREAK_SG, /* Surrogates (XP) (normative) */
1009       CUPS_BREAK_SP, /* Space (A) (normative) */
1010       CUPS_BREAK_SY, /* Symbols Allowing Break After (A) */
1011       CUPS_BREAK_XX, /* Unknown (XP) */
1012       CUPS_BREAK_ZW  /* Zero Width Space (A) (normative) */
1013     } cups_breakclass_t;
1014
1015     typedef int cups_combclass_t;   /**** Unicode Combining Class ****/
1016                                     /* 0=base / 1..254=combining char */
1017
1018     /*
1019      * Structures...
1020      */
1021
1022     typedef struct cups_normmap_str /**** Normalize Map Cache Struct ****/
1023     {
1024       struct cups_normmap_str *next;        /* Next normalize in cache */
1025
1026     McDonald                     June 20, 2002                     [Page 18]
1027 \f
1028            CUPS Internationalization Software Design Description v0.3
1029
1030       int                   used;           /* Number of times entry used */
1031       cups_normalize_t      normalize;      /* Normalization type */
1032       int                   normcount;      /* Count of Source Chars */
1033       ucs2_t                *uni2norm;      /* Char -> Normalization */
1034                                             /* ...only supports UCS-2 */
1035     } cups_normmap_t;
1036
1037     typedef struct cups_foldmap_str /**** Case Fold Map Cache Struct ****/
1038     {
1039       struct cups_foldmap_str *next;        /* Next case fold in cache */
1040       int                   used;           /* Number of times entry used */
1041       cups_folding_t        fold;           /* Case folding type */
1042       int                   foldcount;      /* Count of Source Chars */
1043       ucs2_t                *uni2fold;      /* Char -> Folded Char(s) */
1044                                             /* ...only supports UCS-2 */
1045     } cups_foldmap_t;
1046
1047     typedef struct cups_prop_str    /**** Char Property Struct ****/
1048     {
1049       ucs2_t                ch;             /* Unicode Char as UCS-2 */
1050       unsigned char         gencat;         /* General Category */
1051       unsigned char         bidicat;        /* Bidirectional Category */
1052     } cups_prop_t;
1053
1054     typedef struct                  /**** Char Property Map Struct ****/
1055     {
1056       int                   used;           /* Number of times entry used */
1057       int                   propcount;      /* Count of Source Chars */
1058       cups_prop_t           *uni2prop;      /* Char -> Properties */
1059     } cups_propmap_t;
1060
1061     typedef struct                  /**** Line Break Class Map Struct ****/
1062     {
1063       int                   used;           /* Number of times entry used */
1064       int                   breakcount;     /* Count of Source Chars */
1065       ucs2_t                *uni2break;     /* Char -> Line Break Class */
1066     } cups_breakmap_t;
1067
1068     typedef struct cups_comb_str    /**** Char Combining Class Struct ****/
1069     {
1070       ucs2_t                ch;             /* Unicode Char as UCS-2 */
1071       unsigned char         combclass;      /* Combining Class */
1072       unsigned char         reserved;       /* Reserved for alignment */
1073     } cups_comb_t;
1074
1075     typedef struct                  /**** Combining Class Map Struct ****/
1076     {
1077       int                   used;           /* Number of times entry used */
1078       int                   combcount;      /* Count of Source Chars */
1079       cups_comb_t           *uni2comb;      /* Char -> Combining Class */
1080     } cups_combmap_t;
1081
1082
1083     McDonald                     June 20, 2002                     [Page 19]
1084 \f
1085            CUPS Internationalization Software Design Description v0.3
1086
1087
1088     /*
1089      * Globals...
1090      */
1091
1092     extern int      NzSupportUcs2;  /* Support UCS-2 (16-bit) mapping */
1093     extern int      NzSupportUcs4;  /* Support UCS-4 (32-bit) mapping */
1094
1095     /*
1096      * Prototypes...
1097      */
1098
1099     /*
1100      * Utility functions for normalization module
1101      */
1102     extern int      cupsNormalizeMapsGet(void);
1103     extern int      cupsNormalizeMapsFree(void);
1104     extern void     cupsNormalizeMapsFlush(void);
1105
1106     /*
1107      * Normalize UTF-8 string to Unicode UAX-15 Normalization Form
1108      * Note - Compatibility Normalization Forms (NFKD/NFKC) are
1109      * unsafe for subsequent transcoding to legacy charsets
1110      */
1111     extern int      cupsUtf8Normalize(utf8_t *dest, /* O - Target string */
1112                         const utf8_t *src,          /* I - Source string */
1113                         const int maxout,           /* I - Max output */
1114                         const cups_normalize_t normalize);
1115                                                     /* I - Normalization */
1116
1117     /*
1118      * Normalize UTF-32 string to Unicode UAX-15 Normalization Form
1119      * Note - Compatibility Normalization Forms (NFKD/NFKC) are
1120      * unsafe for subsequent transcoding to legacy charsets
1121      */
1122     extern int      cupsUtf32Normalize(utf32_t *dest,
1123                                                     /* O - Target string */
1124                         const utf32_t *src,         /* I - Source string */
1125                         const int maxout,           /* I - Max output */
1126                         const cups_normalize_t normalize);
1127                                                     /* I - Normalization */
1128
1129     /*
1130      * Case Fold UTF-8 string per Unicode UAX-21 Section 2.3
1131      * Note - Case folding output is
1132      * unsafe for subsequent transcoding to legacy charsets
1133      */
1134     extern int      cupsUtf8CaseFold(utf8_t *dest,  /* O - Target string */
1135                         const utf8_t *src,          /* I - Source string */
1136                         const int maxout,           /* I - Max output */
1137                         const cups_folding_t fold); /* I - Fold Mode */
1138
1139
1140     McDonald                     June 20, 2002                     [Page 20]
1141 \f
1142            CUPS Internationalization Software Design Description v0.3
1143
1144
1145     /*
1146      * Case Fold UTF-32 string per Unicode UAX-21 Section 2.3
1147      * Note - Case folding output is
1148      * unsafe for subsequent transcoding to legacy charsets
1149      */
1150     extern int      cupsUtf32CaseFold(utf32_t *dest,/* O - Target string */
1151                         const utf32_t *src,         /* I - Source string */
1152                         const int maxout,           /* I - Max output */
1153                         const cups_folding_t fold); /* I - Fold Mode */
1154
1155     /*
1156      * Compare UTF-8 strings after case folding
1157      */
1158     extern int      cupsUtf8CompareCaseless(const utf8_t *s1,
1159                                                     /* I - String1 */
1160                         const utf8_t *s2);          /* I - String2 */
1161
1162     /*
1163      * Compare UTF-32 strings after case folding
1164      */
1165     extern int      cupsUtf32CompareCaseless(const utf32_t *s1,
1166                                                     /* I - String1 */
1167                         const utf32_t *s2);         /* I - String2 */
1168
1169     /*
1170      * Compare UTF-8 strings after case folding and NFKC normalization
1171      */
1172     extern int      cupsUtf8CompareIdentifier(const utf8_t *s1,
1173                                                     /* I - String1 */
1174                         const utf8_t *s2);          /* I - String2 */
1175
1176     /*
1177      * Compare UTF-32 strings after case folding and NFKC normalization
1178      */
1179     extern int      cupsUtf32CompareIdentifier(const utf32_t *s1,
1180                                                     /* I - String1 */
1181                         const utf32_t *s2);         /* I - String2 */
1182
1183     /*
1184      * Get UTF-32 character property
1185      */
1186     extern int      cupsUtf32CharacterProperty(const utf32_t ch,
1187                                                     /* I - Source char */
1188                         const cups_property_t property);
1189                                                     /* I - Char Property */
1190
1191     #  ifdef __cplusplus
1192     }
1193     #  endif /* __cplusplus */
1194
1195     #endif /* !_CUPS_NORMALIZE_H_ */
1196
1197     McDonald                     June 20, 2002                     [Page 21]
1198 \f
1199            CUPS Internationalization Software Design Description v0.3
1200
1201
1202     /*
1203      * End of "$Id: i18n_sdd.txt 2678 2002-08-19 01:15:26Z mike $"
1204      */
1205
1206
1207
1208     3.2.1.1.  cups_normmap_t - Normalize Map Structure
1209
1210     typedef struct cups_normmap_str /**** Normalize Map Cache Struct ****/
1211     {
1212       struct cups_normmap_str *next;        /* Next normalize in cache */
1213       int                   used;           /* Number of times entry used */
1214       cups_normalize_t      normalize;      /* Normalization type */
1215       int                   normcount;      /* Count of Source Chars */
1216       ucs2_t                *uni2norm;      /* Char -> Normalization */
1217                                             /* ...only supports UCS-2 */
1218     } cups_normmap_t;
1219
1220     'uni2norm' is a pointer to an array of _triplets_ of UCS-2 values.
1221     'normcount' is a count of _triplets_ in the 'uni2norm[]' array.
1222
1223     For decompositions (NFD and NFKD), the triplets are:  composed base
1224     character, decomposed base character, and decomposed accent character.
1225     These are used by 'cupsUtf8Normalize()' and 'cupsUtf32Normalize()' in
1226     performing canonical (NFD) or compatibility (NFKD) decomposition.
1227
1228     For compositions (NFC and NFKC), the triplets are:  decomposed base
1229     character, decomposed accent character, and composed base character.
1230     These are used by 'cupsUtf8Normalize()' and 'cupsUtf32Normalize()' in
1231     performing canonical composition (for NFC or NFKC).
1232
1233
1234
1235     3.2.1.2.  cups_foldmap_t - Case Fold Map Structure
1236
1237     typedef struct cups_foldmap_str /**** Case Fold Map Cache Struct ****/
1238     {
1239       int                   used;           /* Number of times entry used */
1240       cups_folding_t        fold;           /* Case folding type */
1241       int                   foldcount;      /* Count of Source Chars */
1242       ucs2_t                *uni2fold;      /* Char -> Folded Char(s) */
1243                                             /* ...only supports UCS-2 */
1244     } cups_foldmap_t;
1245
1246     'uni2fold' is a pointer to an array of _quadruplets_ of UCS-2 values.
1247     'foldcount' is a count of _quadruplets_ in the 'uni2fold[]' array.
1248
1249     For simple case folding (without expansion of the size of the output
1250     string), the quadruplets are:  input base character, output case folded
1251     character, zero (unused), and zero (unused).
1252
1253
1254     McDonald                     June 20, 2002                     [Page 22]
1255 \f
1256            CUPS Internationalization Software Design Description v0.3
1257
1258
1259     For full case folding (with possible expansion of the size of the output
1260     string), the quadruplets are:  input base character, output case folded
1261     character, second output character or zero, third output character or
1262     zero.
1263
1264
1265
1266     3.2.1.3.  cups_propmap_t - Char Property Map Structure
1267
1268     typedef struct                  /**** Char Property Map Struct ****/
1269     {
1270       int                   used;           /* Number of times entry used */
1271       int                   propcount;      /* Count of Source Chars */
1272       cups_prop_t           *uni2prop;      /* Char -> Properties */
1273     } cups_propmap_t;
1274
1275     'uni2prop' is a pointer to an array of 'cups_prop_t' (see below).
1276     'propcount' is a count of elements in the 'uni2prop[]' array.
1277
1278
1279
1280     3.2.1.4.  cups_prop_t - Char Property Structure
1281
1282     typedef struct cups_prop_str    /**** Char Property Struct ****/
1283     {
1284       ucs2_t                ch;             /* Unicode Char as UCS-2 */
1285       unsigned char         gencat;         /* General Category */
1286       unsigned char         bidicat;        /* Bidirectional Category */
1287     } cups_prop_t;
1288
1289
1290
1291     3.2.1.5.  cups_breakmap_t - Line Break Map Structure
1292
1293     typedef struct                  /**** Line Break Class Map Struct ****/
1294     {
1295       int                   used;           /* Number of times entry used */
1296       int                   breakcount;     /* Count of Source Chars */
1297       ucs2_t                *uni2break;     /* Char -> Line Break Class */
1298     } cups_breakmap_t;
1299
1300     'uni2break' is a pointer to an array of _triplets_ of UCS-2 values.
1301     'breakcount' is a count of _triplets_ in the 'uni2break[]' array.
1302
1303     The triplets in 'uni2break' are:  first UCS-2 value in a range, last
1304     UCS-2 value in a range, and line break class stored as UCS-2.
1305
1306
1307
1308
1309
1310
1311     McDonald                     June 20, 2002                     [Page 23]
1312 \f
1313            CUPS Internationalization Software Design Description v0.3
1314
1315
1316
1317     3.2.1.6.  cups_combmap_t - Combining Class Map Structure
1318
1319     typedef struct                  /**** Combining Class Map Struct ****/
1320     {
1321       int                   used;           /* Number of times entry used */
1322       int                   combcount;      /* Count of Source Chars */
1323       cups_comb_t           *uni2comb;      /* Char -> Combining Class */
1324     } cups_combmap_t;
1325
1326     'uni2comb' is a pointer to an array of 'cups_comb_t' (see below).
1327     'combcount' is a count of elements in the 'uni2comb[]' array.
1328
1329
1330
1331     3.2.1.7.  cups_comb_t - Combining Class Structure
1332
1333     typedef struct cups_comb_str    /**** Char Combining Class Struct ****/
1334     {
1335       unsigned short        ch;             /* Unicode Char as UCS-2 */
1336       unsigned char         combclass;      /* Combining Class */
1337       unsigned char         reserved;       /* Reserved for alignment */
1338     } cups_comb_t;
1339
1340
1341
1342     3.2.2.  normalize.c - Normalization module
1343
1344     The normalization function 'cupsUtf8Normalize()' and the case folding
1345     function 'cupsUtf8CaseFold()' are modelled on the C standard library
1346     function 'strncpy()', except that they return the count of the output,
1347     like 'strlen()', rather than the (redundant) pointer to the output.
1348
1349     If the normalization or case folding functions detect invalid input
1350     parameters or they detect an encoding error in their input, then they
1351     return '-1', rather than the count of output.
1352
1353     The normalization and case folding functions take an input parameter
1354     indicating the maximum output units (for safe operation).
1355
1356
1357
1358     3.2.2.1.  cupsUtf8Normalize()
1359
1360     /*
1361      * Normalize UTF-8 string to Unicode UAX-15 Normalization Form
1362      * Note - Compatibility Normalization Forms (NFKD/NFKC) are
1363      * unsafe for subsequent transcoding to legacy charsets
1364      */
1365     extern int      cupsUtf8Normalize(utf8_t *dest, /* O - Target string */
1366                         const utf8_t *src,          /* I - Source string */
1367
1368     McDonald                     June 20, 2002                     [Page 24]
1369 \f
1370            CUPS Internationalization Software Design Description v0.3
1371
1372                         const int maxout,           /* I - Max output */
1373                         const cups_normalize_t normalize);
1374                                                     /* I - Normalization */
1375
1376     <Convert input UTF-8 to internal UCS-4 by calling 'cupsUtf8ToUtf32()'>
1377     <Normalize by calling 'cupsUtf32Normalize()'>
1378     <Convert normalized UCS-4 to UTF-8 by calling 'cupsUtf32ToUtf8()>
1379     <Return length of output UTF-8 string -- size in butes>
1380
1381
1382
1383     3.2.2.2.  cupsUtf32Normalize()
1384
1385     extern int      cupsUtf32Normalize(utf32_t *dest,
1386                                                     /* O - Target string */
1387                         const utf32_t *src,         /* I - Source string */
1388                         const int maxout,           /* I - Max output */
1389                         const cups_normalize_t normalize);
1390                                                     /* I - Normalization */
1391
1392     <Find normalize maps by calling 'cupsNormalizeMapsGet()'>
1393     <...if not found, return '-1'>
1394     <Repeatedly traverse internal UCS-4, decomposing (NFD or NFKD)...>
1395     <...with 'bsearch()' of 'uni2norm[]' using local 'compare_decompose()'>
1396     <...until one pass yields no further decomposition>
1397     <Repeatedly traverse internal UCS-4, doing canonical reordering>
1398     <...with 'bsearch()' of 'uni2comb[]' using local 'compare_combchar()'>
1399     <...until one pass yields no further canonical reordering>
1400     <If 'normalize' requests composition (NFC or NFKC)...>
1401     <...repeatedly traverse internal UCS-4, composing (NFC or NFKC)...>
1402     <...with 'bsearch()' of 'uni2norm[]' using local 'compare_compose()'>
1403     <...until one pass yields no further composition>
1404     <Release normalize maps by calling 'cupsNormalizeMapsFree()'>
1405     <Return count of output UTF-32 string -- NOT memory size in butes>
1406
1407
1408
1409     3.2.2.3.  cupsUtf8CaseFold()
1410
1411     /*
1412      * Case Fold UTF-8 string per Unicode UAX-21 Section 2.3
1413      * Note - Case folding output is
1414      * unsafe for subsequent transcoding to legacy charsets
1415      */
1416     extern int      cupsUtf8CaseFold(utf8_t *dest,  /* O - Target string */
1417                         const utf8_t *src,          /* I - Source string */
1418                         const int maxout,           /* I - Max output */
1419                         const cups_folding_t fold); /* I - Fold Mode */
1420
1421     <Find normalize maps by calling 'cupsNormalizeMapsGet()'>
1422     <...if not found, return '-1'>
1423     <Convert input UTF-8 to internal UCS-4 by calling 'cupsUtf8ToUtf32()'>
1424
1425     McDonald                     June 20, 2002                     [Page 25]
1426 \f
1427            CUPS Internationalization Software Design Description v0.3
1428
1429     <Case fold internal UCS-4 by calling 'cupsUtf32CaseFold()'>
1430     <Convert internal UCS-4 to output UTF-8 by calling 'cupsUtf32ToUtf8()>
1431     <Release normalize maps by calling 'cupsNormalizeMapsFree()'>
1432     <Return length of output UTF-8 string -- size in butes>
1433
1434
1435
1436     3.2.2.4.  cupsUtf32CaseFold()
1437
1438     /*
1439      * Case Fold UTF-32 string per Unicode UAX-21 Section 2.3
1440      * Note - Case folding output is
1441      * unsafe for subsequent transcoding to legacy charsets
1442      */
1443     extern int      cupsUtf32CaseFold(utf32_t *dest,    /* Target string */
1444                         const utf32_t *src,            /* Source string */
1445                         const int maxout);            /* Max output units */
1446
1447     <Find case fold maps by calling 'cupsNormalizeMapsGet()'>
1448     <...if not found, return '-1'>
1449     <Traverse internal UCS-4 once, performing case folding...>
1450     <...with 'bsearch()' of 'uni2fold[]' using local 'compare_foldchar()'>
1451     <Copy internal UCS-4 to output UTF-32 string>
1452     <Release normalize maps by calling 'cupsNormalizeMapsFree()'>
1453     <Return count of output UTF-32 string -- NOT memory size in bytes>
1454
1455
1456
1457     3.2.2.5.  cupsUtf8CompareCaseless()
1458
1459     /*
1460      * Compare UTF-8 strings after case folding
1461      */
1462     extern int      cupsUtf8CompareCaseless(const utf8_t *s1,
1463                                                     /* I - String1 */
1464                         const utf8_t *s2);          /* I - String2 */
1465
1466     <Case fold both input UTF-8 strings by calling 'cupsUtf8CaseFold()'>
1467     <Return compare of case folded first and second strings>
1468
1469
1470
1471     3.2.2.6.  cupsUtf32CompareCaseless()
1472
1473     /*
1474      * Compare UTF-32 strings after case folding
1475      */
1476     extern int      cupsUtf32CompareCaseless(const utf32_t *s1,
1477                                                     /* I - String1 */
1478                         const utf32_t *s2);         /* I - String2 */
1479
1480     <Case fold both input UTF-32 strings by calling 'cupsUtf32CaseFold()'>
1481
1482     McDonald                     June 20, 2002                     [Page 26]
1483 \f
1484            CUPS Internationalization Software Design Description v0.3
1485
1486     <Return compare of case folded first and second strings>
1487
1488
1489
1490     3.2.2.7.  cupsUtf8CompareIdentifier()
1491
1492     /*
1493      * Compare UTF-8 strings after case folding and NFKC normalization
1494      */
1495     extern int      cupsUtf8CompareIdentifier(const utf8_t *s1,
1496                                                     /* I - String1 */
1497                         const utf8_t *s2);          /* I - String2 */
1498
1499     <Convert input UTF-8 to internal UCS-4 by calling 'cupsUtf8ToUtf32()'>
1500     <Case fold both strings by calling 'cupsUtf32CaseFold()'>
1501     <Normalize both strings to NFKC by calling 'cupsUtf32Normalize()'>
1502     <Return compare of case folded/normalized first and second strings>
1503
1504
1505
1506     3.2.2.8.  cupsUtf32CompareIdentifier()
1507
1508     /*
1509      * Compare UTF-32 strings after case folding and NFKC normalization
1510      */
1511     extern int      cupsUtf32CompareIdentifier(const utf32_t *s1,
1512                                                     /* I - String1 */
1513                         const utf32_t *s2);         /* I - String2 */
1514
1515     <Case fold both strings by calling 'cupsUtf32CaseFold()'>
1516     <Normalize both strings to NFKC by calling 'cupsUtf32Normalize()'>
1517     <Return compare of case folded/normalized first and second strings>
1518
1519
1520
1521     3.2.2.9.  cupsUtf32CharacterProperty()
1522
1523     /*
1524      * Get UTF-32 character property
1525      */
1526     extern int      cupsUtf32CharacterProperty(const utf32_t ch,
1527                                                     /* I - Source char */
1528                         const cups_property_t property);
1529                                                     /* I - Char Property */
1530
1531     <Lookup UTF-32 character property in appropriate map...> <...internal
1532     functions for each different map lookup>
1533
1534
1535
1536
1537
1538
1539     McDonald                     June 20, 2002                     [Page 27]
1540 \f
1541            CUPS Internationalization Software Design Description v0.3
1542
1543
1544
1545     3.2.2.10.  Normalization Utility Functions
1546
1547
1548
1549
1550     3.2.2.10.1.  cupsNormalizeMapsGet()
1551
1552     extern void     cupsNormalizeMapsMapsGet(void);
1553
1554     <Find normalize maps in cache>
1555     <...If found, increment 'used'>
1556     <...and return void>
1557     <For each map (normalization, case fold, combining class, etc.)...>
1558     <Open (preprocessed form of) Unicode data file...>
1559     <...If not found, return void>
1560     <Count lines in preprocessed form, for mapping memory alloc>
1561     <...Close (preprocessed form of) Unicode data file>
1562     <Open (preprocessed form of) Unicode data file...>
1563     <...If not found, return void>
1564     <Allocate memory for approriate map in cache...>
1565     <...If no memory, return void>
1566     <Add to appropriate cache by assigning 'next' field>
1567     <Assign map type field and count field>
1568     <Increment 'used' field>
1569     <Read normalize map into memory in loop...>
1570     <...Add values to 'uni2xxx[]' array>
1571     <Close (preprocessed form of) Unicode data file>
1572     <Return void>
1573
1574
1575
1576     3.2.2.10.2.  cupsNormalizeMapsFree()
1577
1578     extern void     cupsNormalizeMapsFree(void);
1579
1580     <Find normalize maps in cache>
1581     <...If found, decrement 'used'>
1582     <Return void>
1583
1584
1585
1586     3.2.2.10.3.  cupsNormalizeMapsFlush()
1587
1588     extern void     cupsNormalizeMapsFlush(void);
1589
1590     <Loop through normalize maps cache...>
1591     <...Free 'uni2norm[]' memory>
1592     <...Free normalize map memory>
1593     <Loop through case folding cache...>
1594     <...Free 'uni2fold[]' memory>
1595
1596     McDonald                     June 20, 2002                     [Page 28]
1597 \f
1598            CUPS Internationalization Software Design Description v0.3
1599
1600     <...Free case folding memory>
1601     <Loop through char property map cache...>
1602     <...Free 'uni2prop[]' memory>
1603     <...Free char property map memory>
1604     <Loop through line break class map cache...>
1605     <...Free 'uni2break[]' memory>
1606     <...Free line break class map memory>
1607     <Loop through combining class map cache...>
1608     <...Free 'uni2comb[]' memory>
1609     <...Free combining class map memory>
1610     <Return void>
1611
1612
1613
1614     3.3.  Language - Existing
1615
1616
1617
1618     3.3.1.  language.h - Language header
1619
1620     Required Changes:
1621
1622     (1) Change definition of 'cups_lang_t' to correct length of 'language[]'
1623         to 32 characters per [RFC3066] and [ISO639-2] and [ISO3166-1].
1624
1625
1626
1627     3.3.2.  language.c - Language module
1628
1629
1630
1631     3.3.2.1.  cupsLangEncoding() - Existing
1632
1633     [No Change]
1634
1635
1636
1637     3.3.2.2.  cupsLangFlush() - Existing
1638
1639     [No Change]
1640
1641
1642
1643     3.3.2.3.  cupsLangFree() - Existing
1644
1645     [No Change]
1646
1647
1648
1649
1650
1651
1652
1653     McDonald                     June 20, 2002                     [Page 29]
1654 \f
1655            CUPS Internationalization Software Design Description v0.3
1656
1657
1658
1659     3.3.2.4.  cupsLangGet() - Existing
1660
1661     Required Changes:
1662
1663     (1) Change length of 'langname[]' and 'real[]' to 64 characters per
1664         [RFC3066] and potential length of encoding (charset) names;
1665     (2) Change language string normalization to support:
1666         (a) 8-character language codes per [RFC3066] and 3-character
1667         language codes per [ISO639-2];
1668         (b) 8-character country codes per [RFC3066] and 3-character country
1669         codes per [ISO3166-1];
1670         (c) Support for 'i' (IANA registered) and 'x' (private) language
1671         prefixes per [RFC3066];
1672         (d) Invariant use of 'utf-8' for encoding in message catalog, but
1673         save actual requested encoding name for later use.
1674     (3) Correct broken do/while statement for message catalog lookup (while
1675         condition is _never_ satisfied).
1676
1677
1678
1679     3.3.2.5.  cupsLangPrintf() - New
1680
1681     extern  int     cupsLangPrintf(FILE *fp,        /* I - File to write */
1682                         const cups_lang_t *lang,    /* I - Language/locale*/
1683                         const cups_msg_t msg,       /* I - Msg to format */
1684                         ...);                       /* I - Args to format */
1685
1686     <Set up variable args by calling 'va_start()'>
1687     <Format CUPS message with variable args by calling 'vsnprintf()'>
1688     <Clean up variable args by calling 'va_end()'>
1689     <Transcode CUPS message by calling 'cupsUtf8ToCharset()'>
1690     <Write CUPS message by calling 'fputs()'>
1691     <Return transcoded output CUPS message length>
1692
1693
1694
1695     3.3.2.6.  cupsLangPuts() - New
1696
1697     extern  int     cupsLangPuts(FILE *fp,          /* I - File to write */
1698                         const cups_lang_t *lang,    /* I - Language/locale*/
1699                         const cups_msg_t msg);      /* I - Msg to write */
1700
1701     <Transcode CUPS message by calling 'cupsUtf8ToCharset()'>
1702     <Write CUPS message by calling 'fputs()'>
1703     <Return transcoded output CUPS message length>
1704
1705
1706
1707
1708
1709
1710     McDonald                     June 20, 2002                     [Page 30]
1711 \f
1712            CUPS Internationalization Software Design Description v0.3
1713
1714
1715
1716     3.3.2.7.  cupsEncodingName() - New
1717
1718     extern  char    *cupsEncodingName(cups_encoding_t encoding);
1719
1720     <Lookup encoding name in static 'lang_encodings[]' array>
1721     <Return pointer to encoding name (charset map file name)>
1722
1723
1724
1725     3.4.  Common Text Filter - Existing
1726
1727
1728
1729     3.4.1.  textcommon.h - Common text filter header
1730
1731     Required changes:
1732
1733     (1) Revise 'lchar_t' as specified below, adding 'attrx' bit-mask for
1734         selected Unicode character properties;
1735     (2) Revise 'lchar_t' as specified below, adding 'comblen' and 'combch[]'
1736         for Unicode combining/attached chars (accents);
1737     (3) Add 'COMBLEN_MAX' limit as specified below;
1738     (4) Add 'ATTRX_...' selected Unicode character properties as specified
1739         below.
1740
1741
1742
1743     3.4.1.1.  lchar_t - Character/Attribute Structure
1744
1745     typedef struct lchar_str    /**** Character / Attribute Structure ****/
1746     {
1747       unsigned short        ch;             /* Unicode Char as UCS-2 */
1748                                             /* or 8/16-bit Legacy Char */
1749       unsigned short        attr;           /* Attributes of Char */
1750       unsigned short        attrx;          /* Extended Attributes */
1751       unsigned short        comblen;        /* Combining Char Count */
1752       unsigned short        combch[8];      /* Combining Chars as UCS-2 */
1753     } lchar_t;
1754
1755     'ch' is a 16-bit UCS-2 character or a 8/16-bit legacy char.  'attr' is
1756     the character attributes defined for the existing 'lchar_t' structure
1757     (defined in 'textcommon.h').  'attrx' is the extended character
1758     attributes defined for future selected Unicode character properties (see
1759     below).  'comblen' is the number of attached/combining characters.
1760     'combch' is an array of 16-bit UCS-2 attached/combining characters.
1761
1762     Add to 'textcommon.h' constants:
1763
1764     COMBLEN_MAX 8
1765
1766
1767     McDonald                     June 20, 2002                     [Page 31]
1768 \f
1769            CUPS Internationalization Software Design Description v0.3
1770
1771
1772     ATTRX_RIGHT2LEFT 0x0001
1773
1774
1775
1776     3.4.2.  textcommon.c - Common text filter
1777
1778     Required Changes:
1779
1780     (1) Revise 'TextMain()' function as described below.
1781
1782
1783
1784     3.4.2.1.  TextMain() - Existing
1785
1786     Required Changes:
1787
1788     [Ed Note:  Pseudo code below needs more work on bidi handling.]
1789
1790     (1) In main loop at the _beginning_ of the 'default' clause, add the
1791         following code for combining marks:
1792         lchar_t *cp;
1793
1794         cp = Page[line];
1795         cp += column;
1796         /*
1797          * Check for Unicode combining mark (accent)
1798          */
1799         if (UTF-8 && cupsUtf32CombiningClass(ch) > 0)
1800         {
1801
1802          /*
1803           * Save Unicode combining mark in SAME character
1804           */
1805           if (cp->comblen > COMBLEN_MAX)
1806             break;
1807           cp->combch[cp->comblen] = ch;
1808           cp->comblen ++;
1809           break;
1810         }
1811
1812     (2) In main loop _after_ combining chars section in 'default' clause,
1813         add the following code for Unicode bidi control characters
1814         cups_bidicat_t bidicat;
1815
1816         /*
1817          * Check for Unicode bidi control character
1818          */
1819         if (UTF-8)
1820         {
1821           bidicat = (cups_bidicat_t)
1822             cupsUtf32CharacterProperty(ch, CUPS_PROP_BIDI_CATEGORY);
1823
1824     McDonald                     June 20, 2002                     [Page 32]
1825 \f
1826            CUPS Internationalization Software Design Description v0.3
1827
1828           if ((bidicat == CUPS_BIDI_LRE)        /* Left-to-Right Embedding *
1829           || (bidicat == CUPS_BIDI_LRO)         /* Left-to-Right Override */
1830           || (bidicat == CUPS_BIDI_RLE)         /* Right-to-Left Embedding *
1831           || (bidicat == CUPS_BIDI_RLO)         /* Right-to-Left Override */
1832           || (bidicat == CUPS_BIDI_PDF))        /* Pop Directional Format */
1833           {
1834             /* Do bidi stuff here with memory for NEXT char's direction
1835             /* Discard bidi control character and break */
1836           }
1837           if ((bidicat == CUPS_BIDI_R)           /* Right-to-Left Hebrew */
1838           || (bidicat == CUPS_BIDI_AL))          /* Right-to-Left Arabic */
1839           {
1840             /* Set attrx for right-to-left */
1841             cp->attrx |= ATTRX_RIGHT2LEFT
1842           }
1843         }
1844
1845
1846
1847     3.4.2.2.  compare_keywords() - Existing
1848
1849     [No Change]
1850
1851
1852
1853     3.4.2.3.  getutf8() - Existing
1854
1855     [No Change]
1856
1857     [Ed Note:  Future - allow 20-bit UTF-32 code points - requires updates
1858     in both 'textcommon.c' and 'texttops.c' for extended PostScript.]
1859
1860
1861
1862     3.5.  Text to PostScript Filter - Existing
1863
1864
1865
1866     3.5.1.  texttops.c - Text to PostScript filter
1867
1868     Required Changes:
1869
1870     (1) Revise local 'write_string()' function as described below.
1871
1872
1873
1874     3.5.1.1.  main() - Existing
1875
1876     [No Change]
1877
1878
1879
1880
1881     McDonald                     June 20, 2002                     [Page 33]
1882 \f
1883            CUPS Internationalization Software Design Description v0.3
1884
1885
1886
1887     3.5.1.2.  WriteEpilogue () - Existing
1888
1889     [No Change]
1890
1891
1892
1893     3.5.1.3.  WritePage () - Existing
1894
1895     [No Change]
1896
1897
1898
1899     3.5.1.4.  WriteProlog () - Existing
1900
1901     [No Change]
1902
1903
1904
1905     3.5.1.5.  write_line() - Existing
1906
1907     [No Change]
1908
1909
1910
1911     3.5.1.6.  write_string() - Existing
1912
1913     Required Changes:
1914
1915     (1) At the _beginning_ of Multiple Fonts section, _replace_ the while()
1916         loop and surrounding 'putchar()' calls with the following code:
1917
1918         for (; len > 0; len --, s ++)
1919         {
1920           utf32_t decstr[COMBLEN_MAX * 2];
1921           utf32_t cmpstr[COMBLEN_MAX * 2];
1922           int     cmplen;
1923           int     i;
1924
1925           if (s->comblen == 0)
1926           {
1927             printf("<%04x>", Chars[s->ch]);
1928             continue;
1929           }
1930
1931          /*
1932           * Normalize decomposed Unicode character to NFKC
1933           * (compatibility decomposition, then canonical composition)
1934           */
1935           decstr[0] = (utf32_t) s->ch;
1936           for (i = 0; i < s->comblen; i ++)
1937
1938     McDonald                     June 20, 2002                     [Page 34]
1939 \f
1940            CUPS Internationalization Software Design Description v0.3
1941
1942             decstr[i + 1] = (utf32_t) s->combch[i];
1943           decstr[i] = 0;
1944           cmplen = cupsUtf32Normalize (&cmpstr[0],
1945                        &decstr[0], COMBLEN_MAX * 2, CUPS_NORM_NFKC);
1946           if (cmplen < 1)
1947             continue;
1948
1949          /*
1950           * Write combining chars, then composed base, to same location
1951           */
1952           for (i = 1; i < cmplen; i ++)
1953           {
1954             printf("<%04x>", Chars[(int) cmpstr[i]);
1955            /*
1956             * Superimpose glyphs by backing up one column width
1957             */
1958             printf (" -%.3f ", (72.0f / (float) CharsPerInch));
1959           }
1960           printf("<%04x>", Chars[(int) cmpstr[0]);
1961         }
1962
1963     [Ed Note:  Future - Bidi support - When writing Unicode characters
1964     (checking for explicit bidi) convert input string (lchar_t) to display
1965     order???]
1966
1967
1968
1969     3.5.1.7.  write_text() - Existing
1970
1971     [No Change]
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995     McDonald                     June 20, 2002                     [Page 35]
1996 \f
1997            CUPS Internationalization Software Design Description v0.3
1998                                    APPENDIX A
1999                                     Glossary
2000
2001
2002
2003     A.  Glossary
2004
2005     Abstract Character:  A unit of information used for the organization,
2006     control, or representation of textual data.
2007
2008     Accent Mark:  A mark placed above, below, or to the side of a character
2009     to alter its phonetic value (also 'diacritic').
2010
2011     Alphabet:  A collection of symbols that, in the context of a particular
2012     written language, represent the sounds of that language.
2013
2014     Base Character:  A character that does not graphically combine with
2015     preceding characters, and that is neither a control nor a format
2016     character.
2017
2018     Basic Multilingual Plane:  The Unicode (or UCS) code values 0x0000
2019     through 0xFFFF, specified by [ISO10646] (also 'Plane 0').
2020
2021     BIDI:  Abbreviation for Bidirectional, in reference to mixed
2022     left-to-right and right-to-left text.
2023
2024     Bidirectional Display:  The process or result of mixing left-to-right
2025     oriented text and right-to-left oriented text in a single line.
2026
2027     Big-endian:  A computer architecture that stores multiple-byte numerical
2028     values with the most significant byte (MSB) values first.
2029
2030     BMP:  Abbreviation for Basic Multilingual Plane.
2031
2032     BOM:  Acronym for byte order mark (also 'ZWNBSP').
2033
2034     Byte Order Mark:  The Unicode character U+FEFF Zero Width No-Break Space
2035     (ZWNBSP) when used to indicate the byte order of text.
2036
2037     Canonical:  (1) Conforming to the general rules for encoding -- that is,
2038     not compressed, compacted, or in any other form specified by a higher
2039     protocol.  (2) Characteristic of a normative mapping and form of
2040     equivalence.
2041
2042     Canonical Decomposition:  The decomposition of a character that results
2043     from recursively applying the canonical mappings defined in the Unicode
2044     Character Database until no characters can be further decomposed, then
2045     reordering nonspacing marks according to section 3.10 of [UNICODE3.2].
2046
2047     Canonical Equivalent:  Two characters are canonical equivalents if their
2048     full canonical decompositions are identical.
2049
2050     Case:  (1) Feature of certain alphabets wheere the letters have two
2051
2052     McDonald                    June 20, 2002                     [Page A-1]
2053 \f
2054            CUPS Internationalization Software Design Description v0.3
2055                                    APPENDIX A
2056                                     Glossary
2057
2058     distinct forms.  These variants are called the 'uppercase' letter (also
2059     known as 'capital' or 'majuscule') and the 'lowercase' letter (also
2060     known as 'small' or 'minuscule').  (2) Normative property of Unicode
2061     characters, consisting of uppercase, lowercase, and titlecase.
2062
2063     Character:  (1) The smallest component of written language that has
2064     semantic value; refers to the abstract meaning and/or shape, rather than
2065     a specific shape (see also 'glyph').  (2) Synonym for 'abstract
2066     character'.  (3) The basic unit of encoding for the Unicode character
2067     encoding.  (4) The English name for the ideographic written elements of
2068     Chinese origin (see 'ideograph').
2069
2070     Character Encoding Form (CEF):  Mapping from a character set definition
2071     to the actual bits used to represent the data.
2072
2073     Character Encoding Scheme (CES):  A 'character encoding form' plus byte
2074     serialization.  [UNICODE3.2] defines seven character encoding schemes:
2075     UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, and UTF32-LE.
2076
2077     Character Properties:  A set of property names and property values
2078     associated with individual characters defined in [UNICODE3.2].
2079
2080     Character Repertoire:  (1) The collection of characters included in a
2081     character set.  (2) The SUBSET of characters included in a large
2082     character set, e.g., [UNICODE3.2], that are necessary to support a
2083     complete mapping to another smaller character set, e.g., ISO8859-1 (also
2084     called 'Latin-1').
2085
2086     Character Set:  A collection of elements used to represent textual
2087     information.
2088
2089     Coded Character Set:  A character set in which each character is
2090     assigned a numeric code value.  Frequently abbreviated as 'character
2091     set', 'charset', or 'code set'.
2092
2093     Code Point:  (1) A numerical index (or position) in an encoding table
2094     used for encoding characters.  (2) Synonym for 'Unicode scalar value'.
2095
2096     Collation:  The process of ordering units of textual information.
2097     Collation is usually specific to a particular language.  Also known as
2098     'alphabetizing' or 'alphabetic sorting'.
2099
2100     Combining Character:  A character that graphically combines with a
2101     preceding 'base character'.  The combining character is said to 'apply'
2102     to that base character.  (See also 'nonspacing mark'.)
2103
2104     Compatibility:  (1) Consistency with existing practice or preexisting
2105     character encoding standards.  (2) Characterisitic of a normative
2106     mapping and form of equivalence (see 'compatibility decomposition').
2107
2108
2109     McDonald                    June 20, 2002                     [Page A-2]
2110 \f
2111            CUPS Internationalization Software Design Description v0.3
2112                                    APPENDIX A
2113                                     Glossary
2114
2115
2116     Compatibility Character:  A character that has a compatibility
2117     decomposition.
2118
2119     Compatibility Decomposition:  The decomposition of a character that
2120     results from recursively applying BOTH the compatibility mappings AND
2121     the canonical mappings found in the Unicode Character Database until no
2122     characters can be further decomposed, then reordering nonspacing marks
2123     according to section 3.10 of [UNICODE3.2].
2124
2125     Compatibility Equivalent:  Two characters are compatibility equivalents
2126     if their full compatibility decompositions are identical.
2127
2128     Composed Character:  (See 'descomposable character'.)
2129
2130     DBCS:  Acronym for 'double-byte character set'.
2131
2132     Decomposable Character:  A character that is equivalent to a sequence of
2133     one or more other characters, according to the decomposition mappings
2134     found in [UNICODE3.2].  It may also be known as a 'precomposed
2135     character' or a 'composite character'.
2136
2137     Decomposition:  (1) The process of separating or analyzing a text
2138     element into component units.  (2) A sequence of one or more characters
2139     that is equivalent to a 'decomposable character'.
2140
2141     Diacritic:  (See 'accent mark'.)
2142
2143     Double-Byte Character Set (DBCS):  One of a number of character sets
2144     defined for representing Chinese, Japanese, or Korean text (for example,
2145     JIS X 0208-1990).  These character sets are often encoded in such a way
2146     as to allow double-byte character encodings to be mixed with single-byte
2147     character encodings.  (See also 'multiple-byte character set'.)
2148
2149     Font:  A collection of glyphs used for visual depication of character
2150     data.
2151
2152     FSS-UTF:  Abbreviation for 'File System Safe UCS Transformation Format',
2153     originally published by X/Open.  Now called 'UTF-8'.
2154
2155     Fullwidth:  Characters of East Asian character sets whose glyph image
2156     extends across the entire character display cell.  In legacy character
2157     sets, fullwidth characters are normally encoded in two or three bytes.
2158
2159     Glyph:  (1) An abstract form that represents one or more glyph images.
2160     (2) A synonym for 'glyph image'.
2161
2162     Glyph Image:  The actual, concrete image of a glyph representation
2163     having been rasterized or otherwise images onto some display surface.
2164
2165
2166     McDonald                    June 20, 2002                     [Page A-3]
2167 \f
2168            CUPS Internationalization Software Design Description v0.3
2169                                    APPENDIX A
2170                                     Glossary
2171
2172
2173     Halfwidth:  Characters of East Asian character sets whose glyph image
2174     occupies half of the character display cell.  In legacy character sets,
2175     halfwidth characters are normally encoded in a single byte.
2176
2177     Han Characters:  Ideographic characters of Chinese origin.
2178
2179     Hangul:  The name of the script used to write the Korean language.
2180
2181     High-Surrogate:  A Unicode code value in the range U+D800 to U+DBFF.
2182
2183     Hiragana:  One of two standard syllabaries associated with the Japanese
2184     writing system.  Use to write particles, grammatical affixes, and words
2185     that have no 'kanji' form.
2186
2187     IANA:  Internet Assigned Numbers Authority.
2188
2189     Ideograph:  (1) Any symbol that denotes an idea (or meaning) in contrast
2190     to a sound or pronunciation (for example, a 'smiley face').  (2) A
2191     common term used to refer to Han characters.
2192
2193     IPA:  International Phonetic Alphabet.
2194
2195     IRG:  Abbreviation for Ideographic Rapporteur Group, a subgroup of
2196     ISO/IEC JTC1/SC2/WG2 (who work on Han unification and submission of new
2197     Han characters for inclusion in revised versions of Unicode/ISO 10646).
2198
2199     Jamo:  The Korean name for a single letter of the Hangul script.  Jamos
2200     are used to form Hangul syllables.
2201
2202     Joiner:  An invisible character that affects the joining behavior of
2203     surrounding characters.
2204
2205     JTC1:  Abbreviation for Joint Technical Committee 1 of ISO/IEC,
2206     responsible for information technology standardization.
2207
2208     Kana:  The name of a primarily syllabic script used by the Japanese
2209     writing system, composed of 'hiragana' and 'katakana'.
2210
2211     Kanji:  The Japanese name for Han characters; derived from the Chinese
2212     word 'hanzi'.  Also romanized as 'kanzi'.
2213
2214     Katakana:  One of two standard syllabaries associated with the Japanese
2215     writing system, typically used in representation of borrowed vocabulary.
2216
2217     Ligature:  A glyph representing a combination of two or more characters,
2218     for example in the Latin script the ligature between 'f' and 'i' as
2219     'fi'.
2220
2221     Logical Order:  The order in which text is typed on a keyboard.  For the
2222
2223     McDonald                    June 20, 2002                     [Page A-4]
2224 \f
2225            CUPS Internationalization Software Design Description v0.3
2226                                    APPENDIX A
2227                                     Glossary
2228
2229     most part, logical order corresponds to phonetic order.
2230
2231     Lowercase:  (See 'case'.)
2232
2233     Low-Surrogate:  A Unicode code value in the range U+DC00 to U+DFFF.
2234
2235     MBCS:  Acronym for 'multiple-byte character set'.
2236
2237     Multiple-Byte Character Set (MBCS):  A character set encoded with a
2238     variable number of bytes per character.  Many large character sets have
2239     been defined as MBCS so as to keep strict compatibility with the
2240     US-ASCII subset and/or [ISO2022].
2241
2242     Normalization:  Transformation of data to a normal form.
2243
2244     Plain Text:  Computer-encoded text that consists ONLY of a sequence of
2245     code values from a given standard, with no other formatting or
2246     structural information.
2247
2248     Precomposed Character:  (See 'decomposable character'.)
2249
2250     Rendering:  (1) The process of selecting and laying out glyphs for the
2251     purpose of depicting characters.  (2) The process of making glyphs
2252     visible on a display device.
2253
2254     Repertoire:  (See 'character repertoire'.)
2255
2256     Replacement Character:  A character used as a substitute for an
2257     uninterpretable character from another encoding.  [UNICODE3.2] defines
2258     U+FFFD REPLACEMENT CHARACTER for this function.
2259
2260     Rich Text:  The result of adding information such as font data, color,
2261     formatting, phonetic annotations, etc. to 'plain text' (e.g., HTML).
2262
2263     SBCS:  Acronym for 'single-byte character set'.
2264
2265     Scalar Value:  (See 'Unicode scalar value'.)
2266
2267     Script:  A collection of symbols used to represent textual information
2268     in one or more writing systems.
2269
2270     Single-Byte Character Set (SBCS):  One of a number of one-byte character
2271     sets defined for representing (mostly) Western languages (for example,
2272     ISO 8859-1 'Latin-1').  These character sets are often encoded in such a
2273     way as to be strict supersets of 7-bit [US-ASCII].
2274
2275     Sorting:  (See 'collation'.)
2276
2277     Transcoding:  Conversion of character data between different character
2278     sets.
2279
2280     McDonald                    June 20, 2002                     [Page A-5]
2281 \f
2282            CUPS Internationalization Software Design Description v0.3
2283                                    APPENDIX A
2284                                     Glossary
2285
2286
2287     Transformation Format:  A mapping from a coded character sequence to a
2288     unique sequence of code values (typically octets).
2289
2290     UCS:  Abbreviation for Universal Character Set, specified by [ISO10646].
2291
2292     UCS-2:  UCS encoded in 2 octets, specified by [ISO10646].
2293
2294     UCS-4:  UCS encoded in 4 octets, specified by [ISO10646].
2295
2296     Unicode Scalar Value:  A number between 0 to 0x10FFFF.
2297
2298     Uppercase:  (See 'case'.)
2299
2300     UTF:  Abbreviation for Unicode (or UCS) Transformation Format.
2301
2302     UTF-8:  Unicode (or UCS) Transformation Format, 8-bit encoding form.
2303     Serializes a Unicode (or UCS) scalar value (code point) as a sequence of
2304     one to four octets.  Does NOT suffer from byte-ordering ambiguities.
2305
2306     UTF-16:  Unicode (or UCS) Transformation Format, 16-bit encoding form.
2307     Serializes a Unicode (or UCS) scalar value (code point) as a sequence of
2308     two octets, in either big-endian or little-endian format.  Uses an
2309     (optional) prefix of BOM to disambiguate byte-ordering.
2310
2311     UTF-32:  Unicode (or UCS) Transformation Format, 32-bit encoding form.
2312     Serializes a Unicode (or UCS) scalar value (code point) as a sequence of
2313     four octets, in either big-endian or little-endian format.  Uses an
2314     (optional) prefix of BOM to disambiguate byte-ordering.
2315
2316     Zero Width:  Characteristic of some spaces or format control characters
2317     that do not advance text along the horizontal baseline.
2318
2319
2320
2321
2322
2323
2324
2325
2326
2327
2328
2329
2330
2331
2332
2333
2334
2335
2336
2337     McDonald                    June 20, 2002                     [Page A-6]