]> git.ipfire.org Git - thirdparty/cups.git/blame - data/i18n_sdd.txt
Load cups into easysw/current.
[thirdparty/cups.git] / data / i18n_sdd.txt
CommitLineData
ef416fc2 1
2
3 WORKING DRAFT Ira McDonald
4 <i18n_sdd.txt> High North Inc
5
6 Common UNIX Printing System ("CUPS")
7 Internationalization Software Design Description v0.3
8
9 Copyright (C) Easy Software Products (2002) - All Rights Reserved
10
11
12 Status of this Document
13
14 This document is an unapproved working draft and is incomplete in some
15 sections (see 'Ed Note:' comments).
16
17
18 Abstract
19
20 This document provides general information and high-level design for the
21 Internationalization extensions for the Common UNIX Printing System
22 ("CUPS") Version 1.2. This document also provides C language header
23 files and high-level pseudo-code for all new modules and external
24 functions.
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57 McDonald June 20, 2002 [Page 1]
58\f
59 CUPS Internationalization Software Design Description v0.3
60
61 Table of Contents
62
63 1. Scope ...................................................... 4
64 1.1. Identification ......................................... 4
65 1.2. System Overview ........................................ 4
66 1.3. Document Overview ...................................... 4
67 2. References ................................................. 5
68 2.1. CUPS References ........................................ 5
69 2.2. Other Documents ........................................ 5
70 3. Design Overview ............................................ 7
71 3.1. Transcoding - New ...................................... 7
72 3.1.1. transcode.h - Transcoding header ................... 7
73 3.1.1.1. cups_cmap_t - SBCS Charmap Structure ........... 10
74 3.1.1.2. cups_dmap_t - DBCS Charmap Structure ........... 11
75 3.1.2. transcode.c - Transcoding module ................... 11
76 3.1.2.1. cupsUtf8ToCharset() ............................ 11
77 3.1.2.2. cupsCharsetToUtf8() ............................ 12
78 3.1.2.3. cupsUtf8ToUtf16() .............................. 12
79 3.1.2.4. cupsUtf16ToUtf8() .............................. 12
80 3.1.2.5. cupsUtf8ToUtf32() .............................. 12
81 3.1.2.6. cupsUtf32ToUtf8() .............................. 13
82 3.1.2.7. cupsUtf16ToUtf32() ............................. 13
83 3.1.2.8. cupsUtf32ToUtf16() ............................. 13
84 3.1.2.9. Transcoding Utility Functions .................. 13
85 3.1.2.9.1. cupsCharmapGet() ........................... 14
86 3.1.2.9.2. cupsCharmapFree() .......................... 14
87 3.1.2.9.3. cupsCharmapFlush() ......................... 14
88 3.2. Normalization - New .................................... 15
89 3.2.1. normalize.h - Normalization header ................. 15
90 3.2.1.1. cups_normmap_t - Normalize Map Structure ....... 22
91 3.2.1.2. cups_foldmap_t - Case Fold Map Structure ....... 22
92 3.2.1.3. cups_propmap_t - Char Property Map Structure ... 23
93 3.2.1.4. cups_prop_t - Char Property Structure .......... 23
94 3.2.1.5. cups_breakmap_t - Line Break Map Structure ..... 23
95 3.2.1.6. cups_combmap_t - Combining Class Map Structure . 24
96 3.2.1.7. cups_comb_t - Combining Class Structure ........ 24
97 3.2.2. normalize.c - Normalization module ................. 24
98 3.2.2.1. cupsUtf8Normalize() ............................ 24
99 3.2.2.2. cupsUtf32Normalize() ........................... 25
100 3.2.2.3. cupsUtf8CaseFold() ............................. 25
101 3.2.2.4. cupsUtf32CaseFold() ............................ 26
102 3.2.2.5. cupsUtf8CompareCaseless() ...................... 26
103 3.2.2.6. cupsUtf32CompareCaseless() ..................... 26
104 3.2.2.7. cupsUtf8CompareIdentifier() .................... 27
105 3.2.2.8. cupsUtf32CompareIdentifier() ................... 27
106 3.2.2.9. cupsUtf32CharacterProperty() ................... 27
107 3.2.2.10. Normalization Utility Functions ............... 28
108 3.2.2.10.1. cupsNormalizeMapsGet() .................... 28
109 3.2.2.10.2. cupsNormalizeMapsFree() ................... 28
110 3.2.2.10.3. cupsNormalizeMapsFlush() .................. 28
111 3.3. Language - Existing .................................... 29
112 3.3.1. language.h - Language header ....................... 29
113
114 McDonald June 20, 2002 [Page 2]
115\f
116 CUPS Internationalization Software Design Description v0.3
117
118 3.3.2. language.c - Language module ....................... 29
119 3.3.2.1. cupsLangEncoding() - Existing .................. 29
120 3.3.2.2. cupsLangFlush() - Existing ..................... 29
121 3.3.2.3. cupsLangFree() - Existing ...................... 29
122 3.3.2.4. cupsLangGet() - Existing ....................... 30
123 3.3.2.5. cupsLangPrintf() - New ......................... 30
124 3.3.2.6. cupsLangPuts() - New ........................... 30
125 3.3.2.7. cupsEncodingName() - New ....................... 31
126 3.4. Common Text Filter - Existing .......................... 31
127 3.4.1. textcommon.h - Common text filter header ........... 31
128 3.4.1.1. lchar_t - Character/Attribute Structure ........ 31
129 3.4.2. textcommon.c - Common text filter .................. 32
130 3.4.2.1. TextMain() - Existing .......................... 32
131 3.4.2.2. compare_keywords() - Existing .................. 33
132 3.4.2.3. getutf8() - Existing ........................... 33
133 3.5. Text to PostScript Filter - Existing ................... 33
134 3.5.1. texttops.c - Text to PostScript filter ............. 33
135 3.5.1.1. main() - Existing .............................. 33
136 3.5.1.2. WriteEpilogue () - Existing .................... 34
137 3.5.1.3. WritePage () - Existing ........................ 34
138 3.5.1.4. WriteProlog () - Existing ...................... 34
139 3.5.1.5. write_line() - Existing ........................ 34
140 3.5.1.6. write_string() - Existing ...................... 34
141 3.5.1.7. write_text() - Existing ........................ 35
142 A. Glossary ................................................... A-1
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171 McDonald June 20, 2002 [Page 3]
172\f
173 CUPS Internationalization Software Design Description v0.3
174
175
176
177 1. Scope
178
179
180
181 1.1. Identification
182
183 This document provides general information and high-level design for the
184 Internationalization extensions for the Common UNIX Printing System
185 ("CUPS") Version 1.2. This document also provides C language header
186 files and high-level pseudo-code for all new modules and external
187 functions.
188
189
190 1.2. System Overview
191
192 The CUPS Internationalization extensions provide multilingual support
193 via Unicode 3.2:2002 [UNICODE3.2] / ISO-10646-1:2000 [ISO10646-1] and a
194 suite of local character sets (including all adopted parts of ISO-8859
195 and many MS Windows code pages) for CUPS 1.2.
196
197 The CUPS Internationalization extensions support UTF-8 [RFC2279] as the
198 common stream-oriented representation of all character data. UTF-8 is
199 defined in [ISO10646-1] and is further constrained (for integrity and
200 security) by [UNICODE3.2].
201
202 UTF-8 is the native character set of LDAPv3 [RFC2251], SLPv2 [RFC2608],
203 IPP/1.1 [RFC2910] [RFC2911], and many other Internet protocols.
204
205
206 1.3. Document Overview
207
208
209 This software design description document is organized into the
210 following sections:
211
212 o 1 - Scope
213 o 2 - References
214 o 3 - Design Overview
215 o A - Glossary
216
217
218
219
220
221
222
223
224
225
226
227
228 McDonald June 20, 2002 [Page 4]
229\f
230 CUPS Internationalization Software Design Description v0.3
231
232
233
234 2. References
235
236
237
238 2.1. CUPS References
239
240 See: Section 2.1 'CUPS Documentation' of CUPS Software Design
241 Description.
242
243
244 2.2. Other Documents
245
246 The following non-CUPS documents are referenced by this document.
247
248 [ANSI-X3.4] ANSI Coded Character Set - 7-bit American National Standard
249 Code for Information Interchange, ANSI X3.4, 1986 (aka US-ASCII).
250
251 [GB2312] Code of Chinese Graphic Character Set for Information
252 Interchange, Primary Set, GB 2312, 1980.
253
254 [ISO639-1] Codes for the Representation of Names of Languages -- Part 1:
255 Alpha-2 Code, ISO/IEC 639-1, 2000.
256
257 [ISO639-2] Codes for the Representation of Names of Languages -- Part 2:
258 Alpha-3 Code, ISO/IEC 639-2, 1998.
259
260 [ISO646] Information Technology - ISO 7-bit Coded Character Set for
261 Information Interchange, ISO/IEC 646, 1991.
262
263 [ISO2022] Information Processing - ISO 7-bit and 8-bit Coded Character
264 Sets - Code Extension Techniques, ISO/IEC 2022, 1994. (Technically
265 identical to ECMA-35.)
266
267 [ISO3166-1] Codes for the Representation of Names of Countries and their
268 Subdivisions, Part 1: Country Codes, ISO/ISO 3166-1, 1997.
269
270 [ISO8859] Information Processing - 8-bit Single-Byte Code Graphic
271 Character Sets, ISO/IEC 8859-n, 1987-2001.
272
273 [ISO10646-1] Information Technology - Universal Multiple-Octet Code
274 Character Set (UCS) - Part 1: Architecture and Basic Multilingual
275 Plane, ISO/IEC 10646-1, September 2000.
276
277 [ISO10646-2] Information Technology - Universal Multiple-Octet Code
278 Character Set (UCS) - Part 2: Supplemental Planes, ISO/IEC 10646-2,
279 January 2001.
280
281 [RFC2119] Bradner. Key words for use in RFCs to Indicate Requirement
282 Levels, RFC 2119, March 1997.
283
284
285 McDonald June 20, 2002 [Page 5]
286\f
287 CUPS Internationalization Software Design Description v0.3
288
289
290 [RFC2251] Whal, Howes, Kille. Lightweight Directory Access Protocol
291 Version 3 (LDAPv3), RFC 2251, December 1997.
292
293 [RFC2277] Alvestrand. IETF Policy on Character Sets and Languages, RFC
294 2277, January 1998.
295
296 [RFC2279] Yergeau. UTF-8, a Transformation Format of ISO 10646, RFC
297 2279, January 1998.
298
299 [RFC2608] Guttman, Perkins, Veizades, Day. Service Location Protocol
300 Version 2 (SLPv2), RFC 2608, June 1999.
301
302 [RFC2910] Herriot, Butler, Moore, Turner, Wenn. Internet Printing
303 Protocol/1.1: Encoding and Transport, RFC 2910, September 2000.
304
305 [RFC2911] Hastings, Herriot, deBry, Isaacson, Powell. Internet Printing
306 Protocol/1.1: Model and Semantics, RFC 2911, September 2000.
307
308 [UNICODE3.0] Unicode Consortium, Unicode Standard Version 3.0,
309 Addison-Wesley Developers Press, ISBN 0-201-61633-5, 2000.
310
311 [UNICODE3.1] Unicode Consortium, Unicode Standard Version 3.1 (UAX-27),
312 May 2001.
313
314 [UNICODE3.2] Unicode Consortium, Unicode Standard Version 3.2 (UAX-28),
315 March 2002.
316
317 [US-ASCII] See [ANSI-X3.4] above.
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342 McDonald June 20, 2002 [Page 6]
343\f
344 CUPS Internationalization Software Design Description v0.3
345
346
347
348 3. Design Overview
349
350 The CUPS Internationalization extensions are composed of several header
351 files and modules which extend the Language functions in the existing
352 CUPS Application Programmers Interface (API).
353
354
355 3.1. Transcoding - New
356
357 Initially, the CUPS Internationalization extensions will only support
358 SBCS (single-byte character set) transcoding. But the design allows
359 future support for DBCS (double-byte character set) transcoding for CJK
360 (Chinese/Japanese/Korean) languages and the MBCS (multiple-byte
361 character set) compound sets that use escapes for charset switching.
362
363 In order to reduce code size and increase performance all conventional
364 'mapping files' (tables of values in legacy characters sets with their
365 corresponding Unicode scalar values) will ALSO be sorted and stored in
366 memory as reverse maps (for efficient conversion from Unicode scalar
367 values to their corresponding legacy character set values). Transcoding
368 will be done directly by 2-level lookup (without any searching or
369 sorting).
370
371 [Ed Note: CJK languages will be fairly costly in mapping table sizes,
372 because they have thousands (or tens of thousands) of codepoints.]
373
374
375
376 3.1.1. transcode.h - Transcoding header
377
378 /*
379 * "$Id: i18n_sdd.txt 2678 2002-08-19 01:15:26Z mike $"
380 *
381 * Transcoding support for the Common UNIX Printing System (CUPS).
382 *
383 * Copyright 1997-2002 by Easy Software Products.
384 *
385 * These coded instructions, statements, and computer programs are
386 * the property of Easy Software Products and are protected by Federal
387 * copyright law. Distribution and use rights are outlined in the
388 * file "LICENSE.txt" which should have been included with this file.
389 * If this file is missing or damaged please contact Easy Software
390 * Products at:
391 *
392 * Attn: CUPS Licensing Information
393 * Easy Software Products
394 * 44141 Airport View Drive, Suite 204
395 * Hollywood, Maryland 20636-3111 USA
396 *
397 * Voice: (301) 373-9603
398
399 McDonald June 20, 2002 [Page 7]
400\f
401 CUPS Internationalization Software Design Description v0.3
402
403 * EMail: cups-info@cups.org
404 * WWW: http://www.cups.org
405 */
406
407 #ifndef _CUPS_TRANSCODE_H_
408 # define _CUPS_TRANSCODE_H_
409
410 /*
411 * Include necessary headers...
412 */
413
414 # include "cups/language.h"
415
416 # ifdef __cplusplus
417 extern "C" {
418 # endif /* __cplusplus */
419
420 /*
421 * Types...
422 */
423
424 typedef unsigned char utf8_t; /* UTF-8 Unicode/ISO-10646 code unit */
425 typedef unsigned short utf16_t; /* UTF-16 Unicode/ISO-10646 code unit */
426 typedef unsigned long utf32_t; /* UTF-32 Unicode/ISO-10646 code unit */
427 typedef unsigned short ucs2_t; /* UCS-2 Unicode/ISO-10646 code unit */
428 typedef unsigned long ucs4_t; /* UCS-4 Unicode/ISO-10646 code unit */
429 typedef unsigned char sbcs_t; /* SBCS Legacy 8-bit code unit */
430 typedef unsigned short dbcs_t; /* DBCS Legacy 16-bit code unit */
431
432 /*
433 * Structures...
434 */
435
436 typedef struct cups_cmap_str /**** SBCS Charmap Cache Structure ****/
437 {
438 struct cups_cmap_str *next; /* Next charmap in cache */
439 int used; /* Number of times entry used */
440 cups_encoding_t encoding; /* Legacy charset encoding */
441 ucs2_t char2uni[256]; /* Map Legacy SBCS -> UCS-2 */
442 sbcs_t *uni2char[256]; /* Map UCS-2 -> Legacy SBCS */
443 } cups_cmap_t;
444
445 #if 0
446 typedef struct cups_dmap_str /**** DBCS Charmap Cache Structure ****/
447 {
448 struct cups_dmap_str *next; /* Next charmap in cache */
449 int used; /* Number of times entry used */
450 cups_encoding_t encoding; /* Legacy charset encoding */
451 ucs2_t *char2uni[256]; /* Map Legacy DBCS -> UCS-2 */
452 dbcs_t *uni2char[256]; /* Map UCS-2 -> Legacy DBCS */
453 } cups_dmap_t;
454 #endif
455
456 McDonald June 20, 2002 [Page 8]
457\f
458 CUPS Internationalization Software Design Description v0.3
459
460
461 /*
462 * Constants...
463 */
464 #define CUPS_MAX_USTRING 1024 /* Maximum size of Unicode string */
465
466 /*
467 * Globals...
468 */
469
470 extern int TcFixMapNames; /* Fix map names to Unicode names */
471 extern int TcStrictUtf8; /* Non-shortest-form is illegal */
472 extern int TcStrictUtf16; /* Invalid surrogate pair is illegal */
473 extern int TcStrictUtf32; /* Greater than 0x10FFFF is illegal */
474 extern int TcRequireBOM; /* Require BOM for little/big-endian */
475 extern int TcSupportBOM; /* Support BOM for little/big-endian */
476 extern int TcSupport8859; /* Support ISO 8859-x repertoires */
477 extern int TcSupportWin; /* Support Windows-x repertoires */
478 extern int TcSupportCJK; /* Support CJK (Asian) repertoires */
479
480 /*
481 * Prototypes...
482 */
483
484 /*
485 * Utility functions for character set maps
486 */
487 extern void *cupsCharmapGet(const cups_encoding_t encoding);
488 /* I - Encoding */
489 extern void cupsCharmapFree(const cups_encoding_t encoding);
490 /* I - Encoding */
491 extern void cupsCharmapFlush(void);
492
493 /*
494 * Convert UTF-8 to and from legacy character set
495 */
496 extern int cupsUtf8ToCharset(char *dest, /* O - Target string */
497 const utf8_t *src, /* I - Source string */
498 const int maxout, /* I - Max output */
499 cups_encoding_t encoding); /* I - Encoding */
500 extern int cupsCharsetToUtf8(utf8_t *dest, /* O - Target string */
501 const char *src, /* I - Source string */
502 const int maxout, /* I - Max output */
503 cups_encoding_t encoding); /* I - Encoding */
504
505 /*
506 * Convert UTF-8 to and from UTF-16
507 */
508 extern int cupsUtf8ToUtf16(utf16_t *dest, /* O - Target string */
509 const utf8_t *src, /* I - Source string */
510 const int maxout); /* I - Max output */
511 extern int cupsUtf16ToUtf8(utf8_t *dest, /* O - Target string */
512
513 McDonald June 20, 2002 [Page 9]
514\f
515 CUPS Internationalization Software Design Description v0.3
516
517 const utf16_t *src, /* I - Source string */
518 const int maxout); /* I - Max output */
519
520 /*
521 * Convert UTF-8 to and from UTF-32
522 */
523 extern int cupsUtf8ToUtf32(utf32_t *dest, /* O - Target string */
524 const utf8_t *src, /* I - Source string */
525 const int maxout); /* I - Max output */
526 extern int cupsUtf32ToUtf8(utf8_t *dest, /* O - Target string */
527 const utf32_t *src, /* I - Source string */
528 const int maxout); /* I - Max output */
529
530 /*
531 * Convert UTF-16 to and from UTF-32
532 */
533 extern int cupsUtf16ToUtf32(utf32_t *dest, /* O - Target string */
534 const utf16_t *src, /* I - Source string */
535 const int maxout); /* I - Max output */
536 extern int cupsUtf32ToUtf16(utf16_t *dest, /* O - Target string */
537 const utf32_t *src, /* I - Source string */
538 const int maxout); /* I - Max output */
539
540 # ifdef __cplusplus
541 }
542 # endif /* __cplusplus */
543
544 #endif /* !_CUPS_TRANSCODE_H_ */
545
546 /*
547 * End of "$Id: i18n_sdd.txt 2678 2002-08-19 01:15:26Z mike $"
548 */
549
550
551
552 3.1.1.1. cups_cmap_t - SBCS Charmap Structure
553
554 typedef struct cups_cmap_str /**** SBCS Charmap Cache Structure ****/
555 {
556 struct cups_cmap_str *next; /* Next charset map in cache */
557 int used; /* Number of times entry used */
558 cups_encoding_t encoding; /* Legacy charset encoding */
559 ucs2_t char2uni[256]; /* Map Legacy SBCS -> UCS-2 */
560 sbcs_t *uni2char[256]; /* Map UCS-2 -> Legacy SBCS */
561 } cups_cmap_t;
562
563 'char2uni[]' is a (complete) array of UCS-2 values that supports direct
564 one-level lookup from an input SBCS legacy charset code point, for use
565 by 'cupsCharsetToUtf8()'.
566
567 'uni2char[]' is a (sparse) array of pointers to arrays of (256 each)
568 SBCS values, that supports direct two-level lookup from an input UCS-2
569
570 McDonald June 20, 2002 [Page 10]
571\f
572 CUPS Internationalization Software Design Description v0.3
573
574 code point, for use by 'cupsUtf8ToCharset()'.
575
576
577
578 3.1.1.2. cups_dmap_t - DBCS Charmap Structure
579
580 typedef struct cups_dmap_str /**** DBCS Charmap Cache Structure ****/
581 {
582 struct cups_dmap_str *next; /* Next charset map in cache */
583 int used; /* Number of times entry used */
584 cups_encoding_t encoding; /* Legacy charset encoding */
585 ucs2_t *char2uni[256]; /* Map Legacy DBCS -> UCS-2 */
586 dbcs_t *uni2char[256]; /* Map UCS-2 -> Legacy DBCS */
587 } cups_dmap_t;
588
589 'char2uni[]' is a (sparse) array of pointers to arrays of (256 each)
590 UCS-2 values that supports direct two-level lookup from an input DBCS
591 legacy charset code point, for (future) use by 'cupsCharsetToUtf8()'.
592
593 'uni2char[]' is a (sparse) array of pointers to arrays of (256 each)
594 DBCS values, that supports direct two-level lookup from an input UCS-2
595 code point, for (future) use by 'cupsUtf8ToCharset()'.
596
597
598
599 3.1.2. transcode.c - Transcoding module
600
601 All of the transcoding functions are modelled on the C standard library
602 function 'strncpy()', except that they return the count of output, like
603 'strlen()', rather than the (redundant) pointer to the output.
604
605 If the transcoding functions detect invalid input parameters or they
606 detect an encoding error in their input, then they return '-1', rather
607 than the count of output.
608
609 All of the transcoding functions take an input parameter indicating the
610 maximum output units (for safe operation). The functions that return
611 16-bit (UTF-16) or 32-bit (UTF-32/UCS-4) output always return the output
612 string count (not including the final null) and NOT the memory size in
613 bytes.
614
615
616
617 3.1.2.1. cupsUtf8ToCharset()
618
619 extern int cupsUtf8ToCharset(char *dest, /* O - Target string */
620 const utf8_t *src, /* I - Source string */
621 const int maxout, /* I - Max output */
622 cups_encoding_t encoding); /* I - Encoding */
623
624 <Find charset map by calling 'cupsCharmapGet()'>
625 <Convert input UTF-8 to internal UCS-4 by calling 'cupsUtf8ToUtf32()'>
626
627 McDonald June 20, 2002 [Page 11]
628\f
629 CUPS Internationalization Software Design Description v0.3
630
631 <Convert internal UCS-4 to legacy charset via charset map>
632 <Release charset map by calling 'cupsCharmapFree()'>
633 <Return length of output legacy charset string -- size in butes>
634
635
636
637 3.1.2.2. cupsCharsetToUtf8()
638
639 extern int cupsCharsetToUtf8(utf8_t *dest, /* O - Target string */
640 const char *src, /* I - Source string */
641 const int maxout, /* I - Max output */
642 cups_encoding_t encoding); /* I - Encoding */
643
644 <Find charset map by calling 'cupsCharmapGet()'>
645 <Convert input legacy charset to internal UCS-4 via charset map>
646 <Convert internal UCS-4 to UTF-8 by calling 'cupsUtf32ToUtf8()'>
647 <Release charset map by calling 'cupsCharmapFree()'>
648 <Return length of output UTF-8 string -- size in bytes>
649
650
651
652 3.1.2.3. cupsUtf8ToUtf16()
653
654 extern int cupsUtf8ToUtf16(utf16_t *dest, /* O - Target string */
655 const utf8_t *src, /* I - Source string */
656 const int maxout); /* I - Max output */
657
658 <...to avoid duplicate code to handle surrogate pairs...>
659 <Convert input UTF-8 to internal UCS-4 by calling 'cupsUtf8ToUtf32()'>
660 <Convert internal UCS-4 to UTF-16 by calling 'cupsUtf32ToUtf16()'>
661 <Return count of output UTF-16 string -- NOT memory size in bytes>
662
663
664
665 3.1.2.4. cupsUtf16ToUtf8()
666
667 extern int cupsUtf16ToUtf8(utf8_t *dest, /* O - Target string */
668 const utf16_t *src, /* I - Source string */
669 const int maxout); /* I - Max output */
670
671 <...to avoid duplicate code to handle surrogate pairs...>
672 <Convert input UTF-16 to internal UCS-4 by calling 'cupsUtf16ToUtf32()'>
673 <Convert internal UCS-4 to UTF-8 by calling 'cupsUtf32ToUtf8()'>
674 <Return length of output UTF-8 string -- size in bytes>
675
676
677
678 3.1.2.5. cupsUtf8ToUtf32()
679
680 extern int cupsUtf8ToUtf32(utf32_t *dest, /* O - Target string */
681 const utf8_t *src, /* I - Source string */
682 const int maxout); /* I - Max output */
683
684 McDonald June 20, 2002 [Page 12]
685\f
686 CUPS Internationalization Software Design Description v0.3
687
688
689 <Convert input UTF-8 directly to output UCS-4...>
690 <...checking for valid range, shortest-form, etc.>
691 <Return count of output UTF-32 string -- NOT memory size in bytes>
692
693
694
695 3.1.2.6. cupsUtf32ToUtf8()
696
697 extern int cupsUtf32ToUtf8(utf8_t *dest, /* O - Target string */
698 const utf32_t *src, /* I - Source string */
699 const int maxout); /* I - Max output */
700
701 <Convert input UCS-4 directly to output UTF-8...>
702 <...checking for valid range, etc.>
703 <Return length of output UTF-8 string -- size in bytes>
704
705
706
707 3.1.2.7. cupsUtf16ToUtf32()
708
709 extern int cupsUtf16ToUtf32(utf32_t *dest, /* O - Target string */
710 const utf16_t *src, /* I - Source string */
711 const int maxout); /* I - Max output */
712
713 <Convert input UTF-16 directly to output UCS-4...>
714 <...handling surrogate pairs decoding from UTF-16>
715 <Return count of output UTF-32 string -- NOT memory size in bytes>
716
717
718
719 3.1.2.8. cupsUtf32ToUtf16()
720
721 extern int cupsUtf32ToUtf16(utf16_t *dest, /* O - Target string */
722 const utf32_t *src, /* I - Source string */
723 const int maxout); /* I - Max output */
724
725 <Convert input UCS-4 directly to output UTF-16...>
726 <...handling surrogate pairs encoding to UTF-16>
727 <Return count of output UTF-16 string -- NOT memory size in bytes>
728
729
730
731 3.1.2.9. Transcoding Utility Functions
732
733 The transcoding utility functions are used to load (from a file into
734 memory), free (logically, without freeing memory), and flush (actually
735 free memory) character maps for SBCS (single-byte character set) and
736 (future) DBCS (double-byte character set) transcoding to and from UTF-8.
737
738
739
740
741 McDonald June 20, 2002 [Page 13]
742\f
743 CUPS Internationalization Software Design Description v0.3
744
745
746
747 3.1.2.9.1. cupsCharmapGet()
748
749 extern void *cupsCharmapGet(const cups_encoding_t encoding);
750 /* I - Encoding */
751
752 <Find SBSC or DBCS charset map in cache>
753 <...If found, increment 'used'>
754 <...and return pointer to SBCS or DBCS charset map>
755 <Get charset map file name by calling 'cupsEncodingName()'>
756 <Open charset map file>
757 <...If not found, return void>
758 <Allocate memory for SBCS or DBCS charset map in cache>
759 <...If no memory, return void>
760 <Add to SBCS or DBCS cache by assigning 'next' field>
761 <Assign 'encoding' field>
762 <Increment 'used' field>
763 <Read charset map file into memory in loop...>
764 <If SBCS, then 'char2uni[]' is an array of 'ucs2_t' values>
765 <...and 'uni2char[]' is an array of pointers to 'sbcs_t' arrays>
766 <If DBCS, then char2uni[]' is an array of pointers to 'ucs2_t' arrays>
767 <...and 'uni2char[]' is an array of pointers to 'dbcs_t' arrays>
768 <Close charset map file>
769 <Return pointer to SBCS or DBCS charset map>
770
771
772
773 3.1.2.9.2. cupsCharmapFree()
774
775 extern void cupsCharmapFree(const cups_encoding_t encoding);
776 /* I - Encoding */
777
778 <Find SBSC or DBCS charset map in cache>
779 <...If found, decrement 'used'>
780 <Return void>
781
782
783
784 3.1.2.9.3. cupsCharmapFlush()
785
786 extern void cupsCharmapFlush(void);
787
788 <Loop through SBCS charset map cache...>
789 <...Free 'uni2char[]' memory>
790 <...Free SBCS charset map memory>
791 <Loop through DBCS charset map cache...>
792 <...Free 'char2uni[]' memory>
793 <...Free 'uni2char[]' memory>
794 <...Free DBCS charset map memory>
795 <Return void>
796
797
798 McDonald June 20, 2002 [Page 14]
799\f
800 CUPS Internationalization Software Design Description v0.3
801
802
803
804
805 3.2. Normalization - New
806
807
808
809 3.2.1. normalize.h - Normalization header
810
811 /*
812 * "$Id: i18n_sdd.txt 2678 2002-08-19 01:15:26Z mike $"
813 *
814 * Unicode normalization for the Common UNIX Printing System (CUPS).
815 *
816 * Copyright 1997-2002 by Easy Software Products.
817 *
818 * These coded instructions, statements, and computer programs are
819 * the property of Easy Software Products and are protected by Federal
820 * copyright law. Distribution and use rights are outlined in the
821 * file "LICENSE.txt" which should have been included with this file.
822 * If this file is missing or damaged please contact Easy Software
823 * Products at:
824 *
825 * Attn: CUPS Licensing Information
826 * Easy Software Products
827 * 44141 Airport View Drive, Suite 204
828 * Hollywood, Maryland 20636-3111 USA
829 *
830 * Voice: (301) 373-9603
831 * EMail: cups-info@cups.org
832 * WWW: http://www.cups.org
833 */
834
835 #ifndef _CUPS_NORMALIZE_H_
836 # define _CUPS_NORMALIZE_H_
837
838 /*
839 * Include necessary headers...
840 */
841
842 # include "transcod.h"
843
844 # ifdef __cplusplus
845 extern "C" {
846 # endif /* __cplusplus */
847
848 /*
849 * Types...
850 */
851
852 typedef enum /**** Normalizataion Types ****/
853 {
854
855 McDonald June 20, 2002 [Page 15]
856\f
857 CUPS Internationalization Software Design Description v0.3
858
859 CUPS_NORM_NFD, /* Canonical Decomposition */
860 CUPS_NORM_NFKD, /* Compatibility Decomposition */
861 CUPS_NORM_NFC, /* NFD, them Canonical Composition */
862 CUPS_NORM_NFKC /* NFKD, them Canonical Composition */
863 } cups_normalize_t;
864
865 typedef enum /**** Case Folding Types ****/
866 {
867 CUPS_FOLD_SIMPLE, /* Simple - no expansion in size */
868 CUPS_FOLD_FULL /* Full - possible expansion in size */
869 } cups_folding_t;
870
871 typedef enum /**** Unicode Char Property Types ****/
872 {
873 CUPS_PROP_GENERAL_CATEGORY, /* See 'cups_gencat_t' enum */
874 CUPS_PROP_BIDI_CATEGORY, /* See 'cups_bidicat_t' enum */
875 CUPS_PROP_COMBINING_CLASS, /* See 'cups_combclass_t' type */
876 CUPS_PROP_BREAK_CLASS /* See 'cups_breakclass_t' enum */
877 } cups_property_t;
878
879 /*
880 * Note - parse Unicode char general category from 'UnicodeData.txt'
881 * into sparse local table in 'normalize.c'.
882 * Use major classes for logic optimizations throughout (by mask).
883 */
884
885 typedef enum /**** Unicode General Category ****/
886 {
887 CUPS_GENCAT_L = 0x10, /* Letter major class */
888 CUPS_GENCAT_LU = 0x11, /* Lu Letter, Uppercase */
889 CUPS_GENCAT_LL = 0x12, /* Ll Letter, Lowercase */
890 CUPS_GENCAT_LT = 0x13, /* Lt Letter, Titlecase */
891 CUPS_GENCAT_LM = 0x14, /* Lm Letter, Modifier */
892 CUPS_GENCAT_LO = 0x15, /* Lo Letter, Other */
893 CUPS_GENCAT_M = 0x20, /* Mark major class */
894 CUPS_GENCAT_MN = 0x21, /* Mn Mark, Non-Spacing */
895 CUPS_GENCAT_MC = 0x22, /* Mc Mark, Spacing Combining */
896 CUPS_GENCAT_ME = 0x23, /* Me Mark, Enclosing */
897 CUPS_GENCAT_N = 0x30, /* Number major class */
898 CUPS_GENCAT_ND = 0x31, /* Nd Number, Decimal Digit */
899 CUPS_GENCAT_NL = 0x32, /* Nl Number, Letter */
900 CUPS_GENCAT_NO = 0x33, /* No Number, Other */
901 CUPS_GENCAT_P = 0x40, /* Punctuation major class */
902 CUPS_GENCAT_PC = 0x41, /* Pc Punctuation, Connector */
903 CUPS_GENCAT_PD = 0x42, /* Pd Punctuation, Dash */
904 CUPS_GENCAT_PS = 0x43, /* Ps Punctuation, Open (start) */
905 CUPS_GENCAT_PE = 0x44, /* Pe Punctuation, Close (end) */
906 CUPS_GENCAT_PI = 0x45, /* Pi Punctuation, Initial Quote */
907 CUPS_GENCAT_PF = 0x46, /* Pf Punctuation, Final Quote */
908 CUPS_GENCAT_PO = 0x47, /* Po Punctuation, Other */
909 CUPS_GENCAT_S = 0x50, /* Symbol major class */
910 CUPS_GENCAT_SM = 0x51, /* Sm Symbol, Math */
911
912 McDonald June 20, 2002 [Page 16]
913\f
914 CUPS Internationalization Software Design Description v0.3
915
916 CUPS_GENCAT_SC = 0x52, /* Sc Symbol, Currency */
917 CUPS_GENCAT_SK = 0x53, /* Sk Symbol, Modifier */
918 CUPS_GENCAT_SO = 0x54, /* So Symbol, Other */
919 CUPS_GENCAT_Z = 0x60, /* Separator major class */
920 CUPS_GENCAT_ZS = 0x61, /* Zs Separator, Space */
921 CUPS_GENCAT_ZL = 0x62, /* Zl Separator, Line */
922 CUPS_GENCAT_ZP = 0x63, /* Zp Separator, Paragraph */
923 CUPS_GENCAT_C = 0x70, /* Other (miscellaneous) major class */
924 CUPS_GENCAT_CC = 0x71, /* Cc Other, Control */
925 CUPS_GENCAT_CF = 0x72, /* Cf Other, Format */
926 CUPS_GENCAT_CS = 0x73, /* Cs Other, Surrogate */
927 CUPS_GENCAT_CO = 0x74, /* Co Other, Private Use */
928 CUPS_GENCAT_CN = 0x75 /* Cn Other, Not Assigned */
929 } cups_gencat_t;
930
931 /*
932 * Note - parse Unicode char bidi category from 'UnicodeData.txt'
933 * into sparse local table in 'normalize.c'.
934 * Add bidirectional support to 'textcommon.c' - per Mike
935 */
936
937 typedef enum /**** Unicode Bidi Category ****/
938 {
939 CUPS_BIDI_L, /* Left-to-Right (Alpha, Syllabic, Ideographic) */
940 CUPS_BIDI_LRE, /* Left-to-Right Embedding (explicit) */
941 CUPS_BIDI_LRO, /* Left-to-Right Override (explicit) */
942 CUPS_BIDI_R, /* Right-to-Left (Hebrew alphabet and most punct) */
943 CUPS_BIDI_AL, /* Right-to-Left Arabic (Arabic, Thaana, Syriac) */
944 CUPS_BIDI_RLE, /* Right-to-Left Embedding (explicit) */
945 CUPS_BIDI_RLO, /* Right-to-Left Override (explicit) */
946 CUPS_BIDI_PDF, /* Pop Directional Format */
947 CUPS_BIDI_EN, /* Euro Number (Euro and East Arabic-Indic digits) */
948 CUPS_BIDI_ES, /* Euro Number Separator (Slash) */
949 CUPS_BIDI_ET, /* Euro Number Termintor (Plus, Minus, Degree, etc) */
950 CUPS_BIDI_AN, /* Arabic Number (Arabic-Indic digits, separators) */
951 CUPS_BIDI_CS, /* Common Number Separator (Colon, Comma, Dot, etc) */
952 CUPS_BIDI_NSM, /* Non-Spacing Mark (category Mn / Me in UCD) */
953 CUPS_BIDI_BN, /* Boundary Neutral (Formatting / Control chars) */
954 CUPS_BIDI_B, /* Paragraph Separator */
955 CUPS_BIDI_S, /* Segment Separator (Tab) */
956 CUPS_BIDI_WS, /* Whitespace Space (Space, Line Separator, etc) */
957 CUPS_BIDI_ON /* Other Neutrals */
958 } cups_bidicat_t;
959
960 /*
961 * Note - parse Unicode line break class from 'DerivedLineBreak.txt'
962 * into sparse local table (list of class ranges) in 'normalize.c'.
963 * Note - add state table from UAX-14, section 7.3 - Ira
964 * Remember to do BK and SP in outer loop (not in state table).
965 * Consider optimization for CM (combining mark).
966 * See 'LineBreak.txt' (12,875) and 'DerivedLineBreak.txt' (1,350).
967 */
968
969 McDonald June 20, 2002 [Page 17]
970\f
971 CUPS Internationalization Software Design Description v0.3
972
973
974 typedef enum /**** Unicode Line Break Class ****/
975 {
976 /*
977 * (A) - Allow Break AFTER
978 * (XA) - Prevent Break AFTER
979 * (B) - Allow Break BEFORE
980 * (XB) - Prevent Break BEFORE
981 * (P) - Allow Break For Pair
982 * (XP) - Prevent Break For Pair
983 */
984 CUPS_BREAK_AI, /* Ambiguous (Alphabetic or Ideograph) */
985 CUPS_BREAK_AL, /* Ordinary Alphabetic / Symbol Chars (XP) */
986 CUPS_BREAK_BA, /* Break Opportunity After Chars (A) */
987 CUPS_BREAK_BB, /* Break Opportunities Before Chars (B) */
988 CUPS_BREAK_B2, /* Break Opportunity Before / After (B/A/XP) */
989 CUPS_BREAK_BK, /* Mandatory Break (A) (normative) */
990 CUPS_BREAK_CB, /* Contingent Break (B/A) (normative) */
991 CUPS_BREAK_CL, /* Closing Punctuation (XB) */
992 CUPS_BREAK_CM, /* Attached Chars / Combining (XB) (normative) */
993 CUPS_BREAK_CR, /* Carriage Return (A) (normative) */
994 CUPS_BREAK_EX, /* Exclamation / Interrogation (XB) */
995 CUPS_BREAK_GL, /* Non-breaking ("Glue") (XB/XA) (normative) */
996 CUPS_BREAK_HY, /* Hyphen (XA) */
997 CUPS_BREAK_ID, /* Ideographic (B/A) */
998 CUPS_BREAK_IN, /* Inseparable chars (XP) */
999 CUPS_BREAK_IS, /* Numeric Separator (Infix) (XB) */
1000 CUPS_BREAK_LF, /* Line Feed (A) (normative) */
1001 CUPS_BREAK_NS, /* Non-starters (XB) */
1002 CUPS_BREAK_NU, /* Numeric (XP) */
1003 CUPS_BREAK_OP, /* Opening Punctuation (XA) */
1004 CUPS_BREAK_PO, /* Postfix (Numeric) (XB) */
1005 CUPS_BREAK_PR, /* Prefix (Numeric) (XA) */
1006 CUPS_BREAK_QU, /* Ambiguous Quotation (XB/XA) */
1007 CUPS_BREAK_SA, /* Context Dependent (South East Asian) (P) */
1008 CUPS_BREAK_SG, /* Surrogates (XP) (normative) */
1009 CUPS_BREAK_SP, /* Space (A) (normative) */
1010 CUPS_BREAK_SY, /* Symbols Allowing Break After (A) */
1011 CUPS_BREAK_XX, /* Unknown (XP) */
1012 CUPS_BREAK_ZW /* Zero Width Space (A) (normative) */
1013 } cups_breakclass_t;
1014
1015 typedef int cups_combclass_t; /**** Unicode Combining Class ****/
1016 /* 0=base / 1..254=combining char */
1017
1018 /*
1019 * Structures...
1020 */
1021
1022 typedef struct cups_normmap_str /**** Normalize Map Cache Struct ****/
1023 {
1024 struct cups_normmap_str *next; /* Next normalize in cache */
1025
1026 McDonald June 20, 2002 [Page 18]
1027\f
1028 CUPS Internationalization Software Design Description v0.3
1029
1030 int used; /* Number of times entry used */
1031 cups_normalize_t normalize; /* Normalization type */
1032 int normcount; /* Count of Source Chars */
1033 ucs2_t *uni2norm; /* Char -> Normalization */
1034 /* ...only supports UCS-2 */
1035 } cups_normmap_t;
1036
1037 typedef struct cups_foldmap_str /**** Case Fold Map Cache Struct ****/
1038 {
1039 struct cups_foldmap_str *next; /* Next case fold in cache */
1040 int used; /* Number of times entry used */
1041 cups_folding_t fold; /* Case folding type */
1042 int foldcount; /* Count of Source Chars */
1043 ucs2_t *uni2fold; /* Char -> Folded Char(s) */
1044 /* ...only supports UCS-2 */
1045 } cups_foldmap_t;
1046
1047 typedef struct cups_prop_str /**** Char Property Struct ****/
1048 {
1049 ucs2_t ch; /* Unicode Char as UCS-2 */
1050 unsigned char gencat; /* General Category */
1051 unsigned char bidicat; /* Bidirectional Category */
1052 } cups_prop_t;
1053
1054 typedef struct /**** Char Property Map Struct ****/
1055 {
1056 int used; /* Number of times entry used */
1057 int propcount; /* Count of Source Chars */
1058 cups_prop_t *uni2prop; /* Char -> Properties */
1059 } cups_propmap_t;
1060
1061 typedef struct /**** Line Break Class Map Struct ****/
1062 {
1063 int used; /* Number of times entry used */
1064 int breakcount; /* Count of Source Chars */
1065 ucs2_t *uni2break; /* Char -> Line Break Class */
1066 } cups_breakmap_t;
1067
1068 typedef struct cups_comb_str /**** Char Combining Class Struct ****/
1069 {
1070 ucs2_t ch; /* Unicode Char as UCS-2 */
1071 unsigned char combclass; /* Combining Class */
1072 unsigned char reserved; /* Reserved for alignment */
1073 } cups_comb_t;
1074
1075 typedef struct /**** Combining Class Map Struct ****/
1076 {
1077 int used; /* Number of times entry used */
1078 int combcount; /* Count of Source Chars */
1079 cups_comb_t *uni2comb; /* Char -> Combining Class */
1080 } cups_combmap_t;
1081
1082
1083 McDonald June 20, 2002 [Page 19]
1084\f
1085 CUPS Internationalization Software Design Description v0.3
1086
1087
1088 /*
1089 * Globals...
1090 */
1091
1092 extern int NzSupportUcs2; /* Support UCS-2 (16-bit) mapping */
1093 extern int NzSupportUcs4; /* Support UCS-4 (32-bit) mapping */
1094
1095 /*
1096 * Prototypes...
1097 */
1098
1099 /*
1100 * Utility functions for normalization module
1101 */
1102 extern int cupsNormalizeMapsGet(void);
1103 extern int cupsNormalizeMapsFree(void);
1104 extern void cupsNormalizeMapsFlush(void);
1105
1106 /*
1107 * Normalize UTF-8 string to Unicode UAX-15 Normalization Form
1108 * Note - Compatibility Normalization Forms (NFKD/NFKC) are
1109 * unsafe for subsequent transcoding to legacy charsets
1110 */
1111 extern int cupsUtf8Normalize(utf8_t *dest, /* O - Target string */
1112 const utf8_t *src, /* I - Source string */
1113 const int maxout, /* I - Max output */
1114 const cups_normalize_t normalize);
1115 /* I - Normalization */
1116
1117 /*
1118 * Normalize UTF-32 string to Unicode UAX-15 Normalization Form
1119 * Note - Compatibility Normalization Forms (NFKD/NFKC) are
1120 * unsafe for subsequent transcoding to legacy charsets
1121 */
1122 extern int cupsUtf32Normalize(utf32_t *dest,
1123 /* O - Target string */
1124 const utf32_t *src, /* I - Source string */
1125 const int maxout, /* I - Max output */
1126 const cups_normalize_t normalize);
1127 /* I - Normalization */
1128
1129 /*
1130 * Case Fold UTF-8 string per Unicode UAX-21 Section 2.3
1131 * Note - Case folding output is
1132 * unsafe for subsequent transcoding to legacy charsets
1133 */
1134 extern int cupsUtf8CaseFold(utf8_t *dest, /* O - Target string */
1135 const utf8_t *src, /* I - Source string */
1136 const int maxout, /* I - Max output */
1137 const cups_folding_t fold); /* I - Fold Mode */
1138
1139
1140 McDonald June 20, 2002 [Page 20]
1141\f
1142 CUPS Internationalization Software Design Description v0.3
1143
1144
1145 /*
1146 * Case Fold UTF-32 string per Unicode UAX-21 Section 2.3
1147 * Note - Case folding output is
1148 * unsafe for subsequent transcoding to legacy charsets
1149 */
1150 extern int cupsUtf32CaseFold(utf32_t *dest,/* O - Target string */
1151 const utf32_t *src, /* I - Source string */
1152 const int maxout, /* I - Max output */
1153 const cups_folding_t fold); /* I - Fold Mode */
1154
1155 /*
1156 * Compare UTF-8 strings after case folding
1157 */
1158 extern int cupsUtf8CompareCaseless(const utf8_t *s1,
1159 /* I - String1 */
1160 const utf8_t *s2); /* I - String2 */
1161
1162 /*
1163 * Compare UTF-32 strings after case folding
1164 */
1165 extern int cupsUtf32CompareCaseless(const utf32_t *s1,
1166 /* I - String1 */
1167 const utf32_t *s2); /* I - String2 */
1168
1169 /*
1170 * Compare UTF-8 strings after case folding and NFKC normalization
1171 */
1172 extern int cupsUtf8CompareIdentifier(const utf8_t *s1,
1173 /* I - String1 */
1174 const utf8_t *s2); /* I - String2 */
1175
1176 /*
1177 * Compare UTF-32 strings after case folding and NFKC normalization
1178 */
1179 extern int cupsUtf32CompareIdentifier(const utf32_t *s1,
1180 /* I - String1 */
1181 const utf32_t *s2); /* I - String2 */
1182
1183 /*
1184 * Get UTF-32 character property
1185 */
1186 extern int cupsUtf32CharacterProperty(const utf32_t ch,
1187 /* I - Source char */
1188 const cups_property_t property);
1189 /* I - Char Property */
1190
1191 # ifdef __cplusplus
1192 }
1193 # endif /* __cplusplus */
1194
1195 #endif /* !_CUPS_NORMALIZE_H_ */
1196
1197 McDonald June 20, 2002 [Page 21]
1198\f
1199 CUPS Internationalization Software Design Description v0.3
1200
1201
1202 /*
1203 * End of "$Id: i18n_sdd.txt 2678 2002-08-19 01:15:26Z mike $"
1204 */
1205
1206
1207
1208 3.2.1.1. cups_normmap_t - Normalize Map Structure
1209
1210 typedef struct cups_normmap_str /**** Normalize Map Cache Struct ****/
1211 {
1212 struct cups_normmap_str *next; /* Next normalize in cache */
1213 int used; /* Number of times entry used */
1214 cups_normalize_t normalize; /* Normalization type */
1215 int normcount; /* Count of Source Chars */
1216 ucs2_t *uni2norm; /* Char -> Normalization */
1217 /* ...only supports UCS-2 */
1218 } cups_normmap_t;
1219
1220 'uni2norm' is a pointer to an array of _triplets_ of UCS-2 values.
1221 'normcount' is a count of _triplets_ in the 'uni2norm[]' array.
1222
1223 For decompositions (NFD and NFKD), the triplets are: composed base
1224 character, decomposed base character, and decomposed accent character.
1225 These are used by 'cupsUtf8Normalize()' and 'cupsUtf32Normalize()' in
1226 performing canonical (NFD) or compatibility (NFKD) decomposition.
1227
1228 For compositions (NFC and NFKC), the triplets are: decomposed base
1229 character, decomposed accent character, and composed base character.
1230 These are used by 'cupsUtf8Normalize()' and 'cupsUtf32Normalize()' in
1231 performing canonical composition (for NFC or NFKC).
1232
1233
1234
1235 3.2.1.2. cups_foldmap_t - Case Fold Map Structure
1236
1237 typedef struct cups_foldmap_str /**** Case Fold Map Cache Struct ****/
1238 {
1239 int used; /* Number of times entry used */
1240 cups_folding_t fold; /* Case folding type */
1241 int foldcount; /* Count of Source Chars */
1242 ucs2_t *uni2fold; /* Char -> Folded Char(s) */
1243 /* ...only supports UCS-2 */
1244 } cups_foldmap_t;
1245
1246 'uni2fold' is a pointer to an array of _quadruplets_ of UCS-2 values.
1247 'foldcount' is a count of _quadruplets_ in the 'uni2fold[]' array.
1248
1249 For simple case folding (without expansion of the size of the output
1250 string), the quadruplets are: input base character, output case folded
1251 character, zero (unused), and zero (unused).
1252
1253
1254 McDonald June 20, 2002 [Page 22]
1255\f
1256 CUPS Internationalization Software Design Description v0.3
1257
1258
1259 For full case folding (with possible expansion of the size of the output
1260 string), the quadruplets are: input base character, output case folded
1261 character, second output character or zero, third output character or
1262 zero.
1263
1264
1265
1266 3.2.1.3. cups_propmap_t - Char Property Map Structure
1267
1268 typedef struct /**** Char Property Map Struct ****/
1269 {
1270 int used; /* Number of times entry used */
1271 int propcount; /* Count of Source Chars */
1272 cups_prop_t *uni2prop; /* Char -> Properties */
1273 } cups_propmap_t;
1274
1275 'uni2prop' is a pointer to an array of 'cups_prop_t' (see below).
1276 'propcount' is a count of elements in the 'uni2prop[]' array.
1277
1278
1279
1280 3.2.1.4. cups_prop_t - Char Property Structure
1281
1282 typedef struct cups_prop_str /**** Char Property Struct ****/
1283 {
1284 ucs2_t ch; /* Unicode Char as UCS-2 */
1285 unsigned char gencat; /* General Category */
1286 unsigned char bidicat; /* Bidirectional Category */
1287 } cups_prop_t;
1288
1289
1290
1291 3.2.1.5. cups_breakmap_t - Line Break Map Structure
1292
1293 typedef struct /**** Line Break Class Map Struct ****/
1294 {
1295 int used; /* Number of times entry used */
1296 int breakcount; /* Count of Source Chars */
1297 ucs2_t *uni2break; /* Char -> Line Break Class */
1298 } cups_breakmap_t;
1299
1300 'uni2break' is a pointer to an array of _triplets_ of UCS-2 values.
1301 'breakcount' is a count of _triplets_ in the 'uni2break[]' array.
1302
1303 The triplets in 'uni2break' are: first UCS-2 value in a range, last
1304 UCS-2 value in a range, and line break class stored as UCS-2.
1305
1306
1307
1308
1309
1310
1311 McDonald June 20, 2002 [Page 23]
1312\f
1313 CUPS Internationalization Software Design Description v0.3
1314
1315
1316
1317 3.2.1.6. cups_combmap_t - Combining Class Map Structure
1318
1319 typedef struct /**** Combining Class Map Struct ****/
1320 {
1321 int used; /* Number of times entry used */
1322 int combcount; /* Count of Source Chars */
1323 cups_comb_t *uni2comb; /* Char -> Combining Class */
1324 } cups_combmap_t;
1325
1326 'uni2comb' is a pointer to an array of 'cups_comb_t' (see below).
1327 'combcount' is a count of elements in the 'uni2comb[]' array.
1328
1329
1330
1331 3.2.1.7. cups_comb_t - Combining Class Structure
1332
1333 typedef struct cups_comb_str /**** Char Combining Class Struct ****/
1334 {
1335 unsigned short ch; /* Unicode Char as UCS-2 */
1336 unsigned char combclass; /* Combining Class */
1337 unsigned char reserved; /* Reserved for alignment */
1338 } cups_comb_t;
1339
1340
1341
1342 3.2.2. normalize.c - Normalization module
1343
1344 The normalization function 'cupsUtf8Normalize()' and the case folding
1345 function 'cupsUtf8CaseFold()' are modelled on the C standard library
1346 function 'strncpy()', except that they return the count of the output,
1347 like 'strlen()', rather than the (redundant) pointer to the output.
1348
1349 If the normalization or case folding functions detect invalid input
1350 parameters or they detect an encoding error in their input, then they
1351 return '-1', rather than the count of output.
1352
1353 The normalization and case folding functions take an input parameter
1354 indicating the maximum output units (for safe operation).
1355
1356
1357
1358 3.2.2.1. cupsUtf8Normalize()
1359
1360 /*
1361 * Normalize UTF-8 string to Unicode UAX-15 Normalization Form
1362 * Note - Compatibility Normalization Forms (NFKD/NFKC) are
1363 * unsafe for subsequent transcoding to legacy charsets
1364 */
1365 extern int cupsUtf8Normalize(utf8_t *dest, /* O - Target string */
1366 const utf8_t *src, /* I - Source string */
1367
1368 McDonald June 20, 2002 [Page 24]
1369\f
1370 CUPS Internationalization Software Design Description v0.3
1371
1372 const int maxout, /* I - Max output */
1373 const cups_normalize_t normalize);
1374 /* I - Normalization */
1375
1376 <Convert input UTF-8 to internal UCS-4 by calling 'cupsUtf8ToUtf32()'>
1377 <Normalize by calling 'cupsUtf32Normalize()'>
1378 <Convert normalized UCS-4 to UTF-8 by calling 'cupsUtf32ToUtf8()>
1379 <Return length of output UTF-8 string -- size in butes>
1380
1381
1382
1383 3.2.2.2. cupsUtf32Normalize()
1384
1385 extern int cupsUtf32Normalize(utf32_t *dest,
1386 /* O - Target string */
1387 const utf32_t *src, /* I - Source string */
1388 const int maxout, /* I - Max output */
1389 const cups_normalize_t normalize);
1390 /* I - Normalization */
1391
1392 <Find normalize maps by calling 'cupsNormalizeMapsGet()'>
1393 <...if not found, return '-1'>
1394 <Repeatedly traverse internal UCS-4, decomposing (NFD or NFKD)...>
1395 <...with 'bsearch()' of 'uni2norm[]' using local 'compare_decompose()'>
1396 <...until one pass yields no further decomposition>
1397 <Repeatedly traverse internal UCS-4, doing canonical reordering>
1398 <...with 'bsearch()' of 'uni2comb[]' using local 'compare_combchar()'>
1399 <...until one pass yields no further canonical reordering>
1400 <If 'normalize' requests composition (NFC or NFKC)...>
1401 <...repeatedly traverse internal UCS-4, composing (NFC or NFKC)...>
1402 <...with 'bsearch()' of 'uni2norm[]' using local 'compare_compose()'>
1403 <...until one pass yields no further composition>
1404 <Release normalize maps by calling 'cupsNormalizeMapsFree()'>
1405 <Return count of output UTF-32 string -- NOT memory size in butes>
1406
1407
1408
1409 3.2.2.3. cupsUtf8CaseFold()
1410
1411 /*
1412 * Case Fold UTF-8 string per Unicode UAX-21 Section 2.3
1413 * Note - Case folding output is
1414 * unsafe for subsequent transcoding to legacy charsets
1415 */
1416 extern int cupsUtf8CaseFold(utf8_t *dest, /* O - Target string */
1417 const utf8_t *src, /* I - Source string */
1418 const int maxout, /* I - Max output */
1419 const cups_folding_t fold); /* I - Fold Mode */
1420
1421 <Find normalize maps by calling 'cupsNormalizeMapsGet()'>
1422 <...if not found, return '-1'>
1423 <Convert input UTF-8 to internal UCS-4 by calling 'cupsUtf8ToUtf32()'>
1424
1425 McDonald June 20, 2002 [Page 25]
1426\f
1427 CUPS Internationalization Software Design Description v0.3
1428
1429 <Case fold internal UCS-4 by calling 'cupsUtf32CaseFold()'>
1430 <Convert internal UCS-4 to output UTF-8 by calling 'cupsUtf32ToUtf8()>
1431 <Release normalize maps by calling 'cupsNormalizeMapsFree()'>
1432 <Return length of output UTF-8 string -- size in butes>
1433
1434
1435
1436 3.2.2.4. cupsUtf32CaseFold()
1437
1438 /*
1439 * Case Fold UTF-32 string per Unicode UAX-21 Section 2.3
1440 * Note - Case folding output is
1441 * unsafe for subsequent transcoding to legacy charsets
1442 */
1443 extern int cupsUtf32CaseFold(utf32_t *dest, /* Target string */
1444 const utf32_t *src, /* Source string */
1445 const int maxout); /* Max output units */
1446
1447 <Find case fold maps by calling 'cupsNormalizeMapsGet()'>
1448 <...if not found, return '-1'>
1449 <Traverse internal UCS-4 once, performing case folding...>
1450 <...with 'bsearch()' of 'uni2fold[]' using local 'compare_foldchar()'>
1451 <Copy internal UCS-4 to output UTF-32 string>
1452 <Release normalize maps by calling 'cupsNormalizeMapsFree()'>
1453 <Return count of output UTF-32 string -- NOT memory size in bytes>
1454
1455
1456
1457 3.2.2.5. cupsUtf8CompareCaseless()
1458
1459 /*
1460 * Compare UTF-8 strings after case folding
1461 */
1462 extern int cupsUtf8CompareCaseless(const utf8_t *s1,
1463 /* I - String1 */
1464 const utf8_t *s2); /* I - String2 */
1465
1466 <Case fold both input UTF-8 strings by calling 'cupsUtf8CaseFold()'>
1467 <Return compare of case folded first and second strings>
1468
1469
1470
1471 3.2.2.6. cupsUtf32CompareCaseless()
1472
1473 /*
1474 * Compare UTF-32 strings after case folding
1475 */
1476 extern int cupsUtf32CompareCaseless(const utf32_t *s1,
1477 /* I - String1 */
1478 const utf32_t *s2); /* I - String2 */
1479
1480 <Case fold both input UTF-32 strings by calling 'cupsUtf32CaseFold()'>
1481
1482 McDonald June 20, 2002 [Page 26]
1483\f
1484 CUPS Internationalization Software Design Description v0.3
1485
1486 <Return compare of case folded first and second strings>
1487
1488
1489
1490 3.2.2.7. cupsUtf8CompareIdentifier()
1491
1492 /*
1493 * Compare UTF-8 strings after case folding and NFKC normalization
1494 */
1495 extern int cupsUtf8CompareIdentifier(const utf8_t *s1,
1496 /* I - String1 */
1497 const utf8_t *s2); /* I - String2 */
1498
1499 <Convert input UTF-8 to internal UCS-4 by calling 'cupsUtf8ToUtf32()'>
1500 <Case fold both strings by calling 'cupsUtf32CaseFold()'>
1501 <Normalize both strings to NFKC by calling 'cupsUtf32Normalize()'>
1502 <Return compare of case folded/normalized first and second strings>
1503
1504
1505
1506 3.2.2.8. cupsUtf32CompareIdentifier()
1507
1508 /*
1509 * Compare UTF-32 strings after case folding and NFKC normalization
1510 */
1511 extern int cupsUtf32CompareIdentifier(const utf32_t *s1,
1512 /* I - String1 */
1513 const utf32_t *s2); /* I - String2 */
1514
1515 <Case fold both strings by calling 'cupsUtf32CaseFold()'>
1516 <Normalize both strings to NFKC by calling 'cupsUtf32Normalize()'>
1517 <Return compare of case folded/normalized first and second strings>
1518
1519
1520
1521 3.2.2.9. cupsUtf32CharacterProperty()
1522
1523 /*
1524 * Get UTF-32 character property
1525 */
1526 extern int cupsUtf32CharacterProperty(const utf32_t ch,
1527 /* I - Source char */
1528 const cups_property_t property);
1529 /* I - Char Property */
1530
1531 <Lookup UTF-32 character property in appropriate map...> <...internal
1532 functions for each different map lookup>
1533
1534
1535
1536
1537
1538
1539 McDonald June 20, 2002 [Page 27]
1540\f
1541 CUPS Internationalization Software Design Description v0.3
1542
1543
1544
1545 3.2.2.10. Normalization Utility Functions
1546
1547
1548
1549
1550 3.2.2.10.1. cupsNormalizeMapsGet()
1551
1552 extern void cupsNormalizeMapsMapsGet(void);
1553
1554 <Find normalize maps in cache>
1555 <...If found, increment 'used'>
1556 <...and return void>
1557 <For each map (normalization, case fold, combining class, etc.)...>
1558 <Open (preprocessed form of) Unicode data file...>
1559 <...If not found, return void>
1560 <Count lines in preprocessed form, for mapping memory alloc>
1561 <...Close (preprocessed form of) Unicode data file>
1562 <Open (preprocessed form of) Unicode data file...>
1563 <...If not found, return void>
1564 <Allocate memory for approriate map in cache...>
1565 <...If no memory, return void>
1566 <Add to appropriate cache by assigning 'next' field>
1567 <Assign map type field and count field>
1568 <Increment 'used' field>
1569 <Read normalize map into memory in loop...>
1570 <...Add values to 'uni2xxx[]' array>
1571 <Close (preprocessed form of) Unicode data file>
1572 <Return void>
1573
1574
1575
1576 3.2.2.10.2. cupsNormalizeMapsFree()
1577
1578 extern void cupsNormalizeMapsFree(void);
1579
1580 <Find normalize maps in cache>
1581 <...If found, decrement 'used'>
1582 <Return void>
1583
1584
1585
1586 3.2.2.10.3. cupsNormalizeMapsFlush()
1587
1588 extern void cupsNormalizeMapsFlush(void);
1589
1590 <Loop through normalize maps cache...>
1591 <...Free 'uni2norm[]' memory>
1592 <...Free normalize map memory>
1593 <Loop through case folding cache...>
1594 <...Free 'uni2fold[]' memory>
1595
1596 McDonald June 20, 2002 [Page 28]
1597\f
1598 CUPS Internationalization Software Design Description v0.3
1599
1600 <...Free case folding memory>
1601 <Loop through char property map cache...>
1602 <...Free 'uni2prop[]' memory>
1603 <...Free char property map memory>
1604 <Loop through line break class map cache...>
1605 <...Free 'uni2break[]' memory>
1606 <...Free line break class map memory>
1607 <Loop through combining class map cache...>
1608 <...Free 'uni2comb[]' memory>
1609 <...Free combining class map memory>
1610 <Return void>
1611
1612
1613
1614 3.3. Language - Existing
1615
1616
1617
1618 3.3.1. language.h - Language header
1619
1620 Required Changes:
1621
1622 (1) Change definition of 'cups_lang_t' to correct length of 'language[]'
1623 to 32 characters per [RFC3066] and [ISO639-2] and [ISO3166-1].
1624
1625
1626
1627 3.3.2. language.c - Language module
1628
1629
1630
1631 3.3.2.1. cupsLangEncoding() - Existing
1632
1633 [No Change]
1634
1635
1636
1637 3.3.2.2. cupsLangFlush() - Existing
1638
1639 [No Change]
1640
1641
1642
1643 3.3.2.3. cupsLangFree() - Existing
1644
1645 [No Change]
1646
1647
1648
1649
1650
1651
1652
1653 McDonald June 20, 2002 [Page 29]
1654\f
1655 CUPS Internationalization Software Design Description v0.3
1656
1657
1658
1659 3.3.2.4. cupsLangGet() - Existing
1660
1661 Required Changes:
1662
1663 (1) Change length of 'langname[]' and 'real[]' to 64 characters per
1664 [RFC3066] and potential length of encoding (charset) names;
1665 (2) Change language string normalization to support:
1666 (a) 8-character language codes per [RFC3066] and 3-character
1667 language codes per [ISO639-2];
1668 (b) 8-character country codes per [RFC3066] and 3-character country
1669 codes per [ISO3166-1];
1670 (c) Support for 'i' (IANA registered) and 'x' (private) language
1671 prefixes per [RFC3066];
1672 (d) Invariant use of 'utf-8' for encoding in message catalog, but
1673 save actual requested encoding name for later use.
1674 (3) Correct broken do/while statement for message catalog lookup (while
1675 condition is _never_ satisfied).
1676
1677
1678
1679 3.3.2.5. cupsLangPrintf() - New
1680
1681 extern int cupsLangPrintf(FILE *fp, /* I - File to write */
1682 const cups_lang_t *lang, /* I - Language/locale*/
1683 const cups_msg_t msg, /* I - Msg to format */
1684 ...); /* I - Args to format */
1685
1686 <Set up variable args by calling 'va_start()'>
1687 <Format CUPS message with variable args by calling 'vsnprintf()'>
1688 <Clean up variable args by calling 'va_end()'>
1689 <Transcode CUPS message by calling 'cupsUtf8ToCharset()'>
1690 <Write CUPS message by calling 'fputs()'>
1691 <Return transcoded output CUPS message length>
1692
1693
1694
1695 3.3.2.6. cupsLangPuts() - New
1696
1697 extern int cupsLangPuts(FILE *fp, /* I - File to write */
1698 const cups_lang_t *lang, /* I - Language/locale*/
1699 const cups_msg_t msg); /* I - Msg to write */
1700
1701 <Transcode CUPS message by calling 'cupsUtf8ToCharset()'>
1702 <Write CUPS message by calling 'fputs()'>
1703 <Return transcoded output CUPS message length>
1704
1705
1706
1707
1708
1709
1710 McDonald June 20, 2002 [Page 30]
1711\f
1712 CUPS Internationalization Software Design Description v0.3
1713
1714
1715
1716 3.3.2.7. cupsEncodingName() - New
1717
1718 extern char *cupsEncodingName(cups_encoding_t encoding);
1719
1720 <Lookup encoding name in static 'lang_encodings[]' array>
1721 <Return pointer to encoding name (charset map file name)>
1722
1723
1724
1725 3.4. Common Text Filter - Existing
1726
1727
1728
1729 3.4.1. textcommon.h - Common text filter header
1730
1731 Required changes:
1732
1733 (1) Revise 'lchar_t' as specified below, adding 'attrx' bit-mask for
1734 selected Unicode character properties;
1735 (2) Revise 'lchar_t' as specified below, adding 'comblen' and 'combch[]'
1736 for Unicode combining/attached chars (accents);
1737 (3) Add 'COMBLEN_MAX' limit as specified below;
1738 (4) Add 'ATTRX_...' selected Unicode character properties as specified
1739 below.
1740
1741
1742
1743 3.4.1.1. lchar_t - Character/Attribute Structure
1744
1745 typedef struct lchar_str /**** Character / Attribute Structure ****/
1746 {
1747 unsigned short ch; /* Unicode Char as UCS-2 */
1748 /* or 8/16-bit Legacy Char */
1749 unsigned short attr; /* Attributes of Char */
1750 unsigned short attrx; /* Extended Attributes */
1751 unsigned short comblen; /* Combining Char Count */
1752 unsigned short combch[8]; /* Combining Chars as UCS-2 */
1753 } lchar_t;
1754
1755 'ch' is a 16-bit UCS-2 character or a 8/16-bit legacy char. 'attr' is
1756 the character attributes defined for the existing 'lchar_t' structure
1757 (defined in 'textcommon.h'). 'attrx' is the extended character
1758 attributes defined for future selected Unicode character properties (see
1759 below). 'comblen' is the number of attached/combining characters.
1760 'combch' is an array of 16-bit UCS-2 attached/combining characters.
1761
1762 Add to 'textcommon.h' constants:
1763
1764 COMBLEN_MAX 8
1765
1766
1767 McDonald June 20, 2002 [Page 31]
1768\f
1769 CUPS Internationalization Software Design Description v0.3
1770
1771
1772 ATTRX_RIGHT2LEFT 0x0001
1773
1774
1775
1776 3.4.2. textcommon.c - Common text filter
1777
1778 Required Changes:
1779
1780 (1) Revise 'TextMain()' function as described below.
1781
1782
1783
1784 3.4.2.1. TextMain() - Existing
1785
1786 Required Changes:
1787
1788 [Ed Note: Pseudo code below needs more work on bidi handling.]
1789
1790 (1) In main loop at the _beginning_ of the 'default' clause, add the
1791 following code for combining marks:
1792 lchar_t *cp;
1793
1794 cp = Page[line];
1795 cp += column;
1796 /*
1797 * Check for Unicode combining mark (accent)
1798 */
1799 if (UTF-8 && cupsUtf32CombiningClass(ch) > 0)
1800 {
1801
1802 /*
1803 * Save Unicode combining mark in SAME character
1804 */
1805 if (cp->comblen > COMBLEN_MAX)
1806 break;
1807 cp->combch[cp->comblen] = ch;
1808 cp->comblen ++;
1809 break;
1810 }
1811
1812 (2) In main loop _after_ combining chars section in 'default' clause,
1813 add the following code for Unicode bidi control characters
1814 cups_bidicat_t bidicat;
1815
1816 /*
1817 * Check for Unicode bidi control character
1818 */
1819 if (UTF-8)
1820 {
1821 bidicat = (cups_bidicat_t)
1822 cupsUtf32CharacterProperty(ch, CUPS_PROP_BIDI_CATEGORY);
1823
1824 McDonald June 20, 2002 [Page 32]
1825\f
1826 CUPS Internationalization Software Design Description v0.3
1827
1828 if ((bidicat == CUPS_BIDI_LRE) /* Left-to-Right Embedding *
1829 || (bidicat == CUPS_BIDI_LRO) /* Left-to-Right Override */
1830 || (bidicat == CUPS_BIDI_RLE) /* Right-to-Left Embedding *
1831 || (bidicat == CUPS_BIDI_RLO) /* Right-to-Left Override */
1832 || (bidicat == CUPS_BIDI_PDF)) /* Pop Directional Format */
1833 {
1834 /* Do bidi stuff here with memory for NEXT char's direction
1835 /* Discard bidi control character and break */
1836 }
1837 if ((bidicat == CUPS_BIDI_R) /* Right-to-Left Hebrew */
1838 || (bidicat == CUPS_BIDI_AL)) /* Right-to-Left Arabic */
1839 {
1840 /* Set attrx for right-to-left */
1841 cp->attrx |= ATTRX_RIGHT2LEFT
1842 }
1843 }
1844
1845
1846
1847 3.4.2.2. compare_keywords() - Existing
1848
1849 [No Change]
1850
1851
1852
1853 3.4.2.3. getutf8() - Existing
1854
1855 [No Change]
1856
1857 [Ed Note: Future - allow 20-bit UTF-32 code points - requires updates
1858 in both 'textcommon.c' and 'texttops.c' for extended PostScript.]
1859
1860
1861
1862 3.5. Text to PostScript Filter - Existing
1863
1864
1865
1866 3.5.1. texttops.c - Text to PostScript filter
1867
1868 Required Changes:
1869
1870 (1) Revise local 'write_string()' function as described below.
1871
1872
1873
1874 3.5.1.1. main() - Existing
1875
1876 [No Change]
1877
1878
1879
1880
1881 McDonald June 20, 2002 [Page 33]
1882\f
1883 CUPS Internationalization Software Design Description v0.3
1884
1885
1886
1887 3.5.1.2. WriteEpilogue () - Existing
1888
1889 [No Change]
1890
1891
1892
1893 3.5.1.3. WritePage () - Existing
1894
1895 [No Change]
1896
1897
1898
1899 3.5.1.4. WriteProlog () - Existing
1900
1901 [No Change]
1902
1903
1904
1905 3.5.1.5. write_line() - Existing
1906
1907 [No Change]
1908
1909
1910
1911 3.5.1.6. write_string() - Existing
1912
1913 Required Changes:
1914
1915 (1) At the _beginning_ of Multiple Fonts section, _replace_ the while()
1916 loop and surrounding 'putchar()' calls with the following code:
1917
1918 for (; len > 0; len --, s ++)
1919 {
1920 utf32_t decstr[COMBLEN_MAX * 2];
1921 utf32_t cmpstr[COMBLEN_MAX * 2];
1922 int cmplen;
1923 int i;
1924
1925 if (s->comblen == 0)
1926 {
1927 printf("<%04x>", Chars[s->ch]);
1928 continue;
1929 }
1930
1931 /*
1932 * Normalize decomposed Unicode character to NFKC
1933 * (compatibility decomposition, then canonical composition)
1934 */
1935 decstr[0] = (utf32_t) s->ch;
1936 for (i = 0; i < s->comblen; i ++)
1937
1938 McDonald June 20, 2002 [Page 34]
1939\f
1940 CUPS Internationalization Software Design Description v0.3
1941
1942 decstr[i + 1] = (utf32_t) s->combch[i];
1943 decstr[i] = 0;
1944 cmplen = cupsUtf32Normalize (&cmpstr[0],
1945 &decstr[0], COMBLEN_MAX * 2, CUPS_NORM_NFKC);
1946 if (cmplen < 1)
1947 continue;
1948
1949 /*
1950 * Write combining chars, then composed base, to same location
1951 */
1952 for (i = 1; i < cmplen; i ++)
1953 {
1954 printf("<%04x>", Chars[(int) cmpstr[i]);
1955 /*
1956 * Superimpose glyphs by backing up one column width
1957 */
1958 printf (" -%.3f ", (72.0f / (float) CharsPerInch));
1959 }
1960 printf("<%04x>", Chars[(int) cmpstr[0]);
1961 }
1962
1963 [Ed Note: Future - Bidi support - When writing Unicode characters
1964 (checking for explicit bidi) convert input string (lchar_t) to display
1965 order???]
1966
1967
1968
1969 3.5.1.7. write_text() - Existing
1970
1971 [No Change]
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995 McDonald June 20, 2002 [Page 35]
1996\f
1997 CUPS Internationalization Software Design Description v0.3
1998 APPENDIX A
1999 Glossary
2000
2001
2002
2003 A. Glossary
2004
2005 Abstract Character: A unit of information used for the organization,
2006 control, or representation of textual data.
2007
2008 Accent Mark: A mark placed above, below, or to the side of a character
2009 to alter its phonetic value (also 'diacritic').
2010
2011 Alphabet: A collection of symbols that, in the context of a particular
2012 written language, represent the sounds of that language.
2013
2014 Base Character: A character that does not graphically combine with
2015 preceding characters, and that is neither a control nor a format
2016 character.
2017
2018 Basic Multilingual Plane: The Unicode (or UCS) code values 0x0000
2019 through 0xFFFF, specified by [ISO10646] (also 'Plane 0').
2020
2021 BIDI: Abbreviation for Bidirectional, in reference to mixed
2022 left-to-right and right-to-left text.
2023
2024 Bidirectional Display: The process or result of mixing left-to-right
2025 oriented text and right-to-left oriented text in a single line.
2026
2027 Big-endian: A computer architecture that stores multiple-byte numerical
2028 values with the most significant byte (MSB) values first.
2029
2030 BMP: Abbreviation for Basic Multilingual Plane.
2031
2032 BOM: Acronym for byte order mark (also 'ZWNBSP').
2033
2034 Byte Order Mark: The Unicode character U+FEFF Zero Width No-Break Space
2035 (ZWNBSP) when used to indicate the byte order of text.
2036
2037 Canonical: (1) Conforming to the general rules for encoding -- that is,
2038 not compressed, compacted, or in any other form specified by a higher
2039 protocol. (2) Characteristic of a normative mapping and form of
2040 equivalence.
2041
2042 Canonical Decomposition: The decomposition of a character that results
2043 from recursively applying the canonical mappings defined in the Unicode
2044 Character Database until no characters can be further decomposed, then
2045 reordering nonspacing marks according to section 3.10 of [UNICODE3.2].
2046
2047 Canonical Equivalent: Two characters are canonical equivalents if their
2048 full canonical decompositions are identical.
2049
2050 Case: (1) Feature of certain alphabets wheere the letters have two
2051
2052 McDonald June 20, 2002 [Page A-1]
2053\f
2054 CUPS Internationalization Software Design Description v0.3
2055 APPENDIX A
2056 Glossary
2057
2058 distinct forms. These variants are called the 'uppercase' letter (also
2059 known as 'capital' or 'majuscule') and the 'lowercase' letter (also
2060 known as 'small' or 'minuscule'). (2) Normative property of Unicode
2061 characters, consisting of uppercase, lowercase, and titlecase.
2062
2063 Character: (1) The smallest component of written language that has
2064 semantic value; refers to the abstract meaning and/or shape, rather than
2065 a specific shape (see also 'glyph'). (2) Synonym for 'abstract
2066 character'. (3) The basic unit of encoding for the Unicode character
2067 encoding. (4) The English name for the ideographic written elements of
2068 Chinese origin (see 'ideograph').
2069
2070 Character Encoding Form (CEF): Mapping from a character set definition
2071 to the actual bits used to represent the data.
2072
2073 Character Encoding Scheme (CES): A 'character encoding form' plus byte
2074 serialization. [UNICODE3.2] defines seven character encoding schemes:
2075 UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, and UTF32-LE.
2076
2077 Character Properties: A set of property names and property values
2078 associated with individual characters defined in [UNICODE3.2].
2079
2080 Character Repertoire: (1) The collection of characters included in a
2081 character set. (2) The SUBSET of characters included in a large
2082 character set, e.g., [UNICODE3.2], that are necessary to support a
2083 complete mapping to another smaller character set, e.g., ISO8859-1 (also
2084 called 'Latin-1').
2085
2086 Character Set: A collection of elements used to represent textual
2087 information.
2088
2089 Coded Character Set: A character set in which each character is
2090 assigned a numeric code value. Frequently abbreviated as 'character
2091 set', 'charset', or 'code set'.
2092
2093 Code Point: (1) A numerical index (or position) in an encoding table
2094 used for encoding characters. (2) Synonym for 'Unicode scalar value'.
2095
2096 Collation: The process of ordering units of textual information.
2097 Collation is usually specific to a particular language. Also known as
2098 'alphabetizing' or 'alphabetic sorting'.
2099
2100 Combining Character: A character that graphically combines with a
2101 preceding 'base character'. The combining character is said to 'apply'
2102 to that base character. (See also 'nonspacing mark'.)
2103
2104 Compatibility: (1) Consistency with existing practice or preexisting
2105 character encoding standards. (2) Characterisitic of a normative
2106 mapping and form of equivalence (see 'compatibility decomposition').
2107
2108
2109 McDonald June 20, 2002 [Page A-2]
2110\f
2111 CUPS Internationalization Software Design Description v0.3
2112 APPENDIX A
2113 Glossary
2114
2115
2116 Compatibility Character: A character that has a compatibility
2117 decomposition.
2118
2119 Compatibility Decomposition: The decomposition of a character that
2120 results from recursively applying BOTH the compatibility mappings AND
2121 the canonical mappings found in the Unicode Character Database until no
2122 characters can be further decomposed, then reordering nonspacing marks
2123 according to section 3.10 of [UNICODE3.2].
2124
2125 Compatibility Equivalent: Two characters are compatibility equivalents
2126 if their full compatibility decompositions are identical.
2127
2128 Composed Character: (See 'descomposable character'.)
2129
2130 DBCS: Acronym for 'double-byte character set'.
2131
2132 Decomposable Character: A character that is equivalent to a sequence of
2133 one or more other characters, according to the decomposition mappings
2134 found in [UNICODE3.2]. It may also be known as a 'precomposed
2135 character' or a 'composite character'.
2136
2137 Decomposition: (1) The process of separating or analyzing a text
2138 element into component units. (2) A sequence of one or more characters
2139 that is equivalent to a 'decomposable character'.
2140
2141 Diacritic: (See 'accent mark'.)
2142
2143 Double-Byte Character Set (DBCS): One of a number of character sets
2144 defined for representing Chinese, Japanese, or Korean text (for example,
2145 JIS X 0208-1990). These character sets are often encoded in such a way
2146 as to allow double-byte character encodings to be mixed with single-byte
2147 character encodings. (See also 'multiple-byte character set'.)
2148
2149 Font: A collection of glyphs used for visual depication of character
2150 data.
2151
2152 FSS-UTF: Abbreviation for 'File System Safe UCS Transformation Format',
2153 originally published by X/Open. Now called 'UTF-8'.
2154
2155 Fullwidth: Characters of East Asian character sets whose glyph image
2156 extends across the entire character display cell. In legacy character
2157 sets, fullwidth characters are normally encoded in two or three bytes.
2158
2159 Glyph: (1) An abstract form that represents one or more glyph images.
2160 (2) A synonym for 'glyph image'.
2161
2162 Glyph Image: The actual, concrete image of a glyph representation
2163 having been rasterized or otherwise images onto some display surface.
2164
2165
2166 McDonald June 20, 2002 [Page A-3]
2167\f
2168 CUPS Internationalization Software Design Description v0.3
2169 APPENDIX A
2170 Glossary
2171
2172
2173 Halfwidth: Characters of East Asian character sets whose glyph image
2174 occupies half of the character display cell. In legacy character sets,
2175 halfwidth characters are normally encoded in a single byte.
2176
2177 Han Characters: Ideographic characters of Chinese origin.
2178
2179 Hangul: The name of the script used to write the Korean language.
2180
2181 High-Surrogate: A Unicode code value in the range U+D800 to U+DBFF.
2182
2183 Hiragana: One of two standard syllabaries associated with the Japanese
2184 writing system. Use to write particles, grammatical affixes, and words
2185 that have no 'kanji' form.
2186
2187 IANA: Internet Assigned Numbers Authority.
2188
2189 Ideograph: (1) Any symbol that denotes an idea (or meaning) in contrast
2190 to a sound or pronunciation (for example, a 'smiley face'). (2) A
2191 common term used to refer to Han characters.
2192
2193 IPA: International Phonetic Alphabet.
2194
2195 IRG: Abbreviation for Ideographic Rapporteur Group, a subgroup of
2196 ISO/IEC JTC1/SC2/WG2 (who work on Han unification and submission of new
2197 Han characters for inclusion in revised versions of Unicode/ISO 10646).
2198
2199 Jamo: The Korean name for a single letter of the Hangul script. Jamos
2200 are used to form Hangul syllables.
2201
2202 Joiner: An invisible character that affects the joining behavior of
2203 surrounding characters.
2204
2205 JTC1: Abbreviation for Joint Technical Committee 1 of ISO/IEC,
2206 responsible for information technology standardization.
2207
2208 Kana: The name of a primarily syllabic script used by the Japanese
2209 writing system, composed of 'hiragana' and 'katakana'.
2210
2211 Kanji: The Japanese name for Han characters; derived from the Chinese
2212 word 'hanzi'. Also romanized as 'kanzi'.
2213
2214 Katakana: One of two standard syllabaries associated with the Japanese
2215 writing system, typically used in representation of borrowed vocabulary.
2216
2217 Ligature: A glyph representing a combination of two or more characters,
2218 for example in the Latin script the ligature between 'f' and 'i' as
2219 'fi'.
2220
2221 Logical Order: The order in which text is typed on a keyboard. For the
2222
2223 McDonald June 20, 2002 [Page A-4]
2224\f
2225 CUPS Internationalization Software Design Description v0.3
2226 APPENDIX A
2227 Glossary
2228
2229 most part, logical order corresponds to phonetic order.
2230
2231 Lowercase: (See 'case'.)
2232
2233 Low-Surrogate: A Unicode code value in the range U+DC00 to U+DFFF.
2234
2235 MBCS: Acronym for 'multiple-byte character set'.
2236
2237 Multiple-Byte Character Set (MBCS): A character set encoded with a
2238 variable number of bytes per character. Many large character sets have
2239 been defined as MBCS so as to keep strict compatibility with the
2240 US-ASCII subset and/or [ISO2022].
2241
2242 Normalization: Transformation of data to a normal form.
2243
2244 Plain Text: Computer-encoded text that consists ONLY of a sequence of
2245 code values from a given standard, with no other formatting or
2246 structural information.
2247
2248 Precomposed Character: (See 'decomposable character'.)
2249
2250 Rendering: (1) The process of selecting and laying out glyphs for the
2251 purpose of depicting characters. (2) The process of making glyphs
2252 visible on a display device.
2253
2254 Repertoire: (See 'character repertoire'.)
2255
2256 Replacement Character: A character used as a substitute for an
2257 uninterpretable character from another encoding. [UNICODE3.2] defines
2258 U+FFFD REPLACEMENT CHARACTER for this function.
2259
2260 Rich Text: The result of adding information such as font data, color,
2261 formatting, phonetic annotations, etc. to 'plain text' (e.g., HTML).
2262
2263 SBCS: Acronym for 'single-byte character set'.
2264
2265 Scalar Value: (See 'Unicode scalar value'.)
2266
2267 Script: A collection of symbols used to represent textual information
2268 in one or more writing systems.
2269
2270 Single-Byte Character Set (SBCS): One of a number of one-byte character
2271 sets defined for representing (mostly) Western languages (for example,
2272 ISO 8859-1 'Latin-1'). These character sets are often encoded in such a
2273 way as to be strict supersets of 7-bit [US-ASCII].
2274
2275 Sorting: (See 'collation'.)
2276
2277 Transcoding: Conversion of character data between different character
2278 sets.
2279
2280 McDonald June 20, 2002 [Page A-5]
2281\f
2282 CUPS Internationalization Software Design Description v0.3
2283 APPENDIX A
2284 Glossary
2285
2286
2287 Transformation Format: A mapping from a coded character sequence to a
2288 unique sequence of code values (typically octets).
2289
2290 UCS: Abbreviation for Universal Character Set, specified by [ISO10646].
2291
2292 UCS-2: UCS encoded in 2 octets, specified by [ISO10646].
2293
2294 UCS-4: UCS encoded in 4 octets, specified by [ISO10646].
2295
2296 Unicode Scalar Value: A number between 0 to 0x10FFFF.
2297
2298 Uppercase: (See 'case'.)
2299
2300 UTF: Abbreviation for Unicode (or UCS) Transformation Format.
2301
2302 UTF-8: Unicode (or UCS) Transformation Format, 8-bit encoding form.
2303 Serializes a Unicode (or UCS) scalar value (code point) as a sequence of
2304 one to four octets. Does NOT suffer from byte-ordering ambiguities.
2305
2306 UTF-16: Unicode (or UCS) Transformation Format, 16-bit encoding form.
2307 Serializes a Unicode (or UCS) scalar value (code point) as a sequence of
2308 two octets, in either big-endian or little-endian format. Uses an
2309 (optional) prefix of BOM to disambiguate byte-ordering.
2310
2311 UTF-32: Unicode (or UCS) Transformation Format, 32-bit encoding form.
2312 Serializes a Unicode (or UCS) scalar value (code point) as a sequence of
2313 four octets, in either big-endian or little-endian format. Uses an
2314 (optional) prefix of BOM to disambiguate byte-ordering.
2315
2316 Zero Width: Characteristic of some spaces or format control characters
2317 that do not advance text along the horizontal baseline.
2318
2319
2320
2321
2322
2323
2324
2325
2326
2327
2328
2329
2330
2331
2332
2333
2334
2335
2336
2337 McDonald June 20, 2002 [Page A-6]