]>
Commit | Line | Data |
---|---|---|
ef416fc2 | 1 | |
2 | ||
3 | WORKING DRAFT Ira McDonald | |
4 | <i18n_sdd.txt> High North Inc | |
5 | ||
6 | Common UNIX Printing System ("CUPS") | |
7 | Internationalization Software Design Description v0.3 | |
8 | ||
9 | Copyright (C) Easy Software Products (2002) - All Rights Reserved | |
10 | ||
11 | ||
12 | Status of this Document | |
13 | ||
14 | This document is an unapproved working draft and is incomplete in some | |
15 | sections (see 'Ed Note:' comments). | |
16 | ||
17 | ||
18 | Abstract | |
19 | ||
20 | This document provides general information and high-level design for the | |
21 | Internationalization extensions for the Common UNIX Printing System | |
22 | ("CUPS") Version 1.2. This document also provides C language header | |
23 | files and high-level pseudo-code for all new modules and external | |
24 | functions. | |
25 | ||
26 | ||
27 | ||
28 | ||
29 | ||
30 | ||
31 | ||
32 | ||
33 | ||
34 | ||
35 | ||
36 | ||
37 | ||
38 | ||
39 | ||
40 | ||
41 | ||
42 | ||
43 | ||
44 | ||
45 | ||
46 | ||
47 | ||
48 | ||
49 | ||
50 | ||
51 | ||
52 | ||
53 | ||
54 | ||
55 | ||
56 | ||
57 | McDonald June 20, 2002 [Page 1] | |
58 | \f | |
59 | CUPS Internationalization Software Design Description v0.3 | |
60 | ||
61 | Table of Contents | |
62 | ||
63 | 1. Scope ...................................................... 4 | |
64 | 1.1. Identification ......................................... 4 | |
65 | 1.2. System Overview ........................................ 4 | |
66 | 1.3. Document Overview ...................................... 4 | |
67 | 2. References ................................................. 5 | |
68 | 2.1. CUPS References ........................................ 5 | |
69 | 2.2. Other Documents ........................................ 5 | |
70 | 3. Design Overview ............................................ 7 | |
71 | 3.1. Transcoding - New ...................................... 7 | |
72 | 3.1.1. transcode.h - Transcoding header ................... 7 | |
73 | 3.1.1.1. cups_cmap_t - SBCS Charmap Structure ........... 10 | |
74 | 3.1.1.2. cups_dmap_t - DBCS Charmap Structure ........... 11 | |
75 | 3.1.2. transcode.c - Transcoding module ................... 11 | |
76 | 3.1.2.1. cupsUtf8ToCharset() ............................ 11 | |
77 | 3.1.2.2. cupsCharsetToUtf8() ............................ 12 | |
78 | 3.1.2.3. cupsUtf8ToUtf16() .............................. 12 | |
79 | 3.1.2.4. cupsUtf16ToUtf8() .............................. 12 | |
80 | 3.1.2.5. cupsUtf8ToUtf32() .............................. 12 | |
81 | 3.1.2.6. cupsUtf32ToUtf8() .............................. 13 | |
82 | 3.1.2.7. cupsUtf16ToUtf32() ............................. 13 | |
83 | 3.1.2.8. cupsUtf32ToUtf16() ............................. 13 | |
84 | 3.1.2.9. Transcoding Utility Functions .................. 13 | |
85 | 3.1.2.9.1. cupsCharmapGet() ........................... 14 | |
86 | 3.1.2.9.2. cupsCharmapFree() .......................... 14 | |
87 | 3.1.2.9.3. cupsCharmapFlush() ......................... 14 | |
88 | 3.2. Normalization - New .................................... 15 | |
89 | 3.2.1. normalize.h - Normalization header ................. 15 | |
90 | 3.2.1.1. cups_normmap_t - Normalize Map Structure ....... 22 | |
91 | 3.2.1.2. cups_foldmap_t - Case Fold Map Structure ....... 22 | |
92 | 3.2.1.3. cups_propmap_t - Char Property Map Structure ... 23 | |
93 | 3.2.1.4. cups_prop_t - Char Property Structure .......... 23 | |
94 | 3.2.1.5. cups_breakmap_t - Line Break Map Structure ..... 23 | |
95 | 3.2.1.6. cups_combmap_t - Combining Class Map Structure . 24 | |
96 | 3.2.1.7. cups_comb_t - Combining Class Structure ........ 24 | |
97 | 3.2.2. normalize.c - Normalization module ................. 24 | |
98 | 3.2.2.1. cupsUtf8Normalize() ............................ 24 | |
99 | 3.2.2.2. cupsUtf32Normalize() ........................... 25 | |
100 | 3.2.2.3. cupsUtf8CaseFold() ............................. 25 | |
101 | 3.2.2.4. cupsUtf32CaseFold() ............................ 26 | |
102 | 3.2.2.5. cupsUtf8CompareCaseless() ...................... 26 | |
103 | 3.2.2.6. cupsUtf32CompareCaseless() ..................... 26 | |
104 | 3.2.2.7. cupsUtf8CompareIdentifier() .................... 27 | |
105 | 3.2.2.8. cupsUtf32CompareIdentifier() ................... 27 | |
106 | 3.2.2.9. cupsUtf32CharacterProperty() ................... 27 | |
107 | 3.2.2.10. Normalization Utility Functions ............... 28 | |
108 | 3.2.2.10.1. cupsNormalizeMapsGet() .................... 28 | |
109 | 3.2.2.10.2. cupsNormalizeMapsFree() ................... 28 | |
110 | 3.2.2.10.3. cupsNormalizeMapsFlush() .................. 28 | |
111 | 3.3. Language - Existing .................................... 29 | |
112 | 3.3.1. language.h - Language header ....................... 29 | |
113 | ||
114 | McDonald June 20, 2002 [Page 2] | |
115 | \f | |
116 | CUPS Internationalization Software Design Description v0.3 | |
117 | ||
118 | 3.3.2. language.c - Language module ....................... 29 | |
119 | 3.3.2.1. cupsLangEncoding() - Existing .................. 29 | |
120 | 3.3.2.2. cupsLangFlush() - Existing ..................... 29 | |
121 | 3.3.2.3. cupsLangFree() - Existing ...................... 29 | |
122 | 3.3.2.4. cupsLangGet() - Existing ....................... 30 | |
123 | 3.3.2.5. cupsLangPrintf() - New ......................... 30 | |
124 | 3.3.2.6. cupsLangPuts() - New ........................... 30 | |
125 | 3.3.2.7. cupsEncodingName() - New ....................... 31 | |
126 | 3.4. Common Text Filter - Existing .......................... 31 | |
127 | 3.4.1. textcommon.h - Common text filter header ........... 31 | |
128 | 3.4.1.1. lchar_t - Character/Attribute Structure ........ 31 | |
129 | 3.4.2. textcommon.c - Common text filter .................. 32 | |
130 | 3.4.2.1. TextMain() - Existing .......................... 32 | |
131 | 3.4.2.2. compare_keywords() - Existing .................. 33 | |
132 | 3.4.2.3. getutf8() - Existing ........................... 33 | |
133 | 3.5. Text to PostScript Filter - Existing ................... 33 | |
134 | 3.5.1. texttops.c - Text to PostScript filter ............. 33 | |
135 | 3.5.1.1. main() - Existing .............................. 33 | |
136 | 3.5.1.2. WriteEpilogue () - Existing .................... 34 | |
137 | 3.5.1.3. WritePage () - Existing ........................ 34 | |
138 | 3.5.1.4. WriteProlog () - Existing ...................... 34 | |
139 | 3.5.1.5. write_line() - Existing ........................ 34 | |
140 | 3.5.1.6. write_string() - Existing ...................... 34 | |
141 | 3.5.1.7. write_text() - Existing ........................ 35 | |
142 | A. Glossary ................................................... A-1 | |
143 | ||
144 | ||
145 | ||
146 | ||
147 | ||
148 | ||
149 | ||
150 | ||
151 | ||
152 | ||
153 | ||
154 | ||
155 | ||
156 | ||
157 | ||
158 | ||
159 | ||
160 | ||
161 | ||
162 | ||
163 | ||
164 | ||
165 | ||
166 | ||
167 | ||
168 | ||
169 | ||
170 | ||
171 | McDonald June 20, 2002 [Page 3] | |
172 | \f | |
173 | CUPS Internationalization Software Design Description v0.3 | |
174 | ||
175 | ||
176 | ||
177 | 1. Scope | |
178 | ||
179 | ||
180 | ||
181 | 1.1. Identification | |
182 | ||
183 | This document provides general information and high-level design for the | |
184 | Internationalization extensions for the Common UNIX Printing System | |
185 | ("CUPS") Version 1.2. This document also provides C language header | |
186 | files and high-level pseudo-code for all new modules and external | |
187 | functions. | |
188 | ||
189 | ||
190 | 1.2. System Overview | |
191 | ||
192 | The CUPS Internationalization extensions provide multilingual support | |
193 | via Unicode 3.2:2002 [UNICODE3.2] / ISO-10646-1:2000 [ISO10646-1] and a | |
194 | suite of local character sets (including all adopted parts of ISO-8859 | |
195 | and many MS Windows code pages) for CUPS 1.2. | |
196 | ||
197 | The CUPS Internationalization extensions support UTF-8 [RFC2279] as the | |
198 | common stream-oriented representation of all character data. UTF-8 is | |
199 | defined in [ISO10646-1] and is further constrained (for integrity and | |
200 | security) by [UNICODE3.2]. | |
201 | ||
202 | UTF-8 is the native character set of LDAPv3 [RFC2251], SLPv2 [RFC2608], | |
203 | IPP/1.1 [RFC2910] [RFC2911], and many other Internet protocols. | |
204 | ||
205 | ||
206 | 1.3. Document Overview | |
207 | ||
208 | ||
209 | This software design description document is organized into the | |
210 | following sections: | |
211 | ||
212 | o 1 - Scope | |
213 | o 2 - References | |
214 | o 3 - Design Overview | |
215 | o A - Glossary | |
216 | ||
217 | ||
218 | ||
219 | ||
220 | ||
221 | ||
222 | ||
223 | ||
224 | ||
225 | ||
226 | ||
227 | ||
228 | McDonald June 20, 2002 [Page 4] | |
229 | \f | |
230 | CUPS Internationalization Software Design Description v0.3 | |
231 | ||
232 | ||
233 | ||
234 | 2. References | |
235 | ||
236 | ||
237 | ||
238 | 2.1. CUPS References | |
239 | ||
240 | See: Section 2.1 'CUPS Documentation' of CUPS Software Design | |
241 | Description. | |
242 | ||
243 | ||
244 | 2.2. Other Documents | |
245 | ||
246 | The following non-CUPS documents are referenced by this document. | |
247 | ||
248 | [ANSI-X3.4] ANSI Coded Character Set - 7-bit American National Standard | |
249 | Code for Information Interchange, ANSI X3.4, 1986 (aka US-ASCII). | |
250 | ||
251 | [GB2312] Code of Chinese Graphic Character Set for Information | |
252 | Interchange, Primary Set, GB 2312, 1980. | |
253 | ||
254 | [ISO639-1] Codes for the Representation of Names of Languages -- Part 1: | |
255 | Alpha-2 Code, ISO/IEC 639-1, 2000. | |
256 | ||
257 | [ISO639-2] Codes for the Representation of Names of Languages -- Part 2: | |
258 | Alpha-3 Code, ISO/IEC 639-2, 1998. | |
259 | ||
260 | [ISO646] Information Technology - ISO 7-bit Coded Character Set for | |
261 | Information Interchange, ISO/IEC 646, 1991. | |
262 | ||
263 | [ISO2022] Information Processing - ISO 7-bit and 8-bit Coded Character | |
264 | Sets - Code Extension Techniques, ISO/IEC 2022, 1994. (Technically | |
265 | identical to ECMA-35.) | |
266 | ||
267 | [ISO3166-1] Codes for the Representation of Names of Countries and their | |
268 | Subdivisions, Part 1: Country Codes, ISO/ISO 3166-1, 1997. | |
269 | ||
270 | [ISO8859] Information Processing - 8-bit Single-Byte Code Graphic | |
271 | Character Sets, ISO/IEC 8859-n, 1987-2001. | |
272 | ||
273 | [ISO10646-1] Information Technology - Universal Multiple-Octet Code | |
274 | Character Set (UCS) - Part 1: Architecture and Basic Multilingual | |
275 | Plane, ISO/IEC 10646-1, September 2000. | |
276 | ||
277 | [ISO10646-2] Information Technology - Universal Multiple-Octet Code | |
278 | Character Set (UCS) - Part 2: Supplemental Planes, ISO/IEC 10646-2, | |
279 | January 2001. | |
280 | ||
281 | [RFC2119] Bradner. Key words for use in RFCs to Indicate Requirement | |
282 | Levels, RFC 2119, March 1997. | |
283 | ||
284 | ||
285 | McDonald June 20, 2002 [Page 5] | |
286 | \f | |
287 | CUPS Internationalization Software Design Description v0.3 | |
288 | ||
289 | ||
290 | [RFC2251] Whal, Howes, Kille. Lightweight Directory Access Protocol | |
291 | Version 3 (LDAPv3), RFC 2251, December 1997. | |
292 | ||
293 | [RFC2277] Alvestrand. IETF Policy on Character Sets and Languages, RFC | |
294 | 2277, January 1998. | |
295 | ||
296 | [RFC2279] Yergeau. UTF-8, a Transformation Format of ISO 10646, RFC | |
297 | 2279, January 1998. | |
298 | ||
299 | [RFC2608] Guttman, Perkins, Veizades, Day. Service Location Protocol | |
300 | Version 2 (SLPv2), RFC 2608, June 1999. | |
301 | ||
302 | [RFC2910] Herriot, Butler, Moore, Turner, Wenn. Internet Printing | |
303 | Protocol/1.1: Encoding and Transport, RFC 2910, September 2000. | |
304 | ||
305 | [RFC2911] Hastings, Herriot, deBry, Isaacson, Powell. Internet Printing | |
306 | Protocol/1.1: Model and Semantics, RFC 2911, September 2000. | |
307 | ||
308 | [UNICODE3.0] Unicode Consortium, Unicode Standard Version 3.0, | |
309 | Addison-Wesley Developers Press, ISBN 0-201-61633-5, 2000. | |
310 | ||
311 | [UNICODE3.1] Unicode Consortium, Unicode Standard Version 3.1 (UAX-27), | |
312 | May 2001. | |
313 | ||
314 | [UNICODE3.2] Unicode Consortium, Unicode Standard Version 3.2 (UAX-28), | |
315 | March 2002. | |
316 | ||
317 | [US-ASCII] See [ANSI-X3.4] above. | |
318 | ||
319 | ||
320 | ||
321 | ||
322 | ||
323 | ||
324 | ||
325 | ||
326 | ||
327 | ||
328 | ||
329 | ||
330 | ||
331 | ||
332 | ||
333 | ||
334 | ||
335 | ||
336 | ||
337 | ||
338 | ||
339 | ||
340 | ||
341 | ||
342 | McDonald June 20, 2002 [Page 6] | |
343 | \f | |
344 | CUPS Internationalization Software Design Description v0.3 | |
345 | ||
346 | ||
347 | ||
348 | 3. Design Overview | |
349 | ||
350 | The CUPS Internationalization extensions are composed of several header | |
351 | files and modules which extend the Language functions in the existing | |
352 | CUPS Application Programmers Interface (API). | |
353 | ||
354 | ||
355 | 3.1. Transcoding - New | |
356 | ||
357 | Initially, the CUPS Internationalization extensions will only support | |
358 | SBCS (single-byte character set) transcoding. But the design allows | |
359 | future support for DBCS (double-byte character set) transcoding for CJK | |
360 | (Chinese/Japanese/Korean) languages and the MBCS (multiple-byte | |
361 | character set) compound sets that use escapes for charset switching. | |
362 | ||
363 | In order to reduce code size and increase performance all conventional | |
364 | 'mapping files' (tables of values in legacy characters sets with their | |
365 | corresponding Unicode scalar values) will ALSO be sorted and stored in | |
366 | memory as reverse maps (for efficient conversion from Unicode scalar | |
367 | values to their corresponding legacy character set values). Transcoding | |
368 | will be done directly by 2-level lookup (without any searching or | |
369 | sorting). | |
370 | ||
371 | [Ed Note: CJK languages will be fairly costly in mapping table sizes, | |
372 | because they have thousands (or tens of thousands) of codepoints.] | |
373 | ||
374 | ||
375 | ||
376 | 3.1.1. transcode.h - Transcoding header | |
377 | ||
378 | /* | |
379 | * "$Id: i18n_sdd.txt 2678 2002-08-19 01:15:26Z mike $" | |
380 | * | |
381 | * Transcoding support for the Common UNIX Printing System (CUPS). | |
382 | * | |
383 | * Copyright 1997-2002 by Easy Software Products. | |
384 | * | |
385 | * These coded instructions, statements, and computer programs are | |
386 | * the property of Easy Software Products and are protected by Federal | |
387 | * copyright law. Distribution and use rights are outlined in the | |
388 | * file "LICENSE.txt" which should have been included with this file. | |
389 | * If this file is missing or damaged please contact Easy Software | |
390 | * Products at: | |
391 | * | |
392 | * Attn: CUPS Licensing Information | |
393 | * Easy Software Products | |
394 | * 44141 Airport View Drive, Suite 204 | |
395 | * Hollywood, Maryland 20636-3111 USA | |
396 | * | |
397 | * Voice: (301) 373-9603 | |
398 | ||
399 | McDonald June 20, 2002 [Page 7] | |
400 | \f | |
401 | CUPS Internationalization Software Design Description v0.3 | |
402 | ||
403 | * EMail: cups-info@cups.org | |
404 | * WWW: http://www.cups.org | |
405 | */ | |
406 | ||
407 | #ifndef _CUPS_TRANSCODE_H_ | |
408 | # define _CUPS_TRANSCODE_H_ | |
409 | ||
410 | /* | |
411 | * Include necessary headers... | |
412 | */ | |
413 | ||
414 | # include "cups/language.h" | |
415 | ||
416 | # ifdef __cplusplus | |
417 | extern "C" { | |
418 | # endif /* __cplusplus */ | |
419 | ||
420 | /* | |
421 | * Types... | |
422 | */ | |
423 | ||
424 | typedef unsigned char utf8_t; /* UTF-8 Unicode/ISO-10646 code unit */ | |
425 | typedef unsigned short utf16_t; /* UTF-16 Unicode/ISO-10646 code unit */ | |
426 | typedef unsigned long utf32_t; /* UTF-32 Unicode/ISO-10646 code unit */ | |
427 | typedef unsigned short ucs2_t; /* UCS-2 Unicode/ISO-10646 code unit */ | |
428 | typedef unsigned long ucs4_t; /* UCS-4 Unicode/ISO-10646 code unit */ | |
429 | typedef unsigned char sbcs_t; /* SBCS Legacy 8-bit code unit */ | |
430 | typedef unsigned short dbcs_t; /* DBCS Legacy 16-bit code unit */ | |
431 | ||
432 | /* | |
433 | * Structures... | |
434 | */ | |
435 | ||
436 | typedef struct cups_cmap_str /**** SBCS Charmap Cache Structure ****/ | |
437 | { | |
438 | struct cups_cmap_str *next; /* Next charmap in cache */ | |
439 | int used; /* Number of times entry used */ | |
440 | cups_encoding_t encoding; /* Legacy charset encoding */ | |
441 | ucs2_t char2uni[256]; /* Map Legacy SBCS -> UCS-2 */ | |
442 | sbcs_t *uni2char[256]; /* Map UCS-2 -> Legacy SBCS */ | |
443 | } cups_cmap_t; | |
444 | ||
445 | #if 0 | |
446 | typedef struct cups_dmap_str /**** DBCS Charmap Cache Structure ****/ | |
447 | { | |
448 | struct cups_dmap_str *next; /* Next charmap in cache */ | |
449 | int used; /* Number of times entry used */ | |
450 | cups_encoding_t encoding; /* Legacy charset encoding */ | |
451 | ucs2_t *char2uni[256]; /* Map Legacy DBCS -> UCS-2 */ | |
452 | dbcs_t *uni2char[256]; /* Map UCS-2 -> Legacy DBCS */ | |
453 | } cups_dmap_t; | |
454 | #endif | |
455 | ||
456 | McDonald June 20, 2002 [Page 8] | |
457 | \f | |
458 | CUPS Internationalization Software Design Description v0.3 | |
459 | ||
460 | ||
461 | /* | |
462 | * Constants... | |
463 | */ | |
464 | #define CUPS_MAX_USTRING 1024 /* Maximum size of Unicode string */ | |
465 | ||
466 | /* | |
467 | * Globals... | |
468 | */ | |
469 | ||
470 | extern int TcFixMapNames; /* Fix map names to Unicode names */ | |
471 | extern int TcStrictUtf8; /* Non-shortest-form is illegal */ | |
472 | extern int TcStrictUtf16; /* Invalid surrogate pair is illegal */ | |
473 | extern int TcStrictUtf32; /* Greater than 0x10FFFF is illegal */ | |
474 | extern int TcRequireBOM; /* Require BOM for little/big-endian */ | |
475 | extern int TcSupportBOM; /* Support BOM for little/big-endian */ | |
476 | extern int TcSupport8859; /* Support ISO 8859-x repertoires */ | |
477 | extern int TcSupportWin; /* Support Windows-x repertoires */ | |
478 | extern int TcSupportCJK; /* Support CJK (Asian) repertoires */ | |
479 | ||
480 | /* | |
481 | * Prototypes... | |
482 | */ | |
483 | ||
484 | /* | |
485 | * Utility functions for character set maps | |
486 | */ | |
487 | extern void *cupsCharmapGet(const cups_encoding_t encoding); | |
488 | /* I - Encoding */ | |
489 | extern void cupsCharmapFree(const cups_encoding_t encoding); | |
490 | /* I - Encoding */ | |
491 | extern void cupsCharmapFlush(void); | |
492 | ||
493 | /* | |
494 | * Convert UTF-8 to and from legacy character set | |
495 | */ | |
496 | extern int cupsUtf8ToCharset(char *dest, /* O - Target string */ | |
497 | const utf8_t *src, /* I - Source string */ | |
498 | const int maxout, /* I - Max output */ | |
499 | cups_encoding_t encoding); /* I - Encoding */ | |
500 | extern int cupsCharsetToUtf8(utf8_t *dest, /* O - Target string */ | |
501 | const char *src, /* I - Source string */ | |
502 | const int maxout, /* I - Max output */ | |
503 | cups_encoding_t encoding); /* I - Encoding */ | |
504 | ||
505 | /* | |
506 | * Convert UTF-8 to and from UTF-16 | |
507 | */ | |
508 | extern int cupsUtf8ToUtf16(utf16_t *dest, /* O - Target string */ | |
509 | const utf8_t *src, /* I - Source string */ | |
510 | const int maxout); /* I - Max output */ | |
511 | extern int cupsUtf16ToUtf8(utf8_t *dest, /* O - Target string */ | |
512 | ||
513 | McDonald June 20, 2002 [Page 9] | |
514 | \f | |
515 | CUPS Internationalization Software Design Description v0.3 | |
516 | ||
517 | const utf16_t *src, /* I - Source string */ | |
518 | const int maxout); /* I - Max output */ | |
519 | ||
520 | /* | |
521 | * Convert UTF-8 to and from UTF-32 | |
522 | */ | |
523 | extern int cupsUtf8ToUtf32(utf32_t *dest, /* O - Target string */ | |
524 | const utf8_t *src, /* I - Source string */ | |
525 | const int maxout); /* I - Max output */ | |
526 | extern int cupsUtf32ToUtf8(utf8_t *dest, /* O - Target string */ | |
527 | const utf32_t *src, /* I - Source string */ | |
528 | const int maxout); /* I - Max output */ | |
529 | ||
530 | /* | |
531 | * Convert UTF-16 to and from UTF-32 | |
532 | */ | |
533 | extern int cupsUtf16ToUtf32(utf32_t *dest, /* O - Target string */ | |
534 | const utf16_t *src, /* I - Source string */ | |
535 | const int maxout); /* I - Max output */ | |
536 | extern int cupsUtf32ToUtf16(utf16_t *dest, /* O - Target string */ | |
537 | const utf32_t *src, /* I - Source string */ | |
538 | const int maxout); /* I - Max output */ | |
539 | ||
540 | # ifdef __cplusplus | |
541 | } | |
542 | # endif /* __cplusplus */ | |
543 | ||
544 | #endif /* !_CUPS_TRANSCODE_H_ */ | |
545 | ||
546 | /* | |
547 | * End of "$Id: i18n_sdd.txt 2678 2002-08-19 01:15:26Z mike $" | |
548 | */ | |
549 | ||
550 | ||
551 | ||
552 | 3.1.1.1. cups_cmap_t - SBCS Charmap Structure | |
553 | ||
554 | typedef struct cups_cmap_str /**** SBCS Charmap Cache Structure ****/ | |
555 | { | |
556 | struct cups_cmap_str *next; /* Next charset map in cache */ | |
557 | int used; /* Number of times entry used */ | |
558 | cups_encoding_t encoding; /* Legacy charset encoding */ | |
559 | ucs2_t char2uni[256]; /* Map Legacy SBCS -> UCS-2 */ | |
560 | sbcs_t *uni2char[256]; /* Map UCS-2 -> Legacy SBCS */ | |
561 | } cups_cmap_t; | |
562 | ||
563 | 'char2uni[]' is a (complete) array of UCS-2 values that supports direct | |
564 | one-level lookup from an input SBCS legacy charset code point, for use | |
565 | by 'cupsCharsetToUtf8()'. | |
566 | ||
567 | 'uni2char[]' is a (sparse) array of pointers to arrays of (256 each) | |
568 | SBCS values, that supports direct two-level lookup from an input UCS-2 | |
569 | ||
570 | McDonald June 20, 2002 [Page 10] | |
571 | \f | |
572 | CUPS Internationalization Software Design Description v0.3 | |
573 | ||
574 | code point, for use by 'cupsUtf8ToCharset()'. | |
575 | ||
576 | ||
577 | ||
578 | 3.1.1.2. cups_dmap_t - DBCS Charmap Structure | |
579 | ||
580 | typedef struct cups_dmap_str /**** DBCS Charmap Cache Structure ****/ | |
581 | { | |
582 | struct cups_dmap_str *next; /* Next charset map in cache */ | |
583 | int used; /* Number of times entry used */ | |
584 | cups_encoding_t encoding; /* Legacy charset encoding */ | |
585 | ucs2_t *char2uni[256]; /* Map Legacy DBCS -> UCS-2 */ | |
586 | dbcs_t *uni2char[256]; /* Map UCS-2 -> Legacy DBCS */ | |
587 | } cups_dmap_t; | |
588 | ||
589 | 'char2uni[]' is a (sparse) array of pointers to arrays of (256 each) | |
590 | UCS-2 values that supports direct two-level lookup from an input DBCS | |
591 | legacy charset code point, for (future) use by 'cupsCharsetToUtf8()'. | |
592 | ||
593 | 'uni2char[]' is a (sparse) array of pointers to arrays of (256 each) | |
594 | DBCS values, that supports direct two-level lookup from an input UCS-2 | |
595 | code point, for (future) use by 'cupsUtf8ToCharset()'. | |
596 | ||
597 | ||
598 | ||
599 | 3.1.2. transcode.c - Transcoding module | |
600 | ||
601 | All of the transcoding functions are modelled on the C standard library | |
602 | function 'strncpy()', except that they return the count of output, like | |
603 | 'strlen()', rather than the (redundant) pointer to the output. | |
604 | ||
605 | If the transcoding functions detect invalid input parameters or they | |
606 | detect an encoding error in their input, then they return '-1', rather | |
607 | than the count of output. | |
608 | ||
609 | All of the transcoding functions take an input parameter indicating the | |
610 | maximum output units (for safe operation). The functions that return | |
611 | 16-bit (UTF-16) or 32-bit (UTF-32/UCS-4) output always return the output | |
612 | string count (not including the final null) and NOT the memory size in | |
613 | bytes. | |
614 | ||
615 | ||
616 | ||
617 | 3.1.2.1. cupsUtf8ToCharset() | |
618 | ||
619 | extern int cupsUtf8ToCharset(char *dest, /* O - Target string */ | |
620 | const utf8_t *src, /* I - Source string */ | |
621 | const int maxout, /* I - Max output */ | |
622 | cups_encoding_t encoding); /* I - Encoding */ | |
623 | ||
624 | <Find charset map by calling 'cupsCharmapGet()'> | |
625 | <Convert input UTF-8 to internal UCS-4 by calling 'cupsUtf8ToUtf32()'> | |
626 | ||
627 | McDonald June 20, 2002 [Page 11] | |
628 | \f | |
629 | CUPS Internationalization Software Design Description v0.3 | |
630 | ||
631 | <Convert internal UCS-4 to legacy charset via charset map> | |
632 | <Release charset map by calling 'cupsCharmapFree()'> | |
633 | <Return length of output legacy charset string -- size in butes> | |
634 | ||
635 | ||
636 | ||
637 | 3.1.2.2. cupsCharsetToUtf8() | |
638 | ||
639 | extern int cupsCharsetToUtf8(utf8_t *dest, /* O - Target string */ | |
640 | const char *src, /* I - Source string */ | |
641 | const int maxout, /* I - Max output */ | |
642 | cups_encoding_t encoding); /* I - Encoding */ | |
643 | ||
644 | <Find charset map by calling 'cupsCharmapGet()'> | |
645 | <Convert input legacy charset to internal UCS-4 via charset map> | |
646 | <Convert internal UCS-4 to UTF-8 by calling 'cupsUtf32ToUtf8()'> | |
647 | <Release charset map by calling 'cupsCharmapFree()'> | |
648 | <Return length of output UTF-8 string -- size in bytes> | |
649 | ||
650 | ||
651 | ||
652 | 3.1.2.3. cupsUtf8ToUtf16() | |
653 | ||
654 | extern int cupsUtf8ToUtf16(utf16_t *dest, /* O - Target string */ | |
655 | const utf8_t *src, /* I - Source string */ | |
656 | const int maxout); /* I - Max output */ | |
657 | ||
658 | <...to avoid duplicate code to handle surrogate pairs...> | |
659 | <Convert input UTF-8 to internal UCS-4 by calling 'cupsUtf8ToUtf32()'> | |
660 | <Convert internal UCS-4 to UTF-16 by calling 'cupsUtf32ToUtf16()'> | |
661 | <Return count of output UTF-16 string -- NOT memory size in bytes> | |
662 | ||
663 | ||
664 | ||
665 | 3.1.2.4. cupsUtf16ToUtf8() | |
666 | ||
667 | extern int cupsUtf16ToUtf8(utf8_t *dest, /* O - Target string */ | |
668 | const utf16_t *src, /* I - Source string */ | |
669 | const int maxout); /* I - Max output */ | |
670 | ||
671 | <...to avoid duplicate code to handle surrogate pairs...> | |
672 | <Convert input UTF-16 to internal UCS-4 by calling 'cupsUtf16ToUtf32()'> | |
673 | <Convert internal UCS-4 to UTF-8 by calling 'cupsUtf32ToUtf8()'> | |
674 | <Return length of output UTF-8 string -- size in bytes> | |
675 | ||
676 | ||
677 | ||
678 | 3.1.2.5. cupsUtf8ToUtf32() | |
679 | ||
680 | extern int cupsUtf8ToUtf32(utf32_t *dest, /* O - Target string */ | |
681 | const utf8_t *src, /* I - Source string */ | |
682 | const int maxout); /* I - Max output */ | |
683 | ||
684 | McDonald June 20, 2002 [Page 12] | |
685 | \f | |
686 | CUPS Internationalization Software Design Description v0.3 | |
687 | ||
688 | ||
689 | <Convert input UTF-8 directly to output UCS-4...> | |
690 | <...checking for valid range, shortest-form, etc.> | |
691 | <Return count of output UTF-32 string -- NOT memory size in bytes> | |
692 | ||
693 | ||
694 | ||
695 | 3.1.2.6. cupsUtf32ToUtf8() | |
696 | ||
697 | extern int cupsUtf32ToUtf8(utf8_t *dest, /* O - Target string */ | |
698 | const utf32_t *src, /* I - Source string */ | |
699 | const int maxout); /* I - Max output */ | |
700 | ||
701 | <Convert input UCS-4 directly to output UTF-8...> | |
702 | <...checking for valid range, etc.> | |
703 | <Return length of output UTF-8 string -- size in bytes> | |
704 | ||
705 | ||
706 | ||
707 | 3.1.2.7. cupsUtf16ToUtf32() | |
708 | ||
709 | extern int cupsUtf16ToUtf32(utf32_t *dest, /* O - Target string */ | |
710 | const utf16_t *src, /* I - Source string */ | |
711 | const int maxout); /* I - Max output */ | |
712 | ||
713 | <Convert input UTF-16 directly to output UCS-4...> | |
714 | <...handling surrogate pairs decoding from UTF-16> | |
715 | <Return count of output UTF-32 string -- NOT memory size in bytes> | |
716 | ||
717 | ||
718 | ||
719 | 3.1.2.8. cupsUtf32ToUtf16() | |
720 | ||
721 | extern int cupsUtf32ToUtf16(utf16_t *dest, /* O - Target string */ | |
722 | const utf32_t *src, /* I - Source string */ | |
723 | const int maxout); /* I - Max output */ | |
724 | ||
725 | <Convert input UCS-4 directly to output UTF-16...> | |
726 | <...handling surrogate pairs encoding to UTF-16> | |
727 | <Return count of output UTF-16 string -- NOT memory size in bytes> | |
728 | ||
729 | ||
730 | ||
731 | 3.1.2.9. Transcoding Utility Functions | |
732 | ||
733 | The transcoding utility functions are used to load (from a file into | |
734 | memory), free (logically, without freeing memory), and flush (actually | |
735 | free memory) character maps for SBCS (single-byte character set) and | |
736 | (future) DBCS (double-byte character set) transcoding to and from UTF-8. | |
737 | ||
738 | ||
739 | ||
740 | ||
741 | McDonald June 20, 2002 [Page 13] | |
742 | \f | |
743 | CUPS Internationalization Software Design Description v0.3 | |
744 | ||
745 | ||
746 | ||
747 | 3.1.2.9.1. cupsCharmapGet() | |
748 | ||
749 | extern void *cupsCharmapGet(const cups_encoding_t encoding); | |
750 | /* I - Encoding */ | |
751 | ||
752 | <Find SBSC or DBCS charset map in cache> | |
753 | <...If found, increment 'used'> | |
754 | <...and return pointer to SBCS or DBCS charset map> | |
755 | <Get charset map file name by calling 'cupsEncodingName()'> | |
756 | <Open charset map file> | |
757 | <...If not found, return void> | |
758 | <Allocate memory for SBCS or DBCS charset map in cache> | |
759 | <...If no memory, return void> | |
760 | <Add to SBCS or DBCS cache by assigning 'next' field> | |
761 | <Assign 'encoding' field> | |
762 | <Increment 'used' field> | |
763 | <Read charset map file into memory in loop...> | |
764 | <If SBCS, then 'char2uni[]' is an array of 'ucs2_t' values> | |
765 | <...and 'uni2char[]' is an array of pointers to 'sbcs_t' arrays> | |
766 | <If DBCS, then char2uni[]' is an array of pointers to 'ucs2_t' arrays> | |
767 | <...and 'uni2char[]' is an array of pointers to 'dbcs_t' arrays> | |
768 | <Close charset map file> | |
769 | <Return pointer to SBCS or DBCS charset map> | |
770 | ||
771 | ||
772 | ||
773 | 3.1.2.9.2. cupsCharmapFree() | |
774 | ||
775 | extern void cupsCharmapFree(const cups_encoding_t encoding); | |
776 | /* I - Encoding */ | |
777 | ||
778 | <Find SBSC or DBCS charset map in cache> | |
779 | <...If found, decrement 'used'> | |
780 | <Return void> | |
781 | ||
782 | ||
783 | ||
784 | 3.1.2.9.3. cupsCharmapFlush() | |
785 | ||
786 | extern void cupsCharmapFlush(void); | |
787 | ||
788 | <Loop through SBCS charset map cache...> | |
789 | <...Free 'uni2char[]' memory> | |
790 | <...Free SBCS charset map memory> | |
791 | <Loop through DBCS charset map cache...> | |
792 | <...Free 'char2uni[]' memory> | |
793 | <...Free 'uni2char[]' memory> | |
794 | <...Free DBCS charset map memory> | |
795 | <Return void> | |
796 | ||
797 | ||
798 | McDonald June 20, 2002 [Page 14] | |
799 | \f | |
800 | CUPS Internationalization Software Design Description v0.3 | |
801 | ||
802 | ||
803 | ||
804 | ||
805 | 3.2. Normalization - New | |
806 | ||
807 | ||
808 | ||
809 | 3.2.1. normalize.h - Normalization header | |
810 | ||
811 | /* | |
812 | * "$Id: i18n_sdd.txt 2678 2002-08-19 01:15:26Z mike $" | |
813 | * | |
814 | * Unicode normalization for the Common UNIX Printing System (CUPS). | |
815 | * | |
816 | * Copyright 1997-2002 by Easy Software Products. | |
817 | * | |
818 | * These coded instructions, statements, and computer programs are | |
819 | * the property of Easy Software Products and are protected by Federal | |
820 | * copyright law. Distribution and use rights are outlined in the | |
821 | * file "LICENSE.txt" which should have been included with this file. | |
822 | * If this file is missing or damaged please contact Easy Software | |
823 | * Products at: | |
824 | * | |
825 | * Attn: CUPS Licensing Information | |
826 | * Easy Software Products | |
827 | * 44141 Airport View Drive, Suite 204 | |
828 | * Hollywood, Maryland 20636-3111 USA | |
829 | * | |
830 | * Voice: (301) 373-9603 | |
831 | * EMail: cups-info@cups.org | |
832 | * WWW: http://www.cups.org | |
833 | */ | |
834 | ||
835 | #ifndef _CUPS_NORMALIZE_H_ | |
836 | # define _CUPS_NORMALIZE_H_ | |
837 | ||
838 | /* | |
839 | * Include necessary headers... | |
840 | */ | |
841 | ||
842 | # include "transcod.h" | |
843 | ||
844 | # ifdef __cplusplus | |
845 | extern "C" { | |
846 | # endif /* __cplusplus */ | |
847 | ||
848 | /* | |
849 | * Types... | |
850 | */ | |
851 | ||
852 | typedef enum /**** Normalizataion Types ****/ | |
853 | { | |
854 | ||
855 | McDonald June 20, 2002 [Page 15] | |
856 | \f | |
857 | CUPS Internationalization Software Design Description v0.3 | |
858 | ||
859 | CUPS_NORM_NFD, /* Canonical Decomposition */ | |
860 | CUPS_NORM_NFKD, /* Compatibility Decomposition */ | |
861 | CUPS_NORM_NFC, /* NFD, them Canonical Composition */ | |
862 | CUPS_NORM_NFKC /* NFKD, them Canonical Composition */ | |
863 | } cups_normalize_t; | |
864 | ||
865 | typedef enum /**** Case Folding Types ****/ | |
866 | { | |
867 | CUPS_FOLD_SIMPLE, /* Simple - no expansion in size */ | |
868 | CUPS_FOLD_FULL /* Full - possible expansion in size */ | |
869 | } cups_folding_t; | |
870 | ||
871 | typedef enum /**** Unicode Char Property Types ****/ | |
872 | { | |
873 | CUPS_PROP_GENERAL_CATEGORY, /* See 'cups_gencat_t' enum */ | |
874 | CUPS_PROP_BIDI_CATEGORY, /* See 'cups_bidicat_t' enum */ | |
875 | CUPS_PROP_COMBINING_CLASS, /* See 'cups_combclass_t' type */ | |
876 | CUPS_PROP_BREAK_CLASS /* See 'cups_breakclass_t' enum */ | |
877 | } cups_property_t; | |
878 | ||
879 | /* | |
880 | * Note - parse Unicode char general category from 'UnicodeData.txt' | |
881 | * into sparse local table in 'normalize.c'. | |
882 | * Use major classes for logic optimizations throughout (by mask). | |
883 | */ | |
884 | ||
885 | typedef enum /**** Unicode General Category ****/ | |
886 | { | |
887 | CUPS_GENCAT_L = 0x10, /* Letter major class */ | |
888 | CUPS_GENCAT_LU = 0x11, /* Lu Letter, Uppercase */ | |
889 | CUPS_GENCAT_LL = 0x12, /* Ll Letter, Lowercase */ | |
890 | CUPS_GENCAT_LT = 0x13, /* Lt Letter, Titlecase */ | |
891 | CUPS_GENCAT_LM = 0x14, /* Lm Letter, Modifier */ | |
892 | CUPS_GENCAT_LO = 0x15, /* Lo Letter, Other */ | |
893 | CUPS_GENCAT_M = 0x20, /* Mark major class */ | |
894 | CUPS_GENCAT_MN = 0x21, /* Mn Mark, Non-Spacing */ | |
895 | CUPS_GENCAT_MC = 0x22, /* Mc Mark, Spacing Combining */ | |
896 | CUPS_GENCAT_ME = 0x23, /* Me Mark, Enclosing */ | |
897 | CUPS_GENCAT_N = 0x30, /* Number major class */ | |
898 | CUPS_GENCAT_ND = 0x31, /* Nd Number, Decimal Digit */ | |
899 | CUPS_GENCAT_NL = 0x32, /* Nl Number, Letter */ | |
900 | CUPS_GENCAT_NO = 0x33, /* No Number, Other */ | |
901 | CUPS_GENCAT_P = 0x40, /* Punctuation major class */ | |
902 | CUPS_GENCAT_PC = 0x41, /* Pc Punctuation, Connector */ | |
903 | CUPS_GENCAT_PD = 0x42, /* Pd Punctuation, Dash */ | |
904 | CUPS_GENCAT_PS = 0x43, /* Ps Punctuation, Open (start) */ | |
905 | CUPS_GENCAT_PE = 0x44, /* Pe Punctuation, Close (end) */ | |
906 | CUPS_GENCAT_PI = 0x45, /* Pi Punctuation, Initial Quote */ | |
907 | CUPS_GENCAT_PF = 0x46, /* Pf Punctuation, Final Quote */ | |
908 | CUPS_GENCAT_PO = 0x47, /* Po Punctuation, Other */ | |
909 | CUPS_GENCAT_S = 0x50, /* Symbol major class */ | |
910 | CUPS_GENCAT_SM = 0x51, /* Sm Symbol, Math */ | |
911 | ||
912 | McDonald June 20, 2002 [Page 16] | |
913 | \f | |
914 | CUPS Internationalization Software Design Description v0.3 | |
915 | ||
916 | CUPS_GENCAT_SC = 0x52, /* Sc Symbol, Currency */ | |
917 | CUPS_GENCAT_SK = 0x53, /* Sk Symbol, Modifier */ | |
918 | CUPS_GENCAT_SO = 0x54, /* So Symbol, Other */ | |
919 | CUPS_GENCAT_Z = 0x60, /* Separator major class */ | |
920 | CUPS_GENCAT_ZS = 0x61, /* Zs Separator, Space */ | |
921 | CUPS_GENCAT_ZL = 0x62, /* Zl Separator, Line */ | |
922 | CUPS_GENCAT_ZP = 0x63, /* Zp Separator, Paragraph */ | |
923 | CUPS_GENCAT_C = 0x70, /* Other (miscellaneous) major class */ | |
924 | CUPS_GENCAT_CC = 0x71, /* Cc Other, Control */ | |
925 | CUPS_GENCAT_CF = 0x72, /* Cf Other, Format */ | |
926 | CUPS_GENCAT_CS = 0x73, /* Cs Other, Surrogate */ | |
927 | CUPS_GENCAT_CO = 0x74, /* Co Other, Private Use */ | |
928 | CUPS_GENCAT_CN = 0x75 /* Cn Other, Not Assigned */ | |
929 | } cups_gencat_t; | |
930 | ||
931 | /* | |
932 | * Note - parse Unicode char bidi category from 'UnicodeData.txt' | |
933 | * into sparse local table in 'normalize.c'. | |
934 | * Add bidirectional support to 'textcommon.c' - per Mike | |
935 | */ | |
936 | ||
937 | typedef enum /**** Unicode Bidi Category ****/ | |
938 | { | |
939 | CUPS_BIDI_L, /* Left-to-Right (Alpha, Syllabic, Ideographic) */ | |
940 | CUPS_BIDI_LRE, /* Left-to-Right Embedding (explicit) */ | |
941 | CUPS_BIDI_LRO, /* Left-to-Right Override (explicit) */ | |
942 | CUPS_BIDI_R, /* Right-to-Left (Hebrew alphabet and most punct) */ | |
943 | CUPS_BIDI_AL, /* Right-to-Left Arabic (Arabic, Thaana, Syriac) */ | |
944 | CUPS_BIDI_RLE, /* Right-to-Left Embedding (explicit) */ | |
945 | CUPS_BIDI_RLO, /* Right-to-Left Override (explicit) */ | |
946 | CUPS_BIDI_PDF, /* Pop Directional Format */ | |
947 | CUPS_BIDI_EN, /* Euro Number (Euro and East Arabic-Indic digits) */ | |
948 | CUPS_BIDI_ES, /* Euro Number Separator (Slash) */ | |
949 | CUPS_BIDI_ET, /* Euro Number Termintor (Plus, Minus, Degree, etc) */ | |
950 | CUPS_BIDI_AN, /* Arabic Number (Arabic-Indic digits, separators) */ | |
951 | CUPS_BIDI_CS, /* Common Number Separator (Colon, Comma, Dot, etc) */ | |
952 | CUPS_BIDI_NSM, /* Non-Spacing Mark (category Mn / Me in UCD) */ | |
953 | CUPS_BIDI_BN, /* Boundary Neutral (Formatting / Control chars) */ | |
954 | CUPS_BIDI_B, /* Paragraph Separator */ | |
955 | CUPS_BIDI_S, /* Segment Separator (Tab) */ | |
956 | CUPS_BIDI_WS, /* Whitespace Space (Space, Line Separator, etc) */ | |
957 | CUPS_BIDI_ON /* Other Neutrals */ | |
958 | } cups_bidicat_t; | |
959 | ||
960 | /* | |
961 | * Note - parse Unicode line break class from 'DerivedLineBreak.txt' | |
962 | * into sparse local table (list of class ranges) in 'normalize.c'. | |
963 | * Note - add state table from UAX-14, section 7.3 - Ira | |
964 | * Remember to do BK and SP in outer loop (not in state table). | |
965 | * Consider optimization for CM (combining mark). | |
966 | * See 'LineBreak.txt' (12,875) and 'DerivedLineBreak.txt' (1,350). | |
967 | */ | |
968 | ||
969 | McDonald June 20, 2002 [Page 17] | |
970 | \f | |
971 | CUPS Internationalization Software Design Description v0.3 | |
972 | ||
973 | ||
974 | typedef enum /**** Unicode Line Break Class ****/ | |
975 | { | |
976 | /* | |
977 | * (A) - Allow Break AFTER | |
978 | * (XA) - Prevent Break AFTER | |
979 | * (B) - Allow Break BEFORE | |
980 | * (XB) - Prevent Break BEFORE | |
981 | * (P) - Allow Break For Pair | |
982 | * (XP) - Prevent Break For Pair | |
983 | */ | |
984 | CUPS_BREAK_AI, /* Ambiguous (Alphabetic or Ideograph) */ | |
985 | CUPS_BREAK_AL, /* Ordinary Alphabetic / Symbol Chars (XP) */ | |
986 | CUPS_BREAK_BA, /* Break Opportunity After Chars (A) */ | |
987 | CUPS_BREAK_BB, /* Break Opportunities Before Chars (B) */ | |
988 | CUPS_BREAK_B2, /* Break Opportunity Before / After (B/A/XP) */ | |
989 | CUPS_BREAK_BK, /* Mandatory Break (A) (normative) */ | |
990 | CUPS_BREAK_CB, /* Contingent Break (B/A) (normative) */ | |
991 | CUPS_BREAK_CL, /* Closing Punctuation (XB) */ | |
992 | CUPS_BREAK_CM, /* Attached Chars / Combining (XB) (normative) */ | |
993 | CUPS_BREAK_CR, /* Carriage Return (A) (normative) */ | |
994 | CUPS_BREAK_EX, /* Exclamation / Interrogation (XB) */ | |
995 | CUPS_BREAK_GL, /* Non-breaking ("Glue") (XB/XA) (normative) */ | |
996 | CUPS_BREAK_HY, /* Hyphen (XA) */ | |
997 | CUPS_BREAK_ID, /* Ideographic (B/A) */ | |
998 | CUPS_BREAK_IN, /* Inseparable chars (XP) */ | |
999 | CUPS_BREAK_IS, /* Numeric Separator (Infix) (XB) */ | |
1000 | CUPS_BREAK_LF, /* Line Feed (A) (normative) */ | |
1001 | CUPS_BREAK_NS, /* Non-starters (XB) */ | |
1002 | CUPS_BREAK_NU, /* Numeric (XP) */ | |
1003 | CUPS_BREAK_OP, /* Opening Punctuation (XA) */ | |
1004 | CUPS_BREAK_PO, /* Postfix (Numeric) (XB) */ | |
1005 | CUPS_BREAK_PR, /* Prefix (Numeric) (XA) */ | |
1006 | CUPS_BREAK_QU, /* Ambiguous Quotation (XB/XA) */ | |
1007 | CUPS_BREAK_SA, /* Context Dependent (South East Asian) (P) */ | |
1008 | CUPS_BREAK_SG, /* Surrogates (XP) (normative) */ | |
1009 | CUPS_BREAK_SP, /* Space (A) (normative) */ | |
1010 | CUPS_BREAK_SY, /* Symbols Allowing Break After (A) */ | |
1011 | CUPS_BREAK_XX, /* Unknown (XP) */ | |
1012 | CUPS_BREAK_ZW /* Zero Width Space (A) (normative) */ | |
1013 | } cups_breakclass_t; | |
1014 | ||
1015 | typedef int cups_combclass_t; /**** Unicode Combining Class ****/ | |
1016 | /* 0=base / 1..254=combining char */ | |
1017 | ||
1018 | /* | |
1019 | * Structures... | |
1020 | */ | |
1021 | ||
1022 | typedef struct cups_normmap_str /**** Normalize Map Cache Struct ****/ | |
1023 | { | |
1024 | struct cups_normmap_str *next; /* Next normalize in cache */ | |
1025 | ||
1026 | McDonald June 20, 2002 [Page 18] | |
1027 | \f | |
1028 | CUPS Internationalization Software Design Description v0.3 | |
1029 | ||
1030 | int used; /* Number of times entry used */ | |
1031 | cups_normalize_t normalize; /* Normalization type */ | |
1032 | int normcount; /* Count of Source Chars */ | |
1033 | ucs2_t *uni2norm; /* Char -> Normalization */ | |
1034 | /* ...only supports UCS-2 */ | |
1035 | } cups_normmap_t; | |
1036 | ||
1037 | typedef struct cups_foldmap_str /**** Case Fold Map Cache Struct ****/ | |
1038 | { | |
1039 | struct cups_foldmap_str *next; /* Next case fold in cache */ | |
1040 | int used; /* Number of times entry used */ | |
1041 | cups_folding_t fold; /* Case folding type */ | |
1042 | int foldcount; /* Count of Source Chars */ | |
1043 | ucs2_t *uni2fold; /* Char -> Folded Char(s) */ | |
1044 | /* ...only supports UCS-2 */ | |
1045 | } cups_foldmap_t; | |
1046 | ||
1047 | typedef struct cups_prop_str /**** Char Property Struct ****/ | |
1048 | { | |
1049 | ucs2_t ch; /* Unicode Char as UCS-2 */ | |
1050 | unsigned char gencat; /* General Category */ | |
1051 | unsigned char bidicat; /* Bidirectional Category */ | |
1052 | } cups_prop_t; | |
1053 | ||
1054 | typedef struct /**** Char Property Map Struct ****/ | |
1055 | { | |
1056 | int used; /* Number of times entry used */ | |
1057 | int propcount; /* Count of Source Chars */ | |
1058 | cups_prop_t *uni2prop; /* Char -> Properties */ | |
1059 | } cups_propmap_t; | |
1060 | ||
1061 | typedef struct /**** Line Break Class Map Struct ****/ | |
1062 | { | |
1063 | int used; /* Number of times entry used */ | |
1064 | int breakcount; /* Count of Source Chars */ | |
1065 | ucs2_t *uni2break; /* Char -> Line Break Class */ | |
1066 | } cups_breakmap_t; | |
1067 | ||
1068 | typedef struct cups_comb_str /**** Char Combining Class Struct ****/ | |
1069 | { | |
1070 | ucs2_t ch; /* Unicode Char as UCS-2 */ | |
1071 | unsigned char combclass; /* Combining Class */ | |
1072 | unsigned char reserved; /* Reserved for alignment */ | |
1073 | } cups_comb_t; | |
1074 | ||
1075 | typedef struct /**** Combining Class Map Struct ****/ | |
1076 | { | |
1077 | int used; /* Number of times entry used */ | |
1078 | int combcount; /* Count of Source Chars */ | |
1079 | cups_comb_t *uni2comb; /* Char -> Combining Class */ | |
1080 | } cups_combmap_t; | |
1081 | ||
1082 | ||
1083 | McDonald June 20, 2002 [Page 19] | |
1084 | \f | |
1085 | CUPS Internationalization Software Design Description v0.3 | |
1086 | ||
1087 | ||
1088 | /* | |
1089 | * Globals... | |
1090 | */ | |
1091 | ||
1092 | extern int NzSupportUcs2; /* Support UCS-2 (16-bit) mapping */ | |
1093 | extern int NzSupportUcs4; /* Support UCS-4 (32-bit) mapping */ | |
1094 | ||
1095 | /* | |
1096 | * Prototypes... | |
1097 | */ | |
1098 | ||
1099 | /* | |
1100 | * Utility functions for normalization module | |
1101 | */ | |
1102 | extern int cupsNormalizeMapsGet(void); | |
1103 | extern int cupsNormalizeMapsFree(void); | |
1104 | extern void cupsNormalizeMapsFlush(void); | |
1105 | ||
1106 | /* | |
1107 | * Normalize UTF-8 string to Unicode UAX-15 Normalization Form | |
1108 | * Note - Compatibility Normalization Forms (NFKD/NFKC) are | |
1109 | * unsafe for subsequent transcoding to legacy charsets | |
1110 | */ | |
1111 | extern int cupsUtf8Normalize(utf8_t *dest, /* O - Target string */ | |
1112 | const utf8_t *src, /* I - Source string */ | |
1113 | const int maxout, /* I - Max output */ | |
1114 | const cups_normalize_t normalize); | |
1115 | /* I - Normalization */ | |
1116 | ||
1117 | /* | |
1118 | * Normalize UTF-32 string to Unicode UAX-15 Normalization Form | |
1119 | * Note - Compatibility Normalization Forms (NFKD/NFKC) are | |
1120 | * unsafe for subsequent transcoding to legacy charsets | |
1121 | */ | |
1122 | extern int cupsUtf32Normalize(utf32_t *dest, | |
1123 | /* O - Target string */ | |
1124 | const utf32_t *src, /* I - Source string */ | |
1125 | const int maxout, /* I - Max output */ | |
1126 | const cups_normalize_t normalize); | |
1127 | /* I - Normalization */ | |
1128 | ||
1129 | /* | |
1130 | * Case Fold UTF-8 string per Unicode UAX-21 Section 2.3 | |
1131 | * Note - Case folding output is | |
1132 | * unsafe for subsequent transcoding to legacy charsets | |
1133 | */ | |
1134 | extern int cupsUtf8CaseFold(utf8_t *dest, /* O - Target string */ | |
1135 | const utf8_t *src, /* I - Source string */ | |
1136 | const int maxout, /* I - Max output */ | |
1137 | const cups_folding_t fold); /* I - Fold Mode */ | |
1138 | ||
1139 | ||
1140 | McDonald June 20, 2002 [Page 20] | |
1141 | \f | |
1142 | CUPS Internationalization Software Design Description v0.3 | |
1143 | ||
1144 | ||
1145 | /* | |
1146 | * Case Fold UTF-32 string per Unicode UAX-21 Section 2.3 | |
1147 | * Note - Case folding output is | |
1148 | * unsafe for subsequent transcoding to legacy charsets | |
1149 | */ | |
1150 | extern int cupsUtf32CaseFold(utf32_t *dest,/* O - Target string */ | |
1151 | const utf32_t *src, /* I - Source string */ | |
1152 | const int maxout, /* I - Max output */ | |
1153 | const cups_folding_t fold); /* I - Fold Mode */ | |
1154 | ||
1155 | /* | |
1156 | * Compare UTF-8 strings after case folding | |
1157 | */ | |
1158 | extern int cupsUtf8CompareCaseless(const utf8_t *s1, | |
1159 | /* I - String1 */ | |
1160 | const utf8_t *s2); /* I - String2 */ | |
1161 | ||
1162 | /* | |
1163 | * Compare UTF-32 strings after case folding | |
1164 | */ | |
1165 | extern int cupsUtf32CompareCaseless(const utf32_t *s1, | |
1166 | /* I - String1 */ | |
1167 | const utf32_t *s2); /* I - String2 */ | |
1168 | ||
1169 | /* | |
1170 | * Compare UTF-8 strings after case folding and NFKC normalization | |
1171 | */ | |
1172 | extern int cupsUtf8CompareIdentifier(const utf8_t *s1, | |
1173 | /* I - String1 */ | |
1174 | const utf8_t *s2); /* I - String2 */ | |
1175 | ||
1176 | /* | |
1177 | * Compare UTF-32 strings after case folding and NFKC normalization | |
1178 | */ | |
1179 | extern int cupsUtf32CompareIdentifier(const utf32_t *s1, | |
1180 | /* I - String1 */ | |
1181 | const utf32_t *s2); /* I - String2 */ | |
1182 | ||
1183 | /* | |
1184 | * Get UTF-32 character property | |
1185 | */ | |
1186 | extern int cupsUtf32CharacterProperty(const utf32_t ch, | |
1187 | /* I - Source char */ | |
1188 | const cups_property_t property); | |
1189 | /* I - Char Property */ | |
1190 | ||
1191 | # ifdef __cplusplus | |
1192 | } | |
1193 | # endif /* __cplusplus */ | |
1194 | ||
1195 | #endif /* !_CUPS_NORMALIZE_H_ */ | |
1196 | ||
1197 | McDonald June 20, 2002 [Page 21] | |
1198 | \f | |
1199 | CUPS Internationalization Software Design Description v0.3 | |
1200 | ||
1201 | ||
1202 | /* | |
1203 | * End of "$Id: i18n_sdd.txt 2678 2002-08-19 01:15:26Z mike $" | |
1204 | */ | |
1205 | ||
1206 | ||
1207 | ||
1208 | 3.2.1.1. cups_normmap_t - Normalize Map Structure | |
1209 | ||
1210 | typedef struct cups_normmap_str /**** Normalize Map Cache Struct ****/ | |
1211 | { | |
1212 | struct cups_normmap_str *next; /* Next normalize in cache */ | |
1213 | int used; /* Number of times entry used */ | |
1214 | cups_normalize_t normalize; /* Normalization type */ | |
1215 | int normcount; /* Count of Source Chars */ | |
1216 | ucs2_t *uni2norm; /* Char -> Normalization */ | |
1217 | /* ...only supports UCS-2 */ | |
1218 | } cups_normmap_t; | |
1219 | ||
1220 | 'uni2norm' is a pointer to an array of _triplets_ of UCS-2 values. | |
1221 | 'normcount' is a count of _triplets_ in the 'uni2norm[]' array. | |
1222 | ||
1223 | For decompositions (NFD and NFKD), the triplets are: composed base | |
1224 | character, decomposed base character, and decomposed accent character. | |
1225 | These are used by 'cupsUtf8Normalize()' and 'cupsUtf32Normalize()' in | |
1226 | performing canonical (NFD) or compatibility (NFKD) decomposition. | |
1227 | ||
1228 | For compositions (NFC and NFKC), the triplets are: decomposed base | |
1229 | character, decomposed accent character, and composed base character. | |
1230 | These are used by 'cupsUtf8Normalize()' and 'cupsUtf32Normalize()' in | |
1231 | performing canonical composition (for NFC or NFKC). | |
1232 | ||
1233 | ||
1234 | ||
1235 | 3.2.1.2. cups_foldmap_t - Case Fold Map Structure | |
1236 | ||
1237 | typedef struct cups_foldmap_str /**** Case Fold Map Cache Struct ****/ | |
1238 | { | |
1239 | int used; /* Number of times entry used */ | |
1240 | cups_folding_t fold; /* Case folding type */ | |
1241 | int foldcount; /* Count of Source Chars */ | |
1242 | ucs2_t *uni2fold; /* Char -> Folded Char(s) */ | |
1243 | /* ...only supports UCS-2 */ | |
1244 | } cups_foldmap_t; | |
1245 | ||
1246 | 'uni2fold' is a pointer to an array of _quadruplets_ of UCS-2 values. | |
1247 | 'foldcount' is a count of _quadruplets_ in the 'uni2fold[]' array. | |
1248 | ||
1249 | For simple case folding (without expansion of the size of the output | |
1250 | string), the quadruplets are: input base character, output case folded | |
1251 | character, zero (unused), and zero (unused). | |
1252 | ||
1253 | ||
1254 | McDonald June 20, 2002 [Page 22] | |
1255 | \f | |
1256 | CUPS Internationalization Software Design Description v0.3 | |
1257 | ||
1258 | ||
1259 | For full case folding (with possible expansion of the size of the output | |
1260 | string), the quadruplets are: input base character, output case folded | |
1261 | character, second output character or zero, third output character or | |
1262 | zero. | |
1263 | ||
1264 | ||
1265 | ||
1266 | 3.2.1.3. cups_propmap_t - Char Property Map Structure | |
1267 | ||
1268 | typedef struct /**** Char Property Map Struct ****/ | |
1269 | { | |
1270 | int used; /* Number of times entry used */ | |
1271 | int propcount; /* Count of Source Chars */ | |
1272 | cups_prop_t *uni2prop; /* Char -> Properties */ | |
1273 | } cups_propmap_t; | |
1274 | ||
1275 | 'uni2prop' is a pointer to an array of 'cups_prop_t' (see below). | |
1276 | 'propcount' is a count of elements in the 'uni2prop[]' array. | |
1277 | ||
1278 | ||
1279 | ||
1280 | 3.2.1.4. cups_prop_t - Char Property Structure | |
1281 | ||
1282 | typedef struct cups_prop_str /**** Char Property Struct ****/ | |
1283 | { | |
1284 | ucs2_t ch; /* Unicode Char as UCS-2 */ | |
1285 | unsigned char gencat; /* General Category */ | |
1286 | unsigned char bidicat; /* Bidirectional Category */ | |
1287 | } cups_prop_t; | |
1288 | ||
1289 | ||
1290 | ||
1291 | 3.2.1.5. cups_breakmap_t - Line Break Map Structure | |
1292 | ||
1293 | typedef struct /**** Line Break Class Map Struct ****/ | |
1294 | { | |
1295 | int used; /* Number of times entry used */ | |
1296 | int breakcount; /* Count of Source Chars */ | |
1297 | ucs2_t *uni2break; /* Char -> Line Break Class */ | |
1298 | } cups_breakmap_t; | |
1299 | ||
1300 | 'uni2break' is a pointer to an array of _triplets_ of UCS-2 values. | |
1301 | 'breakcount' is a count of _triplets_ in the 'uni2break[]' array. | |
1302 | ||
1303 | The triplets in 'uni2break' are: first UCS-2 value in a range, last | |
1304 | UCS-2 value in a range, and line break class stored as UCS-2. | |
1305 | ||
1306 | ||
1307 | ||
1308 | ||
1309 | ||
1310 | ||
1311 | McDonald June 20, 2002 [Page 23] | |
1312 | \f | |
1313 | CUPS Internationalization Software Design Description v0.3 | |
1314 | ||
1315 | ||
1316 | ||
1317 | 3.2.1.6. cups_combmap_t - Combining Class Map Structure | |
1318 | ||
1319 | typedef struct /**** Combining Class Map Struct ****/ | |
1320 | { | |
1321 | int used; /* Number of times entry used */ | |
1322 | int combcount; /* Count of Source Chars */ | |
1323 | cups_comb_t *uni2comb; /* Char -> Combining Class */ | |
1324 | } cups_combmap_t; | |
1325 | ||
1326 | 'uni2comb' is a pointer to an array of 'cups_comb_t' (see below). | |
1327 | 'combcount' is a count of elements in the 'uni2comb[]' array. | |
1328 | ||
1329 | ||
1330 | ||
1331 | 3.2.1.7. cups_comb_t - Combining Class Structure | |
1332 | ||
1333 | typedef struct cups_comb_str /**** Char Combining Class Struct ****/ | |
1334 | { | |
1335 | unsigned short ch; /* Unicode Char as UCS-2 */ | |
1336 | unsigned char combclass; /* Combining Class */ | |
1337 | unsigned char reserved; /* Reserved for alignment */ | |
1338 | } cups_comb_t; | |
1339 | ||
1340 | ||
1341 | ||
1342 | 3.2.2. normalize.c - Normalization module | |
1343 | ||
1344 | The normalization function 'cupsUtf8Normalize()' and the case folding | |
1345 | function 'cupsUtf8CaseFold()' are modelled on the C standard library | |
1346 | function 'strncpy()', except that they return the count of the output, | |
1347 | like 'strlen()', rather than the (redundant) pointer to the output. | |
1348 | ||
1349 | If the normalization or case folding functions detect invalid input | |
1350 | parameters or they detect an encoding error in their input, then they | |
1351 | return '-1', rather than the count of output. | |
1352 | ||
1353 | The normalization and case folding functions take an input parameter | |
1354 | indicating the maximum output units (for safe operation). | |
1355 | ||
1356 | ||
1357 | ||
1358 | 3.2.2.1. cupsUtf8Normalize() | |
1359 | ||
1360 | /* | |
1361 | * Normalize UTF-8 string to Unicode UAX-15 Normalization Form | |
1362 | * Note - Compatibility Normalization Forms (NFKD/NFKC) are | |
1363 | * unsafe for subsequent transcoding to legacy charsets | |
1364 | */ | |
1365 | extern int cupsUtf8Normalize(utf8_t *dest, /* O - Target string */ | |
1366 | const utf8_t *src, /* I - Source string */ | |
1367 | ||
1368 | McDonald June 20, 2002 [Page 24] | |
1369 | \f | |
1370 | CUPS Internationalization Software Design Description v0.3 | |
1371 | ||
1372 | const int maxout, /* I - Max output */ | |
1373 | const cups_normalize_t normalize); | |
1374 | /* I - Normalization */ | |
1375 | ||
1376 | <Convert input UTF-8 to internal UCS-4 by calling 'cupsUtf8ToUtf32()'> | |
1377 | <Normalize by calling 'cupsUtf32Normalize()'> | |
1378 | <Convert normalized UCS-4 to UTF-8 by calling 'cupsUtf32ToUtf8()> | |
1379 | <Return length of output UTF-8 string -- size in butes> | |
1380 | ||
1381 | ||
1382 | ||
1383 | 3.2.2.2. cupsUtf32Normalize() | |
1384 | ||
1385 | extern int cupsUtf32Normalize(utf32_t *dest, | |
1386 | /* O - Target string */ | |
1387 | const utf32_t *src, /* I - Source string */ | |
1388 | const int maxout, /* I - Max output */ | |
1389 | const cups_normalize_t normalize); | |
1390 | /* I - Normalization */ | |
1391 | ||
1392 | <Find normalize maps by calling 'cupsNormalizeMapsGet()'> | |
1393 | <...if not found, return '-1'> | |
1394 | <Repeatedly traverse internal UCS-4, decomposing (NFD or NFKD)...> | |
1395 | <...with 'bsearch()' of 'uni2norm[]' using local 'compare_decompose()'> | |
1396 | <...until one pass yields no further decomposition> | |
1397 | <Repeatedly traverse internal UCS-4, doing canonical reordering> | |
1398 | <...with 'bsearch()' of 'uni2comb[]' using local 'compare_combchar()'> | |
1399 | <...until one pass yields no further canonical reordering> | |
1400 | <If 'normalize' requests composition (NFC or NFKC)...> | |
1401 | <...repeatedly traverse internal UCS-4, composing (NFC or NFKC)...> | |
1402 | <...with 'bsearch()' of 'uni2norm[]' using local 'compare_compose()'> | |
1403 | <...until one pass yields no further composition> | |
1404 | <Release normalize maps by calling 'cupsNormalizeMapsFree()'> | |
1405 | <Return count of output UTF-32 string -- NOT memory size in butes> | |
1406 | ||
1407 | ||
1408 | ||
1409 | 3.2.2.3. cupsUtf8CaseFold() | |
1410 | ||
1411 | /* | |
1412 | * Case Fold UTF-8 string per Unicode UAX-21 Section 2.3 | |
1413 | * Note - Case folding output is | |
1414 | * unsafe for subsequent transcoding to legacy charsets | |
1415 | */ | |
1416 | extern int cupsUtf8CaseFold(utf8_t *dest, /* O - Target string */ | |
1417 | const utf8_t *src, /* I - Source string */ | |
1418 | const int maxout, /* I - Max output */ | |
1419 | const cups_folding_t fold); /* I - Fold Mode */ | |
1420 | ||
1421 | <Find normalize maps by calling 'cupsNormalizeMapsGet()'> | |
1422 | <...if not found, return '-1'> | |
1423 | <Convert input UTF-8 to internal UCS-4 by calling 'cupsUtf8ToUtf32()'> | |
1424 | ||
1425 | McDonald June 20, 2002 [Page 25] | |
1426 | \f | |
1427 | CUPS Internationalization Software Design Description v0.3 | |
1428 | ||
1429 | <Case fold internal UCS-4 by calling 'cupsUtf32CaseFold()'> | |
1430 | <Convert internal UCS-4 to output UTF-8 by calling 'cupsUtf32ToUtf8()> | |
1431 | <Release normalize maps by calling 'cupsNormalizeMapsFree()'> | |
1432 | <Return length of output UTF-8 string -- size in butes> | |
1433 | ||
1434 | ||
1435 | ||
1436 | 3.2.2.4. cupsUtf32CaseFold() | |
1437 | ||
1438 | /* | |
1439 | * Case Fold UTF-32 string per Unicode UAX-21 Section 2.3 | |
1440 | * Note - Case folding output is | |
1441 | * unsafe for subsequent transcoding to legacy charsets | |
1442 | */ | |
1443 | extern int cupsUtf32CaseFold(utf32_t *dest, /* Target string */ | |
1444 | const utf32_t *src, /* Source string */ | |
1445 | const int maxout); /* Max output units */ | |
1446 | ||
1447 | <Find case fold maps by calling 'cupsNormalizeMapsGet()'> | |
1448 | <...if not found, return '-1'> | |
1449 | <Traverse internal UCS-4 once, performing case folding...> | |
1450 | <...with 'bsearch()' of 'uni2fold[]' using local 'compare_foldchar()'> | |
1451 | <Copy internal UCS-4 to output UTF-32 string> | |
1452 | <Release normalize maps by calling 'cupsNormalizeMapsFree()'> | |
1453 | <Return count of output UTF-32 string -- NOT memory size in bytes> | |
1454 | ||
1455 | ||
1456 | ||
1457 | 3.2.2.5. cupsUtf8CompareCaseless() | |
1458 | ||
1459 | /* | |
1460 | * Compare UTF-8 strings after case folding | |
1461 | */ | |
1462 | extern int cupsUtf8CompareCaseless(const utf8_t *s1, | |
1463 | /* I - String1 */ | |
1464 | const utf8_t *s2); /* I - String2 */ | |
1465 | ||
1466 | <Case fold both input UTF-8 strings by calling 'cupsUtf8CaseFold()'> | |
1467 | <Return compare of case folded first and second strings> | |
1468 | ||
1469 | ||
1470 | ||
1471 | 3.2.2.6. cupsUtf32CompareCaseless() | |
1472 | ||
1473 | /* | |
1474 | * Compare UTF-32 strings after case folding | |
1475 | */ | |
1476 | extern int cupsUtf32CompareCaseless(const utf32_t *s1, | |
1477 | /* I - String1 */ | |
1478 | const utf32_t *s2); /* I - String2 */ | |
1479 | ||
1480 | <Case fold both input UTF-32 strings by calling 'cupsUtf32CaseFold()'> | |
1481 | ||
1482 | McDonald June 20, 2002 [Page 26] | |
1483 | \f | |
1484 | CUPS Internationalization Software Design Description v0.3 | |
1485 | ||
1486 | <Return compare of case folded first and second strings> | |
1487 | ||
1488 | ||
1489 | ||
1490 | 3.2.2.7. cupsUtf8CompareIdentifier() | |
1491 | ||
1492 | /* | |
1493 | * Compare UTF-8 strings after case folding and NFKC normalization | |
1494 | */ | |
1495 | extern int cupsUtf8CompareIdentifier(const utf8_t *s1, | |
1496 | /* I - String1 */ | |
1497 | const utf8_t *s2); /* I - String2 */ | |
1498 | ||
1499 | <Convert input UTF-8 to internal UCS-4 by calling 'cupsUtf8ToUtf32()'> | |
1500 | <Case fold both strings by calling 'cupsUtf32CaseFold()'> | |
1501 | <Normalize both strings to NFKC by calling 'cupsUtf32Normalize()'> | |
1502 | <Return compare of case folded/normalized first and second strings> | |
1503 | ||
1504 | ||
1505 | ||
1506 | 3.2.2.8. cupsUtf32CompareIdentifier() | |
1507 | ||
1508 | /* | |
1509 | * Compare UTF-32 strings after case folding and NFKC normalization | |
1510 | */ | |
1511 | extern int cupsUtf32CompareIdentifier(const utf32_t *s1, | |
1512 | /* I - String1 */ | |
1513 | const utf32_t *s2); /* I - String2 */ | |
1514 | ||
1515 | <Case fold both strings by calling 'cupsUtf32CaseFold()'> | |
1516 | <Normalize both strings to NFKC by calling 'cupsUtf32Normalize()'> | |
1517 | <Return compare of case folded/normalized first and second strings> | |
1518 | ||
1519 | ||
1520 | ||
1521 | 3.2.2.9. cupsUtf32CharacterProperty() | |
1522 | ||
1523 | /* | |
1524 | * Get UTF-32 character property | |
1525 | */ | |
1526 | extern int cupsUtf32CharacterProperty(const utf32_t ch, | |
1527 | /* I - Source char */ | |
1528 | const cups_property_t property); | |
1529 | /* I - Char Property */ | |
1530 | ||
1531 | <Lookup UTF-32 character property in appropriate map...> <...internal | |
1532 | functions for each different map lookup> | |
1533 | ||
1534 | ||
1535 | ||
1536 | ||
1537 | ||
1538 | ||
1539 | McDonald June 20, 2002 [Page 27] | |
1540 | \f | |
1541 | CUPS Internationalization Software Design Description v0.3 | |
1542 | ||
1543 | ||
1544 | ||
1545 | 3.2.2.10. Normalization Utility Functions | |
1546 | ||
1547 | ||
1548 | ||
1549 | ||
1550 | 3.2.2.10.1. cupsNormalizeMapsGet() | |
1551 | ||
1552 | extern void cupsNormalizeMapsMapsGet(void); | |
1553 | ||
1554 | <Find normalize maps in cache> | |
1555 | <...If found, increment 'used'> | |
1556 | <...and return void> | |
1557 | <For each map (normalization, case fold, combining class, etc.)...> | |
1558 | <Open (preprocessed form of) Unicode data file...> | |
1559 | <...If not found, return void> | |
1560 | <Count lines in preprocessed form, for mapping memory alloc> | |
1561 | <...Close (preprocessed form of) Unicode data file> | |
1562 | <Open (preprocessed form of) Unicode data file...> | |
1563 | <...If not found, return void> | |
1564 | <Allocate memory for approriate map in cache...> | |
1565 | <...If no memory, return void> | |
1566 | <Add to appropriate cache by assigning 'next' field> | |
1567 | <Assign map type field and count field> | |
1568 | <Increment 'used' field> | |
1569 | <Read normalize map into memory in loop...> | |
1570 | <...Add values to 'uni2xxx[]' array> | |
1571 | <Close (preprocessed form of) Unicode data file> | |
1572 | <Return void> | |
1573 | ||
1574 | ||
1575 | ||
1576 | 3.2.2.10.2. cupsNormalizeMapsFree() | |
1577 | ||
1578 | extern void cupsNormalizeMapsFree(void); | |
1579 | ||
1580 | <Find normalize maps in cache> | |
1581 | <...If found, decrement 'used'> | |
1582 | <Return void> | |
1583 | ||
1584 | ||
1585 | ||
1586 | 3.2.2.10.3. cupsNormalizeMapsFlush() | |
1587 | ||
1588 | extern void cupsNormalizeMapsFlush(void); | |
1589 | ||
1590 | <Loop through normalize maps cache...> | |
1591 | <...Free 'uni2norm[]' memory> | |
1592 | <...Free normalize map memory> | |
1593 | <Loop through case folding cache...> | |
1594 | <...Free 'uni2fold[]' memory> | |
1595 | ||
1596 | McDonald June 20, 2002 [Page 28] | |
1597 | \f | |
1598 | CUPS Internationalization Software Design Description v0.3 | |
1599 | ||
1600 | <...Free case folding memory> | |
1601 | <Loop through char property map cache...> | |
1602 | <...Free 'uni2prop[]' memory> | |
1603 | <...Free char property map memory> | |
1604 | <Loop through line break class map cache...> | |
1605 | <...Free 'uni2break[]' memory> | |
1606 | <...Free line break class map memory> | |
1607 | <Loop through combining class map cache...> | |
1608 | <...Free 'uni2comb[]' memory> | |
1609 | <...Free combining class map memory> | |
1610 | <Return void> | |
1611 | ||
1612 | ||
1613 | ||
1614 | 3.3. Language - Existing | |
1615 | ||
1616 | ||
1617 | ||
1618 | 3.3.1. language.h - Language header | |
1619 | ||
1620 | Required Changes: | |
1621 | ||
1622 | (1) Change definition of 'cups_lang_t' to correct length of 'language[]' | |
1623 | to 32 characters per [RFC3066] and [ISO639-2] and [ISO3166-1]. | |
1624 | ||
1625 | ||
1626 | ||
1627 | 3.3.2. language.c - Language module | |
1628 | ||
1629 | ||
1630 | ||
1631 | 3.3.2.1. cupsLangEncoding() - Existing | |
1632 | ||
1633 | [No Change] | |
1634 | ||
1635 | ||
1636 | ||
1637 | 3.3.2.2. cupsLangFlush() - Existing | |
1638 | ||
1639 | [No Change] | |
1640 | ||
1641 | ||
1642 | ||
1643 | 3.3.2.3. cupsLangFree() - Existing | |
1644 | ||
1645 | [No Change] | |
1646 | ||
1647 | ||
1648 | ||
1649 | ||
1650 | ||
1651 | ||
1652 | ||
1653 | McDonald June 20, 2002 [Page 29] | |
1654 | \f | |
1655 | CUPS Internationalization Software Design Description v0.3 | |
1656 | ||
1657 | ||
1658 | ||
1659 | 3.3.2.4. cupsLangGet() - Existing | |
1660 | ||
1661 | Required Changes: | |
1662 | ||
1663 | (1) Change length of 'langname[]' and 'real[]' to 64 characters per | |
1664 | [RFC3066] and potential length of encoding (charset) names; | |
1665 | (2) Change language string normalization to support: | |
1666 | (a) 8-character language codes per [RFC3066] and 3-character | |
1667 | language codes per [ISO639-2]; | |
1668 | (b) 8-character country codes per [RFC3066] and 3-character country | |
1669 | codes per [ISO3166-1]; | |
1670 | (c) Support for 'i' (IANA registered) and 'x' (private) language | |
1671 | prefixes per [RFC3066]; | |
1672 | (d) Invariant use of 'utf-8' for encoding in message catalog, but | |
1673 | save actual requested encoding name for later use. | |
1674 | (3) Correct broken do/while statement for message catalog lookup (while | |
1675 | condition is _never_ satisfied). | |
1676 | ||
1677 | ||
1678 | ||
1679 | 3.3.2.5. cupsLangPrintf() - New | |
1680 | ||
1681 | extern int cupsLangPrintf(FILE *fp, /* I - File to write */ | |
1682 | const cups_lang_t *lang, /* I - Language/locale*/ | |
1683 | const cups_msg_t msg, /* I - Msg to format */ | |
1684 | ...); /* I - Args to format */ | |
1685 | ||
1686 | <Set up variable args by calling 'va_start()'> | |
1687 | <Format CUPS message with variable args by calling 'vsnprintf()'> | |
1688 | <Clean up variable args by calling 'va_end()'> | |
1689 | <Transcode CUPS message by calling 'cupsUtf8ToCharset()'> | |
1690 | <Write CUPS message by calling 'fputs()'> | |
1691 | <Return transcoded output CUPS message length> | |
1692 | ||
1693 | ||
1694 | ||
1695 | 3.3.2.6. cupsLangPuts() - New | |
1696 | ||
1697 | extern int cupsLangPuts(FILE *fp, /* I - File to write */ | |
1698 | const cups_lang_t *lang, /* I - Language/locale*/ | |
1699 | const cups_msg_t msg); /* I - Msg to write */ | |
1700 | ||
1701 | <Transcode CUPS message by calling 'cupsUtf8ToCharset()'> | |
1702 | <Write CUPS message by calling 'fputs()'> | |
1703 | <Return transcoded output CUPS message length> | |
1704 | ||
1705 | ||
1706 | ||
1707 | ||
1708 | ||
1709 | ||
1710 | McDonald June 20, 2002 [Page 30] | |
1711 | \f | |
1712 | CUPS Internationalization Software Design Description v0.3 | |
1713 | ||
1714 | ||
1715 | ||
1716 | 3.3.2.7. cupsEncodingName() - New | |
1717 | ||
1718 | extern char *cupsEncodingName(cups_encoding_t encoding); | |
1719 | ||
1720 | <Lookup encoding name in static 'lang_encodings[]' array> | |
1721 | <Return pointer to encoding name (charset map file name)> | |
1722 | ||
1723 | ||
1724 | ||
1725 | 3.4. Common Text Filter - Existing | |
1726 | ||
1727 | ||
1728 | ||
1729 | 3.4.1. textcommon.h - Common text filter header | |
1730 | ||
1731 | Required changes: | |
1732 | ||
1733 | (1) Revise 'lchar_t' as specified below, adding 'attrx' bit-mask for | |
1734 | selected Unicode character properties; | |
1735 | (2) Revise 'lchar_t' as specified below, adding 'comblen' and 'combch[]' | |
1736 | for Unicode combining/attached chars (accents); | |
1737 | (3) Add 'COMBLEN_MAX' limit as specified below; | |
1738 | (4) Add 'ATTRX_...' selected Unicode character properties as specified | |
1739 | below. | |
1740 | ||
1741 | ||
1742 | ||
1743 | 3.4.1.1. lchar_t - Character/Attribute Structure | |
1744 | ||
1745 | typedef struct lchar_str /**** Character / Attribute Structure ****/ | |
1746 | { | |
1747 | unsigned short ch; /* Unicode Char as UCS-2 */ | |
1748 | /* or 8/16-bit Legacy Char */ | |
1749 | unsigned short attr; /* Attributes of Char */ | |
1750 | unsigned short attrx; /* Extended Attributes */ | |
1751 | unsigned short comblen; /* Combining Char Count */ | |
1752 | unsigned short combch[8]; /* Combining Chars as UCS-2 */ | |
1753 | } lchar_t; | |
1754 | ||
1755 | 'ch' is a 16-bit UCS-2 character or a 8/16-bit legacy char. 'attr' is | |
1756 | the character attributes defined for the existing 'lchar_t' structure | |
1757 | (defined in 'textcommon.h'). 'attrx' is the extended character | |
1758 | attributes defined for future selected Unicode character properties (see | |
1759 | below). 'comblen' is the number of attached/combining characters. | |
1760 | 'combch' is an array of 16-bit UCS-2 attached/combining characters. | |
1761 | ||
1762 | Add to 'textcommon.h' constants: | |
1763 | ||
1764 | COMBLEN_MAX 8 | |
1765 | ||
1766 | ||
1767 | McDonald June 20, 2002 [Page 31] | |
1768 | \f | |
1769 | CUPS Internationalization Software Design Description v0.3 | |
1770 | ||
1771 | ||
1772 | ATTRX_RIGHT2LEFT 0x0001 | |
1773 | ||
1774 | ||
1775 | ||
1776 | 3.4.2. textcommon.c - Common text filter | |
1777 | ||
1778 | Required Changes: | |
1779 | ||
1780 | (1) Revise 'TextMain()' function as described below. | |
1781 | ||
1782 | ||
1783 | ||
1784 | 3.4.2.1. TextMain() - Existing | |
1785 | ||
1786 | Required Changes: | |
1787 | ||
1788 | [Ed Note: Pseudo code below needs more work on bidi handling.] | |
1789 | ||
1790 | (1) In main loop at the _beginning_ of the 'default' clause, add the | |
1791 | following code for combining marks: | |
1792 | lchar_t *cp; | |
1793 | ||
1794 | cp = Page[line]; | |
1795 | cp += column; | |
1796 | /* | |
1797 | * Check for Unicode combining mark (accent) | |
1798 | */ | |
1799 | if (UTF-8 && cupsUtf32CombiningClass(ch) > 0) | |
1800 | { | |
1801 | ||
1802 | /* | |
1803 | * Save Unicode combining mark in SAME character | |
1804 | */ | |
1805 | if (cp->comblen > COMBLEN_MAX) | |
1806 | break; | |
1807 | cp->combch[cp->comblen] = ch; | |
1808 | cp->comblen ++; | |
1809 | break; | |
1810 | } | |
1811 | ||
1812 | (2) In main loop _after_ combining chars section in 'default' clause, | |
1813 | add the following code for Unicode bidi control characters | |
1814 | cups_bidicat_t bidicat; | |
1815 | ||
1816 | /* | |
1817 | * Check for Unicode bidi control character | |
1818 | */ | |
1819 | if (UTF-8) | |
1820 | { | |
1821 | bidicat = (cups_bidicat_t) | |
1822 | cupsUtf32CharacterProperty(ch, CUPS_PROP_BIDI_CATEGORY); | |
1823 | ||
1824 | McDonald June 20, 2002 [Page 32] | |
1825 | \f | |
1826 | CUPS Internationalization Software Design Description v0.3 | |
1827 | ||
1828 | if ((bidicat == CUPS_BIDI_LRE) /* Left-to-Right Embedding * | |
1829 | || (bidicat == CUPS_BIDI_LRO) /* Left-to-Right Override */ | |
1830 | || (bidicat == CUPS_BIDI_RLE) /* Right-to-Left Embedding * | |
1831 | || (bidicat == CUPS_BIDI_RLO) /* Right-to-Left Override */ | |
1832 | || (bidicat == CUPS_BIDI_PDF)) /* Pop Directional Format */ | |
1833 | { | |
1834 | /* Do bidi stuff here with memory for NEXT char's direction | |
1835 | /* Discard bidi control character and break */ | |
1836 | } | |
1837 | if ((bidicat == CUPS_BIDI_R) /* Right-to-Left Hebrew */ | |
1838 | || (bidicat == CUPS_BIDI_AL)) /* Right-to-Left Arabic */ | |
1839 | { | |
1840 | /* Set attrx for right-to-left */ | |
1841 | cp->attrx |= ATTRX_RIGHT2LEFT | |
1842 | } | |
1843 | } | |
1844 | ||
1845 | ||
1846 | ||
1847 | 3.4.2.2. compare_keywords() - Existing | |
1848 | ||
1849 | [No Change] | |
1850 | ||
1851 | ||
1852 | ||
1853 | 3.4.2.3. getutf8() - Existing | |
1854 | ||
1855 | [No Change] | |
1856 | ||
1857 | [Ed Note: Future - allow 20-bit UTF-32 code points - requires updates | |
1858 | in both 'textcommon.c' and 'texttops.c' for extended PostScript.] | |
1859 | ||
1860 | ||
1861 | ||
1862 | 3.5. Text to PostScript Filter - Existing | |
1863 | ||
1864 | ||
1865 | ||
1866 | 3.5.1. texttops.c - Text to PostScript filter | |
1867 | ||
1868 | Required Changes: | |
1869 | ||
1870 | (1) Revise local 'write_string()' function as described below. | |
1871 | ||
1872 | ||
1873 | ||
1874 | 3.5.1.1. main() - Existing | |
1875 | ||
1876 | [No Change] | |
1877 | ||
1878 | ||
1879 | ||
1880 | ||
1881 | McDonald June 20, 2002 [Page 33] | |
1882 | \f | |
1883 | CUPS Internationalization Software Design Description v0.3 | |
1884 | ||
1885 | ||
1886 | ||
1887 | 3.5.1.2. WriteEpilogue () - Existing | |
1888 | ||
1889 | [No Change] | |
1890 | ||
1891 | ||
1892 | ||
1893 | 3.5.1.3. WritePage () - Existing | |
1894 | ||
1895 | [No Change] | |
1896 | ||
1897 | ||
1898 | ||
1899 | 3.5.1.4. WriteProlog () - Existing | |
1900 | ||
1901 | [No Change] | |
1902 | ||
1903 | ||
1904 | ||
1905 | 3.5.1.5. write_line() - Existing | |
1906 | ||
1907 | [No Change] | |
1908 | ||
1909 | ||
1910 | ||
1911 | 3.5.1.6. write_string() - Existing | |
1912 | ||
1913 | Required Changes: | |
1914 | ||
1915 | (1) At the _beginning_ of Multiple Fonts section, _replace_ the while() | |
1916 | loop and surrounding 'putchar()' calls with the following code: | |
1917 | ||
1918 | for (; len > 0; len --, s ++) | |
1919 | { | |
1920 | utf32_t decstr[COMBLEN_MAX * 2]; | |
1921 | utf32_t cmpstr[COMBLEN_MAX * 2]; | |
1922 | int cmplen; | |
1923 | int i; | |
1924 | ||
1925 | if (s->comblen == 0) | |
1926 | { | |
1927 | printf("<%04x>", Chars[s->ch]); | |
1928 | continue; | |
1929 | } | |
1930 | ||
1931 | /* | |
1932 | * Normalize decomposed Unicode character to NFKC | |
1933 | * (compatibility decomposition, then canonical composition) | |
1934 | */ | |
1935 | decstr[0] = (utf32_t) s->ch; | |
1936 | for (i = 0; i < s->comblen; i ++) | |
1937 | ||
1938 | McDonald June 20, 2002 [Page 34] | |
1939 | \f | |
1940 | CUPS Internationalization Software Design Description v0.3 | |
1941 | ||
1942 | decstr[i + 1] = (utf32_t) s->combch[i]; | |
1943 | decstr[i] = 0; | |
1944 | cmplen = cupsUtf32Normalize (&cmpstr[0], | |
1945 | &decstr[0], COMBLEN_MAX * 2, CUPS_NORM_NFKC); | |
1946 | if (cmplen < 1) | |
1947 | continue; | |
1948 | ||
1949 | /* | |
1950 | * Write combining chars, then composed base, to same location | |
1951 | */ | |
1952 | for (i = 1; i < cmplen; i ++) | |
1953 | { | |
1954 | printf("<%04x>", Chars[(int) cmpstr[i]); | |
1955 | /* | |
1956 | * Superimpose glyphs by backing up one column width | |
1957 | */ | |
1958 | printf (" -%.3f ", (72.0f / (float) CharsPerInch)); | |
1959 | } | |
1960 | printf("<%04x>", Chars[(int) cmpstr[0]); | |
1961 | } | |
1962 | ||
1963 | [Ed Note: Future - Bidi support - When writing Unicode characters | |
1964 | (checking for explicit bidi) convert input string (lchar_t) to display | |
1965 | order???] | |
1966 | ||
1967 | ||
1968 | ||
1969 | 3.5.1.7. write_text() - Existing | |
1970 | ||
1971 | [No Change] | |
1972 | ||
1973 | ||
1974 | ||
1975 | ||
1976 | ||
1977 | ||
1978 | ||
1979 | ||
1980 | ||
1981 | ||
1982 | ||
1983 | ||
1984 | ||
1985 | ||
1986 | ||
1987 | ||
1988 | ||
1989 | ||
1990 | ||
1991 | ||
1992 | ||
1993 | ||
1994 | ||
1995 | McDonald June 20, 2002 [Page 35] | |
1996 | \f | |
1997 | CUPS Internationalization Software Design Description v0.3 | |
1998 | APPENDIX A | |
1999 | Glossary | |
2000 | ||
2001 | ||
2002 | ||
2003 | A. Glossary | |
2004 | ||
2005 | Abstract Character: A unit of information used for the organization, | |
2006 | control, or representation of textual data. | |
2007 | ||
2008 | Accent Mark: A mark placed above, below, or to the side of a character | |
2009 | to alter its phonetic value (also 'diacritic'). | |
2010 | ||
2011 | Alphabet: A collection of symbols that, in the context of a particular | |
2012 | written language, represent the sounds of that language. | |
2013 | ||
2014 | Base Character: A character that does not graphically combine with | |
2015 | preceding characters, and that is neither a control nor a format | |
2016 | character. | |
2017 | ||
2018 | Basic Multilingual Plane: The Unicode (or UCS) code values 0x0000 | |
2019 | through 0xFFFF, specified by [ISO10646] (also 'Plane 0'). | |
2020 | ||
2021 | BIDI: Abbreviation for Bidirectional, in reference to mixed | |
2022 | left-to-right and right-to-left text. | |
2023 | ||
2024 | Bidirectional Display: The process or result of mixing left-to-right | |
2025 | oriented text and right-to-left oriented text in a single line. | |
2026 | ||
2027 | Big-endian: A computer architecture that stores multiple-byte numerical | |
2028 | values with the most significant byte (MSB) values first. | |
2029 | ||
2030 | BMP: Abbreviation for Basic Multilingual Plane. | |
2031 | ||
2032 | BOM: Acronym for byte order mark (also 'ZWNBSP'). | |
2033 | ||
2034 | Byte Order Mark: The Unicode character U+FEFF Zero Width No-Break Space | |
2035 | (ZWNBSP) when used to indicate the byte order of text. | |
2036 | ||
2037 | Canonical: (1) Conforming to the general rules for encoding -- that is, | |
2038 | not compressed, compacted, or in any other form specified by a higher | |
2039 | protocol. (2) Characteristic of a normative mapping and form of | |
2040 | equivalence. | |
2041 | ||
2042 | Canonical Decomposition: The decomposition of a character that results | |
2043 | from recursively applying the canonical mappings defined in the Unicode | |
2044 | Character Database until no characters can be further decomposed, then | |
2045 | reordering nonspacing marks according to section 3.10 of [UNICODE3.2]. | |
2046 | ||
2047 | Canonical Equivalent: Two characters are canonical equivalents if their | |
2048 | full canonical decompositions are identical. | |
2049 | ||
2050 | Case: (1) Feature of certain alphabets wheere the letters have two | |
2051 | ||
2052 | McDonald June 20, 2002 [Page A-1] | |
2053 | \f | |
2054 | CUPS Internationalization Software Design Description v0.3 | |
2055 | APPENDIX A | |
2056 | Glossary | |
2057 | ||
2058 | distinct forms. These variants are called the 'uppercase' letter (also | |
2059 | known as 'capital' or 'majuscule') and the 'lowercase' letter (also | |
2060 | known as 'small' or 'minuscule'). (2) Normative property of Unicode | |
2061 | characters, consisting of uppercase, lowercase, and titlecase. | |
2062 | ||
2063 | Character: (1) The smallest component of written language that has | |
2064 | semantic value; refers to the abstract meaning and/or shape, rather than | |
2065 | a specific shape (see also 'glyph'). (2) Synonym for 'abstract | |
2066 | character'. (3) The basic unit of encoding for the Unicode character | |
2067 | encoding. (4) The English name for the ideographic written elements of | |
2068 | Chinese origin (see 'ideograph'). | |
2069 | ||
2070 | Character Encoding Form (CEF): Mapping from a character set definition | |
2071 | to the actual bits used to represent the data. | |
2072 | ||
2073 | Character Encoding Scheme (CES): A 'character encoding form' plus byte | |
2074 | serialization. [UNICODE3.2] defines seven character encoding schemes: | |
2075 | UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, and UTF32-LE. | |
2076 | ||
2077 | Character Properties: A set of property names and property values | |
2078 | associated with individual characters defined in [UNICODE3.2]. | |
2079 | ||
2080 | Character Repertoire: (1) The collection of characters included in a | |
2081 | character set. (2) The SUBSET of characters included in a large | |
2082 | character set, e.g., [UNICODE3.2], that are necessary to support a | |
2083 | complete mapping to another smaller character set, e.g., ISO8859-1 (also | |
2084 | called 'Latin-1'). | |
2085 | ||
2086 | Character Set: A collection of elements used to represent textual | |
2087 | information. | |
2088 | ||
2089 | Coded Character Set: A character set in which each character is | |
2090 | assigned a numeric code value. Frequently abbreviated as 'character | |
2091 | set', 'charset', or 'code set'. | |
2092 | ||
2093 | Code Point: (1) A numerical index (or position) in an encoding table | |
2094 | used for encoding characters. (2) Synonym for 'Unicode scalar value'. | |
2095 | ||
2096 | Collation: The process of ordering units of textual information. | |
2097 | Collation is usually specific to a particular language. Also known as | |
2098 | 'alphabetizing' or 'alphabetic sorting'. | |
2099 | ||
2100 | Combining Character: A character that graphically combines with a | |
2101 | preceding 'base character'. The combining character is said to 'apply' | |
2102 | to that base character. (See also 'nonspacing mark'.) | |
2103 | ||
2104 | Compatibility: (1) Consistency with existing practice or preexisting | |
2105 | character encoding standards. (2) Characterisitic of a normative | |
2106 | mapping and form of equivalence (see 'compatibility decomposition'). | |
2107 | ||
2108 | ||
2109 | McDonald June 20, 2002 [Page A-2] | |
2110 | \f | |
2111 | CUPS Internationalization Software Design Description v0.3 | |
2112 | APPENDIX A | |
2113 | Glossary | |
2114 | ||
2115 | ||
2116 | Compatibility Character: A character that has a compatibility | |
2117 | decomposition. | |
2118 | ||
2119 | Compatibility Decomposition: The decomposition of a character that | |
2120 | results from recursively applying BOTH the compatibility mappings AND | |
2121 | the canonical mappings found in the Unicode Character Database until no | |
2122 | characters can be further decomposed, then reordering nonspacing marks | |
2123 | according to section 3.10 of [UNICODE3.2]. | |
2124 | ||
2125 | Compatibility Equivalent: Two characters are compatibility equivalents | |
2126 | if their full compatibility decompositions are identical. | |
2127 | ||
2128 | Composed Character: (See 'descomposable character'.) | |
2129 | ||
2130 | DBCS: Acronym for 'double-byte character set'. | |
2131 | ||
2132 | Decomposable Character: A character that is equivalent to a sequence of | |
2133 | one or more other characters, according to the decomposition mappings | |
2134 | found in [UNICODE3.2]. It may also be known as a 'precomposed | |
2135 | character' or a 'composite character'. | |
2136 | ||
2137 | Decomposition: (1) The process of separating or analyzing a text | |
2138 | element into component units. (2) A sequence of one or more characters | |
2139 | that is equivalent to a 'decomposable character'. | |
2140 | ||
2141 | Diacritic: (See 'accent mark'.) | |
2142 | ||
2143 | Double-Byte Character Set (DBCS): One of a number of character sets | |
2144 | defined for representing Chinese, Japanese, or Korean text (for example, | |
2145 | JIS X 0208-1990). These character sets are often encoded in such a way | |
2146 | as to allow double-byte character encodings to be mixed with single-byte | |
2147 | character encodings. (See also 'multiple-byte character set'.) | |
2148 | ||
2149 | Font: A collection of glyphs used for visual depication of character | |
2150 | data. | |
2151 | ||
2152 | FSS-UTF: Abbreviation for 'File System Safe UCS Transformation Format', | |
2153 | originally published by X/Open. Now called 'UTF-8'. | |
2154 | ||
2155 | Fullwidth: Characters of East Asian character sets whose glyph image | |
2156 | extends across the entire character display cell. In legacy character | |
2157 | sets, fullwidth characters are normally encoded in two or three bytes. | |
2158 | ||
2159 | Glyph: (1) An abstract form that represents one or more glyph images. | |
2160 | (2) A synonym for 'glyph image'. | |
2161 | ||
2162 | Glyph Image: The actual, concrete image of a glyph representation | |
2163 | having been rasterized or otherwise images onto some display surface. | |
2164 | ||
2165 | ||
2166 | McDonald June 20, 2002 [Page A-3] | |
2167 | \f | |
2168 | CUPS Internationalization Software Design Description v0.3 | |
2169 | APPENDIX A | |
2170 | Glossary | |
2171 | ||
2172 | ||
2173 | Halfwidth: Characters of East Asian character sets whose glyph image | |
2174 | occupies half of the character display cell. In legacy character sets, | |
2175 | halfwidth characters are normally encoded in a single byte. | |
2176 | ||
2177 | Han Characters: Ideographic characters of Chinese origin. | |
2178 | ||
2179 | Hangul: The name of the script used to write the Korean language. | |
2180 | ||
2181 | High-Surrogate: A Unicode code value in the range U+D800 to U+DBFF. | |
2182 | ||
2183 | Hiragana: One of two standard syllabaries associated with the Japanese | |
2184 | writing system. Use to write particles, grammatical affixes, and words | |
2185 | that have no 'kanji' form. | |
2186 | ||
2187 | IANA: Internet Assigned Numbers Authority. | |
2188 | ||
2189 | Ideograph: (1) Any symbol that denotes an idea (or meaning) in contrast | |
2190 | to a sound or pronunciation (for example, a 'smiley face'). (2) A | |
2191 | common term used to refer to Han characters. | |
2192 | ||
2193 | IPA: International Phonetic Alphabet. | |
2194 | ||
2195 | IRG: Abbreviation for Ideographic Rapporteur Group, a subgroup of | |
2196 | ISO/IEC JTC1/SC2/WG2 (who work on Han unification and submission of new | |
2197 | Han characters for inclusion in revised versions of Unicode/ISO 10646). | |
2198 | ||
2199 | Jamo: The Korean name for a single letter of the Hangul script. Jamos | |
2200 | are used to form Hangul syllables. | |
2201 | ||
2202 | Joiner: An invisible character that affects the joining behavior of | |
2203 | surrounding characters. | |
2204 | ||
2205 | JTC1: Abbreviation for Joint Technical Committee 1 of ISO/IEC, | |
2206 | responsible for information technology standardization. | |
2207 | ||
2208 | Kana: The name of a primarily syllabic script used by the Japanese | |
2209 | writing system, composed of 'hiragana' and 'katakana'. | |
2210 | ||
2211 | Kanji: The Japanese name for Han characters; derived from the Chinese | |
2212 | word 'hanzi'. Also romanized as 'kanzi'. | |
2213 | ||
2214 | Katakana: One of two standard syllabaries associated with the Japanese | |
2215 | writing system, typically used in representation of borrowed vocabulary. | |
2216 | ||
2217 | Ligature: A glyph representing a combination of two or more characters, | |
2218 | for example in the Latin script the ligature between 'f' and 'i' as | |
2219 | 'fi'. | |
2220 | ||
2221 | Logical Order: The order in which text is typed on a keyboard. For the | |
2222 | ||
2223 | McDonald June 20, 2002 [Page A-4] | |
2224 | \f | |
2225 | CUPS Internationalization Software Design Description v0.3 | |
2226 | APPENDIX A | |
2227 | Glossary | |
2228 | ||
2229 | most part, logical order corresponds to phonetic order. | |
2230 | ||
2231 | Lowercase: (See 'case'.) | |
2232 | ||
2233 | Low-Surrogate: A Unicode code value in the range U+DC00 to U+DFFF. | |
2234 | ||
2235 | MBCS: Acronym for 'multiple-byte character set'. | |
2236 | ||
2237 | Multiple-Byte Character Set (MBCS): A character set encoded with a | |
2238 | variable number of bytes per character. Many large character sets have | |
2239 | been defined as MBCS so as to keep strict compatibility with the | |
2240 | US-ASCII subset and/or [ISO2022]. | |
2241 | ||
2242 | Normalization: Transformation of data to a normal form. | |
2243 | ||
2244 | Plain Text: Computer-encoded text that consists ONLY of a sequence of | |
2245 | code values from a given standard, with no other formatting or | |
2246 | structural information. | |
2247 | ||
2248 | Precomposed Character: (See 'decomposable character'.) | |
2249 | ||
2250 | Rendering: (1) The process of selecting and laying out glyphs for the | |
2251 | purpose of depicting characters. (2) The process of making glyphs | |
2252 | visible on a display device. | |
2253 | ||
2254 | Repertoire: (See 'character repertoire'.) | |
2255 | ||
2256 | Replacement Character: A character used as a substitute for an | |
2257 | uninterpretable character from another encoding. [UNICODE3.2] defines | |
2258 | U+FFFD REPLACEMENT CHARACTER for this function. | |
2259 | ||
2260 | Rich Text: The result of adding information such as font data, color, | |
2261 | formatting, phonetic annotations, etc. to 'plain text' (e.g., HTML). | |
2262 | ||
2263 | SBCS: Acronym for 'single-byte character set'. | |
2264 | ||
2265 | Scalar Value: (See 'Unicode scalar value'.) | |
2266 | ||
2267 | Script: A collection of symbols used to represent textual information | |
2268 | in one or more writing systems. | |
2269 | ||
2270 | Single-Byte Character Set (SBCS): One of a number of one-byte character | |
2271 | sets defined for representing (mostly) Western languages (for example, | |
2272 | ISO 8859-1 'Latin-1'). These character sets are often encoded in such a | |
2273 | way as to be strict supersets of 7-bit [US-ASCII]. | |
2274 | ||
2275 | Sorting: (See 'collation'.) | |
2276 | ||
2277 | Transcoding: Conversion of character data between different character | |
2278 | sets. | |
2279 | ||
2280 | McDonald June 20, 2002 [Page A-5] | |
2281 | \f | |
2282 | CUPS Internationalization Software Design Description v0.3 | |
2283 | APPENDIX A | |
2284 | Glossary | |
2285 | ||
2286 | ||
2287 | Transformation Format: A mapping from a coded character sequence to a | |
2288 | unique sequence of code values (typically octets). | |
2289 | ||
2290 | UCS: Abbreviation for Universal Character Set, specified by [ISO10646]. | |
2291 | ||
2292 | UCS-2: UCS encoded in 2 octets, specified by [ISO10646]. | |
2293 | ||
2294 | UCS-4: UCS encoded in 4 octets, specified by [ISO10646]. | |
2295 | ||
2296 | Unicode Scalar Value: A number between 0 to 0x10FFFF. | |
2297 | ||
2298 | Uppercase: (See 'case'.) | |
2299 | ||
2300 | UTF: Abbreviation for Unicode (or UCS) Transformation Format. | |
2301 | ||
2302 | UTF-8: Unicode (or UCS) Transformation Format, 8-bit encoding form. | |
2303 | Serializes a Unicode (or UCS) scalar value (code point) as a sequence of | |
2304 | one to four octets. Does NOT suffer from byte-ordering ambiguities. | |
2305 | ||
2306 | UTF-16: Unicode (or UCS) Transformation Format, 16-bit encoding form. | |
2307 | Serializes a Unicode (or UCS) scalar value (code point) as a sequence of | |
2308 | two octets, in either big-endian or little-endian format. Uses an | |
2309 | (optional) prefix of BOM to disambiguate byte-ordering. | |
2310 | ||
2311 | UTF-32: Unicode (or UCS) Transformation Format, 32-bit encoding form. | |
2312 | Serializes a Unicode (or UCS) scalar value (code point) as a sequence of | |
2313 | four octets, in either big-endian or little-endian format. Uses an | |
2314 | (optional) prefix of BOM to disambiguate byte-ordering. | |
2315 | ||
2316 | Zero Width: Characteristic of some spaces or format control characters | |
2317 | that do not advance text along the horizontal baseline. | |
2318 | ||
2319 | ||
2320 | ||
2321 | ||
2322 | ||
2323 | ||
2324 | ||
2325 | ||
2326 | ||
2327 | ||
2328 | ||
2329 | ||
2330 | ||
2331 | ||
2332 | ||
2333 | ||
2334 | ||
2335 | ||
2336 | ||
2337 | McDonald June 20, 2002 [Page A-6] |