]>
Commit | Line | Data |
---|---|---|
0b2b18a2 UD |
1 | @node Character Set Handling, Locales, String and Array Utilities, Top |
2 | @c %MENU% Support for extended character sets | |
3 | @chapter Character Set Handling | |
4 | ||
5 | @ifnottex | |
6 | @macro cal{text} | |
7 | \text\ | |
8 | @end macro | |
9 | @end ifnottex | |
10 | ||
11 | Character sets used in the early days of computing had only six, seven, | |
12 | or eight bits for each character: there was never a case where more than | |
13 | eight bits (one byte) were used to represent a single character. The | |
14 | limitations of this approach became more apparent as more people | |
15 | grappled with non-Roman character sets, where not all the characters | |
16 | that make up a language's character set can be represented by @math{2^8} | |
17 | choices. This chapter shows the functionality that was added to the C | |
18 | library to support multiple character sets. | |
19 | ||
20 | @menu | |
21 | * Extended Char Intro:: Introduction to Extended Characters. | |
22 | * Charset Function Overview:: Overview about Character Handling | |
23 | Functions. | |
24 | * Restartable multibyte conversion:: Restartable multibyte conversion | |
25 | Functions. | |
26 | * Non-reentrant Conversion:: Non-reentrant Conversion Function. | |
27 | * Generic Charset Conversion:: Generic Charset Conversion. | |
28 | @end menu | |
29 | ||
30 | ||
31 | @node Extended Char Intro | |
32 | @section Introduction to Extended Characters | |
33 | ||
d987d219 | 34 | A variety of solutions are available to overcome the differences between |
0b2b18a2 UD |
35 | character sets with a 1:1 relation between bytes and characters and |
36 | character sets with ratios of 2:1 or 4:1. The remainder of this | |
37 | section gives a few examples to help understand the design decisions | |
38 | made while developing the functionality of the @w{C library}. | |
39 | ||
40 | @cindex internal representation | |
41 | A distinction we have to make right away is between internal and | |
42 | external representation. @dfn{Internal representation} means the | |
43 | representation used by a program while keeping the text in memory. | |
44 | External representations are used when text is stored or transmitted | |
45 | through some communication channel. Examples of external | |
46 | representations include files waiting in a directory to be | |
47 | read and parsed. | |
48 | ||
49 | Traditionally there has been no difference between the two representations. | |
50 | It was equally comfortable and useful to use the same single-byte | |
51 | representation internally and externally. This comfort level decreases | |
52 | with more and larger character sets. | |
53 | ||
54 | One of the problems to overcome with the internal representation is | |
55 | handling text that is externally encoded using different character | |
56 | sets. Assume a program that reads two texts and compares them using | |
57 | some metric. The comparison can be usefully done only if the texts are | |
58 | internally kept in a common format. | |
59 | ||
60 | @cindex wide character | |
61 | For such a common format (@math{=} character set) eight bits are certainly | |
62 | no longer enough. So the smallest entity will have to grow: @dfn{wide | |
63 | characters} will now be used. Instead of one byte per character, two or | |
64 | four will be used instead. (Three are not good to address in memory and | |
65 | more than four bytes seem not to be necessary). | |
66 | ||
67 | @cindex Unicode | |
68 | @cindex ISO 10646 | |
69 | As shown in some other part of this manual, | |
70 | @c !!! Ahem, wide char string functions are not yet covered -- drepper | |
71 | a completely new family has been created of functions that can handle wide | |
72 | character texts in memory. The most commonly used character sets for such | |
73 | internal wide character representations are Unicode and @w{ISO 10646} | |
74 | (also known as UCS for Universal Character Set). Unicode was originally | |
75 | planned as a 16-bit character set; whereas, @w{ISO 10646} was designed to | |
76 | be a 31-bit large code space. The two standards are practically identical. | |
77 | They have the same character repertoire and code table, but Unicode specifies | |
78 | added semantics. At the moment, only characters in the first @code{0x10000} | |
79 | code positions (the so-called Basic Multilingual Plane, BMP) have been | |
80 | assigned, but the assignment of more specialized characters outside this | |
81 | 16-bit space is already in progress. A number of encodings have been | |
82 | defined for Unicode and @w{ISO 10646} characters: | |
83 | @cindex UCS-2 | |
84 | @cindex UCS-4 | |
85 | @cindex UTF-8 | |
86 | @cindex UTF-16 | |
87 | UCS-2 is a 16-bit word that can only represent characters | |
88 | from the BMP, UCS-4 is a 32-bit word than can represent any Unicode | |
89 | and @w{ISO 10646} character, UTF-8 is an ASCII compatible encoding where | |
90 | ASCII characters are represented by ASCII bytes and non-ASCII characters | |
91 | by sequences of 2-6 non-ASCII bytes, and finally UTF-16 is an extension | |
92 | of UCS-2 in which pairs of certain UCS-2 words can be used to encode | |
93 | non-BMP characters up to @code{0x10ffff}. | |
94 | ||
95 | To represent wide characters the @code{char} type is not suitable. For | |
96 | this reason the @w{ISO C} standard introduces a new type that is | |
97 | designed to keep one character of a wide character string. To maintain | |
98 | the similarity there is also a type corresponding to @code{int} for | |
99 | those functions that take a single wide character. | |
100 | ||
0b2b18a2 | 101 | @deftp {Data type} wchar_t |
d08a7e4c | 102 | @standards{ISO, stddef.h} |
0b2b18a2 | 103 | This data type is used as the base type for wide character strings. |
bd3916e8 UD |
104 | In other words, arrays of objects of this type are the equivalent of |
105 | @code{char[]} for multibyte character strings. The type is defined in | |
0b2b18a2 UD |
106 | @file{stddef.h}. |
107 | ||
108 | The @w{ISO C90} standard, where @code{wchar_t} was introduced, does not | |
109 | say anything specific about the representation. It only requires that | |
110 | this type is capable of storing all elements of the basic character set. | |
111 | Therefore it would be legitimate to define @code{wchar_t} as @code{char}, | |
112 | which might make sense for embedded systems. | |
113 | ||
a7a93d50 | 114 | But in @theglibc{} @code{wchar_t} is always 32 bits wide and, therefore, |
0b2b18a2 UD |
115 | capable of representing all UCS-4 values and, therefore, covering all of |
116 | @w{ISO 10646}. Some Unix systems define @code{wchar_t} as a 16-bit type | |
117 | and thereby follow Unicode very strictly. This definition is perfectly | |
118 | fine with the standard, but it also means that to represent all | |
119 | characters from Unicode and @w{ISO 10646} one has to use UTF-16 surrogate | |
120 | characters, which is in fact a multi-wide-character encoding. But | |
121 | resorting to multi-wide-character encoding contradicts the purpose of the | |
122 | @code{wchar_t} type. | |
123 | @end deftp | |
124 | ||
0b2b18a2 | 125 | @deftp {Data type} wint_t |
d08a7e4c | 126 | @standards{ISO, wchar.h} |
0b2b18a2 UD |
127 | @code{wint_t} is a data type used for parameters and variables that |
128 | contain a single wide character. As the name suggests this type is the | |
129 | equivalent of @code{int} when using the normal @code{char} strings. The | |
130 | types @code{wchar_t} and @code{wint_t} often have the same | |
131 | representation if their size is 32 bits wide but if @code{wchar_t} is | |
132 | defined as @code{char} the type @code{wint_t} must be defined as | |
133 | @code{int} due to the parameter promotion. | |
134 | ||
135 | @pindex wchar.h | |
136 | This type is defined in @file{wchar.h} and was introduced in | |
137 | @w{Amendment 1} to @w{ISO C90}. | |
138 | @end deftp | |
139 | ||
140 | As there are for the @code{char} data type macros are available for | |
141 | specifying the minimum and maximum value representable in an object of | |
142 | type @code{wchar_t}. | |
143 | ||
0b2b18a2 | 144 | @deftypevr Macro wint_t WCHAR_MIN |
d08a7e4c | 145 | @standards{ISO, wchar.h} |
0b2b18a2 UD |
146 | The macro @code{WCHAR_MIN} evaluates to the minimum value representable |
147 | by an object of type @code{wint_t}. | |
148 | ||
149 | This macro was introduced in @w{Amendment 1} to @w{ISO C90}. | |
150 | @end deftypevr | |
151 | ||
0b2b18a2 | 152 | @deftypevr Macro wint_t WCHAR_MAX |
d08a7e4c | 153 | @standards{ISO, wchar.h} |
0b2b18a2 UD |
154 | The macro @code{WCHAR_MAX} evaluates to the maximum value representable |
155 | by an object of type @code{wint_t}. | |
156 | ||
157 | This macro was introduced in @w{Amendment 1} to @w{ISO C90}. | |
158 | @end deftypevr | |
159 | ||
160 | Another special wide character value is the equivalent to @code{EOF}. | |
161 | ||
0b2b18a2 | 162 | @deftypevr Macro wint_t WEOF |
d08a7e4c | 163 | @standards{ISO, wchar.h} |
0b2b18a2 UD |
164 | The macro @code{WEOF} evaluates to a constant expression of type |
165 | @code{wint_t} whose value is different from any member of the extended | |
166 | character set. | |
167 | ||
168 | @code{WEOF} need not be the same value as @code{EOF} and unlike | |
bd3916e8 | 169 | @code{EOF} it also need @emph{not} be negative. In other words, sloppy |
0b2b18a2 UD |
170 | code like |
171 | ||
172 | @smallexample | |
173 | @{ | |
174 | int c; | |
95fdc6a0 | 175 | @dots{} |
0b2b18a2 | 176 | while ((c = getc (fp)) < 0) |
95fdc6a0 | 177 | @dots{} |
0b2b18a2 UD |
178 | @} |
179 | @end smallexample | |
180 | ||
181 | @noindent | |
182 | has to be rewritten to use @code{WEOF} explicitly when wide characters | |
183 | are used: | |
184 | ||
185 | @smallexample | |
186 | @{ | |
187 | wint_t c; | |
95fdc6a0 | 188 | @dots{} |
b8de7980 | 189 | while ((c = getwc (fp)) != WEOF) |
95fdc6a0 | 190 | @dots{} |
0b2b18a2 UD |
191 | @} |
192 | @end smallexample | |
193 | ||
194 | @pindex wchar.h | |
195 | This macro was introduced in @w{Amendment 1} to @w{ISO C90} and is | |
196 | defined in @file{wchar.h}. | |
197 | @end deftypevr | |
198 | ||
199 | ||
d987d219 | 200 | These internal representations present problems when it comes to storage |
0b2b18a2 | 201 | and transmittal. Because each single wide character consists of more |
6c55cda3 | 202 | than one byte, they are affected by byte-ordering. Thus, machines with |
0b2b18a2 UD |
203 | different endianesses would see different values when accessing the same |
204 | data. This byte ordering concern also applies for communication protocols | |
11bf311e | 205 | that are all byte-based and therefore require that the sender has to |
0b2b18a2 UD |
206 | decide about splitting the wide character in bytes. A last (but not least |
207 | important) point is that wide characters often require more storage space | |
208 | than a customized byte-oriented character set. | |
209 | ||
210 | @cindex multibyte character | |
211 | @cindex EBCDIC | |
bd3916e8 UD |
212 | For all the above reasons, an external encoding that is different from |
213 | the internal encoding is often used if the latter is UCS-2 or UCS-4. | |
0b2b18a2 UD |
214 | The external encoding is byte-based and can be chosen appropriately for |
215 | the environment and for the texts to be handled. A variety of different | |
216 | character sets can be used for this external encoding (information that | |
217 | will not be exhaustively presented here--instead, a description of the | |
218 | major groups will suffice). All of the ASCII-based character sets | |
bd3916e8 UD |
219 | fulfill one requirement: they are "filesystem safe." This means that |
220 | the character @code{'/'} is used in the encoding @emph{only} to | |
0b2b18a2 UD |
221 | represent itself. Things are a bit different for character sets like |
222 | EBCDIC (Extended Binary Coded Decimal Interchange Code, a character set | |
6c55cda3 | 223 | family used by IBM), but if the operating system does not understand |
bd3916e8 UD |
224 | EBCDIC directly the parameters-to-system calls have to be converted |
225 | first anyhow. | |
0b2b18a2 UD |
226 | |
227 | @itemize @bullet | |
bd3916e8 UD |
228 | @item |
229 | The simplest character sets are single-byte character sets. There can | |
230 | be only up to 256 characters (for @w{8 bit} character sets), which is | |
231 | not sufficient to cover all languages but might be sufficient to handle | |
232 | a specific text. Handling of a @w{8 bit} character sets is simple. This | |
233 | is not true for other kinds presented later, and therefore, the | |
0b2b18a2 UD |
234 | application one uses might require the use of @w{8 bit} character sets. |
235 | ||
236 | @cindex ISO 2022 | |
237 | @item | |
238 | The @w{ISO 2022} standard defines a mechanism for extended character | |
239 | sets where one character @emph{can} be represented by more than one | |
240 | byte. This is achieved by associating a state with the text. | |
241 | Characters that can be used to change the state can be embedded in the | |
242 | text. Each byte in the text might have a different interpretation in each | |
243 | state. The state might even influence whether a given byte stands for a | |
244 | character on its own or whether it has to be combined with some more | |
245 | bytes. | |
246 | ||
247 | @cindex EUC | |
248 | @cindex Shift_JIS | |
249 | @cindex SJIS | |
250 | In most uses of @w{ISO 2022} the defined character sets do not allow | |
251 | state changes that cover more than the next character. This has the | |
252 | big advantage that whenever one can identify the beginning of the byte | |
253 | sequence of a character one can interpret a text correctly. Examples of | |
254 | character sets using this policy are the various EUC character sets | |
6c55cda3 | 255 | (used by Sun's operating systems, EUC-JP, EUC-KR, EUC-TW, and EUC-CN) |
0b2b18a2 UD |
256 | or Shift_JIS (SJIS, a Japanese encoding). |
257 | ||
258 | But there are also character sets using a state that is valid for more | |
259 | than one character and has to be changed by another byte sequence. | |
260 | Examples for this are ISO-2022-JP, ISO-2022-KR, and ISO-2022-CN. | |
261 | ||
262 | @item | |
263 | @cindex ISO 6937 | |
264 | Early attempts to fix 8 bit character sets for other languages using the | |
265 | Roman alphabet lead to character sets like @w{ISO 6937}. Here bytes | |
266 | representing characters like the acute accent do not produce output | |
267 | themselves: one has to combine them with other characters to get the | |
268 | desired result. For example, the byte sequence @code{0xc2 0x61} | |
269 | (non-spacing acute accent, followed by lower-case `a') to get the ``small | |
270 | a with acute'' character. To get the acute accent character on its own, | |
271 | one has to write @code{0xc2 0x20} (the non-spacing acute followed by a | |
272 | space). | |
273 | ||
bd3916e8 | 274 | Character sets like @w{ISO 6937} are used in some embedded systems such |
0b2b18a2 UD |
275 | as teletex. |
276 | ||
277 | @item | |
278 | @cindex UTF-8 | |
279 | Instead of converting the Unicode or @w{ISO 10646} text used internally, | |
280 | it is often also sufficient to simply use an encoding different than | |
281 | UCS-2/UCS-4. The Unicode and @w{ISO 10646} standards even specify such an | |
282 | encoding: UTF-8. This encoding is able to represent all of @w{ISO | |
283 | 10646} 31 bits in a byte string of length one to six. | |
284 | ||
285 | @cindex UTF-7 | |
286 | There were a few other attempts to encode @w{ISO 10646} such as UTF-7, | |
287 | but UTF-8 is today the only encoding that should be used. In fact, with | |
288 | any luck UTF-8 will soon be the only external encoding that has to be | |
289 | supported. It proves to be universally usable and its only disadvantage | |
290 | is that it favors Roman languages by making the byte string | |
291 | representation of other scripts (Cyrillic, Greek, Asian scripts) longer | |
292 | than necessary if using a specific character set for these scripts. | |
293 | Methods like the Unicode compression scheme can alleviate these | |
294 | problems. | |
295 | @end itemize | |
296 | ||
297 | The question remaining is: how to select the character set or encoding | |
298 | to use. The answer: you cannot decide about it yourself, it is decided | |
299 | by the developers of the system or the majority of the users. Since the | |
300 | goal is interoperability one has to use whatever the other people one | |
301 | works with use. If there are no constraints, the selection is based on | |
302 | the requirements the expected circle of users will have. In other words, | |
303 | if a project is expected to be used in only, say, Russia it is fine to use | |
304 | KOI8-R or a similar character set. But if at the same time people from, | |
305 | say, Greece are participating one should use a character set that allows | |
306 | all people to collaborate. | |
307 | ||
308 | The most widely useful solution seems to be: go with the most general | |
309 | character set, namely @w{ISO 10646}. Use UTF-8 as the external encoding | |
310 | and problems about users not being able to use their own language | |
311 | adequately are a thing of the past. | |
312 | ||
313 | One final comment about the choice of the wide character representation | |
314 | is necessary at this point. We have said above that the natural choice | |
315 | is using Unicode or @w{ISO 10646}. This is not required, but at least | |
316 | encouraged, by the @w{ISO C} standard. The standard defines at least a | |
317 | macro @code{__STDC_ISO_10646__} that is only defined on systems where | |
318 | the @code{wchar_t} type encodes @w{ISO 10646} characters. If this | |
319 | symbol is not defined one should avoid making assumptions about the wide | |
320 | character representation. If the programmer uses only the functions | |
321 | provided by the C library to handle wide character strings there should | |
322 | be no compatibility problems with other systems. | |
323 | ||
324 | @node Charset Function Overview | |
325 | @section Overview about Character Handling Functions | |
326 | ||
bd3916e8 UD |
327 | A Unix @w{C library} contains three different sets of functions in two |
328 | families to handle character set conversion. One of the function families | |
329 | (the most commonly used) is specified in the @w{ISO C90} standard and, | |
330 | therefore, is portable even beyond the Unix world. Unfortunately this | |
331 | family is the least useful one. These functions should be avoided | |
332 | whenever possible, especially when developing libraries (as opposed to | |
333 | applications). | |
0b2b18a2 UD |
334 | |
335 | The second family of functions got introduced in the early Unix standards | |
336 | (XPG2) and is still part of the latest and greatest Unix standard: | |
337 | @w{Unix 98}. It is also the most powerful and useful set of functions. | |
338 | But we will start with the functions defined in @w{Amendment 1} to | |
339 | @w{ISO C90}. | |
340 | ||
341 | @node Restartable multibyte conversion | |
342 | @section Restartable Multibyte Conversion Functions | |
343 | ||
344 | The @w{ISO C} standard defines functions to convert strings from a | |
345 | multibyte representation to wide character strings. There are a number | |
346 | of peculiarities: | |
347 | ||
348 | @itemize @bullet | |
349 | @item | |
350 | The character set assumed for the multibyte encoding is not specified | |
351 | as an argument to the functions. Instead the character set specified by | |
352 | the @code{LC_CTYPE} category of the current locale is used; see | |
353 | @ref{Locale Categories}. | |
354 | ||
355 | @item | |
356 | The functions handling more than one character at a time require NUL | |
357 | terminated strings as the argument (i.e., converting blocks of text | |
bd3916e8 | 358 | does not work unless one can add a NUL byte at an appropriate place). |
1f77f049 | 359 | @Theglibc{} contains some extensions to the standard that allow |
0b2b18a2 UD |
360 | specifying a size, but basically they also expect terminated strings. |
361 | @end itemize | |
362 | ||
363 | Despite these limitations the @w{ISO C} functions can be used in many | |
364 | contexts. In graphical user interfaces, for instance, it is not | |
365 | uncommon to have functions that require text to be displayed in a wide | |
bd3916e8 | 366 | character string if the text is not simple ASCII. The text itself might |
0b2b18a2 UD |
367 | come from a file with translations and the user should decide about the |
368 | current locale, which determines the translation and therefore also the | |
369 | external encoding used. In such a situation (and many others) the | |
370 | functions described here are perfect. If more freedom while performing | |
371 | the conversion is necessary take a look at the @code{iconv} functions | |
372 | (@pxref{Generic Charset Conversion}). | |
373 | ||
374 | @menu | |
375 | * Selecting the Conversion:: Selecting the conversion and its properties. | |
376 | * Keeping the state:: Representing the state of the conversion. | |
377 | * Converting a Character:: Converting Single Characters. | |
378 | * Converting Strings:: Converting Multibyte and Wide Character | |
379 | Strings. | |
380 | * Multibyte Conversion Example:: A Complete Multibyte Conversion Example. | |
381 | @end menu | |
382 | ||
383 | @node Selecting the Conversion | |
384 | @subsection Selecting the conversion and its properties | |
385 | ||
386 | We already said above that the currently selected locale for the | |
d987d219 | 387 | @code{LC_CTYPE} category decides the conversion that is performed |
0b2b18a2 UD |
388 | by the functions we are about to describe. Each locale uses its own |
389 | character set (given as an argument to @code{localedef}) and this is the | |
390 | one assumed as the external multibyte encoding. The wide character | |
a7a93d50 | 391 | set is always UCS-4 in @theglibc{}. |
0b2b18a2 UD |
392 | |
393 | A characteristic of each multibyte character set is the maximum number | |
394 | of bytes that can be necessary to represent one character. This | |
395 | information is quite important when writing code that uses the | |
396 | conversion functions (as shown in the examples below). | |
397 | The @w{ISO C} standard defines two macros that provide this information. | |
398 | ||
399 | ||
0b2b18a2 | 400 | @deftypevr Macro int MB_LEN_MAX |
d08a7e4c | 401 | @standards{ISO, limits.h} |
0b2b18a2 UD |
402 | @code{MB_LEN_MAX} specifies the maximum number of bytes in the multibyte |
403 | sequence for a single character in any of the supported locales. It is | |
404 | a compile-time constant and is defined in @file{limits.h}. | |
405 | @pindex limits.h | |
406 | @end deftypevr | |
407 | ||
0b2b18a2 | 408 | @deftypevr Macro int MB_CUR_MAX |
d08a7e4c | 409 | @standards{ISO, stdlib.h} |
0b2b18a2 UD |
410 | @code{MB_CUR_MAX} expands into a positive integer expression that is the |
411 | maximum number of bytes in a multibyte character in the current locale. | |
412 | The value is never greater than @code{MB_LEN_MAX}. Unlike | |
bd3916e8 | 413 | @code{MB_LEN_MAX} this macro need not be a compile-time constant, and in |
1f77f049 | 414 | @theglibc{} it is not. |
0b2b18a2 UD |
415 | |
416 | @pindex stdlib.h | |
417 | @code{MB_CUR_MAX} is defined in @file{stdlib.h}. | |
418 | @end deftypevr | |
419 | ||
420 | Two different macros are necessary since strictly @w{ISO C90} compilers | |
421 | do not allow variable length array definitions, but still it is desirable | |
422 | to avoid dynamic allocation. This incomplete piece of code shows the | |
423 | problem: | |
424 | ||
425 | @smallexample | |
426 | @{ | |
427 | char buf[MB_LEN_MAX]; | |
428 | ssize_t len = 0; | |
429 | ||
430 | while (! feof (fp)) | |
431 | @{ | |
432 | fread (&buf[len], 1, MB_CUR_MAX - len, fp); | |
95fdc6a0 | 433 | /* @r{@dots{} process} buf */ |
0b2b18a2 UD |
434 | len -= used; |
435 | @} | |
436 | @} | |
437 | @end smallexample | |
438 | ||
439 | The code in the inner loop is expected to have always enough bytes in | |
440 | the array @var{buf} to convert one multibyte character. The array | |
441 | @var{buf} has to be sized statically since many compilers do not allow a | |
bd3916e8 | 442 | variable size. The @code{fread} call makes sure that @code{MB_CUR_MAX} |
0b2b18a2 UD |
443 | bytes are always available in @var{buf}. Note that it isn't |
444 | a problem if @code{MB_CUR_MAX} is not a compile-time constant. | |
445 | ||
446 | ||
447 | @node Keeping the state | |
448 | @subsection Representing the state of the conversion | |
449 | ||
450 | @cindex stateful | |
451 | In the introduction of this chapter it was said that certain character | |
bd3916e8 | 452 | sets use a @dfn{stateful} encoding. That is, the encoded values depend |
0b2b18a2 UD |
453 | in some way on the previous bytes in the text. |
454 | ||
455 | Since the conversion functions allow converting a text in more than one | |
456 | step we must have a way to pass this information from one call of the | |
457 | functions to another. | |
458 | ||
0b2b18a2 | 459 | @deftp {Data type} mbstate_t |
d08a7e4c | 460 | @standards{ISO, wchar.h} |
0b2b18a2 UD |
461 | @cindex shift state |
462 | A variable of type @code{mbstate_t} can contain all the information | |
463 | about the @dfn{shift state} needed from one call to a conversion | |
464 | function to another. | |
465 | ||
466 | @pindex wchar.h | |
467 | @code{mbstate_t} is defined in @file{wchar.h}. It was introduced in | |
468 | @w{Amendment 1} to @w{ISO C90}. | |
469 | @end deftp | |
470 | ||
bd3916e8 UD |
471 | To use objects of type @code{mbstate_t} the programmer has to define such |
472 | objects (normally as local variables on the stack) and pass a pointer to | |
0b2b18a2 UD |
473 | the object to the conversion functions. This way the conversion function |
474 | can update the object if the current multibyte character set is stateful. | |
475 | ||
476 | There is no specific function or initializer to put the state object in | |
477 | any specific state. The rules are that the object should always | |
478 | represent the initial state before the first use, and this is achieved by | |
479 | clearing the whole variable with code such as follows: | |
480 | ||
481 | @smallexample | |
482 | @{ | |
483 | mbstate_t state; | |
484 | memset (&state, '\0', sizeof (state)); | |
485 | /* @r{from now on @var{state} can be used.} */ | |
95fdc6a0 | 486 | @dots{} |
0b2b18a2 UD |
487 | @} |
488 | @end smallexample | |
489 | ||
490 | When using the conversion functions to generate output it is often | |
491 | necessary to test whether the current state corresponds to the initial | |
492 | state. This is necessary, for example, to decide whether to emit | |
493 | escape sequences to set the state to the initial state at certain | |
494 | sequence points. Communication protocols often require this. | |
495 | ||
0b2b18a2 | 496 | @deftypefun int mbsinit (const mbstate_t *@var{ps}) |
d08a7e4c | 497 | @standards{ISO, wchar.h} |
86e60666 AO |
498 | @safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}} |
499 | @c ps is dereferenced once, unguarded. This would call for @mtsrace:ps, | |
500 | @c but since a single word-sized field is (atomically) accessed, any | |
501 | @c race here would be harmless. Other functions that take an optional | |
502 | @c mbstate_t* argument named ps are marked with @mtasurace:<func>/!ps, | |
503 | @c to indicate that the function uses a static buffer if ps is NULL. | |
504 | @c These could also have been marked with @mtsrace:ps, but we'll omit | |
505 | @c that for brevity, for it's somewhat redundant with the @mtasurace. | |
bd3916e8 UD |
506 | The @code{mbsinit} function determines whether the state object pointed |
507 | to by @var{ps} is in the initial state. If @var{ps} is a null pointer or | |
508 | the object is in the initial state the return value is nonzero. Otherwise | |
0b2b18a2 UD |
509 | it is zero. |
510 | ||
511 | @pindex wchar.h | |
bd3916e8 | 512 | @code{mbsinit} was introduced in @w{Amendment 1} to @w{ISO C90} and is |
0b2b18a2 UD |
513 | declared in @file{wchar.h}. |
514 | @end deftypefun | |
515 | ||
bd3916e8 | 516 | Code using @code{mbsinit} often looks similar to this: |
0b2b18a2 UD |
517 | |
518 | @c Fix the example to explicitly say how to generate the escape sequence | |
519 | @c to restore the initial state. | |
520 | @smallexample | |
521 | @{ | |
522 | mbstate_t state; | |
523 | memset (&state, '\0', sizeof (state)); | |
524 | /* @r{Use @var{state}.} */ | |
95fdc6a0 | 525 | @dots{} |
0b2b18a2 UD |
526 | if (! mbsinit (&state)) |
527 | @{ | |
528 | /* @r{Emit code to return to initial state.} */ | |
529 | const wchar_t empty[] = L""; | |
530 | const wchar_t *srcp = empty; | |
531 | wcsrtombs (outbuf, &srcp, outbuflen, &state); | |
532 | @} | |
95fdc6a0 | 533 | @dots{} |
0b2b18a2 UD |
534 | @} |
535 | @end smallexample | |
536 | ||
537 | The code to emit the escape sequence to get back to the initial state is | |
538 | interesting. The @code{wcsrtombs} function can be used to determine the | |
a7a93d50 JM |
539 | necessary output code (@pxref{Converting Strings}). Please note that with |
540 | @theglibc{} it is not necessary to perform this extra action for the | |
0b2b18a2 UD |
541 | conversion from multibyte text to wide character text since the wide |
542 | character encoding is not stateful. But there is nothing mentioned in | |
d987d219 | 543 | any standard that prohibits making @code{wchar_t} use a stateful |
0b2b18a2 UD |
544 | encoding. |
545 | ||
546 | @node Converting a Character | |
547 | @subsection Converting Single Characters | |
548 | ||
549 | The most fundamental of the conversion functions are those dealing with | |
550 | single characters. Please note that this does not always mean single | |
551 | bytes. But since there is very often a subset of the multibyte | |
552 | character set that consists of single byte sequences, there are | |
d987d219 | 553 | functions to help with converting bytes. Frequently, ASCII is a subset |
bd3916e8 UD |
554 | of the multibyte character set. In such a scenario, each ASCII character |
555 | stands for itself, and all other characters have at least a first byte | |
0b2b18a2 UD |
556 | that is beyond the range @math{0} to @math{127}. |
557 | ||
0b2b18a2 | 558 | @deftypefun wint_t btowc (int @var{c}) |
d08a7e4c | 559 | @standards{ISO, wchar.h} |
86e60666 AO |
560 | @safety{@prelim{}@mtsafe{}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}} |
561 | @c Calls btowc_fct or __fct; reads from locale, and from the | |
562 | @c get_gconv_fcts result multiple times. get_gconv_fcts calls | |
563 | @c __wcsmbs_load_conv to initialize the ctype if it's null. | |
564 | @c wcsmbs_load_conv takes a non-recursive wrlock before allocating | |
565 | @c memory for the fcts structure, initializing it, and then storing it | |
566 | @c in the locale object. The initialization involves dlopening and a | |
567 | @c lot more. | |
0b2b18a2 UD |
568 | The @code{btowc} function (``byte to wide character'') converts a valid |
569 | single byte character @var{c} in the initial shift state into the wide | |
570 | character equivalent using the conversion rules from the currently | |
571 | selected locale of the @code{LC_CTYPE} category. | |
572 | ||
573 | If @code{(unsigned char) @var{c}} is no valid single byte multibyte | |
574 | character or if @var{c} is @code{EOF}, the function returns @code{WEOF}. | |
575 | ||
576 | Please note the restriction of @var{c} being tested for validity only in | |
577 | the initial shift state. No @code{mbstate_t} object is used from | |
578 | which the state information is taken, and the function also does not use | |
579 | any static state. | |
580 | ||
581 | @pindex wchar.h | |
bd3916e8 | 582 | The @code{btowc} function was introduced in @w{Amendment 1} to @w{ISO C90} |
0b2b18a2 UD |
583 | and is declared in @file{wchar.h}. |
584 | @end deftypefun | |
585 | ||
82acaacb JM |
586 | Despite the limitation that the single byte value is always interpreted |
587 | in the initial state, this function is actually useful most of the time. | |
0b2b18a2 | 588 | Most characters are either entirely single-byte character sets or they |
d987d219 | 589 | are extensions to ASCII. But then it is possible to write code like this |
0b2b18a2 UD |
590 | (not that this specific example is very useful): |
591 | ||
592 | @smallexample | |
593 | wchar_t * | |
594 | itow (unsigned long int val) | |
595 | @{ | |
596 | static wchar_t buf[30]; | |
597 | wchar_t *wcp = &buf[29]; | |
598 | *wcp = L'\0'; | |
599 | while (val != 0) | |
600 | @{ | |
601 | *--wcp = btowc ('0' + val % 10); | |
602 | val /= 10; | |
603 | @} | |
604 | if (wcp == &buf[29]) | |
605 | *--wcp = L'0'; | |
606 | return wcp; | |
607 | @} | |
608 | @end smallexample | |
609 | ||
610 | Why is it necessary to use such a complicated implementation and not | |
611 | simply cast @code{'0' + val % 10} to a wide character? The answer is | |
612 | that there is no guarantee that one can perform this kind of arithmetic | |
613 | on the character of the character set used for @code{wchar_t} | |
614 | representation. In other situations the bytes are not constant at | |
615 | compile time and so the compiler cannot do the work. In situations like | |
82acaacb | 616 | this, using @code{btowc} is required. |
0b2b18a2 UD |
617 | |
618 | @noindent | |
82acaacb | 619 | There is also a function for the conversion in the other direction. |
0b2b18a2 | 620 | |
0b2b18a2 | 621 | @deftypefun int wctob (wint_t @var{c}) |
d08a7e4c | 622 | @standards{ISO, wchar.h} |
86e60666 | 623 | @safety{@prelim{}@mtsafe{}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}} |
0b2b18a2 UD |
624 | The @code{wctob} function (``wide character to byte'') takes as the |
625 | parameter a valid wide character. If the multibyte representation for | |
626 | this character in the initial state is exactly one byte long, the return | |
627 | value of this function is this character. Otherwise the return value is | |
628 | @code{EOF}. | |
629 | ||
630 | @pindex wchar.h | |
631 | @code{wctob} was introduced in @w{Amendment 1} to @w{ISO C90} and | |
632 | is declared in @file{wchar.h}. | |
633 | @end deftypefun | |
634 | ||
d987d219 | 635 | There are more general functions to convert single characters from |
0b2b18a2 UD |
636 | multibyte representation to wide characters and vice versa. These |
637 | functions pose no limit on the length of the multibyte representation | |
638 | and they also do not require it to be in the initial state. | |
639 | ||
0b2b18a2 | 640 | @deftypefun size_t mbrtowc (wchar_t *restrict @var{pwc}, const char *restrict @var{s}, size_t @var{n}, mbstate_t *restrict @var{ps}) |
d08a7e4c | 641 | @standards{ISO, wchar.h} |
86e60666 | 642 | @safety{@prelim{}@mtunsafe{@mtasurace{:mbrtowc/!ps}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}} |
0b2b18a2 UD |
643 | @cindex stateful |
644 | The @code{mbrtowc} function (``multibyte restartable to wide | |
645 | character'') converts the next multibyte character in the string pointed | |
cf138b0c FW |
646 | to by @var{s} into a wide character and stores it in the location |
647 | pointed to by @var{pwc}. The conversion is performed according | |
0b2b18a2 UD |
648 | to the locale currently selected for the @code{LC_CTYPE} category. If |
649 | the conversion for the character set used in the locale requires a state, | |
650 | the multibyte string is interpreted in the state represented by the | |
651 | object pointed to by @var{ps}. If @var{ps} is a null pointer, a static, | |
652 | internal state variable used only by the @code{mbrtowc} function is | |
653 | used. | |
654 | ||
cf138b0c | 655 | If the next multibyte character corresponds to the null wide character, |
0b2b18a2 UD |
656 | the return value of the function is @math{0} and the state object is |
657 | afterwards in the initial state. If the next @var{n} or fewer bytes | |
658 | form a correct multibyte character, the return value is the number of | |
659 | bytes starting from @var{s} that form the multibyte character. The | |
660 | conversion state is updated according to the bytes consumed in the | |
661 | conversion. In both cases the wide character (either the @code{L'\0'} | |
662 | or the one found in the conversion) is stored in the string pointed to | |
663 | by @var{pwc} if @var{pwc} is not null. | |
664 | ||
665 | If the first @var{n} bytes of the multibyte string possibly form a valid | |
666 | multibyte character but there are more than @var{n} bytes needed to | |
667 | complete it, the return value of the function is @code{(size_t) -2} and | |
cf138b0c FW |
668 | no value is stored in @code{*@var{pwc}}. The conversion state is |
669 | updated and all @var{n} input bytes are consumed and should not be | |
670 | submitted again. Please note that this can happen even if @var{n} has a | |
671 | value greater than or equal to @code{MB_CUR_MAX} since the input might | |
672 | contain redundant shift sequences. | |
0b2b18a2 UD |
673 | |
674 | If the first @code{n} bytes of the multibyte string cannot possibly form | |
675 | a valid multibyte character, no value is stored, the global variable | |
676 | @code{errno} is set to the value @code{EILSEQ}, and the function returns | |
677 | @code{(size_t) -1}. The conversion state is afterwards undefined. | |
678 | ||
cf138b0c FW |
679 | As specified, the @code{mbrtowc} function could deal with multibyte |
680 | sequences which contain embedded null bytes (which happens in Unicode | |
681 | encodings such as UTF-16), but @theglibc{} does not support such | |
682 | multibyte encodings. When encountering a null input byte, the function | |
683 | will either return zero, or return @code{(size_t) -1)} and report a | |
684 | @code{EILSEQ} error. The @code{iconv} function can be used for | |
685 | converting between arbitrary encodings. @xref{Generic Conversion | |
686 | Interface}. | |
687 | ||
0b2b18a2 UD |
688 | @pindex wchar.h |
689 | @code{mbrtowc} was introduced in @w{Amendment 1} to @w{ISO C90} and | |
690 | is declared in @file{wchar.h}. | |
691 | @end deftypefun | |
692 | ||
cf138b0c FW |
693 | A function that copies a multibyte string into a wide character string |
694 | while at the same time converting all lowercase characters into | |
695 | uppercase could look like this: | |
0b2b18a2 UD |
696 | |
697 | @smallexample | |
0f339252 | 698 | @include mbstouwcs.c.texi |
0b2b18a2 UD |
699 | @end smallexample |
700 | ||
cf138b0c FW |
701 | In the inner loop, a single wide character is stored in @code{wc}, and |
702 | the number of consumed bytes is stored in the variable @code{nbytes}. | |
703 | If the conversion is successful, the uppercase variant of the wide | |
690c3475 | 704 | character is stored in the @code{result} array and the pointer to the |
cf138b0c FW |
705 | input string and the number of available bytes is adjusted. If the |
706 | @code{mbrtowc} function returns zero, the null input byte has not been | |
707 | converted, so it must be stored explicitly in the result. | |
708 | ||
709 | The above code uses the fact that there can never be more wide | |
710 | characters in the converted result than there are bytes in the multibyte | |
711 | input string. This method yields a pessimistic guess about the size of | |
712 | the result, and if many wide character strings have to be constructed | |
713 | this way or if the strings are long, the extra memory required to be | |
714 | allocated because the input string contains multibyte characters might | |
715 | be significant. The allocated memory block can be resized to the | |
716 | correct size before returning it, but a better solution might be to | |
717 | allocate just the right amount of space for the result right away. | |
718 | Unfortunately there is no function to compute the length of the wide | |
719 | character string directly from the multibyte string. There is, however, | |
720 | a function that does part of the work. | |
0b2b18a2 | 721 | |
0b2b18a2 | 722 | @deftypefun size_t mbrlen (const char *restrict @var{s}, size_t @var{n}, mbstate_t *@var{ps}) |
d08a7e4c | 723 | @standards{ISO, wchar.h} |
86e60666 | 724 | @safety{@prelim{}@mtunsafe{@mtasurace{:mbrlen/!ps}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}} |
0b2b18a2 UD |
725 | The @code{mbrlen} function (``multibyte restartable length'') computes |
726 | the number of at most @var{n} bytes starting at @var{s}, which form the | |
727 | next valid and complete multibyte character. | |
728 | ||
729 | If the next multibyte character corresponds to the NUL wide character, | |
730 | the return value is @math{0}. If the next @var{n} bytes form a valid | |
731 | multibyte character, the number of bytes belonging to this multibyte | |
732 | character byte sequence is returned. | |
733 | ||
11bf311e | 734 | If the first @var{n} bytes possibly form a valid multibyte |
bd3916e8 UD |
735 | character but the character is incomplete, the return value is |
736 | @code{(size_t) -2}. Otherwise the multibyte character sequence is invalid | |
0b2b18a2 UD |
737 | and the return value is @code{(size_t) -1}. |
738 | ||
739 | The multibyte sequence is interpreted in the state represented by the | |
740 | object pointed to by @var{ps}. If @var{ps} is a null pointer, a state | |
741 | object local to @code{mbrlen} is used. | |
742 | ||
743 | @pindex wchar.h | |
744 | @code{mbrlen} was introduced in @w{Amendment 1} to @w{ISO C90} and | |
745 | is declared in @file{wchar.h}. | |
746 | @end deftypefun | |
747 | ||
bd3916e8 | 748 | The attentive reader now will note that @code{mbrlen} can be implemented |
0b2b18a2 UD |
749 | as |
750 | ||
751 | @smallexample | |
752 | mbrtowc (NULL, s, n, ps != NULL ? ps : &internal) | |
753 | @end smallexample | |
754 | ||
755 | This is true and in fact is mentioned in the official specification. | |
756 | How can this function be used to determine the length of the wide | |
757 | character string created from a multibyte character string? It is not | |
758 | directly usable, but we can define a function @code{mbslen} using it: | |
759 | ||
760 | @smallexample | |
761 | size_t | |
762 | mbslen (const char *s) | |
763 | @{ | |
764 | mbstate_t state; | |
765 | size_t result = 0; | |
766 | size_t nbytes; | |
767 | memset (&state, '\0', sizeof (state)); | |
768 | while ((nbytes = mbrlen (s, MB_LEN_MAX, &state)) > 0) | |
769 | @{ | |
770 | if (nbytes >= (size_t) -2) | |
771 | /* @r{Something is wrong.} */ | |
772 | return (size_t) -1; | |
773 | s += nbytes; | |
774 | ++result; | |
775 | @} | |
776 | return result; | |
777 | @} | |
778 | @end smallexample | |
779 | ||
780 | This function simply calls @code{mbrlen} for each multibyte character | |
781 | in the string and counts the number of function calls. Please note that | |
782 | we here use @code{MB_LEN_MAX} as the size argument in the @code{mbrlen} | |
f24a6d08 | 783 | call. This is acceptable since a) this value is larger than the length of |
bd3916e8 UD |
784 | the longest multibyte character sequence and b) we know that the string |
785 | @var{s} ends with a NUL byte, which cannot be part of any other multibyte | |
786 | character sequence but the one representing the NUL wide character. | |
0b2b18a2 UD |
787 | Therefore, the @code{mbrlen} function will never read invalid memory. |
788 | ||
789 | Now that this function is available (just to make this clear, this | |
1f77f049 | 790 | function is @emph{not} part of @theglibc{}) we can compute the |
d987d219 | 791 | number of wide characters required to store the converted multibyte |
0b2b18a2 UD |
792 | character string @var{s} using |
793 | ||
794 | @smallexample | |
795 | wcs_bytes = (mbslen (s) + 1) * sizeof (wchar_t); | |
796 | @end smallexample | |
797 | ||
798 | Please note that the @code{mbslen} function is quite inefficient. The | |
bd3916e8 UD |
799 | implementation of @code{mbstouwcs} with @code{mbslen} would have to |
800 | perform the conversion of the multibyte character input string twice, and | |
801 | this conversion might be quite expensive. So it is necessary to think | |
802 | about the consequences of using the easier but imprecise method before | |
0b2b18a2 UD |
803 | doing the work twice. |
804 | ||
0b2b18a2 | 805 | @deftypefun size_t wcrtomb (char *restrict @var{s}, wchar_t @var{wc}, mbstate_t *restrict @var{ps}) |
d08a7e4c | 806 | @standards{ISO, wchar.h} |
86e60666 AO |
807 | @safety{@prelim{}@mtunsafe{@mtasurace{:wcrtomb/!ps}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}} |
808 | @c wcrtomb uses a static, non-thread-local unguarded state variable when | |
809 | @c PS is NULL. When a state is passed in, and it's not used | |
810 | @c concurrently in other threads, this function behaves safely as long | |
811 | @c as gconv modules don't bring MT safety issues of their own. | |
812 | @c Attempting to load gconv modules or to build conversion chains in | |
813 | @c signal handlers may encounter gconv databases or caches in a | |
814 | @c partially-updated state, and asynchronous cancellation may leave them | |
815 | @c in such states, besides leaking the lock that guards them. | |
816 | @c get_gconv_fcts ok | |
817 | @c wcsmbs_load_conv ok | |
818 | @c norm_add_slashes ok | |
819 | @c wcsmbs_getfct ok | |
820 | @c gconv_find_transform ok | |
821 | @c gconv_read_conf (libc_once) | |
822 | @c gconv_lookup_cache ok | |
823 | @c find_module_idx ok | |
824 | @c find_module ok | |
825 | @c gconv_find_shlib (ok) | |
826 | @c ->init_fct (assumed ok) | |
827 | @c gconv_get_builtin_trans ok | |
828 | @c gconv_release_step ok | |
829 | @c do_lookup_alias ok | |
830 | @c find_derivation ok | |
831 | @c derivation_lookup ok | |
832 | @c increment_counter ok | |
833 | @c gconv_find_shlib ok | |
834 | @c step->init_fct (assumed ok) | |
835 | @c gen_steps ok | |
836 | @c gconv_find_shlib ok | |
837 | @c dlopen (presumed ok) | |
838 | @c dlsym (presumed ok) | |
839 | @c step->init_fct (assumed ok) | |
840 | @c step->end_fct (assumed ok) | |
841 | @c gconv_get_builtin_trans ok | |
842 | @c gconv_release_step ok | |
843 | @c add_derivation ok | |
844 | @c gconv_close_transform ok | |
845 | @c gconv_release_step ok | |
846 | @c step->end_fct (assumed ok) | |
847 | @c gconv_release_shlib ok | |
848 | @c dlclose (presumed ok) | |
849 | @c gconv_release_cache ok | |
850 | @c ->tomb->__fct (assumed ok) | |
0b2b18a2 UD |
851 | The @code{wcrtomb} function (``wide character restartable to |
852 | multibyte'') converts a single wide character into a multibyte string | |
853 | corresponding to that wide character. | |
854 | ||
855 | If @var{s} is a null pointer, the function resets the state stored in | |
d987d219 | 856 | the object pointed to by @var{ps} (or the internal @code{mbstate_t} |
0b2b18a2 UD |
857 | object) to the initial state. This can also be achieved by a call like |
858 | this: | |
859 | ||
860 | @smallexample | |
861 | wcrtombs (temp_buf, L'\0', ps) | |
862 | @end smallexample | |
863 | ||
864 | @noindent | |
865 | since, if @var{s} is a null pointer, @code{wcrtomb} performs as if it | |
866 | writes into an internal buffer, which is guaranteed to be large enough. | |
867 | ||
868 | If @var{wc} is the NUL wide character, @code{wcrtomb} emits, if | |
869 | necessary, a shift sequence to get the state @var{ps} into the initial | |
bd3916e8 | 870 | state followed by a single NUL byte, which is stored in the string |
0b2b18a2 UD |
871 | @var{s}. |
872 | ||
bd3916e8 UD |
873 | Otherwise a byte sequence (possibly including shift sequences) is written |
874 | into the string @var{s}. This only happens if @var{wc} is a valid wide | |
875 | character (i.e., it has a multibyte representation in the character set | |
876 | selected by locale of the @code{LC_CTYPE} category). If @var{wc} is no | |
877 | valid wide character, nothing is stored in the strings @var{s}, | |
878 | @code{errno} is set to @code{EILSEQ}, the conversion state in @var{ps} | |
0b2b18a2 UD |
879 | is undefined and the return value is @code{(size_t) -1}. |
880 | ||
881 | If no error occurred the function returns the number of bytes stored in | |
882 | the string @var{s}. This includes all bytes representing shift | |
883 | sequences. | |
884 | ||
885 | One word about the interface of the function: there is no parameter | |
886 | specifying the length of the array @var{s}. Instead the function | |
887 | assumes that there are at least @code{MB_CUR_MAX} bytes available since | |
888 | this is the maximum length of any byte sequence representing a single | |
889 | character. So the caller has to make sure that there is enough space | |
890 | available, otherwise buffer overruns can occur. | |
891 | ||
892 | @pindex wchar.h | |
893 | @code{wcrtomb} was introduced in @w{Amendment 1} to @w{ISO C90} and is | |
894 | declared in @file{wchar.h}. | |
895 | @end deftypefun | |
896 | ||
897 | Using @code{wcrtomb} is as easy as using @code{mbrtowc}. The following | |
898 | example appends a wide character string to a multibyte character string. | |
899 | Again, the code is not really useful (or correct), it is simply here to | |
900 | demonstrate the use and some problems. | |
901 | ||
902 | @smallexample | |
903 | char * | |
904 | mbscatwcs (char *s, size_t len, const wchar_t *ws) | |
905 | @{ | |
906 | mbstate_t state; | |
907 | /* @r{Find the end of the existing string.} */ | |
908 | char *wp = strchr (s, '\0'); | |
909 | len -= wp - s; | |
910 | memset (&state, '\0', sizeof (state)); | |
911 | do | |
912 | @{ | |
913 | size_t nbytes; | |
914 | if (len < MB_CUR_LEN) | |
915 | @{ | |
916 | /* @r{We cannot guarantee that the next} | |
917 | @r{character fits into the buffer, so} | |
918 | @r{return an error.} */ | |
919 | errno = E2BIG; | |
920 | return NULL; | |
921 | @} | |
922 | nbytes = wcrtomb (wp, *ws, &state); | |
923 | if (nbytes == (size_t) -1) | |
924 | /* @r{Error in the conversion.} */ | |
925 | return NULL; | |
926 | len -= nbytes; | |
927 | wp += nbytes; | |
928 | @} | |
929 | while (*ws++ != L'\0'); | |
930 | return s; | |
931 | @} | |
932 | @end smallexample | |
933 | ||
934 | First the function has to find the end of the string currently in the | |
935 | array @var{s}. The @code{strchr} call does this very efficiently since a | |
936 | requirement for multibyte character representations is that the NUL byte | |
937 | is never used except to represent itself (and in this context, the end | |
938 | of the string). | |
939 | ||
940 | After initializing the state object the loop is entered where the first | |
941 | task is to make sure there is enough room in the array @var{s}. We | |
942 | abort if there are not at least @code{MB_CUR_LEN} bytes available. This | |
943 | is not always optimal but we have no other choice. We might have less | |
944 | than @code{MB_CUR_LEN} bytes available but the next multibyte character | |
945 | might also be only one byte long. At the time the @code{wcrtomb} call | |
bd3916e8 UD |
946 | returns it is too late to decide whether the buffer was large enough. If |
947 | this solution is unsuitable, there is a very slow but more accurate | |
0b2b18a2 UD |
948 | solution. |
949 | ||
950 | @smallexample | |
95fdc6a0 | 951 | @dots{} |
0b2b18a2 UD |
952 | if (len < MB_CUR_LEN) |
953 | @{ | |
954 | mbstate_t temp_state; | |
955 | memcpy (&temp_state, &state, sizeof (state)); | |
956 | if (wcrtomb (NULL, *ws, &temp_state) > len) | |
957 | @{ | |
958 | /* @r{We cannot guarantee that the next} | |
959 | @r{character fits into the buffer, so} | |
960 | @r{return an error.} */ | |
961 | errno = E2BIG; | |
962 | return NULL; | |
963 | @} | |
964 | @} | |
95fdc6a0 | 965 | @dots{} |
0b2b18a2 UD |
966 | @end smallexample |
967 | ||
bd3916e8 UD |
968 | Here we perform the conversion that might overflow the buffer so that |
969 | we are afterwards in the position to make an exact decision about the | |
970 | buffer size. Please note the @code{NULL} argument for the destination | |
971 | buffer in the new @code{wcrtomb} call; since we are not interested in the | |
972 | converted text at this point, this is a nice way to express this. The | |
973 | most unusual thing about this piece of code certainly is the duplication | |
974 | of the conversion state object, but if a change of the state is necessary | |
975 | to emit the next multibyte character, we want to have the same shift state | |
976 | change performed in the real conversion. Therefore, we have to preserve | |
0b2b18a2 UD |
977 | the initial shift state information. |
978 | ||
979 | There are certainly many more and even better solutions to this problem. | |
980 | This example is only provided for educational purposes. | |
981 | ||
982 | @node Converting Strings | |
983 | @subsection Converting Multibyte and Wide Character Strings | |
984 | ||
985 | The functions described in the previous section only convert a single | |
986 | character at a time. Most operations to be performed in real-world | |
987 | programs include strings and therefore the @w{ISO C} standard also | |
988 | defines conversions on entire strings. However, the defined set of | |
1f77f049 | 989 | functions is quite limited; therefore, @theglibc{} contains a few |
0b2b18a2 UD |
990 | extensions that can help in some important situations. |
991 | ||
0b2b18a2 | 992 | @deftypefun size_t mbsrtowcs (wchar_t *restrict @var{dst}, const char **restrict @var{src}, size_t @var{len}, mbstate_t *restrict @var{ps}) |
d08a7e4c | 993 | @standards{ISO, wchar.h} |
86e60666 | 994 | @safety{@prelim{}@mtunsafe{@mtasurace{:mbsrtowcs/!ps}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}} |
0b2b18a2 | 995 | The @code{mbsrtowcs} function (``multibyte string restartable to wide |
d987d219 | 996 | character string'') converts the NUL-terminated multibyte character |
0b2b18a2 UD |
997 | string at @code{*@var{src}} into an equivalent wide character string, |
998 | including the NUL wide character at the end. The conversion is started | |
999 | using the state information from the object pointed to by @var{ps} or | |
1000 | from an internal object of @code{mbsrtowcs} if @var{ps} is a null | |
bd3916e8 | 1001 | pointer. Before returning, the state object is updated to match the state |
0b2b18a2 UD |
1002 | after the last converted character. The state is the initial state if the |
1003 | terminating NUL byte is reached and converted. | |
1004 | ||
1005 | If @var{dst} is not a null pointer, the result is stored in the array | |
1006 | pointed to by @var{dst}; otherwise, the conversion result is not | |
1007 | available since it is stored in an internal buffer. | |
1008 | ||
1009 | If @var{len} wide characters are stored in the array @var{dst} before | |
1010 | reaching the end of the input string, the conversion stops and @var{len} | |
1011 | is returned. If @var{dst} is a null pointer, @var{len} is never checked. | |
1012 | ||
1013 | Another reason for a premature return from the function call is if the | |
1014 | input string contains an invalid multibyte sequence. In this case the | |
1015 | global variable @code{errno} is set to @code{EILSEQ} and the function | |
1016 | returns @code{(size_t) -1}. | |
1017 | ||
1018 | @c XXX The ISO C9x draft seems to have a problem here. It says that PS | |
1019 | @c is not updated if DST is NULL. This is not said straightforward and | |
1020 | @c none of the other functions is described like this. It would make sense | |
1021 | @c to define the function this way but I don't think it is meant like this. | |
1022 | ||
1023 | In all other cases the function returns the number of wide characters | |
1024 | converted during this call. If @var{dst} is not null, @code{mbsrtowcs} | |
bd3916e8 | 1025 | stores in the pointer pointed to by @var{src} either a null pointer (if |
0b2b18a2 UD |
1026 | the NUL byte in the input string was reached) or the address of the byte |
1027 | following the last converted multibyte character. | |
1028 | ||
61af4bbb CD |
1029 | Like @code{mbstowcs} the @var{dst} parameter may be a null pointer and |
1030 | the function can be used to count the number of wide characters that | |
1031 | would be required. | |
1032 | ||
0b2b18a2 UD |
1033 | @pindex wchar.h |
1034 | @code{mbsrtowcs} was introduced in @w{Amendment 1} to @w{ISO C90} and is | |
1035 | declared in @file{wchar.h}. | |
1036 | @end deftypefun | |
1037 | ||
bd3916e8 UD |
1038 | The definition of the @code{mbsrtowcs} function has one important |
1039 | limitation. The requirement that @var{dst} has to be a NUL-terminated | |
0b2b18a2 | 1040 | string provides problems if one wants to convert buffers with text. A |
d987d219 | 1041 | buffer is not normally a collection of NUL-terminated strings but instead a |
0b2b18a2 UD |
1042 | continuous collection of lines, separated by newline characters. Now |
1043 | assume that a function to convert one line from a buffer is needed. Since | |
1044 | the line is not NUL-terminated, the source pointer cannot directly point | |
1045 | into the unmodified text buffer. This means, either one inserts the NUL | |
1046 | byte at the appropriate place for the time of the @code{mbsrtowcs} | |
1047 | function call (which is not doable for a read-only buffer or in a | |
1048 | multi-threaded application) or one copies the line in an extra buffer | |
bd3916e8 UD |
1049 | where it can be terminated by a NUL byte. Note that it is not in general |
1050 | possible to limit the number of characters to convert by setting the | |
1051 | parameter @var{len} to any specific value. Since it is not known how | |
1052 | many bytes each multibyte character sequence is in length, one can only | |
0b2b18a2 UD |
1053 | guess. |
1054 | ||
1055 | @cindex stateful | |
1056 | There is still a problem with the method of NUL-terminating a line right | |
1057 | after the newline character, which could lead to very strange results. | |
d987d219 | 1058 | As said in the description of the @code{mbsrtowcs} function above, the |
0b2b18a2 UD |
1059 | conversion state is guaranteed to be in the initial shift state after |
1060 | processing the NUL byte at the end of the input string. But this NUL | |
1061 | byte is not really part of the text (i.e., the conversion state after | |
1062 | the newline in the original text could be something different than the | |
1063 | initial shift state and therefore the first character of the next line | |
1064 | is encoded using this state). But the state in question is never | |
1065 | accessible to the user since the conversion stops after the NUL byte | |
1066 | (which resets the state). Most stateful character sets in use today | |
1067 | require that the shift state after a newline be the initial state--but | |
1068 | this is not a strict guarantee. Therefore, simply NUL-terminating a | |
bd3916e8 | 1069 | piece of a running text is not always an adequate solution and, |
0b2b18a2 UD |
1070 | therefore, should never be used in generally used code. |
1071 | ||
1072 | The generic conversion interface (@pxref{Generic Charset Conversion}) | |
1073 | does not have this limitation (it simply works on buffers, not | |
1f77f049 | 1074 | strings), and @theglibc{} contains a set of functions that take |
0b2b18a2 UD |
1075 | additional parameters specifying the maximal number of bytes that are |
1076 | consumed from the input string. This way the problem of | |
1077 | @code{mbsrtowcs}'s example above could be solved by determining the line | |
1078 | length and passing this length to the function. | |
1079 | ||
0b2b18a2 | 1080 | @deftypefun size_t wcsrtombs (char *restrict @var{dst}, const wchar_t **restrict @var{src}, size_t @var{len}, mbstate_t *restrict @var{ps}) |
d08a7e4c | 1081 | @standards{ISO, wchar.h} |
86e60666 | 1082 | @safety{@prelim{}@mtunsafe{@mtasurace{:wcsrtombs/!ps}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}} |
0b2b18a2 UD |
1083 | The @code{wcsrtombs} function (``wide character string restartable to |
1084 | multibyte string'') converts the NUL-terminated wide character string at | |
bd3916e8 | 1085 | @code{*@var{src}} into an equivalent multibyte character string and |
0b2b18a2 UD |
1086 | stores the result in the array pointed to by @var{dst}. The NUL wide |
1087 | character is also converted. The conversion starts in the state | |
1088 | described in the object pointed to by @var{ps} or by a state object | |
d987d219 | 1089 | local to @code{wcsrtombs} in case @var{ps} is a null pointer. If |
0b2b18a2 UD |
1090 | @var{dst} is a null pointer, the conversion is performed as usual but the |
1091 | result is not available. If all characters of the input string were | |
bd3916e8 | 1092 | successfully converted and if @var{dst} is not a null pointer, the |
0b2b18a2 UD |
1093 | pointer pointed to by @var{src} gets assigned a null pointer. |
1094 | ||
1095 | If one of the wide characters in the input string has no valid multibyte | |
1096 | character equivalent, the conversion stops early, sets the global | |
1097 | variable @code{errno} to @code{EILSEQ}, and returns @code{(size_t) -1}. | |
1098 | ||
1099 | Another reason for a premature stop is if @var{dst} is not a null | |
1100 | pointer and the next converted character would require more than | |
1101 | @var{len} bytes in total to the array @var{dst}. In this case (and if | |
d987d219 | 1102 | @var{dst} is not a null pointer) the pointer pointed to by @var{src} is |
0b2b18a2 UD |
1103 | assigned a value pointing to the wide character right after the last one |
1104 | successfully converted. | |
1105 | ||
bd3916e8 UD |
1106 | Except in the case of an encoding error the return value of the |
1107 | @code{wcsrtombs} function is the number of bytes in all the multibyte | |
61af4bbb CD |
1108 | character sequences which were or would have been (if @var{dst} was |
1109 | not a null) stored in @var{dst}. Before returning, the state in the | |
1110 | object pointed to by @var{ps} (or the internal object in case @var{ps} | |
1111 | is a null pointer) is updated to reflect the state after the last | |
1112 | conversion. The state is the initial shift state in case the | |
0b2b18a2 UD |
1113 | terminating NUL wide character was converted. |
1114 | ||
1115 | @pindex wchar.h | |
bd3916e8 | 1116 | The @code{wcsrtombs} function was introduced in @w{Amendment 1} to |
0b2b18a2 UD |
1117 | @w{ISO C90} and is declared in @file{wchar.h}. |
1118 | @end deftypefun | |
1119 | ||
1120 | The restriction mentioned above for the @code{mbsrtowcs} function applies | |
1121 | here also. There is no possibility of directly controlling the number of | |
bd3916e8 UD |
1122 | input characters. One has to place the NUL wide character at the correct |
1123 | place or control the consumed input indirectly via the available output | |
0b2b18a2 UD |
1124 | array size (the @var{len} parameter). |
1125 | ||
0b2b18a2 | 1126 | @deftypefun size_t mbsnrtowcs (wchar_t *restrict @var{dst}, const char **restrict @var{src}, size_t @var{nmc}, size_t @var{len}, mbstate_t *restrict @var{ps}) |
d08a7e4c | 1127 | @standards{GNU, wchar.h} |
86e60666 | 1128 | @safety{@prelim{}@mtunsafe{@mtasurace{:mbsnrtowcs/!ps}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}} |
0b2b18a2 UD |
1129 | The @code{mbsnrtowcs} function is very similar to the @code{mbsrtowcs} |
1130 | function. All the parameters are the same except for @var{nmc}, which is | |
1131 | new. The return value is the same as for @code{mbsrtowcs}. | |
1132 | ||
1133 | This new parameter specifies how many bytes at most can be used from the | |
bd3916e8 UD |
1134 | multibyte character string. In other words, the multibyte character |
1135 | string @code{*@var{src}} need not be NUL-terminated. But if a NUL byte | |
1136 | is found within the @var{nmc} first bytes of the string, the conversion | |
d987d219 | 1137 | stops there. |
0b2b18a2 | 1138 | |
61af4bbb CD |
1139 | Like @code{mbstowcs} the @var{dst} parameter may be a null pointer and |
1140 | the function can be used to count the number of wide characters that | |
1141 | would be required. | |
1142 | ||
0b2b18a2 UD |
1143 | This function is a GNU extension. It is meant to work around the |
1144 | problems mentioned above. Now it is possible to convert a buffer with | |
d987d219 | 1145 | multibyte character text piece by piece without having to care about |
0b2b18a2 UD |
1146 | inserting NUL bytes and the effect of NUL bytes on the conversion state. |
1147 | @end deftypefun | |
1148 | ||
1149 | A function to convert a multibyte string into a wide character string | |
1150 | and display it could be written like this (this is not a really useful | |
1151 | example): | |
1152 | ||
1153 | @smallexample | |
1154 | void | |
1155 | showmbs (const char *src, FILE *fp) | |
1156 | @{ | |
1157 | mbstate_t state; | |
1158 | int cnt = 0; | |
1159 | memset (&state, '\0', sizeof (state)); | |
1160 | while (1) | |
1161 | @{ | |
1162 | wchar_t linebuf[100]; | |
1163 | const char *endp = strchr (src, '\n'); | |
1164 | size_t n; | |
1165 | ||
1166 | /* @r{Exit if there is no more line.} */ | |
1167 | if (endp == NULL) | |
1168 | break; | |
1169 | ||
1170 | n = mbsnrtowcs (linebuf, &src, endp - src, 99, &state); | |
1171 | linebuf[n] = L'\0'; | |
1172 | fprintf (fp, "line %d: \"%S\"\n", linebuf); | |
1173 | @} | |
1174 | @} | |
1175 | @end smallexample | |
1176 | ||
1177 | There is no problem with the state after a call to @code{mbsnrtowcs}. | |
1178 | Since we don't insert characters in the strings that were not in there | |
1179 | right from the beginning and we use @var{state} only for the conversion | |
1180 | of the given buffer, there is no problem with altering the state. | |
1181 | ||
0b2b18a2 | 1182 | @deftypefun size_t wcsnrtombs (char *restrict @var{dst}, const wchar_t **restrict @var{src}, size_t @var{nwc}, size_t @var{len}, mbstate_t *restrict @var{ps}) |
d08a7e4c | 1183 | @standards{GNU, wchar.h} |
86e60666 | 1184 | @safety{@prelim{}@mtunsafe{@mtasurace{:wcsnrtombs/!ps}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}} |
0b2b18a2 UD |
1185 | The @code{wcsnrtombs} function implements the conversion from wide |
1186 | character strings to multibyte character strings. It is similar to | |
1187 | @code{wcsrtombs} but, just like @code{mbsnrtowcs}, it takes an extra | |
1188 | parameter, which specifies the length of the input string. | |
1189 | ||
1190 | No more than @var{nwc} wide characters from the input string | |
1191 | @code{*@var{src}} are converted. If the input string contains a NUL | |
1192 | wide character in the first @var{nwc} characters, the conversion stops at | |
1193 | this place. | |
1194 | ||
bd3916e8 UD |
1195 | The @code{wcsnrtombs} function is a GNU extension and just like |
1196 | @code{mbsnrtowcs} helps in situations where no NUL-terminated input | |
0b2b18a2 UD |
1197 | strings are available. |
1198 | @end deftypefun | |
1199 | ||
1200 | ||
1201 | @node Multibyte Conversion Example | |
1202 | @subsection A Complete Multibyte Conversion Example | |
1203 | ||
1204 | The example programs given in the last sections are only brief and do | |
1205 | not contain all the error checking, etc. Presented here is a complete | |
1206 | and documented example. It features the @code{mbrtowc} function but it | |
1207 | should be easy to derive versions using the other functions. | |
1208 | ||
1209 | @smallexample | |
1210 | int | |
1211 | file_mbsrtowcs (int input, int output) | |
1212 | @{ | |
1213 | /* @r{Note the use of @code{MB_LEN_MAX}.} | |
1214 | @r{@code{MB_CUR_MAX} cannot portably be used here.} */ | |
1215 | char buffer[BUFSIZ + MB_LEN_MAX]; | |
1216 | mbstate_t state; | |
1217 | int filled = 0; | |
1218 | int eof = 0; | |
1219 | ||
1220 | /* @r{Initialize the state.} */ | |
1221 | memset (&state, '\0', sizeof (state)); | |
1222 | ||
1223 | while (!eof) | |
1224 | @{ | |
1225 | ssize_t nread; | |
1226 | ssize_t nwrite; | |
1227 | char *inp = buffer; | |
1228 | wchar_t outbuf[BUFSIZ]; | |
1229 | wchar_t *outp = outbuf; | |
1230 | ||
1231 | /* @r{Fill up the buffer from the input file.} */ | |
1232 | nread = read (input, buffer + filled, BUFSIZ); | |
1233 | if (nread < 0) | |
1234 | @{ | |
1235 | perror ("read"); | |
1236 | return 0; | |
1237 | @} | |
1238 | /* @r{If we reach end of file, make a note to read no more.} */ | |
1239 | if (nread == 0) | |
1240 | eof = 1; | |
1241 | ||
1242 | /* @r{@code{filled} is now the number of bytes in @code{buffer}.} */ | |
1243 | filled += nread; | |
1244 | ||
1245 | /* @r{Convert those bytes to wide characters--as many as we can.} */ | |
1246 | while (1) | |
1247 | @{ | |
1248 | size_t thislen = mbrtowc (outp, inp, filled, &state); | |
1249 | /* @r{Stop converting at invalid character;} | |
1250 | @r{this can mean we have read just the first part} | |
1251 | @r{of a valid character.} */ | |
1252 | if (thislen == (size_t) -1) | |
1253 | break; | |
1254 | /* @r{We want to handle embedded NUL bytes} | |
1255 | @r{but the return value is 0. Correct this.} */ | |
1256 | if (thislen == 0) | |
1257 | thislen = 1; | |
1258 | /* @r{Advance past this character.} */ | |
1259 | inp += thislen; | |
1260 | filled -= thislen; | |
1261 | ++outp; | |
1262 | @} | |
1263 | ||
1264 | /* @r{Write the wide characters we just made.} */ | |
1265 | nwrite = write (output, outbuf, | |
1266 | (outp - outbuf) * sizeof (wchar_t)); | |
1267 | if (nwrite < 0) | |
1268 | @{ | |
1269 | perror ("write"); | |
1270 | return 0; | |
1271 | @} | |
1272 | ||
1273 | /* @r{See if we have a @emph{real} invalid character.} */ | |
1274 | if ((eof && filled > 0) || filled >= MB_CUR_MAX) | |
1275 | @{ | |
1276 | error (0, 0, "invalid multibyte character"); | |
1277 | return 0; | |
1278 | @} | |
1279 | ||
1280 | /* @r{If any characters must be carried forward,} | |
1281 | @r{put them at the beginning of @code{buffer}.} */ | |
1282 | if (filled > 0) | |
21e66bc5 | 1283 | memmove (buffer, inp, filled); |
0b2b18a2 UD |
1284 | @} |
1285 | ||
1286 | return 1; | |
1287 | @} | |
1288 | @end smallexample | |
1289 | ||
1290 | ||
1291 | @node Non-reentrant Conversion | |
1292 | @section Non-reentrant Conversion Function | |
1293 | ||
1294 | The functions described in the previous chapter are defined in | |
bd3916e8 UD |
1295 | @w{Amendment 1} to @w{ISO C90}, but the original @w{ISO C90} standard |
1296 | also contained functions for character set conversion. The reason that | |
1297 | these original functions are not described first is that they are almost | |
0b2b18a2 UD |
1298 | entirely useless. |
1299 | ||
bd3916e8 UD |
1300 | The problem is that all the conversion functions described in the |
1301 | original @w{ISO C90} use a local state. Using a local state implies that | |
1302 | multiple conversions at the same time (not only when using threads) | |
1303 | cannot be done, and that you cannot first convert single characters and | |
1304 | then strings since you cannot tell the conversion functions which state | |
0b2b18a2 UD |
1305 | to use. |
1306 | ||
bd3916e8 | 1307 | These original functions are therefore usable only in a very limited set |
0b2b18a2 UD |
1308 | of situations. One must complete converting the entire string before |
1309 | starting a new one, and each string/text must be converted with the same | |
1310 | function (there is no problem with the library itself; it is guaranteed | |
1311 | that no library function changes the state of any of these functions). | |
1312 | @strong{For the above reasons it is highly requested that the functions | |
bd3916e8 | 1313 | described in the previous section be used in place of non-reentrant |
0b2b18a2 UD |
1314 | conversion functions.} |
1315 | ||
1316 | @menu | |
1317 | * Non-reentrant Character Conversion:: Non-reentrant Conversion of Single | |
1318 | Characters. | |
1319 | * Non-reentrant String Conversion:: Non-reentrant Conversion of Strings. | |
1320 | * Shift State:: States in Non-reentrant Functions. | |
1321 | @end menu | |
1322 | ||
1323 | @node Non-reentrant Character Conversion | |
1324 | @subsection Non-reentrant Conversion of Single Characters | |
1325 | ||
0b2b18a2 | 1326 | @deftypefun int mbtowc (wchar_t *restrict @var{result}, const char *restrict @var{string}, size_t @var{size}) |
d08a7e4c | 1327 | @standards{ISO, stdlib.h} |
86e60666 | 1328 | @safety{@prelim{}@mtunsafe{@mtasurace{}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}} |
0b2b18a2 UD |
1329 | The @code{mbtowc} (``multibyte to wide character'') function when called |
1330 | with non-null @var{string} converts the first multibyte character | |
1331 | beginning at @var{string} to its corresponding wide character code. It | |
1332 | stores the result in @code{*@var{result}}. | |
1333 | ||
1334 | @code{mbtowc} never examines more than @var{size} bytes. (The idea is | |
1335 | to supply for @var{size} the number of bytes of data you have in hand.) | |
1336 | ||
1337 | @code{mbtowc} with non-null @var{string} distinguishes three | |
1338 | possibilities: the first @var{size} bytes at @var{string} start with | |
1339 | valid multibyte characters, they start with an invalid byte sequence or | |
1340 | just part of a character, or @var{string} points to an empty string (a | |
1341 | null character). | |
1342 | ||
1343 | For a valid multibyte character, @code{mbtowc} converts it to a wide | |
1344 | character and stores that in @code{*@var{result}}, and returns the | |
1345 | number of bytes in that character (always at least @math{1} and never | |
1346 | more than @var{size}). | |
1347 | ||
1348 | For an invalid byte sequence, @code{mbtowc} returns @math{-1}. For an | |
1349 | empty string, it returns @math{0}, also storing @code{'\0'} in | |
1350 | @code{*@var{result}}. | |
1351 | ||
1352 | If the multibyte character code uses shift characters, then | |
1353 | @code{mbtowc} maintains and updates a shift state as it scans. If you | |
1354 | call @code{mbtowc} with a null pointer for @var{string}, that | |
1355 | initializes the shift state to its standard initial value. It also | |
1356 | returns nonzero if the multibyte character code in use actually has a | |
1357 | shift state. @xref{Shift State}. | |
1358 | @end deftypefun | |
1359 | ||
0b2b18a2 | 1360 | @deftypefun int wctomb (char *@var{string}, wchar_t @var{wchar}) |
d08a7e4c | 1361 | @standards{ISO, stdlib.h} |
86e60666 | 1362 | @safety{@prelim{}@mtunsafe{@mtasurace{}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}} |
0b2b18a2 UD |
1363 | The @code{wctomb} (``wide character to multibyte'') function converts |
1364 | the wide character code @var{wchar} to its corresponding multibyte | |
1365 | character sequence, and stores the result in bytes starting at | |
1366 | @var{string}. At most @code{MB_CUR_MAX} characters are stored. | |
1367 | ||
1368 | @code{wctomb} with non-null @var{string} distinguishes three | |
1369 | possibilities for @var{wchar}: a valid wide character code (one that can | |
bd3916e8 | 1370 | be translated to a multibyte character), an invalid code, and |
0b2b18a2 UD |
1371 | @code{L'\0'}. |
1372 | ||
1373 | Given a valid code, @code{wctomb} converts it to a multibyte character, | |
1374 | storing the bytes starting at @var{string}. Then it returns the number | |
1375 | of bytes in that character (always at least @math{1} and never more | |
1376 | than @code{MB_CUR_MAX}). | |
1377 | ||
1378 | If @var{wchar} is an invalid wide character code, @code{wctomb} returns | |
1379 | @math{-1}. If @var{wchar} is @code{L'\0'}, it returns @code{0}, also | |
1380 | storing @code{'\0'} in @code{*@var{string}}. | |
1381 | ||
1382 | If the multibyte character code uses shift characters, then | |
1383 | @code{wctomb} maintains and updates a shift state as it scans. If you | |
1384 | call @code{wctomb} with a null pointer for @var{string}, that | |
1385 | initializes the shift state to its standard initial value. It also | |
1386 | returns nonzero if the multibyte character code in use actually has a | |
1387 | shift state. @xref{Shift State}. | |
1388 | ||
1389 | Calling this function with a @var{wchar} argument of zero when | |
1390 | @var{string} is not null has the side-effect of reinitializing the | |
1391 | stored shift state @emph{as well as} storing the multibyte character | |
1392 | @code{'\0'} and returning @math{0}. | |
1393 | @end deftypefun | |
1394 | ||
1395 | Similar to @code{mbrlen} there is also a non-reentrant function that | |
1396 | computes the length of a multibyte character. It can be defined in | |
1397 | terms of @code{mbtowc}. | |
1398 | ||
0b2b18a2 | 1399 | @deftypefun int mblen (const char *@var{string}, size_t @var{size}) |
d08a7e4c | 1400 | @standards{ISO, stdlib.h} |
86e60666 | 1401 | @safety{@prelim{}@mtunsafe{@mtasurace{}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}} |
0b2b18a2 UD |
1402 | The @code{mblen} function with a non-null @var{string} argument returns |
1403 | the number of bytes that make up the multibyte character beginning at | |
1404 | @var{string}, never examining more than @var{size} bytes. (The idea is | |
1405 | to supply for @var{size} the number of bytes of data you have in hand.) | |
1406 | ||
1407 | The return value of @code{mblen} distinguishes three possibilities: the | |
1408 | first @var{size} bytes at @var{string} start with valid multibyte | |
1409 | characters, they start with an invalid byte sequence or just part of a | |
1410 | character, or @var{string} points to an empty string (a null character). | |
1411 | ||
1412 | For a valid multibyte character, @code{mblen} returns the number of | |
1413 | bytes in that character (always at least @code{1} and never more than | |
bd3916e8 | 1414 | @var{size}). For an invalid byte sequence, @code{mblen} returns |
0b2b18a2 UD |
1415 | @math{-1}. For an empty string, it returns @math{0}. |
1416 | ||
1417 | If the multibyte character code uses shift characters, then @code{mblen} | |
1418 | maintains and updates a shift state as it scans. If you call | |
1419 | @code{mblen} with a null pointer for @var{string}, that initializes the | |
1420 | shift state to its standard initial value. It also returns a nonzero | |
1421 | value if the multibyte character code in use actually has a shift state. | |
1422 | @xref{Shift State}. | |
1423 | ||
1424 | @pindex stdlib.h | |
1425 | The function @code{mblen} is declared in @file{stdlib.h}. | |
1426 | @end deftypefun | |
1427 | ||
1428 | ||
1429 | @node Non-reentrant String Conversion | |
1430 | @subsection Non-reentrant Conversion of Strings | |
1431 | ||
bd3916e8 | 1432 | For convenience the @w{ISO C90} standard also defines functions to |
0b2b18a2 UD |
1433 | convert entire strings instead of single characters. These functions |
1434 | suffer from the same problems as their reentrant counterparts from | |
1435 | @w{Amendment 1} to @w{ISO C90}; see @ref{Converting Strings}. | |
1436 | ||
0b2b18a2 | 1437 | @deftypefun size_t mbstowcs (wchar_t *@var{wstring}, const char *@var{string}, size_t @var{size}) |
d08a7e4c | 1438 | @standards{ISO, stdlib.h} |
86e60666 AO |
1439 | @safety{@prelim{}@mtsafe{}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}} |
1440 | @c Odd... Although this was supposed to be non-reentrant, the internal | |
1441 | @c state is not a static buffer, but an automatic variable. | |
0b2b18a2 UD |
1442 | The @code{mbstowcs} (``multibyte string to wide character string'') |
1443 | function converts the null-terminated string of multibyte characters | |
1444 | @var{string} to an array of wide character codes, storing not more than | |
1445 | @var{size} wide characters into the array beginning at @var{wstring}. | |
1446 | The terminating null character counts towards the size, so if @var{size} | |
1447 | is less than the actual number of wide characters resulting from | |
1448 | @var{string}, no terminating null character is stored. | |
1449 | ||
1450 | The conversion of characters from @var{string} begins in the initial | |
1451 | shift state. | |
1452 | ||
bd3916e8 UD |
1453 | If an invalid multibyte character sequence is found, the @code{mbstowcs} |
1454 | function returns a value of @math{-1}. Otherwise, it returns the number | |
1455 | of wide characters stored in the array @var{wstring}. This number does | |
1456 | not include the terminating null character, which is present if the | |
0b2b18a2 UD |
1457 | number is less than @var{size}. |
1458 | ||
1459 | Here is an example showing how to convert a string of multibyte | |
1460 | characters, allocating enough space for the result. | |
1461 | ||
1462 | @smallexample | |
1463 | wchar_t * | |
1464 | mbstowcs_alloc (const char *string) | |
1465 | @{ | |
1466 | size_t size = strlen (string) + 1; | |
1467 | wchar_t *buf = xmalloc (size * sizeof (wchar_t)); | |
1468 | ||
1469 | size = mbstowcs (buf, string, size); | |
1470 | if (size == (size_t) -1) | |
1471 | return NULL; | |
bdc674d9 | 1472 | buf = xreallocarray (buf, size + 1, sizeof *buf); |
0b2b18a2 UD |
1473 | return buf; |
1474 | @} | |
1475 | @end smallexample | |
1476 | ||
61af4bbb CD |
1477 | If @var{wstring} is a null pointer then no output is written and the |
1478 | conversion proceeds as above, and the result is returned. In practice | |
1479 | such behaviour is useful for calculating the exact number of wide | |
1480 | characters required to convert @var{string}. This behaviour of | |
1481 | accepting a null pointer for @var{wstring} is an @w{XPG4.2} extension | |
1482 | that is not specified in @w{ISO C} and is optional in @w{POSIX}. | |
0b2b18a2 UD |
1483 | @end deftypefun |
1484 | ||
0b2b18a2 | 1485 | @deftypefun size_t wcstombs (char *@var{string}, const wchar_t *@var{wstring}, size_t @var{size}) |
d08a7e4c | 1486 | @standards{ISO, stdlib.h} |
86e60666 | 1487 | @safety{@prelim{}@mtsafe{}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}} |
0b2b18a2 UD |
1488 | The @code{wcstombs} (``wide character string to multibyte string'') |
1489 | function converts the null-terminated wide character array @var{wstring} | |
1490 | into a string containing multibyte characters, storing not more than | |
1491 | @var{size} bytes starting at @var{string}, followed by a terminating | |
1492 | null character if there is room. The conversion of characters begins in | |
1493 | the initial shift state. | |
1494 | ||
1495 | The terminating null character counts towards the size, so if @var{size} | |
1496 | is less than or equal to the number of bytes needed in @var{wstring}, no | |
1497 | terminating null character is stored. | |
1498 | ||
1499 | If a code that does not correspond to a valid multibyte character is | |
bd3916e8 UD |
1500 | found, the @code{wcstombs} function returns a value of @math{-1}. |
1501 | Otherwise, the return value is the number of bytes stored in the array | |
1502 | @var{string}. This number does not include the terminating null character, | |
0b2b18a2 UD |
1503 | which is present if the number is less than @var{size}. |
1504 | @end deftypefun | |
1505 | ||
1506 | @node Shift State | |
1507 | @subsection States in Non-reentrant Functions | |
1508 | ||
1509 | In some multibyte character codes, the @emph{meaning} of any particular | |
1510 | byte sequence is not fixed; it depends on what other sequences have come | |
bd3916e8 UD |
1511 | earlier in the same string. Typically there are just a few sequences that |
1512 | can change the meaning of other sequences; these few are called | |
0b2b18a2 UD |
1513 | @dfn{shift sequences} and we say that they set the @dfn{shift state} for |
1514 | other sequences that follow. | |
1515 | ||
1516 | To illustrate shift state and shift sequences, suppose we decide that | |
1517 | the sequence @code{0200} (just one byte) enters Japanese mode, in which | |
1518 | pairs of bytes in the range from @code{0240} to @code{0377} are single | |
1519 | characters, while @code{0201} enters Latin-1 mode, in which single bytes | |
1520 | in the range from @code{0240} to @code{0377} are characters, and | |
1521 | interpreted according to the ISO Latin-1 character set. This is a | |
1522 | multibyte code that has two alternative shift states (``Japanese mode'' | |
1523 | and ``Latin-1 mode''), and two shift sequences that specify particular | |
1524 | shift states. | |
1525 | ||
1526 | When the multibyte character code in use has shift states, then | |
1527 | @code{mblen}, @code{mbtowc}, and @code{wctomb} must maintain and update | |
1528 | the current shift state as they scan the string. To make this work | |
1529 | properly, you must follow these rules: | |
1530 | ||
1531 | @itemize @bullet | |
1532 | @item | |
1533 | Before starting to scan a string, call the function with a null pointer | |
1534 | for the multibyte character address---for example, @code{mblen (NULL, | |
1535 | 0)}. This initializes the shift state to its standard initial value. | |
1536 | ||
1537 | @item | |
1538 | Scan the string one character at a time, in order. Do not ``back up'' | |
1539 | and rescan characters already scanned, and do not intersperse the | |
1540 | processing of different strings. | |
1541 | @end itemize | |
1542 | ||
1543 | Here is an example of using @code{mblen} following these rules: | |
1544 | ||
1545 | @smallexample | |
1546 | void | |
1547 | scan_string (char *s) | |
1548 | @{ | |
1549 | int length = strlen (s); | |
1550 | ||
1551 | /* @r{Initialize shift state.} */ | |
1552 | mblen (NULL, 0); | |
1553 | ||
1554 | while (1) | |
1555 | @{ | |
1556 | int thischar = mblen (s, length); | |
1557 | /* @r{Deal with end of string and invalid characters.} */ | |
1558 | if (thischar == 0) | |
1559 | break; | |
1560 | if (thischar == -1) | |
1561 | @{ | |
1562 | error ("invalid multibyte character"); | |
1563 | break; | |
1564 | @} | |
1565 | /* @r{Advance past this character.} */ | |
1566 | s += thischar; | |
1567 | length -= thischar; | |
1568 | @} | |
1569 | @} | |
1570 | @end smallexample | |
1571 | ||
1572 | The functions @code{mblen}, @code{mbtowc} and @code{wctomb} are not | |
1573 | reentrant when using a multibyte code that uses a shift state. However, | |
1574 | no other library functions call these functions, so you don't have to | |
1575 | worry that the shift state will be changed mysteriously. | |
1576 | ||
1577 | ||
1578 | @node Generic Charset Conversion | |
1579 | @section Generic Charset Conversion | |
1580 | ||
1581 | The conversion functions mentioned so far in this chapter all had in | |
1582 | common that they operate on character sets that are not directly | |
1583 | specified by the functions. The multibyte encoding used is specified by | |
1584 | the currently selected locale for the @code{LC_CTYPE} category. The | |
1f77f049 | 1585 | wide character set is fixed by the implementation (in the case of @theglibc{} |
d987d219 | 1586 | it is always UCS-4 encoded @w{ISO 10646}). |
0b2b18a2 UD |
1587 | |
1588 | This has of course several problems when it comes to general character | |
1589 | conversion: | |
1590 | ||
1591 | @itemize @bullet | |
1592 | @item | |
bd3916e8 UD |
1593 | For every conversion where neither the source nor the destination |
1594 | character set is the character set of the locale for the @code{LC_CTYPE} | |
1595 | category, one has to change the @code{LC_CTYPE} locale using | |
0b2b18a2 UD |
1596 | @code{setlocale}. |
1597 | ||
79c6869c | 1598 | Changing the @code{LC_CTYPE} locale introduces major problems for the rest |
bd3916e8 UD |
1599 | of the programs since several more functions (e.g., the character |
1600 | classification functions, @pxref{Classification of Characters}) use the | |
0b2b18a2 UD |
1601 | @code{LC_CTYPE} category. |
1602 | ||
1603 | @item | |
1604 | Parallel conversions to and from different character sets are not | |
1605 | possible since the @code{LC_CTYPE} selection is global and shared by all | |
1606 | threads. | |
1607 | ||
1608 | @item | |
1609 | If neither the source nor the destination character set is the character | |
1610 | set used for @code{wchar_t} representation, there is at least a two-step | |
bd3916e8 UD |
1611 | process necessary to convert a text using the functions above. One would |
1612 | have to select the source character set as the multibyte encoding, | |
0b2b18a2 UD |
1613 | convert the text into a @code{wchar_t} text, select the destination |
1614 | character set as the multibyte encoding, and convert the wide character | |
1615 | text to the multibyte (@math{=} destination) character set. | |
1616 | ||
1617 | Even if this is possible (which is not guaranteed) it is a very tiring | |
1618 | work. Plus it suffers from the other two raised points even more due to | |
1619 | the steady changing of the locale. | |
1620 | @end itemize | |
1621 | ||
1622 | The XPG2 standard defines a completely new set of functions, which has | |
1623 | none of these limitations. They are not at all coupled to the selected | |
1624 | locales, and they have no constraints on the character sets selected for | |
bd3916e8 UD |
1625 | source and destination. Only the set of available conversions limits |
1626 | them. The standard does not specify that any conversion at all must be | |
1627 | available. Such availability is a measure of the quality of the | |
0b2b18a2 UD |
1628 | implementation. |
1629 | ||
1630 | In the following text first the interface to @code{iconv} and then the | |
1631 | conversion function, will be described. Comparisons with other | |
1632 | implementations will show what obstacles stand in the way of portable | |
bd3916e8 | 1633 | applications. Finally, the implementation is described in so far as might |
0b2b18a2 UD |
1634 | interest the advanced user who wants to extend conversion capabilities. |
1635 | ||
1636 | @menu | |
1637 | * Generic Conversion Interface:: Generic Character Set Conversion Interface. | |
1638 | * iconv Examples:: A complete @code{iconv} example. | |
1639 | * Other iconv Implementations:: Some Details about other @code{iconv} | |
1640 | Implementations. | |
1641 | * glibc iconv Implementation:: The @code{iconv} Implementation in the GNU C | |
1642 | library. | |
1643 | @end menu | |
1644 | ||
1645 | @node Generic Conversion Interface | |
1646 | @subsection Generic Character Set Conversion Interface | |
1647 | ||
1648 | This set of functions follows the traditional cycle of using a resource: | |
1649 | open--use--close. The interface consists of three functions, each of | |
1650 | which implements one step. | |
1651 | ||
1652 | Before the interfaces are described it is necessary to introduce a | |
1653 | data type. Just like other open--use--close interfaces the functions | |
1654 | introduced here work using handles and the @file{iconv.h} header | |
1655 | defines a special type for the handles used. | |
1656 | ||
0b2b18a2 | 1657 | @deftp {Data Type} iconv_t |
d08a7e4c | 1658 | @standards{XPG2, iconv.h} |
0b2b18a2 UD |
1659 | This data type is an abstract type defined in @file{iconv.h}. The user |
1660 | must not assume anything about the definition of this type; it must be | |
1661 | completely opaque. | |
1662 | ||
d987d219 | 1663 | Objects of this type can be assigned handles for the conversions using |
0b2b18a2 UD |
1664 | the @code{iconv} functions. The objects themselves need not be freed, but |
1665 | the conversions for which the handles stand for have to. | |
1666 | @end deftp | |
1667 | ||
1668 | @noindent | |
1669 | The first step is the function to create a handle. | |
1670 | ||
0b2b18a2 | 1671 | @deftypefun iconv_t iconv_open (const char *@var{tocode}, const char *@var{fromcode}) |
d08a7e4c | 1672 | @standards{XPG2, iconv.h} |
86e60666 AO |
1673 | @safety{@prelim{}@mtsafe{@mtslocale{}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}} |
1674 | @c Calls malloc if tocode and/or fromcode are too big for alloca. Calls | |
1675 | @c strip and upstr on both, then gconv_open. strip and upstr call | |
1676 | @c isalnum_l and toupper_l with the C locale. gconv_open may MT-safely | |
1677 | @c tokenize toset, replace unspecified codesets with the current locale | |
1678 | @c (possibly two different accesses), and finally it calls | |
1679 | @c gconv_find_transform and initializes the gconv_t result with all the | |
1680 | @c steps in the conversion sequence, running each one's initializer, | |
1681 | @c destructing and releasing them all if anything fails. | |
1682 | ||
0b2b18a2 UD |
1683 | The @code{iconv_open} function has to be used before starting a |
1684 | conversion. The two parameters this function takes determine the | |
1685 | source and destination character set for the conversion, and if the | |
1686 | implementation has the possibility to perform such a conversion, the | |
1687 | function returns a handle. | |
1688 | ||
bd3916e8 | 1689 | If the wanted conversion is not available, the @code{iconv_open} function |
cf822e3c | 1690 | returns @code{(iconv_t) -1}. In this case the global variable |
0b2b18a2 UD |
1691 | @code{errno} can have the following values: |
1692 | ||
1693 | @table @code | |
1694 | @item EMFILE | |
1695 | The process already has @code{OPEN_MAX} file descriptors open. | |
1696 | @item ENFILE | |
d987d219 | 1697 | The system limit of open files is reached. |
0b2b18a2 UD |
1698 | @item ENOMEM |
1699 | Not enough memory to carry out the operation. | |
1700 | @item EINVAL | |
1701 | The conversion from @var{fromcode} to @var{tocode} is not supported. | |
1702 | @end table | |
1703 | ||
1704 | It is not possible to use the same descriptor in different threads to | |
1705 | perform independent conversions. The data structures associated | |
1706 | with the descriptor include information about the conversion state. | |
1707 | This must not be messed up by using it in different conversions. | |
1708 | ||
1709 | An @code{iconv} descriptor is like a file descriptor as for every use a | |
1710 | new descriptor must be created. The descriptor does not stand for all | |
1711 | of the conversions from @var{fromset} to @var{toset}. | |
1712 | ||
1f77f049 | 1713 | The @glibcadj{} implementation of @code{iconv_open} has one |
0b2b18a2 UD |
1714 | significant extension to other implementations. To ease the extension |
1715 | of the set of available conversions, the implementation allows storing | |
bd3916e8 | 1716 | the necessary files with data and code in an arbitrary number of |
0b2b18a2 UD |
1717 | directories. How this extension must be written will be explained below |
1718 | (@pxref{glibc iconv Implementation}). Here it is only important to say | |
1719 | that all directories mentioned in the @code{GCONV_PATH} environment | |
1720 | variable are considered only if they contain a file @file{gconv-modules}. | |
1721 | These directories need not necessarily be created by the system | |
1722 | administrator. In fact, this extension is introduced to help users | |
bd3916e8 | 1723 | writing and using their own, new conversions. Of course, this does not |
0b2b18a2 | 1724 | work for security reasons in SUID binaries; in this case only the system |
bd3916e8 UD |
1725 | directory is considered and this normally is |
1726 | @file{@var{prefix}/lib/gconv}. The @code{GCONV_PATH} environment | |
1727 | variable is examined exactly once at the first call of the | |
1728 | @code{iconv_open} function. Later modifications of the variable have no | |
0b2b18a2 UD |
1729 | effect. |
1730 | ||
1731 | @pindex iconv.h | |
bd3916e8 UD |
1732 | The @code{iconv_open} function was introduced early in the X/Open |
1733 | Portability Guide, @w{version 2}. It is supported by all commercial | |
1734 | Unices as it is required for the Unix branding. However, the quality and | |
1735 | completeness of the implementation varies widely. The @code{iconv_open} | |
0b2b18a2 UD |
1736 | function is declared in @file{iconv.h}. |
1737 | @end deftypefun | |
1738 | ||
1739 | The @code{iconv} implementation can associate large data structure with | |
bd3916e8 UD |
1740 | the handle returned by @code{iconv_open}. Therefore, it is crucial to |
1741 | free all the resources once all conversions are carried out and the | |
0b2b18a2 UD |
1742 | conversion is not needed anymore. |
1743 | ||
0b2b18a2 | 1744 | @deftypefun int iconv_close (iconv_t @var{cd}) |
d08a7e4c | 1745 | @standards{XPG2, iconv.h} |
86e60666 AO |
1746 | @safety{@prelim{}@mtsafe{}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{}}} |
1747 | @c Calls gconv_close to destruct and release each of the conversion | |
1748 | @c steps, release the gconv_t object, then call gconv_close_transform. | |
1749 | @c Access to the gconv_t object is not guarded, but calling iconv_close | |
1750 | @c concurrently with any other use is undefined. | |
1751 | ||
0b2b18a2 UD |
1752 | The @code{iconv_close} function frees all resources associated with the |
1753 | handle @var{cd}, which must have been returned by a successful call to | |
1754 | the @code{iconv_open} function. | |
1755 | ||
1756 | If the function call was successful the return value is @math{0}. | |
1757 | Otherwise it is @math{-1} and @code{errno} is set appropriately. | |
d987d219 | 1758 | Defined errors are: |
0b2b18a2 UD |
1759 | |
1760 | @table @code | |
1761 | @item EBADF | |
1762 | The conversion descriptor is invalid. | |
1763 | @end table | |
1764 | ||
1765 | @pindex iconv.h | |
bd3916e8 | 1766 | The @code{iconv_close} function was introduced together with the rest |
0b2b18a2 UD |
1767 | of the @code{iconv} functions in XPG2 and is declared in @file{iconv.h}. |
1768 | @end deftypefun | |
1769 | ||
1770 | The standard defines only one actual conversion function. This has, | |
1771 | therefore, the most general interface: it allows conversion from one | |
1772 | buffer to another. Conversion from a file to a buffer, vice versa, or | |
1773 | even file to file can be implemented on top of it. | |
1774 | ||
0b2b18a2 | 1775 | @deftypefun size_t iconv (iconv_t @var{cd}, char **@var{inbuf}, size_t *@var{inbytesleft}, char **@var{outbuf}, size_t *@var{outbytesleft}) |
d08a7e4c | 1776 | @standards{XPG2, iconv.h} |
86e60666 AO |
1777 | @safety{@prelim{}@mtsafe{@mtsrace{:cd}}@assafe{}@acunsafe{@acucorrupt{}}} |
1778 | @c Without guarding access to the iconv_t object pointed to by cd, call | |
1779 | @c the conversion function to convert inbuf or flush the internal | |
1780 | @c conversion state. | |
0b2b18a2 UD |
1781 | @cindex stateful |
1782 | The @code{iconv} function converts the text in the input buffer | |
1783 | according to the rules associated with the descriptor @var{cd} and | |
1784 | stores the result in the output buffer. It is possible to call the | |
1785 | function for the same text several times in a row since for stateful | |
1786 | character sets the necessary state information is kept in the data | |
1787 | structures associated with the descriptor. | |
1788 | ||
1789 | The input buffer is specified by @code{*@var{inbuf}} and it contains | |
1790 | @code{*@var{inbytesleft}} bytes. The extra indirection is necessary for | |
1791 | communicating the used input back to the caller (see below). It is | |
1792 | important to note that the buffer pointer is of type @code{char} and the | |
1793 | length is measured in bytes even if the input text is encoded in wide | |
1794 | characters. | |
1795 | ||
1796 | The output buffer is specified in a similar way. @code{*@var{outbuf}} | |
1797 | points to the beginning of the buffer with at least | |
1798 | @code{*@var{outbytesleft}} bytes room for the result. The buffer | |
1799 | pointer again is of type @code{char} and the length is measured in | |
1800 | bytes. If @var{outbuf} or @code{*@var{outbuf}} is a null pointer, the | |
1801 | conversion is performed but no output is available. | |
1802 | ||
1803 | If @var{inbuf} is a null pointer, the @code{iconv} function performs the | |
1804 | necessary action to put the state of the conversion into the initial | |
1805 | state. This is obviously a no-op for non-stateful encodings, but if the | |
1806 | encoding has a state, such a function call might put some byte sequences | |
1807 | in the output buffer, which perform the necessary state changes. The | |
1808 | next call with @var{inbuf} not being a null pointer then simply goes on | |
1809 | from the initial state. It is important that the programmer never makes | |
bd3916e8 UD |
1810 | any assumption as to whether the conversion has to deal with states. |
1811 | Even if the input and output character sets are not stateful, the | |
0b2b18a2 | 1812 | implementation might still have to keep states. This is due to the |
1f77f049 | 1813 | implementation chosen for @theglibc{} as it is described below. |
0b2b18a2 UD |
1814 | Therefore an @code{iconv} call to reset the state should always be |
1815 | performed if some protocol requires this for the output text. | |
1816 | ||
cf822e3c | 1817 | The conversion stops for one of three reasons. The first is that all |
0b2b18a2 UD |
1818 | characters from the input buffer are converted. This actually can mean |
1819 | two things: either all bytes from the input buffer are consumed or | |
1820 | there are some bytes at the end of the buffer that possibly can form a | |
1821 | complete character but the input is incomplete. The second reason for a | |
1822 | stop is that the output buffer is full. And the third reason is that | |
1823 | the input contains invalid characters. | |
1824 | ||
1825 | In all of these cases the buffer pointers after the last successful | |
d987d219 | 1826 | conversion, for the input and output buffers, are stored in @var{inbuf} and |
0b2b18a2 UD |
1827 | @var{outbuf}, and the available room in each buffer is stored in |
1828 | @var{inbytesleft} and @var{outbytesleft}. | |
1829 | ||
1830 | Since the character sets selected in the @code{iconv_open} call can be | |
1831 | almost arbitrary, there can be situations where the input buffer contains | |
1832 | valid characters, which have no identical representation in the output | |
1833 | character set. The behavior in this situation is undefined. The | |
1f77f049 | 1834 | @emph{current} behavior of @theglibc{} in this situation is to |
0b2b18a2 UD |
1835 | return with an error immediately. This certainly is not the most |
1836 | desirable solution; therefore, future versions will provide better ones, | |
1837 | but they are not yet finished. | |
1838 | ||
1839 | If all input from the input buffer is successfully converted and stored | |
1840 | in the output buffer, the function returns the number of non-reversible | |
1841 | conversions performed. In all other cases the return value is | |
1842 | @code{(size_t) -1} and @code{errno} is set appropriately. In such cases | |
1843 | the value pointed to by @var{inbytesleft} is nonzero. | |
1844 | ||
1845 | @table @code | |
1846 | @item EILSEQ | |
1847 | The conversion stopped because of an invalid byte sequence in the input. | |
1848 | After the call, @code{*@var{inbuf}} points at the first byte of the | |
1849 | invalid byte sequence. | |
1850 | ||
1851 | @item E2BIG | |
1852 | The conversion stopped because it ran out of space in the output buffer. | |
1853 | ||
1854 | @item EINVAL | |
1855 | The conversion stopped because of an incomplete byte sequence at the end | |
1856 | of the input buffer. | |
1857 | ||
1858 | @item EBADF | |
1859 | The @var{cd} argument is invalid. | |
1860 | @end table | |
1861 | ||
1862 | @pindex iconv.h | |
bd3916e8 | 1863 | The @code{iconv} function was introduced in the XPG2 standard and is |
0b2b18a2 UD |
1864 | declared in the @file{iconv.h} header. |
1865 | @end deftypefun | |
1866 | ||
1867 | The definition of the @code{iconv} function is quite good overall. It | |
1868 | provides quite flexible functionality. The only problems lie in the | |
1869 | boundary cases, which are incomplete byte sequences at the end of the | |
1870 | input buffer and invalid input. A third problem, which is not really | |
1871 | a design problem, is the way conversions are selected. The standard | |
1872 | does not say anything about the legitimate names, a minimal set of | |
1873 | available conversions. We will see how this negatively impacts other | |
1874 | implementations, as demonstrated below. | |
1875 | ||
1876 | @node iconv Examples | |
1877 | @subsection A complete @code{iconv} example | |
1878 | ||
1879 | The example below features a solution for a common problem. Given that | |
1880 | one knows the internal encoding used by the system for @code{wchar_t} | |
1881 | strings, one often is in the position to read text from a file and store | |
1882 | it in wide character buffers. One can do this using @code{mbsrtowcs}, | |
1883 | but then we run into the problems discussed above. | |
1884 | ||
1885 | @smallexample | |
1886 | int | |
1887 | file2wcs (int fd, const char *charset, wchar_t *outbuf, size_t avail) | |
1888 | @{ | |
1889 | char inbuf[BUFSIZ]; | |
1890 | size_t insize = 0; | |
1891 | char *wrptr = (char *) outbuf; | |
1892 | int result = 0; | |
1893 | iconv_t cd; | |
1894 | ||
1895 | cd = iconv_open ("WCHAR_T", charset); | |
1896 | if (cd == (iconv_t) -1) | |
1897 | @{ | |
1898 | /* @r{Something went wrong.} */ | |
1899 | if (errno == EINVAL) | |
1900 | error (0, 0, "conversion from '%s' to wchar_t not available", | |
1901 | charset); | |
1902 | else | |
1903 | perror ("iconv_open"); | |
1904 | ||
1905 | /* @r{Terminate the output string.} */ | |
1906 | *outbuf = L'\0'; | |
1907 | ||
1908 | return -1; | |
1909 | @} | |
1910 | ||
1911 | while (avail > 0) | |
1912 | @{ | |
1913 | size_t nread; | |
1914 | size_t nconv; | |
1915 | char *inptr = inbuf; | |
1916 | ||
1917 | /* @r{Read more input.} */ | |
1918 | nread = read (fd, inbuf + insize, sizeof (inbuf) - insize); | |
1919 | if (nread == 0) | |
1920 | @{ | |
1921 | /* @r{When we come here the file is completely read.} | |
1922 | @r{This still could mean there are some unused} | |
1923 | @r{characters in the @code{inbuf}. Put them back.} */ | |
1924 | if (lseek (fd, -insize, SEEK_CUR) == -1) | |
1925 | result = -1; | |
1926 | ||
1927 | /* @r{Now write out the byte sequence to get into the} | |
1928 | @r{initial state if this is necessary.} */ | |
1929 | iconv (cd, NULL, NULL, &wrptr, &avail); | |
1930 | ||
1931 | break; | |
1932 | @} | |
1933 | insize += nread; | |
1934 | ||
1935 | /* @r{Do the conversion.} */ | |
1936 | nconv = iconv (cd, &inptr, &insize, &wrptr, &avail); | |
1937 | if (nconv == (size_t) -1) | |
1938 | @{ | |
1939 | /* @r{Not everything went right. It might only be} | |
1940 | @r{an unfinished byte sequence at the end of the} | |
1941 | @r{buffer. Or it is a real problem.} */ | |
1942 | if (errno == EINVAL) | |
1943 | /* @r{This is harmless. Simply move the unused} | |
1944 | @r{bytes to the beginning of the buffer so that} | |
1945 | @r{they can be used in the next round.} */ | |
1946 | memmove (inbuf, inptr, insize); | |
1947 | else | |
1948 | @{ | |
1949 | /* @r{It is a real problem. Maybe we ran out of} | |
1950 | @r{space in the output buffer or we have invalid} | |
1951 | @r{input. In any case back the file pointer to} | |
1952 | @r{the position of the last processed byte.} */ | |
1953 | lseek (fd, -insize, SEEK_CUR); | |
1954 | result = -1; | |
1955 | break; | |
1956 | @} | |
1957 | @} | |
1958 | @} | |
1959 | ||
1960 | /* @r{Terminate the output string.} */ | |
1961 | if (avail >= sizeof (wchar_t)) | |
1962 | *((wchar_t *) wrptr) = L'\0'; | |
1963 | ||
1964 | if (iconv_close (cd) != 0) | |
1965 | perror ("iconv_close"); | |
1966 | ||
1967 | return (wchar_t *) wrptr - outbuf; | |
1968 | @} | |
1969 | @end smallexample | |
1970 | ||
1971 | @cindex stateful | |
1972 | This example shows the most important aspects of using the @code{iconv} | |
1973 | functions. It shows how successive calls to @code{iconv} can be used to | |
1974 | convert large amounts of text. The user does not have to care about | |
1975 | stateful encodings as the functions take care of everything. | |
1976 | ||
1977 | An interesting point is the case where @code{iconv} returns an error and | |
bd3916e8 UD |
1978 | @code{errno} is set to @code{EINVAL}. This is not really an error in the |
1979 | transformation. It can happen whenever the input character set contains | |
1980 | byte sequences of more than one byte for some character and texts are not | |
1981 | processed in one piece. In this case there is a chance that a multibyte | |
1982 | sequence is cut. The caller can then simply read the remainder of the | |
1983 | takes and feed the offending bytes together with new character from the | |
1984 | input to @code{iconv} and continue the work. The internal state kept in | |
1985 | the descriptor is @emph{not} unspecified after such an event as is the | |
0b2b18a2 UD |
1986 | case with the conversion functions from the @w{ISO C} standard. |
1987 | ||
1988 | The example also shows the problem of using wide character strings with | |
1989 | @code{iconv}. As explained in the description of the @code{iconv} | |
1990 | function above, the function always takes a pointer to a @code{char} | |
1991 | array and the available space is measured in bytes. In the example, the | |
1992 | output buffer is a wide character buffer; therefore, we use a local | |
1993 | variable @var{wrptr} of type @code{char *}, which is used in the | |
1994 | @code{iconv} calls. | |
1995 | ||
1996 | This looks rather innocent but can lead to problems on platforms that | |
bd3916e8 UD |
1997 | have tight restriction on alignment. Therefore the caller of @code{iconv} |
1998 | has to make sure that the pointers passed are suitable for access of | |
0b2b18a2 UD |
1999 | characters from the appropriate character set. Since, in the |
2000 | above case, the input parameter to the function is a @code{wchar_t} | |
2001 | pointer, this is the case (unless the user violates alignment when | |
2002 | computing the parameter). But in other situations, especially when | |
2003 | writing generic functions where one does not know what type of character | |
2004 | set one uses and, therefore, treats text as a sequence of bytes, it might | |
2005 | become tricky. | |
2006 | ||
2007 | @node Other iconv Implementations | |
2008 | @subsection Some Details about other @code{iconv} Implementations | |
2009 | ||
2010 | This is not really the place to discuss the @code{iconv} implementation | |
2011 | of other systems but it is necessary to know a bit about them to write | |
2012 | portable programs. The above mentioned problems with the specification | |
2013 | of the @code{iconv} functions can lead to portability issues. | |
2014 | ||
2015 | The first thing to notice is that, due to the large number of character | |
2016 | sets in use, it is certainly not practical to encode the conversions | |
2017 | directly in the C library. Therefore, the conversion information must | |
2018 | come from files outside the C library. This is usually done in one or | |
2019 | both of the following ways: | |
2020 | ||
2021 | @itemize @bullet | |
2022 | @item | |
2023 | The C library contains a set of generic conversion functions that can | |
2024 | read the needed conversion tables and other information from data files. | |
2025 | These files get loaded when necessary. | |
2026 | ||
2027 | This solution is problematic as it requires a great deal of effort to | |
bd3916e8 | 2028 | apply to all character sets (potentially an infinite set). The |
0b2b18a2 UD |
2029 | differences in the structure of the different character sets is so large |
2030 | that many different variants of the table-processing functions must be | |
bd3916e8 | 2031 | developed. In addition, the generic nature of these functions make them |
0b2b18a2 UD |
2032 | slower than specifically implemented functions. |
2033 | ||
2034 | @item | |
2035 | The C library only contains a framework that can dynamically load | |
2036 | object files and execute the conversion functions contained therein. | |
2037 | ||
2038 | This solution provides much more flexibility. The C library itself | |
2039 | contains only very little code and therefore reduces the general memory | |
2040 | footprint. Also, with a documented interface between the C library and | |
2041 | the loadable modules it is possible for third parties to extend the set | |
2042 | of available conversion modules. A drawback of this solution is that | |
2043 | dynamic loading must be available. | |
2044 | @end itemize | |
2045 | ||
bd3916e8 UD |
2046 | Some implementations in commercial Unices implement a mixture of these |
2047 | possibilities; the majority implement only the second solution. Using | |
2048 | loadable modules moves the code out of the library itself and keeps | |
0b2b18a2 UD |
2049 | the door open for extensions and improvements, but this design is also |
2050 | limiting on some platforms since not many platforms support dynamic | |
2051 | loading in statically linked programs. On platforms without this | |
2052 | capability it is therefore not possible to use this interface in | |
1f77f049 | 2053 | statically linked programs. @Theglibc{} has, on ELF platforms, no |
0b2b18a2 | 2054 | problems with dynamic loading in these situations; therefore, this |
bd3916e8 | 2055 | point is moot. The danger is that one gets acquainted with this |
0b2b18a2 UD |
2056 | situation and forgets about the restrictions on other systems. |
2057 | ||
2058 | A second thing to know about other @code{iconv} implementations is that | |
2059 | the number of available conversions is often very limited. Some | |
2060 | implementations provide, in the standard release (not special | |
2061 | international or developer releases), at most 100 to 200 conversion | |
2062 | possibilities. This does not mean 200 different character sets are | |
bd3916e8 | 2063 | supported; for example, conversions from one character set to a set of 10 |
0b2b18a2 | 2064 | others might count as 10 conversions. Together with the other direction |
bd3916e8 | 2065 | this makes 20 conversion possibilities used up by one character set. One |
d987d219 | 2066 | can imagine the thin coverage these platforms provide. Some Unix vendors |
bd3916e8 | 2067 | even provide only a handful of conversions, which renders them useless for |
0b2b18a2 UD |
2068 | almost all uses. |
2069 | ||
2070 | This directly leads to a third and probably the most problematic point. | |
2071 | The way the @code{iconv} conversion functions are implemented on all | |
2072 | known Unix systems and the availability of the conversion functions from | |
2073 | character set @math{@cal{A}} to @math{@cal{B}} and the conversion from | |
2074 | @math{@cal{B}} to @math{@cal{C}} does @emph{not} imply that the | |
2075 | conversion from @math{@cal{A}} to @math{@cal{C}} is available. | |
2076 | ||
2077 | This might not seem unreasonable and problematic at first, but it is a | |
2078 | quite big problem as one will notice shortly after hitting it. To show | |
2079 | the problem we assume to write a program that has to convert from | |
2080 | @math{@cal{A}} to @math{@cal{C}}. A call like | |
2081 | ||
2082 | @smallexample | |
2083 | cd = iconv_open ("@math{@cal{C}}", "@math{@cal{A}}"); | |
2084 | @end smallexample | |
2085 | ||
2086 | @noindent | |
2087 | fails according to the assumption above. But what does the program | |
2088 | do now? The conversion is necessary; therefore, simply giving up is not | |
2089 | an option. | |
2090 | ||
2091 | This is a nuisance. The @code{iconv} function should take care of this. | |
bd3916e8 | 2092 | But how should the program proceed from here on? If it tries to convert |
0b2b18a2 UD |
2093 | to character set @math{@cal{B}}, first the two @code{iconv_open} |
2094 | calls | |
2095 | ||
2096 | @smallexample | |
2097 | cd1 = iconv_open ("@math{@cal{B}}", "@math{@cal{A}}"); | |
2098 | @end smallexample | |
2099 | ||
2100 | @noindent | |
2101 | and | |
2102 | ||
2103 | @smallexample | |
2104 | cd2 = iconv_open ("@math{@cal{C}}", "@math{@cal{B}}"); | |
2105 | @end smallexample | |
2106 | ||
2107 | @noindent | |
2108 | will succeed, but how to find @math{@cal{B}}? | |
2109 | ||
2110 | Unfortunately, the answer is: there is no general solution. On some | |
2111 | systems guessing might help. On those systems most character sets can | |
d987d219 | 2112 | convert to and from UTF-8 encoded @w{ISO 10646} or Unicode text. Besides |
bd3916e8 | 2113 | this only some very system-specific methods can help. Since the |
0b2b18a2 UD |
2114 | conversion functions come from loadable modules and these modules must |
2115 | be stored somewhere in the filesystem, one @emph{could} try to find them | |
2116 | and determine from the available file which conversions are available | |
2117 | and whether there is an indirect route from @math{@cal{A}} to | |
2118 | @math{@cal{C}}. | |
2119 | ||
bd3916e8 | 2120 | This example shows one of the design errors of @code{iconv} mentioned |
0b2b18a2 | 2121 | above. It should at least be possible to determine the list of available |
d987d219 | 2122 | conversions programmatically so that if @code{iconv_open} says there is no |
0b2b18a2 UD |
2123 | such conversion, one could make sure this also is true for indirect |
2124 | routes. | |
2125 | ||
2126 | @node glibc iconv Implementation | |
1f77f049 | 2127 | @subsection The @code{iconv} Implementation in @theglibc{} |
0b2b18a2 UD |
2128 | |
2129 | After reading about the problems of @code{iconv} implementations in the | |
2130 | last section it is certainly good to note that the implementation in | |
1f77f049 | 2131 | @theglibc{} has none of the problems mentioned above. What |
0b2b18a2 UD |
2132 | follows is a step-by-step analysis of the points raised above. The |
2133 | evaluation is based on the current state of the development (as of | |
2134 | January 1999). The development of the @code{iconv} functions is not | |
2135 | complete, but basic functionality has solidified. | |
2136 | ||
1f77f049 | 2137 | @Theglibc{}'s @code{iconv} implementation uses shared loadable |
0b2b18a2 UD |
2138 | modules to implement the conversions. A very small number of |
2139 | conversions are built into the library itself but these are only rather | |
2140 | trivial conversions. | |
2141 | ||
1f77f049 | 2142 | All the benefits of loadable modules are available in the @glibcadj{} |
0b2b18a2 UD |
2143 | implementation. This is especially appealing since the interface is |
2144 | well documented (see below), and it, therefore, is easy to write new | |
2145 | conversion modules. The drawback of using loadable objects is not a | |
1f77f049 | 2146 | problem in @theglibc{}, at least on ELF systems. Since the |
0b2b18a2 | 2147 | library is able to load shared objects even in statically linked |
bd3916e8 | 2148 | binaries, static linking need not be forbidden in case one wants to use |
0b2b18a2 UD |
2149 | @code{iconv}. |
2150 | ||
2151 | The second mentioned problem is the number of supported conversions. | |
1f77f049 | 2152 | Currently, @theglibc{} supports more than 150 character sets. The |
0b2b18a2 UD |
2153 | way the implementation is designed the number of supported conversions |
2154 | is greater than 22350 (@math{150} times @math{149}). If any conversion | |
2155 | from or to a character set is missing, it can be added easily. | |
2156 | ||
2157 | Particularly impressive as it may be, this high number is due to the | |
1f77f049 | 2158 | fact that the @glibcadj{} implementation of @code{iconv} does not have |
0b2b18a2 UD |
2159 | the third problem mentioned above (i.e., whenever there is a conversion |
2160 | from a character set @math{@cal{A}} to @math{@cal{B}} and from | |
2161 | @math{@cal{B}} to @math{@cal{C}} it is always possible to convert from | |
2162 | @math{@cal{A}} to @math{@cal{C}} directly). If the @code{iconv_open} | |
bd3916e8 | 2163 | returns an error and sets @code{errno} to @code{EINVAL}, there is no |
0b2b18a2 UD |
2164 | known way, directly or indirectly, to perform the wanted conversion. |
2165 | ||
2166 | @cindex triangulation | |
bd3916e8 UD |
2167 | Triangulation is achieved by providing for each character set a |
2168 | conversion from and to UCS-4 encoded @w{ISO 10646}. Using @w{ISO 10646} | |
0b2b18a2 UD |
2169 | as an intermediate representation it is possible to @dfn{triangulate} |
2170 | (i.e., convert with an intermediate representation). | |
2171 | ||
2172 | There is no inherent requirement to provide a conversion to @w{ISO | |
2173 | 10646} for a new character set, and it is also possible to provide other | |
2174 | conversions where neither source nor destination character set is @w{ISO | |
bd3916e8 | 2175 | 10646}. The existing set of conversions is simply meant to cover all |
0b2b18a2 UD |
2176 | conversions that might be of interest. |
2177 | ||
2178 | @cindex ISO-2022-JP | |
2179 | @cindex EUC-JP | |
2180 | All currently available conversions use the triangulation method above, | |
bd3916e8 | 2181 | making conversion run unnecessarily slow. If, for example, somebody |
0b2b18a2 UD |
2182 | often needs the conversion from ISO-2022-JP to EUC-JP, a quicker solution |
2183 | would involve direct conversion between the two character sets, skipping | |
2184 | the input to @w{ISO 10646} first. The two character sets of interest | |
2185 | are much more similar to each other than to @w{ISO 10646}. | |
2186 | ||
2187 | In such a situation one easily can write a new conversion and provide it | |
1f77f049 | 2188 | as a better alternative. The @glibcadj{} @code{iconv} implementation |
0b2b18a2 UD |
2189 | would automatically use the module implementing the conversion if it is |
2190 | specified to be more efficient. | |
2191 | ||
2192 | @subsubsection Format of @file{gconv-modules} files | |
2193 | ||
2194 | All information about the available conversions comes from a file named | |
2195 | @file{gconv-modules}, which can be found in any of the directories along | |
2196 | the @code{GCONV_PATH}. The @file{gconv-modules} files are line-oriented | |
2197 | text files, where each of the lines has one of the following formats: | |
2198 | ||
2199 | @itemize @bullet | |
2200 | @item | |
bd3916e8 | 2201 | If the first non-whitespace character is a @kbd{#} the line contains only |
0b2b18a2 UD |
2202 | comments and is ignored. |
2203 | ||
2204 | @item | |
bd3916e8 UD |
2205 | Lines starting with @code{alias} define an alias name for a character |
2206 | set. Two more words are expected on the line. The first word | |
0b2b18a2 UD |
2207 | defines the alias name, and the second defines the original name of the |
2208 | character set. The effect is that it is possible to use the alias name | |
2209 | in the @var{fromset} or @var{toset} parameters of @code{iconv_open} and | |
2210 | achieve the same result as when using the real character set name. | |
2211 | ||
2212 | This is quite important as a character set has often many different | |
bd3916e8 | 2213 | names. There is normally an official name but this need not correspond to |
d987d219 | 2214 | the most popular name. Besides this many character sets have special |
bd3916e8 UD |
2215 | names that are somehow constructed. For example, all character sets |
2216 | specified by the ISO have an alias of the form @code{ISO-IR-@var{nnn}} | |
2217 | where @var{nnn} is the registration number. This allows programs that | |
2218 | know about the registration number to construct character set names and | |
2219 | use them in @code{iconv_open} calls. More on the available names and | |
0b2b18a2 UD |
2220 | aliases follows below. |
2221 | ||
2222 | @item | |
2223 | Lines starting with @code{module} introduce an available conversion | |
2224 | module. These lines must contain three or four more words. | |
2225 | ||
2226 | The first word specifies the source character set, the second word the | |
bd3916e8 | 2227 | destination character set of conversion implemented in this module, and |
0b2b18a2 UD |
2228 | the third word is the name of the loadable module. The filename is |
2229 | constructed by appending the usual shared object suffix (normally | |
2230 | @file{.so}) and this file is then supposed to be found in the same | |
bd3916e8 | 2231 | directory the @file{gconv-modules} file is in. The last word on the line, |
0b2b18a2 UD |
2232 | which is optional, is a numeric value representing the cost of the |
2233 | conversion. If this word is missing, a cost of @math{1} is assumed. The | |
2234 | numeric value itself does not matter that much; what counts are the | |
2235 | relative values of the sums of costs for all possible conversion paths. | |
2236 | Below is a more precise description of the use of the cost value. | |
2237 | @end itemize | |
2238 | ||
2239 | Returning to the example above where one has written a module to directly | |
2240 | convert from ISO-2022-JP to EUC-JP and back. All that has to be done is | |
2241 | to put the new module, let its name be ISO2022JP-EUCJP.so, in a directory | |
2242 | and add a file @file{gconv-modules} with the following content in the | |
2243 | same directory: | |
2244 | ||
2245 | @smallexample | |
2246 | module ISO-2022-JP// EUC-JP// ISO2022JP-EUCJP 1 | |
2247 | module EUC-JP// ISO-2022-JP// ISO2022JP-EUCJP 1 | |
2248 | @end smallexample | |
2249 | ||
2250 | To see why this is sufficient, it is necessary to understand how the | |
2251 | conversion used by @code{iconv} (and described in the descriptor) is | |
2252 | selected. The approach to this problem is quite simple. | |
2253 | ||
2254 | At the first call of the @code{iconv_open} function the program reads | |
2255 | all available @file{gconv-modules} files and builds up two tables: one | |
2256 | containing all the known aliases and another that contains the | |
2257 | information about the conversions and which shared object implements | |
2258 | them. | |
2259 | ||
2260 | @subsubsection Finding the conversion path in @code{iconv} | |
2261 | ||
2262 | The set of available conversions form a directed graph with weighted | |
2263 | edges. The weights on the edges are the costs specified in the | |
2264 | @file{gconv-modules} files. The @code{iconv_open} function uses an | |
2265 | algorithm suitable for search for the best path in such a graph and so | |
2266 | constructs a list of conversions that must be performed in succession | |
2267 | to get the transformation from the source to the destination character | |
2268 | set. | |
2269 | ||
2270 | Explaining why the above @file{gconv-modules} files allows the | |
2271 | @code{iconv} implementation to resolve the specific ISO-2022-JP to | |
2272 | EUC-JP conversion module instead of the conversion coming with the | |
2273 | library itself is straightforward. Since the latter conversion takes two | |
2274 | steps (from ISO-2022-JP to @w{ISO 10646} and then from @w{ISO 10646} to | |
2275 | EUC-JP), the cost is @math{1+1 = 2}. The above @file{gconv-modules} | |
2276 | file, however, specifies that the new conversion modules can perform this | |
2277 | conversion with only the cost of @math{1}. | |
2278 | ||
2279 | A mysterious item about the @file{gconv-modules} file above (and also | |
1f77f049 | 2280 | the file coming with @theglibc{}) are the names of the character |
0b2b18a2 UD |
2281 | sets specified in the @code{module} lines. Why do almost all the names |
2282 | end in @code{//}? And this is not all: the names can actually be | |
2283 | regular expressions. At this point in time this mystery should not be | |
2284 | revealed, unless you have the relevant spell-casting materials: ashes | |
2285 | from an original @w{DOS 6.2} boot disk burnt in effigy, a crucifix | |
2286 | blessed by St.@: Emacs, assorted herbal roots from Central America, sand | |
2287 | from Cebu, etc. Sorry! @strong{The part of the implementation where | |
2288 | this is used is not yet finished. For now please simply follow the | |
2289 | existing examples. It'll become clearer once it is. --drepper} | |
2290 | ||
2291 | A last remark about the @file{gconv-modules} is about the names not | |
bd3916e8 UD |
2292 | ending with @code{//}. A character set named @code{INTERNAL} is often |
2293 | mentioned. From the discussion above and the chosen name it should have | |
2294 | become clear that this is the name for the representation used in the | |
2295 | intermediate step of the triangulation. We have said that this is UCS-4 | |
2296 | but actually that is not quite right. The UCS-4 specification also | |
2297 | includes the specification of the byte ordering used. Since a UCS-4 value | |
6c55cda3 | 2298 | consists of four bytes, a stored value is affected by byte ordering. The |
bd3916e8 UD |
2299 | internal representation is @emph{not} the same as UCS-4 in case the byte |
2300 | ordering of the processor (or at least the running process) is not the | |
2301 | same as the one required for UCS-4. This is done for performance reasons | |
2302 | as one does not want to perform unnecessary byte-swapping operations if | |
2303 | one is not interested in actually seeing the result in UCS-4. To avoid | |
11bf311e | 2304 | trouble with endianness, the internal representation consistently is named |
bd3916e8 | 2305 | @code{INTERNAL} even on big-endian systems where the representations are |
0b2b18a2 UD |
2306 | identical. |
2307 | ||
2308 | @subsubsection @code{iconv} module data structures | |
2309 | ||
bd3916e8 | 2310 | So far this section has described how modules are located and considered |
0b2b18a2 | 2311 | to be used. What remains to be described is the interface of the modules |
cf822e3c | 2312 | so that one can write new ones. This section describes the interface as |
bd3916e8 | 2313 | it is in use in January 1999. The interface will change a bit in the |
0b2b18a2 UD |
2314 | future but, with luck, only in an upwardly compatible way. |
2315 | ||
2316 | The definitions necessary to write new modules are publicly available | |
2317 | in the non-standard header @file{gconv.h}. The following text, | |
bd3916e8 | 2318 | therefore, describes the definitions from this header file. First, |
0b2b18a2 UD |
2319 | however, it is necessary to get an overview. |
2320 | ||
2321 | From the perspective of the user of @code{iconv} the interface is quite | |
bd3916e8 UD |
2322 | simple: the @code{iconv_open} function returns a handle that can be used |
2323 | in calls to @code{iconv}, and finally the handle is freed with a call to | |
0b2b18a2 UD |
2324 | @code{iconv_close}. The problem is that the handle has to be able to |
2325 | represent the possibly long sequences of conversion steps and also the | |
2326 | state of each conversion since the handle is all that is passed to the | |
2327 | @code{iconv} function. Therefore, the data structures are really the | |
2328 | elements necessary to understanding the implementation. | |
2329 | ||
2330 | We need two different kinds of data structures. The first describes the | |
2331 | conversion and the second describes the state etc. There are really two | |
2332 | type definitions like this in @file{gconv.h}. | |
2333 | @pindex gconv.h | |
2334 | ||
0b2b18a2 | 2335 | @deftp {Data type} {struct __gconv_step} |
d08a7e4c | 2336 | @standards{GNU, gconv.h} |
0b2b18a2 UD |
2337 | This data structure describes one conversion a module can perform. For |
2338 | each function in a loaded module with conversion functions there is | |
2339 | exactly one object of this type. This object is shared by all users of | |
2340 | the conversion (i.e., this object does not contain any information | |
2341 | corresponding to an actual conversion; it only describes the conversion | |
2342 | itself). | |
2343 | ||
2344 | @table @code | |
2345 | @item struct __gconv_loaded_object *__shlib_handle | |
2346 | @itemx const char *__modname | |
2347 | @itemx int __counter | |
2348 | All these elements of the structure are used internally in the C library | |
d987d219 | 2349 | to coordinate loading and unloading the shared object. One must not expect any |
0b2b18a2 UD |
2350 | of the other elements to be available or initialized. |
2351 | ||
2352 | @item const char *__from_name | |
2353 | @itemx const char *__to_name | |
2354 | @code{__from_name} and @code{__to_name} contain the names of the source and | |
2355 | destination character sets. They can be used to identify the actual | |
bd3916e8 | 2356 | conversion to be carried out since one module might implement conversions |
0b2b18a2 UD |
2357 | for more than one character set and/or direction. |
2358 | ||
2359 | @item gconv_fct __fct | |
2360 | @itemx gconv_init_fct __init_fct | |
2361 | @itemx gconv_end_fct __end_fct | |
2362 | These elements contain pointers to the functions in the loadable module. | |
2363 | The interface will be explained below. | |
2364 | ||
2365 | @item int __min_needed_from | |
2366 | @itemx int __max_needed_from | |
2367 | @itemx int __min_needed_to | |
2368 | @itemx int __max_needed_to; | |
2369 | These values have to be supplied in the init function of the module. The | |
2370 | @code{__min_needed_from} value specifies how many bytes a character of | |
2371 | the source character set at least needs. The @code{__max_needed_from} | |
2372 | specifies the maximum value that also includes possible shift sequences. | |
2373 | ||
2374 | The @code{__min_needed_to} and @code{__max_needed_to} values serve the | |
bd3916e8 | 2375 | same purpose as @code{__min_needed_from} and @code{__max_needed_from} but |
0b2b18a2 UD |
2376 | this time for the destination character set. |
2377 | ||
2378 | It is crucial that these values be accurate since otherwise the | |
2379 | conversion functions will have problems or not work at all. | |
2380 | ||
2381 | @item int __stateful | |
bd3916e8 UD |
2382 | This element must also be initialized by the init function. |
2383 | @code{int __stateful} is nonzero if the source character set is stateful. | |
0b2b18a2 UD |
2384 | Otherwise it is zero. |
2385 | ||
2386 | @item void *__data | |
2387 | This element can be used freely by the conversion functions in the | |
bd3916e8 UD |
2388 | module. @code{void *__data} can be used to communicate extra information |
2389 | from one call to another. @code{void *__data} need not be initialized if | |
2390 | not needed at all. If @code{void *__data} element is assigned a pointer | |
2391 | to dynamically allocated memory (presumably in the init function) it has | |
2392 | to be made sure that the end function deallocates the memory. Otherwise | |
0b2b18a2 UD |
2393 | the application will leak memory. |
2394 | ||
2395 | It is important to be aware that this data structure is shared by all | |
2396 | users of this specification conversion and therefore the @code{__data} | |
2397 | element must not contain data specific to one specific use of the | |
2398 | conversion function. | |
2399 | @end table | |
2400 | @end deftp | |
2401 | ||
0b2b18a2 | 2402 | @deftp {Data type} {struct __gconv_step_data} |
d08a7e4c | 2403 | @standards{GNU, gconv.h} |
0b2b18a2 UD |
2404 | This is the data structure that contains the information specific to |
2405 | each use of the conversion functions. | |
2406 | ||
2407 | ||
2408 | @table @code | |
2409 | @item char *__outbuf | |
2410 | @itemx char *__outbufend | |
2411 | These elements specify the output buffer for the conversion step. The | |
2412 | @code{__outbuf} element points to the beginning of the buffer, and | |
2413 | @code{__outbufend} points to the byte following the last byte in the | |
2414 | buffer. The conversion function must not assume anything about the size | |
d987d219 | 2415 | of the buffer but it can be safely assumed there is room for at |
0b2b18a2 UD |
2416 | least one complete character in the output buffer. |
2417 | ||
2418 | Once the conversion is finished, if the conversion is the last step, the | |
2419 | @code{__outbuf} element must be modified to point after the last byte | |
2420 | written into the buffer to signal how much output is available. If this | |
2421 | conversion step is not the last one, the element must not be modified. | |
2422 | The @code{__outbufend} element must not be modified. | |
2423 | ||
2424 | @item int __is_last | |
2425 | This element is nonzero if this conversion step is the last one. This | |
2426 | information is necessary for the recursion. See the description of the | |
2427 | conversion function internals below. This element must never be | |
2428 | modified. | |
2429 | ||
2430 | @item int __invocation_counter | |
bd3916e8 UD |
2431 | The conversion function can use this element to see how many calls of |
2432 | the conversion function already happened. Some character sets require a | |
0b2b18a2 | 2433 | certain prolog when generating output, and by comparing this value with |
bd3916e8 UD |
2434 | zero, one can find out whether it is the first call and whether, |
2435 | therefore, the prolog should be emitted. This element must never be | |
0b2b18a2 UD |
2436 | modified. |
2437 | ||
2438 | @item int __internal_use | |
2439 | This element is another one rarely used but needed in certain | |
2440 | situations. It is assigned a nonzero value in case the conversion | |
2441 | functions are used to implement @code{mbsrtowcs} et.al.@: (i.e., the | |
2442 | function is not used directly through the @code{iconv} interface). | |
2443 | ||
2444 | This sometimes makes a difference as it is expected that the | |
2445 | @code{iconv} functions are used to translate entire texts while the | |
2446 | @code{mbsrtowcs} functions are normally used only to convert single | |
2447 | strings and might be used multiple times to convert entire texts. | |
2448 | ||
2449 | But in this situation we would have problem complying with some rules of | |
2450 | the character set specification. Some character sets require a prolog, | |
2451 | which must appear exactly once for an entire text. If a number of | |
2452 | @code{mbsrtowcs} calls are used to convert the text, only the first call | |
2453 | must add the prolog. However, because there is no communication between the | |
2454 | different calls of @code{mbsrtowcs}, the conversion functions have no | |
2455 | possibility to find this out. The situation is different for sequences | |
2456 | of @code{iconv} calls since the handle allows access to the needed | |
2457 | information. | |
2458 | ||
bd3916e8 | 2459 | The @code{int __internal_use} element is mostly used together with |
0b2b18a2 UD |
2460 | @code{__invocation_counter} as follows: |
2461 | ||
2462 | @smallexample | |
2463 | if (!data->__internal_use | |
2464 | && data->__invocation_counter == 0) | |
2465 | /* @r{Emit prolog.} */ | |
95fdc6a0 | 2466 | @dots{} |
0b2b18a2 UD |
2467 | @end smallexample |
2468 | ||
2469 | This element must never be modified. | |
2470 | ||
2471 | @item mbstate_t *__statep | |
2472 | The @code{__statep} element points to an object of type @code{mbstate_t} | |
2473 | (@pxref{Keeping the state}). The conversion of a stateful character | |
bd3916e8 UD |
2474 | set must use the object pointed to by @code{__statep} to store |
2475 | information about the conversion state. The @code{__statep} element | |
0b2b18a2 UD |
2476 | itself must never be modified. |
2477 | ||
2478 | @item mbstate_t __state | |
2479 | This element must @emph{never} be used directly. It is only part of | |
2480 | this structure to have the needed space allocated. | |
2481 | @end table | |
2482 | @end deftp | |
2483 | ||
2484 | @subsubsection @code{iconv} module interfaces | |
2485 | ||
2486 | With the knowledge about the data structures we now can describe the | |
2487 | conversion function itself. To understand the interface a bit of | |
bd3916e8 | 2488 | knowledge is necessary about the functionality in the C library that |
0b2b18a2 UD |
2489 | loads the objects with the conversions. |
2490 | ||
2491 | It is often the case that one conversion is used more than once (i.e., | |
2492 | there are several @code{iconv_open} calls for the same set of character | |
2493 | sets during one program run). The @code{mbsrtowcs} et.al.@: functions in | |
1f77f049 | 2494 | @theglibc{} also use the @code{iconv} functionality, which |
0b2b18a2 UD |
2495 | increases the number of uses of the same functions even more. |
2496 | ||
bd3916e8 UD |
2497 | Because of this multiple use of conversions, the modules do not get |
2498 | loaded exclusively for one conversion. Instead a module once loaded can | |
2499 | be used by an arbitrary number of @code{iconv} or @code{mbsrtowcs} calls | |
0b2b18a2 | 2500 | at the same time. The splitting of the information between conversion- |
bd3916e8 | 2501 | function-specific information and conversion data makes this possible. |
0b2b18a2 UD |
2502 | The last section showed the two data structures used to do this. |
2503 | ||
2504 | This is of course also reflected in the interface and semantics of the | |
2505 | functions that the modules must provide. There are three functions that | |
2506 | must have the following names: | |
2507 | ||
2508 | @table @code | |
2509 | @item gconv_init | |
2510 | The @code{gconv_init} function initializes the conversion function | |
2511 | specific data structure. This very same object is shared by all | |
2512 | conversions that use this conversion and, therefore, no state information | |
bd3916e8 UD |
2513 | about the conversion itself must be stored in here. If a module |
2514 | implements more than one conversion, the @code{gconv_init} function will | |
0b2b18a2 UD |
2515 | be called multiple times. |
2516 | ||
2517 | @item gconv_end | |
2518 | The @code{gconv_end} function is responsible for freeing all resources | |
2519 | allocated by the @code{gconv_init} function. If there is nothing to do, | |
2520 | this function can be missing. Special care must be taken if the module | |
2521 | implements more than one conversion and the @code{gconv_init} function | |
2522 | does not allocate the same resources for all conversions. | |
2523 | ||
2524 | @item gconv | |
2525 | This is the actual conversion function. It is called to convert one | |
2526 | block of text. It gets passed the conversion step information | |
2527 | initialized by @code{gconv_init} and the conversion data, specific to | |
2528 | this use of the conversion functions. | |
2529 | @end table | |
2530 | ||
2531 | There are three data types defined for the three module interface | |
2532 | functions and these define the interface. | |
2533 | ||
0b2b18a2 | 2534 | @deftypevr {Data type} int {(*__gconv_init_fct)} (struct __gconv_step *) |
d08a7e4c | 2535 | @standards{GNU, gconv.h} |
0b2b18a2 UD |
2536 | This specifies the interface of the initialization function of the |
2537 | module. It is called exactly once for each conversion the module | |
2538 | implements. | |
2539 | ||
2540 | As explained in the description of the @code{struct __gconv_step} data | |
2541 | structure above the initialization function has to initialize parts of | |
2542 | it. | |
2543 | ||
2544 | @table @code | |
2545 | @item __min_needed_from | |
2546 | @itemx __max_needed_from | |
2547 | @itemx __min_needed_to | |
2548 | @itemx __max_needed_to | |
2549 | These elements must be initialized to the exact numbers of the minimum | |
2550 | and maximum number of bytes used by one character in the source and | |
2551 | destination character sets, respectively. If the characters all have the | |
2552 | same size, the minimum and maximum values are the same. | |
2553 | ||
2554 | @item __stateful | |
9dcc8f11 | 2555 | This element must be initialized to a nonzero value if the source |
0b2b18a2 UD |
2556 | character set is stateful. Otherwise it must be zero. |
2557 | @end table | |
2558 | ||
2559 | If the initialization function needs to communicate some information | |
bd3916e8 UD |
2560 | to the conversion function, this communication can happen using the |
2561 | @code{__data} element of the @code{__gconv_step} structure. But since | |
2562 | this data is shared by all the conversions, it must not be modified by | |
0b2b18a2 UD |
2563 | the conversion function. The example below shows how this can be used. |
2564 | ||
2565 | @smallexample | |
2566 | #define MIN_NEEDED_FROM 1 | |
2567 | #define MAX_NEEDED_FROM 4 | |
2568 | #define MIN_NEEDED_TO 4 | |
2569 | #define MAX_NEEDED_TO 4 | |
2570 | ||
2571 | int | |
2572 | gconv_init (struct __gconv_step *step) | |
2573 | @{ | |
2574 | /* @r{Determine which direction.} */ | |
2575 | struct iso2022jp_data *new_data; | |
2576 | enum direction dir = illegal_dir; | |
2577 | enum variant var = illegal_var; | |
2578 | int result; | |
2579 | ||
2580 | if (__strcasecmp (step->__from_name, "ISO-2022-JP//") == 0) | |
2581 | @{ | |
2582 | dir = from_iso2022jp; | |
2583 | var = iso2022jp; | |
2584 | @} | |
2585 | else if (__strcasecmp (step->__to_name, "ISO-2022-JP//") == 0) | |
2586 | @{ | |
2587 | dir = to_iso2022jp; | |
2588 | var = iso2022jp; | |
2589 | @} | |
2590 | else if (__strcasecmp (step->__from_name, "ISO-2022-JP-2//") == 0) | |
2591 | @{ | |
2592 | dir = from_iso2022jp; | |
2593 | var = iso2022jp2; | |
2594 | @} | |
2595 | else if (__strcasecmp (step->__to_name, "ISO-2022-JP-2//") == 0) | |
2596 | @{ | |
2597 | dir = to_iso2022jp; | |
2598 | var = iso2022jp2; | |
2599 | @} | |
2600 | ||
2601 | result = __GCONV_NOCONV; | |
2602 | if (dir != illegal_dir) | |
2603 | @{ | |
2604 | new_data = (struct iso2022jp_data *) | |
2605 | malloc (sizeof (struct iso2022jp_data)); | |
2606 | ||
2607 | result = __GCONV_NOMEM; | |
2608 | if (new_data != NULL) | |
2609 | @{ | |
2610 | new_data->dir = dir; | |
2611 | new_data->var = var; | |
2612 | step->__data = new_data; | |
2613 | ||
2614 | if (dir == from_iso2022jp) | |
2615 | @{ | |
2616 | step->__min_needed_from = MIN_NEEDED_FROM; | |
2617 | step->__max_needed_from = MAX_NEEDED_FROM; | |
2618 | step->__min_needed_to = MIN_NEEDED_TO; | |
2619 | step->__max_needed_to = MAX_NEEDED_TO; | |
2620 | @} | |
2621 | else | |
2622 | @{ | |
2623 | step->__min_needed_from = MIN_NEEDED_TO; | |
2624 | step->__max_needed_from = MAX_NEEDED_TO; | |
2625 | step->__min_needed_to = MIN_NEEDED_FROM; | |
2626 | step->__max_needed_to = MAX_NEEDED_FROM + 2; | |
2627 | @} | |
2628 | ||
2629 | /* @r{Yes, this is a stateful encoding.} */ | |
2630 | step->__stateful = 1; | |
2631 | ||
2632 | result = __GCONV_OK; | |
2633 | @} | |
2634 | @} | |
2635 | ||
2636 | return result; | |
2637 | @} | |
2638 | @end smallexample | |
2639 | ||
2640 | The function first checks which conversion is wanted. The module from | |
bd3916e8 | 2641 | which this function is taken implements four different conversions; |
0b2b18a2 UD |
2642 | which one is selected can be determined by comparing the names. The |
2643 | comparison should always be done without paying attention to the case. | |
2644 | ||
bd3916e8 | 2645 | Next, a data structure, which contains the necessary information about |
0b2b18a2 | 2646 | which conversion is selected, is allocated. The data structure |
bd3916e8 UD |
2647 | @code{struct iso2022jp_data} is locally defined since, outside the |
2648 | module, this data is not used at all. Please note that if all four | |
d987d219 | 2649 | conversions this module supports are requested there are four data |
0b2b18a2 UD |
2650 | blocks. |
2651 | ||
2652 | One interesting thing is the initialization of the @code{__min_} and | |
2653 | @code{__max_} elements of the step data object. A single ISO-2022-JP | |
2654 | character can consist of one to four bytes. Therefore the | |
2655 | @code{MIN_NEEDED_FROM} and @code{MAX_NEEDED_FROM} macros are defined | |
2656 | this way. The output is always the @code{INTERNAL} character set (aka | |
2657 | UCS-4) and therefore each character consists of exactly four bytes. For | |
2658 | the conversion from @code{INTERNAL} to ISO-2022-JP we have to take into | |
2659 | account that escape sequences might be necessary to switch the character | |
2660 | sets. Therefore the @code{__max_needed_to} element for this direction | |
2661 | gets assigned @code{MAX_NEEDED_FROM + 2}. This takes into account the | |
d987d219 | 2662 | two bytes needed for the escape sequences to signal the switching. The |
0b2b18a2 UD |
2663 | asymmetry in the maximum values for the two directions can be explained |
2664 | easily: when reading ISO-2022-JP text, escape sequences can be handled | |
2665 | alone (i.e., it is not necessary to process a real character since the | |
2666 | effect of the escape sequence can be recorded in the state information). | |
2667 | The situation is different for the other direction. Since it is in | |
2668 | general not known which character comes next, one cannot emit escape | |
2669 | sequences to change the state in advance. This means the escape | |
d987d219 | 2670 | sequences have to be emitted together with the next character. |
0b2b18a2 UD |
2671 | Therefore one needs more room than only for the character itself. |
2672 | ||
2673 | The possible return values of the initialization function are: | |
2674 | ||
2675 | @table @code | |
2676 | @item __GCONV_OK | |
2677 | The initialization succeeded | |
2678 | @item __GCONV_NOCONV | |
2679 | The requested conversion is not supported in the module. This can | |
2680 | happen if the @file{gconv-modules} file has errors. | |
2681 | @item __GCONV_NOMEM | |
2682 | Memory required to store additional information could not be allocated. | |
2683 | @end table | |
2684 | @end deftypevr | |
2685 | ||
2686 | The function called before the module is unloaded is significantly | |
2687 | easier. It often has nothing at all to do; in which case it can be left | |
2688 | out completely. | |
2689 | ||
0b2b18a2 | 2690 | @deftypevr {Data type} void {(*__gconv_end_fct)} (struct gconv_step *) |
d08a7e4c | 2691 | @standards{GNU, gconv.h} |
0b2b18a2 UD |
2692 | The task of this function is to free all resources allocated in the |
2693 | initialization function. Therefore only the @code{__data} element of | |
2694 | the object pointed to by the argument is of interest. Continuing the | |
2695 | example from the initialization function, the finalization function | |
2696 | looks like this: | |
2697 | ||
2698 | @smallexample | |
2699 | void | |
2700 | gconv_end (struct __gconv_step *data) | |
2701 | @{ | |
2702 | free (data->__data); | |
2703 | @} | |
2704 | @end smallexample | |
2705 | @end deftypevr | |
2706 | ||
2707 | The most important function is the conversion function itself, which can | |
2708 | get quite complicated for complex character sets. But since this is not | |
2709 | of interest here, we will only describe a possible skeleton for the | |
2710 | conversion function. | |
2711 | ||
0b2b18a2 | 2712 | @deftypevr {Data type} int {(*__gconv_fct)} (struct __gconv_step *, struct __gconv_step_data *, const char **, const char *, size_t *, int) |
d08a7e4c | 2713 | @standards{GNU, gconv.h} |
d987d219 | 2714 | The conversion function can be called for two basic reasons: to convert |
0b2b18a2 UD |
2715 | text or to reset the state. From the description of the @code{iconv} |
2716 | function it can be seen why the flushing mode is necessary. What mode | |
bd3916e8 | 2717 | is selected is determined by the sixth argument, an integer. This |
0b2b18a2 UD |
2718 | argument being nonzero means that flushing is selected. |
2719 | ||
2720 | Common to both modes is where the output buffer can be found. The | |
2721 | information about this buffer is stored in the conversion step data. A | |
bd3916e8 UD |
2722 | pointer to this information is passed as the second argument to this |
2723 | function. The description of the @code{struct __gconv_step_data} | |
0b2b18a2 UD |
2724 | structure has more information on the conversion step data. |
2725 | ||
2726 | @cindex stateful | |
2727 | What has to be done for flushing depends on the source character set. | |
bd3916e8 UD |
2728 | If the source character set is not stateful, nothing has to be done. |
2729 | Otherwise the function has to emit a byte sequence to bring the state | |
2730 | object into the initial state. Once this all happened the other | |
2731 | conversion modules in the chain of conversions have to get the same | |
2732 | chance. Whether another step follows can be determined from the | |
2733 | @code{__is_last} element of the step data structure to which the first | |
0b2b18a2 UD |
2734 | parameter points. |
2735 | ||
bd3916e8 UD |
2736 | The more interesting mode is when actual text has to be converted. The |
2737 | first step in this case is to convert as much text as possible from the | |
2738 | input buffer and store the result in the output buffer. The start of the | |
2739 | input buffer is determined by the third argument, which is a pointer to a | |
2740 | pointer variable referencing the beginning of the buffer. The fourth | |
0b2b18a2 UD |
2741 | argument is a pointer to the byte right after the last byte in the buffer. |
2742 | ||
2743 | The conversion has to be performed according to the current state if the | |
2744 | character set is stateful. The state is stored in an object pointed to | |
2745 | by the @code{__statep} element of the step data (second argument). Once | |
2746 | either the input buffer is empty or the output buffer is full the | |
2747 | conversion stops. At this point, the pointer variable referenced by the | |
2748 | third parameter must point to the byte following the last processed | |
2749 | byte (i.e., if all of the input is consumed, this pointer and the fourth | |
2750 | parameter have the same value). | |
2751 | ||
bd3916e8 UD |
2752 | What now happens depends on whether this step is the last one. If it is |
2753 | the last step, the only thing that has to be done is to update the | |
0b2b18a2 | 2754 | @code{__outbuf} element of the step data structure to point after the |
bd3916e8 | 2755 | last written byte. This update gives the caller the information on how |
0b2b18a2 UD |
2756 | much text is available in the output buffer. In addition, the variable |
2757 | pointed to by the fifth parameter, which is of type @code{size_t}, must | |
2758 | be incremented by the number of characters (@emph{not bytes}) that were | |
2759 | converted in a non-reversible way. Then, the function can return. | |
2760 | ||
2761 | In case the step is not the last one, the later conversion functions have | |
2762 | to get a chance to do their work. Therefore, the appropriate conversion | |
2763 | function has to be called. The information about the functions is | |
2764 | stored in the conversion data structures, passed as the first parameter. | |
2765 | This information and the step data are stored in arrays, so the next | |
2766 | element in both cases can be found by simple pointer arithmetic: | |
2767 | ||
2768 | @smallexample | |
2769 | int | |
2770 | gconv (struct __gconv_step *step, struct __gconv_step_data *data, | |
2771 | const char **inbuf, const char *inbufend, size_t *written, | |
2772 | int do_flush) | |
2773 | @{ | |
2774 | struct __gconv_step *next_step = step + 1; | |
2775 | struct __gconv_step_data *next_data = data + 1; | |
95fdc6a0 | 2776 | @dots{} |
0b2b18a2 UD |
2777 | @end smallexample |
2778 | ||
2779 | The @code{next_step} pointer references the next step information and | |
2780 | @code{next_data} the next data record. The call of the next function | |
2781 | therefore will look similar to this: | |
2782 | ||
2783 | @smallexample | |
2784 | next_step->__fct (next_step, next_data, &outerr, outbuf, | |
2785 | written, 0) | |
2786 | @end smallexample | |
2787 | ||
2788 | But this is not yet all. Once the function call returns the conversion | |
bd3916e8 UD |
2789 | function might have some more to do. If the return value of the function |
2790 | is @code{__GCONV_EMPTY_INPUT}, more room is available in the output | |
d987d219 | 2791 | buffer. Unless the input buffer is empty, the conversion functions start |
bd3916e8 UD |
2792 | all over again and process the rest of the input buffer. If the return |
2793 | value is not @code{__GCONV_EMPTY_INPUT}, something went wrong and we have | |
0b2b18a2 UD |
2794 | to recover from this. |
2795 | ||
2796 | A requirement for the conversion function is that the input buffer | |
2797 | pointer (the third argument) always point to the last character that | |
2798 | was put in converted form into the output buffer. This is trivially | |
2799 | true after the conversion performed in the current step, but if the | |
2800 | conversion functions deeper downstream stop prematurely, not all | |
2801 | characters from the output buffer are consumed and, therefore, the input | |
2802 | buffer pointers must be backed off to the right position. | |
2803 | ||
bd3916e8 UD |
2804 | Correcting the input buffers is easy to do if the input and output |
2805 | character sets have a fixed width for all characters. In this situation | |
2806 | we can compute how many characters are left in the output buffer and, | |
2807 | therefore, can correct the input buffer pointer appropriately with a | |
2808 | similar computation. Things are getting tricky if either character set | |
2809 | has characters represented with variable length byte sequences, and it | |
2810 | gets even more complicated if the conversion has to take care of the | |
2811 | state. In these cases the conversion has to be performed once again, from | |
2812 | the known state before the initial conversion (i.e., if necessary the | |
2813 | state of the conversion has to be reset and the conversion loop has to be | |
2814 | executed again). The difference now is that it is known how much input | |
2815 | must be created, and the conversion can stop before converting the first | |
2816 | unused character. Once this is done the input buffer pointers must be | |
0b2b18a2 UD |
2817 | updated again and the function can return. |
2818 | ||
2819 | One final thing should be mentioned. If it is necessary for the | |
2820 | conversion to know whether it is the first invocation (in case a prolog | |
bd3916e8 UD |
2821 | has to be emitted), the conversion function should increment the |
2822 | @code{__invocation_counter} element of the step data structure just | |
0b2b18a2 UD |
2823 | before returning to the caller. See the description of the @code{struct |
2824 | __gconv_step_data} structure above for more information on how this can | |
2825 | be used. | |
2826 | ||
2827 | The return value must be one of the following values: | |
2828 | ||
2829 | @table @code | |
2830 | @item __GCONV_EMPTY_INPUT | |
2831 | All input was consumed and there is room left in the output buffer. | |
2832 | @item __GCONV_FULL_OUTPUT | |
2833 | No more room in the output buffer. In case this is not the last step | |
2834 | this value is propagated down from the call of the next conversion | |
bd3916e8 | 2835 | function in the chain. |
0b2b18a2 UD |
2836 | @item __GCONV_INCOMPLETE_INPUT |
2837 | The input buffer is not entirely empty since it contains an incomplete | |
2838 | character sequence. | |
2839 | @end table | |
2840 | ||
2841 | The following example provides a framework for a conversion function. | |
2842 | In case a new conversion has to be written the holes in this | |
2843 | implementation have to be filled and that is it. | |
2844 | ||
2845 | @smallexample | |
2846 | int | |
2847 | gconv (struct __gconv_step *step, struct __gconv_step_data *data, | |
2848 | const char **inbuf, const char *inbufend, size_t *written, | |
2849 | int do_flush) | |
2850 | @{ | |
2851 | struct __gconv_step *next_step = step + 1; | |
2852 | struct __gconv_step_data *next_data = data + 1; | |
2853 | gconv_fct fct = next_step->__fct; | |
2854 | int status; | |
2855 | ||
2856 | /* @r{If the function is called with no input this means we have} | |
2857 | @r{to reset to the initial state. The possibly partly} | |
2858 | @r{converted input is dropped.} */ | |
2859 | if (do_flush) | |
2860 | @{ | |
2861 | status = __GCONV_OK; | |
2862 | ||
2863 | /* @r{Possible emit a byte sequence which put the state object} | |
2864 | @r{into the initial state.} */ | |
2865 | ||
2866 | /* @r{Call the steps down the chain if there are any but only} | |
2867 | @r{if we successfully emitted the escape sequence.} */ | |
2868 | if (status == __GCONV_OK && ! data->__is_last) | |
2869 | status = fct (next_step, next_data, NULL, NULL, | |
2870 | written, 1); | |
2871 | @} | |
2872 | else | |
2873 | @{ | |
2874 | /* @r{We preserve the initial values of the pointer variables.} */ | |
2875 | const char *inptr = *inbuf; | |
2876 | char *outbuf = data->__outbuf; | |
2877 | char *outend = data->__outbufend; | |
2878 | char *outptr; | |
2879 | ||
2880 | do | |
2881 | @{ | |
2882 | /* @r{Remember the start value for this round.} */ | |
2883 | inptr = *inbuf; | |
2884 | /* @r{The outbuf buffer is empty.} */ | |
2885 | outptr = outbuf; | |
2886 | ||
2887 | /* @r{For stateful encodings the state must be safe here.} */ | |
2888 | ||
2889 | /* @r{Run the conversion loop. @code{status} is set} | |
2890 | @r{appropriately afterwards.} */ | |
2891 | ||
cf822e3c | 2892 | /* @r{If this is the last step, leave the loop. There is} |
0b2b18a2 UD |
2893 | @r{nothing we can do.} */ |
2894 | if (data->__is_last) | |
2895 | @{ | |
2896 | /* @r{Store information about how many bytes are} | |
2897 | @r{available.} */ | |
2898 | data->__outbuf = outbuf; | |
2899 | ||
2900 | /* @r{If any non-reversible conversions were performed,} | |
2901 | @r{add the number to @code{*written}.} */ | |
2902 | ||
2903 | break; | |
2904 | @} | |
2905 | ||
2906 | /* @r{Write out all output that was produced.} */ | |
2907 | if (outbuf > outptr) | |
2908 | @{ | |
2909 | const char *outerr = data->__outbuf; | |
2910 | int result; | |
2911 | ||
2912 | result = fct (next_step, next_data, &outerr, | |
2913 | outbuf, written, 0); | |
2914 | ||
2915 | if (result != __GCONV_EMPTY_INPUT) | |
2916 | @{ | |
2917 | if (outerr != outbuf) | |
2918 | @{ | |
2919 | /* @r{Reset the input buffer pointer. We} | |
2920 | @r{document here the complex case.} */ | |
2921 | size_t nstatus; | |
2922 | ||
2923 | /* @r{Reload the pointers.} */ | |
2924 | *inbuf = inptr; | |
2925 | outbuf = outptr; | |
2926 | ||
2927 | /* @r{Possibly reset the state.} */ | |
2928 | ||
2929 | /* @r{Redo the conversion, but this time} | |
2930 | @r{the end of the output buffer is at} | |
2931 | @r{@code{outerr}.} */ | |
2932 | @} | |
2933 | ||
2934 | /* @r{Change the status.} */ | |
2935 | status = result; | |
2936 | @} | |
2937 | else | |
2938 | /* @r{All the output is consumed, we can make} | |
2939 | @r{ another run if everything was ok.} */ | |
2940 | if (status == __GCONV_FULL_OUTPUT) | |
2941 | status = __GCONV_OK; | |
2942 | @} | |
2943 | @} | |
2944 | while (status == __GCONV_OK); | |
2945 | ||
2946 | /* @r{We finished one use of this step.} */ | |
2947 | ++data->__invocation_counter; | |
2948 | @} | |
2949 | ||
2950 | return status; | |
2951 | @} | |
2952 | @end smallexample | |
2953 | @end deftypevr | |
2954 | ||
2955 | This information should be sufficient to write new modules. Anybody | |
1f77f049 JM |
2956 | doing so should also take a look at the available source code in the |
2957 | @glibcadj{} sources. It contains many examples of working and optimized | |
0b2b18a2 UD |
2958 | modules. |
2959 | ||
bd3916e8 | 2960 | @c File charset.texi edited October 2001 by Dennis Grace, IBM Corporation |