]>
Commit | Line | Data |
---|---|---|
40a55d20 UD |
1 | @node Message Translation |
2 | @chapter Message Translation | |
3 | ||
4 | The program's interface with the human should be designed in a way to | |
5 | ease the human the task. One of the possibilities is to use messages in | |
6 | whatever language the user prefers. | |
7 | ||
8 | Printing messages in different languages can be implemented in different | |
9 | ways. One could add all the different languages in the source code and | |
10 | add among the variants every time a message has to be printed. This is | |
11 | certainly no good solution since extending the set of languages is | |
12 | difficult (the code must be changed) and the code itself can become | |
13 | really big with dozens of message sets. | |
14 | ||
15 | A better solution is to keep the message sets for each language are kept | |
16 | in separate files which are loaded at runtime depending on the language | |
17 | selection of the user. | |
18 | ||
19 | The GNU C Library provides two different sets of functions to support | |
20 | message translation. The problem is that neither of the interfaces is | |
21 | officially defined by the POSIX standard. The @code{catgets} family of | |
f2ea0f5b UD |
22 | functions is defined in the X/Open standard but this is derived from |
23 | industry decisions and therefore not necessarily based on reasonable | |
40a55d20 UD |
24 | decisions. |
25 | ||
26 | As mentioned above the message catalog handling provides easy | |
27 | extendibility by using external data files which contain the message | |
28 | translations. I.e., these files contain for each of the messages used | |
29 | in the program a translation for the appropriate language. So the tasks | |
fed8f7f7 | 30 | of the message handling functions are |
40a55d20 UD |
31 | |
32 | @itemize @bullet | |
33 | @item | |
34 | locate the external data file with the appropriate translations. | |
35 | @item | |
36 | load the data and make it possible to address the messages | |
37 | @item | |
38 | map a given key to the translated message | |
39 | @end itemize | |
40 | ||
41 | The two approaches mainly differ in the implementation of this last | |
42 | step. The design decisions made for this influences the whole rest. | |
43 | ||
44 | @menu | |
45 | * Message catalogs a la X/Open:: The @code{catgets} family of functions. | |
46 | * The Uniforum approach:: The @code{gettext} family of functions. | |
47 | @end menu | |
48 | ||
49 | ||
50 | @node Message catalogs a la X/Open | |
51 | @section X/Open Message Catalog Handling | |
52 | ||
53 | The @code{catgets} functions are based on the simple scheme: | |
54 | ||
55 | @quotation | |
56 | Associate every message to translate in the source code with a unique | |
57 | identifier. To retrieve a message from a catalog file solely the | |
58 | identifier is used. | |
59 | @end quotation | |
60 | ||
61 | This means for the author of the program that s/he will have to make | |
62 | sure the meaning of the identifier in the program code and in the | |
63 | message catalogs are always the same. | |
64 | ||
65 | Before a message can be translated the catalog file must be located. | |
66 | The user of the program must be able to guide the responsible function | |
67 | to find whatever catalog the user wants. This is separated from what | |
68 | the programmer had in mind. | |
69 | ||
f2ea0f5b | 70 | All the types, constants and functions for the @code{catgets} functions |
40a55d20 UD |
71 | are defined/declared in the @file{nl_types.h} header file. |
72 | ||
73 | @menu | |
74 | * The catgets Functions:: The @code{catgets} function family. | |
75 | * The message catalog files:: Format of the message catalog files. | |
76 | * The gencat program:: How to generate message catalogs files which | |
77 | can be used by the functions. | |
78 | * Common Usage:: How to use the @code{catgets} interface. | |
79 | @end menu | |
80 | ||
81 | ||
82 | @node The catgets Functions | |
83 | @subsection The @code{catgets} function family | |
84 | ||
85 | @comment nl_types.h | |
86 | @comment X/Open | |
87 | @deftypefun nl_catd catopen (const char *@var{cat_name}, int @var{flag}) | |
88 | The @code{catgets} function tries to locate the message data file names | |
89 | @var{cat_name} and loads it when found. The return value is of an | |
90 | opaque type and can be used in calls to the other functions to refer to | |
91 | this loaded catalog. | |
92 | ||
93 | The return value is @code{(nl_catd) -1} in case the function failed and | |
94 | no catalog was loaded. The global variable @var{errno} contains a code | |
95 | for the error causing the failure. But even if the function call | |
96 | succeeded this does not mean that all messages can be translated. | |
97 | ||
98 | Locating the catalog file must happen in a way which lets the user of | |
99 | the program influence the decision. It is up to the user to decide | |
100 | about the language to use and sometimes it is useful to use alternate | |
101 | catalog files. All this can be specified by the user by setting some | |
f2ea0f5b | 102 | environment variables. |
40a55d20 UD |
103 | |
104 | The first problem is to find out where all the message catalogs are | |
105 | stored. Every program could have its own place to keep all the | |
106 | different files but usually the catalog files are grouped by languages | |
107 | and the catalogs for all programs are kept in the same place. | |
108 | ||
109 | @cindex NLSPATH environment variable | |
110 | To tell the @code{catopen} function where the catalog for the program | |
111 | can be found the user can set the environment variable @code{NLSPATH} to | |
112 | a value which describes her/his choice. Since this value must be usable | |
113 | for different languages and locales it cannot be a simple string. | |
114 | Instead it is a format string (similar to @code{printf}'s). An example | |
115 | is | |
116 | ||
117 | @smallexample | |
118 | /usr/share/locale/%L/%N:/usr/share/locale/%L/LC_MESSAGES/%N | |
119 | @end smallexample | |
120 | ||
121 | First one can see that more than one directory can be specified (with | |
122 | the usual syntax of separating them by colons). The next things to | |
123 | observe are the format string, @code{%L} and @code{%N} in this case. | |
124 | The @code{catopen} function knows about several of them and the | |
125 | replacement for all of them is of course different. | |
126 | ||
127 | @table @code | |
128 | @item %N | |
129 | This format element is substituted with the name of the catalog file. | |
130 | This is the value of the @var{cat_name} argument given to | |
131 | @code{catgets}. | |
132 | ||
133 | @item %L | |
134 | This format element is substituted with the name of the currently | |
135 | selected locale for translating messages. How this is determined is | |
136 | explained below. | |
137 | ||
138 | @item %l | |
139 | (This is the lowercase ell.) This format element is substituted with the | |
f2ea0f5b | 140 | language element of the locale name. The string describing the selected |
40a55d20 UD |
141 | locale is expected to have the form |
142 | @code{@var{lang}[_@var{terr}[.@var{codeset}]]} and this format uses the | |
143 | first part @var{lang}. | |
144 | ||
145 | @item %t | |
146 | This format element is substituted by the territory part @var{terr} of | |
147 | the name of the currently selected locale. See the explanation of the | |
148 | format above. | |
149 | ||
150 | @item %c | |
151 | This format element is substituted by the codeset part @var{codeset} of | |
152 | the name of the currently selected locale. See the explanation of the | |
153 | format above. | |
154 | ||
155 | @item %% | |
156 | Since @code{%} is used in a meta character there must be a way to | |
157 | express the @code{%} character in the result itself. Using @code{%%} | |
158 | does this just like it works for @code{printf}. | |
159 | @end table | |
160 | ||
161 | ||
162 | Using @code{NLSPATH} allows to specify arbitrary directories to be | |
163 | searched for message catalogs while still allowing different languages | |
164 | to be used. If the @code{NLSPATH} environment variable is not set the | |
165 | default value is | |
166 | ||
167 | @smallexample | |
168 | @var{prefix}/share/locale/%L/%N:@var{prefix}/share/locale/%L/LC_MESSAGES/%N | |
169 | @end smallexample | |
170 | ||
171 | @noindent | |
172 | where @var{prefix} is given to @code{configure} while installing the GNU | |
173 | C Library (this value is in many cases @code{/usr} or the empty string). | |
174 | ||
175 | The remaining problem is to decide which must be used. The value | |
176 | decides about the substitution of the format elements mentioned above. | |
177 | First of all the user can specify a path in the message catalog name | |
178 | (i.e., the name contains a slash character). In this situation the | |
179 | @code{NLSPATH} environment variable is not used. The catalog must exist | |
180 | as specified in the program, perhaps relative to the current working | |
181 | directory. This situation in not desirable and catalogs names never | |
182 | should be written this way. Beside this, this behaviour is not portable | |
183 | to all other platforms providing the @code{catgets} interface. | |
184 | ||
185 | @cindex LC_ALL environment variable | |
186 | @cindex LC_MESSAGES environment variable | |
187 | @cindex LANG environment variable | |
188 | Otherwise the values of environment variables from the standard | |
f2ea0f5b | 189 | environment are examined (@pxref{Standard Environment}). Which |
40a55d20 UD |
190 | variables are examined is decided by the @var{flag} parameter of |
191 | @code{catopen}. If the value is @code{NL_CAT_LOCALE} (which is defined | |
192 | in @file{nl_types.h}) then the @code{catopen} function examines the | |
193 | environment variable @code{LC_ALL}, @code{LC_MESSAGES}, and @code{LANG} | |
194 | in this order. The first variable which is set in the current | |
195 | environment will be used. | |
196 | ||
197 | If @var{flag} is zero only the @code{LANG} environment variable is | |
198 | examined. This is a left-over from the early days of this function | |
199 | where the other environment variable were not known. | |
200 | ||
201 | In any case the environment variable should have a value of the form | |
202 | @code{@var{lang}[_@var{terr}[.@var{codeset}]]} as explained above. If | |
203 | no environment variable is set the @code{"C"} locale is used which | |
204 | prevents any translation. | |
205 | ||
206 | The return value of the function is in any case a valid string. Either | |
207 | it is a translation from a message catalog or it is the same as the | |
208 | @var{string} parameter. So a piece of code to decide whether a | |
209 | translation actually happened must look like this: | |
210 | ||
211 | @smallexample | |
212 | @{ | |
213 | char *trans = catgets (desc, set, msg, input_string); | |
214 | if (trans == input_string) | |
215 | @{ | |
216 | /* Something went wrong. */ | |
217 | @} | |
218 | @} | |
219 | @end smallexample | |
220 | ||
221 | @noindent | |
222 | When an error occured the global variable @var{errno} is set to | |
223 | ||
224 | @table @var | |
225 | @item EBADF | |
226 | The catalog does not exist. | |
227 | @item ENOMSG | |
f2ea0f5b | 228 | The set/message ttuple does not name an existing element in the |
40a55d20 UD |
229 | message catalog. |
230 | @end table | |
231 | ||
232 | While it sometimes can be useful to test for errors programs normally | |
233 | will avoid any test. If the translation is not available it is no big | |
234 | problem if the original, untranslated message is printed. Either the | |
235 | user understands this as well or s/he will look for the reason why the | |
236 | messages are not translated. | |
237 | @end deftypefun | |
238 | ||
239 | Please note that the currently selected locale does not depend on a call | |
240 | to the @code{setlocale} function. It is not necessary that the locale | |
241 | data files for this locale exist and calling @code{setlocale} succeeds. | |
242 | The @code{catopen} function directly reads the values of the environment | |
243 | variables. | |
244 | ||
245 | ||
246 | @deftypefun {char *} catgets (nl_catd @var{catalog_desc}, int @var{set}, int @var{message}, const char *@var{string}) | |
247 | The function @code{catgets} has to be used to access the massage catalog | |
248 | previously opened using the @code{catopen} function. The | |
249 | @var{catalog_desc} parameter must be a value previously returned by | |
250 | @code{catopen}. | |
251 | ||
252 | The next two parameters, @var{set} and @var{message}, reflect the | |
253 | internal organization of the message catalog files. This will be | |
254 | explained in detail below. For now it is interesting to know that a | |
255 | catalog can consists of several set and the messages in each thread are | |
256 | individually numbered using numbers. Neither the set number nor the | |
257 | message number must be consecutive. They can be arbitrarily chosen. | |
258 | But each message (unless equal to another one) must have its own unique | |
259 | pair of set and message number. | |
260 | ||
261 | Since it is not guaranteed that the message catalog for the language | |
262 | selected by the user exists the last parameter @var{string} helps to | |
263 | handle this case gracefully. If no matching string can be found | |
264 | @var{string} is returned. This means for the programmer that | |
265 | ||
266 | @itemize @bullet | |
267 | @item | |
268 | the @var{string} parameters should contain reasonable text (this also | |
269 | helps to understand the program seems otherwise there would be no hint | |
270 | on the string which is expected to be returned. | |
271 | @item | |
272 | all @var{string} arguments should be written in the same language. | |
273 | @end itemize | |
274 | @end deftypefun | |
275 | ||
276 | It is somewhat uncomfortable to write a program using the @code{catgets} | |
277 | functions if no supporting functionality is available. Since each | |
f2ea0f5b | 278 | set/message number tuple must be unique the programmer must keep lists |
40a55d20 UD |
279 | of the messages at the same time the code is written. And the work |
280 | between several people working on the same project must be coordinated. | |
281 | In @ref{Common Usage} we will see some how these problems can be relaxed | |
282 | a bit. | |
283 | ||
284 | @deftypefun int catclose (nl_catd @var{catalog_desc}) | |
285 | The @code{catclose} function can be used to free the resources | |
286 | associated with a message catalog which previously was opened by a call | |
287 | to @code{catopen}. If the resources can be successfully freed the | |
288 | function returns @code{0}. Otherwise it return @code{@minus{}1} and the | |
289 | global variable @var{errno} is set. Errors can occur if the catalog | |
290 | descriptor @var{catalog_desc} is not valid in which case @var{errno} is | |
291 | set to @code{EBADF}. | |
292 | @end deftypefun | |
293 | ||
294 | ||
295 | @node The message catalog files | |
296 | @subsection Format of the message catalog files | |
297 | ||
298 | The only reasonable way the translate all the messages of a function and | |
299 | store the result in a message catalog file which can be read by the | |
300 | @code{catopen} function is to write all the message text to the | |
301 | translator and let her/him translate them all. I.e., we must have a | |
f2ea0f5b | 302 | file with entries which associate the set/message tuple with a specific |
40a55d20 UD |
303 | translation. This file format is specified in the X/Open standard and |
304 | is as follows: | |
305 | ||
306 | @itemize @bullet | |
307 | @item | |
308 | Lines containing only whitespace characters or empty lines are ignored. | |
309 | ||
310 | @item | |
311 | Lines which contain as the first non-whitespace character a @code{$} | |
312 | followed by a whitespace character are comment and are also ignored. | |
313 | ||
314 | @item | |
315 | If a line contains as the first non-whitespace characters the sequence | |
316 | @code{$set} followed by a whitespace character an additional argument | |
317 | is required to follow. This argument can either be: | |
318 | ||
319 | @itemize @minus | |
320 | @item | |
321 | a number. In this case the value of this number determines the set | |
322 | to which the following messages are added. | |
323 | ||
324 | @item | |
325 | an identifier consisting of alphanumeric characters plus the underscore | |
326 | character. In this case the set get automatically a number assigned. | |
327 | This value is one added to the largest set number which so far appeared. | |
328 | ||
329 | How to use the symbolic names is explained in section @ref{Common Usage}. | |
330 | ||
331 | It is an error if a symbol name appears more than once. All following | |
332 | messages are placed in a set with this number. | |
333 | @end itemize | |
334 | ||
335 | @item | |
336 | If a line contains as the first non-whitespace characters the sequence | |
337 | @code{$delset} followed by a whitespace character an additional argument | |
338 | is required to follow. This argument can either be: | |
339 | ||
340 | @itemize @minus | |
341 | @item | |
342 | a number. In this case the value of this number determines the set | |
343 | which will be deleted. | |
344 | ||
345 | @item | |
346 | an identifier consisting of alphanumeric characters plus the underscore | |
347 | character. This symbolic identifier must match a name for a set which | |
348 | previously was defined. It is an error if the name is unknown. | |
349 | @end itemize | |
350 | ||
351 | In both cases all messages in the specified set will be removed. They | |
352 | will not appear in the output. But if this set is later again selected | |
353 | with a @code{$set} command again messages could be added and these | |
354 | messages will appear in the output. | |
355 | ||
356 | @item | |
357 | If a line contains after leading whitespaces the sequence | |
358 | @code{$quote}, the quoting character used for this input file is | |
359 | changed to the first non-whitespace character following the | |
360 | @code{$quote}. If no non-whitespace character is present before the | |
361 | line ends quoting is disable. | |
362 | ||
363 | By default no quoting character is used. In this mode strings are | |
364 | terminated with the first unescaped line break. If there is a | |
365 | @code{$quote} sequence present newline need not be escaped. Instead a | |
f2ea0f5b | 366 | string is terminated with the first unescaped appearance of the quote |
40a55d20 UD |
367 | character. |
368 | ||
369 | A common usage of this feature would be to set the quote character to | |
f2ea0f5b | 370 | @code{"}. Then any appearance of the @code{"} in the strings must |
40a55d20 UD |
371 | be escaped using the backslash (i.e., @code{\"} must be written). |
372 | ||
373 | @item | |
374 | Any other line must start with a number or an alphanumeric identifier | |
375 | (with the underscore character included). The following characters | |
376 | (starting at the first non-whitespace character) will form the string | |
377 | which gets associated with the currently selected set and the message | |
378 | number represented by the number and identifier respectively. | |
379 | ||
380 | If the start of the line is a number the message number is obvious. It | |
381 | is an error if the same message number already appeared for this set. | |
382 | ||
383 | If the leading token was an identifier the message number gets | |
384 | automatically assigned. The value is the current maximum messages | |
385 | number for this set plus one. It is an error if the identifier was | |
386 | already used for a message in this set. It is ok to reuse the | |
387 | identifier for a message in another thread. How to use the symbolic | |
388 | identifiers will be explained below (@pxref{Common Usage}). There is | |
389 | one limitation with the identifier: it must not be @code{Set}. The | |
390 | reason will be explained below. | |
391 | ||
392 | Please note that you must use a quoting character if a message contains | |
393 | leading whitespace. Since one cannot guarantee this never happens it is | |
394 | probably a good idea to always use quoting. | |
395 | ||
396 | The text of the messages can contain escape characters. The usual bunch | |
397 | of characters known from the @w{ISO C} language are recognized | |
398 | (@code{\n}, @code{\t}, @code{\v}, @code{\b}, @code{\r}, @code{\f}, | |
399 | @code{\\}, and @code{\@var{nnn}}, where @var{nnn} is the octal coding of | |
400 | a character code). | |
401 | @end itemize | |
402 | ||
403 | @strong{Important:} The handling of identifiers instead of numbers for | |
404 | the set and messages is a GNU extension. Systems strictly following the | |
405 | X/Open specification do not have this feature. An example for a message | |
406 | catalog file is this: | |
407 | ||
408 | @smallexample | |
409 | $ This is a leading comment. | |
410 | $quote " | |
411 | ||
412 | $set SetOne | |
413 | 1 Message with ID 1. | |
414 | two " Message with ID \"two\", which gets the value 2 assigned" | |
415 | ||
416 | $set SetTwo | |
f2ea0f5b | 417 | $ Since the last set got the number 1 assigned this set has number 2. |
40a55d20 UD |
418 | 4000 "The numbers can be arbitrary, they need not start at one." |
419 | @end smallexample | |
420 | ||
421 | This small example shows various aspects: | |
422 | @itemize @bullet | |
423 | @item | |
424 | Lines 1 and 9 are comments since they start with @code{$} followed by | |
425 | a whitespace. | |
426 | @item | |
427 | The quoting character is set to @code{"}. Otherwise the quotes in the | |
428 | message definition would have to be left away and in this case the | |
429 | message with the identifier @code{two} would loose its leading whitespace. | |
430 | @item | |
431 | Mixing numbered messages with message having symbolic names is no | |
f2ea0f5b | 432 | problem and the numbering happens automatically. |
40a55d20 UD |
433 | @end itemize |
434 | ||
435 | ||
436 | While this file format is pretty easy it is not the best possible for | |
437 | use in a running program. The @code{catopen} function would have to | |
438 | parser the file and handle syntactic errors gracefully. This is not so | |
439 | easy and the whole process is pretty slow. Therefore the @code{catgets} | |
440 | functions expect the data in another more compact and ready-to-use file | |
f2ea0f5b | 441 | format. There is a special program @code{gencat} which is explained in |
40a55d20 UD |
442 | detail in the next section. |
443 | ||
444 | Files in this other format are not human readable. To be easy to use by | |
445 | programs it is a binary file. But the format is byte order independent | |
446 | so translation files can be shared by systems of arbitrary architecture | |
447 | (as long as they use the GNU C Library). | |
448 | ||
449 | Details about the binary file format are not important to know since | |
450 | these files are always created by the @code{gencat} program. The | |
451 | sources of the GNU C Library also provide the sources for the | |
f2ea0f5b | 452 | @code{gencat} program and so the interested reader can look through |
40a55d20 UD |
453 | these source files to learn about the file format. |
454 | ||
455 | ||
456 | @node The gencat program | |
457 | @subsection Generate Message Catalogs files | |
458 | ||
459 | @cindex gencat | |
460 | The @code{gencat} program is specified in the X/Open standard and the | |
461 | GNU implementation follows this specification and so allows to process | |
462 | all correctly formed input files. Additionally some extension are | |
3081378b | 463 | implemented which help to work in a more reasonable way with the |
40a55d20 UD |
464 | @code{catgets} functions. |
465 | ||
466 | The @code{gencat} program can be invoked in two ways: | |
467 | ||
468 | @example | |
469 | `gencat [@var{Option}]@dots{} [@var{Output-File} [@var{Input-File}]@dots{}]` | |
470 | @end example | |
471 | ||
472 | This is the interface defined in the X/Open standard. If no | |
473 | @var{Input-File} parameter is given input will be read from standard | |
474 | input. Multiple input files will be read as if they are concatenated. | |
475 | If @var{Output-File} is also missing, the output will be written to | |
476 | standard output. To provide the interface one is used from other | |
477 | programs a second interface is provided. | |
478 | ||
479 | @smallexample | |
480 | `gencat [@var{Option}]@dots{} -o @var{Output-File} [@var{Input-File}]@dots{}` | |
481 | @end smallexample | |
482 | ||
483 | The option @samp{-o} is used to specify the output file and all file | |
484 | arguments are used as input files. | |
485 | ||
486 | Beside this one can use @file{-} or @file{/dev/stdin} for | |
487 | @var{Input-File} to denote the standard input. Corresponding one can | |
488 | use @file{-} and @file{/dev/stdout} for @var{Output-File} to denote | |
489 | standard output. Using @file{-} as a file name is allowed in X/Open | |
490 | while using the device names is a GNU extension. | |
491 | ||
492 | The @code{gencat} program works by concatenating all input files and | |
493 | then @strong{merge} the resulting collection of message sets with a | |
f2ea0f5b UD |
494 | possibly existing output file. This is done by removing all messages |
495 | with set/message number tuples matching any of the generated messages | |
40a55d20 UD |
496 | from the output file and then adding all the new messages. To |
497 | regenerate a catalog file while ignoring the old contents therefore | |
498 | requires to remove the output file if it exists. If the output is | |
499 | written to standard output no merging takes place. | |
500 | ||
501 | @noindent | |
502 | The following table shows the options understood by the @code{gencat} | |
503 | program. The X/Open standard does not specify any option for the | |
504 | program so all of these are GNU extensions. | |
505 | ||
506 | @table @samp | |
507 | @item -V | |
508 | @itemx --version | |
509 | Print the version information and exit. | |
510 | @item -h | |
511 | @itemx --help | |
512 | Print a usage message listing all available options, then exit successfully. | |
513 | @item --new | |
514 | Do never merge the new messages from the input files with the old content | |
515 | of the output files. The old content of the output file is discarded. | |
516 | @item -H | |
517 | @itemx --header=name | |
518 | This option is used to emit the symbolic names given to sets and | |
519 | messages in the input files for use in the program. Details about how | |
520 | to use this are given in the next section. The @var{name} parameter to | |
521 | this option specifies the name of the output file. It will contain a | |
522 | number of C preprocessor @code{#define}s to associate a name with a | |
523 | number. | |
524 | ||
525 | Please note that the generated file only contains the symbols from the | |
526 | input files. If the output is merged with the previous content of the | |
527 | output file the possibly existing symbols from the file(s) which | |
528 | generated the old output files are not in the generated header file. | |
529 | @end table | |
530 | ||
531 | ||
532 | @node Common Usage | |
533 | @subsection How to use the @code{catgets} interface | |
534 | ||
535 | The @code{catgets} functions can be used in two different ways. By | |
536 | following slavishly the X/Open specs and not relying on the extension | |
537 | and by using the GNU extensions. We will take a look at the former | |
538 | method first to understand the benefits of extensions. | |
539 | ||
fed8f7f7 | 540 | @subsubsection Not using symbolic names |
40a55d20 UD |
541 | |
542 | Since the X/Open format of the message catalog files does not allow | |
543 | symbol names we have to work with numbers all the time. When we start | |
f2ea0f5b UD |
544 | writing a program we have to replace all appearances of translatable |
545 | strings with something like | |
40a55d20 UD |
546 | |
547 | @smallexample | |
548 | catgets (catdesc, set, msg, "string") | |
549 | @end smallexample | |
550 | ||
551 | @noindent | |
552 | @var{catgets} is retrieved from a call to @code{catopen} which is | |
553 | normally done once at the program start. The @code{"string"} is the | |
554 | string we want to translate. The problems start with the set and | |
555 | message numbers. | |
556 | ||
557 | In a bigger program several programmers usually work at the same time on | |
558 | the program and so coordinating the number allocation is crucial. | |
f2ea0f5b UD |
559 | Though no two different strings must be indexed by the same tuple of |
560 | numbers it is highly desirable to reuse the numbers for equal strings | |
40a55d20 UD |
561 | with equal translations (please note that there might be strings which |
562 | are equal in one language but have different translations due to | |
563 | difference contexts). | |
564 | ||
565 | The allocation process can be relaxed a bit by different set numbers for | |
566 | different parts of the program. So the number of developers who have to | |
567 | coordinate the allocation can be reduced. But still lists must be keep | |
568 | track of the allocation and errors can easily happen. These errors | |
569 | cannot be discovered by the compiler or the @code{catgets} functions. | |
570 | Only the user of the program might see wrong messages printed. In the | |
571 | worst cases the messages are so irritating that they cannot be | |
572 | recognized as wrong. Think about the translations for @code{"true"} and | |
f2ea0f5b | 573 | @code{"false"} being exchanged. This could result in a disaster. |
40a55d20 UD |
574 | |
575 | ||
576 | @subsubsection Using symbolic names | |
577 | ||
578 | The problems mentioned in the last section derive from the fact that: | |
579 | ||
580 | @enumerate | |
581 | @item | |
582 | the numbers are allocated once and due to the possibly frequent use of | |
583 | them it is difficult to change a number later. | |
584 | @item | |
585 | the numbers do not allow to guess anything about the string and | |
586 | therefore collisions can easily happen. | |
587 | @end enumerate | |
588 | ||
589 | By constantly using symbolic names and by providing a method which maps | |
590 | the string content to a symbolic name (however this will happen) one can | |
591 | prevent both problems above. The cost of this is that the programmer | |
592 | has to write a complete message catalog file while s/he is writing the | |
593 | program itself. | |
594 | ||
595 | This is necessary since the symbolic names must be mapped to numbers | |
596 | before the program sources can be compiled. In the last section it was | |
597 | described how to generate a header containing the mapping of the names. | |
598 | E.g., for the example message file given in the last section we could | |
599 | call the @code{gencat} program as follow (assume @file{ex.msg} contains | |
600 | the sources). | |
601 | ||
602 | @smallexample | |
603 | gencat -H ex.h -o ex.cat ex.msg | |
604 | @end smallexample | |
605 | ||
606 | @noindent | |
607 | This generates a header file with the following content: | |
608 | ||
609 | @smallexample | |
610 | #define SetTwoSet 0x2 /* u.msg:8 */ | |
611 | ||
612 | #define SetOneSet 0x1 /* u.msg:4 */ | |
613 | #define SetOnetwo 0x2 /* u.msg:6 */ | |
614 | @end smallexample | |
615 | ||
616 | As can be seen the various symbols given in the source file are mangled | |
617 | to generate unique identifiers and these identifiers get numbers | |
618 | assigned. Reading the source file and knowing about the rules will | |
619 | allow to predict the content of the header file (it is deterministic) | |
620 | but this is not necessary. The @code{gencat} program can take care for | |
621 | everything. All the programmer has to do is to put the generated header | |
622 | file in the dependency list of the source files of her/his project and | |
623 | to add a rules to regenerate the header of any of the input files | |
624 | change. | |
625 | ||
626 | One word about the symbol mangling. Every symbol consists of two parts: | |
627 | the name of the message set plus the name of the message or the special | |
628 | string @code{Set}. So @code{SetOnetwo} means this macro can be used to | |
629 | access the translation with identifier @code{two} in the message set | |
630 | @code{SetOne}. | |
631 | ||
632 | The other names denote the names of the message sets. The special | |
633 | string @code{Set} is used in the place of the message identifier. | |
634 | ||
635 | If in the code the second string of the set @code{SetOne} is used the C | |
636 | code should look like this: | |
637 | ||
638 | @smallexample | |
639 | catgets (catdesc, SetOneSet, SetOnetwo, | |
640 | " Message with ID \"two\", which gets the value 2 assigned") | |
641 | @end smallexample | |
642 | ||
643 | Writing the function this way will allow to change the message number | |
644 | and even the set number without requiring any change in the C source | |
645 | code. (The text of the string is normally not the same; this is only | |
646 | for this example.) | |
647 | ||
648 | ||
649 | @subsubsection How does to this allow to develop | |
650 | ||
651 | To illustrate the usual way to work with the symbolic version numbers | |
652 | here is a little example. Assume we want to write the very complex and | |
653 | famous greeting program. We start by writing the code as usual: | |
654 | ||
655 | @smallexample | |
656 | #include <stdio.h> | |
657 | int | |
658 | main (void) | |
659 | @{ | |
660 | printf ("Hello, world!\n"); | |
661 | return 0; | |
662 | @} | |
663 | @end smallexample | |
664 | ||
665 | Now we want to internationalize the message and therefore replace the | |
666 | message with whatever the user wants. | |
667 | ||
668 | @smallexample | |
669 | #include <nl_types.h> | |
670 | #include <stdio.h> | |
671 | #include "msgnrs.h" | |
672 | int | |
673 | main (void) | |
674 | @{ | |
675 | nl_catd catdesc = catopen ("hello.cat", NL_CAT_LOCALE); | |
fed8f7f7 | 676 | printf (catgets (catdesc, SetMainSet, SetMainHello, |
838e5ffe | 677 | "Hello, world!\n")); |
40a55d20 UD |
678 | catclose (catdesc); |
679 | return 0; | |
680 | @} | |
681 | @end smallexample | |
682 | ||
683 | We see how the catalog object is opened and the returned descriptor used | |
684 | in the other function calls. It is not really necessary to check for | |
685 | failure of any of the functions since even in these situations the | |
686 | functions will behave reasonable. They simply will be return a | |
687 | translation. | |
688 | ||
689 | What remains unspecified here are the constants @code{SetMainSet} and | |
690 | @code{SetMainHello}. These are the symbolic names describing the | |
691 | message. To get the actual definitions which match the information in | |
692 | the catalog file we have to create the message catalog source file and | |
693 | process it using the @code{gencat} program. | |
694 | ||
695 | @smallexample | |
696 | $ Messages for the famous greeting program. | |
697 | $quote " | |
698 | ||
699 | $set Main | |
700 | Hello "Hallo, Welt!\n" | |
701 | @end smallexample | |
702 | ||
703 | Now we can start building the program (assume the message catalog source | |
704 | file is named @file{hello.msg} and the program source file @file{hello.c}): | |
705 | ||
706 | @smallexample | |
707 | @cartouche | |
708 | % gencat -H msgnrs.h -o hello.cat hello.msg | |
709 | % cat msgnrs.h | |
710 | #define MainSet 0x1 /* hello.msg:4 */ | |
711 | #define MainHello 0x1 /* hello.msg:5 */ | |
712 | % gcc -o hello hello.c -I. | |
713 | % cp hello.cat /usr/share/locale/de/LC_MESSAGES | |
714 | % echo $LC_ALL | |
715 | de | |
716 | % ./hello | |
717 | Hallo, Welt! | |
718 | % | |
719 | @end cartouche | |
720 | @end smallexample | |
721 | ||
722 | The call of the @code{gencat} program creates the missing header file | |
723 | @file{msgnrs.h} as well as the message catalog binary. The former is | |
724 | used in the compilation of @file{hello.c} while the later is placed in a | |
725 | directory in which the @code{catopen} function will try to locate it. | |
726 | Please check the @code{LC_ALL} environment variable and the default path | |
727 | for @code{catopen} presented in the description above. | |
728 | ||
729 | ||
730 | @node The Uniforum approach | |
731 | @section The Uniforum approach to Message Translation | |
732 | ||
733 | Sun Microsystems tried to standardize a different approach to message | |
734 | translation in the Uniforum group. There never was a real standard | |
735 | defined but still the interface was used in Sun's operation systems. | |
736 | Since this approach fits better in the development process of free | |
737 | software it is also used throughout the GNU package and the GNU | |
738 | @file{gettext} package provides support for this outside the GNU C | |
739 | Library. | |
740 | ||
741 | The code of the @file{libintl} from GNU @file{gettext} is the same as | |
742 | the code in the GNU C Library. So the documentation in the GNU | |
743 | @file{gettext} manual is also valid for the functionality here. The | |
744 | following text will describe the library functions in detail. But the | |
745 | numerous helper programs are not described in this manual. Instead | |
746 | people should read the GNU @file{gettext} manual | |
747 | (@pxref{Top,,GNU gettext utilities,gettext,Native Language Support Library and Tools}). | |
748 | We will only give a short overview. | |
749 | ||
750 | Though the @code{catgets} functions are available by default on more | |
751 | systems the @code{gettext} interface is at least as portable as the | |
752 | former. The GNU @file{gettext} package can be used wherever the | |
753 | functions are not available. | |
754 | ||
755 | ||
756 | @menu | |
757 | * Message catalogs with gettext:: The @code{gettext} family of functions. | |
758 | * Helper programs for gettext:: Programs to handle message catalogs | |
759 | for @code{gettext}. | |
760 | @end menu | |
761 | ||
762 | ||
763 | @node Message catalogs with gettext | |
764 | @subsection The @code{gettext} family of functions | |
765 | ||
766 | The paradigms underlying the @code{gettext} approach to message | |
767 | translations is different from that of the @code{catgets} functions the | |
768 | basic functionally is equivalent. There are functions of the following | |
769 | categories: | |
770 | ||
771 | @menu | |
772 | * Translation with gettext:: What has to be done to translate a message. | |
773 | * Locating gettext catalog:: How to determine which catalog to be used. | |
774 | * Using gettextized software:: The possibilities of the user to influence | |
775 | the way @code{gettext} works. | |
776 | @end menu | |
777 | ||
778 | @node Translation with gettext | |
779 | @subsubsection What has to be done to translate a message? | |
780 | ||
781 | The @code{gettext} functions have a very simple interface. The most | |
782 | basic function just takes the string which shall be translated as the | |
783 | argument and it returns the translation. This is fundamentally | |
784 | different from the @code{catgets} approach where an extra key is | |
785 | necessary and the original string is only used for the error case. | |
786 | ||
787 | If the string which has to be translated is the only argument this of | |
788 | course means the string itself is the key. I.e., the translation will | |
789 | be selected based on the original string. The message catalogs must | |
790 | therefore contain the original strings plus one translation for any such | |
791 | string. The task of the @code{gettext} function is it to compare the | |
792 | argument string with the available strings in the catalog and return the | |
793 | appropriate translation. Of course this process is optimized so that | |
794 | this process is not more expensive than an access using an atomic key | |
795 | like in @code{catgets}. | |
796 | ||
797 | The @code{gettext} approach has some advantages but also some | |
798 | disadvantages. Please see the GNU @file{gettext} manual for a detailed | |
799 | discussion of the pros and cons. | |
800 | ||
801 | All the definitions and declarations for @code{gettext} can be found in | |
802 | the @file{libintl.h} header file. On systems where these functions are | |
803 | not part of the C library they can be found in a separate library named | |
804 | @file{libintl.a} (or accordingly different for shared libraries). | |
805 | ||
806 | @deftypefun {char *} gettext (const char *@var{msgid}) | |
807 | The @code{gettext} function searches the currently selected message | |
808 | catalogs for a string which is equal to @var{msgid}. If there is such a | |
809 | string available it is returned. Otherwise the argument string | |
810 | @var{msgid} is returned. | |
811 | ||
812 | Please note that all though the return value is @code{char *} the | |
813 | returned string must not be changed. This broken type results from the | |
814 | history of the function and does not reflect the way the function should | |
815 | be used. | |
816 | ||
817 | Please note that above we wrote ``message catalogs'' (plural). This is | |
818 | a speciality of the GNU implementation of these functions and we will | |
819 | say more about this in section @xref{Locating gettext catalog} when we | |
820 | talk about the ways message catalogs are selected. | |
821 | ||
822 | The @code{gettext} function does not modify the value of the global | |
823 | @var{errno} variable. This is necessary to make it possible to write | |
824 | something like | |
825 | ||
826 | @smallexample | |
827 | printf (gettext ("Operation failed: %m\n")); | |
828 | @end smallexample | |
829 | ||
830 | Here the @var{errno} value is used in the @code{printf} function while | |
831 | processing the @code{%m} format element and if the @code{gettext} | |
832 | function would change this value (it is called before @code{printf} is | |
f2ea0f5b | 833 | called) we would get a wrong message. |
40a55d20 UD |
834 | |
835 | So there is no easy way to detect a missing message catalog beside | |
836 | comparing the argument string with the result. But it is normally the | |
837 | task of the user to react on missing catalogs. The program cannot guess | |
838 | when a message catalog is really necessary since for a user who s peaks | |
839 | the language the program was developed in does not need any translation. | |
840 | @end deftypefun | |
841 | ||
842 | The remaining two functions to access the message catalog add some | |
843 | functionality to select a message catalog which is not the default one. | |
844 | This is important if parts of the program are developed independently. | |
845 | Every part can have its own message catalog and all of them can be used | |
846 | at the same time. The C library itself is an example: internally it | |
847 | uses the @code{gettext} functions but since it must not depend on a | |
848 | currently selected default message catalog it must specify all ambiguous | |
849 | information. | |
850 | ||
851 | @deftypefun {char *} dgettext (const char *@var{domainname}, const char *@var{msgid}) | |
852 | The @code{dgettext} functions acts just like the @code{gettext} | |
853 | function. It only takes an additional first argument @var{domainname} | |
854 | which guides the selection of the message catalogs which are searched | |
855 | for the translation. If the @var{domainname} parameter is the null | |
856 | pointer the @code{dgettext} function is exactly equivalent to | |
857 | @code{gettext} since the default value for the domain name is used. | |
858 | ||
859 | As for @code{gettext} the return value type is @code{char *} which is an | |
f2ea0f5b | 860 | anachronism. The returned string must never be modified. |
40a55d20 UD |
861 | @end deftypefun |
862 | ||
863 | @deftypefun {char *} dcgettext (const char *@var{domainname}, const char *@var{msgid}, int @var{category}) | |
864 | The @code{dcgettext} adds another argument to those which | |
865 | @code{dgettext} takes. This argument @var{category} specifies the last | |
866 | piece of information needed to localize the message catalog. I.e., the | |
867 | domain name and the locale category exactly specify which message | |
868 | catalog has to be used (relative to a given directory, see below). | |
869 | ||
870 | The @code{dgettext} function can be expressed in terms of | |
871 | @code{dcgettext} by using | |
872 | ||
873 | @smallexample | |
874 | dcgettext (domain, string, LC_MESSAGES) | |
875 | @end smallexample | |
876 | ||
877 | @noindent | |
878 | instead of | |
879 | ||
880 | @smallexample | |
881 | dgettext (domain, string) | |
882 | @end smallexample | |
883 | ||
884 | This also shows which values are expected for the third parameter. One | |
885 | has to use the available selectors for the categories available in | |
886 | @file{locale.h}. Normally the available values are @code{LC_CTYPE}, | |
887 | @code{LC_COLLATE}, @code{LC_MESSAGES}, @code{LC_MONETARY}, | |
888 | @code{LC_NUMERIC}, and @code{LC_TIME}. Please note that @code{LC_ALL} | |
889 | must not be used and even though the names might suggest this, there is | |
890 | no relation to the environments variables of this name. | |
891 | ||
892 | The @code{dcgettext} function is only implemented for compatibility with | |
893 | other systems which have @code{gettext} functions. There is not really | |
894 | any situation where it is necessary (or useful) to use a different value | |
895 | but @code{LC_MESSAGES} in for the @var{category} parameter. We are | |
896 | dealing with messages here and any other choice can only be irritating. | |
897 | ||
898 | As for @code{gettext} the return value type is @code{char *} which is an | |
f2ea0f5b | 899 | anachronism. The returned string must never be modified. |
40a55d20 UD |
900 | @end deftypefun |
901 | ||
902 | When using the three functions above in a program it is a frequent case | |
903 | that the @var{msgid} argument is a constant string. So it is worth to | |
904 | optimize this case. Thinking shortly about this one will realize that | |
905 | as long as no new message catalog is loaded the translation of a message | |
906 | will not change. I.e., the algorithm to determine the translation is | |
907 | deterministic. | |
908 | ||
909 | Exactly this is what the optimizations implemented in the | |
f2ea0f5b | 910 | @file{libintl.h} header will use. Whenever a program is compiler with |
40a55d20 UD |
911 | the GNU C compiler, optimization is selected and the @var{msgid} |
912 | argument to @code{gettext}, @code{dgettext} or @code{dcgettext} is a | |
913 | constant string the actual function call will only be done the first | |
914 | time the message is used and then always only if any new message catalog | |
915 | was loaded and so the result of the translation lookup might be | |
916 | different. See the @file{libintl.h} header file for details. For the | |
917 | user it is only important to know that the result is always the same, | |
918 | independent of the compiler or compiler options in use. | |
919 | ||
920 | ||
921 | @node Locating gettext catalog | |
922 | @subsubsection How to determine which catalog to be used | |
923 | ||
f2ea0f5b | 924 | The functions to retrieve the translations for a given message have a |
40a55d20 UD |
925 | remarkable simple interface. But to provide the user of the program |
926 | still the opportunity to select exactly the translation s/he wants and | |
927 | also to provide the programmer the possibility to influence the way to | |
928 | locate the search for catalogs files there is a quite complicated | |
929 | underlying mechanism which controls all this. The code is complicated | |
930 | the use is easy. | |
931 | ||
932 | Basically we have two different tasks to perform which can also be | |
933 | performed by the @code{catgets} functions: | |
934 | ||
935 | @enumerate | |
936 | @item | |
937 | Locate the set of message catalogs. There are a number of files for | |
938 | different languages and which all belong to the package. Usually they | |
939 | are all stored in the filesystem below a certain directory. | |
940 | ||
941 | There can be arbitrary many packages installed and they can follow | |
942 | different guidelines for the placement of their files. | |
943 | ||
944 | @item | |
945 | Relative to the location specified by the package the actual translation | |
946 | files must be searched, based on the wishes of the user. I.e., for each | |
947 | language the user selects the program should be able to locate the | |
948 | appropriate file. | |
949 | @end enumerate | |
950 | ||
951 | This is the functionality required by the specifications for | |
952 | @code{gettext} and this is also what the @code{catgets} functions are | |
953 | able to do. But there are some problems unresolved: | |
954 | ||
955 | @itemize @bullet | |
956 | @item | |
957 | The language to be used can be specified in several different ways. | |
958 | There is no generally accepted standard for this and the user always | |
959 | expects the program understand what s/he means. E.g., to select the | |
960 | German translation one could write @code{de}, @code{german}, or | |
961 | @code{deutsch} and the program should always react the same. | |
962 | ||
963 | @item | |
964 | Sometimes the specification of the user is too detailed. If s/he, e.g., | |
965 | specifies @code{de_DE.ISO-8859-1} which means German, spoken in Germany, | |
966 | coded using the @w{ISO 8859-1} character set there is the possibility | |
967 | that a message catalog matching this exactly is not available. But | |
968 | there could be a catalog matching @code{de} and if the character set | |
969 | used on the machine is always @w{ISO 8859-1} there is no reason why this | |
970 | later message catalog should not be used. (We call this @dfn{message | |
971 | inheritance}.) | |
972 | ||
973 | @item | |
974 | If a catalog for a wanted language is not available it is not always the | |
975 | second best choice to fall back on the language of the developer and | |
976 | simply not translate any message. Instead a user might be better able | |
977 | to read the messages in another language and so the user of the program | |
978 | should be able to define an precedence order of languages. | |
979 | @end itemize | |
980 | ||
f2ea0f5b | 981 | We can divide the configuration actions in two parts: the one is |
40a55d20 UD |
982 | performed by the programmer, the other by the user. We will start with |
983 | the functions the programmer can use since the user configuration will | |
984 | be based on this. | |
985 | ||
986 | As the functions described in the last sections already mention separate | |
987 | sets of messages can be selected by a @dfn{domain name}. This is a | |
988 | simple string which should be unique for each program part with uses a | |
989 | separate domain. It is possible to use in one program arbitrary many | |
990 | domains at the same time. E.g., the GNU C Library itself uses a domain | |
991 | named @code{libc} while the program using the C Library could use a | |
992 | domain named @code{foo}. The important point is that at any time | |
993 | exactly one domain is active. This is controlled with the following | |
994 | function. | |
995 | ||
996 | @deftypefun {char *} textdomain (const char *@var{domainname}) | |
997 | The @code{textdomain} function sets the default domain, which is used in | |
998 | all future @code{gettext} calls, to @var{domainname}. Please note that | |
999 | @code{dgettext} and @code{dcgettext} calls are not influenced if the | |
1000 | @var{domainname} parameter of these functions is not the null pointer. | |
1001 | ||
1002 | Before the first call to @code{textdomain} the default domain is | |
f2ea0f5b | 1003 | @code{messages}. This is the name specified in the specification of |
40a55d20 UD |
1004 | the @code{gettext} API. This name is as good as any other name. No |
1005 | program should ever really use a domain with this name since this can | |
1006 | only lead to problems. | |
1007 | ||
1008 | The function returns the value which is from now on taken as the default | |
1009 | domain. If the system went out of memory the returned value is | |
1010 | @code{NULL} and the global variable @var{errno} is set to @code{ENOMEM}. | |
1011 | Despite the return value type being @code{char *} the return string must | |
1012 | not be changed. It is allocated internally by the @code{textdomain} | |
1013 | function. | |
1014 | ||
1015 | If the @var{domainname} parameter is the null pointer no new default | |
1016 | domain is set. Instead the currently selected default domain is | |
1017 | returned. | |
1018 | ||
1019 | If the @var{domainname} parameter is the empty string the default domain | |
1020 | is reset to its initial value, the domain with the name @code{messages}. | |
1021 | This possibility is questionable to use since the domain @code{messages} | |
1022 | really never should be used. | |
1023 | @end deftypefun | |
1024 | ||
1025 | @deftypefun {char *} bindtextdomain (const char *@var{domainname}, const char *@var{dirname}) | |
1026 | The @code{bindtextdomain} function can be used to specify the directly | |
1027 | which contains the message catalogs for domain @var{domainname} for the | |
1028 | different languages. To be correct, this is the directory where the | |
f2ea0f5b | 1029 | hierarchy of directories is expected. Details are explained below. |
40a55d20 UD |
1030 | |
1031 | For the programmer it is important to note that the translations which | |
f2ea0f5b | 1032 | come with the program have be placed in a directory hierarchy starting |
40a55d20 UD |
1033 | at, say, @file{/foo/bar}. Then the program should make a |
1034 | @code{bindtextdomain} call to bind the domain for the current program to | |
1035 | this directory. So it is made sure the catalogs are found. A correctly | |
1036 | running program does not depend on the user setting an environment | |
1037 | variable. | |
1038 | ||
1039 | The @code{bindtextdomain} function can be used several times and if the | |
f2ea0f5b | 1040 | @var{domainname} argument is different the previously bounded domains |
40a55d20 UD |
1041 | will not be overwritten. |
1042 | ||
26b4d766 UD |
1043 | If the program which wish to use @code{bindtextdomain} at some point of |
1044 | time use the @code{chdir} function to change the current working | |
1045 | directory it is important that the @var{dirname} strings ought to be an | |
1046 | absolute pathname. Otherwise the addressed directory might vary with | |
1047 | the time. | |
1048 | ||
40a55d20 UD |
1049 | If the @var{dirname} parameter is the null pointer @code{bindtextdomain} |
1050 | returns the currently selected directory for the domain with the name | |
1051 | @var{domainname}. | |
1052 | ||
1053 | the @code{bindtextdomain} function returns a pointer to a string | |
1054 | containing the name of the selected directory name. The string is | |
1055 | allocated internally in the function and must not be changed by the | |
1056 | user. If the system went out of core during the execution of | |
1057 | @code{bindtextdomain} the return value is @code{NULL} and the global | |
1058 | variable @var{errno} is set accordingly. | |
1059 | @end deftypefun | |
1060 | ||
1061 | ||
1062 | @node Using gettextized software | |
1063 | @subsubsection User influence on @code{gettext} | |
1064 | ||
1065 | The last sections described what the programmer can do to | |
1066 | internationalize the messages of the program. But it is finally up to | |
1067 | the user to select the message s/he wants to see. S/He must understand | |
1068 | them. | |
1069 | ||
1070 | The POSIX locale model uses the environment variables @code{LC_COLLATE}, | |
1071 | @code{LC_CTYPE}, @code{LC_MESSAGES}, @code{LC_MONETARY}, @code{NUMERIC}, | |
1072 | and @code{LC_TIME} to select the locale which is to be used. This way | |
1073 | the user can influence lots of functions. As we mentioned above the | |
1074 | @code{gettext} functions also take advantage of this. | |
1075 | ||
1076 | To understand how this happens it is necessary to take a look at the | |
1077 | various components of the filename which gets computed to locate a | |
1078 | message catalog. It is composed as follows: | |
1079 | ||
1080 | @smallexample | |
1081 | @var{dir_name}/@var{locale}/LC_@var{category}/@var{domain_name}.mo | |
1082 | @end smallexample | |
1083 | ||
1084 | The default value for @var{dir_name} is system specific. It is computed | |
1085 | from the value given as the prefix while configuring the C library. | |
1086 | This value normally is @file{/usr} or @file{/}. For the former the | |
1087 | complete @var{dir_name} is: | |
1088 | ||
1089 | @smallexample | |
1090 | /usr/share/locale | |
1091 | @end smallexample | |
1092 | ||
1093 | We can use @file{/usr/share} since the @file{.mo} files containing the | |
1094 | message catalogs are system independent, all systems can use the same | |
1095 | files. If the program executed the @code{bindtextdomain} function for | |
1096 | the message domain that is currently handled the @code{dir_name} | |
1097 | component is the exactly the value which was given to the function as | |
1098 | the second parameter. I.e., @code{bindtextdomain} allows to overwrite | |
f2ea0f5b | 1099 | the only system dependent and fixed value to make it possible to |
40a55d20 UD |
1100 | address file everywhere in the filesystem. |
1101 | ||
1102 | The @var{category} is the name of the locale category which was selected | |
1103 | in the program code. For @code{gettext} and @code{dgettext} this is | |
1104 | always @code{LC_MESSAGES}, for @code{dcgettext} this is selected by the | |
1105 | value of the third parameter. As said above it should be avoided to | |
1106 | ever use a category other than @code{LC_MESSAGES}. | |
1107 | ||
1108 | The @var{locale} component is computed based on the category used. Just | |
1109 | like for the @code{setlocale} function here comes the user selection | |
1110 | into the play. Some environment variables are examined in a fixed order | |
1111 | and the first environment variable set determines the return value of | |
1112 | the lookup process. In detail, for the category @code{LC_xxx} the | |
1113 | following variables in this order are examined: | |
1114 | ||
1115 | @table @code | |
1116 | @item LANGUAGE | |
1117 | @item LC_ALL | |
1118 | @item LC_xxx | |
1119 | @item LANG | |
1120 | @end table | |
1121 | ||
1122 | This looks very familiar. With the exception of the @code{LANGUAGE} | |
1123 | environment variable this is exactly the lookup order the | |
1124 | @code{setlocale} function uses. But why introducing the @code{LANGUAGE} | |
1125 | variable? | |
1126 | ||
1127 | The reason is that the syntax of the values these variables can have is | |
1128 | different to what is expected by the @code{setlocale} function. If we | |
1129 | would set @code{LC_ALL} to a value following the extended syntax that | |
1130 | would mean the @code{setlocale} function will never be able to use the | |
1131 | value of this variable as well. An additional variable removes this | |
1132 | problem plus we can select the language independently of the locale | |
1133 | setting which sometimes is useful. | |
1134 | ||
1135 | While for the @code{LC_xxx} variables the value should consist of | |
1136 | exactly one specification of a locale the @code{LANGUAGE} variable's | |
1137 | value can consist of a colon separated list of locale names. The | |
1138 | attentive reader will realize that this is the way we manage to | |
1139 | implement one of our additional demands above: we want to be able to | |
1140 | specify an ordered list of language. | |
1141 | ||
1142 | Back to the constructed filename we have only one component missing. | |
1143 | The @var{domain_name} part is the name which was either registered using | |
1144 | the @code{textdomain} function or which was given to @code{dgettext} or | |
1145 | @code{dcgettext} as the first parameter. Now it becomes obvious that a | |
1146 | good choice for the domain name in the program code is a string which is | |
1147 | closely related to the program/package name. E.g., for the GNU C | |
1148 | Library the domain name is @code{libc}. | |
1149 | ||
1150 | @noindent | |
1151 | A limit piece of example code should show how the programmer is supposed | |
1152 | to work: | |
1153 | ||
1154 | @smallexample | |
1155 | @{ | |
1156 | textdomain ("test-package"); | |
1157 | bindtextdomain ("test-package", "/usr/local/share/locale"); | |
1158 | puts (gettext ("Hello, world!"); | |
1159 | @} | |
1160 | @end smallexample | |
1161 | ||
1162 | At the program start the default domain is @code{messages}. The | |
1163 | @code{textdomain} call changes this to @code{test-package}. The | |
1164 | @code{bindtextdomain} call specifies that the message catalogs for the | |
1165 | domain @code{test-package} can be found below the directory | |
1166 | @file{/usr/local/share/locale}. | |
1167 | ||
1168 | If now the user set in her/his environment the variable @code{LANGUAGE} | |
1169 | to @code{de} the @code{gettext} function will try to use the | |
1170 | translations from the file | |
1171 | ||
1172 | @smallexample | |
1173 | /usr/local/share/locale/de/LC_MESSAGES/test-package.mo | |
1174 | @end smallexample | |
1175 | ||
1176 | From the above descriptions it should be clear which component of this | |
f41c8091 UD |
1177 | filename is determined by which source. |
1178 | ||
1179 | In the above example we assumed that the @code{LANGUAGE} environment | |
1180 | variable to @code{de}. This might be an appropriate selection but what | |
1181 | happens if the user wants to use @code{LC_ALL} because of the wider | |
1182 | usability and here the required value is @code{de_DE.ISO-8859-1}? We | |
1183 | already mentioned above that a situation like this is not infrequent. | |
1184 | E.g., a person might prefer reading a dialect and if this is not | |
1185 | available fall back on the standard language. | |
1186 | ||
1187 | The @code{gettext} functions know about situations like this and can | |
1188 | handle them gracefully. The functions recognize the format of the value | |
1189 | of the environment variable. It can split the value is different pieces | |
1190 | and by leaving out the only or the other part it can construct new | |
1191 | values. This happens of course in a predictable way. To understand | |
1192 | this one must know the format of the environment variable value. There | |
1193 | are to more or less standardized forms: | |
1194 | ||
1195 | @table @emph | |
1196 | @item X/Open Format | |
1197 | @code{language[_territory[.codeset]][@@modifier]} | |
1198 | ||
1199 | @item CEN Format (European Community Standard) | |
1200 | @code{language[_territory][+audience][+special][,[sponsor][_revision]]} | |
1201 | @end table | |
1202 | ||
1203 | The functions will automatically recognize which format is used. Less | |
1204 | specific locale names will be stripped of in the order of the following | |
1205 | list: | |
40a55d20 | 1206 | |
f41c8091 UD |
1207 | @enumerate |
1208 | @item | |
1209 | @code{revision} | |
1210 | @item | |
1211 | @code{sponsor} | |
1212 | @item | |
1213 | @code{special} | |
1214 | @item | |
1215 | @code{codeset} | |
1216 | @item | |
1217 | @code{normalized codeset} | |
1218 | @item | |
1219 | @code{territory} | |
1220 | @item | |
1221 | @code{audience}/@code{modifier} | |
1222 | @end enumerate | |
1223 | ||
f2ea0f5b | 1224 | From the last entry one can see that the meaning of the @code{modifier} |
f41c8091 UD |
1225 | field in the X/Open format and the @code{audience} format have the same |
1226 | meaning. Beside one can see that the @code{language} field for obvious | |
1227 | reasons never will be dropped. | |
1228 | ||
1229 | The only new thing is the @code{normalized codeset} entry. This is | |
1230 | another goodie which is introduced to help reducing the chaos which | |
1231 | derives from the inability of the people to standardize the names of | |
1232 | character sets. Instead of @w{ISO-8859-1} one can often see @w{8859-1}, | |
1233 | @w{88591}, @w{iso8859-1}, or @w{iso_8859-1}. The @code{normalized | |
1234 | codeset} value is generated from the user-provided character set name by | |
1235 | applying the following rules: | |
1236 | ||
1237 | @enumerate | |
1238 | @item | |
1239 | Remove all characters beside numbers and letters. | |
1240 | @item | |
1241 | Fold letters to lowercase. | |
1242 | @item | |
1243 | If the same only contains digits prepend the string @code{"iso"}. | |
1244 | @end enumerate | |
1245 | ||
1246 | @noindent | |
1247 | So all of the above name will be normalized to @code{iso88591}. This | |
1248 | allows the program user much more freely choosing the locale name. | |
1249 | ||
1250 | Even this extended functionality still does not help to solve the | |
1251 | problem that completely different names can be used to denote the same | |
1252 | locale (e.g., @code{de} and @code{german}). To be of help in this | |
1253 | situation the locale implementation and also the @code{gettext} | |
1254 | functions know about aliases. | |
1255 | ||
1256 | The file @file{/usr/share/locale/locale.alias} (replace @file{/usr} with | |
1257 | whatever prefix you used for configuring the C library) contains a | |
1258 | mapping of alternative names to more regular names. The system manager | |
1259 | is free to add new entries to fill her/his own needs. The selected | |
1260 | locale from the environment is compared with the entries in the first | |
1261 | column of this file ignoring the case. If they match the value of the | |
1262 | second column is used instead for the further handling. | |
1263 | ||
1264 | In the description of the format of the environment variables we already | |
1265 | mentioned the character set as a factor in the selection of the message | |
1266 | catalog. In fact, only catalogs which contain text written using the | |
1267 | character set of the system/program can be used (directly; there will | |
1268 | come a solution for this some day). This means for the user that s/he | |
1269 | will always have to take care for this. If in the collection of the | |
1270 | message catalogs there are files for the same language but coded using | |
1271 | different character sets the user has to be careful. | |
40a55d20 UD |
1272 | |
1273 | ||
1274 | @node Helper programs for gettext | |
1275 | @subsection Programs to handle message catalogs for @code{gettext} | |
1276 | ||
f41c8091 UD |
1277 | The GNU C Library does not contain the source code for the programs to |
1278 | handle message catalogs for the @code{gettext} functions. As part of | |
1279 | the GNU project the GNU gettext package contains everything the | |
1280 | developer needs. The functionality provided by the tools in this | |
1281 | package by far exceeds the abilities of the @code{gencat} program | |
1282 | described above for the @code{catgets} functions. | |
1283 | ||
1284 | There is a program @code{msgfmt} which is the equivalent program to the | |
1285 | @code{gencat} program. It generates from the human-readable and | |
1286 | -editable form of the message catalog a binary file which can be used by | |
1287 | the @code{gettext} functions. But there are several more programs | |
1288 | available. | |
1289 | ||
1290 | The @code{xgettext} program can be used to automatically extract the | |
1291 | translatable messages from a source file. I.e., the programmer need not | |
1292 | take care for the translations and the list of messages which have to be | |
1293 | translated. S/He will simply wrap the translatable string in calls to | |
1294 | @code{gettext} et.al and the rest will be done by @code{xgettext}. This | |
1295 | program has a lot of option which help to customize the output or do | |
1296 | help to understand the input better. | |
1297 | ||
1298 | Other programs help to manage development cycle when new messages appear | |
1299 | in the source files or when a new translation of the messages appear. | |
f2ea0f5b | 1300 | here it should only be noted that using all the tools in GNU gettext it |
f41c8091 UD |
1301 | is possible to @emph{completely} automize the handling of message |
1302 | catalog. Beside marking the translatable string in the source code and | |
1303 | generating the translations the developers do not have anything to do | |
1304 | themself. |