]> git.ipfire.org Git - thirdparty/glibc.git/blob - manual/message.texi
Update.
[thirdparty/glibc.git] / manual / message.texi
1 @node Message Translation
2 @chapter Message Translation
3
4 The program's interface with the human should be designed in a way to
5 ease the human the task. One of the possibilities is to use messages in
6 whatever language the user prefers.
7
8 Printing messages in different languages can be implemented in different
9 ways. One could add all the different languages in the source code and
10 add among the variants every time a message has to be printed. This is
11 certainly no good solution since extending the set of languages is
12 difficult (the code must be changed) and the code itself can become
13 really big with dozens of message sets.
14
15 A better solution is to keep the message sets for each language are kept
16 in separate files which are loaded at runtime depending on the language
17 selection of the user.
18
19 The GNU C Library provides two different sets of functions to support
20 message translation. The problem is that neither of the interfaces is
21 officially defined by the POSIX standard. The @code{catgets} family of
22 functions is defined in the X/Open standard but this is drived from
23 industry decisions and therefore not necessarily is based on reasinable
24 decisions.
25
26 As mentioned above the message catalog handling provides easy
27 extendibility by using external data files which contain the message
28 translations. I.e., these files contain for each of the messages used
29 in the program a translation for the appropriate language. So the tasks
30 of the message handling functions functions are
31
32 @itemize @bullet
33 @item
34 locate the external data file with the appropriate translations.
35 @item
36 load the data and make it possible to address the messages
37 @item
38 map a given key to the translated message
39 @end itemize
40
41 The two approaches mainly differ in the implementation of this last
42 step. The design decisions made for this influences the whole rest.
43
44 @menu
45 * Message catalogs a la X/Open:: The @code{catgets} family of functions.
46 * The Uniforum approach:: The @code{gettext} family of functions.
47 @end menu
48
49
50 @node Message catalogs a la X/Open
51 @section X/Open Message Catalog Handling
52
53 The @code{catgets} functions are based on the simple scheme:
54
55 @quotation
56 Associate every message to translate in the source code with a unique
57 identifier. To retrieve a message from a catalog file solely the
58 identifier is used.
59 @end quotation
60
61 This means for the author of the program that s/he will have to make
62 sure the meaning of the identifier in the program code and in the
63 message catalogs are always the same.
64
65 Before a message can be translated the catalog file must be located.
66 The user of the program must be able to guide the responsible function
67 to find whatever catalog the user wants. This is separated from what
68 the programmer had in mind.
69
70 All the types, constants and funtions for the @code{catgets} functions
71 are defined/declared in the @file{nl_types.h} header file.
72
73 @menu
74 * The catgets Functions:: The @code{catgets} function family.
75 * The message catalog files:: Format of the message catalog files.
76 * The gencat program:: How to generate message catalogs files which
77 can be used by the functions.
78 * Common Usage:: How to use the @code{catgets} interface.
79 @end menu
80
81
82 @node The catgets Functions
83 @subsection The @code{catgets} function family
84
85 @comment nl_types.h
86 @comment X/Open
87 @deftypefun nl_catd catopen (const char *@var{cat_name}, int @var{flag})
88 The @code{catgets} function tries to locate the message data file names
89 @var{cat_name} and loads it when found. The return value is of an
90 opaque type and can be used in calls to the other functions to refer to
91 this loaded catalog.
92
93 The return value is @code{(nl_catd) -1} in case the function failed and
94 no catalog was loaded. The global variable @var{errno} contains a code
95 for the error causing the failure. But even if the function call
96 succeeded this does not mean that all messages can be translated.
97
98 Locating the catalog file must happen in a way which lets the user of
99 the program influence the decision. It is up to the user to decide
100 about the language to use and sometimes it is useful to use alternate
101 catalog files. All this can be specified by the user by setting some
102 enviroment variables.
103
104 The first problem is to find out where all the message catalogs are
105 stored. Every program could have its own place to keep all the
106 different files but usually the catalog files are grouped by languages
107 and the catalogs for all programs are kept in the same place.
108
109 @cindex NLSPATH environment variable
110 To tell the @code{catopen} function where the catalog for the program
111 can be found the user can set the environment variable @code{NLSPATH} to
112 a value which describes her/his choice. Since this value must be usable
113 for different languages and locales it cannot be a simple string.
114 Instead it is a format string (similar to @code{printf}'s). An example
115 is
116
117 @smallexample
118 /usr/share/locale/%L/%N:/usr/share/locale/%L/LC_MESSAGES/%N
119 @end smallexample
120
121 First one can see that more than one directory can be specified (with
122 the usual syntax of separating them by colons). The next things to
123 observe are the format string, @code{%L} and @code{%N} in this case.
124 The @code{catopen} function knows about several of them and the
125 replacement for all of them is of course different.
126
127 @table @code
128 @item %N
129 This format element is substituted with the name of the catalog file.
130 This is the value of the @var{cat_name} argument given to
131 @code{catgets}.
132
133 @item %L
134 This format element is substituted with the name of the currently
135 selected locale for translating messages. How this is determined is
136 explained below.
137
138 @item %l
139 (This is the lowercase ell.) This format element is substituted with the
140 language element of the locale name. The string decsribing the selected
141 locale is expected to have the form
142 @code{@var{lang}[_@var{terr}[.@var{codeset}]]} and this format uses the
143 first part @var{lang}.
144
145 @item %t
146 This format element is substituted by the territory part @var{terr} of
147 the name of the currently selected locale. See the explanation of the
148 format above.
149
150 @item %c
151 This format element is substituted by the codeset part @var{codeset} of
152 the name of the currently selected locale. See the explanation of the
153 format above.
154
155 @item %%
156 Since @code{%} is used in a meta character there must be a way to
157 express the @code{%} character in the result itself. Using @code{%%}
158 does this just like it works for @code{printf}.
159 @end table
160
161
162 Using @code{NLSPATH} allows to specify arbitrary directories to be
163 searched for message catalogs while still allowing different languages
164 to be used. If the @code{NLSPATH} environment variable is not set the
165 default value is
166
167 @smallexample
168 @var{prefix}/share/locale/%L/%N:@var{prefix}/share/locale/%L/LC_MESSAGES/%N
169 @end smallexample
170
171 @noindent
172 where @var{prefix} is given to @code{configure} while installing the GNU
173 C Library (this value is in many cases @code{/usr} or the empty string).
174
175 The remaining problem is to decide which must be used. The value
176 decides about the substitution of the format elements mentioned above.
177 First of all the user can specify a path in the message catalog name
178 (i.e., the name contains a slash character). In this situation the
179 @code{NLSPATH} environment variable is not used. The catalog must exist
180 as specified in the program, perhaps relative to the current working
181 directory. This situation in not desirable and catalogs names never
182 should be written this way. Beside this, this behaviour is not portable
183 to all other platforms providing the @code{catgets} interface.
184
185 @cindex LC_ALL environment variable
186 @cindex LC_MESSAGES environment variable
187 @cindex LANG environment variable
188 Otherwise the values of environment variables from the standard
189 environemtn are examined (@pxref{Standard Environment}). Which
190 variables are examined is decided by the @var{flag} parameter of
191 @code{catopen}. If the value is @code{NL_CAT_LOCALE} (which is defined
192 in @file{nl_types.h}) then the @code{catopen} function examines the
193 environment variable @code{LC_ALL}, @code{LC_MESSAGES}, and @code{LANG}
194 in this order. The first variable which is set in the current
195 environment will be used.
196
197 If @var{flag} is zero only the @code{LANG} environment variable is
198 examined. This is a left-over from the early days of this function
199 where the other environment variable were not known.
200
201 In any case the environment variable should have a value of the form
202 @code{@var{lang}[_@var{terr}[.@var{codeset}]]} as explained above. If
203 no environment variable is set the @code{"C"} locale is used which
204 prevents any translation.
205
206 The return value of the function is in any case a valid string. Either
207 it is a translation from a message catalog or it is the same as the
208 @var{string} parameter. So a piece of code to decide whether a
209 translation actually happened must look like this:
210
211 @smallexample
212 @{
213 char *trans = catgets (desc, set, msg, input_string);
214 if (trans == input_string)
215 @{
216 /* Something went wrong. */
217 @}
218 @}
219 @end smallexample
220
221 @noindent
222 When an error occured the global variable @var{errno} is set to
223
224 @table @var
225 @item EBADF
226 The catalog does not exist.
227 @item ENOMSG
228 The set/message touple does not name an existing element in the
229 message catalog.
230 @end table
231
232 While it sometimes can be useful to test for errors programs normally
233 will avoid any test. If the translation is not available it is no big
234 problem if the original, untranslated message is printed. Either the
235 user understands this as well or s/he will look for the reason why the
236 messages are not translated.
237 @end deftypefun
238
239 Please note that the currently selected locale does not depend on a call
240 to the @code{setlocale} function. It is not necessary that the locale
241 data files for this locale exist and calling @code{setlocale} succeeds.
242 The @code{catopen} function directly reads the values of the environment
243 variables.
244
245
246 @deftypefun {char *} catgets (nl_catd @var{catalog_desc}, int @var{set}, int @var{message}, const char *@var{string})
247 The function @code{catgets} has to be used to access the massage catalog
248 previously opened using the @code{catopen} function. The
249 @var{catalog_desc} parameter must be a value previously returned by
250 @code{catopen}.
251
252 The next two parameters, @var{set} and @var{message}, reflect the
253 internal organization of the message catalog files. This will be
254 explained in detail below. For now it is interesting to know that a
255 catalog can consists of several set and the messages in each thread are
256 individually numbered using numbers. Neither the set number nor the
257 message number must be consecutive. They can be arbitrarily chosen.
258 But each message (unless equal to another one) must have its own unique
259 pair of set and message number.
260
261 Since it is not guaranteed that the message catalog for the language
262 selected by the user exists the last parameter @var{string} helps to
263 handle this case gracefully. If no matching string can be found
264 @var{string} is returned. This means for the programmer that
265
266 @itemize @bullet
267 @item
268 the @var{string} parameters should contain reasonable text (this also
269 helps to understand the program seems otherwise there would be no hint
270 on the string which is expected to be returned.
271 @item
272 all @var{string} arguments should be written in the same language.
273 @end itemize
274 @end deftypefun
275
276 It is somewhat uncomfortable to write a program using the @code{catgets}
277 functions if no supporting functionality is available. Since each
278 set/message number touple must be unique the programmer must keep lists
279 of the messages at the same time the code is written. And the work
280 between several people working on the same project must be coordinated.
281 In @ref{Common Usage} we will see some how these problems can be relaxed
282 a bit.
283
284 @deftypefun int catclose (nl_catd @var{catalog_desc})
285 The @code{catclose} function can be used to free the resources
286 associated with a message catalog which previously was opened by a call
287 to @code{catopen}. If the resources can be successfully freed the
288 function returns @code{0}. Otherwise it return @code{@minus{}1} and the
289 global variable @var{errno} is set. Errors can occur if the catalog
290 descriptor @var{catalog_desc} is not valid in which case @var{errno} is
291 set to @code{EBADF}.
292 @end deftypefun
293
294
295 @node The message catalog files
296 @subsection Format of the message catalog files
297
298 The only reasonable way the translate all the messages of a function and
299 store the result in a message catalog file which can be read by the
300 @code{catopen} function is to write all the message text to the
301 translator and let her/him translate them all. I.e., we must have a
302 file with entries which associate the set/message touple with a specific
303 translation. This file format is specified in the X/Open standard and
304 is as follows:
305
306 @itemize @bullet
307 @item
308 Lines containing only whitespace characters or empty lines are ignored.
309
310 @item
311 Lines which contain as the first non-whitespace character a @code{$}
312 followed by a whitespace character are comment and are also ignored.
313
314 @item
315 If a line contains as the first non-whitespace characters the sequence
316 @code{$set} followed by a whitespace character an additional argument
317 is required to follow. This argument can either be:
318
319 @itemize @minus
320 @item
321 a number. In this case the value of this number determines the set
322 to which the following messages are added.
323
324 @item
325 an identifier consisting of alphanumeric characters plus the underscore
326 character. In this case the set get automatically a number assigned.
327 This value is one added to the largest set number which so far appeared.
328
329 How to use the symbolic names is explained in section @ref{Common Usage}.
330
331 It is an error if a symbol name appears more than once. All following
332 messages are placed in a set with this number.
333 @end itemize
334
335 @item
336 If a line contains as the first non-whitespace characters the sequence
337 @code{$delset} followed by a whitespace character an additional argument
338 is required to follow. This argument can either be:
339
340 @itemize @minus
341 @item
342 a number. In this case the value of this number determines the set
343 which will be deleted.
344
345 @item
346 an identifier consisting of alphanumeric characters plus the underscore
347 character. This symbolic identifier must match a name for a set which
348 previously was defined. It is an error if the name is unknown.
349 @end itemize
350
351 In both cases all messages in the specified set will be removed. They
352 will not appear in the output. But if this set is later again selected
353 with a @code{$set} command again messages could be added and these
354 messages will appear in the output.
355
356 @item
357 If a line contains after leading whitespaces the sequence
358 @code{$quote}, the quoting character used for this input file is
359 changed to the first non-whitespace character following the
360 @code{$quote}. If no non-whitespace character is present before the
361 line ends quoting is disable.
362
363 By default no quoting character is used. In this mode strings are
364 terminated with the first unescaped line break. If there is a
365 @code{$quote} sequence present newline need not be escaped. Instead a
366 string is terminated with the first unescaped appearence of the quote
367 character.
368
369 A common usage of this feature would be to set the quote character to
370 @code{"}. Then any appearence of the @code{"} in the strings must
371 be escaped using the backslash (i.e., @code{\"} must be written).
372
373 @item
374 Any other line must start with a number or an alphanumeric identifier
375 (with the underscore character included). The following characters
376 (starting at the first non-whitespace character) will form the string
377 which gets associated with the currently selected set and the message
378 number represented by the number and identifier respectively.
379
380 If the start of the line is a number the message number is obvious. It
381 is an error if the same message number already appeared for this set.
382
383 If the leading token was an identifier the message number gets
384 automatically assigned. The value is the current maximum messages
385 number for this set plus one. It is an error if the identifier was
386 already used for a message in this set. It is ok to reuse the
387 identifier for a message in another thread. How to use the symbolic
388 identifiers will be explained below (@pxref{Common Usage}). There is
389 one limitation with the identifier: it must not be @code{Set}. The
390 reason will be explained below.
391
392 Please note that you must use a quoting character if a message contains
393 leading whitespace. Since one cannot guarantee this never happens it is
394 probably a good idea to always use quoting.
395
396 The text of the messages can contain escape characters. The usual bunch
397 of characters known from the @w{ISO C} language are recognized
398 (@code{\n}, @code{\t}, @code{\v}, @code{\b}, @code{\r}, @code{\f},
399 @code{\\}, and @code{\@var{nnn}}, where @var{nnn} is the octal coding of
400 a character code).
401 @end itemize
402
403 @strong{Important:} The handling of identifiers instead of numbers for
404 the set and messages is a GNU extension. Systems strictly following the
405 X/Open specification do not have this feature. An example for a message
406 catalog file is this:
407
408 @smallexample
409 $ This is a leading comment.
410 $quote "
411
412 $set SetOne
413 1 Message with ID 1.
414 two " Message with ID \"two\", which gets the value 2 assigned"
415
416 $set SetTwo
417 $ Since the last set got the nubmer 1 assigned this set has number 2.
418 4000 "The numbers can be arbitrary, they need not start at one."
419 @end smallexample
420
421 This small example shows various aspects:
422 @itemize @bullet
423 @item
424 Lines 1 and 9 are comments since they start with @code{$} followed by
425 a whitespace.
426 @item
427 The quoting character is set to @code{"}. Otherwise the quotes in the
428 message definition would have to be left away and in this case the
429 message with the identifier @code{two} would loose its leading whitespace.
430 @item
431 Mixing numbered messages with message having symbolic names is no
432 problem and the numering happens automatically.
433 @end itemize
434
435
436 While this file format is pretty easy it is not the best possible for
437 use in a running program. The @code{catopen} function would have to
438 parser the file and handle syntactic errors gracefully. This is not so
439 easy and the whole process is pretty slow. Therefore the @code{catgets}
440 functions expect the data in another more compact and ready-to-use file
441 format. There is a special programm @code{gencat} which is explained in
442 detail in the next section.
443
444 Files in this other format are not human readable. To be easy to use by
445 programs it is a binary file. But the format is byte order independent
446 so translation files can be shared by systems of arbitrary architecture
447 (as long as they use the GNU C Library).
448
449 Details about the binary file format are not important to know since
450 these files are always created by the @code{gencat} program. The
451 sources of the GNU C Library also provide the sources for the
452 @code{gencat} program and so the interested reader can look throught
453 these source files to learn about the file format.
454
455
456 @node The gencat program
457 @subsection Generate Message Catalogs files
458
459 @cindex gencat
460 The @code{gencat} program is specified in the X/Open standard and the
461 GNU implementation follows this specification and so allows to process
462 all correctly formed input files. Additionally some extension are
463 implemented which help to work in a more reasonable way with the the
464 @code{catgets} functions.
465
466 The @code{gencat} program can be invoked in two ways:
467
468 @example
469 `gencat [@var{Option}]@dots{} [@var{Output-File} [@var{Input-File}]@dots{}]`
470 @end example
471
472 This is the interface defined in the X/Open standard. If no
473 @var{Input-File} parameter is given input will be read from standard
474 input. Multiple input files will be read as if they are concatenated.
475 If @var{Output-File} is also missing, the output will be written to
476 standard output. To provide the interface one is used from other
477 programs a second interface is provided.
478
479 @smallexample
480 `gencat [@var{Option}]@dots{} -o @var{Output-File} [@var{Input-File}]@dots{}`
481 @end smallexample
482
483 The option @samp{-o} is used to specify the output file and all file
484 arguments are used as input files.
485
486 Beside this one can use @file{-} or @file{/dev/stdin} for
487 @var{Input-File} to denote the standard input. Corresponding one can
488 use @file{-} and @file{/dev/stdout} for @var{Output-File} to denote
489 standard output. Using @file{-} as a file name is allowed in X/Open
490 while using the device names is a GNU extension.
491
492 The @code{gencat} program works by concatenating all input files and
493 then @strong{merge} the resulting collection of message sets with a
494 possiblity existing output file. This is done by removing all messages
495 with set/message number touples matching any of the generated messages
496 from the output file and then adding all the new messages. To
497 regenerate a catalog file while ignoring the old contents therefore
498 requires to remove the output file if it exists. If the output is
499 written to standard output no merging takes place.
500
501 @noindent
502 The following table shows the options understood by the @code{gencat}
503 program. The X/Open standard does not specify any option for the
504 program so all of these are GNU extensions.
505
506 @table @samp
507 @item -V
508 @itemx --version
509 Print the version information and exit.
510 @item -h
511 @itemx --help
512 Print a usage message listing all available options, then exit successfully.
513 @item --new
514 Do never merge the new messages from the input files with the old content
515 of the output files. The old content of the output file is discarded.
516 @item -H
517 @itemx --header=name
518 This option is used to emit the symbolic names given to sets and
519 messages in the input files for use in the program. Details about how
520 to use this are given in the next section. The @var{name} parameter to
521 this option specifies the name of the output file. It will contain a
522 number of C preprocessor @code{#define}s to associate a name with a
523 number.
524
525 Please note that the generated file only contains the symbols from the
526 input files. If the output is merged with the previous content of the
527 output file the possibly existing symbols from the file(s) which
528 generated the old output files are not in the generated header file.
529 @end table
530
531
532 @node Common Usage
533 @subsection How to use the @code{catgets} interface
534
535 The @code{catgets} functions can be used in two different ways. By
536 following slavishly the X/Open specs and not relying on the extension
537 and by using the GNU extensions. We will take a look at the former
538 method first to understand the benefits of extensions.
539
540 @subsubsection Not using symbolic symbolic names
541
542 Since the X/Open format of the message catalog files does not allow
543 symbol names we have to work with numbers all the time. When we start
544 writing a program we have to replace all appearences of translatable
545 strings with someting like
546
547 @smallexample
548 catgets (catdesc, set, msg, "string")
549 @end smallexample
550
551 @noindent
552 @var{catgets} is retrieved from a call to @code{catopen} which is
553 normally done once at the program start. The @code{"string"} is the
554 string we want to translate. The problems start with the set and
555 message numbers.
556
557 In a bigger program several programmers usually work at the same time on
558 the program and so coordinating the number allocation is crucial.
559 Though no two different strings must be indexed by the same touple of
560 numbers it is highly desireable to reuse the numbers for equal strings
561 with equal translations (please note that there might be strings which
562 are equal in one language but have different translations due to
563 difference contexts).
564
565 The allocation process can be relaxed a bit by different set numbers for
566 different parts of the program. So the number of developers who have to
567 coordinate the allocation can be reduced. But still lists must be keep
568 track of the allocation and errors can easily happen. These errors
569 cannot be discovered by the compiler or the @code{catgets} functions.
570 Only the user of the program might see wrong messages printed. In the
571 worst cases the messages are so irritating that they cannot be
572 recognized as wrong. Think about the translations for @code{"true"} and
573 @code{"false"} being exchanged. This could result in a desaster.
574
575
576 @subsubsection Using symbolic names
577
578 The problems mentioned in the last section derive from the fact that:
579
580 @enumerate
581 @item
582 the numbers are allocated once and due to the possibly frequent use of
583 them it is difficult to change a number later.
584 @item
585 the numbers do not allow to guess anything about the string and
586 therefore collisions can easily happen.
587 @end enumerate
588
589 By constantly using symbolic names and by providing a method which maps
590 the string content to a symbolic name (however this will happen) one can
591 prevent both problems above. The cost of this is that the programmer
592 has to write a complete message catalog file while s/he is writing the
593 program itself.
594
595 This is necessary since the symbolic names must be mapped to numbers
596 before the program sources can be compiled. In the last section it was
597 described how to generate a header containing the mapping of the names.
598 E.g., for the example message file given in the last section we could
599 call the @code{gencat} program as follow (assume @file{ex.msg} contains
600 the sources).
601
602 @smallexample
603 gencat -H ex.h -o ex.cat ex.msg
604 @end smallexample
605
606 @noindent
607 This generates a header file with the following content:
608
609 @smallexample
610 #define SetTwoSet 0x2 /* u.msg:8 */
611
612 #define SetOneSet 0x1 /* u.msg:4 */
613 #define SetOnetwo 0x2 /* u.msg:6 */
614 @end smallexample
615
616 As can be seen the various symbols given in the source file are mangled
617 to generate unique identifiers and these identifiers get numbers
618 assigned. Reading the source file and knowing about the rules will
619 allow to predict the content of the header file (it is deterministic)
620 but this is not necessary. The @code{gencat} program can take care for
621 everything. All the programmer has to do is to put the generated header
622 file in the dependency list of the source files of her/his project and
623 to add a rules to regenerate the header of any of the input files
624 change.
625
626 One word about the symbol mangling. Every symbol consists of two parts:
627 the name of the message set plus the name of the message or the special
628 string @code{Set}. So @code{SetOnetwo} means this macro can be used to
629 access the translation with identifier @code{two} in the message set
630 @code{SetOne}.
631
632 The other names denote the names of the message sets. The special
633 string @code{Set} is used in the place of the message identifier.
634
635 If in the code the second string of the set @code{SetOne} is used the C
636 code should look like this:
637
638 @smallexample
639 catgets (catdesc, SetOneSet, SetOnetwo,
640 " Message with ID \"two\", which gets the value 2 assigned")
641 @end smallexample
642
643 Writing the function this way will allow to change the message number
644 and even the set number without requiring any change in the C source
645 code. (The text of the string is normally not the same; this is only
646 for this example.)
647
648
649 @subsubsection How does to this allow to develop
650
651 To illustrate the usual way to work with the symbolic version numbers
652 here is a little example. Assume we want to write the very complex and
653 famous greeting program. We start by writing the code as usual:
654
655 @smallexample
656 #include <stdio.h>
657 int
658 main (void)
659 @{
660 printf ("Hello, world!\n");
661 return 0;
662 @}
663 @end smallexample
664
665 Now we want to internationalize the message and therefore replace the
666 message with whatever the user wants.
667
668 @smallexample
669 #include <nl_types.h>
670 #include <stdio.h>
671 #include "msgnrs.h"
672 int
673 main (void)
674 @{
675 nl_catd catdesc = catopen ("hello.cat", NL_CAT_LOCALE);
676 printf (catgets (catdesc, SetMainSet, SetMainHello, "Hello, world!\n"));
677 catclose (catdesc);
678 return 0;
679 @}
680 @end smallexample
681
682 We see how the catalog object is opened and the returned descriptor used
683 in the other function calls. It is not really necessary to check for
684 failure of any of the functions since even in these situations the
685 functions will behave reasonable. They simply will be return a
686 translation.
687
688 What remains unspecified here are the constants @code{SetMainSet} and
689 @code{SetMainHello}. These are the symbolic names describing the
690 message. To get the actual definitions which match the information in
691 the catalog file we have to create the message catalog source file and
692 process it using the @code{gencat} program.
693
694 @smallexample
695 $ Messages for the famous greeting program.
696 $quote "
697
698 $set Main
699 Hello "Hallo, Welt!\n"
700 @end smallexample
701
702 Now we can start building the program (assume the message catalog source
703 file is named @file{hello.msg} and the program source file @file{hello.c}):
704
705 @smallexample
706 @cartouche
707 % gencat -H msgnrs.h -o hello.cat hello.msg
708 % cat msgnrs.h
709 #define MainSet 0x1 /* hello.msg:4 */
710 #define MainHello 0x1 /* hello.msg:5 */
711 % gcc -o hello hello.c -I.
712 % cp hello.cat /usr/share/locale/de/LC_MESSAGES
713 % echo $LC_ALL
714 de
715 % ./hello
716 Hallo, Welt!
717 %
718 @end cartouche
719 @end smallexample
720
721 The call of the @code{gencat} program creates the missing header file
722 @file{msgnrs.h} as well as the message catalog binary. The former is
723 used in the compilation of @file{hello.c} while the later is placed in a
724 directory in which the @code{catopen} function will try to locate it.
725 Please check the @code{LC_ALL} environment variable and the default path
726 for @code{catopen} presented in the description above.
727
728
729 @node The Uniforum approach
730 @section The Uniforum approach to Message Translation
731
732 Sun Microsystems tried to standardize a different approach to message
733 translation in the Uniforum group. There never was a real standard
734 defined but still the interface was used in Sun's operation systems.
735 Since this approach fits better in the development process of free
736 software it is also used throughout the GNU package and the GNU
737 @file{gettext} package provides support for this outside the GNU C
738 Library.
739
740 The code of the @file{libintl} from GNU @file{gettext} is the same as
741 the code in the GNU C Library. So the documentation in the GNU
742 @file{gettext} manual is also valid for the functionality here. The
743 following text will describe the library functions in detail. But the
744 numerous helper programs are not described in this manual. Instead
745 people should read the GNU @file{gettext} manual
746 (@pxref{Top,,GNU gettext utilities,gettext,Native Language Support Library and Tools}).
747 We will only give a short overview.
748
749 Though the @code{catgets} functions are available by default on more
750 systems the @code{gettext} interface is at least as portable as the
751 former. The GNU @file{gettext} package can be used wherever the
752 functions are not available.
753
754
755 @menu
756 * Message catalogs with gettext:: The @code{gettext} family of functions.
757 * Helper programs for gettext:: Programs to handle message catalogs
758 for @code{gettext}.
759 @end menu
760
761
762 @node Message catalogs with gettext
763 @subsection The @code{gettext} family of functions
764
765 The paradigms underlying the @code{gettext} approach to message
766 translations is different from that of the @code{catgets} functions the
767 basic functionally is equivalent. There are functions of the following
768 categories:
769
770 @menu
771 * Translation with gettext:: What has to be done to translate a message.
772 * Locating gettext catalog:: How to determine which catalog to be used.
773 * Using gettextized software:: The possibilities of the user to influence
774 the way @code{gettext} works.
775 @end menu
776
777 @node Translation with gettext
778 @subsubsection What has to be done to translate a message?
779
780 The @code{gettext} functions have a very simple interface. The most
781 basic function just takes the string which shall be translated as the
782 argument and it returns the translation. This is fundamentally
783 different from the @code{catgets} approach where an extra key is
784 necessary and the original string is only used for the error case.
785
786 If the string which has to be translated is the only argument this of
787 course means the string itself is the key. I.e., the translation will
788 be selected based on the original string. The message catalogs must
789 therefore contain the original strings plus one translation for any such
790 string. The task of the @code{gettext} function is it to compare the
791 argument string with the available strings in the catalog and return the
792 appropriate translation. Of course this process is optimized so that
793 this process is not more expensive than an access using an atomic key
794 like in @code{catgets}.
795
796 The @code{gettext} approach has some advantages but also some
797 disadvantages. Please see the GNU @file{gettext} manual for a detailed
798 discussion of the pros and cons.
799
800 All the definitions and declarations for @code{gettext} can be found in
801 the @file{libintl.h} header file. On systems where these functions are
802 not part of the C library they can be found in a separate library named
803 @file{libintl.a} (or accordingly different for shared libraries).
804
805 @deftypefun {char *} gettext (const char *@var{msgid})
806 The @code{gettext} function searches the currently selected message
807 catalogs for a string which is equal to @var{msgid}. If there is such a
808 string available it is returned. Otherwise the argument string
809 @var{msgid} is returned.
810
811 Please note that all though the return value is @code{char *} the
812 returned string must not be changed. This broken type results from the
813 history of the function and does not reflect the way the function should
814 be used.
815
816 Please note that above we wrote ``message catalogs'' (plural). This is
817 a speciality of the GNU implementation of these functions and we will
818 say more about this in section @xref{Locating gettext catalog} when we
819 talk about the ways message catalogs are selected.
820
821 The @code{gettext} function does not modify the value of the global
822 @var{errno} variable. This is necessary to make it possible to write
823 something like
824
825 @smallexample
826 printf (gettext ("Operation failed: %m\n"));
827 @end smallexample
828
829 Here the @var{errno} value is used in the @code{printf} function while
830 processing the @code{%m} format element and if the @code{gettext}
831 function would change this value (it is called before @code{printf} is
832 called) we wouls get a wrong message.
833
834 So there is no easy way to detect a missing message catalog beside
835 comparing the argument string with the result. But it is normally the
836 task of the user to react on missing catalogs. The program cannot guess
837 when a message catalog is really necessary since for a user who s peaks
838 the language the program was developed in does not need any translation.
839 @end deftypefun
840
841 The remaining two functions to access the message catalog add some
842 functionality to select a message catalog which is not the default one.
843 This is important if parts of the program are developed independently.
844 Every part can have its own message catalog and all of them can be used
845 at the same time. The C library itself is an example: internally it
846 uses the @code{gettext} functions but since it must not depend on a
847 currently selected default message catalog it must specify all ambiguous
848 information.
849
850 @deftypefun {char *} dgettext (const char *@var{domainname}, const char *@var{msgid})
851 The @code{dgettext} functions acts just like the @code{gettext}
852 function. It only takes an additional first argument @var{domainname}
853 which guides the selection of the message catalogs which are searched
854 for the translation. If the @var{domainname} parameter is the null
855 pointer the @code{dgettext} function is exactly equivalent to
856 @code{gettext} since the default value for the domain name is used.
857
858 As for @code{gettext} the return value type is @code{char *} which is an
859 anachronism. The returned string must never be modfied.
860 @end deftypefun
861
862 @deftypefun {char *} dcgettext (const char *@var{domainname}, const char *@var{msgid}, int @var{category})
863 The @code{dcgettext} adds another argument to those which
864 @code{dgettext} takes. This argument @var{category} specifies the last
865 piece of information needed to localize the message catalog. I.e., the
866 domain name and the locale category exactly specify which message
867 catalog has to be used (relative to a given directory, see below).
868
869 The @code{dgettext} function can be expressed in terms of
870 @code{dcgettext} by using
871
872 @smallexample
873 dcgettext (domain, string, LC_MESSAGES)
874 @end smallexample
875
876 @noindent
877 instead of
878
879 @smallexample
880 dgettext (domain, string)
881 @end smallexample
882
883 This also shows which values are expected for the third parameter. One
884 has to use the available selectors for the categories available in
885 @file{locale.h}. Normally the available values are @code{LC_CTYPE},
886 @code{LC_COLLATE}, @code{LC_MESSAGES}, @code{LC_MONETARY},
887 @code{LC_NUMERIC}, and @code{LC_TIME}. Please note that @code{LC_ALL}
888 must not be used and even though the names might suggest this, there is
889 no relation to the environments variables of this name.
890
891 The @code{dcgettext} function is only implemented for compatibility with
892 other systems which have @code{gettext} functions. There is not really
893 any situation where it is necessary (or useful) to use a different value
894 but @code{LC_MESSAGES} in for the @var{category} parameter. We are
895 dealing with messages here and any other choice can only be irritating.
896
897 As for @code{gettext} the return value type is @code{char *} which is an
898 anachronism. The returned string must never be modfied.
899 @end deftypefun
900
901 When using the three functions above in a program it is a frequent case
902 that the @var{msgid} argument is a constant string. So it is worth to
903 optimize this case. Thinking shortly about this one will realize that
904 as long as no new message catalog is loaded the translation of a message
905 will not change. I.e., the algorithm to determine the translation is
906 deterministic.
907
908 Exactly this is what the optimizations implemented in the
909 @file{libintl.h} header will use. Whenver a program is compiler with
910 the GNU C compiler, optimization is selected and the @var{msgid}
911 argument to @code{gettext}, @code{dgettext} or @code{dcgettext} is a
912 constant string the actual function call will only be done the first
913 time the message is used and then always only if any new message catalog
914 was loaded and so the result of the translation lookup might be
915 different. See the @file{libintl.h} header file for details. For the
916 user it is only important to know that the result is always the same,
917 independent of the compiler or compiler options in use.
918
919
920 @node Locating gettext catalog
921 @subsubsection How to determine which catalog to be used
922
923 The functions to retrieve the translations for a given mesage have a
924 remarkable simple interface. But to provide the user of the program
925 still the opportunity to select exactly the translation s/he wants and
926 also to provide the programmer the possibility to influence the way to
927 locate the search for catalogs files there is a quite complicated
928 underlying mechanism which controls all this. The code is complicated
929 the use is easy.
930
931 Basically we have two different tasks to perform which can also be
932 performed by the @code{catgets} functions:
933
934 @enumerate
935 @item
936 Locate the set of message catalogs. There are a number of files for
937 different languages and which all belong to the package. Usually they
938 are all stored in the filesystem below a certain directory.
939
940 There can be arbitrary many packages installed and they can follow
941 different guidelines for the placement of their files.
942
943 @item
944 Relative to the location specified by the package the actual translation
945 files must be searched, based on the wishes of the user. I.e., for each
946 language the user selects the program should be able to locate the
947 appropriate file.
948 @end enumerate
949
950 This is the functionality required by the specifications for
951 @code{gettext} and this is also what the @code{catgets} functions are
952 able to do. But there are some problems unresolved:
953
954 @itemize @bullet
955 @item
956 The language to be used can be specified in several different ways.
957 There is no generally accepted standard for this and the user always
958 expects the program understand what s/he means. E.g., to select the
959 German translation one could write @code{de}, @code{german}, or
960 @code{deutsch} and the program should always react the same.
961
962 @item
963 Sometimes the specification of the user is too detailed. If s/he, e.g.,
964 specifies @code{de_DE.ISO-8859-1} which means German, spoken in Germany,
965 coded using the @w{ISO 8859-1} character set there is the possibility
966 that a message catalog matching this exactly is not available. But
967 there could be a catalog matching @code{de} and if the character set
968 used on the machine is always @w{ISO 8859-1} there is no reason why this
969 later message catalog should not be used. (We call this @dfn{message
970 inheritance}.)
971
972 @item
973 If a catalog for a wanted language is not available it is not always the
974 second best choice to fall back on the language of the developer and
975 simply not translate any message. Instead a user might be better able
976 to read the messages in another language and so the user of the program
977 should be able to define an precedence order of languages.
978 @end itemize
979
980 We can devide the configuration actions in two parts: the one is
981 performed by the programmer, the other by the user. We will start with
982 the functions the programmer can use since the user configuration will
983 be based on this.
984
985 As the functions described in the last sections already mention separate
986 sets of messages can be selected by a @dfn{domain name}. This is a
987 simple string which should be unique for each program part with uses a
988 separate domain. It is possible to use in one program arbitrary many
989 domains at the same time. E.g., the GNU C Library itself uses a domain
990 named @code{libc} while the program using the C Library could use a
991 domain named @code{foo}. The important point is that at any time
992 exactly one domain is active. This is controlled with the following
993 function.
994
995 @deftypefun {char *} textdomain (const char *@var{domainname})
996 The @code{textdomain} function sets the default domain, which is used in
997 all future @code{gettext} calls, to @var{domainname}. Please note that
998 @code{dgettext} and @code{dcgettext} calls are not influenced if the
999 @var{domainname} parameter of these functions is not the null pointer.
1000
1001 Before the first call to @code{textdomain} the default domain is
1002 @code{messages}. This is the name specified in the fpsecification of
1003 the @code{gettext} API. This name is as good as any other name. No
1004 program should ever really use a domain with this name since this can
1005 only lead to problems.
1006
1007 The function returns the value which is from now on taken as the default
1008 domain. If the system went out of memory the returned value is
1009 @code{NULL} and the global variable @var{errno} is set to @code{ENOMEM}.
1010 Despite the return value type being @code{char *} the return string must
1011 not be changed. It is allocated internally by the @code{textdomain}
1012 function.
1013
1014 If the @var{domainname} parameter is the null pointer no new default
1015 domain is set. Instead the currently selected default domain is
1016 returned.
1017
1018 If the @var{domainname} parameter is the empty string the default domain
1019 is reset to its initial value, the domain with the name @code{messages}.
1020 This possibility is questionable to use since the domain @code{messages}
1021 really never should be used.
1022 @end deftypefun
1023
1024 @deftypefun {char *} bindtextdomain (const char *@var{domainname}, const char *@var{dirname})
1025 The @code{bindtextdomain} function can be used to specify the directly
1026 which contains the message catalogs for domain @var{domainname} for the
1027 different languages. To be correct, this is the directory where the
1028 hierachy of directories is expected. Details are explained below.
1029
1030 For the programmer it is important to note that the translations which
1031 come with the program have be placed in a directory hierachy starting
1032 at, say, @file{/foo/bar}. Then the program should make a
1033 @code{bindtextdomain} call to bind the domain for the current program to
1034 this directory. So it is made sure the catalogs are found. A correctly
1035 running program does not depend on the user setting an environment
1036 variable.
1037
1038 The @code{bindtextdomain} function can be used several times and if the
1039 @var{domainname} argument is different the previously boundd domains
1040 will not be overwritten.
1041
1042 If the @var{dirname} parameter is the null pointer @code{bindtextdomain}
1043 returns the currently selected directory for the domain with the name
1044 @var{domainname}.
1045
1046 the @code{bindtextdomain} function returns a pointer to a string
1047 containing the name of the selected directory name. The string is
1048 allocated internally in the function and must not be changed by the
1049 user. If the system went out of core during the execution of
1050 @code{bindtextdomain} the return value is @code{NULL} and the global
1051 variable @var{errno} is set accordingly.
1052 @end deftypefun
1053
1054
1055 @node Using gettextized software
1056 @subsubsection User influence on @code{gettext}
1057
1058 The last sections described what the programmer can do to
1059 internationalize the messages of the program. But it is finally up to
1060 the user to select the message s/he wants to see. S/He must understand
1061 them.
1062
1063 The POSIX locale model uses the environment variables @code{LC_COLLATE},
1064 @code{LC_CTYPE}, @code{LC_MESSAGES}, @code{LC_MONETARY}, @code{NUMERIC},
1065 and @code{LC_TIME} to select the locale which is to be used. This way
1066 the user can influence lots of functions. As we mentioned above the
1067 @code{gettext} functions also take advantage of this.
1068
1069 To understand how this happens it is necessary to take a look at the
1070 various components of the filename which gets computed to locate a
1071 message catalog. It is composed as follows:
1072
1073 @smallexample
1074 @var{dir_name}/@var{locale}/LC_@var{category}/@var{domain_name}.mo
1075 @end smallexample
1076
1077 The default value for @var{dir_name} is system specific. It is computed
1078 from the value given as the prefix while configuring the C library.
1079 This value normally is @file{/usr} or @file{/}. For the former the
1080 complete @var{dir_name} is:
1081
1082 @smallexample
1083 /usr/share/locale
1084 @end smallexample
1085
1086 We can use @file{/usr/share} since the @file{.mo} files containing the
1087 message catalogs are system independent, all systems can use the same
1088 files. If the program executed the @code{bindtextdomain} function for
1089 the message domain that is currently handled the @code{dir_name}
1090 component is the exactly the value which was given to the function as
1091 the second parameter. I.e., @code{bindtextdomain} allows to overwrite
1092 the only system depdendent and fixed value to make it possible to
1093 address file everywhere in the filesystem.
1094
1095 The @var{category} is the name of the locale category which was selected
1096 in the program code. For @code{gettext} and @code{dgettext} this is
1097 always @code{LC_MESSAGES}, for @code{dcgettext} this is selected by the
1098 value of the third parameter. As said above it should be avoided to
1099 ever use a category other than @code{LC_MESSAGES}.
1100
1101 The @var{locale} component is computed based on the category used. Just
1102 like for the @code{setlocale} function here comes the user selection
1103 into the play. Some environment variables are examined in a fixed order
1104 and the first environment variable set determines the return value of
1105 the lookup process. In detail, for the category @code{LC_xxx} the
1106 following variables in this order are examined:
1107
1108 @table @code
1109 @item LANGUAGE
1110 @item LC_ALL
1111 @item LC_xxx
1112 @item LANG
1113 @end table
1114
1115 This looks very familiar. With the exception of the @code{LANGUAGE}
1116 environment variable this is exactly the lookup order the
1117 @code{setlocale} function uses. But why introducing the @code{LANGUAGE}
1118 variable?
1119
1120 The reason is that the syntax of the values these variables can have is
1121 different to what is expected by the @code{setlocale} function. If we
1122 would set @code{LC_ALL} to a value following the extended syntax that
1123 would mean the @code{setlocale} function will never be able to use the
1124 value of this variable as well. An additional variable removes this
1125 problem plus we can select the language independently of the locale
1126 setting which sometimes is useful.
1127
1128 While for the @code{LC_xxx} variables the value should consist of
1129 exactly one specification of a locale the @code{LANGUAGE} variable's
1130 value can consist of a colon separated list of locale names. The
1131 attentive reader will realize that this is the way we manage to
1132 implement one of our additional demands above: we want to be able to
1133 specify an ordered list of language.
1134
1135 Back to the constructed filename we have only one component missing.
1136 The @var{domain_name} part is the name which was either registered using
1137 the @code{textdomain} function or which was given to @code{dgettext} or
1138 @code{dcgettext} as the first parameter. Now it becomes obvious that a
1139 good choice for the domain name in the program code is a string which is
1140 closely related to the program/package name. E.g., for the GNU C
1141 Library the domain name is @code{libc}.
1142
1143 @noindent
1144 A limit piece of example code should show how the programmer is supposed
1145 to work:
1146
1147 @smallexample
1148 @{
1149 textdomain ("test-package");
1150 bindtextdomain ("test-package", "/usr/local/share/locale");
1151 puts (gettext ("Hello, world!");
1152 @}
1153 @end smallexample
1154
1155 At the program start the default domain is @code{messages}. The
1156 @code{textdomain} call changes this to @code{test-package}. The
1157 @code{bindtextdomain} call specifies that the message catalogs for the
1158 domain @code{test-package} can be found below the directory
1159 @file{/usr/local/share/locale}.
1160
1161 If now the user set in her/his environment the variable @code{LANGUAGE}
1162 to @code{de} the @code{gettext} function will try to use the
1163 translations from the file
1164
1165 @smallexample
1166 /usr/local/share/locale/de/LC_MESSAGES/test-package.mo
1167 @end smallexample
1168
1169 From the above descriptions it should be clear which component of this
1170 filename is determined fromby which source.
1171
1172 @c Describe:
1173 @c * message inheritence
1174 @c * locale aliasing
1175 @c * character set dependence
1176
1177
1178 @node Helper programs for gettext
1179 @subsection Programs to handle message catalogs for @code{gettext}
1180
1181 @c Describe:
1182 @c * msgfmt
1183 @c * xgettext
1184 @c Mention:
1185 @c * other programs from GNU gettext