]> git.ipfire.org Git - thirdparty/glibc.git/blame - manual/message.texi
Update.
[thirdparty/glibc.git] / manual / message.texi
CommitLineData
40a55d20
UD
1@node Message Translation
2@chapter Message Translation
3
4The program's interface with the human should be designed in a way to
5ease the human the task. One of the possibilities is to use messages in
6whatever language the user prefers.
7
8Printing messages in different languages can be implemented in different
9ways. One could add all the different languages in the source code and
10add among the variants every time a message has to be printed. This is
11certainly no good solution since extending the set of languages is
12difficult (the code must be changed) and the code itself can become
13really big with dozens of message sets.
14
15A better solution is to keep the message sets for each language are kept
16in separate files which are loaded at runtime depending on the language
17selection of the user.
18
19The GNU C Library provides two different sets of functions to support
20message translation. The problem is that neither of the interfaces is
21officially defined by the POSIX standard. The @code{catgets} family of
f2ea0f5b
UD
22functions is defined in the X/Open standard but this is derived from
23industry decisions and therefore not necessarily based on reasonable
40a55d20
UD
24decisions.
25
26As mentioned above the message catalog handling provides easy
27extendibility by using external data files which contain the message
28translations. I.e., these files contain for each of the messages used
29in the program a translation for the appropriate language. So the tasks
fed8f7f7 30of the message handling functions are
40a55d20
UD
31
32@itemize @bullet
33@item
34locate the external data file with the appropriate translations.
35@item
36load the data and make it possible to address the messages
37@item
38map a given key to the translated message
39@end itemize
40
41The two approaches mainly differ in the implementation of this last
42step. The design decisions made for this influences the whole rest.
43
44@menu
45* Message catalogs a la X/Open:: The @code{catgets} family of functions.
46* The Uniforum approach:: The @code{gettext} family of functions.
47@end menu
48
49
50@node Message catalogs a la X/Open
51@section X/Open Message Catalog Handling
52
53The @code{catgets} functions are based on the simple scheme:
54
55@quotation
56Associate every message to translate in the source code with a unique
57identifier. To retrieve a message from a catalog file solely the
58identifier is used.
59@end quotation
60
61This means for the author of the program that s/he will have to make
62sure the meaning of the identifier in the program code and in the
63message catalogs are always the same.
64
65Before a message can be translated the catalog file must be located.
66The user of the program must be able to guide the responsible function
67to find whatever catalog the user wants. This is separated from what
68the programmer had in mind.
69
f2ea0f5b 70All the types, constants and functions for the @code{catgets} functions
40a55d20
UD
71are defined/declared in the @file{nl_types.h} header file.
72
73@menu
74* The catgets Functions:: The @code{catgets} function family.
75* The message catalog files:: Format of the message catalog files.
76* The gencat program:: How to generate message catalogs files which
77 can be used by the functions.
78* Common Usage:: How to use the @code{catgets} interface.
79@end menu
80
81
82@node The catgets Functions
83@subsection The @code{catgets} function family
84
85@comment nl_types.h
86@comment X/Open
87@deftypefun nl_catd catopen (const char *@var{cat_name}, int @var{flag})
88The @code{catgets} function tries to locate the message data file names
89@var{cat_name} and loads it when found. The return value is of an
90opaque type and can be used in calls to the other functions to refer to
91this loaded catalog.
92
93The return value is @code{(nl_catd) -1} in case the function failed and
94no catalog was loaded. The global variable @var{errno} contains a code
95for the error causing the failure. But even if the function call
96succeeded this does not mean that all messages can be translated.
97
98Locating the catalog file must happen in a way which lets the user of
99the program influence the decision. It is up to the user to decide
100about the language to use and sometimes it is useful to use alternate
101catalog files. All this can be specified by the user by setting some
f2ea0f5b 102environment variables.
40a55d20
UD
103
104The first problem is to find out where all the message catalogs are
105stored. Every program could have its own place to keep all the
106different files but usually the catalog files are grouped by languages
107and the catalogs for all programs are kept in the same place.
108
109@cindex NLSPATH environment variable
110To tell the @code{catopen} function where the catalog for the program
111can be found the user can set the environment variable @code{NLSPATH} to
112a value which describes her/his choice. Since this value must be usable
113for different languages and locales it cannot be a simple string.
114Instead it is a format string (similar to @code{printf}'s). An example
115is
116
117@smallexample
118/usr/share/locale/%L/%N:/usr/share/locale/%L/LC_MESSAGES/%N
119@end smallexample
120
121First one can see that more than one directory can be specified (with
122the usual syntax of separating them by colons). The next things to
123observe are the format string, @code{%L} and @code{%N} in this case.
124The @code{catopen} function knows about several of them and the
125replacement for all of them is of course different.
126
127@table @code
128@item %N
129This format element is substituted with the name of the catalog file.
130This is the value of the @var{cat_name} argument given to
131@code{catgets}.
132
133@item %L
134This format element is substituted with the name of the currently
135selected locale for translating messages. How this is determined is
136explained below.
137
138@item %l
139(This is the lowercase ell.) This format element is substituted with the
f2ea0f5b 140language element of the locale name. The string describing the selected
40a55d20
UD
141locale is expected to have the form
142@code{@var{lang}[_@var{terr}[.@var{codeset}]]} and this format uses the
143first part @var{lang}.
144
145@item %t
146This format element is substituted by the territory part @var{terr} of
147the name of the currently selected locale. See the explanation of the
148format above.
149
150@item %c
151This format element is substituted by the codeset part @var{codeset} of
152the name of the currently selected locale. See the explanation of the
153format above.
154
155@item %%
156Since @code{%} is used in a meta character there must be a way to
157express the @code{%} character in the result itself. Using @code{%%}
158does this just like it works for @code{printf}.
159@end table
160
161
162Using @code{NLSPATH} allows to specify arbitrary directories to be
163searched for message catalogs while still allowing different languages
164to be used. If the @code{NLSPATH} environment variable is not set the
165default value is
166
167@smallexample
168@var{prefix}/share/locale/%L/%N:@var{prefix}/share/locale/%L/LC_MESSAGES/%N
169@end smallexample
170
171@noindent
172where @var{prefix} is given to @code{configure} while installing the GNU
173C Library (this value is in many cases @code{/usr} or the empty string).
174
175The remaining problem is to decide which must be used. The value
176decides about the substitution of the format elements mentioned above.
177First of all the user can specify a path in the message catalog name
178(i.e., the name contains a slash character). In this situation the
179@code{NLSPATH} environment variable is not used. The catalog must exist
180as specified in the program, perhaps relative to the current working
181directory. This situation in not desirable and catalogs names never
182should be written this way. Beside this, this behaviour is not portable
183to all other platforms providing the @code{catgets} interface.
184
185@cindex LC_ALL environment variable
186@cindex LC_MESSAGES environment variable
187@cindex LANG environment variable
188Otherwise the values of environment variables from the standard
f2ea0f5b 189environment are examined (@pxref{Standard Environment}). Which
40a55d20
UD
190variables are examined is decided by the @var{flag} parameter of
191@code{catopen}. If the value is @code{NL_CAT_LOCALE} (which is defined
192in @file{nl_types.h}) then the @code{catopen} function examines the
193environment variable @code{LC_ALL}, @code{LC_MESSAGES}, and @code{LANG}
194in this order. The first variable which is set in the current
195environment will be used.
196
197If @var{flag} is zero only the @code{LANG} environment variable is
198examined. This is a left-over from the early days of this function
199where the other environment variable were not known.
200
201In any case the environment variable should have a value of the form
202@code{@var{lang}[_@var{terr}[.@var{codeset}]]} as explained above. If
203no environment variable is set the @code{"C"} locale is used which
204prevents any translation.
205
206The return value of the function is in any case a valid string. Either
207it is a translation from a message catalog or it is the same as the
208@var{string} parameter. So a piece of code to decide whether a
209translation actually happened must look like this:
210
211@smallexample
212@{
213 char *trans = catgets (desc, set, msg, input_string);
214 if (trans == input_string)
215 @{
216 /* Something went wrong. */
217 @}
218@}
219@end smallexample
220
221@noindent
222When an error occured the global variable @var{errno} is set to
223
224@table @var
225@item EBADF
226The catalog does not exist.
227@item ENOMSG
f2ea0f5b 228The set/message ttuple does not name an existing element in the
40a55d20
UD
229message catalog.
230@end table
231
232While it sometimes can be useful to test for errors programs normally
233will avoid any test. If the translation is not available it is no big
234problem if the original, untranslated message is printed. Either the
235user understands this as well or s/he will look for the reason why the
236messages are not translated.
237@end deftypefun
238
239Please note that the currently selected locale does not depend on a call
240to the @code{setlocale} function. It is not necessary that the locale
241data files for this locale exist and calling @code{setlocale} succeeds.
242The @code{catopen} function directly reads the values of the environment
243variables.
244
245
246@deftypefun {char *} catgets (nl_catd @var{catalog_desc}, int @var{set}, int @var{message}, const char *@var{string})
247The function @code{catgets} has to be used to access the massage catalog
248previously opened using the @code{catopen} function. The
249@var{catalog_desc} parameter must be a value previously returned by
250@code{catopen}.
251
252The next two parameters, @var{set} and @var{message}, reflect the
253internal organization of the message catalog files. This will be
254explained in detail below. For now it is interesting to know that a
255catalog can consists of several set and the messages in each thread are
256individually numbered using numbers. Neither the set number nor the
257message number must be consecutive. They can be arbitrarily chosen.
258But each message (unless equal to another one) must have its own unique
259pair of set and message number.
260
261Since it is not guaranteed that the message catalog for the language
262selected by the user exists the last parameter @var{string} helps to
263handle this case gracefully. If no matching string can be found
264@var{string} is returned. This means for the programmer that
265
266@itemize @bullet
267@item
268the @var{string} parameters should contain reasonable text (this also
269helps to understand the program seems otherwise there would be no hint
270on the string which is expected to be returned.
271@item
272all @var{string} arguments should be written in the same language.
273@end itemize
274@end deftypefun
275
276It is somewhat uncomfortable to write a program using the @code{catgets}
277functions if no supporting functionality is available. Since each
f2ea0f5b 278set/message number tuple must be unique the programmer must keep lists
40a55d20
UD
279of the messages at the same time the code is written. And the work
280between several people working on the same project must be coordinated.
281In @ref{Common Usage} we will see some how these problems can be relaxed
282a bit.
283
284@deftypefun int catclose (nl_catd @var{catalog_desc})
285The @code{catclose} function can be used to free the resources
286associated with a message catalog which previously was opened by a call
287to @code{catopen}. If the resources can be successfully freed the
288function returns @code{0}. Otherwise it return @code{@minus{}1} and the
289global variable @var{errno} is set. Errors can occur if the catalog
290descriptor @var{catalog_desc} is not valid in which case @var{errno} is
291set to @code{EBADF}.
292@end deftypefun
293
294
295@node The message catalog files
296@subsection Format of the message catalog files
297
298The only reasonable way the translate all the messages of a function and
299store the result in a message catalog file which can be read by the
300@code{catopen} function is to write all the message text to the
301translator and let her/him translate them all. I.e., we must have a
f2ea0f5b 302file with entries which associate the set/message tuple with a specific
40a55d20
UD
303translation. This file format is specified in the X/Open standard and
304is as follows:
305
306@itemize @bullet
307@item
308Lines containing only whitespace characters or empty lines are ignored.
309
310@item
311Lines which contain as the first non-whitespace character a @code{$}
312followed by a whitespace character are comment and are also ignored.
313
314@item
315If a line contains as the first non-whitespace characters the sequence
316@code{$set} followed by a whitespace character an additional argument
317is required to follow. This argument can either be:
318
319@itemize @minus
320@item
321a number. In this case the value of this number determines the set
322to which the following messages are added.
323
324@item
325an identifier consisting of alphanumeric characters plus the underscore
326character. In this case the set get automatically a number assigned.
327This value is one added to the largest set number which so far appeared.
328
329How to use the symbolic names is explained in section @ref{Common Usage}.
330
331It is an error if a symbol name appears more than once. All following
332messages are placed in a set with this number.
333@end itemize
334
335@item
336If a line contains as the first non-whitespace characters the sequence
337@code{$delset} followed by a whitespace character an additional argument
338is required to follow. This argument can either be:
339
340@itemize @minus
341@item
342a number. In this case the value of this number determines the set
343which will be deleted.
344
345@item
346an identifier consisting of alphanumeric characters plus the underscore
347character. This symbolic identifier must match a name for a set which
348previously was defined. It is an error if the name is unknown.
349@end itemize
350
351In both cases all messages in the specified set will be removed. They
352will not appear in the output. But if this set is later again selected
353with a @code{$set} command again messages could be added and these
354messages will appear in the output.
355
356@item
357If a line contains after leading whitespaces the sequence
358@code{$quote}, the quoting character used for this input file is
359changed to the first non-whitespace character following the
360@code{$quote}. If no non-whitespace character is present before the
361line ends quoting is disable.
362
363By default no quoting character is used. In this mode strings are
364terminated with the first unescaped line break. If there is a
365@code{$quote} sequence present newline need not be escaped. Instead a
f2ea0f5b 366string is terminated with the first unescaped appearance of the quote
40a55d20
UD
367character.
368
369A common usage of this feature would be to set the quote character to
f2ea0f5b 370@code{"}. Then any appearance of the @code{"} in the strings must
40a55d20
UD
371be escaped using the backslash (i.e., @code{\"} must be written).
372
373@item
374Any other line must start with a number or an alphanumeric identifier
375(with the underscore character included). The following characters
376(starting at the first non-whitespace character) will form the string
377which gets associated with the currently selected set and the message
378number represented by the number and identifier respectively.
379
380If the start of the line is a number the message number is obvious. It
381is an error if the same message number already appeared for this set.
382
383If the leading token was an identifier the message number gets
384automatically assigned. The value is the current maximum messages
385number for this set plus one. It is an error if the identifier was
386already used for a message in this set. It is ok to reuse the
387identifier for a message in another thread. How to use the symbolic
388identifiers will be explained below (@pxref{Common Usage}). There is
389one limitation with the identifier: it must not be @code{Set}. The
390reason will be explained below.
391
392Please note that you must use a quoting character if a message contains
393leading whitespace. Since one cannot guarantee this never happens it is
394probably a good idea to always use quoting.
395
396The text of the messages can contain escape characters. The usual bunch
397of characters known from the @w{ISO C} language are recognized
398(@code{\n}, @code{\t}, @code{\v}, @code{\b}, @code{\r}, @code{\f},
399@code{\\}, and @code{\@var{nnn}}, where @var{nnn} is the octal coding of
400a character code).
401@end itemize
402
403@strong{Important:} The handling of identifiers instead of numbers for
404the set and messages is a GNU extension. Systems strictly following the
405X/Open specification do not have this feature. An example for a message
406catalog file is this:
407
408@smallexample
409$ This is a leading comment.
410$quote "
411
412$set SetOne
4131 Message with ID 1.
414two " Message with ID \"two\", which gets the value 2 assigned"
415
416$set SetTwo
f2ea0f5b 417$ Since the last set got the number 1 assigned this set has number 2.
40a55d20
UD
4184000 "The numbers can be arbitrary, they need not start at one."
419@end smallexample
420
421This small example shows various aspects:
422@itemize @bullet
423@item
424Lines 1 and 9 are comments since they start with @code{$} followed by
425a whitespace.
426@item
427The quoting character is set to @code{"}. Otherwise the quotes in the
428message definition would have to be left away and in this case the
429message with the identifier @code{two} would loose its leading whitespace.
430@item
431Mixing numbered messages with message having symbolic names is no
f2ea0f5b 432problem and the numbering happens automatically.
40a55d20
UD
433@end itemize
434
435
436While this file format is pretty easy it is not the best possible for
437use in a running program. The @code{catopen} function would have to
438parser the file and handle syntactic errors gracefully. This is not so
439easy and the whole process is pretty slow. Therefore the @code{catgets}
440functions expect the data in another more compact and ready-to-use file
f2ea0f5b 441format. There is a special program @code{gencat} which is explained in
40a55d20
UD
442detail in the next section.
443
444Files in this other format are not human readable. To be easy to use by
445programs it is a binary file. But the format is byte order independent
446so translation files can be shared by systems of arbitrary architecture
447(as long as they use the GNU C Library).
448
449Details about the binary file format are not important to know since
450these files are always created by the @code{gencat} program. The
451sources of the GNU C Library also provide the sources for the
f2ea0f5b 452@code{gencat} program and so the interested reader can look through
40a55d20
UD
453these source files to learn about the file format.
454
455
456@node The gencat program
457@subsection Generate Message Catalogs files
458
459@cindex gencat
460The @code{gencat} program is specified in the X/Open standard and the
461GNU implementation follows this specification and so allows to process
462all correctly formed input files. Additionally some extension are
3081378b 463implemented which help to work in a more reasonable way with the
40a55d20
UD
464@code{catgets} functions.
465
466The @code{gencat} program can be invoked in two ways:
467
468@example
469`gencat [@var{Option}]@dots{} [@var{Output-File} [@var{Input-File}]@dots{}]`
470@end example
471
472This is the interface defined in the X/Open standard. If no
473@var{Input-File} parameter is given input will be read from standard
474input. Multiple input files will be read as if they are concatenated.
475If @var{Output-File} is also missing, the output will be written to
476standard output. To provide the interface one is used from other
477programs a second interface is provided.
478
479@smallexample
480`gencat [@var{Option}]@dots{} -o @var{Output-File} [@var{Input-File}]@dots{}`
481@end smallexample
482
483The option @samp{-o} is used to specify the output file and all file
484arguments are used as input files.
485
486Beside this one can use @file{-} or @file{/dev/stdin} for
487@var{Input-File} to denote the standard input. Corresponding one can
488use @file{-} and @file{/dev/stdout} for @var{Output-File} to denote
489standard output. Using @file{-} as a file name is allowed in X/Open
490while using the device names is a GNU extension.
491
492The @code{gencat} program works by concatenating all input files and
493then @strong{merge} the resulting collection of message sets with a
f2ea0f5b
UD
494possibly existing output file. This is done by removing all messages
495with set/message number tuples matching any of the generated messages
40a55d20
UD
496from the output file and then adding all the new messages. To
497regenerate a catalog file while ignoring the old contents therefore
498requires to remove the output file if it exists. If the output is
499written to standard output no merging takes place.
500
501@noindent
502The following table shows the options understood by the @code{gencat}
503program. The X/Open standard does not specify any option for the
504program so all of these are GNU extensions.
505
506@table @samp
507@item -V
508@itemx --version
509Print the version information and exit.
510@item -h
511@itemx --help
512Print a usage message listing all available options, then exit successfully.
513@item --new
514Do never merge the new messages from the input files with the old content
515of the output files. The old content of the output file is discarded.
516@item -H
517@itemx --header=name
518This option is used to emit the symbolic names given to sets and
519messages in the input files for use in the program. Details about how
520to use this are given in the next section. The @var{name} parameter to
521this option specifies the name of the output file. It will contain a
522number of C preprocessor @code{#define}s to associate a name with a
523number.
524
525Please note that the generated file only contains the symbols from the
526input files. If the output is merged with the previous content of the
527output file the possibly existing symbols from the file(s) which
528generated the old output files are not in the generated header file.
529@end table
530
531
532@node Common Usage
533@subsection How to use the @code{catgets} interface
534
535The @code{catgets} functions can be used in two different ways. By
536following slavishly the X/Open specs and not relying on the extension
537and by using the GNU extensions. We will take a look at the former
538method first to understand the benefits of extensions.
539
fed8f7f7 540@subsubsection Not using symbolic names
40a55d20
UD
541
542Since the X/Open format of the message catalog files does not allow
543symbol names we have to work with numbers all the time. When we start
f2ea0f5b
UD
544writing a program we have to replace all appearances of translatable
545strings with something like
40a55d20
UD
546
547@smallexample
548catgets (catdesc, set, msg, "string")
549@end smallexample
550
551@noindent
552@var{catgets} is retrieved from a call to @code{catopen} which is
553normally done once at the program start. The @code{"string"} is the
554string we want to translate. The problems start with the set and
555message numbers.
556
557In a bigger program several programmers usually work at the same time on
558the program and so coordinating the number allocation is crucial.
f2ea0f5b
UD
559Though no two different strings must be indexed by the same tuple of
560numbers it is highly desirable to reuse the numbers for equal strings
40a55d20
UD
561with equal translations (please note that there might be strings which
562are equal in one language but have different translations due to
563difference contexts).
564
565The allocation process can be relaxed a bit by different set numbers for
566different parts of the program. So the number of developers who have to
567coordinate the allocation can be reduced. But still lists must be keep
568track of the allocation and errors can easily happen. These errors
569cannot be discovered by the compiler or the @code{catgets} functions.
570Only the user of the program might see wrong messages printed. In the
571worst cases the messages are so irritating that they cannot be
572recognized as wrong. Think about the translations for @code{"true"} and
f2ea0f5b 573@code{"false"} being exchanged. This could result in a disaster.
40a55d20
UD
574
575
576@subsubsection Using symbolic names
577
578The problems mentioned in the last section derive from the fact that:
579
580@enumerate
581@item
582the numbers are allocated once and due to the possibly frequent use of
583them it is difficult to change a number later.
584@item
585the numbers do not allow to guess anything about the string and
586therefore collisions can easily happen.
587@end enumerate
588
589By constantly using symbolic names and by providing a method which maps
590the string content to a symbolic name (however this will happen) one can
591prevent both problems above. The cost of this is that the programmer
592has to write a complete message catalog file while s/he is writing the
593program itself.
594
595This is necessary since the symbolic names must be mapped to numbers
596before the program sources can be compiled. In the last section it was
597described how to generate a header containing the mapping of the names.
598E.g., for the example message file given in the last section we could
599call the @code{gencat} program as follow (assume @file{ex.msg} contains
600the sources).
601
602@smallexample
603gencat -H ex.h -o ex.cat ex.msg
604@end smallexample
605
606@noindent
607This generates a header file with the following content:
608
609@smallexample
610#define SetTwoSet 0x2 /* u.msg:8 */
611
612#define SetOneSet 0x1 /* u.msg:4 */
613#define SetOnetwo 0x2 /* u.msg:6 */
614@end smallexample
615
616As can be seen the various symbols given in the source file are mangled
617to generate unique identifiers and these identifiers get numbers
618assigned. Reading the source file and knowing about the rules will
619allow to predict the content of the header file (it is deterministic)
620but this is not necessary. The @code{gencat} program can take care for
621everything. All the programmer has to do is to put the generated header
622file in the dependency list of the source files of her/his project and
623to add a rules to regenerate the header of any of the input files
624change.
625
626One word about the symbol mangling. Every symbol consists of two parts:
627the name of the message set plus the name of the message or the special
628string @code{Set}. So @code{SetOnetwo} means this macro can be used to
629access the translation with identifier @code{two} in the message set
630@code{SetOne}.
631
632The other names denote the names of the message sets. The special
633string @code{Set} is used in the place of the message identifier.
634
635If in the code the second string of the set @code{SetOne} is used the C
636code should look like this:
637
638@smallexample
639catgets (catdesc, SetOneSet, SetOnetwo,
640 " Message with ID \"two\", which gets the value 2 assigned")
641@end smallexample
642
643Writing the function this way will allow to change the message number
644and even the set number without requiring any change in the C source
645code. (The text of the string is normally not the same; this is only
646for this example.)
647
648
649@subsubsection How does to this allow to develop
650
651To illustrate the usual way to work with the symbolic version numbers
652here is a little example. Assume we want to write the very complex and
653famous greeting program. We start by writing the code as usual:
654
655@smallexample
656#include <stdio.h>
657int
658main (void)
659@{
660 printf ("Hello, world!\n");
661 return 0;
662@}
663@end smallexample
664
665Now we want to internationalize the message and therefore replace the
666message with whatever the user wants.
667
668@smallexample
669#include <nl_types.h>
670#include <stdio.h>
671#include "msgnrs.h"
672int
673main (void)
674@{
675 nl_catd catdesc = catopen ("hello.cat", NL_CAT_LOCALE);
fed8f7f7 676 printf (catgets (catdesc, SetMainSet, SetMainHello,
838e5ffe 677 "Hello, world!\n"));
40a55d20
UD
678 catclose (catdesc);
679 return 0;
680@}
681@end smallexample
682
683We see how the catalog object is opened and the returned descriptor used
684in the other function calls. It is not really necessary to check for
685failure of any of the functions since even in these situations the
686functions will behave reasonable. They simply will be return a
687translation.
688
689What remains unspecified here are the constants @code{SetMainSet} and
690@code{SetMainHello}. These are the symbolic names describing the
691message. To get the actual definitions which match the information in
692the catalog file we have to create the message catalog source file and
693process it using the @code{gencat} program.
694
695@smallexample
696$ Messages for the famous greeting program.
697$quote "
698
699$set Main
700Hello "Hallo, Welt!\n"
701@end smallexample
702
703Now we can start building the program (assume the message catalog source
704file is named @file{hello.msg} and the program source file @file{hello.c}):
705
706@smallexample
707@cartouche
708% gencat -H msgnrs.h -o hello.cat hello.msg
709% cat msgnrs.h
710#define MainSet 0x1 /* hello.msg:4 */
711#define MainHello 0x1 /* hello.msg:5 */
712% gcc -o hello hello.c -I.
713% cp hello.cat /usr/share/locale/de/LC_MESSAGES
714% echo $LC_ALL
715de
716% ./hello
717Hallo, Welt!
718%
719@end cartouche
720@end smallexample
721
722The call of the @code{gencat} program creates the missing header file
723@file{msgnrs.h} as well as the message catalog binary. The former is
724used in the compilation of @file{hello.c} while the later is placed in a
725directory in which the @code{catopen} function will try to locate it.
726Please check the @code{LC_ALL} environment variable and the default path
727for @code{catopen} presented in the description above.
728
729
730@node The Uniforum approach
731@section The Uniforum approach to Message Translation
732
733Sun Microsystems tried to standardize a different approach to message
734translation in the Uniforum group. There never was a real standard
735defined but still the interface was used in Sun's operation systems.
736Since this approach fits better in the development process of free
737software it is also used throughout the GNU package and the GNU
738@file{gettext} package provides support for this outside the GNU C
739Library.
740
741The code of the @file{libintl} from GNU @file{gettext} is the same as
742the code in the GNU C Library. So the documentation in the GNU
743@file{gettext} manual is also valid for the functionality here. The
744following text will describe the library functions in detail. But the
745numerous helper programs are not described in this manual. Instead
746people should read the GNU @file{gettext} manual
747(@pxref{Top,,GNU gettext utilities,gettext,Native Language Support Library and Tools}).
748We will only give a short overview.
749
750Though the @code{catgets} functions are available by default on more
751systems the @code{gettext} interface is at least as portable as the
752former. The GNU @file{gettext} package can be used wherever the
753functions are not available.
754
755
756@menu
757* Message catalogs with gettext:: The @code{gettext} family of functions.
758* Helper programs for gettext:: Programs to handle message catalogs
759 for @code{gettext}.
760@end menu
761
762
763@node Message catalogs with gettext
764@subsection The @code{gettext} family of functions
765
766The paradigms underlying the @code{gettext} approach to message
767translations is different from that of the @code{catgets} functions the
768basic functionally is equivalent. There are functions of the following
769categories:
770
771@menu
772* Translation with gettext:: What has to be done to translate a message.
773* Locating gettext catalog:: How to determine which catalog to be used.
774* Using gettextized software:: The possibilities of the user to influence
775 the way @code{gettext} works.
776@end menu
777
778@node Translation with gettext
779@subsubsection What has to be done to translate a message?
780
781The @code{gettext} functions have a very simple interface. The most
782basic function just takes the string which shall be translated as the
783argument and it returns the translation. This is fundamentally
784different from the @code{catgets} approach where an extra key is
785necessary and the original string is only used for the error case.
786
787If the string which has to be translated is the only argument this of
788course means the string itself is the key. I.e., the translation will
789be selected based on the original string. The message catalogs must
790therefore contain the original strings plus one translation for any such
791string. The task of the @code{gettext} function is it to compare the
792argument string with the available strings in the catalog and return the
793appropriate translation. Of course this process is optimized so that
794this process is not more expensive than an access using an atomic key
795like in @code{catgets}.
796
797The @code{gettext} approach has some advantages but also some
798disadvantages. Please see the GNU @file{gettext} manual for a detailed
799discussion of the pros and cons.
800
801All the definitions and declarations for @code{gettext} can be found in
802the @file{libintl.h} header file. On systems where these functions are
803not part of the C library they can be found in a separate library named
804@file{libintl.a} (or accordingly different for shared libraries).
805
806@deftypefun {char *} gettext (const char *@var{msgid})
807The @code{gettext} function searches the currently selected message
808catalogs for a string which is equal to @var{msgid}. If there is such a
809string available it is returned. Otherwise the argument string
810@var{msgid} is returned.
811
812Please note that all though the return value is @code{char *} the
813returned string must not be changed. This broken type results from the
814history of the function and does not reflect the way the function should
815be used.
816
817Please note that above we wrote ``message catalogs'' (plural). This is
818a speciality of the GNU implementation of these functions and we will
819say more about this in section @xref{Locating gettext catalog} when we
820talk about the ways message catalogs are selected.
821
822The @code{gettext} function does not modify the value of the global
823@var{errno} variable. This is necessary to make it possible to write
824something like
825
826@smallexample
827 printf (gettext ("Operation failed: %m\n"));
828@end smallexample
829
830Here the @var{errno} value is used in the @code{printf} function while
831processing the @code{%m} format element and if the @code{gettext}
832function would change this value (it is called before @code{printf} is
f2ea0f5b 833called) we would get a wrong message.
40a55d20
UD
834
835So there is no easy way to detect a missing message catalog beside
836comparing the argument string with the result. But it is normally the
837task of the user to react on missing catalogs. The program cannot guess
838when a message catalog is really necessary since for a user who s peaks
839the language the program was developed in does not need any translation.
840@end deftypefun
841
842The remaining two functions to access the message catalog add some
843functionality to select a message catalog which is not the default one.
844This is important if parts of the program are developed independently.
845Every part can have its own message catalog and all of them can be used
846at the same time. The C library itself is an example: internally it
847uses the @code{gettext} functions but since it must not depend on a
848currently selected default message catalog it must specify all ambiguous
849information.
850
851@deftypefun {char *} dgettext (const char *@var{domainname}, const char *@var{msgid})
852The @code{dgettext} functions acts just like the @code{gettext}
853function. It only takes an additional first argument @var{domainname}
854which guides the selection of the message catalogs which are searched
855for the translation. If the @var{domainname} parameter is the null
856pointer the @code{dgettext} function is exactly equivalent to
857@code{gettext} since the default value for the domain name is used.
858
859As for @code{gettext} the return value type is @code{char *} which is an
f2ea0f5b 860anachronism. The returned string must never be modified.
40a55d20
UD
861@end deftypefun
862
863@deftypefun {char *} dcgettext (const char *@var{domainname}, const char *@var{msgid}, int @var{category})
864The @code{dcgettext} adds another argument to those which
865@code{dgettext} takes. This argument @var{category} specifies the last
866piece of information needed to localize the message catalog. I.e., the
867domain name and the locale category exactly specify which message
868catalog has to be used (relative to a given directory, see below).
869
870The @code{dgettext} function can be expressed in terms of
871@code{dcgettext} by using
872
873@smallexample
874dcgettext (domain, string, LC_MESSAGES)
875@end smallexample
876
877@noindent
878instead of
879
880@smallexample
881dgettext (domain, string)
882@end smallexample
883
884This also shows which values are expected for the third parameter. One
885has to use the available selectors for the categories available in
886@file{locale.h}. Normally the available values are @code{LC_CTYPE},
887@code{LC_COLLATE}, @code{LC_MESSAGES}, @code{LC_MONETARY},
888@code{LC_NUMERIC}, and @code{LC_TIME}. Please note that @code{LC_ALL}
889must not be used and even though the names might suggest this, there is
890no relation to the environments variables of this name.
891
892The @code{dcgettext} function is only implemented for compatibility with
893other systems which have @code{gettext} functions. There is not really
894any situation where it is necessary (or useful) to use a different value
895but @code{LC_MESSAGES} in for the @var{category} parameter. We are
896dealing with messages here and any other choice can only be irritating.
897
898As for @code{gettext} the return value type is @code{char *} which is an
f2ea0f5b 899anachronism. The returned string must never be modified.
40a55d20
UD
900@end deftypefun
901
902When using the three functions above in a program it is a frequent case
903that the @var{msgid} argument is a constant string. So it is worth to
904optimize this case. Thinking shortly about this one will realize that
905as long as no new message catalog is loaded the translation of a message
906will not change. I.e., the algorithm to determine the translation is
907deterministic.
908
909Exactly this is what the optimizations implemented in the
f2ea0f5b 910@file{libintl.h} header will use. Whenever a program is compiler with
40a55d20
UD
911the GNU C compiler, optimization is selected and the @var{msgid}
912argument to @code{gettext}, @code{dgettext} or @code{dcgettext} is a
913constant string the actual function call will only be done the first
914time the message is used and then always only if any new message catalog
915was loaded and so the result of the translation lookup might be
916different. See the @file{libintl.h} header file for details. For the
917user it is only important to know that the result is always the same,
918independent of the compiler or compiler options in use.
919
920
921@node Locating gettext catalog
922@subsubsection How to determine which catalog to be used
923
f2ea0f5b 924The functions to retrieve the translations for a given message have a
40a55d20
UD
925remarkable simple interface. But to provide the user of the program
926still the opportunity to select exactly the translation s/he wants and
927also to provide the programmer the possibility to influence the way to
928locate the search for catalogs files there is a quite complicated
929underlying mechanism which controls all this. The code is complicated
930the use is easy.
931
932Basically we have two different tasks to perform which can also be
933performed by the @code{catgets} functions:
934
935@enumerate
936@item
937Locate the set of message catalogs. There are a number of files for
938different languages and which all belong to the package. Usually they
939are all stored in the filesystem below a certain directory.
940
941There can be arbitrary many packages installed and they can follow
942different guidelines for the placement of their files.
943
944@item
945Relative to the location specified by the package the actual translation
946files must be searched, based on the wishes of the user. I.e., for each
947language the user selects the program should be able to locate the
948appropriate file.
949@end enumerate
950
951This is the functionality required by the specifications for
952@code{gettext} and this is also what the @code{catgets} functions are
953able to do. But there are some problems unresolved:
954
955@itemize @bullet
956@item
957The language to be used can be specified in several different ways.
958There is no generally accepted standard for this and the user always
959expects the program understand what s/he means. E.g., to select the
960German translation one could write @code{de}, @code{german}, or
961@code{deutsch} and the program should always react the same.
962
963@item
964Sometimes the specification of the user is too detailed. If s/he, e.g.,
965specifies @code{de_DE.ISO-8859-1} which means German, spoken in Germany,
966coded using the @w{ISO 8859-1} character set there is the possibility
967that a message catalog matching this exactly is not available. But
968there could be a catalog matching @code{de} and if the character set
969used on the machine is always @w{ISO 8859-1} there is no reason why this
970later message catalog should not be used. (We call this @dfn{message
971inheritance}.)
972
973@item
974If a catalog for a wanted language is not available it is not always the
975second best choice to fall back on the language of the developer and
976simply not translate any message. Instead a user might be better able
977to read the messages in another language and so the user of the program
978should be able to define an precedence order of languages.
979@end itemize
980
f2ea0f5b 981We can divide the configuration actions in two parts: the one is
40a55d20
UD
982performed by the programmer, the other by the user. We will start with
983the functions the programmer can use since the user configuration will
984be based on this.
985
986As the functions described in the last sections already mention separate
987sets of messages can be selected by a @dfn{domain name}. This is a
988simple string which should be unique for each program part with uses a
989separate domain. It is possible to use in one program arbitrary many
990domains at the same time. E.g., the GNU C Library itself uses a domain
991named @code{libc} while the program using the C Library could use a
992domain named @code{foo}. The important point is that at any time
993exactly one domain is active. This is controlled with the following
994function.
995
996@deftypefun {char *} textdomain (const char *@var{domainname})
997The @code{textdomain} function sets the default domain, which is used in
998all future @code{gettext} calls, to @var{domainname}. Please note that
999@code{dgettext} and @code{dcgettext} calls are not influenced if the
1000@var{domainname} parameter of these functions is not the null pointer.
1001
1002Before the first call to @code{textdomain} the default domain is
f2ea0f5b 1003@code{messages}. This is the name specified in the specification of
40a55d20
UD
1004the @code{gettext} API. This name is as good as any other name. No
1005program should ever really use a domain with this name since this can
1006only lead to problems.
1007
1008The function returns the value which is from now on taken as the default
1009domain. If the system went out of memory the returned value is
1010@code{NULL} and the global variable @var{errno} is set to @code{ENOMEM}.
1011Despite the return value type being @code{char *} the return string must
1012not be changed. It is allocated internally by the @code{textdomain}
1013function.
1014
1015If the @var{domainname} parameter is the null pointer no new default
1016domain is set. Instead the currently selected default domain is
1017returned.
1018
1019If the @var{domainname} parameter is the empty string the default domain
1020is reset to its initial value, the domain with the name @code{messages}.
1021This possibility is questionable to use since the domain @code{messages}
1022really never should be used.
1023@end deftypefun
1024
1025@deftypefun {char *} bindtextdomain (const char *@var{domainname}, const char *@var{dirname})
1026The @code{bindtextdomain} function can be used to specify the directly
1027which contains the message catalogs for domain @var{domainname} for the
1028different languages. To be correct, this is the directory where the
f2ea0f5b 1029hierarchy of directories is expected. Details are explained below.
40a55d20
UD
1030
1031For the programmer it is important to note that the translations which
f2ea0f5b 1032come with the program have be placed in a directory hierarchy starting
40a55d20
UD
1033at, say, @file{/foo/bar}. Then the program should make a
1034@code{bindtextdomain} call to bind the domain for the current program to
1035this directory. So it is made sure the catalogs are found. A correctly
1036running program does not depend on the user setting an environment
1037variable.
1038
1039The @code{bindtextdomain} function can be used several times and if the
f2ea0f5b 1040@var{domainname} argument is different the previously bounded domains
40a55d20
UD
1041will not be overwritten.
1042
26b4d766
UD
1043If the program which wish to use @code{bindtextdomain} at some point of
1044time use the @code{chdir} function to change the current working
1045directory it is important that the @var{dirname} strings ought to be an
1046absolute pathname. Otherwise the addressed directory might vary with
1047the time.
1048
40a55d20
UD
1049If the @var{dirname} parameter is the null pointer @code{bindtextdomain}
1050returns the currently selected directory for the domain with the name
1051@var{domainname}.
1052
1053the @code{bindtextdomain} function returns a pointer to a string
1054containing the name of the selected directory name. The string is
1055allocated internally in the function and must not be changed by the
1056user. If the system went out of core during the execution of
1057@code{bindtextdomain} the return value is @code{NULL} and the global
1058variable @var{errno} is set accordingly.
1059@end deftypefun
1060
1061
1062@node Using gettextized software
1063@subsubsection User influence on @code{gettext}
1064
1065The last sections described what the programmer can do to
1066internationalize the messages of the program. But it is finally up to
1067the user to select the message s/he wants to see. S/He must understand
1068them.
1069
1070The POSIX locale model uses the environment variables @code{LC_COLLATE},
1071@code{LC_CTYPE}, @code{LC_MESSAGES}, @code{LC_MONETARY}, @code{NUMERIC},
1072and @code{LC_TIME} to select the locale which is to be used. This way
1073the user can influence lots of functions. As we mentioned above the
1074@code{gettext} functions also take advantage of this.
1075
1076To understand how this happens it is necessary to take a look at the
1077various components of the filename which gets computed to locate a
1078message catalog. It is composed as follows:
1079
1080@smallexample
1081@var{dir_name}/@var{locale}/LC_@var{category}/@var{domain_name}.mo
1082@end smallexample
1083
1084The default value for @var{dir_name} is system specific. It is computed
1085from the value given as the prefix while configuring the C library.
1086This value normally is @file{/usr} or @file{/}. For the former the
1087complete @var{dir_name} is:
1088
1089@smallexample
1090/usr/share/locale
1091@end smallexample
1092
1093We can use @file{/usr/share} since the @file{.mo} files containing the
1094message catalogs are system independent, all systems can use the same
1095files. If the program executed the @code{bindtextdomain} function for
1096the message domain that is currently handled the @code{dir_name}
1097component is the exactly the value which was given to the function as
1098the second parameter. I.e., @code{bindtextdomain} allows to overwrite
f2ea0f5b 1099the only system dependent and fixed value to make it possible to
40a55d20
UD
1100address file everywhere in the filesystem.
1101
1102The @var{category} is the name of the locale category which was selected
1103in the program code. For @code{gettext} and @code{dgettext} this is
1104always @code{LC_MESSAGES}, for @code{dcgettext} this is selected by the
1105value of the third parameter. As said above it should be avoided to
1106ever use a category other than @code{LC_MESSAGES}.
1107
1108The @var{locale} component is computed based on the category used. Just
1109like for the @code{setlocale} function here comes the user selection
1110into the play. Some environment variables are examined in a fixed order
1111and the first environment variable set determines the return value of
1112the lookup process. In detail, for the category @code{LC_xxx} the
1113following variables in this order are examined:
1114
1115@table @code
1116@item LANGUAGE
1117@item LC_ALL
1118@item LC_xxx
1119@item LANG
1120@end table
1121
1122This looks very familiar. With the exception of the @code{LANGUAGE}
1123environment variable this is exactly the lookup order the
1124@code{setlocale} function uses. But why introducing the @code{LANGUAGE}
1125variable?
1126
1127The reason is that the syntax of the values these variables can have is
1128different to what is expected by the @code{setlocale} function. If we
1129would set @code{LC_ALL} to a value following the extended syntax that
1130would mean the @code{setlocale} function will never be able to use the
1131value of this variable as well. An additional variable removes this
1132problem plus we can select the language independently of the locale
1133setting which sometimes is useful.
1134
1135While for the @code{LC_xxx} variables the value should consist of
1136exactly one specification of a locale the @code{LANGUAGE} variable's
1137value can consist of a colon separated list of locale names. The
1138attentive reader will realize that this is the way we manage to
1139implement one of our additional demands above: we want to be able to
1140specify an ordered list of language.
1141
1142Back to the constructed filename we have only one component missing.
1143The @var{domain_name} part is the name which was either registered using
1144the @code{textdomain} function or which was given to @code{dgettext} or
1145@code{dcgettext} as the first parameter. Now it becomes obvious that a
1146good choice for the domain name in the program code is a string which is
1147closely related to the program/package name. E.g., for the GNU C
1148Library the domain name is @code{libc}.
1149
1150@noindent
1151A limit piece of example code should show how the programmer is supposed
1152to work:
1153
1154@smallexample
1155@{
1156 textdomain ("test-package");
1157 bindtextdomain ("test-package", "/usr/local/share/locale");
1158 puts (gettext ("Hello, world!");
1159@}
1160@end smallexample
1161
1162At the program start the default domain is @code{messages}. The
1163@code{textdomain} call changes this to @code{test-package}. The
1164@code{bindtextdomain} call specifies that the message catalogs for the
1165domain @code{test-package} can be found below the directory
1166@file{/usr/local/share/locale}.
1167
1168If now the user set in her/his environment the variable @code{LANGUAGE}
1169to @code{de} the @code{gettext} function will try to use the
1170translations from the file
1171
1172@smallexample
1173/usr/local/share/locale/de/LC_MESSAGES/test-package.mo
1174@end smallexample
1175
1176From the above descriptions it should be clear which component of this
f41c8091
UD
1177filename is determined by which source.
1178
1179In the above example we assumed that the @code{LANGUAGE} environment
1180variable to @code{de}. This might be an appropriate selection but what
1181happens if the user wants to use @code{LC_ALL} because of the wider
1182usability and here the required value is @code{de_DE.ISO-8859-1}? We
1183already mentioned above that a situation like this is not infrequent.
1184E.g., a person might prefer reading a dialect and if this is not
1185available fall back on the standard language.
1186
1187The @code{gettext} functions know about situations like this and can
1188handle them gracefully. The functions recognize the format of the value
1189of the environment variable. It can split the value is different pieces
1190and by leaving out the only or the other part it can construct new
1191values. This happens of course in a predictable way. To understand
1192this one must know the format of the environment variable value. There
1193are to more or less standardized forms:
1194
1195@table @emph
1196@item X/Open Format
1197@code{language[_territory[.codeset]][@@modifier]}
1198
1199@item CEN Format (European Community Standard)
1200@code{language[_territory][+audience][+special][,[sponsor][_revision]]}
1201@end table
1202
1203The functions will automatically recognize which format is used. Less
1204specific locale names will be stripped of in the order of the following
1205list:
40a55d20 1206
f41c8091
UD
1207@enumerate
1208@item
1209@code{revision}
1210@item
1211@code{sponsor}
1212@item
1213@code{special}
1214@item
1215@code{codeset}
1216@item
1217@code{normalized codeset}
1218@item
1219@code{territory}
1220@item
1221@code{audience}/@code{modifier}
1222@end enumerate
1223
f2ea0f5b 1224From the last entry one can see that the meaning of the @code{modifier}
f41c8091
UD
1225field in the X/Open format and the @code{audience} format have the same
1226meaning. Beside one can see that the @code{language} field for obvious
1227reasons never will be dropped.
1228
1229The only new thing is the @code{normalized codeset} entry. This is
1230another goodie which is introduced to help reducing the chaos which
1231derives from the inability of the people to standardize the names of
1232character sets. Instead of @w{ISO-8859-1} one can often see @w{8859-1},
1233@w{88591}, @w{iso8859-1}, or @w{iso_8859-1}. The @code{normalized
1234codeset} value is generated from the user-provided character set name by
1235applying the following rules:
1236
1237@enumerate
1238@item
1239Remove all characters beside numbers and letters.
1240@item
1241Fold letters to lowercase.
1242@item
1243If the same only contains digits prepend the string @code{"iso"}.
1244@end enumerate
1245
1246@noindent
1247So all of the above name will be normalized to @code{iso88591}. This
1248allows the program user much more freely choosing the locale name.
1249
1250Even this extended functionality still does not help to solve the
1251problem that completely different names can be used to denote the same
1252locale (e.g., @code{de} and @code{german}). To be of help in this
1253situation the locale implementation and also the @code{gettext}
1254functions know about aliases.
1255
1256The file @file{/usr/share/locale/locale.alias} (replace @file{/usr} with
1257whatever prefix you used for configuring the C library) contains a
1258mapping of alternative names to more regular names. The system manager
1259is free to add new entries to fill her/his own needs. The selected
1260locale from the environment is compared with the entries in the first
1261column of this file ignoring the case. If they match the value of the
1262second column is used instead for the further handling.
1263
1264In the description of the format of the environment variables we already
1265mentioned the character set as a factor in the selection of the message
1266catalog. In fact, only catalogs which contain text written using the
1267character set of the system/program can be used (directly; there will
1268come a solution for this some day). This means for the user that s/he
1269will always have to take care for this. If in the collection of the
1270message catalogs there are files for the same language but coded using
1271different character sets the user has to be careful.
40a55d20
UD
1272
1273
1274@node Helper programs for gettext
1275@subsection Programs to handle message catalogs for @code{gettext}
1276
f41c8091
UD
1277The GNU C Library does not contain the source code for the programs to
1278handle message catalogs for the @code{gettext} functions. As part of
1279the GNU project the GNU gettext package contains everything the
1280developer needs. The functionality provided by the tools in this
1281package by far exceeds the abilities of the @code{gencat} program
1282described above for the @code{catgets} functions.
1283
1284There is a program @code{msgfmt} which is the equivalent program to the
1285@code{gencat} program. It generates from the human-readable and
1286-editable form of the message catalog a binary file which can be used by
1287the @code{gettext} functions. But there are several more programs
1288available.
1289
1290The @code{xgettext} program can be used to automatically extract the
1291translatable messages from a source file. I.e., the programmer need not
1292take care for the translations and the list of messages which have to be
1293translated. S/He will simply wrap the translatable string in calls to
1294@code{gettext} et.al and the rest will be done by @code{xgettext}. This
1295program has a lot of option which help to customize the output or do
1296help to understand the input better.
1297
1298Other programs help to manage development cycle when new messages appear
1299in the source files or when a new translation of the messages appear.
f2ea0f5b 1300here it should only be noted that using all the tools in GNU gettext it
f41c8091
UD
1301is possible to @emph{completely} automize the handling of message
1302catalog. Beside marking the translatable string in the source code and
1303generating the translations the developers do not have anything to do
1304themself.