]> git.ipfire.org Git - thirdparty/glibc.git/blame - manual/message.texi
manual: Create empty placeholder macros for @standards.
[thirdparty/glibc.git] / manual / message.texi
CommitLineData
7a68c94a
UD
1@node Message Translation, Searching and Sorting, Locales, Top
2@c %MENU% How to make the program speak the user's language
40a55d20
UD
3@chapter Message Translation
4
e8dd4791
CD
5The program's interface with the user should be designed to ease the user's
6task. One way to ease the user's task is to use messages in whatever
7language the user prefers.
40a55d20
UD
8
9Printing messages in different languages can be implemented in different
10ways. One could add all the different languages in the source code and
c430c4af
BS
11choose among the variants every time a message has to be printed. This is
12certainly not a good solution since extending the set of languages is
13cumbersome (the code must be changed) and the code itself can become
40a55d20
UD
14really big with dozens of message sets.
15
c430c4af 16A better solution is to keep the message sets for each language
40a55d20
UD
17in separate files which are loaded at runtime depending on the language
18selection of the user.
19
1f77f049 20@Theglibc{} provides two different sets of functions to support
40a55d20
UD
21message translation. The problem is that neither of the interfaces is
22officially defined by the POSIX standard. The @code{catgets} family of
f2ea0f5b
UD
23functions is defined in the X/Open standard but this is derived from
24industry decisions and therefore not necessarily based on reasonable
40a55d20
UD
25decisions.
26
10b89412 27As mentioned above, the message catalog handling provides easy
ef48b196 28extendability by using external data files which contain the message
40a55d20
UD
29translations. I.e., these files contain for each of the messages used
30in the program a translation for the appropriate language. So the tasks
fed8f7f7 31of the message handling functions are
40a55d20
UD
32
33@itemize @bullet
34@item
c430c4af 35locate the external data file with the appropriate translations
40a55d20
UD
36@item
37load the data and make it possible to address the messages
38@item
39map a given key to the translated message
40@end itemize
41
42The two approaches mainly differ in the implementation of this last
e8dd4791 43step. Decisions made in the last step influence the rest of the design.
40a55d20
UD
44
45@menu
46* Message catalogs a la X/Open:: The @code{catgets} family of functions.
47* The Uniforum approach:: The @code{gettext} family of functions.
48@end menu
49
50
51@node Message catalogs a la X/Open
52@section X/Open Message Catalog Handling
53
54The @code{catgets} functions are based on the simple scheme:
55
56@quotation
57Associate every message to translate in the source code with a unique
58identifier. To retrieve a message from a catalog file solely the
59identifier is used.
60@end quotation
61
62This means for the author of the program that s/he will have to make
63sure the meaning of the identifier in the program code and in the
10b89412 64message catalogs is always the same.
40a55d20
UD
65
66Before a message can be translated the catalog file must be located.
67The user of the program must be able to guide the responsible function
68to find whatever catalog the user wants. This is separated from what
69the programmer had in mind.
70
f2ea0f5b 71All the types, constants and functions for the @code{catgets} functions
40a55d20
UD
72are defined/declared in the @file{nl_types.h} header file.
73
74@menu
75* The catgets Functions:: The @code{catgets} function family.
76* The message catalog files:: Format of the message catalog files.
77* The gencat program:: How to generate message catalogs files which
78 can be used by the functions.
79* Common Usage:: How to use the @code{catgets} interface.
80@end menu
81
82
83@node The catgets Functions
84@subsection The @code{catgets} function family
85
86@comment nl_types.h
87@comment X/Open
88@deftypefun nl_catd catopen (const char *@var{cat_name}, int @var{flag})
29e7e2df
AO
89@safety{@prelim{}@mtsafe{@mtsenv{}}@asunsafe{@ascuheap{}}@acunsafe{@acsmem{}}}
90@c catopen @mtsenv @ascuheap @acsmem
91@c strchr ok
92@c setlocale(,NULL) ok
93@c getenv @mtsenv
94@c strlen ok
95@c alloca ok
96@c stpcpy ok
97@c malloc @ascuheap @acsmem
98@c __open_catalog @ascuheap @acsmem
99@c strchr ok
100@c open_not_cancel_2 @acsfd
101@c strlen ok
102@c ENOUGH ok
103@c alloca ok
104@c memcpy ok
105@c fxstat64 ok
106@c __set_errno ok
107@c mmap @acsmem
108@c malloc dup @ascuheap @acsmem
109@c read_not_cancel ok
110@c free dup @ascuheap @acsmem
111@c munmap ok
112@c close_not_cancel_no_status ok
113@c free @ascuheap @acsmem
10b89412 114The @code{catopen} function tries to locate the message data file named
40a55d20
UD
115@var{cat_name} and loads it when found. The return value is of an
116opaque type and can be used in calls to the other functions to refer to
117this loaded catalog.
118
119The return value is @code{(nl_catd) -1} in case the function failed and
120no catalog was loaded. The global variable @var{errno} contains a code
121for the error causing the failure. But even if the function call
122succeeded this does not mean that all messages can be translated.
123
124Locating the catalog file must happen in a way which lets the user of
125the program influence the decision. It is up to the user to decide
126about the language to use and sometimes it is useful to use alternate
127catalog files. All this can be specified by the user by setting some
f2ea0f5b 128environment variables.
40a55d20
UD
129
130The first problem is to find out where all the message catalogs are
131stored. Every program could have its own place to keep all the
132different files but usually the catalog files are grouped by languages
133and the catalogs for all programs are kept in the same place.
134
135@cindex NLSPATH environment variable
136To tell the @code{catopen} function where the catalog for the program
137can be found the user can set the environment variable @code{NLSPATH} to
138a value which describes her/his choice. Since this value must be usable
139for different languages and locales it cannot be a simple string.
140Instead it is a format string (similar to @code{printf}'s). An example
141is
142
143@smallexample
144/usr/share/locale/%L/%N:/usr/share/locale/%L/LC_MESSAGES/%N
145@end smallexample
146
147First one can see that more than one directory can be specified (with
148the usual syntax of separating them by colons). The next things to
149observe are the format string, @code{%L} and @code{%N} in this case.
150The @code{catopen} function knows about several of them and the
151replacement for all of them is of course different.
152
153@table @code
154@item %N
155This format element is substituted with the name of the catalog file.
156This is the value of the @var{cat_name} argument given to
157@code{catgets}.
158
159@item %L
160This format element is substituted with the name of the currently
161selected locale for translating messages. How this is determined is
162explained below.
163
164@item %l
165(This is the lowercase ell.) This format element is substituted with the
f2ea0f5b 166language element of the locale name. The string describing the selected
40a55d20
UD
167locale is expected to have the form
168@code{@var{lang}[_@var{terr}[.@var{codeset}]]} and this format uses the
169first part @var{lang}.
170
171@item %t
172This format element is substituted by the territory part @var{terr} of
173the name of the currently selected locale. See the explanation of the
174format above.
175
176@item %c
177This format element is substituted by the codeset part @var{codeset} of
178the name of the currently selected locale. See the explanation of the
179format above.
180
181@item %%
10b89412 182Since @code{%} is used as a meta character there must be a way to
40a55d20
UD
183express the @code{%} character in the result itself. Using @code{%%}
184does this just like it works for @code{printf}.
185@end table
186
187
e8b1163e
AJ
188Using @code{NLSPATH} allows arbitrary directories to be searched for
189message catalogs while still allowing different languages to be used.
190If the @code{NLSPATH} environment variable is not set, the default value
191is
40a55d20
UD
192
193@smallexample
194@var{prefix}/share/locale/%L/%N:@var{prefix}/share/locale/%L/LC_MESSAGES/%N
195@end smallexample
196
197@noindent
1f77f049
JM
198where @var{prefix} is given to @code{configure} while installing @theglibc{}
199(this value is in many cases @code{/usr} or the empty string).
40a55d20
UD
200
201The remaining problem is to decide which must be used. The value
202decides about the substitution of the format elements mentioned above.
203First of all the user can specify a path in the message catalog name
204(i.e., the name contains a slash character). In this situation the
205@code{NLSPATH} environment variable is not used. The catalog must exist
206as specified in the program, perhaps relative to the current working
207directory. This situation in not desirable and catalogs names never
608cc1f0 208should be written this way. Beside this, this behavior is not portable
40a55d20
UD
209to all other platforms providing the @code{catgets} interface.
210
211@cindex LC_ALL environment variable
212@cindex LC_MESSAGES environment variable
213@cindex LANG environment variable
214Otherwise the values of environment variables from the standard
f2ea0f5b 215environment are examined (@pxref{Standard Environment}). Which
40a55d20
UD
216variables are examined is decided by the @var{flag} parameter of
217@code{catopen}. If the value is @code{NL_CAT_LOCALE} (which is defined
10b89412 218in @file{nl_types.h}) then the @code{catopen} function uses the name of
4d76a0ec
UD
219the locale currently selected for the @code{LC_MESSAGES} category.
220
221If @var{flag} is zero the @code{LANG} environment variable is examined.
10b89412 222This is a left-over from the early days when the concept of locales
4d76a0ec
UD
223had not even reached the level of POSIX locales.
224
225The environment variable and the locale name should have a value of the
226form @code{@var{lang}[_@var{terr}[.@var{codeset}]]} as explained above.
227If no environment variable is set the @code{"C"} locale is used which
40a55d20
UD
228prevents any translation.
229
230The return value of the function is in any case a valid string. Either
231it is a translation from a message catalog or it is the same as the
232@var{string} parameter. So a piece of code to decide whether a
233translation actually happened must look like this:
234
235@smallexample
236@{
237 char *trans = catgets (desc, set, msg, input_string);
238 if (trans == input_string)
239 @{
240 /* Something went wrong. */
241 @}
242@}
243@end smallexample
244
245@noindent
10b89412 246When an error occurs the global variable @var{errno} is set to
40a55d20
UD
247
248@table @var
249@item EBADF
250The catalog does not exist.
251@item ENOMSG
b8a46c1d 252The set/message tuple does not name an existing element in the
40a55d20
UD
253message catalog.
254@end table
255
256While it sometimes can be useful to test for errors programs normally
257will avoid any test. If the translation is not available it is no big
258problem if the original, untranslated message is printed. Either the
259user understands this as well or s/he will look for the reason why the
260messages are not translated.
261@end deftypefun
262
263Please note that the currently selected locale does not depend on a call
264to the @code{setlocale} function. It is not necessary that the locale
265data files for this locale exist and calling @code{setlocale} succeeds.
266The @code{catopen} function directly reads the values of the environment
267variables.
268
269
270@deftypefun {char *} catgets (nl_catd @var{catalog_desc}, int @var{set}, int @var{message}, const char *@var{string})
29e7e2df 271@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
10b89412 272The function @code{catgets} has to be used to access the message catalog
40a55d20
UD
273previously opened using the @code{catopen} function. The
274@var{catalog_desc} parameter must be a value previously returned by
275@code{catopen}.
276
277The next two parameters, @var{set} and @var{message}, reflect the
278internal organization of the message catalog files. This will be
279explained in detail below. For now it is interesting to know that a
10b89412 280catalog can consist of several sets and the messages in each thread are
40a55d20
UD
281individually numbered using numbers. Neither the set number nor the
282message number must be consecutive. They can be arbitrarily chosen.
283But each message (unless equal to another one) must have its own unique
10b89412 284pair of set and message numbers.
40a55d20
UD
285
286Since it is not guaranteed that the message catalog for the language
287selected by the user exists the last parameter @var{string} helps to
288handle this case gracefully. If no matching string can be found
289@var{string} is returned. This means for the programmer that
290
291@itemize @bullet
292@item
293the @var{string} parameters should contain reasonable text (this also
294helps to understand the program seems otherwise there would be no hint
295on the string which is expected to be returned.
296@item
297all @var{string} arguments should be written in the same language.
298@end itemize
299@end deftypefun
300
301It is somewhat uncomfortable to write a program using the @code{catgets}
302functions if no supporting functionality is available. Since each
f2ea0f5b 303set/message number tuple must be unique the programmer must keep lists
40a55d20
UD
304of the messages at the same time the code is written. And the work
305between several people working on the same project must be coordinated.
10b89412 306We will see how some of these problems can be relaxed a bit (@pxref{Common
8b7fb588 307Usage}).
40a55d20
UD
308
309@deftypefun int catclose (nl_catd @var{catalog_desc})
29e7e2df
AO
310@safety{@prelim{}@mtsafe{}@asunsafe{@ascuheap{}}@acunsafe{@acucorrupt{} @acsmem{}}}
311@c catclose @ascuheap @acucorrupt @acsmem
312@c __set_errno ok
313@c munmap ok
314@c free @ascuheap @acsmem
40a55d20
UD
315The @code{catclose} function can be used to free the resources
316associated with a message catalog which previously was opened by a call
317to @code{catopen}. If the resources can be successfully freed the
10b89412 318function returns @code{0}. Otherwise it returns @code{@minus{}1} and the
40a55d20
UD
319global variable @var{errno} is set. Errors can occur if the catalog
320descriptor @var{catalog_desc} is not valid in which case @var{errno} is
321set to @code{EBADF}.
322@end deftypefun
323
324
325@node The message catalog files
326@subsection Format of the message catalog files
327
10b89412 328The only reasonable way to translate all the messages of a function and
40a55d20
UD
329store the result in a message catalog file which can be read by the
330@code{catopen} function is to write all the message text to the
331translator and let her/him translate them all. I.e., we must have a
f2ea0f5b 332file with entries which associate the set/message tuple with a specific
40a55d20
UD
333translation. This file format is specified in the X/Open standard and
334is as follows:
335
336@itemize @bullet
337@item
338Lines containing only whitespace characters or empty lines are ignored.
339
340@item
341Lines which contain as the first non-whitespace character a @code{$}
342followed by a whitespace character are comment and are also ignored.
343
344@item
345If a line contains as the first non-whitespace characters the sequence
346@code{$set} followed by a whitespace character an additional argument
347is required to follow. This argument can either be:
348
349@itemize @minus
350@item
351a number. In this case the value of this number determines the set
352to which the following messages are added.
353
354@item
355an identifier consisting of alphanumeric characters plus the underscore
356character. In this case the set get automatically a number assigned.
357This value is one added to the largest set number which so far appeared.
358
359How to use the symbolic names is explained in section @ref{Common Usage}.
360
361It is an error if a symbol name appears more than once. All following
362messages are placed in a set with this number.
363@end itemize
364
365@item
366If a line contains as the first non-whitespace characters the sequence
367@code{$delset} followed by a whitespace character an additional argument
368is required to follow. This argument can either be:
369
370@itemize @minus
371@item
372a number. In this case the value of this number determines the set
373which will be deleted.
374
375@item
376an identifier consisting of alphanumeric characters plus the underscore
377character. This symbolic identifier must match a name for a set which
378previously was defined. It is an error if the name is unknown.
379@end itemize
380
381In both cases all messages in the specified set will be removed. They
382will not appear in the output. But if this set is later again selected
383with a @code{$set} command again messages could be added and these
384messages will appear in the output.
385
386@item
387If a line contains after leading whitespaces the sequence
388@code{$quote}, the quoting character used for this input file is
10b89412 389changed to the first non-whitespace character following
40a55d20 390@code{$quote}. If no non-whitespace character is present before the
10b89412 391line ends quoting is disabled.
40a55d20
UD
392
393By default no quoting character is used. In this mode strings are
394terminated with the first unescaped line break. If there is a
395@code{$quote} sequence present newline need not be escaped. Instead a
f2ea0f5b 396string is terminated with the first unescaped appearance of the quote
40a55d20
UD
397character.
398
399A common usage of this feature would be to set the quote character to
f2ea0f5b 400@code{"}. Then any appearance of the @code{"} in the strings must
40a55d20
UD
401be escaped using the backslash (i.e., @code{\"} must be written).
402
403@item
404Any other line must start with a number or an alphanumeric identifier
405(with the underscore character included). The following characters
a2d63612 406(starting after the first whitespace character) will form the string
40a55d20
UD
407which gets associated with the currently selected set and the message
408number represented by the number and identifier respectively.
409
410If the start of the line is a number the message number is obvious. It
411is an error if the same message number already appeared for this set.
412
413If the leading token was an identifier the message number gets
10b89412 414automatically assigned. The value is the current maximum message
40a55d20 415number for this set plus one. It is an error if the identifier was
608cc1f0 416already used for a message in this set. It is OK to reuse the
40a55d20
UD
417identifier for a message in another thread. How to use the symbolic
418identifiers will be explained below (@pxref{Common Usage}). There is
419one limitation with the identifier: it must not be @code{Set}. The
420reason will be explained below.
421
40a55d20
UD
422The text of the messages can contain escape characters. The usual bunch
423of characters known from the @w{ISO C} language are recognized
424(@code{\n}, @code{\t}, @code{\v}, @code{\b}, @code{\r}, @code{\f},
425@code{\\}, and @code{\@var{nnn}}, where @var{nnn} is the octal coding of
426a character code).
427@end itemize
428
429@strong{Important:} The handling of identifiers instead of numbers for
430the set and messages is a GNU extension. Systems strictly following the
431X/Open specification do not have this feature. An example for a message
432catalog file is this:
433
434@smallexample
435$ This is a leading comment.
436$quote "
437
438$set SetOne
4391 Message with ID 1.
440two " Message with ID \"two\", which gets the value 2 assigned"
441
442$set SetTwo
f2ea0f5b 443$ Since the last set got the number 1 assigned this set has number 2.
40a55d20
UD
4444000 "The numbers can be arbitrary, they need not start at one."
445@end smallexample
446
447This small example shows various aspects:
448@itemize @bullet
449@item
450Lines 1 and 9 are comments since they start with @code{$} followed by
451a whitespace.
452@item
453The quoting character is set to @code{"}. Otherwise the quotes in the
10b89412
RJ
454message definition would have to be omitted and in this case the
455message with the identifier @code{two} would lose its leading whitespace.
40a55d20 456@item
10b89412 457Mixing numbered messages with messages having symbolic names is no
f2ea0f5b 458problem and the numbering happens automatically.
40a55d20
UD
459@end itemize
460
461
462While this file format is pretty easy it is not the best possible for
463use in a running program. The @code{catopen} function would have to
10b89412 464parse the file and handle syntactic errors gracefully. This is not so
40a55d20
UD
465easy and the whole process is pretty slow. Therefore the @code{catgets}
466functions expect the data in another more compact and ready-to-use file
f2ea0f5b 467format. There is a special program @code{gencat} which is explained in
40a55d20
UD
468detail in the next section.
469
470Files in this other format are not human readable. To be easy to use by
471programs it is a binary file. But the format is byte order independent
472so translation files can be shared by systems of arbitrary architecture
1f77f049 473(as long as they use @theglibc{}).
40a55d20
UD
474
475Details about the binary file format are not important to know since
476these files are always created by the @code{gencat} program. The
1f77f049 477sources of @theglibc{} also provide the sources for the
f2ea0f5b 478@code{gencat} program and so the interested reader can look through
40a55d20
UD
479these source files to learn about the file format.
480
481
482@node The gencat program
483@subsection Generate Message Catalogs files
484
485@cindex gencat
486The @code{gencat} program is specified in the X/Open standard and the
e8b1163e 487GNU implementation follows this specification and so processes
40a55d20 488all correctly formed input files. Additionally some extension are
3081378b 489implemented which help to work in a more reasonable way with the
40a55d20
UD
490@code{catgets} functions.
491
492The @code{gencat} program can be invoked in two ways:
493
494@example
10b89412 495`gencat [@var{Option} @dots{}] [@var{Output-File} [@var{Input-File} @dots{}]]`
40a55d20
UD
496@end example
497
498This is the interface defined in the X/Open standard. If no
10b89412
RJ
499@var{Input-File} parameter is given, input will be read from standard
500input. Multiple input files will be read as if they were concatenated.
40a55d20 501If @var{Output-File} is also missing, the output will be written to
b8a46c1d 502standard output. To provide the interface one is used to from other
40a55d20
UD
503programs a second interface is provided.
504
505@smallexample
10b89412 506`gencat [@var{Option} @dots{}] -o @var{Output-File} [@var{Input-File} @dots{}]`
40a55d20
UD
507@end smallexample
508
509The option @samp{-o} is used to specify the output file and all file
510arguments are used as input files.
511
512Beside this one can use @file{-} or @file{/dev/stdin} for
513@var{Input-File} to denote the standard input. Corresponding one can
514use @file{-} and @file{/dev/stdout} for @var{Output-File} to denote
515standard output. Using @file{-} as a file name is allowed in X/Open
516while using the device names is a GNU extension.
517
518The @code{gencat} program works by concatenating all input files and
10b89412 519then @strong{merging} the resulting collection of message sets with a
f2ea0f5b
UD
520possibly existing output file. This is done by removing all messages
521with set/message number tuples matching any of the generated messages
40a55d20
UD
522from the output file and then adding all the new messages. To
523regenerate a catalog file while ignoring the old contents therefore
10b89412 524requires removing the output file if it exists. If the output is
40a55d20
UD
525written to standard output no merging takes place.
526
527@noindent
528The following table shows the options understood by the @code{gencat}
10b89412 529program. The X/Open standard does not specify any options for the
40a55d20
UD
530program so all of these are GNU extensions.
531
532@table @samp
533@item -V
534@itemx --version
535Print the version information and exit.
536@item -h
537@itemx --help
538Print a usage message listing all available options, then exit successfully.
539@item --new
10b89412
RJ
540Do not merge the new messages from the input files with the old content
541of the output file. The old content of the output file is discarded.
40a55d20
UD
542@item -H
543@itemx --header=name
544This option is used to emit the symbolic names given to sets and
545messages in the input files for use in the program. Details about how
546to use this are given in the next section. The @var{name} parameter to
547this option specifies the name of the output file. It will contain a
548number of C preprocessor @code{#define}s to associate a name with a
549number.
550
551Please note that the generated file only contains the symbols from the
552input files. If the output is merged with the previous content of the
553output file the possibly existing symbols from the file(s) which
554generated the old output files are not in the generated header file.
555@end table
556
557
558@node Common Usage
559@subsection How to use the @code{catgets} interface
560
561The @code{catgets} functions can be used in two different ways. By
562following slavishly the X/Open specs and not relying on the extension
563and by using the GNU extensions. We will take a look at the former
564method first to understand the benefits of extensions.
565
fed8f7f7 566@subsubsection Not using symbolic names
40a55d20
UD
567
568Since the X/Open format of the message catalog files does not allow
569symbol names we have to work with numbers all the time. When we start
f2ea0f5b
UD
570writing a program we have to replace all appearances of translatable
571strings with something like
40a55d20
UD
572
573@smallexample
574catgets (catdesc, set, msg, "string")
575@end smallexample
576
577@noindent
578@var{catgets} is retrieved from a call to @code{catopen} which is
579normally done once at the program start. The @code{"string"} is the
580string we want to translate. The problems start with the set and
581message numbers.
582
583In a bigger program several programmers usually work at the same time on
584the program and so coordinating the number allocation is crucial.
f2ea0f5b
UD
585Though no two different strings must be indexed by the same tuple of
586numbers it is highly desirable to reuse the numbers for equal strings
40a55d20
UD
587with equal translations (please note that there might be strings which
588are equal in one language but have different translations due to
589difference contexts).
590
591The allocation process can be relaxed a bit by different set numbers for
592different parts of the program. So the number of developers who have to
593coordinate the allocation can be reduced. But still lists must be keep
594track of the allocation and errors can easily happen. These errors
595cannot be discovered by the compiler or the @code{catgets} functions.
596Only the user of the program might see wrong messages printed. In the
597worst cases the messages are so irritating that they cannot be
598recognized as wrong. Think about the translations for @code{"true"} and
f2ea0f5b 599@code{"false"} being exchanged. This could result in a disaster.
40a55d20
UD
600
601
602@subsubsection Using symbolic names
603
604The problems mentioned in the last section derive from the fact that:
605
606@enumerate
607@item
608the numbers are allocated once and due to the possibly frequent use of
609them it is difficult to change a number later.
610@item
10b89412 611the numbers do not allow guessing anything about the string and
40a55d20
UD
612therefore collisions can easily happen.
613@end enumerate
614
615By constantly using symbolic names and by providing a method which maps
616the string content to a symbolic name (however this will happen) one can
617prevent both problems above. The cost of this is that the programmer
618has to write a complete message catalog file while s/he is writing the
619program itself.
620
621This is necessary since the symbolic names must be mapped to numbers
622before the program sources can be compiled. In the last section it was
623described how to generate a header containing the mapping of the names.
624E.g., for the example message file given in the last section we could
10b89412 625call the @code{gencat} program as follows (assume @file{ex.msg} contains
40a55d20
UD
626the sources).
627
628@smallexample
629gencat -H ex.h -o ex.cat ex.msg
630@end smallexample
631
632@noindent
633This generates a header file with the following content:
634
635@smallexample
b8a46c1d 636#define SetTwoSet 0x2 /* ex.msg:8 */
40a55d20 637
b8a46c1d
UD
638#define SetOneSet 0x1 /* ex.msg:4 */
639#define SetOnetwo 0x2 /* ex.msg:6 */
40a55d20
UD
640@end smallexample
641
642As can be seen the various symbols given in the source file are mangled
643to generate unique identifiers and these identifiers get numbers
644assigned. Reading the source file and knowing about the rules will
645allow to predict the content of the header file (it is deterministic)
646but this is not necessary. The @code{gencat} program can take care for
647everything. All the programmer has to do is to put the generated header
648file in the dependency list of the source files of her/his project and
10b89412 649add a rule to regenerate the header if any of the input files change.
40a55d20
UD
650
651One word about the symbol mangling. Every symbol consists of two parts:
652the name of the message set plus the name of the message or the special
653string @code{Set}. So @code{SetOnetwo} means this macro can be used to
654access the translation with identifier @code{two} in the message set
655@code{SetOne}.
656
657The other names denote the names of the message sets. The special
658string @code{Set} is used in the place of the message identifier.
659
660If in the code the second string of the set @code{SetOne} is used the C
661code should look like this:
662
663@smallexample
664catgets (catdesc, SetOneSet, SetOnetwo,
665 " Message with ID \"two\", which gets the value 2 assigned")
666@end smallexample
667
668Writing the function this way will allow to change the message number
669and even the set number without requiring any change in the C source
670code. (The text of the string is normally not the same; this is only
671for this example.)
672
673
674@subsubsection How does to this allow to develop
675
676To illustrate the usual way to work with the symbolic version numbers
677here is a little example. Assume we want to write the very complex and
678famous greeting program. We start by writing the code as usual:
679
680@smallexample
681#include <stdio.h>
682int
683main (void)
684@{
685 printf ("Hello, world!\n");
686 return 0;
687@}
688@end smallexample
689
690Now we want to internationalize the message and therefore replace the
691message with whatever the user wants.
692
693@smallexample
694#include <nl_types.h>
695#include <stdio.h>
696#include "msgnrs.h"
697int
698main (void)
699@{
700 nl_catd catdesc = catopen ("hello.cat", NL_CAT_LOCALE);
fed8f7f7 701 printf (catgets (catdesc, SetMainSet, SetMainHello,
838e5ffe 702 "Hello, world!\n"));
40a55d20
UD
703 catclose (catdesc);
704 return 0;
705@}
706@end smallexample
707
708We see how the catalog object is opened and the returned descriptor used
709in the other function calls. It is not really necessary to check for
710failure of any of the functions since even in these situations the
711functions will behave reasonable. They simply will be return a
712translation.
713
714What remains unspecified here are the constants @code{SetMainSet} and
715@code{SetMainHello}. These are the symbolic names describing the
716message. To get the actual definitions which match the information in
717the catalog file we have to create the message catalog source file and
718process it using the @code{gencat} program.
719
720@smallexample
721$ Messages for the famous greeting program.
722$quote "
723
724$set Main
725Hello "Hallo, Welt!\n"
726@end smallexample
727
728Now we can start building the program (assume the message catalog source
729file is named @file{hello.msg} and the program source file @file{hello.c}):
730
731@smallexample
40a55d20
UD
732% gencat -H msgnrs.h -o hello.cat hello.msg
733% cat msgnrs.h
734#define MainSet 0x1 /* hello.msg:4 */
735#define MainHello 0x1 /* hello.msg:5 */
736% gcc -o hello hello.c -I.
737% cp hello.cat /usr/share/locale/de/LC_MESSAGES
738% echo $LC_ALL
739de
740% ./hello
741Hallo, Welt!
742%
40a55d20
UD
743@end smallexample
744
745The call of the @code{gencat} program creates the missing header file
746@file{msgnrs.h} as well as the message catalog binary. The former is
747used in the compilation of @file{hello.c} while the later is placed in a
748directory in which the @code{catopen} function will try to locate it.
749Please check the @code{LC_ALL} environment variable and the default path
750for @code{catopen} presented in the description above.
751
752
753@node The Uniforum approach
754@section The Uniforum approach to Message Translation
755
756Sun Microsystems tried to standardize a different approach to message
757translation in the Uniforum group. There never was a real standard
6c55cda3 758defined but still the interface was used in Sun's operating systems.
40a55d20 759Since this approach fits better in the development process of free
1410e233 760software it is also used throughout the GNU project and the GNU
1f77f049 761@file{gettext} package provides support for this outside @theglibc{}.
40a55d20
UD
762
763The code of the @file{libintl} from GNU @file{gettext} is the same as
1f77f049 764the code in @theglibc{}. So the documentation in the GNU
40a55d20
UD
765@file{gettext} manual is also valid for the functionality here. The
766following text will describe the library functions in detail. But the
767numerous helper programs are not described in this manual. Instead
768people should read the GNU @file{gettext} manual
769(@pxref{Top,,GNU gettext utilities,gettext,Native Language Support Library and Tools}).
770We will only give a short overview.
771
772Though the @code{catgets} functions are available by default on more
773systems the @code{gettext} interface is at least as portable as the
774former. The GNU @file{gettext} package can be used wherever the
775functions are not available.
776
777
778@menu
779* Message catalogs with gettext:: The @code{gettext} family of functions.
780* Helper programs for gettext:: Programs to handle message catalogs
781 for @code{gettext}.
782@end menu
783
784
785@node Message catalogs with gettext
786@subsection The @code{gettext} family of functions
787
788The paradigms underlying the @code{gettext} approach to message
789translations is different from that of the @code{catgets} functions the
790basic functionally is equivalent. There are functions of the following
791categories:
792
793@menu
17c389fc
UD
794* Translation with gettext:: What has to be done to translate a message.
795* Locating gettext catalog:: How to determine which catalog to be used.
796* Advanced gettext functions:: Additional functions for more complicated
797 situations.
798* Charset conversion in gettext:: How to specify the output character set
799 @code{gettext} uses.
800* GUI program problems:: How to use @code{gettext} in GUI programs.
801* Using gettextized software:: The possibilities of the user to influence
802 the way @code{gettext} works.
40a55d20
UD
803@end menu
804
805@node Translation with gettext
806@subsubsection What has to be done to translate a message?
807
808The @code{gettext} functions have a very simple interface. The most
809basic function just takes the string which shall be translated as the
810argument and it returns the translation. This is fundamentally
811different from the @code{catgets} approach where an extra key is
812necessary and the original string is only used for the error case.
813
814If the string which has to be translated is the only argument this of
815course means the string itself is the key. I.e., the translation will
816be selected based on the original string. The message catalogs must
817therefore contain the original strings plus one translation for any such
10b89412 818string. The task of the @code{gettext} function is to compare the
40a55d20
UD
819argument string with the available strings in the catalog and return the
820appropriate translation. Of course this process is optimized so that
821this process is not more expensive than an access using an atomic key
822like in @code{catgets}.
823
824The @code{gettext} approach has some advantages but also some
825disadvantages. Please see the GNU @file{gettext} manual for a detailed
826discussion of the pros and cons.
827
828All the definitions and declarations for @code{gettext} can be found in
829the @file{libintl.h} header file. On systems where these functions are
830not part of the C library they can be found in a separate library named
831@file{libintl.a} (or accordingly different for shared libraries).
832
b8a46c1d
UD
833@comment libintl.h
834@comment GNU
40a55d20 835@deftypefun {char *} gettext (const char *@var{msgid})
29e7e2df
AO
836@safety{@prelim{}@mtsafe{@mtsenv{}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsfd{} @acsmem{}}}
837@c Wrapper for dcgettext.
40a55d20
UD
838The @code{gettext} function searches the currently selected message
839catalogs for a string which is equal to @var{msgid}. If there is such a
840string available it is returned. Otherwise the argument string
841@var{msgid} is returned.
842
29e7e2df 843Please note that although the return value is @code{char *} the
40a55d20
UD
844returned string must not be changed. This broken type results from the
845history of the function and does not reflect the way the function should
846be used.
847
848Please note that above we wrote ``message catalogs'' (plural). This is
608cc1f0 849a specialty of the GNU implementation of these functions and we will
8b7fb588
UD
850say more about this when we talk about the ways message catalogs are
851selected (@pxref{Locating gettext catalog}).
40a55d20
UD
852
853The @code{gettext} function does not modify the value of the global
854@var{errno} variable. This is necessary to make it possible to write
855something like
856
857@smallexample
858 printf (gettext ("Operation failed: %m\n"));
859@end smallexample
860
861Here the @var{errno} value is used in the @code{printf} function while
862processing the @code{%m} format element and if the @code{gettext}
863function would change this value (it is called before @code{printf} is
f2ea0f5b 864called) we would get a wrong message.
40a55d20 865
10b89412 866So there is no easy way to detect a missing message catalog besides
40a55d20
UD
867comparing the argument string with the result. But it is normally the
868task of the user to react on missing catalogs. The program cannot guess
1410e233 869when a message catalog is really necessary since for a user who speaks
10b89412 870the language the program was developed in, the message does not need any translation.
40a55d20
UD
871@end deftypefun
872
873The remaining two functions to access the message catalog add some
874functionality to select a message catalog which is not the default one.
875This is important if parts of the program are developed independently.
876Every part can have its own message catalog and all of them can be used
877at the same time. The C library itself is an example: internally it
878uses the @code{gettext} functions but since it must not depend on a
879currently selected default message catalog it must specify all ambiguous
880information.
881
b8a46c1d
UD
882@comment libintl.h
883@comment GNU
40a55d20 884@deftypefun {char *} dgettext (const char *@var{domainname}, const char *@var{msgid})
29e7e2df
AO
885@safety{@prelim{}@mtsafe{@mtsenv{}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsfd{} @acsmem{}}}
886@c Wrapper for dcgettext.
10b89412 887The @code{dgettext} function acts just like the @code{gettext}
40a55d20
UD
888function. It only takes an additional first argument @var{domainname}
889which guides the selection of the message catalogs which are searched
890for the translation. If the @var{domainname} parameter is the null
891pointer the @code{dgettext} function is exactly equivalent to
892@code{gettext} since the default value for the domain name is used.
893
894As for @code{gettext} the return value type is @code{char *} which is an
f2ea0f5b 895anachronism. The returned string must never be modified.
40a55d20
UD
896@end deftypefun
897
b8a46c1d
UD
898@comment libintl.h
899@comment GNU
40a55d20 900@deftypefun {char *} dcgettext (const char *@var{domainname}, const char *@var{msgid}, int @var{category})
29e7e2df
AO
901@safety{@prelim{}@mtsafe{@mtsenv{}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsfd{} @acsmem{}}}
902@c dcgettext @mtsenv @asucorrupt @ascuheap @asulock @ascudlopen @acucorrupt @aculock @acsfd @acsmem
903@c dcigettext @mtsenv @asucorrupt @ascuheap @asulock @ascudlopen @acucorrupt @aculock @acsfd @acsmem
904@c libc_rwlock_rdlock @asulock @aculock
905@c current_locale_name ok [protected from @mtslocale]
906@c tfind ok
907@c libc_rwlock_unlock ok
908@c plural_lookup ok
909@c plural_eval ok
910@c rawmemchr ok
911@c DETERMINE_SECURE ok, nothing
912@c strcmp ok
913@c strlen ok
914@c getcwd @ascuheap @acsmem @acsfd
915@c strchr ok
916@c stpcpy ok
917@c category_to_name ok
918@c guess_category_value @mtsenv
919@c getenv @mtsenv
920@c current_locale_name dup ok [protected from @mtslocale by dcigettext]
921@c strcmp ok
922@c ENABLE_SECURE ok
923@c _nl_find_domain @mtsenv @asucorrupt @ascuheap @asulock @ascudlopen @acucorrupt @aculock @acsfd @acsmem
924@c libc_rwlock_rdlock dup @asulock @aculock
925@c _nl_make_l10nflist dup @ascuheap @acsmem
926@c libc_rwlock_unlock dup ok
927@c _nl_load_domain @mtsenv @asucorrupt @ascuheap @asulock @ascudlopen @acucorrupt @aculock @acsfd @acsmem
928@c libc_lock_lock_recursive @aculock
929@c libc_lock_unlock_recursive @aculock
930@c open->open_not_cancel_2 @acsfd
931@c fstat ok
932@c mmap dup @acsmem
933@c close->close_not_cancel_no_status @acsfd
934@c malloc dup @ascuheap @acsmem
935@c read->read_not_cancel ok
936@c munmap dup @acsmem
937@c W dup ok
938@c strlen dup ok
939@c get_sysdep_segment_value ok
940@c memcpy dup ok
941@c hash_string dup ok
942@c free dup @ascuheap @acsmem
943@c libc_rwlock_init ok
944@c _nl_find_msg dup @mtsenv @asucorrupt @ascuheap @asulock @ascudlopen @acucorrupt @aculock @acsfd @acsmem
945@c libc_rwlock_fini ok
946@c EXTRACT_PLURAL_EXPRESSION @ascuheap @acsmem
947@c strstr dup ok
948@c isspace ok
949@c strtoul ok
950@c PLURAL_PARSE @ascuheap @acsmem
951@c malloc dup @ascuheap @acsmem
952@c free dup @ascuheap @acsmem
953@c INIT_GERMANIC_PLURAL ok, nothing
954@c the pre-C99 variant is @acucorrupt [protected from @mtuinit by dcigettext]
955@c _nl_expand_alias dup @ascuheap @asulock @acsmem @acsfd @aculock
956@c _nl_explode_name dup @ascuheap @acsmem
957@c libc_rwlock_wrlock dup @asulock @aculock
958@c free dup @asulock @aculock @acsfd @acsmem
959@c _nl_find_msg @mtsenv @asucorrupt @ascuheap @asulock @ascudlopen @acucorrupt @aculock @acsfd @acsmem
960@c _nl_load_domain dup @mtsenv @asucorrupt @ascuheap @asulock @ascudlopen @acucorrupt @aculock @acsfd @acsmem
961@c strlen ok
962@c hash_string ok
963@c W ok
964@c SWAP ok
965@c bswap_32 ok
966@c strcmp ok
967@c get_output_charset @mtsenv @ascuheap @acsmem
968@c getenv dup @mtsenv
969@c strlen dup ok
970@c malloc dup @ascuheap @acsmem
971@c memcpy dup ok
972@c libc_rwlock_rdlock dup @asulock @aculock
973@c libc_rwlock_unlock dup ok
974@c libc_rwlock_wrlock dup @asulock @aculock
975@c realloc @ascuheap @acsmem
976@c strdup @ascuheap @acsmem
977@c strstr ok
978@c strcspn ok
979@c mempcpy dup ok
980@c norm_add_slashes dup ok
981@c gconv_open @asucorrupt @ascuheap @asulock @ascudlopen @acucorrupt @aculock @acsmem @acsfd
982@c [protected from @mtslocale by dcigettext locale lock]
983@c free dup @ascuheap @acsmem
984@c libc_lock_lock @asulock @aculock
985@c calloc @ascuheap @acsmem
986@c gconv dup @acucorrupt [protected from @mtsrace and @asucorrupt by lock]
987@c libc_lock_unlock ok
988@c malloc @ascuheap @acsmem
989@c mempcpy ok
990@c memcpy ok
991@c strcpy ok
992@c libc_rwlock_wrlock @asulock @aculock
993@c tsearch @ascuheap @acucorrupt @acsmem [protected from @mtsrace and @asucorrupt]
994@c transcmp ok
995@c strmp dup ok
996@c free @ascuheap @acsmem
40a55d20
UD
997The @code{dcgettext} adds another argument to those which
998@code{dgettext} takes. This argument @var{category} specifies the last
999piece of information needed to localize the message catalog. I.e., the
1000domain name and the locale category exactly specify which message
1001catalog has to be used (relative to a given directory, see below).
1002
1003The @code{dgettext} function can be expressed in terms of
1004@code{dcgettext} by using
1005
1006@smallexample
1007dcgettext (domain, string, LC_MESSAGES)
1008@end smallexample
1009
1010@noindent
1011instead of
1012
1013@smallexample
1014dgettext (domain, string)
1015@end smallexample
1016
1017This also shows which values are expected for the third parameter. One
1018has to use the available selectors for the categories available in
1019@file{locale.h}. Normally the available values are @code{LC_CTYPE},
1020@code{LC_COLLATE}, @code{LC_MESSAGES}, @code{LC_MONETARY},
1021@code{LC_NUMERIC}, and @code{LC_TIME}. Please note that @code{LC_ALL}
1022must not be used and even though the names might suggest this, there is
10b89412 1023no relation to the environment variable of this name.
40a55d20
UD
1024
1025The @code{dcgettext} function is only implemented for compatibility with
1026other systems which have @code{gettext} functions. There is not really
1027any situation where it is necessary (or useful) to use a different value
10b89412 1028than @code{LC_MESSAGES} for the @var{category} parameter. We are
40a55d20
UD
1029dealing with messages here and any other choice can only be irritating.
1030
1031As for @code{gettext} the return value type is @code{char *} which is an
f2ea0f5b 1032anachronism. The returned string must never be modified.
40a55d20
UD
1033@end deftypefun
1034
1035When using the three functions above in a program it is a frequent case
10b89412 1036that the @var{msgid} argument is a constant string. So it is worthwhile to
40a55d20
UD
1037optimize this case. Thinking shortly about this one will realize that
1038as long as no new message catalog is loaded the translation of a message
1410e233
UD
1039will not change. This optimization is actually implemented by the
1040@code{gettext}, @code{dgettext} and @code{dcgettext} functions.
40a55d20
UD
1041
1042
1043@node Locating gettext catalog
1044@subsubsection How to determine which catalog to be used
1045
f2ea0f5b 1046The functions to retrieve the translations for a given message have a
40a55d20
UD
1047remarkable simple interface. But to provide the user of the program
1048still the opportunity to select exactly the translation s/he wants and
1049also to provide the programmer the possibility to influence the way to
1050locate the search for catalogs files there is a quite complicated
1051underlying mechanism which controls all this. The code is complicated
1052the use is easy.
1053
1054Basically we have two different tasks to perform which can also be
1055performed by the @code{catgets} functions:
1056
1057@enumerate
1058@item
1059Locate the set of message catalogs. There are a number of files for
10b89412 1060different languages which all belong to the package. Usually they
40a55d20
UD
1061are all stored in the filesystem below a certain directory.
1062
10b89412 1063There can be arbitrarily many packages installed and they can follow
40a55d20
UD
1064different guidelines for the placement of their files.
1065
1066@item
1067Relative to the location specified by the package the actual translation
1068files must be searched, based on the wishes of the user. I.e., for each
1069language the user selects the program should be able to locate the
1070appropriate file.
1071@end enumerate
1072
1073This is the functionality required by the specifications for
1074@code{gettext} and this is also what the @code{catgets} functions are
1075able to do. But there are some problems unresolved:
1076
1077@itemize @bullet
1078@item
1079The language to be used can be specified in several different ways.
1080There is no generally accepted standard for this and the user always
10b89412 1081expects the program to understand what s/he means. E.g., to select the
40a55d20
UD
1082German translation one could write @code{de}, @code{german}, or
1083@code{deutsch} and the program should always react the same.
1084
1085@item
1086Sometimes the specification of the user is too detailed. If s/he, e.g.,
1087specifies @code{de_DE.ISO-8859-1} which means German, spoken in Germany,
1088coded using the @w{ISO 8859-1} character set there is the possibility
1089that a message catalog matching this exactly is not available. But
1090there could be a catalog matching @code{de} and if the character set
1091used on the machine is always @w{ISO 8859-1} there is no reason why this
1092later message catalog should not be used. (We call this @dfn{message
1093inheritance}.)
1094
1095@item
1096If a catalog for a wanted language is not available it is not always the
1097second best choice to fall back on the language of the developer and
1098simply not translate any message. Instead a user might be better able
1099to read the messages in another language and so the user of the program
9dcc8f11 1100should be able to define a precedence order of languages.
40a55d20
UD
1101@end itemize
1102
f2ea0f5b 1103We can divide the configuration actions in two parts: the one is
40a55d20
UD
1104performed by the programmer, the other by the user. We will start with
1105the functions the programmer can use since the user configuration will
1106be based on this.
1107
1108As the functions described in the last sections already mention separate
1109sets of messages can be selected by a @dfn{domain name}. This is a
10b89412
RJ
1110simple string which should be unique for each program part that uses a
1111separate domain. It is possible to use in one program arbitrarily many
1f77f049 1112domains at the same time. E.g., @theglibc{} itself uses a domain
40a55d20
UD
1113named @code{libc} while the program using the C Library could use a
1114domain named @code{foo}. The important point is that at any time
1115exactly one domain is active. This is controlled with the following
1116function.
1117
b8a46c1d
UD
1118@comment libintl.h
1119@comment GNU
40a55d20 1120@deftypefun {char *} textdomain (const char *@var{domainname})
29e7e2df
AO
1121@safety{@prelim{}@mtsafe{}@asunsafe{@asulock{} @ascuheap{}}@acunsafe{@aculock{} @acsmem{}}}
1122@c textdomain @asulock @ascuheap @aculock @acsmem
1123@c libc_rwlock_wrlock @asulock @aculock
1124@c strcmp ok
1125@c strdup @ascuheap @acsmem
1126@c free @ascuheap @acsmem
1127@c libc_rwlock_unlock ok
40a55d20
UD
1128The @code{textdomain} function sets the default domain, which is used in
1129all future @code{gettext} calls, to @var{domainname}. Please note that
1130@code{dgettext} and @code{dcgettext} calls are not influenced if the
1131@var{domainname} parameter of these functions is not the null pointer.
1132
1133Before the first call to @code{textdomain} the default domain is
f2ea0f5b 1134@code{messages}. This is the name specified in the specification of
40a55d20
UD
1135the @code{gettext} API. This name is as good as any other name. No
1136program should ever really use a domain with this name since this can
1137only lead to problems.
1138
1139The function returns the value which is from now on taken as the default
1140domain. If the system went out of memory the returned value is
1141@code{NULL} and the global variable @var{errno} is set to @code{ENOMEM}.
1142Despite the return value type being @code{char *} the return string must
1143not be changed. It is allocated internally by the @code{textdomain}
1144function.
1145
1146If the @var{domainname} parameter is the null pointer no new default
1147domain is set. Instead the currently selected default domain is
1148returned.
1149
1150If the @var{domainname} parameter is the empty string the default domain
1151is reset to its initial value, the domain with the name @code{messages}.
1152This possibility is questionable to use since the domain @code{messages}
1153really never should be used.
1154@end deftypefun
1155
b8a46c1d
UD
1156@comment libintl.h
1157@comment GNU
40a55d20 1158@deftypefun {char *} bindtextdomain (const char *@var{domainname}, const char *@var{dirname})
29e7e2df
AO
1159@safety{@prelim{}@mtsafe{}@asunsafe{@ascuheap{}}@acunsafe{@acsmem{}}}
1160@c bindtextdomain @ascuheap @acsmem
1161@c set_binding_values @ascuheap @acsmem
1162@c libc_rwlock_wrlock dup @asulock @aculock
1163@c strcmp dup ok
1164@c strdup dup @ascuheap @acsmem
1165@c free dup @ascuheap @acsmem
1166@c malloc dup @ascuheap @acsmem
9133b79b 1167The @code{bindtextdomain} function can be used to specify the directory
40a55d20
UD
1168which contains the message catalogs for domain @var{domainname} for the
1169different languages. To be correct, this is the directory where the
f2ea0f5b 1170hierarchy of directories is expected. Details are explained below.
40a55d20
UD
1171
1172For the programmer it is important to note that the translations which
10b89412 1173come with the program have to be placed in a directory hierarchy starting
40a55d20
UD
1174at, say, @file{/foo/bar}. Then the program should make a
1175@code{bindtextdomain} call to bind the domain for the current program to
1176this directory. So it is made sure the catalogs are found. A correctly
1177running program does not depend on the user setting an environment
1178variable.
1179
1180The @code{bindtextdomain} function can be used several times and if the
17c389fc 1181@var{domainname} argument is different the previously bound domains
40a55d20
UD
1182will not be overwritten.
1183
26b4d766
UD
1184If the program which wish to use @code{bindtextdomain} at some point of
1185time use the @code{chdir} function to change the current working
1186directory it is important that the @var{dirname} strings ought to be an
1187absolute pathname. Otherwise the addressed directory might vary with
1188the time.
1189
40a55d20
UD
1190If the @var{dirname} parameter is the null pointer @code{bindtextdomain}
1191returns the currently selected directory for the domain with the name
1192@var{domainname}.
1193
9133b79b 1194The @code{bindtextdomain} function returns a pointer to a string
40a55d20
UD
1195containing the name of the selected directory name. The string is
1196allocated internally in the function and must not be changed by the
1197user. If the system went out of core during the execution of
1198@code{bindtextdomain} the return value is @code{NULL} and the global
1199variable @var{errno} is set accordingly.
1200@end deftypefun
1201
1202
b8a46c1d
UD
1203@node Advanced gettext functions
1204@subsubsection Additional functions for more complicated situations
1205
1206The functions of the @code{gettext} family described so far (and all the
1207@code{catgets} functions as well) have one problem in the real world
10b89412 1208which has been neglected completely in all existing approaches. What
b8a46c1d
UD
1209is meant here is the handling of plural forms.
1210
1211Looking through Unix source code before the time anybody thought about
1212internationalization (and, sadly, even afterwards) one can often find
1213code similar to the following:
1214
1215@smallexample
1216 printf ("%d file%s deleted", n, n == 1 ? "" : "s");
1217@end smallexample
1218
1219@noindent
c891b2df 1220After the first complaints from people internationalizing the code people
b8a46c1d
UD
1221either completely avoided formulations like this or used strings like
1222@code{"file(s)"}. Both look unnatural and should be avoided. First
1223tries to solve the problem correctly looked like this:
1224
1225@smallexample
1226 if (n == 1)
1227 printf ("%d file deleted", n);
1228 else
1229 printf ("%d files deleted", n);
1230@end smallexample
1231
1232But this does not solve the problem. It helps languages where the
1233plural form of a noun is not simply constructed by adding an `s' but
1234that is all. Once again people fell into the trap of believing the
10b89412 1235rules their language uses are universal. But the handling of plural
b8a46c1d
UD
1236forms differs widely between the language families. There are two
1237things we can differ between (and even inside language families);
1238
1239@itemize @bullet
1240@item
1241The form how plural forms are build differs. This is a problem with
1242language which have many irregularities. German, for instance, is a
1243drastic case. Though English and German are part of the same language
1244family (Germanic), the almost regular forming of plural noun forms
608cc1f0 1245(appending an `s') is hardly found in German.
b8a46c1d
UD
1246
1247@item
1248The number of plural forms differ. This is somewhat surprising for
1249those who only have experiences with Romanic and Germanic languages
1250since here the number is the same (there are two).
1251
1252But other language families have only one form or many forms. More
1253information on this in an extra section.
1254@end itemize
1255
1256The consequence of this is that application writers should not try to
1257solve the problem in their code. This would be localization since it is
1258only usable for certain, hardcoded language environments. Instead the
1259extended @code{gettext} interface should be used.
1260
1261These extra functions are taking instead of the one key string two
9dcc8f11 1262strings and a numerical argument. The idea behind this is that using
b8a46c1d
UD
1263the numerical argument and the first string as a key, the implementation
1264can select using rules specified by the translator the right plural
1265form. The two string arguments then will be used to provide a return
1266value in case no message catalog is found (similar to the normal
608cc1f0 1267@code{gettext} behavior). In this case the rules for Germanic language
10b89412 1268are used and it is assumed that the first string argument is the singular
b8a46c1d
UD
1269form, the second the plural form.
1270
1271This has the consequence that programs without language catalogs can
1272display the correct strings only if the program itself is written using
1f77f049 1273a Germanic language. This is a limitation but since @theglibc{}
10b89412
RJ
1274(as well as the GNU @code{gettext} package) is written as part of the
1275GNU package and the coding standards for the GNU project require programs
1276to be written in English, this solution nevertheless fulfills its
b8a46c1d
UD
1277purpose.
1278
1279@comment libintl.h
1280@comment GNU
1281@deftypefun {char *} ngettext (const char *@var{msgid1}, const char *@var{msgid2}, unsigned long int @var{n})
29e7e2df
AO
1282@safety{@prelim{}@mtsafe{@mtsenv{}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsfd{} @acsmem{}}}
1283@c Wrapper for dcngettext.
b8a46c1d
UD
1284The @code{ngettext} function is similar to the @code{gettext} function
1285as it finds the message catalogs in the same way. But it takes two
1286extra arguments. The @var{msgid1} parameter must contain the singular
1287form of the string to be converted. It is also used as the key for the
1288search in the catalog. The @var{msgid2} parameter is the plural form.
1289The parameter @var{n} is used to determine the plural form. If no
1290message catalog is found @var{msgid1} is returned if @code{n == 1},
1291otherwise @code{msgid2}.
1292
10b89412 1293An example for the use of this function is:
b8a46c1d
UD
1294
1295@smallexample
1296 printf (ngettext ("%d file removed", "%d files removed", n), n);
1297@end smallexample
1298
1299Please note that the numeric value @var{n} has to be passed to the
1300@code{printf} function as well. It is not sufficient to pass it only to
1301@code{ngettext}.
1302@end deftypefun
1303
1304@comment libintl.h
1305@comment GNU
1306@deftypefun {char *} dngettext (const char *@var{domain}, const char *@var{msgid1}, const char *@var{msgid2}, unsigned long int @var{n})
29e7e2df
AO
1307@safety{@prelim{}@mtsafe{@mtsenv{}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsfd{} @acsmem{}}}
1308@c Wrapper for dcngettext.
b8a46c1d
UD
1309The @code{dngettext} is similar to the @code{dgettext} function in the
1310way the message catalog is selected. The difference is that it takes
10b89412 1311two extra parameters to provide the correct plural form. These two
b8a46c1d
UD
1312parameters are handled in the same way @code{ngettext} handles them.
1313@end deftypefun
1314
1315@comment libintl.h
1316@comment GNU
1317@deftypefun {char *} dcngettext (const char *@var{domain}, const char *@var{msgid1}, const char *@var{msgid2}, unsigned long int @var{n}, int @var{category})
29e7e2df
AO
1318@safety{@prelim{}@mtsafe{@mtsenv{}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsfd{} @acsmem{}}}
1319@c Wrapper for dcigettext.
b8a46c1d
UD
1320The @code{dcngettext} is similar to the @code{dcgettext} function in the
1321way the message catalog is selected. The difference is that it takes
10b89412 1322two extra parameters to provide the correct plural form. These two
b8a46c1d
UD
1323parameters are handled in the same way @code{ngettext} handles them.
1324@end deftypefun
1325
1326@subsubheading The problem of plural forms
1327
1328A description of the problem can be found at the beginning of the last
1329section. Now there is the question how to solve it. Without the input
1330of linguists (which was not available) it was not possible to determine
1331whether there are only a few different forms in which plural forms are
1332formed or whether the number can increase with every new supported
1333language.
1334
1335Therefore the solution implemented is to allow the translator to specify
1336the rules of how to select the plural form. Since the formula varies
1337with every language this is the only viable solution except for
608cc1f0
UD
1338hardcoding the information in the code (which still would require the
1339possibility of extensions to not prevent the use of new languages). The
a1286745 1340details are explained in the GNU @code{gettext} manual. Here only a
b8a46c1d
UD
1341bit of information is provided.
1342
1343The information about the plural form selection has to be stored in the
10b89412 1344header entry (the one with the empty @code{msgid} string). It looks
c891b2df 1345like this:
b8a46c1d
UD
1346
1347@smallexample
c891b2df 1348Plural-Forms: nplurals=2; plural=n == 1 ? 0 : 1;
b8a46c1d
UD
1349@end smallexample
1350
1351The @code{nplurals} value must be a decimal number which specifies how
1352many different plural forms exist for this language. The string
10b89412
RJ
1353following @code{plural} is an expression using the C language
1354syntax. Exceptions are that no negative numbers are allowed, numbers
b8a46c1d
UD
1355must be decimal, and the only variable allowed is @code{n}. This
1356expression will be evaluated whenever one of the functions
1357@code{ngettext}, @code{dngettext}, or @code{dcngettext} is called. The
1358numeric value passed to these functions is then substituted for all uses
1359of the variable @code{n} in the expression. The resulting value then
1360must be greater or equal to zero and smaller than the value given as the
1361value of @code{nplurals}.
1362
1363@noindent
1364The following rules are known at this point. The language with families
1365are listed. But this does not necessarily mean the information can be
1366generalized for the whole family (as can be easily seen in the table
1367below).@footnote{Additions are welcome. Send appropriate information to
1368@email{bug-glibc-manual@@gnu.org}.}
1369
1370@table @asis
1371@item Only one form:
1372Some languages only require one single form. There is no distinction
c891b2df 1373between the singular and plural form. An appropriate header entry
b8a46c1d
UD
1374would look like this:
1375
1376@smallexample
c891b2df 1377Plural-Forms: nplurals=1; plural=0;
b8a46c1d
UD
1378@end smallexample
1379
1380@noindent
1381Languages with this property include:
1382
1383@table @asis
1384@item Finno-Ugric family
1385Hungarian
1386@item Asian family
3c945c44 1387Japanese, Korean
b8a46c1d
UD
1388@item Turkic/Altaic family
1389Turkish
1390@end table
1391
1392@item Two forms, singular used for one only
c934e1c0 1393This is the form used in most existing programs since it is what English
10b89412 1394uses. A header entry would look like this:
b8a46c1d
UD
1395
1396@smallexample
c891b2df 1397Plural-Forms: nplurals=2; plural=n != 1;
b8a46c1d
UD
1398@end smallexample
1399
1400(Note: this uses the feature of C expressions that boolean expressions
1401have to value zero or one.)
1402
1403@noindent
1404Languages with this property include:
1405
1406@table @asis
1407@item Germanic family
1408Danish, Dutch, English, German, Norwegian, Swedish
1409@item Finno-Ugric family
aa9e3c39 1410Estonian, Finnish
b8a46c1d
UD
1411@item Latin/Greek family
1412Greek
1413@item Semitic family
1414Hebrew
1415@item Romance family
3c945c44 1416Italian, Portuguese, Spanish
b8a46c1d
UD
1417@item Artificial
1418Esperanto
1419@end table
1420
1421@item Two forms, singular used for zero and one
1422Exceptional case in the language family. The header entry would be:
1423
1424@smallexample
c891b2df 1425Plural-Forms: nplurals=2; plural=n>1;
b8a46c1d
UD
1426@end smallexample
1427
1428@noindent
1429Languages with this property include:
1430
1431@table @asis
1432@item Romanic family
3c945c44
UD
1433French, Brazilian Portuguese
1434@end table
1435
1436@item Three forms, special case for zero
1437The header entry would be:
1438
1439@smallexample
1440Plural-Forms: nplurals=3; plural=n%10==1 && n%100!=11 ? 0 : n != 0 ? 1 : 2;
1441@end smallexample
1442
1443@noindent
1444Languages with this property include:
1445
1446@table @asis
1447@item Baltic family
1448Latvian
b8a46c1d
UD
1449@end table
1450
1451@item Three forms, special cases for one and two
1452The header entry would be:
1453
1454@smallexample
c891b2df 1455Plural-Forms: nplurals=3; plural=n==1 ? 0 : n==2 ? 1 : 2;
b8a46c1d
UD
1456@end smallexample
1457
1458@noindent
1459Languages with this property include:
1460
1461@table @asis
1462@item Celtic
3c945c44
UD
1463Gaeilge (Irish)
1464@end table
1465
1466@item Three forms, special case for numbers ending in 1[2-9]
1467The header entry would look like this:
1468
1469@smallexample
1470Plural-Forms: nplurals=3; \
1471 plural=n%10==1 && n%100!=11 ? 0 : \
1472 n%10>=2 && (n%100<10 || n%100>=20) ? 1 : 2;
1473@end smallexample
1474
1475@noindent
1476Languages with this property include:
1477
1478@table @asis
1479@item Baltic family
1480Lithuanian
b8a46c1d
UD
1481@end table
1482
aa9e3c39 1483@item Three forms, special cases for numbers ending in 1 and 2, 3, 4, except those ending in 1[1-4]
b8a46c1d
UD
1484The header entry would look like this:
1485
1486@smallexample
c891b2df
UD
1487Plural-Forms: nplurals=3; \
1488 plural=n%100/10==1 ? 2 : n%10==1 ? 0 : (n+9)%10>3 ? 2 : 1;
b8a46c1d
UD
1489@end smallexample
1490
1491@noindent
1492Languages with this property include:
1493
1494@table @asis
1495@item Slavic family
3c945c44 1496Croatian, Czech, Russian, Ukrainian
107d41a9
UD
1497@end table
1498
1499@item Three forms, special cases for 1 and 2, 3, 4
1500The header entry would look like this:
1501
1502@smallexample
1503Plural-Forms: nplurals=3; \
1504 plural=(n==1) ? 1 : (n>=2 && n<=4) ? 2 : 0;
1505@end smallexample
1506
1507@noindent
1508Languages with this property include:
1509
1510@table @asis
1511@item Slavic family
1512Slovak
b8a46c1d
UD
1513@end table
1514
1515@item Three forms, special case for one and some numbers ending in 2, 3, or 4
1516The header entry would look like this:
1517
1518@smallexample
c891b2df
UD
1519Plural-Forms: nplurals=3; \
1520 plural=n==1 ? 0 : \
1521 n%10>=2 && n%10<=4 && (n%100<10 || n%100>=20) ? 1 : 2;
b8a46c1d
UD
1522@end smallexample
1523
b8a46c1d
UD
1524@noindent
1525Languages with this property include:
1526
1527@table @asis
1528@item Slavic family
1529Polish
1530@end table
1531
3c945c44 1532@item Four forms, special case for one and all numbers ending in 02, 03, or 04
b8a46c1d
UD
1533The header entry would look like this:
1534
1535@smallexample
c891b2df 1536Plural-Forms: nplurals=4; \
3c945c44 1537 plural=n%100==1 ? 0 : n%100==2 ? 1 : n%100==3 || n%100==4 ? 2 : 3;
b8a46c1d
UD
1538@end smallexample
1539
1540@noindent
1541Languages with this property include:
1542
1543@table @asis
1544@item Slavic family
1545Slovenian
1546@end table
1547@end table
1548
1549
17c389fc
UD
1550@node Charset conversion in gettext
1551@subsubsection How to specify the output character set @code{gettext} uses
1552
10b89412 1553@code{gettext} not only looks up a translation in a message catalog, it
17c389fc
UD
1554also converts the translation on the fly to the desired output character
1555set. This is useful if the user is working in a different character set
1556than the translator who created the message catalog, because it avoids
1557distributing variants of message catalogs which differ only in the
1558character set.
1559
1560The output character set is, by default, the value of @code{nl_langinfo
1561(CODESET)}, which depends on the @code{LC_CTYPE} part of the current
1562locale. But programs which store strings in a locale independent way
1563(e.g. UTF-8) can request that @code{gettext} and related functions
1564return the translations in that encoding, by use of the
1565@code{bind_textdomain_codeset} function.
1566
1567Note that the @var{msgid} argument to @code{gettext} is not subject to
1568character set conversion. Also, when @code{gettext} does not find a
1569translation for @var{msgid}, it returns @var{msgid} unchanged --
1570independently of the current output character set. It is therefore
1571recommended that all @var{msgid}s be US-ASCII strings.
1572
1573@comment libintl.h
1574@comment GNU
1575@deftypefun {char *} bind_textdomain_codeset (const char *@var{domainname}, const char *@var{codeset})
29e7e2df
AO
1576@safety{@prelim{}@mtsafe{}@asunsafe{@ascuheap{}}@acunsafe{@acsmem{}}}
1577@c bind_textdomain_codeset @ascuheap @acsmem
1578@c set_binding_values dup @ascuheap @acsmem
17c389fc
UD
1579The @code{bind_textdomain_codeset} function can be used to specify the
1580output character set for message catalogs for domain @var{domainname}.
1410e233
UD
1581The @var{codeset} argument must be a valid codeset name which can be used
1582for the @code{iconv_open} function, or a null pointer.
17c389fc
UD
1583
1584If the @var{codeset} parameter is the null pointer,
1585@code{bind_textdomain_codeset} returns the currently selected codeset
cf822e3c 1586for the domain with the name @var{domainname}. It returns @code{NULL} if
17c389fc
UD
1587no codeset has yet been selected.
1588
107d41a9 1589The @code{bind_textdomain_codeset} function can be used several times.
17c389fc
UD
1590If used multiple times with the same @var{domainname} argument, the
1591later call overrides the settings made by the earlier one.
1592
1593The @code{bind_textdomain_codeset} function returns a pointer to a
1594string containing the name of the selected codeset. The string is
1595allocated internally in the function and must not be changed by the
1596user. If the system went out of core during the execution of
1597@code{bind_textdomain_codeset}, the return value is @code{NULL} and the
582a3cff
AM
1598global variable @var{errno} is set accordingly.
1599@end deftypefun
17c389fc
UD
1600
1601
608cc1f0
UD
1602@node GUI program problems
1603@subsubsection How to use @code{gettext} in GUI programs
1604
1410e233
UD
1605One place where the @code{gettext} functions, if used normally, have big
1606problems is within programs with graphical user interfaces (GUIs). The
608cc1f0
UD
1607problem is that many of the strings which have to be translated are very
1608short. They have to appear in pull-down menus which restricts the
1609length. But strings which are not containing entire sentences or at
1610least large fragments of a sentence may appear in more than one
1611situation in the program but might have different translations. This is
1612especially true for the one-word strings which are frequently used in
1613GUI programs.
1614
1615As a consequence many people say that the @code{gettext} approach is
1616wrong and instead @code{catgets} should be used which indeed does not
1617have this problem. But there is a very simple and powerful method to
1618handle these kind of problems with the @code{gettext} functions.
1619
1620@noindent
bbf70ae9 1621As an example consider the following fictional situation. A GUI program
608cc1f0
UD
1622has a menu bar with the following entries:
1623
1624@smallexample
1625+------------+------------+--------------------------------------+
1626| File | Printer | |
1627+------------+------------+--------------------------------------+
1628| Open | | Select |
1629| New | | Open |
1630+----------+ | Connect |
1631 +----------+
1632@end smallexample
1633
1634To have the strings @code{File}, @code{Printer}, @code{Open},
1635@code{New}, @code{Select}, and @code{Connect} translated there has to be
1636at some point in the code a call to a function of the @code{gettext}
1637family. But in two places the string passed into the function would be
1638@code{Open}. The translations might not be the same and therefore we
1639are in the dilemma described above.
1640
ef48b196 1641One solution to this problem is to artificially extend the strings
608cc1f0 1642to make them unambiguous. But what would the program do if no
ef48b196 1643translation is available? The extended string is not what should be
10b89412 1644printed. So we should use a slightly modified version of the functions.
608cc1f0 1645
ef48b196 1646To extend the strings a uniform method should be used. E.g., in the
10b89412 1647example above, the strings could be chosen as
608cc1f0
UD
1648
1649@smallexample
1650Menu|File
1651Menu|Printer
1652Menu|File|Open
1653Menu|File|New
1654Menu|Printer|Select
1655Menu|Printer|Open
1656Menu|Printer|Connect
1657@end smallexample
1658
1659Now all the strings are different and if now instead of @code{gettext}
1660the following little wrapper function is used, everything works just
1661fine:
1662
1663@cindex sgettext
1664@smallexample
1665 char *
1666 sgettext (const char *msgid)
1667 @{
1668 char *msgval = gettext (msgid);
1669 if (msgval == msgid)
1670 msgval = strrchr (msgid, '|') + 1;
1671 return msgval;
1672 @}
1673@end smallexample
1674
1675What this little function does is to recognize the case when no
1676translation is available. This can be done very efficiently by a
1677pointer comparison since the return value is the input value. If there
1678is no translation we know that the input string is in the format we used
1679for the Menu entries and therefore contains a @code{|} character. We
1680simply search for the last occurrence of this character and return a
1681pointer to the character following it. That's it!
1682
ef48b196 1683If one now consistently uses the extended string form and replaces
608cc1f0
UD
1684the @code{gettext} calls with calls to @code{sgettext} (this is normally
1685limited to very few places in the GUI implementation) then it is
1686possible to produce a program which can be internationalized.
1687
1688With advanced compilers (such as GNU C) one can write the
1689@code{sgettext} functions as an inline function or as a macro like this:
1690
1691@cindex sgettext
1692@smallexample
1693#define sgettext(msgid) \
1694 (@{ const char *__msgid = (msgid); \
1695 char *__msgstr = gettext (__msgid); \
1696 if (__msgval == __msgid) \
1697 __msgval = strrchr (__msgid, '|') + 1; \
1698 __msgval; @})
1699@end smallexample
1700
1701The other @code{gettext} functions (@code{dgettext}, @code{dcgettext}
1702and the @code{ngettext} equivalents) can and should have corresponding
1703functions as well which look almost identical, except for the parameters
1704and the call to the underlying function.
1705
1706Now there is of course the question why such functions do not exist in
1f77f049 1707@theglibc{}? There are two parts of the answer to this question.
608cc1f0
UD
1708
1709@itemize @bullet
1710@item
1711They are easy to write and therefore can be provided by the project they
1712are used in. This is not an answer by itself and must be seen together
1713with the second part which is:
1714
1715@item
1716There is no way the C library can contain a version which can work
1717everywhere. The problem is the selection of the character to separate
ef48b196 1718the prefix from the actual string in the extended string. The
608cc1f0
UD
1719examples above used @code{|} which is a quite good choice because it
1720resembles a notation frequently used in this context and it also is a
1721character not often used in message strings.
1722
1723But what if the character is used in message strings. Or if the chose
1724character is not available in the character set on the machine one
1725compiles (e.g., @code{|} is not required to exist for @w{ISO C}; this is
1726why the @file{iso646.h} file exists in @w{ISO C} programming environments).
1727@end itemize
1728
1729There is only one more comment to make left. The wrapper function above
10b89412 1730requires that the translations strings are not extended themselves.
608cc1f0
UD
1731This is only logical. There is no need to disambiguate the strings
1732(since they are never used as keys for a search) and one also saves
1733quite some memory and disk space by doing this.
1734
1735
40a55d20
UD
1736@node Using gettextized software
1737@subsubsection User influence on @code{gettext}
1738
1739The last sections described what the programmer can do to
1740internationalize the messages of the program. But it is finally up to
1741the user to select the message s/he wants to see. S/He must understand
1742them.
1743
1744The POSIX locale model uses the environment variables @code{LC_COLLATE},
a1286745 1745@code{LC_CTYPE}, @code{LC_MESSAGES}, @code{LC_MONETARY}, @code{LC_NUMERIC},
40a55d20 1746and @code{LC_TIME} to select the locale which is to be used. This way
10b89412 1747the user can influence lots of functions. As we mentioned above, the
40a55d20
UD
1748@code{gettext} functions also take advantage of this.
1749
1750To understand how this happens it is necessary to take a look at the
1751various components of the filename which gets computed to locate a
1752message catalog. It is composed as follows:
1753
1754@smallexample
1755@var{dir_name}/@var{locale}/LC_@var{category}/@var{domain_name}.mo
1756@end smallexample
1757
1758The default value for @var{dir_name} is system specific. It is computed
1759from the value given as the prefix while configuring the C library.
1760This value normally is @file{/usr} or @file{/}. For the former the
1761complete @var{dir_name} is:
1762
1763@smallexample
1764/usr/share/locale
1765@end smallexample
1766
1767We can use @file{/usr/share} since the @file{.mo} files containing the
e8b1163e 1768message catalogs are system independent, so all systems can use the same
40a55d20 1769files. If the program executed the @code{bindtextdomain} function for
e8b1163e
AJ
1770the message domain that is currently handled, the @code{dir_name}
1771component is exactly the value which was given to the function as
1772the second parameter. I.e., @code{bindtextdomain} allows overwriting
f2ea0f5b 1773the only system dependent and fixed value to make it possible to
e8b1163e 1774address files anywhere in the filesystem.
40a55d20
UD
1775
1776The @var{category} is the name of the locale category which was selected
1777in the program code. For @code{gettext} and @code{dgettext} this is
1778always @code{LC_MESSAGES}, for @code{dcgettext} this is selected by the
1779value of the third parameter. As said above it should be avoided to
1780ever use a category other than @code{LC_MESSAGES}.
1781
1782The @var{locale} component is computed based on the category used. Just
1783like for the @code{setlocale} function here comes the user selection
1784into the play. Some environment variables are examined in a fixed order
1785and the first environment variable set determines the return value of
1786the lookup process. In detail, for the category @code{LC_xxx} the
1787following variables in this order are examined:
1788
1789@table @code
1790@item LANGUAGE
1791@item LC_ALL
1792@item LC_xxx
1793@item LANG
1794@end table
1795
1796This looks very familiar. With the exception of the @code{LANGUAGE}
1797environment variable this is exactly the lookup order the
10b89412 1798@code{setlocale} function uses. But why introduce the @code{LANGUAGE}
40a55d20
UD
1799variable?
1800
1801The reason is that the syntax of the values these variables can have is
1802different to what is expected by the @code{setlocale} function. If we
1803would set @code{LC_ALL} to a value following the extended syntax that
1804would mean the @code{setlocale} function will never be able to use the
1805value of this variable as well. An additional variable removes this
1806problem plus we can select the language independently of the locale
1807setting which sometimes is useful.
1808
1809While for the @code{LC_xxx} variables the value should consist of
1810exactly one specification of a locale the @code{LANGUAGE} variable's
1811value can consist of a colon separated list of locale names. The
1812attentive reader will realize that this is the way we manage to
1813implement one of our additional demands above: we want to be able to
10b89412 1814specify an ordered list of languages.
40a55d20
UD
1815
1816Back to the constructed filename we have only one component missing.
1817The @var{domain_name} part is the name which was either registered using
1818the @code{textdomain} function or which was given to @code{dgettext} or
1819@code{dcgettext} as the first parameter. Now it becomes obvious that a
1820good choice for the domain name in the program code is a string which is
1f77f049
JM
1821closely related to the program/package name. E.g., for @theglibc{}
1822the domain name is @code{libc}.
40a55d20
UD
1823
1824@noindent
10b89412 1825A limited piece of example code should show how the program is supposed
40a55d20
UD
1826to work:
1827
1828@smallexample
1829@{
1410e233 1830 setlocale (LC_ALL, "");
40a55d20
UD
1831 textdomain ("test-package");
1832 bindtextdomain ("test-package", "/usr/local/share/locale");
17c389fc 1833 puts (gettext ("Hello, world!"));
40a55d20
UD
1834@}
1835@end smallexample
1836
1410e233
UD
1837At the program start the default domain is @code{messages}, and the
1838default locale is "C". The @code{setlocale} call sets the locale
1839according to the user's environment variables; remember that correct
1840functioning of @code{gettext} relies on the correct setting of the
1841@code{LC_MESSAGES} locale (for looking up the message catalog) and
1842of the @code{LC_CTYPE} locale (for the character set conversion).
1843The @code{textdomain} call changes the default domain to
1844@code{test-package}. The @code{bindtextdomain} call specifies that
1845the message catalogs for the domain @code{test-package} can be found
1846below the directory @file{/usr/local/share/locale}.
40a55d20 1847
10b89412 1848If the user sets in her/his environment the variable @code{LANGUAGE}
40a55d20
UD
1849to @code{de} the @code{gettext} function will try to use the
1850translations from the file
1851
1852@smallexample
1853/usr/local/share/locale/de/LC_MESSAGES/test-package.mo
1854@end smallexample
1855
1856From the above descriptions it should be clear which component of this
f41c8091
UD
1857filename is determined by which source.
1858
10b89412
RJ
1859In the above example we assumed the @code{LANGUAGE} environment
1860variable to be @code{de}. This might be an appropriate selection but what
f41c8091
UD
1861happens if the user wants to use @code{LC_ALL} because of the wider
1862usability and here the required value is @code{de_DE.ISO-8859-1}? We
1863already mentioned above that a situation like this is not infrequent.
1864E.g., a person might prefer reading a dialect and if this is not
1865available fall back on the standard language.
1866
1867The @code{gettext} functions know about situations like this and can
1868handle them gracefully. The functions recognize the format of the value
1869of the environment variable. It can split the value is different pieces
1870and by leaving out the only or the other part it can construct new
1871values. This happens of course in a predictable way. To understand
1872this one must know the format of the environment variable value. There
7a9a2681
UD
1873is one more or less standardized form, originally from the X/Open
1874specification:
f41c8091 1875
f41c8091
UD
1876@code{language[_territory[.codeset]][@@modifier]}
1877
10b89412 1878Less specific locale names will be stripped in the order of the
7a9a2681 1879following list:
40a55d20 1880
f41c8091
UD
1881@enumerate
1882@item
f41c8091
UD
1883@code{codeset}
1884@item
1885@code{normalized codeset}
1886@item
1887@code{territory}
1888@item
7a9a2681 1889@code{modifier}
f41c8091
UD
1890@end enumerate
1891
7a9a2681 1892The @code{language} field will never be dropped for obvious reasons.
f41c8091
UD
1893
1894The only new thing is the @code{normalized codeset} entry. This is
10b89412
RJ
1895another goodie which is introduced to help reduce the chaos which
1896derives from the inability of people to standardize the names of
f41c8091
UD
1897character sets. Instead of @w{ISO-8859-1} one can often see @w{8859-1},
1898@w{88591}, @w{iso8859-1}, or @w{iso_8859-1}. The @code{normalized
1899codeset} value is generated from the user-provided character set name by
1900applying the following rules:
1901
1902@enumerate
1903@item
10b89412 1904Remove all characters besides numbers and letters.
f41c8091
UD
1905@item
1906Fold letters to lowercase.
1907@item
1908If the same only contains digits prepend the string @code{"iso"}.
1909@end enumerate
1910
1911@noindent
10b89412
RJ
1912So all of the above names will be normalized to @code{iso88591}. This
1913allows the program user much more freedom in choosing the locale name.
f41c8091
UD
1914
1915Even this extended functionality still does not help to solve the
1916problem that completely different names can be used to denote the same
1917locale (e.g., @code{de} and @code{german}). To be of help in this
1918situation the locale implementation and also the @code{gettext}
1919functions know about aliases.
1920
1921The file @file{/usr/share/locale/locale.alias} (replace @file{/usr} with
1922whatever prefix you used for configuring the C library) contains a
1923mapping of alternative names to more regular names. The system manager
1924is free to add new entries to fill her/his own needs. The selected
1925locale from the environment is compared with the entries in the first
10b89412 1926column of this file ignoring the case. If they match, the value of the
f41c8091
UD
1927second column is used instead for the further handling.
1928
1929In the description of the format of the environment variables we already
1930mentioned the character set as a factor in the selection of the message
1931catalog. In fact, only catalogs which contain text written using the
1932character set of the system/program can be used (directly; there will
1933come a solution for this some day). This means for the user that s/he
10b89412 1934will always have to take care of this. If in the collection of the
f41c8091
UD
1935message catalogs there are files for the same language but coded using
1936different character sets the user has to be careful.
40a55d20
UD
1937
1938
1939@node Helper programs for gettext
1940@subsection Programs to handle message catalogs for @code{gettext}
1941
1f77f049 1942@Theglibc{} does not contain the source code for the programs to
f41c8091
UD
1943handle message catalogs for the @code{gettext} functions. As part of
1944the GNU project the GNU gettext package contains everything the
1945developer needs. The functionality provided by the tools in this
1946package by far exceeds the abilities of the @code{gencat} program
1947described above for the @code{catgets} functions.
1948
1949There is a program @code{msgfmt} which is the equivalent program to the
1950@code{gencat} program. It generates from the human-readable and
1951-editable form of the message catalog a binary file which can be used by
1952the @code{gettext} functions. But there are several more programs
1953available.
1954
1955The @code{xgettext} program can be used to automatically extract the
1956translatable messages from a source file. I.e., the programmer need not
c430c4af 1957take care of the translations and the list of messages which have to be
f41c8091
UD
1958translated. S/He will simply wrap the translatable string in calls to
1959@code{gettext} et.al and the rest will be done by @code{xgettext}. This
c430c4af 1960program has a lot of options which help to customize the output or
f41c8091
UD
1961help to understand the input better.
1962
c430c4af
BS
1963Other programs help to manage the development cycle when new messages appear
1964in the source files or when a new translation of the messages appears.
11bf311e
UD
1965Here it should only be noted that using all the tools in GNU gettext it
1966is possible to @emph{completely} automate the handling of message
10b89412 1967catalogs. Besides marking the translatable strings in the source code and
f41c8091 1968generating the translations the developers do not have anything to do
608cc1f0 1969themselves.