]> git.ipfire.org Git - thirdparty/glibc.git/blame - manual/=float.texinfo
update from main archive 961207
[thirdparty/glibc.git] / manual / =float.texinfo
CommitLineData
f65fd747 1@node Floating-Point Limits
28f540f4
RM
2@chapter Floating-Point Limits
3@pindex <float.h>
4@cindex floating-point number representation
5@cindex representation of floating-point numbers
6
7Because floating-point numbers are represented internally as approximate
8quantities, algorithms for manipulating floating-point data often need
9to be parameterized in terms of the accuracy of the representation.
10Some of the functions in the C library itself need this information; for
11example, the algorithms for printing and reading floating-point numbers
12(@pxref{I/O on Streams}) and for calculating trigonometric and
13irrational functions (@pxref{Mathematics}) use information about the
14underlying floating-point representation to avoid round-off error and
15loss of accuracy. User programs that implement numerical analysis
16techniques also often need to be parameterized in this way in order to
17minimize or compute error bounds.
18
19The specific representation of floating-point numbers varies from
20machine to machine. The GNU C Library defines a set of parameters which
21characterize each of the supported floating-point representations on a
22particular system.
23
24@menu
25* Floating-Point Representation:: Definitions of terminology.
26* Floating-Point Parameters:: Descriptions of the library facilities.
27* IEEE Floating-Point:: An example of a common representation.
28@end menu
29
30@node Floating-Point Representation
31@section Floating-Point Representation
32
33This section introduces the terminology used to characterize the
34representation of floating-point numbers.
35
36You are probably already familiar with most of these concepts in terms
37of scientific or exponential notation for floating-point numbers. For
38example, the number @code{123456.0} could be expressed in exponential
39notation as @code{1.23456e+05}, a shorthand notation indicating that the
40mantissa @code{1.23456} is multiplied by the base @code{10} raised to
41power @code{5}.
42
43More formally, the internal representation of a floating-point number
44can be characterized in terms of the following parameters:
45
46@itemize @bullet
47@item
48The @dfn{sign} is either @code{-1} or @code{1}.
49@cindex sign (of floating-point number)
50
51@item
52The @dfn{base} or @dfn{radix} for exponentiation; an integer greater
53than @code{1}. This is a constant for the particular representation.
54@cindex base (of floating-point number)
55@cindex radix (of floating-point number)
56
57@item
58The @dfn{exponent} to which the base is raised. The upper and lower
59bounds of the exponent value are constants for the particular
60representation.
61@cindex exponent (of floating-point number)
62
63Sometimes, in the actual bits representing the floating-point number,
64the exponent is @dfn{biased} by adding a constant to it, to make it
65always be represented as an unsigned quantity. This is only important
66if you have some reason to pick apart the bit fields making up the
67floating-point number by hand, which is something for which the GNU
68library provides no support. So this is ignored in the discussion that
69follows.
70@cindex bias, in exponent (of floating-point number)
71
72@item
73The value of the @dfn{mantissa} or @dfn{significand}, which is an
74unsigned quantity.
75@cindex mantissa (of floating-point number)
76@cindex significand (of floating-point number)
77
f65fd747 78@item
28f540f4
RM
79The @dfn{precision} of the mantissa. If the base of the representation
80is @var{b}, then the precision is the number of base-@var{b} digits in
81the mantissa. This is a constant for the particular representation.
82
83Many floating-point representations have an implicit @dfn{hidden bit} in
84the mantissa. Any such hidden bits are counted in the precision.
85Again, the GNU library provides no facilities for dealing with such low-level
86aspects of the representation.
87@cindex precision (of floating-point number)
88@cindex hidden bit, in mantissa (of floating-point number)
89@end itemize
90
91The mantissa of a floating-point number actually represents an implicit
92fraction whose denominator is the base raised to the power of the
93precision. Since the largest representable mantissa is one less than
94this denominator, the value of the fraction is always strictly less than
95@code{1}. The mathematical value of a floating-point number is then the
96product of this fraction; the sign; and the base raised to the exponent.
97
98If the floating-point number is @dfn{normalized}, the mantissa is also
99greater than or equal to the base raised to the power of one less
100than the precision (unless the number represents a floating-point zero,
101in which case the mantissa is zero). The fractional quantity is
102therefore greater than or equal to @code{1/@var{b}}, where @var{b} is
103the base.
104@cindex normalized floating-point number
105
106@node Floating-Point Parameters
107@section Floating-Point Parameters
108
109@strong{Incomplete:} This section needs some more concrete examples
110of what these parameters mean and how to use them in a program.
111
112These macro definitions can be accessed by including the header file
113@file{<float.h>} in your program.
114
115Macro names starting with @samp{FLT_} refer to the @code{float} type,
116while names beginning with @samp{DBL_} refer to the @code{double} type
117and names beginning with @samp{LDBL_} refer to the @code{long double}
118type. (In implementations that do not support @code{long double} as
119a distinct data type, the values for those constants are the same
120as the corresponding constants for the @code{double} type.)@refill
121
122Note that only @code{FLT_RADIX} is guaranteed to be a constant
123expression, so the other macros listed here cannot be reliably used in
124places that require constant expressions, such as @samp{#if}
125preprocessing directives and array size specifications.
126
f65fd747 127Although the @w{ISO C} standard specifies minimum and maximum values for
28f540f4
RM
128most of these parameters, the GNU C implementation uses whatever
129floating-point representations are supported by the underlying hardware.
f65fd747 130So whether GNU C actually satisfies the @w{ISO C} requirements depends on
28f540f4
RM
131what machine it is running on.
132
133@comment float.h
f65fd747 134@comment ISO
28f540f4
RM
135@defvr Macro FLT_ROUNDS
136This value characterizes the rounding mode for floating-point addition.
137The following values indicate standard rounding modes:
138
139@table @code
140@item -1
141The mode is indeterminable.
142@item 0
143Rounding is towards zero.
144@item 1
145Rounding is to the nearest number.
146@item 2
147Rounding is towards positive infinity.
148@item 3
149Rounding is towards negative infinity.
150@end table
151
152@noindent
153Any other value represents a machine-dependent nonstandard rounding
154mode.
155@end defvr
156
157@comment float.h
f65fd747 158@comment ISO
28f540f4
RM
159@defvr Macro FLT_RADIX
160This is the value of the base, or radix, of exponent representation.
161This is guaranteed to be a constant expression, unlike the other macros
162described in this section.
163@end defvr
164
165@comment float.h
f65fd747 166@comment ISO
28f540f4
RM
167@defvr Macro FLT_MANT_DIG
168This is the number of base-@code{FLT_RADIX} digits in the floating-point
169mantissa for the @code{float} data type.
170@end defvr
171
172@comment float.h
f65fd747 173@comment ISO
28f540f4
RM
174@defvr Macro DBL_MANT_DIG
175This is the number of base-@code{FLT_RADIX} digits in the floating-point
176mantissa for the @code{double} data type.
177@end defvr
178
179@comment float.h
f65fd747 180@comment ISO
28f540f4
RM
181@defvr Macro LDBL_MANT_DIG
182This is the number of base-@code{FLT_RADIX} digits in the floating-point
183mantissa for the @code{long double} data type.
184@end defvr
185
186@comment float.h
f65fd747 187@comment ISO
28f540f4
RM
188@defvr Macro FLT_DIG
189This is the number of decimal digits of precision for the @code{float}
190data type. Technically, if @var{p} and @var{b} are the precision and
191base (respectively) for the representation, then the decimal precision
192@var{q} is the maximum number of decimal digits such that any floating
193point number with @var{q} base 10 digits can be rounded to a floating
194point number with @var{p} base @var{b} digits and back again, without
195change to the @var{q} decimal digits.
196
197The value of this macro is guaranteed to be at least @code{6}.
198@end defvr
199
200@comment float.h
f65fd747 201@comment ISO
28f540f4
RM
202@defvr Macro DBL_DIG
203This is similar to @code{FLT_DIG}, but is for the @code{double} data
204type. The value of this macro is guaranteed to be at least @code{10}.
205@end defvr
206
207@comment float.h
f65fd747 208@comment ISO
28f540f4
RM
209@defvr Macro LDBL_DIG
210This is similar to @code{FLT_DIG}, but is for the @code{long double}
211data type. The value of this macro is guaranteed to be at least
212@code{10}.
213@end defvr
214
215@comment float.h
f65fd747 216@comment ISO
28f540f4
RM
217@defvr Macro FLT_MIN_EXP
218This is the minimum negative integer such that the mathematical value
219@code{FLT_RADIX} raised to this power minus 1 can be represented as a
220normalized floating-point number of type @code{float}. In terms of the
221actual implementation, this is just the smallest value that can be
222represented in the exponent field of the number.
223@end defvr
224
225@comment float.h
f65fd747 226@comment ISO
28f540f4
RM
227@defvr Macro DBL_MIN_EXP
228This is similar to @code{FLT_MIN_EXP}, but is for the @code{double} data
229type.
230@end defvr
231
232@comment float.h
f65fd747 233@comment ISO
28f540f4
RM
234@defvr Macro LDBL_MIN_EXP
235This is similar to @code{FLT_MIN_EXP}, but is for the @code{long double}
236data type.
237@end defvr
238
239@comment float.h
f65fd747 240@comment ISO
28f540f4
RM
241@defvr Macro FLT_MIN_10_EXP
242This is the minimum negative integer such that the mathematical value
243@code{10} raised to this power minus 1 can be represented as a
244normalized floating-point number of type @code{float}. This is
245guaranteed to be no greater than @code{-37}.
246@end defvr
247
248@comment float.h
f65fd747 249@comment ISO
28f540f4
RM
250@defvr Macro DBL_MIN_10_EXP
251This is similar to @code{FLT_MIN_10_EXP}, but is for the @code{double}
252data type.
253@end defvr
254
255@comment float.h
f65fd747 256@comment ISO
28f540f4
RM
257@defvr Macro LDBL_MIN_10_EXP
258This is similar to @code{FLT_MIN_10_EXP}, but is for the @code{long
259double} data type.
260@end defvr
261
262
263
264@comment float.h
f65fd747 265@comment ISO
28f540f4
RM
266@defvr Macro FLT_MAX_EXP
267This is the maximum negative integer such that the mathematical value
268@code{FLT_RADIX} raised to this power minus 1 can be represented as a
269floating-point number of type @code{float}. In terms of the actual
270implementation, this is just the largest value that can be represented
271in the exponent field of the number.
272@end defvr
273
274@comment float.h
f65fd747 275@comment ISO
28f540f4
RM
276@defvr Macro DBL_MAX_EXP
277This is similar to @code{FLT_MAX_EXP}, but is for the @code{double} data
278type.
279@end defvr
280
281@comment float.h
f65fd747 282@comment ISO
28f540f4
RM
283@defvr Macro LDBL_MAX_EXP
284This is similar to @code{FLT_MAX_EXP}, but is for the @code{long double}
285data type.
286@end defvr
287
288@comment float.h
f65fd747 289@comment ISO
28f540f4
RM
290@defvr Macro FLT_MAX_10_EXP
291This is the maximum negative integer such that the mathematical value
292@code{10} raised to this power minus 1 can be represented as a
293normalized floating-point number of type @code{float}. This is
294guaranteed to be at least @code{37}.
295@end defvr
296
297@comment float.h
f65fd747 298@comment ISO
28f540f4
RM
299@defvr Macro DBL_MAX_10_EXP
300This is similar to @code{FLT_MAX_10_EXP}, but is for the @code{double}
301data type.
302@end defvr
303
304@comment float.h
f65fd747 305@comment ISO
28f540f4
RM
306@defvr Macro LDBL_MAX_10_EXP
307This is similar to @code{FLT_MAX_10_EXP}, but is for the @code{long
308double} data type.
309@end defvr
310
311
312@comment float.h
f65fd747 313@comment ISO
28f540f4
RM
314@defvr Macro FLT_MAX
315The value of this macro is the maximum representable floating-point
316number of type @code{float}, and is guaranteed to be at least
317@code{1E+37}.
318@end defvr
319
320@comment float.h
f65fd747 321@comment ISO
28f540f4
RM
322@defvr Macro DBL_MAX
323The value of this macro is the maximum representable floating-point
324number of type @code{double}, and is guaranteed to be at least
325@code{1E+37}.
326@end defvr
327
328@comment float.h
f65fd747 329@comment ISO
28f540f4
RM
330@defvr Macro LDBL_MAX
331The value of this macro is the maximum representable floating-point
332number of type @code{long double}, and is guaranteed to be at least
333@code{1E+37}.
334@end defvr
335
336
337@comment float.h
f65fd747 338@comment ISO
28f540f4
RM
339@defvr Macro FLT_MIN
340The value of this macro is the minimum normalized positive
341floating-point number that is representable by type @code{float}, and is
342guaranteed to be no more than @code{1E-37}.
343@end defvr
344
345@comment float.h
f65fd747 346@comment ISO
28f540f4
RM
347@defvr Macro DBL_MIN
348The value of this macro is the minimum normalized positive
349floating-point number that is representable by type @code{double}, and
350is guaranteed to be no more than @code{1E-37}.
351@end defvr
352
353@comment float.h
f65fd747 354@comment ISO
28f540f4
RM
355@defvr Macro LDBL_MIN
356The value of this macro is the minimum normalized positive
357floating-point number that is representable by type @code{long double},
358and is guaranteed to be no more than @code{1E-37}.
359@end defvr
360
361
362@comment float.h
f65fd747 363@comment ISO
28f540f4
RM
364@defvr Macro FLT_EPSILON
365This is the minimum positive floating-point number of type @code{float}
366such that @code{1.0 + FLT_EPSILON != 1.0} is true. It's guaranteed to
367be no greater than @code{1E-5}.
368@end defvr
369
370@comment float.h
f65fd747 371@comment ISO
28f540f4
RM
372@defvr Macro DBL_EPSILON
373This is similar to @code{FLT_EPSILON}, but is for the @code{double}
374type. The maximum value is @code{1E-9}.
375@end defvr
376
377@comment float.h
f65fd747 378@comment ISO
28f540f4
RM
379@defvr Macro LDBL_EPSILON
380This is similar to @code{FLT_EPSILON}, but is for the @code{long double}
381type. The maximum value is @code{1E-9}.
382@end defvr
383
384
385
386@node IEEE Floating Point
387@section IEEE Floating Point
388
389Here is an example showing how these parameters work for a common
390floating point representation, specified by the @cite{IEEE Standard for
f65fd747
UD
391Binary Floating-Point Arithmetic (ANSI/IEEE Std 754-1985 or ANSI/IEEE
392Std 854-1987)}.
28f540f4
RM
393
394The IEEE single-precision float representation uses a base of 2. There
395is a sign bit, a mantissa with 23 bits plus one hidden bit (so the total
396precision is 24 base-2 digits), and an 8-bit exponent that can represent
397values in the range -125 to 128, inclusive.
398
399So, for an implementation that uses this representation for the
400@code{float} data type, appropriate values for the corresponding
401parameters are:
402
403@example
404FLT_RADIX 2
405FLT_MANT_DIG 24
406FLT_DIG 6
407FLT_MIN_EXP -125
408FLT_MIN_10_EXP -37
409FLT_MAX_EXP 128
410FLT_MAX_10_EXP +38
411FLT_MIN 1.17549435E-38F
412FLT_MAX 3.40282347E+38F
413FLT_EPSILON 1.19209290E-07F
414@end example